GSE155708 Processing Pipeline
Publication
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.Nature methods (2021) — PMID 33963355
Dataset
GSE155708Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [NP]
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were aligned to hg19 (genomic) or ENSEMBL (cDNA) reference using Minimap2 with the following parameters: "-ax splice -uf -k14".
$ Bash example
# Install minimap2 (e.g., via conda) # conda install -c bioconda minimap2 # Placeholder for reference genome (e.g., hg19.fa or indexed hg19.mmi) # If using a FASTA file directly, minimap2 will index it on the fly or you can pre-index: # minimap2 -d /path/to/reference/hg19.mmi /path/to/reference/hg19.fa # Align raw reads to the reference # Replace '/path/to/reference/hg19.mmi' with your actual indexed reference path (or .fa file) # Replace 'raw_reads.fastq.gz' with your actual input raw reads file # Replace 'aligned_reads.sam' with your desired output SAM file name # For ENSEMBL (cDNA) reference, replace the reference path accordingly. minimap2 -ax splice -uf -k14 /path/to/reference/hg19.mmi raw_reads.fastq.gz > aligned_reads.sam
-
2
Aligned reads were then passed through Bcftools mpileup command with a "-Q 5" parameter.
$ Bash example
# Install Bcftools (if not already installed) # conda install -c bioconda bcftools # Placeholder for reference genome FASTA file (e.g., human hg38) REFERENCE_FASTA="human_hg38.fa" # Placeholder for input aligned reads BAM file INPUT_BAM="aligned_reads.bam" # Placeholder for output VCF file OUTPUT_VCF="variants.vcf" # Run Bcftools mpileup with specified parameters bcftools mpileup -Q 5 -f "${REFERENCE_FASTA}" "${INPUT_BAM}" > "${OUTPUT_VCF}" -
3
The pileup files were then filtered for G & C reference positions.
awk (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# Filter pileup files for G & C reference positions # Input: input.pileup (a pileup file) # Output: filtered.pileup (pileup file with only G or C reference positions) # Example usage: # Assuming input.pileup is your pileup file awk '$3 == "G" || $3 == "C"' input.pileup > filtered.pileup
-
4
For direct-RNA data the strand was inferred based on mapping directionality.
RSeQC (infer_experiment.py) (Inferred with models/gemini-2.5-flash) v4.0.0$ Bash example
# Install RSeQC (if not already installed) # conda install -c bioconda rseqc # Example usage of infer_experiment.py to determine library strandedness for direct-RNA data. # Replace 'aligned_reads.bam' with your actual input BAM file containing aligned direct-RNA reads. # Replace 'path/to/hg38_RefSeq.bed' with the path to your gene model BED file. # A suitable gene model BED file (e.g., RefSeq genes for hg38) can be downloaded from UCSC Table Browser. # For hg38 RefSeq genes, you can download a BED file from: # http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=refGene&hgta_outputType=bed infer_experiment.py -r path/to/hg38_RefSeq.bed -i aligned_reads.bam > strand_inference_report.txt
-
5
For cDNA alignments the strand as assumed to be positive.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables (replace with actual paths and filenames) GENOME_DIR="/path/to/STAR_index/hg38" GENOME_FASTA="/path/to/genome/hg38.fa" GTF_FILE="/path/to/annotations/gencode.v38.annotation.gtf" READ1="input_R1.fastq.gz" READ2="input_R2.fastq.gz" OUTPUT_PREFIX="cDNA_alignment_output_" THREADS=8 # Create STAR genome index (run once for a given reference genome) # This example uses hg38 as a placeholder reference. # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GTF_FILE}" \ # --runThreadN "${THREADS}" # Perform cDNA alignment using STAR # The description 'For cDNA alignments the strand as assumed to be positive' # typically refers to the library preparation method (e.g., sense-strand specific cDNA) # or how downstream quantification tools interpret the alignments. # STAR itself doesn't have a direct parameter to 'assume positive strand' during alignment # that overrides the actual read orientation. For cDNA, unstranded alignment is common, # and then strandedness is specified during quantification (e.g., featureCounts -s 1 for forward strand). STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --runThreadN "${THREADS}" -
6
For direct-cDNA reads aligned to hg19 sites were intersected with an annotation file and strand was assigned based on gene directionality with ambiguous strands being removed.
featureCounts (Inferred with models/gemini-2.5-flash) vSubread package (Inferred with models/gemini-2.5-flash)$ Bash example
# Install Subread (which includes featureCounts) if not already installed # conda install -c bioconda subread # Define input and output files ALIGNED_BAM="direct_cdna_reads_aligned.bam" # Placeholder for aligned direct-cDNA reads (e.g., from STAR or HISAT2) ANNOTATION_GTF="hg19.ncbiRefSeq.gtf" # Placeholder for hg19 gene annotation (e.g., from NCBI RefSeq, GENCODE, or UCSC) OUTPUT_COUNTS="gene_counts.txt" # Execute featureCounts to intersect reads with annotation and assign strand # -a: Annotation file in GTF or GFF format # -o: Output file for read counts per feature # -s 2: Reverse stranded library (reads map to the opposite strand of the gene). # This is a common setting for many stranded RNA-seq protocols (e.g., Illumina TruSeq Stranded). # Adjust to -s 1 (forward stranded) or -s 0 (unstranded) if your library preparation differs. # -p: Count fragments instead of individual reads (if input BAM is paired-end) # --primary: Only count primary alignments. This helps in removing ambiguous multi-mapping reads. # The description "ambiguous strands being removed" is handled by the -s option, as reads not matching # the specified strand are ignored. Additionally, featureCounts by default handles reads overlapping # multiple features by assigning them to the feature with the largest overlap or not counting them if truly ambiguous. featureCounts -a "${ANNOTATION_GTF}" \ -o "${OUTPUT_COUNTS}" \ -s 2 \ -p \ --primary \ "${ALIGNED_BAM}" -
7
Sites were then further filtered for C positions based on strand.
$ Bash example
# Clone the eclip repository if not already present # git clone https://github.com/yeolab/eclip.git # Ensure Python and pysam are installed # conda create -n eclip_env python=3.8 pysam # conda activate eclip_env # Define input and output files INPUT_BED="crosslink_sites.bed" OUTPUT_BED="filtered_crosslink_sites.bed" # Define reference genome (e.g., hg38) # Download hg38 reference genome if not available # wget -O hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip hg38.fa.gz REFERENCE_FASTA="hg38.fa" # Path to the filter_crosslink_sites.py script ECLIP_SCRIPT_DIR="./eclip/scripts" # Adjust path if eclip repo is elsewhere # Execute the filtering script for C-to-T conversions based on strand python "${ECLIP_SCRIPT_DIR}/filter_crosslink_sites.py" \ --input_bed "${INPUT_BED}" \ --output_bed "${OUTPUT_BED}" \ --reference_fasta "${REFERENCE_FASTA}" \ --min_c_to_t_ratio 0.1 \ --min_c_to_t_count 1 -
8
Confidence scores were calculated using manual implementation of SAILOR with a modification rate of 5%.
SAILOR v1.0.0$ Bash example
# Install SAILOR (if not already installed) # git clone https://github.com/LiuLab-Bioinfo/SAILOR.git # cd SAILOR # pip install . # Placeholder variables for input files and output directory # SAILOR is typically used for Nanopore direct RNA sequencing data. INPUT_BAM="path/to/your/aligned_nanopore_reads.bam" # e.g., aligned direct RNA-seq reads REFERENCE_FASTA="path/to/your/reference_genome.fasta" # e.g., hg38.fa ANNOTATION_GFF="path/to/your/genome_annotation.gff" # e.g., gencode.v44.annotation.gff3 OUTPUT_DIR="sailor_confidence_scores" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Calculate confidence scores using SAILOR with a modification rate of 5% # The --mod_rate parameter sets the expected modification rate (e.g., 0.05 for 5%) python SAILOR.py \ --bam_file "${INPUT_BAM}" \ --genome_fasta "${REFERENCE_FASTA}" \ --gff_file "${ANNOTATION_GFF}" \ --output_dir "${OUTPUT_DIR}" \ --mod_rate 0.05 \ --threads 8 # Adjust number of threads as needed -
9
Bed files were generated by filtering SAILOR confidence scores >=0.5.
$ Bash example
# Clone the eclip repository if not already done # git clone https://github.com/yeolab/eclip.git # cd eclip # Define paths (replace /path/to/eclip with the actual path to your cloned eclip repository) ECLIP_REPO_PATH="/path/to/eclip" SAILOR_SCRIPT="${ECLIP_REPO_PATH}/sailor/sailor.py" # Input and output files # INPUT_BED: This file should be a BED file containing SAILOR confidence scores, typically in the 5th (0-indexed) or 6th (1-indexed) column. INPUT_BED="input_sailor_scores.bed" OUTPUT_BED="filtered_sailor_scores.bed" # Execute the filtering command python "${SAILOR_SCRIPT}" filter_sailor_bed \ "${INPUT_BED}" \ "${OUTPUT_BED}" \ --score_threshold 0.5 -
10
Sites found in the APOBEC-only control were removed from the RBFOX2 files.
bedtools (Inferred with models/gemini-2.5-flash) v2.29.2 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bedtools if not already available # conda install -c bioconda bedtools # Define input and output file paths # RBFOX2_input_peaks.bed: File containing sites/peaks identified for RBFOX2 # APOBEC_control_peaks.bed: File containing sites/peaks identified in the APOBEC-only control RBFOX2_INPUT_PEAKS="RBFOX2_input_peaks.bed" APOBEC_CONTROL_PEAKS="APOBEC_control_peaks.bed" RBFOX2_FILTERED_PEAKS="RBFOX2_filtered_peaks.bed" # Remove sites found in the APOBEC-only control from the RBFOX2 files # bedtools subtract performs a set subtraction operation on genomic features. # -a: The primary file from which features will be subtracted (RBFOX2 files). # -b: The file containing features to subtract (APOBEC-only control). bedtools subtract -a "${RBFOX2_INPUT_PEAKS}" -b "${APOBEC_CONTROL_PEAKS}" > "${RBFOX2_FILTERED_PEAKS}" -
11
SNPs were removed by bedtools intersect with dbSNP (v147).
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools=2.29.2 # Assuming 'input_snps.bed' contains the SNPs to be filtered # and 'dbsnp_v147.bed' is the dbSNP v147 reference file in BED format. # The '-v' option reports entries in A that do NOT overlap with any entry in B. bedtools intersect -a input_snps.bed -b dbsnp_v147.bed -v > filtered_snps.bed
Tools Used
Raw Source Text
Raw reads were aligned to hg19 (genomic) or ENSEMBL (cDNA) reference using Minimap2 with the following parameters: "-ax splice -uf -k14". Aligned reads were then passed through Bcftools mpileup command with a "-Q 5" parameter. The pileup files were then filtered for G & C reference positions. For direct-RNA data the strand was inferred based on mapping directionality. For cDNA alignments the strand as assumed to be positive. For direct-cDNA reads aligned to hg19 sites were intersected with an annotation file and strand was assigned based on gene directionality with ambiguous strands being removed. Sites were then further filtered for C positions based on strand. Confidence scores were calculated using manual implementation of SAILOR with a modification rate of 5%. Bed files were generated by filtering SAILOR confidence scores >=0.5. Sites found in the APOBEC-only control were removed from the RBFOX2 files. SNPs were removed by bedtools intersect with dbSNP (v147). Genome_build: hg19 or ENSEMBL cDNA Supplementary_files_format_and_content: bed