GSE155708 Processing Pipeline

RNA-Seq code_examples 11 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [NP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were aligned to hg19 (genomic) or ENSEMBL (cDNA) reference using Minimap2 with the following parameters: "-ax splice -uf -k14".

minimap2 v2.24 GitHub

$ Bash example

# Install minimap2 (e.g., via conda)
# conda install -c bioconda minimap2

# Placeholder for reference genome (e.g., hg19.fa or indexed hg19.mmi)
# If using a FASTA file directly, minimap2 will index it on the fly or you can pre-index:
# minimap2 -d /path/to/reference/hg19.mmi /path/to/reference/hg19.fa

# Align raw reads to the reference
# Replace '/path/to/reference/hg19.mmi' with your actual indexed reference path (or .fa file)
# Replace 'raw_reads.fastq.gz' with your actual input raw reads file
# Replace 'aligned_reads.sam' with your desired output SAM file name
# For ENSEMBL (cDNA) reference, replace the reference path accordingly.
minimap2 -ax splice -uf -k14 /path/to/reference/hg19.mmi raw_reads.fastq.gz > aligned_reads.sam

View on GitHub

Aligned reads were then passed through Bcftools mpileup command with a "-Q 5" parameter.

Bcftools v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Bcftools (if not already installed)
# conda install -c bioconda bcftools

# Placeholder for reference genome FASTA file (e.g., human hg38)
REFERENCE_FASTA="human_hg38.fa"

# Placeholder for input aligned reads BAM file
INPUT_BAM="aligned_reads.bam"

# Placeholder for output VCF file
OUTPUT_VCF="variants.vcf"

# Run Bcftools mpileup with specified parameters
bcftools mpileup -Q 5 -f "${REFERENCE_FASTA}" "${INPUT_BAM}" > "${OUTPUT_VCF}"

View on GitHub

The pileup files were then filtered for G & C reference positions.

awk (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Filter pileup files for G & C reference positions
# Input: input.pileup (a pileup file)
# Output: filtered.pileup (pileup file with only G or C reference positions)

# Example usage:
# Assuming input.pileup is your pileup file
awk '$3 == "G" || $3 == "C"' input.pileup > filtered.pileup

For direct-RNA data the strand was inferred based on mapping directionality.

RSeQC (infer_experiment.py) (Inferred with models/gemini-2.5-flash) v4.0.0

$ Bash example

# Install RSeQC (if not already installed)
# conda install -c bioconda rseqc

# Example usage of infer_experiment.py to determine library strandedness for direct-RNA data.
# Replace 'aligned_reads.bam' with your actual input BAM file containing aligned direct-RNA reads.
# Replace 'path/to/hg38_RefSeq.bed' with the path to your gene model BED file.
# A suitable gene model BED file (e.g., RefSeq genes for hg38) can be downloaded from UCSC Table Browser.
# For hg38 RefSeq genes, you can download a BED file from:
# http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=refGene&hgta_outputType=bed

infer_experiment.py -r path/to/hg38_RefSeq.bed -i aligned_reads.bam > strand_inference_report.txt

For cDNA alignments the strand as assumed to be positive.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables (replace with actual paths and filenames)
GENOME_DIR="/path/to/STAR_index/hg38"
GENOME_FASTA="/path/to/genome/hg38.fa"
GTF_FILE="/path/to/annotations/gencode.v38.annotation.gtf"
READ1="input_R1.fastq.gz"
READ2="input_R2.fastq.gz"
OUTPUT_PREFIX="cDNA_alignment_output_"
THREADS=8

# Create STAR genome index (run once for a given reference genome)
# This example uses hg38 as a placeholder reference.
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --runThreadN "${THREADS}"

# Perform cDNA alignment using STAR
# The description 'For cDNA alignments the strand as assumed to be positive' 
# typically refers to the library preparation method (e.g., sense-strand specific cDNA) 
# or how downstream quantification tools interpret the alignments. 
# STAR itself doesn't have a direct parameter to 'assume positive strand' during alignment 
# that overrides the actual read orientation. For cDNA, unstranded alignment is common, 
# and then strandedness is specified during quantification (e.g., featureCounts -s 1 for forward strand).
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN "${THREADS}"

View on GitHub

For direct-cDNA reads aligned to hg19 sites were intersected with an annotation file and strand was assigned based on gene directionality with ambiguous strands being removed.

featureCounts (Inferred with models/gemini-2.5-flash) vSubread package (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Subread (which includes featureCounts) if not already installed
# conda install -c bioconda subread

# Define input and output files
ALIGNED_BAM="direct_cdna_reads_aligned.bam" # Placeholder for aligned direct-cDNA reads (e.g., from STAR or HISAT2)
ANNOTATION_GTF="hg19.ncbiRefSeq.gtf" # Placeholder for hg19 gene annotation (e.g., from NCBI RefSeq, GENCODE, or UCSC)
OUTPUT_COUNTS="gene_counts.txt"

# Execute featureCounts to intersect reads with annotation and assign strand
# -a: Annotation file in GTF or GFF format
# -o: Output file for read counts per feature
# -s 2: Reverse stranded library (reads map to the opposite strand of the gene).
#       This is a common setting for many stranded RNA-seq protocols (e.g., Illumina TruSeq Stranded).
#       Adjust to -s 1 (forward stranded) or -s 0 (unstranded) if your library preparation differs.
# -p: Count fragments instead of individual reads (if input BAM is paired-end)
# --primary: Only count primary alignments. This helps in removing ambiguous multi-mapping reads.
# The description "ambiguous strands being removed" is handled by the -s option, as reads not matching
# the specified strand are ignored. Additionally, featureCounts by default handles reads overlapping
# multiple features by assigning them to the feature with the largest overlap or not counting them if truly ambiguous.
featureCounts -a "${ANNOTATION_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -s 2 \
              -p \
              --primary \
              "${ALIGNED_BAM}"

Sites were then further filtered for C positions based on strand.

filter_crosslink_sites.py (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Clone the eclip repository if not already present
# git clone https://github.com/yeolab/eclip.git

# Ensure Python and pysam are installed
# conda create -n eclip_env python=3.8 pysam
# conda activate eclip_env

# Define input and output files
INPUT_BED="crosslink_sites.bed"
OUTPUT_BED="filtered_crosslink_sites.bed"

# Define reference genome (e.g., hg38)
# Download hg38 reference genome if not available
# wget -O hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# gunzip hg38.fa.gz
REFERENCE_FASTA="hg38.fa"

# Path to the filter_crosslink_sites.py script
ECLIP_SCRIPT_DIR="./eclip/scripts" # Adjust path if eclip repo is elsewhere

# Execute the filtering script for C-to-T conversions based on strand
python "${ECLIP_SCRIPT_DIR}/filter_crosslink_sites.py" \
    --input_bed "${INPUT_BED}" \
    --output_bed "${OUTPUT_BED}" \
    --reference_fasta "${REFERENCE_FASTA}" \
    --min_c_to_t_ratio 0.1 \
    --min_c_to_t_count 1

View on GitHub

Confidence scores were calculated using manual implementation of SAILOR with a modification rate of 5%.

SAILOR v1.0.0

$ Bash example

# Install SAILOR (if not already installed)
# git clone https://github.com/LiuLab-Bioinfo/SAILOR.git
# cd SAILOR
# pip install .

# Placeholder variables for input files and output directory
# SAILOR is typically used for Nanopore direct RNA sequencing data.
INPUT_BAM="path/to/your/aligned_nanopore_reads.bam" # e.g., aligned direct RNA-seq reads
REFERENCE_FASTA="path/to/your/reference_genome.fasta" # e.g., hg38.fa
ANNOTATION_GFF="path/to/your/genome_annotation.gff" # e.g., gencode.v44.annotation.gff3
OUTPUT_DIR="sailor_confidence_scores"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Calculate confidence scores using SAILOR with a modification rate of 5%
# The --mod_rate parameter sets the expected modification rate (e.g., 0.05 for 5%)
python SAILOR.py \
    --bam_file "${INPUT_BAM}" \
    --genome_fasta "${REFERENCE_FASTA}" \
    --gff_file "${ANNOTATION_GFF}" \
    --output_dir "${OUTPUT_DIR}" \
    --mod_rate 0.05 \
    --threads 8 # Adjust number of threads as needed

Bed files were generated by filtering SAILOR confidence scores >=0.5.

SAILOR vv1.0.0 GitHub

$ Bash example

# Clone the eclip repository if not already done
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Define paths (replace /path/to/eclip with the actual path to your cloned eclip repository)
ECLIP_REPO_PATH="/path/to/eclip"
SAILOR_SCRIPT="${ECLIP_REPO_PATH}/sailor/sailor.py"

# Input and output files
# INPUT_BED: This file should be a BED file containing SAILOR confidence scores, typically in the 5th (0-indexed) or 6th (1-indexed) column.
INPUT_BED="input_sailor_scores.bed"
OUTPUT_BED="filtered_sailor_scores.bed"

# Execute the filtering command
python "${SAILOR_SCRIPT}" filter_sailor_bed \
    "${INPUT_BED}" \
    "${OUTPUT_BED}" \
    --score_threshold 0.5

View on GitHub

Sites found in the APOBEC-only control were removed from the RBFOX2 files.

bedtools (Inferred with models/gemini-2.5-flash) v2.29.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bedtools if not already available
# conda install -c bioconda bedtools

# Define input and output file paths
# RBFOX2_input_peaks.bed: File containing sites/peaks identified for RBFOX2
# APOBEC_control_peaks.bed: File containing sites/peaks identified in the APOBEC-only control
RBFOX2_INPUT_PEAKS="RBFOX2_input_peaks.bed"
APOBEC_CONTROL_PEAKS="APOBEC_control_peaks.bed"
RBFOX2_FILTERED_PEAKS="RBFOX2_filtered_peaks.bed"

# Remove sites found in the APOBEC-only control from the RBFOX2 files
# bedtools subtract performs a set subtraction operation on genomic features.
# -a: The primary file from which features will be subtracted (RBFOX2 files).
# -b: The file containing features to subtract (APOBEC-only control).
bedtools subtract -a "${RBFOX2_INPUT_PEAKS}" -b "${APOBEC_CONTROL_PEAKS}" > "${RBFOX2_FILTERED_PEAKS}"

View on GitHub

SNPs were removed by bedtools intersect with dbSNP (v147).

bedtools vv2.29.2 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools=2.29.2

# Assuming 'input_snps.bed' contains the SNPs to be filtered
# and 'dbsnp_v147.bed' is the dbSNP v147 reference file in BED format.
# The '-v' option reports entries in A that do NOT overlap with any entry in B.
bedtools intersect -a input_snps.bed -b dbsnp_v147.bed -v > filtered_snps.bed

View on GitHub

Tools Used

SAILOR

Raw Source Text

Raw reads were aligned to hg19 (genomic) or ENSEMBL (cDNA) reference using Minimap2 with the following parameters: "-ax splice -uf -k14". Aligned reads were then passed through Bcftools mpileup command with a "-Q 5" parameter. The pileup files were then filtered for G & C reference positions. For direct-RNA data the strand was inferred based on mapping directionality. For cDNA alignments the strand as assumed to be positive. For direct-cDNA reads aligned to hg19 sites were intersected with an annotation file and strand was assigned based on gene directionality with ambiguous strands being removed. Sites were then further filtered for C positions based on strand. Confidence scores were calculated using manual implementation of SAILOR with a modification rate of 5%. Bed files were generated by filtering SAILOR confidence scores >=0.5. Sites found in the APOBEC-only control were removed from the RBFOX2 files. SNPs were removed by bedtools intersect with dbSNP (v147).
Genome_build: hg19 or ENSEMBL cDNA
Supplementary_files_format_and_content: bed

← Back to Analysis