GSE94460 Processing Pipeline
OTHER
code_examples
4 steps
Publication
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.Nature methods (2021) — PMID 33963355
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: Ribosome profiling
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define reference genome paths (replace with actual paths) # For human GRCh38 and Gencode v44 GENOME_DIR="/path/to/STAR_genome_index_GRCh38" # Directory for STAR genome index GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" # e.g., from Gencode or Ensembl GTF_FILE="/path/to/gencode.v44.annotation.gtf" # e.g., from Gencode # Define input and output paths INPUT_FASTQ="ribo_seq_sample_R1.fastq.gz" # Assuming single-end reads, common for Ribo-seq OUTPUT_PREFIX="ribo_seq_aligned" # Prefix for output files # Define typical Ribo-seq read length for sjdbOverhang (e.g., 30nt) # sjdbOverhang should be readLength - 1 READ_LENGTH=30 SJDB_OVERHANG=$((READ_LENGTH - 1)) # 1. Build STAR genome index (run once per genome version) # Check if index directory exists and is populated if [ ! -d "${GENOME_DIR}" ] || [ -z "$(ls -A ${GENOME_DIR})" ]; then echo "Building STAR genome index for GRCh38..." mkdir -p "${GENOME_DIR}" STAR --runMode genomeGenerate \ --genomeDir "${GENOME_DIR}" \ --genomeFastaFiles "${GENOME_FASTA}" \ --sjdbGTFfile "${GTF_FILE}" \ --sjdbOverhang "${SJDB_OVERHANG}" \ --runThreadN 8 # Adjust thread count as needed else echo "STAR genome index already exists at ${GENOME_DIR}" fi # 2. Align Ribo-seq reads echo "Aligning Ribo-seq reads with STAR..." STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJFilterReads UniqueBoth \ --alignSJDBoverhangMin 1 \ --runThreadN 8 # Adjust thread count as needed -
2
Fastq reads were trimmed and mapped to human rRNA sequences using bowtie allowing 2 mismatches.
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie=1.3.1 # Reference: Human rRNA sequences index (e.g., 'hg38_rRNA') # This index needs to be built beforehand from a FASTA file containing human rRNA sequences. # A common source for human rRNA sequences is NCBI or UCSC. For example, the eCLIP pipeline # often uses a custom FASTA file containing known human rRNA sequences (e.g., NR_003286.2 for 18S, NR_003287.2 for 28S). # Example command to build the index: # bowtie-build human_rRNA.fa hg38_rRNA # Input: Placeholder for trimmed Fastq reads INPUT_FASTQ="trimmed_reads.fastq" # Output: SAM file containing reads mapped to rRNA OUTPUT_SAM="rRNA_mapped.sam" # Map reads to human rRNA sequences allowing 2 mismatches # -v 2: Allow up to 2 mismatches in the alignment # -S: Output alignments in SAM format # hg38_rRNA: The name of the pre-built Bowtie index for human rRNA sequences # ${INPUT_FASTQ}: The input Fastq file bowtie -v 2 -S "${OUTPUT_SAM}" hg38_rRNA "${INPUT_FASTQ}" # If the primary goal is to filter out rRNA reads, you would typically use the --un option to output unmapped reads: # bowtie -v 2 --un "non_rRNA_reads.fastq" hg38_rRNA "${INPUT_FASTQ}" -
3
Unmapped reads were mapped to human genome GRCh38 with Ensembl gene annotation release 83 using STAR with parameters â--outSAMattributes All --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignIntronMax 20000 --outMultimapperOrder Random --outSAMmultNmax 1â
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Prepare Genome Index (if not already done) --- # Download GRCh38 primary assembly FASTA (Ensembl release 83 corresponds to GRCh38.p7) # wget http://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Download Ensembl 83 GTF annotation # wget http://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz # gunzip Homo_sapiens.GRCh38.83.gtf.gz # Create STAR genome index # mkdir STAR_GRCh38_Ensembl83_index # STAR --runThreadN 8 \ # --runMode genomeGenerate \ # --genomeDir STAR_GRCh38_Ensembl83_index \ # --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \ # --sjdbGTFfile Homo_sapiens.GRCh38.83.gtf \ # --sjdbOverhang 100 # Adjust sjdbOverhang based on your read length (e.g., ReadLength - 1) # --- Alignment Step --- # Define input and output INPUT_FASTQ="input.fastq.gz" # Replace with your actual input FASTQ file(s) OUTPUT_PREFIX="aligned_reads_STAR_GRCh38_Ensembl83_" GENOME_DIR="STAR_GRCh38_Ensembl83_index" NUM_THREADS=8 # Adjust as needed STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMattributes All \ --outFilterMismatchNmax 2 \ --alignEndsType EndToEnd \ --outFilterIntronMotifs RemoveNoncanonicalUnannotated \ --alignIntronMax 20000 \ --outMultimapperOrder Random \ --outSAMmultNmax 1 -
4
Reads with mapQ=0 or NH>5 were filtered out and only primary loci are counted.
samtools (Inferred with models/gemini-2.5-flash), awk (Inferred with models/gemini-2.5-flash) vsamtools 1.9$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Filter reads based on mapQ, NH tag, and primary alignment status # -F 0x100: Exclude secondary alignments # -F 0x800: Exclude supplementary alignments # -q 1: Exclude reads with mapping quality 0 # The awk command filters out reads where the NH tag is greater than 5. # It prints headers, reads without an NH tag, or reads where NH is 0-5. samtools view -h -F 0x100 -F 0x800 -q 1 input.bam | \ awk 'BEGIN{FS="\t"}{if($1 ~ /^@/ || $0 !~ /NH:i:[0-9]+/ || $0 ~ /NH:i:[0-5]$/) print}' | \ samtools view -bS -o filtered.bam
Raw Source Text
Library strategy: Ribosome profiling Fastq reads were trimmed and mapped to human rRNA sequences using bowtie allowing 2 mismatches. Unmapped reads were mapped to human genome GRCh38 with Ensembl gene annotation release 83 using STAR with parameters â--outSAMattributes All --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignIntronMax 20000 --outMultimapperOrder Random --outSAMmultNmax 1â Reads with mapQ=0 or NH>5 were filtered out and only primary loci are counted. Genome_build: GRCh38 Supplementary_files_format_and_content: tab-delimited text files include Ensembl gene id, gene symbol, length and raw counts