GSE94460 Processing Pipeline

OTHER code_examples 4 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

GSE94460

Ribosome profiling of HEK293 cells.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: Ribosome profiling

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define reference genome paths (replace with actual paths)
    # For human GRCh38 and Gencode v44
    GENOME_DIR="/path/to/STAR_genome_index_GRCh38" # Directory for STAR genome index
    GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" # e.g., from Gencode or Ensembl
    GTF_FILE="/path/to/gencode.v44.annotation.gtf" # e.g., from Gencode
    
    # Define input and output paths
    INPUT_FASTQ="ribo_seq_sample_R1.fastq.gz" # Assuming single-end reads, common for Ribo-seq
    OUTPUT_PREFIX="ribo_seq_aligned" # Prefix for output files
    
    # Define typical Ribo-seq read length for sjdbOverhang (e.g., 30nt)
    # sjdbOverhang should be readLength - 1
    READ_LENGTH=30
    SJDB_OVERHANG=$((READ_LENGTH - 1))
    
    # 1. Build STAR genome index (run once per genome version)
    # Check if index directory exists and is populated
    if [ ! -d "${GENOME_DIR}" ] || [ -z "$(ls -A ${GENOME_DIR})" ]; then
        echo "Building STAR genome index for GRCh38..."
        mkdir -p "${GENOME_DIR}"
        STAR --runMode genomeGenerate \
             --genomeDir "${GENOME_DIR}" \
             --genomeFastaFiles "${GENOME_FASTA}" \
             --sjdbGTFfile "${GTF_FILE}" \
             --sjdbOverhang "${SJDB_OVERHANG}" \
             --runThreadN 8 # Adjust thread count as needed
    else
        echo "STAR genome index already exists at ${GENOME_DIR}"
    fi
    
    # 2. Align Ribo-seq reads
    echo "Aligning Ribo-seq reads with STAR..."
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${INPUT_FASTQ}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --alignSJFilterReads UniqueBoth \
         --alignSJDBoverhangMin 1 \
         --runThreadN 8 # Adjust thread count as needed
  2. 2

    Fastq reads were trimmed and mapped to human rRNA sequences using bowtie allowing 2 mismatches.

    Bowtie v1.3.1 GitHub
    $ Bash example
    # Install Bowtie (if not already installed)
    # conda install -c bioconda bowtie=1.3.1
    
    # Reference: Human rRNA sequences index (e.g., 'hg38_rRNA')
    # This index needs to be built beforehand from a FASTA file containing human rRNA sequences.
    # A common source for human rRNA sequences is NCBI or UCSC. For example, the eCLIP pipeline
    # often uses a custom FASTA file containing known human rRNA sequences (e.g., NR_003286.2 for 18S, NR_003287.2 for 28S).
    # Example command to build the index:
    # bowtie-build human_rRNA.fa hg38_rRNA
    
    # Input: Placeholder for trimmed Fastq reads
    INPUT_FASTQ="trimmed_reads.fastq"
    # Output: SAM file containing reads mapped to rRNA
    OUTPUT_SAM="rRNA_mapped.sam"
    
    # Map reads to human rRNA sequences allowing 2 mismatches
    # -v 2: Allow up to 2 mismatches in the alignment
    # -S: Output alignments in SAM format
    # hg38_rRNA: The name of the pre-built Bowtie index for human rRNA sequences
    # ${INPUT_FASTQ}: The input Fastq file
    bowtie -v 2 -S "${OUTPUT_SAM}" hg38_rRNA "${INPUT_FASTQ}"
    
    # If the primary goal is to filter out rRNA reads, you would typically use the --un option to output unmapped reads:
    # bowtie -v 2 --un "non_rRNA_reads.fastq" hg38_rRNA "${INPUT_FASTQ}"
  3. 3

    Unmapped reads were mapped to human genome GRCh38 with Ensembl gene annotation release 83 using STAR with parameters ‘--outSAMattributes All --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignIntronMax 20000 --outMultimapperOrder Random --outSAMmultNmax 1’

    STAR vSTAR (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # --- Prepare Genome Index (if not already done) ---
    # Download GRCh38 primary assembly FASTA (Ensembl release 83 corresponds to GRCh38.p7)
    # wget http://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    
    # Download Ensembl 83 GTF annotation
    # wget http://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz
    # gunzip Homo_sapiens.GRCh38.83.gtf.gz
    
    # Create STAR genome index
    # mkdir STAR_GRCh38_Ensembl83_index
    # STAR --runThreadN 8 \
    #      --runMode genomeGenerate \
    #      --genomeDir STAR_GRCh38_Ensembl83_index \
    #      --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
    #      --sjdbGTFfile Homo_sapiens.GRCh38.83.gtf \
    #      --sjdbOverhang 100 # Adjust sjdbOverhang based on your read length (e.g., ReadLength - 1)
    
    # --- Alignment Step ---
    # Define input and output
    INPUT_FASTQ="input.fastq.gz" # Replace with your actual input FASTQ file(s)
    OUTPUT_PREFIX="aligned_reads_STAR_GRCh38_Ensembl83_"
    GENOME_DIR="STAR_GRCh38_Ensembl83_index"
    NUM_THREADS=8 # Adjust as needed
    
    STAR --runThreadN ${NUM_THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMattributes All \
         --outFilterMismatchNmax 2 \
         --alignEndsType EndToEnd \
         --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
         --alignIntronMax 20000 \
         --outMultimapperOrder Random \
         --outSAMmultNmax 1
  4. 4

    Reads with mapQ=0 or NH>5 were filtered out and only primary loci are counted.

    samtools (Inferred with models/gemini-2.5-flash), awk (Inferred with models/gemini-2.5-flash) vsamtools 1.9
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Filter reads based on mapQ, NH tag, and primary alignment status
    # -F 0x100: Exclude secondary alignments
    # -F 0x800: Exclude supplementary alignments
    # -q 1: Exclude reads with mapping quality 0
    # The awk command filters out reads where the NH tag is greater than 5.
    # It prints headers, reads without an NH tag, or reads where NH is 0-5.
    samtools view -h -F 0x100 -F 0x800 -q 1 input.bam | \
    awk 'BEGIN{FS="\t"}{if($1 ~ /^@/ || $0 !~ /NH:i:[0-9]+/ || $0 ~ /NH:i:[0-5]$/) print}' | \
    samtools view -bS -o filtered.bam

Tools Used

Raw Source Text
Library strategy: Ribosome profiling
Fastq reads were trimmed and mapped to human rRNA sequences using bowtie allowing 2 mismatches.
Unmapped reads were mapped to human genome GRCh38 with Ensembl gene annotation release 83 using STAR with parameters ‘--outSAMattributes All --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignIntronMax 20000 --outMultimapperOrder Random --outSAMmultNmax 1’
Reads with mapQ=0 or NH>5 were filtered out and only primary loci are counted.
Genome_build: GRCh38
Supplementary_files_format_and_content: tab-delimited text files include Ensembl gene id, gene symbol, length and raw counts
← Back to Analysis