GSE94460 Processing Pipeline

OTHER code_examples 4 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

Ribosome profiling of HEK293 cells.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: Ribosome profiling

Ribo-seq v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define reference genome paths (replace with actual paths)
# For human GRCh38 and Gencode v44
GENOME_DIR="/path/to/STAR_genome_index_GRCh38" # Directory for STAR genome index
GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" # e.g., from Gencode or Ensembl
GTF_FILE="/path/to/gencode.v44.annotation.gtf" # e.g., from Gencode

# Define input and output paths
INPUT_FASTQ="ribo_seq_sample_R1.fastq.gz" # Assuming single-end reads, common for Ribo-seq
OUTPUT_PREFIX="ribo_seq_aligned" # Prefix for output files

# Define typical Ribo-seq read length for sjdbOverhang (e.g., 30nt)
# sjdbOverhang should be readLength - 1
READ_LENGTH=30
SJDB_OVERHANG=$((READ_LENGTH - 1))

# 1. Build STAR genome index (run once per genome version)
# Check if index directory exists and is populated
if [ ! -d "${GENOME_DIR}" ] || [ -z "$(ls -A ${GENOME_DIR})" ]; then
    echo "Building STAR genome index for GRCh38..."
    mkdir -p "${GENOME_DIR}"
    STAR --runMode genomeGenerate \
         --genomeDir "${GENOME_DIR}" \
         --genomeFastaFiles "${GENOME_FASTA}" \
         --sjdbGTFfile "${GTF_FILE}" \
         --sjdbOverhang "${SJDB_OVERHANG}" \
         --runThreadN 8 # Adjust thread count as needed
else
    echo "STAR genome index already exists at ${GENOME_DIR}"
fi

# 2. Align Ribo-seq reads
echo "Aligning Ribo-seq reads with STAR..."
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJFilterReads UniqueBoth \
     --alignSJDBoverhangMin 1 \
     --runThreadN 8 # Adjust thread count as needed

View on GitHub

Fastq reads were trimmed and mapped to human rRNA sequences using bowtie allowing 2 mismatches.

Bowtie v1.3.1 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie=1.3.1

# Reference: Human rRNA sequences index (e.g., 'hg38_rRNA')
# This index needs to be built beforehand from a FASTA file containing human rRNA sequences.
# A common source for human rRNA sequences is NCBI or UCSC. For example, the eCLIP pipeline
# often uses a custom FASTA file containing known human rRNA sequences (e.g., NR_003286.2 for 18S, NR_003287.2 for 28S).
# Example command to build the index:
# bowtie-build human_rRNA.fa hg38_rRNA

# Input: Placeholder for trimmed Fastq reads
INPUT_FASTQ="trimmed_reads.fastq"
# Output: SAM file containing reads mapped to rRNA
OUTPUT_SAM="rRNA_mapped.sam"

# Map reads to human rRNA sequences allowing 2 mismatches
# -v 2: Allow up to 2 mismatches in the alignment
# -S: Output alignments in SAM format
# hg38_rRNA: The name of the pre-built Bowtie index for human rRNA sequences
# ${INPUT_FASTQ}: The input Fastq file
bowtie -v 2 -S "${OUTPUT_SAM}" hg38_rRNA "${INPUT_FASTQ}"

# If the primary goal is to filter out rRNA reads, you would typically use the --un option to output unmapped reads:
# bowtie -v 2 --un "non_rRNA_reads.fastq" hg38_rRNA "${INPUT_FASTQ}"

View on GitHub

Unmapped reads were mapped to human genome GRCh38 with Ensembl gene annotation release 83 using STAR with parameters â--outSAMattributes All --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignIntronMax 20000 --outMultimapperOrder Random --outSAMmultNmax 1â

STAR vSTAR (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Prepare Genome Index (if not already done) ---
# Download GRCh38 primary assembly FASTA (Ensembl release 83 corresponds to GRCh38.p7)
# wget http://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# Download Ensembl 83 GTF annotation
# wget http://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz
# gunzip Homo_sapiens.GRCh38.83.gtf.gz

# Create STAR genome index
# mkdir STAR_GRCh38_Ensembl83_index
# STAR --runThreadN 8 \
#      --runMode genomeGenerate \
#      --genomeDir STAR_GRCh38_Ensembl83_index \
#      --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
#      --sjdbGTFfile Homo_sapiens.GRCh38.83.gtf \
#      --sjdbOverhang 100 # Adjust sjdbOverhang based on your read length (e.g., ReadLength - 1)

# --- Alignment Step ---
# Define input and output
INPUT_FASTQ="input.fastq.gz" # Replace with your actual input FASTQ file(s)
OUTPUT_PREFIX="aligned_reads_STAR_GRCh38_Ensembl83_"
GENOME_DIR="STAR_GRCh38_Ensembl83_index"
NUM_THREADS=8 # Adjust as needed

STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMattributes All \
     --outFilterMismatchNmax 2 \
     --alignEndsType EndToEnd \
     --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
     --alignIntronMax 20000 \
     --outMultimapperOrder Random \
     --outSAMmultNmax 1

View on GitHub

Reads with mapQ=0 or NH>5 were filtered out and only primary loci are counted.

samtools (Inferred with models/gemini-2.5-flash), awk (Inferred with models/gemini-2.5-flash) vsamtools 1.9

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Filter reads based on mapQ, NH tag, and primary alignment status
# -F 0x100: Exclude secondary alignments
# -F 0x800: Exclude supplementary alignments
# -q 1: Exclude reads with mapping quality 0
# The awk command filters out reads where the NH tag is greater than 5.
# It prints headers, reads without an NH tag, or reads where NH is 0-5.
samtools view -h -F 0x100 -F 0x800 -q 1 input.bam | \
awk 'BEGIN{FS="\t"}{if($1 ~ /^@/ || $0 !~ /NH:i:[0-9]+/ || $0 ~ /NH:i:[0-5]$/) print}' | \
samtools view -bS -o filtered.bam

Tools Used

Ribo-seq STAR

Raw Source Text

Library strategy: Ribosome profiling
Fastq reads were trimmed and mapped to human rRNA sequences using bowtie allowing 2 mismatches.
Unmapped reads were mapped to human genome GRCh38 with Ensembl gene annotation release 83 using STAR with parameters â--outSAMattributes All --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignIntronMax 20000 --outMultimapperOrder Random --outSAMmultNmax 1â
Reads with mapQ=0 or NH>5 were filtered out and only primary loci are counted.
Genome_build: GRCh38
Supplementary_files_format_and_content: tab-delimited text files include Ensembl gene id, gene symbol, length and raw counts

← Back to Analysis