GSE109423 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

A Transcriptome-wide Translational Program Defined by LIN28B Expression Level.

Molecular cell (2019) — PMID 30527666

Dataset

A transcriptome-wide divergence in protein translation scales with LIN28B expression

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: ribosome profiling

Ribo-seq v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with actual paths and filenames
GENOME_DIR="/path/to/STAR_genome_index/hg38" # Directory containing STAR genome index files (e.g., SA, SAindex)
READS_FILE="input_riboseq_reads.fastq.gz" # Input gzipped FASTQ file
OUTPUT_PREFIX="riboseq_alignment" # Prefix for output files
THREADS=8 # Number of threads to use

# Note: A STAR genome index must be pre-built using STAR --runMode genomeGenerate.
# Example command for building index (run once per genome/annotation combination):
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/hg38.fa \
#      --sjdbGTFfile /path/to/hg38.gtf \
#      --runThreadN ${THREADS}

# Align Ribo-seq reads using STAR
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS_FILE} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped None \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbScore 1 \
     --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)

View on GitHub

Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

cutadapt v3.4 GitHub

$ Bash example

# Install cutadapt if not already available
# conda install -c bioconda cutadapt=3.4

# Define input and output files
INPUT_READS="raw_reads.fastq.gz"
OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz"

# Define adapter and barcode sequences common in eCLIP
# Common Illumina TruSeq Universal Adapter (3' end)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Placeholder for a 5' barcode/UMI (e.g., 8 Ns)
# In a real eCLIP experiment, this would be a specific barcode sequence or a UMI pattern.
BARCODE_5PRIME="NNNNNNNN"

# Trim adapter and barcode sequences, perform quality trimming, and filter by length
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -g "${BARCODE_5PRIME}" \
  -m 18 \
  -q 20 \
  -e 0.1 \
  --cores 4 \
  -o "${OUTPUT_TRIMMED_READS}" \
  "${INPUT_READS}"

View on GitHub

Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2

# --- Reference Data Preparation (RepBase) ---
# RepBase sequences are typically obtained from the Genetic Information Research Institute (GIRI).
# For demonstration, we assume a RepBase FASTA file is available. 
# The actual download might involve specific species or versions (e.g., human, mouse).
# Example (illustrative, actual URL/process may vary):
# wget -O RepBase.fasta "http://www.girinst.org/repbase/update/RepBase.fasta"

REPBASE_FASTA="RepBase.fasta" # Placeholder for the RepBase FASTA file
REPBASE_INDEX_PREFIX="RepBase_index"

# Build bowtie2 index for RepBase
# This step only needs to be done once per RepBase FASTA
if [ ! -f "${REPBASE_INDEX_PREFIX}.1.bt2" ]; then
    echo "Building bowtie2 index for RepBase..."
    bowtie2-build "${REPBASE_FASTA}" "${REPBASE_INDEX_PREFIX}"
    echo "bowtie2 index built."
else
    echo "bowtie2 index for RepBase already exists."
fi

# --- Read Mapping and Filtering ---
INPUT_FASTQ="trimmed_reads.fastq.gz" # Input trimmed reads
OUTPUT_UNMAPPED_FASTQ="reads_without_repetitive_sequences.fastq.gz" # Output reads that did not map to RepBase
NUM_THREADS=8 # Number of threads to use

echo "Mapping trimmed reads against RepBase to remove repetitive sequences..."
bowtie2 -x "${REPBASE_INDEX_PREFIX}" \
        -U "${INPUT_FASTQ}" \
        --un-gz "${OUTPUT_UNMAPPED_FASTQ}" \
        -S /dev/null \
        -p "${NUM_THREADS}" \
        --very-sensitive-local # Using a sensitive local alignment mode for better detection of repetitive elements

echo "Reads not mapping to RepBase saved to ${OUTPUT_UNMAPPED_FASTQ}"

View on GitHub

Remaining reads were mapped to the appropriate genome build using STAR aligner

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda create -n star_env star=2.7.10a -c bioconda -c conda-forge
# conda activate star_env

# Define variables
# Placeholder for GRCh38 genome index. Replace with the actual path to your STAR genome index.
GENOME_DIR="/path/to/STAR_genome_index/GRCh38"

# Input FASTQ files. Assuming paired-end reads. Adjust if single-end.
READS_R1="remaining_reads_R1.fastq.gz"
READS_R2="remaining_reads_R2.fastq.gz"

OUTPUT_PREFIX="remaining_reads_aligned"
NUM_THREADS=8 # Adjust based on available CPU cores
RAM_LIMIT_BAM_SORT=30000000000 # 30GB, adjust based on available RAM

# Run STAR alignment
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMismatchNmax 10 \
     --outFilterScoreMinOverLread 0.3 \
     --outFilterMatchNminOverLread 0.3 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}"

View on GitHub

RPKM values were calculated for all polysome and rnaseq samples.

RNA-seq v1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# --- Reference Preparation (run once) ---
# RSEM requires a reference genome and a GTF annotation file to build its index.
# For human (GRCh38), you can download these from Ensembl or GENCODE.
# Example for human GRCh38 (Ensembl release 109):
# wget -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# wget -O Homo_sapiens.GRCh38.109.gtf.gz ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz

# Define paths for reference files (replace with your actual paths)
# GENOME_FA="/path/to/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
# GTF_FILE="/path/to/Homo_sapiens.GRCh38.109.gtf"
# RSEM_REF_INDEX="/path/to/rsem_ref_index_GRCh38"

# Build RSEM reference index (uncomment and run once per reference)
# mkdir -p $(dirname "${RSEM_REF_INDEX}")
# rsem-prepare-reference \
#     --gtf "${GTF_FILE}" \
#     "${GENOME_FA}" \
#     "${RSEM_REF_INDEX}"

# --- RPKM Calculation for Samples ---
# Define paths for input FASTQ files and output results
# INPUT_DIR="/path/to/your/fastq_files"
# OUTPUT_DIR="/path/to/your/rsem_results"
# RSEM_REF_INDEX="/path/to/rsem_ref_index_GRCh38" # Same as used for reference preparation

# Create output directory if it doesn't exist
# mkdir -p "${OUTPUT_DIR}"

# Loop through samples (assuming paired-end FASTQ files: SAMPLE_ID_R1.fastq.gz, SAMPLE_ID_R2.fastq.gz)
# Replace 'sample1 sample2' with the actual list of your sample IDs
for sample_id in sample1 sample2; do
    READ1="${INPUT_DIR}/${sample_id}_R1.fastq.gz"
    READ2="${INPUT_DIR}/${sample_id}_R2.fastq.gz"
    OUTPUT_PREFIX="${OUTPUT_DIR}/${sample_id}"

    echo "Processing sample: ${sample_id}"
    rsem-calculate-expression \
        --paired-end \
        --num-threads 8 \
        "${READ1}" \
        "${READ2}" \
        "${RSEM_REF_INDEX}" \
        "${OUTPUT_PREFIX}"
done

# RPKM values for genes are typically found in the '.genes.results' file
# The 6th column of this file (0-indexed) contains the RPKM values.
# Example: To view gene_id and RPKM for 'sample1':
# head -n 1 "${OUTPUT_DIR}/sample1.genes.results"
# cat "${OUTPUT_DIR}/sample1.genes.results" | awk 'NR==1 || $1 ~ /^ENSG/ {print $1, $6}'

View on GitHub

For eCLIP samples, read densities were calculated to identify eCLIP peaks. eCLIP peaks were identified using clipper

CLIPper vNot specified (from yeolab/clipper GitHub repo, last commit 2019-07-17) GitHub

$ Bash example

# Clone the clipper repository if you haven't already
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Install dependencies (if not already present in your environment)
# CLIPper is a Python script and typically requires numpy, scipy, and pysam.
# conda install -c conda-forge numpy scipy pysam

# Define input files and parameters
# Replace with your actual file paths and desired parameters
TREATMENT_BAM="path/to/your/eCLIP_sample.bam"
CONTROL_BAM="path/to/your/SMInput_or_IgG_control.bam" # This is typically the size-matched input or IgG control BAM
GENOME_SIZE="hg38" # Example: hg38 for human, mm10 for mouse. Adjust as needed.
OUTPUT_PREFIX="eCLIP_peaks_output"
P_VALUE_THRESHOLD="0.01" # Default p-value threshold in clipper.py
FDR_THRESHOLD="0.05"   # Default FDR threshold in clipper.py
MIN_PEAK_LENGTH="10"   # Default minimum peak length in clipper.py

# Execute clipper
# Ensure clipper.py is in your current directory or in your PATH
python clipper.py \
    -t "${TREATMENT_BAM}" \
    -c "${CONTROL_BAM}" \
    -s "${GENOME_SIZE}" \
    -o "${OUTPUT_PREFIX}" \
    -p "${P_VALUE_THRESHOLD}" \
    -f "${FDR_THRESHOLD}" \
    -l "${MIN_PEAK_LENGTH}"

View on GitHub

Tools Used

Ribo-seq STAR RNA-seq

Raw Source Text

Library strategy: ribosome profiling
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner
RPKM values were calculated for all polysome and rnaseq samples. For eCLIP samples, read densities were calculated to identify eCLIP peaks. eCLIP peaks were identified using clipper
Genome_build: mm10 and hg19
Supplementary_files_format_and_content: RPKM files for all polysome samples. eCLIP input normalized peaks in BED format

← Back to Analysis