GSE195494 Processing Pipeline

RIP-Seq code_examples 5 steps

Publication

Nuclear and cytoplasmic poly(A) binding proteins (PABPs) favor distinct transcripts and isoforms.

Nucleic acids research (2022) — PMID 35438785

Dataset

GSE195494

Ribo-STAMP reveals translation status of RNAs bound by PABPN and PABPC

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

All reads were aligned to the human genome hg38 primary assembly using STAR.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (e.g., using conda)
# conda create -n star_env star=2.7.10a -c bioconda -c conda-forge
# conda activate star_env

# Define variables
NUM_THREADS=8 # Example: Number of threads
GENOME_DIR="/path/to/STAR_index/hg38_primary_assembly" # Path to pre-built STAR index for hg38 primary assembly
READS_R1="input_reads_R1.fastq.gz" # Input FASTQ file for Read 1
# READS_R2="input_reads_R2.fastq.gz" # Input FASTQ file for Read 2 (uncomment and provide if paired-end)
OUTPUT_PREFIX="aligned_reads" # Prefix for output files

# Align reads to the human genome hg38 primary assembly
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped None \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --limitBAMsortRAM 30000000000

View on GitHub

featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34

featureCounts v2.0.2

$ Bash example

# Install featureCounts (part of the Subread package)
# conda install -c bioconda subread

# Example command for annotating reads (e.g., counting reads per gene)
# Replace 'input.bam' with your actual alignment file(s).
# Replace 'custom_gencode_v34.gtf' with the path to your custom GTF file.
# The output 'counts.txt' will contain the read counts per feature.
featureCounts -a custom_gencode_v34.gtf -o counts.txt --minOverlap 20 -F GTF -t exon -g gene_id input.bam

EPKM values were determined as in (Brannan et al., 2021).

Skipper (Inferred with models/gemini-2.5-flash) v2021 GitHub

$ Bash example

# Install bedtools if not available
# conda install -c bioconda bedtools

# Define input files and reference data
INPUT_BAM="aligned_reads/sample.bam" # Aligned reads from eCLIP experiment
FEATURES_BED="eclip_peaks/merged_peaks.bed" # Regions to quantify (e.g., merged eCLIP peaks or gene annotations)
OUTPUT_EPKM_FILE="quantification/sample.epkm.tsv"

# Step 1: Count reads overlapping features using bedtools multicov
# This command counts reads in INPUT_BAM that overlap regions in FEATURES_BED
# The output will have the original BED fields plus the read count as the last column.
bedtools multicov -bams "$INPUT_BAM" -bed "$FEATURES_BED" > "${OUTPUT_EPKM_FILE}.counts"

# Step 2: Calculate EPKM from counts
# EPKM = (reads_in_feature / feature_length_kb) / (total_mapped_reads / 1,000,000)
# This step typically involves a custom script that reads the counts, feature lengths,
# and total mapped reads (obtained from BAM file stats, e.g., samtools idxstats).
# For demonstration, we'll use a Python script.

# Get total mapped reads (example using samtools, requires BAM index)
# samtools index "$INPUT_BAM"
# TOTAL_MAPPED_READS=$(samtools idxstats "$INPUT_BAM" | awk '{sum += $3} END {print sum}')
# If samtools is not available or BAM is not indexed, use a placeholder:
TOTAL_MAPPED_READS=10000000 # Placeholder for total mapped reads in the BAM file

python -c "
import pandas as pd
import sys

counts_file = sys.argv[1]
output_file = sys.argv[2]
total_mapped_reads = int(sys.argv[3])

# Read counts from bedtools multicov output
# Assuming bedtools multicov output format: chrom, start, end, name, score, strand, count
df = pd.read_csv(counts_file, sep='\t', header=None,
                 names=['chrom', 'start', 'end', 'name', 'score', 'strand', 'count'])

# Calculate feature length in kilobases
df['length_kb'] = (df['end'] - df['start']) / 1000.0

# Calculate EPKM
# EPKM = (reads_in_feature / feature_length_kb) / (total_mapped_reads / 1,000,000)
# Add a small epsilon to length_kb to avoid division by zero for features with length 0
df['epkm'] = (df['count'] / (df['length_kb'] + 1e-9)) / (total_mapped_reads / 1_000_000.0)

# Select relevant columns and save to output
df[['name', 'epkm']].to_csv(output_file, sep='\t', index=False)
" "${OUTPUT_EPKM_FILE}.counts" "$OUTPUT_EPKM_FILE" "$TOTAL_MAPPED_READS"

echo "EPKM values calculated and saved to $OUTPUT_EPKM_FILE"

View on GitHub

Furthermore, for each gene, EPKM values obtained in the Control-STAMP cell line were subtracted from the RPS2-STAMP EPKM values, giving a normalized EPKM value for each gene which indicates the degree of translation of that transcript.

awk (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Assuming input files 'control_epkm.tsv' and 'rps2_epkm.tsv'
# Each file is expected to be tab-separated with at least two columns:
# 1. Gene Identifier (e.g., GeneID)
# 2. EPKM value
# Example content for control_epkm.tsv:
# GeneA\t100.5
# GeneB\t50.2
# GeneC\t200.1

# Example content for rps2_epkm.tsv:
# GeneA\t120.7
# GeneB\t60.1
# GeneC\t180.5

# Step 1: Ensure both input files are sorted by the gene identifier.
# This is crucial for the 'join' command to work correctly.
# For simplicity, assuming no headers or headers are handled upstream.
sort -k1,1 control_epkm.tsv > control_epkm_sorted.tsv
sort -k1,1 rps2_epkm.tsv > rps2_epkm_sorted.tsv

# Step 2: Join the sorted files based on the gene identifier (first column).
# The 'join' command merges lines from two files that have a common field.
# -t $'\t': Specifies tab as the field separator.
# -1 1: Specifies the join field from file 1 (control_epkm_sorted.tsv) is the first column.
# -2 1: Specifies the join field from file 2 (rps2_epkm_sorted.tsv) is the first column.
# The output of join will be: GeneID\tControl_EPKM\tRPS2_EPKM
# Example: GeneA\t100.5\t120.7
joined_epkm_file=$(mktemp)
join -t $'\t' -1 1 -2 1 control_epkm_sorted.tsv rps2_epkm_sorted.tsv > "$joined_epkm_file"

# Step 3: Use awk to subtract the Control-STAMP EPKM from RPS2-STAMP EPKM.
# The 'awk' command processes the joined file.
# -F'\t': Specifies tab as the input field separator.
# OFS='\t': Specifies tab as the output field separator.
# '{print $1, $3 - $2}': Prints the gene identifier ($1) and the result of (RPS2_EPKM - Control_EPKM).
# In the joined file: $1 is GeneID, $2 is Control_EPKM, $3 is RPS2_EPKM.
awk -F'\t' '{print $1, $3 - $2}' OFS='\t' "$joined_epkm_file" > normalized_epkm.tsv

# Clean up temporary files
rm control_epkm_sorted.tsv rps2_epkm_sorted.tsv "$joined_epkm_file"

Control-STAMP PABPC IP Rep3 was not a successful immunoprecipitation and thus was excluded from analysis.

Snakemake (Inferred with models/gemini-2.5-flash) v7.32.4 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# The description indicates that 'Control-STAMP PABPC IP Rep3' was deemed an unsuccessful immunoprecipitation (IP) and was therefore excluded from downstream analysis. This is a quality control (QC) decision.
# In the context of a bioinformatics pipeline, such an exclusion is typically managed by the workflow system itself, often by flagging the sample in a sample manifest or configuration file.
# For eCLIP assays, the Yeo lab's Skipper workflow (which uses Snakemake) would handle this by reading a sample sheet where the problematic sample is either omitted or explicitly marked for exclusion (e.g., with a 'FAIL' status).

# Example of a hypothetical 'samples.tsv' file that the Skipper workflow would read:
# sample_id	condition	replicate	fastq_r1	fastq_r2	qc_status	notes
# SampleA	control	1	sampleA_R1.fastq.gz	sampleA_R2.fastq.gz	PASS	Successful IP
# Control-STAMP PABPC IP Rep3	IP	3	control_stamp_pabpc_ip_rep3_R1.fastq.gz	control_stamp_pabpc_ip_rep3_R2.fastq.gz	FAIL	Unsuccessful immunoprecipitation
# SampleB	treatment	1	sampleB_R1.fastq.gz	sampleB_R2.fastq.gz	PASS	Successful IP

# The Snakemake workflow (Skipper) would then be executed, and its internal rules would skip processing 'Control-STAMP PABPC IP Rep3' based on its 'qc_status' or its absence from the list of samples to process.

# Installation (example, if Snakemake is not already available and a specific version is desired):
# conda create -n skipper_env -c bioconda -c conda-forge snakemake-minimal=7.32.4
# conda activate skipper_env

# Navigate to the Skipper workflow directory (replace with actual path to your Skipper clone)
# cd /path/to/yeolab/skipper

# Execute the Snakemake workflow.
# The 'sample_sheet' config parameter points to the manifest file (e.g., 'samples.tsv').
# The workflow's internal logic will then ensure 'Control-STAMP PABPC IP Rep3' is excluded from analysis.
snakemake --cores 8 --use-conda --config sample_sheet=samples.tsv

View on GitHub

Tools Used

STAR

Raw Source Text

All reads were aligned to the human genome hg38 primary assembly using STAR.
featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34
EPKM values were determined as in (Brannan et al., 2021). Furthermore, for each gene, EPKM values obtained in the Control-STAMP cell line were subtracted from the RPS2-STAMP EPKM values, giving a normalized EPKM value for each gene which indicates the degree of translation of that transcript. Control-STAMP PABPC IP Rep3 was not a successful immunoprecipitation and thus was excluded from analysis.
Genome_build: hg38
Supplementary_files_format_and_content: comma separated file with normalized epkm values calculated for each gene in each condition (Input, PABPN IP or PABPC IP). Transcript per kilobase Million (TPM) values are included for each condition so that users can filter based on the overall abundance of a gene

← Back to Analysis