GSE117291 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Large-scale tethered function assays identify factors that regulate mRNA stability and translation.

Nature structural & molecular biology (2020) — PMID 32807991

Dataset

Large-scale tethered function assays identify factors that regulate mRNA stability and translation (polysome profiling)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Configuration ---
# Placeholder for input trimmed RNA-seq reads (output from cutadapt)
# Replace with the actual path to your trimmed FASTQ file(s)
TRIMMED_READS_FASTQ="trimmed_reads.fastq.gz"

# Placeholder for the STAR genome index directory built from RepBase v18.04 sequences.
# You would need to download RepBase v18.04 sequences and build a STAR index
# using `STAR --runMode genomeGenerate --genomeDir /path/to/RepBase_v18.04_STAR_index --genomeFastaFiles RepBase_v18.04.fasta`
REPBASE_STAR_INDEX_DIR="/path/to/RepBase_v18.04_STAR_index"

# Output directory for STAR results
OUTPUT_DIR="star_repbase_mapping_results"
mkdir -p "${OUTPUT_DIR}"

# Number of threads for STAR
NUM_THREADS=8 # Adjust based on available resources

# --- STAR Mapping Command ---
STAR \
  --runThreadN "${NUM_THREADS}" \
  --genomeDir "${REPBASE_STAR_INDEX_DIR}" \
  --readFilesIn "${TRIMMED_READS_FASTQ}" \
  --outFileNamePrefix "${OUTPUT_DIR}/repbase_mapped." \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 100 \
  --outFilterMismatchNmax 10

View on GitHub

Reads did not map to repetitive elements were then mapped to the human genome (hg19).

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.7.10a -y
# conda activate star_env

# --- Reference Genome Preparation (hg19) ---
# This step is for building the STAR index. It needs to be done once.
# Download hg19 FASTA and GTF files from UCSC:
# mkdir -p ~/star_index_references
# cd ~/star_index_references
# wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz
# gunzip hg19.fa.gz
# gunzip hg19.refGene.gtf.gz

# Create STAR genome index for hg19:
# mkdir -p ~/star_index/hg19
# STAR --runMode genomeGenerate \
#      --genomeDir ~/star_index/hg19 \
#      --genomeFastaFiles ~/star_index_references/hg19.fa \
#      --sjdbGTFfile ~/star_index_references/hg19.refGene.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN 8 # Adjust threads as needed

# --- Alignment Step ---
# Input: FASTQ file containing reads that did not map to repetitive elements.
# Replace 'unmapped_to_repeats.fastq.gz' with your actual input file.
# Replace 'aligned_to_hg19' with your desired output prefix.
# Replace '~/star_index/hg19' with the actual path to your STAR hg19 index.
# Adjust --runThreadN based on available CPU cores.
# Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB = 30000000000 bytes).

STAR --genomeDir ~/star_index/hg19 \
     --readFilesIn unmapped_to_repeats.fastq.gz \
     --runThreadN 8 \
     --outFileNamePrefix aligned_to_hg19_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes NH HI AS NM MD \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbScore 1 \
     --readFilesCommand zcat \
     --outSAMunmapped Within \
     --outFilterScoreMin 0 \
     --outFilterScoreMinOverLread 0 \
     --outFilterMatchNmin 0 \
     --outFilterMatchNminOverLread 0 \
     --limitBAMsortRAM 30000000000

View on GitHub

Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

featureCounts v1.5.0

$ Bash example

# Download GENCODE v19 annotation GTF file
# Source: https://www.gencodegenes.org/human/release_19.html
# wget is usually pre-installed or easily installed via package manager
# sudo apt-get update && sudo apt-get install -y wget
wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

# Decompress the GTF file
# gunzip is usually pre-installed
gunzip gencode.v19.annotation.gtf.gz

# Run featureCounts to create read count matrices
# Replace 'input_sampleX.bam' with your actual aligned BAM files.
# -a: Annotation file in GTF/GFF format.
# -o: Output file for read counts.
# -t exon: Count reads mapping to 'exon' features (default for gene-level counting).
# -g gene_id: Aggregate counts by 'gene_id' attribute in the GTF (default for gene-level counting).
# -F GTF: Specify input annotation file format as GTF.
# -T 8: Use 8 threads (adjust as needed).
# -p: Include this flag if your reads are paired-end.
# -s 2: Include this flag if your RNA-seq library is reverse stranded (0 for unstranded, 1 for forward).
featureCounts -a gencode.v19.annotation.gtf -o read_counts.txt -t exon -g gene_id -F GTF -T 8 input_sample1.bam input_sample2.bam input_sample3.bam

Differential expression was calculated using DESeq2 version 1.10.1

DESeq2 v1.10.1 GitHub

$ Bash example

# This script assumes you have R and DESeq2 installed.
# To install DESeq2 version 1.10.1 (which corresponds to Bioconductor 3.1, compatible with R 3.2.x):
# First, ensure you have R 3.2.x installed.
# Then, run the following inside R:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(version = "3.1") # This installs Bioconductor 3.1
# BiocManager::install("DESeq2") # This will install DESeq2 1.10.1 for Bioc 3.1

# Create an R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# --- Configuration ---
# Input count matrix file (e.g., from featureCounts, HTSeq-count)
# Expected format: gene_id as first column, sample counts in subsequent columns.
COUNT_FILE="counts.tsv"

# Sample metadata file
# Expected format: sample_id as first column, metadata columns (e.g., 'condition', 'batch')
# Sample IDs in this file must match column names in COUNT_FILE.
SAMPLE_INFO_FILE="sample_info.tsv"

# Output file for differential expression results
OUTPUT_RESULTS_FILE="deseq2_results.csv"

# Design formula for DESeq2 (e.g., ~ condition, ~ batch + condition)
DESIGN_FORMULA_STR="~ condition"

# Contrast for results (e.g., c("condition", "treated", "control"))
# If NULL, DESeq2 will pick the last level of the primary variable as reference.
CONTRAST_VECTOR=c("condition", "treated", "control") # Example: comparing 'treated' vs 'control'

# --- Load Data ---
# Load count data
count_data <- read.table(COUNT_FILE, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Load sample information
sample_info <- read.table(SAMPLE_INFO_FILE, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Ensure sample order and names match
common_samples <- intersect(colnames(count_data), rownames(sample_info))
if (length(common_samples) == 0) {
    stop("No common samples found between count data and sample information.")
}
count_data <- count_data[, common_samples]
sample_info <- sample_info[common_samples, , drop = FALSE]

# Convert design formula string to formula object
design_formula <- as.formula(DESIGN_FORMULA_STR)

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = design_formula)

# --- Pre-filtering (optional but recommended) ---
# Remove genes with very low counts (e.g., less than 10 total counts across all samples)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# --- Run DESeq2 Analysis ---
dds <- DESeq(dds)

# --- Extract Results ---
if (!is.null(CONTRAST_VECTOR)) {
    res <- results(dds, contrast = CONTRAST_VECTOR)
} else {
    res <- results(dds)
}

# Order results by adjusted p-value
res <- res[order(res$padj),]

# Save results
write.csv(as.data.frame(res), file = OUTPUT_RESULTS_FILE)

# Optional: Save normalized counts
# normalized_counts <- counts(dds, normalized=TRUE)
# write.csv(as.data.frame(normalized_counts), file = "normalized_counts.csv")

message(paste("DESeq2 analysis complete. Results saved to", OUTPUT_RESULTS_FILE))

EOF

# Execute the R script
Rscript run_deseq2.R

# --- Placeholder for creating dummy input files for demonstration ---
# In a real pipeline, these files would come from upstream steps (e.g., featureCounts, Salmon quantification)
# echo -e "gene\tsample1_control\tsample2_control\tsample3_treated\tsample4_treated" > counts.tsv
# echo -e "geneA\t100\t120\t500\t550" >> counts.tsv
# echo -e "geneB\t50\t60\t30\t35" >> counts.tsv
# echo -e "geneC\t200\t210\t220\t230" >> counts.tsv
# echo -e "geneD\t10\t12\t15\t18" >> counts.tsv
# echo -e "geneE\t5\t7\t8\t9" >> counts.tsv # This gene might be filtered out

# echo -e "sample\tcondition" > sample_info.tsv
# echo -e "sample1_control\tcontrol" >> sample_info.tsv
# echo -e "sample2_control\tcontrol" >> sample_info.tsv
# echo -e "sample3_treated\ttreated" >> sample_info.tsv
# echo -e "sample4_treated\ttreated" >> sample_info.tsv

View on GitHub

The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

Custom Python script (Inferred with models/gemini-2.5-flash) vPython 3.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Create dummy input files for demonstration
# In a real pipeline, 'input_counts.tsv' and 'polysome_counts.tsv' would be outputs from a read counting tool (e.g., featureCounts, HTSeq-count).

# gene_lengths.tsv: gene_id<tab>length_in_bp
# Source: Derived from a genome annotation file (e.g., GTF/GFF) for a specific assembly (e.g., GRCh38/hg38).
echo -e "geneA\t1000" > gene_lengths.tsv
echo -e "geneB\t2000" >> gene_lengths.tsv
echo -e "geneC\t500" >> gene_lengths.tsv

# input_counts.tsv: gene_id<tab>read_count (for input fraction)
echo -e "geneA\t100" > input_counts.tsv
echo -e "geneB\t200" >> input_counts.tsv
echo -e "geneC\t50" >> input_counts.tsv

# polysome_counts.tsv: gene_id<tab>read_count (for polysome fraction)
echo -e "geneA\t150" > polysome_counts.tsv
echo -e "geneB\t300" >> polysome_counts.tsv
echo -e "geneC\t75" >> polysome_counts.tsv

# Create a Python script to calculate RPKM
# This script requires the pandas library. Install with: # pip install pandas
cat << 'EOF' > rpkm_calculator.py
import pandas as pd
import sys

def calculate_rpkm(counts_file, lengths_file, output_file):
    """
    Calculates RPKM from read counts and gene lengths.
    Formula: RPKM = (C * 10^9) / (N * L)
    C: Number of reads mapped to a transcript/gene
    N: Total number of mapped reads in the experiment (sum of all C values for the sample)
    L: Length of the transcript/gene in base pairs
    """
    try:
        counts_df = pd.read_csv(counts_file, sep='\t', header=None, names=['gene_id', 'read_count'])
        lengths_df = pd.read_csv(lengths_file, sep='\t', header=None, names=['gene_id', 'length_in_bp'])
    except FileNotFoundError as e:
        print(f"Error: Input file not found - {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error reading input files: {e}", file=sys.stderr)
        sys.exit(1)

    # Merge counts and lengths based on gene_id
    merged_df = pd.merge(counts_df, lengths_df, on='gene_id', how='inner')

    if merged_df.empty:
        print(f"Warning: No matching genes found between {counts_file} and {lengths_file}. Output file will be empty.", file=sys.stderr)
        merged_df['rpkm'] = []
    else:
        # Calculate total mapped reads (N) for the current sample
        total_mapped_reads = merged_df['read_count'].sum()

        if total_mapped_reads == 0:
            print(f"Warning: Total mapped reads for {counts_file} is zero. RPKM will be zero for all genes.", file=sys.stderr)
            merged_df['rpkm'] = 0.0
        else:
            # Calculate RPKM: (C * 10^9) / (N * L)
            merged_df['rpkm'] = (merged_df['read_count'] * 1_000_000_000) / (total_mapped_reads * merged_df['length_in_bp'])

    # Save results
    merged_df[['gene_id', 'rpkm']].to_csv(output_file, sep='\t', index=False, header=False)
    print(f"RPKM calculated and saved to {output_file}", file=sys.stderr)

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python rpkm_calculator.py <counts_file> <lengths_file> <output_file>", file=sys.stderr)
        sys.exit(1)
    
    counts_file = sys.argv[1]
    lengths_file = sys.argv[2]
    output_file = sys.argv[3]
    
    calculate_rpkm(counts_file, lengths_file, output_file)
EOF

# Calculate RPKM for input fraction
python rpkm_calculator.py input_counts.tsv gene_lengths.tsv input_rpkm.tsv

# Calculate RPKM for polysome fraction
python rpkm_calculator.py polysome_counts.tsv gene_lengths.tsv polysome_rpkm.tsv

# Clean up the script (optional, but good practice in a pipeline)
rm rpkm_calculator.py

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
Differential expression was calculated using DESeq2 version 1.10.1
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19

← Back to Analysis