GSE117293 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Large-scale tethered function assays identify factors that regulate mRNA stability and translation.

Nature structural & molecular biology (2020) — PMID 32807991

Dataset

Large-scale tethered function assays identify factors that regulate mRNA stability and translation (HEK293T_RNA-seq)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star

# Define reference files and directories
REPBASE_FASTA="RepBase_v18.04.fasta" # Placeholder for RepBase v18.04 FASTA file
STAR_INDEX_DIR="RepBase_STAR_index"
INPUT_READS="trimmed_reads.fastq.gz" # Placeholder for trimmed RNA-seq reads (output from cutadapt)

# Create STAR genome index for RepBase
# This step assumes you have the RepBase v18.04 FASTA file. 
# RepBase data typically requires a license for download from GIRI.
mkdir -p "${STAR_INDEX_DIR}"
STAR --runMode genomeGenerate \
     --genomeDir "${STAR_INDEX_DIR}" \
     --genomeFastaFiles "${REPBASE_FASTA}" \
     --runThreadN 8 # Adjust thread count as needed

# Map RNA-sequencing reads to repetitive elements using STAR
STAR --runMode alignReads \
     --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --runThreadN 8 \
     --outFileNamePrefix RepBase_mapping_ \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --outReadsUnmapped Fastx # Optional: output unmapped reads if needed for downstream analysis

View on GitHub

Reads did not map to repetitive elements were then mapped to the human genome (hg19).

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with actual paths and filenames
GENOME_DIR="/path/to/STAR_index/hg19" # Placeholder for hg19 STAR genome index
INPUT_FASTQ="reads_unmapped_to_repeats.fastq.gz" # Placeholder for input reads (those not mapped to repetitive elements)
OUTPUT_PREFIX="aligned_to_hg19" # Prefix for output files
THREADS=8 # Number of threads to use

# Map reads to the human genome (hg19)
# This command aligns the input FASTQ reads to the hg19 genome index.
# --readFilesCommand zcat is used for gzipped FASTQ files.
# --outSAMtype BAM SortedByCoordinate outputs a sorted BAM file.
# --outFilterMultimapNmax 1 ensures only uniquely mapping reads are reported.
STAR --runMode alignReads \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --runThreadN ${THREADS}

View on GitHub

Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

featureCounts v1.5.0 GitHub

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Define reference annotation file (GENCODE v19)
GENCODE_V19_GTF="gencode.v19.annotation.gtf"

# Placeholder for input BAM files (aligned reads)
INPUT_BAMS="sample1.bam sample2.bam"

# Output file for gene count matrix
OUTPUT_COUNTS="gene_counts.txt"

# Create read count matrices using featureCounts
# -a: Annotation file (GTF/GFF)
# -o: Output file
# -t exon: Count reads overlapping exons (default for gene-level counting)
# -g gene_id: Group features by 'gene_id' attribute to get gene-level counts
featureCounts -a "${GENCODE_V19_GTF}" -o "${OUTPUT_COUNTS}" -t exon -g gene_id ${INPUT_BAMS}

View on GitHub

Differential expression was calculated using DESeq2 version 1.10.1

DESeq2 v1.10.1 GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create a placeholder R script for DESeq2 differential expression analysis.
# This script assumes 'counts.tsv' contains raw gene counts (genes as rows, samples as columns)
# and 'sample_info.tsv' contains metadata (e.g., 'sample_id', 'condition').
# Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files.
# Replace 'condition' with your actual experimental variable.
# Replace 'control' and 'treated' with your actual group names for comparison.

cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
library(DESeq2)

# --- Configuration ---
# Input files
counts_file <- "counts.tsv" # Raw count matrix (genes x samples)
sample_info_file <- "sample_info.tsv" # Sample metadata (samples x variables)

# Output files
results_output_file <- "deseq2_results.tsv"
normalized_counts_output_file <- "deseq2_normalized_counts.tsv"

# Design formula (e.g., ~ condition)
design_formula <- ~ condition

# Reference level for comparison (e.g., 'control' vs 'treated')
# Ensure this matches a level in your 'condition' column in sample_info_file
reference_level <- "control"
test_level <- "treated"

# --- Data Loading ---
# Load count data
# Assuming gene IDs are in the first column and sample counts in subsequent columns
count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Load sample information
# Assuming sample IDs are in the first column and metadata in subsequent columns
sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Ensure sample names match and are in the same order
# It's crucial that column names of count_data match row names of sample_info
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# Convert condition column to factor and set reference level
sample_info$condition <- factor(sample_info$condition, levels = c(reference_level, test_level))

# --- DESeq2 Analysis ---
# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = design_formula)

# Pre-filtering: remove genes with very low counts
# (e.g., keep genes with at least 10 counts in total across all samples)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Extract results for the specified contrast
# The contrast argument specifies the comparison: c("variable", "level_of_interest", "reference_level")
res <- results(dds, contrast = c("condition", test_level, reference_level))

# Order results by adjusted p-value
res <- res[order(res$padj),]

# --- Output Results ---
# Write differential expression results to a TSV file
write.table(as.data.frame(res), file = results_output_file, sep = "\t", quote = FALSE, row.names = TRUE)

# Get normalized counts
normalized_counts <- counts(dds, normalized = TRUE)
write.table(as.data.frame(normalized_counts), file = normalized_counts_output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("DESeq2 differential expression results saved to:", results_output_file))
message(paste("Normalized counts saved to:", normalized_counts_output_file))

EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.1 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# Define reference paths (replace with actual paths to your reference files)
# For hg38, common sources include GENCODE or Ensembl for GTF and UCSC/Ensembl for FASTA.
GENOME_FASTA="path/to/hg38.fa"
GENES_GTF="path/to/gencode.v38.annotation.gtf" # Example GTF for hg38
RSEM_REF_INDEX="path/to/rsem_hg38_index" # Directory where RSEM index will be stored

# Prepare RSEM reference (run this step once for your reference genome/transcriptome)
# This command builds the necessary index files for RSEM quantification.
# rsem-prepare-reference --gtf "${GENES_GTF}" "${GENOME_FASTA}" "${RSEM_REF_INDEX}"

# Define input BAM files (replace with actual paths to your aligned reads)
# These BAM files are typically generated by an aligner like STAR or HISAT2.
INPUT_BAM="path/to/input_fraction.bam"
POLYSOME_BAM="path/to/polysome_fraction.bam"

# Define output prefixes for RSEM results
INPUT_OUTPUT_PREFIX="input_fraction_rsem"
POLYSOME_OUTPUT_PREFIX="polysome_fraction_rsem"

# Calculate RPKMs for the input fraction
# RSEM quantifies gene expression (including RPKM) directly from aligned reads (BAM files).
# The description "from the read count matrices" refers to the internal process of RSEM,
# where it first generates read counts and then normalizes them to RPKM using gene lengths
# and total mapped reads. The output files will contain the RPKM values.
rsem-calculate-expression \
    -p 8 \ # Use 8 threads for parallel processing
    --bam \ # Specify that input files are BAM format
    "${INPUT_BAM}" \ # Input BAM file for the input fraction
    "${RSEM_REF_INDEX}" \ # Path to the pre-built RSEM reference index
    "${INPUT_OUTPUT_PREFIX}" # Prefix for output files (e.g., input_fraction_rsem.genes.results)

# Calculate RPKMs for the polysome fraction
rsem-calculate-expression \
    -p 8 \
    --bam \
    "${POLYSOME_BAM}" \
    "${RSEM_REF_INDEX}" \
    "${POLYSOME_OUTPUT_PREFIX}"

# RPKM values for each gene will be found in the "${INPUT_OUTPUT_PREFIX}.genes.results" and
# "${POLYSOME_OUTPUT_PREFIX}.genes.results" files (specifically in the 'TPM' or 'FPKM' columns,
# which are equivalent to RPKM for single-end reads or when using RSEM's default settings for gene-level quantification).

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
Differential expression was calculated using DESeq2 version 1.10.1
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19

← Back to Analysis