GSE117294 Processing Pipeline

GSE code_examples 5 steps

Publication

Large-scale tethered function assays identify factors that regulate mRNA stability and translation.

Nature structural & molecular biology (2020) — PMID 32807991

Dataset

Large-scale tethered function assays identify factors that regulate mRNA stability and translation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

STAR v2.4.0i GitHub

$ Bash example

# Define variables
INPUT_READS="trimmed_reads.fastq.gz" # Output from cutadapt
GENOME_DIR="RepBase_STAR_index" # Path to the pre-built STAR index for RepBase v18.04
OUTPUT_PREFIX="STAR_RepBase_mapping_"
NUM_THREADS=8 # Adjust as needed

# Install STAR (if not already installed)
# conda install -c bioconda star

# Map trimmed RNA-seq reads to the RepBase index
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --outReadsUnmapped Fastx

View on GitHub

Reads did not map to repetitive elements were then mapped to the human genome (hg19).

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with the actual path to your STAR genome index for hg19
GENOME_DIR="/path/to/STAR_index/hg19"
# Replace with the actual path to your input FASTQ file (reads unmapped to repetitive elements)
INPUT_FASTQ="unmapped_reads.fastq.gz"
# Define output prefix for aligned files
OUTPUT_PREFIX="aligned_hg19_"
# Number of threads to use
NUM_THREADS=8

# Create STAR genome index if it doesn't exist (run once per genome version)
# This step is typically done prior to alignment. Assuming index already exists.
# STAR --runThreadN ${NUM_THREADS} \
#      --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/hg19.gtf \
#      --sjdbOverhang 100 # Adjust based on read length - 1

# Align reads to the human genome (hg19) using STAR
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --readFilesCommand gunzip -c \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outSAMstrandField intronMotif \
     --outFilterType BySJout

View on GitHub

Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

featureCounts v1.5.0

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Define reference annotation file
GENCODE_V19_GTF="gencode.v19.annotation.gtf"

# Placeholder for input BAM files (replace with actual paths to your aligned BAM files)
INPUT_BAMS="sample1.bam sample2.bam"

# Output count matrix file
OUTPUT_COUNTS="gene_counts.txt"

# Download GENCODE v19 annotation if not available (example command)
# Note: Ensure you have the correct URL for GENCODE v19. This is a common pattern.
# wget -O "${GENCODE_V19_GTF}.gz" "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"
# gunzip "${GENCODE_V19_GTF}.gz"

# Run featureCounts to create read count matrices
# -a: specify annotation file (GTF/GFF)
# -o: specify output file for counts
# -F GTF: specify annotation file format (GTF)
# -t exon: specify feature type to count (e.g., 'exon' for gene-level counts)
# -g gene_id: specify attribute type to aggregate features by (e.g., 'gene_id')
# -s 0: specify strand specificity (0=unstranded, 1=stranded, 2=reverse stranded). Assuming unstranded as not specified.
# -T 8: specify number of threads (adjust as needed)
featureCounts -a "${GENCODE_V19_GTF}" -o "${OUTPUT_COUNTS}" -F GTF -t exon -g gene_id -s 0 -T 8 ${INPUT_BAMS}

Differential expression was calculated using DESeq2 version 1.10.1

DESeq2 v1.10.1 GitHub

$ Bash example

# Install R and DESeq2 if not already installed.
# Note: DESeq2 version 1.10.1 is quite old (released with Bioconductor 3.2, R 3.2).
# Modern R/Bioconductor installations will typically provide a much newer version.
# To ensure version 1.10.1, you might need to use an older R/Bioconductor environment
# or a specific BiocManager version (e.g., BiocManager::install('DESeq2', version = '3.2')).

# For R installation (example for Debian/Ubuntu):
# sudo apt-get update
# sudo apt-get install r-base

# For DESeq2 package installation in R:
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# Create an R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
# Ensure DESeq2 version 1.10.1 is installed and loaded for consistency.
library(DESeq2)

# --- Configuration ---
# Input count matrix file (e.g., from featureCounts, HTSeq, Salmon, Kallisto).
# This file should be tab-separated, with gene/feature IDs in the first column
# and raw counts for each sample in subsequent columns.
# Example:
# gene_id    sample1    sample2    sample3    sample4
# geneA      100        120        50         60
# geneB      200        210        100        110
count_file <- "counts.tsv"

# Sample information file (metadata).
# This file should be tab-separated, with sample IDs in the first column
# and experimental factors (e.g., condition, batch) in subsequent columns.
# Sample IDs must match the column names in the count_file.
# Example:
# sample_id    condition    batch
# sample1      control      batch1
# sample2      control      batch2
# sample3      treated      batch1
# sample4      treated      batch2
sample_info_file <- "sample_info.tsv"

# Output directory for results
output_dir <- "deseq2_results"

# Define the comparison: 'treatment_condition' vs 'reference_condition'.
# These values must exactly match entries in the 'design_variable' column of sample_info_file.
reference_condition <- "control"
treatment_condition <- "treated"

# The column in sample_info_file that defines the primary experimental groups for comparison.
design_variable <- "condition"

# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# --- Load Data ---
# Read count data
# Ensure row names are gene IDs and column names are sample IDs
count_data <- as.matrix(read.table(count_file, header = TRUE, row.names = 1, sep = "\t"))

# Read sample information
# Ensure row names are sample IDs
sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t")

# Ensure sample names in count_data columns match sample_info row names and are in the same order
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# --- Create DESeqDataSet object ---
# The 'design' formula specifies the experimental design.
# Here, we are comparing based on the 'design_variable' (e.g., 'condition').
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = as.formula(paste("~", design_variable)))

# Set the reference level for the primary design variable.
# This ensures the fold changes are calculated as 'treatment' vs 'reference'.
dds[[design_variable]] <- relevel(dds[[design_variable]], ref = reference_condition)

# --- Run DESeq2 analysis ---
message("Running DESeq2 analysis...")
dds <- DESeq(dds)
message("DESeq2 analysis complete.")

# --- Extract Results ---
# Get results for the specified comparison
message(paste0("Extracting results for: ", treatment_condition, " vs ", reference_condition))
res <- results(dds, contrast = c(design_variable, treatment_condition, reference_condition))

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Write differential expression results to a CSV file
results_file <- file.path(output_dir, paste0("deseq2_differential_expression_", treatment_condition, "_vs_", reference_condition, ".csv"))
write.csv(as.data.frame(res_ordered), file = results_file)
message(paste0("Differential expression results saved to: ", results_file))

# Optional: Save normalized counts
normalized_counts <- counts(dds, normalized = TRUE)
normalized_counts_file <- file.path(output_dir, "deseq2_normalized_counts.csv")
write.csv(as.data.frame(normalized_counts), file = normalized_counts_file)
message(paste0("Normalized counts saved to: ", normalized_counts_file))

# Optional: Save session info for reproducibility
session_info_file <- file.path(output_dir, "deseq2_session_info.txt")
sink(session_info_file)
sessionInfo()
sink()
message(paste0("Session information saved to: ", session_info_file))

EOF

# Execute the R script
# Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files.
# Ensure these files are in the current working directory or provide full paths.
Rscript run_deseq2.R

View on GitHub

The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

edgeR (Inferred with models/gemini-2.5-flash) v3.44.0 GitHub

$ Bash example

# Install R and edgeR if not already present
# conda install -c bioconda bioconductor-edger r-base

# Create dummy count matrices and gene lengths for demonstration
# In a real scenario, these would be pre-existing files.
mkdir -p data
cat <<EOF > data/input_counts.tsv
GeneID	Sample1	Sample2
GeneA	100	120
GeneB	50	60
GeneC	200	210
EOF

cat <<EOF > data/polysome_counts.tsv
GeneID	Sample1	Sample2
GeneA	200	250
GeneB	100	110
GeneC	400	420
EOF

# Gene lengths for GRCh38 (example values, replace with actual lengths from GTF/GFF)
cat <<EOF > data/gene_lengths.tsv
GeneID	Length
GeneA	1000
GeneB	2000
GeneC	500
EOF

# R script to calculate RPKM using edgeR
R_SCRIPT="calculate_rpkm.R"
cat <<'EOF' > "$R_SCRIPT"
library(edgeR)

# Function to calculate RPKM for a given count matrix
calculate_rpkm_from_file <- function(count_file, gene_lengths_file, output_file) {
  # Read count matrix
  counts_df <- read.delim(count_file, row.names = 1, check.names = FALSE)
  counts_matrix <- as.matrix(counts_df)

  # Read gene lengths
  gene_lengths_df <- read.delim(gene_lengths_file, row.names = 1, check.names = FALSE)
  
  # Ensure gene lengths are in the same order as counts and cover all genes
  # Filter gene_lengths_df to only include genes present in counts_matrix
  common_genes <- intersect(rownames(counts_matrix), rownames(gene_lengths_df))
  if (length(common_genes) == 0) {
    stop("No common genes found between count matrix and gene lengths file.")
  }
  
  counts_matrix <- counts_matrix[common_genes, , drop = FALSE]
  gene_lengths <- gene_lengths_df[common_genes, "Length", drop = FALSE]
  
  # Create a DGEList object (required by edgeR's rpkm function)
  dge <- DGEList(counts = counts_matrix)
  
  # Calculate RPKM
  # The rpkm function expects gene.length as a numeric vector
  rpkm_matrix <- rpkm(dge, gene.length = gene_lengths$Length)
  
  # Write RPKM matrix to file
  write.table(rpkm_matrix, file = output_file, sep = "\t", quote = FALSE, col.names = NA)
  message(paste("RPKM calculated and saved to", output_file))
}

# Define input and output files
input_counts_file <- "data/input_counts.tsv"
polysome_counts_file <- "data/polysome_counts.tsv"
gene_lengths_file <- "data/gene_lengths.tsv"

input_rpkm_output_file <- "input_rpkm.tsv"
polysome_rpkm_output_file <- "polysome_rpkm.tsv"

# Calculate RPKM for input fraction
calculate_rpkm_from_file(input_counts_file, gene_lengths_file, input_rpkm_output_file)

# Calculate RPKM for polysome fraction
calculate_rpkm_from_file(polysome_counts_file, gene_lengths_file, polysome_rpkm_output_file)

EOF

# Execute the R script
Rscript "$R_SCRIPT"

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
Differential expression was calculated using DESeq2 version 1.10.1
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19

← Back to Analysis