GSE117294 Processing Pipeline

GSE code_examples 5 steps

Publication

Large-scale tethered function assays identify factors that regulate mRNA stability and translation.

Nature structural & molecular biology (2020) — PMID 32807991

Dataset

GSE117294

Large-scale tethered function assays identify factors that regulate mRNA stability and translation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

    $ Bash example
    # Define variables
    INPUT_READS="trimmed_reads.fastq.gz" # Output from cutadapt
    GENOME_DIR="RepBase_STAR_index" # Path to the pre-built STAR index for RepBase v18.04
    OUTPUT_PREFIX="STAR_RepBase_mapping_"
    NUM_THREADS=8 # Adjust as needed
    
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Map trimmed RNA-seq reads to the RepBase index
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${INPUT_READS}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outBAMcompression 6 \
         --outReadsUnmapped Fastx
  2. 2

    Reads did not map to repetitive elements were then mapped to the human genome (hg19).

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Replace with the actual path to your STAR genome index for hg19
    GENOME_DIR="/path/to/STAR_index/hg19"
    # Replace with the actual path to your input FASTQ file (reads unmapped to repetitive elements)
    INPUT_FASTQ="unmapped_reads.fastq.gz"
    # Define output prefix for aligned files
    OUTPUT_PREFIX="aligned_hg19_"
    # Number of threads to use
    NUM_THREADS=8
    
    # Create STAR genome index if it doesn't exist (run once per genome version)
    # This step is typically done prior to alignment. Assuming index already exists.
    # STAR --runThreadN ${NUM_THREADS} \
    #      --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/hg19.fa \
    #      --sjdbGTFfile /path/to/hg19.gtf \
    #      --sjdbOverhang 100 # Adjust based on read length - 1
    
    # Align reads to the human genome (hg19) using STAR
    STAR --runThreadN ${NUM_THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --readFilesCommand gunzip -c \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outSAMstrandField intronMotif \
         --outFilterType BySJout
    
  3. 3

    Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

    featureCounts v1.5.0
    $ Bash example
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Define reference annotation file
    GENCODE_V19_GTF="gencode.v19.annotation.gtf"
    
    # Placeholder for input BAM files (replace with actual paths to your aligned BAM files)
    INPUT_BAMS="sample1.bam sample2.bam"
    
    # Output count matrix file
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Download GENCODE v19 annotation if not available (example command)
    # Note: Ensure you have the correct URL for GENCODE v19. This is a common pattern.
    # wget -O "${GENCODE_V19_GTF}.gz" "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"
    # gunzip "${GENCODE_V19_GTF}.gz"
    
    # Run featureCounts to create read count matrices
    # -a: specify annotation file (GTF/GFF)
    # -o: specify output file for counts
    # -F GTF: specify annotation file format (GTF)
    # -t exon: specify feature type to count (e.g., 'exon' for gene-level counts)
    # -g gene_id: specify attribute type to aggregate features by (e.g., 'gene_id')
    # -s 0: specify strand specificity (0=unstranded, 1=stranded, 2=reverse stranded). Assuming unstranded as not specified.
    # -T 8: specify number of threads (adjust as needed)
    featureCounts -a "${GENCODE_V19_GTF}" -o "${OUTPUT_COUNTS}" -F GTF -t exon -g gene_id -s 0 -T 8 ${INPUT_BAMS}
  4. 4

    Differential expression was calculated using DESeq2 version 1.10.1

    $ Bash example
    # Install R and DESeq2 if not already installed.
    # Note: DESeq2 version 1.10.1 is quite old (released with Bioconductor 3.2, R 3.2).
    # Modern R/Bioconductor installations will typically provide a much newer version.
    # To ensure version 1.10.1, you might need to use an older R/Bioconductor environment
    # or a specific BiocManager version (e.g., BiocManager::install('DESeq2', version = '3.2')).
    
    # For R installation (example for Debian/Ubuntu):
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # For DESeq2 package installation in R:
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"
    
    # Create an R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    # Load DESeq2 library
    # Ensure DESeq2 version 1.10.1 is installed and loaded for consistency.
    library(DESeq2)
    
    # --- Configuration ---
    # Input count matrix file (e.g., from featureCounts, HTSeq, Salmon, Kallisto).
    # This file should be tab-separated, with gene/feature IDs in the first column
    # and raw counts for each sample in subsequent columns.
    # Example:
    # gene_id    sample1    sample2    sample3    sample4
    # geneA      100        120        50         60
    # geneB      200        210        100        110
    count_file <- "counts.tsv"
    
    # Sample information file (metadata).
    # This file should be tab-separated, with sample IDs in the first column
    # and experimental factors (e.g., condition, batch) in subsequent columns.
    # Sample IDs must match the column names in the count_file.
    # Example:
    # sample_id    condition    batch
    # sample1      control      batch1
    # sample2      control      batch2
    # sample3      treated      batch1
    # sample4      treated      batch2
    sample_info_file <- "sample_info.tsv"
    
    # Output directory for results
    output_dir <- "deseq2_results"
    
    # Define the comparison: 'treatment_condition' vs 'reference_condition'.
    # These values must exactly match entries in the 'design_variable' column of sample_info_file.
    reference_condition <- "control"
    treatment_condition <- "treated"
    
    # The column in sample_info_file that defines the primary experimental groups for comparison.
    design_variable <- "condition"
    
    # Create output directory if it doesn't exist
    if (!dir.exists(output_dir)) {
      dir.create(output_dir)
    }
    
    # --- Load Data ---
    # Read count data
    # Ensure row names are gene IDs and column names are sample IDs
    count_data <- as.matrix(read.table(count_file, header = TRUE, row.names = 1, sep = "\t"))
    
    # Read sample information
    # Ensure row names are sample IDs
    sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t")
    
    # Ensure sample names in count_data columns match sample_info row names and are in the same order
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    
    # --- Create DESeqDataSet object ---
    # The 'design' formula specifies the experimental design.
    # Here, we are comparing based on the 'design_variable' (e.g., 'condition').
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = as.formula(paste("~", design_variable)))
    
    # Set the reference level for the primary design variable.
    # This ensures the fold changes are calculated as 'treatment' vs 'reference'.
    dds[[design_variable]] <- relevel(dds[[design_variable]], ref = reference_condition)
    
    # --- Run DESeq2 analysis ---
    message("Running DESeq2 analysis...")
    dds <- DESeq(dds)
    message("DESeq2 analysis complete.")
    
    # --- Extract Results ---
    # Get results for the specified comparison
    message(paste0("Extracting results for: ", treatment_condition, " vs ", reference_condition))
    res <- results(dds, contrast = c(design_variable, treatment_condition, reference_condition))
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # Write differential expression results to a CSV file
    results_file <- file.path(output_dir, paste0("deseq2_differential_expression_", treatment_condition, "_vs_", reference_condition, ".csv"))
    write.csv(as.data.frame(res_ordered), file = results_file)
    message(paste0("Differential expression results saved to: ", results_file))
    
    # Optional: Save normalized counts
    normalized_counts <- counts(dds, normalized = TRUE)
    normalized_counts_file <- file.path(output_dir, "deseq2_normalized_counts.csv")
    write.csv(as.data.frame(normalized_counts), file = normalized_counts_file)
    message(paste0("Normalized counts saved to: ", normalized_counts_file))
    
    # Optional: Save session info for reproducibility
    session_info_file <- file.path(output_dir, "deseq2_session_info.txt")
    sink(session_info_file)
    sessionInfo()
    sink()
    message(paste0("Session information saved to: ", session_info_file))
    
    EOF
    
    # Execute the R script
    # Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files.
    # Ensure these files are in the current working directory or provide full paths.
    Rscript run_deseq2.R
  5. 5

    The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

    edgeR (Inferred with models/gemini-2.5-flash) v3.44.0 GitHub
    $ Bash example
    # Install R and edgeR if not already present
    # conda install -c bioconda bioconductor-edger r-base
    
    # Create dummy count matrices and gene lengths for demonstration
    # In a real scenario, these would be pre-existing files.
    mkdir -p data
    cat <<EOF > data/input_counts.tsv
    GeneID	Sample1	Sample2
    GeneA	100	120
    GeneB	50	60
    GeneC	200	210
    EOF
    
    cat <<EOF > data/polysome_counts.tsv
    GeneID	Sample1	Sample2
    GeneA	200	250
    GeneB	100	110
    GeneC	400	420
    EOF
    
    # Gene lengths for GRCh38 (example values, replace with actual lengths from GTF/GFF)
    cat <<EOF > data/gene_lengths.tsv
    GeneID	Length
    GeneA	1000
    GeneB	2000
    GeneC	500
    EOF
    
    # R script to calculate RPKM using edgeR
    R_SCRIPT="calculate_rpkm.R"
    cat <<'EOF' > "$R_SCRIPT"
    library(edgeR)
    
    # Function to calculate RPKM for a given count matrix
    calculate_rpkm_from_file <- function(count_file, gene_lengths_file, output_file) {
      # Read count matrix
      counts_df <- read.delim(count_file, row.names = 1, check.names = FALSE)
      counts_matrix <- as.matrix(counts_df)
    
      # Read gene lengths
      gene_lengths_df <- read.delim(gene_lengths_file, row.names = 1, check.names = FALSE)
      
      # Ensure gene lengths are in the same order as counts and cover all genes
      # Filter gene_lengths_df to only include genes present in counts_matrix
      common_genes <- intersect(rownames(counts_matrix), rownames(gene_lengths_df))
      if (length(common_genes) == 0) {
        stop("No common genes found between count matrix and gene lengths file.")
      }
      
      counts_matrix <- counts_matrix[common_genes, , drop = FALSE]
      gene_lengths <- gene_lengths_df[common_genes, "Length", drop = FALSE]
      
      # Create a DGEList object (required by edgeR's rpkm function)
      dge <- DGEList(counts = counts_matrix)
      
      # Calculate RPKM
      # The rpkm function expects gene.length as a numeric vector
      rpkm_matrix <- rpkm(dge, gene.length = gene_lengths$Length)
      
      # Write RPKM matrix to file
      write.table(rpkm_matrix, file = output_file, sep = "\t", quote = FALSE, col.names = NA)
      message(paste("RPKM calculated and saved to", output_file))
    }
    
    # Define input and output files
    input_counts_file <- "data/input_counts.tsv"
    polysome_counts_file <- "data/polysome_counts.tsv"
    gene_lengths_file <- "data/gene_lengths.tsv"
    
    input_rpkm_output_file <- "input_rpkm.tsv"
    polysome_rpkm_output_file <- "polysome_rpkm.tsv"
    
    # Calculate RPKM for input fraction
    calculate_rpkm_from_file(input_counts_file, gene_lengths_file, input_rpkm_output_file)
    
    # Calculate RPKM for polysome fraction
    calculate_rpkm_from_file(polysome_counts_file, gene_lengths_file, polysome_rpkm_output_file)
    
    EOF
    
    # Execute the R script
    Rscript "$R_SCRIPT"
    

Tools Used

Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
Differential expression was calculated using DESeq2 version 1.10.1
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19
← Back to Analysis