GSE117294 Processing Pipeline
GSE
code_examples
5 steps
Publication
Large-scale tethered function assays identify factors that regulate mRNA stability and translation.Nature structural & molecular biology (2020) — PMID 32807991
Dataset
GSE117294Large-scale tethered function assays identify factors that regulate mRNA stability and translation
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).
$ Bash example
# Define variables INPUT_READS="trimmed_reads.fastq.gz" # Output from cutadapt GENOME_DIR="RepBase_STAR_index" # Path to the pre-built STAR index for RepBase v18.04 OUTPUT_PREFIX="STAR_RepBase_mapping_" NUM_THREADS=8 # Adjust as needed # Install STAR (if not already installed) # conda install -c bioconda star # Map trimmed RNA-seq reads to the RepBase index STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_READS}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --outReadsUnmapped Fastx -
2
Reads did not map to repetitive elements were then mapped to the human genome (hg19).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Replace with the actual path to your STAR genome index for hg19 GENOME_DIR="/path/to/STAR_index/hg19" # Replace with the actual path to your input FASTQ file (reads unmapped to repetitive elements) INPUT_FASTQ="unmapped_reads.fastq.gz" # Define output prefix for aligned files OUTPUT_PREFIX="aligned_hg19_" # Number of threads to use NUM_THREADS=8 # Create STAR genome index if it doesn't exist (run once per genome version) # This step is typically done prior to alignment. Assuming index already exists. # STAR --runThreadN ${NUM_THREADS} \ # --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/hg19.gtf \ # --sjdbOverhang 100 # Adjust based on read length - 1 # Align reads to the human genome (hg19) using STAR STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --readFilesCommand gunzip -c \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outSAMstrandField intronMotif \ --outFilterType BySJout -
3
Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
featureCounts v1.5.0$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Define reference annotation file GENCODE_V19_GTF="gencode.v19.annotation.gtf" # Placeholder for input BAM files (replace with actual paths to your aligned BAM files) INPUT_BAMS="sample1.bam sample2.bam" # Output count matrix file OUTPUT_COUNTS="gene_counts.txt" # Download GENCODE v19 annotation if not available (example command) # Note: Ensure you have the correct URL for GENCODE v19. This is a common pattern. # wget -O "${GENCODE_V19_GTF}.gz" "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz" # gunzip "${GENCODE_V19_GTF}.gz" # Run featureCounts to create read count matrices # -a: specify annotation file (GTF/GFF) # -o: specify output file for counts # -F GTF: specify annotation file format (GTF) # -t exon: specify feature type to count (e.g., 'exon' for gene-level counts) # -g gene_id: specify attribute type to aggregate features by (e.g., 'gene_id') # -s 0: specify strand specificity (0=unstranded, 1=stranded, 2=reverse stranded). Assuming unstranded as not specified. # -T 8: specify number of threads (adjust as needed) featureCounts -a "${GENCODE_V19_GTF}" -o "${OUTPUT_COUNTS}" -F GTF -t exon -g gene_id -s 0 -T 8 ${INPUT_BAMS} -
4
Differential expression was calculated using DESeq2 version 1.10.1
$ Bash example
# Install R and DESeq2 if not already installed. # Note: DESeq2 version 1.10.1 is quite old (released with Bioconductor 3.2, R 3.2). # Modern R/Bioconductor installations will typically provide a much newer version. # To ensure version 1.10.1, you might need to use an older R/Bioconductor environment # or a specific BiocManager version (e.g., BiocManager::install('DESeq2', version = '3.2')). # For R installation (example for Debian/Ubuntu): # sudo apt-get update # sudo apt-get install r-base # For DESeq2 package installation in R: # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Create an R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R # Load DESeq2 library # Ensure DESeq2 version 1.10.1 is installed and loaded for consistency. library(DESeq2) # --- Configuration --- # Input count matrix file (e.g., from featureCounts, HTSeq, Salmon, Kallisto). # This file should be tab-separated, with gene/feature IDs in the first column # and raw counts for each sample in subsequent columns. # Example: # gene_id sample1 sample2 sample3 sample4 # geneA 100 120 50 60 # geneB 200 210 100 110 count_file <- "counts.tsv" # Sample information file (metadata). # This file should be tab-separated, with sample IDs in the first column # and experimental factors (e.g., condition, batch) in subsequent columns. # Sample IDs must match the column names in the count_file. # Example: # sample_id condition batch # sample1 control batch1 # sample2 control batch2 # sample3 treated batch1 # sample4 treated batch2 sample_info_file <- "sample_info.tsv" # Output directory for results output_dir <- "deseq2_results" # Define the comparison: 'treatment_condition' vs 'reference_condition'. # These values must exactly match entries in the 'design_variable' column of sample_info_file. reference_condition <- "control" treatment_condition <- "treated" # The column in sample_info_file that defines the primary experimental groups for comparison. design_variable <- "condition" # Create output directory if it doesn't exist if (!dir.exists(output_dir)) { dir.create(output_dir) } # --- Load Data --- # Read count data # Ensure row names are gene IDs and column names are sample IDs count_data <- as.matrix(read.table(count_file, header = TRUE, row.names = 1, sep = "\t")) # Read sample information # Ensure row names are sample IDs sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t") # Ensure sample names in count_data columns match sample_info row names and are in the same order sample_info <- sample_info[colnames(count_data), , drop = FALSE] # --- Create DESeqDataSet object --- # The 'design' formula specifies the experimental design. # Here, we are comparing based on the 'design_variable' (e.g., 'condition'). dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = as.formula(paste("~", design_variable))) # Set the reference level for the primary design variable. # This ensures the fold changes are calculated as 'treatment' vs 'reference'. dds[[design_variable]] <- relevel(dds[[design_variable]], ref = reference_condition) # --- Run DESeq2 analysis --- message("Running DESeq2 analysis...") dds <- DESeq(dds) message("DESeq2 analysis complete.") # --- Extract Results --- # Get results for the specified comparison message(paste0("Extracting results for: ", treatment_condition, " vs ", reference_condition)) res <- results(dds, contrast = c(design_variable, treatment_condition, reference_condition)) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Write differential expression results to a CSV file results_file <- file.path(output_dir, paste0("deseq2_differential_expression_", treatment_condition, "_vs_", reference_condition, ".csv")) write.csv(as.data.frame(res_ordered), file = results_file) message(paste0("Differential expression results saved to: ", results_file)) # Optional: Save normalized counts normalized_counts <- counts(dds, normalized = TRUE) normalized_counts_file <- file.path(output_dir, "deseq2_normalized_counts.csv") write.csv(as.data.frame(normalized_counts), file = normalized_counts_file) message(paste0("Normalized counts saved to: ", normalized_counts_file)) # Optional: Save session info for reproducibility session_info_file <- file.path(output_dir, "deseq2_session_info.txt") sink(session_info_file) sessionInfo() sink() message(paste0("Session information saved to: ", session_info_file)) EOF # Execute the R script # Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files. # Ensure these files are in the current working directory or provide full paths. Rscript run_deseq2.R -
5
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
$ Bash example
# Install R and edgeR if not already present # conda install -c bioconda bioconductor-edger r-base # Create dummy count matrices and gene lengths for demonstration # In a real scenario, these would be pre-existing files. mkdir -p data cat <<EOF > data/input_counts.tsv GeneID Sample1 Sample2 GeneA 100 120 GeneB 50 60 GeneC 200 210 EOF cat <<EOF > data/polysome_counts.tsv GeneID Sample1 Sample2 GeneA 200 250 GeneB 100 110 GeneC 400 420 EOF # Gene lengths for GRCh38 (example values, replace with actual lengths from GTF/GFF) cat <<EOF > data/gene_lengths.tsv GeneID Length GeneA 1000 GeneB 2000 GeneC 500 EOF # R script to calculate RPKM using edgeR R_SCRIPT="calculate_rpkm.R" cat <<'EOF' > "$R_SCRIPT" library(edgeR) # Function to calculate RPKM for a given count matrix calculate_rpkm_from_file <- function(count_file, gene_lengths_file, output_file) { # Read count matrix counts_df <- read.delim(count_file, row.names = 1, check.names = FALSE) counts_matrix <- as.matrix(counts_df) # Read gene lengths gene_lengths_df <- read.delim(gene_lengths_file, row.names = 1, check.names = FALSE) # Ensure gene lengths are in the same order as counts and cover all genes # Filter gene_lengths_df to only include genes present in counts_matrix common_genes <- intersect(rownames(counts_matrix), rownames(gene_lengths_df)) if (length(common_genes) == 0) { stop("No common genes found between count matrix and gene lengths file.") } counts_matrix <- counts_matrix[common_genes, , drop = FALSE] gene_lengths <- gene_lengths_df[common_genes, "Length", drop = FALSE] # Create a DGEList object (required by edgeR's rpkm function) dge <- DGEList(counts = counts_matrix) # Calculate RPKM # The rpkm function expects gene.length as a numeric vector rpkm_matrix <- rpkm(dge, gene.length = gene_lengths$Length) # Write RPKM matrix to file write.table(rpkm_matrix, file = output_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("RPKM calculated and saved to", output_file)) } # Define input and output files input_counts_file <- "data/input_counts.tsv" polysome_counts_file <- "data/polysome_counts.tsv" gene_lengths_file <- "data/gene_lengths.tsv" input_rpkm_output_file <- "input_rpkm.tsv" polysome_rpkm_output_file <- "polysome_rpkm.tsv" # Calculate RPKM for input fraction calculate_rpkm_from_file(input_counts_file, gene_lengths_file, input_rpkm_output_file) # Calculate RPKM for polysome fraction calculate_rpkm_from_file(polysome_counts_file, gene_lengths_file, polysome_rpkm_output_file) EOF # Execute the R script Rscript "$R_SCRIPT"
Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices. Differential expression was calculated using DESeq2 version 1.10.1 The transcript RPKMs of input and polysome fractions were calculated from the read count matrices. Genome_build: Homo sapiens UCSC hg19