GSE117291 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
Large-scale tethered function assays identify factors that regulate mRNA stability and translation.Nature structural & molecular biology (2020) — PMID 32807991
Dataset
GSE117291Large-scale tethered function assays identify factors that regulate mRNA stability and translation (polysome profiling)
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Configuration --- # Placeholder for input trimmed RNA-seq reads (output from cutadapt) # Replace with the actual path to your trimmed FASTQ file(s) TRIMMED_READS_FASTQ="trimmed_reads.fastq.gz" # Placeholder for the STAR genome index directory built from RepBase v18.04 sequences. # You would need to download RepBase v18.04 sequences and build a STAR index # using `STAR --runMode genomeGenerate --genomeDir /path/to/RepBase_v18.04_STAR_index --genomeFastaFiles RepBase_v18.04.fasta` REPBASE_STAR_INDEX_DIR="/path/to/RepBase_v18.04_STAR_index" # Output directory for STAR results OUTPUT_DIR="star_repbase_mapping_results" mkdir -p "${OUTPUT_DIR}" # Number of threads for STAR NUM_THREADS=8 # Adjust based on available resources # --- STAR Mapping Command --- STAR \ --runThreadN "${NUM_THREADS}" \ --genomeDir "${REPBASE_STAR_INDEX_DIR}" \ --readFilesIn "${TRIMMED_READS_FASTQ}" \ --outFileNamePrefix "${OUTPUT_DIR}/repbase_mapped." \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 100 \ --outFilterMismatchNmax 10 -
2
Reads did not map to repetitive elements were then mapped to the human genome (hg19).
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.10a -y # conda activate star_env # --- Reference Genome Preparation (hg19) --- # This step is for building the STAR index. It needs to be done once. # Download hg19 FASTA and GTF files from UCSC: # mkdir -p ~/star_index_references # cd ~/star_index_references # wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz # gunzip hg19.fa.gz # gunzip hg19.refGene.gtf.gz # Create STAR genome index for hg19: # mkdir -p ~/star_index/hg19 # STAR --runMode genomeGenerate \ # --genomeDir ~/star_index/hg19 \ # --genomeFastaFiles ~/star_index_references/hg19.fa \ # --sjdbGTFfile ~/star_index_references/hg19.refGene.gtf \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust threads as needed # --- Alignment Step --- # Input: FASTQ file containing reads that did not map to repetitive elements. # Replace 'unmapped_to_repeats.fastq.gz' with your actual input file. # Replace 'aligned_to_hg19' with your desired output prefix. # Replace '~/star_index/hg19' with the actual path to your STAR hg19 index. # Adjust --runThreadN based on available CPU cores. # Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB = 30000000000 bytes). STAR --genomeDir ~/star_index/hg19 \ --readFilesIn unmapped_to_repeats.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix aligned_to_hg19_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes NH HI AS NM MD \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --readFilesCommand zcat \ --outSAMunmapped Within \ --outFilterScoreMin 0 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNmin 0 \ --outFilterMatchNminOverLread 0 \ --limitBAMsortRAM 30000000000 -
3
Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
featureCounts v1.5.0$ Bash example
# Download GENCODE v19 annotation GTF file # Source: https://www.gencodegenes.org/human/release_19.html # wget is usually pre-installed or easily installed via package manager # sudo apt-get update && sudo apt-get install -y wget wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # Decompress the GTF file # gunzip is usually pre-installed gunzip gencode.v19.annotation.gtf.gz # Run featureCounts to create read count matrices # Replace 'input_sampleX.bam' with your actual aligned BAM files. # -a: Annotation file in GTF/GFF format. # -o: Output file for read counts. # -t exon: Count reads mapping to 'exon' features (default for gene-level counting). # -g gene_id: Aggregate counts by 'gene_id' attribute in the GTF (default for gene-level counting). # -F GTF: Specify input annotation file format as GTF. # -T 8: Use 8 threads (adjust as needed). # -p: Include this flag if your reads are paired-end. # -s 2: Include this flag if your RNA-seq library is reverse stranded (0 for unstranded, 1 for forward). featureCounts -a gencode.v19.annotation.gtf -o read_counts.txt -t exon -g gene_id -F GTF -T 8 input_sample1.bam input_sample2.bam input_sample3.bam
-
4
Differential expression was calculated using DESeq2 version 1.10.1
$ Bash example
# This script assumes you have R and DESeq2 installed. # To install DESeq2 version 1.10.1 (which corresponds to Bioconductor 3.1, compatible with R 3.2.x): # First, ensure you have R 3.2.x installed. # Then, run the following inside R: # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install(version = "3.1") # This installs Bioconductor 3.1 # BiocManager::install("DESeq2") # This will install DESeq2 1.10.1 for Bioc 3.1 # Create an R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # --- Configuration --- # Input count matrix file (e.g., from featureCounts, HTSeq-count) # Expected format: gene_id as first column, sample counts in subsequent columns. COUNT_FILE="counts.tsv" # Sample metadata file # Expected format: sample_id as first column, metadata columns (e.g., 'condition', 'batch') # Sample IDs in this file must match column names in COUNT_FILE. SAMPLE_INFO_FILE="sample_info.tsv" # Output file for differential expression results OUTPUT_RESULTS_FILE="deseq2_results.csv" # Design formula for DESeq2 (e.g., ~ condition, ~ batch + condition) DESIGN_FORMULA_STR="~ condition" # Contrast for results (e.g., c("condition", "treated", "control")) # If NULL, DESeq2 will pick the last level of the primary variable as reference. CONTRAST_VECTOR=c("condition", "treated", "control") # Example: comparing 'treated' vs 'control' # --- Load Data --- # Load count data count_data <- read.table(COUNT_FILE, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE) # Load sample information sample_info <- read.table(SAMPLE_INFO_FILE, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE) # Ensure sample order and names match common_samples <- intersect(colnames(count_data), rownames(sample_info)) if (length(common_samples) == 0) { stop("No common samples found between count data and sample information.") } count_data <- count_data[, common_samples] sample_info <- sample_info[common_samples, , drop = FALSE] # Convert design formula string to formula object design_formula <- as.formula(DESIGN_FORMULA_STR) # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = design_formula) # --- Pre-filtering (optional but recommended) --- # Remove genes with very low counts (e.g., less than 10 total counts across all samples) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # --- Run DESeq2 Analysis --- dds <- DESeq(dds) # --- Extract Results --- if (!is.null(CONTRAST_VECTOR)) { res <- results(dds, contrast = CONTRAST_VECTOR) } else { res <- results(dds) } # Order results by adjusted p-value res <- res[order(res$padj),] # Save results write.csv(as.data.frame(res), file = OUTPUT_RESULTS_FILE) # Optional: Save normalized counts # normalized_counts <- counts(dds, normalized=TRUE) # write.csv(as.data.frame(normalized_counts), file = "normalized_counts.csv") message(paste("DESeq2 analysis complete. Results saved to", OUTPUT_RESULTS_FILE)) EOF # Execute the R script Rscript run_deseq2.R # --- Placeholder for creating dummy input files for demonstration --- # In a real pipeline, these files would come from upstream steps (e.g., featureCounts, Salmon quantification) # echo -e "gene\tsample1_control\tsample2_control\tsample3_treated\tsample4_treated" > counts.tsv # echo -e "geneA\t100\t120\t500\t550" >> counts.tsv # echo -e "geneB\t50\t60\t30\t35" >> counts.tsv # echo -e "geneC\t200\t210\t220\t230" >> counts.tsv # echo -e "geneD\t10\t12\t15\t18" >> counts.tsv # echo -e "geneE\t5\t7\t8\t9" >> counts.tsv # This gene might be filtered out # echo -e "sample\tcondition" > sample_info.tsv # echo -e "sample1_control\tcontrol" >> sample_info.tsv # echo -e "sample2_control\tcontrol" >> sample_info.tsv # echo -e "sample3_treated\ttreated" >> sample_info.tsv # echo -e "sample4_treated\ttreated" >> sample_info.tsv -
5
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Custom Python script (Inferred with models/gemini-2.5-flash) vPython 3.x (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Create dummy input files for demonstration # In a real pipeline, 'input_counts.tsv' and 'polysome_counts.tsv' would be outputs from a read counting tool (e.g., featureCounts, HTSeq-count). # gene_lengths.tsv: gene_id<tab>length_in_bp # Source: Derived from a genome annotation file (e.g., GTF/GFF) for a specific assembly (e.g., GRCh38/hg38). echo -e "geneA\t1000" > gene_lengths.tsv echo -e "geneB\t2000" >> gene_lengths.tsv echo -e "geneC\t500" >> gene_lengths.tsv # input_counts.tsv: gene_id<tab>read_count (for input fraction) echo -e "geneA\t100" > input_counts.tsv echo -e "geneB\t200" >> input_counts.tsv echo -e "geneC\t50" >> input_counts.tsv # polysome_counts.tsv: gene_id<tab>read_count (for polysome fraction) echo -e "geneA\t150" > polysome_counts.tsv echo -e "geneB\t300" >> polysome_counts.tsv echo -e "geneC\t75" >> polysome_counts.tsv # Create a Python script to calculate RPKM # This script requires the pandas library. Install with: # pip install pandas cat << 'EOF' > rpkm_calculator.py import pandas as pd import sys def calculate_rpkm(counts_file, lengths_file, output_file): """ Calculates RPKM from read counts and gene lengths. Formula: RPKM = (C * 10^9) / (N * L) C: Number of reads mapped to a transcript/gene N: Total number of mapped reads in the experiment (sum of all C values for the sample) L: Length of the transcript/gene in base pairs """ try: counts_df = pd.read_csv(counts_file, sep='\t', header=None, names=['gene_id', 'read_count']) lengths_df = pd.read_csv(lengths_file, sep='\t', header=None, names=['gene_id', 'length_in_bp']) except FileNotFoundError as e: print(f"Error: Input file not found - {e}", file=sys.stderr) sys.exit(1) except Exception as e: print(f"Error reading input files: {e}", file=sys.stderr) sys.exit(1) # Merge counts and lengths based on gene_id merged_df = pd.merge(counts_df, lengths_df, on='gene_id', how='inner') if merged_df.empty: print(f"Warning: No matching genes found between {counts_file} and {lengths_file}. Output file will be empty.", file=sys.stderr) merged_df['rpkm'] = [] else: # Calculate total mapped reads (N) for the current sample total_mapped_reads = merged_df['read_count'].sum() if total_mapped_reads == 0: print(f"Warning: Total mapped reads for {counts_file} is zero. RPKM will be zero for all genes.", file=sys.stderr) merged_df['rpkm'] = 0.0 else: # Calculate RPKM: (C * 10^9) / (N * L) merged_df['rpkm'] = (merged_df['read_count'] * 1_000_000_000) / (total_mapped_reads * merged_df['length_in_bp']) # Save results merged_df[['gene_id', 'rpkm']].to_csv(output_file, sep='\t', index=False, header=False) print(f"RPKM calculated and saved to {output_file}", file=sys.stderr) if __name__ == "__main__": if len(sys.argv) != 4: print("Usage: python rpkm_calculator.py <counts_file> <lengths_file> <output_file>", file=sys.stderr) sys.exit(1) counts_file = sys.argv[1] lengths_file = sys.argv[2] output_file = sys.argv[3] calculate_rpkm(counts_file, lengths_file, output_file) EOF # Calculate RPKM for input fraction python rpkm_calculator.py input_counts.tsv gene_lengths.tsv input_rpkm.tsv # Calculate RPKM for polysome fraction python rpkm_calculator.py polysome_counts.tsv gene_lengths.tsv polysome_rpkm.tsv # Clean up the script (optional, but good practice in a pipeline) rm rpkm_calculator.py
Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices. Differential expression was calculated using DESeq2 version 1.10.1 The transcript RPKMs of input and polysome fractions were calculated from the read count matrices. Genome_build: Homo sapiens UCSC hg19