GSE117293 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
Large-scale tethered function assays identify factors that regulate mRNA stability and translation.Nature structural & molecular biology (2020) — PMID 32807991
Dataset
GSE117293Large-scale tethered function assays identify factors that regulate mRNA stability and translation (HEK293T_RNA-seq)
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define reference files and directories REPBASE_FASTA="RepBase_v18.04.fasta" # Placeholder for RepBase v18.04 FASTA file STAR_INDEX_DIR="RepBase_STAR_index" INPUT_READS="trimmed_reads.fastq.gz" # Placeholder for trimmed RNA-seq reads (output from cutadapt) # Create STAR genome index for RepBase # This step assumes you have the RepBase v18.04 FASTA file. # RepBase data typically requires a license for download from GIRI. mkdir -p "${STAR_INDEX_DIR}" STAR --runMode genomeGenerate \ --genomeDir "${STAR_INDEX_DIR}" \ --genomeFastaFiles "${REPBASE_FASTA}" \ --runThreadN 8 # Adjust thread count as needed # Map RNA-sequencing reads to repetitive elements using STAR STAR --runMode alignReads \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${INPUT_READS}" \ --runThreadN 8 \ --outFileNamePrefix RepBase_mapping_ \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --outReadsUnmapped Fastx # Optional: output unmapped reads if needed for downstream analysis -
2
Reads did not map to repetitive elements were then mapped to the human genome (hg19).
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Replace with actual paths and filenames GENOME_DIR="/path/to/STAR_index/hg19" # Placeholder for hg19 STAR genome index INPUT_FASTQ="reads_unmapped_to_repeats.fastq.gz" # Placeholder for input reads (those not mapped to repetitive elements) OUTPUT_PREFIX="aligned_to_hg19" # Prefix for output files THREADS=8 # Number of threads to use # Map reads to the human genome (hg19) # This command aligns the input FASTQ reads to the hg19 genome index. # --readFilesCommand zcat is used for gzipped FASTQ files. # --outSAMtype BAM SortedByCoordinate outputs a sorted BAM file. # --outFilterMultimapNmax 1 ensures only uniquely mapping reads are reported. STAR --runMode alignReads \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --runThreadN ${THREADS} -
3
Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Define reference annotation file (GENCODE v19) GENCODE_V19_GTF="gencode.v19.annotation.gtf" # Placeholder for input BAM files (aligned reads) INPUT_BAMS="sample1.bam sample2.bam" # Output file for gene count matrix OUTPUT_COUNTS="gene_counts.txt" # Create read count matrices using featureCounts # -a: Annotation file (GTF/GFF) # -o: Output file # -t exon: Count reads overlapping exons (default for gene-level counting) # -g gene_id: Group features by 'gene_id' attribute to get gene-level counts featureCounts -a "${GENCODE_V19_GTF}" -o "${OUTPUT_COUNTS}" -t exon -g gene_id ${INPUT_BAMS} -
4
Differential expression was calculated using DESeq2 version 1.10.1
$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create a placeholder R script for DESeq2 differential expression analysis. # This script assumes 'counts.tsv' contains raw gene counts (genes as rows, samples as columns) # and 'sample_info.tsv' contains metadata (e.g., 'sample_id', 'condition'). # Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files. # Replace 'condition' with your actual experimental variable. # Replace 'control' and 'treated' with your actual group names for comparison. cat << 'EOF' > run_deseq2.R # Load DESeq2 library library(DESeq2) # --- Configuration --- # Input files counts_file <- "counts.tsv" # Raw count matrix (genes x samples) sample_info_file <- "sample_info.tsv" # Sample metadata (samples x variables) # Output files results_output_file <- "deseq2_results.tsv" normalized_counts_output_file <- "deseq2_normalized_counts.tsv" # Design formula (e.g., ~ condition) design_formula <- ~ condition # Reference level for comparison (e.g., 'control' vs 'treated') # Ensure this matches a level in your 'condition' column in sample_info_file reference_level <- "control" test_level <- "treated" # --- Data Loading --- # Load count data # Assuming gene IDs are in the first column and sample counts in subsequent columns count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE) # Load sample information # Assuming sample IDs are in the first column and metadata in subsequent columns sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE) # Ensure sample names match and are in the same order # It's crucial that column names of count_data match row names of sample_info sample_info <- sample_info[colnames(count_data), , drop = FALSE] # Convert condition column to factor and set reference level sample_info$condition <- factor(sample_info$condition, levels = c(reference_level, test_level)) # --- DESeq2 Analysis --- # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = design_formula) # Pre-filtering: remove genes with very low counts # (e.g., keep genes with at least 10 counts in total across all samples) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis dds <- DESeq(dds) # Extract results for the specified contrast # The contrast argument specifies the comparison: c("variable", "level_of_interest", "reference_level") res <- results(dds, contrast = c("condition", test_level, reference_level)) # Order results by adjusted p-value res <- res[order(res$padj),] # --- Output Results --- # Write differential expression results to a TSV file write.table(as.data.frame(res), file = results_output_file, sep = "\t", quote = FALSE, row.names = TRUE) # Get normalized counts normalized_counts <- counts(dds, normalized = TRUE) write.table(as.data.frame(normalized_counts), file = normalized_counts_output_file, sep = "\t", quote = FALSE, row.names = TRUE) message(paste("DESeq2 differential expression results saved to:", results_output_file)) message(paste("Normalized counts saved to:", normalized_counts_output_file)) EOF # Execute the R script Rscript run_deseq2.R -
5
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # Define reference paths (replace with actual paths to your reference files) # For hg38, common sources include GENCODE or Ensembl for GTF and UCSC/Ensembl for FASTA. GENOME_FASTA="path/to/hg38.fa" GENES_GTF="path/to/gencode.v38.annotation.gtf" # Example GTF for hg38 RSEM_REF_INDEX="path/to/rsem_hg38_index" # Directory where RSEM index will be stored # Prepare RSEM reference (run this step once for your reference genome/transcriptome) # This command builds the necessary index files for RSEM quantification. # rsem-prepare-reference --gtf "${GENES_GTF}" "${GENOME_FASTA}" "${RSEM_REF_INDEX}" # Define input BAM files (replace with actual paths to your aligned reads) # These BAM files are typically generated by an aligner like STAR or HISAT2. INPUT_BAM="path/to/input_fraction.bam" POLYSOME_BAM="path/to/polysome_fraction.bam" # Define output prefixes for RSEM results INPUT_OUTPUT_PREFIX="input_fraction_rsem" POLYSOME_OUTPUT_PREFIX="polysome_fraction_rsem" # Calculate RPKMs for the input fraction # RSEM quantifies gene expression (including RPKM) directly from aligned reads (BAM files). # The description "from the read count matrices" refers to the internal process of RSEM, # where it first generates read counts and then normalizes them to RPKM using gene lengths # and total mapped reads. The output files will contain the RPKM values. rsem-calculate-expression \ -p 8 \ # Use 8 threads for parallel processing --bam \ # Specify that input files are BAM format "${INPUT_BAM}" \ # Input BAM file for the input fraction "${RSEM_REF_INDEX}" \ # Path to the pre-built RSEM reference index "${INPUT_OUTPUT_PREFIX}" # Prefix for output files (e.g., input_fraction_rsem.genes.results) # Calculate RPKMs for the polysome fraction rsem-calculate-expression \ -p 8 \ --bam \ "${POLYSOME_BAM}" \ "${RSEM_REF_INDEX}" \ "${POLYSOME_OUTPUT_PREFIX}" # RPKM values for each gene will be found in the "${INPUT_OUTPUT_PREFIX}.genes.results" and # "${POLYSOME_OUTPUT_PREFIX}.genes.results" files (specifically in the 'TPM' or 'FPKM' columns, # which are equivalent to RPKM for single-end reads or when using RSEM's default settings for gene-level quantification).
Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices. Differential expression was calculated using DESeq2 version 1.10.1 The transcript RPKMs of input and polysome fractions were calculated from the read count matrices. Genome_build: Homo sapiens UCSC hg19