GSE134597 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.Nature communications (2022) — PMID 36473869
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.0i # Define input and output paths # Assuming 'trimmed_reads.fastq.gz' is the output from cutadapt trimming READS="trimmed_reads.fastq.gz" # Placeholder for the STAR index built from RepBase v18.04 sequences. # This index needs to be pre-built using STAR's --runMode genomeGenerate command. REPBASE_STAR_INDEX="/path/to/RepBase_v18.04_STAR_index" OUTPUT_PREFIX="repbase_mapped_star" THREADS=8 # Adjust based on available computational resources # Map RNA-sequencing reads to repetitive elements using STAR STAR --runThreadN ${THREADS} \ --genomeDir ${REPBASE_STAR_INDEX} \ --readFilesIn ${READS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 -
2
Reads did not map to repetitive elements were then mapped to the human genome (hg19).
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.5.2b # Align reads to the human genome (hg19) # Assumes a STAR genome index for hg19 is pre-built at /path/to/star_index/hg19. # The hg19 reference genome can be obtained from sources like UCSC (e.g., hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz). # Assumes 'filtered_reads.fastq.gz' contains reads that did not map to repetitive elements. # The output BAM file will be named 'aligned_reads_Aligned.sortedByCoordinate.out.bam'. STAR --runThreadN 8 \ --genomeDir /path/to/star_index/hg19 \ --readFilesIn filtered_reads.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix aligned_reads_ \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMtype BAM SortedByCoordinate \ --limitBAMsortRAM 30000000000 -
3
Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
featureCounts v1.5.0$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Placeholder for GENCODE v19 annotation GTF file # Ensure gencode.v19.annotation.gtf is available in the working directory or specified path # Example download (replace with actual path if already available): # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz GENCODE_GTF="gencode.v19.annotation.gtf" # Placeholder for input BAM files (e.g., aligned RNA-seq reads) # Replace with actual BAM file paths, space-separated for multiple files INPUT_BAM_FILES="sample1.bam sample2.bam" # Output file for the read count matrix OUTPUT_COUNTS="gene_read_counts.txt" featureCounts -a "${GENCODE_GTF}" \ -o "${OUTPUT_COUNTS}" \ -t exon \ -g gene_id \ ${INPUT_BAM_FILES} -
4
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')" # --- Reference Data Preparation (Example for Human GENCODE v38) --- # Transcript/gene lengths are crucial for RPKM calculation. They can be derived from a GTF/GFF annotation file. # Example command to download a GTF file and generate a gene_lengths.tsv (requires a custom script like gtf_to_genelength.py): # wget -nc ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz # gunzip -c gencode.v38.annotation.gtf.gz > gencode.v38.annotation.gtf # python gtf_to_genelength.py gencode.v38.annotation.gtf > gene_lengths.tsv # (A simple gtf_to_genelength.py script would parse the GTF to extract gene_id/transcript_id and calculate the effective length for each entry) # Create dummy input files for demonstration purposes. # In a real pipeline, these count matrices would be outputs from an upstream quantification step (e.g., featureCounts, HTSeq-count, Salmon, Kallisto). # The 'GeneID' column (or 'TranscriptID') should match the IDs in 'gene_lengths.tsv'. cat <<EOF > input_counts.tsv GeneID Input_Rep1 Input_Rep2 GeneA 100 120 GeneB 50 60 GeneC 200 210 EOF cat <<EOF > polysome_counts.tsv GeneID Polysome_Rep1 Polysome_Rep2 GeneA 150 160 GeneB 80 90 GeneC 300 310 EOF cat <<EOF > gene_lengths.tsv GeneID Length GeneA 1000 GeneB 2000 GeneC 500 EOF # R script to calculate RPKM for input and polysome fractions using edgeR Rscript -e ' # Load the edgeR library library(edgeR) # Define input and output file names input_counts_file <- "input_counts.tsv" polysome_counts_file <- "polysome_counts.tsv" gene_lengths_file <- "gene_lengths.tsv" output_input_rpkm_file <- "input_rpkm.tsv" output_polysome_rpkm_file <- "polysome_rpkm.tsv" # Load gene/transcript lengths. Ensure the first column is the ID and second is Length. gene_lengths <- read.delim(gene_lengths_file, row.names = 1, stringsAsFactors = FALSE) # --- Process Input Fraction --- # Load input fraction read counts. Ensure the first column is the ID. input_counts_df <- read.delim(input_counts_file, row.names = 1, stringsAsFactors = FALSE) # Ensure gene/transcript order in counts matches gene_lengths for correct RPKM calculation # Only keep IDs present in both counts and lengths, and in the same order common_ids_input <- intersect(rownames(input_counts_df), rownames(gene_lengths)) input_counts_df <- input_counts_df[common_ids_input, , drop = FALSE] gene_lengths_input <- gene_lengths[common_ids_input, , drop = FALSE] # Create a DGEList object for the input fraction dge_input <- DGEList(counts = input_counts_df, genes = gene_lengths_input) # Calculate RPKM for the input fraction # gene.length parameter expects a vector of lengths corresponding to the rows of the counts matrix rpkm_input <- rpkm(dge_input, gene.length = dge_input$genes$Length) # Write the input RPKM matrix to a TSV file write.table(rpkm_input, output_input_rpkm_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("Input RPKM values saved to:", output_input_rpkm_file)) # --- Process Polysome Fraction --- # Load polysome fraction read counts polysome_counts_df <- read.delim(polysome_counts_file, row.names = 1, stringsAsFactors = FALSE) # Ensure gene/transcript order in counts matches gene_lengths common_ids_polysome <- intersect(rownames(polysome_counts_df), rownames(gene_lengths)) polysome_counts_df <- polysome_counts_df[common_ids_polysome, , drop = FALSE] gene_lengths_polysome <- gene_lengths[common_ids_polysome, , drop = FALSE] # Create a DGEList object for the polysome fraction dge_polysome <- DGEList(counts = polysome_counts_df, genes = gene_lengths_polysome) # Calculate RPKM for the polysome fraction rpkm_polysome <- rpkm(dge_polysome, gene.length = dge_polysome$genes$Length) # Write the polysome RPKM matrix to a TSV file write.table(rpkm_polysome, output_polysome_rpkm_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("Polysome RPKM values saved to:", output_polysome_rpkm_file)) ' # Clean up dummy files (optional, in a real pipeline these would be inputs and outputs handled by the workflow manager) # rm input_counts.tsv polysome_counts.tsv gene_lengths.tsv
Tools Used
Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices. The transcript RPKMs of input and polysome fractions were calculated from the read count matrices. Genome_build: Homo sapiens UCSC hg19 (GRCh37) Supplementary_files_format_and_content: msi_polysome_RPKM.csv: Comma-separated text file includes RPKM values.