GSE134597 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.

Nature communications (2022) — PMID 36473869

Dataset

GSE134597

Musashi1 is a master regulator of aberrant translation in Group 3 medulloblastoma

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.4.0i
    
    # Define input and output paths
    # Assuming 'trimmed_reads.fastq.gz' is the output from cutadapt trimming
    READS="trimmed_reads.fastq.gz"
    
    # Placeholder for the STAR index built from RepBase v18.04 sequences.
    # This index needs to be pre-built using STAR's --runMode genomeGenerate command.
    REPBASE_STAR_INDEX="/path/to/RepBase_v18.04_STAR_index"
    
    OUTPUT_PREFIX="repbase_mapped_star"
    THREADS=8 # Adjust based on available computational resources
    
    # Map RNA-sequencing reads to repetitive elements using STAR
    STAR --runThreadN ${THREADS} \
         --genomeDir ${REPBASE_STAR_INDEX} \
         --readFilesIn ${READS} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outBAMcompression 6
  2. 2

    Reads did not map to repetitive elements were then mapped to the human genome (hg19).

    STAR (Inferred with models/gemini-2.5-flash) v2.5.2b (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star=2.5.2b
    
    # Align reads to the human genome (hg19)
    # Assumes a STAR genome index for hg19 is pre-built at /path/to/star_index/hg19.
    # The hg19 reference genome can be obtained from sources like UCSC (e.g., hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
    # Assumes 'filtered_reads.fastq.gz' contains reads that did not map to repetitive elements.
    # The output BAM file will be named 'aligned_reads_Aligned.sortedByCoordinate.out.bam'.
    STAR --runThreadN 8 \
         --genomeDir /path/to/star_index/hg19 \
         --readFilesIn filtered_reads.fastq.gz \
         --readFilesCommand zcat \
         --outFileNamePrefix aligned_reads_ \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outSAMattributes All \
         --outSAMunmapped Within \
         --outSAMtype BAM SortedByCoordinate \
         --limitBAMsortRAM 30000000000
  3. 3

    Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

    featureCounts v1.5.0
    $ Bash example
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Placeholder for GENCODE v19 annotation GTF file
    # Ensure gencode.v19.annotation.gtf is available in the working directory or specified path
    # Example download (replace with actual path if already available):
    # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip gencode.v19.annotation.gtf.gz
    GENCODE_GTF="gencode.v19.annotation.gtf"
    
    # Placeholder for input BAM files (e.g., aligned RNA-seq reads)
    # Replace with actual BAM file paths, space-separated for multiple files
    INPUT_BAM_FILES="sample1.bam sample2.bam"
    
    # Output file for the read count matrix
    OUTPUT_COUNTS="gene_read_counts.txt"
    
    featureCounts -a "${GENCODE_GTF}" \
                  -o "${OUTPUT_COUNTS}" \
                  -t exon \
                  -g gene_id \
                  ${INPUT_BAM_FILES}
  4. 4

    The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

    edgeR (Inferred with models/gemini-2.5-flash) v4.0.0 GitHub
    $ Bash example
    # Install R and Bioconductor if not already installed
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')"
    
    # --- Reference Data Preparation (Example for Human GENCODE v38) ---
    # Transcript/gene lengths are crucial for RPKM calculation. They can be derived from a GTF/GFF annotation file.
    # Example command to download a GTF file and generate a gene_lengths.tsv (requires a custom script like gtf_to_genelength.py):
    # wget -nc ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
    # gunzip -c gencode.v38.annotation.gtf.gz > gencode.v38.annotation.gtf
    # python gtf_to_genelength.py gencode.v38.annotation.gtf > gene_lengths.tsv
    # (A simple gtf_to_genelength.py script would parse the GTF to extract gene_id/transcript_id and calculate the effective length for each entry)
    
    # Create dummy input files for demonstration purposes.
    # In a real pipeline, these count matrices would be outputs from an upstream quantification step (e.g., featureCounts, HTSeq-count, Salmon, Kallisto).
    # The 'GeneID' column (or 'TranscriptID') should match the IDs in 'gene_lengths.tsv'.
    cat <<EOF > input_counts.tsv
    GeneID	Input_Rep1	Input_Rep2
    GeneA	100	120
    GeneB	50	60
    GeneC	200	210
    EOF
    
    cat <<EOF > polysome_counts.tsv
    GeneID	Polysome_Rep1	Polysome_Rep2
    GeneA	150	160
    GeneB	80	90
    GeneC	300	310
    EOF
    
    cat <<EOF > gene_lengths.tsv
    GeneID	Length
    GeneA	1000
    GeneB	2000
    GeneC	500
    EOF
    
    # R script to calculate RPKM for input and polysome fractions using edgeR
    Rscript -e '
      # Load the edgeR library
      library(edgeR)
    
      # Define input and output file names
      input_counts_file <- "input_counts.tsv"
      polysome_counts_file <- "polysome_counts.tsv"
      gene_lengths_file <- "gene_lengths.tsv"
      output_input_rpkm_file <- "input_rpkm.tsv"
      output_polysome_rpkm_file <- "polysome_rpkm.tsv"
    
      # Load gene/transcript lengths. Ensure the first column is the ID and second is Length.
      gene_lengths <- read.delim(gene_lengths_file, row.names = 1, stringsAsFactors = FALSE)
    
      # --- Process Input Fraction ---
      # Load input fraction read counts. Ensure the first column is the ID.
      input_counts_df <- read.delim(input_counts_file, row.names = 1, stringsAsFactors = FALSE)
      
      # Ensure gene/transcript order in counts matches gene_lengths for correct RPKM calculation
      # Only keep IDs present in both counts and lengths, and in the same order
      common_ids_input <- intersect(rownames(input_counts_df), rownames(gene_lengths))
      input_counts_df <- input_counts_df[common_ids_input, , drop = FALSE]
      gene_lengths_input <- gene_lengths[common_ids_input, , drop = FALSE]
    
      # Create a DGEList object for the input fraction
      dge_input <- DGEList(counts = input_counts_df, genes = gene_lengths_input)
    
      # Calculate RPKM for the input fraction
      # gene.length parameter expects a vector of lengths corresponding to the rows of the counts matrix
      rpkm_input <- rpkm(dge_input, gene.length = dge_input$genes$Length)
    
      # Write the input RPKM matrix to a TSV file
      write.table(rpkm_input, output_input_rpkm_file, sep = "\t", quote = FALSE, col.names = NA)
      message(paste("Input RPKM values saved to:", output_input_rpkm_file))
    
      # --- Process Polysome Fraction ---
      # Load polysome fraction read counts
      polysome_counts_df <- read.delim(polysome_counts_file, row.names = 1, stringsAsFactors = FALSE)
    
      # Ensure gene/transcript order in counts matches gene_lengths
      common_ids_polysome <- intersect(rownames(polysome_counts_df), rownames(gene_lengths))
      polysome_counts_df <- polysome_counts_df[common_ids_polysome, , drop = FALSE]
      gene_lengths_polysome <- gene_lengths[common_ids_polysome, , drop = FALSE]
    
      # Create a DGEList object for the polysome fraction
      dge_polysome <- DGEList(counts = polysome_counts_df, genes = gene_lengths_polysome)
    
      # Calculate RPKM for the polysome fraction
      rpkm_polysome <- rpkm(dge_polysome, gene.length = dge_polysome$genes$Length)
    
      # Write the polysome RPKM matrix to a TSV file
      write.table(rpkm_polysome, output_polysome_rpkm_file, sep = "\t", quote = FALSE, col.names = NA)
      message(paste("Polysome RPKM values saved to:", output_polysome_rpkm_file))
    '
    
    # Clean up dummy files (optional, in a real pipeline these would be inputs and outputs handled by the workflow manager)
    # rm input_counts.tsv polysome_counts.tsv gene_lengths.tsv
    

Tools Used

Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19 (GRCh37)
Supplementary_files_format_and_content: msi_polysome_RPKM.csv: Comma-separated text file includes RPKM values.
← Back to Analysis