GSE134597 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.

Nature communications (2022) — PMID 36473869

Dataset

GSE134597

Musashi1 is a master regulator of aberrant translation in Group 3 medulloblastoma

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.4.0i

# Define input and output paths
# Assuming 'trimmed_reads.fastq.gz' is the output from cutadapt trimming
READS="trimmed_reads.fastq.gz"

# Placeholder for the STAR index built from RepBase v18.04 sequences.
# This index needs to be pre-built using STAR's --runMode genomeGenerate command.
REPBASE_STAR_INDEX="/path/to/RepBase_v18.04_STAR_index"

OUTPUT_PREFIX="repbase_mapped_star"
THREADS=8 # Adjust based on available computational resources

# Map RNA-sequencing reads to repetitive elements using STAR
STAR --runThreadN ${THREADS} \
     --genomeDir ${REPBASE_STAR_INDEX} \
     --readFilesIn ${READS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6

View on GitHub

Reads did not map to repetitive elements were then mapped to the human genome (hg19).

STAR (Inferred with models/gemini-2.5-flash) v2.5.2b (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.5.2b

# Align reads to the human genome (hg19)
# Assumes a STAR genome index for hg19 is pre-built at /path/to/star_index/hg19.
# The hg19 reference genome can be obtained from sources like UCSC (e.g., hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
# Assumes 'filtered_reads.fastq.gz' contains reads that did not map to repetitive elements.
# The output BAM file will be named 'aligned_reads_Aligned.sortedByCoordinate.out.bam'.
STAR --runThreadN 8 \
     --genomeDir /path/to/star_index/hg19 \
     --readFilesIn filtered_reads.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix aligned_reads_ \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outSAMtype BAM SortedByCoordinate \
     --limitBAMsortRAM 30000000000

View on GitHub

Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

featureCounts v1.5.0

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Placeholder for GENCODE v19 annotation GTF file
# Ensure gencode.v19.annotation.gtf is available in the working directory or specified path
# Example download (replace with actual path if already available):
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz
GENCODE_GTF="gencode.v19.annotation.gtf"

# Placeholder for input BAM files (e.g., aligned RNA-seq reads)
# Replace with actual BAM file paths, space-separated for multiple files
INPUT_BAM_FILES="sample1.bam sample2.bam"

# Output file for the read count matrix
OUTPUT_COUNTS="gene_read_counts.txt"

featureCounts -a "${GENCODE_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -t exon \
              -g gene_id \
              ${INPUT_BAM_FILES}

The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

edgeR (Inferred with models/gemini-2.5-flash) v4.0.0 GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')"

# --- Reference Data Preparation (Example for Human GENCODE v38) ---
# Transcript/gene lengths are crucial for RPKM calculation. They can be derived from a GTF/GFF annotation file.
# Example command to download a GTF file and generate a gene_lengths.tsv (requires a custom script like gtf_to_genelength.py):
# wget -nc ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
# gunzip -c gencode.v38.annotation.gtf.gz > gencode.v38.annotation.gtf
# python gtf_to_genelength.py gencode.v38.annotation.gtf > gene_lengths.tsv
# (A simple gtf_to_genelength.py script would parse the GTF to extract gene_id/transcript_id and calculate the effective length for each entry)

# Create dummy input files for demonstration purposes.
# In a real pipeline, these count matrices would be outputs from an upstream quantification step (e.g., featureCounts, HTSeq-count, Salmon, Kallisto).
# The 'GeneID' column (or 'TranscriptID') should match the IDs in 'gene_lengths.tsv'.
cat <<EOF > input_counts.tsv
GeneID	Input_Rep1	Input_Rep2
GeneA	100	120
GeneB	50	60
GeneC	200	210
EOF

cat <<EOF > polysome_counts.tsv
GeneID	Polysome_Rep1	Polysome_Rep2
GeneA	150	160
GeneB	80	90
GeneC	300	310
EOF

cat <<EOF > gene_lengths.tsv
GeneID	Length
GeneA	1000
GeneB	2000
GeneC	500
EOF

# R script to calculate RPKM for input and polysome fractions using edgeR
Rscript -e '
  # Load the edgeR library
  library(edgeR)

  # Define input and output file names
  input_counts_file <- "input_counts.tsv"
  polysome_counts_file <- "polysome_counts.tsv"
  gene_lengths_file <- "gene_lengths.tsv"
  output_input_rpkm_file <- "input_rpkm.tsv"
  output_polysome_rpkm_file <- "polysome_rpkm.tsv"

  # Load gene/transcript lengths. Ensure the first column is the ID and second is Length.
  gene_lengths <- read.delim(gene_lengths_file, row.names = 1, stringsAsFactors = FALSE)

  # --- Process Input Fraction ---
  # Load input fraction read counts. Ensure the first column is the ID.
  input_counts_df <- read.delim(input_counts_file, row.names = 1, stringsAsFactors = FALSE)
  
  # Ensure gene/transcript order in counts matches gene_lengths for correct RPKM calculation
  # Only keep IDs present in both counts and lengths, and in the same order
  common_ids_input <- intersect(rownames(input_counts_df), rownames(gene_lengths))
  input_counts_df <- input_counts_df[common_ids_input, , drop = FALSE]
  gene_lengths_input <- gene_lengths[common_ids_input, , drop = FALSE]

  # Create a DGEList object for the input fraction
  dge_input <- DGEList(counts = input_counts_df, genes = gene_lengths_input)

  # Calculate RPKM for the input fraction
  # gene.length parameter expects a vector of lengths corresponding to the rows of the counts matrix
  rpkm_input <- rpkm(dge_input, gene.length = dge_input$genes$Length)

  # Write the input RPKM matrix to a TSV file
  write.table(rpkm_input, output_input_rpkm_file, sep = "\t", quote = FALSE, col.names = NA)
  message(paste("Input RPKM values saved to:", output_input_rpkm_file))

  # --- Process Polysome Fraction ---
  # Load polysome fraction read counts
  polysome_counts_df <- read.delim(polysome_counts_file, row.names = 1, stringsAsFactors = FALSE)

  # Ensure gene/transcript order in counts matches gene_lengths
  common_ids_polysome <- intersect(rownames(polysome_counts_df), rownames(gene_lengths))
  polysome_counts_df <- polysome_counts_df[common_ids_polysome, , drop = FALSE]
  gene_lengths_polysome <- gene_lengths[common_ids_polysome, , drop = FALSE]

  # Create a DGEList object for the polysome fraction
  dge_polysome <- DGEList(counts = polysome_counts_df, genes = gene_lengths_polysome)

  # Calculate RPKM for the polysome fraction
  rpkm_polysome <- rpkm(dge_polysome, gene.length = dge_polysome$genes$Length)

  # Write the polysome RPKM matrix to a TSV file
  write.table(rpkm_polysome, output_polysome_rpkm_file, sep = "\t", quote = FALSE, col.names = NA)
  message(paste("Polysome RPKM values saved to:", output_polysome_rpkm_file))
'

# Clean up dummy files (optional, in a real pipeline these would be inputs and outputs handled by the workflow manager)
# rm input_counts.tsv polysome_counts.tsv gene_lengths.tsv

View on GitHub

Tools Used

STAR

Raw Source Text

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19 (GRCh37)
Supplementary_files_format_and_content: msi_polysome_RPKM.csv: Comma-separated text file includes RPKM values.

← Back to Analysis