GSE126337 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.

Nature communications (2022) — PMID 36473869

Dataset

GSE126337

Musashi-1 is a master regulator of aberrant translation in Group 3 medulloblastoma

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were mapped to GRCh38 using STAR (v2.4.2a) and Gencode v25 gene annotations

STAR v2.4.2a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star=2.4.2a

# --- Reference Data Setup (Example paths, adjust as needed) ---
# Define directories
# REF_DIR="/path/to/references"
# STAR_INDEX_DIR="${REF_DIR}/star_index/GRCh38_Gencode_v25"
# mkdir -p ${REF_DIR} ${STAR_INDEX_DIR}

# # Download GRCh38 primary assembly (e.g., from Ensembl or NCBI)
# # wget -P ${REF_DIR} ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# # gunzip ${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# # GENOME_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"

# # Download Gencode v25 GTF annotation (e.g., from Gencode FTP)
# # wget -P ${REF_DIR} ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz
# # gunzip ${REF_DIR}/gencode.v25.annotation.gtf.gz
# # GTF_FILE="${REF_DIR}/gencode.v25.annotation.gtf"

# # --- Build STAR Index (if not already built) ---
# # This step uses GRCh38 and Gencode v25 annotations to build the splice-aware index.
# # STAR --runMode genomeGenerate \
# #      --genomeDir ${STAR_INDEX_DIR} \
# #      --genomeFastaFiles ${GENOME_FASTA} \
# #      --sjdbGTFfile ${GTF_FILE} \
# #      --sjdbOverhang 100 \
# #      --runThreadN 8 # Adjust threads as needed

# --- STAR Alignment (Execution Command) ---
# Define input/output paths and parameters
STAR_INDEX_DIR="/path/to/star_index/GRCh38_Gencode_v25" # Path to the pre-built STAR index
READS_R1="input_reads_R1.fastq.gz" # Replace with your actual R1 FASTQ file
READS_R2="input_reads_R2.fastq.gz" # Replace with your actual R2 FASTQ file (remove if single-end)
OUTPUT_PREFIX="mapped_reads" # Prefix for output files
NUM_THREADS=8 # Number of threads to use for alignment

# Run STAR alignment
STAR --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --runThreadN ${NUM_THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}. \
     --outSAMtype BAM SortedByCoordinate

View on GitHub

STAR --genomeDir /STARgenomes/hg38_Gencode25 --outFileNamePrefix $sampleName"_" --readFilesIn /path/to/read1 /path/to/read2 --readFilesCommand zcat --runThreadN 8 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /path/to/gencode.v25.annotation.gtf

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (e.g., using conda)
# conda install -c bioconda star

# Define variables (example)
# Replace with actual sample name, read paths, and reference paths
SAMPLE_NAME="my_sample"
READ1="/path/to/my_sample_R1.fastq.gz"
READ2="/path/to/my_sample_R2.fastq.gz"

# Reference datasets:
# Genome index directory built with hg38 and Gencode v25
GENOME_DIR="/STARgenomes/hg38_Gencode25"
# Gencode v25 annotation GTF file
GTF_FILE="/path/to/gencode.v25.annotation.gtf"

STAR --genomeDir "${GENOME_DIR}" \
     --outFileNamePrefix "${SAMPLE_NAME}_" \
     --readFilesIn "${READ1}" "${READ2}" \
     --readFilesCommand zcat \
     --runThreadN 8 \
     --outSAMstrandField intronMotif \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts \
     --sjdbGTFfile "${GTF_FILE}"

View on GitHub

Read counts from STAR were merged along with gene annotations using bespoke R script

STAR v2.7.10a GitHub

$ Bash example

# Install R and necessary packages if not already installed
# conda install -c conda-forge r-base r-data.table r-dplyr r-rtracklayer

# Placeholder for STAR output file containing read counts per gene.
# In a real pipeline, this would be the output from a previous STAR alignment step (e.g., ReadsPerGene.out.tab).
STAR_COUNTS_FILE="star_output/ReadsPerGene.out.tab"

# Placeholder for gene annotation file (e.g., GTF).
# Using a common human assembly and annotation release as a placeholder.
GENOME_ASSEMBLY="GRCh38"
ENSEMBL_RELEASE="109" # Or a relevant Ensembl release
GTF_FILE="Homo_sapiens.${GENOME_ASSEMBLY}.${ENSEMBL_RELEASE}.gtf"

# Assume the GTF file is already available at a specified path.
# In a real scenario, you might download it first:
# GTF_URL="http://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/homo_sapiens/${GTF_FILE}.gz"
# mkdir -p annotations
# if [ ! -f "annotations/${GTF_FILE}" ]; then
#   echo "Downloading ${GTF_FILE}..."
#   wget -O "annotations/${GTF_FILE}.gz" "${GTF_URL}"
#   gunzip -f "annotations/${GTF_FILE}.gz"
# fi
GENE_ANNOTATION_FILE="/path/to/annotations/${GTF_FILE}"

# Output file for the merged data
MERGED_OUTPUT_FILE="merged_star_gene_counts.tsv"

# Create a dummy R script for demonstration purposes.
# This represents the "bespoke R script" mentioned in the description.
cat << 'EOF' > merge_star_counts.R
#!/usr/bin/env Rscript

args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 3) {
  stop("Usage: Rscript merge_star_counts.R <star_counts_file> <gene_annotation_file> <output_file>", call. = FALSE)
}

star_counts_file <- args[1]
gene_annotation_file <- args[2]
output_file <- args[3]

# Load necessary libraries (install if not present, e.g., install.packages(c("data.table", "dplyr", "rtracklayer")))
library(data.table)
library(dplyr)
library(rtracklayer) # For GTF parsing

message(paste("Reading STAR counts from:", star_counts_file))
# STAR's ReadsPerGene.out.tab typically has 4 columns: GeneID, unstranded, stranded_forward, stranded_reverse.
# The first 4 lines are usually header/summary information and should be skipped.
star_counts <- fread(star_counts_file, skip = 4, header = FALSE, col.names = c("GeneID", "Unstranded", "Stranded_Forward", "Stranded_Reverse"))
# For merging, we'll use GeneID and Unstranded counts as an example.
star_counts <- star_counts %>% select(GeneID, Unstranded)

message(paste("Reading gene annotations from:", gene_annotation_file))
# Read gene annotations from GTF file.
# This is a simplified example; real annotation merging might involve more complex parsing or pre-processed annotation objects.
gtf <- import(gene_annotation_file)
genes_df <- as.data.frame(gtf) %>% 
  filter(type == "gene") %>% # Filter for gene entries
  select(gene_id, gene_name, seqnames, start, end, strand) %>% # Select relevant columns
  distinct(gene_id, .keep_all = TRUE) # Ensure unique gene_id entries

# Rename gene_id to GeneID to match the column name in STAR counts for merging.
genes_df <- genes_df %>% rename(GeneID = gene_id)

message("Merging STAR counts with gene annotations...")
merged_data <- left_join(star_counts, genes_df, by = "GeneID")

message(paste("Writing merged data to:", output_file))
fwrite(merged_data, file = output_file, sep = "\t")

message("Merging complete.")
EOF

# Execute the bespoke R script to merge STAR counts with gene annotations
Rscript merge_star_counts.R "${STAR_COUNTS_FILE}" "${GENE_ANNOTATION_FILE}" "${MERGED_OUTPUT_FILE}"

View on GitHub

Tools Used

STAR

Raw Source Text

Reads were mapped to GRCh38 using STAR (v2.4.2a) and Gencode v25 gene annotations
STAR --genomeDir /STARgenomes/hg38_Gencode25 --outFileNamePrefix $sampleName"_" --readFilesIn /path/to/read1 /path/to/read2 --readFilesCommand zcat --runThreadN 8 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /path/to/gencode.v25.annotation.gtf
Read counts from STAR were merged along with gene annotations using bespoke R script
Genome_build: GRCh38
Supplementary_files_format_and_content: Tab-delimited file containing read counts per gene

← Back to Analysis