GSE126337 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.

Nature communications (2022) — PMID 36473869

Dataset

GSE126337

Musashi-1 is a master regulator of aberrant translation in Group 3 medulloblastoma

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads were mapped to GRCh38 using STAR (v2.4.2a) and Gencode v25 gene annotations

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star=2.4.2a
    
    # --- Reference Data Setup (Example paths, adjust as needed) ---
    # Define directories
    # REF_DIR="/path/to/references"
    # STAR_INDEX_DIR="${REF_DIR}/star_index/GRCh38_Gencode_v25"
    # mkdir -p ${REF_DIR} ${STAR_INDEX_DIR}
    
    # # Download GRCh38 primary assembly (e.g., from Ensembl or NCBI)
    # # wget -P ${REF_DIR} ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    # # gunzip ${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    # # GENOME_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
    
    # # Download Gencode v25 GTF annotation (e.g., from Gencode FTP)
    # # wget -P ${REF_DIR} ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz
    # # gunzip ${REF_DIR}/gencode.v25.annotation.gtf.gz
    # # GTF_FILE="${REF_DIR}/gencode.v25.annotation.gtf"
    
    # # --- Build STAR Index (if not already built) ---
    # # This step uses GRCh38 and Gencode v25 annotations to build the splice-aware index.
    # # STAR --runMode genomeGenerate \
    # #      --genomeDir ${STAR_INDEX_DIR} \
    # #      --genomeFastaFiles ${GENOME_FASTA} \
    # #      --sjdbGTFfile ${GTF_FILE} \
    # #      --sjdbOverhang 100 \
    # #      --runThreadN 8 # Adjust threads as needed
    
    # --- STAR Alignment (Execution Command) ---
    # Define input/output paths and parameters
    STAR_INDEX_DIR="/path/to/star_index/GRCh38_Gencode_v25" # Path to the pre-built STAR index
    READS_R1="input_reads_R1.fastq.gz" # Replace with your actual R1 FASTQ file
    READS_R2="input_reads_R2.fastq.gz" # Replace with your actual R2 FASTQ file (remove if single-end)
    OUTPUT_PREFIX="mapped_reads" # Prefix for output files
    NUM_THREADS=8 # Number of threads to use for alignment
    
    # Run STAR alignment
    STAR --genomeDir ${STAR_INDEX_DIR} \
         --readFilesIn ${READS_R1} ${READS_R2} \
         --readFilesCommand zcat \
         --runThreadN ${NUM_THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate
    
  2. 2

    STAR --genomeDir /STARgenomes/hg38_Gencode25 --outFileNamePrefix $sampleName"_" --readFilesIn /path/to/read1 /path/to/read2 --readFilesCommand zcat --runThreadN 8 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /path/to/gencode.v25.annotation.gtf

    $ Bash example
    # Install STAR (e.g., using conda)
    # conda install -c bioconda star
    
    # Define variables (example)
    # Replace with actual sample name, read paths, and reference paths
    SAMPLE_NAME="my_sample"
    READ1="/path/to/my_sample_R1.fastq.gz"
    READ2="/path/to/my_sample_R2.fastq.gz"
    
    # Reference datasets:
    # Genome index directory built with hg38 and Gencode v25
    GENOME_DIR="/STARgenomes/hg38_Gencode25"
    # Gencode v25 annotation GTF file
    GTF_FILE="/path/to/gencode.v25.annotation.gtf"
    
    STAR --genomeDir "${GENOME_DIR}" \
         --outFileNamePrefix "${SAMPLE_NAME}_" \
         --readFilesIn "${READ1}" "${READ2}" \
         --readFilesCommand zcat \
         --runThreadN 8 \
         --outSAMstrandField intronMotif \
         --outSAMtype BAM SortedByCoordinate \
         --quantMode GeneCounts \
         --sjdbGTFfile "${GTF_FILE}"
  3. 3

    Read counts from STAR were merged along with gene annotations using bespoke R script

    $ Bash example
    # Install R and necessary packages if not already installed
    # conda install -c conda-forge r-base r-data.table r-dplyr r-rtracklayer
    
    # Placeholder for STAR output file containing read counts per gene.
    # In a real pipeline, this would be the output from a previous STAR alignment step (e.g., ReadsPerGene.out.tab).
    STAR_COUNTS_FILE="star_output/ReadsPerGene.out.tab"
    
    # Placeholder for gene annotation file (e.g., GTF).
    # Using a common human assembly and annotation release as a placeholder.
    GENOME_ASSEMBLY="GRCh38"
    ENSEMBL_RELEASE="109" # Or a relevant Ensembl release
    GTF_FILE="Homo_sapiens.${GENOME_ASSEMBLY}.${ENSEMBL_RELEASE}.gtf"
    
    # Assume the GTF file is already available at a specified path.
    # In a real scenario, you might download it first:
    # GTF_URL="http://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/homo_sapiens/${GTF_FILE}.gz"
    # mkdir -p annotations
    # if [ ! -f "annotations/${GTF_FILE}" ]; then
    #   echo "Downloading ${GTF_FILE}..."
    #   wget -O "annotations/${GTF_FILE}.gz" "${GTF_URL}"
    #   gunzip -f "annotations/${GTF_FILE}.gz"
    # fi
    GENE_ANNOTATION_FILE="/path/to/annotations/${GTF_FILE}"
    
    # Output file for the merged data
    MERGED_OUTPUT_FILE="merged_star_gene_counts.tsv"
    
    # Create a dummy R script for demonstration purposes.
    # This represents the "bespoke R script" mentioned in the description.
    cat << 'EOF' > merge_star_counts.R
    #!/usr/bin/env Rscript
    
    args <- commandArgs(trailingOnly = TRUE)
    if (length(args) != 3) {
      stop("Usage: Rscript merge_star_counts.R <star_counts_file> <gene_annotation_file> <output_file>", call. = FALSE)
    }
    
    star_counts_file <- args[1]
    gene_annotation_file <- args[2]
    output_file <- args[3]
    
    # Load necessary libraries (install if not present, e.g., install.packages(c("data.table", "dplyr", "rtracklayer")))
    library(data.table)
    library(dplyr)
    library(rtracklayer) # For GTF parsing
    
    message(paste("Reading STAR counts from:", star_counts_file))
    # STAR's ReadsPerGene.out.tab typically has 4 columns: GeneID, unstranded, stranded_forward, stranded_reverse.
    # The first 4 lines are usually header/summary information and should be skipped.
    star_counts <- fread(star_counts_file, skip = 4, header = FALSE, col.names = c("GeneID", "Unstranded", "Stranded_Forward", "Stranded_Reverse"))
    # For merging, we'll use GeneID and Unstranded counts as an example.
    star_counts <- star_counts %>% select(GeneID, Unstranded)
    
    message(paste("Reading gene annotations from:", gene_annotation_file))
    # Read gene annotations from GTF file.
    # This is a simplified example; real annotation merging might involve more complex parsing or pre-processed annotation objects.
    gtf <- import(gene_annotation_file)
    genes_df <- as.data.frame(gtf) %>% 
      filter(type == "gene") %>% # Filter for gene entries
      select(gene_id, gene_name, seqnames, start, end, strand) %>% # Select relevant columns
      distinct(gene_id, .keep_all = TRUE) # Ensure unique gene_id entries
    
    # Rename gene_id to GeneID to match the column name in STAR counts for merging.
    genes_df <- genes_df %>% rename(GeneID = gene_id)
    
    message("Merging STAR counts with gene annotations...")
    merged_data <- left_join(star_counts, genes_df, by = "GeneID")
    
    message(paste("Writing merged data to:", output_file))
    fwrite(merged_data, file = output_file, sep = "\t")
    
    message("Merging complete.")
    EOF
    
    # Execute the bespoke R script to merge STAR counts with gene annotations
    Rscript merge_star_counts.R "${STAR_COUNTS_FILE}" "${GENE_ANNOTATION_FILE}" "${MERGED_OUTPUT_FILE}"

Tools Used

Raw Source Text
Reads were mapped to GRCh38 using STAR (v2.4.2a) and Gencode v25 gene annotations
STAR --genomeDir /STARgenomes/hg38_Gencode25 --outFileNamePrefix $sampleName"_" --readFilesIn /path/to/read1 /path/to/read2 --readFilesCommand zcat --runThreadN 8 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /path/to/gencode.v25.annotation.gtf
Read counts from STAR were merged along with gene annotations using bespoke R script
Genome_build: GRCh38
Supplementary_files_format_and_content: Tab-delimited file containing read counts per gene
← Back to Analysis