GSE126337 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.Nature communications (2022) — PMID 36473869
Dataset
GSE126337Musashi-1 is a master regulator of aberrant translation in Group 3 medulloblastoma
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Reads were mapped to GRCh38 using STAR (v2.4.2a) and Gencode v25 gene annotations
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star=2.4.2a # --- Reference Data Setup (Example paths, adjust as needed) --- # Define directories # REF_DIR="/path/to/references" # STAR_INDEX_DIR="${REF_DIR}/star_index/GRCh38_Gencode_v25" # mkdir -p ${REF_DIR} ${STAR_INDEX_DIR} # # Download GRCh38 primary assembly (e.g., from Ensembl or NCBI) # # wget -P ${REF_DIR} ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # # gunzip ${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # # GENOME_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa" # # Download Gencode v25 GTF annotation (e.g., from Gencode FTP) # # wget -P ${REF_DIR} ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz # # gunzip ${REF_DIR}/gencode.v25.annotation.gtf.gz # # GTF_FILE="${REF_DIR}/gencode.v25.annotation.gtf" # # --- Build STAR Index (if not already built) --- # # This step uses GRCh38 and Gencode v25 annotations to build the splice-aware index. # # STAR --runMode genomeGenerate \ # # --genomeDir ${STAR_INDEX_DIR} \ # # --genomeFastaFiles ${GENOME_FASTA} \ # # --sjdbGTFfile ${GTF_FILE} \ # # --sjdbOverhang 100 \ # # --runThreadN 8 # Adjust threads as needed # --- STAR Alignment (Execution Command) --- # Define input/output paths and parameters STAR_INDEX_DIR="/path/to/star_index/GRCh38_Gencode_v25" # Path to the pre-built STAR index READS_R1="input_reads_R1.fastq.gz" # Replace with your actual R1 FASTQ file READS_R2="input_reads_R2.fastq.gz" # Replace with your actual R2 FASTQ file (remove if single-end) OUTPUT_PREFIX="mapped_reads" # Prefix for output files NUM_THREADS=8 # Number of threads to use for alignment # Run STAR alignment STAR --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${READS_R1} ${READS_R2} \ --readFilesCommand zcat \ --runThreadN ${NUM_THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate -
2
STAR --genomeDir /STARgenomes/hg38_Gencode25 --outFileNamePrefix $sampleName"_" --readFilesIn /path/to/read1 /path/to/read2 --readFilesCommand zcat --runThreadN 8 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /path/to/gencode.v25.annotation.gtf
$ Bash example
# Install STAR (e.g., using conda) # conda install -c bioconda star # Define variables (example) # Replace with actual sample name, read paths, and reference paths SAMPLE_NAME="my_sample" READ1="/path/to/my_sample_R1.fastq.gz" READ2="/path/to/my_sample_R2.fastq.gz" # Reference datasets: # Genome index directory built with hg38 and Gencode v25 GENOME_DIR="/STARgenomes/hg38_Gencode25" # Gencode v25 annotation GTF file GTF_FILE="/path/to/gencode.v25.annotation.gtf" STAR --genomeDir "${GENOME_DIR}" \ --outFileNamePrefix "${SAMPLE_NAME}_" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --runThreadN 8 \ --outSAMstrandField intronMotif \ --outSAMtype BAM SortedByCoordinate \ --quantMode GeneCounts \ --sjdbGTFfile "${GTF_FILE}" -
3
Read counts from STAR were merged along with gene annotations using bespoke R script
$ Bash example
# Install R and necessary packages if not already installed # conda install -c conda-forge r-base r-data.table r-dplyr r-rtracklayer # Placeholder for STAR output file containing read counts per gene. # In a real pipeline, this would be the output from a previous STAR alignment step (e.g., ReadsPerGene.out.tab). STAR_COUNTS_FILE="star_output/ReadsPerGene.out.tab" # Placeholder for gene annotation file (e.g., GTF). # Using a common human assembly and annotation release as a placeholder. GENOME_ASSEMBLY="GRCh38" ENSEMBL_RELEASE="109" # Or a relevant Ensembl release GTF_FILE="Homo_sapiens.${GENOME_ASSEMBLY}.${ENSEMBL_RELEASE}.gtf" # Assume the GTF file is already available at a specified path. # In a real scenario, you might download it first: # GTF_URL="http://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/homo_sapiens/${GTF_FILE}.gz" # mkdir -p annotations # if [ ! -f "annotations/${GTF_FILE}" ]; then # echo "Downloading ${GTF_FILE}..." # wget -O "annotations/${GTF_FILE}.gz" "${GTF_URL}" # gunzip -f "annotations/${GTF_FILE}.gz" # fi GENE_ANNOTATION_FILE="/path/to/annotations/${GTF_FILE}" # Output file for the merged data MERGED_OUTPUT_FILE="merged_star_gene_counts.tsv" # Create a dummy R script for demonstration purposes. # This represents the "bespoke R script" mentioned in the description. cat << 'EOF' > merge_star_counts.R #!/usr/bin/env Rscript args <- commandArgs(trailingOnly = TRUE) if (length(args) != 3) { stop("Usage: Rscript merge_star_counts.R <star_counts_file> <gene_annotation_file> <output_file>", call. = FALSE) } star_counts_file <- args[1] gene_annotation_file <- args[2] output_file <- args[3] # Load necessary libraries (install if not present, e.g., install.packages(c("data.table", "dplyr", "rtracklayer"))) library(data.table) library(dplyr) library(rtracklayer) # For GTF parsing message(paste("Reading STAR counts from:", star_counts_file)) # STAR's ReadsPerGene.out.tab typically has 4 columns: GeneID, unstranded, stranded_forward, stranded_reverse. # The first 4 lines are usually header/summary information and should be skipped. star_counts <- fread(star_counts_file, skip = 4, header = FALSE, col.names = c("GeneID", "Unstranded", "Stranded_Forward", "Stranded_Reverse")) # For merging, we'll use GeneID and Unstranded counts as an example. star_counts <- star_counts %>% select(GeneID, Unstranded) message(paste("Reading gene annotations from:", gene_annotation_file)) # Read gene annotations from GTF file. # This is a simplified example; real annotation merging might involve more complex parsing or pre-processed annotation objects. gtf <- import(gene_annotation_file) genes_df <- as.data.frame(gtf) %>% filter(type == "gene") %>% # Filter for gene entries select(gene_id, gene_name, seqnames, start, end, strand) %>% # Select relevant columns distinct(gene_id, .keep_all = TRUE) # Ensure unique gene_id entries # Rename gene_id to GeneID to match the column name in STAR counts for merging. genes_df <- genes_df %>% rename(GeneID = gene_id) message("Merging STAR counts with gene annotations...") merged_data <- left_join(star_counts, genes_df, by = "GeneID") message(paste("Writing merged data to:", output_file)) fwrite(merged_data, file = output_file, sep = "\t") message("Merging complete.") EOF # Execute the bespoke R script to merge STAR counts with gene annotations Rscript merge_star_counts.R "${STAR_COUNTS_FILE}" "${GENE_ANNOTATION_FILE}" "${MERGED_OUTPUT_FILE}"
Tools Used
Raw Source Text
Reads were mapped to GRCh38 using STAR (v2.4.2a) and Gencode v25 gene annotations STAR --genomeDir /STARgenomes/hg38_Gencode25 --outFileNamePrefix $sampleName"_" --readFilesIn /path/to/read1 /path/to/read2 --readFilesCommand zcat --runThreadN 8 --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /path/to/gencode.v25.annotation.gtf Read counts from STAR were merged along with gene annotations using bespoke R script Genome_build: GRCh38 Supplementary_files_format_and_content: Tab-delimited file containing read counts per gene