GSE126263 Processing Pipeline
Publication
Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.Nature communications (2022) — PMID 36473869
Dataset
GSE126263Musashi-1 is a master regulator of aberrant translation in Group 3 medulloblastoma
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).
$ Bash example
# Assuming eclipdemux is installed and in your PATH, or you have the script available. # For example, if using the script directly: # git clone https://github.com/yeolab/eclip.git # cd eclip/scripts # python eclipdemux.py --input_fastq input.fastq.gz --output_fastq output_demux.fastq.gz --barcode_length 6 --randomer_length 10 # Placeholder for input and output files INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="output_demux.fastq.gz" # Inferring common parameters for barcode and randomer lengths # (These values are examples; actual lengths would depend on library preparation) BARCODE_LENGTH=6 RANDOMER_LENGTH=10 eclipdemux --input_fastq ${INPUT_FASTQ} --output_fastq ${OUTPUT_FASTQ} --barcode_length ${BARCODE_LENGTH} --randomer_length ${RANDOMER_LENGTH} -
2
Args: --length 10
$ Bash example
# Install fastp if not already installed # conda install -c bioconda fastp # Example usage: Filter reads to a minimum length of 10 # Replace input.fastq.gz and output.fastq.gz with actual file names fastp --in1 input.fastq.gz --out1 output.fastq.gz --length_required 10
-
3
Reads were then trimmed with cutadapt (1.9.1).
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.9.1 # Define input and output file names (replace with actual file paths) INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" LOG_FILE="cutadapt_trimming.log" # Define adapter sequences (replace with actual adapters if known) # For eCLIP, common adapters might be specific to the library preparation. # Example Illumina universal adapters (replace with your specific adapters): # ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # If adapters are not explicitly known, cutadapt can sometimes auto-detect or you might use a generic sequence. # Using placeholders as no specific adapters were provided in the description. ADAPTER_R1="YOUR_ADAPTER_SEQUENCE_R1" ADAPTER_R2="YOUR_ADAPTER_SEQUENCE_R2" # Define quality cutoff and minimum length (common values, adjust as needed) QUALITY_CUTOFF=20 # Phred quality score cutoff MIN_LENGTH=20 # Minimum read length after trimming # Execute cutadapt for paired-end trimming cutadapt -a "${ADAPTER_R1}" \ -A "${ADAPTER_R2}" \ -q "${QUALITY_CUTOFF}" \ -m "${MIN_LENGTH}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" "${INPUT_R2}" \ &> "${LOG_FILE}" echo "Trimming complete. Output files: ${OUTPUT_R1}, ${OUTPUT_R2}" echo "Log file: ${LOG_FILE}" -
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Clone the eCLIP pipeline repository (if not already installed) # git clone https://github.com/yeolab/eclip.git # cd eclip # git checkout 0.1.5 # Or the specific version # export PATH=$(pwd)/bin:$PATH # Add eCLIP scripts to PATH # Example input and output files INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="output.fastq.gz" # Adapter sequences (these are typically generated or provided by the pipeline # and contain the actual adapter sequences to be trimmed. The description # implies these specific fasta files were prepared by a previous step # within the eCLIP pipeline). # For demonstration, assuming they exist in the current directory or a specified path. G_ADAPTERS="g_adapters.fasta" A_ADAPTERS="A_adapters.fasta" a_ADAPTERS="a_adapters.fasta" # Execute parsebarcodes.sh with the specified arguments parsebarcodes.sh "${INPUT_FASTQ}" "${OUTPUT_FASTQ}" \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -g "${G_ADAPTERS}" \ -A "${A_ADAPTERS}" \ -a "${a_ADAPTERS}" -
5
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.9.1 # Define input and output file names INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads_double_ligation.fastq.gz" # Define the adapter sequence associated with double-ligation events. # This sequence needs to be determined based on the specific library preparation protocol. # For eCLIP, this might be a specific internal adapter or the 3' adapter itself appearing internally. # Placeholder: Replace 'ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION' with the actual sequence. # A common Illumina 3' adapter is 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA'. # If a specific internal adapter is used in the eCLIP protocol for double-ligation events, # that sequence should be used here. ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION="SPECIFIC_DOUBLE_LIGATION_ADAPTER_SEQUENCE" # Execute cutadapt to remove the double-ligation adapter sequence # -b ADAPTER_SEQUENCE: Trims the adapter from anywhere in the read (5', 3', or internal). # -o: Specifies the output file. cutadapt -b "${ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Install cutadapt # conda install -c bioconda cutadapt # Placeholder for input and output files # Replace 'input.fastq.gz' with your actual input FASTQ file # Replace 'trimmed.fastq.gz' with your desired output FASTQ file # A_adapters.fasta should be generated from parsebarcodes.sh as mentioned in the description. # Example: parsebarcodes.sh <input_fastq> > A_adapters.fasta cutadapt \ --match-read-wildcards \ -O 5 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -A A_adapters.fasta \ -o trimmed.fastq.gz \ input.fastq.gz -
7
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Define variables TRIMMED_READS="trimmed_reads.fastq.gz" # Placeholder for trimmed reads input REPBASE_INDEX="/path/to/RepBase18.05_STAR_index" # Placeholder for STAR index built from RepBase 18.05 OUTPUT_PREFIX="mapped_to_repbase" # Prefix for output files NUM_THREADS=8 # Placeholder for number of threads # Run STAR alignment STAR --genomeDir "${REPBASE_INDEX}" \ --readFilesIn "${TRIMMED_READS}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate -
8
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Placeholder: Replace with the actual path to your STAR genome index directory for human_repbase GENOME_DIR="human_repbase" READ1="path/to/read1" # Placeholder: Replace with the actual path to your R1 FASTQ file READ2="path/to/read2" # Placeholder: Replace with the actual path to your R2 FASTQ file OUT_PREFIX="out_prefix" # Desired prefix for output files # Run STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).
$ Bash example
# Install STAR (version 2.4.0i) # conda create -n star_env star=2.4.0i -c bioconda -c conda-forge # conda activate star_env # Placeholder for STAR genome index directory # This directory should contain the STAR index files generated for hg19. # Example command to build index (run once): # STAR --runMode genomeGenerate \ # --genomeDir /path/to/hg19_STAR_index \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/hg19.gtf \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust threads as needed GENOME_DIR="/path/to/hg19_STAR_index" # Replace with actual path to hg19 STAR index INPUT_READS="unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file (e.g., from previous filtering step) OUTPUT_PREFIX="mapped_hg19" # Prefix for output files STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_READS}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate -
10
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Placeholder for STAR genome index generation (if not already done) # This step needs to be performed once for a given genome and annotation. # Example for human GRCh38: # STAR --runThreadN 16 \ # --runMode genomeGenerate \ # --genomeDir /path/to/STAR_genome_index_GRCh38 \ # --genomeFastaFiles /path/to/GRCh38.fa \ # --sjdbGTFfile /path/to/GRCh38.gtf # Align reads using STAR STAR \ --runThreadN 16 \ --genomeDir /path/to/STAR_genome_index_GRCh38 \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.4.1)
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.4.1 # Sort aligned reads # Assuming 'aligned_reads.bam' is the input BAM file # And 'sorted_reads.bam' will be the sorted output BAM file samtools sort -o sorted_reads.bam aligned_reads.bam
-
12
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.
$ Bash example
# Install eclip (assuming Python 3 is available) # pip install eclip # Or, if using conda: # conda create -n eclip_env python=3.8 # conda activate eclip_env # pip install eclip # Define input and output files (placeholders) INPUT_R1="sorted_reads_R1.fastq.gz" INPUT_R2="sorted_reads_R2.fastq.gz" OUTPUT_R1="collapsed_reads_R1.fastq.gz" OUTPUT_R2="collapsed_reads_R2.fastq.gz" BARCODE_LENGTH=10 # Common barcode length in eCLIP protocols (e.g., 10bp) MIN_READS=2 # Minimum number of reads to collapse (default in some protocols) # Execute barcodecollapsepe.py barcodecollapsepe.py \ --input "${INPUT_R1}" \ --input2 "${INPUT_R2}" \ --output "${OUTPUT_R1}" \ --output2 "${OUTPUT_R2}" \ --barcode-length "${BARCODE_LENGTH}" \ --min-reads "${MIN_READS}" -
13
Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install Picard (e.g., via Conda): # conda install -c bioconda picard # Execute picard MarkDuplicates # The arguments -b, -o, -m are mapped to Picard's I=, O=, M= respectively. picard MarkDuplicates I=rmDup.bam O=preRmDup.bam M=metrics_file
-
14
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.4.1 # Merge PCR de-duped BAM files samtools merge merged.bam input1_deduped.bam input2_deduped.bam inputN_deduped.bam
-
15
Merged alignments were split to keep just read2 using samtools (1.4.1) view.
$ Bash example
# Define input and output file paths INPUT_BAM="merged_alignments.bam" # Placeholder for your merged alignment BAM file OUTPUT_READ2_BAM="read2_alignments.bam" # Placeholder for the output BAM containing only read2 # Split merged alignments to keep just read2 using samtools view # -f 128: Selects alignments where the 0x80 flag (read is second in pair) is set. # -b: Output in BAM format. samtools view -f 128 -b "${INPUT_BAM}" > "${OUTPUT_READ2_BAM}" -
16
Args: -h -b -f 128
Unknown Tool (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# The tool could not be inferred from the provided description: "Args: -h -b -f 128". # The arguments -h, -b, -f 128 are generic and do not point to a specific bioinformatics tool. # Please provide more context or a tool name to generate a specific command. # No specific installation instructions can be provided as the tool is unknown. # Placeholder command reflecting the provided arguments: unknown_tool -h -b -f 128
-
17
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper and its dependencies (e.g., using conda) # conda create -n clipper_env python=3.8 # conda activate clipper_env # pip install pysam numpy scipy # Clone the CLIPper repository if not already available # git clone https://github.com/yeolab/clipper.git # cd clipper # Run CLIPper to identify peak clusters # Replace 'input_read2.bam' with your actual Read2 BAM file. # Replace 'hg38' with the appropriate genome assembly if different. # The output files will be prefixed with 'clipper_peaks'. python clipper.py \ -g hg38 \ -o clipper_peaks \ input_read2.bam -
18
Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
$ Bash example
# This tool is inferred to be a custom Python script or a specialized gene-aware BAM processor, likely part of an eCLIP pipeline. # The specific tool name and installation steps are not explicitly provided in the description. # Installation instructions would depend on the specific script and its dependencies. # Example (placeholder): # git clone https://github.com/some/repo.git # cd repo # pip install -r requirements.txt # Reference genome: hg19 (Human Genome build 19) # Source: UCSC Genome Browser or NCBI # Execute the inferred tool gene_aware_bam_processor --species hg19 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
19
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Install eCLIP scripts (if not already installed) # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:$(pwd)/eclip/bin # Define input files and output prefix PEAK_FILE="peak_clusters.bed" # Input peak clusters file (e.g., from a previous peak calling step) IP_BAM="ip.bam" # Original IP BAM file (paired-end) INPUT_BAM="input.bam" # Original INPUT BAM file (paired-end) OUTPUT_PREFIX="normalized_peaks" # Prefix for output files # Filter IP and INPUT BAM files to retain only read2 (second in pair) # The -f 0x80 flag for samtools view selects the second segment in a pair. # The -b flag outputs in BAM format. samtools view -b -f 0x80 "${IP_BAM}" > "${IP_BAM%.bam}_read2.bam" samtools view -b -f 0x80 "${INPUT_BAM}" > "${INPUT_BAM%.bam}_read2.bam" IP_READ2_BAM="${IP_BAM%.bam}_read2.bam" INPUT_READ2_BAM="${INPUT_BAM%.bam}_read2.bam" # Run peaksnormalize.pl for normalization # Usage: peaksnormalize.pl <peakfile> <IP_BAM> <INPUT_BAM> <output_prefix> # This script will output files such as <output_prefix>.normalized.bed, <output_prefix>.log, etc. peaksnormalize.pl "${PEAK_FILE}" "${IP_READ2_BAM}" "${INPUT_READ2_BAM}" "${OUTPUT_PREFIX}" # Optional: Clean up intermediate read2 BAM files # rm "${IP_READ2_BAM}" "${INPUT_READ2_BAM}" -
20
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install eCLIP pipeline (version 0.1.5 or compatible) # This script is typically part of the eCLIP pipeline distribution. # You might need to clone the repository and add its scripts directory to your PATH, # or call the script with its full path. # Example: # git clone https://github.com/yeolab/eclip.git # cd eclip # # If a specific tag for v0.1.5 exists, use it: # # git checkout v0.1.5 # # Otherwise, use a commit close to that version or the main branch. # export PATH=$PATH:$(pwd)/scripts # Assuming the script is in a 'scripts' directory within the cloned repo # Define input and output files # These are placeholder names for normalized peak regions in BED format. INPUT_PEAKS_REP1="replicate1_normalized_peaks.bed" INPUT_PEAKS_REP2="replicate2_normalized_peaks.bed" OUTPUT_MERGED_PEAKS="merged_replicate_overlapping_peaks.bed" # Execute the merging script # The script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl' # is designed to merge overlapping normalized peak regions from replicates. # It likely takes multiple BED files as input and outputs a single merged BED file. # The 'l2foldenrpeakfi' in the name suggests it processes L2 fold enrichment information. perl /path/to/eclip/scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ "${INPUT_PEAKS_REP1}" \ "${INPUT_PEAKS_REP2}" \ > "${OUTPUT_MERGED_PEAKS}" -
21
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr # Example input files (replace with actual paths to normalized peak files, # which have likely been processed by make_informationcontent_from_peaks.pl # to include entropy scores, potentially in the signal.value column). REP1_PEAKS="replicate1_normalized_peaks.narrowPeak" REP2_PEAKS="replicate2_normalized_peaks.narrowPeak" OUTPUT_PREFIX="idr_reproducible_peaks" # Run IDR to determine reproducible peaks idr \ --samples "${REP1_PEAKS}" "${REP2_PEAKS}" \ --output-file "${OUTPUT_PREFIX}.idr" \ --rank signal.value \ --plot \ --log-output-file "${OUTPUT_PREFIX}.log" \ --soft-idr-threshold 0.05 -
22
Reproducible peaks were annotated by overlapping peak regions with Gencode hg19 annotations
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools=2.29.2 # Download Gencode hg19 annotation GTF wget -nc http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz gunzip -f gencode.v19.annotation.gtf.gz # Convert Gencode GTF to a BED file containing gene regions # This extracts chromosome, 0-based start, 1-based end, gene_id, strand, and a placeholder score (e.g., 0) # The gene_id is extracted from the attributes column. awk -v OFS='\t' '$3 == "gene" { # Extract gene_id from the 9th column (attributes) gene_id = ""; if (match($9, /gene_id "([^"]+)"/)) { gene_id = substr($9, RSTART + 9, RLENGTH - 10); } print $1, $4-1, $5, gene_id, 0, $7 }' gencode.v19.annotation.gtf > gencode.v19.genes.bed # Placeholder for reproducible peaks file # Replace 'reproducible_peaks.bed' with the actual path to your peak file # Example: reproducible_peaks.bed could be the output of an IDR step. # For eCLIP, this might be the merged peaks from yeolab/merge_peaks. # Example: cp /path/to/your/reproducible_peaks.bed . # Annotate reproducible peaks by overlapping with Gencode hg19 gene annotations # -a: reproducible peaks file (e.g., reproducible_peaks.bed) # -b: Gencode hg19 gene annotations BED file (gencode.v19.genes.bed) # -wa: write the original entry in A for each overlap # -wb: write the original entry in B for each overlap # The output will contain the peak region followed by the overlapping gene region. bedtools intersect -a reproducible_peaks.bed -b gencode.v19.genes.bed -wa -wb > annotated_peaks.bed -
23
Region-based fold enriched was calculated by mapped reads fold change between IP and Input.
$ Bash example
# Install deeptools if not already installed # conda install -c bioconda deeptools=3.5.1 # Calculate genome-wide log2 fold enrichment between IP and Input bigWig files. # This output bigWig file represents the 'region-based fold enriched' signal across the genome. # It can then be used for further region-specific analysis (e.g., averaging over called peaks). # Assuming 'ip.bigWig' and 'input.bigWig' are already normalized bigWig files (e.g., RPKM, CPM). # The --pseudocount 1 is added to avoid issues with zero values when calculating log2. # The bigWig files are assumed to be based on a common reference genome assembly, e.g., hg38. deep_tools bigwigCompare \ -b1 ip.bigWig \ -b2 input.bigWig \ --operation log2 \ --pseudocount 1 \ -o fold_enrichment.bigWig \ --numberOfProcessors auto \ --verbose -
24
Each read was assigned to a single region with the following descending priority order: CDS, 5âUTR, 3âUTR, Intron.
$ Bash example
# Define input and output files INPUT_BAM="aligned_reads.bam" # Placeholder: Replace with your actual sorted and indexed BAM file OUTPUT_BAM_TAGGED="assigned_reads_tagged.bam" OUTPUT_SUMMARY="read_assignment_summary.tsv" # Reference files (placeholders - replace with actual paths to your reference files) # For hg38 and GENCODE v44: # GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa" # GTF_FILE="path/to/gencode.v44.annotation.gtf" # --- Installation (commented out) --- # # Install necessary tools if not already available. # # This script requires Python 3, pysam, pybedtools, bedtools, and samtools. # # Example using conda: # # conda create -n read_assign python=3.9 pysam pybedtools bedtools samtools -c bioconda -y # # conda activate read_assign # --- Step 1: Generate feature BED files from GTF --- # This step extracts coordinates for CDS, 5'UTR, 3'UTR, and Introns from the GTF. # This is a conceptual representation. In a real pipeline, you would parse # your GTF file (e.g., gencode.v44.annotation.gtf) to generate these BED files. # A robust implementation would handle overlapping features within the same category # and accurately derive intron coordinates (e.g., from gene regions minus exons). # For the purpose of this example, we create dummy BED files. # Replace this section with your actual GTF parsing logic or pre-generated BED files. echo -e "chr1\t1000\t2000\tgeneA_CDS\t0\t+" > cds.bed echo -e "chr1\t500\t999\tgeneA_CDS_exon2\t0\t+" >> cds.bed # Example of multiple CDS regions echo -e "chr1\t200\t499\tgeneA_5UTR\t0\t+" > utr5.bed echo -e "chr1\t2001\t2500\tgeneA_3UTR\t0\t+" > utr3.bed echo -e "chr1\t2501\t3000\tgeneA_Intron1\t0\t+" > intron.bed echo -e "chr1\t3001\t4000\tgeneA_Intron2\t0\t+" >> intron.bed # Sort and merge BED files to ensure proper intersection behavior and remove overlaps within categories. # This is crucial for accurate feature assignment. sort -k1,1 -k2,2n cds.bed | bedtools merge -i - > cds.merged.bed sort -k1,1 -k2,2n utr5.bed | bedtools merge -i - > utr5.merged.bed sort -k1,1 -k2,2n utr3.bed | bedtools merge -i - > utr3.merged.bed sort -k1,1 -k2,2n intron.bed | bedtools merge -i - > intron.merged.bed # --- Step 2: Prioritized Read Assignment Script --- # This Python script assigns each read to a single region based on the defined priority. # It uses pysam to read/write BAM files and pybedtools for efficient genomic interval intersection. # Reads will be tagged with an 'RA' (Read Assignment) tag indicating their assigned feature type. # Create the Python script inline for demonstration purposes. cat << 'EOF' > assign_reads_priority.py import pysam import pybedtools import sys import os def main(): if len(sys.argv) != 8: print("Usage: python assign_reads_priority.py <input_bam> <output_bam> <output_summary> <cds_bed> <utr5_bed> <utr3_bed> <intron_bed>") sys.exit(1) input_bam = sys.argv[1] output_bam = sys.argv[2] output_summary = sys.argv[3] cds_bed_path = sys.argv[4] utr5_bed_path = sys.argv[5] utr3_bed_path = sys.argv[6] intron_bed_path = sys.argv[7] # Load feature BED files # Ensure BED files are sorted and merged for optimal performance cds_features = pybedtools.BedTool(cds_bed_path) utr5_features = pybedtools.BedTool(utr5_bed_path) utr3_features = pybedtools.BedTool(utr3_bed_path) intron_features = pybedtools.BedTool(intron_bed_path) # Open input and output BAM files samfile = pysam.AlignmentFile(input_bam, "rb") outfile = pysam.AlignmentFile(output_bam, "wb", template=samfile) assigned_counts = {'CDS': 0, '5UTR': 0, '3UTR': 0, 'Intron': 0, 'Unassigned': 0} total_reads = 0 for read in samfile.fetch(): total_reads += 1 assigned = False # Create a pybedtools Interval for the read # Ensure coordinates are 0-based for pybedtools read_interval = pybedtools.create_interval_from_list([read.reference_name, str(read.reference_start), str(read.reference_end), read.query_name, '0', '-' if read.is_reverse else '+']) # Priority 1: CDS if cds_features.intersect(read_interval, wa=True, u=True): read.set_tag('RA', 'CDS', value_type='Z') # Read Assignment tag assigned_counts['CDS'] += 1 assigned = True # Priority 2: 5'UTR (if not assigned yet) elif utr5_features.intersect(read_interval, wa=True, u=True): read.set_tag('RA', '5UTR', value_type='Z') assigned_counts['5UTR'] += 1 assigned = True # Priority 3: 3'UTR (if not assigned yet) elif utr3_features.intersect(read_interval, wa=True, u=True): read.set_tag('RA', '3UTR', value_type='Z') assigned_counts['3UTR'] += 1 assigned = True # Priority 4: Intron (if not assigned yet) elif intron_features.intersect(read_interval, wa=True, u=True): read.set_tag('RA', 'Intron', value_type='Z') assigned_counts['Intron'] += 1 assigned = True else: read.set_tag('RA', 'Unassigned', value_type='Z') assigned_counts['Unassigned'] += 1 outfile.write(read) samfile.close() outfile.close() # Write summary with open(output_summary, 'w') as f: f.write(f'Total_Reads\t{total_reads}\n') for category, count in assigned_counts.items(): f.write(f'{category}\t{count}\n') # Index the output BAM pysam.index(output_bam) print("Read assignment complete. Output BAM tagged and indexed.") if __name__ == "__main__": main() EOF # Execute the Python script python assign_reads_priority.py \ "${INPUT_BAM}" \ "${OUTPUT_BAM_TAGGED}" \ "${OUTPUT_SUMMARY}" \ cds.merged.bed \ utr5.merged.bed \ utr3.merged.bed \ intron.merged.bed # Clean up temporary files (optional) rm cds.bed utr5.bed utr3.bed intron.bed \ cds.merged.bed utr5.merged.bed utr3.merged.bed intron.merged.bed rm assign_reads_priority.py -
25
For each gene, reads were summed up across each region to calculate final region counts, and fold change.
featureCounts (Inferred with models/gemini-2.5-flash) v2.0.3$ Bash example
# Install Subread (which includes featureCounts) if not already installed # conda install -c bioconda subread # Define input and output paths BAM_FILE="aligned_reads.bam" # Replace with your actual aligned BAM file GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # Replace with your actual GTF annotation file (e.g., from Ensembl or Gencode) OUTPUT_COUNTS_FILE="gene_region_counts.txt" # Sum reads across gene regions to calculate final region counts # -a: annotation file (GTF/GFF format) # -o: output file for counts # -t exon: specify feature type to count (e.g., 'exon' for gene-level counts) # -g gene_id: specify attribute type to aggregate counts by (e.g., 'gene_id') # -p: specify if reads are paired-end (remove if single-end) # -s 0: strand-specific (0=unstranded, 1=stranded, 2=reverse stranded) - adjust based on library prep featureCounts -a "${GTF_FILE}" \ -o "${OUTPUT_COUNTS_FILE}" \ -t "exon" \ -g "gene_id" \ -p \ -s 0 \ "${BAM_FILE}" # Note: Calculating fold change is typically a subsequent step, often performed # using statistical packages like DESeq2 or edgeR in R, which take the # count matrix (e.g., gene_region_counts.txt) as input along with sample metadata. # This part is not a direct bash command but a conceptual follow-up analysis. # Example conceptual R code (not bash): # library(DESeq2) # counts_data <- read.table("gene_region_counts.txt", header=TRUE, row.names=1, skip=1) # Adjust skip based on featureCounts header # # ... further DESeq2 steps to create a DESeqDataSet and calculate fold change ...
Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10 Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.4.1) Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam) Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128 Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Reproducible peaks were annotated by overlapping peak regions with Gencode hg19 annotations Region-based fold enriched was calculated by mapped reads fold change between IP and Input. Each read was assigned to a single region with the following descending priority order: CDS, 5âUTR, 3âUTR, Intron. For each gene, reads were summed up across each region to calculate final region counts, and fold change. Genome_build: hg19 Supplementary_files_format_and_content: tab-delimited text files include -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter