GSE126263 Processing Pipeline

RIP-Seq code_examples 25 steps

Publication

Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.

Nature communications (2022) — PMID 36473869

Dataset

GSE126263

Musashi-1 is a master regulator of aberrant translation in Group 3 medulloblastoma

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

    eclipdemux v0.0.1 GitHub
    $ Bash example
    # Assuming eclipdemux is installed and in your PATH, or you have the script available.
    # For example, if using the script directly:
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/scripts
    # python eclipdemux.py --input_fastq input.fastq.gz --output_fastq output_demux.fastq.gz --barcode_length 6 --randomer_length 10
    
    # Placeholder for input and output files
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="output_demux.fastq.gz"
    
    # Inferring common parameters for barcode and randomer lengths
    # (These values are examples; actual lengths would depend on library preparation)
    BARCODE_LENGTH=6
    RANDOMER_LENGTH=10
    
    eclipdemux --input_fastq ${INPUT_FASTQ} --output_fastq ${OUTPUT_FASTQ} --barcode_length ${BARCODE_LENGTH} --randomer_length ${RANDOMER_LENGTH}
  2. 2

    Args: --length 10

    fastp (Inferred with models/gemini-2.5-flash) v0.23.2 GitHub
    $ Bash example
    # Install fastp if not already installed
    # conda install -c bioconda fastp
    
    # Example usage: Filter reads to a minimum length of 10
    # Replace input.fastq.gz and output.fastq.gz with actual file names
    fastp --in1 input.fastq.gz --out1 output.fastq.gz --length_required 10
  3. 3

    Reads were then trimmed with cutadapt (1.9.1).

    cutadapt v1.9.1 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=1.9.1
    
    # Define input and output file names (replace with actual file paths)
    INPUT_R1="input_R1.fastq.gz"
    INPUT_R2="input_R2.fastq.gz"
    OUTPUT_R1="trimmed_R1.fastq.gz"
    OUTPUT_R2="trimmed_R2.fastq.gz"
    LOG_FILE="cutadapt_trimming.log"
    
    # Define adapter sequences (replace with actual adapters if known)
    # For eCLIP, common adapters might be specific to the library preparation.
    # Example Illumina universal adapters (replace with your specific adapters):
    # ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    # ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    # If adapters are not explicitly known, cutadapt can sometimes auto-detect or you might use a generic sequence.
    # Using placeholders as no specific adapters were provided in the description.
    ADAPTER_R1="YOUR_ADAPTER_SEQUENCE_R1"
    ADAPTER_R2="YOUR_ADAPTER_SEQUENCE_R2"
    
    # Define quality cutoff and minimum length (common values, adjust as needed)
    QUALITY_CUTOFF=20 # Phred quality score cutoff
    MIN_LENGTH=20   # Minimum read length after trimming
    
    # Execute cutadapt for paired-end trimming
    cutadapt -a "${ADAPTER_R1}" \
             -A "${ADAPTER_R2}" \
             -q "${QUALITY_CUTOFF}" \
             -m "${MIN_LENGTH}" \
             -o "${OUTPUT_R1}" \
             -p "${OUTPUT_R2}" \
             "${INPUT_R1}" "${INPUT_R2}" \
             &> "${LOG_FILE}"
    
    echo "Trimming complete. Output files: ${OUTPUT_R1}, ${OUTPUT_R2}"
    echo "Log file: ${LOG_FILE}"
  4. 4

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

    $ Bash example
    # Clone the eCLIP pipeline repository (if not already installed)
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # git checkout 0.1.5 # Or the specific version
    # export PATH=$(pwd)/bin:$PATH # Add eCLIP scripts to PATH
    
    # Example input and output files
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="output.fastq.gz"
    
    # Adapter sequences (these are typically generated or provided by the pipeline
    # and contain the actual adapter sequences to be trimmed. The description
    # implies these specific fasta files were prepared by a previous step
    # within the eCLIP pipeline).
    # For demonstration, assuming they exist in the current directory or a specified path.
    G_ADAPTERS="g_adapters.fasta"
    A_ADAPTERS="A_adapters.fasta"
    a_ADAPTERS="a_adapters.fasta"
    
    # Execute parsebarcodes.sh with the specified arguments
    parsebarcodes.sh "${INPUT_FASTQ}" "${OUTPUT_FASTQ}" \
        --match-read-wildcards \
        -O 1 \
        --times 1 \
        -e 0.1 \
        --quality-cutoff 6 \
        -m 18 \
        -g "${G_ADAPTERS}" \
        -A "${A_ADAPTERS}" \
        -a "${a_ADAPTERS}"
  5. 5

    Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

    cutadapt v1.9.1 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=1.9.1
    
    # Define input and output file names
    INPUT_FASTQ="input_reads.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads_double_ligation.fastq.gz"
    
    # Define the adapter sequence associated with double-ligation events.
    # This sequence needs to be determined based on the specific library preparation protocol.
    # For eCLIP, this might be a specific internal adapter or the 3' adapter itself appearing internally.
    # Placeholder: Replace 'ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION' with the actual sequence.
    # A common Illumina 3' adapter is 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA'.
    # If a specific internal adapter is used in the eCLIP protocol for double-ligation events,
    # that sequence should be used here.
    ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION="SPECIFIC_DOUBLE_LIGATION_ADAPTER_SEQUENCE"
    
    # Execute cutadapt to remove the double-ligation adapter sequence
    # -b ADAPTER_SEQUENCE: Trims the adapter from anywhere in the read (5', 3', or internal).
    # -o: Specifies the output file.
    cutadapt -b "${ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION}" \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  6. 6

    Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

    $ Bash example
    # Install cutadapt
    # conda install -c bioconda cutadapt
    
    # Placeholder for input and output files
    # Replace 'input.fastq.gz' with your actual input FASTQ file
    # Replace 'trimmed.fastq.gz' with your desired output FASTQ file
    
    # A_adapters.fasta should be generated from parsebarcodes.sh as mentioned in the description.
    # Example: parsebarcodes.sh <input_fastq> > A_adapters.fasta
    
    cutadapt \
        --match-read-wildcards \
        -O 5 \
        --times 1 \
        -e 0.1 \
        --quality-cutoff 6 \
        -m 18 \
        -A A_adapters.fasta \
        -o trimmed.fastq.gz \
        input.fastq.gz
  7. 7

    Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Define variables
    TRIMMED_READS="trimmed_reads.fastq.gz" # Placeholder for trimmed reads input
    REPBASE_INDEX="/path/to/RepBase18.05_STAR_index" # Placeholder for STAR index built from RepBase 18.05
    OUTPUT_PREFIX="mapped_to_repbase" # Prefix for output files
    NUM_THREADS=8 # Placeholder for number of threads
    
    # Run STAR alignment
    STAR --genomeDir "${REPBASE_INDEX}" \
         --readFilesIn "${TRIMMED_READS}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate
  8. 8

    Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Placeholder: Replace with the actual path to your STAR genome index directory for human_repbase
    GENOME_DIR="human_repbase"
    READ1="path/to/read1"      # Placeholder: Replace with the actual path to your R1 FASTQ file
    READ2="path/to/read2"      # Placeholder: Replace with the actual path to your R2 FASTQ file
    OUT_PREFIX="out_prefix"    # Desired prefix for output files
    
    # Run STAR alignment
    STAR \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --outFileNamePrefix "${OUT_PREFIX}" \
      --outReadsUnmapped Fastx \
      --outSAMtype BAM Unsorted \
      --outSAMattributes All \
      --outSAMunmapped Within \
      --outSAMattrRGline ID:foo \
      --outFilterType BySJout \
      --outFilterMultimapNmax 30 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --alignEndsType EndToEnd
  9. 9

    Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).

    $ Bash example
    # Install STAR (version 2.4.0i)
    # conda create -n star_env star=2.4.0i -c bioconda -c conda-forge
    # conda activate star_env
    
    # Placeholder for STAR genome index directory
    # This directory should contain the STAR index files generated for hg19.
    # Example command to build index (run once):
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/hg19_STAR_index \
    #      --genomeFastaFiles /path/to/hg19.fa \
    #      --sjdbGTFfile /path/to/hg19.gtf \
    #      --sjdbOverhang 100 \
    #      --runThreadN 8 # Adjust threads as needed
    
    GENOME_DIR="/path/to/hg19_STAR_index" # Replace with actual path to hg19 STAR index
    INPUT_READS="unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file (e.g., from previous filtering step)
    OUTPUT_PREFIX="mapped_hg19" # Prefix for output files
    
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${INPUT_READS}" \
         --runThreadN 8 \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate
  10. 10

    Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Placeholder for STAR genome index generation (if not already done)
    # This step needs to be performed once for a given genome and annotation.
    # Example for human GRCh38:
    # STAR --runThreadN 16 \
    #      --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_genome_index_GRCh38 \
    #      --genomeFastaFiles /path/to/GRCh38.fa \
    #      --sjdbGTFfile /path/to/GRCh38.gtf
    
    # Align reads using STAR
    STAR \
        --runThreadN 16 \
        --genomeDir /path/to/STAR_genome_index_GRCh38 \
        --readFilesIn /path/to/read1 /path/to/read2 \
        --outFileNamePrefix out_prefix \
        --outReadsUnmapped Fastx \
        --outSAMtype BAM Unsorted \
        --outSAMattributes All \
        --outSAMunmapped Within \
        --outSAMattrRGline ID:foo \
        --outFilterType BySJout \
        --outFilterMultimapNmax 1 \
        --outFilterMultimapScoreRange 1 \
        --outFilterScoreMin 10 \
        --alignEndsType EndToEnd
  11. 11

    Aligned reads were sorted with samtools (1.4.1)

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.4.1
    
    # Sort aligned reads
    # Assuming 'aligned_reads.bam' is the input BAM file
    # And 'sorted_reads.bam' will be the sorted output BAM file
    samtools sort -o sorted_reads.bam aligned_reads.bam
  12. 12

    Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

    $ Bash example
    # Install eclip (assuming Python 3 is available)
    # pip install eclip
    # Or, if using conda:
    # conda create -n eclip_env python=3.8
    # conda activate eclip_env
    # pip install eclip
    
    # Define input and output files (placeholders)
    INPUT_R1="sorted_reads_R1.fastq.gz"
    INPUT_R2="sorted_reads_R2.fastq.gz"
    OUTPUT_R1="collapsed_reads_R1.fastq.gz"
    OUTPUT_R2="collapsed_reads_R2.fastq.gz"
    BARCODE_LENGTH=10 # Common barcode length in eCLIP protocols (e.g., 10bp)
    MIN_READS=2     # Minimum number of reads to collapse (default in some protocols)
    
    # Execute barcodecollapsepe.py
    barcodecollapsepe.py \
        --input "${INPUT_R1}" \
        --input2 "${INPUT_R2}" \
        --output "${OUTPUT_R1}" \
        --output2 "${OUTPUT_R2}" \
        --barcode-length "${BARCODE_LENGTH}" \
        --min-reads "${MIN_READS}"
  13. 13

    Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

    picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Picard (e.g., via Conda):
    # conda install -c bioconda picard
    
    # Execute picard MarkDuplicates
    # The arguments -b, -o, -m are mapped to Picard's I=, O=, M= respectively.
    picard MarkDuplicates I=rmDup.bam O=preRmDup.bam M=metrics_file
  14. 14

    PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.4.1
    
    # Merge PCR de-duped BAM files
    samtools merge merged.bam input1_deduped.bam input2_deduped.bam inputN_deduped.bam
  15. 15

    Merged alignments were split to keep just read2 using samtools (1.4.1) view.

    samtools v1.4.1 GitHub
    $ Bash example
    # Define input and output file paths
    INPUT_BAM="merged_alignments.bam" # Placeholder for your merged alignment BAM file
    OUTPUT_READ2_BAM="read2_alignments.bam" # Placeholder for the output BAM containing only read2
    
    # Split merged alignments to keep just read2 using samtools view
    # -f 128: Selects alignments where the 0x80 flag (read is second in pair) is set.
    # -b: Output in BAM format.
    samtools view -f 128 -b "${INPUT_BAM}" > "${OUTPUT_READ2_BAM}"
  16. 16

    Args: -h -b -f 128

    Unknown Tool (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # The tool could not be inferred from the provided description: "Args: -h -b -f 128".
    # The arguments -h, -b, -f 128 are generic and do not point to a specific bioinformatics tool.
    # Please provide more context or a tool name to generate a specific command.
    
    # No specific installation instructions can be provided as the tool is unknown.
    
    # Placeholder command reflecting the provided arguments:
    unknown_tool -h -b -f 128
  17. 17

    Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

    CLIPper v1.2.2 GitHub
    $ Bash example
    # Install CLIPper and its dependencies (e.g., using conda)
    # conda create -n clipper_env python=3.8
    # conda activate clipper_env
    # pip install pysam numpy scipy
    
    # Clone the CLIPper repository if not already available
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    
    # Run CLIPper to identify peak clusters
    # Replace 'input_read2.bam' with your actual Read2 BAM file.
    # Replace 'hg38' with the appropriate genome assembly if different.
    # The output files will be prefixed with 'clipper_peaks'.
    python clipper.py \
        -g hg38 \
        -o clipper_peaks \
        input_read2.bam
  18. 18

    Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

    Gene-aware BAM Processor (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # This tool is inferred to be a custom Python script or a specialized gene-aware BAM processor, likely part of an eCLIP pipeline.
    # The specific tool name and installation steps are not explicitly provided in the description.
    # Installation instructions would depend on the specific script and its dependencies.
    # Example (placeholder):
    # git clone https://github.com/some/repo.git
    # cd repo
    # pip install -r requirements.txt
    
    # Reference genome: hg19 (Human Genome build 19)
    # Source: UCSC Genome Browser or NCBI
    
    # Execute the inferred tool
    gene_aware_bam_processor --species hg19 \
                                 --bam path/to/input.bam \
                                 --timeout 3600000 \
                                 --maxgenes 1000000 \
                                 --save-pickle \
                                 --outfile path/to/output.bam
  19. 19

    Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

    $ Bash example
    # Install eCLIP scripts (if not already installed)
    # git clone https://github.com/yeolab/eclip.git
    # export PATH=$PATH:$(pwd)/eclip/bin
    
    # Define input files and output prefix
    PEAK_FILE="peak_clusters.bed" # Input peak clusters file (e.g., from a previous peak calling step)
    IP_BAM="ip.bam"             # Original IP BAM file (paired-end)
    INPUT_BAM="input.bam"         # Original INPUT BAM file (paired-end)
    OUTPUT_PREFIX="normalized_peaks" # Prefix for output files
    
    # Filter IP and INPUT BAM files to retain only read2 (second in pair)
    # The -f 0x80 flag for samtools view selects the second segment in a pair.
    # The -b flag outputs in BAM format.
    samtools view -b -f 0x80 "${IP_BAM}" > "${IP_BAM%.bam}_read2.bam"
    samtools view -b -f 0x80 "${INPUT_BAM}" > "${INPUT_BAM%.bam}_read2.bam"
    
    IP_READ2_BAM="${IP_BAM%.bam}_read2.bam"
    INPUT_READ2_BAM="${INPUT_BAM%.bam}_read2.bam"
    
    # Run peaksnormalize.pl for normalization
    # Usage: peaksnormalize.pl <peakfile> <IP_BAM> <INPUT_BAM> <output_prefix>
    # This script will output files such as <output_prefix>.normalized.bed, <output_prefix>.log, etc.
    peaksnormalize.pl "${PEAK_FILE}" "${IP_READ2_BAM}" "${INPUT_READ2_BAM}" "${OUTPUT_PREFIX}"
    
    # Optional: Clean up intermediate read2 BAM files
    # rm "${IP_READ2_BAM}" "${INPUT_READ2_BAM}"
  20. 20

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Install eCLIP pipeline (version 0.1.5 or compatible)
    # This script is typically part of the eCLIP pipeline distribution.
    # You might need to clone the repository and add its scripts directory to your PATH,
    # or call the script with its full path.
    # Example:
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # # If a specific tag for v0.1.5 exists, use it:
    # # git checkout v0.1.5
    # # Otherwise, use a commit close to that version or the main branch.
    # export PATH=$PATH:$(pwd)/scripts # Assuming the script is in a 'scripts' directory within the cloned repo
    
    # Define input and output files
    # These are placeholder names for normalized peak regions in BED format.
    INPUT_PEAKS_REP1="replicate1_normalized_peaks.bed"
    INPUT_PEAKS_REP2="replicate2_normalized_peaks.bed"
    OUTPUT_MERGED_PEAKS="merged_replicate_overlapping_peaks.bed"
    
    # Execute the merging script
    # The script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl'
    # is designed to merge overlapping normalized peak regions from replicates.
    # It likely takes multiple BED files as input and outputs a single merged BED file.
    # The 'l2foldenrpeakfi' in the name suggests it processes L2 fold enrichment information.
    perl /path/to/eclip/scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
        "${INPUT_PEAKS_REP1}" \
        "${INPUT_PEAKS_REP2}" \
        > "${OUTPUT_MERGED_PEAKS}"
  21. 21

    Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

    IDR v2.0.2 GitHub
    $ Bash example
    # Install IDR (if not already installed)
    # conda install -c bioconda idr
    
    # Example input files (replace with actual paths to normalized peak files, 
    # which have likely been processed by make_informationcontent_from_peaks.pl 
    # to include entropy scores, potentially in the signal.value column).
    REP1_PEAKS="replicate1_normalized_peaks.narrowPeak"
    REP2_PEAKS="replicate2_normalized_peaks.narrowPeak"
    OUTPUT_PREFIX="idr_reproducible_peaks"
    
    # Run IDR to determine reproducible peaks
    idr \
        --samples "${REP1_PEAKS}" "${REP2_PEAKS}" \
        --output-file "${OUTPUT_PREFIX}.idr" \
        --rank signal.value \
        --plot \
        --log-output-file "${OUTPUT_PREFIX}.log" \
        --soft-idr-threshold 0.05
  22. 22

    Reproducible peaks were annotated by overlapping peak regions with Gencode hg19 annotations

    GENCODE v2.29.2 GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools=2.29.2
    
    # Download Gencode hg19 annotation GTF
    wget -nc http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    gunzip -f gencode.v19.annotation.gtf.gz
    
    # Convert Gencode GTF to a BED file containing gene regions
    # This extracts chromosome, 0-based start, 1-based end, gene_id, strand, and a placeholder score (e.g., 0)
    # The gene_id is extracted from the attributes column.
    awk -v OFS='\t' '$3 == "gene" {
        # Extract gene_id from the 9th column (attributes)
        gene_id = "";
        if (match($9, /gene_id "([^"]+)"/)) {
            gene_id = substr($9, RSTART + 9, RLENGTH - 10);
        }
        print $1, $4-1, $5, gene_id, 0, $7
    }' gencode.v19.annotation.gtf > gencode.v19.genes.bed
    
    # Placeholder for reproducible peaks file
    # Replace 'reproducible_peaks.bed' with the actual path to your peak file
    # Example: reproducible_peaks.bed could be the output of an IDR step.
    # For eCLIP, this might be the merged peaks from yeolab/merge_peaks.
    # Example: cp /path/to/your/reproducible_peaks.bed .
    
    # Annotate reproducible peaks by overlapping with Gencode hg19 gene annotations
    # -a: reproducible peaks file (e.g., reproducible_peaks.bed)
    # -b: Gencode hg19 gene annotations BED file (gencode.v19.genes.bed)
    # -wa: write the original entry in A for each overlap
    # -wb: write the original entry in B for each overlap
    # The output will contain the peak region followed by the overlapping gene region.
    bedtools intersect -a reproducible_peaks.bed -b gencode.v19.genes.bed -wa -wb > annotated_peaks.bed
  23. 23

    Region-based fold enriched was calculated by mapped reads fold change between IP and Input.

    deeptools bigwigCompare (Inferred with models/gemini-2.5-flash) v3.5.1 GitHub
    $ Bash example
    # Install deeptools if not already installed
    # conda install -c bioconda deeptools=3.5.1
    
    # Calculate genome-wide log2 fold enrichment between IP and Input bigWig files.
    # This output bigWig file represents the 'region-based fold enriched' signal across the genome.
    # It can then be used for further region-specific analysis (e.g., averaging over called peaks).
    # Assuming 'ip.bigWig' and 'input.bigWig' are already normalized bigWig files (e.g., RPKM, CPM).
    # The --pseudocount 1 is added to avoid issues with zero values when calculating log2.
    # The bigWig files are assumed to be based on a common reference genome assembly, e.g., hg38.
    deep_tools bigwigCompare \
        -b1 ip.bigWig \
        -b2 input.bigWig \
        --operation log2 \
        --pseudocount 1 \
        -o fold_enrichment.bigWig \
        --numberOfProcessors auto \
        --verbose
  24. 24

    Each read was assigned to a single region with the following descending priority order: CDS, 5’UTR, 3’UTR, Intron.

    Feature Assignment Script (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Define input and output files
    INPUT_BAM="aligned_reads.bam" # Placeholder: Replace with your actual sorted and indexed BAM file
    OUTPUT_BAM_TAGGED="assigned_reads_tagged.bam"
    OUTPUT_SUMMARY="read_assignment_summary.tsv"
    
    # Reference files (placeholders - replace with actual paths to your reference files)
    # For hg38 and GENCODE v44:
    # GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
    # GTF_FILE="path/to/gencode.v44.annotation.gtf"
    
    # --- Installation (commented out) ---
    # # Install necessary tools if not already available.
    # # This script requires Python 3, pysam, pybedtools, bedtools, and samtools.
    # # Example using conda:
    # # conda create -n read_assign python=3.9 pysam pybedtools bedtools samtools -c bioconda -y
    # # conda activate read_assign
    
    # --- Step 1: Generate feature BED files from GTF ---
    # This step extracts coordinates for CDS, 5'UTR, 3'UTR, and Introns from the GTF.
    # This is a conceptual representation. In a real pipeline, you would parse
    # your GTF file (e.g., gencode.v44.annotation.gtf) to generate these BED files.
    # A robust implementation would handle overlapping features within the same category
    # and accurately derive intron coordinates (e.g., from gene regions minus exons).
    
    # For the purpose of this example, we create dummy BED files.
    # Replace this section with your actual GTF parsing logic or pre-generated BED files.
    echo -e "chr1\t1000\t2000\tgeneA_CDS\t0\t+" > cds.bed
    echo -e "chr1\t500\t999\tgeneA_CDS_exon2\t0\t+" >> cds.bed # Example of multiple CDS regions
    echo -e "chr1\t200\t499\tgeneA_5UTR\t0\t+" > utr5.bed
    echo -e "chr1\t2001\t2500\tgeneA_3UTR\t0\t+" > utr3.bed
    echo -e "chr1\t2501\t3000\tgeneA_Intron1\t0\t+" > intron.bed
    echo -e "chr1\t3001\t4000\tgeneA_Intron2\t0\t+" >> intron.bed
    
    # Sort and merge BED files to ensure proper intersection behavior and remove overlaps within categories.
    # This is crucial for accurate feature assignment.
    sort -k1,1 -k2,2n cds.bed | bedtools merge -i - > cds.merged.bed
    sort -k1,1 -k2,2n utr5.bed | bedtools merge -i - > utr5.merged.bed
    sort -k1,1 -k2,2n utr3.bed | bedtools merge -i - > utr3.merged.bed
    sort -k1,1 -k2,2n intron.bed | bedtools merge -i - > intron.merged.bed
    
    # --- Step 2: Prioritized Read Assignment Script ---
    # This Python script assigns each read to a single region based on the defined priority.
    # It uses pysam to read/write BAM files and pybedtools for efficient genomic interval intersection.
    # Reads will be tagged with an 'RA' (Read Assignment) tag indicating their assigned feature type.
    
    # Create the Python script inline for demonstration purposes.
    cat << 'EOF' > assign_reads_priority.py
    import pysam
    import pybedtools
    import sys
    import os
    
    def main():
        if len(sys.argv) != 8:
            print("Usage: python assign_reads_priority.py <input_bam> <output_bam> <output_summary> <cds_bed> <utr5_bed> <utr3_bed> <intron_bed>")
            sys.exit(1)
    
        input_bam = sys.argv[1]
        output_bam = sys.argv[2]
        output_summary = sys.argv[3]
        cds_bed_path = sys.argv[4]
        utr5_bed_path = sys.argv[5]
        utr3_bed_path = sys.argv[6]
        intron_bed_path = sys.argv[7]
    
        # Load feature BED files
        # Ensure BED files are sorted and merged for optimal performance
        cds_features = pybedtools.BedTool(cds_bed_path)
        utr5_features = pybedtools.BedTool(utr5_bed_path)
        utr3_features = pybedtools.BedTool(utr3_bed_path)
        intron_features = pybedtools.BedTool(intron_bed_path)
    
        # Open input and output BAM files
        samfile = pysam.AlignmentFile(input_bam, "rb")
        outfile = pysam.AlignmentFile(output_bam, "wb", template=samfile)
    
        assigned_counts = {'CDS': 0, '5UTR': 0, '3UTR': 0, 'Intron': 0, 'Unassigned': 0}
        total_reads = 0
    
        for read in samfile.fetch():
            total_reads += 1
            assigned = False
            # Create a pybedtools Interval for the read
            # Ensure coordinates are 0-based for pybedtools
            read_interval = pybedtools.create_interval_from_list([read.reference_name, str(read.reference_start), str(read.reference_end), read.query_name, '0', '-' if read.is_reverse else '+'])
    
            # Priority 1: CDS
            if cds_features.intersect(read_interval, wa=True, u=True):
                read.set_tag('RA', 'CDS', value_type='Z') # Read Assignment tag
                assigned_counts['CDS'] += 1
                assigned = True
            # Priority 2: 5'UTR (if not assigned yet)
            elif utr5_features.intersect(read_interval, wa=True, u=True):
                read.set_tag('RA', '5UTR', value_type='Z')
                assigned_counts['5UTR'] += 1
                assigned = True
            # Priority 3: 3'UTR (if not assigned yet)
            elif utr3_features.intersect(read_interval, wa=True, u=True):
                read.set_tag('RA', '3UTR', value_type='Z')
                assigned_counts['3UTR'] += 1
                assigned = True
            # Priority 4: Intron (if not assigned yet)
            elif intron_features.intersect(read_interval, wa=True, u=True):
                read.set_tag('RA', 'Intron', value_type='Z')
                assigned_counts['Intron'] += 1
                assigned = True
            else:
                read.set_tag('RA', 'Unassigned', value_type='Z')
                assigned_counts['Unassigned'] += 1
    
            outfile.write(read)
    
        samfile.close()
        outfile.close()
    
        # Write summary
        with open(output_summary, 'w') as f:
            f.write(f'Total_Reads\t{total_reads}\n')
            for category, count in assigned_counts.items():
                f.write(f'{category}\t{count}\n')
    
        # Index the output BAM
        pysam.index(output_bam)
    
        print("Read assignment complete. Output BAM tagged and indexed.")
    
    if __name__ == "__main__":
        main()
    EOF
    
    # Execute the Python script
    python assign_reads_priority.py \
        "${INPUT_BAM}" \
        "${OUTPUT_BAM_TAGGED}" \
        "${OUTPUT_SUMMARY}" \
        cds.merged.bed \
        utr5.merged.bed \
        utr3.merged.bed \
        intron.merged.bed
    
    # Clean up temporary files (optional)
    rm cds.bed utr5.bed utr3.bed intron.bed \
       cds.merged.bed utr5.merged.bed utr3.merged.bed intron.merged.bed
    rm assign_reads_priority.py
  25. 25

    For each gene, reads were summed up across each region to calculate final region counts, and fold change.

    featureCounts (Inferred with models/gemini-2.5-flash) v2.0.3
    $ Bash example
    # Install Subread (which includes featureCounts) if not already installed
    # conda install -c bioconda subread
    
    # Define input and output paths
    BAM_FILE="aligned_reads.bam" # Replace with your actual aligned BAM file
    GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # Replace with your actual GTF annotation file (e.g., from Ensembl or Gencode)
    OUTPUT_COUNTS_FILE="gene_region_counts.txt"
    
    # Sum reads across gene regions to calculate final region counts
    # -a: annotation file (GTF/GFF format)
    # -o: output file for counts
    # -t exon: specify feature type to count (e.g., 'exon' for gene-level counts)
    # -g gene_id: specify attribute type to aggregate counts by (e.g., 'gene_id')
    # -p: specify if reads are paired-end (remove if single-end)
    # -s 0: strand-specific (0=unstranded, 1=stranded, 2=reverse stranded) - adjust based on library prep
    featureCounts -a "${GTF_FILE}" \
                  -o "${OUTPUT_COUNTS_FILE}" \
                  -t "exon" \
                  -g "gene_id" \
                  -p \
                  -s 0 \
                  "${BAM_FILE}"
    
    # Note: Calculating fold change is typically a subsequent step, often performed
    # using statistical packages like DESeq2 or edgeR in R, which take the
    # count matrix (e.g., gene_region_counts.txt) as input along with sample metadata.
    # This part is not a direct bash command but a conceptual follow-up analysis.
    # Example conceptual R code (not bash):
    # library(DESeq2)
    # counts_data <- read.table("gene_region_counts.txt", header=TRUE, row.names=1, skip=1) # Adjust skip based on featureCounts header
    # # ... further DESeq2 steps to create a DESeqDataSet and calculate fold change ...
    

Tools Used

Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 /path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Reproducible peaks were annotated by overlapping peak regions with Gencode hg19 annotations
Region-based fold enriched was calculated by mapped reads fold change between IP and Input. Each read was assigned to a single region with the following descending priority order: CDS, 5’UTR, 3’UTR, Intron. For each gene, reads were summed up across each region to calculate final region counts, and fold change.
Genome_build: hg19
Supplementary_files_format_and_content: tab-delimited text files include -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter
← Back to Analysis