GSE157259 Processing Pipeline

RIP-Seq code_examples 22 steps

Publication

RNA binding protein DDX5 restricts RORγt<sup>+</sup> T<sub>reg</sub> suppressor function to promote intestine inflammation.

Science advances (2023) — PMID 36724232

Dataset

GSE157259

DDX5 RNA interactome in cultured T cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

    eclipdemux v0.0.1 GitHub
    $ Bash example
    # Install eclipdemux (if not already installed)
    # pip install eclipdemux
    
    # Example usage of eclipdemux
    # input_reads.fastq.gz: Your raw sequenced reads containing inline barcodes.
    # demuxed_reads.fastq.gz: Output file with barcodes removed and randomers added to headers.
    # inline_barcodes.txt: A file containing the inline barcode sequences to be removed.
    # 10: Placeholder for the length of the randomer to be added to read headers.
    eclipdemux -i input_reads.fastq.gz -o demuxed_reads.fastq.gz -b inline_barcodes.txt -r 10
  2. 2

    Args: --length 10

    cutadapt (Inferred with models/gemini-2.5-flash) v4.1 GitHub
    $ Bash example
    # conda install -c bioconda cutadapt
    cutadapt --minimum-length 10 -o output.fastq.gz input.fastq.gz
  3. 3

    Reads were then trimmed with cutadapt (1.9.1).

    cutadapt v1.9.1 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=1.9.1
    
    # Define input and output file names (placeholders)
    # Replace with actual input FASTQ files
    INPUT_R1="input_reads_R1.fastq.gz"
    INPUT_R2="input_reads_R2.fastq.gz"
    OUTPUT_R1="trimmed_reads_R1.fastq.gz"
    OUTPUT_R2="trimmed_reads_R2.fastq.gz"
    
    # Trim reads for quality and minimum length.
    # -q 20,20: Trim low-quality bases from 5' and 3' ends with a quality threshold of 20.
    # -m 20: Discard reads shorter than 20 bp after trimming.
    # No specific adapters were mentioned in the description, so adapter trimming is omitted.
    cutadapt -q 20,20 -m 20 -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" "${INPUT_R1}" "${INPUT_R2}"
  4. 4

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

    $ Bash example
    # Clone the eCLIP pipeline repository to access parsebarcodes.sh
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # # Ensure parsebarcodes.sh is executable and in PATH, or call it directly
    # # chmod +x bin/parsebarcodes.sh
    
    # Placeholder for input raw reads file (e.g., a FASTQ file containing reads with barcodes/adapters)
    input_fastq="raw_reads.fastq"
    
    # Placeholder for output prefix
    # parsebarcodes.sh typically outputs processed FASTQ files (e.g., processed_reads.fastq) and a log file.
    output_prefix="processed_reads"
    
    # Adapter files (placeholders - these would be specific files from the eCLIP pipeline resources)
    # These files contain the adapter sequences to be matched and trimmed.
    # They are typically located in the 'resources' or 'adapters' directory of the eCLIP pipeline.
    g_adapters_fasta="g_adapters.fasta" # Example: eclip/resources/g_adapters.fasta
    A_adapters_fasta="A_adapters.fasta" # Example: eclip/resources/A_adapters.fasta
    a_adapters_fasta="a_adapters.fasta" # Example: eclip/resources/a_adapters.fasta
    
    # Execute parsebarcodes.sh with the specified arguments
    # The script processes the input_fastq, identifies and trims adapters/barcodes,
    # and outputs processed reads to files named with the output_prefix.
    parsebarcodes.sh \
        "${input_fastq}" \
        "${output_prefix}" \
        --match-read-wildcards \
        -O 1 \
        --times 1 \
        -e 0.1 \
        --quality-cutoff 6 \
        -m 18 \
        -g "${g_adapters_fasta}" \
        -A "${A_adapters_fasta}" \
        -a "${a_adapters_fasta}"
  5. 5

    Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

    cutadapt v1.9.1 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=1.9.1
    
    # Define input and output files
    # INPUT_FASTQ should be the output from the previous trimming step or raw reads if this is the first trimming.
    INPUT_FASTQ="input_reads_from_previous_trimming.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads_final.fastq.gz"
    
    # Define the 3' adapter sequence commonly associated with double-ligation events in eCLIP.
    # This sequence is a placeholder for the actual 3' adapter used in the library preparation (e.g., Illumina TruSeq Small RNA 3' Adapter).
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    
    # Define minimum read length after trimming. Common for eCLIP is 18bp.
    MIN_READ_LENGTH=18
    
    # Execute cutadapt to remove the 3' adapter and filter by minimum length.
    # The -a flag specifies a 3' adapter to be removed.
    # The -m flag specifies the minimum length of reads to be kept after trimming.
    cutadapt -a "${ADAPTER_3PRIME}" \
             -m "${MIN_READ_LENGTH}" \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  6. 6

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

    $ Bash example
    # Install cutadapt if not already available
    # conda install -c bioconda cutadapt
    
    # Assuming 'input_reads.fastq.gz' are the raw reads and 'trimmed_reads.fastq.gz' will be the output.
    # A_adapters.fasta is generated by parsebarcodes.sh as described within the eclip 0.1.5+ pipeline.
    cutadapt \
      --match-read-wildcards \
      -O 1 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -A A_adapters.fasta \
      -o trimmed_reads.fastq.gz \
      input_reads.fastq.gz
  7. 7

    Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05).

    $ Bash example
    # Assume STAR is installed and available in PATH.
    # Assume the RepBase 18.05 mouse repeat element database FASTA file (e.g., RepBase18.05_mouse.fasta) has been indexed by STAR.
    # Example command to generate STAR index (if not already done):
    # mkdir -p /path/to/repbase_18.05_star_index
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/repbase_18.05_star_index \
    #      --genomeFastaFiles /path/to/RepBase18.05_mouse.fasta \
    #      --genomeSAindexNbases 10 # Adjust based on repeat element lengths and total size
    
    # Placeholder for STAR genome index directory for RepBase 18.05
    REPBASE_STAR_INDEX="/path/to/repbase_18.05_star_index"
    
    # Placeholder for input trimmed reads file (single-end or first read of paired-end)
    TRIMMED_READS_FASTQ="trimmed_reads.fastq.gz"
    
    # If paired-end reads, use:
    # TRIMMED_READS_FASTQ_R1="trimmed_reads_R1.fastq.gz"
    # TRIMMED_READS_FASTQ_R2="trimmed_reads_R2.fastq.gz"
    # And modify --readFilesIn to "${TRIMMED_READS_FASTQ_R1}" "${TRIMMED_READS_FASTQ_R2}"
    
    # Placeholder for output prefix
    OUTPUT_PREFIX="star_repbase_mapped"
    
    # Placeholder for number of threads
    NUM_THREADS=8
    
    # STAR mapping command for repeat elements
    # Key parameters for repeat mapping:
    # --alignIntronMax 1: Disables splicing, as repeats are not expected to have introns.
    # --outFilterMultimapNmax 1000: Allows reads to map to up to 1000 locations, crucial for highly repetitive sequences.
    # --outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file.
    STAR --runThreadN ${NUM_THREADS} \
         --genomeDir ${REPBASE_STAR_INDEX} \
         --readFilesIn ${TRIMMED_READS_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX}_ \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1000 \
         --alignIntronMax 1 \
         --outSAMattributes NH HI AS nM NM MD
  8. 8

    Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # Install STAR (example, uncomment if needed)
    # conda install -c bioconda star
    
    # Define variables (placeholders)
    # 'mouse_repbase' implies a custom STAR genome index built for mouse (e.g., GRCm39) 
    # and potentially including repetitive elements from databases like Repbase.
    # Replace with the actual path to your pre-built STAR genome index.
    GENOME_DIR="path/to/mouse_genome_index_GRCm39_with_repbase"
    READ1="path/to/read1.fastq.gz"
    READ2="path/to/read2.fastq.gz"
    OUTPUT_PREFIX="out_prefix"
    
    # Run STAR alignment
    STAR \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outReadsUnmapped Fastx \
      --outSAMtype BAM Unsorted \
      --outSAMattributes All \
      --outSAMunmapped Within \
      --outSAMattrRGline ID:foo \
      --outFilterType BySJout \
      --outFilterMultimapNmax 30 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --alignEndsType EndToEnd
  9. 9

    Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm10).

    $ Bash example
    # Install STAR if not already available
    # conda install -c bioconda star
    
    # Define variables
    READS_FILE="unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file
    GENOME_DIR="/path/to/mm10_star_index" # Placeholder for pre-built STAR genome index for mm10
    OUTPUT_PREFIX="aligned_reads_" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Note: The STAR genome index for mm10 needs to be built once prior to alignment.
    # Example command to build the index (replace with actual paths):
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa \
    #      --sjdbGTFfile /path/to/Mus_musculus.GRCm38.102.gtf \
    #      --runThreadN ${THREADS}
    
    # Align unmapped reads to the mouse genome (mm10) using STAR
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READS_FILE} \
         --runThreadN ${THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --readFilesCommand zcat
  10. 10

    Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    # Replace 'path/to/STAR_genome_index' with the actual path to your STAR genome index directory.
    # This index should be built for the relevant reference genome (e.g., hg38/GRCh38 for human).
    STAR_GENOME_DIR="path/to/STAR_genome_index"
    
    # Replace '/path/to/read1' and '/path/to/read2' with the actual paths to your input FASTQ files.
    # Ensure these paths are correct for your data.
    READ1_FASTQ="/path/to/read1"
    READ2_FASTQ="/path/to/read2"
    
    # Replace 'out_prefix' with your desired prefix for output files (e.g., sample_aligned).
    OUTPUT_PREFIX="out_prefix"
    
    STAR \
        --runThreadN 16 \
        --genomeDir "${STAR_GENOME_DIR}" \
        --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
        --outFileNamePrefix "${OUTPUT_PREFIX}" \
        --outReadsUnmapped Fastx \
        --outSAMtype BAM Unsorted \
        --outSAMattributes All \
        --outSAMunmapped Within \
        --outSAMattrRGline ID:foo \
        --outFilterType BySJout \
        --outFilterMultimapNmax 1 \
        --outFilterMultimapScoreRange 1 \
        --outFilterScoreMin 10 \
        --alignEndsType EndToEnd
  11. 11

    Aligned reads were sorted with samtools (1.4.1)

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools (version 1.4.1)
    # conda install -c bioconda samtools=1.4.1
    
    # Sort aligned reads by coordinate
    samtools sort -o output.sorted.bam input.bam
  12. 12

    Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

    $ Bash example
    # Install eCLIP (assuming a specific version or cloning the repository)
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # # If using a specific release, checkout the tag, e.g., git checkout v0.1.5
    # # Ensure Python environment is set up with necessary dependencies
    
    # Define input and output file paths
    INPUT_READ1="sorted_reads_R1.fastq.gz"
    INPUT_READ2="sorted_reads_R2.fastq.gz"
    OUTPUT_PREFIX="collapsed_reads"
    STATS_FILE="collapsed_reads_stats.txt"
    
    # Execute barcodecollapsepe.py to collapse sorted paired-end reads
    # The script is located within the cloned eclip repository at eclip/eclip/barcodecollapsepe.py
    python eclip/eclip/barcodecollapsepe.py \
      -r1 "${INPUT_READ1}" \
      -r2 "${INPUT_READ2}" \
      -o "${OUTPUT_PREFIX}" \
      -s "${STATS_FILE}" \
      -b 10 \
      -m 10
  13. 13

    Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

    sambamba (Inferred with models/gemini-2.5-flash) v0.8.2 GitHub
    $ Bash example
    # Install sambamba if not already available
    # conda install -c bioconda sambamba
    
    # Define input and output files based on the description's arguments
    # -b: Input BAM file (e.g., a BAM file that still contains duplicates)
    # -o: Output BAM file (e.g., a BAM file with duplicates removed)
    # -m: Metrics file (sambamba markdup outputs statistics to stderr, which can be redirected)
    INPUT_BAM="rmDup.bam" # Placeholder for the input BAM file
    OUTPUT_BAM="preRmDup.bam" # Placeholder for the output BAM file after deduplication
    METRICS_FILE="metrics_file" # Placeholder for the deduplication metrics file
    
    # Execute sambamba markdup to remove duplicates
    # The command assumes that rmDup.bam is the input and preRmDup.bam is the output.
    # Deduplication statistics are redirected to the metrics_file.
    sambamba markdup "${INPUT_BAM}" "${OUTPUT_BAM}" 2> "${METRICS_FILE}"
  14. 14

    PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.4.1
    
    # Merge PCR de-duped reads from multiple inline barcodes
    # Replace input_barcode_1.bam, input_barcode_2.bam, etc. with your actual input BAM files.
    samtools merge merged.bam input_barcode_1.bam input_barcode_2.bam input_barcode_N.bam
  15. 15

    Merged alignments were split to keep just read2 using samtools (1.4.1) view.

    samtools v1.4.1 GitHub
    $ Bash example
    # samtools version 1.4.1
    # Input: merged.bam (placeholder for merged alignments)
    # Output: read2.bam (placeholder for alignments containing only read2)
    
    samtools view -f 0x80 -b -o read2.bam merged.bam
  16. 16

    Args: -h -b -f 128

    Unknown (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # The specific tool for these arguments is not provided in the description.
    # This command uses a placeholder 'unknown_tool'.
    unknown_tool -h -b -f 128
  17. 17

    Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

    CLIPper v1.2.2 GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # pip install clipper # Or clone from GitHub and run setup.py
    
    # Define input and output files
    INPUT_BAM="read2.bam" # Placeholder for "Read2 BAM files"
    OUTPUT_DIR="clipper_peaks"
    SPECIES="hg38" # Placeholder for species, e.g., hg38, mm10. Adjust as per experiment.
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run CLIPper to identify peak clusters
    # CLIPper version 1.2.2
    # Note: For eCLIP, it's common to use a control BAM (e.g., input, IgG) with the -c flag.
    # Example: clipper.py -s "${SPECIES}" -o "${OUTPUT_DIR}" -c "control.bam" "${INPUT_BAM}"
    clipper.py -s "${SPECIES}" -o "${OUTPUT_DIR}" "${INPUT_BAM}"
  18. 18

    Args: --species mm10 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

    gene_bam_processor (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This is a placeholder for a custom script or tool that processes BAM files
    # based on gene information and saves intermediate data in pickle format.
    # The specific installation instructions would depend on the actual script/tool.
    # For example, if it's a Python script:
    # # conda create -n gene_bam_env python=3.9
    # # conda activate gene_bam_env
    # # pip install pandas numpy # (example dependencies)
    
    # Reference genome for mm10 (Mouse)
    # Source: UCSC Genome Browser (e.g., for genome annotation like GTF/GFF for gene definitions)
    # MM10_GENOME_ANNOTATION="/path/to/mm10.ncbiRefSeq.gtf" # Placeholder for actual annotation file
    
    # Execute the inferred gene_bam_processor
    # Assuming the script is named 'gene_bam_processor.py'
    python gene_bam_processor.py \
        --species mm10 \
        --bam path/to/input.bam \
        --timeout 3600000 \
        --maxgenes 1000000 \
        --save-pickle \
        --outfile path/to/output.bam
  19. 19

    Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

    $ Bash example
    # Clone the eCLIP repository if not already available
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/scripts
    
    # Example usage of peaksnormalize.pl
    # Assuming initial_peaks.bed is the input peak file (peak clusters),
    # ip_read2.bam is the IP read2 BAM file (containing only read2s),
    # and input_read2.bam is the INPUT read2 BAM file (containing only read2s).
    # The script will output normalized peak files with the specified prefix (e.g., normalized_peaks.bed).
    perl peaksnormalize.pl initial_peaks.bed ip_read2.bam input_read2.bam normalized_peaks
  20. 20

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Install Perl if not available
    # sudo apt-get update && sudo apt-get install perl
    #
    # Clone the eCLIP repository to get the script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    #
    # Define input and output files (placeholders)
    REPLICATE1_PEAKS="replicate1_normalized_peaks.bed"
    REPLICATE2_PEAKS="replicate2_normalized_peaks.bed"
    OUTPUT_PREFIX="merged_reproducible_peaks"
    MIN_OVERLAP_REPLICATES=2 # Example: require overlap in at least 2 replicates
    MIN_LOG2_FOLD_ENRICHMENT=1.0 # Example: require a minimum log2 fold enrichment of 1.0
    
    # Path to the script (assuming eclip repo is cloned and current directory is eclip)
    SCRIPT_PATH="./scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl"
    
    # Execute the script to merge overlapping normalized peak regions
    perl "${SCRIPT_PATH}" \
        "${REPLICATE1_PEAKS}" \
        "${REPLICATE2_PEAKS}" \
        "${OUTPUT_PREFIX}" \
        "${MIN_OVERLAP_REPLICATES}" \
        "${MIN_LOG2_FOLD_ENRICHMENT}"
  21. 21

    Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

    IDR v2.0.2 GitHub
    $ Bash example
    # Define input files and output prefix
    REP1_PEAKS="rep1.normalized.peaks.bed" # Placeholder for normalized peak file from replicate 1
    REP2_PEAKS="rep2.normalized.peaks.bed" # Placeholder for normalized peak file from replicate 2
    OUTPUT_PREFIX="idr_reproducible_peaks"
    
    # --- Step 1: Rank peak files by entropy score ---
    # make_informationcontent_from_peaks.pl is included within the merge_peaks pipeline (https://github.com/yeolab/merge_peaks).
    # Ensure the script is in your PATH or provide its full path.
    # Example installation (assuming merge_peaks is cloned):
    # git clone https://github.com/yeolab/merge_peaks.git
    # export PATH=$(pwd)/merge_peaks/bin:$PATH
    
    echo "Ranking replicate 1 peaks..."
    make_informationcontent_from_peaks.pl "${REP1_PEAKS}" > "${OUTPUT_PREFIX}.rep1.ranked.bed"
    
    echo "Ranking replicate 2 peaks..."
    make_informationcontent_from_peaks.pl "${REP2_PEAKS}" > "${OUTPUT_PREFIX}.rep2.ranked.bed"
    
    # --- Step 2: Run IDR to determine reproducible peaks ---
    # IDR (Irreproducible Discovery Rate) version 2.0.2
    # Installation:
    # conda create -n idr_env idr=2.0.2 -c bioconda
    # conda activate idr_env
    
    echo "Running IDR (2.0.2) to find reproducible peaks..."
    idr --input-file-type narrowPeak \
        --rank-by signal.value \
        --output-file "${OUTPUT_PREFIX}.idr.peaks.bed" \
        --plot \
        --log-output-file "${OUTPUT_PREFIX}.idr.log" \
        --soft-idr-threshold 0.05 \
        "${OUTPUT_PREFIX}.rep1.ranked.bed" \
        "${OUTPUT_PREFIX}.rep2.ranked.bed"
    
    echo "IDR analysis complete. Reproducible peaks saved to ${OUTPUT_PREFIX}.idr.peaks.bed"
  22. 22

    Reproducible peaks were annotated by overlapping peak regions with Gencode M10 annotations

    GENCODE v2.30.0 GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools
    
    # Download Gencode M10 annotation GTF file
    # wget -O gencode.vM10.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/gencode.vM10.annotation.gtf.gz
    # gunzip gencode.vM10.annotation.gtf.gz
    
    # Convert Gencode GTF to BED format for bedtools intersection.
    # This command extracts gene, transcript, and exon features, converting 1-based start to 0-based.
    # It also extracts gene_id, gene_name, and strand to create a 6-column BED file.
    awk -F'\t' '($3 == "gene" || $3 == "transcript" || $3 == "exon") {
        OFS="\t";
        split($9, a, "; ");
        gene_id=""; gene_name="";
        for (i=1; i<=length(a); i++) {
            if (a[i] ~ /gene_id/) { gene_id = substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); }
            if (a[i] ~ /gene_name/) { gene_name = substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); }
        }
        print $1, $4-1, $5, gene_id":"gene_name, "0", $7
    }' gencode.vM10.annotation.gtf > gencode.vM10.annotation.bed
    
    # Placeholder for input reproducible peak regions file (e.g., from IDR or peak calling)
    # reproducible_peaks.bed
    
    # Annotate reproducible peaks by overlapping with Gencode M10 annotations.
    # -a: input peak file (e.g., reproducible_peaks.bed)
    # -b: Gencode M10 annotation BED file
    # -wao: write original entry in A and overlap from B, report 0 for no overlap.
    #       This ensures all peaks are reported, with annotation details appended if an overlap exists.
    bedtools intersect -a reproducible_peaks.bed -b gencode.vM10.annotation.bed -wao > annotated_reproducible_peaks.bed

Tools Used

Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir mouse_repbase \  --readFilesIn path/to/read1 path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm10). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 /path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species mm10 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Reproducible peaks were annotated by overlapping peak regions with Gencode M10 annotations
Genome_build: mm10
Supplementary_files_format_and_content: bigWig files; tab-delimited text file includes -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter
← Back to Analysis