GSE157259 Processing Pipeline
Publication
RNA binding protein DDX5 restricts RORγt<sup>+</sup> T<sub>reg</sub> suppressor function to promote intestine inflammation.Science advances (2023) — PMID 36724232
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).
$ Bash example
# Install eclipdemux (if not already installed) # pip install eclipdemux # Example usage of eclipdemux # input_reads.fastq.gz: Your raw sequenced reads containing inline barcodes. # demuxed_reads.fastq.gz: Output file with barcodes removed and randomers added to headers. # inline_barcodes.txt: A file containing the inline barcode sequences to be removed. # 10: Placeholder for the length of the randomer to be added to read headers. eclipdemux -i input_reads.fastq.gz -o demuxed_reads.fastq.gz -b inline_barcodes.txt -r 10
-
2
Args: --length 10
$ Bash example
# conda install -c bioconda cutadapt cutadapt --minimum-length 10 -o output.fastq.gz input.fastq.gz
-
3
Reads were then trimmed with cutadapt (1.9.1).
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.9.1 # Define input and output file names (placeholders) # Replace with actual input FASTQ files INPUT_R1="input_reads_R1.fastq.gz" INPUT_R2="input_reads_R2.fastq.gz" OUTPUT_R1="trimmed_reads_R1.fastq.gz" OUTPUT_R2="trimmed_reads_R2.fastq.gz" # Trim reads for quality and minimum length. # -q 20,20: Trim low-quality bases from 5' and 3' ends with a quality threshold of 20. # -m 20: Discard reads shorter than 20 bp after trimming. # No specific adapters were mentioned in the description, so adapter trimming is omitted. cutadapt -q 20,20 -m 20 -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" "${INPUT_R1}" "${INPUT_R2}" -
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Clone the eCLIP pipeline repository to access parsebarcodes.sh # git clone https://github.com/yeolab/eclip.git # cd eclip # # Ensure parsebarcodes.sh is executable and in PATH, or call it directly # # chmod +x bin/parsebarcodes.sh # Placeholder for input raw reads file (e.g., a FASTQ file containing reads with barcodes/adapters) input_fastq="raw_reads.fastq" # Placeholder for output prefix # parsebarcodes.sh typically outputs processed FASTQ files (e.g., processed_reads.fastq) and a log file. output_prefix="processed_reads" # Adapter files (placeholders - these would be specific files from the eCLIP pipeline resources) # These files contain the adapter sequences to be matched and trimmed. # They are typically located in the 'resources' or 'adapters' directory of the eCLIP pipeline. g_adapters_fasta="g_adapters.fasta" # Example: eclip/resources/g_adapters.fasta A_adapters_fasta="A_adapters.fasta" # Example: eclip/resources/A_adapters.fasta a_adapters_fasta="a_adapters.fasta" # Example: eclip/resources/a_adapters.fasta # Execute parsebarcodes.sh with the specified arguments # The script processes the input_fastq, identifies and trims adapters/barcodes, # and outputs processed reads to files named with the output_prefix. parsebarcodes.sh \ "${input_fastq}" \ "${output_prefix}" \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -g "${g_adapters_fasta}" \ -A "${A_adapters_fasta}" \ -a "${a_adapters_fasta}" -
5
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.9.1 # Define input and output files # INPUT_FASTQ should be the output from the previous trimming step or raw reads if this is the first trimming. INPUT_FASTQ="input_reads_from_previous_trimming.fastq.gz" OUTPUT_FASTQ="trimmed_reads_final.fastq.gz" # Define the 3' adapter sequence commonly associated with double-ligation events in eCLIP. # This sequence is a placeholder for the actual 3' adapter used in the library preparation (e.g., Illumina TruSeq Small RNA 3' Adapter). ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Define minimum read length after trimming. Common for eCLIP is 18bp. MIN_READ_LENGTH=18 # Execute cutadapt to remove the 3' adapter and filter by minimum length. # The -a flag specifies a 3' adapter to be removed. # The -m flag specifies the minimum length of reads to be kept after trimming. cutadapt -a "${ADAPTER_3PRIME}" \ -m "${MIN_READ_LENGTH}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
6
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Install cutadapt if not already available # conda install -c bioconda cutadapt # Assuming 'input_reads.fastq.gz' are the raw reads and 'trimmed_reads.fastq.gz' will be the output. # A_adapters.fasta is generated by parsebarcodes.sh as described within the eclip 0.1.5+ pipeline. cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -A A_adapters.fasta \ -o trimmed_reads.fastq.gz \ input_reads.fastq.gz
-
7
Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05).
$ Bash example
# Assume STAR is installed and available in PATH. # Assume the RepBase 18.05 mouse repeat element database FASTA file (e.g., RepBase18.05_mouse.fasta) has been indexed by STAR. # Example command to generate STAR index (if not already done): # mkdir -p /path/to/repbase_18.05_star_index # STAR --runMode genomeGenerate \ # --genomeDir /path/to/repbase_18.05_star_index \ # --genomeFastaFiles /path/to/RepBase18.05_mouse.fasta \ # --genomeSAindexNbases 10 # Adjust based on repeat element lengths and total size # Placeholder for STAR genome index directory for RepBase 18.05 REPBASE_STAR_INDEX="/path/to/repbase_18.05_star_index" # Placeholder for input trimmed reads file (single-end or first read of paired-end) TRIMMED_READS_FASTQ="trimmed_reads.fastq.gz" # If paired-end reads, use: # TRIMMED_READS_FASTQ_R1="trimmed_reads_R1.fastq.gz" # TRIMMED_READS_FASTQ_R2="trimmed_reads_R2.fastq.gz" # And modify --readFilesIn to "${TRIMMED_READS_FASTQ_R1}" "${TRIMMED_READS_FASTQ_R2}" # Placeholder for output prefix OUTPUT_PREFIX="star_repbase_mapped" # Placeholder for number of threads NUM_THREADS=8 # STAR mapping command for repeat elements # Key parameters for repeat mapping: # --alignIntronMax 1: Disables splicing, as repeats are not expected to have introns. # --outFilterMultimapNmax 1000: Allows reads to map to up to 1000 locations, crucial for highly repetitive sequences. # --outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file. STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${REPBASE_STAR_INDEX} \ --readFilesIn ${TRIMMED_READS_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1000 \ --alignIntronMax 1 \ --outSAMattributes NH HI AS nM NM MD -
8
Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (example, uncomment if needed) # conda install -c bioconda star # Define variables (placeholders) # 'mouse_repbase' implies a custom STAR genome index built for mouse (e.g., GRCm39) # and potentially including repetitive elements from databases like Repbase. # Replace with the actual path to your pre-built STAR genome index. GENOME_DIR="path/to/mouse_genome_index_GRCm39_with_repbase" READ1="path/to/read1.fastq.gz" READ2="path/to/read2.fastq.gz" OUTPUT_PREFIX="out_prefix" # Run STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm10).
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define variables READS_FILE="unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file GENOME_DIR="/path/to/mm10_star_index" # Placeholder for pre-built STAR genome index for mm10 OUTPUT_PREFIX="aligned_reads_" # Prefix for output files THREADS=8 # Number of threads to use # Note: The STAR genome index for mm10 needs to be built once prior to alignment. # Example command to build the index (replace with actual paths): # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa \ # --sjdbGTFfile /path/to/Mus_musculus.GRCm38.102.gtf \ # --runThreadN ${THREADS} # Align unmapped reads to the mouse genome (mm10) using STAR STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS_FILE} \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --readFilesCommand zcat -
10
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
STAR (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Replace 'path/to/STAR_genome_index' with the actual path to your STAR genome index directory. # This index should be built for the relevant reference genome (e.g., hg38/GRCh38 for human). STAR_GENOME_DIR="path/to/STAR_genome_index" # Replace '/path/to/read1' and '/path/to/read2' with the actual paths to your input FASTQ files. # Ensure these paths are correct for your data. READ1_FASTQ="/path/to/read1" READ2_FASTQ="/path/to/read2" # Replace 'out_prefix' with your desired prefix for output files (e.g., sample_aligned). OUTPUT_PREFIX="out_prefix" STAR \ --runThreadN 16 \ --genomeDir "${STAR_GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.4.1)
$ Bash example
# Install samtools (version 1.4.1) # conda install -c bioconda samtools=1.4.1 # Sort aligned reads by coordinate samtools sort -o output.sorted.bam input.bam
-
12
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.
$ Bash example
# Install eCLIP (assuming a specific version or cloning the repository) # git clone https://github.com/yeolab/eclip.git # cd eclip # # If using a specific release, checkout the tag, e.g., git checkout v0.1.5 # # Ensure Python environment is set up with necessary dependencies # Define input and output file paths INPUT_READ1="sorted_reads_R1.fastq.gz" INPUT_READ2="sorted_reads_R2.fastq.gz" OUTPUT_PREFIX="collapsed_reads" STATS_FILE="collapsed_reads_stats.txt" # Execute barcodecollapsepe.py to collapse sorted paired-end reads # The script is located within the cloned eclip repository at eclip/eclip/barcodecollapsepe.py python eclip/eclip/barcodecollapsepe.py \ -r1 "${INPUT_READ1}" \ -r2 "${INPUT_READ2}" \ -o "${OUTPUT_PREFIX}" \ -s "${STATS_FILE}" \ -b 10 \ -m 10 -
13
Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
$ Bash example
# Install sambamba if not already available # conda install -c bioconda sambamba # Define input and output files based on the description's arguments # -b: Input BAM file (e.g., a BAM file that still contains duplicates) # -o: Output BAM file (e.g., a BAM file with duplicates removed) # -m: Metrics file (sambamba markdup outputs statistics to stderr, which can be redirected) INPUT_BAM="rmDup.bam" # Placeholder for the input BAM file OUTPUT_BAM="preRmDup.bam" # Placeholder for the output BAM file after deduplication METRICS_FILE="metrics_file" # Placeholder for the deduplication metrics file # Execute sambamba markdup to remove duplicates # The command assumes that rmDup.bam is the input and preRmDup.bam is the output. # Deduplication statistics are redirected to the metrics_file. sambamba markdup "${INPUT_BAM}" "${OUTPUT_BAM}" 2> "${METRICS_FILE}" -
14
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.4.1 # Merge PCR de-duped reads from multiple inline barcodes # Replace input_barcode_1.bam, input_barcode_2.bam, etc. with your actual input BAM files. samtools merge merged.bam input_barcode_1.bam input_barcode_2.bam input_barcode_N.bam
-
15
Merged alignments were split to keep just read2 using samtools (1.4.1) view.
$ Bash example
# samtools version 1.4.1 # Input: merged.bam (placeholder for merged alignments) # Output: read2.bam (placeholder for alignments containing only read2) samtools view -f 0x80 -b -o read2.bam merged.bam
-
16
Args: -h -b -f 128
Unknown (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# The specific tool for these arguments is not provided in the description. # This command uses a placeholder 'unknown_tool'. unknown_tool -h -b -f 128
-
17
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper (if not already installed) # pip install clipper # Or clone from GitHub and run setup.py # Define input and output files INPUT_BAM="read2.bam" # Placeholder for "Read2 BAM files" OUTPUT_DIR="clipper_peaks" SPECIES="hg38" # Placeholder for species, e.g., hg38, mm10. Adjust as per experiment. # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run CLIPper to identify peak clusters # CLIPper version 1.2.2 # Note: For eCLIP, it's common to use a control BAM (e.g., input, IgG) with the -c flag. # Example: clipper.py -s "${SPECIES}" -o "${OUTPUT_DIR}" -c "control.bam" "${INPUT_BAM}" clipper.py -s "${SPECIES}" -o "${OUTPUT_DIR}" "${INPUT_BAM}" -
18
Args: --species mm10 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
gene_bam_processor (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This is a placeholder for a custom script or tool that processes BAM files # based on gene information and saves intermediate data in pickle format. # The specific installation instructions would depend on the actual script/tool. # For example, if it's a Python script: # # conda create -n gene_bam_env python=3.9 # # conda activate gene_bam_env # # pip install pandas numpy # (example dependencies) # Reference genome for mm10 (Mouse) # Source: UCSC Genome Browser (e.g., for genome annotation like GTF/GFF for gene definitions) # MM10_GENOME_ANNOTATION="/path/to/mm10.ncbiRefSeq.gtf" # Placeholder for actual annotation file # Execute the inferred gene_bam_processor # Assuming the script is named 'gene_bam_processor.py' python gene_bam_processor.py \ --species mm10 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
19
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Clone the eCLIP repository if not already available # git clone https://github.com/yeolab/eclip.git # cd eclip/scripts # Example usage of peaksnormalize.pl # Assuming initial_peaks.bed is the input peak file (peak clusters), # ip_read2.bam is the IP read2 BAM file (containing only read2s), # and input_read2.bam is the INPUT read2 BAM file (containing only read2s). # The script will output normalized peak files with the specified prefix (e.g., normalized_peaks.bed). perl peaksnormalize.pl initial_peaks.bed ip_read2.bam input_read2.bam normalized_peaks
-
20
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install Perl if not available # sudo apt-get update && sudo apt-get install perl # # Clone the eCLIP repository to get the script # git clone https://github.com/yeolab/eclip.git # cd eclip # # Define input and output files (placeholders) REPLICATE1_PEAKS="replicate1_normalized_peaks.bed" REPLICATE2_PEAKS="replicate2_normalized_peaks.bed" OUTPUT_PREFIX="merged_reproducible_peaks" MIN_OVERLAP_REPLICATES=2 # Example: require overlap in at least 2 replicates MIN_LOG2_FOLD_ENRICHMENT=1.0 # Example: require a minimum log2 fold enrichment of 1.0 # Path to the script (assuming eclip repo is cloned and current directory is eclip) SCRIPT_PATH="./scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl" # Execute the script to merge overlapping normalized peak regions perl "${SCRIPT_PATH}" \ "${REPLICATE1_PEAKS}" \ "${REPLICATE2_PEAKS}" \ "${OUTPUT_PREFIX}" \ "${MIN_OVERLAP_REPLICATES}" \ "${MIN_LOG2_FOLD_ENRICHMENT}" -
21
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Define input files and output prefix REP1_PEAKS="rep1.normalized.peaks.bed" # Placeholder for normalized peak file from replicate 1 REP2_PEAKS="rep2.normalized.peaks.bed" # Placeholder for normalized peak file from replicate 2 OUTPUT_PREFIX="idr_reproducible_peaks" # --- Step 1: Rank peak files by entropy score --- # make_informationcontent_from_peaks.pl is included within the merge_peaks pipeline (https://github.com/yeolab/merge_peaks). # Ensure the script is in your PATH or provide its full path. # Example installation (assuming merge_peaks is cloned): # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$(pwd)/merge_peaks/bin:$PATH echo "Ranking replicate 1 peaks..." make_informationcontent_from_peaks.pl "${REP1_PEAKS}" > "${OUTPUT_PREFIX}.rep1.ranked.bed" echo "Ranking replicate 2 peaks..." make_informationcontent_from_peaks.pl "${REP2_PEAKS}" > "${OUTPUT_PREFIX}.rep2.ranked.bed" # --- Step 2: Run IDR to determine reproducible peaks --- # IDR (Irreproducible Discovery Rate) version 2.0.2 # Installation: # conda create -n idr_env idr=2.0.2 -c bioconda # conda activate idr_env echo "Running IDR (2.0.2) to find reproducible peaks..." idr --input-file-type narrowPeak \ --rank-by signal.value \ --output-file "${OUTPUT_PREFIX}.idr.peaks.bed" \ --plot \ --log-output-file "${OUTPUT_PREFIX}.idr.log" \ --soft-idr-threshold 0.05 \ "${OUTPUT_PREFIX}.rep1.ranked.bed" \ "${OUTPUT_PREFIX}.rep2.ranked.bed" echo "IDR analysis complete. Reproducible peaks saved to ${OUTPUT_PREFIX}.idr.peaks.bed" -
22
Reproducible peaks were annotated by overlapping peak regions with Gencode M10 annotations
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools # Download Gencode M10 annotation GTF file # wget -O gencode.vM10.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/gencode.vM10.annotation.gtf.gz # gunzip gencode.vM10.annotation.gtf.gz # Convert Gencode GTF to BED format for bedtools intersection. # This command extracts gene, transcript, and exon features, converting 1-based start to 0-based. # It also extracts gene_id, gene_name, and strand to create a 6-column BED file. awk -F'\t' '($3 == "gene" || $3 == "transcript" || $3 == "exon") { OFS="\t"; split($9, a, "; "); gene_id=""; gene_name=""; for (i=1; i<=length(a); i++) { if (a[i] ~ /gene_id/) { gene_id = substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); } if (a[i] ~ /gene_name/) { gene_name = substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); } } print $1, $4-1, $5, gene_id":"gene_name, "0", $7 }' gencode.vM10.annotation.gtf > gencode.vM10.annotation.bed # Placeholder for input reproducible peak regions file (e.g., from IDR or peak calling) # reproducible_peaks.bed # Annotate reproducible peaks by overlapping with Gencode M10 annotations. # -a: input peak file (e.g., reproducible_peaks.bed) # -b: Gencode M10 annotation BED file # -wao: write original entry in A and overlap from B, report 0 for no overlap. # This ensures all peaks are reported, with annotation details appended if an overlap exists. bedtools intersect -a reproducible_peaks.bed -b gencode.vM10.annotation.bed -wao > annotated_reproducible_peaks.bed
Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10 Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm10). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.4.1) Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam) Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128 Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species mm10 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Reproducible peaks were annotated by overlapping peak regions with Gencode M10 annotations Genome_build: mm10 Supplementary_files_format_and_content: bigWig files; tab-delimited text file includes -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter