GSE127944 Processing Pipeline
Publication
In Vivo Screening Unveils Pervasive RNA-Binding Protein Dependencies in Leukemic Stem Cells and Identifies ELAVL1 as a Therapeutic Target.Blood cancer discovery (2023) — PMID 36763002
Dataset
GSE127944In vivo CRISPR screening unveils RNA binding protein dependencies for leukemic stem cells and identifies ELAVL1 as a potential therapeutic target [eC…
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).
$ Bash example
# Install eclipdemux (assuming it's part of the yeolab/eclip repository) # git clone https://github.com/yeolab/eclip.git # cd eclip/scripts # Ensure Python environment is set up. # Define input and output files INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_PREFIX="demuxed_reads" # Define barcode and randomer lengths (example values, adjust as per experimental design) # The description implies both inline barcodes are removed and randomers are added to headers. BARCODE_LENGTH=6 # Example: length of the inline barcode to be removed RANDOMER_LENGTH=10 # Example: length of the randomer to be extracted and added to header # Execute eclipdemux to remove inline barcodes and add randomers to read headers # The --gzip flag is used assuming input and output files are gzipped. python eclipdemux.py \ -i "${INPUT_FASTQ}" \ -o "${OUTPUT_PREFIX}" \ --barcode-length "${BARCODE_LENGTH}" \ --randomer-length "${RANDOMER_LENGTH}" \ --gzip -
2
Args: --length 10
Unknown (Inferred with models/gemini-2.5-flash) vUnknown$ Bash example
your_tool_here --length 10
-
3
Reads were then trimmed with cutadapt (1.9.1).
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=1.9.1 # Define input and output file names (placeholders) # Replace with actual file paths INPUT_READS_R1="input_reads_R1.fastq.gz" INPUT_READS_R2="input_reads_R2.fastq.gz" OUTPUT_TRIMMED_R1="trimmed_reads_R1.fastq.gz" OUTPUT_TRIMMED_R2="trimmed_reads_R2.fastq.gz" # Define common Illumina adapter sequences (placeholders) # Adjust these based on the specific library preparation kit used # Example: Illumina Universal Adapter (3' end of Read 1) ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Small RNA 3' Adapter (3' end of Read 2, reverse complement of Read 1 adapter if same) ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Trim reads with cutadapt # -a: 3' adapter for R1 # -A: 3' adapter for R2 # -q: Quality trimming (e.g., trim bases with quality < 20 from 5' and 3' ends) # -m: Minimum read length after trimming (e.g., 20 bp) # -o: Output file for R1 # -p: Output file for R2 cutadapt -a "${ADAPTER_R1}" -A "${ADAPTER_R2}" \ -q 20 -m 20 \ -o "${OUTPUT_TRIMMED_R1}" -p "${OUTPUT_TRIMMED_R2}" \ "${INPUT_READS_R1}" "${INPUT_READS_R2}" -
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
eCLIP v0.1.5$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # This step uses cutadapt for adapter trimming, with adapter sequences # (g_adapters.fasta, A_adapters.fasta, a_adapters.fasta) that are typically # generated by a preceding parsebarcodes.sh step within the eCLIP pipeline. # Input and output filenames are placeholders as they were not specified in the description. cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ -q 6 \ -m 18 \ -g g_adapters.fasta \ -A A_adapters.fasta \ -a a_adapters.fasta \ -o trimmed_reads.fastq.gz \ input_reads.fastq.gz
-
5
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.
cutadapt v1.9.1$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.9.1 # Define adapter sequences from eCLIP workflow (yeolab/eclip) ADAPTER_A="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina TruSeq Universal Adapter ADAPTER_B="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina TruSeq Read 2 Adapter # Define input and output filenames INPUT_READ1="input_R1.fastq.gz" INPUT_READ2="input_R2.fastq.gz" OUTPUT_READ1="trimmed_R1.fastq.gz" OUTPUT_READ2="trimmed_R2.fastq.gz" # Define number of threads N_THREADS=8 # Trim reads to remove adapter sequences and potential double-ligation events # -a: 3' adapter for R1 # -A: 3' adapter for R2 # -O 3: Minimum overlap of 3 bases required to trim an adapter # -m 18: Discard reads shorter than 18 bp after trimming # --cores: Number of CPU cores to use cutadapt -a "${ADAPTER_A}" \ -A "${ADAPTER_B}" \ -O 3 \ -m 18 \ --cores "${N_THREADS}" \ -o "${OUTPUT_READ1}" \ -p "${OUTPUT_READ2}" \ "${INPUT_READ1}" \ "${INPUT_READ2}" -
6
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.18 # Placeholder for input and output files: # A_adapters.fasta: Fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline. # This file contains the adapter sequences to be trimmed. # input_reads.fastq: Raw sequencing reads (e.g., from a single-end eCLIP experiment). # trimmed_reads.fastq: Output file with trimmed reads. # Note on '-A A_adapters.fasta' in description vs. '-a file:A_adapters.fasta' in code: # The eCLIP pipeline's CWL workflow typically uses '-a file:ADAPTERS.fasta' for 3' adapter trimming # from a FASTA file on the first read. While the description uses '-A', which is for the second read # in paired-end data, we infer '-a file:' based on common eCLIP pipeline implementation for single-end # or primary read trimming, and the standard cutadapt syntax for reading adapters from a file. cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a file:A_adapters.fasta \ -o trimmed_reads.fastq \ input_reads.fastq
-
7
Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05).
$ Bash example
# Install STAR if not already available # conda install -c bioconda star=2.4.0i # Define paths and parameters REPBASE_FASTA="RepBase18.05_mouse.fasta" # Placeholder for the mouse-specific RepBase FASTA file STAR_INDEX_DIR="star_repbase_index" TRIMMED_READS="trimmed_reads.fastq.gz" # Placeholder for your trimmed reads file (e.g., single-end) # For paired-end reads, use: # TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz" # TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz" # And modify --readFilesIn accordingly: --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" OUTPUT_PREFIX="repbase_mapping_output" NUM_THREADS=8 # Example number of threads # 1. Create STAR genome index for the RepBase database mkdir -p "${STAR_INDEX_DIR}" # Note: For repeat element databases, splicing is not relevant, so sjdbOverhang is omitted. # The genome is essentially the collection of repeat sequences. STAR --runMode genomeGenerate \ --genomeDir "${STAR_INDEX_DIR}" \ --genomeFastaFiles "${REPBASE_FASTA}" \ --runThreadN "${NUM_THREADS}" # 2. Map trimmed reads to the RepBase index # For repeat element mapping, it's common to allow more multimapping and disable splicing. STAR --runThreadN "${NUM_THREADS}" \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${TRIMMED_READS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --alignIntronMax 1 \ --outFilterMultimapNmax 100 \ --outFilterScoreMinOverLread 0.6 \ --outFilterMatchNminOverLread 0.6 -
8
Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Placeholder for mouse genome index. For eCLIP, this would typically be a STAR index built from a mouse reference genome (e.g., mm10/GRCm38) potentially with repetitive elements (like from RepeatMasker) added to the FASTA for indexing. GENOME_DIR="/path/to/STAR_genome_indices/mouse_mm10_repbase" READ1="path/to/read1" READ2="path/to/read2" OUT_PREFIX="out_prefix" THREADS=16 # Run STAR alignment STAR \ --runThreadN "${THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm9).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.0i # Define reference genome path for mm9 # The mm9 genome FASTA files can be downloaded from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz). # A corresponding GTF/GFF annotation file (e.g., from Ensembl or UCSC refGene) is also recommended for splice-aware alignment. GENOME_DIR="/path/to/STAR_index/mm9" # Example: Build STAR index for mm9 (if not already built) # Replace 'Mus_musculus.mm9.fa' and 'Mus_musculus.mm9.gtf' with your actual file paths. # Adjust --sjdbOverhang based on your read length (typically ReadLength - 1). # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles Mus_musculus.mm9.fa \ # --sjdbGTFfile Mus_musculus.mm9.gtf \ # --sjdbOverhang 100 \ # --runThreadN 8 # Input reads (assuming unmapped reads filtered of repeat elements are in this file) INPUT_READS="unmapped_filtered_reads.fastq.gz" OUTPUT_PREFIX="mapped_star_mm9_" NUM_THREADS=8 # Adjust based on available resources STAR --runMode alignReads \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_READS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --runThreadN ${NUM_THREADS} -
10
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for input/output and reference # For RNA-based assays, a splice-aware aligner like STAR requires a pre-built genome index. # The latest human reference genome assembly is GRCh38. # Example: Download or build STAR index for GRCh38 (e.g., from Gencode or ENCODE project). GENOME_DIR="/path/to/STAR_genome_index/GRCh38" READ1="/path/to/sample_R1.fastq.gz" READ2="/path/to/sample_R2.fastq.gz" OUTPUT_PREFIX="sample_aligned" # Run STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.4.1)
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.4.1 # Sort aligned reads by coordinate # Replace 'input.bam' with the actual path to your aligned BAM file # Replace 'sorted_reads.bam' with the desired output path for the sorted BAM file samtools sort -o sorted_reads.bam input.bam
-
12
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.
$ Bash example
# Install eclip (assuming Python 3 environment) # It's recommended to install eclip within a conda environment # conda create -n eclip_env python=3.8 # conda activate eclip_env # pip install eclip==0.1.5 # If available on PyPI, otherwise clone and install from source # If installing from source: # git clone https://github.com/yeolab/eclip.git # cd eclip # pip install . # Ensure barcodecollapsepe.py is in your PATH or call it directly, e.g., python3 /path/to/eclip/tools/barcodecollapsepe.py # Define input and output files INPUT_R1="sorted_reads_R1.fastq.gz" INPUT_R2="sorted_reads_R2.fastq.gz" OUTPUT_R1_PREFIX="collapsed_reads_R1" OUTPUT_R2_PREFIX="collapsed_reads_R2" # Decompress input files for processing gunzip -c "${INPUT_R1}" > "${INPUT_R1%.gz}" gunzip -c "${INPUT_R2}" > "${INPUT_R2%.gz}" # Execute barcodecollapsepe.py to collapse reads # Using default parameters from eclip CWL workflow (min_qual, min_read_qual, min_seq_qual) python3 /path/to/eclip/tools/barcodecollapsepe.py \ -i "${INPUT_R1%.gz}" \ -j "${INPUT_R2%.gz}" \ -o "${OUTPUT_R1_PREFIX}.fastq" \ -p "${OUTPUT_R2_PREFIX}.fastq" \ -m 20 -q 20 -s 20 # Recompress output files gzip "${OUTPUT_R1_PREFIX}.fastq" gzip "${OUTPUT_R2_PREFIX}.fastq" # Clean up uncompressed intermediate files (optional) # rm "${INPUT_R1%.gz}" "${INPUT_R2%.gz}" -
13
Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
$ Bash example
# Install Picard (e.g., via Conda) # conda create -n picard_env picard -c bioconda -c conda-forge # conda activate picard_env # Define variables for clarity, mapping description's arguments to Picard's standard flags # -b rmDup.bam -> Input BAM file # -o preRmDup.bam -> Output BAM file (with duplicates marked) # -m metrics_file -> Output metrics file INPUT_BAM="rmDup.bam" OUTPUT_BAM="preRmDup.bam" METRICS_FILE="metrics_file" # Execute Picard MarkDuplicates to identify and mark duplicate reads in a BAM file. # ASSUME_SORTED=true is often used as input BAMs are typically sorted by coordinate. # VALIDATION_STRINGENCY=SILENT is used to suppress verbose validation warnings. picard MarkDuplicates \ I="${INPUT_BAM}" \ O="${OUTPUT_BAM}" \ M="${METRICS_FILE}" \ ASSUME_SORTED=true \ VALIDATION_STRINGENCY=SILENT -
14
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.4.1 # Merge PCR de-duped reads from each inline barcode # Replace input_deduped_barcode_1.bam, input_deduped_barcode_2.bam with your actual input files samtools merge merged.bam input_deduped_barcode_1.bam input_deduped_barcode_2.bam
-
15
Merged alignments were split to keep just read2 using samtools (1.4.1) view.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.4.1 # Define input and output file names INPUT_BAM="merged_alignments.bam" OUTPUT_BAM="read2_alignments.bam" # Split merged alignments to keep just read2 using samtools view # -b: output BAM format # -f 0x80: require all of these flags to be present (0x80 = second in pair) samtools view -b -f 0x80 "${INPUT_BAM}" > "${OUTPUT_BAM}" -
16
Args: -h -b -f 128
Unknown (Inferred with models/gemini-2.5-flash) vUnknown$ Bash example
# The specific tool could not be inferred from the provided description and arguments. # Please specify the tool to generate a more accurate command. # Example placeholder command: unknown_tool -h -b -f 128
-
17
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper (example, specific version might require cloning the repo or a specific conda/pip package) # conda install -c bioconda clipper # Define input and output files INPUT_BAM="read2.bam" # Placeholder for "Read2 BAM files" OUTPUT_PREFIX="peak_clusters" GENOME_SIZES="hg38.chrom.sizes" # Placeholder for genome size file, often required by CLIPper # Ensure the genome size file exists or provide a path to it. # Example: Download from UCSC or ENCODE # wget -O ${GENOME_SIZES} http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # Run CLIPper to identify peak clusters # Using common default parameters as none were specified in the description. # -b: Input BAM file # -o: Output prefix for peak files # -s: Genome size file # -p: P-value threshold (example: 0.01) # -f: FDR threshold (example: 0.05) # -w: Window size (example: 20) python clipper.py -b "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -s "${GENOME_SIZES}" -p 0.01 -f 0.05 -w 20 -
18
Args: --species mm9 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
count_reads_in_genes.py (Inferred with models/gemini-2.5-flash) vPart of yeolab/eclip workflow GitHub$ Bash example
# Installation instructions: # This script is typically part of the yeolab/eclip repository. # git clone https://github.com/yeolab/eclip.git # cd eclip/scripts # # Ensure Python dependencies are met, e.g., pysam, pandas, numpy # # conda create -n eclip_env python=3.8 # # conda activate eclip_env # # pip install pysam pandas numpy # Define input/output paths and reference data INPUT_BAM="path/to/input.bam" OUTPUT_FILE="path/to/output.bam" # As specified in the description SPECIES="mm9" # Reference GTF file for mm9 is required by count_reads_in_genes.py for gene definitions. # Placeholder: Replace with actual path to mm9 GTF file (e.g., from UCSC or Ensembl). # Example: Download from UCSC Genome Browser, Table Browser, group: Genes and Gene Predictions, track: NCBI RefSeq, output format: GTF MM9_GTF="/path/to/mm9.ncbiRefSeq.gtf" # Execute the tool # Note: The original count_reads_in_genes.py script from yeolab/eclip uses '--output-file' for text output # and '--pickle-file' for pickle output, and does not directly output a BAM file. # The '--outfile path/to/output.bam' argument from the description might imply a modified script, # a wrapper script, or a subsequent conversion step in the pipeline. # This command uses the argument names exactly as provided in the description. python count_reads_in_genes.py \ --species "${SPECIES}" \ --bam "${INPUT_BAM}" \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile "${OUTPUT_FILE}" \ --gtf "${MM9_GTF}" # This argument is inferred as necessary for the tool's functionality with gene definitions -
19
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Install eCLIP pipeline (if not already installed) # git clone https://github.com/yeolab/eclip.git # export PATH="/path/to/eclip/bin:$PATH" # Ensure Perl dependencies are met (e.g., Bio::DB::Sam, Statistics::R, etc.) # Define input files based on description IP_READ2_BAM="ip_sample_read2.bam" INPUT_READ2_BAM="input_sample_read2.bam" PEAK_CLUSTERS_FILE="peak_clusters.narrowPeak" # Assuming peak clusters are in narrowPeak format OUTPUT_PREFIX="peak_clusters_normalized" # Execute the normalization script peaksnormalize.pl "${PEAK_CLUSTERS_FILE}" "${IP_READ2_BAM}" "${INPUT_READ2_BAM}" "${OUTPUT_PREFIX}" -
20
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Clone the eCLIP repository to obtain the script # git clone https://github.com/yeolab/eclip.git # cd eclip/scripts # Ensure Perl is installed on your system # sudo apt-get update && sudo apt-get install -y perl # Define input peak files (example names for normalized peak regions from replicates) INPUT_PEAK_FILE_REP1="sample_rep1_normalized_peaks.bed" INPUT_PEAK_FILE_REP2="sample_rep2_normalized_peaks.bed" # Define the output file for merged peaks OUTPUT_MERGED_PEAKS="sample_merged_replicate_peaks.bed" # Define the minimum number of replicates required for a peak to be considered reproducible # This is a common parameter for merging replicate peaks to ensure reproducibility. MIN_REPLICATES=2 # Execute the merging script # The script `compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl` takes multiple BED files # as input and merges overlapping regions across replicates. # -c: Specifies the minimum number of replicates that must overlap for a peak to be included in the output. # Other parameters like -f (fold enrichment column), -p (p-value column), -s (score column) # can be specified if the input BED files deviate from the script's default column assignments (f=7, p=8, s=5). perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ -c "${MIN_REPLICATES}" \ "${INPUT_PEAK_FILE_REP1}" \ "${INPUT_PEAK_FILE_REP2}" \ > "${OUTPUT_MERGED_PEAKS}" -
21
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr=2.0.2 # Define input files (placeholders) # The description mentions "Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline)" # This implies that the input files to IDR are the output of make_informationcontent_from_peaks.pl, which are typically ranked peak files. # Assuming two replicate peak files (e.g., from MACS2 or similar peak caller, then processed by make_informationcontent_from_peaks.pl) INPUT_PEAKS_REP1="rep1.ranked_peaks.narrowPeak" INPUT_PEAKS_REP2="rep2.ranked_peaks.narrowPeak" # Define output prefix OUTPUT_PREFIX="reproducible_peaks" # Run IDR # The merge_peaks pipeline (https://github.com/yeolab/merge_peaks) typically uses a soft-idr-threshold of 0.05 and ranks by p.value. # The input file type is usually narrowPeak for eCLIP. idr \ --input-file-type narrowPeak \ --rank p.value \ --output-file "${OUTPUT_PREFIX}.idr.out" \ --plot \ --log-output-file "${OUTPUT_PREFIX}.idr.log" \ --soft-idr-threshold 0.05 \ "${INPUT_PEAKS_REP1}" \ "${INPUT_PEAKS_REP2}" -
22
Reproducible peaks were annotated by overlapping peak regions with Gencode vM1 annotations
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Download Gencode vM1 mouse annotations (GTF format) # Note: vM1 is a very old version. The latest mouse release is vM33. # The exact FTP path for vM1 might be in an archive or an older release directory. # This is a placeholder for where the file would be downloaded from. # For example, a path might look like: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M1/gencode.vM1.annotation.gtf.gz # wget -O gencode.vM1.annotation.gtf.gz "<GENCODE_VM1_GTF_URL>" # gunzip gencode.vM1.annotation.gtf.gz # Convert Gencode GTF to BED format for all relevant features (e.g., genes, transcripts, exons). # This step is crucial for robust interval operations with bedtools. # A dedicated tool like 'gtf2bed' or a custom script is often used. # Example using awk to extract all features as basic BED entries (simplified): # awk -F'\t' '($1 ~ /^chr/ || $1 ~ /^[0-9XYM]/) && $3 != "gene" && $3 != "transcript" {print $1"\t"$4-1"\t"$5"\t"$3":"$9"\t.\t"$7}' gencode.vM1.annotation.gtf | sed 's/gene_id "//;s/"; transcript_id "/\t/;s/"; exon_number "/\t/;s/";/g' > gencode.vM1.annotation.bed # For this example, we assume 'gencode.vM1.annotation.bed' is already prepared. # Assuming 'reproducible_peaks.bed' contains the reproducible peak regions # and 'gencode.vM1.annotation.bed' is a BED file derived from the Gencode vM1 GTF, # containing the genomic regions of interest for annotation. # Overlap reproducible peaks with Gencode vM1 annotations # -a: reproducible peaks (input A) # -b: Gencode vM1 annotations (input B) # -wao: Write the original entry in A and the original entry in B, plus the number of base pairs of overlap. # If there is no overlap, a "0" is reported for the overlap. bedtools intersect -a reproducible_peaks.bed -b gencode.vM1.annotation.bed -wao > reproducible_peaks_annotated.bed
Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10 Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm9). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.4.1) Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam) Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128 Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species mm9 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Reproducible peaks were annotated by overlapping peak regions with Gencode vM1 annotations Genome_build: mm9 Supplementary_files_format_and_content: tab-delimited text files include -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter