GSE167466 Processing Pipeline
Publication
Crosstalk between CRISPR-Cas9 and the human transcriptome.Nature communications (2022) — PMID 35236841
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).
$ Bash example
# Install UMI-tools if not already installed # conda install -c bioconda umi-tools # Reformat sequenced reads to include randomers (UMIs) in read headers. # This command extracts UMIs from the start of the read (e.g., 8bp) and appends them to the read header. # Adjust --bc-pattern and input/output filenames as per actual data and UMI location. umi_tools extract \ --input input.fastq.gz \ --output output.fastq.gz \ --extract-method=string \ --bc-pattern=NNNNNNNN \ --log umi_tools_extract.log -
2
Args: --random-seed 1 --bc-pattern NNNNNNNNNN
$ Bash example
# Install umi_tools (example using conda) # conda create -n umi_tools_env umi_tools=1.1.2 -c bioconda -c conda-forge # conda activate umi_tools_env # Example usage of umi_tools extract with the provided arguments. # This command assumes input reads are in 'input.fastq.gz' and output will be written to 'output.fastq.gz'. # The --bc-pattern NNNNNNNNNN specifies a 10-nucleotide barcode at the start of the read. # Adjust input/output filenames and other parameters as per your specific pipeline. umi_tools extract \ --random-seed 1 \ --bc-pattern NNNNNNNNNN \ --stdin input.fastq.gz \ --stdout output.fastq.gz
-
3
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.14 # Define input and output files INPUT_READS="input_reads.fastq.gz" TRIMMED_READS="trimmed_reads.fastq.gz" # Define adapter sequence (replace with actual adapter sequence if known) # Common Illumina adapters: # TruSeq Universal Adapter: AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT # TruSeq Small RNA 3' Adapter: TGGAATTCTCGGGTGCCAAGGAACTCCAG ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Example adapter # Define quality threshold (e.g., Q20) QUALITY_THRESHOLD=20 # Define minimum read length after trimming MIN_LENGTH=20 # Execute cutadapt for trimming cutadapt -a "${ADAPTER_SEQUENCE}" \ -q "${QUALITY_THRESHOLD}" \ -m "${MIN_LENGTH}" \ -o "${TRIMMED_READS}" \ "${INPUT_READS}" -
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install umi_tools if not already available # conda install -c bioconda umi_tools # Download reference adapter file # wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA_adapters.fasta # Define input and output files (placeholders for a paired-end eCLIP experiment) INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="output_R1.fastq.gz" OUTPUT_R2="output_R2.fastq.gz" ADAPTER_FILE="InvRNA_adapters.fasta" # Execute umi_tools extract command # This step extracts Unique Molecular Identifiers (UMIs) from reads # and places them in the read header, often involving adapter trimming. # The specific patterns for UMI extraction (bc-pattern, read2-pattern) # are inferred from the YeoLab eCLIP CWL workflow, assuming an 18bp UMI # located in Read 2, and using the provided adapter file. umi_tools extract \ --stdin "${INPUT_R1}" \ --stdout "${OUTPUT_R1}" \ --read2-in "${INPUT_R2}" \ --read2-out "${OUTPUT_R2}" \ --extract-method=regex \ --bc-pattern=NNNNNNNNNNNNNNNNNN \ --read2-pattern="^(?P<discard1>.*)(?P<umi_1>.{18})(?P<discard2>.*)" \ --adapter-file "${ADAPTER_FILE}" \ --match-read-wildcards \ --output-encoding=1 \ --error-correct-threshold=1 \ --error-rate=0.1 \ --quality-cutoff=6 \ --min-umi-length=18 -
5
Reads were then trimmed once more with cutadapt (1.14) to remove double-ligation events.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.14 # Example command for trimming double-ligation events. # Replace 'ADAPTER_SEQUENCE' with the actual adapter sequence for double-ligation events. # Replace 'input_R1.fastq.gz', 'input_R2.fastq.gz' with your input files. # Replace 'output_R1_trimmed.fastq.gz', 'output_R2_trimmed.fastq.gz' with your desired output files. # The specific adapter sequence for 'double-ligation events' is assay-dependent and not provided in the description. # A common approach for eCLIP is to remove the 3' adapter, which might be involved in double-ligation. # For example, if the 3' adapter is 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA', it might be used with -a or -b. cutadapt -a ADAPTER_SEQUENCE -A ADAPTER_SEQUENCE \ -o output_R1_trimmed.fastq.gz -p output_R2_trimmed.fastq.gz \ input_R1.fastq.gz input_R2.fastq.gz -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
eCLIP vNot explicitly versioned, inferred from `clipper` repository's commit history (main logic stable since ~2019) GitHub$ Bash example
# Installation (if not already installed) # It is recommended to install clipper in a dedicated conda environment: # conda create -n clipper_env python=3.8 # conda activate clipper_env # git clone https://github.com/yeolab/clipper.git # cd clipper # pip install -r requirements.txt # # Ensure the clipper.py script is executable or called with python # # For example, add the clipper directory to your PATH: # # export PATH=$(pwd):$PATH # # Or call directly: python /path/to/clipper/clipper.py # Download reference data specified in the description # The description mentions 'InvRNA*.fasta', assuming 'InvRNA.fasta' from the example inputs. wget -nc https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA.fasta # Placeholder for input BAM files and output prefix, as they were not provided in the description. # Replace these with your actual aligned read (input) and control BAM files. INPUT_BAM="path/to/your/input.bam" CONTROL_BAM="path/to/your/control.bam" # Optional, but recommended for eCLIP peak calling OUTPUT_PREFIX="eclip_peaks_output" SPECIES="hg38" # Example species, replace with your actual species (e.g., mm10 for mouse) # Execute the clipper peak calling command # Assuming clipper.py is in your PATH or called with its full path clipper.py \ -s "${SPECIES}" \ -o "${OUTPUT_PREFIX}" \ "${INPUT_BAM}" \ "${CONTROL_BAM}" \ --match-read-wildcards \ -O 5 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a InvRNA.fasta -
7
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Reference Data Preparation (RepBase 18.05) --- # Placeholder for RepBase 18.05 FASTA file. # You would typically download this from a source like GIRI (Genetic Information Research Institute). # Example: wget -O RepBase18.05.fasta.gz "http://www.girinst.org/repbase/update/RepBase18.05.fasta.gz" # gunzip RepBase18.05.fasta.gz REPBASE_FASTA="RepBase18.05.fasta" STAR_INDEX_DIR="RepBase_STAR_index" # Create STAR genome index for RepBase mkdir -p "${STAR_INDEX_DIR}" STAR --runMode genomeGenerate \ --genomeDir "${STAR_INDEX_DIR}" \ --genomeFastaFiles "${REPBASE_FASTA}" \ --genomeSAindexNbases 10 # Adjust based on genome size; 10 is suitable for smaller genomes/databases # --- Mapping Trimmed Reads with STAR --- # Input trimmed reads file TRIMMED_READS="trimmed_reads.fastq.gz" # Output prefix for alignment files OUTPUT_PREFIX="mapped_to_repbase" # Run STAR alignment STAR --runMode alignReads \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${TRIMMED_READS}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --outReadsUnmapped Fastx \ --runThreadN 8 # Adjust number of threads as needed -
8
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define variables for clarity # The genome directory should contain a STAR index built from a human reference genome (e.g., GRCh38) # and potentially repeat sequences (e.g., from RepBase) as indicated by 'human_repbase'. GENOME_DIR="human_repbase" READ_FILE_1="path/to/read1.fastq.gz" # Placeholder for input reads OUT_PREFIX="out_prefix" # Prefix for output files # Execute STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILE_1}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (GRCh38).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.0i # Define variables INPUT_READS="unmapped_filtered_reads.fastq.gz" # Placeholder for the input reads file STAR_INDEX_DIR="/path/to/STAR_index/GRCh38" # Placeholder for the STAR genome index directory OUTPUT_PREFIX="aligned_reads" # Prefix for output files NUM_THREADS=8 # Placeholder for number of threads # Run STAR alignment STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${INPUT_READS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 -
10
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Replace with your actual paths and filenames GENOME_DIR="/path/to/STAR_index/GRCh38" # Example: STAR genome index directory for human GRCh38 READ1_FILE="/path/to/your/sample_R1.fastq.gz" # Path to your input FASTQ file (read 1) OUTPUT_PREFIX="my_sample_aligned" # Prefix for all output files # Note: The genome index (GENOME_DIR) must be pre-built using STAR's genomeGenerate command. # Example command for genome generation (run once per genome, adjust parameters as needed): # STAR --runThreadN 16 --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \ # --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \ # --sjdbOverhang 100 # Recommended: (ReadLength - 1) # Align reads using STAR STAR --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FILE}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.6)
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.6 # Sort aligned reads samtools sort -o sorted_reads.bam aligned_reads.bam
-
12
Sorted reads were collapsed with umi_tools (1.0.0).
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi-tools # Placeholder for input sorted BAM file (e.g., from alignment) INPUT_BAM="sorted_aligned_reads.bam" # Placeholder for output collapsed BAM file OUTPUT_BAM="collapsed_reads.bam" # Placeholder for log file LOG_FILE="umi_tools_collapse.log" # Collapse sorted reads using umi_tools # The 'collapse' command groups reads with identical UMIs and mapping positions. # The '--method directional' is a common default for collapsing, considering strand information. # The '--umi-separator' might be needed if UMIs are not in the read name (e.g., in a separate tag). # Assuming UMIs are already extracted and present in the read names or a BAM tag. umi_tools collapse \ -I "${INPUT_BAM}" \ -S "${OUTPUT_BAM}" \ --log "${LOG_FILE}" \ --log-level INFO \ --method directional -
13
Args: --random-seed 1 --method unique
(Inferred with models/gemini-2.5-flash)$ Bash example
some_tool --random-seed 1 --method unique
-
14
BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Define input and output files INPUT_BAM="sample_ip.bam" # Placeholder for your IP BAM file CONTROL_BAM="sample_input.bam" # Placeholder for your control/input BAM file OUTPUT_DIR="clipper_peaks_output" GENOME_SIZE_FILE="hg38.chrom.sizes" # Placeholder for human hg38 genome sizes # --- Installation (commented out) --- # It is recommended to run CLIPper within a dedicated Python environment. # You can clone the repository and install dependencies: # git clone https://github.com/yeolab/clipper.git # cd clipper # pip install -r requirements.txt # Or, if available via pip: # pip install clipper # --- Reference Data (commented out) --- # Download hg38.chrom.sizes if not available. # This file is crucial for CLIPper to determine chromosome lengths. # mkdir -p references # wget -O references/${GENOME_SIZE_FILE} http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # GENOME_SIZE_FILE="references/${GENOME_SIZE_FILE}" # Update path if downloaded # Create output directory mkdir -p "${OUTPUT_DIR}" # Execute CLIPper to identify peak clusters # Parameters: # -b: Input BAM file (IP sample) # -c: Control BAM file (e.g., size-matched input or IgG control) # -g: Genome size file (e.g., hg38.chrom.sizes) # -o: Output directory for results # -p: Peak type (e.g., 'CLIP' for eCLIP, 'ChIP' for ChIP-seq) # -f: False Discovery Rate (FDR) threshold for peak calling (e.g., 0.05) # Note: Other parameters like -s (size factor file) might be relevant depending on experimental design. python clipper.py \ -b "${INPUT_BAM}" \ -c "${CONTROL_BAM}" \ -g "${GENOME_SIZE_FILE}" \ -o "${OUTPUT_DIR}" \ -p CLIP \ -f 0.05 -
15
Args: --species GRCh38 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
$ Bash example
# This command executes a Python-based script or tool for processing BAM files. # It likely involves operations related to gene features, given the '--maxgenes' argument, # and saves intermediate data using Python's pickle format. # The '--species GRCh38' argument indicates that a GRCh38 genome annotation file (e.g., GTF/GFF) # would be a required reference dataset for this tool. For GRCh38, common sources include Ensembl or GENCODE. # Please replace 'process_bam_script.py' with the actual script or tool name. # Also, ensure the GRCh38 annotation file (e.g., 'path/to/GRCh38.gtf') is accessible if required by the script. process_bam_script.py \ --species GRCh38 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
16
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Clone the eCLIP repository if not already available # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:/path/to/eclip/bin # Define input files (placeholders) # Replace with your actual peak file (e.g., BED or narrowPeak format) PEAK_FILE="your_peak_clusters.bed" # Replace with your actual IP BAM file IP_BAM="your_ip_sample.bam" # Replace with your actual INPUT BAM file INPUT_BAM="your_input_sample.bam" # Define an output prefix for the normalized peak file OUTPUT_PREFIX="normalized_peak_clusters" # Execute the peaksnormalize.pl script to normalize peak clusters # The script takes a peak file, an IP BAM file, an INPUT BAM file, and an output prefix. peaksnormalize.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}" -
17
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install Perl (if not already available) # conda install -c bioconda perl # Clone the merge_peaks repository which contains the script # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks # Example usage: Merge overlapping normalized peak regions from replicates # Replace normalized_peaks_repX.bed with your actual input peak files. # The script typically takes multiple input BED files and outputs a merged BED file. perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ normalized_peaks_rep1.bed \ normalized_peaks_rep2.bed \ normalized_peaks_rep3.bed \ > merged_replicate_overlapping_peaks.bed -
18
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Assuming replicate1.compressed.bed and replicate2.compressed.bed are the normalized peak files # and we need to rank them by entropy score. # make_informationcontent_from_peaks.pl is included within the merge_peaks pipeline. # Installation for merge_peaks (containing make_informationcontent_from_peaks.pl): # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$PATH:/path/to/merge_peaks/scripts # Rank replicate 1 peaks by entropy score make_informationcontent_from_peaks.pl replicate1.compressed.bed replicate1.ranked.bed # Rank replicate 2 peaks by entropy score make_informationcontent_from_peaks.pl replicate2.compressed.bed replicate2.ranked.bed # IDR installation: # conda install -c bioconda idr=2.0.2 # Run IDR with the ranked peak files to determine reproducible peaks. # The --rank signal.value parameter is common in eCLIP IDR pipelines, referring to the score column # (which is the entropy score in this case, output by make_informationcontent_from_peaks.pl). idr --samples replicate1.ranked.bed replicate2.ranked.bed \ --output-file-prefix idr_reproducible_peaks \ --rank signal.value \ --plot \ --log-output-file idr.log
Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Reads were then trimmed once more with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (GRCh38). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.6) Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species GRCh38 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Genome_build: GRCh38