GSE249245 Processing Pipeline
Publication
Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.Molecular cell (2024) — PMID 39303722
Dataset
GSE249245Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [Cut&Run]
Processing Steps
Generate Jupyter Notebook-
1
*library strategy: Cut&Run
$ Bash example
# Install Cromwell (if not already installed) # Download the latest Cromwell JAR from https://github.com/broadinstitute/cromwell/releases # Example: wget https://github.com/broadinstitute/cromwell/releases/download/85/cromwell-85.jar # Make sure Java is installed (e.g., sudo apt install openjdk-11-jre) # Clone the ENCODE ChIP-seq pipeline repository # git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git # cd chip-seq-pipeline2 # Define input FASTQ files (replace with actual paths to your Cut&Run data) # Assuming paired-end reads for a single replicate FASTQ_REP1_R1="path/to/your/sample_rep1_R1.fastq.gz" FASTQ_REP1_R2="path/to/your/sample_rep1_R2.fastq.gz" # Define output directory OUTPUT_DIR="cut_run_analysis_output" mkdir -p "$OUTPUT_DIR" # Define reference genome and associated files for hg38 (human) # The ENCODE pipeline requires a 'genome_tsv' file that points to various reference files # (e.g., genome FASTA, Bowtie2 index, chromosome sizes, blacklist regions). # You can find pre-built ENCODE reference files or create your own. # For hg38, you would typically download these from the ENCODE portal or prepare them. # Placeholder paths: Replace with actual paths to your reference files. GENOME_TSV="path/to/your/hg38_encode_genome.tsv" BLACKLIST_BED="path/to/your/hg38_blacklist.bed" # e.g., from ENCODE: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz CHRSZ_FILE="path/to/your/hg38.chrom.sizes" # e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # Create an inputs JSON file for the ENCODE ChIP-seq pipeline # This configuration is tailored for a Cut&Run experiment (often single replicate, no explicit input control) cat << EOF > inputs.json { "chip.fastqs_rep1_R1": ["${FASTQ_REP1_R1}"], "chip.fastqs_rep1_R2": ["${FASTQ_REP1_R2}"], "chip.paired_end": true, "chip.pipeline_type": "chip", "chip.peak_caller": "macs2", "chip.genome_tsv": "${GENOME_TSV}", "chip.output_prefix": "cut_run_sample", "chip.auto_detect_adapter": true, "chip.enable_idr": false, # IDR requires multiple replicates for reproducibility assessment "chip.blacklist_bed": "${BLACKLIST_BED}", "chip.chrsz": "${CHRSZ_FILE}", "chip.aligner": "bowtie2", # Default for ENCODE ChIP-seq pipeline "chip.macs2_qval": 0.01 # Common q-value threshold for MACS2 } EOF # Execute the ENCODE ChIP-seq pipeline using Cromwell # Ensure 'cromwell.jar' is accessible (e.g., in the current directory or specified with full path) # The main WDL file for the pipeline is typically 'chip.wdl' or 'chip_seq.wdl' # Assuming 'chip.wdl' is in the current directory (e.g., after 'cd chip-seq-pipeline2') java -jar /path/to/cromwell-*.jar run chip.wdl -i inputs.json -o "$OUTPUT_DIR" -
2
Data analysis was performed using a modified version of CUT&RUNTools 2.0.
CUT&RUNTools v2.0$ Bash example
# Example installation (if not already installed via conda or pip) # conda create -n cutruntools python=3.8 # conda activate cutruntools # pip install cutruntools # Example usage of CUT&RUNTools 2.0 with placeholder inputs. # The description mentions a 'modified version', which might involve custom scripts or parameters not reflected here. # This command assumes paired-end sequencing for both sample and control, and uses hg38 as a placeholder genome. # Define input and output paths SAMPLE_FASTQ_R1="path/to/sample_R1.fastq.gz" SAMPLE_FASTQ_R2="path/to/sample_R2.fastq.gz" CONTROL_FASTQ_R1="path/to/control_R1.fastq.gz" CONTROL_FASTQ_R2="path/to/control_R2.fastq.gz" OUTPUT_DIR="./cutrun_analysis_output" REFERENCE_GENOME="hg38" # Placeholder for human genome assembly # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute CUT&RUNTools 2.0 python cutruntools.py run \ --fastq1 "${SAMPLE_FASTQ_R1}" \ --fastq2 "${SAMPLE_FASTQ_R2}" \ --control_fastq1 "${CONTROL_FASTQ_R1}" \ --control_fastq2 "${CONTROL_FASTQ_R2}" \ --genome "${REFERENCE_GENOME}" \ --output_dir "${OUTPUT_DIR}" \ --threads 8 # Example: use 8 threads -
3
Adapters were trimmed using Trimmomatic (v0.36), followed by a second round of trimming to remove any remaining adapter overhang sequences not removed due to fragment read-through.
$ Bash example
# Install Trimmomatic (if not already installed) # For example, using conda: # conda install -c bioconda trimmomatic=0.36 # Define input and output file paths INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" # Intermediate files for the first trimming round TEMP_R1_PAIRED="temp_R1_paired.fastq.gz" TEMP_R1_UNPAIRED="temp_R1_unpaired.fastq.gz" TEMP_R2_PAIRED="temp_R2_paired.fastq.gz" TEMP_R2_UNPAIRED="temp_R2_unpaired.fastq.gz" # Final output files after the second trimming round OUTPUT_R1_PAIRED="output_R1_paired.fastq.gz" OUTPUT_R1_UNPAIRED="output_R1_unpaired.fastq.gz" OUTPUT_R2_PAIRED="output_R2_paired.fastq.gz" OUTPUT_R2_UNPAIRED="output_R2_unpaired.fastq.gz" ADAPTER_FILE="TruSeq3-PE.fa" # Placeholder: Replace with the actual adapter file used (e.g., from Trimmomatic's adapters directory) # Ensure the adapter file exists. This is a common adapter file distributed with Trimmomatic. # You might need to locate it, e.g., in the Trimmomatic installation directory. # Example: cp /path/to/trimmomatic/adapters/TruSeq3-PE.fa . # First round of trimming: Standard adapter removal and quality filtering # ILLUMINACLIP:2:30:10 - 2 seed mismatches, 30 for palindrome clip threshold, 10 for simple clip threshold # LEADING:3 - Remove leading low quality bases (below quality 3) # TRAILING:3 - Remove trailing low quality bases (below quality 3) # SLIDINGWINDOW:4:15 - Scan the read with a 4-base wide window, cutting when the average quality per base drops below 15 # MINLEN:36 - Drop reads shorter than 36 bases after trimming java -jar /path/to/trimmomatic-0.36.jar PE \ -phred33 \ "${INPUT_R1}" "${INPUT_R2}" \ "${TEMP_R1_PAIRED}" "${TEMP_R1_UNPAIRED}" \ "${TEMP_R2_PAIRED}" "${TEMP_R2_UNPAIRED}" \ ILLUMINACLIP:"${ADAPTER_FILE}":2:30:10 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 # Second round of trimming: To remove any remaining adapter overhang sequences not removed due to fragment read-through. # This round uses slightly more aggressive ILLUMINACLIP parameters (e.g., lower simple clip threshold of 5) # to catch very short remaining adapter fragments. java -jar /path/to/trimmomatic-0.36.jar PE \ -phred33 \ "${TEMP_R1_PAIRED}" "${TEMP_R2_PAIRED}" \ "${OUTPUT_R1_PAIRED}" "${OUTPUT_R1_UNPAIRED}" \ "${OUTPUT_R2_PAIRED}" "${OUTPUT_R2_UNPAIRED}" \ ILLUMINACLIP:"${ADAPTER_FILE}":1:20:5 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 # Clean up intermediate files (optional) # rm "${TEMP_R1_PAIRED}" "${TEMP_R1_UNPAIRED}" "${TEMP_R2_PAIRED}" "${TEMP_R2_UNPAIRED}" -
4
Reads were aligned to hg38 using bowtie2 (v2.3.5.1) using preset â--very-sensitive-local,â minimum fragment length 10, and maximum fragment length 700.
$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2=2.3.5.1 # Assume hg38 index is available at /path/to/hg38_index/hg38 # Replace reads_1.fastq and reads_2.fastq with your actual input files (paired-end) # Replace aligned.sam with your desired output file bowtie2 --very-sensitive-local -I 10 -X 700 -x /path/to/hg38_index/hg38 -1 reads_1.fastq -2 reads_2.fastq -S aligned.sam
-
5
After removing PCR duplicates with Picard (v0.1.8), peaks were called using MACS2 (v2.2.7.1) on the default narrowPeak setting using the same-batch V5 mock IP sample as a normalization control.
$ Bash example
# Install MACS2 if not already installed # conda install -c bioconda macs2=2.2.7 # Define input files and parameters TREATMENT_BAM="path/to/deduplicated_treatment.bam" CONTROL_BAM="path/to/deduplicated_V5_mock_IP.bam" # "same-batch V5 mock IP sample as a normalization control" OUTPUT_PREFIX="sample_name_macs2_peaks" OUTPUT_DIR="macs2_output" GENOME_SIZE="hs" # Placeholder for human genome size (e.g., hs for human, mm for mouse) mkdir -p "${OUTPUT_DIR}" # Run MACS2 callpeak with default narrowPeak settings # MACS2 will attempt to build a model for fragment size and shifting. # Input BAMs are assumed to be deduplicated by Picard. # MACS2 will automatically detect the input file format (e.g., BAM, BAMPE). macs2 callpeak \ -t "${TREATMENT_BAM}" \ -c "${CONTROL_BAM}" \ -g "${GENOME_SIZE}" \ -n "${OUTPUT_PREFIX}" \ --outdir "${OUTPUT_DIR}"
Raw Source Text
*library strategy: Cut&Run Data analysis was performed using a modified version of CUT&RUNTools 2.0. Adapters were trimmed using Trimmomatic (v0.36), followed by a second round of trimming to remove any remaining adapter overhang sequences not removed due to fragment read-through. Reads were aligned to hg38 using bowtie2 (v2.3.5.1) using preset â--very-sensitive-local,â minimum fragment length 10, and maximum fragment length 700. After removing PCR duplicates with Picard (v0.1.8), peaks were called using MACS2 (v2.2.7.1) on the default narrowPeak setting using the same-batch V5 mock IP sample as a normalization control. Assembly: hg38 Supplementary files format and content: narrowPeak files (q<0.05) output by MACS2