GSE249245 Processing Pipeline

OTHER code_examples 5 steps

Publication

Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.

Molecular cell (2024) — PMID 39303722

Dataset

GSE249245

Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [Cut&Run]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    *library strategy: Cut&Run

    ChIP-seq vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Cromwell (if not already installed)
    # Download the latest Cromwell JAR from https://github.com/broadinstitute/cromwell/releases
    # Example: wget https://github.com/broadinstitute/cromwell/releases/download/85/cromwell-85.jar
    # Make sure Java is installed (e.g., sudo apt install openjdk-11-jre)
    
    # Clone the ENCODE ChIP-seq pipeline repository
    # git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git
    # cd chip-seq-pipeline2
    
    # Define input FASTQ files (replace with actual paths to your Cut&Run data)
    # Assuming paired-end reads for a single replicate
    FASTQ_REP1_R1="path/to/your/sample_rep1_R1.fastq.gz"
    FASTQ_REP1_R2="path/to/your/sample_rep1_R2.fastq.gz"
    
    # Define output directory
    OUTPUT_DIR="cut_run_analysis_output"
    mkdir -p "$OUTPUT_DIR"
    
    # Define reference genome and associated files for hg38 (human)
    # The ENCODE pipeline requires a 'genome_tsv' file that points to various reference files
    # (e.g., genome FASTA, Bowtie2 index, chromosome sizes, blacklist regions).
    # You can find pre-built ENCODE reference files or create your own.
    # For hg38, you would typically download these from the ENCODE portal or prepare them.
    # Placeholder paths: Replace with actual paths to your reference files.
    GENOME_TSV="path/to/your/hg38_encode_genome.tsv"
    BLACKLIST_BED="path/to/your/hg38_blacklist.bed" # e.g., from ENCODE: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
    CHRSZ_FILE="path/to/your/hg38.chrom.sizes" # e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    
    # Create an inputs JSON file for the ENCODE ChIP-seq pipeline
    # This configuration is tailored for a Cut&Run experiment (often single replicate, no explicit input control)
    cat << EOF > inputs.json
    {
      "chip.fastqs_rep1_R1": ["${FASTQ_REP1_R1}"],
      "chip.fastqs_rep1_R2": ["${FASTQ_REP1_R2}"],
      "chip.paired_end": true,
      "chip.pipeline_type": "chip",
      "chip.peak_caller": "macs2",
      "chip.genome_tsv": "${GENOME_TSV}",
      "chip.output_prefix": "cut_run_sample",
      "chip.auto_detect_adapter": true,
      "chip.enable_idr": false, # IDR requires multiple replicates for reproducibility assessment
      "chip.blacklist_bed": "${BLACKLIST_BED}",
      "chip.chrsz": "${CHRSZ_FILE}",
      "chip.aligner": "bowtie2", # Default for ENCODE ChIP-seq pipeline
      "chip.macs2_qval": 0.01 # Common q-value threshold for MACS2
    }
    EOF
    
    # Execute the ENCODE ChIP-seq pipeline using Cromwell
    # Ensure 'cromwell.jar' is accessible (e.g., in the current directory or specified with full path)
    # The main WDL file for the pipeline is typically 'chip.wdl' or 'chip_seq.wdl'
    # Assuming 'chip.wdl' is in the current directory (e.g., after 'cd chip-seq-pipeline2')
    java -jar /path/to/cromwell-*.jar run chip.wdl -i inputs.json -o "$OUTPUT_DIR"
  2. 2

    Data analysis was performed using a modified version of CUT&RUNTools 2.0.

    CUT&RUNTools v2.0
    $ Bash example
    # Example installation (if not already installed via conda or pip)
    # conda create -n cutruntools python=3.8
    # conda activate cutruntools
    # pip install cutruntools
    
    # Example usage of CUT&RUNTools 2.0 with placeholder inputs.
    # The description mentions a 'modified version', which might involve custom scripts or parameters not reflected here.
    # This command assumes paired-end sequencing for both sample and control, and uses hg38 as a placeholder genome.
    
    # Define input and output paths
    SAMPLE_FASTQ_R1="path/to/sample_R1.fastq.gz"
    SAMPLE_FASTQ_R2="path/to/sample_R2.fastq.gz"
    CONTROL_FASTQ_R1="path/to/control_R1.fastq.gz"
    CONTROL_FASTQ_R2="path/to/control_R2.fastq.gz"
    OUTPUT_DIR="./cutrun_analysis_output"
    REFERENCE_GENOME="hg38" # Placeholder for human genome assembly
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute CUT&RUNTools 2.0
    python cutruntools.py run \
        --fastq1 "${SAMPLE_FASTQ_R1}" \
        --fastq2 "${SAMPLE_FASTQ_R2}" \
        --control_fastq1 "${CONTROL_FASTQ_R1}" \
        --control_fastq2 "${CONTROL_FASTQ_R2}" \
        --genome "${REFERENCE_GENOME}" \
        --output_dir "${OUTPUT_DIR}" \
        --threads 8 # Example: use 8 threads
    
  3. 3

    Adapters were trimmed using Trimmomatic (v0.36), followed by a second round of trimming to remove any remaining adapter overhang sequences not removed due to fragment read-through.

    Trimmomatic v0.36 GitHub
    $ Bash example
    # Install Trimmomatic (if not already installed)
    # For example, using conda:
    # conda install -c bioconda trimmomatic=0.36
    
    # Define input and output file paths
    INPUT_R1="input_R1.fastq.gz"
    INPUT_R2="input_R2.fastq.gz"
    
    # Intermediate files for the first trimming round
    TEMP_R1_PAIRED="temp_R1_paired.fastq.gz"
    TEMP_R1_UNPAIRED="temp_R1_unpaired.fastq.gz"
    TEMP_R2_PAIRED="temp_R2_paired.fastq.gz"
    TEMP_R2_UNPAIRED="temp_R2_unpaired.fastq.gz"
    
    # Final output files after the second trimming round
    OUTPUT_R1_PAIRED="output_R1_paired.fastq.gz"
    OUTPUT_R1_UNPAIRED="output_R1_unpaired.fastq.gz"
    OUTPUT_R2_PAIRED="output_R2_paired.fastq.gz"
    OUTPUT_R2_UNPAIRED="output_R2_unpaired.fastq.gz"
    
    ADAPTER_FILE="TruSeq3-PE.fa" # Placeholder: Replace with the actual adapter file used (e.g., from Trimmomatic's adapters directory)
    
    # Ensure the adapter file exists. This is a common adapter file distributed with Trimmomatic.
    # You might need to locate it, e.g., in the Trimmomatic installation directory.
    # Example: cp /path/to/trimmomatic/adapters/TruSeq3-PE.fa .
    
    # First round of trimming: Standard adapter removal and quality filtering
    # ILLUMINACLIP:2:30:10 - 2 seed mismatches, 30 for palindrome clip threshold, 10 for simple clip threshold
    # LEADING:3 - Remove leading low quality bases (below quality 3)
    # TRAILING:3 - Remove trailing low quality bases (below quality 3)
    # SLIDINGWINDOW:4:15 - Scan the read with a 4-base wide window, cutting when the average quality per base drops below 15
    # MINLEN:36 - Drop reads shorter than 36 bases after trimming
    java -jar /path/to/trimmomatic-0.36.jar PE \
        -phred33 \
        "${INPUT_R1}" "${INPUT_R2}" \
        "${TEMP_R1_PAIRED}" "${TEMP_R1_UNPAIRED}" \
        "${TEMP_R2_PAIRED}" "${TEMP_R2_UNPAIRED}" \
        ILLUMINACLIP:"${ADAPTER_FILE}":2:30:10 \
        LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
    
    # Second round of trimming: To remove any remaining adapter overhang sequences not removed due to fragment read-through.
    # This round uses slightly more aggressive ILLUMINACLIP parameters (e.g., lower simple clip threshold of 5) 
    # to catch very short remaining adapter fragments.
    java -jar /path/to/trimmomatic-0.36.jar PE \
        -phred33 \
        "${TEMP_R1_PAIRED}" "${TEMP_R2_PAIRED}" \
        "${OUTPUT_R1_PAIRED}" "${OUTPUT_R1_UNPAIRED}" \
        "${OUTPUT_R2_PAIRED}" "${OUTPUT_R2_UNPAIRED}" \
        ILLUMINACLIP:"${ADAPTER_FILE}":1:20:5 \
        LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
    
    # Clean up intermediate files (optional)
    # rm "${TEMP_R1_PAIRED}" "${TEMP_R1_UNPAIRED}" "${TEMP_R2_PAIRED}" "${TEMP_R2_UNPAIRED}"
  4. 4

    Reads were aligned to hg38 using bowtie2 (v2.3.5.1) using preset “--very-sensitive-local,” minimum fragment length 10, and maximum fragment length 700.

    $ Bash example
    # Install bowtie2 if not already installed
    # conda install -c bioconda bowtie2=2.3.5.1
    
    # Assume hg38 index is available at /path/to/hg38_index/hg38
    # Replace reads_1.fastq and reads_2.fastq with your actual input files (paired-end)
    # Replace aligned.sam with your desired output file
    
    bowtie2 --very-sensitive-local -I 10 -X 700 -x /path/to/hg38_index/hg38 -1 reads_1.fastq -2 reads_2.fastq -S aligned.sam
  5. 5

    After removing PCR duplicates with Picard (v0.1.8), peaks were called using MACS2 (v2.2.7.1) on the default narrowPeak setting using the same-batch V5 mock IP sample as a normalization control.

    MACS2 v2.2.7 GitHub
    $ Bash example
    # Install MACS2 if not already installed
    # conda install -c bioconda macs2=2.2.7
    
    # Define input files and parameters
    TREATMENT_BAM="path/to/deduplicated_treatment.bam"
    CONTROL_BAM="path/to/deduplicated_V5_mock_IP.bam" # "same-batch V5 mock IP sample as a normalization control"
    OUTPUT_PREFIX="sample_name_macs2_peaks"
    OUTPUT_DIR="macs2_output"
    GENOME_SIZE="hs" # Placeholder for human genome size (e.g., hs for human, mm for mouse)
    
    mkdir -p "${OUTPUT_DIR}"
    
    # Run MACS2 callpeak with default narrowPeak settings
    # MACS2 will attempt to build a model for fragment size and shifting.
    # Input BAMs are assumed to be deduplicated by Picard.
    # MACS2 will automatically detect the input file format (e.g., BAM, BAMPE).
    macs2 callpeak \
        -t "${TREATMENT_BAM}" \
        -c "${CONTROL_BAM}" \
        -g "${GENOME_SIZE}" \
        -n "${OUTPUT_PREFIX}" \
        --outdir "${OUTPUT_DIR}"

Tools Used

Raw Source Text
*library strategy: Cut&Run
Data analysis was performed using a modified version of CUT&RUNTools 2.0. Adapters were trimmed using Trimmomatic (v0.36), followed by a second round of trimming to remove any remaining adapter overhang sequences not removed due to fragment read-through. Reads were aligned to hg38 using bowtie2 (v2.3.5.1) using preset “--very-sensitive-local,” minimum fragment length 10, and maximum fragment length 700. After removing PCR duplicates with Picard (v0.1.8), peaks were called using MACS2 (v2.2.7.1) on the default narrowPeak setting using the same-batch V5 mock IP sample as a normalization control.
Assembly: hg38
Supplementary files format and content: narrowPeak files (q<0.05) output by MACS2
← Back to Analysis