GSE249245 Processing Pipeline

OTHER code_examples 5 steps

Publication

Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.

Molecular cell (2024) — PMID 39303722

Dataset

Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [Cut&Run]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

*library strategy: Cut&Run

ChIP-seq vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Cromwell (if not already installed)
# Download the latest Cromwell JAR from https://github.com/broadinstitute/cromwell/releases
# Example: wget https://github.com/broadinstitute/cromwell/releases/download/85/cromwell-85.jar
# Make sure Java is installed (e.g., sudo apt install openjdk-11-jre)

# Clone the ENCODE ChIP-seq pipeline repository
# git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git
# cd chip-seq-pipeline2

# Define input FASTQ files (replace with actual paths to your Cut&Run data)
# Assuming paired-end reads for a single replicate
FASTQ_REP1_R1="path/to/your/sample_rep1_R1.fastq.gz"
FASTQ_REP1_R2="path/to/your/sample_rep1_R2.fastq.gz"

# Define output directory
OUTPUT_DIR="cut_run_analysis_output"
mkdir -p "$OUTPUT_DIR"

# Define reference genome and associated files for hg38 (human)
# The ENCODE pipeline requires a 'genome_tsv' file that points to various reference files
# (e.g., genome FASTA, Bowtie2 index, chromosome sizes, blacklist regions).
# You can find pre-built ENCODE reference files or create your own.
# For hg38, you would typically download these from the ENCODE portal or prepare them.
# Placeholder paths: Replace with actual paths to your reference files.
GENOME_TSV="path/to/your/hg38_encode_genome.tsv"
BLACKLIST_BED="path/to/your/hg38_blacklist.bed" # e.g., from ENCODE: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
CHRSZ_FILE="path/to/your/hg38.chrom.sizes" # e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

# Create an inputs JSON file for the ENCODE ChIP-seq pipeline
# This configuration is tailored for a Cut&Run experiment (often single replicate, no explicit input control)
cat << EOF > inputs.json
{
  "chip.fastqs_rep1_R1": ["${FASTQ_REP1_R1}"],
  "chip.fastqs_rep1_R2": ["${FASTQ_REP1_R2}"],
  "chip.paired_end": true,
  "chip.pipeline_type": "chip",
  "chip.peak_caller": "macs2",
  "chip.genome_tsv": "${GENOME_TSV}",
  "chip.output_prefix": "cut_run_sample",
  "chip.auto_detect_adapter": true,
  "chip.enable_idr": false, # IDR requires multiple replicates for reproducibility assessment
  "chip.blacklist_bed": "${BLACKLIST_BED}",
  "chip.chrsz": "${CHRSZ_FILE}",
  "chip.aligner": "bowtie2", # Default for ENCODE ChIP-seq pipeline
  "chip.macs2_qval": 0.01 # Common q-value threshold for MACS2
}
EOF

# Execute the ENCODE ChIP-seq pipeline using Cromwell
# Ensure 'cromwell.jar' is accessible (e.g., in the current directory or specified with full path)
# The main WDL file for the pipeline is typically 'chip.wdl' or 'chip_seq.wdl'
# Assuming 'chip.wdl' is in the current directory (e.g., after 'cd chip-seq-pipeline2')
java -jar /path/to/cromwell-*.jar run chip.wdl -i inputs.json -o "$OUTPUT_DIR"

View on GitHub

Data analysis was performed using a modified version of CUT&RUNTools 2.0.

CUT&RUNTools v2.0

$ Bash example

# Example installation (if not already installed via conda or pip)
# conda create -n cutruntools python=3.8
# conda activate cutruntools
# pip install cutruntools

# Example usage of CUT&RUNTools 2.0 with placeholder inputs.
# The description mentions a 'modified version', which might involve custom scripts or parameters not reflected here.
# This command assumes paired-end sequencing for both sample and control, and uses hg38 as a placeholder genome.

# Define input and output paths
SAMPLE_FASTQ_R1="path/to/sample_R1.fastq.gz"
SAMPLE_FASTQ_R2="path/to/sample_R2.fastq.gz"
CONTROL_FASTQ_R1="path/to/control_R1.fastq.gz"
CONTROL_FASTQ_R2="path/to/control_R2.fastq.gz"
OUTPUT_DIR="./cutrun_analysis_output"
REFERENCE_GENOME="hg38" # Placeholder for human genome assembly

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute CUT&RUNTools 2.0
python cutruntools.py run \
    --fastq1 "${SAMPLE_FASTQ_R1}" \
    --fastq2 "${SAMPLE_FASTQ_R2}" \
    --control_fastq1 "${CONTROL_FASTQ_R1}" \
    --control_fastq2 "${CONTROL_FASTQ_R2}" \
    --genome "${REFERENCE_GENOME}" \
    --output_dir "${OUTPUT_DIR}" \
    --threads 8 # Example: use 8 threads

Adapters were trimmed using Trimmomatic (v0.36), followed by a second round of trimming to remove any remaining adapter overhang sequences not removed due to fragment read-through.

Trimmomatic v0.36 GitHub

$ Bash example

# Install Trimmomatic (if not already installed)
# For example, using conda:
# conda install -c bioconda trimmomatic=0.36

# Define input and output file paths
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"

# Intermediate files for the first trimming round
TEMP_R1_PAIRED="temp_R1_paired.fastq.gz"
TEMP_R1_UNPAIRED="temp_R1_unpaired.fastq.gz"
TEMP_R2_PAIRED="temp_R2_paired.fastq.gz"
TEMP_R2_UNPAIRED="temp_R2_unpaired.fastq.gz"

# Final output files after the second trimming round
OUTPUT_R1_PAIRED="output_R1_paired.fastq.gz"
OUTPUT_R1_UNPAIRED="output_R1_unpaired.fastq.gz"
OUTPUT_R2_PAIRED="output_R2_paired.fastq.gz"
OUTPUT_R2_UNPAIRED="output_R2_unpaired.fastq.gz"

ADAPTER_FILE="TruSeq3-PE.fa" # Placeholder: Replace with the actual adapter file used (e.g., from Trimmomatic's adapters directory)

# Ensure the adapter file exists. This is a common adapter file distributed with Trimmomatic.
# You might need to locate it, e.g., in the Trimmomatic installation directory.
# Example: cp /path/to/trimmomatic/adapters/TruSeq3-PE.fa .

# First round of trimming: Standard adapter removal and quality filtering
# ILLUMINACLIP:2:30:10 - 2 seed mismatches, 30 for palindrome clip threshold, 10 for simple clip threshold
# LEADING:3 - Remove leading low quality bases (below quality 3)
# TRAILING:3 - Remove trailing low quality bases (below quality 3)
# SLIDINGWINDOW:4:15 - Scan the read with a 4-base wide window, cutting when the average quality per base drops below 15
# MINLEN:36 - Drop reads shorter than 36 bases after trimming
java -jar /path/to/trimmomatic-0.36.jar PE \
    -phred33 \
    "${INPUT_R1}" "${INPUT_R2}" \
    "${TEMP_R1_PAIRED}" "${TEMP_R1_UNPAIRED}" \
    "${TEMP_R2_PAIRED}" "${TEMP_R2_UNPAIRED}" \
    ILLUMINACLIP:"${ADAPTER_FILE}":2:30:10 \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

# Second round of trimming: To remove any remaining adapter overhang sequences not removed due to fragment read-through.
# This round uses slightly more aggressive ILLUMINACLIP parameters (e.g., lower simple clip threshold of 5) 
# to catch very short remaining adapter fragments.
java -jar /path/to/trimmomatic-0.36.jar PE \
    -phred33 \
    "${TEMP_R1_PAIRED}" "${TEMP_R2_PAIRED}" \
    "${OUTPUT_R1_PAIRED}" "${OUTPUT_R1_UNPAIRED}" \
    "${OUTPUT_R2_PAIRED}" "${OUTPUT_R2_UNPAIRED}" \
    ILLUMINACLIP:"${ADAPTER_FILE}":1:20:5 \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

# Clean up intermediate files (optional)
# rm "${TEMP_R1_PAIRED}" "${TEMP_R1_UNPAIRED}" "${TEMP_R2_PAIRED}" "${TEMP_R2_UNPAIRED}"

View on GitHub

Reads were aligned to hg38 using bowtie2 (v2.3.5.1) using preset â--very-sensitive-local,â minimum fragment length 10, and maximum fragment length 700.

Bowtie2 v2.3.5 GitHub

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2=2.3.5.1

# Assume hg38 index is available at /path/to/hg38_index/hg38
# Replace reads_1.fastq and reads_2.fastq with your actual input files (paired-end)
# Replace aligned.sam with your desired output file

bowtie2 --very-sensitive-local -I 10 -X 700 -x /path/to/hg38_index/hg38 -1 reads_1.fastq -2 reads_2.fastq -S aligned.sam

View on GitHub

After removing PCR duplicates with Picard (v0.1.8), peaks were called using MACS2 (v2.2.7.1) on the default narrowPeak setting using the same-batch V5 mock IP sample as a normalization control.

MACS2 v2.2.7 GitHub

$ Bash example

# Install MACS2 if not already installed
# conda install -c bioconda macs2=2.2.7

# Define input files and parameters
TREATMENT_BAM="path/to/deduplicated_treatment.bam"
CONTROL_BAM="path/to/deduplicated_V5_mock_IP.bam" # "same-batch V5 mock IP sample as a normalization control"
OUTPUT_PREFIX="sample_name_macs2_peaks"
OUTPUT_DIR="macs2_output"
GENOME_SIZE="hs" # Placeholder for human genome size (e.g., hs for human, mm for mouse)

mkdir -p "${OUTPUT_DIR}"

# Run MACS2 callpeak with default narrowPeak settings
# MACS2 will attempt to build a model for fragment size and shifting.
# Input BAMs are assumed to be deduplicated by Picard.
# MACS2 will automatically detect the input file format (e.g., BAM, BAMPE).
macs2 callpeak \
    -t "${TREATMENT_BAM}" \
    -c "${CONTROL_BAM}" \
    -g "${GENOME_SIZE}" \
    -n "${OUTPUT_PREFIX}" \
    --outdir "${OUTPUT_DIR}"

View on GitHub

Tools Used

ChIP-seq Bowtie2

Raw Source Text

*library strategy: Cut&Run
Data analysis was performed using a modified version of CUT&RUNTools 2.0. Adapters were trimmed using Trimmomatic (v0.36), followed by a second round of trimming to remove any remaining adapter overhang sequences not removed due to fragment read-through. Reads were aligned to hg38 using bowtie2 (v2.3.5.1) using preset â--very-sensitive-local,â minimum fragment length 10, and maximum fragment length 700. After removing PCR duplicates with Picard (v0.1.8), peaks were called using MACS2 (v2.2.7.1) on the default narrowPeak setting using the same-batch V5 mock IP sample as a normalization control.
Assembly: hg38
Supplementary files format and content: narrowPeak files (q<0.05) output by MACS2

← Back to Analysis