GSE120104 Processing Pipeline

ChIP-Seq code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120104

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [ChIP-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

ChIP-seq data were processed in accordance with ENCODE uniform transcription factor ChIP-seq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/).

ChIP-seq vlatest stable release (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the ENCODE ChIP-seq pipeline repository (if not already cloned)
# git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git
# cd chip-seq-pipeline2

# Create an input JSON file for a paired-end Transcription Factor ChIP-seq experiment (hg38).
# Replace '/path/to/...' with your actual FASTQ file paths.
# Reference genome files are sourced from ENCODE Project (GRCh38_gencode_v34).
cat << EOF > input.json
{
  "chip.fastqs_rep1_R1": ["/path/to/rep1_R1.fastq.gz"],
  "chip.fastqs_rep1_R2": ["/path/to/rep1_R2.fastq.gz"],
  "chip.fastqs_ctl_R1": ["/path/to/control_R1.fastq.gz"],
  "chip.fastqs_ctl_R2": ["/path/to/control_R2.fastq.gz"],
  "chip.paired_end": true,
  "chip.pipeline_type": "tf",
  "chip.genome_tsv": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.tsv",
  "chip.blacklist_bed": "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz",
  "chip.chrsz": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.chrsz.tsv",
  "chip.ref_fa": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.fa.gz",
  "chip.bowtie2_idx": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.bowtie2_idx.tar",
  "chip.bwa_idx": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.bwa_idx.tar",
  "chip.output_prefix": "my_tf_chipseq_hg38"
}
EOF

# Execute the ENCODE ChIP-seq pipeline using Cromwell.
# This command assumes 'cromwell.jar' is in your PATH or specified with its full path,
# and that you are running this from the root directory of the cloned 'chip-seq-pipeline2' repository.
# Replace 'cromwell.jar' with the actual path to your Cromwell JAR file if not in PATH.
# The 'chip.wdl' file is the main workflow definition.
java -jar cromwell.jar run chip.wdl -i input.json

View on GitHub

Briefly, ChIP-seq reads were aligned to the GRCh37 human genome using BWA

BWA v0.7.17 GitHub

$ Bash example

# Install BWA (if not already installed)
# conda install -c bioconda bwa

# Install Samtools (if not already installed)
# conda install -c bioconda samtools

# Define variables
REF_GENOME="/path/to/GRCh37/GRCh37.fa" # Placeholder for GRCh37 human genome reference
READS_R1="input_reads_R1.fastq.gz" # Placeholder for ChIP-seq forward reads
READS_R2="input_reads_R2.fastq.gz" # Placeholder for ChIP-seq reverse reads (if paired-end)
OUTPUT_BAM="aligned_reads.bam"
OUTPUT_SORTED_BAM="aligned_reads.sorted.bam"

# 1. Index the reference genome (if not already indexed)
# This step only needs to be run once per reference genome
# bwa index ${REF_GENOME}

# 2. Align reads to the GRCh37 human genome using BWA-MEM
# Assuming paired-end reads for ChIP-seq, adjust for single-end if necessary
bwa mem -M -t 8 ${REF_GENOME} ${READS_R1} ${READS_R2} | samtools view -bS - > ${OUTPUT_BAM}

# 3. Sort the resulting BAM file
samtools sort ${OUTPUT_BAM} -o ${OUTPUT_SORTED_BAM}

# 4. Index the sorted BAM file
samtools index ${OUTPUT_SORTED_BAM}

View on GitHub

PCR duplicates and reads of low quality were filtered

samtools (Inferred with models/gemini-2.5-flash) v1.9 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Filter out reads marked as PCR duplicates (FLAG 1024) and reads with mapping quality less than 10
samtools view -F 1024 -q 10 -b input.bam > filtered.bam

View on GitHub

Peak-calling was performed with SPP and IDR with 0.02 IDR threshhold

IDR v0.02

$ Bash example

# Install IDR (example using pip, check official documentation for best practice)
# pip install idr

# IDR (Irreproducible Discovery Rate) analysis
# This step takes peak files (e.g., from SPP for two replicates) and identifies reproducible peaks.
# The '0.02 IDR threshhold' refers to the --output-threshold parameter, meaning peaks with an IDR score <= 0.02 are considered reproducible.
idr --samples rep1_spp_peaks.narrowPeak rep2_spp_peaks.narrowPeak \
    --input-file-type narrowPeak \
    --rank-by signal.value \
    --output-file idr_reproducible_peaks.narrowPeak \
    --plot \
    --log-file idr_log.txt \
    --output-threshold 0.02

Signal tracks were generated with MACS2 and bigWig applications from UCSC genome browser (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/)

MACS2 vMACS2: 2.2.7.1 (Inferred with models/gemini-2.5-flash); UCSC bedGraphToBigWig: Not explicitly versioned, typically latest available at time of download. GitHub

$ Bash example

# Placeholder variables
TREATMENT_BAM="path/to/treatment.bam"
CONTROL_BAM="path/to/control.bam" # Optional, but common for MACS2
OUTPUT_PREFIX="sample_name"
GENOME_ASSEMBLY="hg38" # Latest assembly as placeholder
OUTPUT_DIR="macs2_output"
CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# --- MACS2 Installation (commented out) ---
# conda install -c bioconda macs2

# --- UCSC Tools Installation (commented out) ---
# Download bedGraphToBigWig from UCSC Genome Browser utilities
# wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
# chmod +x bedGraphToBigWig
# mv bedGraphToBigWig /usr/local/bin/ # Or add to PATH

# --- Generate/Download chrom.sizes file ---
# This file is required by bedGraphToBigWig. Download from UCSC Genome Browser for the specified assembly.
wget -O "${CHROM_SIZES_FILE}" "http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME_ASSEMBLY}/bigZips/${GENOME_ASSEMBLY}.chrom.sizes"

# --- MACS2 Execution ---
# Generate signal tracks (bedGraph files) using MACS2 callpeak with -B and -SPMR options.
# -t: treatment file (BAM format)
# -c: control file (BAM format, optional but recommended for ChIP-seq/eCLIP)
# -f BAM: input file format (BAM for single-end or paired-end if not using BAMPE)
# -g hs: genome size (e.g., hs for human, mm for mouse). Use a more precise value if known.
# -n: name string for output files
# -B: store fragment pileup and control lambda values in bedGraph files
# -SPMR: calculate signal per million reads (SPMR)
# --outdir: output directory
macs2 callpeak -t "${TREATMENT_BAM}" \
               -c "${CONTROL_BAM}" \
               -f BAM \
               -g hs \
               -n "${OUTPUT_PREFIX}" \
               -B \
               -SPMR \
               --outdir "${OUTPUT_DIR}"

# --- Convert MACS2 bedGraph outputs to bigWig using UCSC bedGraphToBigWig ---
# MACS2 generates several bedGraph files with -B.
# Common signal tracks are:
# 1. ${OUTPUT_PREFIX}_treat_pileup.bdg (raw signal pileup)
# 2. ${OUTPUT_PREFIX}_control_lambda.bdg (background lambda values)
# Often, fold enrichment (FE) or -log10(pvalue) signal is desired, which requires an additional 'macs2 bdgcmp' step.

# Convert treatment pileup bedGraph to bigWig
bedGraphToBigWig "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bdg" \
                 "${CHROM_SIZES_FILE}" \
                 "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bw"

# Example: Generate and convert fold enrichment signal if desired
# macs2 bdgcmp -t "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bdg" \
#              -c "${OUTPUT_DIR}/${OUTPUT_PREFIX}_control_lambda.bdg" \
#              -m FE \
#              -o "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bdg"
# bedGraphToBigWig "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bdg" \
#                  "${CHROM_SIZES_FILE}" \
#                  "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bw"

View on GitHub

Tools Used

ChIP-seq

Raw Source Text

ChIP-seq data were processed in accordance with ENCODE uniform transcription factor ChIP-seq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/).
Briefly, ChIP-seq reads were aligned to the GRCh37 human genome using BWA;
PCR duplicates and reads of low quality were filtered;
Peak-calling was performed with SPP and IDR with 0.02 IDR threshhold;
Signal tracks were generated with MACS2 and bigWig applications from UCSC genome browser (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/)
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: Peak files were narrowPeak files generated by SPP; column information of narrowPeak file is explicated at the link (https://genome-euro.ucsc.edu/FAQ/FAQformat.html#format12).

← Back to Analysis