GSE125954 Processing Pipeline

ChIP-Seq code_examples 4 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [ChIP-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were trimmed using cutadapt and aligned to the hg19 genome using BWA.

BWA v0.7.17 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt
# conda install -c bioconda cutadapt

# Install bwa
# conda install -c bioconda bwa

# Install samtools (often needed with BWA for post-processing)
# conda install -c bioconda samtools

# Define input files and reference genome
READ1="read1.fastq.gz"
READ2="read2.fastq.gz"
HG19_REF="path/to/hg19.fa" # Placeholder for hg19 reference genome

# --- Step 1: Trim reads using cutadapt ---
# Common Illumina universal adapters are used here. Adjust if specific adapters are known.
# -a: 3' adapter for R1
# -A: 3' adapter for R2
# -q 20: Trim low-quality ends (quality below 20)
# --minimum-length 20: Discard reads shorter than 20 bp after trimming
# -o: Output file for R1
# -p: Output file for R2
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
         -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
         -q 20 \
         --minimum-length 20 \
         -o trimmed_${READ1} \
         -p trimmed_${READ2} \
         ${READ1} ${READ2}

# --- Step 2: Align trimmed reads to hg19 using BWA-MEM ---
# -M: Mark shorter split hits as secondary (for Picard compatibility)
# -t 8: Use 8 threads (adjust based on available CPU cores)
# | samtools view -Sb - > aligned.bam: Pipe SAM output to samtools to convert to unsorted BAM
bwa mem -M -t 8 ${HG19_REF} trimmed_${READ1} trimmed_${READ2} | samtools view -Sb - > aligned.bam

# --- Step 3: Sort and index the BAM file ---
samtools sort -o aligned.sorted.bam aligned.bam
samtools index aligned.sorted.bam

# Note: The hg19 reference genome must be indexed for BWA prior to alignment.
# If not already indexed, run:
# bwa index ${HG19_REF}

View on GitHub

Duplicate reads were marked using picard tools, and reads were filtered for a mapq score > 15.

Picard vPicard 2.27.4, Samtools 1.16.1 GitHub

$ Bash example

# Install Picard (if not already installed)
# For example, using conda:
# conda install -c bioconda picard

# Install Samtools (if not already installed)
# For example, using conda:
# conda install -c bioconda samtools

# Mark duplicate reads using Picard MarkDuplicates
# Assuming input.bam is the aligned and sorted BAM file
picard MarkDuplicates \
    I=input.bam \
    O=marked_duplicates.bam \
    M=markduplicates.metrics \
    ASSUME_SORTED=true \
    VALIDATION_STRINGENCY=SILENT

# Filter reads for a MAPQ score > 15 using samtools view
samtools view -b -q 15 marked_duplicates.bam > filtered_mapq.bam
samtools index filtered_mapq.bam

View on GitHub

Macs2 callpeak was used to call peaks, with --broad used for H3K9me3.

MACS2 v2.2.7.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install MACS2 (if not already installed)
# conda install -c bioconda macs2

# Define input and output files/directories (placeholders)
INPUT_BAM="input.bam" # Path to the H3K9me3 ChIP-seq alignment file
CONTROL_BAM="control.bam" # Path to the control (input) alignment file
OUTPUT_DIR="macs2_peaks_H3K9me3"
PREFIX="H3K9me3_broad_peaks"
GENOME_SIZE="hs" # For human, can also be 'mm' for mouse, or a specific number like '2.7e9'

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run MACS2 callpeak with --broad for H3K9me3
macs2 callpeak \
  -t "${INPUT_BAM}" \
  -c "${CONTROL_BAM}" \
  -f BAM \
  -g "${GENOME_SIZE}" \
  -n "${PREFIX}" \
  --outdir "${OUTPUT_DIR}" \
  --broad \
  --broad-cutoff 0.1 \
  -q 0.05

View on GitHub

To generate peak sets, macs2 called peaks were extended by 500bp then only kept if present and overlapping in both biological replicates.

MACS2 v2.29.2 GitHub

$ Bash example

# Install bedtools if not already present
# conda install -c bioconda bedtools

# Define input MACS2 peak files for replicates
# These are assumed to be the 'macs2 called peaks' mentioned in the description.
REP1_MACS2_PEAKS="rep1_macs2_peaks.bed"
REP2_MACS2_PEAKS="rep2_macs2_peaks.bed"

# Define genome sizes file (e.g., for hg38). This is crucial for bedtools slop to prevent extending off chromosome ends.
# Replace 'hg38.chrom.sizes' with the actual genome sizes file used in your pipeline (e.g., from UCSC).
GENOME_SIZES="hg38.chrom.sizes"

# Step 1: Extend MACS2 called peaks from replicate 1 by 500bp on both sides
# bedtools slop adds 'b' base pairs to both ends of each feature.
bedtools slop -i "${REP1_MACS2_PEAKS}" -g "${GENOME_SIZES}" -b 500 > rep1_macs2_peaks_extended.bed

# Step 2: Extend MACS2 called peaks from replicate 2 by 500bp on both sides
bedtools slop -i "${REP2_MACS2_PEAKS}" -g "${GENOME_SIZES}" -b 500 > rep2_macs2_peaks_extended.bed

# Step 3: Identify peaks that are present and overlapping in both biological replicates.
# This involves intersecting the extended peaks from both replicates bidirectionally.
# The '-wao' option writes the original entry in A, the original entry in B, and the number of base pairs of overlap.

# Intersect extended peaks from replicate 1 with replicate 2
bedtools intersect -a rep1_macs2_peaks_extended.bed -b rep2_macs2_peaks_extended.bed -wao > rep1_rep2_overlap.tmp

# Intersect extended peaks from replicate 2 with replicate 1
bedtools intersect -a rep2_macs2_peaks_extended.bed -b rep1_macs2_peaks_extended.bed -wao > rep2_rep1_overlap.tmp

# Step 4: Combine all identified overlapping regions, sort them, and then merge adjacent or overlapping regions
# The awk command extracts the chromosome, start, and end coordinates of the original peak from the first file (A) that had an overlap.
# This effectively creates a union of all peaks from both replicates that showed an overlap.
cat rep1_rep2_overlap.tmp rep2_rep1_overlap.tmp | \
    awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | \
    sort -k1,1 -k2,2n | \
    bedtools merge -i - > reproducible_macs2_peaks.bed

# Clean up temporary files
rm rep1_macs2_peaks_extended.bed rep2_macs2_peaks_extended.bed rep1_rep2_overlap.tmp rep2_rep1_overlap.tmp

View on GitHub

Raw Source Text

Reads were trimmed using cutadapt and aligned to the hg19 genome using BWA. Duplicate reads were marked using picard tools, and reads were filtered for a mapq score > 15.
Macs2 callpeak was used to call peaks, with --broad used for H3K9me3.
To generate peak sets, macs2 called peaks were extended by 500bp then only kept if present and overlapping in both biological replicates.
Genome_build: hg19
Supplementary_files_format_and_content: Processed peak files are in bed format: chr, start, end.

← Back to Analysis