GSE125954 Processing Pipeline

ChIP-Seq code_examples 4 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

GSE125954

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [ChIP-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads were trimmed using cutadapt and aligned to the hg19 genome using BWA.

    BWA v0.7.17 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt
    # conda install -c bioconda cutadapt
    
    # Install bwa
    # conda install -c bioconda bwa
    
    # Install samtools (often needed with BWA for post-processing)
    # conda install -c bioconda samtools
    
    # Define input files and reference genome
    READ1="read1.fastq.gz"
    READ2="read2.fastq.gz"
    HG19_REF="path/to/hg19.fa" # Placeholder for hg19 reference genome
    
    # --- Step 1: Trim reads using cutadapt ---
    # Common Illumina universal adapters are used here. Adjust if specific adapters are known.
    # -a: 3' adapter for R1
    # -A: 3' adapter for R2
    # -q 20: Trim low-quality ends (quality below 20)
    # --minimum-length 20: Discard reads shorter than 20 bp after trimming
    # -o: Output file for R1
    # -p: Output file for R2
    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
             -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
             -q 20 \
             --minimum-length 20 \
             -o trimmed_${READ1} \
             -p trimmed_${READ2} \
             ${READ1} ${READ2}
    
    # --- Step 2: Align trimmed reads to hg19 using BWA-MEM ---
    # -M: Mark shorter split hits as secondary (for Picard compatibility)
    # -t 8: Use 8 threads (adjust based on available CPU cores)
    # | samtools view -Sb - > aligned.bam: Pipe SAM output to samtools to convert to unsorted BAM
    bwa mem -M -t 8 ${HG19_REF} trimmed_${READ1} trimmed_${READ2} | samtools view -Sb - > aligned.bam
    
    # --- Step 3: Sort and index the BAM file ---
    samtools sort -o aligned.sorted.bam aligned.bam
    samtools index aligned.sorted.bam
    
    # Note: The hg19 reference genome must be indexed for BWA prior to alignment.
    # If not already indexed, run:
    # bwa index ${HG19_REF}
  2. 2

    Duplicate reads were marked using picard tools, and reads were filtered for a mapq score > 15.

    Picard vPicard 2.27.4, Samtools 1.16.1 GitHub
    $ Bash example
    # Install Picard (if not already installed)
    # For example, using conda:
    # conda install -c bioconda picard
    
    # Install Samtools (if not already installed)
    # For example, using conda:
    # conda install -c bioconda samtools
    
    # Mark duplicate reads using Picard MarkDuplicates
    # Assuming input.bam is the aligned and sorted BAM file
    picard MarkDuplicates \
        I=input.bam \
        O=marked_duplicates.bam \
        M=markduplicates.metrics \
        ASSUME_SORTED=true \
        VALIDATION_STRINGENCY=SILENT
    
    # Filter reads for a MAPQ score > 15 using samtools view
    samtools view -b -q 15 marked_duplicates.bam > filtered_mapq.bam
    samtools index filtered_mapq.bam
  3. 3

    Macs2 callpeak was used to call peaks, with --broad used for H3K9me3.

    MACS2 v2.2.7.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install MACS2 (if not already installed)
    # conda install -c bioconda macs2
    
    # Define input and output files/directories (placeholders)
    INPUT_BAM="input.bam" # Path to the H3K9me3 ChIP-seq alignment file
    CONTROL_BAM="control.bam" # Path to the control (input) alignment file
    OUTPUT_DIR="macs2_peaks_H3K9me3"
    PREFIX="H3K9me3_broad_peaks"
    GENOME_SIZE="hs" # For human, can also be 'mm' for mouse, or a specific number like '2.7e9'
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run MACS2 callpeak with --broad for H3K9me3
    macs2 callpeak \
      -t "${INPUT_BAM}" \
      -c "${CONTROL_BAM}" \
      -f BAM \
      -g "${GENOME_SIZE}" \
      -n "${PREFIX}" \
      --outdir "${OUTPUT_DIR}" \
      --broad \
      --broad-cutoff 0.1 \
      -q 0.05
  4. 4

    To generate peak sets, macs2 called peaks were extended by 500bp then only kept if present and overlapping in both biological replicates.

    MACS2 v2.29.2 GitHub
    $ Bash example
    # Install bedtools if not already present
    # conda install -c bioconda bedtools
    
    # Define input MACS2 peak files for replicates
    # These are assumed to be the 'macs2 called peaks' mentioned in the description.
    REP1_MACS2_PEAKS="rep1_macs2_peaks.bed"
    REP2_MACS2_PEAKS="rep2_macs2_peaks.bed"
    
    # Define genome sizes file (e.g., for hg38). This is crucial for bedtools slop to prevent extending off chromosome ends.
    # Replace 'hg38.chrom.sizes' with the actual genome sizes file used in your pipeline (e.g., from UCSC).
    GENOME_SIZES="hg38.chrom.sizes"
    
    # Step 1: Extend MACS2 called peaks from replicate 1 by 500bp on both sides
    # bedtools slop adds 'b' base pairs to both ends of each feature.
    bedtools slop -i "${REP1_MACS2_PEAKS}" -g "${GENOME_SIZES}" -b 500 > rep1_macs2_peaks_extended.bed
    
    # Step 2: Extend MACS2 called peaks from replicate 2 by 500bp on both sides
    bedtools slop -i "${REP2_MACS2_PEAKS}" -g "${GENOME_SIZES}" -b 500 > rep2_macs2_peaks_extended.bed
    
    # Step 3: Identify peaks that are present and overlapping in both biological replicates.
    # This involves intersecting the extended peaks from both replicates bidirectionally.
    # The '-wao' option writes the original entry in A, the original entry in B, and the number of base pairs of overlap.
    
    # Intersect extended peaks from replicate 1 with replicate 2
    bedtools intersect -a rep1_macs2_peaks_extended.bed -b rep2_macs2_peaks_extended.bed -wao > rep1_rep2_overlap.tmp
    
    # Intersect extended peaks from replicate 2 with replicate 1
    bedtools intersect -a rep2_macs2_peaks_extended.bed -b rep1_macs2_peaks_extended.bed -wao > rep2_rep1_overlap.tmp
    
    # Step 4: Combine all identified overlapping regions, sort them, and then merge adjacent or overlapping regions
    # The awk command extracts the chromosome, start, and end coordinates of the original peak from the first file (A) that had an overlap.
    # This effectively creates a union of all peaks from both replicates that showed an overlap.
    cat rep1_rep2_overlap.tmp rep2_rep1_overlap.tmp | \
        awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | \
        sort -k1,1 -k2,2n | \
        bedtools merge -i - > reproducible_macs2_peaks.bed
    
    # Clean up temporary files
    rm rep1_macs2_peaks_extended.bed rep2_macs2_peaks_extended.bed rep1_rep2_overlap.tmp rep2_rep1_overlap.tmp
Raw Source Text
Reads were trimmed using cutadapt and aligned to the hg19 genome using BWA. Duplicate reads were marked using picard tools, and reads were filtered for a mapq score > 15.
Macs2 callpeak was used to call peaks, with --broad used for H3K9me3.
To generate peak sets, macs2 called peaks were extended by 500bp then only kept if present and overlapping in both biological replicates.
Genome_build: hg19
Supplementary_files_format_and_content: Processed peak files are in bed format: chr, start, end.
← Back to Analysis