GSE120104 Processing Pipeline

ChIP-Seq code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120104

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [ChIP-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    ChIP-seq data were processed in accordance with ENCODE uniform transcription factor ChIP-seq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/).

    ChIP-seq vlatest stable release (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Clone the ENCODE ChIP-seq pipeline repository (if not already cloned)
    # git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git
    # cd chip-seq-pipeline2
    
    # Create an input JSON file for a paired-end Transcription Factor ChIP-seq experiment (hg38).
    # Replace '/path/to/...' with your actual FASTQ file paths.
    # Reference genome files are sourced from ENCODE Project (GRCh38_gencode_v34).
    cat << EOF > input.json
    {
      "chip.fastqs_rep1_R1": ["/path/to/rep1_R1.fastq.gz"],
      "chip.fastqs_rep1_R2": ["/path/to/rep1_R2.fastq.gz"],
      "chip.fastqs_ctl_R1": ["/path/to/control_R1.fastq.gz"],
      "chip.fastqs_ctl_R2": ["/path/to/control_R2.fastq.gz"],
      "chip.paired_end": true,
      "chip.pipeline_type": "tf",
      "chip.genome_tsv": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.tsv",
      "chip.blacklist_bed": "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz",
      "chip.chrsz": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.chrsz.tsv",
      "chip.ref_fa": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.fa.gz",
      "chip.bowtie2_idx": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.bowtie2_idx.tar",
      "chip.bwa_idx": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.bwa_idx.tar",
      "chip.output_prefix": "my_tf_chipseq_hg38"
    }
    EOF
    
    # Execute the ENCODE ChIP-seq pipeline using Cromwell.
    # This command assumes 'cromwell.jar' is in your PATH or specified with its full path,
    # and that you are running this from the root directory of the cloned 'chip-seq-pipeline2' repository.
    # Replace 'cromwell.jar' with the actual path to your Cromwell JAR file if not in PATH.
    # The 'chip.wdl' file is the main workflow definition.
    java -jar cromwell.jar run chip.wdl -i input.json
  2. 2

    Briefly, ChIP-seq reads were aligned to the GRCh37 human genome using BWA

    BWA v0.7.17 GitHub
    $ Bash example
    # Install BWA (if not already installed)
    # conda install -c bioconda bwa
    
    # Install Samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define variables
    REF_GENOME="/path/to/GRCh37/GRCh37.fa" # Placeholder for GRCh37 human genome reference
    READS_R1="input_reads_R1.fastq.gz" # Placeholder for ChIP-seq forward reads
    READS_R2="input_reads_R2.fastq.gz" # Placeholder for ChIP-seq reverse reads (if paired-end)
    OUTPUT_BAM="aligned_reads.bam"
    OUTPUT_SORTED_BAM="aligned_reads.sorted.bam"
    
    # 1. Index the reference genome (if not already indexed)
    # This step only needs to be run once per reference genome
    # bwa index ${REF_GENOME}
    
    # 2. Align reads to the GRCh37 human genome using BWA-MEM
    # Assuming paired-end reads for ChIP-seq, adjust for single-end if necessary
    bwa mem -M -t 8 ${REF_GENOME} ${READS_R1} ${READS_R2} | samtools view -bS - > ${OUTPUT_BAM}
    
    # 3. Sort the resulting BAM file
    samtools sort ${OUTPUT_BAM} -o ${OUTPUT_SORTED_BAM}
    
    # 4. Index the sorted BAM file
    samtools index ${OUTPUT_SORTED_BAM}
  3. 3

    PCR duplicates and reads of low quality were filtered

    samtools (Inferred with models/gemini-2.5-flash) v1.9 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Filter out reads marked as PCR duplicates (FLAG 1024) and reads with mapping quality less than 10
    samtools view -F 1024 -q 10 -b input.bam > filtered.bam
  4. 4

    Peak-calling was performed with SPP and IDR with 0.02 IDR threshhold

    IDR v0.02
    $ Bash example
    # Install IDR (example using pip, check official documentation for best practice)
    # pip install idr
    
    # IDR (Irreproducible Discovery Rate) analysis
    # This step takes peak files (e.g., from SPP for two replicates) and identifies reproducible peaks.
    # The '0.02 IDR threshhold' refers to the --output-threshold parameter, meaning peaks with an IDR score <= 0.02 are considered reproducible.
    idr --samples rep1_spp_peaks.narrowPeak rep2_spp_peaks.narrowPeak \
        --input-file-type narrowPeak \
        --rank-by signal.value \
        --output-file idr_reproducible_peaks.narrowPeak \
        --plot \
        --log-file idr_log.txt \
        --output-threshold 0.02
  5. 5

    Signal tracks were generated with MACS2 and bigWig applications from UCSC genome browser (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/)

    MACS2 vMACS2: 2.2.7.1 (Inferred with models/gemini-2.5-flash); UCSC bedGraphToBigWig: Not explicitly versioned, typically latest available at time of download. GitHub
    $ Bash example
    # Placeholder variables
    TREATMENT_BAM="path/to/treatment.bam"
    CONTROL_BAM="path/to/control.bam" # Optional, but common for MACS2
    OUTPUT_PREFIX="sample_name"
    GENOME_ASSEMBLY="hg38" # Latest assembly as placeholder
    OUTPUT_DIR="macs2_output"
    CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # --- MACS2 Installation (commented out) ---
    # conda install -c bioconda macs2
    
    # --- UCSC Tools Installation (commented out) ---
    # Download bedGraphToBigWig from UCSC Genome Browser utilities
    # wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
    # chmod +x bedGraphToBigWig
    # mv bedGraphToBigWig /usr/local/bin/ # Or add to PATH
    
    # --- Generate/Download chrom.sizes file ---
    # This file is required by bedGraphToBigWig. Download from UCSC Genome Browser for the specified assembly.
    wget -O "${CHROM_SIZES_FILE}" "http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME_ASSEMBLY}/bigZips/${GENOME_ASSEMBLY}.chrom.sizes"
    
    # --- MACS2 Execution ---
    # Generate signal tracks (bedGraph files) using MACS2 callpeak with -B and -SPMR options.
    # -t: treatment file (BAM format)
    # -c: control file (BAM format, optional but recommended for ChIP-seq/eCLIP)
    # -f BAM: input file format (BAM for single-end or paired-end if not using BAMPE)
    # -g hs: genome size (e.g., hs for human, mm for mouse). Use a more precise value if known.
    # -n: name string for output files
    # -B: store fragment pileup and control lambda values in bedGraph files
    # -SPMR: calculate signal per million reads (SPMR)
    # --outdir: output directory
    macs2 callpeak -t "${TREATMENT_BAM}" \
                   -c "${CONTROL_BAM}" \
                   -f BAM \
                   -g hs \
                   -n "${OUTPUT_PREFIX}" \
                   -B \
                   -SPMR \
                   --outdir "${OUTPUT_DIR}"
    
    # --- Convert MACS2 bedGraph outputs to bigWig using UCSC bedGraphToBigWig ---
    # MACS2 generates several bedGraph files with -B.
    # Common signal tracks are:
    # 1. ${OUTPUT_PREFIX}_treat_pileup.bdg (raw signal pileup)
    # 2. ${OUTPUT_PREFIX}_control_lambda.bdg (background lambda values)
    # Often, fold enrichment (FE) or -log10(pvalue) signal is desired, which requires an additional 'macs2 bdgcmp' step.
    
    # Convert treatment pileup bedGraph to bigWig
    bedGraphToBigWig "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bdg" \
                     "${CHROM_SIZES_FILE}" \
                     "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bw"
    
    # Example: Generate and convert fold enrichment signal if desired
    # macs2 bdgcmp -t "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bdg" \
    #              -c "${OUTPUT_DIR}/${OUTPUT_PREFIX}_control_lambda.bdg" \
    #              -m FE \
    #              -o "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bdg"
    # bedGraphToBigWig "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bdg" \
    #                  "${CHROM_SIZES_FILE}" \
    #                  "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bw"

Tools Used

Raw Source Text
ChIP-seq data were processed in accordance with ENCODE uniform transcription factor ChIP-seq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/).
Briefly, ChIP-seq reads were aligned to the GRCh37 human genome using BWA;
PCR duplicates and reads of low quality were filtered;
Peak-calling was performed with SPP and IDR with 0.02 IDR threshhold;
Signal tracks were generated with MACS2 and bigWig applications from UCSC genome browser (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/)
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: Peak files were narrowPeak files generated by SPP; column information of narrowPeak file is explicated at the link (https://genome-euro.ucsc.edu/FAQ/FAQformat.html#format12).
← Back to Analysis