GSE120104 Processing Pipeline
ChIP-Seq
code_examples
5 steps
Publication
Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.Cell (2019) — PMID 31251911
Dataset
GSE120104Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [ChIP-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
ChIP-seq data were processed in accordance with ENCODE uniform transcription factor ChIP-seq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/).
$ Bash example
# Clone the ENCODE ChIP-seq pipeline repository (if not already cloned) # git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git # cd chip-seq-pipeline2 # Create an input JSON file for a paired-end Transcription Factor ChIP-seq experiment (hg38). # Replace '/path/to/...' with your actual FASTQ file paths. # Reference genome files are sourced from ENCODE Project (GRCh38_gencode_v34). cat << EOF > input.json { "chip.fastqs_rep1_R1": ["/path/to/rep1_R1.fastq.gz"], "chip.fastqs_rep1_R2": ["/path/to/rep1_R2.fastq.gz"], "chip.fastqs_ctl_R1": ["/path/to/control_R1.fastq.gz"], "chip.fastqs_ctl_R2": ["/path/to/control_R2.fastq.gz"], "chip.paired_end": true, "chip.pipeline_type": "tf", "chip.genome_tsv": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.tsv", "chip.blacklist_bed": "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz", "chip.chrsz": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.chrsz.tsv", "chip.ref_fa": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.fa.gz", "chip.bowtie2_idx": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.bowtie2_idx.tar", "chip.bwa_idx": "https://www.encodeproject.org/files/GRCh38_gencode_v34/@@download/GRCh38_gencode_v34.bwa_idx.tar", "chip.output_prefix": "my_tf_chipseq_hg38" } EOF # Execute the ENCODE ChIP-seq pipeline using Cromwell. # This command assumes 'cromwell.jar' is in your PATH or specified with its full path, # and that you are running this from the root directory of the cloned 'chip-seq-pipeline2' repository. # Replace 'cromwell.jar' with the actual path to your Cromwell JAR file if not in PATH. # The 'chip.wdl' file is the main workflow definition. java -jar cromwell.jar run chip.wdl -i input.json -
2
Briefly, ChIP-seq reads were aligned to the GRCh37 human genome using BWA
$ Bash example
# Install BWA (if not already installed) # conda install -c bioconda bwa # Install Samtools (if not already installed) # conda install -c bioconda samtools # Define variables REF_GENOME="/path/to/GRCh37/GRCh37.fa" # Placeholder for GRCh37 human genome reference READS_R1="input_reads_R1.fastq.gz" # Placeholder for ChIP-seq forward reads READS_R2="input_reads_R2.fastq.gz" # Placeholder for ChIP-seq reverse reads (if paired-end) OUTPUT_BAM="aligned_reads.bam" OUTPUT_SORTED_BAM="aligned_reads.sorted.bam" # 1. Index the reference genome (if not already indexed) # This step only needs to be run once per reference genome # bwa index ${REF_GENOME} # 2. Align reads to the GRCh37 human genome using BWA-MEM # Assuming paired-end reads for ChIP-seq, adjust for single-end if necessary bwa mem -M -t 8 ${REF_GENOME} ${READS_R1} ${READS_R2} | samtools view -bS - > ${OUTPUT_BAM} # 3. Sort the resulting BAM file samtools sort ${OUTPUT_BAM} -o ${OUTPUT_SORTED_BAM} # 4. Index the sorted BAM file samtools index ${OUTPUT_SORTED_BAM} -
3
PCR duplicates and reads of low quality were filtered
samtools (Inferred with models/gemini-2.5-flash) v1.9 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Filter out reads marked as PCR duplicates (FLAG 1024) and reads with mapping quality less than 10 samtools view -F 1024 -q 10 -b input.bam > filtered.bam
-
4
Peak-calling was performed with SPP and IDR with 0.02 IDR threshhold
IDR v0.02$ Bash example
# Install IDR (example using pip, check official documentation for best practice) # pip install idr # IDR (Irreproducible Discovery Rate) analysis # This step takes peak files (e.g., from SPP for two replicates) and identifies reproducible peaks. # The '0.02 IDR threshhold' refers to the --output-threshold parameter, meaning peaks with an IDR score <= 0.02 are considered reproducible. idr --samples rep1_spp_peaks.narrowPeak rep2_spp_peaks.narrowPeak \ --input-file-type narrowPeak \ --rank-by signal.value \ --output-file idr_reproducible_peaks.narrowPeak \ --plot \ --log-file idr_log.txt \ --output-threshold 0.02 -
5
Signal tracks were generated with MACS2 and bigWig applications from UCSC genome browser (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/)
MACS2 vMACS2: 2.2.7.1 (Inferred with models/gemini-2.5-flash); UCSC bedGraphToBigWig: Not explicitly versioned, typically latest available at time of download. GitHub$ Bash example
# Placeholder variables TREATMENT_BAM="path/to/treatment.bam" CONTROL_BAM="path/to/control.bam" # Optional, but common for MACS2 OUTPUT_PREFIX="sample_name" GENOME_ASSEMBLY="hg38" # Latest assembly as placeholder OUTPUT_DIR="macs2_output" CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # --- MACS2 Installation (commented out) --- # conda install -c bioconda macs2 # --- UCSC Tools Installation (commented out) --- # Download bedGraphToBigWig from UCSC Genome Browser utilities # wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig # chmod +x bedGraphToBigWig # mv bedGraphToBigWig /usr/local/bin/ # Or add to PATH # --- Generate/Download chrom.sizes file --- # This file is required by bedGraphToBigWig. Download from UCSC Genome Browser for the specified assembly. wget -O "${CHROM_SIZES_FILE}" "http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME_ASSEMBLY}/bigZips/${GENOME_ASSEMBLY}.chrom.sizes" # --- MACS2 Execution --- # Generate signal tracks (bedGraph files) using MACS2 callpeak with -B and -SPMR options. # -t: treatment file (BAM format) # -c: control file (BAM format, optional but recommended for ChIP-seq/eCLIP) # -f BAM: input file format (BAM for single-end or paired-end if not using BAMPE) # -g hs: genome size (e.g., hs for human, mm for mouse). Use a more precise value if known. # -n: name string for output files # -B: store fragment pileup and control lambda values in bedGraph files # -SPMR: calculate signal per million reads (SPMR) # --outdir: output directory macs2 callpeak -t "${TREATMENT_BAM}" \ -c "${CONTROL_BAM}" \ -f BAM \ -g hs \ -n "${OUTPUT_PREFIX}" \ -B \ -SPMR \ --outdir "${OUTPUT_DIR}" # --- Convert MACS2 bedGraph outputs to bigWig using UCSC bedGraphToBigWig --- # MACS2 generates several bedGraph files with -B. # Common signal tracks are: # 1. ${OUTPUT_PREFIX}_treat_pileup.bdg (raw signal pileup) # 2. ${OUTPUT_PREFIX}_control_lambda.bdg (background lambda values) # Often, fold enrichment (FE) or -log10(pvalue) signal is desired, which requires an additional 'macs2 bdgcmp' step. # Convert treatment pileup bedGraph to bigWig bedGraphToBigWig "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bdg" \ "${CHROM_SIZES_FILE}" \ "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bw" # Example: Generate and convert fold enrichment signal if desired # macs2 bdgcmp -t "${OUTPUT_DIR}/${OUTPUT_PREFIX}_treat_pileup.bdg" \ # -c "${OUTPUT_DIR}/${OUTPUT_PREFIX}_control_lambda.bdg" \ # -m FE \ # -o "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bdg" # bedGraphToBigWig "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bdg" \ # "${CHROM_SIZES_FILE}" \ # "${OUTPUT_DIR}/${OUTPUT_PREFIX}_FE.bw"
Tools Used
Raw Source Text
ChIP-seq data were processed in accordance with ENCODE uniform transcription factor ChIP-seq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/). Briefly, ChIP-seq reads were aligned to the GRCh37 human genome using BWA; PCR duplicates and reads of low quality were filtered; Peak-calling was performed with SPP and IDR with 0.02 IDR threshhold; Signal tracks were generated with MACS2 and bigWig applications from UCSC genome browser (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) Genome_build: GRCh37 (hg19) Supplementary_files_format_and_content: Peak files were narrowPeak files generated by SPP; column information of narrowPeak file is explicated at the link (https://genome-euro.ucsc.edu/FAQ/FAQformat.html#format12).