GSE143804 Processing Pipeline

ChIP-Seq code_examples 9 steps

Publication

A super-enhancer-regulated RNA-binding protein cascade drives pancreatic cancer.

Nature communications (2023) — PMID 37673892

Dataset

GSE143804

MYC ChIP-seq in human pancreatic ductal adenocarcinoma cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads trimmed using default paramters of Trim Galore!

    Trim Galore v0.6.10 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Trim Galore! (and its dependency Cutadapt) if not already installed
    # conda install -c bioconda trim-galore
    
    # Define input files (adjust as needed for your data)
    # For paired-end reads:
    INPUT_READ1="input_read1.fastq.gz"
    INPUT_READ2="input_read2.fastq.gz"
    
    # For single-end reads (uncomment and use if applicable):
    # INPUT_READ="input_read.fastq.gz"
    
    # Define output directory (optional, Trim Galore! creates files in current dir by default)
    OUTPUT_DIR="."
    
    # Trim reads using Trim Galore! with default parameters.
    # Default parameters include: quality cutoff 20, minimum read length 20bp, adapter auto-detection.
    # For paired-end reads:
    trim_galore --paired "${INPUT_READ1}" "${INPUT_READ2}" --output_dir "${OUTPUT_DIR}"
    
    # For single-end reads (uncomment and use if applicable):
    # trim_galore "${INPUT_READ}" --output_dir "${OUTPUT_DIR}"
    
  2. 2

    Version 0.4.3

    Unknown (Inferred with models/gemini-2.5-flash) v0.4.3
    $ Bash example
    # The description 'Version 0.4.3' does not provide enough information to infer a specific tool, command, or parameters.
  3. 3

    Trimmed reads aligned to human reference genome GRCh38, release 87 using bowtie2 (version 2.3.4)

    $ Bash example
    # Install Bowtie2 (if not already installed)
    # conda install -c bioconda bowtie2=2.3.4
    
    # Create a directory for reference data
    mkdir -p reference_data
    cd reference_data
    
    # Download human reference genome GRCh38, release 87 (Ensembl)
    # Ensembl release 87 corresponds to GRCh38. We use the primary assembly.
    wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    
    # Build Bowtie2 index for the reference genome
    # The index prefix will be 'grch38_r87_index'
    bowtie2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa grch38_r87_index
    
    # Go back to the main working directory
    cd ..
    
    # Placeholder for trimmed reads file
    # Replace 'trimmed_reads.fastq' with the actual path to your input FASTQ file(s).
    # If your input is gzipped, ensure it's unzipped or use zcat/gunzip in a pipe.
    # Example for single-end reads:
    # trimmed_reads_file="path/to/your/trimmed_reads.fastq"
    # Example for paired-end reads:
    # trimmed_reads_file_1="path/to/your/trimmed_reads_R1.fastq"
    # trimmed_reads_file_2="path/to/your/trimmed_reads_R2.fastq"
    
    # Align trimmed reads to the human reference genome using Bowtie2
    # -x: specify the index basename
    # -U: specify single-end FASTQ reads (use -1 and -2 for paired-end reads)
    # -S: specify SAM output file
    # -p: number of threads (optional, adjust as needed for performance)
    bowtie2 -x reference_data/grch38_r87_index -U trimmed_reads.fastq -S aligned_reads.sam -p 8
    
    # Optional: Convert SAM to BAM, sort, and index (common post-alignment steps)
    # samtools view -bS aligned_reads.sam > aligned_reads.bam
    # samtools sort aligned_reads.bam -o aligned_reads.sorted.bam
    # samtools index aligned_reads.sorted.bam
  4. 4

    Alignments filtered for reads with MAPQ score >= 11 using samtools (version 1.3.1)

    samtools v1.3.1 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.3.1
    
    # Filter alignments for reads with MAPQ score >= 11
    # Replace input.bam with your actual input alignment file
    # Replace filtered.bam with your desired output file name
    samtools view -b -q 11 input.bam -o filtered.bam
  5. 5

    Duplicates removed using picard tools (version 2.9.0)

    Picard v2.9.0 GitHub
    $ Bash example
    # Install Picard (if not already installed)
    # For example, using conda:
    # conda install -c bioconda picard=2.9.0
    
    # Define input and output file names
    INPUT_BAM="input.bam" # Replace with your actual input BAM file (e.g., aligned.bam)
    OUTPUT_BAM="output.dedup.bam" # Replace with your desired output BAM file
    METRICS_FILE="dedup_metrics.txt" # Replace with your desired metrics file
    
    # Remove duplicates using Picard RemoveDuplicates
    # This command creates a new BAM file with duplicate reads removed.
    # Ensure 'picard.jar' is in your PATH or provide the full path to the JAR file.
    java -jar picard.jar RemoveDuplicates \
        I=${INPUT_BAM} \
        O=${OUTPUT_BAM} \
        M=${METRICS_FILE}
  6. 6

    Peaks called using MACS2 (version 2.1.1.20160309)

    MACS2 v2.1.1 GitHub
    $ Bash example
    # Install MACS2 (e.g., via Bioconda)
    # conda install -c bioconda macs2
    
    # Define input files and parameters (placeholders)
    # Replace with actual paths and desired values
    TREATMENT_BAM="path/to/your/treatment.bam" # Aligned ChIP-seq reads
    CONTROL_BAM="path/to/your/control.bam"     # Aligned Input/Control reads
    OUTPUT_DIR="macs2_peaks"                   # Directory for output files
    GENOME_SIZE="hs"                           # Genome size: e.g., 'hs' for human, 'mm' for mouse, or a specific number like '2.7e9'
    Q_VALUE="0.01"                             # Q-value (FDR) cutoff for peak detection
    PREFIX="sample_name"                       # Prefix for output file names
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute MACS2 callpeak command
    macs2 callpeak \
        -t "${TREATMENT_BAM}" \
        -c "${CONTROL_BAM}" \
        -f BAMPE \
        -g "${GENOME_SIZE}" \
        -n "${PREFIX}" \
        -q "${Q_VALUE}" \
        --outdir "${OUTPUT_DIR}" \
        --keep-dup all # Keep all duplicate reads (common for ChIP-seq)
  7. 7

    Peaks filtered for known chromosomes only

    filter_peaks_on_chromosomes.py (Inferred with models/gemini-2.5-flash) vv0.1.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Assume input_peaks.bed is the output from the previous peak calling step
    # Assume hg38 reference genome was used for alignment
    
    # Create a file with known chromosomes for hg38.
    # This list can be customized based on the specific reference genome and desired chromosomes.
    # For example, if using a different assembly or only canonical chromosomes.
    echo -e "chr1\nchr2\nchr3\nchr4\nchr5\nchr6\nchr7\nchr8\nchr9\nchr10\nchr11\nchr12\nchr13\nchr14\nchr15\nchr16\nchr17\nchr18\nchr19\nchr20\nchr21\nchr22\nchrX\nchrY\nchrM" > known_chromosomes_hg38.txt
    
    # The 'filter_peaks_on_chromosomes.py' script is part of the yeolab/eclip workflow.
    # It expects an input BED file, an output BED file, and a file listing known chromosomes.
    
    # Example of how to obtain the script (uncomment and run if not already available):
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # git checkout v0.1.0 # Or the specific version of the workflow being used
    # cd .. # Go back to the working directory
    
    # Execute the filtering script
    # Assuming the script is located at eclip/scripts/filter_peaks_on_chromosomes.py relative to the current directory
    python eclip/scripts/filter_peaks_on_chromosomes.py \
        input_peaks.bed \
        filtered_peaks_on_known_chromosomes.bed \
        known_chromosomes_hg38.txt
  8. 8

    common peaks shared between replicates determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2)

    $ Bash example
    #!/bin/bash
    
    # Define output file for the common peaks
    OUTPUT_BED="common_reproducible_peaks.bed"
    
    # Create an R script to perform the peak overlap analysis
    cat << 'EOF' > run_findOverlapsOfPeaks.R
    # Installation instructions (uncomment and run if packages are not installed):
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    #
    # BiocManager::install("ChIPpeakAnno")
    # BiocManager::install("rtracklayer") # Useful for importing/exporting genomic ranges
    
    # Load necessary libraries
    library(ChIPpeakAnno)
    library(GenomicRanges) # Provides GRanges objects
    library(rtracklayer)   # For importing/exporting BED/narrowPeak files
    
    # --- Configuration for input peak files ---
    # IMPORTANT: Replace these placeholder file paths with your actual peak files.
    # These files should typically be in BED, narrowPeak, or similar formats.
    # Example of how to load actual peak files:
    # peak_file_rep1 <- "path/to/your/replicate1.narrowPeak"
    # peak_file_rep2 <- "path/to/your/replicate2.narrowPeak"
    # peaks_rep1 <- toGRanges(peak_file_rep1, format="narrowPeak")
    # peaks_rep2 <- toGRanges(peak_file_rep2, format="narrowPeak")
    
    # For demonstration purposes, we'll create dummy GRanges objects.
    # In a real scenario, you would load your peak files as shown above.
    
    # Dummy Replicate 1 peaks
    peaks_rep1 <- GRanges(
        seqnames = c("chr1", "chr1", "chr2", "chr3"),
        ranges = IRanges(start = c(100, 300, 50, 1000), end = c(200, 400, 150, 1100)),
        strand = "*",
        score = c(10, 15, 12, 20)
    )
    
    # Dummy Replicate 2 peaks
    peaks_rep2 <- GRanges(
        seqnames = c("chr1", "chr1", "chr2", "chr3"),
        ranges = IRanges(start = c(150, 350, 60, 1050), end = c(250, 450, 160, 1150)),
        strand = "*",
        score = c(11, 16, 13, 22)
    )
    
    # If you have more replicates, create similar GRanges objects for them.
    # Example for a third replicate:
    # peaks_rep3 <- GRanges(
    #     seqnames = c("chr1", "chr1", "chr2"),
    #     ranges = IRanges(start = c(120, 320, 70), end = c(220, 420, 170)),
    #     strand = "*",
    #     score = c(9, 14, 11)
    # )
    
    # Create a list of peak GRanges objects for all replicates you want to compare.
    peaklist <- list(
        "replicate1" = peaks_rep1,
        "replicate2" = peaks_rep2
        # "replicate3" = peaks_rep3 # Uncomment and add if you have more replicates
    )
    
    # Determine common peaks shared between replicates using findOverlapsOfPeaks()
    # Default parameters:
    #   minoverlap = 1 (at least 1 bp overlap)
    #   maxgap = -1 (no gap allowed between overlapping features)
    #   type = "any" (any overlap type, e.g., "start", "end", "within", "equal")
    # Adjust these parameters if your definition of "common" differs (e.g., minoverlap = 0.5 for 50% overlap).
    common_peaks_result <- findOverlapsOfPeaks(peaklist)
    
    # The result 'common_peaks_result' is a GRangesList containing several elements.
    # The '$overlappingPeaks' element provides the merged regions of common peaks,
    # which represent the reproducible peaks across replicates.
    merged_common_regions <- common_peaks_result$overlappingPeaks
    
    # Print the identified common (reproducible) peaks to standard output for inspection
    message("Identified common (reproducible) peaks:")
    print(merged_common_regions)
    
    # Save the common peaks to a BED file
    output_file <- Sys.getenv("OUTPUT_BED")
    if (output_file != "") {
        message(paste("Saving common peaks to:", output_file))
        # For BED format, typically only seqnames, ranges, and strand are used.
        # You can add score or other metadata as extra columns if desired.
        export(merged_common_regions, output_file, format = "bed")
    } else {
        message("OUTPUT_BED environment variable not set. Not saving common peaks to a file.")
    }
    EOF
    
    # Execute the R script
    # Ensure R and the necessary Bioconductor packages (ChIPpeakAnno, rtracklayer) are installed
    # and accessible in your system's PATH.
    Rscript run_findOverlapsOfPeaks.R
    
  9. 9

    common peaks shared between treatments determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2)

    $ Bash example
    # Install R and Bioconductor if not already present
    # For Bioconductor 3.14 (R 4.1), use:
    # Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # Rscript -e 'BiocManager::install(c("ChIPpeakAnno", "rtracklayer"), version = "3.14")'
    
    # Create dummy peak files for demonstration
    # In a real scenario, these would be actual peak files from your treatments
    cat <<EOF > treatment1_peaks.bed
    chr1    1000    1500    peak1_t1    1000    .
    chr1    2000    2500    peak2_t1    800     .
    chr2    500     1000    peak3_t1    900     .
    EOF
    
    cat <<EOF > treatment2_peaks.bed
    chr1    1200    1700    peak1_t2    1100    .
    chr1    2300    2800    peak2_t2    950     .
    chr2    600     1100    peak3_t2    1050    .
    chr3    100     200     peak4_t2    700     .
    EOF
    
    Rscript -e '
      library(ChIPpeakAnno)
      library(GenomicRanges)
      library(rtracklayer)
    
      # Load peak files
      # Assuming peak files are in BED format. Adjust format argument if needed (e.g., "narrowPeak").
      peaks_t1 <- import("treatment1_peaks.bed", format="BED")
      peaks_t2 <- import("treatment2_peaks.bed", format="BED")
    
      # Create a list of GRanges objects for input to findOverlapsOfPeaks
      peaklist <- GRangesList(treatment1 = peaks_t1, treatment2 = peaks_t2)
    
      # Find common peaks using findOverlapsOfPeaks.
      # The `overlappingPeaks` slot of the returned object is a GRanges object
      # representing the union of all overlapping regions across the input peak lists.
      # `minoverlap = 1L` means at least 1 base pair overlap is required.
      overlaps_obj <- findOverlapsOfPeaks(peaklist, minoverlap = 1L)
    
      # Extract the GRanges object containing the actual common genomic regions.
      common_regions <- overlaps_obj$overlappingPeaks
    
      # Save the common regions to a BED file.
      # The output BED file will contain the genomic coordinates of the regions
      # where peaks from different treatments overlap.
      export(common_regions, "common_peaks.bed", format="BED")
    
      message("Common peaks saved to common_peaks.bed")
    '
    

Tools Used

Raw Source Text
Reads trimmed using default paramters of Trim Galore! Version 0.4.3
Trimmed reads aligned to human reference genome GRCh38, release 87 using bowtie2 (version 2.3.4)
Alignments filtered for reads with MAPQ score >= 11 using samtools (version 1.3.1)
Duplicates removed using picard tools (version 2.9.0)
Peaks called using MACS2 (version 2.1.1.20160309)
Peaks filtered for known chromosomes only
common peaks shared between replicates determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2)
common peaks shared between treatments determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2)
Genome_build: GRCh38
← Back to Analysis