GSE143804 Processing Pipeline
ChIP-Seq
code_examples
9 steps
Publication
A super-enhancer-regulated RNA-binding protein cascade drives pancreatic cancer.Nature communications (2023) — PMID 37673892
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Reads trimmed using default paramters of Trim Galore!
$ Bash example
# Install Trim Galore! (and its dependency Cutadapt) if not already installed # conda install -c bioconda trim-galore # Define input files (adjust as needed for your data) # For paired-end reads: INPUT_READ1="input_read1.fastq.gz" INPUT_READ2="input_read2.fastq.gz" # For single-end reads (uncomment and use if applicable): # INPUT_READ="input_read.fastq.gz" # Define output directory (optional, Trim Galore! creates files in current dir by default) OUTPUT_DIR="." # Trim reads using Trim Galore! with default parameters. # Default parameters include: quality cutoff 20, minimum read length 20bp, adapter auto-detection. # For paired-end reads: trim_galore --paired "${INPUT_READ1}" "${INPUT_READ2}" --output_dir "${OUTPUT_DIR}" # For single-end reads (uncomment and use if applicable): # trim_galore "${INPUT_READ}" --output_dir "${OUTPUT_DIR}" -
2
Version 0.4.3
Unknown (Inferred with models/gemini-2.5-flash) v0.4.3$ Bash example
# The description 'Version 0.4.3' does not provide enough information to infer a specific tool, command, or parameters.
-
3
Trimmed reads aligned to human reference genome GRCh38, release 87 using bowtie2 (version 2.3.4)
$ Bash example
# Install Bowtie2 (if not already installed) # conda install -c bioconda bowtie2=2.3.4 # Create a directory for reference data mkdir -p reference_data cd reference_data # Download human reference genome GRCh38, release 87 (Ensembl) # Ensembl release 87 corresponds to GRCh38. We use the primary assembly. wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Build Bowtie2 index for the reference genome # The index prefix will be 'grch38_r87_index' bowtie2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa grch38_r87_index # Go back to the main working directory cd .. # Placeholder for trimmed reads file # Replace 'trimmed_reads.fastq' with the actual path to your input FASTQ file(s). # If your input is gzipped, ensure it's unzipped or use zcat/gunzip in a pipe. # Example for single-end reads: # trimmed_reads_file="path/to/your/trimmed_reads.fastq" # Example for paired-end reads: # trimmed_reads_file_1="path/to/your/trimmed_reads_R1.fastq" # trimmed_reads_file_2="path/to/your/trimmed_reads_R2.fastq" # Align trimmed reads to the human reference genome using Bowtie2 # -x: specify the index basename # -U: specify single-end FASTQ reads (use -1 and -2 for paired-end reads) # -S: specify SAM output file # -p: number of threads (optional, adjust as needed for performance) bowtie2 -x reference_data/grch38_r87_index -U trimmed_reads.fastq -S aligned_reads.sam -p 8 # Optional: Convert SAM to BAM, sort, and index (common post-alignment steps) # samtools view -bS aligned_reads.sam > aligned_reads.bam # samtools sort aligned_reads.bam -o aligned_reads.sorted.bam # samtools index aligned_reads.sorted.bam
-
4
Alignments filtered for reads with MAPQ score >= 11 using samtools (version 1.3.1)
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.3.1 # Filter alignments for reads with MAPQ score >= 11 # Replace input.bam with your actual input alignment file # Replace filtered.bam with your desired output file name samtools view -b -q 11 input.bam -o filtered.bam
-
5
Duplicates removed using picard tools (version 2.9.0)
$ Bash example
# Install Picard (if not already installed) # For example, using conda: # conda install -c bioconda picard=2.9.0 # Define input and output file names INPUT_BAM="input.bam" # Replace with your actual input BAM file (e.g., aligned.bam) OUTPUT_BAM="output.dedup.bam" # Replace with your desired output BAM file METRICS_FILE="dedup_metrics.txt" # Replace with your desired metrics file # Remove duplicates using Picard RemoveDuplicates # This command creates a new BAM file with duplicate reads removed. # Ensure 'picard.jar' is in your PATH or provide the full path to the JAR file. java -jar picard.jar RemoveDuplicates \ I=${INPUT_BAM} \ O=${OUTPUT_BAM} \ M=${METRICS_FILE} -
6
Peaks called using MACS2 (version 2.1.1.20160309)
$ Bash example
# Install MACS2 (e.g., via Bioconda) # conda install -c bioconda macs2 # Define input files and parameters (placeholders) # Replace with actual paths and desired values TREATMENT_BAM="path/to/your/treatment.bam" # Aligned ChIP-seq reads CONTROL_BAM="path/to/your/control.bam" # Aligned Input/Control reads OUTPUT_DIR="macs2_peaks" # Directory for output files GENOME_SIZE="hs" # Genome size: e.g., 'hs' for human, 'mm' for mouse, or a specific number like '2.7e9' Q_VALUE="0.01" # Q-value (FDR) cutoff for peak detection PREFIX="sample_name" # Prefix for output file names # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute MACS2 callpeak command macs2 callpeak \ -t "${TREATMENT_BAM}" \ -c "${CONTROL_BAM}" \ -f BAMPE \ -g "${GENOME_SIZE}" \ -n "${PREFIX}" \ -q "${Q_VALUE}" \ --outdir "${OUTPUT_DIR}" \ --keep-dup all # Keep all duplicate reads (common for ChIP-seq) -
7
Peaks filtered for known chromosomes only
filter_peaks_on_chromosomes.py (Inferred with models/gemini-2.5-flash) vv0.1.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Assume input_peaks.bed is the output from the previous peak calling step # Assume hg38 reference genome was used for alignment # Create a file with known chromosomes for hg38. # This list can be customized based on the specific reference genome and desired chromosomes. # For example, if using a different assembly or only canonical chromosomes. echo -e "chr1\nchr2\nchr3\nchr4\nchr5\nchr6\nchr7\nchr8\nchr9\nchr10\nchr11\nchr12\nchr13\nchr14\nchr15\nchr16\nchr17\nchr18\nchr19\nchr20\nchr21\nchr22\nchrX\nchrY\nchrM" > known_chromosomes_hg38.txt # The 'filter_peaks_on_chromosomes.py' script is part of the yeolab/eclip workflow. # It expects an input BED file, an output BED file, and a file listing known chromosomes. # Example of how to obtain the script (uncomment and run if not already available): # git clone https://github.com/yeolab/eclip.git # cd eclip # git checkout v0.1.0 # Or the specific version of the workflow being used # cd .. # Go back to the working directory # Execute the filtering script # Assuming the script is located at eclip/scripts/filter_peaks_on_chromosomes.py relative to the current directory python eclip/scripts/filter_peaks_on_chromosomes.py \ input_peaks.bed \ filtered_peaks_on_known_chromosomes.bed \ known_chromosomes_hg38.txt -
8
common peaks shared between replicates determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2)
$ Bash example
#!/bin/bash # Define output file for the common peaks OUTPUT_BED="common_reproducible_peaks.bed" # Create an R script to perform the peak overlap analysis cat << 'EOF' > run_findOverlapsOfPeaks.R # Installation instructions (uncomment and run if packages are not installed): # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # # BiocManager::install("ChIPpeakAnno") # BiocManager::install("rtracklayer") # Useful for importing/exporting genomic ranges # Load necessary libraries library(ChIPpeakAnno) library(GenomicRanges) # Provides GRanges objects library(rtracklayer) # For importing/exporting BED/narrowPeak files # --- Configuration for input peak files --- # IMPORTANT: Replace these placeholder file paths with your actual peak files. # These files should typically be in BED, narrowPeak, or similar formats. # Example of how to load actual peak files: # peak_file_rep1 <- "path/to/your/replicate1.narrowPeak" # peak_file_rep2 <- "path/to/your/replicate2.narrowPeak" # peaks_rep1 <- toGRanges(peak_file_rep1, format="narrowPeak") # peaks_rep2 <- toGRanges(peak_file_rep2, format="narrowPeak") # For demonstration purposes, we'll create dummy GRanges objects. # In a real scenario, you would load your peak files as shown above. # Dummy Replicate 1 peaks peaks_rep1 <- GRanges( seqnames = c("chr1", "chr1", "chr2", "chr3"), ranges = IRanges(start = c(100, 300, 50, 1000), end = c(200, 400, 150, 1100)), strand = "*", score = c(10, 15, 12, 20) ) # Dummy Replicate 2 peaks peaks_rep2 <- GRanges( seqnames = c("chr1", "chr1", "chr2", "chr3"), ranges = IRanges(start = c(150, 350, 60, 1050), end = c(250, 450, 160, 1150)), strand = "*", score = c(11, 16, 13, 22) ) # If you have more replicates, create similar GRanges objects for them. # Example for a third replicate: # peaks_rep3 <- GRanges( # seqnames = c("chr1", "chr1", "chr2"), # ranges = IRanges(start = c(120, 320, 70), end = c(220, 420, 170)), # strand = "*", # score = c(9, 14, 11) # ) # Create a list of peak GRanges objects for all replicates you want to compare. peaklist <- list( "replicate1" = peaks_rep1, "replicate2" = peaks_rep2 # "replicate3" = peaks_rep3 # Uncomment and add if you have more replicates ) # Determine common peaks shared between replicates using findOverlapsOfPeaks() # Default parameters: # minoverlap = 1 (at least 1 bp overlap) # maxgap = -1 (no gap allowed between overlapping features) # type = "any" (any overlap type, e.g., "start", "end", "within", "equal") # Adjust these parameters if your definition of "common" differs (e.g., minoverlap = 0.5 for 50% overlap). common_peaks_result <- findOverlapsOfPeaks(peaklist) # The result 'common_peaks_result' is a GRangesList containing several elements. # The '$overlappingPeaks' element provides the merged regions of common peaks, # which represent the reproducible peaks across replicates. merged_common_regions <- common_peaks_result$overlappingPeaks # Print the identified common (reproducible) peaks to standard output for inspection message("Identified common (reproducible) peaks:") print(merged_common_regions) # Save the common peaks to a BED file output_file <- Sys.getenv("OUTPUT_BED") if (output_file != "") { message(paste("Saving common peaks to:", output_file)) # For BED format, typically only seqnames, ranges, and strand are used. # You can add score or other metadata as extra columns if desired. export(merged_common_regions, output_file, format = "bed") } else { message("OUTPUT_BED environment variable not set. Not saving common peaks to a file.") } EOF # Execute the R script # Ensure R and the necessary Bioconductor packages (ChIPpeakAnno, rtracklayer) are installed # and accessible in your system's PATH. Rscript run_findOverlapsOfPeaks.R -
9
common peaks shared between treatments determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2)
$ Bash example
# Install R and Bioconductor if not already present # For Bioconductor 3.14 (R 4.1), use: # Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # Rscript -e 'BiocManager::install(c("ChIPpeakAnno", "rtracklayer"), version = "3.14")' # Create dummy peak files for demonstration # In a real scenario, these would be actual peak files from your treatments cat <<EOF > treatment1_peaks.bed chr1 1000 1500 peak1_t1 1000 . chr1 2000 2500 peak2_t1 800 . chr2 500 1000 peak3_t1 900 . EOF cat <<EOF > treatment2_peaks.bed chr1 1200 1700 peak1_t2 1100 . chr1 2300 2800 peak2_t2 950 . chr2 600 1100 peak3_t2 1050 . chr3 100 200 peak4_t2 700 . EOF Rscript -e ' library(ChIPpeakAnno) library(GenomicRanges) library(rtracklayer) # Load peak files # Assuming peak files are in BED format. Adjust format argument if needed (e.g., "narrowPeak"). peaks_t1 <- import("treatment1_peaks.bed", format="BED") peaks_t2 <- import("treatment2_peaks.bed", format="BED") # Create a list of GRanges objects for input to findOverlapsOfPeaks peaklist <- GRangesList(treatment1 = peaks_t1, treatment2 = peaks_t2) # Find common peaks using findOverlapsOfPeaks. # The `overlappingPeaks` slot of the returned object is a GRanges object # representing the union of all overlapping regions across the input peak lists. # `minoverlap = 1L` means at least 1 base pair overlap is required. overlaps_obj <- findOverlapsOfPeaks(peaklist, minoverlap = 1L) # Extract the GRanges object containing the actual common genomic regions. common_regions <- overlaps_obj$overlappingPeaks # Save the common regions to a BED file. # The output BED file will contain the genomic coordinates of the regions # where peaks from different treatments overlap. export(common_regions, "common_peaks.bed", format="BED") message("Common peaks saved to common_peaks.bed") '
Tools Used
Raw Source Text
Reads trimmed using default paramters of Trim Galore! Version 0.4.3 Trimmed reads aligned to human reference genome GRCh38, release 87 using bowtie2 (version 2.3.4) Alignments filtered for reads with MAPQ score >= 11 using samtools (version 1.3.1) Duplicates removed using picard tools (version 2.9.0) Peaks called using MACS2 (version 2.1.1.20160309) Peaks filtered for known chromosomes only common peaks shared between replicates determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2) common peaks shared between treatments determined by findOverlapsOfPeaks() from ChIPpeakAnno R package (version 3.14.2) Genome_build: GRCh38