GSE125954 Processing Pipeline
ChIP-Seq
code_examples
4 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE125954The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [ChIP-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Reads were trimmed using cutadapt and aligned to the hg19 genome using BWA.
$ Bash example
# Install cutadapt # conda install -c bioconda cutadapt # Install bwa # conda install -c bioconda bwa # Install samtools (often needed with BWA for post-processing) # conda install -c bioconda samtools # Define input files and reference genome READ1="read1.fastq.gz" READ2="read2.fastq.gz" HG19_REF="path/to/hg19.fa" # Placeholder for hg19 reference genome # --- Step 1: Trim reads using cutadapt --- # Common Illumina universal adapters are used here. Adjust if specific adapters are known. # -a: 3' adapter for R1 # -A: 3' adapter for R2 # -q 20: Trim low-quality ends (quality below 20) # --minimum-length 20: Discard reads shorter than 20 bp after trimming # -o: Output file for R1 # -p: Output file for R2 cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \ -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \ -q 20 \ --minimum-length 20 \ -o trimmed_${READ1} \ -p trimmed_${READ2} \ ${READ1} ${READ2} # --- Step 2: Align trimmed reads to hg19 using BWA-MEM --- # -M: Mark shorter split hits as secondary (for Picard compatibility) # -t 8: Use 8 threads (adjust based on available CPU cores) # | samtools view -Sb - > aligned.bam: Pipe SAM output to samtools to convert to unsorted BAM bwa mem -M -t 8 ${HG19_REF} trimmed_${READ1} trimmed_${READ2} | samtools view -Sb - > aligned.bam # --- Step 3: Sort and index the BAM file --- samtools sort -o aligned.sorted.bam aligned.bam samtools index aligned.sorted.bam # Note: The hg19 reference genome must be indexed for BWA prior to alignment. # If not already indexed, run: # bwa index ${HG19_REF} -
2
Duplicate reads were marked using picard tools, and reads were filtered for a mapq score > 15.
$ Bash example
# Install Picard (if not already installed) # For example, using conda: # conda install -c bioconda picard # Install Samtools (if not already installed) # For example, using conda: # conda install -c bioconda samtools # Mark duplicate reads using Picard MarkDuplicates # Assuming input.bam is the aligned and sorted BAM file picard MarkDuplicates \ I=input.bam \ O=marked_duplicates.bam \ M=markduplicates.metrics \ ASSUME_SORTED=true \ VALIDATION_STRINGENCY=SILENT # Filter reads for a MAPQ score > 15 using samtools view samtools view -b -q 15 marked_duplicates.bam > filtered_mapq.bam samtools index filtered_mapq.bam -
3
Macs2 callpeak was used to call peaks, with --broad used for H3K9me3.
$ Bash example
# Install MACS2 (if not already installed) # conda install -c bioconda macs2 # Define input and output files/directories (placeholders) INPUT_BAM="input.bam" # Path to the H3K9me3 ChIP-seq alignment file CONTROL_BAM="control.bam" # Path to the control (input) alignment file OUTPUT_DIR="macs2_peaks_H3K9me3" PREFIX="H3K9me3_broad_peaks" GENOME_SIZE="hs" # For human, can also be 'mm' for mouse, or a specific number like '2.7e9' # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run MACS2 callpeak with --broad for H3K9me3 macs2 callpeak \ -t "${INPUT_BAM}" \ -c "${CONTROL_BAM}" \ -f BAM \ -g "${GENOME_SIZE}" \ -n "${PREFIX}" \ --outdir "${OUTPUT_DIR}" \ --broad \ --broad-cutoff 0.1 \ -q 0.05 -
4
To generate peak sets, macs2 called peaks were extended by 500bp then only kept if present and overlapping in both biological replicates.
$ Bash example
# Install bedtools if not already present # conda install -c bioconda bedtools # Define input MACS2 peak files for replicates # These are assumed to be the 'macs2 called peaks' mentioned in the description. REP1_MACS2_PEAKS="rep1_macs2_peaks.bed" REP2_MACS2_PEAKS="rep2_macs2_peaks.bed" # Define genome sizes file (e.g., for hg38). This is crucial for bedtools slop to prevent extending off chromosome ends. # Replace 'hg38.chrom.sizes' with the actual genome sizes file used in your pipeline (e.g., from UCSC). GENOME_SIZES="hg38.chrom.sizes" # Step 1: Extend MACS2 called peaks from replicate 1 by 500bp on both sides # bedtools slop adds 'b' base pairs to both ends of each feature. bedtools slop -i "${REP1_MACS2_PEAKS}" -g "${GENOME_SIZES}" -b 500 > rep1_macs2_peaks_extended.bed # Step 2: Extend MACS2 called peaks from replicate 2 by 500bp on both sides bedtools slop -i "${REP2_MACS2_PEAKS}" -g "${GENOME_SIZES}" -b 500 > rep2_macs2_peaks_extended.bed # Step 3: Identify peaks that are present and overlapping in both biological replicates. # This involves intersecting the extended peaks from both replicates bidirectionally. # The '-wao' option writes the original entry in A, the original entry in B, and the number of base pairs of overlap. # Intersect extended peaks from replicate 1 with replicate 2 bedtools intersect -a rep1_macs2_peaks_extended.bed -b rep2_macs2_peaks_extended.bed -wao > rep1_rep2_overlap.tmp # Intersect extended peaks from replicate 2 with replicate 1 bedtools intersect -a rep2_macs2_peaks_extended.bed -b rep1_macs2_peaks_extended.bed -wao > rep2_rep1_overlap.tmp # Step 4: Combine all identified overlapping regions, sort them, and then merge adjacent or overlapping regions # The awk command extracts the chromosome, start, and end coordinates of the original peak from the first file (A) that had an overlap. # This effectively creates a union of all peaks from both replicates that showed an overlap. cat rep1_rep2_overlap.tmp rep2_rep1_overlap.tmp | \ awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | \ sort -k1,1 -k2,2n | \ bedtools merge -i - > reproducible_macs2_peaks.bed # Clean up temporary files rm rep1_macs2_peaks_extended.bed rep2_macs2_peaks_extended.bed rep1_rep2_overlap.tmp rep2_rep1_overlap.tmp
Raw Source Text
Reads were trimmed using cutadapt and aligned to the hg19 genome using BWA. Duplicate reads were marked using picard tools, and reads were filtered for a mapq score > 15. Macs2 callpeak was used to call peaks, with --broad used for H3K9me3. To generate peak sets, macs2 called peaks were extended by 500bp then only kept if present and overlapping in both biological replicates. Genome_build: hg19 Supplementary_files_format_and_content: Processed peak files are in bed format: chr, start, end.