GSE181136 Processing Pipeline

GSE code_examples 15 steps

Publication

Identification of the global miR-130a targetome reveals a role for TBL1XR1 in hematopoietic stem cell self-renewal and t(8;21) AML.

Cell reports (2022) — PMID 35263585

Dataset

GSE181136

Global Identification of the miR-130a targetome in human HSPC Reveals a Novel Role for TBL1XR1 in Hematopoietic Stem Cell Self-Renewal and t(8;21) A…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: CUT&RUN

    ChIP-seq vLatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Nextflow if not already installed
    # curl -s https://get.nextflow.io | bash
    # mv nextflow /usr/local/bin/
    
    # Create a dummy directory for input FASTQ files
    # mkdir -p data
    
    # Create dummy FASTQ files for demonstration purposes.
    # In a real scenario, replace these with your actual gzipped FASTQ files.
    # touch data/H3K27me3_rep1_R1.fastq.gz data/H3K27me3_rep1_R2.fastq.gz
    # touch data/IgG_rep1_R1.fastq.gz data/IgG_rep1_R2.fastq.gz
    
    # Create a samples.tsv file to define input samples for the ENCODE ChIP-seq pipeline.
    # For CUT&RUN, this typically includes target samples (e.g., H3K27me3) and an IgG control.
    # The 'control_id' column links target samples to their corresponding control.
    cat << EOF > samples.tsv
    group	replicate	fastq_rep1_R1	fastq_rep1_R2	fastq_rep2_R1	fastq_rep2_R2	antibody	control_id
    H3K27me3_sample	1	data/H3K27me3_rep1_R1.fastq.gz	data/H3K27me3_rep1_R2.fastq.gz			H3K27me3	IgG_control
    IgG_control	1	data/IgG_rep1_R1.fastq.gz	data/IgG_rep1_R2.fastq.gz			IgG	
    EOF
    
    # Run the ENCODE ChIP-seq pipeline using Nextflow.
    # -profile docker: Uses Docker for reproducible execution. Ensure Docker is installed and running.
    # --input samples.tsv: Specifies the input sample sheet created above.
    # --genome GRCh38: Specifies the reference genome (e.g., human hg38/GRCh38).
    # --paired_end true: Indicates that the input reads are paired-end.
    # --outdir results: Specifies the directory where pipeline outputs will be stored.
    nextflow run ENCODE-DCC/chip-seq-pipeline2 \
        -profile docker \
        --input samples.tsv \
        --genome GRCh38 \
        --paired_end true \
        --outdir results
  2. 2

    CUT&RUN paired-end data was trimmed using fastp v.

    fastp v0.23.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install fastp (if not already installed)
    # conda install -c bioconda fastp
    
    # Define input and output file names (placeholders)
    INPUT_R1="sample_R1.fastq.gz"
    INPUT_R2="sample_R2.fastq.gz"
    OUTPUT_R1="sample_R1.trimmed.fastq.gz"
    OUTPUT_R2="sample_R2.trimmed.fastq.gz"
    REPORT_JSON="sample.fastp.json"
    REPORT_HTML="sample.fastp.html"
    
    # Execute fastp for paired-end trimming
    fastp \
      -i "${INPUT_R1}" \
      -I "${INPUT_R2}" \
      -o "${OUTPUT_R1}" \
      -O "${OUTPUT_R2}" \
      -j "${REPORT_JSON}" \
      -h "${REPORT_HTML}"
  3. 3

    0.19.5 to remove adapters and low quality base pairs with base pair quality score <30 and read length <35 bp.

    cutadapt (Inferred with models/gemini-2.5-flash) v2.10 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=2.10
    
    # Define input and output files
    INPUT_R1="input_R1.fastq.gz"
    INPUT_R2="input_R2.fastq.gz"
    OUTPUT_R1="trimmed_R1.fastq.gz"
    OUTPUT_R2="trimmed_R2.fastq.gz"
    
    # Define adapter sequence(s). Replace with the actual adapter sequence used in your library preparation.
    # For Illumina TruSeq, a common 3' adapter for Read 1 is:
    # ADAPTER_3PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    # For eCLIP, a specific linker adapter might be used, e.g., from the Yeo lab eCLIP protocol.
    # Example placeholder adapter:
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Execute cutadapt for adapter trimming and quality/length filtering
    # -a ADAPTER_SEQUENCE: 3' adapter for Read 1
    # -A ADAPTER_SEQUENCE: 3' adapter for Read 2 (often the same or reverse complement depending on library)
    # -q 30,30: Trim low-quality bases from both ends with quality score < 30
    # -m 35: Discard reads shorter than 35 bp after trimming
    # --pair-filter=any: Discard a read pair if either read becomes too short after trimming
    # -o: Output file for Read 1
    # -p: Output file for Read 2
    cutadapt -a "${ADAPTER_SEQUENCE}" -A "${ADAPTER_SEQUENCE}" \
             -q 30,30 \
             -m 35 \
             --pair-filter=any \
             -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \
             "${INPUT_R1}" "${INPUT_R2}"
  4. 4

    Trimmed reads were aligned to the hg38 human reference genome using bowtie2 v.

    Bowtie2 v2.3.4.3 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Installation
    # conda install -c bioconda bowtie2 samtools
    
    # Reference genome preparation (example, replace with actual hg38 fasta path)
    # mkdir -p /path/to/hg38_bowtie2_index
    # wget -O /path/to/hg38_bowtie2_index/hg38.fa.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # gunzip /path/to/hg38_bowtie2_index/hg38.fa.gz
    # bowtie2-build /path/to/hg38_bowtie2_index/hg38.fa /path/to/hg38_bowtie2_index/hg38
    
    # Alignment
    # Assume trimmed_reads.fastq.gz is the input file (single-end example)
    # Assume /path/to/hg38_bowtie2_index/hg38 is the prefix for the bowtie2 index files
    # Assume output will be aligned_reads.bam
    
    bowtie2 --local --very-sensitive-local \
            -x /path/to/hg38_bowtie2_index/hg38 \
            -U trimmed_reads.fastq.gz \
            -p 8 \
            2> bowtie2_alignment.log \
            | samtools view -bS - \
            | samtools sort -o aligned_reads.bam -
    
    samtools index aligned_reads.bam
  5. 5

    2.3.5.

    Unknown (Inferred with models/gemini-2.5-flash) vUnknown (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # No specific tool or command can be inferred from the step description "2.3.5.".
    # Please provide a more detailed description of the step (e.g., "Peak calling for eCLIP").
    # A common reference genome placeholder might be:
    # REF_GENOME="/path/to/reference/genome/hg38.fa"
    
  6. 6

    Reads were also aligned to the S. cerevisiae yeast genome (sacCer3) yeast genome to evaluate the yeast spike-in control for the normalization.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Define variables
    # Replace with your actual fastq file(s). For single-end reads, remove READS_R2.
    READS_R1="path/to/your/spike_in_reads_R1.fastq.gz" 
    READS_R2="path/to/your/spike_in_reads_R2.fastq.gz" 
    
    # Path to your pre-built STAR genome index for S. cerevisiae (sacCer3)
    # This index needs to be generated once using the sacCer3 fasta and GTF files.
    # Example for building index (run once):
    # STAR --runMode genomeGenerate \
    #      --genomeDir sacCer3_star_index \
    #      --genomeFastaFiles sacCer3.fasta \
    #      --sjdbGTFfile sacCer3.gtf \
    #      --runThreadN 8 # Adjust threads as needed
    GENOME_DIR="sacCer3_star_index" 
    
    OUTPUT_PREFIX="aligned_sacCer3_spikein" # Prefix for output files
    
    # Run STAR alignment for spike-in reads
    # Adjust --readFilesIn for single-end reads (e.g., --readFilesIn ${READS_R1})
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READS_R1} ${READS_R2} \
         --runThreadN 8 \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)
    
    # Rename the output BAM file for clarity
    mv ${OUTPUT_PREFIX}.Aligned.sortedByCoord.out.bam ${OUTPUT_PREFIX}.bam
  7. 7

    Peaks were called from the pooled replicates against the IgG using MACS2 v.

    MACS2 v2.2.7.1 GitHub
    $ Bash example
    # Install MACS2 (if not already installed)
    # conda install -c bioconda macs2=2.2.7.1
    
    # Run MACS2 peak calling
    # Assuming 'treatment.pooled.bam' is the pooled replicates BAM and 'control.pooled.bam' is the pooled IgG control BAM.
    # '-g hs' specifies the genome size for human (hg38 equivalent). Adjust if a different organism is used.
    # '-f BAM' specifies the input file format as BAM. Use 'BAMPE' for paired-end data.
    # '-q 0.01' sets the minimum FDR (q-value) cutoff for peak detection.
    # '--outdir' specifies the output directory.
    # '-n' specifies the prefix for output files.
    macs2 callpeak \
      -t treatment.pooled.bam \
      -c control.pooled.bam \
      -f BAM \
      -g hs \
      -q 0.01 \
      --outdir ./ \
      -n macs2_peaks
  8. 8

    2.2.5 with normalizing the samples based on the spike-in reads qunatification.

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and DESeq2 if not already present
    # conda install -c bioconda bioconductor-deseq2 r-base
    
    # Example: Create dummy spike_in_counts.tsv and raw_counts.tsv for demonstration
    # In a real pipeline, these files would be generated by upstream quantification steps.
    # spike_in_counts.tsv: Contains spike-in read counts per sample
    # raw_counts.tsv: Contains raw feature (e.g., gene) counts per sample
    
    echo -e "sample_id\tspike_in_reads" > spike_in_counts.tsv
    echo -e "sample1\t1000" >> spike_in_counts.tsv
    echo -e "sample2\t1500" >> spike_in_counts.tsv
    echo -e "sample3\t800" >> spike_in_counts.tsv
    
    echo -e "gene_id\tsample1\tsample2\tsample3" > raw_counts.tsv
    echo -e "geneA\t100\t150\t80" >> raw_counts.tsv
    echo -e "geneB\t200\t300\t160" >> raw_counts.tsv
    echo -e "geneC\t50\t75\t40" >> raw_counts.tsv
    
    # R script to perform spike-in normalization using DESeq2
    Rscript -e '
      library(DESeq2)
      
      # Load spike-in counts
      spike_in_data <- read.delim("spike_in_counts.tsv", row.names = 1)
      spike_in_counts <- as.numeric(spike_in_data$spike_in_reads)
      names(spike_in_counts) <- rownames(spike_in_data)
      
      # Calculate custom size factors based on spike-in reads
      # A common method is to normalize to the median spike-in count across samples.
      # Each sample''s counts will be scaled by (median_spike_in / sample_spike_in_count).
      median_spike_in <- median(spike_in_counts)
      custom_size_factors <- median_spike_in / spike_in_counts
      
      # Load raw feature counts (e.g., gene counts)
      raw_counts_data <- read.delim("raw_counts.tsv", row.names = 1)
      
      # Ensure sample order in custom_size_factors matches the column order in raw_counts_data
      sample_names <- colnames(raw_counts_data)
      custom_size_factors <- custom_size_factors[sample_names] 
      
      # Create colData for DESeqDataSet (minimal, as we are only normalizing)
      colData <- data.frame(row.names = sample_names, condition = rep("control", length(sample_names)))
      
      # Create DESeqDataSet object
      dds <- DESeqDataSetFromMatrix(countData = raw_counts_data,
                                    colData = colData,
                                    design = ~ 1) # Simple design as we are only normalizing
      
      # Assign custom size factors derived from spike-in reads
      sizeFactors(dds) <- custom_size_factors
      
      # Run DESeq to apply normalization. 
      # estimateDispersions is needed for the counts(dds, normalized=TRUE) function to work correctly,
      # even if not performing full differential expression analysis.
      dds <- estimateDispersions(dds)
      
      # Get normalized counts
      normalized_counts <- counts(dds, normalized = TRUE)
      
      # Write normalized counts to an output file
      write.table(as.data.frame(normalized_counts), "normalized_counts_spikein.tsv", sep = "\t", quote = FALSE, col.names = NA)
    '
  9. 9

    ChipPeakAnnot library in the R package v.3.6.1 was used to determine the overlap between peak sets and determine the shared and specific peaks represented in the Venn diagrams.

    $ Bash example
    #!/bin/bash
    
    # This script uses the ChIPpeakAnno R package to determine overlap between peak sets
    # and generate Venn diagrams, as described in the context.
    
    # Ensure R is installed and available in the PATH. The context specifies R v.3.6.1.
    # You might need to use a specific R environment or installation for this version.
    # Example for installing R (adjust for your OS/package manager and desired version):
    # sudo apt-get update
    # sudo apt-get install r-base=3.6.1-3bionic # Example for Ubuntu 18.04
    
    # Install BiocManager if not already installed
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', repos='https://cloud.r-project.org')"
    
    # Install ChIPpeakAnno package from Bioconductor
    # R -e "BiocManager::install('ChIPpeakAnno')"
    
    # Define input peak files (replace with actual paths to your BED files).
    # These are placeholders. In a real pipeline, these would be outputs from a peak caller.
    # ChIPpeakAnno's makeVennDiagram function supports up to 5 peak sets.
    PEAK_FILE_1="sample1_peaks.bed"
    PEAK_FILE_2="sample2_peaks.bed"
    PEAK_FILE_3="sample3_peaks.bed" # Example for a 3-way Venn diagram
    # PEAK_FILE_4="sample4_peaks.bed"
    # PEAK_FILE_5="sample5_peaks.bed"
    
    # Define output PDF file for the Venn diagram
    OUTPUT_VENN_PDF="peak_overlap_venn_diagram.pdf"
    
    # Create an R script to perform the peak overlap and Venn diagram generation
    cat <<EOF > run_chippeakanno.R
    library(ChIPpeakAnno)
    library(GenomicRanges) # Often loaded implicitly or explicitly with ChIPpeakAnno
    
    # Function to load peaks from a BED file into a GRanges object
    # This function assumes a standard BED format (at least chr, start, end).
    # Adjust 'col.names' and other parameters if your BED file has a different structure.
    load_peaks <- function(file_path) {
      if (!file.exists(file_path)) {
        stop(paste("Peak file not found:", file_path))
      }
      
      # Read BED file, skipping comment lines (starting with #)
      peaks_df <- read.delim(file_path, header=FALSE, comment.char="#", stringsAsFactors=FALSE)
      
      # Ensure at least 3 columns for seqnames (chromosome), start, end
      if (ncol(peaks_df) < 3) {
        stop(paste("Invalid BED file format for", file_path, ": expected at least 3 columns."))
      }
      
      # Assign standard column names for GRanges conversion
      colnames(peaks_df)[1:3] <- c("seqnames", "start", "end")
      
      # Create GRanges object
      gr <- makeGRangesFromDataFrame(peaks_df,
                                     keep.extra.columns=TRUE, # Keep any additional columns (e.g., name, score)
                                     ignore.strand=TRUE,      # Peaks are typically strand-agnostic for overlap analysis
                                     seqinfo=NULL)            # Can provide seqinfo if known, but often not strictly necessary for overlap
      return(gr)
    }
    
    # Load peak sets into a list of GRanges objects
    peaklist <- list()
    
    # Check if peak files exist before attempting to load them
    if (file.exists("${PEAK_FILE_1}")) {
      peaklist[["Set1"]] <- load_peaks("${PEAK_FILE_1}")
    } else {
      message(paste("Warning: Peak file not found, skipping:", "${PEAK_FILE_1}"))
    }
    
    if (file.exists("${PEAK_FILE_2}")) {
      peaklist[["Set2"]] <- load_peaks("${PEAK_FILE_2}")
    } else {
      message(paste("Warning: Peak file not found, skipping:", "${PEAK_FILE_2}"))
    }
    
    if (file.exists("${PEAK_FILE_3}")) {
      peaklist[["Set3"]] <- load_peaks("${PEAK_FILE_3}")
    } else {
      message(paste("Warning: Peak file not found, skipping:", "${PEAK_FILE_3}"))
    }
    
    # Add more peak files here if needed (up to 5 for makeVennDiagram)
    # if (file.exists("${PEAK_FILE_4}")) {
    #   peaklist[["Set4"]] <- load_peaks("${PEAK_FILE_4}")
    # }
    # if (file.exists("${PEAK_FILE_5}")) {
    #   peaklist[["Set5"]] <- load_peaks("${PEAK_FILE_5}")
    # }
    
    # Ensure at least two peak files were successfully loaded for a Venn diagram
    if (length(peaklist) < 2) {
      stop("At least two peak files are required to generate a Venn diagram. Please check input file paths.")
    }
    
    # Determine overlap and generate Venn diagram
    # The 'makeVennDiagram' function visualizes shared and specific peaks.
    pdf("${OUTPUT_VENN_PDF}")
    venn_results <- makeVennDiagram(
      peaklist,
      NameOfPeaks = names(peaklist), # Labels for each set in the diagram
      main = "Overlap of Peak Sets", # Main title for the plot
      fill = c("red", "blue", "green", "yellow", "purple")[1:length(peaklist)], # Customize fill colors
      cat.col = c("red", "blue", "green", "yellow", "purple")[1:length(peaklist)], # Customize category label colors
      cex = 1.5,       # Size of the numbers in the diagram
      cat.cex = 1.5,   # Size of the category labels
      ind = TRUE       # Set to TRUE to get indices of overlapping peaks in the results object
    )
    dev.off()
    
    message(paste("Venn diagram showing peak overlaps saved to:", "${OUTPUT_VENN_PDF}"))
    
    # Optional: To extract specific overlapping/unique peak sets as GRanges objects.
    # This can be done using the 'venn_results' object returned by makeVennDiagram.
    # For example, to get peaks unique to Set1 (assuming 3 sets):
    # if (length(peaklist) == 3) {
    #   unique_set1_indices <- venn_results$vennCounts[venn_results$vennCounts$Set1 == 1 & venn_results$vennCounts$Set2 == 0 & venn_results$vennCounts$Set3 == 0, "idx"]
    #   if (length(unique_set1_indices) > 0) {
    #     unique_peaks_set1 <- peaklist[["Set1"]][unique_set1_indices]
    #     # write.table(as.data.frame(unique_peaks_set1), "unique_peaks_set1.bed", sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)
    #     message(paste("Number of peaks unique to Set1:", length(unique_peaks_set1)))
    #   }
    # }
    EOF
    
    # Execute the R script using the specified R version (3.6.1)
    Rscript run_chippeakanno.R
  10. 10

    Minimum peak overlap was set to 100 bp to determine the intersection peaks between the two samples.

    bedtools intersect (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools
    
    # Define input and output files
    # Replace 'sample1_peaks.bed' and 'sample2_peaks.bed' with your actual peak file paths.
    INPUT_PEAKS_SAMPLE1="sample1_peaks.bed"
    INPUT_PEAKS_SAMPLE2="sample2_peaks.bed"
    OUTPUT_INTERSECTION_PEAKS="intersection_peaks.bed"
    
    # Determine the intersection peaks between the two samples with a minimum overlap of 100 bp.
    # bedtools intersect outputs the overlapping segments between two BED files.
    # The awk command then filters these segments, keeping only those whose length (end - start) is 100 bp or greater.
    bedtools intersect -a "${INPUT_PEAKS_SAMPLE1}" -b "${INPUT_PEAKS_SAMPLE2}" | awk '($3-$2) >= 100' > "${OUTPUT_INTERSECTION_PEAKS}"
    
    # Note: The input peak files (e.g., sample1_peaks.bed, sample2_peaks.bed) are assumed to be in BED format
    # and based on a specific reference genome (e.g., hg38, mm10), though the genome assembly is not directly
    # used by bedtools intersect itself.
  11. 11

    Shared peaks between the sets ETO and Flag antibodies in the Control and miR-130a KD samples were used for the downstream analysis.

    intersect_peaks.py (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Clone the yeolab/merge_peaks repository if not already available
    # git clone https://github.com/yeolab/merge_peaks.git
    # cd merge_peaks
    
    # Assuming peak files from previous steps are available in BED format.
    # Replace these placeholder paths/names with your actual peak files.
    ETO_Control_peaks="ETO_Control_reproducible_peaks.bed"
    Flag_Control_peaks="Flag_Control_reproducible_peaks.bed"
    ETO_KD_peaks="ETO_KD_reproducible_peaks.bed"
    Flag_KD_peaks="Flag_KD_reproducible_peaks.bed"
    
    # Output file for the shared peaks
    output_shared_peaks="shared_peaks_ETO_Flag_Control_KD.bed"
    
    # Execute the intersect_peaks.py script to find peaks common to all specified sets.
    # This command assumes the script is in the current working directory or in PATH.
    python intersect_peaks.py \
        "${ETO_Control_peaks}" \
        "${Flag_Control_peaks}" \
        "${ETO_KD_peaks}" \
        "${Flag_KD_peaks}" \
        -o "${output_shared_peaks}"
  12. 12

    These two sets of shared peaks were intersected again using ChipPeakAnnot library to determine the shared peaks between the two sets and the unique peaks to each of the two sets.

    ChipPeakAnnot library vNot specified GitHub
    $ Bash example
    # The description refers to 'ChipPeakAnnot library' which appears to be a custom or less common Python library.
    # The functionality described (intersecting peak sets and identifying unique peaks) is commonly and robustly performed using bedtools.
    # The following commands demonstrate how to achieve the described operations using bedtools.
    
    # Install bedtools if not already available
    # conda install -c bioconda bedtools
    
    # Define input peak files (replace with actual filenames)
    INPUT_PEAKS_SET1="input_peaks_set1.bed"
    INPUT_PEAKS_SET2="input_peaks_set2.bed"
    
    # Define output files
    SHARED_PEAKS="shared_peaks.bed"
    UNIQUE_TO_SET1="unique_to_set1.bed"
    UNIQUE_TO_SET2="unique_to_set2.bed"
    
    # Determine shared peaks between the two sets
    # By default, bedtools intersect outputs the overlapping regions.
    bedtools intersect -a "${INPUT_PEAKS_SET1}" -b "${INPUT_PEAKS_SET2}" > "${SHARED_PEAKS}"
    
    # Determine unique peaks to Set 1 (peaks in Set 1 that do not overlap with Set 2)
    bedtools intersect -a "${INPUT_PEAKS_SET1}" -b "${INPUT_PEAKS_SET2}" -v > "${UNIQUE_TO_SET1}"
    
    # Determine unique peaks to Set 2 (peaks in Set 2 that do not overlap with Set 1)
    bedtools intersect -a "${INPUT_PEAKS_SET2}" -b "${INPUT_PEAKS_SET1}" -v > "${UNIQUE_TO_SET2}"
    
  13. 13

    Peaks were annotated to the genomic features of the hg38 human genomes.

    HOMER (Hypergeometric Optimization of Motif EnRichment) - annotatePeaks.pl (Inferred with models/gemini-2.5-flash) vv4.11 GitHub
    $ Bash example
    # Install HOMER (if not already installed)
    # conda install -c bioconda homer
    
    # Example: Download and install hg38 genome for HOMER (if not already done)
    # perl /path/to/homer/bin/configureHomer.pl -install hg38
    
    # Define input peak file (replace with your actual peak file, e.g., a BED file)
    input_peaks="input_peaks.bed"
    
    # Define output annotation file
    output_annotation="annotated_peaks.txt"
    
    # Annotate peaks to hg38 genomic features
    # The 'hg38' argument tells HOMER to use its pre-indexed hg38 genome annotation.
    annotatePeaks.pl "${input_peaks}" hg38 > "${output_annotation}"
  14. 14

    HOMER v.

    HOMER vnot specified GitHub
    $ Bash example
    # Install HOMER (example using conda)
    # conda install -c bioconda homer
    
    # Placeholder variables - replace with actual paths and values from your experiment
    # Input peak file (e.g., BED format) from a previous step (e.g., peak calling)
    INPUT_PEAKS="your_experiment_peaks.bed" 
    # Reference genome assembly (e.g., hg38, mm10). 
    # Ensure HOMER indexes for this genome are available (e.g., by running 'perl configureHomer.pl -install hg38').
    GENOME_ASSEMBLY="hg38" 
    # Output directory for HOMER results
    OUTPUT_DIR="homer_motif_results"
    # Motif lengths to search for (e.g., 8, 10, 12 base pairs)
    MOTIF_LENGTHS="8,10,12"
    
    # Example: Run HOMER's findMotifsGenome.pl for de novo motif discovery
    # This command searches for motifs in the provided peak regions against the specified genome.
    # Parameters like -size and -len are common. -size given is typically used for BED files with varying region sizes.
    # Adjust parameters (-mask, -mis, -h, -p, etc.) based on specific experimental needs.
    findMotifsGenome.pl "${INPUT_PEAKS}" "${GENOME_ASSEMBLY}" "${OUTPUT_DIR}" -size given -len "${MOTIF_LENGTHS}"
  15. 15

    4.8 was used for de novo motif analysis of the sets of peaks.

    MEME-ChIP (Inferred with models/gemini-2.5-flash) v5.4.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install MEME Suite (e.g., via Bioconda)
    # conda install -c bioconda meme=5.4.1
    
    # Placeholder for input peak file (e.g., a BED file containing peak regions)
    PEAKS_BED="your_input_peaks.bed"
    
    # Placeholder for reference genome FASTA file
    # This file is used by meme-chip to extract sequences corresponding to the peaks.
    REFERENCE_GENOME_FASTA="path/to/your/reference_genome.fa" # e.g., hg38.fa
    
    # Output directory for the motif analysis results
    OUTPUT_DIR="de_novo_motif_analysis_results"
    
    # Create the output directory if it does not exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run MEME-ChIP for de novo motif analysis
    # -oc: Output directory
    # -genome: Path to the reference genome FASTA file
    # -fimo-skip: Skip FIMO analysis (can be run separately if needed)
    # -meme-maxw: Maximum width of motifs to find by MEME
    # -meme-minw: Minimum width of motifs to find by MEME
    # -meme-nmotifs: Number of motifs to find by MEME
    # -dreme-e: E-value threshold for DREME
    # -dna: Specifies that the input sequences are DNA
    # The last argument is the input peak file (BED format is accepted with -genome)
    meme-chip -oc "${OUTPUT_DIR}" \
              -genome "${REFERENCE_GENOME_FASTA}" \
              -fimo-skip \
              -meme-maxw 20 \
              -meme-minw 6 \
              -meme-nmotifs 10 \
              -dreme-e 0.05 \
              -dna "${PEAKS_BED}"

Tools Used

Raw Source Text
Library strategy: CUT&RUN
CUT&RUN paired-end data was trimmed using fastp v. 0.19.5 to remove adapters and low quality base pairs with base pair quality score <30 and read length <35 bp.
Trimmed reads were aligned to the hg38 human reference genome using bowtie2 v. 2.3.5. Reads were also aligned to the  S. cerevisiae yeast genome (sacCer3) yeast genome to evaluate the yeast spike-in control for the normalization.
Peaks were called from the pooled replicates against the IgG using MACS2 v. 2.2.5  with normalizing the samples based on the spike-in reads qunatification.
ChipPeakAnnot library in the R package v.3.6.1 was used to determine the overlap between peak sets and determine the shared and specific peaks represented in the Venn diagrams. Minimum peak overlap was set to 100 bp to determine the intersection peaks between the two samples.
Shared peaks between the sets ETO and Flag antibodies in the Control and miR-130a KD samples were used for the downstream analysis. These two sets of shared peaks were intersected again using ChipPeakAnnot library to determine the shared peaks between the two sets and the unique peaks to each of the two sets.
Peaks were annotated to the genomic features of the hg38 human genomes.
HOMER v. 4.8  was used for de novo motif analysis of the sets of peaks.
Genome_build: hg38
Supplementary_files_format_and_content: Bed files
← Back to Analysis