GSE211676 Processing Pipeline

GSE code_examples 10 steps

Publication

RANK ligand converts the NCoR/HDAC3 co-repressor to a PGC1β- and RNA-dependent co-activator of osteoclast gene expression.

Molecular cell (2023) — PMID 37751740

Dataset

GSE211676

RANK ligand converts the NCoR/HDAC3 co-repressor to a PGC1β- and RNA-dependent co-activator of osteoclast gene expression

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    FASTQ files were mapped to the mm10 mouse reference genome with Bowtie2.

    $ Bash example
    # Install Bowtie2
    # conda install -c bioconda bowtie2
    
    # Install Samtools (often used with Bowtie2 for BAM conversion and sorting)
    # conda install -c bioconda samtools
    
    # Define variables
    # Placeholder for the Bowtie2 index for mm10. This index must be pre-built.
    # Example: bowtie2-build /path/to/mm10.fa /path/to/bowtie2_indexes/mm10
    REF_GENOME_INDEX="/path/to/bowtie2_indexes/mm10"
    
    # Placeholder for input FASTQ files (assuming paired-end reads)
    INPUT_FASTQ_R1="input_R1.fastq.gz"
    INPUT_FASTQ_R2="input_R2.fastq.gz"
    
    # Desired output BAM file name
    OUTPUT_BAM="output.bam"
    
    # Number of threads to use for alignment
    NUM_THREADS=8
    
    # Run Bowtie2 alignment, pipe output to samtools for BAM conversion and sorting
    bowtie2 -x "${REF_GENOME_INDEX}" \
            -1 "${INPUT_FASTQ_R1}" \
            -2 "${INPUT_FASTQ_R2}" \
            --threads "${NUM_THREADS}" \
            --very-sensitive \
            2> "${OUTPUT_BAM%.bam}.log" \
        | samtools view -bS - \
        | samtools sort -o "${OUTPUT_BAM}" -
  2. 2

    Peaks were called with HOMER (findPeaks) using parameters “-style factor -minDist 200 -size 200”.

    HOMER vv4.12
    $ Bash example
    # Install HOMER (if not already installed)
    # conda install -c bioconda homer
    
    # Define input and output files
    # Replace 'path/to/your/treatment.bam' with the actual path to your aligned reads BAM file
    TREATMENT_BAM="path/to/your/treatment.bam"
    
    # Define output prefix for the peak files
    OUTPUT_PREFIX="homer_factor_peaks"
    
    # Call peaks with HOMER findPeaks
    # This command will generate files like homer_factor_peaks.peaks.txt and homer_factor_peaks.peaks.bed
    findPeaks "${TREATMENT_BAM}" -style factor -minDist 200 -size 200 -o "${OUTPUT_PREFIX}"
  3. 3

    After merging these peaks, correlations among replicates from the same cell subset/treatment were evaluated by correlation using tag counts.

    deepTools (Inferred with models/gemini-2.5-flash) v3.5.1 GitHub
    $ Bash example
    # Install deepTools (example using conda)
    # conda create -n deeptools_env deeptools=3.5.1
    # conda activate deeptools_env
    
    # Define input files (placeholders)
    # MERGED_PEAKS: BED file containing the merged peak regions from the previous step.
    # REPLICATE_BWS: Space-separated list of bigWig files for each replicate from the same cell subset/treatment.
    MERGED_PEAKS="merged_peaks.bed"
    REPLICATE_BWS="replicate1.bw replicate2.bw replicate3.bw"
    
    # Output files
    MATRIX_FILE="replicate_correlation_matrix.npz"
    PLOT_FILE="replicate_correlation_heatmap.png"
    TEXT_FILE="replicate_correlation_values.txt"
    
    # Labels for the replicates, corresponding to the order in REPLICATE_BWS
    REPLICATE_LABELS="rep1 rep2 rep3"
    
    # 1. Compute tag counts/signal for merged peak regions across replicates using multiBigwigSummary
    # This step quantifies the signal within the specified BED regions for each bigWig file.
    multiBigwigSummary BED \
        --BED ${MERGED_PEAKS} \
        --bws ${REPLICATE_BWS} \
        --outFileName ${MATRIX_FILE} \
        --labels ${REPLICATE_LABELS}
    
    # 2. Evaluate correlations and generate a heatmap using plotCorrelation
    # This step takes the matrix generated by multiBigwigSummary and calculates Pearson correlation,
    # then visualizes it as a heatmap.
    plotCorrelation \
        -in ${MATRIX_FILE} \
        --corMethod pearson \
        --plotType heatmap \
        --outFileName ${PLOT_FILE} \
        --plotNumbers \
        --labels ${REPLICATE_LABELS} \
        --outFileCorrelations ${TEXT_FILE}
  4. 4

    The two most highly correlated samples were used for identifying the most robust peaks using the irreproducible discovery rate (IDR) method.

    IDR vlatest from repository GitHub
    $ Bash example
    # Clone the merge_peaks repository if not already present
    # git clone https://github.com/yeolab/merge_peaks.git
    # Assuming the repository is cloned to a known location, e.g., /opt/yeolab_tools/merge_peaks
    MERGE_PEAKS_DIR="/opt/yeolab_tools/merge_peaks"
    
    # Install the IDR executable (version 2.0.4 is commonly used and is a dependency for merge_peaks)
    # conda install -c bioconda idr=2.0.4
    
    # Placeholder for input peak files from the two most highly correlated samples.
    # These would typically be BED files generated by a peak caller (e.g., CLIPper)
    # and selected based on correlation analysis from a previous step.
    replicate1_peaks="path/to/your/replicate1_peaks.bed"
    replicate2_peaks="path/to/your/replicate2_peaks.bed"
    
    # Output prefix for IDR results
    output_prefix="idr_robust_peaks"
    
    # Rank type for IDR (e.g., 'p.value' or 'score'). 'p.value' is a common choice for peak callers.
    # The description does not specify, so a common default is used.
    rank_type="p.value"
    
    # IDR cutoff (e.g., 0.01 or 0.05). 0.01 is a stringent common choice for high confidence peaks.
    idr_cutoff=0.01
    
    # Execute the IDR replicate pipeline script from the merge_peaks repository.
    # This script orchestrates the IDR analysis using the underlying 'idr' executable.
    python "${MERGE_PEAKS_DIR}/idr_replicate_pipeline.py" \
        -p "${replicate1_peaks}" \
        -q "${replicate2_peaks}" \
        -o "${output_prefix}" \
        -r "${rank_type}" \
        -c "${idr_cutoff}"
    
    # The output will include files such as:
    # - ${output_prefix}-idr.log: IDR log file
    # - ${output_prefix}-idr.out: Raw IDR output
    # - Potentially filtered peak files (e.g., BED files with reproducible peaks)
  5. 5

    For this step, peaks were called with HOMER’s findPeaks, using parameters “-L 0 -C 0 -fdr 0.9 -minDist 200 -size 200”.

    HOMER v4.11.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install HOMER (if not already installed)
    # conda install -c bioconda homer
    # perl /path/to/homer/configureHomer.pl -install hg38 # Example for genome installation
    
    # Assuming a tag directory 'my_sample_tags' has been created from alignment files
    # Example: makeTagDirectory my_sample_tags my_sample_aligned.bam
    
    # Define input and output paths (placeholders as not specified in description)
    TAG_DIRECTORY="my_sample_tags"
    OUTPUT_PEAKS="my_sample_peaks.txt"
    GENOME="hg38" # Placeholder for reference genome
    
    # Execute HOMER findPeaks with specified parameters
    findPeaks "${TAG_DIRECTORY}" -o "${OUTPUT_PEAKS}" -L 0 -C 0 -fdr 0.9 -minDist 200 -size 200 -g "${GENOME}" -style factor
  6. 6

    IDR peaks from different conditions involved in a comparison merged with HOMER’s mergePeaks and annotated with HOMER’s annotatePeaks.pl.

    IDR v2.0.4
    $ Bash example
    # The input to this step are peaks generated by the IDR (Irreproducible Discovery Rate) method.
    # The specific actions described (merging and annotation) are performed using HOMER tools.
    
    # Install HOMER (if not already installed)
    # conda install -c bioconda homer
    
    # Example input IDR peak files from different conditions
    # These files are assumed to be in HOMER peak format or a compatible format (e.g., BED, narrowPeak).
    # Replace with actual input file paths.
    input_idr_peaks_cond1="condition1_idr_peaks.narrowPeak"
    input_idr_peaks_cond2="condition2_idr_peaks.narrowPeak"
    input_idr_peaks_cond3="condition3_idr_peaks.narrowPeak"
    
    # Merge IDR peaks from different conditions using HOMER's mergePeaks
    # The '-d' option can be used to specify a distance for merging overlapping peaks (e.g., -d 200 for 200bp).
    # Without '-d', it merges peaks that overlap by at least 1 bp.
    # Output is typically a BED-like format.
    merged_peaks_file="merged_idr_peaks.bed"
    mergePeaks "${input_idr_peaks_cond1}" "${input_idr_peaks_cond2}" "${input_idr_peaks_cond3}" > "${merged_peaks_file}"
    
    # Annotate the merged IDR peaks using HOMER's annotatePeaks.pl
    # Using hg38 as the reference genome (latest assembly placeholder). Replace with appropriate genome if needed.
    # The output is a tab-separated text file with annotation details.
    annotated_peaks_file="annotated_idr_peaks.txt"
    annotatePeaks.pl "${merged_peaks_file}" hg38 > "${annotated_peaks_file}"
  7. 7

    The raw tags of all samples which had reasonable correlation were quantified with HOMER (annotatePeaks.pl) using parameter “-noadj”.

    HOMER vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install HOMER (e.g., via conda)
    # conda install -c bioconda homer=4.12
    
    # Assuming 'input_peaks.bed' is the file containing raw tags/peaks that had reasonable correlation
    # and 'hg38' is the reference genome assembly (replace with actual if known)
    # The output will be a tab-separated file with peak annotations.
    annotatePeaks.pl input_peaks.bed hg38 -noadj > output_annotations.txt
  8. 8

    Peaks which contained at least 4 tags in at least 1 sample were used to identify differentially bounded peaks (DBP) by DESeq2.

    $ Bash example
    # Install R and DESeq2 (if not already installed)
    # conda install -c conda-forge r-base
    # conda install -c bioconda bioconductor-deseq2
    
    # Create dummy input files for demonstration
    # In a real pipeline, 'peak_counts.tsv' would be generated by an upstream peak quantification step
    # and would already reflect the pre-filtering criterion: 
    # "Peaks which contained at least 4 tags in at least 1 sample".
    # 'sample_info.tsv' would contain metadata for each sample.
    cat << EOF > peak_counts.tsv
    peak_id	sample1	sample2	sample3	sample4
    peak1	10	12	5	7
    peak2	0	1	0	0
    peak3	50	60	25	30
    peak4	2	3	1	2
    peak5	15	18	8	10
    peak6	4	5	2	3
    peak7	0	0	0	0
    peak8	6	7	3	4
    EOF
    
    cat << EOF > sample_info.tsv
    sample_id	condition
    sample1	control
    sample2	control
    sample3	treated
    sample4	treated
    EOF
    
    # R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # Load count data
    # Assuming 'peak_counts.tsv' already reflects the pre-filtering: 
    # "Peaks which contained at least 4 tags in at least 1 sample"
    counts_df <- read.table("peak_counts.tsv", header=TRUE, row.names=1, sep="\t")
    # Ensure counts are integers
    counts_df <- round(counts_df)
    
    # Load sample information
    sample_info <- read.table("sample_info.tsv", header=TRUE, row.names=1, sep="\t")
    
    # Ensure sample order matches between counts and sample_info
    sample_info <- sample_info[colnames(counts_df), , drop=FALSE]
    
    # Create DESeqDataSet object
    # The design formula specifies the experimental factors to be modeled.
    # Here, we assume a simple comparison between 'control' and 'treated' conditions.
    dds <- DESeqDataSetFromMatrix(countData = counts_df,
                                  colData = sample_info,
                                  design = ~ condition)
    
    # DESeq2 recommends filtering out rows that have zero or very few counts across all samples.
    # If the input 'peak_counts.tsv' was already filtered based on "at least 4 tags in at least 1 sample",
    # then this internal filtering might be less critical but still good practice.
    # For demonstration, we'll proceed assuming adequate pre-filtering.
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results for the contrast of interest (e.g., treated vs control)
    # The contrast argument specifies the factor, the numerator level, and the denominator level.
    res <- results(dds, contrast=c("condition", "treated", "control"))
    
    # Order results by adjusted p-value
    res <- res[order(res$padj),]
    
    # Save results
    write.csv(as.data.frame(res), file="deseq2_results.csv")
    
    # Optional: Save normalized counts
    # normalized_counts <- counts(dds, normalized=TRUE)
    # write.csv(as.data.frame(normalized_counts), file="deseq2_normalized_counts.csv")
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
  9. 9

    Peaks were categorized as distal peaks which are 2 kb away from known TSS and promotor peaks which are located within 2 kb region of known TSS sites.

    bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools
    
    # Define input files and reference data
    # Replace 'input_peaks.bed' with the actual path to your peak file.
    PEAKS_BED="input_peaks.bed"
    
    # Replace 'hg38_gencode_v38_tss.bed' with the actual path to your TSS BED file.
    # This file should contain 1bp regions representing the Transcription Start Sites.
    # Example: chr1\t10000\t10001\tgeneA_tss\t0\t+
    TSS_BED="hg38_gencode_v38_tss.bed"
    
    # Replace 'hg38.chrom.sizes' with the actual path to your chromosome sizes file.
    # This file is required by bedtools slop to prevent extending regions off chromosome ends.
    # Example: chr1\t248956422
    CHROM_SIZES="hg38.chrom.sizes"
    
    # Create dummy files for demonstration if they don't exist.
    # In a real pipeline, these files would be pre-existing and properly generated.
    if [ ! -f "$CHROM_SIZES" ]; then
        echo "Warning: Creating dummy $CHROM_SIZES for demonstration. Replace with actual file for real analysis."
        echo -e "chr1\t248956422\nchr2\t242193529\nchr3\t198295559\nchrX\t156040895\nchrY\t57227415" > "$CHROM_SIZES"
    fi
    
    if [ ! -f "$TSS_BED" ]; then
        echo "Warning: Creating dummy $TSS_BED for demonstration. Replace with actual file for real analysis."
        # Example TSS for a few genes on chr1
        echo -e "chr1\t10000\t10001\tgeneA_tss\t0\t+\nchr1\t50000\t50001\tgeneB_tss\t0\t-" > "$TSS_BED"
    fi
    
    # Define the promoter region size (2 kb)
    PROMOTER_DISTANCE=2000
    
    # 1. Define promoter regions by extending TSS sites by PROMOTER_DISTANCE in both directions.
    #    bedtools slop extends features by a specified number of base pairs.
    bedtools slop -i "$TSS_BED" -g "$CHROM_SIZES" -b "$PROMOTER_DISTANCE" > promoter_regions_slop.bed
    
    # 2. Merge overlapping promoter regions to create a consolidated set of promoter areas.
    #    This prevents a single peak from being counted multiple times if it overlaps several TSS regions.
    bedtools merge -i promoter_regions_slop.bed > merged_promoter_regions.bed
    
    # 3. Identify promoter peaks: Peaks that overlap with the merged promoter regions.
    #    bedtools intersect -a <peaks> -b <promoter_regions> finds overlaps.
    bedtools intersect -a "$PEAKS_BED" -b merged_promoter_regions.bed > promoter_peaks.bed
    
    # 4. Identify distal peaks: Peaks that do NOT overlap with the merged promoter regions.
    #    bedtools intersect -a <peaks> -b <promoter_regions> -v finds features in -a that do NOT overlap -b.
    bedtools intersect -a "$PEAKS_BED" -b merged_promoter_regions.bed -v > distal_peaks.bed
    
    # Clean up intermediate files
    rm promoter_regions_slop.bed merged_promoter_regions.bed
  10. 10

    ATAC peak quantification was normalized to the total tags in peaks.

    DiffBind (Inferred with models/gemini-2.5-flash) v3.10.0
    $ Bash example
    # Install R and Bioconductor if not already present
    # sudo apt-get update && sudo apt-get install -y r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DiffBind")'
    
    # Create an R script file for ATAC peak quantification and normalization
    # This script assumes you have a 'samplesheet.csv' file defining your samples, 
    # BAM files, and peak files. Example 'samplesheet.csv' structure:
    # SampleID,Condition,Replicate,bamReads,Peaks,PeakCaller
    # Sample1,Control,1,sample1.bam,sample1_peaks.bed,bed
    # Sample2,Control,2,sample2.bam,sample2_peaks.bed,bed
    cat << 'EOF' > run_atac_quant_norm.R
    library(DiffBind)
    
    # Load samplesheet (replace with your actual samplesheet path)
    samplesheet <- dba(sampleSheet="samplesheet.csv")
    
    # Step 1: Count reads in peaks for all samples
    # This creates a dba object with raw counts for each peak in each sample.
    # score=DBA_SCORE_READS ensures raw read counts are used.
    # bUseSummarizeOverlaps=TRUE uses the more robust SummarizedExperiment approach.
    message("Counting reads in peaks...")
    samplesheet_counts <- dba.count(samplesheet, bUseSummarizeOverlaps=TRUE, score=DBA_SCORE_READS)
    
    # Step 2: Extract raw counts matrix
    # This retrieves the raw count matrix where rows are peaks and columns are samples.
    raw_counts_matrix <- dba.peakset(samplesheet_counts, bRetrieve=TRUE, DataType=DBA_DATA_COUNTS)
    
    # Step 3: Calculate total tags in peaks for each sample
    # This sums the raw counts for all peaks for each individual sample.
    total_tags_in_peaks <- colSums(raw_counts_matrix)
    
    # Step 4: Normalize counts by total tags in peaks
    # Each peak's count is divided by the total tags in peaks for its respective sample.
    # A small pseudocount (1e-6) is added to the denominator to prevent division by zero.
    # The result is scaled by 1e6 to represent Counts Per Million (CPM) within peaks.
    message("Normalizing counts to total tags in peaks...")
    normalized_counts_matrix <- t(t(raw_counts_matrix) / (total_tags_in_peaks + 1e-6)) * 1e6
    
    # Save the normalized counts to a CSV file
    output_file <- "atac_normalized_peak_counts.csv"
    write.csv(normalized_counts_matrix, output_file, row.names=TRUE)
    
    message(paste0("ATAC peak quantification and normalization complete. Normalized counts saved to ", output_file))
    EOF
    
    # Execute the R script
    Rscript run_atac_quant_norm.R

Tools Used

Raw Source Text
FASTQ files were mapped to the mm10 mouse reference genome with Bowtie2.  Peaks were called with HOMER (findPeaks) using parameters “-style factor -minDist 200 -size 200”.  After merging these peaks, correlations among replicates from the same cell subset/treatment were evaluated by correlation using tag counts.  The two most highly correlated samples were used for identifying the most robust peaks using the irreproducible discovery rate (IDR) method.  For this step, peaks were called with HOMER’s findPeaks, using parameters “-L 0 -C 0 -fdr 0.9 -minDist 200 -size 200”.  IDR peaks from different conditions involved in a comparison merged with HOMER’s mergePeaks and annotated with HOMER’s annotatePeaks.pl.  The raw tags of all samples which had reasonable correlation were quantified with HOMER (annotatePeaks.pl) using parameter “-noadj”.  Peaks which contained at least 4 tags in at least 1 sample were used to identify differentially bounded peaks (DBP) by DESeq2.  Peaks were categorized as distal peaks which are 2 kb away from known TSS and promotor peaks which are located within 2 kb region of known TSS sites.
ATAC peak quantification was normalized to the total tags in peaks.
Assembly: mm10
← Back to Analysis