GSE211676 Processing Pipeline

GSE code_examples 10 steps

Publication

RANK ligand converts the NCoR/HDAC3 co-repressor to a PGC1β- and RNA-dependent co-activator of osteoclast gene expression.

Molecular cell (2023) — PMID 37751740

Dataset

GSE211676

RANK ligand converts the NCoR/HDAC3 co-repressor to a PGC1Î²- and RNA-dependent co-activator of osteoclast gene expression

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

FASTQ files were mapped to the mm10 mouse reference genome with Bowtie2.

Bowtie2 v2.3.4.3 GitHub

$ Bash example

# Install Bowtie2
# conda install -c bioconda bowtie2

# Install Samtools (often used with Bowtie2 for BAM conversion and sorting)
# conda install -c bioconda samtools

# Define variables
# Placeholder for the Bowtie2 index for mm10. This index must be pre-built.
# Example: bowtie2-build /path/to/mm10.fa /path/to/bowtie2_indexes/mm10
REF_GENOME_INDEX="/path/to/bowtie2_indexes/mm10"

# Placeholder for input FASTQ files (assuming paired-end reads)
INPUT_FASTQ_R1="input_R1.fastq.gz"
INPUT_FASTQ_R2="input_R2.fastq.gz"

# Desired output BAM file name
OUTPUT_BAM="output.bam"

# Number of threads to use for alignment
NUM_THREADS=8

# Run Bowtie2 alignment, pipe output to samtools for BAM conversion and sorting
bowtie2 -x "${REF_GENOME_INDEX}" \
        -1 "${INPUT_FASTQ_R1}" \
        -2 "${INPUT_FASTQ_R2}" \
        --threads "${NUM_THREADS}" \
        --very-sensitive \
        2> "${OUTPUT_BAM%.bam}.log" \
    | samtools view -bS - \
    | samtools sort -o "${OUTPUT_BAM}" -

View on GitHub

Peaks were called with HOMER (findPeaks) using parameters â-style factor -minDist 200 -size 200â.

HOMER vv4.12

$ Bash example

# Install HOMER (if not already installed)
# conda install -c bioconda homer

# Define input and output files
# Replace 'path/to/your/treatment.bam' with the actual path to your aligned reads BAM file
TREATMENT_BAM="path/to/your/treatment.bam"

# Define output prefix for the peak files
OUTPUT_PREFIX="homer_factor_peaks"

# Call peaks with HOMER findPeaks
# This command will generate files like homer_factor_peaks.peaks.txt and homer_factor_peaks.peaks.bed
findPeaks "${TREATMENT_BAM}" -style factor -minDist 200 -size 200 -o "${OUTPUT_PREFIX}"

After merging these peaks, correlations among replicates from the same cell subset/treatment were evaluated by correlation using tag counts.

deepTools (Inferred with models/gemini-2.5-flash) v3.5.1 GitHub

$ Bash example

# Install deepTools (example using conda)
# conda create -n deeptools_env deeptools=3.5.1
# conda activate deeptools_env

# Define input files (placeholders)
# MERGED_PEAKS: BED file containing the merged peak regions from the previous step.
# REPLICATE_BWS: Space-separated list of bigWig files for each replicate from the same cell subset/treatment.
MERGED_PEAKS="merged_peaks.bed"
REPLICATE_BWS="replicate1.bw replicate2.bw replicate3.bw"

# Output files
MATRIX_FILE="replicate_correlation_matrix.npz"
PLOT_FILE="replicate_correlation_heatmap.png"
TEXT_FILE="replicate_correlation_values.txt"

# Labels for the replicates, corresponding to the order in REPLICATE_BWS
REPLICATE_LABELS="rep1 rep2 rep3"

# 1. Compute tag counts/signal for merged peak regions across replicates using multiBigwigSummary
# This step quantifies the signal within the specified BED regions for each bigWig file.
multiBigwigSummary BED \
    --BED ${MERGED_PEAKS} \
    --bws ${REPLICATE_BWS} \
    --outFileName ${MATRIX_FILE} \
    --labels ${REPLICATE_LABELS}

# 2. Evaluate correlations and generate a heatmap using plotCorrelation
# This step takes the matrix generated by multiBigwigSummary and calculates Pearson correlation,
# then visualizes it as a heatmap.
plotCorrelation \
    -in ${MATRIX_FILE} \
    --corMethod pearson \
    --plotType heatmap \
    --outFileName ${PLOT_FILE} \
    --plotNumbers \
    --labels ${REPLICATE_LABELS} \
    --outFileCorrelations ${TEXT_FILE}

View on GitHub

The two most highly correlated samples were used for identifying the most robust peaks using the irreproducible discovery rate (IDR) method.

IDR vlatest from repository GitHub

$ Bash example

# Clone the merge_peaks repository if not already present
# git clone https://github.com/yeolab/merge_peaks.git
# Assuming the repository is cloned to a known location, e.g., /opt/yeolab_tools/merge_peaks
MERGE_PEAKS_DIR="/opt/yeolab_tools/merge_peaks"

# Install the IDR executable (version 2.0.4 is commonly used and is a dependency for merge_peaks)
# conda install -c bioconda idr=2.0.4

# Placeholder for input peak files from the two most highly correlated samples.
# These would typically be BED files generated by a peak caller (e.g., CLIPper)
# and selected based on correlation analysis from a previous step.
replicate1_peaks="path/to/your/replicate1_peaks.bed"
replicate2_peaks="path/to/your/replicate2_peaks.bed"

# Output prefix for IDR results
output_prefix="idr_robust_peaks"

# Rank type for IDR (e.g., 'p.value' or 'score'). 'p.value' is a common choice for peak callers.
# The description does not specify, so a common default is used.
rank_type="p.value"

# IDR cutoff (e.g., 0.01 or 0.05). 0.01 is a stringent common choice for high confidence peaks.
idr_cutoff=0.01

# Execute the IDR replicate pipeline script from the merge_peaks repository.
# This script orchestrates the IDR analysis using the underlying 'idr' executable.
python "${MERGE_PEAKS_DIR}/idr_replicate_pipeline.py" \
    -p "${replicate1_peaks}" \
    -q "${replicate2_peaks}" \
    -o "${output_prefix}" \
    -r "${rank_type}" \
    -c "${idr_cutoff}"

# The output will include files such as:
# - ${output_prefix}-idr.log: IDR log file
# - ${output_prefix}-idr.out: Raw IDR output
# - Potentially filtered peak files (e.g., BED files with reproducible peaks)

View on GitHub

For this step, peaks were called with HOMERâs findPeaks, using parameters â-L 0 -C 0 -fdr 0.9 -minDist 200 -size 200â.

HOMER v4.11.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install HOMER (if not already installed)
# conda install -c bioconda homer
# perl /path/to/homer/configureHomer.pl -install hg38 # Example for genome installation

# Assuming a tag directory 'my_sample_tags' has been created from alignment files
# Example: makeTagDirectory my_sample_tags my_sample_aligned.bam

# Define input and output paths (placeholders as not specified in description)
TAG_DIRECTORY="my_sample_tags"
OUTPUT_PEAKS="my_sample_peaks.txt"
GENOME="hg38" # Placeholder for reference genome

# Execute HOMER findPeaks with specified parameters
findPeaks "${TAG_DIRECTORY}" -o "${OUTPUT_PEAKS}" -L 0 -C 0 -fdr 0.9 -minDist 200 -size 200 -g "${GENOME}" -style factor

View on GitHub

IDR peaks from different conditions involved in a comparison merged with HOMERâs mergePeaks and annotated with HOMERâs annotatePeaks.pl.

IDR v2.0.4

$ Bash example

# The input to this step are peaks generated by the IDR (Irreproducible Discovery Rate) method.
# The specific actions described (merging and annotation) are performed using HOMER tools.

# Install HOMER (if not already installed)
# conda install -c bioconda homer

# Example input IDR peak files from different conditions
# These files are assumed to be in HOMER peak format or a compatible format (e.g., BED, narrowPeak).
# Replace with actual input file paths.
input_idr_peaks_cond1="condition1_idr_peaks.narrowPeak"
input_idr_peaks_cond2="condition2_idr_peaks.narrowPeak"
input_idr_peaks_cond3="condition3_idr_peaks.narrowPeak"

# Merge IDR peaks from different conditions using HOMER's mergePeaks
# The '-d' option can be used to specify a distance for merging overlapping peaks (e.g., -d 200 for 200bp).
# Without '-d', it merges peaks that overlap by at least 1 bp.
# Output is typically a BED-like format.
merged_peaks_file="merged_idr_peaks.bed"
mergePeaks "${input_idr_peaks_cond1}" "${input_idr_peaks_cond2}" "${input_idr_peaks_cond3}" > "${merged_peaks_file}"

# Annotate the merged IDR peaks using HOMER's annotatePeaks.pl
# Using hg38 as the reference genome (latest assembly placeholder). Replace with appropriate genome if needed.
# The output is a tab-separated text file with annotation details.
annotated_peaks_file="annotated_idr_peaks.txt"
annotatePeaks.pl "${merged_peaks_file}" hg38 > "${annotated_peaks_file}"

The raw tags of all samples which had reasonable correlation were quantified with HOMER (annotatePeaks.pl) using parameter â-noadjâ.

HOMER vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install HOMER (e.g., via conda)
# conda install -c bioconda homer=4.12

# Assuming 'input_peaks.bed' is the file containing raw tags/peaks that had reasonable correlation
# and 'hg38' is the reference genome assembly (replace with actual if known)
# The output will be a tab-separated file with peak annotations.
annotatePeaks.pl input_peaks.bed hg38 -noadj > output_annotations.txt

View on GitHub

Peaks which contained at least 4 tags in at least 1 sample were used to identify differentially bounded peaks (DBP) by DESeq2.

DESeq2 v1.42.0 GitHub

$ Bash example

# Install R and DESeq2 (if not already installed)
# conda install -c conda-forge r-base
# conda install -c bioconda bioconductor-deseq2

# Create dummy input files for demonstration
# In a real pipeline, 'peak_counts.tsv' would be generated by an upstream peak quantification step
# and would already reflect the pre-filtering criterion: 
# "Peaks which contained at least 4 tags in at least 1 sample".
# 'sample_info.tsv' would contain metadata for each sample.
cat << EOF > peak_counts.tsv
peak_id	sample1	sample2	sample3	sample4
peak1	10	12	5	7
peak2	0	1	0	0
peak3	50	60	25	30
peak4	2	3	1	2
peak5	15	18	8	10
peak6	4	5	2	3
peak7	0	0	0	0
peak8	6	7	3	4
EOF

cat << EOF > sample_info.tsv
sample_id	condition
sample1	control
sample2	control
sample3	treated
sample4	treated
EOF

# R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# Load count data
# Assuming 'peak_counts.tsv' already reflects the pre-filtering: 
# "Peaks which contained at least 4 tags in at least 1 sample"
counts_df <- read.table("peak_counts.tsv", header=TRUE, row.names=1, sep="\t")
# Ensure counts are integers
counts_df <- round(counts_df)

# Load sample information
sample_info <- read.table("sample_info.tsv", header=TRUE, row.names=1, sep="\t")

# Ensure sample order matches between counts and sample_info
sample_info <- sample_info[colnames(counts_df), , drop=FALSE]

# Create DESeqDataSet object
# The design formula specifies the experimental factors to be modeled.
# Here, we assume a simple comparison between 'control' and 'treated' conditions.
dds <- DESeqDataSetFromMatrix(countData = counts_df,
                              colData = sample_info,
                              design = ~ condition)

# DESeq2 recommends filtering out rows that have zero or very few counts across all samples.
# If the input 'peak_counts.tsv' was already filtered based on "at least 4 tags in at least 1 sample",
# then this internal filtering might be less critical but still good practice.
# For demonstration, we'll proceed assuming adequate pre-filtering.

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for the contrast of interest (e.g., treated vs control)
# The contrast argument specifies the factor, the numerator level, and the denominator level.
res <- results(dds, contrast=c("condition", "treated", "control"))

# Order results by adjusted p-value
res <- res[order(res$padj),]

# Save results
write.csv(as.data.frame(res), file="deseq2_results.csv")

# Optional: Save normalized counts
# normalized_counts <- counts(dds, normalized=TRUE)
# write.csv(as.data.frame(normalized_counts), file="deseq2_normalized_counts.csv")
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Peaks were categorized as distal peaks which are 2 kb away from known TSS and promotor peaks which are located within 2 kb region of known TSS sites.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# Define input files and reference data
# Replace 'input_peaks.bed' with the actual path to your peak file.
PEAKS_BED="input_peaks.bed"

# Replace 'hg38_gencode_v38_tss.bed' with the actual path to your TSS BED file.
# This file should contain 1bp regions representing the Transcription Start Sites.
# Example: chr1\t10000\t10001\tgeneA_tss\t0\t+
TSS_BED="hg38_gencode_v38_tss.bed"

# Replace 'hg38.chrom.sizes' with the actual path to your chromosome sizes file.
# This file is required by bedtools slop to prevent extending regions off chromosome ends.
# Example: chr1\t248956422
CHROM_SIZES="hg38.chrom.sizes"

# Create dummy files for demonstration if they don't exist.
# In a real pipeline, these files would be pre-existing and properly generated.
if [ ! -f "$CHROM_SIZES" ]; then
echo "Warning: Creating dummy $CHROM_SIZES for demonstration. Replace with actual file for real analysis."
echo -e "chr1\t248956422\nchr2\t242193529\nchr3\t198295559\nchrX\t156040895\nchrY\t57227415" > "$CHROM_SIZES"
fi

if [ ! -f "$TSS_BED" ]; then
echo "Warning: Creating dummy $TSS_BED for demonstration. Replace with actual file for real analysis."
# Example TSS for a few genes on chr1
echo -e "chr1\t10000\t10001\tgeneA_tss\t0\t+\nchr1\t50000\t50001\tgeneB_tss\t0\t-" > "$TSS_BED"
fi

# Define the promoter region size (2 kb)
PROMOTER_DISTANCE=2000

# 1. Define promoter regions by extending TSS sites by PROMOTER_DISTANCE in both directions.
# bedtools slop extends features by a specified number of base pairs.
bedtools slop -i "$TSS_BED" -g "$CHROM_SIZES" -b "$PROMOTER_DISTANCE" > promoter_regions_slop.bed

# 2. Merge overlapping promoter regions to create a consolidated set of promoter areas.
# This prevents a single peak from being counted multiple times if it overlaps several TSS regions.
bedtools merge -i promoter_regions_slop.bed > merged_promoter_regions.bed

# 3. Identify promoter peaks: Peaks that overlap with the merged promoter regions.
# bedtools intersect -a <peaks> -b <promoter_regions> finds overlaps.
bedtools intersect -a "$PEAKS_BED" -b merged_promoter_regions.bed > promoter_peaks.bed

# 4. Identify distal peaks: Peaks that do NOT overlap with the merged promoter regions.
# bedtools intersect -a <peaks> -b <promoter_regions> -v finds features in -a that do NOT overlap -b.
bedtools intersect -a "$PEAKS_BED" -b merged_promoter_regions.bed -v > distal_peaks.bed

# Clean up intermediate files
rm promoter_regions_slop.bed merged_promoter_regions.bed

View on GitHub

ATAC peak quantification was normalized to the total tags in peaks.

DiffBind (Inferred with models/gemini-2.5-flash) v3.10.0

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DiffBind")'

# Create an R script file for ATAC peak quantification and normalization
# This script assumes you have a 'samplesheet.csv' file defining your samples, 
# BAM files, and peak files. Example 'samplesheet.csv' structure:
# SampleID,Condition,Replicate,bamReads,Peaks,PeakCaller
# Sample1,Control,1,sample1.bam,sample1_peaks.bed,bed
# Sample2,Control,2,sample2.bam,sample2_peaks.bed,bed
cat << 'EOF' > run_atac_quant_norm.R
library(DiffBind)

# Load samplesheet (replace with your actual samplesheet path)
samplesheet <- dba(sampleSheet="samplesheet.csv")

# Step 1: Count reads in peaks for all samples
# This creates a dba object with raw counts for each peak in each sample.
# score=DBA_SCORE_READS ensures raw read counts are used.
# bUseSummarizeOverlaps=TRUE uses the more robust SummarizedExperiment approach.
message("Counting reads in peaks...")
samplesheet_counts <- dba.count(samplesheet, bUseSummarizeOverlaps=TRUE, score=DBA_SCORE_READS)

# Step 2: Extract raw counts matrix
# This retrieves the raw count matrix where rows are peaks and columns are samples.
raw_counts_matrix <- dba.peakset(samplesheet_counts, bRetrieve=TRUE, DataType=DBA_DATA_COUNTS)

# Step 3: Calculate total tags in peaks for each sample
# This sums the raw counts for all peaks for each individual sample.
total_tags_in_peaks <- colSums(raw_counts_matrix)

# Step 4: Normalize counts by total tags in peaks
# Each peak's count is divided by the total tags in peaks for its respective sample.
# A small pseudocount (1e-6) is added to the denominator to prevent division by zero.
# The result is scaled by 1e6 to represent Counts Per Million (CPM) within peaks.
message("Normalizing counts to total tags in peaks...")
normalized_counts_matrix <- t(t(raw_counts_matrix) / (total_tags_in_peaks + 1e-6)) * 1e6

# Save the normalized counts to a CSV file
output_file <- "atac_normalized_peak_counts.csv"
write.csv(normalized_counts_matrix, output_file, row.names=TRUE)

message(paste0("ATAC peak quantification and normalization complete. Normalized counts saved to ", output_file))
EOF

# Execute the R script
Rscript run_atac_quant_norm.R

Tools Used

Bowtie2 HOMER DESeq2

Raw Source Text

FASTQ files were mapped to the mm10 mouse reference genome with Bowtie2.  Peaks were called with HOMER (findPeaks) using parameters â-style factor -minDist 200 -size 200â.  After merging these peaks, correlations among replicates from the same cell subset/treatment were evaluated by correlation using tag counts.  The two most highly correlated samples were used for identifying the most robust peaks using the irreproducible discovery rate (IDR) method.  For this step, peaks were called with HOMERâs findPeaks, using parameters â-L 0 -C 0 -fdr 0.9 -minDist 200 -size 200â.  IDR peaks from different conditions involved in a comparison merged with HOMERâs mergePeaks and annotated with HOMERâs annotatePeaks.pl.  The raw tags of all samples which had reasonable correlation were quantified with HOMER (annotatePeaks.pl) using parameter â-noadjâ.  Peaks which contained at least 4 tags in at least 1 sample were used to identify differentially bounded peaks (DBP) by DESeq2.  Peaks were categorized as distal peaks which are 2 kb away from known TSS and promotor peaks which are located within 2 kb region of known TSS sites.
ATAC peak quantification was normalized to the total tags in peaks.
Assembly: mm10

← Back to Analysis