GSE213467 Processing Pipeline

ChIP-Seq code_examples 11 steps

Publication

Early transcriptional and epigenetic divergence of CD8+ T cells responding to acute versus chronic infection.

PLoS biology (2023) — PMID 36716323

Dataset

Areas of H3K27 tri-methylation in CD8+ T cells from the spleens of LCMV-Armstrong or LCMV-Clone 13 infected mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Basecalls were performed using bcl2fastq v2.17 for Novaseq output.

bcl2fastq v2.17 GitHub

$ Bash example

# Install bcl2fastq (example, specific installation depends on OS and Illumina's distribution method)
# On some systems, it might be available via apt/yum or direct download from Illumina.
# For example, if using a container:
# docker pull illumina/bcl2fastq:2.17

# Example command for bcl2fastq v2.17
# Replace /path/to/novaseq_run_folder with the actual path to the BCL run folder from the Novaseq instrument.
# Replace /path/to/output_fastq_folder with the desired output directory for FASTQ files.
bcl2fastq --runfolder-dir /path/to/novaseq_run_folder --output-dir /path/to/output_fastq_folder

View on GitHub

Reads were trimmed from 3' end until the final base had a quality score > 30, using Trimmomatic v0.38, discarding reads left with < 25 bp

Trimmomatic v0.38

$ Bash example

# Install Trimmomatic (if not already installed)
# conda install -c bioconda trimmomatic=0.38

# Define input and output file names (placeholders)
# For single-end reads:
INPUT_READS="input_reads.fastq.gz"
OUTPUT_READS="trimmed_reads.fastq.gz"

# For paired-end reads (uncomment and adjust if applicable):
# INPUT_READS_FWD="input_reads_R1.fastq.gz"
# INPUT_READS_REV="input_reads_R2.fastq.gz"
# OUTPUT_READS_FWD_PAIRED="trimmed_reads_R1_paired.fastq.gz"
# OUTPUT_READS_REV_PAIRED="trimmed_reads_R2_paired.fastq.gz"
# OUTPUT_READS_FWD_UNPAIRED="trimmed_reads_R1_unpaired.fastq.gz"
# OUTPUT_READS_REV_UNPAIRED="trimmed_reads_R2_unpaired.fastq.gz"

# Path to Trimmomatic JAR file (adjust if installed via conda or other means)
# If installed via conda, 'trimmomatic' might be directly in your PATH
TRIMMOMATIC_JAR="/path/to/trimmomatic-0.38.jar"

# Execute Trimmomatic for single-end reads
# Parameters:
# -phred33: Assumes Phred+33 quality scores (common for Illumina)
# TRAILING:30: Trim bases from the 3' end if their quality score is below 30.
# MINLEN:25: Discard reads shorter than 25 bp after trimming.
java -jar ${TRIMMOMATIC_JAR} SE \
    -phred33 \
    ${INPUT_READS} \
    ${OUTPUT_READS} \
    TRAILING:30 \
    MINLEN:25

# Example for paired-end reads (uncomment and adjust if applicable):
# java -jar ${TRIMMOMATIC_JAR} PE \
#     -phred33 \
#     ${INPUT_READS_FWD} \
#     ${INPUT_READS_REV} \
#     ${OUTPUT_READS_FWD_PAIRED} \
#     ${OUTPUT_READS_FWD_UNPAIRED} \
#     ${OUTPUT_READS_REV_PAIRED} \
#     ${OUTPUT_READS_REV_UNPAIRED} \
#     TRAILING:30 \
#     MINLEN:25

Reads were aligned to the mouse genome (mm10) using Bowtie version 1.1.2.

Bowtie v1.1.2 GitHub

$ Bash example

# Install Bowtie 1.1.2
# conda install -c bioconda bowtie=1.1.2

# Align reads to the mm10 genome using Bowtie 1.1.2
# Input: reads.fastq (single-end reads assumed, adjust for paired-end if needed)
# Output: aligned_reads.sam (SAM format)
# Parameters:
#   -S: Output alignments in SAM format
#   -p 8: Use 8 threads (adjust as per available resources)
#   -v 2: Allow up to 2 mismatches in the alignment seed (common for short reads with Bowtie 1)
#   --best --strata: Report only alignments that are "best" in terms of stratum and number of mismatches
#   <mm10_index_base>: Base name of the Bowtie index files for mm10 (e.g., /path/to/bowtie_indexes/mm10)
#   <reads.fastq>: Input FASTQ file containing sequencing reads
#   > aligned_reads.sam: Redirect output to a SAM file
bowtie -S -p 8 -v 2 --best --strata <mm10_index_base> <reads.fastq> > aligned_reads.sam

View on GitHub

Parameters: --local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700

Bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Bowtie2 (example using conda)
# conda install -c bioconda bowtie2

# Placeholder for Bowtie2 index (e.g., GRCh38) and input/output files.
# Replace 'path/to/genome_index/GRCh38' with the actual path to your Bowtie2 index files for the reference genome.
# Replace 'reads_R1.fastq.gz' and 'reads_R2.fastq.gz' with your actual input paired-end read files.
# Replace 'output.sam' with your desired output SAM file name.

bowtie2 \
  --local \
  --very-sensitive-local \
  --no-unal \
  --no-mixed \
  --no-discordant \
  --phred33 \
  -I 10 \
  -X 700 \
  -x path/to/genome_index/GRCh38 \
  -1 reads_R1.fastq.gz \
  -2 reads_R2.fastq.gz \
  -S output.sam

View on GitHub

Reads were aligned to the yeast genome (sacCer3) using Bowtie version 1.1.2.

Bowtie v1.1.2 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie=1.1.2

# Download yeast genome (sacCer3) from UCSC
# wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz
# gunzip sacCer3.fa.gz
# mv sacCer3.fa sacCer3.fasta

# Build Bowtie index
# Replace sacCer3.fasta with the actual path to your genome file
bowtie-build sacCer3.fasta sacCer3_index

# Align reads
# Replace reads.fastq with your actual input reads file (e.g., single-end reads)
# Replace aligned_reads.sam with your desired output file name
# -S: output SAM format
# -p 4: use 4 threads (adjust as needed for your system)
bowtie -S -p 4 sacCer3_index reads.fastq > aligned_reads.sam

View on GitHub

Parameters: --local --very-sensitive-local --no-overlap --no-dovetail --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 70

bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.2 GitHub

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2

# Placeholder for reference genome index
# Replace with the actual path to your bowtie2 index prefix (e.g., /path/to/your/genome/index/hg38)
# The index files should be named hg38.1.bt2, hg38.2.bt2, etc.
REFERENCE_GENOME_INDEX="/path/to/your/genome/index/hg38"

# Placeholder for input FASTQ files
# Replace with your actual input FASTQ file paths
READ1_FASTQ="input_read1.fastq.gz"
READ2_FASTQ="input_read2.fastq.gz"

# Placeholder for output SAM file
OUTPUT_SAM="output_aligned.sam"

# Number of threads to use
NUM_THREADS=8 # Adjust as needed

bowtie2 \
    --local \
    --very-sensitive-local \
    --no-overlap \
    --no-dovetail \
    --no-unal \
    --no-mixed \
    --no-discordant \
    --phred33 \
    -I 10 \
    -X 70 \
    -p "${NUM_THREADS}" \
    -x "${REFERENCE_GENOME_INDEX}" \
    -1 "${READ1_FASTQ}" \
    -2 "${READ2_FASTQ}" \
    -S "${OUTPUT_SAM}"

View on GitHub

Spike-in normalization.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This is a conceptual script for spike-in normalization.
# Actual implementation may vary based on assay (e.g., eCLIP, RNA-seq) and specific spike-in design.
# It typically involves counting reads mapping to spike-in sequences and using this to scale endogenous counts.

# Define input parameters (placeholders)
INPUT_BAM="sample_aligned.bam" # BAM file aligned to combined genome (e.g., human + dm6)
SPIKEIN_CHROMOSOME_PREFIX="dm" # Prefix for spike-in chromosomes (e.g., dm6_chr2L, dm6_chrU)
TARGET_SPIKEIN_READ_COUNT=1000000 # A reference spike-in count for normalization (e.g., average across samples)
OUTPUT_NORMALIZATION_FACTOR_FILE="normalization_factor.txt"

# Ensure samtools is installed
# conda install -c bioconda samtools

# 1. Count reads mapping to spike-in sequences
# This command counts reads where the reference name starts with the spike-in prefix.
# Note: This assumes spike-in chromosomes are clearly distinguishable in the BAM header.
# A more robust approach might involve a list of specific spike-in contigs.
SPIKEIN_READ_COUNT=$(samtools idxstats "$INPUT_BAM" | awk -v prefix="$SPIKEIN_CHROMOSOME_PREFIX" '$1 ~ "^"prefix {sum+=$3} END {print sum}')

echo "Observed spike-in read count: $SPIKEIN_READ_COUNT"

# 2. Calculate the normalization factor
if (( $(echo "$SPIKEIN_READ_COUNT == 0" | bc -l) )); then
    echo "Warning: Spike-in read count is zero. Normalization factor set to 1.0."
    NORMALIZATION_FACTOR=1.0
else
    NORMALIZATION_FACTOR=$(awk "BEGIN { print $TARGET_SPIKEIN_READ_COUNT / $SPIKEIN_READ_COUNT }")
fi

echo "Calculated normalization factor: $NORMALIZATION_FACTOR"
echo "$NORMALIZATION_FACTOR" > "$OUTPUT_NORMALIZATION_FACTOR_FILE"

# 3. Conceptual step: Apply the normalization factor
echo "# The calculated factor ($NORMALIZATION_FACTOR) would then be used to scale"
echo "# endogenous read counts, signal tracks (e.g., bigWig files), or other quantitative metrics."
echo "# Example (conceptual): scale_bigwig.py -i unnormalized.bw -o normalized.bw -f $NORMALIZATION_FACTOR"

View on GitHub

The genomic coverage were then normalized by applying the scaling factor that is calculated from the number of mapped reads to mouse genome and yeast genome

Custom Python Script (eCLIP Spike-in Normalization) (Inferred with models/gemini-2.5-flash) vN/A (part of Skipper pipeline) (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# --- Prerequisites: Input BAM files and unnormalized BigWig ---
# Replace with actual paths to your BAM files and BigWig files.
# The BAM files are assumed to be generated by aligning reads to their respective genomes.
# Reference genomes used for alignment (e.g., mouse: mm39, yeast: sacCer3) are implicitly required for these BAMs.
MOUSE_BAM="path/to/your/mouse_mapped_reads.bam"
YEAST_BAM="path/to/your/yeast_spikein_mapped_reads.bam"
INPUT_BIGWIG="path/to/your/unnormalized_coverage.bw"
OUTPUT_BIGWIG="path/to/your/normalized_coverage.bw"

# --- Step 1: Count mapped reads using samtools ---
# Install samtools if not available
# conda install -c bioconda samtools

# Get total mapped reads for the target genome (e.g., mouse)
MOUSE_MAPPED_READS=$(samtools flagstat "${MOUSE_BAM}" | awk '/mapped \(/ {print $1}')

# Get total mapped reads for the spike-in genome (e.g., yeast)
YEAST_MAPPED_READS=$(samtools flagstat "${YEAST_BAM}" | awk '/mapped \(/ {print $1}')

echo "Mouse Mapped Reads: ${MOUSE_MAPPED_READS}"
echo "Yeast Mapped Reads: ${YEAST_MAPPED_READS}"

# --- Step 2: Calculate the scaling factor ---
# This example assumes normalization to a fixed reference amount of yeast spike-in reads.
# Adjust REFERENCE_YEAST_READS based on your experimental design or pipeline standards.
REFERENCE_YEAST_READS=10000000 # Example: Normalize all samples as if they had 10 million yeast reads

if (( $(echo "$YEAST_MAPPED_READS == 0" | bc -l) )); then
    echo "Error: No yeast reads mapped (${YEAST_BAM}). Cannot calculate scaling factor."
    exit 1
fi

# Scaling factor to apply to mouse coverage: (Reference Yeast Reads / Actual Yeast Reads)
SCALING_FACTOR=$(echo "scale=6; ${REFERENCE_YEAST_READS} / ${YEAST_MAPPED_READS}" | bc)

echo "Calculated Scaling Factor: ${SCALING_FACTOR}"

# --- Step 3: Apply the scaling factor to genomic coverage ---
# This step uses a custom Python script, commonly found in eCLIP pipelines like Yeo lab's Skipper.
# The script `normalize_bigwig_with_spikein.py` is inferred from the context.
# Ensure the script is accessible (e.g., by cloning the Skipper repository).

# Clone the Skipper repository if the script is not locally available
# git clone https://github.com/yeolab/skipper.git

# Replace with actual path to the Skipper scripts directory
SKIPPER_SCRIPTS_DIR="/path/to/yeolab/skipper/scripts"

python "${SKIPPER_SCRIPTS_DIR}/normalize_bigwig_with_spikein.py" \
    --input_bigwig "${INPUT_BIGWIG}" \
    --output_bigwig "${OUTPUT_BIGWIG}" \
    --scaling_factor "${SCALING_FACTOR}"

View on GitHub

Peaks were called uing MACS v2.1.0 with the significance cut-off q-value <=0.01, using Callpeak, and compared the sample to corresponding IgG samples using bdgcmp in FE (fold-enrichment) mode.

MACS2 v2.1.0

$ Bash example

# Install MACS2 if not already installed
# conda install -c bioconda macs2=2.1.0

# Define input files and output prefix
TREATMENT_BAM="treatment.bam"
CONTROL_BAM="control.bam"
OUTPUT_PREFIX="my_sample"
GENOME_SIZE="hs" # Example: 'hs' for human, 'mm' for mouse, or a specific number like 2.7e9

# Call peaks using MACS2 callpeak
# -t: Treatment file (sample)
# -c: Control file (corresponding IgG samples)
# -f BAM: Input file format (assuming BAM, could be BAMPE for paired-end)
# -g: Genome size (e.g., hs for human, mm for mouse)
# -q 0.01: Q-value cutoff for peak detection (significance cut-off q-value <=0.01)
# -n: Name string for output files
macs2 callpeak -t "${TREATMENT_BAM}" -c "${CONTROL_BAM}" -f BAM -g "${GENOME_SIZE}" -q 0.01 -n "${OUTPUT_PREFIX}"

# Generate fold-enrichment bedGraph file using bdgcmp
# -t: Treatment pileup bedGraph file (generated by callpeak)
# -c: Control lambda bedGraph file (generated by callpeak)
# -m FE: Mode for comparison (Fold Enrichment)
# -o: Output bedGraph file
macs2 bdgcmp -t "${OUTPUT_PREFIX}_treat_pileup.bdg" -c "${OUTPUT_PREFIX}_control_lambda.bdg" -m FE -o "${OUTPUT_PREFIX}_FE.bdg"

bigWig files were generated using bdg2bw from the bedgraph output from last step.

UCSC tools vLatest GitHub

$ Bash example

# Install UCSC tools (contains bedGraphToBigWig, which bdg2bw typically wraps)
# conda install -c bioconda ucsc-bedgraphtobigwig

# Download chromosome sizes file for hg38 (example reference genome)
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes

# Define input and output files
input_bedgraph="input.bedgraph" # Replace with actual bedgraph file from previous step
chrom_sizes_file="hg38.chrom.sizes" # Replace with actual chrom.sizes file
output_bigwig="output.bigWig" # Desired output bigWig file

# Generate bigWig file using bdg2bw (assumed to be a wrapper for bedGraphToBigWig)
bdg2bw "${input_bedgraph}" "${chrom_sizes_file}" "${output_bigwig}"

View on GitHub

Score represents the FE values at a given genomic coordinate. narrowPeak files were generated using MACS v2 with default settings.

MACS2 v2

$ Bash example

# Install MACS2 (if not already installed)
# conda install -c bioconda macs2

# Define input files and output prefix
TREATMENT_BAM="treatment.bam" # Placeholder for treatment BAM file
CONTROL_BAM="control.bam"   # Placeholder for control BAM file
OUTPUT_PREFIX="my_experiment" # Prefix for output files
OUTPUT_DIR="macs2_output"
GENOME_SIZE="hs" # Example: 'hs' for human, 'mm' for mouse, or a specific number like '2.7e9'

mkdir -p "${OUTPUT_DIR}"

# Run MACS2 callpeak with default settings
# The description mentions "default settings" and "narrowPeak files".
# MACS2's default behavior for callpeak is to generate narrowPeak files.
# The score in narrowPeak files (column 5) is typically -log10(pvalue),
# while fold enrichment (FE) is in column 7. The description "Score represents the FE values"
# might be referring to the FE column or a derived score.
# We'll use standard callpeak command with common defaults.
macs2 callpeak \
  -t "${TREATMENT_BAM}" \
  -c "${CONTROL_BAM}" \
  -f BAM \
  -g "${GENOME_SIZE}" \
  -n "${OUTPUT_PREFIX}" \
  --outdir "${OUTPUT_DIR}" \
  --qvalue 0.05 # Common default q-value threshold for peak calling

Tools Used

UCSC tools

Raw Source Text

Basecalls were performed using bcl2fastq v2.17 for Novaseq output.
Reads were trimmed from 3' end until the final base had a quality score > 30, using Trimmomatic v0.38, discarding reads left with < 25 bp
Reads were aligned to the mouse genome (mm10) using Bowtie version 1.1.2. Parameters: --local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700
Reads were aligned to the yeast genome (sacCer3) using Bowtie version 1.1.2. Parameters: --local --very-sensitive-local --no-overlap --no-dovetail --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 70
Spike-in normalization. The genomic coverage were then normalized by applying the scaling factor that is calculated from the number of mapped reads to mouse genome and yeast genome
Peaks were called uing MACS v2.1.0 with the significance cut-off q-value <=0.01, using Callpeak, and compared the sample to corresponding IgG samples using bdgcmp in FE (fold-enrichment) mode.
bigWig files were generated using bdg2bw from the bedgraph output from last step. Score represents the FE values at a given genomic coordinate. narrowPeak files were generated using MACS v2 with default settings.
Assembly: mm10
Supplementary files format and content: bigWig, narrowPeak (except for IgG sample)

← Back to Analysis