GSE213467 Processing Pipeline
Publication
Early transcriptional and epigenetic divergence of CD8+ T cells responding to acute versus chronic infection.PLoS biology (2023) — PMID 36716323
Dataset
GSE213467Areas of H3K27 tri-methylation in CD8+ T cells from the spleens of LCMV-Armstrong or LCMV-Clone 13 infected mice
Processing Steps
Generate Jupyter Notebook-
1
Basecalls were performed using bcl2fastq v2.17 for Novaseq output.
$ Bash example
# Install bcl2fastq (example, specific installation depends on OS and Illumina's distribution method) # On some systems, it might be available via apt/yum or direct download from Illumina. # For example, if using a container: # docker pull illumina/bcl2fastq:2.17 # Example command for bcl2fastq v2.17 # Replace /path/to/novaseq_run_folder with the actual path to the BCL run folder from the Novaseq instrument. # Replace /path/to/output_fastq_folder with the desired output directory for FASTQ files. bcl2fastq --runfolder-dir /path/to/novaseq_run_folder --output-dir /path/to/output_fastq_folder
-
2
Reads were trimmed from 3' end until the final base had a quality score > 30, using Trimmomatic v0.38, discarding reads left with < 25 bp
Trimmomatic v0.38$ Bash example
# Install Trimmomatic (if not already installed) # conda install -c bioconda trimmomatic=0.38 # Define input and output file names (placeholders) # For single-end reads: INPUT_READS="input_reads.fastq.gz" OUTPUT_READS="trimmed_reads.fastq.gz" # For paired-end reads (uncomment and adjust if applicable): # INPUT_READS_FWD="input_reads_R1.fastq.gz" # INPUT_READS_REV="input_reads_R2.fastq.gz" # OUTPUT_READS_FWD_PAIRED="trimmed_reads_R1_paired.fastq.gz" # OUTPUT_READS_REV_PAIRED="trimmed_reads_R2_paired.fastq.gz" # OUTPUT_READS_FWD_UNPAIRED="trimmed_reads_R1_unpaired.fastq.gz" # OUTPUT_READS_REV_UNPAIRED="trimmed_reads_R2_unpaired.fastq.gz" # Path to Trimmomatic JAR file (adjust if installed via conda or other means) # If installed via conda, 'trimmomatic' might be directly in your PATH TRIMMOMATIC_JAR="/path/to/trimmomatic-0.38.jar" # Execute Trimmomatic for single-end reads # Parameters: # -phred33: Assumes Phred+33 quality scores (common for Illumina) # TRAILING:30: Trim bases from the 3' end if their quality score is below 30. # MINLEN:25: Discard reads shorter than 25 bp after trimming. java -jar ${TRIMMOMATIC_JAR} SE \ -phred33 \ ${INPUT_READS} \ ${OUTPUT_READS} \ TRAILING:30 \ MINLEN:25 # Example for paired-end reads (uncomment and adjust if applicable): # java -jar ${TRIMMOMATIC_JAR} PE \ # -phred33 \ # ${INPUT_READS_FWD} \ # ${INPUT_READS_REV} \ # ${OUTPUT_READS_FWD_PAIRED} \ # ${OUTPUT_READS_FWD_UNPAIRED} \ # ${OUTPUT_READS_REV_PAIRED} \ # ${OUTPUT_READS_REV_UNPAIRED} \ # TRAILING:30 \ # MINLEN:25 -
3
Reads were aligned to the mouse genome (mm10) using Bowtie version 1.1.2.
$ Bash example
# Install Bowtie 1.1.2 # conda install -c bioconda bowtie=1.1.2 # Align reads to the mm10 genome using Bowtie 1.1.2 # Input: reads.fastq (single-end reads assumed, adjust for paired-end if needed) # Output: aligned_reads.sam (SAM format) # Parameters: # -S: Output alignments in SAM format # -p 8: Use 8 threads (adjust as per available resources) # -v 2: Allow up to 2 mismatches in the alignment seed (common for short reads with Bowtie 1) # --best --strata: Report only alignments that are "best" in terms of stratum and number of mismatches # <mm10_index_base>: Base name of the Bowtie index files for mm10 (e.g., /path/to/bowtie_indexes/mm10) # <reads.fastq>: Input FASTQ file containing sequencing reads # > aligned_reads.sam: Redirect output to a SAM file bowtie -S -p 8 -v 2 --best --strata <mm10_index_base> <reads.fastq> > aligned_reads.sam
-
4
Parameters: --local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700
Bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install Bowtie2 (example using conda) # conda install -c bioconda bowtie2 # Placeholder for Bowtie2 index (e.g., GRCh38) and input/output files. # Replace 'path/to/genome_index/GRCh38' with the actual path to your Bowtie2 index files for the reference genome. # Replace 'reads_R1.fastq.gz' and 'reads_R2.fastq.gz' with your actual input paired-end read files. # Replace 'output.sam' with your desired output SAM file name. bowtie2 \ --local \ --very-sensitive-local \ --no-unal \ --no-mixed \ --no-discordant \ --phred33 \ -I 10 \ -X 700 \ -x path/to/genome_index/GRCh38 \ -1 reads_R1.fastq.gz \ -2 reads_R2.fastq.gz \ -S output.sam
-
5
Reads were aligned to the yeast genome (sacCer3) using Bowtie version 1.1.2.
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie=1.1.2 # Download yeast genome (sacCer3) from UCSC # wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz # gunzip sacCer3.fa.gz # mv sacCer3.fa sacCer3.fasta # Build Bowtie index # Replace sacCer3.fasta with the actual path to your genome file bowtie-build sacCer3.fasta sacCer3_index # Align reads # Replace reads.fastq with your actual input reads file (e.g., single-end reads) # Replace aligned_reads.sam with your desired output file name # -S: output SAM format # -p 4: use 4 threads (adjust as needed for your system) bowtie -S -p 4 sacCer3_index reads.fastq > aligned_reads.sam
-
6
Parameters: --local --very-sensitive-local --no-overlap --no-dovetail --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 70
$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2 # Placeholder for reference genome index # Replace with the actual path to your bowtie2 index prefix (e.g., /path/to/your/genome/index/hg38) # The index files should be named hg38.1.bt2, hg38.2.bt2, etc. REFERENCE_GENOME_INDEX="/path/to/your/genome/index/hg38" # Placeholder for input FASTQ files # Replace with your actual input FASTQ file paths READ1_FASTQ="input_read1.fastq.gz" READ2_FASTQ="input_read2.fastq.gz" # Placeholder for output SAM file OUTPUT_SAM="output_aligned.sam" # Number of threads to use NUM_THREADS=8 # Adjust as needed bowtie2 \ --local \ --very-sensitive-local \ --no-overlap \ --no-dovetail \ --no-unal \ --no-mixed \ --no-discordant \ --phred33 \ -I 10 \ -X 70 \ -p "${NUM_THREADS}" \ -x "${REFERENCE_GENOME_INDEX}" \ -1 "${READ1_FASTQ}" \ -2 "${READ2_FASTQ}" \ -S "${OUTPUT_SAM}" -
7
Spike-in normalization.
$ Bash example
# This is a conceptual script for spike-in normalization. # Actual implementation may vary based on assay (e.g., eCLIP, RNA-seq) and specific spike-in design. # It typically involves counting reads mapping to spike-in sequences and using this to scale endogenous counts. # Define input parameters (placeholders) INPUT_BAM="sample_aligned.bam" # BAM file aligned to combined genome (e.g., human + dm6) SPIKEIN_CHROMOSOME_PREFIX="dm" # Prefix for spike-in chromosomes (e.g., dm6_chr2L, dm6_chrU) TARGET_SPIKEIN_READ_COUNT=1000000 # A reference spike-in count for normalization (e.g., average across samples) OUTPUT_NORMALIZATION_FACTOR_FILE="normalization_factor.txt" # Ensure samtools is installed # conda install -c bioconda samtools # 1. Count reads mapping to spike-in sequences # This command counts reads where the reference name starts with the spike-in prefix. # Note: This assumes spike-in chromosomes are clearly distinguishable in the BAM header. # A more robust approach might involve a list of specific spike-in contigs. SPIKEIN_READ_COUNT=$(samtools idxstats "$INPUT_BAM" | awk -v prefix="$SPIKEIN_CHROMOSOME_PREFIX" '$1 ~ "^"prefix {sum+=$3} END {print sum}') echo "Observed spike-in read count: $SPIKEIN_READ_COUNT" # 2. Calculate the normalization factor if (( $(echo "$SPIKEIN_READ_COUNT == 0" | bc -l) )); then echo "Warning: Spike-in read count is zero. Normalization factor set to 1.0." NORMALIZATION_FACTOR=1.0 else NORMALIZATION_FACTOR=$(awk "BEGIN { print $TARGET_SPIKEIN_READ_COUNT / $SPIKEIN_READ_COUNT }") fi echo "Calculated normalization factor: $NORMALIZATION_FACTOR" echo "$NORMALIZATION_FACTOR" > "$OUTPUT_NORMALIZATION_FACTOR_FILE" # 3. Conceptual step: Apply the normalization factor echo "# The calculated factor ($NORMALIZATION_FACTOR) would then be used to scale" echo "# endogenous read counts, signal tracks (e.g., bigWig files), or other quantitative metrics." echo "# Example (conceptual): scale_bigwig.py -i unnormalized.bw -o normalized.bw -f $NORMALIZATION_FACTOR" -
8
The genomic coverage were then normalized by applying the scaling factor that is calculated from the number of mapped reads to mouse genome and yeast genome
Custom Python Script (eCLIP Spike-in Normalization) (Inferred with models/gemini-2.5-flash) vN/A (part of Skipper pipeline) (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# --- Prerequisites: Input BAM files and unnormalized BigWig --- # Replace with actual paths to your BAM files and BigWig files. # The BAM files are assumed to be generated by aligning reads to their respective genomes. # Reference genomes used for alignment (e.g., mouse: mm39, yeast: sacCer3) are implicitly required for these BAMs. MOUSE_BAM="path/to/your/mouse_mapped_reads.bam" YEAST_BAM="path/to/your/yeast_spikein_mapped_reads.bam" INPUT_BIGWIG="path/to/your/unnormalized_coverage.bw" OUTPUT_BIGWIG="path/to/your/normalized_coverage.bw" # --- Step 1: Count mapped reads using samtools --- # Install samtools if not available # conda install -c bioconda samtools # Get total mapped reads for the target genome (e.g., mouse) MOUSE_MAPPED_READS=$(samtools flagstat "${MOUSE_BAM}" | awk '/mapped \(/ {print $1}') # Get total mapped reads for the spike-in genome (e.g., yeast) YEAST_MAPPED_READS=$(samtools flagstat "${YEAST_BAM}" | awk '/mapped \(/ {print $1}') echo "Mouse Mapped Reads: ${MOUSE_MAPPED_READS}" echo "Yeast Mapped Reads: ${YEAST_MAPPED_READS}" # --- Step 2: Calculate the scaling factor --- # This example assumes normalization to a fixed reference amount of yeast spike-in reads. # Adjust REFERENCE_YEAST_READS based on your experimental design or pipeline standards. REFERENCE_YEAST_READS=10000000 # Example: Normalize all samples as if they had 10 million yeast reads if (( $(echo "$YEAST_MAPPED_READS == 0" | bc -l) )); then echo "Error: No yeast reads mapped (${YEAST_BAM}). Cannot calculate scaling factor." exit 1 fi # Scaling factor to apply to mouse coverage: (Reference Yeast Reads / Actual Yeast Reads) SCALING_FACTOR=$(echo "scale=6; ${REFERENCE_YEAST_READS} / ${YEAST_MAPPED_READS}" | bc) echo "Calculated Scaling Factor: ${SCALING_FACTOR}" # --- Step 3: Apply the scaling factor to genomic coverage --- # This step uses a custom Python script, commonly found in eCLIP pipelines like Yeo lab's Skipper. # The script `normalize_bigwig_with_spikein.py` is inferred from the context. # Ensure the script is accessible (e.g., by cloning the Skipper repository). # Clone the Skipper repository if the script is not locally available # git clone https://github.com/yeolab/skipper.git # Replace with actual path to the Skipper scripts directory SKIPPER_SCRIPTS_DIR="/path/to/yeolab/skipper/scripts" python "${SKIPPER_SCRIPTS_DIR}/normalize_bigwig_with_spikein.py" \ --input_bigwig "${INPUT_BIGWIG}" \ --output_bigwig "${OUTPUT_BIGWIG}" \ --scaling_factor "${SCALING_FACTOR}" -
9
Peaks were called uing MACS v2.1.0 with the significance cut-off q-value <=0.01, using Callpeak, and compared the sample to corresponding IgG samples using bdgcmp in FE (fold-enrichment) mode.
MACS2 v2.1.0$ Bash example
# Install MACS2 if not already installed # conda install -c bioconda macs2=2.1.0 # Define input files and output prefix TREATMENT_BAM="treatment.bam" CONTROL_BAM="control.bam" OUTPUT_PREFIX="my_sample" GENOME_SIZE="hs" # Example: 'hs' for human, 'mm' for mouse, or a specific number like 2.7e9 # Call peaks using MACS2 callpeak # -t: Treatment file (sample) # -c: Control file (corresponding IgG samples) # -f BAM: Input file format (assuming BAM, could be BAMPE for paired-end) # -g: Genome size (e.g., hs for human, mm for mouse) # -q 0.01: Q-value cutoff for peak detection (significance cut-off q-value <=0.01) # -n: Name string for output files macs2 callpeak -t "${TREATMENT_BAM}" -c "${CONTROL_BAM}" -f BAM -g "${GENOME_SIZE}" -q 0.01 -n "${OUTPUT_PREFIX}" # Generate fold-enrichment bedGraph file using bdgcmp # -t: Treatment pileup bedGraph file (generated by callpeak) # -c: Control lambda bedGraph file (generated by callpeak) # -m FE: Mode for comparison (Fold Enrichment) # -o: Output bedGraph file macs2 bdgcmp -t "${OUTPUT_PREFIX}_treat_pileup.bdg" -c "${OUTPUT_PREFIX}_control_lambda.bdg" -m FE -o "${OUTPUT_PREFIX}_FE.bdg" -
10
bigWig files were generated using bdg2bw from the bedgraph output from last step.
$ Bash example
# Install UCSC tools (contains bedGraphToBigWig, which bdg2bw typically wraps) # conda install -c bioconda ucsc-bedgraphtobigwig # Download chromosome sizes file for hg38 (example reference genome) # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes # Define input and output files input_bedgraph="input.bedgraph" # Replace with actual bedgraph file from previous step chrom_sizes_file="hg38.chrom.sizes" # Replace with actual chrom.sizes file output_bigwig="output.bigWig" # Desired output bigWig file # Generate bigWig file using bdg2bw (assumed to be a wrapper for bedGraphToBigWig) bdg2bw "${input_bedgraph}" "${chrom_sizes_file}" "${output_bigwig}" -
11
Score represents the FE values at a given genomic coordinate. narrowPeak files were generated using MACS v2 with default settings.
MACS2 v2$ Bash example
# Install MACS2 (if not already installed) # conda install -c bioconda macs2 # Define input files and output prefix TREATMENT_BAM="treatment.bam" # Placeholder for treatment BAM file CONTROL_BAM="control.bam" # Placeholder for control BAM file OUTPUT_PREFIX="my_experiment" # Prefix for output files OUTPUT_DIR="macs2_output" GENOME_SIZE="hs" # Example: 'hs' for human, 'mm' for mouse, or a specific number like '2.7e9' mkdir -p "${OUTPUT_DIR}" # Run MACS2 callpeak with default settings # The description mentions "default settings" and "narrowPeak files". # MACS2's default behavior for callpeak is to generate narrowPeak files. # The score in narrowPeak files (column 5) is typically -log10(pvalue), # while fold enrichment (FE) is in column 7. The description "Score represents the FE values" # might be referring to the FE column or a derived score. # We'll use standard callpeak command with common defaults. macs2 callpeak \ -t "${TREATMENT_BAM}" \ -c "${CONTROL_BAM}" \ -f BAM \ -g "${GENOME_SIZE}" \ -n "${OUTPUT_PREFIX}" \ --outdir "${OUTPUT_DIR}" \ --qvalue 0.05 # Common default q-value threshold for peak calling
Tools Used
Raw Source Text
Basecalls were performed using bcl2fastq v2.17 for Novaseq output. Reads were trimmed from 3' end until the final base had a quality score > 30, using Trimmomatic v0.38, discarding reads left with < 25 bp Reads were aligned to the mouse genome (mm10) using Bowtie version 1.1.2. Parameters: --local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700 Reads were aligned to the yeast genome (sacCer3) using Bowtie version 1.1.2. Parameters: --local --very-sensitive-local --no-overlap --no-dovetail --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 70 Spike-in normalization. The genomic coverage were then normalized by applying the scaling factor that is calculated from the number of mapped reads to mouse genome and yeast genome Peaks were called uing MACS v2.1.0 with the significance cut-off q-value <=0.01, using Callpeak, and compared the sample to corresponding IgG samples using bdgcmp in FE (fold-enrichment) mode. bigWig files were generated using bdg2bw from the bedgraph output from last step. Score represents the FE values at a given genomic coordinate. narrowPeak files were generated using MACS v2 with default settings. Assembly: mm10 Supplementary files format and content: bigWig, narrowPeak (except for IgG sample)