GSE271651 Processing Pipeline

RNA-Seq code_examples 8 steps

Publication

Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes.

Cell reports (2024) — PMID 39154337

Dataset

Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes [Microglia RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were trimmed using Trimmomatic.

Trimmomatic v0.39 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Trimmomatic (if not already installed)
# conda install -c bioconda trimmomatic

# Define input and output file paths (placeholders)
READ1_IN="raw_reads/sample_R1.fastq.gz"
READ2_IN="raw_reads/sample_R2.fastq.gz"

READ1_PAIRED_OUT="trimmed_reads/sample_R1_paired.fastq.gz"
READ1_UNPAIRED_OUT="trimmed_reads/sample_R1_unpaired.fastq.gz"
READ2_PAIRED_OUT="trimmed_reads/sample_R2_paired.fastq.gz"
READ2_UNPAIRED_OUT="trimmed_reads/sample_R2_unpaired.fastq.gz"

# Define adapter file path (adjust if Trimmomatic is not in PATH or adapters are elsewhere)
# Trimmomatic's adapter files are typically found in its installation directory, e.g.,
# TRIMMOMATIC_ADAPTERS="$(dirname $(which trimmomatic))/../share/trimmomatic-0.39-0/adapters"
# For a more robust path, you might need to locate it based on your installation.
# For this example, we assume 'adapters' directory is accessible or Trimmomatic finds it.
ADAPTER_FILE="TruSeq3-PE.fa"

# Create output directory if it doesn't exist
mkdir -p trimmed_reads

# Execute Trimmomatic for paired-end trimming
# Parameters explained:
# PE: Paired-End mode
# -phred33: Quality score encoding
# ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10: Adapter trimming. Path to adapter file, seed mismatches, palindrome clip threshold, simple clip threshold.
# LEADING:3: Remove bases from the start of the read if their quality is below 3.
# TRAILING:3: Remove bases from the end of the read if their quality is below 3.
# SLIDINGWINDOW:4:15: Scan the read with a 4-base wide window, cutting when the average quality per base drops below 15.
# MINLEN:36: Drop reads shorter than 36 bases after trimming.
java -jar $(which trimmomatic) PE -phred33 \
    "${READ1_IN}" "${READ2_IN}" \
    "${READ1_PAIRED_OUT}" "${READ1_UNPAIRED_OUT}" \
    "${READ2_PAIRED_OUT}" "${READ2_UNPAIRED_OUT}" \
    ILLUMINACLIP:"${ADAPTER_FILE}":2:30:10:8:TRUE \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

View on GitHub

FastQC (Babraham Bioinformatics) was used before and after trimming to assess quality control.

FastQC v0.11.9 GitHub

$ Bash example

# Install FastQC using Bioconda
# conda install -c bioconda fastqc

# Create output directories for FastQC reports
mkdir -p fastqc_raw_output
mkdir -p fastqc_trimmed_output

# Run FastQC on raw sequencing reads (example with paired-end reads)
# Replace 'raw_reads_R1.fastq.gz' and 'raw_reads_R2.fastq.gz' with your actual raw input file paths
fastqc -o fastqc_raw_output raw_reads_R1.fastq.gz raw_reads_R2.fastq.gz

# Run FastQC on trimmed sequencing reads (example with paired-end reads)
# Replace 'trimmed_reads_R1.fastq.gz' and 'trimmed_reads_R2.fastq.gz' with your actual trimmed input file paths
fastqc -o fastqc_trimmed_output trimmed_reads_R1.fastq.gz trimmed_reads_R2.fastq.gz

View on GitHub

Reads were aligned to the murine reference genome (mm10) using the STAR aligner tool.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR
# conda install -c bioconda star

# Create a directory for reference files
mkdir -p star_index_mm10
cd star_index_mm10

# Download murine reference genome (mm10 / GRCm38) FASTA and GTF from Ensembl release 105
wget http://ftp.ensembl.org/pub/release-105/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-105/gtf/mus_musculus/Mus_musculus.GRCm38.105.gtf.gz

# Unzip the files
gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm38.105.gtf.gz

# Generate STAR genome index
# sjdbOverhang is typically (read_length - 1). A value of 100 is common for 101bp reads.
STAR --runMode genomeGenerate \
     --genomeDir . \
     --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa \
     --sjdbGTFfile Mus_musculus.GRCm38.105.gtf \
     --sjdbOverhang 100 \
     --runThreadN 8

# Navigate back to the working directory
cd ..

# Example alignment command
# Replace reads_R1.fastq.gz and reads_R2.fastq.gz with actual input file names.
# Adjust --runThreadN based on available CPU cores.
# Output files will be prefixed with 'aligned_reads_'.
STAR --genomeDir star_index_mm10 \
     --readFilesIn reads_R1.fastq.gz reads_R2.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix aligned_reads_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 10 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN 8

View on GitHub

MultiQC (v1.12) was used to assess the quality of alignment.

MultiQC v1.12 GitHub

$ Bash example

# Install MultiQC (if not already installed)
# conda install -c bioconda multiqc

# Assuming MultiQC is run in a directory containing various QC reports (e.g., FastQC, Picard, STAR, etc.)
# MultiQC will automatically find and aggregate these reports.
multiqc . -o multiqc_report

View on GitHub

All aligned samples showed more than 89% uniquely mapped reads.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz" # If paired-end, otherwise set to empty or omit
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for human GRCh38 genome index
OUTPUT_DIR="./star_output"
SAMPLE_NAME="sample_name"
MIN_UNIQUE_MAPPING_PERCENTAGE=89

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# 1. Perform alignment using STAR
# This command assumes paired-end reads. Adjust for single-end by omitting --readFilesIn "${READ2}".
# Reference genome index for GRCh38 should be pre-built using STAR --runMode genomeGenerate.
STAR \
  --runThreadN 8 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --readFilesCommand zcat \
  --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard \
  --quantMode GeneCounts

# 2. Extract uniquely mapped reads percentage from STAR's Log.final.out
LOG_FILE="${OUTPUT_DIR}/${SAMPLE_NAME}_Log.final.out"

# Check if the log file exists
if [ ! -f "$LOG_FILE" ]; then
  echo "Error: STAR log file not found at $LOG_FILE"
  exit 1
fi

# Extract the percentage of uniquely mapped reads
UNIQUE_MAPPED_PERCENTAGE=$(grep "Uniquely mapped reads %" "$LOG_FILE" | awk '{print $NF}' | sed 's/%//')

echo "${SAMPLE_NAME}: Uniquely mapped reads percentage: ${UNIQUE_MAPPED_PERCENTAGE}%"

# 3. Check if the percentage meets the threshold (as described in the step)
if (( $(echo "$UNIQUE_MAPPED_PERCENTAGE >= $MIN_UNIQUE_MAPPING_PERCENTAGE" | bc -l) )); then
  echo "${SAMPLE_NAME}: Uniquely mapped reads percentage (${UNIQUE_MAPPED_PERCENTAGE}%) meets the threshold of ${MIN_UNIQUE_MAPPING_PERCENTAGE}%."
else
  echo "${SAMPLE_NAME}: WARNING: Uniquely mapped reads percentage (${UNIQUE_MAPPED_PERCENTAGE}%) is below the threshold of ${MIN_UNIQUE_MAPPING_PERCENTAGE}%."
  # Optionally, you might want to exit or flag the sample for review
  # exit 1
fi

View on GitHub

reads were counted using the FeatureCounts package

featureCounts v2.0.3 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Subread package (which includes featureCounts)
# conda install -c bioconda subread

# Example command for counting reads using featureCounts
# Replace 'input.bam' with your input alignment file(s).
# Replace 'genes.gtf' with your genomic annotation file (GTF/GFF).
# Replace 'counts.txt' with your desired output file name.
# -a: Specify annotation file.
# -o: Specify output file.
# -F GTF: Specify annotation file format (GTF or GFF).
# -t exon: Specify feature type in the annotation file to count (e.g., 'exon' for genes).
# -g gene_id: Specify attribute type in the annotation file to use as feature identifiers (e.g., 'gene_id').
# -s 0: Specify strandedness (0: unstranded, 1: stranded, 2: reverse stranded). Adjust based on library prep.
# -T 8: Use 8 threads for faster processing (adjust based on available CPU cores).
featureCounts -a genes.gtf \
              -o counts.txt \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              -T 8 \
              input.bam

Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)

RSEM v1.3.3 GitHub

$ Bash example

# Install RSEM
# conda install -c bioconda rsem

# Placeholder for reference genome and annotation
# Replace with actual paths to your reference genome (FASTA) and gene annotation (GTF)
GENOME_FASTA="path/to/human_genome.fasta" # e.g., GRCh38.p13.genome.fa
GENOME_GTF="path/to/human_genes.gtf"     # e.g., gencode.v38.annotation.gtf
RSEM_REFERENCE_NAME="human_rsem_reference"

# Prepare RSEM reference (run once, if not already done)
# This step builds the RSEM index from the genome and GTF file.
# rsem-prepare-reference --gtf "${GENOME_GTF}" "${GENOME_FASTA}" "${RSEM_REFERENCE_NAME}"

# Placeholder for input RNA-Seq reads (FASTQ files)
# Replace with actual paths to your input FASTQ files
READS_R1="path/to/sample_R1.fastq.gz"
READS_R2="path/to/sample_R2.fastq.gz"
OUTPUT_PREFIX="sample_output"

# Run RSEM to calculate gene and isoform expression
# This command assumes paired-end reads and outputs expected counts.
# The --star option integrates STAR aligner for read alignment, which is a common workflow.
# --gzip-compressed indicates input FASTQ files are gzipped.
rsem-calculate-expression \
    --paired-end \
    --star \
    --gzip-compressed \
    "${READS_R1}" \
    "${READS_R2}" \
    "${RSEM_REFERENCE_NAME}" \
    "${OUTPUT_PREFIX}"

# The primary output for gene count quantifications will be in:
# ${OUTPUT_PREFIX}.genes.results (contains gene-level expected counts, TPM, FPKM, etc.)
# Other outputs include:
# ${OUTPUT_PREFIX}.isoforms.results (isoform-level quantifications)
# ${OUTPUT_PREFIX}.stat (alignment statistics)

View on GitHub

The Goseq tool was used to acquire overrepressented gene categories

goseq v1.54.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base

# Open R and install Bioconductor and goseq
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("goseq")'
# R -e 'BiocManager::install("org.Hs.eg.db")' # Example for human GO annotations

# Create a dummy R script for goseq analysis
cat << 'EOF' > goseq_analysis.R
library(goseq)
library(org.Hs.eg.db) # Example for human, required for GO term mapping if not using goseq's internal fetching

# --- Input files (placeholders) ---
# File containing differentially expressed genes (one gene ID per line)
deg_genes_file <- "deg_genes.txt"
# File containing all genes considered in the experiment (one gene ID per line)
all_genes_file <- "all_genes.txt"
# File containing gene lengths (e.g., Ensembl gene ID and length in bp, tab-separated)
# Example format:
# ENSG00000000003\t1234
# ENSG00000000005\t5678
gene_lengths_file <- "gene_lengths.txt"

# --- Output file ---
output_file <- "goseq_overrepresented_categories.tsv"

# --- Load gene lists ---
deg_genes <- readLines(deg_genes_file)
all_genes <- readLines(all_genes_file)

# --- Load gene lengths ---
gene_lengths_df <- read.delim(gene_lengths_file, header = FALSE, col.names = c("gene_id", "length"))
gene_lengths <- setNames(gene_lengths_df$length, gene_lengths_df$gene_id)

# --- Prepare gene vector for goseq ---
# Create a binary vector: 1 if gene is DEG, 0 otherwise
gene_vector <- as.integer(all_genes %in% deg_genes)
names(gene_vector) <- all_genes

# Filter gene_vector to only include genes for which we have length information
# and ensure order matches gene_lengths
common_genes <- intersect(names(gene_vector), names(gene_lengths))
gene_vector_filtered <- gene_vector[common_genes]
gene_lengths_filtered <- gene_lengths[common_genes]

# --- Build Probability Weighting Function (PWF) ---
# This corrects for gene length bias
pwf <- nullp(gene_vector_filtered, bias.data = gene_lengths_filtered, plot.fit = FALSE)

# --- Perform GO enrichment analysis ---
# Assuming Ensembl IDs for input genes and human genome (e.g., hg19 or GRCh38)
# goseq can fetch GO annotations for Ensembl IDs directly using 'genome' and 'id' arguments.
# For specific organisms, using the annotation package (e.g., org.Hs.eg.db) is also an option
# if you map your IDs to Entrez first.
GO.wall <- goseq(pwf, genome = "hg19", id = "ensembl") # Example for human Ensembl IDs

# --- Adjust p-values for multiple testing ---
GO.wall$over_represented_p_adj <- p.adjust(GO.wall$over_represented_pvalue, method = "BH")
GO.wall$under_represented_p_adj <- p.adjust(GO.wall$under_represented_pvalue, method = "BH")

# --- Filter and write results ---
# Filter for overrepresented categories with adjusted p-value < 0.05
overrepresented_results <- GO.wall[GO.wall$over_represented_p_adj < 0.05, ]
overrepresented_results <- overrepresented_results[order(overrepresented_results$over_represented_p_adj), ]

write.table(overrepresented_results, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)

message(paste("Goseq analysis complete. Overrepresented categories written to:", output_file))
EOF

# Create dummy input files for demonstration
# Replace these with your actual input files
echo -e "ENSG00000100000\nENSG00000100001\nENSG00000100002" > deg_genes.txt
echo -e "ENSG00000100000\nENSG00000100001\nENSG00000100002\nENSG00000100003\nENSG00000100004" > all_genes.txt
echo -e "ENSG00000100000\t1500\nENSG00000100001\t2000\nENSG00000100002\t1000\nENSG00000100003\t3000\nENSG00000100004\t2500" > gene_lengths.txt

# Execute the R script
Rscript goseq_analysis.R

View on GitHub

Tools Used

STAR

Raw Source Text

Raw reads were trimmed using Trimmomatic. FastQC (Babraham Bioinformatics) was used before and after trimming to assess quality control.
Reads were aligned to the murine reference genome (mm10) using the STAR aligner tool. MultiQC (v1.12) was used to assess the quality of alignment. All aligned samples showed more than 89% uniquely mapped reads.
reads were counted using the FeatureCounts package
Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)
The Goseq tool was used to acquire overrepressented gene categories
Assembly: mm10
Supplementary files format and content: DESeq2 outputed differential gene expression files
Supplementary files format and content: DESeq2 outputed normalized counts

← Back to Analysis