GSE271650 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes.

Cell reports (2024) — PMID 39154337

Dataset

Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes [bulk RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Adapter sequences, low base call quality (Phred score <30), and reads with fewer than 5 bases, were trimmed from all FASTQ files using Cutadapt (v3.5)

cutadapt v3.5 GitHub

$ Bash example

# Install Cutadapt (if not already installed)
# conda install -c bioconda cutadapt=3.5

# Define input and output file names
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1_TRIMMED="output_R1_trimmed.fastq.gz"
OUTPUT_R2_TRIMMED="output_R2_trimmed.fastq.gz"

# Define common Illumina adapter sequences (adjust if specific adapters are known)
# For Illumina Universal Adapter (R1) and Illumina Reverse Complement Adapter (R2)
ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Run Cutadapt for paired-end reads
cutadapt \
  -a "$ADAPTER_R1" \
  -A "$ADAPTER_R2" \
  -q 30,30 \
  -m 5 \
  -o "$OUTPUT_R1_TRIMMED" \
  -p "$OUTPUT_R2_TRIMMED" \
  "$INPUT_R1" \
  "$INPUT_R2"

View on GitHub

STAR (Dobin et al., Bioinformatics, 2012, DOI: 10.1093/bioinformatics/bts635) aligner was used to align trimmed sequences to the mouse reference genome (GRCm39) with gene annotations from mouse Gencode (vM28)

STAR v2.3.0.1

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for paths and files
GENOME_DIR="star_genome_index_GRCm39_Gencode_vM28"
GENOME_FASTA="GRCm39.primary_assembly.genome.fa" # Placeholder for GRCm39 FASTA file
GTF_FILE="gencode.vM28.annotation.gtf" # Placeholder for Gencode vM28 GTF file
READ_1="trimmed_R1.fastq.gz" # Placeholder for trimmed R1 (forward) reads
READ_2="trimmed_R2.fastq.gz" # Placeholder for trimmed R2 (reverse) reads
OUTPUT_PREFIX="aligned_reads"
NUM_THREADS=8 # Adjust as needed for your system

# --- Download reference files (example placeholders) ---
# These commands are illustrative. You would typically download these from Ensembl/GENCODE FTP.
# mkdir -p references
# cd references
# wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz
# wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/gencode.vM28.annotation.gtf.gz
# gunzip GRCm39.primary_assembly.genome.fa.gz
# gunzip gencode.vM28.annotation.gtf.gz
# cd ..

# --- Step 1: Generate STAR genome index (run once) ---
# This step uses the mouse reference genome (GRCm39) and Gencode vM28 annotations
# to build a STAR index. The sjdbOverhang parameter should ideally be (read_length - 1).
# A value of 100 is common for typical RNA-seq read lengths.
mkdir -p "${GENOME_DIR}"
STAR --runMode genomeGenerate \
     --genomeDir "${GENOME_DIR}" \
     --genomeFastaFiles "${GENOME_FASTA}" \
     --sjdbGTFfile "${GTF_FILE}" \
     --sjdbOverhang 100 \
     --runThreadN "${NUM_THREADS}"

# --- Step 2: Align trimmed sequences ---
# Align trimmed sequences to the mouse reference genome (GRCm39) using the generated index.
# Outputs a sorted BAM file and gene counts (ReadsPerGene.out.tab).
STAR --runThreadN "${NUM_THREADS}" \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ_1}" "${READ_2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 10 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --quantMode GeneCounts

Gene count quantifications were analyzed using RSEM (RNA-Seq by Expectation Maximization; v.1.3.1)

RSEM v1.3.1 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem

# Example: Prepare RSEM reference (run once per reference genome/transcriptome)
# This step requires a FASTA file of the genome (e.g., hg38.fa) and a GTF/GFF annotation file (e.g., gencode.v44.annotation.gtf).
# rsem-prepare-reference --gtf gencode.v44.annotation.gtf hg38.fa hg38_rsem_reference

# Perform gene count quantification using RSEM
# Assuming paired-end reads and a pre-built RSEM reference index (e.g., hg38_rsem_reference)
# Replace 'read1.fastq.gz', 'read2.fastq.gz', 'hg38_rsem_reference', and 'sample_output_prefix' with actual file paths and desired output prefix.
rsem-calculate-expression \
    --paired-end \
    --num-threads 8 \
    read1.fastq.gz \
    read2.fastq.gz \
    hg38_rsem_reference \
    sample_output_prefix

View on GitHub

Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)

RSEM v1.3.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# --- RSEM Reference Preparation (run once per reference genome) ---
# Replace 'genome.fa' with your reference genome FASTA file
# Replace 'genes.gtf' with your gene annotation GTF file
# Replace 'rsem_ref' with your desired reference name/path
# rsem-prepare-reference --gtf genes.gtf genome.fa rsem_ref

# --- RSEM Quantification ---
# This command calculates gene and isoform expression levels from RNA-Seq reads.
# Replace 'read1.fq.gz' and 'read2.fq.gz' with your actual gzipped paired-end FASTQ files.
# Replace 'rsem_ref' with the path to your prepared RSEM reference.
# Replace 'sample_output' with your desired output file prefix.
# The output will include 'sample_output.genes.results' and 'sample_output.isoforms.results'.
rsem-calculate-expression \
    --paired-end \
    --gzip-compressed \
    --num-threads 8 \
    read1.fq.gz read2.fq.gz \
    rsem_ref \
    sample_output

View on GitHub

Functional pathway analysis was performed using Metascape (Zhou et al., Nature Communications, 2019, DOI: 10.1038/s41467-019-09234-6), gProfiler (Reimand et al., R package version 0.67, 2018; https://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf), and ClusterProfiler

clusterProfiler v3.9.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already present.
# For example, using conda:
# conda create -n r_env r-base bioconductor-clusterprofiler r-org.hs.eg.db -y
# conda activate r_env

# Create an R script for functional pathway analysis using clusterProfiler
cat << 'EOF' > run_clusterprofiler.R
# Load necessary packages
library(clusterProfiler)
library(org.Hs.eg.db) # For human gene annotations (Homo sapiens)

# Example gene list (replace with your actual gene list of Entrez IDs)
# In a real scenario, this would typically come from differential expression analysis.
# For demonstration, let's use a small set of placeholder Entrez IDs.
gene_list <- c("10000", "10001", "10002", "10003", "10004", "10005", "10006", "10007", "10008", "10009")

# If your gene list is in a different format (e.g., gene symbols), you would convert it:
# gene_symbols <- c("TP53", "MYC", "EGFR")
# eg <- bitr(gene_symbols, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
# gene_list <- eg$ENTREZID

# Perform GO enrichment analysis (Biological Process, Molecular Function, or Cellular Component)
go_enrichment_bp <- enrichGO(gene          = gene_list,
                             OrgDb         = org.Hs.eg.db,
                             ont           = "BP", # Biological Process
                             pAdjustMethod = "BH",
                             pvalueCutoff  = 0.05,
                             qvalueCutoff  = 0.05,
                             readable      = TRUE)

go_enrichment_mf <- enrichGO(gene          = gene_list,
                             OrgDb         = org.Hs.eg.db,
                             ont           = "MF", # Molecular Function
                             pAdjustMethod = "BH",
                             pvalueCutoff  = 0.05,
                             qvalueCutoff  = 0.05,
                             readable      = TRUE)

# Perform KEGG enrichment analysis
kegg_enrichment <- enrichKEGG(gene         = gene_list,
                              organism     = 'hsa', # 'hsa' for Homo sapiens
                              pvalueCutoff = 0.05)

# Save results to CSV files
if (!is.null(go_enrichment_bp)) {
  write.csv(as.data.frame(go_enrichment_bp), "go_enrichment_bp_results.csv", row.names = FALSE)
}

if (!is.null(go_enrichment_mf)) {
  write.csv(as.data.frame(go_enrichment_mf), "go_enrichment_mf_results.csv", row.names = FALSE)
}

if (!is.null(kegg_enrichment)) {
  write.csv(as.data.frame(kegg_enrichment), "kegg_enrichment_results.csv", row.names = FALSE)
}

message("Functional pathway analysis completed. Results saved to CSV files.")
EOF

# Execute the R script
Rscript run_clusterprofiler.R

View on GitHub

Tools Used

STAR

Raw Source Text

Adapter sequences, low base call quality (Phred score <30), and reads with fewer than 5 bases, were trimmed from all FASTQ files using Cutadapt (v3.5)
STAR (Dobin et al., Bioinformatics, 2012, DOI: 10.1093/bioinformatics/bts635) aligner was used to align trimmed sequences to the mouse reference genome (GRCm39) with gene annotations from mouse Gencode (vM28)
Gene count quantifications were analyzed using RSEM (RNA-Seq by Expectation Maximization; v.1.3.1)
Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)
Functional pathway analysis was performed using Metascape (Zhou et al., Nature Communications, 2019, DOI: 10.1038/s41467-019-09234-6), gProfiler (Reimand et al., R package version 0.67, 2018; https://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf), and ClusterProfiler
Assembly: GRCm39
Supplementary files format and content: EdgeR outputed differential gene expression files

← Back to Analysis