GSE271650 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes.

Cell reports (2024) — PMID 39154337

Dataset

GSE271650

Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes [bulk RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Adapter sequences, low base call quality (Phred score <30), and reads with fewer than 5 bases, were trimmed from all FASTQ files using Cutadapt (v3.5)

    cutadapt v3.5 GitHub
    $ Bash example
    # Install Cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=3.5
    
    # Define input and output file names
    INPUT_R1="input_R1.fastq.gz"
    INPUT_R2="input_R2.fastq.gz"
    OUTPUT_R1_TRIMMED="output_R1_trimmed.fastq.gz"
    OUTPUT_R2_TRIMMED="output_R2_trimmed.fastq.gz"
    
    # Define common Illumina adapter sequences (adjust if specific adapters are known)
    # For Illumina Universal Adapter (R1) and Illumina Reverse Complement Adapter (R2)
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Run Cutadapt for paired-end reads
    cutadapt \
      -a "$ADAPTER_R1" \
      -A "$ADAPTER_R2" \
      -q 30,30 \
      -m 5 \
      -o "$OUTPUT_R1_TRIMMED" \
      -p "$OUTPUT_R2_TRIMMED" \
      "$INPUT_R1" \
      "$INPUT_R2"
  2. 2

    STAR (Dobin et al., Bioinformatics, 2012, DOI: 10.1093/bioinformatics/bts635) aligner was used to align trimmed sequences to the mouse reference genome (GRCm39) with gene annotations from mouse Gencode (vM28)

    STAR v2.3.0.1
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for paths and files
    GENOME_DIR="star_genome_index_GRCm39_Gencode_vM28"
    GENOME_FASTA="GRCm39.primary_assembly.genome.fa" # Placeholder for GRCm39 FASTA file
    GTF_FILE="gencode.vM28.annotation.gtf" # Placeholder for Gencode vM28 GTF file
    READ_1="trimmed_R1.fastq.gz" # Placeholder for trimmed R1 (forward) reads
    READ_2="trimmed_R2.fastq.gz" # Placeholder for trimmed R2 (reverse) reads
    OUTPUT_PREFIX="aligned_reads"
    NUM_THREADS=8 # Adjust as needed for your system
    
    # --- Download reference files (example placeholders) ---
    # These commands are illustrative. You would typically download these from Ensembl/GENCODE FTP.
    # mkdir -p references
    # cd references
    # wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz
    # wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/gencode.vM28.annotation.gtf.gz
    # gunzip GRCm39.primary_assembly.genome.fa.gz
    # gunzip gencode.vM28.annotation.gtf.gz
    # cd ..
    
    # --- Step 1: Generate STAR genome index (run once) ---
    # This step uses the mouse reference genome (GRCm39) and Gencode vM28 annotations
    # to build a STAR index. The sjdbOverhang parameter should ideally be (read_length - 1).
    # A value of 100 is common for typical RNA-seq read lengths.
    mkdir -p "${GENOME_DIR}"
    STAR --runMode genomeGenerate \
         --genomeDir "${GENOME_DIR}" \
         --genomeFastaFiles "${GENOME_FASTA}" \
         --sjdbGTFfile "${GTF_FILE}" \
         --sjdbOverhang 100 \
         --runThreadN "${NUM_THREADS}"
    
    # --- Step 2: Align trimmed sequences ---
    # Align trimmed sequences to the mouse reference genome (GRCm39) using the generated index.
    # Outputs a sorted BAM file and gene counts (ReadsPerGene.out.tab).
    STAR --runThreadN "${NUM_THREADS}" \
         --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ_1}" "${READ_2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 10 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --quantMode GeneCounts
  3. 3

    Gene count quantifications were analyzed using RSEM (RNA-Seq by Expectation Maximization; v.1.3.1)

    RSEM v1.3.1 GitHub
    $ Bash example
    # Install RSEM (example using conda)
    # conda install -c bioconda rsem
    
    # Example: Prepare RSEM reference (run once per reference genome/transcriptome)
    # This step requires a FASTA file of the genome (e.g., hg38.fa) and a GTF/GFF annotation file (e.g., gencode.v44.annotation.gtf).
    # rsem-prepare-reference --gtf gencode.v44.annotation.gtf hg38.fa hg38_rsem_reference
    
    # Perform gene count quantification using RSEM
    # Assuming paired-end reads and a pre-built RSEM reference index (e.g., hg38_rsem_reference)
    # Replace 'read1.fastq.gz', 'read2.fastq.gz', 'hg38_rsem_reference', and 'sample_output_prefix' with actual file paths and desired output prefix.
    rsem-calculate-expression \
        --paired-end \
        --num-threads 8 \
        read1.fastq.gz \
        read2.fastq.gz \
        hg38_rsem_reference \
        sample_output_prefix
  4. 4

    Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)

    RSEM v1.3.3 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install RSEM (if not already installed)
    # conda install -c bioconda rsem
    
    # --- RSEM Reference Preparation (run once per reference genome) ---
    # Replace 'genome.fa' with your reference genome FASTA file
    # Replace 'genes.gtf' with your gene annotation GTF file
    # Replace 'rsem_ref' with your desired reference name/path
    # rsem-prepare-reference --gtf genes.gtf genome.fa rsem_ref
    
    # --- RSEM Quantification ---
    # This command calculates gene and isoform expression levels from RNA-Seq reads.
    # Replace 'read1.fq.gz' and 'read2.fq.gz' with your actual gzipped paired-end FASTQ files.
    # Replace 'rsem_ref' with the path to your prepared RSEM reference.
    # Replace 'sample_output' with your desired output file prefix.
    # The output will include 'sample_output.genes.results' and 'sample_output.isoforms.results'.
    rsem-calculate-expression \
        --paired-end \
        --gzip-compressed \
        --num-threads 8 \
        read1.fq.gz read2.fq.gz \
        rsem_ref \
        sample_output
  5. 5

    Functional pathway analysis was performed using Metascape (Zhou et al., Nature Communications, 2019, DOI: 10.1038/s41467-019-09234-6), gProfiler (Reimand et al., R package version 0.67, 2018; https://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf), and ClusterProfiler

    clusterProfiler v3.9.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor if not already present.
    # For example, using conda:
    # conda create -n r_env r-base bioconductor-clusterprofiler r-org.hs.eg.db -y
    # conda activate r_env
    
    # Create an R script for functional pathway analysis using clusterProfiler
    cat << 'EOF' > run_clusterprofiler.R
    # Load necessary packages
    library(clusterProfiler)
    library(org.Hs.eg.db) # For human gene annotations (Homo sapiens)
    
    # Example gene list (replace with your actual gene list of Entrez IDs)
    # In a real scenario, this would typically come from differential expression analysis.
    # For demonstration, let's use a small set of placeholder Entrez IDs.
    gene_list <- c("10000", "10001", "10002", "10003", "10004", "10005", "10006", "10007", "10008", "10009")
    
    # If your gene list is in a different format (e.g., gene symbols), you would convert it:
    # gene_symbols <- c("TP53", "MYC", "EGFR")
    # eg <- bitr(gene_symbols, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
    # gene_list <- eg$ENTREZID
    
    # Perform GO enrichment analysis (Biological Process, Molecular Function, or Cellular Component)
    go_enrichment_bp <- enrichGO(gene          = gene_list,
                                 OrgDb         = org.Hs.eg.db,
                                 ont           = "BP", # Biological Process
                                 pAdjustMethod = "BH",
                                 pvalueCutoff  = 0.05,
                                 qvalueCutoff  = 0.05,
                                 readable      = TRUE)
    
    go_enrichment_mf <- enrichGO(gene          = gene_list,
                                 OrgDb         = org.Hs.eg.db,
                                 ont           = "MF", # Molecular Function
                                 pAdjustMethod = "BH",
                                 pvalueCutoff  = 0.05,
                                 qvalueCutoff  = 0.05,
                                 readable      = TRUE)
    
    # Perform KEGG enrichment analysis
    kegg_enrichment <- enrichKEGG(gene         = gene_list,
                                  organism     = 'hsa', # 'hsa' for Homo sapiens
                                  pvalueCutoff = 0.05)
    
    # Save results to CSV files
    if (!is.null(go_enrichment_bp)) {
      write.csv(as.data.frame(go_enrichment_bp), "go_enrichment_bp_results.csv", row.names = FALSE)
    }
    
    if (!is.null(go_enrichment_mf)) {
      write.csv(as.data.frame(go_enrichment_mf), "go_enrichment_mf_results.csv", row.names = FALSE)
    }
    
    if (!is.null(kegg_enrichment)) {
      write.csv(as.data.frame(kegg_enrichment), "kegg_enrichment_results.csv", row.names = FALSE)
    }
    
    message("Functional pathway analysis completed. Results saved to CSV files.")
    EOF
    
    # Execute the R script
    Rscript run_clusterprofiler.R

Tools Used

Raw Source Text
Adapter sequences, low base call quality (Phred score <30), and reads with fewer than 5 bases, were trimmed from all FASTQ files using Cutadapt (v3.5)
STAR (Dobin et al., Bioinformatics, 2012, DOI: 10.1093/bioinformatics/bts635) aligner was used to align trimmed sequences to the mouse reference genome (GRCm39) with gene annotations from mouse Gencode (vM28)
Gene count quantifications were analyzed using RSEM (RNA-Seq by Expectation Maximization; v.1.3.1)
Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)
Functional pathway analysis was performed using Metascape (Zhou et al., Nature Communications, 2019, DOI: 10.1038/s41467-019-09234-6), gProfiler (Reimand et al., R package version 0.67, 2018; https://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf), and ClusterProfiler
Assembly: GRCm39
Supplementary files format and content: EdgeR outputed differential gene expression files
← Back to Analysis