GSE271650 Processing Pipeline
Publication
Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes.Cell reports (2024) — PMID 39154337
Dataset
GSE271650Autism-associated CHD8 controls reactive gliosis and neuroinflammation via remodeling chromatin in astrocytes [bulk RNA-seq]
Processing Steps
Generate Jupyter Notebook-
1
Adapter sequences, low base call quality (Phred score <30), and reads with fewer than 5 bases, were trimmed from all FASTQ files using Cutadapt (v3.5)
$ Bash example
# Install Cutadapt (if not already installed) # conda install -c bioconda cutadapt=3.5 # Define input and output file names INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1_TRIMMED="output_R1_trimmed.fastq.gz" OUTPUT_R2_TRIMMED="output_R2_trimmed.fastq.gz" # Define common Illumina adapter sequences (adjust if specific adapters are known) # For Illumina Universal Adapter (R1) and Illumina Reverse Complement Adapter (R2) ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Run Cutadapt for paired-end reads cutadapt \ -a "$ADAPTER_R1" \ -A "$ADAPTER_R2" \ -q 30,30 \ -m 5 \ -o "$OUTPUT_R1_TRIMMED" \ -p "$OUTPUT_R2_TRIMMED" \ "$INPUT_R1" \ "$INPUT_R2"
-
2
STAR (Dobin et al., Bioinformatics, 2012, DOI: 10.1093/bioinformatics/bts635) aligner was used to align trimmed sequences to the mouse reference genome (GRCm39) with gene annotations from mouse Gencode (vM28)
STAR v2.3.0.1$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for paths and files GENOME_DIR="star_genome_index_GRCm39_Gencode_vM28" GENOME_FASTA="GRCm39.primary_assembly.genome.fa" # Placeholder for GRCm39 FASTA file GTF_FILE="gencode.vM28.annotation.gtf" # Placeholder for Gencode vM28 GTF file READ_1="trimmed_R1.fastq.gz" # Placeholder for trimmed R1 (forward) reads READ_2="trimmed_R2.fastq.gz" # Placeholder for trimmed R2 (reverse) reads OUTPUT_PREFIX="aligned_reads" NUM_THREADS=8 # Adjust as needed for your system # --- Download reference files (example placeholders) --- # These commands are illustrative. You would typically download these from Ensembl/GENCODE FTP. # mkdir -p references # cd references # wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz # wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/gencode.vM28.annotation.gtf.gz # gunzip GRCm39.primary_assembly.genome.fa.gz # gunzip gencode.vM28.annotation.gtf.gz # cd .. # --- Step 1: Generate STAR genome index (run once) --- # This step uses the mouse reference genome (GRCm39) and Gencode vM28 annotations # to build a STAR index. The sjdbOverhang parameter should ideally be (read_length - 1). # A value of 100 is common for typical RNA-seq read lengths. mkdir -p "${GENOME_DIR}" STAR --runMode genomeGenerate \ --genomeDir "${GENOME_DIR}" \ --genomeFastaFiles "${GENOME_FASTA}" \ --sjdbGTFfile "${GTF_FILE}" \ --sjdbOverhang 100 \ --runThreadN "${NUM_THREADS}" # --- Step 2: Align trimmed sequences --- # Align trimmed sequences to the mouse reference genome (GRCm39) using the generated index. # Outputs a sorted BAM file and gene counts (ReadsPerGene.out.tab). STAR --runThreadN "${NUM_THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_1}" "${READ_2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --quantMode GeneCounts -
3
Gene count quantifications were analyzed using RSEM (RNA-Seq by Expectation Maximization; v.1.3.1)
$ Bash example
# Install RSEM (example using conda) # conda install -c bioconda rsem # Example: Prepare RSEM reference (run once per reference genome/transcriptome) # This step requires a FASTA file of the genome (e.g., hg38.fa) and a GTF/GFF annotation file (e.g., gencode.v44.annotation.gtf). # rsem-prepare-reference --gtf gencode.v44.annotation.gtf hg38.fa hg38_rsem_reference # Perform gene count quantification using RSEM # Assuming paired-end reads and a pre-built RSEM reference index (e.g., hg38_rsem_reference) # Replace 'read1.fastq.gz', 'read2.fastq.gz', 'hg38_rsem_reference', and 'sample_output_prefix' with actual file paths and desired output prefix. rsem-calculate-expression \ --paired-end \ --num-threads 8 \ read1.fastq.gz \ read2.fastq.gz \ hg38_rsem_reference \ sample_output_prefix -
4
Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3)
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # --- RSEM Reference Preparation (run once per reference genome) --- # Replace 'genome.fa' with your reference genome FASTA file # Replace 'genes.gtf' with your gene annotation GTF file # Replace 'rsem_ref' with your desired reference name/path # rsem-prepare-reference --gtf genes.gtf genome.fa rsem_ref # --- RSEM Quantification --- # This command calculates gene and isoform expression levels from RNA-Seq reads. # Replace 'read1.fq.gz' and 'read2.fq.gz' with your actual gzipped paired-end FASTQ files. # Replace 'rsem_ref' with the path to your prepared RSEM reference. # Replace 'sample_output' with your desired output file prefix. # The output will include 'sample_output.genes.results' and 'sample_output.isoforms.results'. rsem-calculate-expression \ --paired-end \ --gzip-compressed \ --num-threads 8 \ read1.fq.gz read2.fq.gz \ rsem_ref \ sample_output -
5
Functional pathway analysis was performed using Metascape (Zhou et al., Nature Communications, 2019, DOI: 10.1038/s41467-019-09234-6), gProfiler (Reimand et al., R package version 0.67, 2018; https://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf), and ClusterProfiler
$ Bash example
# Install R and Bioconductor if not already present. # For example, using conda: # conda create -n r_env r-base bioconductor-clusterprofiler r-org.hs.eg.db -y # conda activate r_env # Create an R script for functional pathway analysis using clusterProfiler cat << 'EOF' > run_clusterprofiler.R # Load necessary packages library(clusterProfiler) library(org.Hs.eg.db) # For human gene annotations (Homo sapiens) # Example gene list (replace with your actual gene list of Entrez IDs) # In a real scenario, this would typically come from differential expression analysis. # For demonstration, let's use a small set of placeholder Entrez IDs. gene_list <- c("10000", "10001", "10002", "10003", "10004", "10005", "10006", "10007", "10008", "10009") # If your gene list is in a different format (e.g., gene symbols), you would convert it: # gene_symbols <- c("TP53", "MYC", "EGFR") # eg <- bitr(gene_symbols, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db") # gene_list <- eg$ENTREZID # Perform GO enrichment analysis (Biological Process, Molecular Function, or Cellular Component) go_enrichment_bp <- enrichGO(gene = gene_list, OrgDb = org.Hs.eg.db, ont = "BP", # Biological Process pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE) go_enrichment_mf <- enrichGO(gene = gene_list, OrgDb = org.Hs.eg.db, ont = "MF", # Molecular Function pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE) # Perform KEGG enrichment analysis kegg_enrichment <- enrichKEGG(gene = gene_list, organism = 'hsa', # 'hsa' for Homo sapiens pvalueCutoff = 0.05) # Save results to CSV files if (!is.null(go_enrichment_bp)) { write.csv(as.data.frame(go_enrichment_bp), "go_enrichment_bp_results.csv", row.names = FALSE) } if (!is.null(go_enrichment_mf)) { write.csv(as.data.frame(go_enrichment_mf), "go_enrichment_mf_results.csv", row.names = FALSE) } if (!is.null(kegg_enrichment)) { write.csv(as.data.frame(kegg_enrichment), "kegg_enrichment_results.csv", row.names = FALSE) } message("Functional pathway analysis completed. Results saved to CSV files.") EOF # Execute the R script Rscript run_clusterprofiler.R
Tools Used
Raw Source Text
Adapter sequences, low base call quality (Phred score <30), and reads with fewer than 5 bases, were trimmed from all FASTQ files using Cutadapt (v3.5) STAR (Dobin et al., Bioinformatics, 2012, DOI: 10.1093/bioinformatics/bts635) aligner was used to align trimmed sequences to the mouse reference genome (GRCm39) with gene annotations from mouse Gencode (vM28) Gene count quantifications were analyzed using RSEM (RNA-Seq by Expectation Maximization; v.1.3.1) Expected gene count quantifications from RSEM analysis were imported into R for differential gene analysis using edgeR (v.3.30.3) Functional pathway analysis was performed using Metascape (Zhou et al., Nature Communications, 2019, DOI: 10.1038/s41467-019-09234-6), gProfiler (Reimand et al., R package version 0.67, 2018; https://cran.r-project.org/web/packages/gProfileR/gProfileR.pdf), and ClusterProfiler Assembly: GRCm39 Supplementary files format and content: EdgeR outputed differential gene expression files