GSE146916 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
A role for alternative splicing in circadian control of exocytosis and glucose homeostasis.Genes & development (2020) — PMID 32616519
Dataset
GSE146916A role for alternative splicing in circadian control of insulin secretion and glucose homeostasis
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Sequencing on Illumina NextSeq500 and basecalling by Real-Time Analysis (RTA) software version 2.4.11
Real-Time Analysis (RTA) software v2.4.11$ Bash example
# This step describes the physical sequencing process on an Illumina NextSeq500 instrument. # The Real-Time Analysis (RTA) software, version 2.4.11, is proprietary Illumina software # integrated into the instrument's control system (e.g., NextSeq Control Software). # RTA performs image analysis, base calling, and quality scoring in real-time during the sequencing run. # There is no direct user-executable bash command for this process; it is managed by the instrument's onboard software. # The output of this step typically includes BCL (base call) files, which are then demultiplexed and converted to FASTQ files # in subsequent bioinformatics steps.
-
2
Alignment to the mm10 genome using STAR version 2.5.0
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.5.0 # Define variables # Replace with actual paths to your mm10 genome index and input FASTQ files GENOME_DIR="/path/to/STAR_mm10_index" # Example: /data/genomes/mm10/STAR_index READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" # Remove if single-end (e.g., for eCLIP, often single-end) OUTPUT_PREFIX="sample_aligned_" THREADS=8 # Adjust based on available CPU cores # --- Genome Index Generation (run once per genome) --- # This step requires the mm10 FASTA and GTF files. # Example sources: UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/) # Ensembl (ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/) # Ensembl (ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/) # # Example command to generate index: # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/mm10.fa \ # --sjdbGTFfile /path/to/mm10.gtf \ # --runThreadN ${THREADS} \ # --sjdbOverhang 100 # Recommended for RNA-seq, typically read_length - 1 # --- Alignment Command --- STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --readFilesCommand zcat \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM, e.g., 30GB -
3
mRNA quantitation using featureCounts version 1.5.3 and the option âât exonâ was specified to count mRNA features
$ Bash example
# Install featureCounts (part of the Subread package) # conda install -c bioconda subread # Placeholder for annotation file. Replace with the actual path to your GTF/GFF file. # For example, to download a human GRCh38 GTF from Ensembl: # wget http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz # gunzip Homo_sapiens.GRCh38.109.gtf.gz ANNOTATION_FILE="Homo_sapiens.GRCh38.109.gtf" # Placeholder for input BAM files. Replace with your actual aligned BAM files. # Example: INPUT_BAM_FILES="sample1_aligned.bam sample2_aligned.bam" INPUT_BAM_FILES="input1.bam input2.bam" # Output file for the gene counts OUTPUT_COUNTS_FILE="mrna_gene_counts.txt" # Execute featureCounts for mRNA quantitation, counting reads overlapping 'exon' features # and summarizing by 'gene_id'. featureCounts -t exon -g gene_id -a "${ANNOTATION_FILE}" -o "${OUTPUT_COUNTS_FILE}" ${INPUT_BAM_FILES} -
4
Differential expression using DESeq2 version 1.22.1 specifying a false discovery cutoff of 0.01 for independent filtering
$ Bash example
#!/bin/bash # --- Installation instructions (commented out) --- # # Install R (example for Ubuntu/Debian) # # sudo apt-get update # # sudo apt-get install r-base # # # Install BiocManager and DESeq2 within R # # DESeq2 version 1.22.1 corresponds to Bioconductor 3.7 (released Oct 2018). # # To ensure this specific version: # # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # # R -e 'BiocManager::install(version = "3.7", ask = FALSE)' # Install Bioconductor 3.7 # # R -e 'BiocManager::install("DESeq2", ask = FALSE)' # Install DESeq2 for Bioconductor 3.7 # --- Create dummy input files for demonstration --- # In a real pipeline, these files (e.g., featureCounts output) would be generated by upstream steps. # Create a dummy counts matrix (e.g., from featureCounts) echo "GeneID,sample1,sample2,sample3,sample4,sample5,sample6" > counts_matrix.csv echo "geneA,100,120,110,200,210,220" >> counts_matrix.csv echo "geneB,50,60,55,100,110,105" >> counts_matrix.csv echo "geneC,10,12,11,20,22,21" >> counts_matrix.csv echo "geneD,500,510,505,500,510,505" >> counts_matrix.csv echo "geneE,1000,1050,1020,100,110,105" >> counts_matrix.csv echo "geneF,10,11,12,100,105,102" >> counts_matrix.csv echo "geneG,5,6,7,8,9,10" >> counts_matrix.csv # Low count gene echo "geneH,0,0,0,0,0,0" >> counts_matrix.csv # Zero count gene # Create a dummy colData file (sample metadata) echo "sample,condition" > coldata.csv echo "sample1,control" >> coldata.csv echo "sample2,control" >> coldata.csv echo "sample3,control" >> coldata.csv echo "sample4,treated" >> coldata.csv echo "sample5,treated" >> coldata.csv echo "sample6,treated" >> coldata.csv # --- Execute DESeq2 analysis using Rscript --- Rscript -e ' # Load DESeq2 library library(DESeq2) # Read count data # Assuming the first column is GeneID and subsequent columns are sample counts counts_data <- read.csv("counts_matrix.csv", row.names = 1) # Read sample metadata # Assuming the first column is sample name and subsequent columns are metadata col_data <- read.csv("coldata.csv", row.names = 1) col_data$condition <- factor(col_data$condition) # Ensure condition is a factor # Ensure sample names in counts_data columns match row names in col_data if (!all(colnames(counts_data) == rownames(col_data))) { stop("Sample names in counts matrix columns and colData row names do not match or are not in the same order.") } # Create DESeqDataSet object # Design formula specifies the experimental design (e.g., ~ condition) dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = col_data, design = ~ condition) # Pre-filtering: remove genes with very low counts across all samples # This helps to reduce the number of tests and increase statistical power. # A common threshold is to keep genes with at least 10 counts summed across all samples. keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis # This performs normalization, dispersion estimation, and Wald test. dds <- DESeq(dds) # Extract results with a specified false discovery rate (FDR) cutoff (alpha) # Independent filtering is applied by default in the results() function, # which aims to optimize the number of significant genes by filtering out # genes with very low mean normalized counts. res <- results(dds, alpha = 0.01) # Order results by adjusted p-value (padj) res_ordered <- res[order(res$padj),] # Save the differential expression results to a CSV file write.csv(as.data.frame(res_ordered), "deseq2_results_alpha0.01.csv") message("DESeq2 analysis complete. Results saved to deseq2_results_alpha0.01.csv") '
Raw Source Text
Sequencing on Illumina NextSeq500 and basecalling by Real-Time Analysis (RTA) software version 2.4.11 Alignment to the mm10 genome using STAR version 2.5.0 mRNA quantitation using featureCounts version 1.5.3 and the option âât exonâ was specified to count mRNA features Differential expression using DESeq2 version 1.22.1 specifying a false discovery cutoff of 0.01 for independent filtering Genome_build: mm10 Supplementary_files_format_and_content: Transcripts per kilobase million (TPM) normalized count matrix of mapped reads, and DESeq2 outputs for Bmal1 and Clock knockout comparisons in ".csv" format