GSE146916 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

A role for alternative splicing in circadian control of exocytosis and glucose homeostasis.

Genes & development (2020) — PMID 32616519

Dataset

A role for alternative splicing in circadian control of insulin secretion and glucose homeostasis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequencing on Illumina NextSeq500 and basecalling by Real-Time Analysis (RTA) software version 2.4.11

Real-Time Analysis (RTA) software v2.4.11

$ Bash example

# This step describes the physical sequencing process on an Illumina NextSeq500 instrument.
# The Real-Time Analysis (RTA) software, version 2.4.11, is proprietary Illumina software
# integrated into the instrument's control system (e.g., NextSeq Control Software).
# RTA performs image analysis, base calling, and quality scoring in real-time during the sequencing run.
# There is no direct user-executable bash command for this process; it is managed by the instrument's onboard software.
# The output of this step typically includes BCL (base call) files, which are then demultiplexed and converted to FASTQ files
# in subsequent bioinformatics steps.

Alignment to the mm10 genome using STAR version 2.5.0

STAR v2.5.0 GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.5.0

# Define variables
# Replace with actual paths to your mm10 genome index and input FASTQ files
GENOME_DIR="/path/to/STAR_mm10_index" # Example: /data/genomes/mm10/STAR_index
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz" # Remove if single-end (e.g., for eCLIP, often single-end)
OUTPUT_PREFIX="sample_aligned_"
THREADS=8 # Adjust based on available CPU cores

# --- Genome Index Generation (run once per genome) ---
# This step requires the mm10 FASTA and GTF files.
# Example sources: UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/)
#                  Ensembl (ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/)
#                  Ensembl (ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/)
#
# Example command to generate index:
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/mm10.fa \
#      --sjdbGTFfile /path/to/mm10.gtf \
#      --runThreadN ${THREADS} \
#      --sjdbOverhang 100 # Recommended for RNA-seq, typically read_length - 1

# --- Alignment Command ---
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesCommand zcat \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 1 \
     --limitBAMsortRAM 30000000000 # Adjust based on available RAM, e.g., 30GB

View on GitHub

mRNA quantitation using featureCounts version 1.5.3 and the option âât exonâ was specified to count mRNA features

featureCounts v1.5.3 GitHub

$ Bash example

# Install featureCounts (part of the Subread package)
# conda install -c bioconda subread

# Placeholder for annotation file. Replace with the actual path to your GTF/GFF file.
# For example, to download a human GRCh38 GTF from Ensembl:
# wget http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz
ANNOTATION_FILE="Homo_sapiens.GRCh38.109.gtf"

# Placeholder for input BAM files. Replace with your actual aligned BAM files.
# Example: INPUT_BAM_FILES="sample1_aligned.bam sample2_aligned.bam"
INPUT_BAM_FILES="input1.bam input2.bam"

# Output file for the gene counts
OUTPUT_COUNTS_FILE="mrna_gene_counts.txt"

# Execute featureCounts for mRNA quantitation, counting reads overlapping 'exon' features
# and summarizing by 'gene_id'.
featureCounts -t exon -g gene_id -a "${ANNOTATION_FILE}" -o "${OUTPUT_COUNTS_FILE}" ${INPUT_BAM_FILES}

View on GitHub

Differential expression using DESeq2 version 1.22.1 specifying a false discovery cutoff of 0.01 for independent filtering

DESeq2 v1.22.1 GitHub

$ Bash example

#!/bin/bash

# --- Installation instructions (commented out) ---
# # Install R (example for Ubuntu/Debian)
# # sudo apt-get update
# # sudo apt-get install r-base
#
# # Install BiocManager and DESeq2 within R
# # DESeq2 version 1.22.1 corresponds to Bioconductor 3.7 (released Oct 2018).
# # To ensure this specific version:
# # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# # R -e 'BiocManager::install(version = "3.7", ask = FALSE)' # Install Bioconductor 3.7
# # R -e 'BiocManager::install("DESeq2", ask = FALSE)' # Install DESeq2 for Bioconductor 3.7

# --- Create dummy input files for demonstration ---
# In a real pipeline, these files (e.g., featureCounts output) would be generated by upstream steps.

# Create a dummy counts matrix (e.g., from featureCounts)
echo "GeneID,sample1,sample2,sample3,sample4,sample5,sample6" > counts_matrix.csv
echo "geneA,100,120,110,200,210,220" >> counts_matrix.csv
echo "geneB,50,60,55,100,110,105" >> counts_matrix.csv
echo "geneC,10,12,11,20,22,21" >> counts_matrix.csv
echo "geneD,500,510,505,500,510,505" >> counts_matrix.csv
echo "geneE,1000,1050,1020,100,110,105" >> counts_matrix.csv
echo "geneF,10,11,12,100,105,102" >> counts_matrix.csv
echo "geneG,5,6,7,8,9,10" >> counts_matrix.csv # Low count gene
echo "geneH,0,0,0,0,0,0" >> counts_matrix.csv # Zero count gene

# Create a dummy colData file (sample metadata)
echo "sample,condition" > coldata.csv
echo "sample1,control" >> coldata.csv
echo "sample2,control" >> coldata.csv
echo "sample3,control" >> coldata.csv
echo "sample4,treated" >> coldata.csv
echo "sample5,treated" >> coldata.csv
echo "sample6,treated" >> coldata.csv

# --- Execute DESeq2 analysis using Rscript ---
Rscript -e '
  # Load DESeq2 library
  library(DESeq2)

  # Read count data
  # Assuming the first column is GeneID and subsequent columns are sample counts
  counts_data <- read.csv("counts_matrix.csv", row.names = 1)

  # Read sample metadata
  # Assuming the first column is sample name and subsequent columns are metadata
  col_data <- read.csv("coldata.csv", row.names = 1)
  col_data$condition <- factor(col_data$condition) # Ensure condition is a factor

  # Ensure sample names in counts_data columns match row names in col_data
  if (!all(colnames(counts_data) == rownames(col_data))) {
    stop("Sample names in counts matrix columns and colData row names do not match or are not in the same order.")
  }

  # Create DESeqDataSet object
  # Design formula specifies the experimental design (e.g., ~ condition)
  dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                colData = col_data,
                                design = ~ condition)

  # Pre-filtering: remove genes with very low counts across all samples
  # This helps to reduce the number of tests and increase statistical power.
  # A common threshold is to keep genes with at least 10 counts summed across all samples.
  keep <- rowSums(counts(dds)) >= 10
  dds <- dds[keep,]

  # Run DESeq2 analysis
  # This performs normalization, dispersion estimation, and Wald test.
  dds <- DESeq(dds)

  # Extract results with a specified false discovery rate (FDR) cutoff (alpha)
  # Independent filtering is applied by default in the results() function,
  # which aims to optimize the number of significant genes by filtering out
  # genes with very low mean normalized counts.
  res <- results(dds, alpha = 0.01)

  # Order results by adjusted p-value (padj)
  res_ordered <- res[order(res$padj),]

  # Save the differential expression results to a CSV file
  write.csv(as.data.frame(res_ordered), "deseq2_results_alpha0.01.csv")

  message("DESeq2 analysis complete. Results saved to deseq2_results_alpha0.01.csv")
'

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

Sequencing on Illumina NextSeq500 and basecalling by Real-Time Analysis (RTA) software version 2.4.11
Alignment to the mm10 genome using STAR version 2.5.0
mRNA quantitation using featureCounts version 1.5.3 and the option âât exonâ was specified to count mRNA features
Differential expression using DESeq2 version 1.22.1 specifying a false discovery cutoff of 0.01 for independent filtering
Genome_build: mm10
Supplementary_files_format_and_content: Transcripts per kilobase million (TPM) normalized count matrix of mapped reads, and DESeq2 outputs for Bmal1 and Clock knockout comparisons in ".csv" format

← Back to Analysis