GSE146014 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

RNA binding protein DDX5 directs tuft cell specification and function to regulate microbial repertoire and disease susceptibility in the intestine.

Gut (2022) — PMID 34853057

Dataset

GSE146014

DDX5 in colonic tumors

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were mapped to mm10 using STAR v2.6.1a with parameters "--outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000".

STAR v2.6.1a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.6.1a

# Create STAR genome index (if not already created)
# STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /path/to/STAR_index/mm10 --genomeFastaFiles /path/to/mm10.fa --sjdbGTFfile /path/to/mm10.gtf

# Align sequenced reads to mm10 using STAR
STAR --genomeDir /path/to/STAR_index/mm10 \
     --readFilesIn input_R1.fastq.gz input_R2.fastq.gz \
     --outFileNamePrefix aligned_reads_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 \
     --outFilterMultimapNmax 20 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverReadLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000

View on GitHub

Uniquely mapped reads overlapping with the exons were counted using featureCounts for each gene in GENCODE.vM19 annotation.

featureCounts v2.0.6

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Download GENCODE vM19 annotation GTF
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M19/gencode.vM19.annotation.gtf.gz
# gunzip gencode.vM19.annotation.gtf.gz

# Define variables
GTF_FILE="gencode.vM19.annotation.gtf"
INPUT_BAM="aligned_reads.bam" # Placeholder for uniquely mapped and sorted BAM file
OUTPUT_FILE="gene_counts.txt"
NUM_THREADS=8 # Adjust as needed

# Run featureCounts
featureCounts -a "${GTF_FILE}" \
              -o "${OUTPUT_FILE}" \
              -F GTF \
              -t exon \
              -g gene_id \
              --primary \
              -s 0 \
              -T "${NUM_THREADS}" \
              "${INPUT_BAM}"

Differential expression analysis were performed by DESeq2(v1.18.1) package

DESeq2 v1.18.1 GitHub

$ Bash example

#!/bin/bash

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2", version = "3.8")' # DESeq2 v1.18.1 corresponds to Bioconductor 3.8

# Create dummy count data (replace with your actual count matrix)
echo -e "gene\tsample1\tsample2\tsample3\tsample4" > counts.tsv
echo -e "geneA\t100\t120\t50\t60" >> counts.tsv
echo -e "geneB\t20\t25\t80\t90" >> counts.tsv
echo -e "geneC\t500\t550\t200\t220" >> counts.tsv
echo -e "geneD\t10\t15\t5\t8" >> counts.tsv

# Create dummy sample information (replace with your actual sample metadata)
echo -e "sample\tcondition" > sample_info.tsv
echo -e "sample1\tcontrol" >> sample_info.tsv
echo -e "sample2\tcontrol" >> sample_info.tsv
echo -e "sample3\ttreated" >> sample_info.tsv
echo -e "sample4\ttreated" >> sample_info.tsv

# Run DESeq2 analysis using R
Rscript -e '
  library(DESeq2)

  # Load count data
  # Ensure the first column is gene names and subsequent columns are sample counts
  count_data <- read.table("counts.tsv", header = TRUE, row.names = 1)

  # Load sample information
  # Ensure the first column is sample names and subsequent columns are metadata
  sample_info <- read.table("sample_info.tsv", header = TRUE, row.names = 1)

  # Ensure sample order in colData matches countData columns
  sample_info <- sample_info[colnames(count_data), , drop = FALSE]

  # Create DESeqDataSet object
  # Replace `~ condition` with your actual design formula based on `sample_info` columns
  dds <- DESeqDataSetFromMatrix(countData = count_data,
                                colData = sample_info,
                                design = ~ condition)

  # Run DESeq2 analysis
  dds <- DESeq(dds)

  # Get results
  # You can specify contrast, alpha, etc. here, e.g., results(dds, contrast=c("condition","treated","control"))
  res <- results(dds)

  # Save results to a TSV file
  write.table(as.data.frame(res), file = "deseq2_results.tsv", sep = "\t", quote = FALSE, row.names = TRUE)

  message("DESeq2 analysis complete. Results saved to deseq2_results.tsv")
'

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

Sequenced reads were mapped to mm10 using STAR v2.6.1a with parameters "--outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000".
Uniquely mapped reads overlapping with the exons were counted using featureCounts for each gene in GENCODE.vM19 annotation.
Differential expression analysis were performed by DESeq2(v1.18.1) package
Genome_build: MM10
Supplementary_files_format_and_content: tab-delimited text files include raw counts, normalized counts values log2FoldChange and pvalue for each gene

← Back to Analysis