GSE75128 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

A Small RNA-Catalytic Argonaute Pathway Tunes Germline Transcript Levels to Ensure Embryonic Divisions.

Cell (2016) — PMID 27020753

Dataset

A Small RNA-Catalytic Argonaute Pathway Tunes Germline Transcript Levels to Ensure Embryonic Divisions

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

A genome index was first generated using the Ensembl C. elegans genome WBcel235 and the Ensembl gene annotation file (version WBcel235.82) by the command: âSTAR --runThreadN 8 --runMode genomeGenerate --genomeDir <output folder> --genomeFastaFiles <path to genome fasta> --sjdbGTFfile <path to .gtf gene annotation file> --sjdbOverhang 49 --genomeSAindexNbases 12â.

STAR v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Create directories for reference files and output
mkdir -p references
mkdir -p star_genome_index

# Download Ensembl C. elegans genome WBcel235 (release 82)
wget -O references/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-82/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz

# Download Ensembl C. elegans gene annotation file WBcel235.82 (release 82)
wget -O references/Caenorhabditis_elegans.WBcel235.82.gtf ftp://ftp.ensembl.org/pub/release-82/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.82.gtf

# Generate genome index
STAR --runThreadN 8 \
     --runMode genomeGenerate \
     --genomeDir star_genome_index \
     --genomeFastaFiles references/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz \
     --sjdbGTFfile references/Caenorhabditis_elegans.WBcel235.82.gtf \
     --sjdbOverhang 49 \
     --genomeSAindexNbases 12

View on GitHub

Reads from all six samples were then mapped onto the genome using: âSTAR --runThreadN 8 --genomeDir <path to genome index> --readFilesIn <list of sequence files> --outFileNamePrefix <Prefix for output files> --outFilterMismatchNmax 3 --outFilterMultimapNmax 10â.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for clarity
# Placeholder: Replace with actual path to your STAR genome index (e.g., for human hg38)
GENOME_DIR="path/to/STAR_genome_index_hg38"

# Placeholder: Replace with actual list of sequence files for all six samples.
# Assuming paired-end reads for typical RNA-seq, adjust as needed for single-end.
READ_FILES="sample1_R1.fastq.gz sample1_R2.fastq.gz \
            sample2_R1.fastq.gz sample2_R2.fastq.gz \
            sample3_R1.fastq.gz sample3_R2.fastq.gz \
            sample4_R1.fastq.gz sample4_R2.fastq.gz \
            sample5_R1.fastq.gz sample5_R2.fastq.gz \
            sample6_R1.fastq.gz sample6_R2.fastq.gz"

# Placeholder: Replace with desired prefix for output files
OUTPUT_PREFIX="mapped_reads"

# Execute STAR alignment
STAR \
  --runThreadN 8 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn ${READ_FILES} \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outFilterMismatchNmax 3 \
  --outFilterMultimapNmax 10

View on GitHub

3
Reads that uniquely mapped to annotated genes were counted by htseq-count using the command: âhtseq-count --stranded=no <.sam file> <.gtf gene annotation file> > <output file name>â.

HTSeq vInferred with models/gemini-2.5-flash GitHub
$ Bash example
```
# Define input and output files
SAM_FILE="input.sam"
GTF_FILE="annotation.gtf"
OUTPUT_FILE="gene_counts.txt"

# Count reads uniquely mapped to annotated genes
htseq-count --stranded=no "${SAM_FILE}" "${GTF_FILE}" > "${OUTPUT_FILE}"
```
View on GitHub

DESeq2 (Love et al., 2014) was used to perform differential analysis of gene expression between different groups.

DESeq2 v1.4.5 GitHub

$ Bash example

# Install R and Bioconductor if not already present
# For example, using conda:
# conda create -n deseq2_env r-base bioconductor-deseq2 r-readr r-dplyr
# conda activate deseq2_env

# Create dummy input files for demonstration
# In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
echo "gene_id,sample1,sample2,sample3,sample4" > counts.csv
echo "geneA,100,120,50,60" >> counts.csv
echo "geneB,20,25,80,90" >> counts.csv
echo "geneC,500,550,200,220" >> counts.csv

echo "sample,group" > coldata.csv
echo "sample1,control" >> coldata.csv
echo "sample2,control" >> coldata.csv
echo "sample3,treatment" >> coldata.csv
echo "sample4,treatment" >> coldata.csv

# Create an R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load necessary libraries
library(DESeq2)
library(readr)
library(dplyr)

# --- Configuration ---
counts_file <- "counts.csv"
coldata_file <- "coldata.csv"
output_file <- "deseq2_results.csv"
design_formula <- "~ group" # Example design formula

# --- Load Data ---
# Read count data
counts_df <- read_csv(counts_file)
# Assuming the first column is gene_id and subsequent columns are sample counts
gene_ids <- counts_df[[1]]
counts_matrix <- as.matrix(counts_df[,-1])
rownames(counts_matrix) <- gene_ids

# Read colData (sample metadata)
coldata_df <- read_csv(coldata_file)
# Ensure sample names match column names in counts_matrix
# And ensure order matches
coldata_df <- coldata_df %>%
  filter(sample %in% colnames(counts_matrix)) %>%
  arrange(match(sample, colnames(counts_matrix)))
rownames(coldata_df) <- coldata_df$sample

# Ensure colData has factors for grouping variables
coldata_df$group <- factor(coldata_df$group)

# --- Create DESeqDataSet object ---
dds <- DESeqDataSetFromMatrix(countData = counts_matrix,
                              colData = coldata_df,
                              design = as.formula(design_formula))

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract results ---
# Example: comparing 'treatment' vs 'control'
# The contrast should be adjusted based on the actual groups in your colData
res <- results(dds, contrast=c("group", "treatment", "control"))

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Convert to data frame and write to CSV
res_df <- as.data.frame(res_ordered)
write_csv(res_df, output_file)

message(paste("DESeq2 analysis complete. Results saved to", output_file))
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

First, a DESeq2 data set was created by: â(dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ genotype))â.

DESeq2 v1.x.x GitHub

$ Bash example

# Install BiocManager and DESeq2 (if not already installed)
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create a dummy countdata.csv for demonstration purposes.
# In a real pipeline, this would be generated by a previous step (e.g., featureCounts).
# The first column represents gene IDs, and subsequent columns are raw counts for samples.
cat <<EOF > countdata.csv
gene1,100,120,110,200,210,220
gene2,50,60,55,100,110,105
gene3,200,210,205,400,410,405
gene4,10,12,11,20,22,21
gene5,30,35,32,60,65,62
gene6,70,75,72,140,145,142
EOF

# Create a dummy coldata.csv for demonstration purposes.
# In a real pipeline, this would be provided as sample metadata.
# The first column represents sample IDs, and subsequent columns are metadata, including 'genotype'.
cat <<EOF > coldata.csv
sample,genotype
sample1,WT
sample2,WT
sample3,WT
sample4,KO
sample5,KO
sample6,KO
EOF

# R script to create the DESeq2 data set object.
# This script reads the count and colData, then constructs the DESeqDataSet object.
cat <<'EOF_R_SCRIPT' > deseq2_prep.R
#!/usr/bin/env Rscript

# Load DESeq2 library
library(DESeq2)

# Read count data from CSV.
# Assumes the first column contains row names (gene IDs) and the header is present.
countdata_raw <- read.csv("countdata.csv", row.names = 1, header = TRUE)
# Convert the data frame of counts to a matrix, which is required by DESeqDataSetFromMatrix.
countdata <- as.matrix(countdata_raw)

# Read colData (sample metadata) from CSV.
# Assumes the first column contains row names (sample IDs) and the header is present.
coldata <- read.csv("coldata.csv", row.names = 1, header = TRUE)

# Ensure colData sample names match countData column names and are in the same order.
# This step is crucial for correct mapping of metadata to count data.
coldata <- coldata[colnames(countdata), , drop=FALSE]

# Ensure the 'genotype' column is treated as a factor, which is necessary for DESeq2 design formulas.
coldata$genotype <- factor(coldata$genotype)

# Create a DESeq2 data set object (dds).
# This object encapsulates count data, sample metadata, and the experimental design formula.
dds <- DESeqDataSetFromMatrix(countData = countdata,
                              colData = coldata,
                              design = ~ genotype)

# Save the dds object to an .rds file.
# This allows the object to be loaded and used in subsequent DESeq2 analysis steps (e.g., normalization, differential expression).
saveRDS(dds, file = "dds_object.rds")

# Optional: Print a summary of the created dds object to the console.
print(dds)
EOF_R_SCRIPT

# Execute the R script to perform the DESeq2 data set creation.
Rscript deseq2_prep.R

View on GitHub

Second, the size factors (normalization for sequence depth) for each group were estimated by: âdds <- estimateSizeFactors(dds)â.

DESeq2 (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install DESeq2 if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'

# Define input and output file paths for the DESeqDataSet object
# Replace with actual paths in a real pipeline
INPUT_DDS_FILE="input_dds.rds" # Assuming dds object from previous step
OUTPUT_DDS_FILE="dds_with_size_factors.rds"

# Execute DESeq2 command to estimate size factors
Rscript -e "
  library(DESeq2)
  # Load the DESeqDataSet object from the previous step
  if (!file.exists('${INPUT_DDS_FILE}')) {
    stop('Input DDS file not found: ${INPUT_DDS_FILE}')
  }
  dds <- readRDS('${INPUT_DDS_FILE}')

  # Estimate size factors
  dds <- estimateSizeFactors(dds)

  # Save the updated DESeqDataSet object
  saveRDS(dds, file='${OUTPUT_DDS_FILE}')
"

View on GitHub

Differential analysis was performed by calling âdds <- DESeq(dds)â and pairwise comparison was performed at a false discovery rate of 0.05 by: âres_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)â.

DESeq2 vNot specified GitHub

$ Bash example

# Install R and Bioconductor if not already present
# conda install -c bioconda bioconductor-deseq2
# Or in R:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2")

# Create an R script (e.g., deseq2_analysis.R)
cat << 'EOF' > deseq2_analysis.R
library(DESeq2)
library(dplyr) # For data manipulation if needed for colData

# --- Placeholder for creating/loading the DESeqDataSet object 'dds' ---
# In a real pipeline, 'dds' would be created from a count matrix
# and a sample metadata (colData) table from previous steps.
# Replace this section with your actual data loading/creation.

# Example: Create dummy count data and colData for illustration
# This assumes you have a count matrix (e.g., from featureCounts)
# and a metadata file describing your samples.

# Dummy count matrix (replace with your actual counts)
# Rows are genes, columns are samples
set.seed(123)
counts_data <- matrix(sample(1:1000, 100*6, replace=TRUE), ncol=6)
rownames(counts_data) <- paste0("gene", 1:100)
colnames(counts_data) <- c("sample1_group1", "sample2_group1", "sample3_group1",
                           "sample4_group2", "sample5_group2", "sample6_group2")

# Dummy sample metadata (colData)
col_data <- data.frame(
  sample = colnames(counts_data),
  genotype = factor(c(rep("group1", 3), rep("group2", 3))),
  row.names = colnames(counts_data)
)

# Ensure colData has the same sample names as count matrix columns
stopifnot(all(colnames(counts_data) == rownames(col_data)))

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts_data,
                              colData = col_data,
                              design = ~ genotype)

message("DESeqDataSet object 'dds' created/loaded.")
# --- End of Placeholder ---


# Perform differential expression analysis
message("Performing DESeq analysis...")
dds <- DESeq(dds)
message("DESeq analysis complete.")

# Perform pairwise comparison for group1 vs group2
# The contrast specifies the factor ("genotype"), the numerator ("group1"), and the denominator ("group2")
message("Performing pairwise comparison for group1 vs group2 at alpha = 0.05...")
res_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)
message("Pairwise comparison complete.")

# Save the results
output_file <- "deseq2_results_group1_vs_group2_alpha05.csv"
write.csv(as.data.frame(res_group1_group2_alpha.05), file=output_file, row.names=TRUE)
message(paste("Results saved to:", output_file))

# Optionally save the dds object after analysis
# saveRDS(dds, file="dds_after_deseq.rds")
EOF

# Execute the R script
Rscript deseq2_analysis.R

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

A genome index was first generated using the Ensembl  C. elegans genome WBcel235 and the Ensembl gene annotation file (version WBcel235.82)  by the command: âSTAR --runThreadN 8 --runMode genomeGenerate --genomeDir <output  folder> --genomeFastaFiles <path to genome fasta> --sjdbGTFfile <path to .gtf gene  annotation file> --sjdbOverhang 49 --genomeSAindexNbases 12â.
Reads from all six samples were then mapped onto the genome using: âSTAR --runThreadN 8 --genomeDir  <path to genome index> --readFilesIn <list of sequence files> --outFileNamePrefix <Prefix  for output files> --outFilterMismatchNmax 3 --outFilterMultimapNmax 10â.
Reads that uniquely mapped to annotated genes were counted by htseq-count using the command: âhtseq-count --stranded=no <.sam file> <.gtf gene annotation file> > <output file name>â.
DESeq2 (Love et al., 2014) was used to perform differential analysis of gene  expression between different groups. First, a DESeq2 data set was created by: â(dds <-  DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~  genotype))â. Second, the size factors (normalization for sequence depth) for each group  were estimated by: âdds <- estimateSizeFactors(dds)â.
Differential analysis was performed by calling âdds <- DESeq(dds)â and pairwise  comparison was performed at a false discovery rate of 0.05 by: âres_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)â.
Genome_build: WBcel235
Supplementary_files_format_and_content: csv file that contains counts, normalized expression (rpkm), log2FoldChange(SIN/WT), adjusted p-values as well as useful published information for each gene annotation and each of the six samples

← Back to Analysis