GSE75128 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

A Small RNA-Catalytic Argonaute Pathway Tunes Germline Transcript Levels to Ensure Embryonic Divisions.

Cell (2016) — PMID 27020753

Dataset

GSE75128

A Small RNA-Catalytic Argonaute Pathway Tunes Germline Transcript Levels to Ensure Embryonic Divisions

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    A genome index was first generated using the Ensembl C. elegans genome WBcel235 and the Ensembl gene annotation file (version WBcel235.82) by the command: “STAR --runThreadN 8 --runMode genomeGenerate --genomeDir <output folder> --genomeFastaFiles <path to genome fasta> --sjdbGTFfile <path to .gtf gene annotation file> --sjdbOverhang 49 --genomeSAindexNbases 12”.

    STAR v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Create directories for reference files and output
    mkdir -p references
    mkdir -p star_genome_index
    
    # Download Ensembl C. elegans genome WBcel235 (release 82)
    wget -O references/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-82/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
    
    # Download Ensembl C. elegans gene annotation file WBcel235.82 (release 82)
    wget -O references/Caenorhabditis_elegans.WBcel235.82.gtf ftp://ftp.ensembl.org/pub/release-82/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.82.gtf
    
    # Generate genome index
    STAR --runThreadN 8 \
         --runMode genomeGenerate \
         --genomeDir star_genome_index \
         --genomeFastaFiles references/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz \
         --sjdbGTFfile references/Caenorhabditis_elegans.WBcel235.82.gtf \
         --sjdbOverhang 49 \
         --genomeSAindexNbases 12
  2. 2

    Reads from all six samples were then mapped onto the genome using: “STAR --runThreadN 8 --genomeDir <path to genome index> --readFilesIn <list of sequence files> --outFileNamePrefix <Prefix for output files> --outFilterMismatchNmax 3 --outFilterMultimapNmax 10”.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for clarity
    # Placeholder: Replace with actual path to your STAR genome index (e.g., for human hg38)
    GENOME_DIR="path/to/STAR_genome_index_hg38"
    
    # Placeholder: Replace with actual list of sequence files for all six samples.
    # Assuming paired-end reads for typical RNA-seq, adjust as needed for single-end.
    READ_FILES="sample1_R1.fastq.gz sample1_R2.fastq.gz \
                sample2_R1.fastq.gz sample2_R2.fastq.gz \
                sample3_R1.fastq.gz sample3_R2.fastq.gz \
                sample4_R1.fastq.gz sample4_R2.fastq.gz \
                sample5_R1.fastq.gz sample5_R2.fastq.gz \
                sample6_R1.fastq.gz sample6_R2.fastq.gz"
    
    # Placeholder: Replace with desired prefix for output files
    OUTPUT_PREFIX="mapped_reads"
    
    # Execute STAR alignment
    STAR \
      --runThreadN 8 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn ${READ_FILES} \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outFilterMismatchNmax 3 \
      --outFilterMultimapNmax 10
  3. 3

    Reads that uniquely mapped to annotated genes were counted by htseq-count using the command: “htseq-count --stranded=no <.sam file> <.gtf gene annotation file> > <output file name>”.

    HTSeq vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Define input and output files
    SAM_FILE="input.sam"
    GTF_FILE="annotation.gtf"
    OUTPUT_FILE="gene_counts.txt"
    
    # Count reads uniquely mapped to annotated genes
    htseq-count --stranded=no "${SAM_FILE}" "${GTF_FILE}" > "${OUTPUT_FILE}"
  4. 4

    DESeq2 (Love et al., 2014) was used to perform differential analysis of gene expression between different groups.

    $ Bash example
    # Install R and Bioconductor if not already present
    # For example, using conda:
    # conda create -n deseq2_env r-base bioconductor-deseq2 r-readr r-dplyr
    # conda activate deseq2_env
    
    # Create dummy input files for demonstration
    # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
    echo "gene_id,sample1,sample2,sample3,sample4" > counts.csv
    echo "geneA,100,120,50,60" >> counts.csv
    echo "geneB,20,25,80,90" >> counts.csv
    echo "geneC,500,550,200,220" >> counts.csv
    
    echo "sample,group" > coldata.csv
    echo "sample1,control" >> coldata.csv
    echo "sample2,control" >> coldata.csv
    echo "sample3,treatment" >> coldata.csv
    echo "sample4,treatment" >> coldata.csv
    
    # Create an R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    # Load necessary libraries
    library(DESeq2)
    library(readr)
    library(dplyr)
    
    # --- Configuration ---
    counts_file <- "counts.csv"
    coldata_file <- "coldata.csv"
    output_file <- "deseq2_results.csv"
    design_formula <- "~ group" # Example design formula
    
    # --- Load Data ---
    # Read count data
    counts_df <- read_csv(counts_file)
    # Assuming the first column is gene_id and subsequent columns are sample counts
    gene_ids <- counts_df[[1]]
    counts_matrix <- as.matrix(counts_df[,-1])
    rownames(counts_matrix) <- gene_ids
    
    # Read colData (sample metadata)
    coldata_df <- read_csv(coldata_file)
    # Ensure sample names match column names in counts_matrix
    # And ensure order matches
    coldata_df <- coldata_df %>%
      filter(sample %in% colnames(counts_matrix)) %>%
      arrange(match(sample, colnames(counts_matrix)))
    rownames(coldata_df) <- coldata_df$sample
    
    # Ensure colData has factors for grouping variables
    coldata_df$group <- factor(coldata_df$group)
    
    # --- Create DESeqDataSet object ---
    dds <- DESeqDataSetFromMatrix(countData = counts_matrix,
                                  colData = coldata_df,
                                  design = as.formula(design_formula))
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract results ---
    # Example: comparing 'treatment' vs 'control'
    # The contrast should be adjusted based on the actual groups in your colData
    res <- results(dds, contrast=c("group", "treatment", "control"))
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # Convert to data frame and write to CSV
    res_df <- as.data.frame(res_ordered)
    write_csv(res_df, output_file)
    
    message(paste("DESeq2 analysis complete. Results saved to", output_file))
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
  5. 5

    First, a DESeq2 data set was created by: “(dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ genotype))”.

    $ Bash example
    # Install BiocManager and DESeq2 (if not already installed)
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create a dummy countdata.csv for demonstration purposes.
    # In a real pipeline, this would be generated by a previous step (e.g., featureCounts).
    # The first column represents gene IDs, and subsequent columns are raw counts for samples.
    cat <<EOF > countdata.csv
    gene1,100,120,110,200,210,220
    gene2,50,60,55,100,110,105
    gene3,200,210,205,400,410,405
    gene4,10,12,11,20,22,21
    gene5,30,35,32,60,65,62
    gene6,70,75,72,140,145,142
    EOF
    
    # Create a dummy coldata.csv for demonstration purposes.
    # In a real pipeline, this would be provided as sample metadata.
    # The first column represents sample IDs, and subsequent columns are metadata, including 'genotype'.
    cat <<EOF > coldata.csv
    sample,genotype
    sample1,WT
    sample2,WT
    sample3,WT
    sample4,KO
    sample5,KO
    sample6,KO
    EOF
    
    # R script to create the DESeq2 data set object.
    # This script reads the count and colData, then constructs the DESeqDataSet object.
    cat <<'EOF_R_SCRIPT' > deseq2_prep.R
    #!/usr/bin/env Rscript
    
    # Load DESeq2 library
    library(DESeq2)
    
    # Read count data from CSV.
    # Assumes the first column contains row names (gene IDs) and the header is present.
    countdata_raw <- read.csv("countdata.csv", row.names = 1, header = TRUE)
    # Convert the data frame of counts to a matrix, which is required by DESeqDataSetFromMatrix.
    countdata <- as.matrix(countdata_raw)
    
    # Read colData (sample metadata) from CSV.
    # Assumes the first column contains row names (sample IDs) and the header is present.
    coldata <- read.csv("coldata.csv", row.names = 1, header = TRUE)
    
    # Ensure colData sample names match countData column names and are in the same order.
    # This step is crucial for correct mapping of metadata to count data.
    coldata <- coldata[colnames(countdata), , drop=FALSE]
    
    # Ensure the 'genotype' column is treated as a factor, which is necessary for DESeq2 design formulas.
    coldata$genotype <- factor(coldata$genotype)
    
    # Create a DESeq2 data set object (dds).
    # This object encapsulates count data, sample metadata, and the experimental design formula.
    dds <- DESeqDataSetFromMatrix(countData = countdata,
                                  colData = coldata,
                                  design = ~ genotype)
    
    # Save the dds object to an .rds file.
    # This allows the object to be loaded and used in subsequent DESeq2 analysis steps (e.g., normalization, differential expression).
    saveRDS(dds, file = "dds_object.rds")
    
    # Optional: Print a summary of the created dds object to the console.
    print(dds)
    EOF_R_SCRIPT
    
    # Execute the R script to perform the DESeq2 data set creation.
    Rscript deseq2_prep.R
  6. 6

    Second, the size factors (normalization for sequence depth) for each group were estimated by: “dds <- estimateSizeFactors(dds)”.

    DESeq2 (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # Install DESeq2 if not already installed
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'
    
    # Define input and output file paths for the DESeqDataSet object
    # Replace with actual paths in a real pipeline
    INPUT_DDS_FILE="input_dds.rds" # Assuming dds object from previous step
    OUTPUT_DDS_FILE="dds_with_size_factors.rds"
    
    # Execute DESeq2 command to estimate size factors
    Rscript -e "
      library(DESeq2)
      # Load the DESeqDataSet object from the previous step
      if (!file.exists('${INPUT_DDS_FILE}')) {
        stop('Input DDS file not found: ${INPUT_DDS_FILE}')
      }
      dds <- readRDS('${INPUT_DDS_FILE}')
    
      # Estimate size factors
      dds <- estimateSizeFactors(dds)
    
      # Save the updated DESeqDataSet object
      saveRDS(dds, file='${OUTPUT_DDS_FILE}')
    "
  7. 7

    Differential analysis was performed by calling “dds <- DESeq(dds)” and pairwise comparison was performed at a false discovery rate of 0.05 by: “res_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)”.

    DESeq2 vNot specified GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # conda install -c bioconda bioconductor-deseq2
    # Or in R:
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("DESeq2")
    
    # Create an R script (e.g., deseq2_analysis.R)
    cat << 'EOF' > deseq2_analysis.R
    library(DESeq2)
    library(dplyr) # For data manipulation if needed for colData
    
    # --- Placeholder for creating/loading the DESeqDataSet object 'dds' ---
    # In a real pipeline, 'dds' would be created from a count matrix
    # and a sample metadata (colData) table from previous steps.
    # Replace this section with your actual data loading/creation.
    
    # Example: Create dummy count data and colData for illustration
    # This assumes you have a count matrix (e.g., from featureCounts)
    # and a metadata file describing your samples.
    
    # Dummy count matrix (replace with your actual counts)
    # Rows are genes, columns are samples
    set.seed(123)
    counts_data <- matrix(sample(1:1000, 100*6, replace=TRUE), ncol=6)
    rownames(counts_data) <- paste0("gene", 1:100)
    colnames(counts_data) <- c("sample1_group1", "sample2_group1", "sample3_group1",
                               "sample4_group2", "sample5_group2", "sample6_group2")
    
    # Dummy sample metadata (colData)
    col_data <- data.frame(
      sample = colnames(counts_data),
      genotype = factor(c(rep("group1", 3), rep("group2", 3))),
      row.names = colnames(counts_data)
    )
    
    # Ensure colData has the same sample names as count matrix columns
    stopifnot(all(colnames(counts_data) == rownames(col_data)))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                  colData = col_data,
                                  design = ~ genotype)
    
    message("DESeqDataSet object 'dds' created/loaded.")
    # --- End of Placeholder ---
    
    
    # Perform differential expression analysis
    message("Performing DESeq analysis...")
    dds <- DESeq(dds)
    message("DESeq analysis complete.")
    
    # Perform pairwise comparison for group1 vs group2
    # The contrast specifies the factor ("genotype"), the numerator ("group1"), and the denominator ("group2")
    message("Performing pairwise comparison for group1 vs group2 at alpha = 0.05...")
    res_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)
    message("Pairwise comparison complete.")
    
    # Save the results
    output_file <- "deseq2_results_group1_vs_group2_alpha05.csv"
    write.csv(as.data.frame(res_group1_group2_alpha.05), file=output_file, row.names=TRUE)
    message(paste("Results saved to:", output_file))
    
    # Optionally save the dds object after analysis
    # saveRDS(dds, file="dds_after_deseq.rds")
    EOF
    
    # Execute the R script
    Rscript deseq2_analysis.R

Tools Used

Raw Source Text
A genome index was first generated using the Ensembl  C. elegans genome WBcel235 and the Ensembl gene annotation file (version WBcel235.82)  by the command: “STAR --runThreadN 8 --runMode genomeGenerate --genomeDir <output  folder> --genomeFastaFiles <path to genome fasta> --sjdbGTFfile <path to .gtf gene  annotation file> --sjdbOverhang 49 --genomeSAindexNbases 12”.
Reads from all six samples were then mapped onto the genome using: “STAR --runThreadN 8 --genomeDir  <path to genome index> --readFilesIn <list of sequence files> --outFileNamePrefix <Prefix  for output files> --outFilterMismatchNmax 3 --outFilterMultimapNmax 10”.
Reads that uniquely mapped to annotated genes were counted by htseq-count using the command: “htseq-count --stranded=no <.sam file> <.gtf gene annotation file> > <output file name>”.
DESeq2 (Love et al., 2014) was used to perform differential analysis of gene  expression between different groups. First, a DESeq2 data set was created by: “(dds <-  DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~  genotype))”. Second, the size factors (normalization for sequence depth) for each group  were estimated by: “dds <- estimateSizeFactors(dds)”.
Differential analysis was performed by calling “dds <- DESeq(dds)” and pairwise  comparison was performed at a false discovery rate of 0.05 by: “res_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)”.
Genome_build: WBcel235
Supplementary_files_format_and_content: csv file that contains counts, normalized expression (rpkm), log2FoldChange(SIN/WT), adjusted p-values as well as useful published information for each gene annotation and each of the six samples
← Back to Analysis