GSE75128 Processing Pipeline
Publication
A Small RNA-Catalytic Argonaute Pathway Tunes Germline Transcript Levels to Ensure Embryonic Divisions.Cell (2016) — PMID 27020753
Dataset
GSE75128A Small RNA-Catalytic Argonaute Pathway Tunes Germline Transcript Levels to Ensure Embryonic Divisions
Processing Steps
Generate Jupyter Notebook-
1
A genome index was first generated using the Ensembl C. elegans genome WBcel235 and the Ensembl gene annotation file (version WBcel235.82) by the command: âSTAR --runThreadN 8 --runMode genomeGenerate --genomeDir <output folder> --genomeFastaFiles <path to genome fasta> --sjdbGTFfile <path to .gtf gene annotation file> --sjdbOverhang 49 --genomeSAindexNbases 12â.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Create directories for reference files and output mkdir -p references mkdir -p star_genome_index # Download Ensembl C. elegans genome WBcel235 (release 82) wget -O references/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-82/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz # Download Ensembl C. elegans gene annotation file WBcel235.82 (release 82) wget -O references/Caenorhabditis_elegans.WBcel235.82.gtf ftp://ftp.ensembl.org/pub/release-82/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.82.gtf # Generate genome index STAR --runThreadN 8 \ --runMode genomeGenerate \ --genomeDir star_genome_index \ --genomeFastaFiles references/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz \ --sjdbGTFfile references/Caenorhabditis_elegans.WBcel235.82.gtf \ --sjdbOverhang 49 \ --genomeSAindexNbases 12 -
2
Reads from all six samples were then mapped onto the genome using: âSTAR --runThreadN 8 --genomeDir <path to genome index> --readFilesIn <list of sequence files> --outFileNamePrefix <Prefix for output files> --outFilterMismatchNmax 3 --outFilterMultimapNmax 10â.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for clarity # Placeholder: Replace with actual path to your STAR genome index (e.g., for human hg38) GENOME_DIR="path/to/STAR_genome_index_hg38" # Placeholder: Replace with actual list of sequence files for all six samples. # Assuming paired-end reads for typical RNA-seq, adjust as needed for single-end. READ_FILES="sample1_R1.fastq.gz sample1_R2.fastq.gz \ sample2_R1.fastq.gz sample2_R2.fastq.gz \ sample3_R1.fastq.gz sample3_R2.fastq.gz \ sample4_R1.fastq.gz sample4_R2.fastq.gz \ sample5_R1.fastq.gz sample5_R2.fastq.gz \ sample6_R1.fastq.gz sample6_R2.fastq.gz" # Placeholder: Replace with desired prefix for output files OUTPUT_PREFIX="mapped_reads" # Execute STAR alignment STAR \ --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn ${READ_FILES} \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterMismatchNmax 3 \ --outFilterMultimapNmax 10 -
3
Reads that uniquely mapped to annotated genes were counted by htseq-count using the command: âhtseq-count --stranded=no <.sam file> <.gtf gene annotation file> > <output file name>â.
$ Bash example
# Define input and output files SAM_FILE="input.sam" GTF_FILE="annotation.gtf" OUTPUT_FILE="gene_counts.txt" # Count reads uniquely mapped to annotated genes htseq-count --stranded=no "${SAM_FILE}" "${GTF_FILE}" > "${OUTPUT_FILE}" -
4
DESeq2 (Love et al., 2014) was used to perform differential analysis of gene expression between different groups.
$ Bash example
# Install R and Bioconductor if not already present # For example, using conda: # conda create -n deseq2_env r-base bioconductor-deseq2 r-readr r-dplyr # conda activate deseq2_env # Create dummy input files for demonstration # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto) echo "gene_id,sample1,sample2,sample3,sample4" > counts.csv echo "geneA,100,120,50,60" >> counts.csv echo "geneB,20,25,80,90" >> counts.csv echo "geneC,500,550,200,220" >> counts.csv echo "sample,group" > coldata.csv echo "sample1,control" >> coldata.csv echo "sample2,control" >> coldata.csv echo "sample3,treatment" >> coldata.csv echo "sample4,treatment" >> coldata.csv # Create an R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R # Load necessary libraries library(DESeq2) library(readr) library(dplyr) # --- Configuration --- counts_file <- "counts.csv" coldata_file <- "coldata.csv" output_file <- "deseq2_results.csv" design_formula <- "~ group" # Example design formula # --- Load Data --- # Read count data counts_df <- read_csv(counts_file) # Assuming the first column is gene_id and subsequent columns are sample counts gene_ids <- counts_df[[1]] counts_matrix <- as.matrix(counts_df[,-1]) rownames(counts_matrix) <- gene_ids # Read colData (sample metadata) coldata_df <- read_csv(coldata_file) # Ensure sample names match column names in counts_matrix # And ensure order matches coldata_df <- coldata_df %>% filter(sample %in% colnames(counts_matrix)) %>% arrange(match(sample, colnames(counts_matrix))) rownames(coldata_df) <- coldata_df$sample # Ensure colData has factors for grouping variables coldata_df$group <- factor(coldata_df$group) # --- Create DESeqDataSet object --- dds <- DESeqDataSetFromMatrix(countData = counts_matrix, colData = coldata_df, design = as.formula(design_formula)) # --- Run DESeq2 analysis --- dds <- DESeq(dds) # --- Extract results --- # Example: comparing 'treatment' vs 'control' # The contrast should be adjusted based on the actual groups in your colData res <- results(dds, contrast=c("group", "treatment", "control")) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Convert to data frame and write to CSV res_df <- as.data.frame(res_ordered) write_csv(res_df, output_file) message(paste("DESeq2 analysis complete. Results saved to", output_file)) EOF # Execute the R script Rscript run_deseq2.R -
5
First, a DESeq2 data set was created by: â(dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ genotype))â.
$ Bash example
# Install BiocManager and DESeq2 (if not already installed) # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create a dummy countdata.csv for demonstration purposes. # In a real pipeline, this would be generated by a previous step (e.g., featureCounts). # The first column represents gene IDs, and subsequent columns are raw counts for samples. cat <<EOF > countdata.csv gene1,100,120,110,200,210,220 gene2,50,60,55,100,110,105 gene3,200,210,205,400,410,405 gene4,10,12,11,20,22,21 gene5,30,35,32,60,65,62 gene6,70,75,72,140,145,142 EOF # Create a dummy coldata.csv for demonstration purposes. # In a real pipeline, this would be provided as sample metadata. # The first column represents sample IDs, and subsequent columns are metadata, including 'genotype'. cat <<EOF > coldata.csv sample,genotype sample1,WT sample2,WT sample3,WT sample4,KO sample5,KO sample6,KO EOF # R script to create the DESeq2 data set object. # This script reads the count and colData, then constructs the DESeqDataSet object. cat <<'EOF_R_SCRIPT' > deseq2_prep.R #!/usr/bin/env Rscript # Load DESeq2 library library(DESeq2) # Read count data from CSV. # Assumes the first column contains row names (gene IDs) and the header is present. countdata_raw <- read.csv("countdata.csv", row.names = 1, header = TRUE) # Convert the data frame of counts to a matrix, which is required by DESeqDataSetFromMatrix. countdata <- as.matrix(countdata_raw) # Read colData (sample metadata) from CSV. # Assumes the first column contains row names (sample IDs) and the header is present. coldata <- read.csv("coldata.csv", row.names = 1, header = TRUE) # Ensure colData sample names match countData column names and are in the same order. # This step is crucial for correct mapping of metadata to count data. coldata <- coldata[colnames(countdata), , drop=FALSE] # Ensure the 'genotype' column is treated as a factor, which is necessary for DESeq2 design formulas. coldata$genotype <- factor(coldata$genotype) # Create a DESeq2 data set object (dds). # This object encapsulates count data, sample metadata, and the experimental design formula. dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ genotype) # Save the dds object to an .rds file. # This allows the object to be loaded and used in subsequent DESeq2 analysis steps (e.g., normalization, differential expression). saveRDS(dds, file = "dds_object.rds") # Optional: Print a summary of the created dds object to the console. print(dds) EOF_R_SCRIPT # Execute the R script to perform the DESeq2 data set creation. Rscript deseq2_prep.R -
6
Second, the size factors (normalization for sequence depth) for each group were estimated by: âdds <- estimateSizeFactors(dds)â.
$ Bash example
# Install DESeq2 if not already installed # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")' # Define input and output file paths for the DESeqDataSet object # Replace with actual paths in a real pipeline INPUT_DDS_FILE="input_dds.rds" # Assuming dds object from previous step OUTPUT_DDS_FILE="dds_with_size_factors.rds" # Execute DESeq2 command to estimate size factors Rscript -e " library(DESeq2) # Load the DESeqDataSet object from the previous step if (!file.exists('${INPUT_DDS_FILE}')) { stop('Input DDS file not found: ${INPUT_DDS_FILE}') } dds <- readRDS('${INPUT_DDS_FILE}') # Estimate size factors dds <- estimateSizeFactors(dds) # Save the updated DESeqDataSet object saveRDS(dds, file='${OUTPUT_DDS_FILE}') " -
7
Differential analysis was performed by calling âdds <- DESeq(dds)â and pairwise comparison was performed at a false discovery rate of 0.05 by: âres_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)â.
$ Bash example
# Install R and Bioconductor if not already present # conda install -c bioconda bioconductor-deseq2 # Or in R: # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("DESeq2") # Create an R script (e.g., deseq2_analysis.R) cat << 'EOF' > deseq2_analysis.R library(DESeq2) library(dplyr) # For data manipulation if needed for colData # --- Placeholder for creating/loading the DESeqDataSet object 'dds' --- # In a real pipeline, 'dds' would be created from a count matrix # and a sample metadata (colData) table from previous steps. # Replace this section with your actual data loading/creation. # Example: Create dummy count data and colData for illustration # This assumes you have a count matrix (e.g., from featureCounts) # and a metadata file describing your samples. # Dummy count matrix (replace with your actual counts) # Rows are genes, columns are samples set.seed(123) counts_data <- matrix(sample(1:1000, 100*6, replace=TRUE), ncol=6) rownames(counts_data) <- paste0("gene", 1:100) colnames(counts_data) <- c("sample1_group1", "sample2_group1", "sample3_group1", "sample4_group2", "sample5_group2", "sample6_group2") # Dummy sample metadata (colData) col_data <- data.frame( sample = colnames(counts_data), genotype = factor(c(rep("group1", 3), rep("group2", 3))), row.names = colnames(counts_data) ) # Ensure colData has the same sample names as count matrix columns stopifnot(all(colnames(counts_data) == rownames(col_data))) # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = col_data, design = ~ genotype) message("DESeqDataSet object 'dds' created/loaded.") # --- End of Placeholder --- # Perform differential expression analysis message("Performing DESeq analysis...") dds <- DESeq(dds) message("DESeq analysis complete.") # Perform pairwise comparison for group1 vs group2 # The contrast specifies the factor ("genotype"), the numerator ("group1"), and the denominator ("group2") message("Performing pairwise comparison for group1 vs group2 at alpha = 0.05...") res_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05) message("Pairwise comparison complete.") # Save the results output_file <- "deseq2_results_group1_vs_group2_alpha05.csv" write.csv(as.data.frame(res_group1_group2_alpha.05), file=output_file, row.names=TRUE) message(paste("Results saved to:", output_file)) # Optionally save the dds object after analysis # saveRDS(dds, file="dds_after_deseq.rds") EOF # Execute the R script Rscript deseq2_analysis.R
Raw Source Text
A genome index was first generated using the Ensembl C. elegans genome WBcel235 and the Ensembl gene annotation file (version WBcel235.82) by the command: âSTAR --runThreadN 8 --runMode genomeGenerate --genomeDir <output folder> --genomeFastaFiles <path to genome fasta> --sjdbGTFfile <path to .gtf gene annotation file> --sjdbOverhang 49 --genomeSAindexNbases 12â.
Reads from all six samples were then mapped onto the genome using: âSTAR --runThreadN 8 --genomeDir <path to genome index> --readFilesIn <list of sequence files> --outFileNamePrefix <Prefix for output files> --outFilterMismatchNmax 3 --outFilterMultimapNmax 10â.
Reads that uniquely mapped to annotated genes were counted by htseq-count using the command: âhtseq-count --stranded=no <.sam file> <.gtf gene annotation file> > <output file name>â.
DESeq2 (Love et al., 2014) was used to perform differential analysis of gene expression between different groups. First, a DESeq2 data set was created by: â(dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ genotype))â. Second, the size factors (normalization for sequence depth) for each group were estimated by: âdds <- estimateSizeFactors(dds)â.
Differential analysis was performed by calling âdds <- DESeq(dds)â and pairwise comparison was performed at a false discovery rate of 0.05 by: âres_group1_group2_alpha.05 <- results(dds, contrast=c("genotype", "group1", "group2"), alpha = .05)â.
Genome_build: WBcel235
Supplementary_files_format_and_content: csv file that contains counts, normalized expression (rpkm), log2FoldChange(SIN/WT), adjusted p-values as well as useful published information for each gene annotation and each of the six samples