GSE195493 Processing Pipeline

RIP-Seq code_examples 5 steps

Publication

Nuclear and cytoplasmic poly(A) binding proteins (PABPs) favor distinct transcripts and isoforms.

Nucleic acids research (2022) — PMID 35438785

Dataset

GSE195493

Cytoplasmic and nuclear fractionation from HEK293T cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

All reads were aligned to the human genome hg38 primary assembly using STAR.

STAR v2.7.9a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Path to pre-built STAR index for hg38 primary assembly. 
# This index should be built using the hg38 primary assembly FASTA and GTF files.
GENOME_DIR="/path/to/STAR_index/hg38_primary_assembly"
READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads
READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (remove if single-end)
OUTPUT_PREFIX="aligned_reads" # Prefix for output files
THREADS=8 # Number of threads to use

# Align reads to the human genome hg38 primary assembly using STAR
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard

View on GitHub

featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34

featureCounts v2.0.2 GitHub

$ Bash example

# Install featureCounts (part of the Subread package, if not already installed)
# conda install -c bioconda subread

# Define input and reference files
INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned reads BAM file
CUSTOM_GTF="custom_gencode_v34.gtf" # Placeholder for your custom GTF file derived from gencode v34
OUTPUT_COUNTS="featureCounts_output_counts.txt" # Desired output count file

# Execute featureCounts to annotate reads
# -a: Annotation file (GTF/GFF)
# -o: Output file
# -F GTF: Specify annotation file format as GTF
# -t exon: Specify feature type to count (e.g., 'exon')
# -g gene_id: Specify attribute type to group features (e.g., 'gene_id')
# --minOverlap 20: Minimum overlap required between a read and a feature
featureCounts -a "${CUSTOM_GTF}" -o "${OUTPUT_COUNTS}" -F GTF -t exon -g gene_id --minOverlap 20 "${INPUT_BAM}"

View on GitHub

Differential expression was calculated using DESeq2.

DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R (if not already installed)
# For Ubuntu/Debian:
# sudo apt update
# sudo apt install r-base
#
# Install Bioconductor and DESeq2 package within R
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'
# R -e 'install.packages("readr")' # For read_tsv function

# Create dummy input files for demonstration purposes.
# In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto).
# counts.tsv: Tab-separated file with gene IDs as rows and sample IDs as columns, containing raw counts.
# sample_metadata.tsv: Tab-separated file with sample IDs and experimental conditions.

# Example dummy counts.tsv
echo -e "gene_id\tsample1\tsample2\tsample3\tsample4" > counts.tsv
echo -e "geneA\t100\t120\t50\t60" >> counts.tsv
echo -e "geneB\t200\t210\t100\t110" >> counts.tsv
echo -e "geneC\t50\t60\t150\t160" >> counts.tsv
echo -e "geneD\t10\t12\t8\t9" >> counts.tsv

# Example dummy sample_metadata.tsv
echo -e "sample_id\tcondition" > sample_metadata.tsv
echo -e "sample1\ttreated" >> sample_metadata.tsv
echo -e "sample2\ttreated" >> sample_metadata.tsv
echo -e "sample3\tuntreated" >> sample_metadata.tsv
echo -e "sample4\tuntreated" >> sample_metadata.tsv

# Save the R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
#!/usr/bin/env Rscript

# Load necessary packages
library(DESeq2)
library(readr) # For read_tsv

# --- Configuration ---
counts_file <- "counts.tsv"
metadata_file <- "sample_metadata.tsv"
design_formula <- "~ condition" # Example design formula. Adjust based on your experimental design.
output_prefix <- "deseq2_results"

# --- Load Data ---
# Read counts data
counts_data <- read_tsv(counts_file, col_names = TRUE)
# Assuming the first column is gene ID, and subsequent columns are sample counts
gene_ids <- counts_data[[1]]
counts_matrix <- as.matrix(counts_data[,-1])
rownames(counts_matrix) <- gene_ids

# Read sample metadata
sample_metadata <- read_tsv(metadata_file, col_names = TRUE)
# Ensure sample names match between counts and metadata
# Assuming sample_metadata has a column named 'sample_id' that matches column names in counts_matrix
sample_metadata <- as.data.frame(sample_metadata) # DESeq2 expects data.frame
rownames(sample_metadata) <- sample_metadata$sample_id # Or whatever column holds sample IDs
sample_metadata <- sample_metadata[colnames(counts_matrix), ] # Order metadata to match counts

# --- Create DESeqDataSet ---
dds <- DESeqDataSetFromMatrix(countData = counts_matrix,
                              colData = sample_metadata,
                              design = as.formula(design_formula))

# --- Run DESeq2 Analysis ---
dds <- DESeq(dds)

# --- Extract Results ---
# Get results for a specific contrast (e.g., condition treated vs untreated)
# Adjust the contrast based on your actual data and design levels.
# Example: res <- results(dds, contrast=c("condition", "treated", "untreated"))
# If you have multiple factors, you might specify 'name' argument, e.g., results(dds, name="condition_treated_vs_untreated")
res <- results(dds)

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Write results to CSV
write.csv(as.data.frame(res_ordered), paste0(output_prefix, "_differential_expression.csv"))

# Optional: Save normalized counts
normalized_counts <- counts(dds, normalized=TRUE)
write.csv(as.data.frame(normalized_counts), paste0(output_prefix, "_normalized_counts.csv"))

# Optional: Save DESeq2 object for later use
saveRDS(dds, paste0(output_prefix, "_dds_object.rds"))
EOF

# Make the R script executable
chmod +x run_deseq2.R

# Execute the R script
./run_deseq2.R

View on GitHub

Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.40.0 GitHub

$ Bash example

# Install R and Bioconductor if not already present
# conda install -c conda-forge r-base
# conda install -c bioconda bioconductor-deseq2

# Create an R script to perform DESeq2 analysis and filter results
cat << 'EOF' > run_deseq2_and_filter.R
# Load DESeq2 library
library(DESeq2)

# --- Placeholder for DESeq2 analysis setup ---
# In a real scenario, you would load your count matrix and sample metadata.
# Example: Assuming 'counts_matrix.tsv' contains raw counts and 'sample_metadata.tsv' has sample info.
# counts_data <- as.matrix(read.delim("counts_matrix.tsv", row.names = 1))
# sample_info <- read.delim("sample_metadata.tsv", row.names = 1)

# For demonstration, creating dummy data:
set.seed(123)
counts_data <- matrix(rnbinom(n=2000, size=10, mu=100), ncol=10)
rownames(counts_data) <- paste0("gene", 1:200)
colnames(counts_data) <- paste0("sample", 1:10)

sample_info <- data.frame(
  sample = colnames(counts_data),
  condition = factor(rep(c("control", "treated"), each=5))
)
rownames(sample_info) <- sample_info$sample

# Ensure sample order matches between counts and metadata
sample_info <- sample_info[colnames(counts_data), , drop = FALSE]

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts_data,
                              colData = sample_info,
                              design = ~ condition)

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for the contrast of interest (e.g., 'treated' vs 'control')
res <- results(dds, contrast=c("condition", "treated", "control"))

# --- Filtering step based on description ---
# Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01.

# Filter results based on specified cut-offs
significant_genes <- subset(res, baseMean > 50 & padj <= 0.01)

# Order significant genes by adjusted p-value
significant_genes <- significant_genes[order(significant_genes$padj),]

# Save significant genes to a CSV file
write.csv(as.data.frame(significant_genes), "significant_differential_expression.csv", row.names = TRUE)

# Optionally, save all DESeq2 results
write.csv(as.data.frame(res), "deseq2_all_results.csv", row.names = TRUE)

message("DESeq2 analysis and filtering complete. Results saved to significant_differential_expression.csv and deseq2_all_results.csv")
EOF

# Execute the R script
Rscript run_deseq2_and_filter.R

View on GitHub

In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Create dummy count data and sample metadata for demonstration purposes.
# In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon)
# based on alignment to a reference genome like hg38 (placeholder for reference dataset).

# Example: Create a dummy counts.tsv
cat <<EOF > counts.tsv
gene_id	nuclear_rep1	nuclear_rep2	nuclear_rep3	cytoplasmic_rep1	cytoplasmic_rep2	cytoplasmic_rep3
gene1	200	220	190	100	110	95
gene2	50	55	48	120	130	115
gene3	150	145	155	150	145	155
gene4	300	310	290	150	160	145
gene5	80	85	78	180	190	175
EOF

# Example: Create a dummy sample_metadata.tsv
cat <<EOF > sample_metadata.tsv
sample_id	condition
nuclear_rep1	nuclear
nuclear_rep2	nuclear
nuclear_rep3	nuclear
cytoplasmic_rep1	cytoplasmic
cytoplasmic_rep2	cytoplasmic
cytoplasmic_rep3	cytoplasmic
EOF

# Install R and DESeq2 if not already installed
# For Ubuntu/Debian:
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'

# Create the R script for DESeq2 analysis
cat <<'EOF' > run_deseq2.R
# run_deseq2.R
# This script performs differential localization analysis using DESeq2
# based on nuclear and cytoplasmic count data.

library(DESeq2)
library(dplyr) # For data manipulation
library(tibble) # For rownames_to_column

# --- Configuration ---
count_file <- "counts.tsv"
sample_info_file <- "sample_metadata.tsv"
output_file <- "differential_localization_results.tsv"

# Read count data
count_data <- read.delim(count_file, row.names = 1, check.names = FALSE)
count_data <- round(count_data) # Ensure counts are integers

# Read sample metadata
sample_info <- read.delim(sample_info_file, row.names = 1)

# Ensure sample names match and are in the same order
sample_info <- sample_info[colnames(count_data), , drop = FALSE]
if (!all(rownames(sample_info) == colnames(count_data))) {
    stop("Sample names in count data and metadata do not match or are not in the same order.")
}

# Set 'nuclear' as the reference level for the 'condition' factor.
# This ensures that when comparing 'cytoplasmic' vs 'nuclear',
# a positive log2FoldChange indicates higher expression in 'cytoplasmic'.
sample_info$condition <- factor(sample_info$condition, levels = c("nuclear", "cytoplasmic"))

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition)

# Run DESeq analysis
dds <- DESeq(dds)

# Get results for cytoplasmic vs nuclear.
# According to the description:
# - A positive log2FoldChange indicates that a transcript was more cytoplasmic.
#   (i.e., log2(cytoplasmic / nuclear) > 0, meaning cytoplasmic > nuclear)
# - A negative log2FoldChange indicates that a transcript was more nuclear.
#   (i.e., log2(cytoplasmic / nuclear) < 0, meaning nuclear > cytoplasmic)
# This contrast directly aligns with the desired interpretation.
res <- results(dds, contrast=c("condition", "cytoplasmic", "nuclear"))

# Convert to data frame and save
res_df <- as.data.frame(res) %>%
    tibble::rownames_to_column("gene_id")

write.table(res_df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)

message(paste("Differential localization results saved to:", output_file))
EOF

# Execute the R script
Rscript run_deseq2.R

# Clean up the R script and dummy data (optional)
# rm run_deseq2.R counts.tsv sample_metadata.tsv

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

All reads were aligned to the human genome hg38 primary assembly using STAR.
featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34
Differential expression was calculated using DESeq2. Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01. In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic.
Genome_build: hg38
Supplementary_files_format_and_content: comma separated files with output of DESeq2; normalized abudnance measurements for all genes when comparing the nuclear and cytoplasmic fractions. In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic

← Back to Analysis