GSE195493 Processing Pipeline
RIP-Seq
code_examples
5 steps
Publication
Nuclear and cytoplasmic poly(A) binding proteins (PABPs) favor distinct transcripts and isoforms.Nucleic acids research (2022) — PMID 35438785
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
All reads were aligned to the human genome hg38 primary assembly using STAR.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Path to pre-built STAR index for hg38 primary assembly. # This index should be built using the hg38 primary assembly FASTA and GTF files. GENOME_DIR="/path/to/STAR_index/hg38_primary_assembly" READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (remove if single-end) OUTPUT_PREFIX="aligned_reads" # Prefix for output files THREADS=8 # Number of threads to use # Align reads to the human genome hg38 primary assembly using STAR STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard -
2
featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34
$ Bash example
# Install featureCounts (part of the Subread package, if not already installed) # conda install -c bioconda subread # Define input and reference files INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned reads BAM file CUSTOM_GTF="custom_gencode_v34.gtf" # Placeholder for your custom GTF file derived from gencode v34 OUTPUT_COUNTS="featureCounts_output_counts.txt" # Desired output count file # Execute featureCounts to annotate reads # -a: Annotation file (GTF/GFF) # -o: Output file # -F GTF: Specify annotation file format as GTF # -t exon: Specify feature type to count (e.g., 'exon') # -g gene_id: Specify attribute type to group features (e.g., 'gene_id') # --minOverlap 20: Minimum overlap required between a read and a feature featureCounts -a "${CUSTOM_GTF}" -o "${OUTPUT_COUNTS}" -F GTF -t exon -g gene_id --minOverlap 20 "${INPUT_BAM}" -
3
Differential expression was calculated using DESeq2.
$ Bash example
# Install R (if not already installed) # For Ubuntu/Debian: # sudo apt update # sudo apt install r-base # # Install Bioconductor and DESeq2 package within R # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # R -e 'install.packages("readr")' # For read_tsv function # Create dummy input files for demonstration purposes. # In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto). # counts.tsv: Tab-separated file with gene IDs as rows and sample IDs as columns, containing raw counts. # sample_metadata.tsv: Tab-separated file with sample IDs and experimental conditions. # Example dummy counts.tsv echo -e "gene_id\tsample1\tsample2\tsample3\tsample4" > counts.tsv echo -e "geneA\t100\t120\t50\t60" >> counts.tsv echo -e "geneB\t200\t210\t100\t110" >> counts.tsv echo -e "geneC\t50\t60\t150\t160" >> counts.tsv echo -e "geneD\t10\t12\t8\t9" >> counts.tsv # Example dummy sample_metadata.tsv echo -e "sample_id\tcondition" > sample_metadata.tsv echo -e "sample1\ttreated" >> sample_metadata.tsv echo -e "sample2\ttreated" >> sample_metadata.tsv echo -e "sample3\tuntreated" >> sample_metadata.tsv echo -e "sample4\tuntreated" >> sample_metadata.tsv # Save the R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R #!/usr/bin/env Rscript # Load necessary packages library(DESeq2) library(readr) # For read_tsv # --- Configuration --- counts_file <- "counts.tsv" metadata_file <- "sample_metadata.tsv" design_formula <- "~ condition" # Example design formula. Adjust based on your experimental design. output_prefix <- "deseq2_results" # --- Load Data --- # Read counts data counts_data <- read_tsv(counts_file, col_names = TRUE) # Assuming the first column is gene ID, and subsequent columns are sample counts gene_ids <- counts_data[[1]] counts_matrix <- as.matrix(counts_data[,-1]) rownames(counts_matrix) <- gene_ids # Read sample metadata sample_metadata <- read_tsv(metadata_file, col_names = TRUE) # Ensure sample names match between counts and metadata # Assuming sample_metadata has a column named 'sample_id' that matches column names in counts_matrix sample_metadata <- as.data.frame(sample_metadata) # DESeq2 expects data.frame rownames(sample_metadata) <- sample_metadata$sample_id # Or whatever column holds sample IDs sample_metadata <- sample_metadata[colnames(counts_matrix), ] # Order metadata to match counts # --- Create DESeqDataSet --- dds <- DESeqDataSetFromMatrix(countData = counts_matrix, colData = sample_metadata, design = as.formula(design_formula)) # --- Run DESeq2 Analysis --- dds <- DESeq(dds) # --- Extract Results --- # Get results for a specific contrast (e.g., condition treated vs untreated) # Adjust the contrast based on your actual data and design levels. # Example: res <- results(dds, contrast=c("condition", "treated", "untreated")) # If you have multiple factors, you might specify 'name' argument, e.g., results(dds, name="condition_treated_vs_untreated") res <- results(dds) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Write results to CSV write.csv(as.data.frame(res_ordered), paste0(output_prefix, "_differential_expression.csv")) # Optional: Save normalized counts normalized_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(normalized_counts), paste0(output_prefix, "_normalized_counts.csv")) # Optional: Save DESeq2 object for later use saveRDS(dds, paste0(output_prefix, "_dds_object.rds")) EOF # Make the R script executable chmod +x run_deseq2.R # Execute the R script ./run_deseq2.R -
4
Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01.
$ Bash example
# Install R and Bioconductor if not already present # conda install -c conda-forge r-base # conda install -c bioconda bioconductor-deseq2 # Create an R script to perform DESeq2 analysis and filter results cat << 'EOF' > run_deseq2_and_filter.R # Load DESeq2 library library(DESeq2) # --- Placeholder for DESeq2 analysis setup --- # In a real scenario, you would load your count matrix and sample metadata. # Example: Assuming 'counts_matrix.tsv' contains raw counts and 'sample_metadata.tsv' has sample info. # counts_data <- as.matrix(read.delim("counts_matrix.tsv", row.names = 1)) # sample_info <- read.delim("sample_metadata.tsv", row.names = 1) # For demonstration, creating dummy data: set.seed(123) counts_data <- matrix(rnbinom(n=2000, size=10, mu=100), ncol=10) rownames(counts_data) <- paste0("gene", 1:200) colnames(counts_data) <- paste0("sample", 1:10) sample_info <- data.frame( sample = colnames(counts_data), condition = factor(rep(c("control", "treated"), each=5)) ) rownames(sample_info) <- sample_info$sample # Ensure sample order matches between counts and metadata sample_info <- sample_info[colnames(counts_data), , drop = FALSE] # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = sample_info, design = ~ condition) # Run DESeq2 analysis dds <- DESeq(dds) # Get results for the contrast of interest (e.g., 'treated' vs 'control') res <- results(dds, contrast=c("condition", "treated", "control")) # --- Filtering step based on description --- # Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01. # Filter results based on specified cut-offs significant_genes <- subset(res, baseMean > 50 & padj <= 0.01) # Order significant genes by adjusted p-value significant_genes <- significant_genes[order(significant_genes$padj),] # Save significant genes to a CSV file write.csv(as.data.frame(significant_genes), "significant_differential_expression.csv", row.names = TRUE) # Optionally, save all DESeq2 results write.csv(as.data.frame(res), "deseq2_all_results.csv", row.names = TRUE) message("DESeq2 analysis and filtering complete. Results saved to significant_differential_expression.csv and deseq2_all_results.csv") EOF # Execute the R script Rscript run_deseq2_and_filter.R -
5
In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic.
DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Create dummy count data and sample metadata for demonstration purposes. # In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon) # based on alignment to a reference genome like hg38 (placeholder for reference dataset). # Example: Create a dummy counts.tsv cat <<EOF > counts.tsv gene_id nuclear_rep1 nuclear_rep2 nuclear_rep3 cytoplasmic_rep1 cytoplasmic_rep2 cytoplasmic_rep3 gene1 200 220 190 100 110 95 gene2 50 55 48 120 130 115 gene3 150 145 155 150 145 155 gene4 300 310 290 150 160 145 gene5 80 85 78 180 190 175 EOF # Example: Create a dummy sample_metadata.tsv cat <<EOF > sample_metadata.tsv sample_id condition nuclear_rep1 nuclear nuclear_rep2 nuclear nuclear_rep3 nuclear cytoplasmic_rep1 cytoplasmic cytoplasmic_rep2 cytoplasmic cytoplasmic_rep3 cytoplasmic EOF # Install R and DESeq2 if not already installed # For Ubuntu/Debian: # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")' # Create the R script for DESeq2 analysis cat <<'EOF' > run_deseq2.R # run_deseq2.R # This script performs differential localization analysis using DESeq2 # based on nuclear and cytoplasmic count data. library(DESeq2) library(dplyr) # For data manipulation library(tibble) # For rownames_to_column # --- Configuration --- count_file <- "counts.tsv" sample_info_file <- "sample_metadata.tsv" output_file <- "differential_localization_results.tsv" # Read count data count_data <- read.delim(count_file, row.names = 1, check.names = FALSE) count_data <- round(count_data) # Ensure counts are integers # Read sample metadata sample_info <- read.delim(sample_info_file, row.names = 1) # Ensure sample names match and are in the same order sample_info <- sample_info[colnames(count_data), , drop = FALSE] if (!all(rownames(sample_info) == colnames(count_data))) { stop("Sample names in count data and metadata do not match or are not in the same order.") } # Set 'nuclear' as the reference level for the 'condition' factor. # This ensures that when comparing 'cytoplasmic' vs 'nuclear', # a positive log2FoldChange indicates higher expression in 'cytoplasmic'. sample_info$condition <- factor(sample_info$condition, levels = c("nuclear", "cytoplasmic")) # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ condition) # Run DESeq analysis dds <- DESeq(dds) # Get results for cytoplasmic vs nuclear. # According to the description: # - A positive log2FoldChange indicates that a transcript was more cytoplasmic. # (i.e., log2(cytoplasmic / nuclear) > 0, meaning cytoplasmic > nuclear) # - A negative log2FoldChange indicates that a transcript was more nuclear. # (i.e., log2(cytoplasmic / nuclear) < 0, meaning nuclear > cytoplasmic) # This contrast directly aligns with the desired interpretation. res <- results(dds, contrast=c("condition", "cytoplasmic", "nuclear")) # Convert to data frame and save res_df <- as.data.frame(res) %>% tibble::rownames_to_column("gene_id") write.table(res_df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE) message(paste("Differential localization results saved to:", output_file)) EOF # Execute the R script Rscript run_deseq2.R # Clean up the R script and dummy data (optional) # rm run_deseq2.R counts.tsv sample_metadata.tsv
Raw Source Text
All reads were aligned to the human genome hg38 primary assembly using STAR. featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34 Differential expression was calculated using DESeq2. Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01. In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic. Genome_build: hg38 Supplementary_files_format_and_content: comma separated files with output of DESeq2; normalized abudnance measurements for all genes when comparing the nuclear and cytoplasmic fractions. In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic