GSE195493 Processing Pipeline

RIP-Seq code_examples 5 steps

Publication

Nuclear and cytoplasmic poly(A) binding proteins (PABPs) favor distinct transcripts and isoforms.

Nucleic acids research (2022) — PMID 35438785

Dataset

GSE195493

Cytoplasmic and nuclear fractionation from HEK293T cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    All reads were aligned to the human genome hg38 primary assembly using STAR.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Path to pre-built STAR index for hg38 primary assembly. 
    # This index should be built using the hg38 primary assembly FASTA and GTF files.
    GENOME_DIR="/path/to/STAR_index/hg38_primary_assembly"
    READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads
    READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (remove if single-end)
    OUTPUT_PREFIX="aligned_reads" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Align reads to the human genome hg38 primary assembly using STAR
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --readFilesCommand zcat \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes Standard
  2. 2

    featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34

    featureCounts v2.0.2 GitHub
    $ Bash example
    # Install featureCounts (part of the Subread package, if not already installed)
    # conda install -c bioconda subread
    
    # Define input and reference files
    INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned reads BAM file
    CUSTOM_GTF="custom_gencode_v34.gtf" # Placeholder for your custom GTF file derived from gencode v34
    OUTPUT_COUNTS="featureCounts_output_counts.txt" # Desired output count file
    
    # Execute featureCounts to annotate reads
    # -a: Annotation file (GTF/GFF)
    # -o: Output file
    # -F GTF: Specify annotation file format as GTF
    # -t exon: Specify feature type to count (e.g., 'exon')
    # -g gene_id: Specify attribute type to group features (e.g., 'gene_id')
    # --minOverlap 20: Minimum overlap required between a read and a feature
    featureCounts -a "${CUSTOM_GTF}" -o "${OUTPUT_COUNTS}" -F GTF -t exon -g gene_id --minOverlap 20 "${INPUT_BAM}"
  3. 3

    Differential expression was calculated using DESeq2.

    DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R (if not already installed)
    # For Ubuntu/Debian:
    # sudo apt update
    # sudo apt install r-base
    #
    # Install Bioconductor and DESeq2 package within R
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    # R -e 'install.packages("readr")' # For read_tsv function
    
    # Create dummy input files for demonstration purposes.
    # In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto).
    # counts.tsv: Tab-separated file with gene IDs as rows and sample IDs as columns, containing raw counts.
    # sample_metadata.tsv: Tab-separated file with sample IDs and experimental conditions.
    
    # Example dummy counts.tsv
    echo -e "gene_id\tsample1\tsample2\tsample3\tsample4" > counts.tsv
    echo -e "geneA\t100\t120\t50\t60" >> counts.tsv
    echo -e "geneB\t200\t210\t100\t110" >> counts.tsv
    echo -e "geneC\t50\t60\t150\t160" >> counts.tsv
    echo -e "geneD\t10\t12\t8\t9" >> counts.tsv
    
    # Example dummy sample_metadata.tsv
    echo -e "sample_id\tcondition" > sample_metadata.tsv
    echo -e "sample1\ttreated" >> sample_metadata.tsv
    echo -e "sample2\ttreated" >> sample_metadata.tsv
    echo -e "sample3\tuntreated" >> sample_metadata.tsv
    echo -e "sample4\tuntreated" >> sample_metadata.tsv
    
    # Save the R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    #!/usr/bin/env Rscript
    
    # Load necessary packages
    library(DESeq2)
    library(readr) # For read_tsv
    
    # --- Configuration ---
    counts_file <- "counts.tsv"
    metadata_file <- "sample_metadata.tsv"
    design_formula <- "~ condition" # Example design formula. Adjust based on your experimental design.
    output_prefix <- "deseq2_results"
    
    # --- Load Data ---
    # Read counts data
    counts_data <- read_tsv(counts_file, col_names = TRUE)
    # Assuming the first column is gene ID, and subsequent columns are sample counts
    gene_ids <- counts_data[[1]]
    counts_matrix <- as.matrix(counts_data[,-1])
    rownames(counts_matrix) <- gene_ids
    
    # Read sample metadata
    sample_metadata <- read_tsv(metadata_file, col_names = TRUE)
    # Ensure sample names match between counts and metadata
    # Assuming sample_metadata has a column named 'sample_id' that matches column names in counts_matrix
    sample_metadata <- as.data.frame(sample_metadata) # DESeq2 expects data.frame
    rownames(sample_metadata) <- sample_metadata$sample_id # Or whatever column holds sample IDs
    sample_metadata <- sample_metadata[colnames(counts_matrix), ] # Order metadata to match counts
    
    # --- Create DESeqDataSet ---
    dds <- DESeqDataSetFromMatrix(countData = counts_matrix,
                                  colData = sample_metadata,
                                  design = as.formula(design_formula))
    
    # --- Run DESeq2 Analysis ---
    dds <- DESeq(dds)
    
    # --- Extract Results ---
    # Get results for a specific contrast (e.g., condition treated vs untreated)
    # Adjust the contrast based on your actual data and design levels.
    # Example: res <- results(dds, contrast=c("condition", "treated", "untreated"))
    # If you have multiple factors, you might specify 'name' argument, e.g., results(dds, name="condition_treated_vs_untreated")
    res <- results(dds)
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # Write results to CSV
    write.csv(as.data.frame(res_ordered), paste0(output_prefix, "_differential_expression.csv"))
    
    # Optional: Save normalized counts
    normalized_counts <- counts(dds, normalized=TRUE)
    write.csv(as.data.frame(normalized_counts), paste0(output_prefix, "_normalized_counts.csv"))
    
    # Optional: Save DESeq2 object for later use
    saveRDS(dds, paste0(output_prefix, "_dds_object.rds"))
    EOF
    
    # Make the R script executable
    chmod +x run_deseq2.R
    
    # Execute the R script
    ./run_deseq2.R
  4. 4

    Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01.

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.40.0 GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # conda install -c conda-forge r-base
    # conda install -c bioconda bioconductor-deseq2
    
    # Create an R script to perform DESeq2 analysis and filter results
    cat << 'EOF' > run_deseq2_and_filter.R
    # Load DESeq2 library
    library(DESeq2)
    
    # --- Placeholder for DESeq2 analysis setup ---
    # In a real scenario, you would load your count matrix and sample metadata.
    # Example: Assuming 'counts_matrix.tsv' contains raw counts and 'sample_metadata.tsv' has sample info.
    # counts_data <- as.matrix(read.delim("counts_matrix.tsv", row.names = 1))
    # sample_info <- read.delim("sample_metadata.tsv", row.names = 1)
    
    # For demonstration, creating dummy data:
    set.seed(123)
    counts_data <- matrix(rnbinom(n=2000, size=10, mu=100), ncol=10)
    rownames(counts_data) <- paste0("gene", 1:200)
    colnames(counts_data) <- paste0("sample", 1:10)
    
    sample_info <- data.frame(
      sample = colnames(counts_data),
      condition = factor(rep(c("control", "treated"), each=5))
    )
    rownames(sample_info) <- sample_info$sample
    
    # Ensure sample order matches between counts and metadata
    sample_info <- sample_info[colnames(counts_data), , drop = FALSE]
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                  colData = sample_info,
                                  design = ~ condition)
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results for the contrast of interest (e.g., 'treated' vs 'control')
    res <- results(dds, contrast=c("condition", "treated", "control"))
    
    # --- Filtering step based on description ---
    # Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01.
    
    # Filter results based on specified cut-offs
    significant_genes <- subset(res, baseMean > 50 & padj <= 0.01)
    
    # Order significant genes by adjusted p-value
    significant_genes <- significant_genes[order(significant_genes$padj),]
    
    # Save significant genes to a CSV file
    write.csv(as.data.frame(significant_genes), "significant_differential_expression.csv", row.names = TRUE)
    
    # Optionally, save all DESeq2 results
    write.csv(as.data.frame(res), "deseq2_all_results.csv", row.names = TRUE)
    
    message("DESeq2 analysis and filtering complete. Results saved to significant_differential_expression.csv and deseq2_all_results.csv")
    EOF
    
    # Execute the R script
    Rscript run_deseq2_and_filter.R
  5. 5

    In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic.

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Create dummy count data and sample metadata for demonstration purposes.
    # In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon)
    # based on alignment to a reference genome like hg38 (placeholder for reference dataset).
    
    # Example: Create a dummy counts.tsv
    cat <<EOF > counts.tsv
    gene_id	nuclear_rep1	nuclear_rep2	nuclear_rep3	cytoplasmic_rep1	cytoplasmic_rep2	cytoplasmic_rep3
    gene1	200	220	190	100	110	95
    gene2	50	55	48	120	130	115
    gene3	150	145	155	150	145	155
    gene4	300	310	290	150	160	145
    gene5	80	85	78	180	190	175
    EOF
    
    # Example: Create a dummy sample_metadata.tsv
    cat <<EOF > sample_metadata.tsv
    sample_id	condition
    nuclear_rep1	nuclear
    nuclear_rep2	nuclear
    nuclear_rep3	nuclear
    cytoplasmic_rep1	cytoplasmic
    cytoplasmic_rep2	cytoplasmic
    cytoplasmic_rep3	cytoplasmic
    EOF
    
    # Install R and DESeq2 if not already installed
    # For Ubuntu/Debian:
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'
    
    # Create the R script for DESeq2 analysis
    cat <<'EOF' > run_deseq2.R
    # run_deseq2.R
    # This script performs differential localization analysis using DESeq2
    # based on nuclear and cytoplasmic count data.
    
    library(DESeq2)
    library(dplyr) # For data manipulation
    library(tibble) # For rownames_to_column
    
    # --- Configuration ---
    count_file <- "counts.tsv"
    sample_info_file <- "sample_metadata.tsv"
    output_file <- "differential_localization_results.tsv"
    
    # Read count data
    count_data <- read.delim(count_file, row.names = 1, check.names = FALSE)
    count_data <- round(count_data) # Ensure counts are integers
    
    # Read sample metadata
    sample_info <- read.delim(sample_info_file, row.names = 1)
    
    # Ensure sample names match and are in the same order
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    if (!all(rownames(sample_info) == colnames(count_data))) {
        stop("Sample names in count data and metadata do not match or are not in the same order.")
    }
    
    # Set 'nuclear' as the reference level for the 'condition' factor.
    # This ensures that when comparing 'cytoplasmic' vs 'nuclear',
    # a positive log2FoldChange indicates higher expression in 'cytoplasmic'.
    sample_info$condition <- factor(sample_info$condition, levels = c("nuclear", "cytoplasmic"))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = ~ condition)
    
    # Run DESeq analysis
    dds <- DESeq(dds)
    
    # Get results for cytoplasmic vs nuclear.
    # According to the description:
    # - A positive log2FoldChange indicates that a transcript was more cytoplasmic.
    #   (i.e., log2(cytoplasmic / nuclear) > 0, meaning cytoplasmic > nuclear)
    # - A negative log2FoldChange indicates that a transcript was more nuclear.
    #   (i.e., log2(cytoplasmic / nuclear) < 0, meaning nuclear > cytoplasmic)
    # This contrast directly aligns with the desired interpretation.
    res <- results(dds, contrast=c("condition", "cytoplasmic", "nuclear"))
    
    # Convert to data frame and save
    res_df <- as.data.frame(res) %>%
        tibble::rownames_to_column("gene_id")
    
    write.table(res_df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
    
    message(paste("Differential localization results saved to:", output_file))
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
    
    # Clean up the R script and dummy data (optional)
    # rm run_deseq2.R counts.tsv sample_metadata.tsv

Tools Used

Raw Source Text
All reads were aligned to the human genome hg38 primary assembly using STAR.
featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34
Differential expression was calculated using DESeq2. Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01. In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic.
Genome_build: hg38
Supplementary_files_format_and_content: comma separated files with output of DESeq2; normalized abudnance measurements for all genes when comparing the nuclear and cytoplasmic fractions. In this comparison, a negative log2FoldChange indicates that a transcript was more nuclear, while a positive log2FoldChange indicates that a transcript was more cytoplasmic
← Back to Analysis