GSE295421 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Enhancing RNA base editing on mammalian transcripts with small nuclear RNAs.

Nature chemical biology (2025) — PMID 40968292

Dataset

GSE295421

Enhancing RNA base editing on mammalian transcripts with small nuclear RNAs

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads were mapped using STAR 2.7.6a

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    READS_R1="reads_R1.fastq.gz" # Replace with your actual R1 fastq file
    READS_R2="reads_R2.fastq.gz" # Replace with your actual R2 fastq file (remove if single-end)
    GENOME_DIR="path/to/STAR_genome_index/GRCh38" # Replace with the path to your STAR genome index for GRCh38 (e.g., built from Gencode v44 annotation)
    OUTPUT_PREFIX="mapped_reads_" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Run STAR alignment with common RNA-seq parameters
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignSJDBoverhangMin 1 \
         --outReadsUnmapped Fastx \
         --quantMode GeneCounts # Optional: for gene quantification (e.g., for RNA-seq)
  2. 2

    Read count extraction was performed using featureCounts from the Subread package, with results sorted into counts matrices

    featureCounts v2.0.3
    $ Bash example
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Define input and output files
    INPUT_BAM="aligned_reads.bam" # Replace with your actual aligned BAM file
    GENOME_GTF="gencode.v38.annotation.gtf" # Replace with your actual GTF annotation file (e.g., from GENCODE for GRCh38)
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Run featureCounts to extract read counts into a matrix
    # -a: Annotation file (GTF/GFF)
    # -o: Output file for counts
    # -F GTF: Specify GTF format for annotation
    # -t exon: Count features of type 'exon'
    # -g gene_id: Group features by 'gene_id'
    # -s 0: Unstranded (0), stranded (1), reverse stranded (2) - adjust as needed
    # -T 8: Use 8 threads for parallel processing
    # --primary: Only count primary alignments
    # --ignore-multimapping: Ignore multi-mapping reads
    featureCounts -a "${GENOME_GTF}" \
                  -o "${OUTPUT_COUNTS}" \
                  -F GTF \
                  -t exon \
                  -g gene_id \
                  -s 0 \
                  -T 8 \
                  --primary \
                  --ignore-multimapping \
                  "${INPUT_BAM}"
  3. 3

    Differential gene expression analysis was performed on the counts matrices using DESeq2

    $ Bash example
    # Install R and Bioconductor if not already present
    # For Ubuntu/Debian:
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Open R and install Bioconductor packages
    # R
    # if (!require("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("DESeq2")
    # quit()
    
    # Create dummy input files for demonstration purposes
    # In a real scenario, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
    # Example counts.tsv (replace with your actual counts matrix)
    cat <<EOF > counts.tsv
    GeneID	Sample1	Sample2	Sample3	Sample4
    GeneA	100	120	50	60
    GeneB	500	550	200	220
    GeneC	10	15	5	8
    GeneD	2000	2100	1000	1100
    GeneE	5	7	20	25
    EOF
    
    # Example sample_info.tsv (replace with your actual sample metadata)
    cat <<EOF > sample_info.tsv
    Sample	condition
    Sample1	control
    Sample2	control
    Sample3	treated
    Sample4	treated
    EOF
    
    # R script for DESeq2 analysis
    cat <<'EOF' > run_deseq2.R
    # Load DESeq2 package
    library(DESeq2)
    
    # --- Configuration ---
    # Input files
    counts_file <- "counts.tsv" # Path to your raw count matrix
    sample_info_file <- "sample_info.tsv" # Path to your sample metadata file
    
    # Output file
    results_file <- "deseq2_results.tsv"
    
    # Design formula (e.g., ~ condition, ~ treatment + condition, etc.)
    # This should match a column name in your sample_info_file
    design_formula_str <- "~ condition" 
    
    # --- Read input data ---
    # Read count matrix
    # Assuming counts.tsv has genes/features as rows and samples as columns
    # First column is gene ID, subsequent columns are counts
    counts_data <- read.delim(counts_file, row.names = 1, sep = "\t")
    
    # Read sample information
    # Assuming sample_info.tsv has sample IDs as rows and metadata as columns
    # First column is sample ID, subsequent columns are metadata (e.g., condition, treatment)
    sample_info <- read.delim(sample_info_file, row.names = 1, sep = "\t")
    
    # Ensure sample names match and are in the same order
    # It's crucial that the column names of counts_data match the row names of sample_info
    # and are in the same order. Reorder counts_data columns to match sample_info rows.
    counts_data <- counts_data[, rownames(sample_info)]
    
    # Convert counts to integer matrix (DESeq2 requirement)
    counts_data <- round(counts_data)
    counts_data <- as.matrix(counts_data)
    
    # --- Create DESeqDataSet object ---
    dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                  colData = sample_info,
                                  design = as.formula(design_formula_str))
    
    # --- Pre-filtering (optional but recommended) ---
    # Remove genes with very low counts across all samples
    # For example, keep genes that have at least 10 counts in total across all samples
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract results ---
    # Get results for the primary comparison (e.g., "condition_treated_vs_control")
    # To see available comparisons: resultsNames(dds)
    # Assuming 'condition' has levels 'control' and 'treated'
    res <- results(dds, contrast=c("condition", "treated", "control"))
    
    # Order results by adjusted p-value
    res <- res[order(res$padj),]
    
    # --- Write results to file ---
    write.table(as.data.frame(res), file = results_file, sep = "\t", quote = FALSE, row.names = TRUE)
    
    message(paste("DESeq2 analysis complete. Results saved to:", results_file))
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
Reads were mapped using STAR 2.7.6a
Read count extraction was performed using featureCounts from the Subread package, with results sorted into counts matrices
Differential gene expression analysis was performed on the counts matrices using DESeq2
Assembly: GRCh38
Supplementary files format and content: Supplementary files format and content: tab-delimited text files include gene expression analysis for each transfection condition relative to pUC19 (empty vector)
← Back to Analysis