GSE295421 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Enhancing RNA base editing on mammalian transcripts with small nuclear RNAs.

Nature chemical biology (2025) — PMID 40968292

Dataset

Enhancing RNA base editing on mammalian transcripts with small nuclear RNAs

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were mapped using STAR 2.7.6a

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
READS_R1="reads_R1.fastq.gz" # Replace with your actual R1 fastq file
READS_R2="reads_R2.fastq.gz" # Replace with your actual R2 fastq file (remove if single-end)
GENOME_DIR="path/to/STAR_genome_index/GRCh38" # Replace with the path to your STAR genome index for GRCh38 (e.g., built from Gencode v44 annotation)
OUTPUT_PREFIX="mapped_reads_" # Prefix for output files
THREADS=8 # Number of threads to use

# Run STAR alignment with common RNA-seq parameters
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignSJDBoverhangMin 1 \
     --outReadsUnmapped Fastx \
     --quantMode GeneCounts # Optional: for gene quantification (e.g., for RNA-seq)

View on GitHub

Read count extraction was performed using featureCounts from the Subread package, with results sorted into counts matrices

featureCounts v2.0.3

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Replace with your actual aligned BAM file
GENOME_GTF="gencode.v38.annotation.gtf" # Replace with your actual GTF annotation file (e.g., from GENCODE for GRCh38)
OUTPUT_COUNTS="gene_counts.txt"

# Run featureCounts to extract read counts into a matrix
# -a: Annotation file (GTF/GFF)
# -o: Output file for counts
# -F GTF: Specify GTF format for annotation
# -t exon: Count features of type 'exon'
# -g gene_id: Group features by 'gene_id'
# -s 0: Unstranded (0), stranded (1), reverse stranded (2) - adjust as needed
# -T 8: Use 8 threads for parallel processing
# --primary: Only count primary alignments
# --ignore-multimapping: Ignore multi-mapping reads
featureCounts -a "${GENOME_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              -T 8 \
              --primary \
              --ignore-multimapping \
              "${INPUT_BAM}"

Differential gene expression analysis was performed on the counts matrices using DESeq2

DESeq2 v1.38.3 GitHub

$ Bash example

# Install R and Bioconductor if not already present
# For Ubuntu/Debian:
# sudo apt-get update
# sudo apt-get install r-base

# Open R and install Bioconductor packages
# R
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2")
# quit()

# Create dummy input files for demonstration purposes
# In a real scenario, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
# Example counts.tsv (replace with your actual counts matrix)
cat <<EOF > counts.tsv
GeneID	Sample1	Sample2	Sample3	Sample4
GeneA	100	120	50	60
GeneB	500	550	200	220
GeneC	10	15	5	8
GeneD	2000	2100	1000	1100
GeneE	5	7	20	25
EOF

# Example sample_info.tsv (replace with your actual sample metadata)
cat <<EOF > sample_info.tsv
Sample	condition
Sample1	control
Sample2	control
Sample3	treated
Sample4	treated
EOF

# R script for DESeq2 analysis
cat <<'EOF' > run_deseq2.R
# Load DESeq2 package
library(DESeq2)

# --- Configuration ---
# Input files
counts_file <- "counts.tsv" # Path to your raw count matrix
sample_info_file <- "sample_info.tsv" # Path to your sample metadata file

# Output file
results_file <- "deseq2_results.tsv"

# Design formula (e.g., ~ condition, ~ treatment + condition, etc.)
# This should match a column name in your sample_info_file
design_formula_str <- "~ condition" 

# --- Read input data ---
# Read count matrix
# Assuming counts.tsv has genes/features as rows and samples as columns
# First column is gene ID, subsequent columns are counts
counts_data <- read.delim(counts_file, row.names = 1, sep = "\t")

# Read sample information
# Assuming sample_info.tsv has sample IDs as rows and metadata as columns
# First column is sample ID, subsequent columns are metadata (e.g., condition, treatment)
sample_info <- read.delim(sample_info_file, row.names = 1, sep = "\t")

# Ensure sample names match and are in the same order
# It's crucial that the column names of counts_data match the row names of sample_info
# and are in the same order. Reorder counts_data columns to match sample_info rows.
counts_data <- counts_data[, rownames(sample_info)]

# Convert counts to integer matrix (DESeq2 requirement)
counts_data <- round(counts_data)
counts_data <- as.matrix(counts_data)

# --- Create DESeqDataSet object ---
dds <- DESeqDataSetFromMatrix(countData = counts_data,
                              colData = sample_info,
                              design = as.formula(design_formula_str))

# --- Pre-filtering (optional but recommended) ---
# Remove genes with very low counts across all samples
# For example, keep genes that have at least 10 counts in total across all samples
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract results ---
# Get results for the primary comparison (e.g., "condition_treated_vs_control")
# To see available comparisons: resultsNames(dds)
# Assuming 'condition' has levels 'control' and 'treated'
res <- results(dds, contrast=c("condition", "treated", "control"))

# Order results by adjusted p-value
res <- res[order(res$padj),]

# --- Write results to file ---
write.table(as.data.frame(res), file = results_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("DESeq2 analysis complete. Results saved to:", results_file))
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

Reads were mapped using STAR 2.7.6a
Read count extraction was performed using featureCounts from the Subread package, with results sorted into counts matrices
Differential gene expression analysis was performed on the counts matrices using DESeq2
Assembly: GRCh38
Supplementary files format and content: Supplementary files format and content: tab-delimited text files include gene expression analysis for each transfection condition relative to pUC19 (empty vector)

← Back to Analysis