GSE214109 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

An RNA-targeting CRISPR-Cas13d system alleviates disease-related phenotypes in Huntington's disease models.

Nature neuroscience (2023) — PMID 36510111

Dataset

RNA-Targeting CRISPR/Cas13d System Alleviates Disease-Related Phenotypes in Pre-clinical Models of Huntingtonâs Disease (Mouse).

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNAseq reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i) (Dobin et al., 2013).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star=2.4.0i

# --- Reference Data Setup (Prerequisites) ---
# The description specifies mapping to "human-specific repetitive elements from RepBase (version 18.05)".
# This requires obtaining the relevant FASTA file from RepBase and building a STAR index from it.

# Placeholder for RepBase 18.05 human-specific repetitive elements FASTA file.
# Note: Access to RepBase often requires a license or subscription.
# For demonstration, let's assume a file named 'RepBase18.05_human_repeats.fasta' is available.
# Example: wget -O RepBase18.05_human_repeats.fasta "URL_TO_REPEATS_FASTA"

# Create STAR index for RepBase elements (if not already done)
# GENOME_DIR="repbase_index_18.05"
# mkdir -p "${GENOME_DIR}"
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles RepBase18.05_human_repeats.fasta \
#      --runThreadN 8 # Adjust threads as needed for index generation

# --- STAR Mapping Command ---
# Assume adapter-trimmed reads are available from the previous Cutadapt step.
TRIMMED_READS="cutadapt_trimmed_reads.fastq.gz" # Placeholder for actual trimmed reads file
OUTPUT_PREFIX="star_repbase_mapping_output/"
GENOME_DIR="repbase_index_18.05" # Directory containing the pre-built STAR index for RepBase

mkdir -p "${OUTPUT_PREFIX}"

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${TRIMMED_READS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --runThreadN 8 \
     --outSAMtype BAM SortedByCoordinate \
     --outReadsUnmapped Fastx # Output unmapped reads to Fastx files, often useful for further analysis

View on GitHub

Repeat-mapping reads were removed, and remaining reads were mapped to the mouse genome assembly (mm10) with STAR (v2.4.0i)

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.4.0i

# Define variables
# Replace with the actual path to your pre-built STAR index for mm10
GENOME_DIR="/path/to/STAR_index/mm10"
# Replace with your actual input FASTQ file(s). For paired-end, use "read1.fastq.gz read2.fastq.gz"
READS_FILE="input_reads.fastq.gz"
OUTPUT_DIR="STAR_output" # Output directory for STAR results
NUM_THREADS=8 # Number of threads to use

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run STAR alignment
# --outFilterMultimapNmax 1: Ensures only uniquely mapping reads are reported, effectively "removing" repeat-mapping reads from the output.
# --outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file, which is standard for downstream analysis.
# --outBAMcompression 6: Sets BAM compression level.
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_FILE}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_DIR}/" \
     --outFilterMultimapNmax 1 \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6

View on GitHub

Read counts for all genes annotated in GENCODE (vM20) were calculated using the read summarization program featureCounts (v1.5.3) (Liao et al., 2014).

featureCounts v1.5.3 GitHub

$ Bash example

# Install subread (which includes featureCounts) if not already installed
# conda install -c bioconda subread

# Define variables for input and output
ANNOTATION_GTF="gencode.vM20.annotation.gtf" # Placeholder for GENCODE vM20 annotation file
INPUT_BAM="aligned_reads.bam" # Placeholder for your input BAM file(s)
OUTPUT_COUNTS="gene_counts.txt"

# Download GENCODE vM20 annotation if needed (example, adjust URL if necessary)
# wget -O ${ANNOTATION_GTF}.gz "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.annotation.gtf.gz"
# gunzip -f ${ANNOTATION_GTF}.gz

# Run featureCounts
# -a: Annotation file (GTF/GFF format)
# -o: Output file for read counts
# -t exon: Specify feature type to count (exons are typically used for gene-level counts)
# -g gene_id: Specify attribute to group features by (gene_id for gene-level counts)
# -s 0: Strandedness (0=unstranded, 1=forward, 2=reverse). Adjust based on library preparation protocol.
# -T 8: Number of threads (adjust as needed for parallel processing)
# --primary: Only count primary alignments (recommended for RNA-seq to avoid counting multi-mapped reads multiple times)
featureCounts -a "${ANNOTATION_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -t exon \
              -g gene_id \
              -s 0 \
              -T 8 \
              --primary \
              "${INPUT_BAM}"

View on GitHub

Batch correction was completed with Combat-seq

sva (Inferred with models/gemini-2.5-flash) v3.54.0 GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# For Bioconductor 3.19 (R 4.4):
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("sva")'

# Create dummy input files for demonstration (replace with your actual data)
# Example counts.tsv
echo -e "gene\tsample1\tsample2\tsample3\tsample4" > counts.tsv
echo -e "geneA\t100\t120\t80\t90" >> counts.tsv
echo -e "geneB\t50\t60\t40\t45" >> counts.tsv
echo -e "geneC\t200\t210\t180\t190" >> counts.tsv

# Example metadata.tsv
echo -e "sample\tbatch\tcondition" > metadata.tsv
echo -e "sample1\tbatch1\tcontrol" >> metadata.tsv
echo -e "sample2\tbatch1\ttreated" >> metadata.tsv
echo -e "sample3\tbatch2\tcontrol" >> metadata.tsv
echo -e "sample4\tbatch2\ttreated" >> metadata.tsv

# R script to perform ComBat-seq batch correction
Rscript -e '
  library(sva)
  library(data.table) # For fread/fwrite

  # Load count data
  counts_df <- fread("counts.tsv", data.table = FALSE)
  rownames(counts_df) <- counts_df$gene
  counts_matrix <- as.matrix(counts_df[, -1])

  # Load metadata
  metadata_df <- fread("metadata.tsv", data.table = FALSE)
  rownames(metadata_df) <- metadata_df$sample

  # Ensure sample order matches between counts and metadata
  metadata_df <- metadata_df[colnames(counts_matrix), ]

  # Define batch variable
  batch <- metadata_df$batch

  # Define biological covariates to preserve (e.g., condition)
  # If no other covariates, set covar_mod = NULL
  covar_mod <- model.matrix(~condition, data = metadata_df) # Example with "condition"

  # Perform ComBat-seq batch correction
  # For more options, refer to ?ComBat_seq in R
  corrected_counts <- ComBat_seq(
    counts = counts_matrix,
    batch = batch,
    covar_mod = covar_mod,
    full_mod = TRUE # Use full_mod = TRUE if covar_mod is a full design matrix
  )

  # Output corrected counts
  fwrite(as.data.frame(corrected_counts), "corrected_counts_combatseq.tsv", sep = "\t", row.names = TRUE)
'

View on GitHub

Differential expression was completed by DESeq2 (v1.5.3)

DESeq2 v1.5.3 GitHub

$ Bash example

# Install Bioconductor and DESeq2 if not already installed.
# Note: DESeq2 v1.5.3 is an older version. To install it, you would typically need an older R version (e.g., R 3.0.x or 3.1.x) and an older Bioconductor release (e.g., Bioconductor 2.13 or 2.14).
# The general installation method for older Bioconductor packages was:
# source("http://bioconductor.org/biocLite.R")
# biocLite("DESeq2") # This would install the version compatible with your R version at the time.

# Create a placeholder R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
library(DESeq2)

# --- Placeholder for loading count data and sample metadata ---
# Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files.
# 'counts.tsv' should be a tab-separated or comma-separated file with gene IDs as row names
# and sample IDs as column names (e.g., output from featureCounts, HTSeq-count, or aggregated Salmon/Kallisto counts).
# 'sample_info.tsv' should be a tab-separated or comma-separated file with sample IDs as row names
# and experimental conditions/covariates as columns.

# Example: Load count matrix
# counts_matrix <- as.matrix(read.table("counts.tsv", header = TRUE, row.names = 1, sep = "\t"))

# Example: Load sample metadata
# sample_metadata <- read.table("sample_info.tsv", header = TRUE, row_names = 1, sep = "\t"))

# Ensure sample names match and are in the same order
# if (!all(rownames(sample_metadata) == colnames(counts_matrix))) {
#   stop("Sample names in metadata and counts matrix do not match or are not in the same order.")
# }

# --- Create a DESeqDataSet object ---
# Replace 'condition' with the actual column name in your sample_metadata that defines your experimental groups.
# dds <- DESeqDataSetFromMatrix(countData = counts_matrix,
#                               colData = sample_metadata,
#                               design = ~ condition)

# --- Run DESeq2 analysis ---
# dds <- DESeq(dds)

# --- Extract results ---
# Replace 'condition_treatment_vs_control' with your specific contrast.
# For example, if 'condition' has levels 'control' and 'treatment':
# res <- results(dds, contrast = c("condition", "treatment", "control"))

# --- Optional: Apply LFC shrinkage (requires 'apeglm' or 'ashr' package) ---
# resLFC <- lfcShrink(dds, coef="condition_treatment_vs_control", type="apeglm")

# --- Save results ---
# write.csv(as.data.frame(res), file = "deseq2_differential_expression_results.csv")
# write.csv(as.data.frame(resLFC), file = "deseq2_lfc_shrinkage_results.csv")

# --- Print session info for reproducibility ---
sessionInfo()
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

RNAseq reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i) (Dobin et al., 2013).
Repeat-mapping reads were removed, and remaining reads were mapped to the mouse genome assembly (mm10) with STAR (v2.4.0i)
Read counts for all genes annotated in GENCODE (vM20) were calculated using the read summarization program featureCounts (v1.5.3) (Liao et al., 2014).
Batch correction was completed with Combat-seq
Differential expression was completed by DESeq2 (v1.5.3)
Assembly: mm10

← Back to Analysis