GSE214109 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

An RNA-targeting CRISPR-Cas13d system alleviates disease-related phenotypes in Huntington's disease models.

Nature neuroscience (2023) — PMID 36510111

Dataset

GSE214109

RNA-Targeting CRISPR/Cas13d System Alleviates Disease-Related Phenotypes in Pre-clinical Models of Huntington’s Disease (Mouse).

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    RNAseq reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i) (Dobin et al., 2013).

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star=2.4.0i
    
    # --- Reference Data Setup (Prerequisites) ---
    # The description specifies mapping to "human-specific repetitive elements from RepBase (version 18.05)".
    # This requires obtaining the relevant FASTA file from RepBase and building a STAR index from it.
    
    # Placeholder for RepBase 18.05 human-specific repetitive elements FASTA file.
    # Note: Access to RepBase often requires a license or subscription.
    # For demonstration, let's assume a file named 'RepBase18.05_human_repeats.fasta' is available.
    # Example: wget -O RepBase18.05_human_repeats.fasta "URL_TO_REPEATS_FASTA"
    
    # Create STAR index for RepBase elements (if not already done)
    # GENOME_DIR="repbase_index_18.05"
    # mkdir -p "${GENOME_DIR}"
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${GENOME_DIR}" \
    #      --genomeFastaFiles RepBase18.05_human_repeats.fasta \
    #      --runThreadN 8 # Adjust threads as needed for index generation
    
    # --- STAR Mapping Command ---
    # Assume adapter-trimmed reads are available from the previous Cutadapt step.
    TRIMMED_READS="cutadapt_trimmed_reads.fastq.gz" # Placeholder for actual trimmed reads file
    OUTPUT_PREFIX="star_repbase_mapping_output/"
    GENOME_DIR="repbase_index_18.05" # Directory containing the pre-built STAR index for RepBase
    
    mkdir -p "${OUTPUT_PREFIX}"
    
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${TRIMMED_READS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --runThreadN 8 \
         --outSAMtype BAM SortedByCoordinate \
         --outReadsUnmapped Fastx # Output unmapped reads to Fastx files, often useful for further analysis
  2. 2

    Repeat-mapping reads were removed, and remaining reads were mapped to the mouse genome assembly (mm10) with STAR (v2.4.0i)

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.4.0i
    
    # Define variables
    # Replace with the actual path to your pre-built STAR index for mm10
    GENOME_DIR="/path/to/STAR_index/mm10"
    # Replace with your actual input FASTQ file(s). For paired-end, use "read1.fastq.gz read2.fastq.gz"
    READS_FILE="input_reads.fastq.gz"
    OUTPUT_DIR="STAR_output" # Output directory for STAR results
    NUM_THREADS=8 # Number of threads to use
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run STAR alignment
    # --outFilterMultimapNmax 1: Ensures only uniquely mapping reads are reported, effectively "removing" repeat-mapping reads from the output.
    # --outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file, which is standard for downstream analysis.
    # --outBAMcompression 6: Sets BAM compression level.
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_FILE}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_DIR}/" \
         --outFilterMultimapNmax 1 \
         --outSAMtype BAM SortedByCoordinate \
         --outBAMcompression 6
  3. 3

    Read counts for all genes annotated in GENCODE (vM20) were calculated using the read summarization program featureCounts (v1.5.3) (Liao et al., 2014).

    featureCounts v1.5.3 GitHub
    $ Bash example
    # Install subread (which includes featureCounts) if not already installed
    # conda install -c bioconda subread
    
    # Define variables for input and output
    ANNOTATION_GTF="gencode.vM20.annotation.gtf" # Placeholder for GENCODE vM20 annotation file
    INPUT_BAM="aligned_reads.bam" # Placeholder for your input BAM file(s)
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Download GENCODE vM20 annotation if needed (example, adjust URL if necessary)
    # wget -O ${ANNOTATION_GTF}.gz "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.annotation.gtf.gz"
    # gunzip -f ${ANNOTATION_GTF}.gz
    
    # Run featureCounts
    # -a: Annotation file (GTF/GFF format)
    # -o: Output file for read counts
    # -t exon: Specify feature type to count (exons are typically used for gene-level counts)
    # -g gene_id: Specify attribute to group features by (gene_id for gene-level counts)
    # -s 0: Strandedness (0=unstranded, 1=forward, 2=reverse). Adjust based on library preparation protocol.
    # -T 8: Number of threads (adjust as needed for parallel processing)
    # --primary: Only count primary alignments (recommended for RNA-seq to avoid counting multi-mapped reads multiple times)
    featureCounts -a "${ANNOTATION_GTF}" \
                  -o "${OUTPUT_COUNTS}" \
                  -t exon \
                  -g gene_id \
                  -s 0 \
                  -T 8 \
                  --primary \
                  "${INPUT_BAM}"
  4. 4

    Batch correction was completed with Combat-seq

    sva (Inferred with models/gemini-2.5-flash) v3.54.0 GitHub
    $ Bash example
    # Install R and Bioconductor if not already installed
    # For Bioconductor 3.19 (R 4.4):
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("sva")'
    
    # Create dummy input files for demonstration (replace with your actual data)
    # Example counts.tsv
    echo -e "gene\tsample1\tsample2\tsample3\tsample4" > counts.tsv
    echo -e "geneA\t100\t120\t80\t90" >> counts.tsv
    echo -e "geneB\t50\t60\t40\t45" >> counts.tsv
    echo -e "geneC\t200\t210\t180\t190" >> counts.tsv
    
    # Example metadata.tsv
    echo -e "sample\tbatch\tcondition" > metadata.tsv
    echo -e "sample1\tbatch1\tcontrol" >> metadata.tsv
    echo -e "sample2\tbatch1\ttreated" >> metadata.tsv
    echo -e "sample3\tbatch2\tcontrol" >> metadata.tsv
    echo -e "sample4\tbatch2\ttreated" >> metadata.tsv
    
    # R script to perform ComBat-seq batch correction
    Rscript -e '
      library(sva)
      library(data.table) # For fread/fwrite
    
      # Load count data
      counts_df <- fread("counts.tsv", data.table = FALSE)
      rownames(counts_df) <- counts_df$gene
      counts_matrix <- as.matrix(counts_df[, -1])
    
      # Load metadata
      metadata_df <- fread("metadata.tsv", data.table = FALSE)
      rownames(metadata_df) <- metadata_df$sample
    
      # Ensure sample order matches between counts and metadata
      metadata_df <- metadata_df[colnames(counts_matrix), ]
    
      # Define batch variable
      batch <- metadata_df$batch
    
      # Define biological covariates to preserve (e.g., condition)
      # If no other covariates, set covar_mod = NULL
      covar_mod <- model.matrix(~condition, data = metadata_df) # Example with "condition"
    
      # Perform ComBat-seq batch correction
      # For more options, refer to ?ComBat_seq in R
      corrected_counts <- ComBat_seq(
        counts = counts_matrix,
        batch = batch,
        covar_mod = covar_mod,
        full_mod = TRUE # Use full_mod = TRUE if covar_mod is a full design matrix
      )
    
      # Output corrected counts
      fwrite(as.data.frame(corrected_counts), "corrected_counts_combatseq.tsv", sep = "\t", row.names = TRUE)
    '
  5. 5

    Differential expression was completed by DESeq2 (v1.5.3)

    $ Bash example
    # Install Bioconductor and DESeq2 if not already installed.
    # Note: DESeq2 v1.5.3 is an older version. To install it, you would typically need an older R version (e.g., R 3.0.x or 3.1.x) and an older Bioconductor release (e.g., Bioconductor 2.13 or 2.14).
    # The general installation method for older Bioconductor packages was:
    # source("http://bioconductor.org/biocLite.R")
    # biocLite("DESeq2") # This would install the version compatible with your R version at the time.
    
    # Create a placeholder R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    # Load DESeq2 library
    library(DESeq2)
    
    # --- Placeholder for loading count data and sample metadata ---
    # Replace 'counts.tsv' and 'sample_info.tsv' with your actual input files.
    # 'counts.tsv' should be a tab-separated or comma-separated file with gene IDs as row names
    # and sample IDs as column names (e.g., output from featureCounts, HTSeq-count, or aggregated Salmon/Kallisto counts).
    # 'sample_info.tsv' should be a tab-separated or comma-separated file with sample IDs as row names
    # and experimental conditions/covariates as columns.
    
    # Example: Load count matrix
    # counts_matrix <- as.matrix(read.table("counts.tsv", header = TRUE, row.names = 1, sep = "\t"))
    
    # Example: Load sample metadata
    # sample_metadata <- read.table("sample_info.tsv", header = TRUE, row_names = 1, sep = "\t"))
    
    # Ensure sample names match and are in the same order
    # if (!all(rownames(sample_metadata) == colnames(counts_matrix))) {
    #   stop("Sample names in metadata and counts matrix do not match or are not in the same order.")
    # }
    
    # --- Create a DESeqDataSet object ---
    # Replace 'condition' with the actual column name in your sample_metadata that defines your experimental groups.
    # dds <- DESeqDataSetFromMatrix(countData = counts_matrix,
    #                               colData = sample_metadata,
    #                               design = ~ condition)
    
    # --- Run DESeq2 analysis ---
    # dds <- DESeq(dds)
    
    # --- Extract results ---
    # Replace 'condition_treatment_vs_control' with your specific contrast.
    # For example, if 'condition' has levels 'control' and 'treatment':
    # res <- results(dds, contrast = c("condition", "treatment", "control"))
    
    # --- Optional: Apply LFC shrinkage (requires 'apeglm' or 'ashr' package) ---
    # resLFC <- lfcShrink(dds, coef="condition_treatment_vs_control", type="apeglm")
    
    # --- Save results ---
    # write.csv(as.data.frame(res), file = "deseq2_differential_expression_results.csv")
    # write.csv(as.data.frame(resLFC), file = "deseq2_lfc_shrinkage_results.csv")
    
    # --- Print session info for reproducibility ---
    sessionInfo()
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
RNAseq reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i) (Dobin et al., 2013).
Repeat-mapping reads were removed, and remaining reads were mapped to the mouse genome assembly (mm10) with STAR (v2.4.0i)
Read counts for all genes annotated in GENCODE (vM20) were calculated using the read summarization program featureCounts (v1.5.3) (Liao et al., 2014).
Batch correction was completed with Combat-seq
Differential expression was completed by DESeq2 (v1.5.3)
Assembly: mm10
← Back to Analysis