GSE65973 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.

Acta neuropathologica (2022) — PMID 35778567

Dataset

GSE65973

RNA sequencing of mice expressing NLS-hTDP-43

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    3' Illumina adaptors trimmed with cutadapt

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt via conda
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output filenames (placeholders)
    INPUT_FASTQ="input.fastq.gz"
    TRIMMED_FASTQ="trimmed.fastq.gz"
    
    # Define common Illumina 3' adapter sequence (for single-end or Read 1 of paired-end)
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Define minimum read length after trimming and quality threshold
    MIN_LENGTH="18"
    QUALITY_THRESHOLD="20"
    
    # Define number of CPU cores to use for parallel processing
    NUM_CORES="4"
    
    # Execute cutadapt to trim 3' Illumina adaptors and perform quality trimming
    # -a: Specifies the 3' adapter sequence to be removed.
    # -o: Specifies the output file for trimmed reads.
    # --minimum-length: Discards reads shorter than this length after trimming.
    # --cores: Uses multiple CPU cores for faster processing.
    # -q: Performs quality trimming from both 5' and 3' ends with the specified quality threshold.
    cutadapt \
      -a "${ADAPTER_3PRIME}" \
      -o "${TRIMMED_FASTQ}" \
      "${INPUT_FASTQ}" \
      --minimum-length "${MIN_LENGTH}" \
      --cores "${NUM_CORES}" \
      -q "${QUALITY_THRESHOLD}","${QUALITY_THRESHOLD}"
  2. 2

    Mapped to mm9 genome build with Bowtie, version 0.12.7 with options --time -k 100 --best --strata -v 2 -p 3, and filtered to a mismatch rate of 6%

    Bowtie v0.12.7 GitHub
    $ Bash example
    # Install Bowtie (if not already installed)
    # conda install -c bioconda bowtie=0.12.7
    
    # Define variables
    # Replace /path/to/mm9_bowtie_index/mm9 with the actual path to your Bowtie index for mm9.
    # The index files (e.g., mm9.1.ebwt, mm9.2.ebwt, etc.) should be in this directory.
    BOWTIE_INDEX_PREFIX="/path/to/mm9_bowtie_index/mm9"
    INPUT_FASTQ="input_reads.fastq" # Replace with your input FASTQ file (e.g., single-end reads)
    OUTPUT_SAM="aligned_reads.sam" # Replace with your desired output SAM file name
    
    # Execute Bowtie alignment
    # --time: Report wall-clock time taken by Bowtie
    # -k 100: Report up to 100 valid alignments per read
    # --best: Report alignments that are "best" in terms of stratum and number of mismatches
    # --strata: Don't report alignments for a read unless they fall into the best stratum
    # -v 2: Align reads with at most 2 mismatches (this implicitly handles the 6% mismatch rate for typical read lengths)
    # -p 3: Use 3 parallel threads
    bowtie --time -k 100 --best --strata -v 2 -p 3 "$BOWTIE_INDEX_PREFIX" "$INPUT_FASTQ" > "$OUTPUT_SAM"
  3. 3

    Mapped reads were counted for each exon using bedtools multicov and merged into raw gene counts with bedtoools groupBy

    bedtools v2.31.0 GitHub
    $ Bash example
    # Define reference genome and annotation
    GENOME="hg38"
    GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz"
    GTF_FILE="gencode.v45.annotation.gtf.gz"
    EXONS_BED="gencode.v45.exons.bed"
    
    # Input BAM file (placeholder - replace with your actual mapped reads BAM file)
    MAPPED_READS_BAM="mapped_reads.bam"
    # Ensure BAM is sorted and indexed for optimal performance with bedtools
    # Example:
    # samtools sort -o mapped_reads.sorted.bam mapped_reads.bam
    # samtools index mapped_reads.sorted.bam
    # MAPPED_READS_BAM="mapped_reads.sorted.bam"
    
    # Output files
    EXON_COUNTS_TSV="exon_raw_counts.tsv"
    GENE_COUNTS_TSV="gene_raw_counts.tsv"
    
    # --- Installation (commented out) ---
    # Install bedtools and optionally samtools (for BAM sorting/indexing) if not already available
    # conda install -c bioconda bedtools
    # conda install -c bioconda samtools
    
    # --- Prepare exon annotations (if not already available) ---
    # Download GTF if not present
    if [ ! -f "$GTF_FILE" ]; then
        echo "Downloading GTF file from $GTF_URL..."
        wget "$GTF_URL"
    fi
    
    # Convert GTF to BED format for exons, ensuring gene_id is in the name field (column 4).
    # This step is crucial for bedtools groupBy later to correctly group by gene.
    # The awk script extracts chromosome, 0-based start, 1-based end, gene_id, score (set to 0), and strand.
    if [ ! -f "$EXONS_BED" ]; then
        echo "Creating exon BED file from GTF: $EXONS_BED..."
        zcat "$GTF_FILE" | awk '
            BEGIN { OFS="\t" }
            $3 == "exon" {
                chr = $1;
                start = $4 - 1; # Convert to 0-based start
                end = $5;
                strand = $7;
                
                # Extract gene_id from the 9th column (attributes)
                gene_id = "";
                split($9, attrs, "; ");
                for (i=1; i<=length(attrs); i++) {
                    if (attrs[i] ~ /^gene_id /) {
                        gene_id = substr(attrs[i], index(attrs[i], "\"")+1, length(attrs[i]) - index(attrs[i], "\"") - 1);
                        break;
                    }
                }
                
                if (gene_id != "") {
                    print chr, start, end, gene_id, "0", strand;
                }
            }' > "$EXONS_BED"
        echo "Exon BED file created: $EXONS_BED"
    else
        echo "Exon BED file already exists: $EXONS_BED"
    fi
    
    # --- Main execution ---
    
    # 1. Count reads for each exon using bedtools multicov
    #    -b: BED file of features (exons)
    #    -f: BAM file of mapped reads
    #    Output format: chr\tstart\tend\tgene_id\tscore\tstrand\tcount
    echo "Counting reads per exon using bedtools multicov..."
    bedtools multicov -b "$EXONS_BED" -f "$MAPPED_READS_BAM" > "$EXON_COUNTS_TSV"
    
    # 2. Merge into raw gene counts with bedtools groupBy
    #    -i: Input file (exon_raw_counts.tsv from the previous step)
    #    -g 4: Group by the 4th column (which contains the gene_id)
    #    -c 7: Operate on the 7th column (which contains the exon read count)
    #    -o sum: Sum the values in the 7th column for each group (gene)
    echo "Merging exon counts into gene counts using bedtools groupBy..."
    bedtools groupBy -i "$EXON_COUNTS_TSV" -g 4 -c 7 -o sum > "$GENE_COUNTS_TSV"
    
    echo "Raw exon counts saved to: $EXON_COUNTS_TSV"
    echo "Raw gene counts saved to: $GENE_COUNTS_TSV"
  4. 4

    DESeq2 was used to normalize read counts.

    DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor (if not already installed)
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'
    
    # This script assumes you have a count matrix file (e.g., 'counts.tsv')
    # and an optional sample metadata file (e.g., 'coldata.tsv').
    # The count matrix should have gene IDs as row names and sample IDs as column names.
    # The metadata file should have sample IDs as row names and relevant experimental factors as columns.
    
    # Example of how to create dummy input files for demonstration:
    # echo -e "gene_id\tsample1\tsample2\tsample3\tsample4\ngeneA\t100\t120\t90\t110\ngeneB\t50\t60\t45\t55\ngeneC\t200\t210\t190\t220" > counts.tsv
    # echo -e "sample_id\tcondition\nsample1\tcontrol\nsample2\tcontrol\nsample3\ttreated\nsample4\ttreated" > coldata.tsv
    
    Rscript -e '
      library(DESeq2)
      
      # --- Configuration ---
      count_matrix_file <- "counts.tsv" # Path to your raw count matrix file
      sample_metadata_file <- "coldata.tsv" # Path to your sample metadata file
      output_normalized_counts_file <- "normalized_counts.tsv" # Output file for normalized counts
      
      # --- Load Data ---
      # Load count data: first column as row names (gene IDs), tab-separated
      count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = "\t")
      
      # Load sample metadata: first column as row names (sample IDs), tab-separated
      coldata <- read.table(sample_metadata_file, header = TRUE, row.names = 1, sep = "\t")
      
      # Ensure sample order in coldata matches column order in count_data
      # This is crucial for DESeqDataSetFromMatrix
      coldata <- coldata[colnames(count_data), , drop = FALSE]
      
      # --- Create DESeqDataSet object ---
      # For simple normalization, a design of ~1 is sufficient as we are not performing
      # differential expression analysis here, only estimating size factors.
      dds <- DESeqDataSetFromMatrix(countData = count_data,
                                    colData = coldata,
                                    design = ~ 1)
      
      # --- Perform Normalization ---
      # Estimate size factors to account for differences in library size and sequencing depth.
      # This is the core normalization step in DESeq2.
      dds <- estimateSizeFactors(dds)
      
      # Extract normalized counts
      normalized_counts <- counts(dds, normalized = TRUE)
      
      # --- Save Results ---
      # Write normalized counts to a tab-separated file
      # col.names = NA ensures that the row names (gene IDs) are written as the first column
      write.table(normalized_counts, output_normalized_counts_file, sep = "\t", quote = FALSE, col.names = NA)
      
      message(paste0("DESeq2 normalization complete. Normalized counts saved to ", output_normalized_counts_file))
    '
  5. 5

    Genes were filtered to have a minimum of 10 reads per million in at least one sample, and those counts were used to test for differential expression

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub
    $ Bash example
    # Install R and DESeq2 (if not already installed)
    # sudo apt-get update
    # sudo apt-get install -y r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create dummy input files for demonstration (replace with your actual files)
    # gene_counts.csv: A CSV file with gene IDs as row names and sample IDs as column names, containing raw read counts.
    # sample_metadata.csv: A CSV file with sample IDs as row names and experimental conditions (e.g., 'condition', 'treatment') as columns.
    
    cat <<EOF > gene_counts.csv
    GeneID,sample1,sample2,sample3,sample4,sample5,sample6
    geneA,100,120,500,550,110,520
    geneB,5,8,10,12,6,11
    geneC,200,210,250,260,205,255
    geneD,15,18,20,22,16,21
    geneE,0,1,0,2,0,1
    geneF,1000,1100,1200,1300,1050,1250
    EOF
    
    cat <<EOF > sample_metadata.csv
    SampleID,condition
    sample1,control
    sample2,control
    sample3,treatment
    sample4,treatment
    sample5,control
    sample6,treatment
    EOF
    
    # R script to perform filtering and differential expression using DESeq2
    cat <<'EOF_R_SCRIPT' > run_deseq2.R
    library(DESeq2)
    
    # Load count matrix
    counts_df <- read.csv("gene_counts.csv", row.names = 1)
    
    # Load sample metadata
    coldata <- read.csv("sample_metadata.csv", row.names = 1)
    
    # Ensure order of samples in counts_df and coldata matches
    stopifnot(all(colnames(counts_df) == rownames(coldata)))
    
    # --- Filtering step: minimum of 10 reads per million in at least one sample ---
    # Calculate total reads per sample
    total_reads_per_sample <- colSums(counts_df)
    
    # Calculate RPM for each gene in each sample
    # Add a small pseudocount to total_reads_per_sample to avoid division by zero if a sample has 0 total reads
    rpm_matrix <- t(t(counts_df) / (total_reads_per_sample + 1e-6)) * 1e6
    
    # Identify genes that have at least 10 RPM in at least one sample
    keep_genes <- rowSums(rpm_matrix >= 10) >= 1
    
    # Filter the count matrix
    filtered_counts_df <- counts_df[keep_genes, ]
    
    # --- Differential Expression using DESeq2 ---
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = round(filtered_counts_df), # DESeq2 expects integer counts
                                  colData = coldata,
                                  design = ~ condition)
    
    # Run DESeq2
    dds <- DESeq(dds)
    
    # Get results for 'treatment' vs 'control'
    res <- results(dds, contrast = c("condition", "treatment", "control"))
    
    # Order by adjusted p-value
    res_ordered <- res[order(res$padj), ]
    
    # Save results
    write.csv(as.data.frame(res_ordered), "deseq2_differential_expression_results.csv")
    
    # Optional: Save normalized counts
    normalized_counts <- counts(dds, normalized=TRUE)
    write.csv(as.data.frame(normalized_counts), "deseq2_normalized_counts.csv")
    
    message("DESeq2 analysis complete. Results saved to deseq2_differential_expression_results.csv")
    message("Normalized counts saved to deseq2_normalized_counts.csv")
    EOF_R_SCRIPT
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
3' Illumina adaptors trimmed with cutadapt
Mapped to mm9 genome build with Bowtie, version 0.12.7 with options --time -k 100 --best --strata -v 2 -p 3, and filtered to a mismatch rate of 6%
Mapped reads were counted for each exon using bedtools multicov and merged into raw gene counts with bedtoools groupBy
DESeq2 was used to normalize read counts. Genes were filtered to have a minimum of 10 reads per million in at least one sample, and those counts were used to test for differential expression
Genome_build: mm9
Supplementary_files_format_and_content: NLS_deseq2_all_alluniq_vs_mmu9htdp43nls.txt is the differential expression output of DESeq2. NLS_norm_counts_all.txt is the file of normalized read counts per gene as computed by DESeq2. The counts are in matrix format, one file for the whole experiment.
← Back to Analysis