GSE65973 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.Acta neuropathologica (2022) — PMID 35778567
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
3' Illumina adaptors trimmed with cutadapt
$ Bash example
# Install cutadapt via conda # conda install -c bioconda cutadapt=4.0 # Define input and output filenames (placeholders) INPUT_FASTQ="input.fastq.gz" TRIMMED_FASTQ="trimmed.fastq.gz" # Define common Illumina 3' adapter sequence (for single-end or Read 1 of paired-end) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Define minimum read length after trimming and quality threshold MIN_LENGTH="18" QUALITY_THRESHOLD="20" # Define number of CPU cores to use for parallel processing NUM_CORES="4" # Execute cutadapt to trim 3' Illumina adaptors and perform quality trimming # -a: Specifies the 3' adapter sequence to be removed. # -o: Specifies the output file for trimmed reads. # --minimum-length: Discards reads shorter than this length after trimming. # --cores: Uses multiple CPU cores for faster processing. # -q: Performs quality trimming from both 5' and 3' ends with the specified quality threshold. cutadapt \ -a "${ADAPTER_3PRIME}" \ -o "${TRIMMED_FASTQ}" \ "${INPUT_FASTQ}" \ --minimum-length "${MIN_LENGTH}" \ --cores "${NUM_CORES}" \ -q "${QUALITY_THRESHOLD}","${QUALITY_THRESHOLD}" -
2
Mapped to mm9 genome build with Bowtie, version 0.12.7 with options --time -k 100 --best --strata -v 2 -p 3, and filtered to a mismatch rate of 6%
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie=0.12.7 # Define variables # Replace /path/to/mm9_bowtie_index/mm9 with the actual path to your Bowtie index for mm9. # The index files (e.g., mm9.1.ebwt, mm9.2.ebwt, etc.) should be in this directory. BOWTIE_INDEX_PREFIX="/path/to/mm9_bowtie_index/mm9" INPUT_FASTQ="input_reads.fastq" # Replace with your input FASTQ file (e.g., single-end reads) OUTPUT_SAM="aligned_reads.sam" # Replace with your desired output SAM file name # Execute Bowtie alignment # --time: Report wall-clock time taken by Bowtie # -k 100: Report up to 100 valid alignments per read # --best: Report alignments that are "best" in terms of stratum and number of mismatches # --strata: Don't report alignments for a read unless they fall into the best stratum # -v 2: Align reads with at most 2 mismatches (this implicitly handles the 6% mismatch rate for typical read lengths) # -p 3: Use 3 parallel threads bowtie --time -k 100 --best --strata -v 2 -p 3 "$BOWTIE_INDEX_PREFIX" "$INPUT_FASTQ" > "$OUTPUT_SAM"
-
3
Mapped reads were counted for each exon using bedtools multicov and merged into raw gene counts with bedtoools groupBy
$ Bash example
# Define reference genome and annotation GENOME="hg38" GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz" GTF_FILE="gencode.v45.annotation.gtf.gz" EXONS_BED="gencode.v45.exons.bed" # Input BAM file (placeholder - replace with your actual mapped reads BAM file) MAPPED_READS_BAM="mapped_reads.bam" # Ensure BAM is sorted and indexed for optimal performance with bedtools # Example: # samtools sort -o mapped_reads.sorted.bam mapped_reads.bam # samtools index mapped_reads.sorted.bam # MAPPED_READS_BAM="mapped_reads.sorted.bam" # Output files EXON_COUNTS_TSV="exon_raw_counts.tsv" GENE_COUNTS_TSV="gene_raw_counts.tsv" # --- Installation (commented out) --- # Install bedtools and optionally samtools (for BAM sorting/indexing) if not already available # conda install -c bioconda bedtools # conda install -c bioconda samtools # --- Prepare exon annotations (if not already available) --- # Download GTF if not present if [ ! -f "$GTF_FILE" ]; then echo "Downloading GTF file from $GTF_URL..." wget "$GTF_URL" fi # Convert GTF to BED format for exons, ensuring gene_id is in the name field (column 4). # This step is crucial for bedtools groupBy later to correctly group by gene. # The awk script extracts chromosome, 0-based start, 1-based end, gene_id, score (set to 0), and strand. if [ ! -f "$EXONS_BED" ]; then echo "Creating exon BED file from GTF: $EXONS_BED..." zcat "$GTF_FILE" | awk ' BEGIN { OFS="\t" } $3 == "exon" { chr = $1; start = $4 - 1; # Convert to 0-based start end = $5; strand = $7; # Extract gene_id from the 9th column (attributes) gene_id = ""; split($9, attrs, "; "); for (i=1; i<=length(attrs); i++) { if (attrs[i] ~ /^gene_id /) { gene_id = substr(attrs[i], index(attrs[i], "\"")+1, length(attrs[i]) - index(attrs[i], "\"") - 1); break; } } if (gene_id != "") { print chr, start, end, gene_id, "0", strand; } }' > "$EXONS_BED" echo "Exon BED file created: $EXONS_BED" else echo "Exon BED file already exists: $EXONS_BED" fi # --- Main execution --- # 1. Count reads for each exon using bedtools multicov # -b: BED file of features (exons) # -f: BAM file of mapped reads # Output format: chr\tstart\tend\tgene_id\tscore\tstrand\tcount echo "Counting reads per exon using bedtools multicov..." bedtools multicov -b "$EXONS_BED" -f "$MAPPED_READS_BAM" > "$EXON_COUNTS_TSV" # 2. Merge into raw gene counts with bedtools groupBy # -i: Input file (exon_raw_counts.tsv from the previous step) # -g 4: Group by the 4th column (which contains the gene_id) # -c 7: Operate on the 7th column (which contains the exon read count) # -o sum: Sum the values in the 7th column for each group (gene) echo "Merging exon counts into gene counts using bedtools groupBy..." bedtools groupBy -i "$EXON_COUNTS_TSV" -g 4 -c 7 -o sum > "$GENE_COUNTS_TSV" echo "Raw exon counts saved to: $EXON_COUNTS_TSV" echo "Raw gene counts saved to: $GENE_COUNTS_TSV" -
4
DESeq2 was used to normalize read counts.
$ Bash example
# Install R and Bioconductor (if not already installed) # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")' # This script assumes you have a count matrix file (e.g., 'counts.tsv') # and an optional sample metadata file (e.g., 'coldata.tsv'). # The count matrix should have gene IDs as row names and sample IDs as column names. # The metadata file should have sample IDs as row names and relevant experimental factors as columns. # Example of how to create dummy input files for demonstration: # echo -e "gene_id\tsample1\tsample2\tsample3\tsample4\ngeneA\t100\t120\t90\t110\ngeneB\t50\t60\t45\t55\ngeneC\t200\t210\t190\t220" > counts.tsv # echo -e "sample_id\tcondition\nsample1\tcontrol\nsample2\tcontrol\nsample3\ttreated\nsample4\ttreated" > coldata.tsv Rscript -e ' library(DESeq2) # --- Configuration --- count_matrix_file <- "counts.tsv" # Path to your raw count matrix file sample_metadata_file <- "coldata.tsv" # Path to your sample metadata file output_normalized_counts_file <- "normalized_counts.tsv" # Output file for normalized counts # --- Load Data --- # Load count data: first column as row names (gene IDs), tab-separated count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = "\t") # Load sample metadata: first column as row names (sample IDs), tab-separated coldata <- read.table(sample_metadata_file, header = TRUE, row.names = 1, sep = "\t") # Ensure sample order in coldata matches column order in count_data # This is crucial for DESeqDataSetFromMatrix coldata <- coldata[colnames(count_data), , drop = FALSE] # --- Create DESeqDataSet object --- # For simple normalization, a design of ~1 is sufficient as we are not performing # differential expression analysis here, only estimating size factors. dds <- DESeqDataSetFromMatrix(countData = count_data, colData = coldata, design = ~ 1) # --- Perform Normalization --- # Estimate size factors to account for differences in library size and sequencing depth. # This is the core normalization step in DESeq2. dds <- estimateSizeFactors(dds) # Extract normalized counts normalized_counts <- counts(dds, normalized = TRUE) # --- Save Results --- # Write normalized counts to a tab-separated file # col.names = NA ensures that the row names (gene IDs) are written as the first column write.table(normalized_counts, output_normalized_counts_file, sep = "\t", quote = FALSE, col.names = NA) message(paste0("DESeq2 normalization complete. Normalized counts saved to ", output_normalized_counts_file)) ' -
5
Genes were filtered to have a minimum of 10 reads per million in at least one sample, and those counts were used to test for differential expression
$ Bash example
# Install R and DESeq2 (if not already installed) # sudo apt-get update # sudo apt-get install -y r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create dummy input files for demonstration (replace with your actual files) # gene_counts.csv: A CSV file with gene IDs as row names and sample IDs as column names, containing raw read counts. # sample_metadata.csv: A CSV file with sample IDs as row names and experimental conditions (e.g., 'condition', 'treatment') as columns. cat <<EOF > gene_counts.csv GeneID,sample1,sample2,sample3,sample4,sample5,sample6 geneA,100,120,500,550,110,520 geneB,5,8,10,12,6,11 geneC,200,210,250,260,205,255 geneD,15,18,20,22,16,21 geneE,0,1,0,2,0,1 geneF,1000,1100,1200,1300,1050,1250 EOF cat <<EOF > sample_metadata.csv SampleID,condition sample1,control sample2,control sample3,treatment sample4,treatment sample5,control sample6,treatment EOF # R script to perform filtering and differential expression using DESeq2 cat <<'EOF_R_SCRIPT' > run_deseq2.R library(DESeq2) # Load count matrix counts_df <- read.csv("gene_counts.csv", row.names = 1) # Load sample metadata coldata <- read.csv("sample_metadata.csv", row.names = 1) # Ensure order of samples in counts_df and coldata matches stopifnot(all(colnames(counts_df) == rownames(coldata))) # --- Filtering step: minimum of 10 reads per million in at least one sample --- # Calculate total reads per sample total_reads_per_sample <- colSums(counts_df) # Calculate RPM for each gene in each sample # Add a small pseudocount to total_reads_per_sample to avoid division by zero if a sample has 0 total reads rpm_matrix <- t(t(counts_df) / (total_reads_per_sample + 1e-6)) * 1e6 # Identify genes that have at least 10 RPM in at least one sample keep_genes <- rowSums(rpm_matrix >= 10) >= 1 # Filter the count matrix filtered_counts_df <- counts_df[keep_genes, ] # --- Differential Expression using DESeq2 --- # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = round(filtered_counts_df), # DESeq2 expects integer counts colData = coldata, design = ~ condition) # Run DESeq2 dds <- DESeq(dds) # Get results for 'treatment' vs 'control' res <- results(dds, contrast = c("condition", "treatment", "control")) # Order by adjusted p-value res_ordered <- res[order(res$padj), ] # Save results write.csv(as.data.frame(res_ordered), "deseq2_differential_expression_results.csv") # Optional: Save normalized counts normalized_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(normalized_counts), "deseq2_normalized_counts.csv") message("DESeq2 analysis complete. Results saved to deseq2_differential_expression_results.csv") message("Normalized counts saved to deseq2_normalized_counts.csv") EOF_R_SCRIPT # Execute the R script Rscript run_deseq2.R
Tools Used
Raw Source Text
3' Illumina adaptors trimmed with cutadapt Mapped to mm9 genome build with Bowtie, version 0.12.7 with options --time -k 100 --best --strata -v 2 -p 3, and filtered to a mismatch rate of 6% Mapped reads were counted for each exon using bedtools multicov and merged into raw gene counts with bedtoools groupBy DESeq2 was used to normalize read counts. Genes were filtered to have a minimum of 10 reads per million in at least one sample, and those counts were used to test for differential expression Genome_build: mm9 Supplementary_files_format_and_content: NLS_deseq2_all_alluniq_vs_mmu9htdp43nls.txt is the differential expression output of DESeq2. NLS_norm_counts_all.txt is the file of normalized read counts per gene as computed by DESeq2. The counts are in matrix format, one file for the whole experiment.