GSE86043 Processing Pipeline

RNA-Seq code_examples 8 steps

Publication

Protein-RNA Networks Regulated by Normal and ALS-Associated Mutant HNRNPA2B1 in the Nervous System.

Neuron (2016) — PMID 27773581

Dataset

GSE86043

HNRNPA2B1 regulates alternative RNA processing in the nervous system and accumulates in granules in ALS IPSC-derived motor neurons [hnRNPA2B1_RNA-seq…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequencing reads from RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.

    cutadapt v3.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda install -c bioconda cutadapt
    
    # Define input and output file paths
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="trimmed_output.fastq.gz"
    
    cutadapt -o "${OUTPUT_FASTQ}" \
        --match-read-wildcards \
        --times 2 \
        -e 0 \
        -O 5 \
        --quality-cutoff 6 \
        -m 18 \
        -b TCGTATGCCGTCTTCTGCTTG \
        -b ATCTCGTATGCCGTCTTCTGCTTG \
        -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
        -b TGGAATTCTCGGGTGCCAAGG \
        -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
        -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
        "${INPUT_FASTQ}"
  2. 2

    Reads were then mapped against a database of repetitive elements derived from RepBase18.05.

    bowtie (Inferred with models/gemini-2.5-flash) v1.2.2 GitHub
    $ Bash example
    # Install bowtie
    # conda install -c bioconda bowtie=1.2.2
    
    # --- Reference Data Preparation (RepBase18.05) ---
    # Note: RepBase requires a license for direct download. This is a placeholder for demonstration.
    # Download RepBase 18.05 FASTA file (e.g., from GIRI website after obtaining a license)
    # Example (replace with actual download method after obtaining license):
    # wget -O RepBase18.05.fasta "https://www.girinst.org/downloads/RepBase18.05.fasta" 
    
    # Build bowtie index for RepBase18.05
    bowtie-build RepBase18.05.fasta RepBase18.05_index
    
    # --- Mapping Reads ---
    # Replace 'input_reads.fastq' with your actual input FASTQ file
    # Replace 'repbase_aligned.sam' with your desired output SAM file name
    bowtie -q -v 2 -a --best --strata -S RepBase18.05_index input_reads.fastq repbase_aligned.sam
  3. 3

    Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).

    Bowtie v1.0.0 GitHub
    $ Bash example
    # Install Bowtie (if not already installed)
    # conda install -c bioconda bowtie=1.0.0
    
    # Placeholder for the Repbase index. This index must be pre-built using bowtie-build
    # from the Repbase sequences. Replace 'path/to/repbase_index' with the actual path to your index.
    REPBASE_INDEX="path/to/repbase_index"
    
    # Placeholder for input reads. Replace 'path/to/input_reads.fastq' with the actual path to your FASTQ file.
    INPUT_READS="path/to/input_reads.fastq"
    
    # Output SAM file for aligned reads
    OUTPUT_SAM="aligned_reads.sam"
    
    # Run Bowtie alignment
    bowtie -S -q -p 16 -e 100 -l 20 "${REPBASE_INDEX}" "${INPUT_READS}" > "${OUTPUT_SAM}"
  4. 4

    Reads not mapped to Repbase sequences were aligned to the mm9 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within –outFilterMultimapNmax 1 –outFilterMultimapScoreRange 1.

    STAR v2.3.0e
    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Define variables for input reads and genome directory
    # INPUT_READS should be the FASTQ/FASTA file containing reads not mapped to Repbase.
    INPUT_READS="unmapped_reads.fastq" # Placeholder for input reads
    # GENOME_DIR should point to the pre-built STAR index for the mm9 human genome.
    # The mm9 genome FASTA and GTF can be downloaded from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/).
    # A STAR index for mm9 would need to be generated using STAR --runMode genomeGenerate.
    GENOME_DIR="/path/to/STAR_mm9_index" # Placeholder for mm9 STAR index
    OUTPUT_PREFIX="aligned_to_mm9"
    
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${INPUT_READS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 1 \
         --outFilterMultimapScoreRange 1 \
         --runThreadN 8 # Adjust thread count as needed
  5. 5

    Alternative polyadenylation sites were identified from RNA-seq data using the bioinformatics algorithm DaPars (Xia et al., 2014), which uses a regression model to locate endpoints of alternative polyadenylation sites, was used to identify differences in APA events between hnRNP A2/B1 depleted samples and controls.

    RNA-seq vv0.9.1
    $ Bash example
    # Define paths and filenames
    DAPARS_DIR="DaPars" # Assuming DaPars is cloned here
    OUTPUT_BASE_DIR="dapars_analysis"
    mkdir -p "${OUTPUT_BASE_DIR}"
    
    # Placeholder for reference genome and annotation
    # Download Gencode annotation for hg38 (example)
    # wget -P "${OUTPUT_BASE_DIR}" http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
    # gunzip "${OUTPUT_BASE_DIR}/gencode.v38.annotation.gtf.gz"
    GENCODE_GTF="${OUTPUT_BASE_DIR}/gencode.v38.annotation.gtf"
    
    # Placeholder for input BAM files (hnRNP A2/B1 depleted vs controls)
    # Ensure BAM files are sorted and indexed (.bai)
    CONTROL_BAM_1="control_rep1.bam"
    CONTROL_BAM_2="control_rep2.bam"
    DEPLETED_BAM_1="depleted_rep1.bam"
    DEPLETED_BAM_2="depleted_rep2.bam"
    
    # --- Installation (commented out) ---
    # Clone DaPars repository
    # git clone https://github.com/zhenglab/DaPars.git
    # cd DaPars
    
    # Create a conda environment for DaPars dependencies (Python 2.7 is required)
    # conda create -n dapars_env python=2.7 numpy scipy pandas pysam -y
    # conda activate dapars_env
    
    # --- DaPars Workflow ---
    
    # Step 1: Prepare DaPars reference file
    # This step identifies potential APA sites based on the GTF and read coverage from a representative BAM.
    # Using one control BAM for preparation.
    DAPARS_REFERENCE_DIR="${OUTPUT_BASE_DIR}/dapars_reference"
    mkdir -p "${DAPARS_REFERENCE_DIR}"
    python "${DAPARS_DIR}/DaPars.py" prepare \
        -b "${CONTROL_BAM_1}" \
        -g "${GENCODE_GTF}" \
        -o "${DAPARS_REFERENCE_DIR}" \
        --species human # Assuming human, can be omitted if not specified
    
    DAPARS_REFERENCE_FILE="${DAPARS_REFERENCE_DIR}/DaPars_Reference.txt" # Default output name
    
    # Step 2: Run DaPars on individual samples
    # This step quantifies APA usage for each sample.
    CONTROL_OUTPUT_DIR="${OUTPUT_BASE_DIR}/control_dapars_output"
    DEPLETED_OUTPUT_DIR="${OUTPUT_BASE_DIR}/depleted_dapars_output"
    mkdir -p "${CONTROL_OUTPUT_DIR}" "${DEPLETED_OUTPUT_DIR}"
    
    # Run for control samples
    python "${DAPARS_DIR}/DaPars.py" run \
        -b "${CONTROL_BAM_1}" \
        -r "${DAPARS_REFERENCE_FILE}" \
        -o "${CONTROL_OUTPUT_DIR}/control_rep1"
    
    python "${DAPARS_DIR}/DaPars.py" run \
        -b "${CONTROL_BAM_2}" \
        -r "${DAPARS_REFERENCE_FILE}" \
        -o "${CONTROL_OUTPUT_DIR}/control_rep2"
    
    # Run for depleted samples
    python "${DAPARS_DIR}/DaPars.py" run \
        -b "${DEPLETED_BAM_1}" \
        -r "${DAPARS_REFERENCE_FILE}" \
        -o "${DEPLETED_OUTPUT_DIR}/depleted_rep1"
    
    python "${DAPARS_DIR}/DaPars.py" run \
        -b "${DEPLETED_BAM_2}" \
        -r "${DAPARS_REFERENCE_FILE}" \
        -o "${DEPLETED_OUTPUT_DIR}/depleted_rep2"
    
    # Step 3: Compare DaPars results between control and depleted samples
    # Create sample list files for comparison
    CONTROL_SAMPLE_LIST="${OUTPUT_BASE_DIR}/control_samples.txt"
    DEPLETED_SAMPLE_LIST="${OUTPUT_BASE_DIR}/depleted_samples.txt"
    
    # The 'run' step outputs a file like 'sample_name_DaPars_Output.txt'
    find "${CONTROL_OUTPUT_DIR}" -name "*_DaPars_Output.txt" > "${CONTROL_SAMPLE_LIST}"
    find "${DEPLETED_OUTPUT_DIR}" -name "*_DaPars_Output.txt" > "${DEPLETED_SAMPLE_LIST}"
    
    COMPARISON_OUTPUT_DIR="${OUTPUT_BASE_DIR}/dapars_comparison"
    mkdir -p "${COMPARISON_OUTPUT_DIR}"
    
    python "${DAPARS_DIR}/DaPars.py" compare \
        -c "${CONTROL_SAMPLE_LIST}" \
        -t "${DEPLETED_SAMPLE_LIST}" \
        -o "${COMPARISON_OUTPUT_DIR}" \
        --species human # Again, assuming human
  6. 6

    To identify significant APA events, we used the following cutoffs: FDR < 0.05, |ΔPDUI| ≥ 0.2, and |dPDUI| ≥ 0.2.

    DaPars (Inferred with models/gemini-2.5-flash) v2.0
    $ Bash example
    # Tool: DaPars (Inferred with models/gemini-2.5-flash)
    # Version: 2.0 (Inferred, corresponding to DaPars2)
    # Description: Filtering significant Alternative Polyadenylation (APA) events from DaPars output.
    # The input file 'dapars_output.txt' is assumed to be generated by DaPars2 differential analysis.
    # It typically contains columns like: GeneName, chr, strand, pas_pos, pas_type, PDUI_control, PDUI_treatment, dPDUI, p_value, FDR.
    # Reference genome (e.g., hg38) would have been used in the preceding DaPars run, but is not directly used in this filtering step.
    
    # Define cutoffs for significance
    FDR_CUTOFF=0.05
    DPDUI_CUTOFF=0.2
    
    # Define column indices for filtering (1-indexed, based on typical DaPars2 output)
    # Column 8: dPDUI (difference in Percent Different Usage Index)
    # Column 10: FDR (False Discovery Rate)
    DPDUI_COL=8
    FDR_COL=10
    
    # Input file from DaPars2 differential analysis
    INPUT_FILE="dapars_output.txt"
    # Output file for significant APA events
    OUTPUT_FILE="significant_apa_events.tsv"
    
    # Filter significant APA events based on the specified cutoffs:
    # FDR < 0.05, |dPDUI| >= 0.2
    # Note: The description mentions |ΔPDUI| >= 0.2 and |dPDUI| >= 0.2.
    # In DaPars, dPDUI is the primary metric for the change in PDUI.
    # We apply the absolute value check to dPDUI to satisfy the magnitude requirement.
    # If ΔPDUI refers to a distinct column, the command would need adjustment.
    awk -v FDR_CUTOFF="${FDR_CUTOFF}" -v DPDUI_CUTOFF="${DPDUI_CUTOFF}" \
        -v DPDUI_COL="${DPDUI_COL}" -v FDR_COL="${FDR_COL}" \
        'BEGIN {OFS="\t"} \
        NR==1 {print; next} \
        ( $(FDR_COL) < FDR_CUTOFF ) && ( sqrt( ($(DPDUI_COL))^2 ) >= DPDUI_CUTOFF ) {print}' \
        "${INPUT_FILE}" > "${OUTPUT_FILE}"
  7. 7

    APA scatter plots were generated using the R-Studio program with the ggplot2 library.Â

    R vR 4.x, ggplot2 3.x GitHub
    $ Bash example
    # Install R (example for Ubuntu/Debian)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Install R packages (if not already installed)
    # R -e "install.packages('ggplot2', repos='http://cran.us.r-project.org')"
    
    # Create a dummy R script for APA scatter plot generation
    # In a real scenario, this script would be provided or generated based on specific APA data.
    cat << 'EOF' > generate_apa_scatter.R
    # Load necessary library
    library(ggplot2)
    
    # --- Configuration ---
    input_file <- "apa_data.tsv" # Placeholder for input data
    output_file <- "apa_scatter_plot.pdf" # Placeholder for output plot
    
    # --- Dummy Data Generation (replace with actual data loading) ---
    # In a real scenario, you would load your APA data here, e.g.:
    # data <- read.delim(input_file, header = TRUE, sep = "\t")
    # For demonstration, let's create some dummy data
    set.seed(123)
    data <- data.frame(
      Condition1_APA = rnorm(100, mean = 0, sd = 1),
      Condition2_APA = rnorm(100, mean = 0.5, sd = 1.2),
      Gene = paste0("Gene", 1:100)
    )
    
    # --- Generate Scatter Plot ---
    p <- ggplot(data, aes(x = Condition1_APA, y = Condition2_APA)) +
      geom_point(alpha = 0.7) +
      geom_smooth(method = "lm", se = FALSE, color = "blue") + # Add a linear regression line
      labs(
        title = "APA Scatter Plot",
        x = "APA Score (Condition 1)",
        y = "APA Score (Condition 2)"
      ) +
      theme_minimal()
    
    # --- Save the plot ---
    ggsave(output_file, plot = p, width = 7, height = 6)
    
    message(paste("APA scatter plot saved to:", output_file))
    EOF
    
    # Create a dummy input data file (replace with actual data)
    # This file would contain the APA scores for different conditions/genes
    cat << 'EOF' > apa_data.tsv
    Gene\tCondition1_APA\tCondition2_APA
    Gene1\t0.12\t0.34
    Gene2\t-0.56\t-0.12
    Gene3\t0.89\t1.01
    Gene4\t-0.23\t-0.05
    Gene5\t0.55\t0.67
    EOF
    
    # Execute the R script
    Rscript generate_apa_scatter.R
  8. 8

    counts of reads for each gene annotated in gencode vM1 were calculated from featureCounts

    featureCounts v2.0.3 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install subread (which includes featureCounts) if not already installed
    # conda install -c bioconda subread
    
    # Define input and output files
    # Replace 'input_aligned_reads.bam' with the actual path to your aligned BAM file(s).
    # If multiple BAM files, list them space-separated after the -s 0 parameter.
    INPUT_BAM="input_aligned_reads.bam"
    
    # Placeholder for gencode vM1 annotation GTF file.
    # You would need to obtain the specific gencode vM1 GTF file used in the original pipeline.
    # A common path structure might look like: /path/to/gencode/mouse/M1/gencode.vM1.annotation.gtf
    GENCODE_GTF="gencode.vM1.annotation.gtf"
    
    # Define the output file for gene counts
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Execute featureCounts to calculate read counts for each gene
    # -a: Annotation file (GTF/GFF)
    # -o: Output file
    # -F GTF: Specify annotation file format as GTF
    # -t exon: Count reads mapping to 'exon' features
    # -g gene_id: Group features by 'gene_id' attribute to get gene-level counts
    # -s 0: Assume unstranded data (0=unstranded, 1=stranded, 2=reverse stranded - adjust based on assay)
    featureCounts -a "${GENCODE_GTF}" \
                  -o "${OUTPUT_COUNTS}" \
                  -F GTF \
                  -t exon \
                  -g gene_id \
                  -s 0 \
                  "${INPUT_BAM}"

Tools Used

Raw Source Text
Sequencing reads from RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
Reads not mapped to Repbase sequences were aligned to the mm9 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within –outFilterMultimapNmax 1 –outFilterMultimapScoreRange 1.
Alternative polyadenylation sites were identified from RNA-seq data using the bioinformatics algorithm DaPars (Xia et al., 2014), which uses a regression model to locate endpoints of alternative polyadenylation sites, was used to identify differences in APA events between hnRNP A2/B1 depleted samples and controls. To identify significant APA events, we used the following cutoffs: FDR < 0.05, |ΔPDUI| ≥ 0.2, and |dPDUI| ≥ 0.2. APA scatter plots were generated using the R-Studio program with the ggplot2 library.Â
counts of reads for each gene annotated in gencode vM1 were calculated from featureCounts
Genome_build: mm9
Supplementary_files_format_and_content: rpkm file, containts rpkm for each sample
← Back to Analysis