GSE86043 Processing Pipeline

RNA-Seq code_examples 8 steps

Publication

Protein-RNA Networks Regulated by Normal and ALS-Associated Mutant HNRNPA2B1 in the Nervous System.

Neuron (2016) — PMID 27773581

Dataset

HNRNPA2B1 regulates alternative RNA processing in the nervous system and accumulates in granules in ALS IPSC-derived motor neurons [hnRNPA2B1_RNA-seq…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequencing reads from RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.

cutadapt v3.4 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt

# Define input and output file paths
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="trimmed_output.fastq.gz"

cutadapt -o "${OUTPUT_FASTQ}" \
    --match-read-wildcards \
    --times 2 \
    -e 0 \
    -O 5 \
    --quality-cutoff 6 \
    -m 18 \
    -b TCGTATGCCGTCTTCTGCTTG \
    -b ATCTCGTATGCCGTCTTCTGCTTG \
    -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
    -b TGGAATTCTCGGGTGCCAAGG \
    -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
    -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
    "${INPUT_FASTQ}"

View on GitHub

Reads were then mapped against a database of repetitive elements derived from RepBase18.05.

bowtie (Inferred with models/gemini-2.5-flash) v1.2.2 GitHub

$ Bash example

# Install bowtie
# conda install -c bioconda bowtie=1.2.2

# --- Reference Data Preparation (RepBase18.05) ---
# Note: RepBase requires a license for direct download. This is a placeholder for demonstration.
# Download RepBase 18.05 FASTA file (e.g., from GIRI website after obtaining a license)
# Example (replace with actual download method after obtaining license):
# wget -O RepBase18.05.fasta "https://www.girinst.org/downloads/RepBase18.05.fasta" 

# Build bowtie index for RepBase18.05
bowtie-build RepBase18.05.fasta RepBase18.05_index

# --- Mapping Reads ---
# Replace 'input_reads.fastq' with your actual input FASTQ file
# Replace 'repbase_aligned.sam' with your desired output SAM file name
bowtie -q -v 2 -a --best --strata -S RepBase18.05_index input_reads.fastq repbase_aligned.sam

View on GitHub

Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).

Bowtie v1.0.0 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie=1.0.0

# Placeholder for the Repbase index. This index must be pre-built using bowtie-build
# from the Repbase sequences. Replace 'path/to/repbase_index' with the actual path to your index.
REPBASE_INDEX="path/to/repbase_index"

# Placeholder for input reads. Replace 'path/to/input_reads.fastq' with the actual path to your FASTQ file.
INPUT_READS="path/to/input_reads.fastq"

# Output SAM file for aligned reads
OUTPUT_SAM="aligned_reads.sam"

# Run Bowtie alignment
bowtie -S -q -p 16 -e 100 -l 20 "${REPBASE_INDEX}" "${INPUT_READS}" > "${OUTPUT_SAM}"

View on GitHub

Reads not mapped to Repbase sequences were aligned to the mm9 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.

STAR v2.3.0e

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Define variables for input reads and genome directory
# INPUT_READS should be the FASTQ/FASTA file containing reads not mapped to Repbase.
INPUT_READS="unmapped_reads.fastq" # Placeholder for input reads
# GENOME_DIR should point to the pre-built STAR index for the mm9 human genome.
# The mm9 genome FASTA and GTF can be downloaded from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/).
# A STAR index for mm9 would need to be generated using STAR --runMode genomeGenerate.
GENOME_DIR="/path/to/STAR_mm9_index" # Placeholder for mm9 STAR index
OUTPUT_PREFIX="aligned_to_mm9"

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 1 \
     --outFilterMultimapScoreRange 1 \
     --runThreadN 8 # Adjust thread count as needed

Alternative polyadenylation sites were identified from RNA-seq data using the bioinformatics algorithm DaPars (Xia et al., 2014), which uses a regression model to locate endpoints of alternative polyadenylation sites, was used to identify differences in APA events between hnRNP A2/B1 depleted samples and controls.

RNA-seq vv0.9.1

$ Bash example

# Define paths and filenames
DAPARS_DIR="DaPars" # Assuming DaPars is cloned here
OUTPUT_BASE_DIR="dapars_analysis"
mkdir -p "${OUTPUT_BASE_DIR}"

# Placeholder for reference genome and annotation
# Download Gencode annotation for hg38 (example)
# wget -P "${OUTPUT_BASE_DIR}" http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
# gunzip "${OUTPUT_BASE_DIR}/gencode.v38.annotation.gtf.gz"
GENCODE_GTF="${OUTPUT_BASE_DIR}/gencode.v38.annotation.gtf"

# Placeholder for input BAM files (hnRNP A2/B1 depleted vs controls)
# Ensure BAM files are sorted and indexed (.bai)
CONTROL_BAM_1="control_rep1.bam"
CONTROL_BAM_2="control_rep2.bam"
DEPLETED_BAM_1="depleted_rep1.bam"
DEPLETED_BAM_2="depleted_rep2.bam"

# --- Installation (commented out) ---
# Clone DaPars repository
# git clone https://github.com/zhenglab/DaPars.git
# cd DaPars

# Create a conda environment for DaPars dependencies (Python 2.7 is required)
# conda create -n dapars_env python=2.7 numpy scipy pandas pysam -y
# conda activate dapars_env

# --- DaPars Workflow ---

# Step 1: Prepare DaPars reference file
# This step identifies potential APA sites based on the GTF and read coverage from a representative BAM.
# Using one control BAM for preparation.
DAPARS_REFERENCE_DIR="${OUTPUT_BASE_DIR}/dapars_reference"
mkdir -p "${DAPARS_REFERENCE_DIR}"
python "${DAPARS_DIR}/DaPars.py" prepare \
    -b "${CONTROL_BAM_1}" \
    -g "${GENCODE_GTF}" \
    -o "${DAPARS_REFERENCE_DIR}" \
    --species human # Assuming human, can be omitted if not specified

DAPARS_REFERENCE_FILE="${DAPARS_REFERENCE_DIR}/DaPars_Reference.txt" # Default output name

# Step 2: Run DaPars on individual samples
# This step quantifies APA usage for each sample.
CONTROL_OUTPUT_DIR="${OUTPUT_BASE_DIR}/control_dapars_output"
DEPLETED_OUTPUT_DIR="${OUTPUT_BASE_DIR}/depleted_dapars_output"
mkdir -p "${CONTROL_OUTPUT_DIR}" "${DEPLETED_OUTPUT_DIR}"

# Run for control samples
python "${DAPARS_DIR}/DaPars.py" run \
    -b "${CONTROL_BAM_1}" \
    -r "${DAPARS_REFERENCE_FILE}" \
    -o "${CONTROL_OUTPUT_DIR}/control_rep1"

python "${DAPARS_DIR}/DaPars.py" run \
    -b "${CONTROL_BAM_2}" \
    -r "${DAPARS_REFERENCE_FILE}" \
    -o "${CONTROL_OUTPUT_DIR}/control_rep2"

# Run for depleted samples
python "${DAPARS_DIR}/DaPars.py" run \
    -b "${DEPLETED_BAM_1}" \
    -r "${DAPARS_REFERENCE_FILE}" \
    -o "${DEPLETED_OUTPUT_DIR}/depleted_rep1"

python "${DAPARS_DIR}/DaPars.py" run \
    -b "${DEPLETED_BAM_2}" \
    -r "${DAPARS_REFERENCE_FILE}" \
    -o "${DEPLETED_OUTPUT_DIR}/depleted_rep2"

# Step 3: Compare DaPars results between control and depleted samples
# Create sample list files for comparison
CONTROL_SAMPLE_LIST="${OUTPUT_BASE_DIR}/control_samples.txt"
DEPLETED_SAMPLE_LIST="${OUTPUT_BASE_DIR}/depleted_samples.txt"

# The 'run' step outputs a file like 'sample_name_DaPars_Output.txt'
find "${CONTROL_OUTPUT_DIR}" -name "*_DaPars_Output.txt" > "${CONTROL_SAMPLE_LIST}"
find "${DEPLETED_OUTPUT_DIR}" -name "*_DaPars_Output.txt" > "${DEPLETED_SAMPLE_LIST}"

COMPARISON_OUTPUT_DIR="${OUTPUT_BASE_DIR}/dapars_comparison"
mkdir -p "${COMPARISON_OUTPUT_DIR}"

python "${DAPARS_DIR}/DaPars.py" compare \
    -c "${CONTROL_SAMPLE_LIST}" \
    -t "${DEPLETED_SAMPLE_LIST}" \
    -o "${COMPARISON_OUTPUT_DIR}" \
    --species human # Again, assuming human

To identify significant APA events, we used the following cutoffs: FDR < 0.05, |ÎPDUI| â¥ 0.2, and |dPDUI| â¥ 0.2.

DaPars (Inferred with models/gemini-2.5-flash) v2.0

$ Bash example

# Tool: DaPars (Inferred with models/gemini-2.5-flash)
# Version: 2.0 (Inferred, corresponding to DaPars2)
# Description: Filtering significant Alternative Polyadenylation (APA) events from DaPars output.
# The input file 'dapars_output.txt' is assumed to be generated by DaPars2 differential analysis.
# It typically contains columns like: GeneName, chr, strand, pas_pos, pas_type, PDUI_control, PDUI_treatment, dPDUI, p_value, FDR.
# Reference genome (e.g., hg38) would have been used in the preceding DaPars run, but is not directly used in this filtering step.

# Define cutoffs for significance
FDR_CUTOFF=0.05
DPDUI_CUTOFF=0.2

# Define column indices for filtering (1-indexed, based on typical DaPars2 output)
# Column 8: dPDUI (difference in Percent Different Usage Index)
# Column 10: FDR (False Discovery Rate)
DPDUI_COL=8
FDR_COL=10

# Input file from DaPars2 differential analysis
INPUT_FILE="dapars_output.txt"
# Output file for significant APA events
OUTPUT_FILE="significant_apa_events.tsv"

# Filter significant APA events based on the specified cutoffs:
# FDR < 0.05, |dPDUI| >= 0.2
# Note: The description mentions |ΔPDUI| >= 0.2 and |dPDUI| >= 0.2.
# In DaPars, dPDUI is the primary metric for the change in PDUI.
# We apply the absolute value check to dPDUI to satisfy the magnitude requirement.
# If ΔPDUI refers to a distinct column, the command would need adjustment.
awk -v FDR_CUTOFF="${FDR_CUTOFF}" -v DPDUI_CUTOFF="${DPDUI_CUTOFF}" \
    -v DPDUI_COL="${DPDUI_COL}" -v FDR_COL="${FDR_COL}" \
    'BEGIN {OFS="\t"} \
    NR==1 {print; next} \
    ( $(FDR_COL) < FDR_CUTOFF ) && ( sqrt( ($(DPDUI_COL))^2 ) >= DPDUI_CUTOFF ) {print}' \
    "${INPUT_FILE}" > "${OUTPUT_FILE}"

APA scatter plots were generated using the R-Studio program with the ggplot2 library.Â

R vR 4.x, ggplot2 3.x GitHub

$ Bash example

# Install R (example for Ubuntu/Debian)
# sudo apt-get update
# sudo apt-get install r-base

# Install R packages (if not already installed)
# R -e "install.packages('ggplot2', repos='http://cran.us.r-project.org')"

# Create a dummy R script for APA scatter plot generation
# In a real scenario, this script would be provided or generated based on specific APA data.
cat << 'EOF' > generate_apa_scatter.R
# Load necessary library
library(ggplot2)

# --- Configuration ---
input_file <- "apa_data.tsv" # Placeholder for input data
output_file <- "apa_scatter_plot.pdf" # Placeholder for output plot

# --- Dummy Data Generation (replace with actual data loading) ---
# In a real scenario, you would load your APA data here, e.g.:
# data <- read.delim(input_file, header = TRUE, sep = "\t")
# For demonstration, let's create some dummy data
set.seed(123)
data <- data.frame(
  Condition1_APA = rnorm(100, mean = 0, sd = 1),
  Condition2_APA = rnorm(100, mean = 0.5, sd = 1.2),
  Gene = paste0("Gene", 1:100)
)

# --- Generate Scatter Plot ---
p <- ggplot(data, aes(x = Condition1_APA, y = Condition2_APA)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") + # Add a linear regression line
  labs(
    title = "APA Scatter Plot",
    x = "APA Score (Condition 1)",
    y = "APA Score (Condition 2)"
  ) +
  theme_minimal()

# --- Save the plot ---
ggsave(output_file, plot = p, width = 7, height = 6)

message(paste("APA scatter plot saved to:", output_file))
EOF

# Create a dummy input data file (replace with actual data)
# This file would contain the APA scores for different conditions/genes
cat << 'EOF' > apa_data.tsv
Gene\tCondition1_APA\tCondition2_APA
Gene1\t0.12\t0.34
Gene2\t-0.56\t-0.12
Gene3\t0.89\t1.01
Gene4\t-0.23\t-0.05
Gene5\t0.55\t0.67
EOF

# Execute the R script
Rscript generate_apa_scatter.R

View on GitHub

counts of reads for each gene annotated in gencode vM1 were calculated from featureCounts

featureCounts v2.0.3 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install subread (which includes featureCounts) if not already installed
# conda install -c bioconda subread

# Define input and output files
# Replace 'input_aligned_reads.bam' with the actual path to your aligned BAM file(s).
# If multiple BAM files, list them space-separated after the -s 0 parameter.
INPUT_BAM="input_aligned_reads.bam"

# Placeholder for gencode vM1 annotation GTF file.
# You would need to obtain the specific gencode vM1 GTF file used in the original pipeline.
# A common path structure might look like: /path/to/gencode/mouse/M1/gencode.vM1.annotation.gtf
GENCODE_GTF="gencode.vM1.annotation.gtf"

# Define the output file for gene counts
OUTPUT_COUNTS="gene_counts.txt"

# Execute featureCounts to calculate read counts for each gene
# -a: Annotation file (GTF/GFF)
# -o: Output file
# -F GTF: Specify annotation file format as GTF
# -t exon: Count reads mapping to 'exon' features
# -g gene_id: Group features by 'gene_id' attribute to get gene-level counts
# -s 0: Assume unstranded data (0=unstranded, 1=stranded, 2=reverse stranded - adjust based on assay)
featureCounts -a "${GENCODE_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              "${INPUT_BAM}"

Tools Used

STAR RNA-seq R

Raw Source Text

Sequencing reads from RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
Reads not mapped to Repbase sequences were aligned to the mm9 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
Alternative polyadenylation sites were identified from RNA-seq data using the bioinformatics algorithm DaPars (Xia et al., 2014), which uses a regression model to locate endpoints of alternative polyadenylation sites, was used to identify differences in APA events between hnRNP A2/B1 depleted samples and controls. To identify significant APA events, we used the following cutoffs: FDR < 0.05, |ÎPDUI| â¥ 0.2, and |dPDUI| â¥ 0.2. APA scatter plots were generated using the R-Studio program with the ggplot2 library.Â
counts of reads for each gene annotated in gencode vM1 were calculated from featureCounts
Genome_build: mm9
Supplementary_files_format_and_content: rpkm file, containts rpkm for each sample

← Back to Analysis