GSE59288 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Widespread RNA editing dysregulation in brains from autistic individuals.

Nature neuroscience (2019) — PMID 30559470

Dataset

Disruption of human-specific synaptogenesis program in autism [RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The raw sequencing reads were mapped to each speciesâ reference genome and the splice junctions using Bowtie, allowing up to three mismatches.

Bowtie v2.4.5 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Bowtie2 (if not already installed)
# conda install -c bioconda bowtie2

# Placeholder for reference genome and splice junction index. 
# Bowtie2 is not inherently splice-aware. For RNA-seq alignment involving splice junctions, 
# splice-aware aligners like STAR or HISAT2 are typically used. 
# If Bowtie2 was used, it implies a custom index that incorporates splice junction sequences, 
# or that splice junction handling was performed by a downstream tool or a custom pre-processing step.
# For demonstration, we assume a pre-built Bowtie2 index named 'genome_and_junctions_index'.
# Example for building a standard genome index (replace 'reference.fa' with your actual reference genome file):
# bowtie2-build reference.fa genome_and_junctions_index

# Align raw sequencing reads to the reference genome and splice junctions
# -x: specify the basename of the index
# -N 3: allow up to 3 mismatches in the seed alignment (as per description)
# -U: specify single-end reads (use -1 and -2 for paired-end reads)
# -S: output alignments in SAM format

bowtie2 -x genome_and_junctions_index -N 3 -U raw_sequencing_reads.fastq -S aligned_reads.sam

View on GitHub

Only uniquely mapped reads were used to calculate the gene expression levels.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RSEM via conda
# conda install -c bioconda rsem

# Define variables
# Placeholder: Replace with actual genome FASTA (e.g., GRCh38.primary_assembly.genome.fa)
GENOME_FA="path/to/reference_genome.fa"
# Placeholder: Replace with actual gene annotation GTF (e.g., gencode.v38.annotation.gtf)
GTF_FILE="path/to/gene_annotation.gtf"
# Name for the RSEM reference index
RSEM_REF_NAME="rsem_reference_index"
# Placeholder: Replace with your uniquely mapped BAM file
# This BAM file is assumed to contain only uniquely mapped reads, as per the description.
INPUT_BAM="path/to/sample_uniquely_mapped.bam"
# Prefix for output files
OUTPUT_PREFIX="sample_gene_expression"
# Number of threads to use
NUM_THREADS=8

# 1. Prepare RSEM reference (run once per reference genome/annotation combination)
# This step builds the RSEM index. It's crucial that the GTF and genome FASTA match the alignment.
# If you have already prepared an RSEM reference, you can skip this step.
# rsem-prepare-reference --gtf "${GTF_FILE}" --star --num-threads "${NUM_THREADS}" "${GENOME_FA}" "${RSEM_REF_NAME}"

# 2. Calculate gene expression levels from uniquely mapped BAM files
# --bam: specifies that the input is a BAM file.
# --no-qualities: speeds up processing by ignoring quality scores (often not needed for quantification).
# --forward-prob 0.5: assumes unstranded data. Adjust to 1.0 for forward-stranded, 0.0 for reverse-stranded.
# The description "Only uniquely mapped reads were used" implies that the input BAM file
# has already been filtered to contain only uniquely mapped reads (e.g., using samtools view -q 1).
# RSEM will then quantify expression based on these provided unique alignments.
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --forward-prob 0.5 \
    --num-threads "${NUM_THREADS}" \
    "${INPUT_BAM}" \
    "${RSEM_REF_NAME}" \
    "${OUTPUT_PREFIX}"

# Output files will include:
# ${OUTPUT_PREFIX}.genes.results (gene-level expression values like TPM, FPKM, counts)
# ${OUTPUT_PREFIX}.isoforms.results (isoform-level expression values)
# ${OUTPUT_PREFIX}.stat (statistics about the run)

View on GitHub

Gene expression levels were calculated using the read coverage normalized by gene length and the total number of reads mapped to genes in a sample.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# Placeholder for reference genome and annotation (e.g., human hg38 and GENCODE v44)
# Download human reference genome (e.g., hg38 primary assembly)
# wget -O genome.fa.gz "http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
# gunzip genome.fa.gz
# Download human gene annotation (e.g., GENCODE v44 GTF for GRCh38)
# wget -O annotation.gtf.gz "http://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz"
# gunzip annotation.gtf.gz

# Prepare RSEM reference (run once per reference, e.g., for hg38)
# rsem-prepare-reference --gtf annotation.gtf genome.fa RSEM_reference_index

# Assume 'aligned_reads.bam' is the input BAM file from a splice-aware aligner (e.g., STAR)
# Assume 'RSEM_reference_index' is the directory created by rsem-prepare-reference
# Assume 'sample_output' is the desired prefix for output files (e.g., 'SRR1234567')

# Calculate gene expression levels using mapped reads, normalizing by gene length and total mapped reads.
# Use --paired-end if reads are paired-end, otherwise omit for single-end.
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --paired-end \
    aligned_reads.bam \
    RSEM_reference_index \
    sample_output

View on GitHub

Raw Source Text

The raw sequencing reads were mapped to each speciesâ reference genome and the splice junctions using Bowtie, allowing up to three mismatches.
Only uniquely mapped reads were used to calculate the gene expression levels.
Gene expression levels were calculated using the read coverage normalized by gene length and the total number of reads mapped to genes in a sample.
Genome_build: human: hg19; chimpanzee: panTro3.
Supplementary_files_format_and_content: tab-delimited text file shows RPKM values of each gene in each sample.

← Back to Analysis