GSE159074 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Predicting the functional states of human iPSC-derived neurons with single-cell RNA-seq and electrophysiology.

Molecular psychiatry (2016) — PMID 27698428

Dataset

Predicting the functional states of human iPSC-derived neurons with single-cell RNA-seq and electrophysiology

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw sequencing reads were mapped to the human reference transcriptome (GENCODE v19) using gapped-alignment strategies.

GENCODE v2.5.3a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
GENOME_DIR="star_index_gencode_v19"
GENOME_FASTA="Homo_sapiens.GRCh37.dna.primary_assembly.fa"
GTF_FILE="gencode.v19.annotation.gtf"
READ1="raw_reads_R1.fastq.gz" # Placeholder for raw sequencing reads
READ2="raw_reads_R2.fastq.gz" # Placeholder for raw sequencing reads (remove if single-end)
OUTPUT_PREFIX="aligned_reads"
THREADS=8 # Number of threads to use
READ_LENGTH=100 # Assumed read length for sjdbOverhang (e.g., 100bp)

# 1. Download reference genome (GRCh37/hg19) and GENCODE v19 GTF
# For GRCh37 primary assembly (Ensembl release 75 corresponds to GENCODE v19)
# wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz

# For GENCODE v19 GTF
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz

# Create genome directory
mkdir -p "${GENOME_DIR}"

# 2. Build STAR genome index
STAR --runMode genomeGenerate \
     --genomeDir "${GENOME_DIR}" \
     --genomeFastaFiles "${GENOME_FASTA}" \
     --sjdbGTFfile "${GTF_FILE}" \
     --sjdbOverhang $((READ_LENGTH - 1)) \
     --runThreadN "${THREADS}"

# 3. Perform gapped-alignment of raw sequencing reads to the transcriptome
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --readFilesCommand zcat \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}." \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outReadsUnmapped Fastx

View on GitHub

Alignment was performed by STAR (version 2.3.0).

STAR v2.3.0 GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.3.0

# Placeholder for reference genome index directory. 
# For a real analysis, replace 'path/to/GRCh38_STAR_index' with the actual path to your STAR genome index.
# If you need to generate the index, here's an example (uncomment and modify as needed):
# STAR --runMode genomeGenerate \
#      --genomeDir path/to/GRCh38_STAR_index \
#      --genomeFastaFiles path/to/GRCh38.fa \
#      --sjdbGTFfile path/to/GRCh38.gtf \
#      --runThreadN 16

# Placeholder for input FASTQ files. 
# Replace 'read1.fastq.gz' and 'read2.fastq.gz' with your actual input file names.
# Assuming paired-end reads, common for RNA-seq.
# If single-end, remove --readFilesCommand zcat and the second FASTQ file.

# Perform alignment using STAR
STAR --genomeDir path/to/GRCh38_STAR_index \
     --readFilesIn read1.fastq.gz read2.fastq.gz \
     --runThreadN 8 \
     --outFileNamePrefix aligned_reads_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 \
     --readFilesCommand zcat

View on GitHub

Gene-level quantification was performed with HTseq (version 0.6.1).

HTSeq v0.6.1 GitHub

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq
# pip install HTSeq

# Define input files and parameters (placeholders - replace with actual paths and values)
# aligned_reads.bam: Path to your alignment file (BAM format).
# annotation.gtf: Path to your gene annotation file (GTF format).
#                  Example: GENCODE human hg38 (gencode.v44.annotation.gtf) from https://www.gencodegenes.org/human/release_44.html
# output_counts.txt: Desired name for the output count file.

# HTSeq htseq-count command for gene-level quantification
# Parameters used are common defaults or reasonable assumptions, as specific parameters were not provided in the description.
# -f bam: Input file format is BAM.
# -r pos: Order of features in GTF/GFF file (default is 'pos').
# -s reverse: Strandedness of the library. Common options are 'yes', 'no', 'reverse'.
#             This is a critical parameter; verify your library's strandedness based on your library preparation protocol.
# -a 10: Minimum alignment quality score (default is 10).
# -m union: Mode to handle reads overlapping multiple features (default is 'union').
#           Other options: 'intersection-strict', 'intersection-nonempty'.
# --idattr=gene_id: Attribute in the GTF file to use as feature ID (default is 'gene_id').

htseq-count \
    -f bam \
    -r pos \
    -s reverse \
    -a 10 \
    -m union \
    --idattr=gene_id \
    aligned_reads.bam \
    annotation.gtf \
    > output_counts.txt

# Note: Replace 'aligned_reads.bam', 'annotation.gtf', and 'output_counts.txt'
#       with your actual file paths and desired output name.
#       Adjust the '-s' (strandedness) parameter based on your library preparation.

View on GitHub

Per-gene expression outputs were scaled to transcripts per million (TPM) units.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.0 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem=1.3.0

# Define variables
# REFERENCE_NAME: Placeholder for the RSEM reference name (e.g., GRCh38, mm10).
# This reference must have been prepared previously using rsem-prepare-reference.
REFERENCE_NAME="GRCh38" 
RSEM_REFERENCE_DIR="/path/to/rsem_references/${REFERENCE_NAME}" # Directory containing RSEM reference files

# INPUT_BAM: Input alignment file, typically generated by an aligner like STAR.
# Ensure this BAM file is sorted and indexed.
INPUT_BAM="aligned_reads.bam" 

OUTPUT_DIR="rsem_quantification_output"
SAMPLE_NAME="sample1"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run RSEM to calculate gene and isoform expression and scale to TPM.
# This example assumes a paired-end, unstranded library and a STAR-aligned BAM file.
# Adjust parameters like --paired-end, --strandedness, --fragment-length-dist, 
# --forward-prob, --seed, -p (number of threads), etc., as needed for your specific data.
# --no-qualities is often used for BAM files from aligners like STAR to save memory.
rsem-calculate-expression \
    --bam \
    --paired-end \
    --no-qualities \
    -p 8 \
    "${INPUT_BAM}" \
    "${RSEM_REFERENCE_DIR}/${REFERENCE_NAME}" \
    "${OUTPUT_DIR}/${SAMPLE_NAME}"

# The output files will include:
# ${OUTPUT_DIR}/${SAMPLE_NAME}.genes.results (contains TPM values for genes)
# ${OUTPUT_DIR}/${SAMPLE_NAME}.isoforms.results (contains TPM values for isoforms)

View on GitHub

Tools Used

STAR

Raw Source Text

Raw sequencing reads were mapped to the human reference transcriptome (GENCODE v19) using gapped-alignment strategies.
Alignment was performed by STAR (version 2.3.0).
Gene-level quantification was performed with HTseq (version 0.6.1).
Per-gene expression outputs were scaled to transcripts per million (TPM) units.
Genome_build: GENCODE v19
Supplementary_files_format_and_content: tab-delimited text file provides transcripts per million (TPM) expression values (rows = genes, columns = cells)

← Back to Analysis