GSE159074 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Predicting the functional states of human iPSC-derived neurons with single-cell RNA-seq and electrophysiology.

Molecular psychiatry (2016) — PMID 27698428

Dataset

GSE159074

Predicting the functional states of human iPSC-derived neurons with single-cell RNA-seq and electrophysiology

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw sequencing reads were mapped to the human reference transcriptome (GENCODE v19) using gapped-alignment strategies.

    GENCODE v2.5.3a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    GENOME_DIR="star_index_gencode_v19"
    GENOME_FASTA="Homo_sapiens.GRCh37.dna.primary_assembly.fa"
    GTF_FILE="gencode.v19.annotation.gtf"
    READ1="raw_reads_R1.fastq.gz" # Placeholder for raw sequencing reads
    READ2="raw_reads_R2.fastq.gz" # Placeholder for raw sequencing reads (remove if single-end)
    OUTPUT_PREFIX="aligned_reads"
    THREADS=8 # Number of threads to use
    READ_LENGTH=100 # Assumed read length for sjdbOverhang (e.g., 100bp)
    
    # 1. Download reference genome (GRCh37/hg19) and GENCODE v19 GTF
    # For GRCh37 primary assembly (Ensembl release 75 corresponds to GENCODE v19)
    # wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
    # gunzip Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
    
    # For GENCODE v19 GTF
    # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip gencode.v19.annotation.gtf.gz
    
    # Create genome directory
    mkdir -p "${GENOME_DIR}"
    
    # 2. Build STAR genome index
    STAR --runMode genomeGenerate \
         --genomeDir "${GENOME_DIR}" \
         --genomeFastaFiles "${GENOME_FASTA}" \
         --sjdbGTFfile "${GTF_FILE}" \
         --sjdbOverhang $((READ_LENGTH - 1)) \
         --runThreadN "${THREADS}"
    
    # 3. Perform gapped-alignment of raw sequencing reads to the transcriptome
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --readFilesCommand zcat \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}." \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --alignSJDBoverhangMin 1 \
         --alignSJoverhangMin 8 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outReadsUnmapped Fastx
  2. 2

    Alignment was performed by STAR (version 2.3.0).

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.3.0
    
    # Placeholder for reference genome index directory. 
    # For a real analysis, replace 'path/to/GRCh38_STAR_index' with the actual path to your STAR genome index.
    # If you need to generate the index, here's an example (uncomment and modify as needed):
    # STAR --runMode genomeGenerate \
    #      --genomeDir path/to/GRCh38_STAR_index \
    #      --genomeFastaFiles path/to/GRCh38.fa \
    #      --sjdbGTFfile path/to/GRCh38.gtf \
    #      --runThreadN 16
    
    # Placeholder for input FASTQ files. 
    # Replace 'read1.fastq.gz' and 'read2.fastq.gz' with your actual input file names.
    # Assuming paired-end reads, common for RNA-seq.
    # If single-end, remove --readFilesCommand zcat and the second FASTQ file.
    
    # Perform alignment using STAR
    STAR --genomeDir path/to/GRCh38_STAR_index \
         --readFilesIn read1.fastq.gz read2.fastq.gz \
         --runThreadN 8 \
         --outFileNamePrefix aligned_reads_ \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --alignSJDBoverhangMin 1 \
         --alignSJoverhangMin 8 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 \
         --readFilesCommand zcat
  3. 3

    Gene-level quantification was performed with HTseq (version 0.6.1).

    HTSeq v0.6.1 GitHub
    $ Bash example
    # Install HTSeq (if not already installed)
    # conda install -c bioconda htseq
    # pip install HTSeq
    
    # Define input files and parameters (placeholders - replace with actual paths and values)
    # aligned_reads.bam: Path to your alignment file (BAM format).
    # annotation.gtf: Path to your gene annotation file (GTF format).
    #                  Example: GENCODE human hg38 (gencode.v44.annotation.gtf) from https://www.gencodegenes.org/human/release_44.html
    # output_counts.txt: Desired name for the output count file.
    
    # HTSeq htseq-count command for gene-level quantification
    # Parameters used are common defaults or reasonable assumptions, as specific parameters were not provided in the description.
    # -f bam: Input file format is BAM.
    # -r pos: Order of features in GTF/GFF file (default is 'pos').
    # -s reverse: Strandedness of the library. Common options are 'yes', 'no', 'reverse'.
    #             This is a critical parameter; verify your library's strandedness based on your library preparation protocol.
    # -a 10: Minimum alignment quality score (default is 10).
    # -m union: Mode to handle reads overlapping multiple features (default is 'union').
    #           Other options: 'intersection-strict', 'intersection-nonempty'.
    # --idattr=gene_id: Attribute in the GTF file to use as feature ID (default is 'gene_id').
    
    htseq-count \
        -f bam \
        -r pos \
        -s reverse \
        -a 10 \
        -m union \
        --idattr=gene_id \
        aligned_reads.bam \
        annotation.gtf \
        > output_counts.txt
    
    # Note: Replace 'aligned_reads.bam', 'annotation.gtf', and 'output_counts.txt'
    #       with your actual file paths and desired output name.
    #       Adjust the '-s' (strandedness) parameter based on your library preparation.
  4. 4

    Per-gene expression outputs were scaled to transcripts per million (TPM) units.

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.0 GitHub
    $ Bash example
    # Install RSEM (if not already installed)
    # conda install -c bioconda rsem=1.3.0
    
    # Define variables
    # REFERENCE_NAME: Placeholder for the RSEM reference name (e.g., GRCh38, mm10).
    # This reference must have been prepared previously using rsem-prepare-reference.
    REFERENCE_NAME="GRCh38" 
    RSEM_REFERENCE_DIR="/path/to/rsem_references/${REFERENCE_NAME}" # Directory containing RSEM reference files
    
    # INPUT_BAM: Input alignment file, typically generated by an aligner like STAR.
    # Ensure this BAM file is sorted and indexed.
    INPUT_BAM="aligned_reads.bam" 
    
    OUTPUT_DIR="rsem_quantification_output"
    SAMPLE_NAME="sample1"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run RSEM to calculate gene and isoform expression and scale to TPM.
    # This example assumes a paired-end, unstranded library and a STAR-aligned BAM file.
    # Adjust parameters like --paired-end, --strandedness, --fragment-length-dist, 
    # --forward-prob, --seed, -p (number of threads), etc., as needed for your specific data.
    # --no-qualities is often used for BAM files from aligners like STAR to save memory.
    rsem-calculate-expression \
        --bam \
        --paired-end \
        --no-qualities \
        -p 8 \
        "${INPUT_BAM}" \
        "${RSEM_REFERENCE_DIR}/${REFERENCE_NAME}" \
        "${OUTPUT_DIR}/${SAMPLE_NAME}"
    
    # The output files will include:
    # ${OUTPUT_DIR}/${SAMPLE_NAME}.genes.results (contains TPM values for genes)
    # ${OUTPUT_DIR}/${SAMPLE_NAME}.isoforms.results (contains TPM values for isoforms)

Tools Used

Raw Source Text
Raw sequencing reads were mapped to the human reference transcriptome (GENCODE v19) using gapped-alignment strategies.
Alignment was performed by STAR (version 2.3.0).
Gene-level quantification was performed with HTseq (version 0.6.1).
Per-gene expression outputs were scaled to transcripts per million (TPM) units.
Genome_build: GENCODE v19
Supplementary_files_format_and_content: tab-delimited text file provides transcripts per million (TPM) expression values (rows = genes, columns = cells)
← Back to Analysis