GSE159074 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
Predicting the functional states of human iPSC-derived neurons with single-cell RNA-seq and electrophysiology.Molecular psychiatry (2016) — PMID 27698428
Dataset
GSE159074Predicting the functional states of human iPSC-derived neurons with single-cell RNA-seq and electrophysiology
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Raw sequencing reads were mapped to the human reference transcriptome (GENCODE v19) using gapped-alignment strategies.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables GENOME_DIR="star_index_gencode_v19" GENOME_FASTA="Homo_sapiens.GRCh37.dna.primary_assembly.fa" GTF_FILE="gencode.v19.annotation.gtf" READ1="raw_reads_R1.fastq.gz" # Placeholder for raw sequencing reads READ2="raw_reads_R2.fastq.gz" # Placeholder for raw sequencing reads (remove if single-end) OUTPUT_PREFIX="aligned_reads" THREADS=8 # Number of threads to use READ_LENGTH=100 # Assumed read length for sjdbOverhang (e.g., 100bp) # 1. Download reference genome (GRCh37/hg19) and GENCODE v19 GTF # For GRCh37 primary assembly (Ensembl release 75 corresponds to GENCODE v19) # wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz # gunzip Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz # For GENCODE v19 GTF # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz # Create genome directory mkdir -p "${GENOME_DIR}" # 2. Build STAR genome index STAR --runMode genomeGenerate \ --genomeDir "${GENOME_DIR}" \ --genomeFastaFiles "${GENOME_FASTA}" \ --sjdbGTFfile "${GTF_FILE}" \ --sjdbOverhang $((READ_LENGTH - 1)) \ --runThreadN "${THREADS}" # 3. Perform gapped-alignment of raw sequencing reads to the transcriptome STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outReadsUnmapped Fastx -
2
Alignment was performed by STAR (version 2.3.0).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.3.0 # Placeholder for reference genome index directory. # For a real analysis, replace 'path/to/GRCh38_STAR_index' with the actual path to your STAR genome index. # If you need to generate the index, here's an example (uncomment and modify as needed): # STAR --runMode genomeGenerate \ # --genomeDir path/to/GRCh38_STAR_index \ # --genomeFastaFiles path/to/GRCh38.fa \ # --sjdbGTFfile path/to/GRCh38.gtf \ # --runThreadN 16 # Placeholder for input FASTQ files. # Replace 'read1.fastq.gz' and 'read2.fastq.gz' with your actual input file names. # Assuming paired-end reads, common for RNA-seq. # If single-end, remove --readFilesCommand zcat and the second FASTQ file. # Perform alignment using STAR STAR --genomeDir path/to/GRCh38_STAR_index \ --readFilesIn read1.fastq.gz read2.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix aligned_reads_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 \ --readFilesCommand zcat -
3
Gene-level quantification was performed with HTseq (version 0.6.1).
$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq # pip install HTSeq # Define input files and parameters (placeholders - replace with actual paths and values) # aligned_reads.bam: Path to your alignment file (BAM format). # annotation.gtf: Path to your gene annotation file (GTF format). # Example: GENCODE human hg38 (gencode.v44.annotation.gtf) from https://www.gencodegenes.org/human/release_44.html # output_counts.txt: Desired name for the output count file. # HTSeq htseq-count command for gene-level quantification # Parameters used are common defaults or reasonable assumptions, as specific parameters were not provided in the description. # -f bam: Input file format is BAM. # -r pos: Order of features in GTF/GFF file (default is 'pos'). # -s reverse: Strandedness of the library. Common options are 'yes', 'no', 'reverse'. # This is a critical parameter; verify your library's strandedness based on your library preparation protocol. # -a 10: Minimum alignment quality score (default is 10). # -m union: Mode to handle reads overlapping multiple features (default is 'union'). # Other options: 'intersection-strict', 'intersection-nonempty'. # --idattr=gene_id: Attribute in the GTF file to use as feature ID (default is 'gene_id'). htseq-count \ -f bam \ -r pos \ -s reverse \ -a 10 \ -m union \ --idattr=gene_id \ aligned_reads.bam \ annotation.gtf \ > output_counts.txt # Note: Replace 'aligned_reads.bam', 'annotation.gtf', and 'output_counts.txt' # with your actual file paths and desired output name. # Adjust the '-s' (strandedness) parameter based on your library preparation. -
4
Per-gene expression outputs were scaled to transcripts per million (TPM) units.
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem=1.3.0 # Define variables # REFERENCE_NAME: Placeholder for the RSEM reference name (e.g., GRCh38, mm10). # This reference must have been prepared previously using rsem-prepare-reference. REFERENCE_NAME="GRCh38" RSEM_REFERENCE_DIR="/path/to/rsem_references/${REFERENCE_NAME}" # Directory containing RSEM reference files # INPUT_BAM: Input alignment file, typically generated by an aligner like STAR. # Ensure this BAM file is sorted and indexed. INPUT_BAM="aligned_reads.bam" OUTPUT_DIR="rsem_quantification_output" SAMPLE_NAME="sample1" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run RSEM to calculate gene and isoform expression and scale to TPM. # This example assumes a paired-end, unstranded library and a STAR-aligned BAM file. # Adjust parameters like --paired-end, --strandedness, --fragment-length-dist, # --forward-prob, --seed, -p (number of threads), etc., as needed for your specific data. # --no-qualities is often used for BAM files from aligners like STAR to save memory. rsem-calculate-expression \ --bam \ --paired-end \ --no-qualities \ -p 8 \ "${INPUT_BAM}" \ "${RSEM_REFERENCE_DIR}/${REFERENCE_NAME}" \ "${OUTPUT_DIR}/${SAMPLE_NAME}" # The output files will include: # ${OUTPUT_DIR}/${SAMPLE_NAME}.genes.results (contains TPM values for genes) # ${OUTPUT_DIR}/${SAMPLE_NAME}.isoforms.results (contains TPM values for isoforms)
Tools Used
Raw Source Text
Raw sequencing reads were mapped to the human reference transcriptome (GENCODE v19) using gapped-alignment strategies. Alignment was performed by STAR (version 2.3.0). Gene-level quantification was performed with HTseq (version 0.6.1). Per-gene expression outputs were scaled to transcripts per million (TPM) units. Genome_build: GENCODE v19 Supplementary_files_format_and_content: tab-delimited text file provides transcripts per million (TPM) expression values (rows = genes, columns = cells)