GSE61946 Processing Pipeline

RNA-Seq code_examples 9 steps

Publication

A Gene Regulatory Network Cooperatively Controlled by Pdx1 and Sox9 Governs Lineage Allocation of Foregut Progenitor Cells.

Cell reports (2015) — PMID 26440894

Dataset

GSE61946

hESC-derived liver transcriptome analysis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Quality Control.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub

$ Bash example

# Install FastQC (if not already installed)
# conda install -c bioconda fastqc=0.11.9

# Run FastQC on input FASTQ file
# Replace 'input_reads.fastq.gz' with your actual input file
# The '-o .' option specifies the output directory as the current directory.
fastqc input_reads.fastq.gz -o .

View on GitHub

Quality of sequencing data is analyzed using the software FastQC v0.10.1.

FastQC v0.10.1

$ Bash example

# Install FastQC (example using conda)
# conda install -c bioconda fastqc

# Run FastQC to analyze sequencing data quality
# Assuming 'input.fastq.gz' is the raw sequencing data file
# And 'fastqc_output' is the directory where reports will be saved
mkdir -p fastqc_output
fastqc -o fastqc_output input.fastq.gz

The results are examined to determine if samples are of questionable quality on an array of metrics.

MultiQC (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install MultiQC if not already installed
# conda install -c bioconda multiqc

# Assuming various QC output files (e.g., FastQC reports, alignment logs) are located in a 'qc_results' directory.
# MultiQC will scan this directory and its subdirectories for supported QC output files.
# The report will be generated in the 'multiqc_reports' directory.
mkdir -p multiqc_reports
multiqc qc_results/ --outdir multiqc_reports/ --filename "sample_quality_assessment_report.html" --title "Sample Quality Assessment"

View on GitHub

Mapping.

STAR (Inferred with models/gemini-2.5-flash) v2.7.0f GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables (replace with actual paths and filenames)
GENOME_DIR="/path/to/STAR_genome_index/hg38"
READ1_FASTQ="sample_R1.fastq.gz"
READ2_FASTQ="sample_R2.fastq.gz"
OUTPUT_PREFIX="aligned_sample_"
THREADS=8

# Create output directory if it doesn't exist
# mkdir -p "$(dirname ${OUTPUT_PREFIX})"

# Execute STAR alignment
STAR \
  --runThreadN ${THREADS} \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \
  --readFilesCommand zcat \
  --outFileNamePrefix ${OUTPUT_PREFIX} \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --outFilterMultimapNmax 1 \
  --outFilterMismatchNmax 3 \
  --alignIntronMax 200000 \
  --alignMatesGapMax 200000 \
  --outFilterScoreMinOverLread 0.6 \
  --outFilterMatchNminOverLread 0.6 \
  --outFilterMatchNmin 15 \
  --outReadsUnmapped Fastx

View on GitHub

Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e.

RNA-Star v2.3.0e GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for STAR genome index directory (e.g., for human hg38)
# Replace with your actual genome index path
GENOME_DIR="/path/to/STAR_index/hg38"

# Placeholder for input sequencing data (FASTQ file(s))
# For single-end reads: READS_FILE="input_reads.fastq.gz"
# For paired-end reads: READS_FILE="input_reads_R1.fastq.gz input_reads_R2.fastq.gz"
READS_FILE="input_reads.fastq.gz"

# Placeholder for output prefix
OUTPUT_PREFIX="aligned_output"

# Number of threads to use
NUM_THREADS=8

# Run STAR alignment
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READS_FILE}" \
  --runThreadN "${NUM_THREADS}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outBAMcompression 1 \
  --quantMode GeneCounts \
  --outFilterType BySJout \
  --outFilterMismatchNmax 10 \
  --outFilterMismatchNoverLmax 0.04 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --alignSJoverhangMin 8 \
  --alignSJDBoverhangMin 1

# The primary output BAM file will be named: ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam
# The gene counts file will be named: ${OUTPUT_PREFIX}ReadsPerGene.out.tab

View on GitHub

Parameters are set to default and reads are mapped to references along with splice junction databases.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Create STAR genome index (if not already created)
# This step is typically done once per genome/annotation version.
# It requires a genome FASTA file and a GTF/GFF annotation file to build the splice junction database.
# Example for GRCh38:
# GENOME_FASTA="/path/to/genome/GRCh38.primary_assembly.genome.fa"
# GTF_FILE="/path/to/annotation/gencode.v38.annotation.gtf"
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/GRCh38 \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN 8

THREADS=8
GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for GRCh38 genome index
READ1="sample_R1.fastq.gz" # Placeholder for input R1 FASTQ file
READ2="sample_R2.fastq.gz" # Placeholder for input R2 FASTQ file (remove if single-end)
OUTPUT_PREFIX="sample_aligned_" # Prefix for output files

STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)

View on GitHub

Gene Expression Quantification.

Salmon (Inferred with models/gemini-2.5-flash) v1.10.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Salmon (example using conda)
# conda create -n salmon_env salmon -y
# conda activate salmon_env

# --- Placeholder for Salmon index creation (if not already available) ---
# This step assumes you have a FASTA file of the transcriptome (e.g., from Ensembl or GENCODE)
# For example, using human GRCh38 (hg38) transcriptome:
# wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
# gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
# salmon index -t Homo_sapiens.GRCh38.cdna.all.fa -i salmon_index_hg38 --gencode

# --- Gene Expression Quantification using Salmon ---
# Input FASTQ files (replace with actual file names)
READ1="read1.fastq.gz"
READ2="read2.fastq.gz"
# Salmon index path (replace with actual path to your index, e.g., 'salmon_index_hg38')
SALMON_INDEX="salmon_index_hg38"
# Output directory
OUTPUT_DIR="salmon_quant_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Salmon quantification (paired-end, common parameters)
salmon quant -i "${SALMON_INDEX}" \
             -l A \
             -1 "${READ1}" \
             -2 "${READ2}" \
             -o "${OUTPUT_DIR}" \
             --validateMappings \
             --gcBias \
             --seqBias

View on GitHub

To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0).

Cufflinks v2.2.0 GitHub

$ Bash example

# Install Cufflinks (e.g., via Bioconda)
# conda install -c bioconda cufflinks=2.2.0

# Define input and output files/directories
INPUT_BAM="aligned_reads.bam" # Placeholder: Replace with your aligned RNA-seq BAM file
GENOME_GTF="gencode.v38.annotation.gtf" # Placeholder: Replace with your genome annotation GTF file (e.g., from GENCODE or Ensembl)
OUTPUT_DIR="cufflinks_output"
NUM_THREADS="8"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Cufflinks for transcript assembly and quantification
cufflinks \
  -o "${OUTPUT_DIR}" \
  -g "${GENOME_GTF}" \
  -p "${NUM_THREADS}" \
  "${INPUT_BAM}"

View on GitHub

Expression values are calculated for entries in the gene annotation references.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem

# --- Prerequisite: RSEM Reference Preparation (run once per genome/annotation) ---
# This step builds the necessary indices for RSEM quantification from a reference genome and GTF annotation.
# Replace 'GRCh38.primary_assembly.genome.fa' with your reference genome FASTA file (e.g., downloaded from Ensembl/Gencode).
# Replace 'gencode.v38.annotation.gtf' with your gene annotation GTF file (e.g., downloaded from Ensembl/Gencode).
# The output reference files will be prefixed with 'GRCh38_rsem_ref'.
# rsem-prepare-reference --gtf gencode.v38.annotation.gtf GRCh38.primary_assembly.genome.fa GRCh38_rsem_ref

# --- Main Step: Calculate expression values ---
# This command calculates gene and isoform expression levels from RNA-seq reads.
# Replace 'read1.fastq.gz' and 'read2.fastq.gz' with your input paired-end FASTQ files.
# Replace 'GRCh38_rsem_ref' with the prefix of the RSEM reference prepared in the prerequisite step.
# 'sample_output' will be the prefix for all output files (e.g., sample_output.genes.results, sample_output.isoforms.results).
rsem-calculate-expression --paired-end read1.fastq.gz read2.fastq.gz GRCh38_rsem_ref sample_output

View on GitHub

Tools Used

Cufflinks

Raw Source Text

Quality Control. Quality of sequencing data is analyzed using the software FastQC v0.10.1. The results are examined to determine if samples are of questionable quality on an array of metrics.
Mapping. Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e. Parameters are set to default and reads are mapped to references along with splice junction databases.
Gene Expression Quantification. To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0). Expression values are calculated for entries in the gene annotation references.
Genome_build: hg19
Supplementary_files_format_and_content: Txt file of ensemble identifiers with RPKMs of samples used in study

← Back to Analysis