GSE61945 Processing Pipeline

RNA-Seq code_examples 9 steps

Publication

A Gene Regulatory Network Cooperatively Controlled by Pdx1 and Sox9 Governs Lineage Allocation of Foregut Progenitor Cells.

Cell reports (2015) — PMID 26440894

Dataset

GSE61945

Human fetal pancreas transcriptome analysis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Quality Control.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub

$ Bash example

# Install FastQC (e.g., using conda)
# conda install -c bioconda fastqc

# Run FastQC on a FASTQ file to assess raw read quality
# Replace 'input.fastq.gz' with your actual input FASTQ file(s).
# Replace 'output_dir' with your desired output directory for reports.
fastqc input.fastq.gz -o output_dir

View on GitHub

Quality of sequencing data is analyzed using the software FastQC v0.10.1.

FastQC v0.10.1 GitHub

$ Bash example

# Install FastQC (if not already installed)
# conda install -c bioconda fastqc=0.10.1

# Run FastQC on sequencing data
# Replace 'input.fastq.gz' with your actual input file(s)
# Replace 'fastqc_output_dir' with your desired output directory
fastqc input.fastq.gz -o fastqc_output_dir

View on GitHub

The results are examined to determine if samples are of questionable quality on an array of metrics.

MultiQC (Inferred with models/gemini-2.5-flash) v1.20 GitHub

$ Bash example

# Install MultiQC (example using pip)
# pip install multiqc

# Run MultiQC to aggregate and summarize all detected QC reports from the current directory or specified paths.
# This command assumes various QC tool outputs (e.g., FastQC, Picard, STAR logs) are present in the input directory.
# The output will be an HTML report in the specified output directory.
multiqc . -o multiqc_report

View on GitHub

Mapping.

STAR (Inferred with models/gemini-2.5-flash) v2.7.9a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
GENOME_DIR="/path/to/STAR_index/hg38" # Path to your STAR genome index for hg38
READS_R1="input_reads_R1.fastq.gz"
READS_R2="input_reads_R2.fastq.gz" # Adjust if single-end or different naming
OUTPUT_PREFIX="aligned_eclip"
NUM_THREADS=8

# Create STAR genome index if it doesn't exist (run once per genome)
# STAR --runMode genomeGenerate --genomeDir ${GENOME_DIR} --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/hg38.gtf --runThreadN ${NUM_THREADS}

# Perform mapping with STAR
STAR \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READS_R1} ${READS_R2} \
  --readFilesCommand zcat \
  --outFileNamePrefix ${OUTPUT_PREFIX}_ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.04 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --limitBAMsortRAM 30000000000 \
  --runThreadN ${NUM_THREADS}

# The output will be ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam

View on GitHub

Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e.

STAR v2.3.0e GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.3.0e

# Define variables for reference genome and annotation (using hg38 as a placeholder)
GENOME_DIR="./star_index_hg38" # Directory where STAR genome index will be stored
GENOME_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder: Path to your reference genome FASTA file (e.g., from UCSC or Ensembl)
GTF_FILE="gencode.v38.annotation.gtf" # Placeholder: Path to your gene annotation GTF file (e.g., from GENCODE)

# Define variables for input and output
READ1="sample_R1.fastq.gz" # Placeholder: Path to your first FASTQ file
READ2="sample_R2.fastq.gz" # Placeholder: Path to your second FASTQ file (remove if single-end)
OUTPUT_PREFIX="sample_aligned" # Prefix for output files
THREADS=8 # Number of threads to use for alignment

# --- Prerequisite: Generate STAR genome index (run this once for your reference genome and GTF) ---
# This step creates the necessary index files in the ${GENOME_DIR}.
# You would typically download the FASTA and GTF files from sources like GENCODE or UCSC.
# Example download (uncomment and modify as needed):
# wget -P . https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz
# wget -P . https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
# gunzip GRCh38.primary_assembly.genome.fa.gz
# gunzip gencode.v38.annotation.gtf.gz

# mkdir -p ${GENOME_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN ${THREADS}

# --- Perform alignment of sequencing data to the reference genome ---
STAR --runMode alignReads \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.05 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN ${THREADS}

View on GitHub

Parameters are set to default and reads are mapped to references along with splice junction databases.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with actual paths and filenames
GENOME_DIR="/path/to/STAR_genome_index/hg38" # Path to pre-built STAR genome index (e.g., for hg38, built with a GTF file for splice junctions)
READ1="input_reads_R1.fastq.gz"             # Path to forward reads
READ2="input_reads_R2.fastq.gz"             # Path to reverse reads (if paired-end)
OUTPUT_PREFIX="aligned_reads"               # Prefix for output files
NUM_THREADS=8                               # Number of threads to use

# Align reads to the reference genome with splice junction awareness
# Using default parameters as specified in the description.
# --outSAMtype BAM SortedByCoordinate: Output sorted BAM file
# --runThreadN: Number of threads
# --readFilesCommand zcat: Decompress gzipped input files on the fly
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}." \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN "${NUM_THREADS}" \
     --readFilesCommand zcat

View on GitHub

Gene Expression Quantification.

Salmon (Inferred with models/gemini-2.5-flash) v1.10.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Salmon (if not already installed)
# conda install -c bioconda salmon

# Define variables
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz"
SALMON_INDEX="GRCh38.p14.salmon_index" # Placeholder for a pre-built Salmon index
OUTPUT_DIR="salmon_quant_output"
THREADS=8 # Number of threads to use

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Run Salmon quantification
# Assumes a Salmon index has been built from a reference transcriptome (e.g., GENCODE, Ensembl)
# Example command to build an index (if needed, typically done once per reference):
# salmon index -t GRCh38.p14.transcriptome.fasta -i GRCh38.p14.salmon_index

salmon quant \
    -i "${SALMON_INDEX}" \
    -l A \
    -1 "${READS_R1}" \
    -2 "${READS_R2}" \
    -p "${THREADS}" \
    --validateMappings \
    --gcBias \
    -o "${OUTPUT_DIR}"

View on GitHub

To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0).

Cufflinks v2.2.0 GitHub

$ Bash example

# Install Cufflinks (example using Bioconda)
# conda install -c bioconda cufflinks=2.2.0

# Example usage of Cufflinks for gene expression quantification
# Replace 'aligned_reads.bam' with your actual input BAM file (e.g., from TopHat or STAR alignment).
# Replace 'genes.gtf' with your reference genome annotation file.
# The output will be generated in the 'cufflinks_output' directory.
cufflinks -o cufflinks_output -g genes.gtf aligned_reads.bam

View on GitHub

Expression values are calculated for entries in the gene annotation references.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM
# conda install -c bioconda rsem

# --- Reference preparation (one-time setup) ---
# This step assumes you have a genome FASTA and a gene annotation GTF file.
# Example for GRCh38 from Ensembl:
# wget -O genome.fa.gz "ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
# wget -O genes.gtf.gz "ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz"
# gunzip genome.fa.gz
# gunzip genes.gtf.gz

# Build RSEM reference index (replace with your actual paths and desired reference name)
# rsem-prepare-reference --gtf genes.gtf genome.fa rsem_reference_GRCh38

# --- Expression calculation ---
# Input: Aligned BAM file (e.g., from STAR or HISAT2) for paired-end reads.
# Replace with your actual input BAM file, RSEM reference path, and desired output prefix.
INPUT_BAM="path/to/your/aligned_reads.bam"
RSEM_REFERENCE_PATH="path/to/your/rsem_reference_GRCh38" # Directory containing RSEM index files
OUTPUT_PREFIX="sample_expression_results"
NUM_THREADS=8 # Number of threads to use for calculation

rsem-calculate-expression \
    --bam \
    --paired-end \
    -p ${NUM_THREADS} \
    ${INPUT_BAM} \
    ${RSEM_REFERENCE_PATH} \
    ${OUTPUT_PREFIX}

View on GitHub

Tools Used

STAR Cufflinks

Raw Source Text

Quality Control. Quality of sequencing data is analyzed using the software FastQC v0.10.1. The results are examined to determine if samples are of questionable quality on an array of metrics.
Mapping. Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e. Parameters are set to default and reads are mapped to references along with splice junction databases.
Gene Expression Quantification. To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0). Expression values are calculated for entries in the gene annotation references.
Genome_build: hg19
Supplementary_files_format_and_content: Txt file of ensemble identifiers with RPKMs of samples used in study

← Back to Analysis