GSE61946 Processing Pipeline
RNA-Seq
code_examples
9 steps
Publication
A Gene Regulatory Network Cooperatively Controlled by Pdx1 and Sox9 Governs Lineage Allocation of Foregut Progenitor Cells.Cell reports (2015) — PMID 26440894
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Quality Control.
$ Bash example
# Install FastQC (if not already installed) # conda install -c bioconda fastqc=0.11.9 # Run FastQC on input FASTQ file # Replace 'input_reads.fastq.gz' with your actual input file # The '-o .' option specifies the output directory as the current directory. fastqc input_reads.fastq.gz -o .
-
2
Quality of sequencing data is analyzed using the software FastQC v0.10.1.
FastQC v0.10.1$ Bash example
# Install FastQC (example using conda) # conda install -c bioconda fastqc # Run FastQC to analyze sequencing data quality # Assuming 'input.fastq.gz' is the raw sequencing data file # And 'fastqc_output' is the directory where reports will be saved mkdir -p fastqc_output fastqc -o fastqc_output input.fastq.gz
-
3
The results are examined to determine if samples are of questionable quality on an array of metrics.
$ Bash example
# Install MultiQC if not already installed # conda install -c bioconda multiqc # Assuming various QC output files (e.g., FastQC reports, alignment logs) are located in a 'qc_results' directory. # MultiQC will scan this directory and its subdirectories for supported QC output files. # The report will be generated in the 'multiqc_reports' directory. mkdir -p multiqc_reports multiqc qc_results/ --outdir multiqc_reports/ --filename "sample_quality_assessment_report.html" --title "Sample Quality Assessment"
-
4
Mapping.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables (replace with actual paths and filenames) GENOME_DIR="/path/to/STAR_genome_index/hg38" READ1_FASTQ="sample_R1.fastq.gz" READ2_FASTQ="sample_R2.fastq.gz" OUTPUT_PREFIX="aligned_sample_" THREADS=8 # Create output directory if it doesn't exist # mkdir -p "$(dirname ${OUTPUT_PREFIX})" # Execute STAR alignment STAR \ --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 200000 \ --alignMatesGapMax 200000 \ --outFilterScoreMinOverLread 0.6 \ --outFilterMatchNminOverLread 0.6 \ --outFilterMatchNmin 15 \ --outReadsUnmapped Fastx -
5
Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for STAR genome index directory (e.g., for human hg38) # Replace with your actual genome index path GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for input sequencing data (FASTQ file(s)) # For single-end reads: READS_FILE="input_reads.fastq.gz" # For paired-end reads: READS_FILE="input_reads_R1.fastq.gz input_reads_R2.fastq.gz" READS_FILE="input_reads.fastq.gz" # Placeholder for output prefix OUTPUT_PREFIX="aligned_output" # Number of threads to use NUM_THREADS=8 # Run STAR alignment STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_FILE}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 1 \ --quantMode GeneCounts \ --outFilterType BySJout \ --outFilterMismatchNmax 10 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 # The primary output BAM file will be named: ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam # The gene counts file will be named: ${OUTPUT_PREFIX}ReadsPerGene.out.tab -
6
Parameters are set to default and reads are mapped to references along with splice junction databases.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Create STAR genome index (if not already created) # This step is typically done once per genome/annotation version. # It requires a genome FASTA file and a GTF/GFF annotation file to build the splice junction database. # Example for GRCh38: # GENOME_FASTA="/path/to/genome/GRCh38.primary_assembly.genome.fa" # GTF_FILE="/path/to/annotation/gencode.v38.annotation.gtf" # STAR --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index/GRCh38 \ # --genomeFastaFiles ${GENOME_FASTA} \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 \ # --runThreadN 8 THREADS=8 GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for GRCh38 genome index READ1="sample_R1.fastq.gz" # Placeholder for input R1 FASTQ file READ2="sample_R2.fastq.gz" # Placeholder for input R2 FASTQ file (remove if single-end) OUTPUT_PREFIX="sample_aligned_" # Prefix for output files STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB) -
7
Gene Expression Quantification.
Salmon (Inferred with models/gemini-2.5-flash) v1.10.2 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install Salmon (example using conda) # conda create -n salmon_env salmon -y # conda activate salmon_env # --- Placeholder for Salmon index creation (if not already available) --- # This step assumes you have a FASTA file of the transcriptome (e.g., from Ensembl or GENCODE) # For example, using human GRCh38 (hg38) transcriptome: # wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz # gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz # salmon index -t Homo_sapiens.GRCh38.cdna.all.fa -i salmon_index_hg38 --gencode # --- Gene Expression Quantification using Salmon --- # Input FASTQ files (replace with actual file names) READ1="read1.fastq.gz" READ2="read2.fastq.gz" # Salmon index path (replace with actual path to your index, e.g., 'salmon_index_hg38') SALMON_INDEX="salmon_index_hg38" # Output directory OUTPUT_DIR="salmon_quant_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run Salmon quantification (paired-end, common parameters) salmon quant -i "${SALMON_INDEX}" \ -l A \ -1 "${READ1}" \ -2 "${READ2}" \ -o "${OUTPUT_DIR}" \ --validateMappings \ --gcBias \ --seqBias -
8
To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0).
$ Bash example
# Install Cufflinks (e.g., via Bioconda) # conda install -c bioconda cufflinks=2.2.0 # Define input and output files/directories INPUT_BAM="aligned_reads.bam" # Placeholder: Replace with your aligned RNA-seq BAM file GENOME_GTF="gencode.v38.annotation.gtf" # Placeholder: Replace with your genome annotation GTF file (e.g., from GENCODE or Ensembl) OUTPUT_DIR="cufflinks_output" NUM_THREADS="8" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run Cufflinks for transcript assembly and quantification cufflinks \ -o "${OUTPUT_DIR}" \ -g "${GENOME_GTF}" \ -p "${NUM_THREADS}" \ "${INPUT_BAM}" -
9
Expression values are calculated for entries in the gene annotation references.
$ Bash example
# Install RSEM (example using conda) # conda install -c bioconda rsem # --- Prerequisite: RSEM Reference Preparation (run once per genome/annotation) --- # This step builds the necessary indices for RSEM quantification from a reference genome and GTF annotation. # Replace 'GRCh38.primary_assembly.genome.fa' with your reference genome FASTA file (e.g., downloaded from Ensembl/Gencode). # Replace 'gencode.v38.annotation.gtf' with your gene annotation GTF file (e.g., downloaded from Ensembl/Gencode). # The output reference files will be prefixed with 'GRCh38_rsem_ref'. # rsem-prepare-reference --gtf gencode.v38.annotation.gtf GRCh38.primary_assembly.genome.fa GRCh38_rsem_ref # --- Main Step: Calculate expression values --- # This command calculates gene and isoform expression levels from RNA-seq reads. # Replace 'read1.fastq.gz' and 'read2.fastq.gz' with your input paired-end FASTQ files. # Replace 'GRCh38_rsem_ref' with the prefix of the RSEM reference prepared in the prerequisite step. # 'sample_output' will be the prefix for all output files (e.g., sample_output.genes.results, sample_output.isoforms.results). rsem-calculate-expression --paired-end read1.fastq.gz read2.fastq.gz GRCh38_rsem_ref sample_output
Tools Used
Raw Source Text
Quality Control. Quality of sequencing data is analyzed using the software FastQC v0.10.1. The results are examined to determine if samples are of questionable quality on an array of metrics. Mapping. Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e. Parameters are set to default and reads are mapped to references along with splice junction databases. Gene Expression Quantification. To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0). Expression values are calculated for entries in the gene annotation references. Genome_build: hg19 Supplementary_files_format_and_content: Txt file of ensemble identifiers with RPKMs of samples used in study