GSE92602 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

A role for alternative splicing in circadian control of exocytosis and glucose homeostasis.

Genes & development (2020) — PMID 32616519

Dataset

Identification of islet-enriched long non-coding RNAs contributing to beta-cell failure in type 2 diabetes

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Base calling was with Illumina GAP Pipeline Software v1.90

Illumina GAP Pipeline Software v1.90

$ Bash example

# Base calling is an initial processing step performed by the Illumina sequencing instrument's onboard software.
# It converts raw signal data (intensities) from the sequencer into base calls (A, T, C, G) and quality scores.
# This process is typically not executed by the user via a command-line tool post-sequencing.
# The specified software, Illumina GAP Pipeline Software v1.90, is proprietary to Illumina.
# No user-executable command is available for this step.

Sequences were aligned to mm9 reference genome using Tophat 2.0.8 with option -g with mm9 reference GTF

TopHat v2.0.8 GitHub

$ Bash example

# Install TopHat 2.0.8 and its dependencies (e.g., Bowtie 1.x)
# Note: TopHat is an older tool and may require specific environment setup.
# conda create -n tophat2_env tophat=2.0.8 bowtie=1.1.2 -c bioconda -c conda-forge
# conda activate tophat2_env

# Define input and output files (placeholders - replace with actual paths)
READS_1="input_reads_R1.fastq.gz" # Path to input FASTQ file(s) for read 1
# READS_2="input_reads_R2.fastq.gz" # Uncomment and provide path if paired-end reads
OUTPUT_DIR="tophat_alignment_output"

# Define reference files (placeholders - replace with actual paths)
# mm9 reference genome FASTA file
GENOME_FASTA="/path/to/mm9.fa"
# mm9 reference GTF file
GTF_FILE="/path/to/mm9.gtf"
# Prefix for the Bowtie index files built from the mm9 genome
BOWTIE_INDEX_PREFIX="/path/to/mm9_bowtie_index/mm9"

# --- Pre-computation: Build Bowtie index if not already present ---
# TopHat requires a Bowtie index for the reference genome.
# If the index for mm9 is not already built at BOWTIE_INDEX_PREFIX, uncomment and run the following:
# mkdir -p $(dirname "$BOWTIE_INDEX_PREFIX")
# bowtie-build "$GENOME_FASTA" "$BOWTIE_INDEX_PREFIX"

# --- Run TopHat 2.0.8 for alignment ---
# Align sequences to the mm9 reference genome using the provided GTF for splice junction discovery.
# The -g option specifies the GTF file.
tophat2 \
    -o "$OUTPUT_DIR" \
    -g "$GTF_FILE" \
    "$BOWTIE_INDEX_PREFIX" \
    "$READS_1" # Add "$READS_2" here if using paired-end reads

View on GitHub

Novel transcripts were predicted with Cufflinks 2.1.1

Cufflinks v2.1.1 GitHub

$ Bash example

# Install Cufflinks (example using conda)
# conda install -c bioconda cufflinks=2.1.1

# Define input and output paths
# Replace 'path/to/aligned_reads.bam' with your actual input BAM file
INPUT_BAM="path/to/aligned_reads.bam"
# Replace 'path/to/reference_annotation.gtf' with your actual reference GTF/GFF file (e.g., from GENCODE or Ensembl)
REFERENCE_GTF="path/to/Homo_sapiens.GRCh38.109.gtf" 
OUTPUT_DIR="cufflinks_novel_transcripts_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Cufflinks to predict novel transcripts
# -o: Output directory
# -g: Reference annotation to guide assembly and identify novel transcripts
# --frag-bias-correct: Correct for sequence-specific bias
# --multi-read-correct: Correct for reads mapping to multiple locations
# -p: Number of threads (adjust as needed)
cufflinks \
  -o "${OUTPUT_DIR}" \
  -g "${REFERENCE_GTF}" \
  --frag-bias-correct \
  --multi-read-correct \
  -p 8 \
  "${INPUT_BAM}"

View on GitHub

Novel transcripts predictions were merged with mm9 reference genome using Cuffmerge 2.1.1 with option -G with mm9 reference GTF

Cuffmerge v2.1.1 GitHub

$ Bash example

# Install Cufflinks suite (which includes Cuffmerge)
# conda install -c bioconda cufflinks=2.1.1

# Define reference paths
# Placeholder paths for mm9 reference GTF and FASTA.
# These files can typically be downloaded from UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/)
# or Ensembl (e.g., ftp://ftp.ensembl.org/pub/release-54/gtf/mus_musculus/)
MM9_REFERENCE_GTF="/path/to/mm9.ncbiRefSeq.gtf"
MM9_REFERENCE_FASTA="/path/to/mm9.fa"

# Define input file(s) for novel transcript predictions.
# This should be a text file where each line is the path to a GTF file
# containing novel transcript predictions (e.g., from Cufflinks assembly output).
NOVEL_TRANSCRIPTS_GTF_LIST="novel_transcript_assemblies.txt"

# Define output file for the merged GTF
OUTPUT_MERGED_GTF="merged_novel_transcripts.gtf"

# Execute Cuffmerge to merge novel transcript predictions with the mm9 reference GTF
# The -g option specifies the reference annotation GTF file.
# The -s option specifies the reference genome FASTA file.
cuffmerge -g "${MM9_REFERENCE_GTF}" -s "${MM9_REFERENCE_FASTA}" "${NOVEL_TRANSCRIPTS_GTF_LIST}" -o "${OUTPUT_MERGED_GTF}"

View on GitHub

Counts were generated using htseq-count v0.5.4p3

HTSeq v0.5.4p GitHub

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq

# Example usage of htseq-count for generating gene counts from an alignment file and a GTF annotation.
# Parameters are inferred based on common usage for RNA-seq data.
# -f bam: Input file format is BAM.
# -r pos: Reads are sorted by position.
# -s no: Data is unstranded (use 'yes' or 'reverse' for stranded data).
# -a 10: Minimum alignment quality score is 10.
# -t exon: Feature type to count is 'exon'.
# -i gene_id: Attribute in the GTF file to use as feature ID (e.g., gene_id).
# aligned_reads.bam: Placeholder for the input alignment file.
# gencode.vXX.annotation.gtf: Placeholder for the GTF annotation file (e.g., for human GRCh38/hg38, use a recent Gencode version).
# > gene_counts.txt: Output file for the generated counts.
htseq-count \
  -f bam \
  -r pos \
  -s no \
  -a 10 \
  -t exon \
  -i gene_id \
  aligned_reads.bam \
  gencode.vXX.annotation.gtf \
  > gene_counts.txt

View on GitHub

Tools Used

TopHat Cufflinks

Raw Source Text

Base calling was with Illumina GAP Pipeline Software v1.90
Sequences were aligned to mm9 reference genome using Tophat 2.0.8 with option -g with mm9 reference GTF
Novel transcripts were predicted with Cufflinks 2.1.1
Novel transcripts predictions were merged with mm9 reference genome using Cuffmerge 2.1.1 with option -G with mm9 reference GTF
Counts were generated using htseq-count v0.5.4p3
Genome_build: mm9
Supplementary_files_format_and_content: Raw count data for genes were normalized to the relative size of each library using R/Bioconductor package edgeR calcNormFactors function. Count data are provided in tab-delimited format

← Back to Analysis