GSE92602 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

A role for alternative splicing in circadian control of exocytosis and glucose homeostasis.

Genes & development (2020) — PMID 32616519

Dataset

GSE92602

Identification of islet-enriched long non-coding RNAs contributing to beta-cell failure in type 2 diabetes

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Base calling was with Illumina GAP Pipeline Software v1.90

    Illumina GAP Pipeline Software v1.90
    $ Bash example
    # Base calling is an initial processing step performed by the Illumina sequencing instrument's onboard software.
    # It converts raw signal data (intensities) from the sequencer into base calls (A, T, C, G) and quality scores.
    # This process is typically not executed by the user via a command-line tool post-sequencing.
    # The specified software, Illumina GAP Pipeline Software v1.90, is proprietary to Illumina.
    # No user-executable command is available for this step.
  2. 2

    Sequences were aligned to mm9 reference genome using Tophat 2.0.8 with option -g with mm9 reference GTF

    $ Bash example
    # Install TopHat 2.0.8 and its dependencies (e.g., Bowtie 1.x)
    # Note: TopHat is an older tool and may require specific environment setup.
    # conda create -n tophat2_env tophat=2.0.8 bowtie=1.1.2 -c bioconda -c conda-forge
    # conda activate tophat2_env
    
    # Define input and output files (placeholders - replace with actual paths)
    READS_1="input_reads_R1.fastq.gz" # Path to input FASTQ file(s) for read 1
    # READS_2="input_reads_R2.fastq.gz" # Uncomment and provide path if paired-end reads
    OUTPUT_DIR="tophat_alignment_output"
    
    # Define reference files (placeholders - replace with actual paths)
    # mm9 reference genome FASTA file
    GENOME_FASTA="/path/to/mm9.fa"
    # mm9 reference GTF file
    GTF_FILE="/path/to/mm9.gtf"
    # Prefix for the Bowtie index files built from the mm9 genome
    BOWTIE_INDEX_PREFIX="/path/to/mm9_bowtie_index/mm9"
    
    # --- Pre-computation: Build Bowtie index if not already present ---
    # TopHat requires a Bowtie index for the reference genome.
    # If the index for mm9 is not already built at BOWTIE_INDEX_PREFIX, uncomment and run the following:
    # mkdir -p $(dirname "$BOWTIE_INDEX_PREFIX")
    # bowtie-build "$GENOME_FASTA" "$BOWTIE_INDEX_PREFIX"
    
    # --- Run TopHat 2.0.8 for alignment ---
    # Align sequences to the mm9 reference genome using the provided GTF for splice junction discovery.
    # The -g option specifies the GTF file.
    tophat2 \
        -o "$OUTPUT_DIR" \
        -g "$GTF_FILE" \
        "$BOWTIE_INDEX_PREFIX" \
        "$READS_1" # Add "$READS_2" here if using paired-end reads
    
  3. 3

    Novel transcripts were predicted with Cufflinks 2.1.1

    $ Bash example
    # Install Cufflinks (example using conda)
    # conda install -c bioconda cufflinks=2.1.1
    
    # Define input and output paths
    # Replace 'path/to/aligned_reads.bam' with your actual input BAM file
    INPUT_BAM="path/to/aligned_reads.bam"
    # Replace 'path/to/reference_annotation.gtf' with your actual reference GTF/GFF file (e.g., from GENCODE or Ensembl)
    REFERENCE_GTF="path/to/Homo_sapiens.GRCh38.109.gtf" 
    OUTPUT_DIR="cufflinks_novel_transcripts_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run Cufflinks to predict novel transcripts
    # -o: Output directory
    # -g: Reference annotation to guide assembly and identify novel transcripts
    # --frag-bias-correct: Correct for sequence-specific bias
    # --multi-read-correct: Correct for reads mapping to multiple locations
    # -p: Number of threads (adjust as needed)
    cufflinks \
      -o "${OUTPUT_DIR}" \
      -g "${REFERENCE_GTF}" \
      --frag-bias-correct \
      --multi-read-correct \
      -p 8 \
      "${INPUT_BAM}"
  4. 4

    Novel transcripts predictions were merged with mm9 reference genome using Cuffmerge 2.1.1 with option -G with mm9 reference GTF

    Cuffmerge v2.1.1 GitHub
    $ Bash example
    # Install Cufflinks suite (which includes Cuffmerge)
    # conda install -c bioconda cufflinks=2.1.1
    
    # Define reference paths
    # Placeholder paths for mm9 reference GTF and FASTA.
    # These files can typically be downloaded from UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/)
    # or Ensembl (e.g., ftp://ftp.ensembl.org/pub/release-54/gtf/mus_musculus/)
    MM9_REFERENCE_GTF="/path/to/mm9.ncbiRefSeq.gtf"
    MM9_REFERENCE_FASTA="/path/to/mm9.fa"
    
    # Define input file(s) for novel transcript predictions.
    # This should be a text file where each line is the path to a GTF file
    # containing novel transcript predictions (e.g., from Cufflinks assembly output).
    NOVEL_TRANSCRIPTS_GTF_LIST="novel_transcript_assemblies.txt"
    
    # Define output file for the merged GTF
    OUTPUT_MERGED_GTF="merged_novel_transcripts.gtf"
    
    # Execute Cuffmerge to merge novel transcript predictions with the mm9 reference GTF
    # The -g option specifies the reference annotation GTF file.
    # The -s option specifies the reference genome FASTA file.
    cuffmerge -g "${MM9_REFERENCE_GTF}" -s "${MM9_REFERENCE_FASTA}" "${NOVEL_TRANSCRIPTS_GTF_LIST}" -o "${OUTPUT_MERGED_GTF}"
  5. 5

    Counts were generated using htseq-count v0.5.4p3

    HTSeq v0.5.4p GitHub
    $ Bash example
    # Install HTSeq (if not already installed)
    # conda install -c bioconda htseq
    
    # Example usage of htseq-count for generating gene counts from an alignment file and a GTF annotation.
    # Parameters are inferred based on common usage for RNA-seq data.
    # -f bam: Input file format is BAM.
    # -r pos: Reads are sorted by position.
    # -s no: Data is unstranded (use 'yes' or 'reverse' for stranded data).
    # -a 10: Minimum alignment quality score is 10.
    # -t exon: Feature type to count is 'exon'.
    # -i gene_id: Attribute in the GTF file to use as feature ID (e.g., gene_id).
    # aligned_reads.bam: Placeholder for the input alignment file.
    # gencode.vXX.annotation.gtf: Placeholder for the GTF annotation file (e.g., for human GRCh38/hg38, use a recent Gencode version).
    # > gene_counts.txt: Output file for the generated counts.
    htseq-count \
      -f bam \
      -r pos \
      -s no \
      -a 10 \
      -t exon \
      -i gene_id \
      aligned_reads.bam \
      gencode.vXX.annotation.gtf \
      > gene_counts.txt

Tools Used

Raw Source Text
Base calling was with Illumina GAP Pipeline Software v1.90
Sequences were aligned to mm9 reference genome using Tophat 2.0.8 with option -g with mm9 reference GTF
Novel transcripts were predicted with Cufflinks 2.1.1
Novel transcripts predictions were merged with mm9 reference genome using Cuffmerge 2.1.1 with option -G with mm9 reference GTF
Counts were generated using htseq-count v0.5.4p3
Genome_build: mm9
Supplementary_files_format_and_content: Raw count data for genes were normalized to the relative size of each library using R/Bioconductor package edgeR calcNormFactors function. Count data are provided in tab-delimited format
← Back to Analysis