GSE83687 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

RNA binding protein DDX5 directs tuft cell specification and function to regulate microbial repertoire and disease susceptibility in the intestine.

Gut (2022) — PMID 34853057

Dataset

GSE83687

A functional genomics predictive network model identifies regulators of inflammatory bowel disease: Mount Sinai Hospital (MSH) Population Specimen Co…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Illumina Casava1.82 software used for basecalling.

    Illumina Casava v1.82
    $ Bash example
    # The description states "Illumina Casava1.82 software used for basecalling."
    # Casava was an integrated software suite for processing Illumina sequencing data,
    # including basecalling (converting BCL files to FASTQ).
    # The specific command-line execution for Casava 1.8.2 is not provided in the description.
    # Basecalling is the process of converting raw intensity data (BCL files) into base calls and quality scores, typically outputting FASTQ files.
    # Modern Illumina basecalling is typically performed using bcl2fastq or bcl-convert.
    # No specific parameters or reference datasets are mentioned in the description.
  2. 2

    Short reads in fastQ format are processed using RAPiD, which is a RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. .

    RNA-seq vNot specified (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # RAPiD is described as an RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology.
    # As an internal framework, specific installation instructions and command-line interface details are not publicly available.
    # The following is a conceptual representation of how such a tool would be invoked, assuming standard RNA-Seq inputs.
    
    # Placeholder for input FASTQ files
    # For single-end reads:
    # INPUT_FASTQ="sample_reads.fastq.gz"
    # For paired-end reads:
    # INPUT_FASTQ_R1="sample_R1.fastq.gz"
    # INPUT_FASTQ_R2="sample_R2.fastq.gz"
    
    # Placeholder for reference genome and annotation files (e.g., human hg38)
    # GENOME_FASTA="/path/to/reference/GRCh38.p14.genome.fa"
    # GENE_ANNOTATION_GTF="/path/to/reference/gencode.v45.annotation.gtf"
    
    # Placeholder for output directory
    # OUTPUT_DIR="rapid_analysis_output"
    
    # Conceptual command for running RAPiD. The actual command would depend on the framework's design.
    # rapid_run --fastq $INPUT_FASTQ --genome $GENOME_FASTA --gtf $GENE_ANNOTATION_GTF --output $OUTPUT_DIR
    # rapid_run --fastq-r1 $INPUT_FASTQ_R1 --fastq-r2 $INPUT_FASTQ_R2 --genome $GENOME_FASTA --gtf $GENE_ANNOTATION_GTF --output $OUTPUT_DIR
    
    echo "RAPiD is an internal RNA-Seq analysis framework. Specific command-line usage and parameters are not publicly documented."
    echo "Please consult the Technology Development group at Icahn Institute for Genomics and MultiScale Biology for detailed instructions."
  3. 3

    RAPiD uses STAR to map the short reads to the [hg19 ] reference and resultant alignment map in BAM format is quantified for gene level expression using featureCounts of the subreads package.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Placeholder for hg19 STAR index. This directory should contain genome files (e.g., genome.fa, SA, SAindex, etc.)
    GENOME_DIR="/path/to/STAR_index/hg19"
    # Placeholder for input R1 FASTQ file (e.g., sample_R1.fastq.gz)
    READS_R1="input_reads_R1.fastq.gz"
    # Placeholder for input R2 FASTQ file (e.g., sample_R2.fastq.gz). Remove if single-end.
    READS_R2="input_reads_R2.fastq.gz"
    # Prefix for output files (e.g., sample_aligned.bam)
    OUTPUT_PREFIX="sample_aligned"
    # Number of threads to use for alignment
    THREADS=8
    
    # Run STAR alignment
    # --genomeDir: Path to the STAR genome index directory
    # --readFilesIn: Input FASTQ files (space-separated for paired-end)
    # --runThreadN: Number of threads
    # --outFileNamePrefix: Prefix for all output files
    # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file
    # --readFilesCommand zcat: Command to decompress gzipped FASTQ files
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --readFilesCommand zcat
  4. 4

    Detailed QC metrics are generated using the RNASeQC package

    RNASeQC vv2.3.6 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install RNASeQC via conda
    # conda install -c bioconda rnaseqc
    
    # Define variables for input and output
    INPUT_BAM="sample_aligned.bam" # Placeholder: Replace with actual path to input BAM file
    REFERENCE_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder: Replace with actual path to reference genome FASTA (e.g., from GENCODE or Ensembl)
    GENE_MODEL_GTF="gencode.v45.annotation.gtf" # Placeholder: Replace with actual path to gene model GTF (e.g., from GENCODE)
    OUTPUT_DIR="rnaseqc_output"
    SAMPLE_ID="sample1" # Identifier for the sample, used in reports
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute RNA-SeQC
    # The 'rnaseqc' command from bioconda is a wrapper that typically calls the Java JAR.
    # The arguments are consistent with the direct Java command.
    rnaseqc \
      -o "${OUTPUT_DIR}" \
      -r "${REFERENCE_FASTA}" \
      -g "${GENE_MODEL_GTF}" \
      -s "${SAMPLE_ID}|${INPUT_BAM}|RNA" # Format: SampleID|PathToBam|SampleType (e.g., RNA, DNA)

Tools Used

Raw Source Text
Illumina Casava1.82 software used for basecalling.
Short reads in fastQ format are processed using RAPiD, which is a RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. .
RAPiD uses STAR to map the short reads to the  [hg19 ] reference and resultant alignment map in BAM format is quantified for gene level expression using featureCounts of the subreads package. Detailed QC metrics are generated using the RNASeQC package
Genome_build: hg19
Supplementary_files_format_and_content: tab-delimited text files include fragment per kilobase per million (FPKM) for each gene
← Back to Analysis