GSE83687 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

RNA binding protein DDX5 directs tuft cell specification and function to regulate microbial repertoire and disease susceptibility in the intestine.

Gut (2022) — PMID 34853057

Dataset

GSE83687

A functional genomics predictive network model identifies regulators of inflammatory bowel disease: Mount Sinai Hospital (MSH) Population Specimen Co…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina Casava1.82 software used for basecalling.

Illumina Casava v1.82

$ Bash example

# The description states "Illumina Casava1.82 software used for basecalling."
# Casava was an integrated software suite for processing Illumina sequencing data,
# including basecalling (converting BCL files to FASTQ).
# The specific command-line execution for Casava 1.8.2 is not provided in the description.
# Basecalling is the process of converting raw intensity data (BCL files) into base calls and quality scores, typically outputting FASTQ files.
# Modern Illumina basecalling is typically performed using bcl2fastq or bcl-convert.
# No specific parameters or reference datasets are mentioned in the description.

Short reads in fastQ format are processed using RAPiD, which is a RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. .

RNA-seq vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# RAPiD is described as an RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology.
# As an internal framework, specific installation instructions and command-line interface details are not publicly available.
# The following is a conceptual representation of how such a tool would be invoked, assuming standard RNA-Seq inputs.

# Placeholder for input FASTQ files
# For single-end reads:
# INPUT_FASTQ="sample_reads.fastq.gz"
# For paired-end reads:
# INPUT_FASTQ_R1="sample_R1.fastq.gz"
# INPUT_FASTQ_R2="sample_R2.fastq.gz"

# Placeholder for reference genome and annotation files (e.g., human hg38)
# GENOME_FASTA="/path/to/reference/GRCh38.p14.genome.fa"
# GENE_ANNOTATION_GTF="/path/to/reference/gencode.v45.annotation.gtf"

# Placeholder for output directory
# OUTPUT_DIR="rapid_analysis_output"

# Conceptual command for running RAPiD. The actual command would depend on the framework's design.
# rapid_run --fastq $INPUT_FASTQ --genome $GENOME_FASTA --gtf $GENE_ANNOTATION_GTF --output $OUTPUT_DIR
# rapid_run --fastq-r1 $INPUT_FASTQ_R1 --fastq-r2 $INPUT_FASTQ_R2 --genome $GENOME_FASTA --gtf $GENE_ANNOTATION_GTF --output $OUTPUT_DIR

echo "RAPiD is an internal RNA-Seq analysis framework. Specific command-line usage and parameters are not publicly documented."
echo "Please consult the Technology Development group at Icahn Institute for Genomics and MultiScale Biology for detailed instructions."

RAPiD uses STAR to map the short reads to the [hg19 ] reference and resultant alignment map in BAM format is quantified for gene level expression using featureCounts of the subreads package.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Placeholder for hg19 STAR index. This directory should contain genome files (e.g., genome.fa, SA, SAindex, etc.)
GENOME_DIR="/path/to/STAR_index/hg19"
# Placeholder for input R1 FASTQ file (e.g., sample_R1.fastq.gz)
READS_R1="input_reads_R1.fastq.gz"
# Placeholder for input R2 FASTQ file (e.g., sample_R2.fastq.gz). Remove if single-end.
READS_R2="input_reads_R2.fastq.gz"
# Prefix for output files (e.g., sample_aligned.bam)
OUTPUT_PREFIX="sample_aligned"
# Number of threads to use for alignment
THREADS=8

# Run STAR alignment
# --genomeDir: Path to the STAR genome index directory
# --readFilesIn: Input FASTQ files (space-separated for paired-end)
# --runThreadN: Number of threads
# --outFileNamePrefix: Prefix for all output files
# --outSAMtype BAM SortedByCoordinate: Output sorted BAM file
# --readFilesCommand zcat: Command to decompress gzipped FASTQ files
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesCommand zcat

View on GitHub

Detailed QC metrics are generated using the RNASeQC package

RNASeQC vv2.3.6 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RNASeQC via conda
# conda install -c bioconda rnaseqc

# Define variables for input and output
INPUT_BAM="sample_aligned.bam" # Placeholder: Replace with actual path to input BAM file
REFERENCE_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder: Replace with actual path to reference genome FASTA (e.g., from GENCODE or Ensembl)
GENE_MODEL_GTF="gencode.v45.annotation.gtf" # Placeholder: Replace with actual path to gene model GTF (e.g., from GENCODE)
OUTPUT_DIR="rnaseqc_output"
SAMPLE_ID="sample1" # Identifier for the sample, used in reports

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute RNA-SeQC
# The 'rnaseqc' command from bioconda is a wrapper that typically calls the Java JAR.
# The arguments are consistent with the direct Java command.
rnaseqc \
  -o "${OUTPUT_DIR}" \
  -r "${REFERENCE_FASTA}" \
  -g "${GENE_MODEL_GTF}" \
  -s "${SAMPLE_ID}|${INPUT_BAM}|RNA" # Format: SampleID|PathToBam|SampleType (e.g., RNA, DNA)

View on GitHub

Tools Used

RNA-seq STAR

Raw Source Text

Illumina Casava1.82 software used for basecalling.
Short reads in fastQ format are processed using RAPiD, which is a RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. .
RAPiD uses STAR to map the short reads to the  [hg19 ] reference and resultant alignment map in BAM format is quantified for gene level expression using featureCounts of the subreads package. Detailed QC metrics are generated using the RNASeQC package
Genome_build: hg19
Supplementary_files_format_and_content: tab-delimited text files include fragment per kilobase per million (FPKM) for each gene

← Back to Analysis