GSE83687 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
RNA binding protein DDX5 directs tuft cell specification and function to regulate microbial repertoire and disease susceptibility in the intestine.Gut (2022) — PMID 34853057
Dataset
GSE83687A functional genomics predictive network model identifies regulators of inflammatory bowel disease: Mount Sinai Hospital (MSH) Population Specimen Co…
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina Casava1.82 software used for basecalling.
Illumina Casava v1.82$ Bash example
# The description states "Illumina Casava1.82 software used for basecalling." # Casava was an integrated software suite for processing Illumina sequencing data, # including basecalling (converting BCL files to FASTQ). # The specific command-line execution for Casava 1.8.2 is not provided in the description. # Basecalling is the process of converting raw intensity data (BCL files) into base calls and quality scores, typically outputting FASTQ files. # Modern Illumina basecalling is typically performed using bcl2fastq or bcl-convert. # No specific parameters or reference datasets are mentioned in the description.
-
2
Short reads in fastQ format are processed using RAPiD, which is a RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. .
RNA-seq vNot specified (Inferred with models/gemini-2.5-flash)$ Bash example
# RAPiD is described as an RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. # As an internal framework, specific installation instructions and command-line interface details are not publicly available. # The following is a conceptual representation of how such a tool would be invoked, assuming standard RNA-Seq inputs. # Placeholder for input FASTQ files # For single-end reads: # INPUT_FASTQ="sample_reads.fastq.gz" # For paired-end reads: # INPUT_FASTQ_R1="sample_R1.fastq.gz" # INPUT_FASTQ_R2="sample_R2.fastq.gz" # Placeholder for reference genome and annotation files (e.g., human hg38) # GENOME_FASTA="/path/to/reference/GRCh38.p14.genome.fa" # GENE_ANNOTATION_GTF="/path/to/reference/gencode.v45.annotation.gtf" # Placeholder for output directory # OUTPUT_DIR="rapid_analysis_output" # Conceptual command for running RAPiD. The actual command would depend on the framework's design. # rapid_run --fastq $INPUT_FASTQ --genome $GENOME_FASTA --gtf $GENE_ANNOTATION_GTF --output $OUTPUT_DIR # rapid_run --fastq-r1 $INPUT_FASTQ_R1 --fastq-r2 $INPUT_FASTQ_R2 --genome $GENOME_FASTA --gtf $GENE_ANNOTATION_GTF --output $OUTPUT_DIR echo "RAPiD is an internal RNA-Seq analysis framework. Specific command-line usage and parameters are not publicly documented." echo "Please consult the Technology Development group at Icahn Institute for Genomics and MultiScale Biology for detailed instructions."
-
3
RAPiD uses STAR to map the short reads to the [hg19 ] reference and resultant alignment map in BAM format is quantified for gene level expression using featureCounts of the subreads package.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Placeholder for hg19 STAR index. This directory should contain genome files (e.g., genome.fa, SA, SAindex, etc.) GENOME_DIR="/path/to/STAR_index/hg19" # Placeholder for input R1 FASTQ file (e.g., sample_R1.fastq.gz) READS_R1="input_reads_R1.fastq.gz" # Placeholder for input R2 FASTQ file (e.g., sample_R2.fastq.gz). Remove if single-end. READS_R2="input_reads_R2.fastq.gz" # Prefix for output files (e.g., sample_aligned.bam) OUTPUT_PREFIX="sample_aligned" # Number of threads to use for alignment THREADS=8 # Run STAR alignment # --genomeDir: Path to the STAR genome index directory # --readFilesIn: Input FASTQ files (space-separated for paired-end) # --runThreadN: Number of threads # --outFileNamePrefix: Prefix for all output files # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file # --readFilesCommand zcat: Command to decompress gzipped FASTQ files STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --readFilesCommand zcat -
4
Detailed QC metrics are generated using the RNASeQC package
$ Bash example
# Install RNASeQC via conda # conda install -c bioconda rnaseqc # Define variables for input and output INPUT_BAM="sample_aligned.bam" # Placeholder: Replace with actual path to input BAM file REFERENCE_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder: Replace with actual path to reference genome FASTA (e.g., from GENCODE or Ensembl) GENE_MODEL_GTF="gencode.v45.annotation.gtf" # Placeholder: Replace with actual path to gene model GTF (e.g., from GENCODE) OUTPUT_DIR="rnaseqc_output" SAMPLE_ID="sample1" # Identifier for the sample, used in reports # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute RNA-SeQC # The 'rnaseqc' command from bioconda is a wrapper that typically calls the Java JAR. # The arguments are consistent with the direct Java command. rnaseqc \ -o "${OUTPUT_DIR}" \ -r "${REFERENCE_FASTA}" \ -g "${GENE_MODEL_GTF}" \ -s "${SAMPLE_ID}|${INPUT_BAM}|RNA" # Format: SampleID|PathToBam|SampleType (e.g., RNA, DNA)
Raw Source Text
Illumina Casava1.82 software used for basecalling. Short reads in fastQ format are processed using RAPiD, which is a RNA-Seq analysis framework developed and maintained by the Technology Development group at Icahn Institute for Genomics and MultiScale Biology. . RAPiD uses STAR to map the short reads to the [hg19 ] reference and resultant alignment map in BAM format is quantified for gene level expression using featureCounts of the subreads package. Detailed QC metrics are generated using the RNASeQC package Genome_build: hg19 Supplementary_files_format_and_content: tab-delimited text files include fragment per kilobase per million (FPKM) for each gene