GSE54968 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Identification of novel long noncoding RNAs underlying vertebrate cardiovascular development.

Circulation (2015) — PMID 25739401

Dataset

Transcriptomic analysis reveals novel long non-coding RNAs critical for vertebrate development [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina Casava1.7 software used for basecalling.

Illumina Casava v1.7 GitHub

$ Bash example

# Illumina Casava 1.7 is proprietary software typically integrated with Illumina sequencing instruments.
# Basecalling is performed by the instrument's control software, which utilizes Casava components.
# The output of this step is BCL (base call) files.
# No direct command-line execution is typically performed by users for the basecalling itself,
# as it's part of the real-time data acquisition on the sequencer.
# Subsequent steps like demultiplexing and BCL-to-FASTQ conversion would use other Casava tools
# (e.g., configureBclToFastq.pl and make, or later bcl2fastq).

View on GitHub

Reads were aligned to the human genome (hg19) with STAR (v2.2.0c) with default parameters.

STAR v2.2.0c GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star=2.2.0c

# Define variables
# Replace /path/to/your/hg19/STAR_index with the actual path to your STAR genome index for hg19.
# This index needs to be built once using STAR --runMode genomeGenerate.
GENOME_DIR="/path/to/your/hg19/STAR_index"
READS_FILE="reads.fastq.gz" # Replace with your input FASTQ file (e.g., R1.fastq.gz or R2.fastq.gz for single-end, or R1.fastq.gz R2.fastq.gz for paired-end)
OUTPUT_PREFIX="star_output/" # Output files will be prefixed with this path
THREADS=8 # Number of threads to use, adjust based on available CPU cores

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_PREFIX}"

# Run STAR alignment with default parameters
# The --outSAMtype BAM SortedByCoordinate is a common and practical default for downstream analysis.
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READS_FILE}" \
  --runThreadN "${THREADS}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard

View on GitHub

Only uniquely alignable reads were used for downstream analysis

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Create a directory for STAR output
mkdir -p star_output

# Align reads and filter for uniquely mapping reads
# Replace /path/to/STAR_genome_index with the actual path to your STAR genome index
# Replace reads_R1.fastq.gz and reads_R2.fastq.gz with your input FASTQ files
# Adjust --runThreadN based on available CPU cores
# Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB for 30000000000 bytes)
STAR \
  --runThreadN 8 \
  --genomeDir /path/to/STAR_genome_index \
  --readFilesIn reads_R1.fastq.gz reads_R2.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix star_output/aligned_unique_reads_ \
  --outSAMtype BAM SortedByCoordinate \
  --outFilterMultimapNmax 1 \
  --outFilterType BySJout \
  --outFilterMismatchNmax 10 \
  --outFilterMismatchNoverLmax 0.04 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --limitBAMsortRAM 30000000000

View on GitHub

Reads Per Kilobase of exon per milion mapped reads were calculated using HOMER (http://homer.salk.edu/homer/) across the entire gene (both introns and exons)

HOMER vv4.11

$ Bash example

# Install HOMER (if not already installed)
# Download from http://homer.salk.edu/homer/software/install.html
# Or, if available via a package manager (less common for HOMER's full suite):
# conda install -c bioconda homer

# Define input and output paths
INPUT_BAM="path/to/your/aligned_reads.bam"
OUTPUT_TAG_DIR="homer_tag_directory"
GENOME_BUILD="hg38" # Example: hg38 for human, mm10 for mouse. Ensure HOMER has this genome installed.
OUTPUT_RPKM_FILE="gene_rpkm_quantification.txt"

# Step 1: Create a HOMER Tag Directory from the BAM file
# This step processes the aligned reads into a format HOMER can use efficiently.
# -format sam: Specifies input format (BAM is compatible with SAM format)
makeTagDirectory "${OUTPUT_TAG_DIR}" "${INPUT_BAM}" -format sam

# Step 2: Calculate RPKM across genes using annotatePeaks.pl
# "genes": Tells HOMER to use its internal gene database for annotation.
# "${GENOME_BUILD}": Specifies the genome build for the gene database.
# "${OUTPUT_TAG_DIR}": The tag directory created in Step 1.
# -gene: Annotate with gene information.
# -rpkm: Calculate RPKM (Reads Per Kilobase of exon per Million mapped reads).
#        HOMER's -rpkm option calculates the length of the gene from the total length of exons.
#        Reads mapping within the gene body (introns and exons) are counted for the numerator.
annotatePeaks.pl "genes" "${GENOME_BUILD}" "${OUTPUT_TAG_DIR}" -gene -rpkm > "${OUTPUT_RPKM_FILE}"

Tools Used

STAR HOMER

Raw Source Text

Illumina Casava1.7 software used for basecalling.
Reads were aligned to the human genome (hg19) with STAR (v2.2.0c) with default parameters.  Only uniquely alignable reads were used for downstream analysis
Reads Per Kilobase of exon per milion mapped reads were calculated using HOMER (http://homer.salk.edu/homer/) across the entire gene (both introns and exons)
Genome_build: hg19
Supplementary_files_format_and_content: Tab delimited text file, Column Def: 1-ensembl ID, 2-chr, 3-gene start, 4-gene end, 5-strand, 6-gene length, 7-copies in genome, 8-gene name, 9-chr band, 10-gene type, 11-status, 12-aliases, 13-18: RPKM values

← Back to Analysis