GSE54968 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
Identification of novel long noncoding RNAs underlying vertebrate cardiovascular development.Circulation (2015) — PMID 25739401
Dataset
GSE54968Transcriptomic analysis reveals novel long non-coding RNAs critical for vertebrate development [RNA-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina Casava1.7 software used for basecalling.
$ Bash example
# Illumina Casava 1.7 is proprietary software typically integrated with Illumina sequencing instruments. # Basecalling is performed by the instrument's control software, which utilizes Casava components. # The output of this step is BCL (base call) files. # No direct command-line execution is typically performed by users for the basecalling itself, # as it's part of the real-time data acquisition on the sequencer. # Subsequent steps like demultiplexing and BCL-to-FASTQ conversion would use other Casava tools # (e.g., configureBclToFastq.pl and make, or later bcl2fastq).
-
2
Reads were aligned to the human genome (hg19) with STAR (v2.2.0c) with default parameters.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star=2.2.0c # Define variables # Replace /path/to/your/hg19/STAR_index with the actual path to your STAR genome index for hg19. # This index needs to be built once using STAR --runMode genomeGenerate. GENOME_DIR="/path/to/your/hg19/STAR_index" READS_FILE="reads.fastq.gz" # Replace with your input FASTQ file (e.g., R1.fastq.gz or R2.fastq.gz for single-end, or R1.fastq.gz R2.fastq.gz for paired-end) OUTPUT_PREFIX="star_output/" # Output files will be prefixed with this path THREADS=8 # Number of threads to use, adjust based on available CPU cores # Create output directory if it doesn't exist mkdir -p "${OUTPUT_PREFIX}" # Run STAR alignment with default parameters # The --outSAMtype BAM SortedByCoordinate is a common and practical default for downstream analysis. STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_FILE}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard -
3
Only uniquely alignable reads were used for downstream analysis
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Create a directory for STAR output mkdir -p star_output # Align reads and filter for uniquely mapping reads # Replace /path/to/STAR_genome_index with the actual path to your STAR genome index # Replace reads_R1.fastq.gz and reads_R2.fastq.gz with your input FASTQ files # Adjust --runThreadN based on available CPU cores # Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB for 30000000000 bytes) STAR \ --runThreadN 8 \ --genomeDir /path/to/STAR_genome_index \ --readFilesIn reads_R1.fastq.gz reads_R2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix star_output/aligned_unique_reads_ \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterType BySJout \ --outFilterMismatchNmax 10 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000
-
4
Reads Per Kilobase of exon per milion mapped reads were calculated using HOMER (http://homer.salk.edu/homer/) across the entire gene (both introns and exons)
HOMER vv4.11$ Bash example
# Install HOMER (if not already installed) # Download from http://homer.salk.edu/homer/software/install.html # Or, if available via a package manager (less common for HOMER's full suite): # conda install -c bioconda homer # Define input and output paths INPUT_BAM="path/to/your/aligned_reads.bam" OUTPUT_TAG_DIR="homer_tag_directory" GENOME_BUILD="hg38" # Example: hg38 for human, mm10 for mouse. Ensure HOMER has this genome installed. OUTPUT_RPKM_FILE="gene_rpkm_quantification.txt" # Step 1: Create a HOMER Tag Directory from the BAM file # This step processes the aligned reads into a format HOMER can use efficiently. # -format sam: Specifies input format (BAM is compatible with SAM format) makeTagDirectory "${OUTPUT_TAG_DIR}" "${INPUT_BAM}" -format sam # Step 2: Calculate RPKM across genes using annotatePeaks.pl # "genes": Tells HOMER to use its internal gene database for annotation. # "${GENOME_BUILD}": Specifies the genome build for the gene database. # "${OUTPUT_TAG_DIR}": The tag directory created in Step 1. # -gene: Annotate with gene information. # -rpkm: Calculate RPKM (Reads Per Kilobase of exon per Million mapped reads). # HOMER's -rpkm option calculates the length of the gene from the total length of exons. # Reads mapping within the gene body (introns and exons) are counted for the numerator. annotatePeaks.pl "genes" "${GENOME_BUILD}" "${OUTPUT_TAG_DIR}" -gene -rpkm > "${OUTPUT_RPKM_FILE}"
Raw Source Text
Illumina Casava1.7 software used for basecalling. Reads were aligned to the human genome (hg19) with STAR (v2.2.0c) with default parameters. Only uniquely alignable reads were used for downstream analysis Reads Per Kilobase of exon per milion mapped reads were calculated using HOMER (http://homer.salk.edu/homer/) across the entire gene (both introns and exons) Genome_build: hg19 Supplementary_files_format_and_content: Tab delimited text file, Column Def: 1-ensembl ID, 2-chr, 3-gene start, 4-gene end, 5-strand, 6-gene length, 7-copies in genome, 8-gene name, 9-chr band, 10-gene type, 11-status, 12-aliases, 13-18: RPKM values