GSE67040 Processing Pipeline

GSE code_examples 3 steps

Publication

The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.

Nature communications (2022) — PMID 35781533

Dataset

GSE67040

Leucegene: AML sequencing

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Basecalls performed using CASAVA 1.8

CASAVA v1.8

$ Bash example

# CASAVA 1.8 is an older Illumina proprietary software suite for base calling and demultiplexing.
# It typically runs on the instrument or a dedicated server.
# The primary script for converting BCL files to FASTQ in CASAVA 1.8 was often configureBclToFastq.pl,
# which was part of the bcl2fastq conversion software.
# This command is a generic example; actual parameters depend on the specific run and setup.

# Example command for base calling and demultiplexing using CASAVA's bcl2fastq component:
configureBclToFastq.pl --input-dir /path/to/illumina/run/folder --output-dir /path/to/output/fastq/files --mismatches 1

Sequenced reads were mapped to the reference genome using CASAVA 1.8

CASAVA v1.8

$ Bash example

# CASAVA 1.8 was an integrated Illumina software suite for base calling, demultiplexing, alignment, and variant calling.
# The mapping of sequenced reads to a reference genome was an internal step within its larger workflow.
# A direct, standalone command for "mapping" is not typically exposed for CASAVA 1.8.

# Placeholder for reference genome (e.g., GRCh38/hg38).
# If not specified in the description, use the latest assembly as a placeholder.
REFERENCE_GENOME="path/to/GRCh38.fasta"

# Placeholder for input FASTQ files directory.
# CASAVA often processed a directory containing FASTQ files for multiple samples.
FASTQ_INPUT_DIR="path/to/sequenced_reads_fastq_directory"

# Placeholder for the output directory where CASAVA will place its results, including aligned BAM files.
CASAVA_OUTPUT_DIR="path/to/casava_output_directory"

# Placeholder for a CASAVA configuration file (e.g., XML or a custom format).
# This file would specify parameters for the build, including alignment settings.
CASAVA_BUILD_CONFIG="path/to/casava_build_configuration.xml"

# The 'configureBuild.pl' script was used to set up the alignment and variant calling pipeline.
# This command initiates the CASAVA 1.8 build process that includes mapping.
# It generates a Makefile and other necessary files in the output directory.
configureBuild.pl \
--fastq-dir ${FASTQ_INPUT_DIR} \
--output-dir ${CASAVA_OUTPUT_DIR} \
--genome ${REFERENCE_GENOME} \
--config ${CASAVA_BUILD_CONFIG}

# After configuration, one would typically navigate to the generated output directory
# and run 'make' to execute the pipeline, which includes the alignment step.
# cd ${CASAVA_OUTPUT_DIR}
# make

Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using CASAVA 1.8

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem
# conda install -c bioconda star # RSEM often uses STAR internally for alignment if given FASTQ

# Define reference genome and annotation (using GRCh38 and GENCODE v38 as placeholders)
# Download GRCh38 primary assembly FASTA
# wget -O GRCh38.primary_assembly.genome.fa.gz "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
# gunzip GRCh38.primary_assembly.genome.fa.gz
# Download GENCODE v38 annotation GTF
# wget -O gencode.v38.annotation.gtf.gz "http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz"
# gunzip gencode.v38.annotation.gtf.gz

GENOME_FASTA="GRCh38.primary_assembly.genome.fa"
GTF_FILE="gencode.v38.annotation.gtf"
RSEM_REF_DIR="rsem_ref_GRCh38_gencode38"

# Build RSEM reference (this step needs to be done once)
# rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_REF_DIR}"

# Input BAM file (assuming it's already aligned, e.g., by STAR, and processed from CASAVA 1.8 FASTQs)
# Replace 'sample.aligned.bam' with the actual path to your aligned BAM file.
INPUT_BAM="sample.aligned.bam"

# Output directory and prefix for RSEM results
OUTPUT_DIR="rsem_quantification"
SAMPLE_PREFIX="sample_rpkm"

mkdir -p "${OUTPUT_DIR}"

# Run RSEM quantification to calculate RPKM
# --bam: Specifies that the input is a BAM file
# --no-qualities: Assumes BAM file might not have quality scores (common for some aligners or older data)
# --strandedness none: Assumes unstranded library. Adjust to 'forward' or 'reverse' if known.
# -p 8: Use 8 threads for parallel processing
# The output files will be in ${OUTPUT_DIR} with prefix ${SAMPLE_PREFIX}
# RPKM values are found in the .genes.results file (column 6) and .isoforms.results file (column 6)

rsem-calculate-expression \
    --bam \
    --no-qualities \
    --strandedness none \
    -p 8 \
    "${INPUT_BAM}" \
    "${RSEM_REF_DIR}" \
    "${OUTPUT_DIR}/${SAMPLE_PREFIX}"

echo "RPKM values are available in ${OUTPUT_DIR}/${SAMPLE_PREFIX}.genes.results (column 6) and ${OUTPUT_DIR}/${SAMPLE_PREFIX}.isoforms.results (column 6)."

View on GitHub

Raw Source Text

Basecalls performed using CASAVA 1.8
Sequenced reads were mapped to the reference genome using CASAVA 1.8
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using CASAVA 1.8
Genome_build: hg19
Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each sample. Columns are: Chromosome, Start location, Stop location, Gene, RPKM, base count.

← Back to Analysis