GSE157917 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Loss of LUC7L2 and U1 snRNP subunits shifts energy metabolism from glycolysis to OXPHOS.

Molecular cell (2021) — PMID 33852893

Dataset

Loss of LUC7L2 and U1 snRNP subunits shifts energy metabolism from glycolysis to OXPHOS

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The reads were aligned with STAR (Dobin et al., 2013) to the human genome hg19 using default parameters and a two-pass approach.

STAR v2.4.0d

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Replace with actual paths and filenames
GENOME_DIR="/path/to/STAR_genome_index_hg19" # Directory for STAR genome index
GENOME_FASTA="/path/to/hg19.fa" # Human genome hg19 FASTA file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)
GTF_FILE="/path/to/genes.gtf" # Gene annotation GTF file for hg19 (e.g., Gencode v19 or Ensembl GRCh37)
READS_R1="sample_R1.fastq.gz" # Input FASTQ file for Read 1 (gzipped)
READS_R2="sample_R2.fastq.gz" # Input FASTQ file for Read 2 (gzipped, remove if single-end)
OUTPUT_DIR="STAR_alignment_output"
SAMPLE_NAME="sample"
NUM_THREADS=8 # Number of threads to use

# 1. Generate STAR genome index (run once for the genome)
# This step requires the genome FASTA and a GTF file for splice junction annotation.
# The --sjdbOverhang parameter is typically set to (readLength - 1) or 100.
# mkdir -p "${GENOME_DIR}"
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --sjdbOverhang 100 \
#      --runThreadN "${NUM_THREADS}"

# 2. Align reads with STAR using a two-pass approach and default parameters
mkdir -p "${OUTPUT_DIR}"

# For paired-end reads:
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --twopassMode Basic

# For single-end reads, modify the --readFilesIn parameter:
# STAR --genomeDir "${GENOME_DIR}" \
#      --readFilesIn "${READS_R1}" \
#      --readFilesCommand zcat \
#      --runThreadN "${NUM_THREADS}" \
#      --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \
#      --outSAMtype BAM SortedByCoordinate \
#      --outSAMunmapped Within \
#      --outSAMattributes Standard \
#      --twopassMode Basic

Following a first pass alignment of each sample, novel splice junctions were pooled across all samples from the same cell type and incorporated into the genome annotation for a second pass alignment.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Define variables
GENOME_DIR="/path/to/STAR_index_GRCh38"
GENOME_FASTA="/path/to/GRCh38.p14.genome.fa" # Source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/
GTF_FILE="/path/to/gencode.v45.annotation.gtf" # Source: https://www.gencodegenes.org/human/release_45.html
READ_LENGTH=100 # Example read length, adjust as needed based on sequencing data
SJDB_OVERHANG=$((READ_LENGTH - 1))
STAR_VERSION="2.7.10a"

# Placeholders for sample-specific and cell-type-specific data
SAMPLE_ID="sample_1" # Replace with actual sample ID (e.g., SRR1234567)
CELL_TYPE_ID="cell_type_A" # Replace with actual cell type ID (e.g., K562)
FASTQ_R1="${SAMPLE_ID}_R1.fastq.gz" # Replace with actual path to R1 fastq
FASTQ_R2="${SAMPLE_ID}_R2.fastq.gz" # Replace with actual path to R2 fastq
OUTPUT_BASE_DIR="/path/to/output"
OUTPUT_DIR="${OUTPUT_BASE_DIR}/${CELL_TYPE_ID}"
mkdir -p "${OUTPUT_DIR}"

# --- STAR Installation (if not already installed) ---
# conda install -c bioconda star=${STAR_VERSION}

# --- 1. STAR Genome Generation (Run once per reference genome) ---
# This step creates the STAR index. It should be run before any alignments.
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --sjdbOverhang "${SJDB_OVERHANG}" \
#      --runThreadN 8

# --- 2. First Pass Alignment for each sample ---
# This pass aligns reads and discovers novel splice junctions for each individual sample.
STAR --runThreadN 8 \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_ID}_pass1_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 1 \
     --runMode alignReads \
     --twopassMode Basic # Enables 2-pass functionality, generates SJ.out.tab

# --- 3. Pool Novel Splice Junctions across all samples from the same cell type ---
# This step must be run AFTER all first pass alignments for ALL samples of a given cell type are complete.
# It collects all SJ.out.tab files from the first pass alignments for CELL_TYPE_ID,
# concatenates them, and filters/sorts unique junctions to create a master list.
# This master list is then used in the second pass alignment.
# Example command to combine (adjust paths as necessary):
# cat "${OUTPUT_DIR}"/*_pass1_SJ.out.tab | awk 'BEGIN{OFS="\t"}{if($5>0 && $6>0) print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' | sort -u -k1,1 -k2,2n > "${OUTPUT_DIR}/${CELL_TYPE_ID}_pooled_junctions.tab"
POOLED_JUNCTIONS_FILE="${OUTPUT_DIR}/${CELL_TYPE_ID}_pooled_junctions.tab" # Placeholder for the combined and filtered junction file

# --- 4. Second Pass Alignment for each sample, incorporating pooled junctions ---
# This pass re-aligns reads using the refined set of pooled splice junctions for more accurate mapping.
STAR --runThreadN 8 \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_ID}_pass2_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 1 \
     --runMode alignReads \
     --sjdbFileChrStartEnd "${POOLED_JUNCTIONS_FILE}" # Incorporate pooled junctions for second pass

View on GitHub

Second pass gene counts derived from uniquely mapping pairs with the expected strandedness were output by STAR.

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables for paths and files
GENOME_DIR="/path/to/STAR_index_GRCh38_gencode_v38"
GENOME_FASTA="/path/to/GRCh38.primary_assembly.fa" # Source: ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/
GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz"
OUTPUT_PREFIX="sample_star_output_"
THREADS=20 # Adjust as needed

# --- Reference Data Setup (run once per genome/annotation combination) ---
# mkdir -p ${GENOME_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN ${THREADS}

# --- Alignment and Gene Counting (Second Pass) ---
# The 'second pass' is enabled by --twopassMode Basic.
# 'Gene counts' are generated by --quantMode GeneCounts, outputting to ReadsPerGene.out.tab.
# 'Uniquely mapping pairs' are ensured by --outFilterMultimapNmax 1.
# 'Expected strandedness' implies the user will select the correct column from ReadsPerGene.out.tab
# (e.g., column 3 for forward-stranded, column 4 for reverse-stranded, column 2 for unstranded).
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --quantMode GeneCounts \
     --twopassMode Basic \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbOverhang 100 \
     --outSAMstrandField intronMotif

View on GitHub

Tools Used

STAR

Raw Source Text

The reads were aligned with STAR (Dobin et al., 2013) to the human genome hg19 using default parameters and a two-pass approach.
Following a first pass alignment of each sample, novel splice junctions were pooled across all samples from the same cell type and incorporated into the genome annotation for a second pass alignment.
Second pass gene counts derived from uniquely mapping pairs with the expected strandedness were output by STAR.
Genome_build: hg19
Supplementary_files_format_and_content: Raw gene counts for every gene and every sample

← Back to Analysis