GSE157917 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Loss of LUC7L2 and U1 snRNP subunits shifts energy metabolism from glycolysis to OXPHOS.

Molecular cell (2021) — PMID 33852893

Dataset

GSE157917

Loss of LUC7L2 and U1 snRNP subunits shifts energy metabolism from glycolysis to OXPHOS

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    The reads were aligned with STAR (Dobin et al., 2013) to the human genome hg19 using default parameters and a two-pass approach.

    STAR v2.4.0d
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    # Replace with actual paths and filenames
    GENOME_DIR="/path/to/STAR_genome_index_hg19" # Directory for STAR genome index
    GENOME_FASTA="/path/to/hg19.fa" # Human genome hg19 FASTA file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)
    GTF_FILE="/path/to/genes.gtf" # Gene annotation GTF file for hg19 (e.g., Gencode v19 or Ensembl GRCh37)
    READS_R1="sample_R1.fastq.gz" # Input FASTQ file for Read 1 (gzipped)
    READS_R2="sample_R2.fastq.gz" # Input FASTQ file for Read 2 (gzipped, remove if single-end)
    OUTPUT_DIR="STAR_alignment_output"
    SAMPLE_NAME="sample"
    NUM_THREADS=8 # Number of threads to use
    
    # 1. Generate STAR genome index (run once for the genome)
    # This step requires the genome FASTA and a GTF file for splice junction annotation.
    # The --sjdbOverhang parameter is typically set to (readLength - 1) or 100.
    # mkdir -p "${GENOME_DIR}"
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${GENOME_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GTF_FILE}" \
    #      --sjdbOverhang 100 \
    #      --runThreadN "${NUM_THREADS}"
    
    # 2. Align reads with STAR using a two-pass approach and default parameters
    mkdir -p "${OUTPUT_DIR}"
    
    # For paired-end reads:
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --readFilesCommand zcat \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes Standard \
         --twopassMode Basic
    
    # For single-end reads, modify the --readFilesIn parameter:
    # STAR --genomeDir "${GENOME_DIR}" \
    #      --readFilesIn "${READS_R1}" \
    #      --readFilesCommand zcat \
    #      --runThreadN "${NUM_THREADS}" \
    #      --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \
    #      --outSAMtype BAM SortedByCoordinate \
    #      --outSAMunmapped Within \
    #      --outSAMattributes Standard \
    #      --twopassMode Basic
  2. 2

    Following a first pass alignment of each sample, novel splice junctions were pooled across all samples from the same cell type and incorporated into the genome annotation for a second pass alignment.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Define variables
    GENOME_DIR="/path/to/STAR_index_GRCh38"
    GENOME_FASTA="/path/to/GRCh38.p14.genome.fa" # Source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/
    GTF_FILE="/path/to/gencode.v45.annotation.gtf" # Source: https://www.gencodegenes.org/human/release_45.html
    READ_LENGTH=100 # Example read length, adjust as needed based on sequencing data
    SJDB_OVERHANG=$((READ_LENGTH - 1))
    STAR_VERSION="2.7.10a"
    
    # Placeholders for sample-specific and cell-type-specific data
    SAMPLE_ID="sample_1" # Replace with actual sample ID (e.g., SRR1234567)
    CELL_TYPE_ID="cell_type_A" # Replace with actual cell type ID (e.g., K562)
    FASTQ_R1="${SAMPLE_ID}_R1.fastq.gz" # Replace with actual path to R1 fastq
    FASTQ_R2="${SAMPLE_ID}_R2.fastq.gz" # Replace with actual path to R2 fastq
    OUTPUT_BASE_DIR="/path/to/output"
    OUTPUT_DIR="${OUTPUT_BASE_DIR}/${CELL_TYPE_ID}"
    mkdir -p "${OUTPUT_DIR}"
    
    # --- STAR Installation (if not already installed) ---
    # conda install -c bioconda star=${STAR_VERSION}
    
    # --- 1. STAR Genome Generation (Run once per reference genome) ---
    # This step creates the STAR index. It should be run before any alignments.
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${GENOME_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GTF_FILE}" \
    #      --sjdbOverhang "${SJDB_OVERHANG}" \
    #      --runThreadN 8
    
    # --- 2. First Pass Alignment for each sample ---
    # This pass aligns reads and discovers novel splice junctions for each individual sample.
    STAR --runThreadN 8 \
         --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_ID}_pass1_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.05 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --sjdbScore 1 \
         --runMode alignReads \
         --twopassMode Basic # Enables 2-pass functionality, generates SJ.out.tab
    
    # --- 3. Pool Novel Splice Junctions across all samples from the same cell type ---
    # This step must be run AFTER all first pass alignments for ALL samples of a given cell type are complete.
    # It collects all SJ.out.tab files from the first pass alignments for CELL_TYPE_ID,
    # concatenates them, and filters/sorts unique junctions to create a master list.
    # This master list is then used in the second pass alignment.
    # Example command to combine (adjust paths as necessary):
    # cat "${OUTPUT_DIR}"/*_pass1_SJ.out.tab | awk 'BEGIN{OFS="\t"}{if($5>0 && $6>0) print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' | sort -u -k1,1 -k2,2n > "${OUTPUT_DIR}/${CELL_TYPE_ID}_pooled_junctions.tab"
    POOLED_JUNCTIONS_FILE="${OUTPUT_DIR}/${CELL_TYPE_ID}_pooled_junctions.tab" # Placeholder for the combined and filtered junction file
    
    # --- 4. Second Pass Alignment for each sample, incorporating pooled junctions ---
    # This pass re-aligns reads using the refined set of pooled splice junctions for more accurate mapping.
    STAR --runThreadN 8 \
         --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_ID}_pass2_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.05 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --sjdbScore 1 \
         --runMode alignReads \
         --sjdbFileChrStartEnd "${POOLED_JUNCTIONS_FILE}" # Incorporate pooled junctions for second pass
    
  3. 3

    Second pass gene counts derived from uniquely mapping pairs with the expected strandedness were output by STAR.

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables for paths and files
    GENOME_DIR="/path/to/STAR_index_GRCh38_gencode_v38"
    GENOME_FASTA="/path/to/GRCh38.primary_assembly.fa" # Source: ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/
    GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/
    READ1="sample_R1.fastq.gz"
    READ2="sample_R2.fastq.gz"
    OUTPUT_PREFIX="sample_star_output_"
    THREADS=20 # Adjust as needed
    
    # --- Reference Data Setup (run once per genome/annotation combination) ---
    # mkdir -p ${GENOME_DIR}
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles ${GENOME_FASTA} \
    #      --sjdbGTFfile ${GTF_FILE} \
    #      --sjdbOverhang 100 \
    #      --runThreadN ${THREADS}
    
    # --- Alignment and Gene Counting (Second Pass) ---
    # The 'second pass' is enabled by --twopassMode Basic.
    # 'Gene counts' are generated by --quantMode GeneCounts, outputting to ReadsPerGene.out.tab.
    # 'Uniquely mapping pairs' are ensured by --outFilterMultimapNmax 1.
    # 'Expected strandedness' implies the user will select the correct column from ReadsPerGene.out.tab
    # (e.g., column 3 for forward-stranded, column 4 for reverse-stranded, column 2 for unstranded).
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READ1} ${READ2} \
         --runThreadN ${THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --readFilesCommand zcat \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --quantMode GeneCounts \
         --twopassMode Basic \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.05 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --sjdbOverhang 100 \
         --outSAMstrandField intronMotif

Tools Used

Raw Source Text
The reads were aligned with STAR (Dobin et al., 2013) to the human genome hg19 using default parameters and a two-pass approach.
Following a first pass alignment of each sample, novel splice junctions were pooled across all samples from the same cell type and incorporated into the genome annotation for a second pass alignment.
Second pass gene counts derived from uniquely mapping pairs with the expected strandedness were output by STAR.
Genome_build: hg19
Supplementary_files_format_and_content: Raw gene counts for every gene and every sample
← Back to Analysis