GSE157917 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
Loss of LUC7L2 and U1 snRNP subunits shifts energy metabolism from glycolysis to OXPHOS.Molecular cell (2021) — PMID 33852893
Dataset
GSE157917Loss of LUC7L2 and U1 snRNP subunits shifts energy metabolism from glycolysis to OXPHOS
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
The reads were aligned with STAR (Dobin et al., 2013) to the human genome hg19 using default parameters and a two-pass approach.
STAR v2.4.0d$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Replace with actual paths and filenames GENOME_DIR="/path/to/STAR_genome_index_hg19" # Directory for STAR genome index GENOME_FASTA="/path/to/hg19.fa" # Human genome hg19 FASTA file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz) GTF_FILE="/path/to/genes.gtf" # Gene annotation GTF file for hg19 (e.g., Gencode v19 or Ensembl GRCh37) READS_R1="sample_R1.fastq.gz" # Input FASTQ file for Read 1 (gzipped) READS_R2="sample_R2.fastq.gz" # Input FASTQ file for Read 2 (gzipped, remove if single-end) OUTPUT_DIR="STAR_alignment_output" SAMPLE_NAME="sample" NUM_THREADS=8 # Number of threads to use # 1. Generate STAR genome index (run once for the genome) # This step requires the genome FASTA and a GTF file for splice junction annotation. # The --sjdbOverhang parameter is typically set to (readLength - 1) or 100. # mkdir -p "${GENOME_DIR}" # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GTF_FILE}" \ # --sjdbOverhang 100 \ # --runThreadN "${NUM_THREADS}" # 2. Align reads with STAR using a two-pass approach and default parameters mkdir -p "${OUTPUT_DIR}" # For paired-end reads: STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --twopassMode Basic # For single-end reads, modify the --readFilesIn parameter: # STAR --genomeDir "${GENOME_DIR}" \ # --readFilesIn "${READS_R1}" \ # --readFilesCommand zcat \ # --runThreadN "${NUM_THREADS}" \ # --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \ # --outSAMtype BAM SortedByCoordinate \ # --outSAMunmapped Within \ # --outSAMattributes Standard \ # --twopassMode Basic -
2
Following a first pass alignment of each sample, novel splice junctions were pooled across all samples from the same cell type and incorporated into the genome annotation for a second pass alignment.
$ Bash example
# Define variables GENOME_DIR="/path/to/STAR_index_GRCh38" GENOME_FASTA="/path/to/GRCh38.p14.genome.fa" # Source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ GTF_FILE="/path/to/gencode.v45.annotation.gtf" # Source: https://www.gencodegenes.org/human/release_45.html READ_LENGTH=100 # Example read length, adjust as needed based on sequencing data SJDB_OVERHANG=$((READ_LENGTH - 1)) STAR_VERSION="2.7.10a" # Placeholders for sample-specific and cell-type-specific data SAMPLE_ID="sample_1" # Replace with actual sample ID (e.g., SRR1234567) CELL_TYPE_ID="cell_type_A" # Replace with actual cell type ID (e.g., K562) FASTQ_R1="${SAMPLE_ID}_R1.fastq.gz" # Replace with actual path to R1 fastq FASTQ_R2="${SAMPLE_ID}_R2.fastq.gz" # Replace with actual path to R2 fastq OUTPUT_BASE_DIR="/path/to/output" OUTPUT_DIR="${OUTPUT_BASE_DIR}/${CELL_TYPE_ID}" mkdir -p "${OUTPUT_DIR}" # --- STAR Installation (if not already installed) --- # conda install -c bioconda star=${STAR_VERSION} # --- 1. STAR Genome Generation (Run once per reference genome) --- # This step creates the STAR index. It should be run before any alignments. # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GTF_FILE}" \ # --sjdbOverhang "${SJDB_OVERHANG}" \ # --runThreadN 8 # --- 2. First Pass Alignment for each sample --- # This pass aligns reads and discovers novel splice junctions for each individual sample. STAR --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_ID}_pass1_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 \ --runMode alignReads \ --twopassMode Basic # Enables 2-pass functionality, generates SJ.out.tab # --- 3. Pool Novel Splice Junctions across all samples from the same cell type --- # This step must be run AFTER all first pass alignments for ALL samples of a given cell type are complete. # It collects all SJ.out.tab files from the first pass alignments for CELL_TYPE_ID, # concatenates them, and filters/sorts unique junctions to create a master list. # This master list is then used in the second pass alignment. # Example command to combine (adjust paths as necessary): # cat "${OUTPUT_DIR}"/*_pass1_SJ.out.tab | awk 'BEGIN{OFS="\t"}{if($5>0 && $6>0) print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' | sort -u -k1,1 -k2,2n > "${OUTPUT_DIR}/${CELL_TYPE_ID}_pooled_junctions.tab" POOLED_JUNCTIONS_FILE="${OUTPUT_DIR}/${CELL_TYPE_ID}_pooled_junctions.tab" # Placeholder for the combined and filtered junction file # --- 4. Second Pass Alignment for each sample, incorporating pooled junctions --- # This pass re-aligns reads using the refined set of pooled splice junctions for more accurate mapping. STAR --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_ID}_pass2_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 \ --runMode alignReads \ --sjdbFileChrStartEnd "${POOLED_JUNCTIONS_FILE}" # Incorporate pooled junctions for second pass -
3
Second pass gene counts derived from uniquely mapping pairs with the expected strandedness were output by STAR.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for paths and files GENOME_DIR="/path/to/STAR_index_GRCh38_gencode_v38" GENOME_FASTA="/path/to/GRCh38.primary_assembly.fa" # Source: ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/ GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/ READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" OUTPUT_PREFIX="sample_star_output_" THREADS=20 # Adjust as needed # --- Reference Data Setup (run once per genome/annotation combination) --- # mkdir -p ${GENOME_DIR} # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles ${GENOME_FASTA} \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 \ # --runThreadN ${THREADS} # --- Alignment and Gene Counting (Second Pass) --- # The 'second pass' is enabled by --twopassMode Basic. # 'Gene counts' are generated by --quantMode GeneCounts, outputting to ReadsPerGene.out.tab. # 'Uniquely mapping pairs' are ensured by --outFilterMultimapNmax 1. # 'Expected strandedness' implies the user will select the correct column from ReadsPerGene.out.tab # (e.g., column 3 for forward-stranded, column 4 for reverse-stranded, column 2 for unstranded). STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --quantMode GeneCounts \ --twopassMode Basic \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbOverhang 100 \ --outSAMstrandField intronMotif
Tools Used
Raw Source Text
The reads were aligned with STAR (Dobin et al., 2013) to the human genome hg19 using default parameters and a two-pass approach. Following a first pass alignment of each sample, novel splice junctions were pooled across all samples from the same cell type and incorporated into the genome annotation for a second pass alignment. Second pass gene counts derived from uniquely mapping pairs with the expected strandedness were output by STAR. Genome_build: hg19 Supplementary_files_format_and_content: Raw gene counts for every gene and every sample