GSE85908 Processing Pipeline
Publication
Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation.Molecular cell (2017) — PMID 28673540
Dataset
GSE85908Single-cell alternative splicing analysis with Expedition reveals splicing dynamics during neuron differentiation
Processing Steps
Generate Jupyter Notebook-
1
Reads were trimmed of adapter sequences TCGTATGCCGTCTTCTGCTTG, ATCTCGTATGCCGTCTTCTGCTTG, CGACAGGTTCAGAGTTCTACAGTCCGACGATC GATCGGAAGAGCACACGTCTGAACTCCAGTCAC using cutadapt v1.8.1
$ Bash example
# Install cutadapt (version 1.8.1 is quite old, consider using a virtual environment or specific package manager for older versions) # conda create -n cutadapt_1_8_1 cutadapt=1.8.1 # conda activate cutadapt_1_8_1 # Placeholder for input and output files. Adjust as needed. # If reads are paired-end, use -A for the reverse read adapter and -o/-p for paired output. INPUT_READS="input_reads.fastq.gz" OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz" # Adapter sequences identified from the description ADAPTER1="TCGTATGCCGTCTTCTGCTTG" ADAPTER2="ATCTCGTATGCCGTCTTCTGCTTG" ADAPTER3="CGACAGGTTCAGAGTTCTACAGTCCGACGATC" ADAPTER4="GATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Execute cutadapt to trim the specified adapter sequences # Assuming these are 3' adapters (-a). If they were 5' adapters, -g would be used. cutadapt -a "${ADAPTER1}" \ -a "${ADAPTER2}" \ -a "${ADAPTER3}" \ -a "${ADAPTER4}" \ -o "${OUTPUT_TRIMMED_READS}" \ "${INPUT_READS}" -
2
Trimmed reads were aligned to RepBase v18.05 using STAR v2.4.01
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.01 # --- Reference Preparation: RepBase v18.05 STAR Index --- # RepBase v18.05 is a database of repetitive elements. You would typically download # the FASTA file from the Genetic Information Research Institute (GIRI). # The exact download URL might require registration or vary, this is a placeholder. # For example: # mkdir -p reference/repbase_v18.05 # cd reference/repbase_v18.05 # wget -O repbase_v18.05.fasta.gz "http://www.girinst.org/repbase/update/RepBase_v18.05.fasta.gz" # Placeholder URL # gunzip repbase_v18.05.fasta.gz # cd ../../ # Define paths for reference files FASTA_FILE="reference/repbase_v18.05/repbase_v18.05.fasta" # Path to your downloaded RepBase FASTA GENOME_DIR="reference/repbase_v18.05_star_index" # Create directory for STAR index mkdir -p "${GENOME_DIR}" # Build STAR genome index for RepBase v18.05 # Note: For non-spliced alignment to a collection of sequences like RepBase, # STAR's default parameters for genome generation are usually sufficient. # The --sjdbGTFfile parameter is not needed here as there are no introns. STAR --runMode genomeGenerate \ --genomeDir "${GENOME_DIR}" \ --genomeFastaFiles "${FASTA_FILE}" \ --runThreadN 8 # Adjust number of threads as appropriate # --- Alignment Step --- # Define input and output paths INPUT_READS="trimmed_reads.fastq.gz" # Replace with your actual trimmed reads file OUTPUT_PREFIX="aligned_to_repbase" # Prefix for output files # Run STAR alignment # Parameters chosen for alignment to repetitive elements: # --outFilterMultimapNmax 100: Allows reads to map to up to 100 locations, common for repetitive elements. # --alignIntronMax 1: Effectively disables intron searching, as RepBase sequences do not contain introns. # --alignEndsType Local: Local alignment can be useful for partial matches to repetitive elements. STAR --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_READS}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 100 \ --outFilterMismatchNmax 10 \ --alignIntronMax 1 \ --alignEndsType Local -
3
Reads that didn't align to RepBase were aligned to hg19 using STAR v2.4.01 and a splice junction database derived from GENCODE v19
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star=2.4.01 # Define paths for reference genome index and input reads # The STAR genome index for hg19 and GENCODE v19 annotations # should have been pre-built using a command similar to: # STAR --runMode genomeGenerate \ # --genomeDir /path/to/star_index/hg19_gencode_v19 \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \ # --sjdbOverhang 100 # A common value, adjust as needed based on read length GENOME_DIR="/path/to/star_index/hg19_gencode_v19" # Placeholder for the pre-built STAR index INPUT_READS="unaligned_to_repbase.fastq.gz" # Placeholder for the FASTQ file containing reads that didn't align to RepBase OUTPUT_PREFIX="aligned_to_hg19" # Prefix for output files # Align reads to hg19 using STAR with the GENCODE v19-derived splice junction database STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_READS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --runThreadN 8 # Example: use 8 threads, adjust as needed # Other common parameters might include: # --outSAMtype BAM SortedByCoordinate # Output sorted BAM # --outSAMunmapped Within # Keep unmapped reads in BAM # --outFilterMultimapNmax 20 # Max number of loci the read is allowed to map to # --outFilterMismatchNmax 999 # Max number of mismatches per read # --outFilterMismatchNoverLmax 0.04 # Max fraction of mismatches per read length # --quantMode GeneCounts # If gene quantification is desired -
4
Transcripts per million (TPM) per transcript was calculated from reads that didn't align to RepBase were quasi-aligned to GENCODE v19 protein coding and lincRNA transcripts using Sailfish v0.6.3
$ Bash example
# Install Sailfish # conda install -c bioconda sailfish=0.6.3 # Placeholder for Sailfish index built from GENCODE v19 protein coding and lincRNA transcripts. # This index would have been created using a command like: # sailfish index -t <path_to_gencode_v19_pc_lincRNA_transcripts.fa> -o gencode_v19_pc_lincRNA_sailfish_index -k 25 SAILFISH_INDEX="gencode_v19_pc_lincRNA_sailfish_index" # Placeholder for input reads (reads that didn't align to RepBase) INPUT_READS="unaligned_reads.fastq" # Output directory for Sailfish results OUTPUT_DIR="sailfish_tpm_output" # Run Sailfish quantification # -i: Path to the Sailfish index # -r: Input FASTQ file (assuming single-end reads based on common usage for Sailfish v0.6.3 without explicit mention) # -o: Output directory # -l U: Library type (U for unstranded, single-end reads). Adjust if paired-end or stranded. sailfish quant -i "${SAILFISH_INDEX}" -r "${INPUT_READS}" -o "${OUTPUT_DIR}" -l U -
5
TPM per gene was calculated by summing the TPM for all transcripts of a gene
$ Bash example
# Install RSEM (example using conda) # conda install -c bioconda rsem # --- Reference Preparation (run once) --- # Download human GRCh38 genome FASTA and Ensembl GTF (release 109 as example) # mkdir -p rsem_ref # cd rsem_ref # wget ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # wget ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # gunzip Homo_sapiens.GRCh38.109.gtf.gz # GENOME_FASTA="Homo_sapiens.GRCh38.dna.primary_assembly.fa" # GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # RSEM_REFERENCE_INDEX="Homo_sapiens_GRCh38_Ensembl109_rsem_index" # Build RSEM reference (this can take a long time and significant memory) # rsem-prepare-reference \ # --gtf ${GTF_FILE} \ # --num-threads 8 \ # ${GENOME_FASTA} \ # ${RSEM_REFERENCE_INDEX} # cd .. # --- End Reference Preparation --- # Define variables for expression calculation INPUT_BAM="path/to/your/aligned_reads.bam" # Replace with actual input BAM file (e.g., from STAR alignment) RSEM_REFERENCE_INDEX="path/to/your/rsem_ref/Homo_sapiens_GRCh38_Ensembl109_rsem_index" # Path to your RSEM reference index SAMPLE_NAME="your_sample_name" # Replace with your sample name NUM_THREADS=8 # Number of threads to use # Execute RSEM to calculate expression # This command will output gene-level TPMs in ${SAMPLE_NAME}.genes.results # RSEM calculates gene-level TPMs by summing the estimated TPMs for all transcripts of a gene. # --bam: Input is a BAM file # --no-qualities: Often needed if BAM is from STAR and quality scores are not standard # --paired-end: Use if reads are paired-end (remove if single-end) # --num-threads: Number of threads # --output-genome-bam: Output a genome-aligned BAM file (optional, can be large) rsem-calculate-expression \ --bam \ --no-qualities \ --paired-end \ --num-threads ${NUM_THREADS} \ ${INPUT_BAM} \ ${RSEM_REFERENCE_INDEX} \ ${SAMPLE_NAME} # To extract the gene-level TPMs: # The 'TPM' column in the ${SAMPLE_NAME}.genes.results file contains the desired values. # For example, to view the first few lines of gene results: # head ${SAMPLE_NAME}.genes.results # To extract just the gene_id and TPM columns: # awk 'NR > 1 {print $1, $6}' ${SAMPLE_NAME}.genes.results > ${SAMPLE_NAME}.gene_tpm.tsv -
6
Percent spliced-in (Psi) was calculated from junction read output from STAR aligner (SJ.out.tab) were used to create a de novo splicing index in conjunction with GENCODE v19 gene annotations, using outrigger
$ Bash example
# Install outrigger (example using pip) # pip install outrigger # STAR aligner (version 2.7.10a or similar) is assumed to have been run upstream # to generate the SJ.out.tab file. # Define variables GENCODE_VERSION="v19" GENOME_ASSEMBLY="hg19" # Inferred from GENCODE v19 GENCODE_GTF="gencode.${GENCODE_VERSION}.annotation.gtf" GENOME_FASTA="${GENOME_ASSEMBLY}.fa" STAR_SJ_OUT_TAB="sample_SJ.out.tab" # Placeholder for STAR's junction output OUTRIGGER_INDEX_DIR="outrigger_index_${GENOME_ASSEMBLY}_${GENCODE_VERSION}" OUTPUT_PSI_FILE="sample_psi.csv" # Download GENCODE v19 GTF (if not already present) # wget -O "${GENCODE_GTF}.gz" ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip -f "${GENCODE_GTF}.gz" # Download hg19 genome FASTA (if not already present, required for outrigger index) # wget -O "${GENOME_FASTA}.gz" http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip -f "${GENOME_FASTA}.gz" # 1. Create a de novo splicing index using outrigger # This step uses the GENCODE v19 gene annotations (GTF) and the corresponding genome FASTA. outrigger index \ --gtf "${GENCODE_GTF}" \ --fasta "${GENOME_FASTA}" \ --output-folder "${OUTRIGGER_INDEX_DIR}" # 2. Calculate Percent spliced-in (Psi) values # This step uses the STAR junction read output (SJ.out.tab) and the previously created outrigger index. outrigger psi \ --outrigger-index "${OUTRIGGER_INDEX_DIR}" \ --sj-out-tab "${STAR_SJ_OUT_TAB}" \ --output-file "${OUTPUT_PSI_FILE}" -
7
Outlier cells were calculated using K-means clustering with k=3 on the expression matrix and omitting cells which clustered into a group consisting of a majority of other cell types
scikit-learn (Inferred with models/gemini-2.5-flash) v1.2.2$ Bash example
# Install necessary libraries # pip install scikit-learn pandas numpy # Create dummy input files for demonstration if they don't exist python3 -c " import pandas as pd import numpy as np import os if not os.path.exists('expression_matrix.csv'): np.random.seed(42) expression_data = np.random.rand(100, 50) * 100 expression_df = pd.DataFrame(expression_data, columns=[f'gene_{i}' for i in range(50)]) expression_df.index.name = 'cell_id' expression_df.to_csv('expression_matrix.csv') print('Created dummy expression_matrix.csv') if not os.path.exists('cell_types.csv'): cell_types = np.random.choice(['TypeA', 'TypeB', 'TypeC', 'TypeD'], size=100, p=[0.4, 0.3, 0.2, 0.1]) cell_types_df = pd.DataFrame({'cell_type': cell_types}, index=expression_df.index if 'expression_df' in locals() else pd.RangeIndex(start=0, stop=100, name='cell_id')) cell_types_df.index.name = 'cell_id' cell_types_df.to_csv('cell_types.csv') print('Created dummy cell_types.csv') " # Execute the Python script for K-means clustering and outlier filtering python3 -c " import pandas as pd import numpy as np from sklearn.cluster import KMeans # Load expression matrix (e.g., from 'expression_matrix.csv') # This file is expected to have cell IDs as index and gene expression values as columns. expression_matrix = pd.read_csv('expression_matrix.csv', index_col='cell_id') # Load known cell types (e.g., from 'cell_types.csv') # This file is expected to have cell IDs as index and a 'cell_type' column. cell_types_df = pd.read_csv('cell_types.csv', index_col='cell_id') cell_types = cell_types_df['cell_type'] # Ensure cell_types index matches expression_matrix index common_cells = expression_matrix.index.intersection(cell_types.index) expression_matrix = expression_matrix.loc[common_cells] cell_types = cell_types.loc[common_cells] # Perform K-means clustering with k=3 k = 3 kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) # n_init for robust results cluster_labels = kmeans.fit_predict(expression_matrix) # Add cluster labels to the expression matrix for easier filtering clustered_data = expression_matrix.copy() clustered_data['cluster_label'] = cluster_labels clustered_data['known_cell_type'] = cell_types # Identify outlier clusters based on cell type composition outlier_cells_indices = [] for cluster_id in range(k): cluster_cells = clustered_data[clustered_data['cluster_label'] == cluster_id] if not cluster_cells.empty: # Get counts of known cell types within this cluster type_counts = cluster_cells['known_cell_type'].value_counts() if not type_counts.empty: # Identify the most frequent cell type in this cluster most_frequent_type = type_counts.index[0] # Calculate the number of cells that are NOT the most frequent type num_other_types = cluster_cells.shape[0] - type_counts.iloc[0] # Check if a majority of cells are 'other cell types' # This means if the count of all cell types *except* the most frequent one # is greater than the count of the most frequent one. if num_other_types > type_counts.iloc[0]: print(f'Cluster {cluster_id} identified as an outlier cluster: majority of other cell types.') outlier_cells_indices.extend(cluster_cells.index.tolist()) else: print(f'Cluster {cluster_id} is considered valid.') else: print(f'Cluster {cluster_id} is empty or has no known cell types.') else: print(f'Cluster {cluster_id} is empty.') # Filter out outlier cells filtered_expression_matrix = expression_matrix.loc[ ~expression_matrix.index.isin(outlier_cells_indices) ] print(f'\nOriginal number of cells: {expression_matrix.shape[0]}') print(f'Number of outlier cells removed: {len(outlier_cells_indices)}') print(f'Number of cells after filtering: {filtered_expression_matrix.shape[0]}') # Save the filtered expression matrix filtered_expression_matrix.to_csv('filtered_expression_matrix.csv') print('Filtered expression matrix saved to filtered_expression_matrix.csv') " -
8
Modality was calculated using anchor on the psi scores for all events detected in at least 10 cells, for each cell type
Custom Python script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This script calculates splicing modality using a method referred to as "anchor". # The "anchor" method likely involves statistical modeling (e.g., mixture models like beta-binomial mixture models) # to characterize the distribution of PSI scores for each splicing event within a given cell type. # The tool is inferred as a custom Python script, as a standard command-line tool # named "anchor" specifically for modality calculation on psi scores is not widely documented. # Prerequisites: Python 3.x with common scientific libraries (e.g., pandas, numpy, scipy, scikit-learn for mixture models). # # conda create -n modality_env python=3.9 # # conda activate modality_env # # pip install pandas numpy scipy scikit-learn # Input: A TSV or CSV file containing PSI scores for all events across cells. # This file should include columns for event ID, cell ID, cell type, and PSI value. # Example format (long format): # event_id cell_id cell_type psi_score # event1 cellA1 Type1 0.15 # event1 cellA2 Type1 0.18 # ... # event2 cellB1 Type2 0.80 # ... # Output: A TSV or CSV file with modality results for each event per cell type. # Example format: # event_id cell_type modality_type modality_score p_value # Define input and output paths and parameters INPUT_PSI_SCORES="all_events_psi_scores.tsv" OUTPUT_MODALITY_RESULTS="splicing_modality_results.tsv" MIN_CELLS_THRESHOLD=10 CELL_TYPE_COLUMN="cell_type" # Column in INPUT_PSI_SCORES that identifies cell type EVENT_ID_COLUMN="event_id" # Column in INPUT_PSI_SCORES that identifies the splicing event PSI_SCORE_COLUMN="psi_score" # Column in INPUT_PSI_SCORES that contains the PSI value # Execute the hypothetical custom Python script # This script would conceptually: # 1. Load the input PSI scores file. # 2. Group data by the specified cell type column. # 3. For each cell type: # a. Filter events to include only those detected (i.e., having a non-NaN PSI score) in at least MIN_CELLS_THRESHOLD cells. # b. For each filtered event, apply the "anchor" modality calculation method # (e.g., fitting a beta-binomial mixture model or similar statistical test # to determine if the PSI distribution is unimodal, bimodal, etc.). # 4. Aggregate and save the results to the specified output file. python /path/to/custom_scripts/calculate_splicing_modality.py \ --input_psi_scores "${INPUT_PSI_SCORES}" \ --output_modality_results "${OUTPUT_MODALITY_RESULTS}" \ --min_cells_per_event "${MIN_CELLS_THRESHOLD}" \ --cell_type_column "${CELL_TYPE_COLUMN}" \ --event_id_column "${EVENT_ID_COLUMN}" \ --psi_score_column "${PSI_SCORE_COLUMN}" \ --modality_method "anchor" # Explicitly state the method used within the script -
9
Waypoints were calculated using bonvoyage on the psi scores for all events detected in at least 10 cells, for each cell type
$ Bash example
# Install bonvoyage (if not already installed) # pip install bonvoyage # Assuming 'psi_scores_celltypeX.tsv' contains PSI scores for events for a specific cell type. # The description implies that events are already filtered to be detected in at least 10 cells. # The 'min_cells_per_event' parameter in bonvoyage can enforce this filtering if not done upstream. # The tool calculates waypoints based on these psi scores for each cell type. # Example for processing a single cell type: CELL_TYPE="example_cell_type" INPUT_PSI_FILE="${CELL_TYPE}_psi_scores.tsv" OUTPUT_PREFIX="${CELL_TYPE}_waypoints" bonvoyage --psi_table "${INPUT_PSI_FILE}" --output_prefix "${OUTPUT_PREFIX}" --min_cells_per_event 10 -
10
Splicing events were annotated using poshsplice to annotate corresponding genes, splice site strength using MaxEntScan, intron and exon lengths, conservation of alternative splicing from Merkin et al Science 2012,
poshsplice v0.1.0$ Bash example
# Install poshsplice and its dependency maxentscan # pip install poshsplice maxentscan # Download MaxEntScan models (if not already present) # maxentscan download_models # Assuming models are downloaded to ~/.maxentscan_models or a specified path MAXENTSCAN_MODEL_DIR="${HOME}/.maxentscan_models" # Adjust path if models are elsewhere # Define input files and reference data INPUT_BAM="input.bam" # Placeholder for aligned RNA-seq reads GENOME_FASTA="hg38.fa" # Placeholder for reference genome FASTA (e.g., GRCh38) GENE_ANNOTATION_GTF="gencode.v38.annotation.gtf" # Placeholder for gene annotation GTF (e.g., GENCODE v38) OUTPUT_DIR="poshsplice_output" # Create output directory mkdir -p "${OUTPUT_DIR}" # Run poshsplice to annotate splicing events, genes, intron/exon lengths, and splice site strength # Parameters inferred: # -b: Input BAM file # -g: Gene annotation GTF file # -o: Output directory # --genome: Reference genome FASTA for sequence extraction (needed for MaxEntScan) # --maxentscan-model-dir: Path to MaxEntScan models for splice site strength calculation poshsplice \ -b "${INPUT_BAM}" \ -g "${GENE_ANNOTATION_GTF}" \ -o "${OUTPUT_DIR}" \ --genome "${GENOME_FASTA}" \ --maxentscan-model-dir "${MAXENTSCAN_MODEL_DIR}" # Note on "conservation of alternative splicing from Merkin et al Science 2012": # poshsplice identifies splicing events. The conservation analysis from Merkin et al. # likely involves a post-processing step comparing the identified events to a # database or set of conserved events derived from the Merkin et al. publication. # This step would typically be implemented as a custom script using poshsplice's output. # Example (conceptual, not an actual command): # python custom_conservation_script.py \ # --poshsplice_events "${OUTPUT_DIR}/splicing_events.tsv" \ # --merkin_conservation_db "merkin_et_al_conserved_events.tsv" \ # --output_file "${OUTPUT_DIR}/conserved_splicing_events.tsv" -
11
Genes were annotated with attributes available in Gencode v19, Phylostrata for the matching ENSEMBL identifiers in Domazet-Loso et al MBE 2008, and TF or RBP category from Gerstberger et al Nat Rev Gen 2014
$ Bash example
# --- Reference Datasets --- # Gencode v19 annotation GTF file GENCODE_GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz" GENCODE_GTF_FILE="gencode.v19.annotation.gtf.gz" GENCODE_GENES_TSV="gencode_v19_genes.tsv" # Domazet-Loso et al. MBE 2008 Phylostrata data # Source: Domazet-Loso, T., & Tautz, D. (2008). A phylogenetically based transcriptomics approach... MBE, 25(11), 2413-2421. DOI: 10.1093/molbev/msn191 # This data is typically found in supplementary tables or derived databases. A direct download link for a pre-processed file is not readily available and would require specific processing of supplementary materials or database queries. # For demonstration, we simulate a file. In a real pipeline, this would be downloaded or generated from the original source. PHYLOSTRATA_FILE="domazet_loso_phylostrata.tsv" # Gerstberger et al. Nat Rev Gen 2014 TF/RBP categories # Source: Gerstberger, S., Hafner, M., & Tuschl, T. (2014). A census of human RNA-binding proteins. Nat Rev Gen, 15(12), 829-840. DOI: 10.1038/nrg3763 # This data is typically found in supplementary tables. A direct download link for a pre-processed file mapped to ENSEMBL IDs is not readily available and would require specific processing of supplementary materials and gene ID mapping. # For demonstration, we simulate a file. In a real pipeline, this would be downloaded or generated and mapped to ENSEMBL IDs. TF_RBP_FILE="gerstberger_tf_rbp.tsv" # --- Step 1: Download Gencode v19 GTF --- echo "Downloading Gencode v19 GTF..." wget -nc "${GENCODE_GTF_URL}" # --- Step 2: Extract gene attributes from Gencode GTF --- # Extract gene_id (without version suffix), gene_name, gene_type echo "Extracting gene attributes from Gencode GTF..." zcat "${GENCODE_GTF_FILE}" | \ awk '$3 == "gene" { gene_id = gene_name = gene_type = ""; for (i=9; i<=NF; i++) { if ($i ~ /^gene_id "/) { gene_id = substr($i, 10, length($i)-11); } else if ($i ~ /^gene_name "/) { gene_name = substr($i, 12, length($i)-13); } else if ($i ~ /^gene_type "/) { gene_type = substr($i, 12, length($i)-13); } } # Remove version suffix from ENSEMBL gene ID for consistent joining gsub(/\.[0-9]+$/, "", gene_id); print gene_id "\t" gene_name "\t" gene_type; }' | sort -k1,1 > "${GENCODE_GENES_TSV}" # --- Step 3: Simulate/Prepare Phylostrata data --- # In a real scenario, this file would be downloaded or processed from supplementary data. # Example content: ENSEMBL_ID<tab>Phylostratum echo "Preparing dummy Phylostrata data..." cat <<EOF > "${PHYLOSTRATA_FILE}" ENSG00000000003\t1 ENSG00000000005\t2 ENSG00000000419\t3 ENSG00000000457\t4 ENSG00000000460\t5 ENSG00000000971\t6 ENSG00000001036\t7 ENSG00000001084\t8 ENSG00000001167\t9 ENSG00000001461\t10 EOF sort -k1,1 "${PHYLOSTRATA_FILE}" -o "${PHYLOSTRATA_FILE}" # --- Step 4: Simulate/Prepare TF/RBP category data --- # In a real scenario, this file would be downloaded or processed from supplementary data and mapped to ENSEMBL IDs. # Example content: ENSEMBL_ID<tab>Category (e.g., RBP, TF) echo "Preparing dummy TF/RBP data..." cat <<EOF > "${TF_RBP_FILE}" ENSG00000000003\tRBP ENSG00000000457\tTF ENSG00000001036\tRBP ENSG00000001461\tTF EOF sort -k1,1 "${TF_RBP_FILE}" -o "${TF_RBP_FILE}" # --- Step 5: Join all attributes based on ENSEMBL gene ID --- # Join Gencode attributes with Phylostrata echo "Joining Gencode attributes with Phylostrata..." join -t $'\t' -a 1 -e "NA" -o auto "${GENCODE_GENES_TSV}" "${PHYLOSTRATA_FILE}" > gencode_phylostrata.tsv # Join the result with TF/RBP categories echo "Joining with TF/RBP categories..." join -t $'\t' -a 1 -e "NA" -o auto gencode_phylostrata.tsv "${TF_RBP_FILE}" > annotated_genes.tsv echo "Annotation complete. Output: annotated_genes.tsv" echo "Example output (first 5 lines):" head -n 5 annotated_genes.tsv
Tools Used
Raw Source Text
Reads were trimmed of adapter sequences TCGTATGCCGTCTTCTGCTTG, ATCTCGTATGCCGTCTTCTGCTTG, CGACAGGTTCAGAGTTCTACAGTCCGACGATC GATCGGAAGAGCACACGTCTGAACTCCAGTCAC using cutadapt v1.8.1 Trimmed reads were aligned to RepBase v18.05 using STAR v2.4.01 Reads that didn't align to RepBase were aligned to hg19 using STAR v2.4.01 and a splice junction database derived from GENCODE v19 Transcripts per million (TPM) per transcript was calculated from reads that didn't align to RepBase were quasi-aligned to GENCODE v19 protein coding and lincRNA transcripts using Sailfish v0.6.3 TPM per gene was calculated by summing the TPM for all transcripts of a gene Percent spliced-in (Psi) was calculated from junction read output from STAR aligner (SJ.out.tab) were used to create a de novo splicing index in conjunction with GENCODE v19 gene annotations, using outrigger Outlier cells were calculated using K-means clustering with k=3 on the expression matrix and omitting cells which clustered into a group consisting of a majority of other cell types Modality was calculated using anchor on the psi scores for all events detected in at least 10 cells, for each cell type Waypoints were calculated using bonvoyage on the psi scores for all events detected in at least 10 cells, for each cell type Splicing events were annotated using poshsplice to annotate corresponding genes, splice site strength using MaxEntScan, intron and exon lengths, conservation of alternative splicing from Merkin et al Science 2012, Genes were annotated with attributes available in Gencode v19, Phylostrata for the matching ENSEMBL identifiers in Domazet-Loso et al MBE 2008, and TF or RBP category from Gerstberger et al Nat Rev Gen 2014 Genome_build: hg19 Supplementary_files_format_and_content: "metadata.csv.gz" contains sample metadata, "splicing.csv.gz" contains Psi scores, "splicing_feature.csv.gz" contains metadata for splicing events, "expression.csv.gz" contains TPM values for genes, "expression_feature.csv.gz" contains metadata for genes, "mapping_stats.csv.gz" contains the number and percentages of reads mapped for each samples