GSE85908 Processing Pipeline

RNA-Seq code_examples 11 steps

Publication

Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation.

Molecular cell (2017) — PMID 28673540

Dataset

Single-cell alternative splicing analysis with Expedition reveals splicing dynamics during neuron differentiation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were trimmed of adapter sequences TCGTATGCCGTCTTCTGCTTG, ATCTCGTATGCCGTCTTCTGCTTG, CGACAGGTTCAGAGTTCTACAGTCCGACGATC GATCGGAAGAGCACACGTCTGAACTCCAGTCAC using cutadapt v1.8.1

cutadapt v1.8.1 GitHub

$ Bash example

# Install cutadapt (version 1.8.1 is quite old, consider using a virtual environment or specific package manager for older versions)
# conda create -n cutadapt_1_8_1 cutadapt=1.8.1
# conda activate cutadapt_1_8_1

# Placeholder for input and output files. Adjust as needed.
# If reads are paired-end, use -A for the reverse read adapter and -o/-p for paired output.
INPUT_READS="input_reads.fastq.gz"
OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz"

# Adapter sequences identified from the description
ADAPTER1="TCGTATGCCGTCTTCTGCTTG"
ADAPTER2="ATCTCGTATGCCGTCTTCTGCTTG"
ADAPTER3="CGACAGGTTCAGAGTTCTACAGTCCGACGATC"
ADAPTER4="GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Execute cutadapt to trim the specified adapter sequences
# Assuming these are 3' adapters (-a). If they were 5' adapters, -g would be used.
cutadapt -a "${ADAPTER1}" \
         -a "${ADAPTER2}" \
         -a "${ADAPTER3}" \
         -a "${ADAPTER4}" \
         -o "${OUTPUT_TRIMMED_READS}" \
         "${INPUT_READS}"

View on GitHub

Trimmed reads were aligned to RepBase v18.05 using STAR v2.4.01

STAR v2.4.01 GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.4.01

# --- Reference Preparation: RepBase v18.05 STAR Index ---
# RepBase v18.05 is a database of repetitive elements. You would typically download
# the FASTA file from the Genetic Information Research Institute (GIRI).
# The exact download URL might require registration or vary, this is a placeholder.
# For example:
# mkdir -p reference/repbase_v18.05
# cd reference/repbase_v18.05
# wget -O repbase_v18.05.fasta.gz "http://www.girinst.org/repbase/update/RepBase_v18.05.fasta.gz" # Placeholder URL
# gunzip repbase_v18.05.fasta.gz
# cd ../../

# Define paths for reference files
FASTA_FILE="reference/repbase_v18.05/repbase_v18.05.fasta" # Path to your downloaded RepBase FASTA
GENOME_DIR="reference/repbase_v18.05_star_index"

# Create directory for STAR index
mkdir -p "${GENOME_DIR}"

# Build STAR genome index for RepBase v18.05
# Note: For non-spliced alignment to a collection of sequences like RepBase,
# STAR's default parameters for genome generation are usually sufficient.
# The --sjdbGTFfile parameter is not needed here as there are no introns.
STAR --runMode genomeGenerate \
     --genomeDir "${GENOME_DIR}" \
     --genomeFastaFiles "${FASTA_FILE}" \
     --runThreadN 8 # Adjust number of threads as appropriate

# --- Alignment Step ---
# Define input and output paths
INPUT_READS="trimmed_reads.fastq.gz" # Replace with your actual trimmed reads file
OUTPUT_PREFIX="aligned_to_repbase" # Prefix for output files

# Run STAR alignment
# Parameters chosen for alignment to repetitive elements:
# --outFilterMultimapNmax 100: Allows reads to map to up to 100 locations, common for repetitive elements.
# --alignIntronMax 1: Effectively disables intron searching, as RepBase sequences do not contain introns.
# --alignEndsType Local: Local alignment can be useful for partial matches to repetitive elements.
STAR --runThreadN 8 \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 100 \
     --outFilterMismatchNmax 10 \
     --alignIntronMax 1 \
     --alignEndsType Local

View on GitHub

Reads that didn't align to RepBase were aligned to hg19 using STAR v2.4.01 and a splice junction database derived from GENCODE v19

STAR v2.4.01 GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star=2.4.01

# Define paths for reference genome index and input reads
# The STAR genome index for hg19 and GENCODE v19 annotations
# should have been pre-built using a command similar to:
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/star_index/hg19_gencode_v19 \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \
#      --sjdbOverhang 100 # A common value, adjust as needed based on read length

GENOME_DIR="/path/to/star_index/hg19_gencode_v19" # Placeholder for the pre-built STAR index
INPUT_READS="unaligned_to_repbase.fastq.gz" # Placeholder for the FASTQ file containing reads that didn't align to RepBase
OUTPUT_PREFIX="aligned_to_hg19" # Prefix for output files

# Align reads to hg19 using STAR with the GENCODE v19-derived splice junction database
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --runThreadN 8 # Example: use 8 threads, adjust as needed
     # Other common parameters might include:
     # --outSAMtype BAM SortedByCoordinate # Output sorted BAM
     # --outSAMunmapped Within # Keep unmapped reads in BAM
     # --outFilterMultimapNmax 20 # Max number of loci the read is allowed to map to
     # --outFilterMismatchNmax 999 # Max number of mismatches per read
     # --outFilterMismatchNoverLmax 0.04 # Max fraction of mismatches per read length
     # --quantMode GeneCounts # If gene quantification is desired

View on GitHub

Transcripts per million (TPM) per transcript was calculated from reads that didn't align to RepBase were quasi-aligned to GENCODE v19 protein coding and lincRNA transcripts using Sailfish v0.6.3

GENCODE v0.6.3 GitHub

$ Bash example

# Install Sailfish
# conda install -c bioconda sailfish=0.6.3

# Placeholder for Sailfish index built from GENCODE v19 protein coding and lincRNA transcripts.
# This index would have been created using a command like:
# sailfish index -t <path_to_gencode_v19_pc_lincRNA_transcripts.fa> -o gencode_v19_pc_lincRNA_sailfish_index -k 25
SAILFISH_INDEX="gencode_v19_pc_lincRNA_sailfish_index"

# Placeholder for input reads (reads that didn't align to RepBase)
INPUT_READS="unaligned_reads.fastq"

# Output directory for Sailfish results
OUTPUT_DIR="sailfish_tpm_output"

# Run Sailfish quantification
# -i: Path to the Sailfish index
# -r: Input FASTQ file (assuming single-end reads based on common usage for Sailfish v0.6.3 without explicit mention)
# -o: Output directory
# -l U: Library type (U for unstranded, single-end reads). Adjust if paired-end or stranded.
sailfish quant -i "${SAILFISH_INDEX}" -r "${INPUT_READS}" -o "${OUTPUT_DIR}" -l U

View on GitHub

TPM per gene was calculated by summing the TPM for all transcripts of a gene

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem

# --- Reference Preparation (run once) ---
# Download human GRCh38 genome FASTA and Ensembl GTF (release 109 as example)
# mkdir -p rsem_ref
# cd rsem_ref
# wget ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# wget ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz
# GENOME_FASTA="Homo_sapiens.GRCh38.dna.primary_assembly.fa"
# GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
# RSEM_REFERENCE_INDEX="Homo_sapiens_GRCh38_Ensembl109_rsem_index"

# Build RSEM reference (this can take a long time and significant memory)
# rsem-prepare-reference \
#     --gtf ${GTF_FILE} \
#     --num-threads 8 \
#     ${GENOME_FASTA} \
#     ${RSEM_REFERENCE_INDEX}
# cd ..
# --- End Reference Preparation ---

# Define variables for expression calculation
INPUT_BAM="path/to/your/aligned_reads.bam" # Replace with actual input BAM file (e.g., from STAR alignment)
RSEM_REFERENCE_INDEX="path/to/your/rsem_ref/Homo_sapiens_GRCh38_Ensembl109_rsem_index" # Path to your RSEM reference index
SAMPLE_NAME="your_sample_name" # Replace with your sample name
NUM_THREADS=8 # Number of threads to use

# Execute RSEM to calculate expression
# This command will output gene-level TPMs in ${SAMPLE_NAME}.genes.results
# RSEM calculates gene-level TPMs by summing the estimated TPMs for all transcripts of a gene.
# --bam: Input is a BAM file
# --no-qualities: Often needed if BAM is from STAR and quality scores are not standard
# --paired-end: Use if reads are paired-end (remove if single-end)
# --num-threads: Number of threads
# --output-genome-bam: Output a genome-aligned BAM file (optional, can be large)
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --paired-end \
    --num-threads ${NUM_THREADS} \
    ${INPUT_BAM} \
    ${RSEM_REFERENCE_INDEX} \
    ${SAMPLE_NAME}

# To extract the gene-level TPMs:
# The 'TPM' column in the ${SAMPLE_NAME}.genes.results file contains the desired values.
# For example, to view the first few lines of gene results:
# head ${SAMPLE_NAME}.genes.results
# To extract just the gene_id and TPM columns:
# awk 'NR > 1 {print $1, $6}' ${SAMPLE_NAME}.genes.results > ${SAMPLE_NAME}.gene_tpm.tsv

View on GitHub

Percent spliced-in (Psi) was calculated from junction read output from STAR aligner (SJ.out.tab) were used to create a de novo splicing index in conjunction with GENCODE v19 gene annotations, using outrigger

STAR vv0.1.0 GitHub

$ Bash example

# Install outrigger (example using pip)
# pip install outrigger

# STAR aligner (version 2.7.10a or similar) is assumed to have been run upstream
# to generate the SJ.out.tab file.

# Define variables
GENCODE_VERSION="v19"
GENOME_ASSEMBLY="hg19" # Inferred from GENCODE v19
GENCODE_GTF="gencode.${GENCODE_VERSION}.annotation.gtf"
GENOME_FASTA="${GENOME_ASSEMBLY}.fa"
STAR_SJ_OUT_TAB="sample_SJ.out.tab" # Placeholder for STAR's junction output
OUTRIGGER_INDEX_DIR="outrigger_index_${GENOME_ASSEMBLY}_${GENCODE_VERSION}"
OUTPUT_PSI_FILE="sample_psi.csv"

# Download GENCODE v19 GTF (if not already present)
# wget -O "${GENCODE_GTF}.gz" ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip -f "${GENCODE_GTF}.gz"

# Download hg19 genome FASTA (if not already present, required for outrigger index)
# wget -O "${GENOME_FASTA}.gz" http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip -f "${GENOME_FASTA}.gz"

# 1. Create a de novo splicing index using outrigger
# This step uses the GENCODE v19 gene annotations (GTF) and the corresponding genome FASTA.
outrigger index \
    --gtf "${GENCODE_GTF}" \
    --fasta "${GENOME_FASTA}" \
    --output-folder "${OUTRIGGER_INDEX_DIR}"

# 2. Calculate Percent spliced-in (Psi) values
# This step uses the STAR junction read output (SJ.out.tab) and the previously created outrigger index.
outrigger psi \
    --outrigger-index "${OUTRIGGER_INDEX_DIR}" \
    --sj-out-tab "${STAR_SJ_OUT_TAB}" \
    --output-file "${OUTPUT_PSI_FILE}"

View on GitHub

Outlier cells were calculated using K-means clustering with k=3 on the expression matrix and omitting cells which clustered into a group consisting of a majority of other cell types

scikit-learn (Inferred with models/gemini-2.5-flash) v1.2.2

$ Bash example

# Install necessary libraries
# pip install scikit-learn pandas numpy

# Create dummy input files for demonstration if they don't exist
python3 -c "
import pandas as pd
import numpy as np
import os

if not os.path.exists('expression_matrix.csv'):
    np.random.seed(42)
    expression_data = np.random.rand(100, 50) * 100
    expression_df = pd.DataFrame(expression_data, columns=[f'gene_{i}' for i in range(50)])
    expression_df.index.name = 'cell_id'
    expression_df.to_csv('expression_matrix.csv')
    print('Created dummy expression_matrix.csv')

if not os.path.exists('cell_types.csv'):
    cell_types = np.random.choice(['TypeA', 'TypeB', 'TypeC', 'TypeD'], size=100, p=[0.4, 0.3, 0.2, 0.1])
    cell_types_df = pd.DataFrame({'cell_type': cell_types}, index=expression_df.index if 'expression_df' in locals() else pd.RangeIndex(start=0, stop=100, name='cell_id'))
    cell_types_df.index.name = 'cell_id'
    cell_types_df.to_csv('cell_types.csv')
    print('Created dummy cell_types.csv')
"

# Execute the Python script for K-means clustering and outlier filtering
python3 -c "
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Load expression matrix (e.g., from 'expression_matrix.csv')
# This file is expected to have cell IDs as index and gene expression values as columns.
expression_matrix = pd.read_csv('expression_matrix.csv', index_col='cell_id')

# Load known cell types (e.g., from 'cell_types.csv')
# This file is expected to have cell IDs as index and a 'cell_type' column.
cell_types_df = pd.read_csv('cell_types.csv', index_col='cell_id')
cell_types = cell_types_df['cell_type']

# Ensure cell_types index matches expression_matrix index
common_cells = expression_matrix.index.intersection(cell_types.index)
expression_matrix = expression_matrix.loc[common_cells]
cell_types = cell_types.loc[common_cells]

# Perform K-means clustering with k=3
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) # n_init for robust results
cluster_labels = kmeans.fit_predict(expression_matrix)

# Add cluster labels to the expression matrix for easier filtering
clustered_data = expression_matrix.copy()
clustered_data['cluster_label'] = cluster_labels
clustered_data['known_cell_type'] = cell_types

# Identify outlier clusters based on cell type composition
outlier_cells_indices = []
for cluster_id in range(k):
    cluster_cells = clustered_data[clustered_data['cluster_label'] == cluster_id]
    
    if not cluster_cells.empty:
        # Get counts of known cell types within this cluster
        type_counts = cluster_cells['known_cell_type'].value_counts()
        
        if not type_counts.empty:
            # Identify the most frequent cell type in this cluster
            most_frequent_type = type_counts.index[0]
            
            # Calculate the number of cells that are NOT the most frequent type
            num_other_types = cluster_cells.shape[0] - type_counts.iloc[0]
            
            # Check if a majority of cells are 'other cell types'
            # This means if the count of all cell types *except* the most frequent one
            # is greater than the count of the most frequent one.
            if num_other_types > type_counts.iloc[0]:
                print(f'Cluster {cluster_id} identified as an outlier cluster: majority of other cell types.')
                outlier_cells_indices.extend(cluster_cells.index.tolist())
            else:
                print(f'Cluster {cluster_id} is considered valid.')
        else:
            print(f'Cluster {cluster_id} is empty or has no known cell types.')
    else:
        print(f'Cluster {cluster_id} is empty.')

# Filter out outlier cells
filtered_expression_matrix = expression_matrix.loc[
    ~expression_matrix.index.isin(outlier_cells_indices)
]

print(f'\nOriginal number of cells: {expression_matrix.shape[0]}')
print(f'Number of outlier cells removed: {len(outlier_cells_indices)}')
print(f'Number of cells after filtering: {filtered_expression_matrix.shape[0]}')

# Save the filtered expression matrix
filtered_expression_matrix.to_csv('filtered_expression_matrix.csv')
print('Filtered expression matrix saved to filtered_expression_matrix.csv')
"

Modality was calculated using anchor on the psi scores for all events detected in at least 10 cells, for each cell type

Custom Python script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script calculates splicing modality using a method referred to as "anchor".
# The "anchor" method likely involves statistical modeling (e.g., mixture models like beta-binomial mixture models)
# to characterize the distribution of PSI scores for each splicing event within a given cell type.
# The tool is inferred as a custom Python script, as a standard command-line tool
# named "anchor" specifically for modality calculation on psi scores is not widely documented.

# Prerequisites: Python 3.x with common scientific libraries (e.g., pandas, numpy, scipy, scikit-learn for mixture models).
# # conda create -n modality_env python=3.9
# # conda activate modality_env
# # pip install pandas numpy scipy scikit-learn

# Input: A TSV or CSV file containing PSI scores for all events across cells.
# This file should include columns for event ID, cell ID, cell type, and PSI value.
# Example format (long format):
# event_id cell_id cell_type psi_score
# event1 cellA1 Type1 0.15
# event1 cellA2 Type1 0.18
# ...
# event2 cellB1 Type2 0.80
# ...

# Output: A TSV or CSV file with modality results for each event per cell type.
# Example format:
# event_id cell_type modality_type modality_score p_value

# Define input and output paths and parameters
INPUT_PSI_SCORES="all_events_psi_scores.tsv"
OUTPUT_MODALITY_RESULTS="splicing_modality_results.tsv"
MIN_CELLS_THRESHOLD=10
CELL_TYPE_COLUMN="cell_type" # Column in INPUT_PSI_SCORES that identifies cell type
EVENT_ID_COLUMN="event_id" # Column in INPUT_PSI_SCORES that identifies the splicing event
PSI_SCORE_COLUMN="psi_score" # Column in INPUT_PSI_SCORES that contains the PSI value

# Execute the hypothetical custom Python script
# This script would conceptually:
# 1. Load the input PSI scores file.
# 2. Group data by the specified cell type column.
# 3. For each cell type:
# a. Filter events to include only those detected (i.e., having a non-NaN PSI score) in at least MIN_CELLS_THRESHOLD cells.
# b. For each filtered event, apply the "anchor" modality calculation method
# (e.g., fitting a beta-binomial mixture model or similar statistical test
# to determine if the PSI distribution is unimodal, bimodal, etc.).
# 4. Aggregate and save the results to the specified output file.

python /path/to/custom_scripts/calculate_splicing_modality.py \
--input_psi_scores "${INPUT_PSI_SCORES}" \
--output_modality_results "${OUTPUT_MODALITY_RESULTS}" \
--min_cells_per_event "${MIN_CELLS_THRESHOLD}" \
--cell_type_column "${CELL_TYPE_COLUMN}" \
--event_id_column "${EVENT_ID_COLUMN}" \
--psi_score_column "${PSI_SCORE_COLUMN}" \
--modality_method "anchor" # Explicitly state the method used within the script

Waypoints were calculated using bonvoyage on the psi scores for all events detected in at least 10 cells, for each cell type

bonvoyage v0.1.0 GitHub

$ Bash example

# Install bonvoyage (if not already installed)
# pip install bonvoyage

# Assuming 'psi_scores_celltypeX.tsv' contains PSI scores for events for a specific cell type.
# The description implies that events are already filtered to be detected in at least 10 cells.
# The 'min_cells_per_event' parameter in bonvoyage can enforce this filtering if not done upstream.
# The tool calculates waypoints based on these psi scores for each cell type.

# Example for processing a single cell type:
CELL_TYPE="example_cell_type"
INPUT_PSI_FILE="${CELL_TYPE}_psi_scores.tsv"
OUTPUT_PREFIX="${CELL_TYPE}_waypoints"

bonvoyage --psi_table "${INPUT_PSI_FILE}" --output_prefix "${OUTPUT_PREFIX}" --min_cells_per_event 10

View on GitHub

Splicing events were annotated using poshsplice to annotate corresponding genes, splice site strength using MaxEntScan, intron and exon lengths, conservation of alternative splicing from Merkin et al Science 2012,

poshsplice v0.1.0

$ Bash example

# Install poshsplice and its dependency maxentscan
# pip install poshsplice maxentscan

# Download MaxEntScan models (if not already present)
# maxentscan download_models
# Assuming models are downloaded to ~/.maxentscan_models or a specified path
MAXENTSCAN_MODEL_DIR="${HOME}/.maxentscan_models" # Adjust path if models are elsewhere

# Define input files and reference data
INPUT_BAM="input.bam" # Placeholder for aligned RNA-seq reads
GENOME_FASTA="hg38.fa" # Placeholder for reference genome FASTA (e.g., GRCh38)
GENE_ANNOTATION_GTF="gencode.v38.annotation.gtf" # Placeholder for gene annotation GTF (e.g., GENCODE v38)
OUTPUT_DIR="poshsplice_output"

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Run poshsplice to annotate splicing events, genes, intron/exon lengths, and splice site strength
# Parameters inferred:
# -b: Input BAM file
# -g: Gene annotation GTF file
# -o: Output directory
# --genome: Reference genome FASTA for sequence extraction (needed for MaxEntScan)
# --maxentscan-model-dir: Path to MaxEntScan models for splice site strength calculation
poshsplice \
    -b "${INPUT_BAM}" \
    -g "${GENE_ANNOTATION_GTF}" \
    -o "${OUTPUT_DIR}" \
    --genome "${GENOME_FASTA}" \
    --maxentscan-model-dir "${MAXENTSCAN_MODEL_DIR}"

# Note on "conservation of alternative splicing from Merkin et al Science 2012":
# poshsplice identifies splicing events. The conservation analysis from Merkin et al.
# likely involves a post-processing step comparing the identified events to a
# database or set of conserved events derived from the Merkin et al. publication.
# This step would typically be implemented as a custom script using poshsplice's output.
# Example (conceptual, not an actual command):
# python custom_conservation_script.py \
#     --poshsplice_events "${OUTPUT_DIR}/splicing_events.tsv" \
#     --merkin_conservation_db "merkin_et_al_conserved_events.tsv" \
#     --output_file "${OUTPUT_DIR}/conserved_splicing_events.tsv"

Genes were annotated with attributes available in Gencode v19, Phylostrata for the matching ENSEMBL identifiers in Domazet-Loso et al MBE 2008, and TF or RBP category from Gerstberger et al Nat Rev Gen 2014

GENCODE vN/A GitHub

$ Bash example

# --- Reference Datasets ---
# Gencode v19 annotation GTF file
GENCODE_GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"
GENCODE_GTF_FILE="gencode.v19.annotation.gtf.gz"
GENCODE_GENES_TSV="gencode_v19_genes.tsv"

# Domazet-Loso et al. MBE 2008 Phylostrata data
# Source: Domazet-Loso, T., & Tautz, D. (2008). A phylogenetically based transcriptomics approach... MBE, 25(11), 2413-2421. DOI: 10.1093/molbev/msn191
# This data is typically found in supplementary tables or derived databases. A direct download link for a pre-processed file is not readily available and would require specific processing of supplementary materials or database queries.
# For demonstration, we simulate a file. In a real pipeline, this would be downloaded or generated from the original source.
PHYLOSTRATA_FILE="domazet_loso_phylostrata.tsv"

# Gerstberger et al. Nat Rev Gen 2014 TF/RBP categories
# Source: Gerstberger, S., Hafner, M., & Tuschl, T. (2014). A census of human RNA-binding proteins. Nat Rev Gen, 15(12), 829-840. DOI: 10.1038/nrg3763
# This data is typically found in supplementary tables. A direct download link for a pre-processed file mapped to ENSEMBL IDs is not readily available and would require specific processing of supplementary materials and gene ID mapping.
# For demonstration, we simulate a file. In a real pipeline, this would be downloaded or generated and mapped to ENSEMBL IDs.
TF_RBP_FILE="gerstberger_tf_rbp.tsv"

# --- Step 1: Download Gencode v19 GTF ---
echo "Downloading Gencode v19 GTF..."
wget -nc "${GENCODE_GTF_URL}"

# --- Step 2: Extract gene attributes from Gencode GTF ---
# Extract gene_id (without version suffix), gene_name, gene_type
echo "Extracting gene attributes from Gencode GTF..."
zcat "${GENCODE_GTF_FILE}" | \
awk '$3 == "gene" {
    gene_id = gene_name = gene_type = "";
    for (i=9; i<=NF; i++) {
        if ($i ~ /^gene_id "/) { gene_id = substr($i, 10, length($i)-11); }
        else if ($i ~ /^gene_name "/) { gene_name = substr($i, 12, length($i)-13); }
        else if ($i ~ /^gene_type "/) { gene_type = substr($i, 12, length($i)-13); }
    }
    # Remove version suffix from ENSEMBL gene ID for consistent joining
    gsub(/\.[0-9]+$/, "", gene_id);
    print gene_id "\t" gene_name "\t" gene_type;
}' | sort -k1,1 > "${GENCODE_GENES_TSV}"

# --- Step 3: Simulate/Prepare Phylostrata data ---
# In a real scenario, this file would be downloaded or processed from supplementary data.
# Example content: ENSEMBL_ID<tab>Phylostratum
echo "Preparing dummy Phylostrata data..."
cat <<EOF > "${PHYLOSTRATA_FILE}"
ENSG00000000003\t1
ENSG00000000005\t2
ENSG00000000419\t3
ENSG00000000457\t4
ENSG00000000460\t5
ENSG00000000971\t6
ENSG00000001036\t7
ENSG00000001084\t8
ENSG00000001167\t9
ENSG00000001461\t10
EOF
sort -k1,1 "${PHYLOSTRATA_FILE}" -o "${PHYLOSTRATA_FILE}"

# --- Step 4: Simulate/Prepare TF/RBP category data ---
# In a real scenario, this file would be downloaded or processed from supplementary data and mapped to ENSEMBL IDs.
# Example content: ENSEMBL_ID<tab>Category (e.g., RBP, TF)
echo "Preparing dummy TF/RBP data..."
cat <<EOF > "${TF_RBP_FILE}"
ENSG00000000003\tRBP
ENSG00000000457\tTF
ENSG00000001036\tRBP
ENSG00000001461\tTF
EOF
sort -k1,1 "${TF_RBP_FILE}" -o "${TF_RBP_FILE}"

# --- Step 5: Join all attributes based on ENSEMBL gene ID ---
# Join Gencode attributes with Phylostrata
echo "Joining Gencode attributes with Phylostrata..."
join -t $'\t' -a 1 -e "NA" -o auto "${GENCODE_GENES_TSV}" "${PHYLOSTRATA_FILE}" > gencode_phylostrata.tsv

# Join the result with TF/RBP categories
echo "Joining with TF/RBP categories..."
join -t $'\t' -a 1 -e "NA" -o auto gencode_phylostrata.tsv "${TF_RBP_FILE}" > annotated_genes.tsv

echo "Annotation complete. Output: annotated_genes.tsv"
echo "Example output (first 5 lines):"
head -n 5 annotated_genes.tsv

View on GitHub

Tools Used

STAR

Raw Source Text

Reads were trimmed of adapter sequences TCGTATGCCGTCTTCTGCTTG, ATCTCGTATGCCGTCTTCTGCTTG, CGACAGGTTCAGAGTTCTACAGTCCGACGATC GATCGGAAGAGCACACGTCTGAACTCCAGTCAC using cutadapt v1.8.1
Trimmed reads were aligned to RepBase v18.05 using STAR v2.4.01
Reads that didn't align to RepBase were aligned to hg19 using STAR v2.4.01 and a splice junction database derived from GENCODE v19
Transcripts per million (TPM) per transcript was calculated from reads that didn't align to RepBase were quasi-aligned to GENCODE v19 protein coding and lincRNA transcripts using Sailfish v0.6.3
TPM per gene was calculated by summing the TPM for all transcripts of a gene
Percent spliced-in (Psi) was calculated from junction read output from STAR aligner (SJ.out.tab) were used to create a de novo splicing index in conjunction with GENCODE v19 gene annotations, using outrigger
Outlier cells were calculated using K-means clustering with k=3 on the expression matrix and omitting cells which clustered into a group consisting of a majority of other cell types
Modality was calculated using anchor on the psi scores for all events detected in at least 10 cells, for each cell type
Waypoints were calculated using bonvoyage on the psi scores for all events detected in at least 10 cells, for each cell type
Splicing events were annotated using poshsplice to annotate corresponding genes, splice site strength using MaxEntScan, intron and exon lengths, conservation of alternative splicing from Merkin et al Science 2012,
Genes were annotated with attributes available in Gencode v19, Phylostrata for the matching ENSEMBL identifiers in Domazet-Loso et al MBE 2008, and TF or RBP category from Gerstberger et al Nat Rev Gen 2014
Genome_build: hg19
Supplementary_files_format_and_content: "metadata.csv.gz" contains sample metadata, "splicing.csv.gz" contains Psi scores, "splicing_feature.csv.gz" contains metadata for splicing events, "expression.csv.gz" contains TPM values for genes, "expression_feature.csv.gz" contains metadata for genes, "mapping_stats.csv.gz" contains the number and percentages of reads mapped for each samples

← Back to Analysis