GSE232598 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Large-scale evaluation of the ability of RNA-binding proteins to activate exon inclusion.

Nature biotechnology (2024) — PMID 38168984

Dataset

Systematic identification of RNA-binding proteins and tethered domains that activate exon splicing inclusion [RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were mapped using STAR 2.7.6a

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.6a

# Define variables
# Placeholder for STAR genome index. Replace with the actual path to your indexed genome (e.g., GRCh38, hg38).
# To build a genome index (example for GRCh38):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_genome_index/GRCh38 \
#      --genomeFastaFiles /path/to/GRCh38.fa \
#      --sjdbGTFfile /path/to/GRCh38.gtf \
#      --runThreadN 8
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" 

# Placeholder for input FASTQ files. Replace with your actual read files.
# Assuming paired-end reads. For single-end, remove READ2_FASTQ and adjust --readFilesIn.
READ1_FASTQ="sample_R1.fastq.gz"
READ2_FASTQ="sample_R2.fastq.gz" 

# Output file prefix
OUTPUT_PREFIX="sample_mapped"

# Number of threads to use
THREADS=8 

# Run STAR mapping
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --outReadsUnmapped Fastx \
     --quantMode GeneCounts # Optional: for gene quantification (e.g., for RNA-seq)

View on GitHub

Read count extraction was performed using featureCounts from the Subread package.

featureCounts v2.0.6

$ Bash example

# Install Subread package (which includes featureCounts)
# conda install -c bioconda subread

# Placeholder for annotation file (e.g., human GRCh38, Ensembl release 109)
# Download from Ensembl: ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz
GTF_FILE="Homo_sapiens.GRCh38.109.gtf"

# Placeholder for input aligned BAM file(s)
# Replace with your actual aligned BAM files. Multiple BAMs can be provided.
INPUT_BAM="input_aligned.bam"

# Placeholder for output count file
OUTPUT_COUNTS="gene_counts.txt"

# Perform read count extraction using featureCounts
# -a: Annotation file (GTF/GFF format)
# -o: Output file for read counts
# -F GTF: Specify annotation file format as GTF
# -t exon: Specify feature type to count (e.g., 'exon' for gene-level counts)
# -g gene_id: Specify attribute type in GTF/GFF to use as gene identifier
# -s 0: Strandedness (0: unstranded, 1: stranded, 2: reversely stranded). Assumed unstranded.
# -T 8: Number of threads/cores to use
# -p: Count fragments instead of reads (for paired-end data). Remove if single-end.
featureCounts -a "${GTF_FILE}" \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              -T 8 \
              -p \
              "${INPUT_BAM}"

Results were sorted into counts matrices

featureCounts (Inferred with models/gemini-2.5-flash) vSubread 2.0.6 GitHub

$ Bash example

# Install Subread (which includes featureCounts)
# conda install -c bioconda subread

# Define input and output files
# Replace 'aligned_reads.bam' with your actual aligned BAM file(s)
INPUT_BAM="aligned_reads.bam"
# Replace 'Homo_sapiens.GRCh38.111.gtf' with your actual annotation file (e.g., from Ensembl or GENCODE)
ANNOTATION_GTF="Homo_sapiens.GRCh38.111.gtf"
OUTPUT_COUNTS="gene_counts.txt"

# Run featureCounts to generate a gene count matrix
# -a: Specify the annotation file (GTF/GFF format)
# -o: Specify the output file for the counts matrix
# -p: Indicates that reads are paired-end (remove if single-end)
# -t exon: Specifies 'exon' as the feature type to count (common for gene-level counts)
# -g gene_id: Specifies 'gene_id' as the attribute in the GTF/GFF file to use as the feature identifier
# --extraAttributes gene_name: Includes the 'gene_name' attribute in the output (optional)
# -T 8: Use 8 threads for parallel processing (adjust based on available resources)
featureCounts -a "${ANNOTATION_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -p \
              -t exon \
              -g gene_id \
              --extraAttributes gene_name \
              -T 8 \
              "${INPUT_BAM}"

View on GitHub

TPM was calculated manually from counts matrix

Python/R Script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This script calculates Transcripts Per Million (TPM) from a gene counts matrix.
# It assumes the input file is tab-separated and contains at least:
# - A gene identifier column (e.g., 'gene_id') as the first column (index_col=0 for pandas)
# - A gene length column (e.g., 'length')
# - Subsequent columns for sample counts (e.g., 'sample1', 'sample2', ...)

# Example input file (counts.tsv):
# gene_id\tlength\tsample1\tsample2
# geneA\t1000\t100\t200
# geneB\t2000\t50\t150
# geneC\t500\t200\t100

# Create a dummy counts file for demonstration
echo -e "gene_id\tlength\tsample1\tsample2" > counts.tsv
echo -e "geneA\t1000\t100\t200" >> counts.tsv
echo -e "geneB\t2000\t50\t150" >> counts.tsv
echo -e "geneC\t500\t200\t100" >> counts.tsv

# Python script to calculate TPM
# Save this content as 'calculate_tpm.py'
cat << 'EOF' > calculate_tpm.py
#!/usr/bin/env python3
import pandas as pd
import sys

def calculate_tpm(input_file, output_file):
    try:
        df = pd.read_csv(input_file, sep='\t', index_col=0)
    except FileNotFoundError:
        print(f"Error: Input file '{input_file}' not found.", file=sys.stderr)
        sys.exit(1)

    if 'length' not in df.columns:
        print("Error: 'length' column not found in the input matrix.", file=sys.stderr)
        sys.exit(1)

    gene_lengths = df['length']
    counts_df = df.drop(columns=['length'])

    # Calculate RPK (Reads Per Kilobase)
    # Divide counts by gene length in kilobases
    rpk_df = counts_df.div(gene_lengths / 1000, axis=0)

    # Calculate TPM (Transcripts Per Million)
    # Divide RPK by the sum of RPKs for each sample, then multiply by 1,000,000
    tpm_df = rpk_df.div(rpk_df.sum(axis=0), axis=1) * 1_000_000

    tpm_df.to_csv(output_file, sep='\t')

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python calculate_tpm.py <input_counts.tsv> <output_tpm.tsv>", file=sys.stderr)
        sys.exit(1)
    input_counts_file = sys.argv[1]
    output_tpm_file = sys.argv[2]
    calculate_tpm(input_counts_file, output_tpm_file)
EOF

# Make the script executable
chmod +x calculate_tpm.py

# Install pandas if not already installed
# pip install pandas

# Execute the TPM calculation
./calculate_tpm.py counts.tsv tpm.tsv

View on GitHub

Differential splicing analysis was performed on STAR-aligned reads using rMATS 4.0.2.

STAR v4.0.2 GitHub

$ Bash example

# Install rMATS (if not already installed)
# conda install -c bioconda rmats-turbo

# Placeholder variables (replace with actual paths and values)
CASE_BAM_FILES="case_replicate1.bam,case_replicate2.bam"
CONTROL_BAM_FILES="control_replicate1.bam,control_replicate2.bam"
GTF_ANNOTATION="gencode.vM25.annotation.gtf" # Example: latest mouse Gencode GTF
READ_LENGTH=100 # Replace with actual read length
LIB_TYPE="fr-unstranded" # Example: paired-end, unstranded. Adjust as needed (e.g., 'fr-firststrand', 'fr-secondstrand', 'single')
OUTPUT_DIR="rmats_output"
NUM_THREADS=8

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run rMATS for differential splicing analysis
rmats.py \
  --b1 "${CASE_BAM_FILES}" \
  --b2 "${CONTROL_BAM_FILES}" \
  --gtf "${GTF_ANNOTATION}" \
  --readLength "${READ_LENGTH}" \
  --libType "${LIB_TYPE}" \
  --dataType RNA \
  --nthread "${NUM_THREADS}" \
  -o "${OUTPUT_DIR}"

View on GitHub

For each condition, shRNA knockdown samples represent SAMPLE_1, while non-targeting controls (shNT_1, shNT_2, and shNT_3) represent SAMPLE_2.

(Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Define sample groups based on the description for downstream analysis
# SAMPLE_1: shRNA knockdown samples
# Replace 'shRNA_KD_sample_X' with actual sample identifiers for shRNA knockdown samples
SAMPLE_1_SAMPLES="shRNA_KD_sample_1 shRNA_KD_sample_2 shRNA_KD_sample_3"

# SAMPLE_2: Non-targeting controls
SAMPLE_2_SAMPLES="shNT_1 shNT_2 shNT_3"

# These variables would typically be passed to a differential analysis tool
# For example, if using a tool that takes sample lists:
# differential_analysis_tool --group1 "${SAMPLE_1_SAMPLES}" --group2 "${SAMPLE_2_SAMPLES}" --output_prefix "differential_results"

Tools Used

STAR

Raw Source Text

Reads were mapped using STAR 2.7.6a
Read count extraction was performed using featureCounts from the Subread package. Results were sorted into counts matrices
TPM was calculated manually from counts matrix
Differential splicing analysis was performed on STAR-aligned reads using rMATS 4.0.2.  For each condition, shRNA knockdown samples represent SAMPLE_1, while non-targeting controls (shNT_1, shNT_2, and shNT_3) represent SAMPLE_2.
Assembly: GRCh38
Supplementary files format and content: csv files include TPM values for each sample in comparison to non-targeting shRNA control, along with sample means and standard deviations
Supplementary files format and content: tab-delimited text files include differential skipped exon splicing analysis for each knockdown (SE.MATS.JC.txt from rMATS)

← Back to Analysis