GSE280899 Processing Pipeline

GSE code_examples 6 steps

Publication

Integrative CRISPR Screening and RNA Analyses Discover an Essential Role for PUF60 Interactions with 3' Splice Sites in Cancer Progression.

Cancer research (2025) — PMID 41411621

Dataset

GSE280899

PUF60-Mediated Splicing Is a Key Driver of Triple Negative Breast Cancer

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were mapped using STAR 2.7.6a

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.6a

# --- Configuration Variables ---
# Input FASTQ files (adjust paths and filenames as needed)
# Assuming paired-end reads, common for RNA-seq. Adjust if single-end.
READ1="sample_R1.fastq.gz" # Placeholder for input read 1
READ2="sample_R2.fastq.gz" # Placeholder for input read 2

# Reference genome directory (STAR index must be pre-built)
# Placeholder: Replace with your actual path to the STAR genome index for GRCh38/hg38.
# To build a STAR genome index (example for GRCh38 with GTF):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_genome_index/GRCh38 \
#      --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \
#      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
#      --runThreadN 16 # Adjust threads as needed for index generation
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for genome index directory

# Output prefix for all generated files (e.g., sample_STAR_aligned_Aligned.sortedByCoordinate.out.bam)
OUTPUT_PREFIX="sample_STAR_aligned"

# Number of threads to use for alignment
THREADS=16 # Adjust based on available CPU cores

# --- STAR Alignment Command ---
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --outReadsUnmapped Fastx \
     --quantMode GeneCounts \
     --outBAMcompression 6 \
     --limitBAMsortRAM 30000000000 # ~30GB, adjust based on available RAM

View on GitHub

Read count extraction was performed using featureCounts from the Subread package.

featureCounts v2.0.6

$ Bash example

# Install Subread package (which includes featureCounts)
# conda install -c bioconda subread

# Example usage of featureCounts for gene-level read quantification
# This command assumes unstranded RNA-seq data and counts reads mapping to exons,
# grouping them by gene_id. Adjust -s (strandedness) and -t (feature type) as needed for your assay.

# Define input and output files, and reference annotation
INPUT_BAM="aligned_reads.bam" # Replace with your aligned BAM file(s)
OUTPUT_COUNTS="gene_counts.txt"
GENOME_ANNOTATION="gencode.v44.annotation.gtf" # Example: GENCODE human hg38 annotation
NUM_THREADS=8 # Number of threads to use

# Execute featureCounts
featureCounts \
  -a "${GENOME_ANNOTATION}" \
  -o "${OUTPUT_COUNTS}" \
  -F GTF \
  -t exon \
  -g gene_id \
  -s 0 \
  -T "${NUM_THREADS}" \
  "${INPUT_BAM}"

Results were sorted into counts matrices

featureCounts (Inferred with models/gemini-2.5-flash) v2.0.3 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Subread (which includes featureCounts) if not already installed
# conda install -c bioconda subread=2.0.3

# Define variables
# Replace 'path/to/annotation.gtf' with the actual path to your GTF/GFF annotation file.
# Example: Homo_sapiens.GRCh38.109.gtf from Ensembl.
ANNOTATION_FILE="path/to/annotation.gtf"
OUTPUT_DIR="counts_matrices"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Example command for generating gene counts from aligned BAM files.
# Adjust parameters (-p, -s, -t, -g) based on your experimental design (e.g., paired-end, strandedness, feature type).
# Replace 'sample1.bam sample2.bam' with the actual paths to your input BAM files.
featureCounts -a "${ANNOTATION_FILE}" \
              -o "${OUTPUT_DIR}/gene_counts.txt" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              -T 8 \
              sample1.bam sample2.bam

TPM was calculated manually from counts matrix

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script assumes a Python environment is set up.
# If not, you might need to install Python and pandas:
# conda install -c anaconda python pandas

# Placeholder for gene lengths file (e.g., derived from hg38 annotation).
# This file should contain gene IDs and their corresponding lengths in base pairs.
# Example format (tab-separated):
# gene_id\tlength_in_bp
# ENSG00000000003.10\t1000
# ENSG00000000005.5\t2000
GENE_LENGTHS_FILE="gene_lengths_hg38.tsv"

# Input counts matrix (e.g., from featureCounts, HTSeq-count, etc.).
# This file should contain gene IDs as the first column and raw counts for each sample.
# Example format (tab-separated):
# gene_id\tsample1_count\tsample2_count
# ENSG00000000003.10\t100\t150
# ENSG00000000005.5\t50\t75
COUNTS_MATRIX="counts_matrix.tsv"

# Output TPM matrix file.
OUTPUT_TPM_MATRIX="tpm_matrix.tsv"

# Execute a hypothetical Python script to calculate TPM.
# This script (e.g., 'calculate_tpm.py') would read the counts matrix and gene lengths,
# perform the TPM calculation (RPK = counts / length_kb, TPM = RPK / sum(RPK) * 1e6),
# and write the resulting TPM matrix to the output file.
# The 'calculate_tpm.py' script is a placeholder for the "manual calculation" described.
python calculate_tpm.py \
    --counts_matrix "${COUNTS_MATRIX}" \
    --gene_lengths "${GENE_LENGTHS_FILE}" \
    --output "${OUTPUT_TPM_MATRIX}"

Differential splicing analysis was performed on STAR-aligned reads using rMATS 4.0.2.

STAR v4.0.2

$ Bash example

# Install rMATS 4.0.2 via conda
# conda create -n rmats_env python=3.7
# conda activate rmats_env
# conda install -c bioconda rmats=4.0.2

# Placeholder for input BAM files (replace with actual paths)
# Create b1.txt with paths to BAM files for group 1 (e.g., control replicates)
# Example:
# echo "/path/to/control_rep1.bam" > b1.txt
# echo "/path/to/control_rep2.bam" >> b1.txt

# Create b2.txt with paths to BAM files for group 2 (e.g., treatment replicates)
# Example:
# echo "/path/to/treatment_rep1.bam" > b2.txt
# echo "/path/to/treatment_rep2.bam" >> b2.txt

# Placeholder for GTF annotation file (replace with actual path)
# For human hg38, a common GTF is from GENCODE (e.g., gencode.v44.annotation.gtf):
# wget -O gencode.v44.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
# gunzip gencode.v44.annotation.gtf.gz
GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for the GTF file path

# Define output and temporary directories
OUTPUT_DIR="rmats_output"
TMP_DIR="rmats_tmp"

mkdir -p "${OUTPUT_DIR}"
mkdir -p "${TMP_DIR}"

# Execute rMATS for differential splicing analysis
# Assuming paired-end reads (-t paired) and a read length of 100
# Adjust --readLength and --nthread as appropriate for your data and system
rmats.py --b1 b1.txt --b2 b2.txt --gtf "${GTF_FILE}" --od "${OUTPUT_DIR}" --tmp "${TMP_DIR}" -t paired --readLength 100 --nthread 8

For each condition, shRNA knockdown samples represent SAMPLE_1, while non-targeting controls (shNT1, shNT2, and shNT3) represent SAMPLE_2.

Sample Grouping/Definition (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This step defines sample groups based on experimental conditions.
# It does not involve a specific command-line tool execution, but rather
# sets up metadata for downstream comparative analysis.

# Group 1: shRNA knockdown samples
# Represents the experimental condition.
# Example variable assignment for a workflow:
# GROUP1_SAMPLES="shRNA_KD_Sample1 shRNA_KD_Sample2"

# Group 2: Non-targeting control samples
# Represents the control condition.
# Example variable assignment for a workflow:
# GROUP2_SAMPLES="shNT1 shNT2 shNT3"

# These definitions would typically be used by a differential analysis tool
# (e.g., for differential expression or differential binding) to compare Group 1 vs Group 2.

View on GitHub

Tools Used

STAR

Raw Source Text

Reads were mapped using STAR 2.7.6a
Read count extraction was performed using featureCounts from the Subread package. Results were sorted into counts matrices
TPM was calculated manually from counts matrix
Differential splicing analysis was performed on STAR-aligned reads using rMATS 4.0.2. For each condition, shRNA knockdown samples represent SAMPLE_1, while non-targeting controls (shNT1, shNT2, and shNT3) represent SAMPLE_2.
Assembly: Assembly: GRCh38
Supplementary files format and content: Supplementary files format and content: csv files include TPM values for each sample in comparison to non-targeting shRNA control, along with sample means and standard deviations
Supplementary files format and content: Supplementary files format and content: tab-delimited text files include differential skipped exon splicing analysis for each knockdown (SE.MATS.JC.txt from rMATS)

← Back to Analysis