GSE133479 Processing Pipeline

RNA-Seq code_examples 2 steps

Publication

Longitudinal assessment of tumor development using cancer avatars derived from genetically engineered pluripotent stem cells.

Nature communications (2020) — PMID 31992716

Dataset

GSE133479

Cancer avatars derived from genetically engineered pluripotent stem cells allow for longitudinal assessment of tumor development

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq reads were aligned to the human genome (hg19) with STAR 2.4.0h (outFilterMultimapNmax 20, outFilterMismatchNmax 999, outFilterMismatchNoverLmax 0.04, outFilterIntronMotifs RemoveNoncanonicalUnannotated, outSJfilterOverhangMin 6 6 6 6, seedSearchStartLmax 20, alignSJDBoverhangMin 1) using a gene database constructed from Gencode v19

STAR v2.4.0h GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
READ1_FASTQ="input_R1.fastq.gz" # Replace with your R1 fastq file
# READ2_FASTQ="input_R2.fastq.gz" # Uncomment and replace if paired-end
OUTPUT_PREFIX="aligned_reads/sample_name" # Replace with your desired output prefix
STAR_INDEX_DIR="/path/to/your/star_index/hg19_gencode_v19" # Replace with the path to your STAR index (built with hg19 and Gencode v19)
NUM_THREADS=8 # Adjust as needed

# Create output directory if it doesn't exist
mkdir -p $(dirname "${OUTPUT_PREFIX}")

# Run STAR alignment
STAR \
  --runThreadN "${NUM_THREADS}" \
  --genomeDir "${STAR_INDEX_DIR}" \
  --readFilesIn "${READ1_FASTQ}" \
  # --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" # Uncomment if paired-end
  --readFilesCommand zcat \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.04 \
  --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
  --outSJfilterOverhangMin 6 6 6 6 \
  --seedSearchStartLmax 20 \
  --alignSJDBoverhangMin 1 \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMunmapped Within \
  --outSAMattributes All \
  --outSAMstrandField intronMotif

View on GitHub

Reads that overlap with exon coordinates were counted using HTSeqcount (-s reverse -a 0 -t exon -i gene_id -m union)

HTSeq (Inferred with models/gemini-2.5-flash) v0.13.5 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq

# Placeholder for aligned reads (BAM/SAM) and annotation (GTF/GFF)
# Replace 'aligned_reads.bam' with your actual aligned reads file.
# Replace 'annotation.gtf' with your actual gene annotation file (e.g., from GENCODE, Ensembl).
# Ensure the GTF/GFF file contains 'exon' features and 'gene_id' attributes.

htseq-count -s reverse -a 0 -t exon -i gene_id -m union aligned_reads.bam annotation.gtf > gene_counts.txt

View on GitHub

Tools Used

STAR

Raw Source Text

RNA-seq reads were aligned to the human genome (hg19) with STAR 2.4.0h (outFilterMultimapNmax 20, outFilterMismatchNmax 999, outFilterMismatchNoverLmax 0.04, outFilterIntronMotifs RemoveNoncanonicalUnannotated, outSJfilterOverhangMin 6 6 6 6, seedSearchStartLmax 20, alignSJDBoverhangMin 1) using a gene database constructed from Gencode v19
Reads that overlap with exon coordinates were counted using HTSeqcount (-s reverse -a 0 -t exon -i gene_id -m union)
Genome_build: hg19
Supplementary_files_format_and_content: tab-separated file created using featureCounts v1.5.0 (number of reads mapped to each Gencode V.19 gene)

← Back to Analysis