GSE55727 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Zmat3 Is a Key Splicing Regulator in the p53 Tumor Suppression Program.

Molecular cell (2020) — PMID 33157015

Dataset

Integrative Genomic Analysis Reveals Widespread Enhancer Regulation by p53 in Response to DNA Damage

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-Seq Read Alignment: TopHat 2.0 with default options using UCSC known genes from hg19 or mm9 as reference transcriptome

TopHat v2.0 GitHub

$ Bash example

# Install TopHat 2.0 (and Bowtie2, which TopHat depends on)
# conda install -c bioconda tophat=2.0 bowtie2

# --- Reference Preparation (hg19 example) ---
# Create a directory for reference files
# mkdir -p references/hg19
# cd references/hg19

# Download hg19 genome FASTA
# wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz

# Build Bowtie2 index for hg19 genome
# bowtie2-build hg19.fa hg19_index

# Download a GTF file representing UCSC known genes for hg19.
# Gencode v19 is a comprehensive annotation for hg19 and often used as a proxy for "known genes".
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz
# mv gencode.v19.annotation.gtf genes.gtf
# cd ../../ # Go back to the main working directory

# --- RNA-Seq Read Alignment with TopHat 2.0 ---
# Define variables
GENOME_INDEX="references/hg19/hg19_index" # Path to the Bowtie2 index for hg19
GTF_FILE="references/hg19/genes.gtf"     # Path to the GTF file (e.g., Gencode v19 for hg19 known genes)
READS_R1="input_reads_R1.fastq.gz"        # Placeholder for input forward reads
READS_R2="input_reads_R2.fastq.gz"        # Placeholder for input reverse reads (remove if single-end)
OUTPUT_DIR="tophat_alignment_output"      # Output directory for TopHat results

# Execute TopHat 2.0 with default options and specified GTF
# TopHat 2.0 uses 'tophat2' command.
tophat2 -o "${OUTPUT_DIR}" \
        --GTF "${GTF_FILE}" \
        "${GENOME_INDEX}" \
        "${READS_R1}" "${READS_R2}"

View on GitHub

RNA-Seq Differential Expression: Cuffdiff 2.1 with default options using UCSC known genes from hg19 or mm9 as reference transcriptome

Cufflinks v2.1 GitHub

$ Bash example

# Install Cufflinks (which includes Cuffdiff)
# conda install -c bioconda cufflinks=2.1

# Placeholder for reference transcriptome (UCSC known genes from hg19 or mm9)
# This GTF file would contain the gene and transcript annotations.
# You would typically download this from UCSC's Table Browser or a similar resource.
# For example, a GTF derived from UCSC known genes for hg19:
REFERENCE_GTF="hg19_UCSC_known_genes.gtf" # Or mm9_UCSC_known_genes.gtf

# Placeholder for input BAM files (aligned RNA-Seq reads for different conditions/replicates)
# Replace with your actual BAM files. Cuffdiff expects comma-separated BAMs for replicates within a condition.
CONDITION_A_REPS="conditionA_rep1.bam,conditionA_rep2.bam"
CONDITION_B_REPS="conditionB_rep1.bam,conditionB_rep2.bam"

# Output directory for Cuffdiff results
OUTPUT_DIR="cuffdiff_results"

# Run Cuffdiff with default options
# The command requires the reference GTF, followed by comma-separated BAM files for each condition.
cuffdiff -o "${OUTPUT_DIR}" "${REFERENCE_GTF}" "${CONDITION_A_REPS}" "${CONDITION_B_REPS}"

View on GitHub

ChIP-Seq Read Alignment: Bowtie 2.0 with default options using hg19 or mm9 as reference genome

Bowtie v2.0 GitHub

$ Bash example

# Install Bowtie 2 (if not already installed)
# conda install -c bioconda bowtie2

# Define variables
# REFERENCE_INDEX: Path to the Bowtie2 index prefix (e.g., /path/to/bowtie2_indexes/hg19/hg19)
# Pre-built indices for hg19 and mm9 can be downloaded from sources like the Bowtie 2 website (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
# or UCSC Genome Browser (e.g., for hg19: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz, then build index with bowtie2-build)
REFERENCE_INDEX="/path/to/bowtie2_indexes/hg19/hg19" # Example for hg19. Replace with mm9 for mouse.
READS_R1="sample_R1.fastq.gz" # Replace with your actual R1 FASTQ file
READS_R2="sample_R2.fastq.gz" # Replace with your actual R2 FASTQ file (remove if single-end)
OUTPUT_SAM="sample_aligned.sam"

# Perform paired-end alignment with Bowtie 2 using default options
# -x: reference index basename
# -1: reads file 1
# -2: reads file 2
# -S: output SAM file
# Default options for Bowtie 2 are generally suitable for ChIP-Seq.
# For single-end reads, use -U instead of -1 and -2: bowtie2 -x "${REFERENCE_INDEX}" -U "${READS_R1}" -S "${OUTPUT_SAM}"
bowtie2 -x "${REFERENCE_INDEX}" -1 "${READS_R1}" -2 "${READS_R2}" -S "${OUTPUT_SAM}"

# Common post-alignment steps (e.g., convert SAM to BAM, sort, and index)
# OUTPUT_BAM="sample_aligned.bam"
# OUTPUT_SORTED_BAM="sample_aligned.sorted.bam"
# samtools view -bS "${OUTPUT_SAM}" > "${OUTPUT_BAM}"
# samtools sort "${OUTPUT_BAM}" -o "${OUTPUT_SORTED_BAM}"
# samtools index "${OUTPUT_SORTED_BAM}"

View on GitHub

ChIP-Seq general peak calling: Scripture using hg19 or mm9 as reference genome

ChIP-seq v1.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Scripture (requires Java)
# Download Scripture from the Broad Institute website (e.g., https://www.broadinstitute.org/cancer/cga/scripture_download)
# Ensure Java is installed and in your PATH
# For example, if you downloaded scripture.jar to your current directory:

# Define variables for input files, genome, and output
CHIP_BAM="chip_sample.bam" # Aligned ChIP-seq reads (BAM format)
CONTROL_BAM="control_sample.bam" # Aligned control/input reads (BAM format)
GENOME_FASTA="/path/to/reference/hg19.fa" # Reference genome FASTA (hg19 or mm9)
GENOME_SIZE_FILE="/path/to/reference/hg19.chrom.sizes" # Chromosome sizes file (e.g., from UCSC)
OUTPUT_DIR="scripture_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Scripture for peak calling
# Note: Scripture requires specific input formats (e.g., .bed or .wig for reads, .fasta for genome, .chrom.sizes for genome sizes).
# If your input is BAM, you might need to convert it to BED or WIG first.
# Example conversion from BAM to BED (adjust for paired-end if necessary):
# samtools view -b -F 0x04 "${CHIP_BAM}" | bedtools bamtobed -i - > "${OUTPUT_DIR}/chip_sample.bed"
# samtools view -b -F 0x04 "${CONTROL_BAM}" | bedtools bamtobed -i - > "${OUTPUT_DIR}/control_sample.bed"

# Assuming input BED files are available:
CHIP_BED="${OUTPUT_DIR}/chip_sample.bed"
CONTROL_BED="${OUTPUT_DIR}/control_sample.bed"

# Example command for Scripture (parameters may need adjustment based on specific data and desired sensitivity/specificity)
# This is a general example, actual parameters might vary.
# -r: ChIP-seq read file (BED format)
# -c: Control read file (BED format)
# -g: Reference genome FASTA file
# -s: Genome sizes file
# -o: Output directory
# -p: P-value threshold (e.g., 0.05)
# -f: FDR threshold (e.g., 0.05)
# -w: Window size (e.g., 100 bp)
# -e: Extension size (e.g., 150 bp for typical fragment length)

java -Xmx4g -jar scripture.jar \
  -r "${CHIP_BED}" \
  -c "${CONTROL_BED}" \
  -g "${GENOME_FASTA}" \
  -s "${GENOME_SIZE_FILE}" \
  -o "${OUTPUT_DIR}" \
  -p 0.05 \
  -f 0.05 \
  -w 100 \
  -e 150

echo "Scripture peak calling completed. Results are in ${OUTPUT_DIR}"

ChIP-Seq significant peak calling: Cuffdiff 2.0 using Scripture output as peak reference

Cufflinks v2.0 GitHub

$ Bash example

# Install Cufflinks suite (includes Cuffdiff)
# conda install -c bioconda cufflinks=2.0

# Placeholder for Scripture output converted to GTF format.
# Scripture typically outputs BED files of transcribed regions.
# Cuffdiff requires a GTF file of transcripts for quantification.
# This step assumes a conversion from Scripture BED to GTF has been performed.
# Replace 'scripture_output.gtf' with the actual path to your converted Scripture output.
SCRIPTURE_GTF="scripture_output.gtf"

# Placeholder for ChIP-Seq alignment BAM files for two conditions/samples.
# Replace with actual paths to your aligned ChIP-Seq data for each replicate.
CHIP_BAM_SAMPLE1_REP1="chip_sample1_conditionA_replicate1.bam"
CHIP_BAM_SAMPLE1_REP2="chip_sample1_conditionA_replicate2.bam"
CHIP_BAM_SAMPLE2_REP1="chip_sample2_conditionB_replicate1.bam"
CHIP_BAM_SAMPLE2_REP2="chip_sample2_conditionB_replicate2.bam"

# Output directory for Cuffdiff results
OUTPUT_DIR="cuffdiff_chipseq_results"

# Run Cuffdiff for differential expression analysis.
# Note: Cuffdiff is primarily designed for RNA-Seq differential expression analysis,
# not direct ChIP-Seq peak calling. Its application here for "ChIP-Seq significant peak calling"
# using "Scripture output as peak reference" is highly unconventional.
# This command will perform differential quantification of features
# defined in the SCRIPTURE_GTF across the provided ChIP-Seq BAM files,
# treating them as expression data. The output will be differential abundance
# of these features, not traditional ChIP-Seq peaks.
cuffdiff \
-o "${OUTPUT_DIR}" \
-L "ConditionA,ConditionB" \
"${SCRIPTURE_GTF}" \
"${CHIP_BAM_SAMPLE1_REP1},${CHIP_BAM_SAMPLE1_REP2}" \
"${CHIP_BAM_SAMPLE2_REP1},${CHIP_BAM_SAMPLE2_REP2}"

View on GitHub

Tools Used

TopHat Cufflinks ChIP-seq

Raw Source Text

RNA-Seq Read Alignment: TopHat 2.0 with default options using UCSC known genes from hg19 or mm9 as reference transcriptome
RNA-Seq Differential Expression: Cuffdiff 2.1 with default options using UCSC known genes from hg19 or mm9 as reference transcriptome
ChIP-Seq Read Alignment: Bowtie 2.0 with default options using hg19 or mm9 as reference genome
ChIP-Seq general peak calling: Scripture using hg19 or mm9 as reference genome
ChIP-Seq significant peak calling: Cuffdiff 2.0 using Scripture output as peak reference
Genome_build: hg19 / mm9
Supplementary_files_format_and_content: TXT files are expression matrices with FPKM values for each sample / BED files are p53 ChIP-Seq peak coordinates

← Back to Analysis