GSE120105 Processing Pipeline

OTHER code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120105

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [GRO-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: GRO-seq

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
READS="groseq_sample.fastq.gz" # Assuming single-end reads, common for GRO-seq
OUTPUT_DIR="star_alignment_output"
GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder for human hg38 genome index. Replace with actual path.

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run STAR alignment for GRO-seq data
# GRO-seq measures nascent transcription, so alignment to a reference genome is a primary step.
# STAR is a splice-aware aligner suitable for RNA-based assays.
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS}" \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_DIR}/groseq_sample_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --readFilesCommand zcat

View on GitHub

GRO-seq reads were mapped on the human reference genome using the Bowtie2 local model

Bowtie2 vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install Bowtie2 and Samtools if not already installed
# conda install -c bioconda bowtie2 samtools

# Define reference genome index (replace with actual path to your hg38 reference FASTA)
# To build the index for hg38 (human reference genome), assuming you have hg38.fa:
# bowtie2-build hg38.fa hg38_index

# Placeholder for input reads and output BAM
READS_FILE="gro_seq_reads.fastq" # Replace with your actual GRO-seq reads file
REFERENCE_INDEX="hg38_index" # Path to your Bowtie2 index for hg38
OUTPUT_BAM="mapped_gro_seq_reads.bam"

# Run Bowtie2 with the local alignment model, pipe to samtools for BAM conversion and sorting
# -p 8 specifies 8 threads; adjust as needed based on available resources
bowtie2 --local -p 8 -x "${REFERENCE_INDEX}" -U "${READS_FILE}" | samtools view -bS - | samtools sort -o "${OUTPUT_BAM}"

View on GitHub

Non-redundant reads were determined and kept for downstream analysis by using Samtools

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Input BAM file (assumed to be coordinate-sorted and indexed)
INPUT_BAM="aligned_reads_sorted.bam"
OUTPUT_DEDUP_BAM="non_redundant_reads.bam"

# Determine and keep non-redundant reads by removing PCR duplicates.
# The -s flag removes duplicate reads from the output.
# The input BAM file must be sorted by coordinate.
# The output BAM file will contain only unique reads.
samtools markdup -s "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}"

# Index the deduplicated BAM file for downstream analysis
samtools index "${OUTPUT_DEDUP_BAM}"

View on GitHub

DEseq2 package was used to search for differentially expressed genes (DEGs), which was defined with the following cutoff: adjusted p-value of â¤0.05 and fold change of â¤2/3 or â¥3/2.

DESeq2 v1.40.2 GitHub

$ Bash example

# Create dummy input files for demonstration purposes.
# In a real scenario, 'counts.tsv' would be generated by an upstream quantification step
# (e.g., featureCounts, Salmon, Kallisto) and 'coldata.tsv' would be provided by the user.
# Reference datasets for DESeq2 are typically the experimental count matrix and sample metadata,
# not a genomic reference assembly.

# Example counts.tsv (replace with your actual count matrix)
echo -e "gene\tsample1\tsample2\tsample3\tsample4" > counts.tsv
echo -e "geneA\t100\t120\t50\t60" >> counts.tsv
echo -e "geneB\t50\t60\t100\t110" >> counts.tsv
echo -e "geneC\t200\t210\t220\t230" >> counts.tsv
echo -e "geneD\t10\t12\t15\t18" >> counts.tsv

# Example coldata.tsv (replace with your actual sample metadata)
echo -e "sample\tcondition" > coldata.tsv
echo -e "sample1\ttreated" >> coldata.tsv
echo -e "sample2\ttreated" >> coldata.tsv
echo -e "sample3\tcontrol" >> coldata.tsv
echo -e "sample4\tcontrol" >> coldata.tsv

# R script for DESeq2 analysis
# This script performs differential expression analysis using DESeq2
# based on the specified cutoffs: adjusted p-value <= 0.05 and
# fold change <= 2/3 or >= 3/2 (i.e., |log2FoldChange| >= log2(3/2)).
cat << 'EOF' > run_deseq2.R
# R script for DESeq2 analysis
# Assumes 'counts.tsv' contains raw counts and 'coldata.tsv' contains sample metadata.

# Load DESeq2 library
# If DESeq2 is not installed, uncomment and run the following lines:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2", update = FALSE, ask = FALSE)
library(DESeq2)

# Load count data
# Assuming counts.tsv has gene IDs as row names and sample IDs as column names
countData <- as.matrix(read.csv("counts.tsv", sep="\t", row.names=1))

# Load sample metadata
# Assuming coldata.tsv has sample IDs as row names and 'condition' as a column
colData <- read.csv("coldata.tsv", sep="\t", row.names=1)
colData$condition <- factor(colData$condition, levels = c("control", "treated")) # Ensure condition is a factor and set reference level

# Ensure sample names match between countData and colData
if (!all(rownames(colData) %in% colnames(countData))) {
  stop("Sample names in colData do not match column names in countData.")
}
countData <- countData[, rownames(colData)] # Reorder countData columns to match colData rows
if (!all(rownames(colData) == colnames(countData))) {
  stop("Sample order mismatch after reordering.")
}

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = colData,
                              design = ~ condition)

# Pre-filtering: remove genes with very low counts
# This helps to reduce the memory footprint and increase the speed of DESeq2.
# A common threshold is to keep genes with at least 10 total counts across all samples.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Extract results
# The contrast specifies the comparison: 'treated' vs 'control'
res <- results(dds, contrast=c("condition", "treated", "control"))

# Define cutoff values
padj_cutoff <- 0.05
log2fc_cutoff <- log2(3/2) # log2(1.5) approximately 0.585

# Filter for differentially expressed genes
# Genes with adjusted p-value <= 0.05 AND (log2FoldChange <= -log2(3/2) OR log2FoldChange >= log2(3/2))
degs <- subset(res, padj <= padj_cutoff & (log2FoldChange >= log2fc_cutoff | log2FoldChange <= -log2fc_cutoff))

# Sort by adjusted p-value
degs <- degs[order(degs$padj),]

# Save results
write.csv(as.data.frame(degs), file="differentially_expressed_genes.csv", row.names=TRUE)
write.csv(as.data.frame(res), file="deseq2_all_results.csv", row.names=TRUE)

# Print summary
cat("DESeq2 analysis completed.\n")
cat("Number of differentially expressed genes found (padj <= ", padj_cutoff, " and |log2FC| >= ", round(log2fc_cutoff, 3), "): ", nrow(degs), "\n", sep="")
EOF

# Execute the R script
Rscript run_deseq2.R

# Optional: Clean up dummy input files and the R script
# rm counts.tsv coldata.tsv run_deseq2.R

View on GitHub

processed data files format and content: BED and BigWig

UCSC tools vLatest available binaries (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# UCSC tools are typically downloaded as pre-compiled binaries.
# The following commands demonstrate how to download them for a Linux x86_64 system.
# Adjust the URL for other operating systems (e.g., macOsX.x86_64).

# --- Installation of bedGraphToBigWig ---
# wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
# chmod +x bedGraphToBigWig

# --- Installation of bedToBigBed ---
# wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
# chmod +x bedToBigBed

# --- Download chromosome sizes for hg38 (latest assembly as placeholder) ---
# This file is required for both bedGraphToBigWig and bedToBigBed.
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes

# --- Example: Convert a bedGraph file to BigWig ---
# Create a dummy bedGraph input file for demonstration
# echo -e "chr1\t100\t200\t0.5\nchr1\t250\t300\t1.2\nchr2\t50\t150\t0.8" > input.bedGraph

# Execute bedGraphToBigWig
./bedGraphToBigWig input.bedGraph hg38.chrom.sizes output.bigWig

# --- Example: Convert a BED file to BigBed ---
# Create a dummy BED input file (e.g., bed6 format) for demonstration
# echo -e "chr1\t100\t200\tfeatureA\t0\t+\nchr1\t250\t300\tfeatureB\t0\t-" > input.bed

# Execute bedToBigBed (assuming bed6 format for input.bed)
./bedToBigBed -type=bed6 input.bed hg38.chrom.sizes output.bigBed

View on GitHub

Tools Used

Bowtie2 DESeq2 UCSC tools

Raw Source Text

Library strategy: GRO-seq
GRO-seq reads were mapped on the human reference genome using the Bowtie2 local model
Non-redundant reads were determined and kept for downstream analysis by using Samtools
DEseq2 package was used to search for differentially expressed genes (DEGs), which was defined with the following cutoff: adjusted p-value of â¤0.05 and fold change of â¤2/3 or â¥3/2.
Genome_build: GRCh37 (hg19)
processed data files format and content: BED and BigWig

← Back to Analysis