GSE120105 Processing Pipeline
OTHER
code_examples
5 steps
Publication
Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.Cell (2019) — PMID 31251911
Dataset
GSE120105Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [GRO-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: GRO-seq
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables READS="groseq_sample.fastq.gz" # Assuming single-end reads, common for GRO-seq OUTPUT_DIR="star_alignment_output" GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder for human hg38 genome index. Replace with actual path. # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run STAR alignment for GRO-seq data # GRO-seq measures nascent transcription, so alignment to a reference genome is a primary step. # STAR is a splice-aware aligner suitable for RNA-based assays. STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_DIR}/groseq_sample_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --readFilesCommand zcat -
2
GRO-seq reads were mapped on the human reference genome using the Bowtie2 local model
$ Bash example
# Install Bowtie2 and Samtools if not already installed # conda install -c bioconda bowtie2 samtools # Define reference genome index (replace with actual path to your hg38 reference FASTA) # To build the index for hg38 (human reference genome), assuming you have hg38.fa: # bowtie2-build hg38.fa hg38_index # Placeholder for input reads and output BAM READS_FILE="gro_seq_reads.fastq" # Replace with your actual GRO-seq reads file REFERENCE_INDEX="hg38_index" # Path to your Bowtie2 index for hg38 OUTPUT_BAM="mapped_gro_seq_reads.bam" # Run Bowtie2 with the local alignment model, pipe to samtools for BAM conversion and sorting # -p 8 specifies 8 threads; adjust as needed based on available resources bowtie2 --local -p 8 -x "${REFERENCE_INDEX}" -U "${READS_FILE}" | samtools view -bS - | samtools sort -o "${OUTPUT_BAM}" -
3
Non-redundant reads were determined and kept for downstream analysis by using Samtools
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Input BAM file (assumed to be coordinate-sorted and indexed) INPUT_BAM="aligned_reads_sorted.bam" OUTPUT_DEDUP_BAM="non_redundant_reads.bam" # Determine and keep non-redundant reads by removing PCR duplicates. # The -s flag removes duplicate reads from the output. # The input BAM file must be sorted by coordinate. # The output BAM file will contain only unique reads. samtools markdup -s "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}" # Index the deduplicated BAM file for downstream analysis samtools index "${OUTPUT_DEDUP_BAM}" -
4
DEseq2 package was used to search for differentially expressed genes (DEGs), which was defined with the following cutoff: adjusted p-value of â¤0.05 and fold change of â¤2/3 or â¥3/2.
$ Bash example
# Create dummy input files for demonstration purposes. # In a real scenario, 'counts.tsv' would be generated by an upstream quantification step # (e.g., featureCounts, Salmon, Kallisto) and 'coldata.tsv' would be provided by the user. # Reference datasets for DESeq2 are typically the experimental count matrix and sample metadata, # not a genomic reference assembly. # Example counts.tsv (replace with your actual count matrix) echo -e "gene\tsample1\tsample2\tsample3\tsample4" > counts.tsv echo -e "geneA\t100\t120\t50\t60" >> counts.tsv echo -e "geneB\t50\t60\t100\t110" >> counts.tsv echo -e "geneC\t200\t210\t220\t230" >> counts.tsv echo -e "geneD\t10\t12\t15\t18" >> counts.tsv # Example coldata.tsv (replace with your actual sample metadata) echo -e "sample\tcondition" > coldata.tsv echo -e "sample1\ttreated" >> coldata.tsv echo -e "sample2\ttreated" >> coldata.tsv echo -e "sample3\tcontrol" >> coldata.tsv echo -e "sample4\tcontrol" >> coldata.tsv # R script for DESeq2 analysis # This script performs differential expression analysis using DESeq2 # based on the specified cutoffs: adjusted p-value <= 0.05 and # fold change <= 2/3 or >= 3/2 (i.e., |log2FoldChange| >= log2(3/2)). cat << 'EOF' > run_deseq2.R # R script for DESeq2 analysis # Assumes 'counts.tsv' contains raw counts and 'coldata.tsv' contains sample metadata. # Load DESeq2 library # If DESeq2 is not installed, uncomment and run the following lines: # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("DESeq2", update = FALSE, ask = FALSE) library(DESeq2) # Load count data # Assuming counts.tsv has gene IDs as row names and sample IDs as column names countData <- as.matrix(read.csv("counts.tsv", sep="\t", row.names=1)) # Load sample metadata # Assuming coldata.tsv has sample IDs as row names and 'condition' as a column colData <- read.csv("coldata.tsv", sep="\t", row.names=1) colData$condition <- factor(colData$condition, levels = c("control", "treated")) # Ensure condition is a factor and set reference level # Ensure sample names match between countData and colData if (!all(rownames(colData) %in% colnames(countData))) { stop("Sample names in colData do not match column names in countData.") } countData <- countData[, rownames(colData)] # Reorder countData columns to match colData rows if (!all(rownames(colData) == colnames(countData))) { stop("Sample order mismatch after reordering.") } # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition) # Pre-filtering: remove genes with very low counts # This helps to reduce the memory footprint and increase the speed of DESeq2. # A common threshold is to keep genes with at least 10 total counts across all samples. keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis dds <- DESeq(dds) # Extract results # The contrast specifies the comparison: 'treated' vs 'control' res <- results(dds, contrast=c("condition", "treated", "control")) # Define cutoff values padj_cutoff <- 0.05 log2fc_cutoff <- log2(3/2) # log2(1.5) approximately 0.585 # Filter for differentially expressed genes # Genes with adjusted p-value <= 0.05 AND (log2FoldChange <= -log2(3/2) OR log2FoldChange >= log2(3/2)) degs <- subset(res, padj <= padj_cutoff & (log2FoldChange >= log2fc_cutoff | log2FoldChange <= -log2fc_cutoff)) # Sort by adjusted p-value degs <- degs[order(degs$padj),] # Save results write.csv(as.data.frame(degs), file="differentially_expressed_genes.csv", row.names=TRUE) write.csv(as.data.frame(res), file="deseq2_all_results.csv", row.names=TRUE) # Print summary cat("DESeq2 analysis completed.\n") cat("Number of differentially expressed genes found (padj <= ", padj_cutoff, " and |log2FC| >= ", round(log2fc_cutoff, 3), "): ", nrow(degs), "\n", sep="") EOF # Execute the R script Rscript run_deseq2.R # Optional: Clean up dummy input files and the R script # rm counts.tsv coldata.tsv run_deseq2.R -
5
processed data files format and content: BED and BigWig
$ Bash example
# UCSC tools are typically downloaded as pre-compiled binaries. # The following commands demonstrate how to download them for a Linux x86_64 system. # Adjust the URL for other operating systems (e.g., macOsX.x86_64). # --- Installation of bedGraphToBigWig --- # wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig # chmod +x bedGraphToBigWig # --- Installation of bedToBigBed --- # wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed # chmod +x bedToBigBed # --- Download chromosome sizes for hg38 (latest assembly as placeholder) --- # This file is required for both bedGraphToBigWig and bedToBigBed. # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes # --- Example: Convert a bedGraph file to BigWig --- # Create a dummy bedGraph input file for demonstration # echo -e "chr1\t100\t200\t0.5\nchr1\t250\t300\t1.2\nchr2\t50\t150\t0.8" > input.bedGraph # Execute bedGraphToBigWig ./bedGraphToBigWig input.bedGraph hg38.chrom.sizes output.bigWig # --- Example: Convert a BED file to BigBed --- # Create a dummy BED input file (e.g., bed6 format) for demonstration # echo -e "chr1\t100\t200\tfeatureA\t0\t+\nchr1\t250\t300\tfeatureB\t0\t-" > input.bed # Execute bedToBigBed (assuming bed6 format for input.bed) ./bedToBigBed -type=bed6 input.bed hg38.chrom.sizes output.bigBed
Tools Used
Raw Source Text
Library strategy: GRO-seq GRO-seq reads were mapped on the human reference genome using the Bowtie2 local model Non-redundant reads were determined and kept for downstream analysis by using Samtools DEseq2 package was used to search for differentially expressed genes (DEGs), which was defined with the following cutoff: adjusted p-value of â¤0.05 and fold change of â¤2/3 or â¥3/2. Genome_build: GRCh37 (hg19) processed data files format and content: BED and BigWig