GSE120105 Processing Pipeline

OTHER code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120105

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [GRO-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: GRO-seq

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    READS="groseq_sample.fastq.gz" # Assuming single-end reads, common for GRO-seq
    OUTPUT_DIR="star_alignment_output"
    GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder for human hg38 genome index. Replace with actual path.
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run STAR alignment for GRO-seq data
    # GRO-seq measures nascent transcription, so alignment to a reference genome is a primary step.
    # STAR is a splice-aware aligner suitable for RNA-based assays.
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS}" \
         --runThreadN 8 \
         --outFileNamePrefix "${OUTPUT_DIR}/groseq_sample_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --readFilesCommand zcat
  2. 2

    GRO-seq reads were mapped on the human reference genome using the Bowtie2 local model

    Bowtie2 vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install Bowtie2 and Samtools if not already installed
    # conda install -c bioconda bowtie2 samtools
    
    # Define reference genome index (replace with actual path to your hg38 reference FASTA)
    # To build the index for hg38 (human reference genome), assuming you have hg38.fa:
    # bowtie2-build hg38.fa hg38_index
    
    # Placeholder for input reads and output BAM
    READS_FILE="gro_seq_reads.fastq" # Replace with your actual GRO-seq reads file
    REFERENCE_INDEX="hg38_index" # Path to your Bowtie2 index for hg38
    OUTPUT_BAM="mapped_gro_seq_reads.bam"
    
    # Run Bowtie2 with the local alignment model, pipe to samtools for BAM conversion and sorting
    # -p 8 specifies 8 threads; adjust as needed based on available resources
    bowtie2 --local -p 8 -x "${REFERENCE_INDEX}" -U "${READS_FILE}" | samtools view -bS - | samtools sort -o "${OUTPUT_BAM}"
  3. 3

    Non-redundant reads were determined and kept for downstream analysis by using Samtools

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Input BAM file (assumed to be coordinate-sorted and indexed)
    INPUT_BAM="aligned_reads_sorted.bam"
    OUTPUT_DEDUP_BAM="non_redundant_reads.bam"
    
    # Determine and keep non-redundant reads by removing PCR duplicates.
    # The -s flag removes duplicate reads from the output.
    # The input BAM file must be sorted by coordinate.
    # The output BAM file will contain only unique reads.
    samtools markdup -s "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}"
    
    # Index the deduplicated BAM file for downstream analysis
    samtools index "${OUTPUT_DEDUP_BAM}"
  4. 4

    DEseq2 package was used to search for differentially expressed genes (DEGs), which was defined with the following cutoff: adjusted p-value of ≤0.05 and fold change of ≤2/3 or ≥3/2.

    $ Bash example
    # Create dummy input files for demonstration purposes.
    # In a real scenario, 'counts.tsv' would be generated by an upstream quantification step
    # (e.g., featureCounts, Salmon, Kallisto) and 'coldata.tsv' would be provided by the user.
    # Reference datasets for DESeq2 are typically the experimental count matrix and sample metadata,
    # not a genomic reference assembly.
    
    # Example counts.tsv (replace with your actual count matrix)
    echo -e "gene\tsample1\tsample2\tsample3\tsample4" > counts.tsv
    echo -e "geneA\t100\t120\t50\t60" >> counts.tsv
    echo -e "geneB\t50\t60\t100\t110" >> counts.tsv
    echo -e "geneC\t200\t210\t220\t230" >> counts.tsv
    echo -e "geneD\t10\t12\t15\t18" >> counts.tsv
    
    # Example coldata.tsv (replace with your actual sample metadata)
    echo -e "sample\tcondition" > coldata.tsv
    echo -e "sample1\ttreated" >> coldata.tsv
    echo -e "sample2\ttreated" >> coldata.tsv
    echo -e "sample3\tcontrol" >> coldata.tsv
    echo -e "sample4\tcontrol" >> coldata.tsv
    
    # R script for DESeq2 analysis
    # This script performs differential expression analysis using DESeq2
    # based on the specified cutoffs: adjusted p-value <= 0.05 and
    # fold change <= 2/3 or >= 3/2 (i.e., |log2FoldChange| >= log2(3/2)).
    cat << 'EOF' > run_deseq2.R
    # R script for DESeq2 analysis
    # Assumes 'counts.tsv' contains raw counts and 'coldata.tsv' contains sample metadata.
    
    # Load DESeq2 library
    # If DESeq2 is not installed, uncomment and run the following lines:
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("DESeq2", update = FALSE, ask = FALSE)
    library(DESeq2)
    
    # Load count data
    # Assuming counts.tsv has gene IDs as row names and sample IDs as column names
    countData <- as.matrix(read.csv("counts.tsv", sep="\t", row.names=1))
    
    # Load sample metadata
    # Assuming coldata.tsv has sample IDs as row names and 'condition' as a column
    colData <- read.csv("coldata.tsv", sep="\t", row.names=1)
    colData$condition <- factor(colData$condition, levels = c("control", "treated")) # Ensure condition is a factor and set reference level
    
    # Ensure sample names match between countData and colData
    if (!all(rownames(colData) %in% colnames(countData))) {
      stop("Sample names in colData do not match column names in countData.")
    }
    countData <- countData[, rownames(colData)] # Reorder countData columns to match colData rows
    if (!all(rownames(colData) == colnames(countData))) {
      stop("Sample order mismatch after reordering.")
    }
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = countData,
                                  colData = colData,
                                  design = ~ condition)
    
    # Pre-filtering: remove genes with very low counts
    # This helps to reduce the memory footprint and increase the speed of DESeq2.
    # A common threshold is to keep genes with at least 10 total counts across all samples.
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Extract results
    # The contrast specifies the comparison: 'treated' vs 'control'
    res <- results(dds, contrast=c("condition", "treated", "control"))
    
    # Define cutoff values
    padj_cutoff <- 0.05
    log2fc_cutoff <- log2(3/2) # log2(1.5) approximately 0.585
    
    # Filter for differentially expressed genes
    # Genes with adjusted p-value <= 0.05 AND (log2FoldChange <= -log2(3/2) OR log2FoldChange >= log2(3/2))
    degs <- subset(res, padj <= padj_cutoff & (log2FoldChange >= log2fc_cutoff | log2FoldChange <= -log2fc_cutoff))
    
    # Sort by adjusted p-value
    degs <- degs[order(degs$padj),]
    
    # Save results
    write.csv(as.data.frame(degs), file="differentially_expressed_genes.csv", row.names=TRUE)
    write.csv(as.data.frame(res), file="deseq2_all_results.csv", row.names=TRUE)
    
    # Print summary
    cat("DESeq2 analysis completed.\n")
    cat("Number of differentially expressed genes found (padj <= ", padj_cutoff, " and |log2FC| >= ", round(log2fc_cutoff, 3), "): ", nrow(degs), "\n", sep="")
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
    
    # Optional: Clean up dummy input files and the R script
    # rm counts.tsv coldata.tsv run_deseq2.R
  5. 5

    processed data files format and content: BED and BigWig

    UCSC tools vLatest available binaries (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # UCSC tools are typically downloaded as pre-compiled binaries.
    # The following commands demonstrate how to download them for a Linux x86_64 system.
    # Adjust the URL for other operating systems (e.g., macOsX.x86_64).
    
    # --- Installation of bedGraphToBigWig ---
    # wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
    # chmod +x bedGraphToBigWig
    
    # --- Installation of bedToBigBed ---
    # wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
    # chmod +x bedToBigBed
    
    # --- Download chromosome sizes for hg38 (latest assembly as placeholder) ---
    # This file is required for both bedGraphToBigWig and bedToBigBed.
    # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes
    
    # --- Example: Convert a bedGraph file to BigWig ---
    # Create a dummy bedGraph input file for demonstration
    # echo -e "chr1\t100\t200\t0.5\nchr1\t250\t300\t1.2\nchr2\t50\t150\t0.8" > input.bedGraph
    
    # Execute bedGraphToBigWig
    ./bedGraphToBigWig input.bedGraph hg38.chrom.sizes output.bigWig
    
    # --- Example: Convert a BED file to BigBed ---
    # Create a dummy BED input file (e.g., bed6 format) for demonstration
    # echo -e "chr1\t100\t200\tfeatureA\t0\t+\nchr1\t250\t300\tfeatureB\t0\t-" > input.bed
    
    # Execute bedToBigBed (assuming bed6 format for input.bed)
    ./bedToBigBed -type=bed6 input.bed hg38.chrom.sizes output.bigBed
    

Tools Used

Raw Source Text
Library strategy: GRO-seq
GRO-seq reads were mapped on the human reference genome using the Bowtie2 local model
Non-redundant reads were determined and kept for downstream analysis by using Samtools
DEseq2 package was used to search for differentially expressed genes (DEGs), which was defined with the following cutoff: adjusted p-value of ≤0.05 and fold change of ≤2/3 or ≥3/2.
Genome_build: GRCh37 (hg19)
processed data files format and content: BED and BigWig
← Back to Analysis