GSE120110 Processing Pipeline

GSE code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120110

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: HiC-Seq

Hi-C vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Juicer (example, actual installation might vary)
# git clone https://github.com/aidenlab/juicer.git
# cd juicer
# # Follow Juicer installation instructions, which typically involve building from source or using a pre-compiled version.
# # Ensure Java is installed and BWA is in your system's PATH.

# Define Juicer installation directory
JUICER_DIR="/path/to/juicer" # Replace with actual path to Juicer installation

# Define genome ID and restriction enzyme name (must match Juicer's reference setup)
# Juicer requires pre-built reference files (fasta, chrom.sizes, restriction sites)
# in a specific directory structure (e.g., ${JUICER_DIR}/references/${GENOME_ID}/${GENOME_ID}.fasta)
GENOME_ID="hg38" # Placeholder for human genome assembly
ENZYME_NAME="MboI" # Common restriction enzyme for Hi-C

# Input FASTQ files (replace with actual file paths)
INPUT_FASTQ_R1="sample_R1.fastq.gz"
INPUT_FASTQ_R2="sample_R2.fastq.gz"

# Output directory
OUTPUT_DIR="hic_processing_output"
mkdir -p "${OUTPUT_DIR}"

# Run Juicer pipeline
# This command orchestrates alignment (using BWA), sorting, merging, and contact map generation.
# -g: Genome ID (e.g., hg38)
# -s: Restriction enzyme name (e.g., MboI)
# -D: Path to the Juicer installation directory
# -q: Comma-separated FASTQ files (read1,read2)
# -t: Number of threads
# -o: Output directory for intermediate and final files (optional, defaults to current directory)
"${JUICER_DIR}/scripts/juicer.sh" -g "${GENOME_ID}" -s "${ENZYME_NAME}" -D "${JUICER_DIR}" -q "${INPUT_FASTQ_R1},${INPUT_FASTQ_R2}" -t 16 -o "${OUTPUT_DIR}"

View on GitHub

The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1.

ChIA-PET2 v2017a

$ Bash example

# Install ChIA-PET2 (example, adjust based on actual installation method)
# git clone https://github.com/yueli-bioinfo/ChIA-PET2.git
# cd ChIA-PET2
# # Follow installation instructions, e.g., setting up Python environment and dependencies

# Example input BAM file (replace with actual file path to aligned ChIA-PET reads)
INPUT_BAM="path/to/your/chiapet_aligned_reads.bam"
OUTPUT_PREFIX="chiapet2_interactions_output"

# Run ChIA-PET2 for quality control and identification of chromatin interactions
# Note: The original description had "â€“C 1" which is corrected to "-C 1".
python chiapet2.py \
  -i "${INPUT_BAM}" \
  -o "${OUTPUT_PREFIX}" \
  -A ACGCGATATCTTATC \
  -B AGTCAGATAAGATAT \
  -s 1 \
  -m 1 \
  -t 4 \
  -k 2 \
  -e 1 \
  -l 15 \
  -S 500 \
  -M "-q 0.05" \
  -C 1

PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).

PCA vInfer from description

$ Bash example

# Install R if not already installed
# conda install -c r r-base r-essentials

# Assume input matrix is 'sample_40kb_normalized.matrix'
# This matrix should be a square, normalized interaction matrix (e.g., ICE, VC, KR normalized)
# where rows/columns correspond to genomic bins.
# The first column is assumed to be bin identifiers (e.g., "chr1:10000-50000").

# Create an R script to perform PCA on the interaction matrix
cat << 'EOF' > run_hic_pca.R
# Read the normalized interaction matrix
# Assuming the matrix is tab-separated, has no header, and the first column contains bin names.
# Adjust 'sep', 'header', and 'row.names' as needed based on the actual matrix format.
# Example: chr1:10000-50000 \t 0.1 \t 0.2 \t ...
interaction_matrix <- as.matrix(read.table("sample_40kb_normalized.matrix", sep="\t", header=FALSE, row.names=1))

# For Hi-C compartment analysis, PCA is typically performed on the Pearson correlation matrix
# of the normalized interaction matrix.
# Handle potential NaNs or infinite values that might arise from normalization or empty bins.
# 'use = "pairwise.complete.obs"' handles NA values by using all available observations for each pair.
correlation_matrix <- cor(interaction_matrix, use = "pairwise.complete.obs", method = "pearson")

# Replace any remaining NaNs in the correlation matrix with 0.
# This can happen if a bin has no valid interactions with any other bin.
correlation_matrix[is.na(correlation_matrix)] <- 0

# Perform PCA on the correlation matrix
# 'center = TRUE' is standard.
# 'scale. = FALSE' because a correlation matrix is already scaled (values between -1 and 1).
pca_result <- prcomp(correlation_matrix, center = TRUE, scale. = FALSE)

# Extract PC1 values
pc1_values <- pca_result$x[,1]

# Write PC1 values to a file
# The row names of pca_result$x correspond to the bin names from the input matrix.
# Output format: Bin_ID \t PC1_Value
write.table(data.frame(Bin=rownames(pca_result$x), PC1=pc1_values),
            "pc1_values.txt", sep="\t", quote=FALSE, row.names=FALSE)

# Optional: Save PCA results object for further analysis
# saveRDS(pca_result, "pca_result.rds")

# Optional: Generate a scree plot to visualize explained variance
# pdf("pca_scree_plot.pdf")
# plot(pca_result, type = "l", main = "Scree Plot of PCA on Hi-C Correlation Matrix")
# dev.off()
EOF

# Execute the R script
Rscript run_hic_pca.R

# The output 'pc1_values.txt' will contain the PC1 values for each genomic bin.
# This file can then be used to identify A/B compartments based on continuous positive/negative PC1 values.

The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).

HiCPlotter v2015 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Installation (example, adjust as needed)
# git clone https://github.com/akdemirg/HiCPlotter.git
# cd HiCPlotter
# python setup.py install # Or ensure dependencies are met

# Example command for visualizing an interaction matrix
# Replace 'input_interaction_matrix.txt' with your actual matrix file
# Replace 'output_hic_plot' with your desired output file prefix
# Adjust parameters like --resolution, --min_val, --max_val, etc., as needed
python HiCPlotter.py -f input_interaction_matrix.txt -o output_hic_plot --resolution 100000 --min_val 0 --max_val 100

High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.

awk (Inferred with models/gemini-2.5-flash) vN/A (Standard Unix utility) GitHub

$ Bash example

# This script filters a tab-separated file of interactions or peaks
# based on PET counts (read support) and q-value thresholds.
# It assumes PET counts are in the 5th column and q-value in the 6th column.
# Adjust column numbers ($5, $6) if your input file structure differs.
# Replace 'input_interactions.tsv' with the actual input file name.
# Replace 'high_confident_interactions.tsv' with your desired output file name.

# Example input file structure (adjust column indices as needed):
# chr\tstart\tend\tname\tPET_counts\tq_value\t...

awk -F'\t' '$5 > 3 && $6 < 0.05' input_interactions.tsv > high_confident_interactions.tsv

View on GitHub

Tools Used

Hi-C

Raw Source Text

Library strategy: HiC-Seq
The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1.
PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).
The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).
High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: MICC

← Back to Analysis