GSE120110 Processing Pipeline
Publication
Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.Cell (2019) — PMID 31251911
Dataset
GSE120110Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: HiC-Seq
$ Bash example
# Install Juicer (example, actual installation might vary) # git clone https://github.com/aidenlab/juicer.git # cd juicer # # Follow Juicer installation instructions, which typically involve building from source or using a pre-compiled version. # # Ensure Java is installed and BWA is in your system's PATH. # Define Juicer installation directory JUICER_DIR="/path/to/juicer" # Replace with actual path to Juicer installation # Define genome ID and restriction enzyme name (must match Juicer's reference setup) # Juicer requires pre-built reference files (fasta, chrom.sizes, restriction sites) # in a specific directory structure (e.g., ${JUICER_DIR}/references/${GENOME_ID}/${GENOME_ID}.fasta) GENOME_ID="hg38" # Placeholder for human genome assembly ENZYME_NAME="MboI" # Common restriction enzyme for Hi-C # Input FASTQ files (replace with actual file paths) INPUT_FASTQ_R1="sample_R1.fastq.gz" INPUT_FASTQ_R2="sample_R2.fastq.gz" # Output directory OUTPUT_DIR="hic_processing_output" mkdir -p "${OUTPUT_DIR}" # Run Juicer pipeline # This command orchestrates alignment (using BWA), sorting, merging, and contact map generation. # -g: Genome ID (e.g., hg38) # -s: Restriction enzyme name (e.g., MboI) # -D: Path to the Juicer installation directory # -q: Comma-separated FASTQ files (read1,read2) # -t: Number of threads # -o: Output directory for intermediate and final files (optional, defaults to current directory) "${JUICER_DIR}/scripts/juicer.sh" -g "${GENOME_ID}" -s "${ENZYME_NAME}" -D "${JUICER_DIR}" -q "${INPUT_FASTQ_R1},${INPUT_FASTQ_R2}" -t 16 -o "${OUTPUT_DIR}" -
2
The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1.
ChIA-PET2 v2017a$ Bash example
# Install ChIA-PET2 (example, adjust based on actual installation method) # git clone https://github.com/yueli-bioinfo/ChIA-PET2.git # cd ChIA-PET2 # # Follow installation instructions, e.g., setting up Python environment and dependencies # Example input BAM file (replace with actual file path to aligned ChIA-PET reads) INPUT_BAM="path/to/your/chiapet_aligned_reads.bam" OUTPUT_PREFIX="chiapet2_interactions_output" # Run ChIA-PET2 for quality control and identification of chromatin interactions # Note: The original description had "–C 1" which is corrected to "-C 1". python chiapet2.py \ -i "${INPUT_BAM}" \ -o "${OUTPUT_PREFIX}" \ -A ACGCGATATCTTATC \ -B AGTCAGATAAGATAT \ -s 1 \ -m 1 \ -t 4 \ -k 2 \ -e 1 \ -l 15 \ -S 500 \ -M "-q 0.05" \ -C 1 -
3
PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).
PCA vInfer from description$ Bash example
# Install R if not already installed # conda install -c r r-base r-essentials # Assume input matrix is 'sample_40kb_normalized.matrix' # This matrix should be a square, normalized interaction matrix (e.g., ICE, VC, KR normalized) # where rows/columns correspond to genomic bins. # The first column is assumed to be bin identifiers (e.g., "chr1:10000-50000"). # Create an R script to perform PCA on the interaction matrix cat << 'EOF' > run_hic_pca.R # Read the normalized interaction matrix # Assuming the matrix is tab-separated, has no header, and the first column contains bin names. # Adjust 'sep', 'header', and 'row.names' as needed based on the actual matrix format. # Example: chr1:10000-50000 \t 0.1 \t 0.2 \t ... interaction_matrix <- as.matrix(read.table("sample_40kb_normalized.matrix", sep="\t", header=FALSE, row.names=1)) # For Hi-C compartment analysis, PCA is typically performed on the Pearson correlation matrix # of the normalized interaction matrix. # Handle potential NaNs or infinite values that might arise from normalization or empty bins. # 'use = "pairwise.complete.obs"' handles NA values by using all available observations for each pair. correlation_matrix <- cor(interaction_matrix, use = "pairwise.complete.obs", method = "pearson") # Replace any remaining NaNs in the correlation matrix with 0. # This can happen if a bin has no valid interactions with any other bin. correlation_matrix[is.na(correlation_matrix)] <- 0 # Perform PCA on the correlation matrix # 'center = TRUE' is standard. # 'scale. = FALSE' because a correlation matrix is already scaled (values between -1 and 1). pca_result <- prcomp(correlation_matrix, center = TRUE, scale. = FALSE) # Extract PC1 values pc1_values <- pca_result$x[,1] # Write PC1 values to a file # The row names of pca_result$x correspond to the bin names from the input matrix. # Output format: Bin_ID \t PC1_Value write.table(data.frame(Bin=rownames(pca_result$x), PC1=pc1_values), "pc1_values.txt", sep="\t", quote=FALSE, row.names=FALSE) # Optional: Save PCA results object for further analysis # saveRDS(pca_result, "pca_result.rds") # Optional: Generate a scree plot to visualize explained variance # pdf("pca_scree_plot.pdf") # plot(pca_result, type = "l", main = "Scree Plot of PCA on Hi-C Correlation Matrix") # dev.off() EOF # Execute the R script Rscript run_hic_pca.R # The output 'pc1_values.txt' will contain the PC1 values for each genomic bin. # This file can then be used to identify A/B compartments based on continuous positive/negative PC1 values. -
4
The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).
HiCPlotter v2015 (Inferred with models/gemini-2.5-flash)$ Bash example
# Installation (example, adjust as needed) # git clone https://github.com/akdemirg/HiCPlotter.git # cd HiCPlotter # python setup.py install # Or ensure dependencies are met # Example command for visualizing an interaction matrix # Replace 'input_interaction_matrix.txt' with your actual matrix file # Replace 'output_hic_plot' with your desired output file prefix # Adjust parameters like --resolution, --min_val, --max_val, etc., as needed python HiCPlotter.py -f input_interaction_matrix.txt -o output_hic_plot --resolution 100000 --min_val 0 --max_val 100
-
5
High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.
$ Bash example
# This script filters a tab-separated file of interactions or peaks # based on PET counts (read support) and q-value thresholds. # It assumes PET counts are in the 5th column and q-value in the 6th column. # Adjust column numbers ($5, $6) if your input file structure differs. # Replace 'input_interactions.tsv' with the actual input file name. # Replace 'high_confident_interactions.tsv' with your desired output file name. # Example input file structure (adjust column indices as needed): # chr\tstart\tend\tname\tPET_counts\tq_value\t... awk -F'\t' '$5 > 3 && $6 < 0.05' input_interactions.tsv > high_confident_interactions.tsv
Tools Used
Raw Source Text
Library strategy: HiC-Seq The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1. PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010). The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015). High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis. Genome_build: GRCh37 (hg19) Supplementary_files_format_and_content: MICC