GSE120110 Processing Pipeline

GSE code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120110

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: HiC-Seq

    Hi-C vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Juicer (example, actual installation might vary)
    # git clone https://github.com/aidenlab/juicer.git
    # cd juicer
    # # Follow Juicer installation instructions, which typically involve building from source or using a pre-compiled version.
    # # Ensure Java is installed and BWA is in your system's PATH.
    
    # Define Juicer installation directory
    JUICER_DIR="/path/to/juicer" # Replace with actual path to Juicer installation
    
    # Define genome ID and restriction enzyme name (must match Juicer's reference setup)
    # Juicer requires pre-built reference files (fasta, chrom.sizes, restriction sites)
    # in a specific directory structure (e.g., ${JUICER_DIR}/references/${GENOME_ID}/${GENOME_ID}.fasta)
    GENOME_ID="hg38" # Placeholder for human genome assembly
    ENZYME_NAME="MboI" # Common restriction enzyme for Hi-C
    
    # Input FASTQ files (replace with actual file paths)
    INPUT_FASTQ_R1="sample_R1.fastq.gz"
    INPUT_FASTQ_R2="sample_R2.fastq.gz"
    
    # Output directory
    OUTPUT_DIR="hic_processing_output"
    mkdir -p "${OUTPUT_DIR}"
    
    # Run Juicer pipeline
    # This command orchestrates alignment (using BWA), sorting, merging, and contact map generation.
    # -g: Genome ID (e.g., hg38)
    # -s: Restriction enzyme name (e.g., MboI)
    # -D: Path to the Juicer installation directory
    # -q: Comma-separated FASTQ files (read1,read2)
    # -t: Number of threads
    # -o: Output directory for intermediate and final files (optional, defaults to current directory)
    "${JUICER_DIR}/scripts/juicer.sh" -g "${GENOME_ID}" -s "${ENZYME_NAME}" -D "${JUICER_DIR}" -q "${INPUT_FASTQ_R1},${INPUT_FASTQ_R2}" -t 16 -o "${OUTPUT_DIR}"
  2. 2

    The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" –C 1.

    ChIA-PET2 v2017a
    $ Bash example
    # Install ChIA-PET2 (example, adjust based on actual installation method)
    # git clone https://github.com/yueli-bioinfo/ChIA-PET2.git
    # cd ChIA-PET2
    # # Follow installation instructions, e.g., setting up Python environment and dependencies
    
    # Example input BAM file (replace with actual file path to aligned ChIA-PET reads)
    INPUT_BAM="path/to/your/chiapet_aligned_reads.bam"
    OUTPUT_PREFIX="chiapet2_interactions_output"
    
    # Run ChIA-PET2 for quality control and identification of chromatin interactions
    # Note: The original description had "–C 1" which is corrected to "-C 1".
    python chiapet2.py \
      -i "${INPUT_BAM}" \
      -o "${OUTPUT_PREFIX}" \
      -A ACGCGATATCTTATC \
      -B AGTCAGATAAGATAT \
      -s 1 \
      -m 1 \
      -t 4 \
      -k 2 \
      -e 1 \
      -l 15 \
      -S 500 \
      -M "-q 0.05" \
      -C 1
  3. 3

    PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).

    PCA vInfer from description
    $ Bash example
    # Install R if not already installed
    # conda install -c r r-base r-essentials
    
    # Assume input matrix is 'sample_40kb_normalized.matrix'
    # This matrix should be a square, normalized interaction matrix (e.g., ICE, VC, KR normalized)
    # where rows/columns correspond to genomic bins.
    # The first column is assumed to be bin identifiers (e.g., "chr1:10000-50000").
    
    # Create an R script to perform PCA on the interaction matrix
    cat << 'EOF' > run_hic_pca.R
    # Read the normalized interaction matrix
    # Assuming the matrix is tab-separated, has no header, and the first column contains bin names.
    # Adjust 'sep', 'header', and 'row.names' as needed based on the actual matrix format.
    # Example: chr1:10000-50000 \t 0.1 \t 0.2 \t ...
    interaction_matrix <- as.matrix(read.table("sample_40kb_normalized.matrix", sep="\t", header=FALSE, row.names=1))
    
    # For Hi-C compartment analysis, PCA is typically performed on the Pearson correlation matrix
    # of the normalized interaction matrix.
    # Handle potential NaNs or infinite values that might arise from normalization or empty bins.
    # 'use = "pairwise.complete.obs"' handles NA values by using all available observations for each pair.
    correlation_matrix <- cor(interaction_matrix, use = "pairwise.complete.obs", method = "pearson")
    
    # Replace any remaining NaNs in the correlation matrix with 0.
    # This can happen if a bin has no valid interactions with any other bin.
    correlation_matrix[is.na(correlation_matrix)] <- 0
    
    # Perform PCA on the correlation matrix
    # 'center = TRUE' is standard.
    # 'scale. = FALSE' because a correlation matrix is already scaled (values between -1 and 1).
    pca_result <- prcomp(correlation_matrix, center = TRUE, scale. = FALSE)
    
    # Extract PC1 values
    pc1_values <- pca_result$x[,1]
    
    # Write PC1 values to a file
    # The row names of pca_result$x correspond to the bin names from the input matrix.
    # Output format: Bin_ID \t PC1_Value
    write.table(data.frame(Bin=rownames(pca_result$x), PC1=pc1_values),
                "pc1_values.txt", sep="\t", quote=FALSE, row.names=FALSE)
    
    # Optional: Save PCA results object for further analysis
    # saveRDS(pca_result, "pca_result.rds")
    
    # Optional: Generate a scree plot to visualize explained variance
    # pdf("pca_scree_plot.pdf")
    # plot(pca_result, type = "l", main = "Scree Plot of PCA on Hi-C Correlation Matrix")
    # dev.off()
    EOF
    
    # Execute the R script
    Rscript run_hic_pca.R
    
    # The output 'pc1_values.txt' will contain the PC1 values for each genomic bin.
    # This file can then be used to identify A/B compartments based on continuous positive/negative PC1 values.
  4. 4

    The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).

    HiCPlotter v2015 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Installation (example, adjust as needed)
    # git clone https://github.com/akdemirg/HiCPlotter.git
    # cd HiCPlotter
    # python setup.py install # Or ensure dependencies are met
    
    # Example command for visualizing an interaction matrix
    # Replace 'input_interaction_matrix.txt' with your actual matrix file
    # Replace 'output_hic_plot' with your desired output file prefix
    # Adjust parameters like --resolution, --min_val, --max_val, etc., as needed
    python HiCPlotter.py -f input_interaction_matrix.txt -o output_hic_plot --resolution 100000 --min_val 0 --max_val 100
  5. 5

    High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.

    awk (Inferred with models/gemini-2.5-flash) vN/A (Standard Unix utility) GitHub
    $ Bash example
    # This script filters a tab-separated file of interactions or peaks
    # based on PET counts (read support) and q-value thresholds.
    # It assumes PET counts are in the 5th column and q-value in the 6th column.
    # Adjust column numbers ($5, $6) if your input file structure differs.
    # Replace 'input_interactions.tsv' with the actual input file name.
    # Replace 'high_confident_interactions.tsv' with your desired output file name.
    
    # Example input file structure (adjust column indices as needed):
    # chr\tstart\tend\tname\tPET_counts\tq_value\t...
    
    awk -F'\t' '$5 > 3 && $6 < 0.05' input_interactions.tsv > high_confident_interactions.tsv

Tools Used

Raw Source Text
Library strategy: HiC-Seq
The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" –C 1.
PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).
The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).
High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: MICC
← Back to Analysis