GSE120023 Processing Pipeline

OTHER code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120023

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [HiC-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: HiC-Seq

Hi-C vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install HiC-Pro (if not already installed)
# conda create -n hicpro_env python=3.8
# conda activate hicpro_env
# conda install -c bioconda hic-pro

# Define variables for input, output, and reference files
INPUT_FASTQ_DIR="/path/to/your/fastq_files" # Directory containing subdirectories for each sample, e.g., /path/to/your/fastq_files/sample1/sample1_R1.fastq.gz
OUTPUT_DIR="hic_pro_results"
GENOME_FASTA="/path/to/reference/genome.fa" # e.g., /path/to/hg38.fa
GENOME_SIZE="/path/to/reference/genome.size" # e.g., /path/to/hg38.size (format: chr_name<tab>chr_length)
BOWTIE2_INDEX_PREFIX="/path/to/bowtie2/index/prefix" # e.g., /path/to/hg38_index
RESTRICTION_ENZYME="MboI" # e.g., MboI, DpnII, HindIII

# Create a HiC-Pro configuration file
CONFIG_FILE="hicpro_config.txt"
cat << EOF > "${CONFIG_FILE}"
# Global parameters
GENOME_SIZE = ${GENOME_SIZE}
GENOME_FASTA = ${GENOME_FASTA}
REST_ENZYME = ${RESTRICTION_ENZYME}
BOWTIE2_IDX = ${BOWTIE2_INDEX_PREFIX}
N_CPU = 8 # Number of CPUs to use
PAIR_MODE = 1 # 0 for single-end, 1 for paired-end
LIGATION_SITE = GATCN # For MboI/DpnII. Adjust based on your enzyme.
# Other parameters can be added as needed, e.g., MIN_INSERT_SIZE, MAX_INSERT_SIZE, etc.
EOF

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute HiC-Pro
# HiC-Pro expects the input directory (-i) to contain subdirectories,
# where each subdirectory represents a sample and contains its R1/R2 FASTQ files.
# Example structure:
# /path/to/your/fastq_files/
# ├── sample1/
# │   ├── sample1_R1.fastq.gz
# │   └── sample1_R2.fastq.gz
# └── sample2/
#     ├── sample2_R1.fastq.gz
#     └── sample2_R2.fastq.gz
HiC-Pro -i "${INPUT_FASTQ_DIR}" -o "${OUTPUT_DIR}" -c "${CONFIG_FILE}"

View on GitHub

The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1.

ChIA-PET2 v2017

$ Bash example

# Install ChIA-PET2 (example, adjust based on actual installation method)
# git clone https://github.com/Li-Lab-Bioinfo/ChIA-PET2.git
# cd ChIA-PET2
# python setup.py install # Or just run the script directly

# Assuming ChIA-PET2.py is in the PATH or current directory
# Placeholder for input BAM file and output directory
input_bam="aligned_reads.bam"
output_dir="chia_pet2_output"

# Create output directory if it doesn't exist
mkdir -p "${output_dir}"

# Execute ChIA-PET2 with specified parameters
python ChIA-PET2.py \
  -A ACGCGATATCTTATC \
  -B AGTCAGATAAGATAT \
  -s 1 \
  -m 1 \
  -t 4 \
  -k 2 \
  -e 1 \
  -l 15 \
  -S 500 \
  -M "-q 0.05" \
  -C 1 \
  -i "${input_bam}" \
  -o "${output_dir}"

PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).

PCA vNot specified GitHub

$ Bash example

# The input interaction matrix is assumed to be in .cool format, which can be generated from HiC-Pro output using tools like hicpro2cooler or cooler make.
# For this example, we assume 'hicpro_output.cool' is the input file containing the 40kb resolution matrix.

# Installation (commented out)
# conda install -c bioconda cooltools

# Placeholder for reference genome and chromosome sizes. Replace with actual paths.
# For example, hg38.chrom.sizes can be obtained from UCSC goldenPath.
REFERENCE_GENOME="hg38"
CHROM_SIZES="path/to/hg38.chrom.sizes"

# Input .cool file containing the interaction matrix
INPUT_COOL_FILE="hicpro_output.cool"

# Resolution specified in the description (40-kb)
RESOLUTION="40000"

# Output file for the principal component (PC1) values
OUTPUT_PC1_FILE="pc1_values_40kb.tsv"

# Perform PCA using cooltools eigvec, a common implementation for Hi-C data.
# This computes the eigenvectors (principal components) for the specified resolution.
# We request 1 component as PC1 is used for A/B compartment identification.
# The output file will contain the PC1 values for each genomic bin.
cooltools eigvec \
    --n_comps 1 \
    --reference-path ${CHROM_SIZES} \
    ${INPUT_COOL_FILE}::/resolutions/${RESOLUTION} \
    ${OUTPUT_PC1_FILE}

# Further processing of ${OUTPUT_PC1_FILE} (e.g., smoothing, identifying continuous positive/negative regions)
# would be required to define A/B compartments as described in Heinz et al., 2010.

View on GitHub

The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).

HiCPlotter v2015

$ Bash example

# Clone the HiCPlotter repository
# git clone https://github.com/lchin/HiCPlotter.git
# cd HiCPlotter

# Install dependencies (e.g., numpy, scipy, matplotlib)
# pip install numpy scipy matplotlib

# Example usage of HiCPlotter.py
# Replace 'interaction_matrix.txt' with your actual Hi-C interaction matrix file.
# Replace 'annotations.bed' with an optional BED file for genomic annotations.
# Replace 'hic_visualization_output' with your desired output file prefix.
python HiCPlotter.py -m interaction_matrix.txt -b annotations.bed -o hic_visualization_output

High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.

python (Inferred with models/gemini-2.5-flash) v3.9 GitHub

$ Bash example

# Assuming 'interactions.tsv' is the input file with a header.
# Assuming 'PET_counts' and 'q_value' are the column names for the respective metrics.

# Python is a common scripting language for bioinformatics.
# conda install -c conda-forge python=3.9 pandas # Example installation

# Create the Python script to filter interactions
cat << 'EOF' > filter_interactions.py
import pandas as pd
import sys

def filter_interactions(input_file, output_file, pet_count_col, q_value_col, min_pet_counts=3, max_q_value=0.05):
    """
    Filters interactions based on PET counts and q-value.
    Assumes the input file is tab-separated with a header.
    """
    try:
        df = pd.read_csv(input_file, sep='\t')
    except Exception as e:
        print(f"Error reading input file {input_file}: {e}", file=sys.stderr)
        sys.exit(1)

    if pet_count_col not in df.columns:
        print(f"Error: PET count column '{pet_count_col}' not found in input file.", file=sys.stderr)
        sys.exit(1)
    if q_value_col not in df.columns:
        print(f"Error: q-value column '{q_value_col}' not found in input file.", file=sys.stderr)
        sys.exit(1)

    # Ensure columns are numeric, coercing errors to NaN
    df[pet_count_col] = pd.to_numeric(df[pet_count_col], errors='coerce')
    df[q_value_col] = pd.to_numeric(df[q_value_col], errors='coerce')

    # Filter out rows where conversion failed (NaNs) for critical columns
    df.dropna(subset=[pet_count_col, q_value_col], inplace=True)

    # Apply the filtering criteria
    filtered_df = df[(df[pet_count_col] > min_pet_counts) & (df[q_value_col] < max_q_value)]

    filtered_df.to_csv(output_file, sep='\t', index=False)
    print(f"Filtered interactions saved to {output_file}")

if __name__ == "__main__":
    if len(sys.argv) != 5:
        print("Usage: python filter_interactions.py <input_tsv> <output_tsv> <pet_count_column_name> <q_value_column_name>", file=sys.stderr)
        sys.exit(1)

    input_tsv = sys.argv[1]
    output_tsv = sys.argv[2]
    pet_col = sys.argv[3]
    q_col = sys.argv[4]

    filter_interactions(input_tsv, output_tsv, pet_col, q_col)
EOF

# Execute the Python script with example column names
python filter_interactions.py interactions.tsv high_confident_interactions.tsv PET_counts q_value

View on GitHub

Tools Used

Hi-C

Raw Source Text

Library strategy: HiC-Seq
The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1.
PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).
The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).
High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: MICC

← Back to Analysis