GSE120023 Processing Pipeline

OTHER code_examples 5 steps

Publication

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Cell (2019) — PMID 31251911

Dataset

GSE120023

Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [HiC-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: HiC-Seq

    Hi-C vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install HiC-Pro (if not already installed)
    # conda create -n hicpro_env python=3.8
    # conda activate hicpro_env
    # conda install -c bioconda hic-pro
    
    # Define variables for input, output, and reference files
    INPUT_FASTQ_DIR="/path/to/your/fastq_files" # Directory containing subdirectories for each sample, e.g., /path/to/your/fastq_files/sample1/sample1_R1.fastq.gz
    OUTPUT_DIR="hic_pro_results"
    GENOME_FASTA="/path/to/reference/genome.fa" # e.g., /path/to/hg38.fa
    GENOME_SIZE="/path/to/reference/genome.size" # e.g., /path/to/hg38.size (format: chr_name<tab>chr_length)
    BOWTIE2_INDEX_PREFIX="/path/to/bowtie2/index/prefix" # e.g., /path/to/hg38_index
    RESTRICTION_ENZYME="MboI" # e.g., MboI, DpnII, HindIII
    
    # Create a HiC-Pro configuration file
    CONFIG_FILE="hicpro_config.txt"
    cat << EOF > "${CONFIG_FILE}"
    # Global parameters
    GENOME_SIZE = ${GENOME_SIZE}
    GENOME_FASTA = ${GENOME_FASTA}
    REST_ENZYME = ${RESTRICTION_ENZYME}
    BOWTIE2_IDX = ${BOWTIE2_INDEX_PREFIX}
    N_CPU = 8 # Number of CPUs to use
    PAIR_MODE = 1 # 0 for single-end, 1 for paired-end
    LIGATION_SITE = GATCN # For MboI/DpnII. Adjust based on your enzyme.
    # Other parameters can be added as needed, e.g., MIN_INSERT_SIZE, MAX_INSERT_SIZE, etc.
    EOF
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute HiC-Pro
    # HiC-Pro expects the input directory (-i) to contain subdirectories,
    # where each subdirectory represents a sample and contains its R1/R2 FASTQ files.
    # Example structure:
    # /path/to/your/fastq_files/
    # ├── sample1/
    # │   ├── sample1_R1.fastq.gz
    # │   └── sample1_R2.fastq.gz
    # └── sample2/
    #     ├── sample2_R1.fastq.gz
    #     └── sample2_R2.fastq.gz
    HiC-Pro -i "${INPUT_FASTQ_DIR}" -o "${OUTPUT_DIR}" -c "${CONFIG_FILE}"
  2. 2

    The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" –C 1.

    ChIA-PET2 v2017
    $ Bash example
    # Install ChIA-PET2 (example, adjust based on actual installation method)
    # git clone https://github.com/Li-Lab-Bioinfo/ChIA-PET2.git
    # cd ChIA-PET2
    # python setup.py install # Or just run the script directly
    
    # Assuming ChIA-PET2.py is in the PATH or current directory
    # Placeholder for input BAM file and output directory
    input_bam="aligned_reads.bam"
    output_dir="chia_pet2_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${output_dir}"
    
    # Execute ChIA-PET2 with specified parameters
    python ChIA-PET2.py \
      -A ACGCGATATCTTATC \
      -B AGTCAGATAAGATAT \
      -s 1 \
      -m 1 \
      -t 4 \
      -k 2 \
      -e 1 \
      -l 15 \
      -S 500 \
      -M "-q 0.05" \
      -C 1 \
      -i "${input_bam}" \
      -o "${output_dir}"
  3. 3

    PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).

    PCA vNot specified GitHub
    $ Bash example
    # The input interaction matrix is assumed to be in .cool format, which can be generated from HiC-Pro output using tools like hicpro2cooler or cooler make.
    # For this example, we assume 'hicpro_output.cool' is the input file containing the 40kb resolution matrix.
    
    # Installation (commented out)
    # conda install -c bioconda cooltools
    
    # Placeholder for reference genome and chromosome sizes. Replace with actual paths.
    # For example, hg38.chrom.sizes can be obtained from UCSC goldenPath.
    REFERENCE_GENOME="hg38"
    CHROM_SIZES="path/to/hg38.chrom.sizes"
    
    # Input .cool file containing the interaction matrix
    INPUT_COOL_FILE="hicpro_output.cool"
    
    # Resolution specified in the description (40-kb)
    RESOLUTION="40000"
    
    # Output file for the principal component (PC1) values
    OUTPUT_PC1_FILE="pc1_values_40kb.tsv"
    
    # Perform PCA using cooltools eigvec, a common implementation for Hi-C data.
    # This computes the eigenvectors (principal components) for the specified resolution.
    # We request 1 component as PC1 is used for A/B compartment identification.
    # The output file will contain the PC1 values for each genomic bin.
    cooltools eigvec \
        --n_comps 1 \
        --reference-path ${CHROM_SIZES} \
        ${INPUT_COOL_FILE}::/resolutions/${RESOLUTION} \
        ${OUTPUT_PC1_FILE}
    
    # Further processing of ${OUTPUT_PC1_FILE} (e.g., smoothing, identifying continuous positive/negative regions)
    # would be required to define A/B compartments as described in Heinz et al., 2010.
  4. 4

    The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).

    HiCPlotter v2015
    $ Bash example
    # Clone the HiCPlotter repository
    # git clone https://github.com/lchin/HiCPlotter.git
    # cd HiCPlotter
    
    # Install dependencies (e.g., numpy, scipy, matplotlib)
    # pip install numpy scipy matplotlib
    
    # Example usage of HiCPlotter.py
    # Replace 'interaction_matrix.txt' with your actual Hi-C interaction matrix file.
    # Replace 'annotations.bed' with an optional BED file for genomic annotations.
    # Replace 'hic_visualization_output' with your desired output file prefix.
    python HiCPlotter.py -m interaction_matrix.txt -b annotations.bed -o hic_visualization_output
  5. 5

    High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.

    python (Inferred with models/gemini-2.5-flash) v3.9 GitHub
    $ Bash example
    # Assuming 'interactions.tsv' is the input file with a header.
    # Assuming 'PET_counts' and 'q_value' are the column names for the respective metrics.
    
    # Python is a common scripting language for bioinformatics.
    # conda install -c conda-forge python=3.9 pandas # Example installation
    
    # Create the Python script to filter interactions
    cat << 'EOF' > filter_interactions.py
    import pandas as pd
    import sys
    
    def filter_interactions(input_file, output_file, pet_count_col, q_value_col, min_pet_counts=3, max_q_value=0.05):
        """
        Filters interactions based on PET counts and q-value.
        Assumes the input file is tab-separated with a header.
        """
        try:
            df = pd.read_csv(input_file, sep='\t')
        except Exception as e:
            print(f"Error reading input file {input_file}: {e}", file=sys.stderr)
            sys.exit(1)
    
        if pet_count_col not in df.columns:
            print(f"Error: PET count column '{pet_count_col}' not found in input file.", file=sys.stderr)
            sys.exit(1)
        if q_value_col not in df.columns:
            print(f"Error: q-value column '{q_value_col}' not found in input file.", file=sys.stderr)
            sys.exit(1)
    
        # Ensure columns are numeric, coercing errors to NaN
        df[pet_count_col] = pd.to_numeric(df[pet_count_col], errors='coerce')
        df[q_value_col] = pd.to_numeric(df[q_value_col], errors='coerce')
    
        # Filter out rows where conversion failed (NaNs) for critical columns
        df.dropna(subset=[pet_count_col, q_value_col], inplace=True)
    
        # Apply the filtering criteria
        filtered_df = df[(df[pet_count_col] > min_pet_counts) & (df[q_value_col] < max_q_value)]
    
        filtered_df.to_csv(output_file, sep='\t', index=False)
        print(f"Filtered interactions saved to {output_file}")
    
    if __name__ == "__main__":
        if len(sys.argv) != 5:
            print("Usage: python filter_interactions.py <input_tsv> <output_tsv> <pet_count_column_name> <q_value_column_name>", file=sys.stderr)
            sys.exit(1)
    
        input_tsv = sys.argv[1]
        output_tsv = sys.argv[2]
        pet_col = sys.argv[3]
        q_col = sys.argv[4]
    
        filter_interactions(input_tsv, output_tsv, pet_col, q_col)
    EOF
    
    # Execute the Python script with example column names
    python filter_interactions.py interactions.tsv high_confident_interactions.tsv PET_counts q_value

Tools Used

Raw Source Text
Library strategy: HiC-Seq
The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" –C 1.
PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).
The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).
High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: MICC
← Back to Analysis