GSE120023 Processing Pipeline
Publication
Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.Cell (2019) — PMID 31251911
Dataset
GSE120023Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-based Regulation of Transcription [HiC-Seq]
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: HiC-Seq
$ Bash example
# Install HiC-Pro (if not already installed) # conda create -n hicpro_env python=3.8 # conda activate hicpro_env # conda install -c bioconda hic-pro # Define variables for input, output, and reference files INPUT_FASTQ_DIR="/path/to/your/fastq_files" # Directory containing subdirectories for each sample, e.g., /path/to/your/fastq_files/sample1/sample1_R1.fastq.gz OUTPUT_DIR="hic_pro_results" GENOME_FASTA="/path/to/reference/genome.fa" # e.g., /path/to/hg38.fa GENOME_SIZE="/path/to/reference/genome.size" # e.g., /path/to/hg38.size (format: chr_name<tab>chr_length) BOWTIE2_INDEX_PREFIX="/path/to/bowtie2/index/prefix" # e.g., /path/to/hg38_index RESTRICTION_ENZYME="MboI" # e.g., MboI, DpnII, HindIII # Create a HiC-Pro configuration file CONFIG_FILE="hicpro_config.txt" cat << EOF > "${CONFIG_FILE}" # Global parameters GENOME_SIZE = ${GENOME_SIZE} GENOME_FASTA = ${GENOME_FASTA} REST_ENZYME = ${RESTRICTION_ENZYME} BOWTIE2_IDX = ${BOWTIE2_INDEX_PREFIX} N_CPU = 8 # Number of CPUs to use PAIR_MODE = 1 # 0 for single-end, 1 for paired-end LIGATION_SITE = GATCN # For MboI/DpnII. Adjust based on your enzyme. # Other parameters can be added as needed, e.g., MIN_INSERT_SIZE, MAX_INSERT_SIZE, etc. EOF # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute HiC-Pro # HiC-Pro expects the input directory (-i) to contain subdirectories, # where each subdirectory represents a sample and contains its R1/R2 FASTQ files. # Example structure: # /path/to/your/fastq_files/ # ├── sample1/ # │ ├── sample1_R1.fastq.gz # │ └── sample1_R2.fastq.gz # └── sample2/ # ├── sample2_R1.fastq.gz # └── sample2_R2.fastq.gz HiC-Pro -i "${INPUT_FASTQ_DIR}" -o "${OUTPUT_DIR}" -c "${CONFIG_FILE}" -
2
The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1.
ChIA-PET2 v2017$ Bash example
# Install ChIA-PET2 (example, adjust based on actual installation method) # git clone https://github.com/Li-Lab-Bioinfo/ChIA-PET2.git # cd ChIA-PET2 # python setup.py install # Or just run the script directly # Assuming ChIA-PET2.py is in the PATH or current directory # Placeholder for input BAM file and output directory input_bam="aligned_reads.bam" output_dir="chia_pet2_output" # Create output directory if it doesn't exist mkdir -p "${output_dir}" # Execute ChIA-PET2 with specified parameters python ChIA-PET2.py \ -A ACGCGATATCTTATC \ -B AGTCAGATAAGATAT \ -s 1 \ -m 1 \ -t 4 \ -k 2 \ -e 1 \ -l 15 \ -S 500 \ -M "-q 0.05" \ -C 1 \ -i "${input_bam}" \ -o "${output_dir}" -
3
PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010).
$ Bash example
# The input interaction matrix is assumed to be in .cool format, which can be generated from HiC-Pro output using tools like hicpro2cooler or cooler make. # For this example, we assume 'hicpro_output.cool' is the input file containing the 40kb resolution matrix. # Installation (commented out) # conda install -c bioconda cooltools # Placeholder for reference genome and chromosome sizes. Replace with actual paths. # For example, hg38.chrom.sizes can be obtained from UCSC goldenPath. REFERENCE_GENOME="hg38" CHROM_SIZES="path/to/hg38.chrom.sizes" # Input .cool file containing the interaction matrix INPUT_COOL_FILE="hicpro_output.cool" # Resolution specified in the description (40-kb) RESOLUTION="40000" # Output file for the principal component (PC1) values OUTPUT_PC1_FILE="pc1_values_40kb.tsv" # Perform PCA using cooltools eigvec, a common implementation for Hi-C data. # This computes the eigenvectors (principal components) for the specified resolution. # We request 1 component as PC1 is used for A/B compartment identification. # The output file will contain the PC1 values for each genomic bin. cooltools eigvec \ --n_comps 1 \ --reference-path ${CHROM_SIZES} \ ${INPUT_COOL_FILE}::/resolutions/${RESOLUTION} \ ${OUTPUT_PC1_FILE} # Further processing of ${OUTPUT_PC1_FILE} (e.g., smoothing, identifying continuous positive/negative regions) # would be required to define A/B compartments as described in Heinz et al., 2010. -
4
The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015).
HiCPlotter v2015$ Bash example
# Clone the HiCPlotter repository # git clone https://github.com/lchin/HiCPlotter.git # cd HiCPlotter # Install dependencies (e.g., numpy, scipy, matplotlib) # pip install numpy scipy matplotlib # Example usage of HiCPlotter.py # Replace 'interaction_matrix.txt' with your actual Hi-C interaction matrix file. # Replace 'annotations.bed' with an optional BED file for genomic annotations. # Replace 'hic_visualization_output' with your desired output file prefix. python HiCPlotter.py -m interaction_matrix.txt -b annotations.bed -o hic_visualization_output
-
5
High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis.
$ Bash example
# Assuming 'interactions.tsv' is the input file with a header. # Assuming 'PET_counts' and 'q_value' are the column names for the respective metrics. # Python is a common scripting language for bioinformatics. # conda install -c conda-forge python=3.9 pandas # Example installation # Create the Python script to filter interactions cat << 'EOF' > filter_interactions.py import pandas as pd import sys def filter_interactions(input_file, output_file, pet_count_col, q_value_col, min_pet_counts=3, max_q_value=0.05): """ Filters interactions based on PET counts and q-value. Assumes the input file is tab-separated with a header. """ try: df = pd.read_csv(input_file, sep='\t') except Exception as e: print(f"Error reading input file {input_file}: {e}", file=sys.stderr) sys.exit(1) if pet_count_col not in df.columns: print(f"Error: PET count column '{pet_count_col}' not found in input file.", file=sys.stderr) sys.exit(1) if q_value_col not in df.columns: print(f"Error: q-value column '{q_value_col}' not found in input file.", file=sys.stderr) sys.exit(1) # Ensure columns are numeric, coercing errors to NaN df[pet_count_col] = pd.to_numeric(df[pet_count_col], errors='coerce') df[q_value_col] = pd.to_numeric(df[q_value_col], errors='coerce') # Filter out rows where conversion failed (NaNs) for critical columns df.dropna(subset=[pet_count_col, q_value_col], inplace=True) # Apply the filtering criteria filtered_df = df[(df[pet_count_col] > min_pet_counts) & (df[q_value_col] < max_q_value)] filtered_df.to_csv(output_file, sep='\t', index=False) print(f"Filtered interactions saved to {output_file}") if __name__ == "__main__": if len(sys.argv) != 5: print("Usage: python filter_interactions.py <input_tsv> <output_tsv> <pet_count_column_name> <q_value_column_name>", file=sys.stderr) sys.exit(1) input_tsv = sys.argv[1] output_tsv = sys.argv[2] pet_col = sys.argv[3] q_col = sys.argv[4] filter_interactions(input_tsv, output_tsv, pet_col, q_col) EOF # Execute the Python script with example column names python filter_interactions.py interactions.tsv high_confident_interactions.tsv PET_counts q_value
Tools Used
Raw Source Text
Library strategy: HiC-Seq The ChIA-PET2 software (Li et al., 2017a) was used for quality control and identification of chromatin interactions with the following parameter setting: -A ACGCGATATCTTATC -B AGTCAGATAAGATAT -s 1 -m 1 -t 4 -k 2 -e 1 -l 15 -S 500 -M "-q 0.05" âC 1. PCA analysis is then applied to 40-kb resolution interaction matrix generated by HiC-Pro (Servant et al., 2015), and regions of continuous positive or negative PC1 values were used for the identification of A or B compartments (Heinz et al., 2010). The interaction matrix was visualized by HiCPlotter (Akdemir and Chin, 2015). High confident interactions were defined as those with >3 PET counts and q-value < 0.05 for downstream analysis. Genome_build: GRCh37 (hg19) Supplementary_files_format_and_content: MICC