Graph-Regularized NMF and Clustering for Hi-C

View the Project on GitHub Roy-lab/grinch

GRiNCH: Graph-Regularized NMF and Clustering for Hi-C

GPLv3 license Generic badge

GRiNCH applies non-negative matrix factorization (NMF) with graph regularization to discover clusters of highly interacting genomic regions from high-throughput chromosome conformation capture (Hi-C) data. GRiNCH can be used to smooth the input matrices, and can be applied to data from any 3D genome platforms (e.g. HiChIP, SPRITE) in a non-negative square matrix format. Now published in Genome Biology:

alt text

[Step 1] Install

Installation instructions below were tested in Linux Centos 7 distribution. GSL (GNU Scientific Library) is used to handle matrix- and vector-related operations.

  1. If you already have GSL installed, edit the first few lines of the Makefile to point to the correct include and shared library directory, then jump to step 3.
  2. If you do not have GSL installed, or you are not sure, the easiest way to get it installed is to use conda:
    conda install -c conda-forge gsl
  3. Make sure to add the location of the installed shared library to where the compiler/linker will be looking. If you used conda to install GSL to the default location in step 2, run the following command (or add the appropriate path if you already have GSL installed):
  4. And let’s install! In the same directory you downloaded the code/Makefile (either by cloning the repository or by downloading a release), run:
  5. If all went well, you won’t get any alarming messages, and you will see an executable named grinch created in the same directory. A quick test below will print the manual for grinch:
    ./grinch -h

Note: in order to implement NNDSVD initialization of factors, a fast randomized SVD algorithm from RSVDPACK was used. A minor modification to allow random seed specification was made to the original code from RSVDPACK. This updated code is included under modules/random_svd directory. Compilation of this code is part of the included Makefile; no additional step is necessary for installation.

[Step 2] Run

Basic usage

./grinch input.txt input.bed [-o output_prefix] [-k number_of_clusters] [-e expected_size_of_cluster] [-n neighborhood_radius] [-l regularization_strength] [-fgs]

>> ./grinch -h
>> ./grinch Huvec_chr12_25kb.txt Huvec_chr12_25kb.bed -o output/Huvec_chr12_25kb

Input file format


Position Parameter Description Default Value/Behavior
1 input matrix file Input matrix file path and name. Tab-delimited, in sparse matrix format, no header, e.g. 0 10 1201.78 N/A
2 input bed file Input bed file mapping each index in the input matrix file to a chromosomal coordinate. Tab-delimited, no header, e.g. chr1 50000 75000 0. Note: Bin size/resolution of the Hi-C data is assumed to be the same across all bins. Also, only cis-interactions are handled, i.e. the chromosome for all bins are assumed to be the same. N/A
optional -o Ouput file path and prefix. Note: will NOT create a directory if the specified directory does not exist. ‘output’
optional -k Number of clusters, an integer value. n/(1000000/bin size) where n is the dimension of the symmetric input Hi-C matrix and bin size is the resolution in basepairs, i.e., k is set such hat the expected size of a cluster is 1Mb.
optional -e A different way to specify the number of clusters by the expected size of a cluster, i.e. if -e 500000, k = n/(500KB/bin size), where n is the number of bins in the input matrix. Note: -k will override -e. 1000000, i.e. k = n / (1Mb/bin size)
optional -n Neighborhood radius used in regularization graph, in base pairs, and in multiples of resolution. -n 100000 would make neighborhood radius of 4 bins in 25kb resolution, and 4 adjacent regions on either side of a given regions will be used to regularize or ‘smooth’ the matrix factors. Increase for lower-depth or sparser data, to use more neighbors for smoothing. 100000
optional -l Strength of regularization. 1
optional -s Print the smoothed matrix to file, in tab-delimited sparse matrix format (e.g. 25000 50000 500.3). This file can be large. Do NOT output smoothed matrix.
optional -f Print to file the factor U and V. File may be large, since the matrix is written in a dense format, especially for higher-resolution input. Do NOT output factor matrix.
optional -g Print to file the graph used in regularization. File may be large, since the matrix is written in a dense format. Do NOT output graph.

Output files

[Step 3] Visualize

Refer to our handy dandy visualization tutorial to generate images of Hi-C heatmaps, GRiNCH clusters, and other 1D epigenetic signals like the one below:

alt text