Graph-Regularized NMF and Clustering for Hi-C
GRiNCH applies non-negative matrix factorization (NMF) with graph regularization to discover clusters of highly interacting genomic regions from high-throughput chromosome conformation capture (Hi-C) data. GRiNCH can be used to smooth the input matrices, and can be applied to data from any 3D genome platforms (e.g. HiChIP, SPRITE) in a non-negative square matrix format. Now published in Genome Biology: https://doi.org/10.1186/s13059-021-02378-z
Installation instructions below were tested in Linux Centos 7 distribution. GSL (GNU Scientific Library) is used to handle matrix- and vector-related operations.
#CHANGE PATHS AS NEEDED:
INCLUDE_PATH = ${CONDA_PREFIX}/include
LIBRARY_PATH = ${CONDA_PREFIX}/lib
conda install -c conda-forge gsl
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CONDA_PREFIX}/lib
make
grinch
created in the same directory. A quick test below will print the manual for grinch:
./grinch -h
Note: in order to implement NNDSVD initialization of factors, a fast randomized SVD algorithm from RSVDPACK was used. A minor modification to allow random seed specification was made to the original code from RSVDPACK. This updated code is included under modules/random_svd directory. Compilation of this code is part of the included Makefile; no additional step is necessary for installation.
./grinch input.txt input.bed [-o output_prefix] [-k number_of_clusters] [-e expected_size_of_cluster] [-n neighborhood_radius] [-l regularization_strength] [-fgs]
>> ./grinch -h
>> ./grinch Huvec_chr12_25kb.txt Huvec_chr12_25kb.bed -o output/Huvec_chr12_25kb
0 0 1000.2
0 1 1201.78
10 1 200.7
...
chr1 50000 75000 0
chr1 75000 100000 1
chr1 100000 125000 2
...
Position | Parameter | Description | Default Value/Behavior |
---|---|---|---|
1 | input matrix file | Input matrix file path and name. Tab-delimited, in sparse matrix format, no header, e.g. 0 10 1201.78 | N/A |
2 | input bed file | Input bed file mapping each index in the input matrix file to a chromosomal coordinate. Tab-delimited, no header, e.g. chr1 50000 75000 0. Note: Bin size/resolution of the Hi-C data is assumed to be the same across all bins. Also, only cis-interactions are handled, i.e. the chromosome for all bins are assumed to be the same. | N/A |
optional | -o |
Ouput file path and prefix. Note: will NOT create a directory if the specified directory does not exist. | ‘output’ |
optional | -k |
Number of clusters, an integer value. | n/(1000000/bin size) where n is the dimension of the symmetric input Hi-C matrix and bin size is the resolution in basepairs, i.e., k is set such hat the expected size of a cluster is 1Mb. |
optional | -e |
A different way to specify the number of clusters by the expected size of a cluster, i.e. if -e 500000, k = n/(500KB/bin size), where n is the number of bins in the input matrix. Note: -k will override -e. | 1000000, i.e. k = n / (1Mb/bin size) |
optional | -n |
Neighborhood radius used in regularization graph, in base pairs, and in multiples of resolution. -n 100000 would make neighborhood radius of 4 bins in 25kb resolution, and 4 adjacent regions on either side of a given regions will be used to regularize or ‘smooth’ the matrix factors. Increase for lower-depth or sparser data, to use more neighbors for smoothing. | 100000 |
optional | -l |
Strength of regularization. | 1 |
optional | -s | Print the smoothed matrix to file, in tab-delimited sparse matrix format (e.g. 25000 50000 500.3). This file can be large. | Do NOT output smoothed matrix. |
optional | -f | Print to file the factor U and V. File may be large, since the matrix is written in a dense format, especially for higher-resolution input. | Do NOT output factor matrix. |
optional | -g | Print to file the graph used in regularization. File may be large, since the matrix is written in a dense format. | Do NOT output graph. |
.tads
returns a list of putative TADs, each line with the first and last bin of each TAD (inclusive of last bin)..log
returns a plain-text file with list of parameter values used and time/memory consumption..smoothed
returns the smoothed matrix in a tab-delimited sparse matrix format. This file may be large..U
and .V
returns the factors U and V respectively, in dense matrix format. File may be large, especially for higher-resolution input. Note that since the input cis-interaction Hi-C matrix is symmetric, U and V are equivalent up to some scaling factor and numerical error..graph
returns the graph used in regularization. File may be large, since the matrix is written in a dense format.Refer to our handy dandy visualization tutorial to generate images of Hi-C heatmaps, GRiNCH clusters, and other 1D epigenetic signals like the one below: