MUSCARI

Logo

Multi-task Spectral Clustering AlgoRIthm

View the Project on GitHub Roy-lab/Muscari

MUSCARI (Multi-task Spectral Clustering AlgoRIthm)

Muscari is a new multi-task graph-based clustering algorithm developed for the identification of gene co-expression modules (or subnetworks) jointly across species from species-specific genome-wide co-expression networks while leveraging both the phylogenetic relationship between species and the graph-based nature of co-expression matrices. Muscari is based on the Arboretum-HiC (https://doi.org/10.1186/s13059-016-0962-8) multi-task graph clustering algorithm, which defines each task as a spectral graph clustering problem, one for each species, and the the multi-task learning framework simultaneously searches for groups of genes that are interacting in multiple species while accounting for the phylogenetic relationship between species. Unlike Arboretum-HiC, Muscari inputs are gene expression matrices, which are converted into fully-connected weighted gene co-expression networks.


INSTALLATION


1. Required environment

2. Download the code

3. Compling Muscari code

Run code/Makefile :

make

If the compiling was successful, you should be able to find the program named “muscari” in the code directory.



RUNNING MUSCARI


1. Preparation of input requirement files

Following is the list of files should be prepared for the running of Mucari. Please follow the description carefully.
You can also find the example files for each requirement at sample_data direcotry.

2. Running Muscari (shell script)

We are providing a wrapper shell script run_muscari.sh, which is doing (a) eigenvector matrix calculation (MATLAB) and (b) running muscari clustering (C++).

Note that the script is adjusted to run Muscari with the sample data we are providing here at sample_data directory. Therefore, if you want to just use the run_muscari.sh, please put your requirement files prepared by above into the sample_data directory first before running.

The run_muscari.sh script requires arguments below:

  • K: The number of resultant modules
  • P: transition probability; starting default=0.8, trying 0.5, 0.2, …
  • X: fixed covariance value; starting default=0.1, trying 0.15, 0.2, …
  • best species: One of the species name which is most well-studied. The gene ID of this species will be represented instead of OGID if there is the gene in that orthogroup.
  • output dir name: The name of the directory which the results will be in
    USAGE: ./run_muscari.sh [K] [P] [X] [best] [output]
    e.g.  ./run_muscari.sh 10 0.8 0.1 ath result_k10
    

(optional) 2-1. preparation of eigenvector matrices (MATLAB)

This optional step is a part of run_muscari.sh. which could be run automatically by run_muscari.sh but demostrating here what happens by the script.
The matlab script eigvecmat_calc.m will generate a eigenvector matrix with the user-specified k number. Input arguments are:

  • K: The number of resultant module. Note that this number will be same as the number of eigenvectors. For example, if k=10, the eigenvector matrix (which is the result of the eigvecmat_cal.m script) will consist of gene vectors of 10 eigenvectors.
  • output prefix: the name of output prefix (usually the species name) eigenvector matrix file.
    USAGE: matlab -r eigvecmat_calc\(\'[value_matrix]\',K,\'[output_prefix]\'\)
    e.g.  matalb -r eigvecmat_calc\(\'sample_data/ath_sample_matrix.txt\',10,\'ath\'\)
    

    The script will generate output files named: “[output_prefix].eigvecs.matrix.txt”. Please also refer to the wrapper script about the usage.

    matlab -r run_eigvecmat_calculation\(10\) (k=10 for example)
    

(optional) 2-2. detailed parameters usages of the muscari

This running of muscari step is a part of run_muscari.sh. which could be run automatically by run_muscari.sh but demostrating here what are the parameters are.
The argument keys of the program muscari are like below:

- -s: species order file name (requirement 2)
- -e: orthogroup ID (OGID) to gene ID file (requirement 3)
- -k: number of resultan modules
- -t: species tree file name (requirement 1)
- -c: config file name (made in step1)
- -r: species tree file name
- -o: output directory name
- -m: defines the mode in which the Arboretum algorithm is to be used; learn|generate|visualize|crossvalidate are the options, default=learn
- -b: most well-known studied species name
- -i: initialization method for transition probabilities for cluster membership across species; uniform|branchlength, default=uniform
- -p: transition proability value
- -x: fixed covariation value
- -w: an true|false option for writing over the results on the existing directory. default=true
- -f: an true|false option for running initial clustering generation. default=true



3. Outputs If the running of Muscari was successfully finished, you could find that bunch of result files are in the result directory you’ve specified.
The following result files are containing the most relevant information of the result of clustering: