A bioinformatic implementation of mutual information as a distance measure for identification of clusters of variables

Chris Pardy, Susan Wilson


The size of data sets produced in genetic experiments is steadily increasing. Very often there are many more variables than observations, leading to the so-called ``large $p$, small $n$" problem. For such data, clustering and distance based procedures are useful tools for identifying groups of variables associated with outcomes of interest. We develop a novel approach using mutual information as a measure of distance (here dependency) between probability distributions that is valid for comparisons between pairs of variables that are both continuous, both discrete, or one of each. This gives an overall information matrix to be used as a distance matrix in clustering procedures and to define a so-called weighted network of associations between variables. We present computational aspects of implementing our procedures in R.

Mutual information, bioinformatics, R

Full Text:


DOI: http://dx.doi.org/10.21914/anziamj.v52i0.3959

ANZIAM Journal, ISSN 1446-8735, copyright Australian Mathematical Society.