A bioinformatic implementation of mutual information as a distance measure for identification of clusters of variables

Chris Pardy, Susan Wilson


The size of data sets produced in genetic experiments is steadily increasing. Very often there are many more variables than observations, leading to the so-called ``large $p$, small $n$" problem. For such data, clustering and distance based procedures are useful tools for identifying groups of variables associated with outcomes of interest. We develop a novel approach using mutual information as a measure of distance (here dependency) between probability distributions that is valid for comparisons between pairs of variables that are both continuous, both discrete, or one of each. This gives an overall information matrix to be used as a distance matrix in clustering procedures and to define a so-called weighted network of associations between variables. We present computational aspects of implementing our procedures in R.

  • T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, 2006. doi:10.1002/0471200611.
  • Z. Dawy, B. Goebel, J. Hagenauer, C. Andreoli, T. Meitinger, and J. C. Mueller. Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):47--56, 2006. doi:10.1109/TCBB.2006.9.
  • J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J. J. Collins, and T. S. Gardner. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, 5(1):e8, 2007. doi:10.1371/journal.pbio.0050008.
  • T. F. Fuller, A. Ghazalpour, J. E. Aten, T. A. Drake, A. J. Lusis, and S. Horvath. Weighted gene coexpression network analysis strategies applied to mouse weight. Mammalian Genome, 18(6):463--472, 2007. doi:10.1007/s00335-007-9043-3.
  • A. Ghazalpour, S. Doss, B. Zhang, S. Wang, C. Plaisier, R. Castellanos, A. Brozell, E. E. Schadt, T. A. Drake, A. J. Lusis, et al. Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet, 2(8):e130, 2006. doi:10.1371/journal.pgen.0020130.
  • Hung T. Nguyen. On modeling of linguistic information using random sets. Information Sciences, 34(3):265 -- 274, 1984. doi:10.1016/0020-0255(84)90052-5.
  • P. Qiu, A. J. Gentles, and S. K. Plevritis. Fast calculation of pairwise mutual information for gene regulatory network reconstruction. Computer Methods and Programs in Biomedicine, 94(2):177--180, 2009. doi:10.1016/0020-0255(84)90052-5.
  • S. J. Sheather and M. C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), pages 683--690, 1991.
  • M. P. Wand and M. C. Jones. Kernel smoothing. Chapman and Hall/CRC, 1995.
  • B. Zhang and S. Horvath. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1):1128, 2005. doi:10.2202/1544-6115.1128.


Mutual information, bioinformatics, R

Full Text:


DOI: http://dx.doi.org/10.21914/anziamj.v52i0.3959

Remember, for most actions you have to record/upload into this online system
and then inform the editor/author via clicking on an email icon or Completion button.
ANZIAM Journal, ISSN 1446-8735, copyright Australian Mathematical Society.