Selection bias in plots of microarray or other data that have been sampled from a high-dimensional space

Authors

  • J. H. Maindonald
  • C. J. Burden

DOI:

https://doi.org/10.21914/anziamj.v46i0.947

Abstract

For data that have many more features than observations, finding a low-dimensional representation that accurately reflects known prior groupings is non-trivial. Microarray gene expression data, used to create a ``signature'' or discrimination rule that distinguishes cancer tissues that are classified according to type of cancer, is an important special case. The optimal number of features is suitably determined using cross-validation, in which each of several parts of the data becomes in turn the test set, with the remaining data used for training. At each such division or ``fold'' of the data into a training and test set, both the selection of features and the derivation of the discriminant rule must be repeated. Use of the complete data for prior selection of features can lead to a grossly optimistic assessment of predictive accuracy and, in scatter-plot graphs that show discriminant function scores, to a spurious or exaggerated separation between groups. At each division or fold, a second versus first discriminant axis plot of test scores can be drawn. This paper presents a method for bringing these different plots, which have different choices of features and relate to different coordinate systems, into a single plot in which the configuration of points fairly reflects the accuracy of the discriminant procedure. The methodology is applicable, in principle, to use of any discriminant analysis methodology, or of ordination or multidimensional scaling, for obtaining a low dimensional graphical representation of data.

Published

2005-03-15

Issue

Section

Proceedings Computational Techniques and Applications Conference