ANZIAM J. 46(E) ppC59--C74, 2005.

Selection bias in plots of microarray or other data that have been sampled from a high-dimensional space

J. H. Maindonald

C. J. Burden

(Received 16 November 2004, revised 14 February 2005)


For data that have many more features than observations, finding a low-dimensional representation that accurately reflects known prior groupings is non-trivial. Microarray gene expression data, used to create a ``signature'' or discrimination rule that distinguishes cancer tissues that are classified according to type of cancer, is an important special case. The optimal number of features is suitably determined using cross-validation, in which each of several parts of the data becomes in turn the test set, with the remaining data used for training. At each such division or ``fold'' of the data into a training and test set, both the selection of features and the derivation of the discriminant rule must be repeated. Use of the complete data for prior selection of features can lead to a grossly optimistic assessment of predictive accuracy and, in scatter-plot graphs that show discriminant function scores, to a spurious or exaggerated separation between groups. At each division or fold, a second versus first discriminant axis plot of test scores can be drawn. This paper presents a method for bringing these different plots, which have different choices of features and relate to different coordinate systems, into a single plot in which the configuration of points fairly reflects the accuracy of the discriminant procedure. The methodology is applicable, in principle, to use of any discriminant analysis methodology, or of ordination or multidimensional scaling, for obtaining a low dimensional graphical representation of data.

Download to your computer


J. H. Maindonald
Centre for Bioinformation Science, Math. Sci. Inst., Australian National University, Canberra, ACT 0200, Australia.
C. J. Burden
Centre for Bioinformation Science, John Curtin School of Medical Research & Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia

Published 15 March 2005, amended March 18, 2005. ISSN 1446-8735


  1. Chiaretti, S., Xiaochun Li, Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. {Blood} 103(7), 2004.
  2. Xiaochun Li ALL: A data package. R package version 1.0.2, 2004. [Online]
  3. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield. C. D., and Lander, E. S. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. {Science} 286: 531--537.
  4. Ihaka, R. and Gentleman, R. R: a language for data analysis and graphics. {Journal of Computational and Graphical Statistics} 5: 299--314, 1996.
  5. Leisch F. Sweave User Manual. [Online] leisch/Sweave.
  6. Pittelkow, Y. E., Wilson, S. R. Visualisation of Gene Expression Data: The GE-biplot, the Chip-plot and the Gene-plot. {Statistical Applications in Genetics and Molecular Biology} (19pp). [Online], 2003.
  7. Ripley, B. D. {Pattern Recognition and Neural Networks}. Cambridge University Press, 1996.
  8. Sibson, R. Studies in the robustness of multidimensional scaling: Procrustes analysis. {Journal of the Royal Statistical Society B} 40: 234--238, 1978.