ANZIAM J. 47(EMAC2005) pp.C419--C431, 2006.

Minimising fourth order correlations improves latent semantic analysis performance

B. M. Pincombe

(received 14 October 2005; revised 29 September 2006)

Abstract

Latent Semantic Analysis (LSA) uses semantic correlations across a corpora to reduce problems with polysemy, synonymy and inflexion when assessing document similarity. It uses singular value decomposition (SVD) to estimate a generalised linear model. This model assumes the appearance of terms in documents results from the additive noise and the product of topic and mixing matrices. Here, only the largest fourth order pairwise cross cumulants in the SVD output are minimised. Improved performance relative to LSA, as measured using precision-recall curves, is shown on the Medlars test set for a small number of retained vectors. This approach avoids the assumptions and complications of moving towards full higher order decorrelation and is also shown to produce better precision-recall curves than JADE and FastICA on this standard data set. The conclusion is that minimising fourth order correlations improves the performance of LSA on at least some information retrieval tasks. Three tasks likely to benefit from removing a small number of the largest pairwise cross cumulants are identification of writing genre, detection of copied computer code, and retrieval of objects or people from video streams.

Download to your computer

Authors

B. M. Pincombe
Intelligence, Surveillance and Reconnaissance Division, Defence Science and Technology Organisation, Edinburgh, Australia. mailto:Brandon.Pincombe@dsto.defence.gov.au

Published October 26, 2006. ISSN 1446-8735

References

  1. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fernas and R. A. Harshman, Indexing by latent semantic analysis, Journal of the American Scoiety of Information Science, vol. 41, pp. 391--407, 1990.
  2. C. H. Papadimitrou, P. Raghavan, H. Tamaki and S. Vempala, Latent semantic indexing: a probabilistic analysis. In Proc. ACM Conference on Principles of Database Systems, Seattle, WA, 1998, pp. 159--168.
  3. T. Hofmann, Probabilistic latent semantic indexing. In 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 1999, pp. 50--57.
  4. D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation. In Neural Information Processing Systems 14, MIT Press, Cambridge, MA, 2002, pp. 601--608. http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=944937
  5. D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research, vol. 3, pp. 993--1022, 2003.
  6. T. L. Griffiths and M. Steyvers, A probabilistic approach to semantic representation. In Proc. of the Twenty-Fourth Annual Conference of Cognitive Science Society, George Mason University, Fairfax, VA, 2002.
  7. M. D. Lee, B. Pincombe and M. Welsh, An empirical evaluation of models of text document similarity. In Proc. of the XXVII Annual Conference of the Cognitive Science Society, pp. 1254--1259, 2005. URL
  8. A. Hyvarinen, J. Karhunen and E. Oja. Independent component analysis. John Wiley & Sons, New York, 2001.
  9. J. F. Cardoso and A. Souloumiac, Blind beamforming for non Gaussian signals. IEE Proceedings-F, vol. 140, no. 6, pp. 362--370, December 1993.
  10. P. Nakov, Latent semantic analysis for German literature investigation. In Proceedings of the 7th Fuzzy Days'01, International Conference on Computational Intelligence, LNCS 2206, pp. 834--841, 2001.
  11. P. Nakov, Latent semantic analysis of textual data. In Proceedings of the International Conference on Computer Systems and Technologies, pp. V.3-1--V.3-5, 2000. http://portal.acm.org/citation.cfm?id=365382
  12. F. Souvannavong, L. Hohl, B. Merialdo and B. Huet, Structurally enhanced latent semantic analysis for video object retrieval. IEE Proceedings Vision, Image and Signal Processing, vol. 152, no. 6, pp. 859--867, 2005.