Minimising fourth order correlations improves latent semantic analysis performance
AbstractLatent Semantic Analysis (LSA) uses semantic correlations across a corpora to reduce problems with polysemy, synonymy and inflexion when assessing document similarity. It uses singular value decomposition (SVD) to estimate a generalised linear model. This model assumes the appearance of terms in documents results from the additive noise and the product of topic and mixing matrices. Here, only the largest fourth order pairwise cross cumulants in the SVD output are minimised. Improved performance relative to LSA, as measured using precision-recall curves, is shown on the Medlars test set for a small number of retained vectors. This approach avoids the assumptions and complications of moving towards full higher order decorrelation and is also shown to produce better precision-recall curves than JADE and FastICA on this standard data set. The conclusion is that minimising fourth order correlations improves the performance of LSA on at least some information retrieval tasks. Three tasks likely to benefit from removing a small number of the largest pairwise cross cumulants are identification of writing genre, detection of copied computer code, and retrieval of objects or people from video streams.
Proceedings Engineering Mathematics and Applications Conference