Minimising fourth order correlations improves latent semantic analysis performance

B. M. Pincombe

Abstract


Latent Semantic Analysis (LSA) uses semantic correlations across a corpora to reduce problems with polysemy, synonymy and inflexion when assessing document similarity. It uses singular value decomposition (SVD) to estimate a generalised linear model. This model assumes the appearance of terms in documents results from the additive noise and the product of topic and mixing matrices. Here, only the largest fourth order pairwise cross cumulants in the SVD output are minimised. Improved performance relative to LSA, as measured using precision-recall curves, is shown on the Medlars test set for a small number of retained vectors. This approach avoids the assumptions and complications of moving towards full higher order decorrelation and is also shown to produce better precision-recall curves than JADE and FastICA on this standard data set. The conclusion is that minimising fourth order correlations improves the performance of LSA on at least some information retrieval tasks. Three tasks likely to benefit from removing a small number of the largest pairwise cross cumulants are identification of writing genre, detection of copied computer code, and retrieval of objects or people from video streams.

Full Text:

PDF BibTeX


DOI: http://dx.doi.org/10.21914/anziamj.v47i0.1053



Remember, for most actions you have to record/upload into this online system
and then inform the editor/author via clicking on an email icon or Completion button.
ANZIAM Journal, ISSN 1446-8735, copyright Australian Mathematical Society.