Using LSA to Compute Word Sense
Frequencies
Project
Description:
The goal of this project is to study Latent Semantic Analysis (LSA)
based methods for estimation of word sense frequencies. Existing word frequency
measures are based on relative frequency of word appearance in a corpus. To
estimate word-sense frequency for a particular word token W we need to be able
to tag all instances of W in the corpus by their corresponding sense. Hence,
the problem of word-sense frequency estimation is closely related to the
problem of Word Sense Disambiguation (WSD).
Our approach is based on word sense frequency estimation by unsupervised model-based clustering of polysemous words contexts described by their LSA vector representation. Given such clustering, the word sense frequency distribution can be computed as a relative count of instances in each cluster.
We will perform experimental evaluation of the performance of the proposed framework and algorithms using test data and standardized test scoring obtained from Senseval-3, http://www.senseval.org/.
We will focus our investigation on 3 aspects within this framework:
1. Minimizing and eliminating the need for supervision (manual data inspection and tagging). In particular, we will investigate corpus based methodologies that automatically find the best clustering of the data along with the optimal number of clusters (corresponding to word senses). To find the best meaningful clustering and prevent data over-fitting we plan to use regularization methods, such as Minimum Description Length (MDL) or Structural Risk Minimization (MDL).
2. Context Size: We believe that the size of the context in the LSA representation may play an important role in WSD performance. In our research, we will experimentally study the effect of the context size on disambiguation performance.
3. Dimensionality of LSA representation of the context: An important step in deriving LSA representation is the step of dimensionality reduction following singular value decomposition of a normalized co-occurrence matrix. This dimensionality reduction step is critical for the abstraction inherent in the LSA representation. Thus, an important component of applying the technique is finding the optimal dimensionality for the final representation. In our experimental investigation we will evaluate the dependence of WSD performance on the underlying dimension of LSA representation of the context.
Relevant
Papers and Resources:
LSA
·
Latent
Semantic Analysis @ CU Boulder
·
Introduction to Latent
Semantic Analysis, by T. K. Landauer,
P. W. Foltz, & D. Laham, Discourse Processes,
25, 259-284 (1998).
·
Indexing by Latent
Semantic Analysis, by S. Deerwester, S. T. Dumais,
G. W. Furnas, T. K. Landauer, R. Harshman,
Journal of the Society for Information Science, 41(6), 391-407, (1990).
·
InfoVis
page on Latent Semantic Analysis
·
Wikpedia page on Latent Semantic Analysis
·
Probabilistic
Latent Semantic Analysis, by T. Hofmann, Proc. Uncertainty in Artificial
Intelligence, (1999)