Using
LSA to Compute Word Sense Frequencies
This project is
sponsored by a grant from Air Force Research Lab
Project
Description:
The goal of this project is to
study Latent Semantic Analysis (LSA) based methods for estimation of word sense
frequencies. Existing word frequency measures are based on relative frequency of
word appearance in a corpus. To estimate word-sense frequency for a particular
word token W we need to be able to tag all instances of W in the corpus by their
corresponding sense. Hence, the problem of word-sense frequency estimation is
closely related to the problem of Word Sense Disambiguation (WSD).
Our approach is based on word sense frequency estimation by unsupervised model-based clustering of polysemous words contexts described by their LSA vector representation. Given such clustering, the word sense frequency distribution can be computed as a relative count of instances in each cluster.
We will perform experimental evaluation of the performance of the proposed framework and algorithms using test data and standardized test scoring obtained from Senseval-3, http://www.senseval.org/.
We will focus our investigation on 3 aspects within this framework:
1. Minimizing and eliminating the need for supervision (manual data inspection and tagging). In particular, we will investigate corpus based methodologies that automatically find the best clustering of the data along with the optimal number of clusters (corresponding to word senses). To find the best meaningful clustering and prevent data over-fitting we plan to use regularization methods, such as Minimum Description Length (MDL) or Structural Risk Minimization (MDL).
2. Context Size: We believe that the size of the context in the LSA representation may play an important role in WSD performance. In our research, we will experimentally study the effect of the context size on disambiguation performance.
3. Dimensionality of LSA representation of the context: An important step in deriving LSA representation is the step of dimensionality reduction following singular value decomposition of a normalized co-occurrence matrix. This dimensionality reduction step is critical for the abstraction inherent in the LSA representation. Thus, an important component of applying the technique is finding the optimal dimensionality for the final representation. In our experimental investigation we will evaluate the dependence of WSD performance on the underlying dimension of LSA representation of the context.
Relevant
Papers and Resources:
Introduction to LSA
Variations on LSA
Application of LSA in Natural Language
Processing
ACT-R and Double R
Theory
Minimum Description Length
(MDL)
Software:
Matrix Manipulation and Singular Value
Decomposition (SVD)
Machine Learning (K-Mean clustering, MDL,
etc.)
Corpora:
Plain Text or Parsed (For unsupervised
clustering)
Sense tagged (For testing or training by ignoring the tags) :
All sense tags are using various version of WordNet: