Internet Web-sites are typically organized by subject. The collection of subjects covered represents the content of a Web-site. This content may consist of company information, items offered for sale or by subscription, advertisements for products or services, etc. Web-site designers classify subjects in ways that are meant to facilitate access to the information sought by users. This means that designers try to anticipate user-requests in their classification schemes.

Potential Problems: In the most commonly used scheme, subjects are organized hierarchically in a rooted-tree structure, in which nodes of the tree correspond to Web-pages, and the level of generality of content is inversely proportional to distance from the root. A Web-site's entry page (e.g., that designated by the URL used to identify an Internet company such as www.amazon.com) corresponds to the root node. This page contains general information and has a set of indices to guide the user's traversal of the subjects collected in the site. Other pages (corresponding to internal nodes or leaves of the tree) present sub-categories or specialized information. The greater the distance a node is from the root, the more specialized is the information it contains. Hyperlinks directing users from the page they are viewing to other Web-pages are represented by the edges of the tree. Trees, by virtue of being acyclic, cannot be used to represent all possible information structures - graphs clearly offer greater generality. However, the tree representation is adopted in this proposal because it is especially well-suited for modeling classification schemes, several of which may be implicit in the organization of a Web-site. For instance, the home page of the CUNY Graduate Center classifies the content of the site into doctoral programs, other programs, research centers, etc. In addition, it offers pointers for direct access to library information, computing facilities, site maps, etc. The organization of this content can thus be modeled as dual tree-structures, one representing functional entities in the University, the other representing information resources.

Although, from a designer's viewpoint, the content of a Web-site might be neatly represented in an ad hoc tree structure, this particular representation may not capture the ways in which users search for information. Moreover, the information sought by a user may not be included explicitly in the classification scheme. Thus, the information might not be found by traversing a path from the root of the tree used in the classification scheme; it may not be included in the tree-based classification scheme at all. For instance, finding information about a 'database course' without detailed knowledge of the CCNY Web-site is difficult. The information is located somewhere in the path: 'CS Dept,' followed by 'Prof Kawaguchi,' and then 'CSc571X.' A user might start with 'Course Schedule' and locate 'Spring 2001' which is a leaf, not containing information related to the details of database courses. A similar problem arises when the same kind of information falls into different internal (or leaf) nodes in the classification tree. For instance, to find 'logic courses' on the CUNY Web-site one may need to go through the course catalogs offered by several academic departments such as Computer Science, Electrical Engineering, Philosophy, and Mathematics. This happens because of the use of a static classification scheme that requires specific placements of subjects in categories; it does not reflect any inherent weakness in the concept of subject classification itself.

Our Goals: To overcome these shortcomings of conventional classification, many web-sites have recently started installing a site-oriented search capability that allows the user quickly to identify a related URL that matches a specified keyword set. That is to say, the Web-site has a dedicated search engine to find the information restricted to the site. This project aims to extend classification beyond localized search based on ad hoc tree structures, that is, to organize the content of a Web-site based on an analysis of the semantic structure of the site's contents. A complete taxonomy tree will be generated recursively as follows. The HTML pages of the Web-site of interest will be compared to determine their "semantic closeness". Web pages will be compared by measuring the similarity of vectors representing them. Vector components are defined in terms of frequencies of words occurring in the Web-pages. A set of pages having a similar meaning are categorized in one taxonomy, and a higher level of semantics will be extracted from the common set of words. We have developed measures to assess the performance of commercial search engines, and this work will be extended to the measurement of similarity between Web-pages.

Project Significance: Analyzing the semantic structure of a Web-site provides a sound foundation for classification. This approach will lead to improvements over simple keyword matching search capabilities for a particular site. Furthermore, the semantic structure provides Web-site designers with the means to discover similar and/or redundant information, which in turn will assist them in reconfiguring a site to achieve improved presentation and search performance.

© 2001, Akira Kawaguchi and Abbe Mowshowitz, All rights reserved.