Abstract
Integrating a web application into an automated business process requires to design wrappers that get user queries as input and map them onto the search forms that the application provides. Such wrappers build on automatic navigators which are responsible for navigating to the pages that provide the information required to answer the original user queries. A navigator relies on a web page classifier that discerns which pages provide the information and which do not. In the literature, there are many proposals to classify web pages, but none of them fulfills the requirements for a web page classifier in a navigator context. We address the problem of designing an unsupervised web page classifier that builds solely on the information provided by the URLs and does not require extensive crawling of the site being analysed. Our contribution is CALA, a new automated proposal to generate URL-based web page classifiers. Its salient features are that it does not need to previously crawl the complete web site, it is unsupervised, it does not require to download a page before classifying it, and it is computationally tractable. It has been validated by a number of experiments using real-world, top-visited web sites.
Get full access to this article
View all access options for this article.
