Enterprise information integration

Abstract

Integrating a web application into an automated business process requires to design wrappers that get user queries as input and map them onto the search forms that the application provides. Such wrappers build on automatic navigators which are responsible for navigating to the pages that provide the information required to answer the original user queries. A navigator relies on a web page classifier that discerns which pages provide the information and which do not. In the literature, there are many proposals to classify web pages, but none of them fulfills the requirements for a web page classifier in a navigator context. We address the problem of designing an unsupervised web page classifier that builds solely on the information provided by the URLs and does not require extensive crawling of the site being analysed. Our contribution is CALA, a new automated proposal to generate URL-based web page classifiers. Its salient features are that it does not need to previously crawl the complete web site, it is unsupervised, it does not require to download a page before classifying it, and it is computationally tractable. It has been validated by a number of experiments using real-world, top-visited web sites.

Keywords

Web page classification navigation crawling

Get full access to this article

View all access options for this article.

References

[1]

Hernández, Enterprise information integration: An unsupervised proposal for web page classification, PhD thesis, University of Sevilla, 2012, available at: http://fondosdigitales.us.es/tesis/tesis/2177/enterprise-information-integration-unsupervised-proposal-web-page-classification.

[2]

Hernández,

C.R.

Rivero,

Ruiz and

Corchuelo, An architecture for efficient web crawling, in: CAiSE Workshops, 2012, pp. 228–234.

[3]

Hernández,

C.R.

Rivero,

Ruiz and

Corchuelo, A statistical approach to url-based web page clustering, in: WWW (Companion Volume), 2012, pp. 525–526.

[4]

Hernández,

C.R.

Rivero,

Ruiz and

Corchuelo, Towards discovering conceptual models behind web sites, in: ER, 2012, pp. 166–175.

[5]

Hernández,

C.R.

Rivero,

Ruiz and

Corchuelo, CALA: An unsupervised url-based web page classification system, Knowledge-Based Systems 57 (2014), 168–180.

[6]

Hernández,

H.A.

Sleiman,

Ruiz and

Corchuelo, A tool for web links prototyping, in: ICAI, 2011, pp. 951–957.