Contextual weighting approach to compute term weight in layered vector space model

Abstract

The World Wide Web (WWW) is the largest available repository of information. This huge amount of information put forward the challenges of retrieval of trustworthy information from WWW. It defies researchers with new issues of diversity and complexity while retrieving the web information. Information retrieval from the web demands approaches that span beyond conventional information retrieval. Heterogeneity, complexity and the huge volume of web information requires a unique approach to retrieve information. Besides, end-users introduce some difficulties in the retrieval process. Sometimes queries submitted by the user are subtle and ambiguous. The primary concern in information retrieval is the issue of predicting the relevance of documents. In this article, a new approach is proposed that rationally separates web document into five layers, namely, title, header, hyperlink, meta tag and body layer. The proposed method effectively combines the textual information and structural evidence of web document for retrieving information from Web. In the proposed layered vector space model, each layer has an allocated priority which is used to compute weight factor for these layers. The proposed method deduces equation that effectively combines priority of the layer and length of the layer to calculate the weight of the layer.

Keywords

HTML tags layered VSM term weight web information retrieval weight factor

Get full access to this article

View all access options for this article.

References

Sriniwas

Bhatt

. Introduction to web information retrieval: a user perspective. Reson 2002; 7(6): 27–38.

Sahami

Mittal

Baluja

, et al. The happy searcher: challenges in web information retrieval. In: 8th pacific rim international conference on artificial intelligence, Auckland, New Zealand, 9–13 August 2004. Berlin: Springer.

Ravikumar

Kumar Singh

. Web structure mining: exploring hyperlinks and algorithms for information retrieval. Am J App Sci 2010; 7(6): 440–445.

Baeza-Yate

. Information retrieval in the web: beyond current search engines. Int J Approx Reason 2003; 34(2–3): 97–104.

Alhenshiri

. Web information retrieval and search engine techniques. Al-Satil J 2013; Vol.1, 55–81.

Salton

Buckley

. Term weighting approaches in automatic text retrieval. Inform Process Manag 1988; 24(5): 513–523.

Cummins

O’Riordan

. Evolving local and global weighting schemes in information retrieval. J Inform Retr 2006; 9: 311–330.

Savoy

Picard

. Retrieval effectiveness on the web. Inform Process Manag 2001; 37(4): 543–569.

Park

E-K

D-Y

Jang

M-G

. Techniques for improving web retrieval effectiveness. Inform Process Manag 2005; 41(5): 1207–1223.

10.

Robertson

Jones

. Relevance weighting of search terms. J Am Soc Inform Sci 1976; 27(3): 129–146.

11.

Kobayashi

Takeda

. Information retrieval on the web. ACM Comput Surv (CSUR) 2000; 32(2): 144–173.

12.

Bassil

Semaan

. Semantic-sensitive web information retrieval model for HTML document. Eu J Sci Res 2012; 69(4), https://arxiv.org/abs/1204.0186

13.

Al-Dallal

Abdul-Wahab

. Genetic algorithm based to improve HTML document retrieval. In: Proceedings of the second international conference on developments in eSystems engineering, Abu Dhabi, 14–16 December 2009, pp. 343–348. New York: IEEE.

14.

Pathak

Mitra

. A new web document retrieval method using extended–IOWA Operator on HTML Tags. IOSR J Comput Eng 2014; 16(3): 65–74.

15.

Rahman

Chapa

Kabir

. A new weighted keyword based similarity measure for clustering web pages. Int J Comput Inf Tech 2014; 3(5): 929–933.

16.

Cutler

Shih

Meng

. Using the structure of HTML documents to improve retrieval. In: USENIX symposium on internet technologies and systems, Monterey, CA, 8–11 December 1997, pp. 22–34. Berkeley, CA: USENIX Association.

17.

Cutler

Deng

Mannicam

, et al. A new study on using HTML structures to improve retrieval. In: Proceeding of 11th IEEE conference on tools artificial intelligence, Chicago, IL, 9–11 November 1999: pp. 406–409. New York: IEEE.

18.

Deng

Chen

. Web documents categorization using fuzzy representation and HAC. In: Proceedings of the IEEE first international conference, Hong Kong, China, 19–21 June 2000, vol. 2, pp. 24–28. New York: IEEE.

19.

Molinari

Pasi

. A fuzzy representation of HTML documents for information retrieval systems. In: Proceedings of the fifth IEEE international conference on fuzzy systems, New Orleans, LA, 11 September 1996, pp. 107–112. New York: IEEE.

20.

Hyusein

Patel

. Significance of HTML tags for document indexing and retrieval. In: Proceeding of the international conference, www/internet, Algarve, 5–8 November 2003, pp. 817–820.

21.

Kim

Zhang

B-T

. Evolutionary learning of web document structure for information retrieval. In: Proceedings of the 2001 Congress on evolutionary computation, Seoul, South Korea, 27–30 May 2001, pp. 1254–1260. New York: IEEE.

22.

Shaila

Vadivel

. TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application. J Inform Sci 2015; 41(4): 467–485.

23.

Vishnu Priya

. capturing semantics of web page using TAGs-based approach for information retrieval. Int J Sci Eng Res 2016; 7(7): 1608–1611.

24.

Lee

Chuang

Seamons

. Document ranking and vector space model. J IEEE Softw 1997; 14(2): 67–75.

25.

Salton

Wong

Yang

. A vector space model for automatic indexing. Commun ACM 1975; 18(11): 613–620.

26.

Salton

Wong

Yang

. A vector space model for information retrieval. J Am Soc Inform Sci 2000; 32(2): 144–173.

27.

Kocabaş

Karaoğlan

Dinçer

. Luhn’s point of view: median-based term weighting schemes. Int J Nat Eng Sci 2011; 5(3): 31–35.

28.

Kocabaş

Dinçer

Karaoğlan

. Investigation of Luhn’s claim on information retrieval. Turk J Electr Eng Co 2011; 19(6): 993–1004.

29.

MathWebPageCorpus, https://www.comp.nus.edu.sg/∼zhaojin/research.html (2012, accessed 24 July 2019).

30.

UW-CANDATASET, Web Mining Data, http://pami.uwaterloo.ca/∼hammouda/webdata/ (2011, accessed 24 July 2019).

31.

CMU Knowledge Base (Web->KB) project, 2013, http://www.cs.cmu.edu/~webkb/