Improved algorithm of Context Graph based on feature selection

Abstract

In order to solve the problem of low efficiency of traditional theme crawlers in searching theme pages, the crawling algorithm based on Context Graph was discussed. After analyzing the working principle and process of the algorithm, we introduced a new algorithm idea named feature selection algorithm. This new algorithm improved the original TF-IDF formula accordingly and solved the algorithm problems.

Keywords

Context Graph feature selection theme crawler DF-IDF

Get full access to this article

View all access options for this article.

References

Zhou

, Research on event driven and protocol driven subject crawler application in topical domain, Hunan University of Science and Technology, 2012.

Cheng

, Design and implementation of metaserch engine based on suffix tree clustering algorithm, Jilin University, 2017.

and Liu

, Overview of the subject web crawler research, Computer Engineering and Science (2) (2015), 45–51.

Zhang

and Liu

, An optimized path focusing crawler crawling strategy, Minicomputer System 8(8) (2016), 1721–1723.

Min

and Huang

, The design and implementation of the customized theme focused crawler, Computer Engineering and Design 36(1) (2015), 17–19.

Liu

and Li

, Fusion link structure of the subject crawler aalgorithm, Journal of Huaqiao University (Natural Science Edition) 2(38) (2017), 195–197.

, Research on key technology of vertical search engine and distributed implementation, Southeast University, 2017.

, Binary network community partition based on PageRank algorithm, Shenyang University of Aeronautics and Astronautics, 2016.

Novak

, A survey of focused web crawling algorithms, Proceedings of SIKDD at Multiconference IS. Slovenia: ACM Press, 2004, pp. 55–58.

10.

Chen

and Desai

C.B.

, An enhanced web robot for the CINDI system, Proceedings of the C3S2E Conference. Canadia: ACM Press, 2008, pp. 133–135.

11.

Barbosa

and Freire

, An adaptive crawler for locating hidder web entry point, Proceeding of the 18th International Conference on World Wide Web. Madrid, Spain, 2009, pp. 681–697.

12.

Patel

, An adaptive updating topic specific web system using T-graph, Journal of Computer Science 6(4) (2010), 450–456.

13.

Bussche

and Weiand

, Not so creepy crawler: Easy crawler generation with standard XML queries, Proceeding of the 19th international conference on World Wide Web, Raleigh, North Carolina, USA, 2010, pp. 1305–1308.

14.

J.J.

Wei

and Zhou

, The optimized background value of the GM(1,1) model which based on non-homogenous index series, Journal of Systems Science and Information (9) (2010), 149–156.

15.

Tan

Gei

Ren

et al., Entity linking for queries by searching wikipedia sentences, EMNLP (2017), 68–77.

16.

Shijia

and Yang

, Entity search based on the representation learning model with different embedding strategies, IEEE Access 5 (2017), 15174–15183.