A semantic approach to cross-document person profiling in Web

Abstract

The problem of cross-document person profiling aimed at identifying and linking person entities across Web pages and extracting their relevant structured information. In this paper, we specifically focus on the core task of person profiling problem, namely the attribute extraction task. For attribute extraction, the existing approaches face several challenges that two important of them include (i) syntactic and structure variation, and (ii) cross-sentence and cross-document information extraction. To alleviate these deficiencies and improve performance of existing methods, we propose a semantic attribute extraction approach relying on probabilistic reasoning. Our approach produces structured, meaningful profiles in which the resulting textual facts are linked to their possible actual meaning in a distant ontology. We evaluate our approach on standard profile extraction datasets. Experimental results demonstrate that our approach achieves better results when compared with several baselines and state of the art counterparts. The results justify that our approach is a promising solution to the problem of person profiling.

Keywords

Web mining information extraction cross-document person profiling attribute extraction

Get full access to this article

View all access options for this article.

References

Alani,

Kim,

D.E.

Millard,

M.J.

Weal,

Hall,

P.H.

Lewis and

N.R.

Shadbolt, Automatic ontology-based knowledge extraction from Web documents, IEEE Intell. Syst. 18(1) (2003), 14–21. doi:10.1109/MIS.2003.1179189.

Amigó,

Gonzalo,

Artiles and

Verdejo, Evaluation metrics for clustering tasks: A comparison based on formal constraints, Inf. Retr. Boston. 12(4) (2007), 461–486. doi:10.1007/s10791-008-9066-8.

Arnold and

Rahm, Automatic extraction of semantic relations from Wikipedia, Int. J. Artif. Intell. Tools 24(2) (2015), 1–37. doi:10.1142/S0218213015400102.

Artiles,

Borthwick,

Gonzalo,

Sekine and

Amig, WePS-3 evaluation campaign: Overview of the Web people search clustering and attribute extraction tasks, in: CLEF (Notebook Papers/LABs/Workshops), 2010.

Artiles,

Gonzalo and

Sekine, The SemEval-2007 WePS evaluation: Establishing a benchmark for the Web people search task, in: Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), Prague, Czech Republic, 2007, pp. 64–69. doi:10.3115/1621474.1621486.

Artiles,

Gonzalo and

Artiles,

Gonzalo and

Sekine, Weps 2 evaluation campaign: Overview of the web people search clustering task, in: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

Artiles,

Gonzalo and

Sekine, Weps 2 evaluation campaign: Overview of the web people search clustering task, in: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

Auer,

Bizer,

Kobilarov,

Lehmann,

Cyganiak and

Ives, DBpedia: A nucleus for a Web of open data, in: Proc. 6th Int. Semant. Web Conf, Vol. 4825 LNCS, ISWC, Busan, Korea, 2007, pp. 722–735.

10.

Bach and

Badaskar, A review of relation extraction, Lit. Rev. Lang. Stat. II (2007).

11.

Balog,

He,

Monz,

Tsagkias,

Hofmann,

Jijkoun,

Weerkamp and

De Rijke, The university of Amsterdam at WePS2, in: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

12.

Banko,

Cafarella and

Soderland, Open information extraction from the Web, in: International Joint Conferences on Artificial Intelligence, 2007, pp. 2670–2676.

13.

Barla,

Tvarožek and

Bielikov´, Rule-based user characteristics acquisition from logs with semantics for personalized Web-based systems, Comput. Informatics 28(4) (2009), 399–427.

14.

Bonial,

Stowe and

Palmer, Renewing and revising SemLink, in: The GenLex Workshop on Linked Data in Linguistics, 2013, pp. 9–17.

15.

Brandes and

Erlebach, Network Analysis: Methodological Foundations, Vol. 3418, Springer Science & Business, Media, 2005.

16.

R.C.

Bunescu and

R.J.

Mooney, A shortest path dependency kernel for relation extraction, in: Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005, pp. 724–731. doi:10.3115/1220575.1220666.

17.

Cai and

Strube, Evaluation metrics for end-to-end coreference resolution systems, in: Proceedings of SIGDIAL 2010: The 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2010, pp. 28–36.

18.

Carreras and

Màrquez, Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling, 2005.

19.

A.X.

Chang and

C.D.

Manning, SUT IME: A library for recognizing and normalizing time expressions, in: LREC, 2012, pp. 3735–3740.

20.

A.X.

Chang and

C.D.

Manning, SUTIME: A library for recognizing and normalizing time expressions, in: LREC, 2012, pp. 3735–3740.

21.

Chen,

S.Y.

Mei Lee and

C.R.

Huang, A robust web personal name information extraction system, Expert Syst. Appl. 39(3) (2012), 2690–2699. doi:10.1016/j.eswa.2011.08.125.

22.

Chen,

Tamang,

Lee,

Li,

Lin,

Snover,

Artiles,

Passantino and

Ji, CUNY-BLENDER TAC-KBP2010 entity linking and slot filling system description, in: Proc. Text Analysis Conference (TAC 2010), 2010.

23.

Christen, Data Matching Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer, Heidelberg New York Dordrecht London, 2012.

24.

Christensen,

Soderland and

Etzioni, Semantic role labeling for open information extraction, in: Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, 2010, pp. 52–60.

25.

K.W.

Church,

Hill and

Hanks, Word association norms, mutual information, and lexicography, Comput. Linguist. 16(1) (1990), 22–29.

26.

Collobert,

Weston,

Bottou,

Karlen,

Kavukcuoglu and

Kuksa, Natural language processing (almost) from Scratch, J. Mach. Learn. Res. 12 (2011), 2461–2505.

27.

Cortes and

Vapnic, Support-vector networks, Mach. Learn. 20 (1995), 273–297.

28.

Culotta,

McCallum and

Betz, Integrating probabilistic extraction models and data mining to discover relations and patterns in text, in: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 2006, pp. 296–303.

29.

Delli Bovi,

Telesca and

Navigli, Large-scale information extraction from textual definitions through deep syntactic and semantic analysis, Trans. Assoc. Comput. Linguist. 3 (2015), 529–543.

30.

Dutta and

Weikum, Cross-document co-reference resolution using sample-based clustering with knowledge enrichment, Trans. Assoc. Comput. Linguist. 3 (2015), 15–28.

31.

Eichler,

Hemsen and

Neumann, Unsupervised relation extraction from Web documents, in: Proceeding of the 6th International Conference on Language Resources and Evaluation, 2008, pp. 1674–1679.

32.

Emami,

Shirazi and

A.A.

Barforoush, A semantic approach to person profile extraction from farsi documents, Journal of Information Systems and Telecommunication 4(4) (2016).

33.

Emami,

Shirazi,

A.A.

Barforoush and

Hourali, A pattern-matching method for extracting personal information in farsi content, U.P.B. Sci. Bull., Ser. C 78(1) (2016), 125–138.

34.

Exner and

Nugues, Using semantic role labeling to extract events from Wikipedia, in: CEUR Workshop Proceedings, 2011, pp. 38–47.

35.

Fellbaum, WordNet: An Electronic Lexical Database, MIT Press, 1998.

36.

R.J.

Finkel,

Grenager and

Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005), 2005, pp. 363–370.

37.

Gregory,

Mcgrath,

Bell,

K.O.

Hara and

Domico, Domain independent knowledge base population from structured and unstructured data sources, in: Twenty-Fourth International FLAIRS Conference, 2011, pp. 251–256.

38.

Han and

Zhao, CASIANED: People attribute extraction based on information extraction, in: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

39.

Han and

Zhao, Structural semantic relatedness: A knowledge-based method to named entity disambiguation, in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 50–59.

40.

Hourdakis,

Argyriou,

E.G.M.

Petrakis and

E.E.

Milios, Hierarchical clustering in medical document collections: The BIC-means method, J. Digit. Inf. Manag. 8(2) (2010), 71–77.

41.

Koutrika,

Bercovitz and

Garcia-Molina, FlexRecs: Expressing and combining flexible recommendations, in: SIGMOD, 2009, pp. 745–757. doi:10.1145/1559845.1559923.

42.

Lee,

Chang,

Peirsman,

Chambers,

Surdeanu and

Jurafsky, Deterministic coreference resolution based on entity-centric, precision-ranked rules, Comput. Linguist. 39(4) (2013), 885–916. doi:10.1162/COLI_a_00152.

43.

Lee and

Wang, Attribute extraction and scoring: A probabilistic approach, in: ICDE, 2013, pp. 194–205.

44.

Li,

Srihari,

Niu and

Li, Entity profile extraction from large corpora, in: Pacific Association for Computational Linguistics Conference (PACLING-2003), 2003.

45.

Liu, Discriminant analysis and similarity measure, Pattern Recognit. 47(1) (2014), 359–367. doi:10.1016/j.patcog.2013.06.023.

46.

Loper,

Yi and

Palmer, Combining lexical resources: Mapping between propbank and verbnet, in: Proceedings of the 7th International Workshop on Computational Linguistics, Tilburg, The Netherlands, 2007.

47.

Lowd and

Domingos, Efficient weight learning for Markov logic networks, in: PKDD 2007, 2007.

48.

C.D.

Manning,

Surdeanu,

Bauer,

Finkel,

S.J.

Bethard and

McClosky, The stanford CoreNLP natural language processing toolkit, in: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60. doi:10.3115/v1/P14-5010.

49.

Min and

Grishman, Challenges in the knowledge base population slot filling task, in: LREC, 2012, pp. 1137–1142.

50.

Minkov,

R.C.

Wang and

W.W.

Cohen, Extracting personal names from email: Applying named entity recognition to informal text, Comput. Linguist. (2005), 443–450.

51.

Mintz,

Bills,

Snow and

Jurafsky, Distant supervision for relation extraction without labeled data, in: Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, 2009, pp. 1003–1011.

52.

Moro and

Navigli, Integrating syntactic and semantic analysis into the open information extraction paradigm, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, 2013, pp. 2148–2154.

53.

Moro,

Raganato and

Navigli, Entity linking meets word sense disambiguation: A unified approach, Trans. Assoc. Comput. Linguist. 2 (2014), 231–244.

54.

Nagy, Person attribute extraction from the textual parts of Web pages, Acta Cybern. 20(3) (2012), 419–440. doi:10.14232/actacyb.20.3.2012.4.

55.

Niu and

Ré, Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS, in: Proceedings of the VLDB Endowment, Vol. 4, 2011, pp. 373–384.

56.

Palmer,

Gildea and

Kingsbury, The proposition bank: An annotated corpus of semantic roles, Comput. Linguist. 31(1) (2005), 71–106. doi:10.1162/0891201053630264.

57.

D.M.W.

Powers, Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation, J. Mach. Learn. Technol. 2(1) (2011), 37–63.

58.

Richardson and

Domingos, Markov logic networks, Mach. Learn. 62 (2006), 107–136. doi:10.1007/s10994-006-5833-1.

59.

G.E.

Schwarz, Estimating the dimension of a model, Ann. Stat. 6(2) (1978), 461–464. doi:10.1214/aos/1176344136.

60.

Soderland,

Gilmer,

Bart,

Etzioni and

Weld, Open information extraction to KBP relations in 3 hours, in: Proceedings of TAC-KBP 2013, 2013.

61.

Soderland,

Hawkins,

G.L.

Kim and

D.S.

Weld, University of Washington system for 2015 KBP cold start slot filling, in: Proceedings of TAC-KBP 2015, 2015.

62.

Soderland,

Roof,

Qin and

Xu, Adapting open information extraction to domain-specific relations, AI Mag. 31(3) (2010), 93–102. doi:10.1609/aimag.v31i3.2305.

63.

F.M.

Suchanek,

Ifrim and

Weikum, Combining linguistic and statistical analysis to extract relations from Web documents, in: Proceedings of KDD, 2006, pp. 712–717.

64.

Sun,

Grishman and

Sekine, Semi-supervised relation extraction with large-scale word clustering, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011, pp. 521–529.

65.

Surdeanu,

Harabagiu,

Williams and

Aarseth, Using predicate-argument structures for information extraction, in: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics – ACL’03, 2003, pp. 8–15. doi:10.3115/1075096.1075098.

66.

Surdeanu and

Ji, Overview of the English slot filling track at the TAC2014 knowledge base population evaluation, in: Proceedings of Text Analysis Conference (TAC2014), 2014.

67.

J.I.E.

Tang, A combination approach to Web user profiling, ACM Trans. Knowl. Discov. from Data 5(1) (2010), 1–44. doi:10.1145/1870096.1870098.

68.

Watanabe,

Bollegala,

Matsuo and

Ishizuka, A two-step approach to extracting attributes for people on the Web, in: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

69.

Wu and

D.S.

Weld, Autonomously semantifying Wikipedia, in: Proceedings of CIKM’ 07, 2007, pp. 41–50. doi:10.1145/1321440.1321449.

70.

Xu, Bootstrapping Relation Extraction from Semantic Seeds, Saarland University, 2007.

71.

Yahya,

S.E.

Whang,

Gupta and

Halevy, ReNoun: Fact extraction for nominal attributes, in: Proceedings of EMNLP, 2014, pp. 325–335.

72.

Yang,

Hong and

J.X.

Yu, Graph clustering based on structural/attribute similarities, in: Proceedings of the VLDB Endowment, Vol. 2, 2009, pp. 718–729.

73.

S.R.

Yerva,

Miklós and

Aberer, Quality-aware similarity assessment for entity matching in Web data, Inf. Syst. 37(4) (2012), 336–351. doi:10.1016/j.is.2011.09.007.

74.

Yu and

Lam, An integrated probabilistic and logic approach to encyclopedia relation extraction with multiple features, in: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 2008, pp. 1065–1072. doi:10.3115/1599081.1599215.

75.

Yujian and

Bo, A normalized levenshtein distance metric, IEEE Trans. Pattern Anal. Mach. Intell. 29(6) (2007), 1091–1095. doi:10.1109/TPAMI.2007.1078.

76.

Zhu,

Nie,

Liu,

Zhang and

J.-R.

Wen, StatSnowball: A statistical approach to extracting entity, in: Proceedings of the 18th International Conference on World Wide Web, 2009, pp. 101–110. doi:10.1145/1526709.1526724.