Rough sets based span and its application to extractive text summarization

Abstract

Rough Sets provide a mathematical tool to handle decision making under uncertainty. One major domain that can be characterized with inherent ambiguity is natural language texts which often leads to uncertainty in understanding the intent and relative importance of a sentence with respect to its context in the whole text. As a consequence, the process of sentence selection for generation of extractive summary can logically be considered as a process of decision making under uncertainty. In this paper we use rough set based techniques to deal with this uncertainty. This paper’s contribution is two-fold. Firstly, this paper proposes a novel Rough Set based uncertainty measure called span and define special Rough subsets of universe called spanning sets. Span is Rough Set based measure for salience of a subset of universe and spanning set is the subset that maximizes the span. This corresponds to the key elements representing a problem and can be used to solve various real-life applications. Secondly, the concepts are applied to determine extracts of text documents. The idea behind the present work is to determine the most suitable subset(s) of the universe of sentences under consideration. An optimization problem is formulated to generate the extract for the text under consideration using the proposed uncertainty measure of span and is solved using Particle Swarm Optimization. The experimental results on DUC2001, DUC2002 single document data sets and Enron Email datasets establish the effectiveness of the proposed technique. There has been substantial work on Rough Sets though considering a stochastic Rough-subset of the universe and determining its aptness as a representative of the universe is still unexplored. The proposed technique is a novel effort to fill this gap.

Keywords

Rough set extractive text summarization span spanning set particle swarm optimization ROUGE extraction lexical chains DUC2001 DUC2002 LSA graph random indexing GLOVE

Get full access to this article

View all access options for this article.

References

Alguliev

R.M.

, Aliguliyev

R.M.

, Hajirahimova

M.S.

and Mehdiyev

C.A.

, MCMR: Maximum Coverage and Minimum Redundant Text Summarization Model, Expert Systems with Applications 38(12) (2011), 514–522.

Canhasi

and Kononenko

, Multi-Document Summarization via Archetypal Analysis of the Content-Graph Joint model, Knowledge and Information Systems 41(3) (2014), 821–842.

Canhasi

and Kononenko

, Weighted Hierarchical Archetypal Analysis for Multi-Document Summarization, Computer Speech & Language 37 (2016), 24–46.

Chatterjee

and Mohan

, Extraction based Single-Document Summarization using Random Indexing. In 19th IEEE International Conference on Tools with Artificial Intelligence (2007), 448–455.

Chatterjee

and Sahoo

P.K.

, Random Indexing and Modified Random Indexing based Approach for Extractive Text Summarization, Computer Speech & Language 29(1) (2015), 32–44.

Cutler

and Breiman

, Archetypal analysis, Technometrics 36(4) (1994), 338–347.

Eberhart

, Kennedy

A New Optimizer Using Particle Swarm Theory. In MHS’95. Proceedings of The Sixth International Symposium on Micro Machine and Human Science IEEE (1995), pp. 39–43.

Edmundson

H.P.

, New Methods in Automatic Extracting, Journal of The ACM (JACM) 16(2) (1969), 264–285.

, Wang

and Yun

, The Rough Membership Functions on Four Types of Covering based Rough Sets and their Applications, Information Sciences 390 (2017), 1–14.

10.

Gong

and Liu

, Generic Text Summarization using Relevance Measure and Latent Semantic Analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2001), 19–25.

11.

Grzymala-Busse

J.W.

and Ziarko

, Data Mining Based on Rough Sets. In Data Mining: Opportunities and Challenges. IGI Global. (2003), 142–173.

12.

Grzymala-Busse

J.W.

Rough Set Theory with Applications to Data Mining. In Real World Applications of Computational Intelligence. Springer, Berlin, Heidelberg (2005), 221–244.

13.

, An

, Yu

and Yu

, Robust Fuzzy Rough Classifiers, Fuzzy Sets and Systems 183(1) (2011), 26–43.

14.

Inbarani

H.H.

, Azar

A.T.

and Jothi

, Supervised Hybrid Feature Selection Based on PSO and Rough Sets for Medical Diagnosis, Computer Methods and Programs in Biomedicine 113(1) (2014), 175–185.

15.

Jensen

and Shen

, Fuzzy-Rough Sets assisted Attribute Selection, IEEE Transactions on Fuzzy Systems 15(1) (2007), 73–89.

16.

Karwa

and Chatterjee

, Discrete Differential Evolution for Text Summarization. In 2014 International Conference on Information Technology IEEE (2014), 129–133.

17.

Lin

C.Y.

, Rouge: A Package for Automatic Evaluation Of Summaries. Text Summarization Branches Out, 2004.

18.

Loza

, Lahiri

, Mihalcea

and Lai

P.H.

, Building a Dataset for Summarization and Keyword Extraction from Emails, In LREC (2014), 2441–2446.

19.

Luhn

H.P.

, The Automatic Creation of Literature Abstracts, IBM Journal of research and development 2(2) (1958), 159–165.

20.

Mani

Advances in Automatic Text Summarization, MIT Press, 1999.

21.

Mihalcea

Graph based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization. In Proceedings of the ACL Interactive Poster and Demonstration Sessions 2004.

22.

Mihalcea

, Tarau

TextRank: Bringing order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing 2004.

23.

Page

, Brin

, Motwani

, Winograd

The PageRank citation ranking: Bringing order to the web, Stanford InfoLab 1999.

24.

Pawlak

Rough Sets: A Tutorial, Int Journal of Information and Computer Sciences 1982.

25.

Pawlak

Rough Sets: Theoretical Aspects of Reasoning about Data, Springer Science &Business Media 2012.

26.

Pennington

, Socher

, Manning

Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, pp. 1532–1543.

27.

Polkowski

, Rough

(Ed.).

Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems. Physica19 (2013).

28.

Silber

H.G.

, Mccoy

K.F.

Efficient Text Summarization using Lexical Chains. In Proceedings of the 5th International Conference on Intelligent User Interfaces, ACM (2000). pp. 252–255.

29.

Singh

and Dey

, A Rough Fuzzy Document Grading System for Customized Text Information Retrieval, Information Processing and Management 41(2) (2005), 195–216.

30.

Steinberger

and Jezek

, Using Latent Semantic Analysis in Text Summarization and Summary Evaluation, Proc ISIM 4 (2004), pp. 93–100.

31.

Srinivasan

, Ruiz

M.E.

, Kraft

D.H.

and Chen

, Vocabulary Mining for Information Retrieval: Rough Sets and Fuzzy Sets, Information Processing &Management 37(1) (2001), pp. 15–38.

32.

Tay

F.E.

and Shen

, Economic and Financial Prediction using Rough Sets Model, European Journal of Operational Research 141(3) (2002), pp. 641–659.

33.

Wang

, Yang

, Teng

, Xia

and Jensen

, Feature Selection based on Rough Sets and Particle Swarm Optimization, Pattern Recognition Letters 28(4) (2007), pp. 459–471.

34.

Yadav

and Chatterjee

, Text Summarization using Rough Sets. In International Conference on Natural Language Processing and Cognitive Processing 2014.

35.

Yadav

and Chatterjee

, A Novel Approach for Feature Selection using Rough Sets. In 2017 International Conference on Computer, Communications and Electronics (Comptelix) (2017), pp. 195–199.

36.

Yao

Y.Y.

Granular Computing using Neighborhood Systems, In Advances in Soft Computing, Springer, London. (1999), pp. 539–553.

37.

Yeh

J.Y.

, Ke

H.R.

, Yang

W.P.

and Meng

I.H.

, Text Summarization using a Trainable Summarizer and Latent Semantic Analysis, Information Processing and Management 41(1) (2005), 75–95.

38.

Zheng

and Zhu

, Uncertainty Measures of Neighborhood System based Rough Sets, Knowledge-Based Systems 86 (2015), 57–65.

39.

Zhu

and Wang

F.Y.

, On three types of Covering based Rough Sets, IEEE Transactions on Knowledge and Data Engineering 19(8) (2007), 1131–1144.