Polarity classification for Spanish tweets using the COST corpus

Abstract

It was not until 2010 when businesses, politicians and people in general began to realize the potential of Twitter in Spain. This fact has awoken research interest in the extraction of knowledge from Twitter. This paper aims to fill the gap of the lack of resources for Twitter sentiment analysis in Spanish by performing a study of different features and machine learning algorithms for classifying the polarity of Twitter posts. The result is a new corpus of Spanish tweets called COST, and we have carried out a wide-ranging experiment in which different machine learning algorithms have been used. Furthermore, we have tested the influence of using different weighting schemes for unigrams, the influence of eliminating stop-words and the application of a stemmer process.

Keywords

Opinion mining polarity classification sentiment analysis short text analysis social networks Spanish corpus Twitter

Get full access to this article

View all access options for this article.

References

Das

Chen

. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In: Asia Pacific Finance Association annual conference (APFA), 2001.

Pang

Lee

Vaithyanathan

. Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), 2002, pp. 79–86.

Turney

. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 417–424.

Prabowo

Thelwall

. Sentiment analysis: A combined approach. Journal of Informetrics 2009; 3(2): 143–157.

Pang

Lee

. Opinion mining and sentiment analysis. Foundation for Trends in Information Retrieval 2008; 2(1–2).

Liu

. Sentiment analysis and subjectivity. In: Handbook of natural language processing, 2nd edn.London: Chapman and Hall, 2010.

Cambria

Schuller

Xia

Havasi

. New avenues in opinion mining and sentiment analysis. IEEE Intellligent Systems 2013; 28(2): 15–21.

Ahmad

Cheng

Almas

. Multi-lingual sentiment analysis of financial news streams. In: Proceedings of science, GRID, 2006.

Denecke

. Using SentiWordNet for multilingual sentiment analysis. In: IEEE 24th international conference: Data engineering workshop, 2008. ICDEW, 7–12 April 2008, Cancun, pp. 507–512.

10.

Abbasi

Chen

Salem

. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Transactions on Information.Systems 2008; 26(3): 12:1–12:34.

11.

Zhang

Zeng

. Sentiment analysis of Chinese documents: From sentence to document level. Journal of the American Society for Information Science Technology 2009; 60(12): 2474–2487.

12.

Boldrini

Balahur

Martínez-Barco

Montoyo

. EmotiBlog: A fine-grained model for emotion detection in non-traditional textual genres. WOMSA 2009; 22–31.

13.

Ghorbel

Jacot

. Sentiment analysis of French movie reviews. Advances in Distributed Agent-Based Retrieval Tools 2011; 361: 97–108.

14.

Esuli

Sebastiani

. SentiWordNet: A publicly available lexical resource for opinion mining. In: Proceedings of the 5th conference on language resources and evaluation, 2006, pp. 417–422.

15.

Agić

Ljubešić

Tadić

. Towards sentiment analysis of financial texts in croatian. In: Proceedings of the seventh conference on international language resources and evaluation, 2010.

16.

Martínez-Cámara

Martin-Valdivia

Ureña-Lopez

Montejo-Raez

. Sentiment analysis in Twitter. Natural Language Engineering 2014; 20(1): 1–28.

17.

Petrović

Osborne

Lavrenko

. The Edinburgh Twitter corpus. Paper presented at the Proceedings of the NAACL HLT 2010 workshop on computational linguistics in a world of social media, 2010; 25–26.

18.

Pak

Paroubek

. Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, 2010.

19.

Bhayani

Huang

. Twitter sentiment classification using distant supervision. 2009 CS224N Project Report, http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf

20.

Read

. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of the ACL student research workshop, 2005, pp. 43–48.

21.

Jansen

Zhang

Sobel

Chowdury

. Micro-blogging as online word of mouth branding. In: Proceedings of the 27th international conference extended abstracts on human factors in computing systems, 2009, pp. 3859–3864.

22.

Oussalah

Bhat

Challis

Schnier

. A software architecture for Twitter collection, search and geolocation services. Knowledge-Based Systems 2013; 37: 105–120.

23.

Tumasjan

Sprenger

Sandner

Welpe

. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In: Proceedings of the ICWMS, 2010.

24.

Jungherr

Jürgens

Schoen

. Why the Pirate Party won the German elections of 2009 or the rouble with predictions: A response to Tumasjan, A., Sprenger, To. O., Sander, P. G., and Welpe, I. M. ‘Predicting elections with Twitter: what 140 characters reveal about political sentiment’. Social Science Computer Review 2012; 30(2): 229–234.

25.

O’Connor

Balasubramanyan

Routledge

Smith

. From tweets to polls: Linking text sentiment to public opinion time series. In: Proceedings of the ICWMS, 2010.

26.

Diakopoulos

Shamma

. Characterizing debate performance via aggregated Twitter sentiment. In: Proceedings of the 28th international conference on human factors in computing systems, 2010; 1195–1198.

27.

Rill

Reinel

Scheidt

Zicari

. PoliTwi: Early detection of emerging political topics on Twitter and the impact on concept-level sentiment analysis. Knowledge-Based Systems 2014, doi: http://dx.doi.org/10.1016/j.knosys.2014.05.008

28.

Montejo-Ráez

Díaz-Galiano

Ureña-López

. Crowd explicit sentiment analysis. Knowledge-Based Systems 2014, doi: 10.1016/j.knosys.2014.05.007

29.

Villena-Román

Lana-Serrano

Martínez-Cámara

González-Cristóbal

. TASS – Workshop on sentiment analysis at SEPLN. Procesamiento del Lenguaje Natural 2013; 50: 37–44.

30.

Pla

Hurtado

. Sentiment analysis in Twitter for Spanish. Natural Language Processing and Information Systems 2014; 8455: 208–213.

31.

Vilares

Alonso

Gómez-Rodríguez

. On the usefulness of lexical and syntactic processing in polarity classification of Twitter messages. Journal of the Association for Information Science and Technology 2015, in press.

32.

Manning

Schütze

and MITCogNet. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.

33.

Sebastiani

. Machine learning in automated text categorization. ACM Computer Survey 2002; 34(1): 1–47.

34.

Salton

McGill

. Introduction to modern information retrieval. New York: McGraw–Hill, 1986.

35.

Vapnik

. The nature of statistical learning theory. New York: Springer, 1995.

36.

Hosmer

Lemeshow

. Applied logistic regression. Probability and Statistics Series. Chichester: Wiley, 2000.

37.

Mitchell

. In: Munson

(ed.), Machine learning. Boston, MA: WCB/McGraw–Hill, 1997.

38.

Tan

Zhang

. An empirical study of sentiment analysis for chinese documents. Expert Systems with Applications 2008; 34(4): 2622–2629.

39.

Lewis

. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine Learning. Lecture Notes in Computer Science, Vol. 1398. Berlin: Springer, 1998; 4–15.

40.

Domingos

Pazzani

. On the optimality of the simple bayesian classifier under zero–one loss. Machine Learning 1997; 29(2): 103–130.

41.

Porter

. Snowball: A Language for Stemming Algorithms, 2001.

42.

Baccianella

Esuli

Sebastiani

. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the seventh conference on international language resources and evaluation, 2010.