Sage Journals: Discover world-class research

Abstract

In the fast-moving field of Natural Language Processing (NLP), making lexicons is still a method for many text analysis applications. This process of generating lexicons has traditionally used techniques such as semantic matches, word embeddings, and tools like EMPATH. With the arrival of Large Language Models (LLMs) including GPT-3.5, GPT-4 and Mistral 7b 0.1, we have new ways to create lexicons. This study takes a close look at how these older methods stack up against the newer options brought by LLMs. We carried out a detailed analysis, looking at how well different methods could create lexicons, focusing on their precision, scalability, and concluding on how efficiently they can be used in real-world settings. By using standard NLP tasks like document classification, emotion classification and sentiment analysis, this research prove itself on a variety of datasets to test how well the lexicons worked. This discovery, along with others from our study, aims to help professionals and researchers find the best approaches to lexicon creation today, setting the stage for more research in the NLP field.

Keywords

Lexicon generation document classification sentiment analysis emotion classification natural language processing machine learning transformers topic modeling large language models semantic analysis

Get full access to this article

View all access options for this article.

References

Mohammad

Turney

. Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, 2010, June, pp.26–34.

Giunchiglia

Shvaiko

Yatskevich

. S-Match: an algorithm and an implementation of semantic matching. In: European semantic web symposium, 2004, May, pp.61–75. Springer, Berlin, Heidelberg.

Fast

Chen

Bernstein

. Empath: Understanding topic signals in large-scale text. In: Proceedings of the 2016 CHI conference on human factors in computing systems, 2016, May, pp.4647–4657.

Mikolov

. Distributed representations of sentences and documents. In International conference on machine learning, 2014, June, pp.1188–1196. PMLR.

McInnes

Healy

Melville

. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 2018.

Devlin

Chang

Lee

, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI Blog 2019; 1: 9.

Jiang

Sablayrolles

Mensch

, et al. Mistral 7B. arXiv preprint arXiv:2310.06825. 2023.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Advances in neural information processing systems 2017; 30: 5998–6008.

10.

Newsgroup Dataset. http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html .

11.

Wang

Liakata

Zubiaga

, et al. Smile: Twitter emotion classification using domain adaptation. In: CEUR Workshop proceedings, 2016, Vol. 1619, pp.15–21. Sun SITE Central Europe.

12.

Saravia

Liu

HCT

Huang

, et al. CARER: Contextualized affect representations for emotion recognition. In: Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp.3687–3697.

13.

Ekman

. Basic emotions. Handbook of Cognition and Emotion 1999; 98: 16.

14.

Yelp Dataset. URL: https://www.yelp.com/dataset/.

15.

Mikolov

Chen

Corrado

, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.

16.

Rumelhart

Hinton

Williams

. Learning internal representations by error propagation. 1985.

17.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Computation 1997; 9: 1735–1780.

18.

Touvron

Martin

Stone

, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023.

19.

Penedo

Malartic

Hesslow

, et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116. 2023.

20.

Qadir

Riloff

. Learning emotion indicators from tweets: Hashtags, hashtag patterns, and phrases. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, October, pp.1203–1209.

21.

Muppidi

Gorripati

Kishore

. An approach for bibliographic citation sentiment analysis using deep learning’. 1 Jan. 2020: 353–362.

22.

Hussein

DMEDM

. A survey on sentiment analysis challenges. Journal of King Saud University-Engineering Sciences 2018; 30: 330–338.

23.

Zhang

Gan

Jiang

. Machine learning and lexicon-based methods for sentiment classification: A survey. In: 2014 11th web information system and application conference, 2014, September, pp.262–26. IEEE.

24.

Pablos

Cuadros

Rigau

. A Comparison of Domain-based Word Polarity Estimation using different Word Embeddings. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, pp.54–60, Portorož, Slovenia. European Language Resources Association (ELRA).

25.

Musto

Semeraro

Polignano

. A Comparison of Lexicon-based Approaches for Sentiment Analysis of Microblog Posts. In: DART@ AI* IA, 2014, December, pp.59–68).

26.

Tabak

Evrim

. Comparison of emotion lexicons. In 2016 HONET-ICT, 2016, October, pp.154–158. IEEE.

27.

Griffiths

Steyvers

Tenenbaum

. Topics in semantic representation. Psychological Review 2007; 114: 211.

28.

Angelov

. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470. 2020.

29.

Naderalvojoud

Bozkir

Sezer

. Investigation of term weighting schemes in classification of imbalanced texts. In: Lisbon: Proceedings of European Conference on Data Mining (ECDM), 2014, July, pp.15–7.

30.

Bauer

Clark

Lehr

. Understanding broadband speed measurements. Tprc. 2010, August.

31.

Loper

Bird

. Nltk: The natural language toolkit. arXiv preprint cs/0205028. 2002.

32.

The Dataset. https://www.yelp.com/dataset .

33.

Julien

Kosuke

Koichi

, et al. Identification of mental disorders through text mining on social Media.

(TOM) 2024; 17: 29–35.

Generating lexicons;are large language models better?

Abstract

Abstract

Keywords

Get full access to this article

References