Finding answers to COVID-19-specific questions: An information retrieval system based on latent keywords and adapted TF-IDF

Abstract

The scientific community has reacted to the COVID-19 outbreak by producing a high number of literary works that are helping us to understand a variety of topics related to the pandemic from different perspectives. Dealing with this large amount of information can be challenging, especially when researchers need to find answers to complex questions about specific topics. We present an Information Retrieval System that uses latent information to select relevant works related to specific concepts. By applying Latent Dirichlet Allocation (LDA) models to documents, we can identify key concepts related to a specific query and a corpus. Our method is iterative in that, from an initial input query defined by the user, the original query is expanded for each subsequent iteration. In addition, our method is able to work with a limited amount of information per article. We have tested the performance of our proposal using human validation and two evaluation strategies, achieving good results in both of them. Concerning the first strategy, we performed two surveys to determine the performance of our model. For all the categories that were studied, precision was always greater than 0.6, while accuracy was always greater than 0.8. The second strategy also showed good results, achieving a precision of 1.0 for one category and scoring over 0.7 points overall.

Keywords

ATF.IDF COVID-19 document filtering information retrieval keywords generation latent Dirichlet allocation TF-IDF

Get full access to this article

View all access options for this article.

References

Hui

Azhar

Madani

et al. The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health – the latest 2019 novel coronavirus outbreak in Wuhan, China. Int J Infect Dis 2020; 91: 264–266.

World Health Organization (WHO). WHO Director-General’s opening remarks at the media briefing on COVID-19 – 6 March 2020, https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19-6-march-2020 (2020, accessed 26 May 2020).

Chahrour

Assi

Bejjani

et al. A bibliometric analysis of COVID-19 research activity: a call for increased output. Cureus 2020; 12: e7357.

Atkeson

. What will be the economic impact of COVID-19 in the US? Rough estimates of disease scenarios. Los Angeles, 2020, https://www.minneapolisfed.org/research/staff-reports/what-will-be-the-economic-impact-of-covid-19-in-the-us-rough-estimates-of-disease-scenarios

Wang

Xue

et al. The impact of COVID-19 epidemic declaration on psychological consequences: a study on active Weibo users. Int J Environ Res Public Health 2020; 17: 2032.

Liu

Zheng

et al. Health communication through news media during the early stage of the COVID-19 outbreak in China: digital topic modeling approach. J Med Internet Res 2020; 22: e19118.

Huynh

TLD

. The COVID-19 risk perception: a survey on socioeconomics and media attention. Econ Bullet 2020; 40: 758–764.

Torres-Salinas

. Ritmo de crecimiento diario de la producción científica sobre Covid-19. Análisis en Bases de Datos y Repositorios en Acceso Abierto. Prof Inform. Epub ahead of print 14 April 2020. DOI: 10.3145/epi.2020.mar.15.

COVID-19 open research dataset challenge (CORD-19)| Kaggle, https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge (accessed 30 January 2022).

10.

Lou

Tian

Niu

et al. Coronavirus disease 2019: a bibliometric analysis and review. Eur Rev Med Pharmacol Sci 2020; 24: 3411–3421.

11.

Nasab

Rahim

. Bibliometric analysis of global scientific research on SARSCoV-2 (COVID-19). medRxiv. Epub ahead of print 23 March 2020. DOI: 10.1101/2020.03.19.20038752.

12.

Bonnevie

Gallegos-Jeffrey

Goldbarg

et al. Quantifying the rise of vaccine opposition on Twitter during the COVID-19 pandemic. J Comm Healthc 2021; 14: 12–19.

13.

Baraybar-Fernández

Arrufat-Martín

Rubira-García

. Public information, traditional media and social networks during the COVID-19 crisis in Spain. Sustainability 2021; 13: 6534.

14.

Feng

Zhou

. Is working from home the new norm? An observational study based on a large geo-tagged COVID-19 Twitter dataset, http://arxiv.org/abs/2006.08581 (2020, accessed 28 December 2021).

15.

Sharifi

Khavarian-Garmsir

. The COVID-19 pandemic: impacts on cities and major lessons for urban planning, design, and management. Sci Total Environ 2020; 749: 142391.

16.

Cinelli

Quattrociocchi

Galeazzi

et al. The COVID-19 social media infodemic, http://arxiv.org/abs/2003.05004 (2020, accessed 26 May 2020).

17.

Lopez

Vasu

Gallemore

. Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset, http://arxiv.org/abs/2003.10359 (2020, accessed 26 May 2020).

18.

Singh

Bansal

Bode

et al. A first look at COVID-19 information and misinformation sharing on Twitter, http://arxiv.org/abs/2003.13907 (2020, accessed 26 May 2020).

19.

Schild

Ling

Blackburn

et al. ‘Go eat a bat, Chang!’: on the emergence of Sinophobic behavior on web communities in the face of COVID-19, http://arxiv.org/abs/2004.04046 (2020, accessed 26 May 2020).

20.

Riloff

Schafer

Yarowsky

. Inducing information extraction systems for new languages via cross-language projection. In: Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan, 24 August–1 September2002, pp. 1–7. Stroudsburg, PA: Association for Computational Linguistics (ACL).

21.

Latif

Usman

Manzoor

et al. Leveraging data science to combat COVID-19: a comprehensive review. IEEE T Artif Intel 2020; 1: 85–103.

22.

Shah

Perez-Iratxeta

Bork

et al. Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 2003; 4: 20.

23.

Williams

. Keywords: a vocabulary of culture and society. 2nd ed. New York: Oxford University Press, 1985.

24.

Baker

. Querying keywords: questions of difference, frequency, and sense in keywords analysis. J Engl Linguist 2004; 32: 346–359.

25.

Wang

Liu

Chauhan

et al. Automatic textual evidence mining in COVID-19 literature, http://arxiv.org/abs/2004.12563 (2020, accessed 28 December 2021).

26.

Esteva

Kale

Paulus

et al. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. npj Digit Med 2021; 4: 1–9.

27.

Voorhees

Alam

Bedrick

et al. TREC-COVID: constructing a pandemic information retrieval test collection. ACM SIGIR Forum 2020; 54: 1.

28.

Best

Taylor

Manktelow

et al. Systematically retrieving research in the digital age: case study on the topic of social networking sites and young people’s mental health. J Inf Sci 2014; 40: 346–356.

29.

Karlsson

Hammarfelt

Steinhauer

et al. Modeling uncertainty in bibliometrics and information retrieval: an information fusion approach. Scientometrics 2015; 102: 2255–2274.

30.

Dimitrakis

Sgontzos

Tzitzikas

. A survey on question answering systems over linked data and documents. J Intell Inf Syst 2020; 55: 233–259.

31.

Mohammed

Al-Augby

. LSA & LDA topic modeling classification: comparison study on E-books. Indones J Electr Eng Comput Sci 2020; 19: 353–362.

32.

Zuo

Zhang

et al. Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, 13–17 August2016, pp. 2105–2114. New York: Association for Computing Machinery (ACM).

33.

Chen

Xiao

. Selecting publication keywords for domain analysis in bibliometrics: a comparison of three methods. J Informetr 2016; 10: 212–223.

34.

Weinberg

. Bibliographic coupling: a review. Inform Storage Ret 1974; 10: 189–196.

35.

Garfield

. KeyWords Plus – ISI’s breakthrough retrieval method. 1. Expanding your searching power on current-contents on diskette. Curr Contents 1990; 1: 5–9.

36.

Ganesan

Lloyd

Sarkar

. Discovering related clinical concepts using large amounts of clinical notes. Biomed Eng Comput Biol 2016; 7(suppl. 2): 27–33.

37.

Roelleke

Wang

. TF-IDF uncovered: a study of theories and probabilities. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (ACM SIGIR’2008), Singapore, 20–24 July2008, pp. 435–442. New York: ACM Press.

38.

Singla

Patra

. A fast automatic optimal threshold selection technique for image segmentation. Signal Image Video P 2017; 11: 243–250.

39.

Cai

Han

. Training linear discriminant analysis in linear time. In: Proceedings of the 2008 IEEE 24th international conference on data engineering, Cancun, Mexico, 7–12 April2008, pp. 209–217. New York: IEEE.

40.

Sontag

Roy

. Complexity of inference in latent Dirichlet allocation, 2011, https://papers.nips.cc/paper/2011/hash/3871bd64012152bfb53fdf04b401193f-Abstract.html

41.

Mimno

Wallach

Talley

et al. Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, 27–31 July2011, pp. 262–272. Stroudsburg, PA: Association for Computational Linguistics (ACL).

42.

Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, http://arxiv.org/abs/2010.16061 (2020, accessed 30 January 2022).

43.

Matthews-Trigg

Citrin

Halliday

et al. Understanding perceptions of global healthcare experiences on provider values and practices in the USA: a qualitative study among global health physicians and program directors. BMJ Open 2019; 9: e026020.

44.

Bakri

Alqadiri

Adwan

. The highest cited papers in brucellosis: identification using two databases and review of the papers’ major findings. Biomed Res Int 2018; 2018: 9291326.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.31 MB