Evaluating text embeddings for two-dimensional text corpora representations

Abstract

Several text corpus visualizations utilize a map-like metaphor, where the layout reflects the semantic similarity between documents. The underlying two-dimensional scatterplots are created by combining a latent embedding with a subsequent dimensionality reduction. In this work, we analyze the impact of embedding quality on layout quality. We evaluate the accuracy of the layout, specifically the preservation of local and global structures of the text corpus in its two-dimensional representation. Additionally, we assess class separation, focusing on the effectiveness of distinguishing classes within the two-dimensional space. We introduce a benchmark $B = (D, L, Q_{E}, Q_{DR})$ consisting of a collection of text corpora $D$ , a set of layout algorithms $L$ that combine text embeddings with dimensionality reductions, quality metrics $Q_{E}$ for evaluating text embeddings, and quality metrics $Q_{DR}$ for assessing accuracy and class separation. We generate a multivariate dataset by evaluating this benchmark, which we further analyze in a descriptive analysis. Our results indicate that, for Latent Semantic Indexing combined with tf-idf weighting and t-distributed Stochastic Neighbor Embedding, coherence plays a substantial role in determining the accuracy of the layout. Additionally, our findings reveal that embeddings do not enhance class separation in the two-dimensional scatterplot representation. As main result, we provide more fine-grained guidelines for effectively utilizing text embeddings and dimensionality reduction techniques to generate two-dimensional scatterplot representations of text corpora reflecting semantic similarity.

Keywords

Text embeddings dimensionality reductions benchmark studies text spatializations visualization quality metrics

Get full access to this article

View all access options for this article.

References

Kucher

Kerren

. Text visualization techniques: Taxonomy, visual survey, and community insights. In: Proceedings of the 2015 IEEE Pacific Visualization Symposium (PacificVis), Hangzhou, China, 2015, pp.117–121. New York: IEEE.

Crain

Zhou

Yang

S-H

, et al. Dimensionality reduction and topic modeling: From latent semantic indexing to latent Dirichlet allocation and beyond. In: Aggarwal

Zhai

(eds) Mining Text Data. New York: Springer, 2012. pp.129–161.

Ware

Information Visualization: Perception for Design. 4th ed. San Francisco: Morgan Kaufmann; 2019.

Kuhn

Erni

Loretan

, et al. Software cartography: Thematic software visualization with consistent layout. J Softw Maint Evol 2010; 22(3): 191–210.

Gansner

North

SC.

Interactive visualization of streaming text data with dynamic maps. J Graph Algorithms Appl 2013; 17(4): 515–540.

Atzberger

Cech

de la Haye

, et al. Software Forest: A visualization of semantic similarities in source code using a tree metaphor. In: Proceedings of the 16th International Conference on Information Visualization Theory and Applications (IVAPP 2021), Vienna, Austria (online), 2021, vol. 3, pp.112–122. Setúbal, Portugal: SciTePress.

Atzberger

Cech

Jobst

, et al. Visualization of knowledge distribution across development teams using 2.5D semantic software maps. In: Proceedings of the 17th International Conference on Information Visualization Theory and Applications (IVAPP 2022), 2022, vol. 3, pp.210–217. Setúbal, Portugal: SciTePress.

Atzberger

Cech

Scheibel

, et al. Visualization of source code similarity using 2.5D semantic software maps. In: VISIGRAPP 2021: Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021, vol. 1691 of Commun Comput Inf Sci, pp.162–182. Cham: Springer.

Hogräfer

Heitzler

Schulz

H-J.

The state of the art in map-like visualization. Comput Graph Forum 2020; 39(3): 647–674.

10.

van der Maaten

Postma

van den Herik

. Dimensionality reduction: a comparative review. Tech Rep 2009;009–005. Tilburg: Tilburg University, Tilburg Centre for Creative Computing.

11.

Gisbrecht

Hammer

Data visualization by nonlinear dimensionality reduction. WIREs Data Min Knowl Discov 2015; 5(2): 51–73.

12.

Espadoto

Martins

Kerren

, et al. Toward a quantitative survey of dimension reduction techniques. IEEE Trans Vis Comput Graph 2021; 27(3): 2153–2173.

13.

Atzberger

Cech

Scheibel

, et al. Large-scale evaluation of topic models and dimensionality reduction methods for 2D text spatialization. Trans Vis Comput Graph 2023; 30(1): 902–912.

14.

Atzberger

Cech

Scheibel

, et al. Quantifying topic model influence on text layouts based on dimensionality reductions. In: Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2024, vol. 1, pp. 593–602. Setúbal, Portugal: SciTePress.

15.

Skupin

The world of geography: Visualizing a knowledge domain with cartographic means. Proc Natl Acad Sci USA 2004; 101(suppl_1): 5274–5278.

16.

Heimerl

John

Han

, et al. DocuCompass: Effective exploration of document landscapes. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST), Baltimore, MD, 2016, pp. 11–20.

17.

Han

Thom

John

, et al. Visual quality guidance for document exploration with focus+context techniques. IEEE Trans Vis Comput Graph 2020; 26(8): 2715–2731.

18.

Fried

Kobourov

. Maps of computer science. In: Proceedings of the Pacific Visualization Symposium (PacificVis), Yokohama, Japan, 2014, pp. 113–120.

19.

Caillou

Renault

Fekete

J-D

, et al. CARTOLABE: A web-based scalable visualization of large document collections. IEEE Comput Graph Appl 2021; 41(2): 76–88.

20.

Kim

Kang

Park

, et al. TopicLens: Efficient multi-level visual topic exploration of large-scale document collections. IEEE Trans Vis Comput Graph 2017; 23(1): 151–160.

21.

Choo

Lee

Reddy

, et al. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 2013; 19(12): 1992–2001.

22.

Linstead

Bajracharya

Ngo

, et al. Sourcerer: Mining and searching internet-scale software repositories. Data Min Knowl Discov 2009; 18(2): 300–336.

23.

Steeger

Atzberger

Scheibel

, et al. Instanced rendering of parameterized 3D glyphs with adaptive level-of-detail using three.js. In: Proceedings of the 29th International Conference on 3D Web Technology (Web3D ’24), San Diego, CA, USA, 22–24 Jul 2024, pp. 1–11. New York: ACM

24.

Kucher

Martins

Kerren

. Analysis of VINCI 2009–2017 proceedings. In: Proceedings of the 11th International Symposium on Visual Information Communication and Interaction (VINCI ’18), Växjö, Sweden, 13–15 Aug 2018, pp. 97–101. New York: ACM.

25.

Yan

Tao

Jin

, et al. An interactive visual analytics system for incremental classification based on semi-supervised topic modeling. In: Proceedings of the Pacific Visualization Symposium (PacificVis ’19), Bangkok, Thailand, 23–26 Apr 2019, pp. 148–57. New York: IEEE.

26.

Gove

Cadalzo

Leiby

, et al. New guidance for using t-SNE: Alternative defaults, hyperparameter selection automation, and comparative evaluation. Vis Inform 2022; 6(2): 87–97.

27.

Wilkinson

Anand

Grossman

Graph-theoretic scagnostics. In: Proceedings of the Symposium on Information Visualization (InfoVis ’05), Minneapolis, MN, USA, 23–25 Oct 2005, pp. 157–64. New York: IEEE.

28.

Wilkinson

Anand

Grossman

High-dimensional visual analytics: Interactive exploration guided by pairwise views of point distributions. Trans Vis Comput Graph 2006; 12(6): 1363–1372.

29.

Tian

Zhai

van Steenpaal

, et al. Quantitative and qualitative comparison of 2D and 3D projection techniques for high-dimensional data. Information 2021; 12(6): 239:1–21.

30.

Vernier

Garcia

Silva

, et al. Quantitative evaluation of time-dependent multidimensional projection techniques. Comput Graph Forum 2020; 39(3): 241–252.

31.

Vernier

Comba

JLD

Telea

AC.

Guided stable dynamic projections. Comput Graph Forum 2021; 40(3): 87–98.

32.

Vernier

Comba

JLD

Telea

. Quantitative comparison of treemap techniques for time-dependent hierarchies. In: Proceedings of the European Conference on Visualization (EuroVis ’17), Barcelona, Spain, 2017, pp. 105–7. Goslar, Germany: Eurographics.

33.

Atzberger

Cech

Scheibel

, et al. A large-scale sensitivity analysis on latent embeddings and dimensionality reductions for text spatializations. Trans Vis Comput Graph 2025; 31: 305–315.

34.

Nonato

Aupetit

Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment. Trans Vis Comput Graph 2019; 25(8): 2650–2673.

35.

Albuquerque

Eisemann

Magnor

Perception-based visual quality measures. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST ’11), Providence, RI, 2011, pp. 13–20. New York: IEEE.

36.

Sedlmair

Tatu

Munzner

, et al. A taxonomy of visual cluster separation factors. Comput Graph Forum 2012; 31(3pt4): 1335–1344.

37.

Sedlmair

Munzner

Tory

Empirical guidance on scatterplot and dimension reduction technique choices. Trans Vis Comput Graph 2013; 19(12): 2634–2643.

38.

Sedlmair

Aupetit

Data-driven evaluation of visual quality measures. Comput Graph Forum 2015; 34(3): 201–210.

39.

Sips

Neubert

Lewis

, et al. Selecting good views of high-dimensional data using class consistency. Comput Graph Forum 2009; 28(3): 831–838.

40.

Wang

Feng

Chu

, et al. A perception-driven approach to supervised dimensionality reduction for visualization. Trans Vis Comput Graph 2018; 24(5): 1828–1840.

41.

Xia

Zhang

Song

, et al. Revisiting dimensionality reduction techniques for visual cluster analysis: An empirical study. Trans Vis Comput Graph 2022; 28(1): 529–539.

42.

Xia

Huang

Lin

, et al. Interactive visual cluster analysis by contrastive dimensionality reduction. Trans Vis Comput Graph 2023; 29(1): 734–744.

43.

Morariu

Bibal

Cutura

, et al. Predicting user preferences of dimensionality reduction embedding quality. Trans Vis Comput Graph 2023; 29(1): 745–755.

44.

Xia

Lin

Jiang

, et al. Visual clustering factors in scatterplots. Comput Graph Appl 2021; 41(5): 79–89.

45.

Dou

Wang

, et al. HierarchicalTopics: Visually exploring large text collections using topic hierarchies. IEEE Trans Vis Comput Graph 2013; 19(12): 2002–2011.

46.

Peter

Szigeti

Jofre

, et al. Topicks: Visualizing complex topic models for user comprehension. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST ’15), Chicago, IL, 2015, pp. 207–208. New York: IEEE.

47.

Dang

Nguyen

Pham

, et al. WordStream: Interactive visualization for topic evolution. In: Proceedings of EuroVis – Posters 2019, Porto, Portugal, 2019, pp. 103–107.

48.

Alexander

Gleicher

Task-driven comparison of topic models. Trans Vis Comput Graph 2016; 22(1): 320–329.

49.

Wei

, et al. Visual analysis of topic competition on social media. Trans Vis Comput Graph 2013; 19(12): 2012–2021.

50.

Sievert

Shirley

. LDAvis: A method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, MD, USA, 2014, pp. 63–70. Stroudsburg, PA: ACL.

51.

Lipton

ZC.

The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. ACM Queue 2018; 16(3): 31–57.

52.

Röder

Both

Hinneburg

. Exploring the space of topic coherence measures. In: Proc 8th Int Conf on Web Search and Data Mining (WSDM ’15), Shanghai, China, 2015, pp. 399–408. New York: ACM.

53.

Riehmann

Kiesel

Kohlhaas

, et al. Visualizing a thinker’s life. Trans Vis Comput Graph 2019; 25(4): 1803–1816.

54.

Aggarwal

Zhai

. A survey of text classification algorithms. In: Mining Text Data. Boston, MA: Springer.

55.

Deerwester

Dumais

Furnas

, et al. Indexing by latent semantic analysis. J Am Soc Inf Sci 1990; 41(6): 391–407.

56.

Lee

Seung

HS.

Learning the parts of objects by non-negative matrix factorization. Nature 1999; 401(6755): 788–791.

57.

Blei

Jordan

MI.

Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

58.

Devlin

Chang

M-W

Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, 2019, pp. 4171–4186

59.

Grootendorst

BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint 2022; arXiv:2203.05794.

60.

Cox

MAA

Cox

. Multidimensional scaling. In: Handbook of Data Visualization. Berlin: Springer; 2008. pp. 315–347.

61.

Kohonen

. Exploration of very large databases by self-organizing maps. In: Proceedings of the International Conference on Neural Networks (ICNN 1997), Houston, TX, USA, 9–12 Jun 1997, pp. 1–6. New York: IEEE.

62.

van der Maaten

Hinton

. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(11): 2579–605.

63.

McInnes

Healy

Melville

UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv 2020;: 1–63.

64.

Behrisch

Blumenschein

Kim

, et al. Quality metrics for information visualization. Comput Graph Forum 2018; 37(3): 625–662.

65.

Atzberger

Jobst

Scheibel

, et al. Exploring high-dimensional data by pointwise filtering of low-dimensional embeddings. In: Proc 42nd Conf on Computer Graphics & Visual Computing, CGVC ’24. Goslar, Germany: Eurographics, 2024.

66.

Chen

Buja

Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J Am Stat Assoc 2009; 104(485): 209–219.

67.

Venna

Kaski

Visualizing gene interaction graphs with local multidimensional scaling. In: Proceedings of the European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, 2006, pp. 557–562.

68.

Joia

Coimbra

Cuminato

, et al. Local affine multidimensional projection. IEEE Trans Vis Comput Graph 2011; 17(12): 2563–71.

69.

Paulovich

Toledo

FMB

Telles

, et al. Semantic wordification of document collections. Comput Graph Forum 2012; 31(3pt3): 1145–1153.

70.

Aupetit

Sedlmair

SepMe: 2002 new visual separation measures. In: Proceedings of the Pacific Visualization Symposium (PacificVis ’16), Taipei, Taiwan, 2016, pp. 1–8. New York: IEEE.

71.

Newman

Lau

Grieser

, et al. Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conf of the North American Chapter of the Association for Computational Linguistics. ACL, 2010. p. 100–8.

72.

Mimno

Wallach

Talley

, et al. Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), Edinburgh, 2011, pp. 262–72. Stroudsburg, PA: ACL.

73.

Aletras

Stevenson

Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS ’13), Potsdam, Germany, 19–22 Mar 2013, pp. 13–22. Stroudsburg, PA: ACL.

74.

Melka

Mariage

J-J.

Adapting self-organizing map algorithm to sparse data. In: Computational Intelligence. Cham: Springer, 2019. pp. 139–161.

75.

Yoo

Jette

Grondona

. SLURM: Simple linux utility for resource management. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP ’03), Seattle, WA, USA, 24 Jun 2003, pp. 44–60. Berlin: Springer.

76.

Noether

GE.

Why Kendall tau?

Teach Stat 1981; 3(2): 41–43.