Sage Journals: Discover world-class research

Abstract

Organizational research increasingly uses natural language processing (NLP) to measure textual similarity. Despite common usage, the meaning and consistency of similarity measures (e.g., cosine similarity and Euclidean distance) across common NLP methods (e.g., n-grams and document embeddings) is unclear. This risks misalignment between theoretical constructs and textual measures, undermining the comparability of findings across studies. To address this gap, we review studies using textual similarity in organizational and psychological research, finding a jingle-jangle fallacy: identical labels are used for similarity estimates from different NLP methods, and different labels are used for the same method. Additionally, we examine the consistency of similarity measures across and within NLP methods. Different transformer-based embeddings’ similarity results are interchangeable. However, n-grams yield distinct, inconsistent results and are less appropriate for estimating similarity with distance measures. When applied to multi-word inputs, dictionaries and word embeddings return similar results reflecting linguistic style. We provide best practice recommendations and example code for operationalizing textual similarity, including clarifying which NLP methods correspond to content similarity, linguistic style similarity, and semantic similarity at the word, sentence, and document-levels of analysis.

Keywords

Textual data have become a central empirical resource in organizational research. Scholars increasingly analyze text produced inside organizations, such as internal emails and other forms of workplace communication (Barley et al., 2011; Hannigan et al., 2019), as well as text produced for external audiences, including regulatory disclosures such as SEC 10-K filings (Arts et al., 2025; Testoni, 2022), patents (Arts et al., 2018), and outward-facing communication such as CEO earnings call transcripts, letters to shareholders, websites, and press releases (Graffin et al., 2011; Haans & Mertens, 2026; Harrison et al., 2019; Nadkarni & Chen, 2014). Researchers have also leveraged text generated by external audiences and intermediaries, including employee reviews on job platforms (Corritore et al., 2020), online professional profiles and hiring data (Marchetti & Puranam, 2025), analyst communications surrounding earnings calls (Eklund & Mannor, 2026), and customer feedback in online reviews (Favaron & Di Stefano, 2025).

Across these settings, textual similarity has become a widely used analytic device for operationalizing theoretical constructs. At the macro level, researchers have used textual similarity to measure product market rivalry and competitive overlap (Arts et al., 2025; Hoberg & Phillips, 2010), strategic positioning and optimal distinctiveness in market categories (Barlow et al., 2019; van van Angeren et al., 2022), differentiation and distinctiveness (Majzoubi et al., 2024), and interfirm technological relatedness in innovation search (Arts et al., 2018). At the meso level, similarity measures have been used to study organizational vocabularies (Tasselli et al., 2020), organizational culture and cultural fit (Corritore et al., 2020; Goldberg et al., 2016), the transfer of cultural imprints (Ahn & Greve, 2025), and idea originality and distinctiveness (Piezunka & Dahlander, 2015; 2019). At the micro level, textual similarity has been applied to capture applicants’ experience-job relatedness (Parasurama et al., 2025), and interpersonal fit, accommodation, and relational dynamics in interactional contexts (Shi et al., 2019). Beyond organizational research, textual similarity is also widely used in social and experimental psychology (Günther et al., 2016; Yu et al., 2025) and in the development and validation of psychological scales (Hernandez & Nie, 2023).

Despite its widespread use, the meaning, validity, and consistency of textual similarity measures remain insufficiently understood. In practice, organizational researchers frequently compute similarity statistics, most commonly cosine similarity, on vectors generated by fundamentally different natural language processing (NLP) methods, including dictionaries, n-grams, and embedding models. These similarity estimates are often treated as interchangeable indicators of a single underlying construct, even though the underlying vector representations encode conceptually distinct aspects of language. As a result, identical similarity statistics (e.g., cosine similarity) may reflect very different properties of text, depending on how the text has been represented.

This practice raises concerns about construct validity. For instance, a cosine similarity value of .80 may indicate substantial semantic overlap when applied to transformer-based embeddings, but merely indicates shared, high-frequency vocabulary when applied to n-grams. Yet in empirical work, both are often labeled “content similarity” (e.g., Hasan et al., 2015; Patterson et al., 2024). Conversely, identical NLP methods are sometimes described using different conceptual labels, such as content, lexical, or semantic similarity (e.g., Hasan et al., 2015; Margulis et al., 2022; Rule et al., 2015). Taken together, these patterns suggest a jingle–jangle fallacy in the use of textual similarity: the same labels are applied to distinct operationalizations, while different labels are applied to the same operationalization.

A related issue concerns the consistency of similarity estimates across textual similarity measures. Textual similarity is typically operationalized by applying a similarity or distance measure—most often cosine similarity, but also Euclidean distance, Manhattan distance, and Jensen–Shannon (JS) divergence—to numerical vectors derived from text via NLP. These measures differ in how they quantify similarity. Cosine similarity captures directional alignment between vectors, whereas geometric distance measures (e.g., Euclidean distance and Manhattan distance) incorporate information about vector magnitudes. When applied to different NLP representations, these measures may yield substantially different rank orderings of similarity between text pairs. In organizational research, where textual similarity measures are frequently used as key independent variables, such inconsistencies can alter substantive conclusions.

Prior work in computer science has examined the behavior of similarity and distance measures in high-dimensional spaces (Aggarwal et al., 2001; Cantrell, 2018). However, this literature has largely focused on the abstract properties of these measures or on technical tasks such as information retrieval. Importantly, it does not address whether similarity estimates derived from different NLP methods are interchangeable for the purpose of construct operationalization in applied social science research. Nor does it provide guidance on how to align similarity measures and NLP methods with theoretically distinct dimensions of similarity. As a result, organizational scholars lack clear guidance on when different textual similarity operationalizations can be expected to yield comparable results and when they should not.

We address these issues by developing a clearer conceptual and empirical basis for the use of textual similarity in organizational research. We focus on three conceptually distinct dimensions of similarity that recur, often implicitly, in prior work. Content similarity reflects overlap in surface vocabulary. Linguistic style similarity captures how language is used, including function words, grammatical patterns, and stylistic markers. Semantic similarity reflects overlap in meaning that is not dependent on shared word choice. Each of these dimensions aligns more naturally with particular NLP methods, yet these distinctions are rarely made explicit in empirical research.

The article proceeds in four steps. First, we introduce the concept of textual similarity and demonstrate its uses across multiple NLP methods. Second, we review applications of textual similarity in organizational research, documenting how similarity measures and NLP methods are used in practice and highlighting conceptual inconsistencies in labeling and interpretation. Third, we conduct a large-scale empirical analysis examining the consistency of textual similarity estimates across commonly used similarity measures (e.g., cosine similarity and Euclidean distance) and NLP methods (e.g., n-grams, dictionaries, and embeddings). This analysis identifies when similarity estimates are robust across measures or NLP methods and when they are not. Inconsistent textual similarity scores across similarity measures and NLP methods suggest a need to critically evaluate and consider the meaning and appropriateness of each approach. Further, these analyses inform about the reliability of textual similarity measures within each NLP method.

Fourth, we build on Poschmann et al. (2024) by synthesizing these insights into broader best-practice recommendations for using textual similarity in organizational research. We specify how researchers can align NLP methods with the dimension of similarity they seek to capture and recommend comparing results across methods and measures to test robustness. We also highlight opportunities for future micro- and meso-level research using textual similarity, aiming to inspire more rigorous and conceptually grounded applications in organizational studies. Overall, this manuscript aims to improve the methodological validity, transparency, and cumulative value of NLP-based research in organizational studies.

Textual Similarity

Textual similarity involves applying a similarity measure to the numerical vectors generated from applying NLP to two texts (e.g., words, sentences, and documents), resulting in an estimate of how similar those two texts are.¹ Cosine similarity is considered the most important similarity measure for binary and real-valued vectors (Manning & Schütze, 1999) and is commonly used in organizational research to operationalize textual similarity. Cosine similarity operationalizes similarity as the cosine of the angle between two vectors (Jones & Furnas, 1987), disregarding differences in magnitude. Table 1 provides the formula for cosine similarity between two vectors A and B in an n-dimensional Euclidean space. Cosine similarity ranges from −1 to 1, where −1 indicates opposite vectors (i.e., the angle between two vectors is 180°), 0 indicates orthogonal vectors (i.e., the angle is 90°), and 1 indicates vectors pointing in the same direction (i.e., the angle is 0°).² Cosine similarity is closer to 1 when the angle between two vectors is smaller.

Table 1.
Definitions and Mathematical Formulas of Textual Similarity/Dissimilarity Measures.

Similarity Measure Definition Mathematical Formula

Cosine Similarity The cosine of the angle between two non-zero vectors $d (A, B) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} = \frac{\sum_{i = 1}^{n} A_{i} \cdot B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}$

Euclidean Distance The square root of the sum of the squared differences between corresponding coordinates of two points. Measures a real distance between two points in an n-dimensional space, based on L₂ norm $d (A, B) = \sqrt{\sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}}$

Manhattan Distance The sum of the absolute differences between corresponding coordinates of two points, based on L₁ norm $d (A, B) = \sum_{i = 1}^{n} | A_{i} - B_{i} |$

Jensen–Shannon Divergence A symmeasure measure of the difference between two probability distributions p(x) and q(x), also called information radius (IRad). Jensen–Shannon (JS) divergence is built based on KL divergence to calculate a symmetrized and smoothed score $D_{K L} (p (x) | | q (x)) = \sum_{x \in X}^{p} (x) \ln \frac{p (x)}{q (x)}$

$J S D (p (x) | | q (x)) = \frac{1}{2} D_{K L} (p (x) | | m (x)) + \frac{1}{2} D_{K L} (q (x) | | m (x))$
(when x is a discrete variable, and $m (x) = \frac{1}{2} (p (x) + q (x))$ is a mixture distribution of p(x) and q(x))

Note: For cosine similarity and distance measures, each formula shows the corresponding similarity/distance between two vectors A and B in an n-dimensional Euclidean space. $‖ A ‖$ and $‖ B ‖$ denote the magnitudes (norms) of vectors A and B, respectively. A_i and B_i represent the individual elements of vectors A and B, respectively. A · B is the Euclidean dot product, calculated by multiplying corresponding components and summing the results (Vasile, 1987). A norm is a mathematical function that assigns a non-negative length or size to vectors in a vector space. Norms measure how “long” or “far” a vector is from the origin, and the distance measures use these norms to quantify how far two vectors (or points) are from each other. Jensen-Shannon divergence's formula quantifies the corresponding difference between two probability distributions p(x) and q(x). $D_{K L}$ shows the Kullback–Leibler divergence between two distributions. We present the formula for discrete x, as NLP representations such as LIWC and n-gram methods are transformed as discrete probability distributions.

Other common similarity measures include geometric distance measures (e.g., Euclidean distance and Manhattan distance), set-based measures such as Jaccard (or Tanimoto) coefficient (Jaccard, 1912; Tanimoto, 1958; de França, 2016), and probabilistic measures such as JS divergence (Lin, 1991). Tables 1 and 2 summarize the Euclidean distance and Manhattan distance, detailing their formulas (Table 1) and their properties and limitations (Table 2).³ Euclidean distance measures the straight-line distance between two points by aggregating squared deviations across all dimensions. Manhattan distance sums the absolute differences between the coordinates across dimensions. Minkowski distance is a generalized distance measure with its parameter p (a positive integer, also known as the power). When the parameter p is set to 1, Minkowski distance is Manhattan distance; when p is set to 2, it is Euclidean distance; and as p approaches infinity, it converges to Chebyshev distance, which captures the largest difference between two vectors on any single dimension.

Table 2.
Properties and Drawbacks of Textual Similarity/Dissimilarity Measures.

Similarity Measure Properties Drawbacks

Cosine similarity (1) Measures the angle between vectors while ignoring magnitude
(2) Scale invariant (not affected by vector length)
(3) Widely used in text and high-dimensional data (e.g., TF-IDF; LIWC)
(4) Bounded between −1 and 1 (often 0 to 1 for textual data) (1) Does not account for magnitude differences, which may be informative in some tasks
(2) Tends to be overly biased by the features of higher values while less considering how many features are shared between vectors (Li & Han, 2013)

Euclidean distance (1) Measures the straight-line distance (the shortest possible distance) between two points
(2) Intuitive and easy to interpret in low-dimensional spaces
(3) Sensitive to absolute magnitude differences, which can be desirable in some contexts (1) Sensitive to feature scale and variance; features with larger scales dominate unless normalized (Thant et al., 2020)
(2) Sensitive to outliers (Gan et al., 2007)
(3) Tends to be overly biased by the features of higher values while less considering how many features are shared between vectors

Manhattan distance (1) Measures the distance between two points in a grid-like path, as the sum of absolute differences
(2) Often more preferable than Euclidean distance for high-dimensional data (Aggarwal et al., 2001)
(3) Preserves sensitivity to scale differences across dimensions (1) Less intuitive than Euclidean distance in some contexts
(2) Sensitive to outliers (Gan et al., 2007)
(3) Tends to be overly biased by the features of higher values while less considering how many features are shared between vectors
(4) Not invariant to rotations of the coordinate system (Kumar et al., 2014)
(5) Most useful in specific applications (e.g., grid-based layouts), less general-purpose than Euclidean distance or cosine similarity

Jensen–Shannon divergence (1) A symmetrized and smoothed version of KL divergence
(2) Symmetric (i.e., $J S D (p (x) | | q (x)) = J S D (q (x) | | p (x))$ )
(3) Always finite and well-defined, even when q(x) = 0 for any x where p(x) > 0, vice versa
(4) Bounded between 0 and 1 (using base-2 logarithms in the formula, which is commonly used); the value is 0 only when the two probability distributions are identical
(5) Larger values mean greater dissimilarity between the two probability distributions
(6) The square root of JS divergence is a metric (that satisfies triangle inequality) (1) Less sensitive to small differences in distributions compared to KL divergence
(2) Computationally more expensive than KL divergence

Cosine similarity and geometric distance measures differ in how they quantify similarity. Cosine similarity focuses solely on the similarity of the direction of two vectors, without considering their magnitude. In practice, this means that two vectors are considered similar if they show a comparable pattern of relatively higher/lower values across dimensions, even if one vector exhibits generally higher or lower absolute values (analogous to measures of correlation). Geometric distance measures, however, account for both direction and magnitude of vectors by quantifying the geometric distance between their endpoints. Thus, geometric distance measures such as Euclidean distance emphasize differences in the magnitude or frequency of these dimensions (analogous to measures of agreement).

When vectors are normalized by their Euclidean length, cosine similarity and Euclidean distance become mathematically linked, as shown in Equation (1).⁴ Vector normalization involves dividing each element by the vector's length—for example, the vector (3, 4) normalizes to (0.6, 0.8) because its Euclidean length is 5.⁵ After vector normalization, cosine similarity and Euclidean distance capture overlapping information in an inverted form: they yield the same similarity rankings across multiple text pairs (Manning & Schütze, 1999; Poschmann et al., 2024).
$C o s i n e s i m i l a r i t y (A, B) = C o s i n e s i m i l a r i t y (\frac{A}{‖ A ‖}, \frac{B}{‖ B ‖}) = 1 - \frac{d {(\frac{A}{‖ A ‖}, \frac{B}{‖ B ‖})}^{2}}{2}$
(1)

(Note: $‖ A ‖$ and $‖ B ‖$ denote the magnitudes of vectors A and B, respectively. $d (\frac{A}{‖ A ‖}, \frac{B}{‖ B ‖})$ means Euclidean distance between the normalized form of A and B.)

In addition, set-based and probabilistic approaches are commonly used to assess similarity between certain types of vectors. Set-based measures such as Jaccard coefficient and Dice coefficient (Dice, 1945; Sorensen, 1948) are applicable only to binary vectors,⁶ used to represent texts by the presence or absence (i.e., 1 or 0) of words. Both range from 0 (no overlap) to 1 (identical sets). Probabilistic measures are typically applied to discrete probability distributions (Prochaska & Theodore, 2018), where probabilities sum to one in each row of data.⁷

Probabilistic measures include Kullbeck-Liebler divergence (KL divergence; Kullback & Leibler, 1951), JS divergence, Hellinger distance, and Bhattacharyya distance. Intuitively, KL divergence measures the difference between two distributions, which is represented by the information loss incurred when one distribution is used to approximate another distribution. In contrast, JS divergence averages the information loss incurred when each distribution is approximated by their combined data distribution (Manning & Schütze, 1999). Both KL and JS divergence are non-negative, with higher values indicating greater dissimilarity and a value of zero indicating identical distributions. Hellinger distance measures the geometric distance between the square roots of probability distributions, while Bhattacharyya distance measures the amount of overlap between two distributions.

NLP Methods and Illustrative Examples of Textual Similarity

NLP Methods

Multiple NLP approaches have been used to convert documents into numeric vectors to operationalize textual similarity. We provide an overview in Table 3. Our Open Science Framework (OSF) repository⁸ includes Comparison between NLP methods in similarity calculation.ipynb that illustrates how to apply each NLP method to texts of varying lengths and estimate their similarity.

Table 3.
Natural Language Processing Vectorization Techniques Used in Present Study.

NLP Technique Description Properties Limitations

LIWC Categorizes words within a text into over 80 linguistic, psychological, and topical categories indicating various social, cognitive, and affective processes, and converts the text into a vector where each element represents the frequency of words in these categories (1) Captures information about the style of speech
(2) Easy to apply, validated by a large amount of empirical research
(3) Considers a wide range of psychological and linguistic dimensions of text meaning (1) Static lexicon and predefined categories
(2) Lack of complexity in analyzing contexts and word order in the text

Binary unigrams Represents each text by a vector where each element corresponds to the absence (0) or presence (1) of a unique word in the vocabulary (1) Captures information about the specific words used
(2) Relatively easy to implement
(3) Computationally efficient (1) Ignores word frequencies, word orders and context information
(2) Might output high-dimensional sparse vectors

n-gram counts Extracts adjacent words of length n and converts a text into a collection of n-grams (e.g., unigrams, bigrams, and trigrams, or where n = 1, 2, and 3, respectively) in a matrix where each element represents counts of the occurrence of each n-gram (1) Captures information about the specific words used
(2) Relatively easy to implement
(3) Captures contextual information and word order
(4) Considers phrases and multi-word expressions rather than merely individual words (1) Limited context window of n items, not enough to capture longer dependencies between words in text
(2) Sparsity increases as n increases

TF-IDF n-gram counts Extends n-gram counts by converting each n-gram count to the TF-IDF score, which gives greater weight to words in a text that occur infrequently across the entire corpus (1) Balances the frequency of words within a text with their rarity across the given corpus (1) Limited context window of n items, not enough to capture longer dependencies between words in text
(2) Might output sparse vectors where some n-grams do not appear in the given corpus, as n increases
(3) Might be biased towards longer texts

Averaged word2vec embeddings Word2vec can be applied to individual words to capture their semantics, but when applied to multiple words, involving averaging all the word embeddings to create a 300-dimensional single vector that represents the learned definition of the words. word2vec is trained on the Google News corpus that contains 100 billion words (Mikolov et al., 2013). (1) Captures the semantics of words, which can be aggregated to the sentence- or document-level
(2) Fast and easy to average the word embeddings
(3) Less computationally intensive than complex sentence embedding models (1) Might dilute the contextual meaning carried by the same words in different texts
(2) Doesn't consider the order of words that might carry some syntactic and semantic information of the text
(3) Could be low quality since averaging may be misleading due to sensitivity to random outliers and extreme values in specific domains (Elsaadawy et al., 2018)
(4) Unable to encode out-of-vocabulary words
(5) Reflects and propagates social biases present in training corpora (Lauscher & Glavaš, 2019)

RoBERTa embeddings Maps the whole text to a 1024-dimensional vector created by the pretrained RoBERTa model, which uses a sentence-transformer architecture with 355 million parameters in its neural network (1) Captures the contextual and semantic meanings of sentences and documents (Reimers & Gurevych, 2019)
(2) Outputs high-quality sentence embeddings with richer information, suitable for many downstream tasks including semantic textual similarity (Reimers & Gurevych, 2019)
(3) Can deploy the large pre-trained model for easy access (1) Requires massive computational resources and datasets for training
(2) Might present social biases in sentence-level representations and needs to be debiased (Liang et al., 2020)

GIST embeddings Maps the whole text to a 768-dimensional vector created from a fine-tuned version of the pretrained BAAI general embedding model (BGE), which uses a sentence-transformer architecture with 109 million parameters (1) Same as RoBERTa (1) Same as RoBERTa

Note: We used the LIWC 2015 program to run the vectorization analysis. RoBERTa embeddings were generated by a sentence transformer model named “all-roberta-large-v1” (https://huggingface.co/sentence-transformers/all-roberta-large-v1). GIST embeddings were generated by a sentence transformer model named “avsolatorio/GIST-Embedding-v0” (https://huggingface.co/avsolatorio/GIST-Embedding-v0).

Closed vocabulary approaches, such as Linguistic Inquiry and Word Count (LIWC; Pennebaker et al., 2015), count the occurrence of words and phrases in dictionaries, including both psychological and linguistic constructs (Hickman et al., 2022). LIWC focuses on “the ways people use words” (Pennebaker et al., 2015, p. 1) or the style of speech, as opposed to the content of text or its semantics (Pennebaker, 2016; Piezunka & Dahlander, 2019). LIWC counts the proportion of words in a text that occur in dictionaries such as pronouns (e.g., I and them) and causation (e.g., because and effect). For example, Srivastava et al. (2018) applied LIWC to measure cultural fit through employees’ email exchanges, focusing on similarity in the use of linguistic categories such as nouns and negations. Joseph et al. (2023) used LIWC to detect language pattern changes in corporate executive communication. Some LIWC categories were specifically targeted to code particular organizational constructs; for example, past, present, and future tenses were used to capture temporal focus of CEO language (Crilly et al., 2016; Nadkarni & Chen, 2014), positive or negative emotions to capture public attitudes towards films in news articles (Odziemkowska, 2022), and certainty to capture precise language in firm reports or start-up pitches (El-Zayaty et al., 2025; Guo et al., 2017).

A variety of open vocabulary approaches exist. The most basic rely on n-grams where n = 1, or single words (Kobayashi et al., 2018). When n = 1, this provides a document-term matrix, where each document is represented as a row and each term (i.e., unigram) as a column. The entries can be binary (i.e., one if the term is present in the document, zero if not), counts (i.e., the number of times the term appeared in that document), or the counts can be transformed to give greater weight to infrequent terms (e.g., through the term frequency – inverse document frequency, or TF-IDF transformation). Therefore, each n-gram method captures distinct aspects of the words used, since binary unigrams focus on which words were (not) used, n-gram counts focus on how often words were used, and TF-IDF transformation emphasizes words that are less frequent in the corpus. n-grams focus on the specific words used in a text, but like LIWC, n-grams ignore the context of word use (when n = 1) and their semantics. n-grams and LIWC are primarily applicable to input texts with more than one word.

More recently, researchers have begun applying embedding approaches that convert texts into high-dimensional vectors that capture semantics. Early approaches focused on word embeddings, including word2vec (Mikolov et al., 2013) and others (e.g., FastText and GloVe) that are applied to individual words. These models learn word representations based on the contexts in which words appear in the training data, resulting in a single vector for each word. An interesting application of word embeddings involves training multiple word embedding models on different subsets of data to estimate how the meaning of words changes over time or in response to discrete events. For example, Lawson et al. (2022) found that after hiring female CEOs, corporate documents represented women as more agentic than they previously did.

However, when these word embeddings are used to represent multi-word inputs, they often fail to capture full sentence semantics. For instance, given that each word has only one static vector, sentence representations are typically formed by averaging the word vectors (e.g., Abdurahman et al., 2024). This averaging procedure disregards word order and syntactic structure, so two sentences with different meanings but identical words (e.g., the dog chased the cat vs. the cat chased the dog) are assigned the same vector. Similarly, LIWC and unigram approaches also yield equivalent representations for such sentences. n-gram methods with n > 1 can reflect local word order, but they still struggle with meaning: two documents that express nearly identical ideas may receive a low similarity score if they use different but related terms, such as synonyms or hypernym–hyponym pairs (Rahutomo et al., 2012).

After 2017, transformer-based models including bidirectional encoder representations from transformers (BERT; Devlin et al., 2019) and generative pre-trained transformers (GPT; Radford et al., 2018) revolutionized NLP by providing contextualized representations that capture the semantics of multi-word inputs. While these models operate over token-level embeddings, the transformer architecture (Vaswani et al., 2017) dynamically transforms these representations based on surrounding context, enabling accurate modeling of sentence- and document-level meaning. Transformer architectures also underpin modern generative large language models, including those that power systems such as ChatGPT. In organizational applications of supervised machine learning,⁹ sentence and document embeddings outperform other NLP methods (Hickman et al., 2024; Thompson et al., 2023). Unlike word embeddings, organizational researchers are unlikely to train these models from scratch. These methods can handle single words, but they are especially effective for phrases and multi-word inputs.

Illustrative Textual Similarity Example

For single words, sentences, and paragraphs, Table 4 reports cosine similarities, Euclidean distances, Manhattan distances, and JS divergences as well as the rank order of similarities in each measure (in parentheses, from most [1] to least similar [6]) using NLP operationalizations from Table 3: LIWC, n-grams (binary unigrams, n-gram counts where n = 1 and 2, and TF-IDF transformed n-gram counts), word2vec (applied to each word then averaged across the document), and RoBERTa embeddings. Supplemental Table S3 reports the same for Minkowski distance, adjusted cosine similarity, Jaccard coefficient, and KL divergence. Our aim is illustrating the potential variation in textual similarity estimates caused by different NLP methods and measures.

Table 4.
Examples of Natural Language Processing Vectorization Techniques Applied in Similarity Calculation.

Text Example 11. “king”2. “queen”

NLP Method Cosine similarity Euclidean distance Manhattan distance

Averaged word2vec embeddings 0.65 (2) 2.48 (2) 35.00 (2)

RoBERTa embeddings 0.71 (1) 0.76 (1) 19.31 (1)

Text Example 21. “Allow for the creation of suggestions from within another site, give method to pass userid from this ‘parent’ site to identify the suggestor and allow simple restyle of UserSuggest to match ‘parent’ site”2. “We have a single sign-on component, an API and custom design tools in premium versions”

NLP Method Cosine similarity Euclidean distance Manhattan distance JS divergence

LIWC 0.80 (1) 36.89 (6) 154.91 (6) 0.28 (1)

Binary unigrams 0.05 (4) 6.48 (4) 42.00 (4) 0.95 (2)

n-gram counts 0.02 (5) 11.14 (5) 100.00 (5) 0.98 (3)

TF-IDF n-gram counts 0.01 (6) 1.41 (3) 12.81 (2)

Averaged word2vec embeddings 0.59 (2) 0.95 (1) 12.56 (1)

RoBERTa embeddings 0.23 (3) 1.24 (2) 31.52 (3)

Text Example 3 1. “This would be so much easier than adding new tags to distinguish them. I could also make sharing tasks easier—just share out all tasks in a project to the same set of users. Reporting, likewise, could see some simplification with the addition of project-based organization”2. “Tags as a catch-all heirarchy solution started off as a good idea, but as the site as grown and I've learned more about SlimTimer user usage patterns I now see a place for “supertags” or explicitly defining concepts like Project and Billable v Non-billable tasks. I'll have more on this soon”

NLP Method Cosine similarity Euclidean distance Manhattan distance JS divergence

LIWC 0.91 (1) 26.59 (6) 124.77 (5) 0.13 (1)

Binary unigrams 0.18 (4) 8.12 (4) 66.00 (4) 0.82 (2)

n-gram counts 0.09 (5) 14.00 (5) 170.00 (6) 0.91 (3)

TF-IDF n-gram counts 0.05 (6) 1.38 (3) 16.76 (2)

Averaged word2vec embeddings 0.83 (2) 0.51 (1) 7.08 (1)

RoBERTa embeddings 0.35 (3) 1.14 (2) 29.12 (3)

Note: The number in parentheses indicates the rank order of similarities for each measure. We extracted text examples of sentences and paragraphs from the suggestion-reply dataset used in the present study. We did not calculate each similarity/distance measure using the LIWC and three n-grams methods for the single word group, given that these methods are less applicable to the single-word inputs. We calculated each similarity/distance measure for the sentence/paragraph groups by using all six NLP methods. We calculated JS divergence only with LIWC, binary unigrams and n-gram counts, because they can only be applied to probabilistic vectors. We used base 2 for the logarithm in JS divergence. We used the LIWC 2015 program. We used n-grams with n = 1 and 2 to calculate similarities based on n-gram counts and TF-IDF n-gram counts and with n = 1 for binary unigrams. We averaged the word2vec embeddings for single sentence/paragraph group and generated the RoBERTa embeddings by applying a sentence transformer model named “all-roberta-large-v1” (https://huggingface.co/sentence-transformers/all-roberta-large-v1). The codes for calculating these examples are stored in the OSF repository (https://osf.io/ny94z/?view_only = a069b78b2f544ce3a7a0ae5553138bc0).

Notably, the possible range of textual similarity values differs across methods. The values of geometric distance measures (e.g., Euclidean distance) and JS divergence are always positive. For cosine similarity, the range varies as a function of the NLP method: for LIWC and n-gram counts, where variable values are always zero or positive, cosine similarity ranges from 0 to 1. However, with embedding approaches, variable values can be negative so the possible range extends from −1 to 1.

Substantial variation occurs both within and between similarity measures for word-, sentence-, and paragraph-level comparisons. We observe substantial rank order inconsistency in results: No two columns exhibit the same rank order across all three comparison groups. The variation in rank orders of those values arises because NLP methods encode language differently, and similarity measures highlight distinct aspects of those representations. For example, LIWC receives the highest similarity rankings for cosine similarity and JS divergence for the sentence- and paragraph-length texts, while returning the lowest (or second lowest) similarity rankings for the two distance measures. This pattern can be attributed to the fact that cosine similarity is scale invariant and JS divergence is insensitive to vector magnitude once vectors are normalized into probability distributions. In contrast, distance measures are highly sensitive to absolute magnitude differences.

Limitations of Textual Similarity Measures

Despite textual similarity's increasing use, several additional challenges and limitations can impact the validity of textual similarity measures. First, the variation across similarity measures is challenging because it could cause variations in the results of statistical significance testing. On the one hand, these differences could indicate unreliability of the analytical approach. On the other hand, textual similarity estimated from different NLP methods may indicate different constructs (e.g., Piezunka & Dahlander, 2019).

Second, the “curse of dimensionality” (Bellman, 1966)¹⁰ undermines the effectiveness and interpretability of similarity measures in high-dimensional spaces. As dimensionality increases, data tends to become sparse (i.e., many zeroes, such as occurs with n-grams) or the number of variables exceeds the number of observations, as is common for document-term matrices (Aggarwal et al., 2001). In such contexts, distance metrics lose discriminative power because distances between points tend to converge, making it difficult to separate observations meaningfully (Aggarwal et al., 2001; Kabán, 2011). Echoing the no free lunch theorem (Wolpert & Macready, 2002), no single similarity measure performs optimally across all tasks, and the appropriate distance measure in high-dimensional contexts is often unclear (Aggarwal et al., 2001).

Third, researchers have claimed that cosine similarity tends to be overly biased towards features with higher values while being less affected by the number of features shared between two vectors (Li & Han, 2013). Adjusted cosine similarity addresses this issue by modifying the standard cosine similarity measure to reduce its bias toward high-magnitude features (Sarwar et al., 2001). We probed this in Supplemental Table S4 by examining a simple document-term matrix with three texts and three n-grams. The tendency to prioritize high-value features was observed not only in cosine similarity but also in other distance measures, suggesting that the bias stems from using the n-gram representations rather than from any specific similarity measure.

To address these issues, we next review applications of textual similarity in organizational research, including the reporting of robustness checks for similarity measurement. Notably, some papers use multiple NLP methods and similarity measures for similarity operationalizations (e.g., testing for consistent results across multiple NLP methods or similarity measures as robustness checks; Guo et al., 2021; Rule et al., 2015). These validation approaches for textual similarity measurement are crucial as they demonstrate that the research results generalize across alternative NLP methods or similarity measures, rather than being mere statistical artifacts of a specific approach. Then, we provide an empirical demonstration comparing results across NLP methods and similarity measures.

Review of Applications of Textual Similarity

Method

To build a comprehensive understanding of how textual similarity is applied in existing organizational and psychological research, we sought published studies that employed textual similarity in their NLP analyses. Specifically, as is common in organizational research reviews (e.g., Aguinis et al., 2009), we searched nine prominent journals: Academy of Management Journal, Academy of Management Discoveries, Proceedings of the National Academy of Sciences, Personnel Psychology, Journal of Applied Psychology, Journal of Management, Strategic Management Journal, Organization Science and Administrative Science Quarterly for multiple similarity measures. These search terms were “cosine similarity,” “Euclidean distance,” “Manhattan distance,” “Hellinger distance,” “Bhattacharyya distance,” “Jaccard coefficient” (or “Jaccard index” or “Tanimoto coefficient”), “Dice coefficient,” “Kullback–Leibler divergence” (or “KL divergence” or “relative entropy”), and “Jensen–Shannon divergence” (or “information radius” or “IRad”). For an initial search that returned an excessive number of papers, additional search terms were then added to narrow the focus of the results, including “Natural language processing,” “NLP,” “text,” “LIWC,” “ngram,” “word2vec,” and “embedding.” Although not commonly included in organizational reviews, we included Proceedings of the National Academy of Sciences because we were aware of organization-relevant papers published there that used textual similarity.

The initial search returned 1928 papers. We excluded papers that did not employ NLP analyses, used textual similarity only in supplementary analysis, or failed to report their NLP method. Ultimately, 58 papers published between 2010 and 2025, all utilizing one or more similarity measures with NLP, were included in our review. These papers were coded by summarizing their NLP methods, dimensions of vector representations (i.e., the number of variables used to operationalize text), research topics, measured constructs, relevant findings, supplementary analyses supporting measurement validity (if any), texts used for similarity measurement, sample sizes, and data sources.

Descriptive Results and Discussion of Textual Similarity Applications

Table 5 summarizes the aggregated results of our review. For each similarity measure, Table 5 lists the number of studies using each NLP method, constructs measured, text sources used, and the number of studies reporting robustness checks. Supplemental Material B (Supplemental Tables S5 to S9) in the Supplemental Material provides comprehensive details about each study included in the review.

Table 5.
Summary of Review of Applications of Textual Similarity in Organizational and Psychological Research.

Textual Similarity Measure 1: Cosine Similarity

n Reporting Robustness Checks

NLP Method k Measured Constructs Text Sources Alternative Similarity Measures AlternativeNLP Methods

LIWC 3 Linguistic style match, corporate language dissimilarity, CEO obfuscation Crowdsourcing forums, earnings calls, corporate executive memos 0 0

n-grams 11 Content similarity/distance, idea novelty/variety, knowledge overlap, product changes, product market overlap/threat, task overlap, language heterogeneity, solution integrativeness Crowdfunding platform, Stack Exchange, 10-K filings, Google Play app store's app descriptions, The Bureau of Economy Analysis (BEA) input-output tables, U.S. newspapers, participants’ solutions to the firm in crowdsourcing events 0 0

TF-IDF n-gram counts 6 New entry threat, content similarity, semantic similarity, task overlap, knowledge distance 10-K filings, U.S. presidents’ State of the Union addresses, job descriptions, Institute of Electrical and Electronics Engineers’(IEEE) Xplore database, imaginative stories by research participants 1 0

LDA 1 Information diversity Electronic communication messages including emails, calendars and texts 0 0

Latent Semantic Analysis (LSA) 1 Justification factor The European Organization for Nuclear Research (CERN) Archives 0 0

word2vec 12 Content newness, consumers’ perceptions, latent gendered meanings, association between groups and attributes, textual similarity/semantic distance, implicit attitude, CEO's digital technology orientation, cultural congruency Earning/investor calls, 10-K filings, names from the 2010 U.S. Census, The US National Violent Death Reporting System (NVDRS), Reddit, Microsoft Academic Graph database, Project Implicit dataset, Corpus of Contemporary American English (COCA), Google Books Ngram Dataset, Glassdoor employee reviews, New York Times articles, participants’ responses to a given target word in word association tasks 2 6

HistWords 1 Stereotype content Google Books Ngram Dataset 0 0

GloVe 1 Intersectional stereotypes of occupations/traits The Bureau of Labor Statistics (BLS) 2022 report 0 1

FastText 4 Morality in language, moral loadings of hateful terms, intersectional stereotypes of occupations/traits, gender stereotypes German Propaganda Archive, the Bureau of Labor Statistics (BLS) 2022 report, Weaponized Word, The Wikipedia and Common Crawl corpora 0 1

doc2vec 1 Semantic distinctiveness at language level The Second Language TOEFL corpus 0 1

BERT 4 Content conventionality, convergence within a game/divergence between games, intersectional stereotypes of occupations/traits, item-pair correlations in personality scales The Complete Directory to Prime Time Network and Cable TV Shows, The Bureau of Labor Statistics (BLS) 2022 report, the Open-Source Psychomeasures Project and the IPIP dataset from the Eugene-Springfield Community Sample, participants’ utterances on the group's interaction in games 0 1

all-mpnet-base-v2 1 Firm similarity/typicality, category instability/distinctiveness 10-K filings 0 0

Textual Similarity Measure 2: Euclidean Distance

NLP Method k Measured Constructs Text Sources n Reporting Robustness Checks

Alternative Similarity Measures AlternativeNLP Methods

word2vec 2 Semantic speed College application essays of a large public university, Internet Movie Database (IMDB), academic journals 0 1

nonnegative matrix factorization (NMF) 1 Paragraph distance An English corpus including written texts of college literacy 1 1

Textual Similarity Measure 3: Jaccard Coefficient

NLP Method k Measured Constructs Text Sources n Reporting Robustness Checks

Alternative Similarity Measures AlternativeNLP Methods

Binary unigrams 2 Patent similarity, examiner familiarity with the technologies covered in each application PATSTAT, USPTO Public Patent Application Information Retrieval system (Public PAIR) 0 0

Textual Similarity Measure 4: Kullback–Leibler Divergence

n Reporting Robustness Checks

NLP Method k Measured Constructs Text Sources Alternative Similarity Measures AlternativeNLP Methods

n-grams 1 Stylistic similarity between authors Project Gutenberg database 0 0

LDA 2 Novelty, transience and resonance of speeches, linguistic topic distinctiveness Archives Parlementaires (AP) of the French Revolution Digital Archive (FRDA), transcripts of audio recordings collected throughout children's first 3 years of life 0 0

Named entity recognition 1 Gaps between skills in research, job and education Burning Glass, Open Syllabus Project, and the Web of Science 0 0

Textual Similarity Measure 5: Jensen–Shannon Divergence

NLP Method k Measured Constructs Text Sources n Reporting Robustness Checks

Alternative Similarity Measures AlternativeNLP Methods

LIWC 1 Linguistic conformity Firm emails 0 0

n-grams 1 Distinguishability of trial classes Old Bailey Online 0 0

LDA 4 Cultural spawning, cultural distance, category heterogeneity, paragraph distance Glassdoor, product information on Kickstarter, a corpus of written texts of college literacy 1 1

Note: For simplicity, n-grams include unigrams, binary unigrams, and other n-gram models that did not specify the value of n; BERT includes a broad collection of several word/sentence embedding methods based on the BERT architecture (i.e., BERT-Large, Sentence-BERT, BERT-base-uncased, and DistilBERT). all-mpnet-base-v2 is a sentence-transformer model (https://huggingface.co/sentence-transformers/all-mpnet-base-v2) similar in nature to BERT. k indicates the number of studies that utilized the corresponding NLP method when applying the textual similarity measure. Some studies used multiple NLP methods to operationalize the constructs in the main analyses, so the sum of k exceeds the number of studies included in this review. n reporting robustness checks lists the number of papers using alternative similarity measures or NLP methods to confirm the robustness of their results obtained with the original measures or NLP methods.

Cosine similarity was the most common textual similarity measure, appearing in 43 papers and applied to the widest range of NLP methods. In contrast, all other similarity measures were used in just 15 papers. Six used JS divergence and four used KL divergence, typically with unigram counts (Hughes et al., 2012; Klingenstein et al., 2014), LIWC (Lu et al., 2024), or LDA (Ahn & Greve, 2025; Barron et al., 2018), since these methods yield vector representations that are either inherently probabilistic or can be transformed into probability distributions (e.g., unigram counts; see p. 305; Manning & Schütze, 1999). In these cases, KL and JS divergence measured either content similarity (e.g., trial-class distinguishability and category heterogeneity) or linguistic similarity (e.g., conformity). Euclidean distance appeared in three papers, used twice with word2vec to capture the semantic speed of discourse, or how rapidly topics shift in text (Berger & Toubia, 2024; Toubia et al., 2021). Finally, the Jaccard coefficient was used only twice, both times with binary unigrams to assess overlap in patent abstracts or applications (Arts et al., 2018; Barber IV & Diestre, 2022).

Prior work included in our review used six major types of NLP methods. Four (7%) of the 58 studies used LIWC. Twenty-one (36%) used n-grams (primarily unigram counts and binary unigrams, with six using TF-IDF transformed n-gram counts). A number of these studies cited Hoberg and Phillips (2010), which is perhaps the earliest study cited by management research to apply cosine similarity to binary unigrams. Seven studies (12%) applied latent Dirichlet allocation (LDA) topic modeling to n-grams. One study (2%) used another topic modeling method latent semantic analysis (LSA). Twenty studies (34%) used techniques that capture semantic relationships between words, including word2vec, FastText, GloVe, and HistWords. One study (2%) generated document embeddings using doc2vec, an extension of word2vec. Transformer-based sentence embedding models, such as BERT and all-mpnet-base-v2, were used in five studies (9%), starting in 2023.

Twelve studies defined and examined four distinct types of similarity (content, linguistic, lexical, and semantic). For example, Piezunka and Dahlander (2015, 2019) called similarity based on binary unigrams content similarity and LIWC-based similarity linguistic similarity. Other researchers have used the term content similarity with a variety of NLP methods, including TF-IDF transformed n-grams, word2vec, and transformer-based document embeddings (Guo et al., 2021; Hasan et al., 2015; Patterson et al., 2024; Testoni, 2022). Two researchers used topic modeling methods (i.e., LSA and LDA) to measure constructs (e.g., level of justification and information diversity) related to content similarity (Tuertscher et al., 2014; Wu & Kane, 2021). One researcher also dubbed their approach as lexical similarity when using word2vec to measure similarity between digital transformation content and CEO letters to shareholders (Filatotchev et al., 2023). Finally, several researchers referred to semantic similarity, which emphasizes the meaning overlap between texts, when using TF-IDF transformed n-grams, word2vec, doc2vec, and BERT (Hernandez & Nie, 2023; Lewis et al., 2023; Margulis et al., 2022). For example, Hernandez and Nie (2023) fine-tuned a paired BERT model to generate cosine similarity scores that matched item intercorrelations.

Methodological Issues in Textual Similarity Applications

A key finding here is that the linguistic, content, semantic, and lexical similarity labels have been used inconsistently in prior research. Linguistic similarity, or similarities of style, are well captured by dictionary approaches, such as with LIWC (e.g., Piezunka & Dahlander, 2019). Content and lexical similarity are, essentially, synonyms, in that both refer to the specific words used or vocabulary of language. Thus, n-gram counts are well-suited to operationalizing such similarity. However, as mentioned above, n-grams were sometimes labelled as capturing texts’ semantics, despite being more suitable for measuring content. We recommend that semantic similarity be operationalized with embeddings (e.g., BERT). Similarly, word2vec captures the semantics of individual words, yet was labelled as measuring lexical similarity. We probe further into the meaning of different NLP methods with our empirical demonstration examining the consistency of similarity measures across NLP methods.

Further, the mismatch between NLP methods and labels of similarity dimensions often came with a lack of clarity when connecting construct conceptualization to operationalization. Some studies did not consider or discuss the dimension of textual similarity being measured in the construct, thus potentially resulting in a discrepancy between the actual dimension being measured and the nominal one being captured through the NLP method. We recommend selecting a suitable NLP method based on what the construct aims to capture when comparing the similarity between two texts and explaining the rationale.

Notably, some studies in the initial search did not specify the NLP methods used or lacked relevant details about their NLP analyses, leading to their exclusion (Angus, 2019; Catalini et al., 2015; Lawrence & Poliquin, 2023). The absence of detailed descriptions of the NLP operationalizations raises concerns about methodological transparency and reproducibility. Hence, we advocate for transparent reporting of NLP methods in future research, including data preprocessing approaches, specific NLP methods used, and (hyper)parameter settings. Further, researchers should share their NLP code so that, even if manuscript details are unclear, readers and reviewers can determine what was done and adapt the code. Standardizing such practices will enhance methodological clarity and promote collective knowledge.

Approaches to Validating Textual Similarity

For textual similarity measures, alternative similarity measures and NLP methods were occasionally utilized as robustness checks to provide evidence of reliability. Two studies found that Euclidean distance yielded findings consistent with the cosine similarity (Garg et al., 2018; Rule et al., 2015). One study replicated cosine similarity scores by using word mover's distance (WMD), which measures the dissimilarity between two text documents as the minimum distance that words from one document need to “travel” to match words in another document (Kusner et al., 2015). Four studies used multiple word embedding models (e.g., word2vec and GloVe) to compute textual similarity scores (Bhatia & Walasek, 2023; Garg et al., 2018; Rastelli et al., 2022; Toubia et al., 2021). Four studies exhibited general consistency of results by applying at least one additional NLP method (Charlesworth et al., 2022; Guo et al., 2021; Lewis et al., 2023; Sajjadiani et al., 2024) to compute cosine similarity. One study applied distinct pairs of NLP methods and similarity measures to ensure consistent results (Doxas et al., 2010).

However, when papers did not explicitly consider whether the similarity measure and NLP method were aligned with the targeted construct, this would potentially decrease the value of the robustness checks. For example, it might be less valid to use transformer embeddings as an alternative NLP method when the construct reflects content similarity. In this context, even if consistent results are obtained, this does not mean that the robustness check was a valid check on the results.

Several papers also reported evidence of textual similarity measures’ convergent, discriminant, face, and external evidence of validity. For example, Hasan et al. (2015) collected human ratings of task overlap on a subset of their data and found that cosine similarity correlated more strongly with these ratings than with another measure (i.e., task coordination), thus demonstrating convergent and discriminant evidence of validity. Schweisfurth et al. (2023) corroborated the face validity of their cosine similarity measure for idea novelty by consulting real-world experts (i.e., company managers). Toubia et al. (2021) and Berger and Toubia (2024) collected human perceptions of semantic speed to demonstrate face validity of their word2vec-based measure. Frésard et al. (2020) assessed the external validity¹¹ of the cosine similarity measure (i.e., vertical relatedness between upstream and downstream companies) by performing two analyses: they examined whether the measure predicted actual vertical relationships between firm pairs, and whether firm pairs identified as vertically related exhibited expected accounting properties. Similarly, Arts et al. (2018) evaluated the external validity of their patent similarity measure by comparing it with expert ratings.

Tests of Consistency of Similarity Measures Across NLP Methods

Method

To investigate the consistency of different similarity measures across a variety of NLP methods, we applied these methods to the dataset described by Dahlander and Piezunka (2014), Piezunka and Dahlander (2015, 2019), and Park et al. (2024). This dataset consists of crowdsourced suggestions made to a variety of companies on an online platform. Previous work using the dataset focused on operationalizing constructs such as idea variety and content distance between pairs of suggestions (Park et al., 2024; Piezunka & Dahlander, 2015), as well as linguistic and content match between a suggestion and its corresponding rejection explanation (Piezunka & Dahlander, 2019). In the article, we focused on a subset (N = 232,676) of the suggestions that received replies from company representatives. Then, we applied NLP to both the suggestions and the replies and estimated the similarity between each suggestion and its corresponding reply.

NLP Methods

The NLP methods applied to the present study are summarized in Table 3 and described in further detail below.

Linguistic Inquiry and Word Count. We utilized LIWC 2015 (Pennebaker et al., 2015) to extract linguistic and psychological features from each text. Each document was processed through the LIWC 2015 software, which assigns words to pre-defined categories (e.g., function words, affective processes, and cognitive processes). The software outputs the percentage of words in a document that fall into each category. We selected 73 relevant categories spanning linguistic, grammatical, and psychological dimensions without including word count and summary variables, resulting in a 73-dimension vector for each text.

Binary Unigrams. For all n-gram methods, we used Python 3.10.13 and spaCy to implement text preprocessing best practices (Hickman et al., 2022): Each suggestion and reply was stripped of non-alphabetical characters, converted to lowercase, and lemmatized. We combined the suggestions and replies to remove English stop words and filter out terms that appeared in fewer than 50 combined documents (0.01%) or more than 200,000 documents (43%) to reduce noise and sparsity. For binary unigrams, we focused only on n = 1 (i.e., single words) and created a binary vector for each document, where each entry indicates whether a unigram is present (coded as 1, regardless of how many times it occurs) or absent (coded as 0) in that document. The average suggestion had 46 words and the average reply had 28 words, while the document-term matrix for this approach included 7,349 unigrams (columns) from all suggestions and replies.

N-Gram Counts. Using the same preprocessed texts as the binary unigrams approach, we computed raw frequency counts of unigrams and bigrams (n = 1 and 2) for each document and filtered out n-grams that appeared in fewer than 50 combined documents (0.01%) or more than 200,000 documents (43%). The resulting document-term matrix consists of entries for each document-term that indicate the number of times that n-gram occurred in that document. This feature vector (like the binary unigrams) is considered “sparse” because most entries are 0s: There were 17,625 n-grams (columns) in this approach.

N-Gram Counts With Term Frequency-Inverse Document Frequency Transformation (TF-IDF transformation). We transformed the above n-gram count vectors using a TF-IDF weighting scheme. Specifically, term frequency (TF) captures the number of times an n-gram appears in a document, while inverse document frequency (IDF) downweights n-grams that occur frequently across many documents, since they carry less discriminative information. The TF-IDF weighting scheme highlights words that are frequent within a specific document but rare across the entire collection of documents. We used scikit-learn's TfidfVectorizer to apply the transformation to the preprocessed n-gram features.

Word2vec. We used the Gensim package to load a word2vec model trained on ∼100 billion words from the Google News dataset (Mikolov et al., 2013). For each document, we tokenized the text into individual words and applied the word2vec model to generate the 300-dimensional embedding for each word (excluding out-of-vocabulary terms). Each document's final embedding was then calculated by averaging the embeddings of all words in the document.

Document Embeddings. We used the sentence-transformers Python package to apply the “all-roberta-large-v1” model, a transformer-based embedding model built upon the architecture of BERT. This language model's neural network was trained as a bidirectional model, which means it was trained to predict missing words using both preceding and following context. Given that this model's neural network has 1,024 neurons per hidden layer, it converts text into a 1,024-dimension vector capturing its semantics.

Additionally, we used the GIST-Embedding-v0 model (Solatorio, 2024), which is a BERT variant fine-tuned for semantic similarity tasks. Given that this model's neural network has 768 neurons per hidden layer, it converts text into a 768-dimension vector capturing its semantics. These document embedding models are applied to the entire input text.

Similarity Measures

In the article, we focus on cosine similarity, Euclidean distance, Manhattan distance, and JS divergence. Note that JS divergence requires inputs to be probability distributions, and are therefore applicable only to transformed LIWC outputs, transformed binary unigrams, or transformed n-gram count vectors. Additional results for Minkowski distance (parameter p = 3), adjusted cosine similarity, Jaccard coefficient, and KL divergence are reported in Supplemental Material C. These additional measures were included for completeness but showed high redundancy with other measures or could be applied to only a subset of the NLP methods. Minkowski distance (except when applied to TF-IDF n-gram counts) correlated strongly with Euclidean and Manhattan distances (rs ≥ .84), indicating substantial overlap. Adjusted cosine similarity correlated almost perfectly with cosine similarity (rs ≥ .99), suggesting that adjustment is unnecessary in typical NLP contexts. The Jaccard coefficient is only applicable to binary n-grams, limiting its generalizability. KL divergence is only applicable to outputs by transformed LIWC and n-gram methods. It is asymmetrical and requires a designated focal distribution (i.e., either the suggestion or the reply), thus making it less suitable than JS divergence in this context.

Results

Descriptive Statistics

Table 6 presents the descriptive statistics of similarity measures generated from the seven NLP methods. Assessing distributional normality serves as a descriptive tool for understanding how similarity values are distributed across text pairs, which in turn reflects the underlying characteristics of each similarity measure or NLP method. In particular, the degree of skewness and kurtosis provides insight into whether a method produces broadly distributed similarity scores or concentrates similarity values at one end of the distribution. In terms of the different similarity measures, cosine similarity and JS divergence provided values most closely resembling normal distributions, as skewness and kurtosis were both, on average, lower than for Euclidean or Manhattan distance. Skewness and kurtosis were largest, on average, for Euclidean distance.

Table 6.
Descriptive Statistics of Similarity/Dissimilarity Measures.

Similarity Measure NLP Method Mean Minimum Maximum Standard Deviation Skewness Kurtosis

Cosine similarity LIWC 0.76 0.00 1.00 0.20 −2.15 4.88

Binary unigrams 0.09 0.00 1.00 0.11 1.49 2.98

n-gram counts 0.10 0.00 1.00 0.13 1.56 2.50

TF-IDF n-gram counts 0.09 0.00 1.00 0.13 1.86 3.73

Averaged word2vec embeddings 0.66 −0.08 1.00 0.19 −1.30 1.80

RoBERTa embeddings 0.34 −0.21 1.00 0.20 0.10 −0.90

GIST embeddings 0.54 0.13 1.00 0.15 0.11 −0.91

Euclidean distance LIWC 50.66 0.00 300.00 24.12 2.45 10.31

Binary unigrams 4.59 0.00 19.77 1.43 1.01 3.05

n-gram counts 6.15 0.00 139.68 3.12 5.18 94.34

TF-IDF n-gram counts 1.34 0.00 1.41 0.12 −2.47 10.77

Averaged word2vec embeddings 0.91 0.00 6.59 0.43 2.34 8.24

RoBERTa embeddings 1.13 0.00 1.56 0.18 −0.41 −0.47

GIST embeddings 0.94 0.00 1.32 0.16 −0.43 −0.47

Manhattan distance LIWC 231.25 0.00 1057.17 82.94 1.49 4.41

Binary unigrams 23.11 0.00 391.00 15.72 3.14 24.17

n-gram counts 33.16 0.00 783.00 27.07 4.61 53.66

TF-IDF n-gram counts 6.25 0.00 25.00 1.90 0.58 1.14

Averaged word2vec embeddings 12.55 0.00 91.71 6.01 2.34 8.24

RoBERTa embeddings 28.88 0.00 39.84 4.66 −0.41 −0.47

GIST embeddings 24.10 0.00 34.01 4.08 −0.43 −0.47

Jensen–Shannon divergence LIWC 0.30 0.00 1.00 0.16 1.40 2.54

Binary unigrams 0.91 0.00 1.00 0.11 −1.50 3.18

n-gram counts 0.91 0.00 1.00 0.11 −1.57 3.28

Note: N = 232,676 for cosine similarity and distance measures. N = 225,453 for Jensen–Shannon divergence, given that it has a stricter requirement for computation (i.e., a pair of zero vector representations could cause the divergence measure to be meaningless) and we deleted the relevant observations. The kurtosis provided is excess kurtosis and typically compared to a value of 0. A univariate normal distribution has an excess kurtosis of 0. We used n-grams with n = 1 and 2 to calculate similarities based on n-gram counts and TF-IDF n-gram counts and with n = 1 for binary unigrams. RoBERTa embeddings were generated using “all-roberta-large-v1” (https://huggingface.co/sentence-transformers/all-roberta-large-v1). GIST embeddings were generated using “avsolatorio/GIST-Embedding-v0” (https://huggingface.co/avsolatorio/GIST-Embedding-v0). Jensen–Shannon divergence was only calculable in the case of LIWC, binary unigrams, and n-gram counts, and we used base 2 for the logarithm.

In terms of NLP methods, n-gram counts exhibited highly right skewed and sharply peaked distance distributions (skewness around 5 and kurtosis higher than 50). This type of distribution indicates that most text pairs exhibited minimal overlap, with a small number of pairs generating disproportionately small distance values. This distributional pattern reflects the inherent sparsity of exact n-gram matching in textual similarity assessment. A similar, though less extreme, distributional pattern was observed for binary unigrams (skewness ranging from 1 to 3 and kurtosis from 3 to 24). Binary unigrams reduce sensitivity to word frequency by encoding only whether a word appears, rather than how often it appears. However, they remain fundamentally limited to surface-level overlap and therefore fail to capture synonymy or semantic similarity because words not overlapping in surface form are treated as completely different. TF-IDF n-gram counts showed similarly skewed and peaked distance distributions when used with Euclidean distance, indicating that the method remains primarily driven by surface-level word overlap despite differential weighting of words.

Similarities based on LIWC and averaged word2vec embeddings exhibited highly similar distributions, with substantial skewness and kurtosis. Similarities from document embeddings (GIST and RoBERTa) tended to have more normal distributions, as skewness and kurtosis tended to be small (absolute values < 1). This suggests that document embedding-based similarities offer more nuanced comparisons because semantically similar texts (even with no word overlap) can have moderate similarity scores and consistently scaled distances.

Correlations

We conducted correlation analyses to investigate consistency across similarity measures and NLP methods. Table 7 shows the Pearson's correlations within and between similarity measures across NLP methods. Strong correlations indicate reliability and suggest that different methods are capturing similar information.¹²

Table 7.
Correlation Matrix of Similarity/Dissimilarity Results Calculated by Various NLP Methods.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

LIWC

1. CS

2. ED −.74

3. ManD −.65 .94

4. JSD −.96 .83 .79

Binary unigrams

5. CS .26 −.31 −.34 −.30

6. ED .42 −.47 −.43 −.49 −.01

7. ManD .35 −.40 −.37 −.42 −.00 .97

8. JSD −.26 .31 .35 .30 −1 .04 .03

n-gram counts

9. CS .28 −.33 −.36 −.32 .91 .06 .06 −.91

10. ED .28 −.33 −.30 −.35 .00 .85 .86 .02 .06

11. ManD .31 −.35 −.33 −.38 .02 .91 .96 .00 .08 .94

12. JSD −.26 .32 .36 .31 −.97 .01 −.01 .97 −.98 −.01 −.03

TF-IDF n-gram counts

13. CS .25 −.30 −.33 −.28 .85 .04 .04 −.85 .95 .04 .05 −.93

14. ED −.11 .15 .22 .26 −.74 .05 .01 .84 −.83 .02 −.00 .92 −.89

15. ManD .46 −.50 −.43 −.50 −.09 .92 .86 .13 −.08 .72 .81 .14 −.13 .23

Averaged word2vec embeddings

16. CS .83 −.71 −.62 −.84 .39 .49 .42 −.39 .41 .33 .37 −.41 .37 −.21 .53

17. ED −.78 .70 .61 .83 −.31 −.47 −.40 .32 −.33 −.32 −.35 .33 −.30 .18 −.50 −.83

18. ManD −.78 .70 .61 .83 −.31 −.47 −.40 .32 −.33 −.32 −.35 .33 −.30 .18 −.50 −.83 1

RoBERTa embeddings

19. CS .37 −.44 −.45 −.42 .61 .17 .15 −.61 .62 .08 .13 −.62 .61 −.49 .13 .50 −.44 −.44

20. ED −.36 .43 .44 .41 −.62 −.16 −.15 .62 −.63 −.08 −.13 .64 −.62 .51 −.12 −.48 .43 .42 −1

21. ManD −.36 .43 .44 .41 −.62 −.16 −.15 .62 −.63 −.08 −.13 .63 −.62 .50 −.12 −.48 .42 .42 −.99 1

GIST embeddings

22. CS .39 −.48 −.49 −.46 .65 .21 .19 −.65 .66 .13 .17 −.66 .64 −.52 .17 .53 −.46 −.46 .88 −.88 −.88

23. ED −.38 .46 .48 .44 −.66 −.20 −.18 .66 −.67 −.12 −.16 .68 −.66 .54 −.15 −.52 .45 .45 −.88 .88 .88 −.99

24. ManD −.38 .46 .48 .44 −.66 −.20 −.18 .66 −.67 −.12 −.16 .68 −.66 .54 −.15 −.52 .45 .45 −.88 .88 .88 −.99 1

Note: N = 232,676 for correlations among all measures other than JS divergence. N = 225,453 for correlations between JS divergence and other measures, given that JS divergence has a stricter requirement for computation (i.e., a pair of zero vector representations could cause the divergence measure to be meaningless) and we deleted the relevant observations. CS = Cosine similarity, ED = Euclidean distance, ManD = Manhattan distance, JSD = JS divergence.

First, the NLP document embeddings return highly consistent (although not identical) results. Specifically, the RoBERTa-based and GIST-based similarity measures exhibit absolute correlations r = .88, demonstrating that both capture a similar dimension (i.e., semantics) of texts. This suggests that, at least within the BERT-family of models, the alternative modern embedding methods will likely return highly reliable results as robustness checks.

Second, similarities based on LIWC and averaged word2vec embeddings also exhibit consistent results. Their absolute correlations ranged from .61 to .83. word2vec was trained to measure the semantics of individual words. However, this suggests that LIWC and averaged word2vec embeddings appear to capture similar information (i.e., linguistic style) when applied to sentences/documents.

Third, the n-gram methods usually returned consistent results if the same similarity measure was used (most rs ≥ .81). This means that overall, they measure the same dimension (i.e., content) of texts, despite differing in how and whether they count the frequency of word use. The exception is that Euclidean distance based on TF-IDF transformed n-grams correlated minimally with Euclidean distances based on binary unigrams or n-gram counts (rs < .06).

Fourth, the binary unigram results exhibit the lowest consistency across similarity measures. Binary unigrams exhibited an average absolute correlation of .34 across the four similarity measures. This is followed by n-gram counts and TF-IDF n-gram counts, which demonstrate average absolute correlations of .35 and .42, respectively (although JS divergence was not calculated for TF-IDF).

Fifth, within a given NLP method, geometric distance measures (Euclidean distance and Manhattan distance) tend to be negatively correlated with cosine similarity. This is expected, given that similarity is the inverse of distance. However, this was not always the case: for binary unigrams, cosine similarity and distance measures were independent (i.e., |rs| ≤ 0.02) and were slightly positively correlated for n-gram counts.^13,14 This means that for n-gram counts, on average, the vectors representing the suggestion and the reply move slightly farther apart (e.g., the magnitude of one vector increases relative to the other) as the angle between the two vectors decreases.

Similarly, geometric distance measures (Euclidean, Manhattan, and Minkowski distances with parameter p = 3) tended to be positively correlated, as expected. However, for TF-IDF transformed n-gram counts, Manhattan distance and Minkowski distance were negatively correlated (r = −.48; Supplemental Table S11). Further, Euclidean and Manhattan distances correlated only r = .23.¹⁵

Sixth, JS divergence correlated almost perfectly with cosine similarity (i.e., |rs| ≥ .96) across the LIWC, binary unigrams and n-gram counts. It was independent from Euclidean and Manhattan distances (i.e., |rs| ≤ 0.04) for the two n-gram methods.

Last, NLP document embeddings return essentially the same results across cosine similarity, Euclidean distance and Manhattan distance. Within each document embedding method (i.e., RoBERTa and GIST), the absolute correlations between all similarity measures ranged from .99 to 1.00. These high correlations among different similarity measures indicate high reliability for document embedding-based similarity measures.

General Discussion

Textual similarity is increasingly used in a variety of ways in organizational research. Our review of organizational research revealed a jingle-jangle fallacy: The same label is sometimes applied to different NLP methods (e.g., content similarity used to refer to similarity based on n-grams, word2vec, or document embeddings) while different labels are sometimes used for the same method (e.g., both content and semantic similarity used to describe the n-gram-based approach). In our empirical investigations, we found notable differences in the similarities obtained from different NLP methods. These differences arise not because particular similarity methods are inherently flawed, but because different NLP methods encode conceptually distinct properties of natural language text.

Importantly, n-gram methods—the most common NLP methods used in organizational research—often produced inconsistent results across similarity measures, raising concerns about conclusions based on analyses that rely on n-gram-based textual similarity. Addressing these issues requires alignment between the construct of interest, the NLP method, and the similarity measure. Accordingly, we provide best practice recommendations for using textual similarity in future research below and in Table 8, as well as suggestions for where NLP and textual similarity could be applied in future research (Table 9).

Table 8.
A Checklist of Construct Operationalization Based on Textual Similarity in Organizational Research Practice.

1. Define the Theoretical Construct

a. Determine the research question and define the construct of interest

i. What is the research question? What is the hypothesis?

• Two approaches can be considered to define the construct of interest: deductive and inductive (Poschmann et al., 2024)

ii. Collect the textual data needed to operationalize textual similarity

• The textual dataset should be carefully chosen to accurately represent the unit of analysis, whether it pertains to organizations, members, or subunits (Poschmann et al., 2024)

b. Specify the type of similarity that the construct aims to capture, e.g., linguistic style similarity, semantic similarity and content similarity

c. Preregister the research questions/hypotheses and planned analyses

2. Choose the NLP method that aligns with the construct of interest for similarity operationalization and generate the vector representations

a. Specify possible NLP methods suitable for measuring the type of similarity being studied

i. Consider LIWC and averaged word2vec embeddings to measure linguistic similarity with multi-word inputs

ii. Consider transformer-based embeddings (e.g., BERT, RoBERTa) to measure semantic similarity

• Word embeddings can be used for individual words (e.g., Evans & Aceves, 2016; Lawson et al., 2022; Nelson, 2021)

iii. Consider n-gram approaches (e.g., binary unigrams) and topic modelling (e.g., LDA) to measure content similarity

• Preprocessing should include the removal of content that is irrelevant to the construct of interest (e.g., function words; Hickman et al., 2022)

b. Generate the corresponding vector representations

3. Select the similarity measure that aligns with the NLP method, compute the similarity estimates, and run the main analysis

a. Select similarity measures suitable for the type of data that the NLP method generates

i. For LIWC, consider cosine similarity or Jensen–Shannon divergence

• If using Jensen-Shannon divergence, transform the raw vectors into probabilistic distributions by using L1 normalization (i.e., dividing each element of the vector by the sum of the absolute values of all elements)

ii. For topic modelling, consider cosine similarity or Jensen–Shannon divergence

iii. For n-gram approaches, consider cosine similarity, Jaccard coefficient (only for binary unigrams), or Jensen–Shannon divergence (only for n-gram counts)

• If using Jensen-Shannon divergence, transform n-gram counts into probabilistic distributions by using L1 normalization

iv. For averaged word2vec embeddings and transformer-based embeddings, consider cosine similarity or Euclidean distance

• If using Euclidean distance, consider whether it aligns with the conceptualization of the construct (e.g., semantic speed of words; Toubia et al., 2021)

b. Compute the similarity estimates

c. Run the main analysis (e.g., regression analysis)

4. Apply alternative NLP methods and/or similarity measures in the robustness checks

a. Use alternative NLP methods and similarity measures to recalculate the similarity values

i. Consider using at least one alternative NLP method that aligns with the type of similarity being studied

ii. Consider using at least one alternative similarity measure that aligns with the original NLP method

• Euclidean, Manhattan, and Minkowski distances may not be well-suited for n-gram based methods (Manning & Schütze, 1999)

b. Rerun the main analysis to check the consistency of research results

5. Conduct additional validity checks

a. Convergent validity: Correlate the textual similarity measure with a manually coded measure on the construct by subject matter experts (i.e., ground truth), or with a measure operationalized using external data and theoretically correlated with the textual similarity measure (Kobayashi et al., 2018)

i. Strong correlations are expected for good convergent validity

• Not every textual similarity measure is well-suited for manual coding, due to the nuance and complexity of constructs measured (e.g., content similarity vs. linguistic style similarity vs. sematic similarity in the present study); however, it is suitable for subject matter experts to assess textual similarity measures for constructs that are straightforward to manually quantify (e.g., task overlap and patent similarity) by using Likert scales or other approaches (e.g., Arts et al., 2018; Hasan et al., 2015)

b. Discriminant validity: Correlate the textual similarity measure with a theoretically dissimilar or unrelated construct

i. Low correlations are expected for good discriminant validity

• External data should be obtained to provide theory-based operationalizations that are accepted and not expected to correlate with the textual measure (Kobayashi et al., 2018)

c. Face validity: Present the results of the textual similarity measure to subject matter experts

i. Agreement with the similarity measure is expected for good face validity (Kobayashi et al., 2018)

Table 9.
Recommendations on Future Organizational Research Using Textual Similarity Measures.

Research Topic Example Research Question Method Similarity Type NLP Technique Similarity Measure

Organizational socialization, Leadership Why and when does language similarity change over time among organizational newcomers, team members, and leaders-followers? Analyze language similarity over time using trace data (e.g., emails) and predict changes based on workplace behaviors (e.g., time on the job; supervisor behaviors). Linguistic style similarity; content similarity LIWC; n-gram counts, TF-IDF n-gram counts Cosine similarity, Jensen–Shannon divergence, Jaccard coefficient

Job attitudes, performance, and turnover What behaviors and job attitudes do language similarity (e.g., between coworkers or leaders and followers) predict? Use trace data to measure language similarity over time and predict outcomes related to behaviors and job attitudes. Linguistic style similarity; content similarity LIWC; n-gram counts, TF-IDF n-gram counts Cosine similarity, Jensen–Shannon divergence, Jaccard coefficient

Teams How does language similarity predict team processes and outcomes (e.g., team cohesion, team climate, and team learning)? Analyze language similarity over time using trace data and predict team-level outcomes. Linguistic style similarity; content similarity; semantic similarity LIWC; n-gram counts, TF-IDF n-gram counts; transformer-based document embedding models (e.g., BERT) Cosine similarity, Euclidean distance, Jensen–Shannon divergence, Jaccard coefficient

Workplace diversity, equity, and inclusion How do discrete events (e.g., hiring women, Black workers, or other minoritized group members into leadership positions) change how these identities are represented in organizations? Train a custom word embedding model before and after events to analyze changes in word use and meaning related to key constructs (e.g., DEI). Semantic similarity Word2vec, FastText, and GloVe Cosine similarity, Euclidean distance

CEOs and C-suite How does language consistency vs. change in earnings calls and financial statements relate to organizational outcomes? Estimate language consistency or variability across calls or statements and predict organizational outcomes. Linguistic style similarity; semantic similarity LIWC; transformer-based document embedding models (e.g., BERT) Cosine similarity, Euclidean distance, Jensen–Shannon divergence

Organizational strategy How does language consistency vs. change in publicity materials predict changes in organizational strategies? Estimate language consistency or variability in social media posts every two years and predict changes in organizational strategies. Linguistic style similarity; semantic similarity LIWC; transformer-based document embedding models (e.g., BERT) Cosine similarity, Euclidean distance, Jensen–Shannon divergence

Linguistic similarity and job performance How does linguistic similarity between employees’ self-evaluations over years predict changes in their job performance? Measure linguistic similarity in annual self-evaluations to predict changes in job performance (e.g., motivation, organizational citizenship behavior). Linguistic style similarity LIWC Cosine similarity, Jensen–Shannon divergence

Marketing How well does customer feedback align with organizational language in product advertisement, and how does this relate to organizational outcomes? Estimate similarity between customer feedback and product advertisements to predict outcomes like organizational image. Semantic similarity Transformer-based document embedding models (e.g., BERT) Cosine similarity, Euclidean distance

Measurement and assessment How can semantic similarity enhance the conceptualization and measurement of psychological constructs? Use inter-item and item-scale similarity to improve the reliability and validity of scales. Semantic similarity Transformer-based document embedding models (e.g., BERT) Cosine similarity, Euclidean distance

Note: Table 8 provides guidance for which similarity measure is appropriate for each NLP method.

Best Practice Recommendations for Textual Similarity

Table 8 provides step-by-step guidance for using NLP-based textual similarity in future research. We suggest that the first, key step to rigorously conducting such research is clearly specifying the similarity construct of interest and how it will be used in the study. Our review uncovered three conceptually distinct types of similarity mentioned in the literature: content (used interchangeably with lexical), linguistic style, and semantics. These distinctions are conceptual, but closely tied to NLP methods, with important implications for how similarity should be operationalized. Further, because results of different NLP methods are not interchangeable, each type of similarity is best operationalized using different NLP methods, as done in some prior research (e.g., Piezunka & Dahlander, 2019).

n-gram-based similarities reflect content similarity (e.g., Piezunka & Dahlander, 2019). Concerningly, our results suggest a lack of consistency when estimating textual similarity with different n-gram methods or with different similarity measures in each n-gram method. In particular, when using any n-gram approach, Euclidean, Manhattan, and Minkowski distances generally showed weak correlations with cosine similarity—as well as occasionally positive correlations with cosine similarity—and sometimes negative correlations with each other. Early guidance on textual similarity claimed that Euclidean distance is less appropriate for non-normally distributed data (Manning & Schütze, 1999), yet recent work has applied it to n-grams (e.g., Poschmann et al., 2024). Since n-gram features—whether binary, based on raw counts, or TF-IDF weighted—are inherently non-normal, this limitation raises concerns about the suitability of applying distance measures to n-grams.

A key issue with n-gram-based similarity arises in datasets where text pairs share limited vocabulary, which is more common with (a) shorter texts and (b) larger or more heterogeneous datasets. Without proper preprocessing, n-grams might conflate lexical overlap with content similarity, such as cases where documents share many surface-level n-grams (e.g., function words) despite being unrelated (Hickman et al., 2022). This reflects a conceptual limitation of the n-gram-based representation, rather than a problem inherent to similarity measures. When some n-gram overlap exists, however, the choice of similarity measure can exacerbate or attenuate these representational limitations. Distance-based measures such as Euclidean or Minkowski distance apply squaring (or higher-order) operations that disproportionately amplify large dimension-wise differences (Thant et al., 2020; Xia et al., 2015). This amplification emphasizes surface-level overlap and text-length differences, potentially obscuring more substantive aspects of similarity. In contrast, cosine similarity normalizes vector magnitude and depends solely on the angle between vectors, making it less sensitive to such amplified dimensional differences and therefore more suitable for high-dimensional, sparse n-gram representations. Its relative robustness also extends to binary n-gram vectors due to its strong correlation with alternative measures tailored to binary data—such as Jaccard coefficient (as shown in Supplemental Table S11).

We recommend using LIWC or word2vec to capture linguistic style similarity for sentences, paragraphs, and documents. Our results show LIWC and averaged word2vec embeddings correlate highly, suggesting that they capture similar information about text. LIWC has been described as capturing linguistic style similarity in previous research (e.g., Piezunka & Dahlander, 2019). It seems that word2vec is also well suited to capturing linguistic style similarity and is an open-source alternative to LIWC for multi-word inputs.

We recommend using transformer-based embeddings to capture semantic similarity. These models are particularly effective because they capture the semantics of sentences and documents. Transformer models such as BERT, GPT, and RoBERTa are pre-trained on vast amounts of text and designed to capture meaning (Li et al., 2020). Our empirical results further support this, showing that embeddings generated by sentence transformers are consistent between models and across different similarity measures.

word2vec is also useful for capturing the semantic similarity of individual words. Transformer-based embeddings are designed to capture the semantics of longer input sequences, and although they can be applied to individual words, transformer-based embedding models are larger and more expensive to train from scratch. Thus, word2vec provides researchers the opportunity to not only understand relationships between individual words but also how those relationships change in response to continuous or discrete events (e.g., Kozlowski et al., 2019; Lawson et al., 2022). This makes word-level embeddings particularly useful for studying semantic shifts in topics or groups (e.g., gender and social class) over time or across contexts.

Notably, similarities in content, linguistic style, and semantics can be used to operationalize other, downstream constructs. Researchers have used similarity measures to capture the similarity of products (e.g., Jung et al., 2024; Testoni, 2022) and cultural fit (e.g., Goldberg et al., 2016). In specifying the construct to be measured, the key is to consider which type of similarity (i.e., content, linguistic style, or semantic) indicates that construct. For example, Goldberg et al. (2016) used similarities based on LIWC—which captures linguistic style similarity—to capture changes over time in individuals’ cultural embeddedness. Product similarity has often been measured with binary unigrams (Testoni, 2022). Such downstream constructs should be defined in ways that make explicit which type of textual similarity they rely on, thereby clarifying the link between theory and operationalization.

After specifying the construct of interest and selecting an NLP method, we suggest that these study designs be preregistered. Such pre-analysis plans ensure that results are not achieved through fishing by attempting multiple NLP methods or similarity measures. This will minimize the odds of spurious results and strengthen confidence in research findings.

We recommend selecting similarity measures that align with the NLP method used. While some similarity measures such as Jaccard coefficient are restricted to specific data types (i.e., binary n-grams), cosine similarity is broadly applicable to n-grams, embeddings, and probabilistic vectors. When applied to data that is appropriate for other measures, cosine similarity generally yields consistent results (Supplemental Tables S11 and S12). Additionally, JS divergence can be used for probabilistic vectors generated by LIWC, topic modeling or n-gram methods. Euclidean distance can be considered for averaged word2vec and transformer-based embeddings, as it aligns with the conceptualization of certain constructs (e.g., semantic speed of words; Toubia et al., 2021) and supports a geometric interpretation of semantic change in embedding space.

After primary analyses are complete, it is important to conduct robustness checks. These can take three forms: using alternative NLP methods, testing different similarity measures, and validating with subject matter experts. Consider applying multiple NLP operationalizations and assessing the consistency of findings across methods, as different approaches can yield varying outcomes. Nonetheless, over 75% of the reviewed studies reported only a single operationalization of textual similarity. Each similarity type can be implemented with multiple NLP techniques—for instance, semantic similarity can be assessed using different document embeddings (e.g., RoBERTa vs. GIST). Replicating results across NLP methods helps ensure findings are not artifacts of a specific methodological choice. Moreover, researchers should seek to confirm the consistency of their results using alternative similarity measures when appropriate. For example, Jaccard coefficient can be a suitable alternative to cosine similarity for binary n-grams.

Last, like Kobayashi et al. (2018) emphasized, providing validity evidence, rather than assuming validity, will increase confidence in the results. This validity evidence can come in multiple forms. Convergent evidence can be provided by asking subject matter experts to judge the similarity of texts on the same focal type of similarity, then correlating those judgments with the textual similarity results (e.g., Hasan et al., 2015). It can also be provided through data triangulation, by comparing the similarity measure with a measure operationalized using external data but theoretically correlated with the measure (Kobayashi et al., 2018; e.g., Lu et al., 2024). Although face validity is not generally considered a type of validity evidence, it can be useful to check whether subject matter experts agree that texts identified as (dis)similar on the focal dimension are indeed (dis)similar (Kobayashi et al., 2018; e.g., Schweisfurth et al., 2023).

Limitations and Future Work

Our empirical demonstration focused on a single dataset. While that dataset is large, its properties shape how specific similarity measures perform, which matters for interpreting our results. For example, the n-gram results may differ on longer texts, and future work could more thoroughly investigate the reliability and validity of n-gram-based similarity in a variety of texts. More broadly, our evidence suggests that the stability of n-gram similarity is most at risk in corpora with sparse overlap (short texts, heterogeneous corpora, and large vocabularies), where many text pairs share few n-grams and similarity distributions become highly skewed. Future work could test this directly by systematically varying text length, corpus heterogeneity, and vocabulary overlap to map the boundary conditions under which n-gram similarity measures are interchangeable versus when they diverge.

A deeper understanding of whether, when, and how NLP embedding dimensions matter is needed to improve the operationalization of textual similarity. The dimensionality of embeddings can influence cosine similarity results (Elekes et al., 2018), although prior work that altered the dimensionality of custom-trained word2vec embeddings has found consistent results (e.g., Lawson et al., 2022). On the one hand, a higher dimension tends to capture richer word representations (Chiu et al., 2016); on the other hand, effective dimensionality reduction algorithms achieve similar performance to the original word embeddings, suggesting potential redundancy (Raunak et al., 2019). Theoretically, meaningful representations require sufficient, but not necessarily maximal, dimensionality to accommodate the complexity of the underlying semantic relationships (Aceves & Evans, 2024). We found highly consistent results across two document embedding methods that have different dimensions, but this may not be the case in all settings. We do not adjudicate whether dimensionality interacts with domain shift, corpus size, or construct complexity—questions that matter when researchers move from general benchmarks to specialized organizational corpora.

There is a lack of consensus on standardized validity assessments for constructs operationalized using textual similarity. While some studies have addressed construct validity concerns in NLP-based research, these efforts have largely focused on validating models or representations rather than extending construct validity frameworks to similarity-based measurement approaches. For example, Short et al. (2009) outlined a general framework for validity testing in computerized text analysis. Aceves and Evans (2024) proposed a set of validation procedures designed to assess the embedding space as a representation of conceptual structure. At the same time, a well-established gold standard, such as expert manual coding (i.e., ground truth; Kobayashi et al., 2018), can be difficult to obtain, as many organizational studies introduce novel constructs that capture nuanced concepts in texts. Our recommendations provide some guidance on how to validate constructs derived from NLP-based similarity measures, but future work could provide further guidance, such as construct-specific validation guidelines.

To spur future organizational research utilizing these methods, Table 9 presents some possible research questions. Some of these research questions have received prior investigation (e.g., Lawson et al., 2022), but many are novel. These suggestions span micro- and meso-level organizational research topics. Some of them require textual datasets that are difficult to acquire, but they could deepen our understanding of important organizational phenomena including organizational socialization, teamwork, leadership, job attitudes, and performance. The main implication of our findings is practical: when the substantive claim hinges on “similarity,” researchers need to show that the claim is not an artifact of one particular NLP-method × similarity-measure combination.

Although not the primary focus of this present, the choice of similarity measures is also relevant to text clustering. Text clustering has gained increasing attention as a valuable methodological tool in organizational research, particularly for conducting systematic literature reviews (Simonetti et al., 2025) and for construct operationalization—such as identifying patterns of corporate communication through email analysis (Wu & Kane, 2021). Because similarity measures serve as a critical input to clustering algorithms (Jain et al., 1999), their selection can influence the resulting cluster structures. For instance, topic modeling techniques like LDA and LSA are widely used probabilistic approaches to text clustering (Schmiedel et al., 2019). However, Niraula et al. (2013) showed that the alignment between LDA-generated topics and expert-labeled categories varied substantially depending on the similarity measure employed. This reinforces our broader point: similarity measures are not interchangeable, and in clustering, they can change the substantive story by changing the structure of the solution.

Conclusion

As organizational scholars increasingly use NLP methods to operationalize constructs through textual similarity, it becomes crucial to improve our understanding of the nuances and best practices of these methods. We offer best practice recommendations for employing textual similarity measures in organizational studies based on the findings of our organizational literature review and empirical analyses. Our hope is that these recommendations will support more robust and reliable research that harnesses the significant potential of textual data. Two implications follow directly from our review and results. First, the field needs cleaner construct language: we document a jingle-jangle problem in which the same labels are used for different NLP pipelines, and different labels are used for the same pipeline. Second, researchers should treat similarity as a design choice rather than a plug-in statistic: the NLP representation and the similarity measure jointly define what “similarity” means. Table 8 is meant to make that choice explicit: define the similarity construct, choose the method that matches it, default to measures that behave consistently with that representation, and then present robustness and validity evidence. Doing this reduces avoidable researcher degrees of freedom, limits method-driven conclusions, and makes empirical claims that rely on “similarity” easier to evaluate, replicate, and build on.

Supplemental Material

sj-docx-1-orm-10.1177_10944281261432629 - Supplemental material for Textual Similarity in Organizational Research: Review of Applications, Consistency of Methods, and Best Practice Recommendations

Supplemental material, sj-docx-1-orm-10.1177_10944281261432629 for Textual Similarity in Organizational Research: Review of Applications, Consistency of Methods, and Best Practice Recommendations by Siyi Liu, Louis Hickman, Linus Dahlander and Henning Piezunka in Organizational Research Methods

Similarity Measure	Definition	Mathematical Formula
Cosine Similarity	The cosine of the angle between two non-zero vectors	$d (A, B) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖} = \frac{\sum_{i = 1}^{n} A_{i} \cdot B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}$
Euclidean Distance	The square root of the sum of the squared differences between corresponding coordinates of two points. Measures a real distance between two points in an n-dimensional space, based on L₂ norm	$d (A, B) = \sqrt{\sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}}$
Manhattan Distance	The sum of the absolute differences between corresponding coordinates of two points, based on L₁ norm	$d (A, B) = \sum_{i = 1}^{n} \| A_{i} - B_{i} \|$
Jensen–Shannon Divergence	A symmeasure measure of the difference between two probability distributions p(x) and q(x), also called information radius (IRad). Jensen–Shannon (JS) divergence is built based on KL divergence to calculate a symmetrized and smoothed score	$D_{K L} (p (x) \| \| q (x)) = \sum_{x \in X}^{p} (x) \ln \frac{p (x)}{q (x)}$
$J S D (p (x) \| \| q (x)) = \frac{1}{2} D_{K L} (p (x) \| \| m (x)) + \frac{1}{2} D_{K L} (q (x) \| \| m (x))$ (when x is a discrete variable, and $m (x) = \frac{1}{2} (p (x) + q (x))$ is a mixture distribution of p(x) and q(x))

Similarity Measure	Properties	Drawbacks
Cosine similarity	(1) Measures the angle between vectors while ignoring magnitude (2) Scale invariant (not affected by vector length) (3) Widely used in text and high-dimensional data (e.g., TF-IDF; LIWC) (4) Bounded between −1 and 1 (often 0 to 1 for textual data)	(1) Does not account for magnitude differences, which may be informative in some tasks (2) Tends to be overly biased by the features of higher values while less considering how many features are shared between vectors (Li & Han, 2013)
Euclidean distance	(1) Measures the straight-line distance (the shortest possible distance) between two points (2) Intuitive and easy to interpret in low-dimensional spaces (3) Sensitive to absolute magnitude differences, which can be desirable in some contexts	(1) Sensitive to feature scale and variance; features with larger scales dominate unless normalized (Thant et al., 2020) (2) Sensitive to outliers (Gan et al., 2007) (3) Tends to be overly biased by the features of higher values while less considering how many features are shared between vectors
Manhattan distance	(1) Measures the distance between two points in a grid-like path, as the sum of absolute differences (2) Often more preferable than Euclidean distance for high-dimensional data (Aggarwal et al., 2001) (3) Preserves sensitivity to scale differences across dimensions	(1) Less intuitive than Euclidean distance in some contexts (2) Sensitive to outliers (Gan et al., 2007) (3) Tends to be overly biased by the features of higher values while less considering how many features are shared between vectors (4) Not invariant to rotations of the coordinate system (Kumar et al., 2014) (5) Most useful in specific applications (e.g., grid-based layouts), less general-purpose than Euclidean distance or cosine similarity
Jensen–Shannon divergence	(1) A symmetrized and smoothed version of KL divergence (2) Symmetric (i.e., $J S D (p (x) \| \| q (x)) = J S D (q (x) \| \| p (x))$ ) (3) Always finite and well-defined, even when q(x) = 0 for any x where p(x) > 0, vice versa (4) Bounded between 0 and 1 (using base-2 logarithms in the formula, which is commonly used); the value is 0 only when the two probability distributions are identical (5) Larger values mean greater dissimilarity between the two probability distributions (6) The square root of JS divergence is a metric (that satisfies triangle inequality)	(1) Less sensitive to small differences in distributions compared to KL divergence (2) Computationally more expensive than KL divergence

NLP Technique	Description	Properties	Limitations
LIWC	Categorizes words within a text into over 80 linguistic, psychological, and topical categories indicating various social, cognitive, and affective processes, and converts the text into a vector where each element represents the frequency of words in these categories	(1) Captures information about the style of speech (2) Easy to apply, validated by a large amount of empirical research (3) Considers a wide range of psychological and linguistic dimensions of text meaning	(1) Static lexicon and predefined categories (2) Lack of complexity in analyzing contexts and word order in the text
Binary unigrams	Represents each text by a vector where each element corresponds to the absence (0) or presence (1) of a unique word in the vocabulary	(1) Captures information about the specific words used (2) Relatively easy to implement (3) Computationally efficient	(1) Ignores word frequencies, word orders and context information (2) Might output high-dimensional sparse vectors
n-gram counts	Extracts adjacent words of length n and converts a text into a collection of n-grams (e.g., unigrams, bigrams, and trigrams, or where n = 1, 2, and 3, respectively) in a matrix where each element represents counts of the occurrence of each n-gram	(1) Captures information about the specific words used (2) Relatively easy to implement (3) Captures contextual information and word order (4) Considers phrases and multi-word expressions rather than merely individual words	(1) Limited context window of n items, not enough to capture longer dependencies between words in text (2) Sparsity increases as n increases
TF-IDF n-gram counts	Extends n-gram counts by converting each n-gram count to the TF-IDF score, which gives greater weight to words in a text that occur infrequently across the entire corpus	(1) Balances the frequency of words within a text with their rarity across the given corpus	(1) Limited context window of n items, not enough to capture longer dependencies between words in text (2) Might output sparse vectors where some n-grams do not appear in the given corpus, as n increases (3) Might be biased towards longer texts
Averaged word2vec embeddings	Word2vec can be applied to individual words to capture their semantics, but when applied to multiple words, involving averaging all the word embeddings to create a 300-dimensional single vector that represents the learned definition of the words. word2vec is trained on the Google News corpus that contains 100 billion words (Mikolov et al., 2013).	(1) Captures the semantics of words, which can be aggregated to the sentence- or document-level (2) Fast and easy to average the word embeddings (3) Less computationally intensive than complex sentence embedding models	(1) Might dilute the contextual meaning carried by the same words in different texts (2) Doesn't consider the order of words that might carry some syntactic and semantic information of the text (3) Could be low quality since averaging may be misleading due to sensitivity to random outliers and extreme values in specific domains (Elsaadawy et al., 2018) (4) Unable to encode out-of-vocabulary words (5) Reflects and propagates social biases present in training corpora (Lauscher & Glavaš, 2019)
RoBERTa embeddings	Maps the whole text to a 1024-dimensional vector created by the pretrained RoBERTa model, which uses a sentence-transformer architecture with 355 million parameters in its neural network	(1) Captures the contextual and semantic meanings of sentences and documents (Reimers & Gurevych, 2019) (2) Outputs high-quality sentence embeddings with richer information, suitable for many downstream tasks including semantic textual similarity (Reimers & Gurevych, 2019) (3) Can deploy the large pre-trained model for easy access	(1) Requires massive computational resources and datasets for training (2) Might present social biases in sentence-level representations and needs to be debiased (Liang et al., 2020)
GIST embeddings	Maps the whole text to a 768-dimensional vector created from a fine-tuned version of the pretrained BAAI general embedding model (BGE), which uses a sentence-transformer architecture with 109 million parameters	(1) Same as RoBERTa	(1) Same as RoBERTa

	Text Example 11. “king”2. “queen”
Averaged word2vec embeddings	0.65 (2)	2.48 (2)	35.00 (2)
RoBERTa embeddings	0.71 (1)	0.76 (1)	19.31 (1)

	Text Example 21. “Allow for the creation of suggestions from within another site, give method to pass userid from this ‘parent’ site to identify the suggestor and allow simple restyle of UserSuggest to match ‘parent’ site”2. “We have a single sign-on component, an API and custom design tools in premium versions”
LIWC	0.80 (1)	36.89 (6)	154.91 (6)	0.28 (1)
Binary unigrams	0.05 (4)	6.48 (4)	42.00 (4)	0.95 (2)
n-gram counts	0.02 (5)	11.14 (5)	100.00 (5)	0.98 (3)
TF-IDF n-gram counts	0.01 (6)	1.41 (3)	12.81 (2)
Averaged word2vec embeddings	0.59 (2)	0.95 (1)	12.56 (1)
RoBERTa embeddings	0.23 (3)	1.24 (2)	31.52 (3)

	Text Example 3 1. “This would be so much easier than adding new tags to distinguish them. I could also make sharing tasks easier—just share out all tasks in a project to the same set of users. Reporting, likewise, could see some simplification with the addition of project-based organization”2. “Tags as a catch-all heirarchy solution started off as a good idea, but as the site as grown and I've learned more about SlimTimer user usage patterns I now see a place for “supertags” or explicitly defining concepts like Project and Billable v Non-billable tasks. I'll have more on this soon”
LIWC	0.91 (1)	26.59 (6)	124.77 (5)	0.13 (1)
Binary unigrams	0.18 (4)	8.12 (4)	66.00 (4)	0.82 (2)
n-gram counts	0.09 (5)	14.00 (5)	170.00 (6)	0.91 (3)
TF-IDF n-gram counts	0.05 (6)	1.38 (3)	16.76 (2)
Averaged word2vec embeddings	0.83 (2)	0.51 (1)	7.08 (1)
RoBERTa embeddings	0.35 (3)	1.14 (2)	29.12 (3)

Textual Similarity Measure 1: Cosine Similarity
LIWC	3	Linguistic style match, corporate language dissimilarity, CEO obfuscation	Crowdsourcing forums, earnings calls, corporate executive memos	0	0
n-grams	11	Content similarity/distance, idea novelty/variety, knowledge overlap, product changes, product market overlap/threat, task overlap, language heterogeneity, solution integrativeness	Crowdfunding platform, Stack Exchange, 10-K filings, Google Play app store's app descriptions, The Bureau of Economy Analysis (BEA) input-output tables, U.S. newspapers, participants’ solutions to the firm in crowdsourcing events	0	0
TF-IDF n-gram counts	6	New entry threat, content similarity, semantic similarity, task overlap, knowledge distance	10-K filings, U.S. presidents’ State of the Union addresses, job descriptions, Institute of Electrical and Electronics Engineers’(IEEE) Xplore database, imaginative stories by research participants	1	0
LDA	1	Information diversity	Electronic communication messages including emails, calendars and texts	0	0
Latent Semantic Analysis (LSA)	1	Justification factor	The European Organization for Nuclear Research (CERN) Archives	0	0
word2vec	12	Content newness, consumers’ perceptions, latent gendered meanings, association between groups and attributes, textual similarity/semantic distance, implicit attitude, CEO's digital technology orientation, cultural congruency	Earning/investor calls, 10-K filings, names from the 2010 U.S. Census, The US National Violent Death Reporting System (NVDRS), Reddit, Microsoft Academic Graph database, Project Implicit dataset, Corpus of Contemporary American English (COCA), Google Books Ngram Dataset, Glassdoor employee reviews, New York Times articles, participants’ responses to a given target word in word association tasks	2	6
HistWords	1	Stereotype content	Google Books Ngram Dataset	0	0
GloVe	1	Intersectional stereotypes of occupations/traits	The Bureau of Labor Statistics (BLS) 2022 report	0	1
FastText	4	Morality in language, moral loadings of hateful terms, intersectional stereotypes of occupations/traits, gender stereotypes	German Propaganda Archive, the Bureau of Labor Statistics (BLS) 2022 report, Weaponized Word, The Wikipedia and Common Crawl corpora	0	1
doc2vec	1	Semantic distinctiveness at language level	The Second Language TOEFL corpus	0	1
BERT	4	Content conventionality, convergence within a game/divergence between games, intersectional stereotypes of occupations/traits, item-pair correlations in personality scales	The Complete Directory to Prime Time Network and Cable TV Shows, The Bureau of Labor Statistics (BLS) 2022 report, the Open-Source Psychomeasures Project and the IPIP dataset from the Eugene-Springfield Community Sample, participants’ utterances on the group's interaction in games	0	1
all-mpnet-base-v2	1	Firm similarity/typicality, category instability/distinctiveness	10-K filings	0	0

Textual Similarity Measure 2: Euclidean Distance
word2vec	2	Semantic speed	College application essays of a large public university, Internet Movie Database (IMDB), academic journals	0	1
nonnegative matrix factorization (NMF)	1	Paragraph distance	An English corpus including written texts of college literacy	1	1

Textual Similarity Measure 3: Jaccard Coefficient
Binary unigrams	2	Patent similarity, examiner familiarity with the technologies covered in each application	PATSTAT, USPTO Public Patent Application Information Retrieval system (Public PAIR)	0	0

Textual Similarity Measure 4: Kullback–Leibler Divergence
n-grams	1	Stylistic similarity between authors	Project Gutenberg database
LDA	2	Novelty, transience and resonance of speeches, linguistic topic distinctiveness	Archives Parlementaires (AP) of the French Revolution Digital Archive (FRDA), transcripts of audio recordings collected throughout children's first 3 years of life
Named entity recognition	1	Gaps between skills in research, job and education	Burning Glass, Open Syllabus Project, and the Web of Science

Textual Similarity Measure 5: Jensen–Shannon Divergence
LIWC	1	Linguistic conformity	Firm emails	0	0
n-grams	1	Distinguishability of trial classes	Old Bailey Online	0	0
LDA	4	Cultural spawning, cultural distance, category heterogeneity, paragraph distance	Glassdoor, product information on Kickstarter, a corpus of written texts of college literacy	1	1

Similarity Measure	NLP Method	Mean	Minimum	Maximum	Standard Deviation	Skewness	Kurtosis
Cosine similarity	LIWC	0.76	0.00	1.00	0.20	−2.15	4.88
Binary unigrams	0.09	0.00	1.00	0.11	1.49	2.98
n-gram counts	0.10	0.00	1.00	0.13	1.56	2.50
TF-IDF n-gram counts	0.09	0.00	1.00	0.13	1.86	3.73
Averaged word2vec embeddings	0.66	−0.08	1.00	0.19	−1.30	1.80
RoBERTa embeddings	0.34	−0.21	1.00	0.20	0.10	−0.90
GIST embeddings	0.54	0.13	1.00	0.15	0.11	−0.91
Euclidean distance	LIWC	50.66	0.00	300.00	24.12	2.45	10.31
Binary unigrams	4.59	0.00	19.77	1.43	1.01	3.05
n-gram counts	6.15	0.00	139.68	3.12	5.18	94.34
	TF-IDF n-gram counts	1.34	0.00	1.41	0.12	−2.47	10.77
	Averaged word2vec embeddings	0.91	0.00	6.59	0.43	2.34	8.24
	RoBERTa embeddings	1.13	0.00	1.56	0.18	−0.41	−0.47
	GIST embeddings	0.94	0.00	1.32	0.16	−0.43	−0.47
Manhattan distance	LIWC	231.25	0.00	1057.17	82.94	1.49	4.41
	Binary unigrams	23.11	0.00	391.00	15.72	3.14	24.17
	n-gram counts	33.16	0.00	783.00	27.07	4.61	53.66
	TF-IDF n-gram counts	6.25	0.00	25.00	1.90	0.58	1.14
	Averaged word2vec embeddings	12.55	0.00	91.71	6.01	2.34	8.24
	RoBERTa embeddings	28.88	0.00	39.84	4.66	−0.41	−0.47
	GIST embeddings	24.10	0.00	34.01	4.08	−0.43	−0.47
Jensen–Shannon divergence	LIWC	0.30	0.00	1.00	0.16	1.40	2.54
Binary unigrams	0.91	0.00	1.00	0.11	−1.50	3.18
n-gram counts	0.91	0.00	1.00	0.11	−1.57	3.28

Research Topic	Example Research Question	Method	Similarity Type	NLP Technique	Similarity Measure
Organizational socialization, Leadership	Why and when does language similarity change over time among organizational newcomers, team members, and leaders-followers?	Analyze language similarity over time using trace data (e.g., emails) and predict changes based on workplace behaviors (e.g., time on the job; supervisor behaviors).	Linguistic style similarity; content similarity	LIWC; n-gram counts, TF-IDF n-gram counts	Cosine similarity, Jensen–Shannon divergence, Jaccard coefficient
Job attitudes, performance, and turnover	What behaviors and job attitudes do language similarity (e.g., between coworkers or leaders and followers) predict?	Use trace data to measure language similarity over time and predict outcomes related to behaviors and job attitudes.	Linguistic style similarity; content similarity	LIWC; n-gram counts, TF-IDF n-gram counts	Cosine similarity, Jensen–Shannon divergence, Jaccard coefficient
Teams	How does language similarity predict team processes and outcomes (e.g., team cohesion, team climate, and team learning)?	Analyze language similarity over time using trace data and predict team-level outcomes.	Linguistic style similarity; content similarity; semantic similarity	LIWC; n-gram counts, TF-IDF n-gram counts; transformer-based document embedding models (e.g., BERT)	Cosine similarity, Euclidean distance, Jensen–Shannon divergence, Jaccard coefficient
Workplace diversity, equity, and inclusion	How do discrete events (e.g., hiring women, Black workers, or other minoritized group members into leadership positions) change how these identities are represented in organizations?	Train a custom word embedding model before and after events to analyze changes in word use and meaning related to key constructs (e.g., DEI).	Semantic similarity	Word2vec, FastText, and GloVe	Cosine similarity, Euclidean distance
CEOs and C-suite	How does language consistency vs. change in earnings calls and financial statements relate to organizational outcomes?	Estimate language consistency or variability across calls or statements and predict organizational outcomes.	Linguistic style similarity; semantic similarity	LIWC; transformer-based document embedding models (e.g., BERT)	Cosine similarity, Euclidean distance, Jensen–Shannon divergence
Organizational strategy	How does language consistency vs. change in publicity materials predict changes in organizational strategies?	Estimate language consistency or variability in social media posts every two years and predict changes in organizational strategies.	Linguistic style similarity; semantic similarity	LIWC; transformer-based document embedding models (e.g., BERT)	Cosine similarity, Euclidean distance, Jensen–Shannon divergence
Linguistic similarity and job performance	How does linguistic similarity between employees’ self-evaluations over years predict changes in their job performance?	Measure linguistic similarity in annual self-evaluations to predict changes in job performance (e.g., motivation, organizational citizenship behavior).	Linguistic style similarity	LIWC	Cosine similarity, Jensen–Shannon divergence
Marketing	How well does customer feedback align with organizational language in product advertisement, and how does this relate to organizational outcomes?	Estimate similarity between customer feedback and product advertisements to predict outcomes like organizational image.	Semantic similarity	Transformer-based document embedding models (e.g., BERT)	Cosine similarity, Euclidean distance
Measurement and assessment	How can semantic similarity enhance the conceptualization and measurement of psychological constructs?	Use inter-item and item-scale similarity to improve the reliability and validity of scales.	Semantic similarity	Transformer-based document embedding models (e.g., BERT)	Cosine similarity, Euclidean distance

Footnotes

Acknowledgments

Not applicable.

ORCID iDs

Siyi Liu

Louis Hickman

Henning Piezunka

Ethical Considerations

No ethical approval was required.

Consent to Participate

Not applicable.

Consent to Publication

Not applicable.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data is not available due to a confidentiality contract with the third-party data provider.

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Siyi Liu is a PhD student in the Industrial/Organizational Psychology program at Virginia Tech. Her research focuses on the applications and impact of AI technologies in human resource management, particularly in personnel assessment and selection. She is also interested in textual analysis as a research methodology.

Louis Hickman is an assistant professor of Management at Virginia Tech and a Visiting Scholar at Amazon. He earned his PhD from Purdue University. His research focuses on applications of artificial intelligence and natural language processing in organizations.

Linus Dahlander is a professor at ESMT Berlin and holder of the Lufthansa Group Chair in Innovation. He received his PhD from Chalmers University of Technology and did a postdoc at Stanford University. His research uses large-scale network and content analyses to explain how ideas compete, evolve, and succeed within communities and organizations.

Henning Piezunka is an associate professor at the Wharton school at the University of Pennsylvania. He earned his PhD from Stanford University. His research focuses on competition, collaboration, crowdsourcing, and succession.

References

Abdurahman

Zou

Ungar

Bhatia

(2024). A deep learning approach to personality assessment: Generalizing across items and expanding the reach of survey-based research. Journal of Personality and Social Psychology, 126(2), 312–331. https://doi.org/10.1037/pspp0000480

Aceves

Evans

J. A.

(2024). Mobilizing conceptual spaces: How word embedding models can inform measurement and theory within organization science. Organization Science, 35(3), 788–814. https://doi.org/10.1287/orsc.2023.1686

Aggarwal

C. C.

Hinneburg

Keim

D. A.

(2001). On the surprising behavior of distance measures in high dimensional space. In Database theory—ICDT 2001: 8th international conference London, UK, January 4–6, 2001 proceedings 8 (pp. 420–434). Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44503-X_27

Aguinis

Pierce

C. A.

Bosco

F. A.

Muslin

I. S.

(2009). First decade of organizational research methods: Trends in design, measurement, and data-analysis topics. Organizational Research Methods, 12(1), 69–112. https://doi.org/10.1177/1094428108322641

Ahn

Greve

H. R.

(2025). Cultural spawning: Founders bringing organizational cultures to their startup. Organization Science, 36(1), 411–428. https://doi.org/10.1287/orsc.2023.17771

Angus

R. W.

(2019). Problemistic search distance and entrepreneurial performance. Strategic Management Journal, 40(12), 2011–2023. https://doi.org/10.1002/smj.3068

Arts

Cassiman

Gomez

J. C.

(2018). Text matching to measure patent similarity. Strategic Management Journal, 39(1), 62–84. https://doi.org/10.1002/smj.2699

Arts

Cassiman

Hou

(2025). Technology differentiation, product market rivalry, and M&A transactions. Strategic Management Journal, 46(4), 837–862. https://doi.org/10.1002/smj.3687

Barber IV

Diestre

(2022). Can firms avoid tough patent examiners through examiner-shopping? Strategic timing of citations in USPTO patent applications. Strategic Management Journal, 43(9), 1854–1871. https://doi.org/10.1002/smj.3386

10.

Barley

S. R.

Meyerson

D. E.

Grodal

(2011). E-mail as a source and symbol of stress. Organization Science, 22(4), 887–906. https://doi.org/10.1287/orsc.1100.0573

11.

Barlow

M. A.

Verhaal

J. C.

Angus

R. W.

(2019). Optimal distinctiveness, strategic categorization, and product market entry on the google play app platform. Strategic Management Journal, 40(8), 1219–1242. https://doi.org/10.1002/smj.3019

12.

Barron

A. T.

Huang

Spang

R. L.

DeDeo

(2018). Individuals, institutions, and innovation in the debates of the French Revolution. Proceedings of the National Academy of Sciences, 115(18), 4607–4612. https://doi.org/10.1073/pnas.1717729115

13.

Bellman

(1966). Dynamic programming. Science, 153(3731), 34–37. https://doi.org/10.1126/science.153.3731.34

14.

Berger

Toubia

(2024). The topography of thought. PNAS nexus, 3(5), 163. https://doi.org/10.1093/pnasnexus/pgae163

15.

Bhatia

Walasek

(2023). Predicting implicit attitudes with natural language data. Proceedings of the National Academy of Sciences, 120(25), e2220726120. https://doi.org/10.1073/pnas.2220726120

16.

Blei

D. M.

A. Y.

Jordan

M. I.

(2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf .

17.

Cantrell

S. A.

(2018). The emergent algebraic structure of RNNs and embeddings in NLP. arXiv preprint arXiv:1803.02839. https://doi.org/10.48550/arXiv.1803.02839

18.

Catalini

Lacetera

Oettl

(2015). The incidence and role of negative citations in science. Proceedings of the National Academy of Sciences of the United States of America, 112(45), 13823–13826. https://doi.org/10.1073/pnas.1502280112

19.

Charlesworth

T. E.

Caliskan

Banaji

M. R.

(2022). Historical representations of social groups across 200 years of word embeddings from Google Books. Proceedings of the National Academy of Sciences, 119(28), e2121798119. https://doi.org/10.1073/pnas.2121798119

20.

Chiu

Crichton

Korhonen

Pyysalo

(2016, August 1). How to Train good Word Embeddings for Biomedical NLP. ACLWeb; Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2922

21.

Corritore

Goldberg

Srivastava

S. B.

(2020). Duality in diversity: How intrapersonal and interpersonal cultural heterogeneity relate to firm performance. Administrative Science Quarterly, 65(2), 359–394. https://doi.org/10.1177/0001839219844175

22.

Crilly

Hansen

Zollo

(2016). The grammar of decoupling: A cognitive-linguistic perspective on firms’ sustainability claims and stakeholders’ interpretation. Academy of Management Journal, 59(2), 705–729. https://doi.org/10.5465/amj.2015.0171

23.

Dahlander

Piezunka

(2014). Open to suggestions: How organizations elicit suggestions through proactive and reactive attention. Research Policy, 43(5), 812–827. https://doi.org/10.1016/j.respol.2013.06.006

24.

de França

F. O.

(2016). A hash-based co-clustering algorithm for categorical data. Expert Systems with Applications, 64(c), 24–35. https://doi.org/10.1016/j.eswa.2016.07.024

25.

Devlin

Chang

M. W.

Lee

Toutanova

(2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

26.

Dice

L. R.

(1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. https://doi.org/10.2307/1932409

27.

Doxas

Dennis

Oliver

W. L.

(2010). The dimensionality of discourse. Proceedings of the National Academy of Sciences, 107(11), 4866–4871. https://doi.org/10.1073/pnas.0908315107

28.

Eklund

J. C.

Mannor

M. J.

(2026). Curious and analytical: How analysts evaluate and respond to executive communications about firm strategy. Strategic Management Journal, 1–32. https://doi.org/10.1002/smj.70063

29.

Elekes

Englhardt

Schäler

Böhm

(2018). Toward meaningful notions of similarity in NLP embedding models. International Journal on Digital Libraries, 21(2), 109–128. https://doi.org/10.1007/s00799-018-0237-y

30.

Elsaadawy

Torki

Ei-Makky

(2018, December). A text classifier using weighted average word embedding. In 2018 International Japan-Africa Conference on Electronics, Communications and Computations (JAC-ECC) (pp. 151-154). IEEE. https://doi.org/10.1109/jec-ecc.2018.8679539

31.

El-Zayaty

Ganco

Khoshimov

(2025). Vague language, founding team human capital, and resource acquisition. Organization Science, 36(6), 2108–2128. https://doi.org/10.1287/orsc.2022.16367

32.

Evans

J. A.

Aceves

(2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42(1), 21–50. https://doi.org/10.1146/annurev-soc-081715-074206

33.

Favaron

S. D.

Di Stefano

(2025). Let us not speak of them, but look and pass? Organizational responses to online reviews. Organization Science, 36(2), 651–676. https://doi.org/10.1287/orsc.2020.14091

34.

Filatotchev

Lanzolla

Syrigos

(2023). Impact of CEO’s digital technology orientation and board characteristics on firm value: A signaling perspective. Journal of Management, 51(2), 875–912. https://doi.org/10.1177/01492063231200819

35.

Frésard

Hoberg

Phillips

G. M.

(2020). Innovation activities and integration through vertical acquisitions. The Review of Financial Studies, 33(7), 2937–2976. https://doi.org/10.1093/rfs/hhz106

36.

Gan

(2007). Data clustering: theory, algorithms, and applications. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718348

37.

Garg

Schiebinger

Jurafsky

Zou

(2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644. https://doi.org/10.1073/pnas.1720347115

38.

Goldberg

Srivastava

S. B.

Manian

V. G.

Monroe

Potts

(2016). Fitting in or standing out? The tradeoffs of structural and cultural embeddedness. American Sociological Review, 81(6), 1190–1222. https://doi.org/10.1177/0003122416671873

39.

Graffin

S. D.

Carpenter

M. A.

Boivie

(2011). What's all that (strategic) noise? Anticipatory impression management in CEO succession. Strategic Management Journal, 32(7), 748–770. https://doi.org/10.1002/smj.906

40.

Günther

Dudschig

Kaup

(2016). Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies. Quarterly Journal of Experimental Psychology, 69(4), 626–653. https://doi.org/10.1080/17470218.2015.1038280

41.

Guo

Sengul

(2021). The impact of executive verbal communication on the convergence of investors’ opinions. Academy of Management Journal, 64(6), 1763–1792. https://doi.org/10.5465/amj.2019.0711

42.

Guo

Gimeno

(2017). Language and competition: Communication vagueness, interpretation difficulties, and market entry. Academy of Management Journal, 60(6), 2073–2098. https://doi.org/10.5465/amj.2014.1150

43.

Haans

R. F.

Mertens

M. J.

(2026). The internet never forgets: A four-step scraping tutorial, codebase, and database for longitudinal organizational website data. Organizational Research Methods, 29(1), 113–141. https://doi.org/10.1177/10944281241284941

44.

Hannigan

T. R.

Haans

R. F.

Vakili

Tchalian

Glaser

V. L.

Wang

M. S.

Kaplan

Jennings

P. D.

(2019). Topic modeling in management research: Rendering new theory from textual data. Academy of Management Annals, 13(2), 586–632. https://doi.org/10.5465/annals.2017.0099

45.

Harrison

J. S.

Thurgood

G. R.

Boivie

Pfarrer

M. D.

(2019). Measuring CEO personality: Developing, validating, and testing a linguistic tool. Strategic Management Journal, 40(8), 1316–1330. https://www.jstor.org/stable/26806526. https://doi.org/10.1002/smj.3023

46.

Hasan

Ferguson

J. P.

Koning

(2015). The lives and deaths of jobs: Technical interdependence and survival in a job structure. Organization Science, 26(6), 1665–1681. https://doi.org/10.1287/orsc.2015.1014

47.

Hernandez

Nie

(2023). The AI-IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011–1035. https://doi.org/10.1111/peps.12543

48.

Hickman

Liff

Rottman

Calderwood

(2024). The effects of the training sample size, ground truth reliability, and NLP method on language-based automatic interview scores’ psychometric properties. Organizational Research Methods, 29(1), 40–72. https://doi.org/10.1177/10944281241264027

49.

Hickman

Thapa

Tay

Cao

Srinivasan

(2022). Text preprocessing for text mining in organizational research: Review and recommendations. Organizational Research Methods, 25(1), 114–146. https://doi.org/10.1177/1094428120971683

50.

Hoberg

Phillips

(2010). Product market synergies and competition in mergers and acquisitions: A text-based analysis. The Review of Financial Studies, 23(10), 3773–3811. https://doi.org/10.1093/rfs/hhq053

51.

Hughes

J. M.

Foti

N. J.

Krakauer

D. C.

Rockmore

D. N.

(2012). Quantitative patterns of stylistic influence in the evolution of literature. Proceedings of the National Academy of Sciences, 109(20), 7682–7686. https://doi.org/10.1073/pnas.1115407109

52.

Jaccard

(1912). The distribution of the flora in the alpine zone. The New Phytologist, 11(2), 37–50. http://www.jstor.org/stable/2427226. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x

53.

Jain

A. K.

Murty

M. N.

Flynn

P. J.

(1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. https://doi.org/10.1145/331499.331504

54.

Jones

Furnas

G. W.

(1987). Pictures of relevance: A geomeasure analysis of similarity measures. Journal of the American Society for Information Science, 38(6), 420–442. https://doi.org/10.1002/(sici)1097-4571(198711)38:6%3C420::aid-asi3%3E3.0.co;2-s

55.

Joseph

Rhee

Wilson

A. J.

(2023). Corporate hierarchy and organizational learning: Member turnover, code change, and innovation in the multiunit firm. Organization Science, 34(3), 1332–1352. https://doi.org/10.1287/orsc.2022.1618

56.

Jung

Mallon

M. R.

Wilden

(2024). Strategy by doing and product-market performance: A contingency view. Journal of Management, 50(5), 1684–1713. https://doi.org/10.1177/01492063221147298

57.

Kabán

(2011). Non-parametric detection of meaningless distances in high dimensional data. Statistics and Computing, 22(2), 375–385. https://doi.org/10.1007/s11222-011-9229-0

58.

Klingenstein

Hitchcock

DeDeo

(2014). The civilizing process in London’s Old Bailey. Proceedings of the National Academy of Sciences, 111(26), 9419–9424. https://doi.org/10.1073/pnas.1405984111

59.

Kobayashi

V. B.

Mol

S. T.

Berkers

H. A.

Kismihók

Den Hartog

D. N.

(2018). Text mining in organizational research. Organizational Research Methods, 21(3), 733–765. https://doi.org/10.1177/1094428117722619

60.

Kosinski

Wang

Lakkaraju

Leskovec

(2016). Mining big data to extract patterns and predict real-life outcomes. Psychological Methods, 21(4), 493–506. https://doi.org/10.1037/met0000105

61.

Kozlowski

A. C.

Taddy

Evans

J. A.

(2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905–949. https://doi.org/10.1177/0003122419877135

62.

Kullback

Leibler

R. A.

(1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. http://www.jstor.org/stable/2236703. https://doi.org/10.1214/aoms/1177729694

63.

Kumar

Chhabra

J. K.

Kumar

(2014). Impact of distance measures on the performance of clustering algorithms. In Intelligent Computing, Networking, and Informatics: Proceedings of the International Conference on Advanced Computing, Networking, and Informatics, India, June 2013 (pp. 183-190). Springer India. https://doi.org/10.1007/978-81-322-1665-0_17

64.

Kusner

Sun

Kolkin

Weinberger

(2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966). PMLR. https://proceedings.mlr.press/v37/kusnerb15.html.

65.

Lauscher

Glavaš

(2019, April 29). Are we consistently biased? Multidimensional analysis of biases in distributional word vectors. arXiv preprint arXiv:1904.11783. https://doi.org/10.48550/arXiv.1904.11783

66.

Lawrence

Poliquin

(2023). The growth of hierarchy in organizations: Managing knowledge scope. Strategic Management Journal, 44(13), 3155–3184. https://doi.org/10.1002/smj.3539

67.

Lawson

M. A.

Martin

A. E.

Huda

Matz

S. C.

(2022). Hiring women into senior leadership positions is associated with a reduction in gender stereotypes in organizational language. Proceedings of the National Academy of Sciences, 119(9), e2026443119. https://doi.org/10.1073/pnas.2026443119

68.

Lewis

Cahill

Madnani

Evans

(2023). Local similarity and global variability characterize the semantic space of human languages. Proceedings of the National Academy of Sciences, 120(51), e2300986120. https://doi.org/10.1073/pnas.2300986120

69.

Han

(2013). Distance weighted cosine similarity measure for text classification. In Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20-23, 2013. Proceedings 14 (pp. 611-618). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-41278-3_74

70.

Zhou

Wang

Yang

(2020). On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864. https://doi.org/10.48550/arXiv.2011.05864

71.

Liang

P. P.

I. M.

Zheng

Lim

Y. C.

Salakhutdinov

Morency

L.-P.

(2020). Towards debiasing sentence representations. arXiv preprint arXiv:2007.08100. https://doi.org/10.48550/arXiv.2007.08100

72.

Lin

(1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. https://doi.org/10.1109/18.61115

73.

Chatman

J. A.

Goldberg

Srivastava

S. B.

(2024). Two-sided cultural fit: The differing behavioral consequences of cultural congruence based on values versus perceptions. Organization Science, 35(1), 71–91. https://doi.org/10.1287/orsc.2023.1659

74.

Majzoubi

Zhao

E. Y.

Zuzul

Fisher

(2024). The double-edged sword of exemplar similarity. Organization Science, 36(1), 121–144. https://doi.org/10.1287/orsc.2022.16855

75.

Manning

Schütze

(1999). Foundations of statistical natural language processing. MIT Press.

76.

Marchetti

Puranam

(2025). Are less hierarchical firms organized around stronger cultures? Evidence from big data. Strategic Management Journal, 47(2), 463–493. https://doi.org/10.1002/smj.70020

77.

Margulis

E. H.

Wong

P. C.

Turnbull

Kubit

B. M.

McAuley

J. D.

(2022). Narratives imagined in response to instrumental music reveal culture-bounded intersubjectivity. Proceedings of the National Academy of Sciences, 119(4), e2110406119. https://doi.org/10.1073/pnas.2110406119

78.

Mikolov

Chen

Corrado

Dean

(2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781

79.

Nadkarni

Chen

(2014). Bridging yesterday, today, and tomorrow: CEO temporal focus, environmental dynamism, and rate of new product introduction. Academy of Management Journal, 57(6), 1810–1833. https://doi.org/10.5465/amj.2011.0401

80.

Nelson

L. K.

(2021). Leveraging the alignment between machine learning and intersectionality: Using word embeddings to measure intersectional experiences of the nineteenth century U.S. South. Poetics, 88, 101539. https://doi.org/10.1016/j.poetic.2021.101539

81.

Nguyen

H. V.

Bai

(2010, November). Cosine similarity measure learning for face verification. In Asian conference on computer vision (pp. 709–720). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-19309-5_55

82.

Niraula

Banjade

Ştefănescu

Rus

(2013, July). Experiments with semantic similarity measures based on LDA and LSA. In International conference on statistical language and speech processing (pp. 188-199). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-39593-2_17

83.

Odziemkowska

(2022). Frenemies: Overcoming audiences’ ideological opposition to firm–activist collaborations. Administrative Science Quarterly, 67(2), 469–514. https://doi.org/10.1177/00018392211058206

84.

Parasurama

Leung

M. D.

Koppman

(2025). Applying while Black: The collateral effects of racial differences in work histories. Administrative Science Quarterly, 70(4), 923–957. https://doi.org/10.1177/00018392251340351

85.

Park

Piezunka

Dahlander

(2024). Coevolutionary lock-in in external search. Academy of Management Journal, 67(1), 262–288. https://doi.org/10.5465/amj.2022.0710

86.

Patterson

Reilly

Kashkooli

(2024). Must see TV or must keep TV: The nuances of creative performance and team composition in television. Academy of Management Discoveries, 11(3), 423–447. https://doi.org/10.5465/amd.2023.0234

87.

Pennebaker

J. W.

(2016). Using computer analyses to identify language style and aggressive intent: The secret life of function words. In Smith

(Ed.), The relationship between rhetoric and terrorist violence (pp. 8–18). Routledge. https://doi.org/10.4324/9781315540597

88.

Pennebaker

J. W.

Boyd

R. L.

Jordan

Blackburn

(2015). The development and psychomeasure properties of LIWC2015. http://hdl.handle.net/2152/31333.

89.

Piezunka

Dahlander

(2015). Distant search, narrow attention: How crowding alters organizations’ filtering of suggestions in crowdsourcing. Academy of Management Journal, 58(3), 856–880. https://doi.org/10.5465/amj.2012.0458

90.

Piezunka

Dahlander

(2019). Idea rejected, tie formed: Organizations’ feedback on crowdsourced ideas. Academy of Management Journal, 62(2), 503–530. https://doi.org/10.5465/amj.2016.0703

91.

Poschmann

Goldenstein

Büchel

Hahn

(2024). A vector space approach for measuring relationality and multidimensionality of meaning in large text collections. Organizational Research Methods, 27(4), 650–680. https://doi.org/10.1177/10944281231213068

92.

Prochaska

Theodore

(2018). Discrete probability distributions. Introduction to Mathematical Methods for Environmental Engineers and Scientists. Wiley.

93.

Radford

Narasimhan

Salimans

Sutskever

(2018). Improving language understanding by generative pre-training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.

94.

Rahutomo

Kitasuka

Aritsugi

(2012, October). Semantic cosine similarity. In The 7th international student conference on advanced science and technology ICAST (Vol. 4, No. 1, p. 1). South Korea: University of Seoul. https://www.researchgate.net/profile/Faisal-Rahutomo/publication/262525676_Semantic_Cosine_Similarity/links/0a85e537ee3b675c1e000000/Semantic-Cosine-Similarity.pdf.

95.

Rastelli

Greco

De Pisapia

Finocchiaro

(2022). Balancing novelty and appropriateness leads to creative associations in children. PNAS nexus, 1(5), 273. https://doi.org/10.1093/pnasnexus/pgac273

96.

Raunak

Gupta

Metze

(2019, August). Effective dimensionality reduction for word embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) (pp. 235–243). https://doi.org/10.18653/v1/w19-4328

97.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084. https://doi.org/10.48550/arXiv.1908.10084

98.

Rule

Cointet

J.-P.

Bearman

P. S.

(2015). Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014. Proceedings of the National Academy of Sciences, 112(35), 10837–10844. https://doi.org/10.1073/pnas.1512221112

99.

Sajjadiani

Daniels

M. A.

Huang

H. C.

(2024). The social process of coping with work-related stressors online: A machine learning and interpretive data science approach. Personnel Psychology, 77(2), 321–373. https://doi.org/10.1111/peps.12538

100.

Salton

Buckley

(1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0

101.

Sarwar

Karypis

Konstan

Riedl

(2001, April). Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (pp. 285–295). https://doi.org/10.1145/371920.372071

102.

Schmiedel

Müller

Vom Brocke

(2019). Topic modeling as a strategy of inquiry in organizational research: A tutorial with an application example on organizational culture. Organizational Research Methods, 22(4), 941–968. https://doi.org/10.1177/1094428118773858

103.

Schubert

(2021, September). A triangle inequality for cosine similarity. In International Conference on Similarity Search and Applications (pp. 32–44). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-89657-7_3

104.

Schweisfurth

T. G.

Schöttl

C. P.

Raasch

Zaggl

M. A.

(2023). Distributed decision-making in the shadow of hierarchy: How hierarchical similarity biases idea evaluation. Strategic Management Journal, 44(9), 2255–2282. https://doi.org/10.1002/smj.3497

105.

Shi

Zhang

Hoskisson

R. E.

(2019). Examination of CEO–CFO social interaction through language style matching: Outcomes for the CFO and the organization. Academy of Management Journal, 62(2), 383–414. https://doi.org/10.5465/amj.2016.1062

106.

Short

J. C.

Broberg

J. C.

Cogliser

C. C.

Brigham

K. H.

(2009). Construct validation using computer-aided text analysis (CATA): An illustration using entrepreneurial orientation. Organizational Research Methods, 13(2), 320–347. https://doi.org/10.1177/1094428109335949

107.

Simonetti

Tumminello

Picone

P. M.

Minà

(2025). A machine learning toolkit for selecting studies and topics in systematic literature reviews. Organizational Research Methods. https://doi.org/10.1177/10944281251341571

108.

Singh

R. H.

Maurya

Tripathi

Narula

Srivastav

(2020). Movie recommendation system using cosine similarity and KNN. International Journal of Engineering and Advanced Technology, 9(5), 556–559. https://doi.org/10.35940/ijeat.e9666.069520

109.

Solatorio

A. V.

(2024). Gistembed: Guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829. https://doi.org/10.48550/arXiv.2402.16829

110.

Sorensen

(1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab, 5, 1–34.

111.

Srivastava

S. B.

Goldberg

Manian

V. G.

Potts

(2018). Enculturation trajectories: Language, cultural adaptation, and individual outcomes in organizations. Management Science, 64(3), 1348–1364. https://doi.org/10.1287/mnsc.2016.2671

112.

Stroube

Vakili

Bikard

(2025). The misfit bias. Organization Science, 36(5), 1676–1689. https://doi.org/10.1287/orsc.2023.17462

113.

Swanson

M. D.

Kobayashi

Tewfik

A. H.

(1998). Multimedia data-embedding and watermarking technologies. Proceedings of the IEEE, 86(6), 1064–1087. https://doi.org/10.1109/5.687830

114.

Tanimoto

T. T.

(1958). Elementary mathematical theory of classification and prediction. International Business Machines Corporation.

115.

Tasselli

Zappa

Lomi

(2020). Bridging cultural holes in organizations: The dynamic structure of social networks and organizational vocabularies within and across subunits. Organization Science, 31(5), 1292–1312. https://doi.org/10.1287/orsc.2019.1352

116.

Testoni

(2022). The market value spillovers of technological acquisitions: Evidence from patent-text analysis. Strategic Management Journal, 43(5), 964–985. https://doi.org/10.1002/smj.3355

117.

Thant

A. A.

Aye

S. M.

Mandalay

(2020). Euclidean, Manhattan and Minkowski distance methods for clustering algorithms. International Journal of Scientific Research in Science, Engineering and Technology, 7(3), 553–559. https://doi.org/10.32628/ijsrset2073118

118.

Thompson

Koenig

Mracek

D. L.

Tonidandel

(2023). Deep learning in employee selection: Evaluation of algorithms to automate the scoring of open-ended assessments. Journal of Business and Psychology, 38(3), 509–527. https://doi.org/10.1007/s10869-023-09874-y

119.

Toubia

Berger

Eliashberg

(2021). How quantifying the shape of stories predicts their success. Proceedings of the National Academy of Sciences, 118(26), e2011695118. https://doi.org/10.1073/pnas.2011695118

120.

Tuertscher

Garud

Kumaraswamy

(2014). Justification and interlaced knowledge at ATLAS, CERN. Organization Science, 25(6), 1579–1608. https://doi.org/10.1287/orsc.2013.0894

121.

van Angeren

Vroom

McCann

B. T.

Podoynitsyna

Langerak

(2022). Optimal distinctiveness across revenue models: Performance effects of differentiation of paid and free products in a mobile app market. Strategic Management Journal, 43(10), 2066–2100. https://doi.org/10.1002/smj.3394

122.

Vasile

(1987). Inner Product Structures: Theory and Applications. D. Reidel.

123.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is All you Need. In Guyon

Luxburg

U. V.

Bengio

Wallach

Fergus

Vishwanathan

Garnett

(Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf .

124.

Wolpert

D. H.

Macready

W. G.

(2002). No free lunch theorems for optimization. IEEE transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893

125.

Kane

G. C.

(2021). Network-biased technical change: How modern digital collaboration tools overcome some biases but exacerbate others. Organization Science, 32(2), 273–292. https://doi.org/10.1287/orsc.2020.1368

126.

Xia

Zhang

(2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39–52. https://doi.org/10.1016/j.ins.2015.02.024

127.

C.-W.

Chuang

Y.-S.

Lotsos

A. N.

Meier

Haase

C. M.

(2025). The more similar, the better? Associations between latent semantic similarity and emotional experiences differ across conversation contexts. Journal of Language and Social Psychology, 44(6), 961–984. https://doi.org/10.1177/0261927X251343096

128.

Zimek

Schubert

Kriegel

H. P.

(2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(5), 363–387. https://doi.org/10.1002/sam.11161

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.84 MB

Textual Similarity Measure 1: Cosine Similarity
				n Reporting Robustness Checks
NLP Method	k	Measured Constructs	Text Sources	Alternative Similarity Measures	AlternativeNLP Methods
LIWC	3	Linguistic style match, corporate language dissimilarity, CEO obfuscation	Crowdsourcing forums, earnings calls, corporate executive memos	0	0
n-grams	11	Content similarity/distance, idea novelty/variety, knowledge overlap, product changes, product market overlap/threat, task overlap, language heterogeneity, solution integrativeness	Crowdfunding platform, Stack Exchange, 10-K filings, Google Play app store's app descriptions, The Bureau of Economy Analysis (BEA) input-output tables, U.S. newspapers, participants’ solutions to the firm in crowdsourcing events	0	0
TF-IDF n-gram counts	6	New entry threat, content similarity, semantic similarity, task overlap, knowledge distance	10-K filings, U.S. presidents’ State of the Union addresses, job descriptions, Institute of Electrical and Electronics Engineers’(IEEE) Xplore database, imaginative stories by research participants	1	0
LDA	1	Information diversity	Electronic communication messages including emails, calendars and texts	0	0
Latent Semantic Analysis (LSA)	1	Justification factor	The European Organization for Nuclear Research (CERN) Archives	0	0
word2vec	12	Content newness, consumers’ perceptions, latent gendered meanings, association between groups and attributes, textual similarity/semantic distance, implicit attitude, CEO's digital technology orientation, cultural congruency	Earning/investor calls, 10-K filings, names from the 2010 U.S. Census, The US National Violent Death Reporting System (NVDRS), Reddit, Microsoft Academic Graph database, Project Implicit dataset, Corpus of Contemporary American English (COCA), Google Books Ngram Dataset, Glassdoor employee reviews, New York Times articles, participants’ responses to a given target word in word association tasks	2	6
HistWords	1	Stereotype content	Google Books Ngram Dataset	0	0
GloVe	1	Intersectional stereotypes of occupations/traits	The Bureau of Labor Statistics (BLS) 2022 report	0	1
FastText	4	Morality in language, moral loadings of hateful terms, intersectional stereotypes of occupations/traits, gender stereotypes	German Propaganda Archive, the Bureau of Labor Statistics (BLS) 2022 report, Weaponized Word, The Wikipedia and Common Crawl corpora	0	1
doc2vec	1	Semantic distinctiveness at language level	The Second Language TOEFL corpus	0	1
BERT	4	Content conventionality, convergence within a game/divergence between games, intersectional stereotypes of occupations/traits, item-pair correlations in personality scales	The Complete Directory to Prime Time Network and Cable TV Shows, The Bureau of Labor Statistics (BLS) 2022 report, the Open-Source Psychomeasures Project and the IPIP dataset from the Eugene-Springfield Community Sample, participants’ utterances on the group's interaction in games	0	1
all-mpnet-base-v2	1	Firm similarity/typicality, category instability/distinctiveness	10-K filings	0	0

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23
LIWC
1. CS
2. ED	−.74
3. ManD	−.65	.94
4. JSD	−.96	.83	.79
Binary unigrams
5. CS	.26	−.31	−.34	−.30
6. ED	.42	−.47	−.43	−.49	−.01
7. ManD	.35	−.40	−.37	−.42	−.00	.97
8. JSD	−.26	.31	.35	.30	−1	.04	.03
n-gram counts
9. CS	.28	−.33	−.36	−.32	.91	.06	.06	−.91
10. ED	.28	−.33	−.30	−.35	.00	.85	.86	.02	.06
11. ManD	.31	−.35	−.33	−.38	.02	.91	.96	.00	.08	.94
12. JSD	−.26	.32	.36	.31	−.97	.01	−.01	.97	−.98	−.01	−.03
TF-IDF n-gram counts
13. CS	.25	−.30	−.33	−.28	.85	.04	.04	−.85	.95	.04	.05	−.93
14. ED	−.11	.15	.22	.26	−.74	.05	.01	.84	−.83	.02	−.00	.92	−.89
15. ManD	.46	−.50	−.43	−.50	−.09	.92	.86	.13	−.08	.72	.81	.14	−.13	.23
Averaged word2vec embeddings
16. CS	.83	−.71	−.62	−.84	.39	.49	.42	−.39	.41	.33	.37	−.41	.37	−.21	.53
17. ED	−.78	.70	.61	.83	−.31	−.47	−.40	.32	−.33	−.32	−.35	.33	−.30	.18	−.50	−.83
18. ManD	−.78	.70	.61	.83	−.31	−.47	−.40	.32	−.33	−.32	−.35	.33	−.30	.18	−.50	−.83	1
RoBERTa embeddings
19. CS	.37	−.44	−.45	−.42	.61	.17	.15	−.61	.62	.08	.13	−.62	.61	−.49	.13	.50	−.44	−.44
20. ED	−.36	.43	.44	.41	−.62	−.16	−.15	.62	−.63	−.08	−.13	.64	−.62	.51	−.12	−.48	.43	.42	−1
21. ManD	−.36	.43	.44	.41	−.62	−.16	−.15	.62	−.63	−.08	−.13	.63	−.62	.50	−.12	−.48	.42	.42	−.99	1
GIST embeddings
22. CS	.39	−.48	−.49	−.46	.65	.21	.19	−.65	.66	.13	.17	−.66	.64	−.52	.17	.53	−.46	−.46	.88	−.88	−.88
23. ED	−.38	.46	.48	.44	−.66	−.20	−.18	.66	−.67	−.12	−.16	.68	−.66	.54	−.15	−.52	.45	.45	−.88	.88	.88	−.99
24. ManD	−.38	.46	.48	.44	−.66	−.20	−.18	.66	−.67	−.12	−.16	.68	−.66	.54	−.15	−.52	.45	.45	−.88	.88	.88	−.99	1

Textual Similarity in Organizational Research: Review of Applications,Consistency of Methods,and Best Practice Recommendations

Abstract

Keywords

Textual Similarity

NLP Methods and Illustrative Examples of Textual Similarity

NLP Methods

Illustrative Textual Similarity Example

Limitations of Textual Similarity Measures

Review of Applications of Textual Similarity

Method

Descriptive Results and Discussion of Textual Similarity Applications

Methodological Issues in Textual Similarity Applications

Approaches to Validating Textual Similarity

Tests of Consistency of Similarity Measures Across NLP Methods

Method

NLP Methods

Similarity Measures

Results

Descriptive Statistics

Correlations

General Discussion

Best Practice Recommendations for Textual Similarity

Limitations and Future Work

Conclusion

Supplemental Material

sj-docx-1-orm-10.1177_10944281261432629 - Supplemental material for Textual Similarity in Organizational Research: Review of Applications, Consistency of Methods, and Best Practice Recommendations

Footnotes

Acknowledgments

ORCID iDs

Ethical Considerations

Consent to Participate

Consent to Publication

Funding

Declaration of Conflicting Interests

Data Availability Statement

Supplemental Material

Notes

Author Biographies

References

Supplementary Material