Sage Journals: Discover world-class research

Abstract

In this article, we develop a methodological approach for organizational research regarding the construction of multidimensional and relational similarity measures by using the vector space model in natural language processing (NLP). Our vector space approach draws on the well-established premise in organizational research that texts provide a window into social reality and allow measuring theory-based constructs (e.g., organizations’ self-representations). Using a vector space approach allows capturing the multidimensionality of these theory-based constructs and computing relational similarities between organizational entities (e.g., organizations, their members, and subunits) in social spaces and with their environments, such as the organization itself, industries, or countries. Thus, our methodological approach contributes to the recent trend in organizational research to use the potential inherent in big (textual) data by using NLP. In an example, we provide guidance for organizational scholars by illustrating how they can ensure validity when applying our methodological contribution in concrete research practice.

Keywords

organization theory text analysis big (textual) data natural language processing vector space model meaning analysis similarity measure

In this article, we build on the tradition of formal text analysis and develop a methodological approach for constructing multidimensional and relational similarity measures for organizational research. Our approach follows the notion that the investigation of texts provides a window into social reality and is useful for theory building and testing (Duriau et al., 2007; Evans & Aceves, 2016; Mohr et al., 2020; Pollach, 2012).

The backbone of our methodological approach is the vector space model (Salton, 1965; Salton et al., 1975), a well-established natural language processing (NLP) technique (Karlgren & Kanerva, 2021). The basic idea of the vector space model in NLP is that texts can be converted into high-dimensional mathematical representations (e.g., an integrated representation of all words occurring within a text). These high-dimensional mathematical representations in vector spaces enable a formal analysis of texts while still conserving their complex linguistic meaning. In this way, the vector space model in NLP differs from other approaches that process texts as sets of unrelated individual components (e.g., individual words) or classify texts into distinct categories (e.g., Bail, 2016; Fligstein et al., 2017; Goldenstein & Poschmann, 2019a; Jockers & Mimno, 2013; Kaplan & Vakili, 2015; Kobayashi et al., 2018a; Rona-Tas et al., 2019).

Our vector space approach enriches the methodological toolkit of organizational research and for various perspectives that acknowledge the meaningful connection of organizational entities (e.g., organizations, their members, and subunits) through multidimensional relationality within social spaces and with their environments, including the organization itself, industries, markets, or countries (e.g., Cattani et al., 2017; Dhalla & Oliver, 2013; Kennedy, 2008; Kostova et al., 2008; Kristof-Brown et al., 2005; Zietsma et al., 2017). Relationality acknowledges that social meaning depends on similarity patterns between organizational entities and can be revealed through comparing texts as windows into social reality (Mohr et al., 2020). For example, a degree of distinctiveness can be revealed through considering the similarity of organizations’ self-representations within the same market (Goldenstein, Hundolt, et al., 2019; Zuckerman, 2016). Moreover, relationality regularly involves multidimensionality, which implies that similarity patterns manifest in multidimensional meaning profiles (Kozlowski et al., 2019; Mohr, 1998). For example, the similarity of organizations’ self-representations encompasses multiple theory-based dimensions (e.g., all themes referring to corporate identity, products, or employees) along which organizations can communicatively position themselves within markets (Haans, 2019).

In this sense, our vector space approach allows capturing the multidimensionality of theory-based constructs (e.g., self-representations) at the level of organizational entities as units of analysis (e.g., organizations) and the relationality (e.g., distinctiveness) of these entities through computing their degrees of relational similarity. Accordingly, the construction of multidimensional and relational similarity measures in organizational research offers plenty of new opportunities for hypothesis generation and theory testing through an advanced exploitation of the potential of formal text analysis and the vast volumes of texts in and around organizations in the digital age (George et al., 2014, 2016; Kobayashi et al., 2018a).

By developing a vector space approach for organizational research for the construction of multidimensional and relational similarity measures, the main contribution of our article is twofold: (1) We lay the foundation for a systematic adaption of the vector space model in NLP to organizational research and we illustrate its general face validity. We further point to the potential of vector spaces to draw on the output of various formal text analysis approaches (e.g., word counting and word embeddings) as input for vectoral representations. (2) We provide detailed guidance for organizational scholars by using an example that illustrates how to ensure theoretical validity when applying our vector space approach in concrete research practice.

In the following sections, we expound our methodological contribution to organizational research. To this end, we first provide an overview of how organizational studies to date have used formal text analysis to study similarities. Then, we introduce the notion of the vector space model in NLP. Next, by developing a vector space approach for organizational research, we demonstrate the operationalization of theory-based constructs and the usage of their theoretical dimensions in vectors. Finally, we construct relational similarity measures and illustrate their application in the field of organizational research.

Methodological Considerations: The Relationality and Multidimensionality of Meaning

Organizational researchers have recently begun to acknowledge that analyzing large (textual) data using formal text analysis can yield valuable theoretical insights into organizational phenomena (Braun et al., 2018; George et al., 2016; Hannigan et al., 2019; Kang et al., 2020; Tonidandel et al., 2018; Wenzel & Van Quaquebeke, 2018). In this context, a common analytic focus of organizational research has been studying similarities between organizational entities and their environments across social contexts or time.

One major research stream using formal text analysis has descriptively studied the similarity of texts based on individual dimensions of various theory-based constructs (Bonilla & Grimmer, 2013; Jockers & Mimno, 2013; Kobayashi et al., 2018b, 2018a; Nelson et al., 2021; Rona-Tas et al., 2019). For example, DiMaggio and colleagues (2013) inductively discovered the differences in framings with which media corporations communicated their orientation toward government assistance for artists and arts organizations. Poschmann and Goldenstein (2022) classified semantic types of named entities (i.e., persons and organizations) with demographic characteristics and compared their occurrence in public discourses over time.

Other applications of formal text analysis have applied structural techniques such as network analysis (Mohr et al., 2013; Oberg et al., 2017), multidimensional scaling (Hannigan & Casasnovas, 2021), and cluster analysis (Goldenstein & Poschmann, 2019a), focusing on drawing descriptive two-dimensional maps of the co-occurrence of selected dimensions of theory-based constructs and using these maps for further qualitative inspection. For example, Mohr et al. (2013) constructed a visual illustration of the co-occurrences of various named entities and verb phrases to map the association of actors and actions in the U.S. National Security Strategy reports of various U.S. presidents. Hannigan and Casasnovas (2021) constructed a map with the co-occurrence of themes and actors in public discourse. Rule et al. (2015) analyzed the U.S. State of the Union addresses and mapped the co-occurrence of nominal phrases to represent how the similarity of using these terms and categories in discourse have changed with the national and international political situation. Finally, Goldenstein and Poschmann (2019a) performed a combined analysis of syntactic structures and themes by mapping the similarity of the thematic embedding of various subject–verb–object combinations in the context of corporations’ responsibility in public discourse over time.

All the approaches outlined thus far are well suited to detect the extent to which (selected) dimensions of theory-based constructs occur in texts, to map their (co-)occurrences, and to compare the similarity of these co-occurrences across contexts or time. Therefore, as illustrated, organizational scholars use valuable text analytic tools in descriptive studies or for research that aims at detecting similarity patterns in large amounts of text for further in-depth analysis.

Another goal of formal text analysis in combination with large amounts of text is the quantitative analysis of meaning to generate hypotheses and test theories (Goldenstein & Poschmann, 2019b; Kobayashi et al., 2018a; Short et al., 2010). For example, Kobayashi et al. (2018a, p. 792) stated that, “if the objective is to explain how patterns lead to categorization or how structure and form lead to meaning, then an explanatory model is required.” We argue that organizational researchers can use the potential of formal text analysis to capture similarities quantitatively based on NLP-inspired relational similarity measures. These measures follow the established methodological insight that social meaning is multidimensional and built from similarity patterns within and across social spaces (Kogut & Singh, 1988; Mohr, 1998; Scott & Davis, 2007).

Considering similarity patterns is important theoretically and methodologically. First, in line with organizational research, studying dynamics within and across social spaces builds heavily on the assumption—often only implicitly articulated—that social meaning is established based on the relationships between organizational entities and their environments (e.g., Cattani et al., 2017; Kostova et al., 2008; Kristof-Brown et al., 2005; Zietsma et al., 2017). In this context, the examination of theory-based constructs is only meaningful when considering their (co-)occurrence in relation to other organizational entities and their environments (Mohr et al., 2020). For example, the finding that organizations address themes in their self-representation at a certain percentage only becomes significant when compared with relevant public discourses or with other organizations.

Second, methodological insights emphasize that social meaning is constructed compositionally, that is, via a relationality among organizational entities (e.g., the organization and their members and subunits) and their environments and various theory-based dimensions that constitute multidimensional profiles of meaning (Kozlowski et al., 2019; Mohr, 2005). Only a few articles relevant to organizational research have included the construction of relational similarity measures based on texts. Haans (2019) investigated the market entry of new ventures and captured their self-representation by incorporating all themes that new ventures used to describe themselves when entering the market (for a similar approach on the product level, see Barlow et al., 2019). Goldenstein, Poschmann, et al. (2019) analyzed similarities in how organizations from different countries used words grammatically associated with the term “responsibility.” Finally, Kozlowski et al. (2019) captured the latent meaning facets of social class characteristics (e.g., wealth and gender). The authors then used the totality of these meaning facets to calculate their similarity with latent meaning facets of cultural products, such as types of sports.

Despite these early attempts, there is a lack of a coherent and generic methodological approach that enables the broad consideration of relational similarity measures in organizational research. In the following sections, we demonstrate how the vector space model from NLP lays the formal foundation for a vector space approach in organizational research that allows capturing the multidimensionality and relationality of social meaning. We provide detailed guidance on how organizational scholars can validly operationalize the multidimensionality and relationality of theory-based constructs and create vector spaces, which in turn, constitute the basis for computing relational similarity measures.

Constructing Relational and Multidimensional Similarity Measures

The Vector Space Model in NLP

The vector space model in NLP was invented as a means for the quantitative representation of texts in a high-dimensional mathematical space and dates back to Gerald Salton's seminal work on information retrieval (Salton, 1965; Salton et al., 1975). The author proposed mathematically representing a given set of text documents as n-dimensional vectors, with n being the number of terms in the documents. The values of each vector component represent the frequency of the respective terms. The idea of vectors advanced long-standing traditions in NLP at that time and has become an important methodological step toward quantifying linguistic meaning (Karlgren & Kanerva, 2021; Turney & Pantel, 2010).

In detail, the central intuition behind the vector space model in NLP is to bundle the complexity of linguistic meaning as components within high-dimensional vectors. In the vector space model's basic form, linguistic meaning consists of dimensions representing all the terms in a text corpus. Formally, a vector $\vec{x} = (x_{1}, x_{2}, \dots, x_{n})$ is a real-numbered sequence of components with the dimensionality n. In this context, x_i represents the i-th component's value in the vector.

Further, in this manuscript, we have used example sentences to provide an intuitive sense of how mathematical representations of texts capture linguistic meaning. To make this illustration as accessible as possible, we have elaborated on the most basic form of the vector space model—the representation of a text corpus as a document–term matrix. Such a matrix has a size of m × n, where m denotes the number of documents in the text corpus (i.e., the number of vectors), and n denotes the terms occurring in the entire corpus (i.e., the number of dimensions). As a typical representation, the terms in such a matrix are represented by their (relative) frequency of occurrence.

We assume that the following sentences can be found in four text documents. In line with the preprocessing conventions in NLP (Manning et al., 2008; Manning & Schütze, 2000), we also assume that only the bold-marked words (i.e., adjectives, nouns, and verbs) are used for the document–term matrix:
Document d₁: “The company has a corporate responsibility for its employees and their training.”

Document d₂: “The corporate responsibility of our company is to target sustainability.”

Document d₃: “The management has the experience for a proper performance evaluation.”

Document d₄: “The management properly evaluates the performance of employees.”
In Table 1, we depict the document–term matrix for this basic example with four documents. The values in the vectors indicate the terms’ frequency per document. This basic example illustrates that documents in a vector space are represented by multiple vector dimensions at once instead of being classified into single distinct categories, such as a document being about responsibility (i.e., d₁ and d₂,) or performance (i.e., d₃ and d₄). Although this example only includes a few dimensions, vectors can be scaled to any length, thus capturing higher dimensionality (e.g., more terms).

Table 1.
Vector Space Resulting From the Example Sentences in the Basic Application (Shaded Cells Mark the Use of Terms across Documents).

Terms Document d₁ Document d₂ Document d₃ Document d₄

company 1 1 0 0

corporate 1 1 0 0

responsibility 1 1 0 0

employees 1 0 0 1

training 1 0 0 0

target 0 1 0 0

sustainability 0 1 0 0

management 0 0 1 1

experience 0 0 1 0

proper 0 0 1 0

performance 0 0 1 1

evaluation 0 0 1 0

properly 0 0 0 1

evaluates 0 0 0 1

In Figure 1, we visualize the four documents (i.e., d₁, d₂, d₃, and d₄) as vectors. Following the intuitive spatial interpretation in Figure 1, vectors capture relational similarity because they can be conceived as leading to specific points in a geometric space, with the values of the vector components determining their positions (Salton et al., 1975). Two vectors with similar values will be positioned closely to each other in the geometric space. That is, in the most basic form, the more similar the occurrence of terms in texts, the higher the proximity between the corresponding vectors in the geometric space. From an NLP perspective, the geometric proximity between vectors indicates a similarity of the linguistic meaning of the documents they represent (Manning & Schütze, 2000; Schütze, 1998).

Figure 1.
An illustration of the geometric interpretation of the vector space model in natural language processing (NLP) based on a simple model consisting of four documents. For visualization purposes, the 14 dimensions derived from the sentences were reduced to two dimensions using principal component analysis.

Figure 1 illustrates this geometric interpretation—the vector of document d₁ is (much) closer to the vector of document d₂ than to the vectors of documents d₃ and d₄. Similarly, the vector of document d₃ is (much) closer to the vector of document d₄ than to the vectors of documents d₁ and d₂. In general, documents are higher in similarity if the vector dimensions are characterized by the same values. Accordingly, vectors’ geometric proximity points to the degree of documents’ linguistic similarity, and thus, captures their relational similarity in this example.

The geometric proximity capturing the linguistic relational similarity of documents can also be quantified by using similarity metrics (see Table 2). Note that we show similarity values for illustrative purposes, and we will discuss the application of similarity metrics for organizational research in detail in the next sections. The similarity values in this basic application are produced by the cosine similarity and the inverted Euclidean distance (i.e., Euclidean similarity) because these similarity metrics have become the standard in NLP (Cassisi et al., 2012; Ljubesic et al., 2008; Manning & Schütze, 2000).

Table 2.
Relational Similarity Measures Based on the Basic Application (Cosine/Euclidean; Shaded Cells Mark the Ranking of the Similarity Values).

Document d₁ Document d₂ Document d₃ Document d₄

Document d₁ 1.00 / 1.00

Document d₂ 0.60 / 0.33 1.00 / 1.00

Document d₃ 0.00 / 0.24 0.00 / 0.24 1.00 / 1.00

Document d₄ 0.20 / 0.26 0.00 / 0.24 0.40 / 0.29 1.00 / 1.00

The similarity values plotted in Table 2 correspond to the visualized proximity of vectors in Figure 1. In this context, it is important to emphasize that the sizes of similarity values depend on the dimensionality of the vector space (i.e., number of vector components), the value range of the vector dimensions (e.g., the use of absolute or relative frequencies), and the similarity metric applied (i.e., cosine or Euclidean). This also implies that the sizes of similarity values are not readily interpretable and comparable across vector spaces that have been constructed differently. Instead, NLP uses the ranking of similarity values when interpreting and comparing vector spaces (Manning & Schütze, 2000). For instance, our example reveals that the documents that human readers intuitively would consider similar (e.g., d₁ and d₂) also receive the highest similarity values in vector spaces. In other words, even if the absolute sizes of similarity values differ, the ranking of the similarity values in our example remains the same, regardless of the similarity metric applied.

In this section, we provide an intuitive introduction to the vector space model in NLP and its most basic form (i.e., document–term matrix). However, note that (1) the vector space model can be used as the methodological backbone for various NLP applications, and thus, is by no means limited to use with terms as vector dimensions (Karlgren & Kanerva, 2021; Turney & Pantel, 2010). For example, texts can be compared based on their thematic similarity (what content do texts convey?), stylistic similarity (how is content arranged?), the similarity of the words (what terms are used?), or the similarity of word semantics (what latent semantics do terms convey?). In our basic application, we focused on comparing the occurrence of the words used in the four documents, making counting the terms a reasonable approach. However, for example, if the goal was to compare the latent semantics of the words used in the example sentences, the application of word embeddings would be more recommended (for an application of word embeddings to this basic application, see Appendix C). We will illustrate possible ways to construct vectors and vector spaces in the following sections. (2) Furthermore, up to this point, we have illustrated how the vector space model is regularly applied in the NLP context to quantify linguistic meaning. To adapt the vector space model to the requirements of organizational research, we will provide guidance on how to ensure validity when applying vector spaces to calculate relational similarity measures to investigate organizational phenomena.

Outline of a Vector Space Approach for Organizational Research

Our methodological goal is to develop a vector space approach for constructing similarity measures that account for the multidimensionality of theory-based constructs and the relationality of organizational entities (e.g., organizations, their members, and subunits). To this end, we provide guidance on how organizational scholars can ensure validity when constructing relational similarity measures based on vectoral representations. Our validity guide is organized around (1) the definition of the dimensions of theory-based constructs and their linguistic reflection, (2) the selection of approaches of formal text analysis for operationalizing the multidimensionality of theory-based constructs, (3) the selection of texts and their processing, and (4) the theoretical definition of relationality and selecting similarity metrics (see Table 3).

Table 3.
Considerations for the Valid Construction of Relational Similarity Measures.

Aspects of validity Considerations

1. Definition of the theory-based construct and its dimensionality
Ground the definition of a theory-based construct in extant organizational research

Identify the dimensions of the theory-based construct
– Deductive identification of dimensions

– Inductive identification of dimensions

Consider the linguistic reflection of the theory-based construct in text documents

2. Selecting an appropriate formal text analysis approach
Consider the fit of approaches of formal text analysis with the theory-based construct and its linguistic reflection

Note the linguistic unit (i.e., document, sentence, and word) the formal text analysis approach annotates

3. Selecting and processing text documents
Select relevant text documents representing the unit of analysis
– Consider the preprocessing of text documents

– Consider the aggregation of the output of formal text analysis

4. Relationality and calculating relational similarity measures
Ground the definition of relationality in extant organizational research

Consider the theoretical appropriateness of vector comparisons

Select the appropriate similarity metric

We illustrate our vector space approach by illustrating the different validity aspects using a running example. We use the case of a cross-sectional study and we investigate the similarities (i.e., shared understanding) to which organizations from the same industries ascribe the same meaning to an organizational issue (Loewenstein et al., 2012). Organizational issues are developments that emerge exogenously to industries (Bansal et al., 2018) and can have important consequences for organizational life (Hoffman, 1999). In this regard, we focus on the latent semantics of the vocabulary (i.e., a set of words) found in organizational statements on an issue. In our running example, we focus on the issue of organizations’ responsibilities, which modern societies and organizations around the globe continuously debate (Aguinis & Glavas, 2012; Bondy et al., 2012; Chen & Bouvain, 2009; Lim & Tsutsui, 2012; Pope, 2015; Pope & Lim, 2020; Wickert et al., 2016). This running example builds on the open-system perspective in organizational research because the theoretical premises of this perspective are highly consistent with our methodological approach (for an overview, see Scott & Davis, 2007). Organization theories within the open-system perspective are based on conceptualizations of organizational entities as complex, loosely coupled systems with porous boundaries that are heavily constructed and constrained through their multidimensional relationality within social spaces and their environments (e.g., organizations, industries, markets, or countries; Cattani et al., 2017; DiMaggio & Powell, 1983; Zietsma et al., 2017; Zuckerman, 1999).

Considerations for Constructing Relational Similarity Measures

Definition of the Theory-Based Construct and Its Dimensionality

The first consideration to construct validly relational similarity measures is to define the theory-based construct of interest (e.g., the meaning of an organizational issue). A theory-based definition informs the construction of vectors directly because it specifies the nature of theoretical dimensions of the construct and guides the selection of a formal text analysis approach for its operationalization. In this sense, the definition of theory-based constructs and the selection of a formal text analysis approach resemble the procedure of common quantitative empirical research, namely that variables should validly measure the theoretical construct of interest (Klein & Kozlowski, 2000).

Organizational scholars can define the dimensions of theory-based constructs in two ways. One is to derive deductively the theoretical dimensions of a theory-based construct from extant organizational research (McKenny et al., 2012; Short et al., 2010). For example, the literature defines organizational culture as consisting of six dimensions, with each dimension having its own sub-definition (Pandey & Pandey, 2019). The second way is to derive dimensions inductively. For instance, Haans (2019) built on a multidimensional definition of organizational self-representation. However, because organizations vary in how they present themselves to audiences, there is no a priori knowledge about the concrete dimensions that represent organizational self-representations in the empirical case at hand. Accordingly, the author captured organizational self-representations through the totality of various themes that the studied organizations communicated publicly to audiences.

In this context, a critical aspect for organizational scholars is to clarify the way a theory-based construct of interest is linguistically reflected in texts. This aspect is important to justify selecting a formal text analysis approach. For example, organizational self-representations can be operationalized with the topics (i.e., themes) that organizations communicate publicly about themselves (Haans, 2019). In contrast, theory-based constructs, such as psychological capital or organizational culture, tend to be reflected linguistically in the words used in communication (McKenny et al., 2012; Pandey & Pandey, 2019).

Example: In our running example, the theory-based construct under consideration is the meaning assigned to an organizational issue, referring to the contentual understanding of an issue. The dimensions of our theory-based construct are the different facets of meaning that together constitute an issue's contentual understanding. As organizational issues are ambiguous and there is no a priori knowledge about the concrete facets of meaning, we derive the dimensions inductively.

We follow previous research and assume the facets that give rise to the meaning of a concept are reflected linguistically in the latent semantics of the words (i.e., vocabulary) that organizations use to describe the issue's various aspects (Loewenstein et al., 2012; Nelson, 2021). Accordingly, the vector space in this running example consists of vectors representing the meaning that organizations assign to an organizational issue, with vector dimensions representing the different meaning facets that the vocabulary organizations use provides.

Selecting Appropriate Formal Text Analysis Approaches

Due to the general nature of vectors, various formal text analysis approaches can be applied to construct vector spaces. In selecting approaches of formal text analysis, organizational scholars need to ensure the fit between the use case of the formal approach, the definition of the theory-based construct, and its linguistic reflection in texts. In this section, we refer to some formal text analysis approaches (word counting, topic modeling, grammatical parsing, and word embeddings) to illustrate how they have been applied to operationalize theory-based constructs in organizational contexts. We draw on some formal text analysis approaches only because a comprehensive overview of approaches is beyond our article's scope (for a broader overview, see classical textbooks, such as Baeza-Yates & Ribeiro, 2004; Eisenstein, 2019; Jurafsky & Martin, 2009; Manning et al., 2008; Manning & Schütze, 2000; Zhang & Teng, 2021).¹

Inductive word counting has been applied in organizational research, for example, to operationalize management's focus of attention when making sense of poor business performances (Pollach, 2012). In contrast, a deductive application of word counting is the measurement of constructs such as psychological capital by counting the frequency of word collections (i.e., dictionaries; McKenny et al., 2012). With word counting, organizational scholars can use (collections of) terms as the dimensions of vectors and the (relative) term frequencies as the values of these dimensions (forming a document–term matrix).

Topic modeling offers a sophisticated opportunity to describe texts through word clusters (for overviews, see Boyd-Graber et al., 2017; Dieng et al., 2020; Mohr & Bogdanov, 2013; Vayanskya & Kumarb, 2020). Topic modeling has been used in organizational research, for example, to operationalize organizations’ self-representation inductively by capturing themes organizations present to their audiences (Haans, 2019). In other examples, topic modeling has been used to measure the prevalence of predefined communicative frames deductively (Fligstein et al., 2017; Kaplan & Vakili, 2015). Using topic modeling, the dimensions of vectors relate to topics and the values indicate the relative prevalence of topics in documents (forming a document–topic matrix).

Other approaches of formal text analysis such as grammatical parsing focus on the structure of texts. Grammatical parsing classifies words according to their grammatical function in sentences (Ágel & Fischer, 2015; Kübler et al., 2009). For example, organizational researchers have applied grammatical parsing to capture actor–action–object triplets inductively in public discourse (Goldenstein & Poschmann, 2019a; Sudhahar et al., 2013) or to capture words at grammatical positions deductively that linguistically reflect organizations’ actions on others’ behalf (Goldenstein, Poschmann, et al., 2019). Using grammatical parsing, organizational scholars can select words based on their grammatical roles as the dimensions of vectors and their (relative) frequencies as values of these dimensions.

As a final example, we refer to word embeddings (for an overview, see Wang et al., 2020). Word embeddings draw the latent semantics of individual words from patterns of word co-occurrences in plain text corpora (for a more detailed description, see Kiela et al., 2015). As a result, word embeddings represent latent word semantics as high-dimensional vectors of real numbers (embedding vectors). In plain text, the dimensions of embedding vectors mathematically represent the latent dimensions of words’ semantics. Even if the representation of word semantics by real numbers may appear uncommon, embedding vectors computationally makes sense because they capture the similarity of word semantics astonishingly well (Bojanowski et al., 2017; Pennington et al., 2014). For example, words with similar semantics (e.g., “cat” and “tiger”) are represented by more similar real numbers in their embedding vectors than words with dissimilar semantics (e.g., “cat” and “oxygen”). Word embeddings have already been applied to study divergent latent meanings associated with cultural categories inductively and deductively, such as “institutions” or “socio-economic situation.” These studies have applied word embeddings due to researchers paying attention to capture social meaning via words’ latent semantics (Kozlowski et al., 2019; Nelson, 2021; Stoltz & Taylor, 2021). Organizational scholars can use the dimensions of word embeddings as dimensions of vectors and the real numbers of these word embeddings as the values of these dimensions.

At the end of this section, we would like to encourage organizational scholars to explore approaches other than those mentioned here. Especially formal text analysis approaches such as named entity recognition (Li et al., 2020), part-of-speech tagging (Schmid, 2008), or sentiment and emotion analysis (Alswaidan & Menai, 2020) may be suitable for interesting vector space applications (see Appendix A and B).

Example: To operationalize the meaning assigned to an organizational issue, we draw on the latent semantics of the vocabulary used to describe it. Accordingly, we selected a word embedding algorithm. In our running example, we applied the word embedding algorithm by fastText² (Bojanowski et al., 2017), because it comes with a pre-trained model that builds on the widely used Common Crawl data set. The Common Crawl data set is trained on a large text corpus covering a broad array of text sources and has been repeatedly applied successfully in contexts comparable to our running example (Bojanowski et al., 2017; Büchel et al., 2018; Sedoc et al., 2020): words occurring within similar textual contexts (e.g., statements from organizations about their responsibilities).³ The basic assumption here is that the semantics of individual words is relatively stable within similar textual contexts, which makes the application of one pre-trained word embedding model to all documents in a text corpus suitable (Stoltz & Taylor, 2021).⁴

Selecting and Processing Text Documents

A key advantage of formal text analysis and vector spaces is their ability to analyze large collections of texts that reflect the unit of analysis directly (for example, see McKenny et al., 2012; Short et al., 2010). The unit of analysis—usually the organization, their members, and subunits—can be specified based on the definition of the theory-based construct under consideration. Then, organizational scholars can select an appropriate text collection reflecting this specific unit of analysis (Braun et al., 2018; Krippendorff, 2004).⁵ For example, if a study is focused on communication from specific organizational subunits, the units of analysis are the subunits of organizations and will be represented by a specific text corpus, such as newsletters these subunits published. In contrast, research focus on the overarching self-representation of organizations and their positioning toward the wider public might change the text selection profoundly, such that the organization as a unit of analysis will be represented by its communication in various texts at the organizational level (e.g., annual reports and corporate websites).

Example: For our running example, we used the Jena Organization Corpus (JOCo), a freely downloadable data source.⁶ JOCo comprises the English annual reports of 270 corporations from Germany, the United Kingdom, and the United States covering the time span from 2000 to 2015. It contains the reports of the 30 most intensively traded and most highly valued corporations from each stock index: the DAX, MDAX, and TecDAX for Germany, the FTSE, FTSE AIM 100, and FTSE 250 for the United Kingdom, and the Dow Jones, S&P 500, and NASDAQ 100 for the United States (for a more detailed description of JOCo, see Händschke et al., 2018). The corporations are assigned to industries based on the first two digits of the SIC codes of their major segment as reported in the Orbis database by Bureau van Dijk⁷: (1) financial services and real estate, (2) construction, (3) mining, (4) services, (5) manufacturing, (6) trade, and (7) public utilities. These corporations’ text data encompassed 3,102 annual reports. We selected annual reports because they represent a common way of studying phenomena at the organizational level (Pollach, 2012; Short et al., 2010), such as how organizations interpret issues. Due to our focus on responsibility, we only consider sentences in the annual reports that contain the key word “responsibility.”⁸

After selecting an appropriate text collection representing the unit of analysis, formal text analysis can be applied. The output of formal text analysis approaches usually needs to be aggregated by moving from “lower” to “higher” linguistic units (e.g., by applying basic statistical operations such as counting and subsequent averaging), which is a common task in formal text analysis (e.g., McKenny et al., 2012; Pandey & Pandey, 2019).⁹ However, organizational scholars must ensure that the aggregation of the output of formal text analysis approaches validly captures a theory-based construct. Thus, selecting an approach of formal text analysis and aggregating its output resembles the common quantitative empirical research procedure, namely that variables should validly operationalize the theoretical construct of interest. For example, counting word collections can be used to measure the prevalence of theory-based constructs, such as psychological capital. The prevalence hereby is operationalized by summing the frequency with which words in the word collections appear in organizational text documents (McKenny et al., 2012).¹⁰

Example: In our running example, we capture the latent semantics of the words that organizations use. To this end, we applied fastText word embeddings (Bojanowski et al., 2017) to the sentences in the annual reports in the JOCo that contain the key word “responsibility.” FastText captures latent word semantics as vectors of real numbers (i.e., embedding vectors) with 300 dimensions. Since word embeddings represent words’ latent semantics, and organizations are likely to use multiple sentences in annual reports to describe the issue of responsibility, we needed to aggregate the output of fastText to the document level first. To avoid complexity, we limited the preprocessing steps to a minimum. We removed function words (i.e., articles, conjunctions, prepositions, and pronouns) because these words do not add content to sentences (Hickman et al., 2022).

We followed previous research on organizational contexts that has suggested to sum the embedding vectors of words occurring in the same text (e.g., sentence, paragraph, and document) into a representative average embedding vector (Arseniev-Koehler, 2022; Stoltz & Taylor, 2021). For example, Nanni and Fallin (2021) summed the embedding vectors of words per abstract in scientific articles (i.e., one document in this research endeavor) to capture the latent semantics of the vocabulary used within this text section. Accordingly, we aggregated the embedding vectors of words from all sentences containing the key word “responsibility” in the same annual report to an average embedding vector (forming the document–embedding matrix). That is, our application of fastText word embeddings reveals the latent semantics of the vocabulary that organizations use in one annual report to describe the issue of responsibility (for an intuitive illustration of the application of word embeddings, see Appendix C).

When more than one text document represents a unit of analysis, the same aggregation process can be applied. By aggregating information derived from individual documents of one organization (e.g., annual reports), a collection of texts (e.g., all annual reports) can also represent the organization as a unit of analysis. However, aggregating information across individual text documents is only useful when organizational scholars are not interested in comparing these documents. To illustrate aggregation across individual text documents, consider a case in which words are counted to investigate the similarity of the vocabulary that organizations use in six documents (e.g., annual reports). Accordingly, the procedure results in six vectors representing the six documents (forming the document–term matrix). However, to construct vectors at the organizational level (modelled as an organization–term matrix), positioning individual documents in vector space is not sufficient. Instead, computing the average of the values of the dimensions of the document vectors for each organization produces vectors that reflect the organizational level (see org₁ and org₂ in Figure 2).

Figure 2.
An illustration of the aggregation of single document vectors to an organization vector in a simple vector space consisting of six documents and two words as dimensions. Gray arrows represent the document vectors; the black dotted arrows represent the organization vectors.

Example: In our running example, we are interested in constructing a vector space that allows comparing organizations cross-sectionally (modeled as the organization–embedding matrix). Therefore, we calculated the vector of an organization as the aggregation of all document vectors (i.e., all sentences from annual reports containing the key word “responsibility”). The dimensionality of the resulting vector space is defined by the 300 embedding dimensions fastText provides. Accordingly, we computed the average of the values of the document vectors’ dimensions for each organization to produce vectors at the organizational level. The averages of the real numbers of the embedding vectors represent the values for the dimensions of the organizational vectors. Conceptually, the average of embedding vectors for all documents allows a meaningful representation of the meaning that organizations ascribe the issue of responsibility, with vector dimensions representing the different meaning facets that the vocabulary used in the annual reports provides.

Relationality and Calculating Relational Similarity Measures

A precondition for the geometric interpretation of vectors is a specification of what the relationality between vectors means theoretically. This specification is connected to the theory-based construct under consideration. For example, if subunits communicate their organizational culture in similar ways, relationality can capture the coherence of values within an organization (Kristof-Brown et al., 2005; Pandey & Pandey, 2019). If organizations present the features of their products dissimilarly, relationality can capture their degree of distinctiveness within a market (Goldenstein, Hunoldt, et al., 2019; Zuckerman, 2016). After specifying the theoretical meaning of relationality, organizational scholars can utilize the geometric interpretation of vectors to calculate the relational similarity between organizational entities or to their environments. However, quantifying and analyzing the similarity of units of analysis requires decisions about applying arithmetic operations.

The first decision concerns specifying the type of comparison between vectors. Organizational scholars can calculate similarities from dyadic comparisons between vectors (i.e., the respective comparison of two units of analysis) or from the average of multiple dyadic comparisons (i.e., the average calculated from the comparison of a unit of analysis with multiple others). For example, the first type of comparison captures the degree of distinctiveness of organizations regarding an outstanding exemplar within the same industry (Younger & Fisher, 2020), whereas the second one is well suited to measure the degree to which organizations from the same country align on a shared understanding of an organizational issue (Goldenstein, Poschmann, et al., 2019).

The second decision regarding applying arithmetic operations concerns selecting an appropriate metric to quantify similarities between vectors. We illustrate the selection of similarity metrics by focusing on the cosine similarity and Euclidean similarity, because these metrics have become the standard in NLP (Cassisi et al., 2012; Ljubesic et al., 2008; Manning & Schütze, 2000).

The cosine similarity is calculated as follows (x_i and y_i are the n components of the vectors $\vec{x}$ and $\vec{y}$ , respectively):
$\cos (\vec{x}, \vec{y}) := \frac{\sum_{i = 1}^{n} x_{i} y_{i}}{\sqrt{\sum_{i = 1}^{n} x_{i}^{2}} \sqrt{\sum_{i = 1}^{n} y_{i}^{2}}}$
Thus, cosine similarity does not rely on the coordinates of the vectors’ end points in vector space to calculate similarities but on the angles between vectors. Hence, it is insensitive to the magnitude of the values of the vector dimensions (i.e., length of the vector).

The calculation of Euclidean similarity is based on the Euclidean distance formula:
$d (\vec{x}, \vec{y}) := \sqrt{{(x_{1} - y_{1})}^{2} + \dots + {(x_{n} - y_{n})}^{2}} = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}$
In a two-dimensional vector space, the value the Euclidean distance calculates between two vectors represents the length of the line segment between the coordinates of the vectors’ end points in this vector space (say, if point x has coordinates x₁ and x₂, and point y has coordinates y₁ and y₂). This value corresponds to the geometric distance between these two points. In a vector space, the calculation of this distance is scaled to all dimensions of the vectors. The distance measure can be easily adapted to a measure of similarity instead:
$s i m (\vec{x}, \vec{y}) := \frac{1}{1 + d (\vec{x}, \vec{y})}$
Different from cosine similarity, Euclidean similarity is sensitive to the magnitude of the values of vector dimensions and the length of the vectors, because longer vectors can have end points farther apart than shorter vectors do.

Selecting a similarity metric hinges on the methodological consideration of whether organizational scholars want to capture similarities between vectors based on (1) the distribution of values or (2) the magnitude of values in these vectors. That is, cosine similarity captures which vector dimensions organizational entities primarily use. In contrast, Euclidean similarity considers the frequency with which organizational entities use the dimensions of the vectors representing these entities. For example, for a study investigating the frequency with which organizations use specific words, rather than focusing only on the distribution of these words, selecting Euclidean similarity as a similarity metric is appropriate. Figures 3a and b illustrate the importance of this consideration because applying Euclidean similarity reveals a higher similarity between org₂ and org₃, whereas cosine similarity captures a higher similarity between org₁ and org₂. To clarify, both similarity results are meaningful, but the selection of a specific similarity metric depends on whether the distribution of values or the magnitude of values is important for the specific organizational research endeavor.

Figure 3.
Illustration of Euclidean distance and cosine similarity in a simple normalized and unnormalized two-dimensional vector space consisting of three organizations as units of analysis and two words as dimensions.

However, the difference between cosine similarity and Euclidean similarity dissolves in applications in which the values of the vector dimensions are normalized. Normalization means that the value differences between vectors are aligned to the same value range (i.e., the sum of the values of the vector dimensions does not differ across units of analysis). Consequently, the rankings of similarity between vectors calculated with the two similarity metrics are the same (Manning & Schütze, 2000). That is, these similarity metrics come to the same estimation regarding the nearest, and thus, most similar neighbors of a particular vector. Figures 3c and d illustrate that both cosine similarity and Euclidean similarity reveal the same ranking of similarity among org₁, org₂, and org₃.

Example: As the investigation of texts provides a window into social reality, the relational similarity of organizations regarding the meaning they ascribe to the issue of responsibility can be theoretically conceptualized as the degree of a shared understanding that organizations communicatively display (Hoffman, 1999). From an open-system perspective, displaying a shared understanding within social spaces is important for organizations to ensure their prosperity and legitimacy (Zietsma et al., 2017). We are interested in the similarities of organizations within the same industry. Therefore, we calculated the average similarity of multiple dyadic comparisons (i.e., the average calculated from the comparison of an organization with others in the same industry). Our vector dimensions have normalized values, which is reasonable because we are interested in the focus organizations emphasize regarding the meaning they communicate in the context of their responsibilities. Accordingly, applying the cosine similarity or the Euclidean similarity will produce similar results. For illustration purposes, we applied both similarity metrics. We calculated the similarity between the vector of one organization ( $\vec{x}$ ) and all the vectors of the other organizations from the same industry ( $\vec{y}$ ) for this cross-sectional analysis, and we obtained a measure for shared understanding: $s i m (\vec{x}, \vec{y})$ .

Exemplary Regression Analysis

Once the relational similarity measures results have been calculated, they need to be converted into a format suitable for subsequent statistical analysis. Depending on the research focus and the statistical method to be applied, the relational similarity measures need to be transformed into appropriate data formats suitable for a given research focus (e.g., data frames, panel data, or network data). Although it is not possible here to provide a comprehensive overview of how to transform data and how to handle relational similarity measures statistically, we use our running example to illustrate a possible cross-sectional analysis by applying linear mixed-effects regression models.

Variables: We used the shared understanding of organizations from the same industry as the dependent variable and investigated its association with variables at the organizational level commonly used in organizational research. The variables were collected from the Orbis database of Bureau van Dijk. As a measure for the size of organizations, we used the average number of employees from 2015 to 2019 because the number of employees was relatively stable over time (average coefficient of variation = 19.8%). To avoid skewness of the variable, we applied a log transformation to the measure. We also used the age of an organization. We again applied a log transformation as well. To measure the density of industries (i.e., industry size), we counted the number of all corporations that populate these industries according to the Orbis database (Carroll & Hannan, 1989; Goldenstein, Hunoldt, et al., 2019a; Mendoza-Abarca & Gras, 2019). We determined the diversification of organizations based on the first two digits of all SIC codes. We counted the number of different industries in which the organizations operated (Davis et al., 1994). We operationalized the internationalization of organizations by counting the number of foreign subsidiaries normalized by the total number of subsidiaries (Schwens et al., 2018).

Results: Table 4 presents the descriptive statistics and correlations of the variables used in the cross-sectional analyses. In Table 5, we present our cross-sectional analysis testing for the association between organizational variables and the degree of similarity of organizations in the same industry. We used the country as the grouping variable. The models indicate that the relational similarity of organizations from the same industries is associated with an organization's internationalization and size. That is, large organizations and organizations with a focus on national business activities tend to signal a higher degree of similarity to other organizations in their industry.

Table 4.
Descriptive Statistics and Pearson Correlation Coefficients.

Mean SD (1) (2) (3) (4) (5) (6)

(1) Shared understanding (Cosine) .89 .04

(2) Shared understanding (Euclidean) .88 .02 .94

(3) Industry size (100,000) 24.09 27.87 .03 −.05

(4) Internationalization .57 .30 .11 .18 −.07

(5) Age (log) 3.40 1.05 −.03 .03 −.24 .08

(6) Diversification 1.65 .98 .15 .19 −.09 −.13 .07

(7) Size (log) 9.49 2.37 −.14 −.17 −.18 .09 .09 −.28

Table 5.
Linear Mixed-Effects Models Explaining the Degree of a Shared Understanding Within the Same Industry (Standardized Coefficients).

Shared understanding (Cosine similarity) Shared understanding (Euclidean similarity)

Industry size .01 −.06

(100,000) (.06) (.06)

Internationalization −.16* −.14*

(.07) (.06)

Age (log) −.02 .04

(.07) (.06)

Diversification .03 .03

(.06) (.05)

Size (log) .16* .15*

(.07) (.06)

Constant −.13 −.16

(.55) (.63)

Log likelihood −251.93 −225.14

N 196 196

p < .001; p < .01; p < .05.

The results correspond with theoretical assumptions in the open-system perspective. This perspective assumes that an organization's size reflects its public visibility, and thus, may evoke pressure to align with other organizations in the same industry (DiMaggio & Powell, 1983; Fiss & Zajac, 2004). Internationalization is a source of dissimilarity, because organizations operating in many different countries are exposed to a multiplicity of (potentially) diverging requirements and expectations regarding their responsibilities, which they need to consider to maintain legitimacy (Kostova & Zaheer, 1999).

Discussion

Methodological Contribution

In our article, we have followed contemporary calls in organizational research to take advantage of the increasing availability of big collections of (textual) data (Braun et al., 2018; George et al., 2016; Tonidandel et al., 2018; Wenzel & Van Quaquebeke, 2018). The ubiquity of texts, including newspaper articles, press releases, and annual reports, inspired organizational researchers using NLP to quantify meaning and study organizational phenomena on a larger, thus far unprecedented scale (Gentzkow et al., 2019; Hannigan et al., 2019; Kobayashi et al., 2018a, 2018b).

Although NLP can accelerate knowledge discovery, organization researchers have just begun to exploit the methodological reservoir of NLP. That is, they have applied NLP to cluster and classify textual information (DiMaggio et al., 2013; Poschmann & Goldenstein, 2022), or scale in-depth text analysis (Nelson, 2017). However, even if suggested by various theoretical approaches (e.g., Cattani et al., 2017; Dhalla & Oliver, 2013; Kennedy, 2008; Kostova et al., 2008; Kristof-Brown et al., 2005; Zietsma et al., 2017), organizational research up until now lacked an methodological approach that allows the quantitative analysis of the multidimensional relationality of organizational entities in social spaces and with their environment, such as the organization itself, industries, markets, or countries. With the methodological approach outlined in this article, we expand the application of NLP to “areas where research goals are not only to classify or to cluster but also to explain, using existing knowledge or theory and incorporating this into the analysis from the start” (Kobayashi et al., 2018b, p. 756).

As a step toward this visionary goal, we featured the vector space model in NLP. Vectors make possible representing the full multidimensionality of theory-based constructs at once and transforming it into relational similarity measures. For example, this differs from earlier approaches of computer-aided text analysis that generally calculated the sum of the dimensions of theory-based constructs when using regression analyses (McKenny et al., 2012). Our running example provides an intuitive sense of how organizational scholars can apply vectoral representations in concrete research practice.

Outlining the Broad Applicability of the Vector Space Model

Our running example draws on the open-system perspective in organizational research (for an overview, see Scott & Davis, 2007). The perspective includes theoretical streams such as the resource dependence theory, institutional theory, category research, and organizational ecology (DiMaggio & Powell, 1983; Hannan & Freeman, 1977; Meyer & Rowan, 1977; Pfeffer & Salancik, 1978; Zuckerman, 1999). However, to provide an intuitive sense of the broad applicability of the vector space model, we have emphasized some streams in organizational research that could similarly benefit from our methodological approach when applying text analysis. Note that the research streams we mention can only serve as illustrations.

One theoretical stream concerns organizational cognition and emphasizes the importance of sensemaking for the retrospective interpretation of significant actions or events within organizations and their environments (Brown et al., 2015; Gephart, 1993; Weick, 1995). Our methodological approach can contribute to this research stream by studying the antecedents of shared sensemaking outcomes among employees within and across organizational subunits (Brown et al., 2009). For example, future research could use email communication to study whether and how managerial communication and communication styles influence the degree of similarity of employees’ sensemaking outcomes.

We also see potential to enrich empirical research in the literature on stakeholder management (Cornelissen, 2014). One stream in this literature focuses on the congruence between the identity an organization wishes to convey and the image stakeholders have of this organization (Christensen et al., 2013; Hooghiemstra, 2000). In this context, our methodological approach can assist in conducting large-scale studies to test conditions in which this perceived congruence is particularly significant. For example, researchers could measure the similarity between how organizations portray themselves on websites and how they are characterized on social media or in newspaper articles.

Our methodological approach could also contribute to research on social evaluation (Pollock et al., 2019). One stream within this literature emphasizes the importance for organizations to describe corporate crises in concordance with stakeholders’ perceptions (Bundy & Pfarrer, 2015). Future research could use our methodological approach to investigate the tension between organizational communication and stakeholders’ evaluations in texts such as press releases, social media posts, or newspaper articles. Such analyses could provide new insights into the conditions that lead to greater or lesser congruence between organizational communication and stakeholders’ perceptions. Additionally, studies could investigate the factors that exacerbate or mitigate the consequences of a weak congruence between organizational communication and stakeholders’ perceptions.

Future Developments

In this article, we have provided organizational scholars high-level guidance on how to ensure validity when constructing relational similarity measures. Nevertheless, we see potential for future research. First, organizational scholars could further elaborate the details of our four considerations (see Table 1) and provide a step-by-step manual for use cases connected to specific streams in organizational research (for similar approaches, see McKenny et al., 2012; Short et al., 2010). Second, as illustrated in our article, the application of the vectors is not only connected to theoretical validity, but also to issues of the technical validity of formal text analysis approaches. This technical validity concerns the quest for best practices of how to apply techniques developed in another scientific field to organization studies. Recently, organizational scholars have begun to develop and discuss ways to harness the potential of sophisticated formal text analysis for the needs of organizational research (Kobayashi et al., 2018b). Progress in this direction would be helpful for organizational research, as a technically valid use of NLP approaches would help to ensure a suitable operationalization of theory-based constructs.

As a caveat, our methodological approach does require some technical expertise. However, we believe that the formal foundation of our methodological approach—the vector space model—is comprehensible and adoptable for organizational scholars with less technical expertise. Furthermore, basic approaches of formal text analysis such as word counting and topic modeling are relatively easy to implement. Admittedly, depending on the research focus and NLP approach, some advanced methodologies still require a deeper NLP and machine learning background. However, a benefit of this constraint is that our approach makes it possible for organizational researchers to gather entirely new insights, because it helps address the intricate complexity of social meaning.

Furthermore, formal text analysis has its limitations. It is, for example, not readily equipped to capture various types of non-literal or figurative speech, including irony or metaphors (Hossain et al., 2020; Zayed et al., 2020) as well as uncertain or fuzzy wordings (Engelmann & Hahn, 2014; Štajner et al., 2017)—and these aspects of language might be of particular interest for organizational researchers. Finally, some formal approaches are heavily biased toward covering the English language and might be sufficient for analyses of few other languages (e.g., Chinese, French, or German), but they mostly lack sufficient support for what has been called low-resourced languages.

Terms	Document d₁	Document d₂	Document d₃	Document d₄
company	1	1	0	0
corporate	1	1	0	0
responsibility	1	1	0	0
employees	1	0	0	1
training	1	0	0	0
target	0	1	0	0
sustainability	0	1	0	0
management	0	0	1	1
experience	0	0	1	0
proper	0	0	1	0
performance	0	0	1	1
evaluation	0	0	1	0
properly	0	0	0	1
evaluates	0	0	0	1

	Document d₁	Document d₂	Document d₃	Document d₄
Document d₁	1.00 / 1.00
Document d₂	0.60 / 0.33	1.00 / 1.00
Document d₃	0.00 / 0.24	0.00 / 0.24	1.00 / 1.00
Document d₄	0.20 / 0.26	0.00 / 0.24	0.40 / 0.29	1.00 / 1.00

Aspects of validity	Considerations
1. Definition of the theory-based construct and its dimensionality	Ground the definition of a theory-based construct in extant organizational research Identify the dimensions of the theory-based construct – Deductive identification of dimensions – Inductive identification of dimensions Consider the linguistic reflection of the theory-based construct in text documents
2. Selecting an appropriate formal text analysis approach	Consider the fit of approaches of formal text analysis with the theory-based construct and its linguistic reflection Note the linguistic unit (i.e., document, sentence, and word) the formal text analysis approach annotates
3. Selecting and processing text documents	Select relevant text documents representing the unit of analysis – Consider the preprocessing of text documents – Consider the aggregation of the output of formal text analysis
4. Relationality and calculating relational similarity measures	Ground the definition of relationality in extant organizational research Consider the theoretical appropriateness of vector comparisons Select the appropriate similarity metric

		Mean	SD	(1)	(2)	(3)	(4)	(5)	(6)
(1)	Shared understanding (Cosine)	.89	.04
(2)	Shared understanding (Euclidean)	.88	.02	.94
(3)	Industry size (100,000)	24.09	27.87	.03	−.05
(4)	Internationalization	.57	.30	.11	.18	−.07
(5)	Age (log)	3.40	1.05	−.03	.03	−.24	.08
(6)	Diversification	1.65	.98	.15	.19	−.09	−.13	.07
(7)	Size (log)	9.49	2.37	−.14	−.17	−.18	.09	.09	−.28

	Shared understanding (Cosine similarity)	Shared understanding (Euclidean similarity)
Industry size	.01	−.06
(100,000)	(.06)	(.06)
Internationalization	−.16*	−.14*
	(.07)	(.06)
Age (log)	−.02	.04
	(.07)	(.06)
Diversification	.03	.03
	(.06)	(.05)
Size (log)	.16*	.15*
	(.07)	(.06)
Constant	−.13	−.16
	(.55)	(.63)
Log likelihood	−251.93	−225.14
N	196	196

Footnotes

Appendix A: Selected Formal Text Analysis Approaches

Table.

Overview of Selected Formal Text Analysis Approaches.

Approach	Use case	Further readings	Previous usages in empirical research
Word counting	Identification of theory-based constructs in text documents by counting words; high frequencies are hypothesized to reflect saliency	Chung and Pennebaker (2007), Lasswell et al. (1952), Nguyen et al. (2020)	Abrahamson and Eisenman (2008), Breiger et al. (2018), Crilly et al. (2016), Pollach (2012), Short et al. (2010)
Topic modeling	Identification of topics in text documents by identifying patterns of theme indicating word co-occurrences based on probabilistic models	Blei (2012), Boyd-Graber et al. (2017), Dieng et al. (2020), Vayanskya and Kumarb (2020)	Croidieu and Kim (2018), DiMaggio et al. (2013), Fligstein et al. (2017), Goldenstein and Poschmann (2019a), Hannigan et al. (2019), Kaplan and Vakili (2015)
Named entity recognition	Identification of instances or mentions of relevant semantic classes or types (e.g., person, organization, or location) that occur in a phrase	Li et al. (2020), Nadeau and Sekine (2007)	Hannigan and Casasnovas (2021), Mohr et al. (2013), Poschmann and Goldenstein (2022)
Word embeddings	Identification of word semantics based on a window of co-occurring terms using high-dimensional vectors of real numbers	Camacho-Collados and Pilehvar (2018), Wang et al. (2020)	Nelson (2021), Kozlowski et al. (2019)
Part-of-speech tagging	Identification of syntactic structures by classifying words by their parts of speech (e.g., nouns and verbs) based on pretrained models	Bird et al. (2009), Manning and Schütze (2000), Schmid (2008), Voutilainen (2004)	Mohr et al. (2013), Noguti (2016)
Grammatical parsing	Identification of grammatical structures by grouping words into types that reflect the grammatical role of words in a sentence	Bird et al. (2009), Carroll (2004), Manning and Schütze (2000)	Goldenstein and Poschmann (2019a), Goldenstein, Poschmann, et al. (2019), Sudhahar et al. (2013), Van Atteveldt et al. (2008)
Sentiment and emotion analysis	Identification of sentiment polarity or finer-grained emotion dimensions based on pretrained models or lexicon approaches	Alswaidan and Menai (2020), Birjali et al. (2021), Mohammad (2021), Sailuanz et al. (2018), Taboada et al. (2011), Zhang et al. (2018)	Bermiss et al. (2014), Etter et al. (2018), Li et al. (2018), Ranganathan et al. (2018), Song et al. (2018)

Appendix B: Overview of Natural Language Processing Software

Table.

Overview of NLP Software: Libraries (to be Used by Other Programs) vs. Tools (Stand-Alone).

	Type	Word counting	Topic modeling	Word embeddings	Named entity recognition	Part-of-speech tagging	Parsing	Sentiment and emotion analysis
Meaning Extraction Helper	Tool (graphical user interface)	✓
LIWC	Tool (graphical and command line interface)	✓						✓
WordStat	Tool (graphical user interface)	✓	✓
Scikit-Learn	Python library	✓	✓
Mallet	Java library		✓
Tomotopy	Python library		✓
Gensim	Python library		✓	✓
FastText	Tool (command line interface)			✓
GloVe	Tool (command line interface)			✓
NLTK	Python library				✓	✓	✓	✓
spaCy	Python library				✓	✓	✓
Stanford CoreNLP	Java library				✓	✓	✓	✓
TextBlob	Python library					✓	✓
JEmAS	Tool (command line interface)							✓

Appendix C: Using the Vector Space Model in NLP With Word Embeddings

Our basic example illustrating the vector space model in NLP focuses on comparing the vocabulary (i.e., terms) of four documents, each containing one sentence. Due to this focus, we counted the terms in the example sentences.

However, organizational researchers experienced in formal text analysis might consider that instead of comparing the vocabulary; another option would be to compare the latent semantics of the vocabulary using a word embedding algorithm. Even if word embeddings have been scarcely used in organizational research (Arseniev-Koehler, 2022; Kozlowski et al., 2019; Nelson, 2021; Stoltz & Taylor, 2021), we see the contemporary interest in this formal approach for text analysis (for an overview, see Wang et al., 2020).

Word embeddings capture the latent semantics of individual words (for a more detailed description, see Kiela et al., 2015). Word embeddings represent word semantics as high-dimensional vectors of real numbers (embedding vectors). Even if the representation of word semantics by real numbers may appear uncommon, embedding vectors computationally makes sense because they capture the similarity of word semantics astonishingly well (Bojanowski et al., 2017; Pennington et al., 2014).

To provide readers an intuitive sense of word embeddings, we repeat our analysis of the sentences of our basic example with this formal text analysis approach. We again removed function words (i.e., articles, conjunctions, prepositions, and pronouns).

Document d₁: “The corporation has responsibility for its employees and their training.”

Document d₂: “The responsibility of our corporation is to combine sustainability and efficient production.”

Document d₃: “The board has the experience for a proper performance.”

Document d₄: “The board evaluates deviations in production performance.”

For this appendix, we used the fastText word embedding algorithm (Bojanowski et al., 2017). FastText comes with a pre-trained model based on the widely used Common Crawl data set, which has been repeatedly applied successfully in textual contexts similar to our example sentences (Bojanowski et al., 2017; Büchel et al., 2018; Sedoc et al., 2020). In other words, the pre-trained model for fastText is suitable because the latent semantics of individual words is likely to be stable across similar textual contexts (Stoltz & Taylor, 2021). We follow past research and averaged the embedding vectors of words in the same example sentence (Arseniev-Koehler, 2022; Nanni & Fallin, 2021; Stoltz & Taylor, 2021) to capture the latent semantics of the vocabulary used within each document.

In Table 6, we depict the document–embedding matrix for the four documents. The dimensionality of the document–embedding matrix is defined by the 300 embedding dimensions (i.e., dimensions of the embedding vectors) fastText provides.

The semantic similarity of documents can be quantified using similarity metrics (see Table 7). The similarity values in this example are produced by the cosine similarity and the inverted Euclidean distance (i.e., Euclidean similarity; Cassisi et al., 2012; Ljubesic et al., 2008; Manning & Schütze, 2000). Note that applying word embeddings reveal subtle nuances in word semantics. The similarity values indicate a ranking of document pairs that the human reader intuitively would expect when considering the semantics of the vocabulary used in the example sentences.

Acknowledgements

We are grateful to Peter Walgenbach for his very supportive comments on a previous draft of this manuscript. Moreover, we appreciate the help of Antonia Lipka and Toni Wengerodt during the data preparation process. Finally, we thank our editor Michael Withers and anonymous reviewers for their great support and recommendations during the review process.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Deutsche Forschungsgemeinschaft (grant number GO 3213/3-1).

ORCID iD

Philipp Poschmann

Notes

Author Biographies

Philipp Poschmann is a research associate at the Chair of Organization, Leadership and Human Resource Management at Friedrich Schiller University Jena in Germany and co-founder of the Group for Jena Computational Organizational Research Applications (JenCORA). His work primarily revolves around organizations, technology, and digital transformation, with a specific focus on leveraging natural language processing for social research. Philipp has published empirical and methodological research in scientific journals including Journal of Management Studies, Sociological Methodology, and Sociological Methods & Research.

Jan Goldenstein is a research associate at the Chair of Organization, Leadership, and Human Resource Management at Friedrich Schiller University Jena, Germany. He is also co-founder of the Group for Jena Computational Organizational Research Applications (JenCORA). His research interests include institutional theory, category research, social evaluation, market entry decisions, comparative international studies, and methodological research. Jan is the author of articles in scientific journals like Journal of Business Venturing, Journal of Management Studies, Strategic Organization, Sociological Methodology, and Sociological Methods & Research.

Sven Büchel is a freelance data scientist who previously worked at the Jena University Language and Information Engineering (JULIE) Lab. He received his PhD in computational linguistics in 2022 from the Friedrich Schiller University Jena in Germany. His doctorate research focused on deep learning models to understand human emotion in language. He held visiting positions at the University of Pennsylvania, Amazon Alexa, and the German Institute for Economic Research (DIW Berlin). He has published 20+ articles in different AI venues including the prestigious ACL, EMNLP, NAACL, COLING, and CogSci conferences.

Udo Hahn is a full professor (emeritus) of Computational Linguistics at Friedrich Schiller University Jena, Germany. Previously, he held positions as an associate professor for Language Informatics at the University of Freiburg and as an assistant professor for Computer Science at the Information Systems Chair of the University of Passau. He received his doctoral degree in Information Science from the University of Konstanz, Germany. His research is focused on text analytics, mostly information extraction, text mining, information retrieval, and text summarization, with lots of applications in the life sciences (molecular biology, clinical medicine). His publication record contains roughly 400 publications covering a wide range of prestigious journals such as Computational Linguistics, Artificial Intelligence, IEEE Computer, IEEE Intelligent Systems, Information Processing & Management, Briefings in Bioinformatics, Bioinformatics, Journal of Biomedical Informatics, Artificial Intelligence in Medicine, Methods of Information in Medicine, Cell, Nucleic Acids Research, Nature Communications, etc.

References

Abrahamson

Eisenman

(2008). Employee-management techniques: Transient fads or trending fashions? Administrative Science Quarterly, 53(4), 719–744. https://doi.org/10.2189/asqu.53.4.719

Ágel

Fischer

(2015). Dependency grammar and valency theory. In Heine

Narrog

(Eds.), The Oxford handbook of linguistic analysis (pp. 225–258). Oxford University Press.

Aguinis

Glavas

(2012). What we know and don’t know about corporate social responsibility: A review and research agenda. Journal of Management, 38(4), 932–968. https://doi.org/10.1177/0149206311436079

Alswaidan

Menai

(2020). A survey of state-of-the-art approaches for emotion recognition in text. Knowledge and Information Systems, 62(1), 2937–2987. https://doi.org/10.1007/s10115-020-01449-0

Arseniev-Koehler

(2022). Theoretical foundations and limits of word embeddings: What types of meaning can they capture? Sociological Methods & Research, Advance online publication. https://doi.org/10.1177/00491241221140142

Baeza-Yates

Ribeiro

B. d. A. N.

(2004). Modern information retrieval. Pearson Addison-Wesley.

Bail

C. A.

(2016). Cultural carrying capacity: Organ donation advocacy, discursive framing, and social media engagement. Social Science and Medicine, 165(1), 280–288. https://doi.org/10.1016/j.socscimed.2016.01.049

Bansal

Kim

Wood

M. O.

(2018). Hidden in plain sight: The importance of scale in organizations’ attention to issues. Academey of Management Review, 43(2), 217–241. https://doi.org/10.5465/amr.2014.0238

Barlow

M. A.

Verhaal

J. C.

Angus

R. W.

(2019). Optimal distinctiveness, strategic categorization, and product market entry on the Google Play app platform. Strategic Management Journal, 40(8), 1219–1242. https://doi.org/10.1002/smj.3019

10.

Bermiss

Y. S.

Zajac

E. J.

King

B. G.

(2014). Under construction: How commensuration and management fashion affect corporate reputation rankings. Organization Science, 25(2), 591–608. https://doi.org/10.1287/orsc.2013.0852

11.

Bird

Klein

Loper

(2009). Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media.

12.

Birjali

Kasri

Beni-Hssane

(2021). A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, 226(1), 107134. https://doi.org/10.1016/j.knosys.2021.107134

13.

Blei

D. M.

(2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. https://doi.org/10.1145/2133806.2133826

14.

Bojanowski

Grave

Joulin

Mikolov

(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5(1), 135–146. https://doi.org/10.1162/tacl_a_00051

15.

Bondy

Moon

Matten

(2012). An institution of corporate social responsibility (CSR) in multi-national corporations (MNCs): Form and implications. Journal of Business Ethics, 111(2), 281–299. https://doi.org/10.1007/s10551-012-1208-7

16.

Bonilla

Grimmer

(2013). Elevated threat levels and decreased expectations: How democracy handles terrorist threats. Poetics, 41(6), 650–669. https://doi.org/10.1016/j.poetic.2013.06.003

17.

Boyd-Graber

Mimno

D. M.

(2017). Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2–3), 143–296. https://doi.org/10.1561/1500000030

18.

Braun

M. T.

Kuljanin

DeShon

R. P.

(2018). Special considerations for the acquisition and wrangling of big data. Organizational Research Methods, 21(3), 633–659. https://doi.org/10.1177/1094428117690235

19.

Breiger

R. L.

Wagner-Pacifici

Mohr

J. W.

(2018). Capturing distinctions while mining text data: Toward low-tech formalization for text analysis. Poetics, 68(1), 104–119. https://doi.org/10.1016/j.poetic.2018.02.005

20.

Brown

A. D.

Colville

Pye

(2015). Making sense of sensemaking in organization studies. Organization Studies, 36(2), 265–277. https://doi.org/10.1177/0170840614559259

21.

Brown

A. D.

Stacey

Nandhakumar

(2009). Making sense of sensemaking narratives. Human Relations, 61(8), 1035–1062. https://doi.org/10.1177/0018726708094858

22.

Büchel

Buffone

Slaff

Ungar

Sedoc

(2018). Modeling empathy and distress in reaction to news stories. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4758–4765. https://doi.org/10.18653/v1/D18-1507

23.

Bundy

Pfarrer

M. D.

(2015). A burden of responsibility: The role of social approval at the onset of a crisis. Academey of Management Review, 40(3), 345–369. https://doi.org/10.5465/amr.2013.0027

24.

Camacho-Collados

Pilehvar

M. T.

(2018). From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research, 63(1), 743–788. https://doi.org/10.1613/jair.1.11259

25.

Carroll

G. R.

Hannan

M. T.

(1989). Density dependence in the evolution of populations of newspaper organizations. American Sociological Review, 54(4), 524–541. https://doi.org/10.2307/2095875

26.

Carroll

J. A.

(2004). Parsing. In Mitkov

(Ed.), The Oxford handbook of computational linguistics (pp. 233–248). Oxford University Press.

27.

Cassisi

Montalto

Aliotta

Cannata

Pulvirenti

(2012). Similarity measures and dimensionality reduction techniques for time series data mining. In Karahoca

(Ed.), Advances in data mining knowledge discovery and applications (pp. 71–96). inTech.

28.

Cattani

Porac

J. F.

Thomas

(2017). Categories and competition. Strategic Management Journal, 38(1), 64–92. https://doi.org/10.1002/smj.2591

29.

Chen

Mathieu

J. E.

Bliese

P. D.

(2005). A framework for conducting multi-level construct validation. Research in Multi-Level Issues, 3(1), 273–303. https://doi.org/10.1016/S1475-9144(04)03013-9

30.

Chen

Bouvain

(2009). Is corporate responsibility converging? A comparison of corporate responsibility reporting in the USA, UK, Australia, and Germany. Journal of Business Ethics, 87(1), 299–317. https://doi.org/10.1007/s10551-008-9794-0

31.

Christensen

L. T.

Morsing

Thyssen

(2013). CSR As aspirational talk. Organization, 20(3), 372–393. https://doi.org/10.1177/1350508413478310

32.

Chung

Pennebaker

(2007). The psychological functions of function words. In Fiedler

(Ed.), Social communication (pp. 343–359). Psychology Press.

33.

Cornelissen

J. P.

(2014). Corporate communication. Sage Publications.

34.

Crilly

Hansen

Zollo

(2016). The grammar of decoupling: A cognitive-linguistic perspective on firms’ sustainability claims and stakeholders’ interpretation. Academy of Management Journal, 59(2), 705–729. https://doi.org/10.5465/amj.2015.0171

35.

Croidieu

Kim

P. H.

(2018). Labor of love: Amateurs and lay-expertise legitimation in the early U.S. radio field. Administrative Science Quarterly, 63(1), 1–42. https://doi.org/10.1177/0001839216686531

36.

Davis

G. F.

Diekmann

K. A.

Tinsley

C. H.

(1994). The decline and fall of the conglomerate firm in the 1980s: The deinstitutionalization of an organizational form. American Sociological Review, 59(4), 547–570. https://doi.org/10.2307/2095931

37.

Dhalla

Oliver

(2013). Industry identity in an oligopolistic market and firms’ responses to institutional pressures. Organization Studies, 34(12), 1803–1834. https://doi.org/10.1177/0170840613483809

38.

Dieng

A. B.

Ruiz

F. J. R.

Blei

D. M.

(2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8(3), 439–453. https://doi.org/10.1162/tacl_a_00325

39.

DiMaggio

P. J.

Nag

Blei

(2013). Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics, 41(6), 570–606. https://doi.org/10.1016/j.poetic.2013.08.004

40.

DiMaggio

P. J.

Powell

W. W.

(1983). The iron cage revisited: Institutional isomorphism and collective rationality in organizational fields. American Sociological Review, 48(2), 147–160. https://doi.org/10.2307/2095101

41.

Duriau

V. J.

Reger

R. K.

Pfarrer

M. D.

(2007). A content analysis of the content analysis literature in organization studies: Research themes, data sources, and methodological refinements. Organizational Research Methods, 10(1), 5–34. https://doi.org/10.1177/1094428106289252

42.

Eisenstein

(2019). Introduction to natural language processing. MIT Press.

43.

Engelmann

Hahn

(2014). An empirically grounded approach to extend the linguistic coverage and lexical diversity of verbal probabilities. Proceedings of the 36th Annual Meeting of the Cognitive Science Society, 36(1), 23–26.

44.

Etter

Colleoni

Illia

Meggiorin

D’Eugenio

(2018). Measuring organizational legitimacy in social media: Assessing citizens’ judgments with sentiment analysis. Business and Society, 57(1), 60–97. https://doi.org/10.1177/0007650316683926

45.

Evans

J. A.

Aceves

(2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42(1), 21–50. https://doi.org/10.1146/annurev-soc-081715-074206

46.

Fiss

P. C.

Zajac

E. J.

(2004). The diffusion of ideas over contested terrain: The (non)adoption of a shareholder value orientation among German firms. Administrative Science Quarterly, 49(4), 501–534. https://doi.org/10.2307/4131489

47.

Fligstein

Brundage

J. S.

Schultz

(2017). Seeing like the Fed: Culture, cognition, and framing in the failure to anticipate the financial crisis of 2008. American Sociological Review, 82(5), 879–909. https://doi.org/10.1177/0003122417728240

48.

Gentzkow

Kelly

B. T.

Taddy

M. A.

(2019). Text as data. Journal of Economic Literature, 57(3), 535–574. https://doi.org/10.1257/jel.20181020

49.

George

Haas

M. R.

Pentland

(2014). Big data and management. Academy of Management Journal, 57(2), 321–326. https://doi.org/10.5465/amj.2014.4002

50.

George

Osinga

E. C.

Lavie

Scott

B. A.

(2016). Big data and data science methods for management research. Academy of Management Journal, 59(5), 1493–1507. https://doi.org/10.5465/amj.2016.4005

51.

Gephart

R. R.

(1993). The textual approach: Risk and blame in disaster sensemaking. Academy of Management Journal, 36(6), 1465–1514. https://doi.org/10.2307/256819

52.

Goldenstein

Poschmann

(2019a). Analyzing meaning in big data: Performing a map analysis using grammatical parsing and topic modeling. Sociological Methodology, 49(1), 83–131. https://doi.org/10.1177/0081175019852762

53.

Goldenstein

Poschmann

(2019b). A quest for transparent and reproducible text-mining methodologies in computational social science. Sociological Methodology, 49(1), 144–151. https://doi.org/10.1177/0081175019867855

54.

Goldenstein

Hunoldt

Oertel

(2019). How optimal distinctiveness affects new ventures’ failure risk: A contingency perspective. Journal of Business Venturing, 34(3), 477–495. https://doi.org/10.1016/j.jbusvent.2019.01.004

55.

Goldenstein

Poschmann

Händschke

S. G. M.

Walgenbach

(2019). Global and local orientation in organizational actorhood: A comparative study of large corporations from Germany, the United Kingdom, and the United States. European Journal of Cultural and Political Sociology, 6(2), 201–236. https://doi.org/10.1080/23254823.2018.1532804

56.

Haans

R. F. J.

(2019). What’s the value of being different when everyone is? The effects of distinctiveness on performance in homogeneous versus heterogeneous categories. Strategic Management Journal, 40(1), 3–27. https://doi.org/10.1002/smj.2978

57.

Händschke

S. G. M.

Büchel

Goldenstein

Poschmann

Hahn

Walgenbach

(2018). A corpus of corporate annual and social responsibility reports: 280 million tokens of balanced organizational writing. ECONLP 2018: Proceedings of the First Workshop on Economics and Natural Language Processing, 20–31. https://doi.org/10.18653/v1/W18-3103

58.

Hannan

M. T.

Freeman

(1977). The population ecology of organizations. American Journal of Sociology, 82(5), 929–964. https://doi.org/10.1086/226424

59.

Hannigan

T. R.

Casasnovas

(2021). New structuralism and field emergence: The co-constitution of meanings and actors in the early moments of social impact investing. Research in the Sociology of Organizations, 68(1), 147–183. https://doi.org/10.1108/S0733-558X20200000068008

60.

Hannigan

T. R.

Haans

R. F. J.

Vakili

Tchalian

Glaser

V. L.

Wang

M. S.

Kaplan

Jennings

P. D.

(2019). Topic modeling in management research: Rendering new theory from textual data. Academy of Management Annals, 13(2), 586–632. https://doi.org/10.5465/annals.2017.0099

61.

Hickman

Thapa

Tay

Cao

Srinivasan

(2022). Text preprocessing for text mining in organizational research: Review and recommendations. Organizational Research Methods, 25(1), 114–146. https://doi.org/10.1177/1094428120971683

62.

Hoffman

A. J.

(1999). Institutional evolution and change: Envrionmentalism and the U.S. chemical industry. Academy of Management Journal, 42(4), 351–371. https://doi.org/10.2307/257008

63.

Hooghiemstra

(2000). Corporate communication and impression management: New perspectives why companies engage in corporate social reporting. Journal of Business Ethics, 27(1–2), 55–68. https://doi.org/10.1023/A:1006400707757

64.

Hossain

Krumm

Gamon

Kautz

H. A.

(2020). SemEval-2020 Task 7: Assessing humor in edited news headlines. Proceedings of the 14th International Workshop on Semantic Evaluation @ COLING, 2020(1), 746–758.

65.

Jockers

M. L.

Mimno

(2013). Significant themes in 19th-century literature. Poetics, 41(6), 750–769. https://doi.org/10.1016/j.poetic.2013.08.005

66.

Jurafsky

Martin

J. H.

(2009). Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall.

67.

Kang

Cai

Tan

C.-W.

Huang

Liu

(2020). Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics, 7(2), 139–172. https://doi.org/10.1080/23270012.2020.1756939

68.

Kaplan

Vakili

(2015). The double-edged sword of recombination in breakthrough innovation. Strategic Management Journal, 36(10), 1435–1457. https://doi.org/10.1002/smj.2294

69.

Karlgren

Kanerva

(2021). Semantics in high-dimensional space. Frontiers in Artificial Intelligence, 2021(1), #2021.698809. https://doi.org/10.3389/frai.2021.698809

70.

Kennedy

M. T.

(2008). Getting counted: Markets, media, and reality. American Sociological Review, 73(2), 270–295. https://doi.org/10.1177/000312240807300205

71.

Kiela

Hill

Clark

(2015). Specializing word embeddings for similarity or relatedness. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2044–2048. https://doi.org/10.18653/v1/D15-1242

72.

Klein

K. J.

Kozlowski

S. W.

(2000). From micro to meso: Critical steps in conceptualizing and conducting multilevel research. Organizational Research Methods, 3(3), 211–236. https://doi.org/10.1177/109442810033001

73.

Kobayashi

V. B.

Mol

S. T.

Berkers

H. A.

Kismihók

Den Hartog

D. N.

(2018a). Text classification for organizational researchers: A tutorial. Organizational Research Methods, 21(3), 766–799. https://doi.org/10.1177/1094428117719322

74.

Kobayashi

V. B.

Mol

S. T.

Berkers

H. A.

Kismihók

Den Hartog

D. N.

(2018b). Text mining in organizational research. Organizational Research Methods, 21(3), 733–765. https://doi.org/10.1177/1094428117722619

75.

Kogut

Singh

(1988). The effect of national culture on the choice of entry mode. Journal of International Business Studies, 19(1), 411–432. https://doi.org/10.1057/palgrave.jibs.8490394

76.

Kostova

Roth

Dacin

M. T.

(2008). Institutional theory in the study of multinational corporations: A critique and new directions. Academy of Management Review, 33(4), 994–1006. https://doi.org/10.5465/amr.2008.34422026

77.

Kostova

Zaheer

(1999). Organizational legitimacy under conditions of complexity: The case of the multinational enterprise. Academy of Management Review, 24(1), 64–81. https://doi.org/10.2307/259037

78.

Kozlowski

A. C.

Taddy

Evans

J. A.

(2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905–949. https://doi.org/10.1177/0003122419877135

79.

Krippendorff

(2004). Content analysis: An introduction to its methodology (2nd ed.). Sage Publications.

80.

Kristof-Brown

A. L.

Zimmerman

R. D.

Johnson

E. C.

(2005). Consequences of individual’s fit at work: A meta-analysis of person-job, person-organization, person-group, and person- supervisor fit. Personnel Psychology, 58(2), 281–342. https://doi.org/10.1111/j.1744-6570.2005.00672.x

81.

Kübler

McDonald

R. T.

Nivre

(2009). Dependency parsing. Morgan & Claypool Publishers.

82.

Lasswell

H. D.

Lerner

Pool

I. d. S.

(1952). The comparative study of symbols. An introduction. Stanford University Press.

83.

Sun

Han

(2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), 50–70. https://doi.org/10.1109/TKDE.2020.2981314

84.

Zhu

(2018). Making sense of organization dynamics using text analysis. Expert Systems With Applications, 111, 107–119. https://doi.org/10.1016/j.eswa.2017.11.009

85.

Lim

Tsutsui

(2012). Globalization and commitment in Corporate Social Responsibility: Cross-national analyses of institutional and political-economy effects. American Sociological Review, 77(1), 69–98. https://doi.org/10.1177/0003122411432701

86.

Ljubesic

Boras

Bakaric

Njavro

(2008). Comparing measures of semantic similarity. 30th International Conference on Information Technology Interfaces, 675–682.

87.

Loewenstein

Ocasio

Jones

(2012). Vocabularies and vocabulary structure: A new approach linking categories, practices, and institutions. Academy of Management Annals, 6(1), 41–86. https://doi.org/10.5465/19416520.2012.660763

88.

Manning

C. D.

Raghavan

Schütze

(2008). Introduction to information retrieval. Cambridge University Press.

89.

Manning

C. D.

Schütze

(2000). Foundations of natural language processing (4th ed.). MIT Press.

90.

McKenny

A. F.

Short

J. C.

Payne

G. T.

(2012). Using computer-aided text analysis to elevate constructs: An illustration using psychological capital. Organizational Research Methods, 16(1), 152–184. https://doi.org/10.1177/1094428112459910

91.

Mendoza-Abarca

K. I.

Gras

(2019). The performance effects of pursuing a diversification strategy by newly founded nonprofit organizations. Journal of Management, 45(3), 984–1008. https://doi.org/10.1177/0149206316685854

92.

Meyer

J. W.

Rowan

(1977). Institutionalized organizations: Formal structure as myth and ceremony. American Journal of Sociology, 83(2), 340–363. https://doi.org/10.1086/226550

93.

Mohammad

S. M.

(2021). Sentiment analysis: Automatically detecting valence, emotions, and other affectual states from text. In Meiselman

H. L.

(Ed.), Emotion measurement (pp. 323–379). Woodhead Publishing.

94.

Mohr

J. W.

(1998). Measuring meaning structures. Annual Review of Sociology, 24(1), 345–370. https://doi.org/10.1146/annurev.soc.24.1.345

95.

Mohr

J. W.

(2005). Implicit terrains: Meaning, measurement, and spatial metaphors in organizational theory. Unpublished manuscript.

96.

Mohr

J. W.

Bail

C. A.

Frye

Lena

J. C.

Lizardo

McDonnell

T. E.

Mische

Wherry

F. F.

(2020). Measuring culture. Columbia University Press.

97.

Mohr

J. W.

Bogdanov

(2013). Introduction topic models: What they are and why they matter. Poetics, 41(6), 545–569. https://doi.org/10.1016/j.poetic.2013.10.001

98.

Mohr

J. W.

Wagner-Pacifici

Breiger

R. L.

Bogdanov

(2013). Graphing the grammar of motives in national security strategies: Cultural interpretation, automated text analysis and the drama of global politics. Poetics, 41(6), 670–700. https://doi.org/10.1016/j.poetic.2013.08.003

99.

Nadeau

Sekine

(2007). A survey of named entity recognition and classification. Lingvisticæ Investigationes, 30(1), 3–26. https://doi.org/10.1075/li.30.1.03nad

100.

Nanni

Fallin

(2021). Earth, wind, (water), and fire: Measuring epistemic boundaries in climate change research. Poetics, 88, 101573. https://doi.org/10.1016/j.poetic.2021.101573

101.

Nelson

L. K.

(2017). Computational grounded theory: A methodological framework. Sociological Methods & Research, 49(1), 3–42. https://doi.org/10.1177/0049124117729703

102.

Nelson

L. K.

(2021). Leveraging the alignment between machine learning and intersectionality: Using word embeddings to measure intersectional experiences of the nineteenth century U.S. South. Poetics, 88, 101539. https://doi.org/10.1016/j.poetic.2021.101539

103.

Nelson

L. K.

Burk

Knudsen

McCall

(2021). The future of coding: A comparison of hand-coding and three types of computer-assisted text analysis methods. Sociological Methods and Research, 50(1), 202–237. https://doi.org/10.1177/0049124118769114

104.

Nguyen

Liakata

DeDeo

Eisenstein

Mimno

D. M.

Tromble

Winters

(2020). How we do things with words: Analyzing text as social and cultural data. Frontiers in Artificial Intelligence, 3, 1–14. https://doi.org/10.3389/frai.2020.00062

105.

Noguti

(2016). Post language and user engagement in online content communities. European Journal of Marketing, 50(5/6), 695–723. https://doi.org/10.1108/EJM-12-2014-0785

106.

Oberg

Korff

V. P.

Powell

W. W.

(2017). Culture and connectivity intertwined: Visualizing organizational fields as relational structures and meaning systems. Research in the Sociology of Organizations, 53(1), 17–47. https://doi.org/10.1108/S0733-558X20170000053001

107.

Pandey

S. K.

(2019). Applying natural language processing capabilities in computerized textual analysis to measure organizational culture. Organizational Research Methods, 22(3), 765–797. https://doi.org/10.1177/1094428117745648

108.

Pennington

Socher

Manning

C. D.

(2014). GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1534.

109.

Pfeffer

Salancik

G. R.

(1978). The external control of organizations: A resource dependent approach. Harper and Row Publishers.

110.

Pollach

(2012). Taming textual data: The contribution of corpus linguistics to computer-aided text analysis. Organizational Research Methods, 15(2), 263–287. https://doi.org/10.1177/1094428111417451

111.

Pollock

T. G.

Lashley

Rindova

V. P.

Han

J. H.

(2019). Which of these things are not like the others? Comparing the rational, emotional, and moral aspects of reputation, status, celebrity, and stigma. Academy of Management Annals, 13(2), 444–478. https://doi.org/10.5465/annals.2017.0086

112.

Pope

(2015). Why firms participate in the global corporate social responsibility initiatives, 2000–2010. In Tsutsui

Lim

(Eds.), Corporate social responsibility in a globalizing world (pp. 251–285). Cambridge University Press.

113.

Pope

Lim

(2020). The governance divide in global corporate responsibility: The global structuration of reporting and certification frameworks, 1998–2017. Organization Studies, 41(6), 821–854. https://doi.org/10.1177/0170840619830131

114.

Poschmann

Goldenstein

(2022). Disambiguating and specifying social actors in big data: Using Wikipedia as a data source for demographic information. Sociological Methods & Research, 51(2), 887–925. https://doi.org/10.1177/0049124119882481

115.

Ranganathan

Ghosh

Rosenkopf

(2018). Competition–cooperation interplay during multifirm technology coordination: The effect of firm heterogeneity on conflict and consensus in a technology standards organization. Strategic Management Journal, 39(12), 3193–3221. https://doi.org/10.1002/smj.2786

116.

Rona-Tas

Cornuéjols

Blanchemanche

Duroy

Martin

(2019). Enlisting supervised machine learning in mapping scientific uncertainty expressed in food risk analysis. Sociological Methods and Research, 48(3), 608–641. https://doi.org/10.1177/0049124117729701

117.

Rule

Cointet

J.-P.

Bearman

P. S.

(2015). Lexical shifts, substantive changes, and continuity in state of the union discourse, 1790–2014. Proceedings of the National Academy of Sciences, 112(35), 10837–10844. https://doi.org/10.1073/pnas.1512221112

118.

Sailuanz

Dhaliwal

Rokne

Alhajj

(2018). Emotion detection from text and speech: A survey. Social Network Analysis and Mining, 8(1), 1–26. https://doi.org/10.1007/s13278-017-0479-5

119.

Salton

(1965). Progress in automatic information retrieval. IEEE Spectrum, 2(8), 90–103. https://doi.org/10.1109/MSPEC.1965.6501325

120.

Salton

Wong

Yang

C.-S.

(1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. https://doi.org/10.1145/361219.361220

121.

Schmid

(2008). Tokenizing and part-of-speech tagging. In Lüdeling

Kytö

McEnery

(Eds.), Corpus linguistics: An international handbook (pp. 527–551). de Gruyter.

122.

Schütze

(1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–123.

123.

Schwens

Zapkau

F. B.

Bierwerth

Isidor

Knight

G. A.

Kabst

(2018). International entrepreneurship: A meta-analysis on the internationalization and performance relationship. Entrepreneurship: Theory and Practice, 42(5), 734–768. https://doi.org/10.1177/1042258718795346

124.

Scott

W. R.

Davis

G. F.

(2007). Organizations and organizing: Rational, natural, and open system perspectives. Pearson Education.

125.

Sedoc

Büchel

Nachmany

Buffone

Ungar

(2020). Learning word ratings for empathy and distress from document-level user responses. Proceedings of the 12th Conference on Language Resources and Evaluation, 1664–1673.

126.

Short

J. C.

Broberg

J. C.

Brigham

K. H.

(2010). Construct validation using computer-aided text analysis (CATA). Organizational Research Methods, 13(2), 320–347. https://doi.org/10.1177/1094428109335949

127.

Song

Wang

Zhu

(2018). Sustainable strategy for corporate governance based on the sentiment analysis of financial reports with CSR. Financial Innovation, 4(2), 1–14. https://doi.org/10.1186/s40854-018-0086-0

128.

Štajner

Glavaš

Ponzetto

S. P.

Stuckenschmidt

(2017). Domain adaptation for automatic detection of speculative sentences. Proceedings of the IEEE 11th International Conference on Semantic Computing, 164–171.

129.

Stoltz

D. S.

Taylor

M. A.

(2021). Cultural cartography with word embeddings. Poetics, 88, 101567. https://doi.org/10.1016/j.poetic.2021.101567

130.

Sudhahar

de Fazio

Franzosi

Cristianini

(2013). Network analysis of narrative content in large corpora. Natural Language Engineering, 21(1), 81–112. https://doi.org/10.1017/S1351324913000247

131.

Taboada

Brooke

Tofiloski

Voll

Stede

(2011). Lexicon-Based methods for sentiment analysis. Computational Linguistics, 37(2), 267–307. https://doi.org/10.1162/COLI_a_00049

132.

Tonidandel

King

E. B.

Cortina

J. M.

(2018). Big data methods: Leveraging modern data analytic techniques to build organizational science. Organizational Research Methods, 21(3), 525–547. https://doi.org/10.1177/1094428116677299

133.

Turney

P. D.

Pantel

(2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141–188. https://doi.org/10.1613/jair.2934

134.

Van Atteveldt

Kleinnijenhuis

Ruigrok

(2008). O. Political Analysis, 16(4), 428–446. https://doi.org/10.1093/pan/mpn006

135.

Vayanskya

Kumarb

S. A. P.

(2020). A review of topic modeling methods. Information Systems, 94(1), #101582. https://doi.org/10.1016/j.is.2020.101582

136.

Voutilainen

(2004). Part of speech tagging. In Mitkov

(Ed.), The Oxford handbook of computational linguistics (pp. 219–232). Oxford University Press.

137.

Wang

Zhou

Jiang

(2020). A survey of word embeddings based on deep learning. Computing, 102(3), 717–740. https://doi.org/10.1007/s00607-019-00768-7

138.

Weick

K. E.

(1995). Sensemaking in organizations. Sage Publications.

139.

Wenzel

Van Quaquebeke

(2018). The double-edged sword of big data in organizational and management research: A review of opportunities and risks. Organizational Research Methods, 21(3), 548–591. https://doi.org/10.1177/1094428117718627

140.

Wickert

Scherer

A. G.

Spence

L. J.

(2016). Walking and talking corporate social responsibility: Implications of firm size and organizational cost. Journal of Management Studies, 53(7), 1169–1196. https://doi.org/10.1111/joms.12209

141.

Younger

Fisher

(2020). The exemplar enigma: New venture image formation in an emergent organizational category. Journal of Business Venturing, 35(1), 1–35. https://doi.org/10.1016/j.jbusvent.2018.09.002

142.

Zayed

McCrae

J. P.

Buitelaar

(2020). Figure me out: A gold standard dataset for metaphor interpretation. Proceedings of the 12th International Conference on Language Resources and Evaluation, 5810–5819.

143.

Zhang

Wang

Liu

(2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1253. https://doi.org/10.1002/widm.1253

144.

Zhang

Teng

(2021). Natural language processing: A machine learning perspective. Cambridge University Press.

145.

Zietsma

Groenewegen

Logue

D. M.

Hinings

C. R.

(2017). Field or fields? Building the scaffolding for cumulation of research on institutional fields. Academy of Management Annals, 11(1), 391–450. https://doi.org/10.5465/annals.2014.0052

146.

Zuckerman

E. W.

(1999). The categorical imperative: Securities analysts and the illegitimacy discount. American Journal of Sociology, 104(5), 1298–1438. https://doi.org/10.1086/210178

147.

Zuckerman

E. W.

(2016). Optimal distinctiveness revisited: An integrative framework for understanding the balance between differentiation and conformity in individual and organizational identities. In M. G. Pratt, M. Schultz, B. E. Ashforth, & D. Ravasi (Eds.), The Oxford handbook of organizational identity (pp. 183–199). Oxford University Press.

A Vector Space Approach for Measuring Relationality and Multidimensionality of Meaning in Large Text Collections

Abstract

Keywords

Methodological Considerations: The Relationality and Multidimensionality of Meaning

Constructing Relational and Multidimensional Similarity Measures

The Vector Space Model in NLP

Outline of a Vector Space Approach for Organizational Research

Considerations for Constructing Relational Similarity Measures

Definition of the Theory-Based Construct and Its Dimensionality

Selecting Appropriate Formal Text Analysis Approaches

Selecting and Processing Text Documents

Relationality and Calculating Relational Similarity Measures

Exemplary Regression Analysis

Discussion

Methodological Contribution

Outlining the Broad Applicability of the Vector Space Model

Future Developments

Footnotes

Appendix A: Selected Formal Text Analysis Approaches

Appendix B: Overview of Natural Language Processing Software

Appendix C: Using the Vector Space Model in NLP With Word Embeddings

Acknowledgements

Declaration of Conflicting Interests

Funding

ORCID iD

Notes

Author Biographies

References