Sage Journals: Discover world-class research

Abstract

The availability of big data has significantly influenced the possibilities and methodological choices for conducting large-scale behavioural and social science research. In the context of qualitative data analysis, a major challenge is that conventional methods require intensive manual labour and are often impractical to apply to large datasets. One effective way to address this issue is by integrating emerging computational methods to overcome scalability limitations. However, a critical concern for researchers is the trustworthiness of results when machine learning and natural language processing tools are used to analyse such data. We argue that confidence in the credibility and robustness of results depends on adopting a ’human-in-the-loop’ methodology that is able to provide researchers with control over the analytical process, while retaining the benefits of using machine learning and natural language processing. With this in mind, we propose a novel methodological framework for computational grounded theory that supports the analysis of large qualitative datasets, while maintaining the rigour of established grounded theory methodologies. To illustrate the framework’s value, we present the results of testing it on a dataset collected from Reddit in a study aimed at understanding tutors’ experiences in the gig economy.

Keywords

Computational grounded theory big social data human-in-the-loop natural language processing query-driven topic modelling

Introduction and background

Recent advances in machine learning (ML) and natural language processing (NLP) models have significantly enhanced their performance in tasks such as question-answering and text summarisation. Sometimes, these models even outperform humans (Devlin et al., 2018; He et al., 2020). This, coupled with the proliferation of big data facilitated by the widespread adoption of digital technologies, has led to an abundance of digital traces that can be collected from various sources and analysed to gain insights into human behaviour (Lazer et al., 2009; Salganik, 2019). It enables researchers to investigate new phenomena and reach a new or wider range of research subjects (Agarwal and Dhar, 2014; Johnson et al., 2019). Yet, while advanced ML and NLP are superior for some tasks, human researchers possess qualities such as socially embedded sense-making, reasoning and contextual awareness (Legg and Hutter, 2007) that isolated and individual models trained solely on tokenised input cannot fully replace. Although recent developments have improved context handling, they still fall short of capturing the nuanced understanding inherent to human cognition (Wang et al., 2024). Therefore, there is a need to combine the strengths of both so they complement one another when analysing large qualitative datasets (Herrmann, 2020; Peeters et al., 2021).

Qualitative research methods such as grounded theory (GT) (Glaser and Strauss, 1967) are widely used in the humanities and social sciences. Hand-coding is central to these methods, which involve systematic data segregation, annotation, grouping and linking. The goal is to examine unstructured text to identify patterns and acquire a thorough understanding of the data (Chen et al., 2018; Rietz and Maedche, 2020). However, since such a procedure requires considerable manual labour and hours of dedicated effort, analysing big datasets reliably can be extremely laborious, if not impossible (Crowston et al., 2012; Müller et al., 2016). To address this, some researchers use strategies such as subsampling (Attard and Coulson, 2012; Kazmer et al., 2014), but this may fail to take advantage of the richness of the full dataset as a substantial portion of it will remain unexplored (Chen et al., 2018; Jiang et al., 2021). This limitation has prompted researchers to examine ways of using computational techniques to help minimise human coding efforts. According to Baumer et al. (2017), two researchers independently applied topic modelling (TM) and GT to the same dataset and compared their results. They found that many GT codes were well-represented in TM results; however, the latter had a lower level of abstraction, comparable to middle-level GT codes. Therefore, they concluded that these approaches work most effectively when combined. Similarly, Bakharia (2019), Crowston et al. (2012) and Ruis and Lee (2021) argue that while TM may automate parts of the coding process, more work is needed to acquire a deeper qualitative insight into the social phenomena under study. The effort to achieve this has given rise to what is known as computational grounded theory (CGT). Researchers suggest it is a promising methodology, but some reservations remain (Gorra, 2019; Procter, 2022).

Several attempts have been made to develop CGT frameworks, with the most prominent being proposed by Nelson (2020) and subsequently quite widely adopted (Ojo and Rizun, 2021; Williams and Greenhalgh, 2022). The framework has three steps that start by detecting patterns in the data using structural TM (STM). The highest-weighted (top-N) terms (i.e. words) per topic are used to examine the model to determine if it produces semantically coherent topics. Then pattern refinement, where the researcher returns to the classified data to add interpretation through qualitative deep reading of the documents in each TM topic. Finally, the pattern confirmation step, which includes deciding which computational techniques are most suitable for verifying patterns found in the data against the complete dataset. Odacioglu et al. (2022) propose a four-step CGT framework, started by applying latent Dirichlet allocation (LDA) TM, followed by expert coding where topics are labelled based on associated terms and rated their relevance to the study. The researchers also engage in applying the codes and writing analytical memos. The third step, focus coding, involves understanding codes, identifying patterns and seeking similarities between them by reading the associated top-N documents. Then, after constantly comparing codes and discovering links, categories are formed. The final step is theory building, which entails aggregating core categories into one higher-level theme that forms a theory.

Mangiò et al.’s (2023) framework started with experts assigning topic labels to STM topics based on examining the top-N terms and on the qualitative interpretation of the top-N documents per topic. They further evaluated the performance of the model by having two researchers manually review a sample of documents for each topic to assess the model. Then, following extensive discussions, they grouped the topics into more comprehensive higher-level categories based on content similarity. The researchers then utilised the STM for validation that enabled them to analyse how topics changed over time and compare these changes with real-world events. Furthermore, two authors ‘internally validated’ topics by manually coding a sample of documents per topic, ensuring that the model discriminated correctly. In the final step, an interpretative GT analysis of the topics is conducted. A hand-picked sample of the most representative top-N documents is independently hand-coded in three separate stages, including coding the raw data from topics, secondly, they arrange the resulting codes from first to second-order codes, which helps to provide more theoretical insights and in the final coding step a further aggregation of concepts is identified and classified into overarching categories.

While these earlier frameworks have clearly attempted to apply GT principles when utilising text mining techniques, some have oversimplified the process by over-relying on the computational tools results for theory development. Yu’s (2019) framework is an example, which started with applying LDA and then labelling the topics based on reading the top-N terms. They viewed LDA topical terms as open coding, and LDA topic labels as axial coding. The LDAvis tool (Sievert and Shirley, 2014) was then used to visualise LDA topics and they found that topics comprised three independent areas based on the topics’ relationships; therefore, they considered these areas as different categories. By further abstracting and summarising the topical terms contained in these categories, selective codes were obtained, each corresponding to a category.

In summary, the limitations in these frameworks include the absence of an initial stage of data exploration, which is critical given the amount and complexity of the data a social scientist will deal with in big data research. Additionally, there were no measures taken to externally verify the validity of the TM models’ results. Aside from this, and more importantly, certain frameworks failed to apply some of GT’s core principles, which is critical when claiming to utilise the methodology (Birks et al., 2013). These include not appropriately implementing the constant comparison analyses method, and lack of in-depth engagement with the data, as well as inadequately addressing the procedures for coding and grouping codes to arrive at higher-level conceptual categories that will lead to the emergence of the theory. Finally, within all the existing frameworks, there is a noticeable absence of applying theoretical sampling in big data research, a crucial step to fill in the gaps in the discovered core categories and contribute to the quality of analysis. In Table 1, a summary of the discussed CGT frameworks is presented.

Table 1.

Summary of the discussed CGT frameworks.

Aspect of framework	Nelson (2020)	Odacioglu et al. (2022)	Mangiò et al. (2023)	Yu (2019)
TM used	STM	LDA	STM	LDA
Validation of TM topics	Not applied	Not applied	Not applied	Not applied
Evaluation of model’s quality	Top-N terms	Not applied	Top-N terms and documents	Not applied
Qualitative interpretation	Minimal	Moderate	Extensive	Not applied
Constant comparison	Moderate	Moderate	Moderate	Minimal
Applying memo-writing	Not explicitly stated	Explicitly stated	Not explicitly stated	Not explicitly stated
Theoretical sampling	Not applied	Not applied	Not applied	Not applied
Reliance oncomputational tools	Moderate	Moderate	Moderate	Heavily reliant

CGT: computational grounded theory; TM: topic modelling; STM: structural topic modelling; LDA: Latent Dirichlet allocation.

Grounded theory

GT is an inductive methodology for analysing qualitative data, which was originally developed by Glaser and Strauss (1967). It was introduced to address a limitation on existing methods of social research, at that time, which primarily focused on theory verification, while the initial step of discovering relevant concepts and hypotheses inductively from data is usually overlooked. GT is ‘simply a set of integrated conceptual hypotheses systematically generated to produce an inductive theory about a substantive area’ (Glaser and Holton, 2004: 3). This theory reveals the main concern of the research subjects and how they resolve it, that is, their behavioural response. It is called ‘grounded’ because it is grounded in the data: the research subjects’ own explanations or interpretations (Corbin and Strauss, 1990).

The analysis begins with open coding in which the researcher codes the data line-by-line, trying to conceptualise the patterns they find, since ‘the essential relationship between data and theory is a conceptual code’ (Glaser and Holton, 2004: 12). Initial codes are comparative and provisional (Charmaz, 2006) and it is only through constant comparison – the process of comparing incidents (i.e. the empirical data, the indicators of a category or code) with incidents, incidents with codes, and codes with codes – categories followed by higher-level categories start to appear (Charmaz, 2006). A category represents a group of related codes that share common characteristics or patterns. These codes (or sub-patterns) consider the ‘properties’ or ‘dimensions’ of that category (Glaser, 2002). Writing analytical memos is another continuous process to elevate data to concepts and an important component of quality (Glaser and Holton, 2004). As the constant comparison continues, core categories begin to emerge. A core category is considered central and may represent the main concern; frequently appears in the data and is meaningfully related to other categories (Glaser and Strauss, 1967). Then, selective coding, which entitled limiting subsequent data collection and coding to variables related only to core categories (Glaser and Holton, 2004).

As the analysis progresses, the researcher begins to identify gaps for which theoretical sampling is required to saturate the emerging core categories and provide the remaining necessary insights; here they must make decisions about which sources of data will meet the analytical needs (Birks and Mills, 2015). Only core categories with the most explanatory power should be saturated (Glaser and Strauss, 1967). Then, the final stage of theoretical coding begins, which is defined by Glaser (1978) as conceptualising ‘how the substantive codes may relate to each other as hypotheses to be integrated into a theory’ (p. 72).

Enhancing trustworthiness in CGT research

As big data provides new opportunities for social research, establishing trust in the research tools and methodology and thus in the results is of paramount importance. Some conventional researchers have raised concerns about the trustworthiness of ML models in conducting qualitative analyses (Drouhard et al., 2017; Eickhoff and Neuss, 2017; Gao et al., 2023; Jiang et al., 2021), and the credibility of results due to an over-reliance on computational tools (Dellermann et al., 2019; Nguyen et al., 2021; Ramage et al., 2009). This can be attributed to some weak examples of utilising these tools to explain and build theories about human behaviour without grounding them in social scientific foundations and accepting the results without proper validation. Therefore, we argue that when proposing frameworks to automate GT for big data analysis, trust can first be established by assuring social scientists that the core principles of their well-established methodology are applied and taken into account to maintain its rigour.

Furthermore, trustworthiness, defined as confidence in the quality of a research study, including its methods, data and interpretation (Connelly, 2016), is vital for ensuring credibility (Lincoln, 1985; Nguyen et al., 2021; Shenton, 2004). Here, trust in computational methods must be assessed rather than assumed, given that high statistical performance in validating and evaluating ML models does not necessarily imply that their results are interpretable or practically effective (Chang et al., 2009; Grimmer and King, 2011). Thus, many approaches have been proposed to incorporate human validation and evaluation (Chang et al., 2009; Doogan and Buntine, 2021; Elangovan et al., 2024; Mimno et al., 2011; Ying et al., 2022; Zhang et al., 2025). Within the context of CGT, it is important to ensure that the incorporated techniques effectively replicate the role of the researcher in big data analysis. We argue that when using TM to automate the coding process, researchers should externally validate these topics to ensure the model’s ability to detect reliable patterns in the data, thereby the reliability of the automated coding process. Researchers then play an important role in evaluating the quality of the chosen models’ outputs. As Grimmer et al. (2021) argues, rather than fully relying on fitting statistics, it is important to obtain feedback from human researchers about the quality of the generated topics.

Additionally, the use of computational models in CGT can significantly contribute to fostering trust in the resulting GT by enhancing objectivity and reducing human bias (Shenton, 2004). This approach aligns with the concept of ‘mechanical objectivity’ as described by Porter and Haggerty (1997) and Daston and Galison (2021), emphasising reliance on machines and the trust in numbers to produce unbiased results. At the same time, trust in the developed GT is further reinforced by preserving the researcher’s role in qualitatively interpreting the model’s outputs to derive meaningful theoretical insights. This also reflects Daston and Galison’s (2021) notion of ‘trained judgment’ where objectivity arises from the interplay between mechanical processes and human expertise (Benbouzid, 2023). Personal biases in the interpretation stage can be further mitigated by adhering to the constant comparison process (Glaser and Strauss, 1967), central to conventional GT. Preserving the researcher’s interpretative role ensures that automation does not diminish the value of qualitative insights but instead enhances both trust and the researcher’s ability to engage with big datasets. This is particularly important for elevating the conceptual depth of codes generated by TM and for facilitating the development of higher-level core categories – crucial steps in constructing a GT.

This active researcher engagement for achieving this balance is commonly referred to as the ‘human-in-the-loop’ (HITL) approach. In this approach, computational tools are used to support the analysis, while the researcher retains an active role throughout the process (Kim and Pardo, 2018). Adopting HITL enables social scientists to address issues of trust and credibility when using ML methods by allowing them to intervene in validating, evaluating, and interpreting computationally generated results. This also ensures that researchers maintain a sense of closeness and control over their qualitative analyses, which might otherwise be lost in fully automated processes (Jiang et al., 2021).

Thus, this article introduces a novel methodological framework for CGT that incorporates the HITL approach and addresses the limitations of existing CGT frameworks to enhance trust. Unlike researchers who emphasise algorithmic failure and ‘data friction’ as a means to inspire creative insights and identify significant cases for interpretation (Madsen et al., 2023; Munk et al., 2022; Rettberg, 2022), our framework takes a different stance. The primary objective is to ensure that the CGT framework and its associated techniques, can automate key aspects of the traditional methodology, making it scalable and suitable for big data contexts without compromising the foundational principles of GT.

Methodology

Computational grounded theory: A new approach

CGT frameworks were developed as a way of dealing with large volumes of data when conducting a qualitative analysis. However, preserving the fundamental principles of traditional GT to the greatest extent possible is essential to claim to have followed the methodology (Birks et al., 2013). Our three-phase framework (see Figure 1) strives to achieve this goal. Using specific computational tools enabled us to implement and facilitate the application of other GT principles to big data research, beyond the coding process that is the main focus of existing frameworks.

Figure 1.

The HITL computational grounded theory (CGT) framework chart.

HITL ensures the broader social and contextual validity of the topics discovered using unsupervised TM models. The involvement of the researcher in ongoing interaction with the results generated by computational tools enhances the reliability of the analysis. The framework also provides a detailed description of how to conduct a human-centred interpretation of the computationally produced results effectively. This will further ensure adherence to GT principles, which will lead to an improved theoretical understanding of the substantive field of study and enhance the trustworthiness of the results.

Phase one: Data exploration

In this phase, the researcher begins by familiarising themselves with their dataset. Given that in big data research researchers do not collect the data by engaging directly with the research subjects, gaining an understanding of the data, its structure and complexity is crucial. While Goni et al. (2022) suggest reading subsets of data for this purpose, we propose coding a random subset of the data using GT’s initial coding steps. This will result in a list of codes with precise descriptions of the content of each one. These resulting codes should be utilised to validate the patterns that will be explored computationally using unsupervised ML methods (i.e. ones that learn without training data) when exploring a larger volume of data in the next step. This systematic approach is important to provide a valuable method for validation. Even though the original GT methodology encourages coming to the data with a ‘blank slate’, that is not bringing any assumptions about the data into the coding process, here in CGT we mainly depend on ML models to code the data, which necessitates ensuring that these substitute for human researchers effectively and appropriately. Notably, previous frameworks such as Nelson (2020) and Odacioglu et al. (2022) did not incorporate this step.

Next, using LDA TM (see Figure 1), the entire dataset is explored to identify the main patterns. Upon applying LDA, each topic should be represented by a list of the top-20 terms and top-N documents (N is decided by the researcher based on the nature of their data). To label topics, we emphasise the necessity of reading topical documents, since close reading can provide better qualitative insights into the topic’s content (Brookes and McEnery, 2019). A similar approach is taken by Mangiò et al. (2023) and Nelson (2020) while others only used topical terms to label topics, which was proven to not reflect topic content accurately. Following this, HITL validation is crucial to ensure that the TM output effectively replaces human reading and judgments on a large dataset, avoiding the blind use of unsupervised and black-boxed ML algorithms (DiMaggio, 2015; Grimmer and Stewart, 2013; Korenčić et al., 2018). Thus, a concurrent validation (Boussalis and Coan, 2016) is used, where the researcher compares LDA topics with the generated GT codes. Due to LDA scalability and the size of the data included in this step, new topics might be discovered, which is to be expected and should not impact the validity of the analysis. Here, a high level of alignment between GT and LDA results will ensure validity. The similarity between the two methods is what enable validation, as both are exploratory, data-driven and iterative (Muller et al., 2016). Then, the resulting GT codes from step one and the LDA topics will both be considered, this includes not only the topics that overlap but also those unique to either GT or LDA, which is essential to fully leverage the power of ML models in uncovering patterns in big data.

One of the advantages offered by LDA is the availability of terms associated with each identified topic, which is an important resource that our framework utilises. In the term extraction step, the researcher refines and curates the set of terms for each of the previously labelled LDA topics that will form the basis for the data modelling phase. In essence, the term extraction step serves as a critical link between topic identification and modelling. For topics identified in GT and LDA, or only in LDA, the top-20 terms were selected. Any terms that are deemed irrelevant to the topic being discussed should be excluded from the list based on the researcher’s understanding of the topic. This process ensures that the terms associated with each topic reflect the topic content. Secondly, terms should be proposed for topics that appeared only in GT.

Phase two: Data modelling

After identifying and validating the main topics of discussion, the query-driven topic model (QDTM) (Fang et al., 2021) is used. It is a semi-supervised TM approach designed to return topics relevant to queries input by the researcher in a tree-like structure. Here, the researcher inputs the terms extracted from the previous step, a set of query terms for each topic, into the model. QDTM then employs term expansion techniques – frequency-based extraction, KL-divergence-based extraction (KLD), and the relevance model with word embeddings – on the input queries, producing a set of concept terms to enhance topic modelling. These techniques focus on retrieval and relation rather than generation in identifying terms. Then, terms are fed into a framework based on a variation of a Hierarchical Dirichlet process (HDP) to form main topics and a second layer of subtopics, which is considered an effective way to organise and navigate large-scale data (Dumais and Chen, 2000; Johnson, 1967). Furthermore, the HDP is a nonparametric Bayesian model that automatically infers the number of topics in a corpus (Teh et al., 2004).

In CGT, this hierarchically structured TM is particularly helpful in ensuring the effective application of GT principles and maximises the researcher’s ability to extract the full analytical value from large datasets. First, since researchers will be provided with groups of related main and sub-topics together; this will greatly support the intermediate coding steps (Birks and Mills, 2015) and allow for more effective application of the ongoing process of constant comparison during analysis. It will also enable inductive reasoning and a focused examination of related documents (Fang et al., 2021). Second, in traditional GT research settings, it is the researcher’s responsibility to revisit the field and collect specific data when encountering a code or category they do not fully comprehend. In the proposed framework, QDTM automates or partially automates the saturation process. In most cases, only terms automatically generated by LDA are used, with no additional terms added to reduce human bias. Thus, the input queries are expanded into a comprehensive keyword list, making the output topics more relevant. The term expansion techniques employed are highly valuable as they re-examine the complete dataset and identify various variations and dimensions within each main topic (Carlsen and Ralund, 2022). By revisiting the corpus to capture overlooked aspects – such as low-frequency terms or niche topics – that were not initially captured due to their infrequent occurrence or any other factors, QDTM presents the researcher with enough variations and cases for each main topic. This will significantly reduce the necessity for researchers to engage in ‘extensive’ theoretical sampling to saturate their understanding of the identified codes or categories. Despite that, there may still be instances where such sampling might be necessary. Finally, it is critical that the number of sub-topics is automatically determined; in real-world large datasets topics are not evenly distributed, some are more frequent and have a wider range of variation than others. This feature might help researchers gain insights regarding the main concerns of research subjects and the core categories, since one of the indicators is its frequency of occurrence in the data (Glaser and Strauss, 1967).

These distinctive features of QDTM go beyond automating the coding process in large datasets. Instead, they actively support the application of other core principles of GT in big data research, making it a more suitable choice compared to single-layered TM models used in existing CGT frameworks.

Human evaluation of QDTM topics

The aim of this step is to ensure that the QDTM is capable of generating high-quality topics. This is critical as it will enhance the researcher’s ability to interpret these topics more efficiently and confidently at a later stage in the framework. To perform the evaluation tasks reliably, two or more annotators must independently annotate the data (Pustejovsky and Stubbs, 2012). This step further demonstrates the active involvement of the HITL in our framework and how this can contribute to increased trust in the final results. We emphasise the importance of evaluating topics’ intrinsic semantic quality. This is typically assessed using statistical measures, such as perplexity, however, negative correlations between these measures and human judgments of topic quality have been observed (Chang et al., 2009). Hence, we turn to HITL evaluation as an alternative – and superior – method. Typically, tasks for human evaluators (annotators) might include examining the quality of top-N terms for each topic (Mimno et al., 2011), or to identify the intruder within these terms (Chang et al., 2009). However, despite the popularity of measuring coherence based on topical terms, this approach is criticised for only partially reflecting topic content and quality (Brookes and McEnery, 2019; Grimmer and King, 2011; Korenčić et al., 2018).

In our framework, human evaluation is used to assess QDTM results, which is primarily based on topic documents. However, a list of terms per topic is provided to be used in specific cases. To effectively evaluate and annotate the QDTM’s topics, four distinct tasks should be undertaken. Firstly, rate the quality ’coherence’ of each topic, then identify the issues with lower quality topics, and then label the topics. Furthermore, since topics are hierarchically classified, this will only partly reflect output quality (Belyy et al., 2018; Koltcov et al., 2021), the relationship between main and subtopics should also evaluated. This step will enhance the researcher’s ability to interpret these topics more efficiently and confidently at a later stage. In Supplemental Appendix A, the complete annotation guidelines are presented.

Phase three: Human-centred interpretation

This final phase of the proposed CGT framework is dedicated to the in-depth, interpretative analysis of the topics produced by the QDTM. Based on the human evaluation conducted in the previous phase, the researcher will be presented with N groups of related topics, each representing a main topic and a number of subtopics. In the framework, the main topics are considered ‘focused codes’, while the subtopics are considered ‘sub-codes’ (see Figure 1). The labels produced by the annotators for each code are considered initial labels.

The interpretative analysis is started by coding the representative documents for each topic line-by-line. This process of ‘hand coding’ enables researchers to ‘zoom in’ on the studied phenomena for appropriate theoretical interpretations (Aranda et al., 2021; Mangiò et al., 2023) and will increase confidence that nothing has been overlooked (Glaser and Strauss, 1967; Glaser and Holton, 2004). As instructed by GT’s originators, hand-coding is vital to stimulating ideas, writing analytical memos, and properly applying comparative analysis to create higher-level categories. Here, since computational tools are used to classify the data and then the annotators to evaluate the resulting topics (codes), researchers are required to do their own coding. A similar approach is taken by Mangiò et al. (2023), whereas by Nelson (2020) and Odacioglu et al. (2022), interpretations are added to the analysis through a simple reading of the documents.

At this point, the size of documents will influence the plausible number of documents that can be manually coded and analysed (Amaya et al., 2021). Researchers should try to achieve a balance between the need for thorough analysis to draw meaningful insights and the practical constraints in terms of the resources and time available for manual analysis.

The analysis begins by coding the top-N posts of the first focused code. Constant comparison is applied here by comparing incidents in data against the code label and with each other. During this process, more sub-codes might be created and the researcher should raise the conceptual level of annotators’ labels since they are descriptive of topic content. After analysing the first focused code, the same process is applied to the first sub-code from the same group. The resulting codes should then be compared; the researcher should look for patterns, links, or any relationship between the focused code and the sub-code. This may reveal that one code (whether focused or sub-code) is part of the other or create a higher-level category that connects them. Here, the researcher should allow the analysis to take its course and ensure that the memos are written on a regular basis. Similar processes is applied for each of the following sub-codes until the analysis of the first group is completed. The researcher then moves on to the next group, until all are analysed.

The resulting codes and categories from each group are then compared to find relationships and to construct higher-level categories. At this point, the core categories and the main concern of the subjects under study should be determined and confirmed. Then, the researcher should engage in theoretical sampling to saturate gaps found in core categories (see next section). At this stage, the theoretical coding step commences, which is the process of defining possible relationships between the developed fully saturated core categories. Glaser (1978) explains that ‘Theoretical codes implicitly conceptualise how the substantive codes will relate to each other as interrelated multivariate hypotheses in accounting for resolving the main concern’ (p. 163). As opposed to the coding in the previous steps, which is data-driven, here we deal with ideas and perspectives which will help the researcher integrate categories and data into a theory.

Supporting theoretical sampling

Theoretical sampling significantly improves the quality of the analysis by allowing researchers to collect additional selective data to fill in emerging categories and address questions raised during the analysis (Charmaz, 2012), and is an often-overlooked step in big data research. In conventional GT studies, it is typically performed by following up with participants, observing in new settings and so on. However, in big data research, returning to data collection is not always feasible, for example, when the complete dataset is collected before analysis begins (Birks et al., 2013). According to Glaser and Strauss (1967), theoretical sampling is open and flexible: ‘Theoretical sampling for saturation of a category allows a multi-faceted investigation, in which there are no limits to the techniques of data collection, the way they are used, or the types of data acquired’ (p. 65). Therefore, in the proposed CGT framework, we argue that theoretical sampling can be implemented in big data research by utilising different computational techniques on the same sufficiently large and rich dataset. For example, the research might use sentiment or time series analysis, depending on the questions raised during the analysis, to help them gain completely different but needed results. This means conducting a multi-method analysis of the same dataset, previously analysed using TM. Here, researchers have to determine the most suitable NLP technique based on their needs, meaning that sampling methods will vary across different studies. Therefore, as also asserted by Charmaz (2014), theoretical sampling is a strategy that researchers employ and adapt to fit their specific study needs, not a standard data collection method.

Case study

To illustrate the implementation and usefulness of the proposed framework, a case study of gig economy tutors was conducted, with the ultimate goal of developing a substantive theory that could help explain their experiences.

Global economy digitalisation is associated with the rise of the gig economy, in which long-term employment is replaced by one-off transactions based on performing single tasks. This non-standard work, mediated through online platforms, has permitted flexible work arrangements and links independent workers with customers globally. It provides greater job autonomy and expanded employment opportunities (Clark, 2021). Despite that, the shift from market for jobs to market for tasks has also resulted in the loss of protections and rights typically associated with long-term employment (Drahokoupil and Vandaele, 2021). It is linked with job insecurity, income instability and low wages. This leads workers to work longer hours to achieve their earnings goals (Clark, 2021; Duggan et al., 2022). Moreover, there is an ethical concern arising due to, for example, a lack of transparency in platform operations (Jarrahi and Sutherland, 2019), which may create power imbalances between workers and platforms (Koutsimpogiorgos et al., 2020).

The nature of tasks performed in the gig economy has a significant impact on the challenges encountered and the overall experience of workers. This case study examined the experiences of an under-studied group of workers, tutors, where one-to-one online educational sessions are conducted, making them distinctive from other common forms of gig work, for example, in food delivery, transportation services or isolated knowledge-work tasks.

Case study data

CGT frameworks are generalisable to any source of textual big data. In our case study, we collected data from Reddit for illustration purposes and to test the framework. We examined Reddit and identified 18 relevant subreddits (see Table 2) in which our target population shares their experiences and asks questions. Using the Reddit API Wrapper, $\sim$ 52K posts and comments were retrieved. For training the TM models, pre-processing of the data was performed, including the removal of stopwords, punctuation, emojis, URLs, lowercasing, lemmatisation and tokenisation. This resulted in a vocabulary size of 7491. Ethical approval was obtained from the Biomedical and Scientific Research Ethics Committee (BSREC) at the University of Warwick before collecting the data.

Table 2.

The subreddits included in the study, along with the number of posts.

Subreddit name	Number of posts
(a) Platform-specific subreddits
Vipkid	11,457
DaDaABC	10,857
Cambly	8082
iTutor	6057
iTalki	1770
MagicEars	1300
Qkids	1291
Gogokid	680
ZebraEnglish	290
Tombac	154
Preply	144
GoGoKidTeach	129
Palfish	44
(b) Subject-specific and other related subreddits
OnlineESLTeaching	9483
Online_tefl	2260
OnlineESLjobs	119
TeachEnglishOnline	101
Teachingonline	49

Phase one

To apply the first step to the case study, given the fact that our dataset consists of 18 subreddits, two relatively small subreddits were randomly selected, namely GoGoKidTeach and Palfish. This selection resulted in about 160 posts. The analysis process started by coding the data line-by-line, and since GT is a comparative and iterative process, the resulting codes were compared to one another, leading to the development of higher-level codes that group related codes together. After employing this step to this subset of the data, a total of 15 codes were identified (see Table 3). Due to their abstract nature, two of these codes, namely ‘Sharing experiences and feelings’ and ‘Seeking and providing help and advice’ were excluded from the comparison with LDA in the next step. This decision was made because these are contextual and primarily reflect the underlying purpose of the posts, and LDA as a statistical model is not expected to model them, which is designed to identify more concrete and topic-based patterns, rather than abstract themes tied to emotional expressions or interpersonal interactions.

Table 3.

List of codes derived from the initial exploration step.

Code no.	Label
Code 1	Job requirements
Code 2	Hiring process
Code 3	New contracts
Code 4	How tutors consider the job
Code 5	Reasons to join or leave a platform
Code 6	The class and the students
Code 7	Teaching material and methods
Code 8	Bookings and working hours
Code 9	The payments
Code 10	The rating system
Code 11	Technical issues
Code 12	Miscommunication with platforms management
Code 13	Covid-19-related discussions
Code 14*	Sharing experiences and feelings
Code 15*	Seeking and providing help and advice

Regarding the LDA process, two models with 13 and 17 K-values were tested (Alqazlan et al., 2021). The methods used to determine the value of K is firstly by following the approach of Quinn et al. (2010) and Grimmer (2010) to empirically examine the performance of LDA models with a different number of topics. Then, due to its time-consuming nature, we used the Tmtoolkit¹ Python package to compute and evaluate several models in parallel using state-of-the-art theoretical approaches, and based on this 17 topics were eventually selected. Evaluating the quality of the topics and labelling them was performed by examining the top-5 documents.

For validation, by considering both LDA models (with 13 and 17 topics), it was found that they were collectively able to detect 12 topics that were identified by GT analysis. Yet, both models failed to model one topic – ‘Covid-19-related discussions’ – that was present in GT codes. Conversely, only one topic – ‘Bank transfers and transaction fees’ – was clearly modelled in both LDA models but did not appear in GT codes. Therefore, since the aim of this step was to validate and further explore the data, and only one new topic was found even after the number of topics was increased to 17, the aim of this step was fulfilled, and it was not necessary to examine further models. Here, the obtained codes and topics will both be considered, making the final output of this process 14 topics that will be taken into account in the next phase. In Supplemental Appendix B, the term extraction results needed for applying QDTM are presented.

Phase two

To run QDTM on the gig economy tutors dataset, the model was input with a list of terms for each of the 14 topics that were identified at the terms extraction step. Here, after applying the pre-processing steps and training the model, vocabularies with less than five document frequencies and sub-topics with very low prevalence (<0.2%) were excluded. Once we were satisfied with the resulting model, we removed duplicate posts that appeared in the top-10 posts of each topic, and identical subtopics within each topic based on lower prevalence. This resulted in 76 topics (i.e. 14 main topics and 62 subtopics), each represented as top-5 posts and top-10 terms for the next step.

The human evaluation task was conducted by three experienced journalists. The reason for choosing this group is because they had taken part in similar annotation tasks before, which helped to significantly reduce the training time. Moreover, as journalists are generally well-acquainted with many different subjects, making it easier for them to understand the data. In total, each annotator spent about 15 h working to complete the annotation work. They followed guidelines developed by the research group through several rounds of revision and testing (see Supplemental Appendix A). Overall, the annotation was conducted in two stages; first, two main topics and their subtopics were annotated (12 topics total), and the second was with the remaining 12 topics and subtopics (64 topics total). The inter-annotator agreement (IAA) was calculated for each one. Four tasks were required: evaluating topic quality, identifying issues with topics, labelling topics and evaluating relationships between main topics and subtopics. Finally, the criteria for exclusion of topics is that main or subtopics are found to be incoherent, subtopics that are unrelated to their main topics, and topics where all three annotators disagreed on one or more tasks.

To measure the IAA of human evaluation of topics, Fleiss’ kappa (Fleiss, 1971) was used (see Table 4). According to Landis and Koch (1977), all obtained scores are considered ‘fair agreement’. However, we found that in 97% of the annotations, at least two of the three annotators agreed on all tasks. This indicated that the results are acceptable (Feinstein and Cicchetti, 1990), and it was not necessary to take further steps to increase agreement. Moreover, task dependency may have contributed strongly to the obtained results, as disagreements in task one will usually result in disagreements in tasks two and four. For final annotation decisions, we used a simple majority voting method, and in cases of complete disagreement among the annotators (only seven annotations, see Table 4), these topics were excluded. The final category counts after resolving disagreements are given in Table 5. Based on our exclusion criteria, 21 topics (27%) out of 76 were excluded. Therefore, a total of 55 topics will be considered in the next phase of the analysis. Of these, 14 were main topics and 41 were subtopics.

Table 4.

Inter-annotator agreement (IAA) results.

Stage	Task	All agree	Two agree	No agreement	Total number of topics	Fleiss’ kappa
One	Topic coherence	6 (50%)	6 (50%)	0	12	0.34
	Issue identification	5 (42%)	7 (58%)	0	12	0.38
	Relatedness to main topic	4 (40%)	6 (60%)	0	10 (only applied to subtopics)	0.38
Two	Topic coherence	32 (42%)	43 (57%)	1 (1%)	76	0.33
	Issue identification	24 (32%)	49 (64%)	3 (4%)	76	0.35
	Relatedness to main topic	18 (29%)	41 (66%)	3 (5%)	62 (only applied to subtopics)	0.21

Table 5.

Final categories counts after resolving disagreement.

Task one:	Topic coherence					Total
Number of topics	Coherent	Average	Incoherent		No agreement
Number of topics	42	24	9		1	76
Task two:	Issue identification					Total
Number of topics	Intruded	Chained	Random	No issue	No agreement
Number of topics	23	4	4	42	3	76
Task three:	Relatedness to main topic					Total
Number of topics	Strongly related	Partially related	Not related	Identical	No agreement
Number of topics	34	7	18	0	3	62

Commencement of phase three analysis

Following the quality evaluation of the QDTM results, it was necessary to determine the length of posts to decide how many posts it would be feasible to include in the analysis. Across 55 topics, statistics of the top 10 posts revealed a minimum length of 11 tokens, an average of 165 tokens, and only 13 posts exceeded 512 tokens. Based on these, it was determined that hand-coding of 550 posts would be a manageable task. We started the analysis by examining groups of related topics, and manually coding the top-10 posts for each code in the first group. Then, we proceeded to analyse the next group, continuing this process until all 14 groups were examined. Subsequently, after identifying the specific category that necessitated theoretical sampling, we integrated the newly identified data into the coding process and compared it with other data within that category.

During the analysis, a category ‘Redditing’ emerged, where tutors create their supportive network on Reddit; novices seek help and experts provide assistance. Yet, a question arose regarding whether novices truly benefited from this interaction. As a result, more empirical evidence is needed, which requires theoretical sampling. Here, a sentiment analysis model was identified to be useful in answering this question. The model distilbert-base-uncased-go-emotions-student² was pre-trained (Wolf et al., 2020) on the GoEmotions dataset (Demszky et al., 2020). This benchmark dataset contains 58K comments was manually annotated for 27 emotions. The generalisability of the data across domains was tested via different transfer learning experiments. After applying the model, we purposely selected ‘Gratitude’ and ‘Realization’ emotions (see Figure 2) for further analysis. Then, we randomly sampled 50 posts assigned to each emotion (see Table 6). Our analysis revealed that the data met our needs, leading to the creation of a new code, ‘Realising’ as a result of tutors’ Redditing behaviour.

Figure 2.

The frequencies of emotions in our dataset. The bars in red are the two emotions selected for theoretical sampling. Gratitude represents about 7.70% and realization represents about 7.00% of the data.

Table 6.

Random samples from the data assigned to ‘Gratitude’ and ‘Realization’ emotions.

Example of posts	Category
That’s awesome, thanks. It will save me a lot of anxiety.	Gratitude
Thanks a lot! I wish that had been clear from the start.	Gratitude
That’s genius! I really appreciate the tip! I’m definitely going to give it a try!	Gratitude
It worked $\dots$ my reservations remain but I can now enjoy a 15-minute tea break without being interrupted by trial students calls. Many thanks, Herr!	Gratitude
I totally get why and it makes sense!	Realization
Aha! It is now clear to me what needs to be written in my ticket $\dots$ Now I know what’s missing!	Realization
You have painted a very clear picture here! Brings a whole new perspective!	Realization
You made it all make sense in the way you wrote it! I knew most of that, but your way of expressing it really made sense to me! What a great teacher you are, haha.	Realization

Table 3 in Supplemental Appendix C shows the resulting focused codes and sub-codes after line-by-line coding for each group of topics. Then, we compared codes and categories resulting from the analysis of all groups to identify relations and construct higher-level core categories, see Table 4 in Supplemental Appendix C. Quotations from the data to support each of these core categories and corresponding codes can be found in Supplemental Appendix D.

Following this, the final set of fully saturated core categories was organised based on patterns of connectivity in the theoretical coding step which greatly aids in the development of our GT. The section below presents the final product of the HITL CGT framework.

The GT of tutors’ experiences in the gig economy

The analysis revealed a grounded theory of tutors’ experiences in the gig economy. The theory proposes that the underlying main concern underpinning most of tutors’ discussions is ‘staying financially afloat’ when working within ‘under-structured organisations’ these are tutoring platforms. Tutors face significant ‘financial uncertainty’ due to job insecurity, income instability, fluctuating demand, inadequate payments as well as difficult-to-attain incentives. Tutors’ ability to stay financially afloat is also challenged by and arises from opaque and often error-laden ‘technological systems’. The technological systems contributes to payment instability when they incorrectly mark tutors as absent and penalise tutors when technical issues prevent task completion. Tutors’ income is also directly affected when they encounter issues arising from the booking system in events of overlapping classes, students’ no-shows, and cancellations. The vague and often unfair rating system that evaluates tutor performance and influences students’ hiring choices adds to concerns of payment instability. The tutors, however, are largely impotent when trying to address inaccuracies and malfunctions, since the tutoring platforms often fail to comply with the policies they set and lack effective communication channels in an apparent abuse of power. This category and its corresponding codes are among the most frequently discussed topics (see Table 3 in Supplemental Appendix C, topics 3, 4, 5, 6, 7, 8, 9 and 13).

The inherent ambiguity in platform work, including unclear decision-making processes and poorly defined rules and procedures, results in a state of ‘confusion’ among tutors. This code was found frequently in most of the 14 topics, as shown in Table 3 in Supplemental Appendix C and later raised to a category in Table 4 in Supplemental Appendix C. This confusion has a profound impact on how tutors act to resolve their concerns, as does their level of competency in navigating the technological systems. Platform work ‘experts’ are those who have dedicated substantial time and effort to ‘understand’ the job and the system. While ‘novices’, who may or may not be task experts but are not familiar with platform work, need more time to reduce their confusion and establish themselves on the platform, especially as the technologies employed do not operate in their favour. Furthermore, tutors’ ‘financial structure’ will hugely affect the possibility of pursuing this job full-time: for example, only those with financial support or living in low-cost countries find it worthwhile to absorb the impact of uncertain payments. For the majority, achieving earning goals will negatively impact a tutor’s work-life balance and job enjoyment (see topic 10, Table 3 in Supplemental Appendix C). The necessity for them to generate income is what drives them to ‘continue tutoring’ and adapt to challenges.

Tutors will ‘persist’ while working on tutoring platforms; it is the core category that is observed in nearly all topics and it illustrates the behaviour through which tutors addressed their main concern. They seek to establish a supportive network on Reddit and engage in ‘Redditting’ behaviour as a response to the absence of workplace and colleague support. Within Reddit, tutors play different roles based on their competency levels: novices are the help seekers who operate in a context of confusion; experts provide explanations and solutions based on their understanding of the platforms and the business models. Tutors ’resist’ challenges by empowering one another and forming bonds through the shared negative experiences and emotions. Tutors also persist in their efforts to ‘solve’ the problems they encounter. This may involve communicating with the platforms’ management despite their non-responsiveness, and educating students about the technology used on the platforms. To stay competitive, tutors develop interpersonal skills and use self-marketing by promoting their expertise and qualifications in their profiles. ‘Strategising’ is another form of tutors’ persistence. To mitigate income instability, tutors establish a base of regular students by maximising availability, working during peak hours and opening slots in advance. Teaching popular classes and being likeable are ways to raise popularity. Working mainly with loyal regular students and asking for high ratings are strategies for protecting reputation. ‘Multi-homing’, that is, working on multiple platforms simultaneously, is another strategy to combat an insufficient and unstable income.

With persistence, and dedicating time and effort, some tutors were able to resolve their main concern by ‘generating sufficient income’ and achieving their earning goals. Based on our theoretical sampling results, ‘Redditting’ has a significant impact on reducing tutors’ confusion and thus ‘realising’ and gaining a better understanding of unclear aspects of the platform. Lastly, tutors who possess clarity and understanding of the gig economy business model are divided into two types. The first accept the flaws of the business model, set realistic expectations and do not rely on this job as their main source of income. The second type refuses to tolerate poor working conditions and low pay, opting instead to leave the tutoring platform. They may employ leaving strategies such as ‘platform hopping’ in order to find better-paying platforms or they may switch to platforms that let them set their own prices. Alternatively they may simply choose to work as freelance online tutors who are not associated with any platform. This can be observed in Table 3, Supplemental Appendix C, topics 14, 12, and the sub-codes in topics 10 and 5.

Discussion and conclusions

This article proposes a novel framework for applying CGT, offering a practical solution for analysing large qualitative data using ML and NLP tools. The trustworthiness and credibility of the results is ensured by adopting a HITL approach, and the rigour of the GT methodology was maintained by adhering to all its fundamental principles. The case study results demonstrate the potential of this novel approach to CGT, as well as making an important contribution to gig economy research by focusing on an understudied group.

Our framework introduced a departure from the common practice in CGT frameworks of using single-layered TM for coding big datasets by employing a hierarchically structured TM, QDTM. The researcher begins by using LDA to identify the main topics of discussion. Then, to ensure codes and categories saturation, the QDTM with its terms expansion techniques, extends the discovered (and validated) main topics by revisiting the complete corpus to search for all relevant sub-topics, ensuring that no pertinent data or aspects are overlooked. Furthermore, the tree-like structure of QDTM topics presents the researcher with groups of related topics, which makes it easier to analyse and identify relationships between them, thus facilitating the application of GT’s constant comparison analysis. Additionally, QDTM’s capability to model topics using query terms elevates its superiority over other hierarchical TM models, offering researchers the flexibility to formulate missing topics from LDA.

Our framework has maintained a balanced approach, leveraging computational tools as supportive aids rather than replacements for human involvement. We contend that building trust in the credibility of CGT is ensured by the active inclusion of researchers. Within our framework, researchers are engaged in each step of the analysis, starting from the discovery of topics, validation, the modelling process, evaluation and ultimately the interpretation. Furthermore, the researcher adheres closely to GT’s core principles in the final phase of ‘zooming in’ on the phenomena under study to add interpretation and theoretical insights into the analysis. This helps to raise the conceptual level of the codes generated by QDTM and facilitates the construction of core categories. We argue that line-by-line coding of the data is what enables the discovery of research subjects’ main concern and helps to reveal how they address that, which are the key goals in any GT study.

Finally, while our research was carried out before large language models (LLMs) became available, our CGT framework is effectively NLP technology agnostic and hence as more powerful NLP tools become available, they can easily be accommodated within it.

Limitations

NLP researchers may find a limitation in the framework regarding the reproducibility of findings derived from manual interpretive line-by-line coding. Unlike quantitative studies, ensuring identical analysis by two qualitative researchers is challenging. Glaser (2007) argue that through constant comparison, personal input decreases and data become more objective. We argue that computational tools in the framework can reduce subjectivity and bias. Algorithms classify texts the same way by keeping the parameters the same. For instance, QDTM generates comparable topics and sub-topics if another researcher employs the same terms provided in Table 2, Supplemental Appendix B on the same dataset. Thus, the computational component can contribute to the reproducibility of the analysis.

With regards to validation, from a GT perspective, the theories produced are inherently grounded in the data and offer ‘a valid explanation’ of how research subjects address their concerns. Nevertheless, researchers should strive to maximise the validity of their analysis (Cohen et al., 2001). Our CGT framework ensured the validity of the quantitative analysis carried out in the first phase to identify the main topics. However, the final theory generated through the framework remains firmly situated within the qualitative realm and cannot be validated using common practices applied in quantitative and computational disciplines.

Finally, we acknowledge that the implementation of our framework may be time consuming. However, each step is indispensable for the proper application of core GT principles and to ensure trust, especially with large datasets. Overall, as noted by its originators (Glaser and Holton, 2004), GT’s inductive and iterative nature is complex and demands a great deal of time and effort.

Supplemental Material

sj-pdf-1-bds-10.1177_20539517251347598 - Supplemental material for A novel, human-in-the-loop computational grounded theory framework for big social data

Supplemental material, sj-pdf-1-bds-10.1177_20539517251347598 for A novel, human-in-the-loop computational grounded theory framework for big social data by Lama Alqazlan, Zheng Fang, Michael Castelle and Rob Procter in Big Data & Society

Footnotes

ORCID iDs

Lama Alqazlan

Michael Castelle

Rob Procter

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental materials for this article are available online.

Notes

References

Agarwal

Dhar

(2014) Big data, data science, and analytics: The opportunity and challenge for is research.

Alqazlan

Procter

Castelle

(2021) Using computational grounded theory to understand tutors’ experiences in the gig economy. In: Proceedings of the workshop on natural language processing for digital humanities.

Amaya

Bach

Keusch

, et al. (2021) New data sources in social science research: Things to know before working with reddit data. Social Science Computer Review 39(5): 943–960.

Aranda

Sele

Etchanchu

, et al. (2021) From big data to rich theory: Integrating critical discourse analysis with structural topic modeling. European Management Review 18(3): 197–214.

Attard

Coulson

(2012) A thematic analysis of patient communication in Parkinson’s disease online support group discussion forums. Computers in Human Behavior 28(2): 500–506.

Bakharia

(2019) On the equivalence of inductive content analysis and topic modeling. In: Advances in quantitative ethnography: First international conference, ICQE 2019, Madison, WI, USA, October 20–22, 2019, Proceedings 1, pp.291–298. Springer.

Baumer

Mimno

Guha

, et al. (2017) Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology 68(6): 1397–1410.

Belyy

Seleznova

Sholokhov

, et al. (2018) Quality evaluation and improvement for hierarchical topic modeling. In: Computational linguistics and intellectual technologies: Materials of dialogue 2018, pp.110–123.

Benbouzid

(2023) Fairness in machine learning from the perspective of sociology of statistics: How machine learning is becoming scientific by turning its back on metrological realism. In: Proceedings of the 2023 ACM conference on fairness, accountability, and transparency, pp.35–43.

10.

Birks

Fernandez

Levina

, et al. (2013) Grounded theory method in information systems research: Its nature, diversity and opportunities. European Journal of Information Systems 22(1): 1–8.

11.

Birks

Mills

(2015) Grounded Theory: A Practical Guide. London, UK: Sage.

12.

Boussalis

Coan

(2016) Text-mining the signals of climate change doubt. Global Environmental Change 36: 89–100.

13.

Brookes

McEnery

(2019) The utility of topic modelling for discourse studies: A critical evaluation. Discourse Studies 21(1): 3–21.

14.

Carlsen

Ralund

(2022) Computational grounded theory revisited: From computer-led to computer-assisted text analysis. Big Data & Society 9(1): 20539517221080146.

15.

Chang

Gerrish

Wang

, et al. (2009) Reading tea leaves: How humans interpret topic models. In: Advances in neural information processing systems, p.22.

16.

Charmaz

(2006) Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. London, UK: Sage.

17.

Charmaz

(2012) The power and potential of grounded theory. Medical Sociology Online 6(3): 2–15.

18.

Charmaz

(2014) Constructing Grounded Theory. London, UK: Sage.

19.

Chen

Drouhard

Kocielnik

, et al. (2018) Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity. ACM Transactions on Interactive Intelligent Systems (TiiS) 8(2): 1–20.

20.

Clark

(2021) The gig is up: An analysis of the gig-economy and an outdated worker classification system in need of reform. Seattle Journal for Social Justice 19(3): 25.

21.

Cohen

Manion

Morrison

(2001) Research Methods in Education. Oxford: Taylor and Francis.

22.

Connelly

(2016) Trustworthiness in qualitative research. Medsurg Nursing 25(6): 435.

23.

Corbin

Strauss

(1990) Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology 13(1): 3–21.

24.

Crowston

Allen

Heckman

(2012) Using natural language processing technology for qualitative data analysis. International Journal of Social Research Methodology 15(6): 523–543.

25.

Daston

Galison

(2021) Objectivity. Princeton University Press.

26.

Dellermann

Ebel

Söllner

, et al. (2019) Hybrid intelligence. Business & Information Systems Engineering 61: 637–643.

27.

Demszky

Movshovitz-Attias

, et al. (2020) Goemotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547.

28.

Devlin

Chang

Lee

, et al. (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

29.

DiMaggio

(2015) Adapting computational text analysis to social science (and vice versa). Big Data & Society 2(2): 2053951715602908.

30.

Doogan

Buntine

(2021) Topic model or topic twaddle? Re-evaluating demantic interpretability measures. In: North American Association for Computational Linguistics 2021, pp.3824–3848. Association for Computational Linguistics (ACL).

31.

Drahokoupil

Vandaele

(2021) A Modern Guide to Labour and the Platform Economy. Cheltenham: Edward Elgar Publishing.

32.

Drouhard

Chen

Suh

, et al. (2017) Aeonium: Visual analytics to support collaborative qualitative coding. In: 2017 IEEE Pacific visualization symposium (PacificVis), pp.220–229. IEEE.

33.

Duggan

Sherman

Carbery

, et al. (2022) Boundaryless careers and algorithmic constraints in the gig economy. The International Journal of Human Resource Management 33(22): 4468–4498.

34.

Dumais

Chen

(2000) Hierarchical classification of web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp.256–263.

35.

Eickhoff

Neuss

(2017) Topic modelling methodology: Its use in information systems and other managerial disciplines. In: Proceedings of the 25th European Conference on Information Systems (ECIS).

36.

Elangovan

Liu

, et al. (2024) Considers-the-human evaluation framework: Rethinking human evaluation for generative large language models. arXiv preprint arXiv:2405.18638.

37.

Fang

Procter

(2021) A query-driven topic model. arXiv preprint arXiv:2106.07346.

38.

Feinstein

Cicchetti

(1990) High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43(6): 543–549.

39.

Fleiss

(1971) Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5): 378.

40.

Gao

Choo

KTW

Cao

, et al. (2023) Feasibility, opportunities, and challenges of utilizing ai for collaborative qualitative analysis. arXiv preprint arXiv:2304.05560.

41.

Glaser

Strauss

(1967) The Discovery of Grounded Theory: Strategies for Qualitative Research. Chicago.

42.

Glaser

(1978) Theoretical Sensitivity. California: University of California, The Sociology Press.

43.

Glaser

(2002) Conceptualization: On theory and theorizing using grounded theory. International Journal of Qualitative Methods 1(2): 23–38.

44.

Glaser

(2007) Constructivist grounded theory? Historical Social Research/Historische Sozialforschung. Supplement 19: 93–105.

45.

Glaser

Holton

, et al. (2004) Remodeling grounded theory. In: Forum qualitative sozialforschung/forum: qualitative social research, volume 5.

46.

Goni

Fuentes

Raveau

(2022) An experiential account of a large-scale interdisciplinary data analysis of public engagement. In: AI & SOCIETY, pp.1–13.

47.

Gorra

(2019) Keep your data moving: Operationalization of abduction with technology. In: The SAGE handbook of current developments in grounded theory, p.314.

48.

Grimmer

King

(2011) General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences 108(7): 2643–2650.

49.

Grimmer

Roberts

Stewart

(2021) Machine learning for social science: An agnostic approach. Annual Review of Political Science 24: 395–419.

50.

Grimmer

Stewart

(2013) Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3): 267–297.

51.

Grimmer

(2010) Representational Style: The Central Role of Communication in Representation. Cambridge, MA: Harvard University.

52.

Liu

Gao

, et al. (2020) Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.

53.

Herrmann

(2020) Socio-technical design of hybrid intelligence systems—The case of predictive maintenance. In: Artificial intelligence in HCI: First international conference, AI-HCI 2020, held as part of the 22nd HCI international conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings 22, pp.298–309. Springer.

54.

Jarrahi

Sutherland

(2019) Algorithmic management and algorithmic competencies: Understanding and appropriating algorithms in gig work. In: Information in contemporary society: 14th International conference, iconference 2019, Washington, DC, USA, March 31–April 3, 2019, Proceedings 14, pp.578–589. Springer.

55.

Jiang

Wade

Fiesler

, et al. (2021) Supporting serendipity: Opportunities and challenges for human-AI collaboration in qualitative analysis. In: Proceedings of the ACM on human-computer interaction, vol. 5, No. CSCW1, pp.1–23.

56.

Johnson

(1967) Hierarchical clustering schemes. Psychometrika 32(3): 241–254.

57.

Johnson

Gray

Sarker

(2019) Revisiting is research practice in the era of big data. Information and Organization 29(1): 41–56.

58.

Kazmer

Lustria

MLA

Cortese

, et al. (2014) Distributed knowledge in an online patient support community: Authority and discovery. Journal of the Association for Information Science and Technology 65(7): 1319–1334.

59.

Kim

Pardo

(2018) A human-in-the-loop system for sound event detection and annotation. ACM Transactions on Interactive Intelligent Systems (TiiS) 8(2): 1–23.

60.

Koltcov

Ignatenko

Terpilovskii

, et al. (2021) Analysis and tuning of hierarchical topic models based on Renyi entropy approach. PeerJ Computer Science 7: e608.

61.

Korenčić

Ristov

Šnajder

(2018) Document-based topic coherence measures for news media text. Expert Systems With Applications 114: 357–373.

62.

Koutsimpogiorgos

Van Slageren

Herrmann

, et al. (2020) Conceptualizing the gig economy and its regulatory problems. Policy & Internet 12(4): 525–545.

63.

Landis

Koch

(1977) The measurement of observer agreement for categorical data. Biometrics 33: 159–174.

64.

Lazer

Pentland

Adamic

, et al. (2009) Computational social science. Science (New York, N.Y.) 323(5915): 721–723.

65.

Legg

Hutter

, et al. (2007) A collection of definitions of intelligence. Frontiers in Artificial Intelligence and Applications 157: 17.

66.

Lincoln

(1985) Naturalistic Inquiry, Volume 75. London, UK: Sage.

67.

Madsen

Munk

Soltoft

(2023) Friction by machine: How to slow down reasoning with computational methods. In: EPIC proceedings, pp.82–105.

68.

Mangiò

Mismetti

Lissana

, et al. (2023) That’s the press, baby! How journalists co-create family business brands meanings: A mixed method analysis. Journal of Business Research 161: 113842.

69.

Mimno

Wallach

Talley

, et al. (2011) Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp.262–272.

70.

Muller

Guha

Baumer

, et al. (2016) Machine learning and grounded theory method: Convergence, divergence, and combination. In: Proceedings of the 19th international conference on supporting group work, pp.3–8.

71.

Müller

Junglas

Brocke

, et al. (2016) Utilizing big data analytics for information systems research: Challenges, promises and guidelines. European Journal of Information Systems 25(4): 289–302.

72.

Munk

Olesen

Jacomy

(2022) The thick machine: Anthropological ai between explanation and explication. Big Data & Society 9(1): 20539517211069891.

73.

Nelson

(2020) Computational grounded theory: A methodological framework. Sociological Methods & Research 49(1): 3–42.

74.

Nguyen

Ahn

Belgrave

, et al. (2021) Establishing trustworthiness through algorithmic approaches to qualitative research. In: Advances in quantitative ethnography: Second international conference, ICQE 2020, Malibu, CA, USA, February 1–3, 2021, Proceedings 2, pp.47–61. Springer.

75.

Odacioglu

Zhang

Allmendinger

(2022) Combining topic modeling with grounded theory: Case studies of project collaboration. arXiv preprint arXiv:2207.02212.

76.

Ojo

Rizun

(2021) Public perception of digital contact tracing app and implications for technology acceptance and use models. In: AMCIS.

77.

Peeters

van Diggelen

Van Den Bosch

, et al. (2021) Hybrid collective intelligence in a human–AI society. AI & Society 36: 217–238.

78.

Porter

Haggerty

(1997) Trust in numbers: The pursuit of objectivity in science & public life. Canadian Journal of Sociology 22(2): 279.

79.

Procter

(2022) Developing tools and interdisciplinary collaboration for digital social research. In: Digital society, p.93.

80.

Pustejovsky

Stubbs

(2012) Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. Sebastopol, CA: O’Reilly Media, Inc.

81.

Quinn

Monroe

Colaresi

, et al. (2010) How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1): 209–228.

82.

Ramage

Rosen

Chuang

, et al. (2009) Topic modeling for the social sciences. In: NIPS 2009 workshop on applications for topic models: text and beyond, volume 5, pp.1–4.

83.

Rettberg

(2022) Algorithmic failure as a humanities methodology: Machine learning’s mispredictions identify rich cases for qualitative analysis. Big Data & Society 9(2): 20539517221131290.

84.

Rietz

Maedche

(2020) Towards the design of an interactive machine learning system for qualitative coding. In: ICIS.

85.

Ruis

Lee

(2021) Advances in quantitative ethnography: Second international conference, ICQE 2020, Malibu, CA, USA, February 1–3, 2021, Proceedings, volume 1312. Springer Nature.

86.

Salganik

(2019) Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press.

87.

Shenton

(2004) Strategies for ensuring trustworthiness in qualitative research projects. Education for Information 22(2): 63–75.

88.

Sievert

Shirley

(2014) Ldavis: A method for visualizing and interpreting topics. In: Proceedings of the workshop on interactive language learning, visualization, and interfaces, pp.63–70.

89.

Teh

Jordan

Beal

, et al. (2004) Sharing clusters among related groups: Hierarchical Dirichlet processes. In: Advances in neural information processing systems, p.17.

90.

Wang

Sukiennik

, et al. (2024) A survey on human-centric llms. arXiv preprint arXiv:2411.14491.

91.

Williams

Greenhalgh

(2022) Pseudonymous academics: Authentic tales from the twitter trenches. The Internet and Higher Education 55: 100870.

92.

Wolf

Debut

Sanh

, et al. (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.38–45.

93.

Ying

Montgomery

Stewart

(2022) Topics, concepts, and measurement: A crowdsourced procedure for validating topics as measures. Political Analysis 30(4): 570–589.

94.

(2019) How to leverage text data in a decision support system? A solution based on machine learning and qualitative analysis methods.

95.

Zhang

Zhou

(2025) Can human reading validate a topic model? Sociological Methodology 55: 00811750241265336.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.11 MB

A novel,human-in-the-loop computational grounded theory framework for big social data

Abstract

Keywords

Introduction and background

Grounded theory

Enhancing trustworthiness in CGT research

Methodology

Computational grounded theory: A new approach

Phase one: Data exploration

Phase two: Data modelling

Human evaluation of QDTM topics

Phase three: Human-centred interpretation

Supporting theoretical sampling

Case study

Case study data

Phase one

Phase two

Commencement of phase three analysis

The GT of tutors’ experiences in the gig economy

Discussion and conclusions

Limitations

Supplemental Material

sj-pdf-1-bds-10.1177_20539517251347598 - Supplemental material for A novel, human-in-the-loop computational grounded theory framework for big social data

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

Supplemental material

Notes

References

Supplementary Material