Abstract
Social paleontology is a burgeoning field of research that seeks to understand the natural world through the collection, preparation, curation, and study of fossils via online communities. Such a community represents an ideal case for examining scientific practice as the expression of conversation topics in relation to the people who participate. Using Communities of Practice as a theoretical framework, we consider interactions within an egocentric Twitter network over a 397-day period to identify topic archetypes within the community, examine how such topic archetypes act as expressions of behavior that are indicative of community processes, and provide empirical evidence for detecting and indicating the health of an online community. Data were collected continuously and analyzed with a combination of topic modeling and social network analysis. Four unique archetypes were characterized based on the level of activity and longevity of interest. Participants for each were diverse, but not different. Structural differences in each network were noted with high levels of inter-group information flow within certain archetypes. Archetypes were interpreted using the life cycle states for Communities of Practice; sustained conversations and piques of interest indicate healthy online communities. These findings can inform efforts to design, implement, and research online, scientific communities.
Introduction
Approximately 72% of Americans use social networking sites such as Facebook, Twitter, and Instagram (Pew Research Center, 2019a), where diverse users can create original content and reply to one another, information can be aggregated via hashtags, or users can point to additional information through the use of URLs. Sites like Twitter have been shown to encourage diverse communities who are interested in science, regardless of their level of expertise, to communicate about topics of interest within disciplines such as paleontology or ornithology (Bex et al., 2019; Lundgren et al., 2022; Côté & Darling, 2018; Liberatore et al., 2018). In addition to the richness of content that is offered on the platform, Twitter-specific research is fruitful for studying online exchanges around different topics because most of the interactions involve publicly accessible data and users whose profiles are public (Ahmed & Lugovic, 2018).
This study builds on previous studies investigating community response to message design elements and features across Facebook and Twitter based on expressed identity within a paleontological community (Bex et al., 2019; Lundgren et al., 2022). We take the approach that this is a model community as paleontology is a charismatic field of scientific study that draws members with varied interests and who make scientific contributions through the collection, preparation, curation, and study of fossils (Catalani, 2014). Thus, similar to ornithology and other biological fields of study, the online paleontological community represents a wide range of interests and contributions (Tancoigne, 2019). Here, we expand the aims and scope of our original explorations, taking an inductive data science approach to understand this community. We examined data patterns that emerged over time (i.e., 1 year), included relatively robust numbers of social interactions (i.e., ~8,000), and used a form of natural language processing to orient our investigation. We use the Communities of Practice (CoPs) (Wenger, 1998, 2000) theoretical framework to understand the results and to make and assess predictions. Machine learning techniques, such as topic modeling, afford identification of underlying structures and patterns within networks (Nikolenko et al., 2017); however these structures and patterns need to be contextualized to understand and make interpretations about the network.
The context of this work was an egocentric social network on Twitter. We take the stance that online, social communities such as Twitter function as ecosystems that include organisms (i.e., individuals, organizations, and other entities) that interact with one another and their environment (i.e., the platform of Twitter). Therefore, there is merit in characterizing online ecosystems and applying traditional ecological concepts like diversity indices (Shannon & Weaver, 1962; Simpson, 1949) to online communities, which, to our knowledge, has yet to be done. Our purpose is three-fold: to identify topic archetypes within the communication of an interest-based science community, to examine how such topic archetypes act as expressions of behavior that are indicative of community processes, and to provide empirical evidence for detecting and indicating the health of an online community. Next, we delineate our theoretical perspective, outline relevant literature, and describe the research methodology that guided this study.
Theoretical Framework
The construct of CoP emerged from work done by Lave and Wenger (1991) on legitimate peripheral participation in which learners grow more involved in a community through participating more fully. Within the contexts Lave and Wenger examined, participation was contextually dependent on the learning environment, which was a radical shift that decoupled learning from formal school environments. As a theoretical framework, CoP expands upon and emphasizes the ways that people learn collaboratively (Wenger, 2000). The theory emerged from studies of people who were engaged in work-based learning, particularly focusing on how people were learning collaboratively as they contributed to both their own personal knowledge and to the knowledge development of a company (Wenger, 2000). Multiple interpretations of the CoP framework exist (Kimble et al., 2008a, 2008b; Wenger et al., 2011; Wenger-Trayner et al., 2015); our work emerges from the branch of CoP theory that emphasizes three interrelated elements: community, domain, and practice (Wenger, 1998, 2000) We explore how these elements interact within online spaces (Gunawardena et al., 2009; Lundgren et al., 2022).
The three elements of CoPs allow for particular behaviors to be identified and defined as benchmarks that can then be used analytically to evaluate and understand the unique properties and processes inherent to CoPs. Within this study, we emphasize the interconnected nature of the elements (Wenger, 2000; Wenger et al., 2002): the community is the individuals, groups, and organizations who contributed to Twitter activity; the domain is the field of social paleontology (Crippen et al., 2016); and varied social paleontological topics comprise the practices.
The domain of a CoP encompasses the complex, long-standing, and shared interests of the community at hand. Limited studies have specifically addressed the concept of domain, those that do explicate that “developing new knowledge, exchanging relevant information, and/or personal growth” are of import (Britt et al., 2020, p. 2) and that focusing on domains can lead to effective domain-specific interventions (Watkins et al., 2018). Such development occurs through the enactment of practice (Handley et al., 2006), in which members engage fully in “a task, job, or profession” (Brown & Duguid, 2001, p. 203). Practice is qualified as the development of shared elements, both explicit and tacit, with which participation and contribution are identified, including: stories, tools, language, documents, shared worldviews, and ways of addressing problems (Wenger et al., 2002). Practice is a well-scrutinized aspect of the CoP theoretical framework; however, most descriptions of practice fall short as it is ill-defined or given a cursory examination (Smith et al., 2017).
The element of community can be described as those who interact, learn, create relationships, and develop a sense of belonging and commitment to a domain (Wenger et al., 2002). Studies focused on community building in higher education emphasize that community relationships are key to CoPs devoted to teaching and learning at a university level, yet no empirical measurements support such claims (Bondy et al., 2017; Carroll, 2005; De Cindio, 2012; Nistor et al., 2015). Qualitatively describing the ways people commune as well as their dispositions is valuable but lacking robust analytical frames and empirical descriptions, these narratives lose credence.
CoPs are recognized as having a life cycle of development, change, and alteration. Of particular interest is the conceptualization from Wenger and colleagues (2002) who originally proposed five stages, which have been subsequently documented empirically (Knaus & Callcott, 2017; Marques et al., 2016; Pohjola & Puusa, 2016). These stages further explicate CoP theory while also providing evaluation benchmarks that can be used by designers and project staff for decision-making. In developmental order, the stages include: potential, in which shared interest or activity among a core, loosely affiliated group causes them to identify “common knowledge needs” (Wenger et al., 2002, p. 71); coalescing, the phase of most rapid growth in the size of the group, where members “establish the value of sharing knowledge of their domain” (Wenger et al., 2002, p. 82); maturing, a period of less rapid growth, in which the community develops a comprehensive body of knowledge; stewardship, a phase of change in which the community works to maintain their “intellectual focus” (Wenger et al., 2002, p. 104); and transformation, the final stage of alteration, in which a community ends, becomes something else, or is institutionalized. We make use of these five stages of development to orient our interpretation of the community processes that we predicted as likely to have been expressed as topic archetypes within the Twitter network.
Review of Relevant Literature
Borgatti and Ofem (2010) and de Laat and colleagues (2007) have shown that Social Network Analysis (SNA) is well-suited for investigating online virtual spaces and communities since it allows for relationships to be measured across time and space. Digital niches that have been explored with SNA include blogs, forums, wikis, eLearning (Saqr et al., 2018), emails, and social networking sites, which provide researchers easily accessible, digitally documented, and recorded social interaction (Cela et al., 2016; Himelboim et al., 2017). To date, most SNA research has focused on classifying users based on their centrality and the ties between them to understand how information flows (Himelboim et al., 2017). The focus on SNA has been on topics that are not educative in nature; for instance, ties and information flow has been studied in regards to political topics (Wojcieszak & Mutz, 2009), within branding and marketing literature (Habibi et al., 2014) and understanding the connections between people with similar medical ailments (Gabarron et al., 2019). While this research has helped to build our understanding of online social communication, there has been limited research that employs SNA methods to examine how people, groups, and entities interact and learn within science communities (Bex et al., 2019). Researchers that have focused on this employ bibliometric analysis (e.g., Raban & Gordon, 2020) which excludes people who are interested in science but are not a part of the formal academic community. In addition, the literature on the use of online community spaces tends to focus on snapshots in time rather than on extended periods of time (Kimmons & Veletsianos, 2016).
Research pertaining to the use of Twitter spans nearly 15 years and has been approached from multiple perspectives (Gruzd et al., 2011). Much Twitter-specific research describes the use of this social media platform for interest-based activities such as professional development in education and natural sciences (e.g., Bex et al., 2019; Xie & Luo, 2018); such work is often based on snapshots in time that describe users’ interactions with hashtags (e.g., Bert et al., 2016; Bombaci et al., 2016; Britt & Paulus, 2016; Lundgren et al., 2022; Smith et al., 2009). Studies that are snapshots of Twitter users’ activity are useful, but contextualizing these snapshots requires a longer term perspective. Even when longer term quantitative measures are employed to provide meaning within a Twitter network, these studies do not account for the potential diversity of users involved (Greenhalgh et al., 2020).
We take the perspective that an egocentric network on Twitter acts as a community that can foster interest-based conversations among diverse participants which, in turn, can be analyzed to determine the CoP life cycle of development, change, and alteration. While social media niches afford interaction in different ways through their design and use of conventions, we take the approach that individual messages (i.e., tweets) serve as the origin of conversation and as starting points for constructive discourse (Lundgren et al., 2022; Michaels et al., 2008). On Twitter, individuals, groups, and organizations contribute to a rich social world via such conversation. Thus, a conversation on Twitter can be defined as a collection of messages that has the capacity for generating subsequent interest-based interaction. While human coders can tackle the task of analyzing Twitter conversations, when large amounts of data are collected, researchers have turned to machine learning techniques such as topic modeling to aid with analysis. Topic modeling is a statistical technique by which latent patterns in large datasets of texts are identified (Hong & Davison, 2010; Vayansky & Kumar, 2020). Such analysis techniques provide ways to make sense of patterns that may be difficult for human coders to identify as well as allow for researchers to interpret large datasets efficiently (Quinn et al., 2010).
An ecosystem view of community members interacting with each other, groups, and other entities is useful when considering the diversity of communities who engage in conversation. Ecologists query ecosystems, specifically, the interconnected nature of organisms and their associated environments to understand the past, present, and future natural world. Other fields, including education, have appropriated the ecosystem concept (Corin et al., 2015; Hecht & Crowley, 2020) to account for the interconnected nature between people, places, and concepts relevant to their particular field of study. A central concept to the study of ecosystems in the traditional sense is understanding the diversity and abundance of species, which allows for ecologists to make inferences about the past, present, and future health of an environment. SNA draws on these concepts and can be applied to understand how the work of scientists with expertise in ecology and evolution is correlated, how it has developed over time, and how authorship networks form (Borgatti & Ofem, 2010; Borrett et al., 2014; Wasserman & Faust, 1994).
Methodology
This study expands on research into the classification of Twitter topic networks as exemplified by Himelboim et al. (2017). Specifically, Himelboim et al. established a methodology for conceptualizing and classifying Twitter networks based on structures that act as indicators of information flow (i.e., density, modularity, centralization, and isolates); essentially, the flow of information drove network structures. We parallel this methodology but focus on different indicators that relate to the CoP theoretical framework. Establishing such indicators can account for the nature and health of an online, scientific community which in turn drives practice-based archetypes. This single case investigation of an annual cycle (397 days; July 2017–August 2018) of Twitter activity centered on the FOSSIL project (@projectfossil), an effort focused on building connections and shared practice among a diverse community of paleontologists (Crippen et al., 2016). The case was bounded by our intent to understand how people in the network identify with paleontology, the social structures supporting their communication, the resulting patterns of conversation, and how such people, structures, and patterns serve as indicators of the health of an online community. We calculated the established growth trajectory (average 1.95 new followers per day) and pattern of activity (average 6.98 tweets per day) during the annual cycle of activity and determined that such a period of time would provide a sufficient diversity of people, interactions, and topics for answering our research questions. The detailed and consistent social media messaging practice used by @projectfossil, which involved a stated plan and included original practice-based messages that were distributed with a goal of building community to an established group of diverse followers (Lundgren et al., 2022), made it a unique case study as a model system. The limited research involving conversation structures in social network-based communities dictated the context-dependent method and combination of data analysis methods. We sought to address the following research questions: (1) What topic archetypes exist in this social network? (2) Which participants are involved with different topic archetypes? (3) What relationships exist among the topic archetypes, the composition of participants, and community network structures?
Data Sources and Preparation
Using Netlytic, a browser-based text and SNA service (https://netlytic.org/), we scheduled a sampling of the Twitter public search API every 15 min from July 2017 to August 2018. This representative sample (Rafail, 2017), which used the @projectFOSSIL account as a proxy, included 7,753 messages that originated from the account or included it as a mention. Records included the text of each message, the author’s account, and for re-tweets, the account name who passed the message along, as well as any additional accounts who were mentioned. Attributes such as the biography and number of followers were added to all named accounts. In addition, we prepared the data by classifying each account based on a content analysis (Krippendorff, 2012) of each accounts’ biography using the Paleontological Identity Taxonomy (PIT)—a hierarchical taxonomy based on self-identity with the domain (Lundgren et al., 2018). Accounts were classified by the four authors, who individually coded all data then discussed any discrepancies to consensus (Patton, 2002). The PIT level of category was used as the unit of analysis, which involved four classifications: Public, Scientist, Education and Outreach, and Commercial (reported here as proper nouns). In a round of independent coding that included 10% of the overall sample, interrater reliability was determined to have a significant level of agreement at the PIT category level (Fleiss κ = .9).
Data Analysis
Data analysis involved topic modeling, SNA, and the application of diversity indices to PIT-classified accounts. The content of individual tweets were subjected to topic modeling (Nikolenko et al., 2017) using the Gibbs sampling Dirichlet mixture model, which is a modified Latent Dirichlet allocation (LDA) that assumes that each document (i.e., tweet) consists of exactly one topic. This analysis was conducted within version 8.1.0 of the Text Processor extension in the application Rapidminer (Kotu & Deshpande, 2015). Data pre-processing involved removal of terms by stemming and removing numbers, URLs, the name of the account (i.e., projectfossil), stop and very short words. Using the maximum log likelihood optimization method (Sbalchiero & Eder, 2020), we determined the number of detectable topics to be 38 (Figure 1). While topic modeling provides evidence for latent patterns, such patterns need interpretation by human coders (Quinn et al., 2010). The first and the fourth author reviewed topics individually to give each topic a descriptive name based on an interpretation of the words and associated tweets (e.g., Fossils from Various States, Women in Paleontology, Microfossils), then these topic names were discussed to consensus. The authors examined topics that were named similarly (e.g., Paleontology Education & News and News Stories about Paleontology) to determine if topics could be merged; however, the distinctions made by the application of topic modeling were honored, thus, topics that included similar naming conventions remained separate for analysis. Following the descriptive naming of topics, we applied the paleontological practice-based post type (P3T) framework, which consisted of five unique types of content: Information, News, Opportunity, Research, and Off-Topic (Lundgren & Crippen, 2017) to further distill the nature of each topic.

Maximum likelihood estimation of number of topics.
Each day in the study period was recognized as having the potential for unique tweet activity related to each topic, which allowed for the construction of topic-over-time graphs (Figure 2). Initial review of these graphs identified the volume and regularity of tweets as two distinctive features, which when combined could be used to describe the collection. These features were used to construct two unique metrics. First, the total number of tweets affiliated with a topic was operationalized as the Magnitude of Activity or Magnitude which for these topics varied from 0 to 173 tweets. A base-10 logarithmic transformation (i.e., logMagnitude) was used to better account for the range of values and to support transferability to other studies at larger scales. Topics were binned into categories based on their location on the graph (Figure 2). Next, an Occurrence was defined as a day in the study period that had tweet activity. Occurrences are also synonymous with individual peaks in the topic-over-time graphs. Percent Occurrence was calculated as the total Occurrences divided by the duration of the study. For example, topic activity on the dates of 21 October 2017 (15 tweets), 22 October 2017 (21 tweets), 24 October 2017 (18 tweets) and again on 28 October 2017 (12 tweets) would result in four Occurrences (1% Occurrence) with a corresponding logMagnitude value of 1.75. Percent occurrences were binned into categories based on their location. We interpret the variable of percent Occurrence as a measure of the longevity or duration of interest (high values indicate greater longevity), whereas the Magnitude was interpreted as the strength of interest.

Two example topic-over-time graphs (logMagnitude vs. Date); (a) Paleoart (1.5% Occurrence), and (b) Fossils from Various States (4.5% Occurrence).
Topic archetypes were delineated based on a visual content analysis of the topic-over-time graphs for each type (Rose, 2013) as well as examining a graph of logMagnitude versus percent Occurrence where each axis was bifurcated to create quadrants and the potential for four archetypes. Potential archetypes thus represented as a combined expression of logMagnitude and percent Occurrence. The potential archetypes consisted of high Magnitude–high percent Occurrence (HH), high Magnitude–low percent Occurrence (HL), low Magnitude–high percent Occurrence (LH), and low Magnitude–low percent Occurrence (LL).
Community ecology diversity indices were utilized to consider the diversity of participants by PIT category in each identified topic. These metrics have been long employed to assess species diversity in a given location, consider topics of biodiversity scales, identify geographic biodiversity hotspots, and consider how biodiversity is accumulated through space and time (Patzkowsky & Holland, 2007; Stigall et al., 2017). These same metrics were used to consider the diversity of individuals contributing to topical Twitter conversations and the diversity of individuals in defined clusters. Shannon (Shannon & Weaver, 1962) and Simpson (Simpson, 1949) diversity indices were calculated in the R package vegan (Oksanen et al., 2019). Indices were plotted against one another to compare results as the Simpson index preferentially weights abundance of individuals and Shannon works to correct for this bias. Assessment of indices visualizations indicated that both tracked one another closely, thus, in the results, we report our findings solely using the Simpson diversity index (SDI). Diversity of topics was analyzed in the context of Magnitude and percent Occurrence as described above. The SDI ranges from zero to one, with one being highest diversity (Simpson, 1949; Table 1). Topics were subsequently binned via statistically significant (p < .05) natural gaps to indicate low (0.00–0.50), medium (0.51–0.63), or high diversity (0.64–1.00).
List of Topics, Descriptions, Number of Associated Tweets, Paleontology Practice-Based Post Types, Magnitude, Percent Occurrence, and Archetype.
HH = high Magnitude–high percent Occurrence; LL = low Magnitude–low percent Occurrence; LH = low Magnitude–high percent Occurrence; HL = high Magnitude–low percent Occurrence.
The network was visualized and further characterized using NodeXL (Hansen et al., 2011), where participants were nodes in the network with tweets, re-tweets, and mentions serving as the links among the nodes. We chose to use NodeXL for its powerful visualization and customization capabilities, whereas Netlytic was only used to collect tweets continuously for the study period. Groups within each archetype network were identified with the Clauset-Newman-Moore (Clauset et al., 2004) clustering algorithm. Graph metrics included number of vertices (i.e., connections), number of edges (i.e., nodes or entities), density, and geodesic distance. Graph density indicates how interconnected vertices in a network are and is measured on a scale of 0–1. Geodesic distance indicates the length of the shortest path between two people within a network.
Finally, we used a community detection algorithm called InfoMap to visualize the directional flow of information among the groups for each topic archetype (Edler et al., 2016). The networks for each topic archetype were exported to Pajek files and loaded into InfoMap to calculate codelength of the cluster data as well as to visualize the flow of information. InfoMap involves a code-based equation or compression algorithm that is applied to a directional, weighted network to identify probability random walks across the regularities that were defined as groups (Rosvall & Bergstrom, 2008).
Findings
Topic modeling resulted in 38 distinct topics, which were subsequently described and characterized (Table 1). The magnitude among the topics was relatively low and positively skewed, ranging from 4 to 244 with a median of 63 tweets. The saturation of tweets during the time frame was relatively low and positively skewed with the percent occurrence ranging from 0.5 to 23.7 with a median of 5.4%. Diversity was negatively skewed, ranging from 0.06 to 0.67, with a median of 0.53. More than half of the topics were coded as Information (52.6%), followed by Opportunity (23.7%), Research (13.2%), News (7.9%), and zero topics coded as Off-Topic.
Topic Archetypes
Four archetypes were identified from the graph of Magnitude versus percent Occurrence. This graph was then overlaid with the additional variables to further characterize the archetypes (Figure 3).

Topics by quadrant illustrating the archetypes with topic descriptions from the P3T framework, represented by different shapes: Information (circles), News (squares), Opportunity (triangles), and Research (diamonds). Topics are colorized by archetype: LL (bright orange), LH (yellow), HL (light orange), and HH (dark orange/brown), those examined in depth include their text description.
The number of topics within each archetype was consistent with each archetype containing between seven and nine topics. The topic-over-time graphs for each archetype were distinctive (Figure 4).

View of the four quadrants from the Magnitude versus Percent Occurrence graph with exemplar topic-over-time graphs representing the archetypes. The internal y-axis indicates the logMagnitude associated with each topic per day. Note that topics of the HL archetype (light orange) exhibit extreme, singular spikes of activity, whereas topics of the LH archetype (yellow) follow a pattern of more sustained activity.
The LL archetype is characterized as consisting of topics that did not merit sustained interest or interaction. Many of the topics clustered in the LL quadrant were represented in the P3T framework as Research (Figure 3). Research topics included tweets in which members of the social world reposted or discussed Reposting of Recent Research Paper, Fieldwork, and about particular groups of animals (i.e., Sabertooth Cats). Within LL topics, especially those that were Research-specific, members of the social world with specific knowledge provided additional information or a reaction; however, those without the specific knowledge did not participate. Viewed over time, LL-affiliated topics followed a pattern of having few, if any, instances of people tweeting about the topic for any appreciable amount of time; these topics generated limited interest over short periods of time (Figure 4). For example, topics such as Fieldwork or Project Outreach Events generated a singular spike of activity on one day with minimal subsequent activity. These topics, while certainly engaging over a very short duration of time, were likely to thwart community development as they did not entice members of the social world to engage.
The LH archetype is characterized as consisting of topics with frequent tweet activity, but of a low volume, suggesting a low-level conversation. There was a cluster of topics within the LH quadrant that were coded into the P3T Type of News, which included media outlet stories about paleontology (Figure 3). When viewed over time, these LH-affiliated topics exhibited evidence of many instances of members engaging with the topic; however, community member diversity was low. Topics rarely exhibited days of high magnitude during the study period, never showing more than 14 unique instances of activity (Figure 4). We interpret these topics to function in the same way that idle chatter does within an office setting: sustaining some level of basic interest over time creating a sense of community cohesion, but not fomenting new ideas or exhibiting patterns consistent with knowledge generation (Probst & Borzillo, 2008).
The HL archetype is characterized as consisting of topics where people were highly interested, but their interaction was not sustained over time. Topics within the HL quadrant were mostly coded as Information—general resources for paleontology, reports of recent project activity, personal connections to paleontology, or links to blogs (Figure 3). HL topics that were Information-based included the topics of Women in Paleontology, an Individual Network Discussion, and Paleoart. Examining and visualizing HL-affiliated topics over time revealed activity that was nearly identical in nature: activity that peaked with a singular large spike that was brief in nature, usually lasting 1 day (Figure 4). For example, the topic of Microfossils generated 68 tweets and replies on one day in April 2018, but not before or after this date. This topic archetype is an important component of Twitter’s social world as it indicates areas of interest that might encourage participation. However, the ephemeral nature indicative of this archetype reveals that conversations are not sustainable, which in turn, might mean that new participants might not have enough exposure to continue to participate and contribute.
The HH archetype is characterized as consisting of topics that drew both high and sustained activity over a period of time. These topics most often fell into the P3T Type of Opportunity, which are defined as ways for members of the social world to participate in or contribute to paleontology (Figure 3). We interpret these topics, which included Inclusion of Amateur Paleontologists, Citizen Science, and Project Webinars, to be highly important, as they were capable of involving members of the social world in sustained interactions. We interpret this to mean that HH topics need to include ways for people from diverse backgrounds to express their interest in paleontology as well as contribute to or participate in it.
When viewed over time, HH-affiliated topics displayed multiple instances of tweets and replies over consecutive days with occasional spikes of activity. An example of this is the topic Paleontology Resources. Throughout the study period, this topic generated activity consistently, which was visualized as small incremental activity interspersed with four distinct spikes of activity (Figure 4). This suggests that the topic was of sufficient interest to the community that it was capable of sustaining a general level of conversation, but is also such that it could produce instances of more intense interest that resulted in peaks of activity. Such a pattern can be viewed as productive for a community because the topic produces a need for conversation, maintaining involvement over time, but it also provides instances of invigoration.
Diversity Within Archetypes
Our second research question concerned participant involvement within the different topic archetypes, which was answered by applying the PIT to participants and examining which participants were interacting within each archetype and their included topics. The most diverse topics in the network were Events at the Florida Museum (0.67; LH archetype), Project In-Person Events (0.66; HH archetype), and Fieldwork (0.66; LL archetype). The least diverse topics were Reports or Discoveries (0.18; LH archetype), Women in Paleontology (0.06; HL archetype) and Paleoart (0.06; HL archetype). When subjected to a one-way analysis of variance (ANOVA), no significant difference in diversity among the archetypes were found, F (3, 34) = 1.604 (p = .21), r2 = .12; however, we report our findings concerning raw numbers, percentages (Figure 5), and the range of diversity indices within archetypes (Figure 6) to highlight the potential of our method.

Simpson’s diversity indices (SDI) and percentages of participant types per topic archetype. No significant differences in diversity were found between archetypes.

Topics colored by archetype, sized by diversity category, and shapes correspond to associated P3T topic descriptions. LL (bright orange), LH (yellow), HL (light orange), and HH (rust).
Within the LL archetype, the breakdown of participants included 178 Scientists (41%), 156 Education and Outreach (35.9%), and 100 Public participants (23%). The SDI average was 0.55, indicating medium diversity, but also the most diverse of the four archetypes (Figure 5). The two topics with the lowest diversity in the cluster were Sabertooth Cats (0.29) and Exhibits at the NHM London (0.45). The two topics with the highest diversity were Historical Paleontology (0.66) and Fieldwork (0.66) (Figure 6).
Within the LH archetype, the breakdown of participants included 62 Scientists (15.1%), 275 Education and Outreach (70.2%), and 54 Public participants (13.8%). The LH archetype had the highest percentage of Education and Outreach participants. The SDI average was 0.40, indicating low diversity (Figure 5). The two topics in the cluster with the lowest diversity were Reports or Discoveries (0.18) and News About Paleontology (0.22). The two topics with the highest diversity were Time Scavengers (0.60) and Events at the Florida Museum (0.66) (Figure 6).
Within the HL archetype, the breakdown of participants included 470 Scientists (47.4%), 199 Education and Outreach (20.1%), and 321 Public participants (32.4%). The HL archetype had the highest percentage of Scientists and Public participants. The SDI average was 0.43, indicating low diversity; however, individual topic diversity ranged from 0.06 to 0.66, and HL was the only archetype containing topics with diversities below 0.1 (Figure 5). The two topics with the lowest diversity in the cluster were Paleoart (0.06) and Women in Paleontology (0.06). The two topics with the highest diversity were Lists about Paleontology Topics (0.66) and 3D Scanning and Printing (0.66) (Figure 6).
Within the HH archetype, the breakdown of participants included 932 Scientists (24.8%), 1,220 Education and Outreach (60.1%), and 617 Public participants (14.9%). The SDI average was 0.52, indicating medium diversity (Figure 5). The two topics with the lowest diversity were Project Webinars (0.47) and Project Newsletters and Conferences (0.52). The two topics with the highest diversity were Paleontology Resources (0.65) and Project In-Person Events (0.66) (Figure 6).
Community Network Structures
A comparison of network characteristics indicates important differences among the archetypes (Table 2). For three of the four archetypes, participants were mostly connected through a centralized hub, with little connection to one another. The HH archetype included the most vertices (n = 267), while the LL archetype included the least (n = 146). In addition, the HH archetype had the highest number of vertices in a connected component. This suggests that within the HH archetype, there were many participants that were connected; however, the graph density was sparse, at 0.01, suggesting that the vertices were only loosely connected to one another. In comparison, the most dense archetype was HL with a graph density of 0.02. Metrics for graph density are calculated on a scale of 0–1, thus no archetype was densely connected. We interpret this to mean that within an egocentric network that was centered on the science of paleontology, connections were mainly facilitated by the egocentric node and few external conversations occurred outside of those created by the node.
Network Graph Characteristics by Topic Archetype.
For all archetypes, the shortest geodesic distance ranged from 2.32 (HH archetype) to 2.59 (LL archetype). This suggests that for all archetypes, nodes within the network were directly connected to one another or connected through a mutually affiliated entity. As this was a study of an egocentric network, @projectfossil is the most likely to be the mutually affiliated entity.
The network shape and directional flow of the information within the LL archetype indicates a broadcast network (Figure 7). According to Himelboim and colleagues (2017), a distinguishing feature of a broadcast network is one large group acting as the source of information for receiving groups. The LL archetype involved one main group that accounted for most of the information flow to 17 smaller groups. A large amount of information flowed into and out of the main group, with little connection between the other smaller groups, another common feature of broadcast networks (Himelboim et al., 2017). The low density (0.01) is a result of the lack of interconnectedness within the network.

The directional flow of information among groups within each archetype. Each group is represented as a circle; arrows indicate the direction of flow among groups. Size of the circle corresponds to the amount of information flow through the node.
The HH and LH archetypes resulted in networks with the highest number of groups (n = 19 and n = 18, respectively). Based on the shape and the directional flow of information, we interpreted both to also be broadcast networks. Like the LL archetype, HH and LH had the characteristic shape of a broadcast network, with much of the information flowing in and out of one main group. Although HH and LH were broadcast networks, they were both more interconnected than the LL archetype. The HH archetype network had the highest number of connections between groups (n = 39) and LH had the second most interactions between groups (n = 30) of all the archetype networks. While the HH and LH broadcast networks were more interconnected than LL, the high number of groups resulted in a similarly low density (0.01).
The HL archetype network involved far fewer groups (n = 5) than any other. Each of these groups were medium sized, and four of the five had information flowing among them. The size of the groups and the flow of information between them created a structure that is unique to Tight Crowd networks, a situation where participants in groups are tightly connected to each other for ideas, information, and opinions (Himelboim et al., 2017). Since there were so few groups and interconnectedness between these groups, the density of the network (0.02) was higher than the other archetype networks.
Discussion
While we are certain our study makes substantial contributions to the understanding of topic archetypes as community processes and provides evidence for indicating the health of an online community, this study is not without its limitations. Indeed, the study of interactions on Twitter is a limiting factor, as only 22% of adults in the United States use Twitter, and those that use it do not use it frequently, meaning that a large segment of people who may be interested in and contributing to paleontology are excluded from our analysis (Pew Research Center, 2019b). In addition, social media’s fluid and dynamic nature can make replication studies challenging as methods for collecting data change as algorithms, application programming interfaces (APIs), and scraping methods change. More research that takes into account multiple scientific communities as well as multiple platforms could provide a better understanding of scientific online communities. With these limitations in mind, we situate our results within the larger corpus of literature and describe how our research can illuminate new directions for designing, developing, and evaluating sustainable scientific online communities.
Topic Archetypes as Expressions of Community
Drawing from Himelboim and colleagues’ (2017) typologies of Twitter topic-network structures, we interpret the results of our study and propose archetypes that are related to the states of CoP development from Wenger and colleagues (2002). Specifically, we offer an interpretation and contextual renaming of the four distinct topic archetypes as expressions of behavior that are indicative of community processes occurring within the Twitter network (Table 3). With both piques of interest and sustained conversation, the HH archetype exemplifies Sustainable Stewardship—sustained momentum created by relevant, cutting-edge, domain-specific topics that were both “lively and engaging” (Wenger et al., 2002, p. 104). The HL archetype, characterized by little more than a singular spike of extreme activity, exemplifies Coalescing Community—having similar interests, focus, and knowledge, but scarce energy for assimilation. The lack of piqued activity but consistent conversation of the LH archetype exemplifies Mature Membership—members clarifying the community’s “focus, role, and boundaries” while developing “a comprehensive body of knowledge [that] expands the demands on community members” (Wenger et al., 2002, p. 97). Finally, the LL topic archetype with a lack of piqued activity and a conversation exemplifies Potential Practice—connecting on common grounds, an online community grappling with uniting their “heartfelt interests” (Wenger et al., 2002, p. 71) into something that aligns with that of the whole. The contribution of this study is to advance the CoP theoretical framework by examining how topic archetypes provide empirical evidence for detecting and indicating the life cycle of development, change, and alteration of an online community. While these interpretations were appropriate for this online, social world, we foresee future research directions in determining if the CoP stages of development can be utilized to orient interpretations of community processes within other CoPs.
Community Interpretation of the Four Archetypes.
HH = high Magnitude–high percent Occurrence; HL = high Magnitude–low percent Occurrence; LH = low Magnitude–high percent Occurrence; LL = low Magnitude–low percent Occurrence.
The Potential Practice (LL) archetype featured topics that engaged limited members, correlating to the CoP development stage of potential (Wenger et al., 2002). We postulate that topics within this archetype limit community development as they do not allow for newcomers to enter easily nor do they encourage widespread activity among other members (Kraut et al., 2012). Such archetypal aspects relate to how social worlds have been utilized to broadcast scientific research (Bex et al., 2019). Many scientists have taken to using Twitter to publicize new research findings (Côté & Darling, 2018; Didegah et al., 2018; Vainio & Holmberg, 2017); however, much of this research is laden with jargon and discussed with targeted groups, thus there is little opportunity for communication of this nature to sustain interest or interaction beyond a narrow few (Carlson & Harris, 2020; Shuai et al., 2012). Our research shows that within online scientific spaces, some topic archetypes are engaged with less frequently and by limited numbers of individuals. Engagement by a diverse community with the scientific, online, social world can be expanded via the employment of targeted topics based on anticipated responses such as those that include opportunities for members to share their interests. The topic archetypes identified here provide a means for understanding the impact of such an effort.
Archetypes that illustrated opportunities to share interests included Sustainable Stewardship (HH) and Mature Membership (LH). These archetypes are directly correlated with two mature stages of CoP development (Wenger et al., 2002), specifically maturing and stewardship. Within these CoP development stages, the focus of the community is shifted toward growth, change, and integration. Topics of the Sustainable Stewardship (HH) and Mature Membership (LH) archetypes highlighted community members interacting with comprehensive bodies of knowledge and engaging in “lively and engaging” sustained activity (Wenger et al., 2002, p. 104). In addition, the presence of such archetypes provides evidence for the health of online, scientific communities. Sustained activity is indicative of both identity- and bond-based attachment within the community (Kraut et al., 2012). These attachments imply commitment to the community’s purpose (identity-based attachment) and to members of the community (bond-based attachment). The description of these archetypes based on the CoP stages of development and their connection to different types of attachments is important as future researchers can employ such archetypal descriptions when exploring other scientific, online communities. If topic archetypes are indicative of topics engendering different patterns of communication within the community, then we conjecture that targeting the production of certain archetypes through designed strategies offer the potential for growing and sustaining online communities.
The Potential of Using Diversity Indices for Research and Evaluation of Online Communities
The members of @projectfossil’s online social world were varied; through member classification using the PIT, we show that regardless of interest, diverse members of the social world participated in and contributed to various topic archetypes. Previous research has described network participation holistically with a secondary focus on centralized connectors (Brown et al., 2016; Gruzd et al., 2016; Himelboim et al., 2017; Smith et al., 2018) or on the geographic location of where tweets originated (Pruss et al., 2019). Thus, when network participants are considered, they are broadly described as nodes, and or in terms of their connectivity to one another. Some exceptions exist in which community members are defined based on members’ narrative representations of self-identity with a domain (i.e., their Twitter biographies) (Bex et al., 2019; Kimmons & Veletsianos, 2016; Rosenberg et al., 2020; Vainio & Holmberg, 2017). However, this research is still emergent and little has been done regarding classification. Our method of classifying members via the PIT can be modified and used in future studies as training data for machine-assisted classifications.
Twitter is promoted as an effective science communication pathway where scientists can connect to the public (Bombaci et al., 2016; Van Noorden, 2014). In this online social world centered on the domain of paleontology, it is difficult to argue for Twitter’s effectiveness as a science communication medium as we did not find any topic archetypes that included a majority of Public participants. Within the online, social worlds of scientists, there is evidence for unsustainable activity and filter bubbles (Flaxman et al., 2016). The Potential Practice (LL) and Coalescing Community (HL) topic archetypes included high percentages of Scientists; these topic archetypes did not sustain activity. This finding expands work from Côté and Darling (2018) who found that a majority of scientists who used Twitter had mutual following relationships with other scientists unless their follower counts surpassed 1,000. To circumvent filter bubbles and circular conversation, our research indicates that diverse participants who can fulfill multiple roles need to be included to sustain interest-based activity (Wenger et al., 2002). Furthermore, such diversity of perspectives can help to remove filter bubbles (Min & Wohn, 2020) and increase knowledge generation (Didegah et al., 2018; Lei & Xin, 2011; Rosenberg et al., 2020).
Within the Mature Membership (LH) and Sustainable Stewardship (HH) topic archetypes, Education and Outreach members made up the majority of participants. We infer this to mean that some topics, like those in the Sustainable Stewardship (HH) archetype, allow for diverse participants to contribute to and participate in social paleontology for extended periods of time. This finding aligns with previous research on online paleontological social worlds (Bex et al., 2019) in which Education and Outreach members were able to connect across the network. In this study, increased numbers of Education and Outreach participants corresponded with archetypes in which sustained activity occurred.
Our method of applying diversity indices to a social network allows us to account for community membership. This study has shown that quantifying and qualifying diversity within a community is possible; such diversity affects the longevity and health of Twitter topic archetypes. While evidence for the importance of diversity indices exists in ecological studies across space and time (Patzkowsky & Holland, 2007; Stigall et al., 2017), such indices have, to the best of our knowledge, never been applied to an online social world. Pohjola and Puusa (2016) have suggested that community participation and group dynamics shape CoPs in that members’ interests can become dispersed and growth creates different roles. Additional studies that apply diversity indices to such online social worlds are needed to explore how low, medium, or high diversity of members can affect conversations and activity.
Exploring Topics as a Way to Grow and Sustain the Online Community Life cycle
The evolution and expression of a conversation over time within an online community is the essence of digital practice. Yet, our capacity to understand this phenomenon has been limited by the availability of tools and techniques for connecting the key characteristics of people with such activity in meaningful ways. Previous research has mainly considered network characteristics and structure as the means for analysis and inference (Himelboim et al., 2017), offering an important but limited perspective. By connecting the expression of conversation topics through topic modeling with the characteristics of participants, we were able to contextualize the network in a way that offers new insight about digital expressions of behavior that are indicative of community processes and provides empirical evidence for detecting and indicating the health of online communities.
In this study, community network structures showed sparse connectivity. Others have argued that sparse network structures facilitate diffusion of ideas among groups (Behfar et al., 2018) and that loosely connected networks benefit from entities that can act as central connectors (Ergün & Usluel, 2016). Networks emerge from basic principles of community, and thus can be explored via the CoP theoretical framework. From CoP perspective, the egocentric network within this study could be acting as an effective CoP, as a loosely connected network with multiple groups can imply that more members are involved at varied positions of participation (Wenger et al., 2002). Our work has determined the archetypes, composition, and community network structures within an online social world within the domain of paleontology; future research should examine other scientific social networks to see if patterns vary depending on the domain.
Conclusion
Our findings demonstrate that distinct topics featuring a diverse assemblage of members have varied impacts on an online social world that was centered on paleontology. These impacts depended on topic composition—topics with greater magnitude and higher percent occurrence were associated with more diverse member composition and specific P3T post types (i.e., Opportunities and Information). While others have recognized how network structures (Himelboim et al., 2017), topics (Nikolenko et al., 2017), and member composition (Britt & Paulus, 2016; Xie & Luo, 2018) independently can be applied to online social worlds, our study shows that a time series approach, content analysis, and machine learning techniques such as topic modeling can be applied to understand and predict the ways that community members contribute to and participate in interest-based activities in online social worlds. In addition, this study provided a functional conceptual model for understanding and interpreting patterns of behavior as topic archetypes and a way to explore relationships between the nature of participants and the social network; we see the potential to further explore and replicate results within other scientific fields as well as within educational research.
Footnotes
Data Accessibility Statement
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval
All data collected complied with Twitter’s terms of service, as laid out by Twitter’s Developer Agreement and Policy, Section C entitled Respect Users’ Control and Privacy. This research included Institutional Ethics Review and approval (UF-IRB202002652).
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded in part by the National Science Foundation under Grant No. DRL-1322725. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
