Abstract
Influencer marketing spending in the United States was expected to surpass $6 billion in 2023. This marketing tactic poses a public health threat, as research suggests it has been utilized to undercut decades of public health progress—such as gains made against tobacco use among adolescents. Public health and public opinion researchers need practical tools to capture influential accounts on social media. Utilizing X (formerly Twitter) little cigar and cigarillo (LCC) data, we compared seven influential account detection metrics to help clarify our understanding of the functions of existing metrics and the nature of social media discussion of tobacco products. Results indicate that existing influential account detection metrics are non-harmonic and time-sensitive, capturing distinctly different users and categorically different user types. Our results also reveal that these metrics capture distinctly different conversations among influential social media accounts. Our findings suggest that public health and public opinion researchers hoping to conduct analyses of influential social media accounts need to understand each metric’s benefits and limitations and utilize more than one influential account detection metric to increase the likelihood of producing valid and reliable research.
Over four billion people globally are social media users, and projections estimate there will be nearly six billion by 2027 (Statista, 2022). Currently, over 80% of the United States population uses social media (Triton Digital, 2021). As a result, marketers now use social media to reach target audiences and potential consumers. Spending on social media marketing, including paid advertising, owned accounts, earned media, and influencer marketing, was expected to exceed well over $50 billion in 2022 (Williamson, 2020). Recent data indicates that influencer marketing consistently outperformed other revenue-boosting tools on social media platforms, such as embedded advertisements and search engine optimization, regarding return on investment (SocialPubli, 2019). While brands often struggle to create engaging social media content, influencers specialize in capturing niche audiences, creating trends, and encouraging followers to buy the products they promote (Campbell & Farrell, 2020). The worldwide influencer marketing industry was expected to surpass $21 billion in 2023 (Geyser, 2023), with an estimated $6.16 billion spent on influencer marketing in the United States alone (Lin, 2023).
A social media influencer is defined as a person (user) who has “the power to affect the purchasing decisions of others because of his or her authority, knowledge, position, or relationship with his or her audience” (Geyser, 2022, para. 2). Social media influencers actively engage with distinct audience niches on social media platforms (Geyser, 2022), and influencers maintain active relationships and interactions with their followers, representing homogeneous consumer segments to marketers (Leung et al., 2022). These follower interactions allow influencers to leverage the network effects of social media platforms by indirectly reaching larger audiences via peer mediation (engagement or sharing of the influencer post), which creates a persuasive heuristic to pay attention to the content (Anspach, 2017; Bene, 2017).
Influencer Marketing in the Tobacco Space
Tobacco companies have been using influencer marketing to undercut the decades of public health progress against tobacco use among adolescents (Johnston et al., 2019), and influencer marketing was a core strategy in JUUL’s meteoric rise in popularity in 2018 (Jackler et al., 2019; Kaplan, 2018). Commercial tobacco interests appear to mobilize on social media platforms (N. A. Silver et al., in press), and many tobacco companies hire young social media influencers to promote tobacco use as normative to an attractive or aspirational lifestyle (Czaplicki, Kostygina et al., 2020; Hejlová et al., 2019; J. Liu et al., 2020; Navarro et al., 2021). Often, these influencers have been discouraged by tobacco companies from disclosing that they were engaged in paid advertising (Campaign for Tobacco Free Kids, 2018), for which the Federal Trade Commission requires #ad or #sponsored labels to ensure youth recognize the content as an advertisement (Klein et al., 2020).
A growing body of research suggests that the exponential growth of JUUL e-cigarette-related influencer content across social media platforms was significantly correlated with the exponential uptake of JUUL product use among U.S. youth (Chu et al., 2018; Czaplicki, Kostygina, et al., 2020; Fadus et al., 2019; Huang et al., 2019; Jackler et al., 2019; Johnston et al., 2019). In 2018, JUUL undertook several voluntary actions putatively aimed at reducing JUUL-related content on social media, including eliminating its own social media accounts and monitoring and removing “inappropriate material from third-party accounts” (Scutti, 2018). Nonetheless, these voluntary measures affected no meaningful change in the amount or content of JUUL-related content on Instagram (Czaplicki, Tulsiani, et al., 2020). In 2019, over 125 public health and other organizations from 48 countries collectively called upon social media companies to prohibit the promotion of tobacco products on their platforms, including promotions by influencers (Campaign for Tobacco Free Kids, 2019). In response, Facebook and Instagram announced they would no longer allow content advertising vaping and tobacco products (Azad, 2019). Most social media platforms banned paid tobacco advertising; however, few platforms have policies that address novel strategies such as sponsored and influencer posts—content where companies collaborate with influencers to endorse products to their followers in exchange for compensation or other incentives—and few require age-gating to prevent youth exposure to tobacco promotional content (Kong et al., 2022). While the Food and Drug Administration (FDA) now requires marketers of newer products, like IQOS, to report using partners, bloggers, brand ambassadors, and influencers to promote their products (U.S. Food and Drug Administration [FDA], Center for Tobacco Products, 2020), the proliferation of international influencer accounts which challenges enforcing this policy by making it more difficult for regulatory authorities to monitor and ensure compliance when influencers are based in different countries with varying regulations and standards.
The amount tobacco companies spend on influencer marketing remains unknown. However, in 2021, the Bureau of Investigative Journalism in the United Kingdom revealed that British American Tobacco—the second-largest tobacco company in the world (Kolmar, 2023)—had recently launched a £1 billion social media influencer campaign to harness influencers on TikTok, Instagram, and Facebook to promote their tobacco products—mainly to youth—in countries including Kenya, Pakistan, Sweden, and Spain (Chapman, 2022).
Operationalizing Influential Account Detection for Tobacco Control Research
To date, tobacco social media researchers have prioritized conducting research to identify social media influencers promoting tobacco use (Gu et al., 2022; Hejlová et al., 2019; Kostygina et al., 2016; Navarro et al., 2021; Vassey et al., 2023; Wu et al., 2022), assess the reach and engagement of influencers in the tobacco space (Gu et al., 2022; Kostygina et al., 2020; Kostygina et al., 2016; Vassey et al., 2023; Wu et al., 2022), characterize the content of pro- and anti-tobacco influencer posts (Gu et al., 2022; Hejlová et al., 2019; Kostygina et al., 2016; Navarro et al., 2021; Ramadhan & Bigwanto, 2022; Vassey et al., 2023; Wu et al., 2022), describe audience engagement with influencer posts about tobacco (Gu et al., 2022; Hejlová et al., 2019; Kostygina et al., 2016, 2020; Vassey et al., 2023), explore the efficacy of potential countermeasures to pro-tobacco influencer posts (Klein et al., 2020, 2022; Vogel et al., 2020), and measure influencers’ influence on the tobacco attitudes and behaviors (Ramadhan & Bigwanto, 2022; Vogel et al., 2020). Researchers have used three broad types of influencer metrics in tobacco work: simple descriptive metrics (i.e., follower count) (Blakemore et al., 2020; Bonnevie et al., 2020, 2021; Borgmann et al., 2016; Drescher et al., 2021; Edgerton et al., 2016; Gough et al., 2017; Harding et al., 2019; MacKay et al., 2022; Patel et al., 2022; Zou et al., 2021), network metrics (i.e., degree centrality) (Jain & Sinha, 2020; I. Kim & Valente, 2021; Lutkenhaus et al., 2019; Moukarzel et al., 2020, 2021; Munoz-Acuna et al., 2021; Zheng et al., 2021), and index metrics (Gallagher et al., 2021).
Simple descriptive metrics serve as a “quick and dirty” way to identify influential accounts on social media. These metrics often exist in the metadata and do not require advanced analytical methods to interpret. Simple descriptive metrics can take form as follower influence (a user’s number of followers, indicating audience size), mention influence (the number of @user mentions, indicating the user’s ability to engage or incite others into conversation, not including mentions in retweets to avoid conflating scores for mention influence and repost influence), and repost influence (the number of times a user’s message is reposted, indicating the ability of the user to create shareable content). Emerging research suggests diverging effects between popularity metrics (i.e., follower counts) and marketing effectiveness. For example, micro-influencers (10k–100k followers) are often considered more trustworthy and authentic by audiences than mega-influencers (> 1 million followers) (Enberg, 2018; Main, 2017). Another significant weakness of simple descriptive metrics is their vulnerability to validity issues resulting from fraudulent or misleading follower and engagement numbers (i.e., influencers paying for followers, likes, or shares) (Zote, 2019).
Social media researchers use network metrics to capture sources and flow of information through social networks. Network metrics allow interpretation of the users’ distinct roles in media conversations. I. Kim and Valente argue that there are three types of influential users on X (formerly Twitter) represented by conventional network metrics: information sources (represented by in-degree centrality), information disseminators (out-degree centrality), and information brokers (betweenness centrality) (I. Kim & Valente, 2021). Thus, network metrics inherently represent varying functions of influential accounts on social media.
Index metrics offer researchers a tool to identify influential social media accounts by detecting sustained amplification of messages or posts from a specific user around a particular topic. Public health scholars have created indexes to operationalize social media influence (Gallagher et al., 2021; Jain & Sinha, 2020). However, just as often, researchers conducting influencer studies lean on third-party tools designed by social media analytic companies for the same purpose (Albalawi & Sixsmith, 2017; Edgerton et al., 2016). Unfortunately, there is a lack of scholarly consensus over which third-party index metric best detects influential accounts on social media, best illustrated by the variation in metrics utilized in the extant literature (Albalawi & Sixsmith, 2017; Edgerton et al., 2016). Further, third-party tools threaten the replicability and transparency of influencer research as the algorithms utilized by these black-box proprietary tools to identify accounts are not shared with researchers (Albalawi & Sixsmith, 2017; Edgerton et al., 2016; Gallagher et al., 2021; Jain & Sinha, 2020).
Researcher-created index metrics, such as T factor (Bornmann and Haunschild, 2015) and h-index (Gallagher et al., 2021), are more transparent influential account detection metrics. The h-index was initially created to measure academic citations (Hirsch, 2005) and has since been adapted by Gallagher et al. to identify influential accounts, or “elites,” who receive sustained amplification in a specific conversation (i.e., COVID-19) on social media (Hirsch, 2005). Retweets are used to calculate the h-index rather than likes or favorites because the original h-index metric was designed to assess the impact and influence of scholarly publications, typically measured through the number of citations received. A higher h-index score represents the consistent production of many posts and notable attention paid to each post by audiences. The emphasis on consistency also minimizes any statistical noise created by spam or outlier viral posts.
The Need to Assess Available Tools to Detect Influential Accounts on Social Media
Public health and public opinion researchers need practical tools to capture influential accounts discussing tobacco on social media and promoting tobacco use among impressionable youth and young adult audiences (Brien et al., 2020; Gu et al., 2022; Hejlová et al., 2019; Hickman, 2021; Navarro et al., 2021; Nedelman & Selig, 2018). However, before we can efficiently and reliably capture influential tobacco accounts on social media, we need to understand what approaches are available for detecting influential accounts on social media. Therefore, this study represents the vital first step in evaluating the existing methodological tools for detecting influential social media accounts. Our evaluation uses tobacco marketing content as a use-case but is generalizable across marketing, social marketing, and public opinion research domains.
As the extant social media influencer literature regularly uses “influencer” as a catch-all term for distinctly different online personas, including celebrities and customer product reviewers(Leung et al., 2022), we aim to help clarify the difference between describing a user account as an “influencer” and being able to detect influential accounts on social media. For example, an influential account on social media vested in manipulating public opinion on tobacco policy is qualitatively different from an entertainer on social media doing a viral dance holding a JUUL e-cigarette product to promote the product to their audience.
Agenda-setting theory (McCombs et al., 2013) posits that the media can shape public opinion—and, by extension, behavior—by influencing what topics get the most attention (McCombs et al., 2013). Using this agenda-setting lens and guided by the work of I. Kim and Valente (2021), we predict that the differing influential account detection metrics will detect influential accounts strategically guiding or participating in differing conversations (i.e., discussing tobacco policy versus promoting tobacco products) within their respective communication spheres. In other words, we predict that the differing analytical metrics (simple, network, and index metrics) will capture distinctly different influential tobacco accounts on X. The extant literature has established that influential accounts on X serve as sources, disseminators, and brokers of information, helping to guide different concurrent conversations (I. Kim & Valente, 2021). Therefore, we predict that our analyses will help to reveal different types of “influencers” that serve as influential accounts in respective communication spheres (e.g., LCC policy, LCC lifestyle, and LCC product promotion discussions). We also predict that the differing influential account metrics will detect influential accounts strategically guiding or participating in differing conversations (i.e., discussing tobacco policy promoting tobacco products) within their respective communication spheres.
This research will help explicate the existing influential account detection metrics using real-world data focusing on a public health issue and, in doing so, clarify our understanding and highlight the many forms that influential accounts can take, such as those in entertainment (i.e., celebrities and social media stars) (Rosenstein, 2017), politics (i.e., commentators and politicians) (Coulter, 2020), or news media (i.e., journalists and news media company accounts) (CNN, 2022). These findings will also help social media researchers delineate which influential account detection metrics are best suited to address the aspects of social media influence they hope to address in their research—including, but not limited to, detecting likely undisclosed tobacco influencers on social media platforms. Given the exponential growth in spending on tobacco influencer marketing (Chapman, 2022; Kaplan, 2018), tobacco control researchers must integrate the extant operational definitions and delineate the differences between “being an influential account on social media” and “being an influencer” to address the conceptual overlap and to advance our understanding of social media influence. As the current social media influencer literature regularly uses “influencer” as a catch-all term for distinctly different online personas, including celebrities and customer product reviewers (Leung et al., 2022), we aim to help clarify the difference between describing a user account as an “influencer” and being able to detect influential accounts on social media.
Therefore, our study aims to address the following hypothesis and research questions:
Research Question 1 (RQ1). Do simple descriptive, network, and index metrics capture distinctly different types of influential accounts on X (formerly Twitter)?
Research Question 2 (RQ2). How dynamic are the influential account detection metrics within the data time frame every week?
Research Question 3 (RQ3). Which categories of influential accounts are most prominent in shaping the discourse in the LCC X space?
Research Question 4 (RQ4). How do content features (use of hashtags, hyperlinks, n-grams) in the social media messages (tweets) posted by accounts captured differ across the seven influential account detection metrics?
Method
Data Collection
To conduct these analyses, we first identified X (formerly Twitter) posts (original tweets and retweets) relevant to LCC products and used posts from the 3 weeks following the FDA’s menthol ban announcement (April 29, 2021, to May 21, 2021) (U.S. FDA, 2022). We selected the FDA menthol ban announcement as a time period for data collection because it was the most recent tobacco control milestone, which would generate public discussion at the time of data collection (American Lung Association, 2023). X was ideal for this influencer detection research because other social media platforms limit researchers’ ability to extract user-level data (i.e., user descriptions). However, we plan to adapt this methodological evaluation design to other popular platforms (i.e., Instagram and TikTok) in future research. We are focusing on LCC posts because previous tobacco research on X has identified prominent celebrities and community accounts in LCC promotion (Kostygina et al., 2016, 2017). Further, we are focusing on LCC posts because research indicates that cigar product flavor bans may reduce tobacco product co-use among young adults (Glasser et al., 2023)—one of the target populations for big tobacco on social media.
We collected 647,465 Tweets from April 29, 2021, through May 21, 2021, that matched any of our LCC search rules via X’s Historical PowerTrack (Twitter, 2023). Our LCC search results and search terms—including terms (e.g., cigar and cigarillo), brands (e.g., Black & Mild and ZigZag), and behaviors (e.g., “smoke a stogie” and “need a blunt”) are available upon request. Historical PowerTrack allowed access to full public tweets that matched the search rules. After we detected relevant posts, utilizing the approach detailed below, we used post metadata to remove the duplicate tweets and then prepared the analytic dataset for qualitative human coding, quantitative summary statistics, and more.
As with any search rules, tweets irrelevant to LCC were collected (Y. Kim et al., 2016). For example, Phillies Cigars, a brand of traditional cigar in the United States, was captured via our keywords; however, the Phillies were also a prominent baseball team. We developed a supervised machine classifier to identify LCC-relevant posts and exclude non-relevant posts using the team’s standard procedures (Czaplicki, Kostygina, et al., 2020; Kostygina et al., 2016, 2020). The development of the classifier consisted of training a model on a human-coded dataset extracted using LCC keywords. Two coders assessed a randomly selected subset of posts for their relevance to LCC, considering both the visual and linguistic aspects of each post. The sampling process was stratified by the week of posting to ensure that the training dataset encompassed the entire duration of the study. The final classifier was the result of stochastic gradient descent learning with modified Huber loss function and elasticnet regularization. This process aimed to ensure the inclusion of only tweets directly relevant to the topic of interest, in this case, little cigars and cigarillos. We removed 287,853 posts (42.7% of all “raw data” posts collected) for being classified as non-relevant, leaving us with a final analytical sample of 386,612 X posts. The classifier achieved an F1 score of 0.93 (precision 0.91, recall 0.94) from five-fold cross-validation on the training set; and an F1 score of 0.92 (precision 0.89, recall 0.95) on the held-out test set, indicating a high-performance classifier that would provide reliable predictions (Y. Kim et al., 2016).
We also considered the presence of bot-generated content. While our primary focus was on classifying and removing non-relevant posts to LCC discussions, it is worth noting that robotic accounts and automated messages could play a significant role in promoting tobacco products and related policies on social media (X. Liu, 2019). These bots contributed to the information flow in this domain, making it important to distinguish their presence. Therefore, we utilized SGBot (Yang et al., 2020, 2022) to differentiate likely X bot accounts from non-bot accounts and to report the proportion of likely bot-generated tweets by each detection method.
Influential Account Detection Metric Construction
The simple descriptive metrics (follower influence, mention influence, and repost influence) were constructed by aggregating totals for each measure across the selected period. These metrics were created using standard X (formerly Twitter) metadata.
The h-index metric was constructed following Gallagher et al. (2021) using the Python package scholarmetrics. The h-index is a measure that identifies the largest number “h” such that there are “h” tweets, each of which has been retweeted at least “h” times. In this context, retweets are considered akin to citations in traditional scholarly impact metrics.
Our network measures were based on a retweet network, which allows for direct comparisons with the other retweet-based metrics (h-index, simple descriptive metrics). The retweet network structure created a directed network with connections (i.e., network edges) made when one user retweets another user’s LCC-relevant post. The retweet network dataset was generated using the Python networkx package (NetworkX, 2023), with each row containing the user retweeting, the user being retweeted, and a timestamp/identifier of the occurrence. We used this data to calculate three network centrality metrics: in-degree (information sources), out-degree (information disseminators), and betweenness centrality (information brokers).
After constructing the seven influential account detection metrics, we ranked the accounts by each metric and identified the top 100 as potential influential accounts for each metric (n = 700). After removing the duplicate influential accounts which appeared across more than one metric, the final sample of influential tobacco accounts was coded (n = 565).
Influential Account Category Coding
As guided by the real-time infoveillance of Twitter health messages framework (Colditz et al., 2018), we utilized three X (formerly Twitter) coders with extensive experience analyzing tobacco conversations on X to code influential account types. Coders reviewed influential accounts using basic textual information from their posts and user metadata. The coders also reviewed the influential accounts on the X website as needed.
The three coders categorized each influential account identified from the seven influential account detection metrics into 1 of 15 mutually exclusive categories describing the type of influential account. We used a grounded theory (Strauss & Corbin, 1997) approach to create and refine these 15 account-type categories. The three coders first determined if each tobacco-related influential account represented an Individual or Non-Individual/Group/Organization to aid the subsequent category coding. Each influential account was then coded as one of the categories in Table 1. It is also worth noting that the X verification process changed since these data were collected, which we considered when categorizing users and interpreting the results. Our findings regarding verified and unverified accounts reflect the verification status before X changed verification to a paid service on November 9, 2022. Our coding manual can be found in Supplemental Online Table 1.
Influential Account Coding Category Descriptions.
Note. LCC = little cigar and cigarillo; CDC = Centers for Disease Control and Prevention.
A random 16% of the total sample (88 of 565 accounts) was coded independently by each coder to test interrater reliability (Riff et al., 2014). Reliability analyses indicated a high level of agreement between coders for the influential account category analysis, with Gwet’s AC1 (Gwet, 2008) exceeding 0.90 across all coding categories. All three coders coded all accounts in the random sample, and disagreements were resolved by group discussion.
Message Content Feature Analysis
Two coders reviewed LCC-relevant tweets made by influential accounts to code content themes. We performed Natural Language Processing (Chowdhary, 2020) to identify the most-occurring features related to a thematic analysis by reviewing top hyperlinks, generated user-level descriptives (e.g., follower, friend, and post counts), engagement statistics, and n-grams. N-grams were used for coding salient themes. N-grams connected sequences of n items (words) from a given sample of text or speech (Broder et al., 1997). We examined top unigrams, bigrams, and trigrams and found that the trigrams were the most interpretable for salient content themes, consistent with previous social media tobacco research (Benson et al., 2020; Czaplicki, Kostygina, et al., 2020; Malik et al., 2021; N. Silver et al., 2022). Salient themes for each of the seven influential account detection metrics were identified using a grounded theory (Strauss & Corbin, 1997) approach from trigram analysis and coded by consensus agreement between two coders.
Statistical Analyses
We evaluated RQ1 by utilizing Jaccard similarity coefficients to compare differences between the top 100 accounts captured by the seven different metrics across the 3 weeks. We evaluated RQ2 by utilizing Jaccard similarity to compare the within-metric differences/changes of top accounts across the 3 weeks by comparing pairs of adjacent weeks (Week 1 and Week 2 vs. Week 2 and Week 3) to evaluate how dynamic each metric was during the three weeks. RQ3 was evaluated by comparing which influential account category type (e.g., entertainer, political commentator, media network, etc.) appeared most often across the seven detection metrics and, in total, across the study period. RQ4 was evaluated by cross-examination of results from both unsupervised (i.e., LDA) and traditional (e.g., n-grams and top tweet features) thematic analysis methods performed on the data.
Ethical Considerations
Human subjects were not involved in this study. Our inclusion of public X messages and metadata (including usernames) in our analyses and publication follows X’s Terms of Service.
Results
RQ1 explored if simple descriptive, network, and index metrics captured distinctly different types of influential accounts on X. The highest similarity coefficient was between the retweet and out-degree centrality (“disseminator”) metrics (Jaccard similarity = 31; n = 47 mutual accounts captured out of a possible 100). The out-degree (“disseminator”) and betweenness (“broker”) centrality metrics (Jaccard similarity = 13; n = 23 mutual accounts captured out of a possible 100) were the only other metrics with a similarity coefficient above .10, indicating that there was minimal overlap between the metrics. There was no overlap in accounts captured between in-degree and retweets, out-degree and followers, and out-degree and retweets. Further, no single influential account was captured by all seven metrics. As seen in Table 2, our results suggest that the extant influential account metrics capture distinctly different accounts.
Pairwise Jaccard Similarity Coefficients Across the Seven Influential Account Detection Methods.
Note. “Sources” = In-Degree Centrality. “Disseminators” = Out-degree Centrality. “Brokers” = Betweenness Centrality.
RQ2 explored how dynamic the influential account detection metrics were across the 3 weeks following the FDA’s announcement of the menthol ban. Within each metric, we assessed the week-by-week dynamics of accounts captured by each metric by comparing the Jaccard similarity between their top 100 accounts over the 3 weeks (comparing Weeks 1 and 2 vs. Weeks 2 and 3). As seen in Table 3, our findings indicated that Jaccard similarity increased slightly for all metrics except the mention metric. The retweet metric had Jaccard similarity coefficients near 40 across both weeks, indicating less turnover in the top 100 accounts, compared to the other metrics of influential accounts captured across this period.
Jaccard Similarity Across the 3 Weeks Following the FDA Menthol Ban Announcement.
Note. FDA = U.S. Food and Drug Administration.
By contrast, the in-degree centrality (“source”) and betweenness centrality (“broker”) metrics were very dynamic, with Jaccard coefficients across this period generally below 10, indicating a lot of turnover in the accounts that each metric was capturing. In contrast, our findings suggest that retweets and mentions changed the least every week. At the same time, there was more turnover among the metrics capturing accounts initiating connections—in-degree centrality (“sources”) and betweenness centrality (“brokers”)—changed the least.
RQ3 explored which categories of influential accounts were most prominent in shaping the discourse in the LCC X space in the three weeks following the FDA menthol ban announcement. Table 4 shows the top 5 account categories by influential account detection metric. As seen in Table 4, across all accounts captured across the index, network, and simple metrics, unverified Other (organic/unclear) (19.6%), unverified NSFW (17.9%), and unverified LCC Users were the most prominent account-type categories. Among all accounts analyzed, 21.8% were deleted, suspended, or inaccessible, and 15.8% of the captured accounts did not post any LCC content during the study period and were thus coded as non-relevant. NSFW accounts were most likely to be classified as bot accounts. About one-quarter (24.6%) of the NSFW accounts were flagged as likely bot accounts, unverified X accounts (using X’s previous verification system) with an SGBot probability score above .80. NSFW was in the top 3 categories for five metrics (h-index, in-degree [sources], out-degree [disseminators], mentions, and retweets), ranging from 14.5% of accounts to 27.1%. Other accounts appeared in the top 3 account types for four metrics (h-index, in-degree [sources], out-degree [disseminators]), and LCC community/user accounts tended to be more prominent in the simple metric samples (mentions = 40.2%; retweets = 23.3%) than in the network metric samples (out-degree [disseminators] = 9.4%; betweenness [brokers] = 6.2%).
Top 5 Account Category Types by Influential Account Detection Metric Among Relevant Active Accounts.
Note. Only top account categories are reflected, so categories do not add up to 100%. There were no Anti-LCC or Pro-LCC accounts captured in the total sample. Based on X (formerly Twitter)’s outlined criteria, accounts that received a blue verified badge before November 9, 2022, are considered authentic, notable, and active. “Total” accounts have been de-duplicated. Relevant active accounts include unverified accounts labeled as likely bots (SGBot probability score above .80). For the “Sources” detection metric, the prevalence of the fifth and sixth most common categories, Other Commercial and News Media, each made up 7.3% of relevant active accounts. For the Follower detection metric, all subsequent account categories (Other Commercial, Political/Government, Political Commentator, Health/Medical, and Unverified Entertainment) comprised 1.8% of relevant active accounts. LCC = little cigar and cigarillo; NSFW = pornographic or adult; “Sources” = In-Degree Centrality; “Disseminators” = Out-degree Centrality; “Brokers” =betweenness Centrality; U.V. = unverified account; V = verified account.
Our findings indicate that the follower metric—the most used metric in most social media research—captured distinctly different influential account types than all other metrics. Most of the accounts that the follower metric captured appeared to be brands or organizations (73.5%) rather than individuals (26.8%), while total relevant accounts across all metrics were primarily individuals (80.0%). The follower metric was also the only metric for which verified News Media accounts were among the top 5 account types. These accounts were the majority captured by the metric (51.8%), while verified News Media accounts do not appear in the top 5 account types for any other metric (0% to 5.1% for all other metrics). The second most prominent category for the follower metric is verified Entertainment accounts (39.3%). While verified Entertainment accounts are the second most prominent account type for the follower metric (39.3%), unverified Entertainment accounts are more prominent among the remaining metrics, particularly network and index measures (11.9% to 26.6%). Unverified Entertainment accounts appear in the top 5 categories for all metrics except mentions and followers, indicating that only network or index metrics were detecting them efficiently. While all other metrics captured NSFW and Other (organic/unclear) accounts in their top 5 categories, the follower metric did not capture any NSFW or Other accounts. The follower metric also had the highest proportion of accounts (44%, n = 44) coded as “non-relevant,” while no other metric had more than 18% coded as non-relevant.
RQ4 explored how content features (use of hashtags, hyperlinks, n-grams) in the tweets differ across the seven influential account detection metrics. We used an inductive approach to analyze trigrams based on their content. The tri gram themes that emerged were casual blunt use, cigar promotion, menthol and flavor ban, and NSFW (adult or pornographic) content. The casual blunt use trigram theme indicated that a given metric mainly captured posts of users supporting other users’ casual use of popular blunt products (e.g., Black and Mild) and blunt preparation methods (e.g., “rolling blunts”). The cigar promotion tri gram theme indicated that a metric mainly captured commercial marketing and advertising posts promoting cigar products on X, where accounts often shared the “latest cigar news” and promoted websites where users could buy cigars from extensive collections. The menthol and flavor ban tri gram theme indicated that a metric mainly captured posts about the FDA’s announcement of menthol and flavored tobacco products. The NSFW tri gram theme indicated that a metric mainly captured posts containing NSFW content, many of which utilized cigar products and smoking as promotional aspects of the pornographic content.
Table 5 shows the content features in tweets posted by accounts captured across the seven influential account detection metrics. The follower metric captured the most accounts sharing hyperlinks (81% of tweets from the top 100 accounts provided a hyperlink). Results indicate some crossover across metrics in the salient themes that emerged based on the top trigrams. The follower and in-degree (“sources”) metrics captured influential accounts discussing the FDA menthol ban (“ban menthol cigarettes” and “menthol cigarettes flavored,” respectively), the retweet and h-index metrics captured influential accounts sharing NSFW content involving cigar culture (“join beautiful girls”), and the mentions and out-degree metrics (“disseminators”) captured accounts promoting cigar and LCC products (“eat drink smoke” and “latest cigars daily,” respectively). The betweenness (“brokers”) metric captured accounts discussing casual blunt use (“smoke pairing enjoy”). The relevant accounts captured by the follower and in-degree (“sources”) metrics used far fewer hashtags than the other metrics.
Content Features in the Social Media Messages (Tweets) Posted by Accounts Captured Across the Seven Influential Account Detection Metrics.
Note. NSFW = pornographic or adult.
Discussion
The findings from this study have broad methodological implications for social media research. Our metric similarity analysis revealed that influential account detection metrics capture distinctly different influential accounts. The analyses of types of captured accounts further illustrated the substantial differences across each metric. While there was some congruence within the network metrics, these metrics were also the most dynamic (had the most turnover in top accounts captured) across the sampled period. Our findings suggested that accounts that typically retweet others’ messages or consistently mention other users (often functions of bots) were consistent across the observed period. It is worth noting that the h-index metric is not designed for weekly analysis as it is intended to capture sustained amplification, so the observation of h-index change over time should be taken with caution.
Given that these metrics are all designed to capture and quantify influential accounts on social media, our findings reveal that influential accounts take on different forms such that social media research requires varying sensitivities when making methodological decisions. To use an earlier example, an influential account on social media vested in manipulating public opinion on tobacco policy is qualitatively different from an entertainer on social media doing a viral dance holding a JUUL e-cigarette product to promote the product to their audience. Thus, a vital takeaway from this study is that influential account detection metrics are non-harmonic and time-sensitive. The results of studies built around measures quantifying and characterizing influential accounts on social media may vary widely based on which influence detection metric the researchers select and what period they use for data collection.
Our results also revealed that these metrics capture distinctly different conversations. Our thematic content analysis of the content posted by the top captured accounts indicated a distinct difference in the types of conversations these influential accounts had. Researchers tapping into the LCC conversation after the FDA ban announcement using “sources” (in-degree centrality) and follower metrics would only be analyzing event-focused discussion during this sample period, while researchers utilizing mention or “disseminator” (out-degree centrality) metrics—designed to capture accounts making “outward” connections—would have mostly detected LCC marketing. During this same study period, researchers using retweet or h-index metrics—metrics designed to capture users whose content is being shared widely—would have discovered NSFW (pornographic or adult) LCC content. Combined with the account similarity and categorical coding findings, these results suggest that researchers studying influential social media accounts need to utilize more than one influential account detection metric to increase the likelihood of producing valid and reliable results. Further, to select the appropriate metric for analysis, researchers need to consider the type of content they are interested in (i.e., marketing vs. regular people vs. policy/advocacy) to make better-informed methodological decisions to meet their research goals.
Our findings also have significant implications for using follower count as a metric for detecting influential accounts going forward. While follower count has often been a default influencer detection measure (Blakemore et al., 2020; Bonnevie et al., 2020, 2021; Borgmann et al., 2016; Drescher et al., 2021; Edgerton et al., 2016; Gough et al., 2017; Harding et al., 2019; MacKay et al., 2022; Patel et al., 2022; Zou et al., 2021), our account-type coding indicates that the Follower metric may be ill-fitted for rigorous influential account social media analyses. The follower metric mostly captured verified News Media accounts, and it was the only metric with this category type among its top 5 account types captured. Further, the follower count metric had the highest proportion of accounts coded as “non-relevant.” Considered in concert with the growing evidence that follower count is not a valid representation of influence (Enberg, 2018; Main, 2017; Zote, 2019), we hope that our results further drive the nail into the coffin of the assertion that follower count can provide valid or reliable results on its own in research attempting to identify influential social media accounts.
We were surprised to observe that there was not much policy-centric discussion in the wake of the FDA menthol and flavor ban announcement among influential accounts discussing LCC products. While the ban was thematically represented by the top trigrams from the followers and “sources” (in-degree centrality) metrics, the general conversation around the ban was sparse. Although there was substantial similarity in the top hashtags captured by the different metrics, the top trigrams indicate that there was also substantial variation in the topics of the discussions. Our results suggest that distinctly different conversations were occurring in the messaging in the LCC tobacco space on X in the three weeks after the announcement of the FDA menthol ban, and researchers looking to explore the conversation on X occurring during this time would identify different conversations depending on which influential account detection metric they used. Unlike previous research exploring e-cigarette conversations on social media (Kostygina et al., 2021, 2022), we found zero users in the sample labeled by human coders as anti-tobacco regulation advocates among this sample. Indeed, we observed only two prominent conversations during this period—an LCC marketing conversation (e.g., promoting cigars) and an LCC behavior (norms) conversation (e.g., rolling blunts, sharing cigar suggestions and recommendations). This observation may be explained by the fact that tobacco-flavored (i.e., unflavored) products comprise most of the LCC product market (estimates indicate 50–80% market share) (Wang et al., 2021), so LCC users were less concerned with the ban than other tobacco-use product groups would be. Nonetheless, analyses exploring the prominence of differing tobacco conversations on social media are warranted.
The prominence of influential users coded as unverified Entertainment accounts in this sample is important. While the intention of this study was not to identify likely undisclosed tobacco influencers, we would like to “flag” this account type as one of interest to researchers. Social media users associate verification with “celebrity” rather than authenticity. Consumers are less likely to trust verified accounts than unverified ones (Dumas and Stough, 2022). Future research should not overlook unverified accounts when examining potential influencer accounts, as our results indicate that these accounts were among the most central accounts to the LCC discussion during our study period. It is important to note that in light of Elon Musk’s changes to the X verification system, which allows users to pay for verification badges (Conger, 2023), ideas about trust on X are likely to shift. However, verification on other platforms that do not use the pay-for-badge system likely still confers meaning.
Acknowledging the prevalence of likely bot accounts sharing and discussing contentious subjects on social media is vital. Tobacco researchers have suggested that the tobacco industry may use bot accounts to create a climate of false consensus and circulate misinformation regarding tobacco products (Jun et al., 2022). Our results indicate that almost 1 in 5 accounts captured by influential account metrics were likely bots (SGBot score > .80). Is it likely not a coincidence that the metrics that captured the most likely bot accounts (> 30% of all accounts captured) were “disseminators” (out-degree centrality) and retweet metrics, metrics specifically designed to capture accounts that are most effectively sharing messages on social media. Further, NSFW (adult or pornographic) accounts had the highest proportion of likely bot scores. The growing prevalence of porn bots on social media threatens to further hamper social media data validity (Tiidenberg, 2021). Therefore, consistent with the findings of previous social media tobacco research (Allem et al., 2017; Allem & Ferrara, 2018; Ferrara, 2018), public health and public opinion researchers conducting social media research would be well advised to utilize bot detection software to supplement their detection strategy.
Some limitations of this study are worth noting. First, while our findings provide valuable insights into how different metrics identify influential accounts discussing tobacco, we acknowledge the potential biases inherent in the data collection process. Our data collection process could potentially introduce selection bias in the sample of accounts analyzed. We collected X data during a specific time frame, from April 29, 2021, to May 21, 2021, in the wake of the FDA’s menthol ban announcement. While this timeframe was chosen to capture the most recent and policy-relevant discussions on LCC products and use, it is possible that the discourse during this period may not fully represent the broader, long-term discussions on X. Moreover, our data collection was limited to publicly available tweets, which may not account for the entire spectrum of conversations on LCC.
In addition, it is worth noting that our study relied on X (formerly Twitter) data, and X has changed drastically since Elon Musk acquired the company. His implemented changes may significantly affect public health research using this platform (Mole, 2022). For example, verification status was a meaningful variable in this research and was previously assigned by X to “notable” accounts, companies, and brands. However, the new X model allows users to pay for verified status. In other words, the platform post-Musk acquisition has changed drastically such that some of our observations and findings using pre-Musk acquisition data may not be relevant to the Twitter/X platform post-Musk acquisition. While problematic at first glance, this realization makes the findings from this work even more valuable as the observed differences across metrics stand up irrespective of the process for achieving verified status. Although our categorical analysis separating verified and unverified status within categories would look different, the results still suggest distinct differences in the types of accounts each influential metric captured. Nonetheless, future research is needed to develop helpful heuristics for identifying tobacco-sharing accounts of interest. The ongoing changes to Twitter X only add to the evidence that the shifting sands of social media platforms’ terms of service, data-sharing policies, and operations will always present challenges to social media researchers.
Our mutually exclusive coding category design also presented challenges. Like other analyses of media (Guo & Li, 2016), we faced the challenge of interpreting influential users’ intended meanings of their self-categorizations as part of our coding process. While we were able to establish a reliable coding process, the clarity and harmony gaine by using a mutually exclusive coding scheme potentially came at the expense of nuanced aspects of influential user accounts. For example, the line between entertainment and commercial accounts was sometimes blurred as accounts would offer services while intending to provide entertainment products (e.g., a self-described blogger selling unrelated products on their website). Future research could utilize a multi-faceted coding scheme to help tease out nuanced differences among influential accounts discussing tobacco on social media.
We hope that these findings constitute vital first steps in improving methodological decision-making for research analyzing influential accounts on social media. Our comparison of social media metrics provides four vital pieces of guidance for public health and public opinion researchers hoping to conduct valid and reliable analyses of influential accounts on social media. First, our findings highlight the importance of considering various metrics for capturing influential accounts. Researchers should take this into account and consider using a combination of these metrics to gain a more comprehensive view of the landscape of influential accounts in tobacco-related discussions on social media. Second, researchers should consider the temporal dimension of influential account detection metrics when analyzing influential accounts and be aware of which metrics are better suited for capturing stable or evolving accounts in their studies. Third, our findings indicate that different metrics capture different types of accounts, including unverified accounts, NSFW content, LCC users, and more. Understanding these categories can help researchers identify the key players in the tobacco-related discourse and tailor their research to focus on specific account types or content. Fourth, our analysis provides insights into the content features associated with influential accounts across different metrics. Public health and public opinion researchers can leverage this information to gain a deeper understanding of the topics and themes that resonate with influential accounts in the tobacco discussion. This knowledge can inform the creation of more targeted content and campaigns to ensure appropriate and more effective strategies for message design and placement among these influential accounts.
Supplemental Material
sj-docx-1-sms-10.1177_20563051231224268 – Supplemental material for Deciphering Influence on Social Media: A Comparative Analysis of Influential Account Detection Metrics in the Context of Tobacco Promotion
Supplemental material, sj-docx-1-sms-10.1177_20563051231224268 for Deciphering Influence on Social Media: A Comparative Analysis of Influential Account Detection Metrics in the Context of Tobacco Promotion by Alex Kresovich, Andrew H. Norris, Chandler C. Carter, Yoonsang Kim, Ganna Kostygina and Sherry L. Emery in Social Media + Society
Footnotes
Acknowledgements
The authors would like to acknowledge Mateusz Borowiecki, Simon Page, and Hy Tran for their contributions to this research.
Authors’ Note
The content is solely the authors’ responsibility and does not necessarily represent the official views of the National Institutes of Health.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA248871.
Supplemental Material
Supplemental material for this article is available online.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
