Abstract
Social media dominate today’s information ecosystem and provide valuable information for social research. Market researchers, social scientists, policymakers, government entities, public health researchers, and practitioners recognize the potential for social data to inspire innovation, support products and services, characterize public opinion, and guide decisions. The appeal of mining these rich datasets is clear. However, there is potential risk of data misuse, underscoring an equally huge and fundamental flaw in the research: there are no procedural standards and little transparency. Transparency across the processes of collecting and analyzing social media data is often limited due to proprietary algorithms. Spurious findings and biases introduced by artificial intelligence (AI) demonstrate the challenges this lack of transparency poses for research. Social media research remains a virtual “wild west,” with no clear standards for reporting regarding data retrieval, preprocessing steps, analytic methods, or interpretation. Use of emerging generative AI technologies to augment social media analytics can undermine validity and replicability of findings, potentially turning this research into a “black box” enterprise. Clear guidance for social media analyses and reporting is needed to assure the quality of the resulting research. In this article, we propose criteria for evaluating the quality of studies using social media data, grounded in established scientific practice. We offer clear documentation guidelines to ensure that social data are used properly and transparently in research and applications. A checklist of disclosure elements to meet minimal reporting standards is proposed. These criteria will make it possible for scholars and practitioners to assess the quality, credibility, and comparability of research findings using digital data.
Introduction
Social media are ubiquitous in today’s communications environment. Once considered as recreational networks mainly used by youth and younger adults, social media now are used by corporations, news media, advocacy groups, and individuals of various ages and socioeconomic backgrounds. Since each post or upload leaves a digital footprint, social media generate an enormous quantity of data, creating unique opportunities for analyzing important questions about society, policy, and health (Schillinger et al., 2020). Corporations, academic researchers, government, and nonprofit organizations have begun to rely on these data to gauge people’s attitudes toward products, marketing, and proposed policies; and to characterize public opinion and individual behavior (Bruns, 2013; Bruns & Stieglitz, 2014; Cohen & Ruths, 2013; Diakopoulos, 2016; Y. Kim et al., 2016; Kostygina et al., 2016; Tufekci, 2014; Yom-Tov, 2016).
The recent emergence of generative artificial intelligence (AI) tools (e.g., ChatGPT) represents similar opportunities and challenges (Salah et al., 2023). Leveraging the advanced capabilities of these technologies to analyze multiple streams and extensive volumes of data generated daily on social media with greater efficiency and speed can lead to an unprecedented depth and breadth of understanding of social phenomena by identifying patterns of information flow on previously unattainable scale, and model social dynamics and social contagion across platforms (Elmas & Gül, 2023; Haluza & Jungwirth, 2023). This can inform and enable significant advancements in social science and public opinion research at every step from problem definition, to data collection, analysis, and interpretation. However, there are no clear guidelines for conducting research with the help of generative AI tools or standards for assessing the quality of this research. It remains unclear whether such analyses can be reproducible or replicable due to the lack of transparency of generative AI models and potential innate undetected algorithm biases that can compromise the impartiality and validity of research findings, leading to skewed interpretations and inaccurate conclusions (Dwivedi et al., 2023; Mehrabi & Pashaei, 2021). Social media and generative AI are revolutionizing social science and public opinion research, which highlights the need to translate the social science transparency and replicability standards for this new media and technological landscape and update the social science data quality assessment guidelines, as well as disclosure standards and requirements.
The rush to take advantage of the bounty the rich social data offer occurs at a time of substantial public distrust of science and technology in general (Desmond, 2022; Kabat, 2017; Winter et al., 2022). This trend follows waves of controversy over suspect or failed experiments using digital data to gauge public opinion formation (Albergotti, 2014; Booth, 2014) and assess health trends (Lazer & Kennedy, 2015), and the harvest of Facebook profile data without user permission during the 2016 US presidential campaign (Rosenberg et al., 2018). According to the 2022 Pew Research Center, public trust in science also decreased following the COVID-19 pandemic, with only 29% of US adults reporting a great deal of confidence in scientists to act in the public’s best interests in December 2021 (Kennedy et al., 2022). Cynicism or disbelief in science has increased to an extent that the research, government, and business communities interested in promoting scientific and technological progress cannot ignore (Kabat, 2017).
The emergence of new generative AI technologies introduces new problems for social data research. For instance, competition between such social media platforms and generative AI systems resulted in growing restrictions of social media data access and use (e.g., for X—formerly Twitter—and Reddit) for academic, organic, and commercial users due to unlicensed or unauthorized use of copyrighted proprietary digital data by these systems to train their generative AI models or build algorithms (Vincent, 2023). The capacity of ChatGPT and other generative AI to produce simulated social media posts and images can further undermine trust in what constitutes valid social data.
To help regain public confidence, prominent communication scholars have called for efforts to build transparency by establishing a climate of critique and self-correction; fully acknowledging the limitations in data, tools, and methods; accounting for seemingly anomalous data; and clearly, precisely specifying key terms (Hall Jamieson 2015). Researchers have to consider privacy and data provenance when using emerging AI technologies for social data analysis and processing.
We believe that the broad principles of transparency articulated previously to enhance credibility of science (Aczel et al., 2020; Hall Jamieson, 2015) can be applied to establish common disclosure requirements for social media and generative AI research. If we set clear reporting guidelines for social data acquisition, management, quality assessment, and analysis, public trust in the scientific findings and integrity of such research may increase, or at the minimum, research findings can be replicated or refuted, increasing scientific integrity.
Even as the number of research studies using digital data rapidly grows, relatively few have transparently outlined their data collection and analysis methods. Gradually, researchers have begun to critically examine the assumptions behind social media data findings, reproducibility, generalizability, and representativeness and call for higher transparency in documenting methods for such studies (Assenmacher et al., 2022; boyd & Crawford, 2012; Bruns, 2013; Center for Democracy & Technology n.d.; Cockburn et al., 2020; Council for Big Data, Ethics, and Society, n.d.; Fairness, Accountability, and Transparency in Machine Learning, n.d.; Fineberg et al., 2020; González-Bailón et al., 2014; Goroff, 2015; Graham et al., 2013; Jurgens et al., 2015; Y. Kim et al., 2016; Reed & boyd, 2016; Tufekci, 2014).
Challenges and Limitations of Social Data Research
As with any data source, the way in which social data are collected for research influences the conclusions that can be drawn (Japec et al., 2015). Although each social media platform has different technical constraints and poses unique methodological and programming challenges, there are common decisions that any project must address. Biases and other data quality issues arise from decisions researchers make about the platform selected and how the data are accessed, retrieved, processed, or filtered (or cleaned). In turn, each decision affects data quality and the validity of inferences based on the data analytics.
A number of specific limitations and challenges to conducting social data research have been described in the literature over 15 years since social media gained popularity. The challenges and limitations may be categorized as related to data collection, processing, analysis, and interpretation stages of inquiry. At the data collection stage, data-gathering approaches may be opportunistic; for example, studies based on retrieving information using specific hashtags often abstract conversations from a much more complex communications universe; such analyses risk omitting context and creating and describing new realities which may not reflect lived experience (Bruns, 2013). Furthermore, infrastructure may be unreliable, subject to outages and losses during data collection; and the choice of methods to combine multiple data sources may result in potential bias and errors. In addition, platform terms of service restrict data sharing, preventing replication of research using the same dataset. Therefore, data-gathering efforts are often duplicated and uncertainty exists regarding dataset comparability (Bruns, 2013).
During the data preprocessing and analysis stages of inquiry, design decisions for cleaning and interpreting social data—that is, selecting which attributes and variables to count and which to ignore—are inherently subjective (boyd & Crawford, 2012), and there is no known best practice or standard. Tools and methodologies for processing digital data are continuously evolving, and sometimes pieced together from various platforms and technologies, making documentation and replication problematic. Some researchers alternatively turn to commercial analytics services or standardized tools which may operate as black box enterprises, or contain processing steps that lie outside the researcher’s expertise to clarify (Bruns, 2013). Cross-platform analyses pose challenges because the data often appear in different formats that are difficult to combine, for example, text, images, and hyperlinks (Voytek, 2017).
Decision-making during the data collection and analyses stages impacts validity of research findings, interpretations, and conclusions as managing and interpreting the context in which conversations occur as well as implementing rigorous evaluation of the generated outputs to prevent the inadvertent propagation of biases or inaccuracies represent ongoing challenges for social data analysis.
Although these challenges and limitations are widely recognized as important, they are often neglected or dismissed in practice (e.g., Bruns & Stieglitz, 2014; Y. Kim et al., 2016). Disclosure of the decisions made during the conduct of social data research, and the reasons behind them, could dramatically enhance transparency and replicability. Without such reporting, evaluating the validity of findings and comparing methods and results across studies become impossible.
Validity Threats in the Social Media and AI Research Pipeline
Like traditional public opinion research, social data research methods—such as choice of platform, sampling strategy, and search filters for data collection—may affect the results and conclusions and have implications for a study’s external, internal, and construct
Hsieh and Murphy (2017) proposed the Total Twitter Error (TTE) framework for social media data quality assessment, which recognizes that population coverage—or generalizing to the population as a whole—may not always be the goal of social media analysis and that topic coverage, that is, representing topics within a corpus of written material, may often be a more appropriate goal (Schober et al., 2016). The TTE approach identifies
Recognizing the value of the TTE framework, we identify connections between the proposed disclosure standards and insight provided for understanding coverage, query, and interpretation error. However, we also note that social media may be used to analyze research questions that are not related to representing individuals within a population (population coverage) or topics within a corpus (topic coverage) and further social media may be used to support or supplement results from other traditional data sources. For example, online marketing efforts for emerging products like e-cigarettes and alternative tobacco products are difficult to fully monitor using traditional data sources because these products are not typically advertised widely at the point of sale or in print or broadcast media. They are typically first promoted on social media, which can provide critically important information to fully measure online marketing efforts (e.g., Huang et al., 2014).
The research standards for a given topic will depend upon the specific research question, and the three error components of the TTE framework may or may not be relevant. Thus, we emphasize that a flexible approach is needed to judge whether the standards of a specific social media analysis achieve the rigor needed for the research question, while noting that the proposed standards here encompass the needs for a broad array of research questions.
Methods
To guide rigorous analysis of social data and report findings using social science epistemology, we reviewed the literature related to data quality and methodological disclosure from biostatistics, computer science, and communications. We attempted to identify common constructs for qualitative and quantitative research methods and map these constructs to social data workflows and to the existing disclosure standards in the fields of opinion research and social sciences.
We drew upon the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) tool for the reporting of systematic review data, as a conceptual template as the data sources for reviews can be heterogeneous, very similar to the data obtained from social media, mapping the domains determining data quality in PRISMA to those needed for extraction and analysis from social media sources (Liberati et al., 2009; Page et al., 2021). We synthesized this approach with the American Association for Public Opinion Research (AAPOR) Transparency Initiative guidelines and the American Psychological Association Transparency and Openness Promotion (TOP) guidelines as a framework for social media data collection and quality assessment. Thus, the AAPOR Transparency Initiative Disclosure elements refer to the disclosure of information on data collection strategy; funding source/sponsor; measurement tools/instruments (e.g., questionnaires or coding schemes); population under study; method used to generate and recruit the sample; method(s) and mode(s) of data collection; dates of data collection; sample sizes; data weighting approach; data processing and validity checks; and acknowledgment of limitations of the design and data collection. The PRISMA reposting guidelines detail reporting recommendations pertaining to the study support sources, availability of data, code, and other materials, data collection process, and data items, among others. The TOP Guidelines cover eight general domains of research planning and reporting, including citation standards (citation for data and materials disclosures); data transparency (data sharing disclosures, such as posting to a repository); analytics methods transparency (e.g., disclosure of programming code); research materials transparency (materials sharing); design and analysis transparency (e.g., data preprocessing methods; reliability analyses); study design preregistration; analysis plan preregistration; and replication (disclosure of the publication of replication studies) (American Psychological Association, 2023). Thus, there is consensus regarding recommended transparency standards across social science domains which have to do with disclosures of research funding/sponsorship sources, data collection, processing and validation procedures, as well as analytic methods. These key concepts are also consistent with other literature detailing guidelines for evaluation of compliance with the scientific method, for example, Armstrong and Green (2022).
We synthesized and translated these practices and recommendations that are the standard for social science research to research using social media data and generative AI. While some disclosure elements were directly relevant across domains, including the social media data analyses (e.g., disclosure of the funding source), some items require translation or adaptation (e.g., description of the sample frame) or development of an analogous principle (e.g., data access point), or a novel disclosure element (e.g., amount of data decay in social media). Based on our findings, we propose a list of disclosure items as a reporting standard for social media research. We incorporate disclosure consideration regarding use of AI technologies (e.g., generative AI) and natural language processing tools. Our goal is not to direct researchers in their design choices, but to provide a framework and propose measures for evaluating the completeness of reporting and quality of data used in social media studies. Using data quality metrics, we show how selection of sampling and search filters affects the results and conclusions. We do not undertake to prescribe a short list of methods and tools to be used for social and digital media research, but rather to propose standards for how methodologies, procedures, and limitations are documented to increase transparency and replicability and allow consumers to evaluate research rigor.
Proposed Disclosure Items
Our proposed metrics for social data quality assessment and a list of minimal (or immediate) and optional (or preferred) disclosure items are detailed below and summarized in Table 1.
Overview of Disclosure Items for Social Data Quality Reporting and Target Error or Bias Prevention.
Biases and errors: T = transparency; R = replicability; C = coverage error; Q = query error; I = interpretation error.
Minimal Disclosure
We propose that the following items should be included as minimal disclosure requirements in any and every report of research results, or made available immediately upon release of such a report.
Data Collection
Scope of the Study
The report should include the rationale for platform selection, description of the target population or topic, point of data access, sample frame coverage, data verification procedures, total participants, or data points (such as number of posts retrieved or number of social media accounts) on which data were collected, as outlined below. Method and dates of data collection (duration of the study, including when data were collected and for what time period) should also be disclosed. Description of the metadata used in the study, if applicable, is also critical to ensure replicability of the analyses. We propose reporting the following sub-items:
(a)
(b)
Rationale: Populations of different demographics are drawn to different platforms; thus users of one platform may be more or less representative of the population at large than another platform. Furthermore, communicative activities on a given platform may not represent the full breadth of the overall public debate because of different functionalities of platforms. In addition, social desirability and self-censorship may be more characteristic of some platforms (e.g., platforms offering less anonymity such as Facebook), compared with others (e.g., X/Twitter or Reddit). All of the above factors are related to coverage of target population or topic and thus may affect the results of the study and interpretation of findings. If social media accounts are analyzed, information on types of social media accounts (e.g., real people, verified accounts, bots, influencers) and whether certain categories are selected or removed should be described. Subgroups of platform users may behave differently on a given platform.
(c)
Rationale: Different access points of data may produce data with different records. Data access also changes over time. Until early 2023, X/Twitter’s streaming API provided access to 1% sample of all tweets, while PowerTrack API provided access to all public tweets, affecting coverage of target population and topic (Y. Kim, Nordgren, & Emery et al., 2020; Morstatter et al., 2013). Subsequent changes to X/Twitter restricted data access to third-party social listening service providers and scraping. Facebook data were fully available before access was restricted in 2016. Currently, CrowdTangle is the best source of Facebook and Instagram data from publicly available accounts. These different access points may produce data with different metadata, which may enhance or limit the scale of search queries (Y. Kim, Nordgren, & Emery, 2020), which applies to other platforms as well if multiple ways to access and pull data are available.
(d)
Rationale: A sampling frame is carefully designed to represent a target population and derive representative estimates in survey research. While the universe/census of the target population
(e)
Rationale: The unit of analysis is closely tied to the target subject or topic, and replicability. Reporting number of analysis units enables comparability. It is worth noting that the total amount of posts, videos, or accounts related to a topic of interest may be relative (e.g., search volume on Google Trends).
Protocol and Analytic Tools
The software, programming language/scripts, any other analytic tools, and workflow for executing these tools should be described.
Rationale: There are a variety of tools available to analyze social media data, both open sources and commercial software, including emerging generative AI tools such as ChatGPT. Disclosure of computing tools is key to replicability of findings. For instance, social data are often analyzed or processed using Python, R, or other software geared to analyzing large corpuses of data among others. Same machine or statistical learning models are supported by more than one tools, and default settings for parameters and optimization may differ, resulting in different estimates. Certain software providers do not disclose module language and process of module validation. Use of generative AI tools for social media data analysis may augment the efficiency and speed of processing and analysis of large corpuses of social data, but may not be compliant with platform or provider terms of service and can have ethical implications (Elmas & Gül, 2023; Salah et al., 2023). Depending on the amount of contribution of AI systems to the analysis, description, and interpretation of findings, generative AI has been included as a co-author in the published literature, with some systems (e.g., ChatGPT) providing consent to be listed as a co-author (e.g., Haluza & Jungwirth, 2023).
Search Query Construction
The keywords selected to develop the search filter and the search rules for a more focused search should be provided. Outline your rationale for initial keyword selection (e.g., expert knowledge, resources/tools/skills used for systematic search, etc.) as well as for selecting or removing certain keywords. For example, report the relevance (precision) and frequency (number of posts retrieved) of the keywords, or the signal-to-noise (relevant to irrelevant data) ratio or the proper thresholds (by search term). Search filter construction is often an iterative process, alternating between keyword addition and removal based on relevance and frequency (Y. Kim et al., 2016). Generative AI technologies can also be used to identify terms relevant to a topic of interest, to generate search rules and convert them to regular expressions for search query construction. These tools can also translate or adapt search filters to other languages and cultural contexts to conduct multilingual analyses. Search filter is directly related to query error; a precise yet narrow search filter is likely to miss relevant content (i.e., false negative), while a comprehensive search filter is likely to contain false positive content; the balance between precision and completeness is important.
Rationale: Expressiveness of query languages and choice of keywords in combination with Boolean rules in queries define the resulting datasets. Thus, search term selection can affect the study conclusions. For instance, using “smoking” as a search term for tobacco-related social media data collection could result in retrieval of non-relevant posts containing words like “smoking ribs,” “smoking hot” (Emery et al., 2014).
Data Processing
Data Handling
Preprocessing and cleaning procedures, including de-duplication, aggregation, de-identification (if applicable), metadata (e.g., user profile, geographic location, time posted, etc.), and feature extraction, should be outlined. Use of software or tools, such as generative AI, for data preprocessing and text mining should also be disclosed.
Rationale: Converting data from a raw format to more manageable format, for instance, unpacking semi-structured data (e.g., JSON) to structured document-term matrix should be briefly described. Text mining techniques are often used in preprocessing of social media data (e.g., stop words removal, stemming, segmenting the language—factorization, speech-tagging), which can affect the subsequent procedures and analyses. In fact, data preprocessing and cleaning often influence the success of machine learning training and results, affecting interpretation error (as noted in Table 1).
Data Quality Assessment
The quality of retrieved data should be objectively assessed and quantified by inspecting a sample of data classified by search filter, for example, via cross-validation of automated coding based on a sample of data labeled by multiple human trained coders knowledgeable about the topic of interest to minimize potential error or bias, that is, the “gold standard” of filter quality assessment (Y. Kim et al., 2016). Reporting quality measures of the retrieved data, including retrieval recall (completeness of search filter; how much of the relevant data is retrieved by search filter) and retrieval precision (how much of retrieved data by search filter is relevant) helps comparability and transparency. The procedure to assess search filter quality—the selection of data sample (e.g., a subset of data based on random sampling stratified by keyword and account type may serve as a representative sample) and the evaluation strategy (e.g., agreement between coding based on human judgment vs. automated search filter selection, inspection of data that do not match search filter) must be disclosed. For example, several existing studies on the amount and content of tobacco-related tweets have included filter retrieval precision and retrieval recall assessments (e.g., Y. Kim, Nordgren, & Emery, 2020; Kostygina et al., 2016).
Thus, calculation of quality measures typically involves human judgment on a sample of data as a gold standard (Y. Kim et al., 2016). The human coding approach should be described as follows:
(a)
(b)
Data Analysis
Analysis Methods and Measures
Detail the deductive or inductive methods used for data analysis, including statistical techniques, machine learning algorithms, or qualitative analysis (e.g., topic modeling). Explain how the data were categorized, classified, or clustered to answer to the study research questions. Specify the metrics and measures used in the analysis, such as engagement metrics, sentiment analysis scores, or content classification criteria.
(a)
(b)
Researchers should disclose if generative AI tools are used for inductive or deductive analyses, for example, to create features for the classification model or to categorize social media data based on learned/ingested training data sample previously labeled by humans or a machine (e.g., to analyze social media posts to extract sentiment toward a particular topic). Since the predictive models built by generative AI are a “black box,” additional methods for validation and accuracy/performance quality assessment should be described (see Supplemental Appendix 1 for an illustration of additional disclosure items that may need to be considered for studies using generative AI; the list was generated via ChatGPT 3.5 query).
Rationale: Data retrieved by comprehensive search filters are likely to include non-relevant content. To reduce the degree of the query error, we may train supervised learning classifier to further remove non-relevant data. However, since all predictive models make false positive and false negative errors, interpretation error is also likely. Reporting classifier training procedure and its performance metrics helps comparability and transparency of methods.
Funding Source
Disclose who sponsored the research study, who conducted it, and who funded it, including (to the extent known) all original funding sources.
Rationale: Disclosure of sponsor or sources of funding is the standard practice with any scientific research study (e.g., American Association for Public Opinion Research, 2021). This is a fundamental requirement as funder involvement in research question, study design, data analysis, and interpretation of results may bias study findings.
Optional (Preferred) Disclosure Items
Depending on the design and objective of the research study, additional information that can be disclosed to enhance transparency and reproducibility of social media research and minimize error includes as follows:
Additional items discussed in the literature that are not shown in the above list of recommended disclosure elements—due to technical and possible contractual constraints—include disclosure of the raw data; procedure for acquiring consent to participate in the research study from social network users (e.g., whether consent was secured by the user checking a checkbox at the time of creating a social media profile vs. consent being obtained specifically for the research project); as well as procedures for participant debriefing upon study completion.
Discussion
Our approach aims to consolidate and map the concerns about lack of transparency, reporting, and documentation standards raised in the literature on social data analysis quality and replicability and take the process a step further to propose a list of specific disclosure elements grounded in social science epistemology. In fact, striking parallels exist between the current state of social data research and early public opinion research. For example, election polling in the early 1900s often relied on information provided by bookies (i.e., betting markets) or “man-on-the-street” interviews (Rhode & Strumpf, 2004). A classic example of poor results in early public opinion polling can be found in the 1936 prediction by
Thus, we proposed that the minimal disclosure standards should include description of funding source, platform, target population, point of data access, sampling strategy (if sampling is used), data verification procedures, protocol and workflow for executing software and analytic tools, data handling, search filter construction and assessment procedures, classifier training, and performance quality assessment, as detailed above. We believe this proposed framework presents a viable and effective method for quality evaluation of social data research. These criteria go beyond the identification of potential limitations and biases related to the use of social data and generative AI in research, to offer documentation guidelines for auditing and mitigating these issues to ensure the maximum validity and replicability of findings.
While there are overlapping threats to validity and similarities in reporting requirements for empirical or survey research and social media data research, important distinctions exist, which warrants discussion and motivates the framework we proposed. For example, surveys are grounded in a statistical framework that accounts for inferential error (i.e., sampling error, coverage error, etc.), measurement error, assumptions that there are objective measures of the population itself, and that the survey items are knowable and measurable. With social media data, however, such assumptions do not hold because the tools used to measure the population and the “items” are generated by the group that is creating the population and messages; that is, the posts themselves comprise the population and items being measured, so there is no objective “ground truth” to compare with. In such a scenario, rather than throw up our hands in defeat, we are recommending an approach that entails extreme methodological transparency. While others have proposed quality standards for social media data (Hsieh & Murphy, 2017), we contend that these are an important first step, but insufficient because this approach does not address many of the decisions made in the data collection, preprocessing, and analysis, all of which can affect the study conclusions. Thus, disclosure standards for social media data research must be expansive and adaptive to change, as the platforms themselves change access policies rapidly and the public shifts their loyalty and attention as new social media platforms emerge.
Other scholars have cautioned against “too much transparency” in today’s machine learning and statistical research due to intellectual property concerns, the fact that algorithmic logic may not be fully reflected in the source code, as well as the potential risk of backfiring and increasing distrust among members of the public whose research outcome expectations are violated (Hosanagar & Jair, 2018). These scholars have called for “explainable artificial intelligence (AI)” as a more palatable solution. Explainable AI approach does not open the “black box” of decision-making algorithms or machine learning-based analytics, but provides an explanation of the inputs that result in the greatest impact on the final decisions or outcomes of algorithm-based analyses. However, emerging AI tool transparency issues call this argument into question (Dwivedi et al., 2023). Explainable AI may lack efficiency as an approach of science communication if the goal is to establish replicability of social data research in the field of opinion research.
Our goal is not to direct researchers in their design choices, but to provide a framework and propose measures for evaluating the completeness of reporting and quality of data used in social media studies. We aim to translate and synthesize practices that are the standard for both computational research and conventional social science research, in an attempt to breach existing “silos” and make each domain more salient to the other. This translation can serve as a resource for manuscript and grant reviewers, journal editors, and funding organizations that enlist technical or subject matter experts to review studies that use social media data and/or AI to address social science or public health research questions. The proposed standards could be relevant to a range of studies that rely on data mining, natural language processing, and machine learning techniques to extract insights from the vast amount of textual and visual information available on social media, for example, from public opinion and sentiment analysis (analyzing the discourse and sentiment of social media posts to understand trends in public opinion and social norms); to social network analysis (examining the structure and dynamics of social networks to identify influencers, communities, and connections); and to language and linguistics research (studying language evolution, slang, and dialects through social media conversation) among others (e.g., Gallagher et al., 2021; Kozinets, 2020; Yadav & Vishwakarma, 2020). Detailed disclosure of parameters enables study quality evaluation, replication, and advancement across various domains of inquiry and methodologies. Our proposed standards apply whether the study aims to be generalizable to a broad population or focuses on a narrower community or topic, like a case study or netnographic research.
We do not presume that our proposed framework is the final word. Rather we propose the framework as a starting point, and urge the community of researchers and institutions that are involved in decisions about funding, conducting and disseminating social media research to open a larger dialogue. The goal of such a dialogue would be broad consensus and ongoing maintenance of a disclosure framework for social data research as a “moving target” in the evolving environment of rapidly changing media and technology use and access by organic, commercial, and academic users. Such a framework would enable funders, journal editors, research consumers, and those making decisions based upon social media research studies to evaluate the validity of a study, compare studies with conflicting results, and make decisions based on known parameters.
Supplemental Material
sj-docx-1-sms-10.1177_20563051231216947 – Supplemental material for Disclosure Standards for Social Media and Generative Artificial Intelligence Research: Toward Transparency and Replicability
Supplemental material, sj-docx-1-sms-10.1177_20563051231216947 for Disclosure Standards for Social Media and Generative Artificial Intelligence Research: Toward Transparency and Replicability by Ganna Kostygina, Yoonsang Kim, Zachary Seeskin, Felicia LeClere and Sherry Emery in Social Media + Society
Footnotes
Disclosure
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Awards Nos. R01CA248871 and R01CA234082 and the National Institute on Drug Abuse of the National Institutes of Health under Award No. R01DA051000.
Supplemental Material
Supplemental material for this article is available online.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
