Abstract
Social media data have become a mainstay of social science research since the first application programming interfaces (APIs) debuted in the mid-2000s. Over time, platforms have radically altered their data offerings, substantially determining the kinds of research that can be conducted. This article presents historical and normative analyses of the current state of platform data precarity, defined by Freelon (2018) as the post-API age. We recount a periodized history of social media data access spanning nearly 20 years, characterize the data access options currently offered by six prominent platforms, and make recommendations for improving platform data access. Our primary aim is to help social media researchers understand how access to social media data has evolved over the years and consider how platforms might help them conduct more rigorous research moving forward.
In 2018, the social media research community received a wake-up call. In April of that year, Meta (at the time Facebook, Inc.) shut down academic access to the API (application programming interface) that provided data from public Facebook Groups and Pages, along with Instagram’s API. 1 The company took this action in the wake of the Cambridge Analytica scandal, which had appropriated millions of Facebook users’ private data to influence major elections, including the 2016 American presidential election and Brexit (Bruns 2019). Between the closing of Facebook’s Pages and Groups API and the summer of 2020, when the CrowdTangle service opened for academic applications, Meta offered researchers very little authorized access to its data except through a small number of preexisting relationships with company employees and other limited opportunities.
The significance of Meta’s action in the history of social media research cannot be overstated. Platforms had altered their data access regimes before, but Meta was the first major social media company to so drastically curtail data access to the research community. Shortly after Meta announced its decision, one of us (Freelon) wrote and published a brief piece of commentary titled “Computational Research in the Post-API Age” in the journal Political Communication (Freelon 2018). In this piece, he defined the “post-API age” as a state of affairs “when companies can restrict or eliminate API access at any time, for any reason” (Freelon 2018, 665). Of course, this had been the case from the beginning, but the termination of Meta’s APIs made the post-API age an undeniable and pressing concern for the social media research community (Bruns 2019; Davidson et al. 2023; Mancosu and Vegetti 2020; Tromble 2021).
Seven years have passed between the publication of Freelon’s initial post-API piece and now. Social media data access has changed substantially since that time, with varying implications for those who study platform data. The goal of this article is to update its predecessor in three ways: First, we offer a new, concise history of social media data access, informed by official documentation and the relevant research literature. Second, we sketch the present post-API age of digital communication data access in comparison with past such “ages,” with the goal of characterizing the moment for posterity. Third, we look to the future with a normative approach that advocates for platform access policies that balance researchers’ interests in data usability, the public’s interests in privacy and impactful research, and business interests in transparency and good corporate citizenship.
Past: A History of Social Media Data Access
We build on Jünger’s (2021) history of social media APIs 2 , diverging from it in several key respects. The history we present here covers a longer period of time than his, extending into early 2025 (Jünger’s ends in early 2020). Our history is organized around the guiding principle of data access, and its periodization is keyed to major changes in platform data access regimes (i.e., distinct assemblages of policies and technical protocols that enable access to certain types of data). This approach allows us to discuss issues such as data access fees and changing policies around dataset sharing that Jünger either does not cover or mentions only briefly. Also, while Jünger focuses primarily on Facebook, Twitter, and YouTube, secondarily on Instagram, and not at all on Reddit or TikTok, our history more fully incorporates all six of these prominent platforms. Finally, we offer a more detailed timeline that includes key policy shifts and other developments of relevance to social media researchers.
API prehistory (2000–2006)
Briefly, APIs are digital interfaces that supply data from social media platforms for processing by other computer programs rather than for human consumption. Those seeking a thorough explanation of APIs may refer to Bruns and Burgess (2016), Jünger (2021), Lomborg and Bechmann (2014), and van der Vlist et al. (2022). The same data can be, and often are, represented through APIs in dense, highly structured formats, such as XML or JSON, and through web browsers (where they are typically rendered in a format more pleasing to the human eye). While the concept of APIs dates back to the 1960s or possibly earlier (Kazemi 2020), our history begins in 2000, when two foundational events occurred. First, Salesforce published the first commercially viable API on February 7 of that year (Lane 2019), delivering content in XML format exclusively to its customers. Second, University of California, Irvine, doctoral candidate Roy Fielding’s (2000) dissertation introduced the concept of Representational State Transfer (REST). A full discussion of REST’s technical details lies beyond the scope of this article (we refer readers seeking one to Rodriguez [2008]); for our purposes, it suffices to say that REST is a set of technical constraints that facilitates the sending of requests and delivery of responses between platform APIs and their clients. The earliest versions of the first web-accessible APIs (e.g., from eBay [2000], Amazon [2002], and Delicious [2003]) did not use REST, but the first social media APIs did.
The laissez-faire period (2006–2011)
Facebook and Twitter published the first social media APIs in August and September 2006, respectively. During this period, the two APIs functioned very differently. The Facebook platform API allowed logged-in developers to instantly obtain API credentials to collect data from any public page, group, or individual profile. Its terms of use for this period did not directly mention research or noncommercial data use (except in the rare case of physical third-party products), but they did prohibit developers from “giv[ing] data you receive from us to any third party, including ad networks” (Facebook 2009). In contrast, between 2006 and 2011 Twitter’s API did not require a login, and datasets with complete metadata (including identifying information) could be shared publicly. A free, no-code Twitter data collection utility called TwapperKeeper allowed users to curate their own keyword-based datasets on a cloud platform such that anyone could anonymously download anyone else’s dataset (Bruns and Burgess 2016). YouTube (2007) and Reddit (2008) also launched the first versions of their APIs during this period.
We call this the laissez-faire period because platform data access policies were relatively hands-off compared to subsequent periods. Given that programming skills were necessary to use and take advantage of this new era of API data availability (with the notable exception of Twitter data via TwapperKeeper), the first social media researchers came from computer and information sciences. The first peer-reviewed study to use the Twitter API was published in 2007 (Java et al. 2007), and the first studies to use Facebook’s API emerged the following year (Bedrick and Sittig 2008; Nazir et al. 2008). It took several years for API research methods to migrate into the social sciences. While the possibilities APIs offered for research were raised in the literature as early as 2008 (e.g., in Hogan 2008), the first API-based empirical studies in communication (the authors’ home discipline) did not appear until 2011 (Bruns and Burgess 2011; Lotan et al. 2011). Similarly, we found one early study using the Twitter API in political science (Grant et al. 2010) and one each from psychology (Goel et al. 2010) and sociology (Gaby and Caren 2012) that used the Facebook API, but these would remain outliers for several more years. Thus, social science researchers appear not to have taken advantage of the laissez-faire period’s ethos of widely accessible data. Two factors would change this state of affairs: 1) the increasing importance of social media in social, political, and cultural life and 2) the transdisciplinary diffusion of the programming-based computational methods necessary to obtain, analyze, and visualize API data (Ledford 2020).
The authentication period (2011–2018)
The authentication period was a time of tightening restrictions on social media data access. It began in March 2011, when Twitter began requiring users of its API to collect data exclusively using their own accounts and explicitly prohibited them from sharing datasets (Bruns and Burgess 2016). Among other consequences, this decision effectively rendered TwapperKeeper obsolete. Twitter’s data access regime during the authentication period raised the technical bar for data collection: While no-code options still existed, they were either expensive (e.g., social media dashboards like Sysomos and Radian6) or hampered by substantial limitations in data capacity (e.g., the TAGS Google Sheet script and the MS Excel-based NodeXL). Researchers on a budget who wanted to collect tweets in the millions either had to partner with an experienced programmer or learn how to code themselves. These technical changes were occurring at the same time that social media’s role in politics began attracting public attention through events such as the Arab Spring, Occupy Wall Street, and the 2012 U.S. presidential election. Many studies in the first major wave of social media research in the social sciences focused on these events (e.g., Aday et al. 2013; Bode and Epstein 2015; Bruns et al. 2013; Thorson et al. 2013; Tremayne 2014; Vargo et al. 2014).
Twitter’s 2011 changes aligned its data access regime with Facebook’s, which had required individual authentication and prohibited data sharing since its debut. Instagram opened the first version of its API to the public in February 2011, before it was acquired by Facebook the following year. Like Facebook’s and Twitter’s APIs, it required authentication; but unlike theirs, Instagram’s was read-only. Over the course of the authentication period, these platforms (and others) saw an explosion of open-source programs to facilitate data collection, including Tweepy, twarc, rtweet, and python-twitter for Twitter; Facepager, fb_scrape_public, and Netvizz for Facebook; PRAW and PMAW for Reddit; youtube-dl and pytube for YouTube; and Instaloader and instaR for Instagram. Most of these programs were written in the R or Python programming languages and thus required coding skills to use.
One of the major issues during the authentication period was that of prospective versus retrospective data collection methods. The former collect data from a given point in time into the future, while the latter reach backward into the past. Twitter’s streaming API afforded keyword-based prospective data collection while its search API offered a rolling seven-day window for retrospective collection. Both options were free and rate-limited; accessing tweets older than seven days required payment. While prospective data collection is suitable for scheduled events (e.g., elections), unexpected events (e.g., natural disasters and political protests) require retrospective methods. Twitter’s data access regime during this time thus made it difficult for unfunded researchers to study real-time responses to unexpected events. Facebook offered a public search API with a two-week rolling window (Graph API 1.0) until April 2015, when it was shuttered in part over privacy concerns (Constine 2015). Between that time and April 2018, Facebook’s Graph API 2.0 allowed posts and comments to be collected retrospectively from public pages and groups. Prior to 2016, Instagram’s API offered limited retrospective access to public posts, comments, followers, images, and other metadata. But in June of that year, access was severely curtailed such that researchers could collect only 20 posts per user from up to 10 users who had to opt in to data collection. More extensive data permissions were manually granted to some apps that fit a small list of use cases that excluded research (This 2016). YouTube, which updated to the current version of its API in 2015, allows for retrospective collection of videos, metadata, and comments, with the only limit being the number of requests per day. Certain other data, including videos, descriptions, and keywords, can be easily collected without using the API and instead using programs such as pytube and youtube-dl. Also in 2015, an independent programmer named Jason Baumgartner created a rolling archive of Reddit data called Pushshift, which contained nearly all the platform’s public posts and comments (Baumgartner et al. 2020). Until May 2023, researchers could collect Reddit posts and comments retrospectively and anonymously from Pushshift.
The limited options period (2018–2020)
The authentication period ended in April 2018, when Meta closed Facebook’s and Instagram’s APIs to academics completely. The proximate cause of this decision was the Cambridge Analytica scandal, in which the eponymous consulting firm appropriated the personal data of millions of Facebook users for political marketing without their knowledge or consent. The immediate result of the closure of academic access to the Facebook and Instagram APIs was that between April 2018 and July 2020, Meta’s official academic-accessible options for data access were extremely limited (Mancosu and Vegetti 2020). The main replacement for their APIs was the Social Science One (SS1) initiative, a collaboration between Meta, a consortium of private grant makers, and a committee of academics. SS1’s offerings differed from those of Meta’s former APIs in several key respects:
The two data sources offered fundamentally different types of data. Whereas the APIs delivered metadata for individual posts and comments (e.g., the post’s full text, the user who posted it, the time of posting, numbers of likes, shares, comments, etc.), SS1’s datasets were solely URL-based. The metadata for these URLs was presented in aggregate form—for example, the total number of unique accounts that shared each one, reported it as hate speech, shared it without clicking on it, and the country whose residents shared it most (DeGregorio et al. 2019). These two different data types supported vastly different research designs.
During the limited options period, SS1 data were available only to academics interested in a narrow set of research questions “that examine the impact of social media and related digital technologies on democracy and elections, generate insights to inform policy at the intersection of media, technology, and democracy, and advance new avenues for future research” (Social Science One 2018). While successful applicants for this first request for proposals received both data and funding, later requests for proposal removed both the topical restriction and the funding (Social Science One 2021).
While API access was granted on demand through an automated system, SS1 required an extensive, manually evaluated application, similar to what grant makers typically request.
Meta’s policy decisions during this period effectively removed official data access for all but a small number of democracy and election researchers who managed to successfully navigate SS1’s application process. Everyone else faced the choice of shifting their research programs to platforms with more congenial data access regimes, such as YouTube and Reddit, or resorting to unofficial methods. Mancosu and Vegetti (2020) proposed a normative justification and technical description of a privacy-preserving, Facebook-scraping application but stopped short of publishing a functioning program to implement it. While Twitter’s search and streaming APIs were available during this period, the lack of free options for historical data led a group of developers to create a program called Twint that could collect publicly available tweets retrospectively without using the platform’s APIs (Poldi and Zacharias 2023). These developers extended a long tradition of open-source development to support academic research, which persists under the perpetual threat of platform policy changes that may render the software obsolete at any moment. 3 However, as our history demonstrates, this threat of sudden unavailability afflicts unofficial and official methods alike.
The academic cooperation period (2020–2023)
Two parallel tensions have shaped researchers’ access to social media data over the years included in our history: first, growing academic debate over platforms functioning as data intermediaries in the distribution of news and political information (Klinger et al. 2023; Nielsen and Ganter 2018); and second, platforms’ varied efforts to comply with legal frameworks protecting users’ personal data—such as the European Union’s General Data Protection Regulation (GDPR) and Digital Services Act (DSA)—while safeguarding their proprietary interests. Consequently, platforms established several data access options for academic research in this new period of academic cooperation. Some of the oldest social media platforms—including Facebook, Instagram, and Twitter (now X)—implemented academic-only data sources in response to public scrutiny about their role in society. TikTok introduced a similar data access regime to these platforms, while YouTube and Reddit maintained their preexisting policies.
Meta provided researchers with access to publicly shared Facebook and Instagram content through its CrowdTangle tool starting July 31, 2020, following a pilot program with 250 research teams in 2019 (Shiffman and Silverman 2020). CrowdTangle restored much of the functionality of the Facebook Pages and Groups API that Meta had scuttled in 2018, primarily the ability to collect posts from public groups, pages, and individuals. However, many researchers were dissatisfied with the variability and uncertainty of historical data, since they could collect anywhere from 4,500 to almost no Facebook posts depending on the day (Bogle 2022). Additionally, strict rate limits varying by endpoint, combined with the inability to collect comments on posts (which the Pages and Groups API allowed), significantly diminished the utility of these datasets for addressing key research questions (Bogle 2022). Throughout CrowdTangle’s existence, the company struggled to balance platform transparency and its corporate reputation. An X account called “Facebook’s Top 10” used CrowdTangle data to highlight the top-performing posts on Facebook surrounding the 2020 election (Roose 2021). These lists consistently featured conservative and pro-Trump voices, including Ben Shapiro, Sean Hannity, and Fox News, raising questions about whether Facebook disproportionately promoted right-wing echo chambers. Meta suspended new user registrations for CrowdTangle in January 2022 (Reuters 2022) and closed the service permanently on August 14, 2024. We discuss its replacement, the Meta Content Library, in the following section.
Twitter had been one of the most extensively researched social media platforms even before it introduced its Academic Research Product Track on January 26, 2021. This new academic offering provided approved researchers with generous rate limits: up to 10 million tweets per month, much more than its existing free APIs could typically access (Parack 2021). Additionally, academic users could extract data retrospectively from Twitter’s entire historical archive, whereas standard users had to abide by the search and streaming APIs’ preexisting limitations. However, following Elon Musk’s acquisition of the platform in October 2022, Twitter announced that collecting data from its API would henceforth no longer be free (Stokel-Walker 2023).
As Meta and Twitter/X were overhauling their data access regimes, TikTok was growing in popularity in the U.S. and elsewhere. Its predecessor, Musical.ly, was created in 2014 as a platform for sharing short, music-related video clips. It was purchased by the Chinese tech company ByteDance in 2017, and by February 2025, TikTok had attracted an audience of over 135 million U.S. monthly active users (Statista 2025). Before 2023, TikTok offered researchers no official means of collecting data; however, this gap was filled by unofficial open-source programs, including TikTokAPI, Pyktok, and Minet. The platform debuted its research API in February 2023; we discuss it in greater detail in the following section.
Many academics used the Pushshift archive between 2015 and 2023 to collect Reddit data (Baumgartner et al. 2020). Following a policy update on April 18, 2023 (KeyserSosa 2023), the platform removed public access to Pushshift, restricting it to site moderators (lift_ticket83 2023). Researchers can still access historical data collected from Reddit prior to 2023 through Pushshift via torrent downloads. This allows for the easy sharing of two terabytes of archived Reddit data, with individual download options for popular subreddits. These data torrents include all content contributed prior to May 2023 and for the most popular 40,000 subreddits by membership through the end of 2024 (Watchful1 2025).
The academic cooperation period featured expanding and contracting researcher data access for three main reasons. First, regulatory compliance through legal frameworks, such as the EU’s GDPR and DSA, pressured platforms to implement stricter data controls. Second, corporate reputation emerged as an essential platform consideration, as exemplified by Meta’s reaction to “Facebook’s Top 10,” which revealed potentially uncomfortable truths about Facebook’s user demographics and popular online communities. Third, commercial interests frequently clashed with academic needs, exemplified by X’s elimination of the academic API in favor of paid access tiers. During this period, we observed a pattern where platforms that face intense public scrutiny (Meta, Twitter/X) tend to implement the most restrictive data access controls. In contrast, platforms with less controversy (YouTube) or newer entrants (TikTok, Reddit) maintained relatively more open data access. The introduction of specialized academic access programs—often followed by their replacement with more limited alternatives—suggests that platforms recognize the value of academic research but struggle to balance transparency with their business interests.
Present: Data Access Regimes in 2025
The history recounted in the preceding section is decidedly nonlinear: Data access privileges ebb and flow over the years depending on the platform, changes in corporate leadership, political exigencies, and other factors. Attempting to characterize the overall state of social media data access in 2025 risks near-instant obsolescence, as data access regimes inevitably evolve. Nevertheless, we believe it is important to capture the moment at a useful level of abstraction, both to provide a practical guide for current researchers and to offer historical context for future ones. To that end, we categorize our six platforms into four data access regimes: laissez-faire API, academic access API, academic walled garden, and pay-to-play API. Each regime offers a bundle of features of varying degrees of utility depending on the end user. Nevertheless, they define the limits of what can and cannot be studied on the platforms through their officially sanctioned channels. In addition to these four, we discuss a fifth regime into which we classify unofficial methods of accessing platform data.
Laissez-faire API
This data access regime has the oldest roots of the four: All the platforms that existed during the laissez-faire period adopted it at the time. It offers free, on-demand generation of API credentials with no academic affiliation required. Once authenticated, users can extract structured metadata from any public account. The main limitation lies in the rate limit, which can be based on different time units depending on the platform (e.g., day, hour, or minute). Thus, given sufficient time, one could, in theory, collect all available public data for any given query. Of the six platforms we examine in this article, YouTube and Reddit currently operate according to the laissez-faire API data access regime (Reddit also offers an academic API option, which we discuss below). Over the timespan our history covers, these two platforms have made fewer changes to their API offerings than have the others.
The laissez-faire API regime offers several key advantages to the academic end user. The automated granting of credentials means that anyone can acquire API access instantly, even without a traditional institutional affiliation. Data collection and research app development are free of charge. The only requirement is sufficient knowledge of a programming language, such as R or Python, to interact with the API through an open-source program (e.g., python-youtube for YouTube). Anyone with such knowledge can start generating their own datasets as soon as they familiarize themselves with their chosen program’s syntax.
Academic API
The academic API regime emerged well over a decade after the first laissez-faire APIs debuted, as platforms decided to impose more control and discretion over who received data access. Both regimes offer similar kinds of data—generally, structured arrays of metadata linked to individual posts, comments, or videos. The main difference lies in how access is obtained: While laissez-faire APIs offer access to anyone with no advance oversight, academic APIs require lengthy, manually reviewed applications. As the name implies, one prerequisite for access is that the applicant be affiliated with an accredited institution of higher education (or certain types of nonprofit organizations in TikTok’s case). Reddit—which in addition to its laissez-faire option offers more generous data limits to approved academic applicants—inquires about their research questions, the kinds of data they plan on analyzing, and the end goals the data will support, among other details. TikTok, which offers data only through its academic API, asks applicants for a project summary, research design, specific hypotheses, a data protection plan, and more. Both platforms warn that application review will take substantial time (4 weeks for TikTok, 8 to 12 for Reddit), and there is no guarantee of application approval. Neither platform has published any guidelines as to what kinds of research questions may not be permitted, if any exist.
In February 2023, TikTok launched its official research API, following the introduction of several unofficial tools designed for extracting TikTok data. However, its official API has substantial limitations, including restricted access to data and metadata related to comments, users, and videos. Specifically, API users are capped at 1,000 daily requests, with each request restricted to 100 records. This means users can collect up to 100,000 records (e.g., videos, comments) per day (Ruz et al. 2023). Moreover, the research API sets a bearer token to expire after two hours and limits each query to retrieving data within a 30-day interval. Therefore, advanced programming skills are necessary to automate data collection efficiently; otherwise, users must manually restart the process every two hours. Beyond this, users must refresh their data every 30 days and remove any data that are no longer accessible via the TikTok research API at the time of each refresh (TikTok 2024).
As discussed in the preceding section, between 2021 and 2023 Twitter offered an academic API in addition to the free API services it had made available to anyone since 2011. In 2023, several months after Elon Musk completed his purchase of the platform, it switched exclusively to the pay-to-play API regime.
Pay-to-play API
X is the only one of the six platforms discussed here to adopt the pay-to-play API regime, and its offerings have changed substantially since its debut. After announcing in February 2023 that the academic API would soon be shuttered, X initially offered three pricing tiers for its data. The cheapest tier, “Small,” charged $42,000 per month for access to 50 million posts (Stokel-Walker 2023). The next two tiers cost $125,000 and $210,000 per month for up to 100 million and 200 million posts, respectively. The sole free option at that time was write-only, i.e., it only allowed users to create content rather than collecting it (XDevelopers 2023). Within weeks, X added a “Basic” paid option that, for $100 per month, granted access to 10,000 posts per month (a cost of one cent per tweet).
As of this writing, X hosts conflicting information about the current pricing and offerings of its API access tiers. The second Google search result for “x api” (without quotes) as of this writing presents Free and Basic tiers with prices and offerings consistent with the descriptions above (X Developer Platform 2025). However, on the front page of its Developer site, the platform states that its Free tier allows users to collect 100 posts per month (X Corp. 2025b). Further, its Basic tier is listed as costing $200 per month instead of $100 ($175 per month if paid annually). Both pages contain identical details for the Pro tier: one million tweets per month for $5,000 per month. It seems that X has simply failed to standardize its pricing information, yet the discrepancy may confuse users as they attempt to determine how much X data they can afford. We attempted to collect data using the Free tier as a test but were unsuccessful, which may indicate additional discrepancies between what X advertises and what it makes available.
Academic walled garden
Our fourth access regime, the academic walled garden, bears similarities to and differences from two of the preceding three. Adopted by Meta for Facebook and Instagram with the introduction of the Meta Content Library (MCL), it is both free (like the laissez-faire and academic APIs) and available exclusively to academics and nonprofit researchers (like the academic API). The company has outsourced evaluation of applications to the University of Michigan’s Inter-university Consortium for Political and Social Research (ICPSR), in contrast to Reddit and TikTok, which review their applications in house. The MCL offers several advantages over previous Meta data offerings, including the ability to analyze a broad selection of metadata for diverse categories of Facebook and Instagram posts—photos, videos, reels, and stories (Meta for Developers 2025)—and access to full-text posts and comments (Newton 2024). However, it also features significant limitations when compared to CrowdTangle and the Pages and Groups API. The MCL allows data to be exported only above certain thresholds of visibility—25,000 members for Groups and 15,000 members for Pages (SOMAR, n.d.)—and content falling below these thresholds can only be analyzed on the MCL’s “clean room” online platform. These restrictions considerably reduce its utility as a research tool.
Unofficial data access
We classify data acquisition methods not officially sanctioned by the platforms into their own data access regime. Since the pre–social media days of archiving websites using software like WebArchivist.org (Foot et al. 2003) and scraping Usenet with NetScan (Brush et al. 2005), academics have proven resourceful at procuring online communication data by developing and applying methods that are not officially platform-approved. As social media began to enter the mainstream, enterprising programmers—both inside and outside the academy—continued creating software to facilitate the process of extracting data from platforms. Unlike programs that interface with platform APIs, the ones we focus on here collect their data by other means. Some do so by “scraping” web pages intended for standard browser display, while others use data feeds that are not publicly documented but are accessible through a browser’s developer tools interface.
Unofficial methods entail a range of advantages and disadvantages that merit consideration. To begin with the former, one major rationale for creating such software is that no official options exist. This was the case for TikTok prior to February 2023, which spurred one of us (Freelon) to develop the first version of the open-source Python data collection program Pyktok during the summer of 2022. A somewhat different case arises when an official option exists but certain data of interest lie beyond its scope. For example, before 2021, Twitter’s free APIs did not allow collection of data older than seven days with a keyword search. Twint, an unofficial open-source program, allowed users to extract such data from Twitter’s web interface for free. This feature was especially important since in addition to charging for historical data, Twitter reserved the right to reject data purchase requests and is known to have done so. In general, unofficial methods are important to ensure that platforms do not have sole authority to determine what topics can and cannot be studied using their data (Knight First Amendment Institute 2021).
However, unofficial methods also carry distinct disadvantages. For one, they do not always offer the same metadata fields as their official counterparts, as is the case for both Twint and Pyktok, among others. Thus, researchers requiring specific metadata fields not returned by a given unofficial option need to look elsewhere. Further, advanced software development skills are required to create (and sometimes use) data collection programs, unofficial or not. And even those who possess such skills may be dissuaded from applying them by the limited professional incentives for software development in the social sciences. In most such fields, tenure-track academics would satisfy their promotion guidelines more effectively by producing journal articles and books rather than software (although, as our efforts attest, we believe software programs should count as substantial contributions to social science). Finally, and perhaps most crucially, changes to a platform’s data storage format or protocols can render unofficial software obsolete instantly. While this issue also impacts programs that interface with official data endpoints, in such cases users are typically given a generous amount of advance notice to migrate their data acquisition pipelines. The price of building software to extract data without permission is that the fruits of one’s labor, along with all the research that depends on it, may grind to a halt at any moment.
But despite these disadvantages, unofficial methods remain a critical component of the social media researcher’s tool kit. As our history shows, APIs and other official data access methods exist at the platforms’ sole discretion. Unfortunately, platforms face structural incentives to limit or eliminate such methods entirely. Among these is the imperative to protect user privacy, which was the stated reason for terminating Facebook’s and Instagram’s APIs in the wake of the Cambridge Analytica scandal. Another is that certain types of research may generate unfavorable perceptions of the platform, as U.S. Senator Ron Wyden (D-OR) speculated was the case when Meta eliminated the New York University’s Ad Observatory’s access to two Facebook data endpoints (Hatmaker 2021). A third justification for tightening control over APIs is to prevent generative AI companies from using them to extract model-training data without adequate permission or compensation (Harding 2023). It is neither desirable nor feasible to relegate all social media research to such tenuous, platform-controlled methods. Thus, apart from the logistical considerations discussed above, we argue that the research community should support and encourage the use of unofficial methods (e.g., in the classroom, the peer review process, the human subjects review process, etc.). Recent relevant case law supports our position (hiQ Labs, Inc. v. LinkedIn Corp. 2019; Sandvig v. Barr 2020; X Corp. v. Bright Data Ltd. 2024).
Future: Building Better Data Access Regimes
Our analyses of the past and present eras of social media data access have been primarily descriptive. It is important to establish a solid understanding of where we have been before considering where we might go in the future. In this section, we will not attempt to predict the future of social media data access; instead, we will articulate a set of normative recommendations that, as we will argue, should shape platform data provision in the future. Our suggestions are informed by the history we have recounted as well as existing normative work on social media data provision. The decisions to implement these suggestions are out of our hands, but we hope that some will garner enough support to at least be considered by those who create platform data access policies.
We divide our suggestions into six data-relevant categories: access, format, management, usage, costs, and sharing, each of which we discuss in turn.
Data access
Leaving the question of who can access platform data in the hands of the platforms is not ideal. They have an inalienable conflict of interest that incentivizes the denial of data requests that might reveal unpopular, unethical, or even illegal activity. Beyond that, platforms are not qualified to decide which research proposals should and should not be granted access; such decisions should be made by experts in the given field of research. Unlike money, data are not a limited resource, and thus the proposal evaluation process should focus less on identifying the “best” proposals (as Twitter did with its “data grants” program [see Ravindranath 2014]) and more on ensuring a baseline level of technical competency to obtain and analyze the data. This approach would reduce the amount of time and effort required of both the researchers and the evaluators. Meta’s outsourcing of data access proposal evaluation to ICPSR is a model that other platforms should consider, pending the availability of suitable partners.
Outsourcing the task of evaluating data access proposals to qualified partners may not always prove practical, especially for smaller platforms with fewer resources. In such cases, we advise platforms to be as clear as possible about what kinds of proposals they will and will not accept by posting public guidelines. These guidelines should include descriptions and examples of research projects that will not be accepted, in part to allow researchers to push back if and when legitimate scientific inquiry is being stonewalled. Twitter has historically declined to do this: When one of us (Freelon) had a proposal rejected, the company refused to share its internal proposal guidelines upon request (although it did communicate the general reason that the request was denied). Had it shared this information publicly, researchers would not have wasted time preparing proposals that had little or no chance of being accepted. Ultimately, whenever a proposal is rejected, the platform (or academic partner) should provide an explanation for the rejection and indicate what changes would be required to render it acceptable.
Data format
Platforms, including all six discussed in this article, have historically made their data available to researchers in JSON (Javascript Object Notation), a flexible, text-based format that can accommodate densely layered data structures. These JSON files were typically supplied by the platform’s API endpoint and downloaded to the end user’s local computer for analysis. Four of the current six platforms still follow this basic procedure, but the MCL imposes a different procedure for Facebook and Instagram data. As discussed in the previous section, approved researchers are directed to a cloud-hosted “clean room environment,” from which only some data can be downloaded; all other data must be analyzed online. The clean room’s ostensible purpose is to protect users’ privacy from the kinds of violations committed by Cambridge Analytica, but it poses a set of unique challenges for researchers. The system has been criticized as cumbersome and slow (Gotfredsen and Dowling 2024)—understandably given the substantial technical overhead required to operate it. More fundamentally, the limitations imposed on data downloads drastically restrict the kinds of analyses that can be performed on nondownloadable data. The MCL is built on the open-source JupyterLab data science development environment, and its particular implementation supports only four programming languages: Python, R, Julia, and bash. Thus, only tools compatible with these languages can be applied to content from Pages and Groups that fall below the visibility threshold for downloading. Also, nondownloadable data cannot be merged with relevant data from other sources since the latter cannot be uploaded to Meta’s clean room.
To maximize the value of platform data to researchers, and by extension the public, data should be downloadable in most circumstances. Data applications should include a provision making the research team responsible for ensuring proper use of the data they request. If an arbitrary threshold must be used to restrict extremely low-visibility content from researcher accessibility, Meta should consider replacing the current 25,000-member limit for Groups and 15,000-member limit for Pages with CrowdTangle’s former 2,000-member minimum. Even then, researchers should be allowed to argue for access to content falling below that threshold based on the public value their research would offer.
Data management
Once researchers have obtained the data they seek, they are bound by the platform’s terms of service to follow certain rules in managing it. Historically, these rules were written primarily with commercial developers in mind and therefore did not always fit researchers’ use cases (Tromble 2021). But even when written specifically for researchers, platform rules around data management do not always prioritize research rigor and, in some cases, may compromise it. For example, X/Twitter, Reddit, and TikTok all contain provisions in their respective developer agreements that require API users to remove content that is deleted from the platforms from the users’ local datasets on an ongoing basis (Reddit, Inc. 2024; TikTok 2024; X Corp. 2024). In addition to imposing a logistical burden, this provision harms the quality of any research attempting to comply with it. Following the letter of the law, as it were, would generate datasets whose parameters are in constant flux, from initial data collection through analysis and peer review. It is thus possible that, due to such mandatory deletions, a given study’s findings could change as it is being conducted—an outcome that might be especially problematic for studies focusing on highly sensitive topics where content is frequently deleted. Needless to say, such requirements pose an unacceptable threat to scientific research on platform behavior.
That said, we acknowledge that these requirements are intended to protect platform users’ privacy and their prerogative to control access to their own data. Our proposed solutions attempt to balance those values with the researcher’s interests in producing rigorous and accurate empirical impressions of online activity. First, with few exceptions, researchers must explicitly be permitted to permanently fix the parameters of their datasets at the time of acquisition. That is, they should not be held responsible for continuously synchronizing their datasets to the platform’s API or web feeds. The privacy risks are likely low for quantitative studies, as they do not typically publish identifying information, such as usernames and direct quotes. The risks are somewhat higher for qualitative research, which has historically published such information, but the risks could be mitigated by requiring researchers to obtain permission from users before publishing identifying content. 4 Alternatively, researchers could publish paraphrased or composite versions of platform posts that convey the same points as the originals while protecting privacy, as Mason and Singh (2022) suggest. Researchers should respond to requests to remove specific content for specific reasons, a protocol that the three aforementioned platform agreements already require.
Data use
Some platforms have attempted to control what kinds of research academics can conduct using their data. For example, X/Twitter officially prohibits developers using its API from mak[ing] X Content, or information derived from X Content, available for purpose of . . . monitoring sensitive events (including but not limited to protests, rallies, or community organizing meetings); or (d) targeting, segmenting, or profiling individuals based on sensitive personal information, including their health (e.g., pregnancy), negative financial status or condition, political affiliation or beliefs, racial or ethnic origin, religious or philosophical affiliation or beliefs, sex life or sexual orientation, trade union membership (X Corp. 2024)
The implications of this passage for research depend on how one interprets the verbs “monitoring,” “targeting,” “segmenting,” and “profiling.” An overly risk-averse interpretation might read it as forbidding the analysis of posts about political protest or unrest. Similarly, the mention of “political affiliation or beliefs” could be interpreted as prohibiting the calculation of political ideology scores for users. Substantial bodies of literature exist that do both of these things (for example, see Barberá 2015; Bastos and Mercea 2016; Conover et al. 2011; Freelon et al. 2018; Tremayne 2014), with no apparent attempts by the platform to enforce the rules or prevent future violations. Because these rules are neither practically enforceable nor likely to harm users or the platform, they should be abandoned. The MCL’s terms do not ban specific research topics, but they do prohibit “mak[ing] any attempt to combine Meta Data with any other data or datasets” (Meta 2024). This rule impedes potentially useful research without justification, so we recommend that it, too, be repealed.
Reddit’s Developer Terms offer a researcher-friendly model that other platforms could follow. It forbids a range of problematic uses, including illegal, deceitful, commercial, disruptive, and privacy-violating ones, but does not attempt to dictate which specific topics researchers may study, nor does it restrict researchers from combining its data with other relevant datasets. Our position is that Reddit’s terms strike an ideal balance between the prerogatives of researchers, end users, and the platform, and that the other platforms we discuss in this article are similar enough to it to be governed by the same rules.
Data costs
The only platform of those we consider here that currently charges money for access to its data is X/Twitter, but the others could decide to start charging at any moment. Nothing prevents private companies from charging whatever they want for their services, but for those interested in supporting academic research, we offer the following suggestions. First, while the absence of cost as a barrier to analyzing platform content clearly holds value, we understand that it may be necessary to charge data fees in some instances (for example, for very high data volumes or newer platforms that need the money to help subsidize core operating costs). In such cases, platforms should avoid charging exorbitant rates that most academics cannot afford, as X/Twitter did when it initially shut down its academic API (Stokel-Walker 2023). The company’s current data plans are not cost-effective for academics: Its Basic plan offers too few posts per month for many research projects (10,000), while its Pro tier would charge $60,000 for 12 million posts over the course of a year, pricing out all but the wealthiest institutions and funded projects. We advise any company interested in selling its data services to academics to negotiate with them to arrive at reasonable prices. Companies might also consider charging more for researchers in higher-income countries and less for those in lower-income countries as defined by the World Bank’s four income tiers (Metreau et al. 2024).
Charging for very large or otherwise exacting data requests is one solution to a long-standing complaint among social media researchers: that comprehensive data are difficult if not impossible to access (González-Bailón et al. 2014; Tromble 2021). Before the X era, Twitter sold datasets containing all public tweets matching specified queries, but this practice was prohibitively expensive (Driscoll and Walker 2014). Lowering prices would obviously help here, but platforms could also consider selling random data samples based on high-volume keywords or even all platform traffic at a lower cost than all matching posts. 5 Such a pricing model offers advantages to both researchers (who gain multiple data access options at different price points) and platforms (which gain another revenue stream). It could facilitate certain types of research that are difficult to conduct under current access regimes, including studies that seek to discover how popular certain users or types of content are proportionally to total platform traffic (McGrady et al. 2023).
Data sharing
Platforms have an understandable interest in controlling who has access to their data: The Cambridge Analytica scandal illustrates the harms that can ensue when the rules are too lax and/or not enforced. All six platforms require an account to obtain data access credentials, and all except YouTube and X/Twitter require a written application explaining how the data will be used. (Twitter’s academic API formerly required this.) All of the platforms’ terms of service prohibit the sharing of datasets with unauthorized parties. This restriction is generally appropriate but conflicts with the push toward open data in research publications. Many academic journals now require data availability statements that presume that researchers will publish their data absent a specific reason. Further, in the classroom, platform data can be useful to help students learn and practice data analysis. Current restrictions on platform data sharing offer no exceptions for these important use cases.
Openly sharing complete platform datasets in academic repositories is problematic, as the burden would fall to individual users to petition the repositories to have their data removed—but many, if not most, users would be unaware that their data were publicly available in the first place. Before it became X, Twitter’s solution to this issue was to enable API users to “rehydrate” its data. The rehydration paradigm permitted the sharing of internal Twitter content IDs (e.g., for posts, comments, images, videos, etc.) rather than the actual data. When researchers wanted to access the data, they could computationally request the full data row corresponding to each content ID from the platform. By design, any content removed from the platform between the time when the dataset was first collected and when it was rehydrated did not appear in the rehydrated version. Assenmacher et al. (2023) criticize the rehydration paradigm for the damage it does to reproducibility, which generally increases over time. However, their recommended alternative approaches, which include making full datasets available upon request and redacting identifying information, come with their own limitations. For one, data theoretically “available upon request” are not always accessible in practice, as a substantial proportion of data requests are ignored or declined (Tedersoo et al. 2021). Beyond that, Assenmacher et al. offer no guidance as to how platform data should ideally be anonymized other than a mention of differential privacy (Oberski and Kreuter 2020), which is not suitable for all use cases. Their discussion focuses on “harmful online communication,” which, as they demonstrate, suffers disproportionately from attrition as platforms tend to remove it quickly. For other topics less vulnerable to attrition, we argue that the rehydration paradigm should be reinstated to offer another privacy-protective data access option for research.
Conclusion
The overarching lesson of the post-API age is that access to platform data is permanently contingent on factors over which researchers have little or no control. As our history shows, the illusion of stability to which many of us grew accustomed in the early days was not shattered all at once but rather deteriorated gradually in a series of policy shifts caused by technological developments, the passage of new legislation, changes in management, and public scandals. The current empirical reality to which experienced and neophyte researchers alike must adapt entails a variety of methodological approaches, of which APIs constitute only one. The ability to analyze future platform data will depend as much on innovation from the research community—particularly in the field of software development—as on the whims of the platforms’ corporate owners.
We end this article with a brief discussion of the role of government policy in facilitating data access. The prospect of government intervention to ensure data access for scientific research seems remote in the U.S. context, as there is little historical precedent or current legislative momentum for it. The situation is different in Europe, where the European Union’s DSA officially designates digital platforms serving 45 million or more users as Very Large Online Platforms (VLOPs) subject to its regulatory authority. VLOPs operating in the EU are required to make their data available to independent researchers “for the sole purpose of conducting research that contributes to the detection, identification and understanding of systemic risks in the Union” (Digital Services Act 2022). A recent EU report detailing the mechanisms by which platforms make data available to researchers found a variety of approaches similar in breadth to what we document in this article (European Commission 2024). Some such options are available only in EU countries. For example, X allows EU-based researchers to apply for free data access to conduct “qualified research under the Article 40 of the Digital Services Act” (X Corp. 2025a). Other researchers must pay, as we detail above. International differences in data access regimes are inevitable, all but ensuring inequalities in data access into the foreseeable future.
