Sage Journals: Discover world-class research

Abstract

In this article, we analyze whether Twitter can be used to detect relative reports of issues at polling places. We use 20,322 tweets geolocated to U.S. states that match a series of keywords on the 2010, 2012, 2014, 2016, and 2018 general election days. We fine-tune BERTweet, a pre-trained language model, using a training set of 6,365 tweets labeled as issues or non-issues. We develop a model with an accuracy of 96.9% and a recall of 72.2%, and another model with an accuracy of 90.5% and a recall of 93.5%, far exceeding the performance of baseline models. Based on these results, we argue that these BERTweet-based models are promising methods for detecting reports of polling place issues on U.S. election days. We suggest that outputs from these models can be used to supplement existing voter protection efforts and to research the impact of policies, demographics, and other variables on voting access.

Keywords

natural language processing election administration voting voter suppression machine learning twitter

Introduction

In the run-up to the 2018 gubernatorial election in the U.S. state of Georgia, reports of a wrongful purge of 340,134 voters from Georgia’s voter registration rolls emerged on major media platforms (Durkin, 2018). By the day of the election, the governor’s race in Georgia and allegations of voter suppression became a national news story (Anderson, 2018). However, on the election day itself, the best resource for learning about what was happening in Georgia was not national media, election watchers, or the state’s own website. Instead, it was Twitter,¹ where pictures of long lines at polling sites across Georgia had begun to go viral, with thousands of retweets, likes, and comments (Sarmiento, 2018). Activists, journalists, and politicians, but mostly regular voters, found in Twitter a platform not just to report issues at their polling places but also to spark a national conversation about voter suppression, voter turnout, and voting rights.

In 2013, the U.S. Supreme Court ruled that Section 4(b) of the Voting Rights Act of 1965 was unconstitutional (Roberts & Supreme Court, 2012). This ruling invalidated Section 5 of the Voting Rights Act, which required specific areas of the U.S. with a history of voter suppression to receive “preclearance” from the federal government before modifying any of their voting or election laws. Since then, the federal Commission on Civil Rights has found that discriminatory and restrictive voting laws have surged across the U.S. (Bradner, 2018). Despite this research, and its implications for American democracy, the U.S. has no central system for reporting issues at polling places. State hotlines are rarely used or advertised, and there is very little public data counting the number of issues that occur at polling places (Mebane et al., 2017). Thus, in the absence of a better system, Twitter offers a medium for reporting, detecting, and sharing issues. We already saw that the 2018 Georgia election demonstrated that many people share issues that occur at polling places on social media. The more general question is whether Twitter can be used to detect issue reports at polling places on U.S. election days reliably and systematically. In this article we argue in favor of a qualified, positive answer.

The article is structured as follows. In Defining Polling Place Issues, we define “polling place issues” to include issues such as electioneering, long lines, and broken voting machines. In Complaints on Social Media, we contextualize the theoretical foundations of our research by first discussing research on social media complaints in general. In Past Work on Polling Place Issues, we then discuss research studying issues at polling places, which is mostly focused on line-related issues. Next, in Big Data Approaches, we review larger-scale studies, including work from Mebane et al. (2017), which uses data from Twitter to detect a much more broadly defined version of “voting incidents.” We argue that past work is insufficient to support our research and, in Large Language Models, we propose a new method for detecting polling place issue reports using Twitter data and large language models, specifically the use of BERTweet (Nguyen et al., 2020). It is worth noting here that the primary goal of this article is to develop a method for detecting issue reports posted on Twitter, and not to verify the veracity of those reports. To build a model for this task, in the Data section, we describe our process for using keyword searches to download 20,322 unique, geolocated tweets, from the days of the 2010, 2012, 2014, 2016, and 2018 U.S. General elections. We also discuss our process for cleaning and labeling 10,241 of these tweets as describing “issues” or “non-issues.” Next, in the Methods section, we discuss the metrics of interest for this study, including accuracy (i.e., the percentage of tweets the model correctly classifies), precision (i.e., the percentage of model-predicted issue reports that are human-labeled issue reports), and recall (i.e., the percentage of human-labeled issue reports the model predicts as issue reports).² We argue that recall is the most critical metric for evaluating model performance, as finding issues is more important than finding non-issues. We also describe building the primary model for this article, a fine-tuned version of BERTweet. In the Results section, we highlight that our fine-tuned BERTweet model outperforms baseline models, with one version having an accuracy of 96.9% and recall of 72.2%, and another version having accuracy of 90.5% and recall of 93.5%. Finally, in the Discussion section, we summarize the performance of the BERTweet model and argue that, even without further modification, its results are suitable for use. We also discuss the limitations of this research, namely that the model may not generalize to the more complicated 2020 election and that it only contains geolocated tweets. We outline a few paths for future work, including further optimization of the model, using it to study the impact of restrictive voting laws, and applying it to voter protection efforts.

Background

Defining Polling Place Issues

Defining polling place issues is not strictly a legal question. Legally accepted issues occur at polling places, such as long lines, that still constitute barriers to voting worth cataloging. For example, while there is no universal legal standard for line times, a report from the (United States Presidential Commission on Election Administration (2014) recommends that “voters should not need to wait more than half an hour to vote” (19). Other examples of polling place issues for this article that are not necessarily illegal include polling places facing technical difficulties, late changes to polling place locations, and voters not appearing on registration rolls (USPCEA, 2014).

Other polling place issues occur because of illegal behavior. For example, U.S. federal law forbids voter intimidation, examples of which include aggressive questioning about citizenship or other qualifications to vote, falsely presenting as an election official, spreading false information about voting requirements and laws, harassment, and displaying intimidating or false signs about voter fraud (American Civil Liberties Union, 2021). Many states also have rules against “electioneering,” defined as political activities near or within a polling place. However, these restrictions and the definition of “near” vary significantly from state to state, with some states prohibiting voters from wearing political apparel while others only prohibit political literature (National Conference of State Legislatures, 2021).

In this article, we consider polling place issues as lines longer than 30 minutes, technical difficulties with machines, closed polling sites, voters wrongfully barred from voting, electioneering, voter intimidation, harassment, and any illegal behavior. These issues can involve official state policies, poll workers, voters, observers, campaigns, or other people at the polling place. While this is not a comprehensive list of issues that occur at polling places, it broadly covers the types of issues of interest and provides a reliable basis for polling place issue detection.

Complaints on Social Media

Before discussing past work about polling place issues, it is worth briefly discussing the theoretical foundations for this work. This article asks, can Twitter be used to detect issue reports at polling places on U.S. election days? Although there is limited work about polling place issue reports on social media, we look to another body of past work for theory, analyses of complaints posted to social media. This work helps us develop a theoretical foundation for this article’s research question in three ways. First, we can examine whether users are likely to complain about public services on social media; second, we can understand why users complain on social media, and examine if those reasons are likely to exist for polling place issues; and third, we can examine if it is possible to detect and classify those complaints.

Past work examining differences between tweets directed to New York City (NYC) agencies, tweets geolocated to NYC that complain about agency-relevant issues, and phones calls directed at 311 found that during their sample in 2013, there were differences in the number of complaints tweeted to agencies and called to 311. However, the relative volumes of geolocated tweets and 311 calls are similar (Frías-Martínez et al., 2014). They also find that complaints geolocated to NYC but not directed to the agencies are common, and are often directed to friends or followers, suggesting that some complaints skip official channels all-together. Additionally, although Heverin and Zach (2010) do not focus specifically on complaints, the researchers find that Twitter users do regularly share information with and ask questions of police departments in major U.S. cities. In Australia, there is evidence that the plurality of tweets directed at government agencies are complaints about services or policies (Alam & Lucas, 2011). Together, this body of research suggests that Twitter users may post about other public service complaints, polling place issues, when they occur to them.

Other work has examined why users decide to complain on social media. Although this work focuses on complaints about private companies, there are relevant insights about the types of complaints users share, as well as what motivates them to share. For example, there is evidence that people regularly complain about companies on social media, and the majority of those tweets are about products and services, rather than company policies or issues with specific employees (Einwiller & Steilen, 2015). We argue that the process of voting at a polling site can be thought of as a service to voters, potentially motivating them to report polling place issues. Research examining complaints on a popular social networking site in Taiwan finds that personal outcome expectations (e.g., this complaint will help resolve the problem) are a significant motivator for online complaints (Yen, 2016). This also maps on for voters, who may be facing issues at polling places that they hope to have resolved.

In a comprehensive analysis of corporate complaints posted to Twitter in Indonesia, Kusumawati et al. (2024) find various significant predictors of complaining behavior. Namely, users complain on Twitter to vent negative emotions, which may also be the case for some kinds of polling place issue reports. They also find that empowerment plays a role in the decision to complain; users who feel like they cannot solve the problem themselves (e.g., in this article’s case, a long line) and users who feel like their complaint will be seen by others are more likely to complain. Finally, users complain on Twitter because it is an easily accessible platform and because they want to share information with others who may be impacted (Kusumawati & Suhandi, 2024). We argue that these motivators are also likely to be present for voters experiencing polling place issues, suggesting that users are likely to tweet about polling place issues, especially in the absence of other easily accessible forms of complaining.

Researchers have also developed models for detecting and classifying complaints posted to Twitter. As the primary goal of this article is to test whether detecting polling place issue reports on Twitter is possible, this work helps us develop a baseline of feasibility. In many cases, the classification models developed by researchers use modeling techniques similar to the baseline models used in this article. Frias-Martinez et al. (2012) develop a classifier that has an accuracy of 87% for the task of detecting issue report tweets about NYC public services. In other countries, researchers have developed models that can reliably detect complaints relevant to public agencies such as water supply and train issues (Akhtar & Beg, 2021; Fatkhurrochman et al., 2018). In the field of traffic incident detection, researchers have also developed models capable of classifying tweets as reporting issues about traffic and roadways (Fu et al., 2015; Gu et al., 2016; Kumar et al., 2014).

Taken together, past research on complaints posted on social media suggests that users are likely to tweet complaints about public services, which in some cases are different from the complaints shared through traditional channels. Users are often motivated to post complaints by a desire to solve the problem for themselves and for others, and because social media is more accessible than traditional channels. Additionally, there is evidence that it is possible to classify these complaints using methods similar to the baseline models presented in this article. For the purposes of this article, then, there is reason to believe that Twitter is likely to be a platform for polling place issue reports, which share many of the same motivations as other complaints posted to social media, and, as we demonstrate in this article, it is possible to develop a model that reliably and systematically detects those reports.

Past Work on Polling Place Issues

Within the limited research on U.S. polling place issues, most work focuses on analyzing lines and wait times. An examination of this past work is useful for both understanding the constraints of existing methods as well as grounding the importance of further research on polling place issue reports.

Historically, research on lines has generally focused on using post-election surveys or in person observers, to examine where, why, and to whom line-related issues occur. In some cases, researchers have also used simulations to examine the impacts of machine choice and resource allocation on line times; this work is primarily useful to help us understand the causes of long lines—often a lack of voting machines and election workers—but not directly relevant to this article’s proposed method for detecting issue reports during or after elections (Edelstein, 2006; Herron & Smith, 2016; Highton, 2006; Yang et al., 2013).

In terms of surveys, some work has used data from the Congressional Election Study (CES)³ or the Survey of the Performance of American Elections (SPAE), national surveys administered after general elections, which asks voters similar questions about their experience at polling places (Coll, 2022; Kimball, 2013; Pettigrew, 2017; Stewart, 2013; Stewart & Ansolabehere, 2015). However, it is worth noting that the CES only offers five options for voters when asked how long they waited to vote: “not at all,” “less than 10 minutes,” “10–30 minutes,” “31 minutes to an hour,” and “more than an hour” (Ansolabehere & Schaffner, 2018). This lack of granularity and the fact that the survey relies on self-reports after the election may present reliability issues. Specifically, because at least 20% of respondents to the CES claim to have voted when they actually have not, with SPAE responses also having similar issues, respondents may be misleading about their wait times (Chen et al., 2020; Stewart, 2023).

Researchers have also used in person observation to examine lines. This method solves the reliability problem of self-report surveys but introduces a new problem of generalizability, as the polling sites selected are not representative of the country (Herron & Smith, 2016; Spencer & Markovits, 2010). However, some of these studies have captured data in dozens of counties across multiple states, offering us valuable insights into the causes and impacts of long lines (Lamb, 2021; Stein et al., 2020).

Although this past work on polling place lines does not offer us much help in answering this article’s research question, which is largely methodological, it does underscore the potential impact of research on polling place issue report detection. As Highton (2006) notes, polling place wait times may influence voters’ decisions to vote and, as a result, affect the outcome of elections. There is also substantial evidence that line discrepancies are likely to be discriminatory, with minority and low-income voters facing longer wait times (Cottrell et al., 2021; Spencer & Markovits, 2010; Stein et al., 2020; Stewart, 2013). Research has also found that urban voters are more likely than rural voters to face issues at their polling places (Kimball, 2013; Stewart, 2013). Notably, long lines often result in “reneging,” where voters leave before voting, and reneging rates also appear to be disproportionately higher in minority precincts (Lamb, 2021; Spencer & Markovits, 2010). Long lines may also depress future turnout, once again disproportionately for minorities (Pettigrew, 2021).

Given the evidence that polling place issues, such as long lines, are both discriminatory and depressors of turnout, we argue that developing new methods for detecting polling place issue reports, such as the one presented in this article, is crucial. This is especially pertinent because simulations, surveys, and in-person observations all suffer from issues, including generalizability, respondent reliability, granularity, and a primary focus on lines. While one method is unlikely to satisfy all these concerns, introducing new methods may help expand our understanding of the causes and impacts of polling place issues.

In the course of this research, we were unable to find past academic work providing state-by-state counts for polling place issues apart from lines, such as problems with poll workers, incorrect information, and electioneering. Perhaps the largest source of this information comes from national election protection hotlines. The largest of these hotlines, “866-OUR-VOTE,” is led by the Lawyers’ Committee for Civil Rights Under Law, who make public some of their data about reports they receive (Grace & Conley, 2018). During the 2018 election season, 75,000 voters, 35,000 on the election day itself, contacted the hotline for either information or complaints about voting (Grace & Conley, 2018).

In 2018, the Lawyer’s Committee reported the top barriers that hotline callers faced when trying to vote, including long lines, inadequate poll worker training, issues with registration, faulty machines, absentee ballots, and provisional ballots (Grace & Conley, 2018). The Lawyer’s Committee reports offer us essential information about the types of issues voters have at polling places and highlights the importance of further detection. Unfortunately, as not every voter knows or cares to call the hotline with their problems, the Lawyer’s Committee has incomplete information about where and how frequently polling place issues occur.

Big Data Approaches

Recently, there have been a few “Big Data” approaches for detecting issues at polling places. Chen et al. (2020) secured smartphone location data for hundreds of thousands of users in and around the 2016 general election to study the line length problem. Using these data, in combination with polling place addresses obtained from election officials and web scraping, Chen et al. (2020) find that the average voting wait time was nineteen minutes, with 18% of the sample waiting more than 30 minutes to vote and Black voters having longer wait times. Although not directly related to this article, Chen et al. (2020) show the potential of using “Big Data” to study polling place issues.

The other “Big Data” approach from Mebane et al. (2017) used data from Twitter to detect more general “voting-related incidents.” Since this article also uses Twitter to answer a similar question, it is worth analyzing it in some detail. The researchers gathered tweets from Sysomos MAP, a Twitter archiver, and the Twitter API on the 2016 primary and general elections dates. Mebane et al. (2017) used over fifty key phrases, including “line to vote,” “vote,” “voter fraud,” and “voter id,” to gather 4,541,097 unique Tweet texts, 505,112 which had enough location information to geolocate to at least a state (p. 6). Human labelers then classified a sample of 19,789 of these tweets as “hit” or “not-hit.” The “hit” tweets were further labeled as one of twelve or fifteen categories, depending on the election, including “Line Length,” “Polling Place Event,” and “Voter Identification” (Mebane et al., 2017, p. 8). Using an active-learning method, Mebane et al. (2017) trained multiple iterations of a support vector machine, adding to the training set before each iteration by labeling tweets about which the model was least certain. Using this model, Mebane et al. (2017) achieve a “hit” precision of .81 and a “hit” recall of .72 for the general election. For the primary election, Mebane et al. (2017) achieve a “hit” precision of .68 and a “hit” recall of .42. However, when model-classified “hits” were viewed by human labelers, 41.5% were deemed to be “not-hits,” resulting in a “hit” precision of 58.5%. Additionally, the classification rules for general election “hits” in Mebane et al. (2017) were much broader than those used to classify “polling place issue” reports in this article. In Mebane et al. (2017), general election tweets were classified as hits not only if the tweet was a report of an issue at a polling place observed by the user or someone the user knows, but also if the tweet provided election information and guidance for other voters. Thus, “hits” also included “successes,” meaning that tweets where users said they successfully voted were also considered incidents (Mebane et al., 2017, p. 17). As a result, although Mebane et al. (2017) present us with a model for detecting their definition of “incidents,” it is unclear whether this same method would help exclusively detect reports of barriers to voting that occur at polling places, such as long lines, electioneering, and voter suppression.

Mebane et al. (2017) offer evidence that people regularly post information about voting to Twitter on election days and that an active-learning support vector machine model can be used to classify these tweets with some accuracy according to a broad definition of “incidents.” However, we still have an unanswered, key question: is it possible to create an accurate model that uses Twitter to detect reports of issues at polling places? The following sections suggest that it is.

Large Language Models

One way to approach an answer to the previous question is to look at advances in data science and machine learning. Recently, new language models have made waves in the field of natural language processing. Many of these are “Transformer” models, built on the work of Vaswani et al. (2017) that first proposed the Transformer framework and demonstrated that, because of their superior analysis of context, Transformers could outperform existing models for language translation tasks. Of particular interest for our purposes here is the Bidirectional Encoder Representations from Transformers (BERT) language representation model, which was trained on the entirety of English Wikipedia and the BookCorpus of 800 million words (Devlin et al., 2019). BERT can be “fine-tuned” for use on various natural language processing tasks, classification being the most important for this article.

Devlin et al. (2019) show that BERT drastically outperforms previous models on many classification tasks. However, other researchers have argued that, because BERT was trained on books and Wikipedia, it may not be well suited for tasks such as analyzing social media. Nguyen et al. (2020) sought to overcome this by developing BERTweet, a large-scale language model that uses the same architecture as BERT but is instead trained on a corpus of 850 million English tweets. The primary model of BERTweet used in this article was trained on 845 million English tweets obtained between 2012 and 2019. After fine-tuning, Nguyen et al. (2020) demonstrated that BERTweet successfully outperforms two pre-trained models, RoBERTa and XLM-R, on the two Twitter text classification tasks they attempted, namely, sentiment analysis and irony detection (Rosenthal et al., 2017; Van Hee et al., 2018).

Since BERTweet has been pre-trained on hundreds of millions of tweets, it contains contextual information about how English is spoken on Twitter. On Twitter-based tasks, there is evidence that it outperforms existing models. Hettiarachchi and Ranasinghe (2020) used BERTweet to classify COVID-19 tweets as “informative” or “not informative,” finding that the base version of BERTweet, which does not contain any COVID-19 related tweets, outperformed traditional non-Transformer–based models. BERTweet has also been used for research in politics, with work from Ballard et al. (2022) fine-tuning the model to classify tweets from U.S. congresspeople as “uncivil” with an accuracy of 95%, outperforming traditional models for the task. Heseltine and Dorsey (2022) use BERTweet for a similar purpose, fine-tuning the model to classify the tweets of congressional candidates as “uncivil” with an accuracy of 90%, also outperforming baseline models. Based on these promising results, we argue that BERTweet provides an opportunity to test more modern machine learning methods on the task of detecting polling place issues (Nguyen et al., 2020; Vaswani et al., 2017).

Data

To create a dataset for answering this article’s research question, we download 20,322 unique tweets from the days of the 2010, 2012, 2014, 2016, and 2018 U.S. General elections. We query Twitter’s API 2.0⁴ for Academics with case-insensitive exact-match keywords and restrict tweets to those that can be geolocated to a specific state or the District of Columbia (DC). It is worth noting that this drastically reduces the sample size of tweets, as location tracking for tweets is opt-in for Twitter users and some research has found that only about 2.3% of tweets are geotagged (Huang & Carley, 2020; Lapowsky, 2019). Apart from reduced sample size, there is evidence that users who post geotagged tweets are different from those who choose not to, with slight (sub 1%), but statistically significant differences in the gender and age composition of the two distributions (Sloan & Morgan, 2015). Since Mebane et al. (2017) focus part of their analysis on barriers to voting, we carry over many of their keywords and add a few additional ones based on an iterative search. It is worth noting that there are polling place issue reports that do not use these keywords that are missing from the dataset. The alternative, using a random sample of all tweets on election day, would capture those issue reports, but also make the process of labeling issue reports nearly impossible. The complete list of keywords used in this article is available in Table 1.

Table 1.

Keywords Used for Tweets.

Long Line(s)	Line(s) to Vote	Large Line(s)	Voter ID
Voter identification	Polling place	Polling location	Voting location
Voting machine(s)	Polling machine(s)	Poll machine(s)	Disenfranchised
Election complaint	Could not vote	Couldn’t vote	Hard to vote
Let me vote	Problem(s) voting	Provisional ballot

From the dataset of returned tweets, 10,241 tweets were randomly sampled without replacement for labeling to create the main dataset. 515 tweets were hand-labeled as “issues” and 9726 as “non-issues” according to the definition of polling place issues discussed earlier. As noted earlier, the primary goal of this article is to develop a method for detecting reports of polling place issues. Tweets were marked as “issues” if they described an incident that fit the definition, but we did not attempt to verify the veracity of each tweet. Much like voter protection hotlines and survey studies, we rely on self-reports. Labeling was conducted by one coder, with a subset of 1,000 tweets later tested against labels from an additional coder who was provided with the guidance in this and the “Defining Polling Place Issues” sections. There was 99.6% agreement between the two coders for the sub-sample, as most of the tweets were obvious non-issues. The second coder labeled three “non-issue” tweets as “issues” and one “issue” tweet as a “non-issue.” The two coders then discussed these conflicts and agreed that one of the originally labeled “non-issue” tweets should be considered an “issue,” as it referred to an issue with broken machines that had not been resolved. The two other originally labeled “non-issue” tweets remained “non-issues” as one referred to a request for voter ID in a state where that is a requirement, and another was not observed by the author themselves. The originally labeled “issue” tweet also remained an “issue,” as it referred to a line time over 30 minutes, expressed by listing the current time relative to when polls opened (e.g., “I have been here since polls opened, it is now 6:40” in a state where polling sites open at 6 a m.).

For the labeling process, tweets were only marked as issues if the tweet authors suggested they observed the issue themselves. Most importantly, tweets describing long lines were only marked as issues if they specifically mention a wait time of over 30 minutes or describe another issue at the polling place, such as a broken scanner or voting machine. Many “non-issue” tweets simply mention long lines or lines of a particular size (e.g., “around the block” or “forty people”). As mentioned earlier, although there is no legal standard for line times, USPCEA guidance recommends that waiting times should never exceed more than 30 minutes (USPCEA, 2014). Although this definition likely results in the undercounting of line-related issues, it does prevent the over-counting of potentially acceptable line lengths expressed in tweets such as “there was a long line at the polling place today.”

Sentiment was also ignored when labeling tweets. In some cases, tweets express joy at line length (e.g., “It’s so great that there was a two-hour line at my polling place!”), but these tweets were still marked as issues. In other cases, tweets express anger or frustration about non-issues, such as not being able to vote without a voter ID in states where voter ID is mandatory. These tweets were marked as non-issues. Additionally, some tweets describe issues that were already resolved (e.g., a machine had broken down and had just been repaired) and were marked as non-issues. Finally, a small number of tweets describe issues but were obviously from journalists or news outlets. In cases where the tweet describes a first-hand account of an issue, it was marked as an issue. In the cases where the tweet discusses or summarizes the content of an article, it was marked as a non-issue. It is also worth noting that some tweets in the dataset may themselves be considered voter suppression, such as tweets sharing misinformation about voter ID or the date of the election. Although worthy of further analysis, these tweets were not marked as issues.

After some additional duplicate removal, the tweets were then split into training, validation, and test sets of 6,365, 708, and 1,769 tweets, respectively. The full process for deduplicating the tweets is described in the preprocessing & splitting section on page three of the supplemental information. Figure 1 shows the number of tweets gathered from each election day. A breakdown of tweets gathered by state for each election is available in the summary table section on page two of the supplemental information.

Figure 1.

Number of tweets gathered by election.

Methods

We develop four natural language processing (NLP) models for this article: three baselines and one primary model. The three baseline models are a support vector machine (SVM), a multinomial naïve Bayes (NB) classifier, and a random forest (RF) classifier. The primary model is a fine-tuned version of BERTweet. All model development was conducted on a Google Colab Pro environment. The train, validation, and test datasets are identical for all models to reduce the role of randomness in validation and testing. For the baseline models, the train and validation sets were combined for cross-validation. For the BERTweet model, the validation set is used for hyperparameter tuning, and the best performing model was retrained on both the training and validation sets. In all cases, the test set was only processed and used after the best model from each category was selected, and test data were not involved in the model development or tuning phases.

Baseline Models

To preprocess the data for the baseline models, tweets were tokenized using the “tweet tokenizer” from the “Natural Language Toolkit” python library (Bird et al., 2009). After tokenizing, stop words were removed, the remaining words were stemmed, and the tweets were vectorized using Term Frequency-Inverse Document Frequency (TF-IDF). The three baseline models were trained using cross-validation from the “scikit-learn” python library (Pedregosa et al., 2011). The list of hyperparameters tuned for the baseline models is available in Table 2. More detailed explanations of the preprocessing steps and baseline models are available in the methods section on page seven of the supplemental information.

Table 2.

Hyperparameters Tested for Baseline Models.

Model	Parameter	Values Tested
SVM	Kernel	Linear, polynomial
	Cost	0.1, 1, 10, 50, 100
	Degree	2, 3, 5, 10
	Gamma	1, .1, .01, .0001
NB	Alpha	0, .001, .01, .1, 1
RF	Criterion	Gini, entropy
	Number of estimators	10, 100, 1000
	Max features	Sqrt, .2, None

BERTweet Model

The primary model for this article is a fine-tuned version of BERTweet. To build it, we used the Hugging Face Transformers python library, which allows users to select from dozens of pre-trained language models and modify them for specific cases (Transformers, 2021). As mentioned earlier, BERTweet is a Transformer-based model trained on 845 million tweets.

Like the baseline models, preprocessing was conducted on the data before fine-tuning. However, because BERTweet is a pre-trained model, preprocessing must be identical to the preprocessing used during training. Specifically, we used the BERTweet tokenizer, which, apart from tokenizing the tweets, also uses NLTK’s Tweet Tokenizer and converts usernames and links to special tokens. These tokenized tweets were then combined with their respective labels and turned into PyTorch compatible datasets (Paszke et al., 2019).

After preprocessing the tweets, hyperparameter tuning was conducted using the validation set. Like the baseline models, hyperparameter tuning was performed using grid search, which tries every hyperparameter combination to find the best performing model. There are dozens of possible hyperparameters to tune for Transformer-based language models, but the original BERT article recommends focusing on batch size, learning rate, and the number of epochs (Devlin et al., 2019). To further optimize the model, we also added the weight decay hyperparameter. The list of hyperparameters tested is available in Table 3. After hyperparameter tuning on the validation set, the best performing model was retrained on both the training and validation datasets before being run on the test set.

Table 3.

Hyperparameters Tested for BERTweet Model.

Parameter	Values Tested
Epochs	3, 4, 5
Batch size	32, 64
Learning rate	1e-4, 3e-5, 5e-5
Weight decay	.01, .3

Metrics

The three primary metrics calculated for evaluating the performance of the models were accuracy, precision, and recall. There is also the f1 score, which is the harmonic mean of precision and recall. We argue that recall is the most important metric for analyzing the performance of models in this article. Since the data are strongly imbalanced—there are sixteen times more non-issues than issues in the labeled set—high accuracy does not necessarily tell us how successful the model was at its task. For example, imagine a test dataset with 9,000 non-issues and 1,000 issues; a model that simply classifies every tweet as a non-issue would achieve an accuracy of 90% but a recall of 0%. Although this is an exaggerated example, it shows how, for imbalanced datasets, accuracy can be deceiving.

There are two main use cases for an algorithm that detects polling place issue reports from tweets: voter protection efforts and research. In both cases, it is more critical to detect issues than it is to detect non-issues. If a model can offer more true positives at the expense of creating false positives (i.e., higher recall, lower precision/accuracy), a human labeler can comb through the dataset of classified positives to weed out the false positives. Essentially, the model’s goal is to serve as a highly efficient filter of non-issue tweets to help humans find actual issues.

For each model, we also calculated the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUROC). The ROC curve plots the false positive (FP) rate on the x-axis and the true positive (TP) rate on the y-axis. Each point on the curve corresponds to a “threshold” at which a tweet is labeled as an issue or non-issue. Most models assign probabilities to labels; for a specific tweet, a model may say that there is a 40% chance the tweet is an issue report and a 60% chance the tweet is not an issue report. By default, whichever label has a higher probability (i.e., a threshold of 50%) is assigned. The ROC curve allows us to visualize how changing that threshold will impact the FP and TP rates. To achieve better recall, for example, one can choose a lower threshold for the positive class. The AUROC is simply a measurement of the area under the recall curve; it provides us with a metric to understand the overall performance at all thresholds.

For the BERTweet model, we used this ROC curve to find an optimal threshold to potentially increase the model’s recall. To find the optimal threshold, we calculated Youden’s J statistic, a commonly used formula to find the point in the ROC curve that best balances the true and false positive rate (Youden, 1950). The J statistic is simply: TP_rate − FP_rate. The point on the ROC curve that corresponds to the largest J statistic is the optimal threshold. Since, by default, BERTweet does not return classification probabilities, we added a softmax layer, which converts the original logit outputs to probabilities, to the end of the fine-tuned model. It is worth noting that this was done using the validation set to prevent biasing the results, so the threshold found is almost certainly different than the optimal threshold of the test set.

For the BERTweet model, we also calculated the precision-recall curve, which simply plots recall on the x-axis and precision on the y-axis at different thresholds. This is a valuable measure for models with highly imbalanced datasets, as it shifts the measurement focus to the positive class. The AUC for a precision-recall curve is the average precision score, the mean of the precision at each threshold weighted by the change in recall from the previous threshold (Pedregosa et al., 2011).

Finally, we also ran the BERTweet model with the original .5 threshold on the 10,081 tweets gathered but not human labeled. This was done to get estimates for each state’s total number of issues detected during each election.

Results

After tuning hyperparameters using the validation sets, the baseline models with the best recall scores were selected for testing. Additionally, we calculated the optimal ROC threshold for the BERTweet model. The baseline models and BERTweet model with both the default and optimal threshold were tested against the test dataset. A discussion of the hyperparameters, ROC curves, and confusion matrices for the best performing versions of the baseline models is available on pages nine and ten of the supplemental information. Table 4 contains the metrics for the best performing version of all five models.

Table 4.

Performance of Tuned Models on Test Set.

Tuned model	Threshold	Accuracy	Precision	Recall	F1	AUROC
BERTweet	.5	.969	.772	.722	.746	.98
BERTweet	.003	.905	.393	.935	.553	.98
SVM	.5	.930	.437	.352	.390	.83
NB	.5	.939	.615	.074	.132	.79
RF	.5	.935	.462	.167	.245	.86

BERTweet Model

The best performing BERTweet model was trained with five epochs, a training batch size of sixty-four, a weight decay of .3, and a learning rate of 1e-4. After creating an ROC curve for the validation dataset, .003 was chosen as the optimal threshold. Figure 2 shows the ROC curve with an AUROC of .98 and the optimal threshold. Additionally, the precision-recall curve, which is also not impacted by threshold choice, has an AUC of .77 and is available in Figure 3.

Figure 2.

BERTweet ROC curve.

Figure 3.

BERTweet precision-recall curve.

The BERTweet model with a .5 threshold has an accuracy score of .969, a precision score of .772, a recall score of .722, and an F1 score of .746. The confusion matrix for this model is available in Figure 4. The BERTweet model with a threshold of .003 has an accuracy score of .905, a precision score of .393, a recall score of .935, and an F1 score of .553. The confusion matrix for this model is available in Figure 5.

Figure 4.

BERTweet .5 confusion matrix.

Figure 5.

BERTweet .003 confusion matrix.

Unsupervised Results

Additionally, the BERTweet model with the .5 threshold was run on the dataset’s remaining 10,081 unsupervised/unlabeled tweets. These results were combined with the human-labeled results to estimate how many issues occurred in each state during each election cycle. Overall, across all five elections, 968 tweets, or about 5% of all tweets, are detected as issues either by the model’s predictions or by human labeling. Table 5 contains a breakdown of labeled and predicted issues for each election year.

Table 5.

Incidents by Year for Labeled and Predicted Datasets.

	Labeled Set			Predicted (Unlabeled) Set
Year	# Incidents	Total Tweets	% Incidents	# Incidents	Total Tweets	% Incidents
2010	7	106	6.6	5	130	3.85
2012	96	1769	5.43	82	1725	4.75
2014	27	703	3.84	20	695	2.88
2016	166	2977	5.58	158	2916	5.42
2018	219	4686	4.67	188	4615	4.07

Conclusion

This article was motivated by the example of the 2018 gubernatorial election in the state of Georgia, and the virality of tweets that shed a light on issues occurring at polling places. We find that a fine-tuned BERTweet model provides high performance on the task of polling place issue report detection, presenting one version of the model with accuracy of 96.9% and recall of 72.2%, and another version with accuracy of 90.5% and recall of 93.5%. The fine-tuned BERTweet model outperforms baseline models and offers promising performance with a limited training set and only one primary labeler.

Table 4 shows that our BERTweet model offers large improvements over the baseline models in all categories except accuracy and precision for the .003 threshold. The baseline models have high accuracy scores of over .93, but this is largely a consequence of the dataset being greatly imbalanced towards the negative class, as demonstrated by their poor recall scores. A model that simply predicts every tweet as a non-issue would receive an accuracy score of approximately .94.

The BERTweet model at both thresholds achieves excellent results for issue report detection. Additionally, the BERTweet model with .003 threshold sacrifices only a few points in accuracy to drastically increase the recall score. In most cases, it would make sense to use the lower threshold and then manually review predicted issue reports. In general, while we were labeling the training set, we labeled about 200 tweets an hour. Therefore, using a lower threshold would add only a few hours of work after running the model to detect almost all the polling place issues fed into it.

We also examined the specific results of the BERTweet .5 threshold model’s performance on the test set. Although, due to the un-interpretability of deep learning models such as BERTweet, we cannot make generalized claims about why the model made the labeling decisions it did, there are some patterns worth discussing. Amongst the 78 true positives, 67 issue reports were about line or voting machine related issues; the model successfully marked many tweets with varying estimates of line length over 30 minutes, ranging from 36 minutes to 4 hours, as well as tweets with positive-sentiment about line-related issues. Amongst the remaining 11 reports, issues ranged from voters being asked to file provisional ballots, voters being barred from voting, or electioneering. For the 23 false positives, 12 were people discussing stories, rumors, or news reports about other people having polling place issues, 6 were referencing a conspiracy theory article about votes being changed, 3 were about issues that the voter caused, such as going to the wrong polling place, and 2 were about lines under 30 minutes. Finally, of the 30 false negatives, 22 were about line length or voting machine issues, generally in longer, less clear, posts than the true positives, 5 were about electioneering, without mentioning the word, 2 were about voters reneging because of polling site defunction, and 1 was about poll workers not knowing about election day registration. Overall, it appears that most issue reports are about line times over 30 minutes and issues with polling machines, and that the model is generally capable of detecting these. The false positives also demonstrate that the model is not perfectly capable of detecting if an issue report was observed by the author themselves, resulting in positive labels for news articles, conspiracy theories, and hearsay, all of which may obscure geolocation statistics. Finally, the false negatives do not represent a clear difference to the true positives and may simply require more training data. In general, a substantial amount of these issues can be mitigated by using the .003 threshold model alongside human oversight of predicted positives.

Overall, although there is no previous research to directly compare the model to, our model’s high accuracy, precision, and recall are encouraging signs for both current applications and future progress. However, there are some limitations. In particular, the Covid-19 pandemic drastically impacted the 2020 Presidential election, which saw a considerable shift towards mail-in and early voting. Estimates show that 46% of voters voted by mail in 2020, compared to 21% in 2016 (Rakich, 2021). At the same time, polling shows that nearly 20% of in-person voters stated they had to wait more than 30 minutes in line to vote in 2020 (Pew Research Center, 2020). Without further work, we do not know whether the model presented in this article would detect polling place issue reports during the 2020 election.

Additionally, as mentioned earlier, the tweets for this article were restricted to ones that contained enough location information to be geolocated to a specific state. Past work suggests that there are statistically significant differences in the demographics of people who enable geolocation and those who do not, making our sample less representative of Twitter as a whole; however, these differences for gender and age distributions are minor (Sloan & Morgan, 2015). Also, the keyword “vote” was left out of our searches to the Twitter API to avoid further imbalancing the dataset with thousands of non-issues. This necessarily means that the model did not detect some issue reports because they were never in the dataset. It is possible that the model would detect those reports if presented with the data, but this article did not test that. Apart from “vote,” there may be dozens of more keywords that describe polling place issues. These issues, as well as our strict coding practices, are likely to have resulted in an undercount of polling place issue reports. It is worth re-emphasizing that the model is only capable of detecting issue reports and cannot verify if the described issue actually occurred. Although this limitation exists for other methods of polling place issue analysis, we argue that detecting reports is still useful for a variety of purposes, discussed below.

The BERTweet model offered in this article is encouraging for future research on polling place issue report detection. However, it is worth noting that the Twitter Academic API used for this analysis is now defunct, and future research using data directly from Twitter would require paying for enterprise API access, likely to be cost-prohibitive for most researchers. Alternative forms of data gathering may include manually collecting data, which is likely to be time-intensive, or the use of data-scrapers, which Twitter’s terms of service now prohibits (James, 2023). Future research may also consider using other, similar, platforms that have gained popularity since this analysis was conducted, including META’s Threads, if META develops an API for research, or Mastodon, a federated social network with open-access APIs (API Guidelines and best practices, 2024). Since these platforms operate on the same principle of mostly text-based posts, this article’s approach could be easily adapted to them. Assuming appropriate data access, below, we discuss a few pathways for further research.

One path for future research is to test and improve the model presented in this article. As mentioned earlier, the training set for the model consisted of only 6,365 tweets, and only one researcher labeled most tweets. An easy way to improve model performance will be to increase the size of the training set and have labeling done by more people to further prevent errors. Researchers can also add different keywords to the training set to improve performance and catch more issues. Removing the requirement for geolocated tweets will also drastically increase the data available.

There are also ways to parse tweets for location information available in the text itself. For example, Twitter runs its own algorithms to look for “entities” inside the text of tweets, including references to location, that are available via the API. Additionally, work in other fields such as traffic detection has used data science methods to geocode tweets based on references to street names and landmarks inside the text of tweets (Gu et al., 2016). Finally, there is also an opportunity to fine-tune other large language models, such as the original BERT, RoBERTa, and GPT-3, which, while not explicitly trained on tweets, have much larger training sets and may offer competitive performance (Liu et al., 2019; Sanh et al., 2019).

Other future work may focus on applying this model to questions in political science. For example, do some states regularly have more reported issues than others? Do issue reports on Twitter correlate with surveys or polls of issues done during and after elections? Do issue reports correlate with demographics such as race and socioeconomic status? Since previous work on line length has found disparities in wait times in minority communities, researchers can test if those disparities exist for all types of polling place issue reports (Highton, 2006; Spencer & Markovits, 2010).

The results of this article’s BERTweet model also serve as a proof-of-concept for how machine learning methods can be used to tackle previously unanswered questions in political science. While issues at U.S. polling places have surely been occurring since the beginning of American democracy, the lack of work counting and publishing these issues shows the difficulty of the task. Just as Chen et al. (2020) show that using location data from polling sites can offer estimates of line length that are more generalizable than sending human observers to a few polling sites, this article shows the potential for using social media platforms for understanding where all types of polling place issues occur. In the past, research on polling place issues would have to rely on self-report surveys, a human observer, or data about issues reported to voter protection hotlines. Now, this article’s model can quickly and accurately parse through thousands of tweets to estimate where and how frequently polling place issues occur. Even outside the subject of polling place issues, natural language processing algorithms on social media platforms may offer political scientists an opportunity to study questions that were previously unanswerable due to difficult or impossible data gathering tasks.

Apart from further research, the BERTweet classification approach shown in this article shows great promise as a method to help with existing voter protection efforts. In recent years, groups such as “866-OUR-VOTE” have had teams of volunteer monitors “that tracked voting problems reported on various platforms and flagged misleading or intimidating content” (Grace & Conley, 2018, p. 3). With cooperation from Twitter, only slight modifications to the work presented in this article would be necessary to use this model for voter protection efforts. Supplemental Material

Supplemental Material

Supplemental Material – Using Twitter to Detect Polling Place Issue Reports on U.S. Election Days

Supplemental Material for Using Twitter to Detect Polling Place Issue Reports on U.S. Election Days by Prathm Juneja and Luciano Floridi in Social Science Computer Review.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Prathm Juneja’s DPhil at the University of Oxford is funded by the Rhodes Scholarship, which had no role in this research.

Ethical Statement

ORCID iD

Prathm Juneja

Supplemental Material

Supplemental material for this article is available online

Notes

Author Biographies

Prathm Juneja is a PhD Candidate in Social Data Science and Rhodes Scholar at the University of Oxford's Internet Institute. His research interests include AI, elections, voting rights, digital ethics, and technology policy.

Luciano Floridi the Founding Director of the Digital Ethics Center at Yale University, where he is also a Professor in the Cognitive Science Program. He is also a part-time Professor of Sociology of Culture and Communication at the University of Bologna.

References

Akhtar

Beg

M. M. S.

(2021). Railway complaint tweets identification. In Sharma

Chakrabarti

Balas

Martinovic

(Eds.), Data management, analytics and innovation. Advances in intelligent systems and computing (Vol. 1174, pp. 195–207). Springer.

Alam

Lucas

(2011). Tweeting government: A case of Australian government use of twitter (pp. 995–1001). 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

American Civil Liberties Union . (2021). Voter intimidation. https://www.aclu.org/sites/default/files/field_pdf_file/kyr-voterintimidation-v03.pdf

Anderson

(2018). Brian kemp’s lead in Georgia needs an asterisk. https://www.theatlantic.com/ideas/archive/2018/11/georgia-governor-kemp-abrams/575095/

Ansolabehere

Schaffner

B. F.

(2018). CCES common content, 2016. https://doi.org/10.7910/DVN/GDF6Z0

API Guidelines and best practices . (2024). Mastodon. https://docs.joinmastodon.org/api/guidelines/

Ballard

DeTamble

Dorsey

Heseltine

Johnson

(2022). Incivility in congressional tweets. American Politics Research, 50(6), 769–780. https://doi.org/10.1177/1532673x221109516

Bird

Klein

Loper

(2009). Natural Language processing with Python: Analyzing text with the Natural Language Toolkit. O’Reilly Media, Inc.

Bradner

(2018). Discriminatory voter laws have surged in last 5 years, federal commission finds. CNN Politics. https://www.cnn.com/2018/09/12/politics/voting-rights-federal-commission-election/index.html

10.

Chen

M. K.

Haggag

Pope

D. G.

Rohla

(2020). Racial disparities in voting wait times: Evidence from smartphone data. The Review of Economics and Statistics, 3(1), 1–27. https://doi.org/10.1162/rest_a_01012

11.

Coll

(2022). Waiting to vote safely: How Covid-19 safety measures shaped in-person voter wait times during the 2020 election. Social Science Quarterly, 103(2), 380–398. https://doi.org/10.1111/ssqu.13124

12.

Cottrell

Herron

M. C.

Smith

D. A.

(2021). Voting lines, equal treatment, and early voting check-in times in Florida. State Politics and Policy Quarterly, 21(2), 109–138. https://doi.org/10.1177/1532440020943884

13.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). Bert: Pre-Training of deep bidirectional Transformers for language understanding. ArXiv:1810.04805. https://arxiv.org/abs/1810.04805

14.

Durkin

(2018). GOP candidate improperly purged 340,000 from Georgia voter rolls, investigation claims. https://www.theguardian.com/us-news/2018/oct/19/georgia-governor-race-voter-suppression-brian-kemp

15.

Edelstein

W. A.

(2006) New voting systems for NY—long lines and high cost. New Yorkers for Verified Voting.

16.

Einwiller

S. A.

Steilen

(2015). Handling complaints on social network sites – an analysis of complaints and complaint responses on Facebook and Twitter pages of large US companies. Public Relations Review, 41(2), 195–204. https://doi.org/10.1016/j.pubrev.2014.11.012

17.

Fatkhurrochman Noviandha

F. D.

Setyanto

(2018). Twitter classification of public service complaints. 2018 3rd international conference on information technology, information system and electrical engineering (ICITISEE) (pp. 220–223). IEEE.

18.

Frías-Martínez

Sae-Tang

Frías-Martínez

(2014). To call, or to tweet? Understanding 3-1-1 citizen complaint behaviors. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=354eae9b35452a7686fa66bcad0ab4050240922e

19.

Nune

Tao

J. X.

(2015). Social media data analysis for traffic incident detection and management. https://www.semanticscholar.org/paper/76f93ae6f1e6ec7b1680737358ba7c49d3fef364

20.

Grace

Conley

(2018). 2018 midterm elections preliminary report. https://866ourvote.org/wp-content/uploads/2020/10/Election-Protection-Preliminary-Report-on-the-2018-Midterm-Elections-1.pdf

21.

Qian

Chen

(2016). From Twitter to detector: Real-time traffic incident detection using social media data. Transportation Research Part C: Emerging Technologies, 67(2), 321–342. https://doi.org/10.1016/j.trc.2016.02.011

22.

Herron

M. C.

Smith

D. A.

(2016). Precinct resources and voter wait times. Electoral Studies, 42(4), 249–263. https://doi.org/10.1016/j.electstud.2016.02.014

23.

Heseltine

Dorsey

(2022). Online incivility in the 2020 congressional elections. Political Research Quarterly, 75(2), 512–526. https://doi.org/10.1177/10659129221078863

24.

Hettiarachchi

Ranasinghe

(2020). InfoMiner at WNUT-2020 task 2: Transformer-based covid-19 informative tweet extraction. ArXiv:2010.05327 [Cs]. https://arxiv.org/abs/2010.05327

25.

Heverin

Zach

(2010). Twitter for city police department information sharing. Proceedings of the American Society for Information Science and Technology, 47(1), 1–7. https://doi.org/10.1002/meet.14504701277

26.

Highton

(2006). Long lines, voting machine availability, and turnout: The case of franklin county, Ohio in the 2004 presidential election. PS: Political Science & Politics, 39(1), 65–68. https://doi.org/10.1017/S1049096506060148

27.

Huang

Carley

K. M.

(2020). A large-scale empirical study of geotagging behavior on Twitter. Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining (pp. 365–373). Association for Computing Machinery.

28.

James

(2023). Twitter’s terms of service now bans AI data scraping but does that protect you? Tech Radar. https://www.techradar.com/computing/social-media/twitters-terms-of-service-now-bans-ai-data-scraping-but-does-that-protect-you

29.

Kimball

(2013). Why are voting lines longer for urban voters? https://doi.org/10.2139/ssrn.2255009

30.

Kumar

Jiang

Fang

(2014). Where not to go? Detecting road hazards using twitter. Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval (pp. 1223–1226). Association for Computing Machinery.

31.

Kusumawati

Suhandi

A. T.

(2024). Why do customers decide to complain on social media? And how to produce customer forgiveness. Review of Integrative, 13(3), 235–254. https://buscompress.com/uploads/3/4/9/8/34980536/riber_13-3_17_s23-084_235-254.pdf.

32.

Lamb

(2021). Who leaves the line, anyway? A study of who leaves polling place lines, and why. Election Law Journal: Rules, Politics, and Policy, 20(3), 301–314. https://doi.org/10.1089/elj.2020.0686

33.

Lapowsky

(2019). Your old tweets give away more location data than you think. Wired. https://www.wired.com/story/twitter-location-data-gps-privacy/

34.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

Stoyanov

(2019). RoBERTa: A robustly optimized BERT pretraining approach. https://arxiv.org/abs/1907.11692v1

35.

Mebane

W. R.

Klaver

Pineda

Woods

Miller

(2017). Using twitter to observe election incidents in the United States. Annual Meeting of the American Political Science Association.

36.

National Conference of State Legislatures . (2021). Electioneering prohibitions. https://www.ncsl.org/research/elections-and-campaigns/electioneering.aspx

37.

Nguyen

D. Q.

Nguyen

A. T.

(2020). BERTweet: A pre-trained language model for English tweets. Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 9–14). Association for Computational Linguistics.

38.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

Killeen

Lin

Gimelshein

Antiga

Desmaison

Kopf

Yang

DeVito

Raison

Tejani

Chilamkurthy

Steiner

Fang

Chintala

(2019). PyTorch: An imperative style, high-performance deep learning library. In Wallach

Larochelle

Beygelzimer

d’Alché-Buc

Fox

Garnett

(Eds.), Advances in neural information processing systems (pp. 8024–8035). Curran Associates, Inc.

39.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

Duchesnay

(2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research: JMLR, 12(85), 2825–2830. https://dl.acm.org/doi/10.5555/1953048.2078195.

40.

Pettigrew

(2017). The racial gap in wait times: Why minority precincts are underserved by local election officials. Political Science Quarterly, 132(3), 527–547. https://doi.org/10.1002/polq.12657

41.

Pettigrew

(2021). The downstream consequences of long waits: How lines at the precinct depress future turnout. Electoral Studies, 71(5), 102188. https://doi.org/10.1016/j.electstud.2020.102188

42.

Pew Research Center . (2020). The 2020 voting experience: Coronavirus, mail concerns factored into deciding how to vote. Pew Research Center - U.S. Politics & Policy. https://www.pewresearch.org/politics/2020/11/20/the-voting-experience-in-2020/

43.

Rakich

(2021). What absentee voting looked like in all 50 states. FiveThirtyEight. https://fivethirtyeight.com/features/what-absentee-voting-looked-like-in-all-50-states/

44.

Roberts

J. G.

Supreme Court

Of The United States. (2012) U.S. Reports: Shelby County v. Holder, 570 U.S. 529. [Periodical] Retrieved from the Library of Congress. https://www.loc.gov/item/usrep570529/

45.

Rosenthal

Farra

Nakov

(2017). SemEval-2017 task 4: Sentiment analysis in twitter. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) (pp. 502–518). Association for Computational Linguistics.

46.

Sanh

Debut

Chaumond

Wolf

(2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108v4

47.

Sarmiento

I. G.

(2018). A closer look at allegations of voter suppression in Georgia. https://www.teenvogue.com/story/2018-midterms-voter-suppression-allegations-georgia-governors-race-stacey-abrams-brian-kemp

48.

Sloan

Morgan

(2015). Who tweets with their location? Understanding the relationship between demographic characteristics and the use of geoservices and geotagging on twitter. PLoS One, 10(11), Article e0142209. https://doi.org/10.1371/journal.pone.0142209

49.

Spencer

D. M.

Markovits

Z. S.

(2010). Long lines at polling stations? Observations from an election day field study. Election Law Journal: Rules, Politics, and Policy, 9(1), 3–17. https://doi.org/10.1089/elj.2009.0046

50.

Stein

R. M.

Mann

Stewart

Birenbaum

Fung

Greenberg

Kawsar

Alberda

Alvarez

R. M.

Atkeson

Beaulieu

Birkhead

N. A.

Boehmke

F. J.

Boston

Burden

B. C.

Cantu

Cobb

Darmofal

Ellington

T. C.

Wronski

(2020). Waiting to vote in the 2016 presidential election: Evidence from a multi-county study. Political Research Quarterly, 73(2), 439–453. https://doi.org/10.1177/1065912919832374

51.

Stewart

(2013). Waiting to vote in 2012. SSRN Electronic Journal, 28(4), 18–21. https://doi.org/10.2139/ssrn.2243630

52.

Stewart

Ansolabehere

(2015). Waiting to vote. Election Law Journal: Rules, Politics, and Policy, 14(1), 47–53. https://doi.org/10.1089/elj.2014.0292

53.

Stewart

III. (2023). How we voted in 2022. MIT Election Data + Science Lab. https://electionlab.mit.edu/sites/default/files/2023-05/How-We-Voted-In-2022.pdf

54.

Transformers . (2021). Hugging face. https://github.com/huggingface/transformers/blob/f064e0a43d05a6bb1eb81e65e700f8e0f4ab04f9/LICENSE

55.

United States Presidential Commission on Election Administration . (2014). The American voting experience: Report and recommendations of the presidential commission on election administration. Presidential Commission on Election Administration.

56.

Van Hee

Lefever

Hoste

(2018). SemEval-2018 task 3: Irony detection in English tweets. Proceedings of the 12th international workshop on semantic evaluation (pp. 39–50). Association for Computational Linguistics.

57.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. ArXiv:1706.03762. https://arxiv.org/abs/1706.03762

58.

Yang

Allen

T. T.

Fry

M. J.

Kelton

W. D.

(2013). The call for equity: Simulation optimization models to minimize the range of waiting times. IIE Transactions, 45(7), 781–795. https://doi.org/10.1080/0740817X.2012.721947

59.

Yen

Y.-S.

(2016). Factors enhancing the posting of negative behavior in social media and its impact on venting negative emotions. Management Decision, 54(10), 2462–2484. https://doi.org/10.1108/md-11-2015-0526

60.

Youden

W. J.

(1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.30 MB