Abstract

Introduction
Twitter users vary widely in their level of activity. Active Twitter users not only tweet more often than others but they also tend to mention other users with higher frequency and the set of time stamped global positioning system (GPS) locations in their tweets are more complete. The time, location, and connection data along with the text content in the active users' tweets render a more complete picture of their social connections, preferences, attitudes, interests, and spatial mobility.
However, the activity levels of users are heavily skewed; most Twitter content is produced by a small fraction of highly active users (measured by the number of followers, tweets, retweets, and mentions), while the vast majority of registered users are passive observers. For example, there were only 302 million monthly active users in the first quarter of 2015 among approximately one billion registered users (Twitter, 2015). According to Twopcharts, a company monitoring Twitter activity, 43% of the 550 million Twitter accounts that had at least one tweet did not create a single tweet in the previous year (Murphy, 2014). In terms of content, 50% of the URLs consumed on Twitter were generated by 0.05% of all users in 2011 (Wu et al., 2011).
This skewed distribution of overall activity poses challenges for inferring unobserved user characteristics (e.g. the representative geographic location of each user). In particular, the methods devised to infer user characteristics rely on and leverage central tendencies in the data, treating highly active users as outliers or aberrations. In effect, the users who emit the most information and who are expected, in principle, to be summarized and categorized most accurately, are paradoxically the ones whose characteristics tend to be discounted or misclassified.
In this study, we offer concrete examples to illustrate this paradox and discuss how methods that do not properly deal with the active users can increase classification error and ultimately distort our understanding of online social relationships at the micro level and structural properties of the communication network at the macro level.
Geo-location inference
Inferring the representative geographic location of social media users is an actively developing area of research with wide applicability in both applied and basic research endeavors that use social media data. Social media-based early detection and prediction studies of seasonal flu surveillance or the prediction of commercial movie success require knowledge about users' locations. Research in our lab also depends on inference of user location, as in the study of diurnal rhythms of affect using Twitter data (Golder and Macy, 2011) or our on-going cross-national comparative analysis of communication networks.
The state-of-the-art location inference method based on label propagation has been shown to perform with high coverage (90% of users geotagged) and accuracy (median error of 6.38 km) (Compton et al., 2014). This method relies on the fact that the majority of communication partners, or network neighbors, in the Twitter @user mention network are geographically proximate. If a user's Twitter neighbors turn on their GPS while tweeting, it is possible to estimate the focal user's latitude and longitude based on the distribution of those neighbors' locations. This estimated location could then be used to further estimate the unknown locations of the focal user's other network neighbors who did not enable GPS tracking in their tweets, and so on. In technical terms, the label-propagation algorithm attempts to infer a candidate location of a given user based on one of her neighbors' locations that minimizes the sum of distances to the rest of her neighbors' locations (i.e. L1-median location). To ensure robust results, the algorithm incorporates a dispersion threshold (e.g. 100 km) where the algorithm accepts the candidate L1-median location as its final estimate only if the median distance from that candidate location to all the other known and inferred locations of the network neighbors does not exceed the dispersion threshold. If the candidate location satisfies this condition, the algorithm assigns that location to the focal user as its best estimate and propagates it to estimate other network neighbors’ locations. For example, if a user's candidate location inferred by the algorithm happens to be in New York, but her network neighbors are geographically concentrated in both New York and San Francisco, such that the median of the distances from the candidate location in New York to all of her neighbors' locations exceeds 100 km, then the algorithm does not assign that candidate location in New York as the estimated location but instead classifies the user's location as unidentifiable.
Despite the impressive performance in both coverage and accuracy of this new method, we find that prediction error is a U-shaped function of both user activity level, measured as tweets per day (Compton et al., 2014), and network degree. Furthermore, the candidate locations inferred for the highly active users tend to exceed the dispersion threshold, diminishing the proportion of inferable cases (i.e. classification coverage). The paradox of lower coverage and low predictive accuracy of active high-degree users who provide disproportionately large amounts of information about themselves stems from the assumption that there is a single geographical location that best characterizes a user's location. Although this may be true in principle, the network neighbors who are the source of inferring this single location are constantly moving over time and the time at which each of those neighbors coincided with the focal user in space may be too long ago to be relevant. The example of the user whose network neighbors are concentrated in both New York and San Fransisco illustrates this case. This user may work in New York but reside in San Fransisco and form geographically segregated personal and professional networks. Alternatively, this user may have grown up in San Fransisco but moved to New York for work. If the focal user is active on Twitter, her work/social or past/present relationships will each be given equal weight in the label-propagation algorithm. For an active user who exhibits greater geographic diversity in her network, the candidate location is less likely to satisfy the dispersion threshold. The algorithm might be improved by either assigning multiple plausible locations (i.e. assuming multiple representative locations) or by adding more constraints to what constitutes a representative location (e.g. initializing the label-propagation algorithm to start from GPS locations in tweets created exclusively at night).
Individual vs. group account classification
A broad range of prediction applications (Broniatowski et al., 2013), behavioral modeling and social network analyses using social media data of hundreds of millions of users build models with implicit assumptions about the users. An important source of heterogeneity of the users that is often neglected is whether a user is an individual or a group account (e.g. a company's official Twitter account). Often, multiple individuals such as a PR team manage a single group account, leaving quite different behavioral traces from individual accounts managed by single owners. For example, group accounts tend to possess more followers on Twitter and the followers are arguably less related to the group accounts. The communication ties between group accounts and their followers are also not likely to be “social ties” in the conventional sense. Furthermore, the objectives, language, and topical interests of group accounts differ from those of individuals.
Researchers who work primarily with traditional survey data with accurate and well-defined sampling frames do not have to deal with the distinction between groups vs. individuals. However, computational social scientists and data scientists who use social media data inevitably face this problem. Failure to correctly classify and filter out group accounts (or individual accounts, depending on the research objective) could lead to misleading characterizations and conclusions. Imagine a network analysis that does not properly filter out group accounts that tend to have higher connectivity than individuals. Virtually all network metrics, from clustering and degree distribution to mean geodesic, will be affected by the presence of group accounts in the data.
The method we developed to distinguish group and individual accounts is based on the cognitive constraints of individuals in forming and maintaining communication ties, which arguably applies to a lesser extent to group accounts managed by multiple individuals (Park et al., 2015). These constraints are captured, for example, in the ratio of in-degree to out-degree of each node's immediate neighbors in the communication network as well as in the level of concentration of communication volume across one's network neighbors (Saramäki et al., 2014).
This method shares with the methods for geo-location estimation the problem that highly active individuals may be misclassified due to their behavioral and structural similarities to groups. By mistakenly over-filtering central individuals (e.g. opinion leaders), the network would appear to be less clustered, more fragmented, and with a longer mean geodesic.
A potential solution is to leverage the temporal constraints on tweeting which individuals, but not organizations, tend to exhibit (Tavares and Faisal, 2013). Each user's inter-tweet delays and the temporal distribution of tweets throughout the day can be used to enhance the overall discriminatory power between organizations and active individuals whose networks look similar.
Social vs. coworker vs. acquaintance tie classification
In the absence of respondent surveys or ethnographic observation the nature of a communication tie (e.g. professional vs. friendship vs. acquaintance) could be inferred from time-location records of mobile phone logs (Eagle et al., 2009; Toole et al., 2015). The intuition behind this method is that coworkers or professionals tend to be co-located during work hours (e.g. in the same office building on a weekday afternoon) while friends are more likely to be co-located during off-work hours (e.g. in a bar on a Friday evening). Acquaintances are likely to have few colocations regardless of time. This approach, which leverages time-location similarity (using cosine similarity between hourly location occurrence vectors of two individuals), yields accurate and convincing results with mobile phone data that contain detailed time-location records at regular time intervals for each individual (i.e. whenever a mobile device communicates with a cell-phone tower).
Nevertheless, blindly applying this method to Twitter users with GPS tweet data could potentially lead to biased results. Again, the active users will appear to have high mobility with hundreds of GPS locations captured in the data whereas the low activity users may appear relatively immobile. Therefore, a tie involving an active user is more likely to be classified as either professional or friend than acquaintance.
Conclusion
Individuals traverse multiple locations in both physical and network space. Twitter captures information about these interactions and movements from which researchers can infer attributes of individuals and their social relationships. In this essay, we considered three examples that are relevant to a broad range of academic and practical applications. These examples highlight the paradox of highly active users—those who generate most Twitter content. Because they generate more data points with which to measure their behavior, highly active users are less vulnerable to random measurement error, yet they are more vulnerable to systematic mis-classification when researchers make naive assumptions about the distribution of user activity. The paradox of highly active users can be addressed by developing methods that handle the complexities and multidimensionality of social life represented in the data. The need to do so will intensify as an increasing proportion of the population establishes their online and social media presence with more complete pictures of their lives painted in digital form.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors would like to acknowledge the receipt of research funding from MINERVA Initiative, Department of Defense, National Science Foundation (SES-1357488, SES-1434164 and SES-1226483) and National Research Foundation of Korea (NRF-2013S1A3A2055285).
