“Digital footprints” is an attractive, useful, and increasingly popular metaphor for thinking about Big Data. In this essay, I elaborate on this metaphor to highlight three relatively basic fallacies in the way we tend to think about Big Data: first, that they contain information on complete populations, or “N = all”; second, that they contain recordings of naturalistic behavior; and third, that they can be understood devoid of context.
Let me begin with a confession: When it comes to studying Big Data, I’ve always felt like a black sheep. My programming skills are basic at best. Statistics has never come easily to me (which is perhaps why I’m somewhat decent at teaching it). To be honest, I don’t even really like the Internet. And yet, I have now spent almost a decade studying relatively large quantitative datasets gathered from a variety of online sources—from Facebook (Lewis et al., 2008) to OkCupid (Lewis, 2013) to the Save Darfur online movement (Lewis et al., 2014).
For these reasons, I consider myself both an outsider and an insider to the study of Big Data: It is an analytic space that still feels unfamiliar to me and yet in many ways it is the only space with which I am truly familiar. In this essay, I draw on this vantage point to comment on the three most important lessons that working with Big Data has taught me. These lessons all have to do with ontological fallacies of Big Data vis-à-vis more traditional forms of social science data. In other words, and in terms of the aims of this special issue, they are three ways that Big Data systematically distort our impression of the social.
I should clarify what exactly I mean by “Big Data.” In my experience, this term is often used to refer to one of three concepts: 1) data that are unusually large; 2) data that require sophisticated computational methods; or 3) data that were collected digitally. Datasets that are collected digitally (definition #3) can easily become quite large (definition #1) and large datasets tend to require computational analysis (definition #2)—so the three are often conflated. In this essay, I focus on digital data, whether or not they also happen to be large and regardless of what techniques are used to analyze them. Consequently, Golder and Macy’s metaphor of “digital footprints” (2014) provides a helpful framework for this discussion.
1
Footprints are not representative
The first fallacy of digital footprints is also the oldest; it is the property of Big Data that we as researchers (and especially as sociologists) have least excuse for forgetting and yet it is the property we most commonly forget. This property is the “digital divide.” Traditionally, research on the digital divide has focused on socioeconomic inequality in Internet access, and therefore on the distribution of the benefits (or costs) that Internet access affords (DiMaggio et al., 2001). Here, I am broadly referring to the fact that while Big Data are often celebrated for providing access to “complete” populations of participants (or “N = all”), certain kinds of people are more likely than others to turn up in certain digital datasets—potentially biasing our conclusions. Even if we have access to the full universe of tweets, not everyone (thankfully) is on Twitter. Even if we have documented every phone call, every message, or every friendship, such data are usually only comprehensive for a certain website or service provider within a finite window of time.
2
Results may be biased by differences in activity as much as inclusion. Especially because digital datasets commonly contain events (e.g. transactions or communications) rather than individuals as the primary unit of analysis, some people may “turn up” in our data more or less frequently than others—and it is often unclear what these differences capture and whether it is a distinction we care about. For instance, people who make few cell phone calls may generally rely on other means of communication; people with few Facebook friends may have especially high standards for online “friendship”; and people who send unusually many dating site messages may simply be “spammer” users whose data we should ignore. Moreover, for a variety of network analytic techniques (e.g. blockmodeling; White et al., 1976), the “zeroes” in a relational matrix are just as important as (if not more important than) the “ones.” Not only do many analyses of Big Data elide the meaning of action vs. inaction, but they also disproportionately emphasize the former.
3
In sum, an alluring quality of many digital datasets is that they contain all “footprints” from everyone on a given “beach.” First, however, different beaches are frequented by different kinds of people, and many of these differences are consequential but challenging to identify. (Some people may also prefer to stay at home—i.e. those people who avoid digital technology altogether.) Second, some beaches are relatively egalitarian where footprints are evenly distributed among people. On other beaches, the vast majority of people spend their time sitting under umbrellas, while a handful of others run and play—and so the large array of footprints might present a deceptively vibrant portrait of activity (cf. the “1% rule” of the Internet). Certainly this does not mean such footprints are useless. It just means we need to be clear about whom we are actually studying and why findings from such a population are still theoretically valuable—even if they are not statistically generalizable.
Nobody is barefoot
The second fallacy has to do with mediation. In many ways, this aspect of Big Data is unremarkable: All communication is mediated to some degree. Survey researchers have certainly recognized this since the dawn of their art (e.g. by documenting the advantages and limitations of forcing respondents into closed responses); strengths and weaknesses of interview data have also recently been highlighted (Jerolmack and Khan, 2014; Vaisey, 2009). And yet the very notion of “footprints” suggests a naturalism to these data—that they are the direct recordings of unobstructed, in situ behavior that might best be described as “digital ethnography” (see Murthy, 2008). While it is true—and undeniably powerful—that Big Data frequently document actual behaviors (frequently also in real time) rather than reports about these behaviors, it is important to remember their drawbacks as well.
First, digital data are often opaque compared to data collected by traditional means. For instance, in a traditional network survey, respondents might be asked to name the people with whom they have had romantic or sexual interaction (e.g. Bearman et al., 2004). On an online dating site, you can instead observe the actual communication between two people. But how does this communication arise? I might message someone because she was the first person who appeared in my search results; because the website suggested we were compatible based on our responses to personality questionnaires; or because I saw she had previously visited my profile, leading me to infer she was interested. In theory, it might be possible to obtain the entire history of “clicks” that led one person to another. In practice, however, most Big Data are surprisingly thin—and the wall between subjects and analysts precludes follow-up questions that would be possible in offline settings, obscuring differences in meaning.
A second, related issue is the reductionism of many digital interfaces—whether this has to do with text communications limited to 140 characters, best “friendships” that are indistinguishable from acquaintances, or evaluations that are reduced to likes and dislikes, “swipe left” or “swipe right.” In general—and notwithstanding occasional, ingenious online experiments (e.g. Salganik et al., 2006)—these concerns stem from the fact that unlike traditional datasets, Big Datasets are typically collected by people other than researchers with incentives other than the advancement of science. As such, not only are participants’ natural behaviors constrained—they are constrained in ways that are as commonly distortive as illuminating and frequently altogether hidden from the researcher.
In sum, as researchers studying digital footprints, we must recognize that we are not examining natural behaviors—that the people on the beach are not barefoot. Rather, interactions and expressions are constrained by technology in both complex and deceptively simple ways. Very small feet can produce flipper-sized footprints; different people will leave indistinguishable tracks if they are wearing the same sandals; and in general, the relationship between meaning and digital recording can be especially challenging to decipher (cf. Shaw, 2015).
Different beaches have different rules
The third fallacy I want to emphasize draws attention not to representation or mediation—but to the contexts in which digital behaviors are enacted and recorded. Too often we forget that digital contexts—like all social contexts—carry norms that powerfully shape human behavior. This belies the notion that actions and interactions represented in Big Data can be treated at face value. Two illustrations (again, from the kinds of data with which I am most familiar) may be helpful. First, if we are to use social media (e.g. Facebook) to understand processes of network evolution (e.g. Lewis et al., 2012), it is important to remember that friendship dynamics on these websites often operate “opposite” to “real life”: Actual friendships tend to dissolve over time (unless they are renewed by some form of interaction), while “friendships” on Facebook persist until or unless they are actively terminated. Consequently, because termination requires action instead of inaction, dissolving a tie carries a negative social connotation—resulting in a norm that friendships on Facebook are rarely dissolved. Second, several studies of online dating have compared patterns of initiation with patterns of replies (Lewis, 2013; Lin and Lundquist, 2013; Skopek et al., 2011). But sans message contents, how do we distinguish an interested response from a “polite rejection”? The (reassuring) answer, in this case, is that polite rejections are unusual; a norm has developed on most sites that “no response” means “no interest,” such that the vast majority of electronic overtures are unreciprocated (e.g. Lin and Lundquist, 2013).
Both examples above have to do with norms governing interaction on a particular type of website. But the situation with Big Data is more complex still. Digital norms often vary among subpopulations delineated by region, socio-demographic background, or institutional affiliation (and yet in the world of Big Data, the scale is often too massive to attend to such subtleties); these cultural expectations overlap with a nested array of meanings derivative from medium (e.g. Facebook messaging vs. text messaging), role categories (e.g. emailing a parent vs. emailing a professor), and the distinct “relationship culture” between two people (Fuhse, 2009). Finally, interpretation is further complicated by the fact that norms governing digital interaction might be particularly ambiguous and/or ephemeral compared to many norms in offline life; thus, the singular flavor of humor behind When Parents Text (Kaelin and Fraioli, 2011).
In sum, patterns of footprints are influenced by different sets of rules on different beaches (“don’t play too close to the water”; “all children must be accompanied by an adult”); and just as digital footprints cannot be interpreted independently of the technologies through which they are imprinted, neither can they be interpreted without at least some basic understanding of the cultural contexts in which they occur. Certainly it is the case that on the Internet, as on the beach, not all people choose to follow these rules—but whether compliance or circumvention, our behavior cannot be understood without them.
Conclusions
These fallacies may seem obvious. It is with some chagrin, therefore, that I admit they were not always obvious to me; and in many contemporary analyses of digital footprints, they are still regularly ignored. Of course, ignoring these fallacies will be more or less detrimental depending on the nature of the research question. The most useful contributions will probably be studies that directly problematize one or another of these fallacies—e.g. by asking exactly what type of person has an account on Twitter or exactly what norms govern communication in online dating—while the most important contributions will visit such questions as but one necessary stop en route to higher theoretical aims.
For me, then—a self-professed statistically challenged, Python-ignorant digitalphobe—here lies the attraction of digital footprints—why going to the beach is not just about avoiding to step on sharp rocks, but also about the joy of relaxing in the sand and playing in the waves. Research in this growing and important field of scholarship will be most intellectually valuable as well as personally fulfilling when we view Big Data less as a solution than as a challenge: a challenge to learn new skills; to think carefully about methodological puzzles; and to return to classic sociological questions—in settings that, like them or not, are increasingly central to the conduct of everyday life.