Abstract
“Would you like another EXTRA BIG ASS FRIES?” (Carl's Jr computer,
Big Data, more than anything else, is…more. Lots of muchness or, as we might say in sociology, “large
Sociologists quickly gave up on the idea that they were going to find numerical laws of behavior, where the precise values of numbers really mattered (such with
To put it simply, we are used to using tests of statistical significance to determine whether some effect is “significant.” But with Big Data, or “big-ass data,” as we prefer to call it,
We can call this imagination problem the “population problem.” So far, we’ve assumed that the muchness of big-ass data (henceforward, BAD) comes from merely increasing the
Think, if you will, about the apparent obsession that we have with isolating the “average treatment effect,” which has to be one of the least enlightening statistics available. If there is a treatment
Out of sheer laziness, we are projecting a population, which induces a
That everyone is a package of idiosyncracies and incomparabilities need not undermine the possibility of large scale statistical analysis, just like each molecule might well have its unique characteristics without invalidating the gas law. In fact, with BAD, we have new tools to overcome the population problem. BAD analysts now expect a power law distribution for pretty much any cultural behavior, the way Queteletians once expected a normal distribution. That’s because as we leave the worlds of biology and organizational regulation (call it “nature”) for that of culture and independent action (call it “freedom”) we are finding that people tend to do their own thing. And so we see many different power law distributions, with individuals who are at the left of one finding themselves on the right of others. Time spent bidding on
Of course, we may want to go beyond simply graphing univariate distributions, and move towards relationships among variables, where the population problem is consequential. Suppose you want to know whether people are more likely to attend psychobilly concerts if they also consume lots of Pabst Blue Ribbon (PBR) beer. And we have data from 500,000,000 people to investigate that. Adhering to traditional techniques, we would log and correlate these variables. And we will almost certainly find a statistically significant but miniscule (say,
But we should suspect that something’s wrong. What does taking the average effect for this whole population even mean when most people in the sample neither attend psychobilly concerts nor drink PBR at all? The “average man” here certainly does not summarize the relationship between music show and beer choice for everyone in the sample; he’s just a half-hearted attempt to find the lowest common denominator among a diverse set of people. And the larger a population is, the more likely it is to be too heterogeneous to characterize through that single average.
Now, this is not to say that BAD introduces us to the population problem—it’s the same thing we encounter when we get a bi-modal distribution or see outliers. But BAD gives us a special opportunity to not
Imagine arranging nearly everyone alive in a Blau Space—a multidimensional coordinate system, in which socio-demographic variables like age, income, education, race, and geographical location are treated as dimensions, and individuals who are close to each other are more similar to those further apart. Consider any variable corresponding to a taste, preference, or action
Why the elaborate set up for what might seem a reproduction of the notion of correlation? It is because we do not necessarily want to collapse all areas with the same numerical values. Instead, we want to inspect the contours to understand the social logic of the distribution. The map allows us to scan our eyes up, down, left, and right, to draw both horizontal and vertical comparisons—how people in the population relate to each other in terms of demographics or any single surface (e.g. psychobilly concert attendance), as well as which factors contribute concert attendance for each sub-population. We realize, for example, that there are several snake-like shapes of red moving through Blau-space, suggesting separable if not also independent “psychobillies” that are mapped onto the same action-space. We find a clumping of blue that suggests the importance of region. And we find, let us say, a set of widely spaced, horizontal planes of green, suggesting that there are different PBR drinkers organized by their position as the dominated fraction of the class they most identify with. Thus the population can be disaggregated and flexibly explored to answer a number of different questions instead of mean-averaged out to answer one poorly posed and unchanging question. Further, rather than
Choosing the variables to be included in any one map then emerges as the primary challenge. In BAD analysis, we often find ourselves in a world in which there is bountiful possibility but not always a natural stopping place. We may be brought towards a system level view, where we are examining open systems. It is rarely obvious from the start where to close the investigation—as Latour would say, which things to follow, and which not. Less and less often can we say with complacent regret, “we couldn’t do that, because it’s not in our data set.” Like a real scientist, our problem isn’t running out of information, but choosing which path to follow.
It is not that one needs such a jar to begin to think in terms of open systems of mutually interacting elements; theorists have been proposing this for years. It is a source of satisfaction to us that computational techniques now make this tractable for many cases, although sociological implementations are still crude and tend to focus on overly convenient cases. We may find ourselves in a position somewhat like meteorology—to determine the most feasible approximations to systems that we recognize as inherently open and overly complex, and to employ sets of models with known biases and blindspots. And while such models are not directly deducible from lower level theories, it is impossible to construct models that substantially outperform common sense without both huge quantities of accurate data and an understanding both of the representation of vector fields and their relation to fundamental dynamics of social interaction.
In sum, since Durkheim, in sociology, we have used convenient fictions—not the least of which is “society”—to justify otherwise bizarre conventions whereby we link aggregate data to claims about classes or persons. These fictions may have been methodologically necessary, but they are no longer, while they retain their falsity. Big-ass data allows us new ways of finding meaningful patterns in human experience, while we continue to pursue our fundamental theoretical interest in reciprocal patterns on alignment, conflict, competition, affiliation, and influence—precisely the theoretical approach of Durkheim’s arch-enemy Tarde. Perhaps his day has dawned at last.
Footnotes
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
