Abstract
In the past few years, several demographers have pointed out the need to consider big data in population studies. Some are in favour of data-driven approaches, as statistical algorithms could discover novel patterns in the data. This paper examines some of the methods, both old and new, that have been developed for detecting patterns and associations in the data. It concludes with a discussion on how big data and big data analytics can contribute to improving the explanatory power of models in the social sciences and in demography in particular.
Introduction
In the past few years, several demographers have pointed out the need to consider big data in population studies. For example, Stephanie Bohon (2018) has argued that demographers have long collected and analysed big data but in a small way, focusing only on a subset of the data. She considers that demographers should target big deep data, i.e. population-generalizable data such as censuses, rather than data created for other purposes than research such as social media data. Bohon is in favour of data-driven approaches that look at the data holistically in view of detecting possible patterns. Bohon’s proposal is in fact rooted in the inductive tradition of research.
In the recent past, individual-level anonymized data from censuses and registers have become increasingly available and demographers have taken advantage of this situation. A review of the literature has shown that, in the field of big data, demographers analyse huge amounts of data at the level of the individual, coming mainly from censuses and from various administrative registers (Wunsch et al., 2024). In addition, more and more national institutes are now linking data sources together at the individual level, namely census with registers, census with census, and registers with registers. For demographers, one can truly speak of a (big) data revolution. Of course, in all these sources some persons remain uncovered and thus undocumented. For example, population registers usually include only the legal residents of the country.
The availability of big microdata has considerably expanded the scope of demographic research in all fields of study. For instance, multilevel analyses can now be conducted on a large scale and much more individual information is available for feeding agent-based models. However, the same review of the literature cited above has also shown that demographers are not very involved in causal inference and often have recourse to single-equation models that cannot spell out the structure of relationships among the variables. Lately, the author of the present text has co-authored several articles on structural causal modelling in a hypothetico-deductive perspective. This approach requires however strong background knowledge and theory that are often lacking. As an alternative, could data-driven approaches based on big data analytics improve the explanatory power of models in the social sciences? Could induction come to the rescue of deduction?
Following Doug Laney (2001), big data are characterised by 3 V: their volume, their variety and their velocity. Volume relates to the huge amount of data produced by public and private sources; variety to the various sources and formats of the data (such as text, sensor data, satellite imagery, etc.); and velocity to data creation in real-time. Big data analytics is the process of extracting information from big data, such as discovering patterns and correlations in the data. It corresponds to an exploratory analysis of the data and, as such, is nothing new. Indeed, for decades social scientists have used various forms of exploratory multivariate analysis in the search for non-random patterns or structures in the data. As pointed out by Brian Everitt (1978), one may not know in advance what the structural characteristics of one’s multivariate data are. In this case, one should rely on exploratory techniques rather than on confirmatory ones. To give but one example, some fifty years ago Michel Loriaux (1971) recommended taking a data-driven approach in demography, as few tried and true theories were available in this field. He proposed segmentation analysis for the exploration of the data, a stepwise application of one-way analysis of variance model. Segmentation analysis is actually a special case of survival trees that are based on the principle of recursive partitioning algorithms. The latter could be used as an alternative to parametric regression in the social sciences (Robette, 2022).
This paper examines some of the methods, both old and new, that have been developed for detecting patterns and associations in the data. The focus is firmly on the exploratory analysis of data. Recent methods are based on machine learning (ML) and artificial intelligence (AI). No attempt is made however to cover the full field of big data analytics that is becoming larger every year. The choice of a specific method or algorithm depends upon the goal of the study. In particular, one should consider if the objective is interpretability or predictive accuracy (see Bi et al., 2019). The paper concludes with a discussion on how big data and big data analytics can contribute to improving the explanatory power of models in the social sciences and in demography in particular.
The old and the new
Some classic methods for exploring data can be given as examples (Everitt and Dunn, 2001). Cluster analysis is utilized for classification purposes and searches for distinct groups or classes of individuals (or units) in the data. One can then see on what variables the groups differ. Pattern recognition is usually based on some form of cluster analysis. Cluster analysis requires assessing the distance between individual profiles and the measurement of distance is a major consideration in this case. Methods such as principal components (continuous variables) or multiple correspondence analysis (categorical variables) aim at reducing the number of dimensions of the data matrix (n individuals or units on p variables), there being as many dimensions as there are variables. For example, Hervé Le Bras, as far back in time as 1971, has had recourse to principal components analysis for reducing the number of dimensions in a study of the fertility rates of the French ‘départements’ (Le Bras, 1971). A population being divided into k groups known a priori, according e.g. to socio-economic status, discriminant analysis can be used to decide to which group an individual belongs.
Other classification methods, such as Support Vector Machines, naïve Bayes rule, decision trees, and neural networks, have been developed more recently for dealing with big data. For example, Marucci-Wellman et al. (2015) have used naïve Bayes algorithms to routinely classify injury narratives from large administrative databases. Nigri et al. (2022) have had recourse to a deep neural networks model to assign a vector of age-specific death rates to an observed or predicted life expectancy. This method overcomes the linearity assumption and data requirements of past approaches. For these newer methods, the reader is referred, for instance, to Tsai et al. (2015) and Hassani et al. (2018). Closer to demography, in the field of epidemiology, a good survey article is Bi et al. (2019).
Since Tukey’s seminal book on exploratory data analysis (Tukey, 1977), much attention has been given to the visualization of the possible patterns and dependencies, trends, and outliers or anomalies in the data. For example, if the first two principal components account for a large proportion of the variance in the data, one can project the observations on the plane defined by these two components, showing the possible structure in the data and pointing out any outliers. Other classic visualization techniques rely on non-metric multidimensional scaling, for ordinal variables, or on non-linear mapping applied to distance matrices. These and other traditional data visualization multivariate methods are described, for instance, in Everitt (1978). For more recent approaches, such as Lexis fields or composite lattice plots, see the special collection of the journal Demographic Research on data visualization finalized on 20 April 2021.
Machine-learning and artificial intelligence
In recent years, machine-learning (ML) has been proposed as a new approach for detecting patterns and associations in the data and for making predictions, especially in the case of big data with many variables. For example, in a data-driven approach, Bonacini et al. (2021) have opted for a ML algorithm to identify structural breaks in the time series of COVID-19 cases in Italy. For the same country, Cerqua et al. (2021) have used ML to build a counterfactual scenario of mortality in the absence of COVID-19. When hundreds or even thousands of variables are considered, as in weather forecasts, the computer beats humans by far.
Most of the newer methods described below rely on automated ML iterative techniques. Recent approaches now depend on artificial intelligence (AI). The models adapt or ‘learn’ as they are exposed to new data. Often, the models are applied to subsets of data, in view of comparing results across models. For example, in a study on unauthorized immigration to the USA, Azizi and Yektansani (2020) have split their dataset into two subsets, one being a subset for training the model and the other a subset for testing the trained model. The objective is to develop a model that generalizes well to new data, the test set serving here as a proxy for new data. For a simple introduction to ML and AI, see for example Alpaydin (2021). In the following paragraphs, some other methods of big data analysis are briefly discussed, considering their possible usefulness in population studies.
Dimensionality-reduction
Some classic methods are still used, such as principal components and factor analysis, for dimensionality-reduction purposes. However, the analysis of big data can require new techniques, traditional methods often being ill suited for analysing complex large-scale data. A more recent approach in machine learning is feature subset selection that searches the space of persons’ feature (or attribute or variable) subsets for the optimal subset. The method is based on the relevance and redundancy of the features for the problem at hand. Relevance and redundancy are evaluated respectively by entropy and similarity measures (GeeksforGeeks, 2021). Another method is t-distributed stochastic neighbour embedding, a non-linear dimensionality reduction algorithm that seeks to find patterns in the data by identifying clusters based on similarity of data points (Schochastics, 2017). It can be used for visualizing the data in a two- or three-dimensional space. One should also point out the regularization approach in machine learning that reduces the dimensionality of the training set in order to avoid overfitting. For example, Lasso regression minimizes the complexity of the model by a penalty function limiting the sum of the absolute values of the model coefficients. In Ridge regression, the penalty function is equivalent to the square of the magnitude of the coefficients. Another technique is Elastic-Net regression that improves on both Ridge and Lasso by using a modified penalty function based on the combination of the penalties of the Lasso and Ridge methods (see e.g. Nirisha, 2021). Dimensionality reduction can however lead to a high bias error.
Cluster analysis
Clustering is of particular interest for demographers. It groups together low-level data and can uncover hidden similarities between members of a group or feature differences between groups. Outliers can be detected as falling outside the clusters. Cluster analysis aims at putting units (e.g. individuals) into groups or clusters, minimizing the intra-group differences among units and maximizing the inter-group differences. For example, Duchêne and Thiltgès (1993) have clustered the 43 regional sub-divisions (‘arrondissements’) of Belgium into 4 groups with common characteristics of mortality over age 15 in order to examine regional disparities in adult mortality. Clustering could be especially useful in the analysis of big data, such as census microdata, where the very large number of individuals could be grouped into a much smaller number of units, in which individuals share common characteristics, that are more convenient to analyse.
Several distance measures are available for the purpose of clustering, the more familiar one being Euclidean distance. Other distance measures can be preferred in some applications. A well-known one is the Mahalanobis distance, but recent algorithms also have recourse to the Manhattan distance or the Minkowski distance among others, according to different data-mining problems (Tsai et al., 2015). Algorithms usually fall into one of three main categories of clustering: density-based clustering, partitioning clustering, and hierarchical clustering, though others exist such as grid-based algorithms. For their pros and cons, see Wang (2017). Though not a new approach, cluster analysis has been developed considerably in recent years. Some techniques now rely on fuzzy clustering, such as the fuzzy-cmeans (FCM) algorithm generating fuzzy partitions. A major problem is how to reduce the possible complexity of the data in big data clustering, the data being structured (potentially available in tabular form) or unstructured (such as blogs or images), and often provided by various sources.
Finally, if the problem is expressed as sequences of events, such as life courses in demography, clustering similar sequences together according to their resemblance can be performed, in a pattern-search approach, using sequence analysis with optimal matching or alignment algorithms (Abbott, 1995). See Ritschard et al. (2008) for a comparison between sequence analysis and survival trees for mining event-histories.
Association rule mining
The method was first developed for market-basket analysis by Agrawal et al. (1993). The purpose here is to discover the more frequent relationships between the data attributes. This machine-learning method also indicates the strength of association among the co-occurrences in the data. For a clear introduction, see the entry ‘Association rule learning’ in Wikipedia (2022). The method searches for the “if x-then y” patterns or item sets among variables, that are the most frequent in the data. Several algorithms are available for this purpose. In order to avoid discovering too many such rules, a threshold has to be set. The algorithms rely on two important concepts: Support, or how frequently the itemset x and y appears in the dataset, and Confidence, or p(y/x) i.e. the conditional probability of y given x. A third concept is Lift, i.e. the ratio of the observed frequency of co-occurrence to the expected frequency if x and y were independent. If x and y are actually independent, Lift = 1. Positive or negative associations lead respectively to Lift > 1 and Lift < 1. In high-dimensional spaces, the method may require preliminary dimensionality reduction. Of course, some associations detected may be spurious and nothing guarantees that the associations are relevant from a causal viewpoint.
Social network analysis
People have never been so highly connected as now, thanks to social media such as Facebook (Meta) or Twitter (X). The study of these myriads of connections has stimulated the development of social network analysis. Social networks are also very important in the study of epidemics and the way contagion spreads (Kucharski, 2021). For example, the structure of the network has an impact on the rate of contagion. If it is fully connected, the infection can spread from a single infected person to everyone else. If the network contains closed loops, it can increase transmission due to the variety of routes available. Moreover, some people have far more contacts than others do. It is therefore important to know who are these high-contact persons or central agents in the network, as they can possibly be high transmitters of the disease. Much has been written by demographers on the recent incidence and lethality of Covid. Some countries, such as Belgium, have developed a register of the proximate contacts of persons affected by this disease that could be used to study the network of transmission.
Social networks are often represented by graphs where agents are vertices or nodes and edges represent non-null relationships or interactions between agents. These networks can be expressed by binary matrices. A network with n nodes is represented by an n x n adjacency matrix A with elements Aij = 1 if i and j are connected and 0 otherwise. Various descriptive statistics of the network can be computed; see O’Malley and Marsden (2008) for a good survey. For example, size is the number of nodes or agents while density is the number of actual direct connections relative to the number of potential ones. An agent’s degree is the number of other agents to which she is directly connected. The degree distribution is the frequency distribution giving the number of agents having particular degrees. The length of a path between agents is the number of edges it contains. Measures of centrality reflect the prominence of agents within a network.
In the case of big data, much of the research relating to social networks has focused on social network topology (such as assortative or disassortative networks) and especially on centrality measures (Ianni et al., 2021). An important measure is degree centrality that is based on the number of direct connections each node has to other nodes. Highly connected individuals are probably the most popular ones and high transmitters of information, or of contagion in the case of an infectious disease. A measure of EigenCentrality reflects not only how many links a node has with other nodes but also how many links its connected nodes have, and so on. 2 An agent can acquire high centrality either by being connected to many others or by being connected to others that themselves are highly central. EigenCentrality identifies nodes with influence over the whole network. Edges can also be weighted according to the strength of the ties; the entries of the adjacency matrix are then the weights on the edges. A weighted graph can be mapped onto an unweighted multigraph with multiple direct edges between nodes. See Newman (2004) for a useful paper on this subject.
As a complement to this network-oriented approach, one should also consider content-oriented approaches focusing on the topics and opinions or sentiments exchanged among the network members. Natural Language Processing (NLP), a machine-learning approach relying on neural networks, has become the major tool to analyse contents and sentiments in big data. As an example using Twitter (X) data for examining to what extent social media users report negative or positive sentiments on topics relevant to fertility, see Mencarini et al. (2019). A very good literature review of the network- and contents- approaches is presented in Bazzaz et al. (2021).
Analysing satellite imagery
The analysis of spatial data is an important component of research in such fields as geography or ecology, and a wide range of methods have been developed for this purpose (for an overview, see among others Dale et al., 2002). We conclude this non-exhaustive survey of various methods of big data analytics by pointing out the importance of satellite imagery for some demographic applications. There are now a variety of satellite high-resolution imagery sources, often freely available, that can provide information on landscapes and infrastructures such as buildings and roads (GISGeography, 2023). These data are used e.g. for census mapping, in combination with geographical information systems and census questionnaires on tablet. AI-powered algorithms can now derive from geospatial big data highly-detailed digitalized coloured maps distinguishing buildings, road networks, green areas, water, etc. (see for instance Ecopia AI at https://www.ecopiatech.com/). Darin et al. (2022) have, for example, used satellite imagery for Burkina Faso in view of obtaining information on some areas where census data cannot be collected, by combining observations of buildings in satellite images with complementary demographic data. Another application is the creation of population density maps by age and gender, combining satellite imagery with census information, such as those that can be found on the HDX open platform.
To efficiently analyse aerial digital video data, the use of these large datasets requires reliable segmentation algorithms, creating subsets or segments based on common characteristics among the units of observation. This approach should not be confused with ‘segmentation analysis’ as used by Loriaux (see Introduction). In the present case, image segmentation is the partitioning of an image into connected regions of pixels defined by similar colour or texture. For instance, González-Acuña et al. (2016) compare four segmentation methods (clustering) on aerial data and show that each has its own merits and drawbacks. They propose post-processing in order to improve the segmentation performance of these methods by splitting segments, merging segments, and avoiding islands, i.e. connected clusters of pixels that are surrounded by pixels of another segment.
With climate change and the expected increases in emigration from the more affected areas, the analysis of satellite imagery combined with other data, such as digital traces from mobile phones and field studies, could become a major tool in migration research. For an example relating to migration caused by armed conflict, see Pech and Lakes (2017).
Discussion and conclusion
Big data
Demographers analyse huge amounts of microdata coming from censuses and from various administrative registers. In many studies, the volume of observations reaches hundreds of thousands and even several million. During the past years, individual-level anonymized data from censuses and registers have become increasingly available. Moreover, in the high-income countries, many national institutes are now linking data sources together.
Big data can help improving the explanatory power of models in several ways. A large number n of observations increases the precision of the estimates and the power of hypothesis tests. In addition, following Titiunik (2015), a large number of observations can allow for a wider range of estimation methods that would be unreliable with fewer observations. A large n also enables studying small subpopulations that would be overlooked in sample surveys for instance. On the other hand, if n is very large even small differences of no theoretical interest will be “statistically significant”, blurring the causal picture. Once again, statistical significance should not be confused with theoretical significance (Bijak, 2019).
Furthermore, a large number p of variables helps in better describing the phenomenon under study and in reducing omitted-variable bias. However, as the number of variables increases, the differences among individuals will also increase, as their profiles will differ more and more. As each individual becomes increasingly unique, will grouping e.g. life courses together still have causal meaning? Theoretical reflection on the plausible subset of causal variables (not too large but not too small either) will be mandatory in these circumstances.
On the other hand, the representativeness of big datasets is unknown in many cases as well as their population of reference, especially in the study of found data that are not originally collected nor designed for a specific research purpose, such as those provided by Facebook or Twitter (X). These datasets are therefore often problematic. For instance, social media data are provided for a particular subset of users that is neither representative nor random.
Big data analytics
The methods of big data analytics, both old and new, are efficient in detecting correlations and patterns in the data. Actually, the newer methods of data-mining based on machine-learning and artificial intelligence can automatically detect unknown links among variables based on the millions of observations available. This bottom-up approach is a useful complement to the traditional top-down hypothesis-based approach, as it can open the door to proposing new explanatory mechanisms that would then have to be tested. These newer methods are also being used for prediction, with a qualified success as the predictions are based on past knowledge. In the social sciences, causes and causal relations change over time and nothing guarantees that the past and present determinants of union formation, for example, will remain identical in the future.
Deduction, induction, and abduction
This paper ends with a more general discussion of the scientific method based on deduction and induction in this era of big data. It also recalls the value of abduction.
It was pointed out in the introduction that the hypothetico-deductive approach, where one tests an explanatory hypothesis by way of data, is often hampered in demography by the lack of sound background knowledge and theory. Selecting a hypothesis or theory to test is also deterrent to other possible hypotheses. And if one tests several hypotheses, according to what rule does one choose the winner? On this issue, see for instance Wunsch (1988, chapter 2). Moreover, there are no general laws in the social sciences and possible explanations are highly context-dependent. For example, the determinants of present-day fertility in France are quite different from those at the time of the Sun King. Even in a same country, behaviours can differ quite drastically from one population group to the other. Not only causes can differ among different populations but the mechanisms leading from the causes to their effects can also be different. Causal claims are therefore only valid for a specific time and place, or chronotope. 3
There are no such difficulties with induction. The latter starts with a series of observations and, if a common pattern is detected, one seeks a general explanation. For example, non-contracepting populations can present different levels of fertility. One observes a strong association between the fertility level and the duration of breastfeeding. As the latter influences the duration of post-partum amenorrhea, one concludes that the fertility differences are caused by different breastfeeding practices. In this case, a causal mechanism can even be proposed. The larger the data set, the more confident one can be with the conclusion. However, other explanations may also be valid, e.g. separation of couples due to male labour migration, different infertility prevalences, or differential durations of post-partum sexual abstinence. Though induction is a useful companion to deduction when theory is lacking, one can never be sure that induction leads to the right cause.
Demographers have recently suggested adopting also abductive reasoning (Hauer and Bohon, 2020; Bijak, 2022). Abduction seeks to propose the most plausible explanation for a novel pattern observed. In other words, if C is observed, abduction consists in selecting a hypothesis A from one’s background knowledge, considered the most plausible for the case at hand, such that if A is true then C is explained (Cattelin, 2004). Abduction therefore links inductive and deductive approaches. Doctors use it when proposing an explanation of the symptoms observed in a patient, based on their knowledge of the causal relations between diseases and symptoms, and more generally by scientists, on the basis of their knowledge, when invoking an explanation for novel patterns discovered in an exploratory analysis of the data. However, there can be other and better causes of C than A. Abduction thus requires testing the validity of the proposed explanation A and comparing A to other possible causes.
Scientists use induction, abduction and deduction in their current practice according to their needs; much depends on the availability of theory and data. If theory is unavailable, induction can come to the rescue of deduction by proposing possible new hypotheses. But these have then to be tested, induction thus leading to deduction.
To conclude
In an inductive approach, big data analytics can detect novel patterns and associations in the data. However, as Kitchin (2014) has observed, it is one thing to identify patterns in big data; it is another thing to explain them. In other words, data do not speak for themselves: they have to be interpreted. One can presume that among the various associations that will be detected, only a few will possibly make causal sense. In a data-driven approach, the problem then consists in proposing and testing a suitable mechanism that can explain why a variation observed in one variable produces a variation in another variable, observation bias and confounding being under control. When background knowledge is available, abduction can be used for this purpose.
To bring this article to a close, big data, machine-learning, and artificial intelligence will have a profound effect on scientific discovery but they will not replace human judgment in the construction and testing of explanatory models.
Footnotes
Acknowledgements
The author wishes to thank Catherine Gourbin and Federica Russo, and two anonymous reviewers, for their valuable insights in the preparation of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
