Abstract

In 2021, Wiley published the book “Statistical Learning for Big, Dependent Data” which is a comprehensive summary of approaches for analysis, learning, and inference from data with temporal or geo-spatial dependencies. The book is particularly relevant in the context of the ongoing evolution of statistics, noting the transition from scarce and expensive data collected for a specific statistical purpose, to vast readily available data that comes with less control around it’s properties and structures, especially spatial, temporal, or other dependencies. When working with data with these dependencies, it is crucial to use methods that account for their characteristics appropriately.
The book is quite accessible as it begins with a clear description of the current context and provides motivation for consideration of the techniques covered in the text. It serves as a useful resource to study and learn about tools and techniques in this setting, which is common in official statistics as well as rigorous statistical analysis in general. As evidenced by the many references to highly cited work from the authors themselves, this is an area where both authors have done significant work and have contributed to the literature.
The topics covered in the text are particularly relevant, especially in the domain of official statistics. Some, such as state space models and dynamic factor models represent newer approaches that are gaining popularity in the discipline while others such as ARIMA modeling, represent the foundational building blocks of day-to-day work. The methods can be applied in many of the situations that are common to official statistics, including modeling to decompose a time series into unobserved components (as detailed in Quenneville and Ladiray (2001) and Findley et al. (1998)), prediction problems such as forecasting and nowcasting (described in Eurostat (2017)), as well as general inference—all carried out with a high standard for scientific integrity and grounded in underlying statistical theory.
The book includes a total of nine chapters, each including a comprehensive list of references, recent real-world examples with outputs from application of the techniques, and exercises for the reader. In each chapter, new approaches are introduced to the toolkit of available methods. A particularly useful characteristic of the text is the inclusion of source R code that can be used to reproduce the findings. This aspect is particularly useful to gain hands-on experience and solidify and enhance the learning as the reader advances through the text.
The first chapter provides an introduction to big, dependent data, providing illustrative examples and introducing key concepts related to temporal dependencies (autocorrelation) and covers such fundamental concepts as stationarity and invertibility. It closes with an illustration of the effects of serial dependence to motivate the reader to be aware of these characteristics and treat them appropriately to avoid errors.
Chapters 2 to 5 introduce essential tools for modeling and forecasting time series data. The second chapter details ARIMA models and their underlying processes while also introducing key diagnostics such as autocorrelation functions, periodograms, and spectral density functions (also covered in Box and Jenkins (1976) and Brockwell and Davis (1991)). It highlights common time series models including random walks, and the so-called Airline model. This chapter then continues to develop structural time series models and state space models (Durbin and Koopman 2001) and outlines the famous Kalman filter used to filter and smooth a series to generate state estimates at each point in time. Following this, tests that can be used for model development and validation are presented and serve as a useful reference for practitioners working with real dependent data. Several complex extensions to the methods are also discussed, including model averaging and shrinkage estimators. The third chapter then extends the univariate results to the multivariate setting, to simultaneously model multiple series using the famous Vector AutoRegressive (VAR) models and Vector Moving Average (VMA) models. An interesting discussion is provided comparing VMA and VAR models, and practical issues such as spurious regression, over-differencing and unit roots are introduced along with practical advice and solutions to deal with them. Together, these chapters could serve as a text for a graduate level course on time series modeling.
Chapter 4 covers a particularly important topic in dependent data analysis and modeling—dealing with extreme events and outliers that lead to heterogeneity in the data. This is especially relevant in light of recent challenges associated with the COVID pandemic and the associated measures put in place. During this period, social and economic indicators experienced unprecedented shocks, and it was challenging to determine whether they were temporary in nature or would have more lasting effects. This chapter outlines options to model different effects through predetermined step functions for which magnitude can be estimated within the model (as in Eurostat (2018)), or via correlations with exogenous variables. These techniques are critical to use time series data reliably for prediction and inference.
Chapter 5 deals with clustering and classification of time series, a topic with many practical applications. The chapter provides useful distance measures and visualizations to summarize similarity and dissimilarity of time series. The measures proposed are based on linear and non-linear properties of the data which can be used as features for classification using both parametric and non-parametric methods. The chapter also provides technical details such as estimation methods, consideration for structural breaks and hierarchical structures between series. A rich set of references are provided for this topic, which is useful to gain insight and understanding in large data sets, but also from the perspective of grouping similar series to apply repeated tasks such as tuning of parameters.
The remaining chapters turn to more recent advances related to big, dependent data. The Sixth chapter covers Dynamic Factor Models—a class of models that are the subject of a great deal of attention of late. The chapter provides valuable insight into special cases of these models, and frames them in the broader context. It covers practical issues such as parameterization and determining the number of latent factors, as well as strategies for generating forecasts and dealing with effects of scaling. These models tend to be somewhat complex but again, the text covers important issues with application such as non-stationarity, co-integration, and clustering with practical advice and insight. These approaches have the potential to produce advance indicators and improve timeliness of official statistics when suitable quality can be achieved with these powerful predictive models.
Chapter 7 provides a comparison of additional techniques that could be considered for forecasting. It first details the LASSO approach for modeling. Special cases of the LASSO such as fused LASSO and group LASSO are also covered, as is the generalization into elastic-net. A discussion is given of using the LASSO on dependent data, Other special cases such as Principal Component Regression, and Partial Least Squares are also discussed as potential solutions to multi-collinearity in high dimensional problems. Sections are also included on mixed frequency data which enables the use of partial information for the current period to improve the quality of forecasts (referred to as nowcasts in this setting). The chapter also highlights areas where open questions remain, such as strongly autocorrelated data and it’s impact on these methods, along with potential avenues to explore when this situation is encountered.
Chapter 8 covers Machine Learning methods and how they can be adopted for dependent data, specifically Random Forest and Classification and Regression Trees. It details the differences between the models and explains the concept of bagging. A summary discussion of Neural Networks is then provided, distinguishing between recurrent, recursive, and convolutional networks, and providing details on different activation functions with useful examples. These models continue to gain momentum and their use is expected to become increasingly common in the domain of official statistics in the future.
Chapter 9 focuses on data with spatio-temporal dependencies. The chapter provides examples of effective visualizations of some example data sources, and introduces useful frameworks for analysis of the data such as hierarchical and linear mixed models. A discussion of simple, ordinary, and universal Kriging is provided and parallels are drawn between concepts from data with time dependencies and spatial dependencies. Approaches are provided to test for spatial dependence and concepts such as factor models and hierarchical models described earlier in the book are extended to spatial models.
In conclusion, the book develops a comprehensive list of tools that are useful for analysis of dependent data ranging from those that are largely transparent and intuitive to others that are highly complex. It brings together models from classical statistics with more modern approaches and techniques common in data science. It covers topics with high applicability to real-world problems in statistics, especially those found in official statistics and introduces topics in a digestible way accompanying the theory with ample explanations and examples to make for a digestible read.
