Abstract
With the dramatic and rapidly growing role of “data” in contemporary societies, there is increasing interest in how best to reflect this reality in countries’ official statistics. This paper suggest that an appropriately broad concept of “data” is essential, one that includes data
Keywords
“The evidence of huge technologically driven change is everywhere in daily life, and almost nowhere in the standard economic statistics.” 1
Introduction
It is no overstatement to say that there are revolutions underway in the volumes and speeds of computerized data flowing around the planet. A major question is how National Statistical Offices (NSOs) should prioritize statistical developments that accurately portray the dramatic economy- and society-wide increases in the roles of “data”, and then reflect these in their statistical programs.
In turn, a key question is how best to conceptualize the ways statistical information on “data” are most needed. To fix ideas and to address this question, we focus first on two groups of public good purposes, providing statistical evidence to support major areas of public policy, specifically privacy and health. Second, we focus on two further areas where the role of NSOs should include providing statistical portraits enabling the public at large to appreciate and understand better the dramatic trends associated with computerized data, namely inflation and entertainment.
In anatomy, in contrast to motor neurons which signal muscles to move, and sensory neurons which signal pain and other sensations,
What do we mean by “data”
The online Cambridge dictionary defines data as “information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer”. 2 Conventional official statistics such as unemployment and inflation rates presented in tables broken down by time and demographic groups are clearly data in this sense. However, contemporary computer databases increasingly contain “numbers” that are unintelligible on their own. Rather, these can be the often huge bit strings for music videos, genetic sequences, software programs, representations of sub-atomic particle interactions, and web pages. These elements of databases are sometimes referred to as BLOBs, “binary large objects”, which require specialized software to be examined and used. For purposes of this analysis, we include as “data” such bit strings at their higher level as processed for their intended uses, including images and web pages.
With the very recent and dramatic evolution of so-called artificial intelligence software such as LLMs (large language models, e.g., ChatGPT and Perplexity), the boundaries of what constitutes a “data base” are blurring further. More detailed consideration of this evolution of electronic data is, however, generally beyond the scope this paper.
Why collect data on data
To produce any kind of official statistical portrait of the dramatically increasing roles of data in modern societies, NSOs must collect primary data. This entails collecting data on data. However, doing so will have considerable costs, and require significant innovations in statistical thinking and methods. As a result, there need to be clear motivations for any such statistical program. To this end, we focus on two main reasons – to support public policy, and to provide social proprioception. In turn, two more specific areas within each are developed. While these motivations are likely applicable to most countries, we focus here on Canadian experiences.
Data and privacy policy
For public policy motivations, one of the top contemporary data-related issues is privacy. On the one hand, NSOs have too often been constrained in their access to various government and private sector organizations’ detailed internal microdata. This constraint has been relaxed for government data sets in Canada, e.g., in the areas of tax returns and increasingly for health care records. Still there are powerful vested interests who fear what sophisticated analyses of patients’ health care trajectories might reveal – whether these are “bad apples” among the physician community or underperforming units in hospitals. 3 As a result, data custodians unnecessarily use “protecting privacy” as an excuse for not sharing the data – in effect creating a pervasive “privacy chill”.
Importantly, following the revelation of many serious limitations in data flows related to the recent pandemic, the Public Health Agency of Canada convened an Expert Advisory Group which made strong recommendations to ameliorate the situation, 4 and Health Infoway Inc, a joint federal-provincial-territorial crown corporation shortly thereafter published a “roadmap on (health data) interoperability”. 5 This roadmap makes repeated references to the need to remove “blockages” to bona fide / public good flows of data, including personally identifiable data, not only for high quality patient care but also for improved health sector management and a range of broader health research, including more cost-effective randomized clinical trials. Most recently, the Government of Canada introduced legislation in June 2024 to make “data blockage” by software vendors an offence under the criminal code. 6
Private firms are the sources of detailed microdata for NSOs on a range of characteristics – from surveys of retail sales to employment to R&D to financial statements (though in Canada much of these data now come from various tax returns). However, transaction level microdata from private firms remain difficult for Statistics Canada to access, even though these data are among the largest flows of data in the world, and are of great potential value for key economic indicators like the consumer price index (more on this below).
Nevertheless, concerns about “privacy chill” in the context of statistical and research access to detailed microdata have been tremendously overtaken in the opposite direction by largely unfettered and massive privacy invasions, especially by the largest multinational social media and related firms. There are also growing concerns about potential or actual overreach by police and national security agencies, including facial recognition software and their underlying databases. There are the beginnings of significant legislative constraints, led by the EU, along with some bipartisan investigations in the US Congress. However, legislation is far behind what is needed for informed consent regarding the sharing of personally identifiable and profitable data with and among private firms.
In Canada, there is considerable support for strengthening the powers of the Office of the Privacy Commissioner (OPC), especially regarding practices in the private sector. One avenue for this would be to grant the OPC stronger investigatory powers. With such powers, the OPC could compel a firm to disclose details of the ranges and kinds of data it collects and shares. However, Canada's OPC generally operates on a complaints basis; without a specific complaint, it has no power to investigate data behaviours among private firms.
Thus, from a public policy perspective, there is a major conundrum regarding privacy. On the one hand, there is far too much “privacy chill” with regard to data flows to support major public goods, including most recently and acutely data on infections, vaccinations, hospitalizations, and compliance with various lock-downs associated with the pandemic. 7 On the other hand, there are extremely serious and growing invasions of privacy via the data collections and individual-level linkages occurring in the private sector, especially in social media firms. 8
Data and health policy
A second major public policy area where data are central is population health and health care. Progress in automating data collection and analysis, e.g., in the forms of electronic health or medical record (EHRs and EMRs), has been painfully slow, not least due to pervasive privacy chills and powerful vested interests. However, the potential benefits in terms of population health and more effective management of health care service provision are tremendous. Canada's constitutional division of powers between the federal and provincial governments remains a major stumbling block.
There are decades of reports and studies outlining the kinds of health data needed to achieve these population health and health care benefits, including the recent Expert Advisory Group 4 and Infoway Roadmap 5 reports. In the 2023 federal budget, over $200 billion was budgeted over the coming decade as fiscal transfers to the provinces for health care, including $500 million earmarked for health data. 9 To monitor progress on this crucial policy initiative, collecting the appropriate “data on data” is essential.
Social proprioception
From the second main perspective, social proprioception, it is illustrative to focus on two major areas for improved data on data: inflation and entertainment. In both cases, one of the fundamental objectives is to shed light on the extents to which (per the Beatles) “things are getting better all the time”. In other words, the objective in these cases is to provide the general public insights regarding social progress, i.e., for social proprioception.
However, before delving into these motivations for greatly improved official statistical programs for data on data, it is first necessary to consider more fully just what should be meant by “data”.
Where is the data base (DB)?
Discussions of “data” are often expressed in terms of data bases (DBs). For example, scattered amongst firms and other organizations there are discrete sets of DBs, each characterized by its size (numbers of records, number of fields per record in conventional DBs, numbers and types of BLOBs (binary large objects) if more complex DBs such as videos and web pages), and the substantive content of the records. However, contemporary electronic data involve not only DBs, but also massive and rapidly evolving flows of data into, out of, and between DBs.
Consider a purchase via credit card in a retail establishment. Details of the transaction flow to both the vendor's and the purchaser's banks (via credit card intermediaries), both of whom add the transaction data to one of their own DBs. The same transaction data likely also flow (somehow) to the vendor's inventory DB so new items can be ordered when stocks on hand fall below some threshold. These same data also flow to the vendor's accounting software, and to the tax authorities for the collection of sales taxes or VAT, two further DBs. On the purchaser's side, the transaction data feed not only her monthly credit card statement, but also possibly other DBs within the bank to support customer relationships including target marketing of other financial services, and beyond the bank or credit card software to credit rating agencies which combine the individual's purchases from all her credit cards, thereby involving several more DBs.
This story becomes even more involved if the purchase is online, via a firm like Amazon. In this case, the online vendor adds the transaction data to its profile of the individual in terms of her favorite products and other tidbits gleaned from cookies and “trackers” on other web sites to which the vendor has access.
For Statistics Canada at present, these myriad transactions are aggregated and arrive from the Canada Revenue Agency via administrative records showing total VAT and total revenue by firm, which are in turn drawn from various DBs housed at the tax authority.
As a result, a single transaction can appear in myriad DBs. This wide-ranging and virtually instantaneous diffusion of the data from a single transaction has become ubiquitous as the marginal cost and time required for making and flowing electronic copies of the transaction data are close to zero.
Thus, in computer science terms, the world of simple one-off DBs is ancient history. Contemporary DB developments and computer science include concerns with the management of truly enormous real-time transaction data flows.
10
Handling these data
Both DBs and DFs are increasingly dynamic, so it is challenging to provide simple high-level definitions. To start, we can characterize a DB as a collection of records, where each record has a number of attributes. A simple example is birth and death certificates where each record is for one birth or death, and the attributes can include the person's name, address, and date of the event. At the other extreme, a “record” could be a complex web site, and its attributes could include not only its url and text strings but also images and its links to other sites. Further, the web site could be dynamic, like a weather page being continually updated with the latest forecast.
Among the most numerous data flows are financial transactions, such as credit card purchases; while the most voluminous data flows include streaming videos. More complicated is the emerging volume of AI applications, such as ChatGPT, where the data flows (e.g., queries) may be used to augment the AI application's data base to respond more appropriately to subsequent user queries. In essence, these become hybrids of DBs and DFs.
Much work will be required to develop the concepts and definitions of DBDFs, let alone operational methods for collecting data on them. The process will have to be iterative, with much trial and error. The core idea is that starting simply, the NSO would field surveys for carefully designed samples of organizations eliciting data about the DBs they hold and the DFs in which each DB is involved.
One place to start is with financial transactions, where the DBs are comparatively straightforward, including for each discrete transaction the dollar amounts, the bar code or type of purchase, from whom, to whom, and a time stamp. A second fairly straightforward starting point is health care encounters, such as the DBs related to lab test and imaging results and drug prescriptions, and more generally the evolving flows of updates from various care providers to patients’ electronic medical records, especially as progress toward health data interoperability is improved and monitored. 11 As experience is gained, both on the methods of data collection and the kinds of uses to which the resulting statistical information is put, these surveys would seek more detailed data on organizations’ DBDFs, and their sample sizes would be increased.
What should NSOs do with DBDFs?
A major challenge, in this context of dynamic, complex, and rapidly expanding DBDFs, is what specific roles NSOs should play. In the following, we consider four areas: we elaborate the two policy areas of privacy and health already noted, and then discuss the two social proprioception areas of inflation and entertainment.
Privacy
Suppose Canada's Office of the Privacy Commissioner (OPC) is granted stronger legislative powers proactively to investigate and act or regulate potential or emerging privacy issues related to DBDFs. Such powers could be analogous to those already available to the tax authorities who, based on their DB inventories of tax returns, deploy various algorithms to analyze patterns in these returns and then select a (typically stratified) sample of taxpayers’ returns for detailed audit. The essential prerequisite for the tax authority is its inventories of tax returns. Analogously, the OPC would need an inventory of DBDFs.
Such an inventory of DBDFs would have many uses beyond supporting the (potentially expanded) privacy mandate of the OPC, so it would be far more efficient, while complementing any increased powers for the OPC, for Statistics Canada (and other countries’ NSOs more generally) to build and maintain an evergreen (and likely rapidly growing) “portrait” of DBDFs in Canada. Indeed, this portrait, essentially a DB of DBDFs, would form the keystone for much of what is needed for an effective and comprehensive program of collecting data on data, and meeting the specific policy and social proprioception objectives which are the focus here.
Statistics Canada already has a very broad sample frame as a starting point: essentially any organization in Canada that pays sales tax or pays employees (hence administers income tax source withholding) or has individual or corporate income must file at least one kind of tax return at least annually. These tax data subsequently flow routinely to Statistics Canada (subject always to stringent security and confidentiality provisions and practices) where they are used to construct and maintain the “business register”, essentially an ongoing census of all organizations in Canada (n.b. including public sector and non-profit entities as well as private firms), hence providing a near universal sample frame of domestic organizational entities.
In turn, this sample frame is used to elicit data using a variety of focused surveys, ranging from retail trade to R&D. In principle, therefore, it would be possible for Statistics Canada to create a new “DBDF portrait survey” asking (a sample of) these entities to provide basic data on all their DBDFs.
Of course, there are important complexities in designing and implementing such a survey, including:
providing workable and respondent understandable definitions of a DB and a DF, having an adequate profile of the entity being surveyed to ensure that the survey itself is sent to an individual within the firm or organization having the knowledge to complete the survey accurately, and ensuring that all data flows into and from the entity being surveyed include adequate pointers to all the other entities party to the data flows.
A further major challenge is international entities that may have no “footprint” in Canada. With the internet, it is easy and very common to be able to interact with foreign entities, e.g., Google or Google maps searches, where the web site is collecting data on the individual, but has no formal presence in Canada. In cases like this, strong federal legislation will likely be required to compel such international organizations to provide data on their DBDFs insofar as they involve Canadian residents. In the first instance, the reason would be to support any strengthened mandate for the OPC.
It will also be important for any such legislation to be clear regarding the respective roles and mandates of the OPC vis a vis Statistics Canada (or NSOs more generally), as the data on DBDFs thereby generated would play a foundational role for official statistics. In particular, such an evergreen DBDF portrait could serve as a sample frame for a range of more focused data programs, including those described next.
Health
Canada has the potential to join world leaders in managing its health care sector in the most cost-effective manner, in health research, and in rapidly responding to unforeseen events like the recent pandemic. The simple reason is that each province is effectively a single-payer for a wide range of health care services, so in principle it could manage these services by creating a fully integrated patient-level DBDF. Further, from a pan-Canadian perspective, if these provincial DBDFs used standardized concepts and definitions and were interoperable across provincial boundaries, Canada could rival the likes of England's NHS in terms of providing a population-based laboratory for clinical research including more cost-effective randomized clinical trials, health technology assessment, growing appreciation of the power of “real world evidence”, and linkages to major population health and related surveys (like the UK Biobank 12 ). Unfortunately, this potential is far from being realized. A key reason is the many blockages to the appropriate flows of health and health-related data.
As emphasized in the Expert Advisory Group report, 4 there have been decades of reports and studies outlining what is needed in the area of health data. The challenge is overcoming the privacy chill and vested interest blockages. 13
As a concrete example, one of Canada's leading health services research organizations, the Institute for Clinical Evaluation Sciences, decades ago produced a cardiovascular disease atlas. 14 But the data were not linked: one chapter had data on risk factors like smoking and obesity, another on surgical procedure rates, and another on mortality outcomes. In a unique effort, Statistics Canada researchers almost two decades ago, assembled longitudinally linked patient-level microdata on the incidence of heart attacks, surgeries, and mortality outcomes, but only for a few provinces, and only with limited longitudinal follow-up. The results were disturbing: surgical procedure rates varying as much as three-fold with no apparent mortality benefit, hence the obvious implication that there is significant waste in the treatment of cardiovascular disease. 15 However, this analysis has not yet been replicated, let alone updated.
Similar results on the prevalence of unnecessary cardiovascular disease procedures have been found in US studies for sub-populations for decades. 16 While the prevalence of such unnecessary procedures has continued in the US,17,18 there is still no way of knowing whether this is the case in Canada. The absence of significant change in treatment in the US is indicative of the power of vested interests, in this case involving cardiac surgeons, hospital management, and large health insurance companies. In Canada, the absence even of data with which to monitor these surgical procedures is possibly the result of more subtle use of “privacy chill” arguments by the relevant vested interests.
A further indication of the challenges in assembling the needed kinds of data is a recent survey that paints a gloomy picture of the ability of patients even to access their own already existing electronic health data. 19
The implication from these examples is that NSOs have to proceed gradually and devote significant efforts to the development of countervailing public support, a “social license” for the collection of detailed individual-level sensitive data not only for patients themselves but also for truly important public good purposes.
Over the almost two decades since its creation, Canada Health Infoway's staff have developed a very good understanding of the current landscape of provincial health-related DBDFs. Most recently, Infoway has been charged with developing and leading a “Roadmap” on interoperability 4 which in effect constitutes a DBDF portrait in the area of health care. It is not only assembling this portrait, but also endeavoring to ensure that, e.g., a diagnosis of diabetes or a third line chemo treatment for cancer or the make, model and software version of an MRI machine (say) are each coded using a common standard (if not identically) in all the places where such data fields exist.
This is a massive and long overdue undertaking. But individual patients’ lives depend on the interoperability of these data, as does cost-effective management of health care services. Infoway's foci include hospital and physician encounters, lab tests, diagnostic imaging, vaccinations, and prescription drugs. However, there are many critical kinds of data outside Infoway's scope, including characteristics of the health human resources involved (physicians, nurses, personal care workers – their training, work patterns, age), vital statistics (e.g., causes of death), over-the-counter drugs, home care and nursing home ownership and staffing patterns, etc., and the kinds of data needed to place health care within the context of the broader social determinants of health.
Further, there are already many players in the health data area beyond Health Infoway and Statistics Canada. This complex landscape includes the Public Health Agency of Canada (a federal department), the Canadian Institute for Health Information, 20 Canada's Drug and Health Technology Agency, 21 and some provincial counterparts, various provincial government agencies including health ministries themselves, workers compensation boards, and health quality councils, 22 and various academic health research organizations.23–25
Important DBDFs are also held by private sector firms, from pharmacies to lab testing firms to primary care physicians’ businesses typically structured as private corporations, to large insurance companies.
Comprising just over one-tenth of Canada's economy, it is not surprising that there are myriad entities holding health and health-related DBDFs. Based only on the ad hoc and fragmentary information available, it is clear these DBDFs are largely uncoordinated, unstandardized, not interoperable from an individual patient's perspective, and more often than not useless for contemporary kinds of probing statistical analyses, which involve large highly multivariate longitudinally linkable samples of individuals’ data.
Having regularly updated and readily accessible data on the state of health DBDFs would provide the general public as well as journalists, policy analysts and decision-makers, an essential evergreen snapshot of where the most serious gaps in functioning health-related DBDFs were. While it is unlikely to be decisive, such accessible information would further aid in forcing some accountability on the actors whose support and effort are needed to achieve the desired state of health DBDFs in Canada.
Inflation
The economies of many countries immediately following the pandemic suffered from both inflation and the impacts of increased interest rates as tightened monetary policy was deployed. Importantly, there have been reports in the popular media where individuals claimed they were facing much higher inflation than reported in the official statistics. There are also longer standing concerns in Canada that the official consumer price index (CPI) does not reflect the inflation faced by particular groups including the poor and the elderly. While detailed studies have not supported this claim (Stat Can, unpublished), it remains an open question how much heterogeneity would be found in inflation rates across individuals and households with varying patterns of expenditures. In response to these concerns, NSOs could regularly publish inflation data disaggregated not only by commodity and geographic region as presently, but also by socio-economic group. Even better would be the production of “inflation maps” such as scatter plots of family-level inflation rates by various socio-economic characteristics. Meeting these needs, especially the latter, requires significantly better data on expenditure patterns.
At the practical level of implementing the CPI, Statistics Canada is facing growing difficulties collecting data from the Survey of Household Spending which is used to determine these expenditure patterns, i.e., the basket of goods and services providing the “principled weights” (a phrase used by Dan Usher, a professor at Queens University) used to aggregate the sub-indices underlying the CPI.
Another major concern with price indices, including the CPI, is the role of “new goods”. The US Senate-appointed Boskin Commission 26 argued that the US CPI was over-stated by about one percentage point per annum, where half of this overstatement was attributable to the failure to account properly for the appearance of new goods, such as digital cameras, cell phones, and new drugs. Streaming music and videos had not yet become widely available, nor ChatGPT and it's like. This new goods problem arises because the volume of sales of such an item only becomes large enough for it to be included in the price index's basket of goods and services well after the largest declines in its price have already occurred, hence the inflation rate is arguably over-stated. Moreover, the theoretical foundations for price indices are even more problematic when standard neo-classical economic assumptions are relaxed, such as allowing increasing returns to scale, disequilibrium trading, income inequalities, and satisficing rather than omniscient utility maximization. 27
Related to new goods, there is widespread recognition that major quality improvements have been ongoing in many commodities, initially most notably in computers, but also in cars, household appliances, and streaming video services. As a result, NSOs have deployed hedonic regression methods to adjust some commodities’ valuations in price index construction to take account of such quality changes. But due to its practical difficulties, hedonic adjustments for quality changes are applied only for a few commodities. As a result, price indices, including the CPI, are missing much of the improvements in quality occurring.
Also recently, there has been a dramatic growth in “free” goods, such as online search and videos. As their monetary expenditure weights are zero these are completely missing from the CPI.
Perhaps the most dramatic quality improvement is now just emerging, with LLMs = large language models like ChatGPT. These innovations promise to improve dramatically the qualities of many current software packages from word-processing to image generation to online search to completely new functionalities in fields as diverse as law, accounting, and drug discovery. In all these cases, not only are these missing from the CPI, but DBDFs are also deeply involved.
Given all these factors, it can reasonably be argued that the official CPI may be seriously biased, but in ways that are presently unknowable. Further, it is unknown the extent to which inflation, measured taking account of the biases just noted (and to the extent feasible), has important distributional consequences, for example varying systematically across different socio-economic groups.
Beyond inflation measured by the CPI
With these challenges and limitations in constructing an appropriately broadened conceptual foundation for the CPI as just outlined, a fundamentally new approach is needed to conceptualize and then measure statistically households’ “progress” in terms of consumption, defined more broadly than simply in terms of price inflation, and then in deflated “real income” growth rates.
A critical step in this reconceptualization is incorporating time use patterns. There has been a growth in the deployment of time use surveys by NSOs, most notably in the US.
28
While Statistics Canada was an early leader in fielding time use surveys, it has not moved beyond a quinquennial focus in its General Social Survey. But time use patterns are essential for obtaining data on the consumption of “free” goods on the internet. These surveys can also provide the basis for moving from periodic
Time use patterns are also essential for understanding consumption of entertainment, discussed in the next section. Consumption of radio, TV, recorded music, and more recently podcasts, is often joint with other activities like household chores and childcare.
Another critical step is broadening the data flows used to construct the consumption basket, especially given the declining response rates to the household surveys that have provided this basis for many decades. With the dramatic growth of electronic rather than cash payments for goods and services, as well as the use of bar coding for differentiating commodities, there already exist myriad DBDFs with potentially useful data – specifically data on expenditures that are more fine-grained in terms of commodity detail, and are linkable to individuals’ and households’ socio-economic status.
Statistics Canada has the legislative authority to collect such data from banks and retailers, but it does not yet have the “social license” to do so, as revealed in a recent controversy. 13 In this case, a more measured and gradual approach would be more likely to succeed. It would start with the construction of the “portrait” of DBDFs already discussed. Next, there could be an exploratory pilot study with a very small sample of individual records to ascertain not only the levels of detail available from various kinds of electronic transactions (e.g., credit cards, point of sale bar codes), but also more information on the kinds of software and DBDF architectures the various retail, financial and other entities were using to handle and store these data.
It would also be critical for the NSO to have their staff engage personally with the relevant decision-makers in these entities to understand both their sensitivities regarding the disclosure of these very detailed data to the NSO, as well as the kinds of response burdens collecting a sample of these data from various types and sizes of organizational entities would impose.
Entertainment
There is no question that there has been an explosion in the availability and consumption of a range of kinds of electronic entertainment. These include recorded music, streaming videos, sharing photos with friends, sharing hobby interests with individuals around the world (e.g., in Facebook groups), and computer gaming. As recently as a few decades ago, the idea of a “500 channel universe” was still a dream. Today, we are well beyond 500 channels.
Much of this consumption is “free”, without any monetary payments. Much else has essentially zero marginal monetary cost once a subscription has been paid. As it is all electronic, it now involves the flows of digital data, often coupled with data collection by the suppliers on the viewing or usage patterns of each user.
From the context of social proprioception, and understanding societies’ progress, any statistical series based only on monetized market transactions is bound to be seriously biased, most likely understating actual progress. NSOs should be endeavoring to provide their societies valid and engaging statistical information on how these major aspects of our lives are changing.
The “portrait” of DBDFs already described, together with time use surveys as just mentioned, provide the foundations for such a new statistical program. The content of such a program will be sufficiently diverse that a family (dashboard) of statistical indicators would be needed, along with “drill down” access to the underlying microdata for more in-depth analyses.
As a thought experiment, we can imagine the table of contents for the first publication (or in more contemporary terms, a web site) from this new statistical program on electronic entertainment. At the highest level, it could divide the activities into sectors or domains, analogous to standard industrial classifications, e.g., music, videos (both longer like movies, sports events, and TV shows, and shorter like TikTok), computer games (both solo and multi-player), hobbies, and “friends” (conversing, sharing photos).
In each of these domains, among the key statistics would be how much time individuals were spending engaged in the activity, when during the day or week the activity most often occurred, whether it involved real-time interaction with other individuals, and how it was paid for. Further, all these data elements could be disaggregated by users’ various socio-economic characteristics, not least age, sex, educational attainment, household income group, ethnicity, and geography. As importantly, the trends over time would (eventually) be provided. It is most likely that such a statistical publication would generate considerable headline news.
Beyond its value in terms of social proprioception, other features of the underlying data would be important for various areas of public policy. For example, there are the privacy implications of the data on viewers and game players themselves being collected by the vendors of these electronic entertainment services, 8 possible implications of corporate concentration of these vendors for competition policy, possibly adverse effects on children's mental health or school achievement related to amounts of time spent interacting with social media, and in Canada the longstanding policies involved in encouraging Canadian cultural content.
Including “data” in the system of national accounts (SNA)
Since its inception after WWII, the System of National Accounts (SNA) is again in its periodic process of revision, this time targeted for 2025. This round of revisions includes new sections on what the SNA refers to as “digitalization”: “A wide variety of digital products and activities have appeared as part of digitalization, and digital assets … have assumed important roles as stores of wealth or inputs in production. The profound impact of digitalization on production, consumption, transacting, investment, prices, finance, and other aspects of the economy, as well as its impact on international trade in goods and services and other cross-border transactions, calls for enhanced visibility of digital activities, products, and transactions in the macroeconomic accounts.”
29
One key reason is that the SNA's primary focus is on aggregates and partial aggregates, while modern statistical and analytical software, indeed DBDFs themselves, are oriented to being able to drill down into the underlying microdata.
There are important opportunity costs to NSOs investing in creating SNA-style aggregates for statistical information about “data”: it will detract and distract from the more fundamental needs for “data on data” as sketched above.
From a more academic perspective, one that has been generally forgotten, it is worth recalling the “Cambridge controversies in capital theory”. Building on the seminal work of Sraffa, 30 this debate successfully showed that the logical foundations for a scalar measure of aggregate capital stock, as is being proposed for the inclusion of “data” in the revision underway for the SNA, were fundamentally flawed. An aggregate capital K index may serve as the basis for parables in theoretical neoclassical economic growth models, but for official statistics it cannot be trusted to tell an unbiased story of economic growth, productivity, or other aspects of social progress. It is far more useful, valid, and practical to build such stories using data collections that are more disaggregated, that pertain directly to real phenomena, that do not embody patently unrealistic or arbitrary assumptions (e.g., on service lives and depreciation rates), and that do reflect myriad real-world heterogeneities.
The focus on aggregation also runs counter to the Stiglitz, Sen, Fitoussi report 31 as well as much earlier work by Richard and Nancy Ruggles who wrote extensively not only on the need to provide explicit microdata foundations for the aggregates in the SNA, 32 but also developed methods and supported specific efforts to do so. The statistical portraits of DBDFs sketched above are very much in line with these efforts to build explicit microdata foundations ab initio into the statistical portraits of DBDFs.
Part of the appeal of the SNA and its aggregate approach is the simplicity of having a single measure, such as GDP per capita. However, a major part of the motivation for the Stiglitz, Sen, Fitoussi report 32 is captured in the 1995 Atlantic cover page headline, “If the economy is up, why is America down”. The cover story introduced an alternative to GDP, the Genuine Progress Indicator. 33 This article reinforced and abetted a flowering of studies and estimates of indicators proposed as better and more valid alternatives to GDP (and GDP per capita) for assessing social progress. After several years of international meetings convened by the OECD, however, the consensus was that summary indicators were too constraining, and embodied too many implicit but very strong value judgements required to aggregate the diverse sub-indices forming the overall index. Instead, Sen, Stiglitz Fitoussi 32 recommended moving away from a single indicator (GDP) to a “dashboard” of indicators. Subsequently, the OECD launched just such a dashboard as the centrepiece of its “Beyond GDP” agenda. 34
Well-meaning individuals too often seek statistical indicators without appreciating the underlying requisite detailed and expensive data collections required – the malaise of “indicatoritis”. An obvious example is life expectancy, clearly a fundamental indicator. While the concept is relatively straightforward, constructing a high-quality version of this indicator requires average annual expenditures of hundreds of millions of dollars for a population census and a vital statistics program that includes complete death registration.
No country would invest in a census or vital statistics program for the sole reason of producing the life expectancy indicator. The census and vital statistics data collections each serve a multitude of statistical and informational objectives, as well as other areas of government administration and public policy. Further, given their microdata foundations, these very large data sets enable analyses to “drill down” beneath any indicators or substantially aggregated published statistical tables to explore more fully underlying patterns and relationships. It is far more useful for an indicator like life expectancy to reside at the top of a coherent “system of statistics”, 35 with explicit microdata foundations that enable “drill down” capacity to disaggregate by age, cause of death, socio-economic status, geography, and other key covariates. Further, these underlying data should support modern kinds of statistical inference, such as multivariate hazard regressions and microsimulation modeling, in order to provide insights on the factors affecting (in this case) life expectancy.
Similarly, instead of aggregate measures of productivity, it would be far more valuable to make more extensive use of and to expand Statistics Canada's marvelous longitudinal microdata on firms, e.g., to observe births, deaths, mergers and amalgamations, divestitures, and growth, firm by firm, associated with a range of covariates – including the dynamics of firms in relation to their production possibility frontiers. Of course, such analyses are more difficult and time-consuming, typically requiring more careful data preparation, hence are less attractive to researchers given the publish or perish competition in academia at present.
Analogously, a statistical DBDF portrait should be considered as a general purpose statistical activity, designed to meet a wide variety of data, administrative, and policy needs – well beyond the objective of valuing data to form an aggregate sub-index within the framework of the SNA.
Concluding thoughts
Even though the metaphor that “data is the new oil” is somewhat strained, there is no question that data bases and data flows have not only grown dramatically, but are also reflected in major changes in the ways we spend our time and money, hence the economy as well as the ways we interact socially. As a result, it should be incumbent on NSOs to adapt their statistical programs to encompass and reflect these new realities.
We have proposed that at the centre of NSOs’ adaptation to the dramatic growth of “data”, the focus should be on the micro foundations – collecting data not only on discrete data bases (DBs), but on data bases and their associated data flows (DBDFs). The core should be an evergreen micro statistical “portrait” of the country's DBDFs. In essence, this portrait would be a census of individual DBs plus a census of all the DFs including both the substance of the data elements flowing and the pointers indicating the source and destination DBs for these data flows.
The reasons to build and maintain the DBDF portrait include both major policy areas such as privacy and health, and key areas of social proprioception – areas where there is general interest in understanding how society is evolving. In this paper, two such areas have been discussed: inflation and entertainment.
Further, to provide essential context, the DBDF portrait should be complemented by more extensive and coordinated statistical data on time use patterns, hence time use surveys of adequate frequency, with sufficient detail, including content on the satisfaction derived from various activities, and using concepts and definitions concorded with the DBDF portrait.
This kind of statistical program will require considerably more resources for NSOs than adding a capitalized “data” or “digitalization” stock to the SNA. However it will provide the bases for an important range of critical public policies and the foundations for many derivative analyses and areas for further statistical developments. Our proposed statistical portraits of DBDFs could include capitalized valuations in SNA terms, but this should not be a top priority for NSOs.
Footnotes
Acknowledgements
An earlier version of this paper was presented to the IARIW-CIGI conference on “The Valuation of Data”, November 2-3, 2023, Waterloo Ontario. I am indebted to the discussion at that conference, and to a reviewer's comments.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
