Abstract
There has been increasing interest in recent years in training in official statistics with reference to the 2030 Agenda, big data, diversification of data types and sources, and data science. Backgrounds for work in official statistics are becoming more varied than ever. The official statistics community has also become progressively more aware of the importance of statistical literacy in education and trust in official statistics. Hence foundation and introductory are of as much interest to official statistics as more specialised training. At the same time, greater access to data and vast technological capabilities have seen much emphasis and discussion of the statistical and data sciences and education therein, including development of educational resources in contexts such as civic data and statistics. Data science provides opportunities to renew the decades-long push for authentic learning that reflects the practice of ‘greater statistics’ and ‘greater data science’, and to examine progress to date in implementing and sustaining the extensive work and advocacy of many. This article discusses what is needed at the foundation and introductory levels to realize this advocacy, with commentary relevant to official statistics.
Introduction
At the United Nations Statistical Commission (UNSC) in March, 2019, the side event of the International Statistical Institute (ISI) was titled
There has been significant development of training for official statistics at the postgraduate level. More than 20 European universities participate in the European Master in Official Statistics. The wide variety of training of National Statistical Office (NSO) staff by global and regional institutions to strengthen statistical capacities was a key motivator in establishing GIST, the Global Network of Institutions for Statistical Training, GIST – https://unstats.un.org/GIST. Participants in this type of training and training at the postgraduate level, are usually working, or have worked, mostly in NSO’s or closely associated government departments or agencies. Hence we see considerable in-service and postgraduate training developments, and emerging linkages at school level to increase awareness of official statistics, but what of the undergraduate level? Many NSO’s put considerable effort into attracting graduates from suitable quantitative degree programs, including statistics, finance, economics, and mathematics, and some have established significant ongoing research and/or staff links with selected university departments or faculties. But official statistics is just one of innumerable destinations for such graduates, and is very different to many, not just in content but also in type of workplace.
At school level, there has been increasing collaboration between NSO’s and those involved in school education, most notably through competitions such as the ISI’s ISLP (International Statistical Literacy Project) Poster competition (
On the establishment of GIST at the 2018 UNSC, as inaugural chair I discussed with leaders in the UN Statistical Division (UNSD) how to tackle the ambitious and far-reaching aims in the projected strategic action areas, and proposed task forces of GIST members, with leaders selected by each task force on a GIST Board to facilitate sharing and liaison across GIST members and the GIST Stakeholder Advisory Group of NSO’s covering world regions. One of the strategic action areas is “Promote training that enables data producers to improve data literacy and usage within stakeholder communities” and there has been a task team on statistical literacy, now called “Statistical Literacy in the context of the 2030 Agenda” since GIST’s inception. There was considerable initial discussion within this task team on whether the focus was intended to be on improving statistical literacy in the general community, within government, or to improve statistical communication by NSO staff or all three. Indeed, this task team has done substantial work on the first two areas in particular, with significant support and input from UNITAR, as discussed elsewhere in this special issue.
All the above are indicators of the importance of statistics and data education for the world of official statistics, and the rapid increase of this importance with the 2030 agenda, with new sources and types of data, new gatherers and sources of information (and mis-information), increasing awareness of and emphasis on data science, and increasing concerns on issues such as trust in official statistics. Specific training of official statisticians is just one part of the entire spectrum of statistical and data science education which is pertinent to the core business of official statistics. Previously perceived precincts within the entirety of statistical, data and stochastic sciences, including official statistics, have become increasingly indistinct and vanishing as more and more complex real contexts, data and technological capabilities arise across disciplines, business, government and society. Hence sound foundations in statistical and data science education across school and tertiary levels are of consequence to official statistics from citizen statistical and data literacy right through to training of official statisticians. This article discusses a range of issues both general and specific in such foundations, including progress and problems, the contrasts between extensive advocacy and examples of best practice with details of practicalities and realities across education, and noting points of particular interest to official statistics, and where official statistics leadership can assist.
Statistical and data literacy
Data literacy descriptions are the same as those for statistical literacy
There have been many descriptions of statistical literacy over many years, with examples including:
Like the above, many have referred to information in media sources. Jane Watson [3] initially developed a view of statistical literacy that centred on media reports and focused on the data consumer. The extensive and prolific work of Jane and her many colleagues, particularly Rosemary Callingham, include research, guidance and authentic classroom-ready activities developing statistical literacy (see publications under
In this blog, they also make use of the classic five steps of statistical data investigations not as a recipe but as a framework for “
For many years, the Statistics Department at the University of Auckland have successfully run a course
There have been ambitions and attempts over many years to achieve statistical literacy for all at tertiary level. One such university was University of Wollongong which initiated a number of statistical literacy courses, for example, for law [5]. Statistics in this university had, and has, close ties with the Australian Bureau of Statistics. In 2017, their program was encouraged for all, with the following description:
Today in 2021, the above is an online section of a set of tertiary literacies run by the university’s Learning Development,
In a number of forums, including editorials for the journal Teaching Statistics, an International Journal for Statistics and Data Science Teaching, I incorporated references to the UN World Data Forums (WDF) of 2017 and 2018 (for example, [6, 7]). Some presentations at the first UN WDF in 2017 not only included long-established descriptions for statistical literacy as ‘data literacy’ but also denial that it was statistical literacy. In addition, a description of the statistical data investigation process (see Section 3) was renamed as “
In a workshop on data literacy
Such descriptions are highly relevant to official statistics and the 2030 Agenda, but it must be recognised and emphasized that these are part of statistical literacy; NSO’s, GIST and UNSD play crucial roles in such emphasis.
In arguing that “
Some of those who have denied that data literacy is statistical literacy have viewed statistical literacy as capabilities such as being able to produce and use graphs and statistical measures, but it is clear from the extensive literature, of which the above are just a few examples, that the common essential element of data and statistical literacy is to gain confidence as a critical consumer of data and statistical information. This naturally merges into one’s statistical and data capabilities, with no clear-cut delineation, and, as always with statistics, is context-dependent. As Moore and Cobb [9] famously said “Data are numbers in a context”. The extensive literature in statistics education also demonstrates that significant effort has been put in by many world-wide to embed statistical literacy in school education, in citizen/adult education and, more rarely, in learning support across disciplines at tertiary level.
CensusAtSchool and ISLP
Official statistics has been involved in such efforts in a number of ways. Following an idea to raise statistical awareness in primary schools in New Zealand by Forbes [10], the online
Since 2010, the International Association for Official Statistics (IAOS) has been formally involved with the International Statistical Literacy Project (ISLP)
Following the 2009 ISI WSC, the IASE president re-organised the ISLP project with a formal and sustainable governance model consisting of an ISLP executive of director and deputy directors, an Advisory Board with IASE and IAOS representation, and country coordinators appointed by the executive, currently representing more than 80 countries. IAOS and IASE members provide significant leadership and support in all ISLP activities, including the ISLP International Poster Competition. Reija Helenius, Statistics Finland, has been ISLP Director since 2010, and Pedro Campos, University of Porto/Statistics Portugal a deputy director. Sharleen Forbes, Statistics New Zealand (SNZ), was a deputy director until her retirement, and was replaced by Steve MacFeely, UNCTAD and now WHO. It was decided that an international poster competition was more suitable to statistical literacy than examinations, and since 2010, the ISLP International Poster Competition has been held biennially, with international winning posters displayed and announced at ISI WSC’s, and typically involving more than 12,000 students from more than 20 countries.
The
Forbes et al. [12] includes an excellent account of how SNZ worked collaboratively over many years with academics and teachers on developments ranging over statistical literacy resources, school curricula and assessment, to initiatives and eventually courses to raise the statistical capabilities of employees across government. Elsewhere in this special issue, Sharleen Forbes and John Harraway discuss how this led to three free downloadable web apps in official statistics hosted on the ISLP website, and to supplying initial input for the United Nations Institute of Training and Research (UNITAR) e-learning course
Thus we see how official statistics have collaborated, and are collaborating, with educators and academics to help support development of statistical literacy and natural extensions into improving statistical capabilities, but collaboration is key and direct impact such as SNZ in school and tertiary courses is rarer. Section 5 includes some comments and ideas on how to effect better influence and achieve greater educational impact.
Statistical data investigations and data science
The statistical and data sciences
At the second UN WDF, leading data science speakers from large organisations in communications, securities, information technologies, and official statistics, discussed data science as “
In recent years, the term ‘big tent’ of statistics has been increasingly used [13] to encompass the full spectrum of statistical science and statistical practice across everything to do with data, variation and uncertainty. This builds on Chambers [14] ‘greater statistics’ which in turn builds on Tukey’s [15] view of data analysis which is the forerunner of the confluence of the statistical and data sciences. Donoho [16] not only advocates ‘greater statistics’ but also ‘greater data science’, and initiates a vision for the latter that is far more than a ‘mere scaling up to big data’ and big technology, but an ongoing ‘more intellectually productive and lasting’ science. The term data science has been around for a while (see, for example, [17]) including the idea of calling ‘statistics’ ‘data science’ as mooted by Wu in 1986 [18].
The above comments tend to be oriented to professional work, research and higher education, but also apply at the school, foundation and introductory levels. In his exploration of what introductory data science courses should look like, Gould [19] argues “if we are to teach [secondary and introductory level] students to find meaning in data, then most of what students need sits firmly within the boundaries of statistics.”
In the above and other articles and comments, we see that discussion and debate about what data science is and whether it is statistics are continuations of previous discussions on what statistics is. Statistics has always been part of developments in computing power, as contributor, motivator and user, and has fed into, and been fed by, increasing technological power to tackle more and more complex problems in wider contexts with broader as well as ‘larger’ data. Big data, data analytics, data deluges have been impacting on statistics as well as drawing data science ‘out of the back room’.
Discussions and descriptions of what the statistical and data sciences are, are valuable in communicating, but to attempt to divide by internal boundaries is counter-productive. Certainly, the statistical and data sciences span a wide and increasingly broad range and diversity of topics and capabilities, but over the past 50 years, previous internal precincts in statistics have blurred into each other, often combining forces in tackling more complex real contexts and data. Topics are not boundaries. Indeed, it is time to finally remove any previous boundaries within statistics and to cogently advocate avoidance of such in what should now be known as the big tent of the statistical and data sciences. Official statistics is very much within this big tent, and any previous boundaries with other areas of statistics have become well-blurred in the era of big data, diversity of data types, provenance and ownership, and SDG’s.
The statistical data investigation process
The statistical and data sciences are essentially investigative, problem-solving and driven by contexts involving data, variation and hence uncertainty. Over many years, statisticians have advocated that students have authentic experience of the full statistical investigation process. In augmenting Chambers’ [14] description of how statisticians practice ‘
Kenett and Thyregod [21] criticize university teaching that does not include sufficient focus on the first two and last two steps of their statistical consulting cycle described as follows:
These and other descriptions of the full statistical investigation process have been used by leading international statistical educators in advocacy and initiatives across curricula, resources and guidance for statistics learning to reflect the practice of statistics. Those in statistical education will recognise the similarities to Wild and Pfannkuch’s [22] popularization of [23] and the [24] stages of Question, Design, Collection, Analysis, Answer. The description of the data-handling cycle that featured in the UK National School Curriculum in the mid-seventies [25] became the PCPD (Plan, Collect, Process, Discuss) cycle.
Emphasis on embedding authentic experience of the statistical investigation process, was part of extensive work during the past 2–3 decades by statisticians and statistics educators worldwide in initiating a variety of changes:
in teaching statistics at university, particularly introductory levels across disciplines, and at school level; in professional development workplaces and communities; and in statistics education research.
This work incorporated advocacy of:
Data-driven concepts and statistical thinking; Real, ‘large’ contexts and data, treating simple concepts and procedures within complex; Statistics in its own right; Technological and data systems know-how; Student ownership and constructivism.
The American Statistical Association’s Guidelines for Assessment and Instruction in Statistical Education (GAISE) at school and college levels,
All of the above, updated to reflect technological and data advances, have found their way into data science advocacy. The International Data Science in Schools Project (IDSSP) www.idssp.org, is a consortium of statisticians and computer scientists whose collaboration has produced curriculum frameworks intended to guide the writing, resourcing and teaching of data science curricula at senior school or introductory tertiary levels. PPDAC has been slightly edited in the IDSSP framework and renamed “the basic cycle of learning from data”, without any referencing. Most of Unit 1 of this framework is familiar to leading statistical educators, and the approach and key elements of Unit 1 can be described as:
Authentic and embedded learning of what has been called the statistical problem-solving process, the statistical/data investigation cycle, and has been renamed the data science learning cycle in Unit I. Authentic original contexts and data, with the simple embedded in the complex, and collecting/ accessing data relevant to students’ lives. Use of technology as in the Multivariable contexts and data, with (again) the simple embedded in the complex and the emphasis on types of variables, and moving to more than two variables/multivariable data as quickly as possible. Visualisation and exploration. Student-centred learning. Hands-on learning of data acquisition and handling in the data-handling pipeline. The above applying in all introductory data science learning across disciplines.
The second last bullet point above reflects embedding more of the data science approach relevant to today. Unit 2 moves on to more statistical and data science sophistication.
Data investigation at foundation and introductory levels across disciplines
A challenge in foundation and introductory data science curricula is to ensure computing does not overshadow, or worse dominate, the learning from data. Because both mathematics and computer science serve statistics and data science, it is essential at foundation and introductory levels to ensure the focus is on development of statistical and data science learning. Burr et al. [26] provide invaluable discussion on how to gradually build computational capabilities by stealth. Gould [19] uses the experiences of developing and delivering a secondary data science course that pre-dates the IDSSP, the Mobilize
In describing the IDS Data Cycle (Ask questions, Consider data, Analyse data, Interpret data) as a template for the statistical investigative process, Gould [19] emphasizes the change from the older Collect data of codifications such as PPDAC and PCPD, to Consider data. This change was also made in the 2020 revision of GAISE preK-12 [27]. Such emphasis is of paramount importance in today’s statistics and data science, and especially for official statistics. Almost the whole of Rubin’s [28] excellent discussion of the Data Clubs project for young people ages 12–15 is on the value and interest for students in considering how data are generated as part of a larger paradigm shift from previous eras to today’s statistics and data science.
Asking “who, how, when, where, and why?”, as well as “can we measure what we want to measure?” have long been fundamental in the practice of statistics, but take on new significance in education now that technology enables students to access and explore large and complex data sources as well as collect diverse types of data themselves as described by Gould [19] and Rubin [28]. This is also of core focus in ProCivicStat (
My own journey with regard to student data investigations illustrates the journey as technological power increased and gradually became more accessible to students. In the 1980’s, including use of statistical software in large introductory classes across disciplines, including health sciences, technology and engineering students, involved considerable logistic challenges. In the first half of the 1990’s, asking “who, how, when, where, and why?”, as well as “can we measure what we want to measure?” was core in my teaching across disciplines for the real datasets I provided to students.
From 1995 to 2011, I developed, implemented and embedded, with the increasing support of colleagues, student-based statistical data investigations as semester-long projects in parallel with the statistical knowledge and skills development in large (up to 600) introductory courses across disciplines and for statistics and mathematics majors. This strategy could most likely be classed as inquiry (or enquiry) oriented learning, but is essentially reflecting the practice of ‘greater statistics’. It had some interesting similarities and contrasts with IOL in science [29]. The impact and value of this strategy reached beyond expectations not just in student learning and attitudes [30], but also into our curricula, teaching materials and resources, assessment, training of tutors and preparation of future statisticians for the workplace [31]. The emphasis started on students collecting their own data on issues of their choice with staff assistance in planning, but as time went on, there were increasing examples of students wishing to investigate data from other sources. The main problems they found were difficulties in accessing raw data and insufficient information on the “who, how, when, where, and why?”. This is what has changed now with the vast array of technological capabilities in sourcing, accessing, scraping, wrangling and handling data in the official, scientific and public arena. So courses such as mine have needed to, and must, continue to evolve to reflect the practice of statistics and data science.
What are needed
Embedding authentic learning of data investigations
After reading the above, a reaction might be that, provided statistics and data science are recognised as a ‘big tent’ and work together, foundation and introductory learning looks sound, but unfortunately there is much that is not. The ‘emergence’ of data science is opportunity for the statistical and data sciences to greatly promote the understanding and advocacy outlined above, but there are also significant lessons from the lack of penetration or sustainability of such advances and advocacy. At both school and university levels there needs to be quality information on the realities.
As just one recent example, [32], an international leader in statistics education, based in the US, spoke of doing some research about how countries are addressing statistics in the school curriculum, and asked for information in the Australian context. She mentioned her significant concerns, including that:
“
At the university level, internships, work and clinical placements have long been embedded in professional programs, and across all programs there is now much good work focussing on work-integrated learning (WIL) and capstone projects in the final undergraduate years, replacing the vacation work experience or final year industry projects which received intermittent attention in past eras. In other disciplines, the roles of statistics and data science in such WIL and capstone courses depend critically on the foundations in those disciplines, and it is in the foundation and introductory levels that both general culture and curricula details need attention. This is important for official statistics not only because graduates of different disciplines go on to careers in official statistics but also, and possibly of greater significance, official statistics works with all of government as well as increasingly with business and industry, as well as needing informed citizens with trust in official statistics.
There needs to be greater value placed on statistical teaching expertise at the introductory level, by statistics, data science and all disciplines, accompanied by genuine sustained collaboration with other disciplines. Although teaching materials, resources and curricula details need nuancing for different disciplines, the essential of statistics and data science foundations are core to all, and no matter where statisticians are located in a university, establishing an active genuine community of practice in teaching statistics and data science enormously benefits efficiencies and effectiveness as well as student learning and staff morale and advancement. All universities should facilitate and support such a community.
The principles and practicalities of experiential learning of data investigations as discussed in Section 3 above, should be embedded in introductory statistics and data science, alongside well-scaffolded development of the relevant knowledge and skills. Beliefs that ‘students won’t do it right’ and ‘it’s not serious enough’ when students are encouraged to choose contexts and issues of interest to them are both misplaced and counter-productive. Fears of assessment workload are also misplaced. Data investigations are best done in groups because, as in the workplace, such investigations benefit from a group approach. Because data investigations both teach and assess the higher order statistical and data capabilities, thinking and usage, other forms of assessment can focus on knowledge and procedures in more easily-marked formats. In addition, staff involvement in advising students on their investigations during computer laboratory work throughout the course, builds a natural familiarity with the various investigations. It is also of interest to observe that multiple choice questions tend to be highly dependent on local culture/conditions and are course-specific, but criteria and standards for data investigations tend to be more universal, with exemplars which can be used across institutions and programs.
Curricula needs and cautions
Scrutiny of many introductory statistics textbooks illustrates that attempts to de-mathematicalize earlier introductory books (which were meant to be in mathematical statistics) without sufficient re-thinking of the purpose of the statistical ‘story’, lead to over-focus on new ways of teaching earlier content and not necessarily appropriate sequencing at the expense of developing data investigation skills, such as:
Data: What? When? How? Limitations? Issues: What are we interested in? What can we investigate? What do we need?
Sources? Quality? Sufficient information? Access? Collect? Design? Turn research questions into statistical questions Identify variables and cases/subjects Do we need a pilot study/experiment? What do we need in data handling and preparation: organising, wrangling, checking, transferring, combining, coding, …, preparatory exploring
When students choose what to investigate, explore and source, they are motivated to find tools, they have ownership of data, context and questions; student ownership is the best motivator for learning.
Indeed there is need for developing statistical concepts and tools for exploration, visualisation, and analysis, but the following are needed:
Framing of issues, identification of variables and understanding their types; Advantages and disadvantages of different visualisations, presentations; Understand what numerical codes can and cannot do, in order to prevent long-term incorrect use of types of data (unfortunately far too common in certain disciplines), for example,
Cannot turn nominal variables into numerical variables; Cannot turn ordinal variables into continuous variables; Understand what aggregation is, its advantages and its limitations, from histograms to maps. Understand assumptions and how to evaluate assumptions graphically after models are fitted – far too many researchers in other disciplines ignore assumptions and neither use nor understand how to use graphical diagnostics; Learning to bring together findings in reporting using qualifications in language and identifying further issues; Real data and real contexts but
Contexts must not dominate statistical learning; Contexts must be familiar/readily accessible to students; Beware teacher-centred, top-down or context-complex case studies; Most importantly,
The above include cautions and indications of how to avoid foundation problems which have become self-perpetuating, particularly in other disciplines if students’ introductory course does not include sufficient of the above statistical expertise. Below are some more specific cautions and problems which have not yet been tackled:
fixation with restrictions to one and two variables, no matter what types of procedures are preferred by the instructor; isolated, single purpose clean data and questions and instructor-prescribed answer – the simple can be developed within the larger context or data; multiple procedures and forcing into discipline norms
the classic is the overuse of t; rigid, discipline-embedded approaches, top-down case studies, and too much orientation for research training – in any discipline, including statistics.
Scrutiny of many textbooks and introductory courses also highlights a very big problem requiring considerable attention, namely the need to reclaim and reform the teaching of probability and probabilistic thinking as integral to the statistical and data sciences. Probability must embed and be embedded in data, language and visualisation. Extensive student experience of the language of probability builds both familiarity and foundational understanding for statistics and data. This is particularly true of conditioning language in which familiarity is essential for understanding risk both as citizens and professionally. Conditional probability should be introduced before the special case of independence, and developed through data and estimates of conditional probabilities as well as through language. There are many examples in real contexts of misunderstanding of conditional probabilities, using inappropriate data for their estimation, and incorrect multiplication of probabilities, sometimes with appalling consequences. The term ‘multiplication rule’ should be banished forever. Tables are of particular importance in official statistics, and data on two and more categorical variables are ideal settings for engaging and invaluable learning and using conditional probabilities, including Bayes foundations, as well as splitting data, confounding and hidden variables.
Some points on school contexts
Much of the above also applies at school, gradually and more simply developed and experienced appropriately for the level, but in a slowly evolving coherent statistical ‘story’, with authentic student learning experiences at every stage, so that students own their foundation in probabilistic, statistical and data thinking for citizenship and further learning. There are many characteristics of schooling different to those of universities which must be taken into account. Firstly an obvious observation: who should teach statistics and data science as they should be taught, are those who have learnt it this way, demonstrating the importance of sufficient and appropriate statistical and data science learning in pre-service and in-service teacher education.
Comments on school education are too often generalised from the senior school context, but primary schooling has no specialisations, and middle school only some. Although many excellent resources have been developed for school levels, including extra-curricular activities, the big challenges lie in the ‘parcelling up’ of authentic statistical approaches to be embedded within classroom learning, activities, exercises and a diversity of assessments, both formative and, later, summative. No matter where statistics (and now data science) is placed in curricula, we are speaking of a discipline that combines principles and procedures with the nuances of uncertainty and variation. The danger of viewing data science as merely up-scaling of the technology of data-handling is that coding and programming will have the same distorting effects on statistics and learning from data as through the eyes of other discipline-specific views.
There is an urgent need for more involvement of statistical and statistical teaching expertise in all aspects of schooling: curricula, educational authorities, pre-service, in-service, textbooks, assessment. Involvement of such in curricula may have improved, but implementation and sustainability need the full spectrum of involvement, as curricula interpretations, even when not ignored, depend critically on the user’s background and understanding, especially in a discipline such as statistics. Textbooks are a major challenge, as are approaches which emphasize
Conclusion: How official statistics can help
A question from a number of participants at the UNSC 2019 ISI side event was, should there be bachelor degrees in official statistics? University authorities generally tend to be against ‘boutique’ degrees, and especially against ‘boutique’ courses at the introductory level. Official statistics is also just one of enormous number of possibilities for well-educated statistically-trained graduates, and it is currently difficult enough, in the face of competition for well-trained statistical graduates and university preoccupations with high end research, to prepare graduates for practicing statistician careers. NSO staff are also recruited from diverse degrees to provide strength in breadth in teams, with their subsequent careers emerging from individual capabilities, workplace experience and training. And, as discussed in Section 1, NSO’s have increasingly strong vested interests in citizen statistical and data literacy, and in sufficient statistical capabilities in their many clients and collaborators in and outside government.
As mentioned above, NSO’s are increasingly contributing valuable direct and support assistance in extra-curricular school and tertiary learning, and aim to improve access to official statistics data for educational purposes. There are topics of particular relevance to official statistics which could be strengthened and augmented at the school and tertiary levels. Tables play a major role in official statistics, and recent experiences in work on UNITAR’s MOOC
Another area which receives much attention from the critiquing viewpoint in statistics is sampling, but there is a need for constructive hands-on learning experiences in designing and using well-designed sampling. Michael Bulmer’s
One of the best contributions official statistics can make to improving foundation, introductory and more specialised education in statistics and data science, is to use influence at the senior level to help within the ‘big tent’ of statistics and data science, in the advocacy, promotion and emphasis on the extent and importance of the many needs outlined in this article. Collaboration and collaborative leadership across all areas of statistics and data science have never been of greater value and significance.
