Abstract
This paper develops an improved method for estimating the ethnicity of individuals based on individual level pairings of given and family names. It builds upon previous research by using a global database of names from c. 1.7 billion living individuals, supplemented by individual level historical census data. In focusing upon Great Britain, these resources enable, respectively, greater precision in estimating probable global origins and better estimation of self-identification amongst long-established family groups such as the Irish Diaspora. We report on geographic issues in adjusting the weighting of groups that are systematically under- or over-predicted using other methods. Our individual level estimates are evaluated using both small area Great Britain census data for 2011 and individual level data for asylum seekers in Canada between 1995 and 2012. Our conclusions assess the value of such estimates in the conduct of social equity audits and in depicting the social mobility outcomes of residential mobility and migration across Great Britain.
Introduction
Ethnicity is a salient characteristic of individual identity. Of relevance to regional science, it has underpinned research into residential differentiation and social segregation (e.g. Finney and Simpson 2009; Lan et al. 2020), labour market recruitment (Yemane and Fernández-Reino 2021), inter-generational social mobility (Clark and Cummins 2015), innovation processes (Wilson et al. 2018), and health outcomes (Petersen et al. 2021). It is also of policy interest to provide timely inter-census estimates of population characteristics (Office for National Statistics 2017), as demonstrated during the 2020 COVID pandemic and following Brexit. Related work has documented the correspondence between individual naming practices and ethnicity, and consequently, the ways in which given (forename) and family (sur-)names may be used to indicate ethnicity (Mateos et al. 2009; Parameshwaran and Engzell 2015). As such, names-based classification of ethnicity is of wide applicability to many issues of relevance to regional scientists in studies of migration, urban structure and regional functioning – issues that we return to in our conclusions.
Names-based ethnic classification methods typically develop algorithms to identify significant forename – surname associations and assign labels to the resulting cultural, ethnic, and linguistic groups at different levels of aggregation. A recent development of these approaches is the Ethnicity Estimator software (Kandt and Longley, 2018) that was developed in collaboration with the Office for National Statistics (ONS). A novel aspect of this latter approach is the evaluation of estimates with respect to survey respondent self-identifications: such procedures are of particular value where names span different ethnic groups (as with members of the Black Caribbean and White British UK Census groups) or where long-settled groups may no longer identify with their ancestral origins (as with some White Irish individuals in Britain). Kandt and Longley’s (2018) software and the derivative small area estimates of annual changes in local ethnic group composition have been used in circa 60 research projects to date (CDRC, personal communication). The free availability of this classification software for research purposes and the peer-reviewed documentation of its predictive success marks this software as a basis for the further evaluation and improvements developed in this paper.
Comparison of Adult Population (16+) Breakdown by Ethnic Groups Predicted by Applying Kandt and Longley’s (2018) Estimator to the 2011 Linked Consumer Register (LCR) for Great Britain.
Our research objectives are to improve or refine estimates of membership of: (a) the long-established White British majority population that was actually present in the 19th century; (b) the long-established White Irish population that continues to identify with this group; (c) the Black Caribbean population that shares naming conventions with white ethnic groups; (d) groups originating in the Indian sub-continent; and (e) the ‘catch all’ Black African, Black Caribbean, White Other and Other Asian groups, which may be attributed to particular countries that confer quite different circumstances upon migrants from them. Details of development and SQL code used to develop the software, Onomap3, can be found on the cdrc.ac.uk website, for access for research purposes upon successful application.
Data Sources
Our approach is to use the near-complete Linked Consumer Register (LCR) of all adult individual names and addresses in Great Britain in 2011 (see Lansley et al. 2019; Van Dijk et al. 2021) as a frame to estimate ethnicities. The 2011 LCR provides an annual snapshot of the UK adult population created and curated by the ESRC Consumer Data Research Centre (CDRC), as part of a corpus of such data initially covering the period 1997-2016. The LCRs are individual level data compiled from the public version of the UK Electoral Register and other consumer data sources. Lansley et al. (2019) describe the data cleaning, triangulation, imputation and validation processes that are intrinsic to their creation: the 2011 LCR is documented to have similar numbers of adults compared with those recorded in the Census across a range of census geographies.
Here, we estimate the ethnicity of every individual on the 2011 LCR. By georeferencing each record we are then able to compare our estimates with Census figures for the same year at the level of the Lower layer Super Output Area (LSOA, a small area geography in England and Wales with a typical population of 1500). We use these initial results to adjust the weights assigned to forenames and surnames for different ethnic groups. For the specific case of the White Irish population, we also refer to individual level 1881 Census records to evaluate the merit of deeming a contemporary bearer to self-identify with the ‘White Irish’ Census category. The digitised versions of the GB Censuses for 1851-1911 are curated by the I-CeM project (Higgs and Schurer 2019), and individual level records including names, addresses and birthplaces were made available to us by the UK Data Service under special licence. We use the individual level data for 1881, based on our exploratory findings that the data capture process for this year appears to have been particularly effective.
We also use the WorldNames2 (WN2) database that arises from an ongoing project to assemble a representative range of forenames and surnames for every country of the world. O’Brien and Longley (2018) detail the various sources used, including public electoral registers, telephone directories and professional or school registers. The database currently comprises circa 1.7 billion individuals’ names, or about one fifth of the world’s population (calculated based on 7.9 billion according to the UN estimates as of 2021), each with country attribution. Based on the sampled names in the countries and their total populations, frequencies per million (FPMs) of family name occurrences and their estimated populations sizes are derived in the WN2 database.
Aggregate 2011 Census adult population counts classified into 11 ethnicity categories (listed with their abbreviations in Table 1) provide a benchmark for evaluation of the ethnicity estimates developed using the LCRs. The ethnicity categorisations recorded in the 2011 Census questionnaires differ slightly between the different constituent countries of the UK but can be harmonised into the 11 categories. Table 1 also compares the GB population breakdown by ethnic groups estimated by applying Kandt and Longley’s publicly available software to the 2011 LCR and the corresponding 2011 Census figures. Both over-estimation and under-estimation are observed amongst the LCR group assignments.
Methods and Outcomes of Reassignments or Enhancements
The 2011 classifications of ethnicity used by the UK ONS are the outcome of extensive consultation with stakeholders with regard to the end uses of statistical sources so classified (Office for National Statistics 2009), which is reflected in the subtle variations among the ethnic categories adopted by Northern Ireland, Scotland, and England and Wales. The outcome is, inevitably, a snapshot of policy concerns that resonate with the governments of the constituent countries of the United Kingdom. The resultant classes also manifest a long sweep of British history that accommodates Irish and New Commonwealth migration, but not the specific consequences of successive EU enlargements during the UK’s period of EU membership or refugee migration. Our dual purpose is to improve the efficacy of Kandt and Longley’s assignments to the harmonised classes used in Table 1 while also extending it to differentiate between other nations, membership of which might also affect the circumstances of migrants to Britain.
As such, our aim is to extend the granularity of ethnic classification while also retaining sensitivity to the issues of self-identification developed in Kandt and Longley’s (2018) work. We use their Ethnicity Estimator (EE) as a baseline model for our proposed improvements and extensions. The core process of the EE, summarised in equation (1), is to assign each forename-surname pairing a probability of assignment to each of the Census ethnic categories
In developing and extending this approach to classify Great Britain residents, we use additional individuals’ names obtained from the 1881 Great Britain Census and from WN2. We validate the results using aggregate 2011 Census small area statistics for the same year as the 2011 LCR. Ethnicity classification of the 2011 LCR follows a chronology of steps (see Table 1 for abbreviations used), for reasons set out in our discussion below: 1) The EE classifications are assigned as provisional estimates. 2) Family names classified as White British (WBR) but that are not recorded at all in the 1881 Great Britain Census are reassigned to their second highest predicted category amongst the remaining 10 census ethnic groups. 3) Individuals classified as WBR or White Irish (WIR) are then pooled. Reassignments between them are made using Bayes’ Theorem and WN2 data as detailed below. 4) Individuals classified as Asian Indian (AIN), Asian Pakistani (APK), Other Asian (AAO) are pooled and reassigned using re-weightings as detailed below. 5) Individuals classified as Black Caribbean (BCA), WBR or All Other (OXX) are pooled and reassigned using rules as detailed below. 6) WN2 data are used to assign most probable countries to records assigned to the AAO, BAF, BCA and WAO groups.
The White British and White Irish Groups
Kandt and Longley (2018) identify the WIR group as systematically under-estimated, attributing this to self-identification of descendants of previous generations of Irish migrants with the WBR group. We take the explicit decision to define WIR in terms of being long settled in the Irish Republic and WBR as conveying establishment in the United Kingdom. Our approach to accommodating this tendency is threefold: (a) we constrain WBR assignments by filtering out family names not present in the 1881 Great Britain Census; (b) we adjust the forename and surname relative probabilities
Reassigning White British names
Reassignment of the ‘White British’ Predicted in the Previous Step With Family Names With No Bearers in the 1881 Census.
Adjusting the name-ethnicity lookup tables
Conditional Probability of Belonging to the WBR or WIR Using the Name is ‘James’ as an Example, According to Bayes' Theorem.
Tuning the weighting factors
In equation (1), the original EE adopts equally weighted contributions from a forename and a surname (
Figure 1 suggests that surname weight 0.84 gives the closest predictions to the Census. Table 4 presents the transition matrix of the reassignment between the WBR and WIR after the lookup table adjustments with the selected surname weight 0.84. Together with the reassignment to WIR in the previous step, we predict 546,743 White Irish at this stage, which accounts for 99% of the 2011 Census observations. Figure 2 shows the observed and estimated 2011 populations of White Irish by LSOA, where our method correctly picks up the concentration of Irish in urban areas such as London, Birmingham, Liverpool, Manchester, and Glasgow, albeit with modest underestimation. This sensitivity analysis is finely balanced, with the global solution required to balance prediction success in rural and urban areas: in particular, it is apparent from sensitivity analysis that Scottish WBR rural names bear more than passing similarities to urban WIR ones. The predicted numbers of WIR in the 2011 LCR using different surname weights, compared to the 2011 Census observation. Confusion Matrix of the WBR and WIR Populations From EE Prediction (Rows) and the Outcomes of Reassignment Between White British and White Irish (Columns), Using the Surname Weight 0.84 After the Lookup Table Adjustments. Distributions of White Irish by LSOAs from the (a) 2011 Census and (b) 2011 LCR.

Indian Sub-continent and Other Asian Groups
Countries and Codes Identified as Belonging to the Any Other Asians Group.
Predicted Populations of the Four Groups Using Different Surname Weights, Compared With the GB Census Totals.
Confusion Matrix Between the Group Populations From EE (Rows) and the Outcomes of Reassignment Among AIN, APK and AAO (Columns).

Distributions of the APK group by LSOAs from the (a) 2011 Census and (b) 2011 LCR.

Distributions of the AAO group by LSOAs from the (a) 2011 Census and (b) 2011 LCR.

Distributions of the AIN group by LSOAs from the (a) 2011 Census and (b) 2011 LCR.
Black Caribbean Groups
Caribbean Countries With British Colonial Histories (including Current British Overseas Territories) Used in the Analysis.
After experimentation and sensitivity analysis, we alight upon a multiplicative index to measure the likelihood of a name being assigned to the BCA group (equation (2)). The first component of the index records how many times more popular a forename is in the Caribbean than in the UK. The second component records the corresponding multiplier for a surname. The product of the two terms is used as an indicator of the likelihood of belonging to the Black Caribbean group. Making use of the index, Figure 6 illustrates the logic of assigning possible ‘WBR’ and ‘OXX’ to ‘BCA’. For those who are classified as WBR, BCA and OXX, their multiplicative indices are calculated and compared with different empirical thresholds: 1.5 for ‘BCA’, 4.9 for ‘WBR’ and 15 for ‘OXX’. The outcomes determine whether the original classifications are retained or they are reassigned to another group among BCA, WBR and OXX The workflow of assigning possible ‘WBR’ and ‘OXX’ to ‘BCA’ based on forename and surname index scores.

Confusion Matrix Between the Group Populations From EE (Rows) and the Outcomes of Reassignment Among BCA, WBR and OXX (Columns).
*Note: The 1793 WIR estimated by EE are reassigned to WBR in the previous steps but are returned to the BCA group in this step.

Distributions of the BCA group by LSOAs from the (a) 2011 Census and (b) 2011 LCR.
Summary of Reassignments
Confusion Matrix Between the Sizes of the GB Population From the Original EE (Rows) and the Adjusted Estimates With all Adjustments of the WIR, AIN, ABD, APK, BCA, WBR and OXX (Columns).
Comparison of the Predicted Population Sizes Between the EE and Adjusted Estimates, Retaining GB 2011 Census Figures for Comparison.
However, the flows of individuals from over-represented to under-represented groups are very encouraging, as shown in Table 11. The 235,088 increase in the size of the White Irish group improves capture of WIR estimates from 54% to 99% of the recorded Census total, achieved by transfers from the over-represented White British majority group. For the Black Caribbean group, the corresponding ratio increases from 54% to 76%, with most transfers (213,569) from the White British group. Changes in the predictions of the Indian sub-continent groups are more mixed. The underestimated AAO group is improved from 30% to 91%. The overestimation of the Pakistani group is reduced from 151% to 99%, while the overestimation of the Indian group is slightly increased from 115% to 116%. Referring to Table 7, the biggest outflows from APK (291,641) and AIN (166,523) are transferred to the under-estimated Other Asian group – the size of which increases substantially. The improvement of the catch all Other (OXX) is a by-product of other reassignments. Apart from the BCA group, OXX has no outflows but increases in size following other reassignments such as the requirement that WBR names appear in the 1881 Census.
Enhanced Estimation of Countries of Origin
Census categories such as the White Other Group (WAO) have been agreed by the ONS over time through consultation for policy purposes and they inevitably cannot include all groups. Blanket categorisation masks within group variation, potentially straining any assumption of within group homogeneity in research applications: for example, study of UK residential segregation (e.g. Lan et al. 2021) would likely benefit were it possible to differentiate between different groups within the ONS ‘catch all’ categories. We therefore use the WN2 data to apportion the WAO, AAO, BAF and BCA categories to probable countries of ancestral origins.
We evaluate each name pair’s relative probabilities of assignment to a specific country using similar procedures to those underpinning equation (1). We replace the name-ethnicity lookup probabilities
Examples of the Largest Populations in the 2011 LCR by Country of Origin Within Each of the AAO, BAF, BCA and WAO Census Groups.

The distribution of Polish residents in London estimated from the 2011 LCR.
Validation and Discussion
Confusion Matrix Between Our Predictions for Canadian Asylum Seeker Data and Manually Coded Ethnicity Groups Based on Stated Country of Origin.
We have mixed reflections on these results. Migrating and asylum-seeking are heavily selective, and the phenomenon of chain migration likely renders the dataset very noisy. Asylum seekers may be more likely to be of mixed heritage (best represented by the OXX category), something that names-based classification finds very difficult to discern. Asylum seekers may perceive their chances of success to be increased with identification with white groups – with our predictions of many ‘Other Asian’ group members to be ‘White Other’ providing a prominent example. There are also ambiguities in the assignment of countries to EE groups, such as classifying South African asylum seekers uniformly as ‘Black African’.
In some respects, data pertaining to Canadian asylum seekers present an unreasonable challenge: the ONS ethnicity classification is designed to fulfil UK needs and the prominence of the White British and White Irish groups is an irrelevant distraction in this context. In the global context, our enhancements to predictions of origins within the Indian sub-continent appear to be robust. But in other instances, the results confirm global challenges to names classification, with the inherent ambiguity of Black Caribbean names presenting a prominent example. Our own analysis of geographic variation in prediction success within Great Britain also testifies that this problem occurs across different geographic scales, and it may also be affected by changing fashions for particular forenames.
Conclusion
Issues of ethnicity underpin our understanding of population diversity and the regional patterning of population characteristics in the wake of recent and historic waves of migration. Elsewhere (Longley et al. 2021) we have argued that regional origins in ‘Old World’ countries have enduring inter-generational consequences for social mobility outcomes, and one of our motivations for improving the efficacy of names-based classification is to describe and evaluate the relative social circumstances of citizens who can trace their origins through any of a succession of waves of migration to the UK. As such, the creation of Onomap3 has several methodological and substantive touchpoints with research previously reported in this journal, as well as for regional science investigations more generally. Most fundamentally, the work is consistent with the view that data pertaining to human individuals, rather than aggregations of them, provide the most secure foundations to regional analysis. The advent of new sources of georeferenced data at highly disaggregate scales (Longley et al. 2018) enables new methods of conducting migration research that goes far beyond early aggregate formulations in regional analysis (Greenwood and Hunt 2003). It also has potential implications for the conduct of input – output analysis (Miller and Blair 1981). Such detail and flexibility enable a much more robust and transparent definition of the urban structures that are arranged in urban hierarchies (Broitman et al. 2020), while names-based classifications enable the variegated social mixing of established populations and more recent migrants to be described and analysed (Lan et al. 2021). Our use of asylum seekers to validate the research is integral to the case for using names to identify and appraise migrant characteristics in regional analysis more generally (e.g. Lozano-Gracia et al. 2010).
In other respects, names-based classification is of strategic importance in synthesising data that are not routinely collected. Ethnicity is a sensitive personal characteristic under the General Data Protection Regulation (GDPR), and our experience is that names classifications become essential when data collection about ethnicity has not been considered proportionate in service delivery, but subsequently becomes essential in unforeseen social equity audits or health care studies. Our own involvement in auditing the rehousing decisions made post the Grenfell Tower disaster and evaluating hospitalisation outcomes during the COVID-19 pandemic (Thomas et al. 2021) provide prominent examples. In future, the development of trusted research environments (TREs, see Chalstrey 2021) may provide data linkage solutions, but in the meantime, names-based classification provides the only expedient solution, particularly in emergency situations.
In methodological terms, the research reported here provides several lessons to guide this quest. It is widely understood that the heterogeneity of ethnic groups varies geographically, and our work highlights that names-based classification should be cognisant of context: our prediction success is better for Great Britain – the territory for which it was intended – than Canada, yet this focus allows issues of self-assignment in particular cultural contexts to be incorporated, analysed and evaluated. The WN2 data present global evidence of the need to reweight the relative importance of forenames and surnames for some origin jurisdictions and we acknowledge that there is scope for further empirical refinement of the procedures developed here. Our sensitivity analysis and evaluation of results rely upon visual interpretation of mapped results alongside aggregate numerical comparisons. This approach might be supplemented in future research by the use of optimisation criteria and weightings to prioritise assignments (or ‘near misses’) to particular groups of interest. Future research might also address issues arising from transliteration of names (O’Brien and Longley 2018), homonymic family names, the mutation of family names over time and following migration over space, and cultural practices in assembling unique forenames or surnames.
Our approach is guided by the virtue of retaining self-assignments of census respondents in England and Wales while expanding and future-proofing the dictionary of names to include current popular forenames as well as new names imported into Britain from abroad. The classification is thus data led but also guided by GB cultural conventions. Issues of self-assignment may reinforce apparent inequalities of outcome or (as in COVID-19) set researchers on a search for physiological sources to societal problems. Yet our own view is that these issues are best addressed through classifications that are robust, transparent and open to scrutiny and that evaluations such as ours are instructive to minimise risks of misuse or misinterpretation.
Our own motivation for this work is to develop tools to understand the processes that underpin inter-generational inequalities of social mobility outcomes in Great Britain, at geographical and ethnic granularities that range from the effects of local ancestral origins of long-established populations through the inter-generational outcomes experienced by Irish migrants through to the outcomes of global migration in the 20th and 21st centuries. We intend this paper as a contribution to justify the approaches we are taking in this endeavour but hope that it stimulates wider debate about the value and veracity of names-based classification in the widest range of investigations into issues of social equity.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Engineering and Physical Sciences Research Council [EP/M023583/1]; Economic and Social Research Council [ES/L011840/1].
