Abstract
Point of Interest data that is globally available, open access and of good quality is sparse, despite being important inputs for research in a number of application areas. New data from the Overture Maps Foundation offers significant potential in this arena, but accessing the data relies on computational resources beyond the skillset and capacity of the average researcher. In this article, we provide a processed version of the Overture places (POI) dataset for the UK, in a fully queryable format, and provide accompanying code through which to explore the data, and generate other national subsets. In the article, we describe the construction and characteristics of this new open data product, before evaluating its quality in relation to ISO standards, through direct comparison with Geolytix supermarket data. This dataset can support new and important research projects in a variety of different thematic areas, and foster a network of researchers to further evaluate its advantages and limitations, through validation against other well-established datasets from domains external to retail.
Introduction
Point of Interest (POI) data is an invaluable source of information, acting as a key input to much of the research that has, and continues to be generated in urban analytics and city science. These data provide key locational attributes about a broad variety of social, environmental, and economic phenomena, including historical landmarks, parks, hospitals, and retailers, and have been vital sources of data for different applications, including health (Green et al., 2018; Hobbs et al., 2019), urban mobility (Graells-Garrido et al., 2021; Jay et al., 2022), retail, and location analysis (Ballantyne et al., 2022a), transportation (Credit, 2018; Owen et al., 2023), and many others. However, a major challenge when working with POI data relates to the coverage and quality of these datasets (Ballantyne et al., 2022a; Zhang and Pfoser, 2019). By this we mean how much the chosen source(s) of POI data restricts the analyses to specific cities or regions (i.e. coverage), and the degree to which it meets well-established criteria of data quality.
Many POI datasets offer a high level of global coverage and availability, such as OpenStreetMap. However, there are issues when considering the coverage and completeness of OpenStreetMap at finer spatial resolutions. These issues become even more apparent in areas with less contributors, in less developed countries, and for economic activities like retail stores (Ballantyne et al., 2022a; Haklay, 2010; Mahabir et al., 2017; Zhang and Pfoser, 2019). Some POI datasets exist which fill this gap, provided by Ordnance Survey, Local Data Company and SafeGraph, but are often not open access or globally and nationally comprehensive (Ballantyne et al., 2022a; Dolega et al., 2021; Haklay, 2010). A useful framework for assessing the quality of these datasets was established in ISO 19157, which provides a mechanism through which formally assess the quality of spatial data, based on positional and thematic accuracy, completeness, logistical consistency, and usability (Fonte, 2017). As a result, there is a clear gap for POI data that can meet these ISO 19157 standards, providing a good quality, openly available source of POIs for the UK. In this article, we introduce readers to a new ‘open data product’ (Arribas-Bel et al., 2021), derived from the processed version of the Overture Maps places (POI) dataset (Overture Maps Foundation, 2023), which arguably provides a strong solution to many of these problems, and can facilitate groundbreaking urban analytics research in a number of different application areas.
Data
The data was accessed through the Overture Maps Foundation which has developed a number of open data products including Buildings and Places. These have been developed through incorporation of data from multiple sources including Microsoft and Meta, resulting in data products that are available at global scales and contain detailed attributes (Overture Maps Foundation, 2023), including a source attribute which indicates whether each record is sourced from Microsoft or Meta. Users can access the data parquet files, a ‘column-oriented data format … and modern alternative to CSV files’ (Geoparquet, 2023), which offers greater disk saving, and facilitates more efficient querying (Hu et al., 2018). The parquet files can be queried directly from the cloud using Amazon Athena, Microsoft Synapse, or DuckDB, or downloaded locally. However, a specific challenge for urban analytics researchers and city scientists is that the majority will not have the data engineering or ‘quantitative’ skills (Arribas-Bel et al., 2021) to query these datasets from the cloud, and process the attributes in their nested JSON format. Furthermore, for those who want to download the files locally, they can be difficult to work with, as the full global places file is over 200 GB. Therefore, our aim is to provide a processed subset of the Overture places dataset for the UK, which bypasses these issues, and creates an open data product for use in research, which can be updated regularly as newer versions of Overture places become available, and bridges skills and knowledge gaps to open up this dataset to a much wider audience, as in Arribas-Bel et al. (2021).
Overture hosts all data through Amazon Web Services (AWS), which enables a number of query end points to be used to download data subsets. The Overture data schema includes a bounding box structure column to enable efficient spatial SQL queries. To query POI data for the UK, a spatial SQL query was constructed using the DuckDB SQL engine and the UK bounding box, based on EPSG:27700. This query downloaded a GeoPackage file containing all POIs within the UK bounding box, totalling 1.34 GB. This file was then clipped to the administrative boundaries of the United Kingdom, to exclude non-UK places that appeared within the bounding box query. As noted, many of the columns that provide metadata relating to POIs are represented in a nested JSON format (columns containing lists of lists), which are difficult to efficiently parse with traditional tabular data frame libraries. We therefore processed the following columns to ensure the data frame remained two-dimensional: Names, Category, Address and Brand. Following this processing, we spatially joined the 2021 census area geographies for England including Output Areas (OA), Lower layer Super Output Areas (LSOA), Middle layer Super Output Areas (MSOA), and the 2022 Local Authority Districts (LAD). For both Scotland and Northern Ireland, we spatially joined the 2011 Data Zone geographies. We also include the H3 (hexagons) addresses associated with each point for all resolutions between 1 and 9.
The resulting dataset is a 358 MB GeoParquet file, hosted as part of a DagsHub data repository. The repository contains the new open data product which can be downloaded using the link in the supplementary materials (Table i). A list of attributes for the data product can also be found in Table ii, and as a secondary output of this paper, an example workflow for how to extract Overture places for other study areas has also been produced (Table i). Python is the preferred language for utilising our resources, as it enables creation and maintenance of a virtual environment in which to easily replicate our analysis. By providing these workflows and hosting all the materials on DagsHub, this paper enables users to reproduce our analyses, through exploration of the materials stored within our streamlined reproducible research workflow, as in Paez (2021).
Reliability analysis – retail brands
To assess the reliability of our Overture data product, we compared the Overture POIs with the Geolytix Supermarket Retail Points dataset (Geolytix, 2023), which is known to provide reliable information about supermarkets in the UK, through collecting the up-to-date store locations from the retailers themselves (Geolytix, 2023). On this basis, and the fact that this data is used in a wealth of published academic research (e.g. Ilyankou et al., 2023; Long et al., 2023), it was determined that the Geolytix data provides a useful ‘ground-truth’ dataset to validate against. Furthermore, given that Geolytix data is not globally available, and initial comparison with the Overture dataset revealed discrepancies in recording of spatial and non-spatial attributes, it was deemed that the Geolytix dataset represented a suitably independent source to validate against.
In particular, we examined how well Overture represents the Geolytix supermarket data, adopting the key data quality principles outlined in ISO 19157 to formally assess the quality of our new data product. In particular, drawing inspiration from Fonte (2017), we empirically measured the positional and thematic accuracy and completeness of the data product, through consideration of how accurate the POI coordinates were, the presence or absence of thematic tags (i.e. field values), the number of supermarkets absent from our data product, and any biases created by the sourcing of the POIs. The latter is interesting, given existing research into data product biases when derived from different mobile applications (e.g. Ballantyne et al., 2022b), as Overture Maps sources data from different providers including Meta and Microsoft. A detailed description of the methods used to assess the reliability of Overture places can be found in the Supplemental Material (S3).
Completeness and positional accuracy of Overture data compared with Geolytix supermarket data.
Completeness and thematic accuracy of Overture compared with Geolytix supermarket data, describing how POIs sourced from different providers (e.g. Meta) exhibit differences in the completeness of the category_main and brand_name_value, when compared across the three retailers. Where values are NA, this indicates that no POIs for that retailer are supplied by Meta or Microsoft.
Application – mapping supermarkets in the UK
To demonstrate how this dataset can be used, an example workflow has been presented which reads in our new open data product, filters to a specific brand of supermarket, and then maps the distribution of these nationally (Figure 1). The purpose of this workflow is to illustrate how easy it is to work with this dataset, and demonstrate our commitment to reproducible and replicable research (Paez, 2021). Example workflows have been presented for both the Python and R programming languages (Table i), which utilise preferred packages for data manipulation and mapping (e.g. arrow, geopandas). An example application, mapping Tesco stores across the UK.
Conclusion
This paper presents a new open data product, which represents a processed UK national subset of the Overture places database. This new data product opens up data from Overture to a wider audience, facilitating analysis of new dimensions of human geographical processes, as in Arribas-Bel (2021). The potential applications of this data product in a variety of different fields are highly significant (e.g. urban accessibility), given the evidence and considerations presented about the coverage and quality of this new data product of this new data product. Furthermore, we are committed to updating this data product every 6 months and hosting these updates as data products on the Consumer Data Research Centre, enabling users to benefit from updates and new POIs that become available from within the higher-level Overture database. At a time where the retail sector is undergoing significant transformations in response to the cost-of-living crisis, such data can provide invaluable insights about the characteristics and performance of the sector (Ballantyne et al., 2022a, 2022b; Dolega et al., 2021), which has historically been a challenge due to the availability of suitable retailer data.
However, there are inherent limitations to our data product, which have been illustrated through direct comparison with Geolytix data. Users need to be cautious about how they are using this data, especially when the POIs they are using are largely sourced from Microsoft. Furthermore, given the ambiguity of Overture Maps in how their data is assembled, there is scope for data quality assessments using external datasets whose licences forbid Overture from incorporation into Places (e.g. Ordnance Survey), incorporating alternative metrics for ISO criteria like positional accuracy, with the aim of adding further credibility to the quality assessment we present in this paper. However, limitations aside, it is our hope that by releasing this data into the open domain, a network of researchers will be fostered who can utilise this data for their own research questions, and critically evaluate how the Overture places database represents a variety of different social, economic, and environmental activities, through rigorous data quality assessments utilising established POI datasets in other (non-retail) domains.
Supplemental Material
Supplemental Material - Overture POI data for the United Kingdom: A comprehensive, queryable open data product, validated against Geolytix supermarket data
Supplemental Material for Overture POI data for the United Kingdom: A comprehensive, queryable open data product, validated against Geolytix supermarket data by Patrick Ballantyne and Cillian Berragan in Environment and Planning B: Urban Analytics and City Science.
Footnotes
Acknowledgements
We would like to thank Geolytix for making the Supermarket Retail Points dataset openly available. We would also like to extend our thanks to the editor and three anonymous reviewers for their careful and considered feedback on the manuscript, and for providing an exciting outlet in which to publish open data products.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Data availability statement
This open data product is available to download from the Consumer Data Research Centre: https://data.cdrc.ac.uk/dataset/point-interest-data-united-kingdom. The version hosted on the CDRC is the most up-to-date version queried directly from Overture, and as such will vary from the statistics and figures presented in the paper. The DagsHub repository, which stores the features used as part of the anonymous peer review process (e.g. data product, code) is also available to view at:
. This version of the open data product matches that discussed in the paper.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
