Prediction of residential and non-residential building usage in Germany based on a novel nationwide reference data set

Abstract

Building usage is an important variable in modelling the energetic, material and social properties of a building stock. Gathering this data on large geographical scale, and in the necessary temporal and spatial resolution, that means, on building level, is a challenging task. Machine Learning algorithms like Random Forest have proven useful in predicting building-related features in the past but often resort to training sets of limited geographic scope, for example, cities. This study presents a workflow of predicting the semantic attribute of usage on the level of individual buildings. Based on screening data of the previous ENOB:dataNWG project, a novel building ground-truth data set distributed across Germany, a Random Forest algorithm is used to assess how the German building stock can be classified according to its residential or non-residential use. Different sampling strategies had been applied in order to find a robust evaluation metric for the classifier. Furthermore, the relevance of the feature set is highlighted and it is examined whether regional differences in classification quality exist. Results show that a classification of residential and non-residential building footprints has good prospects with an AUC of up to 0.9.

Keywords

building stock building usage classification machine learning Random Forest classifier feature importance spatial cross-validation

Introduction

Motivation

Knowledge about structure and dynamics of the building stock is an important prerequisite for taking measures promoting a socially just and resource-efficient design of the built environment (Creutzig et al., 2016; Zhu et al., 2019). Buildings and their spatial distribution have been shown to be an important proxy for estimating population figures and studying urbanization processes (Kunze and Hecht, 2015; Biljecki et al., 2016; Tomás et al., 2016; Sturrock et al., 2018; Wardrop et al., 2018; Jochem et al., 2020; Schug et al., 2021). Furthermore, energetic retrofitting of the building stock, its share of anthropogenic material stocks and vulnerability to natural hazards are topics that have become more important in recent years. Besides geometric properties of buildings and age, their use is an important, yet still often unknown, variable when it comes to analysis in the mentioned fields (Pauliuk and Müller, 2014; Cheng et al., 2018; Dabbeek and Silva 2020; Haberl et al., 2021).

The sheer amount of geometric data about buildings has grown in the past decades, both in coverage and levels of detail. A variety of data sources (e.g. official statistics like census, cadastre, voluntarily collected Geographical Information (VGI), remote sensing) and models (tabular, 2D vector, 3D, raster) come into consideration for analysis purposes, and often the merging and integration of these data is necessary (Hecht et al., 2019); Evans et al., 2019; Schug et al., 2021). The challenge is to achieve results at the finest possible resolution – ideally at building level. Appropriate standardization for building modelling has been advanced with CityGML and National Mapping and Cadastral Agencies offer products in the form of 2D building footprints and 3D building models (e.g. OS MasterMap Topography Layer® (UK), 3D building models LoD1 (Germany)). Nevertheless, the situation is very uneven from country to country and especially at global scale. According to the Global Open Data Index, only 10% of the surveyed countries had freely available small-scale national maps in 2016 (GODI, 2016). It must be assumed that the share of countries that can provide building data is even lower. In order to increase the global coverage of these important data sets, remote sensing projects such as Global Urban Footprint (GUF, 2022), Global Human Settlement Layer (GHSL, 2022), World Settlement Footprint (WSF, 2022), Microsoft Building Footprints (MBF, 2022) and VGI initiatives such as the Missing Maps project (MMP, 2022) and the Humanitarian OpenStreetMap Team (HOTOSM, 2022) have made attempts to derive and map settlement respectively building footprints.

The grown number of data sources leads to the aspect of harmonization and standardization in order to make them ready-to-use for analysis purposes. With regard to European Union (EU), the INSPIRE (Infrastructure for Spatial Information in the European Community) Directive was an impulse for transnational standardization, interoperability and availability of official geodata inventories (Vancauwenberghe and Van Loenen, 2018). Consistency and accuracy at the semantic and geometric level are important quality elements of geodata (ISO19157, 2013). The aforementioned VGI, such as OpenStreetMap (OSM), have increased accessibility to building data and can be spatially very up-to-date and accurate (Brovelli and Zamboni, 2018). However, this is not necessarily true for the semantics and attribute data (Haklay, 2010; Kunze and Hecht, 2015), especially since the data can be edited improperly (Juhász et al., 2020).

Recently, German federal authorities have more and more adopted open data strategies, which enhanced public access to official geo data products (Open Government Germany, 2019). In most federal states building footprints and 3D building models are freely available. Official geo data in Germany fulfil spatial quality criteria and, in the case of 3D building models, are also provided with standardized semantic attributes (AdV, 2021). However, these object catalogues can be interpreted differently by the federal state authorities, and further standardization is the goal of current efforts. A nationwide assessment of the situation in the building stock is therefore not trivial (BBSR, 2013; Schwarz et al., 2021). Furthermore, collecting and updating attributes for a large number of buildings requires a great deal of effort. In consequence attributes like building use, or numbers of stories are still in some cases not known or outdated.

Related work

There are numerous studies on the automatic extraction and classification of buildings, respective urban form, using remote sensing or vector data. We limit this literature selection to those studies of the last 5 years, that we consider to be most relevant to our study, that is, mentioning building footprint data with function (usage) or typology (single-family house, detached, high rise, etc.) classification (Table 1). The study areas of this literature selection are globally diversified (Europe, Asia, Africa). Different metrics are used for the evaluation, with Area Under Curve of the Receiver Operator Characteristic (AUC) (Sturrock et al., 2018; Lloyd et al., 2020) and Accuracy (Vanderhaegen and Canters 2017, Yan et al., 2019) being the most commonly used. The balance of classes and its consequence on classification algorithms is not always explicitly addressed, only Bandam et al., (2022) mention strategies to handle class imbalance. Obtaining reference data is a typical challenge for studies of this kind. In particular, reference datasets of national extent are often not available according to the state of the literature. Mostly, reference data is the result of manually digitizing remote sensing imagery (Sturrock et al., 2018), or, if available, authoritative data is taken as reference (Wurm et al., 2021).

Table 1.

Related work on the prediction of building usage.

Reference	Prediction task (label, classes, class balance)	Classifier	Study area	Test evaluation metric (values)
Vanderhaegen and Canters (2017)	Urban form and function, multi-class, unbalanced	Decision Tree	Brussels Capital Region (Belgium)	Overall accuracy (scenario based, best scenario: 84.2%)
Sturrock et al., (2018)	Residential/non-residential, binary, unbalanced	Super Learner Ensemble	Botswana [1], Swasiland [2]	AUC ([1]: 0.96 [2]: 0.95)
Yan et al. (2019)	Pattern type, binary, balanced	Graph Convolution Neural network (GCNN), random forest (RF), support vector machine (SVM)	Guangzhou [1], Shanghai [2] (China)	Accuracy ([1]: SVM 92.9%, RF 95.1%, GCNN 98% [2]: SVM 75.6%, RF 87.8%, GCNN 92.6%)
Lloyd et al., (2020)	Residential/non-residential, binary, unbalanced	Super Learner Ensemble	Dem. Rep. Congo [1], Nigeria [2]	AUC ([1]: 0.93 [2]: 0.93)
Wurm et al., (2021)	Building typology, multi-class, unbalanced	Random forest	Münster (Germany)	Overall accuracy: 0.96
Bandam et al., (2022)	Building typology, multi-class, unbalanced	Random forest	Germany	Train F1-Score: 0.9958

Aim of this study

Referring to the challenge of missing or lacking ground-truth data, we aim on using a state of the art classifier model on a novel ground trothed reference data set. For the present study, it was possible to use data from a nationwide representative survey of 89,000 buildings, it is therefore of particular interest how a classifier model can be built on this reference data set. In summary, it can be stated that, despite grown amounts and better availability of building-related geo data, semantic attributes on level of a single building are often not accessible or not of sufficient quality. The contribution that machine learning can make to closing knowledge gaps with regard to semantic attributes of building use, respective residential and non-residential, is subject of this paper.

The following research questions arise from the background obtained from the literature review.

(1) How well can residential and non-residential buildings be distinguished by means of Random Forest Classification?

(2) Which features are particularly relevant for the classification?

(3) Are there regional differences in the quality of classification?

The basic hypothesis is that there is a function-form relationship at the level of buildings, which, mediated by numerical characteristics, enables at least a binary classification of the stock. After this introduction, the second section will inform about the data basis and method. This is followed by a presentation of results and discussion. In the last section, conclusions and future perspectives are given.

Data and method

Reference data and study area

The data used is an outcome of a national stratified random sample of 100,000 buildings drawn from 48 million building footprints in the official German real estate cadastre (ENOB, 2020). In a second step, on-site screenings of the sample objects were carried out. Via a mobile application data on building boundaries, age, types of use, façade condition, owners was collected. Not all sample objects had been accessible, but in this way, a data set of 84,001 screening objects was created and 20,056 residential buildings (RB) and 63,945 non-residential buildings (NRB) were thus available. Since imbalanced data sets could lead to classifications biased towards class majorities, counterstrategies can be applied directly on the data set, or by adapting the classification model (Santos et al., 2018). Garbasevschi et al. (2021) used a combination of oversampling minority (synthetic minority oversampling technique – SMOTE) and under sampling majority in order to classify building age in cities of the state North Rhine-Westphalia with an approx. 69% majority in a multi-class data set. In our study, different training class balances are generated within the cross-validation loop, in order to assess the impact on classifier performance (see Results).

Model

For the classification problem, we use the three-based Random forest as a learning algorithm. During this process, not only one decision tree is formed, but several trees by means of random subsets of the data. Their classification results are combined into an overall result for the respective data point by means of a voting procedure (Breiman, 2001). An advantage of the Random Forest method is its ability to assess the importance of the features used (feature importance). In the original version of the method, this is determined by the Gini Importance or the Mean Decrease of Impurity at the nodes (Breiman, 2001). As random forest applications became more widespread, it became apparent that impurity-based feature importance tends to be biased when a: features occur that are categorical or of low cardinality, or b: correlated features enter the classification. To overcome these Issues, aside from different implementations of random forests, Strobl et al., (2008) recommend using permutation importances without bootstrapping in random forest models.

In the classical implementation of random forests in R libraries and Python packages, there are a number of hyperparameters by which the performance of the model can be influenced. The most important of these are the number of trees in the model (n_estimators or ntree), and the number of features considered per partition (max_features). In the software, values of these parameters are often pre-set, and an adjustment to the specific task may be necessary. The most favourable selection of the hyperparameters (n_estimators = 500, max_features = 8, bootstrap = False) was made by a grid search with 3-fold cross-validation. However, the influence of these settings on the F1-score of the grid search was marginal. Only the number of trees was changed in the following runs, aiming at minimal run times.

Feature set

The features are based on the hypothesis of the function-form relationship, according to which individual geometric/morphological properties can be important for the classification. As a consequence of Tobler’s law (Tobler, 1970), it is further assumed, that also properties of neighbour objects should be taken into consideration. Another decisive factor for the selection of a specific feature was the applicability and implementability for vector data. A vector-to-raster conversion, in order to be able to calculate raster-based features, was not carried out. The features are calculated for the buildings defined from the screening phase of the ENOB:dataNWG project. From a structural point these buildings can consist of more than one building footprint in the official HU-DE.

The feature type represents the semantic domain of the feature; these are geometry (e.g. area and perimeter), morphology (shape description) or neighbourhood of the object. In order to calculate neighbour and block based metrics, official building footprints with additional building height information have been used to fill the street blocks. For the definition of the street blocks, the axes of roads and driveways were extracted from the official German basic landscape model ATKIS (FCG, 2022), low-ranking roads were neglected. These linear features had been converted into polygons and spatially intersected with the building objects. On this basis, block-specific statistics were calculated for individual features.

Shape-describing features establish a relationship between the geometric parameters perimeter and area of the footprint polygon. A value for the area alone does not allow a statement about the shape of a geometric figure. However, if the geometric parameters are combined in an index, the similarity between a certain polygon and, for example, a circle or an ellipse can be estimated. In the broadest sense, these so-called ‘shape’ metrics are also measures of the compactness or indirectly the complexity of a building footprint. One assumption for the significance of these characteristics is that simple shapes tend to predominate in the residential building sector. Complex and less compact footprints would therefore predominantly occur in the area of non-residential buildings. In the literature (e.g. Maceachren, 1985; Steiniger et al., 2008; Wurm et al., 2016), a number of other compactness measures for different areas of application are discussed in addition to shape indices.

As a result of the classification, the influence of the features per scale level will also be examined. On the scale levels, different feature characteristics can be processed. Beyond the individual building footprint, in the direct geometric neighbourhood, or in the building block, the geometry- and form-related features are expanded to include statements on variance and homogeneity. One assumption is that, for example, within a building region in residential use, the building footprint contained therein tends to have a low variance of their geometric and form-related characteristics. Or likewise, if the street block has a pronounced industrial or commercial use, possibly with many operational areas (e.g. boiler house, warehouse, office), a greater variance of the features from the domains geometry and morphology is expected. A brief overview of the feature domains is given in Table 2; for a full feature set table, please see Supplementary Material Table S1.

Table 2.

Overview of semantic Domains and respective features.

Semantic domain	Feature description
Geometric	For example, area, perimeter, height, volume, orientation
Morphologic	For example, shape index, number of polygon vertices
Neighbourhood	For example, numbers of neighbouring buildings, distance to road, statistics (e.g. min, max, mean) of features for street blocks

The workflow is depicted in Figure 1. Based on the combination of screening data footprints, HU-DE building footprints and ATKIS road data, the feature engineering leads to numeric expressions of properties for the screening data footprints. These are input variables for the random forest training and classification.

Figure 1.

Workflow consisting of data fusion, feature processing, sampling and cross-validation. The evaluation metric of the 10-fold cross-validation is used to interpret the block cross-validation.

Cross-validation and sampling

The generalizability of the model is at first tested by means of a random 10-fold cross-validation. In most cases of cross-validation, 10 folds are assumed to be a good compromise between computational effort and validity and therefore often used (Kohavi, 1995, Marcot, 2021). The objects for the folds are selected pseudo-randomly from the data set, without taking into account their spatial distribution. Within the cross-validation loop (see Santos et al., 2018), the effect of different class ratios, respective imbalance, in the training sets is assessed. Using the synthetic generation of minority class instances (Synthetic Minority Oversampling Technique, SMOTE) and Random Undersampling, both part of the python imblearn-package (Lemaitre et al., 2017), different class prevalences are prepared (75:25, 50:50, 25:75). The effect of these imbalances on the prediction of the test data is assessed using the metrics in Table 3. A binary classification can be expressed using a contingency or confusion matrix (see Supplementary Material Table S3), giving the counts of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) results, with TP + FN (Positives) and TN + FP (Negatives) being the original total class counts. An evaluation metric, or a combined metric, that only uses ratios of true class predictions versus the class totals, would not be affected by class imbalance. As shown in Table 3, four metrics could be considered insensitive for class imbalance, that is, Sensitivity, Specificity, ROC-AUC and GMEAN.

Table 3.

Evaluation metric definitions based on a confusion matrix with True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).

Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$
Sensitivity	$\frac{T P}{T P + F N}$
Specificity	$\frac{T N}{T N + F P}$
F1-score	$\frac{2 * T P}{2 * T P + F P + F N}$
ROC-AUC	Graph of sensitivity vs. 1-specificity and Area Under Curve (AUC) calculated based on the trapezoidal rule
Cohens Kappa (Chicco et al. 2021)	$\frac{2 * (T P * T N - F P * F N)}{(T P + F P) * (F P + T N) + (T P + F N) * (F N + T N)}$
GMEAN	$\sqrt{S e n s i t i v i t y * S p e c i f i c i t y}$

Another method of cross-validation takes the spatial distribution of the folds into account. The motivation came from the assumed spatial autocorrelation in the data between different regions. The R library blockCV (Valavi et al., 2019) allows the creation of folds that represent geographically defined blocks within the study area. Polygonal grids, buffers or strips along the latitudes or longitudes can be formed. We so far did not plan to cross-validate federal states or specific administrative regions in Germany against one another, since the coverage of the data would require more detail in this case. Hence, for the investigation of the model, the folds were created by strips along the latitude and longitude circles. Their number was varied and two, three and five folds were formed.

Implementation

ESRI ArcGIS Pro and Python, as well as R (R Core Team, 2020) were used to carry out the analyses. The classification procedures and validations are implemented and documented in the Scikit-Learn Python library (SKL, 2021). The spatial cross-validation is based on the R library blockCV (Valavi et al., 2019) and the corresponding implementations of the Random Forest algorithm (Breiman, 2001). The hardware consisted of a workstation with 64 GB RAM and a 12-core Intel Xeon processor.

Results

The following sections present the results of the cross-validation approaches. A standard 10-fold cross-validation is used to assess model performance under different class balances. In comparison, a spatial cross-validation is used to assess the influence of region effect in the model. Afterwards specific use categories and feature importance are viewed in more detail.

Cross-validations and sampling

The mean scores of the 10-fold cross-validation under different training class ratios are shown in Table 4. As mentioned, the original ratio of non-residential to residential is 75:25 in the sample data set. A balanced class ratio was created using the standard variant of SMOTE for upsampling. An inverted class ratio was created by applying SMOTE and Random Undersampling (RU) to two separate instances of the original data, with their respective results combined afterwards into a new data set.

Table 4.

Data sampling.

	Mean scores per sampling strategy
—	‘75:25’ (Original)	‘50:50’ (SMOTE)	‘25:75’ (SMOTE&RU)
Accuracy	0.86	0.85	0.78
Sensitivity	0.94	0.89	0.76
Specificity	0.62	0.71	0.87
F1-Score	0.91	0.90	0.84
ROC-AUC	0.90	0.90	0.89
Cohens Kappa	0.59	0.60	0.51
GMEAN	0.76	0.80	0.81

Metrics that are related to one side of the classification, are strongly affected by the change in class ratios. A higher rate of residential buildings in the training set (‘25:75’) leads to a higher specificity, that is, sensitivity of the residential class, in the test. Likewise sensitivity (of the non-residential class) drops in the setting of inverted class ratio. Combined metrics, like ROC-AUC (0.89–0.9) and GMEAN (0.76–0.81), remain relatively stable throughout the settings. It is assumed that ROC-AUC is a good representation of the classifiers ability to separate the classes. Since we aim on separating residential and non-residential buildings, we put equal weight on true positive and true negative predictions of the model. This is also reflected in ROC-AUC metric and it is used to report on the models performance under the different geographic settings of the following block cross-validation.

Block cross-validation

The block cross-validation was performed in the reference system WGS84, in a way that the area of Germany was divided into stripes along longitude and latitude circles. Per run 2, three and five folds had been created, so that the north-south resp. west-east partition becomes more detailed with a higher number of folds (Figure 2). The number of partitions was chosen to be not higher than 5, as a higher number would be more difficult to interpret with the given pattern.

Figure 2.

Geographic zones used in block cross-validation. The distribution of screening buildings shown in sub images a and d.

On average, the scores were in the range of cross-validation without spatial reference. Table 5 shows the scores of the folds along the longitudes, from west to east. The prediction performance of the model for the respective fold is evaluated and scored. That is, in the upper bipartite group (see Figure 2(d)), the westward fold one was predicted with a classifier trained on the eastward one, and vice versa. Increasing the number of folds naturally decreases the number of objects to be tested, and the ratio of training to testing data per fold becomes larger.

Table 5.

Scores in block CV from west to east.

Score	Folds
—	1	2	1	2	3	1	2	3	4	5
Sensitivity	0.92	0.94	0.94	0.92	0.94	0.95	0.93	0.93	0.94	0.92
Specificity	0.66	0.52	0.63	0.61	0.52	0.61	0.63	0.62	0.55	0.54
ROC-AUC	0.89	0.88	0.90	0.88	0.88	0.91	0.90	0.88	0.88	0.87

The scores show a tendency become lower towards the east. Considering the example of the three Fold partition, sensitivity is between 0.92 and 0.94, that is, over 90% of the non-residential objects are found. In the western and middle fold also, over 60% of the residential class are correct. But in the easternmost fold, the rate of correct residential objects drops to 0.52. Also, ROC-AUC of 0.88 indicates that the classification of the eastern portion is less accurate, when the classifier was trained with folds in the western part.

Table 6 shows the results of the folds along latitude circles, resp. in the north-south direction. The mean results are also within range of the normal cross-validation. But here, too, different prediction performances are shown in different regions. Divided into two folds, the sensitivity scores are 0.95 in the northern fold one and 0.9 in the southern fold 2. Also, the rate of correct classified residential buildings is much lower in the northern fold. The general pattern shows that the scores become lower in northern and higher in southern direction. In fold five of the latitudinal partition, ROC-AUC reaches the highest value 0.91.

Table 6.

Scores in block CV from north to south.

Score	Folds
—	1	2	1	2	3	1	2	3	4	5
Sensitivity	0.95	0.90	0.95	0.92	0.94	0.95	0.94	0.93	0.91	0.95
Specificity	0.50	0.66	0.51	0.59	0.64	0.51	0.55	0.56	0.67	0.64
ROC-AUC	0.88	0.88	0.88	0.88	0.90	0.88	0.88	0.88	0.90	0.91

In summary a spatial influence on the classifier performance is recognized in longitude and latitude. The effect appears a little higher in the west-east partition, what is going to be discussed further in the following section.

Feature importance

The model presented here works with a comparatively large number of features, some of which are correlated. While this is not necessarily influencing the performance of a random forests classifier, it does influence the interpretation of the feature importance (Strobl et al., 2008). Therefore, for the estimation of feature importance, the number of strongly correlated features was reduced by means of a principal component analysis and feature selection using method B4 according to Jolliffe (1972).

The starting point for this methodology is the matrix of Pearson correlations (See Supplementary Material: Figure S1), on the basis of which highly correlated characteristics are determined. The threshold for “high” correlation was set at $| ρ |$ >0.7, which yielded 26 highly correlated features. For this subset of features, 10 principal components were determined, with the first seven components having eigenvalues higher than 1, thus meeting the Kaiser-Guttmann criterion. Finally, original features were assigned to each of these seven principal components by their load on the component, and the remaining of the highly correlated features were discarded (See Supplementary Material: Figure S2, Table S2). In this way, a reduced feature set of 18 original features was created for which feature importance is examined, none of them having a higher absolute Pearson correlation than 0.6 (See Supplementary Material: Figure S3). Using a 5-fold cross-validation of the whole data set, impurity-based feature importance (IFI) and permutation importance (PI) have been calculated and averaged over folds (Figure 3). The performance scores of this cross-validation are compatible to the ones in Table 4. (Accuray: 0.85, F1-Score: 0.91, ROC-AUC: 0.89).

Figure 3.

Feature importances of the model.

By comparing IFI and PI both importance measures look similar, with only some positional differences. As expected, features of morphologic (UFE) and geometric (HEIGHT, Shape_Length) domain are listed high in first three positions of both measures. At the following positions the order differs, with low cardinality features of 50 m buffer neighbourhood (NO_NB_50) and number of polygon vertices (N_NODES) being ranked higher in PI, while in IFI the distance to road (NEAR_DIST) is at fourth position. In both measures the second morphologic feature (SCHUMM) is ranked fifth position. The feature PATIO, concerning number of holes in the building footprints, is ranked lowest in both measures.

In summary, both importance measures reflect a high importance of morphologic, geometric and neighbourhood related features (either buffer counts of or distances to objects), which, concerning the computational coast of the latter, is important for feature engineering.

Discussion

The present study investigated the automatic separability of the classes residential and non-residential buildings in geodata. A feature set was built up with the help of footprint data and street networks, which was then used for training a Random Forest classifier. In the following we will discuss our results from different perspectives. Initially, the discussion reflects our basic hypotheses about classifying residential and non-residential buildings (see Introduction), concerning function-form relationship, feature importance and class distributions.

The reference footprint data came from a sample of the ENOB:dataNWG project. The aim of the project was to collect sample data on non-residential buildings and extrapolate it to the total inventory. For this purpose, selected buildings were visited and the data was gathered on site. Due to the distribution of the properties over the entire federal territory, data is also available in federal states where official data would be patchy otherwise (Schwarz et al., 2021). So this data set can be considered a ground-truth reference for non-residential buildings. However, it must be noted that the footprints in the resulting sample may have been modified, compared to the original HU-DE used before the screening. Sub goal of the project was to compare the delineation of building footprints from the real estate cadastre with the real building situation. Several cases are conceivable in which the footprints from the cadastre and the situation on site do not correspond:

1. A footprint represents several buildings (1:n).

2. Several footprints represent one building (n:1).

3. Any number of footprints represent any number of buildings (m:n).

4. Footprints without buildings.

It would be possible, that the characteristics and classifications of this sample could not be transferred to input data that lacks such verification and post-editing, e.g. OSM data and HU-DE. However, during the preparation, it was found that only a comparatively small proportion of the footprints had been modified in a way that would lead to noticeable differences from the delineation in the original HU-DE. It is therefore assumed, that there was no change that would alter any hypothetical function-form relationship.

The selected features have proven to be sufficient in their informative value with regard to separability. Their selection represented a compromise between the resources for the calculation and their presumed informative value for the classification. In order to evaluate importance of features, and their respective domains, the original feature set was reduced by discarding highly correlated features. It was confirmed that morphologic and neighbourhood related features rank high in both importance measures applied, what confirms indications of other studies (Steiniger et al., 2008; Vanderhaegen and Canters, 2017; Sturrock et al., 2018, Rosser et al., 2019). With a view to future continuation and repeatable use cases for use classifications of buildings, it is therefore concluded that at least elevation data and the street network must be available for the respective building footprints, in order to enable an acceptable classification.

As described in Data and Method, the prevalence of the classes in the real building stock is, strictly speaking, an unknown. Reasonable estimates exist, but these differ in definition for non-residential buildings. It is comparatively well possible to delineate the stock of residential buildings (Hartmann et al., 2016). These are usually provided with addresses and are located on building block areas to which residential use is assigned; accordingly, buildings for which this does not apply, fall into the group of non-residential buildings. With this definition, the prevalence of non-residential buildings in Germany is around 60%. For energy-economic analyses, however, the number of non-residential buildings is often calculated to be significantly lower. In this case, only non-residential buildings, to which an address can be assigned, are considered; these are approx. 2.7 million buildings (DENA, 2019).

Different sampling strategies had been used to assess the effect of class imbalances on classifier training. Those sampling strategies either create synthetic instances of the minority class (upsampling), randomly discard objects of the majority (downsampling) or work in combination. Their synthetic approach might have unwanted effects, e.g. in the case that a specific feature characteristic is overemphasized by upsampling, that does not reflect the unknown ‘natural’ non-residential building stock. We see the analysis of these sampling strategies and their possible problematic implications on building classification as a future research aspect.

Through this cross-validation ROC-AUC score was found to be a sufficient metric for the classifier performance under different sampling settings. Using this metric, a spatial cross-validation showed lower classifier performance in eastern and northern parts of longitudinal and latitudinal fold partitions. It is possible, that in both cases the calculated feature set is not instructive or predictive enough in training, in order to separate the classes. From the distribution of the objects (Figure 2), it is visible that the 3-fold west-east partition divides the data roughly in the boundaries of former West and East Germany. An influence of historic different architectural and planning patterns, which would make a feature set for the whole of Germany less applicable, cannot be ruled out. Concerning this point, it would be necessary to have a deeper look at special administrative regions during block cross-validation. However, the study at present was not intended to provide an analysis on federal state or county level, since the distribution of the whole data population would not allow the use of administrative boundaries alone. This is due to the fact, that not all subclasses of non-residential buildings might be represented in such an administrative subset. As an example, it is possible that the non-residential class in some county or state (think of Berlin or Hamburg) contains more multiuse office buildings, whereas in another there are more warehouses and agricultural buildings. This would lead to semantically biased learning data. In order to have a more general learning data selection it was decided to use a geographic block pattern during block cross-validation. The detailed look on linkages between non-residential building stock structure with its historic and economic backgrounds is field of future research.

Conclusions and future perspectives

As for the research questions raised, it can be noted that the random forest classifier has demonstrated its ability to both separate classes and determine important features. By limiting the effort of feature engineering, it is easier to extend the analysis to larger areas and datasets with reasonable computing time and classification quality. This does not mean that extensive feature sets should be avoided in general.

In principle transferability of the approach should be given, with regard to the data and software used. This means that only data products were used that are either already available in other countries or can be acquired by means of remote sensing. Geometric vector data on building footprints and roads are required as input data, and these are nowadays often freely available in many countries, at least as VGI. Globally, the mentioned works for deriving building footprints from remote sensing data, make it plausible to think of general applicability of the approach.

All processing steps can be implemented in open software. The fact that the approach was based on a specially prepared reference database should not limit the transferability to other data sets. Although the reference data were edited geometrically, these edits only affect a small part of the objects in the database. Furthermore, the data is likely to change geometrically in a direction that corresponds to that of the OSM data. How far the classification results differ when using either official data or OSM alone is the subject of further research.

Another important starting point for further development is the refinement of the procedure for further classes. However, it has to be considered which aspects are used for class formation. Often, classes are formed according to morphological characteristics (e.g. single-family house, apartment building, high-rise building). In the present case, however, geometric morphological characteristics were used to classify the building function. The classification of the classes residential and non-residential in this study is based on the main categories of the ENOB:dataNWG project. These main categories were created on the basis of similar energy and functional characteristics collected during the screening, rather than on the basis of similar geometric characteristics. A classifier trained with geometric features inevitably provides mixed results.

The monitoring of the building stock on national level is an important goal in the context of the transformation towards a climate-neutral and resource-efficient building stock. Deriving the population distributions in less developed countries is of similar importance and knowledge of building use is a necessary input parameter. Under the assumption of incomplete data, machine learning techniques have proven to be valuable tools for filling in any gaps. Considering the results of the spatial cross-validation in Germany, further research should deepen the understanding of regional specifics, when predictions on the whole building stock are made. The further development of the present approach is seen as a way to achieve a reliable classification with few input data or geometric feature set. In this way, it can be integrated into workflows for building-related area, material or population analyses.

Supplemental Material

Supplemental Material - Prediction of residential and non-residential building usage in Germany based on a novel nationwide reference data set

Supplemental Material for Prediction of residential and non-residential building usage in Germany based on a novel nationwide reference data set by André Hartmann, Martin Behnisch, Robert Hecht and Gotthard Meinel in Environment and Planning B: Urban Analytics and City Science

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article. This study was partly supported by the Federal Ministry For Economic Affairs And Climate Action (Grant Number: FKZ 03ET1315B, Forschungsdatenbank Nichtwohngebäude).

Data availability statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to research partner restrictions.

ORCID iDs

André Hartmann

Martin Behnisch

Robert Hecht

Supplemental Material

Supplemental material for this article is available online.

André Hartmann received his diploma at TU Dresden in 2015 and currently is research associate at Leibniz Institute for Ecological Urban and Regional Development (Research ‘Area Spatial Information and Modelling’). His research interests are in remote sensing, spatio-temporal analysis and monitoring of settlements and open spaces and building stock research.

Martin Behnisch received his diploma and doctoral degrees at the Department of Architecture, Karlsruhe Institute of Technology. He also received a diploma degree in wood processing technologies (University of Cooperative Education, Dresden, Germany) and a master’s degree in geographical information science (University of Salzburg, Austria) with distinction. He worked in Switzerland as a postdoctoral researcher (2007–2011) at the Institute of Historic Building Research and Conservation (ETH Zurich). He is currently a senior scientist at the Leibniz Institute of Ecological Urban and Regional Development (IOER). His research interests are in spatial analysis and modelling, urban data mining, spatial monitoring, land-use science, as well as building stock research.

Robert Hecht received doctoral degree at the TU Dresden in 2013 (Faculty of Environmental Sciences) and is now senior researcher at the Leibniz Institute for Ecological Urban and Regional Development (Research Area ‘Spatial Information and Modelling’) in the fields of Geoinformatics/GIScience, Cartography and Remote Sensing. His current research interests are spatio-temporal analysis and monitoring of settlements and open spaces, pattern recognition in geospatial data, machine learning/AI and Big Data applications, quality aspects of spatial data, VGI and crowdsourcing, and location-based services in an urban context.

Gotthard Meinel received doctoral degree in image processing from Dresden University of Technology in 1987. Later, he has been a postdoctoral researcher in biomathematics und technical informatics. Since 1992, he is project leader in the field of geoinformatics, GIS and remote sensing at Leibniz Institute of Ecological and Regional Development in Dresden and head of the research area ‘Spatial Information and Modelling’. He is specialist in field of land-use monitoring, indicator development, spatial analysis and visualization technologies.

References

AdV (2021) Standards und Produktblätter. Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland. https://www.adv-online.de/AdV-Produkte/Standards-und-Produktblaetter/ZSHH/. (15.09.2021).

Bandam

Busari

Syranidou

, et al. (2022) Classification of building types in Germany: a data-driven modeling approach. Data 7(4): 45. DOI: 10.3390/data7040045

BBSR (2013) Systematische Datenanalyse im Bereich der Nichtwohngebäude - Erfassung und Quantifizierung von Energieeinspar- und CO2-Minderungspotenzialen“. Berlin: BMVBS. Online: https://www.bbsr.bund.de/BBSR/DE/veroeffentlichungen/ministerien/bmvbs/bmvbs-online/2013/ON272013.html

Biljecki

Arroyo Ohori

Ledoux

, et al. (2016) Population estimation using a 3D city model: a multi-scale country-wide study in the Netherlands. PLoS ONE 11(6): e0156808. DOI: 10.1371/journal.pone.0156808

Breiman

(2001) Random Forests. Machine Learning 45: 5–32. DOI: 10.1023/A:1010933404324

Brovelli

Zamboni

(2018) A new method for the assessment of spatial accuracy and completeness of openstreetmap building footprints. ISPRS International Journal of Geo-Information 7(8): 289. DOI: 10.3390/ijgi7080289

Cheng

K.-L

Hsu

S.-C

W.-M

, et al. (2018) Quantifying potential anthropogenic resources of buildings through hot spot analysis. Resources, Conservation and Recycling 133: 10–20. DOI: 10.1016/j.resconrec.2018.02.003

Chicco

Warrens

M J

Jurman

(2021) The Matthews correlation coefficient (MCC) is more informative than Cohen's Kappa and Brier score in binary classification assessment. IEEE Access 9: 78368–78381. DOI: 10.1109/ACCESS.2021.3084050

Creutzig

Agoston

Minx

J C

, et al. (2016) Urban infrastructure choices structure climate solutions. Nature Climate Change 6(12): 1054–1056. DOI: 10.1038/nclimate3169

10.

Dabbeek

Silva

(2020) Modeling the residential building stock in the Middle East for multi-hazard risk assessment. Natural Hazards 100: 781–810. DOI: 10.1007/s11069-019-03842-7

11.

DENA (2019) Statistiken und Analysen zur Energieeffizienz im Gebäudebestand”. dena-GEBÄUDERAPORT KOMPAKT. https://www.dena.de/fileadmin/dena/Publikationen/PDFs/2019/dena-GEBAEUDEREPORT_KOMPAKT_2019.pdf

12.

ENOB :dataNWG (2020) E.1.4.5 Stichprobe: Modellierung und Ziehung”. (in german) https://www.datanwg.de/fileadmin/user/iwu/210506_IWU_E1-4-5_Stichprobe_Modellierung_und_Ziehung.pdf

13.

Evans

Liddiard

Steadman

(2019) Modelling a whole building stock: domestic, non-domestic and mixed use. Building Research & Information 47(2): 156–172. DOI: 10.1080/09613218.2017.1410424

14.

FCG. Federal Agency for CartographyGeodesy (2022) Digital Basic Landscape Model – Basic DLM. Technical Documentation. Online: https://sg.geodatenzentrum.de/web_public/gdz/dokumentation/eng/basis-dlm_eng.pdf

15.

Garbasevschi

Oana M.

Schmiedt

Jacob Estevam

Verma

Trivik

, et al. (2021) Spatial factors influencing building age prediction and implications for urban residential energy modelling, Computers, Environment and Urban Systems. Computers, Environment and Urban Systems 88. https://doi.org/10.1016/j.compenvurbsys.2021.101637

16.

GHSL (2022) Global Human Settlement Layer. https://ghsl.jrc.ec.europa.eu (12.10.2022).

17.

GODI . Global Open Data Index. (2016). https://index.okfn.org/. (23.09.2021).

18.

GUF (2022) Global Urban Footprint. https://www.dlr.de/eoc/en/desktopdefault.aspx/tabid-9628/16557_read-40454 (12.10.2022).

19.

Haberl

Wiedenhofer

Schug

, et al. (2021) High-resolution maps of material stocks in buildings and infrastructures in Austria and Germany. Environmental Science & Technology 55(5): 3368–3379. DOI: 10.1021/acs.est.0c05642

20.

Haklay

(2010) How good is volunteered geographical information? A comparative study of OpenStreetMap and ordnance survey datasets. Environment and Planning B: Planning and Design 37(4): 682–703. DOI: 10.1068/b35097

21.

Hartmann

Hecht

Behnisch

Meinel

, et al. (2016) A Workflow for Automatic Quantification of Structure and Dynamic of the German Building Stock Using Official Spatial Data. ISPRS International Journal of Geo-Information ISPRS International Journal of Geo-Information 5. https://doi.org/10.3390/ijgi5080142

22.

Hecht

Herold

Behnisch

Jehling

(2019) Mapping Long-Term Dynamics of Population and Dwellings Based on a Multi-Temporal Analysis of Urban Morphologies. ISPRS Int. J. Geo-Inf 8(2). https://doi.org/10.3390/ijgi8010002

23.

HOTOSM (2022). Humanitarian OpenStreetmap Team. https://www.hotosm.org (12.10.2022).

24.

ISO 19157 (2013) ISO 19157:2013, Geographic Information - Data Quality. https://www.iso.org/obp/ui/#iso:std:iso:19157:ed-1:v1:en

25.

Jochem

Leasure

Pannell

, et al. (2020) Classifying settlement types from multi-scale spatial patterns of building footprints. Environment and Planning B: Urban Analytics and City Science 48(5): 1161–1179. DOI: 10.1177/2399808320921208

26.

Jolliffe

(1972) Discarding variables in a principal component analysis. i: artificial data. Applied Statistics 21(2): 160–173.

27.

Juhász

Novack

Hochmair

, et al. (2020) Cartographic vandalism in the era of location-based games—The case of OpenStreetMap and Pokémon GO. ISPRS International Journal of Geo-Information 9(4): 197. DOI: 10.3390/ijgi9040197

28.

Kohavi

(1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence 2: 1137–1143.

29.

Kunze

Hecht

(2015) Semantic enrichment of building data with volunteered geographic information to improve mappings of dwelling units and population. Computers, Environment and Urban Systems 53: 4–18. DOI: 10.1016/j.compenvurbsys.2015.04.002.

30.

Lemaitre

Nogueira

Aridas

(2017) Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18(17): 1–5. http://jmlr.org/papers/v18/16-365.html

31.

Lloyd

Sturrock

HJW

Leasure

, et al. (2020) Using GIS and machine learning to classify residential status of Urban buildings in low and middle income settings. Remote Sensing 12(23): 3847. DOI: 10.3390/rs12233847

32.

Maceachren

(1985) Compactness of geographic shape: comparison and evaluation of measures. Geografiska Annaler: Series B, Human Geography 67(1): 53–67.

33.

Marcot

Hanea

(2021) What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Computational Statistics 36: 2009–2031. 2031 DOI: 10.1007/s00180-020-00999-9

34.

MBF (2022) Microsoft Building Footprints. https://www.microsoft.com/en-us/maps/building-footprints (12.10.2022).

35.

MMP (2022) Missing Maps Project. https://www.missingmaps.org (12.10.2022).

36.

Open Government Germany (2019) Second National Plan (NAP) 2019-2021 in the Framework of Germany’sParticipation in the Open Government Partnership (OGP). Berlin: Federal Chancellery. Online: https://www.opengovpartnership.org/wp-content/uploads/2019/09/Germany_Action-Plan_2019-2021_EN.pdf

37.

Pauliuk

Müller

(2014) The role of in‐use stocks in the social metabolism and in climate change mitigation. Global Environmental Change 24: 132–142. DOI: 10.1016/j.gloenvcha.2013.11.006

38.

R Core Team (2020) R: A language and environment for statistical computing. Available online: http://www.r-project.org (accessed on 18 December 2020).

39.

Rosser

J S

Boyd

D S

Long

, et al. (2019) Predicting residential building age from map data. Computers, Environment and Urban Systems 73. https://doi.org/10.1016/j.compenvurbsys.2018.08.004

40.

Santos

Soares

J.P

Abreu

, et al. (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [Research Frontier]. IEEE Computational Intelligence Magazine 13(4): 59–76. Online: https://ieeexplore.ieee.org/document/8492368

41.

Schug

Frantz

van der Linden

, et al. (2021) Gridded population mapping for Germany based on building density, height and type from Earth Observation data using census disaggregation and bottom-up estimates. PLoS ONE 16(3): e0249044. DOI: 10.1371/journal.pone.0249044

42.

Schwarz

Hartmann

Hecht

, et al. (2021) Status Quo of Official 3D building models in LoD1: A Metadata Analysis. zfv – Zeitschrift für Geodäsie, Geoinformation und Landmanagement. (In review).

43.

SKL (2021) https://scikit-learn.org/stable/

44.

Steiniger

Lange

Burghardt

, et al. (2008) An approach for the classification of Urban building structures based on discriminant analysis techniques. Transactions in GIS 12(1): 31–59. DOI: 10.1111/j.1467-9671.2008.01085.x

45.

Strobl

Boulesteix

A.-L

Kneib

, et al. (2008) Conditional variable importance for random forests. BMC Bioinformatics 9: 307. DOI: 10.1186/1471-2105-9-307

46.

Sturrock

HJW

Woolheater

Bennett

, et al. (2018) Predicting residential structures from open source remotely enumerated data using machine learning. PLos ONE 13(9): e0204399. DOI: 10.1371/journal.pone.0204399

47.

Tobler

(1970) A computer movie simulating urban growth in the Detroit region. Economic Geography 46: 234–240.

48.

Tomás

Fonseca

Almeida

, et al. (2016) Urban population estimation based on residential buildings volume using IKONOS-2 images and lidar data. International Journal of Remote Sensing 37: 1–28. DOI: 10.1080/01431161.2015.1121301

49.

Valavi

Elith

Lahoz‐Monfort

, et al. (2019) blockCV: An r package for generating spatially or environmentally separated folds for k‐fold cross‐validation of species distribution models. Methods in Ecology and Evolution 10: 225–232. DOI: 10.1111/2041-210X.13107

50.

Vancauwenberghe

van Loenen

(2018) Exploring the emergence of open spatial data infrastructures: analysis of recent developments and trends in Europe. In: Saeed

Ramayah

Mahmood

(eds), User Centric E-Government. Cham: Springer, pp. 23–45. DOI: 10.1007/978-3-319-59442-2_2

51.

Vanderhaegen

Canters

(2017) Mapping urban form and function at city block level using spatial metrics. Landscape and Urban Planning 167: 399–409. DOI: 10.1016/j.landurbplan.2017.05.023

52.

Wardrop

Jochem

Bird

, et al. (2018) Spatially disaggregated population estimates in the absence of national population and housing census data. Proceedings of the National Academy of Sciences of the United States of America 115: 3529–3537. DOI: 10.1073/pnas.1715305115

53.

WSF (2022) World Settlement Footprint. https://geoservice.dlr.de/web/maps/eoc:wsf2019 (12.10.2022).

54.

Wurm

Droin

Stark

, et al. (2021) deep learning-based generation of building stock data from remote sensing for Urban heat demand modeling. ISPRS International Journal of Geo-Information 10: 23. DOI: 10.3390/ijgi10010023

55.

Wurm

Schmitt

Taubenbock

(2016) Building types’ classification using shape-based features and linear discriminant functions. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9(5): 1901–1912. DOI: 10.1109/JSTARS.2015.2465131

56.

Yan

Yang

, et al. (2019) A graph convolutional neural network for classification of building patterns using spatial vector data. ISPRS Journal of Photogrammetry and Remote Sensing 150. https://doi.org/10.1016/j.isprsjprs.2019.02.010

57.

Zhu

Zhou

Seto

, et al. (2019) Understanding an urbanizing planet: Strategic directions for remote sensing. Remote Sensing of Environment 228: 164–182. DOI: 10.1016/j.rse.2019.04.020

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.26 MB