Abstract
Understanding why machine learning algorithms may fail is usually the task of the human expert that uses domain knowledge and contextual information to discover systematic shortcomings in either the data or the algorithm. In this paper, we propose a semantic referee, which is able to extract qualitative features of the errors emerging from deep machine learning frameworks and suggest corrections. The semantic referee relies on ontological reasoning about spatial knowledge in order to characterize errors in terms of their spatial relations with the environment. Using semantics, the reasoner interacts with the learning algorithm as a supervisor. In this paper, the proposed method of the interaction between a neural network classifier and a semantic referee shows how to improve the performance of semantic segmentation for satellite imagery data.
Keywords
Introduction
Machine learning algorithms and Semantic Web technologies have both been widely used in geographic information systems [17,39]. The former are typically applied on geo-data to perform computer vision tasks such as semantic segmentation, land use/cover, and object detection/recognition, whereas the latter are used for a number of applications such as navigation, knowledge acquisition and map query [26]. Recent developments in machine learning, and in particular deep learning methods, have shown large improvements for several tasks in remote sensing. However, seldom do these approaches take into account the advantages of the semantics that are associated with geo-data. Instead, the training process of neural-based algorithms typically rely on reducing an error defined by a cost function and adapt the model parameters to minimize this error, and there is no consideration if the found solution makes semantically sense.
In the context of semantic segmentation for geospatial data, a classifier that uses only the RGB (Red, Green, Blue) channels as input is error-prone to the visual similarity between certain classes. For example, the RGB data for water is very similar to roads that are covered by a shadow, and buildings with gray roofs that are similar to roads. One possible solution to this problem is to include additional sources of information as part of the input data to the classifier, such as Synthetic-aperture radar (SAR), Light detection and ranging (LIDAR), Digital Elevation Model (DSM), hyperspectral bands, near-infrared (NIR) bands, and synthetic spectral bands [8,21]. However, such additional data is not always accessible, e.g. satellite images from Google Maps or several publicly available data sets that only contain the RGB channels. One possible solution to increase the performance of the classifier is to change the architecture of the network to increase the capacity, e.g. by using Deep Convolutional Neural Networks (DCNNs) [6,14,38].
In this paper, instead of relying on additional sources of information or taking the ad-hoc approach of experimenting with the architecture of the classifier, we propose a method that focuses on conceptualizing the errors in terms of their spatial relations and surrounding neighborhood. Our method applies a reasoner upon an ontological representation of the context in order to retrieve the spatial and geometrical characteristics of the data. We refer to this process as a semantic referee, since we use knowledge representation and reasoning methods to arbitrate on the errors arising from the misclassifications.
In particular, our representation makes use of RCC-8 spatial relations, as well as extensions thereof, where RCC-8 stands for the language that is formed by the 8 base relations of the Region Connection Calculus [9], viz., disconnected, externally connected, overlaps, equal, tangential proper part, non-tangential proper part, tangential proper part inverse, and non-tangential proper part inverse. Notably, RCC-8 has been adopted by the GeoSPARQL1
In general, one of the key challenges in Artificial Intelligence is about reconciliation of data-driven learning methods with symbolic reasoning [12]. The integration approaches between low and high level data have been addressed under different names depending on the employed representational models, and include abduction-induction in learning [28], structural alignment [3], and neural-symbolic methods [5,7]. Due to the increasing interest in deep learning methods, design and development of neural-symbolic systems has recently become the focus of different communities in Artificial Intelligence, as they are assumed to provide better insights into the learning process [11].
In this work, we develop an ontology-based reasoning approach, a preliminary version of which can be found in [2], to assist a neural network classifier for a semantic segmentation task. This assistance can be used in particular to represent typical errors and extract their features that eventually assist in correcting misclassification. Using a specific case on large scale satellite data, we show how Semantic Web resources interact with deep learning models. This interaction improves the classification performance on a city wide scale, as well as a publicly available data set.
Our contribution differentiates from the neural-symbolic systems explained in Section 2 in three regards. Firstly, our method plays the role of a semantic referee for the imagery data classifier in order to conceptualize its errors, which, to the best of our knowledge, is the first attempt in the domain of image segmentation to tackle the problem by explaining its features. Secondly, our model focuses on the misclassifications and uses ontological knowledge together with a geometrical processing to explain them. This combination, to the best of our knowledge, is the first time to be employed for the aforementioned purpose. Finally, our system closes the communication loop between the classifier and the semantic referee.
Structure of paper
The rest of the paper is structured as follows. Section 2 describes the related work. The method is presented in Section 3, which gives the overview of the approach (Section 3.1), the satellite image data used in this work (Section 3.2), the neural network-based semantic segmentation algorithm (Section 3.3), OntoCity as the ontological knowledge model (Section 3.4), and the semantic augmentation process and how it is used to guide the classifier (Sections 3.5 and 3.6). The experimental evaluation is presented in Section 4, which is followed by a discussion and possible directions for future work in Section 5.
Related work
As discussed in the work done by Xie et al., 2017, in neural-symbolic systems where the learning is based on a connectionist learning system, one way of interpreting the learning process is to explain the classification outputs using the concepts related to the classifier’s decision [36]. However, there is a limited body of work where symbolic techniques are used to explain the conclusions. The work presented by Hendricks et al., 2016 introduces a learning system based on a Long-term Convolutional Network (LTCN) [10] that provides explanations over the decisions of the classifier [15]. An explanation is in the form of a justification text. In order to generate the text, the authors have proposed a loss function upon sampled concepts that, by enforcing global sentence constraints, helps the system to construct sentences based on discriminating features of the objects found in the scene. However, no specific symbolic representation was provided, and the features related to the objects are taken from the sentences that are already available for each image in the dataset (CUB dataset [34]).
With focus on the knowledge model, Sarker et al,. 2016 proposed a system that explains the classifier’s outputs based on the background knowledge [31]. The key tool of the system, called DL-Learner, works in parallel with the classifier and accepts the same data as input. Using the Suggested Upper Merged Ontology (SUMO)2
as the symbolic knowledge model, the DL-Learner is also able to categorize the images by reasoning upon the objects together with the concepts defined in the ontology. The compatibility between the output of the DL-Learner and the classifier can be seen as a reliability support and at the same time as an interpretation of the classification process.Similarly, Icarte et al., 2017, introduced a general-purpose knowledge model called the ConceptNet Ontology [16]. In this work, the integration of the symbolic model and a sentence-based image retrieval process based on deep learning is used to improve the performance of the learning process. The knowledge about different concepts, such as their affordances and their relations with other objects, is aligned with objects derived from the deep learning method.
The method of enriching the data by providing information as additional channels for training a CNN-based network has been done before. Liu et al., 2018 and Zhenyi et al., 2018, have explained how to augment the input data by adding two additional channels that represent the i and j coordinates in the image to obtain the location information [20,35]. Our work uses information from a semantic referee as the augmented data instead of the location information.
Although in these works the role of symbolic knowledge represented by ontologies has been emphasized, they are limited in terms of the symbolic representation models. More specifically, the concepts and their relations in ontologies are simplified, limiting the richness of deliberation in an eventual reasoning process, especially for visual imagery data.
Our approach can also be compared with Explanation-based learning (EBL) [23] approaches. EBL refers to a form of machine learning method that is able to learn by generalizing examples where the features of the examples are formalized as domain theory. In EBL, the explanations, which consist of the features of the observation, are directly considered and generalized by the learner, whereas in our semantic based model, although the features of the misclassified regions are inferred from the ontology and send back to the classifier, they are not directly applied on the classification output; rather, they are only treated as a new set of data that is sent through the learning process.
Overview of the approach

Overview of applying a semantic referee (top layer) in the form of reasoning upon ontological knowledge to improve the semantic segmentation task of the convolutional encoder-decoder classifier (bottom layer). The semantic referee reasons about the mistakes made by the classifier based on ontological concepts and provides additional information back to the classifier that prevents the classifier from making the same misclassifications. The ontological knowledge and reasoning methods play the role of the semantic referee that makes sense of the errors from the classifier.
An overview of our approach can be seen in Fig. 1, which shows the interaction between the classifier and the semantic referee.
The classifier is a deep convolutional network with an encoder-decoder structure that provides the semantic segmentation of the input image (see Section 3.3).
In order to deal with the misclassifications from the classifier, a semantic referee reasons about the errors that include the conceptualization of the misclassified regions based on their physical (e.g. geometrical) properties and aligns them with the available ontology to infer the best possible match for the error (see Section 3.4 and 3.5).
The inferred concept related to the misclassified region is then given to the classifier as a referee providing information to be used within the learning process to prevent the classifier from making the same misclassifications. The additional information provided by the reasoner is represented as image channels with the same size as the RGB input and is then concatenated together with the original RGB channels as additional color channels. In this work, we use three additional channels from the reasoner that represents shadow estimation, elevation estimation, and other inconsistencies. This results in the classifier using data with 6 channels instead of the original 3 RGB channels (see Section 3.6).
This process is then repeated until the classification accuracy on the validation data converges. During testing, the same procedure is performed using the same number of iterations that was used during training.

The data consists of RGB satellite images from two different cities Stockholm and Boden. The selected area size for both cities is
Stockholm and Boden
The data used in this work consists of RGB satellite images from two different cities in Sweden, shown in Fig. 2. The first city is Stockholm, which is the largest city and capital of Sweden, and the second city is a smaller city located in northern Sweden called Boden. The selected area size for both cities is
The 5 categories that are used are vegetation, road, building, water, and railroad. The class distribution for each city can be seen in Table 1. Due to the large imbalance in the data set, the loss function uses median frequency class weighting.
Class distribution for the two cities used in this work. There is large difference in amount of vegetation, building, and water between the two cities. Both cities have a very small amount of railroads
For further evaluation, we use the publicly available UC Merced Land Use dataset [37], which consists of 21 land use classes with 100 images for each class with size
New merged classes and old classes from the DLRSD data set and class distribution on the new class. Any images containing the crossed out classes where omitted
New merged classes and old classes from the DLRSD data set and class distribution on the new class. Any images containing the crossed out classes where omitted
A variation of a Convolutional Auto-encoder (CAE) [22] is used to perform the semantic segmentation of the satellite images where every pixel in the map is classified. The structure of the networks follows the model U-net [29] and is created in MATLAB 2018a with the function creatUnet with patch size 256, 6 color channels (3 RGB channels + 3 channels as the feedback provided by the semantic referee (see Fig. 1)), and 5 classes shown in Table 1.
The U-net model consists of an encoder with 4 layers where each layer performs two convolutions with
The model parameters are trained from scratch and were initialized with Xavier initialization [13] and trained using the Adam optimization method [18] with initial learning rate
OntoCity: The ontological knowledge model
In our approach the improvement of data classification relies on an ontological reasoning process. The ontology that we have used as the knowledge model is called OntoCity4
and contains the domain knowledge about generic spatial constraints in outdoor environments. OntoCity whose (part of) representational details explained by Alirezaie et al., 2017 [1], is an extension of the GeoSPARQL ontology, known as a standard vocabulary for geospatial data [19]. The main idea behind designing OntoCity was to develop a generalized knowledge model for representation of cities in terms of their structural, conceptual and physical aspects. Fig. 3 illustrates a Protégé [24] snapshot of the hierarchy of concepts defined in OntoCity.
A snapshot of the hierarchy of concepts in OntoCity. The city features are defined as the subclasses of the
The class
The class
Spatial relations in OntoCity include the RCC-8 (Region Connection Calculus) relations defined by Cohn et al., 1997 [9] and adopted by GeoSPARQL, with a bit of extension. The extension includes the definition of the relation
Spatial relations are used in the form of spatial constraints to provide meaning to the city features. City features are categorized into several types defined as the subclasses of
Furthermore, the two other classes
As shown in Fig. 3, each of the subclasses of the class
For each location in a city (or in general on the ground) there are two elevation values, namely absolute elevation and relative elevation. The absolute elevation is the value measured from the sea-level with height value zero, whereas the relative elevation value of a specific location indicates its relative height w.r.t the ground level and its vicinity. By a non-flat region we refer to landmarks of a city with a non-zero relative elevation value (see axiom (12)). Due to its height, a non-flat region is also assumed to cast shadows. As shown in axiom (22), the concept of shadow has been also defined in OntoCity (
The texture of regions (i.e., landmarks) are defined as subclasses of the class
Each class of these region types can have more than one superclass. For instance, a railroads is defined as a flat (not high) man-made region which is used as a way (i.e., route) in a city. Given this definition, the main three constraints in the definition of the concept
The RCC-8 relations are used to complete the definition of the region types or describe more specific features (e.g. bridges, shadows, shores) whose definitions rely on their spatial relations with their vicinity. For instance, as defined in axiom (20), a railroad in a city is expected to intersect (more specifically externally connect) with at least one but only with features as either a vegetation area, dirt or ground.
Likewise, a bridge is a man-made non-flat region that is partially overlapping (referring to the RCC-8 relation
As one of the non-physical (conceptual) features defined in OntoCity, we can refer to the concept of shadow as a spatial feature with a dynamic and also mobile geometry (i.e., changing depending on the time of the day). Although the exact shape of shadows and their exact positions depend on many quantitative parameters including the position of the source light and the height value of the casting objects, it is still possible to qualitatively describe shadows in the ontology. The definition of the concept shadow in OntoCity is more precise because it also contains a spatial constraints saying that, for the concept to be a shadow, it needs to intersect (
The OntoCity axioms mentioned in the previous sections are a subset of general knowledge that always holds regardless of the city under study (e.g. “Water bridges cross water areas”). However, depending on the case study, the background knowledge might be specialized to represent features belonging to a specific environment (e.g. “in the given region there is no building connected to water areas”).
The areas under our study, as shown in Section 3.2, comprise the central part of Stockholm and also another small city Boden in north of Sweden. The following spatial constraints are valid for both of these cities and that is why they have been added to the version of OntoCity used in our case:
Buildings are directly connected to at least a road or a vegetation area (referring to the connected relation in RCC8:
Buildings do not intersect with railroads (referring to the negation of the
Buildings are not directly connected to water-area (referring to the negation of externally connected relation in RCC8:
Buildings are not directly connected to rail roads (referring to the negation of externally connected relation in RCC8:
Buildings are not contained by roads (referring to the negation of tangential proper part relation in RCC-8:
Buildings do not contain roads (referring to the negation of tangential proper part inverse relation in RCC-8:
Railroads are not directly connected to water-area (referring to the negation of the
The following axiom shows the DL definition of the class
The spatial constraints used in the definition of classes are considered by a reasoner in order to discard the impossible labels (region types) for a region based on its neighborhood.
Semantic augmentation of errors
The size of each patch of data (either testing or training data for Boden or Stockholm) is
The output of the classifier is in the form of labeled pixels. Given a set of pixels carrying the same class label, the semantic referee, within a geometrical process,5
The extraction process is done using the predefined MATLAB function, bwboundaries, used to trace region boundaries in binary images.
To get the discrete classification of each pixel an argmax follows the softmax layer (that outputs class probabilities). To calculate the predicted class for each region, the class that has the highest mode after the argmax of the class probabilities of each pixel in the region is selected. Then to calculate the classification certainty of the region, we take the average class probabilities of the predicted region class for each pixel in the region.
Given the output of the classification together with ontological knowledge about city features, the reasoner as a semantic referee semantically augments the errors based on the content of the ontology. The process is composed of several steps which are in brief captured in Algorithm 1.

Error semantic augmentation
The algorithm accepts as input the list of segments (S) and the list of both classified (R) and misclassified (P) regions in the form of polygons. For each segment, the algorithm extracts all the classified (
Given the two lists of polygons
For each pair
To find a general description indicating why the classifier has been confused, the characteristics of the errors are generalized based on their frequency. If we assume that the pair
By applying the ontological reasoner the query can also be further generalized from type T to its super-classes in OntoCity (see line 13). The concept (C) as a spatial feature (C ⊑
The computational complexity of the algorithm has the order of magnitude
In this work, the reasoner will provide the feedback to the classifier in the form of additional information that will be augmented to the original RGB training data. The additional information is represented as an image with the same size as the input image that describes a certain property of a concept for each pixel of the original data. We have selected the following three concepts that the reasoner should give feedback about: shadow estimation, height estimation, and uncertainty information. This means that the input to the classifier will have 6 color channels (3 RGB channels + 1 shadow estimation channel + 1 height estimation channel + 1 uncertainty information channel) instead of the original 3 RGB channels. The classifier is then re-trained on this new input and provides a new semantic segmentation for the reasoner to reason about. The three concepts are described in more detail below.
The first channel describes the presence of shadow, which is one major cause behind many of the misclassifications. There is a fair amount of research work with the focus on shadow detection in the fields of computer vision and pattern recognition [30]. In order to report the concept of shadow back to the classifier, we first need to localize them on the map. Although neither in OntoCity nor in other available ontologies is there any formal representation to calculate the location of shadows, this explanation as a semantic referee provides a significant insight for us to develop the reasoner to localize the shadows. The values for this channel are −1 (not shadow), 0 (no opinion), 1 (shadow). Another property that has an influence on the classification and might be a cause for the misclassifications is elevation (second channel). Since elevation difference of regions is one of the main parameters in casting shadows, we have assigned the relative elevation value for each region as the average of its pixels’ elevation values. Given the elevation value together with the type and the spatial relations of regions in the neighborhood of each misclassified region, the reasoner is able to localize the shadows as the group of pixels of the misclassified region with the lowest elevation value with respect to the elevation values of the regions intersecting with the misclassified region. The values for this channel are −1 (uncertain), 0 (low height), 1 (medium height), and 2 (high height). Finally, the third channel of data is dedicated to the pixels of those uncertain regions whose spatial relations with their neighborhood were found inconsistent w.r.t OntoCity’s constraints. Furthermore, the ontological reasoner finds many uncertain areas whose spatial relations with their neighborhood were inconsistent and violating the constraints defined in OntoCity. The values for this channel are 0 (no opinion) and 1 (uncertain).
Empirical evaluation
Error characterization
Since the ground truth is available for our data, it is possible to calculate the certainty of misclassified regions. In this work, we use the classification certainty and select all the regions whose classification certainty is less than 70% and consider them as (likely) misclassified regions. Given both the classified regions and the misclassified regions, as explained in Section 3.5, the reasoner is able to conceptualize the errors. The conceptualization process is based on extracting the spatial relations of the misclassified regions with their segmented neighborhood. This step has been implemented using the open-source JTS Topology Suite,6
whose summary of results for Stockholm and Boden are shown in Table 3 and Table 4. Each cell of the table represents number of misclassified regions that are in a spatial relation (given in the column header) with all the regions with a specific type (given in the row header).Given the Stockholm test data classification outputs, the reasoner, in order to find a representative feature of the misclassified regions, considers the pair
Summary of the inconsistent spatial features of errors in classification of
Summary of the inconsistent spatial features in classification of
Given the pair

Two examples of the Stockholm test set classification output along with their input RGB image, classified segmentation and the misclassification. The misclassified regions marked with numbers are in spatial relations with buildings, roads, vegetation, etc. The ontological reasoner can augment the misclassification with the label shadow.
The Description Logic (DL) syntax of the query given to the reasoner is ∃
Figure 4 illustrates two samples taken from Stockholm test set classification output where the misclassified regions are marked in red. At the first row, the areas marked with number 1 and 2 are misclassified as water. As the RGB image on the left shows, the misclassified regions (in red) are (externally) connected to buildings that cast shadows. At the second row, the area marked with number 1 is likewise misclassified as water. This area is again (externally) connected to a building. This area is also located between (i.e., connected with) at least two disconnected regions labeled as roads that are disconnected at the shadow area. This combination can explain the second most observed relation listed in Table 3, between the misclassified regions and the region type
Unlike Stockholm, in classification of Boden test data, most of the extracted spatial relations between misclassified regions and their vicinity were found inconsistent w.r.t. the constraints defined in OntoCity (see Section 3.4.1). As shown in Table 4, 93 rail roads were connected to other (misclassified) regions (e.g. buildings and water areas), a fact which according to OntoCity is inconsistent. Likewise, 164 buildings are connected to other regions, 67 cases out of which were again according to OntoCity inconsistent, and the remaining 97 cases (i.e., the consistent ones) were inferred as shadows. Moreover, 118 misclassified regions (mainly roads) were spatially contained in buildings. However, the ontological reasoner found them inconsistent as according to OntoCity roads cannot be contained (surrounded) by buildings.
The following section presents the classification results on Stockholm, Boden, and UC Merced Land Use. The hardware that was used to train the classifiers for was a i7-8700K CPU @ 3.70 Ghz with a GeForce GTX 1070 GPU. The time to train each classifier was around 3 hours for all data sets.

Classification results on the test data for both cities with two classifiers. The classifiers are trained separately on the training data for both cities before and after using the reasoner.
Two separate classifiers were trained on the training data for each of the two cities used in this work. Each classifier is then applied to the test data for both cities. The classifiers were first trained using a depth concatenation of the RGB channels and three channels that represent the estimations for elevation, shadow, and uncertain areas from the reasoner respectively. The additional channels are set to 0 for the first round of training of the classifiers. On the subsequent iterations, the classifiers are then re-trained with the feedback from the reasoner in the form of adding information to the three additional channels. The process is repeated until the validation accuracy has converged.
The per-class and overall classification accuracy on the test sets for both classifiers before and after the classifiers have been re-trained with the additional information from the reasoner and can be seen in Fig. 5. The overall accuracy is increased for all combinations of classifier and test data and almost all classes individually. The accuracy on the test data is higher if the classifier was trained on the same city and is decreased if the classifier was trained on another city. The class with the lowest accuracy when the classifier was trained on another city is railroad. The reason for this can be seen by observing that the railroads have different structure and surroundings between the two cities. The reasoner improved the results significantly for the test data on Boden with a classifier trained on the same city, see Fig. 5(c).
Some examples of the RGB inputs, predictions, shadow and height estimations for three rounds for Stockholm can be seen in Fig. 6. The first round of training of the classifier results in a high number of misclassifications (column 2). When the reasoner has provided shadow estimation (column 5) and elevation information (column 8), the classifier is re-trained and gives an improved classification (column 3). The process is repeated until the validation accuracy has converged and most of the misclassifications have been corrected (column 4).

RGB input (column 1), predictions from classifier for three rounds (column 2–4, green = vegetation, gray = road, black = building, blue = water, red = railroad), shadow estimations from reasoner for three rounds (column 5–7, gray = undefined, white = not shadow, black = shadow), and height estimations from reasoner for three round (column 8–10, black = low object, white = tall object).
The confusion matrix for the last round on both test sets for a classifier that was trained on the Stockholm train data is given in Table 5. The most difficult class to classify is the class railroad and the largest confusion is between roads and railroad. The semantic referee improved the most for the class road (
Confusion matrix [%] for the test set for the classifier that was trained on Stockholm with the use of the reasoner. The numbers in parenthesis show how the result would change compared to a classifier that did not use a reasoner
Confusion matrix [%] on the test set for the classifier that was trained on Boden with the use of the reasoner. The numbers in parenthesis show how the result would change compared to a classifier that did not use a reasoner
When the classifier was trained on Boden, which has a smaller amount of training data for buildings and water, but more vegetation than Stockholm, the use of a reasoner improved the accuracy of buildings by
A new classifier is trained on the UC Merced Land Use dataset. The data was randomly split into
The largest confusion, without using a reasoner, is between vegetation and non-vegetation ground; building and pavement; water misclassified as airplanes; pavement misclassified as non-vegetation ground, cars, or airplanes; and finally, cars and ships misclassified as airplanes. With the use of a reasoner, the largest improvement is for the reduction of the confusion for the vehicle classes airplane, car, and ship. The confusion between pavement and building is slightly decreased for both classes. For pavement, the confusion for non-vegetation ground and airplanes is decreased but for cars it is increased. The same can be seen for airplanes, which greatly reduces the confusion for pavement but also increase the confusion for cars. However, these increases in confusion is small compared to the largest improvement of removing the confusion between cars misclassified as airplane. Finally, there is the confusion between vegetation and non-vegetation ground where the accuracy for vegetation is increased but for non-vegetation ground is decreased. The reasoner increases the confusion of non-vegetation ground as vegetation. The reason for this could be due to the broad inclusion of previous classes into non-vegetation ground that was merged from bare soil, sand, and chaparral, which is somewhat semantically similar to grass but not trees that the class vegetation consists of. A different definition of moving grass to the class non-vegetation ground and simply call it ground could reduce the confusion between these two classes.
Confusion matrix [%] on the test set for the classifier that was trained on UC merced land use dataset with the use of the reasoner. The numbers in parenthesis show how the result would change compared to a classifier that did not use a reasoner
Confusion matrix [%] on the test set for the classifier that was trained on UC merced land use dataset with the use of the reasoner. The numbers in parenthesis show how the result would change compared to a classifier that did not use a reasoner
The input image, predictions, shadow estimation, and height estimation for 6 test images from the UC Merced Land Use dataset can be seen in Fig. 7. The second column shows the predictions without using a reasoner and the third column shows the predictions when using the reasoner after three iterations of training. The predictions have been averaged within each region to reduce noise. The forth and fifth columns show the shadow and height estimations from the reasoner. From the first four images it can be seen that the reasoner helps to more accurately predict cars (a class that should have a small size). It can also be seen that many misclassifications that were predicted as airplane have been changed with the use of a reasoner (a class that should have a larger size). There is still some confusion between pavement and non-vegetation ground that seems to not have been captured semantically in the reasoner. From the last image we see that the reasoner removes some of the confusion between gray roofs from a building and pavement.

Input image (column 1), predictions from classifier without and with using a reasoner (column 2–3, green = vegetation, orange = non-vegetation ground, gray = pavement, black = building, blue = water, yellow = car, purple = ship, light gray = airplane), shadow estimation from reasoner (column 4, gray = undefined, white = not shadow, black = shadow), and height estimation from reasoner (column 5, black = low height, white = large height).
A summary of the overall classification accuracy for all three data sets with and without the use of a reasoner can be seen in Table 8. The overall classification accuracy is increased for all three data sets when additional information from the reasoner is used.
Overall classification accuracy for the both classifier trained on only Stockholm or Boden training data for the test data from both cities with and without the use of a reasoner
For data sets that already contain some of the information that the reasoner can provide, it is justified to add it directly as input to the classifier instead of estimating it with the reasoner. One such feature is height information that could come from a Digital Surface Model (DSM), which was available for the maps of Stockholm and Boden, but not for the UC Merced Land Use data set. Table 9 shows a comparison of the classification accuracy between a classifier that was trained on RGB and elevation information from a DSM and a classifier that was trained on RGB and elevation information from the reasoner. It can be seen that the classifier that was trained on the ground truth DSM gave slightly better overall classification accuracy. However, the reasoner-provided elevation gives comparable results and is even better for some classes (vegetation and building). This shows that using a reasoner is a viable replacement for data sets that do not contain elevation information. Furthermore, a reasoner can provide other features, such as shadow.
Per-class and mean classification accuracy on the testing set on Stockholm when trained on Stockholm RGB and elevation from reasoner or with ground truth DSM
Per-class and mean classification accuracy on the testing set on Stockholm when trained on Stockholm RGB and elevation from reasoner or with ground truth DSM
So far, the reasoner does not correct misclassifications directly but rather influence the learning process of the classifier. The reason for this approach is twofold. Firstly, our objective is to ultimately learn about spatial features about regions, and the reasoner is used to automate the role of a supervisor. Secondly, the reasoner as a referee is also inherently uncertain, and thus may provide several candidate labels for a region. Therefore, the integration between the reasoning and learning architecture has been done in manner described in this paper.
Discussion and future work
This paper has proposed a combination of a deep neural network with an ontological reasoning approach that improves the overall classification accuracy for a semantic segmentation task of RGB satellite images. By applying geometrical processing about spatial features and ontological reasoning based on the knowledge about cities, the semantic referee is able to semantically provide augmented additional input to the classifier which reduces the amount of misclassifications.
It is worth mentioning that this work relies on the suggestions from the semantic referee, which highly depends on the content of the available ontologies. The richer the ontological knowledge, in terms of spatial constraints, the more meaningful the explanations can be expected from the reasoner. It is also important to clarify that we do not categorize this work as a neural-symbolic integrated system, since the neural network algorithm is independent of the symbolic reasoning module. However, our proposed architecture can be viewed as a strength since it allows for different types of classifiers to be coupled onto the reasoning system in a straightforward manner, which makes our suggested approach a generic framework.
A future direction is to look into how the semantic referee could be integrated into the neural network in such a way that the interaction between the two systems is not limited to only the first and last layers but instead is part of the learning process of the hidden layers of the classifier as well. Another interesting future direction is to explore the reverse process, namely how the classifier can enhance the capabilities of the reasoner.
The source code for this work can be found at: https://github.com/marycore/SemanticRobot.
Footnotes
Acknowledgements
This work has been supported by the Swedish Knowledge Foundation under the research profile on Semantic Robots, contract number 20140033. The work is also supported by Swedish Research Council.
