Abstract
Semantic Web technologies aim to simplify the distribution, sharing and exploitation of information and knowledge, across multiple distributed actors on the Web. As with all technologies that manipulate information, there are privacy and security implications, and data policies (e.g., licenses and regulations) that may apply to both data and software artifacts. Additionally, semantic web technologies could contribute to the more intelligent and flexible handling of privacy, security and policy issues, through supporting information integration and sense-making. In order to better understand the scope of existing work on this topic we examine 78 articles from dedicated venues, including this special issue, the PrivOn workshop series, two SPOT workshops, as well as the broader literature that connects the Semantic Web research domain with issues relating to privacy, security and/or policies. Specifically, we classify each paper according to three taxonomies (one for each of the aforementioned areas), in order to identify common trends and research gaps. We conclude by summarising the strong focus on relevant topics in Semantic Web research (e.g. information collection, information processing, policies and access control), and by highlighting the need to further explore under-represented topics (e.g., malware detection, fraud detection, and supporting policy validation by data consumers).
Introduction
Privacy, security and the proper handling of data related policies are topics that affect all technological areas, but have been under-explored in relation to Semantic Web technologies. Indeed, much research in the Semantic Web and Linked Data domain has focused on enabling the sharing of open datasets. However, as Semantic Web technologies and principles are gaining traction both in use cases that deal with sensitive data and in terms of application in industrial contexts, it is necessary to investigate the potential privacy and security issues. For example, how they might cause new or more complex threats to privacy or make the security of deployed systems harder to ensure, and how managing, tracking and enforcing policies associated with data becomes more complex.
Although the widespread use of Semantic Web technologies and Linked Data leads to new security, privacy and policy-related problems, at the same time they can also be seen as part of the solution. For example, more accurate models for detecting security issues can be built through the semantic analysis of the data. Additionally, the meaningful interpretation of personal data exchanged between individuals and various other web entities could be used to empower web users to better control those interactions, and therefore better manage their online privacy. The machine-readable and machine-processable representation of data-related policies can also bring many advantages to companies through the automation of tasks related to policy-management.
The goal of this paper is to provide a brief overview of recent work on security, privacy and policy related challenges associated with Semantic Web technologies. The information presented herein is based on analysing the articles published in this special issue of the Semantic Web Journal, therefore acting as an editorial for it, as well as looking at the five editions of the Society, Privacy and the Semantic Web – Policy and Technology (PrivOn) workshop (which was collocated with the International Semantic Web Conference), two editions of the Trust and Privacy on the Social and Semantic Web (SPOT) workshop (which was collocated with the Extended Semantic Web Conference), and at other related sources. The objective of this literature review is to identify key trends, and especially new challenges that are being investigated from both the problem and the solution angles, as well as the gaps that the community needs to address. We therefore start by looking at existing classifications in the areas of security, privacy and policies. Following on from this we use the aforementioned classifications to frame our discussion of existing work on privacy, security and policies in the Semantic Web domain. Finally, we conclude by highlighting the current trends and, more importantly, the research gaps that present open challenges for privacy, security and policy research in the Semantic Web domain.

A taxonomy of activities creating privacy problems, from [39].
Privacy, security and policy topics in data and information management are very related to each other, but also each one is very complex and multifaceted in their own right. They each represent a wide range of issues and challenges, to which a variety of solutions have been applied in other domains. While not all those issues and challenges might apply to Semantic Web technologies, it is worth looking at them broadly, inorder to understand where works by the Semantic Web community tend to place themselves, and where gaps still exist.
A taxonomy of privacy
One of the most highly cited works that is used to “classify privacy” is an article entitled “A taxonomy of privacy” by Daniel Solove [39]. In said article, Solove argues (as many authors before him) that privacy is an ambiguous, polysemic and often subjective term that can therefore not be reduced to a simple concept, and especially cannot be considered purely from the point of view of the law. Instead of proposing a definition for privacy, Solove focuses on privacy threats which, he argues, can be listed and defined in a more robust manner. This taxonomy of privacy problems is depicted, in Fig. 1 where information based activities that are known to create problems are divided into four main categories: information collection, information processing, information dissemination, and invasion.
Classification of security incidents
Security is also a broad term that can be applied to many different areas. However, considering the scope of this article, we focus here on cyber-security, which relates to security issues and challenges associated with computing devices, applications and networks. There have been several classifications of issues and problems associated with cyber-security from various organisations, including, e.g., the Software Engineering Institute [5] and the European Union Agency for Network and Information Security (ENISA) [29]. Those tend to overlap and cover similar aspects, as they focus on the incidents of problems that might occur in relation to cyber-security. Here we choose to apply the taxonomy from the European Cybercrime Centre (EUROPOL) [14] as it focuses specifically on threats and issues that are related to technological systems. This taxonomy of incidents is reproduced in Table 1. Naturally, only a subset of those threats are expected to be relevant for Semantic Web technologies.
Classification of cybersecurity incidents from EUROPOL [14]
Classification of cybersecurity incidents from EUROPOL [14]
There are several types of policies that are related to the present study. Those include privacy and security policies that strongly overlap, in their content, with the two previous classifications. We additionally consider in this category the specific tasks that are associated with the management of and compliance with policies associated with the distribution of intellectual property (IP) assets, especially software and data licenses, as well as terms of use of services and regulatory obligations. As far as we are aware, there does not exist a taxonomy of activities or issues associated with this area. We therefore take inspiration from existing literature, especially in the area of software license management, to devise a simple taxonomy of tasks associated with privacy, security, distribution and usage policies for IP assets. This taxonomy, which is presented in Table 2, is relevant for policies that relate to data or software artifacts, including services.
Collection: Existing works around semantic web security, privacy and policy
Based on the taxonomies described above, our goal is to review the privacy, security and policy research contributions associated with Semantic Web technologies. This includes both the use of Semantic technologies to support the resolution of specific privacy, security and policy issues, as well as works that tackle privacy, security and policy issues emerging from the application of semantic technologies. To do that, we create a corpus of papers and articles that directly address one or more of those aspects. We start with the works published in the Special Issue of the Semantic Web Journal on Security, Privacy and Policies (for which this article acts as editorial), namely:
We also include in this analysis all the papers presented during the PrivOn workshop series, which was co-located with the International Semantic Web Conference (ISWC) from 2013 to 2017, and relevant papers from the SPOT workshop, which was co-located with the Extended Semantic Web Conference (ESWC) in 2009 and 2010. Finally, in order to obtain relevant works outside of those specific venues, we perform several Google Scholar searches, by associating keywords strongly related to Semantic Web technologies1
including Semantic Web, semantics, Linked Data, ontology, RDF, OWL, SPARQL
Taxonomy of tasks associated with IP distribution and usage policies
Sample paper classification
We analyse the corpus of references collected according to the method described in the previous section by manually annotating each paper using the three taxonomies previously described. In doing so, we do not assume that any paper should only be represented by one category, or one taxonomy, as many works span across several topics with varying levels of generality. For example, the articles included in the special issue of the Semantic Web Journal are classified as depicted in Table 3.
We also add another category to indicate whether the paper or article presents an issue, challenge or problem, or a solution. Unsurprisingly, considering that most works come from computing or other strongly technical disciplines, the large majority of the references relate to works presenting solutions (66 out of 78).
Works with a strong focus on privacy
Also unsurprisingly, considering the nature of semantic web technologies and their purpose, many of the references included in our corpus relate to privacy (37 out of 78), with at least one annotation from the privacy taxonomy. A particularly frequent annotation there relates to the Information Processing–Identification. This category is mostly used to annotate works that relate to the general problem of anonymity and the anonymisation of personal data. These includes for example, [35] and [24] demonstrating how K-Anonymity can be applied to RDF datasets, [38] applying differential privacy to RDF data from social networks, and [30] looking into the problem of breaking the anonymisation of datasets through record linkage. While, [12] and [2] relate more to Information Processing Exclusion, which involves empowering citizens with transparency and control over personal data processing and sharing that concerns them.
Another common category addressed by works in our corpus is the one of Information Collection. While the two sub-categories Surveillance and Interrogation are rarely mentioned specifically, many works have used Semantic Web technologies to help users of online services to understand how and for what purpose data about them is being collected. This includes for example [22] which describes a tool to keep a record of the trackers encountered in a web user’s everyday browsing, [13] looking more generally at transparency in data sharing on the web, or [1] looking specifically at restricting the collection of location data based on semantics and sensitivity.
Besides the aforementioned groups, several works including [32] or [6] look at privacy from a broader perspective, especially connecting privacy issues around Information Dissemination–Increased accessibility with the communication or interpretation of privacy policies and privacy preferences.
Works with a strong focus on security
While rarely considered a core topic for Semantic Web research, many (46 out of 78) of the works in our corpus relate, in one way or another, to the topic of security. Most of those however focus entirely on the area of Information Security, with strong overlaps with the privacy and policy topics. Indeed, the large majority of the security references are classified under Information Security–Unauthorized access as they relate to solutions for access control either for Semantic Web related information [18,27,33] or that use Semantic Web technologies to support access control over other forms of data [37]. Access control frameworks defined upon Semantic Web technologies and languages have been proposed to support data producers in protecting their resources from Unauthorized access, allowing for Policy enforcement [8,19,26,34] and Policy communication [17,37,40]. These approaches rely on Semantic Web languages, i.e., RDF and SPARQL, to model their access control policies and to support the enforcement of the policies by the consumers. Additionally, the Information Security category includes works that investigate using existing encryption techniques to restrict access to RDF data [16,20,25].
Another interesting area in terms of Information Security where varied works can be found is in using ontologies as a basis for modeling, analysing and detecting security issues. In those cases, mostly, ontologies are used as the knowledge base of an expert system, a representation schema or an annotation vocabulary for a complex, knowledge intensive security issue such as Infection–Malware detection/analysis [4,42] or Intrusion detection [7,28].
Interestingly other common security topics in relation to the Gathering of Information or Abusive Content (and to an extent, privacy) issues such as SPAM or phishing are rarely mentioned and are considered mostly within problem description papers in relation to Semantic Web technologies, as in [23,31].
Works with a strong focus on policies
With the increasing amount of (creative) content being published online, policies about IP distribution and usage are becoming more and more important, as they allow for the association of constraints relating to use and reuse. In this context, the contribution of Semantic Web technologies and languages is twofold: they may be used to support the Producer in associating machine-readable IP distribution and usage policies with the data that they are publishing on the Web, and they may support Consumers in checking whether the intended use of a certain resource published online is allowed or not. In total 33 out of 78 of the works in our corpus were associated with the policy topic, many of which were also associated either with the Information Processing or Information Security–Unauthorized access topics, indicative of the strong relationship between privacy, access control and policies research.
From the point of view of supporting and easing the activities of Producers, several approaches have been proposed in the last years. Concerning Policy communication, Rodriguez-Doncel et al. [36] proposed a dataset of over 100 licenses written in RDF extensively using ODRL.3
Other challenges deal with the Consumer point of view, where issues like compatibility testing and usage monitoring need to be addressed to assist Consumers in gaining a better understanding of the policies, thus supporting the compliant usage of protected resources. Works considering for example Usage Monitoring of data artifacts are therefore starting to appear (cf. [9,11]). Those works strongly relate to the idea of making policies understandable to the consumers of data and information services, with Semantic Web technologies having a role to play in the task of Policy Interpretation. Works such as [32] and [6] specifically address this task in the context of privacy policies, while for IP policies, current works remain limited to usage monitoring. In addition to the challenging issue of Usage Monitoring, the problem of Compatibility testing has been addressed by combining deontic logic and Semantic Web technologies and languages [21].
As can be seen from the analysis described in the previous sections, and further from the annotated corpus of collected references, research work related to Semantic Web technologies has been, at least for privacy and security, strongly focusing on a small subset of issues and challenges. Indeed, the strong prominence of references related to controlling data collection mechanisms and access control shows that, as is often the case in primarily technological disciplines, privacy and security are often reduced to those basic issues. While in security, some works have been looking at applying Semantic Web technologies for example to malware, SPAM or intrusion detection, very few have tackled less computational issues such as fraud detection, and even less have been looking at the specific security implications of Semantic Web technologies (with notable exceptions that remain, however, at a very high level).
Beyond security, the contrast between the description and study of privacy in the social sciences, portraying the issue as a complex, multifaceted and interdisciplinary notion, and its treatment in the Semantic Web literature is striking. Many of the papers reviewed consider privacy as a single, specific (and often purely technical challenge), related most often either to identification, or to the control of either data collection or data access. Again, with some exceptions, very few works really consider the potential of Semantic Web technologies to either create or address issues such as appropriation, distortion, or broadly, information dissemination, and none has considered the challenges associated with invasion. While this is not necessarily surprising, considering the technological nature of Semantic Web research, its purpose, and the specific issues it tackles, it is disappointing to see that these technologies are not being used more creatively to address other challenges where their sense-making and inferential capability would no doubt have benefits. It is also disappointing that, as far as we could see from the references collected, those technologies are rarely being included in broader, interdisciplinary discussions about their potential privacy implications.
The policy part of our brief analysis stands out from the two others as being somehow more varied. Unsurprisingly, issues of policy communication have attracted more consideration as being more directly within the remit of the representation languages and formalisms of the Semantic Web. However, a few works have started to appear that use those representational capabilities to support interpreting, monitoring and reasoning upon policies (often related to privacy and access control, but also related to intellectual property management). Those works address issues of rights associated with information assets, and therefore overlap with research in legal informatics where Semantic Web technologies have had many contributions (which are, however, mostly out of the scope of this article). There is nevertheless much work to be done, from the few starting points we encountered, on the implications of using Semantic Web technologies to support both data producers and consumers (including private individuals) in understanding, combining and interpreting policies in a meaningful and valuable way.
