Abstract
The article reports on the evolution of data.open.ac.uk, the Linked Open Data platform of the Open University, from a research experiment to a data hub for the open content of the University. Entirely based on Semantic Web technologies (RDF and the Linked Data principles), data.open.ac.uk is used to curate, publish and access data about academic degree qualifications, courses, scholarly publications and open educational resources of the University. It exposes a SPARQL endpoint and several other services to support developers, including queries stored server-side and entity lookup using known identifiers such as course codes and YouTube video IDs. The platform is now a key information service at the Open University, with several core systems and websites exploiting linked data through data.open.ac.uk. Through these applications, data.open.ac.uk is now fulfilling a key role in the overall data infrastructure of the university, and in establishing connections with other educational institutions and information providers.
Introduction
The Open University (OU) has a vast presence on the WWW. Having distance learning in its core, the dissemination of open-access learning material is part of its student recruitment strategy as well as “brand” communication. The University publishes a number of websites that contain open-access content, most notably OpenLearn,1
In over 40 years, the OU developed a vast amount of educational resources, a fair deal of which are now online as freely accessible material. The creation of learning and teaching material is part of the regular job of OU faculties, and the production and reuse of media assets are key factors for the effective development of new courses. This activity is spread across several units of the University, each specialised in different types of assets. In particular, the units that have the role of disseminating content across social media are different from the ones that produce this content. Faculties do the actual teaching and produce content, while specialised units develop media items or are in charge of dissemination and student recruitment. Knowledge sharing and reuse in this context is a challenge where Linked Open Data have an obvious role to play.
See
See the press release from November 2010:
In Section 2 we describe the data included in data.open.ac.uk, while Section 3 focuses on the design aspects and the modelling choices that have been made. The services offered by data.open.ac.uk are described in Section 4, and their usage in applications in Section 5. Technical aspects and maintenance issues are illustrated in Section 6, before concluding the article with future work.
data.open.ac.uk publishes Five-Star open data:6
See 5-star data deployment scheme,
The redirection can be forced using the URL:
Key facts about data.open.ac.uk
data.open.ac.uk includes many graphs at different stages of development. The graphs that are considered stable are officially released, documented on the website, and marked as such in the RDF description. A graph is considered to be stable when:
the process that led to the data acquisition is robust and the data provider is considered reliable (i.e. can guarantee future updates); the infrastructure includes a robust update mechanism that guarantees the data are updated regularly, unless they are static data that do not need to be updated.
11 stable graphs have been released so far through data.open.ac.uk, but many others are also available with lower degrees of support and warranty.8
The following query lists all graphs:
The dataset is collected from various sources, which may be public websites or content management systems internal to the University. Data from each source at hand are completely remodelled to include as much information as in their original form. Remodelled data are then exposed through a single SPARQL endpoint and can be queried as a whole. However, each data portion is identified by a named graph, reflecting its primary source. Graph names are resolvable URIs defined in the namespace
We shall group the graphs under six themes:
Open Educational Resources Scientific production Social media Organisational data Research project output Metadata
Open educational resources
A significant part of the information in this category is metadata about educational resources produced or co-produced by the OU. Open Learn is the home of free learning from The Open University.9
Other open data that exist in another form and are transformed and linked are those of the Open Research Online repository (ORO).11
Content is often hosted by third-party organisations, and metadata are extracted from public APIs and aggregated into RDF. The OU publishes media on YouTube ( g:youtube ) and Audioboo ( g:audioboo ). Objects are often annotated with courses, qualifications or OU people they relate to. Playlists and metadata about videos and audio podcasts are extracted from Web APIs, then translated and enriched to interlink with the other entities in data.open.ac.uk.
Organisational data, courses, people, news
In other cases, data are collected from internal repositories and first made public as linked data. It is the case of reference data about courses ( g:course ) and qualifications ( g:qualification ) under presentation, as well as the profiles of researchers and academic staff ( g:people/profiles ). The Key Information Set of the OU is published by HESA12
Higher Education Statistics Agency,
Unistats data
data.open.ac.uk also hosts data produced by research projects. At the moment there are three datasets officially published that come from two projects of the OU Faculty of Arts, namely the Reading Experience dataset15
An important requirement of any RDF database is to specify and document its structure to support external agents in automatically discovering data, detecting the characteristics of the data and possibly configuring their behaviour accordingly. It is also useful for open data to expose and document their schema so that users can make sense of it. Three data spaces are dedicated to metadata: 1) g:meta – Graph metadata using mainly VoID and the SPARQL Service Description; 2) g:ontology – Definitions of terms used, particularly those defined in the data.open.ac.uk domain; 3) g:about – Graph metadata containing links to DBpedia entities that are topics of open educational resources and other entities of the data.open.ac.uk graphs.
Links
Links between entities of different graphs enable data integration with little effort. A key example in data.open.ac.uk is the use of courses as aggregators for similar, related objects through the graphs. Courses are referenced by almost all sources in the University, thus enabling use cases such as content recommendation.17
While this use case is one of the most interesting, and applications using data.open.ac.uk do implement it in different ways, evaluating it is out of the scope of this article.
Example:
Example:
A dedicated graph includes links to DBpedia entities: they are the topics of media objects, Web pages, courses and other entities in the OU data environment. These topics are generated by DiscOU [2], an application that annotates documents of data.open.ac.uk entities with DBpedia entities. For example, OpenLearn Units are made of a number of Web pages or video podcasts have transcripts. These are collected by DiscOU and analysed using DBpedia Spotlight21
The graph
g:led
includes time-related data, with links to related entities in the
Graphs, links and target datasets
Unless otherwise specified, the data released through data.open.ac.uk are licensed under a Creative Commons Attribution 3.0 Unported License.22
Several design choices for the data.open.ac.uk data models are aimed at making the data as reusable as possible. In the following, we address the different aspects arising from Linked Data design.
Design of graphs
The totality of the data in the repository is obtained from external sources. Triples in the data store are organised by source, so that all triples coming from a source are stored in a dedicated graph. This data management pattern is called “Graph-Per-Source” [4]. One drawback of this approach is that if the same information is contributed by several sources, it will be replicated in the two graphs and might lead to inconsistencies between graphs. However, assessing the provenance of each statement (graph) is a core requirement for a service that publishes integrated data, and facilitates maintenance.
Design of entity URIs
Identifiers (URIs23
External identifiers are reused when available. This practice follows the “Natural Keys” pattern [4]. It is the case of courses and qualifications, but also of OU accounts and publications in ORO:
This pattern is fundamental to users who are familiar with the organisational structure of the OU, as entity codes often play a key role in the communication flow of large organisations.
A readable type description of the entity is referenced within the path of the URI. This helps classify an item at first sight, reducing the need for additional queries, e.g.:
The second example above is slightly more complex as it reflects the statement: This is a membership of this ID to an organisation that is KMi.
Often, a hierarchical URI is created by using the graph/source name at the beginning of the path, and then replicating the local identifier of the source:
The dataset includes 125 classes and 785 properties from 57 public vocabularies. The choice of terms to be used is based on the following process: (1) identify the concept to be expressed; (2) search for a widespread existing vocabulary to be used; (3) if found, use it, otherwise (4) search for a less-known vocabulary to reuse; (5) if not found, create a new term; (6) in either case, if there is no well-known term to be used, try to generalise the concept and add an additional statement with a well-known term. This approach led to the adoption of a variety of vocabularies. Sometimes information is redundant, being repeated with different properties such as a generic well-known term and a more specific less known (or proprietary) term. These are consequences of the choice to privilege the reuse of existing terms and the will to choose the best possible terms instead of being restricted to the semantics of only a few widely used ontologies. For reasons of space here we will only mention some vocabularies that are widely used across many graphs. FOAF, SKOS, SIOC, OWL, Dublin Core are used by almost all graphs. GoodRelations is used by g:course to specify the learning offer of the University. This vocabulary is particularly useful because the OU is a decentralised institution, and students are recruited all over the world, so prices and features of the offer may differ. Media ontologies (video, audio) are also used to describe aspects of media objects. Schema.org24
Courses and qualifications are entities with a special role in strengthening the interlinking between graphs. Their codes are widely used within the University to annotate documents, media objects or Web pages. The opportunity here is to query for all content related to a given course (or qualification), or restricting the range of values to a specific graph population. There is a general property, named
This set of properties allows for easy querying by filtering the source of the linked entities with basic triple patterns, without the need for further constraints on the
VoID27
All entities link to their named graphs with the property
Naming resources globally enables important facilities for data consumption and maintenance. This is why we avoid blank nodes. It is of great usefulness to be able to identify a resource in a graph-independent way: a) any entity from a previously selected result set can be inspected in the current endpoint by only knowing the identifier, and b) it is easier to compare dataset dumps, reducing the operation to a diff on two sorted triple collections. A negative consequence of blank nodes is that the data become redundant on incremental updates, as updating the same information twice will add the information twice because of blank nodes being local identifiers. Similarly, users who download the same data might end up with different RDF graphs, making it harder to maintain consistency in their applications. Also, we have not so far been able to identify any practical advantage on using blank nodes instead of portable identifiers in data.open.ac.uk.
Another design choice was the use of RDF cardinal properties to list the positions of authors of publications. As described in the specification of RDFS: “Container membership properties may be applied to resources other than containers”. In the
g:oro
graph, container membership properties are used alongside
Finally, all entities have a single, untyped and not lang-tagged
Services
There is a strong commitment to provide dereferenceable, “cool” URIs.29
See
The
The
Other examples can be found on the
Data can be queried with SPARQL through the endpoint provided. Developers can embed the query in their code and execute it at runtime. However, this practice creates a strong dependency between the application and the database. This dependency might create problems for the developers, because they do not have control of the data source, so they cannot know whether the query would continue functioning when changes on the data occur. One practical solution to this problem was to setup an endpoint for stored queries. Developers can store their queries on the server and use a plain URL to point to the data. Maintainers can then manage the evolution of the database and inform the developers of coming evolutions, when it might affect an existing query. This service is available only to applications developed internally to the University.
The content of the graphs is archived on a weekly basis, and the versions are made available for download from a section of the website.
The goal of exposing interlinked data on data.open.ac.uk is to make existing public data more accessible, reusable and exploitable. This can only be demonstrated through applications that make use of these data in innovative and/or cost-effective ways. Various production systems are using data.open.ac.uk as source of information. For example, the OpenLearn website queries the SPARQL endpoint to get the list of qualifications under presentation, along with related information. Similarly, a system from the Student Services Unit of the OU scans data.open.ac.uk to upgrade the list of available courses.
An application in the OU YouTube space queries data.open.ac.uk to get related courses and qualifications as well as other open educational content. If a user is interested in, for instance, the OU YouTube video https://www.youtube.com/watch?v=NcFrxXKtoXk, the following query to data.open.ac.uk can retrieve a number of other educational resources, as well as courses on offer:
DiscOU [2] is a recommender system developed by the data.open.ac.uk team to support the discovery of open educational content similar to other online resources like a BBC program or a Web page. This system builds an index of the open educational resources catalogued in the data.open.ac.uk dataset that includes a set of DBpedia entities that are representative of the resource. This index is then processed by a similarity algorithm. Two positive outcomes of this application have been recorded. First, it boosted the adoption of linked data within the University by giving an exemplary use case that is otherwise very hard to implement using legacy technologies. Second, we used the content generated by the tool to populate the graphs of topics g:about , as already described in Section 2.
These are but a few of the applications developed on top of the dataset. Others are described on the data.open.ac.uk website and in an earlier paper [1].
Figure 1 displays the result of an analysis performed on server logs. This historical view displays the number of clients using data.open.ac.uk from the launch of the platform on September 2010 until today (September 2014). It shows that the number of clients has since doubled over time, particularly in the last two years. This gives a promising perspective on the adoption of linked open data in this context.

The diagram above displays the progress in number of clients requesting RDF data (not HTML) on a monthly basis from the launch of data.open.ac.uk in September 2010 to September 2014. There are some visible peaks. The first is on 2010-11 (679 clients), then on 2012-08 (720) and 2013-05 (1057). The first is most probably related to the first launch of data.open.ac.uk. We presume the others to be related to the release of new applications consuming data.open.ac.uk data.
The data are basically a snapshot of the status of the related information at a given time. Most of the graphs are updated on a daily basis.
Since the lifecycles of the graphs differ, the infrastructure supports three different update policies:
graph rebuild: the data are rebuilt entirely and a new version substitutes the previous (e.g. g:course );
incremental update: data are never deleted, and new content is added once available (e.g. g:bbc ); and
synchronisation: changes in the source are reflected on the RDF graph as soon as possible (e.g. g:people/profile ).
The data import activity is performed with a number of dedicated procedures, orchestrated according to the specific case. We can summarise the process as follows, abstracting from the specific cases:
data collection: an item is collected from a data source;
transformation: the item is inspected, the data translated into RDF and enriched, by materialising some inferences and inspecting other data sources where applicable;
update plan: depending on the update policy, the commands to perform the change are prepared (for instance to replace the whole data in a graph, or to perform a delete query and then add the content of the file);
update execution: the related graph is updated.
The datasets updated daily (both as full replacement or incremental additions) might have misalignments for up to 24 hours with respect to the sources. However, the change rate of that information is slow and full timeliness not really important for existing applications. This process requires less than one hour to be completed, and it does not affect the running system until data replacement occurs, which runs in less than 2 minutes. The information published tends to reflect the sources as much as possible, and inaccurate information is identified and corrected following feedback from users. A special case is the g:people/profile graph, which is updated in real time from the source content management system, to immediately react to the change of policy that users might operate with respect to the privacy status of their data. When a profile is updated, the change is notified to the updating procedure that adds the profile to a queue, which is then inspected regularly. Relevant triples are deleted with an ad-hoc query and the new version is loaded. The method guarantees full accuracy and good timeliness despite loading the live system with write transactions.33
In general, write transactions are always sequential and we did not experience significant issues with the stability or efficiency of the live system during updates.
data.open.ac.uk code is mostly written in PHP (front-end services) and Java (data importing and remodeling). The system relies on existing, open source software, especially the Fuseki server from Apache Jena.34
The repository contains more than 3.500.000 RDF triples. While this is a fairly large amount, it is far from causing scalability issues with state-of-the-art triple stores. Indeed, the data.open.ac.uk platform only rarely experiences any downtime, and even that is mostly due to planned maintenance on the infrastructure, given that it is supported by a small team (officially amounting to 50% of a single developer).
data.open.ac.uk is today a reliable, constantly monitored service, whose data are updated on a daily basis. The quality of service offered has led to a steady increase in its usage. Applications are using data.open.ac.uk to obtain official information about courses and qualifications and for the discovery of and linkage to relevant content spread across the heterogeneus landscape of systems, websites and repositories of the Open University.
While data.open.ac.uk has evolved into a full-grown semantic dataset, some work is still required to make it the reference method for open data integration in the organisation. In particular, there is a need for new tools and services that can make the data we offer easier to explore, understand, query and embed in applications efficiently. Metadata are also an important asset of data.open.ac.uk. An investigation into the ways to provide provenance information for both entity resolution and SPARQL queries is ongoing. We are observing the evolution of linked CSV specifications, and considering a service that provides predefined views over the triple store listing types of objects with their properties in this format. There are plans to include new data, such as the upcoming course description using XCRI 2.0, as well as library data from the OUDL project.35
