Sage Journals: Discover world-class research

Abstract

The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard. The entries in the catalogue have been recently migrated to a new relational database whose data model adheres to the conceptual models promoted by the International Federation of Library Associations and Institutions (IFLA), in particular, to the FRBR and FRAD specifications. The database content has been later mapped, by means of an automated procedure, to RDF triples which employ basically the RDA vocabulary (Resource Description and Access) to describe the entities, as well as their properties and relationships. This RDF-based semantic description of the catalogue is now accessible online through an interface which supports browsing and searching the information. Due to their open nature, these public data can be easily linked and used for new applications created by external developers and institutions. The methods applied for the automation of the conversion, which build upon open-source software components, are described here.

Keywords

Linked open data bibliographic and authority data cultural heritage semantic web

1. Introduction

Applying the linked open data concepts to the cultural heritage domain has become an active and challenging field [22]: many libraries, museums, and archives are currently exploring ways to convert their data into RDF1

¹
The Resource Description Framework (http://www.w3.org/RDF) is a graph based data model which is widely used for semantic web and linked data applications.
and to develop new interfaces providing a richer experience to the users of cultural heritage websites.

Fig. 1.
Schematic representation of the migration and conversion process.

In parallel, modern standards for cataloguing are emerging as an alternative to the traditional ones (such as ACCR2 [3]). For example, RDA (Resource, Description and Access) is a cataloguing standard [17] for descriptive metadata supporting resource discovery. RDA follows the concepts and terminology of the Functional Requirements for Bibliographic Records (FRBR, [15]) and the Functional Requirements for Authority Data (FRAD, [24]) – and it is working to adopt the Functional Requirements for Subject Authority Data (FRSAD, [20]) –, a family of models promoted by the IFLA which define entities, relationships, and attributes that should be used to describe resources. Recently, a linked data and semantic web representation of the elements and relationships of RDA was published.2 ²
http://www.rdaregistry.info.

This paper describes the steps applied for the automation and control of the migration process from a MARC21 collection of records to a set of RDF triples containing bibliographic metadata in RDA, schematically represented in Fig. 1. The process relies on the creation of a relational database according to the FRBR family of conceptual models, and provides controlled generation of linked data in RDA. The implementation is strongly based on the currently available open-source technology.
2. Related work

Many libraries and organizations are in the process of transforming their legacy metadata into various RDF-based semantic descriptions, mainly FRBR-based. An early survey on FRBRization techniques was prepared by the Online Computer Library Center (OCLC) [9]. A more recent survey [7] provides a taxonomy of semi-automated techniques based on three criteria: type of FRBRization (methods), model expressiveness and specific enhancements to improve quality or performance.

Usually, the FRBRization builds an FRBR catalogue by applying mapping rules between the source bibliographic metadata and the FRBR attributes. For example, the TELPlus prototype developed an FRBR repository for the European Library [12,21] by applying rule-based interpretation of fields enhanced with cluster deduplication and evaluation metrics.

The LC Display Tool provided by the Library of Congress [27] was a simple XSLT template which transforms MARC data into XML and HTML formats. This approach can lead to very large files (due to the rich variety of relationships available in FRBR) which are difficult to visualize.

A different approach based on musical content was implemented at the Indiana University Library with the Variations project [26] where several XML schemas where used to publish FRBR records.3

³
http://www.dlib.indiana.edu/projects/vfrbr/.
An interface4 ⁴
Scherzo, http://webapp1.dlib.indiana.edu/scherzo.
was created by the project in order to retrieve and explore the catalogue.

LibFRBR is a toolkit which can be used to convert bibliographic records into FRBR structures based on the Koha open-source integrated library system and also provides an interface for library cataloguers [6].

FRBR-ML [30] is based on an XML intermediate model which was designed to ease exporting data in various semantic formats. This tool takes MARC-XML records as input and produces a set of FRBR records and their relationships. The output is semantically enriched by linking external information sources.

The GLIMIR project [13,31] has developed software to create clusters of creations within WorldCat5 ⁵
https://www.worldcat.org.
which are manifestations of a single expression or expressions of a single work.

Some initiatives such as the RDA Steering Committee (RSC) and the International Working Group on FRBR and CIDOC CRM Harmonisation, are defining metadata according to international models for user-focused linked data applications. In January 2014, the RDA Steering Committee published stable forms of RDA elements and controlled vocabularies. These vocabularies provide elements, guidelines, and instructions based on FRBR principles. RDA elements applies to each of the FRBR entities as RDF properties and sub-properties, and a set of RDA values vocabularies to populate specific RDA elements such as carrier type or media type.

FRBROO6 ⁶
http://www.ifla.org/files/assets/cataloguing/frbr/frbroo_v2.2.pdf.
is an elaborated version of FRBR implemented as an extension of CIDOC CRM. The FRBROO ontology facilitates the interchange of bibliographic and museum information.

An increasing number of cultural institutions are applying semantic web technologies and creating linked open data projects. For example, the Library of Congress Linked Data Service (id.loc.gov) provides access to authority data such as the LC subject headings and the MARC geographic areas.

The Bibliothèque nationale de France published data.bnf.fr in 2011 by aggregating information about authors, works, and subjects which was scattered among various catalogues. These data are published in RDF using a vocabulary based on the FRBR model where objects are referenced through ARK identifiers.7 ⁷
The Archival Resource Key identifiers are persistent references to web-accessible objects.
The information is stored in a database which contains the data in different formats, including RDF, JSON, and HTML [28].

The British National Bibliography Linked Data Platform (bnb.data.bl.uk/docs) provides access to the British National Bibliography (BNB), implements the SPARQL query language [25] and delivers RDF and JSON outputs. The dataset has been modelled using existing RDF vocabularies, such as Dublin Core, the Bibliographic Ontology (BIBO), and Friend of a Friend (FOAF). Exceptionally – for example, due to insufficient granularity of those vocabularies – a new term was coined and documented. FRBR was not initially used [8], since the identification of the entities in the source MARC records required extensive work. The records were therefore normalized for improved matching and later transformed into RDF using XSLT and Jena Eyeball.

The German National Library supplies its data in the RDF standard via its Linked Data Service (LDS; http://www.dnb.de/EN/lds) since 2010. The vocabulary is based on Dublin Core and BIBO and complemented with some elements from other vocabularies, for example, RDA, ISBD (International Standard Bibliographic Description), and GND (Gemeinsame Normdatei). The records can be also retrieved in BIBFRAME format, an RDF-based replacement for MARC21. The National Library of Spain (BNE) has recently migrated its databases to RDF and published [23] them at datos.bne.es. The transformation is assisted by specific software [33], which supports RDF generation from MARC21, and the vocabulary is strongly based on FRBR and ISBD.

The Europeana linked data at data.europeana.eu ensure a high level of consistency and interoperability by abstracting the original data to a common format (the Europeana Data Model). Unfortunately the richness of the original descriptions is partially lost in the homogenization process.
3. The transformation process

Traditionally, the descriptive metadata of bibliographic content – stored, for example, in MARC records – were created and interpreted by humans. Even if those records followed cataloguing rules such as AACR2 and ISBD [29], the textual descriptions therein could not be easily read and interpreted by computers, see for instance, the rich description under field 534 in Fig. 2. The FRBR family of conceptual models and the RDA specification provide a modern framework which facilitates the automatic processing of the information. However, the transformation of the old records into the new format has a significant cost and is not an easy task [2], since libraries usually host large catalogues which should be manually revised. Therefore, software tools for the automation of the migration process are called for, and the experience of the Biblioteca Virtual Miguel de Cervantes in their implementation is described below.

Fig. 2.

A MARC21 record for the novel El caballero de Illescas.

3.1. Preprocessing of sources

A MARC21 record describes one entry in the bibliographic catalogue or authority file,8

⁸
An authority files compiles the unique terms and possible variations used to describe names, titles, and subjects.
and consists of text fields which are identified by a three-digit number, see Fig. 2. The text in one field can be split into sub-fields which are distinguished with a dollar sign followed by a single-character identifier. Since some fields are required (for example, field 245 containing the title) while some others are optional or user-defined, the homogeneity of the data across libraries cannot be guaranteed. Furthermore, the content of a field can be expressed with different conventions, in different languages, or it may contain typos: these features represent a challenge when MARC21 records must be shared between libraries.

The transformation of MARC records into FRBR is not a simple task [2]. Some issues are common, see [1,21], while some other are particular to a library. For example, the 200,000 records in the Biblioteca Virtual Miguel de Cervantes are provided by a large number of institutions in Spain and Latin America where variable cataloguing practices are applied. Some of the challenges and the measures taken are listed below.
Missing or inconsistent uniform title. Uniform titles identify expressions or manifestations of a single work. However, often the uniform title is missing (only 2% of records contain a uniform title) or it has been inappropriately selected (for example, the source language has been sometimes appended to it, as in Don Quijote de la Mancha, inglés). Since records with identical uniform title are not guaranteed to describe the same work, the preferred title has been used to cluster works instead. Further work will be required to obtain a wider granularity.

Variable encodings. Some information is encoded using different fields at different institutions. For example, the MARC control number (needed to link back the original record) has been found under fields 001, 856 and 909. Works using multiple languages have been sometimes encoded using multiple language subfields (one per language) while in other cases a single subfield lists all the language codes with custom separators. Specific rules have been created to parse the records with a common provenance.

Markup errors. MARC tags are introduced manually and therefore, a number of mistakes can be expected. For example, some titles (MARC field 245) include a responsibility statement after the ISBD separator (a slash) instead of the required MARC prefix $c. The parser compiles a list of rules in order to handle such mistakes or inconsistencies.

Textual errors. Many titles were found to contain spurious characters or unbalanced parenthesis. The migration also allowed to identify such typos and improve the normalization of titles.

Multiple publication statements. Statements about publishers and distributors (MARC code 260) are not distinguished when the exportation employs a DublinCore-based gateway. For example, publication date of the source work and its digital version are both tagged dc:date. Again, specific rules for each provenance have been implemented.

Unspecified roles. Secondary personal entries sometimes contain no information about the role played in the creation. By default, the contributor is associated to a particular manifestation (as in the case, for example, of a publisher). A set of rules has been defined for those cases where the context helps in the interpretation of the content. For example, the keyword trad. indicates a translator which must be associated to an expression.

No unique identifiers for creators. Since no authority record number, such as VIAF or ISNI, has been associated to the creators, ambiguity arises when authors have identical names. Also similar names may correspond to a single author due to name variations and typos. An open-source software [5] has been applied in order to identify names which indeed correspond to a single person.

Analytics cataloguing. Analytics cataloguing creates separate record for each item found in a larger resource, such as article within a journal, newspaper or serial. The information in MARC field 773 (host item) has been parsed in order to detect if the host resource is in the library catalogue. In such case, a relation isPartOf is added to the database.
The pre-processing applied a set of parsers (implemented in Java upon the MARC4j library) to normalize the information contained in data fields such as titles, roles or languages.
3.2. Definition of an FRBR–FRAD relational model

The FRBR family of conceptual models [15] are intended to be independent of any cataloguing code or implementation and they identify the principal entities, their attributes and the relationships between them. The FRBR model defines the products of intellectual or artistic endeavour (work, expression, manifestation, and item) and is complemented with the FRAD model, which defines the entities responsible for the content (person, family, and corporate body), and with the FRSAD model, which defines the entities that serve as the subjects of creations (concept, object, event, and place), see Fig. 3.

Fig. 3.

Entities defined in FRBR (Work, Expression, Manifestation, Item), FRAD (Person, CorporateBody, Family), and FRSAD (Concept, Object, Event, Place) with their primary relationships.

Fig. 4.

Diagram of the Entity-Relationship model of the relational database.

Fig. 5.

The ontology (concepts and relations) describing the catalogue entries is based on the RDA, RDF, OWL, FOAF and Dublin Core vocabularies. Tag prefixes denote different name-spaces (the source ontology): RDA Class (rdac), Work (rdaw), Expression (rdae), Manifestation (rdam), Item (rdai), or Agent (rdaa); Resource Description Framework (rdf); Dublin Core (dc); Library of Congress Metadata Authority Description Schema (mads); Friend of a Friend (foaf); OWL time ontology (owl-time).

Traditional data storage systems, in particular relational databases, are much more mature than semantic ones, and they offer reliable, extensively tested implementations. Inspired by the IFLA conceptual models, an Entity-Relationship (ER) model, schematically represented in Fig. 4, has been defined to store the Biblioteca Virtual Miguel de Cervantes descriptive metadata. Some additional elements were incorporated to the model in order to address the catalogue specificities. For example, Collection entities were needed to host arbitrary groupings of objects, such as works in a bibliography, items with a common provenance (e.g., a partner library holdings or items in a personal archive), which are not properly creations and usually have no associated descriptive metadata. Since authors are often the subject of a book in a library with a focus on literature, the subject element from Dublin Core have been used to describe creations having a particular agent as subject; conversely, agents play different roles when contributing to a document such as printer, editor or illustrator. A generic relationship between entities (partOf) was defined in order to describe nested inclusions, for example, journals publishing volumes, made of issues containing articles. The online manifestations are connected to their URL with the homepage attribute. Entities for the UDC and UNESCO classifiers and for VIAF persons were also added to the model. Since the RDA technical guidelines were created while several aspects of FRBR were still in flux, they include some additional entities (such as Agent) and rename some relations: for example, the FRBR embodiment becomes manifestationOfExpression in RDA, see Fig. 5.

As can be seen in Fig. 4, the abstract class creation generalizes the basic FRBR entities (work, expression and manifestation). This class has been added in order to avoid redundant descriptions and duplicate coding, since many properties, such as subject, are common to all types of entities.

3.3. Migration of MARC records into the FRBR database

The application of the FRBR model to an existing MARC collection needs to identify, create and connect FRBR entities [1]. Once the MARC records were normalized and enhanced through the applications of the actions listed in Section 3.1, the transformation was implemented in three consecutive steps:

Identification of FRBR entities.

Extraction of relationships between entities.

Semi-automatic clustering of entities.

The sequential nature of the migration process allows for simple incremental construction and update.

The identification of FRBR entities required the implementation of a detailed mapping between the original metadata and the FRBR attributes, in particular for those records containing multiple references to persons, subjects or related works. Duplications were minimized by searching for creators with similar names and compatible dates [5]. In parallel, complex subject headings were decomposed into their elementary components to reduce the number of different subject entities.

The extraction of relationships identifies connections, mainly involving works, as in creator or subject elements, but also expressions – for example, translators and editors – and, in a small number of cases, manifestations – for example, printers and illustrators. Relationships between complex works (for example, a journal with articles or a monograph with chapters) and the simple components are also extracted in this step. A standard practice has not emerged yet, and collections have been sometimes considered a single work, a manifestation of different works, or a collective work made of smaller works. The last approach has been used here in order to describe serial relationships, mapping MARC 773 entries to isPartOf relationships.

The statement of responsibility (MARC field 245 $c) contains useful information about persons or bodies contributing to the creation of the content, linking usually persons and expressions. Furthermore, reproduction notes (field 533) often relate a document to the source employed to create the digital version, which can be considered expressions of a single work. In order to extract such valuable relationships, these fields were parsed to find keywords such as edición ilustrada (an illustrated edition) or traducción (a translation) and the results were then mapped to FRBR relationships.

Since patterns are not sufficient to interpret the whole variety of relationships between entities, a web cataloguing interface was implemented for the supervision by librarians of the transformation and clustering process. The interface allows one to retrieve, modify and create relationships and supports the hierarchical navigation through the FRBR structure.

A final step reorganizes the catalogue by grouping manifestations and expressions of the same work, and employs data mining techniques to this purpose. Training sets including difficult cases were prepared by the cataloguing department. Preliminary inspection revealed that uniform titles were not suitable to merge expressions or manifestations of a work, since their main purpose is to provide a normalized form of the title and, only secondarily, to disambiguate works with identical name. The result is that many works sharing their uniform title and author were indeed different creations (for example, many documents had uniform title Laws and author Alfonso XIII, king of Spain).

The clustering process follows instead the principles of the OCLC FRBR Work-Set Algorithm [32] which identifies sets of works based on the information found in bibliographic and authority records: a key is created for every record by combining author and title and, secondarily, by using the uniform title (MARC 130) or the title with MARC 7XX fields. Sets contain works which share an identical key.

Fig. 6.

Overview of the RDF output for the work El ingenioso hidalgo Don Quijote de la Mancha.

Table 1

Vocabularies employed in the RDF dataset

Prefix	Name	URI
rdac	The RDA Classes element set	http://rdaregistry.info/Elements/c/
rdaw	The RDA Work properties	http://rdaregistry.info/Elements/w/
rdae	The RDA Expression properties	http://rdaregistry.info/Elements/e/
rdam	The RDA Manifestation properties	http://rdaregistry.info/Elements/m/
rdai	The RDA Item properties	http://rdaregistry.info/Elements/i/
rdaa	The RDA Agent properties	http://rdaregistry.info/Elements/a/
rdau	The RDA Unconstrained properties	http://rdaregistry.info/Elements/u/
foaf	Friend of a Friend vocabulary	http://xmlns.com/foaf/0.1/
dc	DCMI Metadata Terms	http://purl.org/dc/elements/1.1/
skos	Simple Knowledge Organization System	http://www.w3.org/2004/02/skos/core#
dbpedia-owl	DBpedia ontology	http://dbpedia.org/ontology/
time	Time Ontology in OWL	http://www.w3.org/2006/time#

3.4. From FRBR to RDA linked open data

Two main approaches have been generally applied to the publication of linked open data. Transient RDF views are published as a top layer providing real-time access to the original data. Alternatively, persistent RDF views are generated and the data are published in asynchronous time intervals. Since bibliographic archives do not update their data very frequently and some delay is acceptable in delivering the metadata, persistent RDF views provide a more efficient approach in libraries [18]. Moreover, the adoption of RDF systems usually requires a gradual transition to allow heterogeneous data to be carefully adapted and tested while in parallel the personnel gains confidence with the new procedures to create descriptive metadata.

A parser has been implemented in Java that applies mapping rules between the FRBR database and the RDA vocabulary (classes, properties and relationships), based on the RDA recommendations.9

⁹
http://www.rda-jsc.org/archivedsite/docs/5rda-frbrrdamappingrev.pdf.
For every entity in one of the RDA classes vocabulary10 ¹⁰
http://www.rdaregistry.info/Elements/c.
(e.g., rdac:Work or rdac:Person) an RDF document is created which contains its properties and relationships, as depicted in Fig. 6. For example:

rdaw:titleOfTheWork links a work to the string by which the work is known.

rdae:languageOfExpression contains the language used in a particular expression.

rdam:carrierType assigns a manifestation to the format used for storage and the type of device required to access the content.

RDA provides also additional value vocabularies11 ¹¹
http://www.rdaregistry.info/termList.
for some properties. Parsers for some FRBR attributes such as Media Type, Carrier Type and Content Type have been implemented accordingly.

Whenever a relationship could not be described using RDA elements, then popular vocabularies were applied. For example, the OWL-Time ontology12 ¹²
www.w3.org/TR/owl-time.
has been used to describe temporal events such as publication years; external content, hosted by partner libraries, was described with FOAF elements [4] and subjects triples were created with the Dublin Core13 ¹³
http://dublincore.org/documents/dces.
property dc:subject. Languages and forms of work (genres), which are currently not specified in RDA vocabularies, have been mapped to the codes used by the Library of Congress.14 ¹⁴
http://id.loc.gov/authorities/genreForms; http://id.loc.gov/vocabulary/iso639-1/es.
Even if all the standard vocabularies listed in Table 1 were used, some issues could not be fully addressed: for example, RDA provides only a generic relationship for containment (wholePartManifestationRelationship) which is not rich enough to describe the variety of inclusions in a collection (volumes in a journal, articles in a volume, or books in a series).

The output dataset adheres to established design patterns [10]. For example, the path to the resource provides a readable description of the entity, as shown in Table 2.

Table 2
Design patterns followed by Uniform Resource Identifiers

Entity Pattern

Person …/person/*

Family …/family/*

Corporate Body …/corporatebody/*

Work …/work/*

Expression …/expression/*

Manifestation …/manifestation/*

Item …/item/*

Institution …/institution/*

Language …/language/*

Date …/date/*

Dots stand for the common prefix data.cervantesvirtual.com and the asterisk for a particular value.

Finally, the dataset has been enriched semantically by automatically linking objects to terms in other Linked Open Datasets. For example, links to DBpedia15 ¹⁵
http://wiki.dbpedia.org/.
were gathered for persons by using the identifiers provided by the Virtual International Authority File.16 ¹⁶
https://viaf.org/.
This enhancement allows queries where the unambiguous VIAF identifier is used to retrieve information about an author.
4. Results and evaluation

Entity	Pattern
Person	…/person/*
Family	…/family/*
Corporate Body	…/corporatebody/*
Work	…/work/*
Expression	…/expression/*
Manifestation	…/manifestation/*
Item	…/item/*
Institution	…/institution/*
Language	…/language/*
Date	…/date/*

The automatic procedure described in Section 3 has been applied to transform over 200,000 bibliographic records and 70,000 authority entries, generating about 15 million RDF triples which are published through the gateway data.cervantesvirtual.com . The main features of the RDF data set are summarized in Table 3 and it provides high quality linked open data, what is called five-star open data [16]. The repository holds nearly 37,000 links to external repositories, as shown in the table. These links are described through the owl:sameAs relationship and they introduce the rich connectivity promoted by the linked open data philosophy. The Biblioteca Virtual Miguel de Cervantes data can be downloaded, navigated and queried using a SPARQL endpoint, and they are published under the Creative Commons Public Domain Dedication License.17

¹⁷
https://creativecommons.org/publicdomain/zero/1.0.

Table 3
Some features of the RDF dataset

Main address data.cervantesvirtual.com

Description …/void.ttl

Site-map …/sitemap.xml

Vocabularies 18

No. of classes 20

No. of properties 128

No. of triples 13,131,270

SPARQL access …/sparql

No. of links 7,610 to viaf.org

5,615 to datos.bne.es

45 to id.loc.gov

22,373 to dbpedia.org

1,180 to youtube.com

The RDF dataset has been evaluated using several methods:

Nearly 40 constraints were defined18 ¹⁸
They are available at https://github.com/hibernator11/validation.
and the data were validated against them using Clark&Parsia’s Stardog ICV.19 ¹⁹
http://docs.stardog.com/icv.
The constraints required, for example, that at least one manifestation must be found for every work and author roles can be only assigned to entities of type work.

RDFUnit [19] has been used to test DublinCore triples.

Acceptance sampling and manual revision was performed on several hundreds of records.

A procedure was implemented testing that the number of manifestations and creators matches the numbers in the original database.

These validation procedures allowed to identify and correct inaccuracies in the dataset. For example, the analysis of a random sample with 112 groups created by the clustering of FRBR entities found 8 false positives where works were grouped incorrectly. The wrong clusters mainly contained works with rather general or vague titles such as Real Decreto produced by the same author. The inspection of a random subset of 50 DBpedia links revealed 4 mistakes. In contrast, all roles which were manually revised (50 cases) and all relations to BNE records (50 links) were found to be correct.

Several options to provide SPARQL access to the RDF storage were evaluated, including OpenLink Virtuoso,20 ²⁰
http://virtuoso.openlinksw.com.
4Store,21 ²¹
http://4store.org.
and Sesame.22 ²²
http://rdf4j.org.
The last one was selected in order to implement the access to the data, since it is an open-source Java framework which proved to be light-weight and satisfied the requirements by supporting full-text queries, batch indexing, and database transactions.23 ²³
For an extensive comparative study of platforms, see [14].
An open-source SPARQL interface24 ²⁴
Yasgui, http://yasgui.org.
was added in order to simplify the creation of queries and the visualization of results.

The maintenance of the RDF data generated through the process described above is supported by three automatic procedures for the management of the content:
Rebuild all RDF triples from the database.

Incremental addition of new RDF triples.

Data backup and restore operations.
Fully rebuilding the dataset may require a few hours but the incremental construction runs in real time and can be scheduled to take place periodically so that the published data are synchronized with the database content.
5. Conclusions and future work

The traditional online access of the Biblioteca Virtual Miguel de Cervantes25 ²⁵

http://www.cervantesvirtual.com.

provides only a human readable presentation of the catalogue. The publication of the catalogue as linked open data supports instead external usage and exploitation of the data. For example, the free and open knowledge base Wikidata26 ²⁶

http://www.wikidata.org.

has recently incorporated a new property27 ²⁷

https://www.wikidata.org/wiki/Property:P2799.

which identifies authors in the Biblioteca Virtual Miguel de Cervantes. Currently Wikidata contains about 4,500 links to our dataset. The links to DBpedia allow users to perform SPARQL federated queries to retrieve, for example, authors in our library classified by subject or authors who were influenced by another one.

A tool28 ²⁸

http://data.cervantesvirtual.com.

has been developed to assist browsing linked open data by non-expert users. The interface presents search results grouped according to FRBR categories. For example, expressions and manifestations are presented under the work they materialize and contributions related to a particular creator are classified according to the role played in the creation.

Some work is still to be done. For example, subject headings can be expressed in different languages, depending on the source library. This question has been addressed by a number of projects in the past [11] and a global solution needs still to be found. Further refinements are also needed for the recognition and extraction of implicit relationships expressed in natural language, such as named entities and temporal expressions. The description of subjects can be also enriched with the creation of a thesaurus based on SKOS, a W3C recommendation for the representation of subject headings. Additionally, limitations to the clustering arise from the fact that records imported from external repositories sometimes lack sufficient metadata or may be expressed in foreign languages. Finally, even if the SPARQL interface provides auto-completion for properties and relationships, further work is also needed to provide easier access to SPARQL for non-expert users.

References

Aalberg,

Mercun and

Zumer, Coding FRBR-structured bibliographic information in MARC, in: Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation – 13th International Conference on Asia-Pacific Digital Libraries, ICADL 2011, Beijing, China, October 24–27, 2011,

Xing,

Crestani and

Rauber, eds, Lecture Notes in Computer Science, Vol. 7008, Springer, 2011, pp. 128–137. doi:10.1007/978-3-642-24826-9_18.

Aalberg and

Zumer, Looking for entities in bibliographic records, in: Digital Libraries: Universal and Ubiquitous Access to Information – 11th International Conference on Asian Digital Libraries, ICADL 2008, Bali, Indonesia, December 2–5, 2008,

Buchanan,

Masoodian and

S.J.

Cunningham, eds, Lecture Notes in Computer Science, Vol. 5362, Springer, 2008, pp. 327–330. doi:10.1007/978-3-540-89533-6_36.

American Library Association, CLA and CILIP, Anglo-American Cataloguing Rules, 2nd edn, ALA Editions, 2005. ISBN 978-0-8389-3555-2.

Brickley and

Miller, FOAF Vocabulary Specification 0.99, Namespace Document, 14 January 2014 – Paddington Edition, http://xmlns.com/foaf/spec/.

R.C.

Carrasco,

Serrano and

Castillo-Buergo, A parser for authority control of author names in bibliographic records, Information Processing and Management 52(5) (2016), 753–764. doi:10.1016/j.ipm.2016.02.002.

Chang,

Tsai,

Dunsire and

Hopkinson, Experimenting with implementing FRBR in a Chinese Koha system, Library Hi Tech News 30(10) (2013), 10–20. doi:10.1108/LHTN-09-2013-0054.

Decourselle,

Duchateau and

Lumineau, A survey of FRBRization techniques, in: Research and Advanced Technology for Digital Libraries – 19th International Conference on Theory and Practice of Digital Libraries, TPDL 2015, Poznań, Poland, September 14–18, 2015,

Kapidakis ,

Mazurek and

Werla , eds, Lecture Notes in Computer Science, Vol. 9316, Springer, 2015, pp. 185–196. doi:10.1007/978-3-319-24592-8_14.

Deliot, Publishing the British National Bibliography as linked open data, 2014, http://www.bl.uk/bibliographic/pdfs/publishing_bnb_as_lod.pdf.

T.J.

Dickey, FRBRization of a library catalog: Better collocation of records, leading to enhanced search, retrieval, and display, Information Technology and Libraries 27(1) (2008), 23–32. doi:10.6017/ital.v27i1.3260.

10.

Dodds and

Davis, Linked Data Patterns: A pattern catalogue for modelling, publishing, and consuming Linked Data, 2012, http://patterns.dataincubator.org/book/.

11.

El-Sherbini, Multilingual subject retrieval: Bibliotheca Alexandrina’s subject authority file and linked subject data, in: Data Science, Learning by Latent Structures, and Knowledge Discovery [revised versions of selected papers presented during the European Conference on Data Analysis (ECDA 2013), Luxembourg, July 2013],

Lausen ,

Krolak-Schwerdt and

Böhmer , eds, Studies in Classification, Data Analysis, and Knowledge Organization, Springer, 2013, pp. 535–546. doi:10.1007/978-3-662-44983-7_47.

12.

Freire,

Galvão and

Lopes, FRBR information discovery in traditional catalogues: The TELplus experience, in: World Library and Information Congress: 75th IFLA General Conference and Council, 2009, http://conference.ifla.org/past-wlic/2009/135-freire-en.pdf .

13.

Gatenby,

Thornburg and

Weitz, Collected work clustering in WorldCat, Code4Lib Journal 30 (2015), http://journal.code4lib.org/articles/10963 .

14.

Haslhofer,

Momeni Roochi,

Schandl and

Zander, Europeana RDF store report, Technical report, University of Vienna, Vienna, 2011, http://eprints.cs.univie.ac.at/2833/.

15.

IFLA Study Group on the FRBR, Functional Requirements for Bibliographic Records, IFLA Series on Bibliographic Control, Vol. 19, K.G. Saur, München, 1998.

16.

Janowicz,

Hitzler,

Adams,

Kolas and

Vardeman, Five stars of linked data vocabulary use, Semantic Web 5(3) (2014), 173–176. doi:10.3233/SW-140135.

17.

Joint Steering Committee (JSC) for Development of RDA, RDA tool kit: Resource description and access, 2012, http://www.rdatoolkit.org.

18.

Konstantinou,

D.-E.

Spanos and

Mitrou, Transient and persistent RDF views over relational databases in the context of digital repositories, in: Metadata and Semantics Research – 7th Research Conference, MTSR 2013, Thessaloniki, Greece, November 19–22, 2013,

Garoufallou and

Greenberg, eds, Communications in Computer and Information Science, Vol. 390, Springer, 2013, pp. 342–354. doi:10.1007/978-3-319-03437-9_33.

19.

Kontokostas,

Westphal,

Auer,

Hellmann,

Lehmann and

Cornelissen, Databugger: A test-driven framework for debugging the web of data, in: WWW ’14 Companion: Proceedings of the 23rd International World Wide Web Conference, Seoul, Republic of Korea, April 7–11, 2014,

C.-W.

Chung,

A.Z.

Broder,

Shim and

Suel, eds, ACM, 2014, pp. 115–118. doi:10.1145/2567948.2577017.

20.

Lei Zeng,

Žumer and

Salaba, eds, Functional Requirements for Subject Authority Data (FRSAD) – A Conceptual Model, IFLA Series on Bibliographic Control, Vol. 43, De Gruyter Saur, München, 2011. ISBN 978-3-11-025323-8.

21.

H.M.Á.

Manguinhas,

N.M.A.

Freire and

J.L.B.

Borbinha, FRBRization of MARC records in multiple catalogs, in: Proceedings of the 2010 Joint International Conference on Digital Libraries, JCDL 2010, Gold Coast, Queensland, Australia, June 21–25, 2010,

Hunter,

Lagoze,

C.L.

Giles and

Y.-F.

Li, eds, ACM, 2010, pp. 225–234. doi:10.1145/1816123.1816157.

22.

Marden,

Li-Madeo,

Whysel and

Edelstein, Linked open data for cultural heritage: Evolution of an information technology, in: Proceedings of the 31st ACM International Conference on Design of Communication, Greenville, NC, USA, September 30–October 1, 2013,

M.J.

Albers and

Gossett, eds, ACM, 2013, pp. 107–112. doi:10.1145/2507065.2507103.

23.

R.S.

Muñoz, Launching of beta version of datos.bne.es, a LOD service and a FRBR-based catalogue view, SCATNews 42(42) (2014), 13–21, http://www.ifla.org/files/assets/cataloguing/scatn/scat-news-42.pdf .

24.

G.E.

Patton, ed., Functional Requirements for Authority Data – A Conceptual Model, IFLA Series on Bibliographic Control, Vol. 34, K.G. Saur, München, 2009. ISBN 978-3-598-24282-3.

25.

Prud’hommeaux and

Seaborne, eds, SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, https://www.w3.org/TR/rdf-sparql-query/.

26.

Riley, Enhancing interoperability of FRBR-based metadata, in: Proceedings of the 2010 International Conference on Dublin Core and Metadata Applications, DC 2010, Pittsburgh, PA, USA, October 20–22, 2010,

D.I.

Hillmann and

Lauruhn, eds, Dublin Core Metadata Initiative, 2010, pp. 31–43, http://dcpapers.dublincore.org/pubs/article/view/1037 .

27.

Schneider, FRBRizing MARC records with the FRBR display tool, Technical report, 2008, http://jodischneider.com/pubs/2008may_frbr.html.

28.

Simon,

Di Mascio,

Michel and

Peyrard, We grew up together: data.bnf.fr from the BnF and Logilab perspectives, in: IFLA 2014 Satellite Meeting Linked Data in Libraries: Let’s Make It Happen! 2014, http://ifla2014-satdata.bnf.fr/pdf/iflalld2014_submission_Simon_DiMascio_Michel_Peyrard.pdf .

29.

Standing Committee of the IFLA Cataloguing Section, ed., International Standard Bibliographic Description (ISBD), consolidated edn, IFLA Series on Bibliographic Control, Vol. 44, De Gruyter Saur, München, 2011, http://www.ifla.org/files/assets/cataloguing/isbd/isbd-cons_20110321.pdf .

30.

Takhirov,

Aalberg,

Duchateau and

Zumer, FRBR-ML: A FRBR-based framework for semantic interoperability, Semantic Web 3(1) (2012), 23–43. doi:10.3233/SW-2012-0044.

31.

Thornburg, A candid look at collected works: Challenges of clustering aggregates in GLIMIR and FRBR, Information Technology and Libraries 33(3) (2014), 53–64. doi:10.6017/ital.v33i3.5377.

32.

Toves and

T.B.

Hickey, FRBR Work-Set Algorithm, Version 2.0, 2009, http://www.oclc.org/content/dam/research/activities/frbralgorithm/2009-08.pdf.

33.

Vila-Suero and

Gómez-Pérez, datos.bne.es and MARiMbA: An insight into library linked data, Library Hi Tech 31(4) (2013), 575–601. doi:10.1108/LHT-03-2013-0031.

Main address	data.cervantesvirtual.com
Description	…/void.ttl
Site-map	…/sitemap.xml
Vocabularies	18
No. of classes	20
No. of properties	128
No. of triples	13,131,270
SPARQL access	…/sparql
No. of links	7,610 to viaf.org
	5,615 to datos.bne.es
	45 to id.loc.gov
	22,373 to dbpedia.org
	1,180 to youtube.com

Migration of a library catalogue into RDA linked open data

Abstract

Keywords

1. Introduction

References