Sage Journals: Discover world-class research

Abstract

An ever increasing amount of event-centric knowledge is spread over multiple web sites, either materialized as calendar of past and upcoming events or illustrated by cross-media items. This opens an opportunity to create an infrastructure unifying event-centric information derived from event directories, media platforms and social networks. In order to create such infrastructure, EventMedia relies on Semantic Web technologies that ensures seamless aggregation and integration of disparate data sources, some of which overlap in their coverage. In this paper, we present the EventMedia dataset composed of events descriptions associated with media and interlinked with the Linked Open Data cloud. We describe how data has been extracted, converted, interlinked and published following the best practices of the Semantic Web.

Keywords

Events Linked Data social media LODE ontology

1. Introduction

In their daily life, people naturally organize their personal data according to occurring events: holiday, wedding, birthday party, concert, etc. Events are indeed a natural way for referring to any observable occurrence grouping persons, places, times and activities [10]. Events are also observable experiences that are often documented by people through different media. Nowadays, social services host a large amount of information about events, illustrative media and social connections between participants. However, this information is often spread and locked in amongst these services providing limited event coverage and no interoperability of the description [3]. Aggregating these heterogeneous sources into one unified platform is the aim of the EventMedia project leveraging on the benefits of Semantic Web technologies.

One vision of the Web of Data is to organize the data silos in a structured way which can be understood by machines and easily exploited by humans. This requires the use of common vocabularies for the integration of fragmentary information into a logically coherent knowledge. To achieve this vision, a growing number of RDF datasets have been published in the Web of Data covering diverse domains such as digital libraries, government, health, media or more generally encyclopedic data. In this work, our goal is to introduce an event-domain RDF dataset and to investigate the underlying connections between event centric data on the Web. While wishing to create such dataset, we are aware that event web directories already exist such as Last.fm,1

¹
http://www.last.fm

Eventful,2

http://www.eventful.com

Upcoming3

http://www.upcoming.org

and Facebook4

⁴

http://www.facebook.com

to name a few. However, these services provide limited coverage of events and insufficient browsing options for decision support (e.g. lack of location map and media). As a solution, there is a need to create an infrastructure that enhances data exploration with the flexibility and depth afforded by Semantic Web technologies, and allows users to discover meaningful relationships amongst events. Therefore, we have created the EventMedia dataset which is obtained from four large public event directories namely Last.fm, Eventful, Upcoming and Laynrd,5

⁵

http://www.lanyrd.com

and from two large media directories namely, Flickr and Twitter. Our strategy is to select popular sources, but this list is non-exhaustive.

The remainder of this paper is structured as follows. We explain how the data is collected (Section 2) and converted into an RDF model (Section 3). We present an overview of the EventMedia dataset in Section 4, and we describe how we interlinked it with the other LOD datasets in Section 5. Then, we describe two web application in Section 6, and we outline the future work in Section 7.

2. Crawling and aggregating data

In this section, we describe how the data has been collected and interlinked either statically using a REST-based crawler or dynamically using a live extractor.

2.1. REST-based data crawler

Crawling data from multiple services is in general a time consuming task due to the lack of haromonization in different specifications and policies of REST APIs (Application Programming Interface). This imposes a need to create tools providing a seamless and flexible way to crawl data from multiple services. Such tools should be able to address many tasks such as policy management, requests chaining, data integration or merging response schemas. We propose a framework that supports those tasks and unifies information into a meaningful data model. This framework is composed of two main components: the Unified REST Module and the Scraping Processor as illustrated in Fig. 1. The first module is based on a RESTful service that allows for the unification of various Web APIs by exploiting their commonality in terms of described methods, inputs and responses. Each source API (e.g. Eventful API) is associated with a descriptor file which represents the API parameters such as root URL, API key, and a set of query objects. Then, each query object represents a mapping between our REST URL pattern and the source API URL pattern which describes a REST method and its input parameters. In order to manage the request chaining, we define two types of query objects: (i) the first type is related to first-order methods used to search for the main objects such as events and media, (ii) the second type is related to other methods used to fetch the descriptions of secondary objects such as artists, locations, attendees, etc. Overall, we have created three REST methods to search for events, photos and videos, respectively. These methods have as input a set of parameters such as the original sources (e.g. last.fm, eventful, etc.) and other additional filters (e.g. category, location, date, etc.). Thus, the user can request in parallel multiple Web services by specifying the list of sources into one request.

Fig. 1.

The Rest-based Crawler Architecture.

Besides the RESTful service, the Scraping Processor manages four important tasks. The first task enables multi-threading to reduce the amount of time usually required to query multiple web services. The remaining tasks deal with data processing, starting from JSON de-serialization to RDF conversion and loading into a triple store. More precisely, data retrieved is de-serialized and exported into a common schema providing descriptions of a set of objects, namely; event, location, agent, user, photo and video. Then, we use a tag-based mapping by consuming some metadata, not only to establish links between events and media, but also to enrich their descriptions with additional information from external datasets. This framework is meant to ease the addition of new APIs used to collect events and media. It also offers other REST methods to track or stop the scraping processes. Finally, a web dashboard has been developed in order to offer graphical functionalities that help monitor the scraping task. It provides practical widgets to help build a query by filtering some parameters and track the scraping process. It is available online at http://eventmedia.eurecom.fr/dashboard.

Fig. 2.

The Snow Patrol Concert described with LODE ontology.

2.2. Tag-based mapping

A recent user-centric study [3] highlights the importance of media to provide visual information which support decision making. This study motivated us to enrich event views with media by exploring the overlap in metadata between four popular web sites, namely Flickr as a hosting web site for photos and videos, and Last.fm, Eventful and Upcoming as a rich documentation of past and upcoming events. Note that explicit relationships between events and photos exist using machine tags such as lastfm:event=XXX. We have been able to convert the descriptions of more than 1.7 million photos which are indexed by nearly 140,000 events. We further leverage the machine tags to create links between various directories such as foursquare:venue=XXX used to link venues descriptions with Foursquare6

⁶
https://foursquare.com

directory (a location-based service), and musicbrainz:artist=MBID used to link artists descriptions with MusicBrainz7

⁷

http://musicbrainz.org

directory (an open music database). Similarly, we also exploit the existing overlap between Twitter and Lanyrd (a social conference directory) where each conference is associated with a Twitter hashtag. Thus, we have been able to convert the descriptions of more than 530,000 tweets which are indexed by nearly 1,167 conferences.

2.3. Live data extraction

New events are taking place everyday and people keep sharing an ever-growing amount of media. Such evolution requires a real-time processing that retrieves fresh data and updates the triple store. To achieve this, we developed a live extractor which consumes the feeds provided by some Web services. More precisely, we use the Flickr feeds8

⁸
http://api.flickr.com/services/feeds/photos_public.gne?tags=*:event

which contains the tag “*:event=”. Then, a scheduled process reads the feeds every 10 minutes and trigger accordingly the scraping requests to retrieve the descriptions of events and photos. On an average week, we observe 1500 new photos and 130 new events which are added to EventMedia. Similarly, we also use the Lanyrd feeds9

⁹

http://api.lanyrd.com/conferences

which provides fresh information about conference including the main hashtag required to retrieve related tweets.

3. RDF modeling

In this section, we describe our approach to generate RDF triples describing events and media using a variety of existing vocabularies such as the LODE ontology and Media Resources ontology.

3.1. The LODE ontology

The LODE ontology10

¹⁰
http://linkedevents.org/ontology/

is a minimal model that encapsulates the most useful properties for describing events. LODE is not yet another “event” ontology per se. It has been designed as an interlingua model that solves an interoperability problem by providing a set of axioms expressing mappings between existing event ontologies. Hence, the ontology contains numerous OWL axioms stating classes and properties equivalence between models such as MO [9], CIDOC-CRM [2] and DOLCE to name a few. In addition, LODE can be enhanced with mappings to other vocabularies such as Schema.org and DBpedia. Overall, the goal of LODE is to enable an interoperable modeling of the “factual” aspects of events, where these can be characterized in terms of the four Ws: What happened, Where did it happen, When did it happen, and Who was involved. “Factual” relations within and among events are intended to represent intersubjective “consensus reality” and thus are not necessarily associated with a particular perspective or interpretation. We use the LODE ontology together with properties from FOAF, Dublin Core and VCard. Our strategy is to separate events from their interpretations with an emphasis on factual aspects, a design approach that has not been considered in other event models [10]. Figure 2 depicts the metadata attached to the event identified by 3163952 on Last.fm according to the LODE ontology. More precisely, it indicates that an event of type Concert has been given on the 21th of May 2012 at 12:45 PM in the The Paramount Theatre featuring the Snow Patrol rock band, and one of attendees is the Last.fm user earthcapricor. Using the machine tag of related media, an owl:sameAs link is discovered between this event and a similar one announced on Upcoming.

3.2. Media modeling

To describe media, we re-use two popular vocabularies, namely: the W3C Ontology for Media Resources11

¹¹
http://www.w3.org/TR/mediaont-10/

to represent photos and videos, and SIOC12

¹²

http://rdfs.org/sioc/spec/

to represent tweets, status, posts and slides. The Ontology for Media Resource is a core vocabulary which covers basic metadata properties to describe media resources. It also contains a formal set of axioms defining mapping between different metadata formats for multimedia. The SIOC Core Ontology provides the main concepts and properties required to describe information from on-line communities (e.g., message boards, wikis, weblogs). We use those ontologies together with properties from SIOC, FOAF and Dublin Core to convert into RDF the descriptions of photos, tweets and slides. The link between the media and the event is realized through the lode:illustrate property. Figure 3 depicts the description of photos, tweets and slides related to the ISWC 2011 conference.

Fig. 3.

RDF modeling of photos, tweets and slides associated with the ISWC 2011 Conference.

3.3. Events taxonomy

Events are generally categorized in lightweight taxonomies that provide facets when browsing event directories. We manually analyzed the taxonomy used in various sites, namely Facebook, Eventful, Upcoming, LinkedIn,13

¹³
http://www.linkedin.com

Eventbrite14

¹⁴

http://www.eventbrite.com

and Ticketmaster.15

¹⁵

http://www.ticketmaster.com

Then, we used card sorting techniques in order to build a rich SKOS thesaurus of event categories. This SKOS thesaurus contains axioms expressing mapping relationships with these taxonomies while the terms are defined in our own namespace accessible at (http://data.linkedevents.org/category).

4. EventMedia dataset

EventMedia is a new hub16

¹⁶
http://ckan.net/package/event-media

of the Linked Data cloud since September 2010. We use the Last.fm, Eventful, Upcoming and Lanyrd APIs to convert each event description into LODE ontology. We mint new URIs into our own namespace, for example, the URI for events is (http://data.linkedevents.org/event/).

Fig. 4.

Overview of the EventMedia components.

Our dataset consists of more than 30 millions RDF triples. All URIs are dereferencable and served as either static RDF files serialized in N3 or as JSON by a RESTful API. The back-end of EventMedia consists of a Virtuoso SPARQL endpoint available at (http://eventmedia.eurecom.fr/sparql), a RESTful API available at (http://eventmedia.eurecom.fr/rest/resource) and powered by the ELDA implementation of the Linked Data API.17

¹⁷

http://code.google.com/p/linked-data-api

ELDA provides a configurable way to access RDF data using simple RESTful URLs that are translated into queries to our SPARQL endpoint. The API layer enables associating URIs with processing logic that extracts data from the SPARQL endpoint using one or more SPARQL queries and then serializes the results using the format requested by the client. Figure 4 depicts the architecture of EventMedia, and Table 1 provides an overview about the number of resources per type and source.

5. Interlinking

Event directories have overlap in their coverage and it is worthwhile to discover similar events so that one description can complement another. However, discovering similar events from these overlapping but heterogeneous directories imposes some challenges, well-known in instance matching. In addition, we also investigate the enrichment of EventMedia with additional information from open datasets. In our approach, we favour high precision rather than high recall since the cost of missed mapping is lower that the cost of incorrect matching. Statistics about the linksets generated are accessible at (http://eventmedia.eurecom.fr/dashboard/statistics.html).

Table 1
Number of resources per type and source in EventMedia

Event Agent Location Media

Last.fm 57,258 50,150 16,471 1,425,318

Upcoming 13,114 0 7,330 347,959

Eventful 37,647 6,543 14,576 0

Lanyrd 1,167 0 439 537,091

Total 109,186 56,693 38,3816 2,310,368

	Event	Agent	Location	Media
Last.fm	57,258	50,150	16,471	1,425,318
Upcoming	13,114	0	7,330	347,959
Eventful	37,647	6,543	14,576	0
Lanyrd	1,167	0	439	537,091
Total	109,186	56,693	38,3816	2,310,368

5.1. Interlinking of event directories

We create owl:sameAs links between events that reflect a high similarity in terms of their factual properties, namely: title, date, location and involved agents. It is worth noting that EventMedia is a challenging dataset due to the presence of a structural heterogeneity (e.g. missing property) and naming variations (e.g. abbreviations, misspellings, different naming conventions). The interlinking was performed using two tools: (i) SILK [4] which draws on a declarative configuration language called Silk-LSL to manually define the linkage rules; (ii) KnoFuss [8] which learns the similarity function based on a semi-supervised genetic algorithm optimizing the precision. We integrated two similarity functions into those tools, namely: a temporal inclusion metric and a string similarity metric described in [6]. The results obtained highlight the time-sensitivity of event reconciliation due to the fact that the time is differently described across multiple websites. Moreover, we note that KnoFuss achieves better performance than SILK thanks to its learning strategy. As a result, the use of KnoFuss on a manually constructed gold-standard of 300 matched events achieves high precision of about 95%, but fair recall of about 75%.

5.2. Enrichment with Linked Data

In order to enrich EventMedia, we perform several interlinking processes using SILK attempting to discover connections between agents and locations with Linked Data. In this context, the key challenge is to resolve the naming conflicts which needs to invoke additional features apart from the instance name. For example, to reconcile the agents, we decide to compare agents’ names and descriptions respectively using Jaro and Cosine functions and we set a high threshold to ensure a high precision. Several datasets have been considered such as Musicbrainz, DBpedia, Freebase and Uberblic. Hence, the agent URI which has for label “Radiohead” is interlinked with the DBpedia URI (http://dbpedia.org/page/Radiohead) providing information about this band such as its complete discography. Similarly, the datasets being selected to enrich the locations are: DBpedia, Foursquare and Geonames hosting a large amount of geographical information. In the similarity function, we combine the geographical distance and Jaro function applied on labels.

6. Event-based applications

The EventMedia dataset has been employed in some web applications designed to enable efficient browsing of an event-based space [1,5,7]. For instance, the EventMedia application [7] delivers different event-centric views (what, where, when and who) and allows users to relive experiences based on media. In fact, people wish to discover events either through invitations and recommendations, or by filtering available events according to their interests [3]. Therefore, the interface allows constraining different event properties (e.g. time, place, category) using, for example, a timeline slider control input and a map grouping markers. Once an event selected, media are presented to convey the event experience, along with social information to provide better decision support. The application is available online at http://eventmedia.eurecom.fr. Another application called Confomaton follows the same perspective with a focus on conference events. Its goal is to provide a visual summary of a scientific conference including microposts, presentation slides, photos and videos, so that the attendees can catch up with what they could have missed. Confomaton is available online at http://eventmedia.eurecom.fr/confomaton.

7. Conclusion and future work

The integration of event-centric information from social services using Semantic Web technologies has given rise to EventMedia, an open dataset continuously synchronized with recent updates. Several improvements could potentially enhance its quality and usability. Indeed, further vocabularies could be incorporated such as the Ticket Ontology to add meaningful relationships between events and related tickets, or the Allen’s vocabulary to express the temporal relationships between events in fine-grained level. Another improvement is to enrich EventMedia using other services such as Youtube, Google+ or Facebook, so that we increase the dataset coverage and more connections could straightforwardly be explored. Finally, we aim to develop a live interlinking framework that aligns in real-time every incoming stream of events with Linked Data.

References

[1]

Buschbeck,

Jameson,

Troncy,

Khrouf,

Suominen and

Spirescu, A demonstrator for parallel faceted browsing, Available online at http://imash.leeds.ac.uk/event/pdf/Buschbeck_1.pdf. Presented at the International Workshop on Intelligent Exploration of Semantic Data (IESD, in conjunction with EKAW 2012.

[2]

Doerr, The CIDOC conceptual reference module: An ontological approach to semantic interoperability of metadata, in: AI Magazine – Special Issue on Ontology Research,

Welty, ed., Vol. 24, Association for the Advancement of Artificial Intelligence, Palo Alto, CA, USA, 2003, pp. 75–92.

[3]

Fialho,

Troncy,

Hardman,

Saathoff and

Scherp, What’s on this evening? Designing user support for event-based annotation and exploration of media, in: Proc. of the Workshop on Recognising and Tracking Events on the Web and in Real Life, located at the 6th Hellenic Conference on Artificial Intelligence SETN 2010, Athens, Greece, May 04, 2010

Winkler,

Artikis,

Kompatsiaris and

Mylonas, eds, Vol. 624, CEUR Workshop Proceedings, Aachen, Germany, 2010, pp. 40–54.

[4]

Jentzsch,

Isele and

Bizer, Silk – generating RDF links while publishing or consuming Linked Data, in: Proc. of the ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, Shanghai, China, November 9, 2010,

Polleres and

Chen, eds, Vol. 658, CEUR Workshop Proceedings, Aachen, Germany, 2010, pp. 53–56.

[5]

Khrouf,

Atemezing,

Rizzo,

Troncy and

Steiner, Aggregating social media for enhancing conference experience, in: AAAI Technical Report WS-12-02 on Real-Time Analysis and Mining of Social Stream,

Zubiaga,

Spina,

de Rijke,

Strohmaier and

Naaman, eds, Association for the Advancement of Artificial Intelligence, Palo Alto, CA, USA, 2012, pp. 34–37.

[6]

Khrouf and

Troncy, EventMedia live: Reconciliating events descriptions in the web of data, in: Proc. of the 6th International Workshop on Ontology Matching (OM-2011) in Conjunction with the International Semantic Web Conference (ISWC2011), Bonn, Germany, October 24, 2011,

Shvaiko,

Euzenat,

Heath,

Quix,

Mao and

Cruz, eds, Vol. 814, CEUR Workshop Proceedings, Aachen, Germany, 2011, pp. 250–251.

[7]

Khrouf and

Troncy, EventMedia: Visualizing events and associated media. available online at http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/PostersDemos/iswc11pd_submission_78.pdf. Presented at Posters & Demonstrations Track of the 10th International Semantic Web Conference (ISWC’11), Bonn, Germany, October 25, 2011.

[8]

Nikolov,

Uren,

Motta and

A.D.

Roeck, Handling instance coreferencing in the KnoFuss architecture, in: Proc. of the 1st IRSW2008 International Workshop on Identity and Reference on the Semantic Web, Tenerife, Spain, June 2, 2008,

Bouquet,

Halpin,

Stoermer and

Tummarello, eds, Vol. 422, CEUR Workshop Proceedings, Aachen, Germany, 2008, pp. 53–56.

[9]

Raimond,

Abdallah,

Sandler and

Giasson, The music ontology, in: Proc. of the 8th International Conference on Music Information Retrieval,

Dixon,

Bainbridge and

Typke, eds, Österreichische Computer Gesellschaft, Vienna, Austria, 2007, pp. 417–422.

10.

[10]

Shaw,

Troncy and

Hardman, LODE: Linking open descriptions of events, in: Proc. of the Semantic Web: 4th Asian Conference, ASWC 2009, Shanghai, China, December 6–9, 2009,

Gómez-Pérez,

Yu and

Ding, eds, Lecture Notes in Computer Science, Vol. 5926, Springer Verlag, Berlin, Heidelberg, 2009, pp. 153–167.

EventMedia: A LOD dataset of events illustrated with media

Abstract

Keywords

1. Introduction

1 http://www.last.fm

2.1. REST-based data crawler

6 https://foursquare.com

8 http://api.flickr.com/services/feeds/photos_public.gne?tags=*:event