Abstract
Nowadays, the advanced sensor technology with cloud computing and big data is generating large-scale heterogeneous and real-time IOT (Internet of Things) data. To make full use of the data, development and deploy of ubiquitous IOT-based applications in various aspects of our daily life are quite urgent. However, the characteristics of IOT sensor data, including heterogeneity, variety, volume, and real time, bring many challenges to effectively process the sensor data. The Semantic Web technologies are viewed as a key for the development of IOT. While most of the existing efforts are mainly focused on the modeling, annotation, and representation of IOT data, there has been little work focusing on the background processing of large-scale streaming IOT data. In the paper, we present a large-scale real-time semantic processing framework and implement an elastic distributed streaming engine for IOT applications. The proposed engine efficiently captures and models different scenarios for all kinds of IOT applications based on popular distributed computing platform SPARK. Based on the engine, a typical use case on home environment monitoring is given to illustrate the efficiency of our engine. The results show that our system can scale for large number of sensor streams with different types of IOT applications.
1. Introduction
With the rapid advances in wireless sensor data collection and communication, increased number of IOT data is increasing explosively, which builds many massive wireless sensor networks (WSNs). It is predicted that within the next decade billions of devices (Cisco predicts that the number of the Internet connected devices will be around 50 billion by 2020) [1] will generate myriad of real-world data for many applications and services in a variety of areas such as smart grids, smart homes, e-health, automotive, transport, and environmental monitoring. Such stunning massive and widespread data can help us to observe the surroundings, learn patterns, and have a better understanding of the world, which will construct a more intelligent world.
To make full use of the senses to implement deeper web intelligence, a natural next step would be to unify and process IOT data by existing mature web infrastructure and protocols. Presently, many efforts have been put on this area from the Internet of Things (IOT) to Web. The W3C founded the Web of Things Community Group aiming at accelerating the adoption of Web technologies such as semantic technologies as a basic for enabling services for the combination of IOT with rich descriptions of web data and the context in which they are used. The industry also initiates the standards oneM2M (http://www.onem2m.org/), whose goal is to develop technical specification which is used to address the need for a common IOT service layer by reusing existing web standards and protocols, including RDF, HTTP, and Restful.
However, the characteristics of IOT data, including heterogeneity, variety, volume, and real time, pose a series of challenges to effectively organize, publish, and process the sensor data. The Semantic Web technologies are viewed as a key for the development of IOT. Figure 1 shows the generic functional model of oneM2M for supporting semantics in the specification of oneM2M study on abstraction and semantics enablement [2]. To be specific, it serves the following several purposes: First, the abstraction and semantics layer provide us with a good way to resolve the problems of interoperability and integration within this heterogeneous world of IOT data by defining and reusing some standard semantic concepts. Then, the Semantic Web provides a seamless interface to facilitate the interactions of IOT data and the other existing web of data such as Linked Data [3], DBpedia [4], LinkedGeodata [5], and various kinds of data from Web Services. At last, the service layer provides an interface for various IOT applications by semantic processing technologies, including semantic mash-up, query, and reasoning.

Generic functional semantic model for supporting IOT applications.
Currently, most of the existing works aim at the annotation and definition of various WSNs by providing corresponding description ontology. For example, ontologies such as the W3Cs Semantic Sensor Network (SSN) ontology have been developed, offering a number of constructs to formally describe not only the sensor resources but also the sensor observation and measurement data [6]. However, there has been little work focusing on the background processing of large-scale streaming IOT data. In the paper, we present a large-scale real-time semantic processing framework for IOT applications. The proposed framework efficiently captures and models different scenarios for various IOT applications. We have implemented an elastic streaming engine based on popular large-scale distributed computing platform SPARK [7]. Based on the engine, a typical use case on home environment monitoring is given to illustrate the efficiency of our engine. The results show that our system can scale for large number of sensors streams with different types of IOT applications.
The remainder of this paper is organized as follows. Section 2 outlines related work. In Section 3, we introduce the semantic processing framework and processing engine for IOT applications. Section 4 describes a typical use case in home environment monitoring. Section 5 presents our experiments and results. Finally, we conclude the work in Section 6.
2. Related Work
To the best of our knowledge, our framework and system are the first work addressing various semantic processing tasks for large-scale streaming IOT data, including IOT semantic mash-up, semantic query, and semantic reasoning. There is some related work as follows.
2.1. IOT Modeling and Ontology
One key research topic in IOT is to represent the “things” by standard vocabularies and schemas. Semantic Sensor Web (SSW) is a technology in which sensor data is semantically annotated for interoperability and also provides contextual information for situational knowledge [8]. Many works have proposed semantic model for representing sensors and data. Ontologies such as the W3C's SSN ontology have been developed [6]. These ontologies provide metadata for numerical, spatial, temporal, and other semantic objects. Similar works for sensor metadata description also include Sensor Data Ontology (SDO) [9] and SensorML [10].
These works mainly focus on semantic annotation for the interoperability of IOT by defining a unified and standard ontology, paying no attention to the high-level semantic IOT applications.
2.2. Semantic IOT Applications
Gyrard proposes a semantic-based Machine-to-Machine Measurement approach (M3) to automatically combine, enrich, and reason about IOT data to provide promising cross-domain IOT applications, such as naturopathy application based on multiple datasets [11]. The approach also presents a hub for cross-domain ontologies and datasets. References [12, 13] apply the IOT in the generic agriculture and healthcare context management. SSEO [14] is developed to enable semantic indexing, machine-processable event detection, and data exchange for smart space modeling. Other applications include CONON [15], CoOL [16], and CoBrA [17].
Most of the work aims at building IOT applications in a specific domain, failing to provide a generic semantic IOT processing framework. And they also did not deal with some important challenges for IOT data, such as the real time and scalability.
2.3. Stream Processing for Semantic Data
There are several semantic stream data processing engines, including Streaming SPARQL [18], C-SPARQL [19], and CQELS [20]. Streaming SPARQL extends SPARQL to process data streams. C-SPARQL defines an extension of SPARQL whose distinguishing feature is the support of continuous queries, that is, queries registered over RDF data streams and then continuously executed. CQELS is a native and adaptive query processor for unified query processing over Linked Stream Data and Linked Data.
However, most of these works only focus on the semantic query for Linked Stream Data, ignoring other common demands of IOT applications, including semantic mash-up and reasoning. What is more, the systems are designed to run on a single machine, while our system goes beyond that and specifically focuses on the common scalability issues for IOT applications.
3. Proposed Framework and Elastic Processing Engine
In this section, we propose the large-scale real-time semantic processing framework for IOT applications and elaborate the elastic processing engine to explain how it provides the capabilities for performing various IOT applications.
3.1. Framework
Figure 2 shows the architecture of our semantic processing framework. In general, it consists of five parts: physical entities layer, abstract entities layer, window-based data stream layer, virtual entities layer, and elastic semantic engine layer.

Large-scale real-time semantic processing framework for IOT applications.
3.1.1. Physical Entities Layer
Physical entities layer is located in the lowest layer of the framework, which is responsible for collecting raw sensor data in real time. Every physical entity represents a tangible element that can be sensed by sensors that are deployed in the oneM2M Field Domain environment and that is not specific to a particular IOT application in this environment. According to the oneM2M project standardization, every kind of sensors is to be organized by logical entity (AE) and common services entity (CSE), which provide application logic and common services, respectively.
3.1.2. Abstract Entities Layer
Abstract entities layer is responsible for receiving and implementing the abstraction for the physical devices by the semantic annotation of proxy software. The abstraction layer aims at hiding the complexity of devices and environments by providing a standard format to represent devices. So from the view of upper layer, all the heterogeneous physical sensors can be seen as unified data streams.
3.1.3. Window-Based Data Stream Layer
The layer focuses on extracting related data streams into the windows according to the demands of upper applications. In the real-world IOT applications, data takes the form of continuous streams instead of the form of finite datasets stored in a traditional repository. This is the case for traffic monitoring, environment monitoring, disaster management, telecommunication management, manufacturing, and many other domains. Every sensor corresponds to a window with a certain size.
3.1.4. Virtual Entities Layer
Virtual entities layer aggregates the related window data required by every virtual entity. Virtual entity is a new resource created by multiple window data streams, which is used to accomplish an application service. For our latter use case, if user in a home requests the service for Discomfort Index (DI), a new virtual entity will be generated through aggregating corresponding home appliance sensors (such as temperature, heater, and air cleaner sensors). Then we can get the service of DI by the virtual entity.
3.1.5. Elastic Semantic Engine Layer
The elastic semantic engine layer is the key of the architecture. The layer is responsible for receiving of outer requests, creating of corresponding virtual entities, interacting with static web of data, and real-time returning of continuous results. Our work mainly focuses on the layer. We will discuss it in detail in the next section.
3.2. Elastic Streaming Processing Engine
Our elastic semantic streaming processing engine provides various common IOT services by constructing corresponding virtual entities, including IOT semantic mash-up, query, and reasoning. Every virtual entity aggregates the RDF streams from corresponding several windows. A window extracts the latest elements from the sensor stream. Besides the streaming IOT data, some applications need auxiliary background knowledge, such as Linked Open Data, DBpedia, and LinkedGeodata.
In this part, related definitions are first given; then we elaborate on the 3 functional modules of the engines.
Definition 1 (RDF stream (S)).
The basic data unit for RDF stream is a quad (
Definition 2 (window (W)).
A window is a subset of the RDF streams given a time range t.
Definition 3 (virtual window (V)).
Virtual window aggregates the related data needed by a virtual entity, which includes the corresponding window and background knowledge
3.2.1. IOT Semantic Mash-Up
Semantic mash-up is one of the most basic demands in IOT domain since many IOT applications rely on the task. It provides functionalities to support new services by aggregating multiple disperse resources. For example, “compute the indoor air quality index (AQI) of a room” is a typical mash-up application, which needs to accomplish the task based on various window data sources including PM2.5, O3, and CO.
The mash-up task is formalized via the concepts of filtering and recombination. Given a set of RDF streams
3.2.2. IOT Semantic Query
IOT semantic query is a common function for IOT applications. It enhances the IOT discovery mechanism, to allow locating and linking resources or services based on their semantic information, such as “get the temperature of the
The query task is formalized via the concept of mapping. We denote as I, B, L, and V, respectively, the domains of IRIs, blank nodes, literals, and variables which are all disjoint. We also define
3.2.3. IOT Semantic Reasoning
Reasoning is a mechanism to derive new implicit knowledge from semantically annotated data and to answer complex user query. It can be implemented as a piece of software to be able to infer logical consequences from a set of asserted facts or axioms. Many IOT services belong to the application type. For example, we can infer the Human Comfort Index based on the temperature and humidity, the dangerous level of gas leaking, and so on.
The reasoning task for IOT applications can be seen as the process of applying the reasoning rules in IOT data to derive new facts. We denote as W and B, respectively, the windows of sensor streams and background knowledge. F represents a set of facts that we want to γ contains a set of rules
4. Use Case: Home Environment Monitoring
In this section, we give a common IOT use case. It is designed to facilitate the smart real-time monitoring to the home environment. We first give an overview of the use case; then the data model is presented. At last, three concrete application examples are introduced.
4.1. Overview
Nowadays, people are paying much attention to the environmental problems since we are facing a series of serious environmental pollution types, such as smog disaster and water pollution. To deal with these challenges, governments deploy lots of outdoor monitoring stations to capture and publish real-time environmental information to the public. However, there is limited work in the indoor environmental monitoring due to a lack of sensor devices and processing infrastructure.
With the popularity of smart home appliances (e.g., heater, air conditioner, humidifier, and air cleaner) equipped with environment sensors (e.g., sensors for temperature, humidity, CO1, CO2, and VOC), large volumes of data from all aspects of indoor environment are available, which makes it possible to implement various home monitoring applications including emergency detection and indoor air quality index. Presently, many commercial companies such as Huawei, Cisco, Intel, and Telecom are planning to deploy and develop related hardware and software infrastructure to provide similar services.
Our use case considers the scenario: suppose a number of households in a city have installed relevant smart appliances; they want to get a series of environment monitoring services from the supplier, including Indoor Air Pollution Index (API), Indoor Sensor Discovery, and Human Comfort Index (
4.2. Data Model
As the paper mainly focuses on the background streaming data processing for IOT applications, we do not create a complex ontology to semantically annotate all the various IOT data. Conversely, we design a simple concept model for the home environment monitoring scenario (see Figure 3). The model captures three types of resources: home, room, and sensor. The label under the resource denotes its URI. “p” is the namespace of the properties. Every sensor entity has three properties: type, value, and time.

The simple concept model for home environment monitoring.
Figure 4 shows a snapshot of the stream knowledge graph based on the concept model. Every home contains multiple rooms (living room, bedroom, kitchen, and so on). Every room is equipped with 15 kinds of sensors, including temperature, humidity, illumination, volume, PM10, PM2.5, O3, CO, SO2, and NO2.

A snapshot of the stream knowledge graph.
4.3. Scenarios
4.3.1. Streaming Semantic Mash-Up
Outdoor AQI is available to us for years, while little attention is paid to indoor AQI, which is also important both for customers and for device suppliers. For customers, the indoor AQI can help them keep track of the latest situation of the house and the top air pollutants. For suppliers, the AQI data can help them monitor their devices and have a better understanding of the needs of different customers.
The indoor AQI task is a typical IOT semantic mash-up application since it needs to integrate multiple window data sources and combine them to complete a specific work. The indoor AQI task is composed of two phases. The first step is to compute the individual AQI (IAQI) for a pollutant. Equation (1) gives the computing formula for
4.3.2. Streaming Semantic Query
The streaming semantic query provides us with a basic function to discover and query the IOT resources in real time. We implement the following 6 query examples to illustrate its applications (Table 1). All streaming queries are showed in Appendix A.
IOT streaming query examples.
4.3.3. Streaming Semantic Reasoning
Reasoning is ubiquitous in the IOT environment. We can derive corresponding conclusions if certain data streams trigger reasoning rules. For example, we can get warnings if some pollutants' concentration exceeds normal range, such as temperature and CO. Other reasoning examples include inferring the health status and sleep quality of a person based on some wearable devices such as smart band.
Here we give the example of reasoning the human comfort level. The application will help us to keep track of the conditions of indoor rooms, and automatically adjust the indoor environment to prevent heatstroke or cold in time. Specifically, two kinds of sensor streams (temperature and humidity) will be first integrated to compute the Human Comfort Index according to (3). Then based on reasoning rules in Appendix B, the level of human comfort will be derived. Consider
5. Experiment and Evaluation
In this section, we introduce the experiments and evaluations. First the experimental environment is briefly presented including the configuration and data. Then extended experiments are performed to evaluate the system's functionality and scalability.
5.1. Experiment Setup
Configuration. The experiment is implemented on a SPARK cluster with three machines. Each node has 16 GB DDR3 RAM, 8-core Intel Xeon E5606 CPUs at 2.13 GHz, and 1.5 TB disk. The nodes are connected by the network with the bandwidth of 1000 M/s. All the nodes use CentOS6.4 with the software types JDK-1.7.0, Scala-1.10.1, and SPARK-0.9.0.
Data. The experimental data is generated by our stream data generator whose schema is based on the concept model in Figure 3. The main parameters of the generator are R and T, denoting the number of homes and sampling time, respectively. The number of homes is in proportion to the number of sensors denoted by
5.2. Functional Evaluation
For functional experiment, we first briefly introduce the implementation of the elastic streaming semantic engine. Then based on the proposed engine, 3 kinds of representative IOT applications are presented to illustrate the functional characteristics of our system.
We built our elastic streaming processing engine based on the efficient in-memory cluster computing framework SPARK, which provides us with rich data abstraction and operation abstraction to meet the needs of various IOT applications. For the data model, we use the DStream (Discretized Stream) to model the window streams. For the processing model, operators provided by SPARK such as “filter” and “map” are translated into the processing primitives to effectively implement the different IOT scenarios.
Presently, we have preliminarily implemented the IOT mash-up subsystem, query subsystem, and reasoning subsystem. Every subsystem acts as a module of our elastic semantic processing engines. Once an IOT application requests service, corresponding subsystem will be activated to run a SPARK job to return continuous results. Figure 5 shows partial running results of the previous 3 use cases. For more results, readers can access (https://github.com/hualichenxi/Semantic-IOT-Engine/tree/master/Experiment).

The snapshot of indoor environment monitoring scenarios.
5.3. Scalability Evaluation
For scalability experiment, we use the AQI use case to illustrate the performance of our engine by increasing the number of sensors streams. During the process of SPARK streaming execution, we will write a total delay (TD) into the log file after the data in a time slice has been processed completely. The parameter records the total time from receiving window data to output final results. In our experiment, the time slice (D) is set as 5 seconds and we will run the program for 300 seconds. That is to say, 60 TD will be written into the log file.
Figures 6 and 7 show the trend of the processing time (TD) in single node and cluster with varied sensors. For single node experiment, the number of sensors is varied from 15,000 to 150,000. For cluster experiment, the number of sensors is varied from 75,000 to 750,000. The two figures only show the processing time for parts of sensors so that we can recognize the broken line well. From both figures, in the beginning of executing a SPARK job, TD is not stable (0~50) for all varied sensors. After a while, TD will stay in a comparatively stable level. Here we choose these TD in the last 250 s and compute their average value denoted as

Processing time (TD) with increased sensors in single node.

Processing time (TD) with increased sensors in cluster.
Equation (4) computes the system's throughput Q (MB/s): 0.5 is the average data size generated by a sensor in a sampling time,
Tables 2 and 3 show the execution results in single node and cluster. Figures 6 and 7 show the throughput and processing time with increased sensors. We can conclude the following results from the tables.
Throughput and sizeup in single node.
Throughput and sizeup in cluster.
Firstly, we can get the correlation among relevant variables:

Throughput and TD with increased sensors in single node.

Throughput and TD with increased sensors in cluster.
Secondly, Tables 2 and 3 show that our system achieves high throughputs: more than 53 MB/s and 175 MB/s in single node and cluster. Benefiting from the system's elastic processing ability, it can concurrently process more than 300,000 sensor streams efficiently.
At last, the tables show that our system achieves excellent scalability. For both single and cluster configuration, the sizeup (see (5)) of m times input is much less than m. Particularly for the cluster, when the input stream increases by 6 times (
To sum up, the results demonstrate excellent scalability regarding both the size of input stream and number of nodes.
6. Conclusion and Future Work
To effectively process massive streaming IOT data, the paper presents a large-scale real-time semantic processing framework for various IOT applications. According to the framework, we have implemented an elastic streaming engine based on popular large-scale distributed computing platform SPARK. Based on the engine, a typical use case on home environment monitoring is given to illustrate the efficiency of our engine. The results show that our system can scale for large number of sensors streams with different types of IOT applications. For future work, we are planning to deploy spatial semantic support in our distributed semantic engine to process all kinds of location-based IOT applications such as taxi service and parking service.
Footnotes
Appendices
We provide the SPARQL queries used in the experimental section of streaming semantic query.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is funded by NSFC 61473260 and National Key S&T Special Projects 2015ZX03003012-004 and YB2013120143 of Huawei.
