Summary Instance: Scalable Event Priority Determination Engine for Large-Scale Distributed Event-Based System

Abstract

Data prioritization problem is paramount for distributed publish/subscribe infrastructure to the timely delivery of real-time events since a large number of low priority events may clog the channel thereby causing high priority events to get delayed. The challenge raised for the event-based middleware in large-scale distributed system such as vehicular ad hoc networks is that event priority determination engine must be efficient and scalable in terms of priority rule size and event throughputs. This paper proposes an innovative approach based on Bloom filter and event discretization. A Bloom filter data structure is used to store the rule instances and their priorities. The complex rule evaluation is reduced to set membership testing as queries on Bloom filters. The time complexity of data prioritization is constant and independent of the number of priority rules. As event discretization signatures can be cached, this approach is cache friendly in nature. The previous computation results can be cached in overlay network nodes and reused to improve the system throughputs and determination time. We have evaluated our proposed approach and the results show a significant performance improvement.

1. Introduction

With the advent of ubiquitous sensor-rich environments and location-based services, distributed event-based systems with the publish/subscribe communication paradigm have been gaining popularity [1, 2]. For example, in vehicular ad hoc networks (VANETs), the applications logics are triggered by various events from geographically distributed sources. In expressway monitoring system of VANETs, the sensing data of vehicles are published continuously and the vehicle information system may subscribe to different data based on the vehicle's location.

With the increasing popularity of distributed event-based systems (especially the publish/subscribe systems) and the adoption in mission critical areas, performance and scalability issues are becoming a major concern [3, 4]. The performance and scalability of the event-based middleware used to process real-time event data will be crucial for the successful adoption of such applications. Flexible and efficient events routing mechanisms are paramount for the improvement of user experience. Publish/subscribe systems must support a large number of geographically distributed publishers and subscribers. Efficient communication between these brokers is paramount.

Data of different importance are transported in the same communication infrastructure. Large number of low priority events may occupy much bandwidth of event brokers and incur delivery delay of time sensitive data. We propose event delivery on-time rate (EDOR) metrics to measure system performance. Prioritized multiqueue approach is a natural choice to improve the system performance under given system resources. However, the effectiveness of this approach depends on the performance and scalability of event priority determination engine (PDE).

A naïve implementation of priority rule matching might check each rule against event instance values. But this naïve approach performs poorly in large-scale system [5]. The performance of priority determination engine is dependent on the number of rules in the system. Since each condition of the rules needs to be checked on the fly, this naïve approach is cache-unfriendly and may perform poorly under geographically distributed environment. A cache-friendly approach may alleviate the load of PDE dramatically and achieve significant improvement of system performance in terms of the speed of priority determination and event delivery on-time rate (EDOR).

Another naïve priority policy is that event producer can determine the priority of events. Under this policy, the priority is determined by different event producers. If most events are labeled as high priority, the system cannot benefit much from the priority mechanism. We need global policy for resource scheduling in the overlay network.

The design issues of efficient and scalable priority determination engine (PDE) are addressed in this paper.

In this paper, we present an innovative PDE design based on Bloom filter and event discretization. First, the speed of priority determination is independent of the number of rules in the system. Second, this approach is cache friendly. The system can handle large number of events in geographically distributed deployments.

The results in this paper are an improved and extended version of our conference paper in IEEE SCC 2012 [6]. The major extensions in this work are the following. First, the model is refined and expressed more accurately. Second, more related works are explored. Third, the discretization algorithm is detailed, which provides a much more thorough description compared to the preliminary results in [6]. Finally, the evaluation methods and results are introduced in this paper.

2. Related Work

Data Prioritization. The internet services require soft real-time constraints, for example, 300 msec latency [7, 8]. For some applications, a 100 msec increase in latency can affect its user experience significantly. Experiments of Amazon and Google [9] demonstrated that latencies at hundreds of milliseconds could already result in significant financial loss. The absence of traffic prioritization causes latency-sensitive data stream to wait behind latency-insensitive data stream. Events are useful, if and only if events are delivered within its deadline. Recent research works [7, 8, 10] addressed this issue under datacenter environment and the solutions are mostly focused on transport layer protocols. In the cross-layer approach, named DeTail, the solution depends upon applications to properly specify data priorities based on how latency sensitive they are [7].

The presence of data prioritization can alleviate this issue significantly. Our system [11] introduces data prioritization into application layer, that is, publish/subscribe overlay network. In the overlay system, the prioritizations of application data can be handled by publish/subscribe infrastructure. However, if the overhead of prioritization is too high, the solution is not affordable for most soft real-time applications. These online applications require fast data prioritization services. Low latency and high throughput under geographically distributed environment are demanded for the data prioritization engine. At the same time, it should be scalable in terms of priority rule size.

Rule Matching. Rule matching engines have been intensively studied in the past decades. The most famous algorithm is Rete, which was proposed by Charles L. Forgy at Carnegie-Mellon University in the 1970's [5, 12]. Rete algorithm has become very widely used; it is the basis of OPS5, CLIPS, and numerous commercial rule-based tools.

Techniques used in expert and rule-based systems support expressive predicate languages [12] but are unable to scale up to process millions of Boolean expressions. Most of the traditional rule-based systems used in expert systems focus on language expressiveness and their expected sizes are assumed to be less than thousands of Boolean expressions. The latest Rete implementation declared that the scale can be up to 100 K rules with millions of objects and is at least 500 times faster than the original Rete [13, 14]. Although advances in the implementation of knowledge-based expert systems have provided substantial performance improvements, the rule matching speed in large-scale systems with millions of Boolean expressions under severe time constraints for example, submillisecond, is still an open issue.

Many innovative approaches have been proposed to address fast rule matching against millions of Boolean expressions [15–18] recently. The rule matching performance has been improved significantly. But all these algorithms scale linearly with respect to the number of matched Boolean expressions [15]. These algorithms focus more on top-k matching Boolean expressions [15, 17, 18].

Compared with Rete and the aforementioned works, our approach focuses on the scalability and provides an innovative solution on rule matching engine design under distributed computing environment. Our approach is focused on the scalability of online query speed of event priority rule matching at the cost of offline large rule instance database maintenance and cache management on broker nodes.

3. Model Description

3.1. System Model

The publish/subscribe system can be classified by architecture as centralized publish/subscribe system and distributed publish/subscribe system [19]. As the increasing scale of event-based systems, the distributed publish/subscribe system attracts more attention from both industry [20] and academia [21].

A generic publish/subscribe system (often referred to in the literature as event service or notification service) is composed of a set of broker nodes distributed over a communication network. These nodes form an overlay network, which is a logical network built on the physical network. The links between nodes are paths in the physical network.

Formally [22] the distributed publish/subscribe system can be represented as a 5-tuple $G = 〈B, C, P, S, E_{T}〉$ where $B = {b_{1}, b_{2}, \dots, b_{| B |}}$ is the set of system broker nodes. $C = {c_{1}, c_{2}, \dots, c_{| C |}}$ is the set of connections between broker nodes. $P = {p_{1}, p_{2}, \dots, p_{| P |}}$ is the set of publishers. $S = {s_{1}, s_{2}, \dots, s_{| S |}}$ is the set of subscribers. $E_{T} = {T_{1}, T_{2}, \dots, T_{| E_{T} |}}$ is the set of event types.

The overlay topology of publish/subscribe network is shown in Figure 1. Client can be publisher or/and subscriber. Each client is connected to only one of the brokers in the system. The broker that is connected to client is called the access broker from network view and is also called home broker with respect to that client. The brokers that route events between brokers are called event router or inner broker.

Figure 1

System architecture of distributed publish/subscribe overlay network.

3.2. Event Model

Publish/subscribe based event models were first introduced in the data and business domain as complex event processing [1]. Each event is described by a set of attributes, $e = 〈v_{1}, v_{2}, \dots, v_{k}〉$ [23, 24].

The tuple $〈type (v_{1}), type (v_{2}), \dots, type (v_{k})〉$ is called event schema. A more succinct presentation is written as attribute vector ${a_{1}, a_{2}, \dots, a_{k}}$ . The events having the same schema are classified into the same event type. Event type space is defined as a set, denoted by $E_{T} = {T_{1}, T_{2}, \dots, T_{| E_{T} |}}$ , where $T_{i} = {a_{1}, a_{2}, \dots, a_{k}}$ , $i \in [1, |E_{T}|]$ .

Let E be the set of all events published in the system. An event $e \in E$ must follow one of event schema in the event type set $E_{T}$ . Let $E_{T_{i}}$ denote all events which follow the event schema defined by event type denoted by $T_{i} = {a_{1}, a_{2}, \dots, a_{k}}$ . Therefore, we have the following relations:

\begin{array}{l} E = E_{T_{1}} \cup E_{T_{2}} \cup \dots \cup E_{T_{n}} \\ \forall T_{i} \in E_{T}, T_{j} \in E_{T} : E_{T_{i}} \cap E_{T_{j}} = ⌀ . \end{array}

(1)

For each event schema, the attributes vectors that determine the event priority are called priority signature vector. Let $V_{Sig} (T_{i})$ denote the signature vector for event type $T_{i}$ . $V_{Sig} (T_{i}) = 〈a_{i 1}, a_{i 2}, \dots, a_{i m}〉$ has m priority attribute fields, where $a_{i j} \in {a_{1}, a_{2}, \dots, a_{k}}$ , $a_{i j} = a_{i k}$ if and only if $j = k$ .

We distinguish two types of attributes: continuous variables and discrete variables. In our approach, the attributes with continuous value must be refined into discrete values with proper granularity. The attributes with discrete value also may be refined into proper granularity per application requirements. The discretization procedures can be defined by applications per business requirements.

3.3. Priority Rule Model

The triple consisting of attribute, operator, and set of values is referred to as a Boolean predicate. A conjunction of Boolean predicates is a Boolean expression. A priority rule can be modeled as a set of Boolean expressions. The rule is expressed as a disjunction of Boolean expressions. For a given priority, there may be a set of priority rules specified by applications.

The set of values in all Boolean predicates composed the metadata of the priority rule. An expressive set of operators are supported: relational operators ( $<, \leq, =, \neq, \geq, >$ ) and set operators ( $\in, \notin$ ). The metadata of discrete attributes shall be a subset of the corresponding attribute domain. The metadata of continuous attributes shall be an element of the corresponding attribute domain.

For example, given event schema of coal mine monitoring data, which is defined as $〈a_{1}, a_{2}, a_{3}〉$ , $a_{1}$ is a discrete variable which denotes the location identifier, $a_{2}$ is a continuous variable which denotes methane (CH₄) gas density, and $a_{3}$ is timestamp field.

Boolean predicates $p_{11} = a_{1} \in S_{1}$ and $p_{21} = a_{2} \leq C_{1}$ can construct a Boolean expression ${BE}_{1} = p_{11} \land p_{21} = (a_{1} \in S_{1}) \land (a_{2} \leq C_{1})$ , which means that if the data come from locations in $S_{1}$ and the value of attribute $a_{2}$ is not greater than $C_{1}$ , the event shall be determined as the corresponding priority. The metadata $S_{1}$ and $C_{1}$ suffice the following constraints: $S_{1} \subseteq Dom (a_{1})$ and $C_{1} \in Dom (a_{2})$ . Similarly, ${BE}_{2}$ can be defined as ${BE}_{2} = p_{12} \land p_{22} = (a_{1} \in S_{2}) \land (a_{2} \leq C_{2})$ . The priority rule is defined as $R_{1} = {BE}_{1} \lor {BE}_{2} = (p_{11} \land p_{21}) \lor (p_{12} \land p_{22})$ . The data ${S_{1}, C_{1}, S_{2}, C_{2}}$ are metadata defined by application system, which may be constant or dynamically changing under context.

The rule set for given priority can be modeled as a set of Boolean expressions, which are the union of the Boolean expression sets for the priority rules of the given priority. The general format of priority Boolean function for a set of rules can be formalized as $R = R_{1} \lor \dots \lor R_{k} = {BE}_{1} \lor \dots \lor {BE}_{m} = f (p_{1}, p_{2}, \dots, p_{n})$ . The size of priority Boolean function can be measured by the number of Boolean expressions.

We define the normal model of priority rule as a disjunction of Boolean expressions and each Boolean expression is defined as a conjunction of Boolean predicates.

The transformation from nature language rule specification to normal expression is another research topic in requirement engineering, which is not addressed in this paper.

Given an event instance, $e = 〈v_{1}, v_{2}, v_{3}〉$ . If e suffices the conditions that $v_{1}$ in set $S_{1}$ and $v_{2} \leq C_{1}$ , event e matches rule R successfully; if event e suffices the conditions that $v_{1}$ in set $S_{2}$ and $v_{2} \leq C_{2}$ , event e matched rule R successfully; if both failed, event e does not match the priority rule R.

In this simple example, we observe that the four condition tests denoted by Boolean predicates can be reduced to two attributes in event priority signature vector. We also observe that in condition tests on Boolean predicates $p_{11} = a_{1} \in S_{1}$ and $p_{12} = a_{1} \in S_{2}$ , the computation time of condition test depends on the size of set $S_{i} (i = 1,2)$ . The condition tests on Boolean predicates need be checked on the fly since past evaluations on attributes with continuous values cannot be reused directly.

3.4. Assumptions and Design Goals

Our design has been guided by assumptions that offer both challenges and opportunities. (1)

The system shall support large-scale system running under geographically distributed environment.

(2)

The speed of priority determination is paramount to ensure real-time event to be delivered timely. The performance shall be scalable with the scale of condition tests in rules and the traffic of events in overlay network.

(3)

The problem of priority determination can accept false positive if the false rate can be controlled under the acceptable rate.

(4)

Large amounts of condition tests in priority rules are actually defined by small number of event signature attributes.

(5)

The condition tests in rules can be expressed as set membership query problem. The priority rules mostly are expressed as some attribute suffice some condition (in particular set, less than or greater than specified threshold value). The condition tests are not likely as complex as the pattern match problem in artificial intelligence area.

(6)

The set of event types is known in advance.

Our design goals on PDE focus on the following aspects. (1)

The PDE should strive to maximize the number of events that satisfy their deadline to contribute to application throughput.

(2)

The PDE should be able to accommodate to burst tolerance to improve the peak load: redefines the peak loads at which the publish/subscribe system can operate without impacting the user experience.

To achieve these design goals, we propose our approach: summary instance.

4. Solution

The basic ideas of our approach are composed of two main principles. First, make online query on event instance as simple as possible. The time consuming procedures should be done offline. Second, exploiting the power of cache on each broker node to reduce network round trip may bring much room for performance improvement.

4.1. Overview

To accommodate the aforementioned principles, it would be best that the computation time of query is simple and independent of the number of condition tests in rules and query can be answered by lookup local cache as long as possible. The key ideas of our approach are rule instantiation, event (attributes) discretization, and caching-friendly signature-based rule matching mechanism under distributed event environment.

As shown in Figure 2, the rule matching engine is decoupled as two parts, offline rule instantiation and online query on event instance matching. The offline part is named rule instantiation engine (RIE). The online part is named priority determination engine (PDE).

Figure 2

Architecture of event priority rule matching engine.

Rule instantiation process tries to represent the event priority rule set as a set of instances. If event priority is queried on this set directly, the computation time involved in performing the query is dependent on the number of the elements in set R. To reduce the computation time, the rule instance set R is stored with Bloom filter data structure. The computation time of query on Bloom filter is independent on the size of rule instance set R. Furthermore, the amount of storage required by the Bloom filter for each element in set R is independent of its length. By employment of Bloom filter, the online query of event priority is reduced as two-hash function computation on event signature; refer to next section on Bloom filter theory.

4.2. Preliminary

In order to keep this paper self-contained, this subsection presents a concise introduction on Bloom filter.

After the Bloom filter is proposed in 1970s [25], it is first used in database communities. This technique has gained popularity in network applications with the emergence of the Internet [26].

A Bloom filter is a simple, space-efficient randomized data structure for representing a set of strings compactly for efficient membership querying. It outperforms other efficient data structures such as binary search trees and tries as the time needed to add an item or check whether an item belonging to the set is constant irrespective of the cardinality of the set.

At first, we present the mathematics behind Bloom filters concisely. A standard Bloom filter for representing a set $S = {x_{1}, x_{2}, \dots, x_{n}}$ of n elements is represented by an m-bit vector. All bits in the m-bit vector are initially set to 0. A Bloom filter uses k independent hash functions ${h_{1}, h_{2}, \dots, h_{k}}$ with range ${1,2, \dots, m}$ . For each member x of S, the bits $h_{i} (x)$ are set to 1 for $1 \leq i \leq k$ . The bits can be set to 1 multiple times, but only the first change has an effect. After repeating this procedure for all members of the set, the programming of the filter is completed [25–27].

The query process is similar to programming. To check whether an item y is in S, we generate k hash values with ${h_{1}, h_{2}, \dots, h_{k}}$ from item y. Then, we check whether all k bits $h_{i} (y)$ are set to 1. If at least one of these k bites is unset to 1, y is clearly not a member of S. If all k bits are set to 1, we assume that y is in S, although we are wrong with some probability. Hence, a Bloom filter may yield a false positive. False-positive probability f is $f = (1 - e^{- n k / m})^{k}$ , where n is the number of elements in S, k is the number of hash functions, and m is the size of bit vector. We can reduce the value of f by choosing appropriate values of m and k for given size n of the member set. In the optimal case, which minimizes false-positive probability with respect to k, $k = (m / n) \ln 2$ . This corresponds to a false-positive probability ratio of $f = (1 / 2)^{k}$ .

To accommodate the deletion operation on Bloom filters, Fan et al. proposed the idea of counting Bloom filters [28]. In a counting Bloom filter, each entry in the Bloom filter is not a single bit but rather a small counter. When an item is inserted, the corresponding counters are incremented; when an item is deleted, the corresponding counters are decremented. To avoid counter overflow, we choose sufficiently large counters [26, 27]. The analysis from [26, 28] reveals that 4 bits per counter can suffice requirements of most applications.

To accommodate membership queries of dynamic sets, Guo et al. proposed dynamic Bloom filters (DBF) [29]. Further improvements on scalability problem of Bloom filter are addressed by scalable Bloom filter (SBF) [30].

In order to reduce the need for computation of possibly large number of different hash functions, the authors of [31] have shown that only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false-positive probability.

4.3. Rule Instantiation Engine Footnotes

RIE (rule instantiation engine) is designed to program the event priority determination rules into a set of Bloom filter structures. RIE transforms abstract priority determination rules into concrete instances and generates Bloom filter based summaries of the large data set on rule instances per each priority. In this section, we show how RIE works.

4.3.1. Rule Instantiation Process

First, we explain how RIE transforms rules into a set of rule instances with a simple example.

Consider event schema and rule description as follows. (1)

Event schema is defined as $T = 〈a_{1}, a_{2}, a_{3}〉$ , where $Dom (a_{1})$ is ${A, B, C, D, E}$ , $Dom (a_{2}) = [0,1]$ , and $Dom (a_{3}) = [0, + \infty)$ . The attribute $a_{1}$ denotes the location identifier where the data is generated. In this example, the whole coal mine area is divided into five areas, respectively, named A, B, C, D, and E. The attribute $a_{2}$ denotes the methane gas density at $a_{1}$ . The gas density ranges from 0% to 100%. The attribute $a_{3}$ denotes the timestamp for the generated data.

(2)

Rule set $R = {BE}_{1} \lor {BE}_{2} = (P_{11} \land P_{21}) \lor (P_{12} \land P_{22})$ , where $P_{11} = a_{1} \in S_{1}$ , $P_{21} = a_{2} \leq C_{1}$ , $P_{12} = a_{1} \in S_{2}$ , and $P_{22} = a_{2} \leq C_{2}$ . The sets $S_{1} = {A, B, C}$ and $S_{2} = {A, D}$ are location sets. The threshold values of the methane gas density are $C_{1}$ and $C_{2}$ .

The first rule means that if the incoming event data are generated from locations in set $S_{1} = {A, B, C}$ and the value of methane gas density is less than threshold value $C_{1}$ , the incoming event suffices the first rule.

The second rule means that if the incoming event data are generated from locations in set $S_{2} = {A, D}$ and the value of methane gas density is less than threshold value $C_{2}$ , the incoming event suffices the first rule.

Event priority signature vector can be inferred as $V_{sig} = 〈a_{1}, a_{2}〉$ from rule set specifications.

The attribute $a_{2}$ is numeric value and shall be discretized into discrete value, which belongs to the set $S_{a_{2}} = {A, B, C}$ . The entity, which defines the discretization procedure, is called discretizer. Each attribute with numeric value in signature vector shall have particular discretizer, which is defined by application running over the publish/subscribe infrastructure.

The set $S_{a_{2}}$ is defined as follows. Assume $C_{1} < C_{2}$ ; symbols A, B, and C denote the numeric range, respectively, $A = (- \infty, C_{1}]$ , $B = (C_{1}, C_{2}]$ , and $C = (C_{2}, + \infty)$ . For $P_{21} = {a_{2} \leq C_{1}}$ , the condition test is transformed as $P_{21} = {T (a_{2}) \in T (C_{1})}$ , where $T (a_{2}) \in S_{a_{2}}$ , $T (C_{1}) = {A}$ , and $T (C_{1}) \subseteq S_{a_{2}}$ . The $T (a_{2})$ is the discretized result of PDE discretizer on event attribute $a_{2}$ . The set $T (C_{1})$ is the discretized result of RIE discretizer on threshold value $C_{1}$ in condition tests on numeric value attribute $a_{2}$ .

Each rule instance is an element in rule instance space (IS), which is defined as $IS = S_{a_{1}} \times S_{a_{2}}$ . The cardinality of the set IS is $| S_{a_{1}} | \times | S_{a_{2}} |$ . In this example, as $| S_{a_{1}} | = 5$ and $| S_{a_{2}} | = 3$ , $| IS | = 15$ .

We can deduct the instance representation format for rule R as follows. $(p_{11} \land p_{21}) = {〈A, A〉, 〈B, A〉, 〈C, A〉}$ and $(p_{12} \land p_{22}) = {〈A, A〉, 〈A, B〉, 〈D, A〉, 〈D, B〉}$ ; therefore, we get the rule instance set

\begin{array}{l} S_{R} = \{〈A, A〉, 〈B, A〉, 〈C, A〉\} \\ \cup \{〈A, A〉, 〈A, B〉, 〈D, A〉, 〈D, B〉\} \\ = \{〈A, A〉, 〈A, B〉, 〈B, A〉, 〈C, A〉, 〈D, A〉, 〈D, B〉\} . \end{array}

(2)

The duplicated instance $〈A, A〉$ is eliminated during the union operation of two sets.

It is obvious that $S_{R} \subseteq IS$ . The cardinality of set $IS$ is the upper bound of the size of rule instance set $S_{R}$ .

The key data structures are described in Figure 3 and Algorithm 1.

Algorithm 1: Data structure.

(1) Predicate: 〈attribute, operator, value〉.

(2) Operators: OperatorEnumerator ${<, \leq, =, \neq, \geq, >, \in, \notin}$ defines the supported operator set.

(3) BE: a list of predicates, which is ordered by event attributes. For example, $a_{2} < 50 % \land a_{1} \in {A, B, C} \land a_{2} > 10 %$

shall be formalized as $a_{1} \in {A, B, C} \land a_{2} < 50 % \land a_{2} > 10 %$

(4) Rule: a list of BEs

(5) RuleSet: a list of rules

(6) VecSig: signature vector is a list of attributes involved in priority rules logically. In this basic design schema, there is only

one signature vector for each priority per event schema. The vector is initialized as bit vector with zero. If the

attribute appeared in Boolean expressions, the corresponding bit is set to (1).

(7) Discretizer: two types of hash table are defined. $〈 i d, 〈 l, r 〉〉$ is designed for discretizer for continuous attribute and

$〈 i d, s e t 〈 int 〉〉$ is designed for discretizer for discrete attribute.

(8) Signature for event instance: byte block structured as

〈Event Type ID, Discretizer ID, an array of discretized attribute value in event instance〉

Figure 3

Data structure of signature vector and attribute discretizer.

The procedure on rule instantiation process is described as in Algorithm 2. First, transform the condition tests as set membership determination with proper granularity, $T_{p} : 〈a, o p t, v〉 \to S_{a}$ as shown in Algorithm 3. Second, translate the logic operators (conjunction, disjunction, and negation) as set operators (intersection/product, union, and set difference) at line (6) to line (13) in Algorithm 2. Finally, $S_{R}$ is the result of rule instantiation procedure. Signature vector is also generated during rule instantiation procedure at line (2) to line (5) in Algorithm 2.

Algorithm 2: Rule instantiation algorithm.

Input RuleSet

Output $S_{R}$ , VecSig

(1) generate BE list for RuleSet

(2) for all predicate $b p$ in RuleSet do

(3) Get the attribute ID in predicate

(4) Set the bit in signature vector to (1)

(5) end for

(6) for all BE $b e$ do

(7) predicates table 〈attribute ID, $S_{attribute} 〉$ , $S_{attribute}$ is an array of integer to denote a discrete set

(8) for all predicate $b p$ in $b e$ do

(9) $T_{p} : 〈 a, o p t, v 〉 \to S_{a} : Tranform$ predicate p triple into discrete set by attribute discretizers

(10) Add set $S_{a}$ into predicates table

(11) end for

(12) Merge discrete sets on the same attribute by set intersect operation in predicates table

(13) Generate the rule instance set $S_{be}$ from merged predicates table by set product operation

(14) Store the rule instances in set $S_{be}$ into Bloom filter

(15) end for

Algorithm 3: Predicate transformation algorithm $T_{p} : 〈 a, o p t, v 〉 \to S_{a}$ .

Input predicate triple p

Output discrete set S

(1) switch(p. attribute.type)

(2) case: continuous

(3) ContinousDiscretizer cd = p. attribute.discretizer;

(4) switch(p. operator)

(5) case: Less

(6) for all item in cd do

(7) if p. value < item.lower

(8) ; //do nothing

(9) else if p. value < item.upper

(10) S.insert(item.id); //false positvie rule are introduced

(11) else p. value > item.upper

(12) S.insert(item.id);

(13) end for

(14) break;

(15) case: Great

(16) for all item in cd do

(17) if p. v < item.lower

(18) S.insert(item.id);

(19) else if p. value < item.upper

(20) S.insert(item.id); //false positvie rule are introduced

(21) else p. value > item.upper

(22) ; //do nothing

(23) end for

(24) break;

(25) … //other operators

(26) break;

(27) case: discrete

(28) DiscreteDiscretizer dd = p. attribute.discretizer;

(29) switch(p. operator)

(30) case: In

(31) for all item in dd do

(32) if p. value ∩ item.set $! = ⌀$

(33) S.insert(item.id); //false positvie rule are introduced

(34) end for

(35) break;

(36) case: NotIn

(37) for all item in dd do

(38) if item.set ⊆p. value

(39) ; //do nothing

(40) else //false positvie rule are introduced when $item.set \cap p . value \neq ⌀$

(41) S.insert(item.id);

(42) end for

(43) break;

(44) … //other operators

(45) break;

Given $R = {BE}_{1} \lor \dots \lor {BE}_{| BE |} = f (p_{1}, p_{2}, \dots, p_{n})$ and event priority signature vector $V_{sig} = 〈a_{1}, a_{2}, \dots, a_{m}〉$ , rule instance space $IS$ is the Cartesian product of $S_{a_{1}}, S_{a_{2}}, \dots, S_{a_{m}}$ , where $S_{a_{i}} (1 \leq i \leq m)$ is the set defined by attribute $a_{i}$ 's discretizer. Each Boolean expression can be represented by a set of rule instances $S_{be}$ , which is a subset of rule instance space IS. The rule instance representation of rule set R, denoted by symbol $S_{R}$ , is the union of all $S_{be}$ . Obviously, $S_{R}$ is a subset of rule instance space IS.

Our framework draws a schema of the discretizer, which can be customized by applications. The discretizer divides the domain of the attribute value into several ranges or discrete sets and these ranges or sets shall be disjoint with each other; that is, for all $S_{i}, S_{j} \in Discretizer$ , $S_{i} \cap S_{j} = ⌀$ , $i \neq j$ . Each range or set in the discretizer is labeled with an identifier as shown at line (7) in Algorithm 1. To represent a discretizer for continuous attribute, the hash table $〈i d, 〈l, r〉〉$ is employed, where id denotes the identifier of the range and $〈l, r〉$ denotes the lower bound and upper bound of the range. To represent a discretizer for discrete attribute, the hash table $〈i d, s e t 〈int〉〉$ is employed, where id denotes the identifier of the set and $s e t 〈int〉$ denotes the discrete set. The applications can define their own discretizer for event schema according to business requirements. The id in discretizer hash tables is used to compose the rule instance and event instance signature.

The signature vectors for event schema are generated in Algorithm 2. The signature vectors are used for event signature generation procedure in Section 4.4. The structure of signature vector is shown in Figure 3.

4.3.2. Hash Computation on Instance Set $S_{R}$

For each element in $S_{R}$ , generate the signature for each rule instance tuple, for example, $〈A, A〉$ . The signature may be string composed by each field in the rule instance tuple. You may have more clever encoding approach of the signature. Anyway, the computation complexity depends on the size of rule instance set.

If event types share the same set of Bloom filters, event type identification shall be programmed into signature to ensure the uniqueness of each signature in the Bloom filter. Event type identification is a unique string to distinguish event types. An alternative design choice is that each event type has its own set of Bloom filters for priority determination.

Different event discretizers will make the rule instance different even for the same event schema. If there are multiple discretizers defined by various applications, the discretizer identifier shall be contained in rule instance structure. An example is shown in Table 1.

Table 1

Example for rule instance structure.

Event schema ID	Discretizer ID	Attribute 1	Attribute 2
32-bit integer	32-bit integer	32-bit integer	32-bit integer

4.3.3. Update the Computation Results into Bloom Filters

For each priority, one bit vector is dedicated for the summary of rule instance set $S_{R}$ . The Bloom filter set $P = {P_{1}, P_{2}, \dots, P_{k}}$ is a lossy summary of rule instances for k priorities.

4.4. Priority Determination Engine

The PDE discretizes event instance values to generate the signature of priority and determine event priority by query rule database, which is represented by a group of Bloom filters. When multiple rules match the same event, the engine shall choose the high priority result.

4.4.1. Event Priority Signature Generation in Access Brokers

The generation of event priority signature is based on the same priority signature vector and corresponding discretizer. The interaction procedure between broker node and PDE service is shown in Figure 4.

Figure 4

Event priority determination procedure.

Consider signature and discretizer as follows:

$V_{sig} = 〈a_{1}, a_{2}〉$ ;

$a_{1} \in S_{a_{1}} = {A, B, C, D, E}$ ; $a_{2} \in (- \infty, + \infty)$ ;

$S_{a_{2}} = {A, B, C}$ ;

$A = (- \infty, C_{1}]$ , $B = (C_{1}, C_{2}]$ , and $C = (C_{2}, \infty)$ .

While the access broker receives a published event, the broker need generates the event signature. The signature generation algorithm is shown in Algorithm 4 (line (1)–line (7)). The signature is initialized in line (1). Assume that event type ID is “ETID0001” and discretizer ID is “DISC01,” the signature is represented as $〈ETID 0001, DISC 01〉$ .

Algorithm 4: Event priority query algorithm.

Input event instance $〈 v_{1}, v_{2}, \dots, v_{n} 〉$ ,

Output event priority flag k

(1) Initialize signature with event type ID and discretizer ID

(2) for all attibute value $v_{i}$ in event instance do

(3) if this attribute is in SignatureVector then

(4) $T_{e} (v_{i}) : v_{i} \to i d : tranform$ $v_{i}$ as id of discretizer

(5) add id into signature

(6) end if

(7) end for

(8) Query BF-based rule DB with signature to determine the event priority

Assume that event instance $〈v_{1}, v_{2}, v_{3}〉$ is $〈B, 30 %, timestamp〉$ . We traverse all attribute values in event instance to transform event attribute value into discrete id.

For discrete attribute, the discretizer does nothing by default. In this example, $T_{e} (v_{1}) : B \to B$ . If application need classifies the domain of discrete attribute, the algorithm performs line (4) to line (6) in Algorithm 5. For example, application defines discretizer for attribute $a_{1} \in S_{a_{1}} = {A, B, C, D, E}$ as $S_{1} = {A, B, C}$ and $S_{2} = {D, E}$ ; the function $T_{e}$ shall be executed $T_{e} (v_{1}) : B \to S_{1}$ .

Algorithm 5: Transform event attribute value into discrete id: $T_{e} (v_{i}) : v_{i} \to i d$ .

Input attribute value of event instance $v_{i}$

Output id

(1) if this attribute is continous type then

(2) traverse discretizer $〈 i d, 〈 l, r 〉〉$ to locate id

(3) end if

(4) if this attribute is discrete type then

(5) traverse discretizer $〈 i d, s e t 〈 int 〉〉$ to locate id

(6) end if

For continuous attribute, the discretizer performs line (1) to line (3) in Algorithm 5. In this example, $v_{2} = 30 %$ and discretizer divide the domain of attribute $a_{2} \in (- \infty, + \infty)$ into three areas identified with A, B, and C, respectively. If $30 % \leq C_{1}$ , $T_{e} (v_{2}) : 30 % \to A$ .

The third attribute is not in signature vector; this attribute has no effect on signature generation as line (3) in Algorithm 4.

In this example, the final signature is represented by $〈ETID 0001, DISC 01, B, A〉$ as the format shown in Table 1. Once the event signature is generated as shown in Figure 4, the access broker sends the event signature to PDE service for priority determination.

4.4.2. Query Bloom Filter with Signature

Based on hash computations on event instance signature string, PDE queries the BF-based rule DB to determine the event priority. PDE returns the priority flag to the access broker.

4.4.3. Caching Query Result in Access Brokers

Since a large amount of event instances may share the same signature, the round trip time in the network may be saved by caching the hot signatures in local broker. The main memory access time is typically less than 100 ns. Even the round trip time in the same datacenter is about 500,000 ns. The round trip time in wide area network may be over 100 ms, which is about 6 orders of magnitude of main memory reference time. The saved time on network round trip may speed up the determination of event priority significantly.

4.5. Discussion on Discretization

The discretization can be flexibly defined by applications per business requirements. The basic principle is false positive.

False Positive. For numeric value attribute a, the corresponding discretization set is defined as $S_{a} = {A, B, C}$ , where $A = (- \infty, C_{1}]$ , $B = (C_{1}, C_{2}]$ , and $C = (C_{2}, \infty)$ . The Boolean predicate $p_{j} = a_{i} \leq C$ , where $C_{1} < C < C_{2}$ . By false-positive principle, $p_{j}$ represented by rule instance set shall be ${A, B}$ , not ${A}$ .

The definition of computation granularity on specific attribute $a_{i}$ decides the cardinality of corresponding set $S_{a_{i}}$ according to application requirement.

For discrete attributes, the discretizer is optional. It means that the discretizer can be composed of dummy (do nothing) functions. It is up to application requirements. If the original granularity of attribute value set is too fine, applications can plug in a customized discretizer to achieve proper granularity.

Performance. The traverse of continuous discretizer can be improved by binary tree search. As it is trivial, we do not discuss it in detail.

The traverse of discrete discretizer can be avoided in most cases since there is no discretizer for discrete attribute by default. The employment of discrete discretizer can reduce the rule instance space size at the cost of computation efforts in Algorithms 3 and 5, which increase the computation time of signature generation procedure.

4.6. Analysis on System Performance

System performance analysis is divided by online query (interactions between access broker and PDE) and offline rule instance summary building (RIE module).

4.6.1. Online Query Performance

The computation is broken into two parts as shown in Algorithm 4.

The first part is signature generation. The computation complexity depends on two system parameters: the size of the priority signature vector and corresponding attribute discretizer. Assume that the priority signature vector is $〈a_{1}, a_{2}, \dots, a_{s}〉$ and corresponding discretizer is $D_{i}$ ( $i = 1,2, \dots s$ ). For an incoming event $e = 〈e_{1}, e_{2}, \dots, e_{n}〉$ , the priority signature vector is $〈e_{1}, e_{2}, \dots, e_{s}〉$ .

The discretization result of $D_{i}$ function belongs to one set, whose size is $N_{i} = | D_{i} |$ . Therefore, the computation complexity is $\sum_{i = 1}^{s} N_{i}$ , where $N_{i}$ is constant parameter predefined by domain knowledge. For given domain problem, $\sum_{i = 1}^{s} N_{i}$ is constant. For example, human body temperature set is {Low, Normal, Low Fever, Medium Fever, High Fever}. Therefore, the first stage computation complexity is $O (1)$ and is independent of the number of rules in the system.

The second stage computation is query on rule set Bloom filters. The query computation complexity is $m * k$ ; m denotes the number of priorities predefined by rule set and k is the parameter of Bloom filters. Since m and k are both constant numbers for given rule set, the second stage computation complexity is $O (1)$ and is independent of the number of rules in the system.

Therefore, the computation complexity of online query is $O (1)$ and is independent of the number of rules in the system.

The computation of signature generation depends on the size of the priority signature vector and corresponding attribute discretizer. Since these discretizers can work in parallel, the speed of signature generation depends on the slowest discretizer. It would not be the bottleneck in practice.

In PDE, the main part of query computation time is two-hash function computation of the signature string [31].

To keep the cache fresh, the update on rule instances shall be notified to access brokers. The cache management procedure has no impact on the online query speed.

4.6.2. Offline Building and Maintenance of Rule Instance Database Based on Bloom Filters

Although offline work is not time sensitive, we also need to evaluate the efforts on rule instance database building. We want to know how to minimize the efforts on building instance database.

The basic idea of rule instantiation process is presented in Algorithm 2.

The upper bound of rule instance set size is the cardinality of rule instance space set. The applications shall choose priority signature vector to make the size m as small as possible. The attribute discretizer shall choose proper computation granularity to make the size of $S_{a_{i}} (1 \leq i \leq m)$ as small as possible. These efforts can reduce the size of instance space. The computation time of Bloom filter programming procedure is $O (| S_{R} |)$ . The application can dynamically adjust these parameters to improve the efficiency of offline computation.

Minimizing the offline computation at middleware layer is the subject of ongoing work. A more efficient implementation requires further exploration. The goal is to reduce rule instance space size dramatically without introducing significant impact on online query performance.

For rule maintenance efficiency, the delta rule change shall be processed efficiently. The cache shall be managed efficiently. These works will be addressed by future works.

5. Evaluations

In this section, we evaluate the query performance and scalability of summary instance (SI) approach with simulations. The experiments were run on an Intel Xeon Dual-core E5645 2.4 GHz machine with 8 GB of memory, of which 6 GB is allocated to the JVM.

5.1. Data Set

To evaluate the performance of the summery instance based priority determination engine, we generate rule set with Boolean expressions ranging from 100 K to 1000 K. Lacking the benchmarks and real application data, the rule data set and event data set were generated by a workload generator which produces the data randomly by selecting values from given value ranges. The value ranges can be specified in the configuration of data generator application.

5.2. Matching Algorithm

The brutal-force approach is an exhaustive algorithm that scans and evaluates all BEs one by one for each assignment. We call this approach SF in the following experiments. We compare our approach SI with SF approach in the following experiments.

5.3. Experiment Results

In this section, we explore the impacts on matching time from workload size, workload distribution, and matching rate of event data set. Then, we evaluate the false-positive issue in SI algorithm.

Workload Size. We evaluate the impacts of workload size on the matching algorithms. Figure 5 illustrates the comparison results for SF and SI algorithms under varying rule set size. The rule set size varies from 100 K to 1 M. The matching time of SF algorithm increases linearly with the workload size as shown in Figures 5(a) and 5(b). The matching time of SI algorithm is nearly constant as shown in Figures 5(a) and 5(b). SI algorithm illustrates impressive scalable performance. SF algorithm performance will degrade with workload increases. The simulation results are consistent with our theoretical analysis.

Figure 5

Varying workload size.

Workload Distribution. The effects of workload distribution are shown in Table 2 by comparing the performance of event matching under uniform distribution workload and Zipf distribution workload. The SF algorithm is sensitive to workload distribution in data set. From experiment results in Table 2(a), the Zipf distribution workload introduces about 50% matching time increases in SF matching algorithm compared with uniform distribution workload. The SI algorithm is robust with the workload distribution. There are no significant matching time increases in different workload distribution. Since the time precision of the computer system is 100 milliseconds and the size of event data set is 10,000, the precision of matching time per event is about 0.01 milliseconds. The raw performance data of SI algorithm are illustrated in Table 2(b). The variance can be ignored considering the time precession in our experimental environment. The final results are shown in Table 2(c). The variance of performance is nearly zero.

Table 2

Workload distribution impacts.

(a) Matching time variance in SF algorithm on different workload distribution (Unit: ms)

Rule set size	Uniform	Zipf	Variance
100 K	245.37	375.62	53.08%
300 K	734.21	1089.25	48.36%
500 K	1226.42	1884.68	53.67%
700 K	1716.16	2543.11	48.19%
900 K	2211.40	3269.20	47.83%
1 M	2464.45	3778.07	53.30%

(b) Matching time variance in SI algorithm on different workload distribution (Unit: ms), raw data

Rule set size	Uniform	Zipf	Variance
100 K	0.1133	0.1133	0.00%
300 K	0.1099	0.1099	0.00%
500 K	0.1141	0.1094	−4.12%
700 K	0.1140	0.1109	−2.72%
900 K	0.1125	0.1078	−4.18%
1 M	0.1125	0.1063	−5.51%

(c) Matching time variance in SI algorithm on different workload distribution (Unit: ms)

Rule set size	Uniform	Zipf	Variance
100 K	0.11	0.11	0.00%
300 K	0.11	0.11	0.00%
500 K	0.11	0.11	0.00%
700 K	0.11	0.11	0.00%
900 K	0.11	0.11	0.00%
1 M	0.11	0.11	0.00%

Event Set Matching Rate. We consider the effects of matching rate of event data set. If one event instance does not match any rule in rule set, the SF algorithm needs to go through all rules in the rule set. From intuition, the average event matching time will increase if matching rate in event data set decreases.

From Figure 6, we can see that the performance of SF algorithm is sensitive to the matching rate of event set. As the matching rate of event set increases, the average matching time per event decreases. The matching time decrease linearly with the matching rate of event data set.

Figure 6

Varying matching rate of event set.

We can see that the performance of SI approach is robust with varying matching rates with different workloads. The experiment results under uniform workload are shown in Figure 6(a). The experiment results under Zipf workload are shown in Figure 6(b). The matching time is nearly constant under varying matching rates and different workloads.

False-Positive Issue Evaluation. An important property of SI algorithm is the false-positive rate. We explore the false-positive issue in this experiment.

There are two kinds of false-positive sources: Bloom filter query process and discretization process. The discretization process is defined by applications and can be adjusted at application layer. This paper focuses on platform layer. The discretizer design and optimization is out of the scope of this paper. An automatic adaptive mechanism is promising to optimize the false-positive rate and computation efforts. This optimization issue is out of the scope of this paper. We need to address this issue in an independent paper.

We set up controlled experiments to evaluate the impacts of Bloom filter configuration on false-positive issue. We also verify that no false negative happened to support the theory analysis results.

The test data set are designed as follows. The false positive of discretization process can be avoided by generating event data set and rule data set from predefined ranges based on definition of discretizers.

We use a simple example to illustrate data set construction principles. The attribute discretizer are defined as $A = (- \infty, 1000]$ , $B = (1000,2000]$ , and $C = (2000, \infty)$ . The rules $a < Const$ and $Const$ can be randomly selected from $[900,1000]$ . The discretization results of rule $a < Const$ shall be $Sig (e) \in {A}$ . The event instance value can be randomly selected from $[100,800]$ or $[1100,1800]$ . The discretization results of event instance shall be $Sig (e) = A$ or $Sig (e) = B$ . In aforementioned data set, no false-positive cases are introduced by discretization process. The event e suffices the rule $a < Const$ if and only if the discretized event signature suffices $Sig (e) \in {A}$ . If $Sig (e) = A$ , the query of $Sig (e)$ on Bloom filter is definitely true. If $Sig (e) = B$ , the query of $Sig (e)$ may be true if the false positive happened.

In this experiment, the event data set and rule data set are generated randomly with the constraints without introducing false positive in discretization process. The event data set size is 10 K. The event instances are uniformly distributed in the given ranges. The rule set size is 10 K. The rule parameters are randomly selected from the given ranges.

The experiment results are shown in Table 3. The Bloom filter false-positive rate varies from 0.001 to 0.5, as shown in the BF-FPR (Bloom filter false-positive rate) column in Table 3. The data in matched column and unmatched column are from SF algorithm. These two columns illustrate the accurate rule matching results. The 4th column (SI result) shows the approximate rule matching results by SI algorithm. The parameter of Bloom filter impacts on the rule matching results, namely, FPR (False-Positive Rate), are shown in column 5 in Table 3.

Table 3

False-positive rate evaluation in SI algorithm.

BF-FPR	Matched	Unmatched	SI result	FPR
0.1%	2000	8000	2000	0.00%
1%	2000	8000	2005	0.06%
10%	2000	8000	2499	6.24%
20%	2000	8000	3563	19.54%
30%	2000	8000	4344	29.30%
40%	2000	8000	4644	33.05%
50%	2000	8000	5878	48.48%

False positive rate in SI algorithm with different Bloom filter parameters.

Since priority determination problem is not bothered with low false-positive issue, SI algorithm is very suitable for this kind of applications.

5.3.1. Summary on Evaluation

SI algorithm outperforms SF algorithm with 2–5 orders of magnitude. It demonstrated significant scalability with workload size and stable performance with different workload distribution and various event data sets with varying event matching rates. In addition, it also illustrates acceptable false-positive rate. Therefore, it is a suitable approach providing scalable and robust priority determination service.

6. Conclusion

Information representation and query processing are two core problems of event-based distributed systems such as VANETs. In design problem of event priority rule matching engine, the two core problems are the rule representation and event instance priority determination. Rule representation means organizing rule policy information according to some format and mechanism, making information operable by the corresponding method. Query processing means making decisions about whether an event instance with a given attribute value belongs to a given set.

To speed up the online query in distributed event-based system, we introduce the rule storage schema based on rule instantiation method with Bloom filter technique. This approach leverages offline efforts to increase the online query speed. This paper draws a fundamental framework for this approach.

The key features of our approach are the following: (1)

scalability: performance of rule matching is independent of the number of rules in the system, because an important property of Bloom filter is that the computation time involved in performing the query is independent of the number of strings in the database provided the memory used by the data structure scales linearly with the number of strings stored in it,

(2)

efficiency: the signature approach is cache friendly and works very efficiently under large-scale distributed environment. Large amount of event instances do not need to occupy the bandwidth of the rule match engine,

(3)

false-positive rule matching: the false rate is acceptable by adjusting parameters of Bloom filters.

Our approach is promising to provide an efficient scalable design for event priority determination problem in large-scale distributed event-based systems. This approach is also applicable for many rule matching scenarios with severe time constraints for large rule sets.

Footnotes

Disclosure

A preliminary version of this paper appeared in IEEE SCC 2012, June 24–29, Honolulu, Hawaii, USA.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

We express our thanks to anonymous reviewers who checked our paper for their insightful and constructive comments. This work was supported by National Grand Fundamental Research 973 Program of China under Grant no. 2013CB329605; National Natural Science Foundation of China under Grant no. 91124002; Chinese Universities Scientific Fund (BUPT2014RC0701); Transformation Project of Scientific and Technological Achievements in Henan Province (2014) no. 142201210009; Key Project of Science and Technology in Henan Province (2014) no. 144300510001.

References

Cugola

Margara

Processing flows of information: from data stream to complex event processing

ACM Computing Surveys 2012 44 3 15 84

10.1145/2187671.2187677

2-s2.0-84855339464

Shi

Liu

Zhang

Cheng

Chen

An MID-based load balancing approach for topic-based pub-sub overlay construction

Tsinghua Science and Technology 2011 16 6 589 600

10.1016/S1007-0214(11)70079-7

2-s2.0-82955237826

Hinze

Sachs

Buchmann

Event-based applications and enabling technologies

Proceedings of the 3rd ACM International Conference on Distributed Event-Based Systems (DEBS ′09)

July 2009

Nashville, Tenn, USA

10.1145/1619258.1619260

Schroter

Muhl

Kounev

Parzyjegla

Richling

Stochastic performance analysis and capacity planning of publish/subscribe systems

Proceedings of the 4th ACM International Conference on Distributed Event-Based Systems

2010

ACM

258 269

http://en.wikipedia.org/wiki/Rete_algorithm

Shi

Zhang

Chen

Cheng

Qiao

Summary instance: scalable event priority determination engine for large scale distributed event-based system

Proceedings of the IEEE 9th International Conference on Services Computing (SCC ′12)

June 2012

Honolulu, Hawaii, USA

IEEE

400 406

10.1109/scc.2012.44

2-s2.0-84867351007

Zats

Das

Mohan

Borthakur

Katz

DeTail: reducing the flow completion time tail in datacenter networks

Proceedings of the Conference Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ′12)

August 2012

ACM

139 150

10.1145/2342356.2342390

2-s2.0-84866481397

Vamanan

Hasan

Vijaykumar

T. N.

Deadline-aware datacenter tcp (D2TCP)

Proceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ′12)

2012

ACM

115 126

10.1145/2342356.2342388

Kohavi

Longbotham

Online experiments: lessons learned

Computer 2007 40 9 103 105

2-s2.0-34748868419

10.1109/mc.2007.328

10.

Wilson

Ballani

Karagiannis

Rowtron

Better never than late: meeting deadlines in datacenter networks

Proceedings of the ACM SIGCOMM Conference (SIGCOMM ′11)

2011

50 61

11.

Shi

R.-S.

Zhang

Chen

J.-L.

Xie

Tan

Y.-L.

Guo

W.-Q.

Chai

Z.-H.

Publish/subscribe network service infrastructure design for EDSOA service platform

Computer Integrated Manufacturing Systems 2012 18 8 1659 1666

2-s2.0-84867590551

12.

Forgy

C. L.

Rete: a fast algorithm for the many pattern/many object pattern match problem

Artificial Intelligence 1982 19 1 17 37

10.1016/0004-3702(82)90020-0

2-s2.0-0020177941

13.

Owen

World's fastest rules engine

September 2010, http://www.javaworld.com/javaworld/jw-09-2010/100920-rete-nt.html

14.

http://www.pst.com/reteii2.html

15.

Whang

Brower

Shanmugasundaram

Vassilvitskii

Vee

Yerneni

Garcia-Molina

Indexing boolean expressions

Proceedings of the 35th International Conference on Very Large Data Bases (VLDB ′09)

August 2009

Lyon, France

16.

Sadoghi

Jacobsen

H.-A.

BE-Tree: an index structure to efficiently match Boolean expressions over high-dimensional discrete space

Proceedings of the International Conference on Management of Data (SIGMOD ′11)

June 2011

ACM

637 648

10.1145/1989323.1989390

2-s2.0-79959975908

17.

Sadoghi

Jacobsen

H.-A.

Relevance matters: capitalizing on less (Top-k matching in publish/subscribe)

Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE ′12)

April 2012

Washington, DC, USA

IEEE

786 797

10.1109/icde.2012.38

2-s2.0-84864248758

18.

Machanavajjhala

Vee

Garofalakis

Shanmugasundaram

Scalable ranked publish/subscribe

Proceedings of the VLDB Endowment 2008 1 1 451 462

10.14778/1453856.1453906

19.

Eugster

P. T.

Felber

P. A.

Guerraoui

Kermarrec

A.-M.

The many faces of publish/subscribe

ACM Computing Surveys 2003 35 2 114 131

10.1145/857076.857078

2-s2.0-0345565888

20.

Cooper

B. F.

Ramakrishnan

Srivastava

Silberstein

Bohannon

Jacobsen

H.-S.

Puz

Weaver

Yerneni

PNUTS: yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment 2008 1 2 1277 1288

10.14778/1454159.1454167

21.

Baldoni

Virgillito

Distributed event routing in publish/subscribe communication systems: a survey

2005

DIS, Universita di Roma La Sapienza

22.

Kounev

Bacon

Sachs

Buchmann

A methodology for performance modeling of distributed event-based systems

Proceedings of the 11th IEEE International Symposium on Object Oriented Real-Time Distributed Computing (ISORC ′08)

2008

IEEE

13 22

23.

Tian

Weber

Lutteroth

A tuplespace event model for mashups

Proceedings of the 23rd Australian Computer-Human Interaction Conference (OzCHI ′11)

December 2011

ACM

281 290

10.1145/2071536.2071582

2-s2.0-84863520755

24.

Pongthawornkamol

Nahrstedt

Wang

Probabilistic QoS modeling for reliability/timeliness prediction in distributed content-based publish/subscribe systems over best-effort networks

Proceedings of the 7th IEEE/ACM International Conference on Autonomic Computing

June 2010

Washington, DC, USA

ACM

185 194

10.1145/1809049.1809083

2-s2.0-77954721086

25.

Bloom

Space/time trade-offs in hash coding with allowable errors

Communications of the ACM 1970 13 7 422 426

10.1145/362686.362692

2-s2.0-0014814325

26.

Broder

Mitzenmacher

Network applications of bloom filters: a survey

Internet Mathematics 2004 1 4 485 509

10.1080/15427951.2004.10129096

MR2119995

27.

Dharmapurikar

Krishnamurthy

Sproull

Lockwood

Deep packet inspection using parallel bloom filters

Proceedings of the 11th IEEE Symposium on High Performance Interconnects

2003

44 51

28.

Fan

Cao

Almeida

Broder

A. Z.

Summary cache: a scalable wide-area Web cache sharing protocol

IEEE/ACM Transactions on Networking 2000 8 3 281 293

10.1109/90.851975

2-s2.0-0034206002

29.

Guo

Chen

Luo

Theory and network applications of dynamic bloom filters

Proceedings of the 25th IEEE Conference on Computer Communications (INFOCOM ′06)

April 2006

30.

Xie

Min

Zhang

Wen

Xie

A scalable bloom filter for membership queries

Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM ′07)

November 2007

Washington, DC, USA

543 547

10.1109/GLOCOM.2007.107

31.

Kirsch

Mitzenmacher

Less hashing, same performance: building a better bloom filter

Proceedings of the 14th Annual European Symposium on Algorithms

2006

456 467

Summary Instance: Scalable Event Priority Determination Engine for Large-Scale Distributed Event-Based System

Abstract

1. Introduction

2. Related Work

3. Model Description

3.1. System Model

3.2. Event Model

3.3. Priority Rule Model

3.4. Assumptions and Design Goals

4. Solution

4.1. Overview

4.2. Preliminary

4.3. Rule Instantiation Engine Footnotes

4.3.1. Rule Instantiation Process

Algorithm 1: Data structure.

Algorithm 2: Rule instantiation algorithm.

Algorithm 3: Predicate transformation algorithm T p : 〈 a , o p t , v 〉 → S a .

4.3.2. Hash Computation on Instance Set S R

4.3.3. Update the Computation Results into Bloom Filters

4.4. Priority Determination Engine

4.4.1. Event Priority Signature Generation in Access Brokers

Algorithm 4: Event priority query algorithm.

Algorithm 5: Transform event attribute value into discrete id: T e ( v i ) : v i → i d .

4.4.2. Query Bloom Filter with Signature

4.4.3. Caching Query Result in Access Brokers

4.5. Discussion on Discretization

4.6. Analysis on System Performance

4.6.1. Online Query Performance

4.6.2. Offline Building and Maintenance of Rule Instance Database Based on Bloom Filters

5. Evaluations

5.1. Data Set

5.2. Matching Algorithm

5.3. Experiment Results

5.3.1. Summary on Evaluation

6. Conclusion

Footnotes

Disclosure

Conflict of Interests

Acknowledgments

References

Algorithm 3: Predicate transformation algorithm $T_{p} : 〈 a, o p t, v 〉 \to S_{a}$ .

4.3.2. Hash Computation on Instance Set $S_{R}$

Algorithm 5: Transform event attribute value into discrete id: $T_{e} (v_{i}) : v_{i} \to i d$ .