Sage Journals: Discover world-class research

Abstract

Data exploration and visualization systems are of great importance in the Big Data era, in which the volume and heterogeneity of available information make it difficult for humans to manually explore and analyse data. Most traditional systems operate in an offline way, limited to accessing preprocessed (static) sets of data. They also restrict themselves to dealing with small dataset sizes, which can be easily handled with conventional techniques. However, the Big Data era has realized the availability of a great amount and variety of big datasets that are dynamic in nature; most of them offer API or query endpoints for online access, or the data is received in a stream fashion. Therefore, modern systems must address the challenge of on-the-fly scalable visualizations over large dynamic sets of data, offering efficient exploration techniques, as well as mechanisms for information abstraction and summarization. Further, they must take into account different user-defined exploration scenarios and user preferences. In this work, we present a generic model for personalized multilevel exploration and analysis over large dynamic sets of numeric and temporal data. Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure aggregates input objects into a hierarchical multiscale model. We define two versions of this structure, that adopt different data organization approaches, well-suited to exploration and analysis context. In the proposed structure, statistical computations can be efficiently performed on-the-fly. Considering different exploration scenarios over large datasets, the proposed model enables efficient multilevel exploration, offering incremental construction and prefetching via user interaction, and dynamic adaptation of the hierarchies based on user preferences. A thorough theoretical analysis is presented, illustrating the efficiency of the proposed methods. The presented model is realized in a web-based prototype tool, called SynopsViz that offers multilevel visual exploration and analysis over Linked Data datasets. Finally, we provide a performance evaluation and a empirical user study employing real datasets.

Keywords

Visual analytics big data multiscale progressive incremental indexing linked data multiresolution visual aggregation binning adaptive hierarchical navigation personalized exploration data reduction summarization SynopsViz

1. Introduction

Exploring, visualizing and analysing data is a core task for data scientists and analysts in numerous applications. Data exploration and visualization enable users to identify interesting patterns, infer correlations and causalities, and support sense-making activities over data that are not always possible with traditional data mining techniques [29,54]. This is of great importance in the Big Data era, where the volume and heterogeneity of available information make it difficult for humans to manually explore and analyse large datasets.

One of the major challenges in visual exploration is related to the large size that characterizes many datasets nowadays. Considering the visual information seeking mantra: “overview first, zoom and filter, then details on demand” [94], gaining overview is a crucial task in the visual exploration scenario. However, offering an overview of a large dataset is an extremely challenging task. Information overloading is a common issue in large dataset visualization; a basic requirement for the proposed approaches is to offer mechanisms for information abstraction and summarization.

The above challenges can be overcome by adopting hierarchical aggregation approaches (for simplicity we also refer to them as hierarchical) [36]. Hierarchical approaches allow the visual exploration of very large datasets in a multilevel fashion, offering overview of a dataset, as well as an intuitive and usable way for finding specific parts within a dataset. Particularly, in hierarchical approaches, the user first obtains an overview of the dataset (both structure and a summary of its content) before proceeding to data exploration operations, such as roll-up and drill-down, filtering out a specific part of it and finally retrieving details about the data. Therefore, hierarchical approaches directly support the visual information seeking mantra. Also, hierarchical approaches can effectively address the problem of information overloading as it provides information abstraction and summarization.

A second challenge is related to the availability of API and query endpoints (e.g., SPARQL) for online data access, as well as the cases where that data is received in a stream fashion. The latter pose the challenge of handling large sets of data in a dynamic setting, and as a result, a preprocessing phase, such as traditional indexing, is prevented. In this respect, modern techniques must offer scalability and efficient processing for on-the-fly analysis and visualization of dynamic datasets.

Finally, the requirement for on-the-fly visualization must be coupled with the diversity of preferences and requirements posed by different users and tasks. Therefore, the proposed approaches should provide the user with the ability to customize the exploration experience, allowing users to organize data into different ways according to the type of information or the level of details she wishes to explore.

Considering the general problem of exploring big data [18,43,49,54,80,95], most approaches aim at providing appropriate summaries and abstractions over the enormous number of available data objects. In this respect, a large number of systems adopt approximation techniques (a.k.a. data reduction techniques) in which partial results are computed. Existing approaches are mostly based on: (1) sampling and filtering [2,13,39,55,66,82] and/or (2) aggregation (e.g., binning, clustering) [1,12,36,44,57,58,76,77,89,113]. Similarly, some modern database-oriented systems adopt approximation techniques using query-based approaches (e.g., query translation, query rewriting) [13,57,58,108,114]. Recently, incremental approximation techniques are adopted; in these approaches approximate answers are computed over progressively larger samples of the data [2,39,55]. In a different context, an adaptive indexing approach is used in [118], where the indexes are created incrementally and adaptively throughout exploration. Further, in order to improve performance many systems exploit caching and prefetching techniques [12,25,32,56,60,65,101]. Finally, in other approaches, parallel architectures are adopted [35,55,61,62].

Addressing the aforementioned challenges, in this work, we introduce a generic model that combines personalized multilevel exploration with online analysis of numeric and temporal data. At the core lies a lightweight hierarchical aggregation model, constructed on-the-fly for a given set of data. The proposed model is a tree-based structure that aggregates data objects into multiple levels of hierarchically related groups based on numerical or temporal values of the objects. Our model also enriches groups (i.e., aggregations/summaries) with statistical information regarding their content, offering richer overviews and insights into the detailed data. An additional feature is that it allows users to organize data exploration in different ways, by parameterizing the number of groups, the range and cardinality of their contents, the number of hierarchy levels, and so on. On top of this model, we propose three user exploration scenarios and present two methods for efficient exploration over large datasets: the first one achieves the incremental construction of the model based on user interaction, whereas the second one enables dynamic and efficient adaptation of the model to the user’s preferences. The efficiency of the proposed model is illustrated through a thorough theoretical analysis, as well as an experimental evaluation. Finally, the proposed model is realized in a web-based tool, called SynopsViz that offers a variety of visualization techniques (e.g., charts, timelines) for multilevel visual exploration and analysis over Linked Data (LD) datasets.

Contributions. The main contributions of this work are summarized as follows.

We introduce a generic model for organizing, exploring, and analysing numeric and temporal data in a multilevel fashion.

We implement our model as a lightweight, main memory tree-based structure, which can be efficiently constructed on-the-fly.

We propose two tree structure versions, which adopt different approaches for the data organization.

We describe a simple method to estimate the tree construction parameters, when no user preferences are available.

We define different exploration scenarios assuming various user exploration preferences.

We introduce a method that incrementally constructs and prefetches the hierarchy tree via user interaction.

We propose an efficient method that dynamically adapts an existing hierarchy to a new, considering user’s preferences.

We present a thorough theoretical analysis, illustrating the efficiency of the proposed model.

We develop a prototype system which implements the presented model, offering multilevel visual exploration and analysis over LD.

We conduct a thorough performance evaluation and an empirical user study, using the DBpedia 2014 dataset.

Outline. The remainder of this paper is organized as follows. Section 2 presents the proposed hierarchical model, and Section 3 provides the exploration scenarios and methods for efficient hierarchical exploration. Then, Section 4 presents the SynopsViz tool and demonstrate the basic functionality. The evaluation of our system is presented in Section 5. Section 6 reviews related work, while Section 7 concludes this paper.

2. The HETree model

In this section we present HETree (Hierarchical Exploration Tree), a generic model for organizing, exploring, and analysing numeric and temporal data in a multilevel fashion. Particularly, HETree is defined in the context of multilevel (visual) exploration and analysis. The proposed model hierarchically organize arbitrary numeric and temporal data, without requiring it to be described by an hierarchical scheme. We should note that, our model is not bound to any specific type of visualization; rather it can be adopted by several “flat” visualization techniques (e.g., charts, timeline), offering scalable and multilevel exploration over non-hierarchical data.

In what follows, we present some basic aspects of our working scenario (i.e., visual exploration and analysis scenario) and highlight the main assumptions and requirements employed in the construction of our model. First, the input data in our scenario can be retrieved directly from a database, but also produced dynamically; e.g., from a query or from data filtering (e.g., faceted browsing). Thus, we consider that data visualization is performed online; i.e., we do not assume an offline preprocessing phase in the construction of the visualization model. Second, users can specify different requirements or preferences with respect to the data organization. For example, a user prefers to organize the data as a deep hierarchy for a specific task, while for another task a flat hierarchical organization is more appropriate. Therefore, even if the data is not dynamically produced, the data organization is dynamically adapted to the user preferences. The same also holds for any additional information (e.g., statistical information) that is computed for each group of objects. This information must be recomputed when the groups of objects (i.e., data organization) are modified.

From the above, a basic requirement is that the model must be constructed on-the-fly for any given data and users preferences. Therefore, we implement our model as a lightweight, main memory tree structure, which can be efficiently constructed on-the-fly. We define two versions of this tree structure, following data organization approaches well-suited to visual exploration and analysis context: the first version considers fixed-range groups of data objects, whereas the second considers fixed-size groups. Finally, our structure allows efficient on-the-fly statistical computations, which are extremely valuable for the hierarchical exploration and analysis scenario.

The basic idea of our model is to hierarchically group data objects based on values of one of their properties. Input data objects are stored at the leaves, while internal nodes aggregate their child nodes. The root of the tree represents (i.e., aggregates) the whole dataset. The basic concepts of our model can be considered similar to a simplified version of a static 1D R-Tree [45].

Regarding the visual representation of the model and data exploration, we consider that both data objects sets (leaf nodes contents) and entities representing groups of objects (leaf or internal nodes) are visually represented enabling the user to explore the data in a hierarchical manner. Note that our tree structure organizes data in a hierarchical model, without setting any constraints on the way the user interacts with these hierarchies. As such, it is possible that different strategies can be adopted, regarding the traversal policy, as well as the nodes of the tree that are rendered in each visualization stage.

In the rest of this section, preliminaries are presented in Section 2.1. In Section 2.2, we introduce the proposed tree structure. Sections 2.3 and 2.4 present the two versions of the structure. Finally, Section 2.5 discusses the specification of the parameters required for the tree construction, and Section 2.6 presents how statistics computations can be performed over the tree.

2.1. Preliminaries

In this work we formalize data objects as RDF triples. However, the presented methods are generic and can be applied to any data objects with numeric or temporal attributes. Hence, in the following, the terms triple and (data) object will be used interchangeably.

We consider an RDF dataset R consisting of a set of RDF triples. As input data, we assume a set of RDF triples D, where $D \subseteq R$ and triples in D have as objects either numeric (e.g., integer, decimal) or temporal values (e.g., date, time). Let $tr$ be an RDF triple, $tr . s$ , $tr . p$ and $tr . o$ represent, respectively, the subject, predicate and object of the RDF triple $tr$ .

Given input data D, S is an ordered set of RDF triples, produced from D, where triples are sorted based on objects’ values, in ascending order. Assume that $S [i]$ denotes the i-th triple, with $S [1]$ the first triple. Then, for each $i < j$ , we have that $S [i] . o ⩽ S [j] . o$ . Also, $D = S$ , i.e., for each $tr$ , $tr \in D$ iff $tr \in S$ .

Figure 1 presents a set of 10 RDF triples, representing persons and their ages. In Fig. 1, we assume that the subjects $p 0$ – $p 9$ are instances of a class Person and the predicate age is a datatype property with integer range.

Fig. 1.

Running example input data (data objects).

Example 1.

In Fig. 1, given the RDF triple $tr = p 0 age 35$ , we have that $tr . s = p 0$ , $tr . p = age$ and $tr . o = 35$ . Also, given that all triples comprise the input data D and S is the ordered set of D based on the object values, in ascending order; we have that $S [1] = p 8 age 20$ and $S [10] = p 1 age 100$ .

Assume an interval $I = [a, b]$ , where $a, b \in R$ ; then, $I = {k \in R ∣ a ⩽ k ⩽ b}$ . Similarly, for $I = [a, b)$ , we have that $I = {k \in R ∣ a ⩽ k < b}$ . Let $I^{-}$ and $I^{+}$ denote the lower and upper bound of the interval I, respectively. That is, given $I = [a, b]$ , then $I^{-} = a$ and $I^{+} = b$ . The length of an interval I is defined as $| I^{+} - I^{-} |$ .

In this work we assume rooted trees. The number of the children of a node is its degree. Nodes with degree 0 are called leaf nodes. Moreover, any non-leaf node is called internal node. Sibling nodes are the nodes that have the same parent. The level of a node is defined by letting the root node be at level zero. Additionally, the height of a node is the length of the longest path from the node to a leaf. A leaf node has a height of 0.

The height of a tree is the maximum level of any node in the tree. The degree of a tree is the maximum degree of a node in the tree. An ordered tree is a tree where the children of each node are ordered. A tree is called an m-ary tree if every internal node has no more than m children. A full m-ary tree is a tree where every internal node has exactly m children. A perfect m-ary tree is a full m-ary tree in which all leaves are at the same level.

2.2. The HETree structure

In this section, we present in more detail the HETree structure. HETree hierarchically organizes numeric and temporal data into groups; intervals are used to represents these groups.1

¹
Note that our structure handles numeric and temporal data in a similar manner. Also, other types of one-dimensional data may be supported, with the requirement that a total order can be defined over the data.

HETree is defined by the tree degree and the number of leaf nodes.2

Note that following a similar approach, the HETree can also be defined by specifying the tree height instead of degree or number of leaves.

Essentially, the number of leaf nodes corresponds to the number of groups where input data objects are organized. The tree degree corresponds to the (maximum) number of groups where a group is split in the lower level.

Given a set of data objects (RDF triples) D, a positive integer ℓ denoting the number of leaf nodes; and a positive integer d denoting the tree degree; an HETree $(D, ℓ, d)$ is an ordered d-ary tree, with the following basic properties.

The tree has exactly ℓ number of leaf nodes.

All leaf nodes appear in the same level.

Each leaf node contains a set of data objects, sorted in ascending order based on their values. Given a leaf node n, $n . data$ denote the data objects contained in n.

Each internal node has at most d children nodes. Let n be an internal node, $n . c_{i}$ denotes the i-th child for the node n, with $n . c_{1}$ be the leftmost child.

Each node corresponds to an interval. Given a node n, $n . I$ denotes the interval for the node n.

At each level, all nodes are sorted based on the lower bounds of their intervals. That is, let n be an internal node, for any $i < j$ , we have that $n . c_{i} . I^{-} ⩽ n . c_{j} . I^{-}$ .

For a leaf node, its interval is bounded by the values of the objects included in this leaf node. Let n be the leftmost leaf node; assume that n contains x objects from D. Then, we have that $n . I^{-} = S [1] . o$ and $n . I^{+} = S [x] . o$ , where S is the ordered object set resulting from D.

For an internal node, its interval is bounded by the union of the intervals of its children. That is, let n be an internal node, having k child nodes; then, we have $n . I^{-} = n . c_{1} . I^{-}$ and $n . I^{+} = n . c_{k} . I^{+}$ .

Furthermore, we present two different approaches for organizing the data in the HETree. Assume the scenario in which a user wishes to (visually) explore and analyse the historic events from DBpedia [8], per decade. In this case, user orders historic events by their date and organizes them into groups of equal ranges (i.e., decade). In a second scenario, assume that a user wishes to analyse in the Eurostat dataset the gross domestic product (GDP) organized into fixed groups of countries. In this case, the user is interested in finding information like: the range and the variance of the GDP values over the top-10 countries with the highest GDP factor. In this scenario, the user orders countries by their GDP and organizes them into groups of equal sizes (i.e., 10 countries per group).

In the first approach, we organize data objects into groups, where the object values of each group covers equal range of values. In the second approach, we organize objects into groups, where each group contains the same number of objects. In the following sections, we present in detail the two approaches for organizing the data in the HETree.

2.3. A content-based HETree (HETree-C)

In this section we introduce a version of the HETree, named HETree-C (Content-based HETree). This HETree version organizes data into equally sized groups. The basic property of the HETree-C is that each leaf node contains approximately the same number of objects and the content (i.e., objects) of a leaf node specifies its interval. For the tree construction, the objects are first assigned to the leaves and then the intervals are defined.

An HETree-C $(D, ℓ, d)$ is an HETree, with the following extra property. Each leaf node contains λ or $λ - 1$ objects, where $λ = ⌈ \frac{| D |}{ℓ} ⌉$ .3

³
We assume that, the number of objects is at least as the number of leaves; i.e., $| D | \geq ℓ$ .

Particularly, the

ℓ - (λ \cdot ℓ - | D |)

leftmost leaves contain λ objects, while the rest leaves contain

λ - 1

⁴

As an alternative we can construct the HETree-C, so each leaf contains λ objects, except the rightmost leaf which will contain between 1 and λ objects.

We can equivalently define the HETree-C by providing the number of objects per leaf λ, instead of the number of leaves ℓ.

Fig. 2.

A Content-based HETree (HETree-C).

Example 2.

Figure 2 presents an HETree-C constructed by considering the set of objects D from Fig. 1, $ℓ = 5$ and $d = 3$ . As we can observe, all the leaf nodes contain equal number of objects. Particularly, we have that $λ = ⌈ \frac{10}{5} ⌉ = 2$ . Regarding the leftmost interval, we have $d . I^{-} = 20$ and $d . I^{+} = 30$ .

2.3.1. The HETree-C construction

We construct the HETree-C in a bottom-up way. Algorithm 1 describes the HETree-C construction. Initially, the algorithm sort the object set D in ascending order, based on objects values (line 1). Then, the algorithm uses two procedures to construct the tree nodes. Finally, the root node of the constructed tree is returned (line 4).

Algorithm 1.

createHETree-C/R (D, ℓ, d)

The constrLeaves-C procedure (Procedure 1) construct ℓ leaf nodes (lines 4–16). For the first k leaves, λ objects are inserted, while for the rest leaves, $λ - 1$ objects are inserted (lines 6–9). Finally, the set of created leaf nodes is returned (line 17).

Procedure 1.

constrLeaves-C(S, ℓ)

The constrtInterNodes procedure (Procedure 2) builds the internal nodes in a recursive manner. For the nodes H, their parents nodes P are created (lines 4–16); then, the procedure calls itself using as input the parent nodes P (line 21). The recursion terminates when the number of created parent nodes is equal to one (line 17); i.e., the root of the tree is created.

Computational Analysis. The computational cost for the HETree-C construction (Algorithm 1) is the sum of three parts. The first is sorting the input data, which can be done in the worst case in $O (| D | log | D |)$ , employing a linearithmic sorting algorithm (e.g., merge-sort). The second part is the constrLeaves-C procedure, which requires $O (| D |)$ for scanning all data objects. The third part is the constrtInterNodes procedure, which requires $d \cdot (⌈ \frac{ℓ}{d} ⌉ + ⌈ \frac{ℓ}{d^{2}} ⌉ + ⌈ \frac{ℓ}{d^{3}} ⌉ + \dots + 1)$ , with the sum being the number of internal nodes in the tree. Note that the maximum number of internal nodes in a d-ary tree corresponds to the number of internal nodes in a perfect d-ary tree of the same height. Also, note the number of internal nodes of a perfect d-ary tree of height h is $\frac{d^{h} - 1}{d - 1}$ . In our case, the height of our tree is $h = ⌈ {log}_{d} ℓ ⌉$ . Hence, the maximum number of internal nodes is $\frac{d^{⌈ {log}_{d} ℓ ⌉} - 1}{d - 1} ⩽ \frac{d \cdot ℓ - 1}{d - 1}$ . Therefore, the constrtInterNodes procedure, in worst case requires $O (\frac{d^{2} \cdot ℓ - d}{d - 1})$ . Therefore, the overall computational cost for the HETree-C construction in the worst case is $O (κ log κ) + O (κ) + O (κ) = O (κ log κ + κ + κ) = O (κ log κ) = O (| D | log | D |)$ . $O (| D | log | D | + | D | + \frac{d^{2} \cdot ℓ - d}{d - 1}) = O (| D | log | D | + \frac{d^{2} \cdot ℓ - d}{d - 1})$ .5

⁵

In the complexity computations presented through the paper, terms that are dominated by others (i.e., having lower growth rate) are omitted.

Procedure 2.

constrtInterNodes(H, d)

2.4. A range-based HETree (HETree-R)

The second version of the HETree is called HETree-R (Range-based HETree). HETree-R organizes data into equally ranged groups. The basic property of the HETree-R is that each leaf node covers an equal range of values. Therefore, in HETree-R, the data space defined by the objects values is equally divided over the leaves. As opposed to HETree-C, in HETree-R the interval of a leaf specifies its content. Therefore, for the HETree-R construction, the intervals of all leaves are first defined and then objects are inserted.

An HETree-R $(D, ℓ, d)$ is an HETree, with the following extra property. The interval of each leaf node has the same length; i.e., covers equal range of values. Formally, let S be the sorted RDF set resulting from D, for each leaf node its interval has length ρ, where $ρ = \frac{| S [1] . o - S [| S |] . o |}{ℓ}$ .6

⁶
We assume here that, there is at least one object in D with different value than the rest objects.

Therefore, for a leaf node n, we have that

| n . I^{-} - n . I^{+} | = ρ

. For example, for the leftmost leaf, its interval is

[S [1] . o, S [1] . o + ρ)

. The HETree-R is equivalently defined by providing the interval length ρ, instead of the number of leaves ℓ.

Example 3.

Figure 3 presents an HETree-R tree constructed by considering the set of objects D (Fig. 1), $ℓ = 5$ and $d = 3$ . As we can observe from Fig. 3, each leaf node covers equal range of values. Particularly, we have that the interval of each leaf must have length $ρ = \frac{| 20 - 100 |}{5} = 16$ .

Fig. 3.

A Range-based HETree (HETree-R).

2.4.1. The HETree-R construction

This section studies the construction of the HETree-R structure. The HETree-R is also constructed in a bottom-up fashion.

Similarly with the HETree-C version, Algorithm 1 is used for the HETree-R construction. The only difference is the constrLeaves-R procedure (line 2), which creates the leaf nodes of the HETree-R and is presented in Procedure 3.

The procedure The procedure constructs ℓ leaf nodes (lines 2–9) and assigns same intervals to all of them (lines 4–8), it traverses all objects in S (lines 10–12) and places them to the appropriate leaf node (line 12). Finally, returns the set of created leaves (line 13).

Computational Analysis. The computational cost for the HETree-R construction (Algorithm 1) for sorting the input data (line 1) and creating the internal nodes (line 3) is the same as in the HETree-C case. The constrLeaves-R procedure (line 2) requires $O (ℓ + | D |) = O (| D |)$ (since $| D | \geq ℓ$ ). Using the computational costs for the first and the third part from Section 2.3.1, we have that in worst case, the overall computational cost for the HETree-R construction is $O (| D | log | D | + | D | + \frac{d^{2} \cdot ℓ - d}{d - 1}) = O (| D | log | D | + \frac{d^{2} \cdot ℓ - d}{d - 1})$ .

Procedure 3.

constrLeaves-R(S, ℓ)

2.5. Estimating the HETree parameters

In our working scenario, the user specifies the parameters required for the HETree construction (e.g., number of leaves ℓ). In this section, we describe our approach for automatically calculating the HETree parameters based on the input data, when no user preferences are provided. Our goal is to derive the parameters by the input data, such that the resulting HETree can address some basic guidelines set by the visualization environment. In what follows, we discuss in detail the proposed approach.

An important parameter in hierarchical visualizations is the minimum and maximum number of objects that can be effectively rendered in the most detailed level.7

⁷
Similar bounds can also be defined for other tree levels.

In our case, the above numbers correspond to the number of objects contained in the leaf nodes. The proper calculation of these numbers is crucial such that the resulting tree avoids overloaded visualizations.

Therefore, in HETree construction, our approach considers the minimum and the maximum number of objects per leaf node, denoted as $λ_{\min}$ and $λ_{\max}$ , respectively. Besides the number of objects rendered in the lowest level, our approach considers perfect m-ary trees, such that a more “uniform” structure (i.e., all the internal nodes have exactly m child nodes) results. The following example illustrates our approach of calculating the HETree parameters.

Example 4.

Assume that based on an adopted visualization technique, the ideal number of data objects to be rendered on a specific screen is between 25 and 50. Hence, we have that $λ_{\min} = 25$ and $λ_{\max} = 50$ .

Now, let’s assume that we want to visualize the object set $D_{1}$ , using an HETree-C, where $| D_{1} | = 500$ . Based on the number of objects and the λ bounds, we can estimate the bounds for the number of leaves. Let $ℓ_{\min}$ and $ℓ_{\max}$ denote the lower and the upper bound for the number of leaves. Therefore, we have that $⌈ \frac{| D_{1} |}{λ_{\max}} ⌉ ⩽ ℓ ⩽ ⌈ \frac{| D_{1} |}{λ_{\min}} ⌉ \Leftrightarrow ⌈ \frac{500}{50} ⌉ ⩽ ℓ ⩽ ⌈ \frac{500}{25} ⌉ \Leftrightarrow 10 ⩽ ℓ ⩽ 20$ .

Hence, our HETree-C should have between $ℓ_{\min} = 10$ and $ℓ_{\max} = 20$ leaf nodes. Since, we consider perfect m-ary trees, from Table 1 we can identify the tree characteristics that conform to the number of leaves guideline. The candidate setting (i.e., leaf number and degree) is indicated in Table 1, using dark-grey colour. Note that, the settings with $d = 2$ are not examined since visualizing two groups of objects in each level is considered a small number under most visualization settings. Hence, in any case we only assume settings with $d ⩾ 3$ and $height ⩾ 2$ . Therefore, an HETree-C with $ℓ = 16$ and $d = 4$ is a suitable structure for our case.

Now, let’s assume that we want to visualize the object set $D_{2}$ , where $| D_{2} | = 1000$ . Following a similar approach, we have that $20 ⩽ ℓ ⩽ 40$ . The candidate settings are indicated in Table 1 using light-grey colour. Hence, we have the following settings that satisfy the considered guideline: S1: $ℓ = 27$ , $d = 3$ ; S2: $ℓ = 25$ , $d = 5$ ; and S3: $ℓ = 36$ , $d = 6$ .

In the case where more than one setting satisfies the considered guideline, we select the preferable one according to following set of rules. From the candidate settings, we prefer the setting which results in the highest tree (1st Criterion).8

⁸

Depending on user preferences and the examined task, the shortest tree may be preferable. For example, starting from the root, the user wishes to access the data objects (i.e., lowest level) by performing the smallest amount of drill-down operations possible.

In case that the highest tree is constructed by more than one settings, we consider the distance c, between ℓ and the centre of

ℓ_{\min}

and

ℓ_{\max}

(2nd Criterion); i.e.,

c = | ℓ - \frac{ℓ_{\min} + ℓ_{\max}}{2} |

. The setting with the lowest c value is selected. Note that, based on the visualization context, different criteria and preferences may be followed.

In our example, from the candidate settings, setting S1 is selected, since it will construct the highest tree (i.e., $height = 3$ ). On the other hand, settings S2 and S3 will construct trees with lower heights (i.e., $height = 2$ ).

Now, assume a scenario where only S2 and S3 are candidates. In this case, since both settings result to trees with equal heights, the 2nd Criterion is considered. Hence, for the S2 we have $c_{2} = | 25 - \frac{20 + 40}{2} | = 5$ . Similarly, for the S3 $c_{3} = | 36 - \frac{20 + 40}{2} | = 6$ . Therefore, between the S2 and S3, the setting S2 is preferable, since $c_{2} < c_{3}$ .

In case of HETree-R, a similar approach is followed, assuming normal distribution over the values of the objects.

Table 1

Number of leaf nodes for perfect m-ary trees

2.6. Statistics computations over HETree

Data statistics is a crucial aspect in the context of hierarchical visual exploration and analysis. Statistical informations over groups of objects (i.e., aggregations) offer rich insights into the underlying (i.e., aggregated) data. In this way, useful information regarding different set of objects with common characteristics is provided. Additionally, this information may also guide the users through their navigation over the hierarchy.

In this section, we present how statistics computation is performed over the nodes of the HETree. Statistics computations exploit two main aspects of the HETree structure: (1) the internal nodes aggregate their child nodes; and (2) the tree is constructed in bottom-up fashion. Statistics computation is performed during the tree construction; for the leaf nodes, we gather statistics from the objects they contain, whereas for the internal nodes we aggregate the statistics of their children.

For simplicity, here, we assume that each node contains the following extra fields, used for simple statistics computations, although more complex or RDF-related (e.g., most common subject, subject with the minimum value, etc.) statistics can be computed. Assume a node n, as $n . N$ we denote the number of objects covered by n; as $n . μ$ and $n . σ^{2}$ we denote the mean and the variance of the objects’ values covered by n, respectively. Additionally, we assume the minimum and the maximum values, denoted as $n . \min$ and $n . \max$ , respectively.

Statistics computations can be easily performed in the construction algorithms (Algorithm 1) without any modifications. The follow example illustrates these computations.

Example 5.
In this example we assume the HETree-C presented in Fig. 2. Figure 4 shows the HETree-C with the computed statistics in each node. When all the leaf nodes have been constructed, the statistics for each leaf is computed. For instance, we can see from Fig. 4, that for the rightmost leaf h we have: $h . N = 2$ , $h . μ = \frac{80 + 100}{2} = 90$ and $h . σ^{2} = \frac{1}{2} \cdot ({(80 - 90)}^{2} + {(100 - 90)}^{2}) = 100$ . Also, we have $h . \min = 80$ and $h . \max = 100$ . Following the above process, we compute the statistics for all leaf nodes.

Fig. 4.
Statistics computation over HETree.

Then, for each parent node we construct, we compute its statistics using the computed statistics of its child nodes. Considering the c internal node, with the child nodes g and h, we have that $c . \min = 50$ and $c . \max = 100$ . Also, we have that $c . N = g . N + h . N = 2 + 2 = 4$ . Now we will compute the mean value by combining the children mean values: $c . μ = \frac{g . N \cdot g . μ + h . N \cdot h . μ}{g . N + h . N} = \frac{2 \cdot 52.5 + 2 \cdot 90}{2 + 2} = 71.3$ . Similarly, for variance we have $c . σ^{2} = \frac{g . N \cdot g . σ^{2} + h . N \cdot h . σ^{2} + g . N \cdot {(g . μ - c . μ)}^{2} + h . N \cdot {(h . μ - c . μ)}^{2}}{g . N + h . N} = \frac{2 \cdot 6.25 + 2 \cdot 100 + 2 \cdot {(52.5 - 71.3)}^{2} + 2 \cdot {(90 - 71.3)}^{2}}{2 + 2} = 404.7$ .

The similar approach is also followed for the case of HETree-R.
Computational Analysis. Most of the well known statistics (e.g., mean, variance, skewness, etc.) can be computed linearly w.r.t. the number of elements. Therefore, the computation cost over a set of numeric values S is considered as $O (| S |)$ . Assume a leaf node n containing k objects, then the cost for statistics computations for n is $O (k)$ . Also, the cost for all leaf nodes is $O (| D |)$ . Let an internal node n, then the cost for n is $O (d)$ ; since the statistics in n are computed by aggregating the statistics of the d child nodes. Considering that $\frac{d \cdot ℓ - 1}{d - 1}$ is the maximum number of internal nodes (Section 2.3.1), we have that in the worst case the cost for the internal nodes is $O (\frac{d^{2} \cdot ℓ - d}{d - 1})$ . Therefore, the overall cost for statistics computations over an HETree is $O (| D | + \frac{d^{2} \cdot ℓ - d}{d - 1})$ .
3. Efficient multilevel exploration

In this section, we exploit the HETree structure in order to efficiently handle different multilevel exploration scenarios. Essentially, we propose two methods for efficient hierarchical exploration over large datasets. The first method incrementally constructs the hierarchy via user interaction; the second one achieves dynamic adaptation of the data organization based on user’s preferences.

3.1. Exploration scenarios

In a typical multilevel exploration scenario, referred here as Basic exploration scenario (BSC), the user explores a dataset in a top-down fashion. The user first obtains an overview of the data through the root level, and then drills down to more fine-grained contents for accessing the actual data objects at the leaves. In BSC, the root of the hierarchy is the starting point of the exploration and, thus, the first element to be presented (i.e., rendered).

The described scenario offers basic exploration capabilities; however it does not assume use cases with user-specified starting points, other than the root, such as starting the exploration from a specific resource, or from a specific range of values.

Consider the following example, in which the user wishes to explore the DBpedia infoboxes dataset to find places with very large population. Initially, she selects the populationTotal property and starts her exploration from the root node, moves down the right part of the tree and ends up at the rightmost leaf that contains the highly populated places. Then, she is interested in viewing the area size (i.e., areaTotal property) for one of the highly populated places and, also, in exploring places with similar area size. Finally, she decides to explore places based on the water area size (i.e., areaWater) they contain. In this case, she prefers to start her exploration by considering places that their water area size is within a given range of values.

In this example, besides BSC one we consider two additional exploration scenarios. In the Resource-based exploration scenario (RES), the user specifies a resource of interest (e.g., an IRI) and a specific property; the exploration starts from the leaf containing the specific resource and proceeds in a bottom-up fashion. Thus, in RES the data objects contained in the same leaf with the resource of interest are presented first. We refer to that leaf as leaf of interest.

The third scenario, named Range-based exploration scenario (RAN) enables the user to start her exploration from an arbitrary point in the hierarchy providing a range of values; the user starts from a set of internal nodes and she can then move up or down the hierarchy. The RAN scenario begins by rendering all sibling nodes that are children of the node covering the specified range of interest; we refer to these nodes as nodes of interest.

Note that, regarding the adopted rendering policy for all scenarios, we only consider nodes belonging to the same level. That is, sibling nodes or data objects contained in the same leaf, are rendered.

Regarding the “navigation-related” operations, the user can move down or up the hierarchy by performing a drill-down or a roll-up operation, respectively. A drill-down operation over a node n enables the user to focus on n and render its child nodes. If n is a leaf node, the set of data objects contained in n are rendered. On the other hand, the user can perform a roll-up operation on a set of sibling nodes S. The parent node of S along with the parent’s sibling nodes are rendered. Finally, the roll-up operation when applied to a set of data objects O will render the leaf node that contains O along its sibling leaves, whereas a drill-down operation is not applied to a data object.

3.2. Incremental HETree construction

In the Web of Data, the dataset might be dynamically retrieved by a remote site (e.g., via a SPARQL endpoint), as a result, in all exploration scenarios, we have assumed that the HETree is constructed on-the-fly at the time the user starts her exploration. In the previous DBpedia example, the user explores three different properties; although only a small part of their hierarchy is accessed, the whole hierarchies are constructed and the statistics of all nodes are computed. Considering the recommended HETree parameters for the employed properties, this scenario requires that 29.5K nodes will be constructed for populationTotal property, 9.8K nodes for the areaTotal and 3.3K nodes for the areaWater, amounting to a total number of 42.6K nodes. However, the construction of the hierarchies for large datasets poses a time overhead (as shown in the experimental section) and, consequently, increased response time in user exploration.

In this section, we introduce ICO (Incremental HETree Construction) method, which incrementally constructs the HETree, based on user interaction. The proposed method goes beyond the incremental tree construction, aiming at further reducing the response time during the exploration process by “pre-constructing” (i.e., prefetching) the parts of the tree that will be visited by the user in her next roll-up or drill-down operation. Hence, a node n is not constructed when the user visits it for the first time; instead, it has been constructed in a previous exploration step, where the user was on a node in which n can be reached by a roll-up or a drill-down operation. This way, our method offers incremental construction of the tree, tailored to each user’s exploration. Finally, we show that, during an exploration scenario, ICO constructs the minimum number of HETree elements.

Fig. 5.

Incremental HETree construction example. Resource-based (RES) exploration scenario; Range-based (RAN) exploration scenario.

Employing ICO method in the DBpedia example, the populationTotal hierarchy will only construct 76 nodes (the root along its child nodes and 9 nodes in each of the lower tree levels) and the areaTotal will construct 3 nodes corresponding to the leaf node containing the requested resource and its siblings. Finally, the areaWater hierarchy initially will contain either 6 or 15 nodes, depending on whether the user’s input range corresponds to a set of sibling leaf nodes, or to a set of sibling internal nodes, respectively.

Example 6.

We demonstrate the functionality of ICO through the following example. Assume the dataset used in our running examples, describing persons and their ages. Figure 5 presents the incremental construction of the HETree presented in Fig. 3 for the RES and RAN exploration scenarios. Blue color is used to indicate the HETree elements that are presented (rendered) to the user, in each exploration stage.

In the RES scenario (upper flow in Fig. 5), the user specifies “http://persons.com/p6” as her resource of interest; all data objects contained in the same leaf (i.e., e) with the resource of interest are initially presented to the user. The ICO initially constructs the leaf e, along with its siblings, i.e., leaves d and f. These leaves correspond to the nodes that the user can reach in a next (roll-up) step. Next, the user rolls up and the leaves d, e and f are presented to her. At the same time, parent node b and its sibling c are constructed. Note that all elements which are accessible to the user by moving either down (i.e., d, e, f data objects), or up (i.e., b, c nodes) are already constructed. Finally, when the user rolls up b and c nodes are rendered and parent node a, along with the children of c, i.e., g and h, are constructed.

In the RAN scenario (lower flow in Fig. 5), the user specifies $[30, 50]$ as her range of interest. The nodes covering this range (i.e., d, e) are initially presented along with their sibling f. Also, ICO constructs the parent node b and its sibling c because they are accessible by one exploration step. Then, the user performs a roll-up and ICO constructs the a, g, h nodes (as described in the RES scenario above).

In the beginning of each exploration scenario, ICO constructs a set of initial nodes, which are the nodes initially presented, as well as the nodes potentially reached by the user’s first operation (i.e., required HETree elements). The required HETree elements of an exploration step are nodes that can be reached by the user by performing one exploration operation. Hence, in the RES scenario, the initial nodes are the leaf of interest and its sibling leaves. In the RAN, the initial nodes are the nodes of interest, their children, and their parent node along with its siblings. Finally, in the BSC scenario the initial nodes are the root node and its children.

In what follows we describe the construction rules adopted by ICO through the user exploration process. These rules provide the correspondences between the types of elements presented in each exploration step and the elements that ICO constructs. Note that these rules are applied after the construction of the initial nodes, in all three exploration scenarios. The correctness of these rules is verified later in Proposition 1.

Rule 1.

If a set of internal sibling nodes C is presented, ICO constructs: (i) the parent node of C along with the parent’s siblings, and (ii) the children of each node in C.

Rule 2.

If a set of leaf sibling nodes L is presented, ICO does not construct anything (the required nodes have been previously constructed).

Rule 3.

If a set of data objects O is presented, ICO does not construct anything (the required nodes have been previously constructed).

The following proposition shows that, in all case, the required HETree elements have been constructed earlier by ICO.9

⁹

Proofs are included in Appendix A.

Proposition 1.

In any exploration scenario, the HETree elements a user can reach by performing one operation (i.e., required elements), have been previously constructed by ICO.

Also, the following theorem shows that over any exploration scenario ICO constructs only the required HETree elements.

Theorem 1.

ICO constructs the minimum number of HETree elements in any exploration scenario.

3.2.1. ICO algorithm

In this section, we present the incremental HETree construction algorithm. Note that, here we include the pseudocode only for the HETree-R version, since the only difference with the HETree-C version is in the way that the nodes’ intervals are computed and that the dataset is initially sorted. In the analysis of the algorithms, both versions are studied.

Here, we assume that each node n contains the following extra fields. Let a node n, $n . p$ denotes the parent node of n, and $n . h$ denotes the height of n in the hierarchy. Additionally, given a dataset D, $D . minv$ and $D . maxv$ denote the minimum and the maximum value for all objects in D, respectively. The user preferences regarding the exploration’s starting point are represented as an interval U. In the RES scenario, given that the value of the explored property for the resource of interest is o, we have $U^{-} = U^{+} = o$ . In the RAN scenario, given that the range of interest is R, we have that $U^{-} = max (D . minv, R^{-})$ and $U^{+} = min (D . maxv, R^{+})$ . In the BSC scenario, the user does not provide any preferences regarding the starting point, so we have $U^{-} = D . minv$ and $U^{+} = D . maxv$ . Finally, according to the definition of HETree, a node n encloses a data object (i.e., triple) $tr$ if $n . I^{-} ⩾ tr . o$ and $n . I^{+} ⩽ tr . o$ .

The algorithm ICO-R (Algorithm 2) implements the incremental method for HETree-R. The algorithm uses two procedures to construct all required nodes (available in Appendix B). The first procedure constrRollUp-R (Procedure 4) constructs the nodes which can be reached by a roll-up operation, whereas constrDrillDown-R (Procedure 5) constructs the nodes which can be reached by a drill-down operation. Additionally, the aforementioned procedures exploit two secondary procedures (Appendix B): computeSiblingInterv-R (Procedure 6) and constrSiblingNodes-R (Procedure 7), which are used for nodes’ intervals computations and nodes construction.

Algorithm 2.

ICO-R(D, ℓ, d, U, $cur$ , H)

The ICO-R algorithm is invoked at the beginning of the exploration scenario, in order to construct the initial nodes, as well as every time the user performs an operation. The algorithm takes as input the dataset D, the tree parameters d and ℓ, the starting point U, the currently presented (i.e., rendered) elements $cur$ , and the constructed HETree H. ICO-R begins with the currently presented elements $cur$ equal to $null$ (lines 1–5). Based on the starting point U, the algorithm computes the interval $I_{0}$ corresponding to the sibling nodes that are first presented to the user, as well as its hierarchy height $h_{0}$ (line 3). For sake of simplicity, the details for computing $I_{0}$ and $h_{0}$ are omitted. For example, the interval I for the leaf that contains the resource of interest with object value o, is computed as $I^{-} = D . minv + len \cdot ⌊ \frac{o - D . minv}{len} ⌋$ and $I^{+} = min (D . maxv, I^{-} + len)$ . Following a similar approach, we can easily compute $I_{0}$ and $h_{0}$ .

Based on $I_{0}$ , the algorithm constructs the sibling nodes that are first presented to the user (line 4). Then, the algorithm constructs the rest initial nodes (lines 6–9). In the RES case, as $I_{0}$ we consider the interval that includes the leaf that contains the resource of interest along with its sibling leaves. Hence, all the initial nodes are constructed in line 4 and the algorithm terminates (line 5) until the next user’s operation.

After the first call, in each ICO execution, the algorithm initially checks if the parent node of the currently presented elements is already constructed, or if all the nodes that enclose data objects10

¹⁰

Note that in the HETree-R version, we may have nodes that do not enclose any data objects.

have been constructed (line 6). Then, procedure constrRollUp-R (line 7) is used to construct the

cur

parent node, as well as the parent’s siblings. In the case that

cur

are not leaf nodes or data objects (line 8), procedure constrDrillDown-R (line 9) is used to construct all

cur

children. Finally, the algorithm returns the updated HETree (line 10).

3.2.2. Computational analysis

Here we analyse the incremental construction for both HETree versions.

Number of Constructed Nodes. Regarding the number of initial nodes constructed in each scenario: in RES scenario, at most d leaf nodes are constructed; in RAN scenario, at most $2 d + d^{2}$ nodes are constructed; finally in BSC scenario, $d + 1$ are constructed.

Regarding the maximum number of nodes constructed in each operation in RES and RAN scenarios: (1) A roll-up operation constructs at most $d + d \cdot (d - 1) = d^{2}$ nodes. The d nodes are constructed in constrRollUp, whereas the $d \cdot (d - 1)$ in constrDrillDown. (2) A drill-down operation constructs at most $d^{2}$ nodes in constrDrillDown. As for the BSC scenario: (1) A roll-up operation does not construct any nodes. (2) A drill-down operation constructs at most $d^{2}$ nodes in constrDrillDown.

Discussion. The worst case for the computational cost is higher in HETree-R than in HETree-C, for all exploration scenarios. Particularly, in HETree-R worst case, ICO must build leaves that contain the whole dataset and the computational cost is $O (| D | log | D |)$ for all scenarios. In HETree-C, for the RES and RAN scenarios, the cost is $O (d^{2} + \frac{d - 1}{d} | D |)$ , and for the BSC scenario the cost is $O (d^{2} + | D |)$ . A detailed computational analysis for both HETree-R and HETree-C is included in Appendix C.

Table 2
Summary of Adaptive HETree Construction^⋆

3.3. Adaptive HETree construction

In a (visual) exploration scenario, users wish to modify the organization of the data by providing user-specific preferences for the whole hierarchy or part of it. The user can select a specific subtree and alter the number of groups presented in each level (i.e., the tree degree) or the size of the groups (i.e., number of leaves). In this case, a new tree (or a part of it) pertaining to the new parameters provided by the user should be constructed on-the-fly.

For example, consider the HETree-C of Fig. 6 representing ages of persons.11

¹¹
For simplicity, Fig. 6 presents only the values of the objects.

A user may navigate to node b, where she prefers to increase the number of groups presented in each level. Thus, she modifies the degree of b from 2 to 4 and the subtree is adapted to the new parameter as depicted on the bottom tree of Fig. 6. On the other hand, the user prefers exploring the right subtree (starting from node c) with less details. She chooses to increase the size of the groups by reducing (from 4 to 2) the number of leaves for the subtree of c. In both cases, constructing the subtree from scratch based on the user-provided parameters and recomputing statistics entails a significant time overhead, especially, when user preferences are applied to a large part of or the whole hierarchy.

Fig. 6.

Adaptive HETree example.

In this section, we introduce ADA (Adaptive HETree Construction) method, which dynamically adapts an existing HETree to a new, considering a set of user-defined parameters. Instead of both constructing the tree and computing the nodes’ statistics from scratch, our method reconstructs the new part(s) of the hierarchy by exploiting the existing elements (i.e., nodes, statistics) of the tree. In this way, ADA achieves to reduce the overall construction cost and enables the on-the-fly reorganization of the visualized data. In the example of Fig. 6, the new subtree of b can be derived from the old one, just by removing the internal nodes d and e, while the new subtree of c results from merging leaves together and aggregating their statistics.

Let $T (D, ℓ, d)$ denote the existing HETree and $T^{'} (D, ℓ^{'}, d^{'})$ is the new HETree corresponding to the new user preferences for the tree degree $d^{'}$ and the number of leaves $ℓ^{'}$ . Note that $T$ could also denote a subtree of an existing HETree (in the scenario where the user modifies only a part of it). In this case, the user indicates the reconstruction root of $T$ .

Then, ADA identifies the following elements of $T$ : (1) The elements of $T$ that also exist in $T^{'}$ . For example, consider the following two cases: the leaf nodes of $T^{'}$ are internal nodes of $T$ in level x; the statistics of $T^{'}$ nodes in level x are equal to the statistics of $T$ nodes in level y. (2) The elements of $T$ that can be reused (as “building blocks”) for constructing elements in $T^{'}$ . For example, consider the following two cases: each leaf node of $T^{'}$ is constructed by merging x leaf nodes of $T$ ; the statistics for the node n of $T^{'}$ can be computed by aggregating the statistics from the nodes q and w of $T$ .

Consequently, we consider that an element (i.e., node or node’s statistics) in $T^{'}$ can be: (1) constructed/computed from scratch,12

¹²

Note that it is possible for a from scratch constructed node in $T^{'}$ to aggregate statistics from nodes in $T$ .

(2) reused as is from

T

or (3) derived by aggregating elements from

T

Table 2 summarizes the ADA reconstruction process. Particularly, the table includes: (1) the computational complexity for constructing $T^{'}$ , denoted as Complexity; (2) the number of leaves and internal nodes of $T^{'}$ constructed from scratch, denoted as $# {leaves}_{0}$ and $# {internals}_{0}$ , respectively; and (3) the number of leaves and internal nodes of $T^{'}$ derived from nodes of $T$ , denoted as $# {leaves}_{+}$ and $# {internals}_{+}$ , respectively. The lower part of the table presents the results for the computation of node statistics in $T^{'}$ . Finally, the second table column, denoted as Full Construction, presents the results of constructing $T^{'}$ from scratch.

The following example demonstrates the ADA results, considering a DBpedia exploration scenario.

Example 7.

The user explores the populationTotal property of the DBpedia dataset. The default system organization for this property is a hierarchy with degree 3. The user modifies the tree parameters in order to fit better visualization results as following. First, she decides to render more groups in each hierarchy level and increases the degree from 3 to 9 (1st Modification). Then, she observes that the results overflow the visualization area and that a smaller degree fits better; thus she re-adjusts the tree degree to a value of 6 (2nd Modification). Finally, she navigates through the data values and decides to increase the groups’ size by a factor of three (i.e., dividing by three the number of leaves) (3rd Modification). Again, she corrects her decision and readjusts the final group size to twice the default size (4th Modification).

Table 3 summarizes the number of nodes, constructed by a Full Construction and ADA in each modification, along with the required statistics computations. Considering the whole set of modifications, ADA constructs only the 22% (15.4K vs. 70.2K) of the nodes that are created in the case of the full construction. Also, ADA computes the statistics for only 8% (5.6K vs. 70.2K) of the nodes.

Table 3

Full Construction vs. ADA over DBpedia Exploration Scenario (cells values: Full / ADA)

In the next sections, we present in detail the reconstruction process through the example trees of Fig. 7. Figure 7(a) presents the initial tree $T$ that is an HETree-C, with $ℓ = 8$ and $d = 2$ . Figures 7(b)–7(e) present several reconstructed trees $T^{'}$ . Blue dashed lines are used to indicate the elements (i.e., nodes, edges) of $T^{'}$ which do not exist in $T$ . Regarding statistics, we assume that in each node we compute the mean value. In each $T^{'}$ , we present only the mean values that are not known from $T$ . Also, in mean values computations, the values that are reused from $T$ are highlighted in yellow. All reconstruction details and computational analysis for each case are included in Appendix D.

Fig. 7.

Adaptive HETree construction examples.

3.3.1. The user modifies the tree degree

Regarding the modification of the degree parameter, we distinguish the following cases:

The user increases the tree degree. We have that $d^{'} > d$ ; based on the $d^{'}$ value we have the following cases:

$d^{'} = d^{k}$ , with $k \in N^{+}$ and $k > 1$ : Fig. 7(a) presents $T$ with $d = 2$ and Fig. 7(d) presents the reconstructed $T^{'}$ with $d^{'} = 4$ (i.e., $k = 2$ ). $T^{'}$ results by simply removing the nodes with height 1 (i.e., d, e, f, g) and connecting the nodes with height 2 (i.e., b, c) with the leaves.

In general, $T^{'}$ results from $T$ by simply removing tree levels from $T$ . Additionally, there is no need for computing any new statistics, since the statistics for all nodes of $T^{'}$ remain the same as in $T$ .

$d^{'} = k \cdot d$ , with $k \in N^{+}$ , $k > 1$ and $k \neq d^{ν}$ where $ν \in N^{+}$ : An example with $k = 3$ is presented in Fig. 7(c), where we have $d^{'} = 6$ . In this case, the leaves of $T$ (Fig. 7(a)) remain leaves in $T^{'}$ and all internal nodes up to the reconstruction root of $T$ are constructed from scratch. As for the node statistics, we can compute the mean values for $T^{'}$ nodes with height 1 (i.e., $μ_{b}$ , $μ_{c}$ ) by aggregating already computed mean values (e.g., $μ_{d}$ , $μ_{e}$ , etc.) from $T$ .

In general, except for the leaves, we construct all internal nodes from scratch. For the internal nodes of height 1, we compute their statistics by aggregating the statistics of $T$ leaves, whereas for internal nodes of height greater than 1, we compute from scratch their statistics.

elsewhere: In any other case where the user increases the tree degree, all internal nodes in $T^{'}$ except for the leaves are constructed from scratch. In contrast with the previous case, the leaves’ statistics from $T$ can not be reused and, thus, for all internal nodes in $T^{'}$ the statistics are recomputed.

The user decreases the tree degree. Here we have that

d^{'} < d

; based on the

d^{'}

value we have the following two cases:

$d^{'} = \sqrt[k]{d}$ , with $k \in N^{+}$ and $k > 1$ : Assume that now Fig. 7(d) depicts $T$ , with $d = 4$ , while Fig. 7(a) presents $T^{'}$ with $d^{'} = 2$ . We can observe that $T^{'}$ contains all nodes of $T$ , as well as a set of extra internal nodes (i.e., d, e, f, g). Hence, $T^{'}$ results from $T$ by constructing some new internal nodes.

elsewhere: This case is the same as the previous case (3) where the user increases the tree degree.

3.3.2. The user modifies the number of leaves

Regarding the modification of the number of leaves parameter, we distinguish the following cases:

The user increases the number of leaves. In this case we have that $ℓ^{'} > ℓ$ ; hence, each leaf of $T$ is split into several leaves in $T^{'}$ and the data objects contained in a $T$ leaf must be reallocated to the new leaves in $T^{'}$ . As a result, all nodes (both leaves and internal nodes) in $T^{'}$ have different contents compared to nodes in $T$ and must be constructed from scratch along with their statistics.

In this case, constructing $T^{'}$ requires $O (| D | + \frac{d^{2} \cdot ℓ^{'} - d}{d - 1})$ (by avoiding the sorting phase).

The user decreases the number of leaves. In this case we have that $ℓ^{'} < ℓ$ ; based on the $ℓ^{'}$ value we have the following three cases:

$ℓ^{'} = \frac{ℓ}{d^{k}}$ , with $k \in N^{+}$ : Considering that Fig. 7(a) presents $T$ with $ℓ = 8$ and $d = 2$ . A reconstruction example of this case with $k = 1$ , is presented in Fig. 7(b), where we have $T^{'}$ with $ℓ^{'} = 4$ . In Fig. 7(b), we observe that the leaves in $T^{'}$ result from merging $d^{k}$ leaves of $T$ . For example, the leaf d of $T^{'}$ results from merging the leaves h and i of $T$ . Then, $T^{'}$ results from $T$ , by replacing the $T$ nodes with height k (i.e., b, e, f, g), with the $T^{'}$ leaves. Finally, the nodes of $T$ with height less than k are not included in $T^{'}$ .

Therefore, in this case, $T^{'}$ is constructed by merging the leaves of $T^{'}$ and removing the internal nodes of $T^{'}$ having height less or equal to k. Also, we do not recompute the statistics of the new leaves of $T^{'}$ as these are derived from the statistics of the removed nodes with height k.

$ℓ^{'} = \frac{ℓ}{k}$ , with $k \in N^{+}$ , $k > 1$ and $k \neq d^{ν}$ , where $ν \in N^{+}$ : As in the previous case, the leaves in $T^{'}$ are constructed by merging leaves from $T$ and their statistics are computed based on the statistics of the merged leaves. In this case, however, all internal nodes in $T^{'}$ have to be constructed from scratch.

$ℓ^{'} = ℓ - k$ , with $k \in N^{+}$ , $k > 1$ and $ℓ^{'} \neq \frac{ℓ}{ν}$ , where $ν \in N^{+}$ : The two previous cases describe that each leaf in $T^{'}$ fully contains k leaves from $T$ . In this case, a leaf in $T^{'}$ may partially contains leaves from $T$ . A leaf in $T^{'}$ fully contains a leaf from $T$ when the $T^{'}$ leaf contains all data objects belonging to the $T$ leaf. Otherwise, a leaf in $T^{'}$ partially contain a leaf from $T$ when the $T^{'}$ leaf contains a subset of the data objects from the $T$ leaf.

An example of this case is shown in Fig. 7(e) that depicts a reconstructed $T^{'}$ resulting from the $T$ presented in Fig. 7(a). The d leaf of $T^{'}$ fully contains leaves h, i of $T$ and partially leaf k for which value 35 belongs to a different leaf (i.e., e).

Due to this partial containment, we have to construct all leaves and internal nodes from scratch and recalculate their statistics. Still, the statistics of the fully contained leaves of $T$ can be reused, by aggregating them with the individual values of the data objects included in the leaves. For example, as we can see in Fig. 7(e), the mean value $μ_{d}$ of the leaf d is computed by aggregating the mean values $μ_{h}$ and $μ_{i}$ corresponding to the fully contained leaves h and i, with the individual values 30, 32 of the partially contained leaf k.

Fig. 8.

System architecture.

4. The SynopsViz tool

Based on the proposed hierarchical model, we have developed a web-based prototype called SynopsViz.13

¹³
synopsviz.imis.athena-innovation.gr.

The key features of SynopsViz are summarized as follows: (1) It supports the aforementioned hierarchical model for RDF data visualization, browsing and analysis. (2) It offers automatic on-the-fly hierarchy construction, as well as user-defined hierarchy construction based on users’ preferences. (3) Provides faceted browsing and filtering over classes and properties. (4) Integrates statistics with visualization; visualizations have been enriched with useful statistics and data information. (5) Offers several visualization techniques (e.g., timeline, chart, treemap). (6) Provides a large number of dataset’s statistics regarding the: data-level (e.g., number of sameAs triples), schema-level (e.g., most common classes/properties), and structure level (e.g., entities with the larger in-degree). (7) Provides numerous metadata related to the dataset: licensing, provenance, linking, availability, undesirability, etc. The latter can be considered useful for assessing data quality [115].

In the rest of this section, Section 4.1 describes the system architecture, Section 4.2 demonstrates the basic functionality of the SynopsViz. Finally, Section 4.3 provides technical information about the implementation.

4.1. System architecture

The architecture of SynopsViz is presented in Fig. 8. Our scenario involves three main parts: the Client UI, the SynopsViz, and the Input data. The Client part, corresponds to the system’s front-end offering several functionalities to the end-users. For example, hierarchical visual exploration, facet search, etc. (see Section 4.2 for more details). SynopsViz consumes RDF data as Input data; optionally, OWL-RDF/S vocabularies/ontologies describing the input data can be loaded. Next, we describe the basic components of the SynopsViz.

In the preprocessing phase, the Data and Schema Handler parses the input data and inferes schema information (e.g., properties domain(s)/range(s), class/ property hierarchy, type of instances, type of properties, etc.). Facet Generator generates class and property facets over input data. Statistics Generator computes several statistics regarding the schema, instances and graph structure of the input dataset. Metadata Extractor collects dataset metadata. Note that the model construction does not require any preprocessing, it is performed online, according to user interaction.

During runtime the following components are involved. Hierarchy Specifier is responsible for managing the configuration parameters of our hierarchy model, e.g., the number of hierarchy levels, the number of nodes per level, and providing this information to the Hierarchy Constructor. Hierarchy Constructor implements our tree structure. Based on the selected facets, and the hierarchy configuration, it determines the hierarchy of groups and the contained triples. Statistics Processor computes statistics about the groups included in the hierarchy. Visualization Module allows the interaction between the user and the back-end, allowing several operations (e.g., navigation, filtering, hierarchy specification) over the visualized data. Finally, the Hierarchical Model Module maintains the in-memory tree structure for our model and communicates with the Hierarchy Constructor for the model construction, the Hierarchy Specifier for the model customization, the Statistics Processor for the statistics computations, and the Visualization Module for the visual representation of the model.

Fig. 9.

Web user interface.

4.2. SynopsViz in-use

In this section we outline the basic functionality of SynopsViz prototype. Figure 9 presents the web user interface of the main window. SynopsViz UI consists of the following main panels: Facets panel: presents and manages facets on classes and properties; Input data control panel: enables the user to import and manage input datasets; Visualization panel: is the main area where interactive charts and statistics are presented; Configuration panel: handles visualization settings.

Initially, users are able to select a dataset from a number of offered real-word LD datasets (e.g., DBpedia, Eurostat) or upload their own. Then, for the selected dataset, the users are able to examine several of the dataset’s metadata, and explore several datasets’s statistics.

Using the facets panel, users are able to navigate and filter data based on classes, numeric and date properties. In addition, through facets panel several information about the classes and properties (e.g., number of instances, domain(s), range(s), IRI, etc.) are provided to the users through the UI.

Users are able to visually explore data by considering properties’ values. Particularly, area charts and timeline-based area charts are used to visualize the resources considering the user’s selected properties. Classes’ facets can also be used to filter the visualized data. Initially, the top level of the hierarchy is presented providing an overview of the data, organized into top-level groups; the user can interactively drill-down (i.e., zoom-in) and roll-up (i.e., zoom-out) over the group of interest, up to the actual values of the input data (i.e., LD resources). At the same time, statistical information concerning the hierarchy groups as well as their contents (e.g., mean value, variance, sample data, range) is presented through the UI (Fig. 10). Regarding the most detailed level (i.e., LD resources), several visualization types are offered; i.e., area, column, line, spline and areaspline (Fig. 10).

In addition, users are able to visually explore data, through class hierarchy. Selecting one or more classes, users can interactively navigate over the class hierarchy using treemaps (Fig. 10) or pie charts (Fig. 10). Properties’ facets can also be used to filter the visualized data. In SynopsViz the treemap visualization has been enriched with schema and statistical information. For each class, schema metadata (e.g., number of instances, subclasses, datatype/object properties) and statistical information (e.g., the cardinality of each property, min, max value for datatype properties) are provided.

Finally, users can interactively modify the hierarchy specifications. Particularly, they are able to increase or decrease the level of abstraction/detail presented, by modifying both the number of hierarchy levels, and number of nodes per level.

A video presenting the basic functionality of our prototype is available at youtu.be/n2ctdH5PKA0. Also, a demonstration of SynopsViz tool is presented in [19].

4.3. Implementation

SynopsViz is implemented on top of several open source tools and libraries. The back-end of our system is developed in Java, Jena framework is used for RDF data handing and Jena TDB is used for disk-based RDF storing. The front-end prototype, is developed using HTML and Javascript. Regarding visualization libraries, we use Highcharts, for the area, column, line, spline, areaspline and timeline-based charts and Google Charts for treemap and pie charts.

Fig. 10.

Numeric data & class hierarchy visualization examples.

5. Experimental analysis

In this section we present the evaluation of our approach. In Section 5.1, we present the dataset and the experimental setting. Then, in Section 5.2 we present the performance results and in Section 5.3 the user evaluation we performed.

5.1. Experimental setting

In our evaluation, we use the well known DBpedia 2014 LD dataset. Particularly, we use the Mapping-based Properties (cleaned) dataset14

¹⁴
downloads.dbpedia.org/2014/en/mappingbased_properties_cleaned_en.nt.bz2.

which contains high-quality data, extracted from Wikipedia Infoboxes. This dataset contains 33.1M triples and includes a large number of numeric and temporal properties of varying sizes. The largest numeric property in this dataset has 534K triples, whereas the largest temporal property has 762K.

Regarding the methods used in our evaluation, we consider our HETree hierarchical approaches, as well as a simple non-hierarchical visualization approach, referred as FLAT. FLAT is considered as a competitive method against our hierarchical approaches. It provides single-level visualizations, rendering only the actual data objects; i.e., it is the same as the visualization provided by SynopsViz at the most detailed level. In more detail, the FLAT approach corresponds to a column chart in which the resources are sorted in ascending order based on their object values, the horizontal axis contains the resources’ names (i.e., triples’ subjects), and the vertical axis corresponds to objects’ values. By hovering over a resource, a tooltip appears including the resource’s name and object value.

Regarding the HETree approaches, the tree parameters (i.e., number of leaves, degree and height) are automatically computed following the approach described in Section 2.5. In our experiments, the lower and the upper bound for the objects rendered at the most detailed level have been set to $λ_{\min} = 10$ and $λ_{\max} = 50$ , respectively. Considering the visualizations provided by the default Highcharts settings, these numbers are reasonable for our screen size and resolution.

Finally, our backend system is hosted on a server with a quad-core CPU at 2 GHz and 8 GB of RAM running Windows Server 2008. As client, we used a laptop with i5 CPU at 2.5 GHz with 4 G RAM, running Windows 7, Firefox 38.0.1 and ADSL2+ internet connection. Additionally, in the user evaluation, the client is employed with a 24” ( $1920 \times 1200$ ) screen.

5.2. Performance evaluation

In this section, we study the performance of the proposed model, as well as the behaviour of our tool, in terms of construction and response time, respectively. Section 5.2.1 describes the setting of our performance evaluation, and Section 5.2.2 presents the evaluation results.

5.2.1. Setup

In order to study the performance, a number of numeric and temporal properties from the employed dataset are visualized using the two hierarchical approaches (i.e., HETree-C/R), as well as the FLAT approach. We select one set from each type of properties; each set contains 15 properties with varying sizes, starting from small properties having 50–100 triples up to the largest properties.

In our experiment, for each of the three approaches, we measure the tool response time. Additionally, for the two hierarchical approaches we also measure the time required for the HETree construction.

Note that in hierarchical approaches through user interaction, the server sends to the browser only the data required for rendering the current visualization level (although the whole tree is constructed at the backend). Hence, when a user requests to generate a visualization we have the following workflow. Initially, our system constructs the tree. Then, the data regarding the top-level groups (i.e., root node children) are sent to the browser which renders the result. Afterwards, based on user interactions (i.e., drill-down, roll-up), the server retrieves the required data from the tree and sends it to the browser. Thus, the tree is constructed the first time a visualization is requested for the given input dataset; for any further user navigation over the hierarchy, the response time does not include the construction time. Therefore, in our experiments, in the hierarchical approaches, as response time we measure the time required by our tool to provide the first response (i.e., render the top-level groups), which corresponds to the slower response in our visual exploration scenario. Thus, we consider the following measures in our experiments:

Construction Time: the time required to build the HETree structure. This time includes (1) the time for sorting the triples; (2) the time for building the tree; and (3) the time for the statistics computations.

Response Time: the time required to render the charts, starting from the time the client sends the request. This time includes (1) the time required by the server to compute and build the response. In the hierarchical approaches, this time corresponds to the Construction Time, plus the time required by the server to build the JSON object sent to the client. In the FLAT approach, it corresponds to the time spent in sorting the triples plus the time for the JSON construction; (2) the time spent in the client-sever communication; and (3) the time required by the visualization library to render the charts on the browser.

Table 4
Performance Results for Numeric & Temporal Properties

5.2.2. Results

Table 4 presents the evaluation results regarding the numeric (upper half) and the temporal properties (lower half). The properties are sorted in ascending order of the number of triples. For each property, the table contains the number of triples, the characteristics of the constructed HETree structures (i.e., number of leaves, degree, height, and number of nodes), as well as the construction and the response time for each approach. The presented time measurements are the average values from 50 executions.

Regarding the comparison between the HETree and FLAT, the FLAT approach can not provide results for properties having more than 305K triples, indicated in the last rows for both numeric and temporal properties with “—” in the FLAT response time. For the rest properties, we can observe that the HETree approaches clearly outperform FLAT in all cases, even in the smallest property (i.e., rankingWin, 50 triples). As the size of properties increases, the difference between the HETree approaches and FLAT increases, as well. In more detail, for large properties having more than 53K triples (i.e., the numeric properties larger than the populationDensity -12th row-, and the temporal properties larger than the added -11th row-), the HETree approaches outperform the FLAT by one order of magnitude.

Regarding the time required for the construction of the HETree structure, from Table 4 we can observe the following: The performance of both HETtree structures is very close for most of the examined properties, with the HETree-R performing slightly better than the HETree-C (especially in the relatively small numeric properties). Furthermore, we can observe that the response time follows a similar trend as the construction time. This is expected since the communication cost, as well as the times required for constructing and rendering the JSON object are almost the same for all cases.

Regarding the comparison between the construction and the response time in the HETree approaches, from Table 4 we can observe the following. For properties having up to 5.5K triples (i.e., the numeric properties smaller than the width -8th row-, and the temporal properties smaller than the decommissioningDate -7th row-), the response time is dominated by the communication cost, and the time required for the JSON construction and rendering. For properties with only a small number of triples (i.e., waistSize, 241 triples), only 1.5% of the response time is spent on constructing the HETree. Moreover, for a property with a larger number of triples (i.e., buildingStartData, 1.415 triples), 18% of the time is spent on constructing the HETree. Finally, for the largest property for which the time spent in communication cost, JSON construction and rendering is larger than the construction time (i.e., powerOutput, 5.453 triples), 42% of the time is spent on constructing the HETree.

Fig. 11.

Response Time w.r.t. the number of triples.

Figure 11 summarizes the results from Table 4, presenting the response time for all approaches w.r.t. the number of triples. Particularly, Fig. 11(a) includes all properties sizes (i.e., 50 to 762K). Further, in order to have a precise observation over small property sizes (Small properties) in which the difference between the FLAT and the HETree approaches is smaller, we report properties with less than 20K triples separately in Fig. 11(b). Once again, we observe that HETree-R performs slightly better than the HETree-C. Additionally, from Fig. 11(b) we can indicate that for up to 10K triples the performance of the HETree approaches is almost the same. We can also observe the significant difference between the FLAT and the HETree approaches.

However our method clearly outperforms the non-hierarchical method, as we can observe from the above results, the construction of the whole hierarchy can not provide an efficient solution for datasets containing more than 10K objects. As discussed in Section 3.2, for efficient exploration over large datasets an incremental hierarchy construction is required. In the incremental exploration scenario, the number of hierarchy nodes that have to be processed and constructed is significantly fewer compared to the non-incremental.

For example, adopting an non-incremental construction in populationTotal (305K triples), 29.6K nodes are to be initially constructed (along with their statistics). On the other hand, with the incremental approach (as analysed in Section 3.2) at the beginning of each exploration scenario, only the initial nodes are constructed. Initial nodes are the nodes initially presented, as well as the nodes potentially reached by the user’s first operation.

In the RES scenario, the initial nodes are the leaf of interest (1 node) and its sibling leaves (at most $d - 1$ nodes). In the RAN, the initial nodes are the nodes of interest (at most d nodes), their children (at most $d^{2}$ nodes), and their parent node along with its siblings (at most d nodes). Finally, in the BSC scenario the initial nodes are the root node (1 node) and its children (at most d nodes). Overall, at most d, $2 d + d^{2}$ , and $d + 1$ nodes are initially constructed in the RES, RAN, and BSC scenarios respectively. Therefore, in populationTotal case where $d = 3$ at most 3, 15 and 4 nodes are initially constructed in the RES, RAN, and BSC scenarios respectively.

5.3. User study

In this section we present the user evaluation of our tool, where we have employed three approaches: the two hierarchical and the FLAT. Section 5.3.1 describes the user tasks, Section 5.3.2 outlines the evaluation procedure and setup, Section 5.3.3 summarizes the evaluation results, and Section 5.3.4 discusses issues related to the evaluation process.

5.3.1. Tasks

In this section we describe the different types of tasks that are used in the user evaluation process.

Type 1 [Find resources with specific value]: This type of tasks requests the resources having value v (as object). For this task type, we define task T1 by selecting a value v that corresponds to 5 resources. Given this task, the participants are asked to provide the number of resources that pertain to this value. In order to solve this task, the participants first have to find a resource with value v and then check which of the nearby resources also have the same value.

Type 2 [Find resources in a range of values]: This type of tasks requests the resources having value greater than $v_{\min}$ and less than $v_{\max}$ . We define two tasks of this type, by selecting different combinations of $v_{\min}$ and $v_{\max}$ values, such that tasks which consider different numbers of resources are defined. We define two tasks, the first task considers a relative small number of resources while the second a larger. In our experiments we select 10 as a small number of resources, while as a large number we select 50. Particularly, in the first task, named T2.1, we specify the values $v_{\min}$ and $v_{\max}$ such that a relatively small set of (approximately 10) resources are included, whereas the second task, T2.2, considers a relatively larger set of (approximately 50) resources. Given these tasks, the participants are asked to provide the number of resources included in the given range. This task can be solved by first finding a resource with a value included in the given range, and then explore the nearby resources in order to identify the resources in the given range.

Type 3 [Compare distributions]: This type of tasks requests from the participant to identify whether more resources appear above or below a given value v. For this type, we define task T3, by selecting the value v near to the median. Given this task, the participants are asked to provide the number of resources appearing either above or below the value v. The answer for this tasks requires from the participants to indicate the value v and determine the number or resources appearing either before or after this value.

Table 5
Average Task Completion Time (sec)

Small Property Large Property

FLAT HETree-C HETree-R p FLAT HETree-C HETree-R p

T1 54 29 28 85 52 47

T2.1 63 57 64 74 60 69

T2.2 120 69 74 128 72 77

T3 262 41 40 — 64 62 —

	Small Property	Large Property
T1	54	29	28	85	52	47
T2.1	63	57	64	74	60	69
T2.2	120	69	74	128	72	77
T3	262	41	40	—	64	62	—

$(p < 0.01)$ $(p < 0.05)$ $(p > 0.05)$

Table 6

Error Rate (%)

	Small Property				Large Property

	FLAT	HETree-C	HETree-R	p	FLAT	HETree-C	HETree-R	p
T1	0	0	0		0	0	0
T2.1	0	0	0		0	0	0
T2.2	20	0	0		20	0	10
T3	70	0	0		—	0	0	—

$(p < 0.01)$ $(p < 0.05)$ $(p > 0.05)$

5.3.2. Setup

In order to study the effect of the property size in the selected tasks, we have selected two properties of different sizes from the employed dataset (Section 5.1). The hsvCoordinateHue numeric property containing 970 triples, is referred to as Small, and the maximumElevation numeric property, containing 37.936 triples, is referred to as Large. The first one corresponds to a hierarchy of height 4 and degree 3, and the latter corresponds to a hierarchy of height 7 and degree 3. We should note here that through the user evaluation, the hierarchy parameters were fixed for all the tasks, and the participants were not allowed to modify them, such that the setting has been the same for everyone.

In our evaluation, 10 participants took part. The participants were computer science graduate students and researchers. At the beginning of the evaluation, each participant has introduced to the system by an instructor who provided a brief tutorial over the features required for the tasks. After the instructions, the participants familiarized themselves with the system. Note that we have integrated in the SynopsViz the FLAT approach along with the HETree approaches.

During the evaluation, each participant performed the previously described four tasks, using all approaches (i.e., HETree-C/R and FLAT), over both the small and large properties. In order to reduce the learning effects and fatigue we defined three groups. In the first group, the participants start their tasks with the HETree-C approach, in the second with HETree-R, and in the third with FLAT. Finally, the property (i.e., small, large) first used in each task was counterbalanced among the participants and the tasks. The entire evaluation did not exceed 75 minutes.

Furthermore, for each task (e.g., T2.1, T.3), three task instances were specified by slightly modifying the task parameters. As a result, given a task, a participant has to solve a different instance of this task, in each approach.

For example, in task T2.1, for the HETree-R, the selected v corresponds to a solution of 11 resources, in HETree-C, to 9 resources, whereas for FLAT v corresponded to a solution of 8 resources. The task instance assigned to each approach varied among the participants.

During the evaluation the instructor measured the time required for each participant to complete a task, as well as the number of incorrect answers. Table 5 presents the average time required for the participants to complete each task. The table contains the measurements for all approaches, and for both properties. Although we acknowledge that the number of participants in our evaluation is small, we have computed the statistical significance of the results. Essentially, for each property, the p-value of each task is presented in the last column. The p-value is computed using one-way repeated measures ANOVA.

In addition, the results regarding the number of tasks that were not correctly answered are presented in Table 6. Particularly, the table presents the percentage of incorrect answers for each task and property, referred to as error rate. Additionally, for each task and property, the table includes the p-value. Here, the p-value has been computed using Fisher’s exact test.

5.3.3. Results

Task T1. Regarding the first task, as we can observe from Table 5, the HETree approaches outperform FLAT, in both property sizes. Note that the time results on T1 are statistically significant ( $p < 0.01$ ).

As expected, all approaches require more time for the Large property compared to the Small one. This overhead in FLAT is caused by the larger number of resources that the participants have to scroll over and examine, until they indicate the requested resource’s value. On the other hand, in HETree, the overhead is caused by the larger number of levels that the Large property hierarchy has. Hence, the participants have to perform more drill-down operations and examine more groups of objects, until they reach the LD resources.

We can also observe that in this task, the HETree-R performs slightly better than the HETree-C in both property sizes. This is due to the fact that, in HETree-R structure, resources having the same value are always contained in the same leaf. As a result, the participants had to inspect only one leaf. On the other hand, in HETree-C this does not always hold, hence the participants could have explored more than one leaf.

Finally, as we can observe from Table 6, in all cases only correct answers have been provided. However, none of those results are statistically significant ( $p > 0.05$ ).

Task T2.1. In the next task, where the participants had to indicate a small set of resources in a range of values, the FLAT performance is very close to the HETree, especially in the Small property (Table 5). In addition, we can observe that the HETree-C approach performs slightly better than the HETree-R. Finally, regarding the statistical significance of the results, in Small property we have $p > 0.05$ , while in Large we have $p < 0.005$ .

The poor performance of the HETree approaches in this task can be explained by the small set of resources requested and the HETree parameters adopted in the user evaluation. In this setting, the resources contained in the task solution are distributed over more than one leaves. Hence, the participants had to perform several roll-up and drill-down operations in order to find all the resources. On the other hand, in FLAT, once the participants had indicated one of the requested resources, it was very easy for them to find out the rest of the solution’s resources. To sum up, in FLAT, most of the time is spent on identifying the first of the resources, while in HETree the first resource is identified very quickly. Regarding the difference in performance between the HETree approaches we have the following. In HETree-C due to the fixed number of objects in each leaf, the participants had to visit at most one or two leaves in order to solve this task. On the other hand, in HETree-R, the number of objects in each leaf is varied, so most times the participants had to inspect more than two leaves in order to solve the task. Finally, also in this case only correct answers were given (Table 6).

Task T2.2. In this task the participants had to indicate a larger set (compared to the previous task) of resources given a range of values. HETree approaches noticeably outperform the FLAT approach with statistical significance ( $p < 0.01$ ), while similar results are observed in both properties.

In the FLAT approach a considerable time was spent to identify and navigate over a large number of resources. On the other hand, due to the large number of resources involved in the task’s solution, there are groups in the hierarchy that explicitly contain resources of solutions (i.e., they do not contain resources not included in the solution). As a result, the participants in HETree could easily indicate and compute the whole solution by combining the information related to the groups (i.e., number of enclosed resources) and individual resources. Due to the same reasons stated in the previous task (i.e., T2.1), similarly in T2.2 the HETree-C performs slightly better than the HETree-R. Finally, we can observe from Table 6 (but without statistical significance), that it was more difficult for participants to solve correctly this task with FLAT than with HETree.

Task T3. In the last task, participants were requested to find which of the two ranges contained more resources. As expected, Table 5 shows that the HETree approaches clearly outperform the FLAT approach with statistical significance in the Small property. This is due to the fact that the participants in FLAT had to overview and navigate over almost half of the dataset. As a result, apart from the long time required for this process, it was also very difficult to find the correct solution. This is also verified by Table 6 on a statistically significant level. On the other hand, in the HETree approaches, the participants could easily find out the answer by considering the resources enclosed by several groups.

Regarding the Large property, as it is expected, it was impossible for participants to solve this task with FLAT, since this required to parse over and count about 19K resources. As a result, none of the participants completed this task using FLAT (indicated with “—” in Table 5), considering the 5 minute time limit used in this task.

5.3.4. Discussion

The user evaluation showed that the hierarchical approaches can be efficient (i.e., require short time in solving tasks) and effective (i.e., have lower error rate) in several cases. In more detail, the HETree approaches performed very well on indicating specific values over a dataset, and given the appropriate parameter setting are marginally affected by the dataset size. Also note that due to the “vertical-based” exploration, the position (e.g., towards the end) of the requested value in the dataset does not affect the efficiency of the approach. Furthermore, it is shown that the hierarchical approaches can efficiently and effectively handle visual exploration tasks that involve large numbers of objects.

At the end of the evaluation, the participants gave us valuable feedbacks on possible improvements of our tool. Most of the participants criticized several aspects in the interface, since our tool is an early prototype. Also, several participants mentioned difficulties in obtaining their “position” (e.g., which is the currently visualized range of values, or the previously visualized range of values) during the exploration. Finally, some participants mentioned that some hierarchies contained more levels than needed. As previously mentioned, the adopted parameters are not well suited for the evaluation, since hierarchies with a degree larger than 3 (and as result less levels) are required.

Finally, additional tasks for demonstrating the capabilities of our model can be considered. However, most of these tasks were not selected in this evaluation, because it was not possible for the participants to perform them with the FLAT approach. An indicative set includes: (1) Find the number of resources (and/or statistics) in the 1st and 3rd quartile; (2) Find statistics (e.g., mean value, variance) for the top-10 or 50 resources; (3) Find the decade (i.e., temporal data) in which most events take place.

Table 7
Visualization Systems Overview

6. Related work

This section reviews works related to our approach on visualization and exploration in the Web of Data (WoD). Section 6.1 presents systems and techniques for WoD visualization and exploration, Section 6.2 discusses techniques on WoD statistical analysis, Section 6.3 present hierarchical data visualization techniques, and finally, Section 6.4 discusses works on data structures & processing related to our HETree data structure.

In Table 7 we provide an overview and compare several visualization systems that offer similar features to our SynopsViz. The WoD column indicates systems that target the Semantic Web and Linked Data area (i.e., RDF, RDF/S, OWL). The Hierarchical column indicates systems that provide hierarchical visualization of non-hierarchical data. The Statistics column captures the provision of statistics about the visualized data. The Recomm. column indicates systems, which offer recommendation mechanisms for visualization settings (e.g., appropriate visualization type, visualization parameters, etc.). The Incr. column indicate systems that provide incremental visualizations. Finally, the Preferences column captures the ability of the users to apply data (e.g., aggregate) or visual (e.g., increase abstraction) operations.

6.1. Exploration & visualization in the Web of Data

A large number of works studying issues related to WoD visual exploration and analysis have been proposed in the literature [3,18,30,79]. In what follows, we classify these works into the following categories: (1) Generic visualization systems, (2) Domain, vocabulary & device-specific visualization systems, and (3) Graph-based visualization systems.

6.1.1. Generic visualization systems

In the context of WoD visual exploration, there is a large number of generic visualization frameworks, that offer a wide range of visualization types and operations. Next, we outline the best known systems in this category.

Rhizomer [21] provides WoD exploration based on a overview, zoom and filter workflow. Rhizomer offers various types of visualizations such as maps, timelines, treemaps and charts. VizBoard [109,110] is an information visualization workbench for WoD build on top of a mashup platform. VizBoard presents datasets in a dashboard-like, composite, and interactive visualization. Additionally, the system provides visualization recommendations. Payola [67] is a generic framework for WoD visualization and analysis. The framework offers a variety of domain-specific (e.g., public procurement) analysis plugins (i.e., analyzers), as well as several visualization techniques (e.g., graphs, tables, etc.). In addition, Payola offers collaborative features for users to create and share analyzers. In Payola the visualizations can be customized according to ontologies used in the resulting data.

The Linked Data Visualization Model (LDVM) [20] provides an abstract visualization process for WoD datasets. LDVM enables the connection of different datasets with various kinds of visualizations in a dynamic way. The visualization process follows a four stage workflow: Source data, Analytical abstraction, Visualization abstraction, and View. A prototype based on LDVM considers several visualization techniques, e.g., circle, sunburst, treemap, etc. Finally, the LDVM has been adopted in several use cases [68]. Vis Wizard [105] is a Web-based visualization system, which exploits data semantics to simplify the process of setting up visualizations. Vis Wizard is able to analyse multiple datasets using brushing and linking methods. Similarly, Linked Data Visualization Wizard (LDVizWiz) [6] provides a semi-automatic way for the production of possible visualization for WoD datasets. In a same context, LinkDaViz [103] finds the suitable visualizations for a give part of a dataset. The framework uses heuristic data analysis and a visualization model in order to facilitate automatic binding between data and visualization options.

Balloon Synopsis [91] provides a WoD visualizer based on HTML and JavaScript. It adopts a node-centric visualization approach in a tile design. Additionally, it supports automatic information enhancement of the local RDF data by accessing either remote SPARQL endpoints or performing federated queries over endpoints using the Balloon Fusion service. Balloon Synopsis offers customizable filters, namely ontology templates, for the users to handle and transform (e.g., filter, merge) input data. SemLens [51] is a visual system that combines scatter plots and semantic lenses, offering visual discovery of correlations and patterns in data. Objects are arranged in a scatter plot and are analysed using user-defined semantic lenses. LODeX [15] is a system that generates a representative summary of a WoD source. The system takes as input a SPARQL endpoint and generates a visual (graph-based) summary of the WoD source, accompanied by statistical and structural information of the source. LODWheel [99] is a Web-based visualizing system which combines JavaScript libraries (e.g., MooWheel, JQPlot) in order to visualize RDF data in charts and graphs. Hide the stack [31] proposes an approach for visualizing WoD for mainstream end-users. Underlying Semantic Web technologies (e.g., RDF, SPARQL) are utilized, but are “hidden” from the end-users. Particularly, a template-based visualization approach is adopted, where the information for each resource is presented based on its rdf:type.

6.1.2. Domain, vocabulary & device-specific visualization systems

In this section, we present systems that target visualization needs for specific types of data and domains, RDF vocabularies or devices.

Several systems focus on visualizing and exploring geo-spatial data. Map4rdf [73] is a faceted browsing tool that enables RDF datasets to be visualized on an OSM or Google Map. Facete [97] is an exploration and visualization system for SPARQL accessible data, offering faceted filtering functionalities. SexTant [81] and Spacetime [107] focus on visualizing and exploring time-evolving geo-spatial data. The LinkedGeoData Browser [96] is a faceted browser and editor which is developed in the context of LinkedGeoData project. Finally, in the same context DBpedia Atlas [106] offers exploration over the DBpedia dataset by exploiting the dataset’s spatial data. Furthermore, in the context of linked university data, VISUalization Playground (VISU) [4] is an interactive tool for specifying and creating visualizations using the contents of linked university data cloud. Particularly, VISU offers a novel SPARQL interface for creating data visualizations. Query results from selected SPARQL endpoints are visualized with Google Charts.

A variety of systems target multidimensional WoD modelled with the Data Cube vocabulary. CubeViz [37,90] is a faceted browser for exploring statistical data. The system provides data visualizations using different types of charts (i.e., line, bar, column, area and pie). The Payola Data Cube Vocabulary [52] adopts the LDVM stages [20] in order to visualize RDF data described by the Data Cube vocabulary. The same types of charts as in CubeViz are provided in this system. The OpenCube Toolkit [59] offers several systems related to statistical WoD. For example, OpenCube Browser explores RDF data cubes by presenting a two-dimensional table. Additionally, the OpenCube Map View offers interactive map-based visualizations of RDF data cubes based on their geo-spatial dimension. The Linked Data Cubes Explorer (LDCE) [63] allows users to explore and analyse statistical datasets. Finally, [84] offers several map and chart visualizations of demographic, social and statistical linked cube data.15

¹⁵
www.linked-statistics.gr.

Regarding device-specific systems, DBpedia Mobile [14] is a location-aware mobile application for exploring and visualizing DBpedia resources. Who’s Who [23] is an application for exploring and visualizing information focusing on several issues that appear in the mobile environment. For example, the application considers the usability and data processing challenges related to the small display size and limited resources of the mobile devices.

6.1.3. Graph-based visualization systems

A large number of systems visualize WoD datasets adopting a graph-based (a.k.a., node-link) approach. RelFinder [50] is a Web-based tool that offers interactive discovery and visualization of relationships (i.e., connections) between selected WoD resources. Fenfire [48] and Lodlive [22] are exploratory systems that allow users to browse WoD using interactive graphs. Starting from a given URI, the user can explore WoD by following the links. IsaViz [86] allows users to zoom and navigate over the RDF graph, and also it offers several “edit” operations (e.g., delete/add/rename nodes and edges). In the same context, graphVizdb [16,17] is built on top of spatial and database techniques offering interactive visualization over very large (RDF) graphs. A different approach has been adopted in [100], where sampling techniques have been exploited. Finally, ZoomRDF [116] employs a space-optimized visualization algorithm in order to increase the number of resources which are displayed.

6.1.4. Discussion

In contrast to the aforementioned approaches, our work does not focus solely on proposing techniques for WoD visualization. Instead, we introduce a generic model for organizing, exploring and analysing numeric and temporal data in a multilevel fashion. The underlying model is not bound to any specific type of visualization (e.g., chart); rather it can be adopted by several “flat” techniques and offer multilevel visualizations over non-hierarchical data. Also, we present a prototype system that employs the introduced hierarchical model and offers efficient multilevel visual exploration over WoD datasets, using charts and timelines.

6.2. Statistical analysis in the Web of Data

A second area related to the analysis features of the proposed model deals with WoD statistical analysis. RDFStats [71] calculates statistical information about RDF datasets. LODstats [9] is an extensible framework, offering scalable statistical analysis of WoD datasets. RapidMiner LOD Extension [83,87] is an extension of the data mining platform RapidMiner,16

¹⁶
rapidminer.com.

offering sophisticated data analysis operations over WoD. SparqlR17

¹⁷

cran.r-project.org/web/packages/SPARQL/index.html.

is a package of the R18

¹⁸

www.r-project.org.

statistical analysis platform. SparqlR executes SPARQL queries over SPARQL endpoints and provides statistical analysis and visualization over SPARQL results. Finally, ViCoMap [88] combines WoD statistical analysis and visualization, in a Web-based tool, which offers correlation analysis and data visualization on maps.

6.2.1. Discussion

In comparison with these systems, our work does not focus on new techniques for WoD statistics computation and analysis. We are primarily interested on enhancing the visualization and user exploration functionality by providing statistical properties of the visualized datasets and objects, making use of existing computation techniques. Also, we demonstrate how in the proposed structure, computations can be efficiently performed on-the-fly and enrich our hierarchical model. The presence of statistics provides quantifiable overviews of the underlying WoD resources at each exploration step. This is particularly important in several tasks when you have to explore a large number of either numeric or temporal data objects. Users can examine next levels’ characteristics at a glance, this way are not enforced to drill down in lower hierarchy levels. Finally, the statistics over the different hierarchy levels enables analysis over different granularity levels.

6.3. Hierarchical visual exploration

The wider area of data and information visualization has provided a variety of approaches for hierarchical analysis and presentation.

Treemaps [93] visualize tree structures using a space-filling layout algorithm based on recursive subdivision of space. Rectangles are used to represent tree nodes, the size of each node is proportional to the cumulative size of its descendant nodes. Finally, a large number of treemaps variations have been proposed (e.g., Cushion Treemaps, Squarified Treemaps, Ordered Treemaps, etc.).

Moreover, hierarchical visualization techniques have been extensively employed to visualize very large graphs using the node-link paradigm. In these techniques the graph is recursively decomposed into smaller sub-graphs that form a hierarchy of abstraction layers. In most cases, the hierarchy is constructed by exploiting clustering and partitioning methods [1,7,11,74,89,104]. In other works, the hierarchy is defined with hub-based [75] and density-based [117] techniques. GrouseFlocks [5] supports ad-hoc hierarchies which are manually defined by the users. Finally, there also some edge bundling techniques which join graph edges to bundles. The edges are often aggregated based on clustering techniques [38,41,85], a mesh [28,70] or explicitly by a hierarchy [53].

In the context of data warehousing and online analytical processing (OLAP), several approaches provide hierarchical visual exploration, by exploiting the predefined hierarchies in the dimension space. [78] proposes a class of OLAP-aware hierarchical visual layouts; similarly, [102] uses OLAP-based hierarchical stacked bars. Polaris [98] offers visual exploratory analysis of data warehouses with rich hierarchical structure.

Several hierarchical techniques have been proposed in the context of ontology visualization and exploration [34,40,46,72]. CropCircles [111] adopts a hierarchical geometric containment approach, representing the class hierarchy as a set of concentric circles. Knoocks [69] combines containment-based and node-link approaches. In this work, ontologies are visualized as nested blocks where each block is depicted as a rectangle containing a sub-branch shown as tree map. A different approach is followed by OntoTrix [10] which combine graphs with adjacency matrices.

Finally, in the context of hierarchical navigation, [64] organizes query results using the MeSH concept hierarchy. In [24] a hierarchical structure is dynamically constructed to categorize numeric and categorical query results. Similarly, [26] constructs personalized hierarchies by considering diverse users preferences.

6.3.1. Discussion

In contrast to above approaches that target graph-based or hierarchically-organized data, our work focuses on handling arbitrary numeric and temporal data, with out requiring it to be described by an hierarchical schema. As an example of hierarchically-organized data, consider class hierarchies or multidimensional data organized in multilevel hierarchical dimensions (e.g., in OLAP context, temporal data is hierarchically organized based on years, months, etc.). In contrast to aforementioned approaches, our work dynamically constructs the hierarchies from raw numeric and temporal data. Thus the proposed model can be combined with “flat” visualization techniques (e.g., chart, timeline), in order to provide multilevel visualizations over non-hierarchical data. In that sense, our approach can be considered more flexible compared to the techniques that rely on predefined hierarchies, as it can enable exploratory functionality on dynamically retrieved datasets, by (incrementally) constructing hierarchies on-the-fly, and allowing users to modify these hierarchies.

6.4. Data structures & data processing

In this section we present the data structures and the data (pre-)processing techniques which are the most relevant to our approach.

R-Tree [45] is disk-based multi-dimensional indexing structure, which has been widely used in order to efficiently handle spatial queries. R-Tree adopts the notion of minimum bounding rectangles (MBRs) in order to hierarchical organize multi-dimensional objects.

Data discretization [33,42] is a process where continuous attributes are transformed into discrete. A large number of methods (e.g., supervised, unsupervised, univariate, multivariate) for data discretization have been proposed. Binning is a simple unsupervised discretization method in which a predefined number of bins is created. Widely known binning methods are the equal-width and equal-frequency. In equal-width approach, the range of an attribute is divided into intervals that have equal width and each interval represents a bin. In equal-frequency approach, an equal number of values are placed in each bin.

By recursively applying discretization techniques, a hierarchical discretization of attribute’s values can be produced (a.k.a. concept/generalization hierarchies). In [92] a dynamic programming algorithm for generating numeric concept hierarchies is proposed. The algorithm attempts to maximize both the similarity between the objects stored in the same hierarchy’s node, as well as the dissimilarity between the objects stored in different nodes. The generated hierarchy is a balanced tree where different nodes may have different number of children. Similarly, [47] constructs hierarchies based on data distribution. Essentially, both the leaf and the interval nodes are created in such a way that an even distribution is achieved. The hierarchy construction considers also a threshold specifying the maximum number of distinct values enclosed by nodes in each hierarchy level. Finally, binary concept hierarchies (with degree equal to two) are generated in [27]. Starting from the whole dataset, it performs a recursive binary partitioning over the dataset’s values; the recursion is terminated when the number of distinct values in the resultant partitions is less than a pre-specified threshold.

Using the data objects from our running example (Fig. 1), Fig. 12 shows the hierarchies generated from the aforementioned approaches. Figure 12(a) presents the hierarchy resulting from [27] and Fig. 12(b) depicts the result using the method from [47]. The parameters in each method are set, so that the resulting hierarchies are as much as possible similar to our hierarchies (Figs 2 and 3). Hence, the threshold in (a) is set to 3, and in (b) is set to 2.

6.4.1. Discussion

The basic concepts of HETree structure can be considered similar to a simplified version of a static 1D R-Tree. However, in order to provide efficient query processing in disk-based environment, R-Tree considers a large number of I/O-related issues (e.g., space coverage, nodes overlaps, fill guarantees, etc.). On the other hand, we introduce a lightweight, main memory structure that efficiently constructed on-the-fly. Also, the proposed structure aims at organizing the data in a practical manner for a (visual) exploration scenario, rather than for disk-based indexing and querying efficiency.

Compared to discretization techniques, our tree model exhibits several similarities, namely, the HETree-C version can be considered as a hierarchical version of the equal-frequency binning, and the HETree-R of the equal-width binning. However, the goal of data organization in HETree is to enable visualization and hierarchical exploration capabilities over dynamically retrieved non-hierarchical data. Hence, compared to the binning methods we can consider the following basic differences. First, in contrast with binning methods that require from the user to specify some parameters (e.g., the number/size of the bins, the number of distinct values in each bin, etc); our approach is able to automatically estimate the hierarchy parameters and adjust the visualization results by considering the visualization environment characteristics. Second, in hierarchical approaches the user is not always allowed to specify the hierarchy characteristics (e.g., degree). For example, the hierarchies in [27] have always degree equal to two (Fig. 12(a)), while in [47] the nodes have varying degrees (Fig. 12(b)). On the other hand, in our approach the hierarchy characteristics can be specified precisely. In addition, when not specific hierarchy characteristics are requested, our approach generates perfect trees (Section 2.5), offering a “uniform” hierarchy structure. Third, the computational complexity in some of the hierarchical approaches (e.g., [92]) is prohibitive (i.e., at least cubic) for using them in practise; especially in settings where the hierarchies have to constructed on-the-fly. Fourth, the proposed tree structure is exploited in order to allow efficient statistics computations over different groups of data; then, the statistics are used in order to enhance the overall exploration functionality. Finally, the construction of the model is tailored to the user interaction and preferences; our model offers incremental construction considering the user interaction, as well as efficiently adaptation to the users preferences.

Fig. 12.

Hierarchies generated from different approaches. a) based on [27] b) based on [47].

7. Conclusions

In this paper we have presented HETree, a generic model that combines personalized multilevel exploration with online analysis of numeric and temporal data. Our model is built on top of a lightweight tree-based structure, which can be efficiently constructed on-the-fly for a given set of data. We have presented two variations for constructing our model: the HETree-C structure organizes input data into fixed-size groups, whereas the HETree-R structure organizes input data into fixed-range groups. In that way, the users can customize the exploration experience, allowing them to organize data into different ways, by parameterizing the number of groups, the range and cardinality of their contents, the number of hierarchy levels, and so on. We have also provided a way for efficiently computing statistics over the tree, as well as a method for automatically deriving from the input dataset the best-fit parameters for the construction of the model. Regarding the performance of multilevel exploration over large datasets, our model offers incremental HETree construction and prefetching, as well as efficient HETree adaptation based on user preferences. Based on the introduced model, a Web-based prototype system, called SynopsViz, has been developed. Finally, the efficiency and the effectiveness of the presented approach are demonstrated via a thorough performance evaluation and an empirical user study.

Some insights for future work include the support of sophisticated methods for data organization in order to effectively handle skewed data distributions and outliers. Particularly, we are currently working on hybrid HETree versions, that integrate concepts from both HETree-C and HETree-R version. For example, a hybrid HETree-C considers a threshold regarding the maximum range of a group; similarly, a threshold regarding the maximum number of objects in a group is considered in hybrid HETree-R version. Regarding the SynopsViz tool, we are planning to redesign and extend the graphical user interface, so our tool to be able to use data resulting from SPARQL endpoints, as well as to offer more sophisticated filtering techniques (e.g., SPARQL-enabled browsing over the data). Finally, we are interested in including more visual techniques and libraries.

Footnotes

Acknowledgements

We would like to thank the editors and the three reviewers for their hard work in reviewing our article, their comments helped us to significant improve our work. Further, we thank Giorgos Giannopoulos and Marios Meimaris for many helpful comments on earlier versions of this article. This work was partially supported by the EU/Greece funded KRIPIS: MEDA Project and the EU project “SlideWiki” (688095).

Incremental HETree construction

ICO algorithm

The constrRollUp-R (Procedure 4) initially constructs the $cur$ parent node $par$ (lines 1–7). Next, it computes the interval $I_{ppar}$ corresponding to $par$ parent node interval (lines 10–11). Using $I_{ppar}$ , it computes the intervals for each of $par$ sibling nodes (line 13). Finally, the computed sibling nodes’ intervals $I_{spar}$ are used for the nodes construction (line 15).

In the constrDrillDown-R (Procedure 5), for each node in $cur$ , its children are constructed as follows (line 2). First, the procedure computes the intervals $I_{ch}$ of each child and then it constructs all children (line 5). Finally, the child relations for the parent node $cur [i]$ are constructed (line 6–7).

Incremental HETree construction analysis

In this section, we analyse in details the worst case of ICO algorithm, i.e., when the construction cost is maximized.

Adaptive HETree construction

References

Abello,

van Ham and

Krishnan, ASK-GraphView: A large scale graph visualization system, IEEE Trans. Vis. Comput. Graph.12(5) (2006), 669–676. doi:10.1109/TVCG.2006.120.

Agarwal,

Mozafari,

Panda,

Milner,

Madden and

Stoica, BlinkDB: Queries with bounded errors and bounded response times on very large data, in: Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic, April 14–17, 2013,

Hanzálek,

Härtig,

Castro and

M.F.

Kaashoek, eds, ACM, 2013, pp. 29–42. doi:10.1145/2465351.2465355.

Alahmari,

J.A.

Thom,

Magee and

Wong, Evaluating semantic browsers for consuming linked data, in: Twenty-Third Australasian Database Conference, ADC 2012, Melbourne, Australia, January 2012,

Zhang and

Zhang, eds, CRPIT, Vol. 124, Australian Computer Society, 2012, pp. 89–98.

Alonen,

Kauppinen,

Suominen and

Hyvönen, Exploring the Linked University Data with visualization tools, in: The Semantic Web: ESWC 2013 Satellite Events – ESWC 2013 Satellite Events, Montpellier, France, May 26–30, 2013, Revised Selected Papers,

Cimiano,

Fernández,

Lopez,

Schlobach and

Völker, eds, Lecture Notes in Computer Science, Vol. 7955, Springer, 2013, pp. 204–208. doi:10.1007/978-3-642-41242-4_25.

Archambault,

Munzner and

Auber, GrouseFlocks: Steerable exploration of graph hierarchy space, IEEE Trans. Vis. Comput. Graph.14(4) (2008), 900–913. doi:10.1109/TVCG.2008.34.

G.A.

Atemezing and

Troncy, Towards a linked-data based visualization wizard, in: Proc. of the 5th International Workshop on Consuming Linked Data (COLD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 20, 2014,

Hartig,

Hogan and

Sequeda, eds, CEUR Workshop Proceedings, Vol. 1264, CEUR-WS.org, 2014.

Auber, Tulip – A huge graph visualization framework, in: Graph Drawing Software, Springer, London, 2004, pp. 105–126. doi:10.1007/978-3-642-18638-7_5.

Auer,

Bizer,

Kobilarov,

Lehmann,

Cyganiak and

Z.G.

Ives, DBpedia: A nucleus for a web of open data, in: The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007,

Aberer,

Choi,

N.F.

Noy,

Allemang,

Lee,

L.J.B.

Nixon,

Golbeck,

Mika,

Maynard,

Mizoguchi,

Schreiber and

Cudré-Mauroux, eds, Lecture Notes in Computer Science, Vol. 4825, Springer, 2007, pp. 722–735. doi:10.1007/978-3-540-76298-0_52.

Auer,

Demter,

Martin and

Lehmann, LODStats – An extensible framework for high-performance dataset analytics, in: Proc. Knowledge Engineering and Knowledge Management – 18th International Conference, EKAW 2012, Galway City, Ireland, October 8–12, 2012,

ten Teije,

Völker,

Handschuh,

Stuckenschmidt,

d’Aquin,

Nikolov,

Aussenac-Gilles and

Hernandez, eds, Lecture Notes in Computer Science, Vol. 7603, Springer, 2012, pp. 353–362. doi:10.1007/978-3-642-33876-2_31.

10.

Bach,

Pietriga and

Liccardi, Visualizing populated ontologies with OntoTrix, Int. J. Semantic Web Inf. Syst.9(4) (2013), 17–40. doi:10.4018/ijswis.2013100102.

11.

Bastian,

Heymann and

Jacomy, Gephi: An open source software for exploring and manipulating networks, in: Proc. of the Third International Conference on Weblogs and Social Media, ICWSM 2009, San Jose, California, USA, May 17–20, 2009,

Adar,

Hurst,

Finin,

N.S.

Glance,

Nicolov and

B.L.

Tseng, eds, The AAAI Press, 2009.

12.

Battle,

Chang and

Stonebraker, Dynamic prefetching of data tiles for interactive visualization, Technical Report, 2015.

13.

Battle,

Stonebraker and

Chang, Dynamic reduction of query result sets for interactive visualizaton, in: Proc. of the 2013 IEEE International Conference on Big Data, Santa Clara, CA, USA, 6–9 October 2013,

Hu,

T.Y.

Lin,

V.V.

Raghavan,

B.W.

Wah,

R.A.

Baeza-Yates,

Fox,

Shahabi,

Smith,

Yang,

Ghani,

Fan,

Lempel and

Nambiar, eds, IEEE, 2013, pp. 1–8. doi:10.1109/BigData.2013.6691708.

14.

Becker and

Bizer, Exploring the geospatial Semantic Web with DBpedia Mobile, J. Web Sem.7(4) (2009), 278–286. doi:10.1016/j.websem.2009.09.004.

15.

Benedetti,

Po and

Bergamaschi, A visual summary for Linked Open Data sources, in: Proc. of the ISWC 2014 Posters & Demonstrations Track a Track Within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR Workshop Proceedings, Vol. 21, CEUR-WS.org, 2014, pp. 173–176.

16.

Bikakis,

Liagouris,

Kromida,

Papastefanatos and

T.K.

Sellis, Towards scalable visual exploration of very large RDF graphs, in: The Semantic Web: ESWC 2015 Satellite Events – ESWC 2015 Satellite Events, Portorož, Slovenia, May 31–June 4, 2015, Revised Selected Papers,

Gandon,

Guéret,

Villata,

J.G.

Breslin,

Faron-Zucker and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9341, Springer, 2015, pp. 9–13. doi:10.1007/978-3-319-25639-9_2.

17.

Bikakis,

Liagouris,

Krommyda,

Papastefanatos and

Sellis, graphVizdb: A scalable platform for interactive large graph visualization, in: IEEE Intl. Conf. on Data Engineering (ICDE), 2016, to appear.

18.

Bikakis and

T.K.

Sellis, Exploration and visualization in the Web of Big Linked Data: A survey of the state of the art, in: Proc. of the Workshops of the EDBT/ICDT 2016 Joint Conference, EDBT/ICDT Workshops 2016, Bordeaux, France, March 15, 2016,

Palpanas and

Stefanidis, eds, CEUR Workshop Proceedings, Vol. 1558, CEUR-WS.org, 2016.

19.

Bikakis,

Skourla and

Papastefanatos, rdf: SynopsViz – A framework for hierarchical linked data visual exploration and analysis, in: The Semantic Web: ESWC 2014 Satellite Events – ESWC 2014 Satellite Events, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers,

Presutti,

Blomqvist,

Troncy,

Sack,

Papadakis and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8798, Springer, 2014, pp. 292–297. doi:10.1007/978-3-319-11955-7_37.

20.

J.M.

Brunetti,

Auer,

García,

Klímek and

Necaský, Formal linked data visualization model, in: The 15th International Conference on Information Integration and Web-Based Applications & Services, IIWAS ’13, Vienna, Austria, December 2–4, 2013,

E.R.

Weippl,

Indrawan-Santiago,

Steinbauer,

Kotsis and

Khalil, eds, ACM, 2013, p. 309. doi:10.1145/2539150.2539162.

21.

J.M.

Brunetti,

Gil and

García, Facets and pivoting for flexible and usable linked data exploration, in: Proc. of the Workshop on Interacting with Linked Data (ILD 2012) Workshop Co-Located with the 9th Extended Semantic Web Conference, Heraklion, Greece, May 28, 2012,

Unger,

Cimiano,

Lopez,

Motta,

Buitelaar and

Cyganiak, eds, CEUR Workshop Proceedings, Vol. 913, CEUR-WS.org, 2012, pp. 22–35.

22.

D.V.

Camarda,

Mazzini and

Antonuccio, LodLive, exploring the web of data, in: I-SEMANTICS 2012 – 8th International Conference on Semantic Systems, I-SEMANTICS ’12, Graz, Austria, September 5–7, 2012,

Presutti and

H.S.

Pinto, eds, ACM, 2012, pp. 197–200. doi:10.1145/2362499.2362532.

23.

A.E.

Cano,

Dadzie and

Hartmann, Who’s who – A linked data visualisation tool for mobile environments, in: Proc. of the Semantic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011, Part II, Heraklion, Crete, Greece, May 29–June 2, 2011,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

P.D.

Leenheer and

J.Z.

Pan, eds, Lecture Notes in Computer Science, Vol. 6644, Springer, 2011, pp. 451–455. doi:10.1007/978-3-642-21064-8_33.

24.

Chakrabarti,

Chaudhuri and

Hwang, Automatic categorization of query results, in: Proc. of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, 2004,

Weikum,

A.C.

König and

Deßloch, eds, ACM, 2004, pp. 755–766. doi:10.1145/1007568.1007653.

25.

Chan,

Xiao,

Gerth and

Hanrahan, Maintaining interactivity while exploring massive time series, in: Proc. of the IEEE Symposium on Visual Analytics Science and Technology, IEEE VAST 2008, Columbus, Ohio, USA, 19–24 October 2008,

Ebert and

Ertl, eds, IEEE Computer Society, 2008, pp. 59–66. doi:10.1109/VAST.2008.4677357.

26.

Chen and

Li, Addressing diverse user preferences in SQL-query-result navigation, in: Proc. of the ACM SIGMOD International Conference on Management of Data, Beijing, China, June 12–14, 2007,

C.Y.

Chan,

B.C.

Ooi and

Zhou, eds, ACM, 2007, pp. 641–652. doi:10.1145/1247480.1247551.

27.

W.W.

Chu and

Chiang, Abstraction of high level concepts from numerical values in databases, in: Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, Seattle, Washington, July 1994,

U.M.

Fayyad and

Uthurusamy, eds, AAAI Press, 1994, pp. 133–144, Technical report WS-94-03.

28.

Cui,

Zhou,

Qu,

P.C.

Wong and

Li, Geometry-based edge clustering for graph visualization, IEEE Trans. Vis. Comput. Graph.14(6) (2008), 1277–1284. doi:10.1109/TVCG.2008.135.

29.

Dadzie,

Lanfranchi and

Petrelli, Seeing is believing: Linking data with knowledge, Information Visualization8(3) (2009), 197–211. doi:10.1057/ivs.2009.11.

30.

Dadzie and

Rowe, Approaches to visualising Linked Data: A survey, Semantic Web2(2) (2011), 89–124. doi:10.3233/SW-2011-0037.

31.

Dadzie,

Rowe and

Petrelli, Hide the stack: Toward usable Linked Data, in: Proc. of the Semantic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011, Part I, Heraklion, Crete, Greece, May 29–June 2, 2011,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

P.D.

Leenheer and

J.Z.

Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, 2011, pp. 93–107. doi:10.1007/978-3-642-21034-1_7.

32.

P.R.

Doshi,

E.A.

Rundensteiner and

M.O.

Ward, Prefetching for visual data exploratio, in: Eighth International Conference on Database Systems for Advanced Applications (DASFAA ’03), Kyoto, Japan, March 26–28, 2003, IEEE Computer Society, 2003, pp. 195–202. doi:10.1109/DASFAA.2003.1192383.

33.

Dougherty,

Kohavi and

Sahami, Supervised and unsupervised discretization of continuous features, in: Machine Learning, Proc. of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, July 9–12, 1995,

Prieditis and

S.J.

Russell, eds, Morgan Kaufmann, 1995, pp. 194–202.

34.

Dudás,

Zamazal and

Svátek, Roadmapping and navigating in the ontology visualization landscape, in: Proc. of the 19th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2014, Linköping, Sweden, November 24–28, 2014,

Janowicz,

Schlobach,

Lambrix and

Hyvönen, eds, Lecture Notes in Computer Science, Vol. 8876, Springer, 2014, pp. 137–152. doi:10.1007/978-3-319-13704-9_11.

35.

Eldawy,

Mokbel and

Jonathan, HadoopViz: A MapReduce framework for extensible visualization of big spatial data, in: IEEE Intl. Conf. on Data Engineering (ICDE), 2016, to appear.

36.

Elmqvist and

Fekete, Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines, IEEE Trans. Vis. Comput. Graph.16(3) (2010), 439–454. doi:10.1109/TVCG.2009.84.

37.

Ermilov,

Martin,

Lehmann and

Auer, Linked Open Data statistics: Collection and exploitation, in: Proc. of the 4th International Conference on Knowledge Engineering and the Semantic Web, KESW 2013, St. Petersburg, Russia, October 7–9, 2013,

Klinov and

Mouromtsev, eds, Communications in Computer and Information Science, Vol. 394, Springer, 2013, pp. 242–249. doi:10.1007/978-3-642-41360-5_19.

38.

Ersoy,

Hurter,

F.V.

Paulovich,

Cantareiro and

Telea, Skeleton-based edge bundling for graph visualization, IEEE Trans. Vis. Comput. Graph.17(12) (2011), 2364–2373. doi:10.1109/TVCG.2011.233.

39.

Fisher,

I.O.

Popov,

S.M.

Drucker and

M.C.

Schraefel, Trust me, I’m partially right: Incremental visualization lets analysts explore large datasets faster, in: CHI Conference on Human Factors in Computing Systems, CHI ’12, Austin, TX, USAMay 05–10, 2012,

J.A.

Konstan,

E.H.

Chi and

Höök, eds, ACM, 2012, pp. 1673–1682. doi:10.1145/2207676.2208294.

40.

Fu,

N.F.

Noy and

M.-A.

Storey, Eye tracking the user experience – An evaluation of ontology visualization techniques, Semantic Web Journal (2015), to appear.

41.

E.R.

Gansner,

Hu,

S.C.

North and

C.E.

Scheidegger, Multilevel agglomerative edge bundling for visualizing large graphs, in: IEEE Pacific Visualization Symposium, PacificVis 2011, Hong Kong, China, 1–4 March, 2011,

G.D.

Battista,

Fekete and

Qu, eds, IEEE Computer Society, 2011, pp. 187–194. doi:10.1109/PACIFICVIS.2011.5742389.

42.

García,

Luengo,

J.A.

Sáez,

López and

Herrera, A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning, IEEE Trans. Knowl. Data Eng.25(4) (2013), 734–750. doi:10.1109/TKDE.2012.35.

43.

Godfrey,

Gryz and

Lasek, Interactive visualization of large data sets, Technical report, York University, 2015.

44.

Godfrey,

Gryz,

Lasek and

Razavi, Visualization through inductive aggregation, in: Proc. of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15–16, 2016,

Pitoura,

Maabout,

Koutrika,

Marian,

Tanca,

Manolescu and

Stefanidis, eds, OpenProceedings.org, 2016, pp. 600–603. doi:10.5441/002/edbt.2016.58.

45.

Guttman, R-Trees: A dynamic index structure for spatial searching, in: SIGMOD’84, Proc. of Annual Meeting, Boston, Massachusetts, June 18–21, 1984,

Yormark, ed., ACM Press, 1984, pp. 47–57. doi:10.1145/602259.602266.

46.

Haag,

Lohmann,

Negru and

Ertl, OntoVibe: An ontology visualization benchmark, in: Proc. of the International Workshop on Visualizations and User Interfaces for Knowledge Engineering and Linked Data Analytics Co-Located with 19th International Conference on Knowledge Engineering and Knowledge Management, VISUAL@EKAW 2014, Linköping, Sweden, November 24, 2014,

Ivanova,

Kauppinen,

Lohmann,

Mazumdar,

Pesquita and

Xu, eds, CEUR Workshop Proceedings, Vol. 1299, CEUR-WS.org, 2014, pp. 14–27.

47.

Han and

Fu, Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases, in: Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, Seattle, Washington, July 1994,

U.M.

Fayyad and

Uthurusamy, eds, AAAI Press, 1994, pp. 157–168, Technical report WS-94-03.

48.

Hastrup,

Cyganiak and

Bojars, Browsing Linked Data with Fenfire, in: Proc. of the WWW2008 Workshop on Linked Data on the Web, LDOW 2008, Beijing, China, April 22, 2008,

Bizer,

Heath,

Idehen and

Berners-Lee, eds, CEUR Workshop Proceedings, Vol. 369, CEUR-WS.org, 2008.

49.

Heer and

Kandel, Interactive analysis of Big Data, ACM Crossroads19(1) (2012), 50–54. doi:10.1145/2331042.2331058.

50.

Heim,

Lohmann and

Stegemann, Interactive relationship discovery via the Semantic Web, in: Proc. of the Semantic Web: Research and Applications, 7th Extended Semantic Web Conference, ESWC 2010, Part I, Heraklion, Crete, Greece, May 30–June 3, 2010,

Aroyo,

Antoniou,

Hyvönen,

ten Teije,

Stuckenschmidt,

Cabral and

Tudorache, eds, Lecture Notes in Computer Science, Vol. 6088, Springer, 2010, pp. 303–317. doi:10.1007/978-3-642-13486-9_21.

51.

Heim,

Lohmann,

Tsendragchaa and

Ertl, SemLens: Visual analysis of semantic data with scatter plots and semantic lenses, in: Proc. the 7th International Conference on Semantic Systems, I-SEMANTICS 2011, Graz, Austria, September 7–9, 2011,

Ghidini,

A.N.

Ngomo,

S.N.

Lindstaedt and

Pellegrini, eds, ACM International Conference Proceeding Series, ACM, 2011, pp. 175–178. doi:10.1145/2063518.2063543.

52.

Helmich,

Klímek and

Necaský, Visualizing RDF Data Cubes using the linked data visualization model, in: The Semantic Web: ESWC 2014 Satellite Events – ESWC 2014 Satellite Events, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers,

Presutti,

Blomqvist,

Troncy,

Sack,

Papadakis and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8798, Springer, 2014, pp. 368–373. doi:10.1007/978-3-319-11955-7_50.

53.

Holten, Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data, IEEE Trans. Vis. Comput. Graph.12(5) (2006), 741–748. doi:10.1109/TVCG.2006.147.

54.

Idreos,

Papaemmanouil and

Chaudhuri, Overview of data exploration techniques, in: Proc. of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015,

T.K.

Sellis,

S.B.

Davidson and

Z.G.

Ives, eds, ACM, 2015, pp. 277–281. doi:10.1145/2723372.2731084.

55.

Im,

F.G.

Villegas and

M.J.

McGuffin, VisReduce: Fast and responsive incremental information visualization of large datasets, in: Proc. of the 2013 IEEE International Conference on Big Data, Santa Clara, CA, USA, 6–9 October 2013,

Hu,

T.Y.

Lin,

V.V.

Raghavan,

B.W.

Wah,

R.A.

Baeza-Yates,

Fox,

Shahabi,

Smith,

Yang,

Ghani,

Fan,

Lempel and

Nambiar, eds, IEEE, 2013, pp. 25–32. doi:10.1109/BigData.2013.6691710.

56.

Jayachandran,

Tunga,

Kamat and

Nandi, Combining user interaction, speculative query execution and sampling in the DICE system, PVLDB7(13) (2014), 1697–1700.

57.

Jugel,

Jerzak,

Hackenbroich and

Markl, Faster visual analytics through pixel-perfect aggregation, PVLDB7(13) (2014), 1705–1708.

58.

Jugel,

Jerzak,

Hackenbroich and

Markl, VDDA: Automatic visualization-driven data aggregation in relational databases, VLDB J.25(1) (2016), 53–77. doi:10.1007/s00778-015-0396-z.

59.

Kalampokis,

Nikolov,

Haase,

Cyganiak,

Stasiewicz,

Karamanou,

Zotou,

Zeginis,

Tambouris and

K.A.

Tarabanis, Exploiting Linked Data cubes with OpenCube toolkit, in: Proc. of the ISWC 2014 Posters & Demonstrations Track a Track Within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR Workshop Proceedings, Vol. 1272, CEUR-WS.org, 2014, pp. 137–140.

60.

Kalinin,

Çetintemel and

S.B.

Zdonik, Interactive data exploration using semantic windows, in: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014,

C.E.

Dyreson,

Li and

M.T.

Özsu, eds, ACM, 2014, pp. 505–516. doi:10.1145/2588555.2593666.

61.

Kalinin,

Çetintemel and

S.B.

Zdonik, Searchlight: Enabling integrated search and exploration over large multidimensional data, PVLDB8(10) (2015), 1094–1105.

62.

Kamat,

Jayachandran,

Tunga and

Nandi, Distributed and interactive cube exploration, in: IEEE 30th International Conference on Data Engineering, ICDE 2014, Chicago, IL, USA, March 31–April 4, 2014,

I.F.

Cruz,

Ferrari,

Tao,

Bertino and

Trajcevski, eds, IEEE, 2014, pp. 472–483. doi:10.1109/ICDE.2014.6816674.

63.

Kämpgen and

Harth, OLAP4LD – A framework for building analysis applications over governmental statistics, in: The Semantic Web: ESWC 2014 Satellite Events – ESWC 2014 Satellite Events, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers,

Presutti,

Blomqvist,

Troncy,

Sack,

Papadakis and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8798, Springer, 2014, pp. 389–394. doi:10.1007/978-3-319-11955-7_54.

64.

Kashyap,

Hristidis,

Petropoulos and

Tavoulari, Effective navigation of query results based on concept hierarchies, IEEE Trans. Knowl. Data Eng.23(4) (2011), 540–553. doi:10.1109/TKDE.2010.135.

65.

H.A.

Khan,

M.A.

Sharaf and

Albarrak, DivIDE: Efficient diversification for interactive data exploration, in: Conference on Scientific and Statistical Database Management, SSDBM ’14, Aalborg, Denmark, June 30–July 02, 2014,

C.S.

Jensen,

Lu,

T.B.

Pedersen,

Thomsen and

Torp, eds, ACM, 2014, pp. 15:1–15:12. doi:10.1145/2618243.2618253.

66.

Kim,

Blais,

A.G.

Parameswaran,

Indyk,

Madden and

Rubinfeld, Rapid sampling for visualizations with ordering guarantees, PVLDB8(5) (2015), 521–532.

67.

Klímek,

Helmich and

Necaský, Payola: Collaborative linked data analysis and visualization framework, in: The Semantic Web: ESWC 2013 Satellite Events – ESWC 2013 Satellite Events, Montpellier, France, May 26–30, 2013, Revised Selected Papers,

Cimiano,

Fernández,

Lopez,

Schlobach and

Völker, eds, Lecture Notes in Computer Science, Vol. 7955, Springer, 2013, pp. 147–151. doi:10.1007/978-3-642-41242-4_14.

68.

Klímek,

Helmich and

Necaský, Use cases for linked data visualization model, in: Proc. of the Workshop on Linked Data on the Web, LDOW 2015, Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 19th, 2015,

Bizer,

Auer,

Berners-Lee and

Heath, eds, CEUR Workshop Proceedings, Vol. 1409, CEUR-WS.org, 2015.

69.

Kriglstein and

Motschnig-Pitrik, Knoocks: New visualization approach for ontologies, in: 12th International Conference on Information Visualisation, IV 2008, London, UK, 8–11 July 2008, IEEE Computer Society, 2008, pp. 163–168. doi:10.1109/IV.2008.16.

70.

Lambert,

Bourqui and

Auber, Winding Roads: Routing edges into bundles, Comput. Graph. Forum29(3) (2010), 853–862. doi:10.1111/j.1467-8659.2009.01700.x.

71.

Langegger and

Wöß, RDFStats – An extensible RDF statistics generator and library, in: Proc. Database and Expert Systems Applications, DEXA, International Workshops, Linz, Austria, August 31–September 4, 2009

A.M.

Tjoa and

Wagner, eds, IEEE Computer Society, 2009, pp. 79–83. doi:10.1109/DEXA.2009.25.

72.

Lanzenberger,

Sampson and

Rester, Visualization in ontology tools, in: 2009 International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2009, Fukuoka, Japan, March 16–19, 2009,

Barolli,

Xhafa and

Hsu, eds, IEEE Computer Society, 2009, pp. 705–711. doi:10.1109/CISIS.2009.178.

73.

A.d.

Leon,

Wisniewki,

Villazón-Terrazas and

Corcho, Map4rdf – Faceted browser for geospatial datasets, in: Using Open Data: Policy Modeling, Citizen Empowerment, Data Journalism, 2012.

74.

Li,

Baciu and

Wang, ModulGraph: Modularity-based visualization of massive graphs, in: SIGGRAPH Asia 2015 Visualization in High Performance Computing, Kobe, Japan, November 2–6, 2015, ACM, 2015, pp. 11:1–11:4. doi:10.1145/2818517.2818542.

75.

Lin,

Cao,

Tong,

Wang,

Kang and

D.H.P.

Chau, Demonstrating interactive multi-resolution large graph exploration, in: 13th IEEE International Conference on Data Mining Workshops, ICDM Workshops, TX, USA, December 7–10, 2013,

Ding,

Washio,

Xiong,

Karypis,

B.M.

Thuraisingham,

D.J.

Cook and

Wu, eds, IEEE Computer Society, 2013, pp. 1097–1100. doi:10.1109/ICDMW.2013.124.

76.

L.D.

Lins,

J.T.

Klosowski and

C.E.

Scheidegger, Nanocubes for real-time exploration of spatiotemporal datasets, IEEE Trans. Vis. Comput. Graph.19(12) (2013), 2456–2465. doi:10.1109/TVCG.2013.179.

77.

Liu,

Jiang and

Heer, ImMens: Real-time visual querying of Big Data, Comput. Graph. Forum32(3) (2013), 421–430. doi:10.1111/cgf.12129.

78.

Mansmann and

M.H.

Scholl, Exploring OLAP aggregates with hierarchical visualization techniques, in: Proc. of the 2007 ACM Symposium on Applied Computing (SAC), Seoul, Korea, March 11–15, 2007,

Cho,

R.L.

Wainwright,

Haddad,

S.Y.

Shin and

Y.W.

Koo, eds, ACM, 2007, pp. 1067–1073. doi:10.1145/1244002.1244235.

79.

Marie and

F.L.

Gandon, Survey of Linked Data based exploration systems, in: Proc. of the 3rd International Workshop on Intelligent Exploration of Semantic Data (IESD 2014) Co-Located with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 20, 2014,

Thakker,

Schwabe,

Kozaki,

Garcia,

Dijkshoorn and

Mizoguchi, eds, CEUR Workshop Proceedings, Vol. 1279, CEUR-WS.org, 2014.

80.

Morton,

Balazinska,

Grossman and

J.D.

Mackinlay, Support the data enthusiast: Challenges for next-generation data-analysis systems, PVLDB7(6) (2014), 453–456.

81.

Nikolaou,

Dogani,

Bereta,

Garbis,

Karpathiotakis,

Kyzirakos and

Koubarakis, Sextant: Visualizing time-evolving linked geospatial data, J. Web Sem.35 (2015), 35–52. doi:10.1016/j.websem.2015.09.004.

82.

Park,

M.J.

Cafarella and

Mozafari, Visualization-aware sampling for very large databases, in: IEEE Intl. Conf. on Data Engineering (ICDE), 2016, to appear.

83.

Paulheim, Generating possible interpretations for statistics from Linked Open Data, in: Proc. of the Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, 2012, pp. 560–574. doi:10.1007/978-3-642-30284-8_44.

84.

Petrou,

Meimaris and

Papastefanatos, Towards a methodology for publishing Linked Open Statistical Data, eJournal of eDemocracy & Open Government6(1) (2014).

85.

Phan,

Xiao,

R.B.

Yeh,

Hanrahan and

Winograd, Flow map layout, in: IEEE Symposium on Information Visualization (InfoVis 2005), Minneapolis, MN, USA, 23–25 October 2005,

J.T.

Stasko and

M.O.

Ward, eds, IEEE Computer Society, 2005, p. 29. doi:10.1109/INFOVIS.2005.13.

86.

Pietriga, IsaViz: A visual environment for browsing and authoring RDF models, in: International World Wide Web Conference Developers Day, 2002, p. 68.

87.

Ristoski,

Bizer and

Paulheim, Mining the Web of Linked Data with RapidMiner, J. Web Sem.35 (2015), 142–151. doi:10.1016/j.websem.2015.06.004.

88.

Ristoski and

Paulheim, Visual analysis of statistical data on maps using Linked Open Data, in: The Semantic Web: ESWC 2015 Satellite Events – ESWC 2015 Satellite Events, Portorož, Slovenia, May 31–June 4, 2015, Revised Selected Papers,

Gandon,

Guéret,

Villata,

J.G.

Breslin,

Faron-Zucker and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9341, Springer, 2015, pp. 138–143. doi:10.1007/978-3-319-25639-9_27.

89.

J.F.

RodriguesJr.,

Tong,

Pan,

A.J.M.

Traina,

TrainaJr. and

Faloutsos, Large graph analysis in the GMine system, IEEE Trans. Knowl. Data Eng.25(1) (2013), 106–118. doi:10.1109/TKDE.2011.199.

90.

P.E.R.

Salas,

F.M.D.

Mota,

K.K.

Breitman,

M.A.

Casanova,

Martin and

Auer, Publishing statistical data on the Web, Int. J. Semantic Computing6(4) (2012), 373–388. doi:10.1142/S1793351X12400119.

91.

Schlegel,

Weißgerber,

Stegmaier,

Seifert,

Granitzer and

Kosch, Balloon Synopsis: A modern node-centric RDF viewer and browser for the web, in: The Semantic Web: ESWC 2014 Satellite Events – ESWC 2014 Satellite Events, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers,

Presutti,

Blomqvist,

Troncy,

Sack,

Papadakis and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8798, Springer, 2014, pp. 249–253. doi:10.1007/978-3-319-11955-7_29.

92.

Shen and

Chen, A dynamic-programming algorithm for hierarchical discretization of continuous attributes, European Journal of Operational Research184(2) (2008), 636–651. doi:10.1016/j.ejor.2006.12.013.

93.

Shneiderman, Tree visualization with Tree-Maps: 2-d Space-Filling approach, ACM Trans. Graph.11(1) (1992), 92–99. doi:10.1145/102377.115768.

94.

Shneiderman, The eyes have it: A task by data type taxonomy for information visualizations, in: Proc. of the 1996 IEEE Symposium on Visual Languages, Boulder, Colorado, USA, September 3–6, 1996, IEEE Computer Society, 1996, pp. 336–343. doi:10.1109/VL.1996.545307.

95.

Shneiderman, Extreme visualization: Squeezing a billion records into a million pixels, in: Proc. of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008,

J.T.

Wang, ed., ACM, 2008, pp. 3–12. doi:10.1145/1376616.1376618.

96.

Stadler,

Lehmann,

Höffner and

Auer, LinkedGeoData: A core for a web of spatial open data, Semantic Web3(4) (2012), 333–354. doi:10.3233/SW-2011-0052.

97.

Stadler,

Martin and

Auer, Exploring the web of spatial data with facete, in: 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014, Companion Volume,

Chung,

A.Z.

Broder,

Shim and

Suel, eds, ACM, 2014, pp. 175–178. doi:10.1145/2567948.2577022.

98.

Stolte,

Tang and

Hanrahan, Query, analysis, and visualization of hierarchically structured data using Polaris, in: Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23–26, 2002, ACM, 2002, pp. 112–122. doi:10.1145/775047.775064.

99.

Stuhr,

Roman and

Norheim, LODWheel – JavaScript-based visualization of RDF data, in: Proc. of the Second International Workshop on Consuming Linked Data (COLD2011), Bonn, Germany, October 23, 2011,

Hartig,

Harth and

Sequeda, eds, CEUR Workshop Proceedings, Vol. 782, CEUR-WS.org, 2011.

100.

Sundara,

Atre,

Kolovski,

Das,

Wu,

E.I.

Chong and

Srinivasan, Visualizing large-scale RDF data using subsets, summaries, and sampling in Oracle, in: Proc. of the 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California, USA, March 1–6, 2010,

Li,

M.M.

Moro,

Ghandeharizadeh,

J.R.

Haritsa,

Weikum,

M.J.

Carey,

Casati,

E.Y.

Chang,

Manolescu,

Mehrotra,

Dayal and

V.J.

Tsotras, eds, IEEE, 2010, pp. 1048–1059 doi:10.1109/ICDE.2010.5447795.

101.

Tauheed,

Heinis,

Schürmann,

Markram and

Ailamaki, SCOUT: Prefetching for latent feature following queries, PVLDB5(11) (2012), 1531–1542.

102.

Techapichetvanich and

Datta, Interactive visualization for OLAP, in: Proc. of Computational Science and Its Applications – ICCSA 2005, International Conference, Part III, Singapore, May 9–12, 2005,

Gervasi,

M.L.

Gavrilova,

Kumar,

Laganà,

H.P.

Lee,

Mun,

Taniar and

C.J.K.

Tan, eds, Lecture Notes in Computer Science, Vol. 3482, Springer, 2005, pp. 206–214. doi:10.1007/11424857_23.

103.

Thellmann,

Galkin,

Orlandi and

Auer, LinkDaViz – automatic binding of linked data to visualizations, in: Proc. of The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Part I, Bethlehem, PA, USA, October 11–15, 2015,

Arenas,

Ó.

Corcho,

Simperl,

Strohmaier,

d’Aquin,

Srinivas,

P.T.

Groth,

Dumontier,

Heflin,

Thirunarayan and

Staab, eds, Lecture Notes in Computer Science, Vol. 9366, Springer, 2015, pp. 147–162. doi:10.1007/978-3-319-25007-6_9.

104.

Tominski,

Abello and

Schumann, CGV – An interactive graph visualization system, Computers & Graphics33(6) (2009), 660–678. doi:10.1016/j.cag.2009.06.002.

105.

Tschinkel,

E.E.

Veas,

Mutlu and

Sabol, Using semantics for interactive visual analysis of Linked Open Data, in: Proc. of the ISWC 2014 Posters & Demonstrations Track a Track Within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR Workshop Proceedings, Vol. 1272, CEUR-WS.org, 2014, pp. 133–136.

106.

Valsecchi,

Abrate,

Bacciu,

Tesconi and

Marchetti, DBpedia Atlas: Mapping the uncharted lands of Linked Data, in: Proc. of the Workshop on Linked Data on the Web, LDOW 2015, Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 19th, 2015,

Bizer,

Auer,

Berners-Lee and

Heath, eds, CEUR Workshop Proceedings, Vol. 1409, CEUR-WS.org, 2015.

107.

Valsecchi and

Ronchetti, Spacetime: A two dimensions search and visualisation engine based on Linked Data, in: SEMAPRO 2014: The Eight International Conference on Advances in Semantic Processing,

Cheptso and

Mavromoustakis, eds, 2014, IARIA, pp. 8–12.

108.

Vartak,

Madden,

A.G.

Parameswaran and

Polyzotis, SEEDB: Automatically generating query visualizations, PVLDB7(13) (2014), 1581–1584.

109.

Voigt,

Pietschmann,

Grammel and

Meißner, Context-aware recommendation of visualization components, in: Conference on Information, Process, and Knowledge Management (eKNOW), 2012.

110.

Voigt,

Pietschmann and

Meißner, A semantics-based, end-user-centered information visualization process for Semantic Web data, in: Semantic Models for Adaptive Interactive Systems, Springer, London, 2013.

111.

T.D.

Wang and

Parsia, CropCircles: Topology sensitive visualization of OWL class hierarchies, in: Proc. of the Semantic Web – ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, GA, USA, November 5–9, 2006,

I.F.

Cruz,

Decker,

Allemang,

Preist,

Schwabe,

Mika,

Uschold and

Aroyo, eds, Lecture Notes in Computer Science, Vol. 4273, Springer, 2006, pp. 695–708. doi:10.1007/11926078_50.

112.

M.O.

Ward, XmdvTool: Integrating multiple methods for visualizing multivariate data, in: Proc. IEEE Visualization ’94, Washington, DC, USA, October 17–21, 1994,

R.D.

Bergeron and

A.E.

Kaufman, eds, IEEE Computer Society, 1994, pp. 326–333. doi:10.1109/VISUAL.1994.346302.

113.

Wickham, Bin-summarise-smooth: A framework for visualising large data, Technical report, 2013.

114.

Wu,

Battle and

S.R.

Madden, The case for data visualization management systems, PVLDB7(10) (2014), 903–906.

115.

Zaveri,

A.M.

Anisa Rula,

Pietrobon,

Lehmann and

Auer, Quality assessment methodologies for linked open data, Semantic Web Journal (2015), to appear.

116.

Zhang,

Wang,

D.T.

Tran and

Yu, ZoomRDF: Semantic fisheye zooming on RDF data, in: Proc. of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26–30, 2010,

Rappa,

Jones,

Freire and

Chakrabarti, eds, ACM, 2010, pp. 1329–1332. doi:10.1145/1772690.1772914.

117.

Zinsmaier,

Brandes,

Deussen and

Strobelt, Interactive level-of-detail rendering of large graphs, IEEE Trans. Vis. Comput. Graph.18(12) (2012), 2486–2495. doi:10.1109/TVCG.2012.238.

118.

Zoumpatianos,

Idreos and

Palpanas, Indexing for interactive exploration of big data series, in: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014,

C.E.

Dyreson,

Li and

M.T.

Özsu, eds, ACM, 2014, pp. 1555–1566. doi:10.1145/2588555.2610498.

A hierarchical aggregation framework for efficient multilevel visual exploration and analysis

Abstract

Keywords

1. Introduction

2. The HETree model

2.1. Preliminaries

1 Note that our structure handles numeric and temporal data in a similar manner. Also, other types of one-dimensional data may be supported, with the requirement that a total order can be defined over the data.

3 We assume that, the number of objects is at least as the number of leaves; i.e., | D | ≥ ℓ .

6 We assume here that, there is at least one object in D with different value than the rest objects.

7 Similar bounds can also be defined for other tree levels.

3.1. Exploration scenarios

3.2. Incremental HETree construction

Table 2 Summary of Adaptive HETree Construction⋆

11 For simplicity, Fig. 6 presents only the values of the objects.

3.3.2. The user modifies the number of leaves

13 synopsviz.imis.athena-innovation.gr.

4.3. Implementation

5.1. Experimental setting

14 downloads.dbpedia.org/2014/en/mappingbased_properties_cleaned_en.nt.bz2.

5.2.1. Setup

Table 4 Performance Results for Numeric & Temporal Properties

5.3.1. Tasks

Table 5 Average Task Completion Time (sec) Small Property Large Property FLAT HETree-C HETree-R p FLAT HETree-C HETree-R p T1 54 29 28 85 52 47 T2.1 63 57 64 74 60 69 T2.2 120 69 74 128 72 77 T3 262 41 40 — 64 62 —

5.3.3. Results

5.3.4. Discussion

Table 7 Visualization Systems Overview

6.1. Exploration & visualization in the Web of Data

6.1.1. Generic visualization systems

6.1.2. Domain, vocabulary & device-specific visualization systems

15 www.linked-statistics.gr.

6.1.4. Discussion

6.2. Statistical analysis in the Web of Data

16 rapidminer.com.

6.3. Hierarchical visual exploration

6.3.1. Discussion

6.4. Data structures & data processing

6.4.1. Discussion

Footnotes

Acknowledgements

Incremental HETree construction

ICO algorithm

Incremental HETree construction analysis

Adaptive HETree construction

References

¹
Note that our structure handles numeric and temporal data in a similar manner. Also, other types of one-dimensional data may be supported, with the requirement that a total order can be defined over the data.

³
We assume that, the number of objects is at least as the number of leaves; i.e., $| D | \geq ℓ$ .

⁶
We assume here that, there is at least one object in D with different value than the rest objects.

⁷
Similar bounds can also be defined for other tree levels.

Table 2
Summary of Adaptive HETree Construction^⋆

¹¹
For simplicity, Fig. 6 presents only the values of the objects.

¹³
synopsviz.imis.athena-innovation.gr.

¹⁴
downloads.dbpedia.org/2014/en/mappingbased_properties_cleaned_en.nt.bz2.

Table 4
Performance Results for Numeric & Temporal Properties

Table 5
Average Task Completion Time (sec)

Small Property Large Property

FLAT HETree-C HETree-R p FLAT HETree-C HETree-R p

T1 54 29 28 85 52 47

T2.1 63 57 64 74 60 69

T2.2 120 69 74 128 72 77

T3 262 41 40 — 64 62 —

Table 7
Visualization Systems Overview

¹⁵
www.linked-statistics.gr.

¹⁶
rapidminer.com.