Sage Journals: Discover world-class research

Abstract

Ghosh, Kamara, and Tamassia (GKT) (ASIA CCS 2021) proposed a graph encryption scheme supporting shortest path queries. This work presents a query recovery attack against the scheme when the adversary is given the original graph and the leakage of certain subsets of queries. The attack falls within the security model used by GKT, and is the first targeting schemes supporting shortest path queries. The attack uses classical graph algorithms to compute the canonical names of the single-destination shortest path spanning trees of the underlying graph and uses these canonical names to precompute the set of candidate queries that match each response. When all shortest path queries to a single node have been observed, the canonical names for the corresponding query tree are computed, and the responses are matched to the candidate queries from the offline phase. The output is guaranteed to contain the correct query. For a graph on $n$ vertices, the attack runs in time $O (n^{3})$ and matches the time complexity of the GKT scheme’s setup. The attack’s practicality is demonstrated through an implementation and evaluation on the real-world datasets used in the original paper and on random graphs.

Keywords

searchable encryption leakage-abuse attacks cryptanalysis

1. Introduction

Graphs are a powerful tool that can be used to model many problems related to social networks, biological networks, geographic relationships, etc. Plaintext graph database systems have already received much attention in both industry (e.g. Amazon Neptune,¹ Facebook TAO,² Neo4j,³ and GraphDB⁴) and academia (e.g. Pregel,⁵ GraphLab,⁶ and Trinity⁷).

With the rise of data storage outsourcing, there is an increased interest in graph encryption schemes (GESs). A GES enables a client to encrypt a graph, outsource the storage of the encrypted graph to an untrusted server, and later make certain types of graph queries to the server. Current GES typically only support one type of query, for example, adjacency queries,⁸ neighbor queries,⁸ approximate shortest distance queries,⁹ and exact shortest path queries.^10,11

This article extends the work presented by Falzon and Paterson¹² and takes a closer look at the security of the GES of Ghosh, Kamara, and Tamassia (GKT) from ASIA CCS 2021.¹⁰ This scheme will henceforth be referred to as the GKT scheme. The GKT scheme encrypts a graph $G$ such that when a shortest path query $(u, v)$ is issued for some vertices $u$ and $v$ of $G$ , the server returns information allowing the client to quickly recover the shortest path between $u$ and $v$ in $G$ . The scheme precomputes a matrix called the SP-matrix from which shortest paths can be efficiently computed, and then creates an encrypted version of this matrix, which we refer to as the encrypted database ( $E D B$ ). $E D B$ is sent to the server. At query time, the client computes a search token for the query $(u, v)$ ; this token is sent to the server and is used to start a sequence of look-ups to $E D B$ . Each look-up results in a new token and a ciphertext encrypting the next vertex on the shortest path from $u$ to $v$ . The concatenation of these ciphertexts is returned to the client, and decrypting this sequence reveals the vertices in the shortest path.

The GKT scheme of Ghosh et al.¹⁰ is very elegant and efficient. For a graph on $n$ vertices, computing the SP-matrix takes time $O (n^{3})$ and dominates the setup time. Building a search token involves computing a pseudo-random function. Processing a query $(u, v)$ at the server requires $t$ look-ups in $E D B$ , where $t$ is the length of the shortest path from $u$ to $v$ . Importantly, thanks to the design of the scheme, query processing can be done without interaction with the client, except to receive the initial search token and to return the result. This results in $E D B$ revealing—at query time—the sequence of labels (tokens) needed for the recursive look-up and the sequence of (encrypted) vertices that is eventually returned to the client.

Ghosh et al.¹⁰ provide a security proof of the GKT scheme in a simulation-based security model that assumes an honest-but-curious (semi-honest) server. The approach identifies a leakage profile for the GKT scheme and formally proves that the scheme leaks nothing more than this. The leakage profile comes in two parts: setup leakage (available to the server upon receipt of the encrypted data structure) and query leakage (that becomes available to the server as it processes each query). Specifically, the query leakage leaks when two queries are equal, that is, the query pattern, the length of the queried path, and how two paths with the same destination intersect.

This work exploits the query leakage of the GKT scheme to mount a query recovery (QR) attack against the scheme. This attack can be mounted by the honest-but-curious server and requires knowledge of the graph $G$ . This may appear to be a strong requirement, but it is, in fact, weaker than is permitted in the security model of Ghosh et al.,¹⁰ where the adversary can even choose $G$ . Assuming that the graph $G$ is public is a standard assumption for many schemes that support private graph queries.^10,13,14 There are many settings in which the graph may be known but the edge weights are private, or in which one only wishes to protect the privacy of the client’s queries. This model is ideal for routing and navigation systems in which the road network may easily be obtained online via Google Maps or Waze, but the client may wish to keep its queries private. In such a scenario, the map and traffic information are widely available, but the routing information of individual users is sensitive.

The attack has two phases. First, it has an offline, preprocessing phase that is carried out on the graph $G$ . In this phase, a plaintext description of all the shortest path trees in $G$ is extracted. These trees are then processed, and the candidate queries for each issued query are computed using each tree’s canonical name. A canonical name is an encoding of a graph that can be used to decide if two graphs are isomorphic; a canonical name of a rooted tree can be computed efficiently using the Aho, Hopcraft, and Ullman (AHU) algorithm.¹⁵ This concludes the offline phase of the attack. Its time complexity is $O (n^{3})$ , where $n$ is the number of vertices in $G$ , and matches the run time of our overall attack and the run time of the GKT scheme’s setup. Both the attack and the setup run-time are lower bounded by the time to compute the all-pairs shortest paths (APSP), which takes $O (n^{3})$ time for general graphs.¹⁶

The second phase of the attack is online: as queries are issued, the adversary constructs a second set of trees that correspond to the sequence of labels computed by the server when processing each query, that is, the per-query leakage of the scheme. That leakage is uniquely determined by the search token that initiates the look-up. This description uses the labels of $E D B$ (which are search tokens) as vertices; two labels are connected if the first points to the second in $E D B$ . When an entire tree has been constructed, the adversary can then run the AHU algorithm again to compute the canonical names associated with this query tree. An entire query tree $Q$ can be built when all queries to a particular destination have been issued. In practice, this is a realistic routing scenario where many trips may share a common popular destination (e.g. an airport, school, or distribution center).

By correctness of the scheme, there exists a collection of isomorphisms mapping $Q$ to at least one tree computed in the offline phase. Such isomorphisms also map shortest paths to shortest paths. A matching between paths in the trees from the online phase to the trees in the offline phase is thus performed. This can be done efficiently using a novel extension of the AHU algorithm,¹⁵ which decides when one path can be mapped to another by an isomorphism of trees. This yields two look-up tables which, when composed, map each path in the first set of trees to a set of candidate paths in the second set. The search token of the queries associated with $Q$ is then used to look-up the possible candidate queries in the tables computed in the online phase. These candidate queries are then returned. The run time of this phase is $O (n^{'} \cdot n^{2}),$ where $n^{'} \leq n$ is the number of complete query trees computed in the online phase. The output is guaranteed to contain the correct query.

In general, the leakage from a query can be consistent with many candidates, and the correct candidate cannot be uniquely determined. Graph theoretically, this is because there can be many isomorphisms between pairs of trees in the two sets. In the chosen graph setting, it is easy to construct a graph $G$ where, given any query tree $Q$ of $G$ , its isomorphism is uniquely determined and there is a unique candidate for each query of $Q$ , that is, one can achieve what is called full QR (FQR). For such graphs, the GKT scheme offers almost no protection to queries. In other cases, the query leakage may result in one or only a few possible query candidates, which may be damaging in practice. In order to explore the effectiveness of the attack, the theoretical results are supported with experiments on eight real-world graphs (six of which were used in Ghosh et al.¹⁰) and on random graphs with varying graph sizes and edge probabilities. The results show that for the real-world graphs, as many as 21.9% of all queries can be uniquely recovered, and as many as half of all queries can be mapped to at most three candidate queries. The experimental results show that QR tends to result in smaller sets of candidate queries when the graphs are less dense, and that dense graphs tend to have more symmetries and hence result in larger sets of candidate queries. Note also that the attack is the best possible: it always outputs a minimal set of candidates consistent with the query leakage, and the correct query is always included in the set.

This work extends that of Falzon and Paterson¹² along numerous dimensions, including formal proofs of their claims (Section 4), supporting examples (Section 4), additional experiments on random graphs that better demonstrate the relationship between attack success and graph density (Section 5), and a new variation of the attack from the perspective of a network adversary (Appendix A).

The contributions can be summarized as follows:

(1)
This work better formalizes the attack in Falzon and Paterson¹² against a GES that supports shortest path queries. This attack works in a passive-persistent server-side adversarial setting. A novel extension of the attack in a network-adversarial setting is also presented.
(2)
This work leverages the GKT scheme’s leakage to mount an efficient QR attack against the scheme. In particular, for the real-world datasets used, the set of all query trees can be recovered with as few as 68.1% of the queries.
(3)
This work makes use of the classical AHU algorithm for the graph isomorphism problem for rooted trees. A new algorithm for deciding when a path in one tree can be mapped onto a path in another tree under an isomorphism is presented.
(4)
This work also reports on an implementation of the attack in Python and a thorough evaluation against real-world datasets and random graphs.

Looking ahead toward building new schemes, it is important to better understand how leakage can be exploited. This attack demonstrates that leaking the topology of subtrees of the encrypted graph is detrimental, and that a noninteractive scheme that relies on chaining search tokens may leak too much information to the server. This is true in part because many problems that are not known to be polynomial-time solvable on general graphs can be solved in polynomial time on trees (i.e. the graph isomorphism problem vs. the tree isomorphism problem). The characterization of the GKT scheme’s leakage may thus help inform the construction of more secure GESs.
1.1. Prior and related work

This section describes prior and related work concerning GESs and leakage-abuse attacks on structured encryption.

1.1.1. Graph encryption

Chase and Kamara⁸ present the first GES that supports both adjacency queries and focused subgraph queries. Poh et al.¹⁷ give a scheme for encrypting conceptual graphs. Meng et al.⁹ present three schemes that support approximate shortest path queries on encrypted graphs, each with a slightly different leakage profile. To reduce storage overhead, their solution leverages sketch-based oracles that select seed vertices and store the exact shortest distance from all vertices to the seeds; these distances are then used to estimate shortest paths between any two vertices in the graph. Ghosh et al.¹⁰ and Wang et al.¹¹ present schemes that support exact shortest path queries on encrypted graphs.

Other solutions for privacy-preserving graph structures use other techniques such as secure multiparty computation and private information retrieval (e.g. Wu et al.¹⁸ and Lai et al.¹⁹) and differential privacy (e.g. Sala et al.²⁰). These approaches have different security goals from EDB schemes that are built on symmetric encryption.

1.1.2. Attacks

While many schemes fitting the EDB paradigm have been developed, the security that these schemes offer is not yet fully understood. Security analysis is often done by developing attacks that reconstruct either queries or the database from the leakage functions. Leakage analysis of searchable symmetric encryption (SSE) schemes has been studied in a number of settings including both active^21–23 and passive^{21,22,24–26} adversarial settings. Recently, a number of works analyze the leakage stemming from schemes that support more complex query types such as range queries on one attribute^27–34 and multiattribute^35,36 data, and $k$ -nearest neighbor queries.³⁷ The leakage of GESs was first analyzed by Goetschmann.³⁸ The author considers schemes that support approximate shortest path queries that use sketch-based distance oracles (e.g. Meng et al.⁹), present two methods for estimating distances between nodes, and give a QR attack that aims to recover the vertices in an encrypted query; the experimental evaluation demonstrates that with auxiliary knowledge on some queries, the adversary can distinguish among candidate vertices which vertex was queried. This work also presents a QR attack, but uses knowledge of the graph $G$ rather than partial knowledge of some queries.

2. Preliminaries

2.1. Notation

For an integer $n$ , let $[n] = {1, 2, \dots, n}$ . The concatenation of two strings $a$ and $b$ is denoted as $a | | b$ .

A dictionary $D$ is a map from some label space $L$ to a value space $V$ ; $D [lab] = val$ indicates that $lab \mapsto val$ . A multimap $M$ is a generalization of a dictionary that maps labels to sets of values.

2.2. Graphs

A graph is a pair $G = (V, E)$ consisting of a vertex set $V$ of size $n$ and an edge set $E$ of size $m$ . A graph is directed if the edges specify a direction from one vertex to another. Two vertices $u, v \in V$ are connected if there exists a path from $u$ to $v$ in $G$ . In this paper, it is assumed that all graphs $G$ are connected for simplicity. However, the attack and its constituent algorithms directly apply to multicomponent graphs.

A tree is a connected, acyclic graph. A rooted tree $T = (V, E, r)$ is a tree in which one vertex $r$ has been designated the root. For some rooted tree $T = (V, E, r)$ and vertex $v \in V$ , $T [v]$ denotes the subtree of $T$ induced by $v$ and all its descendants.

Given a graph $G = (V, E)$ and some vertex $v \in V$ , a single-destination shortest path (SDSP) tree for $v$ is a directed spanning tree $T$ such that $T$ is a subgraph of $G$ , $v$ is the only sink in $T$ , and each path from $u \in V ∖ {v}$ to $v$ in $T$ is a shortest path from $u$ to $v$ in $G$ . An example of an SDSP tree can be found in Figure 1(c).

Figure 1.

(a) Original graph $G$ , (b) its corresponding single-destination shortest path (SDSP) tree for vertex $1$ in $G$ with the canonical names labeling all the vertices of the tree, and (c) the matching query tree that is leaked during setup (without any vertex labels). (a) Original graph $G$ ; (b) SDSP tree for vertex $1$ ; and (c) the inferred leakage.

This work makes use of two binary options on graphs. Given two graphs $G = (V, E)$ and $H = (V^{'}, E^{'})$ , the union of $G$ and $H$ is defined as $G \cup H = (V \cup V^{'}, E \cup E^{'})$ . Given a graph $G = (V, E)$ and a subgraph $H = (V^{'}, E^{'})$ such that $V^{'} \subseteq V, E^{'} \subseteq E$ , the graph subtraction of $H$ from $G$ is defined as $G ∖ H = (V ∖ V^{'}, E ∖ E^{'})$ .

2.3. Hash functions

A set $H$ of functions $U \to [M]$ is a universal hash function family if, for every distinct $x, y \in U$ , the hash function family $H$ satisfies the following constraint:

\underset{h \leftarrow H}{Pr} [h (x) = h (y)] \leq 1 / M .

In Section 5, the universal hash function is instantiated using a cryptographic hash function.

2.4. Graph isomorphisms

The attack will make heavy use of graph isomorphisms and automorphisms. In particular, because the leakage profile of the GKT leaks the network topology of spanning subtrees of the original graph $G$ , recovery is information theoretically possible up to graphs with the same topology.

Definition 2.1
An isomorphism of graphs $G_{1} = (V_{1}, E_{1})$ and $G_{2} = (V_{2}, E_{2})$ is a bijection between vertex sets $φ : V_{1} \to V_{2}$ such that for all $u, v \in V_{1}, (u, v) \in E_{1}$ if and only if $(φ (u), φ (v)) \in E_{2} .$ This can be succinctly denoted as $G_{1} ≅ G_{2}$ .
Definition 2.2
An isomorphism of rooted trees $T_{1} = (V_{1}, E_{1}, r_{1})$ and $T_{2} = (V_{2}, E_{2}, r_{2})$ is an isomorphism $φ$ from $T_{1}$ to $T_{2}$ (as graphs) such that $φ (r_{1}) = r_{2}$ .
2.5. Canonical names

A canonical name $N a m e (\cdot)$ is an encoding mapping graphs to bit-strings such that, for any two graphs $H$ and $G$ , $N a m e (G) = N a m e (H)$ if and only if $G ≅ H$ . For rooted trees, AHU¹⁵ describe an algorithm for computing a specific canonical name in $O (n)$ time. We refer to this as the canonical name and describe it next.

This work makes use of a modified AHU algorithm, denoted as $ComputeNames$ , to compute the canonical names of rooted trees (and their subtrees) and determine if they are isomorphic. $ComputeNames$ takes as input a rooted tree $T = (V, E, r)$ , a vertex $v \in V$ , and an empty dictionary $N a m e s$ . It outputs the canonical name of the subtree $T [v]$ (which is also referred to as the canonical name of $v$ ) and a dictionary $N a m e s$ that maps each descendant $u$ of $v$ to the canonical name of $T [u]$ . The algorithm proceeds from the leaves to the root. It assigns the name “ $10$ ” to all leaves of the tree. It then recursively visits each descendent $u$ of $v$ and assigns $u$ a name by sorting the names of its children in increasing lexicographic order, concatenating them into an intermediate name $c h i l d r e n_n a m e s$ and assigning the name “ $1 | | c h i l d r e n_n a m e s | | 0$ ” to $u$ (see Figure 1(b) for an example). The canonical name of $T$ , $N a m e (T)$ , is the name assigned to the root $r$ by this algorithm.

$ComputeNames$ takes time and space $O (n^{2})$ where $| V | = n$ . Note that the original AHU algorithm can be modified to run in $O (n)$ time by only considering one level at a time and reassigning integers to the vertices at that level.¹⁵ In contrast, the attack must assign names to each vertex in the tree in order to later compute the path names, so we are forced to make use of the $O (n^{2})$ version.

The pseudocode of $ComputeNames$ can be found in Algorithm 1.

3. The GKT GES

This section gives an overview of the GES of Ghosh et al.¹⁰ and its leakage.

3.1. GKT scheme overview

The GKT scheme supports single pair shortest path (SPSP) queries. The graphs may be directed or undirected, and the edges may be weighted or unweighted. An SPSP query on a graph $G = (V, E)$ takes as input a pair of vertices $(u, v) \in V \times V$ , and outputs a path $p_{u, v} = (u, w_{1}, \dots, w_{ℓ}, v)$ such that $(u, w_{1}), (w_{1}, w_{2}), \dots, (w_{t - 1}, v) \in E$ . This path must be of minimal length in $G$ , that is, there does not exist a sequence of edges $(u, w_{1}^{'}), (w_{1}^{'}, w_{2}^{'}), \dots, (w_{{t - 1}^{'}}^{'}, v) \in E$ such that $t^{'} < t$ .

SPSP queries may be answered using a number of different data structures. The GKT scheme makes use of the SP-matrix.³⁹ For a graph $G = (V, E)$ , the SP-matrix $M$ is a $| V | \times | V |$ matrix defined as follows. Entry $M [i, j]$ stores the second vertex along the shortest path from vertex $v_{i}$ to $v_{j}$ ; if no such path exists, then it stores $⊥$ . An SPSP query $(v_{i}, v_{j})$ is answered by computing $M [i, j] = v_{k}$ to obtain the next vertex along the path and then recursing on $(v_{k}, v_{j})$ until $⊥$ is returned.

At a high level, the GKT scheme proceeds by computing an SP-matrix for the query graph and then using this matrix to compute a dictionary ${SPDX}^{'}$ . This dictionary is then encrypted using a dictionary encryption scheme (DES) such as Cash et al.⁴⁰ and Chase and Kamara.⁸ To ensure that the GKT scheme is noninteractive, the underlying DES must be response-revealing. Since it is germane to this work, the syntax of a DES is described next.

Definition 3.1
A DES is a tuple of four algorithms $D E S =$ $(D E S . G e n$ , $D E S . E n c r y p t, D E S . T o k e n,$ $D E S . G e t)$ with the following syntax:
$D E S . G e n$ is probabilistic and takes as input a security parameter $λ$ , and outputs a secret key $s k$ .

$D E S . E n c r y p t$ takes as input a secret key $s k$ and dictionary $D$ , and outputs an encrypted dictionary ( $E D$ ).

$D E S . T o k e n$ takes as input a secret key $s k$ and a label $lab$ , and outputs a search token $tk$ .

$D E S . G e t$ takes as input a search token $tk$ and an $E D$ , and returns a plaintext value $val$ .

Correctness for a $D E S$ states that for all dictionaries $D$ , for all keys $s k$ output by $D E S . G e n$ and for pairs $(lab, val)$ in $D$ , executing $D E S . G e t$ on input $tk = D E S . T o k e n (s k, lab)$ and dictionary $E D = D E S . E n c r y p t (s k, D)$ results in output $val$ .

Note that while the GKT scheme itself is response-hiding (i.e. the shortest path is not returned in plaintext to the client), the underlying DES used in the scheme is response-revealing, that is, the values in its $E D$ are revealed at query time. The response-revealing property of the DES is necessary to enable the GKT scheme to operate in a noninteractive manner.

A more detailed description of the GKT scheme will now be provided. At setup, the client generates two secret keys: one for a symmetric encryption scheme $S K E$ , and one for a $D E S$ . It takes the input graph $G$ and computes the SP-matrix $M [i, j]$ . It then computes a dictionary $SPDX$ such that for each pair of vertices $(v_{i}, v_{j}) \in V \times V$ , it sets $SPDX [(v_{i}, v_{j})] = (w, v_{j})$ if $i \neq j$ and if in the SP-matrix $M [i, j] = w$ for some vertex $w$ .

The client then computes a second dictionary ${SPDX}^{'}$ as follows. For each label-value pair $(lab, val)$ in $SPDX$ , the following steps are carried out. A search token $tk$ is computed from $val$ using algorithm $D E S . T o k e n$ and a ciphertext $c$ is computed by encrypting $val$ using $S K E . E n c r y p t$ . Then ${SPDX}^{'} [lab]$ is set to $(tk, c)$ .

The resulting dictionary ${SPDX}^{'}$ is then encrypted using $D E S . E n c r y p t$ to produce an output $E D B$ , which is given to the server.

Now the client can issue an SPSP query for a vertex pair $(u, v)$ by generating a search token $tk$ for $(u, v)$ and sending it to the server. The server initializes an empty string $r e s p$ and uses $tk$ to search $E D B$ and obtain a response $a$ . If $a =⊥$ , then it returns $r e s p$ . Otherwise, it parses $a$ as $({tk}^{'}, c)$ , updates $r e s p = r e s p | | c$ and recurses on ${tk}^{'}$ until $⊥$ is reached on look-up. The server returns $r e s p$ , a concatenation of ciphertexts (or $⊥$ ) to the client. The client then uses its secret key to decrypt $r e s p$ , obtaining a sequence of pairs $val = (w_{k}, v)$ from which the shortest path from $u$ to $v$ can be constructed.

Complexity. The GKT scheme’s setup takes time $O (n^{3})$ and is dominated by the cost of computing the SP-matrix. Token generation takes time $O (1)$ (assuming use of an efficient DES) and querying $E D B$ takes time $O (t)$ where $t$ is the maximum length of a shortest path in $G$ . The server storage is $O (n^{2})$ .
3.2. Leakage of the GKT scheme

Ghosh et al.¹⁰ provide a formal specification of their scheme’s leakage. Informally, the setup leakage of their scheme is the number of vertex pairs in $G$ that are connected by a path, while the query leakage consists of the query pattern (which pairs of queries are equal), the path intersection pattern (the overlap between pairs of shortest paths seen in queries), and the lengths of the shortest paths arising in queries. See Section 4.1 in Ghosh et al.¹⁰ for more details.

In the GKT scheme, the server obtains $E D B$ by encrypting the underlying dictionary ${SPDX}^{'}$ , in which labels are of the form $lab = (v_{i}, v_{j})$ and values are of the form $val = (tk, c)$ , using a DES. Here $tk$ is a search token obtained by running $D E S . T o k e n$ on a pair $(w, v_{j})$ and $c$ is obtained by running $S K E . E n c r y p t$ also on $(w, v_{j})$ . Since $E D B$ is obtained by running $D E S$ on ${SPDX}^{'}$ , this means that the labels in $E D B$ are derived from tokens obtained by running $D E S . T o k e n$ on inputs $lab = (v_{i}, v_{j})$ . Moreover, these tokens also appear in the values in $E D B$ that are revealed to the server at query time, that is, in the entries $(tk, c)$ .

In turn, the query leakage reveals to the server the token used to initiate a search, as well as all the subsequent pairs $(tk, c)$ that are obtained by recursively processing such a query. Let us denote the sequence of search tokens associated with the processing of some (unknown) query $q$ for a shortest path of length $t$ as $s = {tk}_{1} ‖ {tk}_{2} ‖ \dots ‖ {tk}_{t + 1} \in {0, 1}^{*}$ . This string is referred to as the token sequence of $q$ . Since the search tokens correspond to the sequence of vertices in the queried path, there are as many tokens in the sequence as there are vertices in the shortest path. By correctness of $D E S$ used in the construction of $E D B$ , no two distinct queries can result in the same token sequence (in fact no two distinct queries can produce the same first token ${tk}_{1}$ , since each such first token must be used to derive a unique label in $E D B$ identifying the beginning of a specific shortest path).

Notice also that token sequences for different queries can be overlapping; indeed since the tokens are computed by running $D E S . T o k e n$ on inputs $lab = (v_{i}, v),$ where $v$ is the final vertex of a shortest path, two token sequences are overlapping if and only if they correspond to queries (and shortest paths) having the same end vertex. Hence, given the query leakage of a set of queries, the adversary can compute all the token sequences and construct from them $n^{'} \leq n$ directed trees, ${Q_{i}}_{i \in [n^{'}]}$ , each tree having at most $n$ vertices and a single root vertex. The vertices across all $n^{'}$ trees are labeled with the search tokens in $E D B$ and there is a directed edge from $tk$ to ${tk}^{'}$ if and only if $tk$ and ${tk}^{'}$ are adjacent in some token sequence. (Each tree has at most $n$ vertices because of our assumption about $G$ being connected.)

This set of trees is called the query trees . Each query tree corresponds to the set of queries having the same end vertex. Each tree has a single sink (root) that corresponds to a unique vertex $v \in V$ . The tree paths correspond to the shortest paths from vertices $w \in V ∖ {v}$ to $v$ , such that $w$ and $v$ are connected in $G$ . Note that Ghosh et al.¹⁰ also discuss these trees, but they do not analyze the theoretical limits of what can be inferred from them.

The leakage of the GKT scheme on a graph $G$ after issuing a set of SPSP queries $Q$ is denoted as $L (G, Q)$ . For a formal proof of security that establishes the leakage profile of the GKT scheme, please refer to Ghosh et al.¹⁰ The attacks in this work are based only on the leakage of the scheme, as established above, and not on breaking the underlying cryptographic primitives of the scheme.

3.3. Implications of leakage

Suppose that all queries have been issued and that we have constructed all $n$ query trees ${Q_{i}}_{i \in [n]}$ , each tree having $n$ vertices. Observe that there exists a one-to-one matching between the query trees ${Q_{i}}_{i \in [n]}$ and the SDSP trees ${T_{v}}_{v \in V}$ of $G$ such that each matched pair of trees is isomorphic. The reason is that the query trees are just differently labeled versions of the SDSP trees; in turn, this stems from the fact that paths in the query trees are in 1–1 correspondence with the shortest paths in $G$ .

This now reveals the core of the QR attack, developed in detail in Section 4 below. The server with access to $G$ first computes all the SDSP trees offline. As queries are issued, it then constructs the query trees one path at a time. Once a complete query tree $Q$ is computed (recall that each query tree must have $n$ vertices since $G$ is connected), the server finds all possible isomorphisms between $Q$ and the SDSP trees. Then, for each token sequence in $Q$ , it computes the set of paths in the SDSP trees to which that token sequence can be mapped under the possible isomorphisms. This set of paths yields the set of possible queries to which the token sequence can correspond. This information is stored in a pair of dictionaries, which can be used to look-up the candidate queries.

To illustrate the core attack idea, Figure 1 depicts (Figure 1a) a graph $G$ , (Figure 1b) its SDSP tree for vertex $1$ (with vertex labels and canonical names), and (Figure 1c) the matching query tree (without vertex labels). It is then clear that the leakage from the unique shortest path of length 2 in Figure 1(c) can only be mapped to the corresponding path with edges $(4, 5)$ , $(5, 1)$ in Figure 1(b) under isomorphisms, and similarly the shortest path of length 1 that is a subpath of that path of length 2 can only be mapped to path $(5, 1)$ . On the other hand, the three remaining paths of length 1 can be mapped under isomorphisms to any of the length 1 paths $(2, 1)$ , $(3, 1)$ , or $(6, 1)$ and so cannot be uniquely recovered.

Since the adversary only learns the query trees and token sequences from the leakage, the degree of QR that can be achieved based on that leakage is limited. In particular, without auxiliary information, the adversary can only recover the candidate queries up to symmetries arising from the isomorphisms between the query trees and the SDSP trees. In practice, this is often not an issue since many queries result in only a very small number of candidate queries (see Section 5 for more details).

4. Query recovery (QR)

4.1. Threat model and assumptions

This work considers a passive, persistent, honest-but-curious adversary that has compromised the server and can observe the initial search token issued, all subsequent search tokens revealed during the query processing, and the response. In particular, the adversary could be the server itself. Appendix A outlines a modified version of the attack in which the adversary is assumed to have only compromised the communication channels between the client and server; this adversary can thus only see the search tokens used to initiate the recursive look-up and the server responses.

The adversary is assumed to know the graph $G$ that has been encrypted to create $E D B$ . As noted previously, this is a strong assumption, but it fits within the security model used in Ghosh et al.¹⁰ (where $G$ can even be chosen) and is realistic in many routing/navigation scenarios. It is further assumed that the adversary sees enough queries to construct a subset of the $n$ query trees. Computing all $n$ trees does not require observing all possible queries; in the real-world datasets tested, it was possible to construct all query trees with as few as $68.1 %$ of the possible queries. This is because constructing a query tree that corresponds to $T_{v}$ only requires observing the queries that start at the leaf nodes of $T_{v}$ and end at $v$ . In SDSP trees with few leaves, only a small fraction of queries is needed.

It is assumed that the APSP algorithm used in constructing the SP-matrix from $G$ during setup is deterministic. Moreover, it is assumed that this algorithm is known to the adversary. Such an assumption is reasonable as the adversary knows $G$ and many shortest path algorithms are deterministic, including Floyd-Warshall¹⁶ and many of its adaptations.

4.2. Formalizing QR attacks

QR in general is the goal of determining the plaintext value of queries that have been issued by the client. The notion of QR was introduced by Islam et al.²⁴ in the context of leakage-abuse attacks on SSE schemes and has been extensively studied in the context of SSE and related schemes since.

This work studies the problem of QR in the context of GESs, specifically, the GKT scheme: given $G$ , the setup leakage of the GKT scheme and the query leakage from a set of SPSP queries, the adversary’s goal is to match the leakage for each SPSP query with the corresponding start and end vertices $(u, v)$ of a path in $G$ . As noted above, there may be a number of candidate queries that can be assigned to the leakage from each query. The adversary’s goals are formally described below.

Definition 4.1
(Consistency) Let $G = (V, E)$ be a graph, $Q = {q_{1}, \dots, q_{k}}$ be the set of SPSP queries that are issued, and $S = {s_{1}, s_{2}, \dots, s_{k}}$ be the set of token sequences of the queries issued. An assignment $π : S \to V \times V$ is a mapping from token sequences to SPSP queries. An assignment $π$ is said to be consistent with the leakage $L (G, Q)$ if it satisfies $L (G, Q) = L (G, π (S))$ .

Informally, consistency requires that, for each $s_{i} \in S$ , the query $π (s_{i})$ specified by assignment $π$ could feasibly result in the observed leakage $L (G, Q)$ .
Definition 4.2
(QR) Let $G = (V, E)$ be a graph, $Q = {q_{1}, \dots, q_{k}}$ be a set of SPSP queries, and $S$ the corresponding set of token sequences. Let $Π$ be the set of all assignments consistent with $L (G, Q)$ . The adversary achieves QR when it computes and outputs a mapping: $s \mapsto {π (s) : π \in Π}$ for all $s \in S$ .

The adversary achieves QR if, for each $s \in S$ (a set of token sequences resulting from queries in $Q$ ), it outputs a set of query candidates ${π (s) : π \in Π}$ containing every query that is consistent with the leakage. Note that this implies that the output always contains the correct query (and possibly more). This is the best the adversary can do, given the available leakage.

There is some information not conveyed in this mapping. In particular, by fixing an assignment for a given token sequence, it may be possible to fix or reduce the possible assignments for other query responses. Such an example is given below.
Example 4.3
Suppose one observes the set of token sequences ${s_{i} : i \in [5]}$ such that $s_{1}, s_{2}, s_{3}, s_{4}$ correspond to paths of length 1 and $s_{5}$ corresponds to a path of length 2, with $s_{4}$ a subsequence of $s_{5}$ , and which allows one to construct the query tree in Figure 1(c). Further, suppose that the resulting query tree is not isomorphic to any other query tree. Thus, it is possible to infer that all queries in $S$ are rooted at 1. An adversary achieving QR must output the following mappings:
$\begin{aligned} { & s_{1} : {(6, 1), (3, 1), (2, 1)}, s_{2} : {(6, 1), (3, 1), (2, 1)}, \\ s_{3} : {(6, 1), (3, 1), (2, 1)}, s_{4} : {(5, 1)}, s_{5} : {(4, 1)}} . \end{aligned}$
However, if the adversary could fix the assignment $s_{1}$ to $(1, 6)$ (e.g. by using auxiliary information), then $s_{2}$ could only be mapped to either $(1, 3)$ or $(1, 2)$ .

A special type of QR when there exists only one assignment consistent with the query leakage is now defined, that is, the case when all queries can be uniquely recovered.
Definition 4.4
(FQR) Let $G = (V, E)$ be a graph, $Q = {q_{1}, \dots, q_{k}}$ be a set of SPSP queries, and $S$ the corresponding set of token sequences. Let $Π$ be the set of assignments consistent with $L (G, Q)$ . An adversary is said to achieve FQR when it (a) achieves QR and (b) $| Π | = 1$ .

That is, there is a unique assignment of token sequences to queries consistent with the leakage. Whether FQR is always possible (i.e. for every possible set of queries $Q$ ) depends on the graph $G$ . Specifically, FQR is always possible if and only if each SDSP tree arising in $G$ is nonisomorphic and every path in each SDSP tree is fixed by all automorphisms of the tree. It is easy to construct graphs for which these conditions hold (see Section 4.10). For such graphs, the QR attack always achieves FQR.
4.3. Technical results

This section develops some technical results concerning isomorphisms of trees and the behavior of paths under those isomorphisms that will be needed in the remainder of the paper.

For any rooted tree $T = (V, E, r)$ and any $u \in V$ , let $T [u] \subseteq T$ denote the subtree induced by $u$ and all its descendants in $T$ .

Lemma 4.5
Let $T = (V, E, r)$ and $T^{'} = (V^{'}, E^{'}, r^{'})$ be rooted trees. Let $p_{u, r} = (u, w_{1}, \dots, w_{t}, r)$ and $p_{v, r^{'}} = (v, w_{1}^{'}, \dots, w_{ℓ}^{'}, r^{'})$ be paths in $T$ and $T^{'}$ , respectively. If there exists an isomorphism $φ : T \to T^{'}$ such that $φ (u) = v$ , then $t = ℓ$ and $φ (w_{i}) = w_{i}^{'}$ for all $i \in [t]$ .
Proof.
By assumption $φ (u) = v$ and by definition of isomorphism of rooted trees $φ (r) = r^{'}$ . Since $T$ is a tree, there exists a unique path between $u$ and $r$ , and between $v$ and $r^{'}$ . Isomorphisms of graphs must be edge preserving, and so $φ$ must map the subgraph $p_{u, r}$ to $p_{v, r^{'}}$ . These two paths can only be isomorphic if they are the same length and thus $t = ℓ$ . Putting together these two facts implies that
$(φ (u), φ (w_{1})) = (v, w_{1}^{'}), (φ (w_{1}), φ (w_{2})) = (w_{1}^{'}, w_{2}^{'}), \dots, (φ (w_{t}), φ (r)) = (w_{t}, r^{'})$
which concludes the proof.

Given a rooted tree $T = (V, E, r)$ and any $u \in V$ , let ${P a t h N a m e}_{T} (u)$ denote the concatenation of the canonical names of vertices along the path from $u$ to $r$ in $T$ , separated by semicolons:
$\begin{aligned} {P a t h N a m e}_{T} (u) = N a m e (T [u]) ‖ ``;'' ‖ N a m e ( & T [w_{1}]) ‖ ``;'' ‖ \dots ‖ ``;'' ‖ N a m e (T [w_{t}]) ‖ ``;'' ‖ N a m e (T [r]) . \end{aligned}$

Computing path names will form the core of the QR attack. A sequence of results about the relationship between path names and isomorphisms will now be proven. Section 4.5 explains how to apply a universal hash function to the path names to compress their length from $O (n^{2})$ to $O (\log n)$ bits, thereby reducing storage and run time complexity.
Proposition 4.6
Let $T = (V, E, r)$ and $T^{'} = (V^{'}, E^{'}, r^{'})$ be isomorphic rooted trees and let $C$ and $C^{'}$ denote the set of children of $r$ and $r^{'}$ , respectively. There is an isomorphism from $T$ to $T^{'}$ if and only if there is a perfect matching from $C$ to $C^{'}$ such that for each matched pair $c_{i} \in C, c_{i}^{'} \in C^{'}$ , there exists an isomorphism $φ_{i} : T [c_{i}] \to T [c_{i}^{'}]$ .
Proof.
To see the forwards direction, let $φ$ denote an isomorphism from $T$ to $T^{'}$ and note that if $φ (c) = c^{'}$ for $c \in C$ , then by the edge-preservation property of isomorphisms, $φ$ must map the vertices of $T [c]$ to the vertices of $T [c^{'}]$ , and thus $T [c] ≅ T [c^{'}]$ . For the backwards direction, construct an isomorphism $φ$ from $T$ to $T^{'}$ as follows. Let $φ_{r}$ be the trivial isomorphism that takes $r$ to $r^{'}$ and let
$φ = φ_{1} \cup φ_{2} \cup \dots \cup φ_{k} \cup φ_{r} .$
Let $(a, b) \in E$ . If $(a, b)$ is an edge in $T [c_{i}]$ for some $c_{i} \in C$ then it is easy to see that by restricting $φ$ to the vertices in $T [c_{i}]$ , we have that $(φ (a), φ (b))$ is an edge in $T^{'} [c_{i}^{'}] \subseteq T^{'}$ . If $(a, b) = (c_{i}, r)$ for some $c_{i} \in C$ , then $(φ (c_{i}), φ (r)) = (c_{i}^{'}, r^{'})$ . Since $c_{i}^{'}$ is a child of $r^{'}$ then $(φ (a), φ (b)) \in E^{'}$ . A similar argument holds for showing that if $(a, b) \in E^{'}$ , then $(φ^{- 1} (a), φ^{- 1} (b)) \in E$ .
Lemma 4.7
Let $T = (V, E, r)$ and $T^{'} = (V^{'}, E^{'}, r^{'})$ be isomorphic rooted trees. Let $u$ and $v$ be children of $r$ and $r^{'}$ , respectively. Suppose that $σ$ is an isomorphism from $T [u]$ to $T^{'} [v]$ . Then there exists an isomorphism $φ$ from $T$ to $T^{'}$ such that $φ |_{T [u]} = σ$ and $φ (u) = v$ .
Proof.
Let $C$ and $C^{'}$ denote the set of children of $r$ and $r^{'}$ , respectively. Since $φ$ is an isomorphism and is edge preserving, then it must map $C$ to $C^{'}$ , and we necessarily have that $k = | C | = | C^{'} |$ .

Proposition 4.6 can now be used to prove the lemma. Let $\hat{φ}$ be any isomorphism from $T$ to $T^{'}$ . If $\hat{φ}$ maps $u$ to $v$ then the lemma holds. Otherwise, $\hat{φ} (u) = c^{'} \neq v$ and ${\hat{φ}}^{- 1} (v) = c \neq u$ for some $c^{'} \in C^{'}$ and $c \in C$ . By Proposition 4.6, $T [u] ≅ T^{'} [c^{'}]$ and $T [c] ≅ T^{'} [v]$ , and by assumption $T [u] ≅ T^{'} [v]$ . Thus, by transitivity, it follows that $T [c] ≅ T^{'} [c^{'}]$ . Let $W$ be the vertices in $T ∖ (T [u] \cup T [c])$ and let $π$ be an isomorphism from $T [c]$ to $T^{'} [c^{'}]$ . Then $φ = \hat{φ} |_{W} \cup σ \cup π$ is a collection of isomorphisms on all the trees rooted at the children of the roots. Thus, $φ$ is an isomorphism from $T$ to $T^{'}$ that maps $u$ to $v$ .

The main technical result can now be introduced:
Theorem 4.8
Let $T = (V, E, r)$ and $T^{'} = (V^{'}, E^{'}, r^{'})$ be rooted trees and let $u \in V$ and $v \in V^{'}$ . There exists an isomorphism $φ : T \to T^{'}$ mapping $u$ to $v$ if and only if ${P a t h N a m e}_{T} (u) = {P a t h N a m e}_{T^{'}} (v)$ .
Proof.
The forward direction follows from Lemma 4.5.

For the backward direction, suppose that ${P a t h N a m e}_{T} (u) = {P a t h N a m e}_{T^{'}} (v)$ . Since a path name includes the canonical name of the entire tree, we deduce that $N a m e (T [r]) = N a m e (T^{'} [r^{'}])$ ; it follows that $T ≅ T^{'}$ . Similarly, one can deduce that $T [u] ≅ T^{'} [v]$ . More generally, let $p_{u, r} = (u, w_{1}, \dots, w_{t - 1}, r)$ and $p_{v, r^{'}} = (v, w_{1}^{'}, \dots, w_{t - 1}^{'}, r^{'})$ be paths in $T$ and $T^{'}$ , respectively. Then for all $i \in [t - 1]$ it must be that $N a m e (T [w_{i}]) = N a m e (T^{'} [w_{i}^{'}])$ .

The result will now be proven inductively on the vertices along the path from $u$ to $r$ . For the base case, take any isomorphism $φ_{0}$ from $T [u]$ to $T^{'} [v]$ and note that this must necessarily map $u$ to $v$ .

This reasoning can be extended level-by-level upwards, at each stage using the equalities of components of the two path names to extend the isomorphism. Suppose that for $k \leq t - 1$ there exists an isomorphism $φ_{k}$ from $T [w_{k}]$ to $T^{'} [w_{k}^{'}]$ such that $φ_{k} (u) = v$ . By equality of path names, we have that $T [w_{k + 1}] ≅ T^{'} [w_{k + 1}^{'}]$ . Note also that $w_{k}$ and $w_{k}^{'}$ are children of $w_{k + 1}$ and $w_{k + 1}^{'}$ , respectively. Applying Lemma 4.7, there exists an isomorphism $φ_{k + 1}$ from $T [w_{k + 1}]$ to $T^{'} [w_{k + 1}^{'}]$ such that $φ_{k + 1} |_{T [w_{k}]} = φ_{k}$ . Since $u$ is a vertex in $T [w_{k}]$ , it follows that $φ_{k + 1} (u) = φ_{k} (u) = v$ . This completes the induction and with it the proof.

Theorem 4.8 also gives a method for identifying when there exists only a single isomorphism between two rooted trees. Suppose that $T = (V, E, r)$ and $T^{'} = (V^{'}, E^{'}, r^{'})$ are isomorphic rooted trees and that every vertex $v \in V$ has a distinct path name; then there exists exactly one isomorphism from $T$ to $T^{'}$ . Intuitively, a vertex in $T$ can only be mapped to a vertex in $T^{'}$ with the same path name. So if path names are unique, then each vertex in $T$ can only be mapped to a single vertex in $T^{'}$ , meaning there is only a single isomorphism available. The converse also holds: if there exists exactly one isomorphism from $T$ to $T^{'}$ , then every vertex $v \in V$ necessarily has a distinct path name. This observation will be useful in characterizing when query reconstruction results in FQR. We summarize with:
Corollary 4.9
Let $T = (V, E, r)$ and $T = (V^{'}, E^{'}, r^{'})$ be isomorphic rooted trees. Every vertex $v \in V$ has a unique path name in $T$ if and only if there exists a single isomorphism from $T$ to $T^{'}$ .
4.4. Overview of the QR attack

Our QR attack takes as input the graph $G$ , a set of token sequences corresponding to the set of issued queries, and comprises the following steps:

(1)
Preprocess the graph offline (Algorithm 3). Compute the SDSP trees ${T_{v}}_{v \in V}$ of graph $G$ . Then construct a multimap $M$ such that $M$ maps each path name arising in the $T_{v}$ to the set of SPSP queries whose start vertices have the same path name.
(2)
Compute the query trees online. Construct the query trees from the token sequences as the queries are issued.
(3)
Process the query trees (Algorithm 4). Compute a dictionary $D$ that maps each token sequence to the path name of the start vertex of the path.
Note that steps 4.4. and 4.4. are trivially parallelizable. In the case that the APSP algorithm is randomized, the adversary can simply run the attack multiple times to account for different shortest path trees.

In practice, the attack can output a single large table $T$ matching token sequences $s$ to sets of queries. However, storing this large table will be more expensive than storing $D$ and $M$ when $G$ has high symmetry. Moreover, $D$ can be indexed by the first token $tk$ in each token sequence $s$ (since $tk$ uniquely determines the sequence).

The following subsections expand upon the steps in the above overview.
4.5. Computing the path names

Before diving into the attack, the algorithm for computing path names, which is used as a subroutine of the attack, is described. Algorithm 2 ( $ComputePathNames$ ) takes as input a rooted tree $T = (V, E, r)$ and outputs a dictionary mapping each vertex $v \in V$ to its path name. First, Algorithm 1 ( $ComputeNames$ ) is called on tree $T$ , its root $r$ , and an empty dictionary $N a m e s$ , to obtain a dictionary $N a m e s$ that maps each vertex $v \in V$ to the canonical name of subtree $T [v]$ .

A function $h$ drawn from a universal hash function family $H$ is used to compress the path names from $O (n^{2})$ to $O (\log n)$ . An empty dictionary $P a t h N a m e s$ is initialized and updated to include $P a t h N a m e s [r] = h (N a m e s [r])$ . $T$ is then traversed in a depth-first search manner; when a new vertex $v$ is discovered during traversal, $P a t h N a m e s [v]$ is set to the hash of the concatenation of the name of $v$ and the path name of its parent $u$ , that is,

P a t h N a m e s [v] = h (N a m e s [v] ‖ P a t h N a m e s [u]) .

(1)When all vertices have been explored,

P a t h N a m e s

is returned. The pseudocode for

ComputePathNames

can be found in Algorithm 2.

Theorem 4.10

Let $T = (V, E, r)$ be a rooted tree and $P a t h N a m e s$ be the output of running Algorithm 2 on $T$ . Let $H$ be a universal hash function family mapping ${0, 1}^{*} \to {0, 1}^{6 \log n}$ . Then, for randomly sampled $h \leftarrow H$ , the expected number of collisions in $P a t h N a m e s$ is at most $O (1 / n^{3})$ .

Proof.

Let $u, v \in V$ be distinct and let $(u, w_{1}, \dots, w_{k}, r = w_{k + 1})$ and $(v, w_{1}^{'}, \dots, w_{k^{'}}^{'}, r = w_{k^{'} + 1})$ be paths in $T$ . Thus

P a t h N a m e s [u] = h (N a m e s [u] ‖ h (N a m e s [w_{1}] ‖ h (N a m e s [w_{2}] ‖ \dots)))

and similarly for

P a t h N a m e s [v]

. For there to be a collision between their path names then either: (1)

N a m e s [u] ‖ P a t h N a m e s [w_{1}]

and

N a m e s [v] ‖ P a t h N a m e s [w_{1}^{'}]

collide or (2) for some

i \in [min {k, k^{'}}]

and

k, k^{'} \leq n - 2

, it must be that

N a m e s [w_{i}] ‖ P a t h N a m e s [w_{i + 1}]

and

N a m e s [w_{i}^{'}] ‖ P a t h N a m e s [w_{i + 1}^{'}]

collide. Recall that a canonical name is unique up to isomorphism of the rooted tree.

Let $C_{u v}$ denote the event that the path names of $u$ and $v$ collide, and let $C_{u v}^{j}$ denote the event that the $j$ th nested hash of $u$ ’s and $v$ ’s path names collide. A collision on the path names occurs when any of the at most $n$ pairs of nested hash values (used to compute the path names of $u$ and $v$ ) collide. By definition of universal hash function, $E [C_{u v}^{j}] < 1 / n^{6}$ and thus by linearity of expectation,

E [C_{u v}] = \sum_{j = 1}^{min {k + 1, k^{'} + 1}} E [C_{u v}^{j}] < \frac{n}{n^{6}} = \frac{1}{n^{5}} .

Let

C

denote the event of any collision of path names in

T

. Then, by linearity of expectation, the expected number of collisions is

E [C] = \sum_{u} \sum_{v} E [C_{u v}] < \frac{n^{2}}{n^{5}} = \frac{1}{n^{3}} .

Corollary 4.11

Let $G = (V, E)$ be a graph and let ${T_{r}}_{r \in V}$ be the set of SDSP trees of $G$ . Let $P a t h N a m e s$ be the union of the outputs of running Algorithm 2 on each tree in ${T_{r}}_{r \in V}$ . Let $H$ be a universal hash function family mapping ${0, 1}^{*} \to {0, 1}^{6 \log n}$ . Then for randomly sampled $h \leftarrow H$ , the expected number of collisions in $P a t h N a m e s$ is at most $O (1 / n)$ .

To achieve a smaller probability of collision, one can choose a hash function family $H$ whose output length is $c \log n$ , where $c > 6$ . For simplicity, the universal hash function is invoked using SHA-256 truncated to 128 bits.

Lemma 4.12

Let $T = (V, E, r)$ be a rooted tree on $n$ vertices and $H$ be a universal hash function family mapping ${0, 1}^{*} \to {0, 1}^{6 \log n}$ . Upon input of $T$ , Algorithm 2 returns a dictionary of size $O (n \log n)$ mapping each $v \in V$ to a hash of its path name in time $O (n^{2})$ .

Proof.

Correctness follows easily from Theorem 4.10 and by a recursive argument.

Calling $ComputeNames$ (Algorithm 1) takes $O (n^{2})$ time. Reading the name of the root $r$ and assigning the hash of its name takes at most time $O (n)$ . Every node is pushed onto the stack once, and thus the $w h i l e$ loop on line 12 iterates $n$ times. Assigning a new path name on line 16 takes time $O (n)$ since $N a m e s [v]$ is $O (n)$ bits, $P a t h N a m e s [u]$ is $O (\log n)$ bits, and computing the hash takes constant time. Pushing the children of a given vertex onto the stack takes time $O (n)$ for a total run time of $O (n^{2})$ . $P a t h N a m e s$ maps the vertices to the hash of their path names. Each vertex and its hashed path name can be encoded with $O (\log n)$ bits, yielding a dictionary of size $O (n \log n)$ .

4.6. Preprocess the graph

The original graph $G = (V, E)$ is first preprocessed into the $n$ SDSP trees. Since the adversary is assumed to have knowledge of $G$ , this step can be done offline. The same APSP algorithm used at setup is also used on $G$ to compute the $n$ SDSP trees ${T_{v}}_{v \in V}$ , where tree $T_{v}$ is rooted at vertex $v$ . For unweighted, undirected graphs, one can use breadth-first search for a total run time of $O (n^{2} + n m),$ where $m = | E |$ . For general weighted graphs, this step has a run time of $O (n^{3})$ .¹⁶

Next, the path names of each vertex in ${T_{r}}_{r \in V}$ are computed and a multimap $M$ is constructed. $M$ maps the (hashed) path name of each vertex in ${T_{r}}_{r \in V}$ to the set of SPSP queries whose start vertices have the same path name. Theorem 4.8 is leveraged to construct this map as described below.

An empty multimap $M$ is initialized. For each $r \in V$ , $P a t h N a m e s$ is computed by running Algorithm 2 ( $ComputePathNames$ ) on tree $T_{r}$ . For each vertex $v$ in $T_{r}$ , $p a t h_n a m e \leftarrow P a t h N a m e s [v]$ is computed and the label $p a t h_n a m e$ is looked up in $M$ . If the label exists, then $M [p a t h_n a m e] \leftarrow M [p a t h_n a m e] \cup {(v, r)}$ . Otherwise $M [p a t h_n a m e] \leftarrow {(v, r)}$ . The pseudocode for computing $M$ can be found in Algorithm 3.

Lemma 4.13
Let $G = (V, E)$ be a graph on $n$ vertices. Upon input of $G$ , Algorithm 3 returns a multimap of size $O (n^{2} \log n)$ mapping each $v \in V$ to its corresponding path name in time $O (n^{3})$ .
Proof.
For each vertex $r \in V$ , a dictionary mapping each vertex in $T_{r}$ to its respective path names is computed. The correctness of path names follows from Lemma 4.12.

The run time is now analyzed. Computing the all-pairs shortest path takes time $O (n^{3})$ . The $f o r$ loop on line 5 iterates through $n$ vertices. For each vertex in $V$ , running Algorithm 2 ( $ComputePathNames$ ) takes $O (n^{2})$ time and the inner $f o r$ loop on line 9 takes $O (n)$ time. Thus, the $f o r$ loop on line 5 takes a total time of $O (n^{3})$ .

The multimap maps hashes of the path names to a list of candidate queries. The hashed path names have size $O (\log n)$ and there are at most $n^{2}$ distinct path names; each query corresponds to only one path name and is $O (\log n)$ bits long. The multimap thus has a total size $O (n^{2} \log n)$ .

4.7. Process the search tokens

The tokens revealed at query time must now be processed. Recall that the tokens are revealed such that the response to any shortest path query can be computed noninteractively. When a search token $tk$ is sent to the server, the server recursively looks up each of the encrypted vertices along the path. The adversary can thus compute the query trees using the search tokens revealed at query time. First, it initializes an empty graph $F$ .

As label–value pairs $(lab, val)$ are revealed in $E D B$ , the adversary parses ${tk}_{curr} \leftarrow lab$ and $({tk}_{next}, c) \leftarrow val$ , and adds $({tk}_{curr}, {tk}_{next})$ as a directed edge to $F$ . At any given time, $F$ will be a forest comprised of $n^{'} \leq n$ trees, ${Q_{i}}_{i \in [n^{'}]}$ , such that each $Q_{i}$ has at most $n$ nodes. Identifying the individual trees in the forest can be done in time $O (n^{2})$ . The adversary can compute the query trees online, and the final step of the attack can be run on any set of complete query trees. A complete query tree corresponds to the set of all queries to some fixed destination vertex. For ease of explanation, Algorithm 4 ( $QueryMapping$ ) takes as input the set of all complete query trees that have been constructed from the leakage.

4.8. Map the token sequences to SPSP queries

In the last step, the set of complete query trees ${Q_{i}}_{i \in [n^{'}]}$ is used as input. The path names of each vertex in the ${Q_{i}}_{i \in [n^{'}]}$ are used to construct a dictionary $D$ that maps each token sequence $s$ to the path name of the starting vertex of the corresponding path in its respective query tree. An empty dictionary $D$ is initialized. For each complete query tree $Q_{i}$ , $P a t h N a m e s \leftarrow ComputePathNames (Q_{i})$ is computed and added to $D$ . The pseudocode for computing $D$ can be found in Algorithm 4.

Theorem 4.14
Let $G = (V, E)$ be a graph and $E D B$ be an encryption of $G$ using the GKT scheme. Let ${Q_{i}}_{i \in [n^{'}]}$ be the query trees constructed from the leakage of queries issued to $E D B$ . Upon input of $G$ , Algorithm 3 returns a dictionary $M$ mapping each path name to a set of SPSP queries in time $O (n^{3})$ . Upon input of $G$ and ${Q_{i}}_{i \in [n^{'}]}$ , Algorithm 4 returns a dictionary $D$ mapping token sequences to path names in time $O (n^{3})$ . Moreover, the outputs $D$ and $M$ have the property that, for any token sequence $s$ corresponding to a path $(v, r)$ in a query tree and for every query $(v^{'}, r^{'}) \in M [D [s]]$ , there exists an isomorphism $φ$ from $Q$ to $T_{r^{'}}$ such that $φ (v) = v^{'}$ and $φ (r) = r^{'}$ .
Proof.
The correctness of ${Q_{i}}_{i \in [n]}$ follows from the correctness of the GKT scheme. Dictionary $D$ contains a map of each vertex in $\cup_{i \in [n]} Q_{i}$ to its path name. The correctness of $M$ and $D$ follows from Lemmas 4.13 and 4.12, respectively.

Let $(v, r)$ be a pair comprised of a nonroot vertex $v$ and a root vertex $r$ in a complete query tree $Q$ , and let $s$ be the token sequence corresponding to $(v, r)$ . Let $(v^{'}, r^{'}) \in M [D [s]]$ . By composition of $D$ and $M$ , it must be that ${P a t h N a m e}_{Q} (v) = {P a t h N a m e}_{T_{r^{'}}} (v^{'})$ . Applying Theorem 4.8, there is thus an isomorphism from $Q$ to $T_{r^{'}}$ that maps $v$ to $v^{'}$ and $r$ to $r^{'}$ .

With regard to run time, preprocessing $G$ (Algorithm 3) takes time $O (n^{3})$ . Computing the path names (Algorithm 2) of $n^{'} \leq n$ trees takes $O (n^{3})$ time, which is an upper bound on the run time of the whole attack.

4.9. Recover the queries

Once the map between each node (token) in a query tree and its corresponding path name has been computed, the attacker can use $M$ and $D$ to compute the candidate queries of all queries in the complete query trees. Given $M$ and $D$ (outputs of Algorithms 3 and 4, respectively) and an observed token $s$ matching a query in the query trees for some unknown query, the adversary can find the set of queries consistent with $s$ by simply computing $M [D [s]]$ .

4.10. Full query recovery (FQR)

This section is concluded with a discussion of when FQR is possible. By the correctness of the attack, this is the case for a graph $G$ , a set of complete query trees ${Q_{i}}_{i \in [n^{'}]}$ , and associated token sequences $S$ when for $M \leftarrow PreprocessGraph (G)$ , $D \leftarrow QueryMapping (G, {Q_{i}}_{i \in n^{'}})$ , and all $s \in S$ , $| M [D [s]] | = 1$ .

A condition for FQR feasibility can also be described in graph-theoretic terms. Recall Corollary 4.9, which states that given two isomorphic rooted trees $T$ and $T^{'}$ , if each vertex in $T$ has a unique path name, then there exists only one isomorphism from $T$ to $T^{'}$ . FQR is thus always achievable for any set of complete query trees, when all $n^{2}$ vertices in the SDSP trees have unique path names. More formally:

Corollary 4.15
Let $G = (V, E)$ be a graph and let ${T_{v}}_{v \in V}$ be the set of SDSP trees of $G$ . Suppose every vertex in $⋃_{v \in V} T_{v}$ has a unique path name (and in particular, each $T \in {T_{v}}_{v \in V}$ has a unique canonical name). Then, FQR can always be achieved on any complete query tree(s). The converse is also true.

By correctness, the attack achieves FQR whenever it is possible. Figure 2 depicts a graph for which FQR is always possible. Indeed, each tree ${T_{v}}_{v \in [7]}$ has a unique canonical name, and for all $v \in [7]$ , each vertex $u$ in $T_{v}$ has a unique path name. More generally, let $G$ be the family of graphs having one central vertex $c$ and any number of paths all of distinct lengths appended to $c$ . It is easy to see that the attack achieves FQR for all graphs $G \in G$ .

Figure 2.
An example graph for which FQR is always possible, no matter which set of SPSP queries is issued. FQR: full query recovery; SPSP: single pair shortest path.
5. Experiments

The theoretical results are supported by experiments on both real-world datasets and random graphs.

5.1. Implementation details

The attack was implemented in Python 3.7.6 and ran on a computing cluster with a $2 \times 28$ Core Intel Xeon Gold 6258R 2.7 GHz Processor (Turbo up to 4 GHz/AVX512 Support), and 384 GB DDR4 2933 MHz error correcting code memory. To generate the leakage, the GES from Ghosh et al.¹⁰ was implemented, and the same machine for both the client and server was used. The cryptographic primitives were implemented using the PyCryptodome library version 3.10.1⁴¹; AES-CBC with a 16B key was used for symmetric encryption, and SHA-256 was used for collision-resistant hash functions SHA-256. For the DES, $Π_{b a s}$ from Cash et al.⁴⁰ was implemented, and the tokens were generated using hash-based message authentication code with SHA-256 truncated to 128 bits. The shortest paths of the graphs were computed using the single_source_shortest_path algorithm from the NetworkX library version 2.6.2.⁴²

The QR attack was implemented using the same shortest path algorithm from NetworkX as in the scheme implementation. An implementation of the AHU algorithm (Algorithm 1) was used to compute canonical names. As mentioned previously, the attack is highly parallelizable, and this property was exploited when implementing the attack.

5.2. Graph datasets

The attack was evaluated on six of the same datasets as Ghosh et al.¹⁰; in addition, the InternetRouting dataset from the University of Oregon Route Views Project (collected on 2 January 2000) and the facebook-combined dataset were also used. All eight of these datasets were obtained from Leskovec and Krevl.⁴³ The InternetRouting and CA-GrQc datasets were extracted from the original datasets using the dense subset extraction algorithm by Charikar⁴⁴ as implemented by Ambavi et al.⁴⁵ Details about these datasets can be found in Table 1, and a summary of the attack results can be found in Table 2.

Table 1.
A description of the real-world datasets used in the experimental evaluation; $n$ denotes the number of vertices; $m$ denotes the number of edges of the graph dataset; $d = 2 m / (n \cdot (n - 1))$ denotes the density of the graph.

Dataset $n$ $m$ $d$ # Comp

InternetRouting 35 323 0.543 1

CA-GrQc 46 1030 0.995 1

Email-Eu-core 1005 16,706 0.0331 20

Facebook-combined 4039 88,234 0.011 1

p2p-Gnutella08 6301 20,777 0.001 2

p2p-Gnutella04 10,876 39,994 0.0006 1

p2p-Gnutella25 22,687 54,705 0.0002 13

p2p-Gnutella30 36,682 88,328 0.0001 12

Dataset	$n$	$m$	$d$	# Comp
InternetRouting	35	323	0.543	1
CA-GrQc	46	1030	0.995	1
Email-Eu-core	1005	16,706	0.0331	20
Facebook-combined	4039	88,234	0.011	1
p2p-Gnutella08	6301	20,777	0.001	2
p2p-Gnutella04	10,876	39,994	0.0006	1
p2p-Gnutella25	22,687	54,705	0.0002	13
p2p-Gnutella30	36,682	88,328	0.0001	12

Table 2.

A comparison of the attack results for the real-world datasets.

					Percentile
Dataset	$\frac{\# Unique}{Total}$	% Unique	$\frac{\# Leaves in SDSP trees}{\# Nodes in SDSP trees}$	% Min	50	90	99
InternetRouting	$\frac{28}{1190}$	2.353	$\frac{1120}{1190}$	94.1	40	84	90
CA-GrQc	$\frac{3}{2070}$	0.145	$\frac{2065}{2070}$	99.8	1845	1845	1845
Email-EU-core	$\frac{65, 659}{1, 009, 020}$	6.507	$\frac{787, 486}{1, 009, 020}$	78.0	16	69	190
Facebook-combined	$\frac{33, 634}{16, 309, 482}$	0.206	$\frac{16, 194, 084}{16, 309, 482}$	99.3	1826	11,424	20,480
p2p-Gnutella08	$\frac{8, 519, 868}{39, 696, 300}$	21.463	$\frac{27, 663, 800}{39, 696, 300}$	69.7	4	12	64
p2p-Gnutella04	$\frac{25, 915, 785}{118, 276, 500}$	21.911	$\frac{80, 580, 827}{118, 276, 500}$	68.1	3	9	32
p2p-Gnutella25	$\frac{82, 736, 533}{514, 677, 282}$	16.075	$\frac{379, 383, 168}{514, 677, 282}$	73.7 %	5	18	54
p2p-Gnutella30	$\frac{197, 413, 906}{1, 345, 532, 442}$	14.671	$\frac{1, 003, 317, 663}{1, 345, 532, 442}$	74.6	5	24	60

SDSP: single-destination shortest path; QR: query recovery.

The third column denotes the fraction of queries that are recoverable up to one candidate. The fourth column denotes the smallest fraction of queries needed to reconstruct all query trees. The last three columns show the 50th, 90th, and 99th percentiles obtained for QR on the eight real-world datasets.

In addition to the real-world datasets, the attack was also deployed on random graphs for $n = 100, 250, 500, 1000$ and edge probabilities $p = 0.2, 0.4, 0.6, 0.8$ . The graphs were generated using the fast_gnp_random_graph function from NetworkX.⁴²

5.3. Query reconstruction results

Real-world datasets. The attack was carried out on the Internet Routing, CA-GrQc, email-EU-Core, facebook-combined, and p2p-Gnutella08 datasets; the online portion of the attack (Algorithm 4) given all queries ran in 0.087 s, 0.093 s, 5.807 s, 102.670 s, and 339.957 s for each dataset, respectively. For the first four datasets, attacks given 75% and 90% were also run. The results were averaged over 10 runs, and the vertices were sampled as follows: the start vertex was chosen uniformly at random, and the end vertex was chosen with probability linearly proportional to its out degree in the original graph. This simulates a more realistic setting in which certain “highly connected” destinations are chosen with higher frequency. The results of these experiments can be found in Table 3. Queries can be reconstructed with just 75% of the queries. In fact, for the Facebook-combined dataset, complete query trees can be observed with high probability after only observing 20% of the queries.

For the remaining datasets, simulations were run to demonstrate the success that an adversary could achieve given 100% of the queries. The simulations were carried out as follows. Given $G$ , the SDSP trees and the path names for each vertex in these trees were computed, and then a dictionary mapping each query in $G$ to the set of candidate queries was constructed by identifying queries whose starting vertices have the same path name. The simulations only used the plaintext graph, and the results show the success that an adversary would achieve in an end-to-end attack. Simulations were used for the larger graphs since storing all responses is memory-intensive; in practice, the attack can be run on larger datasets by writing the map out to a back-end key-value store. These results can be found in the bottom row of Table 3.

Table 3.
CDFs for QR of the real-world data sets after observing (row 1) 75%, (row 2) 90%, and (rows 3 and 4) 100% of the queries.

Table 1 reports the percentage of uniquely recoverable queries when the attack is run on the set of all query trees. Uniquely recoverable queries are queries whose responses result in only one candidate. CA-GrQc had the smallest percentage of uniquely recoverable queries (0.145%) and the p2p-Gnutella04 had the largest percentage (21.911%). The small percentage for CA-GrQc can be attributed to its high density ( $d = 0.995$ ), where density is defined as $d = 2 m / (n \cdot (n - 1))$ . The CA-GrQc graph is nearly complete, and its SDSP trees display a high degree of symmetry. In fact, many of the query trees are isomorphic to the majority of SDSP trees, and the majority of SDSP trees have a star shape (i.e. $n - 1$ of the vertices in the tree are adjacent to the root). Each nonroot vertex in a star tree has the same path name, resulting in a large number of possible candidates per token sequence.

Table 3 depicts the cumulative distribution functions (CDFs) resulting from the experiments. The four Gnutella datasets exhibit a high recovery rate that can be explained by asymmetry and low density. Fifty percent of all queries for the p2p-Gnutella08, p2p-Gnutella04, p2p-Gnutella25, and p2p-Gnutella30 datasets result in at most 4, 3, 5, and 5 candidate query values, respectively. Details of the 50th, 90th, and 99th percentiles can be found in Table 1. The histograms of the results for QR on the real-world datasets (assuming 100% of the queries have been observed) are depicted in Table 4.

Random graphs. The attack was also deployed on random graphs, varying the number of nodes ( $n = 100, 250, 500, 1000$ ) and the edge probability ( $p = 0.2, 0.4, 0.6, 0.8$ ). QR was carried out after all queries had been issued. For each $(n, p)$ pair, 50 random graphs were generated and encrypted using the GKT scheme. The leakage for all possible SPSP queries was then generated, and the server-side QR attack was deployed on the responses. For each recovered multimap, the average number of candidate queries across all 50 graphs was computed. The CDFs of these results can be found in Table 5. The attack executed very quickly, with runtimes ranging from 0.471 s on graphs with $n = 100$ to $46.2$ s on graphs with $n = 1000$ .

Table 4.

Histograms for quick recovery (QR) of the real-world datasets after observing 100% of the queries.

The number of candidate queries output by QR is plotted on the $x$ axis, and the number of queries is plotted on the $y$ -axis. The red dotted lines indicate the 50th, 90th, and 99th percentiles. An asterisk next to the dataset indicates that results were obtained via simulation; see the discussion for details.

Table 5.

CDFs for QR of random graphs for $n = 100, 250, 500, 1000$ and $p = 0.2, 0.4, 0.6, 0.8$ after observing 100% of the queries.

CDF: cumulative distribution function; QR: query recovery.

The number of candidate queries output by the QR attack was plotted on the $x$ -axis, and the percentage of total queries was plotted on the $y$ axis. For each $(n, p)$ , 50 graphs were generated, and an average of the number of vertices with each given set size of candidate queries was taken. As the edge probability increases, the number of symmetries, and hence the number of candidate queries output, tends to increase.

Table 6.

PDFs for QR of random graphs after observing 100% of the queries.

PDF: probability density function; QR: query recovery.

The number of candidate queries output by the QR attack is plotted on the $x$ -axis, and the percentage of total queries is plotted on the $y$ -axis.

In general, an increase in $p$ and $n$ both result in an increase in the size of the maximum query candidate sets. For example, for $n = 100, p = 0.2$ , $6.3 %$ of all queries are uniquely recoverable and $50 %$ of all queries are recoverable to at most 10 candidate queries (representing $0.101 %$ of all queries). For $n = 1000, p = 0.2$ , $1.5 %$ of all queries are uniquely recoverable and $50 %$ of all queries are recoverable to at most 107 candidate queries (representing $0.0107 %$ of all queries).

As $p$ increases from $0.2$ to $0.8$ , the graphs become more dense, and a similar trend as seen in the real-world datasets can be observed. Denser graphs are closer to complete, and result in more symmetries and larger candidate query sets. As $p$ increases, more “waves” in the CDFs can also be observed; in the graphs showing the probability density functions (Table 6), these correspond to large clusters of candidate queries, all of which have the same path name and hence cannot be distinguished.

6. Discussion

This work describes a QR attack against the GKT GES from Ghosh et al.¹⁰ The attack model considered is strong, but fits within the model used in Ghosh et al.¹⁰ The attack begins with an offline preprocessing phase of the graph. In the online phase, the attack waits until it has observed all queries to at least one destination vertex and then outputs a list of candidates for each of these queries. The attack has the property that the output contains everything consistent with the leakage (and nothing more), and always contains the correct query. The attack was supported with a precise characterization of when FQR is possible, and evaluated against real-world and random graphs.

An alternative setting is to consider query reconstruction when arbitrary subsets of queries have been issued: then the adversary can construct partial query trees and attempt to identify isomorphic embeddings of them into the SDSP trees. It is an interesting open problem to develop an efficient attack for this setting. Yet another variant of the attack, for a network adversary, is described in Appendix A.

This paper highlights the need for detailed cryptanalysis of GESs. The value of such analysis was recognized in Ghosh et al.,¹⁰ but omitted on the grounds that the impact of the leakage is application-specific and can only be assessed in the context of particular use cases at the time of deployment. Such analysis, however, should be done in tandem with security proofs (establishing leakage profiles) at the same time as schemes are developed. Of course, attacks should be assessed with respect to real-world datasets whenever possible, as done here.

This work leaves open the question of whether other GESs can be similarly attacked. On the constructive side, the question of whether more secure schemes can be built that utilize chaining in a noninteractive manner and which support shortest path queries remains open. Another interesting line of research includes constructing practical interactive graph database schemes that minimize the communication overhead and number of rounds between the client and the server. Moreover, there is still much work to be done regarding the design of encrypted graph database schemes that can support a variety of queries—an important property for schemes in practical settings.

Footnotes

Acknowledgments

Work by F.F. was performed in part while visiting Brown University.

ORCID iDs

Francesca Falzon

Kenneth G. Paterson

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the ThinkSwiss Research Scholarship, the U.S. National Science Foundation, and Armasuisse Science and Technology.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Attack for a network adversary

This section introduces a novel variant of the QR attack from the perspective of a network adversary. In practice, a secure communication protocol such as Transport Layer Security would be used to encrypt the communication between the client and the server, thereby mitigating such an attack. However, this attack is an interesting proof of concept that demonstrates how a slight change in the leakage profile can greatly impact the adversary’s ability to recover information.

The network adversary is assumed to know G , but is able to only observe the communication between the client and server, that is, the initial search token and its response (which is a sequence of the encrypted vertices in the path, not including the start vertex). This is in contrast to a server-side adversary that is able to observe the complete sequence of search tokens and encrypted vertices during look-up. The network-adversary attack can be summarized as follows: (1)

Preprocess the graph offline (Algorithm 3). Compute the SDSP trees { T v } v ∈ V of graph G . Then construct a multimap M such that M maps each path name arising in the T v to the set of SPSP queries whose start vertices have the same path name.

(2)

Compute the query trees online. Construct the query trees from the sequence of ciphertexts contained in the responses. The nodes in the trees are labeled by the issued search tokens.

(3)

Process the query trees (Algorithm 5). Construct a dictionary D such that D maps each node in the query tree to the path name of the start vertex of the path. Additionally, compute a multimap M ^ that maps search tokens to the possible nodes in the query tree that each token could correspond to.

Note that Step (A.) is the same as before, but Steps (A.) and (A.) must be adapted to this new setting. The latter two steps are described in detail below.

References

Amazon. Amazon Neptune. https://aws.amazon.com/neptune/ (2021, accessed 27 October 2021).

Bronson

Amsden

Cabrera

, et al. TAO: Facebook’s distributed data store for the social graph. In: 2013 USENIX annual technical conference (USENIX ATC 13). San Jose, CA: USENIX Association, 2013, pp.49–60.

I. Neo4j. Neo4j. https://neo4j.com/ (2021, accessed 27 October 2021).

Ontotext. GraphDB. https://graphdb.ontotext.com/ (2021, accessed 27 October 2021).

Malewicz

Austern

Bik

AJC

, et al. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, Indianapolis, IN, USA, 2010, pp.135–146. New York, NY, USA: Association for Computing Machinery.

Low

Bickson

Gonzalez

, et al. Distributed graphLab: A framework for machine learning and data mining in the cloud. Proc VLDB Endow 2012; 5: 716–727.

Shao

Wang

. Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, SIGMOD’13, 2013, pp.505–516. New York, NY, USA: Association for Computing Machinery.

Chase

Kamara

. Structured encryption and controlled disclosure. In: Advances in cryptology – ASIACRYPT 2010 – 16th international conference on the theory and application of cryptology and information security, 2010, pp.577–594, Lecture notes in computer science, Vol. 6477. Singapore: Springer, Cham, Switzerland.

Meng

Kamara

Nissim

, et al. GRECS: graph encryption for approximate shortest distance queries. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, Denver, Colorado, 2015, pp.504–517. New York, NY, USA: Association for Computing Machinery.

10.

Ghosh

Kamara

Tamassia

. Efficient graph encryption scheme for shortest path queries. In: Proceedings of the 2021 ACM Asia conference on computer and communications security, ASIA CCS’21, Virtual, Hong Kong, 2021, pp.516–525. New York, NY, USA: Association for Computing Machinery.

11.

Wang

Ren

, et al. SecGDB: graph encryption for exact shortest distance queries with efficient updates. In: Financial cryptography and data security – 21st international conference, FC 2017, Sliema, Malta, April 3–7, 2017, revised selected papers (ed A Kiayias), Lecture notes in computer science, Vol. 10322, 2017, pp.79–97. Cham, Switzerland: Springer.

12.

Falzon

Paterson

. An efficient query recovery attack against a graph encryption scheme. In: Computer security – ESORICS 2022 – 27th European symposium on research in computer security, Copenhagen, Denmark, September 26–30, 2022, proceedings, part I (eds V Atluri, RD Pietro, CD Jensen and W Meng), lecture notes in computer science, Vol. 13554, 2022, pp.325–345. Berlin: Springer.

13.

Sealfon

. Shortest paths and distances with differential privacy. In: Proceedings of the 35th ACM SIGMOD–SIGACT–SIGAI symposium on principles of database systems, PODS’16. New York, NY, USA: Association for Computing Machinery, 2016, pp.29–41.

14.

Mouratidis

Yiu

. Shortest path computation with no information leakage. Proc VLDB Endow 2012; 5: 692–703.

15.

Aho

Hopcroft

Ullman

. Data structures and algorithms, 1st edn. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1983.

16.

Floyd

. Algorithm 97: shortest path. Commun ACM 1962; 5: 345.

17.

Poh

Mohamad

Z’aba

. Structured encryption for conceptual graphs. In: Hanaoka G and Yamauchi T (eds) Advances in information and computer security. Berlin: Springer, 2012, pp.105–122.

18.

Zimmerman

Planul

, et al. Privacy-preserving shortest path computation. In: 23rd annual network and distributed system security symposium, NDSS 2016, San Diego, California, USA, February 21–24, 2016. San Diego, CA, USA: The Internet Society, 2016. http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2017/09/privacy-preserving-shortest-path-computation.pdf.

19.

Lai

Yuan

Sun

S-F

, et al. GraphSE2: an encrypted graph database for privacy-preserving social search. In: Proceedings of the 2019 ACM Asia conference on computer and communications security, Asia CCS’19. New York, NY, USA: Association for Computing Machinery, 2019, pp.41–54.

20.

Sala

Zhao

Wilson

, et al. Sharing graphs using differentially private graph models. In: Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference, IMC’11. New York, NY, USA: Association for Computing Machinery, 2011, pp.81–98.

21.

Blackstone

Kamara

Moataz

. Revisiting leakage abuse attacks. In: 27th annual network and distributed system security symposium, NDSS 2020, San Diego, California, USA, February 23–26, 2020. San Diego, CA, USA: The Internet Society, 2020.

22.

Cash

Grubbs

Perry

, et al. Leakage-abuse attacks against searchable encryption. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (CCS '15), Denver, Colorado, pp.668–679. New York, NY, USA: Association for Computing Machinery.

23.

Zhang

Katz

Papamanthou

. All your queries are belong to us: the power of file-injection attacks on searchable encryption. In: 25th USENIX security symposium (USENIX Security 16). Austin, TX: USENIX Association, 2016, pp.707–720.

24.

Islam

Kuzu

Kantarcioglu

. Access pattern disclosure on searchable encryption: ramification, attack and mitigation. In: 19th annual network and distributed system security symposium, NDSS 2012. San Diego, CA, USA: The Internet Society, 2012.

25.

Pouliot

Wright

. The shadow nemesis: inference attacks on efficiently deployable, efficiently searchable encryption. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (CCS '16), Vienna, Austria, 2016, pp.1341–1352. New York, NY, USA: Association for Computing Machinery.

26.

Gui

Paterson

Patranabis

. Rethinking searchable symmetric encryption. In: 44th IEEE symposium on security and privacy, SP 2023, San Francisco, CA, USA, May 21–25, 2023, pp.1401–1418. NY, USA: The Institute of Electrical and Electronics Engineers (IEEE).

27.

Kellaris

Kollios

Nissim

, et al. Generic attacks on secure outsourced databases. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (CCS '16), Vienna, Austria, 2016, pp.1329–1340. New York, NY, USA: Association for Computing Machinery.

28.

Lacharité

M-S

Minaud

Paterson

. Improved reconstruction attacks on encrypted data using range query leakage. In: IEEE symposium on security and privacy (SP), San Francisco, CA, USA, 2018, pp.297–314. NY, USA: The Institute of Electrical and Electronics Engineers (IEEE).

29.

Grubbs

Lacharité

Minaud

, et al. Pump up the volume: practical database reconstruction from volume leakage on range queries. In: Proceedings of the 2018 ACM SIGSAC conference on computer and communications security, CCS 2018, Toronto, ON, Canada, October 15–19, 2018 (eds D Lie, M Mannan, M Backes and X Wang), 2018, pp.315–331. New York, NY, USA: Association for Computing Machinery.

30.

Grubbs

Lacharité

M-S

Minaud

, et al. Learning to reconstruct: statistical learning theory and encrypted database attacks. In: Proceedings of IEEE symposium on security and privacy (SP), San Francisco, CA, USA, 2019, pp.1067–1083. NY, USA: Institute of Electrical and Electronics Engineers.

31.

Gui

Johnson

Warinschi

. Encrypted databases: new volume attacks against range queries. In: Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, CCS 2019, London, UK, November 11–15, 2019 (eds L Cavallaro, J Kinder, X Wang and J Katz), 2019, pp.361–378. New York, NY, USA: Association for Computing Machinery.

32.

Kornaropoulos

Papamanthou

Tamassia

. The state of the uniform: attacks on encrypted databases beyond the uniform query distribution. In: Proceedings of IEEE symposium on security and privacy (SP), San Francisco, CA, USA, 2018, pp.297–314. NY, USA: Institute of Electrical and Electronics Engineers, 2020.

33.

Kornaropoulos

Papamanthou

Tamassia

. Response-hiding encrypted ranges: revisiting security via parametrized leakage-abuse attacks. In: Proceedings of IEEE symposium on security and privacy, San Francisco, CA, USA, 2021, pp.1502–1519. NY, USA: Institute of Electrical and Electronics Engineers.

34.

Markatou

Tamassia

. Full database reconstruction with access and search pattern leakage. In: Information security – 22nd international conference, ISC 2019, New York City, NY, USA, September 16–18, 2019, proceedings, lecture notes in computer science, Vol. 11723, 2019, pp.25–43. Cham, Switzerland: Springer.

35.

Falzon

Markatou

Cash

, et al. Full database reconstruction in two dimensions. In: Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (CCS '20), Virtual, 2020, pp.443–460. New York, NY, USA: Association for Computing Machinery.

36.

Markatou

Falzon

Tamassia

, et al. Reconstructing with less: leakage abuse attacks in two dimensions. In: Proceedings of the 2021 ACM SIGSAC conference on computer and communications security (CCS '21), Virtual, 2021, pp.2243–2261. New York, NY, USA: Association for Computing Machinery.

37.

Kornaropoulos

Papamanthou

Tamassia

. Data recovery on encrypted databases with

k

-nearest neighbor query leakage. In: Proceedings of IEEE symposium on security and privacy 2019, (S&P 2019), San Francisco, CA, USA, 2019, pp.1033–1050. NY, USA: Institute of Electrical and Electronics Engineers (IEEE).

38.

Goetschmann

. Design and analysis of graph encryption schemes. Master’s Thesis, ETH Zürich, 2020.

39.

Cormen

Leiserson

Rivest

, et al. Introduction to algorithms, 3rd edn. The MIT Press, 2009.

40.

Cash

Jaeger

Jarecki

, et al. Dynamic searchable encryption in very-large databases: data structures and implementation. In: 21st annual network and distributed system security symposium 2014, NDSS 2014. San Diego, CA, USA: The Internet Society, 2014.

41.

Developers

. PyCryptodome, 2021, version 3.10.1. https://www.pycryptodome.org/.

42.

Developers

. NetworkX, 2021, version 2.6.2. https://networkx.org/.

43.

Leskovec

Krevl

. SNAP datasets: Stanford large network dataset collection, 2014.

44.

Charikar

. Greedy approximation algorithms for finding dense components in a graph. In: Jansen K and S Khuller S (eds) Approximation algorithms for combinatorial optimization. Berlin: Springer, 2000, pp.84–95.

45.

Ambavi

Sharma

Gohil

. Densest-subgraph-discovery. GitHub, 2020.

An efficient query recovery attack against a graph encryption scheme

Abstract

Keywords

1. Introduction

1.1.1. Graph encryption

1.1.2. Attacks

2. Preliminaries

2.1. Notation

2.2. Graphs

2.4. Graph isomorphisms

3. The GKT GES

3.1. GKT scheme overview

3.3. Implications of leakage

4. Query recovery (QR)

4.1. Threat model and assumptions

4.2. Formalizing QR attacks

4.8. Map the token sequences to SPSP queries

4.10. Full query recovery (FQR)

5.1. Implementation details

5.2. Graph datasets

Table 3. CDFs for QR of the real-world data sets after observing (row 1) 75%, (row 2) 90%, and (rows 3 and 4) 100% of the queries.

Footnotes

Acknowledgments

ORCID iDs

Funding

Declaration of conflicting interests

Attack for a network adversary

References

Table 3.
CDFs for QR of the real-world data sets after observing (row 1) 75%, (row 2) 90%, and (rows 3 and 4) 100% of the queries.