On the Selection of Information Sources for Gossip Spreading

Abstract

Information diffusion is efficient via gossip or rumor spreading in many of the next generation networks. It is of great importance to select some seed nodes as information sources in a network so as to maximize the gossip spreading. In this paper, we deal with the issue of the selection of information sources, which are initially informed nodes (i.e., seed nodes) in a network, for pull-based gossip protocol. We prove that the gossip spreading maximization problem (GSMP) is NP-hard. We establish a temporal mapping of the gossip spreading process using virtual coupon collectors by leveraging the concept of temporal network, further prove that the gossip spreading process has the property of submodularity, and consequently propose a greedy algorithm for selecting the information sources, which yields a suboptimal solution within $(1 - 1 / e)$ of the optimal value for GSMP. Experiments are carried out to study the spreading performance, illustrating the significant superiority of the greedy algorithm over heuristic and random algorithms.

1. Introduction

Information dissemination through networks is ubiquitous in the modern world [1–6], and gossip or rumor spreading is an efficient way of information diffusion: imagine that a rumor arises in a town and is epidemically like spread among the whole population [7]. There are two atomic types of gossip protocols: “pull,” an uninformed node requests an unpossessed message from a randomly chosen neighbor, and “push,” an informed node sends its possessed message to a randomly chosen neighbor. Gossip-based algorithms are simple, robust, flexible, and scalable and hence are promising for many of the next generation networks [8]. Existing applications are numerous, such as consensus and averaging problems in sensor networks [9, 10], ad hoc message routing [11], peer-to-peer (P2P) file distribution [12], and information dissemination in social networks [13].

Most existing analytical works on gossip spreading have dealt with (high probability) upper bounds of the completion time, disregarding the choice of information sources [7, 8, 12–15]. Given a static connected network graph, the completion time of a gossip protocol is the first time at which all the nodes are informed. Recently, the issue on the selection of information sources for gossip spreading has been treated in our previous work [16]. However, in [16], only the push-based gossip protocol was considered and the complexity issue of the gossip spreading maximization problem was left open.

In this paper, we consider the problem of selecting information sources for gossip spreading, focusing on the pull-based gossip protocol. The information sources are initially informed nodes (i.e., seed nodes) in a network. We ask the question: given a general network graph, the budgeted size of a seed set, and a constrained deadline, how to pick the elements in the seed set, which will be endowed with an identical message to be propagated to the rest of the network, such that the expected number of informed nodes is maximized within the deadline. The gossip spreading maximization problem (GSMP) frequently arises in many scenarios, especially when a popular content is demanded by a group of nodes. In sensor networks, one needs to decide the deployment of key sensors, capable of detecting and issuing emergence with the other hearing-and-forwarding sensors, so as to maximize the alarming area as quickly as possible. In P2P networks, one needs to decide the best choice of seeds so as to maximize the file distribution before delay tolerance. Besides, similar research topics can be found in viral marketing (a.k.a. influence maximization), detection of disease outbreak, and opportunistic cellular traffic offloading [17–19].

Our contributions are as follows. $(1)$ We prove that GSMP is NP-hard, which means that suboptimal solutions with performance guarantee should be exploited because polynomial-time algorithms have not yet been discovered for the class of NP-hard problems [20]. $(2)$ We establish a temporal mapping of the pull-based gossip spreading process using virtual coupon collectors by leveraging the concept of temporal network and prove the submodularity in gossip spreading based on this temporal mapping method. Consequently, we propose a greedy algorithm for GSMP that yields a solution within $(1 - 1 / e)$ of the optimal value. $(3)$ We carry out extensive experiments to study the spreading performance, demonstrating that the greedy algorithm outperforms heuristic and random algorithms significantly.

Our work differs from previous works in several aspects. First, we deal with GSMP by selecting influential information sources, different from the analytical works on high-probability completion time (see, e.g., [8]). Second, we study GSMP in the area of gossip spreading and treat it using the coupon collecting and temporal mapping method, while the influence maximization problem is related to the social influence diffusion models and is treated with the coin flipping and equivalent view method (see, e.g., [17]). Third, beyond our previous work [16], we have focused on the “pull” model, rigorously proved the NP-hardness of GSMP, and leverage the new temporal mapping method to analyze the submodularity of gossip spreading.

The rest of this paper is organized as follows. Section 2 describes the pull-based gossip protocol and formulates the gossip spreading maximization problem. We analyze the complexity of GSMP and show its NP-hardness in Section 3. We recognize the submodularity in gossip spreading via a temporal mapping method and propose a suboptimal solution with performance guarantee for GSMP in Section 4. We carry out simulation experiments to study the spreading performance of our approach in Section 5. Finally, Section 6 concludes this paper.

2. System Model and Problem Formulation

In general, a directed network graph $G = (V, E)$ consists of a set of nodes V and a set of directed edges E. Each node is spoken of being either informed or uninformed; and a node is informed if and only if it has possessed its desired message. For any pair of nodes $u, v \in V$ , u may send its possessed message to v if and only if the edge $(u, v) \in E$ .

Time is slotted and a message can be transferred from a sender to a receiver within a time slot, which is called round throughout this paper. No matter which content the information flow over the underlying network is carrying, we focus on only one piece of message in our model.

2.1. Gossip Protocol

The type of gossip spreading considered in this paper can be called multiple sources with single message [16]: initially the nodes in a seed set S become informed with an identical message, and all the other nodes wish to receive a copy of that message. Any uninformed node v can pull the message from one informed neighbor, which is connected to v by an incoming edge in E. That informed neighbor can belong to either the seed nodes in S or the other initially uninformed nodes that have already obtained the message.

In each round, the nodes in the network G contact their neighbors in the following manner: each of the uninformed nodes picks a partner uniformly at random from the set of all its neighbors (connected by incoming edges), oblivious of any state, history, or other nodes' choices. Once a partner is chosen, the uninformed node pulls its desired message from the chosen partner.

In any given round, an uninformed node can pull from only one partner, and it becomes informed if its chosen partner has already possessed its desired message and remains uniformed if this partner does not possess this message. Once a node gets informed in some round, it will stay informed forever. All communications are assumed to be error-free with error control coding and protocol overhead encapsulated by physical-layer design, which are not considered herein.

2.2. Problem Formulation

Consider the sample space Ω, where each sample specifies one possible realization of the gossip spreading process. Let X denote one sample in Ω, and let $\Pr [X]$ denote its occurrence probability. We are interested in the case where the gossip spreading process runs until a constrained deadline. Given a seed set S of k nodes and a deadline of D rounds, the number of informed nodes under one sample $X \in Ω$ by deadline D is $I_{D} (S | X)$ . So the expected number of informed nodes (within D rounds) is

\begin{matrix} σ_{D} (S) ∶ = E [I_{D} (S)] = \sum_{X \in Ω} ‍ \Pr [X] \cdot I_{D} (S | X) . \end{matrix}

(1)

Given a directed network graph $G = (V, E)$ , the budgeted size k of a seed set S, and a constrained deadline D, we wish to select the seed nodes in S such that $σ_{D} (S)$ is maximized. This is called gossip spreading maximization problem (GSMP) and is formally given by

\begin{array}{l} \max_{S \subseteq V} σ_{D} (S) \\ \begin{array}{l} subject to & |S| ⩽ k . \end{array} \end{array}

(2)

3. Complexity Analysis

GSMP belongs to the field of stochastic programming, and we show that it is NP-hard in this section.

3.1. Preliminary

First of all, we consider the decision version of GSMP.

Problem 1 (gossip spreading decision problem).

Given a network graph $G = (V, E)$ , a constrained deadline D, and a utility quota q, we wish to determine whether there exists l of the nodes for the seed set S such that the expected number of informed nodes $σ_{D} (S)$ is at least q. Let an instance of Problem 1 be denoted by $G S P (V, E, D, l, q)$ .

We see that $G S P (V, E, D, l, q)$ belongs to the class NP [20], since it can be validated in polynomial time given any solution of l seed nodes. In order to argue the NP-hardness of GSMP, we will show that its decision version (i.e., $G S P (V, E, D, l, q)$ ) can be reduced from the following problem.

Problem 2 (partial set cover problem).

Given a ground set $U = {u_{1}, u_{2}, \dots, u_{n}}$ and a collection of U's subsets $C = {S_{1}, S_{2}, \dots, S_{m}}$ , we wish to determine whether there exist h of the subsets such that the cardinality of their union is at least p. Let an instance of Problem 2 be denoted by $S C P (U, C, h, p)$ . As the partial set cover problem generalizes the NP-complete set cover problem [20], it must be NP-complete.

3.2. GSMP is NP-Hard

Next we show that GSMP is NP-hard using a reduction from the NP-complete $S C P (U, C, h, p)$ to $G S P (V, E, D, l, q)$ . Note that a bipartite graph is leveraged in the following proof, and similar techniques had been widely used in the complexity analysis literature, such as [17, 21, 22].

Theorem 3.

The gossip spreading maximization problem (GSMP) for the pull-based gossip protocol is NP-hard.

Proof.

Consider an arbitrary instance of the partial set cover problem $S C P (U, C, h, p)$ with n elements $U = {u_{1}, u_{2}, \dots, u_{n}}$ and m subsets $C = {S_{1}, S_{2}, \dots, S_{m}}$ of U, and construct a directed bipartite graph $G^{*} = (V^{*}, E^{*})$ as follows. The node set $V^{*}$ contains $n + m$ nodes, in which a node $v_{i}^{c} (i = 1, \dots, m)$ is corresponding to a subset $S_{i}$ and a node $v_{j}^{u} (j = 1, \dots, n)$ is corresponding to an element $u_{j}$ . Each $v_{i}^{c}$ is called subset node and each $v_{j}^{u}$ is called element node hereafter. There is a directed edge from a subset node $v_{i}^{c}$ to an element node $v_{j}^{u}$ if $u_{j} \in S_{i}$ ; for example, see Figure 1. In the following, we will see that solving an arbitrary instance $S C P (U, C, h, p)$ of the partial set cover problem is equivalent to solving a special-case instance $G S P (V^{*}, E^{*}, \infty, h, h + p)$ of the gossip spreading decision problem, and we assume $h < m$ without loss of generality.

If we can find h of the subsets in C such that the cardinality of their union is at least p, then we will show that h nodes can be found in $G^{*}$ for the seed set S such that $σ_{\infty} (S) \geq h + p$ with the deadline D being infinity. For each of these h selected subsets for $S C P (U, C, h, p)$ , we use the corresponding subset node in $G^{*}$ as a seed node; eventually, at least p element nodes can pull the desired message from their subset nodes via gossip spreading given the infinite deadline; that is, $σ_{\infty} (S) \geq h + p$ .

Conversely, if we can find h of the nodes in $G^{*}$ for the seed set S such that $σ_{\infty} (S) \geq h + p$ , then we will show that h subsets can be found in C such that the cardinality of their union is at least p. For each v of these h selected seed nodes for $G S P (V^{*}, E^{*}, \infty, h, h + p)$ , if v is an element node, then we replace it with a subset node as follows. If the subset node $v^{c}$ that points to v either has already been selected as a seed node or has already replaced other seed nodes, then we replace v with any other available subset node; otherwise, we replace v with this subset node $v^{c}$ . After all of those possible replacements, we have obtained h subset nodes in $G^{*}$ as seed nodes, and $σ_{\infty} (S) \geq h + p$ is clearly still satisfied. Therefore, at least p elements can be covered using h subsets in C, which are exactly corresponding to these h subset nodes in $G^{*}$ .

In total, if the gossip spreading decision problem can be solvable, then the partial set cover problem must be solvable; that is, the decision version of GSMP is at least as hard as the NP-complete partial set cover problem.

Figure 1

Illustration of a directed bipartite graph $G^{*} = (V^{*}, E^{*})$ constructed from an instance of the partial set cover problem.

Remark 4.

The above arguing method can be applied in analyzing the complexity of GSMP under the “push” model. Since GSMP is NP-hard and polynomial-time algorithms have not yet been discovered for the class of NP-hard problems [20], we should exploit suboptimal solutions with performance guarantee.

4. Submodularity and Greedy Algorithm

In this section, we establish a temporal mapping of the gossip spreading process using virtual coupon collectors by leveraging the concept of temporal network [23]. This treatment provides a tractable way to recognize the submodularity in gossip spreading and leads to a greedy algorithm which yields a solution to GSMP within $(1 - 1 / e)$ of the optimal value.

4.1. Preliminary

Before the analysis, we introduce the preliminaries on the temporal network, the shortest time-respecting path, and the live diffusion path.

A temporal network embodies the information of when events occur in dynamic systems [23]. For the case of gossip spreading, the edge between any two interacting nodes is endowed with the information of contact times when these two nodes share message. For example in Figure 2, the weights on the edge from node $v_{a}$ to node $v_{b}$ indicates that $v_{a}$ sends data to $v_{b}$ in rounds $t = 6,7, 11$ . The key is to consider the causality constraints of the time sequences of nodes' contacts [23]; for example, in Figure 2, $v_{a}$ cannot transmit data to $v_{d}$ even if there are contacts between $v_{a}$ and $v_{b}$ as well as between $v_{b}$ and $v_{d}$ , since the contacts of $v_{b}$ and $v_{d}$ occur before those of $v_{a}$ and $v_{b}$ .

Figure 2

Illustration of a temporal network with a timeline indicating the information of when events occur.

Consider a directed temporal graph $G^{⊤} = (V, E^{⊤}, W^{⊤})$ for the gossip spreading process over a network $G = (V, E)$ , where $W^{⊤}$ is the set of weights on the set of directed edges $E^{⊤}$ , indicating the time information of nodes' contacts. According to the causality constraints of time sequences, a time-respecting path ${P^{⊤}}_{u v}$ from u to v is given by

\begin{matrix} 〈{P^{⊤}}_{u v} : v_{0} = u, v_{1}, \dots, v_{M} = v | w_{v_{i}, v_{i + 1}}, 0 \leq i \leq M - 1〉, \end{matrix}

(3)

where the weight

w_{v_{i}, v_{i + 1}}

is the time at which

v_{i}

sends data to

v_{i + 1}

, and the weights of successive edges on the path

{P^{⊤}}_{u v}

must be strictly increasing; that is,

w_{v_{i}, v_{i + 1}} < w_{v_{i + 1}, v_{i + 2}}

for all

0 \leq i \leq M - 2

. Let

T_{u}

be the informed time of user u, that is, the first time at which u becomes informed; then for

{P^{⊤}}_{u v}

defined in (3), its length (i.e., the distance) is defined as

\begin{matrix} d (u, v) ∶ = distance ({P^{⊤}}_{u v}) = w_{v_{M - 1}, v_{M}} - T_{u}, \end{matrix}

(4)

In particular, the shortest time-respecting path ${P_{*}^{⊤}}_{u v}$ from u to v is given by

\begin{matrix} {P_{*}^{⊤}}_{u v} = \underset{{P^{⊤}}_{uv}}{\arg \min} w_{v_{M - 1}, v_{M}} - T_{u} . \end{matrix}

(5)

Given a node set S, we say a node v is reachable if either $v \in S$ or there exists a time-respecting path from one node in S to v on $G^{⊤}$ ; otherwise, it is unreachable. The distance $d (S, v)$ from S to a reachable node v on $G^{⊤}$ is

\begin{matrix} d (S, v) = \min_{u \in S} d (u, v), \end{matrix}

(6)

in which

d (u, v)

is the length of the shortest time-respecting path from the node u to the node v, and

d (v, v) \equiv 0

. For an unreachable node v,

d (S, v) = \infty

For an ordinary path $〈P_{u v} : v_{0} = u, v_{1}, \dots, v_{M} = v〉$ on the network G, we say it is a live diffusion path, if $u \in S$ becomes informed initially in round $t = 0$ , $v_{i + 1}$ succeeds in pulling its desired message from its neighbor $v_{i}$ for each $0 \leq i \leq M - 1$ , and the considered node v is finally informed. Note that v is reachable from S via the seed node u.

4.2. A Temporal Mapping

In the following, we establish a temporal mapping of the pull-based gossip spreading process by constructing a directed temporal graph. Note that all the multiple weights on each edge in the temporal graph are absolute time since the gossip spreading process starts up initially, and the temporal mapping method used in this paper is different from the equivalent view method used in our previous work [16]. These two methods are not simply coupled, and the “pull” model brings in new ingredients to GSMP.

Consider an arbitrary node v, which attempts to pull its desired message from its neighbors in each round since the beginning of round $t = 1$ , as long as it is uninformed. Note that the pulling process of v from its neighbors $N_{v}$ is exactly a coupon collecting process [24], and denote this process using $CC (v)$ . In $CC (v)$ , v has $| N_{v} |$ different coupons to collect, and in each round each of these coupons is collected uniformly and independently at random with replacement. For the node v, let $Z_{t} (v)$ denote the stochastic process indicating the coupon collected in its $CC (v)$ in round $t (t \geq 1)$ . The event that a certain coupon u is collected in round t means that the corresponding neighbor $u = Z_{t} (v)$ is pulled from in round t by the node v. Note that in the above described $CC (v)$ , we do not care whether the message-pulling node v has already possessed the message or not.

Given the constrained deadline D, for each node v, we independently run $CC (v)$ till the deadline D is reached and record all the time stamps for each collected coupon when v collects it every time. Therefore, the set $τ_{u, v}$ of time stamps for a neighbor u which has been contacted by v can be written as

\begin{matrix} τ_{u, v} = \{t : Z_{t} (v) = u; t \geq 1\} . \end{matrix}

(7)

After all the coupon collecting processes $\{CC (v), v \in V\}$ are completed, a directed temporal graph $G^{⊤} = (V, E^{⊤}, W^{⊤})$ is thus constructed; for example, see Figure 3. In $G^{⊤}$ , the set $E^{⊤}$ of directed edges contains just the incoming edges from those nodes in $Z_{t} (v)$ with $1 \leq t \leq D$ for each $v \in V$ , and the set $W^{⊤}$ of edge weights is given by

\begin{matrix} \forall directed edge (u, v) \in E^{⊤}, weight set w_{u, v} = τ_{u, v} . \end{matrix}

(8)

Figure 3

Illustration of the temporal mapping, in which, the red and green paths are the live diffusion paths from the seed nodes $v_{a}$ and $v_{b}$ , respectively, and the black arrow indicates the edge direction. Left: the underlying network. Right: the constructed temporal graph $G^{⊤} = (V, E^{⊤}, W^{⊤})$ .

Leveraging the constructed temporal graph $G^{⊤}$ as above, we have Theorem 5. Note that the informed time $T_{v}$ of a reachable node v is the first time in which v becomes informed. In addition, the above-assumed pulling process $CC (v)$ of a node v after it becomes informed is no longer effective; that is, the attempts of v to pull its desired message from its neighbors will no longer take place in the actual gossip spreading process.

Theorem 5.

Given a directed network $G = (V, E)$ , an arbitrary seed set S of nodes, and a constrained deadline D, the expectation of the informed time of each reachable node is equal to the expectation of the length of the shortest time-respecting path (i.e., the distance) from S to the considered node on $G^{⊤} = (V, E^{⊤}, W^{⊤})$ .

Proof.

For each node $v \in V$ , consider its $CC (v)$ . From the memoryless property of the pull-based gossip spreading process, the $CC (v)$ can be started up from the very beginning of the spreading process till a certain time (i.e., the deadline D) even after v becomes informed. The attempts of v to pull its desired message from its neighbors are no longer effective after its informed time $t = T_{v}$ . Consequently, for each node v, we can let the $CC (v)$ be run at the very beginning and independently of the coupon collecting processes of all the other nodes.

With all the coupon collecting processes $\{CC (v), v \in V\}$ run till the deadline D is reached, their results are then recorded and can be later used for revealing the (absolute) time stamps of the events that v succeeds in pulling its desired message from its neighbors for the first time within D rounds. Therefore, a temporal graph $G^{⊤} = (V, E^{⊤}, W^{⊤})$ can be constructed from $\{CC (v), v \in V\}$ , containing all the information about one sample realization of the spreading process over the network $G = (V, E)$ within D rounds.

Specially, given an arbitrary seed set S of nodes, the informed time $T_{v}$ for each reachable node $v \in V$ in which it becomes informed for the first time is equal to the length of the shortest time-respecting path from S to v on $G^{⊤}$ , and thus their expectations are also equal by taking expectations over all possible realizations of the gossip spreading process within D rounds.

Remark 6.

For any sample realization of the gossip spreading process, each of the resulting live diffusion paths from S to all the other reachable nodes on G is equivalent to the shortest time-respecting path from S to the considered node on $G^{⊤}$ .

4.3. Submodularity in Gossip Spreading

The following arguments lead to a greedy algorithm that yields a solution within $(1 - 1 / e)$ of the optimal value for GSMP. Given a finite ground set $U = {u_{1}, u_{2}, \dots, u_{n}}$ of n elements and an arbitrary function $f (\cdot) : U \to R$ , $f (\cdot)$ maps subsets of U to real numbers. Formally, $f (\cdot)$ is called submodular function if satisfying

\begin{matrix} f (A_{1} \cup \{v\}) - f (A_{1}) \geq f (A_{2} \cup \{v\}) - f (A_{2}), \end{matrix}

(9)

for all pairs of subsets

A_{1} \subseteq A_{2}

and all elements

v \in U ∖ A_{2}

[25]. The quantity

f (A \cup {v})

is called the marginal increase by adding a new element v into a given subset A. Besides,

f (\cdot)

is called monotone function if satisfying

\begin{matrix} f (A \cup \{v\}) - f (A) \geq 0, \end{matrix}

(10)

for all subsets

A \subseteq U

and all elements

v \in U ∖ A

Leveraging the temporal mapping of the gossip spreading process established in Theorem 5, we have Theorem 7. Note that $σ_{D} (\cdot)$ is called gossip spreading function, and $σ_{D} (A)$ is defined in (1) for any given seed set $A \subseteq U$ in the network $G = (V, E)$ .

Theorem 7.

Given a directed network $G = (V, E)$ and a constrained deadline D, the gossip spreading function $σ_{D} (\cdot)$ is submodular for the pull-based gossip protocol.

Proof.

Recall Theorem 5 and consider the sample space Ω, where each sample specifies one possible realization of $\{CC (v), v \in V\}$ . Conditioned upon $X \in Ω$ , define $I_{D} (A | X)$ as the number of informed nodes within D rounds using A as the seed set. Let $R (s, X)$ denote the set of nodes that are reachable from a node s on $G^{⊤} = (V, E^{⊤}, W^{⊤})$ with the length of the shortest time-respecting path no larger than D. Therefore, $I_{D} (A | X)$ is equal to the cardinality of the union $⋃_{s \in A} ‍ R (s, X)$ ; that is,

\begin{matrix} I_{D} (A | X) = |⋃_{s \in A} ‍ R (s, X)| . \end{matrix}

(11)

We now prove that the function $I_{D} (\cdot | X)$ is submodular for each sample X, similar to [16, 17]. Let $A_{1}$ and $A_{2}$ denote two seed sets with $A_{1} \subseteq A_{2}$ . For a node v, consider the following quantity:

\begin{matrix} I_{D} (A_{1} \cup \{v\} | X) - I_{D} (A_{1} | X) = |R (v, X) ∖ ⋃_{s \in A_{1}} ‍ R (s, X)|, \end{matrix}

(12)

which is the number of elements in

R (v, X)

that are not already in the union

⋃_{s \in A_{1}} ‍ R (s, X)

. Therefore, we have

\begin{array}{l} I_{D} (A_{1} \cup \{v\} | X) - I_{D} (A_{1} | X) \\ = |R (v, X) ∖ ⋃_{s \in A_{1}} ‍ R (s, X)| \\ \geq |R (v, X) ∖ ⋃_{s \in A_{2}} ‍ R (s, X)| \\ = I_{D} (A_{2} \cup \{v\} | X) - I_{D} (A_{2} | X) . \end{array}

(13)

According to the defining property of submodularity in (9), we see that $I_{D} (\cdot | X)$ is submodular from (13). To complete the proof, we have

\begin{matrix} σ_{D} (A) = \sum_{X \in Ω} ‍ \Pr [X] \cdot I_{D} (A | X), \end{matrix}

(14)

which means that within D rounds the expected number of informed nodes is just the weighted average over all the sample realizations in Ω of the gossip spreading process. Since a nonnegative linear combination of submodular functions is still submodular [25],

σ_{D} (\cdot)

is submodular.

Remark 8.

Submodularity is a widely applied mathematical tool in tackling a class of nonconvex combinatorial optimization problems, such as social influence maximization, maximum facility location, sensor placement, and optimization design of cellular networks [17, 26–28].

4.4. Greedy Algorithm with Performance Guarantee

We invoke the following result from [25]. Note that the greedy algorithm, presented in “Algorithm 1,” selects each new element with the largest marginal increase in the gossip spreading function $σ_{D} (\cdot)$ till the seed set S is filled in with k nodes; and in “Algorithm 1,” $|Gossip (A)|$ is the number of finally informed users when the gossip spreading process runs till the delay-tolerant deadline D using the seed set A, and the Monte Carlo method is leveraged to evaluate the average value for R repeating times. The complexity of the greedy algorithm is $O (k R |V|)$ . In addition, the lazy evaluation method in [18] can be leveraged to accelerate the greedy algorithm.

Algorithm 1: Greedy $(G; k; D)$ .

Initialize $S \leftarrow \emptyset$ and $R$

for $i = 1 \to k$ do

for each node $v \in V ∖ S$ do

$η_{v} \leftarrow 0$

for $j = 1 \to R$ do

$η_{v} \leftarrow η_{v} + | Gossip (S \cup {v}) |$

end for

$η_{v} \leftarrow η_{v} / R$

end for

$S \leftarrow S \cup {\arg ma x_{v \in V ∖ S} {η_{v}}}$

end for

Output S

Theorem 9 (see [25]).

For a nonnegative monotone submodular function $f (\cdot)$ , let A be the solution of the greedy algorithm and let $A^{*}$ be any arbitrary solution; then

\begin{matrix} f (A) \geq (1 - \frac{1}{e}) \cdot f (A^{*}) . \end{matrix}

(15)

Remark 10.

Since $σ_{D} (\cdot)$ is submodular and clearly nonnegative monotone as well, the greedy algorithm yields a solution within $(1 - 1 / e)$ of the optimal value for GSMP. Note that this $(1 - 1 / e)$ -factor is the best reported one for the class of submodular maximization problems and the infeasibility to improve this factor is argued in [29].

5. Experiments

In this section, we carry out simulation experiments to study the spreading performance of the greedy algorithm. For comparison, we also implement two heuristic algorithms and a random algorithm.

5.1. Setup

5.1.1. Random Geometric Network with Hotspots

We leverage the widely used random geometric graph to generate a network graph with hotspots. There are $n_{1} = 350$ nodes uniformly distributed on the ${[0,1]}^{2}$ square; moreover, additional $n_{2} = 40$ , $n_{3} = 50$ , and $n_{4} = 60$ nodes are densely clustered and uniformly distributed in three separate hotpots on this square.

In total, there are $n = 500$ nodes in the network; and any pair of nodes is connected with two edges in both directions if their Euclidean distance is no larger than the connectivity radius $r_{n} = \sqrt{(\log n) / n}$ . Note that this $r_{n}$ guarantees that a random geometric graph of n nodes is connected with high probability [30]. Besides, each hotspot is restricted in a rectangular region within the offset ranges ${[- r_{n}, + r_{n}]}^{2}$ from its hotspot center. In Figure 4, the network layout is illustrated.

Figure 4

Network layout of a random geometric network with hotspots on the ${[0,1]}^{2}$ square.

5.1.2. Four Typical Social Networks

We further implement our approach in four typical social networks to evaluate the spreading performance. The small-world network [31], the scale-free network [32], the scientific-collaboration network [33], and the autonomous-system network [34] and the main network statistics are presented in Table 1.

Table 1

Node and edge numbers of four social networks.

Network	Node number	Edge number
Small-world network	5000	90000
Scale-free network	5000	49970
Scientific-collaboration network	13861	89238
Autonomous-system network	11157	61886

5.2. Heuristic and Random Algorithms

The degree centrality and distance centrality-based heuristic algorithms are widely used in social network analysis [35]; and we can select the seed nodes from the network, using their degree centralities and distance centralities as the decision criteria.

The degree centrality $de g_{u}$ of each node u is equal to the number of its neighbors (connected by outgoing edges). The degree-centrality algorithm selects each new element with the largest degree centrality till the seed set S is filled in with k nodes.

In particular, we assume the network is connected for the distance-centrality algorithm. The distance centrality $dis t_{u}$ of each node u is equal to the summarization of the distances from u to all the other nodes in the network. The distance from u to each of the rest nodes is measured by the number of hops on the shortest path from u to that node. The distance-centrality algorithm selects each new element with the smallest distance centrality till the seed set S is filled in with k nodes. If the network is not connected, we can use the summarization of these distances' reciprocals instead of these distances themselves to measure the distance centrality.

The random algorithm is the baseline, selecting k distinct seed nodes uniformly at random from all the network nodes.

5.3. Simulation Results

5.3.1. Spreading Performance in Random Geometric Network with Hotspots

We evaluate the spreading performance of the greedy algorithm using different deadlines and seed set sizes, and the results are presented in Figure 5. It is shown that the informed population (i.e., the expected number of informed nodes) grows larger as more seed nodes are selected. In addition, if the network nodes are willing to tolerate longer deadline, more of them can successfully receive their desired information.

Figure 5

Spreading performance of the greedy algorithm.

Next, we present the spreading performance of different seed-selecting algorithms in Figure 6. When the seed set size is 10, the greedy algorithm outperforms the degree-centrality algorithm by $76 %$ , the distance-centrality algorithm by $426 %$ , and the random algorithm by $37 %$ .

Figure 6

Spreading performance of the greedy, heuristic, and random algorithms.

The greedy algorithm that leverages the dynamics of the gossip spreading process in the network performs much better than those centrality-based heuristic algorithms that rely only on the network's structural properties. In actual fact, many of the most central nodes (e.g., with high degree centralities or low distance centralities) are clustered, and thus it is unnecessary to select all of them. In Figure 7, this clustering effect of both centrality-based algorithms is illustrated.

Figure 7

Seed nodes picked by the greedy and heuristic algorithms.

Besides, we see that the random algorithm outperforms both these heuristic algorithms. The reason lies in the fact that the underlying network is generated from the random geometric graph and the random algorithm may often select some seed nodes which have high power to cause a large gossip spreading in different regions over the ${[0,1]}^{2}$ square.

5.3.2. Spreading Performance in Four Social Networks

We further evaluate the spreading performance of the greedy algorithm in Figures 8, 9, 10, and 11, with the random algorithm as a baseline for comparison. The deadline is 10 in Figures 8 and 10 and is 5 in Figures 9 and 11. Note that the spreading performances of different deadlines and algorithms in these four social networks have similar trends as those in the random geometric network with hotspots and hence omitted.

Figure 8

Spreading performance in the small-world network.

Figure 9

Spreading performance in the scale-free network.

Figure 10

Spreading performance in the scientific-authorship network.

Figure 11

Spreading performance in the autonomous-system network.

From Figures 8, 9, 10, and 11, we see that a large portion of nodes in each network are informed when the seed set size is around 20 for the greedy algorithm. Besides, we see that the greedy algorithm significantly outperforms the random algorithm, especially in the scale-free network, the scientific-collaboration network, and the autonomous-system network.

6. Conclusions

In this paper, we have investigated the problem on the selection of information sources for pull-based gossip spreading. We have proved the NP-hardness of the gossip spreading maximization problem and proposed a suboptimal solution (i.e., the greedy algorithm) for this problem to select seed nodes as information sources. A temporal mapping of the dynamic gossip spreading process has been established via virtual coupon collectors by leveraging the concept of temporal network to analyze this problem. In addition, the temporal mapping method helps to bridge the connection from graph theoretic problems to submodularity and further lead to the greedy algorithm with performance guarantee. In the future, it is interesting to leverage the methods developed in the treatment of GSMP to deal with the problems on gossip spreading when implementing gossip-based algorithms in real-world networks.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work has been supported by National Basic Research Program of China (973 Program) through Grant 2012CB316004, SRFDP and RGC ERG Joint Research Scheme through Grant 20133402140001, National Natural Science Foundation of China through Grant 61379003, and the 100 Talents Program of Chinese Academy of Sciences.

References

Selvakennedy

Sinnappan

An adaptive data dissemination strategy for wireless sensor networks

International Journal of Distributed Sensor Networks 2007 3 1 23 40

10.1080/15501320601067725

2-s2.0-34248163113

Boldrini

Conti

Passarella

Contentplace: social-aware data dissemination in opportunistic networks

Proceedings of the 11th ACM International Conference on Modeling, Analysis, and Simulation of Wireless and Mobile Systems (MSWiM '08)

October 2008

203 210

10.1145/1454503.1454541

2-s2.0-63449110577

Gao

Cao

User-centric data dissemination in disruption tolerant networks

Proceedings of the 30th IEEE International Conference on Computer Communications (INFOCOM ’11)

April 2011

3119 3127

10.1109/infcom.2011.5935157

2-s2.0-79960866603

Xie

Hwang

Churn-resilient protocol for massive data dissemination in P2P networks

IEEE Transactions on Parallel and Distributed Systems 2011 22 8 1342 1349

10.1109/TPDS.2011.15

2-s2.0-79959702172

Zhao

Zhu

Efficient data dissemination in urban VANETs: parked vehicles are natural infrastructures

International Journal of Distributed Sensor Networks 2012 2012 11

151795

10.1155/2012/151795

2-s2.0-84872818749

Dong

Yang

Zhang

Gossiping with message splitting on structured networks

International Journal of Distributed Sensor Networks 2015 2015 8

504581

10.1155/2015/504581

Karp

Schindelhauer

Shenker

Vocking

Randomized rumor spreading

Proceedings of the 41st Annual Symposium on Foundations of Computer Science

November 2000

Redondo Beach, Calif, USA

565 574

10.1109/SFCS.2000.892324

Shah

Gossip Algorithms 2009

Now Publishers

Tang

Dai

Gossip-based scalable directed diffusion for wireless sensor networks

International Journal of Communication Systems 2011 24 11 1418 1430

10.1002/dac.1224

2-s2.0-81255134321

10.

Huang

To reach consensus using uninorm aggregation operator: a gossip-based protocol

International Journal of Intelligent Systems 2012 27 4 375 395

10.1002/int.21528

2-s2.0-84857636905

11.

Vahdat

Becker

Epidemic routing for partially connected Ad hoc networks

2000 CS-200006

Duke University

12.

Sanghavi

Hajek

Massoulie

Gossiping with multiple messages

IEEE Transactions on Information Theory 2007 53 12 4640 4654

10.1109/tit.2007.909171

MR2446928

2-s2.0-51549090753

13.

Chierichetti

Lattanzi

Panconesi

Rumor spreading in social networks

Theoretical Computer Science 2011 412 24 2602 2610

10.1016/j.tcs.2010.11.001

MR2828337

2-s2.0-79954897828

14.

Sauerwald

On mixing and edge expansion properties in randomized broadcasting

Algorithmica 2010 56 1 51 88

10.1007/s00453-008-9245-4

MR2576534

2-s2.0-73349116553

15.

Giakkoupis

Sauerwald

Rumor spreading and vertex expansion

Proceedings of 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '12)

2012

1623 1641

16.

Dong

Zhang

Wei

Extracting influential information sources for gossiping

Proceedings of the 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton '12)

October 2012

Monticello, Ill, USA

1438 1444

10.1109/allerton.2012.6483387

2-s2.0-84875752419

17.

Kempe

Kleinberg

Tardos

Maximizing the spread of influence through a social network

Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03)

August 2003

Washington, DC, USA

137 146

10.1145/956750.956769

18.

Leskovec

Krause

Guestrin

Faloutsos

Vanbriesen

Glance

Cost-effective outbreak detection in networks

Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '07)

August 2007

420 429

10.1145/1281192.1281239

2-s2.0-36849083014

19.

Han

Hui

Kumar

V. S. A.

Marathe

M. V.

Pei

Srinivasan

Cellular traffic offloading through opportunistic communications: a case study

Proceedings of the 5th ACM Workshop on Challenged Networks (CHANTS '10)

September 2010

ACM

31 38

10.1145/1859934.1859943

2-s2.0-78649313781

20.

Karp

R. M.

Reducibility Among Combinatorial Problems 1972

Berlin, Germany

Springer

21.

Wilfong

Winkler

Ring routing and wavelength translation

Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '98)

1998

333 341

22.

Golrezaei

Shanmugam

Dimakis

A. G.

Molisch

A. F.

Caire

Femtocaching: wireless video content delivery through distributed caching helpers

Proceedings of the 31st IEEE International Conference on Computer Communications (INFOCOM '12)

2012

1107 1115

23.

Holme

Saramäki

Temporal networks

Physics Reports 2012 519 3 97 125

10.1016/j.physrep.2012.03.001

2-s2.0-84866939709

24.

Mitzenmacher

Upfal

Probability and Computing: Randomized Algorithms and Probabilistic Analysis 2005

Cambridge, UK

Cambridge University Press

25.

Nemhauser

G. L.

Wolsey

L. A.

Fisher

M. L.

An analysis of approximations for maximizing submodular set functions—I

Mathematical Programming 1978 14 1 265 294

10.1007/bf01588971

MR0503866

2-s2.0-0000095809

26.

Drezner

Hamacher

H. W.

Facility Location: Applications and Theory 2004

Springer

MR1933965

27.

Krause

Singh

Guestrin

Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies

Journal of Machine Learning Research 2008 9 1 235 284

2-s2.0-41549146576

28.

Son

Kim

Krishnamachari

Base station operation and user association mechanisms for energy-delay tradeoffs in green cellular networks

IEEE Journal on Selected Areas in Communications 2011 29 8 1525 1536

10.1109/JSAC.2011.110903

2-s2.0-80052046766

29.

Feige

A threshold of ln n for approximating set cover

Journal of the ACM 1998 45 4 634 652

10.1145/285055.285059

MR1675095

2-s2.0-0032108328

30.

Gupta

Kumar

P. R.

The capacity of wireless networks

IEEE Transactions on Information Theory 2000 46 2 388 404

10.1109/18.825799

MR1748976

2-s2.0-33747142749

31.

Watts

D. J.

Strogatz

S. H.

Collective dynamics of ‘small-world’ networks

Nature 1998 393 6684 440 442

10.1038/30918

2-s2.0-0032482432

32.

Barabási

A.-L.

Albert

Emergence of scaling in random networks

Science 1999 286 5439 509 512

10.1126/science.286.5439.509

MR2091634

2-s2.0-0038483826

33.

Newman

M. E.

The structure of scientific collaboration networks

Proceedings of the National Academy of Sciences of the United States of America 2001 98 2 404 409

10.1073/pnas.021544898

MR1812610

2-s2.0-0035895239

34.

Leskovec

Kleinberg

Faloutsos

Graphs over time: densification laws, shrinking diameters and possible explanations

Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '05)

August 2005

Chicago, Ill, USA

177 187

10.1145/1081870.1081893

2-s2.0-32344436210

35.

Wasserman

Faust

Social Network Analysis: Methods and Applications 1994

Cambridge, UK

Cambridge University Press

On the Selection of Information Sources for Gossip Spreading

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. Gossip Protocol

2.2. Problem Formulation

3. Complexity Analysis

3.1. Preliminary

Problem 1 (gossip spreading decision problem).

Problem 2 (partial set cover problem).

3.2. GSMP is NP-Hard

Theorem 3.

Proof.

Remark 4.

4. Submodularity and Greedy Algorithm

4.1. Preliminary

4.2. A Temporal Mapping

Theorem 5.

Proof.

Remark 6.

4.3. Submodularity in Gossip Spreading

Theorem 7.

Proof.

Remark 8.

4.4. Greedy Algorithm with Performance Guarantee

Algorithm 1: Greedy ( G ; k ; D ) .

Theorem 9 (see [25]).

Remark 10.

5. Experiments

5.1. Setup

5.1.1. Random Geometric Network with Hotspots

5.1.2. Four Typical Social Networks

5.2. Heuristic and Random Algorithms

5.3. Simulation Results

5.3.1. Spreading Performance in Random Geometric Network with Hotspots

5.3.2. Spreading Performance in Four Social Networks

6. Conclusions

Footnotes

Conflict of Interests

Acknowledgments

References

Algorithm 1: Greedy $(G; k; D)$ .