Diffusing information for mobile social networks under consideration of dynamic influence

Abstract

As the developments of new techniques, mobile social networks have been built wildly. To obtain and spread information over mobile social networks efficiently, the influence maximization problem is to find a seed nodes set with limited size such that it can influence as many nodes as possible. Previous works ignore the dynamic influence phenomenon of diffusing information on mobile social networks. In this article, we propose a new model to express the procedure of diffusing information under the existence of dynamic influence. Theoretical analysis shows that the influence maximization problem under new model is non-deterministic polynomial-time hard, and efficient approximation algorithm is proposed. Experimental studies on real data sets show that the new model can process dynamic influence well in the diffusing information procedure, and the proposed algorithms can solve the influence maximization problem on new model efficiently.

Keywords

Mobile social networks information diffusion dynamic influence

Introduction

Recently, as the developments of techniques of communications and computing, modern smart phones have huge increase and rapid popularization in the whole world. According to the reports, until 2015, the size of mobile phone users in the world has reached 4.45 billion and 42.9% of them are using smart phones (about 2 billion). By the reports from China, the size of mobile smart phone users has reached 5 billion in 2014. In addition, as the emergence of more devices, such as tablet and smart watch, more and more smart mobile devices are popularized now. Because of the developments of embedded computing, sensing, and communicating techniques, the smart mobile devices have more and more powerful abilities and they are not only tools for communications any more. In many applications, they have performed to be powerful tools for mobile sensing, computing, and so on. For example, iPhone 6 has integrated at least eight types of sensors such as ALS (ambient light sensor), PS (proximity sensor), and GPS (Global Positioning System), and it takes a 2.6 GHz 64-bit processor, besides common modules for telecommunication, and it also has powerful WiFi and bluetooth devices.

As the size and power of smart mobile devices increase, more and more online social applications are being used in mobile environments, and the traditional social networks have evolved to be mobile social networks. The typical applications of mobile social networks include Facebook, Twitter, Google+, and Sina Weibo. According to the reports, until 2014, the number of month average users (MAU for short) of Facebook is about 1.3 billion, and in 2015, the MAU size for Sina Weibo has reached 212 million only in China. As more and more mobile applications are utilized, more and more natural mobile social networks are being created.

The emergence of mobile social networks changes the way of information diffusion and provides opportunities for viral marketing as shown in the work by Chen et al.¹ Different from traditional methods for marketing, viral marketing can utilize the “word-of-mouth” advantages of mobile social networks and diffuse advertising information more efficiently. It has attracted lots of research interests from both mobile and social computing areas. Influence maximization problem is one of the most popular topics in the area of mobile social network. It has been formally investigated by Kempe et al.² and obtained lots of attentions from many researchers.^3–5 However, there are still important challenges not solved in real applications of influence maximization problem when facing more complex scenarios. One of them is that the influence ability between nodes in real world may usually dynamically change, which is also the motivation of this article, while most of current research efforts on influence maximization problem always assume that the influence ability is static.

For general social networks, they are at least composed of nodes and edges like general networks. Usually, each node represents one social actor (e.g. one person in physical world and one user of Facebook) and the edges represent the social interactions between social actors (e.g. two persons are friends and one user is following the other one on Facebook). When considering the influence maximization problem on mobile social networks, usually, the edges have special meanings related to influence (e.g. one user accepts the advice of another one). The most popular methods proposed by previous works are as follows. First, a model is built to describe the information diffusion procedure on mobile social networks; then, the algorithms for finding special seed set with maximized influence are designed. In classic models, whether user A will be influenced by user B depends on the influence probability between them. In detail, if B has accepted some advice and the influence probability between B and A is 0.9, A will accept the same advice with probability 0.9 in next step. The influence probability is a simple and clear method to describe the ability of influence between two nodes, and it is usually assumed to be a constant once if a special social network is considered. However, in mobile social networks, the influence probability may be dynamically changed.

Let us consider an example in real life to show the dynamic influence phenomenon during the process of diffusing information which has been also observed by only few previous works.^6–8 Assume that there is one user A of Twitter, which can be treated to be a directed graph, before lunch A is browsing his following twitters. Using GPS or WiFi devices of A’s smart phone, the app of Twitter may know the location of A and push one advertisement about restaurant X nearby. After seeing that, A may have his lunch in X and post twitters to recommend X to his friends. Assume A has two friends B and C, B is also near from X but C is not. Then, B is more likely than C to view the advertisement and go for a taste, while A may have the same influence probabilities for both B and C on general information. Let us imagine and compare two similar scenarios. In the first case, A receives one posted tweet from his friend B; here, B is the original author of that tweet. In the second case, A still receives the tweet from B, and the difference is that B reposted that from C. Obviously, the influence probability from B to A in the first case should be higher than the one in the second case. In the two examples above, the influence probability is not static for two given nodes, and it can change during the procedure of information diffusion. As known by us, there are only few previous works considering such challenges and none of them considers the influence maximization problem under dynamic influence directly.

Actually, the examples above show two important sources of dynamic influence, locations and time delay. In a mobile social network, if two nodes have same or similar locations, the influence probability between them tends to become larger. If one information has long time delay or has been reposted several times by users sequentially, after receiving such information, the influence probability between two users tends to become smaller. Here, the location need not be a geographic position, and it can be any profile information of nodes which can enhance the social interactions between nodes. For example, when the nodes represent academic authors, the affiliation or research area information can be the “location.” The time need not be a real time either, and it can be any value about the information which can affect the interests for the information of nodes. For example, we can use the times that one user has received the same information to measure the “time.” Zhou et al.⁶ considered only the “location” factor but not the “time” factor. Only few works focus on dynamic influence caused by those factors. Goyal et al.⁷ and Leskovec et al.⁸ have observed the dynamic influence cases caused by “time” factor. Leskovec et al.⁸ summarized a series of interesting scenarios including the dynamic influence in social networks by experiments on real data sets. Goyal et al.⁷ studied how to learn such dynamic influence efficiently. Neither of them directly focuses on the influence maximization problems under dynamic influence. Moreover, when considering dynamic influence, the algorithms designed for classical influence maximization problem are not able to be applied. This article studies the influence maximization problem under dynamic influence caused by the location and time factors.

In this article, we address the problem of diffusing information under dynamic influence in mobile social networks. Following the methods of previous works, we also use influence maximization problem to describe the procedure of diffusing information over mobile social networks. To overcome the challenge of dynamic influence, we modify the classic models to support describing the change in influence under considerations of location and time factors. To solve the influence maximization problem efficiently, we study its computational complexities and design efficient algorithms. The main contributions can be summarized as follows:

We identify the dynamic influence as new challenges of information diffusion on mobile social networks. To overcome them, we propose new information diffusion (ID) models to support dynamic influence and formulate the new influence maximization problem based on new model.

We show the hardness for influence maximization problem under dynamic influence. It is achieved by proving that classic influence maximization problems is a special case of the new problem.

We design efficient approximation algorithms for the new influence maximization problem. By showing the monotone and submodular properties of the new problem, a (1 − 1/e) approximation algorithm can be obtained.

More efficient approximation algorithms for the new influence maximization problem are designed. Also, several optimizing strategies are discussed.

The experimental results on real data sets show that the proposed method can solve the information diffusion problem with dynamic influence on mobile social networks.

The rest parts of the article are organized as follows. In section “Related work,” the related works are discussed. Then, some preliminaries and new definitions will be introduced in section “Notations and definitions.” In section “Approximation algorithm for IMD problem,” theoretical analysis and approximation algorithms for influence maximization problems are introduced. Also, an improved algorithm is given in section “Improving Algorithm GreedyIMD.” Experimental results are shown in section “Experimental evaluation.” Finally, it is the conclusion.

Related work

The influence maximization is an important problem in the research area of online social networking, which has many applications such as viral marketing and computational advertising. It is first studied by Domingos and Richardson,^9,10 and the formalized definitions and comprehensive theoretical analysis are given by Kempe et al.² The standard formal definition of influence maximization can be explained as follows: given the constraint that at most k nodes can be selected, the input is a graph which represents the “influence” relationships between nodes, the problem is to compute a set of k nodes such that the number of nodes influenced by the k nodes is maximum. Different models have been formally defined to simulate the information propagation processes with different characteristics, and the two most popular models are the independent cascade (IC for short) and linear threshold (LT for short) models. In the work by Kempe et al.,² the influence maximization problems under both IC and LT models are shown to be NP-hard (non-deterministic polynomial-time hard) problems, and the problem of computing the exact influence of given nodes set is shown to be ♯ P-hard problem in the work by Chen et al.¹

Many research efforts have been made for the problem of finding the node set with maximum influence. Kempe proposed an algorithm for influence maximization based on greedy ideas which has constant approximation ratio (1 − 1/e). The time complexity of the greedy approximation algorithm of influence maximization is O(n²(m + n)), which is too high to be applied in large-scale social networks. To overcome the shortcomings of greedy-based algorithms, many researchers focus on the problem of influence maximization. By studying the submodularity characteristics of influence functions, Leskovec et al.¹¹ proposed CELF (cost-effective lazy-forward) algorithm. CELF can improve the performance of greedy-based algorithms for influence maximization by reducing the times of evaluations of influence set of given seed set; however, its performance on large-scale data is still not satisfying. Using the similar ideas, CELF++ is proposed to improve the performance of algorithms for solving influence maximization by Goyal et al.¹² In the work by Chen et al.,³ degree-discount algorithm is proposed to improve the performance of greedy-based influence maximization algorithms. By assuming all influence probabilities are same in IC models, Chen et al.³ reduce the complexities of influence maximization problems and give better algorithms based on the new models. Utilizing the structural properties of communities in social networks, Chen et al.¹³ proposed new algorithms by merging similar nodes and reduce the cost of computing influence set. Goyal et al.¹⁴ proposed SIMPATH algorithm in LT model which improves the performance of greedy-based influence maximization algorithm in LT model. Jiang et al.¹⁵ proposed simulated annealing-based influence maximization algorithms.

Kimura and Saito¹⁶ proposed new models of information propagation based on the idea of finding shortest paths, which assume that the information is mainly transferred through shortest paths, and designed new heuristic algorithms for influence maximization problems. Using this model, Chen et al.¹ proposed heuristic algorithms based on maximum broadcast paths, which assume that the information propagated on the network is not transferred by shortest path but maximum broadcast paths. Based on the influence probabilities between users, for each single node, an influence tree is built by computing the maximum broadcast paths, which can be used to estimate the influence range of each user. By assigning threshold for each user, the influence tree can be controlled to ignore nodes which contribute little for the computation of influence set and reduce the size of nodes computed by the influence computation. Also, Chen et al.¹ proved the submodularity of influence functions defined based on maximum broadcast paths and designed approximation algorithms with 1 − 1/e approximation ratio. In the work by Han et al.,¹⁷ timeliness networks with opportunistic selection are investigated and the information maximization model is extended to those applications. In the work by Shi et al.,¹⁸ maximal time bound is considered to limit the abilities of diffusing information in social networks, and efficient algorithms for influence maximization problem for computing maximal time-bounded positive influence set are proposed. In the work by Chen et al.,¹³ similarities of nodes of communities in social networks are utilized to reduce the number of nodes involved in the influence computation. Kim et al.¹⁹ proposed efficient influence maximization algorithms in parallel computing settings. Cai et al.²⁰ try to extend the information maximization models to the applications of crowd-sourced data-based social networks. Han et al.²¹ consider the communities in social networks and study the influence maximization problem over such networks.

There are also many works which try to extend the classic influence maximization methods to other application settings. Li et al.²² study the problem of influence maximization under location-based social networks. In those networks, one node can be influenced by the other node if and only if they are neighbors according to their location information, and Li et al.²² focus on the problem of finding k users which can influence maximum users in the location-based social network. Tang et al.²³ identify the relation types during propagating the information and formally define the problem of influence maximization by considering different types of relationships between nodes. A key idea is that given certain information which needs to be propagated, the influence set of some node set can be computed more efficiently by reducing those edges belonging to some certain types. Chen et al.²⁴ study the problem of influence maximization under topic-aware applications. Cai et al.²⁵ use the idea of information diffusion to prevent sensitive information in social networks. More related work on applications in social networks can be found in the works of Han et al.²⁶ and Bi et al.²⁷ Although the problem of extending classic influence maximization methods has been studied by many research works as shown above, we are not aware of any efforts on influence maximization on dynamic setting.

Obviously, the influence models are usually defined based on several parameters which are utilized to describe the key properties in real applications. The parameter selection problem is essential to the influence maximization methods. Tang et al.⁵ propose topic factor graph (TFG) models to determine different parameters between users and topics. Liu et al.⁴ determine different influence parameters among users using probabilistic models to analyze the relationships of distributions between topics of users and influence relations of users. Weng et al.²⁸ utilize the latent Dirichlet allocation (LDA) models to describe the topic distributions of user topics and propose TwitterRank methods to determine the influence probabilities between users and topics.

Notations and definitions

In this section, classical information diffusing models are introduced first; then, to integrate the two challenge aspects of diffusing information on mobile social networks, new model is proposed by modifying the classical one. Finally, based on the new ID model, we give the new definition of influence maximization problem. In fact, the problem of influence maximization depends on the definition of diffusing models. For all diffusing models, the related influence maximization problems are based on the same idea, but they are different on the aspect of computational hardness.

Traditional ID models

In this article, information diffusion can be described as the propagating procedure of information over some network. A network is usually denoted by a graph G(V, E). Here, V is the node set where each node represents one person or entity and E is the edge set where each edge represents the relation (cooperation, friends, enemies, and so on) between two nodes. Each node is associated with active or inactive state. Intuitively, the active state means that the node has been affected. The active set of nodes may affect the nodes in inactive set, and the influence ratio can describe the strength of that affection. If some inactive node is affected by some active node so much that the inactive becomes active, such a process is called activation. Intuitively, for some node v, the more of the neighbors of v are activated, more likely v will be activated. After that, v will affect more nodes further. As such procedures repeat, more and more nodes will become active. The procedure of activation cannot be reversed: one node can transform from inactive state to active state, but not vice versa. To design proper theoretical model to describe information diffusion in real world, the key is to explain how the interactions between nodes work. Next, we introduce two popular ID models.

LT model

Given a network G(V,E), let N(v) be the set {u|(u,v) ∈ E}. For each (u,v) ∈ E, a threshold value b_uv is utilized to represent the degree of influence from u to v. For each node v, it is satisfied that $0 \leq \sum_{u \in N (v)} b_{u, v} \leq 1$ . During the procedure of information diffusion, another threshold value θ_v with respect to each node v is used to control the diffusion of information. In detail, at some instant time, let A(v) be the set of v’s neighbor nodes which has been active. If $\sum_{u \in A (v)} b_{u, v} \geq θ_{v}$ , v will become active. In this model, when node u tries to activate its neighbor v and fails, the influence b_u,v is remembered and will be accumulated in the following activating steps. In other words, the influence from u to v will not be ignored, even if the activation is failed. As we will see in the following part, the influence is treated differently in other models. The whole procedure of information diffusion in LT model can be described as follows. First, an initial active node set S₀ will be activated. Then, in the ith step of information diffusion, based on the active nodes in S_i − 1, the influence for each node on V \ S_i − 1 will be computed. According to the influence computed and the θ_v for each node v, all nodes satisfying $\sum_{u \in A (v)} b_{u, v} \geq θ_{v}$ will be put in S_i. Repeat these steps until no more nodes can become active.

IC model

IC model is a probabilistic model. Instead of b_uv in LT model, this model uses p_uv to describe the probability that u can activate v in a single activation. The whole procedure of information diffusion under IC model can be described as follows. First, an initial node set S₀ will be set to be active. Then, in the ith step, every node will try to activate their neighbors. In detail, for each node u∈S_i − 1 and node v∈V\S_i − 1, if (u,v) ∈E, v will be activated once with probability p_uv. If v indeed becomes active, it will be added to S_i and not be further considered in current step. Repeat this procedure until that no new nodes are added. It should be noted that p_uv is only determined by u and v and is independent with other node pairs. In this model, each edge (u, v) will be considered only one time. Once it fails, this edge will never be considered. In the work by Kempe et al.,² an extended model in which p_uv will be decreased as time goes by is considered.

New model for diffusing information

All previous models for information diffusion do not consider dynamic influence; this part will propose a new model integrating both “location” and “time” factors which are major sources of dynamic influence in mobile social network.

ID model for dynamic influence

In ID model, the mobile social network can be represented by a graph G = (V, E). Here, V is the node set and E is the directed edge set which represents the influence relationship between nodes in the network. Intuitively, if there is an edge (u, v) ∈ E, it says that v can be influenced by u. That is, if u has been influenced, v also may be influenced through the edge (u, v). We use a function $P : E \to [0, 1]$ to represent the influence probability of each edge which can be used to describe how much influence one node has on another one. For edge (u, v), we will represent its value of P with p_uv. The parameters introduced above are same as the IC model, and to describe the dynamic influence during diffusing information, we need to involve more parameters.

To integrate the dynamic influence caused by “location” factors, we use the function $L : V \times V \to [0, 1]$ defined over V×V to describe the location relationship between nodes. Intuitively, the more likely that two nodes have the same locations, the more likely that they influence each other and the higher the corresponding value of L is. In the following, we use l_uv to denote the L value of node pair (u, v).

To integrate the dynamic influence caused by “time” factors, we use the function $C : E \to [0, 1]$ to describe the change of influence caused by “time delay.” In the following, we will use c_uv to denote the C value on edge (u, v).

Therefore, in ID model, to describe the information diffusion on some mobile social network, we need one four-tuple 〈G, P, L, C〉 .

In ID model, given the network 〈G = (V, E), P, L, C〉 and seed node set A and a threshold 0 ≤ θ ≤ 1, the information diffusion process working in discrete time can be explained as follows. Here, we use t₀, t₁, … to represent the discrete times. Initially, at time t₀, all nodes in A will become active and inserted into the set Z, and all other nodes will be initialized to be inactive. At time t_i, all active nodes will try to activate their “new” neighbor nodes which are met first time at time t_i. In detail, suppose node u is active and v is an inactive new met node of u, that is, u and v did not meet before t_i. If l_uv ≥ θ, the node u will try to activate v in two steps. First, generate a random value x₁ between 0 and 1. If x₁ ≤ l_uv, the node v will be activated with probability $c_{uv}^{i - 1}$ . Otherwise, v will become active with probability $p_{uv} \cdot c_{uv}^{i - 1}$ . If l_uv > θ, the node v will become active with probability $p_{uv} \cdot c_{uv}^{i - 1}$ . It should be noted that each edge (u, v) will be utilized only once, that is, u has only once chance to activate v. Such procedure iterates until no new nodes can be added into Z. Finally, Z will be the influenced set of A under network G. During the whole procedure, it should be noted that the node state can transform from inactive to active, but not vice versa. Moreover, each node can be activated several times but can be activated by each node at most once.

The idea of integrating dynamic influence into information diffusion procedure can be explained as follows. The function L measures the location similarity between two nodes. If they are similar, v will become active with probability higher than the original p_uv. It can be used to describe the case that one node tends to be influenced by the nodes “nearby.” The function C measures the decrease in influence caused by time delay. When considering the “time” factor, node v will try to be activated with probability lower than p_uv. In real applications, using this model needs choosing proper values for the parameters used which can be solved by sophisticated learning methods. The problem of choosing values for the parameters has beyond the scope of this article. Furthermore, we have the following observations.

Observation 1. Without function C, node v will be activated by u with probability l_uv + (1 − l_uv)p_uv which is larger than p_uv.

Observation 2. Without function L, node v will be activated by u with probability $p_{uv} \cdot c_{uv}^{i - 1}$ which is smaller than p_uv.

Let us consider a real example of ID model shown in Figure 1.

Figure 1.

ID model for mobile social network and its possible worlds: (a) original graph G, (b) possible world G₁, (c) possible world G₂, and (d) possible world G₃.

Example 1. Assume that we have an ID model 〈G, P, L, C〉 for some social network. The network G is shown in Figure 1(a), which includes five nodes and seven edges. The definition of P can be represented as follows

\begin{matrix} P (A, C) = 0.6; P (A, B) = 0.9; \\ P (B, C) = 0.6; P (C, E) = 0.7; \\ P (C, D) = 0.9; P (D, C) = 0.2; \\ P (D, B) = 0.3 \end{matrix}

The definition of function L is as follows

\begin{matrix} L (A, B) = 0.3; L (A, C) = 0.7; L (A, C) = 0.3; \\ L (A, E) = 0.3; L (B, C) = 0.3; L (B, D) = 0.3; \\ L (B, E) = 0.3; L (C, D) = 0.3; L (C, E) = 0.9; \\ L (D, E) = 0.3 \end{matrix}

The definition of function C is as follows

\begin{matrix} C (A, C) = 0.9; C (A, B) = 0.95; C (B, C) = 0.8; \\ C (C, E) = 0.7; C (C, D) = 0.8; C (D, C) = 0.95; \\ C (D, B) = 0.5 \end{matrix}

Given {A} as the seed node set and the threshold θ = 0.6, an example of information diffusion procedure is shown in Figure 2. In the first step t₀, node A will be initialized to be active and no edges will be processed in this step. In the second step t₁, two edges connected with A will be processed. For (A, C), since l_AC = L(A, C) = 0.7 > θ = 0.6, node C will be tried to be activated in two steps. First, assume that the random value generated is 0.85. Because 0.85 > l_AC = 0.7, C will be activated with probability $p_{AC} \cdot c_{AC}^{i - 1} = p_{AC} \cdot c_{AC}^{0} = p_{AC} = 0.6$ . For the edge (A, B), because l_AB > θ, B will be activated with probability $p_{AB} \cdot c_{AB}^{i - 1} = p_{AB} = 0.9$ . It should be noted that since the diffusing procedure is a probabilistic process, the nodes with high probabilities may still be inactive and the nodes with low probabilities may become active. Let us assume that node C becomes active but node B does not as shown in Figure 2. The following steps shown in Figure 2 can be summarized as follows:

t ₂- {(C, D) : p_CD · c_CD = 0.72}, {(C,E): l_CE > θ, random value = 0.85 > l_CE, c_CE = 0.7}

t ₃- {(D, C) : not processed}, ${(D, B) : p_{DB} \cdot c_{DB}^{2} = 0.075}$

t ₄- {(B, C) : not processed}

t ₅- {no edges left}

Figure 2.

Information diffusion procedure on ID model.

In steps t₃ and t₄, the edges (D, C) and (B, C) are not processed because C has become active in t₁. In t₅, all edges have been processed and the procedure of diffusing information is finished. Finally, the nodes {A, B, C, D, E} are influenced by the seed set {A}.□

Influence maximization problem on ID model

Based on the observations of ID model above, it is easy to find that we can partition the result set Z to several disjoint subsets {Z₀, Z₁, …} according to the time when nodes in Z become active. Also, if we consider the procedure of information diffusion on ID model as a breadth-first traversal on directed graph G, {Z_i} is obtained by partition Z using the depth of each node.

The object of general influence maximization problem is to maximize the node set influenced by the seed node set. Obviously, in ID model, information diffusion is a probabilistic process, in which node can become active during the procedure is uncertain. Therefore, we need to understand the procedure based on possible world semantics.

Let Ω be the set of all different possible worlds of given ID model. In fact, each possible world X ∈ Ω can be determined uniquely by giving assignments to all probabilistic variables in the information diffusion. For special G = (V, E), if we do not consider the seed node set, the number of possible worlds is 2^|E|. For more, for given seed set A, let G_A be the induced graph of G on node set A, E_A be the set of edges in G_A, and the number of all possible worlds should be $2^{| E | - | E_{A} |}$ . In the following, we will introduce how to compute the probability of the diffusion process and define the influence maximization problem based on possible world semantics.

Example 2. As shown in Figure 1, three possible worlds of given graph G in the ID model are listed. Different possible worlds have different edge selections of G. For example, in G₁, the edges (A, C), (C, D), (D, C), and (D, B) are not selected by the information diffusion procedure.□

It should be noted that in ID model, two different processes may reach the same possible world. Therefore, the probabilities of each single process and possible world are different. Formally, given an information diffusion process M, let S be the set of edges processed during M, then the probability Pr(G_M) can be computed by $Π_{e \in G_{M} \cap S} \Pr (e) \cdot Π_{e \in S ∖ G_{M}} (1 - \Pr (e))$ . Since the edges considered by our information diffusing model are independent from each other, the main idea of Pr(G_M) is to compute the probability of the whole graph by combining the probabilities of all edges processed during information diffusion. In the following example, we will explain how to compute the probabilities intuitively.

Example 3. Consider the information diffusion process M described in Example 1, let the final possible world be G_M. The probability Pr(M) can be computed by combining the probabilities of all possible choices. That is

\Pr (M) = (1 - 0.7) \cdot 0.6 \cdot (1 - 0.9) \cdot 0.72 \cdot 0.9 \cdot 0.7 \cdot 0.075

Let Pr(s, t) represent the edge (s, t) exist in the final possible world. Then, we have

\begin{matrix} \Pr (A, C) = (1 - 0.7) \cdot 0.6 + 0.7 \cdot 1.0 = 0.88 \\ \Pr (A, B) = 0.9 \\ \Pr (C, D) = 0.72 \\ \Pr (C, E) = (1 - 0.9) \cdot 0.49 + 0.9 \cdot 0.7 = 0.679 \\ \Pr (D, B) = 0.075 \end{matrix}

Therefore, by combining those probabilities, we have

\begin{array}{l} \Pr (G_{M}) = \Pr (A, C) \cdot \\ (1 - \Pr (A, B)) \cdot \Pr (C, D) \cdot \Pr (C, E) \cdot \Pr (D, B) \end{array}

Obviously, not all edges in G should be considered since there are some edges not visited during the information diffusion procedure because of the topology structures.□

Usually, we use the function σ(·) to represent the influence range of given seed node set. That is, given seed set A, σ(A) will be the nodes which become active after diffusing the information based on A. Observing the above procedure of information diffusion, for each single process, we have σ(A) = Z. However, the diffusion process is a probabilistic one, we need a definition based on possible world semantics.

Definition 1. Influence function. Given an ID model 〈G, P, C, L〉, seed node set A, and threshold θ, let {G₁,…,G_m} be the set of possible worlds. The influence function δ can measure the expected value of influence of A on G. For special A, δ(G,A,θ) is defined to be $\sum \Pr (G_{i}) \cdot | V_{G_{i}} |$ , and it is also denoted by δ(A) for simplicity.

It can be found that for each possible world G_i, its node size is just the node set which can be influenced by A.

Definition 2. Influence maximization on ID model. Given an ID model 〈G, P, C, L〉, threshold θ, and an integer k > 0, the question is to find a subset A satisfying |A| = k and the size δ(A) is maximized.

In the following parts, we will use IMD (influence maximization under dynamic influence) to represent the influence maximization problem on ID model.

Approximation algorithm for IMD problem

In this section, first, the computational complexity of IMD problem is studied which indicates that it is intractable and should be solved by approximation or randomized ways. To design approximation algorithms for IMD problem, the ID model proposed above is simplified to reduce possible worlds of the model and integrate dynamic influence. Finally, an approximation algorithm is proposed and formal analysis shows that the approximation ratio can be efficiently bounded.

Hardness of solving IMD problem

Since the influence maximization problem on classic models is usually NP-hard and approximation and heuristic algorithms are often needed, we first consider the computational complexities of IMD problem.

Theorem 1. IMD problem is NP-hard.

Proof. The theorem can be proved by observing that classical influence maximization problem on the IC model in the work by Kempe et al.² is a special case of IMD problem. The details can be analyzed as follows.

For IMD problem, we can prove that they are NP-hard by making a direct reduction from the classical influence maximization problem on IC model in the work by Kempe et al.² Given a classical influence maximization instance I, by setting the parameter c_uv = 1, l_uv = 0 for every (u, v) and θ = 0, it is easy to obtain the corresponding instance I′ of IMD problem. Furthermore, it can be easily verified that there are bijective maps between the solutions of I and I′. Therefore, IMD problem is NP-hard.

Simplifying the ID model

By Example 3, it is easy to find that the computation of possible world probability is tricky and it is hard to process in solving IMD problem. In this part, we propose a method to simplify ID model by integrating the L function with ID model.

Given an instance I = 〈G, P, L, C〉 of ID model, we can build another instance I′ = 〈G, P′, L′, C〉 as follows:

Let the threshold be θ.

For each edge (u, v) ∈ E_G, if l_uv ≥ θ, let $p_{uv}^{'} = l_{uv} + (1 - l_{uv}) \cdot p_{uv}$ .

For each edge (u, v) ∈ E_G, if l_uv > θ, let $p_{uv}^{'} = p_{uv}$ .

For each edge (u, v) ∈ E_G, let l′(uv) = 0.

After such transformation, the new instance will only contain zero-value L function, in fact, we can delete the L function in I′.

Lemma 1. For each possible world G₁ which is produced by information diffusion process over I or I′, we have $P r_{I} (G_{1}) = P r_{I^{'}} (G_{1})$ .

Proof. Considering one possible world G₁, we will show that the probabilities of G₁ are same. Because G₁ is a deterministic graph, by making a breadth-first traversal, it is easy to determine the edges which will not be visited during the information diffusion. For the left edges, we consider the following cases. For special edge (u, v) in I, if l_uv ≥ θ, $\Pr (u, v) = l_{uv} \cdot c_{uv}^{i - 1} + (1 - l_{uv}) \cdot p_{uv} \cdot c_{uv}^{i - 1}$ . Otherwise, $\Pr (u, v) = p_{uv} \cdot c_{uv}^{i - 1}$ . The computation of Pr(u, v) is based on the definition of information diffusion procedure.

While in I′, since L′ value is 0, for any edge (u, v), the value $\Pr (u, v) = p_{uv}^{'} \cdot c_{uv}^{i - 1}$ . For any edge (u, v) which satisfies l_uv > θ in I, we have $\Pr (u, v) = p_{uv} \cdot c_{uv}^{i - 1}$ according the definition of transformation above. For edge (u, v) satisfying l_uv ≥ θ in I, we have $\Pr (u, v) = p_{uv}^{'} \cdot c_{uv}^{i - 1} = l_{uv} \cdot c_{uv}^{i - 1} + (1 - l_{uv}) \cdot p_{uv} \cdot c_{uv}^{i - 1}$ .

Obviously, all edges in the possible world have same probabilities to be chosen. Therefore, each possible world in I and I′ has same probabilities. Finally, the distributions of possible worlds of I and I′ are same.

As shown above, the ID instance after transformation has zero-value L function; therefore, we can ignore the L function in the following discussions and only use 〈G, P, C〉 to represent the instance of ID model. For more, given any instance I = 〈G, P, L, C〉 and threshold θ, the transformation can be finished in polynomial time cost.

Efficient approximation algorithm

According to Theorem 1, it is almost impossible to solve IMD problem in polynomial time; therefore, in the following parts, our aim is to find efficient approximation algorithms with performance guarantee. As shown in the work by Kempe et al.,² monotone and submodular properties allow us to develop greedy algorithms to achieve (1 − 1/e − ε) approximation ratio. Here, given function δ(·): 2^V→R, δ is called to be monotone iff δ(S₁) ≤ δ(S₂) for any S₁ ⊆ S₂, it is called to be submodular iff δ(S₁⋃x) − δ(S₁) ≥ δ (S₂⋃x) − δ(S₂) for any S₁ ⊆ S₂. Therefore, in the following parts, we will try to utilize such strategies to design efficient approximation algorithms for the influence maximization problems in this article.

We proposed an algorithm based on greedy idea which can produce approximation algorithms with ratio 1 − 1/e as shown by Fisher et al.²⁹ The algorithm is shown in Figure 3. The Algorithm GreedyIMD takes I = 〈G = (V, E), P, C〉 and integer k > 0 as the input parameters. First, set S for storing the optimal node seeds and is initialized to be empty (lines 1–2). Then, by considering the node one by one, at each time, the algorithm only chooses a node v with maximized Δ_v and insert it into S (lines 3–11). Δ_v is the influence gain obtained by adding v to S, that is, δ_IMD(S⋃v) − δ_IMD(S). Here, the value of function δ(·) is computed by invoking the procedure GetInfluence (line 7).

Figure 3.

Algorithm GreedyIMD.

In the GetInfluence procedure, the inputs are composed of seed node set S and the instance I of ID model, and the goal is to return δ_IMD(S). As shown in the work by Chen et al.,¹ the problem of computing δ(·) under classic IC model has already been $♯ P - hard$ . Therefore, in the GetInfluence procedure, we use the sampling method to estimate the value of influence for given seed node set. First, the variable for storing the final result influence is initialized to be zero (line 1). Then, the sampling method will be ran for n times (lines 1–17) (the value of n can be determine according to the results in the work by Kempe et al.²) and the averaged value of all result influences will be returned (line 18). During each sampling iteration, all temporary variables used will be initialized first (lines 3–6). Then, the seed nodes will be inserted into a queue Q (line 7). Q is helpful to do a breadth-first traversal on G. For each node, the probability of becoming active is calculated by $p_{uv} \cdot c_{uv}^{i - 1}$ (line 12).

Finally, based on the observation that the main procedure of Algorithm 3 is to iterate among all nodes and the GetInfluence procedure only enumerates every edge of G, it is easy to verify that Algorithm GetInfluence can be finished in polynomial time. In GetInfluence, the time cost of codes between lines 3 and 17 can be bounded by O(|E|²); therefore, the time cost of GetInfluence can be bounded by O(n · |E|²). Combining GetInfluence with GreedyIMD, the total time cost of GreedyIMD can be bounded by O(k · n · |V|·|E|²).

Analysis of Algorithm GreedyIMD

In this part, we will show that Algorithm GreedyIMD has performance guarantee on the approximation ratio. The main idea is to show the influence function δ_IMD satisfies the properties of monotone and submodular.

First, we introduce another kind of view of the information diffusion on ID model. According to the results in the work by Kempe et al.,² an equivalent view of information diffusion process on IC model is as follows: each edge (u, v) of G is identified to be live independently with probability p_uv and blocked otherwise. Therefore, we can use different assignments of live and blocked states of the edges to represent different results of information diffusion on IC models. Moreover, Kempe et al.² have shown that they have same distributions. Here, we can also use the similar view of information diffusion process on ID model. For each edge (u, v) of G, (u, v) is identified to be live with probability $p_{uv} \cdot c_{uv}^{i - 1}$ and blocked otherwise. It should be noted that the probabilities are not independent any more but depend on the depth i which is affected by whether other edges have become live. This difference makes the formal analysis even harder, and we will show how to solve this problem in the following.

Second, we introduce a new representation for the influence function δ_IMD(·). According to the definition of δ_IMD, we have

\begin{matrix} δ_{IMD} (A) & = E (F (G, A)) \\ = \sum_{v \in V} E (f (G, A, v)) \\ = \sum_{v \in V} \sum_{x \in X_{G}} g (G, x, A, v) \cdot P r_{v} (x, G, A) \end{matrix}

(1)

where F(G, A) represents the size of influence node set of A on network G with fixed choices of P and C, and f(G, A, v) is 1 or 0 which represents whether node v is in the influence node set of A on G.

In equation (1), given network G, we use X_G to represent the set of all different live-blocked assignments on E_G and x to represent some special assignment of E_G. Given x ∈ X_G, we use G_x to represent the graph obtained from G by deleting blocked edges. It should be noted that, even for same x, the probabilities of the assignments of E_G on different seed sets A are different. We use Pr_v(x, G, A) to represent those probabilities. The function g(G, x, A, v) is the particular value of f(G, A, v) on special x. The value of g can be determined as follows. First, according to x, a subgraph G′ of G can be obtained by only taking live edges into $E_{G^{'}}$ . Then, given seed set A, if v can be influenced by A on G′, the value g will be 1 and 0 otherwise. Finally, equation (1) can be obtained directly from the definition of δ_IMD.

In the following, we will show the monotone and submodular property of δ_IMD.

Lemma 2. The function g in equation (1) is monotone.

Proof. We can prove the function g is monotone by analyzing the connectivity between node v and the seed set A. Obviously, we need to show, given S₁ ⊆ S₂ ⊆ V, there is g(G, x, S₁, v) ≤ g(G, x, S₂, v). Since the value of function g can be only 1 or 0, we need to show that g(G, x, S₁, v) = 1 and g(G, x, S₂, v) = 0 cannot be satisfied at the same time. Assume that g(G, x, S₁, v) = 1 and g(G, x, S₂, v) = 0. Because g(G, x, S₁, v) = 1, let G′ be the subgraph of G obtained by only choosing live edges identified by x in E_G, and there must be some node u ∈ S₁ such that u and v are connected in G′. Since S₁ ⊆ S₂, we have u ∈ S₂. Then, v is connected by some node in S₂, that is, g(G, x, S₂, v) = 1 which is a conflict. Therefore, g(G, x, S₁, v) ≤ g(G, x, S₂, v).

Naturally, it is hoped that Pr_v(x, G, A) is monotone. If so, we can obtain the result that δ_IMD(·) is monotone directly based on equation (1). Unfortunately, we have the following Lemma.

Lemma 3. The function Pr_v(x, G, A) is not monotone.

Proof. It can be understood by the example in Figure 4. Assume that c_uv = 0.5 for every edge (u, v) ∈ G, S₁ = {v_a}, and S₂ = {v_a, v_b}. For S₁, the edges (v_a, v_b) and (v_a, v_c) will be visited in the first iteration, and the edge (v_b, v_c) will be visited in the second iteration. Therefore, Pr_v(x, G, S₁) = 1 · 0.9 · (1 − 0.1 · 0.5) = 0.855. For S₂, the edge (v_a, v_b) will not be processed and the other two edges will be visited in the first iteration. Therefore, Pr_v(x, G, S₂) = 0.9 · (1 − 0.1) = 0.81. Finally, we have

P r_{v} (x, G, S_{1}) > P r_{v} (x, G, S_{2})

(2)

Figure 4.

An example of Pr_v(·).

For another example, let us modify the original graph G into H as shown in Figure 4. Similarly, we have Pr_v(x, H, S₁) = 0.9 · 0.9 = 0.81 and Pr_v(x, H, S₂) = 0.9. Therefore

P r_{v} (x, H, S_{1}) < P r_{v} (x, H, S_{2})

(3)

Finally, we can obtain the result that Pr_v(x, G, ·) is not monotone.

Since Pr_v(x, G, ·) is not monotone, it is hard to prove the theorem directly based on equation (1) and we need an alternative view of δ_IMD(·).

Theorem 2. The influence function δ_IMD(·) is monotone.

Proof. To prove that the influence function δ_IMD is monotone, let us consider the given network G = (V, E, P, C) and two fixed seed node sets S₁ and S₂ satisfying S₁ ⊆ S₂. We want to prove that δ_IMD(S₁) ≤ δ_IMD(S₂).

Observing that those combinations of v and x such that g(G, x, A, v) = 0 have no contributions to the final result of δ_IMD(A), we can reform the representation of δ_IMD(·) as follows

δ_{IMD} (A) = \sum_{v \in V} Q_{v} (G, A, \bar{p})

(4)

Here, we use $Q_{v} (G, A, \bar{p})$ to represent the probability that node v can be influenced by A on the network obtained from G by selecting live edges with probabilities defined in the information diffusion procedure. The parameter $\bar{p}$ is a subset of P = {p_uv|(u,v) ∈ E}. Actually, in equation (1), the parameter $\bar{p}$ is implicitly contained in G. In equation (4), the aim of separating $\bar{p}$ from G is to show which variables in P are essential to the values of Q_v. Therefore, for variables in P, we assume that the function Q_v only involves $\bar{p}$ . For Q_v, we have following observations:

The set $\bar{p}$ only includes variables of { $p_{mn} | \exists w \in A, (m, n)$ appears on the path between w and v}. We will prove this by showing that the value change of p_mn out of $\bar{p}$ does not affect the value of Q_v. Comparing equation (1) with (4), it is easy to find that Q_v is the sum of several Pr_v s, each of which represents the probability of some assignment of X_G. Assume that edge (s, t) does not appear on any path between v and nodes in A. Given some x ∈ X_G such that Pr_v(x, G, A) is included in Q_v, first, let us assume that (s, t) is labeled as “blocked” in x. Obviously, the expression Pr_v(x′, G, A) for assignment x′ obtained by changing the state of (s,t) to be “live” of x must also appear in Q_v because adding the edge (s,t) to G_x will not disconnect the paths between A and v and v will still be influenced by A in $G_{x^{'}}$ . However, let the edge (s,t) be labeled as “live” in x and x′ be obtained by changing the state of (s,t) to be “blocked” in x. Because (s,t) does not appear on any path between A and v, the deletion of (s,t) will not destroy the connectivity between v and A. That is, v will still be influenced by A in $G_{x^{'}}$ and Pr_v(x′,G,A) will appear in Q_v. Since both Pr_v(x,G,A) and Pr_v(x′,G,A) appear in Q_v and the only difference between x and x′ is the state of edge (s,t), the sum of Pr_v(x,G,A) and Pr_v(x′,G,A) will eliminate the variable p_st. Thus, p_st will not appear in the expression of Q_v. No matter what value p_st is assigned to be, the value of Q_v will not change. Finally, Q_v can be denoted by an expression without any variables out of $\bar{p}$ .

Q _v is monotone with respect to each variable p_mn in $\bar{p}$ . Obviously, for any assignment x ∈ X_G and its corresponding graph G_x, if the value of p_mn in $\bar{p}$ increases, the edge (m,n) is more likely to appear in the graph G_x. If there is no paths through (m,n) connecting A and v, the change in p_mn will not affect the probability that v is influenced in graph G_x. Otherwise, if there is indeed one path passing (m,n) and connecting A and v, the probability that v will be influenced in G_x will become larger. Therefore, the increase in p_mn will not reduce the value of Q_v and Q_v is indeed monotone with respect to variables in $\bar{p}$ .

Q _v is monotone with respect to A. Given S ⊆ V, let y ∈ V\S and S′ = S⋃y. We will show that $Q_{v} (G, S, \bar{p}) \leq Q_{v} (G, S^{'}, \bar{p})$ . We divide all edges involved in the procedure of information diffusion into several parts according to the iteration steps they need to be influenced. For example, observing the information diffusion procedure shown in Figure 2, all edges can be labeled with a number representing the steps they utilized. For example, in Figure 2, node A is labeled 0, node B is labeled 3, and node C is labeled 2. According to the definition of δ_IMD, the success probability of each edge (u,v) is computed based on the number labeled shown above. The smaller the number, the lesser the influence reduced and the more likely that the edge is successfully used. After inserting some node y into A, during the information diffusion procedure, the numbers labeled on some edges will decrease and the probabilities corresponding to those edges will increase. Since we have shown that Q_v is monotone with respect to variables in $\bar{p}$ , it is easy to verify that the value of Q_v will increase as the insertion of node y. Finally, we have Q_v is monotone with respect to A.

Based on the above observations and equation (4), δ_IMD(A) is monotone.

Theorem 3. The influence function δ_IMD(·) is submodular.

Proof. To prove that the influence function δ_IMD is submodular, let us consider a given network G = (V,E,P,C), two fixed seed node sets S₁ and S₂ satisfying S₁ ⊆ S₂, and one node u ∈ V\S₂. We need to show that δ_IMD(S₁⋃u) − δ_IMD(S₁) ≥ δ_IMD(S₂⋃u) − δ_IMD(S₂).

Using the similar idea of the proof of Theorem 2, let us consider equation (4), we will explain the proof based on the function Q_v and show the submodular properties for each Q_v. For a fixed node v, to finish the proof, let us consider the relationship between u and v. To be simplicity, we will use Q_v(S) to represent $Q_{v} (Q, S, \bar{p})$ :

The first case is that u cannot be directed to v. That is, there are no paths from u to v in G. Obviously, the addition of u will not add the paths between seeds set and node v, and u will not increase the Q_v values defined over S₁ and S₂. Another way of changing the value of Q_v is that u affects the topology structures of S₁ and S₂. Suppose there is one path p between some node w ∈ S₁ and v and the addition of u changes the iteration levels of some edge (y,z) of p, the value of Q_v will change because of the change in influence probability for (y,z). However, in that case, there would be also one path between u and v through y, that is, conflict with the assumption. Therefore, when u cannot reach v in G, we have Q_v(S₁⋃u) − Q_v(S₁) ≥ Q_v(S₂⋃u) − Q_v(S₂).

The second case is that there are paths between u and v. It should be noted that the topology structure constructed by the information diffusion procedure is in fact a tree and the edges between trees are eliminated by the mechanism that every node can be activated at most once. Therefore, we can divide all kinds of paths into three types: the path started from S₁, the path started from S₂, and the path started from u. We use P₁, P₂, and P_u to denote them, respectively. The value of Q_v changes because there are edges moving from one set to another set during the addition of u. Let us consider some special edge e = (y,z). We ignore the trivial cases that e does not move between sets since the value Q_v will not change. We have the following two observations. (1) If e moves from P₁ to P_u after inserting u to S₁, it will also move from P₂ to P_u because S₁ ⊆ S₂. (2) If e moves from P₁ to P_u or from P₂ to P_u, u can reach v through the edge e. Therefore, it can be found that for node v, we have Q_v(S₁⋃u) − Q_v(S₁) ≥ Q_v(S₂⋃u) − Q_v(S₂).

By combining the above two results and equation (4), we have δ_IMD(S₁⋃u) − δ_IMD(S₁) ≥ δ_IMD(S₂⋃u) − δ_IMD(S₂).

Theorem 4. Algorithm GreedyIMD can solve the IMD problem with (1 − 1/e) approximation ratio.

Proof. According to the result in the work by Kempe et al.,² since we have Theorems 2 and 3, it is easy to verify that Algorithm GreedyIMD can solve the IMD problem with (1 − 1/e) approximation ratio.

Improving Algorithm GreedyIMD

In Algorithm GreedyIMD, during simulating the procedure of information diffusion, there are still many redundant operations to be performed, and we can improve the algorithm by merging and eliminating those operations. The main idea of this part can be explained by the following example. Let us consider an extreme case, suppose in G, that is the original graph in ID model, there is a subset V′ ⊆ V satisfying that V′ forms a connected component and the nodes in V′ have no edges with the nodes outside. It is easy to check that if the seed set A satisfies A ⊆ V′, we need not process the edges out of V′ in the algorithm for solving influence maximization problem. Obviously, we can improve the performance of GreedyIMD by removing such edges. In fact, the optimization idea comes from the observation that sparse subgraphs exist generally in real applications. For those cases, if some node is specified to be the beginning of the diffusion, some part of the whole graph will never be visited because of the sparse part in the graph. Therefore, for given node a, find and eliminate the part which will never be visited with a as the beginning node is useful for improving the efficiency of the algorithm.

Based on the idea above, propose improved version of GetInfluenceImproved to solve the influence maximization under ID model. The improved algorithm is shown in Figure 5. First, given the input I and S, to simplify the computation and enlarge the chances to remove redundant edges, in GetInfluenceImproved, the nodes are considered one by one and their expected influences are estimated (line 2). For each special node, a graph G′ with reversed edges of G is built (line 3). Then, by making a traversal over G′ from the node v, we can find which edges are related with v during the information diffusion procedures (lines 4–5). It should be noted that if the node v has no relations with the given seed set S, we can find that and ignore v in this step. Also, by a traversal from v, not all nodes in S need to be reached. Therefore, the unrelated seed nodes are also filtered in this step. By extracting edges from the obtained nodes set, we can get a graph G_D which is much smaller than G and the simulation operations will only be performed on G_D (line 6). For each graph G_D, the simulation process is run multiple times to get a precise estimation (lines 8–29). In each simulation, first the edges E₁ in G_D are extracted and added to an queue structure (line 10–14). Then, the following operations are similar to GetInfluence. The average value of influence size obtained by multiple simulations will be added to the final results (lines 29–30). Finally, the result will be returned (line 31). It can be found that the optimization is implemented by preprocessing the original graph and labeling nodes that are not useful. The time cost of GetInfluenceImproved can be bounded by O(|V| · n · |E|²). Compared with GetInfluence, the worse-case time cost is increased because there are cases that no subgraphs can be eliminated by the optimization steps.

Figure 5.

Algorithm for getting influence.

Experimental evaluation

Based on real data sets, we evaluate the performance of Algorithm GreedyIMD and GetInfluenceImproved and compare them with some current influence maximization algorithms. All codes are implemented in C++, and all experiments are run on a personal computer with Intel Quad CPU 2.33 GHz and 8 GB main memory. All experiments about running time are run five times and the average values are reported.

Experiment setup

We ran our experiments on four real data sets, whose summary information is shown in Table 1. The digital bibliography & library project (DBLP) data set is a large network of research collaboration maintained by Michael Ley. In the network of DBLP, the nodes represent the authors of academic papers and there exist one edge between two nodes if and only if the two corresponding authors have collaborations. For DBLP, we use the coauthor relationships to compute the influence probability between two authors. Twitter is the network composed of twitters and the tweets posted by them, which is the most popular micro-blogging system in the world. In this network, the nodes represent the twitters and the edges represents the “following” relations between them. For Twitter, we use the repost actions to compute the influence probability. Epinions is a network built by who-trust-whom relations. In this network, nodes represent the users and the edges between them represent the trust relation. We use the metadata about trust relation to compute the influence probability. Wikivote is the data about voting administrator on Wiki. In this network, the nodes represent the wiki users and the edges represent the voting relation. We use the metadata about voting actions to compute the influence probability.

Table 1.

Statistics of real data sets.

Data set	Node size	Edge size	Average degree
DBLP³⁰	980,562	7,324,579	15.92
Twitter³¹	112,044	468,238	8.36
Epinions³²	131,828	841,372	12.8
Wikivote³³	7115	103,689	26.6

Experimental results and analysis

We compare the algorithms proposed in this article on qualities of seed sets and running time costs based on the four real data sets.

We use different parameters of L and C to run the experiments. Here, given a constant c, the value of function C on each edge is generated randomly with following Poisson distribution. Similarly, L values are generated randomly for each pair of nodes. In the following parts, we use MID-A to represent the GreedyIMD algorithm running with L and C values generated by (l = 0.2, c = 0.2). Also, we use MID-B, MID-C, and MID-D to represent GreedyIMD algorithm with parameters (l = 0.2, c = 0.8), (l = 0.8, c = 0.2), and (l = 0.8, c = 0.8), respectively. When comparing the running times, to be clear, we use IMD-A and IMD-D to represent the GreedyIMD algorithm with GetInfluence procedure, while we use IMD-Ax and IMD-Dx to represent the GreedyIMD algorithm with GetInfluenceImproved procedure.

Effects of seed sets

The effect of given seed set can be evaluated by the influenced nodes size. On four data sets, we compare four algorithms with different parameters on seed node sets with different sizes. The results are shown in Figure 6. It can be observed that as the size of seed nodes set increases, the size of influenced nodes increases almost with linear speed. The result is expected since in a large enough network, all nodes tend to perform uniformly during the information diffusion. Also, it can be found that when increasing the value of L and C functions, the size of influenced nodes set gets larger. Actually, when L becomes larger, essentially it increases the probability of influence, thus the size of influenced set will get larger. When C becomes larger, it reduces the decreasing speed of time delay parameters and the size of influenced set will also gets larger. This is because that the two parameters are used to control the influence abilities of the set of seed nodes, which will be much clearer in the following experimental results.

Figure 6.

Influence spread against seed set size on four data sets: (a) DBLP, (b) Epinions, (c) Wikivote, and (d) Twitter.

Effects of function L

The effect of function L is evaluated by the influenced nodes size. On four data sets, fixing the size of the seed nodes to be 20, we compare four algorithms with different L function values. The results are shown in Figure 7. It can be observed that as the value of L increases, the size of influenced nodes increases. When the L value increases, the increasing speed of influenced nodes set becomes slow. In fact, the effect of L values is to increase the original influence probability in a proper level; therefore, when the L value becomes much larger, the increasing speed of influence set size will get slow. Also, it can be found that for fixed L value, when we scale C from 0.2 to 0.8, the size of influenced nodes set gets larger.

Figure 7.

Influence spread against value of L on four data sets when seeds set size is 20: (a) DBLP, (b) Epinions, (c) Wikivote, and (d) Twitter.

Effects of function C

The effect of function C is evaluated by the influenced nodes size. On four data sets, fixing the size of the seed nodes to be 20, we compare four algorithms with different C function values. The results are shown in Figure 8. It can be observed that the results are similar to the results of L function. It should be noted that when the value of C is relatively small, the increasing trend of size of influenced nodes set is more sharp than the results in L function. It depends on the working mechanism of L and C. L is used in linear way in the diffusing information procedure, while C is used in exponential way.

Figure 8.

Influence spread against value of C on four data sets when seed set size is 20: (a) DBLP, (b) Epinions, (c) Wikivote, and (d) Twitter.

Running time

For two data sets, we ran the original greedy algorithm and improved greedy algorithm proposed in this article for different sizes of seeds set. The running time results are shown in Figure 9. It can be found that as the size of seeds set increases, the running time cost also increases; when seed node size becomes larger, the increasing speed of running time cost becomes slow. Also, we can find that the improved algorithms are three times in average faster than the original algorithms. This is because of the reduction in computation cost and optimizing strategies.

Figure 9.

Running time against seed set size on two data sets: (a) DBLP and (b) Twitter.

Conclusion

In this article, based on the observations of information diffusion process on mobile social networks, the ID model for diffusing information under dynamic influence is proposed. By theoretical analysis, we determine the complexities of solving influence maximization on the new model and design efficient algorithms with approximation performance guarantee. By experiments over real data set, the performances of ID model and the algorithms proposed are verified. One possible further question is how to design more efficient algorithms for dynamic influence in social networks. Another question comes from the methods of modeling dynamic influence in this article. Obviously, our methods cannot cover all possibilities of dynamic influences, and we need to investigate more typical representations for dynamic influences and study how to design algorithms for the related influence maximization problem.

Footnotes

Academic Editor: Zhipeng Cai

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Basic Research Program (973 Program) of China via grant 2014CB340503, the National Natural Science Foundation of China (NSFC) via grant 61472107 and 61632011, the National Social Science Foundation of China via Grant 14CXW045, the Research Project for Humanities and Social Science of the Ministry of Education of China via grant 13YJC860013 and the Heilongjiang Social Science Foundation via grant 12D062.

References

Chen

Wang

. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10), pp.1029–1038. New York: ACM, http://doi.acm.org/10.1145/1835804.1835934

Kempe

Kleinberg

Tardos

. Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’03), pp.137–146. New York: ACM, http://doi.acm.org/10.1145/956750.956769

Chen

Wang

Yang

. Efficient influence maximization in social networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’09), pp.199–208. New York: ACM, http://doi.acm.org/10.1145/1557019.1557047

Liu

Tang

Han

. Mining topic-level influence in heterogeneous networks. In: Proceedings of the 19th ACM international conference on information and knowledge management (CIKM ’10), pp.199–208. New York: ACM, http://doi.acm.org/10.1145/1871437.1871467

Tang

Sun

Wang

. Social influence analysis in large-scale networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’09), pp.807–816. New York: ACM, http://doi.acm.org/10.1145/1557019.1557108

Zhou

Cao

Liu

. Location-based influence maximization in social networks. In: Proceedings of the 24th ACM international conference on information and knowledge management (CIKM ’15), pp.1211–1220. New York: ACM, http://doi.acm.org/10.1145/2806416.2806462

Goyal

Bonchi

Lakshmanan

. Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on web search and data nining (WSDM ’10), pp.241–250. New York: ACM, http://doi.acm.org/10.1145/1718487.1718518

Leskovec

Adamic

Huberman

. The dynamics of viral marketing. ACM Trans Web 2007; 1(1), http://doi.acm.org/10.1145/1232722.1232727

Domingos

Richardson

. Mining the network value of customers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’01), pp.57–66. New York: ACM, http://doi.acm.org/10.1145/502512.502525

10.

Richardson

Domingos

. Mining knowledge-sharing sites for viral marketing. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’02), pp.61–70. New York: ACM, http://doi.acm.org/10.1145/775047.775057

11.

Leskovec

Krause

Guestrin

. Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’07), pp.420–429. New York: ACM, http://doi.acm.org/10.1145/1281192.1281239

12.

Goyal

Lakshmanan

. CELF++: optimizing the greedy algorithm for influence maximization in social networks. In: Proceedings of the 20th international conference companion on world wide web (WWW ’11), pp.47–48. New York: ACM, http://doi.acm.org/10.1145/1963192.1963217

13.

Chen

Peng

Lee

. Efficient algorithms for influence maximization in social networks. Knowl Inf Syst 2012; 33(3): 577–601, http://dx.doi.org/10.1007/s10115-012-0540-7

14.

Goyal

Lakshmanan

LVS

. SIMPATH: an efficient algorithm for influence maximization under the linear threshold model. In: Proceedings of the 2011 IEEE 11th international conference on data mining (ICDM ’11), pp.211–220. Washington, DC: IEEE Computer Society, http://dx.doi.org/10.1109/ICDM.2011.132

15.

Jiang

Song

Cong

. Simulated annealing based influence maximization in social networks. In: Proceedings of the twenty-fifth AAAI conference on artificial intelligence (AAAI’11), pp.127–132. Palo Alto, CA: AAAI Press, http://dl.acm.org/citation.cfm?id=2900423.2900443

16.

Kimura

Saito

. Approximate solutions for the influence maximization problem in a social network. In: Proceedings of the 10th international conference on knowledge-based and intelligent information and engineering systems (KES 2006), Bournemouth, 9–11 October 2006, pp.937–944. Berlin, Heidelberg: Springer, http://dx.doi.org/10.1007/11893004_120

17.

Han

Yan

Cai

. An exploration of broader influence maximization in timeliness networks with opportunistic selection. J Netw Comput Appl 2016; 63: 39–49, http://dx.doi.org/10.1016/j.jnca.2016.01.004

18.

Shi

Cheng

Cai

. Retrieving the maximal time-bounded positive influence set from social networks. Personal Ubiquitous Comput 2016; 20(5): 717–730, http://dx.doi.org/10.1007/s00779-016-0943-7

19.

Kim

. Scalable and parallelizable processing of influence maximization for large-scale social networks? In: Proceedings of the 2013 IEEE 29th international conference on data engineering (ICDE), Brisbane, QLD, Australia, 8–12 April 2013, pp.266–277. New York: IEEE.

20.

Cai

JLZ

Yan

. Using crowdsourced data in location-based social networks to explore influence maximization. In: Proceedings of the 35th annual IEEE international conference on computer communications (IEEE INFOCOM 2016), San Francisco, CA, 10–15 April 2016, pp.1–9. New York: IEEE.

21.

Han

Yan

Cai

. Influence maximization by probing partial communities in dynamic online social networks. T Emerg Telecommun Technol. Epub ahead of print 28 June 2016. DOI: 10.1002/ett.3054.

22.

Chen

Feng

. Efficient location-aware influence maximization. In: Proceedings of the 2014 ACM SIGMOD international conference on management of sata (SIGMOD ’14), pp.87–98. New York: ACM, http://doi.acm.org/10.1145/2588555.2588561

23.

Tang

Yuan

Mao

. Relationship classification in large scale online social networks and its impact on information propagation. In: Proceedings of the 30th IEEE international conference on computer communications, joint conference of the IEEE computer and communications societies (INFOCOM 2011), Shanghai, China, 10–15 April 2011, pp.2291–2299, http://dx.doi.org/10.1109/INFCOM.2011.5935046

24.

Chen

Fan

. Online topic-aware influence maximization. Proc VLDB Endow 2015; 8(6): 666–677, http://dx.doi.org/10.14778/2735703.2735706

25.

Cai

Guan

. Collective data-sanitization for preventing sensitive information inference attacks in social networks. IEEE T Depend Secure. Epub ahead of print 26 September 2016. DOI: 10.1109/TDSC.2016. 2613521.

26.

Han

Zhu

. A comparative analysis on weibo and twitter. Tsinghua Sci Technol 2016; 21(1): 1–16.

27.

Zheng

. An optimal content caching framework for utility maximization. Tsinghua Sci Technol 2016; 21(4): 374–384.

28.

Weng

Lim

Jiang

. TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on web search and data mining (WSDM ’10), pp.261–270. New York: ACM, http://doi.acm.org/10.1145/1718487.1718520

29.

Fisher

Nemhauser

Wolsey

. An analysis of approximations for maximizing submodular set functions–II. Berlin, Heidelberg: Springer, pp.73–87, http://dx.doi.org/10.1007/BFb0121195

30.

DBLP. http://dblp.uni-trier.de

31.

Twitter. https://twitter.com

32.

Epinions. https://snap.stanford.edu/data/soc-epinions1.html

33.

Wikivote. https://snap.stanford.edu/data/wiki-vote.html