A Shared Interest Discovery Model for Coauthor Relationship in SNS

Abstract

A social network service (SNS) is a platform to build social networks or social relations among people. Many users enjoy SNS with their smart devices, which are mostly equipped with sensory devices. The sensitive information produced by these sensory devices is uploaded to SNS, which may raise many potential risks. In order to share one's sensitive data with random people without security and privacy concerns, this paper proposes a shared interest discovery model for coauthor relationship in SNS, named as coauthor topic (coAT) model, to identify the users with similar interests from social networks, and collapsed Gibbs sampling method is utilized for inferring model parameters. Thus, one can reduce the possibility that recommended users are not friends but attackers. Finally, extensive experimental results on NIPS dataset indicate that our coAT model is feasible and efficient.

1. Introduction

A social network service (SNS) [1] is a platform to build social networks or social relations among people who, for example, share interests, activities, backgrounds, or real-life connections. A SNS consists of a representation of each user (often a profile), his/her social links, and a variety of additional services, such as potential friend recommendation service. Most social network services, such as Facebook, LinkedIn, and Twitter, are web based and provide means for users to interact over the Internet.

Recently, people enjoy SNS with their smart devices including phones and tablets, which are mostly equipped with sensory devices such as global positioning system (GPS) and camera. As we all know, the information produced from the sensory devices is often sensitive to people's privacy. However, many users like to upload such sensitive information to SNS for some reasons, which raises many potential risks [2]. For example, some attackers usually pretend to originate from a trusted SNS and post a comment containing a malicious URL. If one follows the instructions, he/she may disclose sensitive information or compromise the security of his/her system.

This paper focuses on dealing with the following problem: how to share one's sensitive data with random people without security and privacy concerns. Our main idea is to identify the users with similar interests from social networks, so that recommended friends are not attackers, but indeed friends. This idea is motivated by the following fact. In order not to be recognized as an attacker, he/she often posts a comment related to your posted article. As we all know, he/she does not care about your posted article but entices you to click the malicious URL hiding in the comment. If we are able to analyze in advance whether there are some shared interests, SNS may directly filter the comments.

In the paper, academic social network is taken as research object, which are about interests of academic persons. For example, given coauthors with Koch_C in Figure 1 and coauthored papers in Table 1, our model described in the article can easily discover the shared interest from this social network.

Table 1

Papers coauthored by Bair_W, Horiuchi_T, Luo_T, Douglas_T, and Manwani_T with Koch_C in NIPS dataset.

Bair_W	1	Real-time computer vision and robotics using analog VLSI circuits
	2	An analog VLSI chip for finding edges from zero-crossings
	3	Visual motion computation in analog VLSI using pulses
	4	Correlated neuronal response: time scales and mechanisms

Horiuchi_T	1	Real-time computer vision and robotics using analog VLSI circuits
	2	A delay-line based motion detection chip
	3	An analog VLSI saccadic eye movement system
	4	Analog VLSI circuits for attention-based, visual tracking

Luo_T	1	Computing motion using resistive networks
	2	Real-time computer vision and robotics using analog VLSI circuits
	3	Object-based analog VLSI vision circuits

Douglas_T	1	Network activity determines spatiotemporal integration in single cells
	2	Amplifying and linearizing apical synaptic inputs to cortical pyramidal cells
	3	Direction selectivity in primary visual cortex using massive intracortical connections

Manwani_T	1	Synaptic transmission: an information-theoretic perspective
	2	Multielectrode spike sorting by clustering transfer functions
	3	Memory capacity of linear versus nonlinear models of dendritic integration

Figure 1

Coauthor social network of Koch_C in NIPS dataset.

The organization of the rest of this work is as follows. In Section 2, we firstly discuss briefly the author topic (AT) model and then introduce in detail our proposed coauthor Topic (coAT) model. Section 3 describes the collapse Gibbs sampling methods used for inferring the model parameters. In Section 4, extensive experimental evaluations are conducted, and Section 5 concludes this work.

2. Generative Models for Documents

Before presenting our coauthor topic model (coAT), author topic (AT) model is described firstly. The notations are summarized in Table 2.

Table 2

Notations used in the Generative Models.

Symbol	Description
K	Number of topics
M	Number of documents
V	Number of unique words
A	Number of unique authors
$N_{m}$	Number of word tokens in document m
$A_{m}$	Number of authors in document m (usually $A_{m} > 1$ )
$a_{m}$	Authors in document m
$ϑ_{i, j}$	Multinomial distribution of topics specific to the coauthor relationship $(i, j)$ in coAT model
$ϑ_{a}$	Multinomial distribution of topics specific to the author a in AT model
$φ_{k}$	Multinomial distribution of words specific to the topic k
$z_{m, n}$	Topic assignment associated with the nth token in the document m
$w_{m, n}$	nth token in document m
$x_{m, n}$	One chosen author associated with the word token $w_{m, n}$
$y_{m, n}$	Another chosen author associated with the word token $w_{m, n}$
$α$	Dirichlet priors (hyperparameter) to the multinomial distribution $ϑ$
$β$	Dirichlet priors (hyperparameter) to the multinomial distribution $φ$

2.1. Author Topic (AT) Model

Rosen-Zvi et al. [3–5] propose an author topic (AT) model for extracting information about authors and topics from large text collections. Rosen-Zvi et al. model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multiauthor paper is a mixture of the distributions associated with the authors.

The graphical model representations for AT model are shown in Figure 2. The AT model can be viewed as a generative process, which can be described as follows. $(1)$

For each topic $k \in [1, K]$ :

$(i)$ draw a multinomial $φ_{k}$ from $Dirichlet (β)$ ;

(2) for each author $a \in [1, A]$ :

$(i)$ draw a multinomial $ϑ_{a}$ from $Dirichlet (α)$ ;

(3) for each word $n \in [1, N_{m}]$ in document $m \in [1, M]$ :

$(i)$ draw an author assignment $x_{m, n}$ uniformly from the group of authors $a_{m}$ ;

$(ii)$ draw a topic assignment $z_{m, n}$ from $Multinomial (ϑ_{x_{m, n}})$ ;

$(iii)$ draw a word $w_{m, n}$ from $Multinomial (φ_{z_{m, n}})$ .

Figure 2

The graphical model representation of the AT model.

2.2. Coauthor Topic (coAT) Model

The graphical model representations of the coAT model are shown in Figure 3(a). The coAT model can be viewed as a generative process, which can be described as follows. (1)

For each topic $k \in [1, K]$ :

(i)

draw a multinomial $φ_{k}$ from $Dirichlet (β)$ ;

(2)

for each author pair $(i, j)$ with $i \in [1, A - 1], j \in [i + 1, A]$ :

(i)

draw a multinomial $ϑ_{i, j}$ from $Dirichle (α)$ ;

(3)

for each word $n \in [1, N_{m}]$ in document $m \in [1, M]$ :

(i)

draw an author $x_{m, n}$ uniformly from the group of authors $a_{m}$ ;

(ii)

draw another author $y_{m, n}$ uniformly from the group of authors $a_{m} ∖ x_{m, n}$ ;

(iii)

if $x_{m, n} > y_{m, n}$ , to swap $x_{m, n}$ with $y_{m, n}$ ;

(iv)

draw a topic assignment $z_{m, n}$ from $Multinomial (ϑ_{x_{m, n}, y_{m, n}})$ .

(v)

draw a word $w_{m, n}$ from $Multinomial (φ_{z_{m, n}})$ .

Figure 3

The graphical model representation of the coAT Model.

As shown in the above process, the posterior distribution of topics depends on the information from the text and authors. The parameterization of the coAT model is

\begin{matrix} ϑ_{i, j} | α ~ Dirichlet (α), \\ φ_{k} | β ~ Dirichlet (β), \\ z_{m, n} | ϑ_{x_{m, n}, y_{m, n}} ~ Multinomial (ϑ_{x_{m, n}, y_{m, n}}), \\ w_{m, n} | φ_{z_{m, n}} ~ Multinomial (φ_{z_{m, n}}), \\ (x_{m, n}, y_{m, n}) | a_{m} ~ Multinomial (\frac{2}{A_{m} (A_{m} - 1)}) . \end{matrix}

(1)

In fact, if we do not draw one author $x_{m, n}$ from $a_{m}$ and then another one $y_{m, n}$ from $a_{m} ∖ x_{m, n}$ in Figure 3(a) but draw simultaneously an author pair $(x_{m, n}, y_{m, n})$ from $a_{m}$ in Figure 3(b), coAT model is equivalent to AT model in principle. One can verify this point by comparing Figure 3(b) with Figure 2.

3. Inference Algorithm

For inference, the task is to estimate the sets of thefollowing unknown parameters in the coAT model: (1) $Φ = {φ_{k}}_{k = 1}^{K}$ and $Θ = {{ϑ_{i, j}}_{i = 1}^{A - 1}}_{j = i + 1}^{A}$ ; (2) the corresponding topic and author pair assignments $z_{m, n}$ and $(x_{m, n}, y_{m, n})$ for each word token $w_{m, n}$ . In fact, inference cannot be done exactly in this model. A variety of algorithms have been used to estimate the parameters of topics models, such as variational EM (expectation maximization) [6, 7], expectation propagation [8, 9], belief propagation [10], and Gibbs sampling [11–15]. In this work, collapsed Gibbs sampling algorithm [11–15] is used, since it provides a simple method for obtaining parameter estimates under Dirichlet priors and allows combination of estimates from several local maxima of the posterior distribution.

In the Gibbs sampling procedure, we need to calculate the full conditional distribution $P (z_{m, n}, x_{m, n}, y_{m, n} | w, z_{\neg (m, n)}, x_{\neg (m, n)}, y_{\neg (m, n)}, a, α, β)$ , where $z_{\neg (m, n)}$ and $(x_{\neg (m, n)}, y_{\neg (m, n)})$ represent the topic and author pair assignments for all tokens except $w_{m, n}$ , respectively. We begin with the joint distribution $P (w, z, x, y | a, α, β)$ of a dataset, and using the chain rule, we can get the conditional probability conveniently as

\begin{array}{l} P (z_{m, n}, x_{m, n}, y_{m, n} | w, z_{\neg (m, n)}, x_{\neg (m, n)}, y_{\neg (m, n)}, a, α, β) \\ \propto \frac{n_{z_{m, n}}^{(w_{m, n})} + β_{w_{m, n}} - 1}{\sum_{v = 1}^{V} (n_{z_{m, n}}^{(v)} + β_{v}) - 1} \times \frac{n_{x_{m, n}, y_{m, n}}^{(z_{m, n})} + α_{z_{m, n}} - 1}{\sum_{k = 1}^{K} (n_{x_{m, n}, y_{m, n}}^{(k)} + α_{k}) - 1}, \end{array}

(2)

where

n_{k}^{(v)}

is the number of times tokens of word v is assigned to topic k and

n_{i, j}^{(k)}

represent the number of times author pair

(i, j)

is assigned to topic k. Detailed derivation of Gibbs sampling for coAT is provided in appendix.

If one further manipulates the above equation (2), one can turn it into separated update equations for the topic and author of each token, suitable for random or systematic scan updates:

\begin{array}{l} P (x_{m, n}, y_{m, n} | x_{\neg (m, n)}, y_{\neg (m, n)}, z, a, α) \\ \propto \frac{n_{x_{m, n}, y_{m, n}}^{(z_{m, n})} + α_{z_{m, n}} - 1}{\sum_{k = 1}^{K} (n_{x_{m, n}, y_{m, n}}^{(k)} + α_{k}) - 1}, \end{array}

(3)

\begin{array}{l} P (z_{m, n} | w, z_{\neg (m, n)}, x, y, α, β) \\ \propto \frac{n_{z_{m, n}}^{(w_{m, n})} + β_{w_{m, n}} - 1}{\sum_{v = 1}^{V} (n_{z_{m, n}}^{(v)} + β_{v}) - 1} \times \frac{n_{x_{m, n}, y_{m, n}}^{(z_{m, n})} + α_{z_{m, n}} - 1}{\sum_{k = 1}^{K} (n_{x_{m, n}, y_{m, n}}^{(k)} + α_{k}) - 1} . \end{array}

(4)

During parameter estimation, the algorithm keeps track of two large data structures: an $(A (A - 1) / 2) \times K$ count matrix $n_{i, j}^{(k)}$ and an $K \times V$ count matrix $n_{k}^{(v)}$ . From these data structures, one can easily estimate the Φ and Θ as follows:

\begin{matrix} φ_{k, v} = \frac{n_{k}^{(v)} + β_{v}}{\sum_{v = 1}^{V} (n_{k}^{(v)} + β_{v})}, \end{matrix}

(5)

\begin{matrix} ϑ_{i, j, k} = \frac{n_{i, j}^{(k)} + α_{k}}{\sum_{k = 1}^{K} (n_{i, j}^{(k)} + α_{k})} . \end{matrix}

(6)

With (3)–(6), Gibbs sampling algorithm for coAT model is summarized in Algorithm 1. The procedure itself uses only seven larger data structures, the count variables $n_{i, j}^{(k)}$ and $n_{k}^{(v)}$ , which have dimension $(A (A - 1) / 2) \times K$ and $K \times V$ , respectively, and their row sums $n_{i, j}$ and $n_{k}$ with dimension $A (A - 1) / 2$ and K, as well as the state variable $x_{m, n}$ , $y_{m, n}$ , $z_{m, n}$ with dimension $W = \sum_{m = 1}^{M} ‍ N_{m}$ .

Algorithm 1: Gibbs sampling algorithm for coAT model.

Algorithm coATGibbs( ${w}, {a}, α, β, K$ )

Input: word vectors ${w}$ , author vectors ${a}$ , hyperparameters $α, β$ , topic number K

Global data: count statistics ${n_{i, j}^{(k)}}, {n_{k}^{(v)}}$ and their sums ${n_{i, j}}, {n_{k}}$

Output: topic associations ${z}$ , author pair associations ${x}$ and ${y}$ ,

multinomial parameters Φ and Θ, hyperparameter estimates $α, β$

$/ /$ initialization

zero all count variables: $n_{i, j}^{(k)}, n_{i, j}, n_{k}^{(v)}, n_{k}$

for all documents $m \in [1, M]$ do

for all words $n \in [1, N_{m}]$ in document m do

sample topic index $z_{m, n} ~ Multinomial (1 / K)$

sample one author index $x_{m, n} ~ Multinomial (p)$ with $p_{i} = {\begin{cases} \frac{1}{A_{m}}, & i \in a_{m} \\ 0, & otherwise \end{cases}$

sample another author index $y_{m, n} ~ Multinomial (p)$ with $p_{i} = {\begin{cases} \frac{1}{A_{m} - 1}, & i \in a_{m} ∖ x_{m, n} \\ 0, & otherwise \end{cases}$

if $x_{m, n} > y_{m, n}$ then

swap $x_{m, n}$ with $y_{m, n}$

$/ /$ increment counts and sums

$n_{x_{m, n}, y_{m, n}}^{(k)}$ += 1, $n_{x_{m, n}, y_{m, n}}$ += 1, $n_{z_{m, n}}^{(w_{m, n})}$ += 1, $n_{z_{m, n}}$ += 1

$/ /$ Gibbs sampling over burn-in period and sampling period

while not finished do

for all documents $m \in [1, M]$ do

for all words $n \in [1, N_{m}]$ in document m do

$/ /$ decrement counts and sums

$n_{x_{m, n}, y_{m, n}}^{(z_{m, n})}$ −= 1; $n_{x_{m, n}, y_{m, n}}$ −= 1; $n_{z_{m, n}}^{(w_{m, n})}$ −= 1; $n_{z_{x, y}}$ −= 1

sample an author pair index $(\tilde{i}, \tilde{j})$ according to (3)

sample a topic index $\tilde{z}$ according to (4)

$/ /$ increment counts and sums

$n_{\tilde{i}, \tilde{j}}^{(\tilde{k})}$ += 1; $n_{\tilde{i}, \tilde{j}}$ += 1; $n_{\tilde{k}}^{(w_{m, n})}$ += 1; $n_{\tilde{k}}$ += 1

if converged and L sampling iterations since last read out then

$/ /$ the different parameters read outs are averaged.

read out parameter set Φ according to (5)

read out parameter set Θ according to (6)

4. Experimental Results and Discussions

NIPS proceeding dataset is utilized to evaluate the performance of our model, which consists of the full text of the 13 years of proceedings from 1987 to 1999 Neural Information Processing Systems (NIPS) Conferences. The dataset contains 1,740 research papers and 2,037 unique authors. The distribution of the number of papers over the number of authors is shown in Table 3, which indicates that the percentages of papers with 3, 4, and 5 authors at most are 87.8736%, 95.8046%, and 97.9885%, respectively.

Table 3

Distribution of #papers over #authors in NIPS dataset.

#Authors	1	2	3	4	5	6	7	8	9	10
#Papers	414 (23.7931%)	743 (42.7011%)	372 (21.3793%)	138 (7.9310%)	38 (2.1839%)	23 (1.3218%)	7 (0.4023%)	1 (0.0575%)	3 (0.1724%)	1 (0.0575%)

In addition to downcasing and removing stopwords and numbers, we also remove the words appearing less than five times in the corpus. After the preprocessing, the dataset contains 13,649 unique words and 2,301,375 word tokens in total. Each document's timestamp is determined by the year of the proceedings. In our experiments, K is fixed at 100, and the symmetric Dirichlet priors α and β are set at 0.5 and 0.1, respectively. Gibbs sampling is run for 2000 iterations.

4.1. Examples of Topic, Coauthor Relationship Distributions

Table 4 illustrates examples of 8 topics learned by coAT model. The topics are extracted from a single sample at the 2000th iteration of the Gibbs sampler. Each topic is illustrated with (a) the top 10 words most likely to be generated conditioned on the topic; (b) the top 10 coauthor relationships which have the highest probability conditioned on the topic.

Table 4

An illustration of 8 topics from 100-topic solutions for the NIPS collection. The titles are our own interpretation of the topics. Each topic is shown with the 10 words and coauthor relationships that have the highest probability conditioned on that topic.

Topic 44		Topic 20		Topic 97		Topic 50		Topic 56		Topic 3		Topic 42		Topic 7
Analog circuit		Eye visual positioning		Synaptic cell membrane		Noisy signal filtering		Spike firing rate		Image, vision, and scene		Motion direction detection		Receptive visual field
Word	Prop.	Word	Prop.	Word	Prop.	Word	Prop.	Word	Prop.	Word	Prop.	Word	Prop.	Word	Prop.
Circuit	0.03586	Eye	0.03845	Membrane	0.02053	Noise	0.17760	Spike	0.05751	Image	0.06158	Motion	0.07899	Cells	0.03696
Current	0.02235	Visual	0.02489	Current	0.01840	Signal	0.07218	Firing	0.03778	Images	0.02596	Direction	0.03866	Receptive	0.03226
Analog	0.02086	Position	0.02108	Voltage	0.01809	Noisy	0.01839	Time	0.03028	Local	0.01403	Velocity	0.02732	Visual	0.02666
Chip	0.01945	Head	0.01508	Synaptic	0.01577	Filter	0.01379	Rate	0.02959	Pixel	0.01291	Flow	0.01822	Field	0.02438
Voltage	0.01876	Target	0.01413	Potential	0.01546	Information	0.01358	Neuron	0.02310	Vision	0.01245	Field	0.01598	Cell	0.02229
Figure	0.01876	Model	0.01323	Cell	0.01415	Signals	0.01239	Spikes	0.02013	Scene	0.01175	Moving	0.01581	Orientation	0.02211
Vlsi	0.01594	Gain	0.01228	Dendritic	0.01308	Gaussian	0.01103	Neurons	0.01713	Pixels	0.01134	Figure	0.01179	Response	0.02171
Circuits	0.01182	Movements	0.01099	Conductance	0.01283	Power	0.01044	Temporal	0.01451	Figure	0.01115	Spatial	0.01146	Spatial	0.01951
Retina	0.01106	Location	0.00976	Channel	0.01189	Snr	0.01022	Information	0.01373	Color	0.01110	Directions	0.00956	Cortex	0.01733
Output	0.00985	Motor	0.00942	Synapses	0.01177	Optimal	0.00968	Stimulus	0.01116	Surface	0.00952	Time	0.00945	Stimulus	0.01555

Coauthor	Prop.	Coauthor	Prop.	Coauthor	Prop.	Coauthor	Prop.	Coauthor	Prop.	Coauthor	Prop.	Coauthor	Prop.	Coauthor	Prop.

(Vittoz_E, vanSchaik_A)	0.59925	(Coenen_O, Sejnowski_T)	0.55963	(Manwani_A, Steinmetz_P)	0.64314	(Bialek_W, Owen_W)	0.34075	(Berry_M, Meister_M)	0.62719	(Madarasmi_S, Pong_T)	0.53566	(Higgins_C, Koch_C)	0.53386	(Chance_F, Nelson_S)	0.61231
(Boahen_K, Liu_S)	0.57789	(Anastasio_T, Dow_F)	0.49081	(Claiborne_B, Zador_A)	0.53110	(Beckman_P, Lippmann_R)	0.31923	(Bartels_A, Tang_A)	0.54080	(Pessoa_L, Ross_W)	0.52481	(Koch_C, Kramer_J)	0.40017	(Abbott_L, Chance_F)	0.52244
(Koch_C, Kruger_W)	0.56696	(Muller_K, Weston_J)	0.46146	(Douglas_R, Koch_C)	0.49358	(Bulsara_A, Moss_Moss)	0.31805	(Stevens_C, Zador_A)	0.53387	(Hurlbert_A, Poggio_T)	0.51580	(Mathur_B, Wang_H)	0.35330	(Archie_K, Mel_B)	0.50652
(Delbruck_T, Mead_C)	0.51665	(Langdon_P, Mayhew_J)	0.44305	(Brown_T, Tsai_K)	0.45795	(Jacobs_E, Moss_Moss)	0.28987	(Bartels_A, Sejnowski_T)	0.45256	(Allman_J, Moore_A)	0.50492	(Marshall_J, Martin_K)	0.33484	(Somers_D, Todorov_E)	0.50000
(Diorio_C, Hasler_P)	0.48269	(Pouget_A, Sejnowski_T)	0.43626	(Segev_I, Tishby_N)	0.39906	(Owen_W, Rieke_F)	0.28139	(Gerstner_W, vanHemmen_J)	0.45098	(Diamantaras_K, Geiger_D)	0.42916	(Schuling_F, Zaagman_W)	0.31607	(Abbott_L, Nelson_S)	0.48106
(Vittoz_E, vanSchaik_A)	0.48188	(Dean_P, Mayhew_J)	0.43463	(Claiborne_B, Tsai_K)	0.37184	(Douglass_J, Longtin_A)	0.28052	(Sydorenko_M, Young_E)	0.41291	(Liu_X, Wang_D)	0.42188	(Finkel_L, Yen_S)	0.30495	(Mel_B, Ruderman_D)	0.45446
(Hasler_P, Kruger_W)	0.48046	(DeWeerth_S, Mead_C)	0.43144	(Chklovskii_D, Stevens_C)	0.37069	(Bulsara_A, Jacobs_E)	0.27984	(Rieke_F, de-Ruyter-van-Steveninck_R)	0.39749	(Kersten_D, Pong_T)	0.40456	(Thomber_K, Williams_L)	0.29713	(Gallant_J, Vinje_W)	0.44902
(Andreou_A, Boahen_K)	0.47946	(Lisberger_S, Sejnowski_T)	0.40071	(Mainen_Z, Zador_A)	0.34411	(Bialek_W, Rieke_F)	0.26612	(Schneidman_E, Tishby_N)	0.39286	(Allman_J, Goodman_R)	0.37915	(Koch_C, Sarpeshkar_R)	0.28247	(Sabatini_S, Solari_F)	0.44010
(Mahowald_M, Ryckebusch_S)	0.47611	(Deffayet_C, Sejnowski_T)	0.35550	(Brown_T, Zador_A)	0.34367	(Douglass_J, Moss_F)	0.25947	(Gerstner_W, Wagner_H)	0.38530	(Pavel_M, Sharma_R)	0.37520	(Shashua_A, Ullman_S)	0.28220	(Dimitrov_A, Mundel_T)	0.43216
(Bair_W, Sarpeshkar_R)	0.47151	(Coenen_O, Lisberger_S)	0.32875	(Koch_C, Manwani_A)	0.31870	(Koch_C, Manwani_A)	0.22405	(Gerstner_W, Kempter_R)	0.38333	(Allman_J, Fox_G)	0.37324	(Mueller_P, Takahashi_N)	0.28030	(Hochstein_S, Vaadia_E)	0.42882

4.2. Shared Interest Discovery

In order to analyze further shared interest by an author pair, one can see Table 4 from the viewpoint of author pairs. Table 5 shows the shared interests with Koch_C in NIPS dataset. In Table 5, the meaning of each topic is given in Table 4, and × means that the resulting strength is very low. From Table 5, one can see that shared interests by Koch_C and Manwani_A include Topic 97, Topic 50, and Topic 56 with the strengths of 0.31870, 0.22405, and 0.10003, respectively. By comparing Table 5 with Table 1, it is not difficult to see that our discovered shared interests make sense.

Table 5

Shared interests with Koch_C in NIPS dataset.

	Topic 44	Topic 20	Topic 97	Topic 50	Topic 56	Topic 3	Topic 42	Topic 7
Bair_W	0.39822	×	×	×	0.07898	×	0.08691	×
Horiuchi_T	0.17956	0.24486	×	×	×	×	0.05381	×
Luo_T	0.21363	×	×	×	×	0.10046	0.08891	×
Douglas_R	×	×	0.49358	×	×	×	×	0.13284
Manwani_A	×	×	0.31870	0.22405	0.10003	×	×	×

4.3. Predictive Power Analysis

In order to compare the performance of AT and coAT models, we further divide the NIPS papers into a training set $𝒟^{train}$ of 1,557 papers, and a test set $𝒟^{test}$ of 183 papers. Each author in $𝒟^{test}$ must have authored at least one of the training papers. The perplexity, originally used in language modeling [16], is a standard measure for estimating the performance of a probabilistic model. The perplexity is defined as the reciprocal geometric mean of the token likelihoods in the test set $𝒟^{test} = {{\tilde{w}}_{m, \cdot}, {\tilde{a}}_{m, \cdot}}$ under the AT or coAT model:

\begin{array}{l} perplexit y^{AT} ({{\tilde{w}}_{m, \cdot}} | {{\tilde{a}}_{m, \cdot}}, ℳ) \\ = \exp [- \frac{\sum_{m}^{} \log P^{AT} ({\tilde{w}}_{m, \cdot} {\tilde{a}}_{m, \cdot}, ℳ)}{\sum_{m}^{} (N_{m} \times A_{m})}], \\ perplexit y^{coAT} ({{\tilde{w}}_{m, \cdot}} | {{\tilde{a}}_{m, \cdot}}, ℳ) \\ = \exp [- \frac{\sum_{m}^{} \log P^{coAT} ({\tilde{w}}_{m, \cdot} {\tilde{a}}_{m, \cdot}, ℳ)}{\sum_{m}^{} (N_{m} \times \frac{A_{m} (A_{m} - 1)}{2})}], \end{array}

(7)

with

\begin{array}{l} P^{AT} ({\tilde{w}}_{m, \cdot} | {\tilde{a}}_{m, \cdot}, ℳ) = \prod_{n = 1}^{N_{m}} ‍ P ({\tilde{w}}_{m, n} | {\tilde{a}}_{m, \cdot}, ℳ) \\ = \prod_{n = 1}^{N_{m}} ‍ \sum_{k = 1}^{K} ‍ \sum_{a \in {\tilde{a}}_{m}} ‍ P ({\tilde{w}}_{m, n} | {\tilde{z}}_{m, n} = k) P ({\tilde{z}}_{m, n} = k | {\tilde{x}}_{m, n} = a) \\ = \prod_{n = 1}^{N_{m}} ‍ \sum_{k = 1}^{K} ‍ \sum_{a \in {\tilde{a}}_{m}} ‍ φ_{k, {\tilde{w}}_{m, n}} ϑ_{a, k}, \\ P^{coAT} ({\tilde{w}}_{m, \cdot} | {\tilde{a}}_{m, \cdot}, ℳ) = \prod_{n = 1}^{N_{m}} ‍ P ({\tilde{w}}_{m, n} | {\tilde{a}}_{m, \cdot}, ℳ) \\ = \prod_{n = 1}^{N_{m}} ‍ \sum_{k = 1}^{K} ‍ \sum_{i \in {\tilde{a}}_{m}, j \in {\tilde{a}}_{m}, i < j} ‍ P ({\tilde{w}}_{m, n} | {\tilde{z}}_{m, n} = k) \\ \times P ({\tilde{z}}_{m, n} = k | {\tilde{x}}_{m, n} = i, {\tilde{y}}_{m, n} = j) \\ = \prod_{n = 1}^{N_{m}} ‍ \sum_{k = 1}^{K} ‍ \sum_{i \in {\tilde{a}}_{m}, j \in {\tilde{a}}_{m}, i < j} ‍ φ_{k, {\tilde{w}}_{m, n}} ϑ_{i, j, k} . \end{array}

(8)

Figure 4 shows the results for the AT model and coAT model on the test set $𝒟^{test}$ . It is not difficult to see that the perplexity of coAT model is smaller than that of AT model, which indicates that coAT model outperforms AT model.

Figure 4

Perplexity of the test set $𝒟^{test}$ .

5. Conclusions

Social network service (SNS) is a way for people to connect and share information with each other online. However, these sensitive pieces of information are often sensitive to people's privacy and raise security concerns. Some pieces of information are even more used for providing users with user-customized or context-aware services by commercial companies.

In order to solve this problem, a shared interest discovery model for coauthor relationship in SNS, named as coauthor topic (coAT) model, to identify the users with similar interests from social networks, and collapsed Gibbs sampling method is utilized for inferring model parameters. Thus, one can reduce the possibility that recommended users are not friends but attackers. The results on NIPS dataset show that discovered shared interests make sense.

The relative simplicity of the approach in the work provides advantages for injecting these ideas into other topic models. For example, in ongoing work, we are finding dynamic shared interest patterns in SNS over time, with a coAT model over time, similar to our previous work on AT over time (AToT) [13, 14]. Additionally, the collapsed Gibbs sampling method following the main idea of MapReduce [15, 17] is also utilized for inferring the coAT model parameters. Many other extensions are possible.

Footnotes

Appendix

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was funded partially by Fundamental Research Funds for the Central Universities: Research on Forest Property Circulation Mechanism in Collective Forest Area under Grant no. JGTD2014-04, Beijing Forestry University Young Scientist Fund: Research on Econometric Methods of Auction with their Applications in the Circulation of Collective Forest Right under Grant no. BLX2011028, and Key Technologies R&D Program of Chinese 12th Five-Year Plan (2011–2015): Key Technologies Researcher on Large Scale Semantic Computation for Foreign Scientific & Technical Knowledge Organization System, and Key Technologies Research on Data Mining from the Multiple Electric Vehicle Information Sources under Grant no. 2011BAH10B04 and 2013BAG06B01, respectively.

References

Aggarwal

C. C.

Social Network Data Analysis 2011

Berlin, Germany

Springer

McDowell

Morda

Sociasocial securely: using social networking services

2011

United States Computer Emergency Readiness Team (US-CERT)

Rosen-Zvi

Griffiths

Steyvers

Smyth

The author-topic model for authors and documents

Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence

2004

Arlington, Va, USA

AUAI Press

487 494

Steyvers

Smyth

Rosen-Zvi

Griffiths

Probabilistic author-topic models for information discovery

Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2004

New York, NY, USA

ACM

306 315

2-s2.0-12244288622

Rosen-Zvi

Chemudugunta

Griffiths

Smyth

Steyvers

Learning author-topic models from text corpora

ACM Transactions on Information Systems 2010 28 1 1 38

2-s2.0-80051615661

10.1145/1658377.1658381

Winn

J. M.

Variational message passing and its applications [Ph.D. thesis] 2004

University of Cambridge

Blei

D. M.

A. Y.

Jordan

M. I.

Latent Dirichlet allocation

Journal of Machine Learning Research 2003 3 4-5 993 1022

2-s2.0-0141607824

Minka

T. P.

Expectation propagation for approximate Bayesian inference

Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence

2001

San Francisco, Calif, USA

Morgan Kaufmann Publishers

362 369

Minka

Lafferty

Expectation-propagation for the generative aspect model

Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence

2002

352 359

10.

Zeng

A topic modeling toolbox using belief propagation

Journal of Machine Learning Research 2012 2233 2236

11.

Griffiths

T. L.

Steyvers

Finding scientific topics

Proceedings of the National Academy of Sciences of the United States of America 2004 101 supplement 1 5228 5235

2-s2.0-1842788824

10.1073/pnas.0307752101

12.

Zhu

Xiaodong

Qingwei

Jie

Topic linkages between papers and patents

Proceedings of the 4th International Conference on Advanced Science and Technology

2012

Daejeon, South Korea

Science and Engineering Research Support soCiety (SERSC)

176 183

13.

Shi

Qiao

Zhu

Jung

Lee

Choi

S. P.

Author-topic over time (AToT): A dynamic users' interest model

Proceedings of the 2nd International Conference on Ubiquitous Context-Awareness and Wireless Sensor Network

2013

Springer

227 233

14.

Shi

Qiao

Nong

Author-topic evolution model and its application in analysis of research interests evolution

Journal of the China Society for Scientific and Technical InFormation 2013 32 9 912 919

15.

Shuo

Shi

Qingwei

Qiao

Xiaodong

Zhu

Zhang

Jung

Lee

Choi

S. P.

A dynamic users' interest discovery model with distributed inference algorithm

International Journal of Distributed Sensor Networks 2014 2014 11

280892

10.1155/2014/280892

16.

Azzonpardi

Girolami

van Risjbergen

Investigating the relationship between language model perplexity and IR precision-recall measures

Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval

2003

New York, NY, USA

ACM

369 370

17.

Dean

Ghemawat

MapReduce: simplified data processing on large clusters

Communications of the ACM 2008 51 1 107 113

2-s2.0-37549003336

10.1145/1327452.1327492