Sage Journals: Discover world-class research

Abstract

Online social networks are an important part of people’s life and also become the platform where spammers use suspicious accounts to spread malicious URLs. In order to detect suspicious accounts in online social networks, researchers make a lot of efforts. Most existing works mainly utilize machine learning based on features. However, once the spammers disguise the key features, the detection method will soon fail. Besides, such methods are unable to cope with the variable and unknown features. The works based on graph mainly use the location and social relationship of spammers, and they need to build a huge social graph, which leads to much computing cost. Thus, it is necessary to propose a lightweight algorithm which is hard to be evaded. In this article, we propose a lightweight algorithm GroupFound, which focuses on the structure of the local graph. As the bi-followers come from different social communities, we divide all accounts into different groups and compute the average number of accounts for these groups. We evaluate GroupFound on Sina Weibo dataset and find an appropriate threshold to identify suspicious accounts. Experimental results have demonstrated that our algorithm can accomplish a high detection rate of $86.27 %$ at a low false positive rate of $8.54 %$ .

Keywords

Online social networks community suspicious account graph-based algorithm threshold

Introduction

In recent years, online social networks (OSNs) such as Twitter, Facebook, Sina Weibo, and RenRen provide people with a platform that puts the relationships in the real world into the virtual world. While OSNs make peoples’ life various, it brings a lot of problems with network security. Spammers propagate a variety of attacks to other users through OSNs, such as phishing, drive-by download, malicious code injection, and hosting botnets. These malicious behaviors threaten the users’ security of information and property. In the first half of 2010, about 1.67% new Twitter accounts are closed due to malicious behaviors, suspicious behaviors, or other account abusing behavior.¹

Malicious URLs is one of the most commonly used methods that spammers implement malicious attacks.² Spammers usually share some interesting videos, stories, photographs, and information about discount, while these contents actually contain links to malicious websites. With the launching of short-link services in OSNs, spammers can take advantage of this short URL to hide the domain name of malicious website. Some works show that spammers can identify whether there are detection tools in the present network environment. These make it difficult to detect malicious URLs in OSNs. Since spammers mainly spread malicious URLs by spam accounts, we are just trying to detect spam accounts to curb the spreading of malicious URLs in OSNs, even make it disappear.

Spammers attempt to follow a lot of users and want to be followed back to increase their followers.^2–4 The study demonstrates that a quite part of users will follow strangers who follow them in Sina Weibo.⁵ There are some differences between the following strategies of normal users and spammers. For spammers, evidences are found that most spam accounts are programmed and run automatically, which is easy to achieve with low cost. For normal users, they can decide whether or not to follow according to their own judgment. Therefore, spammers cannot make sure who will follow them, and the followers of spam accounts often have no obvious relationships between each other.

In order to protect the users away from phishing attacks or URLs hiding in a malicious website, many researchers make a great effort for it. Currently, main research works are classified into three categories. In the first category, some studies aim to establish a detection model based on social network behaviors;^6–10 they setup different behavior models through analyzing the different behavior characteristics of normal accounts and malicious accounts in OSNs. Then, according to whether other accounts matching with the model, whether the accounts are malicious or not is determined. However, spammers often deceive detection by reconstructing a new attack model through imitating the behavior of real users. In the second category, some studies use machine learning methods to discover features from malicious accounts, such as behaviors and content.^11–17 They extract the features and train a classifier to identify whether the accounts are malicious or normal. However, once the spammers modify key features, this detection method can be obsolete easily. In the last one category, some researchers construct a graph according to the connection between the accounts, such as relationships among friends, attention and fans.^18–23 They design a detection algorithm to detect malicious accounts through different positions of the normal account and malicious account in the social graph. However, the accuracy of the detection algorithm based on graph is relatively low, and different social networks have their own characteristics.

To spread malicious URLs widely, spammers try to follow a lot of users to attract them to follow. Currently, most methods usually use a bi-direction ratio to detect this feature. However, spam accounts can use different technologies to escape most of the existing methods, such as following a large number of accounts, which will follow their follower to balance the ratio. Normal users can determine whether to follow the malicious account according to their own judgment, so spammers cannot control the consistency of their own fans. In order to find a robust and lightweight detection method, we try to discover features that are difficult for spammers to rebuild and make detection confused. Thus, we propose an algorithm based on local graph to detect suspicious accounts in OSNs.

In this article, we propose a lightweight algorithm which focuses on the structure of local graph. First of all, we collect about 246,898 accounts from the Sina Weibo, which contain about 22,417 spammers. Then, we establish an undirected graph–based on the bi-follow relationships among accounts. Next, we design an algorithm called GroupFound. As the bi-followers come from different social communities, we divide all accounts into different groups and compute the average number of accounts per group. We evaluate GroupFound on Sina Weibo datasets and use receiver operating characteristics (ROC) curve and area under curve (AUC) value to verify its analyzed and validated. We find the appropriate threshold to identify suspicious accounts through ROC curve. The experimental result indicates that the accuracy rate of GroupFound is $86.27 %$ . The false positive rate (FPR) is 8.54%, and the average speed of detecting a single account is $0.032 s$ . Thus, GroupFound is effective to detect suspicious accounts in OSNs with less cost. In summary, we have made three contributions as follows:

We propose the GroupFound algorithm to detect suspicious accounts in OSNs. GroupFound uses the difference of location characteristics between normal and suspicious accounts on the graph to identify suspicious accounts. It is difficult to avoid by spammers.

GroupFound is a lightweight detection algorithm. Compared with the previous works, we focus on the structure of local graph. Therefore, we can detect a large number of accounts in a shorter time.

We use ROC curve and AUC value to prove the effectiveness of the algorithm, and find the best threshold to identify suspicious accounts according to the ROC curve.

The rest of the article is organized as follows. In section “Related work”, the related work is discussed. The experimental dataset and the method used to collect the dataset are described in section “Dataset collection and analysis”. In addition, motivation and observations are explained, and the proofs of the observations are also expounded in this part. Section “Algorithm” introduces the GroupFound algorithm. Experiments are conducted in section “Evaluation” to detect spam accounts in Sina Weibo and evaluate the result of experiments. Related methods are compared to evaluate the speed of the GroupFound algorithm. The conclusion is in section “Conclusion”.

Related work

Spammers usually implement attack with malicious URLs, which make many people and companies in trouble and often cause huge economic losses. For detecting the spammers in OSNs, many researchers and enterprise engineers make a lot of efforts. We plan to discuss these related works from three perspectives: models based on social-behavior, machine learning classifiers based on the social-features, and algorithms based on the social graph.

Social behavior–based models

The key of the detection methods based on the model is to select appropriate behavior characteristics to differentiate normal accounts and suspicious accounts. Then, they judge accounts by whether they match with the model. Le et al. proposed a novel scoring model for signature-based detection that uses static features to pre-filter potential malicious web pages. Their model can combine knowledge from various information sources of web pages very effectively in order to filter potential malicious web pages.⁶ Viswanath et al. proposed using principal component analysis (PCA) to detect anomalous users’ behavior in OSNs. They experimentally validate that normal users’ behavior is limited to a low-dimensional subspace amenable to the PCA technique.⁷ Jiang et al. presented a study of detecting suspicious following behaviors. They discovered that the zombie followers exhibit behaviors that are synchronized and abnormal.⁸ Wang et al. built a practical system for detecting fake identities using server-side clickstream models. They develop a detection approach that groups similar user clickstreams into behavioral clusters, by partitioning a similarity graph that captures distances between clickstream sequences.⁹ Cao et al. designed and implemented a malicious account detection system called SynchroTrap. They observed that malicious accounts usually perform loosely synchronized actions in a variety of social network contexts.¹⁰

Social feature–based machine learning classifiers

This kind of detection scheme uses machine learning methods to train a classifier according to social behaviors, accounts, and content characteristics. Eshete et al. developed an approach that uses machine learning to detect whether a given URL is hosting an exploit kit. Central to their approach is the design of distinguishing features that are drawn from the analysis of attack-centric and self-defense behaviors of exploit kits.¹¹ Yang et al. made an empirical analysis of the evasion tactics utilized by Twitter spammers and then designed several new and robust features to detect Twitter spammers. They formalized the robustness of 24 detection features that are commonly utilized in the literature as well as their proposed ones.¹² Aggarwal et al. used Twitter specific features along with URL features to detect whether a tweet posted with a URL is phishing or not. They use machine learning classification techniques and detect phishing tweets.¹³ Eshete et al.¹⁴ presented a holistic and at the same time lightweight approach, called BINSPECT, that leverages a combination of static analysis and minimalistic emulation to apply supervised learning techniques in detecting malicious web pages pertinent to drive-by download, phishing, injection, and malware distribution by introducing new features that can effectively discriminate malicious and benign web pages. Almaatouq et al.¹⁵ presented a unique look at spam accounts in OSNs through the lens of the behavioral characteristics and spammer techniques for reaching victims. Zheng et al. proposed a supervised machine learning–based solution for an effective spammer detection. The solution considers the users’ content and behavior features and apply them into support vector machine (SVM)-based algorithm for spammer classification.¹⁶ Based on the spam policy of Twitter, novel content-based and graph-based features are proposed to facilitate spam detection by Wang.¹⁷

Social graph–based algorithm

There are many graph structures in OSNs. In the graph, the normal accounts and suspicious accounts have different location characteristics. Yu et al. proposed decentralized algorithms SybilLimit to determine whether a suspect node is Sybil or not; it is an improved version of SybilGuard. It relies on the assumption that social networks are fast mixing and the attack edge is limited. SybilLimit uses random routes, in which each node uses a randomized routing table to choose the next hop. Compared to their previous SybilGuard that accepted $O (\sqrt{n} logn)$ Sybil nodes per attack edge, SybilLimit accepts only $O (logn)$ Sybil nodes per attack edge.^24,25 Cao et al. introduced a new tool in the hands of OSN operators, which they called SybilRank. It relies on social graph properties to rank users according to their perceived likelihood of being fake (Sybils).¹⁸ Wei et al. presented SybilDefender, a Sybil defense mechanism that leverages the network topologies to defend against Sybil attacks in social networks. Based on performing a limited number of random walks within the social graphs, SybilDefender is efficient and scalable to large social networks.¹⁹ Xue et al. presented VoteTrust, a Sybil detection system that further leverages user interactions of initiating and accepting links. VoteTrust uses the techniques of trust-based vote assignment and global vote aggregation to evaluate the probability that the user is a Sybil.²⁰ Tan et al. designed a Sybil defense–based spam detection scheme, UNIK. It differs from existing schemes in a way that it detects non-spam URL patterns from the social network instead of spam URLs directly, because the non-spammer patterns are relatively more stable than spammers.²¹ Beutel et al.²² proposed CopyCatch, which detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created. Gong et al. introduced SybilBelief, a semi-supervised learning framework, to detect Sybil nodes. SybilBelief takes a social network of the nodes in the system, a small set of known benign nodes, and, optionally, a small set of known Sybils as input. Then, SybilBelief propagates the label information from the known benign and/or Sybil nodes to the remaining nodes in the system.²³

Compared with above methods, GroupFound is a lightweight algorithm. It does not need training data in advance. If an attacker wants to spread malicious information, he will need to establish a relationship with normal users. When normal accounts establish relationship in OSNs, they will make their own judgment. Normal users would not establish relationship with the accounts they are not interested in or with malicious behavior. If they suspect that an account is abnormal, they will not follow it back. Attackers are difficult to change their account’s structure of the graph. Unlike most of the existing detection methods based on the graph, we do not need to setup a huge graph. GroupFound focuses on the local structure of the graph, and it can detect one account each time. Sometimes, we just want to detect one node, and GroupFound does not need to build a complete social graph.

Dataset collection and analysis

Before introducing our dataset, there are some definitions for an account in Sina Weibo. The followings_count is defined to be the number of accounts that an account follows and the number of accounts that an account be followed is called followers_count. A very important relationship in our research is bi-follow, which established when between two accounts who follow each other. For an account, those accounts which have bi-follow with it called bi-followers.

Dataset

Our Sina Weibo dataset consists of 246,898 accounts, collected from January 2015, using the Sina Weibo public open platform API. For each account, we collect the users’ latest personal information, follow, Weibo content, label education, professional, location, system recommending, fans, and other data items. The data stored in JSON format.

To distinguish normal accounts from malicious accounts of the dataset, we use the Python scripts to mark each account. As shown in Figure 1, we input http://weibo.com/u and the ID of accounts in the browser address bar, like http://weibo.com/u1234567890 . When it is sure that a current account is a normal one, browser will redirect the visitor to the homepage of account, like http://weibo.com/u1234567890?is_all=1?is_all=1 . When it determines that an account is not a normal one, browser will simply redirect the visitor to http://weibo.com/ . All accounts are marked with this method. Finally, we identified 224,481 normal accounts and 22,417 spam accounts.

Figure 1.

Mark data.

We began to conduct experiments on this dataset in June, 2016. During the experiment time, we visited the homepage of these accounts several times by crawling. Every time, we find the new officially suspended malicious accounts, we would suspend those accounts accordingly in our dataset. To date, it has been a long time since the last malicious account was officially suspended. So, we can conclude that Sina Weibo has closed all the spam accounts in it. Even if some spam accounts successfully escape the official spam detection mechanism, the number of them is so small that they will exert little influence on our experiments. Therefore, we decided to verify our experiments using the officially suspended accounts.

Due to the limitation of the official privacy policy, collecting data from OSNs is still a challenge to researchers. OSNs protect their API carefully, for example, Twitters API restrict methods depending on the type of requests. The limitation of crawling Twitter users basic information is 180 times every 15 min, but the users’ followers can only get 15 times every 15 min. We are trying to collect more data for our future research of OSNs by the official API as far as possible.

Motivation

To spread malicious URLs widely, spammers try to establish a relationship with a large number of normal accounts in a short time. Evidences are found that most spam accounts are programmed and run automatically. Different from spam accounts, the relationships that normal accounts establish with other accounts are usually related to the real life, such as friends, family members, and their favorite stars. Besides, spammers control spam accounts to follow other accounts; however, they cannot determine whether other accounts follow themselves or not. Therefore, the bi-followers of spam accounts usually come from different social communities. Figure 2(a) shows a typical graph structure of the spam node. As shown in the figure, there is no relationship between its neighbors. Some spammers can have a large number of fake accounts. They can make these fake accounts follow each other to construct communities that can be similar to the ones of the normal accounts. As shown in Figure 2(b), the gray nodes are fake ones. When these fake nodes form a group, the center of the graph structure will deviate toward this social group. The distance between other nodes and the center becomes larger.

Figure 2.

Use fake accounts: (a) spam accounts and (b) spam accounts with fake accounts.

To deal with those problems, we use Hierarchical Clustering to get the clusters of an account. In the local social graph, we set each node as a cluster at first and merge two closest clusters every time. We use the error sum of squares (ESS) to compute the distance between two clusters and get an irrelevant distance in section “The details of GroupFound.” When the distance between two clusters is bigger than the irrelevant distance, we think that the two clusters have no relationship and the clustering process will terminate in advance. When the algorithm stops, we get the number of clusters and call it $groups$ . Attackers may be able to control fake nodes to form social $groups$ , but the normal node which has established relationship with a spam node will form a cluster that only contains its own. We use $Group_avg$ to represent the average number of nodes in each $group$ for further analysis.

Observations

In order to explain our algorithm, we present detailed statement of observations at first. Our algorithm is based on three observations, and we will prove them in our dataset.

Observation 1: spam accounts follow more accounts with an aim to widely propagate spam information

If an account never has followers, it will become an isolated point in the social graph and cannot propagate information to others in OSNs. In order to widely propagate spam information, spam account needs to become a part of social communities and establish social relationships with other accounts. Even if it is not a real or valuable account. Spammers cannot control other accounts to follow them, so they need to follow normal users as many as possible to increase their bi-followers. We analyzed 22,448 normal accounts and 22,417 spam accounts in the dataset. In Figure 2, the x-axis represents the followings-count of accounts, and the y-axis is the percentage of total accounts.

As we can see from Figure 3, the blue line is the distribution of followings-count of normal accounts. The purple line shows the distribution of followings-count of spam accounts. We can see from the blue line, as the followings-count increases, the percentage of total accounts grows rapidly. Differently, the curve of purple line is more smoothly. For further observation, we take a value from the x-axis which equals to 50. When x = 50, the corresponding value in the y-axis is 0.89804 for normal accounts. It means that 89.804% normal accounts’followings-count is less than 50. For spam accounts, the value in the y-axis is 0.057215. It means that there are only 5.7215% spam accounts’followings-count is less than 50. Obviously, to widely propagate spam information spammers follow more accounts.

Figure 3.

Number of followings of accounts.

Observation 2: spam accounts cannot control their followings to follow them

Normal accounts correspond to the people in the real world. When normal accounts establish relationship in OSNs, they will make their own judgment. Normal users would not follow accounts that they are not interested in or with malicious behavior. If they suspect that an account is abnormal, they will not follow it. If normal users are good at judgment, they would not be attacked easily by spammers. However, spam accounts follow a large number of normal accounts, and only few accounts follow them back. We can prove it by computing Bi-directional Links Ratio by the following formula

Bi - directional Links Ratio = \frac{bi - followers}{followings_count}

(1)

We choose about 5000 accounts both from normal accounts and spam accounts in Sina Weibo dataset randomly. After processing these accounts by formula (1), we get the Bi-directional Links Ratio of both normal accounts and spam accounts. In Figure 4, x-axis represents the account number, ranging from 1 to 5000 and y-axis represents the value of Bi-directional Links Ratio for each account. Figure 4(a) is the Bi-directional Links Ratio of normal accounts and Figure 4(b) shows spam accounts’Bi-directional Links Ratio. Observing these two figures, we can clearly find that the Bi-directional Links Ratios of normal accounts center around 0.5. However, the Bi-directional Links Ratios of spam accounts center around 0.3. It is oblivious that the average Bi-directional Links Ratio of normal accounts is larger than that of spam accounts. We can conclude that spam accounts follow a lot of accounts but few of these accounts follow them back, because spam accounts cannot control their followers to follow them.

Figure 4.

Bi-directional links ratio of accounts: (a) normal accounts and (b) spam accounts.

Observation 3: the bi-followers of spam accounts usually come from different social communities

To propagate spam information widely, spam account need to follow a lot of normal accounts to attract these accounts to follow them. On one hand, spammers do not care who are their followings or if there are some relationships between them. On the other hand, spammers could not control their followings to follow them. Usually, spammers prefer to follow accounts coming from different social communities. If followings of spammers come from few social communities, it is difficult for them to draw more victims and spread spam information widely. Thus, the bi-followers of spam accounts usually come from different social communities, and there is little interaction between them. In order to prove it, we design an algorithm to detect the interaction strength among accounts’bi-followers based on local clustering coefficient.

Local clustering coefficient: we consider the data of OSNs can be abstracted as a graph (G), where nodes represent accounts and edges the relationships among them. In our study, we use the bi-follow as edge. Easily, when a node is being detected, we call it target-node and the bi-followers of it is neighbor-nodes. In G, the nodes always tend to establish a set of strict organization relationships and then form a community. The local clustering coefficient (LLC) of a node can reflect the degree of the closeness of the community. For a target-node, we use edges_num to represent the total edges among the neighbor-nodes of it and sum to represent the total nodes of it. Finally, LLC can be computed by following formulas:

Undirected graph

LLC = \frac{2 \times edges_num}{sum \times (sum - 1)}

(2)

Directed graph

LLC = \frac{edges_num}{sum \times (sum - 1)}

(3)

We use bi-follow relationship of a target-node (i.e. similar to the friend relationships in Facebook) to establish an undirected graph. Then, we calculate the LLC of each node (Figure 5). In the figure, the blue line is the LLC distribution of normal accounts and the purple line represents the spam accounts. As we can see, a quite number of spam accounts’LLC are 0. When the LLC of a target-node is 0, it means that there is no interaction among its neighbor-nodes. This also proves the third observation. Although the local clustering coefficient can measure the degree of closeness among the neighbor-nodes of a target-node, it cannot distinguish spam accounts from normal accounts clearly because that different target-nodes have different number of neighbor-nodes. In the rest of this article, we will introduce our algorithm GroupFound.

Figure 5.

Local clustering coefficient of accounts.

Algorithm

In this section, we present some statement about our algorithm and then show details of it.

Problem statement

In order to detect the suspicious accounts in OSNs, we propose a lightweight algorithm GroupFound, which focuses on the structure of local graph. We use groups to represent the communities from which neighbor-nodes of a target-node come. For an account $A_{i}$ (i is the account number), we use breadth-first search (BFS) to find its neighbor-nodes and insert all these nodes into a collection $C_{i}$ . Then, for each node $N_{j} \in C_{i}$ , we find its neighbor-nodes and setup corresponding collection $C_{j}$ . After that, we get two layers of bi-follow relationship of $A_{i}$ , which can be abstracted as a undirected graph $G_{i}$ .

The fundamental of algorithm

For an account $A_{i}$ , we create a graph $G_{i}$ . After that, we compute the similarity of every two nodes in $C_{i}$ . For nodes $N_{j}, N_{k} \in C_{i}$ , we get $C_{j}$ and $C_{k}$ . Then, we will use Jaccard index, which is a statistic used for comparing the similarity of two collections. The Jaccard index measures similarity between $C_{j}$ and $C_{k}$ through computing the intersection divided by the size of union of $C_{j}$ and $C_{k}$

Jaccard (C_{j}, C_{k}) = \frac{| C_{j} \cap C_{k} |}{| C_{j} \cup C_{k} |}

(4)

We use formula (4) to compute the similarity of every two nodes in $C_{i}$ and setup a matrix $M_{i_{n \times n}}$ , where $n$ is the size of $C_{i}$ . The values in the matrix represent the similarity between the two collections. When the value is equal to 1, the two collections are identical. Conversely, when the value is equal to $1 / m$ , $m$ is the total nodes of two collections, only one of the elements in the two collections is same. Each row of the $M_{i_{n \times n}}$ is an entity and every column of it can be seen as a feature. For instance, we get a matrix $M$ as follows. For $N_{1}$ in $M$ , we can see from the first row, the value ${1, 0.189825, \dots, 0.0204082}$ is the similarity of $N_{1}$ and ${N_{1}, N_{2}, \dots, N_{n}}$ . We think that the closer the two nodes are, the more values these values have

\begin{matrix} N_{1} \\ N_{2} \\ ⋮ \\ N_{3} \end{matrix} [\begin{matrix} N_{1} & N_{1} & \dots & N_{n} \\ 1 & 0.189825 & \dots & 0.0204082 \\ 0.189825 & 1 \\ ⋮ & ⋱ & ⋮ \\ 0.0204082 & \dots & 1 \end{matrix}]

With the aim at getting groups of $A_{i}$ , we use $M_{i_{n \times n}}$ as input. Then, we classify groups by hierarchical clustering. We set each $N_{i} (i \in C_{i})$ as a cluster at first and then we can get n clusters. Then, we merge the two nodes closest to each other every time. Some spammers can have a large number of Sybil accounts. They can use these Sybil accounts for bi-follow to construct the communities that can be similar to the ones of the normal accounts. To deal with those problems, we use Hierarchical Clustering to get the groups of target-nodes. We set each node as a cluster and merge two closest clusters every time. When the algorithm stops, we get the number of clusters and call it groups. We use Group_avg to represent the average number of nodes of each group. When an account establishes a bi-follow relation with a normal user, the number of groups increases by 1. If spammers need to maintain the value of Group_avg unchanged, they need to be proportional to increase the Sybil nodes. This increases their costs and the risk of being detected. If spammers use Sybil nodes to construct a community, the Variance of spam nodes will become larger. Since other nodes probably not have bi-follow relationship with those Sybil nodes. For all rows in $M_{i_{n \times n}}$ , we calculate a mean value $Mea n_{1 \times n}$

Mea n_{1 \times n} = (\frac{\sum_{i = 1}^{n} M_{1 i}}{n}, \frac{\sum_{i = 1}^{n} M_{2 i}}{n}, \dots, \frac{\sum_{i = 1}^{n} M_{ni}}{n})

(5)

In order to get the distance of $N_{i} (i \in C_{i})$ , we use the ESS,²⁶ which can be calculated by following formula (6) (formula (7) is another form of it). We can get this formula by $Variance$ . It is usually used to describe the degree of dispersion of a random variable. From literature,²⁷ we get the formula of $Variance$ . As we can see from formula (8), $ESS = Variance \times n$ . We compute an $ES S_{i}$ for each $N_{i} (i \in C_{i})$ and get $ES S_{total} (ES S_{total} = \sum_{i = 1}^{n} ES S_{i})$ . When we merge two clusters, we need to compute an $ESS$ for the new cluster and add it to $ES S_{total}$ , which makes the increment of $ES S_{total}$ minimized

ES S_{i} = \sum_{j = 1}^{n} {(X_{ij} - Mea n_{1 j})}^{2}, X_{ij} \in C_{i}

(6)

ES S_{i} = \sum_{j = 1}^{n} X_{ij}^{2} - \frac{1}{n} {(\sum_{j = 1}^{n} X_{ij})}^{2}, X_{ij} \in C_{i}

(7)

Var (X) = \frac{1}{n} \sum_{i = 1}^{n} X_{i}^{2} - {(\frac{\sum_{i = 1}^{n} X_{i}}{n})}^{2}

(8)

For $M_{i_{n \times n}}$ , if we merge nodes $N_{1}$ and $N_{2}$ , we will put the node $N_{2}$ into the cluster of $N_{1}$ and remove the cluster of $N_{2}$ . We use $μ_{C_{1} \cup C_{2}}$ to represent the mean point of $C_{1}$ and $C_{2}$ and calculate it by formula (9). The $ESS$ of the new cluster can be calculated by formula (10) and it will be added to $ES S_{total}$ (formula (11) is another form of formula (10), where $D (x, μ_{C_{1} \cup C_{2}})$ is the distance between $x$ and $μ_{C_{1} \cup C_{2}}$ ). Until the $ES S_{total}$ reaches to limit value, the algorithm stops

μ_{C_{1} \cup C_{2}} = (\frac{N_{11} + N_{21}}{2}, \frac{N_{12} + N_{22}}{2}, \dots, \frac{N_{1 n} + N_{2 n}}{2})

(9)

ES S_{1_2} = \sum_{i = 1}^{2} \sum_{j = 1}^{n} {(X_{ij} - μ_{C_{1} \cup C_{2}})}^{2}

(10)

ES S_{1_2} = \sum_{x \in C_{1} \cup C_{2}} D {(x, μ_{C_{1} \cup C_{2}})}^{2}

(11)

The details of GroupFound

In this part, we will describe the details of GroupFound. Our algorithm mainly contains three procedures. First, we get the two layers of neighbor-nodes of the target-node in Procedure 1. In the procedure, we set $depth$ equal to 3 and the initial value of $deep$ equal to 1.

Procedure 1. $GroupFound$ - $getNeighbors$ .

1: function $getNeighbors$ $target - node$ , $deep$

2: for $node$ $in$ $neighbor - nodes$ $of$ $target - node$ do

3: $C_{target - node} . insert (node)$

4: end for

5: if $deep < depth$ then

6: for $node$ $in$ $C_{target - node}$ do

7: $getNeighbors (node, deep + 1)$

8: end for

9: end if

10: end function

Then, in Procedure 2, we compute the similarity of every two nodes among the neighbor-nodes of a target-node by computing Jaccard index, which is a statistic used for comparing the similarity of two collections. After that, we get a similarity matrix, which will be used to calculate the groups of target-node.

Procedure 2. $GroupFound$ - $getSimilarity$ .

1: function $getSimilarity$ $long$ $long$ $i$

2: for $N_{j}$ ∈ $C_{i}$ do

3: for $N_{k} = N_{j} - > Next$ , $N_{k} \in C_{i}$

4: $Jaccard (Cj, Ck)$

5: $To Next N_{k}$

6: end for

7: $To Next N_{k}$

8: end for

9: end function

Finally, we use the similarity matrix $M_{i_{n \times n}}$ as input. Each row of the $M_{i_{n \times n}}$ is an entity and every column of it can be seen as a feature. In Procedure 3, we setup a cluster for each node $N \in C_{target - node}$ . For all rows in the $M_{i_{n \times n}}$ , we calculate a mean value vector $Mea n_{1 \times n}$ and the distance from $N_{i}$ to $Mea n_{1 \times n}$ . The metric of distance is $ESS$ as we have mentioned before. Our algorithm is mainly based on hierarchical clustering, when we merge two clusters, we need to compute an $ESS$ for a new cluster, which makes the increment of $ES S_{total}$ minimized. When the distance $\geq (n - 1)^{3} / n^{3}$ , GroupFound will be stopped and we get the groups of target-node.

Procedure 3. $GroupFound$ - $getGroup_avg$ .

1: function $getGroups$ $(target - node)$

2: for $node$ $\in C_{target - node}$ do

3: $Set up a cluster for each node$

4: end for

5: while $distance < \frac{(n - 1)^{3}}{n^{3}}$ do

6: $clustering (similar)$

7: $nodes + = 1$

8: end while

9: groups = clusters

10: Group_avg = nodes/groups

11: end function

The irrelevant distance $(n - 1)^{3} / n^{3}$ : we suppose there is a spam node $S$ , the number of neighbor-nodes of $S$ is $n$ . Worst case, there is no relationship between neighbor-nodes of $S$ . We setup a similarity matrix as follows. Then, we calculate a mean value $Mea n_{s}$ using formula (5). We arbitrarily choose a neighbor-node and compute its distance to the $Mea n_{s}$ using formula (6). We simplify formula (13) and obtain formula (14). We use this value of distance as the irrelevant distance, which means when the distance between clusters is greater than $(n - 1)^{3} / n^{3}$ , the remaining clusters are irrelevant, and the algorithm will stop

\begin{matrix} \begin{matrix} N_{1} \\ N_{2} \\ ⋮ \\ N_{3} \end{matrix} [\begin{matrix} N_{1} & N_{2} & \dots & N_{n} \\ 1 & \frac{1}{n} & \dots & \frac{1}{n} \\ \frac{1}{n} & 1 \\ ⋮ & ⋱ & ⋮ \\ \frac{1}{n} & \dots & 1 \end{matrix}] \end{matrix}

\begin{matrix} Mea n_{s} = (\frac{2 n - 1}{n^{2}}, \frac{2 n - 1}{n^{2}}, \dots, \frac{2 n - 1}{n^{2}}) \end{matrix}

(12)

distance = {(1 - \frac{2 n - 1}{n^{2}})}^{2} + {(\frac{1}{n} - \frac{2 n - 1}{n^{2}})}^{2} (n - 1)

(13)

distance = \frac{{(n - 1)}^{3}}{n^{3}}

(14)

We use it as a unified metric. As we can see in Figure 6, which is a Hierarchical Clustering dendrogram, the x-axis represents the distance of clusters and y-axis is the order of nodes. The formation process of tree structure corresponds to the process of clustering. The GroupFound will stop when the distance $\geq 0.903296 ((30 - 1)^{3} / 30^{3})$ . There are three groups and 30 nodes in Figure 5, $Group_avg = 30 / 3 = 10$ .

Figure 6.

Hierarchical clustering dendrogram.

In order to detect suspicious accounts in OSNs, we propose a lightweight algorithm GroupFound, which focuses on the structure of local graph. We leverage the local location features of a node in bi-follow relationship graph to get the groups of target-node. We use Group_avg to represent the average number of nodes in each group. As the bi-followers of spam accounts usually come from different social communities, the Group_avg of spam accounts is very small. For each account, we will get one Group_avg. Through comparing the Group_avg of each account and threshold, we can identify the suspicious accounts. Above all, we can see that GroupFound can focus on detecting one node. To identify the suspicious account, we need to further experiment and find a threshold, which can clearly distinguish the normal accounts and the suspicious accounts.

Our algorithm is based on Hierarchical Clustering, which set $C_{i}$ (a collection of neighbor-nodes of a target-node) as input and gets a similarity matrix. When the algorithm begins, every node is a cluster, and the original number of clusters is the number of neighbor-nodes. Then, we calculate the distance between these clusters according to the similarity matrix and merge the two closest clusters in each calculation. During this process, the number of clusters is decreasing. In addition, we also calculate the irrelevant distance ${(n - 1)}^{3} / n^{3}$ as the termination condition of our algorithm. When the algorithm terminates, we get the number of groups of the target-node (the number of remaining clusters).

Difference between our algorithm and community detection technologies: most of the local community detection approaches work by starting with one (or more) seed node(s) and greedily adding neighboring nodes until a sufficiently strong community is found.²⁸ The group in our algorithm is different from that of the community, some groups of our algorithm have only one node. Especially, there are a large number of groups of spam modes formed by a single node. This division is meaningless to community detection, but our algorithm is making use of this kind of single-mode groups of spam modes (which reduces the average number of nodes for all groups) to detect suspicious nodes.

Different between our algorithm and graph clustering technologies: clustering methods based on graphs usually cluster normal nodes in one cluster according to the different characteristics of normal nodes and spam nodes and then other nodes which are not in the cluster are suspicious nodes. Sometimes, the above methods cluster normal nodes and suspicious nodes, respectively, by sampling the nodes in the two clusters, so that whether the other nodes in the cluster are suspicious can be determined. Our algorithm uses clustering to merge the two clusters whose distance is smaller than the irrelevant distance $(n - 1)^{3} / n^{3}$ . When the algorithm terminates, the nodes in remaining clusters may be normal or suspicious ones. The reason for forming these clusters is only related to the location of the nodes in the graph. We do not make a clear distinction between these clusters, and the final purpose of our algorithm is to get the number of groups of the target-node (the number of remaining clusters) for further analysis.

Evaluation

In this section, we will evaluate our algorithm GroupFound. First, we introduce the confusion matrix, which will be used to evaluate the accuracy of our algorithm. Then, we run GroupFound on the Sina Weibo dataset. We get the $G r o u p_a v g$ of each account. Next, we use ROC curve and AUC value to determine the effectiveness of GroupFound. After these, we get an optimal threshold to distinguish normal accounts and suspicious accounts. Then, we compare the threshold and the $Group_avg$ of each account to identify whether the account is suspicious. Finally, we compute the accuracy and speed of GroupFound.

Confusion matrix

We use the confusion matrix to evaluate the accuracy of our algorithm. Each column of the matrix represents an instance’s prediction of the class and each row represents an actual instance of the class. As shown in Table 1, the confusion matrix shows the relationships among true positive rate (TPR), FPR, false negative rate (FNR), and true negative rate (TNR).

Table 1.

Confusion matrix.

Actual	Predicted
	Spam accounts	Normal accounts
Spam accounts	TP	FN
Normal accounts	FP	TN

TP: true positive; FN: false negative; FP: false positive; TN: true negative.

It is important to find a balance point between these metrics for achieving an effective detection system. Our algorithm aims to detect spammers in OSNs. It needs to identify more spammers as many as possible while reducing the misjudgment of normal accounts. Based on the considerations above, we can use formula (15) to calculate the accuracy of our algorithm

Accuracy = \frac{TP + TN}{FP + FN + TP + TN}

(15)

Experiment

Our evaluation environment is an IBM System X3100 M4. This server is populated with eight 3.30-GHz Intel Xeon CPU E3-1230 V2, 32 GB memory, a 3000.0-GB hard-disk and connected by a 1000-Mbit Ethernet.

Data preprocessing: before starting the experiment, we preprocessed the experimental data. In order to reduce the FPR of the algorithm, we cut out some accounts which have no bi-follow relationships with other accounts in the dataset. They are some isolated points in the social graph; therefore, they have no ability to spread malicious information to other accounts, we think these accounts are harmless. Through preprocessing, we find 9270 harmless accounts.

After data preprocessing, we run our algorithm on Sina Weibo dataset which contains 215,211 normal accounts and 22,417 spam accounts. We get the $Group_avg$ of each account and draw its distribution. We use the X-axis represents the value of $Group_avg$ and the Y-axis represents the number of corresponding accounts. As Figure 7 shows, we draw the $Group_avg$ distribution of normal accounts and spam accounts. We can see from Figure 7(b), for spam accounts, the accounts number reduces rapidly along with the increasing $Group_avg$ . In Figure 7(a), for normal accounts, as $Group_avg$ rises, the number of accounts reduces more slowly. When $Group_avg$ equals 10, we can see a small inflection point. Through analysis, we find that almost all spam accounts’ $Group_avg$ are less than 6, while the $Group_avg$ of normal accounts distribute concentrated in 10. To spread malicious URLs widely, spammers need to follow a lot of normal accounts; however, only few of normal accounts would follow spam accounts. Spammers cannot control normal accounts to follow them; thus, their bi-followers usually come from different social communities. Differently, normal accounts establish bi-follow relationship corresponding to the relationship in the real world, like friends and colleges. Among these accounts, they probably have relationships with others. Therefore, the bi-followers may come from a same social community. Our algorithm calculates the number of $groups$ which represent how many communities the neighbor-nodes of a target-node come from. We use $Group_avg$ to indicate the average accounts’ count of each group. The larger the $Group_avg$ is, the more likely the target-node normally is. Conversely, when the $Group_avg$ is very small, the target-node is likely to be a malicious node.

Figure 7.

Group_avg of account: (a) normal accounts and (b) spam accounts.

ROC curve and AUC value

In order to determine the effectiveness of the algorithm, we further analyze the results of experiments. We can easily find that almost all of the accounts’ $Group_avg$ are less than 100. We try to find a threshold to distinguish suspicious accounts from normal accounts clearly. We use $α$ to represent threshold. Obviously, range of $α$ is from 1 to 100. We use 0.01 as the increment, such as $1, 1.01, 1.02, \dots$ . We compute the FPR and TPR of our algorithm for each threshold. We show the ROC curve in Figure 8, in which the X-axis represents FPR and the Y-axis represents TPR.

Figure 8.

ROC curve of GroupFound.

The value of AUC

AUC is a probability value. When we select a positive sample and a negative sample randomly, the AUC is the probability that the positive sample is arranged before the negative sample. The larger AUC value is, the more possible to identify spammers is. For computing the AUC value, we divide the ROC curve into small rectangular and obtain the area using calculus.

We use the TPR value of the node, which is on the left side of the rectangle. Finally, the AUC value is smaller than the actual one slightly. After calculating, the AUC value of the ROC curve is 0.887003. It means if you select a spam account or a normal account randomly, the probability that the spam account ranks before the normal one is at least 0.887003 by our algorithm GroupFound. Therefore, our algorithm can distinguish spam accounts from normal accounts effectively.

Threshold

In ROC curve, we think that the point that closest to the upper left corner is the most appropriate to as the threshold. To obtain this point, we use the formula (16)

S_{min} = {min}_{α = 1}^{α = 100} {[(X_{α - 0.01} + X_{α} + X_{α + 0.01}) - (X_{α - 0.01} Y_{α - 0.01}) + X_{α} Y_{α} + X_{α + 0.01} Y_{α + 0.01}] / 3}

(16)

When threshold is $α$ , $X_{α}$ represents the corresponding ROC curve’s value of X-axis and $Y_{α}$ represents the value of Y-axis. $S_m i n$ represents the area surrounded by the points in ROC curve, the straight lines $X = 0$ and $Y = 1$ . When $S_\min$ reaches the minimum, the $α$ value is the optimal threshold. It can make a balance between FPR and TPR. Finally, $α = 3.01$ .

Performance

In this section, we will evaluate the performance of our algorithm with two aspects: the accuracy and the speed.

Accuracy

We use $α = 3.01$ to evaluate the accuracy of our algorithm. After running our algorithm GroupFound in Sina Weibo dataset, we get the $Group_avg$ of each account. If an account’s $Group_avg$ is greater than $α$ , we think that is a normal account. When an account’s $Group_avg$ value is less than $α$ , we think the account is a suspicious one. Our experimental dataset includes 237,628 accounts, among which 215,211 are normal accounts and 22,417 are spam accounts. Table 2 shows the statistical result: we can see that 18,176 accounts in 22,417 spam accounts are identified suspicious and 18,378 accounts in 215,211 normal accounts are identified suspicious. We use the two parameters TPR and FPR to evaluate our algorithm. When the optimal threshold value $α = 3.01$ , the TPR is $81.08 %$ and the FPR is $8.54 %$ . According to formula (15), the accuracy of GroupFound is $86.27 %$ .

Table 2.

Statistical data of evaluation on Sina Weibo.

	Normal accounts	Spam accounts
Total number	215,211	22,417
Identified normal	196,833	4241
Identified spam	18,378	18,176

Speed

In order to evaluate the running speed of our algorithm, we use GroupFound to detect 237,628 accounts. We record once when we detect every 5000 accounts. Finally, we get the distribution of the running time (RT) as shown in Table 3. As shown in Table 3, the first row represents the timestamp. Each timestamp corresponds to the time when every 5000 accounts were detected. The second row is the GroupFound’s actual RT. In the last row, $t$ represents the time cost for detecting every 5000 accounts. Obviously, $t$ is different for each timestamp because the number of bi-followers of each account is not equal. The greater the bi-followers count is, the more time it consumes. Finally, GroupFound use 7604 s to detect 237,628 accounts. The average time for detecting each account is around 0.032 s.

Table 3.

Running time of GroupFound.

Timestamp	1	2	3	4	5
RT (s)	221.1	495.6	753.7	855.3	1122.4
$t (s)$	221.1	274.4	240.1	101.6	267.1
Timestamp	6	7	8	9	10
RT (s)	1303.3	1457.3	1762.5	2073.5	2217.0
$t (s)$	180.9	154.0	305.2	311.0	143.5

RT: running time.

Most of the existing detection methods based on machine learning to train a classifier according to social behaviors, accounts, and content characteristics. After training the classifier, it can detect suspicious accounts quickly. Compared with these methods, GroupFound is a lightweight algorithm, and it does not need training data in advance. Other works focus on social graph–based algorithm, the time complexity of GroupFound is higher than most of them. But unlike most of the existing detection methods based on the graph, we do not need to setup a huge graph. Sometimes, we just want to detect one node, and GroupFound has a better performance. The uniqueness of GroupFound is that it focuses on the local structure of the graph and can detect one account each time.

Analysis

First, we analyze the FPR of algorithm. Results of the experiment show that the $Group_avg$ of some normal accounts is very small, even equals 1. We analyzed these accounts and found that there are two kinds of accounts that lead to this result. One is some inactive account, most of them have established few bi-follow relationships. The other is accounts which has a very wide range of social friends. The user of this kind of accounts is likely to establish social relationship with strangers. These result in the value of $Group_avg$ close to 1. For FNR, we focus on the spam accounts whose $Group_avg$ is great. We found that the structure of these accounts in bi-followers graph is similar to that of the normal ones. We think these accounts are compromised accounts; therefore, their bi-follow relationships are built on the basis of the normal accounts. Our algorithm is based on accounts’ location characteristics in graph, so it cannot detect compromised accounts effectively. In order to make our detection algorithm more effective, we also required to get more experimental data and make further analyses with these data.

Comparison

In this section, we will compare GroupFound against other approaches in terms of its effectiveness in detecting suspicious accounts.

PageRank:²⁹ we implement the PageRank method and use it to detect suspicious nodes in social graph. Evidences are found that most spam accounts are programmed and run automatically. To propagate spam information widely, spam account needs to follow a lot of normal accounts to attract these accounts to follow them. Therefore, the out degree of spam account is very big and the in degree of spam account is very small. Eventually, each node gets a value of rank; the smaller the value, the more suspicious the node is.

SybilDefender:¹⁹ it consists of three components: a Sybil identification algorithm, a Sybil community detection algorithm, and two supporting approaches to limiting the number of attack edges. According to the pseudo-code of SybilDefender, we implement one part of it: a Sybil identification algorithm, which has two phases. In the first phase, the algorithm takes the social graph $G (V, E)$ and honest node $h$ as input and outputs the thresholds used by the second phase to identify Sybil nodes. For each suspicious node u, the algorithm preforms random walks originating from u and calculates the number of nodes $m$ , whose frequency is no smaller than $t$ . We set the value of t to 5, consistent with the original experimental parameter. In the second phase, the algorithm compares $m$ with thresholds to determine whether u is Sybil.

We run PageRank and SybilDefender on our dataset and show the ROC curve for each method. As shown in Figure 9, the black curve represents the ROC curve of GroupFound, the red one is SybilDefender’s ROC curve, and the green one is the ROC curve of PageRank. Then, we calculate the AUC value of each curve; the AUC value of the black, red, and green curves is 0.887003, 0.858296, and 0.829042, respectively. Compared with PageRank and SybilDefender, our algorithm can distinguish spam accounts from normal accounts more effectively. In order to compare with more existing studies, we still need to connect with other authors and ask for related sources. In addition, most graph-based algorithms only detect malicious nodes for some instant in a social network. As the node structure in a social graph will change over time, we need to keep crawling more data and consider the time factor to improve our algorithm in future research.

Figure 9.

ROC curve.

Conclusion

OSNs become an important part of people’s life and also are the platform that spammers use suspicious accounts to spread malicious URLs. Most existing works mainly utilize machine learning method based on features. However, once the spammers disguise the key features, the detection method will be invalid. It also has no reliable performance in detecting suspicious accounts with unknown features. In this article, we propose a lightweight algorithm GroupFound, which focuses on the structure of local graph. As the bi-followers of an account come from different social communities, we divide all accounts into different groups and compute the average number of accounts of these groups. The experimental result indicates that the time cost of our algorithm is relatively low. For future work, we will continually improve the performance of our algorithm. Apart from this, we are sincere to communicate with other researchers working in this field.

Footnotes

Academic Editor: Michele Amoretti

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China under grant no. 61472162.

References

Barracuda labs 2010 annual security report. Technical report, March 2010, https://barracudalabs.com/downloads/2010EndyearSecurityReportFINAL.pdf

Liu

Zhang

Xiang

. Statistical detection of online drifting Twitter spam: invited paper. In: Proceedings of the 11th ACM Asia conference on computer and communications security, Xi’an, China, 30 May–3 June 2016, pp.1–10. New York: ACM.

Wang

Zheng

. Man vs machine: practical adversarial detection of malicious crowdsourcing workers. In: Proceedings of the 23rd USENIX security symposium (USENIX security 14), San Diego, CA, 20–22 August 2014, pp.239–254. USENIX Association.

Zheng

Wang

Jie

. Two phase based spammer detection in weibo. In: Proceedings of the 2015 IEEE international conference on data mining workshop (ICDMW), Atlantic City, NJ, 14–17 November 2015, pp.932–939. New York: IEEE.

Zhou

Chen

. Observation on spammers in sina weibo. In: Proceedings of the 2nd international conference on computer science and electronics engineering, Hangzhou, China, 22–23 March 2013. Paris: Atlantis Press.

Welch

Gao

. A novel scoring model to detect potential malicious web pages. In: Proceedings of the 2012 IEEE 11th international conference on trust security and privacy in computing and communications (TrustCom), Liverpool, 25–27 June 2012, pp.254–263. New York: IEEE.

Viswanath

Bashir

Crovella

. Towards detecting anomalous user behavior in online social networks. In: Proceedings of the 23rd USENIX security symposium (USENIX security 14), San Diego, CA, 20–22 August 2014, pp.223–238. USENIX Association.

Jiang

Cui

Beutel

. Detecting suspicious following behavior in multimillion-node social networks. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion, international world wide web conferences steering committee, Seoul, Korea, 7–11 April 2014, pp.305–306. New York: ACM.

Wang

Konolige

Wilson

. You are how you click: clickstream analysis for sybil detection. In: Proceedings of the 22nd USENIX security symposium, Washington, DC, 15 August 2013, pp.1–15.

10.

Cao

Yang

. Uncovering large groups of active malicious accounts in online social networks. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, Scottsdale, AZ, 3–7 November 2014, pp.477–488. New York: ACM.

11.

Eshete

Venkatakrishnan

. Webwinnow: leveraging exploit kit workflows to detect malicious URLs. In: Proceedings of the 4th ACM conference on data and application security and privacy, San Antonio, TX, 3–5 March 2014, pp.305–312. New York: ACM.

12.

Yang

Harkreader

. Die free or live hard? empirical evaluation and new design for fighting evolving Twitter spammers. In: Proceeding of the 14th International conference on recent advances in intrusion detection, Menlo Park, CA, 20–21 September 2011, pp.318–337. Berlin, Heidelberg: Springer.

13.

Aggarwal

Rajadesingan

Kumaraguru

. Phishari: Automatic realtime phishing detection on Twitter. In: Proceedings of the eCrime researchers summit (eCrime), Las Croabas, Puerto Rico, 23–24 October 2012, pp.1–12. New York: IEEE.

14.

Eshete

Villafiorita

Weldemariam

. Binspect: holistic analysis and detection of malicious web pages. In: Proceeding of the international conference on security and privacy in communication networks, Padua, Italy, 3–5 September 2012, pp.149–166. Berlin, Heidelberg: Springer.

15.

Almaatouq

Shmueli

Nouh

. If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts. Int J Inf Secur 2016; 15(5): 475–491.

16.

Zheng

Zeng

Chen

. Detecting spammers on social networks. Neurocomputing 2015; 159: 27–34.

17.

Wang

. Don’t follow me: spam detection in Twitter. In: Proceedings of the 2010 international conference on security and cryptography (SECRYPT), Athens, 26–28 July 2010, pp.1–10. New York: IEEE.

18.

Cao

Sirivianos

Yang

. Aiding the detection of fake accounts in large scale social online services. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI 12), San Jose, CA, 25–27 April 2012, pp.197–210. USENIX Association.

19.

Wei

Tan

. Sybildefender: defend against sybil attacks in large social networks. In: Proceedings of the IEEE INFOCOM, Orlando, FL, 25–30 March 2012, pp.1951–1959. New York: IEEE.

20.

Xue

Yang

. Votetrust: Leveraging friend invitation graph to defend against social network sybils. In: Proceedings of the IEEE INFOCOM, Turin, 14–19 April 2013, pp.2400–2408. New York: IEEE.

21.

Tan

Guo

Chen

. Unik: unsupervised social network spam detection. In: Proceedings of the 22nd ACM international conference on information & knowledge management, San Francisco, CA, 1 November 2013, pp.479–488. New York: ACM.

22.

Beutel

Guruswami

. Copycatch: stopping group attacks by spotting lockstep behavior in social networks. In: Proceedings of the 22nd international conference on world wide web, international world wide web conferences steering committee, Rio de Janeiro, Brazil, 13–17 May 2013, pp.119–130. New York: ACM.

23.

Gong

Frank

Mittal

Sybilbelief: A semi-supervised learning approach for structure-based sybil detection. IEEE T Inf Foren Sec 2014; 9(6): 976–987.

24.

Gibbons

Kaminsky

. Sybillimit: a near-optimal social network defense against sybil attacks. In: Proceedings of the 2008 IEEE symposium on security and privacy, Washington, DC, 18–21 May 2008, pp.3–17. New York: IEEE.

25.

Kaminsky

Gibbons

. Sybilguard: defending against sybil attacks via social networks. ACM SIGCOMM Comp Com 2006; 36: 267–278.

26.

Anderberg

MR.

Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks, vol. 19. Cambridge, MA: Academic Press, 2014.

27.

Variance, https://en.wikipedia.org/wiki/Variance

28.

Viswanath

Post

Gummadi

. An analysis of social network-based sybil defenses. ACM SIGCOMM Comp Com 2010; 40(4): 363–374.

29.

Page

Brin

Motwani

. The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab, Stanford, CA, 1999.

GroupFound: An effective approach to detect suspicious accounts in online social networks

Abstract

Keywords

Introduction

Related work

Social behavior–based models

Social feature–based machine learning classifiers

Social graph–based algorithm

Dataset collection and analysis

Dataset

Motivation

Observations

Observation 1: spam accounts follow more accounts with an aim to widely propagate spam information

Observation 2: spam accounts cannot control their followings to follow them

Observation 3: the bi-followers of spam accounts usually come from different social communities

Algorithm

Problem statement

The fundamental of algorithm

The details of GroupFound

Evaluation

Confusion matrix

Experiment

ROC curve and AUC value

The value of AUC

Threshold

Performance

Accuracy

Speed

Analysis

Comparison

Conclusion

Footnotes

Declaration of conflicting interests

Funding

References