Download ELM-based spammer detection in social networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
J Supercomput
DOI 10.1007/s11227-015-1437-5
ELM-based spammer detection in social networks
Xianghan Zheng1,2 · Xueying Zhang1,2 · Yuanlong Yu1,2 ·
Tahar Kechadi3 · Chunming Rong4
© Springer Science+Business Media New York 2015
Abstract Online social networks, such as Facebook, Twitter, and Weibo have played
an important role in people’s common life. Most existing social network platforms,
however, face the challenges of dealing with undesirable users and their malicious
spam activities that disseminate content, malware, viruses, etc. to the legitimate users
of the service. The spreading of spam degrades user experience and also negatively
impacts server-side functions such as data mining, user behavior analysis, and resource
recommendation. In this paper, an extreme learning machine (ELM)-based supervised
machine is proposed for effective spammer detection. The work first constructs the
labeled dataset through crawling Sina Weibo data and manually classifying corresponding users into spammer and non-spammer categories. A set of features is then
extracted from message content and user behavior and applies them to the ELM-based
spammer classification algorithm. The experiment and evaluation show that the proposed solution provides excellent performance with a true positive rate of spammers
and non-spammers reaching 99 and 99.95 %, respectively. As the results suggest, the
proposed solution could achieve better reliability and feasibility compared with existing SVM-based approaches.
B
Yuanlong Yu
[email protected]
1
College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China
2
Fujian Key Laboratory of Network Computing and Intelligent Information Processing,
Fuzhou 350108, China
3
School of Computer Science and Informatics, University College Dublin,
Belfield, Dublin 4, Ireland
4
Department of Electrical Engineering and Computer Science, University of Stavanger,
4036 Stavanger, Norway
123
X. Zheng et al.
Keywords
machine
Social network · Spammer · Machine learning · Extreme learning
1 Introduction
With the development of science and technology, social networking sites, such as Facebook, Twitter, and Weibo (previously Sina Weibo), have become important platforms
for users to interact with their friends, post messages, discuss hot topics and share
views, etc. According to a Statista report [1], the number of social network users has
reached 2.75 billion until June 2014, and is estimated to remain around 2.33 billion
users globally until the end of 2017.
However, online social platforms also attract huge interest from spammers to spread
advertisements, disseminate pornography and viruses, and expose phishing and so
on. The spreading of spam degrades the user experience and also negatively impacts
server-side functions such as data mining, user behavior analysis, and resource recommendation [2,3]. According to Nexgate’s report [4], during the first half of 2013, the
growth of social spam has been 355 %, much faster than the growth rate of authentic
accounts and messages on most branded social networks.
Since spammers typically behave like legitimate users, detecting and discriminating spam is difficult. Therefore, it becomes highly desirable to develop techniques
and methods for identifying spammers and their behavior in online social networks.
Currently, there have been a few proposals from industry and academia, discussing
possible solutions for spammer detection and analysis. These solutions, however, are
either ineffective or based on too many considered conditions (lots of content and
behavior features, etc.). This paper investigates social spammer content and behavior issues and proposes an effective extreme learning machine (ELM)-based machine
learning model for spammer detection. In conclusion, the paper contains the following
main contributions:
1. The paper adopts the spammer feature to detect spammers and test the results
over Sina Weibo, the biggest social network site in China. Under the Weibo API,
a specific dataset crawler is developed to extract any unauthorized users’ public
messages inside the Weibo platform. This is the first step of the data analysis.
2. The major novelty of the paper is to study a set of the most important features related
to message content and user behaviors and then apply them to the ELM-based classification algorithm for spammer detection. The experiment and comparison work
shows that the proposed solution provides higher spammer detection accuracy.
3. The paper compares several aspects of the ELM-based approach to the SVM-based
approach, including training and testing time and the sensitivity of parameters. The
performance comparison further validates the better feasibility, stability and strong
generalization ability of ELM algorithm.
It is worth mentioning that although the proposed approach is currently tested
specifically in the Sina Weibo social network, it is applicable to all other existing
social sites with minor revisions. The rest of the paper is organized as follows: Sect. 2
introduces background information related to social networks, social networking platforms, and surveys existing work on social spammer detection. Section 3 illustrates
123
ELM-based spammer detection in social networks
the dataset collection and feature extraction related to content and behavior. Section 4
describes the ELM-based spammer detection model, the experiments, and corresponding evaluation. Finally, the conclusion is given in Sect. 5.
2 Background and existing works
2.1 The social network
Sina Weibo is one of the largest social networks in China and attracts millions of users
online every day. Weibo is a platform based on user relationships and instantly sharing
information through short posts not more than 140 characters via computer or mobile
phone [5]. Specifically, Weibo contains the following functions:
“Follower” and “Following”: each user can choose to start following another user
to receive the latest messages and statuses of his/her friends. The user who is followed
could either accept or reject the request to following back.
Post and Repost: short messages with not more than 140 characters, including
punctuation. These posted messages are delivered to followers immediately and the
message is made public for anyone to read.
Mention: represented as @username, meaning that the message sender is willing
to share something with the user mentioned. Using a mention, a notification will
automatically inform the mentioned user that a message has been sent and is available
on his/her homepage.
Label: users can post messages containing labels (#…#) to identify a specific topic.
If enough users pick up this topic, it appears in the list of trending topics.
2.2 Machine learning techniques
In the field of machine learning, a series of traditional machine learning algorithms
were improved to satisfy the higher data processing needs. For instance, the model
of Recently SVM, Least Squares SVM, Limited Newton LSVM [6,7], and so on
which reduced the difficulty of solving a certain extent, improved the solution speed.
However, they still exist two problems: (1) the solution speed could not satisfy the
processing needs for large data; (2) the model related to SVM needs to manual adjustment parameters (C, γ ) frequently and repeat training to obtain the optimal solution
with tedious time-consuming process and poor generalization ability.
Under the circumstances, extreme learning machine provides a new way to solve
these problems. Extreme learning machine (ELM) is a novel machine learning model
proposed by Huang [8] as a least square-based learning algorithm for single hidden
layer neural networks (SLFNNs). In comparison with traditional neural networks
which usually employ back propagation (BP) algorithm [9] to train the connection
weights, the tedious process of iterative parameter tuning is eliminated and the slow
convergence and local minimum problems are avoided. Currently, ELM has been an
important research topic due to its high efficiency, easy-implementation, unification
of classification and regression, and therefore might be capable to be implemented in
social spammer detection field [10].
123
X. Zheng et al.
2.3 Existing works
In the past ten years, email spam detection and filtering mechanisms have been widely
implemented. The main work could be summarized into two categories: a contentbased model and an identity-based model. In the content-based model, a series of
machine learning approaches [11,12] are implemented that parse content according
to keywords and patterns that are potentially spam. In the identity-based model, the
most commonly used approach is that each user maintains a whitelist and a blacklist
of email addresses of people whose emails should and should not be blocked by antispam mechanism [13,14]. More recent work is to leverage social network into email
spam identification according to the Bayesian probability [15]. The concept is to use
the social relationship between a sender and a receiver to decide the closeness and trust
level in a given relationship, and then increase or decrease the Bayesian probability
according to these values.
With the rapid development of social networks, social spam has attracted a lot of
attention from both industry and academia. In industry, Facebook proposes an EdgeRank algorithm [16] that assigns each post with a score generated from a few features
(e.g., number of likes, number of comments, number of reposts, etc.). Therefore, the
higher EdgeRank scores, the less possibility to be a spammer. The disadvantage of
this solution is that spammers could join their networks and continuously like and
comment each other to achieve a high EdgeRank score.
In academia, Wang [17] proposes a naïve Bayesian-based spammer classification
algorithm to distinguish suspicious behaviors from normal ones on Twitter, with the
precision result (F-measure) of 89 %. Yard et al. [18] study the behavior feature of
a small sample of spammers on Twitter and find that the behavior of spammers is
different than legitimate users in regard to posting tweets, followers, friends, and so on.
Stringhini et al. [19] further investigate the spammer features by creating a number of
honey-profiles in three large social network sites (Facebook, Titter, and Myspace) and
identify five common features (followee-to-follower, URL ratio, message similarity,
message sent, and friend number) that may help detect potential spammer activity. Gao
et al. [20] adopt a set of novel features for effectively reconstructing spam messages into
campaigns rather than examining them individually (with precision value over 80 %).
Benevenuto et al. [21] collect a large dataset from Twitter and identify 62 features
related to tweet content and user social behaviors. These characteristics are regarded
as attributes of machine learning process for classifying users as either spammers or
non-spammers. Zheng et al. [22] apply a set of features on SVM classifier to detect
spammer and obtain a better classification result; however, this approach leads to higher
training time and requires manual adjustment in optimized parameter selection.
Besides, many of the researchers had suggested a mechanism via setting active
Honeypots running without human inspection and logging information of its fans
[23,24], and proposed a feature analysis Spammers mechanism and made a comparison
on these features. Furthermore, Zachary et al. [25] proposed two stream clustering
algorithms, StreamKM++ and DenStream, which were modified to facilitate spam
identification.
As a summary, the concept of existing social spam detection work is to extract a set
of features that distinguish normal users from spammers and apply that information
123
ELM-based spammer detection in social networks
into different classifier models to detect suspicious behavior. Due to the differences in
the considered data sources and features, different classifiers might achieve different
performance. Generally, this paper follows these similar concepts, however, with two
distinct points:
1. Our proposed ELM-based classification model considers only 18 feature items and
achieves the best performance result, with the F-measure value reaching over 99 %.
This is the best result ever achieved (although different approaches might not be
comparable due to difference of collected dataset).
2. As verified by the experiment results, ELM-based classification tends to achieve
better generalization performance than SVM-based solutions. The proposed solution is also less sensitive to user-specified parameters and could be easily
implemented.
3 Dataset and feature analysis
3.1 Dataset collection
While Sina Weibo provides a relatively complete API for developers, there are still a
lot of constraints in the data collection process. Accordingly, a specific data crawler
and feature collection mechanism are developed to solve this problem.
Figure 1 describes the basic framework of the data collection and feature extraction.
Firstly, we randomly selected 100 normal users from Weibo social network. Because
most of the normal users are unlikely to follow spammers in reality, we can crawl the
list of users who are following other legitimate users. Similarly, those who follow spam
accounts are probably also spammers, which improve the degree of mutual concern.
Therefore, the sample set of spammers could be obtained from 50 original spammers.
For each user, we crawl corresponding information inside 500 recent messages
(although the returned real number of microblogs is less than 500). The Weibo API
converts each Weibo ID to details message.
3.2 Feature analysis
Spammers usually aim at the commercial intent such as advertisement spreading. In the
paper, we randomly select 500 spam messages and 500 normal messages respectively
from collected dataset, and assign each message with a random integer value ranged
from 1 to 500. We also set the maximum number of reposts, comments and likes to
100.
Figure 2a shows the difference in proportion between the original messages posted
by normal user and spammer. Most legitimate users post messages to share personal
knowledge and feelings with their friends. On the other hand, most spammers repost
messages from others, and therefore cause the proportion of original messages less
than 10 %.
Figure 2b indicates the proportion of messages containing URLs (the proportion of
message contains the URL to the total number of messages). This figure shows that
most spammers have at least one URL in each message.
123
X. Zheng et al.
Data Source
Spammers
Non-Spammers
Data CrawleU
Message Crawler
Followee Crawler
User Data
Spammers
Non-Spammers
Feature Extraction
Number of Followees
Number of Followers
Weibo API
Created Days
……
Username
Message Crawler
Reposts
Messages’ IDs
Converter
Comments
Likes
……
ELM based Feature Learning
Non-Spammers
Spammers
Fig. 1 Dataset craw and feature extract
Figure 2c displays the difference in average number of friends mentioned. Considering that most spammers focus on advertising and spend little time interacting with
“friends”, the message content is mostly advertising words and pictures. Legitimate
users, however, frequently mention their friends and share funny things.
To offer a more specific description, this paper also introduces the cumulative distribution function (CDF) to illustrate the distribution of users’ behavioral characteristics.
The cumulative distribution function (Eq. 1) describes the probability that a sample
of a random variable X will be less than or equal to a value x, where x is a real value.
If X is a continuous random variable then F is a continuous function, and conversely.
F (x) = P (X ≤ x)
123
(1)
1
NonSpammer
Spammer
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
Fraction of message containing URL
The proportion of the original Weibo
ELM-based spammer detection in social networks
1
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
100 150 200 250 300 350 400 450 500
0
0
50
100 150 200 250 300 350 400 450 500
(b)
(a)
Average number of friends mentioned
NonSpammer
Spammer
0.9
12
1
NonSpammer
Spammer
10
0.8
CDF
8
6
4
Spammer
0.4
Non-Spammer
0.2
2
0
0
0
50
100 150 200 250 300 350 400 450 500
(c)
1
1
0.8
0.8
0.6
0.6
0.4
Spammer
Non-Spammer
0.2
0
(e)
0
2000
1000
3000
(d)
CDF
CDF
0.6
0
5
10
15
Spammer
Non-Spammer
0.4
0.2
20
0
0
500
1000
1500
2000
(f)
Fig. 2 Distribution and cumulative distribution function of feature, a the proportion of original messages,
b the fraction of messages containing URL, c the average number of friends mentioned, d the number of
followees, e the fraction of followees per followers, f the number of created days
Figure 2d analyzes the number of people following each user. Normally, spammers
try to follow a multitude of legitimate users so as to be followed back. However, it
does not work most of the time. This behavior, then, makes the fraction of followees
per followers very large in comparison to non-spammers, as illustrated in Fig. 2e.
123
X. Zheng et al.
Figure 2f reveals the feature difference in the number of created days. Compared
with normal users, most spammers usually own less created day because of anti-spam
mechanism that would eventually detect and automatically clean spammer accounts.
4 Spammer detection
Based on the dataset and feature collection described in the previous section, a supervised machine learning model is introduced for spammer identification. Supervised
learning is the machine learning task of inferring a function from labeled training data
that consist of a set of training examples. In supervised learning, each example is a
pair consisting of an input object (typically a vector) and a desired output value (also
called supervisory signal). Through analysis of the training data, supervised learning
produces a classification model for predicting new examples.
4.1 Extreme learning machine
Extreme learning machine (ELM) [26] is based on the empirical risk minimization theory and makes use of a single-layer feedforward network for the training of single hidden layer feedforward neural networks (SLFNs) (as illustrated in Fig. 3). The learning
process needs only one single iteration and avoids multiple iterations and local minimization. Compared with conventional neural network algorithms, ELMs are capable
of achieving faster training speeds and can overcome the problem of overfitting.
Let be a set of P different samples D = (xi , oi ) , i = 1, . . . , P, where {xi } ∈
Rm , and {oi } ∈ Rn . Thus the goal is to find a relationship between {xi } and {oi }.
Standard single hidden layer feedforward networks (SLFNs) with N nodes could be
mathematically modeled by:
yj =
N
h k f wk , x j
(2)
k=1
where 1 ≤ j ≤ P, wk stands for the parameters of the k element of the hidden layer,
h k refers to the weight that connects k hidden element with the output layer, and f
f ( w1 , x )
Fig. 3 Feedforward neural
network with single hidden layer
f
…..
(w , x)
2
h2
f ( w3 , x )
h3
…..
hN
f ( wN , x )
123
h1
y
ELM-based spammer detection in social networks
represents the function that gives the output of the hidden layer. Equation (2) can be
expressed in matrix notation as y = G · h, where h is the vector of weights of the
output layer and G is given by:
⎞
f (w1 , x1 ) . . . f (w N , x1 )
⎠
...
...
...
G=⎝
f (w1 , x P ) . . . f (w N , x P )
⎛
(3)
where N is the number of hidden nodes. As mentioned above, ELM proposes a random initialization of the parameters in the hidden layer wk , being the weights of the
output layer obtained by the Moore–Penrose’s generalized inverse [27] according to
−1 T
the expression h = G + · o, where G + = G T · G
· G is the pseudo-inverse matrix
(superscript T means matrix transposition).
4.2 ELM-based spammer detection model
Figure 4 illustrates the basic concept of the proposed spammer detection model. In
this solution, training data are converted into a series of feature vectors that consists
of a set of formulated attribute values. These vectors construct the input value of a
supervised machine learning algorithm. After training, a classification model is applied
to distinguish whether the specific user belongs to either a normal user or spammer.
Because spammers and non-spammers have different social behaviors, it is capable
to distinguish abnormal behaviors from legitimate ones. In this paper, we used a model
based on 18 features, which were the following: the number of followees, the number
of followers, the number of messages, the number of friends following each other, the
number of favorites, the number of created days, fraction of followees per followers,
fraction of original messages, number of messages per day, the average number of
reposts, the average number of comments, average number of likes, the average number
of URLs, the average number of pictures, the average number of hashtags, the average
Feature Extraction
Social Network
Web Crawler
Data Standardization
Feature
Vectors
Classifier
Model
Detection
Results
Extreme Learning Machine
Fig. 4 Spammer detection model
123
X. Zheng et al.
Table 1 Example of confusion
matrix
Predicted
Spammer
Non-spammer
True
Spammer
TP
FN
Non-spammer
FP
TN
number of user mentions, fraction of messages containing URLs, and fraction of
messages containing pictures.
To evaluate the effectiveness of the experiment results, we consider a confusion
matrix illustrated in Table 1, where true positive (TP) represents the number of
spammers correctly classified, false negative (FN) refers to the number of spammers misclassified as non-spammers, false positive (FP ) expresses the number of
non-spammers misclassified as spammers, and true negative (TN) is the number of
non-spammers classified correctly. According to the confusion matrix, a set of metrics
commonly evaluated in machine learning field are introduced, including: precision
(P), recall (R) and F-measure (F).
P is the ratio of number of instances correctly classified to the total number of
instances and is expressed by the formula:
TP
TP + FP
P=
(4)
R is the ratio of the number of instances correctly classified to the total number of
predicted instances and is expressed with the formula:
R=
TP
TP + FN
(5)
F-measure is the harmonic mean between precision and recall, and is defined as:
F=
2R P
R+P
(6)
For an evaluation of classifiers’ performance, F measure value is more precise because
it is a combination value to summarize both the precision and recall value.
4.3 Classification result and comparison
The simulation for ELM algorithms is carried out in MATLAB environment running
in a Core i5-3470, 3.20 GHZ CPU. Table 2 shows a confusion matrix obtained by
ELM classifiers. It shows that our proposed solution is quite efficient, with 99.1 %
spammers and 99.9 % non-spammers classified correctly, leaving only a small fraction
of spammers and non-spammers misclassified. Table 3 shows the value of evaluation
metrics, in which precision, recall, and F measure are calculated for spammer and
non-spammer, respectively.
123
ELM-based spammer detection in social networks
Table 2 Confusion matrix
Predicted
Spammer (%)
Non-spammer (%)
True
Spammer
99.9
Non-spammer
Table 3 Classification
evaluation
0.1
0.05
Precision
99.95
Recall
F-measure
Spammer
0.999
0.990
0.995
Non-spammer
0.994
0.999
0.997
Table 4 Comparison between ELM and other classifiers
Classifier
Precision
Recall
F-measure
Spammer
Non-spammer
Spammer
Non-spammer
Spammer
Non-spammer
ELM
0.999
0.994
0.990
0.999
0.995
0.997
SVM
0.999
0.995
0.991
0.999
0.995
0.997
Decision tree
0.942
0.95
0.953
0.958
0.947
0.954
Naïve Bayes
0.939
0.96
0.922
0.966
0.93
0.963
Bayes network
0.946
0.915
0.907
0.956
0.926
0.935
Table 5 Comparison between
ELM and SVM
Classifier
Training time (s)
Testing time (s)
ELM
0.4375
0.0625
SVM
3.029
0.499
We also compare the proposed solution with other classifiers, including: Decision
Tree, Naïve Bayes and Bayes Network, with implementation provided by Weka, a
Java data mining software. For each classifier, the same evaluation metrics (precision,
recall, and F-measure) are calculated for both spammers and non-spammers. With the
results illustrated in Table 4, it is clear that both ELM and SVM classifiers are capable
of achieving high accuracy. This observation indicates that ELM- and SVM-based
approaches could clearly separate training data into two parts with maximum margin.
Besides, it is shown that the three other classifiers also achieve good accuracy. This
is because suitable features (including content and user behavior) are selected and
capable of effectively distinguishing spammers from non-spammers.
Furthermore, we compare training and testing time between SVM-based and ELMbased solutions and the experiment results are illustrated in Table 5. The results indicate
that the ELM-based solution is much faster than SVM-based solution, and is therefore
more efficient.
123
X. Zheng et al.
Finally, to further prove the effectiveness of the proposed spammer detection model,
we consider two use scenarios, data standardized and data non-standardized. The paper
compares the training time together with testing accuracy under different activation
functions (Sin, Sig and Hardlim) and different number of hidden nodes (L). The
evaluation is illustrated in Figures 5, 6, 7.
1
6
0.99
Test Accuracy
Train Time
5
4
3
2
1
sig-zscore
sig-non-zscore
(a)
100
200
300
400
sig-zscore
sig-non-zscore
0.97
0.96
0.95
0.94
0.93
0.92
0
0
0.98
500
0
(b)
Number of Hidden Neurons
100
200
300
400
500
Number of Hidden Neurons
6
1
5
0.9
Test Accuracy
Train Time
Fig. 5 Comparison of training time and testing accuracy on Sig function, a training time on Sig function
with different number of hidden nodes, b testing accuracy on Sig function with different number of hidden
4
3
2
1
0.8
0.6
0.5
sin-zscore
sin-non-zscore
0 0
(a)
100
200
300
400
sin-zscore
sin-non-zscore
0.7
0.4
500
0
(b)
Number of Hidden Neurons
100
200
300
400
500
Number of Hidden Neurons
Fig. 6 Comparison of training time and testing accuracy under Sin function, a training time on Sin function
with different number of hidden nodes, b testing accuracy on Sin function with different number of hidden
1
6
0.99
Test Accuracy
Train Time
5
4
3
2
1
hardlim-zscore
hardlim-non-zscore
0
0
(a)
100
200
300
400
Number of Hidden Neurons
0.98
0.97
0.96
0.95
0.94
hardlim-zscore
hardlim-non-zscore
0.93
0.92
500
0
(b)
100
200
300
400
500
Number of Hidden Neurons
Fig. 7 Comparison of training time and testing accuracy under hardlim function, a training time on Hardlim
function with different number of hidden nodes, b testing accuracy on Hardlim function with different
number of hidden
123
ELM-based spammer detection in social networks
Figures 5a, 6a, and 7a show that training time is not significantly influenced under
different activation functions whether the data are standardized or non-standardized.
Testing accuracy (of standardized data), however, is greatly improved in the case of
sin activation function (as shown in Fig. 6b). Therefore, we suggest the formulated
dataset be standardized before classification.
4.4 Stability enhancement
To achieve good generalization performance, the cost parameter C and kernel parameter of SVM [28,29] need to be chosen appropriately. Furthermore, the ELM should also
contain the parameter L that could be adjusted. Accordingly, Fig. 8 compares classification performances between the ELM and SVM solution under different parameters
for further stability evaluation.
We have used 9 different values of C and 9 different values of γ resulting in a
total of 81 pairs of result. The result in Fig. 8a shows that the generalization performance of SVM depends greatly on the combination of (C, γ). Therefore, the
SVM-based approach might require tedious and time-consuming parameter tuning
in real implementation. On the other hand, the generalization performance of ELM
tends to monotonically increase with the increasing number of hidden nodes L, and
remains stable when L is larger than 150 (see Fig. 8b). Therefore, from the implementation point of view, another advantage of the ELM-based approach is the stability
enhancement.
5 Conclusions and future works
The paper presents an ELM-based spammer detection method for social network
platforms. Using data crawled from Sina Weibo, a set of content and behavior features
are extracted and applied into an ELM-based classification algorithm. Through a set
0.996
Testing Accuracy(%)
0.9955
100
Testing Accuracy(%)
95
90
85
80
75
70
65
60
1000
0.995
0.9945
0.994
0.9935
0.993
0.9925
100
1
0.5
0.2
0.1
0.05
(a)
C
0.01
0.001
0.001
0.0005
0.0001
0.01
0.1
10
100
1000
10000
0.992
0.9915
(b)
0
50
100 150 200
250
300 350 400 450 500
L
Fig. 8 The Stability performance under different parameters, a the performance of SVM is sensitive to the
parameters (C, γ ), b the performance of ELM is not sensitive to the parameters (L)
123
X. Zheng et al.
of experiments and evaluation work, our proposed solution is proved to be feasible,
efficient, and significantly more stable than existing SVM-based models.
However, any amount of labeled data might not be enough in a social network
environment with a huge quantity of highly diverse characteristics. Therefore, further
work on the subject might include the investigation of a collaborative training-based
semi-supervised learning model that is capable to train itself automatically based on
a small amount of labeled data.
On the other hand, features extracted in our proposed solution (and other existing
approaches) are based on statistical analysis and manual selection. In the era of big
data with huge data volumes and convenient access, feature extraction mechanisms in
our solution might be low in adaptability and somewhat costive. Therefore, considering how to import the concept of Machine Learning technology (e.g., deep learning
algorithms [30–33]) into automatic feature learning and extraction has become an
important question.
Acknowledgments This paper is supported by the National Natural Science Foundation of China under
Grant No. 61103175 and No.11271002, the Key Project of Chinese Ministry of Education under Grant
No.212086; the Technology Innovation Platform Project of Fujian Province under Grant No. 2009J1007,
No. 2013H6011 and 2013J01228; the Key Project Development Foundation of Education Committee of
Fujian province under Grand No. JA11011 and JA12016.
References
1. Nexgate (2013) State of social media spam. http://nexgate.com/wp-content/uploads/2013/09/
Nexgate-2013-State-of-Social-Media-Spam-Research-Report.pdf
2. Bhat SY, Abulaish M (2013) Community-based features for identifying spammers in online social
networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social
networks analysis and mining. ACM, pp 100–107
3. Grier C, Thomas K, Paxson V et al (2010) At spam: the underground on 140 characters or less[C]. In:
Proceedings of the 17th ACM conference on computer and communications security. ACM, pp 27–37
4. http://www.statista.com/
5. Liu Y, Wu B, Wang B et al (2014) SDHM: a hybrid model for spammer detection in Weibo. Advances
in Social networks analysis and mining (ASONAM), 2014 IEEE/ACM international conference on.
IEEE, pp 942–947
6. Rong HJ, Ong YS, Tan AH et al (2008) A fast pruned-extreme learning machine for classification
problem. Neurocomputing 72(1):359–366
7. Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE
Trans Neural Netw 13(2):415–425
8. Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward
neural networks. Neural Networks 2004. In: Proceedings 2004 IEEE international joint conference on.
IEEE, vol 2, pp 985–990
9. Hirose Y, Yamashita K, Hijiya S (1991) Back-propagation algorithm which varies the number of hidden
units. Neural Netw 4(1):61–66
10. Shen H, Li Z (2014) Leveraging social networks for effective spam filtering. IEEE Trans Comput
11:2743–2759
11. Uemura M, Tabata T (2008) Design and evaluation of a Bayesian-filter-based image spam filtering
method, international conference on information security and assurance (ISA), IEEE, pp 46–51
12. Zhou B, Yao Y, Luo J (2013) Cost-sensitive three-way email spam filtering. J Intell Inf Syst 42(1):19–45
13. Jung J, Sit E (2004) An empirical study of spam traffic and the use of DNS black Lists. In: Proceedings
of the 4th ACM SIGCOMM conference on Internet measurement, ACM, pp 370–375
123
ELM-based spammer detection in social networks
14. Antonakakis M, Perdisci R, Dagon D, Lee W, Feamster N (2010) Building a dynamic reputation system
for DNS, In: Proceedings of the third USENIX workshop on large-scale exploits and emergent threats
(LEET)
15. Xu L, Zheng X, Rong C (2013) Trust evaluation based content filtering in social interactive data. In:
Cloud computing and big data (CloudCom-Asia), 2013 international conference on. IEEE, pp 538–542
16. Kincaid J (2010) EdgeRank: the secret sauce that makes Facebook’s news feed tick. TechCrunch
17. Wang AH (2010) Don’t follow me: Spam detection in twitter. Security and cryptography (SECRYPT),
Proceedings of the 2010 international conference on. IEEE, pp 1–10
18. Yardi S, Romero D, Schoenebeck G (2009) Detecting spam in a twitter network. First Monday 15(1)
19. Stringhini G, Kruegel C, Vigna G (2010) Detecting spammers on social networks. In: Proceedings of
the 26th annual computer security applications conference. ACM, pp 1–9
20. Gao H, Chen Y, Lee K et al (2012) Towards online spam filtering in social networks, NDSS
21. Benevenuto F, Magno G, Rodrigues T et al (2010) Detecting spammers on twitter. Collab, Elect Messag
Anti Abuse Spam Conf (CEAS), 6:12
22. Zheng X, Zeng Z, Chen Z et al (2015) Detecting spammers on social networks. Neurocomputing
159:27–34
23. Lee K, Caverlee J, Webb S (2010) The social honeypot project: protecting online communities from
spammers. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1139–
1140
24. Zhou Y, Chen K, Song L et al (2012) Feature analysis of spammers in social networks with active
honeypots: a case study of Chinese microblogging networks. In: Proceedings of the 2012 international
conference on advances in social networks analysis and mining (ASONAM 2012). IEEE Computer
Society, pp 728–729
25. Miller Z, Dickinson B, Deitrick W et al (2014) Twitter spammer detection using data stream clustering.
Inf Sci 260:64–73
26. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501
27. Rao CR, Mitra SK (1971) Generalized inverse of matrices and its applications. Wiley, New York
28. Ghanty P, Paul S, Pal NR (2009) NEUROSVM: an architecture to reduce the effect of the choice of
kernel on the performance of SVM. J Mach Learn Res 10:591–622
29. Huang GB, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1):155–163
30. Zheng XH, Chen N, Chen Z et al (2014) Mobile cloud based framework for remote-resident multimedia
discovery and access. J Intern Technol 15(6):1043–1050
31. Hinton GE (2007) Learning multiple layers of representation. Trends Cogn Sci 11(10):428–434
32. Bengio Y (2014) Scaling up deep learning. In: Proceedings of the 20th ACM SIGKDD international
conference on knowledge discovery and data mining, ACM, p 1966.1
33. Zhou S, Chen Q, Wang X (2013) Active deep learning method for semi-supervised sentiment classification. Neurocomputing 120:536–546
123