Download - YesBut - University of Technology Sydney

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
Spam detection with data mining method:
Ensemble learning with multiple SVM based classifiers to optimize
generalization ability of e-mail spam classification
Keywords: ensemble learning, SVM classifier, multiple classifiers, generalization ability, e-mail spam, email spam classification
By
Po-Chun CHANG
A Dissertation Submitted To the Advanced Analytics Institute
For the degree of
Master by research
In
The University of Technology, Sydney
2012
Word count: 1299
Page 1
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
Acknowledgements
Word count: 1299
Page 2
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
TABLE OF CONTENTS
Chapter 1.
Introduction ...................................................................................................................................................... 4
Chapter 2.
Literature review............................................................................................................................................. 6
a.
background ........................................................................................................................................ 6
b.
Text Analytics ................................................................................................................................... 6
c.
SVM based classifiers ....................................................................................................................... 6
i.
SVM overview................................................................................................................................... 6
ii.
Kernel trick ........................................................................................................................................ 6
d.
Optimization ...................................................................................................................................... 6
e.
Incremental learning .......................................................................................................................... 6
f.
Ensemble learning ............................................................................................................................. 6
Chapter 3.
Methodology .................................................................................................................................................... 6
Chapter 4.
Results................................................................................................................................................................ 6
Chapter 5.
Discussion ......................................................................................................................................................... 6
Chapter 6.
Conclusion ........................................................................................................................................................ 7
Reference: ........................................................................................................................................................................................... 8
Appendix A:....................................................................................................................................................................................... 8
Word count: 1299
Page 3
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
Chapter 1. INTRODUCTION
[e-mail spam] – same as the previous assignment in TRP
E-mail has become a popular media for spreading spam message, due to its fast transmission, low cost,
and globally accessible. Spam email is also known as junk email, unsolicited bulk email, or unsolicited
commercial email, and it becomes a serious. One problem caused by spam email is companies’
financial losses due to servers require more storage space and computational power to deal with large
amounts of e-mails (Siponen & Stucke 2006) . Another problem is that spam e-mails are received and
stored in users’ mailboxes without their agreement, so users need to spend more time on checking and
deleting junk mails from their mail boxes (Guzella & Caminhas 2009). In addition, due to spam emails
may contain malicious software (e.g.: phishing software), illegal advertising, such as pyramid schemes,
or sensitive information, it has become a serious security issue on internet (Kumar, Poonkuzhali &
Sudhakar 2012).
[classification] – same as the previous assignment in TRP
For solving the spam problem, one of the solutions is using data mining with machine learning
techniques. According to Witten, Frank & Hall(2011, pp. 4-9), Data mining is the automatic or semiautomatic processes for discovering the structural patterns from data, which discovers the knowledge
from existing information. Machine learning is the algorithms, formulas or models that computers can
apply to efficiently implement pattern reorganization on data and use them for predicting possible
outcome on new dataset. Machine learning principle is to find the similarity between new incoming
emails with the existing mails which labelled as spam (Amayri & Bouguila 2010, p. 76).If the matching
result is positive, then classified as spam, else is legitimate e-mail.
[generalization ability]
Based on the concept of data mining and machine learning, the key property of a learning algorithm is
generalization. As it is mentioned in previous paragraph that data mining is a method for discovering the
patterns in the existing data, there is no guarantee the discovered patterns can achieve good result in the
new incoming information. By the definition of Vapnik Chervonenkis (VC) dimension in statistical
learning theory, small training error does not guarantee small generalization error (Burges 1998;
Vapnik 1995, 2000). For the email spam classification situation, generalization ability means the
learning algorithm can still maintain the detection rate when the training data is reduced or new form
spam messages are added.
Word count: 1299
Page 4
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
[SVM based classifiers]
Many learning algorithms have been proposed for dealing with data classification and categorization.
Support vector machine (SVM) (Vapnik 1995) is one of the preferable supervised learning algorithms
due to its solid theoretical background, theoretically good classification accuracies without over fitting
problem and reasonable time consuming (Diao, Yang & Wang 2012). SVM is linear based learning
algorithm that training the classifier to find the best separating hyperplane to separate data, based on
maximum margin training algorithm (Boser, Guyon & Vapnik 1992), into two groups. Moreover, for
the dataset, this is not linear separable, SVM uses kernel trick to implicitly project the data instances
into virtual space. Thus, nonlinear separable data would be linear separable in different feature space,
usually in higher dimension (Schölkopf & Smola 2002). SVM algorithm will be discussed in depth in
the literature review chapter.
[multiple classifiers]
Based on empirical observations and machine learning applications, it is able to find a learning
algorithm might achieve better result than others, but it is not realistic for one single classifier to achieve
the best results on the overall problem domain. Moreover, many learning algorithms use optimization
techniques to achieve the high accuracy results, but they may have chance to stuck in local optima
(Valentini & Masulli 2002). In addition, it is not practical for one single inducer, the well trained model
or classifier from a specific training set, to achieve 100% prediction on the new incoming data. For these
reasons, instead of relying on one classifier, integrating multiple classifiers outcomes, the premise is
their results not compromise one another, would improve the accuracy rate. As the old saying goes,
“Two heads are better than one”.
[Ensemble learning]
Ensemble learning is a technique which can combine multiple classifiers and come out with one
synthesized classifier to improve the prediction accuracy as well as better generalization(Dietterich
2000). The generalization ability of ensemble learning with multiple classifiers is usually much stronger
than only use one classifier. The methodology of Ensemble learning is to weigh several individual
classifiers and combine them together to generate final decision. Ensemble learning algorithm will be
detail discussed in the literature review chapter.
This paper is organized as follows. In the chapter 2 literature review, section (a) will briefly describe
how text messages be translated into clean dataset. Section (b) will introduce one of the widely used
learning algorithm support vector machine (SVM) and kernel trick. Sections (c), (d) and (e) will talk
Word count: 1299
Page 5
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
about the existing optimization techniques for overcome the SVM vulnerability. The methodology
proposed in this paper will be discussed on chapter 3. The experiment result will be shown in chapter 4.
The discussion will be provided on chapter 5 and chapter 6 is conclusion.
Chapter 2. LITERATURE REVIEW
a.
BACKGROUND
b. TEXT ANALYTICS
c. SVM BASED CLASSIFIERS
i. SVM OVERVIEW
ii. KERNEL TRICK
d. OPTIMIZATION
e. INCREMENTAL LEARNING
f. ENSEMBLE LEARNING
Chapter 3. METHODOLOGY
Chapter 4. RESULTS
Chapter 5. DISCUSSION
Word count: 1299
Page 6
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
Chapter 6. CONCLUSION
Word count: 1299
Page 7
Po-Chun CHANG
10832158
TRM sec1, cht 1
01/08/2012
REFERENCE:
Amayri, O. & Bouguila, N. 2010, 'A study of spam filtering using support vector machines', Artificial
Intelligence Review, vol. 34, no. 1, pp. 73-108.
Boser, B.E., Guyon, I.M. & Vapnik, V.N. 1992, 'A training algorithm for optimal margin classifiers', paper
presented to the Proceedings of the fifth annual workshop on Computational learning theory,
Pittsburgh, Pennsylvania, United States.
Burges, C.J.C. 1998, 'A tutorial on support vector machines for pattern recognition', Data Mining and
Knowledge Discovery, vol. 2, no. 2, pp. 121-67.
Diao, L., Yang, C. & Wang, H. 2012, 'Training SVM email classifiers using very large imbalanced dataset',
Journal of Experimental and Theoretical Artificial Intelligence, vol. 24, no. 2, pp. 193-210.
Dietterich, T. 2000, 'Ensemble methods in machine learning', Multiple classifier systems, pp. 1-15.
Guzella, T.S. & Caminhas, W.M. 2009, 'A review of machine learning approaches to Spam filtering', Expert
Systems with Applications, vol. 36, no. 7, pp. 10206-22.
Kumar, R.K., Poonkuzhali, G. & Sudhakar, P. 2012, 'Comparative Study on Email Spam Classifier using Data
Mining Techniques', Proceedings of the International MultiConference of Engineers and Computer
Scientists, vol. 1.
Schölkopf, B. & Smola, A.J. 2002, Learning with kernels: Support vector machines, regularization,
optimization, and beyond, the MIT Press.
Siponen, M. & Stucke, C. 2006, 'Effective Anti-Spam Strategies in Companies: An International Study',
System Sciences, 2006. HICSS '06. Proceedings of the 39th Annual Hawaii International Conference
on, vol. 6, pp. 127c-c.
Valentini, G. & Masulli, F. 2002, 'Ensembles of learning machines', Neural Nets, pp. 3-20.
Vapnik, V.N. 1995, The Nature of Statistical Learning Theory, Springer Verlag, NY.
Vapnik, V.N. 2000, The nature of statistical learning theory, Springer-Verlag New York Inc.
Witten, I.H., Frank, E. & Hall, M.A. 2011, Data Mining: Practical machine learning tools and techniques,
Morgan Kaufmann.
APPENDIX A:
Word count: 1299
Page 8