Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of e-mail spam classification Keywords: ensemble learning, SVM classifier, multiple classifiers, generalization ability, e-mail spam, email spam classification By Po-Chun CHANG A Dissertation Submitted To the Advanced Analytics Institute For the degree of Master by research In The University of Technology, Sydney 2012 Word count: 1299 Page 1 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 Acknowledgements Word count: 1299 Page 2 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 TABLE OF CONTENTS Chapter 1. Introduction ...................................................................................................................................................... 4 Chapter 2. Literature review............................................................................................................................................. 6 a. background ........................................................................................................................................ 6 b. Text Analytics ................................................................................................................................... 6 c. SVM based classifiers ....................................................................................................................... 6 i. SVM overview................................................................................................................................... 6 ii. Kernel trick ........................................................................................................................................ 6 d. Optimization ...................................................................................................................................... 6 e. Incremental learning .......................................................................................................................... 6 f. Ensemble learning ............................................................................................................................. 6 Chapter 3. Methodology .................................................................................................................................................... 6 Chapter 4. Results................................................................................................................................................................ 6 Chapter 5. Discussion ......................................................................................................................................................... 6 Chapter 6. Conclusion ........................................................................................................................................................ 7 Reference: ........................................................................................................................................................................................... 8 Appendix A:....................................................................................................................................................................................... 8 Word count: 1299 Page 3 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 Chapter 1. INTRODUCTION [e-mail spam] – same as the previous assignment in TRP E-mail has become a popular media for spreading spam message, due to its fast transmission, low cost, and globally accessible. Spam email is also known as junk email, unsolicited bulk email, or unsolicited commercial email, and it becomes a serious. One problem caused by spam email is companies’ financial losses due to servers require more storage space and computational power to deal with large amounts of e-mails (Siponen & Stucke 2006) . Another problem is that spam e-mails are received and stored in users’ mailboxes without their agreement, so users need to spend more time on checking and deleting junk mails from their mail boxes (Guzella & Caminhas 2009). In addition, due to spam emails may contain malicious software (e.g.: phishing software), illegal advertising, such as pyramid schemes, or sensitive information, it has become a serious security issue on internet (Kumar, Poonkuzhali & Sudhakar 2012). [classification] – same as the previous assignment in TRP For solving the spam problem, one of the solutions is using data mining with machine learning techniques. According to Witten, Frank & Hall(2011, pp. 4-9), Data mining is the automatic or semiautomatic processes for discovering the structural patterns from data, which discovers the knowledge from existing information. Machine learning is the algorithms, formulas or models that computers can apply to efficiently implement pattern reorganization on data and use them for predicting possible outcome on new dataset. Machine learning principle is to find the similarity between new incoming emails with the existing mails which labelled as spam (Amayri & Bouguila 2010, p. 76).If the matching result is positive, then classified as spam, else is legitimate e-mail. [generalization ability] Based on the concept of data mining and machine learning, the key property of a learning algorithm is generalization. As it is mentioned in previous paragraph that data mining is a method for discovering the patterns in the existing data, there is no guarantee the discovered patterns can achieve good result in the new incoming information. By the definition of Vapnik Chervonenkis (VC) dimension in statistical learning theory, small training error does not guarantee small generalization error (Burges 1998; Vapnik 1995, 2000). For the email spam classification situation, generalization ability means the learning algorithm can still maintain the detection rate when the training data is reduced or new form spam messages are added. Word count: 1299 Page 4 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 [SVM based classifiers] Many learning algorithms have been proposed for dealing with data classification and categorization. Support vector machine (SVM) (Vapnik 1995) is one of the preferable supervised learning algorithms due to its solid theoretical background, theoretically good classification accuracies without over fitting problem and reasonable time consuming (Diao, Yang & Wang 2012). SVM is linear based learning algorithm that training the classifier to find the best separating hyperplane to separate data, based on maximum margin training algorithm (Boser, Guyon & Vapnik 1992), into two groups. Moreover, for the dataset, this is not linear separable, SVM uses kernel trick to implicitly project the data instances into virtual space. Thus, nonlinear separable data would be linear separable in different feature space, usually in higher dimension (Schölkopf & Smola 2002). SVM algorithm will be discussed in depth in the literature review chapter. [multiple classifiers] Based on empirical observations and machine learning applications, it is able to find a learning algorithm might achieve better result than others, but it is not realistic for one single classifier to achieve the best results on the overall problem domain. Moreover, many learning algorithms use optimization techniques to achieve the high accuracy results, but they may have chance to stuck in local optima (Valentini & Masulli 2002). In addition, it is not practical for one single inducer, the well trained model or classifier from a specific training set, to achieve 100% prediction on the new incoming data. For these reasons, instead of relying on one classifier, integrating multiple classifiers outcomes, the premise is their results not compromise one another, would improve the accuracy rate. As the old saying goes, “Two heads are better than one”. [Ensemble learning] Ensemble learning is a technique which can combine multiple classifiers and come out with one synthesized classifier to improve the prediction accuracy as well as better generalization(Dietterich 2000). The generalization ability of ensemble learning with multiple classifiers is usually much stronger than only use one classifier. The methodology of Ensemble learning is to weigh several individual classifiers and combine them together to generate final decision. Ensemble learning algorithm will be detail discussed in the literature review chapter. This paper is organized as follows. In the chapter 2 literature review, section (a) will briefly describe how text messages be translated into clean dataset. Section (b) will introduce one of the widely used learning algorithm support vector machine (SVM) and kernel trick. Sections (c), (d) and (e) will talk Word count: 1299 Page 5 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 about the existing optimization techniques for overcome the SVM vulnerability. The methodology proposed in this paper will be discussed on chapter 3. The experiment result will be shown in chapter 4. The discussion will be provided on chapter 5 and chapter 6 is conclusion. Chapter 2. LITERATURE REVIEW a. BACKGROUND b. TEXT ANALYTICS c. SVM BASED CLASSIFIERS i. SVM OVERVIEW ii. KERNEL TRICK d. OPTIMIZATION e. INCREMENTAL LEARNING f. ENSEMBLE LEARNING Chapter 3. METHODOLOGY Chapter 4. RESULTS Chapter 5. DISCUSSION Word count: 1299 Page 6 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 Chapter 6. CONCLUSION Word count: 1299 Page 7 Po-Chun CHANG 10832158 TRM sec1, cht 1 01/08/2012 REFERENCE: Amayri, O. & Bouguila, N. 2010, 'A study of spam filtering using support vector machines', Artificial Intelligence Review, vol. 34, no. 1, pp. 73-108. Boser, B.E., Guyon, I.M. & Vapnik, V.N. 1992, 'A training algorithm for optimal margin classifiers', paper presented to the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, United States. Burges, C.J.C. 1998, 'A tutorial on support vector machines for pattern recognition', Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-67. Diao, L., Yang, C. & Wang, H. 2012, 'Training SVM email classifiers using very large imbalanced dataset', Journal of Experimental and Theoretical Artificial Intelligence, vol. 24, no. 2, pp. 193-210. Dietterich, T. 2000, 'Ensemble methods in machine learning', Multiple classifier systems, pp. 1-15. Guzella, T.S. & Caminhas, W.M. 2009, 'A review of machine learning approaches to Spam filtering', Expert Systems with Applications, vol. 36, no. 7, pp. 10206-22. Kumar, R.K., Poonkuzhali, G. & Sudhakar, P. 2012, 'Comparative Study on Email Spam Classifier using Data Mining Techniques', Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1. Schölkopf, B. & Smola, A.J. 2002, Learning with kernels: Support vector machines, regularization, optimization, and beyond, the MIT Press. Siponen, M. & Stucke, C. 2006, 'Effective Anti-Spam Strategies in Companies: An International Study', System Sciences, 2006. HICSS '06. Proceedings of the 39th Annual Hawaii International Conference on, vol. 6, pp. 127c-c. Valentini, G. & Masulli, F. 2002, 'Ensembles of learning machines', Neural Nets, pp. 3-20. Vapnik, V.N. 1995, The Nature of Statistical Learning Theory, Springer Verlag, NY. Vapnik, V.N. 2000, The nature of statistical learning theory, Springer-Verlag New York Inc. Witten, I.H., Frank, E. & Hall, M.A. 2011, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann. APPENDIX A: Word count: 1299 Page 8