Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan [email protected] Dr. Nazir A. Zafar Department of Computer & Information Sciences Pakistan Institute of Engineering & Applied Sciences Nilore, Islamabad, Pakistan [email protected] Reduction in Over-Fitting for Classification without Compromising on Accuracy and Effectiveness 2 Machine Learning Machine learning covers following main types of learning: • Classification learning: Learn to put instances into pre-defined classes based on other attributes • Association learning: Learn relationships between the attributes • Clustering: Discover classes of instances that belong together • Regression: Learn to predict a numeric quantity instead of a class 3 Roots of Classification 1. Classification draws on the concepts of three major paradigms: • • • Database technology Statistics Machines 2. Domain knowledge, i.e. the expertise of the end-user. 5 Knowledge Discovery in Databases 1. KDD process typically generates a model using past records with known target classes (outputs) and these models are used to predict outputs of future records (new cases). 2. Applications include fraud detection, marketing, investment analysis, insurance. 7 Marketing example The goal is to predict whether a customer will buy a product given gender, country and age. Gender Country Age Buy? M M F F F M M F F M France England France England France Germany Germany Germany France France 25 21 23 34 30 21 20 18 34 55 Yes Yes Yes Yes No No No No No No Freitas and Lavington (1998) Data Mining, CEC99. 8 country? Germany no England yes France age? <= 25 yes > 25 no Internal branching node This is the decision tree induced by the Marketing example data. The first branch is called the root of the tree. Leaf node 9 Tree induction 1. The tree is built by selecting one attribute at a time - the one that ‘best’ separates the classes. 2. The set of examples is then partitioned according to value of selected attributes. 3. This is repeated at each branch node until segmentation is complete. 10 (4Y 6N) country? Germany England no (0Y France (2Y age? 3N) <= 25 (2Y > 25 yes (2Y yes 3N) 0N) no (0Y 3N) Notice that in this simple example the leaf nodes contain records of one class only. 0N) The number of yes and no examples is conserved as you move up and down the tree. 11 Rule derivation Rules can be extracted directly from induction trees. (4Y 6N) country? Germany England no (0Y France (2Y age? 3N) <= 25 (2Y 0N) What are the other rules? > 25 yes (2Y yes 3N) If (country = Germany) then (Buy? = No) 0N) no (0Y 3N) 12 Heart Disease Dataset 13 What is needed? 1. With databases of enormous size, the user needs help to analyse the data more effectively than just simply querying and reporting. 2. Semi-automatic methods to extract useful, unknown (higher-level) information in a concise format will help the user make more sense of their data. 14 The KDD roadmap 1. KDD may be divided into the following stages: Problem Specfication Resourcing Data Cleansing Pre-processing Data Mining Evaluation Interpretation Exploitation KDD 1 KDD 1 2. Note the iterative nature of the process. 15 Expertise required 1. Any organisation that undertakes a project in KDD will require much expert input to ensure that the results produced are • • • • of high quality, valid, interesting/useful/novel/surprising, and comprehensible by the human user. 2. “If patient is pregnant then gender is female” is very accurate, but is neither useful nor surprising. 16 Test data Data Training data Classifier algorithm Model Classification 17 Validation Test data Data Training data Classifier algorithm Model Classification 18 Relative error reduction S.No. Data Set SRS SDS ER % 1 Heart Disease(Cleveland) 77.37 82.12 20.99 2 Credit-A 84.52 84.93 02.65 3 Diabetes (PIMA) 72.45 73.96 05.48 4 Liver disorder (BUPA) 64.86 65.90 02.96 5 Breast cancer Wisconsin 94.63 94.86 4.28 6 Hepatitis 78.20 88.31 46.38 7 Ionosphere 89.09 90.91 16.68 8 Boston housing 82.53 83.79 07.21 9 Credit (German) 71.64 72.20 01.97 10 Iris 91.93 98.67 83.52 11 Sonar 73.46 70.19 -12.32 Over all average SRS: Simple Random Sampling 80.06 82.35 16.35 SDS: Systematic Distribution Sampling 19 Comparison Data Set SRS SIS Accuracy Over fitting Accuracy Over fitting Heart-C 77.37 06.14 79.58 03.11 Credit-A 84.52 05.25 87.08 02.93 Diabetes 72.45 07.83 72.66 03.58 Liver 64.86 10.47 67.17 04.79 Cancer 94.63 02.58 94.62 01.26 Hepatitis 78.20 06.16 83.12 03.07 Ionosphere 89.09 04.94 89.41 02.19 Housing 82.53 03.68 83.00 02.53 Credit-G 71.64 08.23 72.56 03.97 Iris 91.93 02.91 96.00 01.18 Sonar 73.46 08.26 76.73 03.95 Average 80.06 06.04 81.99 2.96 SRS: Simple Random Sampling SIS:Stratified Induction Sampling 20 Conclusion In this study, we have shown that the original data sets partitioned into training and test data sets by using stratified induction approach reduces over fitting significantly without compromising on accuracy factor. 21 Supporting Texts Data Warehousing, Data Mining and OLAP, Alex Berson & Stephen Smith, McGraw-Hill (1997), ISBN 0-07-006272-2 Predictive Data Mining, Sholom Weiss & Nitin Indurkhya, Morgan Kauffmann (1998), ISBN 1-55860403-0 Data Mining, Ian Witten & Eibe Frank, Morgan Kaufmann (1999), ISBN 1-55860-552-5 22 Useful urls 1. University of East Anglia School of Computing Sciences, UK http://www.cmp.uea.ac.uk/research/groups/mag/kdd/ 2. UCI ML repository, USA http://www.ics.uci.edu/~mlearn/MLRepository.html 3. KD Nuggets, USA http://www.kdnuggets.com/ 23 Questions and Answers Discussion 24 THANK YOU 25