Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, 00198 Roma, Italy, [email protected], http://www.dsi.uniroma1.it/~tronci Workshop ENEA: I Sistemi di Supporto alle Decisioni Centro Ricerche ENEA Casaccia, Roma, October 28, 2003 Data Mining Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. A data miner is a computer program that sifts through data seeking regularities or patterns. Obstructions: noise and computational complexity. 2 Some Applications •Decisions involving judgments, e.g. loans. •Screening images. Example: detection of oil slicks from satellite images, warning of ecological disasters, illegal dumping. •Load forecasting in the electricity supply industry. •Diagnosis, e.g. for preventive maintenance of electromechanical devices. •Marketing and Sales. … On Thursday customers often purchase beer and diapers together … •Stock Market Analysis. •Anomaly Detection. 3 Data Attributes Goal Age Spectacle Prescription Astigmatism Tear Recommende production rate d lens young myope no reduced none young myope no normal soft young myope yes reduced none young myope yes normal hard young hypermetrope no reduced none young hypermetrope no normal soft young hypermetrope yes reduced none young hypermetrope yes normal hard Prepresbyopic myope no reduced none Pre-presb myope no normal soft Instance ............................................................ 4 Classification Assume instances have n attributes A1, … An-1, An. Let attribute An our goal. A classifier is a function f from (A1 x …x An-1) to An. That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the goal. In other words f classifies each instance w.r.t. the goal attribute. The problem of computing a classifier from a set of instances is called the classification problem. Note that in a classification problem the set of classes (i.e. the possible goal value) is known in advance. Note that a classifier works on any possible instance. That is also on instances that were not present in our data set. This is way classification is a form of machine learning. 5 Clustering Assume instances have n attributes A1, … An. A clustering function is a function f from the set (A1 x …x An) to some small subset of the natural numbers. That is f splits the set of instances into a small number of classes. The problem of computing a clustering function from our data set is called the clustering problem. Note that, unlink in a classification problem, in a clustering problem the set of classes is not known in advance. Note that a clustering function works on any possible instance. That is also on instances that were not present in our data set. This is way clustering is a form of machine learning. In the following we will focus on classification. 6 Rules for Contact Lens Data (An example of calssification) if (<tear production rate> = <reduced>) then <recommendation> = <none>; if (<age> = <young> and <astigmatic> = <no> and <tear production rate> = <soft>) then <recommendation> = <soft> if (<age> = <pre-presbyotic > and <astigmatic> = <no> and <tear production rate> = <normal>) then <recommendation> = <soft> .... Attribute recommendation is the attribute we would like to predict. Such attribute is usually called Goal and is typically written on the last column. A possible way of defining a classifier is by using a set of rules as above. 7 Labor Negotiations Data Attribute Type 1 2 3 ... 40 Duration years 1 2 3 ... 2 Wage increase first year percentage 2% 4% 4.3% ... 4.5% Wage increase second year percentage ? ? ? ... ? Working hours per week Number of hours 28 35 38 ... 40 pension {none, r, c} none ? ? ... ? Education allowance {yes, no} yes ? ? ... ? Statutory holidays Nun of days 11 15 12 ... 12 vacation Below-avg, avg, gen avg gen gen ... avg ... ... ... ... ... ... Acceptability of contract {good, bad} bad good good ... good 8 Classification using Decision Trees (The Labor Negotiations Data Example (1)) Wage increase first year > 2.5 Statutory holidays <= 10 <= 2.5 Wage increase first year > 10 <= 4 bad good bad >4 good 9 Classification using Decision Trees (The Labor Negotiations Data Example (2)) Wage increase first year <= 2.5 > 2.5 working hours per week Statutory holidays > 36 <= 36 > 10 Health plan contribution bad none bad half good <= 10 good Wage increase first year <= 4 full bad bad >4 good 10 Which Classifiers is good for me ? From the same data set we may get many classifiers with different properties. Here are some of the properties usually considered for a classifiers. Note that depending on the problem under consideration, some property may or may not not be relevant. •Success rate. That is the percentage of instances classified correctly. •Easy of computation. •Readability. There are cases in which the definition of the classifier must be read by a human being. In such cases the readability of the classifier definition is an important parameter to judge the goodness of a classifier. Finally we should note that starting from the same data set different classification algorithms may return different classifiers. Usually deciding which one to use requires running some testing experiments. 11 A Classification Algorithm Decision Trees Decision trees are among the most used and more effective classifiers. We will show the decision tree classification algorithm with an example: the weather data. 12 Weather Data Outlook Temperature Humidity Windy Play sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no 13 Constructing a decision tree for the weather data (1) Gain: 0.247 Gain: 0.029 Gain: 0.152 Gain: 0.048 Outlook Temperature Humidity Windy mild false y y n n n y y y y 0.971 0.0 y y n n n 0.971 y y n n y y y y n n y y y n y y y n n n n y y y y y y n true y y y y y y n n y y y n n n H(p1, … pn) = -p1logp1 - … -pnlogpn H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r)) H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = 0.971 bits; H([4, 0]) = 0 bits; H([3, 2]) = 0.971 bits H([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = 0.693 bits Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = 0.940. Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = 0.247 14 Constructing a decision tree for the weather data (2) Outlook Outlook Outlook Humidity 0.971 Windy 0.020 high Temperature 0.571 n n y n y n n n y y y y n n y n 15 Constructing a decision tree for the weather data (3) Outlook rainy sunny overcast Humidity high no Windy normal yes yes false true yes no Computational cost of decision tree construction for a data set with m attributes and n instances: O(mn(log n)) + O(n(log n)2) 16 Naive Bayes Outlook yes Temperature no yes yes no sunn 2 y 3 hot 2 2 over cast 4 0 mild 4 2 rain y 3 2 yes no cool Humidity 3 1 yes no sunn y 2/9 3/5 hot 2/9 2/5 over cast 4/9 0/5 mild 4/9 2/5 rainy 3/9 2/5 cool 3/9 1/5 Windy no Play yes no high 3 4 fals e 6 2 nor mal 1 true 3 3 6 yes no high 3/9 4/5 nor mal 1/5 6/9 yes no fals e 6/9 2/5 true 3/9 3/5 yes no 9 5 yes no 9/14 5/14 17 Naive Bayes (2) A new day: Outlook temperature Humidity Windy Play sunny cool high true ? E = (sunny and cool and high and true) Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E). Assuming attributes statistically independent: P(yes | E) = (P(sunny | yes) * P(cool| yes) * P(high | yes) * P(true | yes) * P(yes)) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E) = 0.0053 / P(E). P(no | E) = 0.0026 / P(E). Since P(yes | E) + P(no | E) = 1 we have that P(E) = 0.0053 + 0.0026 = 0.0079. Thus: P(yes | E) = 0.205 P(no | E) = 0.795; Thus we answer: NO Obstruction: usually attributes are not statistically independent. However naive Bayes works quite well in practice. 18 Performance Evaluation Split data set into two parts: training set and test set. Use training set to compute classifier. Use test set to evaluate classifier. Note: test set data have no been used in the training process. This allows us to compute the following quantites (on the test set). For sake of simplicity we refer to a two-class prediction. Predicted class Actual class yes no yes TP (true positive) FN (false negative) no FP (false positive) TN (true negative) 19 Lift Chart Number of true positives = TP 1000 100% Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN) Lift charts are typically used in Marketing Applications 20 Receiver Operating Characteristic (ROC) Curve Tp rate = TP/(TP + FN) 100% 100% FP rate = FP/(FP + TN) ROC curves are typically used in Communication Applications 21 A glimpse of the data mining in Safeguard We outline our use of data mining techniques in the safeguard project. 22 On line schema Port 2506 TCP Packets tcpdump Format Filter Preprocessed TCP payload Sequence of payload bytes Classifier 1 (Hash Table based) Alarm level Format Filter Distribution of payload bytes Supervisor Classifier 2 (Hidden Markov Models) Format Filter Conditional probabilities of chars and words in payload Cluster Analyzer Format Filter Statistics info (avg, var, dev) on payload bytes 23 Training schema Port 2506 TCP Packets tcpdump Format Filter Preprocessed TCP payload log Sequence of payload bytes Classifier 1 HT Classifier Synthesizer (Hash Table based) Format Filter Distribution of payload bytes Classifier 2 (Hidden Markov Models) HMM Synthesizer Format Filter Conditional probabilities of chars and words in payload Cluster Analyzer WEKA (Datamining tool) Format Filter Statistics info (avg, var, dev) on payload bytes 24