Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining methods as a artificial intelligence tool Agnieszka Nowak - Brzezinska Decision Trees and k-Nearest neighbor and basket analysis Lectures 4 BASKET ANALYSIS • Data mining (the advanced analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. • The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? • Applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. Support and Confidence - Example • What is the support and confidence of the following rules? • {Beer}{Bread} • {Bread, PeanutButter}{Jelly} ? Support(XY)=support(X Y) confidence(XY)=support(XY)/support(X) Association Rule Mining Problem Definition • Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup and minconf, • Find all association rules XY with support minsup and confidence minconf • I.E: we want rules with high confidence and support • We call these rules interesting • We would like to • Design an efficient algorithm for mining association rules in large data sets • Develop an effective approach for distinguishing interesting rules from spurious ones Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys beer Customer buys diaper • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold Basic Concepts: Association Rules Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 50 Nuts, Eggs, Milk • Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys beer Customer buys diaper Find all the rules X Y with minimum support and confidence – support, s, probability that a transaction contains X Y – confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%) Measures of Predictive Ability Support refers to the percentage of baskets where the rule was true (both left and right-side products were present). Confidence measures what percentage of baskets that contained the left-hand product also contained the right-hand. Lift measures how many times Confidence is larger than the expected (baseline) Confidence. A lift value that is greater than 1 is desirable. Support and Confidence: An Illustration A B C Rule AD CA AC B&CD A CD B CD ADE B C E Support Confidence Lift 2/5 2/5 2/5 1/5 2/3 2/4 2/3 1/3 2 1 2 0.50 Problem Decomposition 1. Find all sets of items that have minimum support (frequent itemsets) 2. Use the frequent itemsets to generate the desired rules Problem Decomposition – Example Transaction ID Items Bought 1 Shoes, Shirt, Jacket 2 Shoes,Jacket 3 Shoes, Jeans 4 Shirt, Sweatshirt Frequent Itemset {Shoes} {Shirt} {Jacket} {Shoes, Jacket} Support 75% 50% 50% 50% For min support = 50% = 2 trans, and min confidence = 50% For the rule Shoes Jacket •Support = Sup({Shoes,Jacket)}=50% •Confidence = 50 =66.6% 70 Jacket Shoes has 50% support and 100% confidence The Apriori Algorithm — Example Min support =50% = 2 trans Database D TID 100 200 300 400 Items 134 235 1235 25 C1 Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 L1 itemset sup. {1} {2} {3} {5} C2 itemset sup L2 itemset {1 3} {2 3} {2 5} {3 5} C3 sup 2 2 3 2 itemset {2 3 5} {1 {1 {1 {2 {2 {3 Scan D 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L3 C2 Scan D itemset sup {2 3 5} 2 2 3 3 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} KNN Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc KNN - Definition KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure KNN – different names • • • • • • K-Nearest Neighbors Memory-Based Reasoning Example-Based Reasoning Instance-Based Learning Case-Based Reasoning Lazy Learning KNN – Short History • Nearest Neighbors have been used in statistical estimation and pattern recognition already in the beginning of 1970’s (non-parametric techniques). • People reason by remembering and learn by doing. • Thinking is reminding, making analogies. • The k-Nearest Neighbors (kNN) method provides a simple approach to calculating predictions for unknown observations. • It calculates a prediction by looking at similar observations and uses some function of their response values to make the prediction, such as an average. • Like all prediction methods, it starts with a training set but • instead of producing a mathematical model it determines the optimal number of similar observations to use in making the prediction. • During the learning phase, the best number of similar observations is chosen (k). +/- of kNN +: • Noise: kNN is relatively insensitive to errors or outliers in the data. • Large sets: kNN can be used with large training sets. -: • Speed: kNN can be computationally slow when it is applied to a new data set since a similar score must be generated between the observations presented to the model and every member of the training set. • A kNN model uses the k most similar neighbors to the observation to calculate a prediction. • Where a response variable is continuous, the prediction is the mean of the nearest neighbors. • Where a response variable is categorical, the prediction could be presented as a mean or a voting scheme could be used, that is, select the most common classification term. http://people.revoledu.com/kardi/tutorial/KNN/index.html K Nearest Neighbor (KNN): • Training set includes classes. • Examine K items near item to be classified. • New item placed in class with the most number of close items. • O(q) for each tuple to be classified. (Here q is the size of the training set.) KNN The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). K-nearest neighbor algorithm Asumptions: • We have a training set of observations in which each element belongs to one of a given classes (Y). • We have some new observation, for which we do not know the class and we want to find it using kNN algorithm. To calculate the distance from A (2,3) to B (7,8): 9 8 7 6 5 A 4 B 3 2 1 0 0 2 4 6 D (A,B) = sqrt((7-2)2 + (8-3)2) = sqrt (25 + 25) = sqrt (50) = 7.07 8 9 B 8 7 6 5 A B 4 C A 3 2 C 1 0 0 • • • • 1 2 3 4 5 If we have 3 points A(2,3), B(7,8) and C(5,1): D (A,B) = sqrt ((7-2)2 + (8-3)2) = sqrt (25 + 25) = sqrt (50) = 7.07 D (A,C) = sqrt ((5-2)2 + (3-1)2) = sqrt (9 + 4) = sqrt (13) = 3.60 D (B,C) = sqrt ((7-5)2 + (3-8)2) = sqrt (4 + 25) = sqrt (29) = 5.38 6 7 8 K-NN • Step 1: find k nearest neighbors for a given object • Step 2: choose the class from the neighbors (choose the class which is more frequent) K=3 New case: K=5 Will be New case: Will be What if we have more dimensions ? A B C V1 0.7 0.6 0.8 V2 0.8 0.8 0.9 V3 0.4 0.5 0.7 V4 0.5 0.4 0.8 V5 0.2 0.2 0.9 D (A,B) = sqrt ((0.7-0.6)2 + (0.8-0.8)2 + (0.4-0.3)2 + (0.5-0.4)2 + (0.2-0.2)2) = sqrt (0.01 + 0.01 + 0.01) = sqrt (0.03) = 0.17 D (A,C) = sqrt ((0.7-0.8)2 + (0.8-0.9)2 + (0.4-0.7)2 + (0.5-0.8)2 + (0.2-0.9)2) =sqrt(0.01 + 0.01 + 0.09 + 0.09 + 0.49) = sqrt (0.69) = 0.83 D (B,C) = sqrt ((0.6-0.8)2 + (0.8-0.9)2 + (0.5-0.7)2 + (0.4-0.8)2 + (0.2-0.9)2) = sqrt (0.04 + 0.01 + 0.04+0.16 + 0.49) = sqrt (0.74) = 0.86 We are looking for the smallest distance. (A & B). kNN advantages and disadvantages: Advatages: • Noise: kNN is relatively insensitive to errors or outliers in the data. • Large sets: kNN can be used with large training sets. Disadvantage: • Speed: kNN can be computationally slow when it is applied to a new data set since a similar score must be generated between the observations presented to the model and every member of the training set. SSE Smaller SSE values indicate that the predictions are closer to the actual values. The SSE evaluation criterion was used to assess the quality of each model. To assess the different values for k, the sum of squares of error (SSE) evaluation criteria will be used: Table for detecting the best values for k •The Euclidean distance calculation was selected to represent the distance between observations. To calculate an optimal value for k, different values of k were selected between 2 and 20. In this example, the value of k with the lowest SSE value is 6 and this value is selected for use with the kNN model. Observation to be predicted • To illustrate, a data set of cars will be used and a model built to test the car fuel efficiency (MPG). • The following variables will be used as descriptors within the model: Cylinders, Displacement, Horsepower, Weight, Acceleration, Model Year and Origin. Predicting • Once a value for k has been set in the training phase, the model can now be used to make predictions. • For example, an observation x has values for the descriptor variables but not for the response. Using the same technique for determining similarity as used in the model building phase, observation x is compared against all observations in the training set. • A distance is computed between x and each training set observation. The closest k observations are selected and a prediction is made, for example, using the average value The observation (Dodge Aspen) was presented to the kNN model built to predict car fuel efficiency (MPG). The Dodge Aspen observation was compared to all observations in the training set and an Euclidean distance was computed. The six observations with the smallest distance scores are selected, as shown in Table. The prediction is the average of these top six observations, that is, 19.5. the cross validated prediction is shown alongside the actual value. Nearest Neighbor Classification • Input • A set of stored records • k: # of nearest neighbors • Output d ( p, q ) ( pi q ) 2 • Compute distance: • Identify k nearest neighbors • Determine the class label of unknown record based on class labels of nearest neighbors (i.e. by taking majority vote) i i 58 K Nearest Neighbors • K Nearest Neighbors – Advantage • Simple • Powerful • Requires no training time – Disadvantage • Memory intensive • Classification/estimation is slow 59 KNN = k nearest neighbors Gene 2 ? If red = brain tumor and yellow healthy – do I have a brain tumor? Gene 1 KNN is another method for classification. For each point it looks at its k nearest neighbors. KNN = k nearest neighbors Gene 2 ? If red = brain tumor and yellow healthy – do I have a brain tumor? Gene 1 For each point it looks at its k nearest neighbors. For example, the method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red. KNN - exercise Gene 2 ? Gene 1 In the above example – how will the point be classified in KNN with K=1? KNN Classification $250 000 $200 000 $150 000 Non-Default Loan$ Default $100 000 $50 000 $0 0 10 20 30 40 Age 50 60 70 KNN Classification – Distance Age 25 35 45 20 35 52 23 40 60 48 33 Loan $40,000 $60,000 $80,000 $20,000 $120,000 $18,000 $95,000 $62,000 $100,000 $220,000 $150,000 Default N N N N N N Y Y Y Y Y 48 $142,000 ? Distance 102000 82000 62000 122000 22000 124000 47000 80000 42000 78000 8000 D ( x1 x2 ) ( y1 y2 ) 2 2 KNN Classification – Standardized Distance Age Loan 0.125 0.375 0.625 0 0.375 0.8 0.075 0.5 1 0.7 0.325 0.11 0.21 0.31 0.01 0.50 0.00 0.38 0.22 0.41 1.00 0.65 Default N N N N N N Y Y Y Y Y 0.7 0.61 ? X Min Xs Max Min Distance 0.7652 0.5200 0.3160 0.9245 0.3428 0.6220 0.6669 0.4437 0.3650 0.3861 0.3771 KNN Regression - Distance Age 25 35 45 20 35 52 23 40 60 48 33 Loan House Price Index $40,000 135 $60,000 256 $80,000 231 $20,000 267 $120,000 139 $18,000 150 $95,000 127 $62,000 216 $100,000 139 $220,000 250 $150,000 264 48 $142,000 Distance 102000 82000 62000 122000 22000 124000 47000 80000 42000 78000 8000 ? D ( x1 x2 ) ( y1 y2 ) 2 2 KNN Regression – Standardized Distance Age Loan 0.125 0.375 0.625 0 0.375 0.8 0.075 0.5 1 0.7 0.325 0.11 0.21 0.31 0.01 0.50 0.00 0.38 0.22 0.41 1.00 0.65 House Price Index 135 256 231 267 139 150 127 216 139 250 264 0.7 0.61 ? X Min Xs Max Min Distance 0.7652 0.5200 0.3160 0.9245 0.3428 0.6220 0.6669 0.4437 0.3650 0.3861 0.3771 KNN – Number of Neighbors • If K=1, select the nearest neighbor • If K>1, – For classification select the most frequent neighbor. – For regression calculate the average of K neighbors. Distance – Categorical Variables X Y Distance Male Male 0 Male Female 1 x yD0 x y D 1 Decision trees DECISION TREES Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES 10 Training Data Model: Decision Tree Another Example of Decision Tree 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Attrib3 Apply Model Class Deduction Decision Tree Apply Model to Test Data Test Data Start from the root of tree. Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO NO > 80K YES Assign Cheat to “No” Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Attrib3 Apply Model Class Deduction Decision Tree Decision tree distance < 20 km yes no weather sunny rainy Concepts: root, inner node, leaf, edges Decision tree construction y2 1 1 1 1 a1 2 y2 2 2 22 2 2 2 a1 1 1 1 1 2 2 2 2 22 2 2 2 2 y1 a2 y1 a3 y2 < a1 y1< a3 no yes 2 y1 < a2 1 2 y2 < a1 no yes no yes 2 no yes 2 1 Partition for node 1. Quantitative data: comparison with some treshold value yi > i yes no 2. Qualitative data: each possible value has to be used yi yi1 yi2 yik The partition of node for qualitative data 1. For each attribute yi calculate the value of some given measure. 2. Choose the attibute which is optimal in sense of chosen measure. 3. From a given node create a number of edges equal to the number of values of attribute yi. t yi1 yi yi2 t1 t2 yik tk • Decision trees are often generated by hand to precisely and consistently define a decision making process. • However, they can also be generated automatically from the data. • They consist of a series of decision points based on certain variables Splitting Criteria -Dividing Observations • It is common for the split at each level to be a two-way split. • There are methods that split more than two ways. • However, care should be taken using these methods since splitting the set in many ways early in the construction of the tree may result in missing interesting relationships that become exposed as the tree growing process continues. Any variable type can be split using a two-way split: • Dichotomous: Variables with two values are the most straightforward to split since each branch represents a specific value. For example, a variable Temperature may have only two values, hot and cold. Observations will be split based on those with hot and those with cold temperature values. • Nominal: Since nominal values are discrete values with no order, a two-way split is accomplished with one subset being comprised of a set of observations that equal a certain value and the other subset being those observations that do not equal that value. For example, a variable Color that can take the values red, green, blue, and black may be split two-ways. Observations, for example, which have Color equaling red generate one subset and those not equaling red creating the other subset, that is, green, blue and black. Ordinal: In the case where a variable’s discrete values are ordered, the resulting subsets may be made up of more than one value, as long as the ordering is retained. For example, a variable Quality with possible values low, medium, high, and excellent may be split in four possible ways. For example, observations equaling low or medium in one subset and observations equaling high and excellent in another subset. Another example is where low values are in one set and medium, high, and excellent values are in the other set. Continuous: For variables with continuous values to be split two-ways, a specific cutoff value needs to be determined, where on one side of the split are values less than the cutoff and on the other side of the split are values greater than or equal to the cutoff. For example, a variable Weight which can take any value between 0 and 1,000 with a selected cutoff of 200. The first subset would be those observations where the Weight is below 200 and the other subset would be those observations where the Weight is greater than or equal to 200. A splitting criterion has two components: • (1) the variable to split on and • (2) values of the variable to split on. To determine the best split, all possible splits of all variables must be considered. Since it is necessary to rank the splits, a score should be calculated for each split. There are many ways to rank the split. The following describes two approaches for prioritizing splits, based on whether the response is categorical or continuous. • The objective for an optimal split is to create subsets which results in observations with a single response value. In this example, there are 20 observations prior to splitting. • The response variable (Temperature) has two possible values, hot and cold. Prior to the split, the response has an even distribution with the number of observations where the Temperature equals hot is ten and with the number of observations where the Temperature equals cold is also ten. • Different criteria are considered for splitting these observations which results in different distributions of the response variables for each subset (N2 and N3): • Split a: Each subset contains ten observations. All ten observations in N2 have hot temperature values, whereas the ten observations in node N3 are all cold. • Split b: Again each subset (N2 and N3) contains ten observations. However, in this example there is an even distribution of hot and cold values in each subset. • Split c: In this case the splitting criterion results in two subsets where node N2 has nine observations (one hot and eight cold) and node N3 has 11 observations (nine hot and two cold). • Split a is the best split since each node contains observations where the response is one or the other category. • Split b results in the same even split of hot and cold values (50% hot, 50% cold) in each of the resulting nodes (N2 and N3) and would not be considered a good split. • Split c is a good split; however, this split is not so clean as split a since there are values of both hot and cold in both subsets. • The proportion of hot and cold values is biased, in node N2 towards cold values and in N3 towards hot values. When determining the best splitting criteria, it is important to determine how clean each split is, based on the proportion of the different categories of the response variable (or impurity). • S is a sample of training examples • p is the proportion of positive examples in S • Entropy measures the impurity of of S • Entropy(S) = -plogp-(1-p)log(1-p) misclassification, Gini, and entropy • There are three primary methods for calculating impurity: misclassification, Gini, and entropy. • In scenario 1, all ten observations have value cold whereas in scenario 2, one observation has value hot and nine observations have value cold. • For each scenario, an entropy score is calculated. • Cleaner splits result in lower scores. • In scenario 1 and scenario 11, the split cleanly breaks the set into observations with only one value. The score for these scenarios is 0. In scenario 6, the observations are split evenly across the two values and this is reflected in a score of 1. In other cases, the score reflects how well the two values are split. • In order to determine the best split, we now need to calculate a ranking based on how cleanly each split separates the response data. • This is calculated on the basis of the impurity before and after the split. • The formula for this calculation, Gain, is shown below: • N is the number of observations in the parent node, • k is the number of possible resulting nodes and • N(vj) is the number of observations for each of the j child nodes. • vj is the set of observations for the jth node. • It should be noted that the Gain formula can be used with other impurity methods by replacing the entropy calculation. ID3 Algorithm • The ID3 algorithm is considered as a very simple decision tree algorithm (Quinlan, 1986). • ID3 uses information gain as splitting criteria. • The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero. • ID3 does not apply any pruning procedures nor does it handle numeric attributes or missing values. C4.5 Algorithm • C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993). • It uses gain ratio as splitting criteria. • The splitting ceases when the number of instances to be split is below a certain threshold. • Error–based pruning is performed after the growing phase. C4.5 can handle numeric attributes. • It can induce from a training set that incorporates missing values by using corrected gain ratio criteria as presented above. Example: Decision Tree for PlayTennis Example: Data for PlayTennis Decision Tree for PlayTennis 3.4 The Basic Decision Tree Learning Algorithm • Main loop: 1. A the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes • Which attribute is best? Entropy S is a sample of training examples p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S Entropy measures the impurity of S Entropy(S) - p⊕log2 p⊕ - p⊖log2 p⊖ Information Gain Gain(S, A) = expected reduction in entropy due to sorting on A Training Examples Selecting the Next Attribute(1/2) Which attribute is the best classifier? Selecting the Next Attribute(2/2) Ssunny = {D1,D2,D8,D9,D11} Gain (Ssunny , Humidity) = .970 - (3/5) 0.0 - (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (Ssunny, Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019 Converting A Tree to Rules IF THEN IF THEN …. (Outlook = Sunny) ∧ (Humidity = High) PlayTennis = No (Outlook = Sunny) ∧ (Humidity = Normal) PlayTennis = Yes Factors Affecting Sunburn Name Hair Height Weight Sarah blonde average light Lotion Result no positive Dana blonde tall average yes negative Alex brown average yes Negative average no Positive short Annie blonde short Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short yes Negative light Phase 1: From Data to Tree Perform average entropy calculations on the complete data set for each of the four attributes: Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative b1 = blonde b2 = red b3 = brown Average Entropy = 0.50 b1 = short b2 = average b3 = tall Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative Average Entropy = 0.69 b1 = light b2 = average b3 = heavy Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative Average Entropy = 0.94 b1 = no b2 = yes Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative Average Entropy = 0.61 the attribute "hair color" is selected as the first test because it minimizes the entropy. Hair color = blonde Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Annie blonde short average no positive Katie blonde short light yes negative Hair color = red Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative Name Hair Height Weight Lotion Result Emily red average heavy no positive Hair color = brown Name Hair Height Weight Lotion Result Alex brown short average yes negative Pete brown tall heavy no negative John brown average heavy no negative Hair color = blonde Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Annie blonde short average no positive Katie blonde short light yes negative Hair color = red Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Alex brown short average yes Negative Annie blonde short average no Positive Emily red average heavy no positive Peter brown tall heavy no Negative John brown average heavy no Negative Katie blonde short light yes Negative Name Hair Height Weight Lotion Result Emily red average heavy no positive Hair color = brown Name Hair Height Weight Lotion Result Alex brown short average yes negative Pete brown tall heavy no negative John brown average heavy no negative Hair color = blonde Name Hair Height Weight Lotion Result Sarah blonde average light no positive Dana blonde tall average yes negative Annie blonde short average no positive Katie blonde short light yes negative • Similarily, we now choose another test to separate out the sunburned individuals from the blonde haired inhomogeneous subset, {Sarah, Dana, Annie, and Katie}. • The attribute "lotion" is selected because it minimizes the entropy in the blonde hair subset. • Thus, using the "hair color" and "lotion" tests together ensures the proper identification of all the samples. the completed decision tree Decision tree Age <20 <20 <20 [20,40] [20,40] [20,40] >40 >40 >40 Gene1 high high low low high high low high low Gene2 high high low high high low low low high Smoker yes yes no yes no yes yes no no Operation yes yes no yes yes no no no no Decision tree Gene 2 high low Age >40 Yes Operation = no Operation = no No Operation = yes Decision trees are automatically built from “train data” and are used for classification. They also tell us which features are most important. reduced error pruning, example age 20 < 20 type A y B + colour +6 -0 colour white white colour age type y class 1 czarny 11 B tak + 2 biały 23 B tak - 3 czarny 22 A nie - 4 czarny 18 B nie + 5 czarny 15 B tak - 6 biały 27 B nie + no yes type black black + - +6 -1 +0 -4 1 4 5 A typ A - + - +0 -6 +5 -1 +0 -9 3 6 B + - +7 -2 +1 -5 2 B age 20 < 20 type A y B + colour +6 -0 colour white white 1 4 +6 -5 5 + no yes type black black + - +6 -1 +0 -4 1 4 5 A typ A - + - +0 -6 +5 -1 +0 -9 3 6 B + - +7 -2 +1 -5 2 B reduced error pruning age 20 < 20 1 + 4 +12 -5 5 type y A + +6 -0 B no yes 1 4 +6 -5 5 + colour white type black A type A - + - +0 -6 +5 -1 +0 -9 3 6 B + - +7 -2 +1 -5 2 B reduced error pruning age 20 < 20 1 + 4 +12 -5 5 y no yes colour white type black A type + +8 -7 2 A - + - +0 -6 +5 -1 +0 -9 3 6 B + - +7 -2 +1 -5 2 B reduced error pruning age 20 < 20 1 + 4 +12 -5 5 y no yes colour white type B - + - +0 -6 +5 -1 +0 -9 3 6 B + - +7 -2 +1 -5 2 +5 -10 black A A - type 3 6 reduced error pruning age 20 < 20 + y +12 -5 no yes - colour +5 -10 white black - type +0 -6 A B + - +7 -2 +1 -5 Advantages and Disadvantages of Decision Trees • 1. Decision trees are self–explanatory and when compacted they are also easy to follow. Furthermore decision trees can be converted to a set of rules. • 2. Decision trees can handle both nominal and numeric input attributes. • 4. Decision trees are capable of handling datasets that may have errors. • 5. Decision trees are capable of handling datasets that may have missing values. QUALITY OF THE CLASSIFICATION Trening i testowanie Each data set for which we used some data mining algorithm has to be tested. That is why, each data set is divided into two subsets: • Train set • Test set All data Random partition Train set Decision tree induction Test set Testing the accuracy Classification algorithm Train set classifier If age < 31 Or car type = „sports” Then eisk = high classifier Test set Age CAR TYPE risk RISK ACCURACY classifier NEW DATA Age CAR TYPE RISK risk • x axes- sensitivity, • Y axes - specificity Sensitivity vs Specificity • Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). • Specificity measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition, sometimes called the true negative rate). • These two measures are closely related to the concepts of type I and type II errors. Actual true Actual false Predicted true 8 1 Predicted false 2 7 Actual true Actual false Predicted true 8 1 Predicted false 2 7 Actual true Actual false Predicted true 8 1 Predicted false 2 7 Actual true Actual false Predicted true 8 1 Predicted false 2 7 Sensitivit y Specificit y TP 8 0,8 TP FN 8 2 TN 7 0,88 FP TN 7 1