Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 5 DSCI 4520/5240 Data Mining - Lecture Notes DSCI 4520/5240 DATA MINING 5-1 DSCI 4520/5240 DBDSS (DATA MINING) DSCI 4520/5240 Lecture 5 Decision Trees II Some slide material taken from: Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005 Lecture 5 - 1 DSCI 4520/5240 A simple example: Weather Data DATA MINING Outlook Temp Humidity Windy Play? Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No © 2012 University of North Texas Lecture 5 - 11 Lecture 5 DSCI 4520/5240 Data Mining - Lecture Notes Pseudo-code for 1R DSCI 4520/5240 Decision tree for the weather data DSCI 4520/5240 DATA MINING DATA MINING For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign g that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate 5-2 Outlook sunny rainy overcast Windyy Humidityy yes high no normal yes false yes true no Let’s apply 1R on the weather data: Consider the first (outlook) of the 4 attributes (outlook, temp, humidity, windy). Consider all values (sunny, overcast, rainy) and make 3 corresponding rules. Continue until you get all 4 sets of rules. Lecture 4 - 10 DSCI 4520/5240 Evaluating the Weather Attributes in 1R DATA MINING Lecture 4 - 13 Discretization in 1R DSCI 4520/5240 DATA MINING Consider continuous Temperature data, after sorting them in ascending order: 65 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No One way to discretize temperature is to place breakpoints wherever the class changes: Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No To avoid overfitting, 1R adopts the rule that observations of the majority class in each partition be as many as possible but no more than 3, unless there is a “run”: Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No (*) indicates a random choice between two equally likely outcomes If adjacent partitions have the same majority class, the partitions are merged: Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No The final discretization leads to the rule set: IF temperature <= 77.5 THEN Yes IF temperature > 77.5 THEN No Lecture 4 - 12 © 2012 University of North Texas Lecture 4 - 14 DSCI 4520/5240 Data Mining - Lecture Notes Lecture 5 Which attribute to select? DSCI 4520/5240 DATA MINING Outlook overcast yes yes no no no yes yes yes yes hot yes yes yes no no yes yes no no Windy false Humidity high yes yes yes no no no no DSCI 4520/5240 Information Temperature rainy sunny normal yes yes yes yes yes yes no yes yes yes yes yes yes no no mild yes yes yes yes no no • • • • is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy cool yes yes yes no Entropy gives the additional required information (i.e., the information deficit) in bits true This can involve fractions of bits! The negative sign in the entropy formula is needed to convert all negative logs back to positive values yes yes yes no no no Formula for computing the entropy: Entropy (p1, p2, …, pn) = –p1 logp1 –p2 logp2 … –pn logpn Lecture 4 - 25 A criterion for attribute selection DATA MINING • Computing Information DSCI 4520/5240 DATA MINING 5-3 Lecture 4 - 27 Continuing to split DSCI 4520/5240 DATA MINING Which is the best attribute? The one which will result in the smallest tree. Heuristic: choose the attribute that produces the “purest” nodes! Popular impurity criterion: Information. This is the extra information needed to classify an instance instance. It takes a low value for pure nodes and a high value for impure nodes. We can then compare a tree before the split and after the split using Information Gain = Info (before) – Info (after). Information Gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain Outlook Outlook Temperature hot no no mild yes no Outlook sunny sunny sunny Windy cool false yes yes yes no no true yes no Humidity high no no no normal yes yes Gain (Temperature) = 0.571 bits Gain (Humidity) = 0.971 bits Gain (Windy) = 0.020 bits Lecture 4 - 26 © 2012 University of North Texas Lecture 4 - 34 Lecture 5 DSCI 4520/5240 Data Mining - Lecture Notes DSCI 4520/5240 5-4 Weather example: attribute “outlook” DATA MINING • Outlook = “Sunny” Info([2,3]) ([ , ]) = entropy(2/5, py( , 3/5)) = –2/5log(2/5) –3/5log(3/5) = 0.971 bits Info([2,3]) Outlook rainy sunny yes yes no no no overcast yes yes yes yes yes yes yes no no • Outlook = “Overcast” Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition) • Outlook = “Rainy” Rainy Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits Expected Information for attribute Outlook: Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 = 0.693 bits. Lecture 4 - 32 DSCI 4520/5240 Computing the Information Gain DATA MINING • Information Gain = Information Before – Information After Gain (Outlook) = info([9 info([9,5]) 5]) – info([2,3], info([2 3] [4 [4,0], 0] [3 [3,2]) 2]) = 00.940 940 – 0.693 0 693 = 0.247 bits • Information Gain for attributes from the Weather Data: Gain (Outlook) = 0.247 bits Gain (Temperature) = 0.029 bits Gain (Humidity) = 0.152 bits Gain (Windy) = 0.048 Lecture 4 - 33 © 2012 University of North Texas