Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSCI6405 Fall 2003 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Teaching Assistant: Christopher Jordan, Email: [email protected] Office Hours: TR, 1:30 - 3:00 PM 17 October 2003 1 Lectures Outline Pat I: Overview on DM and DW 1. Introduction (ch1) 2. Data preprocessing (ch3) Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4, optional) 5. Classification data mining (ch7) 6. Association data mining (ch6) 7. Characterization data mining (ch5) 8. Clustering data mining (ch8) Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9) 10. Mining spatial data (Ch9) Project Presentations 17 October 2003 Ass1 Due: Sep 23 Tue Ass2: Sep 23 – Oct 16 Ass3: Oct (14) 16 – Oct 30 Ass4: Oct 30 – Nov 13 Project Due: Dec 8 ~prof6405/Doc/proj.guide 2 5. CLASSIFICAITON AND PREDICATION (Ch7) What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Classification Method Summary 17 October 2003 3 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 17 October 2003 4 Induction by sorting examples +: X3,X4,X5,X7,X9,X11,X12,X13 -: X1,X2,X6,X8,X14 Yes for PlayTennis No for PlayTennis {Outlook, Temperature, Humidity, Wind} Outlook | sunny overcast / | +: X9,X11 X3,X7,X12,X13 - : X1,X2,X8 0 [2+,3-] [4+,0] | Yes Humidity {Tem, Hum, Win} / high / +: 0 -: X1,X2,x3 [0+, 3-] No 17 October 2003 \ normal \ X9,X11 0 [2+, 0-] Yes rain \ X4,X5,X6 X10,X14 [3+,2-] | Wind {Tem, Hum, Win} / strong / 0 X6,X14 [0+, 2-] No \ weak \ X4,X5,X10 0 [3+, 0-] Yes 5 Attribute selection measure: Information Gain Which attribute is the best classifier? * We need to measures how well a given attribute separates the training examples according to their target classification. We want to select the attribute that is most useful for classifying the current data set. What is a good quantitative measure of the worth of an attribute? - Information gain (a probability property) * ID3 uses information gain to select among the candidate attributes at each step while growing the tree. 17 October 2003 6 Information measure Concept of quantity of information: The quantity of information in a message may be formally defined in terms of the number of bits necessary to encode the factor by which the probability of the particular outcome has been increased. | E.g., Classification of a specified target Q(message) = P2(outcome_after) - P1(outcome_before) Q: the quantity of information contained in a message P1: the probability of outcome before receiving a message P2: the probability of outcome after receiving the message 17 October 2003 7 Information measure (cont) E.g., Suppose a cow has strayed into a pasture represented as an 8 x 8 array of "cells". The outcome is about finding the cow, i.e., getting the information about where is the cow. case1: Without knowing which cell it is in, the information about where is the cow: P1 (find-the-cow_before) = 1/64 case2: if some one sends a message telling which cell the cow is in, e.g. in cell (4,7) then we acquired some information. How much information we received about find the cow? P2 (find-the-cow_after) = 1 The information has been gone from 1/64 to 1 for a-factor-of 64 increase. 17 October 2003 8 Information measure (cont) The information has been gone from 1/64 to 1 for afactor-of 64 increase. The quantity of information in a message can be formally defined in terms of the number of bits necessary to encode the factor by which the probability of the particular outcome has been increased: E.g., log_2 (64) = 6 bits 17 October 2003 9 Information measure (cont) The quantity of the information, "The crow is in cell (4,7)", about finding the cow is therefore: Q(message) = P2(outcome_after) - P1(outcome_before) Information received = log_2 P2 - log_2 P1 = log_2 (P2 / P1) = log_2 64 = 6 bits P2(outcome_after) Information = log_2 P1(outcome_before) This formula works provided that 1) the probability before receiving the answer is greater than zero and 2) the probability after receiving the answer is greater than or equal to the probability before receiving the answer. 17 October 2003 10 Information measure (cont) E.g., Classification for the concept PlayTennis with the value (outcome) “Yes“: • P1(message_before) = 9/14 The probability of unknown example's being a “Yes” example without asking any question. • P2(message_after) =? The questions may be asked like "what is outlook?“, or “what is Temperature?”, etc from the list of (Outlook Temperature Humidity Wind). • What is Outlook? Then the message can be from P2_1(Outlook=sunny) = 2/5, P2_2(Outlook= overcast) = 4/4, P2_3(Outlook= rain) = 3/5. How to quantify the information about both “Yes” and “No” classes? Entropy 17 October 2003 11 Information measure: Entropy Measure Impurity: Entropy can characterize the impurity of of an arbitrary collection of examples. The concept of entropy, used in information theory, is closely related to the concept of entropy studied in thermodynamics, which is a measure of the disorder in a system of particles. Entropy is a measure of how much confusion, uncertainty, or disorder there is in a situation. 17 October 2003 12 Entropy (cont) Suppose p1, p2, ...,pn is a set of probabilities, such that p1+p2+...+pn = 1. E.g: 9/14 + 5/14 = 1 for PlayTennis The entropy for this set of probabilities is defined to be the value: Entropy(S) = - ∑ n i=1 p_i * log_2(p_i) where S, is the data set, p_i, stands for a subset of S belonging to a class of the target; n, is the number of target classes. E.g., For Play Tennis: Entropy (9+,5-) = - (9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 The entropy is maximized when p1=p2=...=pn=1/n. The entropy is minimized (zero) if pi=1 for some i, and pj = 0 for all j =/= i. E.g. Graph illustration. 17 October 2003 13 Entropy (cont) E.g., +: X3,X4,X5,X7,X9,X11,X12,X13 Yes for PlayTennis -: X1,X2,X6,X8,X14 No for PlayTennis Outlook | sunny overcast rain / | \ +: X9,X11 X3,X7,X12,X13 X4,X5,X6 - : X1,X2,X8 0 X10,X14 [2+,3-] [4+,0] [3+,2-] | Yes | Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 Entropy(S_sunny) = Entropy([2+,3-]) = -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971 Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0 Entropy(S_rain) = Entropy([3+,2-]) = -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971 Expected entropy (measuring the overall result of a partition): s1 j + ...+ smj Enropy (sj) s j =1 v E(A)= ∑ 17 October 2003 14 Information Gain Information gain measures the expected reduction in Entropy: Sj k Gain (S, A) = Entropy (S) - ∑ j=1 Entropy(S_j) S Where k is the number of different values of the attribute currently under consideration, and j, ranges from 1 to k. Gain (S,A), is the information about the purity change of the examples about the classification on the target using attribute A. A good partition will produce a small expected entropy, in turn, the Gain will be bigger. 17 October 2003 15 Information Gain (cont) E.g., The set S has 14 examples: i.e. S = 14 = [9+,5-] for the target PlayTennis = {yes,no}. So wehave Entropy(S) = 0.940. When we consider the attribute Outlook, the information about the set S on the target PlayTennis receives a gain: The original set: S = [9+,5-] = 14 Outlook = {sunny,overcast,rain} partitions the set into subsets: S_sunny = [2+,3-] = 5 S_overcast = [4+,0-] = 4 S_rain = [3+,2-] = 5 Gain(S,Outlook) = Entropy(S) - (5/14)Entropy(S_sunny) - (4/14)Entropy(S_overcast) - (5/14)Entropy(S_rain) = 0.246 Where, Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 Entropy(S_sunny) = Entropy([2+,3-])= -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971 Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0 Entropy(S_rain) = Entropy([3+,2-])= -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971 17 October 2003 16 Summary of Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, …, m} Entropy measures information required to classify any arbitrary tuple set m si si E ( s1,s 2,...,s m ) = − ∑ i =1 s log 2 s Expected entropy of attribute A with values {a1,a2,…,av} v sj E (s1 j,...,smj) s j =1 EE (A) = ∑ Information gained by branching on attribute A Gain(A) = E (s 1, s 2,..., sm) − EE (A) 17 October 2003 17 Example of attribute selection E.g., The set S has 14 examples: i.e. S = 14 = [9+,5-] for the target PlayTennis = {yes,no}. So we have Entropy(S) = 0.940. When we consider the attribute Outlook, the information about the set S on the target PlayTennis receives a gain: Outlook = {sunny,overcast,rain} S_sunny = [2+,3-] = 5, S_overcast = [4+,0-] = 4, S_rain = [3+,2-] = 5 Gain(S,Outlook) = E(S) - (5/14) E(S_sunny) - (4/14) E(S_overcast)- (5/14) E(S_rain) = 0.246 Where, Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 Entropy(S_sunny)= Entropy([2+,3-])= -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971 Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0 Entropy(S_rain) = Entropy([3+,2-])= -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971 17 October 2003 18 Example of attribute selection (cont) The information gain calculations for all four attributes: Gain(S,Outlook) = 0.246 Gain(S,Humidity) = 0.151 Gain(S,Windy) = 0.048 Gain(S,Temperature) = 0.029 The attribute Outlook may lead to a best partition of S in terms of the target PlayTennis. The partitioned subsets are overall more pure (more homogenous) than other partitions (the best classifier). 17 October 2003 19 Example of attribute selection (cont) * Illustration: The partially learned decision tree resulting from the first step of ID3: {X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14} S: [9+,5-] outlook sunny {X1,X2,X8,X9,X11} [ 2=,3- ] ? Which attribute should be tested here? 17 October 2003 Overcast {X3,X7,X12,X13} [4+,0] Yes Rain X4,X5,X6,X10,X14} [3+,2-] ? Which attribute should be tested here? 20