Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSCI6405 Fall 2003 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Teaching Assistant: Christopher Jordan, Email: [email protected] Office Hours: TR, 1:30 - 3:00 PM 22 October 2003 1 Lectures Outline … Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4, optional) 5. Classification data mining (ch7) 6. Association data mining (ch6) 7. Characterization data mining (ch5) 8. Clustering data mining (ch8) Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9) 10. Mining spatial data (Ch9) Project Presentations Ass3: Oct (14) 16 – Oct 30 Ass4: Oct 30 – Nov 13 Project Due: Dec 8 ~prof6405/Doc/proj.guide 22 October 2003 2 1 5. CLASSIFICAITON AND PREDICATION (Ch7) What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Classification Method Summary 22 October 2003 3 Induction by sorting examples +: X3,X4,X5,X7,X9,X11,X12,X13 -: X1,X2,X6,X8,X14 Yes for PlayTennis No for PlayTennis {Outlook, Temperature, Humidity, Wind} Outlook | sunny overcast / | +: X9,X11 X3,X7,X12,X13 - : X1,X2,X8 0 [2+,3-] [4+,0] | Yes Humidity {Tem, Hum, Win} / high / +: 0 -: X1,X2,x3 [0+, 3-] No 22 October 2003 \ normal \ X9,X11 0 [2+, 0-] Yes rain \ X4,X5,X6 X10,X14 [3+,2-] | Wind {Tem, Hum, Win} / strong / 0 X6,X14 [0+, 2-] No \ weak \ X4,X5,X10 0 [3+, 0-] Yes 4 2 Day Outlook Temperature Humidity Wind PlayTennis ---------------------------------------------------------------------1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no 22 October 2003 5 Review: Information Measure The quantity of the information, "The crow is in cell (4,7)", about finding the cow is therefore: Q(message) = P2(outcome_after) - P1(outcome_before) Information received = log_2 P2 - log_2 P1 = log_2 (P2 / P1) = log_2 64 = 6 bits P2(outcome_after) Information = log_2 P1(outcome_before) This formula works provided that 1) the probability before receiving the answer is greater than zero and 2) the probability after receiving the answer is greater than or equal to the probability before receiving the answer. 22 October 2003 6 3 E.g., Classification for the concept PlayTennis with the value (outcome) “Yes“: • P1(message_before) = 9/14 The probability of unknown example's being a “Yes” example without asking any question. • P2(message_after) =? The questions may be asked like "what is outlook?“, or “what is Temperature?”, etc from the list of (Outlook Temperature Humidity Wind). • What is Outlook? Then the message can be from P2_1(Outlook=sunny) = 2/5, P2_2(Outlook= overcast) = 4/4, P2_3(Outlook= rain) = 3/5. How to quantify the information about both “Yes” and “No” classes? Entropy 22 October 2003 7 Entropy: Measure class impurity of a data set Measure Impurity: Entropy can characterize the impurity of of an arbitrary collection of examples. The concept of entropy, used in information theory, is closely related to the concept of entropy studied in thermodynamics, which is a measure of the disorder in a system of particles. Entropy is a measure of how much confusion, uncertainty, or disorder there is in a situation. 22 October 2003 8 4 Suppose p1, p2, ...,pn is a set of probabilities, such that p1+p2+...+pn = 1. E.g: 9/14 + 5/14 = 1 for PlayTennis The entropy for this set of probabilities is defined to be the value: Entropy(S) = - ∑ n i=1 p_i * log_2(p_i) where S, is the data set, p_i, stands for a subset of S belonging to a class of the target; n, is the number of target classes. E.g., For Play Tennis: Entropy (9+,5-) = - (9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 The entropy is maximized when p1=p2=...=pn=1/n. The entropy is minimized (zero) if pi=1 for some i, and pj = 0 for all j =/= i. 22 October 2003 9 If the target attribute T has n values (more general case). n n Entropy(S) = ∑ i=1 -(p_i)log_2(p_i) Where, i is the ith value of T. E.g., If target attribute = Outlook = {Sunny,Overcast,Rain}, then n = 3. 22 October 2003 10 5 Entropy (cont) E.g., +: X3,X4,X5,X7,X9,X11,X12,X13 Yes for PlayTennis -: X1,X2,X6,X8,X14 No for PlayTennis Outlook | sunny overcast rain / | \ +: X9,X11 X3,X7,X12,X13 X4,X5,X6 - : X1,X2,X8 0 X10,X14 [2+,3-] [4+,0] [3+,2-] | Yes | Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 Entropy(S_sunny) = Entropy([2+,3-]) = -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971 Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0 Entropy(S_rain) = Entropy([3+,2-]) = -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971 Expected entropy (measuring the overall result of a partition): s1 j + ...+ smj Enropy (sj) s j =1 v E(A)= ∑ 22 October 2003 11 Review: from information to information gain Outcome Question Message Find the cow Play tennis (Yes, No) ... ... ... ... ... ... Where is the cow? What is the outlook? Temperature? Humidity? Windy? Cell (5,7), P=1 {Sunny,Overcast, Rain} {hot,mild,cool} {high,normal} {weak,strong} * A decision tree represents such a series of questions. The answer to the first question determines what follow-up question is asked next. If the questions are well chosen, a surprisingly short series is enough to accurately classify an incoming records. * The basic idea behind the decision tree algorithm is to test the most important attribute first. By "most important," we mean the one that makes the most difference to the classification of an example. This way, we hope to get the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be small. 22 October 2003 12 6 Summary of Attribute Selection Measure: Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, …, m} Entropy measures information required to classify any arbitrary tuple set m si si E ( s1,s 2,...,s m ) = − ∑ i =1 s log 2 s Expected entropy of attribute A with values {a1,a2,…,av} v sj E (s1 j,...,smj) j =1 s EE (A) = ∑ Information gained by branching on attribute A Gain(A) = E (s 1, s 2,..., sm) − EE (A) 22 October 2003 13 E.g., The set S has 14 examples: i.e. S = 14 = [9+,5-] for the target PlayTennis = {yes,no}. So we have Entropy(S) = 0.940. When we consider the attribute Outlook, the information about the set S on the target PlayTennis receives a gain: Outlook = {sunny, overcast, rain} partitions the set into subsets: S_sunny = [2+,3-] = 5 S_overcast = [4+,0-] = 4 S_rain = [3+,2-] = 5 Gain(S,Outlook) = Entropy(S) - (5/14)Entropy(S_sunny) - (4/14)Entropy(S_overcast) - (5/14)Entropy(S_rain) = 0.246 Where, Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940 Entropy(S_sunny) = Entropy([2+,3-])= -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971 Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0 Entropy(S_rain) = Entropy([3+,2-])= -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971 22 October 2003 14 7 The information gain calculations for all four attributes: Gain(S,Outlook) = 0.246 Gain(S,Humidity) = 0.151 Gain(S,Windy) = 0.048 Gain(S,Temperature) = 0.029 The attribute Outlook may lead to a best partition of S in terms of the target PlayTennis. In other wards, the partitioned subsets are overall more pure (more homogenous) than other partitions (the best classifier). 22 October 2003 15 * Illustration: The partially learned decision tree resulting from the first step of ID3: {X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14} S: [9+,5-] outlook sunny {X1,X2,X8,X9,X11} [ 2=,3- ] ? Which attribute should be tested here? 22 October 2003 Overcast {X3,X7,X12,X13} [4+,0] Yes Rain X4,X5,X6,X10,X14} [3+,2-] ? Which attribute should be tested here? 16 8 How to build decision tree The representative algorithms: * ID3 (Interactive Dichotomiser 3), Quinlan, J.R., "Induction of decision trees", Machine Learning, Vol 1, No.1, pp 81-106, 1986. * C4.5/C.5.0, Quinlan, J.R., "C4.5: Programs for Machine Learning", San Francisco: Morgan Kaufmann, 1993. * Other popular algorithms: CART, CHAID, Chi-squared, ... 22 October 2003 17 Main features of ID3 algorithm: • ID3 is specialized to learning boolean-valued functions • ID3 is a greedy search algorithm that grows the tree topdown, at each node selecting the attribute that best classifies the local training examples. This process continues until the tree perfectly classifies the training examples or until all attributes have been used. 22 October 2003 18 9 General procedure of building up decision tree: 1) At each decision level, decide which attribute is the best classifier? At each level, the available attributes are evaluated to find the best classifier based on richness of the information for classifying the current data set regarding the target. The evaluation is based on statistical test to determine how well it alone classifies the training examples. E.g., At the level 0 (root), the available attributes would be all (not including the target), the current set is the original input data. 2) The data set is divided into subsets according to the values of the selected classifier. 3) If a subset is a leaf node (i.e, the examples in the set belong to a class (a same label), the branch is an end node. Otherwise assign the node as root node. 4) The process is repeated using data set associated with each descendant node to select the best attribute to test at that point in the tree. The learning process is accomplished though the construction of a decision tree. 22 October 2003 19 ID3 Algorithm ID3(Examples, Target, Attributes) /* Examples are the training examples. Target is the attribute whose value is to be predicated by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree that correctly classifies the given Examples. */ • Create a Root node for the tree • If all Examples are positive, return the single-node tree Root, with label = "+". • If all Examples are negative, Return the single-nod tree Root, with label = "-". • If Attributes list is empty, Return the single-node tree Root, with label = most common value of Target in Examples (majority voting). /* You may provide confidence measure (#%) on the classification, or give a warning message if two classes tie. */ 22 October 2003 20 10 • Otherwise Begin - A Å the attribute from Attributes that best classifies Examples. /* The best attribute is the one with highest information gain measured by Gain(S, A), where S is the set of Examples. */ - The decision attribute for Root Å A. - For each possible value, v_i, of A, - Add a new tree branch below Root, corresponding to the test A = v_i. - Let Examples_(v_i), be the subset of Examples that have value v_i for A - If Examples_(v_i) is empty - Then below this new branch add a leaf node with label = most common value of Target in Examples. - Else below this new branch add the subtree ID3(Examples_(v_i), Target, Attributes - {A}) • End • Return Root 22 October 2003 21 Implementation of ID3 algorithm Two key components: greedy search and information gain measure 1. Decide root attribute A from the attributes of S that best classifies Examples. The best attribute is the one with highest information (i.e. with the lowest expected entropy). Root = Outlook after evaluating: Gain(S,Outlook) = 0.246 Gain(S,Humidity) = 0.151 Gain(S,Windy) = 0.048 Gain(S,Temperature) = 0.029 The branches of Outlook are created below the root for each of its possible values: sunny, overcast, and rain. 2. For each possible value, v_i, of A, - Add a new tree branch below Root, corresponding to the test A = v_i. - Let Examples_(v_i), be the subset of Examples that have value v_i for A. - if a subset is not a leaf then call ID3 (Examples_(v_i), Target_attribute, Attributes - {A}) 22 October 2003 22 11 * Illustration: The partially learned decision tree resulting from the first step of ID3: {X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14} S: [9+,5-] outlook sunny {X1,X2,X8,X9,X11} [ 2=,3- ] ? Overcast {X3,X7,X12,X13} [4+,0] Yes Which attribute should be tested here? Rain X4,X5,X6,X10,X14} [3+,2-] ? Which attribute should be tested here? 22 October 2003 23 Note: 1) The training examples are sorted to the corresponding descendant nodes. 2) The overcast descendent has only positive examples and therefore becomes a leaf node with classification Yes. 3) The nodes Sunny and Rain will be further expanded, by selecting the attribute with highest information gain relative to the new subsets of examples. E.g., S_sunny = {X1,X2,X8,X9,X11} Gain(S_sunny,Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = .970 Gain(S_sunny,Windy) = .970 - (2/5)0.0 - (2/5)1.0 -(1/5)0.0 =.570 Gain(S_sunny,Temperature) = .970 - (2/5)1.0 - (3/5).918 =.019 22 October 2003 24 12 outlook sunny Overcast {X3,X7,X12,X13} [4+,0] Yes {X1,X2,X8,X9,X11} [ 2=,3- ] ? Humidity High Rain X4,X5,X6,X10,X14} [3+,2-] ? Which attribute should be tested here? Normal {X1,X2,X3} [0+,3-] Yes 22 October 2003 {X9,X11} [4+,0] Yes 25 Implementation Example Example of ID3 implementation (executable code): ~prof6405Ass/ass3-demo/ To run the program: $ ID3 You then have the following interface: What is the name of the file containing your data? data1 /* do not include any data files in your submission assign3 */ Please choose an attribute (by number): 1. Humidity 2. Windy 3. PlayTennis Attribute: 3 Target attribute is: PlayTennis 22 October 2003 26 13 If outlook is sunny, then if Humidity is high, then PlayTennis is N. if Humidity is normal, then PlayTennis is P. If outlook is overcast, then PlayTennis is P. If outlook is rain, then if Windy is false, then PlayTennis is P. if Windy is true, then PlayTennis is N. * The code structure of the implementation Use top-down, modular approach. Overview of the Program Code: -------------------------------------------ID3.c: main() readdata.c: readdata() getattrib.c: getattrib() maketree.c: maketree() functions.c: binary(), sameresults(), oneattrib() gain.c: choose(), gain(), entropy() printtree.c: printtree() 22 October 2003 27 The following is the code structure: main() --> readdata --> getattrib --> binary --> maketree --> sameresults --> oneattrib --> choose --> gain --> entropy --> printtree ... sameresults(): This function loops through all tuples, setting a counter if a second value is ncountered. If only one value is present, a leaf node is created. oneattrib(): This function is called if the only attribute left is the target attribute. As the target attribute is binary, the number of occurrences of each value is counter, with the leaf node being assigned the value which occurs most often. 22 October 2003 28 14 Summary of decision tree induction - Decision tree is a practical method for classification mining * Efficient heuristic search strategy Greedy search (or divide and conquer): It is a Hill climbing search without back tracking of other attributes. It chooses the test (attribute) that best discriminates among the target classes, and build branches based upon. It infers decision tree by growing it from root downward. This process is repeated until the record arrives at a leaf node. * Transparent representation of mined decision rules Classification Rule: There is a unique path from the root to each leaf. That path is an expression of the rule used to classify the records. The attractiveness of tree-based methods is due in large part to the fact that, in contrast to neural networks, decision trees represent rules. Rules can readily be expressed in English so that we humans can understand them. 22 October 2003 29 * The key issues shared by difference algorithms: -How to select a best attribute to test (attribute-selection measures). -How to prune the decision tree (the different pruning strategies). -How to handle noise or missing data. -The abilities of handling different target-attribute domains. * Effectiveness measure: - We measure the effectiveness of a decision tree, taken as a whole, by applying it to a collection of previously unseen records and observing the percentage classified correctly. - We must also pay attention to the quality of the individual branches of the tree, Each path through the tree represents a rule and some rules are better than others. Some times, the predictive power of the whole tree can be improved by pruning back some of its weaker branches. * At each node in the tree, we can measure: - The number of records entering the node. - The way those records would be classified if this was a leaf node. - The percentage of records classified correctly at this node. 22 October 2003 30 15 - Properties of ID3 search strategy * ID3's search space: - It is the hypothesis space of all decision trees ID3's hypothesis space of all decision trees is a complete space of finite discretevalued functions, relative to the available attributes. - Because every finite discrete-valued function can be represented by some decision tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis spaces (such as methods that consider only conjunctive hypothesis): that the hypothesis space might not contain the target function. * As with other inductive learning methods, ID3 can be characterized as searching a space of hypotheses for one that fits the training examples best. E.g., The search space for PlayTennis classification 22 October 2003 31 E.g., The search space for PlayTennis classification Sear-tree-root (set S) ---------------------------------------------/ | | \ Outlook Temperature Humidity Windy / | \ / | \ / \ / \ sunny overca rain hot mild cold high normal strong weak | ... | ... ... | ... | ... {Temperature, {Outlook, {Outlook, {Outlook, Humidity, Humidity, Temperature, Temperature, Windy} Windy} Windy} Humidity} for S_sunny for S_hot for S_high for S_strong || ... ... ... The shortest decision tree. 22 October 2003 32 16 The best tree based on information gain heuristic function outlook sunny Overcast Humidity High No 22 October 2003 Rain Yes Normal Yes Wind Strong No Weak Yes 33 17