Download ppt

Chapter 6. Classification and Prediction  What is classification? What is prediction? Issues regarding classification and prediction  Classification by decision tree induction         Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary May 2, 2017 Data Mining: Concepts and Techniques 1 Classification by Decision Tree Induction    Decision tree  A flow-chart-like tree structure  Internal node denotes a test on an attribute  Branch represents an outcome of the test  Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases  Tree construction  At start, all the training examples are at the root  Partition examples recursively based on selected attributes  Tree pruning  Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample  Test the attribute values of the sample against the decision tree May 2, 2017 Data Mining: Concepts and Techniques 2 Terminology A Simple Decision Tree Example: Predicting Responses to a Credit Card Marketing Campaign Nodes Branch / Arc Low Root node Non-Responder Leaf nodes Debts High Low Responder Income Many High This decision tree says that people with low income and high debts, and high income males with many children are likely responders. May 2, 2017 Responder Children Male Gender Few Female Non-Responder Non-Responder Data Mining: Concepts and Techniques 33 Decision Tree (Formally) Given:    D = {t1, …, tn} where ti=<ti1, …, tih> Database schema contains {A1, A2, …, Ah} Classes C={C1, …., Cm} Decision or Classification Tree is a tree associated with D such that    May 2, 2017 Each internal node is labeled with attribute, Ai Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, Cj Data Mining: Concepts and Techniques 4 Using a Decision Tree Classifying a New Example for our Credit Card Marketing Campaign Feed the new example into the root of the tree and follow the relevant path towards a leaf, based on the attributes of the example. Low Non-Responder High Responder Debts Assume Sarah has high income. Low Income Many High Children Male Gender Few Female Non-Responder May 2, 2017 Responder Data Mining: Concepts and Techniques Non-Responder The tree predicts that Sarah will not respond to our campaign. 55 Algorithms for Decision Trees • ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and ASSISTANT: Use information gain or gain ratio as splitting criterion • CART (Classification And Regression Trees): Uses Gini diversity index or Twoing criterion as measure (See Sestito & of impurity when deciding splitting. Dillon, Chapter • CHAID: 3, Witten & Frank Section A statistical approach that uses the Chi-squared test (of 4.3, or Dunham correlation/association, dealt with in an earlier lecture) when Section 4.4, if you’re deciding on the best split. Other statistical approaches, such interested in as those by Goodman and Kruskal, and Zhou and Dillon, learning more) use the Assymetrical Tau or Symmetrical Tau to choose the best discriminator. • Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or incorrectly. (If interested see also: http://www.kdnuggets.com/software/classification-tree-rules.html) May 2, 2017 Data Mining: Concepts and Techniques 66 Algorithm for Decision Tree Induction   Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner  At start, all the training examples are at the root  Attributes are categorical (if continuous-valued, they are discretized in advance)  Examples are partitioned recursively based on selected attributes  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning  All samples for a given node belong to the same class  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf  There are no samples left May 2, 2017 Data Mining: Concepts and Techniques 7 DT Induction May 2, 2017 Data Mining: Concepts and Techniques 8 DT Issues How to choose splitting attributes: The performance of the decision tree depends heavily on the choice of the splitting attributes. This includes – Choosing attributes that will be used as splitting attributes. – Ordering of splitting attributes.  Number of splits: Branches at each internal node.  Tree structure: In many cases, balanced tree is desirable.  Stopping criteria: When to stop splitting? Accuracy vs. performance tradeoff. Also, stop earlier to prevent overfitting.  Training data: Too small inaccurate, Too large overfit.  Pruning: To improve the performance during classification.  May 2, 2017 Data Mining: Concepts and Techniques 9 Training Set Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Tempreature Humidity Windy Class hot high false N hot high true N hot high false P mild high false P cool normal false P cool normal true N cool normal true P mild high false N cool normal false P mild normal false P mild normal true P mild high true P hot normal false P mild high true N Which attribute to select? May 2, 2017 Data Mining: Concepts and Techniques 11 Height Example Data Name Kristina Jim Maggie Martha Stephanie Bob Kathy Dave Worth Steven Debbie Todd Kim Amy Wynette May 2, 2017 Gender F M F F F M F M M M F M F F F Height 1.6m 2m 1.9m 1.88m 1.7m 1.85m 1.6m 1.7m 2.2m 2.1m 1.8m 1.95m 1.9m 1.8m 1.75m Output1 Short Tall Medium Medium Short Medium Short Short Tall Tall Medium Medium Medium Medium Medium Data Mining: Concepts and Techniques Output2 Medium Medium Tall Tall Medium Medium Medium Medium Tall Tall Medium Medium Tall Medium Medium 12 Comparing DTs  Tree structure: In many cases, balanced tree is desirable. Balanced Deep May 2, 2017 Data Mining: Concepts and Techniques 13 Building a compact tree    The key to building a decision tree - which attribute to choose in order to branch (Choosing Splitting Attributes) The heuristic is to choose the attribute with the maximum Information Gain based on information theory. Decision Tree Induction is often based on Information Theory May 2, 2017 Data Mining: Concepts and Techniques 14 Information theory   Information theory provides a mathematical basis for measuring the information content. To understand the notion of information, think about it as providing the answer to a question, for example, whether a coin will come up heads.  If one already has a good guess about the answer, then the actual answer is less informative.  If one already knows that the coin is unbalanced so that it will come with heads with probability 0.99, then a message (advanced information) about the actual outcome of a flip is worth less than it would be for a honest coin. May 2, 2017 Data Mining: Concepts and Techniques 15 Information theory (cont …)   For a fair (honest) coin, you have no information, and you are willing to pay more (say in terms of $) for advanced information - less you know, the more valuable the information. Information theory uses this same intuition, but instead of measuring the value for information in dollars, it measures information contents in bits. One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a fair coin May 2, 2017 Data Mining: Concepts and Techniques 16 Information theory  In general, if the possible answers vi have probabilities P (vi), then the information content I (entropy) of the actual answer is given by n I    p(vi ) log 2 p(vi ) i 1  For example, for the tossing of a fair coin we get 1 1 1 1 I (Coin toss)   log  log  1bit 2 2 2 2 2  2 If the coin is loaded to give 99% head, we get I = 0.08, and as the probability of heads goes to 1, the information of the actual answer goes to 0 May 2, 2017 Data Mining: Concepts and Techniques 17 Binary entropy function If X can assume values 0 and 1, entropy of X is defined as H(X) = -Pr(X=0) log2 Pr(X=0) Pr(X=1) log2 Pr(X=1). It has value if Pr(X=0)=1 or Pr(X=1)=1. The entropy reaches maximum when Pr(X=0)=Pr(X=1)=1/2 (the value of entropy is then 1). When p=1 , the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit. May 2, 2017 Data Mining: Concepts and Techniques 18 May 2, 2017 Data Mining: Concepts and Techniques 19 Attribute Selection Measure: Information Gain (ID3/C4.5)    Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any arbitrary tuple m I( s1,s2,...,sm )   i 1  si si log 2 s s entropy of attribute A with values {a1,a2,…,av} s1 j  ...  smj I ( s1 j ,..., smj ) s j 1 v E(A)    information gained by branching on attribute A Gain(A)  I(s 1, s 2 ,..., sm)  E(A) May 2, 2017 Data Mining: Concepts and Techniques 20 Building a Tree : Choosing a Split Example Example: applicants who defaulted on loans (has paying diffıculties) : ApplicantID 1 2 3 4 City Philly Philly Philly Philly Children Many Many Few Few Try split on Children attribute: Many Low Income Few  Status DEFAULTS DEFAULTS PAYS  PAYS  Try split on Income attribute:  Children   Income Medium Low Medium High    Medium High   Notice how the split on the Children attribute gives purer partitions. It is therefore chosen as the first (and in this case only) split. May 2, 2017 Data Mining: Concepts and Techniques 21 21 Building a Tree Choosing a Split: Worked Example Entropy Split on Children attribute: Probability of being in  class at this node Probability of being in class at this node (1) Entropy of each subnode       Entropy _ of _ subnode _ 1   0 log 0  2 log 2  0 2 2 2 2 Few  Entropy _ of _ subnode _ 2   2 log 2  0 log 0  0 2 2 2 2 Children    Many (2) Weighted entropy of this Split Option:    Weighted _ entropy _ of _ split _ on _ Children  2  0  2  0  0 4 4 May 2, 2017 Entropy of Sub-node 1 Proportion of examples in Sub-node 1 Data Mining: Concepts and Techniques 22 22 Building a Tree Choosing a Split: Worked Example Entropy Split on Income attribute: Probability of being in  class at this node Probability of being in class at this node (1) Entropy of each subnode Low Income    Medium High    Entropy _ of _ subnode _ 1   0 log 0  1 log 1  0 1 1 1 1     Entropy _ of _ subnode _ 2   1 log 1  1 log 1  1 2 2 2 2  Entropy _ of _ subnode _ 3   1 log 1  0 log 0  0 1 1 1 1  (2) Weighted entropy of this Split Option:       Weighted _ entropy _ of _ split _ on _ Income  1  0  2 1  1  0  0.5 4 4 4 May 2, 2017 Entropy of Sub-node 2 DataProportion Mining: Concepts and Techniques 23 23 of examples in Sub-node 2 Building a Tree Choosing a Split: Worked Example (3) Information Gain of Split Options    Entropy _ of _ parent _ node   2 log 2  2 log 2  1 4 4 4 4 Many  Low Children   Income Few    Weighted entropy of this Split Option (Split on Children) = 0 (calculated earlier) Information Gain = 1 - 0 = 1  Medium High   Weighted entropy of this Split Option (Split on Income) = 0.5 (calculated earlier) Information Gain = 1 - 0.5 = 0.5 Notice how the split on Children attribute gives higher information gain (purer partitions), and is therefore the preferred split. May 2, 2017 Data Mining: Concepts and Techniques 24 Building a decision tree: an example training dataset (ID3) age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 May 2, 2017 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Data Mining: Concepts and Techniques buys_computer no no yes yes yes no yes no yes yes yes yes yes no 25 Output: A Decision Tree for “buys_computer” age? <=30 student? May 2, 2017 overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Data Mining: Concepts and Techniques 26 Attribute Selection by Information Gain Computation (ID3)  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940 May 2, 2017 Data Mining: Concepts and Techniques 27 Attribute Selection by Information Gain Computation Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age:  age <=30 30…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 age income student credit_rating <=30 high no fair <=30 high no excellent 31…40 high no fair >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair 2017 >40May 2,medium no excellent 5 4 I (2,3)  I ( 4,0) 14 14 5  I (3,2)  0.694 14 E ( age)  5 I (2,3) means “age <=30” has 5 out 14 of 14 samples, with 2 yes’es and 3 no’s. Hence Gain(age)  I ( p, n)  E (age)  0.246 buys_computer no no Similarly, yes yes yes no yes no yes yes yes yes yes Data no Mining: Concepts and Techniques Gain(income)  0.029 Gain( student )  0.151 Gain(credit _ rating )  0.048 28 Attribute Selection : Age Selected May 2, 2017 Data Mining: Concepts and Techniques 29 Example 2 : ID3 (Output1) Name Kristina Jim Maggie Martha Stephanie Bob Kathy Dave Worth Steven Debbie Todd Kim Amy Wynette May 2, 2017 Gender F M F F F M F M M M F M F F F Height 1.6m 2m 1.9m 1.88m 1.7m 1.85m 1.6m 1.7m 2.2m 2.1m 1.8m 1.95m 1.9m 1.8m 1.75m Output1 Short Tall Medium Medium Short Medium Short Short Tall Tall Medium Medium Medium Medium Medium Data Mining: Concepts and Techniques Output2 Medium Medium Tall Tall Medium Medium Medium Medium Tall Tall Medium Medium Tall Medium Medium 30 Example 2: ID3 Example (Output1)  Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384  Gain using gender: Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 Gain: 0.4384 – 0.34152 = 0.09688  Gain using height:     Divide continous data into ranges (0,1.6], (1.6,1.7], (1.7,1.8], (1.8,1.9], (1.9,2.0], (2.0, infinite] 2 values in (1.9,2.0] with entropy (0+1/2 (0.301) + ½ (0.301) = 0.301 Other entropies in ranges are zero 0.4384 – (2/15)(0.301) = 0.3983  Choose height as first splitting attribute May 2, 2017 Data Mining: Concepts and Techniques 31 Example 3 : ID3 Weather data set May 2, 2017 Data Mining: Concepts and Techniques 32 Example 3 : ID3 May 2, 2017 Data Mining: Concepts and Techniques 33 Example 3 : ID3 May 2, 2017 Data Mining: Concepts and Techniques 34 Example 3 : ID3 May 2, 2017 Data Mining: Concepts and Techniques 35 Attribute Selection Measures    Information gain (ID3)  All attributes are assumed to be categorical  Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner)  All attributes are assumed continuous-valued  Assume there exist several possible split values for each attribute  May need other tools, such as clustering, to get the possible split values  Can be modified for categorical attributes Information Gain Ratio (C4.5) May 2, 2017 Data Mining: Concepts and Techniques 36 Gini Index (IBM IntelligentMiner)  If a data set T contains examples from n classes, gini index, n gini(T) is defined as 2 gini(T ) 1  p j j 1  where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as N 1 gini( )  N 2 gini( ) ( T )  gini split T1 T2 N N  The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). May 2, 2017 Data Mining: Concepts and Techniques 37 C4.5 Algoritm (Gain Ratio), -Dunham, page 102  An improvement over ID3 decision tree algorithm. – Splitting: Gain value used in ID3 favors the attributes with high cardinality. C4.5 takes the cardinality of the attribute into the consideration and uses gain ratio as opposed to gain. – Missing data: During the tree construction, the training data that have missing values are simply ignored. Only records that have attribute values will be used to calculate the gain ratio. (Note: the formula for computing the gain ratio changes a bit to accommodate those missing data). During the classification, if the splitting attribute of the query is missing, we need to descent to every child node. The final class assigned to the query depends on the probabilities associated with classes at leaf nodes. May 2, 2017 Data Mining: Concepts and Techniques 38 C4.5 Algoritm (Gain Ratio) Continuous data: Break the continuous attribute values into ranges. – Pruning: 2 strategies: Prune the tree if the increasing classification error is within the limit. • Subtree replacement: Subtree is replaced by a leaf node. Bottom up. • Subtree raising: Subtree is replaced by its most used subtree. – Rules: C4.5 allows classification directly via the decision trees or rules generated from them. In addition, there are some techniques to simplify complex or redundant rules. May 2, 2017 Data Mining: Concepts and Techniques 39 CART Algorithm (Dunham, page 102)      Acronym for Classification And Regression Trees. Binary splits only (Create Binary Tree) Uses entropy Formula to choose split point, s, for node t: PL,PR probability that a tuple in the training set will be on the left or right side of the tree. May 2, 2017 Data Mining: Concepts and Techniques 40 Extracting Classification Rules from Trees       Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example age = “<=30” AND student = “no” THEN buys_computer = “no” age = “<=30” AND student = “yes” THEN buys_computer = “yes” age = “31…40” THEN buys_computer = “yes” age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” IF IF IF IF May 2, 2017 Data Mining: Concepts and Techniques 41 Extracting Classification Rules from Trees for Credit Card Marketing Campaign For each leaf in the tree, read the rule from the root to that leaf. You will arrive at a set of rules. Low IF Income=Low AND Debts=Low Non-Responder THEN Non-Responder Debts Low High Income Responder Many Responder Male Children High Gender Few Female Non-Responder May 2, 2017 IF Income=Low AND Debts=High THEN Responder IF Income=High AND Gender=Male AND Children=Many THEN Responder IF Income=High AND Non-Responder Gender=Male AND Children=Few THEN Non-Responder IF Income=High AND Gender=Female THEN Non-Responder Data Mining: Concepts and Techniques 42 Avoid Overfitting in Classification   The generated tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Result is in poor accuracy for unseen samples Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a set of data different from the training data to decide which is the “best pruned tree” May 2, 2017 Data Mining: Concepts and Techniques 43 Approaches to Determine the Final Tree Size  Separate training (2/3) and testing (1/3) sets  Use cross validation, e.g., 10-fold cross validation  Use all the data for training   but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle:  May 2, 2017 halting growth of the tree when the encoding is minimized Data Mining: Concepts and Techniques 44 Enhancements to basic decision tree induction    Allow for continuous-valued attributes  Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values  Assign the most common value of the attribute  Assign probability to each of the possible values Attribute construction  Create new attributes based on existing ones that are sparsely represented  This reduces fragmentation, repetition, and replication May 2, 2017 Data Mining: Concepts and Techniques 45 Classification in Large Databases    Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining?     relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods May 2, 2017 Data Mining: Concepts and Techniques 46 Scalable Decision Tree Induction Methods in Data Mining Studies     SLIQ (EDBT’96 — Mehta et al.)  builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.)  constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim)  integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)  separates the scalability aspects from the criteria that determine the quality of the tree  builds an AVC-list (attribute, value, class label) May 2, 2017 Data Mining: Concepts and Techniques 47 Data Cube-Based Decision-Tree Induction    Integration of generalization with decision-tree induction (Kamber et al’97). Classification at primitive concept levels  E.g., precise temperature, humidity, outlook, etc.  Low-level concepts, scattered classes, bushy classification-trees  Semantic interpretation problems. Cube-based multi-level classification  Relevance analysis at multi-levels.  Information-gain analysis with dimension + level. May 2, 2017 Data Mining: Concepts and Techniques 48 Presentation of Classification Results May 2, 2017 Data Mining: Concepts and Techniques 49

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt