Download Lecture 8 (2/23) - Department of Computer Science

Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science Data Mining: A KDD Process Pattern Evaluation – Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection Data Mining process model -DM Search in State Spaces Decision Trees •A decision tree is a special case of a state-space graph. •It is a rooted tree in which each internal node corresponds to a decision, with a subtree at these nodes for each possible outcome of the decision. •Decision trees can be used to model problems in which a series of decisions leads to a solution. •The possible solutions of the problem correspond to the paths from the root to the leaves of the decision tree. Decision Trees •Example: The n-queens problem •How can we place n queens on an nn chessboard so that no two queens can capture each other? A queen can move any number of squares horizontally, vertically, and diagonally. Here, the possible target squares of the queen Q are marked with an x. •x •x •x •x •x •x •x •x •x •x •x •x •Q •x •x •x •x •x •x •x •x •x •x •x •x •x •x •x •Let us consider the 4-queens problem. •Question: How many possible configurations of 44 chessboards containing 4 queens are there? •Answer: There are 16!/(12!4!) = (13141516)/(234) = 13754 = 1820 possible configurations. •Shall we simply try them out one by one until we encounter a solution? •No, it is generally useful to think about a search problem more carefully and discover constraints on the problem’s solutions. •Such constraints can dramatically reduce the size of the relevant state space. Obviously, in any solution of the n-queens problem, there must be exactly one queen in each column of the board. Otherwise, the two queens in the same column could capture each other. Therefore, we can describe the solution of this problem as a sequence of n decisions: Decision 1: Place a queen in the first column. Decision 2: Place a queen in the second column. . . . Decision n: Place a queen in the n-th column. Backtracking in Decision Trees empty board •Q place 1st •Q queen •Q place 2nd queen •Q •Q •Q •Q •Q •Q place 3rd •Q •Q queen •Q •Q •Q •Q place 4th queen •Q •Q •Q Neural Network Many inputs and a single output Trained on signal and background sample Well understood and mostly accepted in HEP Decision Tree Many inputs and a single output Trained on signal and background sample Used mostly in life sciences & business Decision tree Basic Algorithm • Initialize top node to all examples • While impure leaves available – select next impure leave L – find splitting attribute A with maximal information gain – for each value of A add child to L Decision tree Find good splitstatistics to compute info gain: count matrix • Sufficient outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild humidity high high high high normal normal normal high normal normal normal high normal high windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE play no no yes yes yes no yes no yes yes yes yes yes no sunny overcast rainy play don't play 2 3 4 0 3 2 gain: 0.25 bits hot mild cool play don't play 2 2 4 2 3 1 gain: 0.16 bits humidity high normal play don't play 3 4 6 1 gain: 0.03 bits windy FALSE TRUE play don't play 6 2 3 3 gain: 0.14 bits outlook temperature Decision trees • • • • Simple depth-first construction Needs entire data to fit in memory Unsuitable for large data sets Need to “scale up” Decision Trees Planning Tool Decision Trees • Enable a business to quantify decision making • Useful when the outcomes are uncertain • Places a numerical value on likely or potential outcomes • Allows comparison of different possible decisions to be made Decision Trees • Limitations: – How accurate is the data used in the construction of the tree? – How reliable are the estimates of the probabilities? – Data may be historical – does this data relate to real time? – Necessity of factoring in the qualitative factors – human resources, motivation, reaction, relations with suppliers and other stakeholders Process Advantages Disadvantages Trained Decision Tree (Limit) (Binned Likelihood Fit) Decision Trees from Data Base Ex Num Att Size Att Colour Att Shape Concept Satisfied 1 2 3 4 5 6 7 med small small large large large large blue red red red green red green brick wedge sphere wedge pillar pillar sphere yes no yes no yes no yes Choose target : Concept satisfied Use all attributes except Ex Num Rules from Tree IF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) ))) OR (SIZE = small AND SHAPE = wedge) THEN NO IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) ) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = medium) THEN YES Association Rule • Used to find all rules in a basket data • Basket data also called transaction data • analyze how items purchased by customers in a shop are related • discover all rules that have:– support greater than minsup specified by user – confidence greater than minconf specified by user • Example of transaction data:– – – – CD player, music’s CD, music’s book CD player, music’s CD music’s CD, music’s book CD player Association Rule • Let I = {i1, i2, …im} be a total set of items D a set of transactions d is one transaction consists of a set of items – dI • Association rule:– – – where X  I ,Y  I and X  Y =  = #of transactions contain X  Y D confidence = #of transactions contain X  Y XY support #of transactions contain X Association Rule • Example of transaction data:– – – – CD player, music’s CD, music’s book CD player, music’s CD music’s CD, music’s book CD player • I = {CD player, music’s CD, music’s book} • D=4 • #of transactions contain both CD player, music’s CD =2 • #of transactions contain CD player =3 • CD player  music’s CD (sup=2/4 , conf =2/3 ); Association Rule • How are association rules mined from large databases ? • Two-step process:– find all frequent itemsets – generate strong association rules from frequent itemsets Association Rules • antecedent  consequent – – – – if  then beer  diaper (Walmart) economy bad  higher unemployment Higher unemployment  higher unemployment benefits cost • Rules associated with population, support, confidence Association Rules • Population: instances such as grocery store purchases • Support – % of population satisfying antecedent and consequent • Confidence – % consequent true when antecedent true 2. Association rules Support Every association rule has a support and a confidence. “The support is the percentage of transactions that demonstrate the rule.” Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 2: 3: 4: 1, 3, 5. 1, 8, 14, 17, 12. 4, 6, 8, 12, 9, 104. 2, 1, 8. support {8,12} = 2 (,or 50% ~ 2 of 4 customers) support {1, 5} = 1 (,or 25% ~ 1 of 4 customers ) support {1} = 3 (,or 75% ~ 3 of 4 customers) 2. Association rules Support An itemset is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold add to previous example: if threshold 50% then itemsets {8,12} and {1} called frequent 2. Association rules Confidence Every association rule has a support and a confidence. An association rule is of the form: X => Y • X => Y: if someone buys X, he also buys Y The confidence is the conditional probability that, given X present in a transition , Y will also be present. Confidence measure, by definition: Confidence(X=>Y) equals support(X,Y) / support(X) 2. Association rules Confidence We should only consider rules derived from itemsets with high support, and that also have high confidence. “A rule with low confidence is not meaningful.” Rules don’t explain anything, they just point out hard facts in data volumes. 3. Example Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 3, 5, 8. 2, 6, 8. 1, 4, 7, 10. 3, 8, 10. 2, 5, 8. 1, 5, 6. 4, 5, 6, 8. 2, 3, 4. 1, 5, 7, 8. 3, 8, 9, 10. Conf ( {5} => {8} ) ? supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4, then conf( {5} => {8} ) = 4/5 = 0.8 or 80% 3. Example Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 3, 5, 8. 2, 6, 8. 1, 4, 7, 10. 3, 8, 10. 2, 5, 8. 1, 5, 6. 4, 5, 6, 8. 2, 3, 4. 1, 5, 7, 8. 3, 8, 9, 10. Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ? supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4, then conf( {8} => {5} ) = 4/7 = 0.57 or 57% 3. Example Example: Database with transactions ( customer_# : item_a1, item_a2, … ) Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ? 57% Done. Rule ( {5} => {8} ) more meaningful then Rule ( {8} => {5} ) 3. Example Example: Database with transactions ( customer_# : item_a1, item_a2, … ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 3, 5, 8. 2, 6, 8. 1, 4, 7, 10. 3, 8, 10. 2, 5, 8. 1, 5, 6. 4, 5, 6, 8. 2, 3, 4. 1, 5, 7, 8. 3, 8, 9, 10. Conf ( {9} => {3} ) ? supp({9}) = 1 , supp({3}) = 1 , supp({3,9}) = 1, then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK? 3. Example Example: Database with transactions ( customer_# : item_a1, item_a2, … ) Conf( {9} => {3} ) = 100%. Done. Notice: High Confidence, Low Support. -> Rule ( {9} => {3} ) not meaningful Association Rules • Population – MS, MSA, MSB, MA, MB, BA – M=Milk, S=Soda, A=Apple, B=beer • Support (MS)= 3/6 – (MS,MSA,MSB)/(MS,MSA,MSB,MA,MB, BA) • Confidence (MS) = 3/5 – (MS, MSA, MSB) / (MS,MSA,MSB,MA,MB)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 8 (2/23) - Department of Computer Science