Download Roiger_DM_ch03 - Gonzaga University

Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 [email protected] A/W & Dr. Chen, Data Mining Objectives • The chapter introduces several common data mining techniques. • In Section 3.1, it focus on supervised learning by presenting a standard algorithm for creating decision trees. • In Section 3.2, an efficient technique for generating association rules is presented. • In Section 3.3, unsupervised clustering and the KMeans algorithm are illustrated. • Section 3.4 show you how genetic algorithms can perform supervised learning and unsupervised clustering. (to be skipped) A/W & Dr. Chen, Data Mining 3.1 Decision Trees • Decision trees are constructed using only those attributes best able to differentiate the concepts to be learned. • A decision tree is built by initially selecting a subset of instances from a training set. • The subset is then used the algorithm to construct a decision tree. The remaining training set instances test the accuracy of the constructed tree. – If the decision tree classifies the instances correctly, the procedure terminates. – If an instance is incorrectly classified, the instance is added to the selected subset of training instances and a new tree is constructed. A/W & Dr. Chen, Data Mining An Algorithm for Building Decision Trees 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: a. If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. b. If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2. A/W & Dr. Chen, Data Mining Table 3.1 • The Credit Card Promotion Database (same as Table2.3) Income Magazine Watch Life Insurance Credit Card Range ($) Promotion Promotion Promotion Insurance 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K A/W & Dr. Chen, Data Mining Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Sex Age Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Income Range 20-30K 2 Yes 2 No A/W & Dr. Chen, Data Mining 30-40K 4 Yes 1 No 40-50K 1 Yes 3 No Figure 3.1 A partial decision tree with root node = income range 50-60K 2 Yes Credit Card Insurance No Yes 3 Yes 0 No 6 Yes 6 No A/W & Dr. Chen, Data Mining Figure 3.2 A partial decision tree with root node = credit card insurance Age <= 43 9 Yes 3 No A/W & Dr. Chen, Data Mining > 43 0 Yes 3 No Figure 3.3 A partial decision tree with root node = age Decision Trees for the Credit Card Promotion Database A/W & Dr. Chen, Data Mining Age <= 43 > 43 No (3/0) Sex Female Male Yes (6/0) Credit Card Insurance No No (4/1) A/W & Dr. Chen, Data Mining Yes Yes (2/0) Figure 3.4 A three-node decision tree for the credit card database Credit Card Insurance No Yes Yes (3/0) Sex Female Yes (6/1) A/W & Dr. Chen, Data Mining Male No (6/1) Figure 3.5 A two-node decision treee for the credit card database Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card Insurance = No Income Range 40–50K 20–30K 30–40K 20–30K A/W & Dr. Chen, Data Mining Life Insurance Credit Card Promotion Insurance No No No Yes No No No No Sex Age Male Male Male Male 42 27 43 29 Decision Tree Rules A/W & Dr. Chen, Data Mining A Rule for the Tree in Figure 3.4 IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No A/W & Dr. Chen, Data Mining A Simplified Rule Obtained by Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No A/W & Dr. Chen, Data Mining Other Methods for Building Decision Trees • CART • CHAID A/W & Dr. Chen, Data Mining Advantages of Decision Trees • Easy to understand. • Map nicely to a set of production rules. • Applied to real problems. • Make no prior assumptions about the data. • Able to process both numerical and categorical data. A/W & Dr. Chen, Data Mining Disadvantages of Decision Trees • Output attribute must be categorical. • Limited to one output attribute. • Decision tree algorithms are unstable. • Trees created from numeric datasets can be complex. A/W & Dr. Chen, Data Mining 3.2 Generating Association Rules A/W & Dr. Chen, Data Mining Confidence and Support A/W & Dr. Chen, Data Mining Rule Confidence Given a rule of the form “If A then B”, rule confidence is the conditional probability that B is true when A is known to be true. A/W & Dr. Chen, Data Mining Rule Support The minimum percentage of instances in the database that contain all items listed in a given association rule. A/W & Dr. Chen, Data Mining Mining Association Rules: An Example A/W & Dr. Chen, Data Mining Table 3.3 • A Subset of the Credit Card Promotion Database Magazine Watch Promotion Yes Yes No Yes Yes No Yes No Yes Yes A/W & Dr. Chen, Data Mining Credit Card Promotion Life Insurance Promotion Insurance Sex No Yes No Yes No No No Yes No Yes No Yes No Yes Yes No Yes No No Yes No No No Yes No No Yes No No No Male Female Male Male Female Female Male Male Male Female Table 3.4 • Single-Item Sets Single-Item Sets Magazine Promotion = Yes Watch Promotion = Yes Watch Promotion = No Life Insurance Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = Male Sex = Female A/W & Dr. Chen, Data Mining Number of Items 7 4 6 5 5 8 6 4 Table 3.5 • Two-Item Sets Two-Item Sets Magazine Promotion = Yes & Watch Promotion = No Magazine Promotion = Yes & Life Insurance Promotion = Yes Magazine Promotion = Yes & Credit Card Insurance = No Magazine Promotion = Yes & Sex = Male Watch Promotion = No & Life Insurance Promotion = No Watch Promotion = No & Credit Card Insurance = No Watch Promotion = No & Sex = Male Life Insurance Promotion = No & Credit Card Insurance = No Life Insurance Promotion = No & Sex = Male Credit Card Insurance = No & Sex = Male Credit Card Insurance = No & Sex = Female A/W & Dr. Chen, Data Mining Number of Items 4 5 5 4 4 5 4 5 4 4 4 General Considerations • We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products. • We are also interested in association rules that show a lower than expected confidence for a particular association. A/W & Dr. Chen, Data Mining Up here for now! A/W & Dr. Chen, Data Mining 3.3 The K-Means Algorithm 1. Choose a value for K, the total number of clusters. 2. Randomly choose K points as cluster centers. 3. Assign the remaining instances to their closest cluster center. 4. Calculate a new cluster center for each cluster. 5. Repeat steps 3-5 until the cluster centers do not change. A/W & Dr. Chen, Data Mining An Example Using K-Means A/W & Dr. Chen, Data Mining Table 3.6 • K-Means Input Values Instance X Y 1 2 3 4 5 6 1.0 1.0 2.0 2.0 3.0 5.0 1.5 4.5 1.5 3.5 2.5 6.0 A/W & Dr. Chen, Data Mining f(x) 7 6 5 4 3 2 1 0 x 0 A/W & Dr. Chen, Data Mining 1 2 3 Figure 3.6 A coordinate mapping of the data in Table 3.6 4 5 6 Table 3.7 • Several Applications of the K-Means Algorithm (K = 2) Outcome 1 Cluster Centers Cluster Points (2.67,4.67) Squared Error 2, 4, 6 14.50 2 (2.00,1.83) 1, 3, 5 (1.5,1.5) 1, 3 15.94 3 (2.75,4.125) 2, 4, 5, 6 (1.8,2.7) 1, 2, 3, 4, 5 9.60 (5,6) A/W & Dr. Chen, Data Mining 6 f(x) 7 6 5 4 3 2 1 0 x 0 A/W & Dr. Chen, Data Mining 1 2 3 4 Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2) 5 6 General Considerations • Requires real-valued data. • We must select the number of clusters present in the data. • Works best when the clusters in the data are of approximately equal size. • Attribute significance cannot be determined. • Lacks explanation capabilities. A/W & Dr. Chen, Data Mining 3.4 Genetic Learning A/W & Dr. Chen, Data Mining Genetic Learning Operators • Crossover • Mutation • Selection A/W & Dr. Chen, Data Mining Genetic Algorithms and Supervised Learning A/W & Dr. Chen, Data Mining Keep Population Elements Fitness Function Training Data Candidates for Crossover & Mutation A/W & Dr. Chen, Data Mining Figure 3.8 Supervised genetic learning Throw Table 3.8 • An Initial Population for Supervised Genetic Learning Population Income Range Life Insurance Promotion Credit Card Insurance Element 1 2 20–30K 30–40K No Yes Yes No 3 4 ? 30–40K No Yes No Yes A/W & Dr. Chen, Data Mining Sex Age Male Femal e Male Male 30–39 50–59 40–49 40–49 Table 3.9 • Training Data for Genetic Learning Training Income Instance 1 2 3 4 5 6 A/W & Dr. Chen, Data Mining Range Life Insurance Promotion Credit Card Insurance Sex Age 30–40K 30–40K 50–60K 20–30K 20–30K 30–40K Yes Yes Yes No No No Yes No No No No No 30–39 40–49 30–39 50–59 20–29 40–49 Male Female Female Female Male Male Population Income Life Insurance Credit Card Sex Age Element Range Promotion Insurance #1 20-30K No Yes Male 30-39 Population Income Life Insurance Credit Card Sex Age Element Range Promotion Insurance #2 30-40K Yes No Fem 50-59 Population Income Life Insurance Credit Card Sex Age Element Range Promotion Insurance #2 Yes Yes Male 30-39 Population Income Life Insurance Credit Card Sex Age Element Range Promotion Insurance #1 Figure 3.9 A crossover operation A/W & Dr. Chen, Data Mining 30-40K 20-30K No No Fem 50-59 Table 3.10 • A Second-Generation Population Population Income Element 1 2 3 4 A/W & Dr. Chen, Data Mining Range Life Insurance Promotion Credit Card Insurance Sex Age 20–30K 30–40K ? 30–40K No Yes No Yes No Yes No Yes Female Male Male Male 50–59 30–39 40–49 40–49 Genetic Algorithms and Unsupervised Clustering A/W & Dr. Chen, Data Mining a1 a2 . . . a3 E11 S1 E12 an I1 P instances I2 . . . . . Ip E21 S2 . . . . . . . SK Solutions A/W & Dr. Chen, Data Mining Figure 3.10 Unsupervised genetic clustering E22 Ek1 Ek2 Table 3.11 • A First-Generation Population for Unsupervised Clustering S 1 S 2 S 3 Solution elements (initial population) (1.0,1.0) (5.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (5.0,1.0) Fitness score 11.31 9.78 15.55 Solution elements (second generation) (5.0,1.0) (5.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (1.0,1.0) Fitness score 17.96 9.78 11.34 Solution elements (third generation) (5.0,5.0) (1.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (1.0,1.0) Fitness score 13.64 9.78 11.34 A/W & Dr. Chen, Data Mining General Considerations • Global optimization is not a guarantee. • The fitness function determines the complexity of the algorithm. • Explain their results provided the fitness function is understandable. • Transforming the data to a form suitable for genetic learning can be a challenge. A/W & Dr. Chen, Data Mining 3.5 Choosing a Data Mining Technique A/W & Dr. Chen, Data Mining Initial Considerations • Is learning supervised or unsupervised? • Is explanation required? • What is the interaction between input and output attributes? • What are the data types of the input and output attributes? A/W & Dr. Chen, Data Mining Further Considerations • Do We Know the Distribution of the Data? • Do We Know Which Attributes Best Define the Data? • Does the Data Contain Missing Values? • Is Time an Issue? • Which Technique Is Most Likely to Give a Best Test Set Accuracy? A/W & Dr. Chen, Data Mining A/W & Dr. Chen, Data Mining

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Roiger_DM_ch03 - Gonzaga University