Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic Data Mining Techniques Chapter 3 3.1 Decision Trees An Algorithm for Building Decision Trees 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification f new iinstances for t following f ll i this thi decision d i i path. th -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to stepp 2. Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40 50K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes N No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No N No Yes No No Yes No No No No No No No Yes Male Female M l Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Income Range 20-30K 2 Yes 2 No 30-40K 4 Yes 1 No 40-50K 1 Yes 3 No Figure 3.1 A partial decision tree with root node = income range 50-60K 2 Yes Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Credit Card Insurance N No 6 Yes 6 No Y Yes 3 Yes 0 No Figure 3.2 A partial decision tree with root node = credit card insurance Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Age <= 43 9 Yes 3N No > 43 0 Yes 3 No Figure 3.3 A partial decision tree with root node = age Decision Trees for the Credit C d Promotion Card P ti Database D t b Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40 50K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40 50K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes N No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No N No Yes No No Yes No No No No No No No Yes Male Female M l Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Age <= 43 > 43 No (3/0) Sex Female Male Yes (6/0) Credit Card Insurance No No (4/1) Figure 3.4 A three-node decision tree for the credit card database Yes Yes (2/0) Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Credit Card Insurance No Yes Yes ((3/0)) Sex Female Yes (6/1) Male No (6/1) Figure 3.5 A two-node decision treee for the credit card database Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card Insurance = No Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 0 50 20–30K 30–40K 20–30K No o No No Yes No o No No No Male ae Male Male Male 42 27 43 29 Decision Tree Rules A Rule for the Tree in Figure 3.4 34 IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No A Simplified p Rule Obtained byy Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No Other Methods for Building Decision Trees • CART • CHAID Advantages of Decision Trees • Easy to understand. • Map nicely to a set of production rules. rules • Applied to real problems. • Make no prior assumptions about the data. data • Able to process both numerical and categorical data. Disadvantages of Decision Trees • Output attribute must be categorical. • Limited to one output attribute. • Decision tree algorithms are unstable. • Trees created from numeric datasets can be complex. 32G 3.2 Generating ti Association A i ti Rules R l C fid Confidence andd Support S Rule Confidence Given a rule of the form “If A then B” rule B”, l confidence fid i the is th conditional diti l probability that B is true when A is k known t be to b true. t Rule Support The minimum percentage of instances i the in th database d t b th t contain that t i all ll items it listed in a given association rule. Mining Association Rules: An Example Subset b off the h C Credit di C Card dP Promotion i D Database b T bl 3.3 Table 33•AS Magazine Promotion Yes Yes No Yes Yes N No Yes No Yes Yes Watch Promotion Life Insurance Promotion Credit Card Insurance Sex No Yes No Yes No N No No Yes No Yes No Yes No Yes Yes N No Yes No No Yes No No No Yes No N No Yes No No No Male Female Male Male Female F Female l Male Male Male Female Table 3.4 • Single-Item Sets Single-Item g Sets Magazine Promotion = Yes Watch Promotion = Yes Watch Promotion = No Life Insurance Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = Male Sex = Female Number of Items 7 4 6 5 5 8 6 4 Table 3.5 • Two-Item Sets Two-Item Sets Number of Items Magazine Promotion = Yes & Watch Promotion = No Magazine Promotion = Yes & Life Insurance Promotion = Yes Magazine Promotion = Yes & Credit Card Insurance = No Magazine Promotion = Yes & Sex = Male Watch Promotion = No & Life Insurance Promotion = No Watch Promotion = No & Credit Card Insurance = No Watch Promotion = No & Sex = Male Life Insurance Promotion = No & Credit Card Insurance = No Life Insurance Promotion = No & Sex = Male Credit Card Insurance = No & Sex = Male Credit Card Insurance = No & Sex = Female 4 5 5 4 4 5 4 5 4 4 4 Two Possible Two-Item Set Rules l IF Magazine M i Promotion P i =Yes Y THEN Life Insurance Promotion =Yes (5/7) IF Life Insurance Promotion =Yes Yes THEN Magazine Promotion =Yes (5/5) Three-Item Three Item Set Rules IF Watch Promotion =No & Life f Insurance Promotion = No THEN Credit Card Insurance =No (4/4) IF Watch Promotion =No THEN Life L f Insurance I Promotion P = No N & Credit C d Card Insurance = No (4/6) General Considerations • We are interested in association rules that show a lift in pproduct oduc sa sales es w where e e thee lift iss thee result esu of the product’s association with one or more other products. • We are also interested in association rules that show a lower than expected confidence for a particular association. 3.3 The K-Means Algorithm 1. Choose a value for K, the total number of clusters. 2. Randomly choose K points as cluster centers. 3. Assign the remaining instances to their closest cluster center. 4. Calculate a new cluster center for each cluster. 5. Repeat steps 3-5 3 5 until the cluster centers do not change. An Example Using K K-Means Means Table 3.6 • K-Means Input Values I t Instance X Y 1 2 3 4 5 6 1.0 1.0 2.0 20 2.0 3.0 5.0 1.5 4.5 1.5 35 3.5 2.5 6.0 f(x) 7 6 5 4 3 2 1 0 x 0 1 2 3 Figure 3.6 A coordinate mapping of the data in Table 3.6 4 5 6 Table 3.7 • Several Applications of the K-Means Algorithm (K = 2) Outcome Cluster Centers Cluster Points 1 (2.67,4.67) 2, 4, 6 Squared Error 14.50 2 (2.00,1.83) 1, 3, 5 (1.5,1.5) 1, 3 (2 75 4 125) (2.75,4.125) 2 4 2, 4, 5 5, 6 (1.8,2.7) 1, 2, 3, 4, 5 15.94 3 9.60 (5,6) 6 f(x) 7 6 5 4 3 2 1 0 x 0 1 2 3 4 Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2) 5 6 General Considerations • Requires real-valued data. • We must select the number of clusters present in the data. • Works best when the clusters in the data are of approximately equal size. • Attribute significance cannot be determined. determined • Lacks explanation capabilities. 3 4 Genetic Learning 3.4 Genetic Learning Operators • Crossover • Mutation • Selection S l i G Genetic i Al Algorithms i h andd Supervised S i d Learningg Keep Population Elements Fitness Function Training Data Candidates for Crossover & Mutation Figure 3.8 Supervised genetic learning Throw Table 3.8 • An Initial Population for Supervised Genetic Learning Population Element 1 2 3 4 Income Range Life Insurance Promotion Credit Card Insurance Sex Age 20–30K 30–40K 30 40K ? 30–40K No Yes No Yes Yes No No Yes Male Female Male Male 30–39 50–59 50 59 40–49 40–49 Table 3.9 • Training Data for Genetic Learning Training Instance Income Range Life Insurance Promotion Credit Card Insurance Sex Age 1 2 3 4 5 6 30–40K 30–40K 50–60K 20–30K 20–30K 30–40K Yes Yes Yes No No No Yes No No No No No Male Female Female Female Male Male 30–39 40–49 30–39 50–59 20–29 40–49 Population e e t Element Income a ge Range Life Insurance o ot o Promotion Credit Card su a ce Insurance Sex Age Population e e t Element Income a ge Range #1 20-30K No Yes Male 30-39 #2 30-40K Population Element Income Range Life Insurance Promotion Credit Card Insurance Sex Age Population Element Income Range #2 30-40K Yes No Fem 50-59 #1 20-30K 20 30K Figure 3.9 A crossover operation Life Insurance Credit Card o ot o su a ce Promotion Insurance Yes Yes Life Insurance Credit Card Promotion Insurance No No Sex Age Male 30-39 Sex Age Fem 50-59 50 59 Table 3.10 • A Second-Generation Population Population Element 1 2 3 4 Income Range Life Insurance Promotion Credit Card Insurance Sex Age 20–30K 30–40K ? 30–40K No Yes N No Yes No Yes N No Yes Female Male M l Male Male 50–59 30–39 40 49 40–49 40–49 Genetic G i Al Algorithms i h andd Unsupervised p Clusteringg a1 P instances I1 I2 . . . . . Ip a2 . . . a3 E11 S1 E12 an E21 S2 . . . . . . . SK Solutions Figure 3.10 Unsupervised genetic clustering E22 Ek1 Ek2 p for Unsupervised p Clustering g Table 3.11 • A First-Generation Population S S 1 S 2 3 Solution elements (initial population) (1.0,1.0) (5 0 5 0) (5.0,5.0) (3.0,2.0) (3 0 5 0) (3.0,5.0) (4.0,3.0) (5 0 1 0) (5.0,1.0) Fitness score 11.31 9.78 15.55 Solution elements (second generation) (5.0,1.0) (5.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (1.0,1.0) Fitness score 17.96 9.78 11.34 Solution elements (third generation) (5.0,5.0) (1.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (1.0,1.0) Fitness score 13.64 9.78 11.34 General Considerations • Global optimization is not a guarantee. • The fitness function determines the complexity of the algorithm. • Explain their results provided the fitness function is understandable. • Transforming the data to a form suitable for genetic learning can be a challenge. 3.5 Choosing a Data Mining Technique Initial Considerations • Is learning supervised or unsupervised? • Is explanation required? • What is the interaction between input and output attributes? • What are the data types of the input and output attributes? F th C Further Considerations id ti • Do We Know the Distribution of the Data? • Do We Know Which Attributes Best Define the Data? • Does the Data Contain Missing Values? • Is I Time Ti an Issue? I ? • Which Technique Is Most Likely to Give a Best Test Set Accuracy?