Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining : Basic Data Mining Techniques 2008.4.10 Database Lab 김성원 목차 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Decision Tree Generating Association Rules The K-Means Algorithm Genetic Learning Choosing a Data Mining Technique Chapter Summary Key Term Exercises 3.1 Decision Tree Decision Tree 알고리즘 설계 1. Tree에 이용할 Training Instances 생성 2. Tree에 포함된 Instances의 Best Differentiates 를 통해 attribute 을 선택 3. Tree node 를 생성하기 위해 attribute 을 선택. 4. 각 subclass 들은 3번째 단계에서 생성: 만약 subclass의 instance가 미리 정의된 기준을 만족하면 나머지 attribute 의 집합에 대한 선택경로가 null이 되어 새로운 instance 에 대한 분류를 경로 지정한다. 만약 subclass의 기준을 만족하지 못한다면 하나의 attribute 을 더 욱 세분화하여 현재 subclass의 instance로 선택하고 2번째 단계 로 다시 돌아간다. 3.1 Decision Tree Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Table 3.1 The Credit Card Promotion Database 3.1 Decision Tree Credit Card Insurance No Yes 3 Yes 0 No 6 Yes 6 No Income Range 20-30K 2 Yes 2 No 30-40K 4 Yes 1 No Figure 3.2 A partial decision tree with root node = credit card insurance 40-50K 1 Yes 3 No 50-60K Age 2 Yes Figure 3.1 A partial decision tree with root node = income range <= 43 9 Yes 3 No > 43 0 Yes 3 No Figure 3.3 A partial decision tree with root node = age 3.1 Decision Tree Age Exercise : Computational Questions Table 3.1 • The Credit Card Promotion Database <= 43 Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Node = Age No (3/0) Sex Female Male Yes (6/0) Credit Card Insurance C.E = 1 – max(12/15,3/15) = 0.2 No Sex > 43 C.E = 1- max(6/12,6/12) = 0.5 Credit Card Insurance C.E = 1 – max(2/6,4/6) = 0.33 No (4/1) Yes Yes (2/0) 3.1 Decision Tree IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card Insurance = No Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 20–30K 30–40K 20–30K No No No Yes No No No No Male Male Male Male 42 27 43 29 Accuracy = 75% IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No Accuracy = 83.3% 3.1 Decision Tree Advantages of Decision Trees 이해하기 쉽고 분류작업이 용이하다. 실제 문제에 적용할 수 있다. 가정(선형성, 등분산성 등)이 필요 없다. Numerical data 와 Categorical data 모두 취급 가능하다. Disadvantages of Decision Trees Output attribute가 분류되어야 한다. Decision tree algorithms은 Tree가 깊어질수록 예측력 저하와 해석의 어려움 등 불안정해진다. 계산량이 많을 수 있다. 3.2 Generating Association Rules Confidence and Support Milk -> Bread Support(milk,bread) = Pattern(milk,bread)/전체 트랜잭션 수 = 5000/10000 = 50% Confidence(milk,bread) = Pattern(milk,bread)/Pattern(milk) = 5000/8000 = 62.5% 3.2 Generating Association Rules Mining Association Rules : An Example Apriori algorithm 1. Item set을 생성한다. 2. 생성된 Item set을 이용하여 association rule을 만든 다. 3.2 Generating Association Rules Mining Association Rules : An Example Table 3.3 • A Subset of the Credit Card Promotion Database Magazine Promotion Yes Yes No Yes Yes No Yes No Yes Yes Watch Promotion Life Insurance Promotion Credit Card Insurance Sex No Yes No Yes No No No Yes No Yes No Yes No Yes Yes No Yes No No Yes No No No Yes No No Yes No No No Male Female Male Male Female Female Male Male Male Female 3.2 Generating Association Rules Table 3.4 • Single-Item Sets Single-Item Sets Number of Items Magazine Promot ion = Yes Wat ch Promot ion = Yes Wat ch Promot ion = No Lif e Insurance Promot ion = Yes Lif e Insurance Promot ion = No Credit Card Insurance = No Sex = Male Sex = Female 7 4 6 5 5 8 6 4 * Three Item set Table 3.5 • Two-Item Sets Watch Promotion = No & Life Insurance Promotion = Number No of Items Two-Item Sets & Credit Card Insurance Magazine Promotion = Yes & Watch Promotion = No 4 Magazine Promotion = Yes & Life Insurance Promotion = Yes Magazine Promotion = Yes & Credit Card Insurance = No Magazine Promotion = Yes & Sex = Male Watch Promotion = No & Life Insurance Promotion = No Watch Promotion = No & Credit Card Insurance = No Watch Promotion = No & Sex = Male Life Insurance Promotion = No & Credit Card Insurance = No Life Insurance Promotion = No & Sex = Male Credit Card Insurance = No & Sex = Male Credit Card Insurance = No & Sex = Female 5 5 4 4 5 4 5 4 4 4 3.2 Generating Association Rules Mining Association Rules : An Example Two-Item set rules IF Magazine Promotion =Yes THEN Life Insurance Promotion =Yes (5/7) IF Life Insurance Promotion =Yes THEN Magazine Promotion =Yes (5/5) Three-item set rules IF Watch Promotion =No & Life Insurance THEN Credit Card Insurance =No (4/4) IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit Promotion = No Card Insurance = No (4/6) IF Credit Card Insurance = No THEN Watch Promotion = No & Life Insurance Promotion = No(4/8) 3.2 Generating Association Rules General Considerations • • Association rules 를 사용하게 되었을 때 제품 을 고객이 살 때 연관된 제품을 통해 한 개 또는 더 많은 다른 제품들도 팔 수 있는 흥미로운 결 과를 볼 수 있다. 연관규칙의 특정한 연관이 기대된 confidence 보다 더 낮은 값을 보이는 흥미로운 점도 있다.