Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 [email protected] A/W & Dr. Chen, Data Mining 2.1 Data Mining Strategies A/W & Dr. Chen, Data Mining Figure 2.1 - A hierarchy of data mining strategies Data Mining Strategies Unsupervised Clustering Market Basket Analysis Supervised Learning No output attributes Classification Categorical/discrete (current behavior) A/W & Dr. Chen, Data Mining Prediction Estimation Numeric Future outcome (categorical/numeric) 3 Classification • Learning is supervised. • The dependent variable is categorical. • Well-defined classes. • Current rather than future behavior. Estimation • Learning is supervised. • The dependent variable is numeric. • Well-defined classes. • Current rather than future behavior. Prediction • The emphasis is on predicting future rather than current outcomes. • The output attribute may be categorical or numeric. A/W & Dr. Chen, Data Mining The Cardiology Patient Dataset 5 A/W & Dr. Chen, Data Mining Table 2.1 • Cardiology Patient Data Attribute Name Mixed Values Numeric Values Comments Age Numeric Numeric Age in years Sex Male, Female 1, 0 Patient gender Chest Pain Type Angina, Abnormal Angina, 1–4 NoTang =Nonanginal NoTang, Asymptomatic Blood Pressure Numeric pain Numeric Resting blood pressure upon hospital admission Cholesterol Numeric Numeric Serum cholesterol Fasting Blood True, False 1, 0 Is f asting blood sugar less Sugar < 120 Resting ECG than 120? Normal, Abnormal, Hyp 0, 1, 2 Hyp = Lef t ventricular hypertrophy Maximum Heart Numeric Numeric Rate Induced Angina? Maximum heart rate achieved True, False 1, 0 Does the patient experience angina as a result of exercise? Old Peak Numeric Numeric ST depression induced by exercise relative to rest Slope Up, f lat, dow n 1–3 Slope of the peak exercise ST segment Number Colored 0, 1, 2, 3 0, 1, 2, 3 Vessels Thal Number of major vessels colored by f luorosopy Normal f ix, rev 3, 6, 7 Normal, f ixed def ect, reversible def ect Concept Class A/W & Dr. Chen, Data Mining Healthy, Sick 1, 0 Angiographic disease status 6 Table 2.2 • Most and Least Typical Instances from the Cardiology Domain Attribute Name Most Typical Least Typical Most Typical Least Typical Healthy Class Healthy Class Sick Class Sick Class Age 52 63 60 62 Sex Male Male Male Female Chest Pain Type NoTang Angina Asymptomatic Asymptomatic Blood Pressure 138 145 125 160 Cholesterol 223 233 258 164 Fasting Blood Sugar < 120 FALSE TRUE FALSE FALSE Resting ECG Normal Hyp Hyp Hyp 169 150 141 145 FALSE FALSE TRUE FALSE 0 2.3 2.8 6.2 Up Down Flat Down 0 0 1 3 Normal Fix Rev Rev Maximum Heart Rate Induced Angina? Old Peak Slope Number of Colored Vessels Thal A/W & Dr. Chen, Data Mining 7 A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% A/W & Dr. Chen, Data Mining Unsupervised Clustering • Determine if concepts can be found in the data. • Evaluate the likely performance of a supervised model. • Determine a best set of input attributes for supervised learning. • Detect Outliers. A/W & Dr. Chen, Data Mining Market Basket Analysis • Find interesting relationships among retail products. • Uses association rule algorithms. • The results of a market basket analysis help retailers – Design promotions, – Arrange shelf or catalog items, and – Develop crossmarketing strategies 10 A/W & Dr. Chen, Data Mining 2.2 Supervised Data Mining Techniques 11 A/W & Dr. Chen, Data Mining The Credit Card Promotion Database 12 A/W & Dr. Chen, Data Mining Table 2.3 • The Credit Card Promotion Database Income Magazine Watch Life Insurance Credit Card Range ($) Promotion Promotion Promotion Insurance 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Sex Age Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 13 A/W & Dr. Chen, Data Mining Data file: CreditCardPromotion.xls Income Range Magazine C C I I 40-50,000 Yes 30-40,000 Yes 40-50,000 No 30-40,000 Yes 50-60,000 Yes 20-30,000 No 30-40,000 Yes 20-30,000 No 30-40,000 Yes 30-40,000 Yes 40-50,000 No 20-30,000 No 50-60,000 Yes 40-50,000 No 20-30,000 No Promo Watch Promo Life Ins Promo Credit CardSex Ins. C C C C I I I I No No No Male Yes Yes No Female No No No Male Yes Yes Yes Male No Yes No Female No No No Female No Yes Yes Male Yes No No Male No No No Male Yes Yes No Female Yes Yes No Female Yes Yes No Male Yes Yes No Female Yes No No Male No Yes Yes Female Age R I 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 14 A/W & Dr. Chen, Data Mining A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. A/W & Dr. Chen, Data Mining A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the promotion? • Rule accuracy is a between-class measure. • Rule coverage is a within-class measure. A/W & Dr. Chen, Data Mining Neural Networks • A neural network is a set of interconnected nodes designed to imitate the functioning of the human brain. • Two phases of operations – Learning phase at the input layer until it reaches a predetermined minimum error rate – Fixing weights and recompute output values for new instances • A major shortcoming of the neural network approach is a lack of explanation about what has been learned. 17 A/W & Dr. Chen, Data Mining Input Layer Hidden Layer Output Layer 18 A/W & Dr. Chen, Data Mining Table 2.3 • The Credit Card Promotion Database Income Range ($) Magazine Watch Promotion Promotion 40–50K 30–40K Yes Yes No Yes Life Insurance Credit Card Promotion Insurance Sex Age No Yes No No Male Female 45 40 (Note that blue: input attributes, red: output attributes) Therefore, there four input nodes and one output node and chose five hidden-layer nodes. 19 A/W & Dr. Chen, Data Mining Table 2.4 - Neural Network Training: Actual and Computed Output Instance Number Life Insurance Promotion Computed Output 1 0 0.024 2 1 0.998 3 0 0.023 4 1 0.986 5 1 0.999 6 0 0.05 7 1 0.999 8 0 0.262 9 0 0.06 10 1 0.997 11 1 0.999 12 1 0.776 13 1 0.999 14 0 0.023 15 1 0.999 20 A/W & Dr. Chen, Data Mining Table 2.4 - Neural Network Training: Actual and Computed Output Instance Number Life Insurance Promotion Computed Output 1 0 0.024 2 1 0.998 3 0 0.023 4 1 0.986 5 1 0.999 6 0 0.05 7 1 0.999 8 0 0.262 9 0 0.06 10 1 0.997 11 1 0.999 12 1 0.776 13 1 0.999 14 0 0.023 15 1 0.999 21 A/W & Dr. Chen, Data Mining Statistical Regression • Statistical regression is a supervised learning technique that generalizes a set of numeric data by creating a mathematical equation relating one or more input attributes to a single numeric output attributes. • Linear regression model is characterized by an output attribute whose value is determined by a linear sum of weighted input attribute values. 22 A/W & Dr. Chen, Data Mining Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) 0.5455 (sex) + 0.7727 Example, a female who does not have credit card insurance is a likely candidate for the life insurance promotion. Life insurance promotion = 0.5909 (0) – 0.5455 (0) + 0.7727 = 0.7727 Because the value 0.7727 is close to 1.0, we conclude that the individual is likely to take advantage of the promotional offer. A/W & Dr. Chen, Data Mining 2.3 Association Rules • Association rule mining techniques are used to discover interesting associations between attributes contained in a database. • Unlike traditional association rules, association rules can have one or several output attributes. • An output attribute for one rule can be an input attribute for another rule. • A popular technique for ‘market basket’ analysis. • Problem: we may have several rules with little value. 24 A/W & Dr. Chen, Data Mining An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes A/W & Dr. Chen, Data Mining 2.4 Clustering Techniques • By applying unsupervised clustering to the instances of the Acme Credit Card Company database, we will find a subset of input attributes that differentiate card holders who have taken advantage of the life insurance promotion from those cardholders who have not accepted the promotion offer. 26 A/W & Dr. Chen, Data Mining Cluster 1 # Instances: 3 Sex: Male => 3 Female => 0 Age: 43.3 Credit Card Insurance: Yes => 0 No => 3 Life Insurance Promotion: Yes => 0 No => 3 Cluster 2 # Instances: 5 Sex: Male => 3 Female => 2 Age: 37.0 Credit Card Insurance: Yes => 1 No => 4 Life Insurance Promotion: Yes => 2 No => 3 IF Sex=Female & 43>=Age>=35 & Credit Card Insurance=NO THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67% Cluster 3 # Instances: 7 Sex: Male => 2 Female => 5 Age: 39.9 Credit Card Insurance: Yes => 2 No => 5 Life Insurance Promotion: Yes => 7 No => 0 27 A/W & Dr. Chen, Data Mining End here for now! 28 A/W & Dr. Chen, Data Mining 2.5 Evaluating Performance • Performance evaluation is probably the most critical of all the steps in the data mining process. Three general questions: – 1. Will the benefits received from a data mining project more than offset the cost of the data mining process? – 2. How do we interpret the results of a data mining session? – 3. Can we use the results of a data mining process with confidence? 29 A/W & Dr. Chen, Data Mining Evaluating Supervised Learner Models • Supervised learner models are designed to classify, estimate, and/or predict future outcome. • Applications on classification correctness: – Develop a model to accept to reject credit card applications – Develop a model to accept or reject home mortgage applicants – Develop a model to decide whether or not to drill for oil 30 A/W & Dr. Chen, Data Mining Confusion Matrix • A matrix used to summarize the results of a supervised classification. • Entries along the main diagonal are correct classifications. • Entries other than those on the main diagonal are classification errors. Classification correctness is best calculated by presenting previously unseen data in the form of a test to the model being evaluated. A confusion matrix is of little use for evaluating supervised learner models offering numeric output. A/W & Dr. Chen, Data Mining Table 2.5 • A Three-Class Confusion Matrix C1 C2 C3 Computed Decision C1 C2 C3 C11 C12 C13 C21 C22 C23 C31 C32 C33 Rule 1: Values along the main diagonal represent correct classification e.g., C11 represents the total number of class C1 instance correctly classified by the model Rule 2: for C2, C21,C22,C23 are all actually members of C2; but C21 and C23 are incorrectly classified as members of another class. Rule 3: for C2, C12 and C32 are instances are incorrectly C2. classified as members of class Computation questions: #1 32 A/W & Dr. Chen, Data Mining Two-Class Error Analysis Table 2.6 • A Simple Confusion Matrix Computed Accept Accept True Accept Reject False Accept Computed Reject False Reject The more the better True Reject The less the better 33 A/W & Dr. Chen, Data Mining Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate Model A Computed Accept Accept Reject 600 75 Computed Reject Model B Computed Accept Computed Reject 25 300 Accept Reject 600 25 75 300 Which model is better? 34 A/W & Dr. Chen, Data Mining Evaluating Numeric • Mean absolute error • Mean squared error • Root mean squared error A/W & Dr. Chen, Data Mining Comparing Models by Measuring Lift • Marketing applications that focus on response rates from mass mailings are less concerned with test set classification error and more interested in building models able to extract bias samples from large populations. • Supervised learner models designed for extracting bias samples from a general population are often evaluated by a measure that comes directly from marketing known as lift. 36 A/W & Dr. Chen, Data Mining Computing Lift Lift P (Ci | Sample) P (Ci | Population) Ci is the class of al zero-balance customers who, given the opportunity, will take advantage of the promotional offer. A/W & Dr. Chen, Data Mining Figure 2.4 – A lift chart (Targeted vs. mass mailing) 1200 1000 Number Responding 800 600 400 200 0 0 10 20 30 40 50 60 70 80 90 100 % Sampled 38 A/W & Dr. Chen, Data Mining Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model No Model Computed Accept Computed Reject Ideal Model Computed Accept Computed Reject Accept Reject 1,000 99,000 0 0 Accept Reject 1,000 0 0 99,000 Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25 Model X Computed Accept Computed Reject Model Y Computed Accept Computed Reject Accept Reject 540 23,460 460 75,540 Accept Reject 450 19,550 550 79,450 39 A/W & Dr. Chen, Data Mining Unsupervised Model Evaluation • Evaluating unsupervised data mining is, in general, a more difficult task than supervised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning. 40 A/W & Dr. Chen, Data Mining Data Mining Strategies Unsupervised Clustering Supervised Learning Market Basket Analysis Classification Estimation Prediction 41 A/W & Dr. Chen, Data Mining