Download Data Mining: A Closer Look

Data Mining: A Closer Look Chapter 2 2.1 Data Mining Strategies Data Mining Strategies Unsupervised Clustering Supervised Learning Market Basket Analysis Classification Estimation Prediction Figure 2.1 A hierarchy of data mining strategies Classification • Learning is supervised. • The dependent variable is categorical. • Well-defined classes. • Current rather than future behavior. Estimation • • • • Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior. Prediction • The emphasis is on predicting future rather than current outcomes. • The output attribute may be categorical or numeric. The Cardiology Patient Dataset Table 2.1 • Cardiology Patient Data Attribute Name Mixed Values Numeric Values Comments Age Numeric Numeric Age in years Sex Male, Female 1, 0 Patient gender Chest Pain Type Angina, Abnormal Angina, NoTang, Asympt omatic 1–4 NoTang = Nonanginal pain Blood Pressure Numeric Numeric Resting blood pressure upon hospital admission Cholesterol Numeric Numeric Serum cholesterol Fasting Blood Sugar < 120 True, False 1, 0 Is fasting blood sugar less than 120? Resting ECG Normal, Abnormal, Hyp 0, 1, 2 Hyp = Left ventricular hypertrophy Maximum Heart Rate Numeric Numeric Maximum heart rate achieved Induced Angina? True, False 1, 0 Does t he patient experience angina as a result of exercise? Old Peak Numeric Numeric ST depression induced by exercise relative to rest Slope Up, flat, dow n 1–3 Slope of t he peak exercise ST segment Number Colored Vessels 0, 1, 2, 3 0, 1, 2, 3 Number of major vessels colored by fluorosopy Thal Normal fix, rev 3, 6, 7 Normal, fixed defect, reversible defect Concept Class Healthy, Sick 1, 0 Angiographic disease stat us Table 2.2 • Most and Least Typical Instances from the Cardiology Domain Attribute Name Age Sex Chest Pain Type Blood Pressure Cholesterol Fasting Blood Sugar < 120 Resting ECG Maximum Heart Rate Induced Angina? Old Peak Slope Number of Colored Vessels Thal Most Typical Healthy Class Least Typical Healthy Class Most Typical Sick Class Least Typical Sick Class 52 Male NoTang 138 223 False Normal 169 False 0 Up 0 Normal 63 Male Angina 145 233 True Hyp 150 False 2.3 Down 0 Fix 60 Male Asymptomatic 125 258 False Hyp 141 True 2.8 Flat 1 Rev 62 Female Asymptomatic 160 164 False Hyp 145 False 6.2 Down 3 Rev A Healthy Class Rule for the Cardiology Patient Dataset • IF 169 <= Maximum Heart Rate <=202 • THEN Concept Class = Healthy • Rule accuracy: 85.07% • Rule coverage: 34.55% A Sick Class Rule for the Cardiology Patient Dataset • IF Thal = Rev & Chest Pain Type = Asymptomatic • THEN Concept Class = Sick • Rule accuracy: 91.14% • Rule coverage: 52.17% Unsupervised Clustering • Determine if concepts can be found in the data. • Evaluate the likely performance of a supervised model. • Determine a best set of input attributes for supervised learning. • Detect Outliers. Market Basket Analysis • Find interesting relationships among retail products. • Uses association rule algorithms. 2.2 Supervised Data Mining Techniques The Credit Card Promotion Database Table 2.3 • The Credit Card Promotion Database Income Range ($) Magazine Promotion 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No Watch Life Insurance Promotion Promotion No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance Sex Age No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 A Hypothesis for the Credit Card Promotion Database • A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. A Production Rule for the Credit Card Promotion Database • IF Sex = Female & 19 <=Age <= 43 • THEN Life Insurance Promotion = Yes • Rule Accuracy: 100.00% • Rule Coverage: 66.67% Production Rules • Rule accuracy is a between-class measure. • Rule coverage is a within-class measure. Neural Networks Input Layer Hidden Layer Figure 2.2 A multilayer fully connected neural network Output Layer Table 2.4 • Neural Network Training: Actual and Computed Output Instance Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Life Insurance Promotion 0 1 0 1 1 0 1 0 0 1 1 1 1 0 1 Computed Output 0.024 0.998 0.023 0.986 0.999 0.050 0.999 0.262 0.060 0.997 0.999 0.776 0.999 0.023 0.999 Statistical Regression • Life insurance promotion = • 0.5909 (credit card insurance) • 0.5455 (sex) + 0.7727 2.3 Association Rules An Association Rule for the Credit Card Promotion Database • IF Sex = Female & Age = over40 & • Credit Card Insurance = No • THEN Life Insurance Promotion = Yes 2.4 Clustering Techniques Cluster 1 # Instances: 3 Sex: Male => 3 Female => 0 Age: 43.3 Credit Card Insurance: Yes => 0 No => 3 Life Insurance Promotion: Yes => 0 No => 3 Cluster 2 # Instances: 5 Sex: Male => 3 Female => 2 Age: 37.0 Credit Card Insurance: Yes => 1 No => 4 Life Insurance Promotion: Yes => 2 No => 3 Cluster 3 # Instances: 7 Sex: Male => 2 Female => 5 Age: 39.9 Credit Card Insurance: Yes => 2 No => 5 Life Insurance Promotion: Yes => 7 No => 0 Figure 2.3 An unsupervised cluster of the credit card 2.5 Evaluating Performance Evaluating Supervised Learner Models Confusion Matrix (混亂矩陣） • A matrix used to summarize the results of a supervised classification. • Entries along the main diagonal are correct classifications. • Entries other than those on the main diagonal are classification errors. Table 2.5 • A Three-Class Confusion Matrix Computed Decision 計算結果 C 1 2 C 3 C C 11 C 12 C 13 C C 21 C 22 C 23 C C 31 C 32 C 33 1 事實 C 2 3 Two-Class Error Analysis （二類別錯誤分析） Table 2.6 • A Simple Confusion Matrix Computed Accept Computed Reject Accept True Accept False Reject Reject False Accept True Reject Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate Model A Accept Reject Computed Accept 600 75 Computed Reject Model B Computed Accept Computed Reject 25 300 Accept Reject 600 25 75 300 Evaluating Numeric Output • Mean absolute error • Mean squared error • Root mean squared error Comparing Models by Measuring Lift （增益值） 1200 1000 Number Responding 800 600 400 200 0 0 10 20 30 40 50 60 70 % Sampled Figure 2.4 Targeted vs. mass mailing 80 90 100 Computing Lift Lift  P (Ci | Sample) P (Ci | Population) Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model No Model Computed Accept Computed Reject Ideal Model Computed Accept Computed Reject Accept Reject 1,000 99,000 0 0 Accept Reject 1,000 0 0 99,000 Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25 Model X Computed Accept Computed Reject Model Y Computed Accept Computed Reject Accept Reject 540 23,460 460 75,540 Accept Reject 450 19,550 550 79,450 Unsupervised Model Evaluation

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining: A Closer Look