Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Data Mining Example • Training set for computer purchase – 16 records – 5 attributes • Goal – Predict whether an individual will purchase a computer Data Preprocessing Anything strange? Case A1 1 A2 2 A3 3 A4 4 A5 5 A6 6 A7 7 A8 8 A9 9 A10 10 A11 11 A12 12 A13 13 A14 14 A15 15 A16 16 Age 31-40 >40 >40 31-40 ≤30 >40 ≤30 31-40 31-40 ≤30 ≤30 >40 ≤30 >40 ≤30 >40 Income High Medium Low Low Low Medium Medium Medium High High High Low Medium Medium Unknown Medium Student No No Yes Yes Yes Yes Yes No Yes No No Yes No No No No Credit Fair Fair Fair Excellent Fair Fair Excellent Excellent Fair Fair Excellent Excellent Fair Excellent Fair N/A Gender Male Female Female Female Female Male Male Male Male Male Female Female Male Male Male Female Buy? Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes No Data Anything strange? Preprocessing Case A1 1 A2 2 A3 3 A4 4 A5 5 A6 6 A7 7 A8 8 A9 9 A10 10 A11 11 A12 12 A13 13 A14 14 A15 15 A16 16 Age 31-40 >40 >40 31-40 ≤30 >40 ≤30 31-40 31-40 ≤30 ≤30 >40 ≤30 >40 ≤30 >40 Income High Medium Low Low Low Medium Medium Medium High High High Low Medium Medium Unknown Medium Student No No Yes Yes Yes Yes Yes No Yes No No Yes No No No No Credit Fair Fair Fair Excellent Fair Fair Excellent Excellent Fair Fair Excellent Excellent Fair Excellent Fair N/A Gender Male Female Female Female Female Male Male Male Male Male Female Female Male Male Male Female Buy? Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes No drop these noisy case Data Preprocessing Case A1 1 A2 2 A3 3 A4 4 A5 5 A6 6 A7 7 A8 8 A9 9 A10 10 A11 11 A12 12 A13 13 A14 14 A15 A16 ≤30 = 3 31-40 = 2 >40 = 1 Age 31-40 2 1 >40 1 >40 2 31-40 ≤303 1 >40 ≤303 31-40 2 31-40 2 ≤303 ≤303 1 >40 ≤303 1 >40 ≤30 >40 High = 3 Medium = 2 Low = 1 Income High Medium Low Low Low Medium Medium Medium High High High Low Medium Medium Unknown Medium Yes = 1 No = 2 Excellent = 2 Fair = 1 Male =1 Female = 2 Yes = 1 No = 0 Student No No Yes Yes Yes Yes Yes No Yes No No Yes No No No No Credit Fair Fair Fair Excellent Fair Fair Excellent Excellent Fair Fair Excellent Excellent Fair Excellent Fair N/A Gender Male Female Female Female Female Male Male Male Male Male Female Female Male Male Male Female Buy? Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes No can Excel handle these labels? No data transformation Data Selection Data -> Data Analysis -> Correlation Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Age 2 1 1 2 3 1 3 2 2 3 3 1 3 1 Income 3 2 1 1 1 2 2 2 3 3 3 1 2 2 Student Credit 2 1 2 1 1 1 1 2 1 1 B1:G15 1 1 1 2 2 2 1 1 2 1 2 2 1 2 2 1 2 2 Gender Buy? 1 not gender 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 0 2 0 2 0 1 0 1 0 Which variables are strongly related to purchase likelihood? correlation matrix Data Selection Selected attributes? all except gender Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Age 2 1 1 2 3 1 3 2 2 3 3 1 3 1 Income 3 2 1 1 1 2 2 2 3 3 3 1 2 2 Student Credit 2 1 2 1 1 1 1 2 1 1 1 1 1 2 2 2 1 1 2 1 2 2 1 2 2 1 2 2 Gender Buy? 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 0 2 0 2 0 1 0 1 0 Data Mining Suppose we build a model that predicts: Case Age Income Student Credit 1 2 3 2 1 2 1 2 2 1 3 1 1 1 1 4 2 1 1 2 5 3 1 1 What are1 we trying to 6 1 2 1 1 7 3 accomplish? 2 1 2 create a model to predict 8 2 2 2 2 whether the 3customer buys or 9 2 1 1 10 3 3not 2 1 11 3 3 2 2 12 1 1 1 2 13 3 2 2 1 14 1 2 2 2 Prediction from Gender Buy? mode 1 2 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 Data Mining Suppose we build a model that predicts: Case 1 2 3 4 5 6 Confusion 7 Matrix Actual Buy8 9 Actual Not10 11 Totals 12 13 14 Age Income 2 3 1 2 1 1 2 1 3 1 1 2 Model Buy 3 2 2 8 2 2 3 3 4 3 3 3 1 12 1 3 2 1 2 Student Credit 2 1 2 1 1 1 1 2 1 1 1 1 Model Not2 1 2 2 1 1 1 2 1 1 2 2 1 2 2 2 1 2 2 Prediction from Gender Buy? mode 1 2 2 2 2 1 Totals 1 91 1 51 2 142 1 1 How many of the customers that bought were predicted correctly? How many of the customers that did not buy were predicted 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 incorrectly? correctly? incorrectly? we tested the model against the data that was used to create it Data Interpretation • In real life we need to: 1. build model (i.e., classification rules) with one data set – Training Set 2. test model with another (independent) data set – Validation Set You just used your model to classify ten more people… Test Data Set Here is what the customer’s actually did… Case Model Actual Buy? 17 Yes Yes 18 Yes Yes 19 Yes Yes 20 Yes Yes 21 Yes Yes 22 Yes Yes 23 No No 24 Yes No 25 No No 26 No No confusion matrix? ? ? ? ? Measures = • Correct classification rate marketing $$ 9 # correct total 10 # classified = 0.90 • Suppose you incurred costs each time: model predicted buy, but customer didn’t $200 model predicted no buy, but customer bought $20 • Cost of error? =1 x $200 + 0 x $20 = $200 Goals • Avoid being too broad, i.e., don’t say… • “gain insight” • “discover meaningful patterns” • “learn interesting things”,… • Instead be specific • We want to… • identify customers likely to renew • rank order by propensity to…