Download Data Mining Processes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Data Mining
Example
• Training set for computer
purchase
– 16 records
– 5 attributes
• Goal
– Predict whether an individual will
purchase a computer
Data
Preprocessing
Anything strange?
Case
A1 1
A2 2
A3 3
A4 4
A5 5
A6 6
A7 7
A8 8
A9 9
A10
10
A11
11
A12
12
A13
13
A14
14
A15
15
A16
16
Age
31-40
>40
>40
31-40
≤30
>40
≤30
31-40
31-40
≤30
≤30
>40
≤30
>40
≤30
>40
Income
High
Medium
Low
Low
Low
Medium
Medium
Medium
High
High
High
Low
Medium
Medium
Unknown
Medium
Student
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
No
No
No
No
Credit
Fair
Fair
Fair
Excellent
Fair
Fair
Excellent
Excellent
Fair
Fair
Excellent
Excellent
Fair
Excellent
Fair
N/A
Gender
Male
Female
Female
Female
Female
Male
Male
Male
Male
Male
Female
Female
Male
Male
Male
Female
Buy?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
No
Data
Anything strange?
Preprocessing
Case
A1 1
A2 2
A3 3
A4 4
A5 5
A6 6
A7 7
A8 8
A9 9
A10
10
A11
11
A12
12
A13
13
A14
14
A15
15
A16
16
Age
31-40
>40
>40
31-40
≤30
>40
≤30
31-40
31-40
≤30
≤30
>40
≤30
>40
≤30
>40
Income
High
Medium
Low
Low
Low
Medium
Medium
Medium
High
High
High
Low
Medium
Medium
Unknown
Medium
Student
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
No
No
No
No
Credit
Fair
Fair
Fair
Excellent
Fair
Fair
Excellent
Excellent
Fair
Fair
Excellent
Excellent
Fair
Excellent
Fair
N/A
Gender
Male
Female
Female
Female
Female
Male
Male
Male
Male
Male
Female
Female
Male
Male
Male
Female
Buy?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
No
drop these noisy case
Data
Preprocessing
Case
A1 1
A2 2
A3 3
A4 4
A5 5
A6 6
A7 7
A8 8
A9 9
A10
10
A11
11
A12
12
A13
13
A14
14
A15
A16
≤30 = 3
31-40 = 2
>40 = 1
Age
31-40
2
1
>40
1
>40
2
31-40
≤303
1
>40
≤303
31-40
2
31-40
2
≤303
≤303
1
>40
≤303
1
>40
≤30
>40
High = 3
Medium = 2
Low = 1
Income
High
Medium
Low
Low
Low
Medium
Medium
Medium
High
High
High
Low
Medium
Medium
Unknown
Medium
Yes = 1
No = 2
Excellent = 2
Fair = 1
Male =1
Female = 2
Yes = 1
No = 0
Student
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
No
No
No
No
Credit
Fair
Fair
Fair
Excellent
Fair
Fair
Excellent
Excellent
Fair
Fair
Excellent
Excellent
Fair
Excellent
Fair
N/A
Gender
Male
Female
Female
Female
Female
Male
Male
Male
Male
Male
Female
Female
Male
Male
Male
Female
Buy?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
No
can Excel handle these labels? No
data transformation
Data
Selection
Data -> Data Analysis -> Correlation
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Age
2
1
1
2
3
1
3
2
2
3
3
1
3
1
Income
3
2
1
1
1
2
2
2
3
3
3
1
2
2
Student Credit
2
1
2
1
1
1
1
2
1
1
B1:G15
1
1
1
2
2
2
1
1
2
1
2
2
1
2
2
1
2
2
Gender Buy?
1 not gender
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
0
2
0
2
0
1
0
1
0
Which variables are strongly related to purchase likelihood?
correlation matrix
Data
Selection
Selected attributes? all except gender
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Age
2
1
1
2
3
1
3
2
2
3
3
1
3
1
Income
3
2
1
1
1
2
2
2
3
3
3
1
2
2
Student Credit
2
1
2
1
1
1
1
2
1
1
1
1
1
2
2
2
1
1
2
1
2
2
1
2
2
1
2
2
Gender Buy?
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
0
2
0
2
0
1
0
1
0
Data
Mining
Suppose we build a model that predicts:
Case Age
Income Student Credit
1
2
3
2
1
2
1
2
2
1
3
1
1
1
1
4
2
1
1
2
5
3
1
1
What
are1 we trying
to
6
1
2
1
1
7
3 accomplish?
2
1
2
create
a
model
to
predict
8
2
2
2
2
whether
the 3customer
buys
or
9
2
1
1
10
3
3not
2
1
11
3
3
2
2
12
1
1
1
2
13
3
2
2
1
14
1
2
2
2
Prediction
from
Gender Buy? mode
1
2
2
2
2
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
Data
Mining
Suppose we build a model that predicts:
Case
1
2
3
4
5
6
Confusion
7
Matrix
Actual Buy8
9
Actual Not10
11
Totals
12
13
14
Age
Income
2
3
1
2
1
1
2
1
3
1
1
2
Model
Buy
3
2
2
8 2
2
3
3 4
3
3
3
1 12
1
3
2
1
2
Student Credit
2
1
2
1
1
1
1
2
1
1
1
1
Model
Not2
1
2
2
1
1
1
2 1
1
2
2
1 2
2
2
1
2
2
Prediction
from
Gender Buy? mode
1
2
2
2
2
1
Totals
1
91
1
51
2
142
1
1
How many of the customers that bought were predicted correctly?
How many of the customers that did not buy were predicted
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
incorrectly?
correctly?
incorrectly?
we tested the model against the data that was used to create it
Data Interpretation
• In real life we need to:
1. build model (i.e., classification rules)
with one data set – Training Set
2. test model with another
(independent) data set – Validation
Set
You just used your model to classify ten more people…
Test Data Set
Here is what the
customer’s actually did…
Case
Model
Actual Buy?
17
Yes
Yes
18
Yes
Yes
19
Yes
Yes
20
Yes
Yes
21
Yes
Yes
22
Yes
Yes
23
No
No
24
Yes
No
25
No
No
26
No
No
confusion matrix?
?
?
?
?
Measures
=
• Correct classification rate
marketing $$
9
# correct
total 10
# classified
= 0.90
• Suppose you incurred costs each time:
model predicted buy, but customer didn’t
$200
model predicted no buy, but customer bought $20
• Cost of error?
=1 x $200 + 0 x $20 = $200
Goals
• Avoid being too broad, i.e., don’t
say…
• “gain insight”
• “discover meaningful patterns”
• “learn interesting things”,…
• Instead be specific
• We want to…
• identify customers likely to renew
• rank order by propensity to…
Related documents