Download Esempio di Applicazione dell`Algoritmo di Hunt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Esempio di Applicazione
dell’Algoritmo di Hunt
A.A. 14/15
Data Mining
1
DATASET:
(fig. 4.6 del testo)
TID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Misura di impurità: Gini
A.A. 14/15
Data Mining
2
Scelta del I split (E → E1 E2): Attributo Home Owner
E={1,2,…,10}
H.O. = Yes
E1={1,4,7}
H.O. = No
E2={2,3,5,6,8,9,10}
Gini(E1) = 1-(0+1) = 0
Gini(E2) = 1-(9/49+16/49) = 24/49
TID
Home
Owner
Defaulted
Borrower
1
Yes
No
2
No
No
3
No
No
4
Yes
No
5
No
Yes
6
No
No
7
Yes
No
8
No
Yes
9
No
No
10
No
Yes
Gini(E1,E2): (3/10)*0 + (7/10)*(24/49) = 12/35 ≅ 0.343
A.A. 14/15
Data Mining
3
Scelta del I split (E → E1 E2): Attributo Marital Status (1)
E={1,2,…,10}
M.S. ∈ {S}
E1={1,3,8,10}
M.S. ∈ {M,D}
E2={2,4,5,6,7,9}
Gini(E1) = 1-(1/4+1/4) = 1/2
Gini(E2) = 1-(25/36+1/36) = 5/18
TID
Marital
Status
Defaulted
Borrower
1
Single
No
2
Married
No
3
Single
No
4
Married
No
5
Divorced
Yes
6
Married
No
7
Divorced
No
8
Single
Yes
9
Married
No
10
Single
Yes
Gini(E1,E2): (4/10)*(1/2) + (6/10)*(5/18) ≅ 0.367
A.A. 14/15
Data Mining
4
Scelta del I split (E → E1 E2): Attributo Marital Status (2)
E={1,2,…,10}
M.S. ∈ {M}
E1={2,4,6,9}
M.S. ∈ {S,D}
E2={1,3,5,7,8,10}
Gini(E1) = 1-(0+1) = 0
Gini(E2) = 1-(1/4+1/4) = 1/2
TID
Marital
Status
Defaulted
Borrower
1
Single
No
2
Married
No
3
Single
No
4
Married
No
5
Divorced
Yes
6
Married
No
7
Divorced
No
8
Single
Yes
9
Married
No
10
Single
Yes
Gini(E1,E2): (4/10)*0 + (6/10)*(1/2) = 3/10 = 0.3
A.A. 14/15
Data Mining
5
Scelta del I split (E → E1 E2): Attributo Marital Status (3)
E={1,2,…,10}
M.S. ∈ {D}
E1={5,7}
M.S. ∈ {S,M}
E2={1,2,3,4,6,8,9,10}
Gini(E1) = 1-(1/4+1/4) = 1/2
Gini(E2) = 1-(4/64+36/64) = 3/8
TID
Marital
Status
Defaulted
Borrower
1
Single
No
2
Married
No
3
Single
No
4
Married
No
5
Divorced
Yes
6
Married
No
7
Divorced
No
8
Single
Yes
9
Married
No
10
Single
Yes
Gini(E1,E2): (2/10)*(1/2) + (8/10)*(3/8) = 2/5 = 0.4
A.A. 14/15
Data Mining
6
Scelta del I split (E → E1 E2): Attributo Annual Income
Record ordinati per Annual Income
Classe
No
No
No
Yes
Yes
Yes
No
No
No
No
Ann. In.
60
70
75
85
90
95
100
120
125
220
55
65
≤ >
≤ >
≤ > ≤
> ≤
YES
0
3
0 3
0
3 0
NO
0
7
1 6
2
5 3
0.42
0.4
Gini
72
0.375
80
87
92
97
110
122
172
230
>
≤ >
≤ >
≤ >
≤ >
≤ >
≤ >
3 1
2
2
1
3
0
3
0
3
0
3
0
3
0
4 3
4
3
4
3
4
4
3
5
2
6
1
7
0
0.343
0.417
0.4
0.3
0.343
0.375
0.4
0.42
Osservazione: Una volta ordinati i record in base ai valori
dell’attributo, tutti gli split possibili con quell’attributo sono
valutati con un’unica scansione dei record
A.A. 14/15
Data Mining
7
Split migliori
Annual Income ≤ 97 e Marital Status ∈ {M}
A.A. 14/15
Data Mining
8
Scegliamo: Annual Income ≤ 97
E={1,2,…,10}
A.I. ≤ 97
A.I. > 97
E1={3,5,6,8,9,10} E2={1,2,4,7}
Gini(E1,E2) = 0.3
TID
Annual
Income
Defaulted
Borrower
1
125K
No
2
100K
No
3
70K
No
4
120K
No
5
95K
Yes
6
60K
No
7
220K
No
8
85K
Yes
9
75K
No
10
90K
Yes
• L’insieme E2 non deve essere ulteriormente partizionato dato
che contiene solo record di classe No
• Per l’insieme E1 si vede che il migliore split è:
Annual Income ≤ 80
A.A. 14/15
Data Mining
9
ALBERO FINALE
A.I. ≤ 97
TID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
A.I. > 97
NO
A.I. ≤ 80
NO
{3,6,9}
A.I. > 80
{1,2,4,7}
YES
{5,8,10}
Oss. Un albero equivalente poteva essere ottenuto con uno split
ternario
A.A. 14/15
Data Mining
10
Scegliendo Marital Status ∈ {M}, si può arrivare al seguente
albero finale (esercizio)
M.S. ∈ {M}
M.S. ∈ {S,D}
NO
{2,4,6,9}
H.O. = Yes
H.O. = No
NO
{1,7}
A.I. ≤ 77
NO
A.I. > 77
YES
{3}
TID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
{5,8,10}
Oss. È un albero più complesso del precedente (e quindi più a
rischio overfitting) ed evidenzia i limiti dell’approccio greedy
A.A. 14/15
Data Mining
11