Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Esempio di Applicazione dell’Algoritmo di Hunt A.A. 14/15 Data Mining 1 DATASET: (fig. 4.6 del testo) TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Misura di impurità: Gini A.A. 14/15 Data Mining 2 Scelta del I split (E → E1 E2): Attributo Home Owner E={1,2,…,10} H.O. = Yes E1={1,4,7} H.O. = No E2={2,3,5,6,8,9,10} Gini(E1) = 1-(0+1) = 0 Gini(E2) = 1-(9/49+16/49) = 24/49 TID Home Owner Defaulted Borrower 1 Yes No 2 No No 3 No No 4 Yes No 5 No Yes 6 No No 7 Yes No 8 No Yes 9 No No 10 No Yes Gini(E1,E2): (3/10)*0 + (7/10)*(24/49) = 12/35 ≅ 0.343 A.A. 14/15 Data Mining 3 Scelta del I split (E → E1 E2): Attributo Marital Status (1) E={1,2,…,10} M.S. ∈ {S} E1={1,3,8,10} M.S. ∈ {M,D} E2={2,4,5,6,7,9} Gini(E1) = 1-(1/4+1/4) = 1/2 Gini(E2) = 1-(25/36+1/36) = 5/18 TID Marital Status Defaulted Borrower 1 Single No 2 Married No 3 Single No 4 Married No 5 Divorced Yes 6 Married No 7 Divorced No 8 Single Yes 9 Married No 10 Single Yes Gini(E1,E2): (4/10)*(1/2) + (6/10)*(5/18) ≅ 0.367 A.A. 14/15 Data Mining 4 Scelta del I split (E → E1 E2): Attributo Marital Status (2) E={1,2,…,10} M.S. ∈ {M} E1={2,4,6,9} M.S. ∈ {S,D} E2={1,3,5,7,8,10} Gini(E1) = 1-(0+1) = 0 Gini(E2) = 1-(1/4+1/4) = 1/2 TID Marital Status Defaulted Borrower 1 Single No 2 Married No 3 Single No 4 Married No 5 Divorced Yes 6 Married No 7 Divorced No 8 Single Yes 9 Married No 10 Single Yes Gini(E1,E2): (4/10)*0 + (6/10)*(1/2) = 3/10 = 0.3 A.A. 14/15 Data Mining 5 Scelta del I split (E → E1 E2): Attributo Marital Status (3) E={1,2,…,10} M.S. ∈ {D} E1={5,7} M.S. ∈ {S,M} E2={1,2,3,4,6,8,9,10} Gini(E1) = 1-(1/4+1/4) = 1/2 Gini(E2) = 1-(4/64+36/64) = 3/8 TID Marital Status Defaulted Borrower 1 Single No 2 Married No 3 Single No 4 Married No 5 Divorced Yes 6 Married No 7 Divorced No 8 Single Yes 9 Married No 10 Single Yes Gini(E1,E2): (2/10)*(1/2) + (8/10)*(3/8) = 2/5 = 0.4 A.A. 14/15 Data Mining 6 Scelta del I split (E → E1 E2): Attributo Annual Income Record ordinati per Annual Income Classe No No No Yes Yes Yes No No No No Ann. In. 60 70 75 85 90 95 100 120 125 220 55 65 ≤ > ≤ > ≤ > ≤ > ≤ YES 0 3 0 3 0 3 0 NO 0 7 1 6 2 5 3 0.42 0.4 Gini 72 0.375 80 87 92 97 110 122 172 230 > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 0.343 0.417 0.4 0.3 0.343 0.375 0.4 0.42 Osservazione: Una volta ordinati i record in base ai valori dell’attributo, tutti gli split possibili con quell’attributo sono valutati con un’unica scansione dei record A.A. 14/15 Data Mining 7 Split migliori Annual Income ≤ 97 e Marital Status ∈ {M} A.A. 14/15 Data Mining 8 Scegliamo: Annual Income ≤ 97 E={1,2,…,10} A.I. ≤ 97 A.I. > 97 E1={3,5,6,8,9,10} E2={1,2,4,7} Gini(E1,E2) = 0.3 TID Annual Income Defaulted Borrower 1 125K No 2 100K No 3 70K No 4 120K No 5 95K Yes 6 60K No 7 220K No 8 85K Yes 9 75K No 10 90K Yes • L’insieme E2 non deve essere ulteriormente partizionato dato che contiene solo record di classe No • Per l’insieme E1 si vede che il migliore split è: Annual Income ≤ 80 A.A. 14/15 Data Mining 9 ALBERO FINALE A.I. ≤ 97 TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes A.I. > 97 NO A.I. ≤ 80 NO {3,6,9} A.I. > 80 {1,2,4,7} YES {5,8,10} Oss. Un albero equivalente poteva essere ottenuto con uno split ternario A.A. 14/15 Data Mining 10 Scegliendo Marital Status ∈ {M}, si può arrivare al seguente albero finale (esercizio) M.S. ∈ {M} M.S. ∈ {S,D} NO {2,4,6,9} H.O. = Yes H.O. = No NO {1,7} A.I. ≤ 77 NO A.I. > 77 YES {3} TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes {5,8,10} Oss. È un albero più complesso del precedente (e quindi più a rischio overfitting) ed evidenzia i limiti dell’approccio greedy A.A. 14/15 Data Mining 11