Download supplemental - Personal Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Classification
supplemental
Scalable Decision Tree Induction
Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that
determine the quality of the tree
– builds an AVC-list (attribute, value, class label)
SPRINT
For large data sets.
Age
23
17
43
68
32
20
Car Type
Family
Sports
Sports
Family
Truck
Family
Age < 25
Risk
H
H
H
L
L
H
Car = Sports
H
H
L
Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T)
n
is defined as gini(T ) 1 
p 2j
j 1
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as
N 1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N
• The attribute provides the smallest ginisplit(T) is chosen to split the
node (need to enumerate all possible splitting points for each
attribute).
SPRINT
Partition (S)
if all points of S are in the same class
return;
else
for each attribute A do
evaluate_splits on A;
use best split to partition into S1,S2;
Partition(S1);
Partition(S2);
SPRINT Data Structures
Age
23
17
43
68
32
20
Training set
Age
Age
17
20
23
32
43
68
Risk
H
H
H
L
H
L
Attribute lists
Tuple
1
5
0
4
3
2
Car Type
Family
Sports
Sports
Family
Truck
Family
Car
Risk
H
H
H
L
L
H
Car Type
Family
Sports
Sports
Family
Truck
Family
Risk
H
H
H
L
L
H
Tuple
0
1
2
3
4
5
Age
17
20
23
32
43
68
Risk
H
H
H
L
H
L
Tuple
1
5
0
4
3
2
Car Type
Family
Sports
Family
Age < 27.5
Car Type
Family
Sports
Sports
Family
Truck
Family
Risk
H
H
H
L
L
H
Tuple
0
1
2
3
4
5
Group2
Group1
Age
17
20
23
Splits
Risk
H
H
H
Tuple
1
5
0
Risk
H
H
H
Tuple
0
1
5
Age
32
43
68
Car Type
Sports
Family
Truck
Risk
L
H
L
Risk
H
L
L
Tuple
4
2
3
Tuple
2
3
4
Histograms
For continuous attributes
Associated with node (Cabove, Cbelow)
to process
already processed
Example
Age
17
20
23
32
43
68
Risk
H
H
H
L
H
L
Tuple
1
5
0
4
3
2
HH
CbCb
CaCa
ginisplit0 = 0.444
ginisplit1= 0.156
ginisplit2= 0.333
ginisplit3= 0.222
ginisplit4= 0.416
23410
12034
gini
ginisplit3
==6/6
0/6gini(S1)
gini(S1)
gini(S1)+3/6
+4/6
+ 6/6
gini(S2)
gini(S2)
gini(S2)
split2
split0=3/6
gini
==2/6
=4/6
=5/6
1/6
gini(S1)
gini(S1)
+2/6
+1/6
+0/6
+5/6
gini(S2)
gini(S2)
split1
split4
split5
2
gini(S1)
gini(S2)===1
[(2/2)
[(4/6)222 ]+(2/6)
== 0
gini(S1)
11---[(3/3)
[(1/1)
[(3/4)
[(4/5)
[(4/6)
]+(2/6)
+(1/4)
+(1/5)
0 2 ]] == 0.444
0.375
0.320
2 +(2/4)2
gini(S2)
[(2/4)
0.5
gini(S2) == 1
1 -- [(1/3)
[(3/4)2 +(2/3)
[(1/2)
[(1/1)
+(2/4)
]+(1/2)
= 0 2 ]] == 0.444
0.1875
0.5
ginisplit5= 0.222
Age <= 18.5
ginisplit6= 0.444
LL
0210
2012
Splitting categorical attributes
Single scan through the attribute list collecting counts
on count matrix for each combination of class label +
attribute value
Example
Car Type
Family
Sports
Sports
Family
Truck
Family
Risk
H
H
H
L
L
H
ginisplit(family)= 0.444
ginisplit((sports) )= 0.333
Tuple
0
1
2
3
4
5
H
Family
Sports
Truck
L
2
2
0
gini
==3/6
gini(S1)
3/6
gini(S2)
gini
2/6gini(S1)
gini(S1)+++5/6
4/6gini(S2)
gini(S2)
split((sports)
ginisplit(family)
=
1/6
split(truck)
2
+ (1/3)2] = 4/9
gini(S1)
== 1
-- [(2/3)
2
gini(S1)
1
[(2/2)
]
2
gini(S1) = 1 - [(1/1) ] == 0
0
2
++ (1/3)2
gini(S2)
== 1[(2/3)
2
2] = 4/9
gini(S2)
1[(2/4)
2
gini(S2) = 1- [(4/5) + (2/4)
(1/5)2]] == 0.5
0.32
ginisplit(truck) )= 0.266
Car Type = Truck
1
0
1
Example (2 attributes)
Age
17
20
23
32
43
68
Risk
H
H
H
L
H
L
Car Type
Family
Sports
Sports
Family
Truck
Family
Tuple
1
5
0
4
3
2
Risk
H
H
H
L
L
H
Tuple
0
1
2
3
4
5
The winner is Age <= 18.5
Y
H
N
Age
20
23
32
43
68
Risk
H
H
L
H
L
Tuple
5
0
4
3
2
Car Type
Family
Risk
H
Tuple
0
Sports
Family
Truck
Family
H
L
L
H
2
3
4
5
Example for Bayes Rules
• The patient either has a cancer or does not.
• A prior knowledge: over the entire population,
.008 have cancer
• Lab test result + or - is imperfect. It returns
– a correct positive result in only 98% of the cases in
which the cancer is actually present
– a correct negative result in only 97% of the cases
in which the cancer is not present
• What happens if a new patient for whom the
lab test returns +?
Example for Bayes Rules
Pr(cancer)=0.008 Pr(not cancer)=0.992
Pr(+|cancer)=0.98 Pr(-|cancer)=0.02
Pr(+|not cancer)=0.03 Pr(-|not cancer)=0.97
Pr(+|cancer)p(cancer) = 0.98* 0.008 = 0.0078
Pr(+|not cancer)Pr(not cancer) =
0.03*0.992=0.0298
Hence, Pr(cancer|+) = 0.0078/(0.0078+0.0298)=0.21
Related documents