Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Classification supplemental Scalable Decision Tree Induction Methods in Data Mining Studies • SLIQ (EDBT’96 — Mehta et al.) – builds an index for each attribute and only class list and the current attribute list reside in memory • SPRINT (VLDB’96 — J. Shafer et al.) – constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim) – integrates tree splitting and tree pruning: stop growing the tree earlier • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) – separates the scalability aspects from the criteria that determine the quality of the tree – builds an AVC-list (attribute, value, class label) SPRINT For large data sets. Age 23 17 43 68 32 20 Car Type Family Sports Sports Family Truck Family Age < 25 Risk H H H L L H Car = Sports H H L Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) n is defined as gini(T ) 1 p 2j j 1 where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as N 1 gini( ) N 2 gini( ) ( T ) gini split T1 T2 N N • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). SPRINT Partition (S) if all points of S are in the same class return; else for each attribute A do evaluate_splits on A; use best split to partition into S1,S2; Partition(S1); Partition(S2); SPRINT Data Structures Age 23 17 43 68 32 20 Training set Age Age 17 20 23 32 43 68 Risk H H H L H L Attribute lists Tuple 1 5 0 4 3 2 Car Type Family Sports Sports Family Truck Family Car Risk H H H L L H Car Type Family Sports Sports Family Truck Family Risk H H H L L H Tuple 0 1 2 3 4 5 Age 17 20 23 32 43 68 Risk H H H L H L Tuple 1 5 0 4 3 2 Car Type Family Sports Family Age < 27.5 Car Type Family Sports Sports Family Truck Family Risk H H H L L H Tuple 0 1 2 3 4 5 Group2 Group1 Age 17 20 23 Splits Risk H H H Tuple 1 5 0 Risk H H H Tuple 0 1 5 Age 32 43 68 Car Type Sports Family Truck Risk L H L Risk H L L Tuple 4 2 3 Tuple 2 3 4 Histograms For continuous attributes Associated with node (Cabove, Cbelow) to process already processed Example Age 17 20 23 32 43 68 Risk H H H L H L Tuple 1 5 0 4 3 2 HH CbCb CaCa ginisplit0 = 0.444 ginisplit1= 0.156 ginisplit2= 0.333 ginisplit3= 0.222 ginisplit4= 0.416 23410 12034 gini ginisplit3 ==6/6 0/6gini(S1) gini(S1) gini(S1)+3/6 +4/6 + 6/6 gini(S2) gini(S2) gini(S2) split2 split0=3/6 gini ==2/6 =4/6 =5/6 1/6 gini(S1) gini(S1) +2/6 +1/6 +0/6 +5/6 gini(S2) gini(S2) split1 split4 split5 2 gini(S1) gini(S2)===1 [(2/2) [(4/6)222 ]+(2/6) == 0 gini(S1) 11---[(3/3) [(1/1) [(3/4) [(4/5) [(4/6) ]+(2/6) +(1/4) +(1/5) 0 2 ]] == 0.444 0.375 0.320 2 +(2/4)2 gini(S2) [(2/4) 0.5 gini(S2) == 1 1 -- [(1/3) [(3/4)2 +(2/3) [(1/2) [(1/1) +(2/4) ]+(1/2) = 0 2 ]] == 0.444 0.1875 0.5 ginisplit5= 0.222 Age <= 18.5 ginisplit6= 0.444 LL 0210 2012 Splitting categorical attributes Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value Example Car Type Family Sports Sports Family Truck Family Risk H H H L L H ginisplit(family)= 0.444 ginisplit((sports) )= 0.333 Tuple 0 1 2 3 4 5 H Family Sports Truck L 2 2 0 gini ==3/6 gini(S1) 3/6 gini(S2) gini 2/6gini(S1) gini(S1)+++5/6 4/6gini(S2) gini(S2) split((sports) ginisplit(family) = 1/6 split(truck) 2 + (1/3)2] = 4/9 gini(S1) == 1 -- [(2/3) 2 gini(S1) 1 [(2/2) ] 2 gini(S1) = 1 - [(1/1) ] == 0 0 2 ++ (1/3)2 gini(S2) == 1[(2/3) 2 2] = 4/9 gini(S2) 1[(2/4) 2 gini(S2) = 1- [(4/5) + (2/4) (1/5)2]] == 0.5 0.32 ginisplit(truck) )= 0.266 Car Type = Truck 1 0 1 Example (2 attributes) Age 17 20 23 32 43 68 Risk H H H L H L Car Type Family Sports Sports Family Truck Family Tuple 1 5 0 4 3 2 Risk H H H L L H Tuple 0 1 2 3 4 5 The winner is Age <= 18.5 Y H N Age 20 23 32 43 68 Risk H H L H L Tuple 5 0 4 3 2 Car Type Family Risk H Tuple 0 Sports Family Truck Family H L L H 2 3 4 5 Example for Bayes Rules • The patient either has a cancer or does not. • A prior knowledge: over the entire population, .008 have cancer • Lab test result + or - is imperfect. It returns – a correct positive result in only 98% of the cases in which the cancer is actually present – a correct negative result in only 97% of the cases in which the cancer is not present • What happens if a new patient for whom the lab test returns +? Example for Bayes Rules Pr(cancer)=0.008 Pr(not cancer)=0.992 Pr(+|cancer)=0.98 Pr(-|cancer)=0.02 Pr(+|not cancer)=0.03 Pr(-|not cancer)=0.97 Pr(+|cancer)p(cancer) = 0.98* 0.008 = 0.0078 Pr(+|not cancer)Pr(not cancer) = 0.03*0.992=0.0298 Hence, Pr(cancer|+) = 0.0078/(0.0078+0.0298)=0.21