Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning Rajeev Rastogi Kyuseok Shim Presented by: Alon Keinan Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT & MDL • PUBLIC • Performance Comparison • Conclusions Introduction: Classification • Classification in data mining: – Training sample set – Classifying future records • Techniques: Bayesian, NN, Genetic, decision trees … Introduction: Decision Trees training Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT & MDL • PUBLIC • Performance Comparison • Conclusions Decision Tree Building Algorithms • 2 phases: – The building phase – The pruning phase • The building constructs a “perfect” tree • The pruning prevents “overfitting” Building Phase Algorithms • Differ in the selection of the test criterion for partitioning – CLS – ID3 & C4.5 – CART, SLIQ & SPRINT • Differ in their ability to handle large training sets • All consider “guillotine-cut” only Pruning Phase Algorithms • MDL – Minimum Description Length • Cost-Complexity Pruning Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT & MDL • PUBLIC • Performance Comparison • Conclusions SPRINT • Initialize root node • Initialize queue Q to contain root node • While Q is not empty do – dequeue the first node N in Q – if N is not pure • for each attribute evaluate splits • use least entropy split to split node N into N1 and N2 • append N1 and N2 to Q Entropy MDL • The best tree is the one that can be encoded using the fewest number of bits • Cost of encoding data records: • Cost of encoding tree: – The structure of the tree – The splits – The classes in the leaves Pruning algorithm • computeCost&Prune(Node N) – – – – If N is a leaf return (C(S)+1) minCostLeft:=computeCost&Prune(Nleft) minCostRight:=computeCost&Prune(Nright) minCost:=min{C(S)+1, Csplit(N)+1+minCostLeft+minCostRight} – If minCost=C(S)+1 • Prune child nodes Nleft and Nright – return minCost Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT & MDL • PUBLIC • Performance Comparison • Conclusions PUBLIC • PUBLIC = PrUning and BuiLding Integrated in Classification • Uses SPRINT for building • Prune periodically !!! • Basically uses MDL for pruning • Distinguished three types of leaves: – “not expandable” – “pruned” – “yet to be expanded” • Exact same tree Lower Bound Computation • PUBLIC(1) – Bound=1 • PUBLIC(S) – Incorporating split costs • PUBLIC(V) – Incorporating split values PUBLIC(S) • Calculates a lower bound for s=0,..,k-1 – For s=0: C(S)+1 – For s>0: • Takes the minimum of the bounds • Computes by iterative addition • O(klogk) PUBLIC(V) • PUBLIC(S) estimates each split as log(a) • PUBLIC(V) estimates each split as log(a), plus the encoding of the splitting value\s • Complexity: O(k*(logk+a)) Lower Bound Computation Summary PUBLIC(1) Fixed - 1 O(1) PUBLIC(S) Incorporating split costs O(klogk) PUBLIC(V) Incorporating split value costs O(k*(logk+a)) Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT & MDL • PUBLIC • Performance Comparison • Conclusions Performance Comparisons • Algorithms: – – – – SPRINT PUBLIC(1) PUBLIC(S) PUBLIC(V) • Data sets: – Real-life – Synthetic Real-life Data Sets SPRINT PUBLIC( ) PUBLIC(S) PUBLIC(V) Synthetic Data Sets SPRINT PUBLIC( ) PUBLIC(S) PUBLIC(V) Noise Other Parameters • No. of Attributes • No. of Classes • Size of training set Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT & MDL • PUBLIC • Performance Comparison • Conclusions Conclusion • The pruning is integrated into the building phase • Computing lower bounds of the cost of “yet to be expanded” leaves • Improved performance • Open: – How often to invoke the pruning procedure? – Expanding other algorithms … – Developing a tighter lower bound…