Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
“BOF” Trees Diagram as a Visual Way to Improve Interpretability of Tree Ensembles Vesna Luzar-Stiffler, Ph.D. University Computing Centre, and CAIR Research Centre, Zagreb, Croatia Charles Stiffler, Ph.D. CAIR Research Centre, Zagreb, Croatia [email protected], [email protected] BOF Trees Visualization Zagreb, June 12, 2004 Outline Introduction/Background Trees Ensemble Trees Visualization Tools Simulation Results Web Survey Results Conclusions/Recommendations BOF Trees Visualization Zagreb, June 12, 2004 Introduction / Background Classification / Decision Trees Data mining (statistical learning) method for classification Invented twice: Statistical community: Breiman: Friedman et.al. (1984) Machine Learning community: Quinlan (1986) Many positive features Interpretability, ability to handle data of mixed type and missing values, robustness to outliers, etc. Disadvantage unstable vis-à-vis seemingly minor data perturbations low predictive power BOF Trees Visualization Zagreb, June 12, 2004 Introduction / Background Possible improvements: Ensembles Bagging i.e., Bootstraping trees (Breiman, 1996) Boosting, e.g., AdaBoost (Freund & Schapire, 1997) Random Forests (Breiman, 2001) Stacking, randomized trees, etc. Advantage: Improved prediction Disadvantage Loss of interpretability (“black box”) BOF Trees Visualization Zagreb, June 12, 2004 Classification Tree Let fˆ ( x) be the classification tree prediction at input x obtained from the full “training” data Z= {(x1,y1),(x2,y2)…(xN,yN)} BOF Trees Visualization Zagreb, June 12, 2004 Bagging Classification Tree Let 1 fˆ ( x) *b be the classification 2 tree prediction at input x obtained from the bootstrap sample Z*b, b=1,2,…B. Bagging estimate: B ˆf ( x) 1 fˆ ( x) B B bag BOF Trees Visualization Zagreb, June 12, 2004 b 1 *b Visualization tools Graphs based on predictor “importances” (Bxp) matrix F (p=# of predictors) 1 ˆ ˆ For bagged trees, we take the avg: I I (T ) B 2 k B Diagram 1, importance mean bar chart Diagram 2, (“BOF Clusters”) is the cluster means chart (NEW) Diagram 3, (“BOF MDPREF”) is the multidimensional preference bi-plot (NEW) BOF Trees Visualization Zagreb, June 12, 2004 b 1 2 k b Visualization tools Graphs based on proximity (nxn) matrix P, (n=# of cases) Diagram 4 (“Proximity Clusters”) is the cluster means chart (Breiman,2002) Diagram 5 (“Proximity MDS”) is the multidimensional scaling plot of “similar” cases (Breiman,2002) BOF Trees Visualization Zagreb, June 12, 2004 Simulation experiments S1: Generate a sample of size n=30, two classes, and p=5 variables (x1-x5), with a standard normal distribution and pair-wise correlation 0.95. The responses are generated according to Pr(Y=1|x1≤0.5) = 0.2, Pr(Y=1|x1>0.5)=0.8. BOF Trees Visualization Zagreb, June 12, 2004 S2: Generate a sample of size n=30, two classes, and p=5 variables (x1-x5), with a standard normal distribution and pair-wise correlation 0.95 between x1 and x2, and 0 among other predictors. The responses are generated according to Pr(Y=1|x1≤0.5) = 0.2, Pr(Y=1|x1>0.5)=0.8. Diagram 1, Mean importance S1 BOF Trees Visualization Zagreb, June 12, 2004 S2 Diagram 2, “BOF Clusters” S1 BOF Trees Visualization Zagreb, June 12, 2004 S2 Diagram 3, “BOF MDPREF” S1 BOF Trees Visualization Zagreb, June 12, 2004 S2 Diagram 4, “Proximity Clusters” S1 BOF Trees Visualization Zagreb, June 12, 2004 S2 Web Survey data ICT infrastructure/usage in Croatian primary and secondary schools 25,000+ teachers (cases) 200+ variables Response: “classroom use of a computer by educators” (yes/no) Partition 50% training 25% validation 25% test BOF Trees Visualization Zagreb, June 12, 2004 Initial tree (before bagging) BOF Trees Visualization Zagreb, June 12, 2004 Diagram 1, “Mean importance” BOF Trees Visualization Zagreb, June 12, 2004 Diagram 2, “BOF Clusters” BOF Trees Visualization Zagreb, June 12, 2004 Diagram 3, “BOF MDPREF” BOF Trees Visualization Zagreb, June 12, 2004 Bootstrap tree 11 BOF Trees Visualization Zagreb, June 12, 2004 Bootstrap tree 22 BOF Trees Visualization Zagreb, June 12, 2004 Bootstrap tree 12 BOF Trees Visualization Zagreb, June 12, 2004 Clustering trees BOF Trees Visualization Zagreb, June 12, 2004 Diagram 5, “Proximity MDS” BOF Trees Visualization Zagreb, June 12, 2004 Conclusions/ Recommendations There are SWs for trees There are some SWs for tree ensembles There are some visualization tools (old and new) The problem is they are not “interfaced” (integrated) BOF Trees Visualization Zagreb, June 12, 2004