Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random Forests for Scientific Discovery Leo Breiman, UC Berkeley 2.10. Adele Cutler, Utah State University 1 The Data Avalanche We can gather and store larger amounts of data than ever before: Satellite data Web data EPOS Microarrays etc Text mining and image recognition. Who is trying to extract meaningful information form these data? Academic statisticians Machine learning specialists People in the application areas! 2.10. 2 CART (Breiman, Friedman, Olshen, Stone 1984) 1. 2. 3. 4. 5. Arguably one of the most successful tools of the last 20 years. Why? Universally applicable to both classification and regression problems with no assumptions on the data structure. Can be applied to large datasets. Computational requirements are of order MNlogN, where N is the number of cases and M is the number of variables. Handles missing data effectively. Deals with categorical variables efficiently. 2.10. 3 2.10. 4 Example: UCSD Heart Disease Study* Goal: to predict who is at risk of a 2nd heart attack and early death within 30 days and to determine who should be sent to intensive care treatment # of subjects = 215 Outcome variable = High/Low Risk determined by PI after 30 days follow up # of variables available = 100 19 noninvasive clinical and lab variables were used as the predictors 2.10. *:Gilpin, Olshen, Henning and Ross (1983) 5 2.10. 6 2.10. 7 2.10. 8 2.10. 9 2.10. 10 2.10. 11 2.10. 12 2.10. 13 2.10. 14 2.10. 15 2.10. 16 2.10. 17 2.10. 18 2.10. 19 2.10. 20 2.10. 21 Drawbacks of CART Accuracy– current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART. Instability—if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears. Today, we can do better! 2.10. 22 What do we want in a tool for the sciences? Universally applicable for classification Unexcelled accuracy Capable of handling large datasets Effective handling of missing values } minimum Variable importance Interactions What is the shape of the data? Are there clusters? Are there novel cases or outliers? How does the multivariate action of the variables separate the classes? 2.10. 23 Random Forests General-purpose tool for classification and regression Unexcelled accuracy – about as accurate as support vector machines (see later) Capable of handling large datasets Effectively handles missing values Gives a wealth of scientifically important insights 2.10. 24