Download 2.10 Random Forests for Scientific Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Random Forests for Scientific Discovery
Leo Breiman, UC Berkeley
2.10.
Adele Cutler, Utah State University
1
The Data Avalanche
We can gather and store larger amounts of data than ever
before:
 Satellite data
 Web data
 EPOS
 Microarrays etc
 Text mining and image recognition.
Who is trying to extract meaningful information form these data?
 Academic statisticians
 Machine learning specialists

People in the application areas!
2.10.
2
CART (Breiman, Friedman,
Olshen, Stone 1984)
1.
2.
3.
4.
5.
Arguably one of the most successful tools of the
last 20 years. Why?
Universally applicable to both classification and
regression problems with no assumptions on the
data structure.
Can be applied to large datasets. Computational
requirements are of order MNlogN, where N is
the number of cases and M is the number of
variables.
Handles missing data effectively.
Deals with categorical variables efficiently.
2.10.
3
2.10.
4
Example: UCSD Heart Disease Study*
Goal: to predict who is at risk of a 2nd heart
attack and early death within 30 days and to
determine who should be sent to intensive
care treatment
# of subjects = 215
Outcome variable = High/Low Risk determined
by PI after 30 days follow up
# of variables available = 100
19 noninvasive clinical and lab variables were
used as the predictors
2.10.
*:Gilpin, Olshen, Henning
and Ross (1983)
5
2.10.
6
2.10.
7
2.10.
8
2.10.
9
2.10.
10
2.10.
11
2.10.
12
2.10.
13
2.10.
14
2.10.
15
2.10.
16
2.10.
17
2.10.
18
2.10.
19
2.10.
20
2.10.
21
Drawbacks of CART


Accuracy– current methods, such as support
vector machines and ensemble classifiers
often have 30% lower error rates than
CART.
Instability—if we change the data a little,
the tree picture can change a lot. So the
interpretation is not as straightforward as it
appears.
Today, we can do better!
2.10.
22
What do we want in a tool for
the sciences?










Universally applicable for classification
Unexcelled accuracy
Capable of handling large datasets
Effective handling of missing values
}
minimum
Variable importance
Interactions
What is the shape of the data?
Are there clusters?
Are there novel cases or outliers?
How does the multivariate action of the variables
separate the classes?
2.10.
23
Random Forests





General-purpose tool for classification and
regression
Unexcelled accuracy – about as accurate as
support vector machines (see later)
Capable of handling large datasets
Effectively handles missing values
Gives a wealth of scientifically important
insights
2.10.
24