Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DEMYSTIFYING RANDOM FORESTS
ANTONI DZIECIOLOWSKI SAS CANADA
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
RANDOM FOREST MOTIVATION
“With excellent performance on all eight metrics, calibrated boosted trees were
the best learning algorithm overall. Random forests are close second, followed
by uncalibrated bagged trees, calibrated SVMs, and un- calibrated neural nets."
Rich Caruana, Alexandru Niculescu-Mizil.
An Empirical Comparison of Supervised Learning Algorithms. ICML 2006
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
2
DECISION TREE DEFINITION
Decision Tree: is a schematic, tree-shaped diagram used
to determine a course of action or show a statistical
probability. Each branch of the decision tree represents a
possible decision, occurrence or reaction. The tree is
structured to show how and why one choice may lead to
the next, with the use of the branches indicating each
option is mutually exclusive.
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
4
DECISION TREE DEFINITION
X1 = 2
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
5
DECISION TREE BINARY SPLIT EXAMPLE
Splitting Criteria:
•
•
•
•
•
Information Gain
Variance
Gini Index (Binary only)
Chi Square
Etc.
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
Julie Grisanti - Decision Trees: An Overview
http://www.aunalytics.com/decision-trees-an-overview/
6
RANDOM FORESTS
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
7
RANDOM FOREST LEO BREIMAN
• Responsible in part for bridging the gap between statistics and
computer science in machine learning.
• Contributed in the work on how classification and regression trees
and ensemble of trees fit to bootstrap samples. (Bagging)
• Focused on computationally intensive multivariate analysis,
especially the use of nonlinear methods for pattern recognition
and prediction in high dimensional spaces
1928 - 2005
• Developed decision trees (random forest) as computationally
efficient alternatives to neural nets.
https://www.stat.berkeley.edu/~breiman/
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
8
WHAT IS A RANDOM FOREST?
“Random forests are a combination of tree predictors such that each tree
depends on the values of a random vector sampled independently and
with the same distribution for all trees in the forest.”
Breiman Leo. Random Forests, Statistics
Department University of California Berkeley, 2001
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
9
RANDOM FOREST
((x1,y1),…,(xN,yN)) = D (Observed Data points)
m < M features (variables)
Algorithm: Random Forest for Regression or Classification.
1. For t = 1 to B: (Construct B trees)
Choose a bootstrap sample Dt from D of size N from the training data.
Grow a random-forest tree Ti to the bootstrapped data, by recursively repeating the following steps for
each leaf node of the tree, until the minimum node size nmin is reached.
(a)
(b)
i.
ii.
iii.
2.
Select m variables at random from the M variables.
Pick the best variable/split-point among the m.
Split the node into two daughter nodes.
Output the ensemble of trees {Tb} B1 .
[Hastie, Tibshirani, Friedman. The Elements of Statistical Learning]
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
10
VISUALIZATION OF BAGGING
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
11
HOW TO BUILD A RANDOM TREE (BOOTSTRAPPING)
Response Space(outputs)
Data Space (inputs)
Feat 1
Obs 1
Obs 2
Obs 3
Obs 4
Obs 5
…
Obs N
Feat 2
Feat 3
…
Feat M
2
6
3
5
0
3
1
5
7
8
5
4
9
8
2
3
4
5
8
2
7
1
3
5
Target 1 Target 2 Target 3 Target 4 Target 5 …
0
1
1
0
0
Target N
1
Pick m features from M and n observations from N at random
Feat 1
Feat 3
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
12
BAGGING OR BOOTSTRAP AGGREGATION
Average many noisy but approximately unbiased models, to reduce the
variance of estimated prediction function
[Hastie, Tibshirani, Friedman. The Elements of Statistical Learning]
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
13
BUILDING A FOREST (ENSEMBLE)
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
14
RANDOM FOREST ADVANTAGES
•
•
•
•
•
•
•
•
•
Can solve both type of problems, classification and regression
Random forests generalize well to new data
It is unexcelled in accuracy among current algorithms*
It runs efficiently on large data bases and can handle thousands of input variables without variable
deletion
It gives estimates of what variables are important in the classification
It generates an internal unbiased estimate of the generalization error as the forest building progresses
It has an effective method for estimating missing data and maintains accuracy when a large proportion
of the data are missing
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or give
interesting views of the data.
Out-of-bag error estimate removes the need for a set aside test set
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
15
DISADVANTAGES
•
The results are less actionable because forests are not easily interpreted.
Considered black box approach for statistical modelers with little control on
what the model does. Similar to a Neural Network
•
It surely does a good job at classification but not as good as for regression
problem as it does not give precise continuous nature predictions. In case of
regression, it doesn’t predict beyond the range in the training data, and that
they may over-fit data sets that are particularly noisy.
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
16
SAS ENTERPRISE
RANDOM FOREST SAS HPFOREST
MINER
PROC HPFOREST;
target targetname/level=typeoftarget;
input (categorical variables) /level=typeofvariable (nominal)
input (numerical variables) /level=typeofvariable (interval)
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
17
OUTPUT OF PROC HPFOREST
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
18
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
19
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
20
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
21
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.
22
THANK YOU!
Cop yrig ht © 2016, SAS Institute Inc. All rig hts reserv ed.