Download Modeling examples continues - UEF-Wiki

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Modeling in R
Sanna Härkönen
Model fitting:
simple linear model
1600
800
600
400
200
0
0
5
10
15
20
25
H (m)
y = 0.4009x + 15.519
R² = 0.1356
30
25
25
20
20
D (cm)
D (cm)
1000
35
y = 1.5619x - 8.4348
R² = 0.8998
30
N trees / ha
1200
•Important measures:
• Correlation r
• Coefficient of determination R2
• p-values
• Residuals (examining their distribution)
35
y = -67.646x + 1818.8
R² = 0.7707
1400
15
15
10
10
5
5
0
0
0
5
10
15
H (m)
20
25
0
5
10
15
20
25
H (m)
2
PEARSON CORRELATION r
• Measures linear relationship
between two variables
• Even though correlation would
be low, there can be strong
relationship (non-linear)
between the variables
• Can be positive/negative
depending on the relationship
(-1..1)
• Equation:
3
EXAMPLE: SAME CORRELATION (0.816), BUT different
RELATIONSHIP
LINEAR FIT OK ONLY HERE:
4
http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg
REGRESSION ANALYSIS
• Examining relationships of
variables
• Dependent variable: the
variable that is explained by
the independent variable(s)
• Coefficient of determination
R2 = r2, where r is correlation
• For example if D would be
expressed as a function of H
-> D is dependent and H is
independent variable.
10
8
6
D
4
2
0
0
2
4
6
H
5
SIMPLE LINEAR REGRESSION
•Fitting linear regression line between
two variables.
•y = β0 + β1 *x + ε
•(y is the dependent (=response)
variale, x is the independent
(=predictor) variable, β0 is the constant,
β1 is the slope and ε is the random
error)
•Method: least squares regression,
where the regression line is fitted so
that the sum of squares of model
residuals (measured y – modeled y)2 is
minimized
10
8
6
Y
4
2
y = 1.2x + 1.2
R² = 0.6923
0
0
2
4
X
6
6
INTERPRETATION: r and
•Relationship: non-significant
•
|r|=0.4
R2=0.16
|r|=0.0
R2=0.0
35
moderate
remarkable
|r|=0.6
R2=0.36
y = 0.4009x + 15.519
R² = 0.1356
30
2
R
35
strong
|r|=0.8
R2=0.64
|r|=1
R2=1
y = 1.5619x - 8.4348
R² = 0.8998
30
25
D (cm)
D (cm)
25
20
15
20
15
10
10
5
5
0
0
5
10
15
20
H (m)
25
0
0
5
10
15
20
25
H (m)
H explains ~14% of the
variation in D.
Poor fit.
H explains ~90% of the
variation in D.
7
Very good fit.
FITTING A SIMPLE LINEAR
MODEL
1.
2.
3.
4.
•
5.
6.
7.
Import the data to R (command read.csv())
Examine summary statistics of your variables (summary()
command in R)
Examine the relationships of variables by plotting them
(plot() command in R)
If you see a linear relationship between the dependent
variable and explanatory variables -> you can fit a linear
model
If the relationship is not linear, you can try to first
linearize it by doing conversion for the variable(s) (e.g.
logarithm, exponential, …) and then apply linear
regression with the conversed values
Fit the linear model in R: command lm(y~x), where y is
dependent and x is independent variable
Examine the results of the regression (significance of
variables, R2 etc) using summary() command
Examine the residuals
Linear relationship
Non-linear relationship of
X and Y
Linear relationship of
X and exp(Y)
8
Summary statistics
• Dataset ”a”: summary
Plotting
• plot(a$D, a$TOTAL_VOLUME)
plot(a$BA, a$TOTAL_VOLUME)
Need for linearizing??
R example: BUILDING LINEAR MODEL
in R
•Building linear model for basal area (BA1) as a function of height (H1)
12
RESULTS OF REGRESSION
ANALYSIS : R
Summary statistics of
residuals (= original_y –
modeled_y)
Intercept and slope for the model. ->
Y = 0.126937 + 0.117584 X
Standard error of the estimates
t-test values (estimate/SE) and their pvalues: express if the variable is
significant with certain significance
level
F-test’s value and its p-value
express if the independent
variables in the model capable to
explain the dependent variable.
Residual standard error:
(sqrt(sum((mod_yorig_y)^2)/(n-2))
Degrees of freedom:
sample size – number of variables in the
model
R-squared: R2
Adjusted R-squared: takes
into account number of
variables in the model. It is
used when comparing
regression models with
different number of variables.
How to interpret p-value:
<0.01 very significant (with >99% probability)
<0.05 significant (with >95% probability)
> 0.05: not significant
13
Residuals
• Important to
check after
model fitting
• Residuals :
measured Y –
modeled Y
14
Interpreting Residual Plots
Residuals should look like this
Variable transformation required
Outliers
non-constant variance
and outliers
variable Xj should be included in the model
[1] from: VANCLAY, J. 1994. “Modelling Forest Growth and Yield. Application to Mixed Tropical Forests” CAB International..
BLAS MOLA’s SLIDES
Residuals: Y_measured –
Y_Modeled
If the model is good, the residuals
should
• be homoscedastic, i.e no trend with
x should be present in residuals
• follow normal distribution
• R command plot.lm(your_model) can
be used for examining residuals:
• Upper figure: residuals should be equally
distributed around the 0-line. In the
example figure, howerev, there seems to
be lowering trend in residuals -> not good.
• Lower figure: all the residuals would be on
the straight line, if the residuals follow
normal distribution. -> in the example
figure they don’t seem to completely
follow normal distribution.
16
EXAMPLE
17
Exercises in GROUPS: which is the best model? Which is the
worst? WHY?
50
y = 0.8875x - 6.8854
R² = 0.845
40
30
h2
20
Linear (h2)
10
0
0
20
40
60
-10
70
y = 0.8434x + 3.5693
R² = 0.4441
60
60
y = 1.0691x - 0.6607
R² = 0.9116
50
50
40
40
h3
30
h1
30
Linear (h3)
Linear (h1)
20
20
10
10
0
0
0
20
40
60
0
20
40
60
18
R examples
Multiple regression:
lm(volume ~ height +
diameter + basal area)
Using dummy variables
(categorical):
(e.g. species, forest type
etc categories)
lm(volume ~ height +
factor(tree_species)
Total volume as function of H
TOTAL VOLUME as function of H and
BA
Total volume as function of H, BA and
forest type (dummy)
Interpretation of output, if
dummy variable is used:
Forest types 1-7
present.
• Forest type 1 is the
”base” category (no
multipliers).
• If forest type is 2 ->
factor(a$FOREST_TYPE)2
coefficient is 1 and is
multiplied with estimate
value 13.097745. In that
case all other forest type
coefficients are 0.
• Etc with other forest
types
• Interpret these R
summaries of the model
fits.
• Write down the
equations (y=a + b*x) of
both models.
• Which model is better?
• Are the intercept and
slope significant in both
models?
• Are both models capable
for estimating the
desired variable?
• What else would you
need to check when
considering the model
goodness?
23
24