Download Modeling examples continues - UEF-Wiki

Modeling in R Sanna Härkönen Model fitting: simple linear model 1600 800 600 400 200 0 0 5 10 15 20 25 H (m) y = 0.4009x + 15.519 R² = 0.1356 30 25 25 20 20 D (cm) D (cm) 1000 35 y = 1.5619x - 8.4348 R² = 0.8998 30 N trees / ha 1200 •Important measures: • Correlation r • Coefficient of determination R2 • p-values • Residuals (examining their distribution) 35 y = -67.646x + 1818.8 R² = 0.7707 1400 15 15 10 10 5 5 0 0 0 5 10 15 H (m) 20 25 0 5 10 15 20 25 H (m) 2 PEARSON CORRELATION r • Measures linear relationship between two variables • Even though correlation would be low, there can be strong relationship (non-linear) between the variables • Can be positive/negative depending on the relationship (-1..1) • Equation: 3 EXAMPLE: SAME CORRELATION (0.816), BUT different RELATIONSHIP LINEAR FIT OK ONLY HERE: 4 http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg REGRESSION ANALYSIS • Examining relationships of variables • Dependent variable: the variable that is explained by the independent variable(s) • Coefficient of determination R2 = r2, where r is correlation • For example if D would be expressed as a function of H -> D is dependent and H is independent variable. 10 8 6 D 4 2 0 0 2 4 6 H 5 SIMPLE LINEAR REGRESSION •Fitting linear regression line between two variables. •y = β0 + β1 *x + ε •(y is the dependent (=response) variale, x is the independent (=predictor) variable, β0 is the constant, β1 is the slope and ε is the random error) •Method: least squares regression, where the regression line is fitted so that the sum of squares of model residuals (measured y – modeled y)2 is minimized 10 8 6 Y 4 2 y = 1.2x + 1.2 R² = 0.6923 0 0 2 4 X 6 6 INTERPRETATION: r and •Relationship: non-significant • |r|=0.4 R2=0.16 |r|=0.0 R2=0.0 35 moderate remarkable |r|=0.6 R2=0.36 y = 0.4009x + 15.519 R² = 0.1356 30 2 R 35 strong |r|=0.8 R2=0.64 |r|=1 R2=1 y = 1.5619x - 8.4348 R² = 0.8998 30 25 D (cm) D (cm) 25 20 15 20 15 10 10 5 5 0 0 5 10 15 20 H (m) 25 0 0 5 10 15 20 25 H (m) H explains ~14% of the variation in D. Poor fit. H explains ~90% of the variation in D. 7 Very good fit. FITTING A SIMPLE LINEAR MODEL 1. 2. 3. 4. • 5. 6. 7. Import the data to R (command read.csv()) Examine summary statistics of your variables (summary() command in R) Examine the relationships of variables by plotting them (plot() command in R) If you see a linear relationship between the dependent variable and explanatory variables -> you can fit a linear model If the relationship is not linear, you can try to first linearize it by doing conversion for the variable(s) (e.g. logarithm, exponential, …) and then apply linear regression with the conversed values Fit the linear model in R: command lm(y~x), where y is dependent and x is independent variable Examine the results of the regression (significance of variables, R2 etc) using summary() command Examine the residuals Linear relationship Non-linear relationship of X and Y Linear relationship of X and exp(Y) 8 Summary statistics • Dataset ”a”: summary Plotting • plot(a$D, a$TOTAL_VOLUME) plot(a$BA, a$TOTAL_VOLUME) Need for linearizing?? R example: BUILDING LINEAR MODEL in R •Building linear model for basal area (BA1) as a function of height (H1) 12 RESULTS OF REGRESSION ANALYSIS : R Summary statistics of residuals (= original_y – modeled_y) Intercept and slope for the model. -> Y = 0.126937 + 0.117584 X Standard error of the estimates t-test values (estimate/SE) and their pvalues: express if the variable is significant with certain significance level F-test’s value and its p-value express if the independent variables in the model capable to explain the dependent variable. Residual standard error: (sqrt(sum((mod_yorig_y)^2)/(n-2)) Degrees of freedom: sample size – number of variables in the model R-squared: R2 Adjusted R-squared: takes into account number of variables in the model. It is used when comparing regression models with different number of variables. How to interpret p-value: <0.01 very significant (with >99% probability) <0.05 significant (with >95% probability) > 0.05: not significant 13 Residuals • Important to check after model fitting • Residuals : measured Y – modeled Y 14 Interpreting Residual Plots Residuals should look like this Variable transformation required Outliers non-constant variance and outliers variable Xj should be included in the model [1] from: VANCLAY, J. 1994. “Modelling Forest Growth and Yield. Application to Mixed Tropical Forests” CAB International.. BLAS MOLA’s SLIDES Residuals: Y_measured – Y_Modeled If the model is good, the residuals should • be homoscedastic, i.e no trend with x should be present in residuals • follow normal distribution • R command plot.lm(your_model) can be used for examining residuals: • Upper figure: residuals should be equally distributed around the 0-line. In the example figure, howerev, there seems to be lowering trend in residuals -> not good. • Lower figure: all the residuals would be on the straight line, if the residuals follow normal distribution. -> in the example figure they don’t seem to completely follow normal distribution. 16 EXAMPLE 17 Exercises in GROUPS: which is the best model? Which is the worst? WHY? 50 y = 0.8875x - 6.8854 R² = 0.845 40 30 h2 20 Linear (h2) 10 0 0 20 40 60 -10 70 y = 0.8434x + 3.5693 R² = 0.4441 60 60 y = 1.0691x - 0.6607 R² = 0.9116 50 50 40 40 h3 30 h1 30 Linear (h3) Linear (h1) 20 20 10 10 0 0 0 20 40 60 0 20 40 60 18 R examples Multiple regression: lm(volume ~ height + diameter + basal area) Using dummy variables (categorical): (e.g. species, forest type etc categories) lm(volume ~ height + factor(tree_species) Total volume as function of H TOTAL VOLUME as function of H and BA Total volume as function of H, BA and forest type (dummy) Interpretation of output, if dummy variable is used: Forest types 1-7 present. • Forest type 1 is the ”base” category (no multipliers). • If forest type is 2 -> factor(a$FOREST_TYPE)2 coefficient is 1 and is multiplied with estimate value 13.097745. In that case all other forest type coefficients are 0. • Etc with other forest types • Interpret these R summaries of the model fits. • Write down the equations (y=a + b*x) of both models. • Which model is better? • Are the intercept and slope significant in both models? • Are both models capable for estimating the desired variable? • What else would you need to check when considering the model goodness? 23 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Modeling examples continues - UEF-Wiki