Download Week 7 - Massey University

Week 11 Regression Models and Inference Generalising from data  Data from 15 lakes in central Ontario.  Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight). Generalising from data    No interest in specific lakes How are plant & sediment Zn related in general. How accurately can you predict plant Zn from sediment Zn? Model for regression data  Sample ‘represents’ a larger ‘population’  Distinguish between regn lines for sample and population  Sample regn line (least squares) is an estimate of popn regn line.  How do you model randomness in sample? Sample regression line (revision) Least squares line yˆ  b0  b1 x b0 intercept — predicted y when x = 0. b1 slope — increase (or decrease) expected for y when x increases by one unit. ŷ predicted y or estimated y. yˆ i fitted value for ith individual Height and handspan  Heights (inches) and Handspans (cm) of 167 college students. Handspan = -3 + 0.35 Height Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height. Residuals (revision) ei  yi  yˆ i  Vertical distance from data point to LS line  Person 70 in tall with handspan 23 cm yˆ  3  0.35(70)  21.5  resisual = yi – yi = 23 – 21.5 = 1.5 cm Model: population regn line E Y   b 0  b1 x E(Y) mean or expected value of y for individuals in the population who all have the same x. b0 intercept of line in the population. b1 slope of line in the population. b1 = 0 means no linear relationship. b0 and b1 estimated by sample LS values b0 and b1. Model: distribution of ‘errors’  Error = vertical distance of value from population regn line error  Y  b 0  b1 x     Assume errors all have normal(0, ) distns Constant standard deviation Linear regression model y = Mean + Error  Error is population equivalent of residual  Error is called “Deviation” in textbook Y = b0 + b1x +   Error distribution  ~ normal(0, ) Understanding parameters Model assumptions  Linear relationship   No curvature No outliers  Constant error standard deviation  Normal errors Checking assumptions  Data should be in a symmetric band of constant width round a straight line  Prices of Mazda cars in Melbourne paper Transformations  Transformation of Y (or X) may help  Model regression line: log price   b 0  b1  age Parameter estimates   Least squares estimates, b0 and b1 are estimates of b0 and b1 Best estimate of error s.d.,  is Sum of Squared Residuals s n2 SSE   n2  2 ˆ   yi  yi  n2 ‘Typical’ size of residuals Minitab estimates Data: x = heights (in inches) y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height. Interpreting s  About 95% of crosses in band ± 2s on each side of least squares line.  s = 24, band ± 48 48 Inference about regn slope  Regn slope, b1, is usually most important parameter  Expected increase in Y for unit increase in x  Point estimate is LS slope, b1  How variable? What is std error of estimate? s.e.b1  s x  x  2 SSE where s  n 2 Inference: 95% C.I. for slope  Same pattern as earlier C.I.s estimate ± t* x std. error  Value of t* :    approx 2 for large n bigger for small n use t-tables (n – 2) degrees of freedom Example  Driver age and maximum legibility distance of new highway sign Average Distance = 577 – 3.01 × Age 95% C.I. from Minitab   Point estimate: reading distance decreases by 3.01 ft per year of age n = 30 points 95% Confidence interval: t28 d.f. = 2.05 b1  t *  s.e.b1   3.01 2.05  0.4243   3.01 0.87   3.88 to  2.14 ft Interpretation With 95% confidence, we estimate that … in the population of drivers represented by this sample, … the mean sign-reading distance decreases between 3.88 and 2.14 ft … per 1-year increase in age. Importance of zero slope Y  b0  b1x  If slope is b1 = 0,  Y is normal with mean b0 and st devn  Response distribution does not depend on x  It is therefore important to test whether b1 = 0 Test for zero slope H0: b1 = 0 HA: b1 ≠ 0  Hypotheses:  Test statistic: estimate  Null value b1  0 t  std error s.e.b1 p-value:  tail area of t-distn (n – 2 d.f.)  Minitab: Age vs reading distance t b1  0  3.0068  0   7.09 and p-value  0.000 s.e.b1  0.4243  Probability is virtually 0 that observed slope could be as far from 0 or farther if there was no linear relationship in population  Extremely strong evidence that distance and age are related Testing zero correlation H0:  = 0 (x and y are not correlated.) HA:  ≠ 0 (x and y are correlated.) where  = population correlation   Same test as for zero regression slope. Can be performed even when a regression relationship makes no sense.  e.g. leaf length & width Significance and Importance  With very large n, weak relationships (low correlation) can be statistically significant. Moral: With a large sample size, saying two variables are significantly related may only mean the correlation is not precisely 0. Look at a scatterplot of the data and examine the correlation coefficient, r. Prediction of new Y at x If you knew values of b0, b1 and  yˆ  b0  b1x  Prediction error  ~ normal0,      New value has s.d.  95% prediction interval  b 0  b1 x   1.96 Prediction of new Y at x In practice, you must use estimates yˆ  b0  b1x  Prediction error has two components New value still has s.d.  estimated by s2   Also, prediction itself is random  x  x 1  s.d.prediction   s  n x i  x 2 Combining these, 2  s.d.prediction error   s  s.d.prediction  2 2 Prediction of new Y at x  Prediction interval yˆ  t * s  s.d.prediction  2 2 x  x 1  s.d.prediction   s  2 n x i  x  2  t* is from t tables (n – 2) d.f.  Narrowest when x is near x  Reading distance and age  Minitab output  95% confident that a 21-year-old will read sign between 407 and 620 ft Estimating mean Y at x  Different from estimating a new individual’s Y  Only takes into account variability in y x  x 1  s.d.prediction   s  n x i  x 2 2   95% CI for mean Y at x yˆ  t *  s.d.prediction t* is from t tables (n – 2) d.f. Height and weight  95% CI   For average of all college men of ht x 95% PI  For one new college man of ht x

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Week 7 - Massey University