Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
German tank problem wikipedia , lookup
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Regression toward the mean wikipedia , lookup
Forecasting wikipedia , lookup
Linear regression wikipedia , lookup
Regression analysis wikipedia , lookup
Week 11 Regression Models and Inference Generalising from data Data from 15 lakes in central Ontario. Zinc concentrations in aquatic plant Eriocaulon septangulare (mg per g dry weight) & zinc concentrations in the lake sediment (mg per g dry weight). Generalising from data No interest in specific lakes How are plant & sediment Zn related in general. How accurately can you predict plant Zn from sediment Zn? Model for regression data Sample ‘represents’ a larger ‘population’ Distinguish between regn lines for sample and population Sample regn line (least squares) is an estimate of popn regn line. How do you model randomness in sample? Sample regression line (revision) Least squares line yˆ b0 b1 x b0 intercept — predicted y when x = 0. b1 slope — increase (or decrease) expected for y when x increases by one unit. ŷ predicted y or estimated y. yˆ i fitted value for ith individual Height and handspan Heights (inches) and Handspans (cm) of 167 college students. Handspan = -3 + 0.35 Height Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height. Residuals (revision) ei yi yˆ i Vertical distance from data point to LS line Person 70 in tall with handspan 23 cm yˆ 3 0.35(70) 21.5 resisual = yi – yi = 23 – 21.5 = 1.5 cm Model: population regn line E Y b 0 b1 x E(Y) mean or expected value of y for individuals in the population who all have the same x. b0 intercept of line in the population. b1 slope of line in the population. b1 = 0 means no linear relationship. b0 and b1 estimated by sample LS values b0 and b1. Model: distribution of ‘errors’ Error = vertical distance of value from population regn line error Y b 0 b1 x Assume errors all have normal(0, ) distns Constant standard deviation Linear regression model y = Mean + Error Error is population equivalent of residual Error is called “Deviation” in textbook Y = b0 + b1x + Error distribution ~ normal(0, ) Understanding parameters Model assumptions Linear relationship No curvature No outliers Constant error standard deviation Normal errors Checking assumptions Data should be in a symmetric band of constant width round a straight line Prices of Mazda cars in Melbourne paper Transformations Transformation of Y (or X) may help Model regression line: log price b 0 b1 age Parameter estimates Least squares estimates, b0 and b1 are estimates of b0 and b1 Best estimate of error s.d., is Sum of Squared Residuals s n2 SSE n2 2 ˆ yi yi n2 ‘Typical’ size of residuals Minitab estimates Data: x = heights (in inches) y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height. Interpreting s About 95% of crosses in band ± 2s on each side of least squares line. s = 24, band ± 48 48 Inference about regn slope Regn slope, b1, is usually most important parameter Expected increase in Y for unit increase in x Point estimate is LS slope, b1 How variable? What is std error of estimate? s.e.b1 s x x 2 SSE where s n 2 Inference: 95% C.I. for slope Same pattern as earlier C.I.s estimate ± t* x std. error Value of t* : approx 2 for large n bigger for small n use t-tables (n – 2) degrees of freedom Example Driver age and maximum legibility distance of new highway sign Average Distance = 577 – 3.01 × Age 95% C.I. from Minitab Point estimate: reading distance decreases by 3.01 ft per year of age n = 30 points 95% Confidence interval: t28 d.f. = 2.05 b1 t * s.e.b1 3.01 2.05 0.4243 3.01 0.87 3.88 to 2.14 ft Interpretation With 95% confidence, we estimate that … in the population of drivers represented by this sample, … the mean sign-reading distance decreases between 3.88 and 2.14 ft … per 1-year increase in age. Importance of zero slope Y b0 b1x If slope is b1 = 0, Y is normal with mean b0 and st devn Response distribution does not depend on x It is therefore important to test whether b1 = 0 Test for zero slope H0: b1 = 0 HA: b1 ≠ 0 Hypotheses: Test statistic: estimate Null value b1 0 t std error s.e.b1 p-value: tail area of t-distn (n – 2 d.f.) Minitab: Age vs reading distance t b1 0 3.0068 0 7.09 and p-value 0.000 s.e.b1 0.4243 Probability is virtually 0 that observed slope could be as far from 0 or farther if there was no linear relationship in population Extremely strong evidence that distance and age are related Testing zero correlation H0: = 0 (x and y are not correlated.) HA: ≠ 0 (x and y are correlated.) where = population correlation Same test as for zero regression slope. Can be performed even when a regression relationship makes no sense. e.g. leaf length & width Significance and Importance With very large n, weak relationships (low correlation) can be statistically significant. Moral: With a large sample size, saying two variables are significantly related may only mean the correlation is not precisely 0. Look at a scatterplot of the data and examine the correlation coefficient, r. Prediction of new Y at x If you knew values of b0, b1 and yˆ b0 b1x Prediction error ~ normal0, New value has s.d. 95% prediction interval b 0 b1 x 1.96 Prediction of new Y at x In practice, you must use estimates yˆ b0 b1x Prediction error has two components New value still has s.d. estimated by s2 Also, prediction itself is random x x 1 s.d.prediction s n x i x 2 Combining these, 2 s.d.prediction error s s.d.prediction 2 2 Prediction of new Y at x Prediction interval yˆ t * s s.d.prediction 2 2 x x 1 s.d.prediction s 2 n x i x 2 t* is from t tables (n – 2) d.f. Narrowest when x is near x Reading distance and age Minitab output 95% confident that a 21-year-old will read sign between 407 and 620 ft Estimating mean Y at x Different from estimating a new individual’s Y Only takes into account variability in y x x 1 s.d.prediction s n x i x 2 2 95% CI for mean Y at x yˆ t * s.d.prediction t* is from t tables (n – 2) d.f. Height and weight 95% CI For average of all college men of ht x 95% PI For one new college man of ht x