Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Confidence interval wikipedia , lookup
Bias of an estimator wikipedia , lookup
Time series wikipedia , lookup
German tank problem wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Regression analysis wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Stat 112: Notes 2 • Today’s class: Section 3.3. – Full description of simple linear regression model. – Checking the assumptions of the simple linear regression model. – Inferences for simple linear regression model. Wages and Education • A random sample of 100 men (ages 18-70) was surveyed about their weekly wages in 1988 and their education (part of the 1988 March U.S. Current Population Survey) (in file wagedatasubset.JMP) • How much more on average do men with one extra year of education make? • If a man has a high school diploma but no further education, what’s the best prediction of his earnings? • Regression addresses these two questions Bivariate Fit of wage By educ X=Education, Y= Weekly Wage 2500 wage 2000 1500 1000 500 0 5 10 educ 15 20 Simple Linear Regression Model Bivariate Fit of wage By educ 2500 wage 2000 1500 1000 500 0 5 10 15 20 educ Linear Fit Linear Fit wage = -89.74965 + 51.225264*educ Summary of Fit RSquare RSquare Adj Root Mean Square Error 0.139941 0.131165 331.48 The mean of weekly wages is estimated to increase b1 51.23 dollars for each extra year of education. The average absolute error from using a man’s education to predict his wages is about RMSE=331.48 dollars Sample vs. Population • We can view the data – ( X1 , Y1 ), , ( X n , Yn ) -- as a sample from a population. • Our goal is to learn about the relationship between X and Y in the population: – We don’t care about the particular 100 men sampled but about the population of US men ages 18-70. – From Notes 1, we don’t care about the relationship between tracks counted and the density of deer for the particular sample, but the relationship among the population of all tracks; this enables to predict in the future the density of deer from the number of tracks counted. Simple Linear Regression Model The simple linear regression model: E (Yi | X X i ) 0 1 X i , Yi 0 1 X i ei , ei ~ N (0, e2 ) The ei are called disturbances and represent the deviation of Yi from its mean given X i . The disturbances are estimated by the residuals eˆi Yi (b0 b1 X i ) . Assumptions of the Simple Linear Regression Model For each value of the explanatory variable X=x, there is a subpopulation of outcomes (responses) Y for units with X=x. Assumptions of the simple linear regression model: 1. Linearity: The means of the subpopulations fall on a straight line function of the explanatory variable. 2. Constant variance: The subpopulation standard deviations are all equal (to ). 3. Normality: The subpopulations are all normally distributed. 4. Independence: The selection of an outcome from any of the subpopulations is independent of the selection of any other outcomes. Checking the Assumptions Simple Linear Regression Model for Population: Yi 0 1 xi ei . Before making any inferences using the simple linear regression model, we need to check the assumptions: Based on the data ( X 1 , Y1 ), , ( X n,Yn ) , 1. We estimate 0 and 1 by the least squares estimates b0 and b1 . 2. We estimate the disturbances ei by the residuals eˆi Yi Eˆ (Y | X i ) Yi (b0 b1 X i ) . 3. We check if the residuals approximate satisfy (1) Linearity: E (eˆi ) 0 for all range of X i . (2) Constant Variance: Var (eˆi ) constant for all range of X i . (3) Normality: eˆi are approximately normally distributed. (4) Independence : eˆi are independent (only worry about for time series data). Residual Plot A useful tool for checking the assumptions is the residual plot. Residual for observation i eˆi yi Eˆ ( yi | xi ) yi (b0 b1 xi ) . The residual plot is a plot of the residuals eˆi versus xi . It is constructed in JMP by after fitting the least squares line, clicking the red triangle next to Linear Fit and clicking Plot Residuals. Residual 1500 1000 500 0 -500 5 10 15 educ 20 Checking Linearity Assumption To check if the linearity assumption holds (i.e., the model for the mean is correct), check if E (eˆi ) is zero for each range of X i . Residual 1500 1000 500 0 -500 5 10 15 20 educ Linearity Assumption appears reasonable but it appears that very high education individuals and low education individuals earn more than expected (most residuals are positive) [we will consider a nonlinear model for this data in Chapter 5, for now we’ll assume linearity is okay). Violation of Linearity For a sample of McDonald’s restaurants Y=Revenue of Restaurant X=Mean Age of Children in Neighborhood of Restaurant Bivariate Fit of Revenue By Age 1300 300 Residual Revenue 1200 1100 1000 200 100 0 -100 -200 2.5 900 5.0 7.5 10.0 Age 800 2.5 5.0 7.5 10.0 12.5 15.0 Age The mean of the residuals is negative for small and large ages and positive for large ages – linearity appears to be violated (we will see what to do when linearity is violated in Chapter 5). 12.5 15.0 Checking Constant Variance To check that the constant variance assumption holds, check that there is no pattern in the spread of the residuals as X varies. Residual 1500 1000 500 0 -500 5 10 15 20 educ Constant variance appears reasonable. Checking Normality For checking normality, we can look at whether the overall distribution of the residuals looks approximately normal by making a histogram of the residuals. Save the residuals by clicking the red triangle next to Linear Fit after Fit Line and then clicking Save Residuals. Then click Analyze, Distribution and put the saved residuals column into Y, Columns. The histogram should be approximately bell shaped if the normality assumption holds. Distributions Residuals wage -500 0 500 1000 1500 The residuals from the wage data have approximately a bell shaped histogram although there is some indication of skewness to the right. The normality assumption seems roughly reasonable. We will look at more formal tools for assessing normality in Chapter 6. Checking Assumptions • It is important to check the assumptions of a regression model because the inferences depend on the assumptions approximately holding. The assumptions don’t need to hold exactly but only approximately. • We will study more about checking assumptions and how to deal with violations of the assumptions in Chapters 5 and 6. Inferences Simple Linear Regression Model for Population: E (Yi | X X i ) 0 1 X i , Yi 0 1 X i ei , ei ~ N (0, e2 ) Data: ( X1 , Y1 ), , ( X n , Yn ) . The least squares estimates b0 and b1 will typically not be exactly equal to the true 0 and 1 . Inferences: Draw conclusions about 0 and 1 based on the data ( X1 , Y1 ), , ( X n , Yn ) . 1. Point Estimates: Best estimates of 0 and 1 . 2. Confidence intervals: Ranges of plausible values of 0 and 1 . 3. Hypothesis tests: Test whether it is plausible that 0 and 1 equal certain values. Sampling Distribution of b0,b1 • The sampling distribution of b0 ,b1 describes the probability distribution of the estimates over repeated samples ( x1, y1 ),, ( xn , yn ) from the simple linear regression model. • Understanding the sampling distribution is the key to drawing inferences from the sample to the population. Sampling distribution in wage data • To see how the least squares estimates can differ over different samples from the population, we consider the “population” to be all 25,632 men surveyed in the March 1988 Current Population Survey in wagedata1988.JMP and the sample to be random samples of size 100 like the one in wagedatasubset.JMP. “Population”: Bivariate Fit of wage By educ 18000 16000 Linear Fit 14000 wage = -19.06983 + 50.414381*educ wage 12000 0 19.07 1 50.41 10000 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 10 educ 12 14 16 18 Samples of wage data • To take samples in JMP, click the Tables menu, then click Subset and then click the circle next to Random Sample Size and set the sample size. JMP will create a new data subset which is a random sample of the original data set. Sample 1: Sample 2: Bivariate Fit of wage By educ Bivariate Fit of wage By educ 3000 2500 2500 2000 wage wage 2000 1500 1000 1500 1000 500 500 0 0 2 4 6 8 10 12 14 16 18 20 0 5 10 educ educ Linear Fit Linear Fit wage = -288.6577 + 71.530586*educ wage = 188.82961 + 38.453459*educ b0 288.66 b0 188.83 b1 b1 38.45 71.53 15 20 Sampling distributions • Only sample, not population, is usually available so we need to understand sampling distribution. • Sampling distribution of b1 – – E(b1 ) 1 Var (b1 ) e2 (n 1) sx 2 1 n 1 n 2 s ( xi x ) , x xi n 1 i 1 n i 1 2 x – Sampling distribution is normally distributed. – Even if normality assumption fails, sampling distributions of b1 are still approximately normal if n>30. Properties of b and b as estimators of and 1 0 0 1 • Unbiased Estimators: Mean of the sampling distribution is equal to the population parameter being estimated. • Consistent Estimators: As the sample size n increases, the probability that the estimator will become as close as you specify to the true parameter converges to 1. • Minimum Variance Estimator: The variance of the estimator b1 is smaller than the variance of any other linear unbiased estimator of 1 , say b1 * Confidence Intervals • Point Estimate: b1 • Confidence interval: range of plausible values for the true slope 1 • (1 )100% Confidence Interval: b1 t / 2 ,n 2 sb1 sb1 se 1 2 (n 1) sx where is an estimate of the standard deviation of b1 ( se RMSE ) Typically we use a 95% CI. • 95% CI is approximately b1 2* sb1 95% CIs for a parameter are usually approximately point estimate 2*Standard Error (point estimate) where the standard error of the point estimate is an estimate of the standard deviation of the point estimate. Computing Confidence Interval with JMP In the Fit Line output in JMP, information for computing the confidence interval for 1 is given under Parameter Estimates.. Parameter Estimates Term Intercept educ Estimate -89.74965 51.225264 Std Error of slope for educ = Std Error 173.4267 12.82813 t Ratio -0.52 3.99 Prob>|t| 0.6060 0.0001 sb1 Approximate 95% confidence Interval for 1 : b1 2* sb 51.225 2*12.828 (25.57, 76.88) 1 The exact 95% confidence interval can be computed by moving the mouse to the Parameter Estimates, right clicking, clicking Columns and then clicking Lower 95% and Upper 95%. Parameter Estimates Term Intercept educ Lower 95% -433.9092 25.768251 Exact 95% confidence interval for Upper 95% 254.40995 76.682276 1 : (25.77, 76.68) Interpretation: Increase in mean wages for one extra year of education is likely to be between 25.77 and 76.68 based on the sample in wagedatasubset.JMP Summary • We have described the assumptions of the simple linear regression model and how to check them. • We have come up with a method of describing the uncertainty in our estimates of the slope and the intercept via confidence intervals. • Note: These confidence intervals are only accurate if the assumptions of the simple linear regression model are approximately correct. • Next class: Hypothesis tests.