* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Regression and Correlation
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Simple Linear Regression We have been introduced to the notion that a categorical variable could depend on different levels of another variable when we discussed contingency tables. We’ll extend this idea to the case of predicting a continuous response variable from different levels of another variable. We say the variable, Y, is the response variable dependent on an explanatory predictor variable, X. There are many examples in the life sciences of such a situation – height to predict weight, dose of an algaecide to predict algae growth, skin fold measurements to predict total body fat, etc… Often times, several predictors are used to make a prediction of one variable (ex. height, weight, age, smoking status, gender can all be used to predict blood pressure). We focus on the special case of using one predictor variable for a response, where the relationship is linear. Example 12.3 In a study of a free living population of the snake Vipera bertis, researchers caught and measured nine adult females. The goal is to predict weight (Y) from length (X). The data and a scatterplot of the data are below. Notice this data comes in pairs. For example (x1,y1) = (60, 136) Weight (g) Snake Length (cm) Weight (g) Scatterplot of Weight vs Length Female Vipera bertis 1 60 136 200 2 69 198 180 3 66 194 160 4 64 140 140 5 54 93 6 67 172 120 7 59 116 100 8 65 174 55.0 57.5 60.0 62.5 65.0 67.5 70.0 Length (cm) 9 63 145 First, we look at a scatterplot of the data. (The picture above was created in Minitab, but we can create one using R). This can be accomplished in R by using the command by entering your two data sets and using the command plot(length,weight). We’d like to fit a (straight) line to the data. Why linear? Does fitting a (straight line) seem reasonable? Simple Linear Model (Regression Equation) The simple linear model relating Y and X is Y = bO + b1X bO is the intercept, the point where the line crosses the Y axis b1 is the slope, the change in Y over the change in X (rise over run) Definition: A predicted value (or fitted value) is the predicted value of yi for a given xi based on the regression equation, bO + b1xi Notation: yi = bO + b1xi A residual is the departure from Y of a fitted value. Notation: residi = yi ‐ yi Which line do we fit? We will fit a line that goes through the data in the best way possible, based on the least squares criterion. Definition: The residual sum of squares (a.k.a. SS(resid) or SSE) is n SS resid SSE yi ‐yi 2 i 1 The least squares criterion states that the optimal fit of a model to data occurs when the SS(resid) is as small as possible. Note that under our model n n yi ‐yi SS resid SSE i 1 2 yi ‐ bO b1 xi 2 i 1 Refer to the applet at http://www.nctm.org/standards/content.aspx?id=26787 RegressionandCorrelation Page2 Using calculus to minimize the SSE, we find the coefficients for the regression equation. ∑ni 1 xi ‐x yi ‐y b1 ∑ni 1 xi ‐x 2 b0 y‐b1 x Example 12.3 Find the linear regression of weight (Y) on Length (X). fit<‐lm(weight~length) summary(fit) Scatterplot of Weight vs Length with Fitted Regression Line 200 Weight (g) 180 160 140 120 100 55.0 57.5 60.0 62.5 Length (cm) 65.0 67.5 70.0 Interpret the slope (b1) in the context of the setting. Can we interpret the meaning of the Y intercept (bO) in this setting? Definition: An extrapolation occurs when one uses the model to predict a y value corresponding to an x value which is not within the range of the observed x’s. RegressionandCorrelation Page3 A Measure of Variability – sY|X Once we fit a line to our data and use it to make predictions, it is natural to ask the question of how far off our predictions are in general. Definition: The residual standard deviation is sY|X SS resid n‐2 ∑ni yi ‐yi 2 n‐2 1 Caution: This is not to be confused with sY! Recall, sY ∑ni yi ‐y 2 n‐1 1 Scatterplot of Weight vs Length with Fitted Regression Line 200 180 Weight (g) 160 140 120 100 55.0 57.5 60.0 62.5 Length (cm) 65.0 67.5 70.0 Determine and interpret sY|X for the regression of female Vipera bertis weight on length. RegressionandCorrelation Page4 The Linear Statistical Model Definition: A conditional mean is the expected value of a variable conditional on another variable. Notation: Y|X Defiinition: A conditional standard deviation is the standard deviation of a variable conditional on another variable. Notation: Y|X The linear regression model of Y on X assumes Y = Y|X +  where the conditional mean is linear with Y|X = βO + β1X and  = 0 and  = Y|X We use ______ to estimate βO, ______ to estimate β1, and __________ to estimate Y|X. Then, we can estimate (or predict) Y|X=x at any X so that μY|X x = bO + b1x Assuming the linear model is appropriate here, find estimates of the mean and standard deviation of female Vipera bertis weight at a length of 65 cm. Should we estimate female Vipera bertis weight at a length of 75 cm? Why or why not? RegressionandCorrelation Page5 Inference on β1 Normal Error Model In our discussion on the linear statistical model, we stated that the linear regression model of Y on X assumes a linear conditional mean, with the errors having mean 0 and standard deviation Y|X. To make inference on β1, we need to update the conditions on this model to include a normal distribution on the errors. Y = Y|X +  Y|X = βO + β1X  ~ N(0,σ2Y|X ) Assumptions to check  εi must be independent  εi must be normally distributed  εi must have equal variance  εi must have mean zero RegressionandCorrelation Page6 Confidence Interval for β1 Under the normal error model, b1 is unbiased for β1 with SE b1 sY|X ∑ni 1 xi ‐ x 2 This confidence interval uses a t critical point: t/2,df=n‐2 Then the CI is b1 ± tα/2,df=n‐2SE(b1) R will do this for us using the confint(…) function. Compute and interpret the 95% confidence interval for β1 in the female Vipera bertis regression. Use confint(fit) RegressionandCorrelation Page7 Hypothesis Test for β1 Similar to the development of the confidence interval for β1, we can use the t distribution to conduct a hypothesis test for β1 with ts b1 SE b1 Under HO, ts ~ tdf=n‐2 We’ll test HO: _________________________________________________________________ HA: _________________________________________________________________ Using the output from the R functions we used previously in this lecture, we get the P‐value for the two sided test. If you have a directional alternative, check the data goes in the direction of HA. If so, cut the P‐value in half. If not, just say P > 0.5. Conduct a test of hypothesis to see whether β > 0. RegressionandCorrelation Page8 Coefficient of Determination (r2) Recall SS(resid) is a measure of the unexplained variability in Y (the variation in Y not explained by X through the regression model) and is given by n SS resid SSE yi ‐yi 2 Scatterplot of Weight vs Length i 1 with Fitted Regression Line 200 Definition: SS(total) measures the total variability in Y and is given by Weight (g) n yi ‐y 2 SS total SST 180 i 1 160 140 120 Definition: SS(reg) measures the variability in Y that is explained by X through the regression model and is given by 100 55.0 57.5 60.0 62.5 Length (cm) 65.0 67.5 70.0 n yi ‐y 2 SS reg SSR i 1 Then, SS(total) = SS(reg) + SS(resid) or the total variability in Y is explained by the regression model plus the unexplained residual variation. Definition: The coefficient of determination is the ratio between the SS(reg) and SS(total) and is given by ∑ni 1 xi ‐x yi ‐y 2 SS reg n coefficientofdetermination r ∑i 1 xi ‐x 2 ∑ni 1 yi ‐y 2 SS total 2 and is interpreted as the proportion (or percentage) of variability in Y that is explained by the linear regression of Y on X. Find and interpret r2 for the regression of female Vipera bertis weight on length. RegressionandCorrelation Page9 Correlation The linear regression model assumes the X’s are measured with negligible error. Think about the snake data here…the researcher measured length to predict weight! Why not the other way around? I mean, if we go out to collect snake measurements, I am not volunteering to get the length – I’m volunteering to hook the snake and throw it in a bag to weigh it. But, if we tried to use the weight to predict length, the variability in weight due to eating, pregnancy, etc. could lead to bad predictions of length. For instance, a snake in our data set that just ate a mouse, would have a shorter length that what would be predicted for a snake that actually weighed little snake + food = big snake pounds. In other words, Y|X is the mean of Y given X. We use this type of model to make predictions of Y, based on our model for a given value of X. For the situation where we’d like to make statements about the joint relationship of X and Y, we’ll need for X and Y to both be random. When we’re interested in examining the joint relationship of two random variables, we are interested in their joint distribution (the joint distribution of two random variables is called a bivariate distribution). Definition The bivariate random sampling model views the pairs (Xi, Yi) as joint random variables, with population means, X, Y, population standard deviations, X, Y, and a correlation parameter, . In this model,  measures the level of dependence between two random variables, X and Y.  ‐1 ≤  ≤ 1    1  X & Y become more correlated    0  X & Y become uncorrelated We’ll measure the sample correlation coefficient, called r. r ∑ni ∑ni 1 1 xi ‐x yi ‐y xi ‐x 2 ∑n i 1 yi ‐y 2 Notice what is hiding inside of r. RegressionandCorrelation Page10 Properties of r  r =  √r 2  as n  , E[r]   s  related to LS regression coefficients; b1 = r Y sX  test of HO: β1 = 0 numerically equivalent to test of HO:  = 0 ts b1 SE b1 r n‐2 1‐r 2 Figure 12.14 Blood pressure and platelet calcium for 38 persons with normal blood pressure Example 12.19 Is calcium in blood related to blood pressure? Y = calcium concentration in blood platelets X = blood pressure (average of systolic and diastolic) What do you think r is for these data? RegressionandCorrelation Page11 Plots Depicting the Sensitivity of r to Outliers Some Final Notes on Regression and Correlation  use (conditional) regression analysis when prediction of Y from X is desired o random sampling from the conditional distribution of Y|X is required if bO, b1, and sY|X are to be viewed as estimates of the parameters βO, β1, and Y|X o Y must be random and X need not be random  use correlation analysis when association between X and Y is under study o bivariate random sampling model is required if r is to be viewed as an estimate of the population parameter  o X and Y both must be random  Always plot the data! Why? Because  r is very sensitive to extreme observations and outliers, so BE CAREFUL!  r is also known as the Pearson Product‐Moment Correlation Coefficient  a distribution free version of r exists, known as Spearman’s Rank Correlation Coefficient RegressionandCorrelation Page12
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            