Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 651 Lecture #19 Copyright (c) Bani K. Mallick 1 Topics in Lecture #19 Are Y and X related? Inference about a population slope Residual plots to test for normality Copyright (c) Bani K. Mallick 2 Book Chapters in Lecture #19 Chapter 11.3 Copyright (c) Bani K. Mallick 3 Relevant SPSS Tutorials Simple linear regression Residual plots (they do it slightly differently from what I do, but their method is OK as well) Copyright (c) Bani K. Mallick 4 Lecture 18 Review: Linear Regression and Correlation Linear regression and correlation are aimed at understanding how two variables are related The variables are called Y and X Y is called the dependent variable X is called the independent variable We want to know how, and whether, X influences Y Copyright (c) Bani K. Mallick 5 Lecture 18 Review: Linear Regression and Correlation Let Y = GPA, X = height A linear prediction equation is a line, such as ˆ= ˆ0 ˆ1 X Y The intercept of the line = ̂ 0 The slope of the line = ̂1 Copyright (c) Bani K. Mallick 6 Lecture 18 Review: Linear Regression and Correlation The basic tool of regression is the scatterplot This simply plots the data in a graph X is along the horizontal (or X) axis Y is along the vertical (or Y) axis Copyright (c) Bani K. Mallick 7 Lecture 18 Review: Linear Regression and Correlation The usual method, called least squares, tries to make the squared distance between the line and the actual data as small as possible Yi , for i 1, ..., n The data are ˆ0 ˆ1 X i Any line is Y ˆ n ˆ1 X i i 1 The slope & intercept are chosen to minimize this total squared distance Total squared distance is Copyright (c) Bani K. Mallick i 0 8 2 Lecture 18 Review: Linear Regression and Correlation Y ˆ n Total squared distance is i i 1 0 ˆ1 X i The slope & intercept are chosen to minimize this total squared distance The slope is ˆ1 Y Y X n i 1 i X n i 1 i i X X 2 Copyright (c) Bani K. Mallick 9 2 Lecture 18 Review: Linear Regression and Correlation Y ˆ n ˆ X 1 i Total squared distance is i 1 The slope & intercept are chosen to minimize this total squared distance i The intercept is 0 ˆ0 Y ˆ1 X This is algebra! The estimates are called the least squares estimates SPSS calculates these automatically Copyright (c) Bani K. Mallick 10 2 Lecture 18 Review: Linear Regression and Correlation The population parameters 0 and 1 are simple the least squares estimates computed on all the members of the population, not just the sample Population parameters: 0 and 1 ˆ0 and ˆ1 Sample statistics: Copyright (c) Bani K. Mallick 11 Lecture 18 Review: Linear Regression and Correlation Formally speaking, the linear regression model says that Y and X are related: Y = 0 1 X The meaning of the line 0 1 X is: Take the (sub)population all of whom have independent variable value X The mean of this (sub)population is 0 1 X Copyright (c) Bani K. Mallick 12 Lecture 18 Review: Linear Regression and Correlation Y = 0 1 X Assumption #1: A straight line really fits the data Assumption #2: The errors are at least vaguely normally distributed Assumption #3: The errors have somewhat the same variances Copyright (c) Bani K. Mallick 13 Inference About the Population Slope and Intercept Y = 0 1 X If 1 0 then we have a graph like this: 0 1 X X Copyright (c) Bani K. Mallick 14 Inference About the Population Slope and Intercept Y = 0 1 X If 1 0 then we have a graph like this: This is the mean of Y for those whose independent variable is X 0 1 X X Copyright (c) Bani K. Mallick 15 Inference About the Population Slope and Intercept Y = 0 1 X If 1 0 then we have a graph like this: 0 1 X Note how the mean of Y does not depend on X: Y and X are independent X Copyright (c) Bani K. Mallick 16 Linear Regression and Correlation Y = 0 1 X If 1 0 then Y and X are independent So, we can test the null hypothesis H0 : that Y and X are independent by testing H0 :1 0 The p-value in regression tables tests this hypothesis Copyright (c) Bani K. Mallick 17 GPA and Height 4.5 Grade Point Average (GPA) Note how GPA’s generally get lower as height increases: data do not fall exactly on a line 4.0 3.5 3.0 2.5 2.0 1.5 55 60 65 70 75 80 Height in inches Copyright (c) Bani K. Mallick 18 Linear Regression and Correlation There is an ANOVA Table summarizing the model ANOVAb Model 1 Regression Residual Total Sum of Squares 1.954 24.294 26.247 df 1 98 99 Mean Square 1.954 .248 F 7.881 Sig. .006 a a. Predictors: (Constant), Height in inches b. Dependent Variable: Grade Point Average (GPA) Copyright (c) Bani K. Mallick 19 Linear Regression and Correlation Intercept = 5.529 & Slope = -0.0372 are in “B” column Coefficientsa Model 1 (Constant) Height in inches Unstandardized Coefficients B Std. Error .897 5.529 -3.72E-02 .013 Standardi zed Coefficien ts Beta -.273 t 6.165 -2.807 Sig. .000 .006 a. Dependent Variable: Grade Point Average (GPA) Copyright (c) Bani K. Mallick 20 Linear Regression and Correlation Y = 0 1 X The standard deviation of the errors is to be called This means that every subpopulation who share the same value of X have Mean = 0 1 X Standard deviation = Copyright (c) Bani K. Mallick 21 Linear Regression and Correlation Y = 0 1 X The least squares estimate ̂1 variable is a random What does this mean? Copyright (c) Bani K. Mallick 22 Linear Regression and Correlation Y = 0 1 X The least squares estimate ̂1 variable What does this mean? is a random If you do another experiment, you will get another least squares estimate We have to quantify its variability ̂1 Sound familiar: remember the mean (not the Maine) Copyright (c) Bani K. Mallick 23 Linear Regression and Correlation The least squares estimate ̂1 is a random variable Its standard deviation is ˆ1 ) s.e.( n 2 (X X) i i 1 Copyright (c) Bani K. Mallick 24 Linear Regression and Correlation Recall that ˆ1 ) s.e.( n 2 (X X) i i 1 Also recall that sample variance of X’s is n 2 (X X) /(n 1) i i 1 Thus, ˆ1 ) s.e.( (n 1) (sample var of X ' s) Copyright (c) Bani K. Mallick 25 Linear Regression and Correlation Recall that ˆ1 ) s.e.( (n 1) (sample var of X ' s) Thus, you can make the sample slope a more precise estimate of the population slope in two way Increase the sample size (like, duh) Make the X’s more variable Copyright (c) Bani K. Mallick 26 Linear Regression In order to compute a standard error for the least squares slope, we have to provide an estimate of the common standard deviation Define the residuals to be the difference between the actual data and the predicted ˆ0 ˆ1 X i ) ri Yi ( line: These can be calculated automatically in SPSS Copyright (c) Bani K. Mallick 27 Linear Regression Define the residuals to be the difference between the actual data and the predicted ˆ0 ˆ1 X i ) line: ri Yi ( Sum of Squares due to Error (SSE), or Residual Sum of Squares (RRS), is SSE n 2 r i i 1 Copyright (c) Bani K. Mallick 28 Linear Regression Sum of Squares due to Error (SSE), or Residual Sum of Squares (RRS), is SSE n 2 r i i 1 The Mean Squared Error (MSE) has n-2 degrees of freedom and equals n MSE Copyright (c) Bani K. Mallick 2 r i i 1 n2 29 Linear Regression The Mean Squared Error (MSE) is n MSE The estimated s.d. is 2 r i i 1 n2 s MSE Copyright (c) Bani K. Mallick 30 Linear Regression Remember, each value of X gives rise to a subpopulation Each subpopulation has (sub)population standard deviation The estimate of this standard deviation is s Copyright (c) Bani K. Mallick 31 Linear Regression and Correlation The least squares estimate ̂1 is a random variable Its estimated standard deviation is ˆ1 ) s.e.( s n 2 (X X) i i 1 Copyright (c) Bani K. Mallick 32 Linear Regression and Correlation The (1a100% Confidence interval for the population slope is ˆ1 t a /2 (n 2)se( ˆ1 ) If the confidence interval is from 3 to 6, what does this mean? Copyright (c) Bani K. Mallick 33 Linear Regression and Correlation The (1a100% Confidence interval for the population slope is ˆ1 t a /2 (n 2)se( ˆ1 ) If the 95% confidence interval is from 3 to 6, what does this mean? The population slope is between 3 and 6 with 95% probability Are Y and X independent? Copyright (c) Bani K. Mallick 34 Linear Regression and Correlation The (1a100% Confidence interval for the population slope is ˆ1 t a /2 (n 2)se( ˆ1 ) If the 95% confidence interval is from 3 to 6, what does this mean? The population slope is between 3 and 6 with 95% probability Are Y and X independent? No, since we have ruled out that 1 0 Copyright (c) Bani K. Mallick 35 Linear Regression and Correlation You can test the null hypothesis that Y and X are independent, using the t-test The t-statistics in SPSS output is ˆ1 /se( ˆ1 ) t The Type I error is a You reject the null hypothesis if t t a /2 (n 2) Copyright (c) Bani K. Mallick 36 Linear Regression and Correlation In SPSS, you can get the interval as follows: “Analyze”, “Regression”, “Linear”, Ask for “Statistics” Option and accept “Confidence Intervals” While you are at it, accept the “Save” option and save the predicted values, residuals, Cook’s and Leverage Copyright (c) Bani K. Mallick 37 GPA and Height ANOVAb Model 1 Regress ion Res idual Total Sum of Squares 1.954 24.294 26.247 df 1 98 99 Mean Square 1.954 .248 F 7.881 Sig. .006 a a. Predictors : (Constant), Height in inches b. Dependent Variable: Grade Point Average (GPA) What is the sample size? Copyright (c) Bani K. Mallick 38 GPA and Height ANOVAb Model 1 Regress ion Res idual Total Sum of Squares 1.954 24.294 26.247 df 1 98 99 Mean Square 1.954 .248 F 7.881 Sig. .006 a a. Predictors : (Constant), Height in inches b. Dependent Variable: Grade Point Average (GPA) What is the sample size? 100, since the df for residual is n-2 Copyright (c) Bani K. Mallick 39 GPA and Height Coefficientsa Model 1 (Constant) Height in inches Unstandardized Coefficients B Std. Error 5.529 .897 -3.72E-02 .013 Standardi zed Coefficien ts Beta -.273 t 6.165 -2.807 Sig. .000 .006 95% Confidence Interval for B Lower Bound Upper Bound 3.749 7.309 -.064 -.011 a. Dependent Variable: Grade Point Average (GPA) 95% CI for the slope is Copyright (c) Bani K. Mallick 40 GPA and Height Coefficientsa Model 1 (Cons tant) Height in inches Uns tandardized Coefficients B Std. Error 5.529 .897 -3.72E-02 .013 Standardi zed Coefficien ts Beta -.273 t 6.165 -2.807 Sig. .000 .006 95% Confidence Interval for B Lower Bound Upper Bound 3.749 7.309 -.064 -.011 a. Dependent Variable: Grade Point Average (GPA) 95% CI for the slope is –0.064 to –0.011. What does this mean? Copyright (c) Bani K. Mallick 41 GPA and Height Coefficientsa Model 1 (Cons tant) Height in inches Uns tandardized Coefficients B Std. Error 5.529 .897 -3.72E-02 .013 Standardi zed Coefficien ts Beta -.273 t 6.165 -2.807 Sig. .000 .006 95% Confidence Interval for B Lower Bound Upper Bound 3.749 7.309 -.064 -.011 a. Dependent Variable: Grade Point Average (GPA) 95% CI for the slope is –0.064 to –0.011. What does this mean? Population slope is between these limits with 95% Probability. Are GPA and Height Independent? Copyright (c) Bani K. Mallick 42 GPA and Height Coefficientsa Model 1 (Cons tant) Height in inches Uns tandardized Coefficients B Std. Error 5.529 .897 -3.72E-02 .013 Standardi zed Coefficien ts Beta -.273 t 6.165 -2.807 Sig. .000 .006 95% Confidence Interval for B Lower Bound Upper Bound 3.749 7.309 -.064 -.011 a. Dependent Variable: Grade Point Average (GPA) 95% CI for the slope is –0.064 to –0.011. What does this mean? Population slope is between these limits with 95% Probability. Are GPA and Height Independent? NO – the population slope is negative Copyright (c) Bani K. Mallick 43 t-test There are n-2 = 98 degrees of freedom for residual (error) The t-statistic for the slope is t = -2.807 The p-value is 0.006. Take a = 0.05. Look up Is Yes since p = 0.006!!! t a /2 (n 2) t .025 (98) t t a /2 (n 2) 1.99 ? Copyright (c) Bani K. Mallick 44 Residuals You can check the assumption that the errors are normally distributed by constructing a q-q plot of the residuals Copyright (c) Bani K. Mallick 45 Residuals of GPA on Height Normal? Regression Residuals, All Data Y=GPA, X+Height 1.00 .75 Expected Cum Prob .50 .25 0.00 0.00 .25 .50 .75 1.00 Observed Cum Prob Copyright (c) Bani K. Mallick 46 Stenosis Data, Healthy Kids Are Y=log(1+AVA) and X = BSA independent? ANOVAb Model 1 Regress ion Res idual Total Sum of Squares 6.786 2.121 8.907 df 1 68 69 Mean Square 6.786 3.119E-02 F 217.543 Sig. .000 a a. Predictors : (Constant), Body Surface Area b. Dependent Variable: Log(1 + Aortic Valve Area) Copyright (c) Bani K. Mallick 47 Stenosis Data, Healthy Kids Are Y=log(1+AVA) and X = BSA independent? No: P= .000 ANOVAb Model 1 Regress ion Res idual Total Sum of Squares 6.786 2.121 8.907 df 1 68 69 Mean Square 6.786 3.119E-02 F 217.543 Sig. .000 a a. Predictors : (Constant), Body Surface Area b. Dependent Variable: Log(1 + Aortic Valve Area) Copyright (c) Bani K. Mallick 48 Stenosis Data, Healthy Kids Normal? Stenosis Data, Healthy Kids Y=log(1+AVA), X=BSA 1.00 .75 Expected Cum Prob .50 .25 0.00 0.00 .25 .50 .75 1.00 Observed Cum Prob Copyright (c) Bani K. Mallick 49 Stenosis Data, Healthy Kids CI for Slope? a Coefficients Model 1 (Constant) Body Surface Area Standardi zed Unstandardized Coefficien Coefficients ts B Std. Error Beta .167 .039 .615 .042 .873 t 4.247 14.749 Sig. .000 .000 95% Confidence Interval for B Lower Bound Upper Bound .088 .245 .532 .698 a. Dependent Variable: Log(1 + Aortic Valve Area) Copyright (c) Bani K. Mallick 50