Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Instrumental variables estimation wikipedia , lookup
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Linear regression wikipedia , lookup
Chapter 14 Multiple Regression Analysis What is the general purpose of regression? • To model the relationship between the Multiple regression can be and usedone to or fitmore dependent variable y (response) models to datax with two or more independent variables (predictors or independent variables. explanatory variables) For example, some variation in the price of a house in a large city can be attributed to the size of a house, but there are other variables that also attribute to the price of a house; such as age, lot size, number of bedrooms and bathrooms, etc. Consider a school district in which teachers with no prior teaching experience and no college credits beyond a bachelor’s degree start at an annual salary of $38,000. Suppose that In foraeach yearregression of teaching experience simple model, x1 up to 20 years, the teacher receives an additional $800 and What if y is not entirely determined and x represent two observations 2 that each unit of postgraduate creditindependent up to 75 credits by the two (or more) of$60 a single variable. results in an additional per year. In multiplevariables? regression, x1 and x2 Let: represent two independent y = salary of a teacher withvariables! at most 20 years experience and at most 75 postgraduate units Since y is determined entirely by x1 = number years of experience x1of and x , this a deterministic 2 How can is this scenario be modeled model. x2 = number of postgraduate units using multiple regression? The equation to determine salary is y 38,000 800x1 60x2 General Additive Multiple Regression Model A general additive multiple regression model, which relates a dependent variable y to k predictor variables x1, x2, . . ., xk, is given by the model equation y’sare the population 1x1 2x 2 ... k xk e regression coefficients. Each i can be interpreted as the mean The random deviation is assumed to be normally distributed change in yewhen the predictor xi increase 1 This called the population with mean value unit 0 andand standard deviation sthe for any particular theisvalue of all other this is fixed. the amount regression function. values x1, …, xk. Remember, predictors remains that randomly This implies that for fixed x1, xa2,point …, xk values, y has a normal the distribution with standard deviates deviation s from and regression model. mean y value for fixed 1x1 2x 2 ... k x k x , x , ..., x values 1 2 k Data collected in a survey of approximately 1000 secondyear college students suggest that GPA at the end of the second year is related to the student's level of interaction with the faculty and staff and to the student’s commitment to his or her major. Let: y = GPA at end of sophomore year For sophomores whose= level interaction with= the (mean value of GPA) 1.4 + of .33(4.2) + .16(2.1) 3.12 x1 = faculty level ofand faculty and staff interaction (measured staff is rated at 4.2 and whose level on of a scale of 1 tocommitment 5) to major is rated at 2.1, It is likely that a y value will be within 2s (.30) of this x2 = mean level of commitment to major (measured on a 2.82 scaleto of value (3.12 ± .30). This interval is from + .33(4.2) + .16(2.1) = 3.12 1 to(mean 5) value of GPA) = 1.4 3.42. One possible population model might be: y 1.4 .33x1 .16x2 e with s 0.15 Polynomial Regression Suppose a scatterplot has the following appearance: Would a line be a good Itfit looks a parabola for like these data? (quadratic function) Explain. y would provide a good fit for the data. x Polynomial Regression The kth degree polynomial regression model is y 1x 2x ... k x e 2 k Note that we include Note also that is a special case of the the general multiple i cannot be random deviation ex interpreted since all the regression Thismodel is thewith population regression The most important special case (other than since this is a values are functions of a function (mean y value for fixed 2 3 k the simple regression model when k = 1) is the x1 = x, x2 = x , x3 =probabilistic xsingle , . . .,variable. xk =model. x values of regression the predictors). quadratic model y = + 1x + 2x2 + e Many researchers have examined factors that are believed to contribute to the risk of heart attacks. One study found that hip-to-waist ratio was a better predictor of heart attacks than body-mass index. A plot of data from this study of a measure heart-attack risk (y) versus hip-to-waist ratio (x) had a exhibited a curved relationship. A model consistent with summary values given in the paper is y 1.023 0.024x 0.060x 2 e Suppose the hip-to-waist ratio is 1.3, what are the possible values of the heart-attack risk measure (if s = 0.25)? y = 1.023 + .024(1.3) + .060(1.3)2 = 1.16 It is likely that the heart-attack risk measure for a person with a hip-to-waist ratio of 1.3 is between .66 and 1.66. Mean y value Suppose that an industrial chemist is interested in the relationship between product yield (y) from a certain chemical reaction and two independent Because chemical theory variables, x1 = reaction temperature and x2 = pressure at suggest that the decline which the reaction is carried out.in average yield when Notice each is a straight The chemist initially suggest that for temperatures 2 increases linepressure with a xslope of -35. should beat more for between 80 and 110 in combination with pressure values Let’s look plotrapid of these a high temperature than by ranging from 50 to 70, the relationship can be modeled lines. for athree low temperature, the chemist now has y 1200 15x1 reason 35x2to e doubt the appropriateness of the model. Consider the mean y value for threeproposed different particular temperature values: x1 = 90: x1 = 95: x1 = 100: mean y value = 1200 + 15(90) – 35x2 = 2550 – 35x2 mean y value = 1200 + 15(95) – 35x2 = 2625 – 35x2 mean y value = 1200 + 15(100) – 35x2 = 2700 – 35x2 x2 Chemical Reaction Continued . . . that these all A better model would include aNotice third predictor variable have different slopes x1x2. This third variable aspredictor seen in the plot ofis One such model is an interaction term. these lines. y 4500 75x1 60x2 x1x2 e Mean y value Consider the mean y value for three different particular temperature values: x1 = 90: mean y value = -4500 + 75(90) + 60x2 - 0x2 = 2250 – 30x2 x1 = 95: mean y value = -4500 + 75(95) + 60x2 - 95x2 = 2625 – 35x2 x1 = 100: mean y value = -4500 + 75(100) + 60x2 - 100x2 = 3000 – 40x2 x2 Interaction Between Variables More than one interaction predictor can be included in the model when than If the change in the mean y associated withmore a 1-unit two independent variables arequadratic, available. In quadratic regression, the (slope) full increase in one independent variable depends onor second-order model is: is the value ofcomplete a second independent variable, there interaction between these two variable. 2 y 1x1 2x2 3x1x2 4x1 5x22 e When the variables are denoted by x1 and x2, such interaction can be modeled by including x1x2, the product of the variables that interact, as a predictor variable. The general equation for a multiple regression model based on two independent variables x1 and x2 that also includes an interaction predictor is y 1x1 2x2 3x1x2 e Qualitative Predictor Variables Qualitative or categorical variables can also be If a qualitative variable In general, incorporating a incorporated into a or multiple regression model had three more categorical variable with c possible categories, throughcategories the use of into an then indicator variable or a regression model multiplethe indicator dummy variable. requires use of c – 1 indicator variables are needed. An indicator variablevariables. will use the values of 0 and 1 to indicate the different categories. Example: Location of houses in Californian beach resort gender of students 0 if male x1 1 if female 1 if ocean view and beachfront x1 0 otherwise 1 if ocean view and not beachfront x2 0 otherwise One of the factors that has an effect on the price of a house is location. We might want to incorporate location, as well as numerical predictors, such as size and age, into a multiple regression model for predicting categorybeach for location would becan be house What price. California community houses classifiedrepresented by location into bythree x1 = categories 0 and x2 =– ocean 0? view and beachfront, ocean view but not beachfront, and no ocean view. Let: 1 if ocean view and beachfront x1 0 otherwise 1 if ocean view but not beachfront x2 0 otherwise x 3 house size x 4 house age We could then consider a multiple regression model of the form y 1x1 2x2 3x2 4x4 e One way colleges measure success is by graduation rates. The Education Trust publishes graduation rates along with other college characteristics. Let’s consider the following variables: y = 6-year graduation rate Note that twocollege of x1 = median SAT score of students accepted to the As in simple regression, we these will need to predictors x2 = student-related per full time student (in estimate theexpense regression coefficients are numerical dollars) of , 1, has 2, only andfemale 3 bystudents calculating a, and b1, students variables one 1 if college or only male b2, and b3. is categorical. x3 = 0 if college has both male and female students In simple regression, an observation is an (x,y) One possible model that would be considered to pair. In multiple regression, an observation In this example, anyobservation describe the relationship between and these would consist of the k independent variables would be (x , x , x , y). 1 2– so 3 it would have three predictors and the is dependent variable k + 1 terms. y 1x1 2x2 3x3 e Least-Squares Estimates According to thesquares principles of least-squares, The least estimates for a giventhe fit ofdata a particular estimatedbyregression function set are obtained solving a system a + b1of x1 +k .+. 1. +equations bkxk to the observed data is a, in the k + 1 unknowns b1, . by . ., the bk (called equations). measured sum ofthe thenormal squared deviations This difficult ytovalues do byand hand, all between theisobserved thebut y values the commonly used statistical software predicted by the estimated regression function: packages have been programmed to solve y 2 for these. y a b x ... b x 1 1 k k The least-squares estimates of , 1, . . ., k are those values of a, b1, . . ., bk that make this sum of squared deviations as small as possible. Graduation Rates Continued . . . Minitab output from a regression command requesting that the model y = + 1x1 + 2x2 + 3x3 + e be fit to the small college data (found on pages 815-816 of the textbook) is given below: The regression equation is y = -0.391 + 0.000760 x1 + 0.000007 x2 + 0.125 x3 Predictor Coef SE Coef Constant -0.3906 0.1976 x1 0.0007602 0.0002300 x2 0.0000069 0.0000045 x3 0.12495 0.05943 S = 0.0844346 R-Sq = 86.1% Analysis of Variance Source DF SS 3 0.79486 Residual Error 18 0.12833 Total 21 0.92318 Regression What areP the T These are the This value is interpreted interpretations of -1.98 0.064 for theof asestimates the average change in the coefficients 3.30 0.004 regression 6-yearthe graduation rate for predictor 1.55 0.139 coefficients. a 1 unit increase in median variables x2 enrolling and x3? 2.10 0.050 SAT score for = 83.8% students R-Sq(adj) while the type of institution and the MS F expenditures remain fixed. 0.26495 0.00713 37.16 P 0.000 Graduation Rates Continued . . . Minitab output from a regression command requesting This is the coefficient of that the model y = + 1x1 + 2x2 + 3x3 + e be fit to the multiple determination. It of the small college data (found on pages 815-816 is the proportion of the textbook) is given below: This is s , the e variation in 6 year 0.000007 value the The regression equation is y = -0.391 + 0.000760 x1 +This x2 + is 0.125 x3 estimated standard 2 graduation rates that can Predictordeviation Coef SE Coef T adjusted P R . of the be explained by the Constant 0.1976 -1.98 0.064 random -0.3906 deviation e. multiple 0.0002300 regression model. x1 0.0007602 3.30 0.004 x2 0.0000069 0.0000045 1.55 0.139 x3 0.12495 0.05943 2.10 0.050 S = 0.0844346 R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F P 3 0.79486 0.26495 37.16 0.000 Residual Error 18 0.12833 0.00713 Total 21 0.92318 Regression Is the model useful? 2, and the We use s , R • Recall The estimate for the random deviation e that SSTo is 2 is given by adjusted R2 to variance s the sum of the squared deviations ofdetermine how useful SS Resid the multiple regression 2 the observedsy values e is. from the mean of n y – (k model 1) Recall that SSResid is it is a measure of the the sum of the squared Residuals are the total variability in the residuals. differences between y values. Theof dfmultiple = nthe - (kobserved + 1) • The coefficient determination is y values because (k + and 1) dfthe arepredicted y lost in estimating the SS Resid values. 2 R k +11coefficients , 1, . SS . ., To k. Is the model useful? Continued … • The adjusted R2 is computed using n 1 SS Resid adjusted R 1 n (k 1) SS To 2 the value in the 2 On rare occasions, the number The adjusted RBecause takes into account square brackets exceeds 2 may adjusted be of predictor variables. ThisRis important 2 1, thenegative. value of r because, givenadjusted that you is use a large number of always smaller predictors, you can account than for r2. most of the variability in y, even if no real relationship exist. Graduation Rates Continued . . . 2 is The value of s is small and the value of R Minitab output from ae regression command requesting large. This variation that the model y = means + 1x1 that + 2x2most + 3x3of + ethe be fit to the thisfor model useful? is accounted by the modelofand small college dataIs (found on pages 815-816 thethe look at these textbook) is given Let’s below: observations have little deviation from the three values again. predicted Also, the valuesx2of R2 x3 The regression equation is y y =values. -0.391 + 0.000760 x1 + 0.000007 + 0.125 2 and the close, which Predictor Coef adjusted SE Coef R are T P we haven’t-1.98 used too many Constant suggests -0.3906 that0.1976 0.064 x1 0.0007602 0.0002300 in our3.30 predictors model. 0.004 x2 0.0000069 0.0000045 x3 0.12495 S = 0.0844346 0.05943 1.55 0.139 2.10 0.050 R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F P 3 0.79486 0.26495 37.16 0.000 Residual Error 18 0.12833 0.00713 Total 21 0.92318 Regression F Distributions • The model utility test for multiple regression is based on a probability distribution called the F distribution. • Like the t and c2 distributions, the F distributions are based on df. However, it is based upon the df1 for the numerator of the test statistic and on the df2 for the denominator of the test statistic. • Each different combination of df1 and df2 produces a different F distribution. F Distributions Continued . . . • Here are some graphs of different F curves The is the area AllP-value F tests in this under the associated F textbook are uppercurve to the right of the tailed. F curve for df1 = 3 and df2 calculated = 18 F value. Most statistical software packages and graphing calculators will compute F curve for df1 = 18 and df2 =this 3 P-value. F Test for Modal Utility Null Hypothesis: H0: 1 = 2 = … = k = 0 Alternative Hypothesis: one of 1, …, k SSRegr = At SSTo - SSResid There is no least useful linear are ynot 0 ANY relationship between and SS Regr of the predictors. k Test Statistic: F SS is Resid There a useful linear relationship between n (kyand 1) at least one of the predictors. Assumptions: For any combination of predictor variables values, the distribution of e is normal with mean 0 and constant variance s2. Graduation Rates Continued . . . The model y = + 1x1 + 2x2 + 3x3 + e was fitted to the small college data (found on pages 815-816 of the textbook). H0: 1 = 2 = 3 = 0 Ha: at least one of the three ’s is not 0 Assumptions: A normal probability plot of the standardized residuals is quite straight, indicating that the assumption of normality of the random deviation distribution is reasonable. Graduation Rates Continued . . . H0: 1 = 2 = 3 = 0 Ha: at least one of the three ’s is not 0 Test Statistic: 0.79486 / 3 0.26495 F 37.16 0.12833 / 18 0.00713 df1 = 3, df2 = 18, = .05, P-value ≈ 0 Since P-value < , we reject H0. There is evidence to confirm the usefulness of the multiple regression model. Graduation Rates Continued . . . Minitab output from a regression command requesting that the model y = + 1x1 + 2x2 + 3x3 + e be fit to the Dividing these two MS terms Notice the sum of squares are small college data (found on pages 815-816 of the Dividing the SSRegr bySSResid its df Similarly, dividing the produces the of Foftest given thenumerator Analysis textbook) is giveninthe below: produces thestatistic. byVariance its df produces the Table. The regression equation y = -0.391 + 0.000760 x1 + 0.000007 x2 + 0.125 x3 F istest statistic. denominator of the F test Predictor Coef SE Coef T P statistic. Constant -0.3906 0.1976 -1.98 0.064 x1 0.0007602 0.0002300 3.30 0.004 x2 0.0000069 0.0000045 1.55 0.139 2.10 0.050 x3 0.12495 S = 0.0844346 0.05943 R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F P 3 0.79486 0.26495 37.16 0.000 Residual Error 18 0.12833 0.00713 Total 21 0.92318 Regression What factors contribute to the price of energy bars? Minitab output for data (found on page 825 of the textbook) based on the following variables is shown below. y = price The equation is x2regression = protein content x1 = calorie content x3 = fat content Price = 0.252 + 0.00125 Calories + 0.0485 Protein + 0.0444 Fat Predictor 3Coef SE Coef T P Constant 0.2511 0.3524 0.71 0.487 0.001254 0.001724 0.73 0.478 Calories to ProteinAccording 0.04849 the F test 0.01353 3.58 0.003 for model the Fat 0.04445 utility, 0.03648 1.22 0.242 fitted multiple S = 0.2789 R-Sq = 74.7% R-Sq(adj) = 69.6% However, looking model at is the t tests for each Analysis of regression Variance predictor, that only Source useful DFit appears SS MS the variable F P in predicting the onprice protein content is useful. Let’s 14.76 redo our0.000 Regression 3.4453 1.1484 of 3the energy model to bars. include only the 0.0778 protein predictor Residual Error 15 1.1670 variable. Total 18 4.6122 What factors contribute to the price of energy bars? Since the model with just one predictor accounts Minitab output for data (found on page 825 of the for almost asthe much of the variation in y below. values textbook) based on following variable is shown (69.4%) as the multiple regression model y = price x2 = protein to content (69.6%) - it is preferable use the more simple model. The regression equation is Price = 0.607 + 0.0623 Protein Predictor 3Coef SE Coef T P Constant 0.6072 0.1419 4.28 0.001 0.062256 0.009618 6.47 0.000 Protein According S = 0.279843 to the test R-Sq =F71.7% model utility, the Analysis offor Variance Source fitted simple DF regression SS model is also useful Regression 1 3.2809in Residual predicting Error 17the price 1.3313 of Total 18 4.6122 the energy bars. R-Sq(adj) = 69.4% MS F P 3.2809 41.90 0.000 0.0763