* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Regress Lecture 1
Survey
Document related concepts
Transcript
I. Introduction: Simple Linear Regression As discussed last semester, what are the basic differences between correlation & regression? What vulnerabilities do correlation & regression share in common? What are the conceptual challenges regarding causality? Linear regression is a statistical method for examining how an outcome variable y depends on one or more explanatory variables x. E.g., what is the relationship of the per capita earnings of households to their numbers of members & their members’ ages, years of higher education, race-ethnicity, gender & employment statuses? What is the relationship of the fertility rates of countries to their levels of GDP per capita, urbanization, education, & so on? Linear regression is used extensively in the social, policy, & other sciences. Multiple regression—i.e. linear regression with more than one explanatory variable—makes it possible to: Combine many explanatory variables for optimal understanding &/or prediction; & Examine the unique contribution of each explanatory variable, holding the levels of the other variables constant. Hence, multiple regression enables us to perform, in a setting of observational research, a rough approximation to experimental analysis. Why, though, is experimental control better than statistical control? So, to some degree multiple regression enables us to isolate the independent relationships of particular explanatory variables with an outcome variable. So, concerning the relationship of the per capita earnings of households to their numbers of members & their members’ ages, years of education, race-ethnicity, gender & employment statuses: What is the independent effect of years of education on per capita household earnings, holding the other variables constant? Regression is linear because it’s based on a linear (i.e. straight line) equation. E.g., for every one-year increase in a family member’s higher education (an explanatory variable), household per capita earnings increase by $3127 on average, holding the other variables fixed. But such a statistical finding raises questions: e.g., is a year of college equivalent to a year of graduate school with regard to household earnings? We’ll see that multiple regression can accommodate nonlinear as well as linear y/x relationships. And again, always question whether the relationship is causal. Before proceeding, let’s do a brief review of basic statistics. A variable is a feature that differs from one observation (i.e. individual or subject) to another. What are the basic kinds of variables? How do we describe them in, first, univariate terms, & second, bivariate terms? Why do we need to describe them both graphically & numerically? What’s the fundamental problem with the mean as a measure of central tendency & standard deviation as a measure of spread? When should we use them? Despite their problems, why are the mean & standard deviation used so commonly? What’s a density curve? A normal distribution? What statistics describe a normal distribution? Why is it important? What’s a standard normal distribution? What does it mean to standardize a variable, & how is it done? Are all symmetric distributions normal? What’s a population? A sample? What’s a parameter? A statistic? What are the two basic probability problems of samples, & how most basically do we try to mitigate them? Why is a sample mean typically used to estimate a parameter? What’s an expected value? What’s sampling variability? A sampling distribution? A population distribution? What’s the sampling distribution of a sample mean? The law of large numbers? The central limit theorem? Why’s the central limit theorem crucial to inferential statistics? What’s the difference between a standard deviation & a standard error? How do their formulas differ? What’s the difference between the z- & t-distributions? Why do we typically use the latter? What’s a confidence interval? What’s its purpose? Its premises, formula, interpretation, & problems? How do we make it narrower? What’s a hypothesis test? What’s its purpose? Its premise & general formula? How is it stated? What’s its interpretation? What are the typical standards for judging statistical significance? To what extent are they defensible or not? What’s the difference between statistical & practical significance? What are Type I & Type II errors? What is the Bonferroni (or other such) adjustment? What are the possible reasons for a finding of statistical insignificance? True or false, & why: Large samples are bad. To obtain roughly equal variability, we must take a much bigger sample in a big city than in a small city. You have data for an entire population. Next step: construct confidence intervals & conduct hypothesis tests for the variables. Source: Freedman et al., Statistics. (true-false continued) To fulfill the statistical assumptions of correlation or regression, what definitively matters for each variable is that its univariate distribution is linear & normal. __________________________ Define the following: Association Causation Lurking variables Simpson’s Paradox Spurious non-association Ecological correlation Restricted-range data Non-sampling errors _________________________ Regarding variables, ask: How they are defined & measured? In what ways are their definition & measurement valid or not? & what are the implications of the above for the social construction of reality? See King et al., Designing Social Inquiry; & Ragin, Constructing Social Research. Remember the following, overarching principles concerning statistics & social/policy research from last semester’s course: (1) Anecdotal versus systematic evidence (including the importance of theories in guiding research). (2) Social construction of reality. (3) Experimental versus observational evidence. (4) Beware of lurking variables. (5) Variability is everywhere. (6) All conclusions are uncertain. Recall the relative strengths & weaknesses of large-n, multivariate quantitative research versus smalln, comparative research & casestudy research. “Not everything worthwhile can be measured, and not everything measured is worthwhile.” Albert Einstein And always question presumed notions of causality. Finally, here are some more or less equivalent terms for variables: e.g., dependent, outcome, response, criterion, left-hand side e.g., independent, explanatory, predictor, regressor, control, right-hand side __________________________ Let’s return to the topic of linear regression. The dean of students wants to predict the grades of all students at the end of their freshman year. After taking a random sample, she could use the following equation: y E( y ) e y freshman GPA E(y) expected value of freshmen GPA e random error Since the dean doesn’t know the value of the random error for a particular student, this equation could be reduced to using the sample mean of freshman GPA to estimate a particular student’s GPA: ŷ y That is, a student’s predicted y (i.e. yhat) is estimated as equal to the sample mean of y. But what does that mini-model overlook? That a more accurate model—& thus more precise predictions—can be obtained by using explanatory variables (e.g., SAT score, major, hours of study, gender, social class, race-ethnicity) to estimate freshman GPA. Here we see a major advantage of regression versus correlation: regression permits y/x directionality* (including multiple explanatory variables). In addition, regression coefficients are expressed in the units in which the variables are measured. * Recall from last semester: What are the ‘two regression lines’? What questions are raised about causality? We use a six-step procedure to create a regression model (as defined in a moment): Hypothesize the form of the model for E(y). Collect the sample data on outcome variable y & one more more explanatory variables x: random sample, data on all the regression variables are collected for the same subjects. Use the sample data to estimate unknown parameters in the model. (4) Specify the probability distribution of the random error term (i.e. the variability in the predicted values of outcome variable y), & estimate any unknown parameters of this distribution. (5) Statistically check the usefulness of the model. (6) When satisfied that the model is useful, use it for prediction, estimation, & so on. We’ll be following this six-step procedure for building regression models throughout the semester. Our emphasis, then, will be on how to build useful models: i.e. useful sets of explanatory variables x’s and forms of their relationship to outcome variable y. “A model is a simplification of, and approximation to, some aspect of the world. Models are never literally ‘true’ or ‘false,’ although good models abstract only the ‘right’ features of the reality they represent” (King et al., Designing Social Inquiry, page 49). Models both reflect & shape the social construction of reality. We’ll focus, then, on modeling: trying to describe how sets of explanatory variables x’s are related to outcome variable y. Integral to this focus will be an emphasis on the interconnections of theory & empirical research (including questions of causality). We’ll be thinking about how theory informs empirical research, & vice versa. See King et al., Designing Social Inquiry; Ragin, Constructing Social Research; McClendon, Multiple Regression and Causal Analysis; Berk, Regression: A Constructive Critique. “A social science theory is a reasoned and precise speculation about the answer to a research question, including a statement about why the proposed answer is correct.” “Theories usually imply several or more specific descriptive or causal hypotheses” (King et al., page 19). And to repeat: A model is “a simplification of, and approximation to, some aspect of reality” (King et al., page 49). One more item before we delve into regression analysis: Regarding graphic assessment of the variables, keep the following points in mind: Use graphs to check distributions & outliers before describing or estimating variables & models; & after estimating models as well. The univariate distributions of the variables for regression analysis need not be normal! But the usual caveats concerning extreme outliers must be heeded. It’s not the univariate graphs but the y/x bivariate scatterplots that provide the key evidence on these concerns. Even so, let’s anticipate a fundamental feature of multiple regression: The characteristics of bivariate scatterplots & correlations do not necessarily predict whether explanatory variables will be significant or not in a multiple regression model. Moreover, bivariate relationships don’t necessarily indicate whether a Y/X relationship will be positive or negative within a multivariate framework. This is because multiple regression expresses the joint, linear effects of a set of explanatory variables on an outcome variable. See Agresti/Finlay, chapter 10; and McClendon, chapter 1 (and other chapters). Let’s start our examination of regression analysis, however, with a simple (i.e. one explanatory variable) regression model: y 0 1x e y outcome variable x explanatory variable (y) 0 1 x determinis tic component (epsilon) random error component 0 ( beta zero, or constant) y - intercept 1 ( beta one) slope of line, i.e., amount of change in the mean of y for every one - unit change in x. .04 .03 .02 0 .01 Density 20 40 60 80 science score .02 0 .01 Density .03 .04 Kernel density estimate Normal density 30 40 50 60 math score Kernel density estimate Normal density 70 80 . su science math . corr science math 20 40 60 80 . scatter science math||qfit science math 30 40 50 60 math score science score Fitted values 70 80 . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual Total science 1 7760.55791 Coef. 200 = 130.81 Prob > F = 0.0000 11746.9421 198 59.3279904 19507.50 198) = R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Root MSE = 7.7025 Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 Interpretation? For every one-unit increase in x, y increases (or decreases) by … units, on average. For every one-unit increase in math score, science score increases by 0.67, on average. Questions of causal order? What’s the standard deviation interpretation, based on the formulation for b, the regression coefficient? . su science math . corr science math Or easier: . listcoef, help . listcoef, help regress (N=200): Unstandardized and Standardized Estimates Observed SD: 9.9008908 SD of Error: 7.7024665 science math b 0.66658 t 11.437 P>t bStdX bStdY bStdXY SDofX 0.000 6.2448 0.0673 0.6307 9.3684 b = raw coefficient t = t-score for test of b=0 P>t = p-value for t-test bStdX = x-standardized coefficient bStdY = y-standardized coefficient bStdXY = fully standardized coefficient SDofX = standard deviation of X What would happen if we reversed the equation? . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual Total science 1 7760.55791 Coef. 200 = 130.81 Prob > F = 0.0000 11746.9421 198 59.3279904 19507.50 198) = R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Root MSE = 7.7025 Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 With science as the outcome variable. . reg math science Source | SS df MS Number of obs = -------------+-----------------------------Model | 6948.31801 Residual | 10517.477 F( 1, 17465.795 198) = 130.81 1 6948.31801 Prob > F = 0.0000 198 53.1185707 R-squared = 0.3978 -------------+-----------------------------Total | 200 Adj R-squared = 0.3948 199 87.7678141 Root MSE = 7.2882 -----------------------------------------------------------------------------math | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------science | .596814 .0521822 11.44 _cons | 21.70019 2.754291 7.88 0.000 0.000 .4939098 16.26868 .6997183 27.1317 With math as the outcome variable. What would be risky in saying that ‘every one-unit increase in math scores causes a 0.67 increase in predicted science score’? Because: (1) Computer software will accept variables in any order & churn out regression y/x results—even if the order makes no sense. (2) Association does not necessarily signify causation. (3) Beware of lurking variables. (4) There’s always the likelihood of non-sampling error. (5) It’s much easier to disprove than prove causation. So be cautious! See McClendon (pp. 4-7) on issues of causal inference. How do we establish causality? Can regression analysis be worthwhile even if causality is ambiguous? See also Berk, Regression Analysis: A Constructive Critique. Why is a regression model probabilistic rather than deterministic? Because the model is estimated from sample data & thus will include some variation due to random phenomena than can’t be modeled or explained. That is, the random error component represents all unexplained variation in outcome variable y caused by important but omitted variables or by unexplainable random phenomena. Examples of a random error component for this model (i.e. using science scores to predict math scores)? There are three basic sources of error in regression analysis: (1) Sampling error (2) Measurement error (including non-sampling error) (3) Omitted variables See Allison, Multiple Regression: A Primer. Examine the type & quality of the sample. Based on your knowledge of the topic: What variables are relevant? How should they be defined & measured? How actually are they defined & measured? Examine the diagnostics for the model’s residuals (i.e. probabilistic, or ‘error’, component). After estimating a regression equation, we estimate the value of e associated with each y value using the corresponding residual, i.e. the deviation between the observed & predicted value of y. The model’s random error component consists of deviations between the observed & predicted values of y. These are the residuals (which, to repeat, are estimates of the model’s error component for each value of y). ei yi ŷi Each observed science score minus each predicted science score. . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 1 Residual 11746.9421 Total 19507.50 7760.55791 198) = 200 = 130.81 Prob > F = 0.0000 198 59.3279904 R-squared = 0.3978 199 98.0276382 Adj R-squared = 0.3948 Root MSE = 7.7025 science Coef. Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 . predict yhat [predicted values of y] (option xb assumed; fitted values) . predict e, resid . sort science . [residuals] [to order its values from lowest to highest] su science yhat e . list science yhat e in 1/10 . list science yhat e in 100/110 . list science yhat e in -10/l (‘l’ indicates ‘last’) . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 1 7760.55791 19507.50 200 = 130.81 Prob > F = 0.0000 Residual 11746.9421 198 59.3279904 Total 198) = 199 98.0276382 R-squared = 0.3978 Adj R-squared = 0.3948 Root MSE = 7.7025 science Coef. Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 SSResiduals (i.e. SSE)=11746.94 The least squares line, or regression line, follows two properties: (1) The expected value of the errors (i.e. deviations or residuals) SE=0. (2) The sum of the squared errors, SSE, is smaller than for any other straight line model with SE=0. The regression line is called the least squares line because it minimizes the distance between the equation’s y-predictions & the data’s y-observations (i.e. it minimizes the sum of squared errors, SSE). The better the model fits the data, the smaller the distance between the y-predictions & the y-observations. Here are the values of the regression model’s estimated beta (i.e. slope or regression) coefficient & y-intercept (i.e. constant) that minimize SSE: SSxy ˆ Slope 1 SSxx y - intercept: ˆ 0 y ˆ 1x where: SSxy (xi - x)(yi - y) SSxx (x i - x )2 Compute y-intercept: y - intercept: ˆ 0 y ˆ 1x . su science math Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------science | 200 51.85 9.900891 26 74 math | 200 52.645 9.368448 33 75 . display 51.85 - (.66658*52.645) 16.757896 Note: math slope coefficient=.66658 (see regression output) . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual Total science 1 7760.55791 = 130.81 R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Root MSE = 7.7025 Std. Err. t P>t [95% Conf. Interval] .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 math Coef. 200 Prob > F = 0.0000 11746.9421 198 59.3279904 19507.50 198) = .66658 The y-intercept (i.e. the constant) matches our calculation: 16.75789. Compute math’s slope coefficient: SSxy ˆ Slope 1 SSxx Summation of each math-value times each science value, divided by summation of each math value squared. . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual Total science 1 7760.55791 Coef. 200 = 130.81 Prob > F = 0.0000 11746.9421 198 59.3279904 19507.50 198) = R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Root MSE = 7.7025 Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 We’ll eventually see that the probability distribution of e determines how well the model describes the population relationship between outcome variable y & explanatory variable x. In this context, there are four basic assumptions about the probability distribution of e. These are important to (1) minimize bias & (2) to make confidence intervals & hypothesis tests valid. The Four Assumptions (1) The expected value of e over all possible samples is 0. That is, the mean of e does not vary with the levels of x. (2) The variance of the probability distribution of e is constant for all levels of x. That is, the variance of e does not vary with the levels of x. (3) The errors associated with any two different y observations are 0. That is, the errors are uncorrelated: the errors associated with one value of y have no effect on the errors associated with other y values. (4) The probability distribution of e is normal. These assumptions of the regression model are commonly summarized as: I.I.D. Independently & identically distributed errors. As we’ll come to understand, the assumptions make the estimated least squares line an unbiased estimator of the population value of the y-intercept & the slope coefficient—i.e. of the population value of y. Plus they make the standard errors of the estimated least squares line as small as possible & unbiased, so that confidence intervals & hypothesis tests are valid. Checking these vital assumptions— which need not hold exactly—is a basic part of post-estimation diagnostics. How do we estimate the variability of the random error e (which means variability in the predicted values of outcome variable y)? We do so by estimating the variance of e (i.e. the variance of the predicted values of outcome variable y). Why must we be concerned with the variance of e? Because the greater the variance of e, the greater will be the errors in the estimates of the y-intercept & slope coefficient. Thus the greater the variance of e, the more inaccurate will be the predicted value of y for any given value of x. Since we don’t know the population error, 2 , we estimate it with sample data as follows: s 2 SSE where df for error SSE ( yi ŷi )2 s2=sum(each observed science score minus each predicted science score)2/df for error Standard error of e: s s2 . reg science math Source SS df MS Number of obs F( 1, Model 7760.5579 1 7760.55791 19507.50 200 = 130.81 Prob > F = 0.0000 Residual 11746.9421 198 59.3279904 Total 198) = 199 98.0276382 R-squared = 0.3978 Adj R-squared = 0.3948 Root MSE = 7.7025 science Coef. Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 s2 (yhat variance=59.33) s (yhat standard error)=7.70 Interpretation of s (yhat’s standard error): we are 95% certain that yhat’s values fall within an interval of roughly +/- 2*7.70 (i.e. +/- two standard deviations). To display other confidence levels for this & the other regression output in STATA: reg y x1 x2, level(90) Assessing the usefulness of the regression model: making inferences about slope 1 Ho: 1 = 0. Ha: 1 0. (or one-tailed Ha in either direction) . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual 11746.9421 198 59.3279904 Total science 19507.50 Coef. 1 7760.55791 t 200 = 130.81 Prob > F = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Std. Err. 198) = Root MSE = 7.7025 P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 10.61264 22.90315 0.000 .66658/.0582822=11.44, p-value=0.0000 Hypothesis test & conclusion? Depending on the selected alpha (i.e. test criterion) & on the test’s p-value, either reject or fail to reject Ho. The hypothesis test’s assumptions: probability sample; & the previously discussed four assumptions about e. . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual 11746.9421 198 59.3279904 Total science 19507.50 Coef. 1 7760.55791 t 200 = 130.81 Prob > F = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Std. Err. 198) = Root MSE = 7.7025 P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 10.61264 22.90315 0.000 How to compute a slope coefficient’s confidence interval? Compute math’s slope-coefficient confidence interval (.95): . di invttail(199, .05/2) t .95, df 199 1.9719565 . di .66658 - (1.972*.0582822) .5516475 = low side of CI . di .66658 + (1.972*.0582822) .7815125 = high side of CI Note: math slope coefficient=.66658; math slope coefficient standard error=.0582822 Conclusion for confidence interval: We can say with 90% or 95% or 99% confidence that for every oneunit increase/decrease in x, y changes by +/- …… units, on average. But remember: there are nonsampling sources of error, too. Let’s next discuss correlation. Correlation: a linear relationship between two quantitative variables (though recall from last semester that ‘spearman’ & other such procedures compute correlation involving categorical variables, or when assumptions for correlation between two quantitative variables are violated). Beware of outliers & non-linearity: graph a bivariate scatterplot in order to conclude whether conducting a correlation test makes sense or not (& thus whether an alternative measure should be used). Correlation assesses the degree of bivariate cluster along a straight line: the strength of a linear relationship. Regression examines the degree of y/x slope of a straight line: the extent to which y varies in response to changes in x. Regarding correlation, remember that association does not necessarily imply causation. And beware of lurking variables. Other limitations of correlation analysis? Formula for correlation coefficient: Standardize each x observation & each y observation. Cross-multiply each pair of x & y observations. Divide the sum of the cross-products by n – 1. In short, the correlation coefficient is the average of the cross-products of the standardized x & y values. Here’s the equivalent, sum-ofsquares formula: r SSxy SSxxSSyy Hypothesis test for correlation: Ho : xy 0 Ha : xy 0 (or one-sided Ha in either direction) Depending on selected alpha & on test pvalue, either reject or fail to reject Ho. The hypothesis test’s assumptions? Before estimating a correlation, of course, first graph the univariate & bivariate distributions. Look for overall patterns & striking deviations, especially outliers. Is the bivariate scatterplot approximately linear? Are there extreme outliers? 0 .0 2 .0 4 .0 6 . hist science, norm 20 40 60 science score 80 0 .0 1 .0 2 .0 3 .0 4 . hist math, norm 30 40 50 60 math score 70 80 20 40 60 80 . scatter science math 30 40 50 60 70 math score Approximately linear, no extreme outliers. 80 20 40 60 80 . scatter science math || lfit science math 30 40 50 60 m ath s c ore s cienc e sc o re Fitte d v alue s 70 80 20 40 60 80 . scatter science math || qfit science math 30 40 50 60 m ath s c ore s cienc e sc o re Fitte d v alue s 70 80 Hypothesis test: Ho : xy 0 Ha : xy 0 . pwcorr science math, sig star(.05) | science math ------------+-----------------science | 1.0000 | | math | | 0.6307* 1.0000 0.0000 Hypothesis test conclusion? Coefficient of determination, r2: r2 (in simple but not multiple regression, just square the correlation coefficient) represents the proportion of the sum of squares of deviations of the y values about their mean that can be attributed to a linear relationship between y & x. Interpretation: about 100(r2)% of the sample variation in y can be attributed to the use of x to predict y in the straight-line model. Higher r2 signifies better fit: greater cluster along the y/x straight line. Formula for r2 in simple & multiple regression: SSyy - SSE r2 SSyy How would this be computed for the regression of science on math? . reg science math SS Source | MS df ------------+------------------------------ Model | Residual 7760.55791 11746.9421 1 198 F( 1, 7760.55791 59.3279904 ------------+------------------------------ Total | 19507.50 Number of obs = 198) = Prob > F 200 130.81 = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276 Root MSE = 7.7025 -----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .66658 _cons | 16.75789 .0582822 3.116229 11.44 0.000 .5516466 .7815135 5.38 0.000 10.61264 22.90315 r2=Model SS/Total SS=7760.56/19507.5 Let’s step back for a moment & review the matter of explained versus unexplained variation in an estimated regression model. DATA = FIT + RESIDUAL What does this mean? Why does it matter? DATA: total variation in outcome variable y; measured by the total sum of squares. FIT: variation in outcome variable y attributed to the explanatory variable x (i.e. by the model); measured by the model sum of squares. RESIDUAL: variation in outcome variable y attributed to the estimated errors; measured by the residual (or error) sum of squares. DATA = FIT + RESIDUAL SST = SSM + SSE Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values. Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values. Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values. . reg science math Source SS df MS Number of obs = F( 1, Model 7760.55791 1 7760.55791 19507.50 199 98.0276382 = 130.81 Prob > F = 0.0000 Residual 11746.9421 198 59.3279904 Total 198) 200 R-squared = 0.3978 Adj R-squared = 0.3948 Root MSE = 7.7025 science Coef. Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 Next step: compute the variance for each component by dividing its sum of squares by its degrees of freedom—its Mean Square: Mean Square for Total = Mean Square for Model + Mean Square for Errors (Residuals) s2 : Mean Square for Errors (Residuals) s: Root Mean Square (se of yhat) . reg science math Source | SS MS df Number of obs = ------------+------------------------------ Model | Residual | 7760.55791 1 11746.9421 198 F( 1, 7760.55791 59.3279904 ------------+------------------------------ Total | 19507.50 199 198) = Prob > F 200 130.81 = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 98.0276 Root MSE = 7.7025 -----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .66658 _cons | 16.75789 .0582822 3.116229 11.44 0.000 .5516466 .7815135 5.38 0.000 10.61264 22.90315 Root MSE = square root of 59.3279904 Analysis of Variance (ANOVA) Table: the regression output displaying the sums of squares & mean square for model, residual (error) & total. How do we compute F & r2 from the ANOVA table? F=Mean Square Model/Mean Square Residual r2=Sum of Squares Model/Sum of Squares Total . reg science math Source | SS MS df Number of obs = ------------+------------------------------ Model | Residual | 7760.55791 1 11746.9421 F( 1, 7760.55791 198 59.3279904 ------------+------------------------------ Total | 19507.50 199 198) = Prob > F 200 130.81 = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 98.0276 Root MSE = 7.7025 -----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .66658 _cons | 16.75789 .0582822 3.116229 11.44 0.000 .5516466 .7815135 5.38 0.000 10.61264 22.90315 F=MSM/MSR=7760.55791/59.3279904=130.81 r2=MSS/TSS=7760.55791/19507.50=0.3978 DATA = FIT + RESIDUAL SST = SSM + SSE Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values. Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values. Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values. Using the regression model for estimation & prediction: Fundamental point: never make predictions beyond the range of the sampled (i.e. observed) x values. That is, while the model may provide a good fit for the sampled range of values, it could give a poor fit outside the sampled x-value range. Another point in making predictions: the standard error for the estimated mean of y will be less than that for an estimated individual y observation. That is, there’s more uncertainty in predicting individual y values than mean y values. Why is this so? Let’s review how STATA reports the indicators of how a regression model fits the sampled data. . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 Residual 11746.9421 198 59.3279904 Total science 19507.50 Coef. 1 7760.55791 198) = 200 = 130.81 Prob > F = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276382 Root MSE = 7.7025 Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 Software regression output typically refers to the residualterms more or less as follows: s2 = Mean Square Error (MSE: variance of predicted y/d.f.) s = Root Mean Square Error (Root MSE: standard error of predicted y [which equals the square root of MSE]) Stata labels the Residuals, Mean Square Error & Root Mean Square Error as follows: Top-left table SS for Residual: sum of squared errors MS for residual: variance of predicted y/d.f. Top-right column Root MSE: standard error of predicted y & moreover there’s R2 (as well F & other indicators that we’ll examine next week). . reg science math Source | SS MS df ------------+-----------------------------Model | 7760.55791 Residual 1 | 11746.9421 F( 1, 7760.55791 19507.50 198) = Prob > F 59.3279904 R-squared 198 ------------+-----------------------------Total | Number of obs = 199 200 130.81 = 0.0000 = 0.3978 Adj R-squared = 0.3948 98.0276382 Root MSE = 7.7025 -----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .66658 _cons | 16.75789 .0582822 3.116229 11.44 0.000 .5516466 .7815135 5.38 0.000 10.61264 22.90315 SS Residual/df Residual=MS Resid: variance of yhat. Root MSE=sqrt(MS Residual): standard error of yhat. . reg science math SS Source | MS df ------------+------------------------------ Model | Residual 7760.55791 11746.9421 1 F( 1, 7760.55791 198 59.3279904 ------------+------------------------------ Total | 19507.50 199 Number of obs = 198) = Prob > F 200 130.81 = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 98.0276 Root MSE = 7.7025 -----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .66658 _cons | 16.75789 .0582822 3.116229 11.44 0.000 .5516466 .7815135 5.38 0.000 10.61264 22.90315 r2=SS Model/SS Total . reg science math Source | SS MS df ------------+------------------------------ Model | Residual 7760.55791 1 11746.9421 F( 1, 7760.55791 198 59.3279904 ------------+------------------------------ Total | 19507.50 Number of obs = 198) = Prob > F 200 130.81 = 0.0000 R-squared = 0.3978 Adj R-squared = 0.3948 199 98.0276 Root MSE = 7.7025 -----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .66658 _cons | 16.75789 .0582822 3.116229 F=MSM/MSR 11.44 0.000 .5516466 .7815135 5.38 0.000 10.61264 22.90315 The most basic ways to make a linear prediction of y (i.e. yhat) after estimating a simple regression model? . display 16.75789 + .66658 . display 16.75789 + .66658*45 . lincom _cons + math . lincom _cons + math*45 (lincom: linear combination; provides a confidence interval for prediction) In summary, we use a six-step procedure to create a regression model: Hypothesize the form of the model for E(y). Collect the sample data: random sample, data for the regression variables collected on the same subjects Use the sample data to estimate unknown parameters in the model. (4) Specify the probability distribution of the random error term, & estimate any unknown parameters of this distribution. (5) Statistically check the usefulness of the model. (6) When satisfied that the model is useful, use it for prediction, estimation, & so on. See King et al. Finally, the four fundamental assumptions of regression analysis involve the probability distribution of e (the model’s random component, which consists of the residuals). These assumptions can be summarized as I.I.D. The univariate distributions of the variables for regression analysis need not be normal! But the usual caveats concerning extreme outliers are important. It’s not the univariate graphs but the y/x bivariate scatterplots that provide the key evidence on these concerns. We’ll nonetheless see that the characteristics of bivariate relationships do not necessarily predict whether explanatory variables will test significant or the direction of their coefficients in a multiple regression model. We’ll see, rather, that a multiple regression model expresses the joint, linear effects of a set of explanatory variables on an outcome variable. Review: Regress science achievement scores on math achievement scores .use hsb2, clear Note: recall that these are not randomly sampled data. 0 .02 Density .04 .06 . hist science, norm 20 40 60 science score 80 0 .01 .02 .03 .04 . hist math, norm 30 40 50 60 math score 70 80 . su science, detail science score Percentiles Smallest 1% 30 26 5% 34 29 10% 39 31 Obs 200 25% 44 31 Sum of Wgt. 200 50% 53 Mean 51.85 Std. Dev. 9.900891 Largest 75% 58 69 90% 64.5 72 Variance 98.02764 95% 66.5 72 Skewness -.1872277 99% 72 74 Kurtosis 2.428308 . su math, d math score Percentiles Smallest 1% 36 33 5% 39 35 10% 40 37 Obs 200 25% 45 38 Sum of Wgt. 200 50% 52 Largest Mean 52.645 Std. Dev. 9.368448 75% 59 72 90% 65.5 73 Variance 87.76781 95% 70.5 75 Skewness .2844115 99% 74 75 Kurtosis 2.337319 20 40 60 80 . scatter science math || qfit science math 30 40 50 60 70 80 m a th s c o r e s ci e n c e sc o re Fi tte d v a l u e s Conclusion about approximate linearity & outliers? . pwcorr science math, obs bonf sig star(.05) science 1.0000 200 math 0.6307* 1.0000 0.0000 200 200 Formula for correlation coefficient? Hypothesis test & conclusion? . reg science math Source SS df MS Number of obs F( 1, Model 7760.55791 198) = 200 = 130.81 1 7760.55791 Prob > F = 0.0000 Residual 11746.9421 198 59.3279904 R-squared = 0.3978 Total 19507.50 199 98.0276382 Adj R-squared = 0.3948 Root MSE = 7.7025 science Coef. Std. Err. t P>t [95% Conf. Interval] math .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons 16.75789 3.116229 5.38 0.000 10.61264 22.90315 # observations? df? residuals, formula? yhat variance, formula? yhat standard error, formula? F, formula? r2, formula? y-intercept, CI, formula? slope coefficient, CI, formula? slope hypothesis test? Graph the linear prediction for yhat with a confidence interval: 3 0 4 0 5 0 6 0 7 0 . twoway qfitci science math, blc(blue) 30 40 5 0 60 7 0 m a th s c o r e 95 % C I F it t e d v a l u e s 80 Predictions of yhat using STATA’s calculator: . display 16.75789 + .66658*45 46.75399 . di 16.75789 + .66658*65 60.08559 Predictions for yhat using ‘lincom’: . lincom _cons + math*45 ( 1) 45.0 math + _cons = 0.0 -----------------------------------------------------------------------------science| Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 46.754 .7036832 66.44 0.000 45.36632 48.14167 ------------------------------------------------------------------------------ . lincom _cons + math*65 ( 1) 65.0 math + _cons = 0.0 -----------------------------------------------------------------------------science| Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 60.0856 .9028563 66.55 0.000 58.30515 61.86604 ------------------------------------------------------------------------------