Download Regress Lecture 1

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
I. Introduction:
Simple Linear Regression
 As discussed last semester, what
are the basic differences between
correlation & regression?
 What vulnerabilities do correlation &
regression share in common?
 What are the conceptual challenges
regarding causality?
 Linear regression is a statistical
method for examining how an outcome
variable y depends on one or more
explanatory variables x.
 E.g., what is the relationship of the
per capita earnings of households to
their numbers of members & their
members’ ages, years of higher
education, race-ethnicity, gender &
employment statuses?
 What is the relationship of the
fertility rates of countries to their
levels of GDP per capita,
urbanization, education, & so on?
 Linear regression is used
extensively in the social, policy, &
other sciences.
Multiple regression—i.e. linear
regression with more than one
explanatory variable—makes it
possible to:
 Combine many explanatory variables
for optimal understanding &/or
prediction; &
 Examine the unique contribution of
each explanatory variable, holding the
levels of the other variables constant.
 Hence, multiple regression enables us
to perform, in a setting of observational
research, a rough approximation to
experimental analysis.
 Why, though, is experimental control
better than statistical control?
 So, to some degree multiple regression
enables us to isolate the independent
relationships of particular explanatory
variables with an outcome variable.
So, concerning the relationship of the
per capita earnings of households to
their numbers of members & their
members’ ages, years of education,
race-ethnicity, gender & employment
statuses:
 What is the independent effect of years
of education on per capita household
earnings, holding the other variables
constant?
 Regression is linear because it’s
based on a linear (i.e. straight line)
equation.
 E.g., for every one-year increase
in a family member’s higher
education (an explanatory
variable), household per capita
earnings increase by $3127 on
average, holding the other
variables fixed.
 But such a statistical finding
raises questions: e.g., is a year of
college equivalent to a year of
graduate school with regard to
household earnings?
 We’ll see that multiple regression
can accommodate nonlinear as
well as linear y/x relationships.
 And again, always question
whether the relationship is causal.
Before proceeding, let’s do a brief
review of basic statistics.
 A variable is a feature that differs
from one observation (i.e. individual
or subject) to another.
 What are the basic kinds of
variables?
 How do we describe them in, first,
univariate terms, & second, bivariate
terms?
 Why do we need to describe them
both graphically & numerically?
 What’s the fundamental problem
with the mean as a measure of
central tendency & standard
deviation as a measure of spread?
When should we use them?
 Despite their problems, why are
the mean & standard deviation
used so commonly?
 What’s a density curve? A normal
distribution? What statistics
describe a normal distribution?
Why is it important?
 What’s a standard normal
distribution? What does it mean to
standardize a variable, & how is it
done?
 Are all symmetric distributions
normal?
 What’s a population? A sample?
What’s a parameter? A statistic?
What are the two basic probability
problems of samples, & how most
basically do we try to mitigate
them?
 Why is a sample mean typically
used to estimate a parameter?
What’s an expected value?
 What’s sampling variability? A
sampling distribution? A population
distribution?
 What’s the sampling distribution of a
sample mean? The law of large
numbers? The central limit theorem?
 Why’s the central limit theorem crucial
to inferential statistics?
 What’s the difference between a
standard deviation & a standard
error? How do their formulas differ?
 What’s the difference between the
z- & t-distributions? Why do we
typically use the latter?
 What’s a confidence interval?
What’s its purpose? Its premises,
formula, interpretation, & problems?
How do we make it narrower?
 What’s a hypothesis test? What’s
its purpose? Its premise & general
formula? How is it stated? What’s its
interpretation?
 What are the typical standards for
judging statistical significance? To
what extent are they defensible or
not?
 What’s the difference between
statistical & practical significance?
 What are Type I & Type II errors?
What is the Bonferroni (or other
such) adjustment?
 What are the possible reasons for
a finding of statistical
insignificance?
True or false, & why:
 Large samples are bad.
 To obtain roughly equal variability, we
must take a much bigger sample in a big
city than in a small city.
 You have data for an entire population.
Next step: construct confidence intervals
& conduct hypothesis tests for the
variables.
Source: Freedman et al., Statistics.
(true-false continued)
 To fulfill the statistical assumptions
of correlation or regression, what
definitively matters for each variable
is that its univariate distribution is
linear & normal.
__________________________
Define the following:
 Association
 Causation
 Lurking variables
 Simpson’s Paradox
 Spurious non-association
 Ecological correlation
 Restricted-range data
 Non-sampling errors
_________________________
Regarding variables, ask:
 How they are defined & measured?
 In what ways are their definition &
measurement valid or not?
 & what are the implications of the
above for the social construction of
reality?
 See King et al., Designing Social
Inquiry; & Ragin, Constructing Social
Research.
Remember the following,
overarching principles concerning
statistics & social/policy research
from last semester’s course:
(1) Anecdotal versus systematic
evidence (including the
importance of theories in guiding
research).
(2) Social construction of reality.
(3) Experimental versus
observational evidence.
(4) Beware of lurking variables.
(5) Variability is everywhere.
(6) All conclusions are uncertain.
 Recall the relative strengths &
weaknesses of large-n, multivariate
quantitative research versus smalln, comparative research & casestudy research.
 “Not everything worthwhile can be
measured, and not everything
measured is worthwhile.” Albert
Einstein
 And always question
presumed notions of causality.
Finally, here are some more or less
equivalent terms for variables:
 e.g., dependent, outcome, response,
criterion, left-hand side
 e.g., independent, explanatory,
predictor, regressor, control, right-hand
side
__________________________
Let’s return to the topic of linear
regression.
 The dean of students wants to predict
the grades of all students at the end of
their freshman year. After taking a
random sample, she could use the
following equation:
y  E( y )  e
y  freshman GPA
E(y)  expected value of freshmen GPA
e  random error

Since the dean doesn’t know the value
of the random error for a particular
student, this equation could be reduced to
using the sample mean of freshman GPA
to estimate a particular student’s GPA:
ŷ  y
That is, a student’s predicted y (i.e. yhat)
is estimated as equal to the sample mean
of y.
 But what does that mini-model
overlook?
 That a more accurate model—&
thus more precise predictions—can
be obtained by using explanatory
variables (e.g., SAT score, major,
hours of study, gender, social class,
race-ethnicity) to estimate
freshman GPA.
 Here we see a major advantage of
regression versus correlation: regression
permits y/x directionality* (including
multiple explanatory variables).
 In addition, regression coefficients are
expressed in the units in which the
variables are measured.
* Recall from last semester: What are the
‘two regression lines’? What questions
are raised about causality?
 We use a six-step procedure to create a
regression model (as defined in a
moment):

Hypothesize the form of the model for
E(y).

Collect the sample data on outcome
variable y & one more more explanatory
variables x: random sample, data on all
the regression variables are collected for
the same subjects.

Use the sample data to estimate unknown
parameters in the model.
(4) Specify the probability distribution of the
random error term (i.e. the variability in the
predicted values of outcome variable y), &
estimate any unknown parameters of this
distribution.
(5) Statistically check the usefulness of the
model.
(6) When satisfied that the model is useful,
use it for prediction, estimation, & so on.
 We’ll be following this six-step
procedure for building regression
models throughout the semester.
 Our emphasis, then, will be on how
to build useful models: i.e. useful
sets of explanatory variables x’s and
forms of their relationship to
outcome variable y.
 “A model is a simplification of, and
approximation to, some aspect of
the world. Models are never literally
‘true’ or ‘false,’ although good
models abstract only the ‘right’
features of the reality they
represent” (King et al., Designing
Social Inquiry, page 49).
 Models both reflect & shape the
social construction of reality.
 We’ll focus, then, on modeling:
trying to describe how sets of
explanatory variables x’s are
related to outcome variable y.
 Integral to this focus will be an
emphasis on the interconnections
of theory & empirical research
(including questions of causality).
 We’ll be thinking about how
theory informs empirical research,
& vice versa.
 See King et al., Designing Social
Inquiry; Ragin, Constructing
Social Research; McClendon,
Multiple Regression and Causal
Analysis; Berk, Regression: A
Constructive Critique.
 “A social science theory is a reasoned and
precise speculation about the answer to a
research question, including a statement
about why the proposed answer is correct.”
 “Theories usually imply several or more
specific descriptive or causal hypotheses”
(King et al., page 19).
 And to repeat: A model is “a simplification
of, and approximation to, some aspect of
reality” (King et al., page 49).
 One more item before we delve
into regression analysis: Regarding
graphic assessment of the
variables, keep the following
points in mind:
 Use graphs to check
distributions & outliers before
describing or estimating variables
& models; & after estimating
models as well.
 The univariate distributions of the
variables for regression analysis
need not be normal!
 But the usual caveats concerning
extreme outliers must be heeded.
 It’s not the univariate graphs but
the y/x bivariate scatterplots that
provide the key evidence on these
concerns.
Even so, let’s anticipate a
fundamental feature of multiple
regression:
 The characteristics of
bivariate scatterplots &
correlations do not necessarily
predict whether explanatory
variables will be significant or
not in a multiple regression
model.
 Moreover, bivariate
relationships don’t necessarily
indicate whether a Y/X
relationship will be positive or
negative within a multivariate
framework.
 This is because multiple
regression expresses the joint,
linear effects of a set of
explanatory variables on an
outcome variable.
 See Agresti/Finlay, chapter 10;
and McClendon, chapter 1 (and
other chapters).
 Let’s start our examination of
regression analysis, however, with a
simple (i.e. one explanatory
variable) regression model:
y   0   1x  e
y  outcome variable
x  explanatory variable
(y)   0   1 x  determinis tic component
  (epsilon)  random error component
 0  ( beta zero, or constant)  y - intercept
 1  ( beta one)  slope of line, i.e., amount of
change in the mean of y for every one - unit
change in x.
.04
.03
.02
0
.01
Density
20
40
60
80
science score
.02
0
.01
Density
.03
.04
Kernel density estimate
Normal density
30
40
50
60
math score
Kernel density estimate
Normal density
70
80
. su science math
. corr science math
20
40
60
80
. scatter science math||qfit science math
30
40
50
60
math score
science score
Fitted values
70
80
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
Total
science

1
7760.55791
Coef.
200
= 130.81
Prob > F = 0.0000
11746.9421 198 59.3279904
19507.50
198)
=
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Root MSE = 7.7025
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
Interpretation?
 For every one-unit increase in x,
y increases (or decreases) by …
units, on average.
 For every one-unit increase in
math score, science score
increases by 0.67, on average.
 Questions of causal order?
 What’s the standard deviation
interpretation, based on the
formulation for b, the regression
coefficient?
. su science math
. corr science math
 Or easier:
. listcoef, help
. listcoef, help
regress (N=200): Unstandardized and Standardized Estimates
Observed SD: 9.9008908
SD of Error: 7.7024665
science
math
b
0.66658
t
11.437
P>t
bStdX
bStdY
bStdXY
SDofX
0.000
6.2448
0.0673
0.6307
9.3684
b = raw coefficient
t = t-score for test of b=0
P>t = p-value for t-test
bStdX = x-standardized coefficient
bStdY = y-standardized coefficient
bStdXY = fully standardized coefficient
SDofX = standard deviation of X
 What would happen if we reversed
the equation?
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
Total
science
1
7760.55791
Coef.
200
= 130.81
Prob > F = 0.0000
11746.9421 198 59.3279904
19507.50
198)
=
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Root MSE = 7.7025
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
 With science as the outcome variable.
. reg math science
Source |
SS
df
MS
Number of obs =
-------------+-----------------------------Model | 6948.31801
Residual | 10517.477
F( 1,
17465.795
198) = 130.81
1
6948.31801
Prob > F
= 0.0000
198
53.1185707
R-squared
= 0.3978
-------------+-----------------------------Total |
200
Adj R-squared = 0.3948
199 87.7678141
Root MSE
= 7.2882
-----------------------------------------------------------------------------math |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------science |
.596814
.0521822
11.44
_cons |
21.70019
2.754291
7.88
0.000
0.000
.4939098
16.26868
.6997183
27.1317
 With math as the outcome variable.
 What would be risky in saying
that ‘every one-unit increase in
math scores causes a 0.67
increase in predicted science
score’?
 Because:
(1) Computer software will accept variables in any
order & churn out regression y/x results—even if
the order makes no sense.
(2) Association does not necessarily signify
causation.
(3) Beware of lurking variables.
(4) There’s always the likelihood of non-sampling
error.
(5) It’s much easier to disprove than prove
causation.
 So be cautious!
 See McClendon (pp. 4-7) on
issues of causal inference.
 How do we establish causality?
 Can regression analysis be
worthwhile even if causality is
ambiguous?
 See also Berk, Regression
Analysis: A Constructive Critique.
 Why is a regression model
probabilistic rather than
deterministic?

Because the model is estimated from
sample data & thus will include some variation
due to random phenomena than can’t be
modeled or explained.
 That is, the random error component
represents all unexplained variation in
outcome variable y caused by important but
omitted variables or by unexplainable random
phenomena.
 Examples of a random error component for
this model (i.e. using science scores to predict
math scores)?

There are three basic sources of
error in regression analysis:
(1) Sampling error
(2) Measurement error (including
non-sampling error)
(3) Omitted variables
See Allison, Multiple Regression: A
Primer.
 Examine the type & quality of the
sample.
 Based on your knowledge of the topic:
What variables are relevant? How
should they be defined & measured?
How actually are they defined &
measured?
 Examine the diagnostics for the
model’s residuals (i.e. probabilistic, or
‘error’, component).
 After estimating a regression
equation, we estimate the value of e
associated with each y value using the
corresponding residual, i.e. the
deviation between the observed &
predicted value of y.
 The model’s random error component
consists of deviations between the
observed & predicted values of y. These
are the residuals (which, to repeat, are
estimates of the model’s error
component for each value of y).
ei  yi  ŷi
 Each observed science score minus each
predicted science score.
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
1
Residual 11746.9421
Total
19507.50
7760.55791
198)
=
200
= 130.81
Prob > F = 0.0000
198 59.3279904
R-squared = 0.3978
199 98.0276382
Adj R-squared = 0.3948
Root MSE = 7.7025
science
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
. predict yhat
[predicted values of y]
(option xb assumed; fitted values)
. predict e, resid
. sort science
.
[residuals]
[to order its values from lowest to highest]
su science yhat e
. list science yhat e in 1/10
. list science yhat e in 100/110
. list science yhat e in -10/l (‘l’ indicates ‘last’)
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
1
7760.55791
19507.50
200
= 130.81
Prob > F = 0.0000
Residual 11746.9421 198 59.3279904
Total
198)
=
199 98.0276382
R-squared = 0.3978
Adj R-squared = 0.3948
Root MSE = 7.7025
science
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
 SSResiduals (i.e. SSE)=11746.94
 The least squares line, or
regression line, follows two
properties:
(1) The expected value of the errors
(i.e. deviations or residuals) SE=0.
(2) The sum of the squared errors,
SSE, is smaller than for any other
straight line model with SE=0.
 The regression line is called the least
squares line because it minimizes the
distance between the equation’s y-predictions
& the data’s y-observations (i.e. it minimizes
the sum of squared errors, SSE).
 The better the model fits the data, the
smaller the distance between the y-predictions
& the y-observations.
 Here are the values of the regression
model’s estimated beta (i.e. slope or
regression) coefficient & y-intercept (i.e.
constant) that minimize SSE:
SSxy
ˆ
Slope   1 
SSxx
y - intercept: ˆ 0  y  ˆ 1x
where:
SSxy  (xi - x)(yi - y)
SSxx   (x i - x )2
 Compute y-intercept: y - intercept: ˆ 0  y  ˆ 1x
. su science math
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------science |
200
51.85
9.900891
26
74
math |
200
52.645
9.368448
33
75
. display 51.85 - (.66658*52.645)
16.757896
Note: math slope coefficient=.66658 (see
regression output)
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
Total
science
1
7760.55791
= 130.81
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Root MSE = 7.7025
Std. Err.
t
P>t
[95% Conf. Interval]
.0582822
11.44
0.000
.5516466 .7815135
_cons 16.75789 3.116229
5.38
0.000
10.61264 22.90315
math
Coef.
200
Prob > F = 0.0000
11746.9421 198 59.3279904
19507.50
198)
=
.66658
 The y-intercept (i.e. the constant)
matches our calculation: 16.75789.
 Compute math’s slope coefficient:
SSxy
ˆ
Slope   1 
SSxx
Summation of each math-value times
each science value, divided by
summation of each math value
squared.
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
Total
science
1
7760.55791
Coef.
200
= 130.81
Prob > F = 0.0000
11746.9421 198 59.3279904
19507.50
198)
=
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Root MSE = 7.7025
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315

We’ll eventually see that the probability
distribution of e determines how well the
model describes the population relationship
between outcome variable y & explanatory
variable x.
 In this context, there are four basic
assumptions about the probability distribution
of e.
 These are important to (1) minimize bias &
(2) to make confidence intervals & hypothesis
tests valid.
The Four Assumptions
(1) The expected value of e over all
possible samples is 0. That is, the
mean of e does not vary with the
levels of x.
(2) The variance of the probability
distribution of e is constant for all
levels of x. That is, the variance of
e does not vary with the levels of x.
(3) The errors associated with any two
different y observations are 0. That
is, the errors are uncorrelated: the
errors associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of e is
normal.
 These assumptions of the
regression model are commonly
summarized as:
I.I.D.
 Independently & identically
distributed errors.
 As we’ll come to understand, the
assumptions make the estimated least
squares line an unbiased estimator of
the population value of the y-intercept
& the slope coefficient—i.e. of the
population value of y.
 Plus they make the standard errors of
the estimated least squares line as
small as possible & unbiased, so that
confidence intervals & hypothesis tests
are valid.
 Checking these vital assumptions—
which need not hold exactly—is a basic
part of post-estimation diagnostics.
 How do we estimate the variability
of the random error e (which means
variability in the predicted values of
outcome variable y)?
 We do so by estimating the
variance of e (i.e. the variance of
the predicted values of outcome
variable y).
 Why must we be concerned with
the variance of e?
 Because the greater the variance
of e, the greater will be the errors
in the estimates of the y-intercept
& slope coefficient.
 Thus the greater the variance of e,
the more inaccurate will be the
predicted value of y for any given
value of x.
 Since we don’t know the
population error,  2 , we estimate
it with sample data as follows:
s 2  SSE
where
df for error
SSE   ( yi  ŷi )2
s2=sum(each observed science score minus
each predicted science score)2/df for error
 Standard error of e:
s s2
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.5579
1
7760.55791
19507.50
200
= 130.81
Prob > F = 0.0000
Residual 11746.9421 198 59.3279904
Total
198)
=
199 98.0276382
R-squared
= 0.3978
Adj R-squared = 0.3948
Root MSE = 7.7025
science
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
 s2 (yhat variance=59.33)
s (yhat standard error)=7.70
 Interpretation of s (yhat’s standard
error): we are 95% certain that
yhat’s values fall within an interval of
roughly +/- 2*7.70 (i.e. +/- two
standard deviations).
 To display other confidence levels
for this & the other regression output
in STATA: reg y x1 x2, level(90)
 Assessing the usefulness of the
regression model: making inferences
about slope  1
Ho:  1 = 0.
Ha: 
1
 0.
(or one-tailed Ha in either
direction)
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
11746.9421 198 59.3279904
Total
science
19507.50
Coef.
1
7760.55791
t
200
= 130.81
Prob > F = 0.0000
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Std. Err.
198)
=
Root MSE = 7.7025
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44 0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
10.61264 22.90315
0.000
 .66658/.0582822=11.44, p-value=0.0000
 Hypothesis test & conclusion?
 Depending on the selected alpha
(i.e. test criterion) & on the test’s
p-value, either reject or fail to
reject Ho.
 The hypothesis test’s
assumptions: probability sample; &
the previously discussed four
assumptions about e.
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
11746.9421 198 59.3279904
Total
science
19507.50
Coef.
1
7760.55791
t
200
= 130.81
Prob > F = 0.0000
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Std. Err.
198)
=
Root MSE = 7.7025
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44 0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
10.61264 22.90315
0.000
 How to compute a slope coefficient’s
confidence interval?
 Compute math’s slope-coefficient
confidence interval (.95):
. di invttail(199, .05/2)
t .95, df 199
1.9719565
. di .66658 - (1.972*.0582822)
.5516475
= low side of CI
. di .66658 + (1.972*.0582822)
.7815125
= high side of CI
Note: math slope coefficient=.66658; math
slope coefficient standard error=.0582822
Conclusion for confidence
interval:
 We can say with 90% or 95% or
99% confidence that for every oneunit increase/decrease in x, y
changes by +/- …… units, on
average.
 But remember: there are nonsampling sources of error, too.
 Let’s next discuss correlation.

Correlation: a linear relationship between
two quantitative variables (though recall
from last semester that ‘spearman’ & other
such procedures compute correlation
involving categorical variables, or when
assumptions for correlation between two
quantitative variables are violated).
 Beware of outliers & non-linearity: graph a
bivariate scatterplot in order to conclude
whether conducting a correlation test
makes sense or not (& thus whether an
alternative measure should be used).
 Correlation assesses the degree
of bivariate cluster along a
straight line: the strength of a
linear relationship.
 Regression examines the degree
of y/x slope of a straight line: the
extent to which y varies in
response to changes in x.
 Regarding correlation, remember
that association does not
necessarily imply causation.
 And beware of lurking variables.
 Other limitations of correlation
analysis?
Formula for correlation coefficient:
 Standardize each x observation & each
y observation.
 Cross-multiply each pair of x & y
observations.
 Divide the sum of the cross-products
by n – 1.
 In short, the correlation coefficient
is the average of the cross-products
of the standardized x & y values.
 Here’s the equivalent, sum-ofsquares formula:
r  SSxy
SSxxSSyy
 Hypothesis test for correlation:
Ho : xy  0
Ha : xy  0
(or one-sided Ha in either direction)
 Depending on selected alpha & on test pvalue, either reject or fail to reject Ho.

The hypothesis test’s assumptions?
 Before estimating a correlation,
of course, first graph the
univariate & bivariate
distributions.
 Look for overall patterns &
striking deviations, especially
outliers.
 Is the bivariate scatterplot
approximately linear? Are there
extreme outliers?
0
.0 2
.0 4
.0 6
. hist science, norm
20
40
60
science score
80
0
.0 1
.0 2
.0 3
.0 4
. hist math, norm
30
40
50
60
math score
70
80
20
40
60
80
. scatter science math
30
40
50
60
70
math score

Approximately linear, no extreme outliers.
80
20
40
60
80
. scatter science math || lfit science math
30
40
50
60
m ath s c ore
s cienc e sc o re
Fitte d v alue s
70
80
20
40
60
80
. scatter science math || qfit science math
30
40
50
60
m ath s c ore
s cienc e sc o re
Fitte d v alue s
70
80
 Hypothesis test:
Ho : xy  0
Ha : xy  0
. pwcorr science math, sig star(.05)
| science
math
------------+-----------------science |
1.0000
|
|
math |
|
0.6307* 1.0000
0.0000
 Hypothesis test conclusion?
Coefficient of determination, r2:

r2 (in simple but not multiple regression,
just square the correlation coefficient)
represents the proportion of the sum of squares
of deviations of the y values about their mean
that can be attributed to a linear relationship
between y & x.
 Interpretation: about 100(r2)% of the sample
variation in y can be attributed to the use of x
to predict y in the straight-line model.
 Higher r2 signifies better fit: greater cluster
along the y/x straight line.
 Formula for r2 in simple & multiple
regression:
SSyy - SSE
r2 
SSyy
 How would this be computed for the
regression of science on math?
. reg science math
SS
Source |
MS
df
------------+------------------------------
Model |
Residual
7760.55791
11746.9421
1
198
F( 1,
7760.55791
59.3279904
------------+------------------------------
Total |
19507.50
Number of obs =
198) =
Prob > F
200
130.81
= 0.0000
R-squared = 0.3978
Adj R-squared = 0.3948
199 98.0276
Root MSE
= 7.7025
-----------------------------------------------------------------------------science |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------math |
.66658
_cons |
16.75789
.0582822
3.116229
11.44
0.000
.5516466
.7815135
5.38
0.000
10.61264
22.90315
 r2=Model SS/Total SS=7760.56/19507.5
Let’s step back for a moment & review
the matter of explained versus
unexplained variation in an estimated
regression model.
DATA = FIT + RESIDUAL
 What does this mean? Why does it
matter?
 DATA: total variation in outcome variable y;
measured by the total sum of squares.
 FIT: variation in outcome variable y
attributed to the explanatory variable x (i.e.
by the model); measured by the model sum
of squares.
 RESIDUAL: variation in outcome variable y
attributed to the estimated errors; measured
by the residual (or error) sum of squares.
DATA = FIT + RESIDUAL
SST = SSM + SSE
 Sum of Squares Total (SST): each observed y
minus the mean of y ; sum the values; square the
summed values.
 Sum of Squares for Model (SSM): each predicted
y minus the mean of y ; sum the values; square the
summed values.
 Sum of Squares for Errors (SSE): each observed y
minus the mean of predicted y ; sum the values;
square the values.
. reg science math
Source
SS
df
MS
Number of obs =
F( 1,
Model 7760.55791
1
7760.55791
19507.50
199 98.0276382
= 130.81
Prob > F = 0.0000
Residual 11746.9421 198 59.3279904
Total
198)
200
R-squared = 0.3978
Adj R-squared = 0.3948
Root MSE = 7.7025
science
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
Next step: compute the variance for
each component by dividing its sum of
squares by its degrees of freedom—its
Mean Square:
Mean Square for Total =
Mean Square for Model +
Mean Square for Errors (Residuals)
 s2 : Mean Square for Errors (Residuals)
 s: Root Mean Square (se of yhat)
. reg science math
Source |
SS
MS
df
Number of obs =
------------+------------------------------
Model |
Residual
|
7760.55791
1
11746.9421
198
F( 1,
7760.55791
59.3279904
------------+------------------------------
Total |
19507.50
199
198) =
Prob > F
200
130.81
= 0.0000
R-squared = 0.3978
Adj R-squared = 0.3948
98.0276
Root MSE
= 7.7025
-----------------------------------------------------------------------------science |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------

math |
.66658
_cons |
16.75789
.0582822
3.116229
11.44
0.000
.5516466
.7815135
5.38
0.000
10.61264
22.90315
Root MSE = square root of 59.3279904
 Analysis of Variance (ANOVA) Table:
the regression output displaying the
sums of squares & mean square for
model, residual (error) & total.
 How do we compute F & r2 from the
ANOVA table?
F=Mean Square Model/Mean Square Residual
r2=Sum of Squares Model/Sum of Squares Total
. reg science math
Source |
SS
MS
df
Number of obs =
------------+------------------------------
Model |
Residual
|
7760.55791 1
11746.9421
F( 1,
7760.55791
198
59.3279904
------------+------------------------------
Total |
19507.50
199
198) =
Prob > F
200
130.81
= 0.0000
R-squared = 0.3978
Adj R-squared = 0.3948
98.0276
Root MSE
= 7.7025
-----------------------------------------------------------------------------science |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------math |
.66658
_cons |
16.75789
.0582822
3.116229
11.44
0.000
.5516466
.7815135
5.38
0.000
10.61264
22.90315
 F=MSM/MSR=7760.55791/59.3279904=130.81
 r2=MSS/TSS=7760.55791/19507.50=0.3978
DATA = FIT + RESIDUAL
SST = SSM + SSE
 Sum of Squares Total (SST): each observed y
minus the mean of y ; sum the values; square the
summed values.
 Sum of Squares for Model (SSM): each predicted
y minus the mean of y ; sum the values; square the
summed values.
 Sum of Squares for Errors (SSE): each observed y
minus the mean of predicted y ; sum the values;
square the values.
Using the regression model for
estimation & prediction:
 Fundamental point: never make
predictions beyond the range of
the sampled (i.e. observed) x
values.
 That is, while the model may provide
a good fit for the sampled range of
values, it could give a poor fit
outside the sampled x-value range.
 Another point in making
predictions: the standard error for
the estimated mean of y will be
less than that for an estimated
individual y observation.
 That is, there’s more uncertainty
in predicting individual y values
than mean y values.
 Why is this so?
 Let’s review how STATA reports
the indicators of how a regression
model fits the sampled data.
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
Residual
11746.9421 198 59.3279904
Total
science
19507.50
Coef.
1 7760.55791
198)
=
200
= 130.81
Prob > F = 0.0000
R-squared
= 0.3978
Adj R-squared
= 0.3948
199 98.0276382
Root MSE = 7.7025
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264 22.90315
Software regression output
typically refers to the residualterms more or less as follows:

s2 = Mean Square Error (MSE: variance
of predicted y/d.f.)
 s = Root Mean Square Error (Root MSE:
standard error of predicted y [which equals
the square root of MSE])
Stata labels the Residuals, Mean
Square Error & Root Mean Square
Error as follows:
Top-left table
 SS for Residual: sum of squared errors
 MS for residual: variance of predicted y/d.f.
Top-right column
 Root MSE: standard error of predicted y
 & moreover there’s R2 (as well F & other
indicators that we’ll examine next week).
. reg science math
Source |
SS
MS
df
------------+-----------------------------Model | 7760.55791
Residual
1
| 11746.9421
F( 1,
7760.55791
19507.50
198) =
Prob > F
59.3279904 R-squared
198
------------+-----------------------------Total |
Number of obs =
199
200
130.81
= 0.0000
= 0.3978
Adj R-squared = 0.3948
98.0276382
Root MSE
= 7.7025
-----------------------------------------------------------------------------science |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------math |
.66658
_cons |
16.75789
.0582822
3.116229
11.44
0.000
.5516466
.7815135
5.38
0.000
10.61264
22.90315
 SS Residual/df Residual=MS Resid: variance of yhat.
Root MSE=sqrt(MS Residual): standard error of yhat.
. reg science math
SS
Source |
MS
df
------------+------------------------------
Model |
Residual
7760.55791
11746.9421
1
F( 1,
7760.55791
198
59.3279904
------------+------------------------------
Total |
19507.50
199
Number of obs =
198) =
Prob > F
200
130.81
= 0.0000
R-squared = 0.3978
Adj R-squared = 0.3948
98.0276
Root MSE
= 7.7025
-----------------------------------------------------------------------------science |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------math |
.66658
_cons |
16.75789
.0582822
3.116229
11.44
0.000
.5516466
.7815135
5.38
0.000
10.61264
22.90315
 r2=SS Model/SS Total
. reg science math
Source |
SS
MS
df
------------+------------------------------
Model |
Residual
7760.55791
1
11746.9421
F( 1,
7760.55791
198
59.3279904
------------+------------------------------
Total |
19507.50
Number of obs =
198) =
Prob > F
200
130.81
= 0.0000
R-squared = 0.3978
Adj R-squared = 0.3948
199 98.0276
Root MSE
= 7.7025
-----------------------------------------------------------------------------science |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------math |
.66658
_cons |
16.75789
.0582822
3.116229
 F=MSM/MSR
11.44
0.000
.5516466
.7815135
5.38
0.000
10.61264
22.90315
 The most basic ways to make a
linear prediction of y (i.e. yhat)
after estimating a simple regression
model?
. display 16.75789 + .66658
. display 16.75789 + .66658*45
. lincom _cons + math
. lincom _cons + math*45
(lincom: linear combination; provides a
confidence interval for prediction)

In summary, we use a six-step procedure
to create a regression model:

Hypothesize the form of the model for
E(y).

Collect the sample data: random sample,
data for the regression variables collected
on the same subjects

Use the sample data to estimate unknown
parameters in the model.
(4) Specify the probability distribution of
the random error term, & estimate any
unknown parameters of this
distribution.
(5) Statistically check the usefulness of
the model.
(6) When satisfied that the model is
useful, use it for prediction,
estimation, & so on.
 See King et al.
 Finally, the four fundamental
assumptions of regression analysis
involve the probability distribution
of e (the model’s random
component, which consists of the
residuals).
 These assumptions can be
summarized as I.I.D.
 The univariate distributions of the
variables for regression analysis
need not be normal!
 But the usual caveats concerning
extreme outliers are important.
 It’s not the univariate graphs but
the y/x bivariate scatterplots that
provide the key evidence on these
concerns.

We’ll nonetheless see that the
characteristics of bivariate relationships
do not necessarily predict whether
explanatory variables will test
significant or the direction of their
coefficients in a multiple regression
model.
 We’ll see, rather, that a multiple
regression model expresses the joint,
linear effects of a set of explanatory
variables on an outcome variable.
Review:
Regress science achievement scores
on math achievement scores
.use hsb2, clear
Note: recall that these are not
randomly sampled data.
0
.02
Density
.04
.06
. hist science, norm
20
40
60
science score
80
0
.01
.02
.03
.04
. hist math, norm
30
40
50
60
math score
70
80
. su science, detail
science score
Percentiles
Smallest
1%
30
26
5%
34
29
10%
39
31
Obs
200
25%
44
31
Sum of Wgt.
200
50%
53
Mean
51.85
Std. Dev.
9.900891
Largest
75%
58
69
90%
64.5
72
Variance
98.02764
95%
66.5
72
Skewness
-.1872277
99%
72
74
Kurtosis
2.428308
. su math, d
math score
Percentiles
Smallest
1%
36
33
5%
39
35
10%
40
37
Obs
200
25%
45
38
Sum of Wgt.
200
50%
52
Largest
Mean
52.645
Std. Dev.
9.368448
75%
59
72
90%
65.5
73
Variance
87.76781
95%
70.5
75
Skewness
.2844115
99%
74
75
Kurtosis
2.337319
20
40
60
80
. scatter science math || qfit science math
30
40
50
60
70
80
m a th s c o r e
s ci e n c e sc o re
Fi tte d v a l u e s
 Conclusion about approximate linearity & outliers?
. pwcorr science math, obs bonf sig star(.05)
science
1.0000
200
math
0.6307*
1.0000
0.0000
200
200
 Formula for correlation coefficient?
 Hypothesis test & conclusion?
. reg science math
Source
SS
df
MS
Number of obs
F( 1,
Model
7760.55791
198)
=
200
= 130.81
1
7760.55791
Prob > F
= 0.0000
Residual
11746.9421
198
59.3279904
R-squared
= 0.3978
Total
19507.50
199
98.0276382
Adj R-squared
= 0.3948
Root MSE
= 7.7025
science
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
math
.66658
.0582822
11.44
0.000
.5516466 .7815135
_cons
16.75789
3.116229
5.38
0.000
10.61264
22.90315
 # observations? df? residuals, formula? yhat variance, formula?
yhat standard error, formula? F, formula? r2, formula? y-intercept,
CI, formula? slope coefficient, CI, formula? slope hypothesis test?
 Graph the linear prediction for yhat with a
confidence interval:
3 0
4 0
5 0
6 0
7 0
. twoway qfitci science math, blc(blue)
30
40
5 0
60
7 0
m a th s c o r e
95 % C I
F it t e d v a l u e s
80

Predictions of yhat using STATA’s calculator:
. display 16.75789 + .66658*45
46.75399
. di 16.75789 + .66658*65
60.08559
 Predictions for yhat using ‘lincom’:
. lincom _cons + math*45
( 1) 45.0 math + _cons = 0.0
-----------------------------------------------------------------------------science|
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------(1)
|
46.754
.7036832
66.44
0.000 45.36632 48.14167
------------------------------------------------------------------------------
. lincom _cons + math*65
( 1) 65.0 math + _cons = 0.0
-----------------------------------------------------------------------------science|
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------(1)
|
60.0856
.9028563
66.55
0.000
58.30515 61.86604
------------------------------------------------------------------------------