Download 1. Least-Squares Regression Recap 2. The

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Transcript
Statistics 11/Economics 40
Lecture 20
(Chapter 10.1)
1. Least-Squares Regression Recap
In Chapter 2 a line is used to summarize the relationship between some variable designated as the explanatory
or independent variable (called x) and a response/outcome/dependent variable (called y). The slope was
identified as b and the intercept as a. The equation for the line which summarizes the relationship is:
y = a + bx
where b = r*(Sy/Sx) and a = (average of the y variable) - (slope * average of the x variable)
The idea here is that there is a line which "best" fits the data and it's criteria for "fit" is that it minimizes the
squared deviations in the y-direction from the line. This line is used because it has good mathematical
properties.
2. The population regression line (10.1)
The interest here is to examine how y changes as x changes for all possible x and y out there. In
chapter 2, there was no discussion of "sample" versus "population", but in Chapter 10, there is. The
notation changes a little bit:
µ y = β 0 + β 1x
This is a population regression line some perfect representation of the relationship between x and y. β
0 and β 1 are parameters -- they are fixed, unchanging. The concern here is how mean values of y for
a given x value change as x changes. The actual y values will vary around µ y and this variability is
measured by σ y (and this is also thought to be unchanging, that is, the same for every value of x).
See figure 10.2. This is "the truth" as it applies to a relationship between an X variable and a Y
variable. We do not observe this line in practice, the lines we have been looking at come from
samples. In other words, in practice, this is what we are "modeling" or "regressing" in a single
sample:
y i = β0 + β1x + εi (where the yi are individual observed values)
or to think of it another way
yˆ = b0 + b1 x (the y-hat – a predicted y -- contains the observed value and the residual)
And the ε i are residuals. They should be(sometimes they are not, this is a bad thing) independently
distributed with mean 0 and standard deviation σ y . b0 and b1 are estimates (from samples) of β 0 and
β 1 (population parameters). The estimation/calculation of b0 and b1 is like Chapter 2.
3.
The residuals, again
The residuals as calculated from the sample regression (observed y - predicted y) are represented by ei and
they are an estimate of εi they are just the difference between the actual y value in the sample and the
predicted y value. The residuals should sum to zero.
The importance of residuals in this context is that they are used to estimate
variable around each level of x.
σy -- the variability of the y-
σy has a slightly different formula than you might expect:
Statistics 11/Economics 40
n
s2 =
∑( y
i= 1
i
Lecture 20
(Chapter 10.1)
− yˆ ) 2
n −2
( see the bottom of page 666, this is called the Residual Mean Square Error)
And the square root of s2 is s and that is used as our estimate of σy . Since you are taking both x –bar and y-bar
into account when calculating y-hat, the divisor is n-2. Compare this with your calculation of the standard
deviation of y or of just x from chapter 1, there the divisor is n-1.
Don't worry, you don't need to know how to calculate this, but you should know how to interpret a Stata output
(like Lab 6) on regression if you see one.
4.
Interpreting the Stata Output for Regression Analysis
s
n (see page 504) When you do not have sigma and you must
A term you need to know: standard error or
use s (the sample standard deviation) to estimate it, Moore and McCabe call the standard deviation of a
statistic (such as a mean) the standard error. In the regression framework, one usually does not have igma
and so one is almost always working with s. And if one also has small samples (under size 15) the
assumptions of the Central Limit Theorem (i.e. sigma known, population normal, sample size large, chapter
5.2) are violated so the standard normal can’t be used, instead something called the t-distribution is used. So
can no longer use a Z-test, you must use a t-test (Chapter 7.1 note when the sample size approaches infinity, z
and t are distributed in the same manner). The t-test looks a lot like a z-test for the distribution of a sample
mean:
t=
x −µ
s/ n
Vocabulary words of interest (from the Stata output)
R Squared
_cons
Coefficients
Standard Error
t
P>|t|
95% Confidence Interval
Let's look again at the relationship between advertising expenditures and name recognition:
. regress recognit spending
Source |
SS
df
MS
---------+-----------------------------Model | 7723.27815
1 7723.27815
Residual | 10494.1085
19 552.321499
---------+-----------------------------Total | 18217.3866
20 910.869332
Number of obs
F( 1,
19)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
21
13.98
0.0014
0.4240
0.3936
23.502
-----------------------------------------------------------------------------recognit |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------spending |
.3631741
.0971203
3.739
0.001
.159899
.5664492
_cons |
22.16269
7.089478
3.126
0.006
7.324244
37.00114
------------------------------------------------------------------------------
Statistics 11/Economics 40
Lecture 20
(Chapter 10.1)
Remember, our goal is to predict the number of people who recognize the brand (recognit) from the
amount of money spent advertising the brand (spending)
The values β 0 and β 1 are parameters and are given in the column labeled COEF. The intercept is
labeled _cons and its value is 22.16269 The interpretation is that if one had no advertising spending
the predicted recognition is 22.16269 million people. β 1 is the slope of the line (or the rate of change
in recognition for each unit change in spending) and its value is .3631741 so the least squares line for
the regression of recognition on spending is:
yˆ = 22.16269 + .3631741( x i )
5. Confidence Intervals and Significance Tests in Regression Analysis
You can also produce confidence intervals and significance tests for the coefficients in a linear
regression model. The confidence intervals are very similar to the ones you have seen in Chapters 6
but they use a t instead of a Z as the multiplier.
Estimate ± t*(Standard Error of the Estimate)
Where t* is the critical value from the t distribution (table D)
Generally, statistical software provides you with the upper and lower bounds. For the x variable
spending, it is (.159899, .5664492) for a 95% confidence interval. As before, the interpretation is
that you are 95% confident that the true parameter β 1 is covered by the interval given above. You
could also give a 95% confidence interval around the intercept (7.324244, 37.00114).
To test the hypothesis β 1 = 0, look at the value of t given by the output (3.739 for spending) it's just
the coefficient divided by its standard error or .3631741/.0971203 and a p-value is given to you.
Since the p-value given is .001, your interpretation would be that it is highly statistically significant
(P < .05 or P < .01). Note that it is testing a two-sided alternative.
Your textbook delves into some other estimates generated by statistical packages, but what has been
detailed above is sufficient.