Download Simple Linear Regression

Document related concepts

Psychometrics wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Simple Linear Regression
Farideh Dehkordi-Vakil
Simple Regression



Simple regression analysis is a statistical tool That
gives us the ability to estimate the mathematical
relationship between a dependent variable (usually
called y) and an independent variable (usually
called x).
The dependent variable is the variable for which
we want to make a prediction.
While various non-linear forms may be used,
simple linear regression models are the most
common.
Introduction
• The primary goal of
quantitative analysis is to use
current information about a
phenomenon to predict its
future behavior.
• Current information is usually
in the form of a set of data.
• In a simple case, when the data
form a set of pairs of numbers,
we may interpret them as
representing the observed
values of an independent (or
predictor ) variable X and a
dependent ( or response)
variable Y.
lot size Man-hours
30
73
20
50
60
128
80
170
40
87
50
108
60
135
30
69
70
148
60
132
Introduction
The goal of the analyst
who studies the data is to
find a functional relation
y  f (x)
between the response
variable y and the
predictor variable x.
Statistical relation between Lot size and Man-Hour
180
160
140
120
Man-Hour

100
80
60
40
20
0
0
10
20
30
40
50
Lot size
60
70
80
90
Regression Function
The statement that the relation
between X and Y is statistical
should be interpreted as providing
the following guidelines:
1. Regard Y as a random variable.
2. For each X, take f (x) to be the
expected value (i.e., mean value) of
y.
3. Given that E (Y) denotes the
expected value of Y, call the
equation

E (Y )  f ( x)
the regression function.
Historical Origin of Regression


Regression Analysis was
first developed by Sir
Francis Galton, who
studied the relation
between heights of sons
and fathers.
Heights of sons of both
tall and short fathers
appeared to “revert” or
“regress” to the mean of
the group.
Historical Origin of Regression
Basic Assumptions of a Regression
Model

A regression model is based on the following
assumptions:
1.
There is a probability distribution of Y for each
level of X.
2.
Given that y is the mean value of Y, the standard
form of the model is
Y  f (x )  
where  is a random variable with a normal
distribution.
Statistical relation between Lot Size and number of
man-Hours-Westwood Company Example
Statistical relation between Lot size and number of Man-Hours
180
160
140
120
100
80
60
40
20
0
0
10
20
30
40
50
60
70
80
90
Pictorial Presentation of Linear Regression
Model
Construction of Regression Models
Selection of independent variables

•
Since reality must be reduced to manageable proportions
whenever we construct models, only a limited number of
independent or predictor variables can or should be included in a
regression model. Therefore a central problem is that of
choosing the most important predictor variables.
Functional form of regression relation

•
Sometimes, relevant theory may indicate the appropriate
functional form. More frequently, however, the functional form
is not known in advance and must be decided once the data have
been collected and analyzed.
Scope of model


In formulating a regression model, we usually need to restrict
the coverage of model to some interval or region of values of the
independent variables.
Uses of Regression Analysis


Regression analysis serves Three major
purposes.
1. Description
2. Control
3. Prediction
The several purposes of regression analysis
frequently overlap in practice
Formal Statement of the Model

General regression model
Y   0  1 X  
1.
0, and 1 are parameters
2.
X is a known constant
3.
Deviations  are independent N(o, 2)
Meaning of Regression Coefficients


The values of the regression parameters 0,
and 1 are not known.We estimate them
from data.
1 indicates the change in the mean
response per unit increase in X.
Regression Line

If the scatter plot of our sample data
suggests a linear relationship between two
variables i.e.
y   0  1 x

we can summarize the relationship by
drawing a straight line on the plot.
Least squares method give us the “best”
estimated line for our set of sample data.
Regression Line

We will write an estimated regression line
based on sample data as
yˆ  b0  b1 x

The method of least squares chooses the
values for b0, and b1 to minimize the sum of
squared errors
n
n
i 1
i 1
2
SSE   ( yi  yˆ i ) 2   y  b0  b1 x 
Regression Line

Using calculus, we obtain estimating
formulas:
n
b1 
 ( x  x )( y  y )
or
i 1
n
 (x  x)
2
i 1
b1 
n xy   x y
n x 2  ( x ) 2
b0  y  b1 x
Estimation of Mean Response


Fitted regression line can be used to estimate the
mean value of y for a given value of x.
Example

The weekly advertising expenditure (x) and weekly
sales (y) are presented in the following table.
y
x
1250
41
1380
54
1425
63
1425
54
1450
48
1300
46
1400
62
1510
61
1575
64
1650
71
Point Estimation of Mean Response

From previous table we have:
 x  564
 x  32604
 y  14365
 xy  818755
n  10

2
The least squares estimates of the regression
coefficients are:
b1 
n xy   x y
n x 2  (  x ) 2

10(818755)  (564)(14365)
 10.8
10(32604)  (564) 2
b0  1436.5  10.8(56.4)  828
Point Estimation of Mean Response

The estimated regression function is:
ŷ  828  10.8x
Sales  828  10.8 Expenditur e

This means that if the weekly advertising
expenditure is increased by $1 we would expect
the weekly sales to increase by $10.8.
Point Estimation of Mean Response


Fitted values for the sample data are
obtained by substituting the x value into the
estimated regression function.
For example if the advertising expenditure
is $50, then the estimated Sales is:
Sales  828  10.8(50)  1368

This is called the point estimate (forecast)
of the mean response (sales).
Example:Retail sales and floor space

It is customary in retail operations to asses the
performance of stores partly in terms of their
annual sales relative to their floor area (square
feet). We might expect sales to increase linearly as
stores get larger, with of course individual
variation among stores of the same size. The
regression model for a population of stores says
that
SALES = 0 + 1 AREA + 
Example:Retail sales and floor space



The slope 1 is as usual a rate of change: it is the
expected increase in annual sales associated with
each additional square foot of floor space.
The intercept 0 is needed to describe the line but
has no statistical importance because no stores
have area close to zero.
Floor space does not completely determine sales.
The term  in the model accounts for difference
among individual stores with the same floor space.
A store’s location, for example, is important.
Residual

The difference between the observed value
yi and the corresponding fitted value ŷi .
ˆi
ei  yi  y

Residuals are highly useful for studying
whether a given regression model is
appropriate for the data at hand.
Example: weekly advertising expenditure
y
1250
1380
1425
1425
1450
1300
1400
1510
1575
1650
x
41
54
63
54
48
46
62
61
64
71
y-hat
1270.8
1411.2
1508.4
1411.2
1346.4
1324.8
1497.6
1486.8
1519.2
1594.8
Residual (e)
-20.8
-31.2
-83.4
13.8
103.6
-24.8
-97.6
23.2
55.8
55.2
Estimation of the variance of the error
terms, 2

The variance 2 of the error terms i in the
regression model needs to be estimated for a
variety of purposes.


It gives an indication of the variability of the
probability distributions of y.
It is needed for making inference concerning
regression function and the prediction of y.
Regression Standard Error


To estimate  we work with the variance and take
the square root to obtain the standard deviation.
For simple linear regression the estimate of 2 is
the average squared residual.
s y. x 
2


1
1
2
2
ˆ
e

(
y

y
)
 i n2 i i
n2
To estimate  , use
s estimates the standard deviation  of the error
term  in the statistical model for simple linear
regression.
s y. x  s y. x
2
Regression Standard Error
y
x
y-hat
Residual (e)
1250
41
1270.8
-20.8
432.64
1380
54
1411.2
-31.2
973.44
1425
63
1508.4
-83.4
6955.56
1425
54
1411.2
13.8
190.44
1450
48
1346.4
103.6
10732.96
1300
46
1324.8
-24.8
615.04
1400
62
1497.6
-97.6
9525.76
1510
61
1486.8
23.2
538.24
1575
64
1519.2
55.8
3113.64
1650
71
1594.8
55.2
3047.04
y-hat = 828+10.8X
square(e)
total
36124.76
S y .x
67.19818
Analysis of Residual


Inference based on regression model can be
misleading if the assumptions are violated.
Assumptions for the simple linear
regression model are:




The underlying relation is linear.
The errors are independent.
The errors have constant variance.
The errors are normally distributed.
Analysis of Residual


To examine whether the regression model is
appropriate for the data being analyzed, we can
check the residual plots.
Residual plots are:




Plot a histogram of the residuals
Plot residuals against the fitted values.
Plot residuals against the independent variable.
Plot residuals over time if the data are chronological.
Analysis of Residual



A histogram of the residuals provides a check on
the normality assumption. A Normal quantile plot
of the residuals can also be used to check the
Normality assumptions.
Moderate departures from a bell shaped curve do
not impair the conclusions from tests or prediction
intervals.
Plot of residuals against fitted values or the
independent variable can be used to check the
assumption of constant variance and the aptness
of the model.
Analysis of Residual


Plot of residuals against time provides a
check on the independence of the error
terms assumption.
Assumption of independence is the most
critical one.
Analysis of Residual
Analysis of Residual
Variable transformations



If the residual plot suggests that the variance is not
constant, a transformation can be used to stabilize
the variance.
If the residual plot suggests a non linear
relationship between x and y, a transformation
may reduce it to one that is approximately linear.
Common linearizing transformations are:
1
, log( x)
x

Variance stabilizing transformations are:
1
, log( y ),
y
y,
y2
Inference in Regression Analysis



The simple linear regression model imposes several
conditions. We should verify these conditions before
proceeding to inference.
These conditions concern the population, but we can
observe only our sample.
In doing inference we act as if




The sample is a SRS from the population.
There is a linear relationship in the population.
The standard deviation of the responses about the population line is
the same for all values of the explanatory variable.
The response varies Normally about the population regression line.
Inference in Regression Analysis

Plotting the residuals against the
explanatory variable is helpful in checking
these conditions because a residual plot
magnifies patterns.
Confidence Intervals and Significance
Tests



In our previous lectures we presented confidence intervals
and significance tests for means and differences in
means.In each case, inference rested on the standard error s
of the estimates and on t or z distributions.
Inference for the slope and intercept in linear regression is
similar in principal, although the recipes are more
complicated.
All confidence intervals, for example , have the form


estimate  t* Seestimate
t* is a critical value of a t distribution.
Confidence Intervals and Significance
Tests


Confidence intervals and tests for the slope and
intercept are based on the sampling distributions
of the estimates b1 and b0.
Here are the facts:



If the simple linear regression model is true, each of b0
and b1 has a Normal distribution.
The mean of b0 is 0 and the mean of b1 is 1.
The standard deviations of b0 and b1 are multiples of
the model standard deviation .
SEb1  S (b1 ) 
S y. x
 (x  x)
2
Confidence Intervals and Significance
Tests
Example:Weekly Advertising Expenditure

Let us return to the Weekly advertising
expenditure and weekly sales example.
Management is interested in testing whether
or not there is a linear association between
advertising expenditure and weekly sales,
using regression model. Use  = .05
Example:Weekly Advertising Expenditure

Hypothesis:
H0 :
Ha :

1  0
1  0
Decision Rule:
Reject H0 if t  t.025;8  t  2.306
or
t  t.025;8  t  2.306
Example:Weekly Advertising Expenditure

Test statistic:
t 
S (b1 ) 
b1
S (b1 )
S y. x
 (x  x)
2

67.2
 2.38
794.4
b1  10.8
t 
10.8
 4. 5
2.38
Example:Weekly Advertising Expenditure

Conclusion:
Since t =4.5 > 2.306 then we reject H0.
There is a linear association between
advertising expenditure and weekly sales.
Confidence interval for 1
b1  t

(

2
; n 2 )
( S (b1 ))
Now that our test showed that there is a
linear association between advertising
expenditure and weekly sales, the
management wishes an estimate of 1 with a
95% confidence coefficient.
Confidence interval for 1


For a 95 percent confidence coefficient, we
require t (.025; 8). From table B in appendix
III, we find t(.025; 8) = 2.306.
The 95% confidence interval is:
b1  t
(

2
; n2)
( S (b1 ))
10.8  2.306(2.38)
10.8  5.49  (5.31, 16.3)
Example: Do wages rise with experience?

Many factors affect the wages of workers: the industry
they work in, their type of job, their education and their
experience, and changes in general levels of wages. We
will look at a sample of 59 married women who hold
customer service jobs in Indiana banks. The following
table gives their weekly wages at a specific point in time
also their length of service with their employer, in month.
The size of the place of work is recorded simply as “large”
(100 or more workers) or “small.” Because industry, job
type, and the time of measurement are the same for all 59
subjects, we expect to see a clear relationship between
wages and length of service.
Example: Do wages rise with experience?
Example: Do wages rise with experience?
Example: Do wages rise with experience?
Example: Do wages rise with experience?

Do wages rise with experience?

The hypotheses are:
H0: 1 = 0,
Ha: 1 > 0
The t statistic for the significance of regression is:
t

b1
0.5905

 2.85
SEb1 0.20697
The P- value is:
P(t > 2.85) < .005
The t distribution for this problem have n-2 = 57 degrees of
freedom.

Conclusion:

Reject H0 : There is strong evidence that the mean wages
increase as length of service increases.
Example: Do wages rise with experience?

A 95% confidence interval for the slope 1 of the
regression line in the population of all married
female customer service workers in Indiana bank
is
b1  t * SEb1  05905  (2.00)(0.20697)
 0.5905  0.4139
 (0.177, 1.00)

The t distribution for this problem have n-2 = 57
degrees of freedom
Inference about Correlation



The correlation between wages and length of service for
the 59 bank workers is r = 0.3535. This appears in the
Excel out put, where it is labeled “Multiple R.”
We expect a positive correlation between length of service
and wages in the population of all married female bank
workers. Is the sample result convincing that this is true?
This question concerns a new population parameter, the
population correlation. This is correlation between length
of service and wages when we measure these variables for
every member of the population.
Inference about Correlation





We will call the population correlation.
To assess the evidence that  . 0 in the bank worker
population, we must test the hypotheses
H0:  = 0
Ha:  > 0
It is natural to base the test on the sample correlation r.
There is a link between correlation and regression slope.
The population correlation  is zero, positive, negative
exactly when the slope 1 of the population regression line
is zero, positive, or negative.
Inference about Correlation
Correlation Coefficient

Recall the the algebraic expression for the
correlation coefficient is.
r
r
 ( x  x )( y  y )
( x  x )2 ( y  y)2
n xy   x y
n x 2  ( x ) 2 n  y 2  ( y ) 2
Example: Do wages rise with experience?


The sample correlation between wages and
length of service is r = 0.3535 from a
sample of n = 59.
To test
H0:  = 0
Ha:  > 0
Use t statistic
t
r n2
1 r
2

0.3535 59  2
1  (0.3535)
2
 2.853
Example: Do wages rise with experience?


Compare t = 2.853 with critical values from
the t table with n - 2 = 57 degrees of
freedom.
Conclusion:

P( t > 2.853) < .005, therefore we reject H0.
There is a positive correlation between wages
and length of service.
Prediction of a new response ( ŷ )


We now consider the prediction of a new
observation y corresponding to a given level
x of the independent variable.
In our advertising expenditure and weekly
sales, the management wishes to predict the
weekly sales corresponding to the
advertising expenditure of x = $50.
Interval Estimation of a new response (ŷ )


The following formula gives us the point estimator
(forecast) for y.
yˆ  b0  b1 x
1- % prediction interval for a new observation ŷ
is:
yˆ  t 
(S f )
(

Where
S f  S y. x
2
; n2)
1
( x  x )2
1 
n  ( x  x )2
Example

In our advertising expenditure and weekly sales,
the management wishes to predict the weekly sales
if the advertising expenditure is $50 with a 90 %
prediction interval.
yˆ  828  10.8(50)  1368
S f  S y. x
1
( x  x )2
1 
n  ( x  x )2
1 (50  56.4) 2
S f  67.2 1  
 72.11
10
794.4

We require t(.05; 8) = 1.860
Example

The 90% prediction interval is:
yˆ  t(.05;8) ( S f )
1368  1.860(72.11)
(1233.9, 1502.1)
Analysis of variance approach to Regression
analysis


The analysis of variance approach is based on the
partitioning of sums of squares and degrees of
freedom associated with the response variable.
Consider the weekly advertising expenditure and
the weekly sales example. There is variation in the
amount ($) of weekly sales, as in all statistical
data. The variation of the yi is conventionally
measured in terms of the deviations:
yi  y
Analysis of variance approach to
Regression analysis

The measure of total variation, denoted by SST, is the sum
of the squared deviations:
SST   ( yi  y)2



If SST = 0, all observations are the same(No variability).
The greater is SST, the greater is the variation among the y
values.
When we use the regression model, the measure of
variation is that of the y observations variability around the
fitted line:
ˆi
yi  y
Analysis of variance approach to Regression
analysis

The measure of variation in the data around the
fitted regression line is the sum of squared
deviations (error), denoted SSE:
SSE   ( yi  yˆi )2


For our Weekly expenditure example
SSE = 36124.76
SST = 128552.5
What accounts for the substantial difference
between these two sums of squares?
Analysis of variance approach to Regression
analysis

The difference is another sum of squares:
SSR   ( yˆi  y ) 2



SSR stands for regression sum of squares.
SSR may be considered as a measure of the
variability of the yi that is associated with the
regression line.
The larger is SSR relative to SST, the greater is the
role of regression line in explaining the total
variability in y observations.
Analysis of variance approach to Regression
analysis

In our example:
SSR  SST  SSE  128552.5  36124.76  92427.74

This indicates that most of variability in
weekly sales can be explained by the
relation between the weekly advertising
expenditure and the weekly sales.
Formal Development of the Partitioning


We can decompose the total variability in the
observations yi as follows:
yi  y  yˆi  y  yi  yˆi
The total deviation yi  y can be viewed as the
sum of two components:
 The deviation of the fitted value ŷi around the mean
y.

The deviation of yi around the fitted regression line.
Formal Development of the Partitioning

The sums of these squared deviations have
the same relationship:
 ( y  y)   ( yˆ  y)   ( y  yˆ )
Breakdown of degree of freedom:
2
i

2
i
n  1  1  (n  2)
2
i
i
Mean squares


A sum of squares divided by its degrees of
freedom is called a mean square (MS)
Regression mean square (MSR)
SSR
MSR 
1

Error mean square (MSE)
MSE 

SSE
n2
Note: mean squares are not additive.
Mean squares

In our example:
MSR 
SSR 92427.74

 92427.74
1
1
SSE 36124.76
MSE 

 4515.6
n2
8
Analysis of Variance Table

The breakdowns of the total sum of squares
and associated degrees of freedom are
displayed in a table called analysis of
variance table (ANOVA table)
Source of
Variation
SS
df
MS
F-Test
Regression
SSR
1
MSR
=SSR/1
MSR/MSE
Error
SSE
n-2
MSE
=SSE/(n-2)
Total
SST
n-1
Analysis of Variance Table

In our weekly advertising expenditure and
weekly sales example the ANOVA table is:
Source of
variation
SS
df
MS
Regression
92427.74
1
92427.74
Error
36124.76
8
4515.6
Total
128552.5
9
F-Test for 1= 0 versus 1 0

The general analysis of variance approach
provides us with a battery of highly useful
tests for regression models. For the simple
linear regression case considered here, the
analysis of variance provides us with a test
for:
H 0 : 1  0
H a : 1  0
F-Test for 1= 0 versus 1 0




Test statistic:
MSR
F
MSE
In order to be able to construct a statistical
decision rule, we need to know the distribution of
our test statistic F.
When H0 is true, our test statistic, F, follows the Fdistribution with 1, and n-2 degrees of freedom.
Table C on page 622 of your text gives the critical
values of the F-distribution at  = 0.1, 0.5 and .01.
F-Test for 1= 0 versus 1 0

Construction of decision rule:


At  = 5% level
Reject H0 if
F  F ( ;1, n  2)

Large values of F support Ha and Values of
F near 1 support H0.
F-Test for 1= 0 versus 1 0


Using our example again, let us repeat the earlier
test on 1. This time we will use the F-test. The
null and alternative hypothesis are:
H 0 : 1  0
H a : 1  0
Let  = .05. Since n=10, we require F(.05; 1, 8).
From table 5-3 we find that F(.05; 1, 8) = 5.32.
Therefore the decision rule is:

Reject H0 if:
F  5.32
F-Test for 1= 0 versus 1 0


From ANOVA table we have
MSR = 92427.74
MSE = 4515.6
Our test statistic F is:
F

Decision:

92427.74
 20.47
4515.6
Since 20.47> 5.32, we reject H0, that is there is a linear
association between weekly advertising expenditure
and weekly sales.
F-Test for 1= 0 versus 1 0

Equivalence of F Test and t Test:


For given  level, the F test of 1 = 0 versus
1  0 is equivalent algebraically to the two sided ttest.
Thus, at a given level, we can use either the t-test
or the F-test for testing 1 = 0 versus
1  0.

The t-test is more flexible since it can be used for
one sided test as well.
Analysis of Variance Table

The complete ANOVA table for our
example is:
Source of
Variation
SS
df
MS
F-Test
Regression
92427.74
1
92427.74
20.47
Error
36124.76
8
4515.6
Total
128552.5
9
Computer Output

The EXCEL out put for our example is:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.847950033
R Square
0.719019259
Adjusted R Square
0.683896667
Standard Error
67.19447214
Observations
10
ANOVA
df
SS
MS
Regression
1
92431.72331
92431.72
Residual
8
36120.77669
4515.097
Total
9
128552.5
Coefficients
Intercept
AD-Expen (X)
Standard Error
t Stat
F
20.4717
P-value
Significance F
0.0019382
Lower 95%
Upper 95%
828.1268882
136.1285978
6.083416
0.000295 514.2135758
1142.0402
10.7867573
2.384042146
4.524567
0.001938 5.289142698 16.2843719
Coefficient of Determination



Recall that SST measures the total variations in yi
when no account of the independent variable x is
taken.
SSE measures the variation in the yi when a
regression model with the independent variable x
is used.
A natural measure of the effect of x in reducing
the variation in y can be defined as:
SST  SSE SSR
SSE
R 

 1
SST
SST
SST
2
Coefficient of Determination


R2 is called the coefficient of determination.
0  SSE  SST, it follows that:
0  R2  1


We may interpret R2 as the proportionate reduction
of total variability in y associated with the use of
the independent variable x.
The larger is R2, the more is the total variation of y
reduced by including the variable x in the model.
Coefficient of Determination




If all the observations fall on the fitted regression
line, SSE = 0 and R2 = 1.
If the slope of the fitted regression line
b1 = 0 so that yˆ i  y, SSE=SST and R2 = 0.
The closer R2 is to 1, the greater is said to be the
degree of linear association between x and y.
The square root of R2 is called the coefficient of
correlation.
r   R2