Download Slide 1

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
12
Simple Linear
Regression and
Correlation
Here, we have two quantitative
variables for each of 16
students.
1) How many beers they
drank, and
2) Their blood alcohol level
(BAC)
We are interested in the
relationship between the two
variables: How is one affected
by changes in the other one?
Student
Beers
Blood Alcohol
1
5
0.1
2
2
0.03
3
9
0.19
6
7
0.095
7
3
0.07
9
3
0.02
11
4
0.07
13
5
0.085
4
8
0.12
5
3
0.04
8
5
0.06
10
5
0.05
12
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Associations Between Variables
3
When you examine the relationship between two variables, a new
question becomes important:
1.Is your purpose simply to explore the nature of the relationship?
2.Do you wish to show that one of the variables can explain variation in
the other?
A response variable measures an outcome of a study.
An explanatory variable explains or causes changes in the
response variable.
3
Looking at relationships

Start with a graph

Look for an overall pattern and deviations from the pattern

Use numerical descriptions of the data and overall pattern (if
appropriate)
Scatterplots
In a scatterplot, one axis is used to represent each of the variables,
and the data are plotted as points on the graph.
Student
Beers
BAC
1
5
0.1
2
2
0.03
3
9
0.19
6
7
0.095
7
3
0.07
9
3
0.02
11
4
0.07
13
5
0.085
4
8
0.12
5
3
0.04
8
5
0.06
10
5
0.05
12
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Interpreting scatterplots

After plotting two variables on a scatterplot, we describe the
relationship by examining the form, direction, and strength of the
association. We look for an overall pattern …

Form: linear, curved, clusters, no pattern

Direction: positive, negative, no direction

Strength: how closely the points fit the “form”
Form and direction of an association
Linear
No relationship
Nonlinear
Positive association: High values of one variable tend to occur together
with high values of the other variable.
Negative association: High values of one variable tend to occur together
with low values of the other variable.
No relationship: X and Y vary independently. Knowing X tells you nothing
about Y.
Strength of the association
The strength of the relationship between the two variables can be
seen by how much variation, or scatter, there is around the main form.
With a strong relationship, you
can get a pretty good estimate
of y if you know x.
With a weak relationship, for any
x you might get a wide range of
y values.
This is a weak relationship. For a
particular state median household
income, you can’t predict the state
per capita income very well.
This is a very strong relationship.
The daily amount of gas consumed
can be predicted quite accurately for
a given temperature value.
The correlation coefficient "r"

The correlation coefficient is a measure of the direction and strength
of a linear relationship.

It is calculated using the mean and the standard deviation of both
the x and y variables.

Correlation can only be used to describe quantitative variables.
Categorical variables don’t have means and standard deviations.
The correlation coefficient "r"
1 n  xi  x  yi  y 


r

n  1 i 1  s x  s y 
Time to swim: x = 35, sx = 0.7
x
Pulse rate: y = 140 sy = 9.5
y
You DON'T want to do this by hand.
Make sure you learn how to use
your calculator or software.
"r" ranges
from -1 to +1
"r" quantifies the strength
and direction of a linear
relationship between 2
quantitative variables.
Strength: how closely the points
follow a straight line.
Direction: is positive when
individuals with higher X values
tend to have higher values of Y.
Correlation only describes linear relationships
No matter how strong the association,
r does not describe curved relationships.
Note: You can sometimes transform a non-linear association to a linear form,
for instance by taking the logarithm. You can then calculate a correlation using
the transformed data.
Explanatory and response variables
A response variable measures or records an outcome of a study. An
explanatory variable explains changes in the response variable.
Typically, the explanatory or independent variable is plotted on the x
axis, and the response or dependent variable is plotted on the y axis.
Blood Alcohol as a function of Number of Beers
Blood Alcohol Level (mg/ml)
0.20
Response
(dependent)
variable:
blood alcohol
content
y
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
x
0
1
2
3
4
5
6
7
8
9
10
Number of Beers
Explanatory (independent) variable:
number of beers
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.
In addition, we would like to have a numerical description of how both
variables vary together. For instance, is one variable increasing faster
than the other one? And we would like to make predictions based on that
numerical description.
But which line best
describes our data?
The regression line

A regression line is a straight line that describes how a response
variable y changes as an explanatory variable x changes.

We often use a regression line to predict the value of y for a given
value of x.

In regression, the distinction between explanatory and response
variables is important.
The regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is as small as possible.
Distances between the points and
line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras).
Properties
The least-squares regression line can be shown to have this equation:
yˆ  b 0  b1 x
ŷ
is the predicted y value (y hat)
b1 is the slope
b0 is the y-intercept
How to:
First we calculate the slope of the line, b1;
from statistics we already know:
sy
b1  r
sx
r is the correlation.
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.
Once we know b1, the slope, we can calculate b0, the y-intercept:
b0  y  b1 x


Where x and y are the sample
means of the x and y variables

The equation completely describes the regression line.
To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those points.
Hint: The regression line always passes through the mean of x and y.
The points you use for
drawing the regression
line are derived from the
equation.
They are NOT points from
your sample data (except
by pure coincidence).
Making predictions
The equation of the least-squares regression allows you to predict y
for any x within the range studied.
yˆ  0.0144 x  0.0008
Nobody in the study drank 6.5
beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a
blood alcohol content of 0.094
mg/ml.
yˆ  0.0144 * 6.5  0.0008
yˆ  0.936  0.0008  0.0944 mg/ml
yˆ  0.125 x  41.4
The data in a scatterplot are a random
sample from a population that may
exhibit a linear relationship between x
and y. Different sample  different plot.
Now we want to describe the population mean
response my as a function of the explanatory
variable x:my = b0 + b1x.
And to assess whether the observed relationship
is statistically significant (not entirely explained
by chance events due to random sampling).
Statistical model for linear regression
In the population, the linear regression equation is my = b0 + b1x.
Sample data then fits the model:
Data =
fit
+ residual
y i = ( b0 + b1x i) +
(ei)
where the ei are
independent and
Normally distributed N(0,s).
Linear regression assumes equal variance of y
(s is the same for all values of x).
Estimating the parameters
my = b0 + b1x
The intercept b0, the slope b1, and the standard deviation s of y are the
unknown parameters of the regression model. We rely on the random
sample data to provide unbiased estimates of these parameters.

The value of ŷ from the least-squares regression line is really a prediction
of the mean value of y (my) for a given value of x.

The least-squares regression line (ŷ = b0 + b1x) obtained from sample data
is the best estimate of the true population regression line (my = b0 + b1x).
ŷ unbiased estimate for mean response my
b0 unbiased estimate for intercept b0
b1 unbiased estimate for slope b1
The population standard deviation
sfor y at any given value of x
represents the spread of the normal
distribution of the ei around the mean
my .
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
s
2
residual

n2

2
ˆ
(
y

y
)
 i i
n2
s is an unbiased estimate of the regression standard deviation s.
Conditions for inference

The observations are independent.

The relationship is indeed linear.

The standard deviation of y,σ, is the same for all values of x.

The response y varies normally
around its mean.
Confidence interval for regression parameters
Estimating the regression parameters b1 is a case of one-sample
inference with unknown population variance.
 We rely on the t distribution, with n – 2 degrees of freedom.
A level C confidence interval for the slope, b1, is proportional to the
standard error of the least-squares slope:
b1 ± t* SEb1
t* is the t critical value for the t (n – 2) distribution with area C between –t* and +t*.
Significance test for the slope
We can test the hypothesis H0: b1 = 0 versus a 1 or 2 sided alternative.
We calculate
t = b1 / SEb1
which has the t (n – 2)
distribution to find the
p-value of the test.
Testing the hypothesis of no relationship
We may look for evidence of a significant relationship between
variables x and y in the population from which our data were drawn.
For that, we can test the hypothesis that the regression slope
parameter β is equal to zero.
H0: β1 = 0 vs. H0: β1 ≠ 0
s y Testing H0: β1 = 0 also allows to test the hypothesis of no
slope b1  r
sx correlation between x and y in the population.
Calculations for regression inference
To estimate the parameters of the regression, we calculate the
standard errors for the estimated regression coefficients.
The standard error of the least-squares slope β1 is:
SEb1 
s
2
(
x

x
)
 i i
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Using technology
Computer software runs all the computations for regression analysis.
Here is some software output for the car speed/gas efficiency example.
JMP
Slope
Intercept
Standard
error
p-value for tests
of significance
The t-test for regression slope is highly significant (p< 0.0001). There is a
significant relationship between average car speed and gas efficiency.
13.4
Multiple Regression
Analysis
Copyright © Cengage Learning. All rights reserved.
Population multiple regression equation

Up to this point we have considered, in detail, the linear regression
model in one explanatory variable x.
ŷ = b0 + b1x

Usually more complex linear models are needed in practical
situations.

There are many problems in which a knowledge of more than one
explanatory variable is necessary in order to obtain a better
understanding and better prediction of a particular response.

In multiple regression, the response variable y depends on p
explanatory variables, x1 , x2 , , x p .
ŷ = b0 + b1x1 + b2x2 +  + bpxp
Data for multiple regression

The data for simple linear regression problem consists of n
observations (xi , yi ) of the two variables.

Data for multiple linear regression consist of the value of a
response variable y and p explanatory variables ( x1 , x2 ,  , x p ) on

n cases.

We write the data and enter them in the form:
Variables
Case
x1
x2

xp
y
1
x11
x12
x1p
y1
2
x21
x22


x2p
y2
xn1
xn2

xnp
yn

n
We have data on 224 first-year computer science majors at a large
university in a given year. The data for each student include:
* Cumulative GPA after 2 semesters at the university (y, response variable)
* SAT math score (SATM, x1, explanatory variable)
* SAT verbal score (SATV, x2, explanatory variable)
* Average high school grade in math (HSM, x3, explanatory variable)
* Average high school grade in science (HSS, x4, explanatory variable)
* Average high school grade in English (HSE, x5, explanatory variable)
Variables
Case
SATM
SATV

HSE
GPA
1
720
700
…
9
3.8
2
590
350
…
6
2.6
224
550
490
…
7
3.0
Multiple linear regression model
For “p” number of explanatory variables, we can express the population
mean response (my) as a linear equation:
my = b0 + b1x1 + … + bpxp
The statistical model for n sample data (i = 1, 2, … n) is then:
Data =
fit
+
residual
yi = (b0 + b1x1i … + bpxpi) +
(ei)
Where the ei are independent and normally distributed N(0, s).
Multiple linear regression assumes equal variance s2 of y. The
parameters of the model are b0, b1 … bp.
Estimation of the parameters
We selected a random sample of n individuals for which p + 1 variables
were measured (x1 … , xp, y). The least-squares regression method
minimizes the sum of squared deviations ei (= yi – ŷi) to express y as a
linear function of the p explanatory variables:
ŷi = b0 + b1x1i +… + bkxpi
As with simple linear regression, the constant b0 is the y intercept.
my
ŷ
b0
are unbiased estimates of population parameters
β0
…
…
bp
βp
Confidence interval for βj
Estimating the regression parameters β0, … ,βj, … ,βp is a case of onesample inference with unknown population variance.
 We rely on the t distribution, with n – p – 1 degrees of freedom.
A level C confidence interval for βj is:
bj ± t* SEbj
- SEbj is the standard error of bj —we rely on software to obtain SEbj .
- t* is the t critical for the t (n – p – 1) distribution with area C between –
t* and +t*.
Significance test for βj
To test the hypothesis H0: bj = 0 versus a 1 or 2 sided alternative.
We calculate the t statistic
which has the t (n – p – 1)
distribution to find the
p-value of the test.
t = bj/SEbj
ANOVA F-test for multiple regression
For a multiple linear relationship the ANOVA tests the hypotheses
H 0: β1 = β2 = … = βp = 0
versus Ha: H0 not true
by computing the F statistic: F = MSM / MSE
When H0 is true, F follows
the F(p, n − p − 1) distribution.
The p-value is P(F > f ).
A significant p-value doesn’t mean that all p explanatory variables
have a significant influence on y—only that at least one does.
ANOVA table for multiple regression
Source
Model
Error
Sum of squares SS
Mean square MS
F
P-value
p
SSM/DFM
MSM/MSE
Tail area above F
n−p−1
SSE/DFE
2
ˆ
(
y

y
)
 i
 ( y  yˆ )
i
Total
df
2
i
 ( yi  y)2
n−1
SST = SSM + SSE
DFT = DFM + DFE
The sample standard error, s, for n sample data points is calculated
from the residuals ei = yi – ŷi
s 
2
2
e
i
n  p 1

2
ˆ
(
y

y
)
 i i
n  p 1
SSE

 MSE
DFE
s is an unbiased estimate of the regression standard deviation σ.
We have data on 224 first-year computer science majors at a large
university in a given year. The data for each student include:
* Cumulative GPA after 2 semesters at the university (y, response variable)
* SAT math score (SATM, x1, explanatory variable)
* SAT verbal score (SATV, x2, explanatory variable)
* Average high school grade in math (HSM, x3, explanatory variable)
* Average high school grade in science (HSS, x4, explanatory variable)
* Average high school grade in English (HSE, x5, explanatory variable)
Here are the summary statistics for these data:
We finally run a multiple regression model with all the variables together.
P-value very
significant
R2 fairly small (22%)
HSM significant
The overall test is significant, but only the average high school math score (HSM)
makes a significant contribution in this model to predicting the cumulative GPA.
This conclusion applies to computer majors at this large university.
The United Nations Development Reports provide data on a large
number of human development variables for 182 OECD (Organization
for Economic Cooperation and Development) countries. The variables
examined here from the 2009 report HDR_2009 are:
* HDI
- United Nations human development index rank
* LEB
- life expectancy at birth (years) in 2007
* ALR
- adult literacy rate (% aged 15 and above)
•GDP
- gross domestic product per capita (purchasing power parity in US$)
* URB
- % urban population in 2010
• PEH
- public expenditure on health (as % of total government expenditure)
Here are the summary statistics for a sample of twenty countries:
Here is the data:
HDI
4
9
13
18
38
46
50
75
92
93
98
99
105
113
117
124
127
150
172
178
Country
Canada
Switzerland
United States
Italy
Malta
Lithuania
Uruguay
Brazil
China
Belize
Tunisia
Tonga
Phillipines
Bolivia
Moldova
Nicaragua
Tajikistan
Sudan
Mozambique
Mali
LEB
80.6
81.7
79.1
81.1
79.6
71.8
76.1
72.2
72.9
76.0
73.8
71.7
71.6
65.4
68.3
72.7
66.4
57.9
47.8
48.1
ALR
99.0
99.0
99.0
98.9
92.4
99.7
97.9
90.0
93.3
75.1
77.7
99.2
93.4
90.7
99.2
78.0
99.6
60.9
44.4
26.2
GDP
35,812
40,658
45,592
30,353
2,308
17,575
11,216
9,567
5,383
6,734
752
3,748
3,406
4,206
2,551
2,570
1,753
2,086
802
1,083
URB
80.6
73.6
82.3
68.4
94.7
67.2
89.0
86.5
44.9
52.7
67.3
25.3
66.4
66.5
43.8
57.3
26.5
45.2
38.4
33.3
PEH
17.9
19.6
19.1
14.2
14.7
13.3
9.2
7.2
9.9
10.9
6.5
11.1
6.4
11.6
11.8
16.0
5.5
6.3
12.6
12.2
The first step in multiple linear regression is to study all pair-wise
relationships between the p + 1 variables. Here is the output for all
pair-wise correlations.
Scatterplots for all 10 pair-wise relationships are also necessary to
understand the data.
Note that the relationship between GDP and the other variables appears to be non-linear.
Let’s first run two simple linear regressions to predict LEB using
ALR alone and another using GDP alone:
Note: R2 = 1218.41/1809.228 = 0.673
R2 = 609.535/1809.228 = 0.337,
- the proportion of variation in LEB explained by each variability separately.
- When ALR or GDP are used alone, both P-values are very significant.
Now, let’s run a multiple linear regression using ALR and GDP together.
Note:
R2 = 1333.465/1809.228 = 0.737
- slight increase from using ALR alone
b1 = 0.335 with SE = 0.066; when used alone, we had b1 = 0.392 with SE = 0.064
b2 = .00019with SE = 0.00009; when used alone, we had bGDP = 0.00039 with
SEGDP = 0.00013
Now consider a multiple linear regression using all four explanatory variables:
R2 = 0.807
R = 0.899
R is the correlation
between y and
P-value very significant
At least one regression
coefficient is different from 0
INC is significant
URB is significant
GDP and PEH are not significant
We now drop the least significant variable from the previous model: GDP.
R2 is almost the same
P-value very
significant
ALR significant
URB significant
PEH not
The conclusions are about the same. But notice that the actual regression
coefficients have changed.
predicted LEB  32.10  .31ALR  .14URB  .26 PEH  .00005GDP
predicted LEB  30.31  .32 ALR  .14URB  .37 PEH
Let’s run a multiple linear regression with the ALR and URB only.
R2 is marginally lower
P-value very
significant
ALR significant
URB significant
The ANOVA test for βALR and βURB is very significant  at least one is not zero.
The t tests for βALR and βURB are very significant  each one is not zero.
When taken together, only ALR and URB are significant predictors of LEB.