Download Statistics 2014, Fall 2001

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
1
Correlation and Regression
Suppose we have two random variables X and Y that have a joint
bivariate normal distribution with correlation coefficient . A joint
normal distribution has p.d.f.
A graph of a joint normal density is shown below for  = -0.9:
If we have selected a random sample of size n from this population,
we may estimate  with the sample correlation coefficient:
n
r
 x
i 1
i
 x  y i  y 
n
.
n
2 
2
  x i  x     y i  y  
 i 1
  i 1

The sample correlation coefficient may also be found by taking the
square root of the coefficient of determination and attaching the sign
shown for the relationship in the scatterplot of Y v. x. – a positive
sign if the relationship is increasing, or a negative sign if the
relationship is decreasing.
Example: In the stainless steel stress fracture example, we found
that R2 = 0.632518266, and the scatterplot showed a decreasing
relationship between tensile stress and time to fracture. Hence, the
2
sample correlation coefficient would be r = -0.7953, indicating a
fairly strong negative linear relationship between applied tensile
stress and time to fracture.
Multiple Regression
Sometimes we have several possible predictor variables, with no
single variable giving good prediction by itself. In such a situation,
we may use a multiple regression model:
k
Yi  0  1 X i1  2 X i 2    k X ik   i  0    j X ij   i .
j 1
Here Xij is the value of the jth predictor variable for the ith member
of the sample.
We assume that the relationship between Y and any one of the
predictor variables is linear. We also assume that
 1 ,  2 , ,  n ~ Normal 0,  2 .
i .i .d .
The first step in the data analysis would be to do scatterplots of Y v.
each X, to check the assumption of linearity of the relationships.
A cautionary note is in order here. Some of the variables in the pool
of predictors may be only weakly related to Y. We don’t necessarily
discard these variables from the pool of predictors, however, since
we are looking for a model in which a collection of predictors give
good prediction.
Estimation of the parameters is accomplished using the Least
Squares method. The quantity to be minimized is the sum of
squared errors:
2
k


2
Q    i    Yi   0    j X ij  .
i 1
i 1 
j 1

n
n
3
We take the partial derivative of Q with respect to each of the
parameters and set the result equal to zero, obtaining a set of k + 1
equations in k + 1 unknowns, the normal equations:
n 
k
 set
Q

 2  Yi   0    j X ij   0 ; and
 0
i 1 
j 1

n
k

 set
Q
 2 X il  Yi   0    j X ij   0 , for l = 1, 2, …, k.
 l
i 1
j 1


The solutions to the normal equations are the least squares
estimators of the parameters.
Example: Soil and sediment adsorption, the extent to which
chemicals collect in a condensed form on the surface, is an
important characteristic influencing the effectiveness of pesticides
and various agricultural chemicals. The paper “Adsorption of
phosphate, arsenate, methanearsonate, and cacodylate by lake and
stream sediments: comparisons with soils” (Journal of
Environmental Quality, 1984, pp. 499-504), gave the accompanying
data on Y = phosphate adsorption index, X1 = amount of extractable
iron, and X2 = amount of extractable aluminum. (from Probability
and Statistics for Engineering and the Sciences, by Jay L. Devore)
4
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
X1
61
175
111
124
130
173
169
169
160
244
257
333
199
X2
13
21
24
23
64
38
33
61
39
71
112
88
54
Y
4
18
14
18
26
26
21
30
28
36
65
62
40
The paper proposed the model:
Yi  0  1 X i1  2 X i 2   i .
We want to try to fit this model to the data, to see whether the two
predictor variables give good prediction of the Adsorption Index.
First, we want to check whether the relationship between the
Adsorption Index and each of the predictors seems to be linear.
From the scatterplots below, it appears that the relationship between
Adsorption Index and each of the predictors is a fairly strong
positive linear relationship.
5
Scatteplot of Adsorption Index v. Amount of
Extractable Fe
70
Adsorption Index
60
50
40
30
20
10
0
0
50
200
150
100
250
300
350
Amount of Extractable Fe
Scatterplot of Adsorption Index v. Amount of Extractable
Al
70
Adsorption Index
60
50
40
30
20
10
0
0
20
40
60
80
Amount of Extractable Al
100
120
6
To fit a multiple regression model using Excel, we must use the
LINEST function, which is an array function. We enter the data,
being sure to put the predictor variables in adjacent columns. Then
we choose an empty cell, and highlight an array containing 5 rows
and k + 1 columns. We then enter
=LINEST(a1..a13, b1..c13, TRUE, TRUE),
followed by Ctrl-Shift-Enter.
The first entry in parentheses is the column listing the values of Y.
The second entry is the rectangular array of predictor variables. The
third entry is an indicator that we want Excel to estimate the
intercept, instead of simply assuming that it is zero. The fourth
entry is an indicator that we want, not only the parameter estimates
and their standard errors, but also additional regression statistics,
such as the coefficient of determination, and sums of squares. The
output is shown below:
The first row of the table gives the parameter estimates; the second
row gives their standard errors. The third row gives the coefficient
of determination and the standard error for the estimator of the
conditional mean. The fourth row gives the value of the F statistic
and the error degrees of freedom. The fifth row gives the regression
sum of squares, followed by the error sum of squares.
0.349
0.071306
0.948467
92.02558
3529.903
0.112733
0.029691
4.379375
10
191.7892
-7.35066
3.484668
The equation of the regression line is
Yˆ  7.35066  0.349 X 1  0.112733 X 2 , and 94.8467% of the variability
in the Adsorption Index is explained by the linear relationship
between the Adsorption Index and the two predictors.
7
We can construct the ANOVA table from the above information:
Source
Regression
Residual
Total
SS
3529.903
191.7892
3721.6922
d.f.
2
10
12
MS
1764.9515
19.17892
F
92.02559
Hence, if we want to test for a linear relationship between the
Adsorption Index and the predictor variables, we proceed as
follows:
Step 1: H0: 1   2  0
v. Ha: Not both 0.
Step 2: n = 13,  = 0.05
MSR
The test statistic is F  MSE , which under the null
hypothesis has an F(2, 10) distribution.
Step 4: The critical value is F(0.95, 2, 10) = 4.10. If the
calculated value of the test statistic is greater than 4.10, we will
reject the null hypothesis.
Step 5: From the ANOVA table, we have F = 92.02559.
Step 6: We reject the null hypothesis at the 0.05 level of
significance. We have sufficient evidence to conclude that at least
one of the slope coefficients is not 0.
Step 3:
If we have a soil type for which the amount of extractable iron is
150 and the amount of extractable aluminum is 50, then we predict
that the Adsorption Index will be 50.6360.
Sometimes we have several predictors, and one or more of them is
only weakly related to the response variable. After including some
of the stronger predictors in the model, we want to know whether it
would make sense to include any of the weaker predictors as well.
Anytime we include another predictor in the model, we will increase
8
the value of SSR and decrease the value of SSE, so that the value of
SSR
2
SST remains the same. Since R  SST , adding another predictor
will always increase the explained variation in Y by some amount.
We want to know whether the increase in R2 due to adding a
relatively weak predictor is sufficiently large to offset the decrease
in the error degrees of freedom. To do this, we will look at the
adjusted coefficient of multiple determination.
Defn: The adjusted coefficient of multiple determination for a
multiple regression model with k predictor variables is
 n  1  SSE
SSE /( n  p)
2

Radj
 1
 1  
,
SST /( n  1)
n

p

 SST
where p is the number of predictor variables after the additional
variables are added.
If the decrease in SSE from adding another predictor is not sufficient
to offset the loss of an error degree of freedom, then the adjusted
coefficient of multiple determination may actually decrease, and we
would decide not to add the additional predictor.
Example: Data were collected on three variables in an observational
study at a semiconductor manufacturing plant. The finished
semiconductor is wire-bonded to a frame. The three variables
measured, for a sample of 25 units, are the Pull Strength (the amount
of force necessary to break the bond), the Wire Length, and the Die
Height. The data are shown in the table on page 288.
If we include only Wire Length in the model, we find that
R2 = 0.963954368.
9
If we include both variables in the model, the table of results from
LINEST is shown below. We see that R2 = 0.979905735, an
increase of 0.015951367. Is this amount of increase worth the loss
of an error degree of freedom? The adjusted coefficient of multiple
determination is found to be 0.962, which is less than the coefficient
of multiple determination when we had only Wire Length in the
model. Hence, we might not want to add Die Height to the model.
0.011815307
0.002827293
0.979905735
536.4198808
5983.250235
2.72668868
0.09878057
2.36157179
22
122.694469
2.820283576
1.032732335