Download November 25

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
ST 380
Probability and Statistics for the Physical Sciences
Regression Models
The value of a response Y may be influenced by the levels of one or
more factors.
Some factors are qualitative: type of solvent, or brand of gasoline.
Others are quantitative: temperature, or pressure.
Regression modeling deals largely with quantitative factors, also
called explanatory variables, or predictors.
1 / 18
Simple Linear Regression
Introduction
ST 380
Probability and Statistics for the Physical Sciences
When only two levels of a quantitative factor are used, we can detect
whether it influences the response, but not how.
When several levels of a factor are used, we can begin to describe the
way in which it influences the response.
In observational data, we do not control the levels, and many
different levels may be observed.
The simplest type of influence is linear dependence.
2 / 18
Simple Linear Regression
Introduction
ST 380
Probability and Statistics for the Physical Sciences
Example 12.2
In a particular process for removing arsenic from drinking water, the
percentage removed (Y ) is affected by the pH (x) of the water.
In R
arsenic <- read.table("Data/Example-12-02.txt", header = TRUE)
plot(arsenic)
Clearly the percentage of arsenic removed is influenced by the pH.
The percentage removed declines roughly linearly as the pH increases.
3 / 18
Simple Linear Regression
Introduction
ST 380
Probability and Statistics for the Physical Sciences
The Regression Model
The general idea of a regression model is that the distribution of the
response Y depends on the level of the quantitative factor x.
In the simple linear regression model, we assume that:
The expected value of Y is a linear function of x:
E (Y ) = β0 + β1 x
The variance of Y is constant:
V (Y ) = σ 2 .
4 / 18
Simple Linear Regression
The Model
ST 380
Probability and Statistics for the Physical Sciences
Equivalently, if we write = Y − E (Y ) for the random error in Y ,
then
Y = β0 + β1 x + ,
where
E () = 0
and
V () = σ 2 .
We also assume that the Y ’s (or ’s) are independent.
5 / 18
Simple Linear Regression
The Model
ST 380
Probability and Statistics for the Physical Sciences
The assumptions appear to be valid for the arsenic example.
In other situations, any of the three assumptions might be violated:
The expected value E (Y ) is often a nonlinear function of x.
The variance V (Y ) may not be constant; it often increases with
E (Y ).
The responses may be correlated; measurements of the same
variable collected over time are usually correlated with each
other.
Also, the response is typically influenced by more than one factor.
For now, we ignore these complications.
6 / 18
Simple Linear Regression
The Model
ST 380
Probability and Statistics for the Physical Sciences
Statistical Inference
The parameters of the simple linear regression model are:
β0 , the intercept of the line;
β1 , the slope of the line;
σ 2 , the error variance.
As always, we want point estimators and interval estimators of them,
and we want to test hypotheses about them.
7 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Least Squares
The most commonly used estimators of β0 and β1 are the least
squares estimators.
They are also maximum likelihood estimators, if Y has a normal
distribution.
8 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Suppose that we observe n pairs (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
For any trial values b0 and b1 , we can use the line y = b0 + b1 x to
predict what the value of Y should have been at each xi :
ŷi (b0 , b1 ) = b0 + b1 xi .
Since we observed the actual value yi , we can also calculate the
prediction error, or residual:
ri (b0 , b1 ) = yi − ŷi (b0 , b1 ) = yi − (b0 + b1 xi )
9 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Good values of b0 and b1 should make good predictions, with small
residuals.
The best values, in the least squares sense, give the smallest value of
the sum of squared residuals,
S(b0 , b1 ) =
n
X
ri (b0 , b1 )2
i=1
=
n
X
[yi − (b0 + b1 xi )]2
i=1
10 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
These are the least squares estimators
Pn
(x − x̄)(yi − ȳ )
Pn i
β̂1 = i=1
2
i=1 (xi − x̄)
and
βˆ0 = ȳ − β̂1 x̄.
11 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Estimating σ 2
The error sum of squares, or residual
sum
of squares (for the least
squares estimates) is SSE = S β̂0 , β̂1 .
Under the regression model,
E (SSE) = (n − 2)σ 2 ,
so
s2 =
SSE
n−2
is an unbiased estimator of σ 2 .
The maximum likelihood estimator has a divisor of n instead of n − 2.
12 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Example 12.2 continued
In R, the lm() function produces least squares estimates of the
parameters, and much more:
summary(lm(Percent ~ pH, arsenic))
Output
Call:
lm(formula = Percent ~ pH, data = arsenic)
Residuals:
Min
1Q Median
-9.0421 -4.5110 -0.7635
13 / 18
3Q
Max
3.8326 11.1382
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Output continued
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 190.268
12.587
15.12 6.81e-11 ***
pH
-18.034
1.474 -12.23 1.55e-09 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 6.126 on 16 degrees of freedom
Multiple R-squared: 0.9034, Adjusted R-squared: 0.8974
F-statistic: 149.7 on 1 and 16 DF, p-value: 1.552e-09
14 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
As before, we need to translate the output to the regression model;
the “Coefficients” are the estimated parameters:
(Intercept) β̂0
pH
β̂1
Each line has
the parameter estimate,
its standard error,
the t-statistic for testing the null hypothesis H0 : βi = 0,
the P-value for the statistic.
15 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Note that β0 is E (Y ) when x = 0; in this case, no drinking water has
a pH of zero, so β0 has no physical meaning, and testing hypotheses
about it are a waste of time (and Type I errors!).
β1 is the slope of the line, and measures how strongly pH affects the
removal of arsenic; in particular, if β1 = 0, pH has no effect, so
H0 : β1 = 0 is of substantive interest.
The output shows that the association of pH with the removal of
arsenic is highly significant.
16 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
Coefficient of Determination
The coefficient of determination, denoted R 2 , measures how much of
the variation in Y is explained by the regression model.
It is defined as
R2 = 1 −
SSE
,
SST
where
SST =
X
(yi − ȳ )2
is the sum of squares of the residuals around a horizontal line
(β1 = 0) at height ȳ (β0 = ȳ ).
In the R output, the coefficient of determination is labeled
Multiple R-squared.
17 / 18
Simple Linear Regression
Inference
ST 380
Probability and Statistics for the Physical Sciences
In the arsenic removal example, R 2 = 0.9034, which would often be
stated as
90.34% of the variance of Y is explained by the linear
regression on x
or more loosely as
90.34% of the variation in the effectiveness of removal of
arsenic is explained by the effect of pH.
18 / 18
Simple Linear Regression
Inference