Download See regression.R : solve(t(X01) %*% X01) %*% t(X01) %*% Y

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Bias of an estimator wikipedia , lookup

Regression toward the mean wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Choice modelling wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
BIOINF 2118
Linear Regression
Page 1 of 4
The least squares principle.
We have data with
a PREDICTOR (covariate, feature, attribute, independent variable), and
a TARGET (outcome, dependent variable).
Example: See GAGurine.R.
The simple linear regression model:
In this model, the x values are assumed to be fixed and known. The Y values are random.
Under this model, the Y values are independent, and the distribution of each Yi is normal:
When there is one predictor, maximizing the likelihood is the same as minimizing the sum of squares:
.
“Least Squares” finds the parameter values that minimize Q.
Rearranging:
The solution is
BIOINF 2118
Linear Regression
Page 2 of 4
,
Multiple predictors:
The model is
.
It turns out to simplify things if we define
for all observations. Now we write
the “design matrix” =
.
Then
where the multiplication is the dot product of two vectors. So
is the vector of parameters, and
is the ROW vector in the design matrix. Stacking all the outcomes into a vector Y, we get
(matrix multiplication)
In matrix notation it’s a lot simpler:
,
See regression.R : solve(t(X01) %*% X01) %*% t(X01) %*% Y
(The general multiple-variable version of the normal distribution has density
.
variance, and
Here, observations are independent with equal
, so the likelihood function is
BIOINF 2118
Estimating the variance
Linear Regression
Page 3 of 4
The maximum likelihood estimate for the variance is
.
This will be biased (too small from overfitting). The unbiased estimate is
,
because we have fitted
.
The key assumptions:
parameters.

E(Y ) is linear in the predictors.


The “errors” are i.i.d. normal.
The error variance is fixed (homoscedasticity); it does not change with the predictors.
Distribution of the estimators: For the coefficients:
We first concentrate on the case k=1 (just one predictor in addition to the intercept).
Let
. Then
.
Conditional on the X’s,
.
so the estimator is unbiased, and
In summary,
You can check whether this makes sense. First, the dimensions
of both sides is (Y’s per X)2.
Second, as the variance of Y increases, the estimate gets less precise. Third, as the variance of
the X’s increase, the estimate gets MORE precise.
So, you can increase your precision of the estimate.... how?
What are the STUDY DESIGN IMPLICATIONS?
BIOINF 2118
Linear Regression
Page 4 of 4
Multivariate view
Again, it’s much simpler in matrix notation, and more general. Let the vector of parameter estimates
be
. Then
.
This is the basis for the Wald test produced by lots of statistical software. For testing the null
hypothesis that
, the approximate 95% confidence interval is
, where
. The subscript jj means the entry in the inverse matrix for the j th column and j th
row. The P-value for the two-sided test is
P=
where
is the standard
normal c.d.f..
But one can do better. This pretends that the estimate
with
is correct. Of course not. We replace
to account for the k+1 parameters fitted when we estimate
.
 See Dalgaard Chapter 6, Section 6.1-6.2, regression-Dalgaard.Rmd, GAGurine.R.
For the variance estimate:
If we knew the regression line exactly, we’d have
.
Then we could get a confidence interval for
easily.
Just set this equal to the lower and upper 2.5% quantiles of
Lower bound for
Upper bound for
, and solve for
:
s y2 = Sy2 /qchisq(0.975, n)
=
Sy2 /qchisq(0.025, n)
But we have to estimate the regression parameters, so we reduce the degrees of freedom:
, and we get the confidence interval
Lower bound for
Upper bound for
s y2 = Sy2 /qchisq(0.975,
)
Sy2 /qchisq(0.025,
)
=