Download 1. Given a set of data (xi,yi),1 ≤ i ≤ N, we seek to find a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bias of an estimator wikipedia , lookup

Regression toward the mean wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Choice modelling wikipedia , lookup

Forecasting wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
1. Given a set of data (xi , yi ), 1 ≤ i ≤ N, we seek to find a representation of the
data yˆi such that
N
X
S=
(yi − yˆi )2 ,
i=1
is minimized. Geometrically this minimizes the square of the sum of the distance between the observed points and the proposed model points ŷi .
2. Developed in the fields of Astronomy and Geodesy:
a) Different observations as being the best estimate of the true value - errors
decreasing on aggregation - first expressed by Roger Coates in 1722.
b) Method of averages - combining different observations under the same
conditions. Used by Tobias Mayer while studying librations of the moon
in 1750 and by Laplace in explaining the differences in motion of Jupiter
and Saturn in 1788.
c) Combination of different observations taken under different conditionsRoger Joseph Boscovich in 1757 and Laplace in 1788.
d) Development of a criterion that can be evaluated to determine when the
solution with the minimum error has been achieved - Laplace.
e) First clear and precise explanation by Legendre in 1805, though in 1809,
Gauss published a method for calculating the orbits of celestial bodies and
claimed to have known about this method since 1795. However, Gauss
went beyond Legendre and invented the Gaussian or normal distribution.
Used to predict the future location of the newly discovered asteroid Ceres.
f) In 1810, Laplace, after reading Gauss’ work, after proving the central limit
theorem, gave a large sample justificationfor the method of least squares
and the normal distribution.
g) In 1822, Gauss showed that the least squares approach to regression analysis is optimal in the sense that in a linear model where the errors have
mean zero and have equal variances, the best linear unbiased estimator of
the coefficients is the least squares estimator - Gauss-Markov theorem.
3. Linear least squares: for ther data above, fit a linear model y = a+bx. Typically,
we have to specify an error distribution - y = a + bx + ǫ, where ǫ N (0, σ 2 ). Here,
x is the independent variable and y is the dependent or response variable.
a) Minimize the error
N
X
S=
(yi − (â + b̂.xi ))2 .
i=1
b) So find the a, b such that
∂S
∂S
= 0,
= 0.
∂a
∂b
1
c) Find the theoretical values for the least squares estimates for â, b̂.
d) Define
N
X
SSE =
(yi − ŷi )2 ,
i=1
b̂ =
PN
i=1 (xi − x̄)(yi −
PN
2
i=1 (xi − x̄)
â = ȳ − b̂x̄ =
Note some other definitions:
SSxx =
N
X
i=1
SSxy =
n
X
i=1
2
i=1
cov(x, y) =
2
N
X
i=1
N
X
i=1
(yi − ȳ) =
x2i
PN
−
x i yi −
n
X
i=1
yi2
SSxy
.
SSxx
=
i=1 yi
− b̂
N
(xi − x̄)(yi − ȳ) =
SSyy =
Note,
PN
(xi − x̄) =
n
X
ȳ)
−
i=1
xi
N
.
PN
x i )2
PN
xi )(
N
Pn
yi2 )2
(
i=1
N
(
(
i=1
i=1
n
,
Pn
i=1
.
SSxy 2
SSxx
cov(x, y)
, sx =
, b̂ =
.
n−1
n−1
s2x
The standard error of the regression is σ 2 and is estimated by
s
r
Pn
2
SSE
i=1 (Yi − Ŷi )
SY X =
=
.
n−2
n−2
The standard error on the estimate of b, b̂ is
s
Pn
1
2
i=1 (ǫ̂i )
n−2
P
,
sb̂ =
n
2
i=1 (xi − x̄)
where ǫ̂ = yi − ŷi . The standard error on the intercept, a, â, is
v
u n
u1 X
sâ = sb̂ t
x2i .
n i=1
2
yi )
,
e) Here you are estimating a, b with â, b̂ - a case of parameter estimation.
Note we assume the errors for each observation are independent of each
other - homoskedasticity.
f) The residual sum of squares is
SSres =
n
X
i=1
(yi − ŷi )2 .
The regression sum of squares (SSR ) is
SSR =
n
X
i=1
SST =
n
X
i=1
(ŷi − ȳ)2 ,
(yi − ȳ)2 .
g) Also (in some cases - simple linear regression)
SST = SSR + SSRes .
g) The coefficient of determination is defined as
R2 = 1 −
SSres
SStot
Therefore in many cases a value of R2 close to 1 means that the regression does a good job of explaining the variance in the data. In linear
least squares with an intercept term, R2 equals the square of the Pearson
correlation coefficient between the observed and predicted values of the
dependent variable.
h) R2 gives some information about the goodness of fit of a model. In regression, its a measure of how well the regression line approximates the real
data points. R2 = 1 suggests the regression line fits the data perfectly.
i) In many cases R2 increases we increase the number of variables in the
model. Obviously N data points can be explained with a model with N
parameters. Hence we have adjusted R2 . This is the almost the same as
before but penlizes statistic as extra variables are added. The adjusted R2
is denoted by R̄2 and is
R̄2 = 1 − (1 − R2 )
3
n−1
,
n−p−1
where p is the total number of regressors or independent variables in the
model. R̄2 can be negative. In An unbiased estimate of (σ)2 is
σ2 =
SSRes
= M SERes .
n−2
h) So far we have not made any use of the normality assumptions for the error
ǫ. Hence upto now we only need homoskedasticity or constant variance.
8. Using the normality assumption, we have
t=
b̂ − b
∼ tn−2 ,
sb̂
which is a Student’s t distribution with n − 2 degrees of freedom. We can
construct a confidence interval for the slope b as
b ∈ [b̂ − sb̂ t∗n−2 , b̂ + sb̂ t∗n−2 ]
at confidence level (1 − γ), where t∗n−2 is the (1 − γ/2)th quantile of the tn−2
distribution. Similarly, a confidence interval for the intercept, a is
a ∈ [â − sâ .t∗n−2 , â + sâ .t∗n−2 ]
at confidence level (1 − γ).
12. In the more general case with more than 2 parameters, we specify the model
as
n
X
yi =
Xij βj , (i = 1, ....m).
j=1
That is we have m linear equations in n unknown coefficients β1 , β2 , ..βm . Written in matrix form this is
Xβ = y.
Minimizing the sum of squares leads to the normal equations for the least
squares estimate β̂.
(X T X)β̂ = X T y.
The solution is
β̂ = (X T X)−1 X T y.
There is a large literature on the solution of these equations. This is sometimes
referred to as the Generalized Linear Model. Maximum Likelihood estimation
with a normal cdf is least squares.
4
13. In cases where the variance is difference from observation to observation, the
residual sum of squares can be expressed as
S=
n
X
Wii ri2 ,
i=1
where
Wii =
1
,
σi2
and ri = yi − yˆi . If the weight matrix, Wij is diagonal (observational errors are
uncorrelated), then the normal equations become
(X T W X)β̂ = X T W y.
14. If we call the uncertainty on a given observation |sigmai , then the method of
least squars amounts to minimizing
χ2 =
X yi − a − bxi
]2 ,
[
σi
Under appropriate assumptions, this χ2 is distributed as a χ2 variable with
two degrees of freedom (what assumptions are these?). So in some cases least
sequares is equivalent to miniomizing χ2 . In these situations, we can use the
value of χ2 as a measure of goodness of fit. In higher dimensional problems,
can plot ”contours of constant χ2 .
15. Least Squares fit to a Polynomial
a) Suppose we want to fit
y(x) = a1 + a2 x + a3 x2 + ..... + am xm−1 ,
or more generally
y(x) =
m
X
ak fk (x),
k=1
where the functions fk (x) could be powers of x but do not involve the
parameters ai .. Under normality assumptions, we have
P (a1 , ...., am ) = Π(
so that
σi
1
√
m
X
−1 X 1
ak fk (xi )]2 ],
[y
−
)exp[
i
2
σi2
2π
k=1
m
X 1
X
χ =
[ [yi −
ak fk (xi )]]2 ,
σ1
2
k=1
5
and the method of least squares amounts minimizing this expression. Problem: show that under the normal distribution, the maximum likelihood
estimator is the least squares estimator. Consider a model
yi = a1 + s2 xi + a3 x2i ,
for 2l measurements (xi , yi ), i = 1, ...2l, with each measurement having
standard deviation σi , and the observations being normally distributed.
Formluate the least squares equations and develop a matrix equation for
the unknown parameters a1 , a2 , a3 .
14. Examples:
a) Look at the following data representing the potential difference as a function of position along a current carrying wire: (position(cm), voltage(V ))
(10, 0.37), (20.0, 0.58), (30.0, 0.83), (40.0, 1.15), (50.0, 1.36),
(60.0, 1.62), (70.0, 1.90), (80.0, 2.18), (90.0, 2.45).
Is the voltage linearly related to the position in the wire?
b) Number of counts detected in 7.5 min. intervals as a function of distance
from source: (distance(m), Counts),
(0.2, 901), (0.25, 652), (0.30, 443), (0.35, 339), (0.40, 283),
(0.45, 281)(0.50, 240), (0.60, 220)(0.75, 180), (1.0, 154).
Is the number of counts linearly related to the inverse square of the distance?
d) Derive a formula for making a linear fit to data with an intercept at the
origin so tha y = bx. Apply your method to fit a straight line through
the origin to the following coordinate pairs, Assume uniform uncertainties
(σi = 1.5 in yi . Find χ2 of the fit and the uncertainty in b.
e) A student measures the temperature (T ) of water in an insulated flas at
times (t) separated by 1 minute and obtains (t(s), T (C)),
(0, 98.51), (1, 98.50), (2, 98.50), (3, 98.49), (4, 98.52),
(5, 98.49), (6, 98.52), (7, 98.45), (8, 98.47).
f) Calculate the mean temperature and its standard error.
g) Plot a graph of temp. vs time and make a least squares fit of a straight
line to the data, Is there a statristically significant slope to the graph?
h) The intercept is not equal to the mean value of the temperature you calculated. Now shift the tiem coordinate so that the mean time is 0. Refit
the data with the new values of T. is the intercept now identical to the
mean value of T?
i) Show that, if the mean value of x is equal to zero, then the intercept b
calculated from least squares is identical to the mean value of y.
j) Example: Cepheid Period-Luminosity Relation.
6