Download Lecture 9 - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Lecture 9
Sections 3.3
Objectives:
•Bivariate and Multivariate Data and Distributions
− Fitting a Straight Line
− Assessing the fit
− Residuals
Fitting a Line to Bivariate Data
We first draw the scatter plot of two quantitative variables for a visual
inspection of relationship between those variables. We then measure
its strength by computing a correlation coefficient if the scatter plot
shows a linear relationship. Now one may want to fit a line on the
scatter plot to summarize the overall pattern. This is done using a
regression analysis.
Fitting a straight line to data
(Regression line)
We want to draw a line
1) to summarize the relationship between x
and y.
2) to describe how a response variable y
changes as an explanatory variable x
changes.
3) to predict the value of y for a given value
of x.
Least Squares Regression Line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is as small as possible.
Distances between the points
and line are squared so all
are positive values. This is
done so that distances can
be properly added
(Pythagoras).
Least Squares Regression Line
Least squares line is a regression line that minimizes the squared
distances between the observed points and a line. That is, we’ll find a
and b that minimize
Q(a, b)   ( y i  (a  bxi )) 2
i
To find the minimum of this function, take the derivative of Q wrt a
and b and equate them to zero. Those two equations (called the
normal equations) give
b
S xy
S xx
 ( x  x )( y  y )

 (x  x)
i
i
i
2
Estimated slope
a  y  bx
Estimated y-intercept
i
i
The equation of the least square line is often written as
yˆ  a  bx
where “hat” above y emphasizes that is a prediction of y that results from the
substitution of any particular x value into the equation.
Example
A sample of Pizza restaurants located near college campus
a. Find a least squares line.
b. Predict the quarterly sales when the student populations are 18,000
and 30,000, resp.
Extrapolation
Extrapolation is the use of a
regression line for predictions
outside the range of x values
used to obtain the line.
This is not recommended, as
seen here.
!!!
!!!
LS Regression Line
Regression and correlation coefficient
b
S xy
S xx
r
S yy
S xx
Coefficient of determination, denoted by r2, is the proportion of variation
in the observed y values that can be explained by the regression line:
Note that
1) 0 ≤ r2 ≤ 1
2) the closer this percentage is to 100%, the more successful is the
relationship in explaining variation in y.
3) (correlation coefficient)2 = coefficient of determination.
Example. Revisit the Pizza sales example. What percent of the variation in the
quarterly sales is explained by the regression line?
Residuals
Predicted value (or fitted value) = ŷ
Residual = observed value - predicted value
= y i  yˆ
Residual Plot
A residual plot is a scatter plot of the regression residuals against the
predicted value or the explanatory variable. Residual plots help us
assess the fit of a regression line.
If residuals are scattered randomly around 0, chances are your data
fit a linear model, was normally distributed, and you didn’t have outliers.
Residual Plot
Residuals are randomly scattered—
good!
Curved pattern—means the
relationship
you are looking at is not linear.
A change in variability across a plot
is a warning sign. You need to find
out why it is, and remember that
predictions made in areas of larger
variability will not be as good.
Example
Consider the following data on x=height (in) and y=average weight (lb)
for American females aged 30-39 (taken from The World Almanac and
Book of Facts).
x: 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
y: 113 115 118 121 124 128 131 134 137 141 145 150 153 159 164
Draw the scatter plot and residual plots. Find the least square
regression line.