* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 9 - WordPress.com
Survey
Document related concepts
Transcript
Lecture 9 Sections 3.3 Objectives: •Bivariate and Multivariate Data and Distributions − Fitting a Straight Line − Assessing the fit − Residuals Fitting a Line to Bivariate Data We first draw the scatter plot of two quantitative variables for a visual inspection of relationship between those variables. We then measure its strength by computing a correlation coefficient if the scatter plot shows a linear relationship. Now one may want to fit a line on the scatter plot to summarize the overall pattern. This is done using a regression analysis. Fitting a straight line to data (Regression line) We want to draw a line 1) to summarize the relationship between x and y. 2) to describe how a response variable y changes as an explanatory variable x changes. 3) to predict the value of y for a given value of x. Least Squares Regression Line The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is as small as possible. Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added (Pythagoras). Least Squares Regression Line Least squares line is a regression line that minimizes the squared distances between the observed points and a line. That is, we’ll find a and b that minimize Q(a, b) ( y i (a bxi )) 2 i To find the minimum of this function, take the derivative of Q wrt a and b and equate them to zero. Those two equations (called the normal equations) give b S xy S xx ( x x )( y y ) (x x) i i i 2 Estimated slope a y bx Estimated y-intercept i i The equation of the least square line is often written as yˆ a bx where “hat” above y emphasizes that is a prediction of y that results from the substitution of any particular x value into the equation. Example A sample of Pizza restaurants located near college campus a. Find a least squares line. b. Predict the quarterly sales when the student populations are 18,000 and 30,000, resp. Extrapolation Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line. This is not recommended, as seen here. !!! !!! LS Regression Line Regression and correlation coefficient b S xy S xx r S yy S xx Coefficient of determination, denoted by r2, is the proportion of variation in the observed y values that can be explained by the regression line: Note that 1) 0 ≤ r2 ≤ 1 2) the closer this percentage is to 100%, the more successful is the relationship in explaining variation in y. 3) (correlation coefficient)2 = coefficient of determination. Example. Revisit the Pizza sales example. What percent of the variation in the quarterly sales is explained by the regression line? Residuals Predicted value (or fitted value) = ŷ Residual = observed value - predicted value = y i yˆ Residual Plot A residual plot is a scatter plot of the regression residuals against the predicted value or the explanatory variable. Residual plots help us assess the fit of a regression line. If residuals are scattered randomly around 0, chances are your data fit a linear model, was normally distributed, and you didn’t have outliers. Residual Plot Residuals are randomly scattered— good! Curved pattern—means the relationship you are looking at is not linear. A change in variability across a plot is a warning sign. You need to find out why it is, and remember that predictions made in areas of larger variability will not be as good. Example Consider the following data on x=height (in) and y=average weight (lb) for American females aged 30-39 (taken from The World Almanac and Book of Facts). x: 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 y: 113 115 118 121 124 128 131 134 137 141 145 150 153 159 164 Draw the scatter plot and residual plots. Find the least square regression line.