* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Section 9-3
Data assimilation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Forecasting wikipedia , lookup
Choice modelling wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
STATISTICS Chapter 5 Regression MVS 250: V. Katch 1 Regression Definition  Regression Equation Given a collection of paired data, the regression equation ^ (y = mx + b) algebraically describes the relationship between the two variables  Regression Line (line of best fit or least-squares line) the graph of the regression equation 2 Regression Line Plotted on Scatter Plot 3 Regression Line 4 Two different lines, one to predict X and one to predict Y. 5 The Regression Equation x is the independent variable (predictor variable) ^y is the dependent variable (response variable) y = mx +b b = slope 6 Assumptions 1. We are investigating only linear relationships. 2. For each x value, y is a random variable having a normal (bell-shaped) distribution. All of these y distributions have the same variance. Also, for a given value of x, the distribution of y-values has a mean that lies on the regression line. (Results are not seriously affected if departures from normal distributions and equal variances are not too extreme.) 7 Formula for y-intercept and slope Formula 1 b= (y/n) ( x2/n) (y-intercept) - (x/n) (xy/n) (x2/n) - (x/n)2 SD2x Formula 2 m= (xy/n) - (x/n) (y/n) (slope) (x2/n) - (x/n)2 SD2x 8 If you find r, then Formula 3 slope = m = r sy/sx where y is the mean of the y-values and x is the mean of the x values Formula 4 Intercept = b = y - mx where y is the mean of the y-values, x is the mean of the x-values and m is the slope 9 Rounding the y-intercept and the slope  Round to three significant digits  If you use the formulas 1 and 2, and 3 try not to round intermediate values. 10 The regression line fits the sample points best. 11 Residuals and the Least-Squares Property Definitions Residual for a sample of paired (x,y) data, the difference (y - ^ y) ^ between an observed sample y-value and the value of y, which is the value of y that is predicted by using the regression equation. Least-Squares Property A straight line satisfies this property if the sum of the squares of the residuals is the smallest sum possible. 12 Residuals and the Least-Squares Property x y 1 2 4 24 ^ y = 5 + 4x 4 5 8 32 y 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 • Residual = 7 • Residual = 11 • • Residual = -13 Residual = -5 x 1 2 3 4 5 13 Predictions In predicting a value of y based on some given value of x ... 1. If there is not a significant linear correlation, the best predicted y-value is y. 2. If there is a significant linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation. 14 Predicting the Value of a Variable Start Calculate the value of r and test the hypothesis that  = 0 Is there a significant linear correlation ? Yes Use the regression equation to make predictions. Substitute the given value in the regression equation. No Given any value of one variable, the best predicted value of the other variable is its sample mean. 15 Guidelines for Using The Regression Equation 1. If there is no significant linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data was drawn. 16 X 1 2 3 4 5 10 15 18 20 30 Y 34 36 37 39 41 50 59 64 68 86 Example Compute r, slope, intercept, regression What is this equation used for? 17 What is the best predicted size of a household that discard 0.50 lb of plastic? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 18 What is the best predicted size of a household that discard 0.50 lb of plastic? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 Using a calculator: b = 0.549 m = 1.48 y = 0.549 + 1.48 (0.50) y = 1.3 A household that discards 0.50 lb of plastic has approximately one person. 19 Definitions  Marginal Change the amount a variable changes when the other variable changes by exactly one unit  Outlier a point lying far away from the other data points  Influential Points points which strongly affect the graph of the regression line 20 Example 5.4 Height and Foot Length (cont) Three outliers were data entry errors. Regression equation uncorrected data: 15.4 + 0.13 height corrected data: -3.2 + 0.42 height Correlation uncorrected data: r = 0.28 corrected data: r = 0.69 21 Example 5.10 Earthquakes in US San Francisco earthquake of 1906. Correlation all data: w/o SF: r = 0.73 r = –0.96 22 Example: Predict the quiz score of a student who spends 30 hours a week watching television. One more step……. 23 Compute the Standard Error of the Estimate SY*X = 2 SDY√1-r SY*X = 2 13.83√1-(-8.17) SY*X = ±7.978 7.978 The predicted score is 56.56 points + points 24 Multiple Regression Definition Multiple Regression Equation A linear relationship between a dependent variable y and two or more independent variables (x1, x2, x3 . . . , xk) ^ y = m0 + m1x1 + m2x2 + . . . + mkxk 25 Generic Models Linear: y = a + bx Quadratic: y = ax2 + bx + c Logarithmic: y = a + b lnx Exponential: y = abx Power: y = axb Logistic: y= c 1 + ae -bx 26 27 28 29 30 31 32 Development of a Good Mathematics Model  Look for a Pattern in the Graph: Examine the graph of the plotted points and compare the basic pattern to the known generic graphs.  Find and Compare Values of R2: Select functions that result in larger values of R2, because such larger values correspond to functions that better fit the observed points.  Think: Use common sense. Don’t use a model that lead to predicted values known to be totally unrealistic. 33
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            