Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Forecasting wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
Chapters 8 & 9 Linear Regression & Regression Wisdom Price of Homes Bases on Size (in Square Feet) Sold in Ames between Sep. 2004 and Oct. 2005 r = 0.8718945 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Statistical Modeling Statistical Model: An equation that fits the pattern between a response variable and possible explanatory variables, accounting for deviations from the model. (Simplest case: one quantitative response variable and one quantitative explanatory variable.) Response Variable (Y): The quantitative outcome of a study. Explanatory Variable (X): A quantitative variable that may explain or predict the response variable What is the beset model for: Predicting weight (Y) from height (X)? What is the best model for: Predicting blood pressure (Y) from age (X)? Correlation and the Line Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Regression line Explains how the response variable (y) changes in relation to the explanatory variable (x) Use the line to predict value of y for a given value of x Regression line Need a mathematical formula We want to predict y from x The predicted values are called ŷ. The observed values are called y. Which Line is Best? What are some ways we can determine which model out of all the possible models is the “best” one? What are some ways that we can numerically rank the different models. (i.e. the different lines) Which Model is Best? Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Regression line “Putting a hat on it” means we have predicted something from the model Look at vertical distance y yˆ Amount of error in the regression line The goal is to find the line so that these errors are minimized. Least squares regression Most commonly used regression line Makes the sum of the squared errors as small as possible Based on the statistics x , y , sx , s y , r Regression line equation yˆ b0 b1 x where b1 r sy sx b0 y b1 x Regression line equation b1 = slope of line. For every unit increase in x, y changes by the amount of the slope. Interpreting b1 (slope): For every one unit increase in the explanatory variable, there will be, on average, a b1 unit(s) increase/decrease in the response variable. For example: For every one square foot increase in size, on average, there will be a $159.80 increase in home price. MEMORIZE THIS!!!!! Regression line equation b0 = y-intercept of line. The value of y when x = 0. Interpreting b0 (y-intercept): When the explanatory variable = 0, on average, the value of the response variable = b0. For example: When the sq. ft. of a home is 0, the price of the home will be -$90,245.80 on average. MEMORIZE THIS!!!!! BE CAREFUL. The interpretation of the intercept does not always make sense. When interpreting, be sure to mention if the interpretation does not make sense. Example – Kobe’s Shooting I visited cnnsi’s website and checked out some of Kobe Bryant’s personal scoring numbers. I looked at the number of times he shot the ball and his point total for each game so far this year. Let’s come up with the regression equation for this data. Kobe’s Shooting r = 0.7293762 Form: Linear QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Strength: Moderate to Strong Direction: Positive Calculating the regression line Remember that: Our explanatory variable(x) is the number of shots Our response variable(y) is the number of points So the five numbers needed are: x 27.04, sx 7.41, y 35.71, sy 12.13, r 0.749768 Calculating the Regression Line Find the Slope sy 12.13 b1 r (0.7293762) 1.19 sx 7.41 Find the Intercept b0 y b1 x 35.71 .90(27.04) 3.436 Calculating the regression line. Don’t forget to write the equation. ŷ 3.436 1.19x DON’T FORGET TO WRITE THE EQUATION IN THE CONTEXT OF THE PROBLEM. pts 3.436 1.19(number of shots) Interpretation How would we interpret b1? For a one shot increase from Kobe Bryant, on average we would expect him to score 1.19 more points. How would we interpret b0? If Kobe Bryant did not take one shot then on average we would expect him to score 3.436 points Prediction Use the regression equation to predict y from x. Ex. What is the predicted number of points when Kobe shoots 30 times in a game? ŷ 3.436 1.19(30) 39.136 Ex. What is the predicted number of points when Kobe shoots 55 times in a game? ŷ 3.436 1.19(55) 68.886 Plotting the regression line Find two points on the line: Ex. x = 30, y = 39 and x = 55, y =69 • If you are plotting by hand it is ok to round values Plot these two points on the graph Connect the points This is the regression line Plotting the Regression Line QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Properties of regression line r is related to the value of b1 r has the same sign as b1 One standard deviation change in x corresponds to r times one standard deviation change in y The regression line always goes through the point ( x , y ) Properties of regression line r2 Percent of variation in y that is explained by the least squares regression of y on x The higher the value of r2, the more the regression line explains the changes that occur in the y variable The higher the values of r2, the better the regression line fits the data Properties of regression line r2 0 r2 1 since -1 r 1 Interpretation of r2 r2 is the percent of variation in the response variable that can be explained by the least squares regression of the response variable on the explanatory variable. For Kobe’s example: 53.1% of the variability in the number of points Kobe Bryant scores in a game can be explained by the LS regression of points per game on number of shots per game (g). MEMORIZE THIS!!!! Residuals Amount of variation in y not taken into account by regression line Formula: y y ˆ There is a residual for each data point Mean of the residuals is zero Calculating Residuals – Kobe ŷ 3.436 1.19x pts 3.436 1.19(number of shots) Find the residual for the point (46,81) First find the predicted number of calories for a sandwich with a serving weight of 182 g: ŷ 3.436 1.19(46) 58.176 Now find residual: residual y ŷ 81 58.176 22.824 Calculating Residuals – Kobe Find the residual for the point (26,35) ŷ 3.346 1.19(26) 34.286 residual y ŷ 35 34.286 0.714 Residual Plots Scatterplot of Residuals Explanatory variable on horizontal axis Residuals on vertical axis Horizontal line at residual = 0 Residual Plots QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Interpreting Residual Plots Is there a curved pattern? Is there increasing spread about the line as x increases? This could mean a non-linear relationship This could mean non-constant variance Is there decreasing spread about the line as x increases? This could mean non-constant variance Interpreting Residual Plots Points with large residuals These are probably outliers in the y direction These will pull the regression line in the direction of the outlier (up or down) Extreme points in the x direction These are called influential points They do not always show up in residuals because the residual could be small Removing the outlier could markedly change the regression line Reading JMP Data Bivariate Fit of BAC by # of Beers 0.2 BAC 0.15 0.1 0.05 0 0 2 4 6 Beers 8 10 Reading JMP Data Linear Fit Linear Fit BAC = -0.011654 + 0.0180112 # of Beers This is the regression line for the data. Slope is 0.0180112. y-Intercept is -0.011654. The response variable is the BAC. The explanatory variable is the # of Beers. Reading JMP Data Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) This gives some summary of the data. RSquare = r2 = (r)2 = (correlation)2 Root Mean Square Error = s Mean of response = y Observations = n 0.803536 0.788424 0.020920 0.076000 15 Reading JMP Data Analysis of Variance Source Model Error C. Total DF 1 13 14 Sum of Squares 0.02327041 0.00568959 0.02896000 Mean Square 0.023270 0.000438 This is called the ANOVA Table. This is another way to analyze the data. We aren’t going to discuss this in this class. F Ratio 53.1700 Prob > F <.0001 Reading JMP Data Parameter Estimates Term Intercept #beers Estimate -0.011654 0.0180112 Std Error 0.013179 0.002470 t Ratio -0.88 7.29 This tells you what the y-intercept and slope are. It also gives the standard error for each of the estimates. If you were to form confidence intervals for the parameter estimates, you would need these values. We won’t discuss that in this class. Prob>|t| 0.3926 <.0001 Reading JMP Data Residual 0.05 0.03 0.01 -0.01 -0.03 0 2 4 6 8 10 Beers Here is your residual plot. Check it to see if there are any problems with linearity of the data and constant variance. Example at F G e s l c o r 60 70 80 90 10 10 120 A ge 10 20 30 A ge 40 at fi rs Example Age at first word vs. Gesell score. Scatterplot: Weak negative linear relationship between two variables. Possible outliers at (42,57) and (17,121). Regression: r = -0.64, r2 = 40.96%. yˆ 109.87 1.13x Example at F G e s l c o r 60 70 80 90 10 10 120 A ge 10 20 30 A ge 40 at fi rs t Example Age at first word vs. Gesell score. Prediction: • When x=17 • When x=42 Residuals: • point (17,121) • point (42,57) Example R e s i d u a l -10 0 10 20 30 Re s id u a 10 20 30 A ge 40 at Fi rs Example Residual Plot Outliers at x=17 and x=42 Small residual for x=42 • Could be influential Remove (42,57) from data. Regression line changes markedly. r = -0.33, r2 = 10.89%. Example at F G e s l c o r 60 70 80 90 10 10 120 A ge 10 20 30 A ge 40 at fi rs Outliers--What should you do? Make sure data points have been recorded correctly Collect more data Remove the outlier Examine collection techniques Examine outside influences Cautions about regression Linear relationship only Not resistant Using averaged data Makes relationship appear stronger Taking average removes variation Extrapolation Predicting y when x value is outside the original data Cautions about Regression Extrapolation Remember the data about home prices vs. the amount of sq. footage in the home. The regression line we found based on data collected from homes with 900 to 3,000 sq. ft. is price 75.47 0.69( sq. ft.) This would mean that if my home has no square footage, then I pay -$75,470. If you must extrapolate, at least don’t expect that your prediction will come true. Cautions about regression ASSOCIATION IS NOT CAUSATION! Strong association between explanatory and response variables does not mean that the explanatory variable causes the response variable. Proving Causation Experiment Change the values of x and control for lurking variables. Not all problems can be solved by experiment • Smoking causes lung cancer. • Living near power lines causes leukemia. Proving Causation Lurking variable Important effect on variables, but not included in study. Example: • Do taller people make more money? What do you think a lurking variable might be? • Proving Causation Proving smoking causes lung cancer Association is strong Association is consistent High doses are associated with stronger response Cause precedes the effect in time Cause is plausible Review Number of Calories By Sugar Content (g) for 13 Cereals 150 Let’s calculate the formula for this regression line: cals 125 100 75 50 25 0 5 sugar (g) 10 15 Review Let’s review all the formulas we need: yˆ b0 b1 x b1 r sy sx b0 y b1 x 1 ( x x )( y y ) r n 1 sx s y s ( y y) n 1 2 1 y y n Review Here are all the numbers you need: x n 13 ( x ( y y ) 2 94 y 1280 x )( y y ) 1014.66 6169.21 2 ( x x ) 301.97 Review First, calculate sx and sy: sx sy (x x) 2 n 1 ( y y) n 1 301.91 5.02 12 6169.21 22.67 12 2 Review Second, calculate r: 1014.66 1014.66 r 0.743 (13 1)( 22.67)(5.02) 1365.64 Third, calculate b1: 22.67 b1 (.743) 3.36 5.02 Review Fourth, calculate x and y : 94 x 7.23 13 1280 y 98.46 13 Fifth, calculate a (we’re almost done!!): b0 98.46 3.36(7.23) 74.17 Review Last, but definitely the most important, WRITE DOWN THE EQUATION IN THE CONTEXT OF THE PROBLEM: Calories 74.17 3.36( sugar ) Review Interpret b1: For every one gram increase in sugar, the number of calories will increase by 3.36. Interpret r2: About 55% of the variability in the number of calories in cereal can be explained by the LS regression of calories on sugar content.