Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tuesday, September 25: 5.1-5.2 Describing Bivariate Data with Scatterplots (review ch. 3-4) The first step when analyzing the relationship between two _______________ variables is to graph the data using a _______________. In a scatterplot, the ________________ variable should be on the x-axis and the ___________ variable should be on the y-axis. The explanatory variable seeks to explain or predict changes in the response variable. Usually the explanatory variable comes first chronologically. Each axis should be clearly __________ with the variable’s name and unit. It should also have a well marked and uniform ___________ on each axis, however, the scales do not need to be the same. The axes often intersect at (0, 0) but this can change depending on the range of the data sets. Patterns are often more visible when there is less “empty space” and the data is more spread out. In general, when describing a scatterplot, 1. Look for the ______________ of the association __________________________ higher values of one variable are associated with higher values of the other variable __________________________: higher values of one variable are associated with lower values of the other variable __________________________: higher values of one variable do not give any information about the values of the other variable 2. Look for the _______ of the data 3. Look for the ______________. If there is not much scatter, we say there is a “strong” association between the variables. 4. Look for ____________________: _____________ that fall outside the pattern of the rest of the data and __________ of points that are isolated from the rest of the data. Always investigate these values! For the following sets of bivariate data, which would be the explanatory variable? 1. Scuba diving: depth and visibility 2. World population vs. year 3. Amount of rain vs. crop growth 4. Height vs. GPA Using the TI-83 to make scatterplots Zoomstat Window Note: to sort bivariate data and keep the ordered pairs together, enter SortA(L1, L2). This will sort the data by L1 and keep the pairs together. HW #1 5.1-5.9 odd Thursday, September 27: 5.3 Fitting a line to Bivariate data (chapter 3-4 test) When the form in a scatterplot is linear, we can use an equation in the form ŷ a bx to model the relationship between the explanatory variable (x) and the response variable (y) a = y-intercept (constant) b = slope ŷ (“y hat”) signifies that the value of ŷ is an estimate or prediction Statisticians prefer the form ŷ a bx instead of ŷ mx b , but they are equivalent. Some books use the notation ŷ b0 b1 x How can we find the best linear model? Since our goal is to make good predictions, we want to minimize the vertical deviations from the observations to the line. These vertical deviations are called __________________. residual = observed y value - predicted y value = y - ŷ The best fitting line is the line which minimizes the sum of the squared residuals, y yˆ 2 . This line is called the __________________________________________________ (LSRL). JMP-Intro Script: LeastSquaresDemo Applet: http://mathforum.org/dynamic/java_gsp/squares.html (try this at home) Using the TI-83 to calculate the LSRL: enter the data in L1 and L2 stat: calc: 8: LinReg (a+bx) L1,L2 (Note: 4 and 8 are the same, just different forms) You should always use the TI83 to find the LSRL. Ignore any directions that say otherwise. Consider the following data describing the age (in months) vs. height of infants (in inches): a. sketch the scatterplot and describe what you see age height 1 20.5 1 19 2 21 4 22 4 23.5 6 22.5 7 23 7 24 9 26 b. since the scatterplot shows a linear form, calculate the LSRL and graph it on the plot c. interpret the slope in the context of the problem d. interpret the y-intercept in the context of the problem e. if a child is 5 months old, how tall should he be, based on the model? In other words, how tall is an average 5 month old? f. would you be willing to predict the height of a 10 year old child with this model? Why not? Def: _____________________ is using your model to make predictions outside of the range of the data. It is very unreliable since the pattern of the data may not stay the same. HW #2 5.26, 27, 29, 30 Monday, October 1: 5.4 Assessing the Fit of a Line After we find the Least Squares line, we should examine how well the model fits the data. Important questions to consider are: 1. Is a linear model really appropriate, or would a curved model be better? 2. If we make predictions with the model, how accurate will our predictions be? 3. Are there any unusual aspects of the data set we need to consider before we make predictions with the model? Question 1: Is the linear model appropriate? Remember that a residual is the vertical distance from the point to the line: y – ŷ Def: a ____________________ gives us a closer look at the pattern of the residuals. It plots the xvalues on the x-axis and the residuals on the y-axis. What kind of pattern will we see in a residual plot if we fit a line to linear data? Making residual plots on the TI-83: L1 = x L2 = y L3 = ŷ L4 = y – ŷ Scatterplot L1, L4 For the following data sets, sketch the original data with the least squares line. Then, sketch the residual plot and use it to decide if the linear model is appropriate. x y 1 1 2 5 3 7 4 8 5 12 6 12 7 17 x y 1 .1 2 .8 3 2 4 3.3 5 5 6 7.3 7 9.9 For the second data set, the data is close to the line (not much scatter) even though there is a obvious curve in the residual plot. The residual plot indicates that a line is not the best way to model this data. However, the lack of scatter means that the predictions using the linear model will still be fairly accurate within the range of our data, though not as accurate as with a curved model. In conclusion, a residual plot will tell you if a linear model is the right type of model (has the right form) or if we should consider fitting a non-linear model. NOTE: Sometimes a residual plot is made with the predicted values ( ŷ ) on the x-axis and the residuals on the y-axis. This is because computer software packages are built for multiple regression, which uses many different x-variables to predict y. Instead of using 1 of the x-variables and ignoring the others, they use the predicted values since they are a function of all the x’s. However, the plot will still show the same characteristics and should be interpreted in the same way. HW #3: 5.47, 5.48 Tuesday, October 2: 5.4 How accurate will our predictions be? Suppose that I randomly selected 10 students from this class and recorded their weight (in pounds): {103, 201, 125, 179, 150, 138, 181, 220, 113, 126} If I were to randomly select one more student, what would be a good prediction for his or her weight? Of course, this prediction is not likely to be correct. Typically, how far are the observations from the mean? In other words, how far off should we expect to be? Is there any way to improve our prediction? In other words, is there a way I can reduce the standard deviation? Here are the heights (in inches) for the original 10 students: {61, 68, 65, 69, 65, 61, 64, 72, 63, 62} Sketch the scatterplot and calculate the LSRL. Of course, the predictions using the regression line aren’t perfect either. Standard Deviation about the Least Squares Regression Line: To get a sense of how close the points are to the line, we can calculate the standard deviation about the least squares regression line, which gives an estimate of the average distance each observation is from the line (in other words, the average residual). se y yˆ n2 2 SSError SSResid n2 n2 Note: “SS” = “Sum of Squares” so SSResid is the sum of squared residuals Note: se is also called “root mean square error” (RMSE) or simply s. Calculate the standard deviation about the regression line for this data: The coefficient of determination, r 2 , is a measure of the proportion of variability in the y variable that can be “explained” by the linear relationship between x and y. For example, suppose we open a pizza parlor, selling pizzas for $8 plus $1.50 per topping. If we were to plot the points (0, 8.00), (1, 9.50), (2, 11.00) they would fall exactly on a line. In this case, the number of toppings explains 100% (all) of the variability in price. Thus, r 2 = 1, or 100%. Calculate the coefficient of determination for the height and weight data: Look at the r-squared program again To measure the total variability in the y variable (weight), we measure the variability of y from its mean: The total sum of squares = SSTotal = ( y y) 2 = Note: We do not consider the x variable at all when we calculate SSTotal. Note: This is the same quantity that we use when we calculate s, the sample standard deviation for one variable. We can also consider the variability in y (weight) that still remains after we factor in x (height): This is called the residual sum of squares: SSResid = SSError = ( y yˆ ) 2 = Note: this is the same quantity that we used when we calculated se , the standard deviation about the LSRL. The difference between SSTotal and SSError is called SSModel. SSModel is the variability in y (weight) that is explained by x (height). • y - yˆ } { yˆ - y y } y- y yˆ a bx SSTotal = SSModel + SSResid = y y yˆ y y yˆ 2 2 2 SSTotal is the variability in the response variable (considered by itself) SSModel is the variability in the response variable that is accounted for by the explanatory variable SSResid is the variability in the response variable that is not accounted for by the explanatory variable Thus, the coefficient of determination can be computed as: SSModel SSTotal-SSResid SSResid variability of residuals = r2 = = =1=1SSTotal SSTotal SSTotal variability of weights Thus, we can say that ____% of the variability in a weight can be explained by height (the total variability in weight has been reduced by ____%). This also means that ____% of the variability in weight remains unexplained (it is due to other factors). We can also say that height accounts for ____% of the variation in weight. Using the TI-83 to calculate r 2 . What is the relationship between r 2 and se ? Both measure how well a line models the data r 2 has no unit and is usually expressed as a percent between 0% and 100% se is expressed in the same units as the response (y) variable If n = 10, se = 8, and s y = 25, calculate r 2 . AP Question HW #4: 5.38, 39, 40, 44, 49 Thursday, October 4: 5.2 Correlation / Understanding Regression Output Another way to help us quantify the amount of scatter in a scatterplot is to calculate the _____________________________________, which measures the strength of the linear relationship between two quantitative variables. For example, consider the following scatterplot showing fat (grams) vs. calories in 5 ounces of various types of pizza (Statistics in Action, page 134). How would the relationship change if I converted the values on the x-axis into milligrams instead of grams? Since the units of measure do not matter, we can standardize them by calculating the z-score for each observation. x x x 12.17 For fat, z x sx 3.70 y y y 331.92 For calories, z y sy 29.15 Now, we can look at a scatterplot of the z x , z y ordered pairs. Which points give evidence that there is a positive association? Which points count against a positive association? Since the points in quadrants I and III indicate a positive association, we will use the products zx zy which are always positive in QI and QIII and negative in QII and QIV. To get an overall sense of the relationship, we add up these products: zx zy . If the sum is positive, we have a positive association. If the sum is negative, we have a negative association. If there is no association, the sum should be close to 0. Also, since the size of this sum will get bigger the more data we have, we divide the sum by n – 1 to find the correlation coefficient: zx z y 18.95 .82 Note: r r 2 r n 1 24 1 Note: When we use the word “correlation” in statistics, we are referring to the correlation coefficient. If you want to describe a relationship in a more casual way, use the word “association”. Note: r is often called Pearson’s correlation coefficient Properties of the Correlation Coefficient: 1. The value of r does not depend on the unit of measure since r is based on z-scores, which have no units. For example, the relationship between height and weight is equally strong if we use inches and pounds or centimeters and kilograms. 2. r has no units. 3. The value of r does not depend on which variable is x and which is y. The product z x z y is the same as z y z x . Note: this isn’t true for the slope, intercept, or SD. 4. -1 ≤ r ≤ 1. When r > 0, the relationship is positive. When r < 0, the relationship is negative. As r --> ±1, the relationship is stronger and has less scatter As r --> 0, the relationship is weaker and has more scatter 5. r = ±1 only when the data are in a perfect line. This is the only case where the values of one variable can be completely determined by the values of the other variable. 6. The value of r is a measure of the strength of a linear relationship. It measures how closely the data fall to a straight line. An r value near 0, however, does not imply that there is no relationship, only no linear relationship. For example, quadratic or sinusoidal data have an r close to 0, even though there may be a strong relationship present. Also, even though r measures the strength of a linear relationship, it does NOT tell us if a linear model is appropriate. Only ___________________ can do that. The correlation coefficient just measures how much scatter there is from the line on a scale from -1 to 1. Don’t confuse correlation with causation: For example, in the last 10 years, the number of students taking the AP Statistics exam has grown from 7500 to over 75,000. At the same time, the national murder rate has been decreasing, so there is a negative correlation. Does this mean that we should make more people take AP Statistics so there will be fewer murders? Also, there is a strong positive association between monthly ice cream sales at Baskin Robbins and monthly drowning deaths. Should we close Baskin Robbins to save people from drowning? You can never prove cause-and-effect from a scatterplot!! Understanding Computer Output: On the AP Exam, questions about regression frequently include output from computer software, such as JMP-Intro. Average High Temperature Annual Precipitation 70 15 71 13 73 9 74 12 76 8 76 10 77 7 72 11 72 12 In the “Analyze” menu, choose “Fit Y by X” and enter annual precipitation for Y and average high temperature for X. Click OK. You will see a scatterplot of the data. Click on the red arrow and choose “fit line.” You should see the following output: Linear Fit Annual Precipitation = 76.437788 - 0.8940092 Average High Temp Summary of Fit RSquare 0.747577 RSquare Adj 0.711516 Root Mean Square Error 1.363495 Mean of Response 10.77778 Observations (or Sum Wgts) 9 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 38.541731 38.5417 20.7312 Error 7 13.013825 1.8591 Prob > F C. Total 8 51.555556 0.0026 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 76.437788 14.42794 5.30 0.0011 Average High Temp -0.894009 0.19635 -4.55 0.0026 The least squares regression line is given at the top, but on many AP questions, they will only give you the bottom table. Make sure you can find the equation from the parameter estimates table only! Other notes: To find the correlation coefficient (r), take the square root of RSquare ( + or -) You can ignore RSquare Adj (this is a multiple regression topic) Root Mean Square Error is the standard deviation about the regression line, se . Other software use the symbol “S” for this quantity. Mean of response is the average y-value ( y ). The Analysis of Variance table has all of the Sums of Squares needed to calculate r 2 and se . To make a residual plot, click on the red arrow by “linear fit” and choose “plot residuals” HW #5 5.10-12, 14, 16-18, JMP problem (use the TI-83 to calculate r) JMP: A random sample of cars was selected and the horsepower (HP) and miles per gallon (MPG) were calculated. Let x = HP and y = MPG. a. Make a scatterplot with the least squares regression line. b. State the equation of the line. c. Interpret the slope in the context of the problem. d. Make a residual plot and comment. e. State and interpret the value of the coefficient of determination. f. State and interpret the value of the standard deviation about the regression line. g. Show how parts e and f could be calculated without using the Summary of Fit section. h. Calculate the correlation coefficient. i. Predict the MPG for a car with 175 HP. j. Suppose that we added a new point to the data set (HP = 250, MPG = 35). Add this point to your scatterplot by hand. How do you think this will this affect the slope, r, r 2 , and se ? k. Add this point in JMP to check your answers. Include the new output. HP MPG 110 36 170 23 165 20 93 29 142 21 214 28 114 30 124 29 225 23 255 25 155 31 200 27 70 43 81 33 168 28 Monday, October 15: 5.3 Regression Earlier we learned why the Least Squares Regression Line includes the words “Least Squares.” Today, we will learn why it includes the word “Regression.” Thinking about the pizza example from yesterday if we had a pizza with an average amount of fat, it would be reasonable to predict that it would also have an average number of calories. This illustrates an important property of the LSRL: It will always go through the point ________. If we think about the scatterplot of z x , z y , this means that the line will go through (0, 0). In other words, the y-intercept will be 0. Furthermore, it can be proven that the slope of the LSRL on the standardized plane is equal to the correlation coefficient (r). Thus, on the standardized plane, the LSRL is: zˆ y 0 r zx Note: Regression slopes don’t tell us the strength of an association since they are dependent on units. For example, if y = height in meters, converting to cm increases the slope 100 times. Correlation is the standardized version of the slope. Here is the scatterplot of z-scores for the pizza data along with the line zˆ y .82 zx So, if an x-value is one standard deviation above the mean of x zx 1 , the predicted y-value will only be r standard deviations above the mean of y zˆ y r 1 r . If an xvalue is two standard deviations below the mean of x zx 2 , the predicted y-value will only be 2r standard deviations below the mean of y zˆ y r 2 2r Thus, since 1 r 1 , the predicted value of y will almost always be closer to its mean than the x-value is to its mean (in terms of standard deviations). This illustrates the concept of “regression to the mean.” zˆ y r zx zˆ y r 1 z x 2 1 zx 1 zˆ y 2r This concept was first publicized by Francis Galton, who noticed that that tall fathers had tall sons, but not quite as tall on average, and that short fathers had short sons, but not as short on average. For example, suppose we were using the height of a father (x) to predict the height of his son (y) and that r = 0.7. Then, if a father was 2 standard deviations below average in height, we would predict his son’s height to be only ________ standard deviations below average. Also, as the correlation gets closer to 0, our predicted values will be closer the mean value of y. This seems reasonable since as the correlation gets weaker, we will have less confidence making predictions that are far from the mean. So far, we have been working with standardized values. What if we wanted to use the original units? zˆ y rz x xx yˆ y r sy sx s yˆ y r y x x sx s s yˆ r y x r y x y sx sx s Since the slope of the regression line is the coefficient of x, b r y . sx Once we know the slope, we can use the fact that x , y is on the line to find the y-intercept: y a bx For example, find the LSRL if x 65, sx 4, y 150, s y 30, r .62 . Note: Unlike the correlation coefficient x and y are not interchangeable when calculating the LSRL. Therefore, we should never use a y-value to try to predict an x-value. Note: if r = 0, then the slope = 0 and the LSRL is horizontal: y y . If there is no linear association, knowing x won’t help us predict y! HW #6 5.31-35, 42 Tuesday, October 16: 5.4 Unusual and Influential Points The following data show the weight (in pounds) and cost (in dollars) of a sample of 11 stand mixers (from Consumer Reports 11-05). 1. Sketch a scatterplot of the data and find the least squares regression line and correlation. Weight Price 23 180 28 250 19 300 17 150 25 300 26 370 21 400 32 350 16 200 17 150 8 30 2. What do you think will happen if we remove the outlier (the Walmart brand)? Sketch a scatterplot of the remaining data and find the least squares regression line and correlation. How do they compare? 3. What if the outlier was made from an expensive lightweight alloy so that the observation was (8, 700)? 4. What if the first mixer in the data set went on clearance sale for $25. How will this change the regression line? Summary: Any point that stands apart from the others is called an ____________. Since the LSRL must pass through the point x , y , points that are separated in the x-direction can be particularly ________________. We say they have high _______________. When a point with high leverage lines up with the rest of the data, such as #1 above, the line won’t change very much, but the correlation will be stronger. When a point with high leverage does not line up with the rest of the data, such as #3 above, it can have a large effect on both the line and the correlation. Note: Points with high leverage often do not have large residuals, since they pull the line close to them. Points that are near x will usually not be influential, such as #4 above. Why not? Regression Wisdom: Always graph the data and investigate outliers! Applet: http://www.stat.uiuc.edu/courses/stat100//java/guess/PPApplet.html This allows you to add/subtract points and see effect on correlation, LSRL, and RMSE Applet: http://statweb.calpoly.edu/chance/applets/LRApplet.html Allows you to move points around to see changes in r, lsrl. Dynamic! AP Question: HW #7: See next page HW #7: When will the Cherry Blossoms Appear? (from Dan Teague, NCSSM) The anticipation of the first blooms of spring flowers is one of the joys of April. One of the most beautiful is that of the Japanese cherry tree. Experience has taught us that, if the spring has been a warm one, the trees will bloom early, but if the spring has been cool, then the blossoms will appear later. Mr. Yamada is a gardener who has observed the date in April when the blossoms first appear for the last 24 years. His son, Hiro, went on the internet and found the average March temperature (in degrees Celsius) in his area for those years. The data is below. To verify that you entered the data correctly, the mean temperature is 4.321 and the mean days is 12.875. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Why should temperature be the explanatory variable? Explain. Draw a scatterplot and discuss the noticeable features. Is one variable completely dependent on the other? Calculate the least squares line and graph it on the scatterplot. Interpret the slope in the context of the problem. Interpret the x- and y-intercepts in the context of the problem. Find the value of the correlation coefficient. If the temperature was measured in degrees Fahrenheit, how would this value change? If r is high, can we conclude that a change in temperature causes the blooms to appear at different times? Explain. Calculate and interpret the residual for the first point in the data set. Sketch the residual plot. What does it tell you? Calculate and interpret the values of r 2 and se in the context of the problem. If you were to use number of hours instead of number of days, how would the values of r 2 and se change? Predict the date of first bloom for an average March temperature of 3.5˚. Which observation do you think is be most influential? Explain. Which observation had the biggest residual? Is it unusually large? Temp Days 4.0 14 5.4 8 3.2 11 2.6 19 4.2 14 4.7 14 4.9 14 4.0 21 4.9 9 3.8 14 4.0 13 5.1 11 4.3 13 1.5 28 3.7 17 3.8 19 4.5 10 4.1 17 6.1 3 6.2 3 5.1 11 5.0 6 4.6 9 4.0 11 Thursday, October 18: Review chapter 5 Distribute Projects: Proposal due Thursday Review for test AP Question Monday, October 22: Test Chapter 5 Tuesday, October 23: Review for Midterm AP Questions Work on proposals Thursday, October 25: Review for Midterm Data exploration: The Cootie problem Proposals due! Monday, October 29: Midterm