Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Whopper One Double Whopper with cheese provides 53 grams of protein – all the protein you need in a day. It also supplies 1020 calories and 65 grams of fat. The Daily Value (based on a 2000-calorie diet) for fat is 65 grams. How are fat and protein related on the entire BK Menu? The scatterplot for Fat (grams) vs. Protein (grams) shows a positive, moderately strong, linear relationship. The Whopper The Whopper Whopper, Association So, if you want 25 grams of protein in your lunch, how much fat should you expect to consume at Burger King? The correlation between fat and protein is .83, a sign that the linear association in the scatterplot is fairly strong. However, strength of the relationship is only part of the picture. The correlation says, “The linear association between these two variables is fairly strong” but it doesn’t tell us what the line is. Let’s Say More Yes, the relationship is strong, but let’s say something more: we can model the relationship with a line and give its equation This equation will let us predict the fat content for any Burger King food, given its amount of protein. How is this like what we do with a Normal Model? Linear Model A linear model is an equation of a straight line through the data. Of course, no line can go through all the points, but a linear model can summarize the general pattern with only a couple of parameters. Like all models of the real world, the model will be wrong – wrong in the sense that it cannot match reality exactly. But it can help us understand how these variables are associated. Residuals We want to find a line that goes through our data that comes closer to all the points than any other line. It may turn out that this line doesn’t even hit a single point! But it does minimize the error between the line and each data point. For one example, our line might predict the BK Broiled Chicken Sandwich with 30 grams of protein should have 36 grams of fat, when in fact, it actually has only 25 grams of fat. We call the estimate made from a model the predicted value and write it as 𝑦 which is called “yhat” and distinguish it from the true value, y. Residuals The difference between the observed value, y, and the predicted value, 𝑦, is called the residual. The BK Broiled Chicken Residual would be 𝑦 − 𝑦 = 25 − 36 = −11g of fat. The residual tells us how far off the model’s prediction is at that point. To find residuals we always subtract the predicted value from the observed one. Residuals To find residuals we always subtract the predicted value from the observed one. A negative residual means the predicted value is too big – an overestimate. A positive residual shows that the model makes an underestimate. “Best Fit” Means Least-Squares When we draw our line through our scatter plot, some residuals are positive and some are negative. We can’t assess how well the line fits by adding up all the residuals – the positive and negative ones will cancel each other out. This is the same issue we faced when calculating Standard Deviation. So what did we do? Best Fit = Least Squares We are going to square the residuals!!!!!!!!!!!!!!!!! (Emphasis Added) Squaring: Makes all the values positive Emphasizes the large residuals The line of best fit is the line for which the sum of the squared residuals is smallest, the least squares line. Finding That Line What we know about correlation can lead us to the equation of the linear model. Let’s look specifically at a scatterplot of standardized variables. Finding the Line Let’s start in the center of the plot – how much protein and fat does the typical Burger King food email provide? The typical amount of protein content is 𝑥. What is the fat content of this average protein content? So… Our best fit line must go through the point (𝑥, 𝑦). In the plot of zscores, then, the line passes through the origin (0,0) The answer, as you might guess, is about average: 𝑦. Why is that the case? Finding the Line A normal linear equation can be written in the form y=mx+b If it passes through the origin, b=0, so the line can be expressed as y=mx where m is the slope of the line. Note that our coordinates are not written (x,y) because they are z-scores, so our points are written 𝑧𝑥 , 𝑧𝑦 and we need to indicate that the point on the line corresponding to a particular 𝑧𝑥 is 𝑧𝑦 : 𝑧𝑦 = 𝑚𝑧𝑥 Finding the Line Now, many lines pass through the origin, but which one fits our data the best? … That is, which slope determines the line that minimizes the sum of squared residuals? It turns out that the slope that minimizes our squared residuals is r itself!!!!!!!!!!!!!!!!!!!!!!!!!!!! … Once again, emphasis added. Finding the Line Wow! The equation for the line is about as simple as we could ever hope for: 𝑧𝑦 = 𝑟 ∗ 𝑧𝑥 What does it tell us? It says that moving one standard deviation from the mean in x we can expect to move r standard deviations away from the mean in y. 𝑧𝑦 = 𝑟 ∗ 𝑧𝑥 Let’s get specific: For the sandwiches, the correlation is 0.83 If we standardize both protein and fat we can write: 𝑧𝐹𝑎𝑡 = 0.83 ∗ 𝑧𝑃𝑟𝑜𝑡𝑒𝑖𝑛 This model tells us that for every standard deviation above (or below) the mean a sandwich is in protein, we’ll predict its fat content is 0.83 standard deviations above (or below) the mean fat content. 𝑧𝑦 = 𝑟 ∗ 𝑧𝑥 A double hamburger has 31 grams of protein, about 1 SD from the mean. Putting 1.0 in for 𝑧𝑃𝑟𝑜𝑡𝑒𝑖𝑛 in the model gives a 𝑧𝐹𝑎𝑡 value of 0.83. If you trust the model, you’d expect the fat content to be about 0.83 fat SDs above the mean fat level. Moving one standard deviation away from the mean in x moves our estimate r standard deviations away from the mean in y. That is to say for our example, you’d expect the fat content to be about 0.83 fat SDs above the mean fat level. R = 0, 1, or -1 For r = 0, there is no linear relationship. The line is horizontal, and no matter how many standard deviations you move in x, the predicted value for y doesn’t change. On the other hand, if r = 1.0 or -1.0, there’s a perfect linear association. In this case, moving one SD in x moves exactly the same number of SD in y. How Big Can Predicted Values Get? A new student is to join the class and you have to guess his height. A reasonable guess would be to guess the mean height of male students in the class. Now assume you are told he is 2 SDs above mean height in centimeters, how tall would you guess he is in inches? Well, height in inches and height in centimeters are perfectly correlated, so you would guess 2 SDs in inches above the mean. How Big Can Predicted Values Get? A new student is to join the class and you have to guess his height. Now assume you are told his GPA is 2 SDs above the mean. What would you guess his height to be? A reasonable guess would be to guess the mean height of male students in the class. There is little to no correlation between height and GPA so we would still guess the mean height of male students. How Big Can Predicted Values Get? A new student is to join the class and you have to guess his height. Now, assume you are told he’s 2 SDs above the mean in shoe size. Now what would you guess his height to be? A reasonable guess would be to guess the mean height of male students in the class. There is a positive correlation between shoe size and height. We wouldn’t guess an exact correlation so it would be less than the 2 SDs we guessed from the height in centimeters example but it would certainly be higher than the 0 SD we guessed from the GPA example. How Big Can Predicted Values Get? The height example provides a key insight into a general rule: Each predicted y value tends to be closer to its mean (in Standard Deviations) than its corresponding x value was. The property of the linear model is called regression to the mean and the line is called the regression line. Just Checking… A scatterplot of house Price (in thousands of dollars) versus Size (in thousands of square feet) for houses sold recently in Saratoga Springs, NY shows a relationship that is straight, with only moderate scatter and no outliers. The correlation between house Price and Size is 0.77 You go to an open house and find that the house is 1 SD above the mean in size. What would you expect about its price? You read an ad for a house priced 2 SD below the mean. What would you guess about its size? A friend tells you about a house whose size in square meters is 1.5 SD above the mean. What would you guess about its size in square feet? Page 192, # 1, 3, 5, 13, 15 The Regression Line in Real Units We don’t always think in terms of z-scores, and in fact, most real world scenarios will require you keep thinking of things in their original units (though it is vital you understand the significance of their zscores as well) How much fat would you predict for a double hamburger with 31 grams of protein? The mean for protein is near 17 grams and the SD is 14, so that items is1 SD above the mean. Since r = 0.83, we predict the fat content will be 0.83 SD above the mean fat content. The Regression Line in Real Units Mean fat content is 23.5 grams and the SD for fat content is 16.4 grams, so we predict the double hamburger will have: We can always convert both x and y to z-scores, find the correlation, use 𝑧𝑦 = 𝑟 ∗ 𝑧𝑥 and then convert 𝑧𝑦 back to its original units so that we can understand the prediction. 23.5 + 0.83 * 16.4 = 37.11 grams of fat. But can this be done more simply? The Regression Line in Real Units Let’s re-write the equation of the line for protein and fat to be back in terms of the original units: We find the slope using a formula developed on page 175-176 of your book, 𝑟𝑠𝑦 0.83 ∗16.4 𝑔 𝑓𝑎𝑡 14 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 𝑦 = 𝑏0 + 𝑏1 𝑥 𝑏1 = b0 is the y-intercept, the value of the line where it crosses the y-axis, and b1 0.93 grams of fat per gram of protein is the slope 𝑠𝑥 = = The Regression Line in Real Units Let’s re-write the equation of the line for protein and fat to be back in terms of the original units: 𝑦 = 𝑏0 + 𝑏1 𝑥 b0 is the y-intercept, the value of the line where it crosses the y-axis, and b1 is the slope Next, how do we find the y-intercept 𝑏0 ? Remember that the line has to go through the mean-mean point, (𝑥, 𝑦) That is, the model predicts 𝑦 to be the value that corresponds to 𝑥 We can put the means into the equation and write 𝑦 = 𝑏0 + 𝑏1 𝑥 The Regression Line in Real Units Let’s re-write the equation of the line for protein and fat to be back in terms of the original units: 𝑦 = 𝑏0 + 𝑏1 𝑥 b0 is the y-intercept, the value of the line where it crosses the y-axis, and b1 is the slope 𝑦 = 𝑏0 + 𝑏1 𝑥 Rewrite to solve for 𝑏0 gives us: 𝑏0 = 𝑦 − 𝑏1 𝑥 The Regression Line in Real Units 𝑏0 = 𝑦 − 𝑏1 𝑥 For our Burger King example this comes out to be 𝑔 𝑓𝑎𝑡 𝑏0 = 23.5𝑔 𝑓𝑎𝑡 − 0.97 ∗ 17.2 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 6.8 𝑔 𝑓𝑎𝑡 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 The Regression Line in Real Units 𝑔 𝑓𝑎𝑡 𝑏0 = 23.5𝑔 𝑓𝑎𝑡 − 0.97 ∗ 17.2 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 6.8 𝑔 𝑓𝑎𝑡 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 Putting this back into the regression equation gives: 𝑓𝑎𝑡 = 6.8 + 0.97 ∗ 𝑃𝑟𝑜𝑡𝑒𝑖𝑛 The Regression Line in Real Units 𝑓𝑎𝑡 = 6.8 + 0.97 ∗ 𝑃𝑟𝑜𝑡𝑒𝑖𝑛 The slope of 0.97 means that an additional gram of protein is associated with an additional 0.97grams of fat, on average. Less formally, we might say that BK sandwiches pack about 0.97 grams of fat per gram of protein. Keep in mind that for slope, units matter! Slope and Units The units of slope are always the units of y per units of x Changing units doesn’t change the correlation but does change the standard deviations. The slope introduces the units into the equation by multiplying the correlation by the ratio of 𝑠𝑦 𝑡𝑜 𝑠𝑥 If children grow an average of 3 inches per year that is the same as growing 0.21 millimeters per day. The Intercept What is the significance of the intercept of the BK regression line, 6.8? This is the value of y when we are at an x of zero So, for BK items, this means that we have 6.8 grams of fat even when an item contains noprotein. Note! When using a regression model it is vital that we check the same conditions for regressions as we did for correlation: Quantitative Variable Condition Straight Enough Condition OutlierCondition