Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chi-square Test for Goodness of Fit (GOF) What would it have been like to be a crew member on the Titanic? The crew of the Titanic made up 40% of the people on board. But 45% of the deaths were from the crew (685 of 1517 death were from the crew) Did the crew pay a heavier cost? Copyright © 2009 Pearson Education, Inc. Pop: Crew on the Titanic Parameter of interest: proportion lost H 0 : p .4 H a : p .4 1-Proportion Test Independence: pop>10n ? Normality: np >10 yes, nq >10 yes. Copyright © 2009 Pearson Education, Inc. .45 .40 4.10 .4 .6 899 P z 4.10 .0000208 z Because the P-value is so low, we reject the H 0 . We believe that the proportion of death among crew members were higher than expected. Percent On board On board Observed Expected Lost Lost O-E (O-E)^2/E First class 329 15% 130 224.51 -94.51 39.79 Second Class 285 13% 166 194.49 -28.49 4.17 Third Class 710 32% 536 484.51 51.49 5.47 Crew 899 40% 685 613.49 71.51 8.34 2223 100% 1517 0.00 57.77 Copyright © 2009 Pearson Education, Inc. Pop: Titanic Deaths Model: 15% 1st, 13% 2nd, 32% 3rd, 40% Crew H 0 : The model is a good fit. 2 57, df 3 P 2 57 is essentially 0 H a : The model is not a good fit. 2 Goodness of Fit Because the P-value is so low, we Bias: Not SRS. PWC reject the H 0 . The deaths on the Titanic Independence: pop>10n ? were not distributed according to the model No expected counts less than 1. No more than 20% of expected counts less than 5. Copyright © 2009 Pearson Education, Inc. What new for Chi-squared GOF No parameter of interest; rather a model describing the distribution amongst several categories. H0: the model is good. Ha: the model is not good. No normality check; rather no expected counts lower than 1. No more than 20% of the expected counts lower than 5. df=number of categories-1 Copyright © 2009 Pearson Education, Inc. The statistic for goodness of fit The Chi-square test statistic measures how 2 closely the observed O E 2 data matches what is E expected given a df categories - 1 particular model Copyright © 2009 Pearson Education, Inc. Step 3: Test Statistic some value 2 P some value 2 Copyright © 2009 Pearson Education, Inc. AP Statistics, 8 Section 13.1 Observed Expected Expected Count Percentage Count Blue 24% Brown 13% Green 16% Orange 20% Red 13% Yellow 14% Copyright © 2009 Pearson Education, Inc. (O-E)^2/E AP Statistics, 9 Section 13.1 Chapter 7 Scatterplots, Association, and Correlation Copyright © 2009 Pearson Education, Inc. Looking at Scatterplots Scatterplots may be the most common and most effective display for data. In a scatterplot, you can see patterns, trends, relationships, and even the occasional extraordinary value sitting apart from the others. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables. Copyright © 2009 Pearson Education, Inc. Slide 1- 12 Looking at Scatterplots (cont.) When looking at scatterplots, we will look for direction, form, strength, and unusual features. Direction: A pattern that runs from the upper left to the lower right is said to have a negative direction. A trend running the other way has a positive direction. Copyright © 2009 Pearson Education, Inc. Slide 1- 13 Looking at Scatterplots (cont.) Copyright © 2009 Pearson Education, Inc. The figure shows a negative direction between the year since 1970 and the and the prediction errors made by NOAA. As the years have passed, the predictions have improved (errors have decreased). Slide 1- 14 Looking at Scatterplots (cont.) Copyright © 2009 Pearson Education, Inc. The example in the text shows a negative association between central pressure and maximum wind speed As the central pressure increases, the maximum wind speed decreases. Slide 1- 15 Looking at Scatterplots (cont.) Form: If there is a straight line (linear) relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form. Copyright © 2009 Pearson Education, Inc. Slide 1- 16 Looking at Scatterplots (cont.) Form: If the relationship curves sharply, the methods of this book cannot really help us. Copyright © 2009 Pearson Education, Inc. Slide 1- 17 Looking at Scatterplots (cont.) Strength: At one extreme, the points appear to follow a single stream (whether straight, curved, or bending all over the place). Copyright © 2009 Pearson Education, Inc. Slide 1- 18 Looking at Scatterplots (cont.) Strength: At the other extreme, the points appear as a vague cloud with no discernable trend or pattern: Note: we will quantify the amount of scatter soon. Copyright © 2009 Pearson Education, Inc. Slide 1- 19 Looking at Scatterplots (cont.) Unusual features: Look for the unexpected. Often the most interesting thing to see in a scatterplot is the thing you never thought to look for. One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot. Clusters or subgroups should also raise questions. Copyright © 2009 Pearson Education, Inc. Slide 1- 20 Roles for Variables It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis. Copyright © 2009 Pearson Education, Inc. Slide 1- 21 Roles for Variables (cont.) The roles that we choose for variables are more about how we think about them rather than about the variables themselves. Just placing a variable on the x-axis doesn’t necessarily mean that it explains or predicts anything. And the variable on the y-axis may not respond to it in any way. Copyright © 2009 Pearson Education, Inc. Slide 1- 22 Correlation Data collected from students in Statistics classes included their heights (in inches) and weights (in pounds): Here we see a positive association and a fairly straight form, although there seems to be a high outlier. Copyright © 2009 Pearson Education, Inc. Slide 1- 23 Correlation (cont.) How strong is the association between weight and height of Statistics students? If we had to put a number on the strength, we would not want it to depend on the units we used. A scatterplot of heights (in centimeters) and weights (in kilograms) doesn’t change the shape of the pattern: Copyright © 2009 Pearson Education, Inc. Slide 1- 24 Correlation (cont.) The correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables. z z r x y n 1 Copyright © 2009 Pearson Education, Inc. Slide 1- 25 Correlation (cont.) For the students’ heights and weights, the correlation is 0.644. What does this mean in terms of strength? We’ll address this shortly. Copyright © 2009 Pearson Education, Inc. Slide 1- 26 Correlation Conditions Correlation measures the strength of the linear association between two quantitative variables. Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition Copyright © 2009 Pearson Education, Inc. Slide 1- 27 Correlation Conditions (cont.) Quantitative Variables Condition: Correlation applies only to quantitative variables. Don’t apply correlation to categorical data masquerading as quantitative. Check that you know the variables’ units and what they measure. Copyright © 2009 Pearson Education, Inc. Slide 1- 28 Correlation Conditions (cont.) Straight Enough Condition: You can calculate a correlation coefficient for any pair of variables. But correlation measures the strength only of the linear association, and will be misleading if the relationship is not linear. Copyright © 2009 Pearson Education, Inc. Slide 1- 29 Correlation Conditions (cont.) Outlier Condition: Outliers can distort the correlation dramatically. An outlier can make an otherwise small correlation look big, or hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa). When you see an outlier, it’s often a good idea to report the correlations with and without that point. Copyright © 2009 Pearson Education, Inc. Slide 1- 30 Correlation Properties The sign of a correlation coefficient gives the direction of the association. Correlation is always between -1 and +1. Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. A correlation near zero corresponds to a weak linear association. Copyright © 2009 Pearson Education, Inc. Slide 1- 31 Correlation Properties (cont.) Correlation treats x and y symmetrically: The correlation of x with y is the same as the correlation of y with x. Correlation has no units. Correlation is not affected by changes in the center or scale of either variable. Correlation depends only on the z-scores, and they are unaffected by changes in center or scale. Copyright © 2009 Pearson Education, Inc. Slide 1- 32 Correlation Properties (cont.) Correlation measures the strength of the linear association between the two variables. Variables can have a strong association but still have a small correlation if the association isn’t linear. Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small. Copyright © 2009 Pearson Education, Inc. Slide 1- 33 Correlation ≠ Causation Whenever we have a strong correlation, it is tempting to explain it by imagining that the predictor variable has caused the response to help. Scatterplots and correlation coefficients never prove causation. A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable. Copyright © 2009 Pearson Education, Inc. Slide 1- 34 What Can Go Wrong? Don’t confuse “correlation” with “causation.” Scatterplots and correlations never demonstrate causation. These statistical tools can only demonstrate an association between variables. Copyright © 2009 Pearson Education, Inc. Slide 1- 35 What Can Go Wrong? (cont.) Don’t correlate categorical variables. Be sure to check the Quantitative Variables Condition. Be sure the association is linear. There may be a strong association between two variables that have a nonlinear association. Copyright © 2009 Pearson Education, Inc. Slide 1- 36 What Can Go Wrong? (cont.) Don’t assume the relationship is linear just because the correlation coefficient is high. Here the correlation is 0.979, but the relationship is actually bent. Copyright © 2009 Pearson Education, Inc. Slide 1- 37 What Can Go Wrong? (cont.) Beware of outliers. Even a single outlier can dominate the correlation value. Make sure to check the Outlier Condition. Copyright © 2009 Pearson Education, Inc. Slide 1- 38 What have we learned? We examine scatterplots for direction, form, strength, and unusual features. Although not every relationship is linear, when the scatterplot is straight enough, the correlation coefficient is a useful numerical summary. The sign of the correlation tells us the direction of the association. The magnitude of the correlation tells us the strength of a linear association. Correlation has no units, so shifting or scaling the data, standardizing, or swapping the variables has no effect on the numerical value. Copyright © 2009 Pearson Education, Inc. Slide 1- 39 What have we learned? (cont.) Doing Statistics right means that we have to Think about whether our choice of methods is appropriate. Before finding or talking about a correlation, check the Straight Enough Condition. Watch out for outliers! Don’t assume that a high correlation or strong association is evidence of a cause-and-effect relationship—beware of lurking variables! Copyright © 2009 Pearson Education, Inc. Slide 1- 40 Chapter 8 Linear Regression Copyright © 2009 Pearson Education, Inc. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Copyright © 2009 Pearson Education, Inc. Slide 1- 42 Residuals The model won’t be perfect, regardless of the line we draw. Some points will be above the line and some will be below. The estimate made from a model is the predicted value (denoted as ŷ ). Copyright © 2009 Pearson Education, Inc. Slide 1- 43 Residuals (cont.) The difference between the observed value and its associated predicted value is called the residual. To find the residuals, we always subtract the predicted value from the observed one: residual observed predicted y yˆ Copyright © 2009 Pearson Education, Inc. Slide 1- 44 Residuals (cont.) A negative residual means the predicted value’s too big (an overestimate). A positive residual means the predicted value’s too small (an underestimate). Copyright © 2009 Pearson Education, Inc. Slide 1- 45 “Best Fit” Means Least Squares Some residuals are positive, others are negative, and, on average, they cancel each other out. So, we can’t assess how well the line fits by adding up all the residuals. Similar to what we did with deviations, we square the residuals and add the squares. The smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest. Copyright © 2009 Pearson Education, Inc. Slide 1- 46 The Linear Model Remember from Algebra that a straight line can be written as: y mx b In Statistics we use a slightly different notation: ŷ b0 b1 x We write ŷ to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values. Copyright © 2009 Pearson Education, Inc. Slide 1- 47 The Linear Model (cont.) We write b1 and b0 for the slope and intercept of the line. The b’s are called the coefficients of the linear model. The coefficient b1 is the slope, which tells us how rapidly ŷ changes with respect to x. The coefficient b0 is the intercept, which tells where the line hits (intercepts) the y-axis. Copyright © 2009 Pearson Education, Inc. Slide 1- 48 The Least Squares Line In our model, we have a slope (b1): The slope is built from the correlation and the standard deviations: b1 r sy sx Our slope is always in units of y per unit of x. Copyright © 2009 Pearson Education, Inc. Slide 1- 49 The Least Squares Line (cont.) In our model, we also have an intercept (b0). The intercept is built from the means and the slope: b0 y b1 x Our intercept is always in units of y. Copyright © 2009 Pearson Education, Inc. Slide 1- 50 Fat Versus Protein: An Example The regression line for the Burger King data fits the data well: The equation is The predicted fat content for a BK Broiler chicken sandwich is 6.8 + 0.97(30) = 35.9 grams of fat. Copyright © 2009 Pearson Education, Inc. Slide 1- 51 The Least Squares Line (cont.) Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition Outlier Condition Copyright © 2009 Pearson Education, Inc. Slide 1- 52 Correlation and the Line Moving one standard deviation away from the mean in x moves us r standard deviations away from the mean in y. This relationship is shown in a scatterplot of z-scores for fat and protein: Copyright © 2009 Pearson Education, Inc. Slide 1- 53 Correlation and the Line (cont.) Put generally, moving any number of standard deviations away from the mean in x moves us r times that number of standard deviations away from the mean in y. Copyright © 2009 Pearson Education, Inc. Slide 1- 54 How Big Can Predicted Values Get? r cannot be bigger than 1 (in absolute value), so each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean; the line is called the regression line. Copyright © 2009 Pearson Education, Inc. Slide 1- 55 Residuals Revisited The linear model assumes that the relationship between the two variables is a perfect straight line. The residuals are the part of the data that hasn’t been modeled. Data = Model + Residual or (equivalently) Residual = Data – Model Or, in symbols, e y yˆ Copyright © 2009 Pearson Education, Inc. Slide 1- 56 Residuals Revisited (cont.) Residuals help us to see whether the model makes sense. When a regression model is appropriate, nothing interesting should be left behind. After we fit a regression model, we usually plot the residuals in the hope of finding…nothing. Copyright © 2009 Pearson Education, Inc. Slide 1- 57 Residuals Revisited (cont.) The residuals for the BK menu regression look appropriately boring: Copyright © 2009 Pearson Education, Inc. Slide 1- 58 Regression Assumptions and Conditions Quantitative Variables Condition: Regression can only be done on two quantitative variables, so make sure to check this condition. Straight Enough Condition: The linear model assumes that the relationship between the variables is linear. A scatterplot will let you check that the assumption is reasonable. Copyright © 2009 Pearson Education, Inc. Slide 1- 59 Regressions Assumptions and Conditions (cont.) It’s a good idea to check linearity again after computing the regression when we can examine the residuals. You should also check for outliers, which could change the regression. If the data seem to clump or cluster in the scatterplot, that could be a sign of trouble worth looking into further. Copyright © 2009 Pearson Education, Inc. Slide 1- 60 Regressions Assumptions and Conditions (cont.) If the scatterplot is not straight enough, stop here. You can’t use a linear model for any two variables, even if they are related. They must have a linear association or the model won’t mean a thing. Some nonlinear relationships can be saved by reexpressing the data to make the scatterplot more linear. Copyright © 2009 Pearson Education, Inc. Slide 1- 61 Regressions Assumptions and Conditions (cont.) Outlier Condition: Watch out for outliers. Outlying points can dramatically change a regression model. Outliers can even change the sign of the slope, misleading us about the underlying relationship between the variables. Copyright © 2009 Pearson Education, Inc. Slide 1- 62 Reality Check: Is the Regression Reasonable? Statistics don’t come out of nowhere. They are based on data. The results of a statistical analysis should reinforce your common sense, not fly in its face. If the results are surprising, then either you’ve learned something new about the world or your analysis is wrong. When you perform a regression, think about the coefficients and ask yourself whether they make sense. Copyright © 2009 Pearson Education, Inc. Slide 1- 63 What Can Go Wrong? Don’t fit a straight line to a nonlinear relationship. Beware of extraordinary points (y-values that stand off from the linear pattern or extreme x-values). Don’t invert the regression. To swap the predictorresponse roles of the variables, we must fit a new regression equation. Don’t extrapolate beyond the data—the linear model may no longer hold outside of the range of the data. Don’t infer that x causes y just because there is a good linear model for their relationship—association is not causation. Don’t choose a model based on R2 alone. Copyright © 2009 Pearson Education, Inc. Slide 1- 64 What have we learned? When the relationship between two quantitative variables is fairly straight, a linear model can help summarize that relationship. The regression line doesn’t pass through all the points, but it is the best compromise in the sense that it has the smallest sum of squared residuals. Copyright © 2009 Pearson Education, Inc. Slide 1- 65 What have we learned? (cont.) The correlation tells us several things about the regression: The slope of the line is based on the correlation, adjusted for the units of x and y. For each SD in x that we are away from the x mean, we expect to be r SDs in y away from the y mean. Since r is always between -1 and +1, each predicted y is fewer SDs away from its mean than the corresponding x was (regression to the mean). Copyright © 2009 Pearson Education, Inc. Slide 1- 66 What have we learned? (cont.) The residuals also reveal how well the model works. If a plot of the residuals against predicted values shows a pattern, we should re-examine the data to see why. The standard deviation of the residuals quantifies the amount of scatter around the line. Copyright © 2009 Pearson Education, Inc. Slide 1- 67 What have we learned? (cont.) The linear model makes no sense unless the Linear Relationship Assumption is satisfied. Also, we need to check the Straight Enough Condition and Outlier Condition with a scatterplot. For the standard deviation of the residuals, we must make the Equal Variance Assumption. We check it by looking at both the original scatterplot and the residual plot for Does the Plot Thicken? Condition. Copyright © 2009 Pearson Education, Inc. Slide 1- 68