Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AP Statistics Chapter 3 Chapter 3 – Examining Relationships Scatterplots, Correlation, and Least – Squared Regression Section 3.1: Variables, Scatterplots, Interpreting Scatterplots, correlation • • Define vocabulary and types of variables Construct a scatterplot using Excel and/or your graphing calculator • Interpret various scatterplots • Add categorical data to a scatterplot on your calculator Scatterplot = displays the relationship between two quantitative variables measured on the same individuals. • • • • Can examine the nature of a relationship between two variables Examine if a variable can explain changes in the other variable. If predicting an outcome, an explanatory and response variable must be identified. Naming variables in this way does not mean one variable causes the changes in the other…remember…causation is very difficult to prove. Explanatory Variable • Attempts to explain the observed outcomes • Any factor that can influence the response • The variable that attempts to explain the relationship with the other variable. • If there is one, it will be the variable on the x - axis Response Variable • Measures an outcome of a study • Variable that responds to the other variable being used. • If there is one, it will be the variable on the y - axis Each individual in the data appears as the point in the plot fixed by the values of both variables for that particular individual. Explanatory/Response Variable Process • Look at the data you have been given with a graph • Calculate any numerical summary you can for the data o 2 – Vars Stat • Summarize the shape, overall patterns, outliers, etc. (see correct statistical vocabulary) • If conditions are met, find a mathematical model for the data. To construct a scatterplot: • Use graphing calculator. • Use Excel – not all that useful Interpreting Scatterplots • Describing the overall pattern of the data • Look for any deviations from the pattern. • Form (clusters, linear, curved) • Direction (negatively, positively associated, no association) • Strength o How closely the points follow a clear form (How closely together the points are and how closely they resemble a line) o Strong, moderate, weak, or a combination of these words. • Outlier (observation that falls outside the overall pattern of the relationship) • Clusters (often when looking at the subject of the data, a possible explanation can come about) Positively Associated • High values of one tend to accompany high values of the other variable • Low values of one tend to accompany low values of the other variable Negatively Associated • High values of one tend to accompany low values of the other variable. • Low values of one tend to accompany high values of the other variable Nonlinear data • Data does not have to have a linear relationship • Certain data can be transformed to show a linear scatterplot (Chapter 12) Activity - Height vs. Shoe Size Is there a relationship between your height and your shoe size? • On the slip of paper provided, label and complete like the following using your OWN scores: o Height: 67 inches o Shoe size: 8 (men’s) • Go to L1 and L2 on your calculator, copy the data from my calculator • Create a scatterplot on your calculator using the data. Is there a relationship between your height and your shoe size? Interpret the scatterplot. Correlation • • Calculate the correlation using your graphing calculator. Interpret data using a scatterplot and the numeric summary 1. Interpreting Scatterplots – Graph Analysis (Visual) • Form – clusters, linear, curved • Direction – positive, negative, no association • Strength – weak, moderate, strong relationship • Discuss outliers (if present) • If no outliers, STATE there are no outliers. 2. Numeric Analysis • Correlation o Numerical measure of the direction and strength of a linear relationship between two variables. o “r” where r = " xi − x %" yi − y % 1 ' $ '$ ∑ n −1 # sx &$# sy '& o Knowing the formula helps in understanding the important properties of correlation. o Correlation is calculated on your graphing calculator. o “Diagnostic On” in order to see it, under catalog. o Average of the products of the standardized variables. Facts about Correlation 1. 2. 3. 4. Makes no distinction between explanatory and response variables Requires both variables be quantitative (no categorical) It has no unit of measurement A positive value indicates a positive association and a negative value indicates a negative association 5. The value will always be a number between “1” and “-1 “ 6. ONLY measures strength of linear relationships 7. It is NOT a resistant measure therefore it is strongly affected by outliers Correlation is NOT a complete summary of two data sets. • When reporting a numerical analysis, ALSO report the following: § Mean of data set #1 § Standard Deviation of data set #1 § Mean of data set #2 § Standard Deviation of data set #2 Interpreting Correlation Correct Vocabulary to use when interpreting Correlation (the strength): • • • • • • • Very Weak Weak Moderately Weak Moderate Moderately Strong Strong Very Strong Section 3.2: Least Squares Regression and “ r 2 ” • • Calculate a least squares regression line Calculate the coefficient of determination. • Interpret your findings. € Least Squares regression (LSRL) • • • • • • A mathematical model of the data. Method for finding a line that summarizes the relationship between two variables. A line that is as close as possible to the points in the vertical direction. The line should make the vertical distances of the points as small as possible since we are predicting “y”. Predication error occurs in the “y” value. Round values to 4 decimal places when writing it down. Least – Squares Regression Line • A line of “y” on “x” is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Regression Line • • • • Straight line that describes how a response variable “y” changes as an explanatory variable “x” changes. Used to make predictions of “y” values given certain “x” values Requires explanatory and response variables to be assigned. Calculate a linear regression on your calculator using “LinReg(a+bx)” (number 8 in the list under “Stat”, “Calc”) o Keeps consistent with the notation used on the AP and in our textbook. Equation for Least – Squares Regression Line ŷ = a + bx where slope => b = r sy sx intercept=> a = y − bx • Important note: y = observed value of the variable ŷ (y – hat) = predicted value of the variable • Slope o Rate of Change o The amount of change in ŷ when “x” increases by 1. • Intercept o Needed for drawing the LSRL o Only statistically meaningful when the response can take a value close to zero. Coefficient of Determination - r 2 • The fraction of variation in the values of “y” that is explained by the least – squares regression of “y” on “x” • The proportion€of the total sample variability that is explained by the least – squares regression of “y” on “x”. • The percent variation in “y” that can be explained by the least squares regression of “y” on “x” • To have a large percentage, the SSE (sum of squares for error) => ( y − ŷ) • 2 must be small. Example Interpretation: “83% of the variation in (response variable) is explained by least-squares regression of (response variable) on (explanatory variable) . Important concepts about the Least Squares Regression/Coefficient of Determination 1. The distinction between the explanatory and response variables is essential in regression. • A line that is as close as possible to the points in the vertical direction. • Different response variable (y – values) will change the least – squares regression line. 2. There is a close connection between correlation and slope of a LSRL. • A change of one standard deviation in “x” corresponds to a change of r standard deviations in “y”. s r ⋅ sy • b=r y b= sx sx or “r” is rarely, if ever, “1” or “-1”. Therefore, the standard deviation of “y” will be multiplied by a fraction between “1” and “-1”. 3. The least – squares regression line will always pass through the point ( x, y ) on the graph of “y” against “x”. 4. The coefficient of determination ( r 2 ) is the fraction of the variation in values of “y” that can be explained by the least – squares regression of “y” on “x”. € Section 3.2 (again) – Residuals and Residual Plots • • Calculate residuals using a graphing calculator Calculate residual plot using your graphing calculator. • Interpret your findings. Coeffecient of Determination - r 2 • Percent or proportion of explained changes in explanatory variable that affect the changes in the response variable due to the regression line. € Linear Regression Line • Mathematical model for the overall pattern of a linear relationship between an explanatory variable and a response variable. • Least – squares regression line makes the vertical distance from observations to the LSRL as small as possible. Residuals: • The small vertical distances between observations and the LSRL represent left-over variation in the response variable. • Difference between the response variable’s observed value and the predicted value from the regression line. • Residual = y − ŷ for each data point Residuals: • Show how far the data fall from the regression line. • Positive values mean the data point lies above the regression line • Negative values mean the data point lies below the regression line Residuals: The mean of the least – squares residuals is always zero. (You will find it is almost equal to zero, due to rounding errors.) Residual Plots: • A scatterplot of the regression residuals against the explanatory variable. • The straight line at zero corresponds to the regression line. • Assesses the fit of a regression line Interpretation of Residual Plots • IF the Residual plot shows no systematic pattern, o THEN the LSRL captures the overall relationship between the explanatory and the response Other Interpretations of Residual Plots • A curved pattern shows that the relationship is NOT linear. • This means that an LSRL is not the correct method for this data. • Increasing or decreasing spread about the line as “x” increases indicates the prediction of “y” will be less accurate for those “x” where the increase spread is occurring. NECESSARY STEP: • Prior to calculating residuals and a residual plot, you need to complete a scatterplot and a least – squares regression • If you do not complete the above step, there will be NO residual plot for the data you are using. Residuals and Residual Plots Example: Does the age at which a child begins to talk predict a later score on a test of mental ability? a) Enter the data into your graphing calculator and construct a scatterplot ONLY. Interpret your findings from the scatterplot. b) In the context of the problem, describe the relationship you observe from the scatterplot. c) Calculate the correlation and interpret your findings. d) Find the least – squares regression line for the data (round to 4 decimal places) e) Calculate the coefficient of determination and interpret your findings. f) Use the least – squares regression equation to predict Child #7’s, 13’s, and 20’s scores. g) Using your graphing calculator, find all the residuals for the data set. (Done under the “Edit List” feature) Write down the first 2 residual values. h) Make a residual plot. Interpret your residual plot. What does it tell you about your least – squares regression line for the data. 3.69 The mean height of American women in their early twenties is about 64.5 inches and the standard deviation is about 2.5 inches. The mean height of men the same age is about 68.5 inches, with standard deviation about 2.7 inches. If the correlation between the heights of husbands and wives is about r = 0.5, what is the slope of the regression line of the husband’s height on the wife’s height in young couples? Use your calculator to get a regression line. Predict the height of the husband of a woman who is 67 inches tall. 3.57 In Professor Friedman’s economics course the correlation between the students’ total scores prior to the final examination and their final examination scores is r = 0.6. The pre-exam totals for all students in the course have mean 280 and standard deviation 30. The final exam scores have mean 75 and standard deviation 8. Professor Friedman has lost Julie’s final exam but knows that her total before the exam was 300. He decides to predict her final exam score from her pre-exam total. (a) What is the slope of the least-squares regression line of final exam scores on pre-exam total scores in this course? What is the intercept? (b) Use the regression line to predict Julie’s final exam score. (c) Julie doesn’t think this method accurately predicts how well she did on the final exam. What is r 2 ? Use the value you get to argue that her actual score could have been much higher (or much lower) than the predicted value. Section 3.2 (still) – Influential Observations • • Define influential points Determine if a data point is an outlier or an influential point • Interpret your findings. Influential Points • • • • An observation that, when removed from the data, it makes a distinct change in the result of the calculation(s). § Slope § Correlation § Coefficient of Determination Influential Point - an outlier with a small residual value. Often, not always, they are outliers in the “x” direction of a scatterplot. Located in an extreme position on the explanatory scale. Is it an Influential Point? To Test Influential Points • • • • Calculate correlation, least – squares regression, etc. WITH the observation (potential influential point) included in the data. Then, REMOVE the potential influential point and run the calculations again. COMPARE results to see if it made a strong influence on the calculations. Graph the data without the potential influential point on a scatterplot with the LSRL. Is it a better fit of the data? Continuation From Section 3.3 (again) – Residuals (from yesterday) • • Use the example about the age at which children speak and their test score on the Gesell exam to complete the following. You may have to retype the data or rerun the information if you deleted it. j) ONLY Remove Child 19 from the data. Construct: • A scatterplot • Least – squares regression (into Y2) (keep original LSRL in Y1) • Correlation. On the scatterplot, compare your original least squares – regression (Y1) and the new one (Y2). Write down the new regression line and correlation. k) Put Child 19 back into your data lists. l) ONLY Remove Child 18 from the data. Construct: • A scatterplot • Least – squares regression (into Y2) (keep original LSRL in Y1) • Correlation. On the scatterplot, compare your original least squares – regression (Y1) and the new one (Y2). Write down the new regression line and correlation. m) Compare the regression line in part j (When Child 19 is removed) with the original regression line from yesterday. Does Child 19 appear to be an influential point? If so, why? n) Compare the regression line in part l (When Child 18 is removed) with the original regression line from yesterday. Does Child 18 appear to be an influential point? If so, why? Mini – Tab • • • Will not use the program but we will interpret information and outputs provided in the book. For the AP, you need to be able to read the information provided in a Mini – Tab output This output is on Page 156 in the textbook.