* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Linear Regression Notes
Survey
Document related concepts
Transcript
Linear Regression and Correlation February 11, 2009 The Big Ideas I To understand a set of data, start with a graph or graphs. The Big Ideas I To understand a set of data, start with a graph or graphs. I If the data concern the relationship between two quantitative variables measured on the same individuals, use a scatterplot. If the variables have an explanatory-response relationship, be sure to put the explanatory variable on the horizontal (x) axis of the plot. The Big Ideas I When you look at a statistical graph, look for the overall pattern and for striking deviations from that pattern. Thanks to David S. Moore and W.H. Freeman and Company for this summary. The Big Ideas I When you look at a statistical graph, look for the overall pattern and for striking deviations from that pattern. I When you look at a scatterplot, describe the overall pattern by its form, direction, and strength. Outliers are an important kind of deviation from the pattern. Thanks to David S. Moore and W.H. Freeman and Company for this summary. The Big Ideas I When you look at a statistical graph, look for the overall pattern and for striking deviations from that pattern. I When you look at a scatterplot, describe the overall pattern by its form, direction, and strength. Outliers are an important kind of deviation from the pattern. I If the overall pattern is roughly linear (straight-line), the correlation r measures its direction and strength. Thanks to David S. Moore and W.H. Freeman and Company for this summary. Linear Equations y = a + bx I These describe straight lines. Linear Equations y = a + bx I These describe straight lines. I The constant a is the y -intercept. Linear Equations y = a + bx I These describe straight lines. I The constant a is the y -intercept. I The constant b is the slope. Linear Equations y = a + bx I These describe straight lines. I The constant a is the y -intercept. I The constant b is the slope. I The variable x is the independent or explanatory variable. Linear Equations y = a + bx I These describe straight lines. I The constant a is the y -intercept. I The constant b is the slope. I The variable x is the independent or explanatory variable. I The variable y is the dependent orresponses variable. Example 1 I African peoples often eat “bushmeat”, the meat of wild animals. Thanks to David S. Moore and W.H. Freeman and Company for this example. Example 1 I African peoples often eat “bushmeat”, the meat of wild animals. I The explanatory variable is sh supply per person, in kilograms. Thanks to David S. Moore and W.H. Freeman and Company for this example. Example 1 I African peoples often eat “bushmeat”, the meat of wild animals. I The explanatory variable is sh supply per person, in kilograms. I The response is the percent change in the total “biomass” (weight in tons). Thanks to David S. Moore and W.H. Freeman and Company for this example. Analyzing Relationships I Form: There is a general linear (straight-line) pattern. Analyzing Relationships I Form: There is a general linear (straight-line) pattern. I Direction: The plot has a clear lower-left to upper-right pattern, so that more positive (or less negative) changes in wildlife go with higher sh supply. This is a positive association between the variables. Analyzing Relationships I Form: There is a general linear (straight-line) pattern. I Direction: The plot has a clear lower-left to upper-right pattern, so that more positive (or less negative) changes in wildlife go with higher sh supply. This is a positive association between the variables. I Strength: The idea of the strength of a relationship is how closely the points follow the overall pattern. A “perfectly strong” linear relationship means that all points fall exactly on a straight line. The relationship here is moderately strong. Line of Best Fit I We can make predictions about the response variable based on the explanatory variable by finding points on a line that best fits the data. Line of Best Fit I We can make predictions about the response variable based on the explanatory variable by finding points on a line that best fits the data. I We use software to find this line of best fit. Line of Best Fit I We can make predictions about the response variable based on the explanatory variable by finding points on a line that best fits the data. I We use software to find this line of best fit. I Our responsibility is to check to see that the line is sensible and then interpret the value of the predictions. Line of Best Fit I The regression equation is the equation of the line of best fit y = a + bx Line of Best Fit I The regression equation is the equation of the line of best fit y = a + bx I A y values on the line is a predicted value of the response variable for a given explanatory value. Line of Best Fit I The regression equation is the equation of the line of best fit y = a + bx I A y values on the line is a predicted value of the response variable for a given explanatory value. I We use the notation ŷ for predicted values and y for observed values. Thus yˆ0 = a + bx0 is the value of the response variable that is predicted by the regression equation for the explanatory value x0 . Errors I The error or residual measures the distance between the predicted response to a given explanatory value and the actual response for that explanatory value Errors I The error or residual measures the distance between the predicted response to a given explanatory value and the actual response for that explanatory value i.e. the vertical distance between the line and the data. I We use the notation for the error. So, for a data point (xi , yi ) we have i = |yi − ŷi | Example 2 Life expectancy and the percent of people who can read. Sum of the Squared Errors I SSE stands for the Sum of the Squared Errors. Sum of the Squared Errors I SSE stands for the Sum of the Squared Errors. I In symbols we have SSE = n X i 2i = n X (yi − ŷ )2 i Sum of the Squared Errors I SSE stands for the Sum of the Squared Errors. I In symbols we have SSE = n X i I 2i = n X (yi − ŷ )2 i The line of best fit is the line that minimizes the SSE. The Correlation Coefficient r I The correlation coefficient r measures how strong the relationship is between the the variables. The Correlation Coefficient r I The correlation coefficient r measures how strong the relationship is between the the variables. I We always have r bounded by ±1. −1 ≤ r ≤ 1 The Correlation Coefficient r I The correlation coefficient r measures how strong the relationship is between the the variables. I We always have r bounded by ±1. −1 ≤ r ≤ 1 I The bigger the absolute value of r , the stronger the relationship. I I I r = 1 means there is a perfect positive relationship between the variables. r = 0 means there is absolutely no relationship between the variables. r = −1 means there is a perfect negative relationship between the variables. Textbook I See page 484 of the text for pictures about positive, negative and zero correlation. Textbook I See page 484 of the text for pictures about positive, negative and zero correlation. I See page 491 of the text for strength guidelines. Example: Exercise vs. Screen Time I What kind of correlation do we expect from the data in our class? Outliers I If an error is more than 1.9s than we consider the corresponding data to be a potential outlier. I The value of the standard deviation s is sP r Pn n 2 2 SSE i i = i (yi − ŷ ) = s= n−2 n−2