Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7 Scatterplots, Association, and Correlation Examining Relationships Relationship between two variables Examples: • • • • Height and Weight Alcohol and Body Temperature SAT Verbal Score and SAT Math Score High School GPA and College GPA Two Types of Variables Response Variable (Dependent) Explanatory Variable (Independent) Measures an outcome of the study Used to explain the response variable. Example: Alcohol and Body Temp Explanatory Variable: Alcohol Response Variable: Body Temperature Two Types of Variables Does not mean that explanatory variable causes response variable It helps explain the response Sometimes there are no true response or explanatory variables Ex. Height and Weight SAT Verbal and SAT Math Scores Graphing Two Variables Plot of explanatory variable vs. response variable Explanatory variable goes on horizontal axis (x) Response variable goes on vertical axis (y) If response and explanatory variables do not exist, you can plot the variables on either axis. This plot is called a scatterplot This plot can only be used if explanatory and response variables are both quantitative. Scatterplots Scatterplots show patterns, trends, and relationships. When interpreting a scatterplot (i.e., describing the relationship between two variables) always look at the following: Overall Pattern • Form • Direction • Strength Deviations from the Pattern • Outliers Interpreting Scatterplots Form Is the plot linear or is it curved? Strength Does the plot follow the form very closely or is there a lot of scatter (variation)? Interpreting Scatterplots Direction Is the plot increasing or is it decreasing? Positively Associated • Above (below) average in one variable tends to be associated with above (below) average in another variable. Negative Associated • Above (below) average in one variable tends to be associated with below (above) average in another variable. Example – Scatterplot The following survey was conducted in the U.S. and in 10 countries of Western Europe to determine the percentage of teenagers who had used marijuana and other drugs. Example – Scatterplot Percent who have used Country Marijuana Other Drugs Czech Republic 22 4 Denmark 17 3 England 40 21 Finland 5 1 Ireland 37 16 Italy 19 8 North Ireland 23 14 Norway 6 3 Portugal 7 3 Scotland 53 31 United States 34 24 Example – Scatterplot Percent who have used Marijuana vs Other Drugs 35 30 25 20 15 10 5 0 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 0 10 20 30 40 50 60 Example – Scatterplot The variables are interchangeable in this example. In this example, Percent of Marijuana is being used as the explanatory variable (since it is on the x-axis). Percent of Other Drugs is being used as the response since it is on the y-axis. Example - Scatterplot The form is linear The strength is fairly strong The direction is positive since larger values on the x-axis yield larger values on the y-axis Example - Scatterplot Negative association Outside temperature and amount of natural gas used Gas 10 5 0 -5.0 .0 5.0 Temp 10.0 15.0 Correlation The strength of the linear relationship between two quantitative variables can be described numerically This numerical method is called correlation Correlation is denoted by r Correlation A way to measure the strength of the linear relationship between two quantitative variables. 1 ( x x )( y y ) r n 1 sx sy Correlation Steps to calculate correlation: Calculate the mean of x and y Calculate the standard deviation for x and y (x x )(yy ) Calculate Plug all numbers into formula Correlation Femur vs. Humerus 100 Humerus 80 60 40 20 0 0 10 20 30 40 Femur 50 60 70 80 Calculating r. Femur (x) Humerus (y) 38 56 59 63 74 41 63 70 72 84 Set up a table with columns for x, y, 2 2 , , , and x x y y xx y y xxyy , Calculating r. x x y y xxyy 2 2 y y 41 xx -20 -25 400 625 500 56 63 -2 -3 4 9 6 59 70 1 4 1 16 4 63 72 5 6 25 36 30 74 84 16 18 256 324 288 290 330 0 0 686 1010 828 x y 38 Calculating r Recall: y y n So, 290 x 58 5 330 y 66 5 Calculating r Recall: s ( y y) 2 n 1 So, sx 686 13.1 4 1010 sy 15.9 4 Calculating r. Put everything into the formula: x x y y r n 1s x s y 828 5 113.115.9 0.994 Properties of r r has no units (i.e., just a number) Measures the strength of a LINEAR association between two quantitative variables If the data have a curvilinear relationship, the correlation may not be strong even if the data follow the curve very closely. Properties of r r always ranges in values from –1 to 1 r = 1 indicates a straight increasing line r = -1 indicates a straight decreasing line r = 0 indicates no LINEAR relationship As r moves away from 0, the linear relationship between variables is stronger Properties of r Changing the scale of x or y will not change the value of r Not resistant to outliers Strong correlation ≠ Causation Strong linear relationship between two variables is NOT proof of a causal relationship! Reading JMP Output The following is some output from JMP where I considered Blood Alcohol Content and Number of Beers. The explanatory variable is the number of beers. Blood alcohol content is the response variable. Reading JMP Output Bivariate Fit of BAC By Be ers 0.2 BAC 0.15 0.1 0.05 0 0 2 4 6 Beers 8 10 Reading JMP Output Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.803536 0.788424 0.02092 0.076 15 Reading JMP Output RSquare = r2 This means r RSquare 0.803536 0.896 I know this is positive because the scatterplot has a positive direction. The Mean of the Response is the mean of the y’s or y