Download chapter2A

STP 420 SUMMER 2005 STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES PART 1 - DATA CHAPTER 2 LOOKING AT DATA - RELATIONSHIPS Introduction Association between variables Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable. Eg. height and weight: it seems that the overall trend of height increasing shows that weight also increases Or smoking and life expectancy: the overall trend of smokers is short life expectancy (inverse relationship but still associated) response variable – measures an outcome of a study (dependent variable) explanatory variable – explains or causes changes in the response variables (independent variable) 1 STP 420 SUMMER 2005 2.1 Scatterplots A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis (explanatory variable x) and the other on the vertical axis (response variable y). Each individual appears as a point. Examining a scatterplot In any graph of data, look for overall pattern and for striking deviations/outliers. Describe the overall pattern of the scatterplot by the form, direction (+ve or –ve), and strength (how close the points are to a straight line)of the relationship. Outlier (important kind of deviation) – falls outside the overall pattern of the relationship Positive association – points in scatterplot seem to increase from left to right Negative association – points in scatterplot seem to decrease from left to right Linear relationship – points follow a straight line approximately Categorical variables – use different color or symbol for each category Categorical explanatory variables with a quantitative response variable Make a graph that compares the distributions of the response for each category of the explanatory variable. 2 STP 420 SUMMER 2005 2.2 Correlation – r Correlation - measures the direction and strength of the linear relationships between two quantitative variables and ranges in numeric value from –1 to 1. r = -1 implies a perfect negative relation, all points follow a negative straight line r = 0 implies no relationship r = 1 implies a perfect positive relation, all points follow a positive straight line r 1 n  1 xi x sx yi y sy where n - # of individuals xi – observations for variable X x - mean of variable X sx – standard deviation for variable X yi – observations for variable Y y - mean of variable Y sy – standard deviation for variable Y Properties of correlation 1. Makes no use in distinction between explanatory and response variables. Makes no difference which variable is x or y. 2. The two variables must be quantitative. Not appropriate on categorical variables. 3. r computed using standardized values and not affected if units of measurements for x, y, or both are changed. 4. Positive r implies positive association between variables and negative r implies negative association 5. -1  r  1, r close to 0 implies a weak relationship 6. correlation measures strength of linear relationships (not for curves) 7. like s, r is not resistant and affected by possible outliers (be careful) Correlation is not a complete description of two-variable data, should also use means and standard deviations. 3 STP 420 SUMMER 2005 2.3 Least-squares regression Example. Age (yr): x Price ($100): y 5 85 4 103 6 70 5 82 5 89 5 98 6 66 6 95 2 169 7 70 7 48 Plot x against x, if the points seem to follow a straight line, then a straight line can be used to approximate the relationship between x and y. A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Can be used to predict y given x. Must know which is the explanatory and response variables. y = a + bx where b is the slope and tells how much y changes as x changes one unit a is the intercept, the value of y when x = 0 Least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Extrapolation – use of a regression line to predict far outside the range of values of the explanatory variable x. may be inaccurate. Equation of the least-squares regression line ŷ a bx with slope b r sy sx and intercept a y bx 4 STP 420 SUMMER 2005 x y r x y sx sy explanatory variable response variable correlation between x and y sample mean of x sample mean of y sample standard deviation of x sample deviation of y Example continue. Regression equation – The equation of the regression line. yˆ195.47206x Computational formulas in regression (for by hand computations) Definition Computational 2 ( x) n ( x )( y ) xy n ( y )2 2 y n 2 x2 Sx(x) S xx Sxy(x)y S xy 2 Sy(y) S yy b S xy S xx and a 1 ( y b1 n x) y b x Coefficient of determination ,r2 is the square of the correlation r – is the fraction of the variation in the observed values of y that is explained by the least-squares regression of y on x. 5 STP 420 SUMMER 2005 Computational Formulas : r S xy S xx S yy 2 ie. r varies from 0 to 1 0  r2  1 r S xy2 2 S xx S yy r2 close to 0 implies the least-squares regression explains very little of the variation in y r2 close to 1 implies the least-squares regression explains most of the variation in y r2(yˆ/) (y y)2 x y ŷ 5 4 6 5 5 5 6 6 2 7 7 85 103 70 82 89 98 66 95 169 70 48 94.16 114.42 73.90 94.16 94.16 94.16 73.90 73.90 154.95 53.64 53.64 (yˆ y)2 = 8285.0 , (y y y -3.64 14.36 -18.64 -6.64 0.36 9.36 -22.64 6.36 80.36 -18.64 -40.64 ŷ y 5.53 25.79 -14.74 5.53 5.53 5.53 -14.74 -14.74 66.31 -35.00 -35.00 y yˆ -9.16 -11.42 -3.90 -12.16 -5.16 3.84 -7.90 21.10 14.05 16.36 -5.64 y)2= 9708.5 r2 = 8285.0/9708.5 = 0.853 (85.3%) 6 STP 420 SUMMER 2005 2.4 Cautions about regression and correlation Both correlation and regression along with the scatterplot allows us to study the relationship among variables considered in pairs. Residual – the difference between an observed value of the response variable and the value predicted by the regression line Residual = observed y – predicted y = y yˆ Residual plot – a scatterplot of the regression residuals against the explanatory variable. - It helps us to assess the fit of the regression line. - if plot unstructured and centered about 0, no major problem - if plot has a curve then a straight line is not the best fit of the data - if the residuals get bigger as you go from left to right, predictions are more precise on the left than on the right. Lurking variable – variable that has an important effect on the relationship among variables in a study but is not included among the variables studied. Outlier – observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction of a scatterplot have large regression residuals, other outliers need not have large residuals. Influential observation – if removed there would be a change in the result of some statistical calculation. Points that are outliers in the x direction of a scatterplot are often called influential points for the least-squares regression line. Difference between fitted values (DFFITS) - Find the predicted response ( ŷ i ) for the ith individual with this individual in the data and out of the data, find the difference and standardize it (minus the mean and divide by the sd). Do this for all individuals to give the DFFITS. 7 STP 420 SUMMER 2005 Studentized residuals – standardizing the residuals using the standard deviation of the data with the individual omitted from the data (helps to avoid having too big a standard deviation) Beware of lurking variables Correlation measures only linear association. Extrapolation can be inaccurate. Correlation and least-squares regression are not resistant measures. Lurking variables can make correlation or regression misleading. Association does not imply causation An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually causes changes in y. A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on the data for individuals. Prediction does not requires a cause-and-effect relationship. (eg. height & weight) 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download chapter2A