Download STP 420

STP 420 SUMMER 2002 STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES PART 1 - DATA CHAPTER 2 LOOKING AT DATA - RELATIONSHIPS Introduction Association between variables Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable. Eg. height and weight: it seems that the overall trend of height increasing shows that weight also increases Or smoking and life expectancy: the overall trend of smokers is short life expectancy (inverse relationship but still associated) response variable – measures an outcome of a study (dependent variable) explanatory variable – explains or causes changes in the response variables (independent variable) 1 STP 420 SUMMER 2002 2.1 Scatterplots A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis (explanatory variable x) and the other on the vertical axis (response variable y). Each individual appears as a point. Examining a scatterplot In any graph of data, look for overall pattern and for striking deviations/outliers. Describe the overall pattern of the scatterplot by the form, direction (+ve or –ve), and strength (how close the points are to a straight line)of the relationship. Outlier (important kind of deviation) – falls outside the overall pattern of the relationship Positive association – points in scatterplot seem to increase from left to right Negative association – points in scatterplot seem to decrease from left to right Linear relationship – points follow a straight line approximately Categorical variables – use different color or symbol for each category Categorical explanatory variables with a quantitative response variable Make a graph that compares the distributions of the response for each category of the explanatory variable. 2 STP 420 SUMMER 2002 2.2 Correlation – r Correlation - measures the direction and strength of the linear relationships between two quantitative variables and ranges in numeric value from –1 to 1. r = -1 implies a perfect negative relation, all points follow a negative straight line r = 0 implies no relationship r = 1 implies a perfect positive relation, all points follow a positive straight line r  xi  x  y i  y  1     n  1  s x  s y  where n - # of individuals xi – observations for variable X x - mean of variable X sx – standard deviation for variable X yi – observations for variable Y y - mean of variable Y sy – standard deviation for variable Y Properties of correlation 1. Makes no use in distinction between explanatory and response variables. Makes no difference which variable is x or y. 2. The two variables must be quantitative. Not appropriate on categorical variables. 3. r computed using standardized values and not affected if units of measurements for x, y, or both are changed. 4. Positive r implies positive association between variables and negative r implies negative association 5. -1  r  1, r close to 0 implies a weak relationship 6. correlation measures strength of linear relationships (not for curves) 7. like s, r is not resistant and affected by possible outliers (be careful) 3 STP 420 SUMMER 2002 Correlation is not a complete description of two-variable data, should also use means and standard deviations. 2.3 Least-squares regression Example. Age (yr): x Price ($100): y 5 85 4 103 6 70 5 82 5 89 5 98 6 66 6 95 2 169 7 70 7 48 Plot x against x, if the points seem to follow a straight line, then a straight line can be used to approximate the relationship between x and y. A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Can be used to predict y given x. Must know which is the explanatory and response variables. y = a + bx where b is the slope and tells how much y changes as x changes one unit a is the intercept, the value of y when x = 0 Least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Extrapolation – use of a regression line to predict far outside the range of values of the explanatory variable x. may be inaccurate. Equation of the least-squares regression line yˆ  a  bx 4 STP 420 SUMMER 2002 br with slope x y r x y sx sy sy sx and intercept a  y  bx explanatory variable response variable correlation between x and y sample mean of x sample mean of y sample standard deviation of x sample deviation of y Example continue. Regression equation – The equation of the regression line. yˆ  195.47  20.26 x Computational formulas in regression (for by hand computations) Definition S xx   ( x  x) Computational 2 ( x) 2   S xx  x n ( x)(  y ) S xy   xy  n 2 ( y ) 2   y S yy  n 2 S xy   ( x  x)( y  y) S yy   ( y  y ) b  S xy S xx 2 and a 1 ( y  b1  x)  y  b x n 5 STP 420 SUMMER 2002 Coefficient of determination ,r2 is the square of the correlation r – is the fraction of the variation in the observed values of y that is explained by the least-squares regression of y on x. S xy2 S xy 2 Computational Formulas : r  r  S xx S yy S xx S yy 0  r2  1 ie. r2 varies from 0 to 1 r2 close to 0 implies the least-squares regression explains very little of the variation in y r2 close to 1 implies the least-squares regression explains most of the variation in y r 2   ( yˆ  y ) 2 /  ( y  y ) 2  x y ŷ y y yˆ  y y  yˆ 5 4 6 5 5 5 6 6 2 7 7 85 103 70 82 89 98 66 95 169 70 48 94.16 114.42 73.90 94.16 94.16 94.16 73.90 73.90 154.95 53.64 53.64 -3.64 14.36 -18.64 -6.64 0.36 9.36 -22.64 6.36 80.36 -18.64 -40.64 5.53 25.79 -14.74 5.53 5.53 5.53 -14.74 -14.74 66.31 -35.00 -35.00 -9.16 -11.42 -3.90 -12.16 -5.16 3.84 -7.90 21.10 14.05 16.36 -5.64 ( yˆ  y ) 2 = 8285.0 ,  ( y  y ) 2 = 9708.5 r2 = 8285.0/9708.5 = 0.853 (85.3%) 6 STP 420 SUMMER 2002 2.4 Cautions about regression and correlation Both correlation and regression along with the scatterplot allows us to study the relationship among variables considered in pairs. Residual – the difference between an observed value of the response variable and the value predicted by the regression line Residual = observed y – predicted y = y  yˆ Residual plot – a scatterplot of the regression residuals against the explanatory variable. - It helps us to assess the fit of the regression line. - if plot unstructured and centered about 0, no major problem - if plot has a curve then a straight line is not the best fit of the data - if the residuals get bigger as you go from left to right, predictions are more precise on the left than on the right. Lurking variable – variable that has an important effect on the relationship among variables in a study but is not included among the variables studied. Outlier – observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction of a scatterplot have large regression residuals, other outliers need not have large residuals. Influential observation – if removed there would be a change in the result of some statistical calculation. Points that are outliers in the x direction of a scatterplot are often called influential points for the least-squares regression line. Difference between fitted values (DFFITS) - Find the predicted response ( ŷi ) for the ith individual with this individual in the data and out of the data, find the difference and standardize it (minus the mean and divide by the sd). Do this for all individuals to give the DFFITS. 7 STP 420 SUMMER 2002 Studentized residuals – standardizing the residuals using the standard deviation of the data with the individual omitted from the data (helps to avoid having too big a standard deviation) Beware of lurking variables Correlation measures only linear association. Extrapolation can be inaccurate. Correlation and least-squares regression are not resistant measures. Lurking variables can make correlation or regression misleading. Association does not imply causation An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually causes changes in y. A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on the data for individuals. Prediction does not requires a cause-and-effect relationship. (eg. height & weight) 2.6 Relations in categorical data (case of response variable being quantitative) Relationships described using of counts (frequencies) or percents (relative frequencies) of each category 8 STP 420 SUMMER 2002 Two way table – presents data for two variables Row variable - education Column variable - age Education < HS = HS College 1 – 3 College >= 4 Total 25 - 34 5,325 14,061 11,659 10,342 41,388 Age group 35 – 54 9,152 24,070 19,926 19,878 73,028 >= 55 16,035 18,320 9,662 8,005 52,022 Total 30,152 56,451 41,247 38,225 166,438 Roundoff error – values rounded to nearest thousand. Education alone and age alone are marginal distributions Eg Education < HS = HS College 1 – 3 College >= 4 Total Total 30,152 56,451 41,247 38,225 166,438 Ages 25 - 34 41,388 35 – 54 73,028 >= 55 52,022 Total 166,438 9 STP 420 SUMMER 2002 Conditional Distribution of Education given an age (25 – 34) Education < HS = HS College 1 – 3 College >= 4 Total 25 - 34 5,325 14,061 11,659 10,342 41,388 Simpson’s paradox – an association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. - reversal of direction by aggregation of data Example of three-way table – presenting information on three variables, one two-way table for each level (value) of the third variable. Died Survived Total Good Condition Hosp. A Hosp. B 6 8 594 592 600 600 Died Survived Total Poor Condition Hosp. A Hosp. B 57 8 1443 192 1500 200 Condition variable – good and poor Hospital variable – A and B Survival variable – Died and survived Aggregation of data – adding up across one variable (elimination of one variable) Eg. eliminating condition (ignoring condition) Died Survived Total Hosp. A 63 2037 2100 Hosp. B 16 784 800 10 STP 420 SUMMER 2002 2.7 The question of causation Two variables are often associated or strongly associated but this does not assume that any one causes the other (ie. - the explanatory variable causes the response variable). Explaining association - causation One variable causes the other x y x y x y ? z z Causation common response confounding x, y – observed variables z – lurking variable arrows shows cause-effect relationship Explaining association – common response Observed association between x and y is explained by lurking variable z. Both x and y changes when z is changed. Explaining association - confounding Effects on a response variable is mixed about more than one variable (x and z are either explanatory or lurking variables). Cannot distinguish the influence of x on y from the influence of z on y. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download STP 420