Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 Regression 5/23/2017 1 Regression • Like correlation, regression addresses linear relationships between quantitative variables X & Y • Objective of correlation quantify direction and strength of linear association • Objective of regression derive best fitting line that describes the association • We are especially interested in the slope of the line 5/23/2017 2 Same illustrative data as Ch 3 Country Life Expectancy Y 21.4 23.2 20.0 22.7 20.8 18.6 21.5 22.0 23.8 21.2 77.48 77.53 77.32 78.63 77.17 76.39 78.51 78.15 78.99 77.37 Enter data into calculator Austria Belgium Finland France Germany Ireland Italy Netherlands Switzerland UK Per Capita GDP X Algebraic equation for a line • y = a + b∙X where • b ≡ slope ≡ change in Y per unit X • a ≡ intercept ≡ value of Y when x =0 Statistical Equation for a Line ŷ = a + b∙X where: ŷ ≡ predicted average of Y at a given level of X a ≡ intercept b ≡ slope a and b are called regression coefficients 5/23/2017 5 How do we find the equation for the best fitting line through the scatter cloud? Ans: We use the “least squares method” Life expectancy (yrs) 79 78 77 76 18 20 22 24 Per Capital GDP 5/23/2017 6 These formulas derive the coefficients for the least squares regression line Slope br 5/23/2017 sy Intercept a y bx sx 7 Illustrative Example (GDP & Life Expectancy) Statistics for illustrative data (calculated with TI-30XSII) x 21.52 y 77.754 s x 1.532 s y 0.795 r 0.809 Calculation of regression coefficients by hand: 0.795 br (0.809) 0.420 sx 1.532 sy a y bx 77.754 - (0.420)(21.52) 68.716 5/23/2017 8 “Least Squares” Regression Coefficients via TI-30XIIS STAT > 2-VAR > DATA > STATVAR BEWARE! The TI-30XIIS mislabels the slope & intercept. The slope is mislabeled as a and the intercept is mislabeled as b. It should be the other way around! 5/23/2017 9 Interpretation of Slope (GDP & Life Expectancy) ŷ = 68.7 + 0.42∙X Each ↑$1K in GDP associated with a 0.42 year increase in life expectancy b = increase in Y per unit X = 0.42 years 1 unit X 5/23/2017 10 Interpretation of Intercept • Mathematically = the predicted value of Y when X =0 • In real-world = has no interpretation unless a value of X = 0 is plausible 5/23/2017 11 Regression Line for Prediction • The regression line will always go through (x-bar, y-bar) which in this case is (21.5, 77.8) • To draw the regression line, connect any two points on the line 5/23/2017 Case Study (Life Expectancy) 79 Life expectancy (yrs) • Example: What is the predicted life expectancy of a country with a GDP of 20? • Ŷx=20 = 68.7 + (0.42)X = 68.7+(0.42)(20) = 77.12 78 x x 77 76 18 20 22 24 Per Capital GDP 12 2 Coefficient of Determination r Interpretation: proportion of the variability in Y mathematically explained by X Our example r =.809 r2 = .8092 = 0.66. Interpretation: 66% of the variability in Y (life expectancy) mathematically explained* by X (GDP) * mathematically explained ≠ causally explained 5/23/2017 13 Cautions about linear regression 1. Applies to linear relationships only 2. Strongly influenced by outliers, especially when outlier is in the X direction 3. Do not extrapolate! 4. Association ≠ causation (Beware of lurking variables.) 5/23/2017 14 Outliers / Influential Points • Outliers in the X direction have strong influence (tip the line) • Example (right) – Child 18 = outlier in X direction w/o outlier with outlier – Changes the slope substantially 5/23/2017 15 Do Not Extrapolate! 8 7 height (feet) • Example (right): Sarah’s height from age 3 to 5 • Least squares regression line: ŷ = 2.32 + .159(X) • Predict height at age 30 • ŷ = 2.32 + .159(X) = 2.32 + .159(30) = 8.68’ (ridiculous) • Do NOT extrapolate beyond the range of X 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 age (years) 5/23/2017 16 Association ≠ Causation • “Association” not the same as “causation” • Lurking variable ≡ an extraneous factor (Z) that is associated with both X and Y • Lurking variables can confound an association 5/23/2017 17 Example of Confounding by a Lurking Variable • Explanatory variable X ≡ number of prior children • Response variable Y ≡ the risk of Down’s syndrome • Lurking variable Z ≡ advanced age of mother • X is associated with Y, but does not cause Y in this example • Z does cause Y 5/23/2017 Number of children Mental retardation Older mother 18 Criteria used to establish causality with examples about smoking (X) and lung cancer (Y) • Strength of association – X & Y strongly correlated • Consistency of findings – Many studies have shown X & Y correlated • Dose-response relationship – The more you smoke, the more you increase risk • Temporality (time relation) – Lung cancer occurs after 10 – 20 years of smoking • Biological plausibility – Chemical in cigarette smoke are mutagenic 5/23/2017 19