Download Tue Nov 11 - Wharton Statistics Department

Lecture 19: Tues., Nov. 11th • R-squared (8.6.1) • Review • Midterm II on Thursday in class: Allowed calculator, two double-sided pages of notes • Office hours: Today after class; Wednesday, 1:30-2:30; by appointment (I will be around Wed. morning and Thurs. morning before 10:30). R-Squared • The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable. Total sum of squares - Residual sum of squares R  100( )% Total sum of squares 2 • Total sum of squares = i1 (Yi Y )2 . Best sum of squared prediction error without using x. • Residual sum of squares = n ˆ  ˆ x )2 res  ( y   i i1 i1 i 0 1 i n 2 n R-Squared example Neuron activity index Bivariate Fit of Neuron activity index By Years playing 30 Linear Fit 25 Neuron activity index = 7.9715909 + 1.0268308 Years playing 20 Summary of Fit 15 10 5 0 0 5 10 15 Years playing 20 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.866986 0.855902 3.025101 15.89286 14 • R2= 86.69. Read as “86.69 percent of the variation in neuron activity was explained by linear regression on years played.” Interpreting 2 R • R2 takes on values between 0 and 1, with higher R2 indicating a stronger linear association. • If the residuals are all zero (a perfect fit), then R2 is 100%. If the least squares line has slope 0, R2 will be 0%. • R2 is useful as a unitless summary of the strength of linear association. Caveats about 2 R – R2 is not useful for assessing model adequacy (e.g., linearity) or whether or not there is an association. – A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good. Coverage of Second Midterm • Transformations of the data for two group problem (Ch. 3.5) • Welch t-test (Ch. 4.3.2) • Comparisons Among Several Samples (5.1-5.3, 5.5.1) • Multiple Comparisons (6.3-6.4) • Simple Linear Regression (Ch. 7.1-7.4, 7.5.3) • Assumptions for Simple Linear Regression and Diagnostics (Ch. 8.1-8.4, 8.6.1, 8.6.3) Transformations for two-group problem • Goal: Find transformation so that the two distributions have approximately equal spread. • Log transformation might work when distributions are skewed and spread is greater in the distribution with larger median. • Interpretation of log transformation: – For causal inference: Let  be the additive treatment effect on the log scale (log Y *  log Y   ). Then the effect of the treatment is  to multiply the control outcome by e (Y *  Ye ) – For population inference: Let 1 and 2 be the means of the logged values of population 1 and 2 respectively. If the logged values of the population are symmetric, then e 2 1 equals the ratio of the median of population 2 to the median of population 1. Review of One-way layout • Assumptions of ideal model – All populations have same standard deviation. – Each population is normal – Observations are independent • Planned comparisons: Usual t-test but use all groups to estimate  . If many planned comparisons, use Bonferroni to adjust for multiple comparisons • Test of H 0 : 1  2    I vs. alternative that at least two means differ: one-way ANOVA F-test • Unplanned comparisons: Use Tukey-Kramer procedure to adjust for multiple comparisons. Regression • Goal of regression: Estimate the mean response Y for subpopulations X=x, {Y | X } • Applications: (i) Description of association between X and Y; (ii) Passive prediction of Y given X ; (iii) Control – predict what y will be if x is changed. Application (iii) requires the x’s to be randomly assigned. • Simple linear regression model: {Y | X }  0  1 X • Estimate  0 and 1 by least squares – choose to minimize the sum of squared residuals ˆ0 , ˆ1 (prediction errors) Ideal Model • Assumptions of ideal simple linear regression model – There is a normally distributed subpopulation of responses for each value of the explanatory variable – The means of the subpopulations fall on a straight-line function of the explanatory variable. – The subpopulation standard deviations are all equal (to  ) – The selection of an observation from any of the subpopulations is independent of the selection of any other observation. The standard deviation  •  is the standard deviation in each subpopulation. •  measures the accuracy of predictions from the regression. ˆ  sum of all squared residuals n-2 • If the simple linear regression models holds, then approximately – 68% of the observations will fall within ̂ of the least squares line – 95% of the observations will fall within 2̂ of the least squares line Inference for Simple Linear Regression • Inference based on the ideal simple linear regression model holding. • Inference based on taking repeated random samples ( y1,, yn ) from the same subpopulations ( x1,, xn ) as in the observed data. • Types of inference: – – – – Hypothesis tests for intercept and slope Confidence intervals for intercept and slope Confidence interval for mean of Y at X=X0 Prediction interval for future Y for which X=X0 Tools for model checking 1. Scatterplot of Y vs. X (see Display 8.6) 2. Scatterplot of residuals vs. fits (see Display 8.12) • Look for nonlinearity, non-constant variance and outliers 3. Normal probability plot (Section 8.6.3) – for checking normality assumption Outliers and Influential Observations • An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual. • An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential. • The least squares method is not resistant to outliers. Follow the outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in the direction of scatterplot. Transformations • Goal: Find transformations f(y) and g(x) such that the simple linear regression model approximately describes the relationship between f(y) and g(x). • Tukey’s Bulging Rule can be used to find candidate transformations. • Prediction after transformation • Interpreting log transformations

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Tue Nov 11 - Wharton Statistics Department