* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Induction on Regression (Ch 15)
Types of artificial neural networks wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
AP Statistics Inference – Chapter 14 Hypothesis Tests: Slopes • Given: Observed slope relating Education to Job Prestige = 2.47 • Question: Can we generalize this to the population of all Americans? – How likely is it that this observed slope was actually drawn from a population with slope = 0? • • • • Solution: Conduct a hypothesis test Notation: slope = b, population slope = b H0: Population slope b = 0 H1: Population slope b 0 (two-tailed test) Review: Slope Hypothesis Tests • What information lets us to do a hypothesis test? • Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic – It is the distribution of every value of the slope, based on all possible samples (of size N) • If certain assumptions are met, the sampling distribution approximates the t-distribution – Thus, we can assess the probability that a given value of b would be observed, if b = 0 – If probability is low – below alpha – we reject H0 Review: Slope Hypothesis Tests • Visually: If the population slope (b) is zero, then the sampling distribution would center at zero – Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero Sampling distribution of the slope b If b=0, observed slopes should commonly fall near zero, too If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0. 0 Bivariate Regression Assumptions • Assumptions for bivariate regression hypothesis tests: • 1. Random sample – Ideally N > 20 – But different rules of thumb exist. (10, 30, etc.) • 2. Variables are linearly related – i.e., the mean of Y increases linearly with X – Check scatter plot for general linear trend – Watch out for non-linear relationships (e.g., Ushaped) Bivariate Regression Assumptions • 3. Y is normally distributed for every outcome of X in the population – “Conditional normality” • Ex: Years of Education = X, Job Prestige (Y) • Suppose we look only at a sub-sample: X = 12 years of education – Is a histogram of Job Prestige approximately normal? – What about for people with X = 4? X = 16 • If all are roughly normal, the assumption is met Bivariate Regression Assumptions • Normality: Examine sub-samples at different values of X. Make histograms and check for normality. 12 10 10 8 6 4 8 2 Std. Dev = 1.51 Mean = 3.84 N = 60.00 0 .50 1.50 1.00 2.50 2.00 HAPPY 3.50 3.00 4.50 4.00 5.50 5.00 6.50 6.00 7.50 7.00 8.00 Good 6 12 4 10 8 6 4 2 2 Std. Dev = 3.06 Mean = 4.58 N = 60.00 0 .50 1.50 2.50 3.50 1.00 2.00 0 3.00 4.50 5.50 6.50 4.00 5.00 6.00 7.50 8.50 9.50 7.00 8.00 9.00 10.00 HAPPY 0 INCOME 20000 40000 60000 80000 100000 Not very good Bivariate Regression Assumptions • 4. The variances of prediction errors are identical at different values of X – Recall: Error is the deviation from the regression line – Is dispersion of error consistent across values of X? – Definition: “homoskedasticity” = error dispersion is consistent across values of X – Opposite: “heteroskedasticity”, errors vary with X • Test: Compare errors for X=12 years of education with errors for X=2, X=8, etc. – Are the errors around line similar? Or different? Bivariate Regression Assumptions • Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? 10 8 Here, things look pretty good. 6 4 2 0 0 INCOME 20000 40000 60000 80000 100000 Bivariate Regression Assumptions • Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. 10 8 6 This looks pretty bad. 4 2 0 0 20000 10000 INCOME 40000 30000 60000 50000 80000 70000 100000 90000 Bivariate Regression Assumptions • Notes/Comments: • 1. Overall, regression is robust to violations of assumptions – It often gives fairly reasonable results, even when assumptions aren’t perfectly met • 2. Variations of regression can handle situations where assumptions aren’t met • 3. But, there are also further diagnostics to help ensure that results are meaningful… Regression Hypothesis Tests • If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution • Standard deviation of the sampling distribution is called the standard error of the slope (sb) • Population formula of standard error: sb s N (X i 1 i 2 e X) 2 • Where se2 is the variance of the regression error Regression Hypothesis Tests • Estimating se2 lets us estimate the standard error: N sˆ e e 2 i i 1 N 2 SS ERROR MS ERROR N 2 • Now we can estimate the S.E. of the slope: ŝ b MS ERROR N (X i 1 i X) 2 Regression Hypothesis Tests • Finally: A t-value can be calculated: – It is the slope divided by the standard error t N 2 bYX sb bYX MS ERROR 2 s X ( N 1) • Where sb is the sample point estimate of the standard error • The t-value is based on N-2 degrees of freedom Regression Confidence Intervals • You can also use the standard error of the slope to estimate confidence intervals: C.I . b sb (t N 2 ) • Where tN-2 is the t-value for a two-tailed test given a desired a-level • Example: Observed slope = 2.5, S.E. = .10 • 95% t-value for 102 d.f. is approximately 2 • 95% C.I. = 2.5 +/- 2(.10) • Confidence Interval: 2.3 to 2.7 Regression Hypothesis Tests • You can also use a T-test to determine if the constant (a) is significantly different from zero – But, this is typically less useful to do t N 2 aYX MS ERROR ( N 1) • Hypotheses (a = population parameter of a): • H0: a = 0, H1: a 0 • But, most research focuses on slopes Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have problems • Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample • Outliers can result from: – Errors in coding or data entry – Highly unusual cases – Or, sometimes they reflect important “real” variation • Even a few outliers can dramatically change estimates of the slope (b) Regression: Outliers • Outlier Example: Extreme case that pulls regression line up 4 2 -4 -2 0 -2 -4 2 4 Regression line with extreme case removed from sample Regression: Outliers • • • • Strategy for dealing with outliers: 1. Identify them Look at scatterplots for extreme values Or, have computer software compute outlier diagnostic statistics – There are several statistics to identify cases that are affecting the regression slope a lot – Examples: “Leverage”, Cook’s D, DFBETA – Computer software can even identify “problematic” cases for you… but it is preferable to do it yourself. Regression: Outliers • 2. Depending on the circumstances, either: • A) Drop cases from sample and re-do regression – Especially for coding errors, very extreme outliers – Or if there is a theoretical reason to drop cases – Example: In analysis of economic activity, communist countries differ a lot… • B) Or, sometimes it is reasonable to leave outliers in the analysis – e.g., if there are several that represent an important minority group in your data • When writing papers, identify if outliers were excluded (and the effect that had on the analysis).