Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Types of artificial neural networks wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
AP Statistics Inference – Chapter 14 Hypothesis Tests: Slopes • Given: Observed slope relating Education to Job Prestige = 2.47 • Question: Can we generalize this to the population of all Americans? – How likely is it that this observed slope was actually drawn from a population with slope = 0? • • • • Solution: Conduct a hypothesis test Notation: slope = b, population slope = b H0: Population slope b = 0 H1: Population slope b 0 (two-tailed test) Review: Slope Hypothesis Tests • What information lets us to do a hypothesis test? • Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic – It is the distribution of every value of the slope, based on all possible samples (of size N) • If certain assumptions are met, the sampling distribution approximates the t-distribution – Thus, we can assess the probability that a given value of b would be observed, if b = 0 – If probability is low – below alpha – we reject H0 Review: Slope Hypothesis Tests • Visually: If the population slope (b) is zero, then the sampling distribution would center at zero – Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero Sampling distribution of the slope b If b=0, observed slopes should commonly fall near zero, too If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0. 0 Bivariate Regression Assumptions • Assumptions for bivariate regression hypothesis tests: • 1. Random sample – Ideally N > 20 – But different rules of thumb exist. (10, 30, etc.) • 2. Variables are linearly related – i.e., the mean of Y increases linearly with X – Check scatter plot for general linear trend – Watch out for non-linear relationships (e.g., Ushaped) Bivariate Regression Assumptions • 3. Y is normally distributed for every outcome of X in the population – “Conditional normality” • Ex: Years of Education = X, Job Prestige (Y) • Suppose we look only at a sub-sample: X = 12 years of education – Is a histogram of Job Prestige approximately normal? – What about for people with X = 4? X = 16 • If all are roughly normal, the assumption is met Bivariate Regression Assumptions • Normality: Examine sub-samples at different values of X. Make histograms and check for normality. 12 10 10 8 6 4 8 2 Std. Dev = 1.51 Mean = 3.84 N = 60.00 0 .50 1.50 1.00 2.50 2.00 HAPPY 3.50 3.00 4.50 4.00 5.50 5.00 6.50 6.00 7.50 7.00 8.00 Good 6 12 4 10 8 6 4 2 2 Std. Dev = 3.06 Mean = 4.58 N = 60.00 0 .50 1.50 2.50 3.50 1.00 2.00 0 3.00 4.50 5.50 6.50 4.00 5.00 6.00 7.50 8.50 9.50 7.00 8.00 9.00 10.00 HAPPY 0 INCOME 20000 40000 60000 80000 100000 Not very good Bivariate Regression Assumptions • 4. The variances of prediction errors are identical at different values of X – Recall: Error is the deviation from the regression line – Is dispersion of error consistent across values of X? – Definition: “homoskedasticity” = error dispersion is consistent across values of X – Opposite: “heteroskedasticity”, errors vary with X • Test: Compare errors for X=12 years of education with errors for X=2, X=8, etc. – Are the errors around line similar? Or different? Bivariate Regression Assumptions • Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? 10 8 Here, things look pretty good. 6 4 2 0 0 INCOME 20000 40000 60000 80000 100000 Bivariate Regression Assumptions • Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. 10 8 6 This looks pretty bad. 4 2 0 0 20000 10000 INCOME 40000 30000 60000 50000 80000 70000 100000 90000 Bivariate Regression Assumptions • Notes/Comments: • 1. Overall, regression is robust to violations of assumptions – It often gives fairly reasonable results, even when assumptions aren’t perfectly met • 2. Variations of regression can handle situations where assumptions aren’t met • 3. But, there are also further diagnostics to help ensure that results are meaningful… Regression Hypothesis Tests • If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution • Standard deviation of the sampling distribution is called the standard error of the slope (sb) • Population formula of standard error: sb s N (X i 1 i 2 e X) 2 • Where se2 is the variance of the regression error Regression Hypothesis Tests • Estimating se2 lets us estimate the standard error: N sˆ e e 2 i i 1 N 2 SS ERROR MS ERROR N 2 • Now we can estimate the S.E. of the slope: ŝ b MS ERROR N (X i 1 i X) 2 Regression Hypothesis Tests • Finally: A t-value can be calculated: – It is the slope divided by the standard error t N 2 bYX sb bYX MS ERROR 2 s X ( N 1) • Where sb is the sample point estimate of the standard error • The t-value is based on N-2 degrees of freedom Regression Confidence Intervals • You can also use the standard error of the slope to estimate confidence intervals: C.I . b sb (t N 2 ) • Where tN-2 is the t-value for a two-tailed test given a desired a-level • Example: Observed slope = 2.5, S.E. = .10 • 95% t-value for 102 d.f. is approximately 2 • 95% C.I. = 2.5 +/- 2(.10) • Confidence Interval: 2.3 to 2.7 Regression Hypothesis Tests • You can also use a T-test to determine if the constant (a) is significantly different from zero – But, this is typically less useful to do t N 2 aYX MS ERROR ( N 1) • Hypotheses (a = population parameter of a): • H0: a = 0, H1: a 0 • But, most research focuses on slopes Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have problems • Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample • Outliers can result from: – Errors in coding or data entry – Highly unusual cases – Or, sometimes they reflect important “real” variation • Even a few outliers can dramatically change estimates of the slope (b) Regression: Outliers • Outlier Example: Extreme case that pulls regression line up 4 2 -4 -2 0 -2 -4 2 4 Regression line with extreme case removed from sample Regression: Outliers • • • • Strategy for dealing with outliers: 1. Identify them Look at scatterplots for extreme values Or, have computer software compute outlier diagnostic statistics – There are several statistics to identify cases that are affecting the regression slope a lot – Examples: “Leverage”, Cook’s D, DFBETA – Computer software can even identify “problematic” cases for you… but it is preferable to do it yourself. Regression: Outliers • 2. Depending on the circumstances, either: • A) Drop cases from sample and re-do regression – Especially for coding errors, very extreme outliers – Or if there is a theoretical reason to drop cases – Example: In analysis of economic activity, communist countries differ a lot… • B) Or, sometimes it is reasonable to leave outliers in the analysis – e.g., if there are several that represent an important minority group in your data • When writing papers, identify if outliers were excluded (and the effect that had on the analysis).