Download STA 220H1F LEC0201 Week 10 Statistical Inference Continued

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Foundations of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
STA 220H1F LEC0201 Week 10 Statistical Inference Continued: More on Statistical Testing and A New Probability Distribution The Connection Between Confidence Intervals and Hypothesis Tests Suppose you have a sample of 25 observations from a normal distribution with 25. From the sample you calculate ̅
100. Suppose you want to test :
10% signficance level. 95 versus The test statistic is 1.25 /√
:
95 at the For a two‐sided ‐test at the 10% significance level, the cut‐off value for the test statistics to be significant is 1.645. 1 So here we fail to reject . Construct a 90% confidence interval for : 100
1.645
20
√25
93.16, 106.84 95 is in the confidence interval. Every number inside the 90% confidence interval for is a versus value of for which you would not reject :
:
at the 0.10 significance level. Note that this relationship between hypothesis tests and confidence intervals exists for tests with two‐sided intervals. (For a duality with one‐sided tests, it is necessary to construct one‐sided confidence intervals.) 2 Use and Abuse of Hypothesis Tests  Treating hypothesis tests as a decision problem: o Want to choose between reject or fail to reject . o Reject if the P‐value is , the significance level. o Choosing the significance level : How much evidence is required to reject ? o The smaller is, the less likely that is rejected. A very small may be necessary if rejecting is a costly decision, or if is an established theory. o In practice, always report the P‐value. What’s the difference between 0.049 and 0.051?  Statistical significance is not the same thing as practical significance. To see this, consider the test: :
40versus :
40 Suppose 5, 50, ̅ 41, 0.05 Then the test statistics is ⁄√
1.41 P‐value 2
1.41
0.157 Conclude that the data do not give evidence against Now suppose 300. Then ⁄√
. 3.46 P=value 2
3.46
0.0006 Conclude that the data give strong evidence against . 3 



o Very small effects can be highly statistically significant when a test is based on a large sample. o But a statistically significant effect need not be practically important. o Lack of significance does not imply that is true. Maybe the sample size is too small or there is too much variability in the data to detect a difference. Small P‐values may occur: o By chance. o Because is false. o Because of problems related to data collection (e.g., lack of randomization). o Because of violations of the assumptions underlying the particular signficance test. Testing cannot correct flaws in the data collection design (bias, confounding, lack of randomization). Test results are not reliable if the statements of the hypotheses are suggested by the data. This is called data snooping. If multiple tests are carried out, some are likely to be significant by chance alone. o For example, if 0.05, we expect significant results 5% of the time, even when no difference exists. o Be suspicious when you see a few significant results when many tests have been carried out. 4  Statistical tests should always be proceeded by plots and summary statistics which may indicate any problems with the data (e.g., skew, outliers) and illustrate the effect you are seeking (if you can’t visualize it with your data, is it practically important?). Make a picture, make a picture, make a picture.  Hypothesis testing works well if you know what effects you are seeking, design a study to search for that effect, and use a hypothesis test to assess the evidence. Power of a Test  The significance level of a test shows how the method performs in repeated sampling.  If is true and you use a 1% significance test repeatedly with a different sample each time, you will reject (a wrong conclusion!) 1% of the time and not reject 99% of the time.  If is too small, may never be rejected, even if the true parameter value is far from the value.  We want to have a test with the ability to detect a false . This ability is measured by the probability that the test will reject when an alternative is true. 5 The probability that a fixed significance level test will reject when a particular alternative value of the parameter is true is called the power of the test against that alternative. In other words, the power of a test is the probability of making a correct decision when the null hypothesis is false. The higher the power, the more sensitive the test is in detecting a false . How to calculate the power of a test for the mean: 1. State and . 2. Find the values of ̅ or ̂ that will lead to the conclusion reject for the level . 3. Find the probability of these values of ̅ or ̂ under a particular value of the parameter in . This probability is the power of the test against the alternative. Example: The mean score on an aptitude test is reported to be 500 in Ontario. Consider a level 10% test of :
500versus
:
500 Suppose that the population s.d. of the scores is 75. 6 And suppose that we will use a SRS of 100 scores to conduct the test. 1.645 or will be rejected if the test statistic where ̅
̅
.
√
If If 1.645 then ̅
1.645 7.5
1.645 then ̅
1.645 500
1.645 7.5
512.338. 500
487.662. What is the power of this test if is not 500 but is 505? Power
reject
|
7 How to increase power 1. Power is larger the further the alternative value at which it is evaluated is away from the value proposed in . 2. Higher gives higher power. 3. Less variability gives higher power. 4. The larger the sample size, the greater the power. When planning studies, the necessary sample size is determined by choosing a significance level and the desired power and determining how many observations are necessary to achieve it for a particular difference between and of interest. This “particular difference” should be chosen for practical considerations; it is often called the effect size. 8 If Hypothesis Testing is a Decision, Could I Make the Wrong One? There are two types of decision error: 1. Type I error: Reject when in fact it is true. This happens with probability . 2. Type II error: Fail to reject when in fact is true. This happens with probability 1 power. Example: Aptitude test continued TypeIerror
Power at So 0.1 505wascalculatedtobe0.1739 TypeIIerror
0.8261  In practice, we want to have low probability of both types of errors.  But lowering the probability of one raises the other.  Increases the sample size decreases the probability of both types of errors.  It is necessary to decide which type of error is more serious based on the context. 9 Example: Disease testing : patient doesn’t have disease : patient does have disease testssayspatienthasdiseasewhenhedoesn't testssayspatientdoesn'thavediseasewhenhedoes Which type of error is more serious? The Court case – Statistical Testing Comparison: Innocent until proven guilty. Courtroom The defendant is innocent.
versus The defendant is guilty. The cases goes to a trial because the prosecution believes that the assumption of innocence is incorrect. The prosecution collects evidence. The hope is that the jurors will be convinced that the evidence would be extremely unlikely if the assumption of innocence were true. Hypothesis test Null hypothesis versus Alternative hypothesis The researcher believes the alternative hypothesis may be correct. The researcher collects data.
The intent is that the data (summarized in a test statistic) would be extremely unlikely if the assumption of the null hypothesis were true. 10 Choices for the jury: (1) The evidence is not strong enough to convincingly rule out that the defendant is innocent. Conclusion: “Not guilty.” (2) The evidence was strong enough that we are willing to rule out the possibility that an innocent person resulted in the evidence. We reject that the defendant is innocent, and conclude that he or she is guilty. P‐value large or small? Beyond a reasonable doubt
Potential error I:
An innocent person is falsely convicted. Potential error II:
A criminal has been erroneously freed. Choices:
(1) We cannot reject the null hypothesis based on the data. The P‐
value is not small enough. Conclusion: “Fail to reject .” (2) The data were unusual enough that we are willing to rule out that the null hypothesis is true and produced the observed data. The P‐
value is small. We reject the null hypothesis and conclude that the alternative hypothesis is true. P‐value is Type I error: Reject the null hypothesis when in fact it is true. Type II error: Fail to reject the null hypothesis when in fact it is false. 11 What if you want to carry out a test or construct a confidence interval for the population mean and is not known? ∑
Estimate by ̅
. The standard error of the mean is the resulting estimate of the . standard deviation of ̅ . √
If is very large, ̅
⁄√
has approximately a standard normal distribution. But, in general, even if the distribution of the variable in the population is exactly normal, ̅
⁄√
is not normally distributed. The number of degrees of freedom of a statistic refers to the number of “free‐to‐vary” values entering its calculation. For , the number of degrees of freedom is 1. 12 The (Student’s) Distribution If the data are independent and generated from a SRS or a randomized experiment from a population with a normal distribution with mean , then with ̅
⁄√
has a distribution 1 degrees of freedom. What does the distribution look like? 





Symmetric Bell‐shaped Centred at 0 Heavier tails than the normal distribution The smaller the degrees of freedom, the heavier the tails As the degrees of freedom gets large, the distribution gets close to the standard normal distribution.  Need a probability distribution table for the distribution for every possible degrees of freedom. The tables show only a few quantiles. 13 Inference using the distribution 100 1
% confidence interval for : ̅
Testing :
The test statistic distribution with , ⁄
√
̅
⁄√
is an observation from a 1 degrees of freedom if is true. This confidence interval and the P‐value for this test are exact when the distribution of the population is normal and are approximate for large (by the Central Limit Theorem). (See the textbook for some guidelines on how large needs to be depending on how much the data distribution deviates from the normal distribution.) Example: Body Temperatures According to Wunderlich’s “axioms on clinical thermometry” (1868), the normal mean body temperature of healthy adults is 98.6 degrees Fahrenheit (37.0 degrees Celsius). A study reported in JAMA (1992) examined whether this claim was out‐of‐date by looking at oral temperatures. 14 We have body temperature measurements in Fahrenheit for 65 men and 65 women. Is there evidence that Wunderlich was wrong? 15 A statistical procedure is robust if the probability calculations required are not sensitive to violations of the assumptions made. distribution procedures are robust against non‐normality except in the case of outlier or strong skewness. Power for the Test The calculations of and power for tests using the distribution are mathematically complicated. So either use software or estimate from curves constructed for this purpose. One complication: The calculation needs , the population standard deviation. Need to make your best guess, perhaps based on a pilot or previous study. Often researchers consider a range of possible values of . Power curves are typically used in practice to calculate the necessary sample size. Example: Consider testing :
100 versus :
100 and we’re interested in the alternative value Suppose that the sample size is 7, 110. 0.01, and 10. 16 Calculate |
|
1 From the power curves with df=6, is approximately 60%. So the power of the test at 110 is only 0.4. What if we switch to 0.05? What sample size would we need for 90% power? What if we were interested in the alternative value 105? 17