Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Let’s further explore the principles of statistical inference: using samples to make estimates about a population. Remember that fundamental to statistical inference are probability principles that allow us to answer the question: what would happen if we repeated this random sample many times. According to the laws of probability, each independent, random sample of size-n from the same population yields the following: true value +/- random error The procedure, to repeat, must be a random sample or a randomized experiment in order for probability to operate. If not, the use of statistical inference is invalid. Remember also that sample means are unbiased estimates of the population mean; & that the standard deviation of sample means can be made narrower by increasing the size of random samples-n. Further: remember that means are less variable & more normally distributed than individual observations. If the underlying population distribution is normal, then the sampling distribution of the mean will also be normal. There’s also the Law of Large Numbers. And last but perhaps most important, there’s the Central Limit Theorem. The Point of Departure for Inferential Statistics Here, now, is the most basic problem in inferential statistics: you’ve drawn a random sample & estimated a sample mean. How reliable is this estimate? After all, repeated random samples of the same sample size-n in the same population would be unlikely to give the same sample mean. How do you know, then, where the sample mean obtained would be located in the variable’s sampling distribution: i.e. on its histogram displaying the sample means for all possible random samples of the same size-n in the same population? Can’t we simply rely on the fact that the sample mean is an unbiased estimator of the population mean? No, we can’t: that only means that the sample mean of a random sample has no systematic tendency to undershoot or overshoot the population mean. We still don’t know if, e.g., the sample mean we obtained is at the very low end or the very high end of the histogram of the sampling distribution, or is located somewhere around the center. In other words, a sample estimate without an indication of variability is of little value. In fact, what’s the worst thing about a sample of just one observation? Answer A sample of one observation doesn’t allow us to estimate the variability of the statistic over repeated random samples of the same size in the same population. To repeat, a sample estimate without an indication of variability is of little value. What must we do? Introduction to Confidence Intervals The solution has to do with a mean’s standard deviation, anchored in the square root of the sample size-n. The first thing we do is compute the mean’s standard deviation, divided by the square root of the sample size-n. What does the result allow us to do? The result allows us to situate the mean’s variability within the sampling distribution of the sample mean: the distribution of means for all possible random samples of the same size in the same population. And to do so in terms of what rule? The 68 – 95 – 99.7 rule. The probability is 68% that x-mean lies within +/- one standard deviation of the population mean (i.e. the true value); 95% that x-mean lies within +/- two standard deviations of the population mean; & 99.7% that xmean lies within +/- three standard deviations of the population mean. A common practice in statistics is to use the benchmark of +/- two standard deviations: i.e. a range likely to capture 95% of sample means obtained by repeated random samples of the same size-n in the same population. We can therefore conclude: we’re 95% certain that this sample mean falls within +/- two standard deviations of the population mean—i.e. of the true population value. Unfortunately, it also means that we still have room for worry: 5% of such samples will not obtain a sample mean within this range— i.e. will not capture the true population value. The interval either captures the parameter (i.e. population mean) or it doesn’t. What’s worse: we never know when the confidence interval captures the interval or not. As Freedman et al. put it, a 95% confidence interval is “like buying a used car. About 5% turn out to be lemons.” Recall that conclusions are always uncertain. In any event, we’ve used our understanding of how the laws of probability work in the long run— with repeated random samples of size-n from the same population—to express a specified degree of confidence in the results of this one sample. In sum: the language of statistical inference uses the fact about what would happen in the long run to express our confidence in the results of any one random sample of independent observations. If things are done right, this is how we interpret a 95% confidence interval: “We are 95% confident that the true population value lies between [low-end value] and [high-end value].” Or: “This number was calculated by a method that captures the true population value in 95% of all possible samples.” To repeat: the confidence interval either captures the parameter (i.e. the true population value) or it doesn’t—there’s no in between. More on Confidence Intervals Confidence intervals take the following form: Estimate +/- margin of error Margin of error: how accurate we believe our estimate is, based on the variability of the sample mean in repeated random sampling of the same size & in the same population. The confidence interval is based on the sampling distribution of sample means: N(mu, standard deviation/square root of n) It is also based on the Central Limit Theorem, which says that the sampling distribution of sample means is approximately normal for large random samples whatever the underlying population distribution may be. Besides the sampling distribution of sample means & the Central Limit Theorem, the computation of the confidence interval involves two other components: C-level: i.e. the confidence level, which we’ve already considered. It defines the probability that the confidence interval captures the parameter. z-score: i.e. the standard score defined in terms of the C-level. It is the value on the standard normal curve with area C between –z* & +z*. Here’s how the z-scores & C-levels are related to each other: z-score: 1.645 C-level: 90% 1.960 95% 2.57 99% Any normal curve has probability C between the point z* standard deviations below the mean & point z* standard deviation above the mean. Choose a z-score that corresponds to the desired level of confidence (1.645 for 90%; 1.960 for 95%; 2.576 for 99%). Multiply the z-score times the standard deviation/square root of the sample size-n This anchors the range of the estimated values of the confidence interval within the specified probability interval of the sampling distribution of sample means. Notice the darkened areas at the extremes of the tails on the horizontal axis. Those areas at the extremes demarcate the critical areas that are integral to significance tests, which we’ll soon discuss. How to do it in Stata . ci write Variable write Obs Mean Std. Err. [95% Conf. Interval] 200 52.775 .6702372 51.45332 54.09668 If the data aren’t in memory: . cii 200 63.1 7.8 Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------------------| 200 63.1 .5515433 62.01238 64.18762 Review: Confidence Intervals The chances are in the sampling procedure, not in the parameter: confidence intervals, & inferential statistics in general, are premised on random sampling & the long-run laws of probability. If there’s no random sample or randomized experiment, the use of a confidence interval is invalid. What if you have data for an entire population? Then there’s no need for a confidence interval: congratulations! The confidence interval either captures the parameter or it doesn’t: it’s an either/or matter. We’re saying that we calculated the numbers according to a method that, according to the laws of probability, will capture the parameter in [90% or 95% or 99%] of all possible random samples of the same size.” That means, though, that a certain percent of the time (10%, 5% or 1%) the confidence interval does not capture the parameter. And we won’t know when it doesn’t capture the parameter. By means of their influence on mean x, extreme outliers & strong skewness can have a serious effect on the confidence interval. Consider correcting or removing extreme outliers (if doing so can be justified) &/or transforming the data, or else consider using relatively resistant procedures. So, always graph a variable’s data— checking for pronounced skewness & outliers—before computing a confidence interval. To make an informed decision, compute the confidence interval both before & after modifying the data. Compounding the problem are nonprobability (i.e. non-sampling) causes of bias. So we have two sources of uncertainty: probabilistic (sampling) & non-probabilistic (non-sampling). All conclusions are uncertain. How to reduce a confidence interval’s margin of error? Use a lower level of confidence (smaller C—i.e. narrower confidence interval). Increase the sample size (larger n). Reduce the standard deviation (via more precise measurement of variables or more precise sample design). Significance Tests What is significance testing? How do confidence intervals pertain to significance testing? Variability is everywhere. “… variation itself is nature’s only irreducible essence.” Stephen Jay Gould E.g., weighing the same item repeatedly. E.g., measuring blood pressure, cholesterol, estrogen, or testosterone levels at any various times. E.g., performance on standardized tests or in sports events at various times. For any given unbiased measurement: measured value = true value +/- random error How do we distinguish an outcome potentially worth paying attention to from an outcome based on mere random variability? That is, how do we distinguish a an outcome potentially worth paying attention to from an outcome based on mere chance? We do so by using probability to establish that an outcome of a particular magnitude or level could rarely have occurred by mere chance. Is the magnitude of the sample mean large enough relative to its standard deviation/square root of sample size-n to have rarely occurred by chance alone? The scientific method tries to make it hard to establish that such an outcome occurred for reasons other than chance. It makes us start out by asserting that there’s no effect, or no difference: the null hypothesis. Hence significance tests, like confidence intervals, are premised on a variable’s sampling distribution—i.e. on what would happen with repeated random samples of the same size in the same population over the very long run. Significance Tests: Stating Hypotheses The null hypothesis—the starting point for a significance test—asserts that any observed effect is due simply to chance: that there’s no effect, or no difference, beyond random noise. By contrast, the alternative hypothesis asserts that the observed effect is big enough—relative to the standard deviation of the variable/square root of n—that it would rarely have occurred by sheer chance: that the observed effect, or difference, needs to be taken seriously. The statement being tested in a significance test is called the null hypothesis. The significance test is designed to assess the strength of the evidence against the null hypothesis. Usually the null hypothesis is a statement of “no effect” or “no difference.” The alternative to the null hypothesis is the alternative hypothesis. It is what we hope to find evidence to support. The alternative hypothesis may be one-sided or twosided. A one-sided example? A two-sided example? The null hypothesis expresses the idea that the observed difference is due merely to chance—that it’s a fluke. The alternative hypothesis expresses the idea that the observed difference is due to reasons beyond mere chance. Hypotheses always refer to some population or model, not to a particular outcome. Thus the null hypothesis & alternative hypothesis must always be stated in terms of population parameters. Put differently, a statistical hypothesis is a claim about a population. Is the evidence for the sample statistically consistent with the population claim or not? E.g., null hypothesis & alternative hypothesis for a study on possible cancer risks of living near power lines. E.g., null hypothesis & alternative hypothesis concerning relationship of gender to standardized math test scores. Depending on the test results, we either fail to reject the null hypothesis or reject the null hypothesis. We never accept the null hypothesis—why not? Test Statistic A test statistic measures the compatibility between the null hypothesis & the data. It is computed as a z-score, but with the standard deviation divided by the square root of sample size-n. This adjustment reflects the fact that the data are drawn from a sample. Statistical Significance: the P-Value The probability that the test statistic would take a value as extreme or more extreme than the value actually observed is called the P-value of the test. The smaller the P-value, the stronger the data’s evidence against the null hypothesis. That is, the smaller the P-value, the stronger the data’s evidence in favor of the alternative hypothesis. Why? The P-value is small enough to be statistically significant if the magnitude of the sample mean is sufficiently large, in relation to its standard deviation/square root of sample size-n. The P-value, to repeat, is the observed significance level. The P-value is based on the sampling variability of the sample mean. A P-value is located within the extremes of the tails of the horizontal axis, as defined by this formula: sample mean +/(z-score times standard deviation)/square root of the sample size-n. Depending on the form of the alternative hypothesis, the significance test may be two-tailed or one-tailed. The P-value is small enough to be statistically significant if the magnitude of the sample mean is sufficiently large in relation to its standard deviation/square root of sample size-n. If the P-value is as smaller or smaller than a specified significance level (conventionally .10 or .05 or .01), we say that the data are statistically significant at that level (P-value=….). How to do it in Stata . ttest math = 55 One-sample t test Variable Obs Mean math 200 52.645 Degrees of freedom: Std. Err. Std. Dev. .6624493 9.368448 [95% Conf. Interval] 51.33868 53.95132 199 Ho: mean(math) = 55 Ha: mean < 55 Ha: mean ~= 55 Ha: mean > 55 t = -3.5550 t = -3.5550 t = -3.5550 P<t= P>t= P>t= 0.0002 0.0005 0.9998 For example: two-tailed test— reject null hypothesis (Pvalue=.0005, df=199) in favor of the alternative hypothesis. . ttest read = math Paired t test Variable Obs Mean Std. Err. Std. Dev. read 200 52.23 .7249921 10.25294 50.80035 53.65965 math 200 52.645 .6624493 9.368448 51.33868 53.95132 diff 200 -.415 8.103152 -1.54489 .7148905 .5729794 [95% Conf. Interval] Ho: mean(read - math) = mean(diff) = 0 Ha: mean(diff) < 0 Ha:mean(diff) ~= 0 Ha: mean(diff) > 0 t = -0.7243 t = -0.7243 t = -0.7243 P<t= P >t = P>t= 0.2349 0.4697 0.7651 Two-tailed test: fail to reject null hypothesis (P-value=.470). One-tailed (lower): fail to reject null hypothesis (P-value=.235). One-tailed (upper): fail to reject null hypothesis (P-value=.765). What Statistical Significance Isn’t, & What It Is Statistical significance does not mean theoretical, substantive or practical significance. In fact, statistical significance may accompany a trivial substantive or practical finding. Statistical significance means that if the null hypothesis were actually true, then a finding of the observed magnitude would be likely to occur by chance no more than—depending on the specified significance level—ten times or five times or one time out of 100 observations. Depending on the test results, either we fail to reject the null hypothesis or we reject the null hypothesis. We never accept the null hypothesis: Why not? Regarding statistical significance, it’s useful to think (more or less) in terms of the following scale: .10 or less: moderate statistical significance .05 or less: strong statistical significance .01 or less: very strong statistical significance These levels are cultural conventions in statistics & research methods. There’s really no rigid line between statistical significance & lack of such significance, or between any of the levels of significance. Listing the P-value provides a more precise statement of the evidence. E.g.: fail to reject the null hypothesis at the .10 level (Pvalue=.142). Let’s remember, moreover: statistical significance does not mean theoretical, substantive, or practical significance. In any event, statistical significance—as conventionally defined—is much easier to obtain in a large sample than a small sample. Why? Because according to the formula, a sample statistic’s standard deviation divided by the square root of n decreases as the sample size increases. What does it take to obtain statistical significance? A strong linear relationship between two variables. A large enough sample size to minimize the role of chance in determining the finding. Consequently, lack of statistical significance may mean that the relationship is nonlinear; or it may simply mean that the sample size is not large enough to downplay the role of chance in determining the finding. It might also mean there are data errors, that the sample is badly designed or executed, or that there are other problems with the study’s design. Of course, it may indeed mean that the linear relationship between the two variables simply is not strong enough to minimize the role of chance in causing the observed finding. Statistical significance does not necessarily mean substantive or practical significance. And remember: significance tests are premised on a random sample of independent observations. Hence, statistical significance tests are invalid if the sample cannot be reasonably defended as random or if measurements are obtained for an entire population (the latter being a very good thing, however). Without a random sample, the laws of probability can’t operate. With measurements on an entire population, there is no samplingbased uncertainty to test (or worry about). Confidence intervals & two- sided tests: the two-sided test can be directly computed from the confidence interval. For a two-sided test, if the hypothesized value falls outside the confidence interval, then we can reject the null hypothesis. Why? Ho: math equals 55 Ha: math does not equal 55 . ci math Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+----------------------------------------------------------math | 200 52.645 .6624493 51.33868 53.95132 Reject null hypothesis at the .05 level. In general, P-values are preferable to fixed significance levels: Why? Review: Significance Testing Significance testing is premised on a random sample of independent observations: if this premise does not hold, then the significance tests are invalid. Statistical significance does not mean theoretical, substantive or practical significance. Any finding of statistical significance may be an artifact of large sample size. Any finding of statistical insignificance may be an artifact of small sample size. Moreover, statistical significance or insignificance in any case may be an artifact of chance. What does a significance test mean? What does a significance test not mean? What is the procedure for conducting a significance test? What is the P-value? Why is the P-value preferable to a fixed significance level? What are the possible reasons why a finding does not attain statistical significance? What are the possible reasons why findings are statistically significant? Depending on the test results, we either fail to reject the null hypothesis or reject the null hypothesis. We never accept the null hypothesis. Beware! There is no sharp border between “significant” & “insignificant”, only increasingly strong evidence as the Pvalue gets smaller. There is no intrinsic reason why the conventional standards of statistical significance must be .10 or .05 or .01. Don’t ignore lack of statistical significance: it may yield important insights. Beware of searching for significance: by chance alone, a certain percentage of findings will indeed attain statistical significance. There’s always uncertainty in assessing statistical significance. Another Problem: Two Types of Error in Significance Tests If a finding tests significant, the null hypothesis may be wrongly rejected. If a finding tests insignificant, the null hypothesis may be wrongly accepted. Type I error Type II error Power: the power of a significance test measures its ability to detect an alternative hypothesis. Power against a specific alternative is calculated as the probability that the test will reject the null hypothesis when the alternative is true. Type I/II Errors & Power in Stata See Stata “help” &/or the documentation manual for the command “sampsi.” Bonferroni adjustment When there are multiple hypothesis tests, the Bonferroni adjustment makes it tougher to obtain statistical significance: What’s the reason for doing so? Divide the selected critical value by the number of hypothesis tests. Selected critical level: .05 Five tests .05/5=.01 Thus, each test will be judged as statistically significant only at Pvalue=.01 or less. There are other such “multiple adjustments,” such as Scheffe, Sidak, Tukey. Review Again What’s the premise of significance tests? What if the premise doesn’t hold? What is the procedure for conducting a significance test? What do significance tests mean? What don’t they mean? What conditions yield a statistically significant finding? What conditions don’t yield such a finding? Why is a P-value preferable to a fixed significance level? Why are .10, .05 & .01 significance levels so commonly used? How should we treat statistically insignificant findings? Why shouldn’t we search for statistical significance? Why is a finding of statistical significance uncertain? Why is a finding of statistical insignificance uncertain? What are Type I errors? What are Type II errors? What’s a Bonferroni adjustment (or other “multiple adjustment”)? Why is it used? For what various reasons are conclusions inherently uncertain? Significance Testing: Questions True or false, & explain: A difference that is highly significant must be very important. Big samples are bad. If the null hypothesis is rejected, the difference isn’t trivial. It is bigger than what would occur by chance. For one year in one graduate major at a university, 825 men applied & 62% were admitted; 108 women applied & 82% were admitted. Is the difference statistically significant? The masses of the inner planets average 0.43 versus 74.0 for the outer planets. Is the difference statistically significant? Does this question make sense? A P-value of .047 means something quite different from one of .052. According to the U.S. Census, in 1950 13.4% of the U.S. population lived in the West; in 1990 21.2% lived in the West. Is the difference statistically significant? Practically significant? Morals of the Stories Statistical significant says nothing about: practical significance the adequacy of the study’s design whether or not the study is based on a random sample of independent observations Professional standards of statistical significance are cultural conventions: there’s no intrinsic, hard line between statistical significance & insignificance. Findings of statistical insignificance may be more insightful than those of statistical significance. Finally, confidence intervals & significance tests are based on a random variable’s sampling distribution: over all possible random samples of the same size in the same population.