Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Confidence intervals, effect size, power Looking beyond the p-value What we will cover • Confidence Intervals • What they are • How to calculate them • How to interpret them • Effect size • The role of the sample size in finding an effect • Cohen’s D • Introducing the concept of Power • An R library for basic power calculations. Point estimates & Intervals • If you have a sample mean and you wish to make a guess as to what the population mean is, you can make two kinds of estimates: • Point estimates are guesses that specify an exact number. When you make a point estimate using the sample mean, it is likely your guess is near the true population parameter but it is very unlikely that it will be exactly the same as the parameter. For example, you might make a point estimate that μ is 4 when really it is 3.55. • Interval estimates are guesses that specify a range of numbers. With interval estimates, you guess that the true population parameter falls somewhere between 2 numbers. For example, you might make an interval estimate that μ is between 2 and 6. Confidence interval • In reality we often want to report a confidence interval as well as a test statistic and significance test. • You can see that most statistical packages report confidence intervals by default, e.g. Confidence interval • Say we were to take a random sample from a population and calculate the mean. • A confidence interval is a range of values around the mean, with the following meaning: • if we drew an infinite number of samples of the same size from the same population, x% of the time the true population mean would be included in the confidence interval calculated from the samples. • If we compute a 95% confidence interval (the most common type), x = 95, so we can say that 95% of the confidence intervals calculated from an infinite number of samples of the same size, drawn from the same population, can be expected to contain the true population mean. Confidence Interval • More generally, a confidence interval gives us information about the precision of a point estimate such as the sample mean. • A wide confidence interval tells us that if we had drawn a different sample, we might get a quite different sample mean, whereas a narrow confidence interval suggests that if we drew a different sample, the sample mean would probably be fairly close to that from the sample we did draw. • Basically we can calculate (using an appropriate formula) the upper and lower boundary of our confidence interval. • estimate ± margin of error Confidence Interval • Lets think about this in terms of a normal distribution first before looking at how to calculate it for the independent samples t-test (which is more practical). • Any Normal distribution has a probability about 0.95 within ±2 standard deviations of its mean. Confidence Interval • To construct a confidence interval we need to know more about the area C under the curve. • That is, we must find the number z∗ such that any Normal distribution has probability C within ±z∗ standard deviations of its mean. • Because all Normal distributions have the same standardized form, we can obtain everything we need from the standard Normal curve. • The sample mean x̄ has the Normal distribution with mean μ and standard deviation σ/√n. • The unknown population mean μ lies between: • x − z∗σ /√n and x + z∗σ/ √n Example • The National Student Loan Survey collects data to examine questions related to the amount of money that borrowers owe. The survey selected a sample of 1280 borrowers who began repayment of their loans between four and six months prior to the study. The mean of the debt for undergraduate study was $18,900 and the standard deviation was about $49,000. • This distribution is clearly skewed but because our sample size is quite large, we can rely on the central limit theorem to assure us that the confidence interval based on the Normal distribution will be a good approximation. • Let’s compute a 95% confidence interval for the true mean debt for all borrowers. (Although the standard deviation is estimated from the data collected, we will treat it as a known quantity for our calculations here). Example • Calculations: • We’ll round 2684 to 2700 for the purposes of this example. Example • Suppose the researchers who designed the National Student Loan Survey had used a different sample size. • How would this affect the confidence interval? • We can answer this question by changing the sample size in our calculations and assuming that the mean and standard deviation are the same. • Let’s assume that the sample mean of the debt for undergraduate study is $18,900 and the standard deviation is about $49,000, as in the previous example. But suppose that the sample size is only 320. • The margin of error for 95% confidence is? Example 5400 Vs 2700 • Notice that the margin of error for this example is twice as large as the margin of error that we just computed. • The only change that we made was to assume that the sample size is 320 rather than 1280. • This sample size is exactly one-fourth of the original 1280. • Thus, we approximately double the margin of error when we reduce the sample size to one-fourth of the original value. Confidence Interval • One thing to note that by increasing the confidence interval from 95% to 99%. We make the interval bigger not smaller! • This may seem strange at first but this diagram shows why: • Suppose that for the student loan data in our example we wanted 99% confidence. For 99% confidence, z∗ = 2.576. The margin of error for 99% confidence based on 1280 observations is: Confidence Interval • Formula for Independent Samples t-test (equal variance) Confidence Interval • If we had calculated the t-test by hand we would have calculated most of this anyway. • There are several points worth noting about this formula: • It is actually a confidence interval for the difference in the means of the two Populations. • We use the upper critical t-value for the df, and half the specified alpha level from a standard t-table. • Lets use R for the moment: > ballet <- c(89.2,78.2,89.3,88.3,87.3,90.1,95.2,94.3,78.3,89.3) > football <- c(79.3,78.3,85.3,79.3,88.9,91.2,87.2,89.2,93.3,79.9) > spool_numerator <- (var( ballet ) * (length(ballet) - 1) ) + (var( football ) * (length(football) - 1)) > spool_demoninator <- (length(ballet)-1)+(length(football)-1) Confidence Intervals > spool_fraction <- spool_numerator / spool_demoninator [1] 31.78189 > spool_rhs <- sqrt((spool_fraction * (1/length(ballet) + 1/length(football)))) [1] 2.521186 > alpha <- .05 > t_alpha_half <- qt(1-alpha/2, df=(length(ballet)-1)+(length(football)-1)) [1] 2.10 > difference_in_mean <- mean(ballet) - mean(football) [1] 2.76 • Given what we have just entered and formula you just saw - can you work out the confidence interval now? 2.76 + ((2.10)*(2.521186)) = 8.054491 2.76 – ((2.10)* (2.521186)) = -2.534491 Confidence Intervals • Note that this confidence interval includes 0, which is our null value (the value we posited for the difference in means in our null hypothesis); this result is expected because for this data set, we did not find significant results and thus did not reject the null hypothesis. • Because the sample mean is not resistant, outliers can have a large effect on the confidence interval. You should search for outliers and try to correct them or justify their removal before computing the interval. Caution! • The most important caution concerning confidence intervals is that the margin of error in a confidence interval covers only random sampling errors. • The margin of error is obtained from the sampling distribution and indicates how much error can be expected because of chance variation in randomized data production. • Practical difficulties such as nonresponse in a sample survey cause additional errors. • These errors can be larger than the random sampling error. This often happens • when the sample size is large (so that σ/√n is small). • Remember this unpleasant fact when reading the results of an opinion poll or other sample survey. The practical conduct of the survey influences the trustworthiness of its results in ways that are not included in the announced margin of error Effect size • As an indication of the importance of a result in quantitative research, statistical significance has enjoyed a rather privileged position for decades. Social scientists have long given the “p < .05” rule a sort of magical quality, with any result carrying a probability greater than .05 being quickly discarded into the trash heap of “nonsignificant” results. • Recently, however, researchers and journal editors have begun to view statistical significance in a slightly less flattering light, recognizing one of its major shortcomings: It is perhaps too heavily influenced by sample size. • As a result, more and more researchers are becoming aware of the importance of effect size and increasingly are including reports of effect size in their work. Statistical inferences • To determine whether a statistic is statistically significant, we follow the same general sequence regardless of the statistic (z scores, t values, F values, correlation coefficients, etc.). • First, we find the difference between a sample statistic and a population parameter (either the actual parameter or, if this is not known, a hypothesized value for the parameter). • Next, we divide that difference by the standard error. • Finally, we determine the probability of getting a ratio of that size due to chance, or random sampling error. • The problem with this process is that when we divide the numerator (i.e., the difference between the sample statistic and the population parameter) by the denominator (i.e., the standard error), the sample size plays a large role. • In all of the formulas that we use for standard error, the larger the sample size, the smaller the standard error. When we plug the standard error into the formula for determining t values, F values, and z scores, we see that the smaller the standard error, the larger these values become, and the more likely that they will be considered statistically significant. Effect size & sample • Because of this effect of sample size, we sometimes find that even very small differences between the sample statistic and the population parameter can be statistically significant if the sample size is large. • The left side of the graph shows a fairly large difference between the sample mean and population mean, but this difference is not statistically significant with a small sample… Effect size & sample • Suppose we know that the average IQ score for the population of adults in the United States is 100. • Now suppose that I randomly select two samples of adults. One of my samples contains 25 adults, the other 1600. • Each of these two samples produces an average IQ score of 105 and a standard deviation of 15. Is the difference between 105 and 100 statistically significant? • To answer this question, we’ll calculate a z score for each sample. • Standard Error = 15 / 25 = 3 • Standard Error = 15 / 1600 = 0.375 • Get Z-scores • 100 – 105 / 3 = 1.666667 • 100 – 105 / 0.3753 = 13.33333 • Our z-crit for alpha of 0.05 (two-way) is 1.96. Implications • If we are using an alpha level of .05, then a difference of 5 points on the IQ test would not be considered statistically significant if we only had a sample size of 25, but would be highly statistically significant if our sample size were 1,600. • Because sample size plays such a big role in determining statistical significance, many statistics textbooks make a distinction between statistical significance and practical significance. • With a sample size of 1,600, a difference of even 1 point on the IQ test would produce a statistically significant result (z = 1 ÷ .375 ⇒ z = 2.67, p < .01). However, if we had a very small sample size of 4, even a 10-point difference in average IQ scores would not be statistically significant (z= 10 ÷ 7.50 ⇒ z = 1.33, p > .10) • But is a difference of 1 point on a test with a range of over 150 points really important in the real world? And is a difference of 10 points not meaningful? • In other words, is it a significant difference in the practical sense of the word significant? Practical significance • Very small effects can be highly significant (small P), especially when a test is based on a large sample. A statistically significant effect need not be practically important. • Plot the data to display the effect you are seeking, and use confidence intervals to estimate the actual values of parameters. • P-values are more informative than the reject-or-not result of a fixed level α test. • Beware of placing too much weight on traditional values of α, such as α = 0.05. • Significance tests are not always valid. Faulty data collection, outliers in the data, and testing a hypothesis on the same data that suggested the hypothesis can invalidate a test. • Many tests run at once will probably produce some significant results by chance alone, even if all the null hypotheses are true. Calculating effect size • There are different formulas for calculating the effect sizes of different statistics, but these formulas share common features. The formulas for calculating most inferential statistics involve a ratio of a numerator divided by a standard error. Similarly, most effect size formulas use the same numerator, but divide this numerator by a standard deviation rather than a standard error. • We’ll look at an example for the t-test • The general name for the effect size formula we will look at is Cohen’s D • (We can also use this in another formula to give use an idea of what sample size we can use!) COHEN’S D • Standard measure for independent samples t test X1 X 2 d sp • Cohen initially suggested could use either sample standard deviation, since they should both be equal according to our assumptions (homogeneity of variance) • In practice however researchers use the pooled variance GLASS’S Δ • For studies with control groups, we’ll use the control group standard deviation in our formula X1 X 2 d scontrol • This does not assume equal variances Comparison • Note the range from 0 to 1. There are general guidelines on what is small, medium and large. Cohen's Rules Of Thumb For Effect Size Effect size “Small effect” “Medium effect” “Large effect” Correlation coefficient r = 0.1 r = 0.3 r = 0.5 Difference between means d = 0.2 standard deviations d = 0.5 standard deviations d = 0.8 standard deviations 29 Statistical Power • Statistical power refers to the probability of finding a particular sized effect • Specifically, it is 1- type II error rate • Probability of rejecting the null hypothesis if it is false • It is a function of type I error rate, sample size, and effect size • Its utility lies in helping us determine the sample size needed to find an effect size of a certain magnitude Two kinds of power analysis • A priori • Used when planning your study • What sample size is needed to obtain a certain level of power? • Post hoc • Used when evaluating study • What chance did you have of significant results? • Not really useful • If you do the power analysis and conduct your analysis accordingly then you did what you could. To say after, “I would have found a difference but didn’t have enough power” isn’t going to impress anyone. A priori Effect Size? • Figure out an effect size before I run my experiment? • Several ways to do this: • Base it on substantive knowledge • What you know about the situation and scale of measurement • Base it on previous research • Use conventions An acceptable level of power? • Why not set power at .99? • Practicalities • Howell shows how for a 1 sample t test, and an effect size d of 0.33: • Power = .80, then n = 72 • Power = .95, then n = 119 • Power = .99, then n = 162 • Cost of increasing power (usually done through increasing n) can be high Howell’s general rule • Look for big effects or • Use big samples • You may now start to understand how little power many of the studies in have considering they are often looking for small effects. • Many seem to think that if they use the central limit theorem rule of thumb (n=30) that power is solved too. • This is clearly not the case. • Effects are there but if they are small it will be very unlikely that an experiment with a small sample will ‘discover them’. Post hoc power: the power of the actual study • If you fail to reject the null hypothesis might want to know what chance you had of finding a significant result – defending the failure • As many point out this is a little dubious • One thing we can understand regarding the power of a particular study at hand is that it can be affected by a number of issues such as • Reliability of measurement • An increase in reliability can actually result in power increasing or decreasing though here we stress the decrease due to unreliable measures • • • • Outliers Skewness Unequal N for group comparisons The analysis chosen Something to consider • Doing a sample size calculation is nice in that it gives a sense of what to shoot for, but rarely if ever do the data or circumstances bare out such that it provides a perfect estimate for our needs • Rule of thumb sample size calculation for all studies: • The sample size needed is the largest N you can obtain based on practical considerations (e.g. time, money) • Also, even the useful form of power analysis (for sample size calculation) involves statistical significance as its focus • While it gives you something to shoot for, our real interest regards the effect size itself and how comfortable we are with its estimation • Emphasizing effect size over statistical significance in a sense de-emphasizes the power problem Errors in Null Hypothesis Tests True State of Affairs Your Reject Null Decision Fail to Reject Null No Effect (H0) Some Effect Type I Error - reject null when it is true (a) Power= 1- Type II Error - fail to reject null when you should () a and • The simplest way to consider the relationship between a and is to think of a in terms of the null hypothesis and in terms of the alternative hypothesis (a different distribution). One has an effect on the other but it isn’t a straight linear relationship. The Definition Of Statistical Power • Statistical power is the probability of not missing an effect, due to sampling error, when there really is an effect there to be found. • Power is the probability (prob = 1 - β) of correctly rejecting Ho when it really is false. • Depends on: • Effect Size • How large is the effect in the population? • Sample Size (N) • You are using a sample to make inferences about the population. How large is the sample? • Decision Criteria - a • How do you define “significant” and why? Power: a and Conventions And Decisions About Statistical Power • Acceptable risk of a Type II error is often set at 1 in 5, i.e., a probability of 0.2. • The conventionally uncontroversial value for “adequate” statistical power is therefore set at 1 - 0.2 = 0.8. • People often regard the minimum acceptable statistical power for a proposed study as being an 80% chance of an effect that really exists showing up as a significant finding. • http://homepage.stat.uiowa.edu/~rlenth/Power/ Power • What you should know: • Power is the probability of correctly rejecting the null hypothesis that sample estimates (e.g. Mean, proportion, odds, correlation co-efficient etc.) does not statistically differ between study groups in the underlying population. • Large values of power are desirable, at least 80%, is desirable given the available resources and ethical considerations. • Power proportionately increases as the sample size for study increases. • Accordingly, an investigator can control the study power by adjusting the sample size and vice versa Sample size • Let’s assume we want to calculate how many subjects per group we need to conduct a two-tailed independent samples t-test with acceptable power. • The denominator here is the effect size. • We need Z-values for both α and β to use this formula. We will stick with the 95% confidence interval for a two-tailed test, so the Z-value for 1 − α /2 will be 1.96 • We will compute the sample size required for 80% power, so the Z-value for 1 − β will be 0.84. (Note that if we were doing a one-tailed test, Zα would be 1.645, and if we were calculating 90% power, Z1 − β would be 1.28.) • The effect size is the difference between the two populations divided by the appropriate • measure of variance Calculating Sample Size • If μ1 = 25, μ2 = 20, and σ = 10, the effect size is 0.5. We can plug these numbers into the sample size formula. • We round fractional results up to the next whole number, so we need at least 63 subjects per group to have an 80% probability of finding a significant difference between two groups when the effect size is 0.5. Power analysis in R • See: ‘pwr’ package > pwr.t.test(n=64,d=0.5,sig.level=.05,type="two.sample", alternative="two.sided") • Our previous calculations are close but it is generally a good idea to let the stats package calculate these figures.