Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Making Inferences, AKA Hypothesis Testing Assignment 2 and 3 • You should have received feedback on assignment 2. – Great job everyone. • Please send everything to both me and Lamiya. • Mofiz Haque please stop by I have a question about your email address. • Assignment 3 is assigned today. So Far • You know how to describe variables: – Conceptually with a taxonomy – Graphically – Numerically • You know how to describe some distributions: – Empirically – Theoretically • You have been exposed to two statistical packages to help you do these tasks: – R with Rcmdr – SAS Enterprise Guide So Far • Probability is scored between 0 and 1. 0 0.5 1 Impossible As likely as not Certain Unlikely to occur Likely to occur • Area under a curve or heights of bars represent probability. From Last Time • I talked about when a variable (really its distribution) is (theoretically) normally distributed, it is described by only two parameters (the first two moments of the mean), the mean and standard deviation. • When you are taking sample means (with more than one observation in the mean) and you plot the means, the density looks normally distributed. This fact that the sampling distribution of means looks normal (irrespective of the original distribution) is called the Central Limit Theorem. Moving On • The next steps are to describe other types of distributions and figure out how to quantify just how unusual a weird statistic from your sample actually actually is. • You are not always going to be making generalizations about comparing means. – Comparing variability (variance) is hugely important. Variability of Sample Means • Recall that the number of people (observations) in each sample mattered a lot in determining whether the sampling distribution looked normal. – If you have a decent size sample (the number of people in each sample), it is hard to get very extreme values out of a normal sampling distribution because the extremely big values tend to cancel out the extremely small values. 1500 0 500 Frequency Actual Scores 300 400 500 600 700 600 700 600 700 scores 600 200 0 400 500 bunchOfMeans 200 600 Bunch of Means sample N = 20 0 The distribution of the means from sample size of 20 is narrower still (and bell-shaped). 300 Frequency The distribution of the means from sample size of 5 is narrower than the original values (and bell shaped). Frequency Bunch of Means sample N = 5 300 400 500 bunchOfMeans20 Variability Between Samples • The width of the sampling distribution of the means got narrower and narrower as the size of each of the samples increased. • The variability within a sample (of size 1) is called the standard deviation. • The variability across the means when you have samples bigger than size 1 is called the standard error. Standard Error • The formula for the standard error of the means is just the sample standard deviation formula with a tweak to indicate the impact of the sample size. • The SE plays a huge role in all inferences. You need it to determine what is an odd sample. SD SEMean Sample Size Standard Error Formula • As you move through the year, you will meet many formulas for standard errors. – If you are testing to see if there is a difference between two groups, you use a slightly different formula. – If you are working with the distribution of counts of events happening or not happening in many trials (yes/no getting pregnant on many attempts), the SE formula is different but it plays the same role. It helps you determine what is an unusual value. Probability Functions • Some people are entertained while others are horrified at the prospect of having to do calculus to figure out the area under the curve corresponding to what made an unusual sample. Happily, you don’t have to. You can use the probability functions in a language like SAS or R. Quantiles • Say you want to know what quantile corresponds to a standard normal value. • The standard normal is where you have rescaled your values so they are measured with a mean of 0 and standard deviation of 1. thingy ~ N (0,1) • For example, you may want to know what value cuts off the most extremely large 5% of a standard normal curve. Z-scores for Percentages Z-scores for Percentages What percentiles? • You are far more likely to want to know what percentile your actual scores correspond to. To get those values, you will use the CDF function (Cumulative Density Function). -2 0.0 0.2 -1 0.6 -3 1 0.8 2 1.0 -3 -2 -2 -1 -1 0 0 z 1 1 2 2 3 0 Quantile (Z) 0.4 Probability -4 z 3 -3 0.0 -2 0.2 -1 0 z 0.4 p 1 0.6 2 0.8 3 1.0 0.0 0 100 150 0.1 0.2 0.3 Probability density 50 frequency 0.4 Null Hypothesis • When you design an experiment, you typically propose a hypothesis indicating that nothing interesting is going on. – For example, if you expect a drug and a placebo to act the same way, your null hypothesis is that the average difference is 0. – You reject the null hypothesis if your sample is too far out in the tails of the null distribution. – You typically set up this target (dummy hypothesis) and hope your data does not look like this. Hoe Hoe Hoe • The null hypothesis is typically written H0. That is pronounced H-zero or H-not. Don’t call it “hoe”. • The alternative hypothesis is typically written H1 or HA. What could possibly go wrong? • When you do an experiment you come up with a hypothetical population mean and SD and have a computer calculate sampling distribution of the means (for your sample size). You can then test to see if your data is compatible or weird giving the population mean and standard error. • Call this distribution “the null distribution” because it is what you expect and nothing interesting is going on if you find it is true. • What could possibility go wrong? What could possibly go wrong? • Your guess at the population mean was right but you could get a sample by chance (poor luck) that was from way out in the tails of the distribution. – The first thing that could go wrong is called the Type I (one) Error. • Things could be really bad and your guess about the population mean was wrong but you get a sample that is compatible with your original hypothesis that is not in agreement with reality (this 2nd thing that could go wrong is called the Type II (two) Error. – You won’t notice that the distribution is actually centered around an alternative mean and has an alternative distribution. Think of… Pascal’s Wager The TRUTH Your Decision God Exists God Doesn’t Exist BIG MISTAKE Correct Correct— Big Pay Off MINOR MISTAKE Reject God Accept God Type I and Type II Error in a Box Your Statistical Decision Reject H0 True State of Null Hypothesis H0 True H0 False Type I error (α) Correct Correct Type II Error (β) Do not reject H0 Analogy to Quality Control • In my humble opinion, people typically worry too much about the Type I error. The probability that this error happens is called the p-value and this is called the α (alpha) level. • Failing to realize that the data should be described by an alternative distribution is called the β (beta) error. Hypothesis Testing Analogies Power 1- b Reject Null Is a real difference Is no real difference No Error (true positive) Type 1 error Type 2 error No Error (true negative) Fail to reject b Low metastasis potential Sensitivity High PSA Normal PSA 1- a Is a really caner No cancer No error (true positive) False positive False negative No error (true negative) Specificity Highly aggressive breast cancer Positive image Negative a Is a really caner No cancer No error (true positive) False positive False negative No error (true negative) A Tale About Two Tails • If you want to test to see if your data is incompatible with a null hypotheses, you specify just how weird it needs to be to be called weird. That is, you specify the alpha level. Typically you say a sample statistic that could happen 1 in 20 times is too uncommon to say it happened by chance alone. • For example, you have a hypothetical mean and if your sample mean is very high or very low relative to it, you say it is too odd and you reject the null hypothesis. • Using the code from earlier in the lecture, you could figure out the probability of a value. One-Tailed • Typically you want to know if your value differs from the population value. In other cases (very rarely), you may be interested if and only if the value is greater than the population value. In yet other cases (very rarely), you may be interested if and only if the value is less than the population value. • The test of a difference is a two-tailed test because the value could be unusually high or low. The test of “more than” (as opposed to “different”) is a one-tailed test. The test of “less than” is also a one tailed test. Splitting Tails • If you do a two-sided test and you say a sample is odd if it occurs only 1/20 times, you need to split that .05 percent of the weirdness into both tails. So you cut the distribution such that a sample which is in the upper .025 or lower .025 of the distribution is grounds for rejecting the null hypothesis. But if you say that you are only interested in whether this sample is greater than the hypothetical mean, you can shove all .05 into one tail and it is relatively easy to find a weird sample. Some Moron Tails… • The inexact use of Fisher's Exact Test in six major medical journals by McKinney et al., JAMA Vol. 261 No. 23, June 16, 1989 – We reviewed the use of Fisher's Exact Test in 71 articles published between 1983 and 1987 in six medical journals. Thirty-three of 56 selected articles did not specify use of a one- or two-tailed test, and 12 (36%) of these actually used the one-tailed test. Five (42%) of these 12 articles contained at least one table in which the standard significance level of P less than .05 was no longer met when a two-tailed analysis was run instead. Extreme Caution • If an outcome could biologically be either above or below a population mean, do the two sided tests. There are terrifying scenarios that begin with a standard of care that is so thought to be so good that a new (less invasive) treatment could only be worse. So a researcher does a one-sided test to see if the new treatment is worse. In reality, the gold standard is harmful (pure oxygen to neonates). Therefore, you do not see a statistically significant difference. In other words, they would fail to see the harmful effect of the treatment as statistically significant. What could possibly go wrong? • Recall that in addition to the Type I error caused by having an unusual sample that really came from the null distribution, you could get a value from the alternate distribution that was compatible with the null hypothesis. Alpha and Beta • Alpha and Beta errors are intimately connected in testing hypotheses and van Belle does not make this clear enough. An alternate presentation can be found in Normal and Streiner’s Biostatistics: The Bare Essentials. If you are math phobic, I highly recommend the book. Blood Sodium Example • The story begins with a measure of blood sodium with a known population mean of 140 mmol/L and a standard deviation of 2. In the study, blood measures are taken on 25 people and the mean is 137.5. The question is “does it look unlikely that the sample mean came from a population with a mean of 140 or do you want to conclude that the true population mean is different?” • What do you do? Steps to the Comparison • What is the standard error? • How many standard errors away from the mean is this sample? • If testing for a difference between the groups at the alpha .05 level, what is the cut point in zunits? • What is the cut point in the original units? • What is the power? • What is the beta error? • What happens if you use a smaller sample? The SE • The Standard Error of the mean: • The Z score: ( x ) z / n SEMean SD Sample Size Calculating a Z Score It is a darn unusual sample if the population mean is 140. The Actual Cut Point 136 138 140 142 0.0 0.2 0.4 Density 0.6 0.8 Sample size = 25 • What happens when your sample size was smaller? Sample size = 4 Running the Analyst Pick Your Study Design Fill in the Blanks Get Results as a Table …or as a picture Other Software Packages • S-Plus can easily produce information on power: Best Guesses • So far I talked about making judgments regarding when a sample is compatible with a distribution. Another very important task is making a guess about a population value and specifying the precision of your guess. • This is the process of building confidence intervals. 100% Confidence Intervals • Say I do a sample of ages of Stanford undergraduates. My mean from the sample is 20 years old. That is my point estimate of the population mean. I know that the true mean is not exactly 20. So I give myself some wiggle room by saying 20 plus or minus something. That range is called the confidence interval. • I want to be 100% certain that my guestimated range includes the true population mean, so I say age 20 +/- 90 years. Can I do better than that? • The true population mean is going to be within the range of 0 and 110 years old. So, I have built a 100% confidence interval. • Say I get a sample of 25 undergrads, calculate their mean age, add +/- 10 years and call that the confidence interval. The population mean will or will not be inside of the range. So reality is either yes or no. How do I specify a probability here? • You want specify a range that when you do the sampling experiment many times, you will usually capture the true value within the guestimated range. That is the typical definition of a confidence interval. • You use the sampling distribution we have been talking about to calculate those values.