Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2/27/12 BMS 617 Lecture 10 BMS 617 Statistical Techniques for the Biomedical Sciences Lecture 10: Normality Tests and Outliers The assumption of normality The next section of the course discusses statistical tests Essentially a collection of techniques for calculating p-values for common scenarios Many of these tests rely on the assumption that data were sampled from a Gaussian (Normal) Distribution T-tests, ANOVA, regression Sensible to question whether this assumption is reasonable In most cases, data cannot be sampled from an ideal Gaussian distribution E.g. in many cases, data values cannot be negative Many simulations, and some theory, have shown that statistical tests still work well enough if the data are sampled from a population approximately normally distributed So the question really becomes "were the data sampled from a distribution close enough to the normal distribution?" This is, of course, impossible to answer… What normally distributed data looks like With very large data sets, the distribution of values sampled from a normal distribution will look like the standard bell curve However, with smaller data sets, distribution will vary widely Even if the population is perfectly normally distributed Next image shows five samples of size 15 and five samples of size 100 from a normally distributed population with mean=10 and sd=3. Samples from a normally distributed population webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 1/7 2/27/12 BMS 617 Lecture 10 Measuring the deviation from normality Normal distributions have two important properties: They are symmetric about the mean The spread about the mean is defined: About 68% of the values lie within one standard deviation of the mean, etc Two measures test how well a data set deviates from this ideal Skewness measures the deviation from symmetry. A skewness of 0 represents exact symmetry. Positive skewness represents a heavier right tail Negative skewness represents a heavier left tail Kurtosis measures the "peakiness" of the distribution A Gaussian dsitribution has a kurtosis of zero. Positive kurtosis indicates a sharper peak/ Negative kurtosis indicates a more rounded peak Skewness webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 2/7 2/27/12 BMS 617 Lecture 10 Kurtosis webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 3/7 2/27/12 BMS 617 Lecture 10 Tests for Normality A number of tests are available that will test a data set to see if it looks like a sample from a Normal distribution Typically these tests combine the skewness and kurtosis of the data set and compute a p-value. Tests include Shapiro-Wilk test Kolmogorov-Smirnov test Darling-Anderson test D'Agostino-Pearson omnibus K2 normality test In all cases, the null hypothesis is that the data were drawn from a normally distributed population p-value is the probability of getting this much skewness and/or kurtosis under this assumption Interpreting Normality Tests If a normality test gives a small p-value, we can reject the null hypothesis (the data were sampled from a normally distributed population). The usual cautions about small p-values apply If the sample size is large, the p-value might be small even though the deviation from normality is small. For Kurtosis and Skewness in the range ±0.5 (or even ±1.0) most tests will typically work ok for large samples If there is strong evidence the population data is far from normally distributed, must account for this: Do the data come from another distribution? If so, transforming the data (e.g. by taking logs) may result in data that is sampled from a normal distribution. Are there outliers that are causing the normality test to fail? Run an outlier test (discussed soon). Consider using a non-parametric test, which does not rely on normality. Beware of these as they have less power. We will discuss these later in the course. Interpreting Normality Tests: large p-values If a normality test produces a large p-value, there is no evidence that the data are not from a normally distributed population This does not mean they are sampled from a normally distributed population! If the sample size is small, the normality test may simply not have enough power to detect deviations from the normal distribution There is no good way to prove that the population data is normally distributed! Guidelines for the Assumption of Normality Remember, if there are many sources of variation, and those sources are additive, then data are likely to be approximately normally distributed If you can justify this, then this is a better argument than using normality tests If you are running a series of experiments, they should all use the same test. Do not use non-parametric tests for some and parametric tests for others. If it is standard in the literature to assume data from a particular assay are normally distributed, then it is probably safe to work with this assumption Outliers An outlier is a value which is far from all other values Appears to have come from a different population Outliers can invalidate many statistical analyses webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 4/7 2/27/12 BMS 617 Lecture 10 Can come from many causes: Invalid data entry Experimental error Incorrect assumptions about the distribution Typically, assuming a normal distribution instead of a lognormal distribution Random sampling chance Biological diversity The sample truly is different to the others and the value really does come from a different distribution In this case, you may have an exciting discovery Detecting outliers by eye Which of these 20 data sets have outliers? None of them! All the data were computer generated samples from the same normal distribution Detecting outliers in an ad hoc manner is dangerous If you suspect an outlier These are things to consider if you suspect an outlier: Check the data entry for errors. Fix any errors. Make sure you understand output from software. Many programs will use "special values" for missing data. Was there an error observed during the experiment. If so, simply eliminate the corresponding value No need for an outlier test Are you sure the distribution is normal? If not, a transformation might remove the outlier Example shortly Is the extreme value due to biological variability? If so, can you identify how the sample is different and what causes the extreme value? This may lead to an important discovery. Outlier Tests webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 5/7 2/27/12 BMS 617 Lecture 10 If the questions on the previous slide do not resolve the outlier, then there are two possibilities: The value came from the same distribution, but due to random sampling just happened to be bigger or smaller than the others The value was the result of some experimental error No computation can tell for sure which is the case, but an outlier test can answer this question: Assume the data all came from the same normal distribution. What is the probability of getting a value at least this far from all the others? If an outlier test results in a low p-value, then there is strong evidence the value is the result of an error, and it can be excluded. If an outlier test results in a high p-value, then there is no strong evidence the value is the result of an error. Lognormal distributions The left image shows the raw data An outlier test detects significant outliers in the middle three data sets The right image shows the data after log transforms: there are no outliers Grubb's Test Grubb's Test is a test for outliers Works by computing the distance from each data point to the mean, divided by the standard deviation The maximum value of this studentized deviate is the statistic for the test Compared against a table of critical values Critical value depends on the sample size Summary: Tests for normality Many statistical tests rely on the population being normally distributed But they can tolerate small deviations from the ideal normal distribution Tests for normality exist which test the null hypothesis that the data were sampled from a normal distribution Skewness measures the symmetry of the data set webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 6/7 2/27/12 BMS 617 Lecture 10 Kurtosis measures the sharpness of the peak Both values are zero for a perfect normal distribution If normality tests indicate the data are sampled from distribution far from normal then some statistical tests may be invalid Do not immediately jump to a non-parametric test without careful consideration Summary: Outlier tests Outliers are hard, if not impossible, to detect "by hand" If you suspect an outlier: Check for errors first Consider the possibility that sample really is different Consider that the data may be from a different distribution Only then perform an outlier test webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html 7/7