Download BMS 617 The assumption of normality What normally distributed

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
2/27/12
BMS 617 Lecture 10
BMS 617
Statistical Techniques for the Biomedical Sciences
Lecture 10: Normality Tests and Outliers
The assumption of normality
The next section of the course discusses statistical tests
Essentially a collection of techniques for calculating p-values for common scenarios
Many of these tests rely on the assumption that data were sampled from a Gaussian (Normal) Distribution
T-tests, ANOVA, regression
Sensible to question whether this assumption is reasonable
In most cases, data cannot be sampled from an ideal Gaussian distribution
E.g. in many cases, data values cannot be negative
Many simulations, and some theory, have shown that statistical tests still work well enough if the data are sampled from
a population approximately normally distributed
So the question really becomes "were the data sampled from a distribution close enough to the normal
distribution?"
This is, of course, impossible to answer…
What normally distributed data looks like
With very large data sets, the distribution of values sampled from a normal distribution will look like the standard bell
curve
However, with smaller data sets, distribution will vary widely
Even if the population is perfectly normally distributed
Next image shows five samples of size 15 and five samples of size 100 from a normally distributed population with
mean=10 and sd=3.
Samples from a normally distributed population
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
1/7
2/27/12
BMS 617 Lecture 10
Measuring the deviation from normality
Normal distributions have two important properties:
They are symmetric about the mean
The spread about the mean is defined:
About 68% of the values lie within one standard deviation of the mean, etc
Two measures test how well a data set deviates from this ideal
Skewness measures the deviation from symmetry.
A skewness of 0 represents exact symmetry.
Positive skewness represents a heavier right tail
Negative skewness represents a heavier left tail
Kurtosis measures the "peakiness" of the distribution
A Gaussian dsitribution has a kurtosis of zero.
Positive kurtosis indicates a sharper peak/
Negative kurtosis indicates a more rounded peak
Skewness
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
2/7
2/27/12
BMS 617 Lecture 10
Kurtosis
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
3/7
2/27/12
BMS 617 Lecture 10
Tests for Normality
A number of tests are available that will test a data set to see if it looks like a sample from a Normal distribution
Typically these tests combine the skewness and kurtosis of the data set and compute a p-value.
Tests include
Shapiro-Wilk test
Kolmogorov-Smirnov test
Darling-Anderson test
D'Agostino-Pearson omnibus K2 normality test
In all cases, the null hypothesis is that the data were drawn from a normally distributed population
p-value is the probability of getting this much skewness and/or kurtosis under this assumption
Interpreting Normality Tests
If a normality test gives a small p-value, we can reject the null hypothesis (the data were sampled from a normally
distributed population).
The usual cautions about small p-values apply
If the sample size is large, the p-value might be small even though the deviation from normality is small.
For Kurtosis and Skewness in the range ±0.5 (or even ±1.0) most tests will typically work ok for large
samples
If there is strong evidence the population data is far from normally distributed, must account for this:
Do the data come from another distribution? If so, transforming the data (e.g. by taking logs) may result in
data that is sampled from a normal distribution.
Are there outliers that are causing the normality test to fail? Run an outlier test (discussed soon).
Consider using a non-parametric test, which does not rely on normality. Beware of these as they have less
power. We will discuss these later in the course.
Interpreting Normality Tests: large p-values
If a normality test produces a large p-value, there is no evidence that the data are not from a normally distributed
population
This does not mean they are sampled from a normally distributed population!
If the sample size is small, the normality test may simply not have enough power to detect deviations from the
normal distribution
There is no good way to prove that the population data is normally distributed!
Guidelines for the Assumption of Normality
Remember, if there are many sources of variation, and those sources are additive, then data are likely to be
approximately normally distributed
If you can justify this, then this is a better argument than using normality tests
If you are running a series of experiments, they should all use the same test.
Do not use non-parametric tests for some and parametric tests for others.
If it is standard in the literature to assume data from a particular assay are normally distributed, then it is probably safe to
work with this assumption
Outliers
An outlier is a value which is far from all other values
Appears to have come from a different population
Outliers can invalidate many statistical analyses
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
4/7
2/27/12
BMS 617 Lecture 10
Can come from many causes:
Invalid data entry
Experimental error
Incorrect assumptions about the distribution
Typically, assuming a normal distribution instead of a lognormal distribution
Random sampling chance
Biological diversity
The sample truly is different to the others and the value really does come from a different distribution
In this case, you may have an exciting discovery
Detecting outliers by eye
Which of these 20 data sets have outliers?
None of them! All the data were computer generated samples from the same normal distribution
Detecting outliers in an ad hoc manner is dangerous
If you suspect an outlier
These are things to consider if you suspect an outlier:
Check the data entry for errors. Fix any errors.
Make sure you understand output from software. Many programs will use "special values" for missing data.
Was there an error observed during the experiment.
If so, simply eliminate the corresponding value
No need for an outlier test
Are you sure the distribution is normal? If not, a transformation might remove the outlier
Example shortly
Is the extreme value due to biological variability?
If so, can you identify how the sample is different and what causes the extreme value?
This may lead to an important discovery.
Outlier Tests
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
5/7
2/27/12
BMS 617 Lecture 10
If the questions on the previous slide do not resolve the outlier, then there are two possibilities:
The value came from the same distribution, but due to random sampling just happened to be bigger or smaller
than the others
The value was the result of some experimental error
No computation can tell for sure which is the case, but an outlier test can answer this question:
Assume the data all came from the same normal distribution. What is the probability of getting a value at least
this far from all the others?
If an outlier test results in a low p-value, then there is strong evidence the value is the result of an error, and it can be
excluded.
If an outlier test results in a high p-value, then there is no strong evidence the value is the result of an error.
Lognormal distributions
The left image shows the raw data
An outlier test detects significant outliers in the middle three data sets
The right image shows the data after log transforms: there are no outliers
Grubb's Test
Grubb's Test is a test for outliers
Works by computing the distance from each data point to the mean, divided by the standard deviation
The maximum value of this studentized deviate is the statistic for the test
Compared against a table of critical values
Critical value depends on the sample size
Summary: Tests for normality
Many statistical tests rely on the population being normally distributed
But they can tolerate small deviations from the ideal normal distribution
Tests for normality exist which test the null hypothesis that the data were sampled from a normal distribution
Skewness measures the symmetry of the data set
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
6/7
2/27/12
BMS 617 Lecture 10
Kurtosis measures the sharpness of the peak
Both values are zero for a perfect normal distribution
If normality tests indicate the data are sampled from distribution far from normal then some statistical tests may be
invalid
Do not immediately jump to a non-parametric test without careful consideration
Summary: Outlier tests
Outliers are hard, if not impossible, to detect "by hand"
If you suspect an outlier:
Check for errors first
Consider the possibility that sample really is different
Consider that the data may be from a different distribution
Only then perform an outlier test
webpages.marshall.edu/~denvir/BMS617/Spring2012/Lecture10_2012-02-13.html
7/7