Download Analytical Tools for High Throughput Sequencing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Marshall University School of Medicine
Department of Biochemistry and Microbiology
BMS 617
Lecture 14: Non-parametric tests
Marshall University Genomics Core Facility
Parametric Tests
• The following tests all assume that the data are sampled from
populations in which the values are normally distributed:
– Unpaired t-test
– Paired t-test
• Assumes that the differences within pairs are samples of a normally
distributed population
– ANOVA
• Data which is normally distributed can be completely summarized
by the mean and standard deviation
– These two values completely determine the distribution
• These are the parameters for the distribution
• These tests work by estimating one or both of these parameters
– Known as parametric tests
Marshall University School of Medicine
Non-parametric tests
• Tests which make no assumptions about the
distribution are known as non-parametric tests
• Most commonly-used forms are based on ranking
(ordering) the values in the data set and analyzing
only the ranks
• This form of test is extremely robust to outliers
• The following are non-parametric versions of
parametric tests
• They are used in similar situations to their
parametric versions, but make no assumptions
about normality.
Marshall University School of Medicine
Non-parametric analogs of parametric
tests
Scenario
Parametric Test
Non-parametric test
Comparing two groups,
with no pairing
Unpaired T-test
Mann-Whitney Test (a.k.a.
Wilcoxon ranked-sum test)
Comparing two paired
groups
Paired T-test
Wilcoxon matched-pairs
signed-rank test
Comparing two ordinal
variables for correlation
Pearson correlation
(variables must be interval
variables)
Spearman’s rank
correlation
Comparing more than two
groups
One-way ANOVA
Kruskal-Wallis test
Marshall University School of Medicine
The Mann-Whitney Test
• The Mann-Whitney test is the non-parametric
equivalent of the unpaired T-test
• Use when you want to compare a variable
between two groups, but you have reason to
believe the data is not sampled from a
normally-distributed population
Marshall University School of Medicine
How the Mann-Whitney Test works
• The Mann-Whitney test works as follows:
• Compute the rank for all values, regardless of which group
they come from
– Smallest value has a rank of 1, next smallest has a rank of 2, etc.
• Choose one group: for each data point in that group, count
the number of data points in the other group which are
smaller
– Sum these values, and call the sum U1
• Similarly compute U2, or use the fact that U1+U2=n1n2 Let
U=min(U1,U2)
• The distribution of U under the null hypothesis is known, so
software can compute a p-value
Marshall University School of Medicine
The Wilcoxon matched-pairs signedrank test
• The Wilcoxon matched-pairs signed-rank test is used for
paired data
• Before and after treatment, etc.
• Unlike the paired t-test, it does not assume the differences
are samples from a normally- distributed population
• Basic procedure:
– Compute all signed differences
– Rank the differences by their absolute value
– Sum the ranks for the positive differences, and sum the ranks
for the negative differences.
– Compute the difference between these two sums of ranks
– The distribution of this value under the null hypothesis is known
Marshall University School of Medicine
Spearman’s Rank Correlation
• Spearman's Rank Correlation is used to test for
dependence in the ordering of two variables
– The variables only need be ordinal
– Do not need to be interval variables (no scale)
• Works by computing the ranks of each variable
• Then just compute the Pearson correlation coefficient
of the ranks
• Does not assume normally distributed populations
• Does not test for a linear relationship
– Just a monotonic (increasing/decreasing) one
Marshall University School of Medicine
Pros and cons of non-parametric tests
• Pros of non-parametric tests:
– Since non-parametric tests do not rely on the assumption of normallydistributed populations, they can be used when that assumption fails, or
cannot be verified
• Cons of non-parametric tests:
– If the data really do come from normally-distributed populations, the nonparametric tests are less powerful than their parametric counterparts
• i.e. they will give higher p-values
– For small sample sizes, they are much less powerful:
• Mann-Whitney p-values are always greater than 0.05 if the sample size is 7 or fewer
– Nonparametric Tests typically do not compute confidence intervals
• Can sometimes be computed, but often require additional assumptions
– Non-parametric tests are not related to regression models
• Cannot be extended to account for confounding variables using multiple regression
techniques
Marshall University School of Medicine
Choosing between parametric and
non-parametric tests
• The choice between parametric and non-parametric tests is not
straightforward
• A common, but invalid, approach is to use normality tests to automate the
choice
– The choice is most important for small data sets, for which normality tests are
of limited use
– Using the data set to determine the statistical analysis will underestimate pvalues
– If data fail normality tests, a transformation may be appropriate
• The most "honest" approach is to perform in independent experiment
with a large sample to test for normality, and then design the experiment
in hand based on the results of this
– This is almost always impractical
– For well-used experimental designs, an almost-equivalent approach is to
follow customary procedure
• Essentially assuming this has been carried out in some way already
Marshall University School of Medicine
How much difference does it make?
• The central limit theorem ensures that parametric tests work well with
non-normal distributions if the sample is large enough
– How large is large enough?
– Depends on the distribution!
– For most distributions, sample sizes in the range of dozens will remove any
issues with normality
• You will still increase your statistical power by using a transformation if appropriate
• Conversely, if the data really come from a normally-distributed population
and you choose a non- parametric test, you will lose statistical power
– For large samples, however, the difference is minimal
• Small samples present problems:
– Non-parametric tests have very little power for small samples
– Parametric tests can give misleading results for small samples if the population
data are non- normal
– Tests for normality are not helpful for small samples
Marshall University School of Medicine
Conclusions
• The bottom-line conclusion is that large samples are better than
small samples
– In general, the larger the better
• Of course, it can be prohibitively time consuming and/or expensive
to analyze large samples
• If your experimental design is going to use a small sample, you need
to be able to justify the data come from a normally distributed
population
– If this is a common experimental design that is conventionally
analyzed this way, that may be good enough
– For a new methodology, you should really perform an independent
experiment with a large sample to test for normality first
• Use the results of this to guide the data analysis for future experiments
Marshall University School of Medicine
Computationally-intensive nonparametric methods
• The non-parametric methods we examined worked by analyzing the
ranks of the data
• Another class of non-parametric tests is the class of
computationally-intensive methods
• There are two subclasses:
– Permutation or randomization tests:
• Simulate the null distribution by repeatedly randomly reassigning group labels
• Compare the "real" data to the generated null distribution
– Bootstrapping techniques:
• Effectively generate many samples from the population by resampling from
the original sample
Look at the distribution of summary data from the generated samples
• These techniques still require a reasonable sample size to begin
with
– Big enough to generate enough distinct permutations or bootstraps
Marshall University School of Medicine
Summary
• Rank-based non-parametric tests are available as
replacements for parametric tests
• Less powerful than parametric counterparts but work
when the data are not sampled from a normal
distribution
• Choice of test should not be automated
– Should be part of experimental design and not depend on
the data
• Choice is less important for large data sets
• But lose most power for small data sets
• Permutation and Bootstrapping techniques also
provide alternatives to parametric tests
Marshall University School of Medicine