Download W01 Notes: Inference and hypothesis testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Wednesday #1
10m
Announcements and questions
Email Carla ([email protected]) to register for the course site with a blank email, subject "BIO508"
All accounts emailed so far are created, username = password = email user ID (lower case)
Note that Problem Set #1 is posted and due by end-of-day next Monday
First lab tomorrow, Thursday 4:30-6:30 in Kresge LL6
Office hour location update from Carla
15m
Probability: basic definitions
Statistics describe data; probability provides an underlying mathematical theory for manipulating them
Experiment: anything that produces a non-deterministic (random or stochastic) result
Coin flip, die roll, item count, concentration measurement, distance measurement...
Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous
{H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...}
Event: any subset of a sample space
{}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3)
Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist)
Thus many symbolic proofs of probability relationships are based on integrals or limit theory
Note that all of these are defined in terms of sets and set notation:
{1, 2, 3} is a set of unordered unique elements, {} is the empty set
AB is the union of two sets (set of all elements in either set A or B)
AB is the intersection of two sets (set of all elements in both sets A and B)
~A is the complement of a set (all elements from some universe not in set A)
(Kolmogorov) axioms: one definition of "probability" that matches reality
For any event E, P(E)≥0
"Probability" is a non-negative real number
For any sample space S, P(S)=1
The "probability" of all outcomes for an experiment must total 1
For disjoint events EF={}, P(EF)=P(E)+P(F)
For two mutually exclusive events that share no outcomes...
The "probability" of either happening equals the summed "probability" of each happening independently
These represent one set of three assumptions from which intuitive rules about probability can be derived
0≤P(E)≤1 for any event E
P({})=0, i.e. every experiment must have some outcome
P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events
20m
Conditional probabilities and Bayes' theorem
Note that we'll be covering this only briefly in class
Feel free to refer to 1) the notes here, 2) the full notes online, 3) suggested reading below, or 4) the tutorials at:
http://brilliant.org/assessment/techniques-trainer/conditional-probability-and-bayes-theorem/
http://www.math.umass.edu/~lr7q/ps_files/teaching/math456/Week2.pdf
Conditional probability
The probability of an event given that another event has already occurred
The probability of event F given that the sample space S has been reduced to ES
Notated P(F|E) and defined as P(EF)/P(E)
Bayes' theorem
P(F|E) = P(E|F)P(F)/P(E)
True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F)
Provides a means of calculating a conditional probability based on the inverse of its condition
Typically described in terms of prior, posterior, and support
P(F) is the prior probability of F occurring at all in the first place, "before" anything else
P(F|E) is the posterior probability of F occurring "after" E has occurred
P(E|F)/P(E) is the support E provides for F
Some examples from poker
Pick a card, any card!
Probability of drawing a jack given that you've drawn a face card?
P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3
P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3
Probability of drawing a face card given that you've drawn a jack?
P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1
P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1
Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair?
P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
20m
Hypothesis testing
These are useful for proofs involving counting (and gambling), but more so for comparing results to chance
Test statistics and null hypotheses
Test statistic: any numerical summary of input data with known sampling distribution for null (random) data
Example: flip a coin four times; data are four H/T binary categorical values
One test statistic might be the % of heads (H) results (or % of tails (T) results)
Example: repeat a measurement of gel band intensity in three lanes each for two proteins
How different are the two proteins' expression levels?
One test statistic might be the difference in means of each three intensities
Another might be the difference in medians
Another might be the difference in minima; or the difference in maxima
Simply a number that reduces (summarizes) the entire dataset to a single value
Null distribution: expected distribution of test statistic under assumption of no effect/bias/etc.
A specific definition of "no effect" is the null hypothesis
And when used, the alternate hypothesis is just everything else (e.g. "some effect")
Example: a fair coin summarized as % Hs is expected to produce 50% H
"Null hypothesis" is that P(H) = P(T) = 0.5, i.e. the coin is fair
But any given experiment has some noise; the null distribution is not that %Hs equals exactly 0.5
Instead if the coin is fair, the null distribution has some "shape" around 0.5
The shape of that distribution depends on the experiment being performed (e.g. binomial for coin flip)
Example: the difference in means between proteins of equal expression is expected to be zero
But how surprised are we if it's not identically zero?
The "shape" of the "typical" variation around zero is dictated by the amount of noise in our experiment
If we can measure protein expression really well, we're surprised if the difference is >>0
But if our gels are noisy, the null distribution might be wide
So we can't differentiate expression changes due to biology from those due to "chance" or noise
Parametric versus nonparametric null distributions
It is absolutely critical to distinguish between two types of null distributions:
Those for which we can analyze the expected probability distribution mathematically: parametric
Those for which we can compute the expected probability distribution numerically: nonparametric
Parametric distributions mean the test statistic under the null hypothesis has some analytically described shape
Implies some more or less specific assumptions about the behavior of the experiment generating your data
Example: a gene has known baseline expression μ, and you measure it once under a new condition
A good test statistic is your measurement x-μ; how surprised are you if this is ≠0?
If experimental noise is normally distributed with standard deviation σ, x-μ will be normally distributed
Referred to as a z-statistic
Example: a gene has known baseline expression μ, and you measure it three times
How surprised are you if
| ˆ   | 0 ?
What if you don't know your experimental noise beforehand, but do know it's normally distributed?
A useful test statistic in this case is the t-statistic t | ˆ   | /(ˆ /
n)
This uses the sample's standard deviation, but is less certain about it
Has an analytically defined t-distribution, thus leading to the popularity of the t-test
But parametric tests often make very strong assumptions about your experiment and data!
And fortunately, using computers, you rarely need to use them any more
Nonparametric distributions mean the shape of the null distribution is calculated either by:
Transforming the test statistic to rank values only (and thus ignoring its shape) or
Simulating it directly from truly randomized data using a computer
Referred to as the bootstrap or permutation testing depending on precisely how it's done
Take your data, shuffle it a bunch of times, see how often you get a test statistic as extreme as real data
Incredibly useful: inherently bakes structure of data into significance calculations
e.g. population structure for GWAS, coexpression structure for microarrays, etc.
Example: comparing your 2x3 protein measurements
Your data start out ordered into two groups: [A B C, X Y Z]
You can then use the difference of means [A B C] and [X Y Z] as a test statistic
Shuffle the order many times so the groups are random
How many times is the difference of [1 2 3] and [4 5 6] for random orders as big as the real difference?
If we assume normality etc., we can calculate this using formulas for the difference in means
But what about the difference in medians? Or minima/maxima?
Nonparametric tests can be extremely robust to the quirks encountered in real data
They typically involve few or no assumptions about the shape or properties of your experiment
They can also provide a good summary of how well your results do fit a specific assumption
e.g. how close is a permuted distribution to normal?
Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors)
The costs are:
Decreased sensitivity (particularly for rank-based tests)
Increased computational costs (all that shuffling/permuting/randomization takes time!)
15m
p-values
Given some real data with a test statistic, and a null hypothesis with some resulting null distribution...
A p-value is probability of observing a "comparable" result if the null hypothesis is true
Formally, for test statistic T with some value t for your data, P(T≥t|H0)
Example: flip a coin a bunch of times
Your test statistic is %Hs, and your null hypothesis is the coin's fair so P(H)=0.5
What's the probability of observing ≥90% heads?
Example: given a plate full of yeast colonies, count how many of them are petites (non-respiratory)
Your test statistic is %petites, and your null hypothesis is that the wild type P(petite)=0.15
What's the probability of observing ≥50% petite colonies on a plate?
Note that we've been stating these p-values in terms of extreme values, e.g. "at least 90% or more"
Sometimes you care only about large values, e.g. ≥90%
Sometimes you care about any extreme values, e.g. ≥90% or ≤10%
Often the case for deviations from zero/mean, e.g. | ˆ   |
n  3
This is the difference between one-sided and two-sided hypothesis tests/p-values
H0:0, HA:>0
H0:=0, HA:0
We won't dwell on the theoretical implications of this, but it has two important practical effects:
Calculate without an absolute value when you care about direction, with when you don't
Halve your p-value when you don't care about direction, since you're testing twice the area
Often only one of these (one- or two-sided) will make sense for a particular situation
In cases where you can choose, it's almost always more correct to make the more conservative choice
20m
Performance evaluation
We can (and should!) formalize these concepts of tests being "right" or "wrong"
How do you assess the accuracy of a hypothesis test with respect to a ground truth?
Gold standard: list of correct outcomes for a test, also standard or answer set or reference
Often categorical binary outcomes (0/1), also sometimes true numerical values
In the former (very common) case, answers are referred to as:
Positives: outcomes expected to be "true" or "1" in the gold standard, drawn from H a distribution
Negatives: outcomes expected to be "false" or "0" in the standard, drawn from H 0 distribution
Predictions or results are the outcomes of a test or inference process
Often whether or not a hypothesis test falls above or below a critical value
Example: is a p-value <0.05
For any test with a reference outcome in the gold standard, four results are possible:
The feature was drawn from H0 and your test accepts H0: true negative
The feature was drawn from H0 and your test rejects H0: false positive (also type I error)
The feature was drawn from Ha and your test accepts Ha: true positive
The feature was drawn from Ha and your test rejects Ha: false negative (also type II error)
H0 True
H0 False
H0 Accepted
True Negative
False Negative (Type II)
H0 Rejected
False Positive (Type I)
True Positive
The rates or probabilities with which incorrect outcomes occur indicate the performance or accuracy of a test
False positive rate = FPR = fraction of successful tests for uninteresting features = P(reject H0|H0)
False negative rate = FNR = fraction of failed tests for interesting features = P(accept H0|Ha)
What type of "performance" you care about depends a lot on how the test's to be used
Are you running expensive validations in the lab? False positives can hurt!
Are you trying to detect dangerous contaminants? False negatives can hurt!
There are thus a slew of complementary performance measures based on different aspects of these quantities
Power: probability of detecting a true effect = TP/P = TP/(TP+FN) = P(reject H 0|Ha)
Also called recall, true positive rate (TPR), or sensitivity
Precision: probability a detected effect is true = TP/(TP+FP) = P(H a|reject H0)
Specificity: probability a rejected effect is false = TN/(TN+FP) = P(accept H 0|H0)
Also called true negative rate (TNR) and equivalent to 1 - false positive rate (FPR)
Most hypothesis tests provide at least one tunable parameter that will trade off between aspects of performance
High-precision tests are typically low-recall (few false positives, more false negatives)
Highly sensitive tests are typically less specific (few false negatives, more false positives)
These tradeoffs are commonly visualized to provide a sense of prediction accuracy over a range of thresholds
Precision/recall plots: recall (x) vs. precision (y), upper right corner is "good"
Receiver Operating Characteristic (ROC) or sensitivity/specificity plots: 1-FPR (x) vs. TPR (y), upper left good
Both feature recall, but since precision ≠ specificity, can provide complementary views of the same data
Entire curves can be further reduced to summary statistics of overall prediction performance
Area Under the Curve (AUC): area under a ROC curve
Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0
Area under precision/recall curve (AUPRC): exactly what it sounds like
Not commonly used, since there's no baseline; "random" isn't fixed to a particular value
Computational predictions are very often evaluated using AUC, as are diagnostic tests and risk scores
???m
Repeated tests and high-dimensional data
Suppose you run not just one six-lane gel for one protein, but six microarrays for 20,000 transcripts?
If you generate p-values for all 20,000 tests of difference in means, what should they look like?
A p-value is "the probability of a result occurring by chance"
Thus every value between 0 and 1 is, if your assumptions are correct, equally likely!
You expect 10% of them to fall in the range 0-0.1, or 0.4-0.5, or 0.9-1.0
Thus well-behaved p-values should have a uniform distribution between 0 and 1 if the null hypothesis is true
5% of them will fall below 0.05 even if there's no true effect
It's common to visualize p-values for multiple tests to determine how biased they are
How far from the uniform distribution expected under the null hypothesis
This can be done using histograms of p-values:
Or you can compare rank-ordered p-values to a straight diagonal line (form of Q-Q plot):
If you t-test 20,000 genes, you expect p<0.05 for 1,000 of them even if there's no biological effect
This is one aspect of what makes high-dimensional data so difficult to deal with
Occurs when you take many measurements or features (p) under only a few conditions or samples (n)
Thus sometimes referred to as the p>>n problem
What can you do?
One option is to use nonparametric permutation testing:
Parametric null hypothesis is, "Each individual gene's difference is normally distributed with mean zero"
Thus you're testing, "How surprised am I by the magnitude of each individual gene's difference?"
Can instead determine the nonparametric null hypothesis that, "All genes' differences are zero"
And thus the null distribution, "Differences of all genes after randomizing the data"
And test, "How often do I observe differences of this magnitude in my entire randomized dataset?"
This is often advisable but not always possible (e.g. when the computation's expensive)
You can instead adjust your p-value in various analytical ways to account for multiple hypothesis testing
Bonferroni correction: very strict correction that multiplies p-value by the number of tests
Controls the Family-Wise Error Rate (FWER), % of features expected to "pass" by chance
Thus it's sensitive to the number of "input" features: how many measurements you're making
Good, quick-and-easy, very conservative way to account for high-dimensional data
But it lowers sensitivity: it's easy to "miss" good results because they're "not significant enough"
An alternative is to control the False Discovery Rate (FDR), % of results expected to "fail" by chance
FDR q-value = (total # tests) * (p-value) / (rank p-value)
Makes control depend on number of output (passing) features, not total number of input features
Example: test your 20,000 genes
With no control, you expect 1,000 to achieve p<0.05 even in the absence of a true effect
5% of what you test will be "wrong" by chance
With Bonferroni FWER control, you expect 0 to achieve p<0.05 even in the absence of a true effect
0% of what you test will be "wrong" by chance
With FDR control, you expect 5% of tests that achieve q<0.05 to do so even without a true effect
5% of your results will be "wrong" by chance (doesn't depend on how many tests you run!)
Reading
Hypothesis testing:
Pagano and Gauvreau, 10.1-5
T-tests:
Pagano and Gauvreau, 11.1-2
Wilcoxon:
Pagano and Gauvreau, 13.2-4
ANOVA:
Pagano and Gauvreau, 12.1-2
Performance evaluation:
Pagano and Gauvreau, 6.4