Download W01 Notes - The Huttenhower Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Wednesday #1
10m
Announcements and questions
Please email TAs to register for the course web site with a blank email, subject "BIO508"
Note that Problem Set #1 is posted and due by end-of-day next Tuesday
First lab will be held next Tuesday 10:30-12:20 in Kresge 209
My office hours are Monday 9:00-10:00 in SPH1 413; TAs' locations?
20m
Probability: basic definitions
Statistics describe data; probability provides an underlying mathematical theory for manipulating them
Experiment: anything that produces a non-deterministic (random or stochastic) result
Coin flip, die roll, item count, concentration measurement, distance measurement...
Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous
{H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...}
Event: any subset of a sample space
{}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3)
Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist)
Thus many symbolic proofs of probability relationships are based on integrals or limit theory
Note that all of these are defined in terms of sets and set notation:
{1, 2, 3} is a set of unordered unique elements, {} is the empty set
AB is the union of two sets (set of all elements in either set A or B)
AB is the intersection of two sets (set of all elements in both sets A and B)
~A is the complement of a set (all elements from some universe not in set A)
(Kolmogorov) axioms: one definition of "probability" that matches reality
For any event E, P(E)≥0
"Probability" is a non-negative real number
For any sample space S, P(S)=1
The "probability" of all outcomes for an experiment must total 1
For disjoint events EF={}, P(EF)=P(E)+P(F)
For two mutually exclusive events that share no outcomes...
The "probability" of either happening equals the summed "probability" of each happening independently
These represent one set of three assumptions from which intuitive rules about probability can be derived
0≤P(E)≤1 for any event E
P({})=0, i.e. every experiment must have some outcome
P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events
15m
Conditional probabilities and Bayes' theorem
Conditional probability
The probability of an event given that another event has already occurred
The probability of event F given that the sample space S has been reduced to ES
Notated P(F|E) and defined as P(EF)/P(E)
Bayes' theorem
P(F|E) = P(E|F)P(F)/P(E)
True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F)
Provides a means of calculating a conditional probability based on the inverse of its condition
Typically described in terms of prior, posterior, and support
P(F) is the prior probability of F occurring at all in the first place, "before" anything else
P(F|E) is the posterior probability of F occurring "after" E has occurred
P(E|F)/P(E) is the support E provides for F
Some examples from poker
Pick a card, any card!
Probability of drawing a jack given that you've drawn a face card?
P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3
P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3
Probability of drawing a face card given that you've drawn a jack?
P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1
P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1
Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair?
P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
20m
Hypothesis testing
These methods are useful for counting arguments (and cards), but more so for comparing results to chance
Test statistics and null hypotheses
Test statistic: any numerical summary of input data with known sampling distribution for null (random) data
Example: flip a coin four times; data are four H/T binary categorical values
One test statistic might be the % of heads (H) results (or % of tails (T) results)
Example: repeat a measurement of gel band intensity in three lanes each for two proteins
How different are the two proteins' expression levels?
One test statistic might be the difference in means of each three intensities
Another might be the difference in medians
Another might be the difference in minima; or the difference in maxima
Simply a number that reduces (summarizes) the entire dataset to a single value
Null distribution: expected distribution of test statistic under assumption of no effect/bias/etc.
A specific definition of "no effect" is the null hypothesis
And when used, the alternate hypothesis is just everything else (e.g. "some effect")
Example: a fair coin summarized as % Hs is expected to produce 50% H
"Null hypothesis" is that P(H) = P(T) = 0.5, i.e. the coin is fair
But any given experiment has some noise; the null distribution is not that %Hs equals exactly 0.5
Instead if the coin is fair, the null distribution has some "shape" around 0.5
The shape of that distribution depends on the experiment being performed (e.g. binomial for coin flip)
Example: the difference in means between proteins of equal expression is expected to be zero
But how surprised are we if it's not identically zero?
The "shape" of the "typical" variation around zero is dictated by the amount of noise in our experiment
If we can measure protein expression really well, we're surprised if the difference is >>0
But if our gels are noisy, the null distribution might be wide
So we can't differentiate expression changes due to biology from those due to "chance" or noise
Parametric versus nonparametric null distributions
It is absolutely critical to distinguish between two types of null distributions:
Those for which we can analyze the expected probability distribution mathematically: parametric
Those for which we can compute the expected probability distribution numerically: nonparametric
Parametric distributions mean the test statistic under the null hypothesis has some analytically described shape
Implies some more or less specific assumptions about the behavior of the experiment generating your data
Example: a gene has known baseline expression μ, and you measure it once under a new condition
A good test statistic is your measurement x-μ; how surprised are you if this is ≠0?
If experimental noise is normally distributed with standard deviation σ, x-μ will be normally distributed
Referred to as a z-statistic
Example: a gene has known baseline expression μ, and you measure it three times
How surprised are you if
| ˆ   | 0 ?
What if you don't know your experimental noise beforehand, but do know it's normally distributed?
A useful test statistic in this case is the t-statistic t | ˆ   | /(ˆ /
n)
This uses the sample's standard deviation, but is less certain about it
Has an analytically defined t-distribution, thus leading to the popularity of the t-test
But parametric tests often make very strong assumptions about your experiment and data!
And fortunately, using computers, you rarely need to use them any more
Nonparametric distributions mean the shape of the null distribution is calculated either by:
Transforming the test statistic to rank values only (and thus ignoring its shape) or
Simulating it directly from truly randomized data using a computer
Referred to as the bootstrap or permutation testing depending on precisely how it's done
Take your data, shuffle it a bunch of times, see how often you get a test statistic as extreme as real data
Incredibly useful: inherently bakes structure of data into significance calculations
e.g. population structure for GWAS, coexpression structure for microarrays, etc.
Example: comparing your 2x3 protein measurements
Your data start out ordered into two groups: [A B C, X Y Z]
You can then use the difference of means [A B C] and [X Y Z] as a test statistic
Shuffle the order many times so the groups are random
How many times is the difference of [1 2 3] and [4 5 6] for random orders as big as the real difference?
If we assume normality etc., we can calculate this using formulas for the difference in means
But what about the difference in medians? Or minima/maxima?
Nonparametric tests can be extremely robust to the quirks encountered in real data
They typically involve few or no assumptions about the shape or properties of your experiment
They can also provide a good summary of how well your results do fit a specific assumption
e.g. how close is a permuted distribution to normal?
Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors)
The costs are:
Decreased sensitivity (particularly for rank-based tests)
Increased computational costs (all that shuffling/permuting/randomization takes time!)
20m
p-values
Given some real data with a test statistic, and a null hypothesis with some resulting null distribution...
A p-value is probability of observing a "comparable" result if the null hypothesis is true
Formally, for test statistic T with some value t for your data, P(T≥t|H0)
Example: flip a coin a bunch of times
Your test statistic is %Hs, and your null hypothesis is the coin's fair so P(H)=0.5
What's the probability of observing ≥90% heads?
Example: given a plate full of yeast colonies, count how many of them are petites (non-respiratory)
Your test statistic is %petites, and your null hypothesis is that the wild type P(petite)=0.15
What's the probability of observing ≥50% petite colonies on a plate?
Note that we've been stating these p-values in terms of extreme values, e.g. "at least 90% or more"
Sometimes you care only about large values, e.g. ≥90%
Sometimes you care about any extreme values, e.g. ≥90% or ≤10%
Often the case for deviations from zero/mean, e.g. | ˆ   |
n  3
This is the difference between one-sided and two-sided hypothesis tests/p-values
H0:0, HA:>0
H0:=0, HA:0
We won't dwell on the theoretical implications of this, but it has two important practical effects:
Calculate without an absolute value when you care about direction, with when you don't
Halve your p-value when you don't care about direction, since you're testing twice the area
Often only one of these (one- or two-sided) will make sense for a particular situation
In cases where you can choose, it's almost always more correct to make the stricter choice
Repeated tests and high-dimensional data
Suppose you run not just one six-lane gel for one protein, but six microarrays for 20,000 transcripts?
If you generate p-values for all 20,000 tests of difference in means, what should they look like?
A p-value is "the probability of a result occurring by chance"
Thus every value between 0 and 1 is, if your assumptions are correct, equally likely!
You expect 10% of them to fall in the range 0-0.1, or 0.4-0.5, or 0.9-1.0
Thus well-behaved p-values should have a uniform distribution between 0 and 1 if the null hypothesis is true
5% of them will fall below 0.05 even if there's no true effect
It's common to visualize p-values for multiple tests to determine how biased they are
How far from the uniform distribution expected under the null hypothesis
This can be done using histograms of p-values:
Or you can compare rank-ordered p-values to a straight diagonal line (form of Q-Q plot):
If you t-test 20,000 genes, you expect p<0.05 for 1,000 of them even if there's no biological effect
This is one aspect of what makes high-dimensional data so difficult to deal with
Occurs when you take many measurements or features (p) under only a few conditions or samples (n)
Thus sometimes referred to as the p>>n problem
What can you do?
One option is to use nonparametric permutation testing:
Parametric null hypothesis is, "Each individual gene's difference is normally distributed with mean zero"
Thus you're testing, "How surprised am I by the magnitude of each individual gene's difference?"
Can instead determine the nonparametric null hypothesis that, "All genes' differences are zero"
And thus the null distribution, "Differences of all genes after randomizing the data"
And test, "How often do I observe differences of this magnitude in my entire randomized dataset?"
This is often advisable but not always possible (e.g. when the computation's expensive)
You can instead adjust your p-value in various analytical ways to account for multiple hypothesis testing
Bonferroni correction: very strict correction that multiplies p-value by the number of tests
Controls the Family-Wise Error Rate (FWER), % of features expected to "pass" by chance
Thus it's sensitive to the number of "input" features: how many measurements you're making
Good, quick-and-easy, very conservative way to account for high-dimensional data
But it lowers sensitivity: it's easy to "miss" good results because they're "not significant enough"
An alternative is to control the False Discovery Rate (FDR), % of results expected to "fail" by chance
FDR q-value = (total # tests) * (p-value) / (rank p-value)
Makes control depend on number of output (passing) features, not total number of input features
Example: test your 20,000 genes
With no control, you expect 1,000 to achieve p<0.05 even in the absence of a true effect
5% of what you test will be "wrong" by chance
With Bonferroni FWER control, you expect 0 to achieve p<0.05 even in the absence of a true effect
0% of what you test will be "wrong" by chance
With FDR control, you expect 5% of tests that achieve q<0.05 to do so even without a true effect
5% of your results will be "wrong" by chance (doesn't depend on how many tests you run!)
20m
Performance evaluation
We can (and should!) formalize these concepts of tests being "right" or "wrong"
How do you assess the accuracy of a hypothesis test with respect to a ground truth?
Gold standard: list of correct outcomes for a test, also standard or answer set or reference
Often categorical binary outcomes (0/1), also sometimes true numerical values
In the former (very common) case, answers are referred to as:
Positives: outcomes expected to be "true" or "1" in the gold standard, drawn from H0 distribution
Negatives: outcomes expected to be "false" or "0" in the standard, drawn from Ha distribution
Predictions or results are the outcomes of a test or inference process
Often whether or not a hypothesis test falls above or below a critical value
Example: is a p-value <0.05
For any test with a reference outcome in the gold standard, four results are possible:
The feature was drawn from H0 and your test accepts H0: true negative
The feature was drawn from H0 and your test rejects H0: false positive (also type I error)
The feature was drawn from Ha and your test accepts Ha: true positive
The feature was drawn from Ha and your test rejects Ha: false negative (also type II error)
H0 True
H0 False
H0 Accepted
True Negative
False Negative (Type II)
H0 Rejected
False Positive (Type I)
True Positive
The rates or probabilities with which incorrect outcomes occur indicate the performance or accuracy of a test
False positive rate = FPR = fraction of successful tests for uninteresting features = P(reject H0|H0)
False negative rate = FNR = fraction of failed tests for interesting features = P(accept H0|Ha)
What type of "performance" you care about depends a lot on how the test's to be used
Are you running expensive validations in the lab? False positives can hurt!
Are you trying to detect dangerous contaminants? False negatives can hurt!
There are thus a slew of complementary performance measures based on different aspects of these quantities
Power: probability of detecting a true effect, P(reject H0|Ha)
Precision: probability a detected effect is true = TP/(TP+FP) = P(Ha|reject H0)
Recall: probability of detecting true effects = TP/P = TP/(TP+FN) = P(reject H0|Ha)
Also called true positive rate (TPR) or sensitivity
Specificity: probability a rejected effect is false = TN/(TN+FP) = P(accept H0|H0)
Also called true negative rate (TNR) and equivalent to 1 - false positive rate (FPR)
Most hypothesis tests provide at least one tunable parameter that will trade off between aspects of performance
High-precision tests are typically low-recall (few false positives, more false negatives)
Highly sensitive tests are typically less specific (few false negatives, more false positives)
These tradeoffs are commonly visualized to provide a sense of prediction accuracy over a range of thresholds
Precision/recall plots: recall (x) vs. precision (y), upper right corner is "good"
Receiver Operating Characteristic (ROC) or sensitivity/specificity plots: 1-FPR (x) vs. TPR (y), upper left good
Both feature recall, but since precision ≠ specificity, can provide complementary views of the same data
Entire curves can be further reduced to summary statistics of overall prediction performance
Area Under the Curve (AUC): area under a ROC curve
Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0
Area under precision/recall curve (AUPRC): exactly what it sounds like
Not commonly used, since there's no baseline; "random" isn't fixed to a particular value
Computational predictions are very often evaluated using AUC, as are diagnostic tests and risk scores
Reading
Hypothesis testing:
Pagano and Gauvreau, 10.1-5
Performance evaluation:
Pagano and Gauvreau, 6.4
The following are notes on how these topics are typically present for biostatistics applications
We'll be covering them for computational applications over the next few weeks
Please review them now; you don't need to be familiar with how they work, just the language and concepts
Common one/two-sample hypothesis tests
Parametric t-tests assuming normality, independence, and homoscedasticity
T-test: assesses either one or two normally distributed homoscedastic independent sets of values
Homoscedastic: no relationship between measurement value and error
Used to compare samples assumed to be from underlying normal distributions
T-distribution is heavier tailed than normal distribution to take population uncertainty into account
Contrast with the normal distribution of a z-test, where one mean/stdev is known with certainty
One-sample t-test: does the mean of a normally distributed sample differ from zero?
t = /(/n)
Two-sample unpaired t-test: do the means of two normal samples with equal variance differ?
t = (1-2)/(S(1/n1+1/n2)), S = (((n1-1)12+(n2-1)22)/(n1+n2-2)) estimator of common/pooled stdev
Two-sample paired t-test: do the means of two matched normal samples of equal variance differ?
Tests whether a distribution shifts, since the paired data points are non-independent
Can be equivalently stated: is the mean of the pairs' differences different from zero?
Welch's t-test: do the means of two normal samples with unequal variances differ?
Identical save that S = (12/n1+22/n2) and the degrees of freedom for the t-distribution are complicated
Nonparametric rank-based tests
Mann-Whitney U/ Wilcoxon test: are two independent sets of values drawn from identical distributions?
Test statistic is U1 = R1 - n1(n1+1)/2, where R1 = sum of ranks within all values from sample 1
Two-sample unpaired: tests whether two distributions differ
Comparable to the two-sample t-tests
Two-sample paired: tests whether one distribution of matched samples differ
Equivalent to paired t-test
Performed by rank-transforming data, comparable to relationship between Pearson and Spearman correlation
Asks whether the order of samples' ranks is "surprisingly different"
Example: S1=[0.2, 0.5, 0.9] and S2=[0.1, 0.3, 0.4]
Pooled = [0.12, 0.21, 0.32, 0.42, 0.51, 0.91], Ranks = [12, 21, 32, 42, 51, 61]
R1 = 2+5+6 = 13, U1 = 13 - 3*4/2 = 7
The MWW test is equivalent to the AUC of one sample as a "test" compared to another as a "standard"
AUC = U1/(n1n2)
Kolmogorov-Smirnov test: one or two independent sets of values
One-sample: distribution differs from chosen reference
Two-sample unpaired: two distributions differ
Performed by transforming data to cumulative distribution
Asks whether CDFs are surprisingly different
Inference: linear regression
Correlation etc. make non-causal statements about the relationship between two samples
What if you want to predict an output as a function of one or more inputs?
This inherently makes a distinction between a response and one or more explanatory variables
y = a + bx + e
Assumes that e is normally distributed around a+bx with fixed variance
Also again assumes independence of measurements x/y
Linear regression fits a and b parameters with respect to a constraint, typically least squares
That is, minimize |y-f(x)|2
This is an excellent, simple way to make predictions about an output variable
Goodness of fit can be given a p-value (taking normality/independent assumptions into account)
Efficient, can accommodate many x's, can transform x nonlinearly as desired
Provides several good views of data to validate goodness of fit, homoscedasticity (residuals plot)
Multiple regression: bakes in additional coefficients y = b0 + bx
Significance of each b coefficient can be determined (H0: b = 0)
Logistic regression: transforms output y to accommodate binary (0/1) values
Treat y like a probability ranging between 0 and 1
ln(p/(1-p)) = a + bx
Equivalent to y = exp(a + bx)/(1 + exp(a + bx)), which is the logistic function:
Constrained (or penalized) regression: least squares isn't the only constraint!
Also known as regularization: constraints on model parameters during fitting
Typically introduces a tradeoff between penalization (variance) and accuracy (bias)
Penalized regression (ridge): limits b2, making the total of all coefficients "small"
Sparse regression (lasso): limits |b|, making the number of nonzero coefficients "small"
Important part is that you have lots of options for teaching a computer to predict outputs from inputs
Overfitting: easy to create a model of noise in your data (makes constraints important)
Often discuss model accuracy in terms of training and test data/accuracy/error
Typical cross-validation procedure:
Split your data into a ~2/3 training set and a ~1/3 test set
Learn model parameters by optimizing in the training set
Evaluate model generalization/accuracy by evaluating on the test set
In the best case, repeat this many times and average
ANOVA: >2 normally distributed homoscedastic independent sets of values
Extension of t-test to more than two groups; compares one continuous with one categorical
Could just repeat a t-test many times between pairs of groups; boom!
Instead tests whether between-groups variance is surprisingly different from within-groups
Within-groups calculated as weighted (by group size) average of sample variances
Between-groups calculated as total variance ignoring groups
Only tells you that the groups do (or don't) differ, not which or how!
Post-hoc pairwise tests (multiple hypothesis corrected)
Kruskal-Wallis: nonparametric alternative to ANOVA
Replaces data with ranks; exactly the same relationship as Spearman/Pearson or MWW/t-test
Chi-square: assesses whether histogram (multinomial) agrees with theoretical distribution (e.g. uniform)
Closely related to Fisher's Exact test, which compares two multinomials (histograms)
I rarely find either of these things useful because genomic data rarely pairs categoricals
Reading
T-tests:
Pagano and Gauvreau, 11.1-2
Wilcoxon:
Pagano and Gauvreau, 13.2-4
ANOVA:
Pagano and Gauvreau, 12.1-2
Linear regression:
Pagano and Gauvreau, 18.1-3
Logistic regression:
Pagano and Gauvreau, 20.1-2