Download W01 Notes: Inference and hypothesis testing

Wednesday #1 10m Announcements and questions Email Carla ([email protected]) to register for the course site with a blank email, subject "BIO508" All accounts emailed so far are created, username = password = email user ID (lower case) Note that Problem Set #1 is posted and due by end-of-day next Monday First lab tomorrow, Thursday 4:30-6:30 in Kresge LL6 Office hour location update from Carla 15m Probability: basic definitions Statistics describe data; probability provides an underlying mathematical theory for manipulating them Experiment: anything that produces a non-deterministic (random or stochastic) result Coin flip, die roll, item count, concentration measurement, distance measurement... Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous {H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...} Event: any subset of a sample space {}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3) Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist) Thus many symbolic proofs of probability relationships are based on integrals or limit theory Note that all of these are defined in terms of sets and set notation: {1, 2, 3} is a set of unordered unique elements, {} is the empty set AB is the union of two sets (set of all elements in either set A or B) AB is the intersection of two sets (set of all elements in both sets A and B) ~A is the complement of a set (all elements from some universe not in set A) (Kolmogorov) axioms: one definition of "probability" that matches reality For any event E, P(E)≥0 "Probability" is a non-negative real number For any sample space S, P(S)=1 The "probability" of all outcomes for an experiment must total 1 For disjoint events EF={}, P(EF)=P(E)+P(F) For two mutually exclusive events that share no outcomes... The "probability" of either happening equals the summed "probability" of each happening independently These represent one set of three assumptions from which intuitive rules about probability can be derived 0≤P(E)≤1 for any event E P({})=0, i.e. every experiment must have some outcome P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events 20m Conditional probabilities and Bayes' theorem Note that we'll be covering this only briefly in class Feel free to refer to 1) the notes here, 2) the full notes online, 3) suggested reading below, or 4) the tutorials at: http://brilliant.org/assessment/techniques-trainer/conditional-probability-and-bayes-theorem/ http://www.math.umass.edu/~lr7q/ps_files/teaching/math456/Week2.pdf Conditional probability The probability of an event given that another event has already occurred The probability of event F given that the sample space S has been reduced to ES Notated P(F|E) and defined as P(EF)/P(E) Bayes' theorem P(F|E) = P(E|F)P(F)/P(E) True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F) Provides a means of calculating a conditional probability based on the inverse of its condition Typically described in terms of prior, posterior, and support P(F) is the prior probability of F occurring at all in the first place, "before" anything else P(F|E) is the posterior probability of F occurring "after" E has occurred P(E|F)/P(E) is the support E provides for F Some examples from poker Pick a card, any card! Probability of drawing a jack given that you've drawn a face card? P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3 P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3 Probability of drawing a face card given that you've drawn a jack? P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1 P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1 Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair? P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13 P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13 20m Hypothesis testing These are useful for proofs involving counting (and gambling), but more so for comparing results to chance Test statistics and null hypotheses Test statistic: any numerical summary of input data with known sampling distribution for null (random) data Example: flip a coin four times; data are four H/T binary categorical values One test statistic might be the % of heads (H) results (or % of tails (T) results) Example: repeat a measurement of gel band intensity in three lanes each for two proteins How different are the two proteins' expression levels? One test statistic might be the difference in means of each three intensities Another might be the difference in medians Another might be the difference in minima; or the difference in maxima Simply a number that reduces (summarizes) the entire dataset to a single value Null distribution: expected distribution of test statistic under assumption of no effect/bias/etc. A specific definition of "no effect" is the null hypothesis And when used, the alternate hypothesis is just everything else (e.g. "some effect") Example: a fair coin summarized as % Hs is expected to produce 50% H "Null hypothesis" is that P(H) = P(T) = 0.5, i.e. the coin is fair But any given experiment has some noise; the null distribution is not that %Hs equals exactly 0.5 Instead if the coin is fair, the null distribution has some "shape" around 0.5 The shape of that distribution depends on the experiment being performed (e.g. binomial for coin flip) Example: the difference in means between proteins of equal expression is expected to be zero But how surprised are we if it's not identically zero? The "shape" of the "typical" variation around zero is dictated by the amount of noise in our experiment If we can measure protein expression really well, we're surprised if the difference is >>0 But if our gels are noisy, the null distribution might be wide So we can't differentiate expression changes due to biology from those due to "chance" or noise Parametric versus nonparametric null distributions It is absolutely critical to distinguish between two types of null distributions: Those for which we can analyze the expected probability distribution mathematically: parametric Those for which we can compute the expected probability distribution numerically: nonparametric Parametric distributions mean the test statistic under the null hypothesis has some analytically described shape Implies some more or less specific assumptions about the behavior of the experiment generating your data Example: a gene has known baseline expression μ, and you measure it once under a new condition A good test statistic is your measurement x-μ; how surprised are you if this is ≠0? If experimental noise is normally distributed with standard deviation σ, x-μ will be normally distributed Referred to as a z-statistic Example: a gene has known baseline expression μ, and you measure it three times How surprised are you if | ˆ   | 0 ? What if you don't know your experimental noise beforehand, but do know it's normally distributed? A useful test statistic in this case is the t-statistic t | ˆ   | /(ˆ / n) This uses the sample's standard deviation, but is less certain about it Has an analytically defined t-distribution, thus leading to the popularity of the t-test But parametric tests often make very strong assumptions about your experiment and data! And fortunately, using computers, you rarely need to use them any more Nonparametric distributions mean the shape of the null distribution is calculated either by: Transforming the test statistic to rank values only (and thus ignoring its shape) or Simulating it directly from truly randomized data using a computer Referred to as the bootstrap or permutation testing depending on precisely how it's done Take your data, shuffle it a bunch of times, see how often you get a test statistic as extreme as real data Incredibly useful: inherently bakes structure of data into significance calculations e.g. population structure for GWAS, coexpression structure for microarrays, etc. Example: comparing your 2x3 protein measurements Your data start out ordered into two groups: [A B C, X Y Z] You can then use the difference of means [A B C] and [X Y Z] as a test statistic Shuffle the order many times so the groups are random How many times is the difference of [1 2 3] and [4 5 6] for random orders as big as the real difference? If we assume normality etc., we can calculate this using formulas for the difference in means But what about the difference in medians? Or minima/maxima? Nonparametric tests can be extremely robust to the quirks encountered in real data They typically involve few or no assumptions about the shape or properties of your experiment They can also provide a good summary of how well your results do fit a specific assumption e.g. how close is a permuted distribution to normal? Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors) The costs are: Decreased sensitivity (particularly for rank-based tests) Increased computational costs (all that shuffling/permuting/randomization takes time!) 15m p-values Given some real data with a test statistic, and a null hypothesis with some resulting null distribution... A p-value is probability of observing a "comparable" result if the null hypothesis is true Formally, for test statistic T with some value t for your data, P(T≥t|H0) Example: flip a coin a bunch of times Your test statistic is %Hs, and your null hypothesis is the coin's fair so P(H)=0.5 What's the probability of observing ≥90% heads? Example: given a plate full of yeast colonies, count how many of them are petites (non-respiratory) Your test statistic is %petites, and your null hypothesis is that the wild type P(petite)=0.15 What's the probability of observing ≥50% petite colonies on a plate? Note that we've been stating these p-values in terms of extreme values, e.g. "at least 90% or more" Sometimes you care only about large values, e.g. ≥90% Sometimes you care about any extreme values, e.g. ≥90% or ≤10% Often the case for deviations from zero/mean, e.g. | ˆ   | n  3 This is the difference between one-sided and two-sided hypothesis tests/p-values H0:0, HA:>0 H0:=0, HA:0 We won't dwell on the theoretical implications of this, but it has two important practical effects: Calculate without an absolute value when you care about direction, with when you don't Halve your p-value when you don't care about direction, since you're testing twice the area Often only one of these (one- or two-sided) will make sense for a particular situation In cases where you can choose, it's almost always more correct to make the more conservative choice 20m Performance evaluation We can (and should!) formalize these concepts of tests being "right" or "wrong" How do you assess the accuracy of a hypothesis test with respect to a ground truth? Gold standard: list of correct outcomes for a test, also standard or answer set or reference Often categorical binary outcomes (0/1), also sometimes true numerical values In the former (very common) case, answers are referred to as: Positives: outcomes expected to be "true" or "1" in the gold standard, drawn from H a distribution Negatives: outcomes expected to be "false" or "0" in the standard, drawn from H 0 distribution Predictions or results are the outcomes of a test or inference process Often whether or not a hypothesis test falls above or below a critical value Example: is a p-value <0.05 For any test with a reference outcome in the gold standard, four results are possible: The feature was drawn from H0 and your test accepts H0: true negative The feature was drawn from H0 and your test rejects H0: false positive (also type I error) The feature was drawn from Ha and your test accepts Ha: true positive The feature was drawn from Ha and your test rejects Ha: false negative (also type II error) H0 True H0 False H0 Accepted True Negative False Negative (Type II) H0 Rejected False Positive (Type I) True Positive The rates or probabilities with which incorrect outcomes occur indicate the performance or accuracy of a test False positive rate = FPR = fraction of successful tests for uninteresting features = P(reject H0|H0) False negative rate = FNR = fraction of failed tests for interesting features = P(accept H0|Ha) What type of "performance" you care about depends a lot on how the test's to be used Are you running expensive validations in the lab? False positives can hurt! Are you trying to detect dangerous contaminants? False negatives can hurt! There are thus a slew of complementary performance measures based on different aspects of these quantities Power: probability of detecting a true effect = TP/P = TP/(TP+FN) = P(reject H 0|Ha) Also called recall, true positive rate (TPR), or sensitivity Precision: probability a detected effect is true = TP/(TP+FP) = P(H a|reject H0) Specificity: probability a rejected effect is false = TN/(TN+FP) = P(accept H 0|H0) Also called true negative rate (TNR) and equivalent to 1 - false positive rate (FPR) Most hypothesis tests provide at least one tunable parameter that will trade off between aspects of performance High-precision tests are typically low-recall (few false positives, more false negatives) Highly sensitive tests are typically less specific (few false negatives, more false positives) These tradeoffs are commonly visualized to provide a sense of prediction accuracy over a range of thresholds Precision/recall plots: recall (x) vs. precision (y), upper right corner is "good" Receiver Operating Characteristic (ROC) or sensitivity/specificity plots: 1-FPR (x) vs. TPR (y), upper left good Both feature recall, but since precision ≠ specificity, can provide complementary views of the same data Entire curves can be further reduced to summary statistics of overall prediction performance Area Under the Curve (AUC): area under a ROC curve Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0 Area under precision/recall curve (AUPRC): exactly what it sounds like Not commonly used, since there's no baseline; "random" isn't fixed to a particular value Computational predictions are very often evaluated using AUC, as are diagnostic tests and risk scores ???m Repeated tests and high-dimensional data Suppose you run not just one six-lane gel for one protein, but six microarrays for 20,000 transcripts? If you generate p-values for all 20,000 tests of difference in means, what should they look like? A p-value is "the probability of a result occurring by chance" Thus every value between 0 and 1 is, if your assumptions are correct, equally likely! You expect 10% of them to fall in the range 0-0.1, or 0.4-0.5, or 0.9-1.0 Thus well-behaved p-values should have a uniform distribution between 0 and 1 if the null hypothesis is true 5% of them will fall below 0.05 even if there's no true effect It's common to visualize p-values for multiple tests to determine how biased they are How far from the uniform distribution expected under the null hypothesis This can be done using histograms of p-values: Or you can compare rank-ordered p-values to a straight diagonal line (form of Q-Q plot): If you t-test 20,000 genes, you expect p<0.05 for 1,000 of them even if there's no biological effect This is one aspect of what makes high-dimensional data so difficult to deal with Occurs when you take many measurements or features (p) under only a few conditions or samples (n) Thus sometimes referred to as the p>>n problem What can you do? One option is to use nonparametric permutation testing: Parametric null hypothesis is, "Each individual gene's difference is normally distributed with mean zero" Thus you're testing, "How surprised am I by the magnitude of each individual gene's difference?" Can instead determine the nonparametric null hypothesis that, "All genes' differences are zero" And thus the null distribution, "Differences of all genes after randomizing the data" And test, "How often do I observe differences of this magnitude in my entire randomized dataset?" This is often advisable but not always possible (e.g. when the computation's expensive) You can instead adjust your p-value in various analytical ways to account for multiple hypothesis testing Bonferroni correction: very strict correction that multiplies p-value by the number of tests Controls the Family-Wise Error Rate (FWER), % of features expected to "pass" by chance Thus it's sensitive to the number of "input" features: how many measurements you're making Good, quick-and-easy, very conservative way to account for high-dimensional data But it lowers sensitivity: it's easy to "miss" good results because they're "not significant enough" An alternative is to control the False Discovery Rate (FDR), % of results expected to "fail" by chance FDR q-value = (total # tests) * (p-value) / (rank p-value) Makes control depend on number of output (passing) features, not total number of input features Example: test your 20,000 genes With no control, you expect 1,000 to achieve p<0.05 even in the absence of a true effect 5% of what you test will be "wrong" by chance With Bonferroni FWER control, you expect 0 to achieve p<0.05 even in the absence of a true effect 0% of what you test will be "wrong" by chance With FDR control, you expect 5% of tests that achieve q<0.05 to do so even without a true effect 5% of your results will be "wrong" by chance (doesn't depend on how many tests you run!) Reading Hypothesis testing: Pagano and Gauvreau, 10.1-5 T-tests: Pagano and Gauvreau, 11.1-2 Wilcoxon: Pagano and Gauvreau, 13.2-4 ANOVA: Pagano and Gauvreau, 12.1-2 Performance evaluation: Pagano and Gauvreau, 6.4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download W01 Notes: Inference and hypothesis testing