Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Wednesday #1 10m Announcements and questions Please email Siyuan to register for the course web site with a blank email, subject "BIO508" All accounts emailed so far are created, username = password = email user ID Note that Problem Set #1 is posted and due by end-of-day next Monday First lab tomorrow, Thursday 4:30-6:30 in Kresge LL6 Office hours 20m Comparing data: Manhattan, Euclidean, Pearson, Cosine, and Spearman correlations and distances Useful to have summary statistics of paired data Measurements consisting of two correspondingly ordered vectors Motivations for different paired summary statistics are diverse Some describe properties of probability distributions, like correlation Others describe properties of "real" space, like Euclidean distance In this class we'll focus on simply cataloging and having an intuition for these Note that we'll avoid the term metric, which means some specific Slightly safer to refer to similarity or dissimilarity measures Distances: larger means less similar Euclidean = e = (x-y)2, also the L2 norm Straight line distance, which amplifies outliers; in the range [0, ] Manhattan = m = |x-y|, also the L1 norm Grid or absolute distance; in the range [0, ] Correlations: larger means more similar Pearson = = ((x-x)(y-y))/((x-x)2(y-y)2 Also the Euclidean distance of z-scored data, i.e. normalized by mean and standard deviation Thus location and scale invariant, but assumes normal distribution; in the range [-1, 1] Cosine = c = (xy)/(x2y2) Also uncentered Pearson, normalized by standard deviation but not mean Thus scale but not location invariant; in the range [-1, 1] Spearman = r = Pearson correlation of ranks (with ties ranked identically) Assesses two datasets' monotonicity Nonparametric measure of similarity of trend, thus location and scale invariant; in the range [-1, 1] Beware of summary statistics: Anscombe's quartet Manually constructed in 1973 by Francis Anscombe at Yale as a demonstration Four pairs of datasets with: Equal mean (9) and standard deviation (11) of x, mean (7.5) and standard deviation (4.1) of y Identical Pearson correlation (0.816) and regression (y = 3+0.5x) But completely different relationships Understand (and visualize) your data before summarizing them! 15m Probability: basic definitions Statistics describe data; probability provides an underlying mathematical theory for manipulating them Experiment: anything that produces a non-deterministic (random or stochastic) result Coin flip, die roll, item count, concentration measurement, distance measurement... Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous {H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...} Event: any subset of a sample space {}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3) Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist) Thus many symbolic proofs of probability relationships are based on integrals or limit theory Note that all of these are defined in terms of sets and set notation: {1, 2, 3} is a set of unordered unique elements, {} is the empty set AB is the union of two sets (set of all elements in either set A or B) AB is the intersection of two sets (set of all elements in both sets A and B) ~A is the complement of a set (all elements from some universe not in set A) (Kolmogorov) axioms: one definition of "probability" that matches reality For any event E, P(E)≥0 "Probability" is a non-negative real number For any sample space S, P(S)=1 The "probability" of all outcomes for an experiment must total 1 For disjoint events EF={}, P(EF)=P(E)+P(F) For two mutually exclusive events that share no outcomes... The "probability" of either happening equals the summed "probability" of each happening independently These represent one set of three assumptions from which intuitive rules about probability can be derived 0≤P(E)≤1 for any event E P({})=0, i.e. every experiment must have some outcome P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events 20m Conditional probabilities and Bayes' theorem Note that we'll be covering this only briefly in class Feel free to refer to 1) the notes here, 2) the full notes online, 3) suggested reading below, or 4) the tutorials at: http://brilliant.org/assessment/techniques-trainer/conditional-probability-and-bayes-theorem/ http://www.math.umass.edu/~lr7q/ps_files/teaching/math456/Week2.pdf Conditional probability The probability of an event given that another event has already occurred The probability of event F given that the sample space S has been reduced to ES Notated P(F|E) and defined as P(EF)/P(E) Bayes' theorem P(F|E) = P(E|F)P(F)/P(E) True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F) Provides a means of calculating a conditional probability based on the inverse of its condition Typically described in terms of prior, posterior, and support P(F) is the prior probability of F occurring at all in the first place, "before" anything else P(F|E) is the posterior probability of F occurring "after" E has occurred P(E|F)/P(E) is the support E provides for F Some examples from poker Pick a card, any card! Probability of drawing a jack given that you've drawn a face card? P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3 P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3 Probability of drawing a face card given that you've drawn a jack? P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1 P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1 Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair? P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13 P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13 20m Hypothesis testing These are useful for proofs involving counting (and gambling), but more so for comparing results to chance Test statistics and null hypotheses Test statistic: any numerical summary of input data with known sampling distribution for null (random) data Example: flip a coin four times; data are four H/T binary categorical values One test statistic might be the % of heads (H) results (or % of tails (T) results) Example: repeat a measurement of gel band intensity in three lanes each for two proteins How different are the two proteins' expression levels? One test statistic might be the difference in means of each three intensities Another might be the difference in medians Another might be the difference in minima; or the difference in maxima Simply a number that reduces (summarizes) the entire dataset to a single value Null distribution: expected distribution of test statistic under assumption of no effect/bias/etc. A specific definition of "no effect" is the null hypothesis And when used, the alternate hypothesis is just everything else (e.g. "some effect") Example: a fair coin summarized as % Hs is expected to produce 50% H "Null hypothesis" is that P(H) = P(T) = 0.5, i.e. the coin is fair But any given experiment has some noise; the null distribution is not that %Hs equals exactly 0.5 Instead if the coin is fair, the null distribution has some "shape" around 0.5 The shape of that distribution depends on the experiment being performed (e.g. binomial for coin flip) Example: the difference in means between proteins of equal expression is expected to be zero But how surprised are we if it's not identically zero? The "shape" of the "typical" variation around zero is dictated by the amount of noise in our experiment If we can measure protein expression really well, we're surprised if the difference is >>0 But if our gels are noisy, the null distribution might be wide So we can't differentiate expression changes due to biology from those due to "chance" or noise Parametric versus nonparametric null distributions It is absolutely critical to distinguish between two types of null distributions: Those for which we can analyze the expected probability distribution mathematically: parametric Those for which we can compute the expected probability distribution numerically: nonparametric Parametric distributions mean the test statistic under the null hypothesis has some analytically described shape Implies some more or less specific assumptions about the behavior of the experiment generating your data Example: a gene has known baseline expression μ, and you measure it once under a new condition A good test statistic is your measurement x-μ; how surprised are you if this is ≠0? If experimental noise is normally distributed with standard deviation σ, x-μ will be normally distributed Referred to as a z-statistic Example: a gene has known baseline expression μ, and you measure it three times How surprised are you if | ˆ | 0 ? What if you don't know your experimental noise beforehand, but do know it's normally distributed? A useful test statistic in this case is the t-statistic t | ˆ | /(ˆ / n) This uses the sample's standard deviation, but is less certain about it Has an analytically defined t-distribution, thus leading to the popularity of the t-test But parametric tests often make very strong assumptions about your experiment and data! And fortunately, using computers, you rarely need to use them any more Nonparametric distributions mean the shape of the null distribution is calculated either by: Transforming the test statistic to rank values only (and thus ignoring its shape) or Simulating it directly from truly randomized data using a computer Referred to as the bootstrap or permutation testing depending on precisely how it's done Take your data, shuffle it a bunch of times, see how often you get a test statistic as extreme as real data Incredibly useful: inherently bakes structure of data into significance calculations e.g. population structure for GWAS, coexpression structure for microarrays, etc. Example: comparing your 2x3 protein measurements Your data start out ordered into two groups: [A B C, X Y Z] You can then use the difference of means [A B C] and [X Y Z] as a test statistic Shuffle the order many times so the groups are random How many times is the difference of [1 2 3] and [4 5 6] for random orders as big as the real difference? If we assume normality etc., we can calculate this using formulas for the difference in means But what about the difference in medians? Or minima/maxima? Nonparametric tests can be extremely robust to the quirks encountered in real data They typically involve few or no assumptions about the shape or properties of your experiment They can also provide a good summary of how well your results do fit a specific assumption e.g. how close is a permuted distribution to normal? Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors) The costs are: Decreased sensitivity (particularly for rank-based tests) Increased computational costs (all that shuffling/permuting/randomization takes time!) 15m p-values Given some real data with a test statistic, and a null hypothesis with some resulting null distribution... A p-value is probability of observing a "comparable" result if the null hypothesis is true Formally, for test statistic T with some value t for your data, P(T≥t|H0) Example: flip a coin a bunch of times Your test statistic is %Hs, and your null hypothesis is the coin's fair so P(H)=0.5 What's the probability of observing ≥90% heads? Example: given a plate full of yeast colonies, count how many of them are petites (non-respiratory) Your test statistic is %petites, and your null hypothesis is that the wild type P(petite)=0.15 What's the probability of observing ≥50% petite colonies on a plate? Note that we've been stating these p-values in terms of extreme values, e.g. "at least 90% or more" Sometimes you care only about large values, e.g. ≥90% Sometimes you care about any extreme values, e.g. ≥90% or ≤10% Often the case for deviations from zero/mean, e.g. | ˆ | n 3 This is the difference between one-sided and two-sided hypothesis tests/p-values H0:0, HA:>0 H0:=0, HA:0 We won't dwell on the theoretical implications of this, but it has two important practical effects: Calculate without an absolute value when you care about direction, with when you don't Halve your p-value when you don't care about direction, since you're testing twice the area Often only one of these (one- or two-sided) will make sense for a particular situation In cases where you can choose, it's almost always more correct to make the more conservative choice Reading Hypothesis testing: Pagano and Gauvreau, 10.1-5 T-tests: Pagano and Gauvreau, 11.1-2 Wilcoxon: Pagano and Gauvreau, 13.2-4 ANOVA: Pagano and Gauvreau, 12.1-2