Download W01 Notes: Inference and hypothesis testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Wednesday #1
10m
Announcements and questions
Please email Siyuan to register for the course web site with a blank email, subject "BIO508"
All accounts emailed so far are created, username = password = email user ID
Note that Problem Set #1 is posted and due by end-of-day next Monday
First lab tomorrow, Thursday 4:30-6:30 in Kresge LL6
Office hours
20m
Comparing data: Manhattan, Euclidean, Pearson, Cosine, and Spearman correlations and distances
Useful to have summary statistics of paired data
Measurements consisting of two correspondingly ordered vectors
Motivations for different paired summary statistics are diverse
Some describe properties of probability distributions, like correlation
Others describe properties of "real" space, like Euclidean distance
In this class we'll focus on simply cataloging and having an intuition for these
Note that we'll avoid the term metric, which means some specific
Slightly safer to refer to similarity or dissimilarity measures
Distances: larger means less similar
Euclidean = e = (x-y)2, also the L2 norm
Straight line distance, which amplifies outliers; in the range [0, ]
Manhattan = m = |x-y|, also the L1 norm
Grid or absolute distance; in the range [0, ]
Correlations: larger means more similar
Pearson =  = ((x-x)(y-y))/((x-x)2(y-y)2
Also the Euclidean distance of z-scored data, i.e. normalized by mean and standard deviation
Thus location and scale invariant, but assumes normal distribution; in the range [-1, 1]
Cosine = c = (xy)/(x2y2)
Also uncentered Pearson, normalized by standard deviation but not mean
Thus scale but not location invariant; in the range [-1, 1]
Spearman = r = Pearson correlation of ranks (with ties ranked identically)
Assesses two datasets' monotonicity
Nonparametric measure of similarity of trend, thus location and scale invariant; in the range [-1, 1]
Beware of summary statistics: Anscombe's quartet
Manually constructed in 1973 by Francis Anscombe at Yale as a demonstration
Four pairs of datasets with:
Equal mean (9) and standard deviation (11) of x, mean (7.5) and standard deviation (4.1) of y
Identical Pearson correlation (0.816) and regression (y = 3+0.5x)
But completely different relationships
Understand (and visualize) your data before summarizing them!
15m
Probability: basic definitions
Statistics describe data; probability provides an underlying mathematical theory for manipulating them
Experiment: anything that produces a non-deterministic (random or stochastic) result
Coin flip, die roll, item count, concentration measurement, distance measurement...
Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous
{H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...}
Event: any subset of a sample space
{}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3)
Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist)
Thus many symbolic proofs of probability relationships are based on integrals or limit theory
Note that all of these are defined in terms of sets and set notation:
{1, 2, 3} is a set of unordered unique elements, {} is the empty set
AB is the union of two sets (set of all elements in either set A or B)
AB is the intersection of two sets (set of all elements in both sets A and B)
~A is the complement of a set (all elements from some universe not in set A)
(Kolmogorov) axioms: one definition of "probability" that matches reality
For any event E, P(E)≥0
"Probability" is a non-negative real number
For any sample space S, P(S)=1
The "probability" of all outcomes for an experiment must total 1
For disjoint events EF={}, P(EF)=P(E)+P(F)
For two mutually exclusive events that share no outcomes...
The "probability" of either happening equals the summed "probability" of each happening independently
These represent one set of three assumptions from which intuitive rules about probability can be derived
0≤P(E)≤1 for any event E
P({})=0, i.e. every experiment must have some outcome
P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events
20m
Conditional probabilities and Bayes' theorem
Note that we'll be covering this only briefly in class
Feel free to refer to 1) the notes here, 2) the full notes online, 3) suggested reading below, or 4) the tutorials at:
http://brilliant.org/assessment/techniques-trainer/conditional-probability-and-bayes-theorem/
http://www.math.umass.edu/~lr7q/ps_files/teaching/math456/Week2.pdf
Conditional probability
The probability of an event given that another event has already occurred
The probability of event F given that the sample space S has been reduced to ES
Notated P(F|E) and defined as P(EF)/P(E)
Bayes' theorem
P(F|E) = P(E|F)P(F)/P(E)
True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F)
Provides a means of calculating a conditional probability based on the inverse of its condition
Typically described in terms of prior, posterior, and support
P(F) is the prior probability of F occurring at all in the first place, "before" anything else
P(F|E) is the posterior probability of F occurring "after" E has occurred
P(E|F)/P(E) is the support E provides for F
Some examples from poker
Pick a card, any card!
Probability of drawing a jack given that you've drawn a face card?
P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3
P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3
Probability of drawing a face card given that you've drawn a jack?
P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1
P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1
Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair?
P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
20m
Hypothesis testing
These are useful for proofs involving counting (and gambling), but more so for comparing results to chance
Test statistics and null hypotheses
Test statistic: any numerical summary of input data with known sampling distribution for null (random) data
Example: flip a coin four times; data are four H/T binary categorical values
One test statistic might be the % of heads (H) results (or % of tails (T) results)
Example: repeat a measurement of gel band intensity in three lanes each for two proteins
How different are the two proteins' expression levels?
One test statistic might be the difference in means of each three intensities
Another might be the difference in medians
Another might be the difference in minima; or the difference in maxima
Simply a number that reduces (summarizes) the entire dataset to a single value
Null distribution: expected distribution of test statistic under assumption of no effect/bias/etc.
A specific definition of "no effect" is the null hypothesis
And when used, the alternate hypothesis is just everything else (e.g. "some effect")
Example: a fair coin summarized as % Hs is expected to produce 50% H
"Null hypothesis" is that P(H) = P(T) = 0.5, i.e. the coin is fair
But any given experiment has some noise; the null distribution is not that %Hs equals exactly 0.5
Instead if the coin is fair, the null distribution has some "shape" around 0.5
The shape of that distribution depends on the experiment being performed (e.g. binomial for coin flip)
Example: the difference in means between proteins of equal expression is expected to be zero
But how surprised are we if it's not identically zero?
The "shape" of the "typical" variation around zero is dictated by the amount of noise in our experiment
If we can measure protein expression really well, we're surprised if the difference is >>0
But if our gels are noisy, the null distribution might be wide
So we can't differentiate expression changes due to biology from those due to "chance" or noise
Parametric versus nonparametric null distributions
It is absolutely critical to distinguish between two types of null distributions:
Those for which we can analyze the expected probability distribution mathematically: parametric
Those for which we can compute the expected probability distribution numerically: nonparametric
Parametric distributions mean the test statistic under the null hypothesis has some analytically described shape
Implies some more or less specific assumptions about the behavior of the experiment generating your data
Example: a gene has known baseline expression μ, and you measure it once under a new condition
A good test statistic is your measurement x-μ; how surprised are you if this is ≠0?
If experimental noise is normally distributed with standard deviation σ, x-μ will be normally distributed
Referred to as a z-statistic
Example: a gene has known baseline expression μ, and you measure it three times
How surprised are you if
| ˆ   | 0 ?
What if you don't know your experimental noise beforehand, but do know it's normally distributed?
A useful test statistic in this case is the t-statistic t | ˆ   | /(ˆ /
n)
This uses the sample's standard deviation, but is less certain about it
Has an analytically defined t-distribution, thus leading to the popularity of the t-test
But parametric tests often make very strong assumptions about your experiment and data!
And fortunately, using computers, you rarely need to use them any more
Nonparametric distributions mean the shape of the null distribution is calculated either by:
Transforming the test statistic to rank values only (and thus ignoring its shape) or
Simulating it directly from truly randomized data using a computer
Referred to as the bootstrap or permutation testing depending on precisely how it's done
Take your data, shuffle it a bunch of times, see how often you get a test statistic as extreme as real data
Incredibly useful: inherently bakes structure of data into significance calculations
e.g. population structure for GWAS, coexpression structure for microarrays, etc.
Example: comparing your 2x3 protein measurements
Your data start out ordered into two groups: [A B C, X Y Z]
You can then use the difference of means [A B C] and [X Y Z] as a test statistic
Shuffle the order many times so the groups are random
How many times is the difference of [1 2 3] and [4 5 6] for random orders as big as the real difference?
If we assume normality etc., we can calculate this using formulas for the difference in means
But what about the difference in medians? Or minima/maxima?
Nonparametric tests can be extremely robust to the quirks encountered in real data
They typically involve few or no assumptions about the shape or properties of your experiment
They can also provide a good summary of how well your results do fit a specific assumption
e.g. how close is a permuted distribution to normal?
Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors)
The costs are:
Decreased sensitivity (particularly for rank-based tests)
Increased computational costs (all that shuffling/permuting/randomization takes time!)
15m
p-values
Given some real data with a test statistic, and a null hypothesis with some resulting null distribution...
A p-value is probability of observing a "comparable" result if the null hypothesis is true
Formally, for test statistic T with some value t for your data, P(T≥t|H0)
Example: flip a coin a bunch of times
Your test statistic is %Hs, and your null hypothesis is the coin's fair so P(H)=0.5
What's the probability of observing ≥90% heads?
Example: given a plate full of yeast colonies, count how many of them are petites (non-respiratory)
Your test statistic is %petites, and your null hypothesis is that the wild type P(petite)=0.15
What's the probability of observing ≥50% petite colonies on a plate?
Note that we've been stating these p-values in terms of extreme values, e.g. "at least 90% or more"
Sometimes you care only about large values, e.g. ≥90%
Sometimes you care about any extreme values, e.g. ≥90% or ≤10%
Often the case for deviations from zero/mean, e.g. | ˆ   |
n  3
This is the difference between one-sided and two-sided hypothesis tests/p-values
H0:0, HA:>0
H0:=0, HA:0
We won't dwell on the theoretical implications of this, but it has two important practical effects:
Calculate without an absolute value when you care about direction, with when you don't
Halve your p-value when you don't care about direction, since you're testing twice the area
Often only one of these (one- or two-sided) will make sense for a particular situation
In cases where you can choose, it's almost always more correct to make the more conservative choice
Reading
Hypothesis testing:
Pagano and Gauvreau, 10.1-5
T-tests:
Pagano and Gauvreau, 11.1-2
Wilcoxon:
Pagano and Gauvreau, 13.2-4
ANOVA:
Pagano and Gauvreau, 12.1-2