Download W01 Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Wednesday #1
10
Problem set questions
20
Hypothesis testing
Test statistic: numerical summary of input data with known sampling distribution for null (random) data
Example: flip a coin 100 times; data is 100 0/1 values, test statistic is fraction heads
Reduces the entire dataset to a single value with expected binomial distribution around 0.5
Some common statistics are Z (with a normal distribution), T (with t-distribution), etc.
In these days of computational simulations, can be anything: mean, sum, etc.
Null hypothesis
Statement of expected distribution of test statistic under assumption of no effect/bias/etc.
Often stated as equivalence of means (or other test statistic)
Alternative hypothesis: everything else
p-value: probability of observing the alternative hypothesis if the null hypothesis is true
Example: probability of observing 90%+ heads if coin is fair
Example: probability of counting 50%+ petite yeast colonies if expected WT is 15%
Note that typically stated in terms of extreme values, e.g. "at least 90% or more"
One versus two-sided
H0:=0, HA:0
H0:0, HA:>0
Note that directionality halves/doubles p-values
Often only one- or two-sided will make sense for a particular situation
In cases where you can choose, almost always more correct to make the stricter choice
Parametric versus nonparametric
Parametric: distribution of test statistic under null hypothesis has some analytically described shape
Normal distribution, t-distribution, etc.
Nonparametric: distribution of test statistic calculated directly by simulation/randomization
Sometimes referred to as bootstrap (not to be confused with machine learning bootstrap)
Take your data, shuffle it N times, see how often you get a test statistic as extreme as real data
Incredibly useful: inherently bakes structure of data into significance calculations
e.g. population structure for GWAS, coexpression structure for microarrays, etc.
Can also provide a good summary of shape of data relative to expectation
Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors)
Repeated tests and high-dimensional data
If you t-test 20,000 genes at p=0.05, you expect the null to be true for 1,000 just by chance
Very hard to have sufficient samples (n) for large numbers of features (p)
Bonferroni correction: adjusts p-value for multiple hypothesis testing
Set =p for p features (hypotheses being tested)
Can be too strict for large p
Controls Family-Wise Error Rate (FWER), % input features (p) expected to "pass" by chance
Alternative controls the False Discovery Rate (FDR), % output features expected to "fail" by chance
FDR q-value = (total # tests) * (p-value) / (rank p-value)
Makes control depend on number of output (passing) features, not total number of input features
20
Common one/two-sample hypothesis tests
T-test: one or two normally distributed homoscedastic independent sets of values
Used to compare two samples assumed to be from underlying normal distributions
T-distribution is heavier tailed than normal to take 2x the variance uncertainty into account
Contrast with normal distribution of z-test when one mean is known with certainty
One-sample: mean of normally distributed sample differs from zero
t = /(/n)
Two-sample unpaired: means of two normal samples of equal variance differ
t = (1-2)/(S(1/n1+1/n2)), S = (((n1-1)12+(n2-1)22)/(n1+n2-2)) estimator of common/pooled stdev
Two-sample paired: means of two matched normal samples of equal variance differ
Tests whether a distribution shifts, since before/after points are non-independent
Equivalent statement: does the mean of the sample of differences differ from zero
Welch's t-test: unequal variance
Identical save that S = (12/n1+22/n2) and df for t-distribution are funky
Mann-Whitney U/Mann-Whitney Wilcoxon: two independent sets of values
U1 = R1 - n1(n1+1)/2, R1 = sum of ranks within all values from sample 1
Two-sample unpaired: distributions differ
Equivalent to two-sample t-test
Two-sample paired: distributions of matched samples differ
Equivalent to paired t-test
Performed by rank-transforming data (like Pearson -> Spearman)
Asks whether order of ranks is surprisingly different
Example: S1=[0.2, 0.5, 0.9] and S2=[0.1, 0.3, 0.4]
Pooled = [0.12, 0.21, 0.32, 0.42, 0.51, 0.91], Ranks = [12, 21, 32, 42, 51, 61]
R1 = 2+5+6 = 13, U1 = 13 - 3*4/2 = 7
Of note below: AUC = U1/(n1n2)
Kolmogorov-Smirnov: one or two independent sets of values
One-sample: distribution differs from chosen reference
Two-sample unpaired: two distributions differ
Performed by transforming data to cumulative distribution
Asks whether CDFs are surprisingly different
20
ANOVA: >2 normally distributed homoscedastic independent sets of values
Extension of t-test to more than two groups; compares one continuous with one categorical
Could just repeat a t-test many times between pairs of groups; boom!
Instead tests whether between-groups variance is surprisingly different from within-groups
Within-groups calculated as weighted (by group size) average of sample variances
Between-groups calculated as total variance ignoring groups
Only tells you that the groups do (or don't) differ, not which or how!
Post-hoc pairwise tests (multiple hypothesis corrected)
Kruskal-Wallis: nonparametric alternative to ANOVA
Replaces data with ranks; exactly the same relationship as Spearman/Pearson or MWW/t-test
Chi-square: assesses whether histogram (multinomial) agrees with theoretical distribution (e.g. uniform)
Closely related to Fisher's Exact test, which compares two multinomials (histograms)
I rarely find either of these things useful because genomic data rarely pairs categoricals
20
Performance evaluation
Gold standard: list of correct answers, often discrete 0/1 probabilities or true numerical values
Positives, one or true in the gold standard; negatives, zero or false in the gold standard
Probabilistically, positives have null H0 false, negatives have H0 true
H0 Accepted
H0 Rejected
H0 True
H0 False
True Negative
False Positive (Type I)
False Negative (Type II)
True Positive
Predictions: true positives, false positives, true negatives, false negatives
Values above/below a cutoff confidence/probability threshold that agree/disagree with standard
Type I error: false positive, P(reject H0|H0)
Type II error: false negative, P(accept H0|~H0)
Power: P(reject H0|~H0)
Performance evaluation: precision/recall and sensitivity/specificity
Precision: TP/(TP+FP) = P(~H0|reject H0)
Recall: TP/P = TP/(TP+FN) = P(reject H0|~H0) = true positive rate (TPR)
Sensitivity: recall
Specificity: TN/(TN+FP) = P(accept H0|H0) = true negative rate = 1 - false positive rate (FPR)
Receiver Operating Characteristic (ROC) plot: FPR (x) vs TPR (y)
Area Under the Curve (AUC): area under ROC
Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0
AUPRC: area under precision/recall curve
No baseline; random not fixed to a particular value!
20
Inference: linear regression
Correlation etc. make non-causal statements about the relationship between two samples
What if you want to predict an output as a function of one or more inputs?
This inherently makes a distinction between a response and one or more explanatory variables
y = a + bx + e
Assumes that e is normally distributed around a+bx with fixed variance
Also again assumes independence of measurements x/y
Linear regression fits a and b parameters with respect to a constraint, typically least squares
That is, minimize |y-f(x)|2
This is an excellent, simple way to make predictions about an output variable
Goodness of fit can be given a p-value (taking normality/independent assumptions into account)
Efficient, can accommodate many x's, can transform x nonlinearly as desired
Provides several good views of data to validate goodness of fit, homoscedasticity (residuals plot)
Multiple regression: bakes in additional coefficients y = b0 + bx
Significance of each b coefficient can be determined (H 0: b = 0)
Logistic regression: transforms output y to accommodate binary (0/1) values
Treat y like a probability ranging between 0 and 1
ln(p/(1-p)) = a + bx
Equivalent to y = exp(a + bx)/(1 + exp(a + bx)), which is the logistic function:
Constrained (or penalized) regression: least squares isn't the only constraint!
Also known as regularization: constraints on model parameters during fitting
Typically introduces a tradeoff between penalization (variance) and accuracy (bias)
Penalized regression (ridge): limits b2, making the total of all coefficients "small"
Sparse regression (lasso): limits |b|, making the number of nonzero coefficients "small"
Important part is that you have lots of options for teaching a computer to predict outputs from inputs
Overfitting: easy to create a model of noise in your data (makes constraints important)
Often discuss model accuracy in terms of training and test data/accuracy/error
Typical cross-validation procedure:
Split your data into a ~2/3 training set and a ~1/3 test set
Learn model parameters by optimizing in the training set
Evaluate model generalization/accuracy by evaluating on the test set
In the best case, repeat this many times and average
0
(Reading)
Hypothesis testing:
T-tests:
Wilcoxon:
ANOVA:
Performance evaluation:
Linear regression:
Logistic regression:
Pagano and Gavreau, 10.1-5
Pagano and Gavreau, 11.1-2
Pagano and Gavreau, 13.2-4
Pagano and Gavreau, 12.1-2
Pagano and Gavreau, 6.4
Pagano and Gavreau, 18.1-3
Pagano and Gavreau, 20.1-2