Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Wednesday #1 10 Problem set questions 20 Hypothesis testing Test statistic: numerical summary of input data with known sampling distribution for null (random) data Example: flip a coin 100 times; data is 100 0/1 values, test statistic is fraction heads Reduces the entire dataset to a single value with expected binomial distribution around 0.5 Some common statistics are Z (with a normal distribution), T (with t-distribution), etc. In these days of computational simulations, can be anything: mean, sum, etc. Null hypothesis Statement of expected distribution of test statistic under assumption of no effect/bias/etc. Often stated as equivalence of means (or other test statistic) Alternative hypothesis: everything else p-value: probability of observing the alternative hypothesis if the null hypothesis is true Example: probability of observing 90%+ heads if coin is fair Example: probability of counting 50%+ petite yeast colonies if expected WT is 15% Note that typically stated in terms of extreme values, e.g. "at least 90% or more" One versus two-sided H0:=0, HA:0 H0:0, HA:>0 Note that directionality halves/doubles p-values Often only one- or two-sided will make sense for a particular situation In cases where you can choose, almost always more correct to make the stricter choice Parametric versus nonparametric Parametric: distribution of test statistic under null hypothesis has some analytically described shape Normal distribution, t-distribution, etc. Nonparametric: distribution of test statistic calculated directly by simulation/randomization Sometimes referred to as bootstrap (not to be confused with machine learning bootstrap) Take your data, shuffle it N times, see how often you get a test statistic as extreme as real data Incredibly useful: inherently bakes structure of data into significance calculations e.g. population structure for GWAS, coexpression structure for microarrays, etc. Can also provide a good summary of shape of data relative to expectation Q-Q plots do this for the general case (scatter plot quantiles of any two distributions/vectors) Repeated tests and high-dimensional data If you t-test 20,000 genes at p=0.05, you expect the null to be true for 1,000 just by chance Very hard to have sufficient samples (n) for large numbers of features (p) Bonferroni correction: adjusts p-value for multiple hypothesis testing Set =p for p features (hypotheses being tested) Can be too strict for large p Controls Family-Wise Error Rate (FWER), % input features (p) expected to "pass" by chance Alternative controls the False Discovery Rate (FDR), % output features expected to "fail" by chance FDR q-value = (total # tests) * (p-value) / (rank p-value) Makes control depend on number of output (passing) features, not total number of input features 20 Common one/two-sample hypothesis tests T-test: one or two normally distributed homoscedastic independent sets of values Used to compare two samples assumed to be from underlying normal distributions T-distribution is heavier tailed than normal to take 2x the variance uncertainty into account Contrast with normal distribution of z-test when one mean is known with certainty One-sample: mean of normally distributed sample differs from zero t = /(/n) Two-sample unpaired: means of two normal samples of equal variance differ t = (1-2)/(S(1/n1+1/n2)), S = (((n1-1)12+(n2-1)22)/(n1+n2-2)) estimator of common/pooled stdev Two-sample paired: means of two matched normal samples of equal variance differ Tests whether a distribution shifts, since before/after points are non-independent Equivalent statement: does the mean of the sample of differences differ from zero Welch's t-test: unequal variance Identical save that S = (12/n1+22/n2) and df for t-distribution are funky Mann-Whitney U/Mann-Whitney Wilcoxon: two independent sets of values U1 = R1 - n1(n1+1)/2, R1 = sum of ranks within all values from sample 1 Two-sample unpaired: distributions differ Equivalent to two-sample t-test Two-sample paired: distributions of matched samples differ Equivalent to paired t-test Performed by rank-transforming data (like Pearson -> Spearman) Asks whether order of ranks is surprisingly different Example: S1=[0.2, 0.5, 0.9] and S2=[0.1, 0.3, 0.4] Pooled = [0.12, 0.21, 0.32, 0.42, 0.51, 0.91], Ranks = [12, 21, 32, 42, 51, 61] R1 = 2+5+6 = 13, U1 = 13 - 3*4/2 = 7 Of note below: AUC = U1/(n1n2) Kolmogorov-Smirnov: one or two independent sets of values One-sample: distribution differs from chosen reference Two-sample unpaired: two distributions differ Performed by transforming data to cumulative distribution Asks whether CDFs are surprisingly different 20 ANOVA: >2 normally distributed homoscedastic independent sets of values Extension of t-test to more than two groups; compares one continuous with one categorical Could just repeat a t-test many times between pairs of groups; boom! Instead tests whether between-groups variance is surprisingly different from within-groups Within-groups calculated as weighted (by group size) average of sample variances Between-groups calculated as total variance ignoring groups Only tells you that the groups do (or don't) differ, not which or how! Post-hoc pairwise tests (multiple hypothesis corrected) Kruskal-Wallis: nonparametric alternative to ANOVA Replaces data with ranks; exactly the same relationship as Spearman/Pearson or MWW/t-test Chi-square: assesses whether histogram (multinomial) agrees with theoretical distribution (e.g. uniform) Closely related to Fisher's Exact test, which compares two multinomials (histograms) I rarely find either of these things useful because genomic data rarely pairs categoricals 20 Performance evaluation Gold standard: list of correct answers, often discrete 0/1 probabilities or true numerical values Positives, one or true in the gold standard; negatives, zero or false in the gold standard Probabilistically, positives have null H0 false, negatives have H0 true H0 Accepted H0 Rejected H0 True H0 False True Negative False Positive (Type I) False Negative (Type II) True Positive Predictions: true positives, false positives, true negatives, false negatives Values above/below a cutoff confidence/probability threshold that agree/disagree with standard Type I error: false positive, P(reject H0|H0) Type II error: false negative, P(accept H0|~H0) Power: P(reject H0|~H0) Performance evaluation: precision/recall and sensitivity/specificity Precision: TP/(TP+FP) = P(~H0|reject H0) Recall: TP/P = TP/(TP+FN) = P(reject H0|~H0) = true positive rate (TPR) Sensitivity: recall Specificity: TN/(TN+FP) = P(accept H0|H0) = true negative rate = 1 - false positive rate (FPR) Receiver Operating Characteristic (ROC) plot: FPR (x) vs TPR (y) Area Under the Curve (AUC): area under ROC Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0 AUPRC: area under precision/recall curve No baseline; random not fixed to a particular value! 20 Inference: linear regression Correlation etc. make non-causal statements about the relationship between two samples What if you want to predict an output as a function of one or more inputs? This inherently makes a distinction between a response and one or more explanatory variables y = a + bx + e Assumes that e is normally distributed around a+bx with fixed variance Also again assumes independence of measurements x/y Linear regression fits a and b parameters with respect to a constraint, typically least squares That is, minimize |y-f(x)|2 This is an excellent, simple way to make predictions about an output variable Goodness of fit can be given a p-value (taking normality/independent assumptions into account) Efficient, can accommodate many x's, can transform x nonlinearly as desired Provides several good views of data to validate goodness of fit, homoscedasticity (residuals plot) Multiple regression: bakes in additional coefficients y = b0 + bx Significance of each b coefficient can be determined (H 0: b = 0) Logistic regression: transforms output y to accommodate binary (0/1) values Treat y like a probability ranging between 0 and 1 ln(p/(1-p)) = a + bx Equivalent to y = exp(a + bx)/(1 + exp(a + bx)), which is the logistic function: Constrained (or penalized) regression: least squares isn't the only constraint! Also known as regularization: constraints on model parameters during fitting Typically introduces a tradeoff between penalization (variance) and accuracy (bias) Penalized regression (ridge): limits b2, making the total of all coefficients "small" Sparse regression (lasso): limits |b|, making the number of nonzero coefficients "small" Important part is that you have lots of options for teaching a computer to predict outputs from inputs Overfitting: easy to create a model of noise in your data (makes constraints important) Often discuss model accuracy in terms of training and test data/accuracy/error Typical cross-validation procedure: Split your data into a ~2/3 training set and a ~1/3 test set Learn model parameters by optimizing in the training set Evaluate model generalization/accuracy by evaluating on the test set In the best case, repeat this many times and average 0 (Reading) Hypothesis testing: T-tests: Wilcoxon: ANOVA: Performance evaluation: Linear regression: Logistic regression: Pagano and Gavreau, 10.1-5 Pagano and Gavreau, 11.1-2 Pagano and Gavreau, 13.2-4 Pagano and Gavreau, 12.1-2 Pagano and Gavreau, 6.4 Pagano and Gavreau, 18.1-3 Pagano and Gavreau, 20.1-2