Download Statistical Pair-Wise Comparison of Learning Algorithms

Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh Overview  How can we tell if one algorithm can learn better than another? – – – – Design an experiment to measure the accuracy of the two algorithms. Run multiple trials. Compare the samples not just their means. Do a statistically sound test of the two samples. Is any observed difference significant? Is it due to true difference between algorithms or natural variation in the measurements? Statistical Hypothesis Testing   Statistical Hypothesis: A statement about the parameters of one or more populations Hypothesis Testing: A procedure for deciding to accept or reject the hypothesis – – – – – Identify the parameter of interest State a null hypothesis, H0 Specify an alternate hypothesis, H1 Choose a significance level α State an appropriate test statistic Statistical Hypothesis Testing Cont  Null Hypothesis (H0): A statement presumed to be true until statistical evidence shows otherwise     Usually specifies an exact value for a parameter Example H0: µ = 30 Kg Alternate Hypothesis (H1): Accepted if the null hypothesis is rejected Test Statistic: Particular statistic calculated from measurements of a random sample / experiment – – A test statistic is assumed to follow a particular distribution (normal, t, chi-square, etc) That particular distribution can be used to test for the significance of the calculated test statistic. Error in Hypothesis Testing  Type I error occurs when H0 is rejected but it is in fact true –  P(Type I error)=α or significance level Type II error occurs when we fail to reject H0 but it is in fact false – – – P(Type II error)=β power = 1-β = Probability of correctly rejecting H0 power = ability to distinguish between the two populations Paired t-Test  Collect data in pairs: –   Suppose n paired measurements have been made Assume – –  Example: Given a training set DTrain and a test set DTest, train both learning algorithms on DTrain and then test their accuracies on DTest. The measurements are independent The measurements for each algorithm follow a normal distribution The test statistic T0 will follow a t-distribution with n-1 degrees of freedom Paired t-Test cont Trial # Algorithm 1 Accuracy X1 Algorithm 2 Accuracy X2 1 X11 X21 2 X12 X22 … .. … n X1N X2N Assume: X1 follows N(µ1,σ1) X2 follows N(µ2,σ2) Let: µD = µ1 - µ2 Di = X1i - X2i i=1,2,...,n 1 D   X 1i  X 2i n i S D  STDEV ( X 1i  X 2i ) Null Hypothesis: H0: µD = Δ0 Test Statistic: T0  D    0 n SD Rejection Criteria: H1: µD ≠ Δ0 |t0| > tα/2,n-1 H1: µD > Δ0 t0 > tα,n-1 H1: µD < Δ0 t0 < -tα,n-1 Cross Validated t-test   Paired t-Test on the 10 paired accuracies obtained from 10-fold cross validation Advantages – –  Large train set size Most powerful (Diettrich, 98) Disadvantages – – Accuracy results are not independent (overlap) Somewhat elevated probability of type-1 error (Diettrich, 98) … 5x2 Cross Validated t-test     Run 2-fold cross validation 5 times Use results from the first of five replications to estimate mean difference Use results for all folds to estimate the variance Advantage: –  Lowest Type-1 error (Diettrich, 98) Disadvantage – Not as powerful as 10 fold cross validated t-test (Diettrich, 98) Re-sampled t-test     Randomly divide data into train / test sets (usually 2/3 – 1/3) Run multiple trials (usually 30) Perform a paired t-test between the trial accuracies This test has very high probability of type-1 error and should never be used. Calibrated Tests  Bouckaert – ICML 2003: – – – It is very difficult to estimate the true degrees of freedom because independence assumptions are being violated Instead of correcting for the mean-difference, calibrate on the degrees of freedom Recommendation: use 10 times repeated 10-fold cross validation with 10 degrees of freedom References    R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. ICML’03: PP 51-58. T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998. D. C. Montgomery et al. Engineering Statistics. 2nd Edition. Wiley Press. 2001

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical Pair-Wise Comparison of Learning Algorithms