Download CMSC 691B: Time management

Statistical Methods for Computer Science Marie desJardins ([email protected]) CMSC 601 April 9, 2012 October 1999 Material adapted from slides by Tom Dietterich, with permission Statistical Analysis of Data  Given a set of measurements of a value, how certain can we be of the value?  Given a set of measurements of two values, how certain can we be that the two values are different?  Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome? 4/9/12 2 Measuring CPU Time  Here are 37 measurements of the CPU time required to compute C(10000, 500):  0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.25 0.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.25 0.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.25 0.24 0.25 0.24 0.24 0.25 0.25 0.26  What is the “true” CPU cost of this computation?  Before doing any calculations, 4/9/12 3 CPU Data Distribution 4/9/12 4 Kernel Density Estimate  Kernel density: place a small Gaussian distribution (“kernel”) around each data point, and sum them  Useful for visualization; also often used as a regression technique 4/9/12 5 Sample Mean  The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution  Given this assumption, we can compute a sample mean: 1 n x  i1 x i  0.248 n  How certain can we be that this is the true value?  Confidence interval [min, max]:  Suppose  we drew many random samples of size n=37, and computed the sample means  95% of the time, this value would lie between max and min 4/9/12 6 Confidence Intervals via Resampling  We can simulate this process algorithmically  Draw 1000 random subsamples (with replacement) from the original 37 points  This process makes no assumption about a Gaussian distribution!  Sort the means of these subsamples  Choose the 26th and 975th values as min and max of a 95% confidence interval (includes 95% of the sample means!)  Result: The resampled confidence interval is [0.245, 0.251] 4/9/12 7 Confidence Intervals via Distributional Theory  The Central Limit Theorem says that the distribution of the sample means is normally distributed,  If the original data is normally distributed with mean μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...): 1 x       2   '  2 p(x)  1 e 2 '  Note that it isn’t important to remember this formula, 4/9/12 since Matlab, R, etc. will do this for you. But it is very  to understand why you are computing it! important 8 t Distribution  Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1)  The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases  This distribution yields just slightly tighter confidence limits, using the central limit theorem: 4/9/12 9 Distributional Confidence Intervals  We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval:   The 0.025 t-value, t0.025, is the value such that the probability that μ-μ’ < t0.025 is 0.975 The 95% confidence interval is then [μ’-t0.025, μ+t0.025]  For the CPU example, t0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -tighter than the bootstrapped CI of [0.245, 0.251] 4/9/12 10 Bootstrap Computations of Other Statistics  The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate:    median mode variance  Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution:   minimum maximum 4/9/12 11 Measuring the Number of Occurrences of Events  In CS, we often want to know how often something occurs:    How many times does a process complete successfully? How many times do we correctly predict membership in a class? How many times do we find the top search result?  Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ 4/9/12 12 Bootstrap Confidence Intervals for Rates  Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct  Draw many (say, 1000) samples of size n, with replacement, from the n observed predictions (here, n=100), and compute the sample classification rate  Sort the sample rates θi in increasing order  Choose the 26th and 975th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94] 4/9/12 13 Binomial Distributional Confidence  If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval  This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate 4/9/12 14 Comparing Two Measurements  Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these CPU times:  0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.19 0.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.20 0.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.29 0.21 0.23 0.20  These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure? 4/9/12 15 Kernel Density Comparison  Visually, the second machine (Shark3) is much faster than the first (Darwin): 4/9/12 16 Difference Estimation  Bootstrap testing:  Repeat many times:  Draw a bootstrap sample from each of the machines, computer sample means  If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster  We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be [0.0461, 0.0553]  If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution 4/9/12  Confidence interval on this difference: [0.0463, 0.0555] 17 Hypothesis Testing  Is the true difference zero, or more than zero? 4/9/12  Use classical statistical rejection testing  Null hypothesis: The two machines have the same speed (i.e., μ, the difference in sample rate, is equal to zero)  Can we reject this hypothesis, based on the observed data?  If the null hypothesis were true, what is the probability we would have observed this data?  We can measure this probability using the t distribution  In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69  The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001 18 Paired Differences  Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times:  Obviously, we don’t want to just compare the means, since the programs have such different running times 4/9/12 19 Kernel Density Visualization  CPU1 seems to be systemically faster (offset to the left) than CPU2 4/9/12 20 Scatterplot Visualization  CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed) 4/9/12 21 Sequential Visualization  The co-correlation of program “difficulty” (and faster CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot: 4/9/12 22 Distribution Analysis I  If the differences are in the same “units,” we can subtract the CPU times for the “paired” tests and assume a t distribution on these differences  The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis  If we have no prior belief about which machine is faster, we should use a “two-tailed test”  The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds  Note that we can also use a bootstrap analysis on this type of paired data 4/9/12 23 Paired vs. Non-Paired  If we don’t pair the data (just compare the overall mean, not the differences for paired tests):   Distributional analysis doesn’t let us reject the null hypothesis Bootstrap analysis doesn’t let us reject the null hypothesis 4/9/12 24 Sign Tests  I mentioned before that the paired t-test is appropriate if the measurements are in the same “units”  If the magnitude of the difference is not important, or not meaningful, we still can compare performance  Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times)  Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster 4/9/12 25 Other Important Topics  Regression analysis  Cross-validation  Human subjects analysis and user study design  Analysis of Variance (ANOVA)  For your particular investigation, you need to know which of these topics are relevant, and to learn about them! 4/9/12 26 Statistically Valid Experimental Design  Make sure you understand the nuances before you design your experiments...  ...and definitely before you analyze your experimental data!  Designing the statistical methods (and hypotheses) after the fact is not valid!    You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around In the worst case, doing this is downright unethical In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful 4/9/12 27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CMSC 691B: Time management