Download CMSC 691B: Time management

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistical Methods for
Computer Science
Marie desJardins ([email protected])
CMSC 601
April 9, 2012
October 1999
Material adapted from slides by
Tom Dietterich, with permission
Statistical Analysis of Data
 Given a set of measurements of a value, how certain
can we be of the value?
 Given a set of measurements of two values, how
certain can we be that the two values are different?
 Given a measured outcome, along with several
condition or treatment values, how can we remove
the effect of unwanted conditions or treatments on
the outcome?
4/9/12
2
Measuring CPU Time
 Here are 37 measurements of the CPU time required
to compute C(10000, 500):

0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.25
0.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.25
0.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.25
0.24 0.25 0.24 0.24 0.25 0.25 0.26
 What is the “true” CPU cost of this computation?
 Before doing any calculations,
4/9/12
3
CPU Data Distribution
4/9/12
4
Kernel Density Estimate
 Kernel density: place a small Gaussian distribution
(“kernel”) around each data point, and sum them

Useful for visualization; also often used as a regression
technique
4/9/12
5
Sample Mean
 The data seems to have reasonably close to a normal (Gaussian
or bell curve) distribution
 Given this assumption, we can compute a sample mean:
1 n
x  i1 x i  0.248
n
 How certain can we be that this is the true value?
 Confidence interval [min, max]:
 Suppose

we drew many random samples of size
n=37, and computed the sample means
 95% of the time, this value would lie between max
and min
4/9/12
6
Confidence Intervals via Resampling
 We can simulate this process algorithmically
 Draw 1000 random subsamples (with replacement) from the
original 37 points

This process makes no assumption about a Gaussian distribution!
 Sort the means of these subsamples
 Choose the 26th and 975th values as min and max of a 95%
confidence interval (includes 95% of the sample means!)
 Result: The resampled confidence interval is [0.245, 0.251]
4/9/12
7
Confidence Intervals via
Distributional Theory
 The Central Limit Theorem says that the distribution
of the sample means is normally distributed,
 If the original data is normally distributed with mean
μand standard deviation σ, then the sample means
will be normally distributed with mean μ and standard
deviation σ’ = σ/√n (but we don’t know the original μ
and σ...):
1 x   
 

2   ' 
2
p(x) 
1
e
2 '
 Note that it isn’t important to remember this formula,
4/9/12
since Matlab, R, etc. will do this for you. But it is very
 to understand why you are computing it!
important
8
t Distribution
 Instead of assuming a normal distribution, we can use a t
distribution (sometimes called a “Student’s t distribution”), which
has three parameters: μ, σ, and the degrees of freedom (d.f. =
n-1)

The probability distribution function looks somewhat like a normal
distribution, but gives a tighter peak (with longer tails) as n increases
 This distribution yields just slightly tighter confidence limits,
using the central limit theorem:
4/9/12
9
Distributional Confidence Intervals
 We can use the mathematical formula for the t
distribution to compute a p (typically, p=0.95)
confidence interval:


The 0.025 t-value, t0.025, is the value such that the probability
that μ-μ’ < t0.025 is 0.975
The 95% confidence interval is then [μ’-t0.025, μ+t0.025]
 For the CPU example, t0.025 is 0.028, so the
distributional confidence interval is [0.220, 0.276] -tighter than the bootstrapped CI of [0.245, 0.251]
4/9/12
10
Bootstrap Computations of
Other Statistics
 The bootstrap method can be used to compute other
sample statistics for which the distribution method
isn’t appropriate:



median
mode
variance
 Because the tails and outlying values may not be well
represented in a sample, the bootstrap method is not
as useful for statistics involving the “ends” of the
distribution:


minimum
maximum
4/9/12
11
Measuring the Number of Occurrences
of Events
 In CS, we often want to know how often something
occurs:



How many times does a process complete successfully?
How many times do we correctly predict membership in a
class?
How many times do we find the top search result?
 Again, the sample rate θ’ is what we have observed,
but we would like to know the “true” rate θ
4/9/12
12
Bootstrap Confidence Intervals
for Rates
 Suppose we have observed 100 predictions of a decision tree,
and 88 of them were correct
 Draw many (say, 1000) samples of size n, with replacement,
from the n observed predictions (here, n=100), and compute the
sample classification rate
 Sort the sample rates θi in increasing order
 Choose the 26th and 975th values as the ends of the confidence
interval: here, the confidence interval is [0.81, 0.94]
4/9/12
13
Binomial Distributional
Confidence
 If we assume that the classifier is a “biased coin” with
probability θ of coming up heads, then we can use
the binomial distribution to analytically compute the
confidence interval

This requires a small correction because the binomial
distribution is actually discrete, but we want a continuous
estimate
4/9/12
14
Comparing Two Measurements
 Consider the CPU measurements of the earlier
example, and suppose we have performed the same
computation on a different machine, yielding these
CPU times:

0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.19
0.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.20
0.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.29
0.21 0.23 0.20
 These times certainly seem faster than the first
machine, which yielded a distributional confidence
interval of [0.220, 0.276] – but how can we be sure?
4/9/12
15
Kernel Density Comparison
 Visually, the second machine (Shark3) is much faster
than the first (Darwin):
4/9/12
16
Difference Estimation
 Bootstrap testing:
 Repeat many times:
 Draw a bootstrap sample from each of the machines,
computer sample means
 If Shark3 is faster than Darwin more than 95% of the time,
we can be 95% confident that it really is faster
 We can also compute a 95% bootstrap confidence interval
on the difference between the means – this turns out to be
[0.0461, 0.0553]
 If the samples are drawn from t distributions, then the
difference between the sample means also has a t
distribution
4/9/12

Confidence interval on this difference: [0.0463, 0.0555]
17
Hypothesis Testing
 Is the true difference zero, or more than zero?
4/9/12
 Use classical statistical rejection testing
 Null hypothesis: The two machines have the same speed
(i.e., μ, the difference in sample rate, is equal to zero)
 Can we reject this hypothesis, based on the observed data?
 If the null hypothesis were true, what is the probability we
would have observed this data?
 We can measure this probability using the t distribution
 In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69
 The probability of seeing this t value, if μ was actually zero,
is nearly nonexistent: The 99.999% confidence interval (for
the null hypothesis) is [-4.59, 4.59], so the probability of this t
value is (much) less than 0.00001
18
Paired Differences
 Suppose we had a set of 10 different benchmark
programs that we ran on the two machines, yielding
these CPU times:
 Obviously, we don’t
want to just compare
the means, since the
programs have such
different running times
4/9/12
19
Kernel Density Visualization
 CPU1 seems to be systemically faster (offset to the
left) than CPU2
4/9/12
20
Scatterplot Visualization
 CPU1 is always faster than CPU2 (i.e., above the
diagonal line that corresponds to equal speed)
4/9/12
21
Sequential Visualization
 The co-correlation of program “difficulty” (and faster
CPU speed of CPU1) is even more obvious in this
ordered (by program number) line plot:
4/9/12
22
Distribution Analysis I
 If the differences are in the same “units,” we can subtract the
CPU times for the “paired” tests and assume a t distribution on
these differences
 The probability of observing a sample mean difference as
large as 02779, given a null hypothesis of the machines
having the same speed, is 0.0000466 – we can reject the null
hypothesis
 If we have no prior belief about which machine is faster, we
should use a “two-tailed test”
 The probability of observing a sample mean difference this
large in either direction is 00000932 – slightly larger, but still
sufficiently improbable that we can be sure that the machines
have different speeds
 Note that we can also use a bootstrap analysis on this type of
paired data
4/9/12
23
Paired vs. Non-Paired
 If we don’t pair the data (just compare the overall
mean, not the differences for paired tests):


Distributional analysis doesn’t let us reject the null
hypothesis
Bootstrap analysis doesn’t let us reject the null hypothesis
4/9/12
24
Sign Tests
 I mentioned before that the paired t-test is
appropriate if the measurements are in the same
“units”
 If the magnitude of the difference is not important, or
not meaningful, we still can compare performance
 Look at the sign of the difference (here, CPU1 is
faster 10 out of 10 times; but in another case, it might
only be faster 9 out of 10 times)
 Use the binomial distribution (flip a coin to get the
sign) to compute a confidence interval for the
probability that CPU1 is faster
4/9/12
25
Other Important Topics
 Regression analysis
 Cross-validation
 Human subjects analysis and user study design
 Analysis of Variance (ANOVA)
 For your particular investigation, you need to know
which of these topics are relevant, and to learn about
them!
4/9/12
26
Statistically Valid Experimental
Design
 Make sure you understand the nuances before you
design your experiments...

...and definitely before you analyze your experimental data!
 Designing the statistical methods (and hypotheses)
after the fact is not valid!



You can often find a hypothesis and associated statistical
method and hypothesis ex post facto – i.e., design an
experiment to fit the data instead of the other way around
In the worst case, doing this is downright unethical
In the best case, it shows a lack of clear research objectives
and may not be reproducible or meaningful
4/9/12
27
Related documents