Measuring errors
Hypothesis test
Let’s start with an example:
with the current interface for a system the average time to complete a task is t = 5sec
a new interface is proposed that should improve on this
to test the new interface, it was agreed that n = 200 users would try out the new system
let’s say that Y is the average time in which the 200 users completed the same task
we establish that if Y < 5 then the new interface is better than the old
BUT: we must consider that, even if we have observed Y < 5 for these 200 users, the
real average time could still be 5sec or more!
OR: even if we have observed Y > 5 for these 200 users, the real average time could
be less than 5sec
in the first case, we accept the new interface, while we should
have rejected it (this is called a Type I error)
in the second case, we reject the new interface, while we
should have accepted it (this is called a Type II error)
we want to measure the probability that these errors occur
A hypothesis test is a method for establishing whether a claim
about an experiment is reasonable
the first hypothesis we want to consider is that nothing has
changed (i.e. the new and the old interfaces have exactly the
same performance)
this is called the null hypothesis and is indicated with H0
in the example H0 : t = 5sec
the second hypothesis is that there has in fact be a change in
this is called the alternative hypothesis and is indicated
with H1
in the example H1 : t < 5sec
but one could also test H1 : t > 5 or even H1 : t != 5
Hypothesis test (2)
Significance Level
the hypothesis we want to test is therefore:
we reject H0 and accept H1 if Y < 5sec
and the errors are then:
Type I error: rejecting H0 and accepting H1 when H0 is
Type II error: accepting H0 and rejecting H1 when H0 is
the degree of certainty one requires in order to reject the null
hypothesis in favor of the alternative is called significance
level and is indicated with α
this is a probability measure (so it’s a number from 0 to 1, or
from 0% to 100%)
for instance, we can say that the significance level must be of
α = 5% = 0.05
of course we will never be able to tell for sure if an error has
been made
that is, we request that the probability p that a Type I error
occurred must be less than 0.05
we can estimate this with a certain "degree of certainty"
the smaller α the more you are "protected"
so, the rule would be that we reject H0 if:
Y < 5 and
p < α, that is p < 0.05
Power of the test
Confidence intervals
the probability of rejecting the null hypothesis when it is in
fact false is called the power of the test and is denoted by
another output that can be given for a hypothesis test is the
confidence interval
the notation is based on the probability value that we accept
the null hypothesis when it is in fact false, which is indicated
with β
we give an estimate of whether the value is included in a given
range, or interval
the more powerful a test is, the better
β cannot be chosen by the user
but it can be calculated, although only in some cases
a decrease in α leads to an increase in β and viceversa
but both α and β depend on the sample size
intuitively: the more users we contact for our experiment,
the better the test
that is, instead of giving the probability that the true value
differs from the observed one
this is more useful when you repeat the experiment many
times, so the observed value can vary
he interval estimate gives an indication of how much
uncertainty there is in our estimate of the true value
the narrower the interval, the more precise is our estimate
confidence intervals are given again with an associated
confidence level α
Yes, but... how do I go about it?
Confidence intervals (2)
so, if α = 5%, we say that we have a 95% confidence interval
for the value
in the example, this means that:
if we repeat the experiment a sufficiently large number of
times, each time with 200 users
and each time we calculate the average time to perform a
and each time we calculate a confidence interval for the
true average
in the long run, about 95% of these confidence intervals
will in fact contain the true average
note: so, an 95% confidence interval does not mean that there
is a 95% probability that the interval contains the true
we need to make some assumptions to make life easier
for instance, suppose we want to concentrate on the value of
the mean (as in the example before)
we assume that the data comes from a normal distribution, for
which we know the standard deviation
this is a reasonable assumptions, for the Central Limit Theorem, provided that the
sample size is big enough (5 users would not really do...)
the null hypothesis is of the form H0 : Y = Y 0
the alternative hypothesis is one of:
H1 : Y > Y 0 (the mean has increased)
H1 : Y < Y 0 (the mean has decreased)
H1 : Y != Y 0 (the mean has changed, we don’t know if it has increased or decreased)
the first two hypothesis are said to be single tailed, while the
third one is said double tailed
Steps in testing the hypothesis
1. we specify H0 and H1
e.g. H0 : Y = 0.5sec;
and H1 : Y < 0.5sec
2. we choose a significance level
e.g. α = 0.05
3. we perform a randomised experiment, with a sample of users
e.g. n = 100
4. we calculate the mean of this sample
Y sample = n1 ni=1 Yi
let’s suppose Ysample = 0.45
Steps in testing the hypothesis (2)
5. we calculate the standard error of the mean
suppose the average time comes from a normal (bell
shaped) distribution:
mean Y = 0.5
standard deviation sY = 1.5
the standard error of the mean is given by the formula:
in our case:
Steps in testing the hypothesis (3)
6. we compare this with the normal value for z according to the
level of significance chosen, zα
these values can be found in given tables, for instance
z0.05 = 1.645
if z ≤ −zα we reject H0
in our case −0.33 ≥ −1.645 so we have to accept H0
for H1 : Y > Y 0 the rejection zone is z ≥ zα
for H1 : Y != Y 0 the rejection zone is |z| >= zα/2 where |z| is
the value of z with a positive sign
Y sample − Y
0.45 − 0.5
= −0.33
