Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Can five be enough?
Sample sizes in usability tests
Paul Cairns and Caroline Jarrett
Problem: usability studies have small
samples




Good experiments: 30+ Ps
Typical usability studies: ~5Ps
Moving to 3Ps!
How?! What?!
– Typically, I have conniptions
– CJ asked me to solve it!
8th March, 2012
UX people like small samples





Common practice (CJ)
Krug (2010), 3 (“a morning a month”)
Tullis & Albert, 6 to 8 (formative)
Virzi (1992)
Nielsen (1993)
– 7 experts ≈ 5 experts
8th March, 2012
Use probabilities to suggest sample
sizes
 Total number of problems, K
 Probability of problem discovery, p
 Find n so that 1 – (1-p)n is x% of K
– Binomial distribution
 n is our sample size
 p = 0.16, 0.22, 0.41, 0.6, n ≈ 5
8th March, 2012
The models can be refined
 What is p for your system?
– p can be small (Spool & Schroeder, 2001)
– Bootstrap
 Is p constant for all problems?
– More complex models
 Are all participants equally good?
 Tend to increase n
8th March, 2012
The models have conceptual flaws
 Is p meaningful?
– Independence of discovery
– Discovery is probabilistic
– What’s the probability space?
 Problem classification
8th March, 2012
A usability test can be an experiment






Conduct like an experiment
Need an alternative hypothesis
Measure (quantitatively) one thing
Carefully defined tasks
Manipulate the interface
Use statistics to identify true variation
8th March, 2012
Example questions





Is task quicker on new design?
Does design increase click-throughs?
Are errors below a threshold rate?
Is performance comparable in new design?
Can you prove this design is worth it?
8th March, 2012
Why use an experiment?
Good for…
Not for…
 When reasoning is
not enough
 Good beliefs for
improvements
 Finessing
 Show-stoppers
 Large effects
 Anything but
alternative
hypothesis
8th March, 2012
Usability tests are more about better
designs





Move to new technology
Design well
Reach a point of plausibility
Competing considerations
Test!
8th March, 2012
There are different argument styles
 Deduction: X causes Y; X hence Y
 Induction: From instances of X and Y, when I
see X, I infer Y.
 Abduction: X causes Y; Y hence?
– Explanation seeking
 Pierce: “matted felt of pure hypothesis”
 Sherlock Holmes does abduction!
8th March, 2012
Solutions arise from abduction
 Users act in response to system
 Features cause good/bad outcomes
 Abduce explanations
– More experience, better explanations
8th March, 2012
So what should be the sample size?
 H= “X is good”
 Null: p(H) = 0.5
 Five people are enough:
– H does not hold for 5 people
– (0.5)5 = 1/32 < 1/20 hence sig
8th March, 2012
So what should be the sample size?
 H= “X is good”
 Null: p(H) = 0.5
 Five people are enough:
– H does not hold for 5 people
– (0.5)5 = 1/32 < 1/20 hence sig
 This is false!
8th March, 2012
Usability tests sit in a cloud of
hypotheses
 Usability as a privative
 Every feature is contingently usable
– Any falsification forces revision (Popper)
– Kuhnian resistance
 Neo-Popperian (Deutsch)
– Falsification + narratives (explanations)
8th March, 2012
Sample size depends on explanation
 Plausible sample sizes
– Show-stopper: 1
– Unexpected but plausible: 3-5
– No explanation: many
 Different behaviour, same explanation
 ROI
8th March, 2012
Why usability tests might look like
experiments
 Control in experiments
– Causal attribution
 Coverage
 Observation
 Piggy-backing
8th March, 2012
Questions?
8th March, 2012