Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Can five be enough? Sample sizes in usability tests Paul Cairns and Caroline Jarrett Problem: usability studies have small samples Good experiments: 30+ Ps Typical usability studies: ~5Ps Moving to 3Ps! How?! What?! – Typically, I have conniptions – CJ asked me to solve it! 8th March, 2012 UX people like small samples Common practice (CJ) Krug (2010), 3 (“a morning a month”) Tullis & Albert, 6 to 8 (formative) Virzi (1992) Nielsen (1993) – 7 experts ≈ 5 experts 8th March, 2012 Use probabilities to suggest sample sizes Total number of problems, K Probability of problem discovery, p Find n so that 1 – (1-p)n is x% of K – Binomial distribution n is our sample size p = 0.16, 0.22, 0.41, 0.6, n ≈ 5 8th March, 2012 The models can be refined What is p for your system? – p can be small (Spool & Schroeder, 2001) – Bootstrap Is p constant for all problems? – More complex models Are all participants equally good? Tend to increase n 8th March, 2012 The models have conceptual flaws Is p meaningful? – Independence of discovery – Discovery is probabilistic – What’s the probability space? Problem classification 8th March, 2012 A usability test can be an experiment Conduct like an experiment Need an alternative hypothesis Measure (quantitatively) one thing Carefully defined tasks Manipulate the interface Use statistics to identify true variation 8th March, 2012 Example questions Is task quicker on new design? Does design increase click-throughs? Are errors below a threshold rate? Is performance comparable in new design? Can you prove this design is worth it? 8th March, 2012 Why use an experiment? Good for… Not for… When reasoning is not enough Good beliefs for improvements Finessing Show-stoppers Large effects Anything but alternative hypothesis 8th March, 2012 Usability tests are more about better designs Move to new technology Design well Reach a point of plausibility Competing considerations Test! 8th March, 2012 There are different argument styles Deduction: X causes Y; X hence Y Induction: From instances of X and Y, when I see X, I infer Y. Abduction: X causes Y; Y hence? – Explanation seeking Pierce: “matted felt of pure hypothesis” Sherlock Holmes does abduction! 8th March, 2012 Solutions arise from abduction Users act in response to system Features cause good/bad outcomes Abduce explanations – More experience, better explanations 8th March, 2012 So what should be the sample size? H= “X is good” Null: p(H) = 0.5 Five people are enough: – H does not hold for 5 people – (0.5)5 = 1/32 < 1/20 hence sig 8th March, 2012 So what should be the sample size? H= “X is good” Null: p(H) = 0.5 Five people are enough: – H does not hold for 5 people – (0.5)5 = 1/32 < 1/20 hence sig This is false! 8th March, 2012 Usability tests sit in a cloud of hypotheses Usability as a privative Every feature is contingently usable – Any falsification forces revision (Popper) – Kuhnian resistance Neo-Popperian (Deutsch) – Falsification + narratives (explanations) 8th March, 2012 Sample size depends on explanation Plausible sample sizes – Show-stopper: 1 – Unexpected but plausible: 3-5 – No explanation: many Different behaviour, same explanation ROI 8th March, 2012 Why usability tests might look like experiments Control in experiments – Causal attribution Coverage Observation Piggy-backing 8th March, 2012 Questions? 8th March, 2012