Download Empirical Evaluation - Georgia Institute of Technology

Evaluating Visualizations cs5764: Information Visualization Chris North Evaluating Visualizations • Usability Test • Observation, problem identification • Controlled Experiment • Formal controlled scientific experiment • Comparisons, statistical analysis • Expert Review • Examination by visualization expert • Heuristic Evaluation • Principles, Guidelines • Algorithmic Projects • Implementation projects: • Small usability test of implementation • Short usability report • Experiment projects: • Main controlled experiment • Experiment materials and raw data • Then data analysis Usability test vs. Controlled Expm. • Usability test: • • • • • Formative: helps guide design Single UI, early in design process Few users Usability problems, incidents Qualitative feedback from users • Controlled experiment: • • • • • Summative: measure final result Compare multiple UIs Many users, strict protocol Independent & dependent variables Quantitative results, statistical significance Controlled Experiments What is Science? • Measurement • Modeling Scientific Method 1. 2. 3. 4. Form Hypothesis Collect data Analyze Accept/reject hypothesis • How to “prove” a hypothesis in science? • • • • Easier to disprove things, by counterexample Null hypothesis = opposite of hypothesis Disprove null hypothesis Hence, hypothesis is proved Empirical Experiment • Typical question: • Which visualization is better in which situations? Spotfire vs. TableLens Cause and Effect • Goal: determine “cause and effect” • Cause = visualization tool (Spotfire vs. TableLens) • Effect = user performance time on task T • Procedure: • Vary cause • Measure effect • Problem: random variation random variation Real world Collected data uncertain conclusions • Cause = vis tool OR random variation? Stats to the Rescue • Goal: • Measured effect unlikely to result by random variation • Hypothesis: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Null hypothesis: • Visualization tool has no effect (e.g. Spotfire = TableLens) • Hence: Cause = random variation • Stats: • If null hypothesis true, then measured effect occurs with probability < 5% • But measured effect did occur! (e.g. measured effect >> random variation) • Hence: • Null hypothesis unlikely to be true • Hence, hypothesis likely to be true Variables • Independent Variables (what you vary), and treatments (the variable values): • Visualization tool » Spotfire, TableLens, Excel • Task type » Find, count, pattern, compare • Data size (# of items) » 100, 1000, 1000000 • Dependent Variables (what you measure) • • • • User performance time Errors Subjective satisfaction (survey) HCI metrics Example: 2 x 3 design Ind Var 2: Task Type Task1 Task2 Task3 SpotInd Var 1: fire Vis. Tool TableLens • n users per cell Measured user performance times (dep var) Groups • “Between subjects” variable • • • • 1 group of users for each variable treatment Group 1: 20 users, Spotfire Group 2: 20 users, TableLens Total: 40 users, 20 per cell • “With-in subjects” (repeated) variable • • • • • All users perform all treatments Counter-balancing order effect Group 1: 20 users, Spotfire then TableLens Group 2: 20 users, TableLens then Spotfire Total: 40 users, 40 per cell Issues • Eliminate or measure extraneous factors • Randomized • Fairness • Identical procedures, … • Bias • User privacy, data security • IRB (internal review board) Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions » Do not reveal true purpose of experiment • Training runs • Actual runs » Give task, measure performance • Post-Survey: subjective measures • * n users Data • Measured dependent variables • Spreadsheet: User Spotfire TableLens task task task task task task 1 2 3 1 2 3 Step 1: Visualize it • • • • Dig out interesting facts Qualitative conclusions Guide stats Guide future experiments Step 2: Stats Ind Var 2: Task Type SpotInd Var 1: fire Vis. Tool TableLens Task1 Task2 Task3 37.2 54.5 103.7 29.8 53.2 145.4 Average user performance times (dep var) TableLens better than Spotfire? Avg Perf time (secs) Spotfire TableLens • Problem with Averages: lossy • Compares only 2 numbers • What about the 40 data values? (Show me the data!) The real picture Avg Perf time (secs) Spotfire TableLens • Need stats that compare all data Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind var • ANOVA: Analysis of Variance • Compares 1 dep var on n treatments of m ind vars • Result: • p = probability that difference between treatments is random (null hypothesis) • “statistical significance” level • typical cut-off: p < 0.05 • Hypothesis confidence = 1 - p Excel p < 0.05 • • • • Woohoo! Found a “statistically significant” difference Averages determine which is ‘better’ Conclusion: • • • • • • Cause = visualization tool (e.g. Spotfire ≠ TableLens) Vis Tool has an effect on user performance for task T … “95% confident that TableLens better than Spotfire …” NOT “TableLens beats Spotfire 95% of time” 5% chance of being wrong! Be careful about generalizing p > 0.05 • Hence, no difference? • Vis Tool has no effect on user performance for task T…? • Spotfire = TableLens ? • NOT! • • • • Did not detect a difference, but could still be different Potential real effect did not overcome random variation Provides evidence for Spotfire = TableLens, but not proof Boring, basically found nothing • How? • Not enough users • Need better tasks, data, … Data Mountain • Robertson, “Data Mountain” • (Microsoft) Comparison of Info Vis Systems • Kobsa Cleveland’s Rules for Secondary Tasks • Chewar et al. Usability Testing Usability test vs. Controlled Expm. • Usability test: • • • • • Formative: helps guide design Single UI, early in design process Few users Usability problems, incidents Qualitative feedback from users • Controlled experiment: • • • • • Summative: measure final result Compare multiple UIs Many users, strict protocol Independent & dependent variables Quantitative results, statistical significance Usability Specification Table Scenario task Worst case Planned Target Best case (expert) Observed Find most expensive house for sale? 1 min. 3 sec. ??? sec … 10 sec. Usability Test Setup • Set of benchmark tasks • Easy to hard, specific to open-ended • Coverage of different UI features • E.g. “find the 5 most expensive houses for sale” • Consent forms • Not needed unless video-taping user’s face (new rule) • Experimenters: • Facilitator: instructs user • Observers: take notes, collect data, video tape screen • Executor: run the prototype if faked • Users • 3-5 users, quality not quantity Usability Test Procedure • Goal: mimic real life • Do not cheat by showing them how to use the UI! • Initial instructions • “We are evaluating the system, not you.” • Repeat: • • • • Give user a task Ask user to “think aloud” Observe, note mistakes and problems Avoid interfering, hint only if completely stuck • Interview • Verbal feedback • Questionnaire • ~1 hour / user Usability Lab • E.g McBryde 102 Data • Note taking • E.g. “&%$#@ user keeps clicking on the wrong button…” • Verbal protocol: think aloud • E.g. user expects that button to do something else… • Rough quantitative measures • HCI metrics: e.g. task completion time, .. • Interview feedback and surveys • Video-tape screen & mouse • Eye tracking, biometrics? Analyze • Initial reaction: • “stupid user!”, “that’s developer X’s fault!”, “this sucks” • Mature reaction: • “how can we redesign UI to solve that usability problem?” • the user is always right • Identify usability problems • Learning issues: e.g. can’t figure out or didn’t notice feature • Performance issues: e.g. arduous, tiring to solve tasks • Subjective issues: e.g. annoying, ugly • Problem severity: critical vs. minor Cost-Importance Analysis Problem Importance Solutions Cost Ratio I/C • Importance 1-5: (task effect, frequency) • 5 = critical, major impact on user, frequent occurance • 3 = user can complete task, but with difficulty • 1 = minor problem, small speed bump, infrequent • Ratio = importance / cost • Sort by this • 3 categories: Must fix, next version, ignored Refine UI • Simple solutions vs. major redesigns • Solve problems in order of: importance/cost • Example: • Problem: user didn’t know he could zoom in to see more… • Potential solutions: – – – – Better zoom button icon, tooltip Add a zoom bar slider (like moosburg) Icons for different zoom levels: boundaries, roads, buildings NOT: more more “help” documentation!!! You can do better. • Iterate • Test, refine, test, refine, test, refine, … • Until? Meets usability specification Project revisited • For implementation projects: • Informal test • A few users – Not (tainted) info vis students • 102 lab not required • Simple data collection – Biometrics optional! • 1 iteration • Exploit this opportunity to improve your design • For experiment projects: • See controlled experiments

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Empirical Evaluation - Georgia Institute of Technology