Download Evaluating HCI Systems

Document related concepts

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Analysis of variance wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Qualitative Evaluation
Lecture Outline


Evaluation objectives
Evaluation methods

Human Subjects



“Think Aloud”
Wizard of Oz
No Human Subjects



Heuristic evaluation
Cognitive walkthrus
GOMS
Huh?
I’ll be dead before …
Evaluation objectives


Anticipate what will happen when real users
start using your system.
Give the test users some tasks to try to do, and
you’ll be keeping track of whether they can
do them.
Two axes


Human – non-human
Qualitative – Quantitative
Quantitative
GOMS
Heuristic
Evaluation
Think Aloud
Human
Non-Human
Wizard of Oz
Cognitive
Walk-Thru
Qualitative
Non-human subject methods


Heuristic evaluation
Cognitive walkthrus
Heuristic Evaluation (1)




A small set of HCI experts independently assess
(two passes) for adherence to usability principles
(heuristics).
Evaluators rate severity of violation to prioritize key
fixes. Explain why interface violates heuristic.
Evaluators communicate afterwards to aggregate
findings but not during evaluation.
Since the evaluators are not using the system as such
(to perform a real task), it is possible to perform
heuristic evaluation of user interfaces that exist on
paper only and have not yet been implemented.
Heuristic Evaluation (2)

10 Usability Heuristics (by Jakob Nielsen)

Visibility of system status

Match between system and the real world



The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
The system should speak the users' language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow
real-world conventions, making information appear in a natural and logical order.
User control and freedom

Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to
go through an extended dialogue. Support undo and redo.

Consistency and standards

Error prevention

Recognition rather than recall




Even better than good error messages is a careful design which prevents a problem from occurring in the first place.
Make objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another.
Instructions for use of the system should be visible or easily retrievable whenever appropriate.
Flexibility and efficiency of use


Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.
Accelerators -- unseen by the novice user -- may often speed up the interaction for the expert user such that the system can cater to both
inexperienced and experienced users. Allow users to tailor frequent actions.
Aesthetic and minimalist design

Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the
relevant units of information and diminishes their relative visibility.

Help users recognize, diagnose, and recover from errors

Help and documentation


Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.
Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such
information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large.
Heuristic Evaluation (3)

Severity rating





0 = no problem
1 = cosmetic problem
2 = minor usability problem
3 = major usability problem; should fix
4 = catastrophe; must fix
Heuristic Evaluation (4)

Usability matrix




Each row represents one
evaluator.
Each column represents
one of the usability
problems.
Each black square shows
whether the evaluator
represented by the row
found the usability
problem.
The more rows blacked
out within a column, the
more obvious the
Heuristic Evaluation (5)


Use 3-5 evaluators; any
more and you get
diminishing returns.
Using more than 5
evaluators also costs
more money!
Cognitive Walkthroughs (1)





Cognitive walkthrough is a formalized way of
imagining people’s thoughts and actions when they
use an interface for the first time.
Start with a prototype or a detailed design
description of the interface and known end-users.
Try to tell a believable story about each action a user
has to take to do the task.
If you can’t tell a believable story about an action,
then you've located a problem with the interface.
Walkthroughs focus most clearly on problems that
users will have when they first use an interface.
Cognitive Walkthroughs (2)
1. You need a description or a prototype of the interface. It
doesn’t have to be complete, but it should be fairly detailed.
Details such as exactly what words are in a menu can make a
big difference.
2. You need a task description. The task should usually be one of
the representative tasks you’re using for task-centered design,
or some piece of that task.
3. You need a complete, written list of the actions needed to
complete the task with the interface.
4. You need an idea of who the users will be and what kind of
experience they’ll bring to the job. This is an understanding
you should have developed through your task and user
analysis. Ideally, you have developed detailed user personas
either through customer surveys or ethnographic studies.
Cognitive Walkthroughs (3)
1. Will users be trying to produce whatever effect the
action has? (Example: safely remove hardware in
Windows)
2. Will users see the control (button, menu, switch,
etc.) for the action? (Example: hidden cascading
icons in Windows menus and Taskbar)
3. Once users find the control, will they recognize that
it produces the effect they want?
4. After the action is taken, will users understand the
feedback they get, so they can go on to the next
action with confidence?
Human subject


Wizard of Oz
Think aloud
Human subjects (1)



Best test users will be people who are representative
of the people you expect to have as users.
Voluntary, informed consent for testing.
If you are working in an organization that receives
federal research funds, you are obligated to comply
with formal rules and regulations that govern the
conduct of tests, including getting approval from a
review committee for any study that involves human
participants.
Human subjects (2)



Train test users as they are likely to receive
training in the field
You should always do a pilot study as part of
any usability test. Do this twice, once with
colleagues, to get out the biggest bugs, and
then with real users.
Keep variability to a minimum. Do not
provide one user more guidance or “Help”
than another.
Human subjects (3)

During the test




Make clear to test users that they are free to stop
participating at any time. Avoid putting any
pressure on them to continue.
Monitor the attitude of your test users carefully
especially if they get upset with themselves if
things don’t go well.
Stress that it is your system, not the users, that is
being tested.
You cannot provide any help beyond what they
would receive in the field!
Collecting Data (1)

Process Data


Qualitative observations of what the test users are
doing and thinking as they work through the
tasks.
Bottom-Line Data

Quantitative data on how long the user spent on
the experiment, how many mistakes, how many
questions, etc.
“Think Aloud” (1)





“Tell me what you are thinking about as you work.”
Encourage the user to talk while working, to voice
what they’re thinking, what they are trying to do,
questions that arise as they work, things they read.
Tell the user that you are not interested in their secret
thoughts but only in what they are thinking about
their task.
Record (videotape, tape, written notes) their
comments.
Convert the words and actions into data about your
prototype using a coding sheet
“Think Aloud” – Coding (2)
Time
Action / Statement
Error state
Type
Comment
00:00
Start
Start
Given task to create a
menu for two kids
aged 6 and 8.
00:10
“I see multiple ways
in which I can start
the ordering process
like suggested menu
or low-fat menu. Ok,
I’ll start at low-fat.”
No
Deciding goal
Selecting action
00:12
Press Low-Fat.
No
Interface action
00:15
“Oh this is low-fat for
adults. My kids
wouldn’t eat steamed
broccoli and fish.”
Yes
Interpreting system
state
Think-Aloud – Coding (3)

Coding Scheme

Cognitive Ergonomics Issues


Physical Ergonomics Issues


Emotion
Content Issues




Screen resolution, audio amplitude, text size, icon size
Affective Issues


Searching, Learning, Interpreting, Recalling, Memorizing, Selecting,
Relevance of content
Information design preference
Color and Font choice
Computer Interaction Activity




Mouse movement
Mouse selection
Keyboard action
Spoken command
Getting “hard data”










Time to task completion
% of tasks completed
% of tasks completed per unit time (speed)
Ratio of successes to failures
Time spent in error state
Time spent recovering from errors
% or number of errors per number of actions
Frequency of getting help
Number of times user loses control of system
Number of times user expresses frustration
Time
Statement Error
Code
1
“I am supposed to find
out how many
restaurants there are.”
No
Searching
2
Selected Restaurant
menu item.
Error
Comment
Wizard of Oz



“Faking the implementation”
You emulate and simulate unimplemented functions
and generate the feedback users should see.
Uses




Testing needs to respond to unpredictable user input.
Testing which input techniques and sensing mechanisms
best represent the interaction
Find out the kinds of problems people will have with the
devices and techniques
Very early stage testing (and quite useful for intelligent
room)
Quantitative Evaluation
When to progress to quantitative


Qualitative methods are best for formative
assessments
Quantitative methods are best for summative
assessments
GOMS (1)

GOMS means




Goals
Operators
Methods
Selection rules
GOMS (2)

Goal


Operators


Locate train station, board correct train, alight at Central
Methods


Go from North Sydney to University of Sydney
Walk, take bus, take ferry, take train, bike, drive
Selection rules


Example: Walking is cheap but slow and inexpensive
Example: Taking a bus is subject to uncertain road
conditions
GOMS (3)




Goals = something the user wants to do; may
have subgoals which are ordered
hierarchically
Operators = specific actions performed in
service of a goal; no sub-operators
Methods = sequence of operators to
accomplish goals
Rules = how to select methods
GOMS (4)

Keystroke-Level-Model (KLM)


To estimate execution time for a task, list the sequence of
operators and then total the execution times for the individual
operators. In particular, specify the method used to accomplish
each particular task instance
Six Operators






K to press a key or button
P to point with a mouse to a target on a display
H to home hands on the keyboard or other device
D to draw a line segment on a grid
M to mentally prepare to do an action or a closely related series
of primitive actions
R to represent the system response time during which the user
has to wait for the system
GOMS (5)
GOMS (6)

Card, Moran, and Newell GOMS (CMNGOMS)


Like GOMS, CMN-GOMS has a strict goal
hierarchy, but methods are represented in an
informal program form that can include
submethods and conditionals.
Used to predict operator sequences.
GOMS (7)

Natural GOMS Language (NGOMSL)


Constructs an NGOMSL model by performing a
top-down, breadth-first expansion of the user’s
top-level goals into methods, until the methods
contain only primitive operators, typically
keystroke-level operators. Like CMN-GOMS,
NGOMSL models explicitly represent the goal
structure, and so they can represent high-level
goals.
NGOMSL provides learning time as well as
execution time predictions.
GOMS (8)

Comparative Example



Goal = remove a directory
Comparison = Apple Macintosh MacOS X and
Windows XP
K-L-M Method
Hypothesis Testing (1)

Stating and testing a hypothesis allows the
designer



To provide data about cognitive process and
human performance limitations
To compare systems and fine-tune interaction
By


Controlling variables and conditions in the test
Removing experimenter bias
Hypothesis Testing (2)

A hypothesis IS


A proposed explanation for a natural or artificial
phenomenon
A hypothesis IS NOT

A tautology (i.e., could not possibly be disproved)
Hypothesis Writing (1)

A good hypothesis


(Interactive Menu Project) There is no difference
in the time to complete a meal order between a
dialog driven interface and a menu driven
interface regardless of the expertise level of the
subject.
A bad hypothesis

(Interactive Menu Project) The meal order entry
system is easy to use.
Hypothesis Writing (2)

A good hypothesis includes

Independent variables that are to be altered



Aspects of the testing environment that you
manipulate independent of a subject’s behaviour
Classifying the subjects into different categories
(novice, expert)
Example from Interactive Menu Project


UI Genre: Dialog driven; Menu driven
User Type: Expert, Novice
Hypothesis Writing (3)

A good hypothesis also includes

Dependent variables that you will measure


Quantitative measurements and observations of
phenomenon which are dependent on the subject’s
interaction with the system and dependent on the
independent variables
Example

Interactive Menu Project



Order entry time
Number of selection errors made
Count of interaction methods
Methods of Quantitative Analysis



Mean, Median and Standard Deviation
Correlation
ANOVA (analysis of variance)
Mean, Median and Standard Deviation



The mean is the expected
value of a measured
quantity.
The median is defined as
the middle of a
distribution: half the values
are above the median and
half are below the median.
The standard deviation tells
you how tightly clustered
the values are around the
mean.
N


1
N
x
i
i 1
N
N
 x   
i 1
2
i
Correlation (1)


Used when you want to find a relationship between two
variables, where one is usually considered the independent
variable and the other is the dependent variable
The correlation may be




Up to +1 when there is a direct relationship
0 where there is no relationship
-1 when there is an inverse relationship
Notes

A correlation does not imply causality – there may be a bias in your
sample set or you do not have a large enough sample set
Correlation (2)



Example – Is there a correlation between the
number of words people say while playing
Monopoly and how much fun they’re having?
Independent variable: Number of words
Dependent variable: Fun
Correlation (3)
Games Heuristic
4000
3500
Word Count
3000
2500
Games
2000
Poly. (Games)
1500
1000
500
0
0%
10%
20%
30%
40%
50%
60%
Fun %
R=0.64
70%
80%
90%
100%
ANOVA (1)



ANOVA is ANalysis Of VAriance.
Used when you want to find if there is a
statistical difference between the
heterogeneity of means when the measured
quantity (observation) is from different test
cases (factor levels)
The number of replicates (observations per
factor level) must be the same in each factor
level. This is called a balanced ANOVA.
ANOVA (2)

Example




Suppose you want to test the completion time for ordering
a meal with the Interactive Menu.
You decide to classify your users by age group, 5-12, 1318, and 19-25.
Then, you measure the amount of time it takes to
complete the order entry.
There is likely to be a different mean time to order among
the three age groups. What you want to know is whether
in fact the groups really are different. That is, is there
statistical evidence that age causes the difference between
the mean order entry time?
ANOVA (3)


The null hypothesis – The null
hypothesis is that there is no real
effect of age on order entry time,
just that the groups are likely to
have different order completion
times.
The standard deviation of the
expected mean calculates the
likely variation. In the equation,
σ is the is the standard deviation
of the completion times for all
groups and N is the number of
people per group (must be the
same).

N
1/ 2
ANOVA (4)



Collect the completion time for each group.
Calculate the mean completion time and
standard deviation for each group.
If the standard deviation of those means is
“significantly” larger than the standard
deviation of the expected mean, we have
evidence that the null hypothesis is not correct
and instead age has an effect.
ANOVA (5)

You of course would typically use a statistical
analysis package! This is what the package
does.
ANOVA (6)






Calculates sum square of
averages
Calculates sum square of
errors
Calculates the mean
squared average
Calculates the mean
squared error
Calculates the F-ratio
Look up the P-value to see
if the F-ratio is greater than
or equal to what would
have been found by chance
 N 2 1  K N 
1
  xij 
SSA     xij  


N i 1  j 1
 KN  i 1 j 1 
K
SSE   xij  i 
K
N
i 1 j 1
MSA 
SSA
K 1
MSE 
SSE
K  N  1
F
MSA
MSE
2
2
Usability Evaluation Summary

Select appropriate evaluation technique based on
availability of human subjects and fidelity of
prototype



Set clear metrics and objectives for evaluation



Wizard of Oz suitable for early stage, “think aloud” is not
Heuristic evaluation and cognitive walkthroughs are good
for mid-stage reviews, GOMS is overkill until details of
interface have been finalized
Everyone should agree what is being tested.
If the results of human subject tests are ambiguous, you
either need a larger sample set (more time and money) or
your testing procedure was too variable.
Agree how to use results of feedback before testing
Quantitative Analysis


Controlled experiments that test a hypothesis
can provide convincing evidence on specific
usability issues
In practice, often used in




Extremely complex interfaces (aviation)
High-risk (medical instruments)
High-use (manufacturing)
Academic research when developing new
interface genres