Download Slide 1 - Institute of Information Sciences and Technology

Document related concepts

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Analysis of variance wikipedia , lookup

Transcript
159.410/710 User Interface Design
© Paul Lyons 2010
Epistemology
Approaches to knowledge
Engineers
Holistic
Constructionist
Scientists
Reductionist
Analyst/synthesists
Arts
Complexity
Subjectivists
I hear and I forget.
I see and I remember.
I do and I understand.
Confucius (attributed)
~ 3 ~
159.410/710 User Interface Design
Epistemology
Types of HCI research
Development of interaction widgets
Usability – efficiency, enjoyability
Internet & web
Social applications
Mobile applications – shoehorning complex applications onto tiny screens
~ 4 ~
159.410/710 User Interface Design
Characteristics of HCI research
HCI research has a focus on people
what computers can do is not the main point
what computers can help people do is
variety of contributing disciplines
sociology
psychology
statistics
computer science
observation techniques
controlled experiments
handing noisy data
developing (genuinely) new interface paradigms
rigorous research methodologies are required
it isn’t enough to develop a new interface or a new interface component
does the new interface make things better?
how do you know?
are you sure?
~ 5 ~
159.410/710 User Interface Design
Characteristics of HCI research
Things to measure
time to complete a task
number of tasks completed in a standard time
performance measures
largely industry-driven
accuracy of performing a task
accuracy of performing a task
enjoyment
emotional wellbeing
why people choose to spend discretionary time using computers
difficult to measure in
a laboratory setting
e.g. contributing to Wikipedia
why people choose to stop using applications
people’s usage patterns of mobile computing devices and social apps
~ 6 ~
159.410/710 User Interface Design
Characteristics of HCI research
Replication of results
Multiple studies that reach the same or similar conclusion
Triangulation by different research methods
if a single method produces identical results repeatedly,
the reason may be a flawed method
Results may change over time
reasons for using a computer 1980s vs. 2000s
finding information – searching and tagging vs. hierarchical directories
~ 7 ~
159.410/710 User Interface Design
Characteristics of HCI research
Tradeoffs
speed vs accuracy (Fitt’s Law)
better interface vs. familiar interface
more efficient keyboard vs. QWERTY keyboard
iPad is cool and new
it’s the coolth that persuades people to adopt it
how do you measure that?
security vs. usability
eye-scans and fingerprints?
a revolutionary, undeniably better computer vs. environmental costs of computer disposal
~ 8 ~
159.410/710 User Interface Design
Characteristics of HCI research
HCI is an interdisciplinary discipline
in the past
human factors
engineering
psychology
all suit
experimental
design
in the present
in the future
ubiquitous?
Virtual Reality?
mind-activated?
library science
information science
art and design
competition
with judges
(cf. architecture)?
reductionist
widely accepted:
statistical tests
control groups
reliable
more holistic
more subjective
less trusted
(not less trustworthy)
~ 9 ~
159.410/710 User Interface Design
Epistemology
Types of model
generative
produce principles and guidelines or actual systems (e.g. Colour Harmoniser)
prescriptive
suggest ways of building things (e.g. patterns)
predictive
allow us to plan for the future
explanatory
explain what causes the data that have been observed
descriptive
generalisations about the data – allow us to see order amidst chaos
~ 10 ~
159.410/710 User Interface Design
Experimental Research
Usability Testing
The goal of usability testing is simply to find flaws in a specific interface
A small number of users may take part
… it can be structured or unstructured.
… there is no claim that the results can be generalised.
The goal is simply to find flaws and help the developers improve the interface.
If that involves jumping in and helping a user or changing the task mid-process, that is
acceptable.
Lazar, Feng and Hocheiser
Research Methods in HCI
2010
~ 11 ~
159.410/710 User Interface Design
Experimental Research
HCI research – 57 varieties
observations
field studies
surveys
usability studies
interviews
focus groups
controlled experiments
rich, not reproducible
reproducible, reductionist
descriptive research; observations – may be quantitative and accurate
relational research; establishes correlations between factors – does not establish causality
typing speed correlated with hours spent gaming
does time spent gaming improve typing?
are good typists successful gamers?
experimental research; can establish causality
allocate users to two groups randomly
expose one group to games, the other not
measure typing ability of groups after a suitable interval
~ 12 ~
159.410/710 User Interface Design
Experimental Research
Null and alternative hypotheses
H0
nochange
effect
no
in speed
no change
in user satisfaction
the
widgetcauses…
causes…
thenew
treatment
the
mutual
exclusion seesaw
H1
some
an effect
change
in speed
and also
some change
in user satisfaction
However…
testing multiple
hypotheses
can complicate
controls and variables
a good hypotheses
is clear and unambiguous
clearly distinguishes between independent and dependent variables
is testable in a single experiment
clearly identifies control groups and conditions of experiment
generally derives from
preliminary
observational studies
each combination of independent variables is a condition
~ 13 ~
159.410/710 User Interface Design
Experimental Research
Independent and dependent variables
“cause”
“effect”
independent variable
dependent variable
variations in value are under the
experimenter’s control
variations in value are observed
Null hypothesis: there is no speed change between the original widget and the new widget
experimenter measures this
it’s the dependent variable
if the experimental results are plotted on a graph
independent variable goes on the x-axis
dependent variable goes on the y-axis
experimenter makes choice of widget
it’s the independent variable
dependent
variable
independent variable
~ 14 ~
159.410/710 User Interface Design
Experimental Research
Typical
independent
variables
Typical
dependent
variables
Technology
Efficiency
typing vs. speech
mouse vs. joystick, touchpad etc
time to complete a task, speed
Accuracy
Design
error rate
pull-down vs. pop-up menu
colour scheme
layout
Subjective Satisfaction
Demographic
Ease of learning and retention rate
Likert scale ratings
gender, age, experience, education
time to learn, loss after a week, a month
Context
Cognitive Demand
lighting, noise, seated vs standing,
other people in the vicinity
time before onset of fatigue
~ 15 ~
159.410/710 User Interface Design
Experimental Research
Components of an experiment
Treatments
randomisation is often necessary
things we want to compare
(cf. medical treatments)
compare two splines A and B for a CAD tool
use a within-subjects design
measure time-to-complete task with A, then B
flaw: subjects learnt the task, so B is best
solution; randomise order of tasks.
Units
“things” that treatment is applied to
(normally human subjects)
comparing two treatments using a
between-subjects design
allocate subjects to treatment A, till enough
then allocate subjects to treatment B
Assignment method
how subjects are assigned to treatments
~ 16 ~
flaw: A is applied to early birds, B to late sleepers
Solution: randomise allocation to the treatments
159.410/710 User Interface Design
Experimental Research
Significance tests
this approach depends on being able to distinguish between an effect and no effect
how do we decide whether or not an effect is real?
we measure the probability that it occurred by chance
if that probability is sufficiently low, we say that there is a significant effect.
p < 0.05
0.005 says that the probability that the observed behaviour occurred by chance is less than 5%
0.5%
or that the probability that the effect is real exceeds 95%
99.5%
whether that’s good enough depends on the application
for a new drug, a significance level of p < 0.05 is not good enough
if the null hypothesis is “the standard dose is not fatal”
~ 17 ~
159.410/710 User Interface Design
Experimental Research
Type I errors
&
Type II errors
(aka “false positive”)
(aka “false negative”)
study concludes
widget no better
widget is no different
widget is better
study concludes
widget is better
type I (gullibility) error
type II (blindness) error
probability of Type I error = α
probability of Type II error = β
generally aim for p < 0.05
probability that effect occurred by chance, p-value = α
probability of correctly rejecting an incorrect null hypothesis, statistical power of a test = 1- β
α and β are related; the less gullible you are, the
more likely you are to be blind to improvements
keep β low by using large sample sizes
probability of finding an effect that does exist
~ 18 ~
159.410/710 User Interface Design
Experimental Research
Limitations of experimental research
controlled experiments are a very powerful technique
but
hypothesis must be well-defined
number of variables must be limited, preferably orthogonal
HCI problems can be difficult to define
many, interrelated factors may be involved
factors other than independent variables may not affect dependent variables
e.g. difficult to factor out familiarity with technology in an age-related study
prescreen to ensure homogeneity
between subject groups
use statistical techniques designed
to filter out confounding factors
(analysis of covariables)
subjects’ behaviour in a lab differs from behaviour in real world
~ 19 ~
159.410/710 User Interface Design
Experimental Design
True Experiments
<x> is an intuitive interface
a testable hypothesis
subjects will be able to use <x> correctly it in under 1 minute
all of them?
two conditions (one treatment, one control)
50%
> 75%
sometimes more
quasi-experiment
random assignment of subjects
no?
non-experiment
quantitative measurements
e.g. not ethical to randomly assign children to parents
to study effect of single-parent upbringing
significance tests
attention to bias elimination
replicable
~ 20 ~
159.410/710 User Interface Design
Experimental Design
Other types of experiment
quasi-experiments (subjects not randomly assigned)
may be necessary for practical or ethical reasons
can still produce useful results but more susceptible to confounding factors
non-experiments (no control group)
insufficient subjects – use what’s available
researcher lacks influence (modified Word interface)
may be necessary for practical or ethical reasons
can still produce useful results but even more susceptible to confounding factors
e.g. usability trials – aim is to detect problems
formal experiments are designed
to detect subtle effects
to factor out researcher bias
researcher’s specialist knowledge may trump population’s preferences
(e.g. user surveys for Xerox showed little demand for such a device)
is demonstrating that it is possible to build something a valid experiment?
engineering research often stops at this point
~ 21 ~
159.410/710 User Interface Design
Experimental Design
Important considerations
number of independent variables
Hypothesis: There is no difference between target selection speed when using a mouse, a joystick,
or a trackball to select icons of different sizes (small, medium, large)
How many independent variables?
number of conditions
3x3=9
number of dependent variables
1
• type of pointing device
• icon size
measurement may need careful thought
e.g. is typing speed wpm or error-free wpm?
is speech recogniser error rate definitive?
~ 22 ~
159.410/710 User Interface Design
Experimental Design
Structure of an experiment
basic
design
factorial
design
withingroup
1 independent variable
>1 independent variable
only one group
but subjects experience multiple conditions
eliminates individual differences
smaller population required
learning and fatigue may cause effects
one group per condition
each subject experiences only one condition
betweengroup
splitplot
no learning effect
less fatigue effect
susceptible to differences between groups
mix of within-group and between-group
~ 23 ~
Order tasks using a Latin
square to factor out fatigue
Subj1 1 2 3
Subj2 2 3 1
Subj3 3 1 2
randomise order of conditions
and/or provide preliminary training
tasks with large diffs between individuals
e.g. cognitively complex tasks
suits small subject pools
requires big groups, randomly selected
cognitively simple tasks, no learning effect
(inter-subject diffs increase with complexity)
tasks where subject difference is indep var.
effect of using GPS (binary, within-group)
on three age-groups (between-groups)
159.410/710 User Interface Design
Experimental Design
Watch out for interaction effects
if the effect of a variable depends on the value of the other variable, the variables interact
the variables are (should be) independent, but their effects interact
task
duration
complex
task
simple
task
Office 2003
genuinely independent variables
Office 2007
~ 24 ~
159.410/710 User Interface Design
Experimental Design
Reliability of experimental results
random errors
research involving human subjects is noisy
noise value
actual value
observed value = actual value + random error (noise)
sample size
with increased sample size, actual values add, relative size of noise tends to 0
systematic errors
the same each time – not cancelled by large sample
more deleterious than noise
~ 25 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
can often replace instruments (e.g. stopwatch) with software
experimental procedure
non-random task condition order allows learning & fatigue to have an effect  opposite systematic errors
instructions may introduce errors
complete the task as fast as possible vs. take your time, no rush produced different results
instructions from different members of research team may differ
subjects under time stress were slower!
trivial details
data entry on a PDA – holding PDA in hand produced different results from sitting PDA on table
randomise conditions and tasks when using within-group design
use identical instructions for all participants – written or recorded
run pilot studies beforehand to detect potential biases
don’t want to realise half-way through the experiment that all the results are compromised
you have overlooked something
use real participants from target population
~ 26 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
can often replace instruments (e.g. stopwatch) with software
participants
experimental procedure
non-random task condition order allows learning & fatigue to have an effect  opposite systematic errors
instructions may introduce errors
complete the task as fast as possible vs. take your time, no rush produced different results
instructions from different members of research team may differ
trivial details
data entry on a PDA – holding PDA in hand produced different results from sitting PDA on table
randomise conditions and tasks when using within-group design
use identical instructions for all participants – written or recorded
run pilot studies beforehand to detect potential biases
don’t want to realise half-way through the experiment that all the results are compromised
you have overlooked something
use real participants from target population
~ 27 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
experimental procedure
participants
age bias
education bias (particularly prevalent in university studies)
interest in the product (or its domain)
recruit a set of participants representative of target population
may be quite skewed - e.g. for elder-care systems
don’t stress the participants
explain that the system is under test, not them
any result they produce is good
organise schedule conservatively so participants aren’t inconvenienced
it’s polite, and it produces better results!
~ 28 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
experimental procedure
participants
age bias
education bias (particularly prevalent in university studies)
interest in the product (or its domain)
recruit a set of participants representative of target population
may be quite skewed - e.g. for elder-care systems
don’t stress the participants
explain that the system is under test, not them
any result they produce is good
organise schedule conservatively so participants aren’t inconvenienced
it’s polite, and it produces better results!
~ 29 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
experimental procedure
participants
experimenter behaviour
express no opinion about the system
maintain noncommittal body language
be ready to start on time
use the same experimenter each time, if possible, or a recorded protocol
if multiple experimenters are necessary, require them to follow a written experimental protocol
~ 30 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
experimental procedure
participants
experimenter behaviour
express no opinion about the system
maintain noncommittal body language
be ready to start on time
use the same experimenter each time, if possible, or a recorded protocol
if multiple experimenters are necessary, require them to follow a written experimental protocol
~ 31 ~
159.410/710 User Interface Design
Experimental Design
Systematic errors
instrumentation errors
experimental procedure
participants
experimenter behaviour
environmental factors
physical environment: noise, temperature , humidity, lighting, vibration
social environment: people nearby, power relationships of participants and people nearby, interruptions
quiet room
suitable lighting
comfortable furniture
non-distracting environment
observation by CCTV or from behind 1-way mirror, if possible
for field studies, visit the location beforehand to check for problems
~ 32 ~
159.410/710 User Interface Design
Experimental Design
Experimental Procedures
1.
2.
3.
4.
5.
6.
7.
Identify a research hypothesis
Design the study
Run a pilot
Recruit participants
Run data collection sessions
Analyse the data
Report the conclusions
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Set up the experimental environment/equipment
Greet participants
Outline the purpose of the study and the procedures
Obtain participants’ consent
Assign participants to experimental condition
Participants complete pre-survey (if any)
Participants complete training task
Participants complete survey task
Participants complete post-survey (if any)
Debrief (can be more useful than formal survey)
~ 33 ~
159.410/710 User Interface Design
Analysing the Data
There are many analytical tools
independent samples
t-test
paired-samples
(e.g. before/after tests)
one-way
ANOVA
factorial
repeated measures
correlation
regression
Chi-square
~ 34 ~
159.410/710 User Interface Design
Analysing the Data
Data Preparation
error checking and correction
incorrect grouping of survey forms
computing experience > age
impossible age
paper forms needs checking
survey software (or Excel) could
check at data collection time
age: 23½, 23 years and 7 months, nearly twenty-four
if data can’t be corrected, may need to be thrown away
e.g. because subjects are anonymous
may need pre-processing
coding text as numbers (e.g. 1, 2, 3, for no degree, bachelors, P/G)
extracting general themes from individual interviews
coding interaction events (e.g. click(100, 250)  “select book”)
consistency may need to be verified if > 1 coder
analysis may require data to be restructured
related information in different surveys (pre & post trial, for example)
analysis software requires specific formatting
SPSS independent samples and paired samples t-tests
use same data in 1 column and 2 parallel columns
~ 35 ~
159.410/710 User Interface Design
Analysing the Data
Start with an exploratory analysis
good 50
forthinitial
percentile
comparison
- good for
between
most
skewed
popular
groups
(e.g. Pareto) distribution and when outliers may be errors
Get a feel for the data
mean, median, mode
range, variance, standard deviation
Box-and-whisker plots
Histograms
measures
of central tendency
measures of spread
datamax – data
estimates
spread
– likely
around
to increase
mean: sign-independent,
with sample size &but
sensitive
has different
to outliers)
units from samples
min (crude
more complex measures of spread assume a normal distribution
may be necessary to modify data to conform
variance
n
∑ (xi – x)2
i=1
~ 36 ~
(n – 1)
159.410/710 User Interface Design
Analysing the Data
Start with an exploratory analysis
Get a feel for the data
mean, median, mode
range, variance, standard deviation
measures of central tendency
measures of spread
estimates spread around mean: sign-independent, but has different units from samples
more complex measures of spread assume a normal distribution
may be necessary to modify data to conform
variance
n
∑ (xi – x)2
i=1
~ 37 ~
(n – 1)
159.410/710 User Interface Design
Analysing the Data
Start with an exploratory analysis
Get a feel for the data
mean, median, mode
range, variance, standard deviation
measures of central tendency
measures of spread
same units as samples
more complex measures of spread assume a normal distribution
may be necessary to modify data to conform
√
s
n
∑ (xi – x)2
i=1
(n – 1)
standard deviation
measures
mean deviation of samples
from mean of samples
~ 38 ~
159.410/710 User Interface Design
Analysing the Data
Mean differences  treatments are different
two groups, same task
comparing two search engines
differences between means
difference between treatments
one group, two tasks
Δ=5
Δ=5
significance tests are necessary
to determine probability that
difference is due to chance
are these means
different?
are these means
different?
test
IVs
conditions per IV
between-groups
1
1
≥2
2
≥3
≥2
independent-samples t-test
1-way ANOVA
factorial ANOVA
within-group
1
1
≥2
2
≥3
≥2
paired-samples t-test
repeated measures ANOVA
repeated measures ANOVA
mixed
≥2
≥2
split-plot ANOVA
~ 39 ~
159.410/710 User Interface Design
Analysing the Data
To compare 2 means use a t-test
null hypothesis
task completion times for subjects using word-prediction software
do not differ from
task-completion times for subjects who do not use the software
signal
x1 – x2
t=
noise
=
√
s12 + s 22
n1 n2
signal
noise
s2 is, of course, the variance
remember: same as p-value
generally say there’s a significant effect if α ≤ 0.05
for 2 gps
however, significance of a particular t depends on size of subject groups
specifically degrees of freedom, df = total participants – number of groups = n1 + n2 -2
consult published tables showing α value for particular (t, df) combinations
statistical software usually outputs α from builtin tables
~ 40 ~
159.410/710 User Interface Design
Analysing the Data
To compare 2 means use a t-test
null hypothesis
task completion times for subjects using word-prediction software
do not differ from
task-completion times for subjects who do not use the software
signal
x1 – x2
t=
noise
=
√
s12 + s 22
n1 n2
signal
noise
generally say there’s a significant effect if α ≤ 0.05
however, significance of a particular t depends on size of subject groups
specifically degrees of freedom, df = total participants – number of groups = n1 + n2 -2
consult published tables showing α value for particular (t, df) combinations
statistical software usually outputs α from builtin tables
~ 41 ~
159.410/710 User Interface Design
Analysing the Data
To compare 2 means use a t-test
null hypothesis
task completion times for subjects using word-prediction software
do not differ from
task-completion times for subjects who do not use the software
for unrelated samples
use independent-samples t-test
for a single group
use paired-samples t-test
times for group using word-prediction software
times for group uses conventional software
times for subject using word-prediction software
and for same subject uses conventional software
SPSS t-test data comprises
times and group membership
SPSS t-test data comprises
times with software and times without software
t-value
t-value
high t-value
high P(null hypothesis false)
~ 42 ~
159.410/710 User Interface Design
Analysing the Data
What if hypothesis predicts sign of difference?
if we know that sign of effect will be + or –
instruct analysis software to use a 1-tailed t-test
α = 0.1 indicates same level of confidence as α = 0.05 for 2-tailed test
Do NOT use a one-tailed t-test because 2-tailed test indicates no significance
test should be hypothesis-driven, not data-driven!
~ 43 ~
159.410/710 User Interface Design
Analysing the Data
ANOVA: within-gp variances vs. population variance
null hypothesis: sample sets A, B, C & D belong to 1 population
if smeans for means of sample sets A - D > scombined population
there is more than 1 population
F= found variation in averages/expected variation in averages
F=1 supports null hypothesis
x1i - x1
Σ(x
i
)2
x1
SS1
Sum of Squares1
x2
SS2
~ 44 ~
159.410/710 User Interface Design
Analysing the Data
ANOVA: within-gp variances vs. population variance
null hypothesis: sample sets A, B, C & D belong to 1 population
if smeans for means of sample sets A - D > scombined population
there is more than 1 population
F= found variation in averages/expected variation in averages
~ 45 ~
159.410/710 User Interface Design
Analysing the Data
ANOVA: within-gp variances vs. population variance
null hypothesis: sample sets A, B, C & D belong to 1 population
if smeans for means of sample sets A - D > scombined population
there is more than 1 population
F= found variation in averages/expected variation in averages
within-groups variability (aka error variance)
variability due to differences between means (aka effect)
if effect variance is large w.r.t. error variance
treated group & untreated groups act as different populations
(treatment has an effect)
Group 1
Group 2
Observation 1
Observation 2
Observation 3
2
3
1
6
7
5
Mean
Sums of Squares (SS)
2
2
6
2
Overall Mean
Total Sums of Squares
4
28
much larger differences between means than in the diagram
MAIN EFFECT
ANOVA determines p
taking df into account
~ 46 ~
SS
Effect
Error
24.0
4.0
df
1
4
MS
F
p
24.0
1.0
24.0
.008
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
the parameter actually generated by the calculation (cf t-test t)
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
design
IVs
conditions
1-way ANOVA
between-group
1
≥3
factorial ANOVA
between-group
≥2
repeated measures
ANOVA
within-group
split-plot ANOVA
between-group
and
within-group
~ 47 ~
(F = t2)
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
design
IVs
conditions
1-way ANOVA
between-group
1
≥3
factorial ANOVA
between-group
≥2
repeated measures
ANOVA
within-group
split-plot ANOVA
between-group
and
within-group
~ 48 ~
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
1-way ANOVA
design
IVs
conditions
between-group
1
≥3
SPSS data input for 1-way ANOVA
(pared down to the minimum)
245
236
321
.
.
246
213
265
.
.
178
289
222
.
0
0
0
.
.
1
1
1
.
.
2
2
2
.
task
durations
code
SPSS output from the analysis
standard text entry
(control group)
sum of sqs
between-group
within-group
df
Mean sq
F
significance
7842.250
2
3921.125
2.174
0.139
37880.375
21
1803.827
text-prediction
So, how
would
we summarise
this
thesis
or report?
the
statistical
calculation
greater
than
0.05in
F-value
not significant
same
calculation
produces
2produces
sets
of aresults
– these
aren’t relevant
dictation
significance obtained by table lookup
A 1-way ANOVA analysis
with text-entry method as independent variable
and task completion time as dependent variable
suggests there is no significant difference between the three conditions:
(F(2, 21) = 2.174, n.s.)
24 samples in 3 groups gives df = 21
~ 49 ~
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
1-way ANOVA
design
IVs
conditions
between-group
1
≥3
SPSS data input for 1-way ANOVA
(pared down to the minimum)
245
236
321
.
.
246
213
265
.
.
178
289
222
.
0
0
0
.
.
1
1
1
.
.
2
2
2
.
task
durations
code
SPSS output from the analysis
standard text entry
(control group)
sum of sqs
between-group
within-group
df
Mean sq
F
significance
7842.250
2
3921.125
2.174
0.139
37880.375
21
1803.827
text-prediction
So, how would we summarise this in a thesis or report?
dictation
A 1-way ANOVA analysis
with text-entry method as independent variable
and task completion time as dependent variable
suggests there is no significant difference between the three conditions:
(F(2, 21) = 2.174, n.s.)
24 samples in 3 groups gives df = 21
~ 50 ~
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
design
IVs
conditions
1-way ANOVA
between-group
1
≥3
factorial ANOVA
between-group
≥2
SPSS data entry format
Q: does nature of the task (composition
or transcription) affect performance?
data entry 2
method 1
0
SPSS fn is called Univariate analysis
task time task type data entry
method
task type
0
1
245
236
…
246
213
…
178
289
…
256
269
…
265
232
…
189
321
gp1 gp2 dictation
gp3 gp4 predictive
gp5 gp6 standard
~ 51 ~
0
0
…
0
0
…
0
0
…
1
1
…
1
1
…
1
1
0
0
…
1
1
…
2
2
…
0
0
…
1
1
…
2
2
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
design
IVs
conditions
1-way ANOVA
between-group
1
≥3
factorial ANOVA
between-group
≥2
SPSS data entry format
Q: does nature of the task (composition
or transcription) affect performance?
data entry 2
method 1
0
SPSS fn is called Univariate analysis
task time task type data entry
method
task type
0
1
245
236
…
246
213
…
178
289
…
256
269
…
265
232
…
189
321
gp1 gp2 dictation
gp3 gp4 predictive
gp5 gp6 standard
~ 52 ~
0
0
…
0
0
…
0
0
…
1
1
…
1
1
…
1
1
0
0
…
1
1
…
2
2
…
0
0
…
1
1
…
2
2
159.410/710 User Interface Design
Analysing the Data
Use ANOVA (aka F-test) to compare means of ≥ 2 groups
we’ve already seen the special case of ANOVA for comparing 2 means: the t-test
design
IVs
conditions
1-way ANOVA
between-group
1
≥3
factorial ANOVA
between-group
≥2
Q: does nature of the task (composition
or transcription) affect performance?
SPSS output
task type
entry method
interaction task * entry
error
IVs
sum of sq
df
mean square
2745.188
1
2745.188
1.410
0.242
17564.625
2
8782.313
4.512
0.017
114.875
2
57.437
0.030
0.971
81751.625
42
1946.467
F
significance
task type caused no significant effect: F(1, 42) = 1.41, n.s
entry method had a significant effect: F(2, 42) = 4.51, p < 0.05
there is no significant interaction between task and entry
~ 53 ~
159.410/710 User Interface Design
Analysing the Data
Use repeated measures ANOVA for within-group studies
previous between-groups design requires lots of participants (72, if 12 subjects/group)
what about a within-groups design?
specially if only some are eligible – e.g. disabled
to study effect of 1 IV, use 1-way repeated measures ANOVA
3 data points from each participant, all in the same row
245
236
321
246
213
265
278
289
222
to study effect of >1 IV, use multi-level repeated measures ANOVA
for 3 x 2 factorial study, 6 data points per participant per row
transcription
composition
standard predictive dictation standard predictive dictation
participant1 245
246
178
256
265
189
participant2 236
213
289
269
232
321
within-groups design
faster
less fatigue
can control for learning
smaller sample
~ 54 ~
159.410/710 User Interface Design
Analysing the Data
Assumptions of t tests and F tests
no systematic errors
e.g. different instructors, with different sets of instructions
correlation between errors of participants in each instructor’s group
will systematically skew results
homogeneity of variance (identical distribution of errors)
populations should have comparable variances
x1 x2
significantly
do these distributions have^different means?
not easy to say, either for people or for software
normal distribution of errors
may be violated if data is highly skewed (non-normal distribution)
~ 55 ~
159.410/710 User Interface Design
Analysing the Data
Use Pearson’s r to identify correlations
is factora related to factorb?
determine Pearson’s product moment correlation coefficient (r)
r varies from -1 to 1
-1: perfect negative linear relationship
0: no relationship
+1: perfect positive linear relationship
computer
experience
12
6
3
19
can determine
r values for
time with time with
standard predictive
software software
245
236
321
212
246
213
265
189
experience * standard s/ware
experience * predictive s/ware
standard * predictive
experience
r
experience
timestd
timepred
1
-0.723
-0.468
-0.723
0.043
1
-0.468
0.325
0.243
0.325
0.432
1
significance
timestd
timepred
r
significance
r
significance
(experience, timestd) has significant –ve correlation
time with std software decreases with computer experience
no other significant correlations
~ 56 ~
159.410/710 User Interface Design
Analysing the Data
Use Pearson’s r to identify correlations
r2 represents percentage of variance in X that can be explained by variable Y
represents percentage of variance in Y that can be explained by variable X
but beware: correlation does not imply causation
e.g. negative correlation between income and speed of internet search
does earning more make you worse at using the internet?
or does higher income imply greater age and less familiarity with the internet?
higher income
age
less internet experience
lower performance
~ 57 ~
159.410/710 User Interface Design