Download Document

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Omnibus test wikipedia , lookup

Analysis of variance wikipedia , lookup

Psychometrics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript



Non-parametric tests (examples)
Some repetition of key concepts (time
permitting)
Free experiment status
Exercise
 Group tasks on non-parametric tests (worked
examples of will be provided!)
 Free experiment supervision/help

Did you get the compendium?

Remember: For week 12, regression and
correlation, 100+ pages in compendium: No need
to read all of it – read the introductions to each
chapter, get the feel for the first simple examples
– multiple regression and –correlation is for future
reference
Two types of statistical test: Parametric tests:

Based on assumption that the data have
certain characteristics or "parameters":

Results are only valid if:

(a) the data are normally distributed;
(b) the data show homogeneity of variance;
(c) the data are measurements on an interval or ratio scale.



25
20
15
Group 1: M = 8.19 (SD = 1.33),
10
5
Group 2: M = 11.46 (SD = 9.18)
0
1
2
Nonparametric tests
 Make no assumptions about the data's characteristics.

Use if any of the three properties below are true:



(a) the data are not normally distributed (e.g. skewed);
(b) the data show in-homogeneity of variance;
(c) the data are measurements on an ordinal scale (ranks).

Non-parametric tests are used when we do not have
ratio/interval data, or when the assumptions of parametric
tests are broken

Just like parametric tests, which nonparametric test to use depends on the
experimental design (repeated measures or
within groups), and the number of/level of Ivs

Non-parametric tests are minimally affected
by outliers, because scores are converted to
ranks
Examples of parametric tests and their non-parametric equivalents:
Parametric test:
Non-parametric counterpart:

Pearson correlation
Spearman's correlation

(No equivalent test)
Chi-Square test

Independent-means t-test
Mann-Whitney test

Dependent-means t-test
Wilcoxon test

One-way Independent Measures
Analysis of Variance (ANOVA)
Kruskal-Wallis test
One-way Repeated-Measures
ANOVA
Friedman's test


Non-parametric tests make few assumptions about the
distribution of the data being analyzed

They get around this by not using raw scores, but by ranking
them: The lowest score get rank 1, the next lowest rank 2, etc.
 Different from test to test how ranking is carried out, but same principle

The analysis is carried out on the ranks, not the raw data

Ranking data means we lose information – we do not know the
distance between the ranks
This means that non-par tests are less powerful than par tests,

 and that non-par tests are less likely to discover an effect in our data than
par tests (increased chance of type II error)

This is the non-parametric equivalent of the
independent t-test

Used when you have two conditions, each
performed by a separate group of subjects.

Each subject produces one score. Tests
whether there a statistically significant
difference between the two groups.
Example: Difference between men and dogs

We count the number of ”doglike” behaviors in a group of 20
men and 20 dogs over 24 hours

The result is a table with 2 groups and their number of
doglike behaviors

We run a Kolmogorv-Smirnov test (Vodka test) to see if data
are normally distributed. The test is significant though
(p<.0.009), so we need a non-parametric test to analyze the
data

The MN test looks for differences in the
ranked positions of scores in the two groups
(samples)

Example ...

Mann-Whitney test, step-by-step:

Does it make any difference to students' comprehension of
statistics whether the lectures are in English or in Klingon?


Group 1: Statistics lectures in English.
Group 2: Statistics lectures in Serbo-Croat

DV: Lecturer intelligibility ratings by students
(0 = "unintelligible", 100 = "highly intelligible").

Ratings - So Mann-Whitney is appropriate.
English group
(raw scores)
English group
(ranks)
Serbo-croat
group (raw
scores)
Serbo-croat
group (ranks)
18
17
17
15
15
10.5
13
8
17
15
12
5.5
13
8
16
12.5
11
3.5
10
1.5
16
12.5
15
10.5
10
1.5
11
3.5
17
15
13
8
12
5.5
Mean:
S.D.:
14.63
2.97
Mean:
S.D.:
13.22
2.33
Median:
15.5
Median:
13
Step 1:
Rank all the scores together, regardless of group.
How to Rank scores:
 (a) Lowest score gets rank of “1”; next lowest gets “2”; and so on.

(b) If two or more scores with the same value are “tied”.
(i) Give each tied score the rank it would have had, had it been
different from the other scores.
(ii) Add the ranks for the tied scores, and divide by the number of
tied scores. Each of the ties gets this average rank.
(iii) The next score after the set of ties gets the rank it would have
obtained, had there been no tied scores.

Example:
raw score:
6
“original” rank: 1
“actual” rank: 1
34
2
2.5
34
3
2.5
48
4
4

Formula for Mann-Whitney Test statistic: U
Nx (Nx + 1)
U = N1 * N2 + ---------------- - Tx
2




T1 and T2 = Sum of ranks for groups 1 and 2
N1 and N2 = Number of subjects in groups 1 and 2
Tx = largest of the two rank totals
Nx = Number of subjects in Tx-group
Step 2:
 Add up the ranks for group 1, to get T1. Here, T1 = 83.
 Add up the ranks for group 2, to get T2. Here, T2 = 70.
Step 3:
 N1 is the number of subjects in group 1; N2 is the number of
subjects in group 2. Here, N1 = 8 and N2 = 9.
Step 4:
 Call the larger of these two rank totals Tx. Here, Tx = 83.
 Nx is the number of subjects in this group; here, Nx = 8.
Step 5: Find U:
Nx (Nx + 1)
U = N1 * N2 + ---------------- - Tx
2
In our example:
U = 8*9
8 * (8 + 1)
+ ---------------- - 83
2
U = 72 + 36 - 83 = 25

If there are unequal numbers of subjects - as in the present
case - calculate U for both rank totals and then use the
smaller U.

In the present example, for T1, U = 25, and for T2, U = 47.
Therefore, use 25 as U.
Step 6:
 Look up the critical value of U, (in a table), taking into
account N1 and N2. If our obtained U is smaller than the
critical value of U, we reject the null hypothesis and
conclude that our two groups do differ significantly.
N2
N1
5
6
7
8
9
10
5
2
3
5
6
7
8
6
3
5
6
8
10
11
7
5
6
8
10
12
14
8
6
8
10
13
15
17
9
7
10
12
15
17
20
10
8
11
14
17
20
23
Here, the critical value of U for N1 = 8 and N2 = 9 is 15.
Our obtained U of 25 is larger than this, and so we conclude
that there is no significant difference between our two
groups.
Conclusion: Ratings of lecturer intelligibility are unaffected
by whether the lectures are given in English or in SerboCroat.
Mann-Whitney using SPSS - procedure:
Mann-Whitney using SPSS - procedure:
Mann-Whitney using SPSS - output:
SPSS gives us two boxes as the output:
Ranks
Intelligibility
Language
Englis h
Serbo-croat
Total
N
8
9
17
Mean Rank
10.38
7.78
Sum of Ranks
83.00
70.00
Sum of ranks
b
Test Statistics
The U statistic
Significance value
of the test
Can halve this if
One-way hypothesis
Intelligibility
Mann-Whitney U
25.000
Wilcoxon W
70.000
Z
-1.067
Asymp. Sig. (2-tailed)
.286
a
Exact Sig. [2*(1-tailed
.321
Sig.)]
a. Not corrected for ties .
b. Grouping Variable: Language
The Wilcoxon test:

Used when you have two conditions, both
performed by the same subjects.

Each subject produces two scores, one for
each condition.

Tests whether there a statistically significant
difference between the two conditions.
Wilcoxon test, step-by-step:

Does background music affect the mood of factory workers?

Eight workers: Each tested twice.


Condition A: Background music.
Condition B: Silence.

DV: Worker's mood rating (0 = "extremely miserable", 100 =
"euphoric").

Ratings data, so use Wilcoxon test.
Worker:
Silence
Music
Difference
Rank
1
15
10
5
4.5
2
12
14
-2
2.5
3
11
11
0
Ignore
4
16
11
5
4.5
5
14
4
10
6
6
13
1
12
7
7
11
12
-1
1
8
8
10
-2
2.5
Mean: 12.5, SD: 2.56
Mean: 9.13, SD: 4.36
Median: 12.5
Median: 10.5
Step 1:
Find the difference between each pair of scores, keeping track of the sign
(+ or -) of the difference - different from a Mann Whitney U test, where
the data themselves are ranked!
Step 2:
Rank the differences, ignoring their sign. Lowest = 1.
Tied scores dealt with as before.
Ignore zero difference-scores.
Step 3:
 Add together the positive-signed ranks. = 22.
 Add together the negative-signed ranks. = 6.
Step 4:
 "W" is the smaller sum of ranks; W = 6.
 N is the number of differences, omitting zero differences:
N = 8 - 1 = 7.
Step 5:
 Use table of critical W-values to find the critical value of W,
for your N. Your obtained W has to be smaller than this
critical value, for it to be statistically significant.
N
6
7
8
9
10
One Tailed Significance levels:
0.025
0.01
Two Tailed significance levels:
0.05
0.02
0
2
0
4
2
6
3
8
5
0.005
0.01
0
2
3



The critical value of W (for an N of 7) is 2.
Our obtained W of 6 is bigger than this.
Our two conditions are not significantly different.

Conclusion: Workers' mood appears to be unaffected by
presence or absence of background music.
Wilcoxon using SPSS - procedure:
Wilcoxon using SPSS - procedure:
Wilcoxon using SPSS - output:
Ranks
N
s ilence - music
Negative Ranks
Pos itive Ranks
Ties
Total
4a
3b
1c
8
Mean Rank
5.50
2.00
Sum of Ranks
22.00
6.00
a. s ilence < mus ic What negative ranks refer to: Silence less score than w. music
b. s ilence > mus ic What positive ranks refer to: Silence higher score than w. music
c. s ilence = mus ic
As for MN-test, z-score
becomes more
accurate with
higher sample size
Ties = no changes in score w./wo. music
Test Statisticsb
Z
Asymp. Sig. (2-tailed)
s ilence music
-1.357 a
.175
a. Bas ed on positive ranks.
b. Wilcoxon Signed Ranks Tes t
Number of SD´s
from mean
Significance value
Non-parametric tests for comparing three or more groups or
conditions:
Kruskal-Wallis test:
 Similar to the Mann-Whitney test, except that it enables you
to compare three or more groups rather than just two.
 Different subjects are used for each group.
Friedman's Test (Friedman´s ANOVA):
 Similar to the Wilcoxon test, except that you can use it with
three or more conditions (for one group).
 Each subject does all of the experimental conditions.

One IV, with multiple levels

Levels can differ:
(a) qualitatively/categorically  e.g. effects of managerial style (laissex-faire, authoritarian, egalitarian) on
worker satisfaction.
 effects of mood (happy, sad, neutral) on memory.
 effects of location (Scotland, England or Wales) on happiness ratings.
(b) quantitatively  e.g. effects of age (20 vs 40 vs 60 year olds) on optimism ratings.
 effects of study time (1, 5 or 10 minutes) before being tested on recall of faces.
 effects of class size on 10 year-olds' literacy.
 effects of temperature (60, 100 and 120 deg.) on mood.
Why have experiments with more than two levels of the IV?
(1) Increases generality of the conclusions:
 E.g. comparing young (20) and old (70) subjects tells you nothing about the
behaviour of intermediate age-groups.
(2) Economy:
 Getting subjects is expensive - may as well get as much data as possible from
them – i.e. use more levels of the IV (or more IVs)
(3) Can look for trends:
 What are the effects on performance of increasingly large doses of cannabis (e.g.
100mg, 200mg, 300mg)?
Kruskal-Wallis test, step-by-step:

Does it make any difference to students’ comprehension of
statistics whether the lectures are given in English, SerboCroat - or Cantonese?
 (similar case to MN-test, just one more language, i.e. group of people)





Group A – 4 ppl: Lectures in English;
Group B – 4 ppl: Lectures in Serbo-Croat;
Group C – 4 ppl: Lectures in Cantonese.
DV: student rating of lecturer's intelligibility on 100-point
scale ("0" = "incomprehensible").
Ratings - so use a non-parametric test. 3 groups – so KW-test
English
(raw score)
English
(rank)
Serbo-Croat
(raw score)
Serbo-Croat
(rank)
Cantonese
(raw score)
Cantonese
(rank)
20
3.5
25
7.5
19
1.5
27
9
33
10
20
3.5
19
1.5
35
11
25
7.5
23
6
36
12
22
5
Step 1:
 Rank the scores, ignoring which group they belong to.
 Lowest score gets lowest rank.
 Tied scores get the average of the ranks they would otherwise have
obtained (note the difference from the Wilcoxon test!)
Formula:
2
 12
Tc
H 

  3   N  1
nc 
 N  N  1
N is the total number of subjects;
Tc is the rank total for each group;
nc is the number of subjects in each group;
H is the test statistic
Step 2:
 Find "Tc", the total of the ranks for each
group.


Tc1 (the total for the English group) is 20.

Tc2 (for the Serbo-Croat group) is 40.5.

Tc3 (for the Cantonese group) is 17.5.
Step 3: Find H.
2
 12
Tc
H 

  3   N  1


N
N

1
n
c 

N is the total number of subjects;
Tc is the rank total for each group;
nc is the number of subjects in each group.

12
Tc 2 
H 

  3   N  1
nc 
 N  N  1
2
Tc

nc

2
20
40.5

4
4
2
17.5

4
2
 586.62
 12 
 (
H  
  586.62   3  13 )  6.12
 12 * 13 



Step 4: In KW-test, we use degrees of freedom:
Degrees of freedom are the number of groups minus one. d.f. = 3 - 1 = 2.
Step 5:
H is statistically significant if it is larger than the critical value of ChiSquare for this many d.f. [Chi-Square is a test statistic distribution we use]
 Here, H is 6.12. This is larger than 5.99, the critical value of Chi-Square for 2
d.f. (SPSS gives us this, no need to look in a table, but we could do it)



So: The three groups differ significantly: The language in which statistics
is taught does make a difference to the lecturer's intelligibility.

NB: the test merely tells you that the three groups differ; inspect group
medians to decide how they differ.
Using SPSS for the Kruskal-Wallis test:
"1" for "English",
"2" for "Serbo-Croat",
"3" for "Cantonese".
Independent
measures-test type:
One column gives
scores, another
column identifies
which group each
score belongs to.
Scores
column
Group
column
Using SPSS for the Kruskal-Wallis test:
Analyze > Nonparametric tests > k independent samples
Using SPSS for the Kruskal-Wallis test :
Choose variable
Identify groups
Ranks
intelligibility
language
Englis h
Serbo-croat
Cantonese
Total
N
4
4
4
12
Mean Rank
5.00
10.13
4.38
Mean rank values
Test Statisticsa, b
Chi-Square
df
Asymp. Sig.
intelligibility
6.190
2
.045
a. Kruskal Wallis Tes t
b. Grouping Variable: language
Test statistic (H)
DF
Significance

How do we find out how the four groups
differed?

One way is to construct a box-whisker plot –
and look at median values

What we really need is some contrasts and
post-hoc tests like for ANOVA

One solution is to run series of Mann-Whitney tests,
controlling for the build-up of Type I error

Need several MW-tests, each with a 5% chance of a Type I error
– when serialling them this chance builds up (language 1 vs.
language 2, language 1 vs. 3 etc. ...)

We therefore do a Bonferroni correction – use p<0.05 divided
with number of MW-tests conducted

We can get away with only comparing with the control
condition – so MN-test for each of the three languages
compared to the control group
 We then see if any differences are significant
Friedman's Test (Friedman´s ANOVA):

Similar to the Wilcoxon test, except that you
can use it with three or more conditions (for
one group).

Each subject does all of the experimental
conditions.

Friedman’s test, step-by-step:

Effects on worker mood of different types of music:
Five workers. Each is tested three times, once under each of the
following conditions:
 Condition 1: Silence.
 Condition 2: “Easy-listening” music.
 Condition 3: Marching-band music.



DV: mood rating ("0" = unhappy, "100" = euphoric).
Ratings - so use a non-parametric test.

NB: To avoid practice and fatigue effects, order of presentation of
conditions is varied/randomized across subjects.
Silence
(raw
score)
Silence
(ranked
score)
Easy
(raw
score)
Easy
(ranked
score)
Band
(raw
score)
Band
(ranked
score)
Wkr 1:
4
Wkr 2: 2
1
1
5
7
2
2.5
6
7
3
2.5
Wkr 3:
1.5
1
1
6
7
8
1.5
3
2
8
5
9
3
2
3
6
Wrkr 4: 3
Wrkr 5: 3
Step 1:
Rank each subject's scores individually.
Worker 1's scores are 4, 5, 6: these get ranks of 1, 2, 3.
Worker 4's scores are 3, 7, 5: these get ranks of 1, 3, 2 .
Wkr 1:
Wkr 2:
Wkr 3:
Silence
(raw
score)
Silence
(ranked
score)
Easy
(raw
score)
Easy
(ranked
score)
Band
(raw
score)
Band
(ranked
score)
4
2
6
1
1
1.5
5
7
6
2
2.5
1.5
6
7
8
3
2.5
3
1
1
7
8
3
2
5
9
2
3
Wrkr 4: 3
Wrkr 5: 3
Step 2:
Find the rank total for each condition, using the ranks
from all subjects within that condition.
Rank total for ”Silence" condition: 1+1+1.5+1+1 = 5.5.
Rank total for “Easy Listening” condition = 11.
Rank total for “Marching Band” condition = 13.5.
Step 3:
Work out “r2“ (the test statistic name for Friedman´s ANOVA)

12

2
 r  
   Tc   3  N  C  1
 N  C  C  1 

2
C is the number of conditions (here 3 types of music).
N is the number of subjects (here 5 workers).
Tc2 is the sum of the squared rank totals for each condition
(5.5, 11 and 13.5 respectively for the three types of music).
r
2

12

2
 
   Tc   3  N  C  1
 N  C  C  1 

To get Tc2 :
(1) Square each rank total:
5.52 = 30.25. 112 = 121. 13.52 = 182.25.
(2) Add together these squared totals.
30.25 + 121 + 182.25 = 333.5.
In our example,



12
2
 r  
   Tc   3  N  C  1
 N  C  C  1

2
 12 

 r  
  333.5  3  5  4  6.7
 5  3  4 

2
r2 = 6.7
Step 4:
Degrees of freedom = number of conditions minus one.
DF = 3 - 1 = 2.
Step 5:
 Assessing the statistical significance of r2 depends on the number of
subjects and the number of groups.
(a) Less than 9 subjects:
 Use a special table of critical values for r2.
(b) 9 or more subjects:
 Use a Chi-Square table for critical values.
 Compare your obtained r2 value to the critical value of Chi-Square for
your number of DF
 If your obtained r2 is bigger than the critical Chi-Square value, your
conditions are significantly different.
The test only tells you that some kind of difference exists; look at the
median score for each condition to see where the difference comes from.
We have 5 subjects and 3 conditions, so use Friedman table
for small sample sizes:
Obtained r2 is 6.7.
For N = 5, a r2 value of 6.4 would occur by chance with a
probability of 0.039.
Our obtained value is bigger than 6.4, so p<0.039.
Conclusion: The conditions are significantly different. Music
does affect worker mood.
Using SPSS to perform Friedman’ s ANOVA
Repeated measures - each row is one participant's data.
Just like for Wilcoxon and other repeated measures tests
Using SPSS to perform Friedman’ s ANOVA
Analyze > Nonparametric Tests > k related samples
Using SPSS to perform Friedman’ s ANOVA
Analyze > Nonparametric Tests > k related samples
Note: here you select a
Kolmogorov-Smirnov
test for checking if your
sample data are
normally distributed
Using SPSS to perform Friedman’ s ANOVA
Drag over variables to be included in the test
Output from Friedman’ s ANOVA
Descriptive Statistics
N
s ilence
eas y
marching
5
5
5
Mean
3.6000
6.6000
7.0000
Ranks
s ilence
eas y
marching
Mean Rank
1.10
2.20
2.70
Std. Deviation
1.51658
1.14018
1.58114
Minimum
2.00
5.00
5.00
Test Statisticsa
N
Chi-Square
df
Asymp. Sig.
5
7.444
2
.024
a. Friedman Tes t
Significance
Maximum
6.00
8.00
9.00
Test statistic r2
NB: slightly
different
value from
6.7 worked
out by hand




Mann-Whitney: Two conditions, two groups,
each participant one score
Wilcoxon: Two conditions, one group, each
participant two scores (one per condition)
Kruskal-Wallis: 3+ conditions, different
people in all conditions, each participant one
score
Friedman´s ANOVA: 3+ conditions, one
group, each participant 3+ scores
Which nonparametric test?
1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister
noises from under their bed
1. Effects of cheese, brussel sprouts, wine and curry on vividness of a
person's dreams
2. Number of people spearing their eardrums after enforced listening to
Britney Spears, Beyonce, Robbie Williams and Boyzone
3. Pedestrians rate the aggressiveness of owners of different types of car.
Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru
owners; group D rate Mondeo owners.
Consider: How many groups? How many levels of IV/conditions?
1.
Differences in fear ratings for 3, 5 and 7-year olds in response to sinister
noises from under their bed [3 groups, each one score, 2 conditions Kruskal-Wallis].
2.
Effects of cheese, brussel sprouts, wine and curry on vividness of a
person's dreams [one group, each 4 scores, 4 conditions - Friedman´s
ANOVA].
3.
Number of people spearing their eardrums after enforced listening to
Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each
4 scores, 4 conditions – Friedman´s ANOVA]
4.
Pedestrians rate the aggressiveness of owners of different types of car.
Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru
owners; group D rate Mondeo owners. [4 groups, each one score –
Kruskal-Wallis]





What is a ”population”???
Types of measure
Normal distribution
Standard Error
Effect size
What,
again!?!?

The term does not necessarily refer to a set of
individuals or items (e.g. cars). Rather, it refers
to a state of individuals or items.

Example: After a major earthquake in a city (in
which no one died) the actual set of individuals
remains the same. But the anxiety level, for
example, may change. The anxiety level of the
individuals before and after the quake defines
them as two populations.

“Population” is an abstract term we use in
statistics
My brain is
the size of a
walnut!

Scientists are interested in how variables change,
and what causes the change

Anything that we can measure and which changes,
is called a variable

”Why do people like the color red?”
 Variable: Preference of the color red

Variables can take many forms, i.e. numbers,
abstract values, etc.







Values are measureable
Measuring size of variables is important for
comparing results between studies/projects
Different measures provide different quality
of data:
Nominal (categorical) data
Non-parametric
Ordinal data
Interval data
Parametric
Ratio data

Nominal data (categorical, frequency data)

When numbers are used as names

No relationship between the size of the number and
what is being measured

Two things with same number are equivalent

Two things with different numbers are different


E.g. Numbers on the shirts of soccer players
Nominal data are only used for frequencies
 How many times ”3” occurs in a sample
 How often player 3 scores compared to player 1

Ordinal data

Provides information about the ordering of the data

Does not tell us about the relative differences
between values


For example: The order of people who
complete a race – from the winner to the last
to cross the finish line.
Typical scale for questionnaire data
 Interval data
 When measurements are made
on a scale with equal intervals
between points on the scale, but
the scale has no true zero point.

Examples:

Celsius temperature scale: 100 is water's boiling point; 0 is an
arbitrary zero-point (when water freezes), not a true absence
of temperature.

Equal intervals represent equal amounts, but ratio
statements are meaningless - e.g., 60 deg C is not twice as
hot as 30 deg!
-4
-3
-2
-1
0
1
1
2
3
4
5
6
2
7
3
4
8
9

Ratio data

When measurements are made on a scale with
equal intervals between points on the scale, and the
scale has a true zero point.

e.g. height, weight, time, distance.

Measurements of relevance include: Reaction times,
numbers correct answered, error scores in usability
tests.
His brain
has a
standard
error ...

If we take repeated samples, each sample has a mean
height, a standard deviation (s), and a shape/distribution.
s1
s2
X2
s3
X3
Samples
.
.
.
.
.
.
 Due to random fluctuations, each sample is different - from
other samples and from the parent population.
 These differences
are predictable - we can use samples to


make inferences about their parent populations.
X1
X  30
X  25
X  33
X  30
X  29

Often we have more than one sample of a population
This permits the calculation different sample means,
whose value will vary, giving us a sampling distribution
Sampling distribution
 = 10
Mean = 10
SD = 1.22
4
3
M = 10
M=9
M = 11
M=9
2
1
M = 10
M=8
Frequency

M = 12
0
6
M = 10
M = 11
7
8
9
10
11
Sample Mean
12
13
14

The sampling distribution informs about the
behavior of samples from the population

We can calculate SD for the sampling
distribution

This is called the Standard Error of the Mean
(SE)


SE shows how much variation there is within
a set of sample means
Therefore also how likely a specific sample
mean is to be erroneous, as an estimate of
the true population mean
means of
different
samples
actual
population
mean

SE = SD of the sample means distribution

We can estimate SE via one sample
x



n
Estimate SE = SD of the sample divided with
the square root of the sample size (n)

If the SE is small, our obtained sample mean is more likely to be
similar to the true population mean than if the SE is large
x



n
Increasing n reduces the size of the SE
 A sample mean based on 100 scores is probably closer to the population
mean than a sample mean based on 10 scores (!)

Variation between samples decreases as sample size increases –
because extreme scores become less important to the mean
2
2
X 
  0.20
100 10
Suppose the n = 16 instead of 100

2
2
X 
  0.50
16 4
Almost
finished ...
Frequency of errors
Frequency of errors made
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
Number of errors made

The Normal curve is a mathematical abstraction which
conveniently describes ("models") many frequency
distributions of scores in real-life.
8
9
10
length of time before someone
looks away in a staring contest:
length of pickled gherkins:
Francis Galton (1876) 'On the height and weight of boys aged 14, in town and
country public schools.' Journal of the Anthropological Institute, 5, 174-180:
Francis Galton (1876) 'On the height and weight of boys aged 14, in town and
country public schools.' Journal of the Anthropological Institute, 5, 174-180:
Height of 14 year-old children
16
country
14
town
10
8
6
4
2
0
51
-5
2
53
-5
4
55
-5
6
57
-5
8
59
-6
0
61
-6
2
63
-6
4
65
-6
6
67
-6
8
69
-7
0
frequency (%)
12
height (inches)
Size of
score axis
Frequency
axis
Properties of the Normal Distribution:
1. It is bell-shaped and asymptotic at the extremes.
2. It's symmetrical around the mean.
3. The mean, median and mode all have same value.
4. It can be specified completely, once mean and SD
are known.
5. The area under the curve is directly proportional
to the relative frequency of observations.
e.g. here, 50% of scores fall below the mean, as
does 50% of the area under the curve.
e.g. here, 85% of scores fall below score X,
corresponding to 85% of the area under the curve.
Relationship between the normal curve and the
standard deviation (SD):
frequency
All normal curves share this property: The SD cuts off
a constant proportion of the distribution of scores:
68%
95%
99.7%
-3
-2
-1
mean
+1
+2
+3
Number of standard deviations either side of mean

About 68% of scores will fall in the range of the mean plus and minus 1 s.d.;

95% in the range of the mean +/- 2 s.d.'s;

99.7% in the range of the mean +/- 3 s.d.'s.

e.g.: I.Q. is normally distributed, with a mean of 100 and s.d. of 15.

Therefore, 68% of people have I.Q's between 85 and 115 (100 +/- 15).

95% have I.Q.'s between 70 and 130 (100 +/- (2*15).

99.7% have I.Q's between 55 and 145 (100 +/- (3*15).
68%
85 (mean - 1 s.d.)
115 (mean + 1 s.d.)

Just by knowing the mean, SD, and that scores are normally
distributed, we can tell a lot about a population.

If we encounter someone with a particular score, we can
assess how they stand in relation to the rest of their
group.

e.g.: someone with an I.Q. of 145 is quite unusual: This is 3
SD's above the mean. I.Q.'s of 3 SD's or above occur in only
0.15% of the population [ (100-99.7) / 2 ].
 Note: divide with 2 as there are 2 sides to the normal distribution!
Conclusions:

Many psychological/biological properties are
normally distributed.

This is very important for statistical
inference (extrapolating from samples to
populations)
My scaly
butt is of
large size!

Just because the test statistic is significant, does not mean
that the effect measured is important - it may account for
only a very small part of the variance in the dataset, even
though it is bigger than the random variance

So we calculate effect sizes – a measure of the magnitude
of an observed effect

A common effect size is Pearsons correlation coefficient –
normally used to measure the strenght of the relationship
between two variables
 We call this ”r”

”r” is the proportion of the total variance in the dataset that
can be explained by the experiment

It falls between 0 (experiment explains no variance at all,
effect size = zero) and 1 (experiment explains all variance,
perfect effect size)
Three normal levels of r:
 r = 0.1 – small effect, 1% of total variance explained
 r = 0.3 – medium effect, 9% of total variance explained
 r = 0.5 – large effect, 25% of variance explained

Note: Not linear scale - r-values of 0.2 is not
twice of 0.1

r is standardized – we can compare across
studies

Effect sizes are objective measures of the
importance of a measured effect

The bigger the effect size of something, the
easier it is to find experimentally, i.e.:
 If IV manipulation has a major effect on the DV,
effect size is large

r can be calculated from a lot of test
statistics, notably z-scores

r = z-score / square root of sample size