Download Quantitative Measures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Mediation (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Statistics for Linguistics
Students
Michaelmas 2004
Week 4
Bettina Braun
www.phon.ox.ac.uk/~bettina
Overview
• Discussion of last assignment
• z-distribution vs. t-distribution
• Between-subjects design vs. Withinsubjects design
• t-tests
– for independent samples
– for dependent samples
Exercise z-scores
1) The mean pause duration in a read text is 200ms with a
standard deviation of 50ms. For the calculations please specify
how you reached your conclusion!
a) Is this a statistic or a parameter?
If we are interested in describing this particular read test,
then it’s a parameter. If we use this text to draw inferences
about pause duration in any text then it’s a statistic.
b) What proportion of the data is above 70ms?
z=2.6
0.47% of the data lie below 70ms
99.53% of the data lie above 70ms
c) What proportion of the data falls between 100ms and
300ms?
z=2
2,28% lie below 100ms and 2.28% lie above 300ms
95.44% lie between 100ms and 300ms
Exercise sampling distribution
2) If we have a sample size of 50, what does the
sampling distribution of the means look like if the
population is
a) U-shaped
b) skewed-left, and
c)normally distributed?
Because of the central limit theorem, the sampling
distribution of the mean will be normally
distributed, irrespective of the form of the parent
distribution
Exercise central limit theorem,
standard error
3) What happens, if the sample size increases for
the following statistics. Does the
–
–
estimated mean increase, decrease, or stay
approximately the same? Why?
Stays the same as the sample mean is an adequate
estimate for the population mean (central limit
theorem)
standard error increase, decrease, or stay
approximately the same? Why?
Standard error decreases with the square root of
the sample size (see formula for standard error)
What are frequency data?
• Number of subjects/events in a given
category
• You can then test whether the observed
frequencies deviate from your expected
frequencies
• E.g. In an election, there is an a priori
change of 50-50 for each candidate.
X2-test
• Null-hypothesis: there is no difference between
expected and observed frequency
• Data
Kerry
Bush
supporter
observed
expected
• Calculation
supporter
X2-test
• Limitations:
– All raw data for X2 must be frequencies
– Each subject or event is counted only once
(if we wish to find out whether boys or girls are more
likely to pass or fail a test, we might observe the
performance of 100 children on a test. We may not
observe the performance of 25 children on 4 tests,
however)
– The total number of observations should be greater
than 20
– The expected frequency in any cell should be greater
than 5
Looking up the p-value
Degrees of freedom:
• If there is one
independent
variable
df = (a – 1)
• Iif there are two
independent
variables:
df = (a-1)(b-1)
Exercise dependent and
independent variables
• Generally, in hypothesis testing, the independent
variable is hardly ever interval. Mostly it is
nominal, or ordinal
• Differentiate between
– Number of independent variables (e.g. gender and
exam year for score example => 2)
– Levels of an independent variable are the number of
values it can take (e.g. gender: generally 2)
• The null-hypothesis is formulated to deny a
relation between dependent and independent
variable
Exercise dependent and
independent variables
Imagine you have a text-to-speech synthesis
system. You are interested to find out whether the
acceptability (from 1 to 5) is increased if you model
short pauses at syntactic phrases.
• dependent variable: acceptability (ordinal data)
• independent variable: TTS with/without pause
model (2 levels)
• Null-Hypothesis: Duration model does not
influence acceptability rating
Exercise dependent and
independent variables
Subjects learned 20 nonsense-words presented visually. 30
minutes later they were tested for retention. The next day,
the same subjects learned another 20 nonsense-words,
this time in a combined visual and auditory presentation.
Again, after 30 minutes they were tested for retention. The
researcher measured the number of correct nonsensewords.
• dependent variable: number of correct responses
(interval data)
• independent variable: kind of presentation (2 levels)
• null hypothesis: The number of correct responses will be
the same in the two conditions
Further influencing factors
• Besides the independent variable, there might
be further factors that influence your dependent
variable.
• Other factors might be confounded with our
independent variable (e.g. in the nonword
retention task, the audio-visual presentation was
on a different day than the auditory presentation.
Presentation kind can thus be confounded with
presentation time)
• Systematic error
Counterbalancing
• To avoid confounding variables, the conditions
have to be counterbalanced. Examle:
– Half the subjects are doing the auditory presentation
first and the audio-visual presentation second
– Half the subjects are doing the task in opposite order
• We often have a group of subjects to perform
the task (not just one subject)
• Also, in linguistic research, we often use multiple
repetitions or different lexicalisations for a given
condition (e.g. different words that all have a
CVCV strucure)
Exercise drawing error-bars
• Variables need to have the correct type!
• Error bars show the 95% confidence
interval for the mean (i.e. the mean and
the area where 95% of the data fall in)
• One independent variable
– Simple error bar for groups of variables
• Two independent variables
– Clustered error bar for groups of variables
Exercise drawing error-bars
Clustered error
bars for two
independent
variables
Example: testing if a sample is
drawn from a given population
• A lecturer at Oxford University expects that
students at this university have a higher
IQ-score than the average British
population.
• Since records are taken, he knows that the
mean IQ-score in Britain is 200 with a
standard deviation of 32
Experimental Procedure
• The Null-hypothesis H0 is that the IQ of Oxford
students is no different from the general public.
• He randomly selects 40 students and gives them
the standard IQ test.
• This results in an IQ-score of 210
• Questions:
– Can he conclude that Oxford students have a higher
IQ?
– Can he compare his sample to the population?
Comparison to population
• The sample mean cannot directly be
compared to the whole population, but to
the sampling distribution of the sample
mean (with samples of size n=40).
• The sampling distribution has the same
mean as the population (200) and the
standard error of
Calculating z-score
• Since the sampling distribution will be
normally distributed (for n > 30), we can
calculate the z-score to see how likely a
mean of 210 is, given the null-hypothesis
were true
There is a chance of 2.4%
•
that the sample mean falls
within the sampling
distribution
What if the population is unknown?
• Often, we compare two different samples
and we do not know the population
parameters
(e.g. are exam scores of the year 1990
and 2000 from the same distribution?)
• Independent variable (# levels?):
• Dependent variable (type?):
What if the population is unknown?
• Often, we compare two different samples
and we do not know the population
parameters
(e.g. are exam scores of the year 1990
and 2000 from the same distribution?)
• Independent variable (# levels?):
exam year (2 levels)
• Dependent variable (type?):
exam score (interval data)
Hypothesis
• Null-hypothesis: The scores in the 2 exam
years were drawn from the same
distribution
• Comparison of the means of the two
populations (estimated from two
representatitve samples)
• What statistical test do we have to
perform?
Between-subjects design
(completely randomised)
• All comparisons between the different conditions
are based on comparisons between different
(groups of) subjects
• Each subject provides data for only one
research condition
• Example:
You want to test whether the pitch of children
under the age of 10 is dependent on their
gender (a given child is either male or female!)
Within-subjects design
(repeated measures)
• All comparisions between different conditions
are based on comparisons within the same
group of subjects
• Each subject provides data for all experimental
conditions (as many scores as experimental
conditions)
• Example:
You want to test whether the number of reading
errors is higher when a subject is sober or
slightly drunk.
Why is this difference important?
• On average, two scores from P1 and two scores
from P2 will be more alike than two scores, one
from P1 and one from P2
• Scores from one person on the same task will be
correlated; this is taken into account by withinsubjects tests.
• If between-subjects test is used for withinsubjects design, we may fail to find an effect
(type II error)
• If within-subjects test is used for betweensubjects design, we might find an effect that is
actually not there (type I error)
Example
• You want to test whether the precontext has an
effect on the prosodic realisation of sentenceinitial accents.
• You construct 20 sentences, which can appear
in two different contexts, say contrastive and
non-contrastive.
• Then you ask 20 subjects to read the 40 short
paragraphs and measure the pitch height of the
initial accent and the duration of the initial word.
• You want to know if accents are realised
differently in contrastive and non-contrastive
context.
Difficult cases
• Different classes of dependent variables
– If you are interested in articulatory precision at
two different speech rates, you might measure
the formant values of the vowels and the
number of sound elisions
– These two dependent variables are taken
from the same speaker but this is not a withinsubjects design
Difficult cases
• More than one measurement per subject,
combined to give one score
– You are interested in the formant values of
male and female /a/. You have a list of 20
words, containing an /a/. Each group of 10
speakers reads the 20 words and you
measure the formant values. Then you build
the mean formant value of /a/ for every
speaker
– Since the analysis is performed on only one
score per subject, no within-subjects design
Which statistical test, when you’ve
score data (parametric tests)?
Between,
within, mixed?
Number of independent variables?
One
Between
More than one
Significance test
Indep. t-Test (2 levels)
One-way ANoVA
Two-/Three-way ANoVA
Paired t-Test (2 levels)
One
a x s ANoVA
Within
More than one
Mixed
b x b (x c) x s ANoVA
Assumptions for statistical tests on
score data (parametric tests)
• The scores must be from an interval scale
• The scores must be normally distributed in
the population
• The variances in the conditions must be
homogenious
Note: You can perform parametric tests only
if these assumptions are met!
T-Test
• Student’s T-test
• How likely is it that
two samples are
taken from the same
population?
• T-test looks at the
ratio of the difference
in group means to the
variance
Sample 1 Sample 2
Figure taken from http://esa21.kennesaw.edu/modules/basics/exercise3/3-8.htm
T-Tests
• Calculating t-statistic
• Comparable to z-statistic, but dependent
on the degrees of freedoms (df)
• Degrees of freedom (df)
– Independent t-test: N1+N2-2
– Paired t-test: N-1
• The critical t-value for α = 0.05 (5% risk of
finding an effect that is not actually there)
is dependent on df
T-distribution
• The more degrees of freedom, the closer
the closer the tdistribution is to the
normal distribution
T-Table
One-tailed vs. two-tailed
predictions
• If we predict a direction of the difference,
we are making a one-tailed prediction
• If we predict that there is a difference
(irrespective of direction), we are making a
two-tailed prediction
• If there is not enough evidence for a
directional difference, a two-tailed test is
safe.
Example
• Hypothesis: reaction
time in cond a is
significantly different
from cond b
• Null-hypothesis: the
reaction times are
not different in
conditions a and b
Independent t-test in SPSS
Organise
independent
and dependent
variables in
separate
columns!
Independent t-test in SPSS
• Independent variable(s):
Test variable(s)
• Dependent variable:
Grouping variable
You have to specify the
levels of the independent
variable
(can only have two!)
How to interpret the output?
Descriptive statistics
If p > 0.05, variances
are homogenious
There is an effect of
condition on rt
How to interpret the output?
• Group statistics (descriptive statistics for
the conditions)
• Independent samples test
– Levene’s test for equality of variances
(if p > 0.05, then variances are homogenious)
– t-test for equality of means
•
•
•
•
t-value
df (N-2)
Significance level (2-tailed)
mean difference (difference between the means)
What do we report?
• There is a significant effect of condition on
reaction time. The average reaction time in
condition a was 238.7ms longer than in
condition b (t = 6.12, df = 62, p < 0.001).
• Interpretation?
Paired t-test in SPSS
• Variables of different conditions have to be
in parallel columns.
• Click on variables to compare and then
How to interpret the output?
• Paired samples statistic (descriptive
statistics)
• Paired samples correlation
(naturally, there should be a rather strong
correlation. Subjects with a low rt will have
a slow one in both conditions)
• Paired samples t-test
(t, df (N-1), significance level)
What if the basic assumptions are
not met
• For example
– if the distributions are very skewed
– if you have ordinal data instead of interval
data
• You have to use non-parametric tests
• There is a whole range of non-parametric
tests; I’ll only show the most common
ones
Non-parametric statistical tests
(for one independent variable only)
Between,
within, mixed?
Number of levels of
independent variable?
Significance test
Two
Mann-Whitney Test
More than two
Kruskal-Wallis Test
Between
Two
Wilcoxon Signed Ranks Test
Within
More than two
Freedman Test