Download Eisenbeiss (2013): Introduction to Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Statistical Analysis
Sonja Eisenbeiss ([email protected]),
Note: This introduction is aimed at researchers without statistical background. It should
enable them to read result sections of research articles and to understand terms like "pvalue", "repeated-measures design" or "Latin Square Design". For a list of introductions
to the use of test statistics and the use of the software package R, see:
http://experimentalfieldlinguistics.wordpress.com/readings/statistics/
Variables





Variables are properties of participants, situations, materials, .. whose value can vary.
An Independent (Experimental) Variable (IV) is a variable whose values are
manipulated by the researcher. The values of this variable are set up independently by
the researcher; i.e. before the experiment begins.
An IV can have several levels. An Experiment can have several IVs. Conditions
result from the combination of IVs.
The Dependent Variable (DV) measures the effects that result from the researcher's
manipulation. The values of this variable are seen as dependent on the values of the
independent variable.
Example:
o Language: English
o Population: adult native speakers
o Constructions: s-possessives and prepositional of-possessives
the lady's leg vs. the leg of the lady; the table's leg vs. the leg of the table
o Research Question: Does animacy affect the choice of possessive
construction?
o Hypotheses: Phrases with animate referents are more easily encoded than
phrases with inanimate referents. Hence, phrases with animate referents tend
to be encoded before phrases with inanimate referents. For possessive
constructions, this means that speakers should prefer to realize animate
possessor (PRs) phrase before an inanimate possessum (PM), i.e. in a PRinitial s-possessive (the lady's leg vs. ?the leg of the lady). Inanimate PRs
should not show such a preference (the leg of the table vs. ?the table's leg).
o Literature: Rosenbach (2008) in Lingua and references cited there (use google
scholar to find more recent publications referring to this article). Use
http://linguistlist.org/ ; Glottopedia
http://www.glottopedia.de/index.php/Main_Page, http://www.wikipedia.org/,
http://academia.edu/ to find further references and to follow researchers,
journals, topics, etc.
=> IV:
animacy of PR; two levels (animate vs. inanimate)
DV: percentage of s-choice
1
Types of Measurements
Variables can be categorical (or nominal), ordinal, or interval.



Categorical
A categorical variable has two or more categories, but does not involve any intrinsic
ordering of the categories. Examples: animacy (animate vs. inanimate), gender (two
unordered categories: male and female), first language of second-language learners
(e.g. two unordered categories "French" and "German" in a study comparing German
and English learners of Hindi)
Ordinal
The levels of an ordinal variable are clearly ordered. E.g., second language learners
can be assigned to groups with low, intermediate and high proficiency in their second
language, resulting in a variable PROFICIENCY, with three levels. However, the
spacing between the different levels of this variable (low, intermediate and high) may
not necessarily be consistent. For instance, there might be a bigger difference between
low and intermediate levels than between intermediate and high proficiency. Thus,
one cannot treat these categories as being on a scale with fixed intervals. .Similarly,
one can assume an animacy "scale" where inanimate objects like stones are
considered less animate than plants, which are considered less animate than animals
and humans. Such a scale involves an ordering, but no fixed regular intervals.
Interval
An interval variable involves ordered categories, but the intervals between the levels
of the interval variable are equally spaced. For instance, if you have reaction-times of
500 milliseconds, 1000 milliseconds , 1500 milliseconds and 2000 milliseconds, you
can be assured that the intervals between the four reaction-times are equally spaced as
you are measuring on a millisecond scale with fixed intervals.
Descriptive Statistics for Nominal Variables
For categorical/nominal variables, one can provide


absolute frequencies (total numbers) and
relative frequencies (i.e. percentages such as 90% or ratios such as 9/10) for the
different categories of responses.
2
For instance, in if one compares how frequently a speaker uses s-possessives vs.
of-possessives for animate vs. inanimate possessors in an elicited production experiment
with picture-descriptions, one should report




the number of s-possessives that were produced for animate possessors
the number of s-possessives that were produced for inanimate possessors
the percentage of pictures with animate possessors that elicited an s-possessive out of
all the pictures with animate possessors that elicited a response
the percentage of pictures with inanimate possessors that elicited an s-possessive out
of all the pictures with inanimate possessors that elicited a response
You need to provide this information, plus numbers and percentages of no-responses.
Participants could produce low numbers of s-possessives for animate possessives, but this
could be due to high rates of non-responses. Hence, it is important to also provide the
percentage of s-possessives out of the total responses and the number and percentage of
non-responses.
Descriptive Statistics for Ordinal and Interval Variables
Descriptive statistics for ordinal and interval variables provides information about the
central tendencies and the variation in your data.

Central tendencies show you the average or typical behaviour of your participants:
 Mean (average): the sum of all scores/measurements divided by the number of
participants (only applicable for scale varbaibles). This measure is only
appropriate for interval scales. It does not make sense to calculate a mean for
"low", "intermediate" and "high", even if you use numbers like 1, 2, 3 to code
these categories and the statistical programme will let you calculate a mean.
 Mode: the score/measurement obtained by the largest number of participants.
This measure is appropriate for ordinal variables.
 Median: the "middle" score, i.e. the score that divides the group into two (so that
half of the scores are above the median and half of the scores are below the
median). This measure is appropriate when you have an interval scale and you are
worried that there are some extremely high or low values that could distort the
picture. For instance, if one participant misses the button press and there is no
time-out set, you could have one reaction time of 1000ms in an experiment where
all other measurements are between 200ms and 450ms. If you calculated a mean
including the 1000ms measurement, the resulting high mean would not correctly
reflect the overall performance of the group. If you calculate the median, the
3
1000ms value only enters as one high value, but the value itself is ignored,
making the median less prone to problems with extreme values.
The standard deviation (s.d.) provides information about the variation in your data.
Lower s.d.s indicate a comparatively homogeneous behaviour of your participant group.
Higher s.d. show that the group is heterogeneous with respect to your measurements, i.e.
they behave very differently from one another.
Exercise:
Provide means, mode, median for the scores of the following three tests? How do they
differ? How large is the variation in the individual tests?
Table 1: Example Data Set for Descriptive Statistics
Test score score score score score score
1
2
3
4
5
6
1
3
4
6
6
6
6
2
1
2
2
2
6
10
3
1
2
6
6
6
6
score
7
6
10
8
score
8
8
10
9
score
9
8
10
9
s.d.
1.62
4.1
2.8
P-Values and the Purpose of Inferential (Test) Statistics
•
•
•
Statistical tests are used to determine whether the results obtained in quantitative
analyses should be interpreted or whether they might simply have come about by
chance.
Statistical tests will provide a p-value that will tell you how likely it is that the results
have come about by chance. I.e., the p-value tells you the probability that the
observed effects – for instance a difference between two groups - are due to chance.
Two types of errors have to be avoided in the interpretation of results:
•
Type I error: A true null-hypothesis is rejected. I.e., you interpret an effect as
meaningful when it is not.
•
Type II error: A false null-hypothesis is failed to be rejected. I.e., you interpret
an effect as a pure chance result when it is a "real" effect that should be
interpreted.
4
•
•
Alpha is the probability of a type I error. For each analysis, we have to determine an
alpha-level, often also called "significance level". This is the probability with which
we are willing to reject the null-hypothesis when it might in fact be correct, i.e. the
probability of interpreting a chance result as meaningful. In linguistics and
psychology, a result is typically interpreted as significant if the probability of a
chance result is less than 5% (i.e. p<.05). For medical experiments, where more risks
for participants and patients are involved, one might only accept a result if the
probability of basing decisions on a chance result is smaller than 1% (p<.001). Thus,
for your studies, you should use .05 as your alpha-value, but you should be prepared
to find studies with a lower alpha-level, especially in medical research.
Your statistics-software outputs may contain two different p-values:
• two-tailed: This value should be taken if you have an undirected experimental
hypothesis (e.g.: There is a frequency difference for a construction X between
speaker A and B).
• one-tailed: This value should only be taken, if you have a directed
experimental hypothesis (e.g.: Speaker A produces more constructions of type
X than B). Even then, many people report two-tailed values as this is more
conservative.
For your write-ups, tell the reader whether you have selected a two-tailed or a onetailed analysis, e.g. "p<.05; two-tailed" or "p<.05; one-tailed". If you select a onetailed analysis, you should make it clear which directed experimental hypothesis you
are testing.
Correlational Designs


Two different measures are obtained for each participant and one tries to determine
whether there is a relationship between the measurements
prediction: There is a positive correlation between the measurements (the higher the
score for variable X, the higher the score for variable Y) OR there is a negative
correlation between the measurements (the higher X, the lower Y)
Table 2: Correlational Design
Participant Measurement 1
1.
2.
3.
4.
…
Measurement 2
5
Repeated Measure Designs/Variables



other names: within group design, same subject design, related design
The same participants are measured several times.
prediction: There is a difference between the measurements
Table 3: Repeated Measure Design
Participant Measurement 1 Measurement 2
1.
2.
3.
4.
5.
…
Measurement 3
…
Independent Group Designs/Variables





other names: between-group design, different subject design, unrelated design
Two groups of participants are measured and the measurements of the two groups are
compared.
Groups can differ with respect to one variable, e.g. age, proficiency level, or L1.
Then, there is one IV.
Groups can also differ with respect to several of these variables. Then, there is more
than one IV.
prediction: there is a difference between the measurements
Table 4: Independent Group Design
Participant Group
1.
1
2.
1
3.
1
4.
1
5.
1
…
1
6.
2
7.
2
8.
2
9.
2
…
Measurement
6
Repeated Measures vs. Independent Groups in a
Participant/Subject Analysis and in an Item-Analysis
In psycholinguistic experiments, there are typically at least 8-10 items for each condition.
Moreover, there are typically at least 10-20 participants per group. In your example, a
study on the use of s-possessives in L2-acquisition, you have 6 Japanese learners of
English and 6 German learners (ideally, you should also have English native speaker
controls). German has a distinction between s-possessives and prepositional possessives,
but animacy plays a limited role in construction choice. Japanese does not have such a
distinction, but only a construction that is more similar to the English s-possessive with
respect to word order. Each participant has seen filler items that disguise the purpose of
the experiment, plus:


8 sentences with an s-PR that has an animate referent
8 sentences with an s-PR that has an inanimate referent
Thus, we have a so-called 2 x 2 design with two IVs that each have two levels
(LANGUAGE, with the two levels JAPANESE and GERMAN; ANIMACY, with the
two levels ANIMATE PR vs. INANIMATE PR). The DV is an acceptability rating on a
scale from 1-5 (completely unacceptable – completely acceptable).
Statistical tests like t-tests and ANOVAs are often not based on the raw data, but on (i)
means for individual participants for your participant analysis and (ii) on means for
individual items for your item analysis.
Table 5: Participant Analysis
LANGUAGE
PARTICIPANT
1
1
1
2
1
3
1
4
1
5
1
6
2
1
2
2
2
3
2
4
2
5
2
6
ANIMATE PR
INANIMATE PR
7
Table 6a: Item-Analysis
Variant 1: Different PM-word for possessives with animate and inanimate PR-referent
PR-ANIMACY
Item
Japanese
German
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
What was a between-group variable in the participant/subject analysis may be a withingroup (repeated measures) variable in the item analysis. But this not always the case:
 The LANGUAGE of participants is a between-groups variable in the
participant/subject analysis because each participant only has one measurement for
language: each participant is either JAPANESE or GERMAN. However,
LANGUAGE is a repeated measures variable in the item-analysis - because we have
two measures for each possessive (i.e. item): one for the Japanese learners and one for
the German learners.
 In our example for Table 6a, the possessives with animate PR-referents and the
possessives with inanimate PR-referents contained different PM words (e.g. the lady's
arm vs. the table's leg) . Thus, they are different items (though they are matched for
sentence length, familiarity of vocabulary etc.). For the participant analysis, PRANIMACY is a repeated measure variable because each participant is measured for
possessives with animate PR-referents and for possessives with inanimate PRreferents. For your item analysis, ANIMACY is a between group variable because
each possessive only has one measurement for ANIMACY: The possessive either
involves an animate PR-referent or an inanimate PR-referent – and the possessives
with animate PR-referents involve different PM-nouns than the possessives with
animate PR-referents.
8

If you use a LATIN square design, each PM-noun is presented with two types of PRs:
once with an animate PR-referent (to one group of participants) and once with an
inanimate PR-referent (to another group of participants); e.g. the lady's leg vs. the
table's leg (see Table 6b). Thus, you obtain two measurements for each PM-noun:
one for the version with the animate PR-referent and one for the version with the
inanimate PR-referent. Then, your data file for the item analysis should look like this:
Table 6b: Item-Analysis
Variant 2: The same PM-words for possessives with animate and inanimate PR-referent
(Latin Square Design):
Item Japanese: Animate German: Animate Japanese: Inanimate German: Inanimate
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Table 7: Terminology and Synonyms
independent group
unmatched
unrelated
between group
between subject
independent samples
repeated measurement
matched
related
within group
within subject
paired samples
9
Normal Distribution
In order to select a statistical test, you have to determine whether your observations come
from a normally distributed population. If you have measured on a scale you can plot a
histogram with the measurements on the X-axis and the number of participants that had
the respective score on the Y-axis. If your data is normally distributed, the scores of all
the individual cases should spread round the average in a particular bell-shaped pattern
(the Gaussian curve) that you see illustrated in all statistic books and below. Many
statistical tests (the ones called "parametric") can only be used for normally distributed
data. In order to determine whether your data is normally distributed, you have to run a
KS-test. If this test is significant, your data distribution significantly deviates from the
normal distribution. I.e., your data is NOT normally distributed. If the test is not
significant, your data is either normally distributed or your data set is so small that your
KS-test does not become significant even though your data is not normally distributed.
Figure 1: Normal Distribution
10
8
6
4
2
Std. Dev = 1.02
Mean = 3.3
N = 20.00
0
1.0
2.0
3.0
4.0
5.0
scores
The Basis for the Choice of Statistical Test
The following criteria are used to determine which test to use for experiments in which
differences between groups of participants or between different types of stimuli are
investigated and where the DV is measured on a scale:
• Are you investigating correlations or differences?
• How many IVs does the design for the current part of the analysis involve?
• How many levels do your IVs have?
• Which types of IVs are involved:
 repeated measures,
 independent groups?
• Could observations be from a normally distributed population (assumption of
normality)?
10
Figure 2: Choice of Statistical Test for Studies on Differences
1 variable
|
2
|
repeated
measures
2 levels
|
|
parametric:
t-test
(related)
nonparametric:
Wilcoxon
|
indep.
group
more than
two levels
|
parametric:
1
way
ANOVA
(related)
nonparametric:
Friedman
2 levels
|
|
parametric:
t-test
(unrelated)
nonparametric:
Mann
Whitney
more than
two levels
|
parametric:
1
way
ANOVA
(unrelated)
nonparametric:
KruskallWallis
|
repeated
measures
|
|
|
2
(3,..)
way
ANOVA
(related)
or more
|
|
mixed
|
|
|
2 (3,..)
way
ANOVA
(mixed)
variables
|
independent
group
|
|
|
2 (3,..) way
ANOVA
(unrelated)
11
Choice of Statistical Test for Correlation Designs
parametric: Pearson, non-parameteric: Spearman
Some Memory Aids for Statistical Tests




T-test: „tea for two“ as this test simply compares two means (for two groups or two
conditions).
The non-parametric tests for repeated measures (within-group comparisons) are
called after one single person (Wilcoxon or Friedman)
The non-parametric tests for comparisons of independent groups are called after two
independent people (Mann-Whitney or Kruskall-Wallis)
Parametric Correlation: Pearson
Exercise
Discuss the following examples: How should your files for means of the participants look
like? Select appropriate tests for your analysis.
1. A study compares reading speed scores in two groups of learners, one taught with
teaching method 1, the other one taught with teaching method 2. The two groups were
matched on English proficiency (TOEFL scores) before the teaching; and the
measurements took place after the teaching.
2. A study compares reading speed scores in three groups of learners, each taught with a
different teaching method. All three groups were matched on TOEFL scores before
the teaching; and the measurements took place after the teaching.
3. A study compares reading speed scores in two groups of learners, one taught with
teaching method 1, the other one taught with teaching method 2. Both groups are
measured before and after teaching. In addition, there is a control group without any
teaching. This group is measured twice as well, with the same time interval as the
other two groups.
4. A reaction time study compares how fast participants can recognize one-syllable
words, two-syllable words and three-syllable words.
5. A reaction time study compares how fast participants can recognize high-frequency
regularly inflected word forms, low-frequency regularly inflected word forms, highfrequency irregularly inflected word forms, and low-frequency irregularly inflected
word forms.
12
Related documents