Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduction to Statistical Analysis Sonja Eisenbeiss ([email protected]), Note: This introduction is aimed at researchers without statistical background. It should enable them to read result sections of research articles and to understand terms like "pvalue", "repeated-measures design" or "Latin Square Design". For a list of introductions to the use of test statistics and the use of the software package R, see: http://experimentalfieldlinguistics.wordpress.com/readings/statistics/ Variables Variables are properties of participants, situations, materials, .. whose value can vary. An Independent (Experimental) Variable (IV) is a variable whose values are manipulated by the researcher. The values of this variable are set up independently by the researcher; i.e. before the experiment begins. An IV can have several levels. An Experiment can have several IVs. Conditions result from the combination of IVs. The Dependent Variable (DV) measures the effects that result from the researcher's manipulation. The values of this variable are seen as dependent on the values of the independent variable. Example: o Language: English o Population: adult native speakers o Constructions: s-possessives and prepositional of-possessives the lady's leg vs. the leg of the lady; the table's leg vs. the leg of the table o Research Question: Does animacy affect the choice of possessive construction? o Hypotheses: Phrases with animate referents are more easily encoded than phrases with inanimate referents. Hence, phrases with animate referents tend to be encoded before phrases with inanimate referents. For possessive constructions, this means that speakers should prefer to realize animate possessor (PRs) phrase before an inanimate possessum (PM), i.e. in a PRinitial s-possessive (the lady's leg vs. ?the leg of the lady). Inanimate PRs should not show such a preference (the leg of the table vs. ?the table's leg). o Literature: Rosenbach (2008) in Lingua and references cited there (use google scholar to find more recent publications referring to this article). Use http://linguistlist.org/ ; Glottopedia http://www.glottopedia.de/index.php/Main_Page, http://www.wikipedia.org/, http://academia.edu/ to find further references and to follow researchers, journals, topics, etc. => IV: animacy of PR; two levels (animate vs. inanimate) DV: percentage of s-choice 1 Types of Measurements Variables can be categorical (or nominal), ordinal, or interval. Categorical A categorical variable has two or more categories, but does not involve any intrinsic ordering of the categories. Examples: animacy (animate vs. inanimate), gender (two unordered categories: male and female), first language of second-language learners (e.g. two unordered categories "French" and "German" in a study comparing German and English learners of Hindi) Ordinal The levels of an ordinal variable are clearly ordered. E.g., second language learners can be assigned to groups with low, intermediate and high proficiency in their second language, resulting in a variable PROFICIENCY, with three levels. However, the spacing between the different levels of this variable (low, intermediate and high) may not necessarily be consistent. For instance, there might be a bigger difference between low and intermediate levels than between intermediate and high proficiency. Thus, one cannot treat these categories as being on a scale with fixed intervals. .Similarly, one can assume an animacy "scale" where inanimate objects like stones are considered less animate than plants, which are considered less animate than animals and humans. Such a scale involves an ordering, but no fixed regular intervals. Interval An interval variable involves ordered categories, but the intervals between the levels of the interval variable are equally spaced. For instance, if you have reaction-times of 500 milliseconds, 1000 milliseconds , 1500 milliseconds and 2000 milliseconds, you can be assured that the intervals between the four reaction-times are equally spaced as you are measuring on a millisecond scale with fixed intervals. Descriptive Statistics for Nominal Variables For categorical/nominal variables, one can provide absolute frequencies (total numbers) and relative frequencies (i.e. percentages such as 90% or ratios such as 9/10) for the different categories of responses. 2 For instance, in if one compares how frequently a speaker uses s-possessives vs. of-possessives for animate vs. inanimate possessors in an elicited production experiment with picture-descriptions, one should report the number of s-possessives that were produced for animate possessors the number of s-possessives that were produced for inanimate possessors the percentage of pictures with animate possessors that elicited an s-possessive out of all the pictures with animate possessors that elicited a response the percentage of pictures with inanimate possessors that elicited an s-possessive out of all the pictures with inanimate possessors that elicited a response You need to provide this information, plus numbers and percentages of no-responses. Participants could produce low numbers of s-possessives for animate possessives, but this could be due to high rates of non-responses. Hence, it is important to also provide the percentage of s-possessives out of the total responses and the number and percentage of non-responses. Descriptive Statistics for Ordinal and Interval Variables Descriptive statistics for ordinal and interval variables provides information about the central tendencies and the variation in your data. Central tendencies show you the average or typical behaviour of your participants: Mean (average): the sum of all scores/measurements divided by the number of participants (only applicable for scale varbaibles). This measure is only appropriate for interval scales. It does not make sense to calculate a mean for "low", "intermediate" and "high", even if you use numbers like 1, 2, 3 to code these categories and the statistical programme will let you calculate a mean. Mode: the score/measurement obtained by the largest number of participants. This measure is appropriate for ordinal variables. Median: the "middle" score, i.e. the score that divides the group into two (so that half of the scores are above the median and half of the scores are below the median). This measure is appropriate when you have an interval scale and you are worried that there are some extremely high or low values that could distort the picture. For instance, if one participant misses the button press and there is no time-out set, you could have one reaction time of 1000ms in an experiment where all other measurements are between 200ms and 450ms. If you calculated a mean including the 1000ms measurement, the resulting high mean would not correctly reflect the overall performance of the group. If you calculate the median, the 3 1000ms value only enters as one high value, but the value itself is ignored, making the median less prone to problems with extreme values. The standard deviation (s.d.) provides information about the variation in your data. Lower s.d.s indicate a comparatively homogeneous behaviour of your participant group. Higher s.d. show that the group is heterogeneous with respect to your measurements, i.e. they behave very differently from one another. Exercise: Provide means, mode, median for the scores of the following three tests? How do they differ? How large is the variation in the individual tests? Table 1: Example Data Set for Descriptive Statistics Test score score score score score score 1 2 3 4 5 6 1 3 4 6 6 6 6 2 1 2 2 2 6 10 3 1 2 6 6 6 6 score 7 6 10 8 score 8 8 10 9 score 9 8 10 9 s.d. 1.62 4.1 2.8 P-Values and the Purpose of Inferential (Test) Statistics • • • Statistical tests are used to determine whether the results obtained in quantitative analyses should be interpreted or whether they might simply have come about by chance. Statistical tests will provide a p-value that will tell you how likely it is that the results have come about by chance. I.e., the p-value tells you the probability that the observed effects – for instance a difference between two groups - are due to chance. Two types of errors have to be avoided in the interpretation of results: • Type I error: A true null-hypothesis is rejected. I.e., you interpret an effect as meaningful when it is not. • Type II error: A false null-hypothesis is failed to be rejected. I.e., you interpret an effect as a pure chance result when it is a "real" effect that should be interpreted. 4 • • Alpha is the probability of a type I error. For each analysis, we have to determine an alpha-level, often also called "significance level". This is the probability with which we are willing to reject the null-hypothesis when it might in fact be correct, i.e. the probability of interpreting a chance result as meaningful. In linguistics and psychology, a result is typically interpreted as significant if the probability of a chance result is less than 5% (i.e. p<.05). For medical experiments, where more risks for participants and patients are involved, one might only accept a result if the probability of basing decisions on a chance result is smaller than 1% (p<.001). Thus, for your studies, you should use .05 as your alpha-value, but you should be prepared to find studies with a lower alpha-level, especially in medical research. Your statistics-software outputs may contain two different p-values: • two-tailed: This value should be taken if you have an undirected experimental hypothesis (e.g.: There is a frequency difference for a construction X between speaker A and B). • one-tailed: This value should only be taken, if you have a directed experimental hypothesis (e.g.: Speaker A produces more constructions of type X than B). Even then, many people report two-tailed values as this is more conservative. For your write-ups, tell the reader whether you have selected a two-tailed or a onetailed analysis, e.g. "p<.05; two-tailed" or "p<.05; one-tailed". If you select a onetailed analysis, you should make it clear which directed experimental hypothesis you are testing. Correlational Designs Two different measures are obtained for each participant and one tries to determine whether there is a relationship between the measurements prediction: There is a positive correlation between the measurements (the higher the score for variable X, the higher the score for variable Y) OR there is a negative correlation between the measurements (the higher X, the lower Y) Table 2: Correlational Design Participant Measurement 1 1. 2. 3. 4. … Measurement 2 5 Repeated Measure Designs/Variables other names: within group design, same subject design, related design The same participants are measured several times. prediction: There is a difference between the measurements Table 3: Repeated Measure Design Participant Measurement 1 Measurement 2 1. 2. 3. 4. 5. … Measurement 3 … Independent Group Designs/Variables other names: between-group design, different subject design, unrelated design Two groups of participants are measured and the measurements of the two groups are compared. Groups can differ with respect to one variable, e.g. age, proficiency level, or L1. Then, there is one IV. Groups can also differ with respect to several of these variables. Then, there is more than one IV. prediction: there is a difference between the measurements Table 4: Independent Group Design Participant Group 1. 1 2. 1 3. 1 4. 1 5. 1 … 1 6. 2 7. 2 8. 2 9. 2 … Measurement 6 Repeated Measures vs. Independent Groups in a Participant/Subject Analysis and in an Item-Analysis In psycholinguistic experiments, there are typically at least 8-10 items for each condition. Moreover, there are typically at least 10-20 participants per group. In your example, a study on the use of s-possessives in L2-acquisition, you have 6 Japanese learners of English and 6 German learners (ideally, you should also have English native speaker controls). German has a distinction between s-possessives and prepositional possessives, but animacy plays a limited role in construction choice. Japanese does not have such a distinction, but only a construction that is more similar to the English s-possessive with respect to word order. Each participant has seen filler items that disguise the purpose of the experiment, plus: 8 sentences with an s-PR that has an animate referent 8 sentences with an s-PR that has an inanimate referent Thus, we have a so-called 2 x 2 design with two IVs that each have two levels (LANGUAGE, with the two levels JAPANESE and GERMAN; ANIMACY, with the two levels ANIMATE PR vs. INANIMATE PR). The DV is an acceptability rating on a scale from 1-5 (completely unacceptable – completely acceptable). Statistical tests like t-tests and ANOVAs are often not based on the raw data, but on (i) means for individual participants for your participant analysis and (ii) on means for individual items for your item analysis. Table 5: Participant Analysis LANGUAGE PARTICIPANT 1 1 1 2 1 3 1 4 1 5 1 6 2 1 2 2 2 3 2 4 2 5 2 6 ANIMATE PR INANIMATE PR 7 Table 6a: Item-Analysis Variant 1: Different PM-word for possessives with animate and inanimate PR-referent PR-ANIMACY Item Japanese German 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 What was a between-group variable in the participant/subject analysis may be a withingroup (repeated measures) variable in the item analysis. But this not always the case: The LANGUAGE of participants is a between-groups variable in the participant/subject analysis because each participant only has one measurement for language: each participant is either JAPANESE or GERMAN. However, LANGUAGE is a repeated measures variable in the item-analysis - because we have two measures for each possessive (i.e. item): one for the Japanese learners and one for the German learners. In our example for Table 6a, the possessives with animate PR-referents and the possessives with inanimate PR-referents contained different PM words (e.g. the lady's arm vs. the table's leg) . Thus, they are different items (though they are matched for sentence length, familiarity of vocabulary etc.). For the participant analysis, PRANIMACY is a repeated measure variable because each participant is measured for possessives with animate PR-referents and for possessives with inanimate PRreferents. For your item analysis, ANIMACY is a between group variable because each possessive only has one measurement for ANIMACY: The possessive either involves an animate PR-referent or an inanimate PR-referent – and the possessives with animate PR-referents involve different PM-nouns than the possessives with animate PR-referents. 8 If you use a LATIN square design, each PM-noun is presented with two types of PRs: once with an animate PR-referent (to one group of participants) and once with an inanimate PR-referent (to another group of participants); e.g. the lady's leg vs. the table's leg (see Table 6b). Thus, you obtain two measurements for each PM-noun: one for the version with the animate PR-referent and one for the version with the inanimate PR-referent. Then, your data file for the item analysis should look like this: Table 6b: Item-Analysis Variant 2: The same PM-words for possessives with animate and inanimate PR-referent (Latin Square Design): Item Japanese: Animate German: Animate Japanese: Inanimate German: Inanimate 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Table 7: Terminology and Synonyms independent group unmatched unrelated between group between subject independent samples repeated measurement matched related within group within subject paired samples 9 Normal Distribution In order to select a statistical test, you have to determine whether your observations come from a normally distributed population. If you have measured on a scale you can plot a histogram with the measurements on the X-axis and the number of participants that had the respective score on the Y-axis. If your data is normally distributed, the scores of all the individual cases should spread round the average in a particular bell-shaped pattern (the Gaussian curve) that you see illustrated in all statistic books and below. Many statistical tests (the ones called "parametric") can only be used for normally distributed data. In order to determine whether your data is normally distributed, you have to run a KS-test. If this test is significant, your data distribution significantly deviates from the normal distribution. I.e., your data is NOT normally distributed. If the test is not significant, your data is either normally distributed or your data set is so small that your KS-test does not become significant even though your data is not normally distributed. Figure 1: Normal Distribution 10 8 6 4 2 Std. Dev = 1.02 Mean = 3.3 N = 20.00 0 1.0 2.0 3.0 4.0 5.0 scores The Basis for the Choice of Statistical Test The following criteria are used to determine which test to use for experiments in which differences between groups of participants or between different types of stimuli are investigated and where the DV is measured on a scale: • Are you investigating correlations or differences? • How many IVs does the design for the current part of the analysis involve? • How many levels do your IVs have? • Which types of IVs are involved: repeated measures, independent groups? • Could observations be from a normally distributed population (assumption of normality)? 10 Figure 2: Choice of Statistical Test for Studies on Differences 1 variable | 2 | repeated measures 2 levels | | parametric: t-test (related) nonparametric: Wilcoxon | indep. group more than two levels | parametric: 1 way ANOVA (related) nonparametric: Friedman 2 levels | | parametric: t-test (unrelated) nonparametric: Mann Whitney more than two levels | parametric: 1 way ANOVA (unrelated) nonparametric: KruskallWallis | repeated measures | | | 2 (3,..) way ANOVA (related) or more | | mixed | | | 2 (3,..) way ANOVA (mixed) variables | independent group | | | 2 (3,..) way ANOVA (unrelated) 11 Choice of Statistical Test for Correlation Designs parametric: Pearson, non-parameteric: Spearman Some Memory Aids for Statistical Tests T-test: „tea for two“ as this test simply compares two means (for two groups or two conditions). The non-parametric tests for repeated measures (within-group comparisons) are called after one single person (Wilcoxon or Friedman) The non-parametric tests for comparisons of independent groups are called after two independent people (Mann-Whitney or Kruskall-Wallis) Parametric Correlation: Pearson Exercise Discuss the following examples: How should your files for means of the participants look like? Select appropriate tests for your analysis. 1. A study compares reading speed scores in two groups of learners, one taught with teaching method 1, the other one taught with teaching method 2. The two groups were matched on English proficiency (TOEFL scores) before the teaching; and the measurements took place after the teaching. 2. A study compares reading speed scores in three groups of learners, each taught with a different teaching method. All three groups were matched on TOEFL scores before the teaching; and the measurements took place after the teaching. 3. A study compares reading speed scores in two groups of learners, one taught with teaching method 1, the other one taught with teaching method 2. Both groups are measured before and after teaching. In addition, there is a control group without any teaching. This group is measured twice as well, with the same time interval as the other two groups. 4. A reaction time study compares how fast participants can recognize one-syllable words, two-syllable words and three-syllable words. 5. A reaction time study compares how fast participants can recognize high-frequency regularly inflected word forms, low-frequency regularly inflected word forms, highfrequency irregularly inflected word forms, and low-frequency irregularly inflected word forms. 12