* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Quantitative methods and R – (2)
Survey
Document related concepts
Transcript
Quantitative methods and R – (2) LING115 December 2, 2009 Two sample t-test • Check if the means of two samples are different – Calculate the difference between the two means – Normalize it by the standard error • Which standard error to use becomes an issue since there are two samples Two sample t-test – (2) • If the two sample variances are roughly the same, pool the two sample variances and then estimate SE – Weighted average of the two variances, each of which is weighted by its degrees of freedom (n-1) 2 2 ( n 1 ) s ( n 1 ) s a b b s 2p a (na 1) (nb 1) SE s 2p (na nb ) x1 x2 t SE df = na+ nb-2 Two sample t-test – (3) • If the two sample variances are not the same, SE is estimated by the following formula: SE (sa2 / na ) (sb2 / nb ) • Degrees of freedom is calculated differently depending on the sample size – If both samples consist of more than 30 data points, we can use the normal distribution as the distribution of t-scores – If not, estimate the degrees of freedom by the following formula: U2 df 2 2 V /( na 1) W 2 /( nb 1) V sa2 / na W sb2 / nb U 2 V W Two sample t-test in R $ cd /home/ling115/r $ R Two sample t-test in R – (2) Paired t-test • Some data make more sense when paired – F1 of a set of vowels from males and females – Difference in frequency of the same set of words between two corpora – Can control for the variation due to the factor by which observations are paired (e.g. what the vowel is) • Calculate the difference in score for each pair • Run one sample t-test to see if the mean difference value is different from zero Paired t-test in R Parametric vs. Nonparametric • Parametric test (of statistical significance) – Assumes normal distribution – Data are measured in interval scales – Makes use of parameters such as mean, variance • Nonparametric test – Does not assume normal distribution – Knowledge of parameters is not necessary – e.g. Wilcoxon test instead of t-test • Shapiro-Wilk test for normality ANOVA • Data points are grouped by a factor with more than two levels – F1 of a set of vowels produced by speakers from five different dialect groups – Difference in frequency of the same set of words among ten corpora • The goal of Analysis of Variance is to check if the differences among the means of different groups is greater than the differences among the observations in the data set generally Variance among groups • Calculate the mean for each group • Calculate the overall mean of the data pooled from all groups • Calculate the squared deviation of each group mean from the overall mean • Multiply the squared deviation by the number of data points in each group so that the amount reflects the size of each group • Add up the values (SS_group) • Normalize SS_group by the degrees of freedom – df = number of groups minus one Variance within the entire data • On first thought, this may be the sample variance • But this variance includes variance due to group difference • So we want to get the variance of the whole with the variance due to group difference removed Variance within the entire data – (2) • Sum of squares of error (SS_error) – Method 1 • Calculate the sum of squared deviations over all data points (SS_total) • Subtract the sum of squared deviations due to group difference (SS_group), i.e. SS_total – SS_group – Method 2 • Calculate the sum of squared deviations within each group • Add up the SS • Normalize SS_error by the degrees of freedom – df = number of data points minus the number of groups ANOVA and F-ratio • If variance among groups is different from variance within the entire data, we assume the group means are different • F = VAR_group / VAR_error – F = 1 if the two variances are exactly the same – The more F is farther away from 1, the less likely it is that the two variances are the same F-distribution • Probability distribution of ratio of variances – Note that each variance will have its own degrees of freedom • F=1 if two variances are the same • The farther away F is from 1, the less likely it is that the two variances are the same • F-distribution is sensitive to whether the population distribution is normal F-distribution graph (from wikimedia) Comparison of variance in R • var.test(x,y) • We can also calculate F-ratio by var(x)/var(y) Two factors • ANOVA we discussed so far assumed there is a single factor which divides the data points into multiple groups • There may be more than one factor – Number of adjectives in each sentence – Length of each sentence Interaction • With two factors, assuming they are meaningful, there are three ways the factors can affect the value of an observation – Factor 1 – Factor 2 – Interaction of factor 1 and factor 2 Repeated measures • ANOVA discussed so far assumes the data points are independent from each other • As in the case of paired t-test, some data make more sense when observations are matched