Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Branches of statistics • Descriptive statistics is for organizing and summarizing data COLLOCATIONS II: HYPOTHESIS TESTING – Histograms, mean, median, standard deviation • Inferential statistics is used to draw conclusions about a population based on a sample (subset) of that population John Fry Boise State University – Point estimation, hypothesis testing, confidence intervals • Probability theory is the basis of, and in some sense, the “inverse” of, inferential statistics – Probability reasons from the population to a sample – Inferential statistics reasons from the sample to the population Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University Sample variance s2 Sample mean x̄ • The sample variance of a set of numbers x1, x1, . . . , xn is • The sample mean x̄ of a set of numbers x1, x1, . . . , xn is x̄ = x1 + x2 + . . . + xn = n Pn i=1 xi s2 = n Pn − x̄)2 n−1 i=1 (xi > var(c(2, 3, 4, 5, 3, 2, 5)) [1] 1.619048 • Examples in R > mean(c(2, 3, 4, 5, 3, 2, 5)) [1] 3.428571 • The sample standard deviation s is the positive square root of the sample variance √ s = s2 > mean(c(2.1, 0.4, 4.2, 3.6, 4.2)) [1] 2.9 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 1 > sd(c(2, 3, 4, 5, 3, 2, 5)) [1] 1.272418 > sqrt(var(c(2, 3, 4, 5, 3, 2, 5))) [1] 1.272418 2 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 3 Standard deviations in the Normal distribution Mean value µ • The mean value of a random variable X, which is denoted µX or just µ, is the same as the expectation E(X) µX = E(X) = X p(x) x x∈X • Example: the mean value µ of a single throw of a die is µ= 6 X x=1 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 4 Statistical hypothesis testing 6 p(x) x = 1X 21 x= = 3.5 6 x=1 6 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 5 Examples of null hypotheses • Informal examples of statistical hypotheses • Deciding whether a coin is fair or loaded – That a given die or coin is unfair (loaded) – That a new drug is more effective than a placebo – Null hypothesis: p = 0.5 – Alternative hypothesis: p 6= 0.5 • Formally, hypotheses are claims about the characteristics of the sampled population(s) (e.g., that p = 0.5 or µ1 = µ2) • A hypothesis test is usually framed in terms of two competing hypotheses, the null hypothesis and the alternative hypothesis • Deciding whether a treatment is better than a placebo – Null hypothesis: no difference between the mean response to treatment and to the placebo: µ1 = µ2 – Alternative hypothesis: mean response to treatment is better than for the placebo: µ1 > µ2 • The null hypothesis is the default, ‘status quo’ hypothesis (e.g., ‘innocent until proven guilty’) • We reject the null hypothesis only in the face of strong evidence for the alternative Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 6 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 7 Steps in statistical hypothesis testing Evaluating a test statistic 1. Collect the data 2. Determine a null and alternative hypothesis 3. Calculate a test statistic for that null hypothesis (t, χ2, etc.) 4. Determine if this test statistic is odd given this null hypothesis (a) If the statistic is sufficiently extreme relative to the null hypothesis, then we reject the null hypothesis (b) If the statistic is not sufficiently extreme relative to the null hypothesis, then we do not reject the null hypothesis Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 8 Hypothesis testing of collocations 9 The one-sample t test • Hypothesis testing can be used to identify collocations • One common statistic for hypothesis testing is the t statistic x̄ − µ t=p s2/N • If words x and y are independent, then P (xy) = P (x)P (y) • In a true collocation, words x and y are highly dependent, so we expect P (xy) P (x)P (y) • We can frame this question as a hypothesis test • The t test looks at the mean x̄ and variance s2 of a sample • The null hypothesis is that the sample is drawn from a population with mean µ (that is, we expect x̄ ≈ µ) – Null hypothesis: P (xy) = P (x)P (y) – Alternative: P (xy) > P (x)P (y) • We reject the null hypothesis only in the face of strong evidence for the alternative Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 10 • If t is high enough, we can reject the null hypothesis and conclude that the sample is not drawn from that population Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 11 Finding collocations with the t test Finding collocations with the t test • The t statistic can be used to test whether a bigram xy is a collocation; i.e., whether P (xy) P (x)P (y) • If t is large enough, we can reject the null hypothesis that x and y are independent and accept xy as a collocation x̄ − µ t=p s2/N • Specifically, statistical tables show that when t > 2.576 we can reject the null hypothesis with 99.5% confidence (p = 0.005) • For this task we set x̄ = P (xy) and µ = P (x)P (y), and then approximate s2 = P (xy) • The Church et al. (1991) approximation of t is t≈ • For this reason, the t score is typically used to rank or compare collocations, rather than to accept or reject them based on a critical value c(xy) − N1 c(x)c(y) p c(xy) Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 12 Bible bigrams ranked by t score T 81.3465 76.1978 56.4903 47.8521 42.8622 39.9544 39.489 38.4577 37.4083 36.4899 35.671 35.6205 35.1866 35.1293 35.0394 C(x) 34671 64023 12667 9838 8854 51696 3999 34671 2392 5620 64023 2775 1821 51696 5474 C(y) 64023 7964 64023 7013 3836 10419 8997 2575 34671 64023 2542 34671 34671 7376 1616 C(xy) 11543 7035 5031 2461 1922 2791 1649 1697 1602 2144 1658 1502 1393 2086 1250 of the in shall i and said of son all the out children and thou • Unfortunately, this approach forces us to accept virtually all bigrams xy as collocations, since it is almost always true that P (xy) P (x)P (y) Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 13 Bible bigrams ranked by t score where c = 10 T C(x) 3.16197 19 3.16196 66 3.16191 29 3.16182 10 3.16158 31 3.16117 39 3.16108 300 3.16096 21 3.16015 65 3.15979 520 3.1592 151 3.15918 287 3.15871 194 3.15772 33 3.15752 202 the lord the be will he unto israel of the king of of they shalt C(y) 40 12 32 115 56 71 10 157 82 12 51 27 46 346 59 C(xy) 10 committeth 10 golden 10 solemn 10 dearly 10 due 10 molten 10 young 10 fir 10 deep 10 thousand 10 new 10 much 10 unclean 10 looketh 10 six adultery spoon feasts beloved season images pigeons trees sleep footmen moon less spirits toward months These are not collocations! Problem is t favors frequent bigrams Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 14 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 15 Bible bigrams ranked by t score where c = 20 T C(x) C(y) 4.47147 113 21 4.47103 55 71 4.47086 37 122 4.47054 28 202 4.46943 41 234 4.46874 501 24 4.46873 33 366 4.46862 76 164 4.46852 328 39 4.46734 38 447 4.46668 157 123 4.46638 283 72 4.46103 283 139 4.46053 78 527 4.45666 1405 39 C(xy) 20 fine 20 graven 20 inner 20 anointing 20 fierce 20 cast 20 continual 20 simon 20 four 20 innocent 20 east 20 turn 20 chief 20 bowed 20 made Strong bigrams ranked by t score T 106.28 97.1442 85.5205 85.3466 81.7163 81.6797 77.485 77.1576 71.729 71.2304 70.5829 70.4699 65.9995 62.5705 61.1211 twined images court oil anger lots burnt peter corners blood wind aside captain himself manifest Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 16 Powerful bigrams ranked by t score T 56.8366 51.1367 47.4809 45.8005 43.541 41.937 38.4901 37.5722 36.1926 34.8137 34.3074 33.2658 32.6545 32.4674 30.5478 C(x) C(y) 196842 81809 196842 289928 196842 3541187 196842 1388113 196842 145743 196842 561524 196842 1147361 196842 875579 196842 774963 196842 6497319 196842 538219 196842 141226 196842 474899 196842 51439655 196842 533268 C(xy) 3244 2663 2813 2323 1920 1851 1667 1554 1436 2158 1265 1130 1144 6997 1020 powerful powerful powerful powerful powerful powerful powerful powerful powerful powerful powerful powerful powerful powerful powerful Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University C(x) C(y) 564688 561524 564688 835693 564688 350085 564688 583264 564688 728249 564688 76401 564688 426675 564688 592486 564688 727631 564688 179242 564688 291093 564688 81809 564688 1133059 564688 331883 564688 756389 C(xy) 11562 9832 7480 7560 7021 6708 6206 6233 5487 5159 5120 5005 4882 4072 4089 strong strong strong strong strong strong strong strong strong strong strong strong strong strong strong enough support demand growth dollar winds buy opposition sales showing performance earthquake economic earnings economy Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 17 Likelihood ratios • Another approach to hypothesis testing is the likelihood ratio (Dunning 1993) earthquake bomb than military explosion enough political man force new shot blast lower and car • A likelihood ratio is a number that tells us how much more likely one hypothesis H1 is than the alternative hypothesis H2 • Compared to the t and χ2 statistics, likelihood ratios are: – More interpretable – More robust under sparse data – More difficult to compute (math is hairier) 18 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 19 Finding collocations with likelihood ratios Finding collocations with likelihood ratios • We want to determine whether bigram xy is a collocation, or whether x and y are independent of each other Letting x = c(x), y = c(y), and xy = c(xy), we compute log • Hypothesis H1 (independent): P (y|x) = p = P (y|¬x) + log pxy (1 − p)x−xy + log py−xy (1 − p)(N −x)−(y−xy) • We can use MLE to estimate p, p1, and p2: c(y) N p1 = c(xy) c(x) p2 = c(y) − c(xy) N − c(x) Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 20 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University Powerful collocations ranked by likelihood ratios 1 −2 log H H2 1291.42 99.31 82.96 80.39 57.27 51.66 51.52 51.05 c(x) 12593 379 932 932 932 932 171 932 c(y) 932 932 934 3424 291 40 932 43 c(xy) 150 10 10 13 6 4 5 4 x most politically powerful powerful powerful powerful economically powerful y powerful powerful computers force symbol lobbies powerful magnet 22 21 Collocations: summary • Collocations, idioms, and other MWEs are groups of words that co-occur more often than chance, exhibit limited semantic compositionality, and vary from language to language • Collocations are useful for lexicography, language learning, and applications like the ‘key phrases’ feature at amazon.com • Methods for automatically extracting collocations: 1. 2. 3. 4. 5. 6. Source: Manning & Schütze (1999), p. 163 Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University = − log p1xy (1 − p1)x−xy − log p2y−xy (1 − p2)(N −x)−(y−xy) • Hypothesis H2 (collocation): P (y|x) = p1 6= p2 = P (y|¬x) p= H1 H2 Frequency counts with POS filters Relative frequency ratios Mutual information (favors rare collocations) t scores (favors frequent collocations) Likelihood ratios χ2 test (next time!) Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University 23