Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/03/10 (Slides from Dr. Mary P. Harper, http://min.ecn.purdue.edu/~ee669/) Fall 2001 EE669: Natural Language Processing 1 What is a Word? • A word form is a particular configuration of letters. Each individual occurrence of a word form is called a token. – E.g., 'water’ has several related word forms: water, waters, watered, watering, watery. – This set of word forms is called a lemma. Fall 2001 EE669: Natural Language Processing 2 Definition Of Collocation • A collocation – a sequence of two or more consecutive words – characteristics of a syntactic and semantic unit – exact meaning or connotation (言外之意) cannot be derived directly from the meaning or connotation of its components [Chouekra, 1988] Fall 2001 EE669: Natural Language Processing 3 Word Collocations • ‘Water’ can be used are subtly linked to very specific situations, and to other words. An example: – His mouth watered. – His eyes watered. • But this paradigm doesn't extend to watering. – The roast was mouth-watering. – *The [smokey] nightclub was eye-watering. Fall 2001 EE669: Natural Language Processing 4 Word Collocations • Collocation – non-compositionality of meaning • cannot be derived directly from its parts (heavy rain) – non-substitutability in context • for parts (red light) – non-modifiability (& non-transformability) • kick the yellow bucket; take exceptions to Fall 2001 EE669: Natural Language Processing 5 Association and Co-occurrence: Terms • Considerable overlap between the concepts of collocations and terms. (Terms in IR refers to both words and phrases.) • Terms appear together or in the same (or similar) context: • • • • • (doctors, nurses) (hardware, software) (gas, fuel) (hammer, nail) (communism, free speech) • Collocations sometimes reflect attitudes – e.g., strong cigarettes, tea, coffee versus powerful drug (e.g., heroin) Fall 2001 EE669: Natural Language Processing 6 Linguistic Subclasses of Collocations • Light verbs: verbs with little semantic content like make, take, do • Terminological Expressions: concepts and objects in technical domains (e.g., hard drive) • Idioms: fixed phrases • kick the bucket, birds-of-a-feather, run for office (競選公職) • Proper names: difficult to recognize even with lists • Tuesday (person’s name), May, Winston Churchill, IBM, Inc. • Numerical expressions – containing “ordinary” words • Monday Oct 04 1999, two thousand seven hundred fifty • Verb particle constructions or Phrasal Verbs – Separable parts: • look up, take off, tell off Fall 2001 EE669: Natural Language Processing 7 Collocations • Collocations are not necessarily adjacent • Collocations cannot be directly translated into other languages. • It may be better to use the term collocation in the narrower sense of grammatically bound elements that occur in a particular order, and use association or co-occurrence for words that appear together in context. Fall 2001 EE669: Natural Language Processing 8 Collocation Detection Techniques • Select a span (範圍) within co-occurrence of words. – significant relationships are within a span of plus or minus four. • Selection methods of Collocations – Frequency – Mean and Variance – Hypothesis Testing • T-test • Chi-square • Likelihood Ratio – Pointwise Mutual Information Fall 2001 EE669: Natural Language Processing 9 Using Frequency to Hunt for Collocations • The most frequent n-grams are not always collocations – many involve function words or common names. • Simple heuristic methods help to improve the collocation yield of the n-grams. – Use knowledge of stop words; words/forms that cannot alone make up a collocation • a, the, and, or, but, not, … – Use part of speech patterns to filter the n-grams (Justeson and Katz, 1995) • Adj Noun (cold feet) • Noun Noun (oil prices) • Noun Preposition Noun (out of sight) Fall 2001 EE669: Natural Language Processing 10 Mean and Variance (Smadja et al., 1993) • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible relationships. For example, – Knock and door may not occur at a fixed distance from each other • One method of detecting these flexible relationships uses the mean and variance of the offset between the two words in the corpus. • If the offsets are randomly distributed (i.e., no collocation), then the variance will be high. Fall 2001 EE669: Natural Language Processing 11 Mean, Sample Variance, and Standard Deviation n X Mean( X 1, X 2,..., Xn) X i 1 n n s 2 Var ( X 1, X 2,..., Xn) i (Xi X ) 2 i 1 n 1 s SD( X 1, X 2,..., Xn) Var ( X 1, X 2,..., Xn) Fall 2001 EE669: Natural Language Processing 12 Example: Knock and Door 1. 2. 3. 4. • • She knocked on his door. They knocked at the door. 100 women knocked on the big red door. A man knocked on the metal front door. Average offset between knock and door: (3 + 3 + 5 + 5)/ 4 = 4 Variance: ((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3 Fall 2001 EE669: Natural Language Processing 13 Hypothesis Testing: Overview • We want to determine whether the co-occurrence is random or whether it occurs more often than chance. This is a classical problem of Statistics called Hypothesis Testing. • We formulate a null hypothesis H0 (the association occurs by chance, i.e., no association between words). Assuming this, calculate the probability that a collocation would occur if H0 were true. If the probability is very low (e.g., p < 0.05) (thus confirming “interesting” things are happening!), then reject H0 ; otherwise retain it as possible. • In this case, we assume that two words are not collocations if they occur independently. Fall 2001 EE669: Natural Language Processing 14 Hypothesis Testing: The t test • The t test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean . • The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one is to get a sample of that mean and variance assuming that the sample is drawn from a normal distribution with mean . Fall 2001 EE669: Natural Language Processing 15 The Student’s t test • To determine the probability of getting a certain sample, we compute the t statistic, where X is the sample mean and s2 is the sample variance, and look up its significance wrt the normal distribution. t X s 2 N Fall 2001 z X N • N 30 • is unknown • Normal distribution EE669: Natural Language Processing 16 The t test • Significance of difference – Compare with normal distribution (mean ) – Using real-world data, compute t – Find in tables (see Manning and Schutze, p. 609): • d.f. = degrees of freedom (parameters which are not determined by other parameters; sample size) • percentile level p = 0.05 (or lower) – The bigger the t statistic: • the better chance that it is an interesting combination (i.e. we can reject the null hypothesis – no association) • t: significance level from the t table Fall 2001 EE669: Natural Language Processing 17 The t test on Collocations • Null hypothesis: independence • mean : p(w1) p(w2) • Data estimates: • x’ = MLE of joint probability from data • 2 is p(1-p), i.e. almost p for small p; N is the data size • Example: compute t value for new companies – C(new)=15,828; C(companies)= 4,675; N=14,307,668 – H0: p(new companies)= 15,828/14,307,668 * 4,675/ 14,307,668 = 3.615 * 10-7 – p(new companies)= 8/14,307,668=5.591 * 10-7 2 = p*(1-p) = p-p2 5.591 * 10-7 – T = (5.591 * 10-7 - 3.615 * 10-7)/(5.591 * 10-7 /14,307,668).5=.999932 – For a = 0.05, need a t value of 1.645, so the null hypothesis is not rejected. Fall 2001 EE669: Natural Language Processing 18 Hypothesis Testing of Differences (Church & Hanks, 1989) • We may also want to find words whose cooccurrence patterns best distinguish between two words (e.g., strong versus powerful). This application can be useful for Lexicography. • The t test is extended to the comparison of the means under the assumption that they are normally distributed. • The null hypothesis is that the average difference is 0. Fall 2001 EE669: Natural Language Processing 19 The t-test for Comparing Two Populations • This t test compares the means of two normal populations. The variances of the two populations are added since the variance of the difference of two RVs is the sum of their variances. t X1 X 2 2 2 s1 s2 n1 n2 Fall 2001 EE669: Natural Language Processing 20 Collocation Testing • T values are calculated assuming a Bernoulli distribution: w is the collocate of interest, v1 and v2 are the words to compare, and assume that s2 p. t P(v1 w) P(v 2 w) P(v1 w) P(v 2 w) N Fall 2001 EE669: Natural Language Processing 21 Pearson’s Chi-Square Test • Use of the t test has been criticized by Church and Mercer (1993) because it assumes that probabilities are approximately normally distributed (not true, generally). • The Chi-Square test does not make this assumption. • The essence of the test is to compare observed frequencies with frequencies expected in the case of independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence. • c2 test (general formula): Si,j (Oij-Eij)2 / Eij – where Oij and Eij are the observed versus expected counts of events i, j Fall 2001 EE669: Natural Language Processing 22 Pearson’s Chi-square Test • Example of a two-outcome event: w 1 /w 2 = stone stone Fall 2001 eat = eat 9 75 1,770 219,243 EE669: Natural Language Processing 23 Pearson’s Chi-Square Test P(w1w2) = P(w1)p(w2) = E11/N => E11= P(w1)P(w2)N • The expected frequencies are computed from the marginal probabilities: w – E11= (O11 + O12)/N (O11 + O21)/N N – where N is the number of bigrams c2 i, j (Oij Eij ) 2 Eij 2 w2 w1 O11 O12 w1 O21 O22 N (O11O22 O12O21 ) 2 (O11 O12 )(O11 O21 )(O12 O22 )(O21 O22 ) c2 = 221097 (219243 9 - 75 1770)2/(1779 84 221013 219318) = 103.39 > 7.88 (at .005 thus we can reject the independence assumption) Fall 2001 EE669: Natural Language Processing 24 Pearson’s Chi-Square: Applications • One of the early uses of the Chi-Square test in Statistical NLP was the identification of translation pairs in aligned corpora (Church & Gale, 1991). • A more recent application is to use Chi-Square as a metric for corpus similarity (Kilgariff and Rose, 1998) • Note that the Chi-Square test should not be used for small counts. Fall 2001 EE669: Natural Language Processing 25 Likelihood Ratios Within a Single Corpus (Dunning, 1993) • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. • They are easier to interpret than the Chi-Square statistic. • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2: – H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p – H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1) P(w2 | w1) = p2 Fall 2001 EE669: Natural Language Processing 26 Likelihood Ratios Within a Single Corpus Binominal Distribution: n k nk b( k ; n, p ) k p (1 p ) • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution: – Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N – Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2 w2 w1 c12 w1 c2-c12 Total w2 Total c1 N-c1 c2 – Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2 – Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2 Fall 2001 EE669: Natural Language Processing 27 Likelihood Ratios Within a Single Corpus • The likelihood of H1 – L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence) • The likelihood of H2 – L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence) • The log of likelihood ratio – log = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..) • The quantity –2 log is asymptotically c2 distributed, so we can test for significance. Fall 2001 EE669: Natural Language Processing 28 Pointwise Mutual Information • An Information-Theoretic measure for discovering collocations is pointwise mutual information (Church et al., 1989, 1991). • This is NOT MI as defined in Information Theory – (MI: random variables; not values of random variables) • I’(a,b) = log2 [p(a,b) / (p(a)p(b))] = log2 [p(a|b) / p(a)] – Example: I’(eat, stone) = log2 [4.1e-5 / (3.8e-4 8.0e-3) ]= 3.74 • Pointwise Mutual Information is roughly a measure of how much one word tells us about the other. Fall 2001 EE669: Natural Language Processing 29 Pointwise Mutual Information • Pointwise mutual information works particularly badly in sparse environments (favors low frequency events). • May not be a good measure of what an interesting correspondence between two events is (Church and Gale, 1995). Fall 2001 EE669: Natural Language Processing 30 Homework 3 • Please collect 100 web news, and then find 50 useful collocations from the collection by Chi-square test Pointwise mutual information. Also, compare the performance of the two methods based on precision. Fall 2001 EE669: Natural Language Processing 31