Download Probability for linguists

Probability for linguists John Goldsmith What is probability theory?     It is the quantitative theory of evidence. It derives, historically, from the study of games of chance – gambling – and of the insurance business. To this day, the basic examples of probability are often presented in terms that grow out of gambling: toss of a die, the draw of a card. Alas. Knowledge of language?    Knowledge of language is the long-term knowledge that speakers bring to bear in the analysis, and the generation, of specific utterances in particular contexts. Long-term knowledge? Understanding a specific utterance is a task that combines knowledge of the sounds that were uttered and long-term knowledge of the language in question. But before we can get to that…  The basic game plan of all of probability theory is to    define a universe (“sample space” or “universe of possible outcomes”) define a method of dividing up “probability mass” (whose total weight is 1.0) over all of the sample space. That’s it! Again: For this discrete case, probability theory obliges us to: 1. Establish the universe of outcomes that we are interested in; 2. Establish a means for distributing probability mass over the outcomes, in such a way that the total amount of probability distributed sums to 1.0 – no more, and no less. The role of a model   In every interesting case, the probability assigned to the basic members of the sample space is calculated according to some model that we devise – and therein lies our contribution. Let’s turn to some examples. Letters    Let our sample space be the 26 letters of the Latin alphabet. To make a probabilistic model, we need an assignment of probability to each letter a…z: prob(a) ≥ 0; and the sum of all of these probabilities is 1: 26  pr (l ) i 1 i Why?    Probabilities are our eventual contact with reality, and with observations. Probabilities are our model’s way of making quantitative predictions. We will (eventually) evaluate our model by seeing how good its probabilities are. Counts, frequencies, and probabilities    Counts are integral – at least they are for now; eventually we drop that assumption. For now, they are the result of empirical measurement. Frequency (relative or proportional) is the ratio of a count of a subset to the count of the whole universe. Probability is a parameter in a probabilistic model. Distribution  A distribution is a set of non-negative numbers which sum to 1. Brief digression: some niceties  The sample space Ω need not be finite or discrete, and we actually care about a large set F of subsets of Ω. The sets in F are called events, and we require that F be a s-algebra (so F is closed under complementation, countable union and intersection). A probability measure P on (Ω, F) is a function P such that F(Ω) = 1, F(Ø) = 0, and the probability of a countable set of pairwise disjoint members of F is the sum of their probabilities:   i 1 i 1 P ( Ai )   P ( Ai ) Uniform distribution   A uniform distribution is one in which all elements in the sample space have the same probability. If there are 26 letters in our sample space, each letter is assigned probability 1/26 by the uniform distribution. Uniform distribution (2)    If the sample space is rolls of a die, and the distribution is uniform, then each die-roll has probability equal to 1/6. Sample space can be complex – e.g., sequences of 3 rolls of a die. In that case, there are 63 = 216 members, and each has probability 1/216 if the distribution is uniform. If the sample space is N rolls of a die, then there are 6N members, each with prob 1/6N. More examples  If the sample space is all sequences of 1, 2 or 3 rolls of a die, then the size of the sample space is 6 + 62 + 63 = 258, and if the distribution is uniform, each has probability 1/258. Words in the sample space  Suppose our sample space has 1,000 elements, and each is associated with 1 of the 1,000 most frequent words in the Brown corpus; each is also associated to the number (between 1 and 1,000) associated with the rank of that number. Probability distribution    We could associate a uniform distribution to that sample space; or we could assign a probability to each word based on its frequency in a corpus. Let N = total number of words in the corpus (1,160,322 in Brown corpus). counts( word i ) frequency of wordi = 1,160,322 E.g., frequency (“the”) = 69,681 / 1,160,322 = .0600… Is it clear that the sum of all these frequencies is necessarily 1.0? count ( word N ) count ( word1 ) count ( word 2 )   ...  Total count Total count Total count N count ( wordi )  i 1 Total count Sequences When we think about sequences, we see that the probability assigned to each element (=each sequence) is assigned through a model, not directly. In the simplest case, the probability of a sequence is the product of the probabilities of the individual elements. Random variable (r.v.)   A random variable is a function from our probability space (the sample space, endowed with a probability distribution) to the real numbers. Example: Our sample space has 1,000 elements, and each is associated with 1 of the 1,000 most frequent words in the Brown corpus; our r.v. R maps each word to the number (between 1 and 1,000) associated with the rank of that number. Sample space: words    Suppose we replace our usual die with one with 1,000 faces, and each face has a number (1-1,000) and one of the 1,000 most frequent words from the Brown corpus on it (1=the, 2=of, etc.). Now we choose numbers at random: for example, 320 990 646 94 756 which translates into: whether designed passed must southern. For the moment, assume a uniform distribution… We’re more interested in the probability our model assigns to sentences that already exist. For example, (1) In the beginning was the word Since these words are in the Brown corpus, each has probability 1/1000, and the probability of (1) is 1 1  1036  1018  10 18   Any sequence of 6 words in the top 1,000 words from the Brown Corpus will have this probability. Is this good, or bad? We could increase our vocabulary to the entire 47,885 words of the Brown Corpus. Then sentence (1) would have probability 1 -5 6  29  (2.1 *10 )  8.6 *10 6 47,885 Or: use frequency as our probability   pr (the) = 0.0600, as above 107 pr (leaders) = 1,160,322  0.000 092 (ranked 1,000 in corpus) Be clear on the difference between frequencies and probabilities. Word Count Frequency the 69681 0.060053 of 36372 0.031346 and 28506 0.024567 to 26080 0.022477 a 23170 0.019969 in 21261 0.018323 ' 10716 0.009235 that 10704 0.009225 is 10085 0.008692 was 9786 0.008434 he 9729 0.008385 What is the probability of the sentence “The woman arrived” ? What is the sample space? Strings of 3 words from the Brown Corpus.  Probability of each word is its frequency in the Brown Corpus. pr ("the“) = 0.060 080; pr (“woman”) = 0.000 212 030; pr (“arrived”) = .000 053 482  Some notation    S = “the woman arrived” S[1] = “the”, S[2] = “woman”, S[3] = “arrived” pr (S) = pr (S[1] = “the” & S[2] = “woman” & S[3] = “arrived”) Stationary model   For all sentences S, all words w and all positions i and j: prob ( S[i] = wn ) = prob ( S[j] = wn ). Is this true? Back to the sentence Pr (“the woman arrived”) = 6.008 020 * 10-2 2.122 030 * 10-5 * 5.348 207 * 10-5 = 6.818 x 10-11  “in the beginning was the word”     We calculated that this sentence’s probability was 1 in 1018 in the sample space of all sentences of exactly 6 words, selected from the 1,000 most frequent words in the Brown Corpus. 8.6 * 10-29 in the sample space of all six-word sentences built from the Brown vocabulary. And in frequency-based distribution: 1.84 x 10-14 in 0.018322785 the 0.060080206 beginning 0.000139743 was 0.008439816 the 0.060080206 word 0.000236356 PRODUCT 1.84368E-14   We prefer a model that scores better (by assigning a higher probability) to sentences that actually and already exist – we prefer that model to any other model that assigns a lower probability to the actual corpus. Some win, some lose  In order for a model to assign higher probability to actual and existing sentences, it must assign less probability mass to other sentences (since the total amount of probability mass that it has at its disposal to assign totals up to 1.000, and no more). So of course it assigns lower probability to a lot of unobserved strings. Word order?  Alas, word order counts for nothing, so far. Probability mass  It is sometimes helpful to think of a distribution as a way of sharing an abstract goo called probability mass around all of the members of the universe of basic outcomes (also known as the sample space). Think of there being 1 kilogram of goo, and it is cut up and assigned to the various members of the universe. None can have more than 1.0 kg, and none can have a negative amount, and the total amount must add up to 1.0 kg. And we can modify the model by moving probability mass from one outcome to another if we so choose. Conditional probability   Sometimes we want to shift the universe of discussion to a more restricted sub-universe – this is always a case of having additional information, or at least of acting as if we had additional information. Universe = sample space.   We look at our universe of outcomes, with its probability mass spread out over the set of outcomes, and we say, let us consider only a sub-universe, and ignore all possibilities outside of that sub-universe. We then must ask: how do we have to change the probabilities inside that sub-universe so as to ensure that the probabilities inside it add up to 1.0 (to make it a distribution)? Answer: Divide probabilities by total amount of probability mass in the sub-universe. How do we get the new information? Peek and see the color of a card.  Maybe we know the word will be a noun.  Maybe we know what the last word was. Generally: we have to pick an outcome, and we have some case-particular information that bears on the choice.  Cards: pr (Queen of Hearts|red)    Pr (Queen Hearts) = 1/52; Pr(red card)= ½; Pr (Queen Hearts, given Red) = 1 52  2  1 1 52 26 2 Definition of probability of A, given B: prob( A and B) prob( A | B)  prob( B) Guessing a word, given knowledge of previous word:    pk( S[i] = wj given that S[i-1] = wk ), which is usually written in this way: pk( S[i] = wj | S[i-1] = wk ) Pr (“the”) is high, but not if the preceding word is “a”. Almost nothing about language or about English in particular has crept in. The fact that we have considered conditioning our probabilities of a word based on what word preceded is entirely arbitrary; we could just as well look at the conditional probability of words conditioned on what word follows, or even conditioned on what the word was two words to the left. Table 2: Top of the Brown Corpus for words following "the": Word Count Frequency 0first 664 1same 629 2other 419 3most 419 4new 398 5world 393 6united 385 7state 292 8two 267 9only 260 10time 250 11way 239 12old 234 13last 223 14house 216 15man 214 16next 210 17end 206 18fact 194 19whole 190 0.009494 0.008994 0.005991 0.005991 0.005691 0.005619 0.005505 0.004175 0.003818 0.003718 0.003575 0.003417 0.003346 0.003189 0.003089 0.003060 0.003003 0.002946 0.002774 0.002717  One of the most striking things is how few nouns, and how many adjectives, there are among the most frequent words here -- that's probably not what you would have guessed. None of them are very high in frequency; none place as high as 1 percent of the total. Table 3: Top of the Brown Corpus for words following "of". Total count 36388 Word Count Count / 36,388 1 the 9724 0.26723 2 a 1473 0.04048 3 his 810 0.02226 4 this 553 0.0152 5 their 342 0.0094 6 course 324 0.0089 7 these 306 0.00841 8 them 292 0.00802 9 an 276 0.00758 10 all 256 0.00704 11 her 252 0.00693 12 our 251 0.0069 13 its 229 0.00629 14 it 205 0.00563 15 that 156 0.00429 16 such 140 0.00385 17 those 135 0.00371 18 my 128 0.00352 19 which 124 0.00341 Among the words after "of", one word is over 25%: "the". So not all words are equally helpful in helping to guess what the next word is. Table 4: Top of the Brown Corpus for words preceding "the". Total count 69936 word count count / 69,936 1 of 9724 0.13904 2 . 6201 0.08867 3 in 6027 0.08618 4 , 3836 0.05485 5 to 3485 0.04983 6 on 2469 0.0353 7 and 2254 0.03223 8 for 1850 0.02645 9 at 1657 0.02369 10 with 1536 0.02196 11 from 1415 0.02023 12 that 1397 0.01998 13 by 1349 0.01929 14 is 799 0.01142 15 as 766 0.01095 16 into 675 0.00965 17 was 533 0.00762 18 all 430 0.00615 19 when 418 0.00598 20 but 389 0.00556  Note the prepositions. Table 5: Top of the Brown Corpus for words 2 to the right of "the". Total count 69936 Word Count Count / 69,936 1 of 10861 0.1553 2 . 4578 0.06546 3 , 4437 0.06344 4 and 2473 0.03536 5 to 1188 0.01699 6 ' 1106 0.01581 7 in 1082 0.01547 8 is 1049 0.015 9 was 950 0.01358 10 that 888 0.0127 11 for 598 0.00855 12 were 386 0.00552 13 with 370 0.00529 14 on 368 0.00526 15 states 366 0.00523 16 had 340 0.00486 17 are 330 0.00472 18 as 299 0.00428 19 at 287 0.0041 20 or 284 0.00406  If you know a word is "the", then the probability that the word-after-next is "of" is greater than 15% -- which is quite a bit. Exercise: What do you think the probability distribution is for the 10th word after "the"? What are the two most likely words? Why? Bayes’ Rule: the, given that preceding is of How do we calculate what the probability is that the nth word of a sentence is “the” if the n-1st word is “of”? We count the number of occurrences of “the” that follow “of”, and divide by the total number of “of”s.  Total number of "of": 36,341  Total number of "of the": 9,724  p ( S[i] = the | S[i-1] = of ) = 9724 / 36341 = 0.267  Of, given that following is the: What is the probability that the nth word is "of", if the n+1st word is "the"? We count the number of occurrences of "of the", and divide by the total number of "the": that is,  p ( S[i] = "of" | S[i+1] = "the" ) = 9,724 / 69,903 = 0.139 Bayes’ Rule: The relationship between  pr ( A | B ) "the probability of A given B" and  pr ( B | A ) "the probability of B given A". pr(A) pr (A | B )  pr ( B | A) pr(B) pr ( A and B)  pr ( A | B) * pr ( B) pr ( A and B)  pr ( B | A) * pr ( A) pr ( A | B) * B  pr ( B | A) * pr ( A) pr(A) pr (A | B )  pr ( B | A) pr(B) p ( S[i]  " of" | S[i  1]  " the" )  p ( S[i]  " the" | S[i - 1]  of ) * p( S[i - 1]  " of" ) p( S[i  1]  " the" ) The joy of logarithms       Probabilities get too small too fast. Logs help. Log probabilities are very big and negative. We make them positive. Positive log probabilities (a.k.a. inverse log probabilities, ILP) ILP’s have a meaning in terms of the ideal compressed length of a symbol in a compression scheme. Notation: {p} = -1 * log p. More standard notation: ~ p   It's important to be comfortable with the notation, so that you see easily that the preceding equation can be written as follows:   4 log  prob( S[i])    log prob( S[i])  i 1  i 1 4 Base 2 logs 2x is y, then log2(y) is x.  23 is 8, so log (8) = 3;  2-3 is 1/8, so log(1/8) = -3.  If Adding log probs in unigram model    The probability of a sentence S in the unigram model is the product of the probabilities of its words, so the log probability of a sentence in the unigram model is the sum of the log probabilities of its words. Thus the longer the sentence gets, the larger its log probability gets. In a sense that is reasonable -- the longer the sentence, the less likely it is. But we might also be interested in the average log probability of the sentence,the total log probability of the sentence divided by the number of words  it's the average log probability per word = 1 N  N  { S[i] } i 1 entropy -- especially if we're talking about averaging over not just one sentence, but a large, representative sample, so that we can say it's (approximately) the entropy of the language, not just of some particular sentence. N  { S[i] } i 1  Observe carefully that this is a sum in which we sum over the successive words of the sentence. When i is 1, we are considering the first word, which might be the, and when i is 10, the tenth word might be the as well. It makes sense to re-order the summing of the log probabilities -- because the sum is the same regardless of the order in which you add numbers -- so that all the identical words are together. This means that we can rewrite the sum of the log probabilities as a sum over words in the vocabulary (or the dictionary -- a list where each distinct word occurs only once), and multiply the log probability by the number of times it is present in the entire sum. Thus: N ( sum over words in string ) { S[i] }  i 1 V  count (word j 1 j ){word j } ( sum over vocabulary)  If we've kept track all along of how many words there are in this corpus (calling this "N"), then if we divide this calculation by N, we get, on the left, the average inverse log probability, and, on the right: V  j 1 count ( word j ) N Frequency of wordj {word j }  So we can rewrite this as: V  prob(word j 1 V j ){word j }   prob(word j ) log prob(word j ) j 1   The fundamental goal of analysis is to maximize the probability of the observed data. If we restrict ourselves at first to the unigram model, then it is not difficult to prove – and it is important to recognize – that the maximum probability that can be obtained for a given corpus is the one whose word-probabilities coincide precisely with the observed frequencies.    We can assign a probability to a corpus with any distribution. If there are words in the corpus which do not get a positive value in the distribution, then the corpus will receive a total probability of zero, but that is not an impossible situation. We refer to the set which gets a non-zero probability as the support of the distribution. Computational linguists may say that they are concerned with making sure that all words are in the support of their probability distribution.    Suppose we built a distribution for the words of a corpus randomly. To make this slightly more concrete, let's say that these probabilities form the distribution Q, composed of a set of values q(wordi), for each word in the corpus (and possibly other words as well). Even this randomly assigned distribution would (mathematically) assign a probability to the corpus. N ( sum over words in string ) V   q( word j ) j 1  q(S[i]) i 1 count( word j ) ( sum over vocabulary)  If we now switch to thinking about the log probability, any particular word which occurs k times in the corpus will contribute k times its log probability to the entire sum which gives us the (positive, or inverse) log probability: V  count (word j 1 ){word j } (sum over vocabulary)   What should be clear by now is that we can use any distribution to assign a probability to a corpus. We could even use the uniform distribution, which assigns the same probability to each word. Now we can better understand the idea that we may use a distribution for a given corpus whose probabilities are defined exactly by the frequencies of the words in a given corpus.   It is a mathematical fact that this "empirical distribution" assigns the highest probability to the corpus, and this turns out to be an extremely important property. Important: you should convince yourself now that if this is true, then the empirical distribution also assigns the lowest entropy to the corpus. It follows from what we have just said that if there is a "true" probability distribution for English, it will assign a lower probability to any given corpus that the empirical distribution based on that corpus and that the empirical distribution based on one corpus C1 will assign a lower probability to a different corpus C2 than C2's own empirical distribution. Putting that in terms of entropy (that is, taking the positive log of the probabilities that we have just mentioned, and dividing by N, the number of words in the corpus), we may say that the "true" probability distribution for English assigns a larger entropy to a corpus C than C's own empirical distribution, and C1's empirical distribution assigns a higher entropy to a different corpus C2 than C2's own empirical distribution does.  When we calculate this formula, weighting one distribution (like an observed frequency distribution) by the log probabilities of some other distribution D2, we call that the crossentropy;  and if we calculate the difference between the cross-entropy and the usual (self) entropy, we also say that we are calculating the Kullback-Leibler (or "KL") divergence between the two distributions. Mathematically, if the probability assigned to wordi by D1 is expressed as D1(wordi) (and likewise for D2), then the KL divergence is… V  D (word j 1 1 j ) log D1 ( word j )  D1 log D2 ( word j ) V  D (word j 1 1 j ) log D1 ( word j ) D2 ( word j ) The quantity –log2 freq (.) is of great importance; we can refer to it as the information content, or the optimal compressed length, because it is always possible to develop a compression scheme by which a symbol w, emitted with probability prob (w), is represented by a placeholder of length -1*log2 prob(w) bits. Probability of a corpus on a uniletter model  [li ]     i 1  N  26  [ li ] And if we add 10 e’s: [ e ]10 i  [li ]   [e]  10       N  10 N  10    i 1,i  e  26 [l ] where N  [ei ] Let’s calculate the difference of these two log probabilities, or the log of their ratio… [l ]    [e]  10  new probability       N  10   N  10  [ e ]10 [ li ] 26 i i 1,i  e [ li ]  [li ]     i 1  N  26 [ e ]10 old probability N ([e]  10)  N     N 10 [e] ( N  10) [e]  N  10  N N [e]  1010 [e]  10[e] N  1010 [e][e] Now we take inverse logarithm, which is the difference of their information content:  N  N [e]  1010 [e]  10[ e ]   log   0 10 [e] [ e]  N  10  N  10  N new [e]new  N log  10 log probnew (e)  [e] log N old [e]old Global change in the information content of all of the old symbols; positive 10 * information content of an [e]; We’re subtracting a negative number change in the information content of each of the old e’s; We’re subtracting a positive number Mutual information   The mutual information between two variables is a measure of the relationship between their joint occurrence and their individual occurrences. In the case we’re looking at, we might be interested in the variables specifying the letters at adjacent (ith and i+1th) locations in a string. Mutual information The mutual information between two letters is the log of the ratio of the observed frequency divided by the frequency that we would predict if the data had no structure, which is the product of the frequencies of the letters: the log of the found to expected-if-nostructure --  MI(X,Y) = freq( XY ) log freq( X ) freq(Y )  log freq( XY )  log freq( X )  log freq(Y ) This is a measure of the stickiness between X and Y in the data: they can attract (MI>0) or repel (MI < 0).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Probability for linguists