Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 3: Basic Probability (Chapter 2 of Manning and Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/02/24 Motivation • Statistical NLP aims to do statistical inference. • Statistical inference consists of taking some data (generated according to some unknown probability distribution) and then making some inferences about this distribution. • An example of statistical inference is the task of language modeling, namely predicting the next word given a window of previous words. To do this, we need a model of the language. • Probability theory helps us to find such a model. Probability Terminology • Probability theory deals with predicting how likely it is that something will happen. • The process by which an observation is made is called an experiment or a trial (e.g., tossing a coin twice). • The collection of basic outcomes (or sample points) for our experiment is called the sample space . • An event is a subset of the sample space. • Probabilities are numbers between 0 and 1, where 0 indicates impossibility and 1, certainty. • A probability function/distribution distributes a probability mass of 1 throughout the sample space. Experiments & Sample Spaces • Set of possible basic outcomes of an experiment: sample space – – – – coin toss ( = {head,tail}) tossing coin 2 times ( = {HH, HT, TH, TT}) dice roll ( = {1, 2, 3, 4, 5, 6}) missing word (| | vocabulary size) • Discrete (countable) versus continuous (uncountable) • Every observation/trial is a basic outcome or sample point. – Event A is a set of basic outcomes with A , and all A 2 (the event space) – is then the certain event, is the impossible event Events and Probability • The probability of event A is denoted p(A) – also called the prior probability – the probability before we consider any additional knowledge • Example Experiment: toss coin three times – = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} • cases with two or more tails: – A = {HTT, THT, TTH, TTT} – P(A) = |A|/|| = 4/8 = 1/2 (assuming uniform distribution) • all heads: – A = {HHH} – P(A) = |A|/|| = 1/8 Probability Properties • Basic properties: – P: 2 [0,1] – P() = 1 – For disjoint events: P(Ai) = i P(Ai) • [NB: axiomatic definition of probability: take the above three conditions as axioms] • Immediate consequences: – P() = 0, P( A ) = 1 - P(A), – A P(A) = 1 A B P(A) P(B) Joint Probability • Joint Probability of A and B: P(A,B) = P(A B) – 2-dimensional table (|A|x|B|) with a value in every cell giving the probability of that specific pair occurring. A B AB Conditional Probability • Sometimes we have partial knowledge about the outcome of an experiment, then the conditional (or posterior) probability can be helpful. If we know that event B is true, then we can determine the probability that A is true given this knowledge: P(A|B) = P(A,B) / P(B) A B AB Conditional and Joint Probabilities • P(A|B) = P(A,B)/P(B) P(A,B) = P(B)P(A|B) • P(B|A) = P(A,B)/P(A) P(A,B) = P(A)P(B|A) • Chain rule: P(A1,..., An) A1 A2 A3 An = P(A1)P(A2|A1)P(A3|A1,A2 )….P(An|A1,..., An-1) – Chain rule is used in Language Model. Bayes Rule • Since P(A,B) = P(B,A), P(A B) = P(B A), and P(A,B) = P(A|B)P(B) = P(B|A)P(A): – P(A|B) = P(A,B)/P(B) = [P(B|A)P(A)]/P(B) – P(B|A) = P(A,B)/P(A) = [P(A|B)P(B)]/P(A) A B AB Example • S: have stiff neck, M: have meningitis (腦膜炎) • P(S|M) = 0.5, P(M) = 1/50,000, P(S)=1/20 • I have stiff neck, should I worry? P( S | M ) P( M ) P( M | S ) P( S ) 0.5 1/50000 0.0002 1/20 Independence • Two events A and B are independent of each other if P(A|B) = P(A) • Example: – two coin tosses – weather today and on March 4th, 2012 • If A and B are independent, then we compute P(A,B) from P(A) and P(B) as: – P(A,B) = P(A|B)P(B) = P(A)P(B) • Two events A and B are conditionally independent of each other given C if: • P(A|B,C) = P(A|C) A Golden Rule (of Statistical NLP) • If we are interested in which event is most likely given A, we can use Bayes rule, max over all B: – argmaxB P(B|A) = argmaxB P(A,B) / P(A) = argmaxB [P(A|B)P(B)] / P(A) = argmaxB P(A|B)P(B) – P(A) is the same for each candidate B Random Variables (RV) • Random variables (RV) allow us to talk about the probabilities of numerical values that are related to the event space (with a specific numeric range) • An RV is a function X: Q (events values) – in general: Q = Rn, typically R – easier to handle real numbers than real-world events – An RV is discrete if Q is a countable subset of R; an indicator RV (or Bernoulli trial) if Q = {0, 1} • Define a probability mass function (pmf) for RV X: – pX(x) = P(X=x) = P(Ax), where Ax = { : X() = x} – often just p(x) Example • Suppose we define a discrete RV X that is the sum of the faces of two dice, then Q = {2, …, 11, 12} with the pmf as follows: – – – – – – – – – – – P(X=2)=1/36, P(X=3)=2/36, P(X=4)=3/36, P(X=5)=4/36, P(X=6)=5/36, P(X=7)=6/36, P(X=8)=5/36, P(X=9)=4/36, P(X=10)=3/36, P(X=11)=2/36, P(X=12)=1/36 Expectation and Variance • The expectation is the mean or average of a RV, defined as: E ( X ) xp( x) x • The variance of a RV is a measure of whether the values of the RV tend to vary over trials: Var ( X ) E (( X E ( X )) 2 ) E( X 2 ) E 2 ( X ) • The standard deviation (s) is the square root of the variance. Examples • What is the expectation of the sum of the numbers on two dice? 2 * P(X=2) = 2 * 1/36 = 1/18 3 * P(X=3) = 3 * 2/36 = 3/18 4 * P(X=4) = 4 * 3/36 = 6/18 5 * P(X=5) = 5 * 4/36 = 10/18 6 * P(X=6) = 6 * 5/36 = 15/18 7 * P(X=7) = 7 * 6/36 = 21/18 8 * P(X=8) = 8 * 5/36 = 20/18 9 * P(X=9) = 9 * 4/36 = 18/18 10 * P(X=10) = 10 * 3/36 = 15/18 11 * P(X=11) = 11 * 2/36 = 11/18 12 * p(X=12) = 12 * 1/36 = 6/18 E(SUM) = 126/18 = 7 • Or more simply: • E(SUM) = E(D1+D2) = E(D1) + E(D2) • E(D1) = E(D2) = 1* 1/6 + 2* 1/6 + … + 6* 1/6 = 1/6 (1 + 2 + 3+ 4 + 5 + 6) = 21/6 • Hence, E(SUM) = 21/6 + 21/6 = 7 Examples • Var(X) = E((X – E(X))2) = E( X2 – 2XE(X) + E2(X) ) = E(X2) – 2E(XE(X)) + E2(X) = E(X2) – 2E2(X) + E2(X) = E(X2) – E2(X) • E(SUM2)=329/6 and E2(SUM) = 72 = 49 • Hence, Var(SUM) = 329/6 – 294/6 = 35/6 Joint and Conditional Distributions for RVs y1 y2 • Joint pmf for two RVs X and Y is p(x,y) = P(X=x, Y=y) x1 x2 • Marginal pmfs are calculated as pX(x) = y p(x,y) and pY(y) = x p(x,y) • If X and Y are independent then p(x,y) = pX(x) * pY(y) • Define conditional pmf using joint distribution: pX|Y(x|y) = p(x,y)/pY(y) if pY(y) > 0 • Chain rule: – p(w,x,y,z) = p(w)p(x|w)p(y|w,x)p(z|w,x,y) y3 Estimating Probability Functions • What is the probability that sentence “The cow chewed its cud” will be uttered? Unknown, so P must be estimated from a sample of data. • An important measure for estimating P is the relative frequency of the outcome, i.e., the proportion of times an outcome u occurs: C (u ) fu N – C(u) is the number of times u occurs in N trials. – For N , the relative frequency tends to stabilize around some number, the probability estimate. • Two different approaches: – Parametric (assume distribution) – Non-parametric (distribution free) Parametric Methods • Assume that the language phenomenon is acceptably modeled by one of the well-known standard distributions (e.g., binomial, normal). • By assuming an explicit probabilistic model of the process by which the data was generated, then determining a particular probability distribution within the family requires only the specification of a few parameters, which requires less training data (i.e., only a small number of parameters need to be estimated). Non-parametric Methods • No assumption is made about the underlying distribution of the data. • For example, simply estimate P empirically by counting a large number of random events is a distribution-free method. • Given less prior information, more training data is needed. Estimating Probability • Example: Toss coin three times – = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} – count cases with exactly two tails: A = {HTT, THT, TTH} • Run an experiment 1000 times (i.e., 3000 tosses) • Counted: 386 cases with two tails (HTT, THT, or TTH) • Estimate of p(A) = 386 / 1000 = .386 – Run again: 373, 399, 382, 355, 372, 406, 359 • p(A) = .379 (weighted average) or simply 3032 / 8000 • Uniform distribution assumption: p(A) = 3/8 = .375 Standard Probabilistic Distribution • Discrete distributions • Continuous distributions Binomial n P( X k | p, n) p k (1 p) n k k Multinomial n! P( X 1 k1 ,..., X m k m ) p1k1 ... pm k m k1!...k m ! P( X k ) p(1 p) k 1 P( X k | ) e k k! Geometric Poisson Normal N ( x | , ) 1 2 f ( x | ) e x e 1 2 2 ( x )2 Exponential 1 x ( x | , ) x e ( ) Gamma Standard Distributions: Binomial • Series of trials with only two outcomes, 0 or 1, with each trial being independent from all the other outcomes. • Number r of successes out of n trials given that the probability of success in any trial is p: n r b(r; n, p) p (1 p) n r r • n (r) counts how many possibilities there are for choosing r objects out of n, i.e., n! / (n-r)!r! Binomial Distribution • Works well for tossing a coin. However, for linguistic corpora one never has complete independence from one sentence to the next: approximation. • Use it when counting whether something has a certain property or not (assuming independence). • Actually quite common in SNLP: e.g., look through a corpus to find out the estimate of the percentage of sentences that have a certain word in them or how often a verb is used as transitive or intransitive. • Expectation is n p, variance is n p (1-p) Standard Distributions: Normal • The Normal (Gaussian) distribution is a continuous distribution with two parameters: mean μ and standard deviation σ. – Standard normal if = 0 and = 1. n( x; , ) X 1 2 ( x )2 2 2 e Basic Formula Ω P( x) P( x, h) h1 h6 X h5 h P( x | y ) P( x, h | y) h2 h3 h4 h P ( x, h | y ) P ( h | y ) P ( x | y , h ) * P ( x | y ) P ( h | y ) P ( x | y , h) x y h P( x | y) P(h | y) P( x | h) h h WMMKS Lab Homework 2 • Based on the following formula, please give an example of research idea or application. P ( x | y ) P ( h | y ) P ( x | y , h) h Frequency Zipf's law Rank • Zipf's law, an empirical law formulated using mathematical statistics, refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. • The law is named after the linguist George Kingsley Zipf (pronounced /ˈzɪf/) who first proposed it (Zipf 1935, 1949), though J.B. Estoup appears to have noticed the regularity before Zipf. Frequency Zipf's law Rank • Zipf's law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(frequency). – For example, the word "the" (as described above) would appear at x = log(1), y = log(69971). The data conform to Zipf's law to the extent that the plot is linear. • Formally, let: – N be the number of elements; – k be their rank; – s be the value of the exponent characterizing the distribution. Zipf PMF for N = 10 on a log-log scale. The horizontal axis is the index k . (Note that the function is only defined at integer values of k. The connecting lines do not indicate continuity.) Frequentist Statistics • s: sequence of observations • M: model (a distribution plus parameters) • For fixed model M, Maximum likelihood estimate: arg max P( s | M ) M • Probability expresses something about the world with no prior belief! Bayesian Statistics I: Bayesian Updating • Assume that the data are coming in sequentially and are independent. • Given an a-priori probability distribution, we can update our beliefs when a new datum comes in by calculating the Maximum A Posteriori (MAP) distribution. • The MAP probability becomes the new prior and the process repeats on each new datum. arg max P( M M P( s | M ) P( M ) | s) arg max P( s) M arg max P( s | M ) P( M ) M • P(s) is a normalizing constant Bayesian Statistics II: Bayesian Decision Theory • Bayesian Statistics can be used to evaluate which model or family of models better explains some data. • We define two different models of the event and calculate the likelihood ratio between these two models. Bayesian Decision Theory • Suppose we have 2 models M1 and M2; we want to evaluate which model better explains some new data. P( M 1 | s ) P( s | M 1 ) P( M 1 ) P( M 2 | s ) P( s | M 2 ) P( M 2 ) P( M 1 | s ) – if 1 i.e. P( M 1 | s) P( M 2 | s) P( M 2 | s ) – then M1 is the most likely model, otherwise M2 Essential Information Theory • Developed by Shannon in the 1940s. • Goal is to maximize the amount of information that can be transmitted over an imperfect communication channel. • Wished to determine the theoretical maxima for data compression (entropy H) and transmission rate (channel capacity C). • If a message is transmitted at a rate slower than C, then the probability of transmission errors can be made as small as desired. The Noisy Channel Model • Goal: encode the message in such a way that it occupies minimal space while still containing enough redundancy to be able to detect and correct errors. W Message X Encoder Input to channel Channel p(y|x) Y ^ W Decoder Output from channel Attempt to reconstruct message based on output The Noisy Channel Model • Want to optimize a communication across a channel in terms of throughput and accuracy – the communication of messages in the presence of noise in the channel. • There is a duality between compression (achieved by removing all redundancy) and transmission accuracy (achieved by adding controlled redundancy so that the input can be recovered in the presence of noise). The Noisy Channel Model • Channel capacity: rate at which one can transmit information through the channel with an arbitrary low probability of being unable to recover the input from the output. For a memoryless channel: C max I(X;Y) p(X) I(X; Y): mutual information • We reach a channel’s capacity if we manage to design an input code X whose distribution p(X) maximizes I between input and output. Mutual Information • I(X; Y) is called the mutual information between X and Y. • I(X; Y)= H(X)-H(X|Y)= H(Y)-H(Y|X) – H(X): entropy of X, X: discrete RV, p(X) • It is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another. Entropy • Entropy (or self-information) is the average uncertainty of a single RV. • Let p(x) = P(X=x), where x X, then: bits (length) 1 H ( p) H ( X ) p( x) log 2 p( x) p( x) log 2 p ( x) xX xX • In other words, entropy measures the amount of information in a random variable measured in bits. It is also the average length of the message needed to transmit an outcome of that variable using the optimal code. • An optimal code sends a message of probability p(x) in -log2 p(x) bits. Entropy (cont.) H ( X ) p( x) log 2 p( x) x X p( x)[ log 2 p( x)] x X 1 p( x) log 2 p( x) x X 1 E log 2 p ( x) • Entropy is non-negative: H ( X ) 0 H ( X ) 0 p( X ) 1 – when the value of X is determinate, it provides no new information Entropy of a Weighted Coin (one toss) The Limits • When H(X) = 0? – if a result of an experiment is known ahead of time – necessarily: x ; p( x) 1 & y ; y x p( y) 0 • Upper bound? – 1 , n for || = n: H(X) log2n n • nothing can be more uncertain than the uniform distribution x , p( x) • Entropy increases with message length. “Coding” Interpretation of Entropy • The least (average) number of bits needed to encode a message (string, sequence, series,...) gives H(X). • Compression algorithms – do well on data with repeating patterns (easily predictable = low entropy) – their results have high entropy compressed data does nothing Coding: Example • Experience: some characters are more common, some (very) rare – use more bits for the rare, and fewer bits for the frequent – suppose: p(‘a’) = 0.3, p(‘b’) = 0.3, p(‘c’) = 0.3, the rest: p(x) .0004 – code: ‘a’ ~ 00, ‘b’ ~ 01, ‘c’ ~ 10, rest: 11b1b2b3b4b5b6b7b8 – code acbbécbaac => 0010010111000011111001000010 a c b b é c b a a c – number of bits used: 28 (vs. 80 using “naive” 8-bit uniform coding) Joint Entropy • The joint entropy of a pair of discrete random variables X, Y, p(x, y) is the amount of information needed on average to specify both their values. H ( X , Y ) p( x, y ) log 2 p( x, y ) xX yY Conditional Entropy • The conditional entropy of a discrete random variable Y given another X, for X, Y, p(x,y) expresses how much extra information is needed to supply on average for communicating Y given that the other party knows X. (Recall that H(X) = E(log2(1/p(x))); weights are not conditional) H (Y | X ) p( x) H (Y | X x) xX p ( x) p( y | x) log 2 p( y | x) x X yY p( x, y ) log 2 p ( y | x) x X yY E (log 2 p (Y | X )) Entropy and Language • Entropy is measure of uncertainty. The more we know about something the lower the entropy. • If a language model captures more of the structure of the language than another model, then its entropy should be lower. • Entropy can be thought of as a matter of how surprised we will be to see the next word given previous words we have already seen. Perplexity • A measure related to the notion of cross entropy and used in the speech recognition community is called the perplexity. • Perplexity(x1n, m) = 2H(x1n,m) • A perplexity of k means that you are as surprised on average as you would have been if you had to guess between k equi-probable choices at each step. Chain Rule for Entropy • The product became a sum due to the log. H ( X , Y ) H ( X ) H (Y | X ) H ( X1,...,X n ) H ( X1 ) H ( X 2|X 1 ) .... H ( X n|X 1,...X n1 ) Mutual Information • By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) • Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X)=I(X; Y) • I(X; Y) is called the mutual information between X and Y. • It is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another. Relationship between I and H H(X,Y) H(X|Y) I(X; Y) H(Y|X) H(Y) H(X) I(X; Y) H(X) - H(X | Y) H(X) H(Y) - H(X, Y) 1 1 p( x) log p( y ) log p( x, y ) log p( x, y ) p( x) y p( y ) x , y x p( x, y ) log x, y p ( x, y ) p( x) p( y ) Mutual Information (cont) I(X; Y) H(X) - H(X | Y) H(Y) - H(Y | X) • I(X; Y) is symmetric, non-negative measure of the common information of two variables. • Some see it as a measure of dependence between two variables, but better to think of it as a measure of independence. – I(X; Y) is 0 only when X and Y are independent: H(X|Y)=H(X) • H(X)=H(X)-H(X|X)=I(X; X) Why entropy is called self-information. Mutual Information (cont) • Don’t confuse with pointwise mutual information, p ( x, y ) p ( x) p ( y ) • which has some problems… The Noisy Channel Model • Goal: encode the message in such a way that it occupies minimal space while still containing enough redundancy to be able to detect and correct errors. W Message X Encoder Input to channel Channel p(y|x) Y ^ W Decoder Output from channel Attempt to reconstruct message based on output The Noisy Channel Model • Want to optimize a communication across a channel in terms of throughput and accuracy – the communication of messages in the presence of noise in the channel. • There is a duality between compression (achieved by removing all redundancy) and transmission accuracy (achieved by adding controlled redundancy so that the input can be recovered in the presence of noise). The Noisy Channel Model • Channel capacity: rate at which one can transmit information through the channel with an arbitrary low probability of being unable to recover the input from the output. For a memoryless channel: C max I(X;Y) p(X) • We reach a channel’s capacity if we manage to design an input code X whose distribution p(X) maximizes I between input and output. Language and the Noisy Channel Model • In language we can’t control the encoding phase; we can only decode the output to give the most likely input. Determine the most likely input given the output! ^ I Noisy Channel p(o|i) O I Decoder p(i)p(o | i) Î argmax p(i | o) argmax p(o) i i argmax p(i)p(o | i) i The Noisy Channel Model p(i)p(o | i) Î argmax p(i | o) argmax p(o) i i argmax p(i)p(o | i) i • p(i) is the language model, the distribution of sequence of ‘words’ in the input language • p(o|i) is the channel model (probability) • This is used in machine translation, optical character recognition, speech recognition, etc. The Noisy Channel Model Relative Entropy: Kullback-Leibler Divergence • For 2 probability functions, p(x) and q(x), their relative entropy is: D(p || q) p(x)log 2 xX p(x) q(x) p(X) E p log 2 q(X) • The relative entropy, also called the KullbackLeibler divergence, is a measure of how different two probability distributions are (over the same event space). • The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on the not-quite-right distribution q. Comments on Relative Entropy • Goal: minimize relative entropy D(p||q) to have a probabilistic model as accurate as possible. • Conventions: – 0 log 0 = 0 – p log (p/0) = (for p > 0) • Distance? – not symmetric: D(p||q) D(q||p) – does not satisfy the triangle inequality – But can be useful to think of it as distance Mutual Information and Relative Entropy • Random variables X, Y; pXY(x, y), pX(x), pY(y) • Mutual information: I(X,Y) = D(p(x,y) || p(x)p(y)) I(X; Y) p( x, y) log p( x, y) x, y p( x) p( y ) • I(X,Y) measures how much (our knowledge of) Y contributes (on average) to easing the prediction of X • Or how p(x, y) deviates from independence (p(x)p(y)) Entropy and Language • Entropy is measure of uncertainty. The more we know about something the lower the entropy. • If a language model captures more of the structure of the language than another model, then its entropy should be lower. • Entropy can be thought of as a matter of how surprised we will be to see the next word given previous words we have already seen. The Entropy of English • We can model English using n-gram models (also known a Markov chains). • These models assume limited memory, i.e., we assume that the next word depends only on the previous k ones (kth order Markov approximation). • What is the Entropy of English? – First order: 4.03 bits – Second order: 2.8 bits – Shannon’s experiment: 1.3 bits Perplexity • A measure related to the notion of cross entropy and used in the speech recognition community is called the perplexity. • Perplexity(x1n, m) = 2H(x1n,m) • A perplexity of k means that you are as surprised on average as you would have been if you had to guess between k equi-probable choices at each step. Homework 3 • Please collect three web news, and then compute the entropies of word distribution for one, two and three news, respectively.