Download csa5011_distributions

Corpora and Statistical Methods Albert Gatt Probability distributions Example 1: Book publishing  Case:  publishing house considers whether to publish a new textbook on statistical NLP  considerations include: production cost, expected sales, net profits (given cost)  Problem:  to publish or not to publish?  depends on expected sales and profits  if published, how many copies?  depends on demand and cost Example 1: Demand & cost figures  Suppose:  book costs €35, of which:  publisher gets €25  bookstore gets €6  author gets €4  To make a decision, publisher needs to estimate profits as a function of the probability of selling n books, for different values of n.  profit = (€25 * n) – overall production cost Terminology  Random variable  In this example, the expected profit from selling n books is our random variable  It takes on different values, depending on n  We use uppercase (e.g. X) to denote the random variable  Distribution  The different values of X (denoted x) form a distribution.  If each value x can be assigned a probability (the probability of making a given profit), then we can plot each value x against its likelihood. Definitions  Random variable  A variable whose numerical value is determined by chance. Formally, a function that returns a unique numerical value determined by the outcome of an uncertain situation.  Can be discrete (our exclusive focus) or continuous  Probability distribution  For a discrete random variable X, the probability distribution p(x) gives the probabilities for each value x of X.  The probabilities p(x) of all possible values of X sum to 1.  The distribution tells us how much out of the overall probability space (the “probability mass”), each value of x takes up. Tabulated probability distribution No. copies sold Prod. cost Profits (X) Probability P(x) 5,000 £275,000 -£150,000 .20 10,000 £300,000 -£50,000 .40 20,000 £350,000 £150,000 .25 30,000 £400,000 £350,000 .10 40,000 £450,000 £550,000 .05 Plotting the distribution Uses of a probability distribution  Computation of:  mean: the expected value of X in the long run  based on the specific values of X, and their probability  NB: NOT interpreted as value in a sample of data, but expected (future) value based on sample.  standard deviation & variance: the extent to which actual values of X will differ from the mean  skewness: the extent to which our distribution is “balanced”, i.e. whether it’s symmetrical In graphics… Mean: expected value in the long run SD & variance: How much actual values deviate from mean overall Skewness: Symmetry or “tail” of our distribution Measures of expectation and variation The expected value (mean)  The expected value of a discrete random variable X, denoted E[X] or μ, is a weighted average of the values of X  weighted, because not all values x will have the same probability  estimated by summing, for all values of X, the product of x and its probability p(x)   E[ X ]   xp( x) x More on expected value  The mean or expected value tells us that, in the long run, we can expect X to have the value μ.  E.g. in our example, our book publisher can expect longterm profits of: (-150,000 * .2) + (-50,000 * .4) + (150,000 * .25) + (350,000 * .1) + (550,000 * .05) = €50,000 Variance  Mean is the expected value of X, E[X]  Variance (σ2) reflects the extent to which the actual outcomes deviate from expectation (i.e. from E[X])  σ2 = E[(X – μ)2] = Σ(x – μ)2p(x)  i.e. the weighted sum of deviations squared  Reasons for squaring:  eliminates the distinction between +ve and –ve  makes it exponential: larger deviations are given more importance  e.g. one deviation of 10 is as large as 4 deviations of 5 Standard deviation  Variance gives overall dispersion or variation  Standard deviation (σ) is the dispersion of possible outcomes; it indicates how spread out the distribution is.  estimated as square root of variance   2  2 ( x   ) p( x)  x The book publishing example again  Recall that for our new book on stat NLP, expected profit is £50,000  What’s the standard deviation?  need to estimate (50000-x)2 for all x  multiply by p(x) in each case  take the square root of the result  This is left as an exercise… Skewness  The mean gives us the “centre” of a distribution.  Standard deviation gives us dispersion.  Skewness (denoted γ “gamma”) is a measure of the symmetry of the outcomes.   (x  ) x  3 3 Skewness, continued  The formula calculates the average value of cubed deviations by the standard deviation cubed.  Why cubed?  The cube of a positive deviation is itself positive; that of a negative is itself negative. We want both, as we want to know deviations both to the left (-ve) and right (+ve) of the mean.  Like the variance estimation, this emphasises large deviations in either direction (it’s exponential).  If the outcomes are symmetrical around the mean, then +ve and –ve deviations are balanced, and skewness is 0. Graphical display of skewness Positive skewness: tail going right Negative skewness: tail going left Skewness and language  By Zipf’s law (next week), word frequencies do not cluster around the mean.  There are a few highly frequent words (making up a large proportion of overall word frequency)  There are many highly infrequent words (f = 1 or f = 2)  So the Zipfian distribution is highly skewed.  We will hear more on the Zipfian distribution in the next lecture. The concept of information What is information?  Main ingredient:  an information source, which “transmits” symbols from a finite alphabet S  every symbol is denoted si  we call a sequence of such symbols a text  assume a probability distribution s.t. every si has probability p(si)  Example:  a dice is an information source; every throw yields a symbol from the alphabet {1,2,3,4,5,6}  6 successive throws yield a text of 6 symbols Quantifying information  Intuition:  the more probable a symbol is, the less information it yields  “something seen very often is not very surprising”  So information is the inverse probability of the symbol I ( si )  log b 1   log p( si ) p ( si )  for some b > 1. Usually we use base 2  Another term for I(s) is surprisal Properties of I 1. Non-negative 2. If p(s) = 1, I(s) = 0 3. If 2 events s1, s2 are independent, then: I (si , s2 )  I ( p(s1 )  p(s2 ))  I (s1 )  I (s2 ) 4. Monotonic: slight changes in probability result in slight changes in I Aggregate measure of information  What is the information content of a text (sequence of symbols)?   1. this is the same as finding the average information of a random variable the measure is called Entropy, denoted H Define X as a random variable over the symbols in our alphabet P(s) = P(X=s) for all s in our alphabet 2. Estimate H(P) Entropy  The entropy (or information) of a probability distribution is 1 H b ( P)   P( s) log b   P( s) log b P( s) P( s ) sS sS  entropy is the expected value (mean) of the surprisal  the value is interpreted as the number of “bits” of information Entropy example  Source = an 8-sided die  Alphabet S = {1,2,3,4,5,6,7,8}  every si has p = 1/8 8 8 1 1 1 H ( P)   p ( si ) log p ( si )   log   log  log 8  3 8 8 i 1 i 1 8 Interpretation of entropy  The information contained in the distribution P (the more unpredictable the outcomes, the higher the entropy)  The message length if the message was generated according to P and coded optimally Interpretation cont/d  For the 8-sided die example, the result H(P)=3 tells us we need 3 bits on average to “transmit” the result of rolling an 8-sided die: 1 2 3 4 5 6 7 8 001 010 011 100 101 110 111 000  We can’t do it in less than 3 bits Entropy for multiple variables  So far we have dealt with a single random variable  The joint entropy of a pair of RVs: 1 H ( X , Y )   P( x, y ) log b P ( x, y ) xX yY    P( x, y ) log b P( x, y ) xX yY Conditional Entropy  Given X and Y, how much information about Y do we gain if we know X?  a version of entropy using conditional probability: H(Y|X) H (Y | X )   P( x) H (Y | X x X    x)   P ( x)   P ( y | x) log P ( y | x)  x X  yY     P ( x) P ( y | x) log P ( y | x) x X yY Mutual information Mutual information  Just as probability can change based on posterior knowledge, so can information.  Suppose our distribution gives us the probability P(a) of observing the symbol a.  Suppose we first observe the symbol b.  If a and b are not independent, this should alter our information state with respect to the probability of observing a.  i.e. we can compute p(a|b) Mutual info between two symbols  The change in our information about a on observing b is:  1   1    log  I (a; b)  log  P(a )   P ( a | b)   P ( a | b)    log  P(a)   If a and b are completely independent, I(a;b)=0. Averaging mutual information  We want to average mutual information between all values of a random variable A and those of a random variable B.  P ( a | b)  I ( a ; b) I ( A; b)  i i   i i  P ( ai | b )   P(ai | b)  log   P ( ai )   And similarly: I (a; B)   j  P(bi | a)   P(a | b j )  log  P(b j )    Combining the two… I ( A; B)   P(a )  I (a ; B) i i i  P ( ai , b j )    P(ai , b j )  log  P(ai ) P(b j )  i j    I ( B; A)   Thus, mutual info involves taking the joint probability and dividing by the individual probabilities.  I.e. a comparison of the likelihood of observing a, b together vs. separately. Mutual Information: summary  Gives a measure of reduction in uncertainty about a random variable X, given knowledge of Y  quantifies how much information about X is contained in Y Some more on I(X;Y)  In statistical NLP, we often calculate pointwise mutual information  this is the mutual information between two points on a distribution  I(x;y) rather than I(X;Y)  used for some applications in lexical acquisition Mutual Information -- example  Suppose we’re interested in the collocational strength of two words x and y  e.g. bread and butter  mutual information quantifies the likelihood of observing x and y together (in some window)  If there is no interesting relationship, knowing about bread tells us nothing about the likelihood of encountering butter  Here, P(x,y) = P(x)P(y) and I(x;y) = 0  This is the Church and Hanks (1991) approach.  NB. The approach uses pointwise MI

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download csa5011_distributions