Download Stefan Evert, IMS

IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Association Measures Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Reminder: Contingency Tables IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien General Remarks IMS • we will only use data from contingency tables • we will consider each pair type on its own, independently from all other pair types ( no distributional information) • we won't distinguish between relational and positional cooccurrences Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Association Measures (AMs) IMS • goal: assign association score to each pair type = strength of association between components • high score = strong association • association in a statistical sense, but there is no precise definition • positive vs. negative association ("colourless green ideas") Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Using Association Scores IMS • absolute values (cut-off threshold) • input for higher-order statistics (AMs are first-order statistics)  scores should be meaningful • ranking of collocation candidates  only relative scores matter • rank collocates of given base  one marginal frequency fixed  only two free parameters Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien First Steps: Proportions • Workshop on Mechanized Documentation (Washington, 1964) O11 P1  R1 O11 P2  C1 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien First Steps: Proportions IMS • proportions between 0 and 1 • high proportion = strong (directional) association • need to combine two proportions into a single association score • average (P1 + P2) / 2 is not useful • f=1, f1=1, f2=1000  avg.=0.5005 • f=50, f1=100, f2=100  avg.=0.5  more "conservative" weighting Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien First Steps: Proportions • harmonic mean • geometric mean • minimum • Jaccard IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien First Steps: Proportions IMS • coefficients range from 0 to 1 • 1 = total (positive) association • interpretation of lower scores is less clear • positive vs. negative association? • which score for no association? • what is "no association"??  random combinations Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Expected Frequencies • assume that types u and v cooccur only by chance • f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens • each instance of u has a chance of f2(v)/N to cooccur with a v  expected # of cooccurrences: f1 (u )  f 2 (v) R1C1  : E11 N N IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Expected Frequencies IMS • expected frequencies for all cells of the contingency table • assuming random combinations ( statistical independence) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Expected Frequencies IMS • comparison of expected against observed frequencies • note that row and column sums are the same for both tables Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Mutual Information IMS • • • • compares O11 with E11 ratio O11/E11 ranges from 0 to  1 = no association (O11=E11) usually logarithmic values • range: - to + • 0 = no assoc., < 0 neg., > 0 pos. • used in English lexicography Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Low-Frequency Pairs & Random Variation IMS • large amount of low-frequency data (consequence of Zipf's law) • a simple (invented) example • A: f=50, f1=100, f2=100, N=1000  O11=50, E11=10, MI = log 5 • B: f=1, f1=1, f2=1, N=1000  O11=1, E11=.001, MI = log 1000 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Low-Frequency Pairs & Random Variation IMS • three problems with case B • how meaningful is a single example? (not very much, actually) • could well be a spelling mistake or noise from automatic processing • we want to make generalisations (from particular corpus to "language")  this is the domain of statistics: draw inferences about population (=language) from a sample (=corpus) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Statistical Model: Random Sample IMS • assumption: corpus data is a random sample from the language  base data is a random sample from all coocs. in the language Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Statistical Model: Random Sample IMS • random sample of size N is described by random variables Ui and Vi (i = 1..N), representing the labels of the i-th bigram token • notation: U and V as "prototypes" • for a given pair type (u,v), contingency table can be computed from Ui and Vi  random variables X11, X12, X21, X22 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Statistical Model: Random Sample IMS • population parameters 11, 12, 21, 22 for pair type (u,v) • observed frequencies O11, O12, O21, O22 represent one particular realisation of the sample • theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Statistical Model: Random Sample IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Two Footnotes IMS • vector notation for cont. tables  X  ( X 11 , X 12 , X 21 , X 22 )  O  (O11, O12 , O21 , O22 )  k  (k11, k12 , k 21 , k 22 ) • population  general language • restricted to domain(s), genre(s), ... covered by source corpus • e.g. black box in computer science vs. newspapers vs. cooking Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Sampling Distribution IMS • multinomial sampling distribution • each individual cell count Xij has a binomial distribution (but these are not independent) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Sampling Distribution IMS • given assumptions about the population parameters, we can compute the likelihood of the observed contingency table • relatively high likelihood = consistent with assumptions • relatively low likelihood = evidence against assumptions (inversely proportional to likelihood) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Adequacy of the Statistical Model • particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency) • randomness assumption (random sample from fixed population) • independence of pair tokens • constancy of population parameters • violations problematic only when they affect sampling distribution IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Adequacy of the Statistical Model IMS • three causes of non-randomness • local dependencies (e.g. syntax)  usually not problematic • inhomogeneity of source corpus (speakers, domains, topics, ...)  mixture population • repetition / clustering of bigrams  can be a serious problem (does not affect segment-based data if clustered within segments) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Making Assumptions about the Population Parameters IMS • population parameters (, 1, 2) are unknown • best guess from observation: MLE = maximum-likelihood estimate Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Making Assumptions about the Population Parameters IMS • conditional probabilities with MLE  O11   1 R1  O11 P(U  u | V  v)   2 C1 P(V  v | U  u ) • Dice coefficient etc. are MLE for population characteristics • MI is MLE for log( /(1  2))  unreliable for small frequencies Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien The Null Hypothesis IMS • null hypothesis H0: no association = independence of instances, i.e. P(U=u  V=v) = P(U=u)  P(V=v) • not all parameters determined • MLE maximise probability of observed data under H0 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Likelihood Measures IMS • probability of observed data under H0 (with MLE) • probability of single cell: X11 should be most "informative" Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Likelihood Measures IMS • small likelihood values = strong association • computed probabilities are often extremely small • use negative base-10 logarithm  more convenient scale  high scores indicate strong association Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Problems of Likelihood Measures IMS • three reasons for low likelihood • observed data is inconsistent with the null hypothesis because of strong association • association may also be negative (fewer coocs. than expected) • observed data is consistent, but probability mass is spread across many similar contingency tables Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Problems of Likelihood Measures IMS • high frequency = low likelihood • e.g. binomial likelihood • O11=1, E11=1  L = 0.3679 • O11=1000, E11=1000  L = 0.0126 • O11=4, E11=1  L  0.0126 • need to "normalise" likelihood • NB: likelihood association measures often have good empirical results nonetheless Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Likelihood Ratios • simplest normalisation technique • divide maximum probability of data under H0 by unconstrained maximum probability • suggested by Dunning (1993) IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Statistical Hypothesis Tests IMS • compute probability of group of outcomes instead of single one • observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0 • total probability is known as the p-value or significance • problem: ranking of cont. tables Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Asymptotic Tests IMS • asymptotic tests defined ranking of contingency tables explicitly • compute test statistic from data • higher values = more evidence against H0 • can use test statistic as an AM • theory: approximation of p-value associated with test statistic (accurate in the limit N  ) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Asymptotic Tests IMS • standard test for independence is Pearson's chi-squared test • limiting distribution = 2 distribution with df=1 • number of degrees of freedom was subject of a long debate Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Two-Sided Tests IMS • chi-squared test is two-sided, i.e. no difference between positive and negative association • ignore small number of pairs with (non-total) negative association • or convert to one-sided test: reject H0 only when O11 > E11 • p-value is usually divided by 2 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Yates Continuity Correction IMS • Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution ( "normal theory") • estimating probabilities P(Xij  k) from normal distribution introduces systematic errors Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Yates' Continuity Correction IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Yates' Continuity Correction IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Yates' Continuity Correction IMS • generic form of Yates' continuity correction for contingency tables • usefulness is still controversial (criticised as too conservative) • applicability for chi-squared test is generally accepted Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Asymptotic Tests IMS • different form of chi-squared test (comparison of two binomials) is equivalent to independence test • special eq. with Yates' correction Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Asymptotic Tests IMS • can also use log-likelihood ratio as a test statistic (two-sided) • limiting distribution is found to be 2 distribution with df=1 • more conservative than Pearson's chi-squared test • Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Something I'd Rather Not Mention IMS • Church & Hanks: O11 and E11 are both random variables • H0: expected values are equal • assume normal distribution with unknown variance • compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Something I'd Rather Not Mention IMS • one-sided test • statistical model is questionable • limiting distribution: t-distribution with df  N • even more conservative than log-likelihood (low-frequency data) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Exact Tests IMS • problem: how to establish ranking of contingency tables • solution: reduce set of alternatives • if we consider only the cell X11, the difference X11 – E11 gives a sensible ranking: binomial test Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Exact Tests IMS • another solution: marginal frequencies do not provide evidence for or against H0 ( "ancillary" statistics) • condition on fixed row and column sums R1, R2, C1, C2 • conditional hypergeometric distribution does not depend on parameters 1 and 2 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Exact Tests • X11 is the only free parameter • we can use X11 – E11 for ranking • Fisher's exact test (Pedersen 1996) • computationally expensive • numerical difficulties IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Hypothesis Tests IMS • Fisher's test is now widely accepted as most appropriate • tends to be conservative • log-likelihood gives good approximation of "correct" p-values (slightly less conservative) • chi-squared over-estimates • t-score far too conservative Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Other Approaches to Measuring Association IMS • information-theoretic (MI, entropy)  equivalent to log-likelihood • combined measures ("boosting") • conservative estimates instead of MLE (confidence intervals) • hypothesis tests with different null hypothesis:  = C  1  2 • mixture of conservative estimates and hypothesis tests? Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Implementation IMS • one-sided vs. two-sided tests • need special software to obtain p-values for asymptotic tests • numerical accuracy • beware of zero frequencies! Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Errr.... Help!? Software? IMS • Ted Pedersen's N-gram Statistics Package (NSP) [Perl, portable, easy to use] • UCS Toolkit will be available soon from www.collocations.de [Perl/Linux, some prerequisites, for the more ambitious :o) ] Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien More Association Measures IMS • • • • • lots of association measures will be updated references slides from this course under construction Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures • mathematical discussion • very complex • results only for special cases • numerical simulation • computationally expensive • Dunning (1993, 1998) • lazy man's approach • construct mock data set where frequencies vary systematically IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 10,000,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 10,000,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Comparing Association Measures N = 100,000 IMS

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Stefan Evert, IMS