Download Stefan Evert, IMS

Document related concepts
no text concepts found
Transcript
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Association Measures
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Reminder: Contingency Tables
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
General Remarks
IMS
• we will only use data from
contingency tables
• we will consider each pair type
on its own, independently from all
other pair types
( no distributional information)
• we won't distinguish between
relational and positional
cooccurrences
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Association Measures (AMs)
IMS
• goal: assign association score to
each pair type = strength of
association between components
• high score = strong association
• association in a statistical sense,
but there is no precise definition
• positive vs. negative association
("colourless green ideas")
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Using Association Scores
IMS
• absolute values (cut-off threshold)
• input for higher-order statistics
(AMs are first-order statistics)
 scores should be meaningful
• ranking of collocation candidates
 only relative scores matter
• rank collocates of given base
 one marginal frequency fixed
 only two free parameters
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
First Steps: Proportions
• Workshop on
Mechanized Documentation
(Washington, 1964)
O11
P1 
R1
O11
P2 
C1
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
First Steps: Proportions
IMS
• proportions between 0 and 1
• high proportion =
strong (directional) association
• need to combine two proportions
into a single association score
• average (P1 + P2) / 2 is not useful
• f=1, f1=1, f2=1000  avg.=0.5005
• f=50, f1=100, f2=100  avg.=0.5
 more "conservative" weighting
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
First Steps: Proportions
• harmonic mean
• geometric mean
• minimum
• Jaccard
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
First Steps: Proportions
IMS
• coefficients range from 0 to 1
• 1 = total (positive) association
• interpretation of lower scores
is less clear
• positive vs. negative association?
• which score for no association?
• what is "no association"??
 random combinations
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Expected Frequencies
• assume that types u and v
cooccur only by chance
• f1(u) occs. of u and f2(v) occs. of v
spread randomly over N tokens
• each instance of u has a chance
of f2(v)/N to cooccur with a v
 expected # of cooccurrences:
f1 (u )  f 2 (v) R1C1

: E11
N
N
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Expected Frequencies
IMS
• expected frequencies for all cells
of the contingency table
• assuming random combinations
( statistical independence)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Expected Frequencies
IMS
• comparison of expected against
observed frequencies
• note that row and column sums
are the same for both tables
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Mutual Information
IMS
•
•
•
•
compares O11 with E11
ratio O11/E11 ranges from 0 to 
1 = no association (O11=E11)
usually logarithmic values
• range: - to +
• 0 = no assoc., < 0 neg., > 0 pos.
• used in English lexicography
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Low-Frequency Pairs
& Random Variation
IMS
• large amount of low-frequency
data (consequence of Zipf's law)
• a simple (invented) example
• A: f=50, f1=100, f2=100, N=1000
 O11=50, E11=10, MI = log 5
• B: f=1, f1=1, f2=1, N=1000
 O11=1, E11=.001, MI = log 1000
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Low-Frequency Pairs
& Random Variation
IMS
• three problems with case B
• how meaningful is a single example?
(not very much, actually)
• could well be a spelling mistake or
noise from automatic processing
• we want to make generalisations
(from particular corpus to "language")
 this is the domain of statistics:
draw inferences about population
(=language) from a sample (=corpus)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Statistical Model:
Random Sample
IMS
• assumption: corpus data is a
random sample from the language
 base data is a random sample
from all coocs. in the language
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Statistical Model:
Random Sample
IMS
• random sample of size N is
described by random variables
Ui and Vi (i = 1..N), representing
the labels of the i-th bigram token
• notation: U and V as "prototypes"
• for a given pair type (u,v),
contingency table can be
computed from Ui and Vi
 random variables X11, X12, X21, X22
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Statistical Model:
Random Sample
IMS
• population parameters
11, 12, 21, 22 for pair type (u,v)
• observed frequencies
O11, O12, O21, O22 represent one
particular realisation of the sample
• theory of random samples predicts
distribution of X11, X12, X21, X22 from
assumptions about the population
parameters 11, 12, 21, 22
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Statistical Model:
Random Sample
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Two Footnotes
IMS
• vector notation for cont. tables

X  ( X 11 , X 12 , X 21 , X 22 )

O  (O11, O12 , O21 , O22 )

k  (k11, k12 , k 21 , k 22 )
• population  general language
• restricted to domain(s), genre(s), ...
covered by source corpus
• e.g. black box in computer science
vs. newspapers vs. cooking
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Sampling Distribution
IMS
• multinomial sampling distribution
• each individual cell count Xij has
a binomial distribution
(but these are not independent)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Sampling Distribution
IMS
• given assumptions about the
population parameters, we can
compute the likelihood of the
observed contingency table
• relatively high likelihood
= consistent with assumptions
• relatively low likelihood
= evidence against assumptions
(inversely proportional to likelihood)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Adequacy of the Statistical Model
• particular sequence of pair tokens
is irrelevant, only the overall
frequencies matter ( sufficiency)
• randomness assumption (random
sample from fixed population)
• independence of pair tokens
• constancy of population parameters
• violations problematic only when
they affect sampling distribution
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Adequacy of the Statistical Model
IMS
• three causes of non-randomness
• local dependencies (e.g. syntax)
 usually not problematic
• inhomogeneity of source corpus
(speakers, domains, topics, ...)
 mixture population
• repetition / clustering of bigrams
 can be a serious problem
(does not affect segment-based
data if clustered within segments)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Making Assumptions about the
Population Parameters
IMS
• population parameters (, 1, 2)
are unknown
• best guess from observation: MLE
= maximum-likelihood estimate
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Making Assumptions about the
Population Parameters
IMS
• conditional probabilities with MLE
 O11

 1 R1
 O11
P(U  u | V  v)

 2 C1
P(V  v | U  u )
• Dice coefficient etc. are MLE for
population characteristics
• MI is MLE for log( /(1  2))
 unreliable for small frequencies
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
The Null Hypothesis
IMS
• null hypothesis H0: no association
= independence of instances, i.e.
P(U=u  V=v) = P(U=u)  P(V=v)
• not all parameters determined
• MLE maximise probability of
observed data under H0
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Likelihood Measures
IMS
• probability of observed data
under H0 (with MLE)
• probability of single cell: X11
should be most "informative"
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Likelihood Measures
IMS
• small likelihood values
= strong association
• computed probabilities are often
extremely small
• use negative base-10 logarithm
 more convenient scale
 high scores indicate strong
association
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Problems of
Likelihood Measures
IMS
• three reasons for low likelihood
• observed data is inconsistent with
the null hypothesis because of
strong association
• association may also be negative
(fewer coocs. than expected)
• observed data is consistent, but
probability mass is spread across
many similar contingency tables
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Problems of
Likelihood Measures
IMS
• high frequency = low likelihood
• e.g. binomial likelihood
• O11=1, E11=1  L = 0.3679
• O11=1000, E11=1000  L = 0.0126
• O11=4, E11=1  L  0.0126
• need to "normalise" likelihood
• NB: likelihood association
measures often have good
empirical results nonetheless
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Likelihood Ratios
• simplest normalisation technique
• divide maximum probability of
data under H0 by unconstrained
maximum probability
• suggested by Dunning (1993)
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Statistical Hypothesis Tests
IMS
• compute probability of group of
outcomes instead of single one
• observed contingency table is
grouped with all tables that
provide at least the same amount
of evidence against H0
• total probability is known as the
p-value or significance
• problem: ranking of cont. tables
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Asymptotic Tests
IMS
• asymptotic tests defined ranking
of contingency tables explicitly
• compute test statistic from data
• higher values =
more evidence against H0
• can use test statistic as an AM
• theory: approximation of p-value
associated with test statistic
(accurate in the limit N  )
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Asymptotic Tests
IMS
• standard test for independence is
Pearson's chi-squared test
• limiting distribution =
2 distribution with df=1
• number of degrees of freedom
was subject of a long debate
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Two-Sided Tests
IMS
• chi-squared test is two-sided,
i.e. no difference between positive
and negative association
• ignore small number of pairs with
(non-total) negative association
• or convert to one-sided test:
reject H0 only when O11 > E11
• p-value is usually divided by 2
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Yates Continuity Correction
IMS
• Pearson's chi-squared test
approximates discrete binomial
distributions of each cell by
continuous normal distribution
( "normal theory")
• estimating probabilities P(Xij  k)
from normal distribution
introduces systematic errors
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Yates' Continuity Correction
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Yates' Continuity Correction
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Yates' Continuity Correction
IMS
• generic form of Yates' continuity
correction for contingency tables
• usefulness is still controversial
(criticised as too conservative)
• applicability for chi-squared test
is generally accepted
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Asymptotic Tests
IMS
• different form of chi-squared test
(comparison of two binomials) is
equivalent to independence test
• special eq. with Yates' correction
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Asymptotic Tests
IMS
• can also use log-likelihood ratio
as a test statistic (two-sided)
• limiting distribution is found to be
2 distribution with df=1
• more conservative than
Pearson's chi-squared test
• Dunning (1993) showed that
Pearson's test over-estimates
evidence against H0 (simulation)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Something I'd Rather Not Mention
IMS
• Church & Hanks: O11 and E11
are both random variables
• H0: expected values are equal
• assume normal distribution with
unknown variance
• compare O11 and E11 with
Student's t-test, estimating
unknown variance from the
observed data
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Something I'd Rather Not Mention
IMS
• one-sided test
• statistical model is questionable
• limiting distribution:
t-distribution with df  N
• even more conservative than
log-likelihood (low-frequency data)
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Exact Tests
IMS
• problem: how to establish
ranking of contingency tables
• solution: reduce set of
alternatives
• if we consider only the cell X11,
the difference X11 – E11 gives a
sensible ranking: binomial test
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Exact Tests
IMS
• another solution: marginal
frequencies do not provide
evidence for or against H0
( "ancillary" statistics)
• condition on fixed row and
column sums R1, R2, C1, C2
• conditional hypergeometric
distribution does not depend on
parameters 1 and 2
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Exact Tests
• X11 is the only free parameter
• we can use X11 – E11 for ranking
• Fisher's exact test (Pedersen 1996)
• computationally expensive
• numerical difficulties
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Hypothesis Tests
IMS
• Fisher's test is now widely
accepted as most appropriate
• tends to be conservative
• log-likelihood gives good
approximation of "correct" p-values
(slightly less conservative)
• chi-squared over-estimates
• t-score far too conservative
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Other Approaches to
Measuring Association
IMS
• information-theoretic (MI, entropy)
 equivalent to log-likelihood
• combined measures ("boosting")
• conservative estimates instead of
MLE (confidence intervals)
• hypothesis tests with different null
hypothesis:  = C  1  2
• mixture of conservative estimates
and hypothesis tests?
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Implementation
IMS
• one-sided vs. two-sided tests
• need special software to obtain
p-values for asymptotic tests
• numerical accuracy
• beware of zero frequencies!
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Errr.... Help!? Software?
IMS
• Ted Pedersen's
N-gram Statistics Package (NSP)
[Perl, portable, easy to use]
• UCS Toolkit will be available
soon from www.collocations.de
[Perl/Linux, some prerequisites,
for the more ambitious :o) ]
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
More Association Measures
IMS
•
•
•
•
•
lots of association measures
will be updated
references
slides from this course
under construction
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
• mathematical discussion
• very complex
• results only for special cases
• numerical simulation
• computationally expensive
• Dunning (1993, 1998)
• lazy man's approach
• construct mock data set where
frequencies vary systematically
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 10,000,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 10,000,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Stefan Evert, IMS - Uni Stuttgart
Brigitte Krenn, ÖFAI Wien
Comparing Association Measures
N = 100,000
IMS
Related documents