Download COLLOCATIONS II: HYPOTHESIS TESTING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Branches of statistics
• Descriptive statistics is for organizing and summarizing data
COLLOCATIONS II:
HYPOTHESIS TESTING
– Histograms, mean, median, standard deviation
• Inferential statistics is used to draw conclusions about a
population based on a sample (subset) of that population
John Fry
Boise State University
– Point estimation, hypothesis testing, confidence intervals
• Probability theory is the basis of, and in some sense, the
“inverse” of, inferential statistics
– Probability reasons from the population to a sample
– Inferential statistics reasons from the sample to the
population
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
Sample variance s2
Sample mean x̄
• The sample variance of a set of numbers x1, x1, . . . , xn is
• The sample mean x̄ of a set of numbers x1, x1, . . . , xn is
x̄ =
x1 + x2 + . . . + xn
=
n
Pn
i=1 xi
s2 =
n
Pn
− x̄)2
n−1
i=1 (xi
> var(c(2, 3, 4, 5, 3, 2, 5))
[1] 1.619048
• Examples in R
> mean(c(2, 3, 4, 5, 3, 2, 5))
[1] 3.428571
• The sample standard deviation s is the positive square root of
the sample variance
√
s = s2
> mean(c(2.1, 0.4, 4.2, 3.6, 4.2))
[1] 2.9
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
1
> sd(c(2, 3, 4, 5, 3, 2, 5))
[1] 1.272418
> sqrt(var(c(2, 3, 4, 5, 3, 2, 5)))
[1] 1.272418
2
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
3
Standard deviations in the Normal distribution
Mean value µ
• The mean value of a random variable X, which is denoted
µX or just µ, is the same as the expectation E(X)
µX = E(X) =
X
p(x) x
x∈X
• Example: the mean value µ of a single throw of a die is
µ=
6
X
x=1
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
4
Statistical hypothesis testing
6
p(x) x =
1X
21
x=
= 3.5
6 x=1
6
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
5
Examples of null hypotheses
• Informal examples of statistical hypotheses
• Deciding whether a coin is fair or loaded
– That a given die or coin is unfair (loaded)
– That a new drug is more effective than a placebo
– Null hypothesis: p = 0.5
– Alternative hypothesis: p 6= 0.5
• Formally, hypotheses are claims about the characteristics of
the sampled population(s) (e.g., that p = 0.5 or µ1 = µ2)
• A hypothesis test is usually framed in terms of two competing
hypotheses, the null hypothesis and the alternative hypothesis
• Deciding whether a treatment is better than a placebo
– Null hypothesis: no difference between the mean response
to treatment and to the placebo: µ1 = µ2
– Alternative hypothesis: mean response to treatment is
better than for the placebo: µ1 > µ2
• The null hypothesis is the default, ‘status quo’ hypothesis
(e.g., ‘innocent until proven guilty’)
• We reject the null hypothesis only in the face of strong
evidence for the alternative
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
6
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
7
Steps in statistical hypothesis testing
Evaluating a test statistic
1. Collect the data
2. Determine a null and alternative hypothesis
3. Calculate a test statistic for that null hypothesis (t, χ2, etc.)
4. Determine if this test statistic is odd given this null hypothesis
(a) If the statistic is sufficiently extreme relative to the null
hypothesis, then we reject the null hypothesis
(b) If the statistic is not sufficiently extreme relative to the null
hypothesis, then we do not reject the null hypothesis
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
8
Hypothesis testing of collocations
9
The one-sample t test
• Hypothesis testing can be used to identify collocations
• One common statistic for hypothesis testing is the t statistic
x̄ − µ
t=p
s2/N
• If words x and y are independent, then P (xy) = P (x)P (y)
• In a true collocation, words x and y are highly dependent, so
we expect P (xy) P (x)P (y)
• We can frame this question as a hypothesis test
• The t test looks at the mean x̄ and variance s2 of a sample
• The null hypothesis is that the sample is drawn from a
population with mean µ (that is, we expect x̄ ≈ µ)
– Null hypothesis: P (xy) = P (x)P (y)
– Alternative: P (xy) > P (x)P (y)
• We reject the null hypothesis only in the face of strong
evidence for the alternative
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
10
• If t is high enough, we can reject the null hypothesis and
conclude that the sample is not drawn from that population
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
11
Finding collocations with the t test
Finding collocations with the t test
• The t statistic can be used to test whether a bigram xy is a
collocation; i.e., whether P (xy) P (x)P (y)
• If t is large enough, we can reject the null hypothesis that x
and y are independent and accept xy as a collocation
x̄ − µ
t=p
s2/N
• Specifically, statistical tables show that when t > 2.576 we can
reject the null hypothesis with 99.5% confidence (p = 0.005)
• For this task we set x̄ = P (xy) and µ = P (x)P (y), and then
approximate s2 = P (xy)
• The Church et al. (1991) approximation of t is
t≈
• For this reason, the t score is typically used to rank or compare
collocations, rather than to accept or reject them based on a
critical value
c(xy) − N1 c(x)c(y)
p
c(xy)
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
12
Bible bigrams ranked by t score
T
81.3465
76.1978
56.4903
47.8521
42.8622
39.9544
39.489
38.4577
37.4083
36.4899
35.671
35.6205
35.1866
35.1293
35.0394
C(x)
34671
64023
12667
9838
8854
51696
3999
34671
2392
5620
64023
2775
1821
51696
5474
C(y)
64023
7964
64023
7013
3836
10419
8997
2575
34671
64023
2542
34671
34671
7376
1616
C(xy)
11543
7035
5031
2461
1922
2791
1649
1697
1602
2144
1658
1502
1393
2086
1250
of
the
in
shall
i
and
said
of
son
all
the
out
children
and
thou
• Unfortunately, this approach forces us to accept virtually all
bigrams xy as collocations, since it is almost always true that
P (xy) P (x)P (y)
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
13
Bible bigrams ranked by t score where c = 10
T
C(x)
3.16197 19
3.16196 66
3.16191 29
3.16182 10
3.16158 31
3.16117 39
3.16108 300
3.16096 21
3.16015 65
3.15979 520
3.1592 151
3.15918 287
3.15871 194
3.15772 33
3.15752 202
the
lord
the
be
will
he
unto
israel
of
the
king
of
of
they
shalt
C(y)
40
12
32
115
56
71
10
157
82
12
51
27
46
346
59
C(xy)
10 committeth
10 golden
10 solemn
10 dearly
10 due
10 molten
10 young
10 fir
10 deep
10 thousand
10 new
10 much
10 unclean
10 looketh
10 six
adultery
spoon
feasts
beloved
season
images
pigeons
trees
sleep
footmen
moon
less
spirits
toward
months
These are not collocations! Problem is t favors frequent bigrams
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
14
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
15
Bible bigrams ranked by t score where c = 20
T
C(x) C(y)
4.47147 113 21
4.47103
55 71
4.47086
37 122
4.47054
28 202
4.46943
41 234
4.46874 501 24
4.46873
33 366
4.46862
76 164
4.46852 328 39
4.46734
38 447
4.46668 157 123
4.46638 283 72
4.46103 283 139
4.46053
78 527
4.45666 1405 39
C(xy)
20 fine
20 graven
20 inner
20 anointing
20 fierce
20 cast
20 continual
20 simon
20 four
20 innocent
20 east
20 turn
20 chief
20 bowed
20 made
Strong bigrams ranked by t score
T
106.28
97.1442
85.5205
85.3466
81.7163
81.6797
77.485
77.1576
71.729
71.2304
70.5829
70.4699
65.9995
62.5705
61.1211
twined
images
court
oil
anger
lots
burnt
peter
corners
blood
wind
aside
captain
himself
manifest
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
16
Powerful bigrams ranked by t score
T
56.8366
51.1367
47.4809
45.8005
43.541
41.937
38.4901
37.5722
36.1926
34.8137
34.3074
33.2658
32.6545
32.4674
30.5478
C(x)
C(y)
196842
81809
196842
289928
196842 3541187
196842 1388113
196842
145743
196842
561524
196842 1147361
196842
875579
196842
774963
196842 6497319
196842
538219
196842
141226
196842
474899
196842 51439655
196842
533268
C(xy)
3244
2663
2813
2323
1920
1851
1667
1554
1436
2158
1265
1130
1144
6997
1020
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
powerful
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
C(x)
C(y)
564688 561524
564688 835693
564688 350085
564688 583264
564688 728249
564688
76401
564688 426675
564688 592486
564688 727631
564688 179242
564688 291093
564688
81809
564688 1133059
564688 331883
564688 756389
C(xy)
11562
9832
7480
7560
7021
6708
6206
6233
5487
5159
5120
5005
4882
4072
4089
strong
strong
strong
strong
strong
strong
strong
strong
strong
strong
strong
strong
strong
strong
strong
enough
support
demand
growth
dollar
winds
buy
opposition
sales
showing
performance
earthquake
economic
earnings
economy
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
17
Likelihood ratios
• Another approach to hypothesis testing is the likelihood ratio
(Dunning 1993)
earthquake
bomb
than
military
explosion
enough
political
man
force
new
shot
blast
lower
and
car
• A likelihood ratio is a number that tells us how much more
likely one hypothesis H1 is than the alternative hypothesis H2
• Compared to the t and χ2 statistics, likelihood ratios are:
– More interpretable
– More robust under sparse data
– More difficult to compute (math is hairier)
18
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
19
Finding collocations with likelihood ratios
Finding collocations with likelihood ratios
• We want to determine whether bigram xy is a collocation, or
whether x and y are independent of each other
Letting x = c(x), y = c(y), and xy = c(xy), we compute
log
• Hypothesis H1 (independent): P (y|x) = p = P (y|¬x)
+ log pxy (1 − p)x−xy
+ log py−xy (1 − p)(N −x)−(y−xy)
• We can use MLE to estimate p, p1, and p2:
c(y)
N
p1 =
c(xy)
c(x)
p2 =
c(y) − c(xy)
N − c(x)
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
20
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
Powerful collocations ranked by likelihood ratios
1
−2 log H
H2
1291.42
99.31
82.96
80.39
57.27
51.66
51.52
51.05
c(x)
12593
379
932
932
932
932
171
932
c(y)
932
932
934
3424
291
40
932
43
c(xy)
150
10
10
13
6
4
5
4
x
most
politically
powerful
powerful
powerful
powerful
economically
powerful
y
powerful
powerful
computers
force
symbol
lobbies
powerful
magnet
22
21
Collocations: summary
• Collocations, idioms, and other MWEs are groups of words
that co-occur more often than chance, exhibit limited semantic
compositionality, and vary from language to language
• Collocations are useful for lexicography, language learning,
and applications like the ‘key phrases’ feature at amazon.com
• Methods for automatically extracting collocations:
1.
2.
3.
4.
5.
6.
Source: Manning & Schütze (1999), p. 163
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
= − log p1xy (1 − p1)x−xy
− log p2y−xy (1 − p2)(N −x)−(y−xy)
• Hypothesis H2 (collocation): P (y|x) = p1 6= p2 = P (y|¬x)
p=
H1
H2
Frequency counts with POS filters
Relative frequency ratios
Mutual information (favors rare collocations)
t scores (favors frequent collocations)
Likelihood ratios
χ2 test (next time!)
Linguistics 497: Corpus Linguistics, Spring 2011, Boise State University
23