Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics
MP Oakes (1998) Statistics for corpus
linguistics. Edinburgh University Press
Rock bottom basics
• Central tendency
– With any set of numerical scores (eg frequency
counts of word types, lengths of sentences in a
corpus)
– mode (most frequently obtained score)
• Easily affected by chance scores
– median (the score nearest the middle of the range
of scores)
• Will be close to mean if data evenly distributed
– mean (average)
x in equations
2
Rock bottom basics
• Probability of an event a, usually written P(a)
– For a set of alternative events, total of all probabilities is 1
– Events assumed to be independent
• This can be counter-intuitive, but (in coin toss) the chance of
heads is always 1/2, whatever the preceding tosses were
• Probability of an event a, given some other condition
b is written P(a|b)
– Notice that P(a|b) is independent of P(b) - eg P(skelter|helter)
• Not to be confused with the probability of two events
co-occurring
– written P(a,b)
– which is not the same as the combined probability P(a) P(b)
3
Simple word counts
• A simple frequency count on its own might
not tell you anything
• Need to compare it with something else
– Frequency counts of other similar things
– Or the frequency count that you might expect
on average
• Then need to see if the measured
difference is significant
4
Statistical significance
• Probably most commonly used statistic in all
social science is t-test
• Understood that any result could be due to
random chance
• Statistical significance tells you what level of
random chance would be responsible for the
result you get
• Usually involves looking something up in a table
– Level of certainty
– Number of variables or degrees of freedom
5
Correlation
• Frequency counts might provide an ordered list
• You might want to compare counts of two things
to see if they are correlated, eg word length in
English and number of characters in Chinese (Xu
1996)
• Person’s rho
N xy x y
2
2
2
2
N x x N y y
• There’s also a formula for rank correlation
6
Xu (1996)
X
Y
sqr(X)
sqr(Y)
XY
1
2
1
4
2
2
1
4
1
2
2
2
4
4
4
3
1
9
1
3
3
2
9
4
6
4
2
16
4
8
6
2
36
4
12
6
3
36
9
18
7
1
49
1
7
7
2
49
4
14
8
2
64
4
16
9
2
81
4
18
10
2
100
4
20
11
2
121
4
22
11
3
121
9
33
TOTAL
N=15
29
700
61
N x
N xy x y
2
x N y 2 y
2
2
15 185 90 29
15 700 90 90 15 61 29 29
165
2400 74
0.39
Critical value for 15 pairs of
observations at 5% level of confidence
is 0.441, so result is not statistically
significant
(it is at 10% level though)
185
7
Comparison with expected values
• We might want to compare relative frequencies
of a range of features
• Chi-square test shows if frequency differences
are significant
2
O
E
2
E
• where O is observed value, E is expected
value
row total column tot al
E
grand total
8
Yamamoto (1996)
• Frequencies of types of 3rd-person reference in
English and Japanese
Japanese
Ellipsis
English
TOTALS
E(J)
E(E)
X2(J)
X2(E)
104
0
104
48.60
55.40
63.14
55.40
Central pronouns
73
314
387
180.86
206.14
64.32
56.43
Non-central pronouns
12
28
40
18.69
21.31
2.40
2.10
Names
314
291
605
282.73
322.27
3.46
3.03
Common NPs
205
174
379
177.12
201.88
4.39
3.85
TOTAL
708
807
1515
• Sum = 258.8, significant for (5-1)x(2-1)=4dfs at
0.1% level
9
Co-occurrence
• Is distribution of two
things correlated?
• Contingency table
– eg sentences where
two words co-occur or
not
• Phi coefficient
• Dice’s coefficient
• Several variants
W1
not W1
W2
a
b
not W2
c
d
2 (a b c d )
2a b
s
abcd
10
Co-occurrence
• Scores such as Dice’s
coefficient need to be
turned into something
like a t score, so that
significance can be
measured
f ( x) f ( y )
f ( x, y )
N
t
f ( x, y )
ab
( a b)
abcd
t
( a b)
11
Co-occurrence
Mutual information
• Measures the
relatedness of two
variables
• compares joint and
combined Ps
• P 0 = chance
association
• P>>0 strong
association
• P<<0 complementary
distribution
I ( X ; Y ) log 2
P ( x, y )
P( x) P( y )
In terms of contingency matrix:
a
P( x)
abcd
b
P( y )
abcd
ab
P ( x, y )
abcd
12
Church & Hanks (1990)
• Used MI to show word associations
– Eg doctors + {dentitsts,nurses,treating,treat,
examine,bills,hospitals}
– In contrast with doctors + {with,a,is}
– Identify phrasal verbs eg set + {up,off,out,in}
but not about
– Using a parser to separate N and V readings,
most likely objects of verb drink
– What you can do to a telephone (sit by,
disconnect, answer, …)
13
14
Church et al. (1991)
• strong vs powerful
experiment
MI
word pair
MI
word pair
10.47
strong northerly
8.66
powerful legacy
9.76
strong showings
8.58
powerful tool
9.3
strong believer
8.35
powerful storms
15