Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press Rock bottom basics • Central tendency – With any set of numerical scores (eg frequency counts of word types, lengths of sentences in a corpus) – mode (most frequently obtained score) • Easily affected by chance scores – median (the score nearest the middle of the range of scores) • Will be close to mean if data evenly distributed – mean (average) x in equations 2 Rock bottom basics • Probability of an event a, usually written P(a) – For a set of alternative events, total of all probabilities is 1 – Events assumed to be independent • This can be counter-intuitive, but (in coin toss) the chance of heads is always 1/2, whatever the preceding tosses were • Probability of an event a, given some other condition b is written P(a|b) – Notice that P(a|b) is independent of P(b) - eg P(skelter|helter) • Not to be confused with the probability of two events co-occurring – written P(a,b) – which is not the same as the combined probability P(a) P(b) 3 Simple word counts • A simple frequency count on its own might not tell you anything • Need to compare it with something else – Frequency counts of other similar things – Or the frequency count that you might expect on average • Then need to see if the measured difference is significant 4 Statistical significance • Probably most commonly used statistic in all social science is t-test • Understood that any result could be due to random chance • Statistical significance tells you what level of random chance would be responsible for the result you get • Usually involves looking something up in a table – Level of certainty – Number of variables or degrees of freedom 5 Correlation • Frequency counts might provide an ordered list • You might want to compare counts of two things to see if they are correlated, eg word length in English and number of characters in Chinese (Xu 1996) • Person’s rho N xy x y 2 2 2 2 N x x N y y • There’s also a formula for rank correlation 6 Xu (1996) X Y sqr(X) sqr(Y) XY 1 2 1 4 2 2 1 4 1 2 2 2 4 4 4 3 1 9 1 3 3 2 9 4 6 4 2 16 4 8 6 2 36 4 12 6 3 36 9 18 7 1 49 1 7 7 2 49 4 14 8 2 64 4 16 9 2 81 4 18 10 2 100 4 20 11 2 121 4 22 11 3 121 9 33 TOTAL N=15 29 700 61 N x N xy x y 2 x N y 2 y 2 2 15 185 90 29 15 700 90 90 15 61 29 29 165 2400 74 0.39 Critical value for 15 pairs of observations at 5% level of confidence is 0.441, so result is not statistically significant (it is at 10% level though) 185 7 Comparison with expected values • We might want to compare relative frequencies of a range of features • Chi-square test shows if frequency differences are significant 2 O E 2 E • where O is observed value, E is expected value row total column tot al E grand total 8 Yamamoto (1996) • Frequencies of types of 3rd-person reference in English and Japanese Japanese Ellipsis English TOTALS E(J) E(E) X2(J) X2(E) 104 0 104 48.60 55.40 63.14 55.40 Central pronouns 73 314 387 180.86 206.14 64.32 56.43 Non-central pronouns 12 28 40 18.69 21.31 2.40 2.10 Names 314 291 605 282.73 322.27 3.46 3.03 Common NPs 205 174 379 177.12 201.88 4.39 3.85 TOTAL 708 807 1515 • Sum = 258.8, significant for (5-1)x(2-1)=4dfs at 0.1% level 9 Co-occurrence • Is distribution of two things correlated? • Contingency table – eg sentences where two words co-occur or not • Phi coefficient • Dice’s coefficient • Several variants W1 not W1 W2 a b not W2 c d 2 (a b c d ) 2a b s abcd 10 Co-occurrence • Scores such as Dice’s coefficient need to be turned into something like a t score, so that significance can be measured f ( x) f ( y ) f ( x, y ) N t f ( x, y ) ab ( a b) abcd t ( a b) 11 Co-occurrence Mutual information • Measures the relatedness of two variables • compares joint and combined Ps • P 0 = chance association • P>>0 strong association • P<<0 complementary distribution I ( X ; Y ) log 2 P ( x, y ) P( x) P( y ) In terms of contingency matrix: a P( x) abcd b P( y ) abcd ab P ( x, y ) abcd 12 Church & Hanks (1990) • Used MI to show word associations – Eg doctors + {dentitsts,nurses,treating,treat, examine,bills,hospitals} – In contrast with doctors + {with,a,is} – Identify phrasal verbs eg set + {up,off,out,in} but not about – Using a parser to separate N and V readings, most likely objects of verb drink – What you can do to a telephone (sit by, disconnect, answer, …) 13 14 Church et al. (1991) • strong vs powerful experiment MI word pair MI word pair 10.47 strong northerly 8.66 powerful legacy 9.76 strong showings 8.58 powerful tool 9.3 strong believer 8.35 powerful storms 15