Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Corpora and Statistical Methods Albert Gatt Zipf’s law and the Zipfian distribution Identifying words Words Levels of identification: Graphical word (a token) Dependent on surface properties of text Underlying word (stem, root…) Dependent on some form of morphological analysis Practical definition: A word… is an indivisible (!) sequence of characters carries elementary meaning is reusable in different contexts Indivisibility Words can have compositional meaning from parts that are either words themselves, or prefixes and suffixes colour + -less = colourless (derivation) data + base = database (compounding) The notion of “atomicity” or “indivisibility” is a matter of degree. Problems with indivisibility Definite article in Maltese il-kelb DEF-dog “the dog” phonologically dependent on word German componding Lebensversicherunggesellschaftsangestellter “life insurance company employee” Arabic conjunctions: waliy One possible gloss: and I follow (w- is “and”) Resuability Words become part of the lexicon of a language, and can be reused. But some words can be formed on the fly using productive morphological processes. Many words are used very rarely A large majority of the lexicon is inaccessible to native speakers Approximately 50% of the words in a novel will be used only once within that novel (hapax legomena) The graphic definition Many corpora, starting with Brown, use a definition of a graphic word: sequence of letters/numbers possibly some other symbols separated by whitespace or punctuation But even here, there are exceptions. Not much use for tokenisation of languages like Arabic. Non-alphanumeric characters Numbers such as 22.5 in word frequency counts, typically mapped to a single type ## Other characters: Abbreviations: U.S.A. Apostrophes: O’Hara vs. John’s Whitespace: New Delhi A problem for tokenisation Hyphenated compounds: so-called, A-1-plus vs. aluminum-export industry How many words do we have here? Tokenisation Task of breaking up running text into component words. Crucial for most NLP tasks, as parameters typically estimated based on words. Can be statistical or rule-based. Often, simple regular expressions will go a long way. Some practical problems: Whitespace: very useful in Indo-European languages. In others (e.g. East Asian languages, ancient Greek) no space is used. Non-alphanumeric symbols: need to decide if these are part of a word or not. Types and tokens Running example Throughout this lecture, data is taken from a corpus of Maltese texts: ca. 51,000 words all from Maltese-language newspapers various topics and article types Compared to data from English corpora taken from Baroni 2007 Definitions (I) token = any word in the corpus (also counting words that occur more than once) type = all the individual, different words in the corpus (grouping words together as representatives of a single type) Example: I spoke to the chap who spoke to the child 10 tokens 7 types (I, spoke, to, the, chap, who, child) Definitions (II) The number of tokens in the corpus is an estimate of overall corpus size Maltese corpus: 51,000 tokens The number of types is an estimate of vocabulary size gives an idea of the lexical richness of the corpus Maltese corpus: 8193 types Relative measures of frequency Type-token ratio: no. occurrences of a type / corpus size essentially relative frequency In very large corpora, this is typically multiplied by a constant e.g. multiplying by 1 million gives frequency per million Type/token ratio Ratio varies enormously depending on corpus size! If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%. With 4 million words, it’s more likely to be in the region of 2%. Reasons: vocab size grows with corpus size but large corpora will contain a lot of types that occur many times Frequency lists (BNC) A simple list, pairing each word with its frequency type frequency the 6054231 in 1931797 time 149487 year 73167 man 57699 … monarch 744 cumin 51 prestidigitation 3 Frequency lists (MT) type aħħar (“last”) jkun (“be.IMPERF.3SG”) ukoll (“also”) bħala (“as”) dak (“that.SGM”) tat- (“of.DEF”) frequency 97 96 93 91 86 86 Frequency ranks Word counts can get very big. most frequent word in the Maltese corpus occurs 2195 times (and the corpus is small) Raw frequency lists can be hard to process. Useful to represent words in terms of rank: count the words sort by frequency (most frequent first) assign a rank to the words: rank 1 = most frequent rank 2 = next most frequent … Rank/frequency profile (BNC) rank 1 goes to the most frequent type all ranks are unique ties in frequency are given arbitrary rank rank (r) 1 2 3 freq (f) 6054231 1931797 149487 … Note the large differences in frequency from one rank to another Rank-frequency profile (MT) Rank (r) Frequency (f) 1 2195 2 2080 3 1277 4 1264 Differences in frequency from one rank to another are smaller than in BNC. Frequency spectrum (MT) A representation that shows, for each frequency value, the number of different types that occur with that frequency. frequency types 1 4382 2 1253 3 661 4 356 Word distributions (few giants, many midgets) Non-linguistic case study Suppose we are interested in measuring people’s height. population = adult, male/female, European sample: N people from the relevant population measure height of each person in the sample Results: person 1: 1.6 m person 2: 1.5 m … Measures of central tendency Given the height of individuals in our sample, we can calculate some summary statistics: mean (“average”): sum of all heights in sample, divided by N mode: most frequent value What are your expectations? will most people be extremely tall? extremely short? more or less average? Plotting height/frequency Observations: 1. Extreme values are less frequent. 2. Most people fall on the mean 3. Mode is approximately same as mean 4. Bell-shaped curve (“normal” distribution) Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5) ca. 15% of our corpus size is made up of only 5 different words! Out of 8193 types: 4382 are hapax legomena, occurring only once (bottom ranks) 1253 occur only twice … In this data, the mean won’t tell us very much. it hides huge variations! Ranks and frequencies (MT) 1. 2. 3. 2195 2080 1277 Among top ranks, frequency drops very dramatically (but depends on corpus size) … 2298. 1 2299. 1 … Among bottom ranks, frequency drops very gradually General observations There are always a few very high-frequency words, and many low-frequency words. Among the top ranks, frequency differences are big. Among bottom ranks, frequency differences are very small. So what are the high-frequency words? Top 5 ranked words in the Maltese data: li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of ”), tal- (“of the”) Bottom ranked words: żona (“zone”) f = 1 yankee f = 1 żwieten (“Zejtun residents”) f = 1 xortih (“luck.3SGM”) f = 1 widnejhom (“ear.POSS.3PL”) f = 1 Frequency distributions in corpora The top few frequency ranks are taken up by function words. In the Brown corpus, the 10 top-ranked words make up 23% of total corpus size (Baroni, 2007) Bottom-ranked words display lots of ties in frequency. Lots of words occurring only once (hapax legomena) In Brown, ca. ½ of vocabulary size is made up of words that occur only once. Implications The mean or average frequency hides huge deviations. In Brown, average frequency of a type is 19 tokens. But: the mean is inflated by a few very frequent types most words will have frequency well below the mean Mean will therefore be higher than median (the middle value) not a very meaningful indicator of central tendency Mode (most frequent frequency value) is usually 1. This is typical of most large corpora. Same happens if we look at n-grams rather than words. Typical shape of a rank/frequency curve Actual example (MT) A few high frequency, low-rank words Hundreds of low-frequency, frequency high-rank words 2500 frequency 2000 1500 frequency 1000 500 0 0 1000 2000 3000 4000 rank 5000 6000 7000 8000 9000 Zipf’s law Observation: Frequency decreases non-linearly with rank. C f ( w) r ( w) a a constant, determined from data, roughly the frequency of the most frequent word a constant, determined from data Suppose a = 1, and C = 60,000. The model predicts: 2nd most frequent word will be C/2 = 30,000 3rd most frequent: C/3 = 20,000 20th most frequent = C/20 = 3000 So frequency decreases very rapidly (exponentially) as rank increases. Things to note The law doesn’t predict frequency ties there are no ties among ranks The law is a power law: frequency is a function of negative power of rank Taking the log of both sides gives us a linear function: log f (w) log C a log r (w) Basically a straight line plot. Log-log plot for MT data (a=1) Deviation from prediction for high frequencies Deviation from prediction for low frequencies Log-log plot for data from Baroni 2007 Some observations Empirical work has shown that the law doesn’t perfectly predict frequencies: at the bottom ranks (low frequencies), actual frequency drops more rapidly than predicted at the top ranks (high frequencies), the model predicts higher frequencies than actually attested Mandelbrot’s law Mandelbrot proposed a version of Zipf’s law as follows: C f ( w) a (r ( w) b) (Note: Zipf’s original law is Mandelbrot’s law with b = 0) If b is a small value, it will make frequency of items ranked at the top (rank 1, 2, etc) significantly smaller, but won’t affect the lower ranks. Comparison Let C = 60,000, a = 1 and b = 1 Then, for a word of rank 1: Zipf’s law predicts f(w) = 60,000/1 = 60,000 Mandelbrot’s law predicts f(w) = 60,000/(1+1) = 30,000 For a word of rank 1000: Zipf predicts: f(w) = 60,000/1000 = 60 Mandelbrot: f(w) = 60,000/1001 = 59.94 So differences are bigger at the top than at the bottom. Linear version of Mandelbrot log f (w) log C a log(r (w) b) Note: this is no longer a linear curve, so should fit our data better. Consequences of the law Data sparseness: no matter how big your corpus, most of the words in it will be of very low frequency. You can’t exhaust the vocabulary of a language: new words will crop up as corpus size increases. implication: you can’t compare vocabulary richness of corpora of different sizes Explanation for Zipfian distributions Zipf’s own explanation (“least effort” principle): Speaker’s goal is to minimise effort by using a few distinct words as frequently as possible Hearer’s goal is to maximise clarity by having as large a vocabulary as possible Other Zipfian distributions Zipf’s law crops up in other domains (e.g. distribution of incomes) Even randomly generated character strings show the same pattern! short strings will be few, but likely to crop up by chance more long strings, but each one less likely individually