Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson Todd Self-correction of writing Language learning or language use Resources for writers: – Dictionaries (e.g. COBUILD) • Common syntactic patterns but needs awareness – Lists of common errors • Limited number of errors covered – Grammar checkers • Potentially useful if designed for non-native writers (c)2009 Richard Watson Todd Principles of grammar checker design Pattern matching – e.g. common phrases – limited (like lists of common errors) Parsing and rule-based – e.g. subject-verb agreement – useful for syntax but limited application Corpus-based probabilistic analysis – lexically-based on co-occurrence of words – very local errors only (c)2009 Richard Watson Todd Conducting a corpus-based probabilistic analysis Construct a large corpus (100 million words) For most common 6,700 words, identify all possible bigrams (44 million) Calculate z-scores of bigrams to identify errors 40 million bigram errors (c)2009 Richard Watson Todd The problem Identifying errors is relatively easy Providing good suggestions for correcting errors is more difficult Is it possible to provide correct suggestions for word-word co-occurrence errors through analysis of a large corpus? (c)2009 Richard Watson Todd The approach Collect 200 sentences from student writing containing word-word errors Generate multiple methods of correcting the errors Evaluate the methods Produce algorithms based on common patterns (c)2009 Richard Watson Todd An example He drives a red colour car. – – – – – – A. Delete “red”? B. Delete “colour”? C. Switch “red” and “colour”? D. Replace “red” with another word? E. Replace “colour” with another word? F. Insert a word between “red” and “colour”? (c)2009 Richard Watson Todd Checking deleting and switching He drives a red colour car. – A. Delete “red” – Result: He drives a colour car. – Check z-score of co-occurrence of a + colour + car – If z-score is high, possible method – Do the same for: • B. Delete “colour” • C. Switch “red” andWatson“colour” (c)2009 Richard Todd Finding words to replace or insert He drives a red colour car. – – – – D. Replace “red” with another word He drives a red colour car. Search for trigram: a X colour Identify trigram with highest z-score for: • a + X + colour – Do the same for: • E. Replace “colour” with another word [red + X + car] • F. Insert a word between “red” and “colour” [red + X + colour] (c)2009 Richard Watson Todd Evaluating methods and producing algorithms For each error, up to 6 methods of generating suggestions possible Evaluations based on judgments of appropriacy of suggestion by a native speaker Patterns identified for parts of speech (there are 12,000 POS-POS-POS trigrams but 300 billion word-word-word trigrams) 8 algorithms produced Sample algorithm: – Replace first word (i.e. method D) when the second word is (noun OR verb OR preposition) and first word is adjective preceded by (c)2009 Richard Watson Todd adverb Validation of algorithms Procedures applied to further sentences from student writing Applying algorithms provides correct suggestions for 45% of errors identified – Pattern matching and rule-based algorithms provide correct suggestions for 90% of errors – Corpus-based sections cover a greater number of less predictable errors (c)2009 Richard Watson Todd Implications for lexicography Growth in use of electronic dictionaries Growth in number of aspects covered by dictionaries – originally only spelling and meaning – now examples of use, syntactic patterns, register, variants, synonyms etc. – in the future suggestions for correcting errors? In 20 years’ time, integration of dictionaries (c)2009 Richard Watson Todd and grammar checkers?