Download Asialex presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Error detection and correction wikipedia , lookup

Transcript
Corpus-based generation of suggestions
for correcting student errors
Paper presented at AsiaLex August 2009
Richard Watson Todd
KMUTT
©2009 Richard Watson Todd
Self-correction of writing


Language learning or language use
Resources for writers:
– Dictionaries (e.g. COBUILD)
• Common syntactic patterns but needs awareness
– Lists of common errors
• Limited number of errors covered
– Grammar checkers
• Potentially useful if designed for non-native writers
(c)2009 Richard Watson Todd
Principles of grammar checker
design

Pattern matching
– e.g. common phrases
– limited (like lists of common errors)

Parsing and rule-based
– e.g. subject-verb agreement
– useful for syntax but limited application

Corpus-based probabilistic analysis
– lexically-based on co-occurrence of words
– very local errors only
(c)2009 Richard Watson Todd
Conducting a corpus-based
probabilistic analysis




Construct a large corpus (100 million
words)
For most common 6,700 words, identify all
possible bigrams (44 million)
Calculate z-scores of bigrams to identify
errors
40 million bigram errors
(c)2009 Richard Watson Todd
The problem



Identifying errors is relatively easy
Providing good suggestions for correcting
errors is more difficult
Is it possible to provide correct suggestions
for word-word co-occurrence errors through
analysis of a large corpus?
(c)2009 Richard Watson Todd
The approach




Collect 200 sentences from student writing
containing word-word errors
Generate multiple methods of correcting the
errors
Evaluate the methods
Produce algorithms based on common
patterns
(c)2009 Richard Watson Todd
An example

He drives a red colour car.
–
–
–
–
–
–
A. Delete “red”?
B. Delete “colour”?
C. Switch “red” and “colour”?
D. Replace “red” with another word?
E. Replace “colour” with another word?
F. Insert a word between “red” and “colour”?
(c)2009 Richard Watson Todd
Checking deleting and switching

He drives a red colour car.
– A. Delete “red”
– Result: He drives a colour car.
– Check z-score of co-occurrence of a + colour +
car
– If z-score is high, possible method
– Do the same for:
• B. Delete “colour”
• C. Switch “red”
andWatson“colour”
(c)2009 Richard
Todd
Finding words to replace or insert

He drives a red colour car.
–
–
–
–
D. Replace “red” with another word
He drives a red colour car.
Search for trigram: a X colour
Identify trigram with highest z-score for:
• a + X + colour
– Do the same for:
• E. Replace “colour” with another word [red + X +
car]
• F. Insert a word between “red” and “colour” [red +
X + colour]
(c)2009 Richard Watson Todd
Evaluating methods and
producing algorithms





For each error, up to 6 methods of generating suggestions
possible
Evaluations based on judgments of appropriacy of
suggestion by a native speaker
Patterns identified for parts of speech (there are 12,000
POS-POS-POS trigrams but 300 billion word-word-word
trigrams)
8 algorithms produced
Sample algorithm:
– Replace first word (i.e. method D) when the second word is (noun
OR verb OR preposition) and first word is adjective preceded by
(c)2009 Richard Watson Todd
adverb
Validation of algorithms


Procedures applied to further sentences
from student writing
Applying algorithms provides correct
suggestions for 45% of errors identified
– Pattern matching and rule-based algorithms
provide correct suggestions for 90% of errors
– Corpus-based sections cover a greater number
of less predictable errors
(c)2009 Richard Watson Todd
Implications for lexicography


Growth in use of electronic dictionaries
Growth in number of aspects covered by
dictionaries
– originally only spelling and meaning
– now examples of use, syntactic patterns,
register, variants, synonyms etc.
– in the future suggestions for correcting errors?

In 20 years’ time, integration of dictionaries
(c)2009 Richard Watson Todd
and grammar checkers?