Download Improving Finite-State Spell-Checker Suggestions with Part of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
154
T. A PIRINEN, M. SILFVERBERG, KRISTER LINDÉN
form of spelling errors is word forms that do not belong to the given
context under certain syntactic or semantic requirements, such as writing
‘their’ instead of ‘there’. This form is correspondingly called real-word
(spelling) errors. The non-word type of spelling errors can easily be detected using a dictionary, whereas the detection of the latter type of errors
typically requires syntactic analysis or probabilistic methods [3]. For the
purpose of this article we do not distinguish between them, as the same
correction methods can be applied to both.
The correction of spelling errors usually means generating a list of
word forms belonging to the language for a user to chose from. The
mechanism for generating correction suggestions for the erroneous wordforms is an error-model. The purpose of an error-model is to act as a filter to revert the mistakes the user typing the erroneous word-form has
made. The simplest and most traditional model for making such corrections is the Levenshtein-Damerau edit distance algorithm, attributed initially to [4] and especially in the context of spell-checking to [1]. The
Levenshtein-Damerau edit distance assumes that spelling errors are one
of insertion, deletion or changing of a single character to another, or
swapping two adjacent characters, which models well the spelling errors caused by an accidental slip of finger on a keyboard. It was originally discovered that for most languages and spelling errors, this simple
method already covers 80 % of all spelling errors [1]. This model is also
language-independent, ignoring the differences in character repertoires of
a given language. Various other error models have also been developed,
ranging from confusion sets to phonemic folding [5].
In this paper, we evaluate the use of context for further fine-tuning
of the correction suggestions. The context is still not commonly used in
spell-checkers. According to [5] it was lacking in the majority of spellcheckers and while the situation may have improved slightly for some
commercial office suite products, the main spell-checkers for open source
environments are still primarily context-ignorant, such as hunspell1 which
is widely used in the open source world. For English, the surface wordform trigrams model has been demonstrated to be reasonably efficient
both for non-word cases [6] and for for real-word cases[7]. As an additional way to improve the set of suggestions, we propose to use morphosyntactically relevant analyses in context. In this article, we evaluate
a model with a statistical morphological tagger [8]. The resulting system
is in effect similar as described in [9] for Spanish2 .
1
http://hunspell.sf.net