Download non-word error detection

Document related concepts

Liaison (French) wikipedia , lookup

German orthography reform of 1996 wikipedia , lookup

Spelling reform wikipedia , lookup

Scripps National Spelling Bee wikipedia , lookup

The 25th Annual Putnam County Spelling Bee wikipedia , lookup

English-language spelling reform wikipedia , lookup

American and British English spelling differences wikipedia , lookup

English orthography wikipedia , lookup

Transcript
Helyesírás ellenőrzés
Kiselőadások elvárt formátuma
• Minden előadásról kell készüljön egy 5-7 slideos
prezentáció (pl. ppt/pdf)
– A prezentációt le kell adni legkésőbb az előadás első
perceiben
• A slideok a teljes cikket át kell hogy tekintsék, a
problémafelvetésétől kezdve a javasolt
megoldásokon át
• Fontos hogy a rendelkezésre álló 5 percet nem
szabad túllépni az előadás hosszával
– 5 percnél tovább senki nem fog tudni prezentálni, ha a
prezentáció ezalatt az idő alatt nem érthető tartalmat
mutat be akkor a feladatot nem teljesítettnek veszem!
Where we are going
• We’ve looked at some more abstract ideas of language
processing, and now we’re going to focus on a real-world
application, detecting and correcting errors in spelling in a
document
• The general techniques (e.g., probabilistic methods) will
be applicable in later units, e.g., POS-tagging
• Running through such a practical application will make you
more aware of the kinds of things you need to know in
order to combat a problem
• The more you know about the kinds of errors people
make, the more likely you can find them
3
Who cares about spelling?
• Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t
mttaer in waht oredr the ltteers in a wrod are, the olny
iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit
pclae. The rset can be a toatl mses and you can sitll raed it
wouthit porbelm. Tihs is bcuseae the huamn mnid deos not
raed ervey lteter by istlef, but the wrod as a wlohe.
• (See
http://www.mrccbu.cam.ac.uk/personal/matt.davis/Cmabrigde/ for the story
behind this supposed research report.)
4
Detection vs. Correction
• There are two distinct tasks:
– error detection = simply find the misspelled words
– error correction = correct the misspelled words
• e.g., It might be easy to tell that ater is a misspelled
word, but what is the correct word? water? later?
after?
• So, what causes errors?
5
Spelling Tasks
• Spelling Error Detection
• Spelling Error Correction:
– Autocorrect
• htethe
– Suggest a correction
– Suggestion lists
6
Types of spelling errors
• Non-word Errors
– graffe giraffe
• Real-word Errors
– Typographical errors
• three there
– Cognitive Errors (homophones)
• piecepeace,
• too  two
7
Non-word spelling errors
• Non-word spelling error detection:
– Any word not in a dictionary is an error
– The larger the dictionary the better
• Non-word spelling error correction:
– Generate candidates: real words that are similar
to error
– Choose the one which is best:
• Shortest weighted edit distance
• Highest noisy channel probability
8
Real word spelling errors
• For each word w, generate candidate set:
– Find candidate words with similar pronunciations
– Find candidate words with similar spelling
– Include w in candidate set
• Choose best candidate
– Noisy Channel
9
Keyboard mistyping
• Space bar issues
– run-on errors = two separate words become one
• e.g., the fuzz becomes thefuzz
– split errors = one word becomes two separate words
• e.g., equalization becomes equali zation
• Keyboard proximity
– e.g., Jack becomes Hack since h, j are next to each other on a typical
American keyboard
• Physical similarity
– similarity of shape, e.g., mistaking two physically similar letters when
typing up something handwritten
• e.g., tight for fight
10
Phonetic errors
• phonetic errors = errors based on the sounds of a language
(not necessarily on the letters)
– homophones = two words which sound the same
• e.g., red/ read (past tense), cite/ site/ sight, they’re/ their/ there
– Spoonerisms = switching two letters/sounds around
– letter substitution = replacing a letter (or sequence of letters) with a
similar-sounding one
• e.g., John kracked his nuckles. instead of John cracked his knuckles
• e.g., I study sikologee.
11
Knowledge problems
• And then there are simply cases of not
knowing how to spell:
– not knowing a word and guessing its spelling (can
be phonetic)
• e.g., sientist
– not knowing a rule and guessing it
• e.g., Do we double a consonant for ing words? Jog ->
joging ?
12
What makes spelling correction
difficult?
• Tokenization: What is a word?
– Definition is difficult with contractions, multi-token words, hyphens,
abbreviations
• Inflection: How are some words related?
– How do we store rules and exceptions?
• Productivity of language: How many words are there?
– Words entering and exiting the lexicon
• How we handle these issues determines how we build a
dictionary.
13
Techniques used for spell checking
• Non-word error detection
• Isolated-word error correction
• Context-dependent word error detection and
correction
– grammar correction.
• The exact techniques used will differ depending on if
we are looking for spelling errors in human typing or
with optical character recognition (OCR)
14
Non-word error detection
• non-word error detection is essentially the
same thing as word recognition = splitting up
“words” into true words and non-words.
• How is non-word error detection done?
– Using a dictionary: Most common way to find nonword errors
– N-gram analysis:
• fast and simple technique, but most typing errors are
still valid n-grams
• used with OCR more than typing
15
Dictionaries
• Intuition:
– Have a complete list of words and check the input words
against this list.
– If it’s not in the dictionary, it’s not a word.
• Two aspects:
– Dictionary construction = build the dictionary (what do
you put in it?)
– Dictionary lookup = lookup a potential word in the
dictionary (how do you do this quickly?)
16
Dictionary construction
• Do we include inflected words? i.e., words with prefixes
and suffixes already attached
– Pro: lookup can be faster
– Con: takes much more space, doesn’t account for new
formations
• Want the dictionary to have only the word relevant for
the user -> domain-specificity
• Foreign words, hyphenations, derived words, proper
nouns, and new words will always be problems for
dictionaries since we cannot predict these words until
humans have made them words.
• Dictionary should probably be dialectally consistent.
– e.g., include only color or colour but not both
17
Dictionary lookup
• Several issues arise when trying to look up a
word:
– Have to make lookup fast by using efficient lookup
techniques, such as a hash table
– Have to strip off prefixes and suffixes if the word
isn’t an entry by itself.
18
Isolated-word error correction
• Having discussed how errors can be detected, we
want to know how to correct these misspelled
words:
– The most common method is isolated-word error
correction = correcting words without taking context into
account.
– Note: This technique can only handle errors that result in
non-words.
• Knowledge about what is a typical error helps in
finding correct word.
19
Knowledge about typical errors
• Word length effects: most misspellings are within
two characters in length of original
– When searching for the correct spelling, we do not
usually need to look at words with greater length
differences
• First-position error effects: the first letter of a
word is rarely erroneous
– When searching for the correct spelling, the process is
sped up by being able to look only at words with the
same first letter.
20
Isolated-word error correction methods
• Many different methods are used; we will briefly look at
four methods:
–
–
–
–
rule-based methods
similarity key techniques
minimum edit distance
probabilistic methods
• The methods play a role in one of the three basic steps:
– 1. Detection of an error (discussed above)
– 2. Generation of candidate corrections
• rule-based methods
• similarity key techniques
– 3. Ranking of candidate corrections
• probabilistic methods
• minimum edit distance
21
Rule-based methods
• One can generate correct spellings by writing
rules:
– Common misspelling rewritten as correct word:
• e.g., hte -> the
– Rules
• based on inflections:
– e.g., V+C+ing -> V+CC+ing
(where V = vowel and C = consonant)
• based on other common spelling errors (such as
keyboard effects or common transpositions):
– e.g., Cie -> Cei
22
Similarity key techniques
• Problem: How can we find a list of possible corrections?
• Solution: Store words in different boxes in a way that puts
the similar words together.
• Example:
– 1. Start by storing words by their first letter (first letter effect),
• e.g., punc starts with the code P.
– 2. Then assign numbers to each letter
• e.g., 0 for vowels, 1 for b, p, f, v(all bilabials), and so forth,
• e.g., punc -> P052
– 3. Then throw out all zeros and repeated letters,
• e.g., P052 -> P52.
– 4. Look for real words within the same box,
• e.g., punk is also in the P52 box.
23
How is a mistyped word related to the
intended?
• Types of errors
–
–
–
–
insertion = a letter is added to a word
deletion = a letter is deleted from a word
substitution = a letter is put in place of another one
transposition = two adjacent letters are switched
• Note that the first two alter the length of the word,
whereas the second two maintain the same length.
• General properties
– single-error misspellings = only one instance of an error
– multi-error misspellings = multiple instances of errors (harder
to identify)
24
Probabilistic methods
• Two main probabilities are taken into account:
– transition probabilities = probability (chance) of going
from one letter to the next.
• . e.g., What is the chance that a will follow p in English? That u will
follow q?
– confusion probabilities = probability of one letter being
mistaken (substituted) for another (can be derived from a
confusion matrix)
• e.g., What is the chance that q is confused with p?
• Useful to combine probabilistic techniques with
dictionary methods
25
Confusion probabilities
• For the various reasons discussed above (keyboard layout,
phonetic similarity, etc.) people type other letters than the ones
they intended.
• It is impossible to fully investigate all possible error causes and
how they interact, but we can learn from watching how often
people make errors and where.
• One way of doing so is to build a confusion matrix = a table
indicating how often one letter is mistyped for another (this is a
substitution matrix)
26
The Noisy Channel Model
• We can view the setup like this:
– SOURCE: word -> NOISY CHANNEL -> noisy word
– We need to decode the noisy word to figure out what the
original was
• The noisy channel model has been very popular in
speech recognition, among other fields
– Noisy word: O = observation (incorrect spelling)
– To guess at the original word, we want to find the word (w)
which maximizes: P(w|O), i.e., the probability of w, given
that O has been seen
27
Non-word spelling error example
acress
28
Candidate generation
• Words with similar spelling
– Small edit distance to error
• Words with similar pronunciation
– Small edit distance of pronunciation to error
29
Damerau-Levenshtein edit distance
• Minimal edit distance between two strings,
where edits are:
–
–
–
–
Insertion
Deletion
Substitution
Transposition of two adjacent letters
30
Candidate generation
• 80% of errors are within edit distance 1
• Almost all errors within edit distance 2
• Also allow insertion of space or hyphen
– thisidea  this idea
– inlaw  in-law
31
Words within 1 of acress
Error
Candidate Correct
Correction Letter
Error
Letter
Type
acress
actress
t
-
deletion
acress
cress
-
a
insertion
acress
caress
ca
ac
transposition
acress
access
c
r
substitution
acress
across
o
e
substitution
acress
acres
-
s
insertion
acress
acres
-
s
insertion
32
Wait, how do you generate the
candidates?
1. Run through dictionary, check edit distance with each
word
2. Generate all words within edit distance ≤ k (e.g., k = 1
or 2) and then intersect them with dictionary
3. Use a character k-gram index and find dictionary
words that share “most” k-grams with word (e.g., by
Jaccard coefficient)
4. Compute them fast with a Levenshtein finite state
transducer
5. Have a precomputed hash of words to possible
corrections
33
Computing error probability: confusion
matrix
del[x,y]:
x)
ins[x,y]:
xy)
sub[x,y]:
trans[x,y]:
yx)
count(xy typed as
count(x typed as
count(x typed as y)
count(xy typed as
Insertion and deletion conditioned on previous
34
Channel model
Kernighan, Church, Gale 1990
35
Smoothing probabilities: Add-1
smoothing
• But if we use the last slide, unseen errors are
impossible!
• They’ll make the overall probability 0. That
seems too harsh
– e.g., in Kernighan’s chart qa and aq are both
0, even though they’re adjacent on the keyboard!
• A simple solution is to add one to all counts
and then if there is a |A| character alphabet,
to normalize appropriately:
sub[x, w]+1
If substitution, P(x | w) =
count[w]+ A
36
Noisy channel probability for
acress
Candidate Correct Error x|w
Correction Letter
Letter
P(x|word)
P(word)
109 *P(x|w)P(w)
actress
t
-
c|ct
.000117
.0000231
2.7
cress
-
a
a|#
.00000144
.000000544
.00078
caress
ca
ac
ac|ca .00000164
.00000170
.0028
access
c
r
r|c
.000000209
.0000916
.019
across
o
e
e|o
.0000093
.000299
2.8
acres
-
s
es|e
.0000321
.0000318
1.0
acres
-
s
ss|s
.0000342
.0000318
1.0
37
Using a bigram language model
• “a stellar and versatile acress
whose combination of sass and
glamour…”
• Counts from the Corpus of Contemporary
American English with add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010
• P(across|versatile) =.000021 P(whose|across) = .000006
• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10
• P(“versatile across whose”) = .000021*.000006 = 1 x10-10
38
Real-word spelling errors
•
•
•
•
…leaving in about fifteen minuets to go to her house.
The design an construction of the system…
Can they lave him my messages?
The study was conducted mainly be John Black.
• 25-40% of spelling errors are real words
1992
Kukich
39
Noisy channel for real-word spell
correction
two
of
to
thew
...
threw
tao
off
thaw
too
on
the
two
of
thaw
40
Categorization Problem : Spelling
1.
2.
3.
4.
5.
6.
Field
Wield
Shield
Deceive
Receive
Ceiling
Rule-based Approach
“I before E except after C”
-- an example of a linguistic insight
Probabilistic Statistical Model:
• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’
and ‘cei’ in a large corpus
P(IE) = 0.0177
P(EI) = 0.0046
P(CIE) = 0.0014
P(CEI) = 0.0005
Words where ie occur after c
•
•
•
•
science
society
ancient
species
Chomsky’s Argument
• A completely new sentence must have a
probability of 0, since it is an outcome that
has not been seen.
• Since novel sentences are in fact generated all
the time, there is a contradiction.
Probabilistic language model
• Probabilities should broadly indicate likelihood
of sentences
– P(I saw a van) >> P(eyes awe of an)
Courtesy Dan Klein
Spell Checking
Let’s say the word hand is mistyped
Hand --- *Hamd
There you have an unknown word!
Spell Checking
Out-of-Vocabulary Error
*Hamd
Spell Checking
*Hamd
Hand
Spell Checking
*Hamd
Hand
Hard
What we need to find …
• P(hard|hamd)
• P(hand|hamd)
• Whichever is greater is the right one!
What we need to find …
• But how do you find the probability
P(hard|hamd) when ‘hamd’ is an *unknown
word*?
BAYES RULE
• P(E|F) = P(F|E) * P(E) / P(F)
• P(hand|hamd) = P(hamd|hand)*P(hand)
P(hamd|hand) = P(h|h)P(a|a)P(m|n)P(d|d)
The stuff in green is approximately 1, so ignore it!
• P(hand|hamd) = P(m|n)*P(hand)
So we have what we needed to find …
• P(hard|hamd) = P(m|n)*P(hand)
• P(hand|hamd) = P(m|r)*P(hard)
• Whichever is greater is the right one!
Note: Use of Unigram Probabilities
• The P(m|n) part is called the channel
probability
• The P(hand) part is called the prior probability
• Kernighan’s experiment showed that P(hand)
is very important (causes a 7% improvement)!
Spell Checking
• How do you tell whether an unknown word is
a new word or a spelling mistake?
• How can you tell “Shivaswamy” is not a wrong
spelling of “Shuddering”
• You can use statistics
T-test
• Calculate the probability of “Shivaswamy” having
come about by errors made on “Shuddering” …
that is, 8 errors happening in 10 letters.
• P(error) = 0.8
• Say the average letter error rate is 0.1.
• Verify using the t-test that Shivaswamy is not
likely to have come about from Shuddering by a
0.1 error-per-letter process.
• Use the word length (10) as n.