Download Phonetic matching techniques(cont.)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expert system wikipedia , lookup

Collaborative information seeking wikipedia , lookup

Incomplete Nature wikipedia , lookup

Transcript
國立雲林科技大學
National Yunlin University of Science and Technology
Phonetic String Matching:Lessons
from Information Retrieval
Advisor : Dr. Hsu
Graduate : Chih-Ling Wang
Authors
: Justin Zobel
Philip Dart
2003 IEEE
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline









Motivation
Objective
Introduction
Phonetic matching versus information retrieval
Phonetic matching techniques
Performance assessment
Combination of evidence
Conclusions
Personal Opinion
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

We explore the accuracy of the phonetic string
matching.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

In this paper we propose a new phonetic matching
techniques and describe the results of a new
comparative investigation of phonetic matching.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

Phonetic matching is used to identify strings that may
be of similar pronunciation, regardless of their actual
spelling.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction(cont.)

There are two pragmatic issues that must be addressed
in such a phonetic matching system.



One is of speed – answers should be found reasonably quickly.
The other pragmatic issue is accuracy.
The parallels between information retrieval and
phonetic matching mean that


They can be measured by the same kinds of techniques.
Methods for improving information retrieval performance may
also apply to phonetic matching.
6
Intelligent Database Systems Lab
Phonetic matching versus information
retrieval

In information retrieval, ranking is the process of
identifying which of a set of documents are most
likely to be similar in content to a given query.

Phonetic matching is the process of identifying which
of a set of strings are most likely to be similar in
sound to a given query string.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching versus information
retrieval(cont.)



In both cases the matching process is: fundamentally
inexact, since human judgment is required to tell
whether the process’s guess is correct
Similarity is relative, unable in isolation to determine
whether a query and potential answer are matches.
It is difficult to give an accurate definition of
relevance.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching versus information
retrieval(cont.)

N.Y.U.S.T.
I. M.
We consider phonetic matching to be the process of
identifying strings that, after elimination of possible
transmission or cognition errors, may sound the same.


Transmission errors include, sound-alike mistakes in data
entry ;mishearing of a spoken name on a imperfect transmission
medium.
Cognition errors include, mistaking a pronunciation for an
expected word.
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques
Soundex
 Soundex uses codes based on the sound of each letter
to translate a string into a canonical form of at most
four characters, preserving the first letter.
 Soundex makes the error of transforming dissimilarsounding strings to the same code, and of
transforming similar-sounding strings to different
codes.
 There is no ranking of matches: strings are either
similar or not similar.
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Example: reynold(r005043)=>r543
renauld(r050043) =>r543
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Phonix
Phonix is a Soundex variant.
 Letters are mapped to a set of codes using the same
algorithm, but a slightly different set of codes is used,
and prior to mapping about 160 letter-group
transformations are used to standardise the string.



The sequence tjv is mapped to chv if it occurs at the start of a
string, and x is transformed to ecs.
These transformations provide context for the phonetic coding
and allow c and s to be distinguished.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Example: reynold(r005043)=>r543
renauld(r050043) =>r543
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)

In our experiments we consider a variant of Phonix, here
called Phonix+, in which truncation is not applied and a
minimal edit distance is used to compare the resulting
strings.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Q-gram methods

A q-gram of string s is any substring of s of some
fixed length q.

Simply counting q-grams does not allow for length
differences.
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Gs  Gt  2 Gs  Gt ,
Example:rhodes;rod
We have used this q-gram method with q=2
 rh 
ro 
rhodes= no  rod= od


od 


de



es 

Gs  Gt  2 Gs  Gt  5  2  2 1  5
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Agrep


Agrep is a utility that embodies a fast algorithm for
identifying strings that contain a substring which is
identical to a query but for at most k insertions,
deletions, or replacements, where k is a predefined
constant.
Agrep was not designed for the task of phonetic
matching, but rather for fast searching of large files.
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Edit distances

A simple edit distance, which counts the minimal
number of single-character insertions, deletions, and
replacements needed to transform one string into
another, could be used for phonetic matching since
similar-sounding words are often spelled similarly.
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)


For two strings s and t of length i and j respectively,
this edit distance can computed with the recurrence
relation edit(i,j).
The function r(s , t ) returns 0 if s and t are identical,
and 1otherwise.
i
i
j
j
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Example: rhodes;rod
edit 6,3  min edit 5,3  1, edit 6,2   1, edit 5,2   1
edit 5,3  min edit24,3  1, edit45,2   1, edit34,2   1
1
edit 4,3  min edit33,3  1, edit3 4,2   1, edit 3,2   0
3
2
2
edit 3,3  min edit 2,3  1, edit 3,2   1, edit 2,2   1
3
2
2
edit 2,3  min edit 1,3  1, edit 2,2   1, edit 1,2   1
4
2
3
edit 1,3  min edit 0,3  1, edit 1,2   1, edit 0,2   1
2
3
1
edit 1,2   min edit 0,2   1, edit 1,1  1, edit 0,1  1
2
2
0
edit 1,1  min edit 0,1  1, edit 1,0   1, edit 0,0   0
2
2
1
edit 2,2   min edit 1,2   1, edit 2,1  1, edit 1,1  1
3
7
4
edit 2,1  min edit11,1  1, edit32,0   1, edit21,0   1
1
edit 3,2   min edit22,2   1, edit33,1  1, edit 2,1  0
2
edit 3,1  min edit 2,1  1, edit43,0   1, edit32,0   1
edit 4,2   min edit2 3,2   1, edit5 4,1  1, edit33,1  1
edit 4,1  min edit33,1  1, edit54,0   1, edit43,0   1
edit 5,2   min edit34,2   1, edit5 5,1  1, edit4 4,1  1
edit 5,1  min edit44,1  1, edit65,0   1, edit54,0   1
edit 6,2   min edit45,2   1, edit66,1  1, edit55,1  1
edit 6,1  min edit55,1  1, edit76,0   1, edit65,0   1
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Editex

Editex is a phonetic distance measure that combines
the properties of edit distances with the lettergrouping strategy used by Soundex and Phonix.

Editex also groups letters that can result in similar
pronunciations, but doesn't require that the groups be disjoint
and can thus reflect the correspondences between letters and
possible similar pronunciation more accurately.
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)

Editex is defined by the edit distance recurrence relation with
a redefined function r(s , t ) returns 0 if si and t j are identical, 1 if
s i and t j are both occur in the same group, and 2 otherwise.
The function d(a, b) is identical tor(s , t ).If a is h or w and
a  b then d(a, b) is 1.
i

j
i
j
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Phonometric methods

Our algorithms for phonometric matching consist of
two stages:



First, the string of letters is converted into a string of
phonemes by a string-to-pronunciation conversion
algorithm.
The second stage is comparison of strings of phonemes.
The distance between pronunciations as represented
by strings of phonemes can be measured more
precisely than the distance between strings of letters.
23
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Phonetic matching techniques(cont.)
Tapering


Tapering is a refinement to the edit distance
techniques based on a human-factors property:
differences at the start of a pronunciation can be
more significant than differences at the end.
A tapered edit distance of particular interest is one in
which the maximum penalty for replacement or
deletion at start of string just exceeds twice the
minimum penalty for replacement or deletion at end
of string.
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance assessment


We can now compare the various approaches to phonetic
matching.Results are shown in Table 1, which is of 11-point recallprecision.
For many of the techniques tested, only a few distinct ranks are
possible, and some techniques only return two ranks, match and
not-match.
25
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance assessment(cont.)

The least effective methods such as Phonix and Soundex
only return a small number of answers for most queries.

Phonix and Soundex are not only finding many wrong
answers but not finding many right ones.

The “baseline” results are for a trivial phonetic matching
method: find all strings with at most one character – an
insertion, deletion, or replacement – different from the
query.
26
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance assessment(cont.)

A particular problem of best agrep is the tiny number of
correct answers returned – less than one per query – but we
stress that agrep was not designed for phonetic matching.
27
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Performance assessment(cont.)


An interesting discovery is that even the most successful of
the methods fetch rather different sets of answers, sometimes
almost without overlap.
As for information retrieval, it seems, two methods can
perform well without finding the same answers.
28
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Combination of evidence

Phonetic matching has strong parallels with information
retrieval.

Matching techniques fetch a ranked list of matches in which
each entry has weight attached to it; this weight is the
likelihood that the entry is a good match.

Combining the ranked lists produced by different retrieval
mechanisms can improve performance.
29
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Combination of evidence(cont.)


The “(none)” lines are the results of running the methods
individually.
The best performance of all is given by the combination of
Phonix+ and the q-gram method, neither of which works
particularly well alone.
30
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Combination of evidence(cont.)

More sophisticated techniques for combination could be used:
 Weighting the ranks from the different techniques.
 Combining more than two methods.

That combination of evidence is successful in this context.
31
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions

Two of our proposals – the Ipadist and Editex methods – do
indeed lead to improved performance, whereas the third –
tapering – was not successful.

We showed that combination of evidence, which has been
successfully applied to information retrieval, consistently
improves performance.

Our new methods are substantially more effective than
existing methods such as edit distances, and that combination
of evidence is as valuable in this domain as it is in
information retrieval.
32
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion

The concept in this paper may use in our
research, but I haven’t have a clear idea
to implement it.I need more time to
think…think…
33
Intelligent Database Systems Lab