Download NER as Known Word Tagging

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
國立雲林科技大學
National Yunlin University of Science and Technology
Chinese Named Entity Recognition
using Lexicalized HMMs
Advisor : Dr. Hsu
Presenter : Zih-Hui Lin
Author
:Guohong Fu, Kang-Kwong Luke
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline








Motivation
Objective
Introduction
NER as Known Word Tagging
Known Word Segmentation
Lexicalized HMM Tager
Conclusions
Future Research
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

A HMM-based tagger don’t provide contextual word
information, which sometimes gives strong evidence
for NER.

Some learning methods such as ME and SVMs
usually need much more time in training and tagging.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

Our system is able to integrate both the internal
formation patterns and the surrounding contextual
clues for NER under the framework of HMMs

As a result, the performance of the system can be
improved without losing its efficiency in training and
tagging.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

We develop a two-stage NERsystem for
Chinese.
─
─
Given a sentence, a known word bigram model is first
applied to segment it into a meaningful sequence of
known words.
Then, a lexicalized HMM tagger is used to assign each
known word a proper hybrid tag that indicates its
pattern in forming an entity and the category of the
formed entity.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction(cont.)

Our system is able to explore three types of
features
─
─
─

entity-internal formation patterns
contextual word evidence
contextual category information
As a consequence, the system’s performance
can be improved without losing its efficiency
in training and processing.
6
Intelligent Database Systems Lab
NER as Known Word Tagging
N.Y.U.S.T.
I. M.
- Categorization of Entities


We use the same named entity tag set as defined in
the IEER-99 Mandarin named entity task, these entity
categories are further encoded using twelve different
abbreviated SGML tags.
Our system will also assign each common word in the
input sentence a proper POS tag. (Peking University
POS tag-set)
7
Intelligent Database Systems Lab
NER as Known Word Tagging
N.Y.U.S.T.
I. M.
- Patterns of known words in NER

A known word w may take one of the following
four patterns to present itself during NER:
─
─
─
─
w is an independent named entity ISE
w is the beginning component of a named entity  BOE
w is at the middle of a named entity MOE
w is at the end of a named entity ISE

Example:

“温家宝/总理/”
“<BOE>温</BOE><MOE>家</MOE><EOE>宝</EOE><ISE>总理</ISE>”.
8
Intelligent Database Systems Lab
NER as Known Word Tagging
- NER as known word tagging

We define a hybrid tag set by merging the
category tags and the pattern tags.
─
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Known Word Segmentation

To find the most appropriate segmentation that
maximizes the conditional probability
P(W | C) , i.e.
─
─
P (w | wt-1) denotes the known word bigram probability,
which can be estimated from a segmented corpus using
maximum likelihood estimation (MLE).
10
Intelligent Database Systems Lab
Lexicalized HMM Tager
N.Y.U.S.T.
I. M.
- Lexicalized HMMs


We employ the uniformly lexicalized models to
perform the tagging of known words for Chinese
NER.
To find an appropriate sequence of hybrid tags
that maximizes the conditional probability
P(T |W) , namely
11
Intelligent Database Systems Lab
Lexicalized HMM Tager
N.Y.U.S.T.
I. M.
- Lexicalized HMMs (cont.)

─
Two types of approximations are employed to simplify the
general model.

The firs approximation is based on the independent hypothesis used
in standard HMMs:

The second pproximation follows the notion of the lexicalization
technique,
12
Intelligent Database Systems Lab
Lexicalized HMM Tager
N.Y.U.S.T.
I. M.
- Lexicalized HMMs (cont.)

Because MLE will yield zero will yield zero
probabilities for any cases that are not observed in the
training data. To solve this problem, we employ the
linear interpolation smoothing technique .
13
Intelligent Database Systems Lab
Lexicalized HMM Tager
N.Y.U.S.T.
I. M.
- Lattice-Based Tagging

In our implementation, we employ the classical
Viterbi algorithm to find the most probable sequence
of hybrid tags for a given sequence of known words,
which works in three major steps as follows:
─
─
─
The generation of candidate tags
The decoding of the best tag sequence
The conversion of the results
14
Intelligent Database Systems Lab
Lexicalized HMM Tager
N.Y.U.S.T.
I. M.
- Inconsistent Tagging

Our system may yield two types of inconsistent
tagging :
─
pattern inconsistency

─
when two adjacent known words are assigned inconsistent
pattern tags such as “ISE:MOE” or “ISE:EOE”.
class inconsistency

two adjacent known words are labeled with different categorytags while at the same time

<CPN-BOE>张</CPN-BOE><Vg-MOW>晓</Vg-MOW><CPN-EOE>
华</CPN-EOE>
15
Intelligent Database Systems Lab
Experiments - Experimental Measures

F-measure is a weighted harmonic mean of
precision and recall.
─
R=
correctly recognized NEs
manually annotated corpus
─
─
P=
correctly recognized NEs
recognized by the system
─
─
β is the weighing coefficient
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Experiments Data
17
Intelligent Database Systems Lab
Experiments - Experimental Results

Consequently, our first aim is to examine how the use
of the lexicalization technique affects the
performance of our system.
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
N.Y.U.S.T.
I. M.
- Experimental Results(cont.)

To investigate whether the word-level mode or the
character-level model is more effective for Chinese
NER.
19
Intelligent Database Systems Lab
Experiments
N.Y.U.S.T.
I. M.
- Experimental Results(cont.)

to compare our system with other public systems for
Chinese NER.
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions

The experimental results indicate
─
─
on different public corpora show that the NER
performance can be significantly enhanced using
lexicalization techniques.
character-level tagging are comparable to and may
even outperform known-word based tagging when a
lexicalized method is applied.
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Future Directions


Our current tagger is a purely statistical system; it
will inevitably suffer from the problem of data
sparseness, particularly in open-domain applications.
Our system usually fails to yield correct results for
some complicated NEs such as nested organization
names.
22
Intelligent Database Systems Lab
Related documents