Download Pattern-Based NLP (1) - Erasmus Universiteit Rotterdam

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
A Survey of Approaches on Mining the
Structure from Unstructured Data
Frederik Hogenboom
Flavius Frasincar
Uzay Kaymak
[email protected]
[email protected]
[email protected]
Econometric Institute
Erasmus University Rotterdam
PO Box 1738, NL-3000 DR
Rotterdam, the Netherlands
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
2
Introduction
• A lot of data is generated every day
• Difficult to find information that meets one’s needs
• There is a need to mine the structure of data as a first step
towards understanding it
• Part of the effort to make the Web machine-understandable
• Solution: employ NLP techniques to extract knowledge from
unstructured text written in natural language
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
3
Which Technique to Choose?
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
4
Statistics-Based NLP (1)
• Utilize statistics and mathematical models based on probability
theory
• Refers to all non-symbolic and non-logical work on NLP, i.e., it
encompasses all quantitative approaches to automated language
processing, including:
– Probabilistic modeling
– Information theory
– Linear algebra
• Phrases extracted from text written in an arbitrary natural
language are analyzed in order to find (statistical) relations
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
5
Statistics-Based NLP (2)
• Word-based:
– Statistics collection on words
– Frequency counting and ranking generation (e.g., TF-IDF)
– Collocations (cliff-hanger, eye candy, take care, profit
announcement, etc.)
– Word Sense Disambiguation (WSD)
– Inference models: n-grams
– Clustering
• Grammar-based:
– Part-Of-Speech (POS) tagging
– Stochastic Context-Free Grammars (SCFG)
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
6
Statistics-Based NLP (3)
• Advantages:
– Not based on knowledge, thus they do not require linguistic
resources, nor do they require expert knowledge
– Issues regarding leaking grammars, inconsistencies among
humans, dialects, etc. are alleviated
• Disadvantages:
– Often need a large amount of data
– Approaches do not deal with meaning explicitly, i.e., statistical
methods discover relations in corpora without considering semantics
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
7
Statistics-Based NLP (4)
• Examples:
– (Bannard et al., 2003) discuss several techniques for using
statistical models acquired from corpus data to infer the meaning of
verb-particle constructions:
• Collocation-like approach, frequency counting
• Focus on mining relations between words
– (Taira and Soderland, 1999) implement a statistical natural
language processor:
• Based on resonance probabilities between word pairs
• Uses word affinity knowledge from training sentences
• Focus on acquiring knowledge from radiology reports
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
8
Pattern-Based NLP (1)
• Use linguistic patterns to extract data from texts
• Patterns can be:
– Predefined
– Discovered (learned)
• Knowledge used:
– Lexical knowledge
– Syntactic knowledge
– Semantic knowledge
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
9
Pattern-Based NLP (2)
• Lexico-syntactic patterns:
– Combine lexical and syntactic elements with regular expressions
– E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)}
collaboration {with NNP}?” mines a corpus for information on
fusions and collaborations of companies and/or persons
• Lexico-semantic patterns:
– Enrich lexico-syntactic patterns through the addition of semantics
– Gazetteers (simple typing):
• Use linguistic meaning of text
• E.g., “[sub:company] announces collaboration with
[obj:company]”
– Ontologies (complex typing):
• Include also relationships
• E.g., “[kb:Company] kb:collaborates
[kb:Company]”
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
10
Pattern-Based NLP (3)
• Advantages:
– Need less training data
– Complex expressions can be defined
– Results are easily interpretable
• Disadvantages:
– Lexical knowledge is required
– Prior expert/domain knowledge might be required (for lexicosemantic patterns)
– Defining and maintaining patterns is a cumbersome and non-trivial
task
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
11
Pattern-Based NLP (4)
• Examples:
– CAFETIERE (Black et al., 2005):
• Employs extraction rules defined at lexico-semantic level
• Makes use of gazetteering
• Knowledge is stored using Narrative Knowledge Representation
Language (NKRL)
• Knowledge base lacks reasoning support
• Focus on extracting relations from corpora
– Hermes (Frasincar et al., 2009):
•
•
•
•
Nov. 30, 2009
Patterns defined at lexico-semantic level
Makes use of ontologies and reasoning engines
Knowledge is based on an OWL domain ontology
Focus on the use of pattern-based NLP in building personalized news
services
Dutch-Belgian Database Day 2009 (DBDBD 2009)
12
Hybrid NLP (1)
• Combine linguistic knowledge with statistical methods
• Usually, it appears to be difficult to stay within the boundaries of
a single approach
• Thus, it is convenient to combine best from both worlds:
– Bootstrapping lexical methods
– Solving lack of expert knowledge by applying statistical methods
– Statistical methods that use some present (lexical) knowledge
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
13
Hybrid NLP (2)
• Advantages:
– Solve problems related to scaling and required expert knowledge of
pattern-based approaches
– Do not require as much data as statistical approaches
– Inherit some of the advantages of both statistical and pattern-based
approaches
• Disadvantages:
– By combining different techniques, maintaining completeness and
accuracy of the systems becomes more difficult
– Multidisciplinary aspects
– Inherit some of the disadvantages of both statistical and patternbased approaches
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
14
Hybrid NLP (3)
• Examples:
– Corpus-Based Statistics-Oriented techniques (Su et al., 1996):
• Mainly statistical learning techniques, guided by high-level linguistic
constructs
• Applications in POS tagging, semantic analysis of corpora, machine
translation, annotation, etc.
• Focus is on extracting inductive knowledge from corpora to support
building large scale NLP systems
– PANKOW (Cimiano et al., 2004):
• Generates instances of lexico-syntactic patterns indicating a certain
semantic or ontological relation
• Counts number of occurrences of patterns
• Statistical distribution of instances of these patterns constitutes the
collective knowledge
• Focus is on supporting annotation
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
15
Conclusions
• Three main approaches to NLP:
– Statistics-based
– Pattern-based
– Hybrid
• Which techniques to use for your NLP tasks? There is no single
best approach, but consider these rough guidelines:
– Evaluate your problem, preferences, and available resources
– If you are less concerned with semantics and you assume that
knowledge lies within statistical facts on a specific corpus, use a
statistics-based approach
– If you are concerned with the semantics of discovered information,
or you want to be able to easily explain and control the results, use
a pattern-based approach
– If you need to bootstrap a pattern-based approach using statistics
(e.g., insufficient knowledge available) or the other way around (e.g.,
need of a priori knowledge) use a hybrid approach
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)
16
References
•
•
•
•
•
•
C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verbparticles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and
Treatment, pages 65-72. Association for Computational Linguistics, 2003.
W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi.
CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and
Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester,
2005.
P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th
International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004.
F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building
Personalized News Services. International Journal of E-Business Research, 5(3):35-53,
2009.
K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented
(CBSO) Techniques for Natural Language Processing. Computational Linguistics and
Chinese Language Processing, 1(1):101-157, 1996.
R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical
reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics
Association, 1999.
Nov. 30, 2009
Dutch-Belgian Database Day 2009 (DBDBD 2009)