Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 A Survey of Approaches on Mining the Structure from Unstructured Data Frederik Hogenboom Flavius Frasincar Uzay Kaymak [email protected] [email protected] [email protected] Econometric Institute Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 2 Introduction • A lot of data is generated every day • Difficult to find information that meets one’s needs • There is a need to mine the structure of data as a first step towards understanding it • Part of the effort to make the Web machine-understandable • Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 3 Which Technique to Choose? Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 4 Statistics-Based NLP (1) • Utilize statistics and mathematical models based on probability theory • Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including: – Probabilistic modeling – Information theory – Linear algebra • Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 5 Statistics-Based NLP (2) • Word-based: – Statistics collection on words – Frequency counting and ranking generation (e.g., TF-IDF) – Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.) – Word Sense Disambiguation (WSD) – Inference models: n-grams – Clustering • Grammar-based: – Part-Of-Speech (POS) tagging – Stochastic Context-Free Grammars (SCFG) Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 6 Statistics-Based NLP (3) • Advantages: – Not based on knowledge, thus they do not require linguistic resources, nor do they require expert knowledge – Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated • Disadvantages: – Often need a large amount of data – Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 7 Statistics-Based NLP (4) • Examples: – (Bannard et al., 2003) discuss several techniques for using statistical models acquired from corpus data to infer the meaning of verb-particle constructions: • Collocation-like approach, frequency counting • Focus on mining relations between words – (Taira and Soderland, 1999) implement a statistical natural language processor: • Based on resonance probabilities between word pairs • Uses word affinity knowledge from training sentences • Focus on acquiring knowledge from radiology reports Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 8 Pattern-Based NLP (1) • Use linguistic patterns to extract data from texts • Patterns can be: – Predefined – Discovered (learned) • Knowledge used: – Lexical knowledge – Syntactic knowledge – Semantic knowledge Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 9 Pattern-Based NLP (2) • Lexico-syntactic patterns: – Combine lexical and syntactic elements with regular expressions – E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons • Lexico-semantic patterns: – Enrich lexico-syntactic patterns through the addition of semantics – Gazetteers (simple typing): • Use linguistic meaning of text • E.g., “[sub:company] announces collaboration with [obj:company]” – Ontologies (complex typing): • Include also relationships • E.g., “[kb:Company] kb:collaborates [kb:Company]” Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 10 Pattern-Based NLP (3) • Advantages: – Need less training data – Complex expressions can be defined – Results are easily interpretable • Disadvantages: – Lexical knowledge is required – Prior expert/domain knowledge might be required (for lexicosemantic patterns) – Defining and maintaining patterns is a cumbersome and non-trivial task Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 11 Pattern-Based NLP (4) • Examples: – CAFETIERE (Black et al., 2005): • Employs extraction rules defined at lexico-semantic level • Makes use of gazetteering • Knowledge is stored using Narrative Knowledge Representation Language (NKRL) • Knowledge base lacks reasoning support • Focus on extracting relations from corpora – Hermes (Frasincar et al., 2009): • • • • Nov. 30, 2009 Patterns defined at lexico-semantic level Makes use of ontologies and reasoning engines Knowledge is based on an OWL domain ontology Focus on the use of pattern-based NLP in building personalized news services Dutch-Belgian Database Day 2009 (DBDBD 2009) 12 Hybrid NLP (1) • Combine linguistic knowledge with statistical methods • Usually, it appears to be difficult to stay within the boundaries of a single approach • Thus, it is convenient to combine best from both worlds: – Bootstrapping lexical methods – Solving lack of expert knowledge by applying statistical methods – Statistical methods that use some present (lexical) knowledge Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 13 Hybrid NLP (2) • Advantages: – Solve problems related to scaling and required expert knowledge of pattern-based approaches – Do not require as much data as statistical approaches – Inherit some of the advantages of both statistical and pattern-based approaches • Disadvantages: – By combining different techniques, maintaining completeness and accuracy of the systems becomes more difficult – Multidisciplinary aspects – Inherit some of the disadvantages of both statistical and patternbased approaches Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 14 Hybrid NLP (3) • Examples: – Corpus-Based Statistics-Oriented techniques (Su et al., 1996): • Mainly statistical learning techniques, guided by high-level linguistic constructs • Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc. • Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems – PANKOW (Cimiano et al., 2004): • Generates instances of lexico-syntactic patterns indicating a certain semantic or ontological relation • Counts number of occurrences of patterns • Statistical distribution of instances of these patterns constitutes the collective knowledge • Focus is on supporting annotation Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 15 Conclusions • Three main approaches to NLP: – Statistics-based – Pattern-based – Hybrid • Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines: – Evaluate your problem, preferences, and available resources – If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach – If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach – If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009) 16 References • • • • • • C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verbparticles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65-72. Association for Computational Linguistics, 2003. W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, 2005. P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004. F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, 2009. K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1):101-157, 1996. R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics Association, 1999. Nov. 30, 2009 Dutch-Belgian Database Day 2009 (DBDBD 2009)