Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data center wikipedia , lookup
Clusterpoint wikipedia , lookup
3D optical data storage wikipedia , lookup
Data analysis wikipedia , lookup
Database model wikipedia , lookup
Data vault modeling wikipedia , lookup
Information privacy law wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Business intelligence wikipedia , lookup
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2) Roadmap Problem definition Motivation Solution Framework Demo Conclusion Problem definition The purpose of an automatic glossary compiler is to aid in the construction of a list of definitions across a large collection of documents. Definition is a concise description of what an entity is. Challenges: Multiple ways to phrase a definition Single term has multiple definitions Need clustering Motivation Benefit for everyone: Construct a glossary without marking index words by hand; Briefly look up the definition of a term in a book, journal articles, a set of books or collection of papers on a particular topic. No current similar tool exists. Solution framework Query processing Definition extraction Minipar; Clustering algorithm Yahoo API; K-means; Technology IE Toolbar. Page processing Goals Fetch pages for a given query Convert multiple formats into text format Use multi-threading to accelerate e.g., PDF files Filter Remove HTML tags, incomplete tokens… Detect sentence boundaries. Remove garbage Page processing (cont.) Process Query query string result set Yahoo API Fetch URL html ? .TXT pdf ? Remove Tag Convert to TXT query pages Sentence Segmentation Garbage Cleaning Definition extraction Dependency parser (MINIPAR): Based on the theory of dependency grammars; Broad coverage parser; Output is a parse tree representing head-modifier relations. Generic definition patterns: Use generic semantic patterns to overcome the syntactic variability (expressing the same meaning with the same set of words by employing different syntactic structures of a sentence); Extensible, easily coded in XML, requires minimum knowledge of linguistics. Definition extraction “Data Mining, also known as knowledge discovery in data bases, is the process of automatically searching large volumes of data for patterns.” Definition extraction Simple and complex definitions; Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts; Data Mining can be defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data“. Simple and complex terms being defined; Data Mining; Core of comparative genome analysis. Extensible; High accuracy (limited by the parser). Clustering Algorithm: Similarity measure: K-means; Vector space model; Challenges: Define k; Define similarity measure. Demo Thank you! Questions?