Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long Tail Presenter : Cheng-Feng Weng Authors : Fei Wu, Raphael Hoffmann, Daniel S. Weld 2008/11/18 KDD.9 (2008) Intelligent Database Systems Lab Outline N.Y.U.S.T. I. M. Motivation Objective Methods and Experiments Conclusion Comments 2 Intelligent Database Systems Lab Introduction N.Y.U.S.T. I. M. KYLIN automatically constructs and completes infoboxes for the articles of Wikipedia. 3 Intelligent Database Systems Lab Motivation N.Y.U.S.T. I. M. The number of article instances per infobox class has a longtailed distribution. Many articles simply does not have much information to extracted. 4 Intelligent Database Systems Lab Objective N.Y.U.S.T. I. M. This paper presents three novel techniques for increasing recall from Wikipedia’s long tail of sparse classes: Shrinkage over an automatically-learned subsumption taxonomy A retraining technique for improving the training data Supplementing results by extracting from the broader Web 5 Intelligent Database Systems Lab Shrinkage N.Y.U.S.T. I. M. This paper use shrinkage when training an extractor of an instance-sparse infobox class by aggregating data from its parent and children classes. Person.birth_plc=taiwan Person Scientist ChungChian Hsu Performer Actor Performer.location=? Comedian 6 Intelligent Database Systems Lab Shrinkage using the KOG Ontology The Kylin Ontology Generator (KOG) is an autonomous system that builds a rich ontology by combining Wikipedia infoboxes with WordNet using statisticalrelational machine learning [27]. The overall shrinkage procedure is as follows: N.Y.U.S.T. I. M. To collect the related class set Person Query KOG for the mapped attribue Assign weight to the training Scientist Performer examples ChungChian Hsu Actor 7 Intelligent Database Systems Lab Comedian Shrinkage Experiments N.Y.U.S.T. I. M. Considering three strategies to determine the weights: Uniform: Size adjusted: W=1 W = min{1, k/(|C|+1) } Precision Directed: W = p(extraction precision) 8 Intelligent Database Systems Lab Shrinkage Experiments (con.) 9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Retraining N.Y.U.S.T. I. M. A complementary idea is the notion of harvesting additional training data even from the outside Web. It utilizes TextRunner which extracts relations from a crawl of about 100 million Web pages. TextRunner’ crawl includes the top ten pages returned by Google. 10 Intelligent Database Systems Lab Using TextRunner for Retraining The retrainer uses this mapped set(C.a) from TextRunner to augment and clean the training data for C’s extractors in two ways: Adding positive examples Filtering negative examples Position example Most common 11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Retraining Experiments N.Y.U.S.T. I. M. 12 Intelligent Database Systems Lab Extracting From the Web N.Y.U.S.T. I. M. It trained extractors on Wikipedia articles and apply them to relevant Web pages. Choosing search engine queries Weighting extractions Combining Wikipedia and Web extractions 13 Intelligent Database Systems Lab Extracting From the Web (con.) Choosing search engine queries Birthday of Andrew Murray “Andrew Murray” “Andrew murray” birth date Weighting extractions A set of query Combining Wikipedia and Web extractions scoreweb : s* r* c* 14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Web Experiments N.Y.U.S.T. I. M. 15 Intelligent Database Systems Lab Combining Experiments N.Y.U.S.T. I. M. 16 Intelligent Database Systems Lab Conclusions N.Y.U.S.T. I. M. This paper describes three powerful methods for increasing recall w.r.t. the above to long-tailed challenges: shrinkage, retraining, and supplementing Wikipedia extractions with those from the Web. 17 Intelligent Database Systems Lab Comments Advantage It use a good idea to overcome long-tail problem. Drawback N.Y.U.S.T. I. M. Just about improving the performance of Kylin they developed Application To construct the knowledge network 18 Intelligent Database Systems Lab