Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction Bhaskara Marthi, Brian Milch, Stuart Russell Computer Science Div. University of California NIPS 15th, 2003 Identity Uncertainty and Citation Matching Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser Computer Science Div. University of California Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21 Outlines Introduction Related works Models for the bibliography domain Experiment on model A Desiderata for a FOPL Conclusions 2/18 Introduction – Citation Matching Problem Citation matching: the problem of deciding which citations correspond to the same publication Difficulties Different citation styles An imperfect copy of the book’s title Different ways to refer an object (identity) Ambiguity “Wauchope, K. Eucalyptus: Integrating Natural language Input with a Graphical User Interface” Tasks Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ? Parsing Disambiguation Matching 3/18 Introduction – Citation Matching Problem: Examples Journal of Artificial Intelligence Research, or Artificial Intelligence Journal ?? 4/18 Introduction – First-Order Probabilistic Models Logic Propositio nal First-order Probabilistic Model Formula A B P(j m a b e), Bayesian Network Inference/ Algorithms Resolution, Model Checking, Forward chaining, DPLL, WalkSAT … Bayes’ rule, Summing-out, smoothing, prediction, approximation (likely-hood, MCMC …), … Formula x King(x) Greedy(x) Evil(x) x King(x) Greedy(x) Evil(x) = 0.8766 Inference/ Algorithms Unification, Resolution, … Learning, Approximation, … System/ Languages Prolog, Rule Engine (JBoss), … FOPL, RPM 5/18 Introduction – Result of Model B 6/18 Related Works IE Bayesian modeling the Message Understanding Conferences [DARPA,1998] finding stochastically repeated patterns (motifs) in DNA sequences [Xing et al., 2003] Robot localization [Anguelov et al., 2002] FOPL/RPM (Relational Prob. Model) A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. 7/18 Models for the Bibliography Domain – Model A [Pasula et al. 2003] 8/18 Models for the Bibliography Domain – Model A (Cont.) Suggest a declarative approach to identity uncertainty using a formal language Algorithm Steps Generate objects/instances Parse and fill attributes Inference (Approximation, MCMC) Cluster the identity (publication) 9/18 Models for the Bibliography Domain – Model A (Cont.) Attributes using unconditional probability learn several bigram models using the following resources letter-based models of first names, surnames, and title words the 2000 Census data on US names a large A.I. BibTeX bibliography a hand-parsed collection of 500 citations Attributes using conditional probability Using noise channels for some attributes Citation.parse It keeps track of the segmentation of Citation.text An author segment, a title segment, and three filler segments (one before, one after, and one in between) Citation.text the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames The parameters of the corruption models are learnt online, using stochastic EM Be constrained by Citation.parse, Paper.pubType, … These models were learned using our pre-segmented file. 10/18 Models for the Bibliography Domain – Model B Publisher Name, City Collection Authors Name, Type, Date, Publisher Name Area+ (Fields) Publication Citation Groups Title Area Type (Book/conf. …) AuthorList Collection Type (Area, Author) Style PublicationList CitationList Citation Publication TitleAsCited AuthorsAsCited Text Parse 11/18 Models for the Bibliography Domain – Model B (Cont.) Generating objects The set of Author objects, and the set of Collection objects are generated independently. the set of Publication objects is generated conditional on the Authors and Collections. CitationGroup objects are generated conditional on the Authors and Collections. Citation objects are generated from the CitationGroups. 12/18 Models for the Bibliography Domain – Model B (Cont.) Fill attributes Author.Name Publications.Title is chosen from a mixture of a letter bigram distribution with a distribution that chooses from a set of commonly occurring names is generated from an n-gram model, conditioned on Publications.area More specific relations and conditions between attributes 13/18 Experiment on model A – Experiment Setting Dataset Citeseer’s hand-matched datasets Each of these datasets contains several hundred citations of machine learning papers Citeseer’s phrase matching algorithm a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap half of them in clusters ranging in size from two to twenty-one citations 14/18 Experiment on model A – Experiment Result 15/18 Desiderata for a FOPL Contains A probability distribution over possible worlds The expression power to model the relational structure of the world An efficient inference algorithm A learning procedure which allows priors over the parameters Has the ability to answer queries to make inferences about the existence or nonexistence of objects having particular properties to represent common types of compound objects to represent probabilistic dependencies to incorporate domain knowledge into the inference algorithms 16/18 Conclusions First-order probabilistic models a useful, probably necessary, component of any system that extracts complex relational information from unstructured text data Some of the directions we plan to pursue in the future defining a representation language that allows such models to be specified declaratively, scaling up the inference procedure to handle large knowledge bases 17/18 Thanks!!