Download Statistical Relational Learning for Link Prediction

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003 Link Prediction • Link Prediction is an important problem arising in many domains – Web pages – Computers – Scientific publications – Organizations – People Being able to predict the presence of links or connections in a domain is both important and difficult to do well Characteristics in Link Prediction Domains • Their nature is inherently multi-relational – This makes the standard “flat” file domain representation inadequate • Data is often noisy or partially observed – e.g. articles may be cited for any number of reasons which reasons are not fully observed Typical Learning Approaches • Assume one-table “flat” domain representation • Process of feature creation is decoupled from feature selection (and is often performed manually) • Relevant features may not be readily observed by human eyes The “Full Join” Approach • Perform a full join on the entire database and statistically analyze the entries – Both impractical and incorrect • Size is prohibitive • Notion of an object is lost (stored across multiple rows) • Entries will be atomic attribute values, rather than results from a complex search • Negates option to introduce intelligent search heuristics The Relational Method • Integrates standard statistical modeling (logistic regression) with a process for systematically generating features from relational data • Feature generation is formulated as search in the space of relational database queries • Space bias can be controlled by specifying valid query types – – – – Aggregations or statistical operations Groupings Richer join conditions Arg-max based queries • Allows for discovery of complex, interesting relationships Link Prediction in the Citeseer Domain • Can be used as a citation recommendation service – User would provide an abstract, author names, possibly a partial reference list • Citeseer provides a rich set of relational data – – – – – Texts of titles Abstracts and documents Citation information Author names and affiliations Conference or journal names Methodology • Couple the two main processes – Generation of feature candidates from relational data – Their selection with statistical model selection criteria Relational Feature Generation • Main principle of search formulation is based on the concept of refinement graphs • Start with the most general clauses and progress by refining them into more specialized clauses Relational Feature Generation – Refinement Graphs • Directed acyclic graphs specifying search space • Constrained by specifying legal clauses – Negation and recursion disallowed • Structured by partial ordering of clauses • A search node is expanded (refined) to produce the most general specializations • ILP systems using refinement graph search usually apply two refinement operators – Add a predicate to a clause – A single variable substitution Relational Feature Generation – Aggregates • Query results are aggregated to produce scalar numeric values to be used in statistical learning • Any statistical aggregate can be valid, but some are expected to be more useful than others – – – – – – Count Average Max Min Mode Empty • Aggregations are considered for inclusion at each node, but not factored into further search Relational Feature Selection • Logistic Regression is used for binary classification problems • Regression coefficients are learned to maximize the likelihood function • Stepwise model selection and Bayesian Information Criterion (BIC) are used to avoid overfitting Tasks and Data – IID Violation • The relational structure violates the assumption of independence • This can be remedied by choosing the right features • When the right features are used, the observations are independent given the features Two Prediction Tasks 1. The identity of all objects is known. Some link structure is known. Predict unobserved links. 2. New objects arrive. Predict their links. - What do we know about the objects? - - Some of their links Some of their attributes This paper presents results for task 1 The Citeseer Environment • 271,343 documents • 1,092,200 citations • Five data sets defined – Four data sets consist of links among documents containing a certain query phrase (e.g. “artificial intelligence”) – Fifth data set includes all documents Learning Methodology • Populate three relations Citation, Author and PublishedIn • Sample 2,500 citations each of – Positive training examples (from available links) – Negative training examples (absence of a link) – Positive test examples – Negative test examples Learning Methodology • Remove citations from test set (but no other relevant information) • Remove citations from training set (so answers are not contained in background information) • Perform learning – Using citations only – Using all relevant information (citation, authors and venue) Results : Training and Test set accuracies – balanced priors Dataset BK : Citation BK: All Train Test Train Test “artificial intelligence” 90.24 89.68 92.60 92.14 “data mining” 87.40 87.20 89.70 89.18 “information retrieval” 85.98 85.34 88.88 88.82 “machine learning” 89.40 89.14 91.42 91.14 Entire collection 92.80 92.28 93.66 93.22 The End

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical Relational Learning for Link Prediction