Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning and the Semantic Web Hendrik Blockeel Katholieke Universiteit Leuven Department of Computer Science Thanks : Raymond Kosala, Nico Jacobs Overview Machine learning and data mining Relationship with semantic web Some concrete examples Synergy between both Document classification Information integration Conclusions Machine Learning & Data Mining Related technology, different focus Machine learning: Data mining: Programs that improve their performance on certain tasks Focus on adaptive behaviour Discovering implicit knowledge (regularities) in large amounts of data Focus on handling large amounts of data Very useful technology in the context of the Web Learning Agents Programs that Learn the user’s preferences Make life for the user as simple as possible E.g., intelligent mail reader E.g., adaptive web pages Learn how to find reliable information Move links, create “direct” links, ... Index page synthesis (Perkowitz & Etzioni, IJCAI 1999) E.g., learn which other people have similar preferences to this user, use their opinions to make suggestions (other applications: learning to play games, ...) Mining the Web Analyze data that are available on the Web Distinguish 3 types: Web content mining Web structure mining Look in contents of documents (text, ...) Look at links between documents Web usage mining Look at user logs (e.g. who accessed a web page, which links often used, ...) Web Content Mining Relies on information extraction E.g., in a text: find keywords, ... Techniques from machine learning, statistics, ... used to guess from context what a word means what its function in the text is ... Fill a schema with specific slots, based on analysis of text Even more complicated: recognise objects in pictures, ... I.E. is a complex matter Mining for Genes Jenssen et al. (2001), Nature Genetics 28, “A literature network of human genes” Mining MEDLINE database of abstracts Find names of genes occurring together Construct similarity graph Construct a database with this information Database contains knowledge no single individual has, or could obtain without data mining Similar techniques could be used on the web One extra problem: uncertainty about reliability Web Structure Mining Analyse structure of the web Which sites have many incoming / outgoing links? Find clusters of sites that are strongly interconnected Identify “hubs” Web communities ... E.g., Google Identifies important pages based on links that point to it (rather than contents of page itself) Web Usage Mining Log user behaviour Which links are often followed, in which order, how long is a page looked at, ... Possible at several levels: General usage statistics User-specific statistics Relating behaviour to properties of user, insofar available E.g., adaptive web sites Adaplix project automatic index page creation Web Mining As It Currently Is Machine learning / data mining strongly rely on Data quantity Data quality Quantity is usually not a problem on the Web Quality is! Much data not in easily processable format E.g. Inside text documents : need information extraction Unstructured, poorly structured, heterogeneously structured Lots of noise ... How Is All This Related to the Semantic Web? There can be a synergy : Machine learning can help with building the Semantic Web The Semantic Web will help mining the Web, making Web interfaces and agents more intelligent What Machine Learning Can Do for the Semantic Web Upgrading the current web to a semantic web involves a lot of work Can partially be automated! Examples: Learning ontologies Automatic document classification Information integration ... Learning Ontologies Maedche & Staab (2001), “Ontology learning for the semantic web” View: Manually creating of ontologies is very labourintensive Fully automating creating of ontologies is not feasible Hence: develop tool that helps building ontologies Basic components: Good graphical interface (interaction man-machine) Powerful underlying machine learning techniques Text-To-Onto Framework : Import / reuse existing ontologies Extract ontology from documents Identify new terms, map onto existing concepts or define new ones Identify relationships between concepts ... Many opportunities for general machine learning techniques Prune ontology Refine ontology Some Useful Techniques for Learning Ontologies Term extraction from texts Hierarchical Clustering Identification of concepts Clustering: finding groups of “similar” things Hierarchical clustering: clusters of clusters Taxonomy can be constructed through hierarchical clustering of concepts Association rules Find sets of terms that often occur together May indicate important relations E.g., events in texts often co-occur with locations Information Integration Doan, Domingos, Halevy: “Reconciling Schemas of Disparate Data Sources”, ACM SIGMOD 2001 Context: Given databases with different schemas: Find similarities in schemas, guess how concepts map onto each other Integrate the schemas Essentially the same as mapping ontologies onto each other Automated Document Classification Mitchell et al. Based on examples of web pages + what kind of page they are (course page, student page, ...), Learn to classify new pages Can be based on contents of page, links pointing to page, typical structure of certain kinds of web sites (e.g. universities), ... Note: helps to relate objects to ontology Problem: how to get labeled examples Unlimited amount of unlabelled pages available But labelling them manually is labour intensive! Exploiting Unlabelled Data A solution: co-training (Blum & Mitchell 1998) Learn separate (imperfect) classifiers from disjoint sets of sufficient information E.g. Learn to classify pages from Content of page (“Home page of CS 101”) Links pointing to page (“CS 101”) Take classifications that classifier A is most certain of, add these labels to training set for B (and vice versa) Repeat multiple times (kind of bootstrapping process) Co-training allows to exploit large amounts of unlabelled data! What the Semantic Web Can Do for Machine Learning Will make mining the web much easier Reason 1: removal of ambiguity Reason 2: structured vs. unstructured data More precise knowledge of what is meant with certain terms Learning from structured data is much easier than from unstructured data Reason 3: availability of background knowledge Can be used to make better decisions when learning Removal of Ambiguity Example: text document classification E.g., given a text, tell in which newsgroups it belongs Typical approaches: “bag of words” Look only at which words occur, in the text, and how often Each time a word occurs that occurs mainly in one particular class, increase probability for that class But words are ambiguous! Increased classification accuracy can be expected by removing ambiguity Mining From (Un)structured Data Mining data = intensively querying data Answering a querying is Easy in structured data Harder in semi-structured data (e.g., HTML) Hard in unstructured data Relational database, XML, ... Information exraction needed Could do this by learning a “wrapper” This involves one extra layer of learning Relating this to our text example: taking into account function of words in text Availability of Background Knowledge Learning = finding relevant patterns in behaviour Important to have the right context to describe these patterns Example: Making interesting offers to clients “People who bought this book also bought ...” = “Instance-based” learning Estimate profile of user Find users with similar profile Look at behaviour of those users to help current user Availability of Background Knowledge Can work better if more background knowledge is available, e.g., type of book, author, ... For instance, for books: “similar profile” = users that up till now bought same books as this user “similar” = often bought books by same author Probably many more people, allows for more reasonable guess “similar” = often bought books of same genre (fiction, ...) May not be many people May work even better Ontologies (among other) provide such background knowledge Web Mining Revisited Semantic Web will change Content mining Structure mining More relevant structure Usage mining Clearer view on contents and meaning of documents More relevant information on actions of user Will in general improve intelligence of systems E.g. mail filter gets a better view of contents of mails Promising Learning Techniques Many different learning techniques exist Neural networks, support vector machines, instancebased learning, bayesian learning, association rules, ... Not all equally suitable for any task E.g. SVM for document classification works well E.g. instance-based learning: find other users with same profile as this user to make predictions Intelligent agents will use a mix of them Relational learners seem interesting Can handle explicit information on objects and relations between them Classic example: Inductive logic programming Inductive Logic Programming Induces rules in first order logic from examples or other rules Such rules can be used to reason with The reasoning can be explained Cf. example of mail program Can use existing background knowledge “knowledge intensive learning” Currently: good background knowledge has to be engineered manually Will become more easily available with semantic web Example: mining in chemical domains Mining in chemical domains Example problem: relate activity of molecule to its properties Useful for, e.g., drug development Which properties are important? Chemically relevant properties: functional groups, 3D structure, ... ? Has to be encoded manually Ideally: get relevant information from some trustworthy data source as and when needed Intelligent agents will exploit (“tap”) the common intelligence of the Web Conclusions Machine learning is an promising tool for the Semantic Web For building it For exploiting it Clear synergy between Semantic Web efforts and Machine Learning efforts Some References Maedche, “A Machine Learning Perspective for the Semantic Web”, position paper www.semanticweb.org/SWWS/program/position/soi-maedche.pdf Maedche & Staab (2001): Ontology Learning for the Semantic Web, IEEE Intelligent Systems 16(2) Jenssen et al., Nature Genetics 28 Doan et al. (2001), ACM SIGMOD conf. Kosala & Blockeel (2000), SIGKDD Explorations 2(1) Mitchell (1996), Machine Learning