Download Machine Learning and the Semantic Web

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Machine Learning and the
Semantic Web
Hendrik Blockeel
Katholieke Universiteit Leuven
Department of Computer Science
Thanks : Raymond Kosala, Nico Jacobs
Overview


Machine learning and data mining
Relationship with semantic web


Some concrete examples



Synergy between both
Document classification
Information integration
Conclusions
Machine Learning & Data Mining

Related technology, different focus

Machine learning:



Data mining:



Programs that improve their performance on certain tasks
Focus on adaptive behaviour
Discovering implicit knowledge (regularities) in large
amounts of data
Focus on handling large amounts of data
Very useful technology in the context of the Web
Learning Agents

Programs that

Learn the user’s preferences



Make life for the user as simple as possible
E.g., intelligent mail reader
E.g., adaptive web pages



Learn how to find reliable information


Move links, create “direct” links, ...
Index page synthesis (Perkowitz & Etzioni, IJCAI 1999)
E.g., learn which other people have similar preferences to
this user, use their opinions to make suggestions
(other applications: learning to play games, ...)
Mining the Web


Analyze data that are available on the Web
Distinguish 3 types:

Web content mining


Web structure mining


Look in contents of documents (text, ...)
Look at links between documents
Web usage mining

Look at user logs (e.g. who accessed a web page, which
links often used, ...)
Web Content Mining

Relies on information extraction

E.g., in a text: find keywords, ...

Techniques from machine learning, statistics, ... used to
guess from context






what a word means
what its function in the text is
...
Fill a schema with specific slots, based on analysis of
text
Even more complicated: recognise objects in
pictures, ...
I.E. is a complex matter
Mining for Genes


Jenssen et al. (2001), Nature Genetics 28, “A
literature network of human genes”
Mining MEDLINE database of abstracts





Find names of genes occurring together
Construct similarity graph
Construct a database with this information
Database contains knowledge no single individual
has, or could obtain without data mining
Similar techniques could be used on the web

One extra problem: uncertainty about reliability
Web Structure Mining

Analyse structure of the web

Which sites have many incoming / outgoing links?


Find clusters of sites that are strongly interconnected



Identify “hubs”
Web communities
...
E.g., Google

Identifies important pages based on links that point to
it (rather than contents of page itself)
Web Usage Mining

Log user behaviour


Which links are often followed, in which order, how
long is a page looked at, ...
Possible at several levels:


General usage statistics
User-specific statistics


Relating behaviour to properties of user, insofar available
E.g., adaptive web sites


Adaplix project
automatic index page creation
Web Mining As It Currently Is

Machine learning / data mining strongly rely on




Data quantity
Data quality
Quantity is usually not a problem on the Web
Quality is!

Much data not in easily processable format




E.g. Inside text documents : need information extraction
Unstructured, poorly structured, heterogeneously structured
Lots of noise
...
How Is All This Related to the
Semantic Web?

There can be a synergy :


Machine learning can help with building the
Semantic Web
The Semantic Web will help mining the Web,
making Web interfaces and agents more
intelligent
What Machine Learning Can Do
for the Semantic Web



Upgrading the current web to a semantic
web involves a lot of work
Can partially be automated!
Examples:




Learning ontologies
Automatic document classification
Information integration
...
Learning Ontologies


Maedche & Staab (2001), “Ontology learning for
the semantic web”
View:




Manually creating of ontologies is very labourintensive
Fully automating creating of ontologies is not feasible
Hence: develop tool that helps building ontologies
Basic components:


Good graphical interface (interaction man-machine)
Powerful underlying machine learning techniques
Text-To-Onto

Framework :


Import / reuse existing ontologies
Extract ontology from documents






Identify new terms, map onto existing concepts or define new
ones
Identify relationships between concepts
...
Many opportunities for general machine learning techniques
Prune ontology
Refine ontology
Some Useful Techniques for
Learning Ontologies

Term extraction from texts


Hierarchical Clustering




Identification of concepts
Clustering: finding groups of “similar” things
Hierarchical clustering: clusters of clusters
Taxonomy can be constructed through hierarchical
clustering of concepts
Association rules


Find sets of terms that often occur together
May indicate important relations

E.g., events in texts often co-occur with locations
Information Integration


Doan, Domingos, Halevy: “Reconciling Schemas
of Disparate Data Sources”, ACM SIGMOD
2001
Context:

Given databases with different schemas:



Find similarities in schemas, guess how concepts map onto
each other
Integrate the schemas
Essentially the same as mapping ontologies
onto each other
Automated Document
Classification

Mitchell et al.





Based on examples of web pages + what kind of
page they are (course page, student page, ...),
Learn to classify new pages
Can be based on contents of page, links pointing to
page, typical structure of certain kinds of web sites
(e.g. universities), ...
Note: helps to relate objects to ontology
Problem: how to get labeled examples


Unlimited amount of unlabelled pages available
But labelling them manually is labour intensive!
Exploiting Unlabelled Data

A solution: co-training (Blum & Mitchell 1998)

Learn separate (imperfect) classifiers from disjoint
sets of sufficient information

E.g. Learn to classify pages from





Content of page (“Home page of CS 101”)
Links pointing to page (“CS 101”)
Take classifications that classifier A is most certain of,
add these labels to training set for B (and vice versa)
Repeat multiple times (kind of bootstrapping process)
Co-training allows to exploit large amounts of
unlabelled data!
What the Semantic Web Can Do
for Machine Learning


Will make mining the web much easier
Reason 1: removal of ambiguity


Reason 2: structured vs. unstructured data


More precise knowledge of what is meant with certain
terms
Learning from structured data is much easier than
from unstructured data
Reason 3: availability of background knowledge

Can be used to make better decisions when learning
Removal of Ambiguity

Example: text document classification


E.g., given a text, tell in which newsgroups it belongs
Typical approaches: “bag of words”




Look only at which words occur, in the text, and how
often
Each time a word occurs that occurs mainly in one
particular class, increase probability for that class
But words are ambiguous!
Increased classification accuracy can be expected by
removing ambiguity
Mining From (Un)structured Data


Mining data = intensively querying data
Answering a querying is

Easy in structured data



Harder in semi-structured data (e.g., HTML)
Hard in unstructured data




Relational database, XML, ...
Information exraction needed
Could do this by learning a “wrapper”
This involves one extra layer of learning
Relating this to our text example: taking into
account function of words in text
Availability of Background
Knowledge



Learning = finding relevant patterns in behaviour
Important to have the right context to describe
these patterns
Example:



Making interesting offers to clients
“People who bought this book also bought ...”
= “Instance-based” learning



Estimate profile of user
Find users with similar profile
Look at behaviour of those users to help current user
Availability of Background
Knowledge

Can work better if more background knowledge
is available, e.g., type of book, author, ...

For instance, for books:

“similar profile” = users that up till now bought same books
as this user


“similar” = often bought books by same author


Probably many more people, allows for more reasonable guess
“similar” = often bought books of same genre (fiction, ...)


May not be many people
May work even better
Ontologies (among other) provide such
background knowledge
Web Mining Revisited

Semantic Web will change

Content mining


Structure mining


More relevant structure
Usage mining


Clearer view on contents and meaning of documents
More relevant information on actions of user
Will in general improve intelligence of systems

E.g. mail filter gets a better view of contents of mails
Promising Learning Techniques

Many different learning techniques exist


Neural networks, support vector machines, instancebased learning, bayesian learning, association rules,
...
Not all equally suitable for any task




E.g. SVM for document classification works well
E.g. instance-based learning: find other users with same
profile as this user to make predictions
Intelligent agents will use a mix of them
Relational learners seem interesting


Can handle explicit information on objects and relations
between them
Classic example: Inductive logic programming
Inductive Logic Programming

Induces rules in first order logic from examples
or other rules


Such rules can be used to reason with
The reasoning can be explained


Cf. example of mail program
Can use existing background knowledge




“knowledge intensive learning”
Currently: good background knowledge has to be
engineered manually
Will become more easily available with semantic web
Example: mining in chemical domains
Mining in chemical domains

Example problem: relate activity of molecule to
its properties


Useful for, e.g., drug development
Which properties are important?




Chemically relevant properties: functional groups, 3D
structure, ... ?
Has to be encoded manually
Ideally: get relevant information from some
trustworthy data source as and when needed
Intelligent agents will exploit (“tap”) the common
intelligence of the Web
Conclusions

Machine learning is an promising tool for
the Semantic Web



For building it
For exploiting it
Clear synergy between Semantic Web
efforts and Machine Learning efforts
Some References

Maedche, “A Machine Learning Perspective for the
Semantic Web”, position paper
www.semanticweb.org/SWWS/program/position/soi-maedche.pdf





Maedche & Staab (2001): Ontology Learning for the
Semantic Web, IEEE Intelligent Systems 16(2)
Jenssen et al., Nature Genetics 28
Doan et al. (2001), ACM SIGMOD conf.
Kosala & Blockeel (2000), SIGKDD Explorations 2(1)
Mitchell (1996), Machine Learning