Download Interactive Evolution in Automated Knowledge Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Interactive Evolution
in Automated Knowledge Discovery
Tomáš Řehořek
March 2011
Knowledge Discovery Automation
• Our goal:
– Given input dataset, automatically construct
KF and offer output knowledge that the user is
satisfied with
– Create such a system is a big deal!
Automated
Knowledge Discovery
Knowledge Discovery Automation
• What is Knowledge Discovery?
– Transformation of input data to humaninterpretable knowledge
– Oriented graph of actions (Knowledge Flow)
is a suitable approach
Knowledge Discovery Ontology
• Ontology (definition)
– Formal representation of a domain
– Specification of entities, their properties and
relations
– Provides a vocabulary, which can be used to
model a domain
• E.g.: dataset, model, testing sample, scatter plot,
confusion matrix, association rule…
Knowledge Discovery Ontology
• Ontology design problems in KD:
– Which KFs are reasonable?
– How should the output report look like?
– May the metadata be helpful?
– Are the some categories of users with similar
interests?
• Two ideas concerning Ontology:
– Deductive approach
– Inductive approach
Knowledge Discovery Ontology
• Deductive approach:
– Ontology is given
– Based on the Ontology, and the given
dataset, try to construct appropriate KF
Knowledge Discovery Ontology
• Deductive approach:
Taken from: M. Žáková, P. Křemen, F. Železný, Nada Lavrač: Automating Knowledge
Discovery Workflow Composition Through Ontology-Based Planning (2010)
Knowledge Discovery Ontology
• Inductive approach:
– No prior assumptions about the Ontology
– Learn the Ontology based on a database of
KFs designed by experts
MetaKnowledge Discovery
Discovered
KD Ontology
Our Approach:
Revolutionary Reporting
• There may be thousands of useful KFs
– Different datasets may require different
actions
– Different users may require different
knowledge
• Maybe, users form clusters:
– „DM Scientist“ – may experiment with different
algorithms on a given dataset
– „Business Manager“ – may appreciate
beer-and-diapers rule 
Our Approach:
Revolutionary Reporting
• Let’s design a system capable of learning
what do users like!
– Adopt Interactive Evolutionary Computation
– Collect feedback to evaluate fitness
• of a given KF,
• for a given user,
• on a given dataset,
– Store the feedback, along with the metadata,
to a database
– As the DB grows, offer intelligent KF mutation
based on the experience
Our Approach:
Revolutionary Reporting
• Interactive Evolutionary Computation (IEC)
– Also known as „Aesthetic Selection“
– Evolutionary Computation using Human
evaluation as fitness function
• Inspiration: http://picbreeder.org
Interactive
Evolution
PicBreeder
by
Jimmy
Secretan
Kenneth
Stanley
Next
generation
…
and so on
…
And after 75 generations ...
... you eventually get something interesting
The technology hidden behind
x
z
grayscale
z
Neural net draws the image
x
Neuroevolution
x
grayscale
z
By clicking, you increase fitness of nets
Next generations inherit fit building patterns
Gallery of discovered images
Collaborative evolution
You start your evolution,
where others finished …
… and when discover
something interesting …
… you store it to database.
Our Approach:
Revolutionary Reporting
Experience
Database
System core
Feedback
User
First Experiments: Data Projection
• Transform input Dataset to 2D
f:
n

Examples in
n-Dimensional
space
2D
• Similar to PCA, Sammon projection etc.
2
Experiment Setup
MySQL
RapidMiner 5
Genetic Algorithm
Tomcat
Server
Feedback
jabsorb
JSON-RPC
(via HTTP)
Current
Population
AJAX
Google API
Feedback
Collection GUI
Web Client
User
Data Projection Experiments
• Linear transformation
– Evolve coefficient matrix
a1, a2 ,
b , b ,
 1
2
, an 
, bn 
– Do the transformation using formula:
n

f  x   i=1ai  xi,


… resulting a point in 2D-space

b

x
i
i
i=1

n
[ Demonstration ]
Data Projection Experiments
• Sigmoidal transformation
– Evolve coefficient matrix
a1,1, a1,2 ,
a , a ,
2,2
 2,1
, a1,n , b1,1, b1,2 ,
, a2,n , b2,1, b2,2,
, b1,n , c1,1, c1,2,
, b2,n, c2,1, c2,2,
, c1,n 
, c2,n 
– Do the transformation using formula:
a1,i
 n
f  x    i=1
,
b1,i  xi c1,i 
1+ e


i=11+ eb2,i  xi c2,i  
n
a2,i
b
a
c
Interactive Evolution: Issues
• Fitness function is too costly:
– GA requires a lot of evaluations
– User may get annoyed, bored, tired…
• Heuristic approach needed to speed up
the evolution!
– „Hard-wired“ estimation of projection quality
• E.g. Clustering homogenity, separability,
intra-cluster variability…
• Puts a limitation on what „quality“ means!
– Modeling user’s preferences…?
Surrogate Model
• Optimization approach in areas where
evaluation is too expensive
• Builds an approximation model of the
fitness function
• Given training dataset of so-far-known
candidate solutions and their fitness…
• …predicts fitness of newly generated
candidates
Surrogate Model
1. Collect fitness of an initial sample
2. Construct Surrogate Model
3. Search the Surrogate Model
•
•
Surrogate Model is cheap to evaluate
Genetic Algorithm may be employed
4. Collect fitness at new locations found
in step 3.
5. If solution is not good enough, go to 2.
Evaluating Fitness
• In order to construct fitness-prediction
models, training dataset must be delivered
• Information about fitness provided by the
user is indirect
– In scope of single population, good projection
is sure better than bad one
– However, better is a relative term
– Is good projection in generation #2 better than
bad projection in generation #10…?
Interconnecting generations
• In each generation, population may be
divided to up to 3 categories:
– bad, neutral, good
• Let’s copy the best projection to the nextepoch population
– So-called elitism in Evolutionary Computation
– In scope of new population, the elite will
again fall in one of these 3 categories
– This gives us information about
cross-generation progress!
Absolutizing Fitness
Generation #1
Absolutizing Fitness
Equivalence classes
Equivalence relation
Partial order relation
Generation #2
Generation #3
Fitness Prediction KF in RM
Training dataset
Learning (3NN)
Normalization
Current population
Fitness
prediction
Thank you
for your attention!
Tomáš Řehořek
[email protected]