Download Third-Generation Data Mining: Towards Service

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases
Third-Generation Data Mining:
Towards Service-Oriented Knowledge Discovery
SoKD’10
September 24, 2010
Barcelona, Spain
Editors
Melanie Hilario
University of Geneva, Switzerland
Nada Lavrač
Vid Podpečan
Jožef Stefan Institute, Ljubljana, Slovenia
Joost N. Kok
LIACS, Leiden University, The Netherlands
- ii -
Preface
It might seem paradoxical that third-generation data mining (DM) remains an
open research issue more than a decade after it was first defined1 . First generation data mining systems were individual research-driven tools for performing generic learning tasks such as classification or clustering. They were aimed
mainly at data analysis experts whose technical know-how allowed them to do
extensive data preprocessing and tool parameter-tuning. Second-generation DM
systems gained in both diversity and scope: they not only offered a variety of
tools for the learning task but also provided support for the full knowledge discovery process, in particular for data cleaning and data transformation prior
to learning. These so-called DM suites remained, however, oriented towards the
DM professional rather than the end user. The idea of third-generation DM systems, as defined in 1997, was to empower the end user by focusing on solutions
rather than tool suites; domain-specific shells were wrapped around a core of
DM tools, and graphical interfaces were designed to hide the intrinsic complexity of the underlying DM methods. Vertical DM systems have been developed for
applications in data-intensive fields such as bioinformatics, banking and finance,
e-commerce, telecommunications, or customer relationship management.
However, driven by the unprecedented growth in the amount and diversity of
available data, advances in data mining and related fields gradually led to a
revised and more ambitious vision of third-generation DM systems. Knowledge
discovery in databases, as it was understood in the 1990s, turned out to be just
one subarea of a much broader field that now includes mining unstructured data
in text and image collections, as well as semi-structured data from the rapidly
expanding Web. With the increased heterogeneity of data types and formats, the
limitations of attribute-value vectors and their associated propositional learning
techniques were acknowledged, then overcome through the development of complex object representations and relational mining techniques.
Outside the data mining community, other areas of computer science rose up
to the challenges of the data explosion. To scale up to tera-order data volumes,
high-performance computers proved to be individually inadequate and had to
be networked into grids in order to divide and conquer computationally intensive tasks. More recently, cloud computing allows for the distribution of data
and computing load to a large number of distant computers, while doing away
with the centralized hardware infrastructure of grid computing. The need to harness multiple computers for a given task gave rise to novel software paradigms,
foremost of which is service-oriented computing.
1
G. Piatetsky-Shapiro. Data mining and knowledge discovery: The third generation.
In Foundations of Intelligent Systems: 10th International Symposium, 1997.
- iii -
As it name suggests, service-oriented computing utilizes services as the basic constructs to enable composition of applications from software and other resources
distributed across heterogeneous computing environments and communication
networks. The service-oriented paradigm has induced a radical shift in our definition of third-generation data mining. The 1990’s vision of a data mining tool
suite encapsulated in a domain-specific shell gives way to a service-oriented architecture with functionality for identifying, accessing and orchestrating local
and remote data/information resources and mining tools into a task-specific
workflow. Thus the major challenge facing third-generation DM systems is the
integration of these distributed and heterogeneous resources and software into
a coherent and effective knowledge discovery process. Semantic Web research
provides the key technologies needed to ensure interoperability of these services;
for instance, the availability of widely accepted task and domain ontologies ensures common semantics for the annotation, search and retrieval of the relevant
data/knowledge/software resources, thus enabling the construction of shareable
and reusable knowledge discovery workflows.
SoKD’10 is the third in a series of workshops that serve as the forum for ongoing research on service-oriented knowledge discovery. The papers selected for
this edition can be grouped under 3 main topics. Three papers propose novel
techniques for the construction, analysis and re-use of data mining workflows.
A second group of two papers addresses the problem of building ontologies for
knowledge discovery. Finally, two papers describe applications of service-oriented
knowledge discovery in plant biology and predictive toxicology.
Geneva, Ljubljana, Leiden
July 2010
Melanie Hilario
Nada Lavrač
Vid Podpečan
Joost N. Kok
- iv -
Workshop Organization
Workshop Chairs
Melanie Hilario (University of Geneva)
Nada Lavrač (Jožef Stefan Institute)
Vid Podpečan (Jožef Stefan Institute)
Joost N. Kok (Leiden University)
Program Committee
Abraham Bernstein (University of Zurich, Switzerlnd)
Michael Berthold (Konstanz University, Germany)
Hendrik Blockeel (Leuven University, Belgium)
Jeroen de Bruin (Leiden University, The Netherlands)
Werner Dubitzky (University of Ulster, UK)
Alexandros Kalousis (University of Geneva, Switzerland)
Igor Mozetič (Jožef Stefan Institute, Slovenia)
Filip Železny (Czech Technical University, Czechia)
Additional Reviewers
Agnieszka Ławrynowicz (Poznan University of Technology, Poland)
Yvan Saeys (Ghent University, Belgium)
-v-
Table of Contents
Data Mining Workflows: Creation, Analysis and Re-use
Data Mining Workflow Templates for Intelligent Discovery
Assistance and Auto-Experimentation
Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, Simon Fischer
1
Workflow Analysis Using Graph Kernels
Natalja Friesen, Stefan Rüping
13
Re-using Data Mining Workflows
Stefan Rüping, Dennis Wegener, Philipp Bremer
25
Ontologies for Knowledge Discovery
Exposé: An Ontology for Data Mining Experiments
Joaquin Vanschoren, Larisa Soldatova
31
Foundations of Frequent Concept Mining with Formal Ontologies
Agnieszka Ławrynowicz
45
Applications of Service-Oriented Knowledge Discovery
Workflow-based Information Retrieval to Model Plant Defence Response
to Pathogen Attacks
Dragana Miljković, Claudiu Mihăilă, Vid Podpečan, Miha Grčar,
Kristina Gruden, Tjaša Stare, Nada Lavrač
OpenTox: A Distributed REST Approach to Predictive Toxicology
Tobias Girschick, Fabian Buchwald, Barry Hardy, Stefan Kramer
- vi -
51
61
Data Mining Workflow Templates for Intelligent
Discovery Assistance and Auto-Experimentation
Jörg-Uwe Kietz1 , Floarea Serban1 , Abraham Bernstein1 , and Simon Fischer2
2
1
University of Zurich, Department of Informatics,
Dynamic and Distributed Information Systems Group,
Binzmühlestrasse 14, CH-8050 Zurich, Switzerland
{kietz|serban|bernstein}@ifi.uzh.ch
Rapid-I GmbH, Stockumer Str. 475, 44227 Dortmund, Germany
[email protected]
Abstract. Knowledge Discovery in Databases (KDD) has grown a lot
during the last years. But providing user support for constructing workflows is still problematic. The large number of operators available in
current KDD systems makes it difficult for a user to successfully solve
her task. Also, workflows can easily reach a huge number of operators
(hundreds) and parts of the workflows are applied several times. Therefore, it becomes hard for the user to construct them manually. In addition, workflows are not checked for correctness before execution. Hence,
it frequently happens that the execution of the workflow stops with an
error after several hours runtime.
In this paper3 we present a solution to these problems. We introduce
a knowledge-based representation of Data Mining (DM) workflows as a
basis for cooperative-interactive planning. Moreover, we discuss workflow templates, i.e. abstract workflows that can mix executable operators
and tasks to be refined later into sub-workflows. This new representation
helps users to structure and handle workflows, as it constrains the number of operators that need to be considered. Finally, workflows can be
grouped in templates which foster re-use further simplifying DM workflow construction.
1
Introduction
One of the challenges of Knowledge Discovery in Databases (KDD) is assisting
the users in creating and executing DM workflows. Existing KDD systems such
as the commercial Clementine 4 and Enterprise Miner 5 or the open-source
3
4
5
This paper reports on work in progress. Refer to http://www.e-lico.eu/eProPlan to
see the current state of the Data Mining ontology for WorkFlow planning (DMWF),
the IDA-API, and the eProPlan Protege plug-ins we built to model the DMWF.
The RapidMiner IDA-wizard will be part of a future release of RapidMiner check
http://www.rapidminer.com/ for it.
http://www.spss.com/software/modeling/modeler-pro/
http://www.sas.com/technologies/analytics/datamining/miner/
Weka 6 , MiningMart 7 , KNIME 8 and RapidMiner 9 support the user with nice
graphical user interfaces, where operators can be dropped as nodes onto the
working pane and the data-flow is specified by connecting the operator-nodes.
This works very well as long as neither the workflow becomes too complicated
nor the number of operators becomes too large.
The number of operators in such systems, however, has been growing fast.
All of them contain over 100 operators and RapidMiner, which includes Weka,
even over 600. It can be expected that with the incorporation of text-, image, and multimedia-mining as well as the transition from closed systems with
a fixed set of operators to open systems, which can also use Web services as
operators (which is especially interesting for domain specific data access and
transformations), will further accelerate the rate of growth resulting in total
confusion for most users.
Not only the number of operators, but also the size of the workflows is
growing. Today’s workflows can easily contain hundreds of operators. Parts of
the workflows are applied several times ( e.g. the preprocessing sub-workflow
has to be applied on training, testing, and application data) implying that the
users either need to copy/paste or even to design a new sub-workflow10 several
times. None of the systems maintain this “copy”-relationship, it is left to the
user to maintain the relationship in the light of changes.
Another weak point is that workflows are not checked for correctness before
execution: it frequently happens that the execution of the workflow stops with
an error after several hours runtime because of small syntactic incompatibilities
between an operator and the data it should be applied on.
To address these problems several authors [1, 12, 4, 13] propose the use of
planning techniques to automatically build such workflows. However all these
approaches are limited in several ways. First, they only model a very small set
of operations and they work on very short workflows (less than 10 operators).
Second, none of them models operations that work on individual columns of
a data set, they only model operations that process all columns of a data set
equally together. Lastly, the approaches cannot scale to large amounts of operators and large workflows: their used planning approaches will necessarily get
lost in the too large space of “correct” (but nevertheless most often unwanted)
solutions. In [6] we reused the idea of hierarchical task decomposition (from the
manual support system CITRUS [11]) and knowledge available in Data Min6
7
8
9
10
http://www.cs.waikato.ac.nz/ml/weka/
http://mmart.cs.uni-dortmund.de/
http://www.knime.org/
http://rapid-i.com/content/view/181/190/
Several operators must be exchanged and cannot be just reapplied. Consider for
example training data (with labels) and application data (without labels). Labeldirected operations like feature selection or discretization by entropy used on the
training data cannot work on the application data. But even if there is a label
like on separate test data, redoing feature selection/discretization may result in
selecting/building different features/bins. But to apply and test the model exactly
the same features/bins have to be selected/build.
-2-
ing (e.g. CRISP-DM) for hierarchical task network (HTN) planning [9]. This
significantly reduces the number of generated unwanted correct workflows. Unfortunately, since it covers only generic DM knowledge, it still does not capture
the most important knowledge a DM engineer uses to judge workflows and models useful: understanding the meaning of the data11 .
Formalizing the meaning of the data requires a large amount of domain
knowledge. Eliciting all the possible needed background information about the
data from the user would probably be more demanding for her than designing
useful workflows manually. Therefore, the completely automatic planning of useful workflows is not feasible. The approach of enumerating all correct workflows
and then let the user choose the useful one(s) will likely fail due to the large
number of correct workflows (infinite, without a limit on the number of operations in the workflow). Only cooperative-interactive planning of workflows seems
to be feasible. In this scenario the planner ensures the correctness of the state of
planning and can propose a small number of possible intermediate refinements
of the current plan to the user. The user can use her knowledge about the data
to choose useful refinements, can make manual additions/corrections, and use
the planner again for tasks that can be routinely solved without knowledge
about the data. Furthermore, the planner can be used to generate all correct
sub-workflows to optimize the workflow by experimentation.
In this paper we present a knowledge-based representation of DM workflows,
understandable to both planner and user, as the foundation for cooperativeinteractive planning. To be able to represent the intermediate states of planning,
we generalize this to “workflow templates”, i.e. abstract workflows that can
mix executable operators and tasks to be refined later into sub-workflows (or
sub-workflow-templates). Our workflows follow the structure of a Data Mining
Ontology for Workflows (DMWF). It has a hierarchical structure consisting
of a task/method decomposition into tasks, methods or operators. Therefore,
workflows can be grouped based on the structure decomposition and can be
simplified by using abstract nodes. This new representation helps the users since
akin to structured programming the elements (operators, tasks, and methods) of
a workflow actively under consideration are reduced significantly. Furthermore,
this approach allows to group certain sequences of operators as templates to
be reused later. All this simplifies and improves the design of a DM workflow,
reducing the time needed to construct workflows, and decreases the workflow’s
size.
This paper is organized as follows: Section 2 describes workflows and their
representation as well as workflow template, Section 3 shows the advantages of
workflow templates, Section 4 presents the current state and future steps and
finally Section 5 concludes our paper.
11
Consider a binary attribute “address invalid”: just by looking at the data it is almost
impossible to infer that it does not make sense to send advertisement to people
with this flag set at the moment. In fact, they may have responded to previous
advertisements very well.
-3-
2
DM Workflow
DM workflows generally represent a set of DM operators, which are executed
and applied on data or models. In most of the DM tools users are only working
with operators and setting their parameters (values). Data is implicit, hidden
in the connectors, the user provides the data and applies the operators, but
after each step new data is produced. In our approach we distinguish between
all the components of the DM workflow: operators, data, parameters. To enable
the system and user to cooperatively design the workflows, we developed a
formalization of the DM workflows in terms of an ontology.
To be able to define a DM workflow we first need to describe the DMWF
ontology since workflows are stored and represented in DMWF format. This
ontology encodes rules from the KDD domain on how to solve DM tasks, as for
example the CRISP-DM [2] steps in the form of concepts and relations (Tbox –
terminology). The DMWF has several classes that contribute in describing the
DM world, IOObjects, MetaData, Operators, Goals, Tasks and Methods. The
most important ones are shown in Table 1.
Class
Description
Examples
Input and output used by
Data, Model, Report
operators
MetaData Characteristics of the IOObjects Attribute, AttributeType, DataColumn, DataFormat
DataTableProcessing, ModelProcessing, Modeling,
Operator DM operators
MethodEvaluation
A DM goal that the user could DescriptiveModelling, PatternDiscovery,
Goal
solve
PredictiveModelling, RetrievalByContent
CleanMV, CategorialToScalar, DiscretizeAll,
Task
A task is used to achieve a goal
PredictTarget
CategorialToScalarRecursive, CleanMVRecursive,
Method A method is used to solve a task
DiscretizeAllRecursive, DoPrediction
IOObject
Table 1: Main classes from the DMWF ontology
12
Properties
uses
– usesData
– usesModel
produces
– producesData
– producesModel
parameter
simpleParameter
solvedBy
worksOn
– inputData
– outputData
Domain
Range
Description
Operator
IOObject
defines input for an operator
Operator
IOObject
defines output for an operator
Operator
Operator
Task
MetaData
data type
Method
defines other parameters for operators
TaskMethod
IOObject
worksWith
TaskMethod
MetaData
decomposedTo
Method
A task is solved by a method
The IOObject elements the Task or Method
works on
The MetaData elements the Task or Method
worksWith
Operator/Task A Method is decomposed into a set of steps
Table 2: Main roles from the DMWF ontology
The classes from the DWMF ontology are connected through properties as
shown in Table 2. The parameters of operators as well as some basic characteristics of data are values (integer, double, string, etc.) in terms of data properties,
12
Later on we use usesProp, producesProp, simpleParamProp, etc. to denote the
subproperties of uses, produces, simpleParameter, etc. .
-4-
e.g. number of records for each data table, number of missing values for each
column, mean value and standard deviation for each scalar column, number of
different values for nominal columns, etc. Having them modeled in the ontology
enables the planner to use them for planning.
2.1
What is a workflow?
In our approach a workflow constitutes an instantiation of the DM classes; more
precisely is a set of ontological individuals (Abox - assertions). It is mainly
composed from several basic operators, which can be executed or applied with
the given parameters. The workflow follows the structure illustrated in Fig. 1. A
workflow consists of several operator applications, instances of operators as well
as their inputs and outputs – instances of IOObject, simple parameters (values
which can have different data types like integer, string, etc.), or parameters –
instances of MetaData. The flow itself is rather implicit, it is represented by
shared IOObjects used and produced by Operators. The reasoner can ensure
that every IOObject has only 1 producer and that every IOObject is either
given as input to the workflow or produced before it can be used.
Operator[usesProp1 {1,1}⇒ IOObject, . . . , usesPropn {1,1}⇒IOObject,
producesProp1 {1,1}⇒ IOObject, . . . , producesPropn {1,1}⇒IOObject,
parameterProp1 {1,1}⇒ MetaData, . . . , parameterPropn {1,1}⇒ MetaData,
simpleParamProp1 {1,1}⇒ dataType, . . . , simpleParamPropn {1,1}⇒ dataType].
Fig. 1: Tbox for operator applications and workflows
Fig. 2 illustrates an example of a real workflow. It is not a linear sequence
since models are shared between subprocesses, so the workflow produced is a
DAG (Direct Acyclic Graph). The workflow consists of two subprocesses: the
training and the testing which share the models. We have a set of basic operator individuals (FillMissingValues1 , DiscretizeAll1 , etc.) which use individuals of IOObject (TrainingData, TestData, DataTable1 , etc.) as input
and produce individuals of IOObject (PreprocessingModel1 , Model1 , etc.) as
output. The example does not display the parameters and simple parameters of
operators but each operator could have several such parameters.
2.2
Workflow templates
Very often the DM workflows have a large number of operators (hundreds),
even more some sequences of operators may repeat and be executed several
times in the same workflow. This becomes a real problem since the users need
to construct and maintain the workflows manually. To overcome this problem
we introduce the notion of workflow templates.
When the planner generates a workflow it follows a set of task/method decomposition rules encoded in the DMWF ontology. Every task has a set of
methods able to solve it. The task solved by a method is called the head of
the method. Each method is decomposed into a sequence of steps which can be
-5-
Training
uses
Data
Data
produces
Data
Discretize
All1
Data
uses
Data
Table1
produces
Model
Data
produces
Data
ApplyPrepro
cessing
Model1
uses
Data
Modeling1
produces
Model
Preprocess
ingModel2
uses
Model
Test
Data
Table2
produces
Model
Preprocess
ingModel1
uses
Data
produces
Data
FillMissing
Values1
Model1
uses
Model
uses
Data
Data
uses
Model
produces
Data
ApplyPrepro
cessing
Model2
Table3
uses
Data
Data
Table4
Apply
Model1
produces
Data
uses
Data
Data
Table5
Report
Accuracy1
produces
Data
Report1
Fig. 2: A basic workflow example
either tasks or operators as shown in the specification in Fig. 3. The matching
between the current and the next step is done based on operators’ conditions
and effects as well as methods’ conditions and contributions as described in
[6]. Such a set of task/method decompositions works similarly to a context-free
grammar: tasks are the non-terminal symbols of the grammar, operators are
the terminal-symbols (or alphabet) and methods for a task are the grammarrule that specify how a non-terminal can be replaced by a sequence of (simpler)
tasks and operators. In this analogy the workflows are words of the language
specified by the task/method decomposition grammar. To be able to generate
not only operator sequences, but also operator DAGs14 , it additionally contains
a specification for passing parameter constraints between methods, tasks and
operators15 . In the decomposition process the properties of the method’s head
(the task) or one of the steps can be bound to the same variable as the properties
of other steps.
TaskMethod[worksOnProp1 ⇒ IOObject, . . . , worksOnPropn ⇒ IOObject,
worksWithProp1 ⇒ MetaData, . . . , worksWithPropn ⇒ MetaData]
{Task, Method} :: TaskMethod.
Task[solvedBy ⇒ Method].
{step1 , . . . , stepn } :: decomposedTo.
Method[step1 ⇒{Operator|Task}, . . . , stepn ⇒ {Operator|Task}].
Method.{head|stepi }.prop = Method.{head|stepi }.prop
prop := workOnProp | workWithProp |usesProp| producesProp
| parameterProp | simpleParamProp
Fig. 3: TBox for task/method decomposition and parameter passing constraints
A workflow template represents the upper (abstract) nodes from the generated decomposition, which in fact are either tasks, methods or abstract operators. If we look at the example in Fig. 2 none of the nodes are basic operators.
Indeed, they are all tasks as place-holders for several possible basic operators.
14
15
The planning process is still sequential, but the resulting structure may have a
non-linear flow of objects.
Giving it the expressive power of a first-order logic Horn-clause grammar.
-6-
For example, DiscretizeAll has different discretization methods as described
in Section 3, therefore DiscretizeAll represents a task which can be solved
by the DiscretizeAllAtOnce method. The method can have several steps, e.g,
the first step is an abstract operator RM DiscretizeAll, which subsequently
has several basic operators like RM Discretize All by Size, RM Discretize
All by Frequency.
The workflows are produced by an HTN planner [9] based on the DMWF ontology as background knowledge (domain) and on the goal and data description
(problem). In fact, a workflow is equivalent to a generated plan.
The planner generates only valid workflows since it checks the preconditions
of every operator present in the workflow, also operator’s effects are the preconditions of the next operator in the workflow. In most of the existing DM
tools the user can design a workflow, start executing it, and after some time discover that in fact some operator was applied on data with missing values or on
nominals whilst, in fact, it can handle only missing value free data and scalars.
Our approach can avoid such annoying and time consuming problems by using conditions and effects of operators. An operator is applicable only when its
preconditions are satisfied, therefore the generated workflows are semantically
correct.
3
Workflow Templates for auto-experimentation
To illustrate the usefulness of our approach, consider the following common
scenario. Given a data table containing numerical data, a modelling algorithm
should be applied that is not capable of processing numerical values, e.g., a
simple decision tree induction algorithm. In order to still utilize this algorithm,
attributes must first be discretized. To discretize a numerical attribute, its range
of possible numerical values is partitioned, and each numerical value is replaced
by the generated name of the partition it falls into. The data miner has multiple
options to compute this partition, e. g., RapidMiner [8] contains five different
algorithms to discretize data:
– Discretize by Binning. The numerical values are divided into k ranges of equal
size. The resulting bins can be arbitrarily unbalanced.
– Discretize by Frequency. The numerical values are inserted into k bins divided
at thresholds computed such that an equal number of examples is assigned
to each bin. The ranges of the resulting bins may be arbitrarily unbalanced.
– Discretize by Entropy. Bin boundaries are chosen as to minimize the entropy
in the induced partitions. The entropy is computed with respect to the label
attribute.
– Discretize by Size. Here, the user specifies the number of examples that should
be assigned to each bin. Consequently, the number of bins will vary.
– Discretize by User Specification. Here, the user can manually specify the
boundaries of the partition. This is typically only useful if meaningful boundaries are implied by the application domain.
-7-
Each of these operators has its advantages and disadvantages. However, there
is no universal rule of thumb as to which of the options should be used depending
on the characteristics or domain of the data. Still, some of the options can be
excluded in some cases. For example, the entropy can only be computed if a
nominal label exists. There are also soft rules, e.g., it is not advisable to choose
any discretization algorithm with fixed partition boundaries if the attribute
values are skewed. Then, one might end up with bins that contain only very few
examples.
Though no such rule of thumb exists, it is also evident that the choice of
discretization operator can have a huge impact on the result of the data mining
process. To support this statement, we have performed experiments on some
standard data sets. We have executed all combinations of the five discretization
operators Discretize by Binning with two and four bins, Discretize by Frequency
with two and four bins, and Discretize by Entropy on the 4 numerical attributes
of the well-known UCI data set Iris. Following the discretization, a decision tree
was generated and evaluated using a ten-fold cross validation16 . We can observe
that the resulting accuracy varies significantly, between 64.0% and 94.7% (see
Table 3). Notably, the best performance is not achieved by selecting a single
method for all attributes, but by choosing a particular combination. This shows
that finding the right combination can actually be worth the effort.
Dataset #numerical attr. # total attr. min. accuracy max. accuracy
Iris
4
4
64.0%
94.7%
Adult
6
14
82.6%
86.3%
Table 3: The table shows that optimizing the discretization method can be a
huge gain for some tables, whereas it is negligible for others.
Consider the number of different combinations possible for k discretization
operators and m numeric attributes. This makes up for a total of k m cominations. If we want to try i different values for the number of bins, we even have
(k·i)m different combinations. In the case of our above example, this makes for a
total of 1 296 combinations. Although knowing that the choice of discretization
operator can make a huge difference, most data miners will not be willing to
perform such a huge amount of experiments.
In principle, it is possible to execute all combinations in an automated fashion using standard RapidMiner operators. However, such a process must be
custom-made for the data set at hand. Furthermore, discretization is only one
out of numerous typical preprocessing steps. If we take into consideration other
steps like the replacement of missing values, normalization, etc., the complexity
of such a task grows beyond any reasonable border.
This is where workflow templates come into play. In a workflow template,
it is merely specified that at some point in the workflow all attributes must be
discretized, missing values be replaced or imputed, or a similar goal be achieved.
The planner can then create a collection of plans satisfying these constraints.
16
The process used to generate these results is available on the myExperiment platform [3]: http://www.myexperiment.org/workflows/1344
-8-
Clearly, simply enumerating all plans only helps if there is enough computational
power to try all possible combinations. Where this is not possible, the number
of plans must be reduced. Several options exist:
– Where fixed rules of thumb like the two rules mentioned above exist, this is
expressed in the ontological description of the operators. Thus, the search
space can be reduced, and less promising plans can be excluded from the
resulting collection of plans.
– The search space can be restricted by allowing only a subset of possible
combinations. For example, we can force the planner to apply the same
discretization operator to all attributes (but still allow any combination
with other preprocessing steps).
– The ontology is enriched by resource consumption annotations describing
the projected execution time and memory consumption of the individual
operators. This can be used to rank the retrieved plans.
– Where none of the above rules exist, meta mining from systematic experimentation can help to rank plans and test their execution in a sensible order.
This work is ongoing work within the e-Lico project.
– Optimizing the discretization step does not necessarily yield such a huge gain
as presented above for all data sets. We executed a similar optimization as
the one presented above for the numerical attributes of the Adult data set.
Here, the accuracy only varies between 82.6% and 86.3% (see Table 3). In
hindsight, the reason for this is clear: Whereas all of the attributes of the
Iris data set are numerical, only 6 out of 14 attributes of the Adult dataset
are. Hence, the expected gain for Iris is much larger. A clever planner can
spot this fact, removing possible plans where no large gain can be expected.
Findings like these can also be supported by meta mining.
All these approaches help the data miner to optimize steps where this is
promising and generating and executing the necessary processes to be evaluated.
4
Current state
The current state and some of the future development plans of our project are
shown in Fig.4. The system consists of a modeling environment called eProPlan (e-Lico Protege-based Planner) in which the ontology that defines the
behavior of the Intelligent Discovery Assistant (IDA) is modeled. eProPlan comprises several Protégé4-plugins [7], that add the modeling of the operators with
their conditions and effects and the task-method decomposition to the baseontology modeling. It allows to analyze workflow inputs and to set up the goals
to be reached in the workflow. It also adds a reasoner-interface to our reasoner/planner, such that the applicability of operators to IO-Objects can be
tested (i.e. the correct modeling of the condition of an operator), a single operator can be applied with applicable parameter setting (i.e. the correct modeling
of the effect of an operator can be tested), and also the planner can be asked to
-9-
Modeling &
testing
Workflow
generation
Va
li
Pl date
an Exp
lain
Plan Repair Plan
nd
pa
Ex Task
IDA-API
Exp Task
an sio
n
e riev
Ret n
Pla
s
N Plans for Task
DMO
Fig. 4: (a) eProPlan architecture
st Be od
th
Me
Reasoning &
planning
B
Op est era
tor
N Best Plans
A
Op pply
era tor
ble
lica s
App rator
e
Op
(b) The services of the planner
generate a whole plan for a specified task (i.e. the task-method decomposition
can be tested).
Using eProPlan we modeled the DMWF ontology which currently consists
of 64 Modeling (DM) Operators, including supervised learning, clustering, and
association rules generation of which 53 are leaves i.e. executable RapidMiner
Operators. We have also 78 executable Preprocessing Operators from RapidMiner and 30 abstract Groups categorizing them. We also have 5 Reporting (e.g.
a data audit, ROC-curve), 5 Model evaluation (e.g. cross-validation) and Model
application operators from RapidMiner. The domain model which describes the
IO-Objects of operators (i.e. data tables, models, reports, text collections, image
collections) consists of 43 classes. With that the DMWF is by far the largest
collection of real operators modeled for any planner-IDA in the related work.
A main innovation of our domain model over all previous planner-based IDAs
is that we did not stop with the IO-Objects, but modeled their parts as well,
i.e. we modeled the attributes and the relevant properties a data table consists
of. With this model we are able to capture the conditions and effects of all
these operators not only on the table-level but also on the column-level. This
important improvement was illustrated on the example of discretization in the
last section. On the Task/Method decomposition side we modeled a CRISP-DM
top-level HTN. Its behavior can be modified by currently 15 (sub-) Goals that
are used as further hints for the HTN planner. We also have several bottom-level
tasks as the DiscretizeAll described in the last section, e.g. for Missing Value
imputation and Normalization.
To access our planner IDA in data mining environment we are currently
developing an IDA-API (Intelligent Data Assistant - Application Programming
Interface). The first version of the API will offer the ”AI-Planner” services in
Fig.4(b), but we are also working to extend our planner with the case-based
planner services shown there and our partner is working to integrate the probabilistic planner services [5]. The integration of the API into RapidMiner as a
wizard is displayed in Fig. 5 and it will be integrated into Taverna [10] as well.
- 10 -
Fig. 5: A screenshot of the IDA planner integrated as a Wizard into RapidMiner.
5
Conclusion and future work
In this paper we introduced a knowledge-based representation of DM workflows
as a basis for cooperative-interactive workflow planning. Based on that we presented the main contribution of this paper: the definition of workflow templates,
i.e. abstract workflows that can mix executable operators and tasks to be refined
later into sub-workflows. We argued that these workflow templates serve very
well as a common workspace for user and system to cooperatively design workflows. Due to their hierarchical task structure they help to make large workflows
neat. We experimentally showed on the example of discretization that they help
to optimize the performance of workflows by auto-experimentation. Future work
will try to meta-learn from these workflow-optimization experiments, such that
a probabilistic extension of the planner can rank the plans based on their expected success. We argued that knowledge about the content of the data (which
cannot be extracted from the data) has a strong influence on the design of useful
workflows. Therefore, previously designed workflows for similar data and goals
likely contain an implicit encoding of this knowledge. This means an extension
to case-based planning is a promissing direction for future work as well. We
expect workflow templates to help us in case adaptation as well, because they
show what a sub-workflow wants to achieve on the data.
Acknowledgements: This work is supported by the European Community
7th framework ICT-2007.4.4 (No 231519) “e-Lico: An e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science”.
- 11 -
References
1. A. Bernstein, F. Provost, and S. Hill. Towards Intelligent Assistance for a Data
Mining Process: An Ontology-based Approach for Cost-sensitive Classification.
IEEE Transactions on Knowledge and Data Engineering, 17(4):503–518, April
2005.
2. P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and
R. Wirth. Crisp–dm 1.0: Step-by-step data mining guide. Technical report, The
CRISP–DM Consortium, 2000.
3. D. De Roure, C. Goble, and R. Stevens. The design and realisation of the myexperiment virtual research environment for social sharing of workflows. In Future
Generation Computer Systems 25, pages 561–567, 2009.
4. C. Diamantini, D. Potena, and E. Storti. KDDONTO: An Ontology for Discovery
and Composition of KDD Algorithms. In Service-oriented Knowledge Discovery
(SoKD-09) Workshop at ECML/PKDD09, 2009.
5. M. Hilario, A. Kalousis, P. Nguyen, and A. Woznica. A data mining ontology for
algorithm selection and meta-learning. In Service-oriented Knowledge Discovery
(SoKD-09) Workshop at ECML/PKDD09, 2009.
6. J.-U. Kietz, F. Serban, A. Bernstein, and S. Fischer. Towards cooperative planning
of data mining workflows. In Service-oriented Knowledge Discovery (SoKD-09)
Workshop at ECML/PKDD09, 2009.
7. H. Knublauch, R. Fergerson, N. Noy, and M. Musen. The Protégé OWL plugin:
An open development environment for semantic web applications. Lecture notes
in computer science, pages 229–243, 2004.
8. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid
prototyping for complex data mining tasks. In KDD ’06: Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 935–940. ACM, 2006.
9. D. Nau, T.-C. Au, O. Ilghami, U. Kuter, W. Murdock, D. Wu, and F.Yaman.
SHOP2: An HTN planning system. JAIR, 20:379–404, 2003.
10. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver, M. Pocock,
A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 2004.
11. R. Wirth, C. Shearer, U. Grimmer, T. P. Reinartz, J. Schlösser, C. Breitner,
R. Engels, and G. Lindner. Towards process-oriented tool support for knowledge
discovery in databases. In PKDD ’97: Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, pages 243–253,
London, UK, 1997. Springer-Verlag.
12. M. Žáková, P. Křemen, F. Železný, and N. Lavrač. Planning to learn with a
knowledge discovery ontology. In Planning to Learn Workshop (PlanLearn 2008)
at ICML 2008, 2008.
13. M. Žáková, V. Podpečan, F. Železný, and N. Lavrač. Advancing data mining
workflow construction: A framework and cases using the orange toolkit. In Serviceoriented Knowledge Discovery (SoKD-09) Workshop at ECML/PKDD09, 2009.
- 12 -
Workflow Analysis using Graph Kernels
Natalja Friesen and Stefan Rüping1
Fraunhofer IAIS, 53754 St. Augustin, Germany,
{natalja.friesen,stefan.rueping}@iais.fraunhofer.de,
WWW home page: http://www.iais.fraunhofer.de
Abstract. Workflow enacting systems are a popular technology in business and e-science alike to flexibly define and enact complex data processing tasks. Since the construction of a workflow for a specific task can
become quite complex, efforts are currently underway to increase the
re-use of workflows through the implementation of specialized workflow
repositories. While existing methods to exploit the knowledge in these
repositories usually consider workflows as an atomic entity, our work is
based on the fact that workflows can naturally be viewed as graphs.
Hence, in this paper we investigate the use of graph kernels for the problems of workflow discovery, workflow recommendation, and workflow pattern extraction, paying special attention to the typical situation of few
labeled and many unlabeled workflows. To empirically demonstrate the
feasibility of our approach we investigate a dataset of bioinformatics
workflows retrieved from the website myexperiment.org.
Key words: Workflow analysis, graph mining
1
Introduction
Workflow enacting systems are a popular technology in business and e-science
alike to flexibly define and enact complex data processing tasks. A workflow is
basically a description of the order in which a set of services have to be called with
which input in order to solve a given task. Since the construction of a workflow
for a specific task can become quite complex, efforts are currently underway to
increase the re-use of workflows through the implementation of specialized workflow repositories. Driven by specific applications, a large collection of workflow
systems have been prototyped such as Taverna [12] or Triana [15].
As the high numbers of workflows can be generated and stored relatively
easily it becomes increasingly hard to keep an overview about the available
workflows. Workflow repositories and websites such as myexperiment.org tackle
this problem by offering the research community the possibility to publish and
exchange complete workflows. An even higher amount of integration has been
described in the idea of developing a Virtual Research Environment (VRE, [2]).
Due to the complexity of managing a large repository of workflows, data
mining approaches are needed to support the user in making good use of the
knowledge that is encoded in these workflows. In order to improve the flexibility
of a workflow system, a number of data mining tasks can be defined:
- 13 -
Workflow recommendation: Compute a ranking of the available workflows with
respect to their interestingness to the user for a given task. As it is hard to
formally model the user’s task and his interest in a workflow, one can also
define the task of finding a measure of similarity on workflows. Given a
(partial) workflow for the task the user is interested in, the most similar
workflows are then recommended to the user.
Metadata extraction: Given a workflow (and possibly partial metadata), infer the metadata that describes the workflow best. As most approaches for
searching and organizing workflows are based on descriptive metadata, this
task can be seen as the automatization of the extraction of workflow semantics.
Pattern extraction: Given a set of workflows, extract a set of sub-patterns that
are characteristic for this workflow. A practical purpose of these patterns is to
serve as building blocks for new workflows. In particular, given several sets of
workflows, one can also define the task of extracting the most discriminative
patterns, i.e. patterns that are characteristic for one group but not the others.
Workflow construction: Given a description of the task, automatically construct a workflow solving the task from scratch. An approach to workflow
construction, based on cooperative planning, is proposed in [11]. However,
this approach requires a detailed ontology of services [8], which in practice
is often not available. Hence, we do not investigate this task in this paper.
In existing approaches to the retrieval and discovery of workflows, workflows are
usually considered as an atomic entity, using workflow meta data such as its
usage history, textual descriptions (in particular tags), or user-generated quality
labels as descriptive attributes. While these approaches can deliver high quality
results, they are limited by the fact that all these attributes require either a
high user effort to describe the workflow (to use text mining techniques), or a
frequent use of each workflow by many different users (to mine for correlations).
We restrict our investigations to second approach considering the case where a
large collection of working workflow is available.
In this paper we are interested in supporting the user in constructing the
workflow and reducing the manual effort of workflow tagging. The reason for the
focus on the early phases of workflow construction is that in practice it can be
observed that often users are reluctant to put too much effort into describing
a workflow; they are usually only interested in using the workflow system as a
means to get their work done. A second aspect to be considered is that without
proper means to discover existing workflows for re-use, it will be hard to receive
enough usage information on a new workflow to start up a correlation-based
recommendation in the first place.
To address these problems, we have opted to investigate solutions to the
previously described data mining tasks that can be applied in the common situation of many unlabeled workflows, using only the workflow description itself
and no meta data. Our work is based on the fact that workflows can be viewed
as graphs. We will demonstrate that by the use of graph kernels it is possible to
effectively extract workflow semantics and use this knowledge for the problems of
- 14 -
workflow recommendation and metadata extraction. The purpose of this paper
is to answer the following questions:
Q1: How good are graph kernels at performing the tasks of workflow recommendation without explicit user input? We will present an approach that is
based on exploiting workflow similarity.
Q2: Can appropriate meta data about a workflow be extracted from the workflow itself? What can we infer about the semantics of a workflow and its
key characteristics? In particular, we will investigate the task of tagging a
workflow with a set of user-defined keywords.
Q3: How good does graph mining perform at a descriptive approach of workflow
analysis, namely the extraction of meaningful graph patterns?
The remainder of the paper is structured as follows: Next, we will discuss related
work in the area of workflow systems. In Section 3, we give a detailed discussion of
representation of workflows and the associated metadata. Section 4 will present
the approach of using graph kernels for workflow analysis. The approach will be
evaluated on four distinct learning tasks on a dataset of bioinformatics workflows
retrieved from the website http://myexperiment.org in Section 5. Section 6
concludes.
2
Related Work
Since workflow systems are getting more complicated, the development of effective discovery techniques particularly for this field has been addressed by many
researcher during the last years. Public repositories that enable sharing of workflows are widely used both in business and scientific communities. While first
steps toward supporting the user have been made, there is still a need to improve the effectiveness of discovery methods and support the user in navigating
the space of available workflows. A detailed overview of different approaches for
workflow discovery is given by Goderis [4].
Most approaches are based on simple search functionalities and consider a
workflow as an atomic entity. Searching over workflow annotation like titles,
textual description , or discovery on the basis of user profiles belongs to basic capabilities of repositories such as myExperiment [14], BioWep1 , Kepler2 or
commercial systems like Infosense and Pipeline Pilot.
In [5] a detailed study about current practices in workflow sharing, re-using
and retrieval is presented. To summarize, the need to take into account structural
properties of workflows in the retrieval process was underlined by several users.
Authors demonstrate that existing techniques are not sufficient and there is
still a need for effective discovery tools. In [6] retrieval techniques and methods
for ranking discovered workflows based on graph-subisomorphism matching are
presented. Coralles [1] proposes a method for calculating the structural similarity
1
2
http://bioinformatics.istge.it/biowep/
https://kepler-project.org/
- 15 -
of two BPEL (Business Process Execution Language) workflows represented by
graphs. It is based on error correcting graph subisomorphism detection.
Apart from workflow sharing and retrieval, the design of new workflows is an
immense challenge to users of workflow systems. It is both time-consuming and
error-prone, as there is a great diversity of choices regarding services, parameters,
and their interconnections. It requires the researcher to have specific knowledge
in both his research area and in the use of the workflow system. Consequently, it
is preferable for a researcher to not start from scratch, but to receive assistance
in the creation of a new workflow.
A good way to implement this assistance is to reuse or re-purpose existing
workflows or workflow patterns (i.e. more generic fragments of workflows). An
example of workflow re-use is given in [7], where a workflow to identify genes
involved in tolerance to Trypanosomiasis in East African cattle was reused successfully by another scientist to identify the biological pathways implicated in
the ability of mice to expel the Trichuris Muris parasite.
In [7] it is argued that designing new workflows by reusing and re-purposing
previous workflows or workflows patterns has the following advantages:
– Reduction of workflow authoring time
– Improved quality through shared workflow development
– Improved experimental provenance through reuse of established and validated workflows
– Avoidance of workflow redundancy
While there has been some research comparing workflow patterns in a number
of commercially available workflow management systems [17] or identifying patterns that describe the behavior of business processes [18], to the best of our
knowledge there exists no work to automatically extract patterns. A pattern
mining method for business workflows based on calculation of support values
is presented in [16]. However, the set of patterns that was used was derived
manually based on an extensive literature study.
3
Workflows
A workflow is a way to formalize and structure complex data analysis experiments. Scientific workflows can be described as a sequence of computation
steps together with predefined input and output that arise in scientific problemsolving. Such a definition of workflows enables sharing analysis knowledge within
scientific communities in a convenient way.
We consider the discovery of similar workflows in the context of a specific
VRE called myExperiment [13]. MyExperiment has been developed to support
sharing of scientific objects associated with an experiment. It is a collaborative
environment where scientists can publish their workflows. Each stored workflow
is created by a specific user, is associated with a workflow graph, and contains
metadata and certain statistics such as the number of downloads or the average
rating given by the users. We split all available information about a workflow
- 16 -
into four different groups: the workflow graph, textual data, user information,
and workflow statistics. Next we will characterize each group in more detail.
Textual Data: Each workflow in myExperiment has a title and a description
text and contains information about the creator and date of creation. Furthermore, the associated tags annotate workflow by several keywords that
facilitate searching for workflows and provide more precise results.
User Information: MyExperiment was thought also as a social infrastructure
for the researchers. The social component is realized by registration of users
and allows them to create profiles with different kind of personal information,
details about their work and professional life. The members of myExperiment
can form complex relationships with other members, such as creating or
joining user groups or giving credit to others. All this information can be
used in order to find the groups of users having similar research interests or
working in related projects. In the end, this type of information can be used
to generate the well known correlation-based recommendations of the type
“users who liked this workflow also liked the following workflows...”.
Workflow Statistics: As statistic data we consider information that is changing
with the time, such as the number of views or downloads or the average
rating. Statistic data can be very useful for providing a user with a workflow
he is likely to be interested in. As we do not have direct information about
user preferences, some of the statistics data, e.g. number of downloads or
rating, can be considered as a kind of quality measure.
4
A Graph Mining Approach to Workflow Analysis
The characterization of a workflow by metadata alone is challenging because
neither of these features give an insight into the underlying sub-structures of the
workflow. It is clear that users do not always create a new workflow from scratch,
but most likely re-use old components and sub-workflows. Hence, knowledge of
sub-structures is important information to characterize a workflow completely.
The common approach to represent objects for a learning problem is to describe them as vectors in a feature space. However, when we handle objects that
have important sub-structures, such as workflows, the design of a suitable feature
space is not trivial. For this reason, we opt to follow a graph mining approach.
4.1
Frequent Subgraphs
Frequent subgraph discovery has received a lot of attention, since it has a wide
range of applications areas. Frequently occurring subgraphs in a large set of
graphs can represent important motifs in the data. Given a set of graphs G, the
support S(G) of a graph G is defined as the fraction of graphs in G in which G
occurs. The problem of finding frequent patterns is defined as follows:
Given a set of graphs G and minimum support Smin , we want to find all
connected subgraphs that occur frequently enough (i.e. S(G) >= Smin ) over the
entire set of graphs. The output of the discovery process may contain a large
number of such patterns.
- 17 -
4.2
Graph Kernels
Graph kernels, as originally proposed by [3,10], provide a general framework
for handling graph data structures by kernel methods. Different approaches for
defining graph kernels exist. A popular representation of graphs that is used for
examples in protein modeling and drug screening are kernels based on cyclic
patterns [9]. However, these are not applicable to workflow data, as workflows
are by definition acyclic (because an edge between services A and B represents
the relation “A must finish before B can start”).
To adequately represent the decomposition of workflows into functional substructures, we follow a third approach: the set of graphs is searched for substructures (in this case paths) that occur in at least a given percentage (support) of
all graphs. Then, the feature vector is composed of the weighted counts of such
paths. The substructures are sequences of labeled vertices that were produced by
graph traversal. The length of a substructure is equal to the number of vertices
in it. This family of kernels is called Label Sequence Kernels. The main difference among the kernels lies in how graphs are traversed and how weights are
involved in computing a kernel. According to the extracted substructures, these
are kernels based on walks, trees or cycles. In our work we used walks based
exponential kernels proposed by Gärtner et al. [3]. Since workflows are directed
acyclic graphs, in our special case the hardness results of [3] no longer hold and
we actually can enumerate all walks. This allows us to explicitly generate the
feature space representation of the kernels by defining the attribute values for
every substructure (walk). For each substructure s in the set of graphs, let k be
the length of the substructure. Then, the attribute λs is defined as:
λs =
βk
k!
(1)
if the graph contains the substructure s and λs = 0 else. Here β is a parameter
that can be optimized, e.g. by cross-validation. A very important advantage
of graph kernels approach for discovery task is that distinct substructures can
provide an insight into the specific behavior of the the workflow.
4.3
Graph Representation of Workflows
A workflow can be formalized as a directed acyclic labeled graph. The workflow
graph has two kind of nodes: regular nodes representing the computation operations and nodes defining input/output data structure. A set of edges shows
information and control flow between the nodes. More formally, a workflow graph
can be defined as a tuple W = (N, T ), where:
N = {C, I, O}
C = finite set of computation operations,
I/O = finite set of inputs or outputs
T ⊆ N × N = finite set of transitions defining the control flow.
Labeled graphs contain an additional source of information. There are several
alternatives to obtain node labels. On the one hand, users often annotate single
- 18 -
workflow components by a combination of words or abbreviations. On the other
hand, each component within workflow system has a signature and an identifier
associated with it, e.g. in web-service WSDL format. User created labels suffer
from subjectivity and diversity, e.g. the same node representing the same computational operation can be labeled in very different way. The first alternative
again assumes some type of user input, so we opt to use the second alternative.
An exemplary case where this choice makes a clear difference will be presented
later in Section 5.2.
Figure 1 shows an example of such transformation obtained for a Taverna
workflow [12]. While the left pictures shows a user annotated components the
right picture presents workflow graph on the next abstraction level. Obviously,
the choice of the right abstraction level is crucial. In this paper, we use a handcrafted abstraction that was developed especially for the MyExperiment data.
In general, the use of data mining ontologies [8] may be preferable.
Fig. 1. Transformation of Taverna workflow to the workflow graph.
- 19 -
Group Size Most frequent tags
1
30% localworker, example, mygrid
2
29% bioinformatics, sequence, protein,
BLAST, alignment, similarity,
structure, search, retrieval
3
24% benchmarks
4 6.7% AIDA , BioAID, text mining,
bioassist, demo, biorange
5
5
Description
Workflows using local scripts.
Sequence similarity search
using the BLAST algorithm
Benchmarks WFs.
Text mining on biomedical texts using
the AIDA toolbox and BioAID web
services
6.3% Pathway, microarray, kegg
Molecular pathway analysis using the
Kyoto Encyclopedia of Genes and
Genomes (KEGG)
Table 1. Characterization of workflow groups derived by clustering.
Evaluation
In this section we illustrate the use of workflow structure and graph kernels in
particular for workflow discovery and pattern extraction. We evaluate results on
a real-world dataset of Taverna workflows. However, the same approach can be
applied to other workflow systems, as long as the workflows can be transformed
to a graph in a consistent way.
5.1
Dataset
For the purposes of this evaluation we used a corpus of 300 real-world bioinformatics workflows retrieved from myExperiment [13]. We chose to restrict ourselves to workflows that were created in Taverna workbench [12] in oder to
simplify the formatting of workflows. Since the application area of myExperiment is restricted to bioinformatics, it is likely that sets of similar workflows
exist. In the data, user feedback about the similarity of workflow pairs is missing. Hence, we use semantic information to obtain workflows similarity. We made
the assumption that workflows targeting the same tasks are similar. Under this
assumption we used the cosine similarity of the vector of tags assigned to the
workflow as a proxy for the true similarity. An optimization over the number of
clusters resulted in five groups shown in Table 1. These tags indeed impose a
clear structuring with few overlaps on the workflows.
5.2
Workflow Recommendation
In this section, we address Question Q1: “How good are graph kernels at performing the tasks of workflow recommendation without explicit user input?” The
goal is to retrieve workflows that are "close enough" to a user’s context. To do
this, we need to be able to compare workflows available in existing VREs with
the user’s one. As similarity measure we use the graph kernel from Section 4.2.
- 20 -
We compare our approach based on graph kernels to the following techniques
representing the current state of the art [6]: matching of workflow graphs based
on the size of the maximal common subgraph (MCS) and a method that considers
a workflow as a bag of services. In addition to these techniques we also consider a
standard text mining approach, whose main idea is that workflows are documents
in XML format. The similarity of a workflow pair is then calculated as the cosine
distance between the respective word vectors.
In our experiment we predict if two workflows belong to the same cluster.
Table 5.2 summarizes the average performances of a leave-one-out evaluation
for the four approaches. It can be seen that graph kernels clearly outperform
all other approaches in accuracy and recall. For precision, MCS performs best,
however, at the cost of a minimal recall. The precision of graph kernels ranks
second and is close to the value of MCS.
Method
Accuracy Precision
Recall
Graph Kernels 81.2 ± 10.0 71.9 ± 22.0 38.3 ± 21.1
MCS
73.9 ± 9.3 73.5 ± 24.7 4.8 ± 27.4
Bags of services 73.5 ± 10.3 15.5 ± 20.6 3.4 ± 30.1
Text Mining
77.8 ± 8.31 67.2 ± 21.5 31.2 ± 25.8
Table 2. Performance of workflow discovery.
We conclude that graph kernels are very promising for the task of workflow
recommendation based only on graph structure without explicit user input.
5.3
Workflow Tagging
We are now interested in Question Q2 of extraction of appropriate metadata
from workflows. As a prototypical piece of metadata, we investigate user-defined
tags.
20 tags were selected that occur in at least 3% of all workflows. We use tags
as proxies that represent the real-world task that a workflow can perform. For
each tag we would like to predict if it describes a given workflow. To do that we
utilize graph kernels. We tested two algorithms: SVM and k-Nearest Neighbor.
Table 3 shows the results of tag prediction evaluated by 2-fold cross validation
over 20 keywords, It can be seen that an SVM with graph kernels can predict the
selected tags with high AUC and precision, while a Nearest Neighbor approach
using graph kernels to define the distance achieves a higher recall.
We can conclude that the graph representation of workflow contains enough
information to predict appropriate metadata.
5.4
Pattern extraction
Finally, we investigate question Q4, which deals with the more descriptive task
of extracting meaningful patterns from sets of workflows that are helpful in the
construction of new workflows.
- 21 -
Method
AUC
Precision
Recall
Nearest Neighbors 0.54 ± 0.18 0.51 ± 0.21 0.58 ± 0.19
SVM
0.85 ± 0.10 0.84 ± 0.24 0.38 ± 0.29
Table 3. Accuracy of workflows tagging based on graph kernels averaged over all 20
tasks.
We address the issue of extracting patterns that are particularly important
within a group of similar workflows in several steps. First, we use a SVM to
build a classification model based on the graph kernels. This model identifies all
workflows which belong to the same group against workflows from other groups.
Then we search for features having high weight value which the model considers as important. We performed such pattern extraction targeting consequently
each workflow group. A 10-fold cross-validation shows that this classification
can be achieved with high accuracy, values ranging between 81.3% and 94.7%,
depending on the class. However, we are more interested in the most significant
patterns, which we determine based on the weight that was assigned by the SVM
(taking the standard deviation into account).
Figure 2 shows an example of workflow patterns and the same pattern inside a workflow that it occurs in. It was considered as important for classifying
workflows from group 2, which consists of workflows using the BLAST algorithm to calculate sequences similarity. The presented pattern is a sequence of
components that are needed to run a BLAST service.
This example shows that graph kernels can be used to extract useful patterns,
which then can be recommended to the user during creation of a new workflow.
6
Conclusions
Workflow enacting systems have become a popular tool for the easy orchestration of complex data processing tasks. However, the design and management of
workflows are a complex tasks. Machine learning techniques have the potential
to significantly simplify this work for the user.
In this paper, we have discussed the usage of graph kernels for the analysis
of workflow data. We argue that graph kernels are very useful in the practically
important situation where no meta data is available. This is due to the fact
that the graph kernel approach allows to take decompositions of the workflow
into its substructures into account, while allowing an flexible integration of these
information contained into these substructures into several learning algorithms.
We have evaluated the use of graph kernels in the fields of workflow similarity
prediction, metadata extraction, and pattern extraction. A comparison of graphbased workflow analysis with metadata-based workflow analysis in the field of
workflow quality modeling showed that metadata-based approaches outperform
graph-based approaches in this application. However, it is important to recognize
that the goal of the graph-based approach is not to replace the metadata-based
approaches, but to serve as an extension when no or few metadata is available.
- 22 -
Fig. 2. Example of workflow graph.
The next step in our work will be to evaluate our approach in more realistic scenario. Future research will investigate several alternatives for the creation
of a workflow representation from a workflow graph in order to provide an appropriate representation at different levels of abstraction. One possibility is to
obtain label of graph nodes using an ontology that describes the services and
key components of a workflow such as in [8].
References
1. Juan Carlos Corrales, Daniela Grigori, and Mokrane Bouzeghoub. Bpel processes
matchmaking for service discovery. In In Proc. CoopIS 2006, Lecture Notes in
Computer Science 4275, pages 237–254. Springer, 2006.
2. M. Fraser. Virtual Research Environments: Overview and Activity. Ariadne, 2005.
3. Thomas Gaertner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness
results and efficient alternatives. In Proceedings of the 16th Annual Conference
on Computational Learning Theory and 7th Kernel Workshop, pages 129–143.
Springer-Verlag, August 2003.
4. Antoon Goderis. Workflow re-use and discovery in bioinformatics. PhD thesis,
School of Computer Science, The University of Manchester, 2008.
- 23 -
5. Antoon Goderis, Paul Fisher, Andrew Gibson, Franck Tanoh, Katy Wolstencroft,
David De Roure, and Carole Goble. Benchmarking workflow discovery: a case study
from bioinformatics. Concurr. Comput. : Pract. Exper., (16):2052–2069, 2009.
6. Antoon Goderis, Peter Li, and Carole Goble. Workflow discovery: the problem,
a case study from e-science and a graph-based solution. In ICWS ’06: Proceedings of the IEEE International Conference on Web Services, pages 312–319. IEEE
Computer Society, 2006.
7. Antoon Goderis, Ulrike Sattler, Phillip Lord, and Carole Goble. Seven bottlenecks
to workflow reuse and repurposing. The Semantic Web âĂŞ ISWC 2005, pages
323–337, 2005.
8. Melanie Hilario, Alexandros Kalousis, Phong Nguyen, and Adam Woznica. A
data mining ontology for algorithm selection and meta-learning. In Proc of the
ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Serviceoriented Knowledge Discovery (SoKD-09), Bled, Slovenia, pages 76–87., 2009.
9. Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for
predictive graph mining. In KDD ’04: Proc. of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM,
2004.
10. Hisashi Kashima and Teruo Koyanagi. Kernels for semi-structured data. In ICML
’02: Proceedings of the Nineteenth International Conference on Machine Learning,
pages 291–298, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
11. Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, and Simon Fischer. Towards
cooperative planning of data mining workflows. In Proc of the ECML/PKDD09
Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge
Discovery (SoKD-09), Bled, Slovenia, pages pp. 1–12, September 2009.
12. T Oinn, M.J. Addis, J. Ferris, D.J. Marvin, M. Senger, T. Carver, M. Greenwood,
K Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054,
June 2004.
13. David De Roure, Carole Goble, Jiten Bhagat, Don Cruickshank, Antoon Goderis,
Danius Michaelides, and David Newman. myexperiment: Defining the social virtual
research environment. In 4th IEEE International Conference on e-Science, pages
182–189. IEEE Press, December 2008.
14. Robert Stevens David De Roure. The design and realisation of the myexperiment
virtual research environment for social sharing of workflows. 2009.
15. Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Matthew Shields. Workflows
for e-Science: Scientific Workflows for Grids. Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2006.
16. Lucineia Thom, Cirano Iochpe, and Manfred Reichert. Workflow patterns for business process modeling. In Proc. of the CAiSE’06 Workshops - 8th Int’l Workshop
on Business Process Modeling, Development, and Support (BPMDS’07), page Vol.
1. Trondheim, Norway, 2007.
17. W. M. P. Van Der Aalst, A. H. M. Ter Hofstede, B. Kiepuszewski, and A. P. Barros.
Workflow patterns. Distrib. Parallel Databases, 14(1):5–51, 2003.
18. Stephen A. White. Business process trends. In Business Process Trends, 2004.
- 24 -
Re-using Data Mining Workflows
Stefan Rüping, Dennis Wegener, and Philipp Bremer
Fraunhofer IAIS, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
http://www.iais.fraunhofer.de
Abstract. Setting up and reusing data mining processes is a complex
task. Based on our experience from a project on the analysis of clinicogenomic data we will make the point that supporting the setup and
reuse by setting up large workflow repositories may not be realistic in
practice. We describe an approach for automatically collecting workflow
information and meta data and introduce data mining patterns as an
approach for formally describing the necessary information for workflow
reuse.
Key words: Data Mining, Workflow Reuse, Data Mining Patterns
1
Introduction
Workflow enacting systems are a popular technology in business and e-science
alike to flexibly define and enact complex data processing tasks. A workflow is
basically a description of the order in which a set of services have to be called
with which input in order to solve a given task. Driven by specific applications,
a large collection of workflow systems have been prototyped, such as Taverna1
or Triana2 . The next generation of workflow systems are marked by workflow
repositories such as MyExperiment.org, which tackle the problem of organizing
workflows by offering the research community the possibility to publish, exchange
and discuss individual workflows.
However, the more powerful these environments become, the more important
it is to guide the user in the complex task of constructing appropriate workflows.
This is particularly true for the case of workflows which encode a data mining
tasks, which are typically much more complex and in a more constant state of
frequent change than workflows in business applications.
In this paper, we are particularly interested in the question of reusing successful data mining applications. As the construction of a good data mining process
invariably requires to encode a significant amount of domain knowledge, this is
a process which cannot be fully automated. By reusing and adapting existing
processes that have proven to be successful in practical use, we hope to be able
to save much of this manual work in a new application and thereby increase the
efficiency of setting up data mining workflows.
1
2
http://www.taverna.org.uk
http://www.trianacode.org
- 25 -
We report our experiences in designing a system which is targeted at supporting scientists, in this case bioinformaticians, with a workflow system for the
analysis of clinico-genomic data. We will make the case that:
– For practical reasons it is already a difficult task to gather a non-trivial
database of workflows which can form the basis of workflow reuse.
– In order to be able to meaningfully reuse data mining workflows, a formal
notation is needed that bridges the gap between a description of the workflows at implementation level and a high-level textual description for the
workflow designer.
The paper is structured as follows: In the next section, we introduce the
ACGT project, in the context of which our work was developed. Section 3 describes an approach for automatically collecting workflow information and appropriate meta data. Section 4 presents data mining patterns which formally
describe all information that is necessary for workflow reuse. Section 5 concludes.
2
The ACGT Project
The work in this paper is based on our experiences in the ACGT project3 , which
has the goal of implementing a secure, semantically enhanced end-to-end system
in support of large multi-centric clinico-genomic trials, meaning that it strives
to integrate all steps from the collection and management of various kinds of
data in a trial up to the statistical analysis by the researcher. In the current
version, the various elements of the data mining environment can be integrated
into complex analysis pipelines through the ACGT workflow editor and enactor.
With respect to workflow reuse, we made the following experiences in setting up
and running an initial version of the ACGT environment:
– The construction of data mining workflows is an inherently complex problem
when it is based on input data with complex semantics, as it is the case in
clinical and genomic data.
– Because of the complex data dependencies, copy and paste is not an appropriate technique for workflow reuse.
– Standardization and reuse of approaches and algorithms works very well on
the level of services, but not on the level of workflows. While it is relatively
easy to select the right parameterization of a service, making the right connections and changes to a workflow template is quickly getting quite complex,
such that user finds it easier to construct a new workflow from scratch.
– Workflow reuse only occurs when the initial creator of a workflow detailedly
describes the internal logic of the workflow. However, most workflow creators
avoid this effort because they simply want to “solve the task at hand”.
In summary, the situation of having a large repository of workflows to chose
the appropriate one from, which is often assumed in existing approaches for
workflow recommendation systems, may not be very realistic in practice.
3
http://eu-acgt.org
- 26 -
3
Collecting Workflow Information
To obtain information about the human creation of data mining workflows it is
necessary to design a system which collects realistic data mining workflows out
of the production cycle. We developed a system which collects data mining workflows based on plug-ins which were integrated into the data mining software used
for production [1]. In particular, we developed plug-ins for Rapidminer, which is
an open source data mining software, and Obwious, a self-developed text-mining
tool. Every time the user executes a workflow, the workflow definition is send to
a repository and stored in a shared abstract representation. The shared abstract
representation is mandatory as we want to compare the different formats and
types of workflows and to extract the interesting information out of a wide range
of workflows to get a high diversity.
As we do not only want to observe the final version of a humanly created
workflow but the whole chain of workflows which were created in the process
of finding and creating this final version, we also need a representation of this
chain of workflows. We will call the collection of the connected workflows from
the workflow life cycle which solves the same data mining problem on the same
data base a workflow design sequence.
The shared abstract representation of the workflows is solely orientated on
CRISP-Phases and its common tasks, as described in [2]. Based on this we created the following six classes: (1) data preparation: select data, (2) data preparation: clean data, (3) data preparation: construct data, (4) data preparation:
integrate data, (5) data preparation: format data, (6) modeling, and (7) other.
Of course, it would be also be of interest to use more detailed structures, such
as the data mining ontology presented in [3]. The operators of the data mining
software that was used are classified using these classes and the workflows are
transferred to the shared abstract representation. The abstract information itself
consists of the information if any operator of the first five classes - the operators
which are doing data preparation tasks - is used in the workflow, and if any
changes in the operator themselves are done or if any changes in the parameter
settings are done in comparison to the predecessor in this sequence. Furthermore
in this representation it is noted which operators of the class Modeling are used
and if there are any changes operators or in their parameter setting in comparison to the predecessor in the design sequence. An example of this representation
is shown in Figure 1.
Fig. 1. Visualization of a workflow design sequence in the abstract representation
- 27 -
At the end of the first collection phase which lasted 6 months we have collected 2520 workflows in our database which were created by 16 different users.
These workflows were processed into 133 workflow design sequences. According
to our assumption this would mean that there are about 33 real workflow design
sequences in our database. There was an imbalance on the distribution of workflows and workflow design sequences over the two software sources. Because of
heavy usage and an early development state of Obwious about 85% of workflows
and over 90% of workflow design sequences were created using Rapidminer.
Although there has to be much more time the system collects data there are
already some interesting informations in the derived data. In Figure 2 one can
see that in the workflow creation process the adjustment and modulation of the
data preparation operators is as important as the adjustment and modulation
of the learner operators. This is contrarily to common assumptions where the
focus is only set on the modeling phase and the learner operators. The average
length of a computed workflow design sequence is about 18 workflows.
In summary, our study shows that a human workflow creator produces many
somewhat similar workflows until he finds his final version, which mainly differ in
operators and parameters of the CRISP-phases data preparation and modeling.
absolute occurrences relative occurrences1
Change
609
24,17%
Data preparation Parameter change
405
16,07%
CRISP-phase
Learner
1
Sum of all changes
Change
Parameter change
1014
215
801
40,24%
8,53%
31,79%
Sum of all changes
1016
40,32%
Relative to the absolute count of all workflows of 2520
Fig. 2. Occurrences of changes in CRISP-phases
4
Data Mining Patterns
In the area of data mining there exist a lot of scenarios where existing solutions
are reusable, especially when no research on new algorithms is necessary. Lots
of examples and ready-to-use algorithms are available as toolkits or services,
which only have to be integrated. However, the reuse and integration of existing
solutions is not often or only informally done in practice due to a lack of formal
support, which leads to a lot of unnecessary repetitive work. In the following we
present our approach on the reuse of data mining workflows by formally encoding
both technical and and high-level semantics of these workflows.
In this work, we aim at a formal representation of data mining processes to
facilitate their semi-automatic reuse in business processes. As visualized in Fig.
3, the approach should bridge the gap between a high-level description of the
process as in written documentation and scientific papers (which is to general to
- 28 -
lead to an automization of work), and a fine-grained technical description in the
form of an executable workflow (which is too specific to be re-used in slightly
different cases).
Fig. 3. Different strategies of reusing data mining.
In [4] we presented a new process model for easy reuse and integration of
data mining in different business processes. The aim of this work was to allow
for reusing existing data mining processes that have proven to be successful.
Thus, we aimed at the development of a formal and concrete definition of the
steps that are involved in the data mining process and of the steps that are
necessary to reuse it in new business processes. In the following we will briefly
describe the steps that are necessary to allow for the reuse of data mining.
Our approach is based on CRISP [2]. The basic idea is that when a data
mining solution is re-used, one can see several parts of the CRISP process are
pre-defined, and only needs to execute those parts of CRISP where the original
and re-used process differ. Hence, we define Data Mining Patterns to describe
those parts that are pre-defined, and introduce a meta-process to model those
steps of CRISP which need to be executed when re-using a pattern on a concrete
data mining problem. Data Mining Patterns are defined such that the CRISP
process (more correctly, those parts of CRISP that can be pre-defined) is the
most general Data Mining Pattern, and that we can derive a more specialized
Data Mining Pattern out of a more general one by replacing a task by a more
specialized one (according to a hierarchy of tasks that we define),
CRISP is a standard process model for data mining which describes the life
cycle of a data mining project in the following 6 phases: Business Understanding,
Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.
The CRISP model includes a four-level breakdown including phases, generic
tasks, specialized tasks and process instances for specifying different levels of
abstraction. In the end, the data mining patterns match most to the process
instance level of CRISP. In our approach we need to take into account that reuse
- 29 -
may in some cases only be possible at a general or conceptual level. We allow for
the specification of different levels of abstraction by the following hierarchy of
tasks: conceptual (only textual description is available), configurable (code
is available but parameters need to be specified), and executable (code and
parameters are specified).
The idea of our approach is to be able to describe all data mining processes.
The description needs to be as detailed as adequate for the given scenario. Thus,
we consider the tasks of the CRISP process as the most general data mining
pattern. Every concretion of this process for an specific application is also a data
mining pattern. The generic CRISP tasks can be transformed to the following
components: Check tasks in the pattern, e.g. checking if the data quality is
acceptable; Configurable tasks in the pattern, e.g. setting a certain service
parameter by hand; Executable tasks or gateways in the pattern which can
be executed without further specification; Tasks in the meta process that are
independent of a certain pattern, e.g. checking if the business objective of the
original data mining process and the new process are identical; Empty task as
the task is obsolete due to the pattern approach, e.g. producing a final report.
We defined a data mining pattern as follows: The pattern representing the
extended CRISP model is a Data Mining Pattern. Each concretion of this
according to the presented hierarchy is also a Data Mining Pattern.
5
Discussion and Future Work
Setting up and reusing data mining workflows is a complex task. When many
dependencies on complex data exist, the situation found in workflow reuse is
fundamentally different from the one found in reusing services. In this paper,
we have given a short insight into the nature of this problem, based on our
experience in a project dealing with the analysis of clinico-genomic data. We
have proposed two approaches to improve the possibility for reusing workflows,
which are the automated collection of a meta data-rich workflow repository, and
the definition of data mining patterns to formally encode both technical and
high-level semantic of workflows.
References
1. Bremer, P.: Erstellung einer Datenbasis von Workflowreihen aus realen Anwendungen (in german), Diploma Thesis, University of Bonn (2010)
2. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth,
R.: CRISP-DM 1.0 Step-by-step data mining guide, CRISP-DM consortium (2000)
3. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A Data Mining Ontology for
Algorithm Selection and Meta-Learning. Proc. ECML/PKDD09 Workshop on Third
Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD09), Bled, Slovenia, pp. 76–87 (2009)
4. Wegener, D., Rüping, S.: On Reusing Data Mining in Business Processes - A
Pattern-based Approach. BPM 2010 Workshops - Proceedings of the 1st International Workshop on Reuse in Business Process Management, Hoboken, NJ (2010)
- 30 -
Exposé:
An Ontology for Data Mining Experiments
Joaquin Vanschoren1 and Larisa Soldatova2
1
2
Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium,
[email protected]
Aberystwyth University, Llandinum Bldg, Penglais, SY23 3DB Aberystwyth, UK,
[email protected]
Abstract. Research in machine learning and data mining can be speeded
up tremendously by moving empirical research results out of people’s
heads and labs, onto the network and into tools that help us structure
and filter the information. This paper presents Exposé, an ontology to
describe machine learning experiments in a standardized fashion and
support a collaborative approach to the analysis of learning algorithms.
Using a common vocabulary, data mining experiments and details of
the used algorithms and datasets can be shared between individual researchers, software agents, and the community at large. It enables open
repositories that collect and organize experiments by many researchers.
As can been learned from recent developments in other sciences, such a
free exchange and reuse of experiments requires a clear representation.
We therefore focus on the design of an ontology to express and share
experiment meta-data with the world.
1
Introduction
Research in machine learning is inherently empirical. Whether the goal is to develop better learning algorithms or to create appropriate data mining workflows
for new sources of data, running the right experiments and correctly interpreting
the results is crucial to build up a thorough understanding of learning processes.
Running those experiments tends to be quite laborious. In the case of evaluating a new algorithm, pictured in Figure 1, one needs to search for datasets,
preprocessing algorithms, (rival) learning algorithm implementations and scripts
for algorithm performance estimation (e.g. cross-validation). Next, one needs to
set up a wide range of experiments: datasets need to be preprocessed and algorithm parameters need to be varied, each of which requires much expertise.
This easily amounts to a large range of experiments representing days, if not
weeks of work, while only averaged results will ever be published. Any other
researcher willing to verify the published results or test additional hypothesis
will have to start again from scratch, repeating the same experiments instead of
simply reusing them.
- 31 -
Fig. 1. A typical experimental workflow in machine learning research.
1.1
Generalizability and Interpretability
Moreover, in order to ensure that results are generally valid, the empirical evaluation also needs to cover many different conditions. These include various parameter settings and various kinds of datasets, e.g. differing in size, skewness,
noisiness, and various workflows of preprocessing techniques. Unfortunately, because of the amount of work involved in empirical evaluation, many studies
will not explore these conditions thoroughly, limiting themselves to algorithm
benchmarking. It has long been recognized that such studies are in fact only
‘case studies’ [1], and should be interpreted with caution.
Sometimes, overly general conclusions can be drawn. In time series analysis research, many studies were shown to be biased toward the datasets being
used, leading to contradictory results [16]. Moreover, it has been shown that
the relative performance of learning algorithms depends heavily on the amount
of sampled training data [23, 29], and is also easily dominated by the effect of
parameter optimization and feature selection [14].
As such, there are good reasons to thoroughly explore different conditions, or
at least to clearly state under which conditions certain conclusions may or may
not hold. Otherwise, it is very hard for other researchers to correctly interpret
the results, thus possibly creating a false sense of progress [11]:
...no method will be universally superior to other methods: relative superiority will depend on the type of data used in the comparisons, the
particular data sets, the performance criterion and a host of other factors. [...] an apparent superiority in classification accuracy, obtained in
laboratory conditions, may not translate to a superiority in real-world
conditions...
- 32 -
1.2
A collaborative approach
In this paper, we advocate a much more dynamic, collaborative approach to experimentation, in which all experiment details can be freely shared in repositories
(see the dashed arrow in Fig. 1), linked together with other studies, augmented
with measurable properties of algorithms and datasets, and immediately reused
by researchers all over the world. Any researcher creating empirical meta-data
should thus be able to easily share it with others and in turn reuse any prior
results of interest. Indeed, by reusing prior results we can avoid unnecessary
repetition and speed up scientific research. This enables large-scale, very generalizable machine learning studies which are prohibitively expensive to start
from scratch. Moreover, by bringing the results of many studies together, we can
obtain an increasingly detailed picture of learning algorithm behavior. If this
meta-data is also properly organized, many questions about machine learning
algorithms can be answered on the fly by simply writing a query to a database
[29]. This also drastically facilitates meta-learning studies that analyze the stored
empirical meta-data to find or useful patterns in algorithm performance [28].
1.3
Ontologies
The use of such public experiment repositories is common practice in many other
scientific disciplines. To streamline the sharing of experiment data, they created
unambiguous description languages, based on a careful analysis of the concepts
used within a domain and their relationships. This is formally represented in
ontologies [5, 13]: machine manipulable domain models in which each concept
(class) is clearly described. They provide an unambiguous vocabulary that can
be updated and extended by many researchers, thus harnessing the “collective
intelligence” of the scientific community [10]. Moreover, they express scientific
concepts and results in a formalized way that allows software agents to interpret
them correctly, answer queries and automatically organize all results [25].
In this paper, we propose an ontology designed to adequately record machine learning experiments and workflows in a standardized fashion, so they can
be shared, collected and reused. Section 2 first discusses the use of ontologies
in other sciences to share experiment details and then covers previously proposed ontologies for data mining. Next, we present Exposé, a novel ontology for
machine learning experimentation, in Section 3. Section 4 concludes.
2
2.1
Previous work
e-Sciences
Ontologies have proven very successful in bringing together the results of researchers all over the world. For instance, in astronomy, ontologies are used to
build Virtual Observatories [7, 27], combining astronomical data from many different telescopes. Moreover, in bio-informatics, the Open Biomedical Ontology
- 33 -
(OBO) Foundry3 defines a large set of consistent and complementary ontologies
for various subfields, such as microarray data4 , and genes and their products [2].
As such, they create an “open scientific culture where as much information
as possible is moved out of people’s heads and labs, onto the network and into
tools that can help us structure and filter the information” [20].
Ironically, while machine learning and data mining have been very successful
in speeding up scientific progress in these fields by discovering useful patterns in
a myriad of collected experimental results, machine learning experiments themselves are currently not being documented and organized well enough to engender
the same automatic discovery of insightful patterns that may speed up the design
of new data mining algorithms or workflows.
2.2
Data mining ontologies
Recently, the design of ontologies for data mining attracted quite a bit of attention, resulting in many ontologies for various goals.
OntoDM [22] is a general ontology for data mining with the aim of providing
a unified framework for data mining research. It attempts to cover the full width
of data mining research, containing high-level classes, such as data mining tasks
and algorithms, and more specific classes related to certain subfields, such as
constraints for constraint-based data mining.
EXPO [26] is a top-level ontology that models scientific experiments in general, so that empirical research can be uniformly expressed and automated. It
covers classes such as hypotheses, (un)controlled variables, experimental designs
and experimental equipment.
DAMON (DAta Mining ONtology) [4], is a taxonomy meant to offer domain
experts a way to look up tasks, methods and software tools given a certain goal.
KDDONTO [8] is an OWL-DL ontology also built to discover suitable KD
algorithms and to express workflows of KD processes. It covers the inputs and
outputs of the algorithms and any pre- and postconditions for their use.
KD ontology [31] describes planning-related information about datasets
and KD algorithms. It is used in conjunction with an AI planning algorithm: preand postconditions of KD operators are converted into standard PDDL planning
problems [18]. It is used in an extension of the Orange toolkit to automatically
plan KD workflows [32].
The DMWF ontology [17] also describes all KD operators with their in- and
outputs and pre- and postconditions, and is meant to be used in a KD support
system that generates (partial) workflows, checks and repairs workflows built by
users, and retrieves and adapts previous workflows.
DMOP, the Data Mining Ontology for Workflow Optimization [12], models
the internal structure of learning algorithms, and is explicitly designed to support
algorithm selection. It covers classes such as the structure and parameters of
predictive models, the involved cost functions and optimization strategies.
3
4
http://www.obofoundry.org/
http://www.mged.org/ontology
- 34 -
3
The Exposé ontology
In this section, we describe Exposé, an ontology for machine learning experimentation. It is meant to be used in conjunction with experiment databases (ExpDBs) [3, 29, 28]: databases designed to collect the details of these experiments,
and to intelligently organize them in online repositories to enable fast and thorough analysis of a myriad of collected results. In this context, Exposé supports
the accurate recording and exchange of data mining experiments and workflows.
It has been ‘translated’ into an XML-based language, called ExpML, to describe
experiment workflows and results in detail [30]. Moreover, it clearly defines the
semantics of data mining experiments stored in the experiment database, so that
a very wide range of questions on data mining algorithm performance can be answered through querying [29]. Many examples can be found in previous papers
[29, 30]. Finally, although we currently use a relational database, Exposé will
clearly be instrumental in RDF databases, allowing even more powerful queries.
It thus supports reasoning with the data, meta learning, data integration, and
also enables logical consistency checks.
For now, Exposé focuses on supervised classification on propositional datasets.
It is also important to note that, while it has been influenced and adapted by
many researchers, it is a straw-man proposal that is intended to instigate discussion and attract wider involvement from the data mining community. It is
described in the OWL-DL ontology language [13], and can be downloaded from
the experiment database website (http://expdb.cs.kuleuven.be).
We first describe the design guidelines used to develop Exposé, then its toplevel classes, and finally the parts covering experiments, experiment contexts,
evaluation metrics, performance estimation techniques, datasets, and algorithms.
3.1
Ontology design
In designing Exposé, we followed existing guidelines for ontology design [21, 15]:
Top-level ontologies It is considered good practice to start from generally
accepted classes and relationships (properties) [22]. We started from the
Basic Formal Ontology (BFO)5 covering top-level scientific classes and the
OBO Relational Ontology (RO)6 offering a predefined set of properties.
Ontology reuse If possible, other ontologies should be reused to build on
prior knowledge and consensus. We directly reuse several general machine
learning related classes from OntoDM [22], experimentation-related classes
from EXPO [26], and classes related to internal algorithm mechanisms from
DMOP [12]. We wish to integrate Exposé with existing ontologies, so that
it will evolve with them as they are extended further.
Design patterns Ontology design patterns7 are reusable, successful solutions
to recurrent modeling problems. For instance, a learning algorithm can sometimes act as a base-learner for an ensemble learner. This is a case of an
5
6
7
http://www.ifomis.org/bfo
http://www.obofoundry.org/ro/
http://ontologydesignpatterns.org
- 35 -
has
description
dataset
model specification
data specification
objective specification
information
content entity
is concretization of
model
data item
function
specification
algorithm
specification
is concretization of
algorithm
implementation
digital entity
algorithm
application
function application
planned
process
is concretization of
has part
parameter
implementation
has participant
has participant
has participant
thing
parameter
is concretization of
function
implementation
implementation
prediction
has participant
parameter
setting
operator
has participant
executed on
has participant
KD workflow
experiment
workflow
machine
material entity
realizable entity
quality
data role
role
algorithm
component role
data property
algorithm property
Fig. 2. An overview of the top-level classes in the Exposé ontology.
agent-role pattern, and a predefined property, ‘realizes’, is used to indicate
which entities are able to fulfill a certain role.
Quality criteria General criteria include clarity, consistency, extensibility and
minimal commitment. These criteria are rather qualitative, and were only
evaluated through discussions with other researchers.
3.2
Top-level View
Figure 2 shows the most important top-level classes and properties, many of
which are inherited from the OntoDM ontology [22], which in turn reuses classes
from OBI8 (i.e., planned process) and IAO9 (i.e. information content entity). The
full arrows symbolize an ‘is-a’ property, meaning that the first class is a subclass
of the second, and the dashed arrows symbolize other common properties. Double
arrows indicate one-to-many properties, for instance, an algorithm application
can have many parameter settings.
The three most important categories of classes are information content entity, which covers datasets, models and abstract specifications of objects (e.g.
algorithms), implementation, and planned process, a sequence of actions meant
to achieve a certain goal. When describing experiments, this distinction is very
important. For instance, the class ‘C4.5’ can mean the abstract algorithm, a specific implementation or an execution of that algorithm with specific parameter
settings, and we want to distinguish between all three.
8
9
http://obi-ontology.org
http://code.google.com/p/information-artifact-ontology
- 36 -
composite
experiment
experiment
workflow
experimental
design
has participant
machine
has participant
has description
is executed
on
KD workflow
simulation
has participant
planned
process
has participant
has participant
parameter
setting
performance
estimation application
learning algorithm
application
data processing
application
has specified input
has participant
data processing
workflow
has part
evaluation
has participant
has specified
input
has specified
output
has specified
input
has specified output
has specified output
experimental
variable
model evaluation
function implementation
has participant
has participant
algorithm
implementation
model
evaluation
result
model evaluation
function application
algorithm
application
operator
has specified
output
learner
evaluation
singular
experiment
has participant
has description
model
has specified
output
prediction
result
dataset
has part
function application
prediction
information content entity
Fig. 3. Experiments in the Exposé ontology.
As such, ambiguous classes such as ‘learning algorithm’ are broken up according to different interpretations (indicated by bold ellipses in Fig. 2): an abstract
algorithm specification (e.g. in pseudo-code), a concrete algorithm implementation, code in a certain programming language with a version number, and a
specific algorithm application, a deterministic function with fixed parameter settings, run on a specific machine with an actual input (a dataset ) and output
(a model ), also see Fig. 3. The same distinction is used for other algorithms
(for data preprocessing, evaluation or model refinement), mathematical functions (e.g. the kernel used in an SVM), and parameters, which can have different
names in different implementations and different value settings in different applications. Algorithm and function applications are operators in a KD workflow,
and can even be participants of another algorithm application (e.g., a kernel or
a base-learner), i.e. they can be part of the inner workflow of an algorithm.
Finally, there are also qualities, properties of a specific dataset or algorithm
(see Figs. 6 and 7), and roles indicating that an element assumes a (temporary)
role in another process: an algorithm can act as a base-learner in an ensemble,
a function can act as a distance function in a learning algorithm, and a dataset
can be a training set in one experiment and a test set in the next.
3.3
Experiments
Figure 3 shows the ontological description of experiments, with the top-level
classes from Fig. 2 drawn in filled double ellipses. Experiments are defined as
workflows, which allows the description of many kinds of experiments. Some
(composite) experiments can also consist of many smaller (singular) experiments,
and can use a particular experiment design [19] to investigate the effects of
various experimental variables, e.g. parameter settings.
- 37 -
d1
op1
has input
has input
has output
has input
d2
has output
d2
op2
has output
has participant
has participant
workflow
data processing
application
dataset
data processing
application
dataset
data processing
application
dataset
data processing workflow
learner
application
model
model evaluation
function application
dataset
train
test
performance estimation
application
evaluation
learner evaluation
model
evaluation
result
Fig. 4. Workflow structure and an example experiment workflow.
We will now focus on a particular kind of experiment: a learner evaluation
(indicated by a bold ellipse). This type of experiment applies a specific learning
algorithm (with fixed parameters) on a specific input dataset and evaluates the
produced model by applying one or several model evaluation functions, e.g. predictive accuracy. In predictive tasks, a performance estimation technique, e.g.
10-fold cross-validation, is applied to generate training- and test sets, evaluate
the resulting models and aggregate the results. After it is executed on a specific
machine, it will output a model evaluation result containing the outcomes of all
evaluations and, in the case of predictive algorithms, the (probabilistic) predictions made by the models. Models are also generated by applying the learning
algorithm on the entire dataset.
Finally, more often than not, the dataset will have to be preprocessed first.
Again, by using workflows, we can define how various data processing applications preprocess the data before it is passed on to the learning algorithm.
Figure 4 illustrates such a workflow. The top of the figure shows that it consists of participants (operators), which in turn have inputs and outputs (shown
in ovals): datasets, models and model evaluation results. Workflows themselves
also have inputs and outputs, and we can define specialized sub-workflows. A
data processing workflow is a sequence of data processing steps. The center of
Fig. 4 shows one with three preprocessors. A learner evaluation workflow takes
a dataset as input and applies performance estimation techniques (e.g. 10-fold
cross-validation) and model evaluation functions (e.g. the area under the ROC
curve) to evaluate a specific learner application. Of course, there are other types
of learner evaluations, both finer ones, e.g. a singular train-test experiment, and
more complex ones, e.g. doing an internal model selection to find the optimal
parameter settings.
- 38 -
learner
evaluation
has specified
input
dataset
learning algorithm
application
has
participant
has participant
model
has specified output
function application
function
application
model evaluation
function application
has participant
has participant
confidence
support
leverage
is concretization of
frequency
lift
integrated squared error
density-based
clustering measure
clustering
evaluation measure
is concretization of
function
specification
conviction
association
evaluation measure
model evaluation
function
implementation
has participant
cost function application
name
version
has
description
function
implementation
parameter
setting
has specified input
inter-cluster similarity
distance-based
clustering measure
model evaluation
function
probabilistic distribution
evaluation measure
intra-cluster variance
integrated average squared error
distribution likelihood
probability distribution
scoring function
distribution logprobabilistic model
computational
likelihood
distance measure
evaluation measure
Kullback-Leibner divergence
build cpu time
likelihood ratio
build memory consumption
single point AUROC
predictive model
evaluation measure
has participant
AUPRC
class prediction
evaluation measure
f- measure
AUROC
derived measure
binary prediction
evaluation function
confusion matrix
has participant
multi-class prediction
evaluation measure
numeric prediction
evaluation measure
averaged binary
prediction measure
probability errorbased measure
has participant
error-based
evaluation measure
RMSE
RRSE
MAD
kappa statistic
correlation
coefficient
RSS
MAPE
information
criterion
AIC
recall
precision
specificity
predictive accuracy
class RMSE
ROC_curve
graphical
evaluation measure
has part
cost curve
lift chart
PRgraph
point
precision-recall curve
BIC
has part
Fig. 5. Learner evaluation measures in the Exposé ontology.
3.4
Experiment context
Although outside the scope of this paper, Exposé also models the context in
which scientific investigations are conducted. Many of these classes are originally defined in the EXPO ontology [26]. They include authors, references to
publications and the goal, hypotheses and conclusions of a study. It also defines
(un)controlled or (in)dependent experimental variables, and various experimental
designs [19] defining which values to assign to each of these variables.
3.5
Learner evaluation
To describe algorithm evaluations, Exposé currently covers 96 performance measures used in various learning tasks, some of which are shown in Fig. 5. In some
tasks, all available data is used to build a model, and properties of that model
are measured to evaluate it, e.g., the inter-cluster similarity in clustering. In
binary classification, the predictions of the models are used, e.g., predictive accuracy, precision and recall. In multi-class problems, the same measures can be
used by transforming the multi-class prediction into c binary predictions, and
averaging the results over all classes, weighted by the number of examples in each
- 39 -
role
data mining
data role
data role
realizes
dataset
data repository
part of
name
has description
has
quality
is concretization of
identifier
url
graph
time series
relational database
set of
instances
data item
target feature
class feature
has quality
instance
property
labeling
feature
property
qualitative
feature
property
feature
datatype
nominal
value set
numeric
datatype
feature entropy
feature kurtosis
...
quantitative dataset
property
...
unlabeled
labeled
quantitative
feature property
information-theoretic
dataset property
attribute-value
table
numeric
target feature
has part
data feature
has quality
data
property
has part
set of tuples
has part
data instance
optimization set
item sequence
sequence
dataset
property
bag
training set
version
data specification
quality
bootstrap
test set
landmarker
# features
# instances
simple dataset
property
# missing values
statistical
dataset property
target skewness
frac1
Fig. 6. Datasets in the Exposé ontology.
class. Regression measures, e.g., root mean squared error (RMSE) can also be
used in classification by taking the difference between the actual and predicted
class probabilities. Finally, graphical evaluation measures, such as precision-recall
curves, ROC-curves or cost-curves, provide a much more detailed evaluation.
Many definitions of these metrics exist, so it is important to define them clearly.
Although not shown here, Exposé also covers several performance estimation
algorithms, such as k-fold or 5x2 cross-validation, and statistical significance
tests, such as the paired t-test (by resampling, 10-fold cross-validation or 5x2
cross-validation) [9] or tests on multiple datasets [6].
3.6
Datasets
Figure 6 shows the most important classes used to describe datasets.
Specification. The data specification (in the top part of Fig. 6) describes the
structure of a dataset. Some subclasses are graphs, sequences and sets of
instances. The latter can have instances of various types, e.g., tuples, in
which case it can have a number of data features and data instances. For
other types of data this specification will have to be extended. Finally, a
dataset has descriptions, such as name, version and download url to make it
easily retrievable.
- 40 -
Roles. A specific dataset can play different roles in different experiments (top
of Fig. 6). For instance, it can be a training set in one evaluation and a test
set in the next.
Data properties. As said before, we wish to link all empirical results to theoretical metadata, called properties, about the underlying datasets to perform
meta-learning studies. These data properties are shown in the bottom half
of Fig. 6, and may concern individual instances, individual features or the
entire dataset. We define both feature properties such as feature skewness
or mutual information with the target feature, as well as general dataset
properties such as the number of attributes and landmarkers [24].
3.7
Algorithms
Algorithms can perform very differently under different configurations and parameter settings, so we need a detailed vocabulary to describe them. Figure 7
shows how algorithms and their configurations are expressed in our ontology.
From top to bottom, it shows a taxonomy of different types of algorithms, the
different internal operators they use (e.g. kernel functions), the definition of algorithm implementations and applications (see Sect. 3.2) and algorithm properties
(only two are shown).
Algorithm implementations. Algorithm implementations are described with all
information needed to retrieve and use them, such as their name, version, url,
and the library they belong to (if any). Moreover, they have implementations of
algorithm parameters and can have qualities, e.g. their susceptibility to noise.
Algorithm composition. Some algorithms use other algorithms or mathematical functions, which can often be selected (or plugged in) by the user. These
include base-learners in ensemble learners, distance functions in clustering and
nearest neighbor algorithms and kernels in kernel-based learning algorithms.
Some algorithm implementations also use internal data processing algorithms,
e.g. to remove missing values. In Exposé, any operator can be a participant of
an algorithm application, combined in internal workflows with in- and outputs.
Depending on the algorithm, operators can fulfill (realize) certain predefined
roles (center of Fig. 7).
Algorithm mechanisms. Finally, to understand the performance differences between different types of algorithms, we need to look at the internal learning
mechanisms on which they are built. These include the kind of models that are
built (e.g. decision trees), how these models are optimized (e.g. the heuristic
used, such as information gain) and the decision boundaries that are generated
(e.g. axis-parallel, piecewise linear ones in the case of non-oblique decision trees).
These classes, which extend the algorithm definitions through specific properties
(e.g. has model structure), are defined in the DMOP ontology [12], so they won’t
be repeated here.
- 41 -
pattern discovery
algorithm
association
algorithm
clustering algorithm
Apriori
K means
neural network algorithm
curvilinear
regression algorithm
least mean squares regression
ridge
regression
linear regression
rule learning algorithm
generalized linear
classification algorithm
Bayesian logistic regression
logistic regression
probit regression
linear discriminant
algorithm
kernel density estimator
k-nearest neighbor
lazy learning algorithm
tree-augmented naive Bayes
decision tree algorithm
recursive partitioning
algorithm
generative algorithm
Gaussian discriminant analysis
naive Bayes algorithm
inductive logic
programming algorithm
predictive
algorithm
Bayesian net algorithm
Gaussian processes
support vector machine
kernel-based algorithm
ensemble
algorithm
learning
algorithm
has
hyperparameter
has part
mixed algorithm
ensemble algorithm
stacking algorithm
single algorithm
ensemble algorithm
bagging algorithm
boosting algorithm
has part
search
base learner
role
algorithm
specification
neighbor search
algorithm role
algorithm
component role
model processor
data processor
kernel
distance estimator
function role
is concretization of
algorithm
implementation
identifier
data item
parameter
algorithm
parameter
is concretization of
model
parameter
has part
has quality
has quality
parameter
implementation
parameter
property
quality
realizes
operator
programming
language
operating system
has description
has
participant
algorithm
property
algorithm
application
classpath
name
url
version
has description
default value
susceptibility to noise
handles missing values
has participant
has
participant
parameter setting
Fig. 7. Algorithms and their configurations in the Exposé ontology.
- 42 -
has
part
4
Conclusions
We have presented Exposé, an ontology for data mining experiments. It is complementary to other data mining ontologies such as OntoDM [22], EXPO [26],
and DMOP [12], and covers data mining experiments in fine detail, including
the experiment context, evaluation metrics, performance estimation techniques,
datasets, and algorithms. It is used in conjunction with experiment databases
(ExpDBs) [3, 29, 28], to engender a collaborative approach to empirical data mining research, in which experiment details can be freely shared in repositories,
linked together with other studies, and immediately reused by researchers all
over the world. Many illustrations of the uses of Exposé to share, collect and
query for experimental meta-data can be found in prior work [3, 29, 30].
References
1. Aha, D.: Generalizing from case studies: A case study. Proceedings of the Ninth
International Conference on Machine Learning pp. 1–10 (1992)
2. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M.,
Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald,
M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology.
nature genetics 25, 25–29 (2000)
3. Blockeel, H., Vanschoren, J.: Experiment databases: Towards an improved experimental methodology in machine learning. Lecture Notes in Computer Science 4702,
6–17 (2007)
4. Cannataro, M., Comito, C.: A data mining ontology for grid programming. First International Workshop on Semantics in Peer-to-Peer and Grid Computing at WWW
2003 pp. 113–134 (2003)
5. Chandrasekaran, B., Josephson, J.: What are ontologies, and why do we need
them? IEEE Intelligent systems 14(1), 20–26 (1999)
6. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal
of Machine Learning Research 7, 1–30 (2006)
7. Derriere, S., Preite-Martinez, A., Richard, A.: UCDs and ontologies. ASP Conference Series 351, 449 (2006)
8. Diamantini, C., Potena, D., Storti, E.: Kddonto: An ontology for discovery and
composition of kdd algorithms. Proceedings of the 3rd Generation Data Mining
Workshop at the 2009 European Conference on Machine Learning (2009)
9. Dietterich, T.: Approximate statistical tests for comparing supervised classification
learning algorithms. Neural computation 10(7), 1895–1923 (1998)
10. Goble, C., Corcho, O., Alper, P., Roure, D.D.: e-science and the semantic web: A
symbiotic relationship. Lecture Notes in Computer Science 4265, 1–12 (2006)
11. Hand, D.: Classifier technology and the illusion of progress. Statistical Science
21(1), 114 (2006)
12. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A data mining ontology for algorithm selection and meta-mining. Proceedings of the ECML/PKDD09 Workshop
on 3rd generation Data Mining (SoKD-09) pp. 76–87 (2009)
13. Horridge, M., Knublauch, H., Rector, A., Stevens, R., Wroe, C.: A practical guide
to building OWL ontologies using Protege 4 and CO-ODE tools. The University
of Manchester (2009)
- 43 -
14. Hoste, V., Daelemans, W.: Comparing learning approaches to coreference resolution. there is more to it than bias. Proceedings of the Workshop on Meta-Learning
(ICML-2005) pp. 20–27 (2005)
15. Karapiperis, S., Apostolou, D.: Consensus building in collaborative ontology engineering processes. Journal of Universal Knowledge Management 1(3), 199–216
(2006)
16. Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A
survey and empirical demonstration. Data Mining and Knowledge Discovery 7(4),
349–371 (2003)
17. Kietz, J., Serban, F., Bernstein, A., Fischer, S.: Towards cooperative planning of
data mining workflows. Proceedings of the Third Generation Data Mining Workshop at the 2009 European Conference on Machine Learning (ECML 2009) pp.
1–12 (2009)
18. Klusch, M., Gerber, A., Schmidt, M.: Semantic web service composition planning
with owls-xplan. Proceedings of the First International AAAI Fall Symposium on
Agents and the Semantic Web (2005)
19. Kuehl, R.: Design of experiments: statistical principles of research design and analysis. Duxbury Press (1999)
20. Nielsen, M.: The future of science: Building a better collective memory. APS
Physics 17(10) (2008)
21. Noy, N., McGuinness, D.: Ontology development 101: A guide to creating your first
ontology. Stanford University (2002)
22. Panov, P., Soldatova, L., Dzeroski, S.: Towards an ontology of data mining investigations. Lecture Notes in Artificial Intelligence 5808, 257–271 (2009)
23. Perlich, C., Provost, F., Simonoff, J.: Tree induction vs. logistic regression: A
learning-curve analysis. Journal of Machine Learning Research 4, 211–255 (2003)
24. Pfahringer, B., Bensusan, H., Giraud-Carrier, C.: Meta-learning by landmarking
various learning algorithms. Proceedings of the Seventeenth International Conference on Machine Learning pp. 743–750 (2000)
25. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL query for OWL-DL. Third International Workshop on OWL Experiences and Directions (OWLED 2007) (2007)
26. Soldatova, L., King, R.: An ontology of scientific experiments. Journal of the Royal
Society Interface 3(11), 795–803 (2006)
27. Szalay, A., Gray, J.: The world-wide telescope. Science 293, 2037–2040 (2001)
28. Vanschoren, J., Blockeel, H., Pfahringer, B., Holmes, G.: Experiment databases:
Creating a new platform for meta-learning research. Proceedings of the
ICML/UAI/COLT Joint Planning to Learn Workshop (PlanLearn08) pp. 10–15
(2008)
29. Vanschoren, J., Pfahringer, B., Holmes, G.: Learning from the past with experiment
databases. Lecture Notes in Artificial Intelligence 5351, 485–492 (2008)
30. Vanschoren, J., Soldatova, L.: Collaborative meta-learning. Proceedings of the
Third Planning to Learn Workshop at the 19th European Conference on Artificial
Intelligence (2010)
31. Záková, M., Kremen, P., Zelezný, F., Lavrac, N.: Planning to learn with a
knowledge discovery ontology. Second planning to learn workshop at the joint
ICML/COLT/UAI Conference pp. 29–34 (2008)
32. Záková, M., Podpecan, V., Zelezný, F., Lavrac, N.: Advancing data mining workflow construction: A framework and cases using the Orange toolkit. Proceedings of
the SoKD-09 International Workshop on Third Generation Data Mining at ECML
PKDD 2009 pp. 39–51 (2009)
- 44 -
Foundations of frequent concept mining with
formal ontologies
Agnieszka Lawrynowicz1
Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 2,
60-965 Poznan, Poland
[email protected]
1
Introduction
With increased availability of information published using standard Semantic
Web languages, new approaches are needed to mine this growing resource of data.
Since Semantic Web data is relational in nature, there have been recently growing
number of proposals adapting methods of Inductive Logic Programming (ILP)
[1] for the Semantic Web knowledge representation formalisms, most notably
Web ontology language OWL1 (grounded on description logics (DLs) [2]).
One of the fundamental data mining tasks is the discovery of frequent patterns. Within the setting of ILP, frequent pattern mining has been investigated
initially for Datalog, in systems such as WARMR [3], FARMER [4] or c-armr
[5]. However, recent proposals have extended the scope of relational frequent
pattern mining to operate on description logics, or hybrid languages (combining
Datalog with DL or DL with some form of rules), where examples are system
SPADA [6], or approaches proposed in [7], and in [8]. However, none of the
current approaches that use DLs to mine frequent patterns target a peculiarity
of the DL formalism, namely variable–free notation, in representing patterns.
This paper aims to fill this gap. The main contributions of the paper are
summarized as follows: (a) a novel setting for the task of frequent pattern mining
is introduced, coined frequent concept mining, where patterns are (complex)
concepts expressed in description logics (corresponding to OWL classes); (b)
basic building blocks for this new setting are provided such as generality measure,
and refinement operator.
2
Preliminaries
2.1
Representation and Inference
Description logics [2] are a family of knowledge representation languages (equipped
with a model-theoretic semantics and reasoning services) that have been adopted
as theoretical foundation for OWL language. Basic elements in DLs are: atomic
concepts (denoted by A), and atomic roles (denoted by R, S). Complex descriptions (denoted by C and D) are inductively built by using concept and role
1
http://www.w3.org/TR/owl-features
- 45 -
Table 1: Syntax and semantics of example DL constructors.
Constructor
Universal concept
Bottom concept
Negation of arbitrary concepts
Intersection
Union
Value restriction
Full existential quantification
Datatype exists
Nominals
Syntax
>
⊥
(¬C)
(C u D)
(C t D)
(∀R.C)
(∃R.C)
(∃T.u)
{a1 , ..., an }
Semantics
∆I
∅
∆I \C I
C I ∩ DI
C I ∪ DI
{a ∈ ∆I |∀b.(a, b) ∈ RI → b ∈ C I }
{a ∈ ∆I |∃b.(a, b) ∈ RI ∧ b ∈ C I }
{a ∈ ∆I |∃t.(a, t) ∈ T I ∧ t ∈ uD }
{a1 I , ..., an I }
constructors.
Semantics is defined by interpretations I=(∆I , ·I ), where non-empty set ∆I
is the domain of the interpretation and ·I is an interpretation function which
assigns to every atomic concept A a set AI ⊆ ∆I , and to every atomic role R
a binary relation RI ⊆ ∆I × ∆I . The interpretation function is extended to
complex concept descriptions by the inductive definition as presented in Tab. 1.
A DL knowledge base, KB, is formally defined as: KB = (T , A), where T is
called a TBox, and it contains axioms dealing with how concepts and roles are
related to each other, and where A is called an ABox, and it contains assertions
about individuals such as C(a) (the invidual a is an instance of the concept C)
and R(a, b) (a is R-related to b). Moreover, DLs may also support reasoning with
concrete datatypes such as strings or integers. A concrete domain D consists of
a set ∆D , the domain of D, and a set pred(D), the predicate names of D.
Each predicate name P is associated with an arity n, and an n-ary predicate
P D ⊆ (∆D )n . The abstract domain ∆I and the concrete domain ∆D are disjoint. Concrete role T is interpreted as a binary relation T I ⊆ ∆I × ∆D .
Example 1 provides a sample DL knowledge base, that represents a part of
the domain of data mining with the purpose to be used in meta-mining, e.g.
for algorithm selection (the example is based on the ontology for Data Mining
Optimization (DMOP) [9]).
Example 1 (Description logic KB).
T = { RecursivePartitioningAlgorithm v ClassificationAlgorithm, C4.5-Algorithm v
RecursivePartitioningAlgorithm, BayesianAlgorithm v ClassificationAlgorithm,
NaiveBayesAlgorithm v BayesianAlgorithm, NaiveBayesNormalAlgorithm v
NaiveBayesAlgorithm, OperatorExecution v ∃executes.Operator,
Operator v ∃implements.Algorithm, > v ∀ hasInput− .OperatorExecution,
> v ∀ hasInput.(Data t Model), DataSet v Data }.
A = { OperatorExecution(Weka NaiveBayes–OpEx01), Operator(Weka NaiveBayes),
executes(Weka NaiveBayes–OpEx01, Weka NaiveBayes),
implements(Weka NaiveBayes, NaiveBayesNormal),
NaiveBayesNormalAlgorithm(NaiveBayesNormal),
hasInput(Weka NaiveBayes–OpEx01,Iris–DataSet),DataSet(Iris–DataSet),
- 46 -
OperatorExecution(Weka NaiveBayes–OpEx02),
executes(Weka NaiveBayes–OpEx02, Weka NaiveBayes),
hasParameterSetting(Weka NaiveBayes–OpEx02, Weka–NaiveBayes–OpEx02–D),
OpParameterSetting(Weka–NaiveBayes–OpEx02–D),
setsValueOf(Weka–NaiveBayes–OpEx02–D,Weka NaiveBayes–D),
hasValue(Weka–NaiveBayes–OpEx02–D,false),
hasParameterSetting(Weka NaiveBayes–OpEx02, Weka–NaiveBayes–OpEx02–K),
OpParameterSetting(Weka–NaiveBayes–OpEx02–K),
setsValueOf(Weka–NaiveBayes–OpEx02–K,Weka NaiveBayes–K),
hasValue(Weka–NaiveBayes–OpEx02–K,false),
OperatorExecution(Weka–J48–OpEx01),Operator(Weka J48),
executes(Weka–J48–OpEx01, Weka J48),
implements(Weka J48, C4.5), C4.5-Algorithm(C4.5) }. The inference services, further referred to in the paper, are subsumption and
retrieval. Given two concept descriptions C and D in a TBox T , C subsumes D
(denoted by D v C) if and only if, for every interpretation I of T it holds that
DI ⊆ C I . C equivalent to D (denoted by C ≡ D) corresponds to C v D and
D v C. The retrieval problem is, given an ABox A and a concept C, to find all
individuals a such that A |= C(a).
2.2
Refinement operators for DL
Learning in DLs can be seen as a search in the space of concepts. In ILP it is
common to impose an ordering on this search space, and apply refinement operators to traverse it [1]. Downward refinement operators construct specialisations
of hypotheses (concepts, in this context). Let (S, ) be a quasi ordered space.
Then, a downward refinement operator ρ is a mapping from S to 2S , such that
for any C ∈ S, C 0 ∈ ρ(C) implies C 0 C. C 0 is called a specialisation of C.
For searching the space of DL concepts, a natural quasi-order is subsumption. If
C subsumes D (D v C), then C covers all instances that are covered by D. In
this work, subsumption is assumed as a generality measure between concepts.
Further details concerning refinement operators proposed for description logics
may be found in [10–13].
3
Frequent concept mining
In this section, the task of frequent concept mining is formally introduced.
3.1
The task
The definition of the task of frequent pattern discovery requires a specification
of what is counted to calculate the pattern support. In the setting proposed
- 47 -
in this paper, the support of concept C is calulated relatively to the number
of instances of a user-specified concept of reference, reference concept Ĉ, from
which the search procedure starts (and which is being specialized).
Definition 1 (Support). Let C be a concept expressed using predicates from a
DL knowledge base KB = (T , A), memberset(C, KB) be a function that returns
the set of all individuals a such that A |= C(a), and let Ĉ denote a reference
concept, where C vĈ.
A support of pattern C with respect to the knowledge base KB is defined as
the ratio between the number of instances of the concept C, and the number of
instances of the reference concept Ĉ in KB:
support(C, KB) =
|memberset(C, KB)|
|memberset(Ĉ, KB)|
Having defined the support it is now possible to formulate a definition of a
frequent concept discovery.
Definition 2 (Frequent concept discovery).
Given
– a knowledge base KB represented in description logic,
– a set of patterns in the form of a concept C, where each C is subsumed by a
reference concept Ĉ (C vĈ),
– a minimum support threshold minsup specified by the user,
and assuming that patterns with support s are frequent in KB if s ≥ minsup,
the task of frequent pattern discovery is to find the set of frequent patterns. Example 2. Let us consider the knowledge base KB from Example 1. Let’s assume that Ĉ =OperatorExecution (in general, Ĉ can be also a complex concept,
and not necessarily a primitive one). There are 3 instances of Ĉ in KB. The following example patterns, refinements of OperatorExecution, could be generated:
C1 = OperatorExecution u∃executes.Operator
C2 = OperatorExecution u∃executes.(Operator u∃implements.ClassificationAlgorithm)
C3 =OperatorExecution u∃executes.(Operator u∃implements.RecursivePartitioningAlgorithm)
C4 =OperatorExecution u∃executes.{Weka NaiveBayes}
C5 =OperatorExecution u∃hasParameterSetting.(OpParameterSetting u ∃setsValueOf.{Weka NaiveBayes–K} u ∃hasValue.false)
C6 =OperatorExecution u∃hasInput.Data
The support values of the above patterns are as follows: s(C1 ) = 33 , s(C2 ) = 33 ,
s(C3 ) = 13 , s(C4 ) = 23 , s(C5 ) = 13 , s(C6 ) = 31 . - 48 -
3.2
Refinement operator
Depending on a language used, the number of specializations of a concept (ordered by subsumption) may be infinite. There is also a trade-off between the level
of completeness of a refinement operator, and its efficiency. Below, a refinement
operator is introduced (inspired by [12]), that allows to generate the concepts
listed in Example 2, and as such exploites the futures of DMOP ontology (which
provides an intended use case for the presented approach).
Definition 3 (Downward refinement operator ρ). ρ = (ρt , ρu ), where:
[ρt ] given a description in normal
form D = D1 t ... t Dn :
F
(a) D0 ∈ ρt (D) if D0 = 1≤i,j≤n Dk for some j 6= i, 1 ≤ k ≤ n,
F =i
0
0
Dk for some Di ∈ ρ u (Di )
(b) D0 ∈ ρt (D) if D0 = Di t j61≤i,j≤n
[ρu ] given a conjunctive description C = C1 u ... u Cm :
(a) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 is a primitive concept, and KB |=
Cj+1 v C,
(b) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 = ∃R.Dj+1 ,
(c) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 = ∃T.uj+1 ,
(d) C 0 ∈ ρu (C) if C 0 = C u Cj+1 , where Cj+1 = {a}, and KB |= C(a),
(e) C 0 ∈ ρu (C) if C 0 = (Ct¬Cj )uCj0 , for some j ∈ {1, ..., m}, where Cj0 = ∃R0 .Dj ,
Cj = ∃R.Dj , R0 v R.
(f ) C 0 ∈ ρu (C) if C 0 = (C t¬Cj )uCj0 , for some j ∈ {1, ..., m}, where Cj0 = ∃R.Dj0 ,
Cj = ∃R.Dj , Dj0 ∈ ρt (Dj ). ρt either (a) drops one top-level disjunct or (b) replaces it with a downward refinement obtained with ρu . ρu adds new conjuncts in the form of: (a) an atomic
description being a subconcept of a refined concept, (b) an existential restriction
involving abstract role, (c) an existential restriction involving concrete role (d)
a nominal, being an instance of a refined concept, or (e) replaces one conjunct
with a refinement obtained by replacing a role in existential restriction by its
subrole, or (f) replaces one conjunct with a refinement obtained by specializing
concepts in the range of an existential restriction by ρt .
Open world assumption (OWA) in DL reasoning has different specificity, than
usually applied closed world assumption (CWA). That’s why, the proposed operator does not specialize concepts through ∀ quantifier. Due to the OWA, even if
every instance in a KB of interest would possess certain property, this could not
be deduced by a reasoner (always assuming incomplete knowledge, and possible
existence of a counterexample). This could be solved, for example, by introducing an epistemic operator [2], but such refinement rule could be costly.
The usage of an expressive pattern language, and the presence of OWA (leading to less constraints on possible generated patterns), may result in a large
pattern search-space. Thus, further steps are necessary to prune the space explored by the operator. This is usually done in ILP by introducing declarative
bias (restrictions on depth, wideness or language of patterns). One of common
problems in performing data mining with DLs, is a usual lack of disjointness constraints, resulting e.g., in a huge number of concepts tested as a filler of a given
role. Hence, despite of concept depth and wideness restrictions, a declarative
bias should enable to restrict the language of patterns beyond the constraints
imposed by DL axioms (e.g. restrict a list of fillers of a particular role, etc.).
- 49 -
4
Conclusions and Future Work
To the best of our knowledge, this is the first proposal for mining frequent
patterns, expressed as concepts represented in description logics. The paper lays
the foundations for this task, as well as proposes first steps towards the solution.
The future work will investigate suitable declarative bias for the proposed setting, and will devise an efficient algorithm, most likely employing parallelization.
The primary motivation of this work is a future application of the proposed frequent concept mining in real-life scenarios, e.g. for ontology-based meta-learning.
References
1. Nienhuys-Cheng, S., de Wolf, R.: Foundations of Inductive Logic Programming.
Volume 1228 of LNAI. Springer (1997)
2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P., eds.:
The Description Logic Handbook. Cambridge University Press (2003)
3. Dehaspe, L., Toivonen, H.: Discovery of frequent Datalog patterns. Data Mining
and Knowledge Discovery 3(1) (1999) 7–36
4. Nijssen, S., Kok, J.: Faster association rules for multiple relations. In: Proc. of the
17th Int. Joint Conference on Artificial Intelligence (IJCAI’2001). (2001) 891–897
5. de Raedt, L., Ramon, J.: Condensed representations for inductive logic programming. In: Proc. of the Ninth International Conference on Principles of Knowledge
Representation and Reasoning (KR 2004). (2004) 438–446
6. Lisi, F., Malerba, D.: Inducing multi-level association rules from multiple relations.
Machine Learning Journal 55(2) (2004) 175–210
7. Záková, M., Zelezný, F., Garcia-Sedano, J.A., Tissot, C.M., Lavrac, N., Kremen,
P., Molina, J.: Relational data mining applied to virtual engineering of product
designs. In Muggleton, S., Otero, R.P., Tamaddoni-Nezhad, A., eds.: ILP. Volume
4455 of Lecture Notes in Computer Science., Springer (2006) 439–453
8. Józefowska, J., Lawrynowicz, A., Lukaszewski, T.: The role of semantics in mining
frequent patterns from knowledge bases in description logics with rules. Theory
and Practice of Logic Programming 10(3) (2010) 251–289
9. Hilario, M., Kalousis, A., Nguyen, P., Woznica, A.: A Data Mining Ontology for
algorithm selection and meta-learning. In: Proc of the ECML/PKDD’09 Workshop
on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery
(SoKD-09). (2009) 76–87
10. Kietz, J.U., Morik, K.: A polynomial approach to the constructive induction of
structural knowledge. Machine Learning 14(2) (1994) 193–218
11. Iannone, L., Palmisano, I., Fanizzi, N.: An algorithm based on counterfactuals for
concept learning in the Semantic Web. Appl. Intell. 26(2) (2007) 139–159
12. Fanizzi, N., d’Amato, C., Esposito, F.: DL-Foil: Concept learning in Description
Logics. In Zelezný, F., Lavrač, N., eds.: Proceedings of the 18th International
Conference on Inductive Logic Programming, ILP2008. Volume 5194 of LNAI.
Springer, Prague, Czech Rep. (2008) 107–121
13. Lehmann, J.: DL-learner: Learning concepts in description logics. Journal of
Machine Learning Research (JMLR) 10 (2009) 2639–2642
- 50 -
Workow-based Information Retrieval to Model
Plant Defence Response to Pathogen Attacks
Dragana Miljkovi¢1 , Claudiu Mih il 3 , Vid Podpe£an1 , Miha Gr£ar1 , Kristina
Gruden4 , Tja²a Stare4 , Nada Lavra£1,2
1
Joºef Stefan Institute, Ljubljana, Slovenia
University of Nova Gorica, Nova Gorica, Slovenia
3
Faculty of Computer Science, Al.I. Cuza, University of Ia³i, Ia³i, Romania
Department of Biotechnology and Systems Biology, National Institute of Biology,
Ljubljana, Slovenia
2
4
Abstract. The paper proposes a workow-based approach to support
modelling of plant defence response to pathogen attacks. Currently, such
models are built manually by merging expert knowledge, experimental results, and literature search. To this end, we have developed a
methodology which supports the expert in the process of creation, curation, and evaluation of biological models by combining publicly available
databases, natural language processing tools, and hand-crafted knowledge. The proposed solution has been implemented in a service-oriented
workow environment Orange4WS, and evaluated using manually developed Petri Net plant defence response model.
1 Introduction
Bioinformatics workow management systems have been a subject of numerous
research eorts in recent years. For example, Wikipedia lists 19 systems5 which
are capable of executing some form of scientic workows. Such systems oer
numerous advantages in comparison with monolithic and problem specic solutions. First, repeatability of experiments is easy since the procedure (i.e. the
corresponding workow) and parameters can be saved and reused. Second, if the
tool is capable of using web services, this ensures a certain level of distributed
computation and makes the system more reliable6 and independent. Third, as
abstract representations of complex computational procedures, workows are
easy to understand and execute, even for non-experts. Finally, such systems
typically oer easy access (e.g. by using web services) to large public databases
such as PubMed, WOS, BioMart[5], EMBL-EBI data resources7 etc.
The topic of this paper is defence response in plants to virus attacks, which
has been investigated for a considerable time. However, individual research groups
5
6
7
http://en.wikipedia.org/wiki/Bioinformatics_workow_management_systems
The reliability of web service-based solutions is debatable but provided that there is
a certain level of redundancy such distributed systems are more reliable than single
source solutions [3].
http://www.ebi.ac.uk/Tools/webservices/
- 51 -
usually focus their experimental work on a subset of the entire defence system,
while a model of a global defence response mechanism in plants is still to be
developed.
The motivation of biology experts to develop a more comprehensive model
of the entire defence response is twofold. Firstly, it will provide a better understanding of the complex defence response mechanism in plants which means
highlighting connections in the network and understanding how the connections
operate. Secondly, prediction of experimental results through simulation will save
time and indicate further research directions to biology experts. The development of a more comprehensive model of plant defence response for simulation
purposes addresses three general research questions:
what is the most appropriate formalism for representing the plant defence
model,
how to extract network structure; more precisely, how to retrieve relevant
compounds and relations between them,
how to determine network parameters such as initial compound values, speeds
of the reactions, threshold values, etc.
Having studied dierent representation formalisms, we have decided to represent the model of the given biological network in the form of a graph. This
paper addresses the second research question, i.e. automatized extraction of the
graph structure through information retrieval and natural language processing
techniques, with the emphasis on a implementation in a service-oriented workow environment. We propose a worow-based approach to support modelling
of plant defence response to pathogen attacks, and present an implementation
of the proposed workow in a service-oriented environment Orange4WS. The
implementation combines open source natural language processing tools, data
from publicly available databases, and hand-crafted knowledge. The evaluation
of the approach is carried out using a manually crafted Petri net model which
was developed by fusing expert knowledge and manual literature mining.
The structure of the paper is as follows. Section 2 presents existing approaches
to modelling plant defence response and discusses their advantages and shortcommings. Section 3 introduces our manually crafted Petri net model, and proposes a workow-based solution to assist the creation, curation, and evaluation
of such models. Section 4 presents and evaluates the results of our work. Section
5 concludes the paper and proposes directions for further work.
2 Related work
Due to the complexity of the plant defence response mechanism, a challenge of
building a general model for simulation purposes is still not fully addressed. Early
attempts to accomplish numerical simulation by means of Boolean formalism
from experimental microarray data [4] have already indicated the complexity of
defence response mechanisms, and highlighted many crosstalk connections. Furthermore, many components mediating the beginning of the signalling pathway
- 52 -
and the nal response are missing. As the focus of interest of biology experts is
now oriented on what could be the bottlenecks in this response such intermediate
components are of interest.
Other existing approaches, such as the MoVisPP tool [6], attempt to automatically retrieve information from databases and transfer the pathways into
the Petri Net formalism. MoVisPP is an online tool which automatically produces Petri Net models from KEGG and BRENDA pathways. However, not all
pathways are accessible, and the signalling pathways for plant defence response
do not exist in databases.
Tools for data extraction and graphical representation are also related to
our work as they are used to help experts to understand underlying biological
principles. They can be roughly grouped according to their information sources:
databases (Biomine [15], Cytoscape [16], ProteoLens [8], VisAnt [7], PATIKA
[2]), databases and experimental data (ONDEX [9], BiologicalNetworks [1]), and
literature (TexFlame [12]). More general approaches such as [14] to visualization
of arbitrary textual data through triplets are also relevant. However, such general systems have to be adapted in order to be able to produce domain-specic
models.
3 Approaches to modelling plant defence response
This section presents our manually crafted Petri net model using the Cell Illustrator software [11]. We briey describe the development cycle of the model
and show some simulation results. The main part of the section discusses our
workow-based approach to assist the creation and curation of such biological
models.
3.1 A Petri Net model of plant defence response
A Petri Net is a bipartite graph with two types of nodes: places and transitions.
Standard Petri Net models are discrete and non-temporal, but their various
extensions can represent both qualitative and quantitative models. The Cell
Illustrator software implements Hybrid Functional Petri Net extension, which
was used in our study. In Hybrid Functional Petri Net formalism, the speed of
transition depends on the amount of input components and both discrete and
continuous places exist.
Our manually crafted Petri Net model of plant defence response currently
contains 52 substances and 41 reactions which, according to the Petri Net formalism, correspond to places and transitions, respectively. The model of salicylic
acid biosynthesis and signalling pathway which is one of the key components in
plant defence response, is shown in Figure 1.
Early results of the simulation are already able to show the eects of positive
and negative feedback loops in salicylic acid (SA) pathway as shown in Figure 2.
The red line represents the level of SA in chloroplast that is out of the positive
feedback loop. The blue line represents the same component in cytoplasm that
- 53 -
Fig. 1. A Petri Net model of salicylic acid biosynthesis and signaling pathway in plants.
Relations in negative and positive feedback loop are colored red and green, respectively.
- 54 -
is in the positive feedback loop. The peak of the blue line depicts the eect of
the positive feedback loop which rapidly increases the amount of the SA. After
reaching the peak, the trend of the blue line is negative as the eect of the
negative feedback loop prevails.
Fig. 2. Simulation results of the Petri Net model of salicylic acid pathway. The red line
represents the level of SA in chloroplast that is out of the positive feedback loop. The
blue line represents the same component in cytoplasm that is in the positive feedback
loop.
The represented Petri Net model consists of two types of biological pathways:
metabolic part and signalling part. The metabolic part is a cascade of reactions
with small compounds as reactants, and it was manually obtained from KEGG
database. The signalling part is not available in databases and it had to be
obtained from the literature. The biology experts have manually extracted relevant information related to this pathway within a period of approximately two
months. Having in mind that the salicylic acid pathway is only one out of three
pathways that are involved in plant defence response, it is clear that considerable
amount of time has to be invested if only manual approach were employed.
3.2 Computer-assisted development of plant defence response
models
The process of fusing expert knowledge and manually obtained information from
the literature as presented in the previous section turns out to be time consuming and non-systematic. Therefore, it is necessary to employ more automated
methods of extracting relevant information.
Our proposed solution is based on a service-oriented approach using scientic
workows. Web services oer platform independent implementation of processing components which makes our solution more general as it can be used in any
service-oriented environment. Furthermore, by composing developed web services into workows, our approach oers reusability and repeatability, and can
be easily extended with additional components.
- 55 -
Our implementation is based on Orange4WS, a service-oriented workow
environment which also oers tools for developing new services based on existing
software libraries. For natural language processing we employed functions from
the NLTK library [10], which were transformed into web services. Additionally,
the GENIA tagger [17] for biological domains was used to perform part-of-speech
tagging and shallow parsing. The data was extracted from PubMed and WOS
using web-service enabled access.
A workow diagram for computer-assisted creation of plant defence models
from textual data is shown in Figure 3. It is composed of the following elements:
1. PubMed web service and WOS search to extract article data,
2. PDF-to-text converter service, which is based on Poppler8 , an open source
PDF rendering library,
3. NLP web services based on NLTK: tokenizer, shallow parser (chunker), sentence splitter,
4. the GENIA tagger,
5. ltering components, e.g. contradiction removal, synonymity resolver, etc.
The idea underlying this research was to extract sets in the triplet form
{Subject, P redicate, Object}
from biological texts which are freely available. The defence response related
information is obtained by employing the vocabulary which we have manually
developed for this specic eld. Subject and Object are biological compounds
such as proteins, genes or small compounds, and their names with synonyms are
built into the vocabulary, whereas Predicate represents the relation or interaction
between the compounds. We have dened four types or reactions, i.e. activation,
inhibition, binding and degradation, and the synonyms for these reactions are also
implemented in the vocabulary. An example of such a triplet is shown below:
{P AD4 protein, activates, EDS5 gene}
Such triplets, if automatically found in text and visualized in a graph, can
assist the development and nalization of the plant defence response Petri Net
model for the simulation purposes. Triplet extraction is performed by employing
simple rules to nd the last noun of the rst phrase as Subject. Predicate is a
part of a verb phase located between the noun phrases. Object is then detected
as a part of the rst noun phrase after the verb phrase. The triplets are further
enhanced by linking them to the associated biological lexicon of synonyms BioLexicon [13]. In addition to these rules, pattern matching from the dictionary
is performed to search for more complicated phrases among text to enhance the
information extraction. The relevant information (a graph) is then visualized
using a native Orange graph visualizer or a Biomine visualization component
provided by Orange4WS. An example of such a graph is shown in Figure 4.
8
http://poppler.freedesktop.org/
- 56 -
Fig. 3. Workow schema which enables information retrieval from public databases to
support modelling of plant defence response.
While such automatically extracted knowledge currently cannot compete in terms of details and correctness - with the manually crafted Petri net model,
it can be used to assists the expert in the process of building and curation of
the model. Also, it can provide novel relevant information not known to the
expert. Provided that wet lab experimental data are available, some parts of the
automatically built models could also be evaluated automatically. This, however,
is currently out of the scope of the research presented here.
4 Results: An illustrative example
Consultation with biological experts resulted in the rst round of experiments
performed on a set of ten most relevant articles from the eld which were published after 2005. Figure 4 shows the extracted triplets, visualized using the
Biomine visualizer which is available as a widget in the Orange4WS environment.
Salicylic acid (SA) appears to be the central component in the graph, which
conrms the biological fact that salicylic acid is indeed one of the three main
components in plant defence response. The information contained in the graph
of Figure 4 is similar to the initial knowledge obtained by biology experts by
manual information retrieval from the literature9 . Such a graph, however, can
not provide the cascade network type which is more close to reality (and to the
manually crafted Petri Net model). The rst feedback from the biology experts
is positive. Even though this approach can not completely substitute human
9
It is worth noting that before the start of joint collaboration between the computer
scientists and biology experts, the collaborating biology experts have previously tried
to manually extract knowledge from scientic articles in the form of a graph, and
have succeeded to build a simple graph representation of the SA biosynthesis and
signalling pathway.
- 57 -
Fig. 4. A set of extracted triplets, visualized using the Biomine graph visualizer.
experts, biologists consider it a helpful tool in accelerating the information retrieval from the literature. The presented results indicate the usefulness of the
proposed approach but also the necessity to further improve the quality of information extraction.
5 Conclusion
In this paper we presented a methodology which supports the domain expert
in the process of creation, curation, and evaluation of plant defence response
models by combining publicly available databases, natural language processing tools, and hand-crafted knowledge. The methodology was implemented in a
service-oriented workow environment by constructing a reusable workow, and
evaluated using a hand crafted Petri Net model. This Petri Net model has been
developed by fusing expert knowledge, experimental results and literature reading, and serves as a baseline for evaluation of automatically mined plant defence
response knowledge, but it also enables computer simulation and prediction.
In further work we plan to continue the development and curation of the
Petri Net model, and implement additional lters and workow components to
improve computer-assisted creation of plant defence response models. As the
presented methodology is general, the future work will also concentrate on development of other biological models.
Finally, we are preparing a public release of our workow-based implementation. This will provide us the much needed feedback from experts which will
help us to improve the knowledge extraction process.
- 58 -
Acknowledgments
This work is partially supported by the AD Futura scholarship and the Slovenian
Research Agency grants P2-0103 and J4-2228. We are grateful to Lorand Dali
and Delia Rusu for constructive discussions and suggestions.
References
1. M. Baitaluk, M. Sedova, A. Ray, and A. Gupta. BiologicalNetworks: visualization
and analysis tool for systems biology. Nucl. Acids Res., 34(suppl 2):W466-471,
2006.
2. E. Demir, O. Babur, U. Dogrusoz, A. Gursoy, G. Nisanci, R. Cetin-Atalay and M.
Ozturk. PATIKA: An integrated visual environment for collaborative construction
and analysis of cellular pathways. Bioinformatics, 18(7):996-1003, 2002.
3. T. Erl. Service-Oriented Architecture: Concepts, Technology, and Design. PrenticeHall. 2006.
4. T. Genoud, M. B. Trevino Santa Cruz, and J.-P. Metraux. Numeric Simulation of
Plant Signaling Networks. Plant Physiology, August 1, 2001; 126(4): 1430 - 1437.
5. S. Haider, B. Ballester, D. Smedley, J. Zhang, P. Rice and A. Kasprzyk BioMart
Central Portal - unifed access to biological data. Nucleic Acids Res. 2009 Jul 1;37
(Web Server issue):W23-7. Epub 2009 May 6.
6. S. Hariharaputran, R. Hofestädt, B. Kormeier, and S. Spangardt. Petri net models
for the semi-automatic construction of large scale biological networks. Springer
Science and Business. Natural Computing, 2009.
7. Z. Hu, J. Mellor, J. Wu, and C. DeLisi. VisANT: data-integrating visual framework
for biological networks and modules. Nucleic Acids Research, 33:W352-W357, 2005.
8. T. Huan, A.Y. Sivachenko, S.H. Harrison, J.Y. Chen. ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining. BMC
Bioinformatics 2008, 9(Suppl 9):S5.
9. J. Köhler, J. Baumbach, J. Taubert, M. Specht, A. Skusa, A. Röuegg, C. Rawlings,
P. Verrier and S. Philippi. Graph-based analysis and visualization of experimental
results with Ondex, Bioinformatics 22(11), 2006.
10. E. Loper and S. Bird. NLTK: The Natural Language Toolkit. Proceedings of the
ACL Workshop on Eective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp 62-69, Philadelphia, Association for Computational Linguistics. July 2002.
11. Matsuno H, Fujita S, Doi A, Nagasaki M, Miyano S: Towards biopathway modeling
and simulation. Lecture Notes in Computer Science 2003, 2679:3-22.
12. N. Le Novère, M. Hucka, H. Mi, S. Moodie, F. Schreiber, A. Sorokin, E. Demir, K.
Wegner, M.I. Aladjem, S.M. Wimalaratne, F.T. Bergman, R. Gauges, P. Ghazal,
H. Kawaji, L. Li, Y. Matsuoka, A. Villéger, S.E. Boyd, L. Calzone, M. Courtot,
U. Dogrusoz, T.C. Freeman, A. Funahashi, S. Ghosh, A. Jouraku, S. Kim, F.
Kolpakov, A. Luna, S. Sahle, E. Schmidt, S. Watterson, G. Wu, I. Goryanin, D.B.
Kell, C. Sander, H. Sauro, J.L. Snoep, K. Kohn, H. Kitano. The Systems Biology
Graphical Notation. Nature Biotechnology, 2009 27(8):735-41.
13. D. Rebholz-Schuhmann, P. Pezik, V. Lee, R. del Gratta, J.J. Kim, Y. Sasaki,
J.McNaught, S. Montagni, M. Monachini, N. Calzolari, S. Ananiadou. BioLexicon:
Towards a reference terminological resource in the biomedical domain. Poster at
16th International Conference Intelligent Systems for Molecular Biology, 2008.
- 59 -
14. D. Rusu, B. Fortuna, D. Mladeni¢, M. Grobelnik, R. Sipo². Document Visualization
Based on Semantic Graphs. In Proceedings of the 13th International Conference
Information Visualisation, 2009.
15. P. Sevon, L. Eronen, P. Hintsanen, K. Kulovesi, and H. Toivonen. Link discovery
in graphs derived from biological databases. In Proceedings of 3rd International
Workshop on Data Integration in the Life Sciences, 2006.
16. P. Shannon, A. Markiel, O, Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B.
Schwikowski, T. Ideker. Cytoscape: A software environment for integrated models
of biomolecular interaction networks. Genome Research, 13:2498-2504, 2003.
17. Y. Tsuruoka, Y. Tateishi, J. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J.
Tsujii. Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances
in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382392, 2005
- 60 -
OpenTox: A Distributed REST Approach to
Predictive Toxicology
Tobias Girschick1 , Fabian Buchwald1 , Barry Hardy2 , and Stefan Kramer1
1
Technische Universität München, Institut für Informatik/I12, Boltzmannstr. 3,
85748 Garching b. München, Germany
{tobias.girschick, fabian.buchwald, stefan.kramer}@in.tum.de
2
Douglas Connect, Baermeggenweg 14, 4314 Zeiningen, Switzerland
[email protected]
Abstract. While general-purpose data mining has a role to play on
the internet of services, there is a growing demand for services particularly tailored for application domains in industry and science. In the
talk, we present the results of the European Union funded project OpenTox [1] (see http://www.opentox.org), which aims for building a webservice based framework specifically for predictive toxicology. OpenTox
is an interoperable, standards-based framework for the support of predictive toxicology data and information management, algorithms, (Quantitative) Structure-Activity Relationship modeling, validation and reporting. Data access and management, algorithms for modeling, feature construction and feature selection as well as the use of ontologies
are core components of the OpenTox framework architecture. Alongside the extensible Application Programming Interface (API) that can
be used by contributing developers, OpenTox provides the end-user oriented applications ToxPredict (http://www.toxpredict.org) and ToxCreate (http://toxcreate.org/create). These are built on top of the
API and are especially useful to non-computational scientists. The very
flexible component-based structure of the framework allows for the combination of different services into multiple applications. All framework
components are API-compliant REST web services, that can be combined to distributed and interoperable tools. New software developed by
OpenTox partners like FCDE [2], FMiner [3] or Last-PM [4], that are
particularly suited for toxicology predictions with chemical data input,
are integrated. The advantages of the framework should encourage researchers from machine learning and data mining to get involved and
develop new algorithms within the framework that offers high-quality
data, controlled vocabularies and standard validation routines.
Acknowledgements
This work was supported by the EU FP7 project (HEALTH-F5-2008-200787)
OpenTox (http://www.opentox.org) and the TUM Graduate School.
- 61 -
References
1. Hardy, B., Douglas, N., Helma, C., Rautenberg, M., Jeliazkova, N., Jeliazkov, V.,
Nikolova, I., Benigni, R., Tcheremenskaia, O., Kramer, S., Girschick, T., Buchwald,
F., Wicker, J., Karwath, A., Gütlein, M., Maunz, A., Sarimveis, H., Melagraki,
G., Afantitis, A., Sopasakis, P., Gallagher, D., Poroikov, V., Filimonov, D., Zakharov, A., Lagunin, A., Gloriozova, T., Novikov, S., Skvortsova, N., Druzhilovsky,
D., Chawla, S., Ghosh, I., Ray, S., Patel, H., Escher, S.: Collaborative Development
of Predictive Toxicology Applications, accepted. Journal of Cheminformatics (2010)
2. Buchwald, F., Girschick, T., Frank, E., Kramer, S.: Fast Conditional Density Estimation for Quantitative Structure-Activity Relationships. In: Proc. of the 24th
AAAI Conference on Artificial Intelligence, AAAI Press (2010) 1268–1273
3. Maunz, A., Helma, C., Kramer, S.: Efficient Mining for Structurally Diverse Subgraph Patterns in Large Molecular Databases. Machine Learning, in press (2010)
4. Maunz, A., Helma, C., Cramer, T., Kramer, S.: Latent Structure Pattern Mining.
In: Proc. of ECML/PKDD 2010, accepted. (2010)
- 62 -