Download An XML Framework proposal for knowledge discovery in databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
An XML Framework Proposal for Knowledge Discovery in Databases
Petr Kotásek, Jaroslav Zendulka
{kotasekp, zendulka}@dcse.fee.vutbr.cz
Brno University of Technology, Department of Computer Science and Engineering,
Bo et chova 2, 612 66 Brno, Czech Republic
Abstract. In recent years, the XML language has been receiving much interest among IT
community. It has many nice properties that make it a great candidate for representation
of different kinds of data. In this paper we will propose an XML framework for the
domain of knowledge discovery in databases. This is not a specification document; this
article tries to be a compendium of ideas and remarks concerning the broad area of
knowledge discovery in databases (KDD). It tries to identify some common high-level
problems of this area from a higher perspective and then to outline a possible solution,
showing an example with the definition of data interfaces for respective KDD steps using
XML.
1 Introduction
There are huge amounts of data stored in various repositories (databases) and it is behind
human capabilities to reasonably process it. It is no longer possible for us to look at the
database, see any useful patterns in the data and consequently derive some potentially useful
knowledge from our observation.
The knowledge discovery in databases addresses the problem mentioned above by
developing automatic and intelligent techniques for automated discovery of useful and
interesting patterns (called knowledge) in the database. The main effort in the knowledge
discovery community has so far been devoted to development of efficient mining algorithms,
but there are not many contributions to the complex solution of the problem. There is no set of
widely accepted techniques and methodologies to support the entire process. Many
knowledge discovery systems exist and each of them is using its own methodology. It is quite
understandable, as most of the systems were designed for quite a narrow application area (e.g.
healthcare, business or image data analysis).
The knowledge discovery in databases is a complex, interdisciplinary, data-centered and
human-centered task. So, on one hand, it is naturally desirable to have an unifying platform
(preferably built on formal basics) for the process. On the other hand, this inherent complexity
makes a development of such a framework very difficult, if not even impossible. However,
the need for systematic description of the knowledge discovery process has been recognized
in the KDD community. In this paper, we try to summarize some of the problems that could
be addressed by the availability of such an unifying view, and outline a proposal of
an implementation-level solution exploiting XML.
The remainder of this paper is organized as follows: Section 2 summarizes some high-level
problems present in the KDD domain today and suggest a solution through the use of
ontologies. Section 3 describes the XML language. Readers familiar with XML can skip this
section. Section 4 presents ideas how XML could be used in the KDD process. We conclude
and outline the future work in Section 5.
2 Some common high-level problems in KDD
We believe that it is worth trying to propose a framework for a systematic approach to KDD
process. As the knowledge discovery is a wide, open and evolving topic, the solution must
reflect its needs; it has to be open and extensible, too. Below are outlined some problems that
might be addressed by the unifying framework:
1. It may seem surprising, but we still do not have precise definition of basic terms and
concepts appearing in the area. A verbal definition of basic terms (like knowledge,
pattern, interestingness etc.) can be found in [1]. However, it would be better to have a
definition of these concepts and their relations based on formal approaches.
2. The KDD process requires the user to deal intensively with huge amounts of data. This
happens especially during the preprocessing step, when target data have to be identified,
integrated and cleaned. A conceptual view of the raw data is needed to navigate through
the data. Also different techniques are used to process the data; typically, mathematical
statistics techniques are used extensively.
3. Next, the data mining task has to be identified and the proper data mining method chosen.
These two steps (2 and 3) are highly iterative: the formal description of various data
mining tasks and methods, together with the formal conceptual view of the data, would
help in the task identification and mutual matching between the task and the data.
4. The results have to be presented in human-readable form. It is usually accomplished by a
combination of graphical and textual primitives. Formal definition of different kinds of
results would provide for their easier management and transformations (e.g., classification
tree to rules).
5. When some of the results are identified as a knowledge, it should be possible to
manipulate it the way we are used to manipulate knowledge: share it, consolidate it, report
it to interested parties etc.
6. It is natural that domain knowledge is used by an expert to guide the process, especially
during the initial data preprocessing and during the final knowledge identification. These
days, the domain knowledge usually resides in an expert's brain. It would be convenient to
be able to use the domain knowledge stored in a knowledge base, then integrate it with the
newly discovered knowledge coming as the outcome of the KDD process, and possibly
refine it.
We believe that the above problems might be addressed by creating the system of the formal
ontology for the knowledge discovery domain. For detailed discussion on ontologies see [2]
or [3]. For our purposes, we can say that ontology is a system that defines categories of things
in the domain of interest and their mutual relationships, possibly in axiomatic way. For
example, in [2] the KIF (Knowledge Interchange Format) language was proposed to
interchange knowledge among disparate programs and to create ontologies (see their
Ontolingua Server [4]).
Ontologies play an important role when we want to describe a domain and to process it,
communicate or share knowledge about it. For example, it is obvious that ontologies
describing the data being analyzed are essential during the preprocessing phase. The
conceptual description of the data (mentioned under problem 2 above) is a must - it enables
for navigation through and management of the data during the KDD process, to say the least.
Similarly, if we had an ontology describing the characteristics of the KDD process itself,
we could try to address the above problems. We can either start building this ontology and all
the supporting mechanisms from the scratch, or we can use some existing technology. One of
such existing technologies is that being developed by the Knowledge Sharing Effort (KSE)
[2]. They are creating an ontology library that includes ontological descriptions of various
domains. Moreover, the fact that the KDD ontology, the ontology of the domain being
explored plus all other supporting ontologies are built on the same platform will result in
several primary benefits: firstly, it should be easier to integrate the conceptual description of
data and the domain knowledge of the domain under exploration into the KDD process.
Secondly, it should enable for the domain knowledge to be used in the KDD process and for
discovered knowledge to be incorporated back into the domain knowledge. Thirdly, if any
supporting ontologies are present, they can be used in the process too (typically, for graphical
representation or statistical evaluation). This whole idea is briefly depicted in Fig. 1.
Fig. 1. The idea of the ontological library with KDD ontologies
So now we can suppose that we have a huge library containing ontologies for the domain of
interest, and ontologies of areas that support the knowledge discovery process in various
ways. We are still left with the task of defining the ontologies that would describe the domain
of knowledge discovery itself (with the idea of easy integration and use of different other
ontologies within the knowledge discovery process in mind). Let us show how these
ontologies, together with those existing in the ontology library, might help with the problems
listed at the beginning of this section.
Firstly, to solve the problem mentioned under number 1 above, we have to create an
ontology covering the intrinsic concepts like knowledge, interestingness etc. It will be used by
following ontologies, especially by those describing different knowledge types (classification,
association rules etc.).
As for the problem 2, the conceptual description of data is a part of the ontology for the
domain under exploration, and mathematical statistics might be covered by one of the
supporting ontologies.
To address the problem 3, we will need ontologies describing characteristics of different
data mining tasks, data mining methods and the architecture of the desired results (which is
tightly coupled with the data mining task).
Regarding problems 4 and 5, if we represent the newly discovered knowledge in
compliance with a previously defined ontology, belonging to a family based on awidely
accepted technological platform (like that proposed in [2]), it will be possible to integrate it
with domain knowledge (provided it is represented by an ontology built using the same
platform).
The whole architecture should be open to changes and extensions. Formal ontologies will
be defined using some formal language like KIF. However, these languages are not meant to
be used on the implementation level. Rather, we should use a more suitable format for
physical data, for example XML. Figure 2 shows a possible architecture for ontologies
defining different knowledge types (and thus addressing problems 4 and 5). The overall
structure is hierarchical, with basic terms and basic knowledge types ontologies on top - these
are general purpose ontologies. It is obvious that at least the two bottom ontologies will play a
physical role in the KDD process; they represent the discovered knowledge - the final
product. Therefore, they will have to be implemented using XML. In the remainder of this
paper, we will propose the possible use of XML in the context of the whole KDD process.
Fig. 2. An example of ontologies for description of knowledge types
3 The XML language
3.1 Brief History of XML
The Extensible Markup Language (XML) is a simplified subset of the Standard Generalized
Markup Language (SGML). The main goal of SGML is to provide mechanism for platformindependent representation of structured data. Unfortunately, SGML is very complex and
therefore the cost of its implementation is high. The Hypertext Markup Language (HTML),
on the other hand, has very poor representation capability. Actually, it is only oriented
towards presentation and as such it is becoming insufficient for growing demands of the
World Wide Web. Therefore an initiative was started by The World Wide Web Consortium
(W3C) [5] to build XML. It is much simpler than SGML (and therefore easy to implement)
but still very powerful to represent structured data. It has been recognized by many
researchers as a promising solution to their problems.
3.2 Brief Description of the XML Concept
XML is a method for putting data in a text file. Actually, it has evolved into a whole family of
technologies.
It is not meant to be read by humans but it is very easy to read. So if you are an expert or a
programmer debugging applications, you can use simple text editors to look at XML files or
even repair them.
XML looks very much like HTML but has nothing to do with it. HTML only marks up the
data for a browser to visualize it, whereas XML purely represents the logical structure of the
document regardless of its possible future presentation form. Tags in HTML are predefined
and users can do nothing about it – in other words, HTML is an application of SGML. In
XML, users can define their own tags and attributes (the grammar of an XML document) – in
other words, XML is a subset of SGML.
3.3 Advantages of XML
XML users will find many advantages depending on their field of interest, but there are some
more general advantages that become obvious in every application domain.
•
Platform Independence
XML is a text format that can be displayed or processed on any device. The device will
need to know the Document Type Definition (DTD) of the given XML document. DTD
represents the definition of the grammar – the XML document has to comply with its
grammar. If the DTD is publicly available (e.g., through WWW), device can withdraw
it, parse the document and transform it in any possible way.
•
Robustness
XML documents have to be well-formed, meaning each starting tag must have a
corresponding closing tag, there must be only one root element, and so on. As the
format is textual, it is more resistant to transport errors.
•
Extensibility
This is another important feature, together with platform independence. Via DTDs, the
XML technology serves as a metalanguage for definitions of other languages.
Moreover, one of XML technologies called Extensible Stylesheet Language (XSL)
provides means for transforming XML documents. A part of XSL called XSL
Transformations (XSLT) is an XML vocabulary for doing such transformations. Then it
is not a problem to take an XML tree and convert it to another completely different one.
4 Using XML in the Knowledge Discovery Process
Let us describe the potential of XML in the domain of knowledge discovery and data mining.
The process of knowledge discovery in databases has several steps that are shown in Figure 3.
Fig. 3. The Knowledge Discovery Process
Its goal is to retrieve some potentially useful and usable information (knowledge) from large
amounts of data, and to use this knowledge for decision support, marketing, etc.
It would be very nice if we could run the data mining algorithm against the raw data from
the database. Unfortunately, many steps have to be taken prior to actual mining can take
place:
1. Relevant data have to be selected; this is a joint work for the data mining expert and
the domain expert. At this point we assume that we have already decided which type of
knowledge we want to discover and what particular data mining algorithm we will use.
Different knowledge types need different algorithms and different algorithms require
different data. If we ran data mining algorithm against data which is irrelevant, we could
receive useless results (which is the better case) or even results that are confusing and
therefore potentially harmful, if applied in real world (this is the worse case).
2. Once the relevant data are known, they have to be preprocessed and transformed into
the shape that will be understood by the data mining algorithm. Typical preprocessing
activity is the elimination of the erroneous data (data cleaning). In the vast majority of
experiments, transformation involves extracting data from the database source and saving it
in plain text files. This is the task for the data mining expert and without a doubt it is
the most time-consuming part of the whole process, as the format of these files is
proprietary to given mining algorithms.
3. Now we can run the data mining algorithm against the data. It produces results (in some
proprietary format again) that have to be visualized somehow and interpreted by domain
experts.
Actually, the three steps mentioned above under 1. and 2. (selection, preprocessing and
transformation) overlap each other in real process. Selection can be understood as
identification of relevant data without actually touching it. Preprocessing and transformation
are activities that deal with physical data.
Most researchers in this field concentrate on developing new data mining techniques, which
is only one step along the long path. To our knowledge, little attention has been paid to
formats of data being processed.
4.1 An XML Framework Proposal for Data Interfaces
During the discovery process, data travels through many stages that all have well-defined
functionality. It would be convenient to define input data formats (we will refer to these as
data interfaces) for these stages using some platform-independent, robust, extensible and
human-readable technology. What a task for XML! If we decide to use XML, we can define
XML data interfaces for respective knowledge discovery steps. Then the overall architecture
might look like this (please refer to Figure 3):
Data Selection. The target data have to be extracted from the database system and stored in
XML format. We will call this data interface XML-TargetData. There are basically two ways
to do this. In first scenario, application has to be written that accesses the database, retrieves
data through the standard interface provided by the DBMS (typically, SQL) and converts
the data into XML. This attitude requires additional coding. Fortunately, major database
system vendors are beginning to realize the importance of XML and start to incorporate XML
interface into their database engines. So extraction of data in an XML format should be
a straightforward process in near future.
Data Preprocessing. The data preprocessing phase receives data in XML-TargetData format.
The data is checked, cleaned, whatever possible processing is performed, and the output is
created in XML-PreprocessedData format. This interface contains data for mining that is
semantically ready to be used by the data mining algorithm, but its syntactic structure might
be different from the syntax understood by the algorithm.
Transformation. In the original architecture from Figure 3 the transformation phase was
a single, highly specialized procedure. Imagine this scenario: Preprocessed data are in text
format. It is quite understandable because
preprocessing deals with data checking and
cleaning so it is desirable that the data mining expert can read the format easily.
Unfortunately, the data mining algorithm is ready to receive data in his own proprietary
binary format. So someone has to sit down and code a program to convert the human-readable
text format into algorithm-friendly binary format.
With XML, the transformation step is just a simple conversion from XMLPreprocessedData to XML-TransformedData – the input interface for the data mining
algorithm. XML technology provides instruments for simple transformation of XML
documents among each other (please recall the XSLT from Section 1.3). Actually, with XML,
there is no unique transformation step in the discovery process any more. Rather, many local
transformations can be performed easily and effectively on each data interface using simple
and straightforward XSLT transformations.
Data Mining. The data mining algorithm takes the data in the XML-TransformedData format
and mines the data for knowledge. The output of this step is the XML-Patterns data format.
Interpretation/Evaluation. This step demonstrates the strength of XML in its full range. We
receive data from the XML-Patterns interface and want to visualize it in human-friendly
fashion. Using XSLT (or any other transformation tool or program), data can be transformed
into any data format and displayed by different programs. Nowadays, all the end-user
programs use their own proprietary formats (.doc, .xsl, .jpg, …). If we manage to create
widely accepted and respected XML vocabulary for a specific domain, we will only be left
with the task of defining transformations from/to these proprietary formats. Some steps are
being taken by W3C in this field (e.g., Precision Graphics Markup Language – PGML, Vector
Markup Language – VML, Document Definition Markup Language – DDML, Mathematical
Markup Language – MathML). Consequently, it is desirable to create a markup language for
data representation on all the data interfaces of the knowledge discovery process.
Another good thing to mention here is knowledge transformation. It is often necessary to
transform one knowledge representation to another, e.g. classification tree to association rules.
Again, there is no problem with XML and its transformation capabilities.
In previous paragraphs, we have assumed that data would be extracted from the database and
stored in XML formats between successive steps. We have accepted this assumption to
emphasize the individualistic character of respective KDD steps. In real world applications,
this attitude would result in an unacceptable waste of space. Why should we store the same
(or almost same) data twice? Rather, the data will be transformed into XML on the fly while
being read from a database. Actually, only the final product of the whole process, the XMLPatterns data, is expected to be stored permanently for future use.
Figure 4 shows the KDD process again, but now with corresponding XML data interfaces.
Fig. 4. The Knowledge Discovery Process with XML Data Interfaces
Here is the summary of benefits resulting from XML usage in the knowledge discovery
process:
1. Each step in the process has its input and output XML data interfaces. If the input and
output interfaces of two consequent steps do not match, data can be transformed easily
using XSLT transformations.
2. Consequently, it is possible to combine different discovery components to perform
the whole process. This feature becomes most appreciated in the data mining step for
testing purposes. Now it is easy to compare different mining algorithms against each other.
All that has to be done is transformation from the XML-TransformedData interface into the
proprietary format (which can but does not necessarily have to be XML) of the data mining
algorithm.
4.2 An Example: XML-Patterns Interface
There are different types of knowledge patterns that can be discovered in data: data
generalization, summarization, characterization, association rules, classification trees,
clustering analysis, regression analysis, time series, web mining, path traversal patterns
(mining for user access patterns in interactive systems), etc. We will give a demonstration for
association rules.
An association rule is an expression of the form X ⇒ Y, where X and Y are sets of items.
Intuitive meaning of such a rule is that sets of items that contain X as their subset tend to
contain also Y. Association rules are being mined for in relational databases.
The typical application is in analysis of sales data; databases contain huge amount of
transactions typically consisting of the transaction date and items bought in the transaction.
One of the well known examples is this association rule: nappies ⇒ beer. The idea behind this
strange rule is that fathers who were sent out by their wives to buy nappies decided to reward
themselves for their heroic performance by buying beer. When a good market specialist sees
this rule, he or she immediately moves beer together with crackers closer to nappies to satisfy
the thirsty husbands’ temptation.
A fraction of an XML document representing this situation, and the DTD to which this
document conforms, might look like the one in Table 1. It is a very idealized and non-realistic
example. It only says that the association rule exists, nothing more. In real world applications,
there are other data associated with rules; typically, some metrics that measure the value of
the rule. Moreover, association rules can have different forms: quantitative, generalized, fuzzy
etc. The XML document would have to be able to accommodate all these variants.
Table 1. An XML and DTD Example for the XML-Patterns Interface
XML Document
<Knowledge>
<AssociationRule ruleid=”1”>
<Ascendent>
<AscendentItem>
<Name>nappies</Name>
</AscendentItem>
</Ascendent>
<Consequent>
<ConsequentItem>
<Name>beer</Name>
</ConsequentItem>
</Consequent>
</AssociationRule>
</Knowledge>
DTD
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
Knowledge
AssociationRule
AssociationRule
Antecedent
AntecedentItem
Consequent
ConsequentItem
Name
(AssociationRule+)>
(Antecedent,Consequent)>
ruleid CDATA #REQUIRED>
(AntecedentItem+)>
(Name)>
(ConsequentItem+)>
(Name)>
(#PCDATA)>
Knowledge is a root element. It can include one or more AssociationRule elements. Each
AssociationRule element has a unique identifier ruleid. AssociationRule consists of one
Antecedent and one Consequent. Each Antecedent and Consequent must have one or more
AntecedentItem and ConsequentItem elements. Each AntecedentItem and ConsequentItem
has one name, which is a string.
4.3 An XML Framework Proposal for Communication Interfaces
So far we have used XML to define data interfaces, i.e. the formats of input/output data for
given discovery steps. The usage of XML in this context is implicit as XML is designed for
data representation.
XML can also be very well used to define communication interfaces. We can assume that
respective KDD steps are performed by agents. By agent, we mean a software application
which is able to communicate with other applications (agents).
A computing environment is becoming more and more parallel and distributed and this
holds true for knowledge discovery as well. Therefore, it makes sense to think of various
KDD steps as tasks being performed by specialized agents that expose their functionality to
the rest of the world through an XML-defined communication interface. Typically, the data
mining algorithm will reside on a computer and will define its communication interface
through the XML document shown in Figure 5.
<DataMiningAgent>
<Name>AprioriItemset</Name>
<Description>
This is a data mining algorithm used for mining association rules among sets of items.
</Description>
<BaseUrl url = ”http://www.fee.vutbr.cz/Mining/Algorithms/AprioriItemset.cgi”
method=”post”/>
<InputDataInterfaces>
<InputDataInterface
url=”http://www.fee.vutbr.cz/Mining/DTDs/AprioriItemsetInput1.dtd”>
<Description>DTD for one of input data interfaces accepted by AprioriItemset
</Description>
</InputDataInterface>
</InputDataInterfaces>
<OutputDataInterfaces>
<OutputDataInterface
url=”http://www.fee.vutbr.cz/Mining/DTDs/AprioriItemsetOutput1.dtd”>
<Description>DTD for one of output data interfaces produced by AprioriItemset
</Description>
</OutputDataInterface>
</OutputDataInterfaces>
<InputParameters>
<InputParam name="input_dtd_url"/>
<InputParam name="input_xml_url"/>
<InputParam name="output_dtd_url"/>
</InputParameters>
</DataMiningAgent>
Fig. 5. An Example of Communication Interface of the Data Mining Agent
The interface in Figure 5 says that there is a data mining agent called AprioriItemset located
at BaseUrl. It can accept documents conforming to the DTD stored in the file named
AprioriItemsetInput1.dtd. The output of the agent conforms to the DTD stored in the file
named AprioriItemsetOutput1.dtd. The AprioriItemset agent accepts several input parameters:
URL of DTD to which the input XML data conforms, URL of the XML input data, and URL
of DTD for output data. This example assumes that the communication between agents is
built on top of the http transport protocol. Again, this is a very simplified view of the problem
to show how XML could be utilized.
5 Conclusion and Future Work
We tried to show the potential of XML and related technologies in the domain of knowledge
discovery in databases, in the context of a wider formal attitude to the KDD process. There is
no 'one and only' solution for this problem. In our approach, software applications (agents) are
used to perform successive knowledge discovery steps. These agents have to define their
communication and data interfaces. As these interfaces are defined using XML, the
environment is open and easily extensible. It is easy to build new components and it should
also be easy to accommodate the existing ones.
The XML solution resides on an implementation level. Above it, a general formal
architecture is built by means of formal ontologies. These ontologies describe data being
analyzed (which comes naturally) and newly they are used to describe the domain of the KDD
process itself. Moreover, having a formal platform, following problems should become
solvable (in addition to those mentioned in Section 2):
•
integration of different KDD systems
•
comparison of different KDD systems
•
integration of KDD systems into existing environments
Thus, the future work will lie in definition of an unifying approach that would embrace
the KDD process as much as possible. It will require the deep investigation of the whole area,
identification of key concepts and their relationships, and their description via ontological
structures.
The Knowledge Interchange Format (KIF) is an example of aconvenient formalism to
describe these ontologies, and XML can serve as an implementation technology in the way
similar to that outlined briefly in this paper.
References
1. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds., Advances in Knowledge
Discovery and Data Mining, AAAI Press/The MIT Press, 1996.
2. The Knowledge-Sharing Effort Consortium, http://www.cs.umbc.edu/kse
3. LADSEB-CNR, http://www.ladseb.pd.cnr.it/infor/ontology/ontology.html
4. Ontolingua Server, http://www-ksl-svc.stanford.edu:5915/
5. World Wide Web Consortium XML page, http://www.w3.org/XML