Download an Integrated Rule-Based Data Mining System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Affective computing wikipedia , lookup

Human–computer interaction wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Pattern recognition wikipedia , lookup

Personal knowledge base wikipedia , lookup

Ecological interface design wikipedia , lookup

Time series wikipedia , lookup

Transcript
From Data to Knowledge: an
Integrated Rule-Based Data
Mining System
Chien-Chung Chan
and
Zhicheng Su
Department of Computer Science
University of Akron
Akron, OH 44325-4003
[email protected]
Abstract
This paper presents an integrated rule-based data
mining system that is capable of creating rulebased classifiers with web-based user interface
from data sets provided by end users. It
provides a streamlined integration of three
technologies, namely, database systems, machine
learning systems , and rule-based systems. Rules
generated by the system are based on rough set
theory. Thus, it is capable of dealing with
uncertain rules. The system generates user
interface dynamically, therefore, end users are
released from the burden of programming. The
generated classifiers are stored in a database
system and are delivered as web-based
applications. Therefore, they can be managed
and accessed easily by using web browsers.
1. Introduction
The development of computer technologies has
provided many useful and efficient tools to
produce and store huge amount of data stored in
various forms. The raw data stored in databases
or computer files have become important assets
of modern enterprises. Therefore, it has become
a common challenge to enterprises to make the
best use of these data.
The task of extracting useful information from
data is not a new one. It has been a common
interest in research areas such as statistical data
analysis, machine learning, and pattern
recognition. A new emerging research area called
Knowledge Discovery in Databases (KDD) is
mainly concerned with how to develop new
approaches and techniques to enable efficient
extracting of useful information from databases.
In general, KDD refers to the overall process of
finding and interpreting useful patterns from
data. A typical KDD process may consist of six
steps: (1) select an application domain and create
a target data set, (2) preprocess and clean the
data set, (3) transform the data set into a proper
form, (4) choose the functions and algorithms of
data mining, (5) validate and verify discovered
patterns, and (6) apply the discovered patterns
[1, 2].
Research in KDD has contributed to the recent
development of commercial systems. Major
vendors of database systems such as Oracle,
IBM, and Microsoft have extended their systems
to support some basic functionalities of KDD.
These systems may be used to create data mining
models based on supported clustering and
classification algorithms. Since they are rooted in
database technologies, no inference or reasoning
tools are supported.
On the other hand, research in AI has
contributed to the development of expert systems
since early 80’s [3]. Most successful expert
systems has used if-then rules to represent
experts’ knowledge, hence, they are also called
rule-based systems. The basic structure of a
rule-based expert system includes a rule base,
which is a set of if-then rules, and an inference
engine, which can be used to infer answers to
given inputs by using the rules in the rule base.
One major challenge of building expert systems is
how to construct the rule base. Another issue is
whether the system is capable of dealing with
uncertain rules. Expert system development
environments such as CLIPS [4] and JESS [5]
provide tools for end users to program or enter
rules into rule base. However, they do not
provide machine learning tools to generate rules
from data sets. In addition, inference engines
may not support reasoning with uncertain rules.
Most data mining tools are based on results of
machine learning research, which is a wellestablished area in AI [6, 7], and there have been
many successful applications. One of the wellknown programs is Quinlan’s C4.5 [8], a revised
commercial version is called C5.0, which can be
used to generate classifier programs from
collections of training examples. The generated
classifier can be represented as a decision tree or
a set of production rules. The C4.5 system can
also generate an inference engine that will apply
the generated decision trees or rules to produce
answers for queries entered interactively. In
C5.0, the system can generate C souce codes of
classifiers for developing embedded applications.
Another well-known open-source data mining
package is called WEKA [7], which is
implemented in java. The WEKA package is
quite comprehensive, and it provides tools for
pre-processing and typical data mining tasks
such as classifying, clustering, and association
rule mining. However, no inference engines for
using the generated rules are provided. Because
it is implemented in java, it is easy to embed
WEKA’s tools in java applications.
The capability to support embedded
applications is quite attractive, however, it
requires some programming to provide end-user
interface. In addition, there is one issue that has
not been addressed by most data mining
packages, namely, how to manage the classifiers
generated by these systems?
In the proposed system, our focus is more from
the perspectives of end users who do not have
programming skills. The requirements from the
end users are to provide data sets and the
knowledge of pre-processing the data. From a
pre-processed data set, our system will generate
a metadata file containing attribute information of
the data and a set of if-then rules with
uncertainties. The rules are generated using the
BLEM2 learning program [9]. The resulting
metadata file and rule file are used by a webbased classifier generating system to generate a
web-based user interface for running the target
classifier.
For each target classifier, the
corresponding meta-file and rule-file are stored in
a database system to provide manageability.
The organization of the paper is as follows. In
Section 2, we provide a brief overview of rulebased classifiers and the components and
workflow of our system. Section 3 presents
underlying database and components design of
the system. Implementation is covered in Section
4. Section 5 is conclusions and references are
given in Section 6.
2. System Overview
2.1. Rule-Based Classifier
A rule-based classifier system is a set of if-then
rules that implements a classifier. In general, the
inference step in a rule-based classifier is quite
simple. That is, if the input data match the
conditions on the Left Hand Side (LHS) of a rule,
the rule is called firable, the firing of the rule is to
return the decision on the Right Hand Side (RHS)
of the rule. Because conditions of different rules
are not necessarily mutually exclusive, it is
possible that multiple rules may be firable for a
given input. When the RHS returned by the rules
are different, then the answer returned by the
system is not unique. We call this condition rule
conflict. One way for resolving rule conflict is to
use the majority voting method, i.e., to return the
value produced by most of the rules, and break
ties arbitrarily. It is also possible to use methods
based on theories such as fuzzy sets [10, 11] and
rough sets [12, 13]. In our system, a maximum
sum approach based on rough set theory has
been implemented. It will be presented in the
following subsection.
2.2. System Components and Workflow
As shown in Figure 1, there are three
components of the proposed system: Data
Preprocessing, Rule Generator, and RBC
Generator.
Input
Data
Set
Data
Preprocessing
Rule
Generator
RBC
Generator
Figure 1. System components and workflow.
Input Data Set:
The required format for input data set is a text
file with comma separated values (CSV), which
can be created using MS Excel program. It is
assumed that there are N columns of values
corresponding to N attributes or variables, which
may be real or symbolic values. The first N – 1
attributes are considered as condition attributes
and the last one is the decision attribute.
Data Preprocessing:
This component is used to discretize domains
of real attributes into a finite number of intervals.
The discretized data file is then used to generate
a metadata attribute information file and a training
data file.
Rule Generator:
A symbolic learning program BLEM2 [] is used
to generate rules with uncertainty from the
discretized training data file and corresponding
attribute file.
RBC Generator:
This component is used to generate a webbased Rule-Based Classifier (RBC) from a rule file
and a metadata file. The idea was first introduced
in [14]. For each pair of metadata and rule files,
the user interface for running the classifier is
generated dynamically. It includes an inference
engine shared by all target classifiers. In
addition, relevant information of target classifiers
and user interfaces are stored in a database for
manageability.
The architecture of RBC
generator is a multi-tier client-server system
shown in Figure 2. Clients interact with the
backend SQL server through services provided
by the application server in the middle tier using
a web-browser. The application server is
responsible for dispatching requests to the
intended backend server, receiving responses
from backend server, and presenting the final
results back to the clients. The detailed design of
middle-tier components and database of the RBC
generator will be presented in Section 3.
statements generated from user inputs. It uses
the pattern matching ability of a SQL processor
to determine the rules that are firable. Rule
conflict resolutions are implemented using stored
procedures. For instance, to implement the MaxSum approach using the Rough Set theory, the
SQL statement can be constructed as follows:
SELECT TOP 1 <decision-attribute> as decision,
sum(certainty*strength) as certainty from
<domain name>
WHERE <selected condition-attributes> =
<selected value>
GROUP BY <decision-attribute> ORDER BY
certainty DESC
This SQL statement returns the decision-attribute
which has the maximum sum(certainty*strength)
value whenever the selected condition-attributes
meet the user selected values.
The majority voting method can be implemented
as:
Requests
Middle Tier
Client
SQL DB server
Responses
Figure 2. Architecture of RBC generator.
2.3. Workflow of RBC Generator
The workflow of the RBC generator is shown in
Figure 3.
Rule set
File
Metadata
File
RBC Generator
SQL Rule
Table
Rule Table
Definition
Figure 3. Workflow of RBC generator.
As shown in Figure 3, the generator takes a
rule set file and a metadata file as its inputs, and it
dynamically generates SQL statements for
creating tables in the database. Rules of different
target classifiers are stored in dynamically
generated relational tables separately.
The
inference engine that is shared by all target
classifiers is implemented as dynamic SQL
SELECT TOP 1 <decision-attribute>, COUNT(*)
AS MAX_NUM
FROM <meta-data table>
WHERE <selected condition-attributes> =
<condition-attributes values in meta table>
OR < not selected condition-attributes> is
NULL
GROUP BY <decision-attribute> ORDER BY
MAX_NUM
This SQL statement returns one of the decisionattribute that has maximum number of matches
whenever the selected condition-attributes meet
the condition-attributes values stored in the
database or the non selected condition-attributes
has NULL values in the database.
Conflict resolution approaches are stored in a
relation table and retrieved dynamically based on
user’s preference during the execution of a RBC.
3.
Design of the RBC Generator
3.1. Backend Database Design
3.1.1. Domain Information Tables
The information retrieved from the metadata file is
stored in three tables: Domain, Attribute and
Lookup tables. The domain table contains
domain name and description, and it guarantees a
unique domain id for each domain. Attribute
table contains the information of each attribute
and the mapping relation to the corresponding
rule set table such as the actual data type and
size when creating the rule set table. Each
attribute refers to several rows in the Lookup
table; and each row contains one possible value
for that attribute. The relationship of these tables
is shown in Figure 3 where endpoints of the line
indicate whether the relationship is one-to-one or
one-to-many. If a relationship has a key at one
endpoint and a figure-eight at the other, it is a
one-to-many relationship. If a relationship has a
key at each endpoint, it is a one-to-one
relationship. The relationship line indicates that
a foreign key relationship exists between one
table and another.
For a one-to-many
relationship, the foreign key table is the table
near the line's figure-eight symbol. If both
endpoints of the line attach to the same table, the
relationship is a reflexive relationship.
‘INTEGER’. A type field is used to record
attribute type as condition or decision attribute.
Lookup table
Lookup table stores the actual values of
attributes. It is also extracted from metadata file
line by line.
Rule Set table
The rule set file is parsed and stored in a rule set
table. Each domain has its own rule set table
whose name is the same as the domain name.
The rule set table is created dynamically
according to the information in the Attribute
table together with four extra fields: support,
certainty, strength and coverage which are
generated by the BLEM2 learning program.
These fields are used in conflict resolution.
3.1.2.
Management Tables
For flexibility and ease of maintenance, menu
items are stored in the database and retrieved
dynamically.
Figure 3.
tables.
Structure of domain information
Domain table
From the metadata file, domain name and
description are extracted to Domain table and a
unique domain ID is generated for each new
domain.
Attribute table
The attribute information is extracted line by line
from the metadata file. The attribute name and
number of values are extracted directly from the
metadata file. The data type and size field are
computed from the attribute category. For
example, if the attribute category is ‘C’, the
mapping data type would be ‘VARCHAR(127)’; if
the attribute category is ‘D’, the mapping data
type would be ‘DECIMAL(8,4)’; if the attribute
category is ‘I’, the mapping data type would be
Figure 4. Structure of management tables.
Users of the system are grouped into different
user roles such as Administrator, Author and
Operator, etc. These user roles are used to
enforce permission control on menu items. In
other words, different users are mapped to
different sets of menu items. The structure of
management tables used in RBC generator is
shown in Figure 4.
3.1.3.
Miscellaneous Tables
There are two miscellaneous tables used in the
system. The tblApproach table is used to map a
conflict resolution approach to the stored
procedure that implements it. The approach
names are also dynamically retrieved when the
application runs. The tblMetaFile table is a
temporary place to store the file uploaded from
the client side.
Figure 5. Miscellaneous tables.
3.2 Middle-Tier Components
proper roles. This is accomplished by a web user
control component, which takes a user role as
input and retrieves the menu items dynamically
from the table tblUserSideMenuItemAssoc.
Authentication:
Authentication is enforced by the web
configuration stored in web.config file. Access
to any pages of this application will be redirected
to a logon page. It is allowed to specify only
some of the pages as protected by changing the
default policy in the configure file.
Ses sion Control:
After the user login, some user information such
as user id and user name will be stored
throughout the session in order to customize the
outlook for specific user or tracking the activities
of the user. This information will be deleted after
the user logout or the session is timeout.
3.2.1. Database Access
For the easiness of maintenance and reusability
reason, we wrap all the necessary database
operations in one component. The component
hides the detail of database connection, pooling
and permission control, it provides many
overloaded functions to run stored procedures
and SQL statements, and it can return either one
value or a disconnected dataset. It can be reused
in any other .Net applications without changes.
The connection string which contains the
database name and login information is read from
the standard XML-based web application
configuration file: web.config.
3.2.2. Account Management
User role based permission management: Users
are grouped into different roles and the resource
permission such as the menu items are assigned
to the roles rather than a specific user. Currently
there are three roles: Administrator, Author and
operator.
Dynamic menu retrieval:
Menu items are not hard coded in the system.
They are stored in the database table
tblSideMenuItem. Database Administrator can
insert new item into this table and create a new
row in table tblUserSideMenuItemAssoc which
contains the permission mapping between the
user roles and menu items. Once done, new
menu items will be available to all users with
Add and Delete Account:
While adding or deleting an account, there are
two tables related: tblPerson and tblUser. We
use transaction to guarantee the data integrity.
3.2.3.
Domain Operations
Domain operations are encapsulated in a Domain
class. While creating or deleting a domain, there
are several tables need to be handled in a correct
order: Domain, Attribute, Look and rule set table.
They have referential relationship and the data
integrity should be maintained. For instance,
when deleting a domain, the rule set table need to
be dropped. And the corresponding information
should be first deleted in the Lookup table, then
the Attribute table and finally the Domain table.
All this operations must be done in one
transaction.
Conflict resolutions are implemented using
stored procedures.
The conflict resolution
approaches are stored in the tblApproach table
and retrieved dynamically during application
execution. This enables users to add more
approaches using our stored procedure builder
dynamically to meet their requirements.
Stored procedure builder:
We have developed a Stored procedure builder
tool for dynamically creating conflict resolution
approaches. The builder will first generate a
stored procedure according to the SQL statement
input by the user and add a new record to the
tblApproach table to map the approach name to
the stored procedure. Actually, all these jobs are
done in a stored procedure. In fact, a stored
procedure is used to generate other stored
procedures.
Dynamic classifier generator:
Users can run a target classifier any time by
choosing the “Run Application” menu. After
choosing the domain name, the attributes will
change accordingly.
Actually the attribute
information is retrieved from the database
dynamically, so is the construction of the
WHERE statement. The WHERE statement will
be sent to the corresponding approach stored
procedure and the decision and fired rules will be
returned.
4. Implementation
The proposed system has been implemented
using MS Dot NET framework and MS SQL
server 2000 [15]. It has been deployed as a Dot
NET application on IIS web server running on a
Pentium 4/700 MHz machine since summer 2003.
Due to space limitation, a running example is
omitted; it will be presented at the conference
presentation.
5. Conclusions
This paper presents a web-based system for
automated generation of rule-based classifiers
from data sets provided by end users. There is
no programming required from end users , since
user interfaces for running classifiers are
generated automatically by the system.
Generated classifiers and related user interfaces
are stored in relational tables, so they are quite
extensible and manageable. Created classifiers
are web-based, so they are easily accessible by
thin clients.
6. References
[1] Fayyad, U., Editorial, Int. J. of Data Mining and
Knowledge Discovery, Vol.1, Issue 1, 1997.
[2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth,
"From data mining to knowledge discovery: an
overview," in Advances in Knowledge Discovery
and Data Mining, Fayyad et al (Eds.), MIT
Press, 1996.
[3] Buchanan, B.G. and E.H. Shortliffe, eds. RuleBased Expert Systems: The MYCIN Experiments
of the Stanford Heuristic Programming Project.
Reading, MA, Addison-Wesley, 1984.
[4] Giarratano, J. and G. Riley, Expert Systems
Principles and Programming, 3rd ed., PWS
Publishing Company, 1998.
[5] Friedman-Hill, E., Java Expert System Shell,
Sandia National Laboratories, Livermore, CA.,
http://herzberg.ca.sandia.gov/jess
[6] Mitchell, T.M., Machine Learning, The
McGraw-Hill Companies, Inc., 1997.
[7] Witten I.H. and E. Frank, Data Mining: Practical
Machine Learning Tools and Techniques with
Java Implementations, San Francisco, Morgan
Kaufmann, 2000.
[8] Quinlan, J.R., C4.5: Programs for machine
learning. San Francisco, Morgan Kaufmann,
1993.
[9] Chan, C.-C. and S. Santhosh, “BLEM2: Learning
Bayes’ rules from examples using rough sets,”
Proc. NAFIPS 2003, 22nd Int. Conf. of the North
American Fuzzy Information Processing Society,
July 24 – 26, 2003, Chicago, Illinois, pp. 187190.
[10] Zadeh, L.A., “Fuzzy sets,” Information and
Control, 8:338-353, 1965.
[11] Gupta, M.M., R.K. Ragade, and R.R. Yager,
Advances in Fuzzy Set Theory and Applications,
editors, North-Holland, Amsterdam, 1979.
[12] Pawlak, Z., "Rough sets: basic notion," Int. J. of
Computer and Information Science 11, pp. 34456, 1982.
[13] Pawlak, Z., J. Grzymala-Busse, R. Slowinski, and
W. Ziarko, "Rough sets," Communication of
ACM, Vol. 38, No. 11, November, 1995, pp. 8995.
[14] Khasawneh, N. and C.-C. Chan, “Servlet-based
implementation for rule-based classifiers,” Proc.
12th MidWest Artificial Intelligence and Cognitive
Science Conference MAICS 2001, March 31 –
April 1, 2001, Miami University, Oxford, Ohio,
pp. 70-74.
[15] Su, Zhicheng, “Dot Net implementation for rulebased classifiers,” Master’s research report,
Department of Computer Science, University of
Akron, June, 2003.