Download Evolving SQL Queries for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Evolving SQL Queries for Data Mining
Majid Salim and Xin Yao
School of Computer Science, The University of Birmingham
Edgbaston, Birmingham B15 2TT, UK
{msc30mms,x.yao}@cs.bham.ac.uk
Abstract. This paper presents a methodology for applying the principles of evolutionary computation to knowledge discovery in databases by
evolving SQL queries that describe datasets. In our system, the fittest
queries are rewarded by having their attributes being given a higher probability of surviving in subsequent queries. The advantages of using SQL
queries include their readability for non-experts and ease of integration
with existing databases. The evolutionary algorithm (EA) used in our
system is very different from existing EAs, but seems to be effective and
efficient according to the experiments to date with three different testing
data sets.
1
Introduction
Data mining studies the identification and extraction of useful knowledge from
large amounts of data [5]. There are a number of different fields of inquiry within
data mining, of which classification is particularly popular. Machine learning algorithms that can learn to classify datum correctly can be applied to a wide
variety of problem domains, including credit card fraud detection and medical
diagnostics [1,2,3]. An important aspect of such algorithms is ensuring that they
are easy to comprehend, to facilitate the transfer of machine discovered knowledge to people easily [4]. This paper will present a framework for discovering
classification knowledge hidden in a database through evolutionary computation techniques, as applied to SQL queries. The task is related to but different
from the conventional classification problem. Instead of trying to learn a classifier for predicting an unseen example, we are most interested in discovering the
underlying knowledge and concept that best describes a given set of data from
a large database.
SQL is a standardised data manipulation language that is widely supported
by database vendors. Constructing a data mining framework using SQL is therefore very useful, as it would inherit SQL’s portability and readability.
Ryu and Eick [7] proposed a genetic programming (GP) based approach
to deriving queries from examples. However, there are two major differences
between the work presented here and theirs. First, the query languages used are
different and, as a result, the chromosome representations are different. Our use
of SQL has made the whole system much simpler and more portable. Second, the
evolutionary algorithms used are different. While Ryu and Eick [7] used GP, we
H. Yin et al. (Eds.): IDEAL 2002, LNCS 2412, pp. 62–67, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Evolving SQL Queries for Data Mining
63
have developed a much simpler algorithm which does not use any conventional
crossover and mutation operators. Instead, the idea of self-adaptation at the
gene level is exploited. Our initial experimental studies have shown that such a
simple scheme is very easy to implement, yet very effective and efficient.
The rest of this paper is structured as follows. Section 2 describes the architecture of the proposed framework, justifying design decisions made and explaining
the benefits and drawbacks that were perceived in the process. Section 3 presents
initial results obtained with the framework, and Section 4 concludes the paper
with a brief discussion of future work that is planned.
2
Evolving SQL Queries
It was necessary to find a way of representing SQL queries genotypically, to
allow for the application of evolutionary search operators. Another issue was
the design of a fitness function to apply evolutionary pressure to the queries, to
guide them towards the correct classification rules.
Genotypes were required to encode the list of conditional constraints that
specify the criterion by which records should be selected. Each conditional constraint in SQL follows the structure [attribute name] [logical operator] [value].
This sequence was chosen as the basic unit of information, or ’gene’, from which
genotypes would be constructed. Genotypic representations varied randomly in
length.
2.1
Evolutionary Search
The algorithm that was implemented is described in this section. 100 genotypes
were constructed by randomly selecting attribute names, logical operators and
values. Each attribute in the dataset began with a 0.5 probability of being included in any given genotype. Genotypes were then translated into SQL by
initialising a String with the value ’SELECT * FROM [tablename] WHERE’, and
then appending each gene in the genotype to the end of the String. For example,
a genotype such as this:
(LEGS = 4) (PREDATOR = TRUE) (FEATHERS = FALSE)
(VENOMOUS = FALSE)
would be translated into the following SQL query, through the random addition
of AND and OR conditionals:
SELECT * FROM Animals WHERE LEGS = 4 AND PREDATOR = true
AND FEATHERS = false OR VENOMOUS = false
Such SQL queries, once constructed, were sent to the database, and the results
analysed.
Each genotype was assigned a fitness value according to the extent to which
its results corresponded with a target result set T. The fitness function used was
64
M. Salim and X. Yao
fitness = 100 - falsePositives - (2 * falseNegatives),
where 100 was an arbitrarily chosen constant. This fitness function was adapted
from a paper by Ryu and Eick [7], dealing with deriving queries from object oriented databases. falsePositives is the number of records that were incorrectly
identified as belonging to T, and falseNegatives is the number of records that
should have been included T, but were not. The fitness function punishes false
negatives more than it punishes false positives. If a query returns no false negatives, but several false positives, it can be seen to be correctly identifying the
target result set, but generalising too much, whereas a query that returns false
negatives is simply incorrect. By punishing false negatives more, it was hoped
to apply evolutionary pressure that would favour queries that better classified
the training data.
After assigning fitness values for the 100 queries, the best and worst three
were selected. If a perfect classifier was found (with fitness of 100) the evolution would terminate, otherwise the attributes would have their probabilities
re-weighted. Every attribute that appeared in the top three fittest genotypes
had its selection probability incremented by 1%. Every attribute in the worst
three genotypes had its probability decremented by 1%.
The old genotypes were then discarded, and a new set of 100 genotypes were
randomly created using the self-adapted probabilities. Over a period of generations, attributes that contributed to higher fitness values came to dominate
in the genotype set, whereas attributes that contributed little to a genotype
featured less and less.
2.2
Discussions
Our algorithm departs from the metaphor commonly used in evolutionary algorithms; however it does offer a mechanism through which the genotypes are
iteratively converging on the sector of the search space that offers the greatest
classification utility. Although genetic information of parents are not inherited
directly by offspring, the genetic information in the whole population is inherited
by the next population. Such inheritance is biased toward more useful genetic
materials probabilistically. Hence, more useful genetic materials will occur more
frequently in a population. It is hoped that classification rules may be discovered
as a consequence of this.
3
Experimental Studies
Several experiments have been carried out to evaluate the effectiveness and efficiency of the proposed framework. All datasets were downloaded from the UCI
Machine Learning Repository 1 . Each dataset was tested with 20 independent
runs. If after 100 generations a perfect classifier was not found, the best classifier
found to date was returned. The results were averaged over the 20 runs, and are
presented below.
1
http://www1.ics.uci.edu/ mlearn/MLRepository.html
Evolving SQL Queries for Data Mining
3.1
65
The Zoo Dataset
The Zoo dataset contains data items that describe animals. In total 14 attributes
are provided, of which 13 are boolean and one has a predefined integer range.
The animals are classified into 7 different types. Table 1 describes the results
from the Zoo dataset. ’ANG’ refers to the average number of generations that
it took for our algorithm to find a perfect classifier.
Table 1. Results for the Zoo dataset, showing performance of the evolved classifying
queries for each animal type. The results were averaged over 20 runs.
Type False Positives False Negatives
1
0
0
2
0
0
3
1
0
4
0
0
5
0
0
6
0
0
7
2
0
ANG
0.8
0.7
n/a
4.7
21.0
44.5
n/a
Accuracy
100.0%
100.0%
83.3%
100.0%
100.0%
100.0%
83.3%
It can be seen that our algorithm performed well on most of the classification
tasks. The two instances in which it failed to find perfect classifiers are the
most difficult tasks within the dataset, as both tasks involve a very small set of
animals. In both cases, however, the best queries did not include false negatives.
3.2
Monk’s Problems
The Monks Problem dataset involves data items with six attributes, all of which
are predefined integers between 1 and 4. The first Monk’s problem is the identification of data patterns where (B=C) or (E=1). The second problem is the
identification of all data patterns that feature exactly two of (B = 1, C = 1,
D =1, E = 1, F = 1 or G = 1). The third Monk’s problem is the identification of data patterns where (F = 3 and E = 1) or ( F != 4 and C != 3),
and features 5% noise added to the training set. The results averaged over 20
runs are summarised in Table 2.
Our algorithm performed perfectly on the first problem, and very well on the
third, but performed poorly on the second problem. Part of the reason lies in
SQL’s inherent difficulty in expressing the desired conditions. The second Monks
Problem requires a solution that compares relative attribute values, whereas
SQL is usually used to select records according to a set of disjunctive attribute
constraints.
66
M. Salim and X. Yao
Table 2. Results for Monks Problem datasets, showing performance of the best queries
for each problem. ’ANG’ refers to the average number of generations that it took for
our algorithm to find a perfect classifier.
Type False Positives False Negatives
Problem 1
0
0
0
Problem 2
85
Problem 3
5
3
3.3
ANG Accuracy
40.6 100.0%
n/a 16.9%
n/a 94.7%
Credit Card Approval
The credit card approval dataset contains anonymised information on credit
card application approvals and rejections. The dataset contains a variety of attribute types, with some attributes having predefined values and others having
continuous values. The dataset also features 5% noise.
Our algorithm succeeded in correctly identifying, on average, 82.9% of the
rejections. However, this relative success is countered by the fact that this classifier also included a large number of false positives - 101 on average, accounting
for nearly 20% of the dataset size.
3.4
Discussion of the Results
The results for the Zoo and Monk’s Problem datasets are encouraging. Our
algorithm demonstrates the poorest performance on the second Monk’s problem,
which may be because the problem is not structurally conducive to an SQL based
classification rule, although future refinements of our algorithm will hopefully
improve upon these results.
The results with the credit card approval dataset also show room for improvement. This may be due to its inclusion of continuous variables. Our algorithm
performs poorly with continuous valued attributes because, although it can identify attributes that are valuable in making a classification, it cannot make the
same distinction for logical operators or values. It is necessary for the algorithm
to find the variable values as well as attribute values that are necessary for good
classification. It is proposed that logical operators will be given initial selection
probabilities as well, which will decrement or increment according to the effect
they play upon the fitness value of their genotype.
4
Conclusions
By using evolutionary computation techniques to evolve SQL queries it is possible to create a data mining framework that both produces easily readable results,
and also can be applied to any SQL compliant database system. The problem
considered here is somewhat different from the conventional classification problem. The key question we are addressing here is: Given a subset of data in a
Evolving SQL Queries for Data Mining
67
large database, how can we gain a better understanding of them? Our solution
is to evolve human comprehensible SQL queries that describe the data.
The algorithm proposed in this paper differs from many traditional evolutionary algorithms, in that it does not use the metaphor of selection, whereby
the fittest individuals have their traits inherited by the new generation of individuals, through operations such as crossover or mutation. Rather, it rewards
the attributes that make individuals successful, and then iterates the initial step
of creation. In other words, rather than survival of the fittest, this work operates upon the principle of survival of the qualities that make the fittest fit.
Although many genetic algorithms feature mutation, it is usually scaled down
so that it does not destroy any useful structures that evolution may have already
constructed. This approach differs in that it divorces the importance of the attribute from the values that the attribute happens to have in a given gene. As
such it effects an ’evolutionary liquidity’ that in turn results in an appealingly
diverse population, more likely to distribute itself over an entire search space
than it is to converge on some local optima.
Although our preliminary experimental results are promising, they also offer
room for improvement. It is hoped that future improvements with regard to
dealing with continuous variables will improve performance.
References
1. X. Yao and Y. Liu, ’A new evolutionary system for evolving artificial neural networks,’ IEEE Transactions on Neural Networks, 8(3):694-713, May 1997.
2. X. Yao and Y. Liu, ’Making use of population information in evolutionary artificial
neural networks,’ IEEE Transactions on Systems, Man and Cybernetics, Part B:
Cybernetics, 28(3):417-425, June 1998.
3. Y. Liu, X. Yao and T. Higuchi, ’Evolutionary ensembles with negative correlation
learning,’ IEEE Transactions on Evolutionary Computation, 4(4):380-387, November 2000.
4. J. Bobbin and X. Yao, ’Evolving rules for nonlinear control’, In New Frontier in
Computational Intelligence and its Applications, M. Mohammadian (ed.), IOS Press,
Amsterdam, 2000, pp.197-202.
5. A. A. Freitas, ’A genetic programming framework for two data mining tasks: classification and knowledge discovery’, Genetic Programming 1997: Proc. 2nd Annual
Conference, pp 96-101, Stanford University, 1997
6. A. A. Freitas, ’A survey of evolutionary algorithms for data mining and knowledge
discovery’, In: A. Ghosh, S. Tsutsui (eds.), Advances in Evolutionary Computation,
Springer-Verlag, 2001
7. T. W. Ryu, C. F. Eick, ’Deriving queries from results using genetic programming’,
Proc. 2nd International Conference, Knowledge Discovery and Data Mining, pp
303-306, AAAI Press, 1996