Download A Query Optimization Application in Database Management System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
A Query Optimization Application in Database Management System Using
Rough-Genetic Algorithm
P. Enyindah & P.O. Asagba
Department of Computer Science
University of Port Harcourt
Rivers State, Nigeria
E-mail: [email protected], [email protected]
Tel: +234 8036710489, +234 8034857781
ABSTRACT
An improved rough-genetic system framework has been implemented for optimal query processing of a database management
system. The system uses rough sets principles to summarize the database and remove duplicate values while genetic algorithm
(GA) was used to improve classification and prediction accuracy using an evolutionary structure. The evolved GA structure is
automatically integrated into the structure query language ( SQL ) database management system (DBMS) using a database
schema based on the Optimal Query Structure (OQS) for Optimal GA processing. The genetic algorithm approach ensures that
incorrect order of entry in the data input fields will not affect the performance of the prediction process by generating a
population of randomly mutated attributes from the parent set, and for each population of selected individuals performing a
fitness check. A random-mutation operation evolves a new set of individual solutions while automatically updating the OQS. The
system has been applied to a plant species database and the results obtained were quite satisfactory with about 5% improvement
over traditional SQL/Data mining query language (DMQL) approach.
Keywords: Rough sets, genetic algorithms, Optimal Query Structure, random-mutation
African Journal of Computing & ICT Reference Format:
P. Enyindah & P.O. Asagba (2015): A Query Optimization Application in Database Management System Using Rough-Genetic Algorithm.
Afr J. of Comp & ICTs. Vol 8, No. 3. Pp 181-188.
1. INTRODUCTION
The issue of query optimization in DBMS has generated a lot
of interest with several attempts to apply data mining
techniques and even evolving Data Mining Query Languages
for this purpose. [1] have briefly introduced what they
consider the major issues to be addressed in parallel query
optimization. The issues that was tackled include, mainly the
placement of data in the memory, concurrent access to data
and some algorithms for parallel query processing. These
algorithms were restricted to parallel joins, the authors
describe, in a very synthetic way, data placement, static and
dynamic query optimization methods, and accuracy of the cost
model. Nevertheless, they do not show how to compare the
two optimization approaches, and how to choose the
appropriate optimization approach.
However, there is need to implement query optimizer test bed
applications that include a comprehensive set of queries,
reliable, efficient and time efficient.
3. AIM AND OBJECTIVES
The aim of this paper is to develop an improved query
optimization application for Database Management System.
The specific objectives include the following:
i)
To develop an analytical attributes and data
mining models, that will speed up queries and
improved classification accurancy of the
summarised dataset.
ii)
To develop an Application that will implement
data mining query language.
2. STATEMENT OF THE PROBLEM
The challenges of an efficient query optimization strategy for
modern day DBMS’s is a common recurring problem in
industry and academia. Several research efforts geared at
improving query response times and reducing storage
requirements are currently investigated on, in particular, in the
area of data mining based queries for DBMS’s.
4. RELATED WORK
Several scheduling strategies of pipelined operators were also
proposed. To improve the response time, they developed an
execution model ensuring the best trade-off between parallel
execution and communication overhead. [2] proposed a data
mining query language dubbed “DMQL” for relational
database management systems.
181
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
The design was inspired by an application they developed
called DBMiner. DBMIner is a systems graph user interface
(GUI) app that allows and facilitates queries on a DMQL
inspired database engine. Thus, their goal was to provide the
necessary primitives for data mining engines to work on. [3]
four algorithms (Maximum, MinDp, MaxDp, and Rate-Match)
have been proposed to determine the join parallelism degree
independently of the initial data placement. The originality of
the algorithm tries to make correspond the production rate of
the result tuples of an operator with the consumption rate of
next operator tuples. Then, the authors describe six alternative
methods of processor allocation in the clones of a unique join
operator. They are based on heuristics such as the random or
round-robin strategies, and on a model taking into account the
effect of the resource contention.
[6] identified that traditional database systems expect all data
to conform to an explicitly specified rigid schema. However it
was observed that vast amount of information available today
is semi-structured that is irregular or incomplete. They
observed that it was difficult and inefficient to manage this
incomplete data using traditional relational, object-oriented
system which were designed primarily for well-structured
data. The researchers overcame this bottleneck by developing
a database management system called “LORE”, whose sole
purpose was for querying and storing semi-structured data. [7]
performed an experimental study on three heuristics
algorithms – Simulated Annealing (S.A), Tabu Search (T.S)
and Genetic Algorithms (G.A) for the database utilities
scheduling problem. They found out that the S.A performed
better when compared to the T.S and G.A. Notwithstanding,
T.S and G.A also fared reasonably well.
In [4], a multi-join process in a multi-user context were of
primary interest. They categorized system state in terms of
multi-resource contention. They studied, more generally, the
relational query optimization on shared nothing architecture.
The Modular Parallel query Optimizers (MPO) determines
dynamically the intra-operation parallelism degree of the join
operators of a bushy tree. The authors suggest a dynamic
heuristic to resource allocation in four steps applied in the
following order: (i) Preservation of the data locality (or “data
localization”), (ii) Size of the memory, (iii) I/O Reduction, (iv)
Operation serialization of a bushy tree. In [5], a parallel
algorithm to process a query compound of N joins for each
search space shape (i.e. left-deep tree, right deep tree and
bushy tree, Cf) was proposed. The authors considered two
methods of hash join: the simple hash join and the hybrid hash
join, Reports for each search space shape, the need in memory
size, the potential scheduling, and the capacity to exploit the
different forms of parallelism.
[8] proposed a data mining query language for knowledge
discovery in a geographical information system; they
postulated that spatial data mining is a process for discovering
interesting, but not explicit patterns embedded in both spatial
and non-spatial data. They presented a spatial data mining
object query language (SDMOQL) design which is based on
the standard object query language (OQL). The SDMOQL
was embedded in a particular geographical information system
known as INGENS(Inductive Geographic Information
System) which is a prototype GIS that integrates data mining
tools to help users in their task of topographic map
interpretation. The SDOQL proposed in [8] support two data
mining task which are.
i.
Inducing classification rules.
ii.
Discovering association rules.
For both tasks, the language permits the specification of task
relevant data, the kind of knowledge to be mined, the
background knowledge and the hierarchies, the interestingness
measures, and the visualization for discovered patterns. [9]
used a level wise apriori algorithm to optimize an association
rule mining query, the level wise algorithms have been shown
to work well with association rule miming from sparse data,
however, there are inherent challenges as in many practical
applications, the computation becomes intractable for a user
given frequency threshold and the lack of focus leads to huge
collections of frequency item set. In the proposals concerning
parallel relational query optimization, few authors proposed a
synthesis dedicated to parallel relational query optimization
methods. [9] also investigated two promising issues, the
efficient use of user defined constraint and computation of
condensed representation of frequency item-sets. They showed
how the benefits of these two approaches can be combined
into a level wise algorithm. Their result showed that it can be
used for the discovery of association rules in difficult cases i.e.
dense and highly correlated data. [10] developed and
implemented the DMQL inspired language which he dubbed
DMQL-457 using a structured programming environment
(Java) for the data mining of any DBMS. DMQL-457 is a
streamlined version of the DMQL with the major focus of ease
in use and implementation.
The study includes. the case where the memory resource is
unlimited, and the more realistic case where the memory is
limited. In the first case, the right deep tree is the most adapted
to best exploit the parallelism. But, this structure is no longer
the best when the memory is limited. Indeed, there were
several strategies allowing to exploit the capabilities of the
right deep trees when the memory is limited. The strategy,
named "Static Right Deep Scheduling" consists in cutting the
right deep tree in several separate sub-trees in a way that the
sum of the sizes of all the hash tables of a sub-tree can fit in
memory.
The temporary results of the execution of sub-trees T1, T2
…Tn will be stored in disks. The drawback of this strategy is
that the numbers of sub-trees increases with the number of
base and as such are not held stored in memory. Hence, this
method reduces the pipeline chain and increases the response
time. Two methods were proposed, one is based on segmented
right-deep trees, and the other one is based on zigzag trees.
The objective of these two methods is to avoid the
investigation of the bushy tree search space and then
simplifying the optimization process.
182
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
Using DMQL-457, on-line analytical processing (OLAP) for a
test database schema or (data cube), was achieved with
reasonable execution times. [11] developed an adaptive
genetic algorithm with dynamic population size for finding the
Optimal Join Ordering executing a query to a RDBMS. Due to
high processing cost, the author stated that the evaluation of
joins and their ordering as the primary focus of query
optimization. However, the author focused was on the
optimization of only a particular type of query called the
Selection-Projection-Join (SPJ) query. [12] proposed an
intelligent query answering system on three real life data sets
(KDD99, Cover-type and Iris) using rough sets and G.A’s.
5. METHODOLOGY
A rough-generic approach using object-oriented techniques
was employed. This approach builds on the principles of
rough-sets and genetic algorithms using a set of structured
classes for the development of improved DBMS (GOPTIMA).
5.2 Rough-Genetic Principles
The rough-genetic scheme for optimal query processing
system demands that the information system (IS) be
summarized prior to data mining. We define a rough-genetic
algorithmic system following a different approach at the
genetic end (mutation before cross-over) Fig 1 shows the
proposed Algorithmic Scheme for Optimal Query Processing.
Adaptive Classification was achieved by reinforcing rough
sets reducts with the G.A’s with good execution times on the
aggregrate functions and reasonable good classification
prediction accuracies for the KDD99 and Iris Data sets (98.3%
and 97.65% respectively). However, for the Cover-type data
sets the classification accuracy was low at 64.2%. Also,
average concept hierarchy prediction accuracy was given only
for the KDD99 and Cover-type with predictions of 95.9% and
61.2% respectively. [13] proposed a genetic algorithm
technique to perform a multi-join operational data in active
data warehousing retrieval of data based on multiple queries.
Using G.A, they were able to efficiently perform the multijoin operation using the cross-over, mutation and selection
operators which in turn improved the data retrieval process
with high data retrievals with increasing relational tables. [14],
apprehended the field of data mining using neural network and
genetic algorithm. They over viewed-data mining and said it’s
a process designed to analyze and explore the data in search of
consistent patterns or to analyze the systematic relationships
between data or variables and then to validate the findings by
applying the detected patterns to new subsets of data. They
also over viewed neural network as a collection of many
processing elements called neurons and all neurons
interconnected to other neurons and each interconnection have
a weight associated with it. They also over viewed genetic
algorithm as an adaptive heuristic random global and direct
search method based on imitaten of nature biological
evolution mechanism. The authors concluded that neural
network and genetic algorithm are two good data mining
process tools widely used for classification and prediction in
complex dataset.
Initialize information system (IS):
1. Summarize data set: Isn = summary (IS)
2. for (attributes a1, a2 …an ∈ Isn)
3. Set arg = arg1+arg2+…+argn
4. Mutate (arg)
5. crossover (arg)
6. Compute fitness
7. if (fitness<=fitness_criterion)
a. break;
8. end if
9. end for
Fig1: Algorithmic Scheme for Optimal Query
Processing
5.3 Storage/Database Structure and Specification
In every information system, a domain of study needs to be
specified [16]. In this study, the IRIS dataset, a plant species
database, have been studied due to its popularity as a domain
benchmark for studying the effectiveness of data mining
algorithms and techniques in the literature. The domain
scheme is shown in Table 1.
Table 1: Domain Scheme for Analysis
ID
[15] proposed an optimization for data flow specifies known
as pack programs, that is able to reorder operators with
MapReduce-Style-UDFs,(user-defined function) within an
imperative language. This approach leverages static code
analysis to extract information from UDFs, which is used to
reason about the reorder-ability of UDF operators. This
process allows a user to peek step-by-step into each phase of
the optimization process, and finally the parallel execution of
a chosen execution plan is selected using a set of analytical
data flow programs from relational/ non-relational domains. In
this paper, a rough-genetic application (GOptima) has been
developed for the mining of knowledge in a database.
1
2
3
Attribute
1 (PV)
5.4
5.1
7
Attribute
2 (PV)
3.4
3.7
3.2
Attribute
3 (PV)
1.7
1.5
4.7
Attribute
4 (PV)
0.2
0.4
1.4
4
6.4
3.2
4.5
1.5
Key:
DV – Decision Variable
PV – Prediction Variable
183
Species
(DV)
Iris-setosa
Iris-setosa
Irisversicolor
Irisversicolor
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
This approach was used so that the genetic algorithm GA
structure can easily be adapted to anyone database. All that is
needed is just to specify the attributes in the developed
framework
Feature (Attribute) Selection and Labelling
The following features of the IRIS dataset are
utilized:
i)
the plant species – any of Iris-setosa, IrisVersiclor, and Iris-Virginica
ii)
the plant attributes – sepal-length, sepalwidth, petal-length and petal-width
Based on selected features, the domain has the form as shown
in Fig 2.
5.5 Output/Input Specifications
Input-output data are captured after connection to database has
been established. The database result set object will serve as
source container from which other primitive data types may
derive functionality. A fitness criterion is defined in a fitness
class. Table 2 and 3 shows the input and output specifications
5.4 Data Querying Structure
Data querying structure takes two forms. One based on the
standard SQL For the standard case, a typical query on the
IRIS dataset has the form:
Table 2: Input specifications
ID
Attribute
String s1 = "SELECT*FROM IRIS WHERE Sepallength =
'5.1' AND Sepalwidth ='3.2'";
IRIS = table in Relational Data Model
* = All attributes
Sepallength = Attribute 1
Sepalwidth = Attribute 2
1
Plant
length No
Plant
length No
Plant
length No
Data mining Structure Optimized for SQL
Optimal( SQL) query structure for using the genetic
algorithm( GA) will take the form:
String s1 = "SELECT ID FROM IRIS WHERE “sa ⊗ sb";
Here, sa and sb represent chosen attributes selected for
optimal query processing and,
sa = A1
sb = A2
⊗ = AND
A1 = Sepallength
A2 = Sepalwidth.
Table 3: Output specifications
2
3
ID
1
2
3
184
Pant width
No
Plant width
No
Plant width
No
Plant species
Iris-setosa
Iris-versiclor
Iris-virginica
No of
searches
10-50
10-50
10-50
No of searches
10-50
10-50
10-50
Property
Numeric,
string
Numeric,
string
Numeric,
string
Bit change
0-1
0-1
0-1
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
Fig 2: Domain Scheme for Analysis with feature labels specified
185
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
5.6 Rough-Genetic Computational Class
The object-oriented paradigm encourages the use of structured classes. These core classes has been develop and this is
exemplified in Fig 3.
Fig 3: Computational Class Structure of Proposed System
Results of tests have been tabulated in Table 4 using the
equality aggregator. The results was compared with the
standard SQL with the genetic algorithm (GA) optimized
SQL for a DBMS. The Query attributes field represents the
expected attribute values (alleles) for which the end-user
requests a report. The entry process is generalized in the
sense that end-user may enter any one measured or
specified plant attributes to discover the species class. The
standard (SQL) queries have been run using standard java
output console to simplify analysis report. The results
show good performance of the GA optimized (SQL) which
compared favourably well with the standard (SQL) with the
select, aggregrate queries for generations less than 50.
With the Deceptive Pattern mining - captured by reversing
the alleles, the GA optimized SQL out performed the
standard SQL which return empty results. The reason for
the GA success over standard SQL is that the GA will seek
to create a new population of attribute pairs for each
generation in the evolution process.
6. SYSTEMS TESTING AND RESULTS
The DBMS needed to be tested and deployed after writing
and debugging the program, Testing is done to assess the
efficiency of the program. The testing procedure is outlined
as follows:
1.
Run the Main Application
2.
Enter numerical values of Sepal length and Sepal
width using the data as a guide
3.
Click the submit query button
4.
Read and record the values
186
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
Table 4 Comparing Standard SQL with GA-optimized SQL
Query Attributes
Standard SQL
Plant Attribute 1 (e.g. SepalPlant Attribute 2 (e.g.
Classified Specie
length)
Sepal-width)
GA Optimized SQL
Classified Specie
5.1
4.9
7
3.5
4.9
Iris-Setosa
Iris-Setosa
Iris-versicolor
Iris-Setosa
Iris-Setosa
3.5
3.0
3.2
5.1
3.0
Iris-Setosa
Iris-Setosa
Iris-versicolor
Empty
Empty
A snapshot of the running application is shown in Fig4
Running Patterns
Running Patterns describe the nature of the GA query using a classification aggregate query. This is depicted in Fig 4.
Fig 4: Running Pattern using the = Aggregate query for 10 Search
187
Vol 8. No. 3 – September, 2015
African Journal of Computing & ICT
© 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781
www.ajocict.net
[9] Jeudy B, Boudicaut, J.F., 2002. Optimization of
Association Rule Mining, Journal of Intelligent
Data Analysis, IOT Press. pp. 341-357
[10] Scanner
2003,
MIE457F
Project,
http://www.cs.toronto.edu/~ssanner/Projects/index.
html
[11] Vellev, S. 2008. Review of Algorithms for the Join
Ordering Problem in Database Query Optimization,
Journal of Informations Technology and Control,
2009, pp 32-40
[12] Srinivasa, K.G, Venugopal, K.R., and Patnaik.,
(2008), “A soft
computing approach for data
mining based query processing Using rough sets
and genetic algorithms” International Journal of
Hybrid Intelligent system,Vol. 5, pp, 1-17
[13] Paramasivam,K. Chandraskar, C. (2012) MultiJoin operation, using genetic algorithms in active
Data warehouse Asian jounal of computer science
and information Technology. 2.5 vol. 2.5 pp123127
[14] Rahi. P, Gupta. B, and Bisht. S.S., 2014. Data
Mining Using Neural-Genetic Approach – A
Review. International
Journal of
Engineering Research and Applications, Vol. 4,
Issue No. 4, pp. 36-42
[15] Fabian, H. Mathias Peters ( 2012) Peeking into
optimization of data flow programs with map
reduce-style
UD.
www.mailto..7
[email protected]
[16] Marshall,
1998,
Iris
Dataset,
http://archive.ics.uci.edu/ml/datasets/Iris
5. CONCLUSIONS
In conclusion, genetic algorithms and rough sets play crucial
role in optimal query processing if properly planned. Using
object-oriented approach and simple data structures can
assure the quality of the data mining process and thus
eliminate the need for expensive techniques such as using
data mining query language ( DMQL). Increasing the number
of generations involved in the program solution not
necessarily make the predictions much better in certain
circumstances. Thus, trade-off has to be made between the
required precision and query load or time.
6. RECOMMENDATIONS FOR FUTURE WORK
Genetic algorithm is a proven data mining algorithm of
choice if efficient and accurate database systems are to be
built. The developed system thus can bring in more efficient
and accurate data mining features into a database
management system. Using the system, database engineers
can approach the query optimization in a more dynamic and
object-oriented way which can make the end- user
applications developed more robust. This application will
therefore be useful in modern day intelligent database
products in academia and industry. In future, this application
can also be integrate into mobile computing environment in a
platform independent way.
REFERENCES
[1] Hasan, W. Ganguly,s. Krishnamurithy, R. (1992)
Query optimization for paralle execution
proceeding of Acm SIGMOD International
conference on management of data PP 1-10
[2] Han, J., Fu, Y., Wang, W., Koperski, and Zaniane,
O.R., (1996), A Data Mining Query Language for
Relational Databases, DMQL Montreal Canada,
pp, 27-33.
[3] Mehta, Manish. David, J. Dewitt ( 1997) Managing
Intra-Opertor parallelism in parallel Database
system. gsl.azurewebites.net/.../0/.../VLDB95
[4] Brunie, N. Chaudhuri, S. (1997) Muti-join process
in a Multi-user context. www.csd.uoc.gr/.../...
[5] Schneider Vinect Singh David De witt.. –(1990)
processing complex Join QUeris Via Hasting in
muttiprocessor Database machines
[6] Mchugh, J.G., 2000. Data Management and Query
Processing for Semi-Structured Data, PhD. Thesis,
Stanford University.
[7] Xu. Z., (2001),
“Automatically Scheduling
Database Utilities”, M. S.C, Thesis, Dept. of
computing and information science, Queen’s
University. Manchester.
[8] Malerba, Donato. Annalisa, Appice. Michelangelo,
Ceci (2004) A data mining query language for
knowledge discovery in a geographical information
system. Lecture notes in computer science Vol
2682, pp-95-116
188