Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net A Query Optimization Application in Database Management System Using Rough-Genetic Algorithm P. Enyindah & P.O. Asagba Department of Computer Science University of Port Harcourt Rivers State, Nigeria E-mail: [email protected], [email protected] Tel: +234 8036710489, +234 8034857781 ABSTRACT An improved rough-genetic system framework has been implemented for optimal query processing of a database management system. The system uses rough sets principles to summarize the database and remove duplicate values while genetic algorithm (GA) was used to improve classification and prediction accuracy using an evolutionary structure. The evolved GA structure is automatically integrated into the structure query language ( SQL ) database management system (DBMS) using a database schema based on the Optimal Query Structure (OQS) for Optimal GA processing. The genetic algorithm approach ensures that incorrect order of entry in the data input fields will not affect the performance of the prediction process by generating a population of randomly mutated attributes from the parent set, and for each population of selected individuals performing a fitness check. A random-mutation operation evolves a new set of individual solutions while automatically updating the OQS. The system has been applied to a plant species database and the results obtained were quite satisfactory with about 5% improvement over traditional SQL/Data mining query language (DMQL) approach. Keywords: Rough sets, genetic algorithms, Optimal Query Structure, random-mutation African Journal of Computing & ICT Reference Format: P. Enyindah & P.O. Asagba (2015): A Query Optimization Application in Database Management System Using Rough-Genetic Algorithm. Afr J. of Comp & ICTs. Vol 8, No. 3. Pp 181-188. 1. INTRODUCTION The issue of query optimization in DBMS has generated a lot of interest with several attempts to apply data mining techniques and even evolving Data Mining Query Languages for this purpose. [1] have briefly introduced what they consider the major issues to be addressed in parallel query optimization. The issues that was tackled include, mainly the placement of data in the memory, concurrent access to data and some algorithms for parallel query processing. These algorithms were restricted to parallel joins, the authors describe, in a very synthetic way, data placement, static and dynamic query optimization methods, and accuracy of the cost model. Nevertheless, they do not show how to compare the two optimization approaches, and how to choose the appropriate optimization approach. However, there is need to implement query optimizer test bed applications that include a comprehensive set of queries, reliable, efficient and time efficient. 3. AIM AND OBJECTIVES The aim of this paper is to develop an improved query optimization application for Database Management System. The specific objectives include the following: i) To develop an analytical attributes and data mining models, that will speed up queries and improved classification accurancy of the summarised dataset. ii) To develop an Application that will implement data mining query language. 2. STATEMENT OF THE PROBLEM The challenges of an efficient query optimization strategy for modern day DBMS’s is a common recurring problem in industry and academia. Several research efforts geared at improving query response times and reducing storage requirements are currently investigated on, in particular, in the area of data mining based queries for DBMS’s. 4. RELATED WORK Several scheduling strategies of pipelined operators were also proposed. To improve the response time, they developed an execution model ensuring the best trade-off between parallel execution and communication overhead. [2] proposed a data mining query language dubbed “DMQL” for relational database management systems. 181 Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net The design was inspired by an application they developed called DBMiner. DBMIner is a systems graph user interface (GUI) app that allows and facilitates queries on a DMQL inspired database engine. Thus, their goal was to provide the necessary primitives for data mining engines to work on. [3] four algorithms (Maximum, MinDp, MaxDp, and Rate-Match) have been proposed to determine the join parallelism degree independently of the initial data placement. The originality of the algorithm tries to make correspond the production rate of the result tuples of an operator with the consumption rate of next operator tuples. Then, the authors describe six alternative methods of processor allocation in the clones of a unique join operator. They are based on heuristics such as the random or round-robin strategies, and on a model taking into account the effect of the resource contention. [6] identified that traditional database systems expect all data to conform to an explicitly specified rigid schema. However it was observed that vast amount of information available today is semi-structured that is irregular or incomplete. They observed that it was difficult and inefficient to manage this incomplete data using traditional relational, object-oriented system which were designed primarily for well-structured data. The researchers overcame this bottleneck by developing a database management system called “LORE”, whose sole purpose was for querying and storing semi-structured data. [7] performed an experimental study on three heuristics algorithms – Simulated Annealing (S.A), Tabu Search (T.S) and Genetic Algorithms (G.A) for the database utilities scheduling problem. They found out that the S.A performed better when compared to the T.S and G.A. Notwithstanding, T.S and G.A also fared reasonably well. In [4], a multi-join process in a multi-user context were of primary interest. They categorized system state in terms of multi-resource contention. They studied, more generally, the relational query optimization on shared nothing architecture. The Modular Parallel query Optimizers (MPO) determines dynamically the intra-operation parallelism degree of the join operators of a bushy tree. The authors suggest a dynamic heuristic to resource allocation in four steps applied in the following order: (i) Preservation of the data locality (or “data localization”), (ii) Size of the memory, (iii) I/O Reduction, (iv) Operation serialization of a bushy tree. In [5], a parallel algorithm to process a query compound of N joins for each search space shape (i.e. left-deep tree, right deep tree and bushy tree, Cf) was proposed. The authors considered two methods of hash join: the simple hash join and the hybrid hash join, Reports for each search space shape, the need in memory size, the potential scheduling, and the capacity to exploit the different forms of parallelism. [8] proposed a data mining query language for knowledge discovery in a geographical information system; they postulated that spatial data mining is a process for discovering interesting, but not explicit patterns embedded in both spatial and non-spatial data. They presented a spatial data mining object query language (SDMOQL) design which is based on the standard object query language (OQL). The SDMOQL was embedded in a particular geographical information system known as INGENS(Inductive Geographic Information System) which is a prototype GIS that integrates data mining tools to help users in their task of topographic map interpretation. The SDOQL proposed in [8] support two data mining task which are. i. Inducing classification rules. ii. Discovering association rules. For both tasks, the language permits the specification of task relevant data, the kind of knowledge to be mined, the background knowledge and the hierarchies, the interestingness measures, and the visualization for discovered patterns. [9] used a level wise apriori algorithm to optimize an association rule mining query, the level wise algorithms have been shown to work well with association rule miming from sparse data, however, there are inherent challenges as in many practical applications, the computation becomes intractable for a user given frequency threshold and the lack of focus leads to huge collections of frequency item set. In the proposals concerning parallel relational query optimization, few authors proposed a synthesis dedicated to parallel relational query optimization methods. [9] also investigated two promising issues, the efficient use of user defined constraint and computation of condensed representation of frequency item-sets. They showed how the benefits of these two approaches can be combined into a level wise algorithm. Their result showed that it can be used for the discovery of association rules in difficult cases i.e. dense and highly correlated data. [10] developed and implemented the DMQL inspired language which he dubbed DMQL-457 using a structured programming environment (Java) for the data mining of any DBMS. DMQL-457 is a streamlined version of the DMQL with the major focus of ease in use and implementation. The study includes. the case where the memory resource is unlimited, and the more realistic case where the memory is limited. In the first case, the right deep tree is the most adapted to best exploit the parallelism. But, this structure is no longer the best when the memory is limited. Indeed, there were several strategies allowing to exploit the capabilities of the right deep trees when the memory is limited. The strategy, named "Static Right Deep Scheduling" consists in cutting the right deep tree in several separate sub-trees in a way that the sum of the sizes of all the hash tables of a sub-tree can fit in memory. The temporary results of the execution of sub-trees T1, T2 …Tn will be stored in disks. The drawback of this strategy is that the numbers of sub-trees increases with the number of base and as such are not held stored in memory. Hence, this method reduces the pipeline chain and increases the response time. Two methods were proposed, one is based on segmented right-deep trees, and the other one is based on zigzag trees. The objective of these two methods is to avoid the investigation of the bushy tree search space and then simplifying the optimization process. 182 Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net Using DMQL-457, on-line analytical processing (OLAP) for a test database schema or (data cube), was achieved with reasonable execution times. [11] developed an adaptive genetic algorithm with dynamic population size for finding the Optimal Join Ordering executing a query to a RDBMS. Due to high processing cost, the author stated that the evaluation of joins and their ordering as the primary focus of query optimization. However, the author focused was on the optimization of only a particular type of query called the Selection-Projection-Join (SPJ) query. [12] proposed an intelligent query answering system on three real life data sets (KDD99, Cover-type and Iris) using rough sets and G.A’s. 5. METHODOLOGY A rough-generic approach using object-oriented techniques was employed. This approach builds on the principles of rough-sets and genetic algorithms using a set of structured classes for the development of improved DBMS (GOPTIMA). 5.2 Rough-Genetic Principles The rough-genetic scheme for optimal query processing system demands that the information system (IS) be summarized prior to data mining. We define a rough-genetic algorithmic system following a different approach at the genetic end (mutation before cross-over) Fig 1 shows the proposed Algorithmic Scheme for Optimal Query Processing. Adaptive Classification was achieved by reinforcing rough sets reducts with the G.A’s with good execution times on the aggregrate functions and reasonable good classification prediction accuracies for the KDD99 and Iris Data sets (98.3% and 97.65% respectively). However, for the Cover-type data sets the classification accuracy was low at 64.2%. Also, average concept hierarchy prediction accuracy was given only for the KDD99 and Cover-type with predictions of 95.9% and 61.2% respectively. [13] proposed a genetic algorithm technique to perform a multi-join operational data in active data warehousing retrieval of data based on multiple queries. Using G.A, they were able to efficiently perform the multijoin operation using the cross-over, mutation and selection operators which in turn improved the data retrieval process with high data retrievals with increasing relational tables. [14], apprehended the field of data mining using neural network and genetic algorithm. They over viewed-data mining and said it’s a process designed to analyze and explore the data in search of consistent patterns or to analyze the systematic relationships between data or variables and then to validate the findings by applying the detected patterns to new subsets of data. They also over viewed neural network as a collection of many processing elements called neurons and all neurons interconnected to other neurons and each interconnection have a weight associated with it. They also over viewed genetic algorithm as an adaptive heuristic random global and direct search method based on imitaten of nature biological evolution mechanism. The authors concluded that neural network and genetic algorithm are two good data mining process tools widely used for classification and prediction in complex dataset. Initialize information system (IS): 1. Summarize data set: Isn = summary (IS) 2. for (attributes a1, a2 …an ∈ Isn) 3. Set arg = arg1+arg2+…+argn 4. Mutate (arg) 5. crossover (arg) 6. Compute fitness 7. if (fitness<=fitness_criterion) a. break; 8. end if 9. end for Fig1: Algorithmic Scheme for Optimal Query Processing 5.3 Storage/Database Structure and Specification In every information system, a domain of study needs to be specified [16]. In this study, the IRIS dataset, a plant species database, have been studied due to its popularity as a domain benchmark for studying the effectiveness of data mining algorithms and techniques in the literature. The domain scheme is shown in Table 1. Table 1: Domain Scheme for Analysis ID [15] proposed an optimization for data flow specifies known as pack programs, that is able to reorder operators with MapReduce-Style-UDFs,(user-defined function) within an imperative language. This approach leverages static code analysis to extract information from UDFs, which is used to reason about the reorder-ability of UDF operators. This process allows a user to peek step-by-step into each phase of the optimization process, and finally the parallel execution of a chosen execution plan is selected using a set of analytical data flow programs from relational/ non-relational domains. In this paper, a rough-genetic application (GOptima) has been developed for the mining of knowledge in a database. 1 2 3 Attribute 1 (PV) 5.4 5.1 7 Attribute 2 (PV) 3.4 3.7 3.2 Attribute 3 (PV) 1.7 1.5 4.7 Attribute 4 (PV) 0.2 0.4 1.4 4 6.4 3.2 4.5 1.5 Key: DV – Decision Variable PV – Prediction Variable 183 Species (DV) Iris-setosa Iris-setosa Irisversicolor Irisversicolor Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net This approach was used so that the genetic algorithm GA structure can easily be adapted to anyone database. All that is needed is just to specify the attributes in the developed framework Feature (Attribute) Selection and Labelling The following features of the IRIS dataset are utilized: i) the plant species – any of Iris-setosa, IrisVersiclor, and Iris-Virginica ii) the plant attributes – sepal-length, sepalwidth, petal-length and petal-width Based on selected features, the domain has the form as shown in Fig 2. 5.5 Output/Input Specifications Input-output data are captured after connection to database has been established. The database result set object will serve as source container from which other primitive data types may derive functionality. A fitness criterion is defined in a fitness class. Table 2 and 3 shows the input and output specifications 5.4 Data Querying Structure Data querying structure takes two forms. One based on the standard SQL For the standard case, a typical query on the IRIS dataset has the form: Table 2: Input specifications ID Attribute String s1 = "SELECT*FROM IRIS WHERE Sepallength = '5.1' AND Sepalwidth ='3.2'"; IRIS = table in Relational Data Model * = All attributes Sepallength = Attribute 1 Sepalwidth = Attribute 2 1 Plant length No Plant length No Plant length No Data mining Structure Optimized for SQL Optimal( SQL) query structure for using the genetic algorithm( GA) will take the form: String s1 = "SELECT ID FROM IRIS WHERE “sa ⊗ sb"; Here, sa and sb represent chosen attributes selected for optimal query processing and, sa = A1 sb = A2 ⊗ = AND A1 = Sepallength A2 = Sepalwidth. Table 3: Output specifications 2 3 ID 1 2 3 184 Pant width No Plant width No Plant width No Plant species Iris-setosa Iris-versiclor Iris-virginica No of searches 10-50 10-50 10-50 No of searches 10-50 10-50 10-50 Property Numeric, string Numeric, string Numeric, string Bit change 0-1 0-1 0-1 Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net Fig 2: Domain Scheme for Analysis with feature labels specified 185 Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net 5.6 Rough-Genetic Computational Class The object-oriented paradigm encourages the use of structured classes. These core classes has been develop and this is exemplified in Fig 3. Fig 3: Computational Class Structure of Proposed System Results of tests have been tabulated in Table 4 using the equality aggregator. The results was compared with the standard SQL with the genetic algorithm (GA) optimized SQL for a DBMS. The Query attributes field represents the expected attribute values (alleles) for which the end-user requests a report. The entry process is generalized in the sense that end-user may enter any one measured or specified plant attributes to discover the species class. The standard (SQL) queries have been run using standard java output console to simplify analysis report. The results show good performance of the GA optimized (SQL) which compared favourably well with the standard (SQL) with the select, aggregrate queries for generations less than 50. With the Deceptive Pattern mining - captured by reversing the alleles, the GA optimized SQL out performed the standard SQL which return empty results. The reason for the GA success over standard SQL is that the GA will seek to create a new population of attribute pairs for each generation in the evolution process. 6. SYSTEMS TESTING AND RESULTS The DBMS needed to be tested and deployed after writing and debugging the program, Testing is done to assess the efficiency of the program. The testing procedure is outlined as follows: 1. Run the Main Application 2. Enter numerical values of Sepal length and Sepal width using the data as a guide 3. Click the submit query button 4. Read and record the values 186 Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net Table 4 Comparing Standard SQL with GA-optimized SQL Query Attributes Standard SQL Plant Attribute 1 (e.g. SepalPlant Attribute 2 (e.g. Classified Specie length) Sepal-width) GA Optimized SQL Classified Specie 5.1 4.9 7 3.5 4.9 Iris-Setosa Iris-Setosa Iris-versicolor Iris-Setosa Iris-Setosa 3.5 3.0 3.2 5.1 3.0 Iris-Setosa Iris-Setosa Iris-versicolor Empty Empty A snapshot of the running application is shown in Fig4 Running Patterns Running Patterns describe the nature of the GA query using a classification aggregate query. This is depicted in Fig 4. Fig 4: Running Pattern using the = Aggregate query for 10 Search 187 Vol 8. No. 3 – September, 2015 African Journal of Computing & ICT © 2015 Afr J Comp & ICT – All Rights Reserved - ISSN 2006-1781 www.ajocict.net [9] Jeudy B, Boudicaut, J.F., 2002. Optimization of Association Rule Mining, Journal of Intelligent Data Analysis, IOT Press. pp. 341-357 [10] Scanner 2003, MIE457F Project, http://www.cs.toronto.edu/~ssanner/Projects/index. html [11] Vellev, S. 2008. Review of Algorithms for the Join Ordering Problem in Database Query Optimization, Journal of Informations Technology and Control, 2009, pp 32-40 [12] Srinivasa, K.G, Venugopal, K.R., and Patnaik., (2008), “A soft computing approach for data mining based query processing Using rough sets and genetic algorithms” International Journal of Hybrid Intelligent system,Vol. 5, pp, 1-17 [13] Paramasivam,K. Chandraskar, C. (2012) MultiJoin operation, using genetic algorithms in active Data warehouse Asian jounal of computer science and information Technology. 2.5 vol. 2.5 pp123127 [14] Rahi. P, Gupta. B, and Bisht. S.S., 2014. Data Mining Using Neural-Genetic Approach – A Review. International Journal of Engineering Research and Applications, Vol. 4, Issue No. 4, pp. 36-42 [15] Fabian, H. Mathias Peters ( 2012) Peeking into optimization of data flow programs with map reduce-style UD. www.mailto..7 [email protected] [16] Marshall, 1998, Iris Dataset, http://archive.ics.uci.edu/ml/datasets/Iris 5. CONCLUSIONS In conclusion, genetic algorithms and rough sets play crucial role in optimal query processing if properly planned. Using object-oriented approach and simple data structures can assure the quality of the data mining process and thus eliminate the need for expensive techniques such as using data mining query language ( DMQL). Increasing the number of generations involved in the program solution not necessarily make the predictions much better in certain circumstances. Thus, trade-off has to be made between the required precision and query load or time. 6. RECOMMENDATIONS FOR FUTURE WORK Genetic algorithm is a proven data mining algorithm of choice if efficient and accurate database systems are to be built. The developed system thus can bring in more efficient and accurate data mining features into a database management system. Using the system, database engineers can approach the query optimization in a more dynamic and object-oriented way which can make the end- user applications developed more robust. This application will therefore be useful in modern day intelligent database products in academia and industry. In future, this application can also be integrate into mobile computing environment in a platform independent way. REFERENCES [1] Hasan, W. Ganguly,s. Krishnamurithy, R. (1992) Query optimization for paralle execution proceeding of Acm SIGMOD International conference on management of data PP 1-10 [2] Han, J., Fu, Y., Wang, W., Koperski, and Zaniane, O.R., (1996), A Data Mining Query Language for Relational Databases, DMQL Montreal Canada, pp, 27-33. [3] Mehta, Manish. David, J. Dewitt ( 1997) Managing Intra-Opertor parallelism in parallel Database system. gsl.azurewebites.net/.../0/.../VLDB95 [4] Brunie, N. Chaudhuri, S. (1997) Muti-join process in a Multi-user context. www.csd.uoc.gr/.../... [5] Schneider Vinect Singh David De witt.. –(1990) processing complex Join QUeris Via Hasting in muttiprocessor Database machines [6] Mchugh, J.G., 2000. Data Management and Query Processing for Semi-Structured Data, PhD. Thesis, Stanford University. [7] Xu. Z., (2001), “Automatically Scheduling Database Utilities”, M. S.C, Thesis, Dept. of computing and information science, Queen’s University. Manchester. [8] Malerba, Donato. Annalisa, Appice. Michelangelo, Ceci (2004) A data mining query language for knowledge discovery in a geographical information system. Lecture notes in computer science Vol 2682, pp-95-116 188