Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
206 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | Complex Query JOIN Optimization in Parallel Distributed Environment Dr. Sunita M. Mahajan1, Ms. Vaishali P. Jadhav 2 Principal, Computer Science Dept., Mumbai Education Trust , Bandra , Maharashtra, India 2 Research Scholar, NMIMS University, Mumbai, Maharashtra, India 1 Abstract - The research work covers the query optimization concept in parallel distributed environment. The queries considered are select-project-join (SPJ) queries with large databases. The main query operation considered for research is JOIN operation of the query. For fast execution of a complex query, JOIN operation time needs to be minimized. Different JOIN operation algorithms such as Network Byte Order (NBO) parallel two way semi JOIN with bit array, NBO parallel collision free intelligent bloom JOIN filter, NBO parallel positional encoded reduction filter (PERF) JOIN and NBO parallel distinct encoded reduction filter (DERF) JOIN are implemented and evaluated with existing algorithms. complex query on large databases in parallel distributed environment [1-17] [18-23] [29-31] [48]. Keywords: Query Optimization, JOIN Operation, Network Byte Order, Parallelization, and Distributed Environment. Fig. 1 Distributed Complex Query with Multiple JOINs 1 Introduction A query is defined as content retrieval from database on demand. It can be as easy as “Retrieving name of the person with PAN card number AAQPW2130D” or more complex like “Finding the amount of EMI of all bank customers whose age is between 30-39 years, having more than 5 years of work experience, having loan amount between 20 lacks to 50 lacks and want to make up loan within 5 years having loan period of 20 years with floating interest”. After carrying out a JOIN operation between query tables, the query results are produced. The JOIN order of the tables decides the performance of the query. Query optimizer determines JOIN order via different JOIN algorithms in diverse environments. Depending upon the algorithmic change in each JOIN algorithm, query JOIN optimization time may vary. The research focus is to reduce the optimization time and network cost (in terms of amount of data to be transferred on network) of JOIN operation of a complex query in large databases. The following Fig.1 gives an example of a complex query consists of many relations R1-R15. To produce the final JOIN operation result in minimum time, we need to reduce the time required by intermediate JOIN operations. To get the fast result, we parallelize a set of query optimization procedures for improving the performance of JOIN operation in a For implementing parallel JOIN algorithms, intra-operator parallelism and inter-query parallelism with shared nothing architecture is used. In intra-operator parallelism, single query JOIN operation is executed and in inter-query parallelism, multiple queries are executed on multiple nodes in distributed environment [18-23] [29-32] [48]. In parallelization, various load balancing schemes are used. The load balancing schemes that we used are round-robin, total-sum, equi-depth and stratified-allocation. The cost of parallel execution of JOIN operation includes the cost of data partitioning, data assembling and maximum execution cost of JOIN operation on multiple nodes. For achieving better results of parallelism, the combination of independent and pipelined parallelism is used. In independent parallelism, each node works independently with the data allotted to it. The pipelined parallelism gathers the results in pipeline and gives the final result [9-11] [42-45]. Our research work gives extra facilities during optimization phase. Query JOIN optimizer converts actual JOIN attribute values into a compressed binary form. But the problem is different CPU platforms may store compressed binary data types differently [17].i.e. Little Endian or Big Endian representation. To solve this problem, this compressed binary data is again converted into network byte order representation ISBN: 1-60132-444-8, CSREA Press © Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | which will be an intermediate storage of binary data to be transmitted across network [17] [40] [42]. This NBO data again is encrypted and provides the security during transmission of data on network. The research work focused on various challenges occurred during optimization of query in different environment such as centralized environment, distributed environment and parallel environment. We focused on parallel environment. Producing the query result in less execution time, reducing communication cost when data is distributed among different sites, preventing data loss during JOIN operation, eliminating duplicate data during data transfer, reducing the amount of data to be transferred using data compression techniques, securing the data during transmission are some of the challenges in our research work[9-11] [42-15]. 207 form. We then studied some compressed techniques such as Bloom JOIN [6], Position Encoded Reduction Filter (PERF) JOIN [7, 8] [41-47] and Distinct Encoded Reduction Filter (DERF) JOIN [7,8][41-47]. Bloom JOIN faces the collision problem i.e. more than one elements point to the same address. PERF JOIN uses position encoding which is not suitable for large databases and DERF JOIN requires more time to remove duplicates. So our research fills these gaps by using parallelization to improve the performance of JOIN operation [24-28] [36-39] [47] 3 Research Work We develop four JOIN optimizers in parallel distributed databases which focus on network data reduction, timememory reduction and security during data transmission on network. Based on certain parameters such as execution time, memory utilization, amount of data to be transferred on network, speed and efficiency, the query JOIN optimization algorithms are evaluated [40-48]. 2 Related Work The existing JOIN algorithms that we considered for our research are as follows: Fig. 3 Proposed Work Fig. 2 Existing JOINs For research in optimization of complex queries, we studied the existing JOIN optimization algorithms. Existing JOIN optimization algorithms are classified as Uncompressed and Compressed JOIN optimization. We studied semi JOIN and two way semi JOIN optimization techniques [1- 5] [19-23] [42-48] [30-35]. In Uncompressed JOIN optimization, the actual JOIN attribute values are transmitted so it increases the transmission time as well as network data. In two way semi JOIN optimization, the common and uncommon JOIN attribute values are compared but data is still in uncompressed Instead of sending a data in its actual format, data is converted into binary format i.e. data is compressed and cost in terms of data transfer is reduced [30-35]. The main issue with binary data transmission is its compatibility with binary data representation on different machines [40]. Different machines can have different binary representations such as little endian or big endian i.e. the way to read binary data may be different. Machines with little endian binary representation read Least Significant Byte (LSB) first and machines with big endian binary representation read Most Significant Byte (MSB) first. So while binary data is transferred between different machines with different representations, data can be interpreted wrongly. To avoid this problem, our research converts this ordinary binary data into network byte order (NBO) which is transparent to any binary representation [40][42]. Another issue that we consider in our research is security of data during transmission on network. The next step after conversion of binary data into NBO is providing data security by encrypting NBO using Advanced Encryption Security ISBN: 1-60132-444-8, CSREA Press © 208 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | (AES) algorithm. The encrypted data is transferred on network and provides security to original data [40] [42-47] [50]. Thus the problems of data reduction, binary data representation and security issues are solved with above solutions (Data Compression, NBO and AES). The time and memory reduction is achieved by using parallelization concept in query optimization [9-11] [42-45]. The detailed block diagram of our system is given below in Fig. 4 relation is also calculated in execution phase [1-5] [19-23]. The queries considered for research are star, chain, cycle and clique. The five databases used for evaluation are TPC-H, Northwind, Pub, AdventureWorks and Query Optimization (customized database used for testing) [13-15]. Databases can be scaled with proper scale factors. The four JOIN optimizers developed are NBO parallel two way semi JOIN with byte array NBO parallel collision free intelligent bloom JOIN NBO parallel positional encoded reduction filter JOIN NBO parallel distinct encoded reduction filter JOIN 3.1 NBO Parallel Two Way Semi JOIN with Byte Array Original two way semi JOIN calculates the common and uncommon JOIN attribute values and compares the count of both the values. The less count of common or uncommon values are transmitted on network and used further for reduction at resultant site [1-5] [19-23] [42-44] [50] [30-35]. The detailed algorithmic contribution of our first NBO parallel algorithm is given below in Fig 5. Fig. 4 Block Diagram and Contribution The basic block diagram of a system is consists of 3 phases: preprocessing, optimization and query execution as shown in Fig. 3. Pre-processing consists of server status checker module which checks the connection between server and client in distributed environment. If the connection is OK then query generator module generates the query. According to the number of JOIN attributes of query tables, adjacency matrix is created in query pre-processing module. It gives the preprocessing time and query type as its output. Query type defines the number of relations in a query with number of JOIN attributes. For example query type 4-#3 means number of relations is 4 and number of JOIN attributes is 3 in a query. The optimization phase consists of intra-operator and interquery parallelism. Intra-operator parallelism consists of parallelization of single operator i.e. JOIN operator on multiple machines. Inter-query parallelism consists of multiple queries executed simultaneously on multiple machines [9-11] [45].Our four parallel JOIN optimizers are available in both environment i.e. intra- operator and inter-query parallelism. After the data compression and data reduction is over in all parallel JOIN optimizers, compressed data is converted into network byte order (NBO). NBO data is then encrypted with the help of AES algorithm and provides the security during transmission of data on network [43-47] [50]. In execution phase, the query is executed with calculation of time and memory requirement. The row count of reduced Fig. 5 Algorithmic Contribution of NBO Parallel Two Way Semi Join with Byte Array Two way semi JOIN optimizer can also be executed in parallel environment. Instead of sending actual JOIN attribute values, our optimizer encodes the lesser count of either common or uncommon JOIN attribute values into byte array to save transmission cost. The byte array is converted into network byte order (NBO) to solve the problem of compatibility of binary representations on multiple machines [40]. Then NBO data is encrypted using AES algorithm for providing security ISBN: 1-60132-444-8, CSREA Press © Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | during data transmission [12] [42-47] [50]. Parallelization of above optimizer improves the performance of original two way semi JOIN optimizer [9-12] [17] [30-35][38]. We have used intra-operator and inter-query parallelization in our research. 3.2 NBO Parallel Collision Free Intelligent Bloom JOIN Filter Original bloom JOIN filter uses hash function to map JOIN attribute values of relations. The main problem of hash function is collision problem. More than one JOIN attribute values can be mapped to the same address in collision problem [6] [24-28] [36-39]. Instead of using hash functions in bloom JOIN, our algorithm used C# data structure i.e. Hash Table. The problem of collision due to hash functions is avoided by using hash table. Also hash table saves only the distinct values so parallel collision free intelligent bloom JOIN with hash table is useful in large databases and improves the performance of original bloom JOIN optimizer[9-12] [17]. The main contribution of this algorithm is conversion of JOIN attribute values into hash table in forward reduction of JOIN operation. The hash table values are converted into network byte order for solving the problem of binary representation. Security to transmitted data is provided by using AES algorithm. In backward reduction, the less count of common and uncommon values are converted into hash table, then network byte order conversion and then AES algorithm is applied for security purpose. 3.3 NBO Parallel Position Encoded Reduction Filter JOIN Original positional encoded reduction filter (PERF) encodes the position of JOIN attribute value into byte array. Considering JOIN operation between two relations, PERF JOIN sets byte array value equal to 1 for the common JOIN attribute value and sets 0 for the uncommon JOIN attribute value [7-8] [41-47]. The problem with original PERF JOIN is it encodes position in byte array, as the JOIN attribute values increases, the size of byte array increases. This problem is solved by NBO parallel PERF JOIN which compares common and uncommon JOIN attribute values and instead of encoding all JOIN attribute values, less count of common and uncommon values are used for encoding. The conversion process of compressed data to NBO data and then NBO data to encrypted data remains same for all optimizers. The PERF JOIN encodes all the values including duplicates. 209 3.4 NBO Parallel Distinct Encoded Reduction Filter JOIN In PERF join, duplicate JOIN attribute values are also encoded. Duplication increases the cost in terms of response time, transmission time as well as it increases the amount of data to be transferred on network. To solve this problem, original distinct encoded reduction filter (DERF) JOIN selects distinct JOIN attribute values while transmitting data on network. It solves the problem of duplication [7-8] [41]. Our NBO parallel DERF JOIN further reduces the cost in backward reduction phase of JOIN operation. NBO parallel DERF provides security during transmission of data as well as reduces transmission overhead. In backward reduction phase, the distinct common and uncommon JOIN attribute values are compared and less count of these values is represented into byte array for further reduction. 4 Experimental Set Up For experimentation, we consider 3 servers (Configuration: Intel® Core™ 2 Duo Processor E7300, 2.66 GHZ, 3 MB Cache) and 20 Clients (Configuration: 2.9GHz Intel Core i5 processor,4GB DDR3 RAM, 500GB hard drive) in one data center. During experimentation, database of size 500GB is considered and 1500 queries are evaluated. The queries considered for research are star, chain, cycle and clique. The five databases used for evaluation are TPC-H, Northwind, Pub, AdventureWorks and Query Optimization (customized database used for testing) [13-15] [16-17]. Databases can be scaled with proper scale factors. Different load balancing schemes are used during experimentation such as round-robin, total-sum, equi-depth and stratified-allocation [12]. We considered all the test cases (Best, Average and Worst) during our experimentation. This research paper includes the results of ‘equi-depth’ load balancing scheme with all types of queries in intra-operator and inter-query parallelism. We have used following parameters for evaluation of our optimizers. Execution time Memory utilization Amount of data to be transferred Speed –Up Efficiency With the help of these evaluation parameters, optimizer’s performance is evaluated. All NBO parallel optimizers with all types of queries in intra-operator and inter-query parallelism are shown in results. ISBN: 1-60132-444-8, CSREA Press © 210 5 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | Table 3 Amount of Data to be Transferred Results and Discussion This section includes the results of our first NBO Parallel JOIN optimizer in worst case intra-operator parallel environment. Our algorithms work well with best case and average case for all types of queries. Our first optimizer i.e. NBO Parallel Two Way Semi JOIN with Byte Array is compared with two existing JOIN optimizers i.e. Parallel JOIN and Parallel Two Way Semi JOIN. This optimizer is evaluated on basis of all evaluation parameters such as execution time, memory utilization, amount of data to be transferred, speed up and efficiency. Table 1 shows the evaluation of our optimizer using execution time. Table 1 Execution Time Our NBO parallel optimizers used compressed data during transmission on network. So in all test cases, the amount of data to be transferred is less than existing JOIN optimizers. In worst case also, the amount of data to be transferred on network is less for all types of queries. We compared all our NBO parallel optimizers with above evaluation parameters. For other two evaluation parameters such as Speed-Up and Efficiency, we compared all four NBO parallel optimizers with existing parallel optimizers. Table 4 Speed Up NBO Parallel Two way Semi JOIN Optimizer works well with all queries in best and average case. The time required to execute Chain, Cycle and Clique queries is more than the time required to execute existing optimizers in worst case. So our optimizer works well for star queries in worst case scenario. The time required to count common and uncommon JOIN attribute values may take extra time for more complex queries in worst case scenario. Table 2 Memory Utilization Table 5 Efficiency The memory utilization is also less for Star queries in worst case scenario. Memory required by Chain, Cycle and Clique is somewhat more than the existing optimizers. ISBN: 1-60132-444-8, CSREA Press © Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | Our four NBO parallel optimizers are compared with existing two optimizers.Speed Up and efficiency of all optimizers is measured. Among all four optimizers , NBO Parallel Distinct Encoded Reduction Filter gives more speed and is more efficient than other 3 NBO parallel optimizers. 6 Conclusion NBO parallel optimizers significantly improve the performance of JOIN operation in parallel distributed environment along with reduction in the execution time and amount of data to be transferred on network. The compression techniques such as Byte Array conversion, Hash Table conversion, PERF(Relation) conversion and DERF(Relation) conversion solves the existing issues in JOIN operation. We found ‘Equi-Depth’ is better load balancing scheme among other load balancing schemes. Our evaluation parameter, ‘amount of data to be transferred’ is always acceptable for all queries in all cases. NBO Parallel Two Way Semi JOIN with Byte Array and NBO parallel PERF JOIN is suggested for star queries. NBO Parallel Collision Free Intelligent Bloom JOIN Filter is suggested for star and chain queries. NBO Parallel DERF JOIN is suggested for all types of queries i.e. star, chain, cycle and clique. NBO Parallel DERF is the best optimizer among all other NBO parallel optimizers. In future these NBO parallel optimizers can be tested on unstructured databases or column oriented databases. 7 References [1]Abraham Silberschatz, Hank Korth and S. Sudarshan. Database System Concepts, 5th Edition, McGraw-Hill, 2006 [2]“SQL Statement Processing” http://technet.microsoft.com/ens/library/ms190623(v=sql.105) .aspx [3]Goetz Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys,Vol. 25, No. 2, June 1993. [4]“Application areas of Databases”, http://my.safaribooksonline.com/book/databases/9788131731 925/databasesystem/ch01lev1sec4 [5]Anand V. Hudli, “Distributed Query Processing”, M.Tech Dissertation IIT Bombay, 1984 [6]J.M.Morrissey and W.Osborn,“Experiments with the use of reduction filters in distributed query optimization” , in proceedings of the 9th International Conference on Parallel and Distributed Computing and Systems,(pp.327- 330). 211 [7]Ahmet Cumhur ÖZTÜRK, “Distinct Encoded Records Join Operator for Distributed Query Processing”, Thesis of Masters of Science, øzmir Institute of Technology. Izmir 2012. [8]Zhe Li and Kenneth A. Ross“PERF Join: An Alternative To Two-way Semijoin And Bloomjoin” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.57. 2324&rep=rep1&type=pdf [9]“Chapter 8:Parallel Query Optimization” http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase .dc20023_1251/html/optiz-er/X23652.htm [10]”Parallel Hardware Architeture” http://docs.oracle.com/cd/A58617_01/server.804/a58238/ch3_ arch.htm [11]Patrick Valduriez , “ Parallel Database Systems : Open Problems and New Issues” , Distributed and Parallel Databases I (1993) 137-165 [12]S.K. Basu, “Design Methods and Analysis Algorithms”, PHI learning Pvt.Ltd. Second Edition. of [13]http://en.wikipedia.org/wiki/Transaction_Processing_Perf ormance_Council [14]https://northwinddatabase.codeplex.com/ [15]http://www.codeproject.com/Articles/20987/HowToInstall-the-Northwind-and-Pubs-Sample-Databa [16]https://msftdbprodsamples.codeplex.com/releases/view/93 587 [17]http://www.ccse.kfupm.edu.sa/~fazzedin/COURSES/CSP 2005/Reading/NetworkProgramming.pdf [18]http://nou.edu.ng/NOUN_OCL/pdf/pdf2/DAM%20%202 12.pdf [19]http://www.it.bond.edu.au/inft320/001/lectures/qproc3.pdf [20] http://www.srpskibre.com/pdf/Fundamentals_of_Database_Sy stems.pdf [21]G.M.Lohman , C. Mohan, L.M. Haas, D. Daniels, B.G.Lindsay, P.G. Selinger, and P.F. Wilms. “Query ISBN: 1-60132-444-8, CSREA Press © 212 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'16 | processing in R*”.In query processing in Database Systems. Springer ,New York, 1985. [22]J.K.Ahn and S.C.Moon. “Optimizing joins between two fragmented relations on a broadcast local network”. Information systems, vol. 16, no. 2, pages 185-198, 1991. [23]J.S.J. Chen and V.O. K. Li. “Optimizing joins in fragmented database systems on a broadcast local network”. IEEE Transaction on software Engineering, vol. 15, no. 1, pages 26-38, 1989. [24]W.K.Osborn.”The use of reduction filters in distributed query optimization” Master Thesis, University of Windsor, 1998. [25]W.T.Bealor.”Semijoin strategies for total cost minimization in distributed query processing”. Master Thesis, University of Windsor, 1995. [26]W.T. Bealor and J.M. Morrissey, “Minimizing data transfer in distributed query optimization: A comparative study and evaluation”, Computer Journal, Vol 39, No. 8, 1997. [27]J. M. Morrissey , S. Bandopadhyay, and W.T. Bealor. “ A heuristic for minimizing total cost in distributed query processing”. In proceedings of the 7Th International Conference on Computing and Information – ICCI’95, 1995. [28]J.M. Morrissey, S.Bandopathyay, and W. T. Bealor .” A comparison of static and dynamic strategies for query optimization” In proceeding of the 7th IASTED/ISM International Conference on Parallel and Distributed Computing Systems, 1995. [29]Bernstein, PA, Godman, N.Wong, E.Reeve, C, and Rothnie, J, “ Query processing in a system for distributed databases (SDD-1)” , ACM Trans. syst. Vol.6, Dec 1981. Pages 602-625. [30]Yu, CT and Chang CC, “On the design of a query processing strategy in a distributed database environment”, Proc. ACM SIGMOD Intl. Conf. Management of Data, 1983, pages 30-39. [31]Apers,PMG,Hevner,A, and Yao,SB, “ Optimization algorithm for distributed queries”, IEEE Trans. Software Engg, Vol. 9, No. 1, Jan. 1983, Pages 57-68. [32]C. Wang, V.O.K. Li and A.L.P.Chen. “ Distributed query optimization by one shot fixed precision semijoin execution”. In Proceedings of the 7th International Conference on Data Engineering, pages 756-763, 1991. [33]C. Wang, V.O.K. Li and A.L.P. Chen.”One-shot semi join execution strategies for processing distributed join query”. Computer Systems Science and Engineering, Vol. *, No. 4, pages 245-253, 1993. [34] H.Kang and N. Roussopoulos.“ Using 2-way semijoins in distributed query processing” In proceedings of the 3rd International conference on Data Engineering, pages 644-651, 1987. [35]N. Roussopoulos and H.Kang,“A pipielined n-way join algorithm based on the 2-way semijoin program”. IEEE Transactions on knowledge and Data Engineering, 3(4) pages 486-495,1991. [36]B.H.Bloom,”Space/time tradeoff in hashing coding with allowable errors”, Communication ACM, Vol.13, July 1970, pages 422-426. [37]J.M. Morrissey and X.Ma, ”Investigating response time minimization in distributed query optimization, In proceedings of the 10Th International Conference on Computing and Information – ICCI’98, 1998. [38]J.C.R. Tseng and A.L.P. Chen,“Improving distributed query processing by hash semijoins”. Journal of Information Science and Engineering. Vol. 8, pages 525-540, 1992. [39]J.M. Morrissey and W.K. Osborn.”Experiments with the use of reduction filters in distributed query optimization.” In proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Systems, 1997. [40]http://www.goldparser.org/doc/about/byte-ordering.htm [41]Zhe Li. and Kenneth A. Ross.” PERF Join: An alternative to two ay semi join and Bloom join”, Department of Computer Science, Columbia University, New York, NY 10027. [42]http://library.iyte.edu.tr/tezler/master/bilgisayaryazilimi/T 001058.pdf [43]https://support.microsoft.com/en-us/kb/246071/ [44]http://mu.ac.in/myweb_test/MCA%20study%20material/A dvanced%20Database%20Techniques-f.pdf [45]http://web.cs.wpi.edu/~cs561/s12/Lectures/45/ParallelDBs.pdf [46]http://mazsola.iit.uniiskolc.hu/tempus/discom/doc/db/tema01a.pdf [47]http://bnrg.cs.berkeley.edu/~adj/cs262/papers/graefe.pdf ISBN: 1-60132-444-8, CSREA Press ©