Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IN-MEMORY DATA STRUCTURE FOR GOOGLE DATASTORE ON MULTI-CORE ARCHITECTURES A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences 2011 MOON MOON NATH School of Computer Science List of Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Abstract 8 Declaration 9 Copyright 10 Acknowledgements 11 1 12 Introduction 1.1 Shared Memory Multi-Core Systems and Google DataStore . . . . . . .14 1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 2 Background 18 2.1 The BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 2.2 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 The Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 2.4 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 2.5 Data Retrieval in a Cluster Environment: MapReduce . . . . . . . . . . .27 2.6 In-Memory Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7 Cache-Oblivious Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 3 System Design 34 3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Data Manipulation and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Data Structure Implementation 40 4.1 Static Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 4.2 Packed Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46 4.3 Algorithm to Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Algorithm to Append / Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2 4.5 Algorithm to Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5 Query Implementation 53 5.1 Development Tools – Java Fork/Join Framework . . . . . . . . . . . . . . . 53 5.2 TPC-H Benchmark Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55 5.3 Loading the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4 TPC-H Query 17 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5 Query 17 – Sequential Implementation . . . . . . . . . . . . . . . . . . . . . . . 61 5.6 Query 17 – Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.7 Need for Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 5.8 Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73 5.9 Synchronization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73 6 Evaluation 75 6.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7 Conclusion 84 7.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Word Count 20,190 3 List of Figures Figure 1.1: NUMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 Figure 1.2: UMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 2.1: Example table storing web pages . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 2.2: To illustrate the concept of ‘rows’, ‘column families’ and ‘columns’ in BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 Figure 2.3: To illustrate timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 2.4: GFS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 2.5: The memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 2.6: The RAM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Figure 3.1: Data structure design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 3.2: Representation of the Data model. . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 4.1: Steps to create the data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 4.2: A complete Binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 4.3: van Emde Boas layout on a binary tree of height 5. . . . . . . . . . . . . 44 Figure 4.4: To illustrate the relation between a full binary tree (of height 4) and the vEB array and the Packed array structure. . . . . . . . . . . . . . . . . . .48 Figure 5.1: Co-operation among fork ( ) and join ( ) tasks. . . . . . . . . . . . . . . . .54 Figure 5.2: TPC-H database schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 5.3: Sample Key-Value pairs generated from the de-normalized dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 5.4: Overview of parallel execution strategy used in Query17. . . . . . . .67 Figure 6.1: Mean execution times of Query 17 for 100 MB data (small) on Janus. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 6.2: Mean execution times of Query 17 for 500 MB data (medium) on Janus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 Figure 6.3: Mean execution times of Query 17 for 1 GB data (large) on Janus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 6.4: Mean execution times of Query 17 for 1 GB data (small) on Mcore 48. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 6.5: Mean execution times of Query 17 for 3 GB data (medium) on Mcore 48. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4 Figure 6.6: Mean execution times of Query 17 for 5 GB data (large) on Mcore 48. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 Figure 6.7: Absolute speedup of Query 17 for all three datasets on Janus. . . . .81 Figure 6.8: Absolute speedup of Query 17 for all three datasets on Mcore48. .82 5 List of Tables Table 1: Production system configurations for performance evaluation . . . . . .76 6 Code Listings Listing 1: Class definition of a node in the implementation of a binary tree. . .41 Listing 2: Implementation of a sorted tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Listing 3: Implementation of vEB array. . . . . . . . . . . . . . . . . . . . . . . . . . . 45 - 46 Listing 4: Pseudo Code to explain the mapping of vEB array to packed array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 Listing 5: To calculate the number of leaves for a tree of height ‘height’. . . . .50 Listing 6: Implementation of Packed Array. . . . . . . . . . . . . . . . . . . . . . . . . . . .50 Listing 7: Implementation to convert a key-value pair file to another key-value pair format (based on our system’s data model). . . . . . . . . . . . . . . . . 58 Listing 8: Implementation of search algorithm to check for the first occurrence of a column in the binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Listing 9: Implementation of search algorithm to check within a subtree. . . . .63 Listing 10: Implementation of intersection operation to find the common rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63 Listing 11: Implementation of search algorithm to check for a column_name and a specific row_key within a single combo key. . . . . . . . . . . . . . 64 Listing 12: Implementation of duplicate removal algorithm. . . . . . . . . . . . . . . 65 Listing 13: Implementation of search algorithm to check for less-than condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65 Listing 14: Implementation of parallel search algorithm (1). . . . . . . . . . . . . . . 69 Listing 15: Implementation of parallel search algorithm (2). . . . . . . . . . . . . . . 70 Listing 16: Implementation of parallel addition algorithm. . . . . . . . . . . . . . . . .71 7 Abstract Google provides its users an assortment of applications and services. In the process, it requires to store and manage huge volumes of user data. To accomplish this, all the data is distributed and stored across thousands of servers, in a distributed storage system. This approach is beneficial since it exploits parallelism in a cluster environment to achieve good system performance, in terms of throughput and response time. The advent of multi-core architectures has resulted in a lot of research to find effective software solutions that will take advantage of the parallel hardware. This project also deals with investigating the possibilities and developing a Google datastore-like system for shared memory multi-core machines, that is scalable, fast, and efficient. This dissertation discusses the motivation, relevant literature, scope, design, implementation, and evaluation of the project. The literature survey provides all the essential background knowledge, necessary to understand the idea behind this research. The system design comprises primarily of an underlying data structure and a set of operations to manipulate the database. The implementation, based on Java 7, includes developing the data structure to support the database and parallelization of a search query. Support for several database operations like insert, delete, and search, similar to that of Google DataStore exist in this system. The query execution is parallelized on several multi-core machines to capture and evaluate performance and scalability of the design, based on execution time and absolute speedup as metrics. The analysis of the results reveals a maximum speedup of 12.7 for 1 GB and 6.3 for 5 GB data on a 48 core test machine, which indicate the advantage of executing queries on multi-core systems. The designed database is a subset of the Google Datastore and hence, supports only the core features of it. Given, the stringent time frame, enhancements like fault tolerance and security are kept outside the scope of this project. 8 Declaration No portion of the work referred to in this dissertation has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. 9 Copyright i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this dissertation, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has entered into. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the dissertation, for example graphs and tables (“Reproductions”), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this dissertation, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://documents.manchester.ac.uk/display.aspx?DocID=487), in any relevant Dissertation restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s Guidance for the Presentation of Dissertations. 10 Acknowledgements At the outset, I would like to express my sincere gratitude to my supervisor Dr. Mikel Lujan, for his invaluable guidance, support and inspiration throughout the project. I would also like to thank all my faculty members at the School of Computer Science. Finally, I would like to convey my heartfelt gratitude to my parents and friends, for their ceaseless love and support, without which this work would not have been possible. 11 1 Introduction Processor architecture has evolved considerably over the years. From being steered primarily by Moore’s Law [37] to exploiting multi-core parallelism nowadays, it has travelled a long way. The direct correlation between processor frequency and its performance is threatening to vanish owing to certain limiting factors. The most prominent one among them is the transistor size, which cannot be reduced beyond a certain degree [35]. In fact, smaller size transistors also require a lot of complex design effort. This is a physical limitation on the reduction of a transistor size. However, the direct impact of increasing the number of transistors in a chip is the increased power consumption. Apart from this, there is also the problem of physical memory bandwidth. We know that the speed of the main memory is much slower than that of the processor. In fact, the rate at which the processor frequency has amplified in the past decade, the memory speed has not. This memory bottleneck will always restrict the system performance, despite the high clock speed of the processor. This is due to the fact that a fast processor with a slow memory only increases the processor idle time. Wulf et al. called this bottleneck the ‘Memory Wall’ [40]. Hardware designers have now incorporated multi-core technology into the processors. Processors instead of having a single CPU have multiple CPUs built onto the same chip, called ‘cores’. The existence of multiple cores creates an opportunity for improvements in performance and speed of the processor, provided there exists parallel software that can utilize the cores available to it. This is because a task can now be executed on several cores simultaneously as threads. A modern multi-core processor is usually a NUMA (Non Uniform Memory access) shared memory multi-processor [42]. However, the ones with fewer cores are still SMPs (Symmetric Multi-Processor) having UMA (Uniform Memory Access) [42]. UMA is where memory is shared by all processor cores, such that each core takes the same amount of time to access it. However, in case of a NUMA, pools of memory exist, that are shared by a set of cores (multiple cores grouped together to form a ‘socket’). This implies that each socket (group 12 of cores) is connected directly to one RAM and indirectly to all the others (see figure 1.1), which results in some sockets accessing a particular RAM faster than the others. The following figures 1.1 and 1.2 illustrate the two multi-core architectures. RAM 0 RAM 1 Core 0 Core 1 4 Quad Cores RAM RAM 2 RAM 3 Socket 2 Core 2 Socket 3 Figure 1.1: NUMA architecture. Drawn based on [42]. Core 3 Figure 1.2: UMA architecture. Drawn based on [42]. Here, each socket has 4cores. Here, we can see that a single operation can get divided among 16 and 4 CPUs in the NUMA and UMA architectures respectively and if parallel software is available can utilize the existence of these multiple cores for performance enhancements. However, some latency exists in case of NUMA due the different access times. Although the hardware industry has found an effective technique in the form of multi-core, the software industry still needs to evolve accordingly to exploit this hardware. It is extremely vital to write software and design frameworks that can efficiently scale and utilize the underlying multi-core architecture. Also, one should bear in mind the hierarchical memory structure involving CPU caches to yield optimal system performance. This is especially true when processing terabytes of data. There are some frameworks for parallel programming on multi-core like OpenMP [47] for C and Fortan, Java Fork-Join [49], Phoenix [13] and MR-J [17]. However, there still exists a lot of instability in the applications written for multi-core architectures. The complexity involved in the appropriate utilization of thread-level parallelism is magnified by the existence of multiple cache 13 levels, cache-sharing, memory page sizes and so forth [44]. Therefore, software designers can achieve greater performance from the multi-core systems if they consider these factors and design structures and algorithms that are tailored accordingly. 1.1 Shared Memory Multi-Core Systems and Google DataStore The industry today requires managing huge amounts of data, in the order of petabytes. To process such large computations, a distributed cluster computing environment or a shared memory multi-core architecture can be used. Again, apart from improving the hardware, the software should also be rewritten to be able to exploit the hardware, as mentioned earlier. Google has devised a mechanism based on the distributed computing environment to process and manage petabytes of user data. BigTable [3] is a high performance, scalable proprietary database system from Google. It is a distributed storage system that supervises large amounts of data across several thousand commodity servers. It is built atop other Google services like the Google File System (GFS) [5], MapReduce [14], Chubby Locking Service [6], and so on. The GFS is a distributed file system that runs on several thousand inexpensive commodity Linux servers. It provides the usual file system operations with special fault tolerance, scalability and reliability features. The database operations are designed such that they can utilise the distributed nature of the environment and run in parallel. However, it does not utilize the individual cores within a single system to gain performance benefits; in other words, it does not support execution on a multi-core architecture. The distributed parallelism is achieved by using MapReduce, which is a framework that requires a programmer to write only two special functions, while the complex parallel activities are handled by its underlying run-time features. 14 Many Google projects make use of BigTable like, Google Earth, Google Finance, the web indexing operation, Gmail, YouTube, and so on. Several similar distributed systems exist as open source. The most commonly used are from Hadoop [12]. Hadoop’s HBase [7], HDFS [12] and MapReduce [16] are similar in most ways to Google’s BigTable, GFS and MapReduce respectively. Hadoop is extensively used by services like Facebook, Twitter, Adobe, EBay, LinkedIn and Yahoo to name a few. The applications supported by these distributed systems give us a fair idea about the enormity of the data handled by them. These systems are robust and have a low response time in most situations. However, if concurrent activities increase manifold, owing to a large number of simultaneous users or if the amount of data increases by many times, over the next few years, the performance parameters might not produce as good a result as they do now. The computations are bound to become large requiring more power in the future. Therefore, with the advent of multi-core architectures, it is only natural to try and extract the additional computational power required, from the multi-core systems itself. In fact, BigTable and HBase make use of inexpensive commodity machines for their clusters. The multi-core nature of these individual systems, that constitute the cluster, can actually be exploited to gain improvements in performance and speed-up. 1.2 Aims and Objectives This project aims at investigating the possibility of implementing a subset of the BigTable [3] functionality on multi-core architectures. The designed database always resides in memory [18, 19], eliminating entirely the access to a secondary storage for its operations and has an underlying data structure that has a cache-oblivious [27] design. The research is carried out in three basic phases. The initial phase involves conducting a survey of the Google database system, its underlying infrastructure, the GFS [5] and a research of other similar non-SQL (unconventional or non-relational DBMS) database technologies. It 15 also constitutes looking at the various in-memory and cache-oblivious data structures to identify the suitability of these structures for this research. The existing multi-core frameworks are also examined to identify a suitable implementation that can be used to achieve parallel activity. Also, in this phase, programming environment for developing this implementation is explored to arrive at an appropriate choice. The next phase is to design and develop a version of the DataStore system for shared-memory multi-core machines, based on the decisions taken in the previous phase. This involves designing and implementing a suitable data structure that has the capability to perform operations similar to Google BigTable. This structure is then used to perform simple operations like create, populate data, append new data and delete on the database. In addition to this, thread synchronization features are incorporated to allow multiple users concurrent access to a single database. Next, a data retrieval operation is performed on it, exploiting the parallelism of the processor cores. The final objective of this research is to evaluate the performance and usability of this multi-core implementation. Also, the efficiency of the implementation as well as its scalability on various multi-core systems is examined. The parallelized query is used for this evaluation. However, the evaluation of the cache-obliviousness of the data structure is not performed due to time constraints. The results obtained from various multi-core systems are analyzed to arrive at a formal conclusion. 1.3 Organization of the Dissertation This dissertation is organized into a Background (Section 2), System Design (Section 3), Data Structure Implementation (Section 4), Query Implementation (Section 5), Evaluation (Section 6) and Conclusion (Section 7) sections, apart from the Introduction (Section 1). The Background section contains an overview of the entire research activity. It presents the primary motivation behind this project – Google DataStore (BigTable) [3]. The concept, architecture and salient features are discussed briefly. It is then compared with 16 its open source counterpart HBase, from Hadoop [7]. Next GFS [5] and its open source version from Hadoop, HDFS [12] are discussed, exploring the architecture of these systems. The database querying mechanism, MapReduce [14, 16], used by these distributed systems is then examined. Further, we look at the alternative database technologies like In-Memory Databases (IMDB) [18, 19] to investigate the feasibility of using them for this implementation. We also look at cache-oblivious [20] data structures to explore their suitability and at the same time, identify an appropriate structure for development. The System Design section, presents a detailed description of the data structure that forms the building block of the in-memory, DataStore-like database system, which is implemented. The subsequent section deals with the actual implementation details of the data structure, followed by the implementation details of the query, used to evaluate the system. The mechanism used to parallelize this query, in order to exploit the processor cores of a multi-core system, is also presented. The Evaluation section further includes the evaluation techniques, the various multi-core configurations used, the benchmarks, as well as the results of analysis. Finally, the Conclusion section wraps up the report by briefly discussing the outcome, the learning, system limitations and the future work that can be undertaken. 17 2 Background Google provides its users with a Platform-as-a-Service (PaaS) commercial cloud technology, in the form of the Google App Engine [1]. App Engine allows users to build, maintain and run their web applications on Google’s infrastructure by means of a huge storage structure called the DataStore [2]. It comprises of several APIs, required for its services, one of which is the Datastore API [2], which is available in both Java and Python versions and accesses a query engine and some atomic transactions. This API provides users with a stable storage that is both reliable and consistent. The huge amount of user data present in the DataStore is in reality, stored across thousands of servers and managed by a distributed storage system called BigTable [3]. In other words, the DataStore of App Engine is built on top of BigTable. BigTable, earlier a single-master, distributed storage system, consists of three main components – a library linked to all clients, a master server and several tablet servers [3]. It is a non-SQL (non traditional DBMS) database management system in that it does not conform to a specific schema – the number of columns in different rows of the same table can vary, thus sharing characteristics of both row-oriented and column-oriented databases. Typically, a column-oriented database serializes (stores values internally in file etc.) the contents of the database in a column-wise fashion, in that all data from one column gets stored together and then the same for the next column and so forth. The biggest advantage of such a storage mechanism is the quick execution of aggregation operations (like sum, count, etc. that are performed over specific columns) since now entire rows need not be read. Instead the required column, a much smaller subset of the database, can be accessed directly, giving faster query results. Also, since column data is usually of the same type, compression techniques can be employed to achieve storage size optimizations, which is not possible in row-oriented stores. BigTable uses an underlying Google File System (GFS) [5] to store data and is based on the shared-nothing architecture [4]. BigTable also relies on a distributed locking service called Chubby [6] to ensure consistency and 18 synchronization of all client activities in a loosely-coupled distributed system. It provides its client with a highly reliable and consistent environment. The open source counterparts from Hadoop [12] also have similarities in terms of architecture. One of the primary objectives of this research work is to conduct a survey of these distributed systems to understand their functionality, architecture and the structures employed. Additionally, we will examine in detail different types of data structures especially the ones that utilize the cache (cache-aware and cache-oblivious) for performance improvements. Their study will provide us with necessary understanding and thus allow us to decide on the data structure to implement. This decision will be guided primarily by the fact that the structure should be similar to that of BigTable’s; data should be stored in a column-oriented manner. Efficient memory and cache utilization, performance etc. are the other criteria. We will also look at in-memory data bases [18, 19] as they have very low response times and decide on their suitability for this project. This section will deal with the above discussed aspects of my research and thus provide an understanding of the background and the system in general. 2.1 The BigTable BigTable is defined as “a sparse, distributed, persistent multidimensional sorted map” by Ghemawat et al. [3]. It is “sparse” because each row in a table can have an arbitrary number of columns, very different from the other rows in that table. This is possible due to the fact that BigTable is not a conventional relational database management system that is strictly roworiented. It is instead a non-SQL, column-oriented database system. A BigTable row contains only those columns which contain some value. Contrary to an RDBMS, there are no NULL values and no JOINs. The tables are also unlike the traditional RDBMS ones. A table here is a “map”, indexed by a row key, column key and a timestamp. In other words, a cell in a BigTable table is identified by 3 dimensions – row, column and timestamp. The timestamp facilitates versioning of the data. Each cell can have multiple values at different 19 points in time and each value, an array of bytes, is maintained separately with its associated timestamp. These are 64 bit integers and can be used to store actual time in microseconds as well. The unique row key is a maximum of 64 KB in size and is an arbitrary string. All data is maintained in lexicographic order of the row key. A table can be huge and is therefore split at row boundaries to manage them. These partitioned tables are called tablets. Each tablet is around 100 – 200 MB in size, allowing several hundred to be stored on each machine. This sort of an arrangement allows for fine grained load balancing. Several column keys are combined to form a set called a column family, which is the basic access control unit. Any number of column keys can be part of a single column family, but these columns are usually of the same data type. The number of column families is restricted to a few hundred in contrast to the unbounded number of columns in a table. A column key has the following syntax:- family:qualifier, where ‘family’ and ‘qualifier’ refer to column family and column key respectively. The following diagram, redrawn from the original paper [3] illustrates the structure of a single row in BigTable. “contents:” “com.cnn.www” “anchor:cnnsi.com” “<html>… “<html>...” “<html>... ” “anchor:my.look.ca” t3 t5 “CNN” t9 “CNN.com” t6 Figure 2.1: Example table storing web pages. It is redrawn from the original BigTable paper [3]. The diagram contains 2 column families namely ‘contents’ and ‘anchor’. ‘contents’ has a single column while ‘anchor’ has 2. Again while ‘anchor’ column values have a single timestamp (t8 and t9), ‘contents’ has 3 timestamps for a single value (t3, t5 20 t8 and t6), where t3 is the oldest and t6 is the most recent value. The next row can have a different number of columns for these two column families. As already mentioned, Google uses the distributed GFS to store all data and maintain log records. Internally however, an immutable file format called SSTable is used to store the BigTable tablets (data). It is sequence of blocks typically 64 KB in size. An index is stored at the end of each SSTable, to locate its blocks. A BigTable realization comprises of three major constituents:- master server, several tablet servers and a library attached to every client machine. The master server assigns tablets to various tablet servers, performs load balancing and garbage collection as well as detects any alterations in the tablet servers. Tablet servers manage the tablets assigned to it including reads and writes by the client. Additionally it is also responsible for partitioning tablets that have exceeded their size limit. BigTable uses Chubby [6] as a locking service for synchronization, tablet location information, tablet server expirations, store schema information and so forth. A three-tier B+ tree-like structure is used to store tablet location information. Here, the first level is a file stored in Chubby which holds the location of the root tablet, which in turn holds locations of all other tables in separate tablets called METADATA. Each METADATA tablet stores the location of the user tablets that includes a list of SSTables. SSTables are loaded into memory using their index into a memtable. All updates are also made to a memtable. As its size increases, due to updates, to reach a threshold, a new memtable is created. The old memtable is turned into an SSTable and sent to the GFS. This is termed as “minor compaction”. Minor compactions create new SSTables which results in several of them after a time. Therefore, to curb the creation of numerous such SSTables, another merge operation is performed at regular intervals, called “major compaction”. This involves rewriting all existing SSTables into a single one. 21 2.2 HBase HBase is an open source BigTable-like structured storage built on the Hadoop Distributed File System (HDFS) [12]. Source [7] defines HBase as “an open-source, distributed, versioned, column-oriented store modelled after Google’s BigTable”. Here too, a table is “sparse” in that rows in the same table can have variable number of columns. The rows again are lexicographically sorted on a unique row-key. It is a multi-dimensional map like BigTable, with the data being identified by the 3 dimensions namely, row, column and timestamp. A row contains only those columns which hold some data value; no NULL values are used. Columns like in BigTable are grouped together to constitute column families and are denoted by a column qualifier or label. A column needs to be identified therefore, by the <family:qualifier> notation. Figure 2.2 below illustrates rows and columns. It is a JSON example created based on examples from source [8]. { "aaaaa" : { "A" : { "foo" : "y", "bar" : "d" }, "aaaab" : { "A" : { "check" : "world", }, "B" : { "test" : "ocean" } }, } Figure 2.2: To illustrate the concept of ‘rows’, ‘column families’ and ‘columns’ in BigTable. Drawn based on examples in source [8] 22 This figure clearly explains the arrangement of rows, column families and columns in HBase (and BigTable). Here, ’aaaaa’ and ‘aaaab’ are the two rows in an HBase table arranged in ascending lexicographic order. The table contains 2 column families: ‘A’ and ‘B’. Note that column families in a table are usually static unlike the columns constituting them. Therefore, the first row ‘aaaaa’ has 2 columns from only 1 family, A:foo and A:bar, whereas the second row ‘aaaab’ has 2 very different columns belonging to 2 different families, A:check and B:test. Each of these data values can also have several versions as stated earlier, thus allowing the database to store historical data as well. This can be illustrated using JSON as shown below. "aaaaa" : { "A" : { "foo" : { 20 : "y", 8 : "x" "bar" : { 16 : "d" }, } Figure 2.3: To illustrate timestamps. Drawn based on example in source [8]. The figure above illustrates the use of timestamps in HBase/BigTable. The most recent data is stored first. For instance, to access the data ‘y’ (most recent value) HBase will use the path “aaaaa/A:foo/20” while “aaaaa/A:foo/8” for ‘x’. Also, when responding to a query HBase accesses the timestamp that is “less than or equal to” the queried time. For instance, if we want to access all values with timestamp 10, we will receive the cell value ‘x’ since its timestamp is less than 10. An HBase table comprises of several regions, each of which is marked by a ‘startkey’ and ‘endKey’. Regions are made of several HDFS blocks. There are two types of nodes namely, Master server, and Region servers attached to numerous client machines. These servers are similar to master and tablet servers in BigTable. Master server monitors the region servers as well as assigns and 23 balances load on them. Region servers hold multiple regions. Contrary to a Chubby lock service in BigTable, HBase uses a ZooKeeper [9], a centralized service, for distributed synchronization. It has an extremely simple interface that is itself distributed and highly reliable. The clients connect to a specific cluster by seeking information from the ZooKeeper since it holds the locations of all Region servers hosting the root locations of all tables. HBase uses an internal file format called HFile [11], analogous to BigTable’s SSTable. It uses a 64 KB block size, containing data and identified by a block magic number. HBase like BigTable is extremely efficient when managing huge amounts of data in the order of petabytes over an equally large number of machines widely distributed all across the globe. It allows data replication for reliability, availability and fault tolerance. It also facilitates distributed reads and writes on the data that are very fast. 2.3 The Google File System The Google File System (GFS) [5] is a proprietary, scalable and distributed file system designed specifically for large, distributed and dataintensive applications. It is fault-tolerant and reliable, providing a high aggregate performance to its clients. The GFS design is primarily motivated by the observations on the technological environment as well as, the application workloads, where component failures are inevitable. The file system runs on thousands of inexpensive, commodity Linux systems and is accessed by an equivalent number of client machines. Unlike many file systems, it is not built into the OS kernel, but supported as a library. GFS is simple and provides the users with the basic file commands like open, close, create, read, write, append and snapshot. Append and snapshot are special commands; while append allows multiple clients to add information to files (even concurrently) without overwriting existing data, snapshot creates a copy of a file/directory tree at minimal system cost. 24 Google organizes its resources into distributed clusters of computers, with each cluster comprising of thousands of machines, classified as either master server, chunk servers and client servers. Client files tend to be very large (order of multi GB), so they are divided into fixed sized chunks of 64 MB each and stored on various chunk servers. For reliability, chunks are replicated on multiple chunk servers with a default of 3 replicas. At the time of creation, each chunk is assigned a globally unique 64 bit chunk handle. The master acts as cluster-coordinator, maintaining an operation log for its cluster, stores all file system metadata including namespaces, access control information, mappings of files to chunks and current chunk locations. The master server does not persistently store any chunk location information instead, upon start-up, it polls the chunk servers, which respond with the contents present in it. Also, periodically it communicates with the chunk servers via HeartBeat messages to give instructions and collect their state. GFS Master Application (file name, chunk index) File namespace /foo/bar Legend: Data Messages chunk 2ef0 Control Messages GFS Client (chunk handle, chunk location) (chunk handle, byte range) chunk data Chunk server state Instructions to chunk server GFS Chunk server GFS Chunk server Linux File System Linux File System Figure 2.4: GFS architecture. Redrawn from the original GFS paper [5]. The client code is linked into each application (Figure 2.4 above) and it communicates with the master and chunk servers to read/write data. Figure 2.4 illustrates the architecture in terms of a single read reference. The application sends a filename and byte offset to the client, which converts this information into a chunk index and sends it (along with filename) to the master. The master replies with the corresponding chunk handle and replica locations. The client 25 then sends a request to the closest replica. Also, it caches the chunk replica locations so that future interactions need not involve the master. All metadata on the master are stored as in-memory data structures and hence master operations are fast and efficient. The operation log mentioned earlier is critical to the GFS in that, it contains all vital changes to the system metadata. The system is designed to have minimal master involvement in all operations. To this end, the master assigns lease to any one of the replicas and calls it the primary replica (chunk server) for an initial duration of 60 seconds. All mutations (alterations to file content and/or namespace) are now managed by the primary, including secondary replica management. Another crucial feature of the GFS is garbage collection. This mechanism is unique in that, the physical storage released (due to a file deletion) is not reclaimed immediately. Instead, the file is renamed with a special (hidden) name along with a timestamp. The master performs scans at scheduled times, during which it deletes permanently all ‘hidden’ files that have existed for more than 3 days (using timestamp). GFS uses a very important principle of autonomic computing [45, 46], which means that a system can detect and correct its problems without any human intervention. It incorporates ‘stale replica detection’ (where using the replica timestamp, master can identify outdated replicas), various faulttolerance techniques like fast recovery (all servers to restart and restore stable state in seconds irrespective of how it terminated), chunk replication, master replication (copies of master maintained, including ‘shadow masters’ – slightly outdated read-only master replicas) and so forth. The Google File System is structured in a manner that systems as well as hardware memory can be upgraded with a lot of ease, making it truly scalable. This knowledge is vital since it enlightens us with the knowledge of in-memory data structures, fault tolerance and security techniques. 26 2.4 Hadoop Distributed File System HDFS [12] is similar to the GFS [5] and partitions the large data files into fixed sized blocks called chunks and stores them across several machines in the cluster. It is designed to handle hardware failures and network congestion in a robust manner. It uses a large number of inexpensive commodity systems to construct the distributed cluster. It is fault tolerant, reliable and highly scalable. However, the design is restricted to a specific type of application. It is assumed that the applications using HDFS are written only once and perform frequent sequential streaming reads with infrequent updates. An HDFS cluster comprises of a NameNode, connected to numerous DataNodes and client machines. This is analogous to the Master and Cluster servers in GFS. The NameNode stores all the metadata information like namespace, file to chunk mappings etc. and also controls the DataNodes. All metadata is stored in-memory to facilitate faster access. The NameNode is accessed by a client to retrieve the location information of all chunks constituting the file required by it. This also includes the locations of all chunk replicas, created for greater reliability and fault-tolerance. The client then selects a DataNode nearest to it to start its operations. DataNodes like the GFS cluster servers store the actual data chunks (blocks), with each chunk being replicated thrice by default. Also, replicas are housed on different machines, preferably on separate racks in the cluster. Moreover, apart from data replication, the NameNode is also copied so as to save the metadata in the event of a failure. 2.5 Data Retrieval in a Cluster Environment: MapReduce Querying and data retrieval are an integral part of any database system and involves complex processing. As the amount of data increases, so does its processing complexity, in order to maintain a reasonably good response time. Distributed database systems have the advantage of utilizing parallelism to achieve this. The programming model available to exploit parallelism in both 27 distributed and multi-core systems is MapReduce. The advantage of this model is that it abstracts away the complex parallel implementation from the programmer and yet achieves large scale parallelism. The programmer typically is involved in expressing the problem at hand as a functional programming model. Once this is done, the MapReduce runtime environment automatically parallelises it. Google MapReduce [14] is a generic programming framework for processing and generating very large datasets in a cluster computing environment. The primary advantage of this paradigm is the simplicity it provides to a programmer; abstracting the underlying complexities and allowing the programmer to express the computation in a functional style. This implementation is highly scalable and easy to use, capable of processing terabytes of data across thousands of machines. It requires programmers to specify two functions: map and reduce. Both functions accept key/value pairs as input. The map function processes the input key/value pair and generates an output consisting of a list of intermediate key/value pairs. The reduce function reads the sorted output of map and merges all intermediate values for a particular key to produce output for each unique key. Apart from these user-defined functions, it also has a runtime environment that manages data partitioning, scheduling, fault tolerance and automatic parallelism; all abstracted from the programmer. It uses GFS [5] as the underlying file system. The open source counterpart of Google MapReduce is Hadoop MapReduce [16], also a framework for processing huge amount of data on large clusters of machines. This is based on the HDFS [12]. MapReduce implementation on multi-core systems is slightly different from that of distributed systems, although the underlying principle remains the same. A model called Phoenix, developed by Ranger et al. [13] and another, MR-J developed by Kovoor et al. [17] are examples of MapReduce architectures for shared memory multi-core systems. 28 2.6 In-Memory Database Systems In-Memory Database (IMDB) [18, 19] systems also known as Main Memory Databases (MMDB) is a database management system that stores and manipulates its data in the main memory, eliminating disk access, unlike most database systems that use the disk for persistent storage. The conventional disk-resident database (DRDB) systems support all the ACID (Atomicity, Consistency, Isolation and Durability) properties. Database transactions (operations) can fail due to various hardware and software problems. The ACID properties ensure that these transactions are processed reliably, that is, even in the event of a failure the data stored will be consistent and reliable [38]. However, the DRDB systems have limitations in terms of their response time and throughput. Caching the disk data into memory for faster access does not completely eliminate disk accesses. Such accesses reduce the throughput while increasing the response time, thus rendering the system unsuitable for time-critical (hard real-time) applications. On the other hand, IMDB systems were primarily designed to cater to time-critical applications by achieving very low response time and high throughput. They are faster because their performance is not dependent on disk I/Os. The data structures employed are also optimized to gain maximum performance benefits. Moreover, they usually have a strict memory-based architecture, which implies that data is stored and manipulated from memory in exactly the same form, in which it is used by the application. This completely eliminates all overheads associated with data translations as well as caching. This also results in minimal CPU usage. Another advantage of IMDB systems is that it can achieve multi-user concurrency on some shared data with consistent performance. The main memory is volatile. This makes the IMDBs appear to lack the durability property of ACID, in case of a power off or server failure. This can be achieved by either of the following mechanisms:1. Creating checkpoints or snapshots. These periodically record the database state and provide the required persistence. However, in the event of a 29 system failure, the most recent modifications will be lost (after the checkpoint), hence provides only partial durability. 2. Combining checkpoints and Transaction Logging. A transaction log records all modifications to a log/journal file that aids in complete system recovery. 3. Using a non-volatile RAM or an EEPROM (Electrically Erasable Programmable Read Only Memory). 4. Maintaining a Disk backup. Another disadvantage appears to be the limited storage available to these systems due to the fact that all data in stored only in the main memory, which has less storage compared to a disk. IMDBs are primarily used for performancecritical embedded systems, which are usually devices that require applications and data to have a small footprint (size/memory requirement) and hence their being memory-resident (i.e. limited storage) no longer remains an issue. Moreover, when used for systems handling large datasets, the virtual memory usually comes into play to hold the excess data. They are extremely important for this research because of their low response time and high throughput. Designing a system with lowest possible response time and optimal memory usage is one of the basic objectives of this project. 2.7 Cache-Oblivious Data Structures Modern computers have a multi-level hierarchical storage that includes CPU registers, different levels of cache, main memory and disk, where the data oscillates between the processor registers and the rest. Figure 2.5 below illustrates this hierarchy. RAM CPU Cache 3 Cache 1 Disk Cache 2 Registers Figure 2.5: The memory hierarchy. Redrawn from source [34]. 30 As the memory levels move further from the CPU, their access times as well as storage capacity increases. In fact, there exists a sharp rise in these characteristics as we proceed from the main memory to the disk. This implies that for any algorithm executing on such a system, the cost of memory access (and hence system performance) entirely depends upon the storage level where the element being accessed is currently residing. Moreover, data travels between the storage hierarchies in blocks, of a certain size and different caches have different block size. So, the design of the algorithm, in terms of how it accesses memory, now has a major impact on its actual execution time and therefore, to achieve optimal performance, it should take into consideration the above mentioned storage hierarchy characteristics, especially the cache. Normally algorithms are analyzed by overlooking the existence of cache in between CPU and RAM (illustrated in Figure 2.6) which assumes all memory accesses consume the same amount of time. However, practically this is not so and therefore, data structures and algorithms that can exploit the cache suitably can achieve very high performance. RAM CPU Figure 2.6: The RAM model. Redrawn from source [34]. Data structures and algorithms that are cache-aware [23] do just this. They contain parameters that can be tuned to gain optimal performance for a specific cache size. This advantage in turn results in an issue; they either need to be tuned for every system (with different cache size) for good performance or they perform well only on some systems (for which it is tuned) while not so well on others. This behaviour however, is not really an attractive one. Caches in general are based on two basic principles of locality namely, temporal and spatial [23]. Temporal locality states that a program, which uses a particular data has a higher probability of using the same again in the near future. Spatial locality states that a program, which uses a particular data has a higher probability of using some adjacent data in the near future. So any 31 optimal cache-aware algorithm should try and exploit both these properties to achieve optimality. Harald Prokop in 1999 came up with the concept of cache-oblivious algorithms for this master’s thesis [27] that was later published by Frigo et al. [20]. This arrived as a solution to the cache-aware problem. It also exploits the cache size, however without requiring the tuning to achieve optimal performance. It works well for all cache block sizes since it optimizes the algorithm for one unknown memory level, which automatically optimizes it for all levels. The basic idea is to recursively split a dataset such that its size reduces and at some point a single portion (split section) of the dataset will be small enough to fit into the cache and will fill at least half of it. This eliminates cachemisses. This idea also eliminates the requirement to know the cache block size. The data structures are designed in a manner that a dataset (irrespective of its size) is split appropriately to make good use of caches of all sizes. The memory model suggested by Prokop [27] considers an infinitely large external memory and an internal memory acting as cache of size M. Data moves in between these two, in blocks of size B. The algorithm cannot control the cache in that it does not explicitly manage the movement of data blocks between the two storage devices. It assumes the existence of a cache manager. This restriction is due to the fact that M and B values are unknown and hence, cannot be manipulated directly. A fixed page replacement policy is used and it is also assumed that the cache is ideal [27]. This means that the cache is fullyassociative and the page replacement strategy is optimal. It also assumes that the cache is Tall [27]. A tall-cache is one where the number of blocks present in it (M / B) is much greater than the size of a single block (B). This assumption is represented by the following equation: M = Ω (B2) …………………………………………………… eqn. (1) [20, 30]. This constraint facilitates the cache-oblivious algorithms to have a large pool of values to guess the block size (B). Demaine [30] in his paper introduces the various cache-oblivious algorithms and data structures available, explaining the techniques behind those designs. Also, Bender et al. [29] proposed a design for a cache-oblivious B Tree, which was later simplified by Wu et al. [28] while still preserving cache locality. 32 All these designs make efficient use of the cache. Olsen and Skov [26] also analyzed and examined two cache-oblivious priority queues and designed an optimal cache-oblivious priority deque based on one of the priority queues. Also, in 2005, Bender, Fineman, Gilbert and Kuszmaul [31] proposed 3 different concurrent cache-oblivious algorithms that they proved made efficient use of the cache. A very important aspect of this research is therefore, to analyse these data structures in order to identify a suitable data structure for our purpose. 33 3 System Design This research project entails developing a subset of a Google BigTable-like database, for shared memory multi-core systems. This implementation is a simplified structure of the database that is in-memory [18, 19] to try and achieve performance benefits like speed-up, as well as scalability in a multicore environment. It involves creating a data structure, based on the concurrent cache-oblivious B-Tree design proposed by Bender et al. [31]. Another vital aspect of the research is to perform data retrieval operations in parallel and then evaluate the efficiency and usability of the design. This is crucial as it will help us assess the suitability of a multi-core environment for such huge distributed database systems. The functional and architectural details of the implemented system will be discussed in the following sub sections. 3.1 System Overview The design for this database system comprises of two main parts, the underlying Data Structure to hold the data and a set of operations to query the database with. The data structure resides in memory and is based on a cacheoblivious design. The disk will be used only to load the database and to store backup of the data to ensure durability. The cache-oblivious model is primarily based on the Packed-Memory Concurrent Cache-Oblivious B-Tree model proposed by Bender et al. in 2005 [31], that contains both lock-based and a lock-free versions of the structure for concurrency control. It therefore ensures that the data can be accessed concurrently. Moreover, B-Trees minimise the number of disk accesses, which is critical to this design because of its inmemory nature. An important point to note here, is that all data needs to be stored in the key-value format and sorted based on a unique key. As mentioned in the previous section, every BigTable data is identified by a unique combination key 34 (row, column, and timestamp). It is therefore essential for the data structure of this implementation to be able to support this and yield good performance. The operations include a set of functions to create and manipulate the data structure, map the appended new data to the appropriate location, provide thread-level security (for concurrency), and so forth. The retrieval operations, primarily consisting of search queries, are parallel in nature. These functions are designed in accordance with the Google’s API model. 3.2 System Model B-Trees have been one of the most predominant data structures that keep data sorted, allowing insertions, deletions, searches, sequential reads with very low response time. It is a generalized binary search tree [33], optimized to handle large data sets. The Packed-Memory Concurrent Cache-Oblivious BTree model [31] consists of two structures combined into one; a static cacheoblivious Binary Tree [27] and a packed memory data structure [29]. The static cache-oblivious binary tree is a static binary tree based on the van Emde Boas (cache-oblivious) layout [29]. The nodes of the tree can be traversed in O (1 + log B (N)) memory transfers and is hence asymptotically optimal [29]. The packed memory data structure is ‘one-way packed’ and stores the data in sorted order in a loosely packed array. It is said to be loosely packed, since the elements are stored with a lot gaps in between to allow for insertions and deletions. One-way packing allows concurrency to be supported. The array is divided into ‘sections’, with gaps within each section to allow insertions, as mentioned above. The combined structure is a binary tree, sorted on the combined key (row key, column key, timestamp) where each node contains both key and data. The leaves however, correspond to certain sections in the packed memory array. Each leaf of the tree maps onto each section as its first element. The other nodes are not stored in this array and can be accessed directly from the tree. However, since the tree is static, any new insertions will result in accessing the 35 appropriate section of the packed array. Thus, the packed array is primarily designed to support insertions, deletions, as well as search queries on the newly added data. The figure below illustrates the design used for the database system. Figure 3.1: Data structure design. It is a modified design based on the work of Bender et al.. [31]. The binary tree above contains the combination-key (of row, column, and timestamp) as its nodes, as shown for the root node ‘55’. Thus, each number in the nodes (like 55) here is used to represent the combination-key. The tree is created only once and hence static. The actual data is stored in the tree and for the leaves, in the packed array as well (below the tree). This array contains in addition to the data, the combination-key as well. Thus each element in the array is a complex data type comprising of a key and its data, as illustrated using the key value of ‘44’. The array again is divided into sections; the black bold sections in the diagram. Each section contains a leaf and some gaps for insertions. For example, the first section contains only one key-value pair, 1 (keys + their data). The remaining array locations in that section remain empty. Also the maximum value of a key, it can possibly hold is 8 (shown in grey). The next key (9) is part of the next section, which can hold a maximum of 21 (shown in grey) and so on. Also, the leaves of the tree map onto the first 36 element of every section as shown above. Thus the packed array allows for quick insertions into the database where the static tree acts as an index. Figure 3.1 above is the data structure based on the model described earlier. This structure can be effectively used to store the data in the key/value format, as required for this research. The choice of using the binary tree is important, since it ensures that the data elements added are stored in sorted order. Basically, this structure ensures that every operation (delete/search) results in traversing the tree and performing that operation on the tree itself, provided the data is available there. It also ensures quick execution, due to the binary layout. However, if the data is not available in the tree, traversing it results in locating an appropriate leaf, which then directly maps onto the corresponding section of the packed array. A linear search or a binary search can then be carried out within the section to locate the desired element. Also, since this array is loosely packed (contains gaps), insertion operations to the database, always result in data getting added to the packed array, within the gaps of a particular section. The structure is also beneficial in terms of performance. It remains in the main memory throughout, hence low response time. Moreover, mapping of key to value, both held in memory, adds to this speed. 3.3 Data Model As stated earlier, the data in this implementation conforms to the Google Datastore data model; in that it is stored in a key-value format where each data item is identified by a unique combination of three keys (row key, column key, and timestamp). To facilitate this, a unique row id (row key) is generated by the system for every data element, belonging to a new row. All data items belonging to a particular row have the same row id. Since a single row contains one or more columns, the combination key (row key, column key, timestamp) always remains unique for every data element in the database. The exact combination key used for this implementation is a dot (.) separated string of column name, row key and timestamp respectively. Also, all data, like in the 37 Google system, are stored as strings. For instance, to store a value 50 that represents the ‘age’ of an employee, we create a combo key, AGE.R20.111100, assuming here that this value is the 20th entry in the database (hence row is 20). The combo key is based on the format (COL_NAME.ROW_ID.TIMESTAMP). The figure 3.2 below illustrates this model based on a sample column-oriented database containing 4 to 5 columns; employee id, first name, an optional middle name, last name and age. It may be recalled, that the main advantage of a column store over a traditional RDBMS is the flexibility its gives to each row; a row can contain variable number of different columns and hence need not store NULL values unnecessarily (in those cells which do not have an appropriate value). In the sample database of figure 3.2, the column E_MNAME (employee middle name) is one such optional attribute. Therefore, every row of a columnoriented database is not required to hold that column. The implementation of this project also allows the same flexibility by using key-value pairs to store the data. Figure 3.2: Representation of the Data model. Diagram illustrating the keyvalue pairs and the format of the unique key used to identify a value. Here, E_MNAME is present only in Row 1. 38 As mentioned earlier, this data representation allows for the creation of a typical columnar database, since each row can have a variable number of different columns, as illustrated above. To ensure efficient operations and performance benefits on this database, the choice of an appropriate key is extremely crucial. This is due to the fact that, in this implementation all data will be sorted and arranged in a static binary tree (described earlier) based on this unique combo key. This also implies that all operations including retrievals will depend primarily on the position of these keys in the binary tree. To allow quick data access and retrieval, the COL_NAME.ROW_ID.TIMESTAMP format is followed. This format ensures that the sorting of the keys is based on the column names and thus, all values (across all rows) belonging to the same column will be grouped together (stored as a sub-tree) in the binary tree. This kind of columnar locality gives the data structure the advantage typical of a column-oriented store; since data stored in a column oriented manner allow operations like group by and other aggregations to be extremely fast. 3.4 Data Manipulation and Retrieval The various operations supported by the database are insertion (data append), deletion and search queries. There are no random data write operations, just like BigTable. A set of system functions, are also designed to handle background operations. These include, an array rebalancing operation, which is mandatory for data deletions and is optional in the case of insertions to the data structure. Rebalancing is the re-arrangement of the packed array to adjust the element density in its sections. The most important runtime system operation is to execute the user queries in parallel. Since all queries may not be completely parallelizable, they need to be created in a manner that they are able to make maximum use of the processor cores. In this project however, a single query is designed and parallelized and also executed on several multi-core systems for performance evaluation. 39 4 Data Structure Implementation This section will focus on the implementation of the underlying data structure that stores the database keys and their corresponding values, as described in the last section. All implementation is done using Java 7 on Eclipse Helios. The data structure, as mentioned earlier is composed of two parts; a static binary tree based on the van Emde Boas layout [29] and a packed array. The database is in-memory and is loaded only once from a file. The file data is used to create a sorted binary tree, which is also balanced and then made complete (all nodes have exactly two children). This tree is traversed in the van Emde Boas manner to create a cache-oblivious array. The leaves of this tree are then mapped onto a loosely packed array (with gaps) to allow for insertions. The following sub sections will provide a detailed description of the implementation of these structures. Figure 4.1: Steps to create the data structure. 40 The steps carried out to create the entire data structure are illustrated in the diagram above. Implementation of each step is described in detail in the following subsections. 4.1 Static Binary Tree A binary tree is composed of numerous nodes, which in this implementation is defined as a class called TreeNode. In general, in a binary tree, each node consists of a data and links to its left and right child. Here too, each TreeNode consists of a data part and links to its two children. However, the data in turn comprises of a combo key (of the form column_name.row_key.timestamp) and its corresponding value. The links to the left and right child are not implemented as pointers, instead an array link of size 2 is used. The 0th element of link stores the left child and the 1st element the right child. The class definition is as follows: Listing 1: Class definition of a node in the implementation of a binary tree. A Full (Complete) Binary Tree: A complete binary tree is one in which each node comprises of exactly 2 child nodes. This implies that the total number of nodes for a complete binary tree of height h is always fixed and hence can be calculated. The binary tree required for this implementation as discussed earlier, cannot be modified once created (hence static). Also, the database is an inmemory one, which means all updates to it, should be handled by the data structure. Therefore, the structure itself should possess the ability to allow such updates to happen with minimal rebalancing and adjustments. To cater to this requirement, it is essential, once the tree gets populated with the key-value pairs, 41 to check for its completeness and enforce it if found not-complete. Completeness is enforced by adding the missing nodes in the form of empty (zero data) leaves. Creating a complete binary tree facilitates the creation of a packed memory array that allocates space to store key-value pairs for all the leaf nodes, including the empty ones. This allows a large number of insertions to happen into the database without the need to rebalance frequently. Figure 4.2 below illustrates this concept. Figure 4.2: A complete Binary tree. The number of key-value pairs is 11, which creates a binary tree of height 4. However, for a tree of height 4, the total number of nodes should be 15. Hence the tree is made complete by adding the missing nodes in the form of empty (zero data) leaves, illustrated above by the black small circles. This allows space to be allocated for these empty leaf nodes in the packed array structure (drawn below the tree) where new data can be inserted. Algorithm to Calculate Number of Nodes and creating a full binary tree: Calculating the number of nodes in a binary tree is quite simple. The number of nodes Ni at level i is always equal to twice the number of nodes in the previous level Ni-1. For example, in the above diagram, the number of nodes at level 1 (root level) is 1 and the nodes in the next level (2) is 2 and in the next is 4 and 42 so forth. This idea is used to compute the total number of nodes that should ideally exist in a tree, for that tree to become full. The existing number of nodes is obtained by keeping a count of the items read from the file while constructing the tree. The new count allows us to create a complete binary tree by calculating both number and position of the missing leaves and inserting empty nodes there. Algorithm to Create a Sorted Tree: The tree is sorted by reading the nodes inorder and storing them temporarily in an array. This sorted array is then used to create the sorted, balanced and complete binary tree. This tree is traversed recursively following the van Emde Boas layout [29] to create the cacheoblivious array. Listing 2 below shows the recursive traversal of nodes in-order to obtain a sorted binary tree. Listing 2: Implementation of a sorted tree. Algorithm to Create vEB array: The van Emde Boas (vEB) technique lays out a balanced and complete binary tree in memory recursively. Let us assume that we have a binary tree of height h and size (number of nodes) N, where h is a power of 2. In order to traverse this tree using the vEB layout, we divide it into two sections, each of height h/2. The top half of the tree contains a single subtree with the same root as the tree and has √N nodes, whereas, the bottom half contains 2h/2 subtrees, each with approximately √N nodes. When the height h is not a power of 2, the bottom half is selected such that its height is a power of 2. Figure 4.3 below illustrates this concept using a balanced and full 43 (complete) binary tree of height 5. The basic idea is to first layout the top half recursively and then the bottom half, with each half being laid out in order of its subtrees. Figure 4.3: van Emde Boas layout on a binary tree of height 5. Redrawn from [29]. The figure above shows the division of the tree into a top (small dark square) and a bottom half comprising of 2 subtrees (black boxes). Each such subtree in turn can be divided into top and bottom subtrees in a recursive manner. The numbers indicate the order of tree traversal. The structure below the tree illustrates the layout of the tree in memory. The order of tree traversal is the order in which it is laid out in memory (therefore if the traversed tree is stored as an array, the numbers next to the nodes become the array indices). The array created is the van Emde Boas (vEB) array. In this implementation, the sorted, balanced and full binary tree (created in the previous step) is traversed in a manner similar to the one explained above. First, the height of the tree is computed using a recursive algorithm. Next, using this height, the tree is split and traversed recursively to create an array of keys (combo key used in this implementation) having the vEB layout. If the tree height h is not a power of 2, the root node is separated out as the top subtree and the rest of the nodes are treated as the bottom subtree. For every subtree, the root of the subtree and its children (2 nodes) are traversed in order. Then the 44 grandchild node of this root is called recursively. Again the same steps are followed till a leaf node is encountered, after which the siblings are traversed in a similar fashion. This simple recursive algorithm traverses a complete binary tree in exactly the same manner as illustrated in figure 4.3 above. The vEB traversal results in the creation of a vEB array in memory. Also, while traversal, the vEB indices of each node are maintained separately (in the veb_index field of the class TreeNode, listed earlier in Listing 1). This facilitates easy mapping of a node to the corresponding position in the vEB array and then further down to the Packed Array structure during a search, insertion or deletion operation. This will be explained in detail in the next section. With the above implementation, a full binary tree with vEB indices for each of its nodes and a vEB array is in place. The listing below gives the code snippet that recursively traverses the binary tree in the vEB format. 1. . . . . . 2. if(current != null) 3. { 4. current.setVeb_index(i); 5. array[(int) i++] = current.getKey(); 6. runner[0] = current.link[0]; 7. runner[1] = current.link[1]; 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. if(current.link[0] != null) { current.link[0].setVeb_index(i); array[(int) i++] = current.link[0].getKey(); runner[0] = current.link[0]; } else { current.link[0] = new TreeNode("0", "0"); current.link[0].setVeb_index(i); array[(int) i++] = current.link[0].getKey(); runner[0] = current.link[0]; } 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. if(current.link[1] != null) { current.link[1].setVeb_index(i); array[(int) i++] = current.link[1].getKey(); runner[1] = current.link[1]; } else { current.link[1] = new TreeNode("0", "0"); current.link[1].setVeb_index(i); array[(int) i++] = current.link[1].getKey(); runner[1] = current.link[1]; } Continued below 45 Continued from above . . . 34. 35. 36. 37. 38. 39. 40. if(current.link[0] != null) { if(runner[0].link[0] != null) vebTree(runner[0].link[0]); if(runner[0].link[1] != null) vebTree(runner[0].link[1]); } 41. 42. 43. 44. 45. 46. 47. if(current.link[1] != null) { if(runner[1].link[0] != null) vebTree(runner[1].link[0]); if(runner[1].link[1] != null) vebTree(runner[1].link[1]); } 48. } ...... Listing 3: Implementation of vEB array. As explained earlier, first the root (of any subtree) is accessed (current node) and then its children as done in line numbers 4 through 33 (current.link[0] indicates the left child of the current node being traversed while current.link[1] is the right child). Line numbers 34 through 47 indicate the recursive calls to the grandchild nodes (first left and then right) of the current node. 4.2 Packed Array The Packed Array structure, (mentioned earlier) is a loosely packed array created to facilitate insertions and deletions. The goal is to leave sufficient gaps in this array so that for most of the insertion operations, fewer elements need to be moved in order to accommodate the newly inserted element. This keeps the average insertion cost as low as possible. New nodes are not added to the binary tree on insertion; it is this packed array that gets affected with every new addition. The binary tree acts like an index to the packed array and facilitates quick operations on the database. The packed array stores key-value pairs for every leaf node of the binary tree. Let us assume that there are N leaves in the tree. The packed array thus 46 maintains a size of cN to store these N elements. Here c is some value > 1 used as a multiplication factor to create gaps in the packed array. The packed array is also divided into N sections (one for each leaf) of size c each. Thus c can also be thought of as the capacity of each section of the packed array. When the data in a section exceeds this size c, the structure is redistributed by adjusting the elements in the adjacent free sections. However, to make this implementation truly flexible like a columnoriented database, the packed array is implemented as an ArrayList (a dynamic array in Java). This allows addition of new columns to the database as well as resizing the structure when the need arises. Another important point to note is that the relationship between the leaves (stored in the vEB array) and the sections of the packed array is a very simple one. The first leaf corresponds to the (first element of the) first section of packed array, the second leaf to the (first element of the) second section of packed array and so forth. As the packed array stores key-value pairs for every leaf in the tree, it is unnecessary to store this information in the vEB array as well. Therefore the leaves in the vEB array do not store any keys. Instead, the space allocated to these leaves in the vEB array is utilised to map the leaves onto their corresponding sections in the packed array. If the vEB index of a leaf is known, then accessing its key or value from the packed array is a simple mapping onto the section where this leaf is stored. The diagram below illustrates the mapping and shows the relation between the vEB and the packed arrays. 47 Figure 4.4: To illustrate the relation between a full binary tree (of height 4) and the vEB array and the Packed array structure. The binary tree on top with 16 nodes is first converted into a vEB array (structure below tree). The leaves are represented as the white circles in the vEB array. The Packed array below stores these leaves, shown again as white circles. The numbers (0, 10, 20, …, 70) below each leaf in the vEB array indicate the index of these leaves in the Packed Array. The leaves in the vEB array do not store keys (unlike other grey nodes), instead, they store the index positions of themselves in the Packed array (the numbers 0, 10, 20, .., 70). Each leaf in the Packed array is allocated a section. In this example we assume that the Packed array allocates space for 10 elements in each of its sections (factor or capacity c is 10). Thus total size of this structure is number of leaves * c = 8 * 10 = 80. To highlight the ease of accessing elements from this structure, let us consider an example where we are interested in inserting a new element as a child of the data item whose vEB index is 11. It has already been mentioned 48 that the vEB index of any element (key-value pair) in the binary tree is in reality the index position of that element in the vEB array. Therefore, from the above diagram, it is clear, that the element with vEB index 11 is a leaf (represented as a white circle in vEB array). Now to insert an element, we need to go to the Packed array. The mapping provided in this implementation facilitates an extremely quick and easy way of doing this. In the vEB array, a leaf holds its own Packed array index location and not key-value information. Therefore, element at 11 (a leaf) also stores its corresponding Packed array position, which is 40 (number written below it to indicate the value stored by it). Thus we can jump directly to the index position 40 in the Packed array and insert the desired element. The pseudo code snippet (with self explanatory variable names) to illustrate the access mechanism is as follows: array veb[], array packed_str[], integer veb_index, integer packed_index; veb_index = 11; packed_index = veb[veb_index]; result = packed_str[packed_index]; Listing 4: Pseudo code to explain the mapping of vEB array to packed array. Algorithm to Calculate the Number of leaves: In order to implement the packed array structure, as explained above, it is essential to know the number of leaves in the binary tree. It may be recalled that the Packed array is a structure that allocates space for all the leaves of the tree as well as allocates some additional spaces based on some constant c, to allow for low cost insertions. Thus, the size of this structure equals the product of the number of leaves and this constant (capacity) c. For a full binary tree of height h, the maximum number leaves are fixed. For instance, a full tree of height 2 can have at most 2 leaves, whereas a tree of height 4 a maximum of 8 leaves and so on. This is predictable since each node in a full binary tree has exactly (and also at most) 2 child nodes. This, results in the number of nodes in one level to become twice the number of nodes in the previous level (explained earlier). This simple relation between nodes in 49 different levels is used to compute the total number of leaves in the binary tree created in this implementation as follows: leaves = (int) Math.pow(2, height-1); Listing 5: To calculate the number of leaves for a tree of height ‘height’. Algorithm to Create Packed Array structure: Once the number of leaves is known, a temporary array is populated with the key-value pairs of the actual leaf nodes. As obvious, the size of this array equals the computed leaf count. Using this array as input, the final Packed array structure is realized. The following code snippet explains the technique used. . . 1. 2. 3. 4. 5. 6. 7. 8. . . if ( i == 0 | i % factor == 0) { if (leaf_nodes[k] != null) { temp = leaf_nodes[k++].concat("|").concat(leaf_nodes[k++]); } al.add(i, temp); } 9. else 10. al.add(i, "0"); . . . . Listing 6: Implementation of Packed Array. The Packed array as stated earlier is implemented as an ArrayList, al. In the above code, factor is the variable used for the capacity c of the Packed array, which denotes the size of each section in this structure. We already know that each leaf is the first element of a section in the Packed array. It implies that the index position of each leaf in the Packed array is a multiple of c. Therefore, in the sample code, when the index i of the Packed array al is a multiple of factor (or c), an element from the array leaf_nodes (containing key-value pairs of all the leaves) is added to al. For all other values of i, a zero is appended. This allocates space for new data to be inserted into the Packed array al. 50 4.3 Algorithm to Search The algorithm to search for data in this database is simple and efficient. To look for a particular element e, the binary tree is searched first. As all key and value (data) pairs are present in the tree itself, the search is quite fast. Searches can be key based, value based or both together. This being a database implementation, the most standard searches are values for a particular key (like search for all the values of column employee_name, which is a key) and hence are very efficient. The binary tree is sorted based on the column_name(s) of the combo key and hence, only a small section of the tree needs to be searched to locate the required value (because, if the searched column_name is less than root, go to the left subtree else to the right and so on). However, if the element being searched is not available in the tree, there exist two possibilities. Either the element does not exist in the database at all, or, it is an item that was appended to the database (later) and hence exists in the Packed array and not the binary tree. Whatever, the reason might be, when such a situation arises the Packed array is always searched. This search is also efficient due to the direct mapping of the leaf to the Packed array section (as explained earlier). Once the appropriate section is reached, a linear search or a binary search is performed within that section. Linear searches are efficient for smaller arrays. However, if a lot updates have been made to the database, a binary search is more appropriate. 4.4 Algorithm to Append / Insert As mentioned several times, insertion operations do not affect the tree in any way. All insertions get reflected in the Packed array. Since this implementation of the database is a subset of the Google DataStore, the operations permitted on it are also in accordance with DataStore. The DataStore is an append-only database and so is this. Append-only implies that there can be no random write operations or any modifications (overwriting) to the existing data in the database. In DataStore, the write operations are merely new 51 insertions (additions or appends) to the database. Each new insertion to any column gets added along with a new timestamp value. Thus a query for this updated column will retrieve by default the most recent value unless specified otherwise. The previous timestamps (versions) become historical data. To perform such an append operation, the binary tree is first traversed (like in a search operation discussed above) to locate an appropriate section i of the Packed array, where the new element will be added. This section is then checked for gaps. If the section has not exceeded its capacity, the insertion is immediately made. However, if the section capacity is full, all previous sections need to be checked for gaps. If space is available, a redistribution (rebalancing) operation is performed to push the existing elements from the current section to the left (to the previous section) to accommodate the new element. Again, if the current section i being checked for gaps is the leftmost section or empty spaces are not available in its adjacent sections, the size of the Packed array is doubled. Then, all elements are rearranged. The advantage of using an ArrayList for implementing this Packed array structure is evident here, since a resizing operation can be easily performed. 4.5 Algorithm to Delete The deletion operation is supported only for data that has not been written to disk. Although this is an in-memory database, yet for durability, data is written to a file from time to time. Later, the database is loaded into memory from this flat file itself. This data is stored in the static tree and also partly (leaves only) in the Packed array. Any alteration therefore is not possible to this data. The only data that can be deleted are the ones inserted later into the database (thus present in the Packed array only, except the leaves). To delete such a data item (key-value pair), we search through the tree (in the manner explained in the search sub section) to locate the exact section in the Packed array. Once there, a linear search is performed within the section to obtain the exact element and then delete it. 52 5 Query Implementation This section will discuss a single query designed and developed for the database, described in the last section. The query implemented is a standard benchmark query that is realistic and has a broad industry-wide relevance. The benchmark used is TPC-H [56], a decision-support (analytical) benchmark that is most suitable for this database system, since this system has the ability to hold several versions of the same data (time stamped). Also, examined in this section is the need for synchronization, the issues involved and the synchronization techniques used for multiple concurrent users. All development is done using Java 7 on Eclipse Helios. Java 7 has a lot of support for parallel programming, hence the choice. The specifics will be discussed in detail in the following sub sections. 5.1 Development Tools – Java Fork/Join Framework Java had support for multi-threading and concurrency for a long time now (since version 5.0). However, the new features introduced in the package java.util.concurrent of Java SE 7, enhanced these features by adding support for parallelism [51, 52, 53]. It is based on the parallel divide-and-conquer strategy. Divide-and-conquer algorithms are perfect for problems that can be split into two or more independent sub problems of the same type. This is analogous to the map-reduce strategy in functional languages. The basic idea is to recursively split the problem such that it eventually becomes simple enough to be solved directly. The solutions of each of the sub problems are then combined to produce the final result. Previous versions of Java can also solve such divide-and-conquer problems concurrently using the Executor framework that has the Callable<V> class. However, Callables waiting for the results of other Callables (to combine results), and produce the final result, actually go into a wait state. This wastes the opportunity to handle another Callable task in queue. The uniqueness of the 53 Java 7 Fork/Join framework therefore, is its ability to efficiently use the resources in parallel. The Fork/Join framework uses a Work Stealing mechanism [51, 53] to steal jobs from other threads in its pool, while one thread (task) waits for another one to complete. The implementation of the work stealing scheduling used in Java Fork/Join framework is a variant adapted from the Cilk-5 project [55]. The basic mechanism of work stealing involves assigning each worker thread in the fork/join pool with its own private deque (double ended queue). This deque holds all subtasks assigned to a particular thread for execution. When any worker thread completes the execution of the tasks in its local deque, it tries to steal pending tasks from other threads. This process continues (by threads) till all tasks in all deques are completed. The advantage, as obvious is efficient resource usage as well as reducing overheads associated with load imbalances. The Fork/Join framework consists of a ForkJoinPool executor [52], which is dedicated to executing instances implementing the ForkJoinTask class. A ForkJoinTask object supports the creation of subtasks as well waits for these subtasks to complete. This is illustrated below. Each ForkJoinTask object has 2 specific methods: 1. fork ( ) method, that allows a new ForkJoinTask to be spawned from an existing one. 2. join ( ) method, that allows a ForkJoinTask to wait for the completion of another one. Figure 5.1: Co-operation among fork ( ) and join ( ) tasks. Redrawn from [53]. 54 There are 2 types of ForkJoinTask specializations: 1. RecursiveAction, instances of which do not return a value. 2. RecursiveTask, instances of which return a value. For this implementation, we use instances of RecursiveTask that return the computed result. 5.2 TPC-H Benchmark Overview The TPC (Transaction Processing Performance Council) Benchmark TM H, commonly TPC-H, is a decision-support benchmark that comprises of a suite of business oriented ad-hoc and concurrent queries [56]. The benchmark primarily describes decision-support systems that work with huge volumes of data and support queries that are a highly complex and cater to real-world business problems. It does not target any specific business area but is applicable to any industry that buys and sells or manages or distributes products worldwide. There is a standard TPC-H database against which the queries are executed. The performance metric used is called the TPC-H Composite Queryper-Hour Performance Metric (QphH@Size), which reflects the database size, the query processing power for a single query and the query throughput for multiple concurrent queries. This benchmark is usually used by commercial DBMSes with an SQL interface. The TPC-H database is composed of 8 tables. The total number of columns (from all the tables) in the database is 61. The relationship between the various columns in the database is illustrated in Figure 5.2 below. The columns with outgoing arrows are the primary keys (in their respective tables) while the ones receiving them are foreign keys, that aid in joining. For instance, the table PART has PARTKEY column as its primary key, which becomes the foreign key in PARTSUPP. Similarly, the primary key SUPPKEY of SUPPLIER table is the foreign key in PARTSUPP table. These 2 foreign keys in this table together is the primary key of PARTSUPP. Also, it is evident from the diagram that the NATIONKEY primary key of NATION table exists as foreign key in 55 both SUPPLIER and CUSTOMER tables. All such relations are clearly drawn in figure 5.2. Figure 5.2: TPC-H database schema. Copied from source [56]. The parentheses following each table name is the prefix used for each column name in that table. The range of different queries available and their industry-wide relevance makes TPC-H an ideal choice to analyse the implemented database system. TPC-H has a set of 22 queries, each addressing some realistic business scenario. The query selected for implementation on this database is Query 17, the details of which are provided in a later subsection. 56 5.3 Loading the Database Prior to implementing the selected query, the stage needs to be set for executing the query on a suitable dataset. It may be recalled that the implemented database works on data stored in the form of key-value pairs. Therefore, the benchmark dataset (of TPC-H), which is in the traditional RDBMS format of tables (schema shown above) needs to be converted into a dataset appropriate for this system. The first step however, is to create the normalized TPC-H dataset (in the form of tables) of the required size. TPC-H provides a utility called dbgen that allows users to create a normalized dataset of any size (even gigabytes). Next this normalized data is loaded into a standard SQL database to form the different tables. Now is the time to create the key-value pairs. It is important to note at this point that BigTable (DataStore), as mentioned previously, is a single big table of key-value pairs. Also, join operations are not supported directly. Therefore, to create a similar structure, the 8 separate tables of TPC-H need to be converted into a single big table as well. To realize this, the individual tables are combined by performing an Equi-Join on all of them. This de-normalized single table is written to a file in the comma separated format. The total number of columns after an equi-join is 54 as opposed to the original 61 columns in TPC-H. This is due to fact that an equi-join operation considers only one of the two columns used in a particular join operation. Thus joining 8 tables requires 7 equi-joins utilizing 14 columns (to join). Out of these 14 columns, 7 are discarded. The data although remains unchanged, contains redundancy due to the multiple joins, which is taken into account while implementing the query. This comma separated file is read line by line and converted into a flat file consisting of key-value pairs. It can be recalled that a key here is a combo key. Hence, for every row a unique row key is generated. All columns belonging to a particular row use the same row key. Also, since the entire dataset is a single version, the timestamp is kept fixed to 111100. New insertions result in new timestamps. The file generated looks like the one shown in figure 5.3 below. 57 Figure 5.3: Sample Key-Value pairs generated from the de-normalized dataset. It follows a row_key.col_name.timestamp format, unlike our data model. This key-value pair file (shown above) for the purpose of this implementation is obtained from a colleague working on a similar project (Appendix 2). The system design section earlier illustrated the data model used for this project (column_name.row_key.timestamp) and it clearly differs from the one in Figure 5.3. The combo keys in our case are sorted based on the different column names and stored accordingly in the binary tree. This provides us with the advantage of having all the same columns (across different rows) to be grouped together (subtree) in the tree. The locality achieved is typical of a column store and helps in querying the data from the database. The above keyvalue pairs file is thus, converted into the required data model. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. . . Listing File file = new File(input_file); Scanner scan = new Scanner(file); FileWriter fstream = new FileWriter(output_file); BufferedWriter out = new BufferedWriter(fstream); while(scan.hasNextLine()) { String line = scan.nextLine(); String [] newline = line.split(","); String [] newword = newline[0].split("\\."); String nline = newword[0].trim(); newword[0] = newword[1].trim(); newword[1] = nline; newline[0] = newword[0] + "." + newword[1] + "." + newword[2].trim(); line = newline[0].trim() + "," + newline[1].trim(); out.write(line); out.newLine(); } . .Implementation to convert a key-value pair file to another key7: value pair format (based on our system’s data model). 58 The code snippet provided above is used to perform this conversion. The file shown in figure 5.3 is read line by line and split by the comma (,) that separates a key from its value (Listing 7: line 8) and stored in an array. The 0th element of this array contains the combo key. This element is further split based on the dot (.) that divides a row key from a column name and again stored in an array (Listing 7: line 9). The 0th element contains the row key while the 1st element the column name. The elements are swapped to obtain the desired combo key (Listing 7: line 10 through 14). Now that the dataset is available in the desired format, the implemented database is populated. The data structures developed get created in the manner described in the last section. The static binary tree contains the keys (combo keys) and their corresponding values. These keys map onto a vEB array, while only the leaves are placed in the Packed array to facilitate insertions and deletions. 5.4 TPC-H Query 17 Overview Query 17 (Small-Quantity-Order Revenue) of TPC-H found at section 3.20 of the TPC-H Specifications [56] determines the yearly average revenue lost when orders for small quantities of certain parts are no longer filled. The basic idea is to assess the possibility of reducing overhead expenses, by overlooking the smaller consignments and concentrating only on the sales of larger shipments. The query definition in terms of SQL is as follows: select sum (L_EXTENDEDPRICE) / 7.0 as avg_yearly from LINEITEM, PART where P_PARTKEY = L_PARTKEY and P_BRAND = '[BRAND]' and P_CONTAINER = '[CONTAINER]' and L_QUANTITY < ( select 0.2 * avg (L_QUANTITY) 59 from LINEITEM where L_PARTKEY = P_PARTKEY ); This query works on a 7 year database. It basically considers parts belonging to a particular brand and container. Then, for such parts, it determines the average lineitem quantity, for all orders in the database. Finally it computes the average yearly (gross) loss in revenue, if orders for all those parts with a quantity of below 20% of the average (calculated earlier) are not considered any more. The substitution parameters for [BRAND] and [CONTAINER] include values like Brand#23 and MED BOX respectively or Brand#25 and JUMBO PKG respectively. A few other variations are available in the specification [56]. To select a single appropriate query from among the available set of 22 TPC-H queries, careful consideration was given to several factors. It was essential to select a query that:a. Was complex, so that it would be a perfect query to evaluate the database system at hand. b. Had a business relevance (in the real world), so that the database could be assessed from a realistic perspective. This would make the evaluation more reliable. c. Contained few operations that were highly suitable for traditional RDMS, since in a real business environment databases are queried irrespective of their type. Queries are created to suit the business requirements rather than an underlying RDBMS or a Column store. Therefore, it was imperative to select a query that was generic and not completely suited to a typical column-oriented DB. d. Contained few operations that were highly suitable for ColumnOriented databases as well. This is quite obvious, since operations on columns like aggregations are extremely efficient on column-oriented databases. These kinds of operations display the true power of a column store. So it was necessary to have some operations that could utilize the potential of the underlying database as well. e. Contained some operations that could be parallelized. It should be remembered, that this project aims to assess the performance of a 60 query on the in-memory database developed, on various multi-core systems. Therefore, a query that had the potential to utilize multiple cores of a many-core system, as effectively as possible, would be a suitable candidate for implementation. After weighing the different queries against the aforementioned factors, Query 17 appeared to be a suitable choice. 5.5 Query 17 – Sequential Implementation The query is implemented by following a series of steps as explained below:1. Execute the first selection condition, column P_BRAND = Brand#25. This is easy to understand; here the column_name part of our combo key is P_BRAND whose values should be equal to Brand#25. From the database, all rows satisfying this condition should be selected. An important point to remember here is that our database is not the typical row-oriented table. In our case, all data is in the form of keys and values. However, as illustrated earlier, the key owing to its format can be used to search for a column and also a row; the key starts with a column_name, followed by a row_key, where the row_key is the same for all columns belonging to a particular row. Thus for row1, all columns have this (row1) as the row_key. Therefore, in the current search operation, extracting the row_key part of the combo key, where its associated column_name is P_BRAND, whose value in turn equals Brand#25, identifies all the required rows (which can then be used to search for other columns). The static binary tree is searched for the column and its value and the row_keys thus obtained are stored in a temporary ArrayList parser1.res (where parser1 is an object that contains the results in ArrayList res). The advantage of sorting the binary tree based on the column_names is evident in this simple search operation itself. All instances of column P_BRAND exist within a particular subtree in our binary tree. Therefore, the search algorithm is split into two parts. The first part traverses the binary tree, to look for the first occurrence of P_BRAND (desired column). This search only checks the column names and moves 61 either into the left half or the right half of the binary tree, depending upon the value. The node thus obtained becomes the root node for the second half of the search. We now know, due to the locality achieved, that all other instances of P_BRAND exist as children of this node. Thus the second half of the search checks each node of this subtree for P_BRAND as column and Brand#25 as value. Row_keys of all such instances found within the subtree are added to the result parser1.res. This algorithm is extremely efficient since it searches only a very small portion of the entire tree. 2. Execute the second selection condition, column P_CONTAINER = JUMBO PKG. This operation is identical to the one above. Here, the binary tree is searched first for the first occurrence of P_CONTAINER (column). Once that is found, using this node as root, all child nodes of this subtree are searched for both P_CONTAINER and value JUMBO PKG. The row_keys obtained are stored in the ArrayList res of another object parser2 (parser2.res). The code snippet below gives the algorithm used. 1. TreeNode searchTree( TreeNode root, String key ) 2. { 3. if ( root == null ) { // Tree is empty, so it certainly doesn't contain key. 4. return root; } 5. else if ( root.getKey().contains(key) ) { // Yes, the key has been found in the root node. 7. return root; } 8. else { 9. int dir = ( key.compareTo(root.getKey()) < 0 )? 0:1; 10. TreeNode rt = searchTree( root.link[dir], key ); 11. if (rt == null) 12. return root; 13. else 14. return rt; 15. } 16. } // end searchTree() Listing 8: Implementation of search algorithm to check for the first occurrence of a column in the binary tree. 62 . . 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. . . long nodes = array.length; Queue<TreeNode> q = new LinkedList<TreeNode>(); q.add(root); i = 1; while(!q.isEmpty() && i < nodes) { TreeNode current = (TreeNode)q.remove(); if(current != null) { if (item == "") { . . . . } else { if ((current.getKey().contains(key)) && (current.getItem().trim().equals(item))) { String [] t = current.getKey().split("\\."); res.add(t[1]); } } q.add(current.link[0]); q.add(current.link[1]); root = current; } . . . . Listing 9: Implementation of search algorithm to check within a subtree. 3. We currently have all row_keys that satisfy the two selection conditions. In the given query, since these occur as AND conditions, we need to find out the common rows that satisfy both conditions. To achieve this, an intersection operation is performed on the results obtained in steps 1 and 2. The common row_keys are stored in parser1.res (variable is re-used). The following snippet gives the implementation details. 1. public int intersection(LoadFileNew obj) 2. { 3. ArrayList <String> temp = new ArrayList<String> (); 4. for(String t : ds.res) 5. { 6. if (obj.ds.res.contains(t)) 7. temp.add(t); 8. } . . . . 9. ds.res.addAll(temp); . . . . Listing 10: Implementation of intersection operation to find the common rows. Here, ds.res indicates parser1.res and obj.ds.res signifies parser2.res. As stated, results are stored in ds.res (that is, parser1.res). 63 4. The third operation in the outer query is a join operation on the partkeys (L_PARTKEY). Since this database already contains data that are joined with each other, this step need not be performed. However, as mentioned earlier, due to de-normalization, the data is likely to be redundant. Consequently, it is essential to extract only the unique row_keys from the result set obtained above (after intersection). There is no way to identify redundancy from the results obtained above. As a solution, the L_PARTKEY values for each row_key (rows) obtained in step 3 can be extracted. Since these L_PARTKEY values are values of a primary key, they should be unique. Any duplicate values found indicate redundancy and the corresponding row_keys can be immediately discarded. The results after duplicate removal are stored again in parser1.res. The code snippet for both L_PARTKEY extraction and duplicate removal are given below. . . . . 1. long nodes = array.length; 2. Queue<TreeNode> q = new LinkedList<TreeNode>(); 3. q.add(root); 4. i = 1; 5. while(!q.isEmpty() && i < nodes) { 6. TreeNode current = (TreeNode)q.remove(); 7. if(current != null) { 8. if (key2 == "") { . . . . } 9. else { 10. if (current.getKey().contains(key + "." + key2 + ".")) 11. { 12. String [] t = current.getKey().split("\\."); 13. restemp.put(t[1], current.getItem()); 14. } 15. } 16. q.add(current.link[0]); 17. q.add(current.link[1]); 18. root = current; 19. } 20. } . . . . Listing 11: Implementation of search algorithm to check for a column_name and a specific row_key within a single combo key. Here in the if condition in line 10, the variable key contains values of column_name (L_PARTKEY in this case) and key2 is supplied with row_key values from step 3 (the common results in parser1.res). 64 . . 1. 2. 3. 4. 5. 6. 7. 9. 10. . . . . for (String s : res.keySet()) { if (map.isEmpty()) map.put(s, res.get(s)); else if (map.containsValue(res.get(s))) continue; else map.put(s, res.get(s)); } . . Listing 12: Implementation of duplicate removal algorithm. Here, only unique key-value pairs are added to a structure map. If a new value being checked, already exists in map, it is not added (Listing 12: lines 5 – 9). 5. The next step involves working on the inner query and computing the average of L_QUANTITY (column) values, for all unique row-keys obtained in step 4. This is like the L_PARTKEY search in the last step. The column_name L_QUANTITY and a row-key is searched is conjunction from the binary tree. This is done for each row-key involved. The average computed is multiplied by 0.2 and result stored in a variable avg. 6. Next, the final selection condition of the outer query is implemented. This involves searching for all L_QUANTITY values that are < avg (from step 5). This is a search operation similar to the ones in step 1 and 2. However, here instead of searching for a particular value, we search for those that satisfy a relational operation (less than). All row-keys obtained are stored in tempres and then transferred to parser3.res. . . 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. . . . . if (current.getKey().contains(key)) { BigDecimal b = new BigDecimal(current.getItem().trim()); if(b.doubleValue() < it.doubleValue()) { String [] t = current.getKey().split("\\."); tempres.add(t[1]); } } q.add(current.link[0]); q.add(current.link[1]); root = current; . . Listing 13: Implementation of search algorithm to check for less-than condition. 65 7. Since the selection operation is step 6 is an AND operation with the other selections, an intersection is again performed on results from step 4 (in parser1.res) and step 6 (in parser3.res). This gives all common row-keys that satisfy all the given selection criteria. Results are stored in parser1.res. 8. Now is the time to implement the final result (to be displayed). For each row-key obtained above, the corresponding L_EXTENDEDPRICE values are searched from the tree. Method as explained in steps 4 and 5. All these values are then added (and stored in sum) and finally the 7 year average is obtained by dividing the sum by 7.0. This gives the outcome of the query. 5.6 Query 17 – Parallel Implementation The parallel implementation of the query is accomplished by using the Fork/Join framework of Java SE 7. Here, the ForkJoinPool ExecutorService is used to create a pool of Java ForkJoinTask threads. The RecursiveTask variation of ForkJoinTask base class is extended to execute different threads in parallel exploiting the underlying multi-core hardware. The basic mechanism of the Fork/Join framework is already explained in an earlier sub section. The primary operations to be performed in this query are already known from the sequential implementation sub section. This implementation executes in parallel (on many cores) the portions of the algorithm that exhibit the potential to be independent and hence parallel. All steps cannot be parallelized. The algorithm consists of 8 basic steps, evident from the last sub section, that are executed one after another. The parallel mechanism is illustrated below. 66 Figure 5.4: Overview of parallel execution strategy used in Query 17. 67 After analysing the steps, the parallel strategy shown above was devised and implemented. It highlights the use of the Java Fork/Join framework to split the execution (and hence speed it up) among multiple fork-join tasks. As mentioned earlier, all steps involved cannot be executed in parallel. As a result, only all search operations traversing the binary tree and aggregation operations are parallelised to achieve faster execution times, as evident from figure 5.4 above. Step 1: It retrieves all rows (row_keys) where the column_name portion of the combo key is P_BRAND and its corresponding value is Brand#25. It can be recalled, that the basic idea is (as in the sequential case) to obtain the first occurrence of P_BRAND (column_name) from the tree. Then using this node as root, we search every node within this particular subtree, for both P_BRAND and value Brand#25. As we search every node in the subtree, the operation is quite CPU intensive. This processor intensive operation is therefore split into as many tasks as there are nodes (to be searched) and handled by the threads in the ForkJoinPool. Every node is checked for the selection criteria; if it is satisfied, the row_key part is extracted and added to the result set. Then 2 tasks are spawned for each of its two child nodes; each in turn is checked for the selection condition. Then for every child, 2 more tasks are spawned (for its children) and the process continues recursively, till all nodes are checked and result obtained. Since an individual task is created for every node in the subtree being searched, the process is very efficient in terms of speed and processor core usage. Step 2: This step checks for P_CONTAINER column_names with value JUMBO PKG. This is exactly similar to the one above and implemented in the same fashion. It should be noted that steps 1 and 2 are run one after another; the parallelism is within their individual execution. The parallel algorithm used for both is listed below. 68 . . . . 1. if(node != null) { 2. if(node.getKey().contains(key) && node.getItem().equals(item)) 3. { 4. String [] t = node.getKey().split("\\."); 5. res.add(t[1]); 6. } 7. if(ds.checklinks(node) > 0) 8. { 9. test1parent left = new test1parent(ds.returnLeftChild(node), key, item, KV); 10. test1parent right = new test1parent(ds.returnRightChild(node), key, item, KV); 11. left.fork(); 12. leftres.addAll(right.compute()); 13. rightres.addAll(left.join()); 14. leftres.addAll(rightres); 15. } 16. } . . . . Listing 14: Implementation of parallel search algorithm (1). Step 3: The next step is to obtain all the common row_keys that satisfy both conditions (of step 1 and step 2). The algorithm used to realize this is similar to the one used in the sequential implementation of the query (and code snippet is listed there). The common set of row_keys thus obtained is stored in res1. Step 4: The fourth step uses the result set (res1) from step 3 and for each row_key in this set, checks for the combo key that also contains L_PARTKEY as the column_name and retrieves its value. For example, if res1 contains row_keys R0, R45 and R121, then this steps retrieves the values of combo keys that contain the following (timestamp can be any thing) :- L_PARTKEY.R0, L_PARTKEY.R45, L_PARTKEY.R121. It is evident that the number of searches equals the number of row_keys in the result set res1. To accomplish this, tasks equal to the count (size or number of elements) of the result set res1 are spawned and executed in parallel. The individual result of each thread is then joined (merged) and stored in list2. The code snippet is shown below. 69 . . 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. . . . . if(res1 != null) { for(String s : res1) { ParallelSubTree task = new ParallelSubTree(node, key, s, KV); forks.add(task); task.fork(); } int i = 0; for(RecursiveTask<List<String>> task : forks) { map.put(res1.get(i), task.join().toString()); i++; } } . . Listing 15: Implementation of parallel search algorithm (2). Step 5: As discussed during the sequential implementation, that the denormalization of the original TPC-H dataset for our purpose gives rise to redundancy. The set res1 (from step 3) therefore is likely to contain duplicate data, to eliminate which step 4 is executed and in this step, the duplicates are removed from list2 (result of last step). The unique result set is stored in res2. The algorithm is the same as used in the sequential implementation. Step 6: This step is exactly the same as step 4. Each row_key in the unique set res2 obtained above, is prefixed with the column_name L_QUANTITY. Then the corresponding values of all combo keys containing this string, are extracted. The set of values thus obtained is stored in list2. Step 7: In this step, using list2, the average value is calculated. It is then multiplied by 0.2 and the result stored in avg. The summation operation is executed in parallel by simply splitting the values recursively (via fork/join) and computing the sum (the usual divide-and-conquer strategy). The average is then calculated. Please note that for simplicity the whole average operation in the diagram (figure 5.4) is shown as being executed in parallel, although in reality only the summation operation is. The entire dataset is split into halves recursively, till a portion becomes less than or equal to the threshold (indicating that it is small enough to be added up), after which the sum of those values is 70 calculated. All such partial results are finally added up to give the complete result. The source code is as follows. 1. protected BigDecimal compute() 2. { 3. if (length <= threshold) 4. return computeDirectly(); 5. 6. 7. 8. } long split = length / 2; invokeAll(new ParallelAddition (res, start, split), new ParallelAddition (res, start + split, length - split)); return sum; 9. protected BigDecimal computeDirectly() 10. { 11. if (length != 0) 12. { 13. for (int c = 0; c < length; c++) 14. { . . . . 15. BigDecimal sum1 = new BigDecimal(s.trim()); 16. sum = sum.add(sum1); 17. } 18. } 19. return sum; 20. } Listing 16: Implementation of parallel addition algorithm. Step 8: Here the tree is searched for combo keys with column_name L_QUANTITY, whose values are less than avg (computed in step 7 above). The strategy here is similar to the one used for steps 1 and 2. The first occurrence of L_QUANTITY in the binary tree is searched for. Then using that node as the root, all its descendants are searched for both column_name L_QUANTITY and values less than avg. This search operation (of the subtree) is executed in parallel, spawning as many tasks as there are nodes in the subtree. The mechanism is exactly the one used in steps 1 and 2. The result set of row_keys is stored in res1. Step 9: Here an intersection operation is performed to obtain all common row_keys (like in step 3) among the result sets res2 (from step 5) and res1 (from step 8). These common row_keys are again stored in res1. This result gives the set of row_keys that satisfy all the selection conditions in the outer query. 71 Step 10: As in steps 4 and 6 this step also fetches all values, for column L_EXTENDEDPRICE combined separately with every row_key in result set res1 (from step 9). Hence the number of searches is equal to the number of elements in res1. The values obtained are stored in list2. Step 11: The final step of execution; here the values obtained above (list2) are added up. This addition operation is executed in parallel (as in step 7). The algorithm used is also the same as shown above. The result is stored in sum. Then this value is divided by 7.0 since the database represents a 7 year dataset and we intend to find the average yearly loss (mentioned in an earlier subsection with details of Query 17). This average value is displayed as the answer to the query. 5.7 Need for Synchronization We have so far seen the implementation of a database system and a query that not only executes sequentially on the database, but also exploits the underlying parallelism of the hardware. An essential point to consider at this juncture is the existence of multiple users of this system. In the real world, any database system is likely to be used by several users; this also means that those multiple users are likely to be simultaneously using the system. This further implies that there exists a risk of inconsistency in the system, especially when concurrent updates are made to this single system. Hence, the need for synchronization arises. In a multi-threaded environment, each thread maintains its own stack and registers. However, if these threads access any shared object, errors are bound to be introduced and therefore require synchronization. Synchronization ensures that the concurrent accesses to the shared object(s) do not corrupt the value stored in them. Various synchronization techniques exist in Java 7. Here we use a lock-based mechanism, the ReadWriteLock [57] to secure the database from the hazards of concurrent activity. 72 5.8 Synchronization Issues There are various problems associated with incorrect synchronization, most of which are not discernible till the implementation (code) is executed. These include deadlocks, livelocks, race conditions, starvation and so forth. However, the primary issue associated with the concurrent access (read/write by multiple users) to a single shared object (variable) is the possibility of one thread (user) seeing the data (shared object) in an incorrect or corrupt state, due to operations performed on it by another thread (user). In a multi-core environment, where threads execute in parallel on the available cores, two threads might actually try to update the same object simultaneously. This therefore requires an appropriate mechanism to control access and avoid inconsistencies. 5.9 Synchronization Techniques As mentioned earlier, the synchronization technique used here is the ReadWriteLock [57], an interface in Java 7 that maintains a pair of associated locks, one for read operations and one for writing. A read lock can be held simultaneously by several threads that intend to only read the same object. A write lock on the other hand, cannot be held by multiple writer threads at the same time; it is exclusive. A read-write lock provides much better performance and allows for greater concurrency when accessing shared data, than a typical mutual exclusion lock, owing to the fact that multiple reader threads can read the same piece of data concurrently. This increased concurrency leads to considerable performance improvements on a many-core processor. However, since the write operations are exclusive, they do not exploit the processor parallelism. This in turn implies that if the number of write operations is more frequent compared to the read operations, system performance is likely to be affected. The class ReentrantReadWriteLock [57] is extended and the methods readLock ( ) and writeLock ( ) are used to lock shared data in the query 73 implementation. The Query 17 implements read operations on the database. Thus every shared access is locked by a readLock ( ). This ensures multiple user threads can simultaneously access the database for reading. However, there are variables that are used to store intermediate results, which need to be protected (write-protected) during a concurrent access to the system, so as to not corrupt the result. These shared variables are locked through the writeLock ( ). Ensuring synchronization for write operations to the database is slightly tricky to achieve. We already know that write operations on the database imply updates, which are made into the Packed Array structure. Therefore, this structure as well as all other intermediate data items that are shared need to be protected by a writeLock ( ) to achieve synchronization. The design of write synchronization necessary for this project is complete. However, due to the stringent time frame, the complete implementation of this was not possible. 74 6 Evaluation This section will discuss the query performance on two different multi- core systems; a 48 core AMD Opteron™ 6100 Series Processor and a quad-core Intel Xeon E3 1245 processor with hyper-threading [58]. The evaluation is performed by executing the implemented query (TPC-H Benchmark Query 17) for three different datasets (small, medium and large) on each of the two machines, and comparing the results obtained. 6.1 Experimental Methodology This evaluation aims to verify if the database system is truly scalable, as desired, and also analyse the parallelism achieved on different multi-core architectures, in terms of execution times and speed-up. This is realized by executing the implemented benchmark query on a variable number of threads, for every dataset (small, medium and large). Using data of different sizes is necessary as this database is a subset of DataStore that manages petabytes of data; it is therefore imperative to test the system with various load sizes. Moreover, the research involves utilizing multiple cores and analysing for performance improvements. Most multi-core systems handle problems that grow in size. This makes it essential to analyse the scalability of this parallel query while scaling the problem size as well. Intel Xeon (Janus) has a quad-core processor and supports hyperthreading of 2 threads per core, thus making a total of 8 threads. It has 8 gigabytes of memory. Datasets of size 100 MB, 500 MB and 1 GB are used on this system. AMD Opteron 6174 (Mcore 48) on the other hand, supports 48 hardware cores with one thread per core (no hyper-threading) thus with a total of 48 threads. It has 128 GB main memory where datasets of size 1 GB, 3 GB and 5 GB are used for evaluation. The Appendix 1 contains the tabulated results obtained for every configuration. Both the multi-core machines are installed with Linux 2.6 and Sun Java Version 1.7.0 (build 1.7.0-b147). 75 Query 17 is executed with a fixed heap size of 7GB and 65GB on Janus and Mcore 48 respectively. No other parameters and/or configurations are changed during the execution of the query. Experiments are conducted for different number of threads; 1, 2, 4 and 8 threads on Janus and 1, 2, 4, 8, 16, 24, 32, 40, and 48 threads on Mcore 48. On a particular machine, for each thread, 10 different execution times are recorded; and then the mean execution time is computed. Execution times are obtained based on millisecond resolution and then converted into seconds. Also, calculated are the standard error, standard deviation and confidence intervals. Every execution of the query is performed by launching a fresh instance of the database, thus avoiding cache benefits. To measure the scalability of the database system, the absolute speedup and efficiency is calculated. The Speedup is given by Sp = T1 / Tp ……………………………. eqn (2), where T1 is the sequential execution time and Tp the parallel execution time on p processors. When T1 is the execution time for the best known sequential algorithm, the speedup is referred to as absolute speedup. This performance metric is most effective when evaluating parallel algorithms. The system configurations of the test machines used is tabulated below (table 1). Also note, that the complete results of execution (time), standard error, standard deviation, confidence interval, speedup and efficiency are listed in Appendix 1. S.No. System Name CPU Type 1 3 Janus MCore 48 Intel Xeon E3 1245 AMD Opteron 6174 No. of Cores 4 48 No. of Threads (total) 8 48 RAM 8 GB 128 GB Table 1: Production system configurations for performance evaluation. 6.2 Experimental Results The experimental data obtained is analysed in two ways. First, the mean execution times are plotted against the corresponding number of threads. Second, the speedups computed are mapped against the number of threads. These signify the amount of parallelism achieved through the implementation. 76 The results obtained for a variable number of threads and different datasets indicate the scalability of the system. Figures 6.1 – 6.6, illustrate the execution times obtained for small, medium and large datasets on the two test machines. The X-axis (horizontal) represents the number of threads (upto 8 for Janus and upto 48 for Mcore 48). The Y-axis (vertical) represents the mean execution times for a particular dataset in seconds. The confidence intervals (95%) for each reading are also plotted on the graphs. However, the intervals being quite small are not visible in most cases. The exact values of confidence interval are available in Appendix 1. TPC-H Query17 on Janus for 100MB dataset 1.4 Execu tuion Tim e (T ) in secon ds 1.2 1 0.82 0.8 0.74 janus 100MB 0.6 0.45 0.4 0.43 0.2 0 0 1 2 3 4 5 6 7 8 9 Number of threads (N) Figure 6.1: Mean execution times of Query 17 for 100 MB data (small) on Janus. 77 TPC-H Query17 on Janus for 500MB dataset Executuion Time (T) in seconds 15 10 9.78 9.24 janus 500MB 5 4.32 3.43 0 0 1 2 3 4 5 6 7 8 9 10 Num ber of threads (N) Figure 6.2: Mean execution times of Query 17 for 500 MB data (medium) on Janus. TPC-H Query17 on Janus for 1GB dataset 300 Executuion Time (T) in seconds 290 281.18 280 274.09 270 janus 1GB 260 251.84 250 250.33 240 0 1 2 3 4 5 6 7 8 9 10 Number of threads (N) Figure 6.3: Mean execution times of Query 17 for 1 GB data (large) on Janus. 78 TPC-H Query17 on mcore48 for 1GB dataset 160 146.97 145 131.43 Executuion Tim e (T) in seconds 130 115 100 85 mcore48 70 55.44 55 40 32.1 26.15 25 24.93 24.44 23.7 22.96 10 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Num ber of threads (N) Figure 6.4: Mean execution times of Query 17 for 1 GB data (small) on Mcore 48. TPC-H Query17 on mcore48 for 3GB dataset 1530 1440.02 1430 Executuion Time (T) in seconds 1330 1302.51 1230 1130 1030 mcore48 3GB 930 830 730 630 616.88 530 478.07 430 476.95 454.96 439.34 430.39 426.38 330 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Number of threads (N) Figure 6.5: Mean execution times of Query 17 for 3 GB data (medium) on Mcore 48. 79 TPC-H Query17 on mcore48 for 5GB dataset 4300 4183.29 4100 3900 3844.93 Execu tu io n T im e (T ) in seco n d s 3700 3500 3300 3100 2900 2700 mcore48 2500 2300 2144.93 2100 1900 1700 1576.64 1500 1405.18 1300 1290.67 1100 0 2 4 6 1190.85 1155.63 1134.43 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Number of threads (N) Figure 6.6: Mean execution times of Query 17 for 5 GB data (large) on Mcore 48. It is evident from the figures above that the execution time decreases significantly when the number of threads is increased from 1 through 4 in case of Janus and between 1 and 8 for Mcore 48. After these points, the execution time lowers at a very slow rate. Increasing the number of threads beyond this does not seem to have much effect on the execution times. This can be attributed to the amount of parallelisation achieved in the query. We may recall, that only certain portions of the query (all searches and aggregations) were parallelised. The rest remained sequential. Moreover, the intermediate results obtained in one portion of the query were needed by another, thus making the different parts of the query execute in sequence. These sequential sections of the query, in all probability, fail to take complete advantage of the threads. Also, there exists additional parallelisation overheads that further affect the execution time of the parallel algorithm. Analysing figures 6.3 (execution time of 1GB dataset on Janus) and 6.4 (execution time of 1GB dataset on Mcore48) reveal the following fact; for the 80 same dataset and parallel algorithm, the execution times for threads 1, 2, 4, and 8 (the number common between Mcore48 and Janus) are much lower in case of Mcore48. In fact, the execution time of this dataset on 8 threads is almost 8 times less on Mcore48. Even the time taken on a single execution thread is twice on Janus. This massive difference between the query response times on the two systems is owing to the hardware configurations of the two machines. Janus is a hyper-threaded processor [58]; this implies that every physical core is presented as two logical cores to the operating system. This efficiently uses the CPU resources by executing two threads in parallel on a single processor. In order to do this, certain resources are duplicated in each physical core. Also some amount of sharing occurs, as the same physical core is used by the two logical cores. Thus, a processor with two physical cores is inherently more powerful in terms of performance, than a hyper-threaded dual core (single core with two threads). For the test machines, all the 8 cores of Mcore48 are physical cores; whereas, the 8 threads of Janus are actually 4 physical cores. This accounts for the superior query performance on Mcore48. The following figures, 6.7 and 6.8 portray the absolute speedup attained by the query for different datasets across a variable number of threads, on each test machine. TPC-H Query17 on Janus 8 A b so lu te Sp eed u p (T 1 / T p ) 7 6 5 janus 100 MB 4 janus 500 MB janus 1 GB 3 2 1 0 2 4 6 8 10 Number of threads (N) Figure 6.7: Absolute speedup of Query 17 for all three datasets on Janus. 81 TPC-H Query17 on mcore48 16 A b so lu te Sp eed u p (T 1 / T p ) 14 11.138 12 11.918 11.684 12.29 12.686 9.074 10 mcore48 1 GB 8 6 5.254 5.45 5.727 5.463 2 4.547 2.216 2 3.342 5.555 5.102 6.054 6.111 6.02 6.204 6.32 mcore48 3 GB mcore48 5 GB 4.224 4 5.93 1.865 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Number of threads (N) Figure 6.8: Absolute speedup of Query 17 for all three datasets on Mcore48. The figures above clearly indicate that for the same parallel algorithm, in most cases, as the problem size (dataset size) increases, the speedup achieved decreases. The only exception to this is the speedup attained by the 500 MB dataset on Janus; for every thread, the speedup achieved is higher than the corresponding 100 MB dataset. However, for a fixed problem size, there is a significant increase in speedup with the increase in the number of threads, for all datasets (for both test machines). Mcore48 produces higher speedups than Janus. On Janus, a maximum speedup of 2.8, 5.7 and 2 is obtained for 100MB, 500MB and 1GB dataset respectively. On the contrary, on Mcore48, maximum speedup of 12.7, 6.1 and 6.3 is obtained for 1GB, 3GB and 5GB data respectively. The rise in speedup is quite sharp initially and extremely sluggish in the later half, where a much larger number of threads are in use. Further, analysing the efficiency values, reveals a gradual efficiency diminution with increasing number of threads. There are various possible reasons behind this behaviour. The first and foremost could be the existence of sequential regions in the parallel code (as 82 stated in Amdahl’s Law [41]). Other factors might include synchronization and communication costs incurred with increased number of threads. It is evident, from the experimental results obtained for query 17, that there are performance improvements when exploiting multi-core parallelism; good speedup of up to 12.7 is obtained. However, the point to consider here is the scalability when increasing the problem (dataset) size. For this particular query, the speedup decreases on increasing the problem size, indicating that it does not scale very well. The system therefore needs to be queried with some other queries as well, to ascertain its exact behaviour. Moreover, the use of performance counters and profiling will help in observing the memory accesses and monitoring the exact cache levels involved in the query execution. This will allow us to better analyze the system performance. This will also reveal the cache misses thereby, providing some information about the data structure as well; whether the data structure takes adequate advantage of the cache or not. 83 7 Conclusion This section will summarise the dissertation, and its outcome, as well as will put forth the scope of improvement and future work. 7.1 Dissertation Summary This research involves the design and development of a subset of the Google BigTable [3] database system for multi-core machines. The system currently used by Google is distributed in nature and does not exploit threadlevel parallelism of the individual machines in the cluster. This project aims to explore the possibility and assess the efficacy of thread-level parallelism for such huge databases. The dissertation has presented the Google system and its open source counterparts, the underlying file systems and the distributed parallelism techniques used. It has also described alternative database technologies (IMDBs) and cache-oblivious data structures to provide the readers the necessary background knowledge, before proceeding on to explain the implementation methods employed to build the database system. Presenting a wider context was essential due to the complexity of the system as a whole. The complete technique of development presented, includes building a data structure that resides in-memory, allows insertions, deletions, and parallel queries, and is also scalable. The implementation methodology for the benchmark query implemented in order to evaluate the database has also been described at length. The Java 7 Fork/Join framework was extensively used to achieve the desired parallelism for execution on multi-core machines. Next, the evaluation results of the database system were presented. Evaluation has been performed on two different multiprocessor architectures; an Intel Xeon quad-core processor and an AMD Opteron 48 core processor. Datasets of different sizes were used to analyse the system performance in terms of execution time and speed-up. It was observed that on a 48 core 84 processor, a speedup of upto 12.7 was achieved for 1 GB data, upto 6.1 for a 3 GB dataset and about 6.3 for a 5 GB dataset. It is evident that the system is capable of handling very large (in gigabytes) datasets. Although, there is a significant speedup for a particular problem size (dataset) across different thread sizes, an efficiency analysis indicates a gradual decline. There are several possible factors influencing this; the most important being the existence of sequential portions in the query itself which limits the parallelization of it. Apart from that, using larger datasets and more number of threads can result in additional synchronization costs that hinder speedup. However, a different query with more scope for parallelization can produce favourable results. Therefore, cluster-based column-oriented systems like Google DataStore will benefit entirely from the inherent multi-core architecture of its individual machines only if a suitable query is executed. For all other queries (with limited parallelism), such a system might see performance improvements for different data sizes, only upto a few cores; with further increase in parallel threads there will be a drop in efficiency (and a stagnation in speedup) as the query execution will not scale well for very large number of threads. For such queries, a fewer number of threads can be utilized to gain performance improvements, instead of exploiting all the available machine cores. This will mean that the cores will remain under-utilized; however, the speedup achieved by exploiting a few threads of the multi-core machines will boost the overall performance of a query running on a cluster (distributed environment like Google DataStore) and therefore, such a trade-off is acceptable. This research has explored an extremely complex and vast quarter, and successfully managed to develop the desired database system. However, due to lack of time, the data structure could not be evaluated and assessed for cacheobliviousness; whether the design is efficiently utilizing the cache remains unknown. Also, given the enormity and complexity of the DataStore, only a subset of it was looked into. Further, the query evaluation was an attempt to analyse the query performance and scalability; although rigid conclusions cannot be drawn based on the results of only a single query. 85 7.2 Limitations As mentioned earlier, the evaluation results accounted in this dissertation are only an estimate of the performance of the database system. A more accurate evaluation of the system would have been possible, had more benchmark queries been implemented and used. A wider range of benchmark queries would indicate the true potential of the system and its applicability. Also, the query performance of a typical column-oriented query could not be analysed on this database due to time constraints. This would have showcased the true power of the fundamental column-oriented design of the system. Moreover, the evaluation of the cache efficiency of the data structure could not be performed for the same reason. There exists a design limitation as well. For simplicity, the vEB array and the Packed Array were implemented as separate structures. It is however possible to create a single structure that includes both their characteristics. Separating the structure has resulted in occupying more memory. However, this design decision does not affect the query performance in any way. Another limitation is on the creation of the database. When the database is being loaded for the first time, it is imperative to supply it with sufficiently large number of entries. Failure to do so will result in the creation of a static binary tree of small size, which in turn implies that too many write (append) operations would likely cause frequent rebalancing of the Packed Array. A large number of threads get generated in a multi-core environment when querying the database, and hence more garbage gets created per unit time [51]. It is therefore essential to analyse the Java Garbage Collector (GC) behaviour as well, to check for its impact on system performance. 7.3 Future Work This project involved, implementing only a subset of Google DataStore. To make the database system more robust and complete, features like security, and fault tolerance need to be in place. A single benchmark query was 86 implemented, primarily aimed at evaluating the system performance. In reality, a wide variety of different queries should be executed on the database system, before arriving at any firm conclusions; the experimental results obtained here are merely an estimate; it is not exhaustive. Furthermore, a query scheduler should be developed to manage all (multiple) queries made to the system. Such a mechanism would build a database that handles all queries made to it in a uniform manner; hence more realistic. There should be an appropriate access control mechanism incorporated to make the database more secure. Also, the synchronization mechanism to handle concurrent writes to the database needs to be completed (partially implemented right now). All such enhancements will build an exact Datastore-like system for multi-core machines. A very important assessment would have been that of the data structure itself, in terms of cache efficiency. The structure although based on a cacheoblivious design, was not verified to ascertain if it was indeed utilising the cache to improve performance. However, such an evaluation will provide a better understanding of the utilization of the memory hierarchy, which can be valuable to bring about enhancements in system performance. Profiling information as well as use of performance counters can provide useful information about the query performance and the system bottlenecks. An interesting area of research would be to integrate thread-level parallelism into existing cluster-based systems like DataStore. With the huge volumes of data available and the equally complex computations performed on them nowadays, it is necessary to research further into this area. This investigation should essentially aim to utilise the power of multiple cores available to every machine in a cluster and thus improve performance considerably. In this project, by means of a single query execution on a subset of a DataStore-like system, the challenges and shortcomings associated with query execution in a multi-core environment, were exposed. Most importantly, it demonstrated the advantage of introducing thread-level parallelism into huge column-oriented database systems. This advantage therefore, could be taken forward; thus integrating multi-core with cluster-based systems and achieving twofold benefits by supporting both thread and process parallelism. 87 Appendix 1 TPC-H Query 17 Execution Results 1 a. Execution Results on Mcore48 (AMD Opteron 6174, 48 core processor) Small Dataset: 1 GB Number of Threads Mean Execution Time (T) in sec Confidence Interval (CI) Standard Deviation (STDEV) Standard Error (STDERR) 1 2 4 8 16 24 32 40 48 146.97 131.43 55.44 32.1 26.15 24.93 24.44 23.7 22.96 2.81 2.26 0.84 0.43 0.42 0.37 0.38 0.38 0.35 4.53 3.64 1.35 0.7 0.68 0.6 0.61 0.62 0.57 1.43 1.15 0.43 0.22 0.22 0.19 0.19 0.2 0.18 Confidence Interval (CI) Standard Deviation (STDEV) Standard Error (STDERR) 13.93 27.02 25.56 9.06 10.72 11.63 23.4 23.09 21.84 22.48 43.59 41.24 14.62 17.3 18.77 37.76 37.25 35.23 7.11 13.78 13.04 4.62 5.47 5.94 11.94 11.78 11.14 Medium Dataset: 3 GB Number of Threads 1 2 4 8 16 24 32 40 48 Mean Execution Time (T) in sec 1440.02 1302.51 616.88 478.07 476.95 454.96 439.34 430.39 426.38 Large Dataset: 5 GB Number of Threads Mean Execution Time (T) in sec Confidence Interval (CI) Standard Deviation (STDEV) Standard Error (STDERR) 1 2 4 8 16 24 32 40 48 4183.29 3844.93 2144.93 1576.64 1405.18 1290.67 1190.85 1155.63 1134.43 91.59 88.98 55.18 29.01 14.57 8.12 7.86 10.31 8.5 147.78 143.56 89.03 46.8 23.51 13.1 12.68 16.63 13.72 46.73 45.4 28.15 14.8 7.43 4.14 4.01 5.26 4.34 88 S. No. Number of Threads 1 2 3 4 5 6 7 8 9 Sequential 2 4 8 16 24 32 40 48 S. No. Number of Threads 1 2 3 4 5 6 7 8 9 Sequential 2 4 8 16 24 32 40 48 S. No. Number of Threads 1 2 3 4 5 6 7 8 9 Sequential 2 4 8 16 24 32 40 48 Mcore48 1 GB Absolute Avg. Time Speedup (xmean in s) (T1 / Tp) 291.27 131.43 2.216 55.44 5.254 32.1 9.074 26.15 11.138 24.93 11.684 24.44 11.918 23.7 12.29 22.96 12.686 Mcore48 3 GB Absolute Avg. Time Speedup (xmean in s) (T1 / Tp) 2605.49 1302.51 2 616.88 4.224 478.07 5.45 476.95 5.463 454.96 5.727 439.34 5.93 430.39 6.054 426.38 6.111 Mcore48 5 GB Absolute Avg. Time Speedup (xmean in s) (T1 / Tp) 7169.24 3844.93 1.865 2144.93 3.342 1576.64 4.547 1405.18 5.102 1290.67 5.555 1190.85 6.02 1155.63 6.204 1134.43 6.32 89 Efficiency 1.108 1.314 1.134 0.696 0.487 0.372 0.307 0.264 Efficiency 1 1.056 0.681 0.341 0.239 0.185 0.151 0.127 Efficiency 0.933 0.836 0.568 0.319 0.231 0.188 0.155 0.132 1 b. Execution Results on Janus (Intel Xeon quad-core processor) Small Dataset: 100 MB Number of Threads Mean Execution Time (T) in sec Confidence Interval (CI) Standard Deviation (STDEV) Standard Error (STDERR) 1 2 4 8 0.82 0.74 0.45 0.43 0.02 0.01 0.01 0.02 0.04 0.02 0.02 0.04 0.01 0.01 0.01 0.01 Medium Dataset: 500 MB Number of Threads Mean Execution Time (T) in sec Confidence Interval (CI) Standard Deviation (STDEV) Standard Error (STDERR) 1 2 4 8 9.78 9.24 4.32 3.43 0.1 0.06 0.05 0.08 0.16 0.1 0.08 0.13 0.05 0.03 0.03 0.04 Confidence Interval (CI) Standard Deviation (STDEV) Standard Error (STDERR) 2.84 0.95 1.65 0.92 4.58 1.54 2.66 1.49 1.45 0.49 0.84 0.47 Large Dataset: 1 GB Number of Threads 1 2 4 8 Mean Execution Time (T) in sec 281.18 274.09 251.84 250.33 90 JANUS 100 MB S. No. Number of Threads Avg. Time (xmean in s) 1 2 3 4 Sequential 2 4 8 1.22 0.74 0.45 0.43 Absolute Speedup (T1 / Tp) Efficiency 1.649 2.711 2.837 0.825 0.678 0.355 JANUS 500 MB S. No. Number of Threads Avg. Time (xmean in s) 1 2 3 4 Sequential 2 4 8 19.44 9.24 4.32 3.43 Absolute Speedup (T1 / Tp) Efficiency 2.104 4.5 5.668 1.052 1.125 0.709 Absolute Speedup (T1 / Tp) Efficiency 1.829 1.991 2.003 0.915 0.498 0.25 JANUS 1 GB S. No. Number of Threads Avg. Time (xmean in s) 1 2 3 4 Sequential 2 4 8 501.41 274.09 251.84 250.33 91 Appendix 2 IMPORTANT NOTE: A project involving Google BigTable is also being designed by my colleague. However, the choice of the research area, scope of implementation, methodology/approach, supported features, and the choice of programming language vary completely. Hence, the two projects are separate and unconnected. The individual projects are thus being conducted independent of each other, with the approval and under the guidance by my supervisor. 92 References 1. GOOGLE APP ENGINE. http://code.google.com/appengine/docs/whatisgoogleappengine.html, last visited on March 25, 2011. 2. GOOGLE DATASTORE. http://code.google.com/appengine/docs/java/datastore/, last visited on March 25, 2011. 3. CHANG, F., GHEMAWAT, S., DEAN, J. et al. Bigtable: A Distributed Storage System for Structured Data. 7th OSDI (Nov. 2006) 4. STONEBRAKER, M. The case for shared nothing. Database Engineering Bulletin 9, 1 (Mar. 1986), 4.9. 5. GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google File system. In Proc. of the 19th ACM SOSP (Dec.2003). 6. BURROWS, M. The Chubby lock service for loosely coupled distributed systems. In Proc. of the 7th OSDI (Nov. 2006). 7. APACHE HBASE. http://hbase.apache.org/, last visited on March 25, 2011. 8. UNDERSTANDING HBASE AND BIGTABLE. http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTa ble, last visited on April 27, 2011. 9. ZOOKEEPER. http://wiki.apache.org/hadoop/ZooKeeper, last visited on April 29, 2011. 10. ADVANCED HBASE. http://www.docstoc.com/docs/66356954/AdvancedHBase, last visited on April 27, 2011. 11. HFILE: A BLOCK-INDEXED FILE FORMAT TO STORE SORTED KEY-VALUE PAIRS. http://www.slideshare.net/schubertzhang/hfile-ablockindexed-file-format-to-store-sorted-keyvalue-pairs, last visited on April 27, 2011. 12. HADOOP TUTORIAL. http://developer.yahoo.com/hadoop/tutorial/module1.html, last visited on April 27, 2011. 13. RANGER, C., et al. “Evaluating MapReduce for Multi-core and Multiprocessor Systems,” Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture 2007. 14. DEAN, J., GHEMAWAT, S. “MapReduce: Simplified Data Processing on Large Clusters,” OSDI, 2004. 15. ORAM, A., WILSON, G. “Distributed Programming with MapReduce”, O'Reilly, 2007. 16. HADOOP MAPREDUCE TUTORIAL. http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html, last visited on April 27, 2011. 17. BUILDING A JAVA MAPREDUCE FRAMEWORK FOR MULTICORE ARCHITECTURES. http://www.cs.man.ac.uk/~lujanmx/research/docs/kovoor_multiprog2010.pd f, last visited on April 29, 2011. 18. GARCIA-MOLINA, H., KENNETH, S. Main Memory Database Systems: An Overview. IEEE Trans. on Knowledge and Data Engineering, Dec 1992. 19. IN-MEMORY DATABASE. http://it.toolbox.com/wiki/index.php/InMemory_Database, last visited on May 2, 2011. 20. FRIGO, M., LEISERSON, C. E., PROKOP, H., RAMACHANDRAN, S. 93 Cache-oblivious algorithms. Extended Abstract. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pages, 1999. 21. CACHE-OBLIVIOUS DATA STRUCTURES. http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-datastructures.aspx, last visited on May 2, 2011. 22. CACHE-OBLIVIOUS ALGORITHMS. http://qstuff.blogspot.com/2010/06/cache-oblivious-algorithms.html, last visited on May 3, 2011. 23. CACHE OBLIVIOUS DATA STRUCTURES. http://bryanpendleton.blogspot.com/2009/06/cache-oblivious-datastructures.html, last visited on May 2, 2011. 24. CACHE-OBLIVIOUS ALGORITHMS. http://www.itu.dk/~annao/ADT03/lecture10.pdf, last visited on May 3, 2011. 25. VITTER, J.S. External Memory Algorithms and Data Structures: Dealing with massive data. ACM Computing Surveys, 33(2):209–271, 2001. 26. OLSEN, J.H. and SKOV, S.C. Cache-Oblivious Algorithms in Practice. Master’s thesis, University of Copenhagen, Copenhagen, Denmark, 2002. 27. PROKOP, H. Cache-Oblivious Algorithms. Master’s thesis, Massachusetts Institute of Technology, Massachusetts, 1999. 28. BENDER, M.A., DUAN, Z., IACONO, J. and WU, J. A locality-preserving cache-oblivious dynamic dictionary. Journal of Algorithms, 115-136, 2004. 29. BENDER, M.A, DEMAINE, E.D. and FARACH-COLTON, M. “CacheOblivious B-Trees”, SIAM Journal on Computing, 2005. 30. DEMAINE, E.D. “Cache-Oblivious Algorithms and Data Structures”, in Lecture Notes from the EEF Summer School on Massive Data Sets, BRICS, University of Aarhus, Denmark, June 27–July 1, 2002. 31. BENDER, M.A., FINEMAN, J.T., GILBERT, S. and KUSZMAUL, B.C. Concurrent Cache-Oblivious B-Trees. Proc. of the 17th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) Las Vegas. July 2005. 32. COMER, D. The ubiquitous B-Tree. Computing Surveys, 1979. 33. CORMEN, T.H, LEISERSON, C.E., RIVEST, R.L. and STEIN, C. Introduction to Algorithms, Second Edition. Chapter 12, Section 15.5. 34. MING-YANG-KAO. Encyclopaedia of algorithms. Page 123. 35. TRANSISTOR SIZING ISSUES AND TOOL FOR MULTI-THRESHOLD CMOS TECHNOLOGY. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=597182, last visited on April 24, 2011. 36. EXCERPTS FROM A CONVERSATION WITH GORDON MOORE: MOORE’S LAW. ftp://download.intel.com/museum/Moores_Law/VideoTranscripts/Excepts_A_Conversation_with_Gordon_Moore.pdf, last visited on April 24, 2011. 37. MOORE’S LAW. http://www.intel.com/technology/mooreslaw/index.htm?iid=tech_2b+rhc_la w, last visited on April 24, 2011. 38. GARCIA-MOLINA, H., ULLMAN, J.D., WIDOM, J. Database Systems: The Complete Book. Prentice Hall, 2nd Edition. 39. DUAL-CORE PROCESSORS. http://www.tomshardware.com/cn/957,review-957.html, last visited on April 24, 2011. 94 40. WULF, W. and McKEE, S. “Hitting the Memory Wall: Implications of the Obvious,” ACM SIGARCH Computer Architecture News, vol.23, 1994. 41. FOSTER, I. Designing and Building Parallel Programs, Addison-Wesley, 1994. 42. AN INTRO. TO MULTIPROCESSOR SYSTEMS. http://www.realworldtech.com/page.cfm?ArticleID=RWT121106171654&p =2, last visited on April 24, 2011. 43. THE TROUBLE WITH MULTI-CORE COMPUTERS. http://www.technologyreview.com/computing/17682/page2/, last visited on April 24, 2011. 44. FOURTH WORKSHOP ON PROGRAMMABILITY ISSUES FOR MULTI-CORE COMPUTERS (JAN 2011). http://multiprog.ac.upc.edu/resources/multiprog11.pdf, pp-3, last visited on April 24, 2011. 45. AUTONOMIC COMPUTING. http://autonomiccomputing.org/, last visited on March 25, 2011. , last visited on April 22, 2011. 46. AUTONOMIC COMPUTING. http://www.research.ibm.com/autonomic/overview/benefits.html, last visited on April 22, 2011. 47. OPEN MP. http://openmp.org/wp/, last visited on May 4, 2011. 48. THE CILK PROJECT. http://supertech.csail.mit.edu/cilk/, last visited on May 4, 2011. 49. FORK/JOIN TUTORIAL. http://download.oracle.com/javase/tutorial/essential/concurrency/forkjoin.ht ml, last visited on May 4, 2011. 50. HOW TO SURVIVE MULTICORE SOFTWARE REVOLUTION. http://akira.ruc.dk/~keld/teaching/IPDC_f10/How_to_Survive_the_Multicor e_Software_Revolution-1.pdf, last visited on May 4, 2011. 51. A JAVA FORK/JOIN FRAMEWORK. LEA, D., SUNY, Oswego. http://gee.cs.oswego.edu/dl/papers/fj.pdf, last visited on August 30, 2011. 52. PACKAGE JAVA.UTIL.CONCURRENT. http://download.oracle.com/javase/7/docs/api/java/util/concurrent/packagesummary.html, last visited on August 30, 2011. 53. FORK AND JOIN: JAVA CAN EXCEL AT PAINLESS PARALLEL PROGRAMMING TOO!, PONGE, J. http://www.oracle.com/technetwork/articles/java/fork-join-422606.html, last visited on August 30, 2011. 54. FORK-JOIN DEVELOPMENT IN JAVA™ SE. http://www.coopsoft.com/ar/ForkJoinArticle.html, last visited on August 30, 2011. 55. FRIGO, M., LEISERSON, C.E. AND RANDALL, K.H. “The implementation of the Cilk-5 multithreaded language,” SIGPLAN Not.,1998. 56. TPC BENCHMARKTM H (DECISION SUPPORT) STANDARD SPECIFICATION, Revision 2.8.0. http://www.tpc.org/tpch/spec/tpch2.8.0.pdf, last visited on August 30, 2011. 57. INTERFACE READWRITELOCK. http://download.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ ReadWriteLock.html, last visited on August 30, 2011. 58. INTEL HYPER-THREADING TECHNOLOGY. 95 http://www.intel.com/content/www/us/en/architecture-and-technology/hyperthreading/hyper-threading-technology.html/, last visited on August 30, 2011. 96