Download in-memory data structure for google datastore on multi

Document related concepts

Quadtree wikipedia , lookup

Interval tree wikipedia , lookup

B-tree wikipedia , lookup

Array data structure wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
IN-MEMORY DATA STRUCTURE FOR
GOOGLE DATASTORE ON MULTI-CORE
ARCHITECTURES
A dissertation submitted to the University of Manchester
for the degree of Master of Science
in the Faculty of Engineering and Physical Sciences
2011
MOON MOON NATH
School of Computer Science
List of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Abstract
8
Declaration
9
Copyright
10
Acknowledgements
11
1
12
Introduction
1.1 Shared Memory Multi-Core Systems and Google DataStore . . . . . . .14
1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
2
Background
18
2.1 The BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
2.2 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
2.4 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
2.5 Data Retrieval in a Cluster Environment: MapReduce . . . . . . . . . . .27
2.6 In-Memory Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Cache-Oblivious Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
3
System Design
34
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Data Manipulation and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4
Data Structure Implementation
40
4.1 Static Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41
4.2 Packed Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
4.3 Algorithm to Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Algorithm to Append / Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2
4.5 Algorithm to Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5
Query Implementation
53
5.1 Development Tools – Java Fork/Join Framework . . . . . . . . . . . . . . . 53
5.2 TPC-H Benchmark Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
5.3 Loading the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 TPC-H Query 17 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Query 17 – Sequential Implementation . . . . . . . . . . . . . . . . . . . . . . . 61
5.6 Query 17 – Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Need for Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
5.8 Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
5.9 Synchronization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
6
Evaluation
75
6.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7
Conclusion
84
7.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88
Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Word Count
20,190
3
List of Figures
Figure 1.1: NUMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
Figure 1.2: UMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Figure 2.1: Example table storing web pages . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 2.2: To illustrate the concept of ‘rows’, ‘column families’ and ‘columns’
in BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Figure 2.3: To illustrate timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 2.4: GFS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.5: The memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 2.6: The RAM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
Figure 3.1: Data structure design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 3.2: Representation of the Data model. . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 4.1: Steps to create the data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4.2: A complete Binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 4.3: van Emde Boas layout on a binary tree of height 5. . . . . . . . . . . . . 44
Figure 4.4: To illustrate the relation between a full binary tree (of height 4) and
the vEB array and the Packed array structure. . . . . . . . . . . . . . . . . . .48
Figure 5.1: Co-operation among fork ( ) and join ( ) tasks. . . . . . . . . . . . . . . . .54
Figure 5.2: TPC-H database schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 5.3: Sample Key-Value pairs generated from the de-normalized
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 5.4: Overview of parallel execution strategy used in Query17. . . . . . . .67
Figure 6.1: Mean execution times of Query 17 for 100 MB data (small) on
Janus. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 6.2: Mean execution times of Query 17 for 500 MB data (medium) on
Janus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78
Figure 6.3: Mean execution times of Query 17 for 1 GB data (large) on
Janus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 6.4: Mean execution times of Query 17 for 1 GB data (small) on
Mcore 48. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 6.5: Mean execution times of Query 17 for 3 GB data (medium) on
Mcore 48. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4
Figure 6.6: Mean execution times of Query 17 for 5 GB data (large) on
Mcore 48. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Figure 6.7: Absolute speedup of Query 17 for all three datasets on Janus. . . . .81
Figure 6.8: Absolute speedup of Query 17 for all three datasets on Mcore48. .82
5
List of Tables
Table 1: Production system configurations for performance evaluation . . . . . .76
6
Code Listings
Listing 1: Class definition of a node in the implementation of a binary tree. . .41
Listing 2: Implementation of a sorted tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Listing 3: Implementation of vEB array. . . . . . . . . . . . . . . . . . . . . . . . . . . 45 - 46
Listing 4: Pseudo Code to explain the mapping of vEB array to
packed array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
Listing 5: To calculate the number of leaves for a tree of height ‘height’. . . . .50
Listing 6: Implementation of Packed Array. . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Listing 7: Implementation to convert a key-value pair file to another key-value
pair format (based on our system’s data model). . . . . . . . . . . . . . . . . 58
Listing 8: Implementation of search algorithm to check for the first occurrence
of a column in the binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Listing 9: Implementation of search algorithm to check within a subtree. . . . .63
Listing 10: Implementation of intersection operation to find the common
rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63
Listing 11: Implementation of search algorithm to check for a column_name
and a specific row_key within a single combo key. . . . . . . . . . . . . . 64
Listing 12: Implementation of duplicate removal algorithm. . . . . . . . . . . . . . . 65
Listing 13: Implementation of search algorithm to check for less-than
condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
Listing 14: Implementation of parallel search algorithm (1). . . . . . . . . . . . . . . 69
Listing 15: Implementation of parallel search algorithm (2). . . . . . . . . . . . . . . 70
Listing 16: Implementation of parallel addition algorithm. . . . . . . . . . . . . . . . .71
7
Abstract
Google provides its users an assortment of applications and services. In
the process, it requires to store and manage huge volumes of user data. To
accomplish this, all the data is distributed and stored across thousands of servers,
in a distributed storage system. This approach is beneficial since it exploits
parallelism in a cluster environment to achieve good system performance, in
terms of throughput and response time.
The advent of multi-core architectures has resulted in a lot of research to
find effective software solutions that will take advantage of the parallel
hardware. This project also deals with investigating the possibilities and
developing a Google datastore-like system for shared memory multi-core
machines, that is scalable, fast, and efficient.
This dissertation discusses the motivation, relevant literature, scope,
design, implementation, and evaluation of the project. The literature survey
provides all the essential background knowledge, necessary to understand the
idea behind this research. The system design comprises primarily of an
underlying data structure and a set of operations to manipulate the database.
The implementation, based on Java 7, includes developing the data
structure to support the database and parallelization of a search query. Support
for several database operations like insert, delete, and search, similar to that of
Google DataStore exist in this system. The query execution is parallelized on
several multi-core machines to capture and evaluate performance and scalability
of the design, based on execution time and absolute speedup as metrics. The
analysis of the results reveals a maximum speedup of 12.7 for 1 GB and 6.3 for
5 GB data on a 48 core test machine, which indicate the advantage of executing
queries on multi-core systems.
The designed database is a subset of the Google Datastore and hence,
supports only the core features of it. Given, the stringent time frame,
enhancements like fault tolerance and security are kept outside the scope of this
project.
8
Declaration
No portion of the work referred to in this dissertation has been submitted in
support of an application for another degree or qualification of this or any other
university or other institute of learning.
9
Copyright
i.
The author of this dissertation (including any appendices and/or
schedules to this dissertation) owns certain copyright or related
rights in it (the “Copyright”) and s/he has given The University of
Manchester certain rights to use such Copyright, including for
administrative purposes.
ii.
Copies of this dissertation, either in full or in extracts and whether in
hard or electronic copy, may be made only in accordance with the
Copyright, Designs and Patents Act 1988 (as amended) and
regulations issued under it or, where appropriate, in accordance with
licensing agreements which the University has entered into. This
page must form part of any such copies made.
iii.
The ownership of certain Copyright, patents, designs, trade marks
and other intellectual property (the “Intellectual Property”) and any
reproductions of copyright works in the dissertation, for example
graphs and tables (“Reproductions”), which may be described in this
dissertation, may not be owned by the author and may be owned by
third parties. Such Intellectual Property and Reproductions cannot
and must not be made available for use without the prior written
permission of the owner(s) of the relevant Intellectual Property
and/or Reproductions.
iv.
Further information on the conditions under which disclosure,
publication and commercialisation of this dissertation, the Copyright
and any Intellectual Property and/or Reproductions described in it
may take place is available in the University IP Policy (see
http://documents.manchester.ac.uk/display.aspx?DocID=487), in any
relevant Dissertation restriction declarations deposited in the
University Library, The University Library’s regulations (see
http://www.manchester.ac.uk/library/aboutus/regulations) and in The
University’s Guidance for the Presentation of Dissertations.
10
Acknowledgements
At the outset, I would like to express my sincere gratitude to my
supervisor Dr. Mikel Lujan, for his invaluable guidance, support and inspiration
throughout the project. I would also like to thank all my faculty members at the
School of Computer Science. Finally, I would like to convey my heartfelt
gratitude to my parents and friends, for their ceaseless love and support, without
which this work would not have been possible.
11
1 Introduction
Processor architecture has evolved considerably over the years. From being
steered primarily by Moore’s Law [37] to exploiting multi-core parallelism
nowadays, it has travelled a long way. The direct correlation between processor
frequency and its performance is threatening to vanish owing to certain limiting
factors. The most prominent one among them is the transistor size, which
cannot be reduced beyond a certain degree [35]. In fact, smaller size transistors
also require a lot of complex design effort. This is a physical limitation on the
reduction of a transistor size. However, the direct impact of increasing the
number of transistors in a chip is the increased power consumption. Apart from
this, there is also the problem of physical memory bandwidth. We know that the
speed of the main memory is much slower than that of the processor. In fact, the
rate at which the processor frequency has amplified in the past decade, the
memory speed has not. This memory bottleneck will always restrict the system
performance, despite the high clock speed of the processor. This is due to the
fact that a fast processor with a slow memory only increases the processor idle
time. Wulf et al. called this bottleneck the ‘Memory Wall’ [40].
Hardware designers have now incorporated multi-core technology into the
processors. Processors instead of having a single CPU have multiple CPUs built
onto the same chip, called ‘cores’. The existence of multiple cores creates an
opportunity for improvements in performance and speed of the processor,
provided there exists parallel software that can utilize the cores available to it.
This is because a task can now be executed on several cores simultaneously as
threads.
A modern multi-core processor is usually a NUMA (Non Uniform Memory
access) shared memory multi-processor [42]. However, the ones with fewer
cores are still SMPs (Symmetric Multi-Processor) having UMA (Uniform
Memory Access) [42]. UMA is where memory is shared by all processor cores,
such that each core takes the same amount of time to access it. However, in case
of a NUMA, pools of memory exist, that are shared by a set of cores (multiple
cores grouped together to form a ‘socket’). This implies that each socket (group
12
of cores) is connected directly to one RAM and indirectly to all the others (see
figure 1.1), which results in some sockets accessing a particular RAM faster
than the others. The following figures 1.1 and 1.2 illustrate the two multi-core
architectures.
RAM
0
RAM
1
Core
0
Core
1
4 Quad Cores
RAM
RAM
2
RAM
3
Socket 2
Core
2
Socket 3
Figure 1.1: NUMA architecture.
Drawn based on [42].
Core
3
Figure 1.2: UMA
architecture. Drawn
based on [42]. Here,
each socket has 4cores.
Here, we can see that a single operation can get divided among 16 and 4
CPUs in the NUMA and UMA architectures respectively and if parallel
software is available can utilize the existence of these multiple cores for
performance enhancements. However, some latency exists in case of NUMA
due the different access times.
Although the hardware industry has found an effective technique in the
form of multi-core, the software industry still needs to evolve accordingly to
exploit this hardware. It is extremely vital to write software and design
frameworks that can efficiently scale and utilize the underlying multi-core
architecture. Also, one should bear in mind the hierarchical memory structure
involving CPU caches to yield optimal system performance. This is especially
true when processing terabytes of data.
There are some frameworks for parallel programming on multi-core like
OpenMP [47] for C and Fortan, Java Fork-Join [49], Phoenix [13] and MR-J
[17]. However, there still exists a lot of instability in the applications written for
multi-core architectures. The complexity involved in the appropriate utilization
of thread-level parallelism is magnified by the existence of multiple cache
13
levels, cache-sharing, memory page sizes and so forth [44]. Therefore, software
designers can achieve greater performance from the multi-core systems if they
consider these factors and design structures and algorithms that are tailored
accordingly.
1.1
Shared
Memory
Multi-Core
Systems
and
Google
DataStore
The industry today requires managing huge amounts of data, in the order
of petabytes. To process such large computations, a distributed cluster
computing environment or a shared memory multi-core architecture can be used.
Again, apart from improving the hardware, the software should also be
rewritten to be able to exploit the hardware, as mentioned earlier.
Google has devised a mechanism based on the distributed computing
environment to process and manage petabytes of user data. BigTable [3] is a
high performance, scalable proprietary database system from Google. It is a
distributed storage system that supervises large amounts of data across several
thousand commodity servers. It is built atop other Google services like the
Google File System (GFS) [5], MapReduce [14], Chubby Locking Service [6],
and so on.
The GFS is a distributed file system that runs on several thousand
inexpensive commodity Linux servers. It provides the usual file system
operations with special fault tolerance, scalability and reliability features. The
database operations are designed such that they can utilise the distributed nature
of the environment and run in parallel. However, it does not utilize the
individual cores within a single system to gain performance benefits; in other
words, it does not support execution on a multi-core architecture. The
distributed parallelism is achieved by using MapReduce, which is a framework
that requires a programmer to write only two special functions, while the
complex parallel activities are handled by its underlying run-time features.
14
Many Google projects make use of BigTable like, Google Earth, Google
Finance, the web indexing operation, Gmail, YouTube, and so on.
Several similar distributed systems exist as open source. The most
commonly used are from Hadoop [12]. Hadoop’s HBase [7], HDFS [12] and
MapReduce [16] are similar in most ways to Google’s BigTable, GFS and
MapReduce respectively. Hadoop is extensively used by services like Facebook,
Twitter, Adobe, EBay, LinkedIn and Yahoo to name a few.
The applications supported by these distributed systems give us a fair idea
about the enormity of the data handled by them. These systems are robust and
have a low response time in most situations. However, if concurrent activities
increase manifold, owing to a large number of simultaneous users or if the
amount of data increases by many times, over the next few years, the
performance parameters might not produce as good a result as they do now. The
computations are bound to become large requiring more power in the future.
Therefore, with the advent of multi-core architectures, it is only natural to try
and extract the additional computational power required, from the multi-core
systems itself. In fact, BigTable and HBase make use of inexpensive
commodity machines for their clusters. The multi-core nature of these
individual systems, that constitute the cluster, can actually be exploited to gain
improvements in performance and speed-up.
1.2
Aims and Objectives
This project aims at investigating the possibility of implementing a subset
of the BigTable [3] functionality on multi-core architectures. The designed
database always resides in memory [18, 19], eliminating entirely the access to a
secondary storage for its operations and has an underlying data structure that
has a cache-oblivious [27] design. The research is carried out in three basic
phases. The initial phase involves conducting a survey of the Google database
system, its underlying infrastructure, the GFS [5] and a research of other similar
non-SQL (unconventional or non-relational DBMS) database technologies. It
15
also constitutes looking at the various in-memory and cache-oblivious data
structures to identify the suitability of these structures for this research. The
existing multi-core frameworks are also examined to identify a suitable
implementation that can be used to achieve parallel activity. Also, in this phase,
programming environment for developing this implementation is explored to
arrive at an appropriate choice.
The next phase is to design and develop a version of the DataStore system
for shared-memory multi-core machines, based on the decisions taken in the
previous phase. This involves designing and implementing a suitable data
structure that has the capability to perform operations similar to Google
BigTable. This structure is then used to perform simple operations like create,
populate data, append new data and delete on the database. In addition to this,
thread synchronization features are incorporated to allow multiple users
concurrent access to a single database. Next, a data retrieval operation is
performed on it, exploiting the parallelism of the processor cores.
The final objective of this research is to evaluate the performance and
usability of this multi-core implementation. Also, the efficiency of the
implementation as well as its scalability on various multi-core systems is
examined. The parallelized query is used for this evaluation. However, the
evaluation of the cache-obliviousness of the data structure is not performed due
to time constraints. The results obtained from various multi-core systems are
analyzed to arrive at a formal conclusion.
1.3
Organization of the Dissertation
This dissertation is organized into a Background (Section 2), System
Design (Section 3), Data Structure Implementation (Section 4), Query
Implementation (Section 5), Evaluation (Section 6) and Conclusion (Section 7)
sections, apart from the Introduction (Section 1). The Background section
contains an overview of the entire research activity. It presents the primary
motivation behind this project – Google DataStore (BigTable) [3]. The concept,
architecture and salient features are discussed briefly. It is then compared with
16
its open source counterpart HBase, from Hadoop [7]. Next GFS [5] and its open
source version from Hadoop, HDFS [12] are discussed, exploring the
architecture of these systems. The database querying mechanism, MapReduce
[14, 16], used by these distributed systems is then examined. Further, we look at
the alternative database technologies like In-Memory Databases (IMDB) [18,
19] to investigate the feasibility of using them for this implementation. We also
look at cache-oblivious [20] data structures to explore their suitability and at the
same time, identify an appropriate structure for development.
The System Design section, presents a detailed description of the data
structure that forms the building block of the in-memory, DataStore-like
database system, which is implemented.
The subsequent section deals with the actual implementation details of
the data structure, followed by the implementation details of the query, used to
evaluate the system. The mechanism used to parallelize this query, in order to
exploit the processor cores of a multi-core system, is also presented.
The Evaluation section further includes the evaluation techniques, the
various multi-core configurations used, the benchmarks, as well as the results of
analysis. Finally, the Conclusion section wraps up the report by briefly
discussing the outcome, the learning, system limitations and the future work
that can be undertaken.
17
2 Background
Google provides its users with a Platform-as-a-Service (PaaS) commercial
cloud technology, in the form of the Google App Engine [1]. App Engine
allows users to build, maintain and run their web applications on Google’s
infrastructure by means of a huge storage structure called the DataStore [2]. It
comprises of several APIs, required for its services, one of which is the
Datastore API [2], which is available in both Java and Python versions and
accesses a query engine and some atomic transactions. This API provides users
with a stable storage that is both reliable and consistent.
The huge amount of user data present in the DataStore is in reality, stored
across thousands of servers and managed by a distributed storage system called
BigTable [3]. In other words, the DataStore of App Engine is built on top of
BigTable. BigTable, earlier a single-master, distributed storage system, consists
of three main components – a library linked to all clients, a master server and
several tablet servers [3]. It is a non-SQL (non traditional DBMS) database
management system in that it does not conform to a specific schema – the
number of columns in different rows of the same table can vary, thus sharing
characteristics of both row-oriented and column-oriented databases. Typically, a
column-oriented database serializes (stores values internally in file etc.) the
contents of the database in a column-wise fashion, in that all data from one
column gets stored together and then the same for the next column and so forth.
The biggest advantage of such a storage mechanism is the quick execution of
aggregation operations (like sum, count, etc. that are performed over specific
columns) since now entire rows need not be read. Instead the required column, a
much smaller subset of the database, can be accessed directly, giving faster
query results. Also, since column data is usually of the same type, compression
techniques can be employed to achieve storage size optimizations, which is not
possible in row-oriented stores.
BigTable uses an underlying Google File System (GFS) [5] to store data
and is based on the shared-nothing architecture [4]. BigTable also relies on a
distributed locking service called Chubby [6] to ensure consistency and
18
synchronization of all client activities in a loosely-coupled distributed system. It
provides its client with a highly reliable and consistent environment.
The open source counterparts from Hadoop [12] also have similarities in
terms of architecture. One of the primary objectives of this research work is to
conduct a survey of these distributed systems to understand their functionality,
architecture and the structures employed. Additionally, we will examine in
detail different types of data structures especially the ones that utilize the cache
(cache-aware and cache-oblivious) for performance improvements. Their study
will provide us with necessary understanding and thus allow us to decide on the
data structure to implement. This decision will be guided primarily by the fact
that the structure should be similar to that of BigTable’s; data should be stored
in a column-oriented manner. Efficient memory and cache utilization,
performance etc. are the other criteria. We will also look at in-memory data
bases [18, 19] as they have very low response times and decide on their
suitability for this project.
This section will deal with the above discussed aspects of my research
and thus provide an understanding of the background and the system in general.
2.1
The BigTable
BigTable
is
defined
as
“a
sparse,
distributed,
persistent
multidimensional sorted map” by Ghemawat et al. [3]. It is “sparse” because
each row in a table can have an arbitrary number of columns, very different
from the other rows in that table. This is possible due to the fact that BigTable is
not a conventional relational database management system that is strictly roworiented. It is instead a non-SQL, column-oriented database system. A BigTable
row contains only those columns which contain some value. Contrary to an
RDBMS, there are no NULL values and no JOINs. The tables are also unlike
the traditional RDBMS ones. A table here is a “map”, indexed by a row key,
column key and a timestamp. In other words, a cell in a BigTable table is
identified by 3 dimensions – row, column and timestamp. The timestamp
facilitates versioning of the data. Each cell can have multiple values at different
19
points in time and each value, an array of bytes, is maintained separately with
its associated timestamp. These are 64 bit integers and can be used to store
actual time in microseconds as well.
The unique row key is a maximum of 64 KB in size and is an arbitrary
string. All data is maintained in lexicographic order of the row key. A table can
be huge and is therefore split at row boundaries to manage them. These
partitioned tables are called tablets. Each tablet is around 100 – 200 MB in size,
allowing several hundred to be stored on each machine. This sort of an
arrangement allows for fine grained load balancing.
Several column keys are combined to form a set called a column family,
which is the basic access control unit. Any number of column keys can be part
of a single column family, but these columns are usually of the same data type.
The number of column families is restricted to a few hundred in contrast to the
unbounded number of columns in a table. A column key has the following
syntax:- family:qualifier, where ‘family’ and ‘qualifier’ refer to column family
and column key respectively.
The following diagram, redrawn from the original paper [3] illustrates
the structure of a single row in BigTable.
“contents:”
“com.cnn.www”
“anchor:cnnsi.com”
“<html>…
“<html>...”
“<html>...
”
“anchor:my.look.ca”
t3
t5
“CNN”
t9
“CNN.com”
t6
Figure 2.1: Example table storing web pages. It is redrawn from the
original BigTable paper [3]. The diagram contains 2 column families
namely ‘contents’ and ‘anchor’. ‘contents’ has a single column while
‘anchor’ has 2. Again while ‘anchor’ column values have a single
timestamp (t8 and t9), ‘contents’ has 3 timestamps for a single value (t3, t5
20
t8
and t6), where t3 is the oldest and t6 is the most recent value. The next row
can have a different number of columns for these two column families.
As already mentioned, Google uses the distributed GFS to store all data
and maintain log records. Internally however, an immutable file format called
SSTable is used to store the BigTable tablets (data). It is sequence of blocks
typically 64 KB in size. An index is stored at the end of each SSTable, to locate
its blocks.
A BigTable realization comprises of three major constituents:- master
server, several tablet servers and a library attached to every client machine. The
master server assigns tablets to various tablet servers, performs load balancing
and garbage collection as well as detects any alterations in the tablet servers.
Tablet servers manage the tablets assigned to it including reads and writes by
the client. Additionally it is also responsible for partitioning tablets that have
exceeded their size limit.
BigTable uses Chubby [6] as a locking service for synchronization,
tablet location information, tablet server expirations, store schema information
and so forth. A three-tier B+ tree-like structure is used to store tablet location
information. Here, the first level is a file stored in Chubby which holds the
location of the root tablet, which in turn holds locations of all other tables in
separate tablets called METADATA. Each METADATA tablet stores the location
of the user tablets that includes a list of SSTables. SSTables are loaded into
memory using their index into a memtable. All updates are also made to a
memtable. As its size increases, due to updates, to reach a threshold, a new
memtable is created. The old memtable is turned into an SSTable and sent to the
GFS. This is termed as “minor compaction”. Minor compactions create new
SSTables which results in several of them after a time. Therefore, to curb the
creation of numerous such SSTables, another merge operation is performed at
regular intervals, called “major compaction”. This involves rewriting all
existing SSTables into a single one.
21
2.2
HBase
HBase is an open source BigTable-like structured storage built on the
Hadoop Distributed File System (HDFS) [12]. Source [7] defines HBase as “an
open-source, distributed, versioned, column-oriented store modelled after
Google’s BigTable”. Here too, a table is “sparse” in that rows in the same table
can have variable number of columns. The rows again are lexicographically
sorted on a unique row-key. It is a multi-dimensional map like BigTable, with
the data being identified by the 3 dimensions namely, row, column and
timestamp. A row contains only those columns which hold some data value; no
NULL values are used. Columns like in BigTable are grouped together to
constitute column families and are denoted by a column qualifier or label. A
column needs to be identified therefore, by the <family:qualifier> notation.
Figure 2.2 below illustrates rows and columns. It is a JSON example created
based on examples from source [8].
{
"aaaaa" : {
"A" : {
"foo" : "y",
"bar" : "d"
},
"aaaab" : {
"A" : {
"check" : "world",
},
"B" : {
"test" : "ocean"
}
},
}
Figure 2.2: To illustrate the concept of ‘rows’, ‘column families’ and
‘columns’ in BigTable. Drawn based on examples in source [8]
22
This figure clearly explains the arrangement of rows, column families
and columns in HBase (and BigTable). Here, ’aaaaa’ and ‘aaaab’ are the two
rows in an HBase table arranged in ascending lexicographic order. The table
contains 2 column families: ‘A’ and ‘B’. Note that column families in a table
are usually static unlike the columns constituting them. Therefore, the first row
‘aaaaa’ has 2 columns from only 1 family, A:foo and A:bar, whereas the second
row ‘aaaab’ has 2 very different columns belonging to 2 different families,
A:check and B:test. Each of these data values can also have several versions as
stated earlier, thus allowing the database to store historical data as well. This
can be illustrated using JSON as shown below.
"aaaaa" : {
"A" : {
"foo" : {
20 : "y",
8 : "x"
"bar" : {
16 : "d"
},
}
Figure 2.3: To illustrate timestamps. Drawn based on example in source [8].
The figure above illustrates the use of timestamps in HBase/BigTable.
The most recent data is stored first. For instance, to access the data ‘y’ (most
recent value) HBase will use the path “aaaaa/A:foo/20” while “aaaaa/A:foo/8”
for ‘x’. Also, when responding to a query HBase accesses the timestamp that is
“less than or equal to” the queried time. For instance, if we want to access all
values with timestamp 10, we will receive the cell value ‘x’ since its timestamp
is less than 10.
An HBase table comprises of several regions, each of which is marked
by a ‘startkey’ and ‘endKey’. Regions are made of several HDFS blocks. There
are two types of nodes namely, Master server, and Region servers attached to
numerous client machines. These servers are similar to master and tablet servers
in BigTable. Master server monitors the region servers as well as assigns and
23
balances load on them. Region servers hold multiple regions. Contrary to a
Chubby lock service in BigTable, HBase uses a ZooKeeper [9], a centralized
service, for distributed synchronization. It has an extremely simple interface
that is itself distributed and highly reliable. The clients connect to a specific
cluster by seeking information from the ZooKeeper since it holds the locations
of all Region servers hosting the root locations of all tables.
HBase uses an internal file format called HFile [11], analogous to
BigTable’s SSTable. It uses a 64 KB block size, containing data and identified
by a block magic number.
HBase like BigTable is extremely efficient when managing huge
amounts of data in the order of petabytes over an equally large number of
machines widely distributed all across the globe. It allows data replication for
reliability, availability and fault tolerance. It also facilitates distributed reads
and writes on the data that are very fast.
2.3
The Google File System
The Google File System (GFS) [5] is a proprietary, scalable and
distributed file system designed specifically for large, distributed and dataintensive applications. It is fault-tolerant and reliable, providing a high
aggregate performance to its clients. The GFS design is primarily motivated by
the observations on the technological environment as well as, the application
workloads, where component failures are inevitable. The file system runs on
thousands of inexpensive, commodity Linux systems and is accessed by an
equivalent number of client machines. Unlike many file systems, it is not built
into the OS kernel, but supported as a library.
GFS is simple and provides the users with the basic file commands like
open, close, create, read, write, append and snapshot. Append and snapshot are
special commands; while append allows multiple clients to add information to
files (even concurrently) without overwriting existing data, snapshot creates a
copy of a file/directory tree at minimal system cost.
24
Google organizes its resources into distributed clusters of computers,
with each cluster comprising of thousands of machines, classified as either
master server, chunk servers and client servers. Client files tend to be very large
(order of multi GB), so they are divided into fixed sized chunks of 64 MB each
and stored on various chunk servers. For reliability, chunks are replicated on
multiple chunk servers with a default of 3 replicas. At the time of creation, each
chunk is assigned a globally unique 64 bit chunk handle. The master acts as
cluster-coordinator, maintaining an operation log for its cluster, stores all file
system metadata including namespaces, access control information, mappings
of files to chunks and current chunk locations.
The master server does not persistently store any chunk location
information instead, upon start-up, it polls the chunk servers, which respond
with the contents present in it. Also, periodically it communicates with the
chunk servers via HeartBeat messages to give instructions and collect their state.
GFS Master
Application
(file name,
chunk index)
File namespace
/foo/bar
Legend:
Data Messages
chunk 2ef0
Control
Messages
GFS Client
(chunk handle,
chunk location)
(chunk handle, byte range)
chunk data
Chunk server state
Instructions to chunk server
GFS Chunk server
GFS Chunk server
Linux File System
Linux File System
Figure 2.4: GFS architecture. Redrawn from the original GFS paper [5].
The client code is linked into each application (Figure 2.4 above) and it
communicates with the master and chunk servers to read/write data. Figure 2.4
illustrates the architecture in terms of a single read reference. The application
sends a filename and byte offset to the client, which converts this information
into a chunk index and sends it (along with filename) to the master. The master
replies with the corresponding chunk handle and replica locations. The client
25
then sends a request to the closest replica. Also, it caches the chunk replica
locations so that future interactions need not involve the master.
All metadata on the master are stored as in-memory data structures and
hence master operations are fast and efficient. The operation log mentioned
earlier is critical to the GFS in that, it contains all vital changes to the system
metadata. The system is designed to have minimal master involvement in all
operations. To this end, the master assigns lease to any one of the replicas and
calls it the primary replica (chunk server) for an initial duration of 60 seconds.
All mutations (alterations to file content and/or namespace) are now managed
by the primary, including secondary replica management.
Another crucial feature of the GFS is garbage collection. This
mechanism is unique in that, the physical storage released (due to a file deletion)
is not reclaimed immediately. Instead, the file is renamed with a special (hidden)
name along with a timestamp. The master performs scans at scheduled times,
during which it deletes permanently all ‘hidden’ files that have existed for more
than 3 days (using timestamp).
GFS uses a very important principle of autonomic computing [45, 46],
which means that a system can detect and correct its problems without any
human intervention. It incorporates ‘stale replica detection’ (where using the
replica timestamp, master can identify outdated replicas), various faulttolerance techniques like fast recovery (all servers to restart and restore stable
state in seconds irrespective of how it terminated), chunk replication, master
replication (copies of master maintained, including ‘shadow masters’ – slightly
outdated read-only master replicas) and so forth.
The Google File System is structured in a manner that systems as well as
hardware memory can be upgraded with a lot of ease, making it truly scalable.
This knowledge is vital since it enlightens us with the knowledge of in-memory
data structures, fault tolerance and security techniques.
26
2.4
Hadoop Distributed File System
HDFS [12] is similar to the GFS [5] and partitions the large data files
into fixed sized blocks called chunks and stores them across several machines in
the cluster. It is designed to handle hardware failures and network congestion in
a robust manner. It uses a large number of inexpensive commodity systems to
construct the distributed cluster. It is fault tolerant, reliable and highly scalable.
However, the design is restricted to a specific type of application. It is assumed
that the applications using HDFS are written only once and perform frequent
sequential streaming reads with infrequent updates.
An HDFS cluster comprises of a NameNode, connected to numerous
DataNodes and client machines. This is analogous to the Master and Cluster
servers in GFS. The NameNode stores all the metadata information like
namespace, file to chunk mappings etc. and also controls the DataNodes. All
metadata is stored in-memory to facilitate faster access. The NameNode is
accessed by a client to retrieve the location information of all chunks
constituting the file required by it. This also includes the locations of all chunk
replicas, created for greater reliability and fault-tolerance. The client then
selects a DataNode nearest to it to start its operations.
DataNodes like the GFS cluster servers store the actual data chunks
(blocks), with each chunk being replicated thrice by default. Also, replicas are
housed on different machines, preferably on separate racks in the cluster.
Moreover, apart from data replication, the NameNode is also copied so as to
save the metadata in the event of a failure.
2.5
Data Retrieval in a Cluster Environment: MapReduce
Querying and data retrieval are an integral part of any database system
and involves complex processing. As the amount of data increases, so does its
processing complexity, in order to maintain a reasonably good response time.
Distributed database systems have the advantage of utilizing parallelism to
achieve this. The programming model available to exploit parallelism in both
27
distributed and multi-core systems is MapReduce. The advantage of this model
is that it abstracts away the complex parallel implementation from the
programmer and yet achieves large scale parallelism. The programmer typically
is involved in expressing the problem at hand as a functional programming
model. Once this is done, the MapReduce runtime environment automatically
parallelises it.
Google MapReduce [14] is a generic programming framework for
processing and generating very large datasets in a cluster computing
environment. The primary advantage of this paradigm is the simplicity it
provides to a programmer; abstracting the underlying complexities and allowing
the programmer to express the computation in a functional style. This
implementation is highly scalable and easy to use, capable of processing
terabytes of data across thousands of machines.
It requires programmers to specify two functions: map and reduce. Both
functions accept key/value pairs as input. The map function processes the input
key/value pair and generates an output consisting of a list of intermediate
key/value pairs. The reduce function reads the sorted output of map and merges
all intermediate values for a particular key to produce output for each unique
key. Apart from these user-defined functions, it also has a runtime environment
that manages data partitioning, scheduling, fault tolerance and automatic
parallelism; all abstracted from the programmer. It uses GFS [5] as the
underlying file system.
The open source counterpart of Google MapReduce is Hadoop
MapReduce [16], also a framework for processing huge amount of data on large
clusters of machines. This is based on the HDFS [12].
MapReduce implementation on multi-core systems is slightly different
from that of distributed systems, although the underlying principle remains the
same. A model called Phoenix, developed by Ranger et al. [13] and another,
MR-J developed by Kovoor et al. [17] are examples of MapReduce
architectures for shared memory multi-core systems.
28
2.6
In-Memory Database Systems
In-Memory Database (IMDB) [18, 19] systems also known as Main
Memory Databases (MMDB) is a database management system that stores and
manipulates its data in the main memory, eliminating disk access, unlike most
database systems that use the disk for persistent storage.
The conventional disk-resident database (DRDB) systems support all the
ACID (Atomicity, Consistency, Isolation and Durability) properties. Database
transactions (operations) can fail due to various hardware and software
problems. The ACID properties ensure that these transactions are processed
reliably, that is, even in the event of a failure the data stored will be consistent
and reliable [38]. However, the DRDB systems have limitations in terms of
their response time and throughput. Caching the disk data into memory for
faster access does not completely eliminate disk accesses. Such accesses reduce
the throughput while increasing the response time, thus rendering the system
unsuitable for time-critical (hard real-time) applications.
On the other hand, IMDB systems were primarily designed to cater to
time-critical applications by achieving very low response time and high
throughput. They are faster because their performance is not dependent on disk
I/Os. The data structures employed are also optimized to gain maximum
performance benefits. Moreover, they usually have a strict memory-based
architecture, which implies that data is stored and manipulated from memory in
exactly the same form, in which it is used by the application. This completely
eliminates all overheads associated with data translations as well as caching.
This also results in minimal CPU usage. Another advantage of IMDB systems is
that it can achieve multi-user concurrency on some shared data with consistent
performance.
The main memory is volatile. This makes the IMDBs appear to lack the
durability property of ACID, in case of a power off or server failure. This can
be achieved by either of the following mechanisms:1. Creating checkpoints or snapshots. These periodically record the
database state and provide the required persistence. However, in the event of a
29
system failure, the most recent modifications will be lost (after the checkpoint),
hence provides only partial durability.
2. Combining checkpoints and Transaction Logging. A transaction log
records all modifications to a log/journal file that aids in complete system
recovery.
3. Using a non-volatile RAM or an EEPROM (Electrically Erasable
Programmable Read Only Memory).
4. Maintaining a Disk backup.
Another disadvantage appears to be the limited storage available to these
systems due to the fact that all data in stored only in the main memory, which
has less storage compared to a disk. IMDBs are primarily used for performancecritical embedded systems, which are usually devices that require applications
and data to have a small footprint (size/memory requirement) and hence their
being memory-resident (i.e. limited storage) no longer remains an issue.
Moreover, when used for systems handling large datasets, the virtual memory
usually comes into play to hold the excess data.
They are extremely important for this research because of their low
response time and high throughput. Designing a system with lowest possible
response time and optimal memory usage is one of the basic objectives of this
project.
2.7
Cache-Oblivious Data Structures
Modern computers have a multi-level hierarchical storage that includes
CPU registers, different levels of cache, main memory and disk, where the data
oscillates between the processor registers and the rest.
Figure 2.5 below
illustrates this hierarchy.
RAM
CPU
Cache 3
Cache 1
Disk
Cache 2
Registers
Figure 2.5: The memory hierarchy. Redrawn from source [34].
30
As the memory levels move further from the CPU, their access times as
well as storage capacity increases. In fact, there exists a sharp rise in these
characteristics as we proceed from the main memory to the disk. This implies
that for any algorithm executing on such a system, the cost of memory access
(and hence system performance) entirely depends upon the storage level where
the element being accessed is currently residing. Moreover, data travels
between the storage hierarchies in blocks, of a certain size and different caches
have different block size. So, the design of the algorithm, in terms of how it
accesses memory, now has a major impact on its actual execution time and
therefore, to achieve optimal performance, it should take into consideration the
above mentioned storage hierarchy characteristics, especially the cache.
Normally algorithms are analyzed by overlooking the existence of cache in
between CPU and RAM (illustrated in Figure 2.6) which assumes all memory
accesses consume the same amount of time. However, practically this is not so
and therefore, data structures and algorithms that can exploit the cache suitably
can achieve very high performance.
RAM
CPU
Figure 2.6: The RAM model. Redrawn from source [34].
Data structures and algorithms that are cache-aware [23] do just this.
They contain parameters that can be tuned to gain optimal performance for a
specific cache size. This advantage in turn results in an issue; they either need to
be tuned for every system (with different cache size) for good performance or
they perform well only on some systems (for which it is tuned) while not so
well on others. This behaviour however, is not really an attractive one.
Caches in general are based on two basic principles of locality namely,
temporal and spatial [23]. Temporal locality states that a program, which uses a
particular data has a higher probability of using the same again in the near
future. Spatial locality states that a program, which uses a particular data has a
higher probability of using some adjacent data in the near future. So any
31
optimal cache-aware algorithm should try and exploit both these properties to
achieve optimality.
Harald Prokop in 1999 came up with the concept of cache-oblivious
algorithms for this master’s thesis [27] that was later published by Frigo et al.
[20]. This arrived as a solution to the cache-aware problem. It also exploits the
cache size, however without requiring the tuning to achieve optimal
performance. It works well for all cache block sizes since it optimizes the
algorithm for one unknown memory level, which automatically optimizes it for
all levels.
The basic idea is to recursively split a dataset such that its size reduces
and at some point a single portion (split section) of the dataset will be small
enough to fit into the cache and will fill at least half of it. This eliminates cachemisses. This idea also eliminates the requirement to know the cache block size.
The data structures are designed in a manner that a dataset (irrespective of its
size) is split appropriately to make good use of caches of all sizes.
The memory model suggested by Prokop [27] considers an infinitely
large external memory and an internal memory acting as cache of size M. Data
moves in between these two, in blocks of size B. The algorithm cannot control
the cache in that it does not explicitly manage the movement of data blocks
between the two storage devices. It assumes the existence of a cache manager.
This restriction is due to the fact that M and B values are unknown and hence,
cannot be manipulated directly. A fixed page replacement policy is used and it
is also assumed that the cache is ideal [27]. This means that the cache is fullyassociative and the page replacement strategy is optimal. It also assumes that
the cache is Tall [27]. A tall-cache is one where the number of blocks present in
it (M / B) is much greater than the size of a single block (B). This assumption is
represented by the following equation:
M = Ω (B2) …………………………………………………… eqn. (1) [20, 30].
This constraint facilitates the cache-oblivious algorithms to have a large pool of
values to guess the block size (B).
Demaine [30] in his paper introduces the various cache-oblivious
algorithms and data structures available, explaining the techniques behind those
designs. Also, Bender et al. [29] proposed a design for a cache-oblivious B Tree,
which was later simplified by Wu et al. [28] while still preserving cache locality.
32
All these designs make efficient use of the cache. Olsen and Skov [26] also
analyzed and examined two cache-oblivious priority queues and designed an
optimal cache-oblivious priority deque based on one of the priority queues.
Also, in 2005, Bender, Fineman, Gilbert and Kuszmaul [31] proposed 3
different concurrent cache-oblivious algorithms that they proved made efficient
use of the cache.
A very important aspect of this research is therefore, to analyse these
data structures in order to identify a suitable data structure for our purpose.
33
3 System Design
This research project entails developing a subset of a Google BigTable-like
database, for shared memory multi-core systems. This implementation is a
simplified structure of the database that is in-memory [18, 19] to try and
achieve performance benefits like speed-up, as well as scalability in a multicore environment. It involves creating a data structure, based on the concurrent
cache-oblivious B-Tree design proposed by Bender et al. [31]. Another vital
aspect of the research is to perform data retrieval operations in parallel and then
evaluate the efficiency and usability of the design. This is crucial as it will help
us assess the suitability of a multi-core environment for such huge distributed
database systems.
The functional and architectural details of the implemented system will be
discussed in the following sub sections.
3.1
System Overview
The design for this database system comprises of two main parts, the
underlying Data Structure to hold the data and a set of operations to query the
database with. The data structure resides in memory and is based on a cacheoblivious design. The disk will be used only to load the database and to store
backup of the data to ensure durability. The cache-oblivious model is primarily
based on the Packed-Memory Concurrent Cache-Oblivious B-Tree model
proposed by Bender et al. in 2005 [31], that contains both lock-based and a
lock-free versions of the structure for concurrency control. It therefore ensures
that the data can be accessed concurrently. Moreover, B-Trees minimise the
number of disk accesses, which is critical to this design because of its inmemory nature.
An important point to note here, is that all data needs to be stored in the
key-value format and sorted based on a unique key. As mentioned in the
previous section, every BigTable data is identified by a unique combination key
34
(row, column, and timestamp). It is therefore essential for the data structure of
this implementation to be able to support this and yield good performance.
The operations include a set of functions to create and manipulate the
data structure, map the appended new data to the appropriate location, provide
thread-level security (for concurrency), and so forth. The retrieval operations,
primarily consisting of search queries, are parallel in nature. These functions are
designed in accordance with the Google’s API model.
3.2
System Model
B-Trees have been one of the most predominant data structures that keep
data sorted, allowing insertions, deletions, searches, sequential reads with very
low response time. It is a generalized binary search tree [33], optimized to
handle large data sets. The Packed-Memory Concurrent Cache-Oblivious BTree model [31] consists of two structures combined into one; a static cacheoblivious Binary Tree [27] and a packed memory data structure [29].
The static cache-oblivious binary tree is a static binary tree based on the
van Emde Boas (cache-oblivious) layout [29]. The nodes of the tree can be
traversed in O (1 + log
B
(N)) memory transfers and is hence asymptotically
optimal [29].
The packed memory data structure is ‘one-way packed’ and stores the
data in sorted order in a loosely packed array. It is said to be loosely packed,
since the elements are stored with a lot gaps in between to allow for insertions
and deletions. One-way packing allows concurrency to be supported. The array
is divided into ‘sections’, with gaps within each section to allow insertions, as
mentioned above.
The combined structure is a binary tree, sorted on the combined key
(row key, column key, timestamp) where each node contains both key and data.
The leaves however, correspond to certain sections in the packed memory array.
Each leaf of the tree maps onto each section as its first element. The other nodes
are not stored in this array and can be accessed directly from the tree. However,
since the tree is static, any new insertions will result in accessing the
35
appropriate section of the packed array. Thus, the packed array is primarily
designed to support insertions, deletions, as well as search queries on the newly
added data. The figure below illustrates the design used for the database system.
Figure 3.1: Data structure design. It is a modified design based on the work
of Bender et al.. [31].
The binary tree above contains the combination-key (of row, column,
and timestamp) as its nodes, as shown for the root node ‘55’. Thus, each
number in the nodes (like 55) here is used to represent the combination-key.
The tree is created only once and hence static. The actual data is stored in the
tree and for the leaves, in the packed array as well (below the tree). This array
contains in addition to the data, the combination-key as well. Thus each element
in the array is a complex data type comprising of a key and its data, as
illustrated using the key value of ‘44’. The array again is divided into sections;
the black bold sections in the diagram. Each section contains a leaf and some
gaps for insertions. For example, the first section contains only one key-value
pair, 1 (keys + their data). The remaining array locations in that section remain
empty. Also the maximum value of a key, it can possibly hold is 8 (shown in
grey). The next key (9) is part of the next section, which can hold a maximum
of 21 (shown in grey) and so on. Also, the leaves of the tree map onto the first
36
element of every section as shown above. Thus the packed array allows for
quick insertions into the database where the static tree acts as an index.
Figure 3.1 above is the data structure based on the model described
earlier. This structure can be effectively used to store the data in the key/value
format, as required for this research. The choice of using the binary tree is
important, since it ensures that the data elements added are stored in sorted
order. Basically, this structure ensures that every operation (delete/search)
results in traversing the tree and performing that operation on the tree itself,
provided the data is available there. It also ensures quick execution, due to the
binary layout. However, if the data is not available in the tree, traversing it
results in locating an appropriate leaf, which then directly maps onto the
corresponding section of the packed array. A linear search or a binary search
can then be carried out within the section to locate the desired element. Also,
since this array is loosely packed (contains gaps), insertion operations to the
database, always result in data getting added to the packed array, within the
gaps of a particular section.
The structure is also beneficial in terms of performance. It remains in the
main memory throughout, hence low response time. Moreover, mapping of key
to value, both held in memory, adds to this speed.
3.3
Data Model
As stated earlier, the data in this implementation conforms to the Google
Datastore data model; in that it is stored in a key-value format where each data
item is identified by a unique combination of three keys (row key, column key,
and timestamp). To facilitate this, a unique row id (row key) is generated by the
system for every data element, belonging to a new row. All data items
belonging to a particular row have the same row id. Since a single row contains
one or more columns, the combination key (row key, column key, timestamp)
always remains unique for every data element in the database. The exact
combination key used for this implementation is a dot (.) separated string of
column name, row key and timestamp respectively. Also, all data, like in the
37
Google system, are stored as strings. For instance, to store a value 50 that
represents the ‘age’ of an employee, we create a combo key, AGE.R20.111100,
assuming here that this value is the 20th entry in the database (hence row is 20).
The combo key is based on the format (COL_NAME.ROW_ID.TIMESTAMP).
The figure 3.2 below illustrates this model based on a sample column-oriented
database containing 4 to 5 columns; employee id, first name, an optional middle
name, last name and age. It may be recalled, that the main advantage of a
column store over a traditional RDBMS is the flexibility its gives to each row; a
row can contain variable number of different columns and hence need not store
NULL values unnecessarily (in those cells which do not have an appropriate
value). In the sample database of figure 3.2, the column E_MNAME (employee
middle name) is one such optional attribute. Therefore, every row of a columnoriented database is not required to hold that column. The implementation of
this project also allows the same flexibility by using key-value pairs to store the
data.
Figure 3.2: Representation of the Data model. Diagram illustrating the keyvalue pairs and the format of the unique key used to identify a value. Here,
E_MNAME is present only in Row 1.
38
As mentioned earlier, this data representation allows for the creation of a
typical columnar database, since each row can have a variable number of
different columns, as illustrated above.
To ensure efficient operations and performance benefits on this database,
the choice of an appropriate key is extremely crucial. This is due to the fact that,
in this implementation all data will be sorted and arranged in a static binary tree
(described earlier) based on this unique combo key. This also implies that all
operations including retrievals will depend primarily on the position of these
keys in the binary tree. To allow quick data access and retrieval, the
COL_NAME.ROW_ID.TIMESTAMP format is followed. This format ensures
that the sorting of the keys is based on the column names and thus, all values
(across all rows) belonging to the same column will be grouped together (stored
as a sub-tree) in the binary tree. This kind of columnar locality gives the data
structure the advantage typical of a column-oriented store; since data stored in a
column oriented manner allow operations like group by and other aggregations
to be extremely fast.
3.4
Data Manipulation and Retrieval
The various operations supported by the database are insertion (data
append), deletion and search queries. There are no random data write operations,
just like BigTable. A set of system functions, are also designed to handle
background operations. These include, an array rebalancing operation, which is
mandatory for data deletions and is optional in the case of insertions to the data
structure. Rebalancing is the re-arrangement of the packed array to adjust the
element density in its sections.
The most important runtime system operation is to execute the user
queries in parallel. Since all queries may not be completely parallelizable, they
need to be created in a manner that they are able to make maximum use of the
processor cores. In this project however, a single query is designed and
parallelized and also executed on several multi-core systems for performance
evaluation.
39
4 Data Structure Implementation
This section will focus on the implementation of the underlying data
structure that stores the database keys and their corresponding values, as
described in the last section. All implementation is done using Java 7 on Eclipse
Helios. The data structure, as mentioned earlier is composed of two parts; a
static binary tree based on the van Emde Boas layout [29] and a packed array.
The database is in-memory and is loaded only once from a file. The file
data is used to create a sorted binary tree, which is also balanced and then made
complete (all nodes have exactly two children). This tree is traversed in the van
Emde Boas manner to create a cache-oblivious array. The leaves of this tree are
then mapped onto a loosely packed array (with gaps) to allow for insertions.
The following sub sections will provide a detailed description of the
implementation of these structures.
Figure 4.1: Steps to create the data structure.
40
The steps carried out to create the entire data structure are illustrated in
the diagram above. Implementation of each step is described in detail in the
following subsections.
4.1
Static Binary Tree
A binary tree is composed of numerous nodes, which in this
implementation is defined as a class called TreeNode. In general, in a binary
tree, each node consists of a data and links to its left and right child. Here too,
each TreeNode consists of a data part and links to its two children.
However, the data in turn comprises of a combo key (of the form
column_name.row_key.timestamp) and its corresponding value. The links to the
left and right child are not implemented as pointers, instead an array link of size
2 is used. The 0th element of link stores the left child and the 1st element the
right child. The class definition is as follows:
Listing 1: Class definition of a node in the implementation of a binary tree.
A Full (Complete) Binary Tree: A complete binary tree is one in which each
node comprises of exactly 2 child nodes. This implies that the total number of
nodes for a complete binary tree of height h is always fixed and hence can be
calculated. The binary tree required for this implementation as discussed earlier,
cannot be modified once created (hence static). Also, the database is an inmemory one, which means all updates to it, should be handled by the data
structure. Therefore, the structure itself should possess the ability to allow such
updates to happen with minimal rebalancing and adjustments. To cater to this
requirement, it is essential, once the tree gets populated with the key-value pairs,
41
to check for its completeness and enforce it if found not-complete.
Completeness is enforced by adding the missing nodes in the form of empty
(zero data) leaves. Creating a complete binary tree facilitates the creation of a
packed memory array that allocates space to store key-value pairs for all the leaf
nodes, including the empty ones. This allows a large number of insertions to
happen into the database without the need to rebalance frequently. Figure 4.2
below illustrates this concept.
Figure 4.2: A complete Binary tree. The number of key-value pairs is 11,
which creates a binary tree of height 4. However, for a tree of height 4, the
total number of nodes should be 15. Hence the tree is made complete by
adding the missing nodes in the form of empty (zero data) leaves,
illustrated above by the black small circles. This allows space to be
allocated for these empty leaf nodes in the packed array structure (drawn
below the tree) where new data can be inserted.
Algorithm to Calculate Number of Nodes and creating a full binary tree:
Calculating the number of nodes in a binary tree is quite simple. The number of
nodes Ni at level i is always equal to twice the number of nodes in the previous
level Ni-1. For example, in the above diagram, the number of nodes at level 1
(root level) is 1 and the nodes in the next level (2) is 2 and in the next is 4 and
42
so forth. This idea is used to compute the total number of nodes that should
ideally exist in a tree, for that tree to become full. The existing number of nodes
is obtained by keeping a count of the items read from the file while constructing
the tree. The new count allows us to create a complete binary tree by calculating
both number and position of the missing leaves and inserting empty nodes there.
Algorithm to Create a Sorted Tree: The tree is sorted by reading the nodes inorder and storing them temporarily in an array. This sorted array is then used to
create the sorted, balanced and complete binary tree. This tree is traversed
recursively following the van Emde Boas layout [29] to create the cacheoblivious array. Listing 2 below shows the recursive traversal of nodes in-order
to obtain a sorted binary tree.
Listing 2: Implementation of a sorted tree.
Algorithm to Create vEB array: The van Emde Boas (vEB) technique lays
out a balanced and complete binary tree in memory recursively. Let us assume
that we have a binary tree of height h and size (number of nodes) N, where h is
a power of 2. In order to traverse this tree using the vEB layout, we divide it
into two sections, each of height h/2. The top half of the tree contains a single
subtree with the same root as the tree and has √N nodes, whereas, the bottom
half contains 2h/2 subtrees, each with approximately √N nodes. When the height
h is not a power of 2, the bottom half is selected such that its height is a power
of 2. Figure 4.3 below illustrates this concept using a balanced and full
43
(complete) binary tree of height 5. The basic idea is to first layout the top half
recursively and then the bottom half, with each half being laid out in order of its
subtrees.
Figure 4.3: van Emde Boas layout on a binary tree of height 5. Redrawn
from [29].
The figure above shows the division of the tree into a top (small dark
square) and a bottom half comprising of 2 subtrees (black boxes). Each such
subtree in turn can be divided into top and bottom subtrees in a recursive
manner. The numbers indicate the order of tree traversal. The structure below
the tree illustrates the layout of the tree in memory. The order of tree traversal is
the order in which it is laid out in memory (therefore if the traversed tree is
stored as an array, the numbers next to the nodes become the array indices). The
array created is the van Emde Boas (vEB) array.
In this implementation, the sorted, balanced and full binary tree (created
in the previous step) is traversed in a manner similar to the one explained above.
First, the height of the tree is computed using a recursive algorithm. Next, using
this height, the tree is split and traversed recursively to create an array of keys
(combo key used in this implementation) having the vEB layout. If the tree
height h is not a power of 2, the root node is separated out as the top subtree and
the rest of the nodes are treated as the bottom subtree. For every subtree, the
root of the subtree and its children (2 nodes) are traversed in order. Then the
44
grandchild node of this root is called recursively. Again the same steps are
followed till a leaf node is encountered, after which the siblings are traversed in
a similar fashion. This simple recursive algorithm traverses a complete binary
tree in exactly the same manner as illustrated in figure 4.3 above. The vEB
traversal results in the creation of a vEB array in memory. Also, while traversal,
the vEB indices of each node are maintained separately (in the veb_index field
of the class TreeNode, listed earlier in Listing 1). This facilitates easy mapping
of a node to the corresponding position in the vEB array and then further down
to the Packed Array structure during a search, insertion or deletion operation.
This will be explained in detail in the next section.
With the above implementation, a full binary tree with vEB indices for
each of its nodes and a vEB array is in place. The listing below gives the code
snippet that recursively traverses the binary tree in the vEB format.
1. . . . . .
2. if(current != null)
3. {
4.
current.setVeb_index(i);
5.
array[(int) i++] = current.getKey();
6.
runner[0] = current.link[0];
7.
runner[1] = current.link[1];
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
if(current.link[0] != null)
{
current.link[0].setVeb_index(i);
array[(int) i++] = current.link[0].getKey();
runner[0] = current.link[0];
}
else
{
current.link[0] = new TreeNode("0", "0");
current.link[0].setVeb_index(i);
array[(int) i++] = current.link[0].getKey();
runner[0] = current.link[0];
}
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
if(current.link[1] != null)
{
current.link[1].setVeb_index(i);
array[(int) i++] = current.link[1].getKey();
runner[1] = current.link[1];
}
else
{
current.link[1] = new TreeNode("0", "0");
current.link[1].setVeb_index(i);
array[(int) i++] = current.link[1].getKey();
runner[1] = current.link[1];
}
Continued below
45
Continued from above . . .
34.
35.
36.
37.
38.
39.
40.
if(current.link[0] != null)
{
if(runner[0].link[0] != null)
vebTree(runner[0].link[0]);
if(runner[0].link[1] != null)
vebTree(runner[0].link[1]);
}
41.
42.
43.
44.
45.
46.
47.
if(current.link[1] != null)
{
if(runner[1].link[0] != null)
vebTree(runner[1].link[0]);
if(runner[1].link[1] != null)
vebTree(runner[1].link[1]);
}
48. }
......
Listing 3: Implementation of vEB array.
As explained earlier, first the root (of any subtree) is accessed (current
node) and then its children as done in line numbers 4 through 33 (current.link[0]
indicates the left child of the current node being traversed while current.link[1]
is the right child). Line numbers 34 through 47 indicate the recursive calls to
the grandchild nodes (first left and then right) of the current node.
4.2
Packed Array
The Packed Array structure, (mentioned earlier) is a loosely packed
array created to facilitate insertions and deletions. The goal is to leave sufficient
gaps in this array so that for most of the insertion operations, fewer elements
need to be moved in order to accommodate the newly inserted element. This
keeps the average insertion cost as low as possible. New nodes are not added to
the binary tree on insertion; it is this packed array that gets affected with every
new addition. The binary tree acts like an index to the packed array and
facilitates quick operations on the database.
The packed array stores key-value pairs for every leaf node of the binary
tree. Let us assume that there are N leaves in the tree. The packed array thus
46
maintains a size of cN to store these N elements. Here c is some value > 1 used
as a multiplication factor to create gaps in the packed array.
The packed array is also divided into N sections (one for each leaf) of
size c each. Thus c can also be thought of as the capacity of each section of the
packed array. When the data in a section exceeds this size c, the structure is
redistributed by adjusting the elements in the adjacent free sections.
However, to make this implementation truly flexible like a columnoriented database, the packed array is implemented as an ArrayList (a dynamic
array in Java). This allows addition of new columns to the database as well as
resizing the structure when the need arises.
Another important point to note is that the relationship between the
leaves (stored in the vEB array) and the sections of the packed array is a very
simple one. The first leaf corresponds to the (first element of the) first section of
packed array, the second leaf to the (first element of the) second section of
packed array and so forth.
As the packed array stores key-value pairs for every leaf in the tree, it is
unnecessary to store this information in the vEB array as well. Therefore the
leaves in the vEB array do not store any keys. Instead, the space allocated to
these leaves in the vEB array is utilised to map the leaves onto their
corresponding sections in the packed array. If the vEB index of a leaf is known,
then accessing its key or value from the packed array is a simple mapping onto
the section where this leaf is stored. The diagram below illustrates the mapping
and shows the relation between the vEB and the packed arrays.
47
Figure 4.4: To illustrate the relation between a full binary tree (of height 4)
and the vEB array and the Packed array structure.
The binary tree on top with 16 nodes is first converted into a vEB array
(structure below tree). The leaves are represented as the white circles in the vEB
array. The Packed array below stores these leaves, shown again as white circles.
The numbers (0, 10, 20, …, 70) below each leaf in the vEB array indicate the
index of these leaves in the Packed Array. The leaves in the vEB array do not
store keys (unlike other grey nodes), instead, they store the index positions of
themselves in the Packed array (the numbers 0, 10, 20, .., 70). Each leaf in the
Packed array is allocated a section. In this example we assume that the Packed
array allocates space for 10 elements in each of its sections (factor or capacity c
is 10). Thus total size of this structure is number of leaves * c = 8 * 10 = 80.
To highlight the ease of accessing elements from this structure, let us
consider an example where we are interested in inserting a new element as a
child of the data item whose vEB index is 11. It has already been mentioned
48
that the vEB index of any element (key-value pair) in the binary tree is in reality
the index position of that element in the vEB array. Therefore, from the above
diagram, it is clear, that the element with vEB index 11 is a leaf (represented as
a white circle in vEB array).
Now to insert an element, we need to go to the Packed array. The
mapping provided in this implementation facilitates an extremely quick and
easy way of doing this. In the vEB array, a leaf holds its own Packed array
index location and not key-value information. Therefore, element at 11 (a leaf)
also stores its corresponding Packed array position, which is 40 (number written
below it to indicate the value stored by it). Thus we can jump directly to the
index position 40 in the Packed array and insert the desired element. The
pseudo code snippet (with self explanatory variable names) to illustrate the
access mechanism is as follows:
array veb[], array packed_str[],
integer veb_index, integer packed_index;
veb_index = 11;
packed_index = veb[veb_index];
result = packed_str[packed_index];
Listing 4: Pseudo code to explain the mapping of vEB array to packed
array.
Algorithm to Calculate the Number of leaves: In order to implement the
packed array structure, as explained above, it is essential to know the number of
leaves in the binary tree. It may be recalled that the Packed array is a structure
that allocates space for all the leaves of the tree as well as allocates some
additional spaces based on some constant c, to allow for low cost insertions.
Thus, the size of this structure equals the product of the number of leaves and
this constant (capacity) c.
For a full binary tree of height h, the maximum number leaves are fixed.
For instance, a full tree of height 2 can have at most 2 leaves, whereas a tree of
height 4 a maximum of 8 leaves and so on. This is predictable since each node
in a full binary tree has exactly (and also at most) 2 child nodes. This, results in
the number of nodes in one level to become twice the number of nodes in the
previous level (explained earlier). This simple relation between nodes in
49
different levels is used to compute the total number of leaves in the binary tree
created in this implementation as follows:
leaves = (int) Math.pow(2, height-1);
Listing 5: To calculate the number of leaves for a tree of height ‘height’.
Algorithm to Create Packed Array structure: Once the number of leaves is
known, a temporary array is populated with the key-value pairs of the actual
leaf nodes. As obvious, the size of this array equals the computed leaf count.
Using this array as input, the final Packed array structure is realized. The
following code snippet explains the technique used.
. .
1.
2.
3.
4.
5.
6.
7.
8.
. .
if ( i == 0 | i % factor == 0)
{
if (leaf_nodes[k] != null)
{
temp = leaf_nodes[k++].concat("|").concat(leaf_nodes[k++]);
}
al.add(i, temp);
}
9. else
10.
al.add(i, "0");
. . . .
Listing 6: Implementation of Packed Array.
The Packed array as stated earlier is implemented as an ArrayList, al. In
the above code, factor is the variable used for the capacity c of the Packed array,
which denotes the size of each section in this structure. We already know that
each leaf is the first element of a section in the Packed array. It implies that the
index position of each leaf in the Packed array is a multiple of c. Therefore, in
the sample code, when the index i of the Packed array al is a multiple of factor
(or c), an element from the array leaf_nodes (containing key-value pairs of all
the leaves) is added to al. For all other values of i, a zero is appended. This
allocates space for new data to be inserted into the Packed array al.
50
4.3
Algorithm to Search
The algorithm to search for data in this database is simple and efficient.
To look for a particular element e, the binary tree is searched first. As all key
and value (data) pairs are present in the tree itself, the search is quite fast.
Searches can be key based, value based or both together. This being a database
implementation, the most standard searches are values for a particular key (like
search for all the values of column employee_name, which is a key) and hence
are very efficient. The binary tree is sorted based on the column_name(s) of the
combo key and hence, only a small section of the tree needs to be searched to
locate the required value (because, if the searched column_name is less than
root, go to the left subtree else to the right and so on).
However, if the element being searched is not available in the tree, there
exist two possibilities. Either the element does not exist in the database at all, or,
it is an item that was appended to the database (later) and hence exists in the
Packed array and not the binary tree. Whatever, the reason might be, when such
a situation arises the Packed array is always searched. This search is also
efficient due to the direct mapping of the leaf to the Packed array section (as
explained earlier). Once the appropriate section is reached, a linear search or a
binary search is performed within that section. Linear searches are efficient for
smaller arrays. However, if a lot updates have been made to the database, a
binary search is more appropriate.
4.4
Algorithm to Append / Insert
As mentioned several times, insertion operations do not affect the tree in
any way. All insertions get reflected in the Packed array. Since this
implementation of the database is a subset of the Google DataStore, the
operations permitted on it are also in accordance with DataStore. The DataStore
is an append-only database and so is this. Append-only implies that there can
be no random write operations or any modifications (overwriting) to the
existing data in the database. In DataStore, the write operations are merely new
51
insertions (additions or appends) to the database. Each new insertion to any
column gets added along with a new timestamp value. Thus a query for this
updated column will retrieve by default the most recent value unless specified
otherwise. The previous timestamps (versions) become historical data.
To perform such an append operation, the binary tree is first traversed
(like in a search operation discussed above) to locate an appropriate section i of
the Packed array, where the new element will be added. This section is then
checked for gaps. If the section has not exceeded its capacity, the insertion is
immediately made.
However, if the section capacity is full, all previous sections need to be
checked for gaps. If space is available, a redistribution (rebalancing) operation
is performed to push the existing elements from the current section to the left
(to the previous section) to accommodate the new element. Again, if the current
section i being checked for gaps is the leftmost section or empty spaces are not
available in its adjacent sections, the size of the Packed array is doubled. Then,
all elements are rearranged. The advantage of using an ArrayList for
implementing this Packed array structure is evident here, since a resizing
operation can be easily performed.
4.5
Algorithm to Delete
The deletion operation is supported only for data that has not been
written to disk. Although this is an in-memory database, yet for durability, data
is written to a file from time to time. Later, the database is loaded into memory
from this flat file itself. This data is stored in the static tree and also partly
(leaves only) in the Packed array. Any alteration therefore is not possible to this
data. The only data that can be deleted are the ones inserted later into the
database (thus present in the Packed array only, except the leaves).
To delete such a data item (key-value pair), we search through the tree
(in the manner explained in the search sub section) to locate the exact section in
the Packed array. Once there, a linear search is performed within the section to
obtain the exact element and then delete it.
52
5 Query Implementation
This section will discuss a single query designed and developed for the
database, described in the last section. The query implemented is a standard
benchmark query that is realistic and has a broad industry-wide relevance. The
benchmark used is TPC-H [56], a decision-support (analytical) benchmark that
is most suitable for this database system, since this system has the ability to
hold several versions of the same data (time stamped).
Also, examined in this section is the need for synchronization, the issues
involved and the synchronization techniques used for multiple concurrent users.
All development is done using Java 7 on Eclipse Helios. Java 7 has a lot
of support for parallel programming, hence the choice. The specifics will be
discussed in detail in the following sub sections.
5.1
Development Tools – Java Fork/Join Framework
Java had support for multi-threading and concurrency for a long time
now (since version 5.0). However, the new features introduced in the package
java.util.concurrent of Java SE 7, enhanced these features by adding support for
parallelism [51, 52, 53]. It is based on the parallel divide-and-conquer strategy.
Divide-and-conquer algorithms are perfect for problems that can be split into
two or more independent sub problems of the same type. This is analogous to
the map-reduce strategy in functional languages. The basic idea is to recursively
split the problem such that it eventually becomes simple enough to be solved
directly. The solutions of each of the sub problems are then combined to
produce the final result.
Previous versions of Java can also solve such divide-and-conquer
problems concurrently using the Executor framework that has the Callable<V>
class. However, Callables waiting for the results of other Callables (to combine
results), and produce the final result, actually go into a wait state. This wastes
the opportunity to handle another Callable task in queue. The uniqueness of the
53
Java 7 Fork/Join framework therefore, is its ability to efficiently use the
resources in parallel. The Fork/Join framework uses a Work Stealing
mechanism [51, 53] to steal jobs from other threads in its pool, while one thread
(task) waits for another one to complete.
The implementation of the work stealing scheduling used in Java
Fork/Join framework is a variant adapted from the Cilk-5 project [55]. The
basic mechanism of work stealing involves assigning each worker thread in the
fork/join pool with its own private deque (double ended queue). This deque
holds all subtasks assigned to a particular thread for execution. When any
worker thread completes the execution of the tasks in its local deque, it tries to
steal pending tasks from other threads. This process continues (by threads) till
all tasks in all deques are completed. The advantage, as obvious is efficient
resource usage as well as reducing overheads associated with load imbalances.
The Fork/Join framework consists of a ForkJoinPool executor [52],
which is dedicated to executing instances implementing the ForkJoinTask class.
A ForkJoinTask object supports the creation of subtasks as well waits for these
subtasks to complete. This is illustrated below. Each ForkJoinTask object has 2
specific methods:
1. fork ( ) method, that allows a new ForkJoinTask to be spawned from an
existing one.
2. join ( ) method, that allows a ForkJoinTask to wait for the completion of
another one.
Figure 5.1: Co-operation among fork ( ) and join ( ) tasks. Redrawn from
[53].
54
There are 2 types of ForkJoinTask specializations:
1. RecursiveAction, instances of which do not return a value.
2. RecursiveTask, instances of which return a value.
For this implementation, we use instances of RecursiveTask that return the
computed result.
5.2
TPC-H Benchmark Overview
The TPC (Transaction Processing Performance Council) Benchmark
TM
H, commonly TPC-H, is a decision-support benchmark that comprises of a suite
of business oriented ad-hoc and concurrent queries [56]. The benchmark
primarily describes decision-support systems that work with huge volumes of
data and support queries that are a highly complex and cater to real-world
business problems. It does not target any specific business area but is applicable
to any industry that buys and sells or manages or distributes products worldwide.
There is a standard TPC-H database against which the queries are
executed. The performance metric used is called the TPC-H Composite Queryper-Hour Performance Metric (QphH@Size), which reflects the database size,
the query processing power for a single query and the query throughput for
multiple concurrent queries. This benchmark is usually used by commercial
DBMSes with an SQL interface.
The TPC-H database is composed of 8 tables. The total number of
columns (from all the tables) in the database is 61. The relationship between the
various columns in the database is illustrated in Figure 5.2 below. The columns
with outgoing arrows are the primary keys (in their respective tables) while the
ones receiving them are foreign keys, that aid in joining. For instance, the table
PART has PARTKEY column as its primary key, which becomes the foreign
key in PARTSUPP. Similarly, the primary key SUPPKEY of SUPPLIER table
is the foreign key in PARTSUPP table. These 2 foreign keys in this table
together is the primary key of PARTSUPP. Also, it is evident from the diagram
that the NATIONKEY primary key of NATION table exists as foreign key in
55
both SUPPLIER and CUSTOMER tables. All such relations are clearly drawn
in figure 5.2.
Figure 5.2: TPC-H database schema. Copied from source [56]. The
parentheses following each table name is the prefix used for each column
name in that table.
The range of different queries available and their industry-wide
relevance makes TPC-H an ideal choice to analyse the implemented database
system. TPC-H has a set of 22 queries, each addressing some realistic business
scenario. The query selected for implementation on this database is Query 17,
the details of which are provided in a later subsection.
56
5.3
Loading the Database
Prior to implementing the selected query, the stage needs to be set for
executing the query on a suitable dataset. It may be recalled that the
implemented database works on data stored in the form of key-value pairs.
Therefore, the benchmark dataset (of TPC-H), which is in the traditional
RDBMS format of tables (schema shown above) needs to be converted into a
dataset appropriate for this system.
The first step however, is to create the normalized TPC-H dataset (in the
form of tables) of the required size. TPC-H provides a utility called dbgen that
allows users to create a normalized dataset of any size (even gigabytes). Next
this normalized data is loaded into a standard SQL database to form the
different tables.
Now is the time to create the key-value pairs. It is important to note at
this point that BigTable (DataStore), as mentioned previously, is a single big
table of key-value pairs. Also, join operations are not supported directly.
Therefore, to create a similar structure, the 8 separate tables of TPC-H need to
be converted into a single big table as well. To realize this, the individual tables
are combined by performing an Equi-Join on all of them. This de-normalized
single table is written to a file in the comma separated format. The total number
of columns after an equi-join is 54 as opposed to the original 61 columns in
TPC-H. This is due to fact that an equi-join operation considers only one of the
two columns used in a particular join operation. Thus joining 8 tables requires 7
equi-joins utilizing 14 columns (to join). Out of these 14 columns, 7 are
discarded. The data although remains unchanged, contains redundancy due to
the multiple joins, which is taken into account while implementing the query.
This comma separated file is read line by line and converted into a flat
file consisting of key-value pairs. It can be recalled that a key here is a combo
key. Hence, for every row a unique row key is generated. All columns
belonging to a particular row use the same row key. Also, since the entire
dataset is a single version, the timestamp is kept fixed to 111100. New
insertions result in new timestamps. The file generated looks like the one shown
in figure 5.3 below.
57
Figure 5.3: Sample Key-Value pairs generated from the de-normalized
dataset. It follows a row_key.col_name.timestamp format, unlike our data
model.
This key-value pair file (shown above) for the purpose of this
implementation is obtained from a colleague working on a similar project
(Appendix 2). The system design section earlier illustrated the data model used
for this project (column_name.row_key.timestamp) and it clearly differs from
the one in Figure 5.3. The combo keys in our case are sorted based on the
different column names and stored accordingly in the binary tree. This provides
us with the advantage of having all the same columns (across different rows) to
be grouped together (subtree) in the tree. The locality achieved is typical of a
column store and helps in querying the data from the database. The above keyvalue pairs file is thus, converted into the required data model.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
. .
Listing
File file = new File(input_file);
Scanner scan = new Scanner(file);
FileWriter fstream = new FileWriter(output_file);
BufferedWriter out = new BufferedWriter(fstream);
while(scan.hasNextLine())
{
String line = scan.nextLine();
String [] newline = line.split(",");
String [] newword = newline[0].split("\\.");
String nline = newword[0].trim();
newword[0] = newword[1].trim();
newword[1] = nline;
newline[0] = newword[0] + "." + newword[1] + "." +
newword[2].trim();
line = newline[0].trim() + "," + newline[1].trim();
out.write(line);
out.newLine();
}
. .Implementation to convert a key-value pair file to another key7:
value pair format (based on our system’s data model).
58
The code snippet provided above is used to perform this conversion. The
file shown in figure 5.3 is read line by line and split by the comma (,) that
separates a key from its value (Listing 7: line 8) and stored in an array. The 0th
element of this array contains the combo key. This element is further split based
on the dot (.) that divides a row key from a column name and again stored in an
array (Listing 7: line 9). The 0th element contains the row key while the 1st
element the column name. The elements are swapped to obtain the desired
combo key (Listing 7: line 10 through 14). Now that the dataset is available in
the desired format, the implemented database is populated. The data structures
developed get created in the manner described in the last section. The static
binary tree contains the keys (combo keys) and their corresponding values.
These keys map onto a vEB array, while only the leaves are placed in the
Packed array to facilitate insertions and deletions.
5.4
TPC-H Query 17 Overview
Query 17 (Small-Quantity-Order Revenue) of TPC-H found at section
3.20 of the TPC-H Specifications [56] determines the yearly average revenue
lost when orders for small quantities of certain parts are no longer filled. The
basic idea is to assess the possibility of reducing overhead expenses, by
overlooking the smaller consignments and concentrating only on the sales of
larger shipments.
The query definition in terms of SQL is as follows:
select sum (L_EXTENDEDPRICE) / 7.0 as avg_yearly
from
LINEITEM,
PART
where P_PARTKEY = L_PARTKEY
and
P_BRAND = '[BRAND]'
and
P_CONTAINER = '[CONTAINER]'
and
L_QUANTITY < (
select
0.2 * avg (L_QUANTITY)
59
from
LINEITEM
where
L_PARTKEY = P_PARTKEY
);
This query works on a 7 year database. It basically considers parts belonging to
a particular brand and container. Then, for such parts, it determines the average
lineitem quantity, for all orders in the database. Finally it computes the average
yearly (gross) loss in revenue, if orders for all those parts with a quantity of
below 20% of the average (calculated earlier) are not considered any more.
The substitution parameters for [BRAND] and [CONTAINER] include
values like Brand#23 and MED BOX respectively or Brand#25 and JUMBO
PKG respectively. A few other variations are available in the specification [56].
To select a single appropriate query from among the available set of 22
TPC-H queries, careful consideration was given to several factors. It was
essential to select a query that:a. Was complex, so that it would be a perfect query to evaluate the
database system at hand.
b. Had a business relevance (in the real world), so that the database could
be assessed from a realistic perspective. This would make the
evaluation more reliable.
c. Contained few operations that were highly suitable for traditional
RDMS, since in a real business environment databases are queried
irrespective of their type. Queries are created to suit the business
requirements rather than an underlying RDBMS or a Column store.
Therefore, it was imperative to select a query that was generic and not
completely suited to a typical column-oriented DB.
d. Contained few operations that were highly suitable for ColumnOriented databases as well. This is quite obvious, since operations on
columns like aggregations are extremely efficient on column-oriented
databases. These kinds of operations display the true power of a
column store. So it was necessary to have some operations that could
utilize the potential of the underlying database as well.
e. Contained some operations that could be parallelized. It should be
remembered, that this project aims to assess the performance of a
60
query on the in-memory database developed, on various multi-core
systems. Therefore, a query that had the potential to utilize multiple
cores of a many-core system, as effectively as possible, would be a
suitable candidate for implementation.
After weighing the different queries against the aforementioned factors, Query
17 appeared to be a suitable choice.
5.5
Query 17 – Sequential Implementation
The query is implemented by following a series of steps as explained below:1. Execute the first selection condition, column P_BRAND = Brand#25. This
is easy to understand; here the column_name part of our combo key is
P_BRAND whose values should be equal to Brand#25. From the database,
all rows satisfying this condition should be selected. An important point to
remember here is that our database is not the typical row-oriented table. In
our case, all data is in the form of keys and values. However, as illustrated
earlier, the key owing to its format can be used to search for a column and
also a row; the key starts with a column_name, followed by a row_key,
where the row_key is the same for all columns belonging to a particular row.
Thus for row1, all columns have this (row1) as the row_key. Therefore, in
the current search operation, extracting the row_key part of the combo key,
where its associated column_name is P_BRAND, whose value in turn
equals Brand#25, identifies all the required rows (which can then be used to
search for other columns). The static binary tree is searched for the column
and its value and the row_keys thus obtained are stored in a temporary
ArrayList parser1.res (where parser1 is an object that contains the results in
ArrayList res). The advantage of sorting the binary tree based on the
column_names is evident in this simple search operation itself. All instances
of column P_BRAND exist within a particular subtree in our binary tree.
Therefore, the search algorithm is split into two parts. The first part
traverses the binary tree, to look for the first occurrence of P_BRAND
(desired column). This search only checks the column names and moves
61
either into the left half or the right half of the binary tree, depending upon
the value. The node thus obtained becomes the root node for the second half
of the search. We now know, due to the locality achieved, that all other
instances of P_BRAND exist as children of this node. Thus the second half
of the search checks each node of this subtree for P_BRAND as column and
Brand#25 as value. Row_keys of all such instances found within the subtree
are added to the result parser1.res. This algorithm is extremely efficient
since it searches only a very small portion of the entire tree.
2. Execute the second selection condition, column P_CONTAINER = JUMBO
PKG. This operation is identical to the one above. Here, the binary tree is
searched first for the first occurrence of P_CONTAINER (column). Once
that is found, using this node as root, all child nodes of this subtree are
searched for both P_CONTAINER and value JUMBO PKG. The row_keys
obtained are stored in the ArrayList res of another object parser2
(parser2.res). The code snippet below gives the algorithm used.
1. TreeNode searchTree( TreeNode root, String key )
2. {
3.
if ( root == null ) {
// Tree is empty, so it certainly doesn't contain key.
4.
return root; }
5.
else if ( root.getKey().contains(key) ) {
// Yes, the key has been found in the root node.
7.
return root;
}
8.
else {
9.
int dir = ( key.compareTo(root.getKey()) < 0 )? 0:1;
10.
TreeNode rt = searchTree( root.link[dir], key );
11.
if (rt == null)
12.
return root;
13.
else
14.
return rt;
15.
}
16. } // end searchTree()
Listing 8: Implementation of search algorithm to check for the first
occurrence of a column in the binary tree.
62
. .
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
. .
long nodes = array.length;
Queue<TreeNode> q = new LinkedList<TreeNode>();
q.add(root);
i = 1;
while(!q.isEmpty() && i < nodes) {
TreeNode current = (TreeNode)q.remove();
if(current != null)
{
if (item == "")
{
. . . .
}
else
{
if ((current.getKey().contains(key)) &&
(current.getItem().trim().equals(item)))
{
String [] t = current.getKey().split("\\.");
res.add(t[1]);
}
}
q.add(current.link[0]);
q.add(current.link[1]);
root = current;
} . . . .
Listing 9: Implementation of search algorithm to check within a subtree.
3. We currently have all row_keys that satisfy the two selection conditions. In
the given query, since these occur as AND conditions, we need to find out
the common rows that satisfy both conditions. To achieve this, an
intersection operation is performed on the results obtained in steps 1 and 2.
The common row_keys are stored in parser1.res (variable is re-used). The
following snippet gives the implementation details.
1. public int intersection(LoadFileNew obj)
2. {
3.
ArrayList <String> temp = new ArrayList<String> ();
4.
for(String t : ds.res)
5.
{
6.
if (obj.ds.res.contains(t))
7.
temp.add(t);
8.
}
. . . .
9.
ds.res.addAll(temp);
. . . .
Listing 10: Implementation of intersection operation to find the common
rows.
Here, ds.res indicates parser1.res and obj.ds.res signifies parser2.res. As
stated, results are stored in ds.res (that is, parser1.res).
63
4. The third operation in the outer query is a join operation on the partkeys
(L_PARTKEY). Since this database already contains data that are joined
with each other, this step need not be performed. However, as mentioned
earlier, due to de-normalization, the data is likely to be redundant.
Consequently, it is essential to extract only the unique row_keys from the
result set obtained above (after intersection). There is no way to identify
redundancy from the results obtained above. As a solution, the
L_PARTKEY values for each row_key (rows) obtained in step 3 can be
extracted. Since these L_PARTKEY values are values of a primary key,
they should be unique. Any duplicate values found indicate redundancy and
the corresponding row_keys can be immediately discarded. The results after
duplicate removal are stored again in parser1.res. The code snippet for both
L_PARTKEY extraction and duplicate removal are given below.
. . . .
1. long nodes = array.length;
2. Queue<TreeNode> q = new LinkedList<TreeNode>();
3. q.add(root);
4. i = 1;
5. while(!q.isEmpty() && i < nodes) {
6.
TreeNode current = (TreeNode)q.remove();
7.
if(current != null)
{
8.
if (key2 == "")
{ . . . .
}
9.
else
{
10.
if (current.getKey().contains(key + "." + key2 + "."))
11.
{
12.
String [] t = current.getKey().split("\\.");
13.
restemp.put(t[1], current.getItem());
14.
}
15.
}
16.
q.add(current.link[0]);
17.
q.add(current.link[1]);
18.
root = current;
19. }
20. }
. . . .
Listing 11: Implementation of search algorithm to check for a
column_name and a specific row_key within a single combo key.
Here in the if condition in line 10, the variable key contains values of
column_name (L_PARTKEY in this case) and key2 is supplied with row_key
values from step 3 (the common results in parser1.res).
64
. .
1.
2.
3.
4.
5.
6.
7.
9.
10.
. .
. .
for (String s : res.keySet())
{
if (map.isEmpty())
map.put(s, res.get(s));
else if (map.containsValue(res.get(s)))
continue;
else
map.put(s, res.get(s));
}
. .
Listing 12: Implementation of duplicate removal algorithm.
Here, only unique key-value pairs are added to a structure map. If a new
value being checked, already exists in map, it is not added (Listing 12: lines
5 – 9).
5. The next step involves working on the inner query and computing the
average of L_QUANTITY (column) values, for all unique row-keys
obtained in step 4. This is like the L_PARTKEY search in the last step. The
column_name L_QUANTITY and a row-key is searched is conjunction
from the binary tree. This is done for each row-key involved. The average
computed is multiplied by 0.2 and result stored in a variable avg.
6. Next, the final selection condition of the outer query is implemented. This
involves searching for all L_QUANTITY values that are < avg (from step 5).
This is a search operation similar to the ones in step 1 and 2. However, here
instead of searching for a particular value, we search for those that satisfy a
relational operation (less than). All row-keys obtained are stored in tempres
and then transferred to parser3.res.
. .
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
. .
. .
if (current.getKey().contains(key))
{
BigDecimal b = new BigDecimal(current.getItem().trim());
if(b.doubleValue() < it.doubleValue())
{
String [] t = current.getKey().split("\\.");
tempres.add(t[1]);
}
}
q.add(current.link[0]);
q.add(current.link[1]);
root = current;
. .
Listing 13: Implementation of search algorithm to check for less-than
condition.
65
7. Since the selection operation is step 6 is an AND operation with the other
selections, an intersection is again performed on results from step 4 (in
parser1.res) and step 6 (in parser3.res). This gives all common row-keys
that satisfy all the given selection criteria. Results are stored in parser1.res.
8. Now is the time to implement the final result (to be displayed). For each
row-key obtained above, the corresponding L_EXTENDEDPRICE values
are searched from the tree. Method as explained in steps 4 and 5. All these
values are then added (and stored in sum) and finally the 7 year average is
obtained by dividing the sum by 7.0. This gives the outcome of the query.
5.6
Query 17 – Parallel Implementation
The parallel implementation of the query is accomplished by using the
Fork/Join framework of Java SE 7. Here, the ForkJoinPool ExecutorService is
used to create a pool of Java ForkJoinTask threads. The RecursiveTask
variation of ForkJoinTask base class is extended to execute different threads in
parallel exploiting the underlying multi-core hardware. The basic mechanism of
the Fork/Join framework is already explained in an earlier sub section.
The primary operations to be performed in this query are already known
from the sequential implementation sub section. This implementation executes
in parallel (on many cores) the portions of the algorithm that exhibit the
potential to be independent and hence parallel. All steps cannot be parallelized.
The algorithm consists of 8 basic steps, evident from the last sub section, that
are executed one after another. The parallel mechanism is illustrated below.
66
Figure 5.4: Overview of parallel execution strategy used in Query 17.
67
After analysing the steps, the parallel strategy shown above was devised
and implemented. It highlights the use of the Java Fork/Join framework to split
the execution (and hence speed it up) among multiple fork-join tasks. As
mentioned earlier, all steps involved cannot be executed in parallel. As a result,
only all search operations traversing the binary tree and aggregation operations
are parallelised to achieve faster execution times, as evident from figure 5.4
above.
Step 1: It retrieves all rows (row_keys) where the column_name portion of the
combo key is P_BRAND and its corresponding value is Brand#25. It can be
recalled, that the basic idea is (as in the sequential case) to obtain the first
occurrence of P_BRAND (column_name) from the tree. Then using this node
as root, we search every node within this particular subtree, for both P_BRAND
and value Brand#25. As we search every node in the subtree, the operation is
quite CPU intensive. This processor intensive operation is therefore split into as
many tasks as there are nodes (to be searched) and handled by the threads in the
ForkJoinPool. Every node is checked for the selection criteria; if it is satisfied,
the row_key part is extracted and added to the result set. Then 2 tasks are
spawned for each of its two child nodes; each in turn is checked for the
selection condition. Then for every child, 2 more tasks are spawned (for its
children) and the process continues recursively, till all nodes are checked and
result obtained. Since an individual task is created for every node in the subtree
being searched, the process is very efficient in terms of speed and processor
core usage.
Step 2: This step checks for P_CONTAINER column_names with value
JUMBO PKG. This is exactly similar to the one above and implemented in the
same fashion. It should be noted that steps 1 and 2 are run one after another; the
parallelism is within their individual execution. The parallel algorithm used for
both is listed below.
68
. . . .
1. if(node != null) {
2.
if(node.getKey().contains(key) && node.getItem().equals(item))
3.
{
4.
String [] t = node.getKey().split("\\.");
5.
res.add(t[1]);
6.
}
7.
if(ds.checklinks(node) > 0)
8.
{
9.
test1parent left = new
test1parent(ds.returnLeftChild(node), key, item, KV);
10.
test1parent right = new
test1parent(ds.returnRightChild(node), key, item, KV);
11.
left.fork();
12.
leftres.addAll(right.compute());
13.
rightres.addAll(left.join());
14.
leftres.addAll(rightres);
15.
}
16. }
. . . .
Listing 14: Implementation of parallel search algorithm (1).
Step 3: The next step is to obtain all the common row_keys that satisfy both
conditions (of step 1 and step 2). The algorithm used to realize this is similar to
the one used in the sequential implementation of the query (and code snippet is
listed there). The common set of row_keys thus obtained is stored in res1.
Step 4: The fourth step uses the result set (res1) from step 3 and for each
row_key in this set, checks for the combo key that also contains L_PARTKEY
as the column_name and retrieves its value. For example, if res1 contains
row_keys R0, R45 and R121, then this steps retrieves the values of combo keys
that contain the following (timestamp can be any thing) :- L_PARTKEY.R0,
L_PARTKEY.R45, L_PARTKEY.R121. It is evident that the number of
searches equals the number of row_keys in the result set res1. To accomplish
this, tasks equal to the count (size or number of elements) of the result set res1
are spawned and executed in parallel. The individual result of each thread is
then joined (merged) and stored in list2. The code snippet is shown below.
69
. .
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
. .
. .
if(res1 != null)
{
for(String s : res1)
{
ParallelSubTree task = new ParallelSubTree(node, key, s, KV);
forks.add(task);
task.fork();
}
int i = 0;
for(RecursiveTask<List<String>> task : forks)
{
map.put(res1.get(i), task.join().toString());
i++;
}
}
. .
Listing 15: Implementation of parallel search algorithm (2).
Step 5: As discussed during the sequential implementation, that the denormalization of the original TPC-H dataset for our purpose gives rise to
redundancy. The set res1 (from step 3) therefore is likely to contain duplicate
data, to eliminate which step 4 is executed and in this step, the duplicates are
removed from list2 (result of last step). The unique result set is stored in res2.
The algorithm is the same as used in the sequential implementation.
Step 6: This step is exactly the same as step 4. Each row_key in the unique set
res2 obtained above, is prefixed with the column_name L_QUANTITY. Then
the corresponding values of all combo keys containing this string, are extracted.
The set of values thus obtained is stored in list2.
Step 7: In this step, using list2, the average value is calculated. It is then
multiplied by 0.2 and the result stored in avg. The summation operation is
executed in parallel by simply splitting the values recursively (via fork/join) and
computing the sum (the usual divide-and-conquer strategy). The average is then
calculated. Please note that for simplicity the whole average operation in the
diagram (figure 5.4) is shown as being executed in parallel, although in reality
only the summation operation is. The entire dataset is split into halves
recursively, till a portion becomes less than or equal to the threshold (indicating
that it is small enough to be added up), after which the sum of those values is
70
calculated. All such partial results are finally added up to give the complete
result. The source code is as follows.
1. protected BigDecimal compute()
2. {
3.
if (length <= threshold)
4.
return computeDirectly();
5.
6.
7.
8. }
long split = length / 2;
invokeAll(new ParallelAddition (res, start,
split),
new ParallelAddition (res, start + split, length - split));
return sum;
9. protected BigDecimal computeDirectly()
10. {
11.
if (length != 0)
12.
{
13.
for (int c = 0; c < length; c++)
14.
{
. . . .
15.
BigDecimal sum1 = new BigDecimal(s.trim());
16.
sum = sum.add(sum1);
17.
}
18.
}
19.
return sum;
20. }
Listing 16: Implementation of parallel addition algorithm.
Step 8: Here the tree is searched for combo keys with column_name
L_QUANTITY, whose values are less than avg (computed in step 7 above).
The strategy here is similar to the one used for steps 1 and 2. The first
occurrence of L_QUANTITY in the binary tree is searched for. Then using that
node as the root, all its descendants are searched for both column_name
L_QUANTITY and values less than avg. This search operation (of the subtree)
is executed in parallel, spawning as many tasks as there are nodes in the subtree.
The mechanism is exactly the one used in steps 1 and 2. The result set of
row_keys is stored in res1.
Step 9: Here an intersection operation is performed to obtain all common
row_keys (like in step 3) among the result sets res2 (from step 5) and res1
(from step 8). These common row_keys are again stored in res1. This result
gives the set of row_keys that satisfy all the selection conditions in the outer
query.
71
Step 10: As in steps 4 and 6 this step also fetches all values, for column
L_EXTENDEDPRICE combined separately with every row_key in result set
res1 (from step 9). Hence the number of searches is equal to the number of
elements in res1. The values obtained are stored in list2.
Step 11: The final step of execution; here the values obtained above (list2) are
added up. This addition operation is executed in parallel (as in step 7). The
algorithm used is also the same as shown above. The result is stored in sum.
Then this value is divided by 7.0 since the database represents a 7 year dataset
and we intend to find the average yearly loss (mentioned in an earlier subsection
with details of Query 17). This average value is displayed as the answer to the
query.
5.7
Need for Synchronization
We have so far seen the implementation of a database system and a
query that not only executes sequentially on the database, but also exploits the
underlying parallelism of the hardware. An essential point to consider at this
juncture is the existence of multiple users of this system. In the real world, any
database system is likely to be used by several users; this also means that those
multiple users are likely to be simultaneously using the system. This further
implies that there exists a risk of inconsistency in the system, especially when
concurrent updates are made to this single system. Hence, the need for
synchronization arises.
In a multi-threaded environment, each thread maintains its own stack
and registers. However, if these threads access any shared object, errors are
bound to be introduced and therefore require synchronization. Synchronization
ensures that the concurrent accesses to the shared object(s) do not corrupt the
value stored in them. Various synchronization techniques exist in Java 7. Here
we use a lock-based mechanism, the ReadWriteLock [57] to secure the database
from the hazards of concurrent activity.
72
5.8
Synchronization Issues
There are various problems associated with incorrect synchronization,
most of which are not discernible till the implementation (code) is executed.
These include deadlocks, livelocks, race conditions, starvation and so forth.
However, the primary issue associated with the concurrent access (read/write by
multiple users) to a single shared object (variable) is the possibility of one
thread (user) seeing the data (shared object) in an incorrect or corrupt state, due
to operations performed on it by another thread (user). In a multi-core
environment, where threads execute in parallel on the available cores, two
threads might actually try to update the same object simultaneously. This
therefore requires an appropriate mechanism to control access and avoid
inconsistencies.
5.9
Synchronization Techniques
As mentioned earlier, the synchronization technique used here is the
ReadWriteLock [57], an interface in Java 7 that maintains a pair of associated
locks, one for read operations and one for writing. A read lock can be held
simultaneously by several threads that intend to only read the same object. A
write lock on the other hand, cannot be held by multiple writer threads at the
same time; it is exclusive.
A read-write lock provides much better performance and allows for
greater concurrency when accessing shared data, than a typical mutual
exclusion lock, owing to the fact that multiple reader threads can read the same
piece of data concurrently. This increased concurrency leads to considerable
performance improvements on a many-core processor. However, since the write
operations are exclusive, they do not exploit the processor parallelism. This in
turn implies that if the number of write operations is more frequent compared to
the read operations, system performance is likely to be affected.
The class ReentrantReadWriteLock [57] is extended and the methods
readLock ( ) and writeLock ( ) are used to lock shared data in the query
73
implementation. The Query 17 implements read operations on the database.
Thus every shared access is locked by a readLock ( ). This ensures multiple user
threads can simultaneously access the database for reading. However, there are
variables that are used to store intermediate results, which need to be protected
(write-protected) during a concurrent access to the system, so as to not corrupt
the result. These shared variables are locked through the writeLock ( ).
Ensuring synchronization for write operations to the database is slightly
tricky to achieve. We already know that write operations on the database imply
updates, which are made into the Packed Array structure. Therefore, this
structure as well as all other intermediate data items that are shared need to be
protected by a writeLock ( ) to achieve synchronization. The design of write
synchronization necessary for this project is complete. However, due to the
stringent time frame, the complete implementation of this was not possible.
74
6
Evaluation
This section will discuss the query performance on two different multi-
core systems; a 48 core AMD Opteron™ 6100 Series Processor and a quad-core
Intel Xeon E3 1245 processor with hyper-threading [58]. The evaluation is
performed by executing the implemented query (TPC-H Benchmark Query 17)
for three different datasets (small, medium and large) on each of the two
machines, and comparing the results obtained.
6.1
Experimental Methodology
This evaluation aims to verify if the database system is truly scalable, as
desired, and also analyse the parallelism achieved on different multi-core
architectures, in terms of execution times and speed-up. This is realized by
executing the implemented benchmark query on a variable number of threads,
for every dataset (small, medium and large). Using data of different sizes is
necessary as this database is a subset of DataStore that manages petabytes of
data; it is therefore imperative to test the system with various load sizes.
Moreover, the research involves utilizing multiple cores and analysing for
performance improvements. Most multi-core systems handle problems that
grow in size. This makes it essential to analyse the scalability of this parallel
query while scaling the problem size as well.
Intel Xeon (Janus) has a quad-core processor and supports hyperthreading of 2 threads per core, thus making a total of 8 threads. It has 8
gigabytes of memory. Datasets of size 100 MB, 500 MB and 1 GB are used on
this system. AMD Opteron 6174 (Mcore 48) on the other hand, supports 48
hardware cores with one thread per core (no hyper-threading) thus with a total
of 48 threads. It has 128 GB main memory where datasets of size 1 GB, 3 GB
and 5 GB are used for evaluation. The Appendix 1 contains the tabulated
results obtained for every configuration. Both the multi-core machines are
installed with Linux 2.6 and Sun Java Version 1.7.0 (build 1.7.0-b147).
75
Query 17 is executed with a fixed heap size of 7GB and 65GB on Janus
and Mcore 48 respectively. No other parameters and/or configurations are
changed during the execution of the query. Experiments are conducted for
different number of threads; 1, 2, 4 and 8 threads on Janus and 1, 2, 4, 8, 16, 24,
32, 40, and 48 threads on Mcore 48. On a particular machine, for each thread,
10 different execution times are recorded; and then the mean execution time is
computed. Execution times are obtained based on millisecond resolution and
then converted into seconds. Also, calculated are the standard error, standard
deviation and confidence intervals. Every execution of the query is performed
by launching a fresh instance of the database, thus avoiding cache benefits.
To measure the scalability of the database system, the absolute speedup
and efficiency is calculated. The Speedup is given by
Sp = T1 / Tp ……………………………. eqn (2), where T1
is the
sequential execution time and Tp the parallel execution time on p processors.
When T1 is the execution time for the best known sequential algorithm, the
speedup is referred to as absolute speedup. This performance metric is most
effective when evaluating parallel algorithms.
The system configurations of the test machines used is tabulated below
(table 1). Also note, that the complete results of execution (time), standard error,
standard deviation, confidence interval, speedup and efficiency are listed in
Appendix 1.
S.No.
System Name
CPU Type
1
3
Janus
MCore 48
Intel Xeon E3 1245
AMD Opteron 6174
No. of
Cores
4
48
No. of Threads
(total)
8
48
RAM
8 GB
128 GB
Table 1: Production system configurations for performance evaluation.
6.2
Experimental Results
The experimental data obtained is analysed in two ways. First, the mean
execution times are plotted against the corresponding number of threads.
Second, the speedups computed are mapped against the number of threads.
These signify the amount of parallelism achieved through the implementation.
76
The results obtained for a variable number of threads and different datasets
indicate the scalability of the system.
Figures 6.1 – 6.6, illustrate the execution times obtained for small,
medium and large datasets on the two test machines. The X-axis (horizontal)
represents the number of threads (upto 8 for Janus and upto 48 for Mcore 48).
The Y-axis (vertical) represents the mean execution times for a particular
dataset in seconds. The confidence intervals (95%) for each reading are also
plotted on the graphs. However, the intervals being quite small are not visible in
most cases. The exact values of confidence interval are available in Appendix 1.
TPC-H Query17 on Janus for 100MB dataset
1.4
Execu tuion Tim e (T ) in secon ds
1.2
1
0.82
0.8
0.74
janus 100MB
0.6
0.45
0.4
0.43
0.2
0
0
1
2
3
4
5
6
7
8
9
Number of threads (N)
Figure 6.1: Mean execution times of Query 17 for 100 MB data (small) on
Janus.
77
TPC-H Query17 on Janus for 500MB dataset
Executuion Time (T) in seconds
15
10
9.78
9.24
janus 500MB
5
4.32
3.43
0
0
1
2
3
4
5
6
7
8
9
10
Num ber of threads (N)
Figure 6.2: Mean execution times of Query 17 for 500 MB data (medium)
on Janus.
TPC-H Query17 on Janus for 1GB dataset
300
Executuion Time (T) in seconds
290
281.18
280
274.09
270
janus 1GB
260
251.84
250
250.33
240
0
1
2
3
4
5
6
7
8
9
10
Number of threads (N)
Figure 6.3: Mean execution times of Query 17 for 1 GB data (large) on
Janus.
78
TPC-H Query17 on mcore48 for 1GB dataset
160
146.97
145
131.43
Executuion Tim e (T) in seconds
130
115
100
85
mcore48
70
55.44
55
40
32.1
26.15
25
24.93
24.44
23.7
22.96
10
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Num ber of threads (N)
Figure 6.4: Mean execution times of Query 17 for 1 GB data (small) on
Mcore 48.
TPC-H Query17 on mcore48 for 3GB dataset
1530
1440.02
1430
Executuion Time (T) in seconds
1330
1302.51
1230
1130
1030
mcore48
3GB
930
830
730
630
616.88
530
478.07
430
476.95
454.96
439.34
430.39
426.38
330
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Number of threads (N)
Figure 6.5: Mean execution times of Query 17 for 3 GB data (medium) on
Mcore 48.
79
TPC-H Query17 on mcore48 for 5GB dataset
4300
4183.29
4100
3900
3844.93
Execu tu io n T im e (T ) in seco n d s
3700
3500
3300
3100
2900
2700
mcore48
2500
2300
2144.93
2100
1900
1700
1576.64
1500
1405.18
1300
1290.67
1100
0
2
4
6
1190.85
1155.63
1134.43
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Number of threads (N)
Figure 6.6: Mean execution times of Query 17 for 5 GB data (large) on
Mcore 48.
It is evident from the figures above that the execution time decreases
significantly when the number of threads is increased from 1 through 4 in case
of Janus and between 1 and 8 for Mcore 48. After these points, the execution
time lowers at a very slow rate. Increasing the number of threads beyond this
does not seem to have much effect on the execution times. This can be
attributed to the amount of parallelisation achieved in the query. We may recall,
that only certain portions of the query (all searches and aggregations) were
parallelised. The rest remained sequential. Moreover, the intermediate results
obtained in one portion of the query were needed by another, thus making the
different parts of the query execute in sequence. These sequential sections of
the query, in all probability, fail to take complete advantage of the threads. Also,
there exists additional parallelisation overheads that further affect the execution
time of the parallel algorithm.
Analysing figures 6.3 (execution time of 1GB dataset on Janus) and 6.4
(execution time of 1GB dataset on Mcore48) reveal the following fact; for the
80
same dataset and parallel algorithm, the execution times for threads 1, 2, 4, and
8 (the number common between Mcore48 and Janus) are much lower in case of
Mcore48. In fact, the execution time of this dataset on 8 threads is almost 8
times less on Mcore48. Even the time taken on a single execution thread is
twice on Janus. This massive difference between the query response times on
the two systems is owing to the hardware configurations of the two machines.
Janus is a hyper-threaded processor [58]; this implies that every physical core is
presented as two logical cores to the operating system. This efficiently uses the
CPU resources by executing two threads in parallel on a single processor. In
order to do this, certain resources are duplicated in each physical core. Also
some amount of sharing occurs, as the same physical core is used by the two
logical cores. Thus, a processor with two physical cores is inherently more
powerful in terms of performance, than a hyper-threaded dual core (single core
with two threads). For the test machines, all the 8 cores of Mcore48 are physical
cores; whereas, the 8 threads of Janus are actually 4 physical cores. This
accounts for the superior query performance on Mcore48.
The following figures, 6.7 and 6.8 portray the absolute speedup attained
by the query for different datasets across a variable number of threads, on each
test machine.
TPC-H Query17 on Janus
8
A b so lu te Sp eed u p (T 1 / T p )
7
6
5
janus 100 MB
4
janus 500 MB
janus 1 GB
3
2
1
0
2
4
6
8
10
Number of threads (N)
Figure 6.7: Absolute speedup of Query 17 for all three datasets on Janus.
81
TPC-H Query17 on mcore48
16
A b so lu te Sp eed u p (T 1 / T p )
14
11.138
12
11.918
11.684
12.29
12.686
9.074
10
mcore48 1 GB
8
6
5.254
5.45
5.727
5.463
2
4.547
2.216
2
3.342
5.555
5.102
6.054
6.111
6.02
6.204
6.32
mcore48 3 GB
mcore48 5 GB
4.224
4
5.93
1.865
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Number of threads (N)
Figure 6.8: Absolute speedup of Query 17 for all three datasets on Mcore48.
The figures above clearly indicate that for the same parallel algorithm,
in most cases, as the problem size (dataset size) increases, the speedup achieved
decreases. The only exception to this is the speedup attained by the 500 MB
dataset on Janus; for every thread, the speedup achieved is higher than the
corresponding 100 MB dataset. However, for a fixed problem size, there is a
significant increase in speedup with the increase in the number of threads, for
all datasets (for both test machines). Mcore48 produces higher speedups than
Janus. On Janus, a maximum speedup of 2.8, 5.7 and 2 is obtained for 100MB,
500MB and 1GB dataset respectively. On the contrary, on Mcore48, maximum
speedup of 12.7, 6.1 and 6.3 is obtained for 1GB, 3GB and 5GB data
respectively. The rise in speedup is quite sharp initially and extremely sluggish
in the later half, where a much larger number of threads are in use. Further,
analysing the efficiency values, reveals a gradual efficiency diminution with
increasing number of threads.
There are various possible reasons behind this behaviour. The first and
foremost could be the existence of sequential regions in the parallel code (as
82
stated in Amdahl’s Law [41]). Other factors might include synchronization and
communication costs incurred with increased number of threads.
It is evident, from the experimental results obtained for query 17, that
there are performance improvements when exploiting multi-core parallelism;
good speedup of up to 12.7 is obtained. However, the point to consider here is
the scalability when increasing the problem (dataset) size. For this particular
query, the speedup decreases on increasing the problem size, indicating that it
does not scale very well. The system therefore needs to be queried with some
other queries as well, to ascertain its exact behaviour. Moreover, the use of
performance counters and profiling will help in observing the memory accesses
and monitoring the exact cache levels involved in the query execution. This will
allow us to better analyze the system performance. This will also reveal the
cache misses thereby, providing some information about the data structure as
well; whether the data structure takes adequate advantage of the cache or not.
83
7
Conclusion
This section will summarise the dissertation, and its outcome, as well as
will put forth the scope of improvement and future work.
7.1
Dissertation Summary
This research involves the design and development of a subset of the
Google BigTable [3] database system for multi-core machines. The system
currently used by Google is distributed in nature and does not exploit threadlevel parallelism of the individual machines in the cluster. This project aims to
explore the possibility and assess the efficacy of thread-level parallelism for
such huge databases.
The dissertation has presented the Google system and its open source
counterparts, the underlying file systems and the distributed parallelism
techniques used. It has also described alternative database technologies (IMDBs)
and cache-oblivious data structures to provide the readers the necessary
background knowledge, before proceeding on to explain the implementation
methods employed to build the database system. Presenting a wider context was
essential due to the complexity of the system as a whole.
The complete technique of development presented, includes building a
data structure that resides in-memory, allows insertions, deletions, and parallel
queries, and is also scalable. The implementation methodology for the
benchmark query implemented in order to evaluate the database has also been
described at length. The Java 7 Fork/Join framework was extensively used to
achieve the desired parallelism for execution on multi-core machines.
Next, the evaluation results of the database system were presented.
Evaluation has been performed on two different multiprocessor architectures; an
Intel Xeon quad-core processor and an AMD Opteron 48 core processor.
Datasets of different sizes were used to analyse the system performance in
terms of execution time and speed-up. It was observed that on a 48 core
84
processor, a speedup of upto 12.7 was achieved for 1 GB data, upto 6.1 for a 3
GB dataset and about 6.3 for a 5 GB dataset. It is evident that the system is
capable of handling very large (in gigabytes) datasets. Although, there is a
significant speedup for a particular problem size (dataset) across different
thread sizes, an efficiency analysis indicates a gradual decline. There are several
possible factors influencing this; the most important being the existence of
sequential portions in the query itself which limits the parallelization of it. Apart
from that, using larger datasets and more number of threads can result in
additional synchronization costs that hinder speedup.
However, a different query with more scope for parallelization can
produce favourable results. Therefore, cluster-based column-oriented systems
like Google DataStore will benefit entirely from the inherent multi-core
architecture of its individual machines only if a suitable query is executed. For
all other queries (with limited parallelism), such a system might see
performance improvements for different data sizes, only upto a few cores; with
further increase in parallel threads there will be a drop in efficiency (and a
stagnation in speedup) as the query execution will not scale well for very large
number of threads. For such queries, a fewer number of threads can be utilized
to gain performance improvements, instead of exploiting all the available
machine cores. This will mean that the cores will remain under-utilized;
however, the speedup achieved by exploiting a few threads of the multi-core
machines will boost the overall performance of a query running on a cluster
(distributed environment like Google DataStore) and therefore, such a trade-off
is acceptable.
This research has explored an extremely complex and vast quarter, and
successfully managed to develop the desired database system. However, due to
lack of time, the data structure could not be evaluated and assessed for cacheobliviousness; whether the design is efficiently utilizing the cache remains
unknown. Also, given the enormity and complexity of the DataStore, only a
subset of it was looked into. Further, the query evaluation was an attempt to
analyse the query performance and scalability; although rigid conclusions
cannot be drawn based on the results of only a single query.
85
7.2
Limitations
As mentioned earlier, the evaluation results accounted in this
dissertation are only an estimate of the performance of the database system. A
more accurate evaluation of the system would have been possible, had more
benchmark queries been implemented and used. A wider range of benchmark
queries would indicate the true potential of the system and its applicability. Also,
the query performance of a typical column-oriented query could not be analysed
on this database due to time constraints. This would have showcased the true
power of the fundamental column-oriented design of the system. Moreover, the
evaluation of the cache efficiency of the data structure could not be performed
for the same reason.
There exists a design limitation as well. For simplicity, the vEB array
and the Packed Array were implemented as separate structures. It is however
possible to create a single structure that includes both their characteristics.
Separating the structure has resulted in occupying more memory. However, this
design decision does not affect the query performance in any way.
Another limitation is on the creation of the database. When the database
is being loaded for the first time, it is imperative to supply it with sufficiently
large number of entries. Failure to do so will result in the creation of a static
binary tree of small size, which in turn implies that too many write (append)
operations would likely cause frequent rebalancing of the Packed Array.
A large number of threads get generated in a multi-core environment
when querying the database, and hence more garbage gets created per unit time
[51]. It is therefore essential to analyse the Java Garbage Collector (GC)
behaviour as well, to check for its impact on system performance.
7.3
Future Work
This project involved, implementing only a subset of Google DataStore.
To make the database system more robust and complete, features like security,
and fault tolerance need to be in place. A single benchmark query was
86
implemented, primarily aimed at evaluating the system performance. In reality,
a wide variety of different queries should be executed on the database system,
before arriving at any firm conclusions; the experimental results obtained here
are merely an estimate; it is not exhaustive. Furthermore, a query scheduler
should be developed to manage all (multiple) queries made to the system. Such
a mechanism would build a database that handles all queries made to it in a
uniform manner; hence more realistic. There should be an appropriate access
control mechanism incorporated to make the database more secure. Also, the
synchronization mechanism to handle concurrent writes to the database needs to
be completed (partially implemented right now). All such enhancements will
build an exact Datastore-like system for multi-core machines.
A very important assessment would have been that of the data structure
itself, in terms of cache efficiency. The structure although based on a cacheoblivious design, was not verified to ascertain if it was indeed utilising the
cache to improve performance. However, such an evaluation will provide a
better understanding of the utilization of the memory hierarchy, which can be
valuable to bring about enhancements in system performance. Profiling
information as well as use of performance counters can provide useful
information about the query performance and the system bottlenecks.
An interesting area of research would be to integrate thread-level
parallelism into existing cluster-based systems like DataStore. With the huge
volumes of data available and the equally complex computations performed on
them nowadays, it is necessary to research further into this area. This
investigation should essentially aim to utilise the power of multiple cores
available to every machine in a cluster and thus improve performance
considerably.
In this project, by means of a single query execution on a subset of a
DataStore-like system, the challenges and shortcomings associated with query
execution in a multi-core environment, were exposed. Most importantly, it
demonstrated the advantage of introducing thread-level parallelism into huge
column-oriented database systems. This advantage therefore, could be taken
forward; thus integrating multi-core with cluster-based systems and achieving
twofold benefits by supporting both thread and process parallelism.
87
Appendix 1
TPC-H Query 17 Execution Results
1 a. Execution Results on Mcore48 (AMD Opteron 6174, 48 core processor)
Small Dataset: 1 GB
Number
of
Threads
Mean
Execution
Time (T)
in sec
Confidence
Interval
(CI)
Standard
Deviation
(STDEV)
Standard
Error
(STDERR)
1
2
4
8
16
24
32
40
48
146.97
131.43
55.44
32.1
26.15
24.93
24.44
23.7
22.96
2.81
2.26
0.84
0.43
0.42
0.37
0.38
0.38
0.35
4.53
3.64
1.35
0.7
0.68
0.6
0.61
0.62
0.57
1.43
1.15
0.43
0.22
0.22
0.19
0.19
0.2
0.18
Confidence
Interval
(CI)
Standard
Deviation
(STDEV)
Standard
Error
(STDERR)
13.93
27.02
25.56
9.06
10.72
11.63
23.4
23.09
21.84
22.48
43.59
41.24
14.62
17.3
18.77
37.76
37.25
35.23
7.11
13.78
13.04
4.62
5.47
5.94
11.94
11.78
11.14
Medium Dataset: 3 GB
Number
of
Threads
1
2
4
8
16
24
32
40
48
Mean
Execution
Time (T)
in sec
1440.02
1302.51
616.88
478.07
476.95
454.96
439.34
430.39
426.38
Large Dataset: 5 GB
Number
of
Threads
Mean
Execution
Time (T)
in sec
Confidence
Interval
(CI)
Standard
Deviation
(STDEV)
Standard
Error
(STDERR)
1
2
4
8
16
24
32
40
48
4183.29
3844.93
2144.93
1576.64
1405.18
1290.67
1190.85
1155.63
1134.43
91.59
88.98
55.18
29.01
14.57
8.12
7.86
10.31
8.5
147.78
143.56
89.03
46.8
23.51
13.1
12.68
16.63
13.72
46.73
45.4
28.15
14.8
7.43
4.14
4.01
5.26
4.34
88
S.
No.
Number of
Threads
1
2
3
4
5
6
7
8
9
Sequential
2
4
8
16
24
32
40
48
S.
No.
Number of
Threads
1
2
3
4
5
6
7
8
9
Sequential
2
4
8
16
24
32
40
48
S.
No.
Number of
Threads
1
2
3
4
5
6
7
8
9
Sequential
2
4
8
16
24
32
40
48
Mcore48 1 GB
Absolute
Avg. Time
Speedup
(xmean in s)
(T1 / Tp)
291.27
131.43
2.216
55.44
5.254
32.1
9.074
26.15
11.138
24.93
11.684
24.44
11.918
23.7
12.29
22.96
12.686
Mcore48 3 GB
Absolute
Avg. Time
Speedup
(xmean in s)
(T1 / Tp)
2605.49
1302.51
2
616.88
4.224
478.07
5.45
476.95
5.463
454.96
5.727
439.34
5.93
430.39
6.054
426.38
6.111
Mcore48 5 GB
Absolute
Avg. Time
Speedup
(xmean in s)
(T1 / Tp)
7169.24
3844.93
1.865
2144.93
3.342
1576.64
4.547
1405.18
5.102
1290.67
5.555
1190.85
6.02
1155.63
6.204
1134.43
6.32
89
Efficiency
1.108
1.314
1.134
0.696
0.487
0.372
0.307
0.264
Efficiency
1
1.056
0.681
0.341
0.239
0.185
0.151
0.127
Efficiency
0.933
0.836
0.568
0.319
0.231
0.188
0.155
0.132
1 b. Execution Results on Janus (Intel Xeon quad-core processor)
Small Dataset: 100 MB
Number
of
Threads
Mean
Execution
Time (T)
in sec
Confidence
Interval
(CI)
Standard
Deviation
(STDEV)
Standard
Error
(STDERR)
1
2
4
8
0.82
0.74
0.45
0.43
0.02
0.01
0.01
0.02
0.04
0.02
0.02
0.04
0.01
0.01
0.01
0.01
Medium Dataset: 500 MB
Number
of
Threads
Mean
Execution
Time (T)
in sec
Confidence
Interval
(CI)
Standard
Deviation
(STDEV)
Standard
Error
(STDERR)
1
2
4
8
9.78
9.24
4.32
3.43
0.1
0.06
0.05
0.08
0.16
0.1
0.08
0.13
0.05
0.03
0.03
0.04
Confidence
Interval
(CI)
Standard
Deviation
(STDEV)
Standard
Error
(STDERR)
2.84
0.95
1.65
0.92
4.58
1.54
2.66
1.49
1.45
0.49
0.84
0.47
Large Dataset: 1 GB
Number
of
Threads
1
2
4
8
Mean
Execution
Time (T)
in sec
281.18
274.09
251.84
250.33
90
JANUS 100 MB
S.
No.
Number of
Threads
Avg. Time
(xmean in s)
1
2
3
4
Sequential
2
4
8
1.22
0.74
0.45
0.43
Absolute
Speedup
(T1 / Tp)
Efficiency
1.649
2.711
2.837
0.825
0.678
0.355
JANUS 500 MB
S.
No.
Number of
Threads
Avg. Time
(xmean in s)
1
2
3
4
Sequential
2
4
8
19.44
9.24
4.32
3.43
Absolute
Speedup
(T1 / Tp)
Efficiency
2.104
4.5
5.668
1.052
1.125
0.709
Absolute
Speedup
(T1 / Tp)
Efficiency
1.829
1.991
2.003
0.915
0.498
0.25
JANUS 1 GB
S.
No.
Number of
Threads
Avg. Time
(xmean in s)
1
2
3
4
Sequential
2
4
8
501.41
274.09
251.84
250.33
91
Appendix 2
IMPORTANT NOTE:
A project involving Google BigTable is also being designed by my colleague.
However, the choice of the research area, scope of implementation,
methodology/approach, supported features, and the choice of programming
language vary completely. Hence, the two projects are separate and unconnected. The individual projects are thus being conducted independent of
each other, with the approval and under the guidance by my supervisor.
92
References
1. GOOGLE APP ENGINE.
http://code.google.com/appengine/docs/whatisgoogleappengine.html, last
visited on March 25, 2011.
2. GOOGLE DATASTORE.
http://code.google.com/appengine/docs/java/datastore/, last visited on
March 25, 2011.
3. CHANG, F., GHEMAWAT, S., DEAN, J. et al. Bigtable: A Distributed
Storage System for Structured Data. 7th OSDI (Nov. 2006)
4. STONEBRAKER, M. The case for shared nothing. Database Engineering
Bulletin 9, 1 (Mar. 1986), 4.9.
5. GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google File
system. In Proc. of the 19th ACM SOSP (Dec.2003).
6. BURROWS, M. The Chubby lock service for loosely coupled distributed
systems. In Proc. of the 7th OSDI (Nov. 2006).
7. APACHE HBASE. http://hbase.apache.org/, last visited on March 25, 2011.
8. UNDERSTANDING HBASE AND BIGTABLE.
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTa
ble, last visited on April 27, 2011.
9. ZOOKEEPER. http://wiki.apache.org/hadoop/ZooKeeper, last visited on
April 29, 2011.
10. ADVANCED HBASE. http://www.docstoc.com/docs/66356954/AdvancedHBase, last visited on April 27, 2011.
11. HFILE: A BLOCK-INDEXED FILE FORMAT TO STORE SORTED
KEY-VALUE PAIRS. http://www.slideshare.net/schubertzhang/hfile-ablockindexed-file-format-to-store-sorted-keyvalue-pairs, last visited on
April 27, 2011.
12. HADOOP TUTORIAL.
http://developer.yahoo.com/hadoop/tutorial/module1.html, last visited on
April 27, 2011.
13. RANGER, C., et al. “Evaluating MapReduce for Multi-core and
Multiprocessor Systems,” Proceedings of the 2007 IEEE 13th International
Symposium on High Performance Computer Architecture 2007.
14. DEAN, J., GHEMAWAT, S. “MapReduce: Simplified Data Processing on
Large Clusters,” OSDI, 2004.
15. ORAM, A., WILSON, G. “Distributed Programming with MapReduce”,
O'Reilly, 2007.
16. HADOOP MAPREDUCE TUTORIAL.
http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html, last
visited on April 27, 2011.
17. BUILDING A JAVA MAPREDUCE FRAMEWORK FOR MULTICORE
ARCHITECTURES.
http://www.cs.man.ac.uk/~lujanmx/research/docs/kovoor_multiprog2010.pd
f, last visited on April 29, 2011.
18. GARCIA-MOLINA, H., KENNETH, S. Main Memory Database Systems:
An Overview. IEEE Trans. on Knowledge and Data Engineering, Dec 1992.
19. IN-MEMORY DATABASE. http://it.toolbox.com/wiki/index.php/InMemory_Database, last visited on May 2, 2011.
20. FRIGO, M., LEISERSON, C. E., PROKOP, H., RAMACHANDRAN, S.
93
Cache-oblivious algorithms. Extended Abstract. In Proceedings of the 40th
IEEE Symposium on Foundations of Computer Science, pages, 1999.
21. CACHE-OBLIVIOUS DATA STRUCTURES.
http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-datastructures.aspx, last visited on May 2, 2011.
22. CACHE-OBLIVIOUS ALGORITHMS.
http://qstuff.blogspot.com/2010/06/cache-oblivious-algorithms.html, last
visited on May 3, 2011.
23. CACHE OBLIVIOUS DATA STRUCTURES.
http://bryanpendleton.blogspot.com/2009/06/cache-oblivious-datastructures.html, last visited on May 2, 2011.
24. CACHE-OBLIVIOUS ALGORITHMS.
http://www.itu.dk/~annao/ADT03/lecture10.pdf, last visited on May 3, 2011.
25. VITTER, J.S. External Memory Algorithms and Data Structures: Dealing
with massive data. ACM Computing Surveys, 33(2):209–271, 2001.
26. OLSEN, J.H. and SKOV, S.C. Cache-Oblivious Algorithms in Practice.
Master’s thesis, University of Copenhagen, Copenhagen, Denmark, 2002.
27. PROKOP, H. Cache-Oblivious Algorithms. Master’s thesis, Massachusetts
Institute of Technology, Massachusetts, 1999.
28. BENDER, M.A., DUAN, Z., IACONO, J. and WU, J. A locality-preserving
cache-oblivious dynamic dictionary. Journal of Algorithms, 115-136, 2004.
29. BENDER, M.A, DEMAINE, E.D. and FARACH-COLTON, M. “CacheOblivious B-Trees”, SIAM Journal on Computing, 2005.
30. DEMAINE, E.D. “Cache-Oblivious Algorithms and Data Structures”,
in Lecture Notes from the EEF Summer School on Massive Data Sets,
BRICS, University of Aarhus, Denmark, June 27–July 1, 2002.
31. BENDER, M.A., FINEMAN, J.T., GILBERT, S. and KUSZMAUL, B.C.
Concurrent Cache-Oblivious B-Trees. Proc. of the 17th ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA) Las Vegas. July 2005.
32. COMER, D. The ubiquitous B-Tree. Computing Surveys, 1979.
33. CORMEN, T.H, LEISERSON, C.E., RIVEST, R.L. and STEIN,
C. Introduction to Algorithms, Second Edition. Chapter 12, Section 15.5.
34. MING-YANG-KAO. Encyclopaedia of algorithms. Page 123.
35. TRANSISTOR SIZING ISSUES AND TOOL FOR MULTI-THRESHOLD
CMOS TECHNOLOGY.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=597182, last visited
on April 24, 2011.
36. EXCERPTS FROM A CONVERSATION WITH GORDON MOORE:
MOORE’S LAW.
ftp://download.intel.com/museum/Moores_Law/VideoTranscripts/Excepts_A_Conversation_with_Gordon_Moore.pdf, last visited
on April 24, 2011.
37. MOORE’S LAW.
http://www.intel.com/technology/mooreslaw/index.htm?iid=tech_2b+rhc_la
w, last visited on April 24, 2011.
38. GARCIA-MOLINA, H., ULLMAN, J.D., WIDOM, J. Database Systems:
The Complete Book. Prentice Hall, 2nd Edition.
39. DUAL-CORE PROCESSORS.
http://www.tomshardware.com/cn/957,review-957.html, last visited on
April 24, 2011.
94
40. WULF, W. and McKEE, S. “Hitting the Memory Wall: Implications of the
Obvious,” ACM SIGARCH Computer Architecture News, vol.23, 1994.
41. FOSTER, I. Designing and Building Parallel Programs, Addison-Wesley,
1994.
42. AN INTRO. TO MULTIPROCESSOR SYSTEMS.
http://www.realworldtech.com/page.cfm?ArticleID=RWT121106171654&p
=2, last visited on April 24, 2011.
43. THE TROUBLE WITH MULTI-CORE COMPUTERS.
http://www.technologyreview.com/computing/17682/page2/, last visited on
April 24, 2011.
44. FOURTH WORKSHOP ON PROGRAMMABILITY ISSUES FOR
MULTI-CORE COMPUTERS (JAN 2011).
http://multiprog.ac.upc.edu/resources/multiprog11.pdf, pp-3, last visited on
April 24, 2011.
45. AUTONOMIC COMPUTING. http://autonomiccomputing.org/, last visited
on March 25, 2011. , last visited on April 22, 2011.
46. AUTONOMIC COMPUTING.
http://www.research.ibm.com/autonomic/overview/benefits.html, last
visited on April 22, 2011.
47. OPEN MP. http://openmp.org/wp/, last visited on May 4, 2011.
48. THE CILK PROJECT. http://supertech.csail.mit.edu/cilk/, last visited on
May 4, 2011.
49. FORK/JOIN TUTORIAL.
http://download.oracle.com/javase/tutorial/essential/concurrency/forkjoin.ht
ml, last visited on May 4, 2011.
50. HOW TO SURVIVE MULTICORE SOFTWARE REVOLUTION.
http://akira.ruc.dk/~keld/teaching/IPDC_f10/How_to_Survive_the_Multicor
e_Software_Revolution-1.pdf, last visited on May 4, 2011.
51. A JAVA FORK/JOIN FRAMEWORK. LEA, D., SUNY, Oswego.
http://gee.cs.oswego.edu/dl/papers/fj.pdf, last visited on August 30, 2011.
52. PACKAGE JAVA.UTIL.CONCURRENT.
http://download.oracle.com/javase/7/docs/api/java/util/concurrent/packagesummary.html, last visited on August 30, 2011.
53. FORK AND JOIN: JAVA CAN EXCEL AT PAINLESS PARALLEL
PROGRAMMING TOO!, PONGE, J.
http://www.oracle.com/technetwork/articles/java/fork-join-422606.html, last
visited on August 30, 2011.
54. FORK-JOIN DEVELOPMENT IN JAVA™ SE.
http://www.coopsoft.com/ar/ForkJoinArticle.html, last visited on August 30,
2011.
55. FRIGO, M., LEISERSON, C.E. AND RANDALL, K.H. “The
implementation of the Cilk-5 multithreaded language,” SIGPLAN Not.,1998.
56. TPC BENCHMARKTM H (DECISION SUPPORT) STANDARD
SPECIFICATION, Revision 2.8.0.
http://www.tpc.org/tpch/spec/tpch2.8.0.pdf, last visited on August 30, 2011.
57. INTERFACE READWRITELOCK.
http://download.oracle.com/javase/7/docs/api/java/util/concurrent/locks/
ReadWriteLock.html, last visited on August 30, 2011.
58. INTEL HYPER-THREADING TECHNOLOGY.
95
http://www.intel.com/content/www/us/en/architecture-and-technology/hyperthreading/hyper-threading-technology.html/, last visited on August 30, 2011.
96