Download Large Spatial Data Computation on Shared

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Large Spatial Data Computation
on Shared-Nothing Spatial DBMS Cluster
via MapReduce
Dissertation Report
Submitted in partial fulfillment of the requirements for the degree of
Master of Technology
by
Abhishek Sagar
Roll No: 10305042
under the guidance of
Prof. Umesh Bellur
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
June, 2012
ii
Abstract
Vector Spatial data types such as lines, polygons or regions etc usually comprises of hundreds
of latitude-longitude pairs to accurately represent the geometry of spatial features such as towns,
rivers or villages. This leads to spatial data operations being computationally and memory intensive. Also, there exist certain real world scenarios which generates an extremely large amount
of spatial data. For example, NASA’s Earth Observing System (EOS), for instance, generates
1 terabyte of data every day. A solution to deal with this is to distribute the spatial operations
amongst multiple computational nodes. Parallel spatial databases attempt to do this but at very
small scales (of the order of few 10s of nodes at most). Another approach would be to use distributed approaches such as Map-Reduce since spatial data is cleanly distributable by exploiting
their spatial locality. It affords us the advantage of being able to harness commodity hardware
operating in a shared nothing mode while at the same time lending robustness to the computation
since parts of the computation can be restarted on failure. But at the same time, MapReduce is a
completely batch processing paradigm and score extremely poor on performance. This is due to
that fact that MapReduce do not support any indexing and operate on unstructured data. Parallel
spatial DBMSs, on the other hand, tend to give high performance but have limited scalability,
where as MapReduce tend to deliver scalability but poor at performance. Therefore, an approach
is required which allow us to process a large amount of spatial data on potentially thousands of
machines yet maintaining reasonable performance level. In this effort, we present HadoopDB a combination of Hadoop and Postgres spatial to efficiently handle computations on large spatial
data sets. In HadoopDB, Hadoop is employed a means of coordinating amongst various computational nodes each of which performs the spatial query on a part of the data set. The Reduce
stage helps collate the result data to yield the result of the original query. Thus, in HadoopDB, we
intend to reap the benefits of two technologies - spatial DBMSs (performance) and MapReduce
(scalability and fault tolerance). We present performance results to show that common spatial
queries yields a speedup that nearly linear with the number of Hadoop processes deployed.
Contents
1
.
.
.
.
.
.
1
1
2
2
4
5
7
2
Literature Survey
2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
8
3
Integrating Geoserver, Hadoop and PostGIS
3.1 Hadoop Distributed File System . . . . .
3.1.1 Goals . . . . . . . . . . . . . . .
3.1.2 Architecture . . . . . . . . . . . .
3.2 Integrating postGIS . . . . . . . . . . . .
3.3 Integrating Geoserver as Front-End . . . .
3.4 Query Execution Steps . . . . . . . . . .
3.5 Challenges faced . . . . . . . . . . . . .
3.6 Summary . . . . . . . . . . . . . . . . .
4
5
Introduction
1.1 Background . . . . . . . . . . . . . . . .
1.1.1 Parallel Spatial DBMS . . . . . .
1.1.2 MapReduce - an alternate solution
1.2 MapReduce Vs Parallel DBMSs . . . . .
1.3 Geospatial Processing Environment . . .
1.4 Outline . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
12
13
15
17
19
20
21
Vector Data Distribution
4.1 Partitioning Strategy . . . . . . . . . . . . .
4.2 Partition Skew . . . . . . . . . . . . . . . . .
4.3 Load Balancing . . . . . . . . . . . . . . . .
4.4 Spatial Data Partitioning using Hilbert Curve
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
25
27
Performance Evaluation
5.1 Highly Selective Spatial Queries . . . . .
5.2 Spatial Join Queries . . . . . . . . . . . .
5.3 Global Sorting . . . . . . . . . . . . . . .
5.4 Queries against shared-nothing restriction
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
30
30
33
34
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.5
6
Fault Tolerance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Summary and Conclusion
37
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography
42
v
List of Figures
1.1
1.2
1.3
Spatial Join via MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GeoServer Architecture [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Geoserver Deployment as Web Application . . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Overall System Architecture GeoServer + Hadoop + postGIS
Hadoop Architecture . . . . . . . . . . . . . . . . . . . . .
Hadoop with Database(postGIS) . . . . . . . . . . . . . . .
Logical View of System Architecture . . . . . . . . . . . . .
MapReduce job Compilation by Geoserver . . . . . . . . .
Geoserver-front end to Hadoop cloud . . . . . . . . . . . .
Query Execution Plan . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
14
16
17
18
19
20
4.1
4.2
4.3
4.4
4.5
Decomposition of the Universe into Partitions
Tile Based Partitioning Scheme . . . . . . . .
Initial Distribution . . . . . . . . . . . . . .
Distribution after Tile Partitioning . . . . . .
Hilbert Curve Space Filing Curve . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
25
26
26
27
5.1
5.2
5.3
5.4
5.5
Performance evaluation of Highly-Selective Query . . . . . . .
Performance evaluation Spatial Join Query . . . . . . . . . . . .
Performance evaluation of Global Sort Query . . . . . . . . . .
Performance evaluation of Against Shared-Nothing Query . . .
Fault Tolerance comparison of Hadoop+HDFS with Hadoop+DB
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
33
34
35
6.1
6.2
Comparison Chart contd .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Comparison Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
7
List of Tables
3.1
SQL to MapReduce Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1
5.2
Test Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Hardware Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
Chapter 1
Introduction
Geographic information system (GIS) is one that captures, stores, analyzes, manages and
presents spatial data along with relevant non spatial information. GIS forms the core of applications in areas as varied as agriculture to consumer applications such as location based services.
Today, many computer applications, directly or indirectly, are based on carrying out spatial analysis at the back-end. Spatial analysis involve spatial operations to be performed on spatial data.
We represent the spatial features such as roads, towns, cities etc as Vectored data. Vector data is
collection of latitude-longitude pairs called Geospatial points, structured into a format so as to
represent the geometry of spatial features. An example would be the use of vectored polygons
to represent city or state boundaries. For example to represent a road network of the state of
Arizona, we require approximately 10 million points, each of which is a coordinate involving
latitude and longitude. Number of geospatial coordinates required to represent the geometry of
real world objects varying from few hundreds to tens of thousands. Spatial operations such as
overlapping test (to check whether two areas overlap each other or not) etc are performed on a
set of vector spatial data sets. These operations are generally the implementation of geometric
algorithms. Because of the enormous number of points required to represent a single spatial
object and complexity of geometric algorithms, carrying out spatial computation on real world
data sets has been resource-intensive. 2core, 1.5 GiB memory machine shows constant 75-85%
CPU consumption for join queries. Also, enormous quantities of spatial data is constantly being
generated from various sources such as satellites, sensors and mobile devices. NASAs Earth
Observing System (EOS), for instance, generates 1 terabyte of data every day. Therefore, we
consider spatial operations a potential candidate to be subjected to parallelism.
1.1
Background
Two widely used technologies that are being used for distributed computation of spatial data
are parallel spatial DBMSs and MapReduce. The two technologies do not the substitute the other
but infact complementary to each other. We discuss pros and cons of each in the domain of large
spatial datasets processing.
1.1.1
Parallel Spatial DBMS
Parallel spatial DBMSs such as oracle spatial are being widely in use for carrying out parallel
computation of spatial data across cluster of machines. Parallel DBMSs has been a matured
technology that is in existence for around 30 years by now and has been greatly optimized to
yield high performance but yet do not score well in terms of scalability. Asterdata, a parallel
database known to posses one of the best scalability in parallel database community is scalable
to around 330-350 nodes. In parallel DBMSs, the intermediate results of query are pipelined to
next query operator or another sub-query without being written to disk. Now if any Sub-Query
fails, the intermediate results processed so far are lost and entire query have to be restarted again.
Not writing intermediate data onto disks, though results in high performance but at the same
time avoid parallel DBMS from exhibiting good fault tolerance. With the increase in the size
of cluster of commodity machines, the probability of node or task failure also increase and this
failure is likely to become a frequent event in case the parallel DBMS cluster size is increased
to the order of few hundreds of nodes. This would result in a significant degradation in the
performance of parallel DBMSs. Thus, poor fault tolerance capability puts an upper bound on
the cluster size of parallel DBMSs (up to few tens of nodes), resulting parallel DBMSs to posses
ordinary scalability too.
1.1.2
MapReduce - an alternate solution
With reference to the drawback of parallel DBMSs discussed, for the last 2-3 years, MapReduce [1] has attracted researchers as an alternative to parallel DBMSs to study the parallelization
of spatial operations and their performance evaluation in distributed framework. MapReduce,
a distributed parallel programming model developed by google, provides a framework for large
volumes of data processing, in the order of hundreds of terabytes across thousands of sharednothing commodity machines. The scalability and fault tolerance feature of MapReduce enable
us to use sufficiently larger number of commodity machines for carrying out data intensive computations. MapReduce parallel programming model does not necessitates the programmer to
understand the parallelism inherent in the operation of the paradigm. It is a high level parallel
programming model that allow programmer to focus only towards writing a solution of the core
problem logic rather than taking care of parallel programming constructs such as synchronization, deadlock etc.
MapReduce programming model require the programmer to provide the implementation details
of two function : Map and Reduce. The Map function partitions the input data to be processed
preferably into disjoint sets. Each set is then returned to Reduce function for further processing.
Key-value pairs form the basic data structure in MapReduce. The input to the Map function
is the key value pair (k1 , v1 ), key k1 being the byte offset of a record within the input file, the
value k2 being the record line. The Map output the set of intermediate key-value pairs, [(k2 ,
v2 )]. The MapReduce library implements the shuffle phase which lies in between the Map and
Reduce phases. The shuffle phase rearrange the intermediate Map-output and aggregates all the
values associated with the same key together to form a (key, list(values)) pair which forms the
input to the reduce phase which follows. Last phase is the Reduce phase which process list of
2
values associated with the same key. Identical Reducer function executes in parallel on worker
nodes. The output of the Reducers is the final output that is written back onto HDFS. MapReduce
programming model is used to carry out distributed computation on clusters of shared-nothing
machines.
The Apache Hadoop [2] software library is a framework that allows for the distributed processing of large data sets across clusters of computers using MapReduce. It is designed to scale
up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect
and handle failures at the application layer, so delivering a highly-available service on top of a
cluster of computers, each of which may be prone to failures. Hadoop Distributed File System
(HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a cluster to enable
availability and reliability.
MapReduce in spatial domain: MapReduce programming paradigm, by its design, is biased
towards distributed processing of spatial data. MapReduce has the power to efficiently map most
spatial query logic to Map and Redue functions. For straight-forward spatial queries limited
to WHERE clause and without joins, MapReduce paradigm require to have only map function.
Input tuples read are tested if they qualify the Where criteria in the map-phase itself. For spatial
queries involving spatial joins between two data sets, in addition to Map phase, Reduce phase is
also required. Generally, spatial joins are performed on spatial objects that are spatial proximal.
Map phase read the input tuples of join operand data sets, and create set of groups, each group
contains only the set of spatial objects that are within pre-defined spatial boundaries. Each group
is then processed for spatial join in parallel by Reduce processes on cluster machines. Thus,
Vector spatial data, by its nature, is well suited to be processed on clusters following sharednothing architecture. MapReduce’s Map and Reduce process executes in isolation with each
other, that is there is no inter communication between processes whatsoever. Processing all
spatial objects enclosed within a finite geographical boundary on different machines eliminates
much of the possibility for MapReduce processes to interact with each other, thus abiding by
Hadoop’s shared-nothing architecture.
Figure 1.1 shows the MapReduce formulation of spatial join between two heterogeneous
data sets: Rivers (Linestrings) and Settlements (Polygons). The Map-phase partition the spatial
objects of two data sets and create groups. Each group contains spatial objects that lies within
spatial pre-defined boundaries. For example, 1:River, 2:Setmt and 4:Setmt are grouped together
and form a group identified by a key 1. Similarly, 2:Setmt, 3:River are grouped together with
group-key as 3. The shuffle phase involves migration of all spatial objects from different Mappers
associated with the same group key onto a single machine through network. In the figure, 1:River,
2:Setmt and 4:Setmt are collated to a single machine. After shuffle phase, Reducers start on each
machine which processes all spatial objects associated with one group. Here, three reducers starts
corresponding to three groups, each of which finds the set of Settlements crossed by a single river.
For example, 2:Setmt and 4:Setmt is crossed by river 1:River. Reducers independantly write their
final output onto HDFS.
3
Figure 1.1: Spatial Join via MapReduce
1.2
MapReduce Vs Parallel DBMSs
Processing larger amount of spatial data has become a critical issue in the recent times. Parallel DBMS technology has been widely used for processing larger volumes of vector data, but
with the ever increasing need to process larger and larger spatial data sets, parallel DBMS is no
more a desirable technology for this purpose. We discuss the comparison between MapReduce
and parallel DBMS with respect to scalability, fault tolerance and performance aspects [3].
1. Scalability: Parallel database systems scale really well into the tens and even low hundreds of machines. Unfortunately, parallel database systems, as they are implemented today,
unlike Hadoop, they do not scale well into the realm of many thousands of nodes. Enormous
quantities of spatial data is constantly being generated from various sources such as satellites,
sensors and mobile devices. NASA’s Earth Observing System (EOS), for instance, generates 1
terabyte of data every day [19]. Processing such large volumes of spatial data on daily basis
need to employ much larger number of machines, probably in the order of few thousands which
parallel DBMS technology doesn’t support.
2. Fault tolerance: Fault tolerance is the ability of the system to cope up with node/task
failures. A fault tolerant analytical DBMS is simply one that does not have to restart a query
if one of the nodes involved in query processing fails. Fault tolerance capability of the parallel
DBMS is much inferior to that of Hadoop. Hadoop has been especially designed to exhibit excellent scalability and fault tolerance capability. The amount of work that is lost when one of the
node in the cluster fails is more in parallel DBMS than in case of Hadoop. In parallel DBMS, the
intermediate results of query are pipelined to next query operator or another sub-query without
4
being written to disk. Now if any Sub-Query fails, the intermediate results processed so far are
lost and entire query have to be restarted again. However, in Hadoop, the intermediate results
of the mappers (Or Reducers) are always written to the disk before they are fetched by the Reducers (Or mappers of the next Map-Reduce stage). Thus, instead of pipelining the intermediate
results/data to subsequent processes, Hadoop processes themselves are pipelined to operate on
target data. In case of a task/node failure, the same task is restarted on another node to operate
on the target intermediate data which still exist on the disk.
3. Performance: Parallel DBMS have been designed to work in real time system and therefore what is important is performance, whereas Hadoop has been designed for batch processing.
Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. In fact, Hadoop takes
around 10-11 seconds only to initiate distributed processing on 3-4 node cluster size where as
parallel DBMS finishes much of the computation in this time period. Hadoop’s slower performance is also because Hadoop stores data in the accompanying distributed file system (HDFS),
in the same textual format in which the data was generated. Consequently, this default storage
method places the burden of parsing the fields of each record on user code. This parsing task
requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate
type. This, further results in widening the performance gap between MapReduce and a parallel
DBMSs [3].
To summarize, MapReduce offers excellent scalability and fault tolerance feature which enable MapReduce a suitable programming model to process larger data sets on sufficiently large
clusters of commodity machines, whereas parallel DBMS technology is limited to cluster size
up to few dozens of nodes but outperforms the MapReduce clearly in terms of performance.
1.3
Geospatial Processing Environment
GeoServer is an open-source server written in Java - allows users to share and edit geospatial
data. Designed for interoperability, it publishes data from any major spatial data source using
open standards. GeoServer has evolved to become an easy method of connecting existing information to Virtual Globes such as Google Earth and NASA World Wind as well as to web-based
maps such as OpenLayers, Google Maps and Bing Maps [20]. Geoserver forms the beginning of
the standardization in the GIS arena. It follows the WMS, WFS and WCS specifications to the
letter, and forms the platform to develop GIS applications based on these specifications.
Geoserver Design Figure 1.2 shows the different components of a Geoserver. At a high level
view Geoserver essentially consists of many different modules that are actively interacting with
each other. Geoserver reads data in a wide variety of formats from PostGIS, OracleSpatial,
ArcSDE to shapefiles and geotiff. It can also produce KML, GML, Shapefiles, GeoRSS, GeoJSOn and multitudes of other formats. Geoserver essentially consists of two aspects – the config5
uration and data store aspect, and the rendering aspect. All configurations in geoserver are done
through the admin interfaces and XML configuration files. DataStore is essentially a source of
data for rendering of features. Geoserver supports many different data stores including Web
Feature Server, Property files, Shapefiles and databases. CoverageStore is another entity at the
same level as a DataStore but includes raster based data formats like ArcGrid, geotiffs, image
mosaics, etc. The rendering component follows the WMS, WFS and WCS specification and are
uses geotools as the rendering API.
Figure 1.2: GeoServer Architecture [21]
Geoserver Deployment Figure 1.3 shows the deployment of a Geoserver. At run-time actions
performed by the user on the client side are translated into HTTP request by JAVAScript code
and sent to the server where data satisfying the request are selected and sent back to the client
as HTML, JAVAScript and raster data. In detail, a GIS query is transmitted as a GET or POST
request written according to either the WFS or WMS specifications. The request is therefore
captured by the generated GIS application and forwarded to Geoserver which, in turn, interprets
the query, composes the SQL statement according to the PostGIS DML and send it to the DBMS.
Once the DBMS computes the query, results are gathered by Geoserver to create the answer. In
particular, in case of a WMS request, Geoserver computes a raster map containing the results
encoded as a standard picture format (GIF, PNG, SVG, etc.). In case of a WFS request, Geoserver
collects data from the DBMS and returns a GML (Geography Markup Language) encoded data
6
Web Server (Apache)
HTTP Request
WFS/WMS
+ Application Server
(Tomcat)
Web Browser
Internet
HTML/Java Script
Raster
Generated
GIS
Application
GeoServer
postGIS
Figure 1.3: Geoserver Deployment as Web Application
to the generated GIS server application. The latter further processes the resulting GML data, by
sending it back to the client side in HTML format.
1.4
Outline
The remainder of this report is organised as follows : Chapter 2 presents the problem statement followed by related work done in the field of spatial data compution on Hadoop cloud.
In Chapter 3, we present the details regarding integrating postGIS with MapReduce paradigm.
Chapter 4 discusses the strategy to distribute the vector data across cluster nodes. In chapter 5
we present the set of benchmark to realise the benefit we obtain by bringing the spatial DBMSs
with MapReduce together in spatial domain. Chapter 6 summarises and concludes the work with
an outline on future work.
7
Chapter 2
Literature Survey
2.1
Problem Formulation
Parallel spatial DBMSs such as Oracle Spatial have been in use for carrying out spatial analysis on moderately large spatial data sets. Today, Spatial DBMSs have been improved to support
variety of spatial indexing mechanism which enable it to process spatial queries really fast. But,
parallel DBMS, because of their limited scalability, fail to handle the ever increasing size of
spatial repositories. To overcome this barrier, researchers have focused on MapReduce as an
alternate solution which is capable to express variety of spatial operation such as spatial joins
[7],[8],[9], Nearest neighbor queries [5], Voronoi diagram construction [15] etc and unlike parallel DBMSs, can process much larger volumes of spatial repositories on thousands of commodity
machines in parallel. But MapReduce is a batch processing programming paradigm and it brute
force style of data processing make it unsuitable for real time spatial data analysis. Spatial Indices
is a solution to this problem which is not supported by the current implementation of Hadoop’s
MapReduce. Hence the two technologies actually goes in opposite directions and neither is good
at what the other does well.
Thus, we believe, in order to facilitate processing of large spatial data sets while maintaining
the reasonable performance level, there is a need to study the behavior of spatial operations in
the integrated environment of MapReduce systems and spatial DBMSs.
2.2
Literature Survey
To begin With, [10] discusses the implementation of common spatial operators such as geometry intersection on Hadoop-platform by transforming it into Map-Reduce paradigm. It throws
the light on what input/output key value pair a Map-Reduce programmer should choose for both
Mapper and Reducers to effectively partition the data on several slave machines on a cluster and
carrying out the spatial computation in parallel on slave machines. This paper also presents a performance evaluation of spatial operation comparing spatial database with Hadoop platform. The
results demonstrate the feasibility and efficiency of the MapReduce model and show that cloud
computing technology has the potential to be applicable to more complex spatial problems.
8
[6] discusses the implementation of spatial queries into Map-Reduce Paradigm which particularly involves the spatial join between two or more heterogeneous spatial data sets. The strategies
discussed include strip-based plane sweeping algorithm, tile-based spatial partitioning function
and duplication avoidance technology. Paper experimentally demonstrates the performance of
SJMR (Spatial Join with Map-Reduce) algorithm in various situations with the real world data
sets and establish the applicability of computing-intensive spatial applications transformed as
MapReduce on clusters.
[7] is the optimization work done on [6] . It discusses effective strategies to partition the
spatial data in the Map-phase so that most Reducers running on the slave machines get the fair
share of data to processed i.e. it should not happen that some reducer gets very less data for
processing while other reducers are just overwhelmed with test data to be processed. Paper
shows experimental statistics and results that shows the improvement in overall cluster utilization
, memory utilization and run-time of the Hadoop job.
[12] discusses the strategies to transform the various graph manipulating algorithms such as
Dijkastra’s Single Source Shortest Path Algorithm , Bipartite Matching , Approximate vertex
and edge covers and minimum cuts etc into Map-Reduce form. Generally the Map-Reduce
algorithms to manipulate graphs are iterative in nature i.e. the problem is solved by processing the
graph through a Map-Reduce pipeline, each iteration being converging towards a solution. The
key challenge in transforming the graph based problems into map-reduce form is the partitioning
of the graph . It has to be done carefully as the slave machine processing one part of the graph
has no information whatsoever about the rest of the graph. The partitioning of the graph among
slave machines must be done so that there is no need for the slave machine to be aware of the
rest of the portion of the graph and it can perform computation independently on its own share
of graph.
[11] discusses the three stage Map-Reduce solution to spatial problem ANN (All nearest
Neighbor) [5]. Since Mapper phase partition the spatial objects and group all those together
which lies close to each other within a rectangular 2D space called partition. The algorithm requires just one Map-Reduce stage in case if every object’s nearest neighbor is also present with
in the same partition as the object is. But what if the nearest neighbor of the object belongs to the
adjacent partition. Such objects whose NN is not guaranteed to exist within its own partition are
called Pending elements. They make use of the intermediate Data Structure called Pending Files
in which they write the pending element and the potential partition which could contain the NN
of this pending element. Input to the Map-phase of the next Map-Reduce Stage is the pending
files + original data source. Output of second Map-Reduce stage produces the final results in
which every element is guaranteed to find its nearest neighbor. Through this approach, the MapReduce programming model overcome the communication barrier between slave nodes. Hadoop
Platform does not allow slave nodes to share information while Map-Reduce task is executing.
The data partition that a particular slave node executed is made available to another slave node
in the next Map-Reduce stage.
9
[13] discusses the Map-Reduce strategy to Perform build the index of large spatial data sets. It
discusses the Map-Reduce paradigm to construct the Rtree [14] in parallel on slave machines of
cluster. The maximum size of RTree that can be constructed is limited by the total size of main
memory of all slave machines of the cluster.
[4] does the comparative study of MapReduce based systems against parallel DBMSs. It compares the two systems w.r.t three main aspects : Performance, Fault-tolerance and Scalability.
It argues that the two systems are actually complementary to each other, and neither is good at
what other does well and targets completely different applications altogether. For one time data
processing and batch applications, MapReduce is favorable whereas for interactive and applications demanding high performance, parallel DBMS is a favorable technology.
Gap Analysis
Until now, we have seen, over the past two-three years, there has been plethora of research in
formulating spatial operations as MapReduce problems. But, MapReduce based systems has
been originally designed for one time data processing such as log file analysis. In practice
MapReduce based systems involve processing of large data sets with almost fixed MapReduce
jobs, once data is processed it is offloaded permanently from the system. Contrary to this, most
spatial data usually isn’t of One time processing type and require a constant probation from the
user end to derive more and more meaningful results. Keeping this perspective, We believe,
MapReduce is not a suitable programming model to carry out spatial analysis. Also, lack of
indexing support make it further unsuitable for this purpose. Spatial DBMS, on the other hand,
is a well known, matured and unlike MapReduce, is significantly optimized technology that has
been employed for years for spatial analysis. Therefore, while MapReduce enable us to carry
out data processing across large clusters but do not score on performance, DBMS on the other
hand, yield high performance but score poorly on scalability. Even for moderate spatial data sets
(250 mb) (California roads and counties [18]), we setup 3 node Hadoop cluster (same as the one
set up in experiment section) where we observer MapReduce takes as much as 6-7 minutes to
process join queries, whereas single postGIS, on the other hand, outputs the results in only 40-50
seconds. Thus, even 3-node Hadoop succumb to single postGIS optimized capabilities in terms
of performance.
Thus, We consider, to cope up with the ever increasing size of spatial repositories, we need
to study the behavior of such systems which can employ potentially thousands of machines yet
maintaining reasonable performance level while targeting data analysis over large spatial data
sets.
10
Chapter 3
Integrating Geoserver, Hadoop and
PostGIS
As the solution to the problem we presented in the previous section in the domain of large
spatial data analysis, In this section, we integrate the MapReduce with spatial DBMS technology
to create real spatial processing environment. We present the step by step discussion regarding
assembling of the three systems : GeoServer, Hadoop and PostGIS. We start with the open source
Apache’s Hadoop project as the basic software platform on top of which we build the remaining components and it forms the core component of the overall system. In the second step, we
replace Hadoop’s default data storage Hadoop Distributed File System, HDFS with the postGIS
database servers which stores the spatial test data sets as tables and one instance of which is
active on each cluster node. In the third step, we add Geoserver as the front end to the system
which provide high level spatial SQL interface to the end user to query the cluster data.
Figure 3.1 shows the top level architecture of the overall system we intend to build. Geoserver
serves as the front end of the system which enable user to monitor and interact with the cloud
in the back end. Hadoop Master-Node is in direct communication with the Geoserver. Hadoop
Master-Node in turn monitor and coordinate the activities of rest of the cloud by usual MapReduce paradigm. This system distinguishes itself from many of the current parallel and distributed
databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize
performance in heterogeneous clusters. Especially in cloud computing environments, where
there might be dramatic fluctuations in the performance and availability of individual nodes,
fault-tolerance and the ability to perform in heterogeneous environments are critical.
In the remainder of this chapter, we first discuss the basic architecture and functioning of
Hadoop, followed by the discussion of postGIS and Geoserver integration to the system.
11
SQL Query
Namenode
SMC
Reader
MR-code
Database
Connector
Geoserver
catalog.xml
HDFS
tasktracker
tasktracker
tasktracker
postGIS
postGIS
postGIS
Node 1
Node 2
Node 3
Figure 3.1: Overall System Architecture
GeoServer + Hadoop + postGIS
3.1
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on lowcost hardware. HDFS provides high throughput access to application data and is suitable for
applications that have large data sets. HDFS is now an Apache Hadoop subproject.
3.1.1
Goals
• Hardware Failure: Hardware failure is the norm rather than the exception. An HDFS
instance may consist of hundreds or thousands of server machines, each storing part of
the file system’s data. The fact that there are a huge number of components and that each
component has a non-trivial probability of failure means that some component of HDFS is
always non-functional. Therefore, detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS.
• Streaming Data Access: Applications that run on HDFS need streaming access to their
12
data sets. They are not general purpose applications that typically run on general purpose
file systems. HDFS is designed more for batch processing rather than interactive use by
users. The emphasis is on high throughput of data access rather than low latency of data
access.
• Large Data Sets: Applications that run on HDFS have large data sets. A typical file in
HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It
should provide high aggregate data bandwidth and scale to hundreds of nodes in a single
cluster. It should support tens of millions of files in a single instance.
• Simple Coherency Model: HDFS applications need a write-once-read-many access model
for files. A file once created, written, and closed need not be changed. This assumption
simplifies data coherency issues and enables high throughput data access. A MapReduce
application or a web crawler application fits perfectly with this model. There is a plan to
support appending-writes to files in the future.
• Moving Computation is Cheaper than Moving Data: A computation requested by an
application is much more efficient if it is executed near the data it operates on. This is
especially true when the size of the data set is huge. This minimizes network congestion
and increases the overall throughput of the system. The assumption is that it is often better
to migrate the computation closer to where the data is located rather than moving the data
to where the application is running. HDFS provides interfaces for applications to move
themselves closer to where the data is located.
• Portability Across Heterogeneous Hardware and Software Platforms: HDFS has been
designed to be easily portable from one platform to another. This facilitates widespread
adoption of HDFS as a platform of choice for a large set of applications.
3.1.2
Architecture
HDFS has a master/slave architecture. HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per machine in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes a file system namespace and allows user
data to be stored in files. Internally, a file is split into one or more blocks and these blocks
are stored in a set of DataNodes. The NameNode executes file system namespace operations
like opening, closing, and renaming files and directories. It also determines the mapping of
blocks to DataNodes. The DataNodes are responsible for serving read and write requests from
the file system’s clients. The DataNodes also perform block creation, deletion, and replication
upon instruction from the NameNode. The NameNode and DataNode are pieces of software
designed to run on commodity machines. These machines typically run a GNU/Linux operating
system (OS). HDFS is built using the Java language; any machine that supports Java can run
the NameNode or the DataNode software. Usage of the highly portable Java language means
that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated
13
machine that runs only the NameNode software. Each of the other machines in the cluster runs
one instance of the DataNode software. The architecture does not preclude running multiple
DataNodes on the same machine but in a real deployment that is rarely the case. The existence of
a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode
is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that
user data never flows through the NameNode. The JobTracker provides command and control for
job management. It supplies the primary user interface to a MapReduce cluster. It also handles
the distribution and management of tasks. There is one instance of this server running on a
cluster. The machine running the JobTracker server is the ’MapReduce master’. TaskTracker
provides execution services for the submitted jobs. Each TaskTracker manages the execution of
tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of
the TaskTracker processes. There is one instance of this server per compute node.
Figure 3.2: Hadoop Architecture
14
3.2
Integrating postGIS
HDFS is a distributed storage that store all Hadoop data. On each datanode, We replace
HDFS with one instance of active postGIS database server which now host all hadoop data as
DBMSs tables. In general, the Hadoop’s MapReduce jobs takes the blocks of data distributed
across HDFS as input data to be processed. Each data block is allocated a Mapper process
which processes the data block. Hadoop HDFS API’s DBInputFormat class allows to change
this paradigm, and instantiate a mapper by allocating it a database table hosted on datanode instead of HDFS data block. Each Mapper prior to its execution establish the SQL connection with
local postGIS server and execute the input SQL query. The result emitted out is then serves as
the input to the Mappers. The result is that we obtain a hybrid that combines databases with
scalable and fault-tolerant MapReduce systems. It comprises of Postgres spatial on each node
(database layer), Hadoop’s MapReduce as a communication layer that coordinates the multiple nodes each running Postgres. Hadoop with Database as the primary storage layer instead
of HDFS is termed as HadoopDB [16]. By taking advantage of Hadoop (particularly HDFS,
scheduling, and job-tracking), HadoopDB distinguishes itself from many of the current parallel and distributed databases by dynamically monitoring and adjusting for slow nodes and node
failures to optimize performance in heterogeneous clusters. Especially in cloud computing environments, where there might be dramatic fluctuations in the performance and availability of
individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are
critical. The system is designed to process most of the problem logic within the database layer,
thus speeding up the queries by making use of database’s optimized capabilities such as Indexing
which is not supported in MapReduce, whereas the aggregation of data from multiple nodes, if
required, is done in the MapReduce environment.
Figure 3.3 shows the architecture of the system. The Database Connector (DC) component of
the system has the responsibility to connect to the databases hosted on cluster machines. DC
probes the Catalog file residing in HDFS to locate the host address, port no, database name for
a given table name. It also contains the replication details of all tables. The databases hosted by
cluster nodes, are spatially enabled open source postgres databases, which we shall now refer to
as postGIS. The Hadoop Daemon, called TaskTracker is running on each cluster node to assist
and control the execution of local Maps and Reducers.
We also need to implement the spatial SQL front end to the system that allow user to query
the cluster data via high level SQL queries instead of writing MapReduce jobs (60-70 lines of
code for simple SQL queries) everytime for a different query. In the next section, we discuss the
implementation details of this front end.
Logical View of Architecture
Figure 3.4 shows the logical view of the System Architecture. The system can be viewed as
the stack of three layers:
• Top Layer - Database layer
15
Figure 3.3: Hadoop with Database(postGIS)
• Middle layer - MapReduce Layer
• Bottom Layer - Hadoop HDFS layer
The topmost spatial DBMS layer comprises of postGIS servers running on cluster machines.
Most of the spatial data processing takes place at this layer, that is within database engines.
Database Tables hosting spatial data are processed via spatial SQL locally and independently
on cluster nodes. The result emitted out is then handled to the underlying MapReduce Layer.
Arrows shows the direction of intermediate data flow between layers.
The Middle layer is the Hadoop’s MapReduce programming paradigm. The intermediate output
of the top layer is processed further at this layer. This layer further process each intermediate
record (Map-phase) and collates the result from different database sites (Reduce-phase). Also,
this layer can write and read the data to and from the bottom HDFS layer.
The bottom layer is the Hadoop’s HDFS layer. It provides the distributed transparent storage as
well as coordinate the Hadoop Daemon processes across the cluster. This layer is responsible to
provide the fault-tolerance capability to the system, thereby improving the scalability.
16
Spatial DBMS Layer
Map-Reduce Layer
Map-phase
Reduce-phase
Hadoop HDFS Layer
Figure 3.4: Logical View of System Architecture
3.3
Integrating Geoserver as Front-End
Geoserver [20] comprises the front end of the system. It allows users to share and edit geospatial data. Designed for interoperability, it publishes data from any major spatial data source
using open standards. We implement our simple SQL-to-MapReduce Converter module (SMC)
in Geoserver that recognize the basic spatial data types viz Polygons, MultiPolygons, LineStrings
and Points and translates the spatial SQL queries into equivalent compiled MapReduceSQL code.
SMC provide high level SQL interface to the user to query cluster data which is capable of transforming any spatial query into equivalent MapReduceSQL form provided that there is no collation of data needed from different database sites except through Group By clause, and aggregate
functions supported are sum,max and min only. Table 3.1 shows the set of rules to map SQL
constructs to MapReduce. As long as SQL query does not have Group By clause, equivalent
MapReduceSQL has only Map function. Group By clause require the records having the same
value of field which is being grouped to collate from different database sites, thereby introducing a reduce function. For this MapReduce code, the input specification involves the input data
to be retrieved from cluster databases instead of default HDFS. Once the data is fetched out of
the databases, rest of the computation proceeds as per the usual MapReduce paradigm. Figure
3.5 shows the sequence of operations implemented in SMC starting from taking the spatial SQL
query as an input from user upto launching the MapReduce job. TableCatalog.txt is the text file
maintained on node running Geoserver which consists of table schema information: Table fields
and its data type. This information is required by MapReduceSQL Generator function of SMC.
The SQL-enable compiled MapReduce job, produced by SMC, is copied by Hadoop Master
node to relevant cluster nodes as single jar file. Here relevant cluster nodes are the nodes which
host any of the tables specified in original query. This information comes from catalog file
residing on HDFS.
17
Table 3.1: SQL to MapReduce Mapping
SQL construct
MapReduce construct
No GroupBy clause
Only Map
Group By clause
Map and Reduce
Group by field
output-key of Mappers and input-key of Reducers
Aggregate functions
Sum , Min , Max
supported Data types
primitive data types + Geometry data types
set of fields Selected
Map Input Value
Input
Spatial SQL
Query Parser
Generate Tokens:
Tablename, fields,
agg fns, Group By
clause, group by
fields
TableCatalog.txt
MapReduceSQL
Generator
MapReduce Job
Compiler
Job
Launcher
Figure 3.5: MapReduce job Compilation by Geoserver
Figure 3.6 shows the Geoserver interface screen which forms the front end of Hadoop + postGIS cloud. At the left hand side of the screen is the hyperlink HadoopPlugin clicking which
this screen appears. The text field provided allows the user to mention the spatial query to be
processed on cluster data. Specifying the spatial query, user need to press the CREATE MAPREDUCE JOB button to translate the spatial SQL into equivalent compiled MapReduceSQL.
Clicking LAUNCH HADOOP JOB button will launch the job. Activity Terminal is the window which shows all information about user interaction with Hadoop cloud. We also provide
18
Figure 3.6: Geoserver-front end to Hadoop cloud
other buttons to monitor Hadoop cluster via Geoserver interface viz : START HADOOP CLUSTER, STOP HADOOP CLUSTER, FORMAT HDFS and CLUSTER STATUS.
3.4
Query Execution Steps
The query execution passes through three phases:
• The first phase simply involves the original query to execute inside the database engine
locally on cluster nodes in parallel.
• In second phase, the tuples emitted out of DBMSs (in first phase), called the ResultSet,
are read by the Mappers. Here Map performs any extra computation, which might not be
supported at postGIS layer, if required, on each tuple. For example, we can test inside
the postGIS layer to output all pair of roads thats intersect each other, but in case if we
are specifically interested in finding all T-Point intersection points between roads, it can
be tested in the map phase whether the two roads, which are now confirmed to intersect,
actually intersect at around 90 degrees or not.
• In third phase, Reducers start when all mappers have finished, each reducer, aggregate the
individual map-outputs, consolidate them and write final results back onto HDFS, which is
then be imported by Geoserver. This phase is optional, and is not required if no aggregation
19
of Map-outputs is required from different cluster nodes. Usually, third phase comes into
picture in case of nested queries, or queries with GROUP BY clause.
Suppose we need to find the polygon with the greatest area among all the polygons stored
in three database sites. Figure 3.7 shows the Query execution plan of a query in the integrated
environment of DBMS and MapReduce. The example shows three postGIS nodes hosting table
containing id and area columns of polygon table. The example query output the polygon id and
area of polygon that has maximum area in the respective table. Mappers then read this output
and accumulate all Map-outputs at a single reducer by binding them with a common reduce key.
Reducer, then find the largest polygon among all locally largest polygons.
Query : Select id, area(polygon.the geom) as A from polygon where A =
(Select max(area(polygon.the geom)) from polygon ))
postGIS
1
21
2
42
3
22
postGIS
4
31
5
35
6
17
postGIS
7
12
8
10
9
47
output = 2, 42
output = 5, 35
output = 9, 47
Map (2, 42)
{
Map-output ( key1 , (2,42))
}
Map (5, 35)
{
Map-output ( key1 , (5,35))
}
Map (9, 47)
{
Map-output ( key1 , (9,47))
}
Reduce (key1, List [(2,42), (5,45), (9,47))
{
Find largest polygon among 2, 5, 9
Reduce-output ( NULL , (9,47))
}
Final output Written onto HDFS
Figure 3.7: Query Execution Plan
3.5
Challenges faced
While replacing HDFS with postGIS database was accomplished by Hadoop libraries, the two
main challenges that we faced were to :
20
• attach Geoserver as a front end to Hadoop cloud and invoking Hadoop’s utilies from
Geoserver-interface and,
• implementing SMC module in GeoServer.
Attaching GeoServer as a front-end: This challenge is due to technical incompatibility between GeoServer which is platform independent and Hadoop’s MasterNode which is biased to
run on UNIX platform. GeoServer is a software developed in Java and is therefore platform independent, Hadoop, like GeoServer, is also developed in Java, but Hadoop’s MasterNode relies
on certain complex shell scripts which configures hundreds of Hadoop parameters stored in several xml files to configure Hadoop cloud, and therefore MasterNode is compatible to run only on
UNIX machines. It is for this reason that Windows based machines can be used as Hadoop slave
nodes only. Therefore, what was required, is to invoke those shell scrips with correct parameters
from with in HadoopPlugin java code implemented as a separate module in GeoServer. Invoking
the shell commands from inside the Java code is accomplished by Java java.lang.ProcessBuilder
class. The class ProcessBuilder, in Java 1.5, is used to create operating system processes. With
an instance of this class, it can execute an external programs (Hadoop controlling shell scripts
in our case) and return an instance of a subclass of java.lang.Process. It provides methods for
performing input from the process, performing output to the process, waiting for the process to
complete, checking the exit status of the process, and destroying (killing) the process. GeoServer
invoke Hadoop utilities via instance of this class. It is because of this reason that our GeoServer
as Hadoop front end is only deployable only on UNIX machines, though can be accessed remotely via WEB from any machine.
Implementing GeoServer SMC module: Implementing this module was algorithmically very
challenging. SMC (SQL to MapReduce converter) module require to take the SQL query as input string, parse this query to generate SQL tokens such as table names specified in the query,
fields selected, their data types, fields being grouped by and aggregate functions used. Therefore,
this module requires a lot of String handling and builds an appropriate data structures such as
mapping of fields specified in SQL Select to their return data types. Having build the data structures after query parsing, SMC maps the SQL constructs to MapReduce constructs. Rules for
this mapping are listed in table 3.1. This mapping has to be carefully handled as one can twist
the SQL queries in one way or the other. We try our best to make this module support variety of
SQL query translation but still it is limited to support aggregate functions sum,max and min and
table field data types supported are Geometry, int, float, double and String only.
3.6
Summary
HadoopDB does not replace Hadoop. Both systems coexist enabling the analyst to choose
the appropriate tools for a given dataset and task. Through the performance benchmarks in the
following sections, we show that using an efficient database storage layer cuts down on data processing time especially on tasks that require complex spatial query processing over structured
21
data such as joins. We also show in the experiment section that HadoopDB is able to take advantage of the fault-tolerance and the ability to run on heterogeneous environments that comes
naturally with Hadoop-style system. HadoopDB achieves fault tolerance and the ability to operate in heterogeneous environments by inheriting the scheduling and job tracking implementation
from Hadoop, yet it achieves the performance of parallel databases by doing much of the query
processing inside of the database engine.
Thus, by integrating the three systems together, we get the shared-nothing distributed spatial
DBMSs cluster with MapReduce as the coordinating layer and achieves the following goals:
• Performance, a property inherited from DBMSs
• Flexible Query Interface, query the data via SQL instead of MapReduce programs, provided by GeoServer
• Fault-Tolerance, a property inherited from Hadoop
• Scalability, a property inherited from Hadoop
• Ability to run in Heterogeneous environment, a property inherited from Hadoop
22
Chapter 4
Vector Data Distribution
We shall now discuss the strategy to distribute the vector test data across the cluster nodes. The
distribution of data across the cluster nodes is primarily governed by JOIN operation which is
the most commonly used and expensive operation to perform. Spatial joins combine two spatial
data sets by their spatial relationship such as Intersection, containment or within. In sharednothing distributed DBMSs, if two tables residing on different sites need to be joined, then one
of the table has to be imported on the other’s site prior to perform join. Spatial data are often
large in size and therefore expensive to transfer from disk over the network. Vector spatial data,
by its nature, is well suited to be processed on clusters following shared-nothing architecture.
Hosting all spatial objects enclosed within a finite geographical boundary (called a partition) as
tables on single database site eliminates much of the possibility to manipulate tables in between
database sites, thus abiding by Hadoop’s shared-nothing architecture. For example, any spatial
object enclosed with in region A would not overlap, intersect, meet or touches any spatial object
in another geographical region B, and therefore can be hosted on two different database sites as
join on these predicates between the two sets would never return any result. Also, it is highly
unlikely to perform the join between tables that contains the data that is not spatial proximal.
4.1
Partitioning Strategy
For a large random collection of spatial objects, we define universe as the minimum bounding
rectangle (MBR) that encloses all objects in the collection. In order to distribute the data sets
across shared-nothing database sites following the discussion above, it is required to decompose
the universe into smaller regions called partitions. We need to have some metadata first which
is essential for decomposition of the universe into partitions. The metadata comprises of the
dimensions of the universe which is determined by manual analysis of the data set or through
some hand-coded scripts. This is static and permanent information, once computed, need not be
computed again through the life time of data set. Number of partitions into which the universe is
to be spatially decomposed depends on the maximum table size a database can process efficiently
without using temporary disk buffers (another parameter essential for data partitioning). If the
23
total no of spatial objects in the universe is N, and the average no of objects that can be stored
in a database table which avoids disk buffer access during query execution is M, then number
of partitions to be made is roughly N/M. The dimensions of partitions boundaries are predicted
roughly by dividing the universe into smaller rectangular regions of equal sizes. Partitioning of
spatial data sets is done by testing the spatial relationship between partitions and MBR of spatial
object as per the predicate condition, say overlap in our case. The spatial objects which qualifies
the predicate with the partition(s) becomes a member of that partition(s). This step produces
candidates which are a superset of the actual result.
Figure 4.1 shows the decomposition of spatial space into four partitions. Each partition consists of the spatial objects whose MBRs gives the positive overlap test with the partition. All the
spatial objects belonging to a particular partition resides on a single database site in Distributed
DBMSs. Also, note that spatial object labeled as O1 in the figure overlap with two partitions P1
and P4, so it is a member of two partitions and therefore resides on two corresponding database
sites.
Universe
P1
O1
P2
P4
P3
Figure 4.1: Decomposition of the Universe into Partitions
4.2
Partition Skew
In practice, the distribution of spatial features over 2D spatial space is generally not even.
For example, there are more roads in cities than in the rural areas. Therefore, the distribution
of spatial objects into partitions may be imbalanced. Figure 4.1 shows that partition P3 consist
of the least number of spatial objects where as partition P1 and P4 are densely populated. This
situation is called Partition Skew and is not uncommon. Since each partition corresponds to the
24
tables residing on the same database site, this uneven distribution results in tables residing on
different database sites to vary from each other in size. Consequently, different amount of querycomputation is carried out on different cluster nodes, thus resulting an increase in the overall job
execution time. The overall execution time of the job is decided by the time taken by the cluster
node which finishes its share of computation after all cluster nodes have. Therefore, we need
Load Balancing for balanced distribution of objects among partitions.
9
P2
10
P3
11
P4
6
7
8
P3
P4
P1
3
4
5
P4
P1
P2
0
1
P1
P2
2
P3
Figure 4.2: Tile Based Partitioning Scheme
4.3
Load Balancing
To deal with the problem of partition skew, a tile based partitioning method [9] is used for
balanced distribution of objects among partitions. This method involves the decomposition of
universe into N smaller partitions called Tiles where N P (number of partitions). There is
also many-to-one mapping between tiles and partitions. All spatial objects that gives positives
overlap test with the tile(s) is copied to the partition(s) the tile(s) maps to. Larger the number of
tiles the universe is decomposed into, more uniform distribution of objects is among partitions.
In Figure 4.2 above, the universe is decomposed into 48 tiles. We have shown the decomposition
of only one partition P1 into tiles numbered from 0 to 11. Likewise other partitions are also
decomposed in the same manner (not shown in the figure). Tiles are mapped to a partitions
in Round Robin fashion. Some spatial objects that are spatially enclosed with in this partition
are now mapped to other partitions. For example, some spatial objects of partition P1 which
overlaps with tile 2 and 5, will now be a member of partition P3 and P2 respectively. In the same
manner, some spatial objects from other partitions are also mapped to partition P1. This results
in the uniform distribution of spatial objects among partitions. Though, this strategy is simple
to implement, but balancing the distribution of spatial objects across partitions via this strategy
suffer from following drawbacks :
• with the decomposition of each partition further into smaller tiles, the number of spatial
25
objects overlapping across multiple tiles would increase, resulting in mapping of a spatial
object to multiple partitions, thereby resulting in profound duplication.
• This strategy is somewhat of adhoc nature and unpredictable.
• Predication of the appropriate value of N (No. of Tiles) is difficult.
We perform tile based partitioning scheme on total of 24,07,442 of spatial objects of state of
California which comprises of counties, roads and rivers by decomposing the universe into 512
(randomly chosen) equal sized tiles which maps to 6 partitions (or 6 disks).
7,37,593
2,53,531
P3
P2
P1
94,333
9,20,742
P4
08
P5
4,65,759
P6
Figure 4.3: Initial Distribution
504312
P1
306119
P4
410912
378412
P3
P2
465864
P5
429121
P6
Figure 4.4: Distribution after Tile Partitioning
Figure 4.3 shows the original distribution of objects as they lay on the Map of California.
From distribution, we infer that there are almost no spatial features on the top right corner of
the universe. Figure 4.4 shows the distribution of spatial objects after performing tile based
division. It shows the more uniform distribution of objects across partitions. But also, total
no of spatial objects in figure 4.4 counts to 24,94,740 which is 87,298 more than the original
count. It is because the spatial objects, the MBR of which overlaps with multiple tiles at the
same time maps to multiple partitions, thereby, increasing the object duplication which tend
to become more pronounce with the granularity of division of space into tiles. This leads to
26
redundant computation of results across multiple database sites. We discuss an improved spatial
data partitioning strategy based on Hilbert Space filling curve in the next section which resolve
the problem of object duplication.
4.4
Spatial Data Partitioning using Hilbert Curve
Hilbert Space Filling Curve (HSFC) is a continuous fractal space-filling curve first described
by german mathematician David Hilbert in 1891 [23]. Hilbert curve are useful in the domain of
spatial data partitioning because they give a mapping between 1D and 2D space that fairly well
preserves spatial locality. If (x,y) is the coordinates of a point within the unit square, and d is the
distance along the curve when it reaches that point, then points that have nearby d values will also
have nearby (x,y) values. This is true in the opposite direction also, that is, nearby (x,y) points
will also have nearby d values. Figure 4.5 shows the division of a 2D space with the increase in
the recursion depth of Hilbert curve.
Figure 4.5: Hilbert Curve Space Filing Curve
We first define the term Hilbert Value. Hilbert Value of an arbitrary point p in 2D space is
define as euclidean distance between p and p’ where p’ is the point lying on Hilbert curve and
nearest to p. To keep the discussion simple, let us assume we have the following two functions :
• double [ ] GenerateHilbertCurve (m, Universe Dimensions) : The function generate set
of 2D points called Hilbert Points which decompose the universe as shown in figure above.
m is the recursion depth.
• double ComputeHilbertValue (point p) : The function return the Hilbert Value of a point
p in 2D space.
Partitioning of spatial data sets according to this scheme requires to compute the Hilbert vaue
of all spatial entities in the data set. Hilbert Value for spatial object obj is computed as follows:
27
Ob jh v = ComputeHilbertValue(Center(MBR(geom(Obj))))
Let us define Vg as the average volume of all spatial entities on each disk in bytes. N is the
number of disks in the system and B j is the volume of spatial entities on jth disk, initially initialized to 0 for all j = 0 to N-1; n is the number of all spatial entities to be partitioned; Vi is the size
of the ith spatial entity in bytes; i takes value from 0 to n-1.
Procedure for data partitioning using HSFC approach is presented as follows:
• Construct Hilbert Curve within the universe, for each spatial entity in the data set, compute
Hilbert value. As a general rule, if n is the number of spatial entities in the universe, then
n <22m , where m is the recursion depth of the Hilbert curve to be generated.
• Sort the spatial entities in increasing order according to Hilbert value.
• Beginning from i=0, j=0, put the ith spatial object to jth disk. Do B j = B j + Vi
• Compare B j and Vg . If B j is smaller than Vg , then i = i+1, otherwise j = j+1 ; repeat this
step until i = n-1.
In this approach of data partitioning, each spatial entity is assigned to one and only one disk,
therefore there is no duplication in this case. Spatial locality is preserved due to assignment of
disk to spatial entities in increasing order of Hilbert value.
28
Chapter 5
Performance Evaluation
In this section we perform a set of benchmarks to asses the performance of Hadoop+HDFS ,
GeoServer on top of spatial HadoopDB 3Node cloud (will be termed simply as 3Node GeoServer
from now on) and single node Geoserver (a geoserver with single node HadoopDB in the back
end) in the domain of spatial data processing. We subject each of the systems to spatial queries
with different execution plans. The test data comprises of the counties (polygons) and roads
(Linestrings) of the three states of America : California, Arizona and Texas [18]. Following are
the details of the environment we conduct experiments.
Table 5.1: Test Data Description
Node
State
# Counties
# roads
size(mb)
Node 1
Node 2
Node 3
Texas
Arizona
California
32693
11963
62096
1377372
718556
2062872
170
135
245
Table 5.2: Hardware Description
Node
CPU
RAM(GB)
freq(GHz)
Node 1
Node 2
Node 3
intel 4 core
intel 4 core
intel 2 core
2
2
2
2.66
2.66
1.66
Data Distribution: The test data is distributed across three node cluster. In case of Hadoop,
we upload the input files onto HDFS which are then scattered into fixed size data blocks across
HDFS. In case of HadoopDB, one postGIS database server is active on each node. We distribute
the data State wise, that is, each node stores the county and roads table of exactly one state. All
the experiments are performed on this three node cluster set up. The network communication
29
between cluster nodes is established through Ethernet cable of 100 Mbps link.
5.1
Highly Selective Spatial Queries
Goal: To show the an improvement in response time by distributing the high selection queries
over multiple postGIS servers.
Hypothesis: Highly selective spatial queries, such as one shown in figure 5.1 aims at selecting very small number for tuples which qualifies the given predicate condition of the large data
sets (order of tens of millions of rows). By replacing the Hadoop’s default read only data source
HDFS by database layer, MapReduce is no more bound to scan all the data blocks (or chunks)
in a brute force manner to retrieve the required result as per the business logic. MapReduce
jobs written to fetch very less number of tuples from extremely large data sets would require to
make a brute force linear scan over all the tuples in the entire datasets. MapReduce framework
splits large files into smaller chunks which are then distributed across cluster nodes. Each data
chunk is bound with exactly one Mapper. When Mappers start, data chunks are independently
processed by their respective mappers in parallel across the cluster. However, the potential tuples
which actually satisfy the selection criteria may belong to only few, or even to one data chunk.
But MapReduce inability to index a data tuple to the data chunk that contains the tuple require
MapReduce to process all the data chunks and thus unnecessarily launch as many mappers as the
no. of data chunks thereby increasing the jobtracker overhead to control the ongoing computation and results in over-consumption of cluster resources. In these queries, most of the tuples are
filtered out within the database layer, and very less number of records are handled to subsequent
mappers thus, minimizing the network and disk read-write overhead. We expect HadoopDB to
outperform Hadoop by a wide margin.
Result and Explanation: The above query outputs the geometry of only those roads whose
length is greater than 0.1 units. No. of qualified tuples are 182 out of 4158800 tuples. Note that
0.1 units corresponds to 11.1 km of length because all data is represented in decimal degrees [22].
3Node GeoServer clearly outperforms single node GeoServer as shown figure 5.1. In 3Node
GeoServer, the qualified tuples are fetched out of the database layer as per the SQL WHERE
condition logic. Tuples not satisfying the constraint are filtered out in the database layer only.
Hence, amount of workload in MapReduce environment is very less as compared to that in pure
MapReduce case. Hadoop scans all the data tuples and so shows worst performance.
5.2
Spatial Join Queries
Goal: To evaluate the performance of Hadoop, HadoopDB and that of single postGIS while
performing spatial joins.
30
70
time in seconds
select id, geom
from roads where
length(geom) >0.1
30
20
3Node
Hadoop
3Node
Geoserver
1Node
Geoserver
Figure 5.1: Performance evaluation of Highly-Selective Query
We perform the spatial join between counties and roads of all three states. We aim to determine the total length of roads in each county for all three states. We employ SJMR algorithm
[6] in which the partitions correspond to bounding boxes of states. So three partitions in all. For
HadoopDB and single DB, we use the SQL Query as shown in figure 5.2.
Hypothesis: We perform the spatial join query (see figure 5.2) by implementing SJMR on
Hadoop which involves the online partitioning of spatial data sets in the Map Phase followed
by Reduce phase performing actual spatial join. In case of Intra Join on HadoopDB (that is join
operand tables resides on same database sites), data partitioning was done offline and is not a part
of join run time processing. Entire spatial join query logic it pushed inside the database layer,
thus completely releasing Map phase of any intensive geometric computations and avoids the
Reduce phase altogether. We also perform Inter site join on HadoopDB by randomly redistributing the test data among database sites, which is similar to SJMR except that its data source is
database tables rather than HDFS. We expect HadoopDB to outperform Hadoop cleanly as long
as the join is intra, whereas, performance of the two systems are comparable in case of inter join
as databases merely acts as read only layer in this case, like HDFS. Single node HadoopDB, on
the other hand performs inferior to HadoopDB for obvious reasons.
Inter Site Spatial Join:
As said earlier, partitioning of spatial data sets among database sites is governed primarily
by Spatial joins. As long as the spatial join operand tables reside on the same database sites,
database layer naturally take care of performing speedy join by exploiting spatial indices. However, there can be scenarios where we need to perform join between tables residing on different
database sites. We call such spatial joins as Inter site Spatial join. In other Shared-Nothing dis31
tributed spatial DBMSs, spatial join is accomplished by importing one join operand table (which
is preferably smaller in size) to another’s site over network prior to performing a join. Transmission cost over the network, cost of uploading the imported data again into database and cost
of creating spatial indices again is a part of spatial join algorithm and adds much to the overall
cost of performing a spatial query. HadoopDB has an additional advantage of having MapReduce as a task coordination layer beneath database layer in the sense that it has the capability
to pro-grammatically represent wide variety of algorithmic logic. we can shift the entire spatial
join algorithm down to MapReduce layer. Let us suppose we have spatial data sets R and S residing on database sites Ri and Si respectively. Performing inter site spatial join involves three steps :
Step 1. Read Source Data: Read qualified tuples from the sites Ri and Si in parallel as per the
WHERE SQL clause, if any. These tuples are read by the Map phase.
Step 2. Spatial Data Partitioning: Spatial data partitioning scheme described in the previous
section, is now performed online and is implemented in the Map phase. This phase needs the
characteristics of data sets such as universe dimensions, Number of partitions as an additional
input which is essential to decompose the universe into partitions. Each partition contains the
spatial objects from the set R and S which are potential candidate to qualify join predicate.
Step 3. Performing actual spatial join: Each partition is then processed by reducers in parallel to compute the spatial join between R and S. We implement famous Sweepline algorithm in
this phase which is used to perform spatial join.
time in minutes
16
14
12
select a.id, sum(length((b.geom)))
from counties as a, roads as b
where intersects(a.geom, b.geom)
group by a.id;
10
8
6
Reduce phase
Map phase
Inter Join
4
2
0
Intra Join
3 Node
Hadoop
3 Node
Geoserver
3 Node
GeoServer
1 Node
Geoserver
Figure 5.2: Performance evaluation Spatial Join Query
Result and Explanation: As shown in Figure 5.2, in case of intra join 3Node GeoServer
clearly outperforms Hadoop+HDFS and single node Geoserver. But, 3Node GeoServer’s perfor32
mance degrades down to that of Hadoop in case of inter join. This is because, the Join processing
has been now shifted from database layer down to the MapReduce layer which, like SJMR, now
involves online partitioning followed by Reduce phase.
5.3
Global Sorting
Goal: To evaluate the performance of the systems when network bandwidth becomes the bottleneck.
Hypothesis: The query shown in figure 5.3 requires that counties to be first read out of
HDFS (or DBMS in case of HadoopDB), then aggregated together at a single reducer process
for sorting. This results volumes of data flow across the network because the geometry of spatial
entities is large in size. The query overall completion time also includes the total time taken for
data aggregation at a single machine over the 100 Mbps link and so the performance is largely
driven by network bandwidth.
Result and Explanation: Figure 5.3 shows that three is no large difference in the performance of 3Node Hadoop+HDFS, and that of 3Node GeoServer for this query, because this
MapReduceSQL implementation of this query merely reads all tuples from each local databases
in case of GeoServer and from HDFS in case of Hadoop. Single node geoserver performs slightly
better for this query as it suffers from no network overhead.
70
select id, geom from counties
order by area(geom);
Reduce phase
Map phase
60
time in seconds
50
40
30
20
10
0
3Node
Hadoop
3Node
Geoserver
1Node
Geoserver
Figure 5.3: Performance evaluation of Global Sort Query
33
5.4
Queries against shared-nothing restriction
Goal: Performance evaluation of Hadoop, HadoopDB and single node Geoserver for spatial
Queries which tend to go against Shared-Nothing restriction.
Certain spatial queries tend to go against Hadoop’s Shared-Nothing restriction by invoking
the need of communication between independent MapReduce processes running on cluster machines. The query as shown in figure 5.4 returns all the roads of the state of California which
are longer than the longest road of Arizona and Texas. Since, roads tables of the three states
of America resides on three different database sites, we first need to evaluate the result of the
subquery first, which is then taken as the input by the outer query to yield final result. Because,
the local result of different database sites (length of the longest road of Arizona and Texas) need
to be communicated to the California database site, the execution plan of this query goes against
Hadoop’s Shared nothing restriction and, therefore this query cannot be represented by a singlestage MapReduce program. To implement the above query in HadoopDB, MapReduceSQL contains two MapReduce stages. In the first stage, the subquery is processed on the Arizona and
Texas sites in parallel and local results are written onto HDFS (length of the longest roads of the
state). In the second MapReduce stage, the outer query takes the result of the previous MapReduce stage from HDFS as input during run time and is processed on California site only. The
same mechanism is followed by Hadoop by setting the input directories to Texas and Arizona for
the first MapReduce stage, and to California directory for the second MapReduce stage.
250
MR Stage 1
MR Stage 2
time in seconds
200
select geom from california roads
where length(geom)
>ALL
( (select (max(length(geom)))
from arizona roads)
UNION
(select max(length(geom))
from texas roads) );
150
100
50
0
3Node
Hadoop
3Node
Geoserver
1Node
Geoserver
Figure 5.4: Performance evaluation of Against Shared-Nothing Query
Results and Explanation: Figure 5.4 shows that Hadoop’s performance is the worst due to
obvious reasons. However, the performance of 3Node GeoServer is comparable to that of single
node Geoserver. This is because of the overhead of launching two MapReduce task one after the
34
another dominates the overall effective query execution. Hadoop framework easily takes around
8-10 seconds only to initiate the MapReduce job. In cloud computing environment, where there
is 100s of gigabytes of data spread across cluster nodes, overhead of launching an extra MapReduce job shall be negligible, unlike in our case.
5.5
Fault Tolerance Test
Goal: To verify that HadoopDB inherits the the same fault tolerance capability as that of
Hadoop.
In this experiment, we use the same query as was used in experiment 2. We configure both
the systems to replication factor of 2. During the job execution on three nodes cluster, when the
MapReduce job was completed to 65%, we disconnect the node 1. We now note the time taken
by the job to run to completion on both Hadoop and HadoopDB cases.
Hypothesis: In both HadoopDB and Hadoop, tasks of the failed node are distributed over
the remaining available nodes that contain replicas of the data. We expect HadoopDB slightly
outperforms Hadoop in such situations. In Hadoop, TaskTrackers assigned blocks not local
to them will copy the data first (from a replica) before processing. In HadoopDB, however,
processing is pushed into the (replica) database. Since the number of records returned after
database query processing is less than the raw size of data, HadoopDB does not experience
Hadoop’s network overhead on node failure.
F=62.5%
13 min
F = Node Failure
N = Normal Execution
8 min
N
F=33%
2 min
1.5 min
N
3Node Hadoop+HDFS
3Node Hadoop+DB
Figure 5.5: Fault Tolerance comparison of Hadoop+HDFS with Hadoop+DB
Results and Explanations: We notice that percentage slow down in case of Hadoop and
HadoopDB is 62.5% and 33% respectively. This is due to the fact, that HadoopDB simply restart
35
the query on replica database, whereas in Hadoop, restarted tasktrackers pulls the replica data
blocks not local to them, thereby additional network overhead which is significant for spatial
data due to large geometry sizes.
36
Chapter 6
Summary and Conclusion
6.1
Summary
We started with the discussion on Parallel DBMSs and MapReduce, two widely used methodologies for processing large data sets, including but not limited to spatial data. We highlight the
pros and cons of both. While MapReduce has been designed to enable us to afford the advantage
of being able to harness commodity hardware operating in a shared nothing mode at the same
time lending robustness to the computation since parts of the computation can be restarted on
failure, where as spatial DBMS’s optimized capabilities yields us high performance as long as
the prominent portion of the query logic is processed inside the database layer. HadoopDB being
the hybrid of MapReduce and Database technologies, it inherits the benefits of the two, therefore
allowing each to do what it is good at.
We do the comparative analysis of the three systems viz : Hadoop+HDFS, Hadoop+DBMS,
parallel DBMSs see Figure 6.1 and 6.2. If we treat the DBMS and Hadoop technologies as two
extremes , Hadoop-with-Database (HadoopDB) is actually a DBMS equipped with some Hadoop
techniques. DBMSs are taken as the storage and execution units, and MapReduce mechanism
takes responsibility for parallelization and fault tolerance.
By bringing the two technologies together:
• What we gain w.r.t Hadoop:
1. Efficiency and performance
2. Data source (DBMS) is modifiable (HDFS was readable) , update queries comes
into picture
3. DBMSs Acid properties
• What we lose w.r.t Hadoop:
1. Transparency w.r.t. Data loading. Data loading and distribution is no more automated by Hadoop Framework, but becomes manual.
37
2. Facebook process 2 petabytes of data daily, manually uploading that amount of data
on DBMS nodes on daily basis is not affordable.
• What we gain w.r.t DBMS:
1. a shared-nothing OPEN SOURCE parallel dbms, there has been no parallel open
source dbms so far.
2. unlike other distributed DBMSs, HadoopDB posses better fault tolerance features.
6.2
Conclusion
We conclude that MapReduce programming paradigm alone is sufficient to express most spatial query logic, but lack of support for spatial indexing mechanism and its brute force nature
make it impractical for interactive real time spatial data analysis systems. HadoopDB shows
great improvement in query execution speeds as postGIS inherent support of spatial indices
adds a significant advantage, but on the other hand performance degrade down to no better than
MapReduce for queries the execution plan of which tend to go against the “Shared-Nothing“
restriction such as inter site spatial join. We also realize that vector spatial data, by its nature, is
well suited to be processed on Shared-Nothing distributed database clusters. Hosting all spatial
objects confined within a finite geographical boundary as a single table chunk on one database
node eliminates much of the possibility to manipulate tables between database nodes, thus abiding by Hadoop’s shared-nothing architecture, avoiding the dependency on MapReduce layer and
therefore yielding high performance.
Also, since HadoopDB does not possess any fault tolerance at data layer, offline management
of spatial data costs much to the usability of HadoopDB into the realm of very large scale spatial
data analysis. In large scale data analysis, some times, the data have a short life cycle. The data
are loaded into the system in batch mode, then some almost fixed queries are put onto the data,
and after that, the data is offloaded to the offline system. In such condition, organizing the data
into some sophisticated structure is not worthwhile given the extra maintenance cost and the low
utility - thus making this system impractical in the field of large scale data analysis. The situation
get worse if spatial data suffers from partition skew and load balancing is required which is not
uncommon.
6.3
Future Work
Current Hadoop’s implementation of MapReduce doesn’t support any sort of indexing mechanism. This is not a drawback of MapReduce, it is some thing that MapReduce has not been
designed for. MapReduce has been designed for one time processing of large data sets in batch
mode. We see it as a future scope to empower MapReduce with indexing mechanism to make it
suitable for real time data analysis.
38
One of the biggest drawback of HadoopDB that makes it unsuitable in the realm of large
scale data processing is the lack of fault tolerance as the data layer. Partitioning of the raw data,
uploading them onto individual database nodes is no more supervised by Hadoop framework.
While HadoopDB integrates the power of efficient DBMS technology with MapReduce, yet it
seems impractical to employ this system to carry out large scale data processing. It shall be a
great advancement towards large scale data processing if HadoopDB is improved to posses fault
tolerance at data layer too, just like Hadoop do.
39
Property Comparison Chart of
Hadoop+HDFS, Hadoop+DBMS, Parallel DBMS
Property
Fault
Tolerance
Hadoop
Excellent fault tolerance
capability.
Amount of Work lost due to
(Amount of work Node/Process failure is less.
to be redone in Good Fault tolerance is achieved at the
cost of performance (by storing the
case of
intermediate outputs onto Disks).
node/process
failures)
System can scale upto 40005000 Nodes easily.
Hadoop-With-Database
Parallel DBMS
Inherits Hadoop's fault
Poor fault tolerance.
tolerance feature. In case of node Amount of work lost due to
failure , task is restarted onto another
node hosting replicated data. Only the
chunk of the data hosted by failed
node need to be reprocessed on
another node storing replicated
database.
Node/process failure is large. Query
needs to be restarted right from the
beginning. Designers emphasize more
on performance , & therefore
intermediate output data is pipelined
to next query operator without having
written to disk.
posses same scalability near to Parallel Database systems
that of Hadoop.
possess poor scalability.
Asterdata, a parallel database known
to posses one of the best scalability in
parallel database community is
scalable to around 330-350 nodes.
Probability of node failure would
increase if the cluster size increases ,
frequent failures would result in
degraded performance.
Scalability
Processing time is much larger.
By Replacing the Data Source
HDFS by Database ,
performance of this system is
dramatically improved.
Also WHERE condition is checked
within Map phase (by hand coding). So However , For inter site
map function unnecessarily have to read queries , performance can
Performance every tuple from input file.
degrade down to that of
Hadoop.
There is no provision to index the input
data whatsoever.
Lot of Disk Read/Write.
Intermediate outputs (of Mappers) are
written to Disk before they are fetched
by subsequent processes (Reducers)
Architecture Shared Nothing Architecture
type
Reason : DBMS minimize the search
space and accelerate the query
execution my making use of Database
indexes. Because most of the data
tuples are filtered out within database
(because of WHERE clause), no. of
input tuples to mappers is not very
large and hence much less
computation in Map-Reduce
Environment as compared to that of
Hadoop.
Shared Nothing Architecture
Figure 6.1: Comparison Chart contd ....
40
We expect Parallel DBMS to
clearly outperform Hadoop.
Reason : 1. Lesser Search Space
because of database indexes.
2. Unlike Hadoop , Intermediate
results are pipelined to next Query
operator without having written to
Disk. No Disk Read/Writes of
intermediate results lead to high
performance, but poor fault tolerance.
Shared Memory Architecture
Designed by keeping in mind that
Node failure is going
to be a rare event. Therefore, cluster
Hadoop has been especially
designed to run on cheap
commodity hardware , and
Inherits this feature from
Hadoop. Now we have kind of
No provision for Maintenance
of record indexes
Isolated Indexes has to be
created per database , cannot
make use of global structures
Can make use of global data
structures, such as global
indexes
Data Loading Transparent to the User, data
split into blocks and redistribution
&
Distribution across HDFS is carried out
Manual , user need to upload the
Transparent to the user, tables
are replicated without user
manual support. Data loading is
Slow.
Granularity of Block Level Parallelism , each
Parallelism data block is allotted a Map
Table level parallelism , full
Application Batch processing
Requirements
Midway between Batch
processing and real time
processing.
Optimized to yield real time
benefits.
Environment Heterogeneous machines
Inherits this feature from
Hadoop.
Do not score well in
Heterogeneous environments
Large Scale Designed especially for this
Data analysis purpose.
(LSDA)
Scalable to thousands of
machines, but not suitable for
LSDA like Hadoop
LSDA, but at moderate scale.
Hardware
Support
Global Data
structures
distributed DBMS that has the ability
therefor node failure is considered to be to cope with clusters made up of
ordinary Hardware machines.
a frequent event. Therefore , a lot of
emphasize has been given over
Hadoop's Fault tolerance feature.
automatically by Hadoop framework.
Data loading is fast. Unstructured data
in textual format.
Cost
data manually in databases across the
cluster. Data loading is Slow as each
tuple might have to undergo too many
checks to satisfy integrity constrains.
Data structured and organized in
tables. This makes this system little
unrealistic for large sale data
processing.
database table is the smallest unit to
be allotted to Map
Open Source Project, free of cost Completely made up of open
source components
Figure 6.2: Comparison Chart
41
nodes require quality hardware
machines to to used.
Granule level parallelism. Table is
logically splitted up into chunks called
granules, each granule is processed in
parallel. Oracle parallel DBMS
support this feature.
No open source parallel DBMS
is known
Bibliography
[1] J. Dean and S. Ghemawat, ”Mapreduce: simplified data processing on large clusters,”in
Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation - Volume 6. San Francisco, CA: USENIX Association, 2004, pp. 10-10.
[2] A. Bialecki, M. Cafarella, D. Cutting, and O. OMalley, Hadoop: a framework
for running applications on large clusters built of commodity hardware, Wiki at
http://lucene.apache.org/hadoop.
[3] Pavlo, a., Paulson, e., rasin, a., abadi, d.J., deWitt, d.J., Madden, s.r., and stonebraker, M.A
comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD
International Conference on Management of Data. aCM Press, new york, 2009, pp.165178.
[4] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin.
MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):6471, 2010.
[5] Jun Zhang, N. Mamoulis, D. Papadias, and Yufei Tao,All-nearest-neighbors queries in spatial
databases, June 2004, pp.297306.
[6] Zhang, S., Han, J., Liu, Z., Wang, K., and Xu, Z. SJMR: Parallelizing spatial join with
MapReduce on clusters. In Proceedings of CLUSTER. 2009, pp. 1-8.
[7] J.P. Dittrich and B. Seeger, Data redundancy and duplicate detection in spatial join processing, in ICDE 00: Proceedings of the 16th International Conference on Data Engineering,
2000, pp. 535546
[8] T. Brinkhof, H.P. Kriegel, and B. Seeger, Parallel processing of spatial joins using R-trees,
in ICDE 96: Proceedings of the Twelfth International Conference on Data Engineering, pp.
258265.
[9] J.M. Patel and D.J. DeWitt, Partition based spatial-merge join, in Proceedings of the 1996
ACM SIGMOD international conference on Management of data. ACM New York, NY,
USA, 1996, pp. 259270
[10] Yonggang Wang, Sheng Wang, Research and Implementation on Spatial Data Storage and
Operation”, 2010, pp.275-278
42
[11] K. Wang, J. Han, B. Tu, J. Dai, W. Zhou, and X. Song, ”Accelerating Spatial Data Processing with MapReduce”, in proceedings of ICPADS , 2010, pp. 229-236
[12] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii, ”Filtering: a
method for solving graph problems in MapReduce”, in Proceedings of the 23rd Annual
ACM Symposium on Parallelism in Algorithms and Architectures, 2011, pp. 85-94
[13] Haojun Liao, Jizhong Han, Jinyun Fang,”Multi-dimensional Index on Hadoop Distributed
File System”, in Vth International Conference on Networking, Architecture, and Storage,
2010, pp. 240-249
[14] A. Guttman, ”R-trees: a dynamic index structure for spatial searching”, in Proceedngs of
the ACM SIGMOD, Boston, Massachusetts, ACM, 1984, pp. 47-57
[15] Afsin Akdogan, Ugur Demiryurek, Farnoush Banaei-Kashani, and Cyrus Shahabi Integrated Media Systems Center, University of Southern California, Los Angeles, CA90089
[16] Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi Silberschatz, Er
Rasin,HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for
anaytical workloads, in Proc. VLDB09, 2009
[17] http://en.wikipedia.org/wiki/GeoServer
[18] http://arcdata.esri.com/data/tiger2000/tiger download.cfm
[19] G. Leptoukh, NASA remote sensing data in earth sciences: Processing, archiving, distribution, applications at the GES DISC, in Proc. of the 31st Intl Symposium of Remote Sensing
of Environment, 2005
[20] http://en.wikipedia.org/wiki/GeoServer
[21] http://people.na.infn.it dimartino/webgis/architecture.html
[22] http://en.wikipedia.org/wiki/Decima degrees
[23] e.wikipedia.org/wiki/Hilbert curve
43
44