Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Large Spatial Data Computation on Shared-Nothing Spatial DBMS Cluster via MapReduce Dissertation Report Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Abhishek Sagar Roll No: 10305042 under the guidance of Prof. Umesh Bellur Department of Computer Science and Engineering Indian Institute of Technology, Bombay June, 2012 ii Abstract Vector Spatial data types such as lines, polygons or regions etc usually comprises of hundreds of latitude-longitude pairs to accurately represent the geometry of spatial features such as towns, rivers or villages. This leads to spatial data operations being computationally and memory intensive. Also, there exist certain real world scenarios which generates an extremely large amount of spatial data. For example, NASA’s Earth Observing System (EOS), for instance, generates 1 terabyte of data every day. A solution to deal with this is to distribute the spatial operations amongst multiple computational nodes. Parallel spatial databases attempt to do this but at very small scales (of the order of few 10s of nodes at most). Another approach would be to use distributed approaches such as Map-Reduce since spatial data is cleanly distributable by exploiting their spatial locality. It affords us the advantage of being able to harness commodity hardware operating in a shared nothing mode while at the same time lending robustness to the computation since parts of the computation can be restarted on failure. But at the same time, MapReduce is a completely batch processing paradigm and score extremely poor on performance. This is due to that fact that MapReduce do not support any indexing and operate on unstructured data. Parallel spatial DBMSs, on the other hand, tend to give high performance but have limited scalability, where as MapReduce tend to deliver scalability but poor at performance. Therefore, an approach is required which allow us to process a large amount of spatial data on potentially thousands of machines yet maintaining reasonable performance level. In this effort, we present HadoopDB a combination of Hadoop and Postgres spatial to efficiently handle computations on large spatial data sets. In HadoopDB, Hadoop is employed a means of coordinating amongst various computational nodes each of which performs the spatial query on a part of the data set. The Reduce stage helps collate the result data to yield the result of the original query. Thus, in HadoopDB, we intend to reap the benefits of two technologies - spatial DBMSs (performance) and MapReduce (scalability and fault tolerance). We present performance results to show that common spatial queries yields a speedup that nearly linear with the number of Hadoop processes deployed. Contents 1 . . . . . . 1 1 2 2 4 5 7 2 Literature Survey 2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 3 Integrating Geoserver, Hadoop and PostGIS 3.1 Hadoop Distributed File System . . . . . 3.1.1 Goals . . . . . . . . . . . . . . . 3.1.2 Architecture . . . . . . . . . . . . 3.2 Integrating postGIS . . . . . . . . . . . . 3.3 Integrating Geoserver as Front-End . . . . 3.4 Query Execution Steps . . . . . . . . . . 3.5 Challenges faced . . . . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . 4 5 Introduction 1.1 Background . . . . . . . . . . . . . . . . 1.1.1 Parallel Spatial DBMS . . . . . . 1.1.2 MapReduce - an alternate solution 1.2 MapReduce Vs Parallel DBMSs . . . . . 1.3 Geospatial Processing Environment . . . 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 12 13 15 17 19 20 21 Vector Data Distribution 4.1 Partitioning Strategy . . . . . . . . . . . . . 4.2 Partition Skew . . . . . . . . . . . . . . . . . 4.3 Load Balancing . . . . . . . . . . . . . . . . 4.4 Spatial Data Partitioning using Hilbert Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 24 25 27 Performance Evaluation 5.1 Highly Selective Spatial Queries . . . . . 5.2 Spatial Join Queries . . . . . . . . . . . . 5.3 Global Sorting . . . . . . . . . . . . . . . 5.4 Queries against shared-nothing restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 30 33 34 iv . . . . . . . . . . . . . . . . 5.5 6 Fault Tolerance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Summary and Conclusion 37 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Bibliography 42 v List of Figures 1.1 1.2 1.3 Spatial Join via MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . GeoServer Architecture [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geoserver Deployment as Web Application . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Overall System Architecture GeoServer + Hadoop + postGIS Hadoop Architecture . . . . . . . . . . . . . . . . . . . . . Hadoop with Database(postGIS) . . . . . . . . . . . . . . . Logical View of System Architecture . . . . . . . . . . . . . MapReduce job Compilation by Geoserver . . . . . . . . . Geoserver-front end to Hadoop cloud . . . . . . . . . . . . Query Execution Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 14 16 17 18 19 20 4.1 4.2 4.3 4.4 4.5 Decomposition of the Universe into Partitions Tile Based Partitioning Scheme . . . . . . . . Initial Distribution . . . . . . . . . . . . . . Distribution after Tile Partitioning . . . . . . Hilbert Curve Space Filing Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 25 26 26 27 5.1 5.2 5.3 5.4 5.5 Performance evaluation of Highly-Selective Query . . . . . . . Performance evaluation Spatial Join Query . . . . . . . . . . . . Performance evaluation of Global Sort Query . . . . . . . . . . Performance evaluation of Against Shared-Nothing Query . . . Fault Tolerance comparison of Hadoop+HDFS with Hadoop+DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 33 34 35 6.1 6.2 Comparison Chart contd .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Comparison Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 7 List of Tables 3.1 SQL to MapReduce Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.1 5.2 Test Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Hardware Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 vii Chapter 1 Introduction Geographic information system (GIS) is one that captures, stores, analyzes, manages and presents spatial data along with relevant non spatial information. GIS forms the core of applications in areas as varied as agriculture to consumer applications such as location based services. Today, many computer applications, directly or indirectly, are based on carrying out spatial analysis at the back-end. Spatial analysis involve spatial operations to be performed on spatial data. We represent the spatial features such as roads, towns, cities etc as Vectored data. Vector data is collection of latitude-longitude pairs called Geospatial points, structured into a format so as to represent the geometry of spatial features. An example would be the use of vectored polygons to represent city or state boundaries. For example to represent a road network of the state of Arizona, we require approximately 10 million points, each of which is a coordinate involving latitude and longitude. Number of geospatial coordinates required to represent the geometry of real world objects varying from few hundreds to tens of thousands. Spatial operations such as overlapping test (to check whether two areas overlap each other or not) etc are performed on a set of vector spatial data sets. These operations are generally the implementation of geometric algorithms. Because of the enormous number of points required to represent a single spatial object and complexity of geometric algorithms, carrying out spatial computation on real world data sets has been resource-intensive. 2core, 1.5 GiB memory machine shows constant 75-85% CPU consumption for join queries. Also, enormous quantities of spatial data is constantly being generated from various sources such as satellites, sensors and mobile devices. NASAs Earth Observing System (EOS), for instance, generates 1 terabyte of data every day. Therefore, we consider spatial operations a potential candidate to be subjected to parallelism. 1.1 Background Two widely used technologies that are being used for distributed computation of spatial data are parallel spatial DBMSs and MapReduce. The two technologies do not the substitute the other but infact complementary to each other. We discuss pros and cons of each in the domain of large spatial datasets processing. 1.1.1 Parallel Spatial DBMS Parallel spatial DBMSs such as oracle spatial are being widely in use for carrying out parallel computation of spatial data across cluster of machines. Parallel DBMSs has been a matured technology that is in existence for around 30 years by now and has been greatly optimized to yield high performance but yet do not score well in terms of scalability. Asterdata, a parallel database known to posses one of the best scalability in parallel database community is scalable to around 330-350 nodes. In parallel DBMSs, the intermediate results of query are pipelined to next query operator or another sub-query without being written to disk. Now if any Sub-Query fails, the intermediate results processed so far are lost and entire query have to be restarted again. Not writing intermediate data onto disks, though results in high performance but at the same time avoid parallel DBMS from exhibiting good fault tolerance. With the increase in the size of cluster of commodity machines, the probability of node or task failure also increase and this failure is likely to become a frequent event in case the parallel DBMS cluster size is increased to the order of few hundreds of nodes. This would result in a significant degradation in the performance of parallel DBMSs. Thus, poor fault tolerance capability puts an upper bound on the cluster size of parallel DBMSs (up to few tens of nodes), resulting parallel DBMSs to posses ordinary scalability too. 1.1.2 MapReduce - an alternate solution With reference to the drawback of parallel DBMSs discussed, for the last 2-3 years, MapReduce [1] has attracted researchers as an alternative to parallel DBMSs to study the parallelization of spatial operations and their performance evaluation in distributed framework. MapReduce, a distributed parallel programming model developed by google, provides a framework for large volumes of data processing, in the order of hundreds of terabytes across thousands of sharednothing commodity machines. The scalability and fault tolerance feature of MapReduce enable us to use sufficiently larger number of commodity machines for carrying out data intensive computations. MapReduce parallel programming model does not necessitates the programmer to understand the parallelism inherent in the operation of the paradigm. It is a high level parallel programming model that allow programmer to focus only towards writing a solution of the core problem logic rather than taking care of parallel programming constructs such as synchronization, deadlock etc. MapReduce programming model require the programmer to provide the implementation details of two function : Map and Reduce. The Map function partitions the input data to be processed preferably into disjoint sets. Each set is then returned to Reduce function for further processing. Key-value pairs form the basic data structure in MapReduce. The input to the Map function is the key value pair (k1 , v1 ), key k1 being the byte offset of a record within the input file, the value k2 being the record line. The Map output the set of intermediate key-value pairs, [(k2 , v2 )]. The MapReduce library implements the shuffle phase which lies in between the Map and Reduce phases. The shuffle phase rearrange the intermediate Map-output and aggregates all the values associated with the same key together to form a (key, list(values)) pair which forms the input to the reduce phase which follows. Last phase is the Reduce phase which process list of 2 values associated with the same key. Identical Reducer function executes in parallel on worker nodes. The output of the Reducers is the final output that is written back onto HDFS. MapReduce programming model is used to carry out distributed computation on clusters of shared-nothing machines. The Apache Hadoop [2] software library is a framework that allows for the distributed processing of large data sets across clusters of computers using MapReduce. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable availability and reliability. MapReduce in spatial domain: MapReduce programming paradigm, by its design, is biased towards distributed processing of spatial data. MapReduce has the power to efficiently map most spatial query logic to Map and Redue functions. For straight-forward spatial queries limited to WHERE clause and without joins, MapReduce paradigm require to have only map function. Input tuples read are tested if they qualify the Where criteria in the map-phase itself. For spatial queries involving spatial joins between two data sets, in addition to Map phase, Reduce phase is also required. Generally, spatial joins are performed on spatial objects that are spatial proximal. Map phase read the input tuples of join operand data sets, and create set of groups, each group contains only the set of spatial objects that are within pre-defined spatial boundaries. Each group is then processed for spatial join in parallel by Reduce processes on cluster machines. Thus, Vector spatial data, by its nature, is well suited to be processed on clusters following sharednothing architecture. MapReduce’s Map and Reduce process executes in isolation with each other, that is there is no inter communication between processes whatsoever. Processing all spatial objects enclosed within a finite geographical boundary on different machines eliminates much of the possibility for MapReduce processes to interact with each other, thus abiding by Hadoop’s shared-nothing architecture. Figure 1.1 shows the MapReduce formulation of spatial join between two heterogeneous data sets: Rivers (Linestrings) and Settlements (Polygons). The Map-phase partition the spatial objects of two data sets and create groups. Each group contains spatial objects that lies within spatial pre-defined boundaries. For example, 1:River, 2:Setmt and 4:Setmt are grouped together and form a group identified by a key 1. Similarly, 2:Setmt, 3:River are grouped together with group-key as 3. The shuffle phase involves migration of all spatial objects from different Mappers associated with the same group key onto a single machine through network. In the figure, 1:River, 2:Setmt and 4:Setmt are collated to a single machine. After shuffle phase, Reducers start on each machine which processes all spatial objects associated with one group. Here, three reducers starts corresponding to three groups, each of which finds the set of Settlements crossed by a single river. For example, 2:Setmt and 4:Setmt is crossed by river 1:River. Reducers independantly write their final output onto HDFS. 3 Figure 1.1: Spatial Join via MapReduce 1.2 MapReduce Vs Parallel DBMSs Processing larger amount of spatial data has become a critical issue in the recent times. Parallel DBMS technology has been widely used for processing larger volumes of vector data, but with the ever increasing need to process larger and larger spatial data sets, parallel DBMS is no more a desirable technology for this purpose. We discuss the comparison between MapReduce and parallel DBMS with respect to scalability, fault tolerance and performance aspects [3]. 1. Scalability: Parallel database systems scale really well into the tens and even low hundreds of machines. Unfortunately, parallel database systems, as they are implemented today, unlike Hadoop, they do not scale well into the realm of many thousands of nodes. Enormous quantities of spatial data is constantly being generated from various sources such as satellites, sensors and mobile devices. NASA’s Earth Observing System (EOS), for instance, generates 1 terabyte of data every day [19]. Processing such large volumes of spatial data on daily basis need to employ much larger number of machines, probably in the order of few thousands which parallel DBMS technology doesn’t support. 2. Fault tolerance: Fault tolerance is the ability of the system to cope up with node/task failures. A fault tolerant analytical DBMS is simply one that does not have to restart a query if one of the nodes involved in query processing fails. Fault tolerance capability of the parallel DBMS is much inferior to that of Hadoop. Hadoop has been especially designed to exhibit excellent scalability and fault tolerance capability. The amount of work that is lost when one of the node in the cluster fails is more in parallel DBMS than in case of Hadoop. In parallel DBMS, the intermediate results of query are pipelined to next query operator or another sub-query without 4 being written to disk. Now if any Sub-Query fails, the intermediate results processed so far are lost and entire query have to be restarted again. However, in Hadoop, the intermediate results of the mappers (Or Reducers) are always written to the disk before they are fetched by the Reducers (Or mappers of the next Map-Reduce stage). Thus, instead of pipelining the intermediate results/data to subsequent processes, Hadoop processes themselves are pipelined to operate on target data. In case of a task/node failure, the same task is restarted on another node to operate on the target intermediate data which still exist on the disk. 3. Performance: Parallel DBMS have been designed to work in real time system and therefore what is important is performance, whereas Hadoop has been designed for batch processing. Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. In fact, Hadoop takes around 10-11 seconds only to initiate distributed processing on 3-4 node cluster size where as parallel DBMS finishes much of the computation in this time period. Hadoop’s slower performance is also because Hadoop stores data in the accompanying distributed file system (HDFS), in the same textual format in which the data was generated. Consequently, this default storage method places the burden of parsing the fields of each record on user code. This parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type. This, further results in widening the performance gap between MapReduce and a parallel DBMSs [3]. To summarize, MapReduce offers excellent scalability and fault tolerance feature which enable MapReduce a suitable programming model to process larger data sets on sufficiently large clusters of commodity machines, whereas parallel DBMS technology is limited to cluster size up to few dozens of nodes but outperforms the MapReduce clearly in terms of performance. 1.3 Geospatial Processing Environment GeoServer is an open-source server written in Java - allows users to share and edit geospatial data. Designed for interoperability, it publishes data from any major spatial data source using open standards. GeoServer has evolved to become an easy method of connecting existing information to Virtual Globes such as Google Earth and NASA World Wind as well as to web-based maps such as OpenLayers, Google Maps and Bing Maps [20]. Geoserver forms the beginning of the standardization in the GIS arena. It follows the WMS, WFS and WCS specifications to the letter, and forms the platform to develop GIS applications based on these specifications. Geoserver Design Figure 1.2 shows the different components of a Geoserver. At a high level view Geoserver essentially consists of many different modules that are actively interacting with each other. Geoserver reads data in a wide variety of formats from PostGIS, OracleSpatial, ArcSDE to shapefiles and geotiff. It can also produce KML, GML, Shapefiles, GeoRSS, GeoJSOn and multitudes of other formats. Geoserver essentially consists of two aspects – the config5 uration and data store aspect, and the rendering aspect. All configurations in geoserver are done through the admin interfaces and XML configuration files. DataStore is essentially a source of data for rendering of features. Geoserver supports many different data stores including Web Feature Server, Property files, Shapefiles and databases. CoverageStore is another entity at the same level as a DataStore but includes raster based data formats like ArcGrid, geotiffs, image mosaics, etc. The rendering component follows the WMS, WFS and WCS specification and are uses geotools as the rendering API. Figure 1.2: GeoServer Architecture [21] Geoserver Deployment Figure 1.3 shows the deployment of a Geoserver. At run-time actions performed by the user on the client side are translated into HTTP request by JAVAScript code and sent to the server where data satisfying the request are selected and sent back to the client as HTML, JAVAScript and raster data. In detail, a GIS query is transmitted as a GET or POST request written according to either the WFS or WMS specifications. The request is therefore captured by the generated GIS application and forwarded to Geoserver which, in turn, interprets the query, composes the SQL statement according to the PostGIS DML and send it to the DBMS. Once the DBMS computes the query, results are gathered by Geoserver to create the answer. In particular, in case of a WMS request, Geoserver computes a raster map containing the results encoded as a standard picture format (GIF, PNG, SVG, etc.). In case of a WFS request, Geoserver collects data from the DBMS and returns a GML (Geography Markup Language) encoded data 6 Web Server (Apache) HTTP Request WFS/WMS + Application Server (Tomcat) Web Browser Internet HTML/Java Script Raster Generated GIS Application GeoServer postGIS Figure 1.3: Geoserver Deployment as Web Application to the generated GIS server application. The latter further processes the resulting GML data, by sending it back to the client side in HTML format. 1.4 Outline The remainder of this report is organised as follows : Chapter 2 presents the problem statement followed by related work done in the field of spatial data compution on Hadoop cloud. In Chapter 3, we present the details regarding integrating postGIS with MapReduce paradigm. Chapter 4 discusses the strategy to distribute the vector data across cluster nodes. In chapter 5 we present the set of benchmark to realise the benefit we obtain by bringing the spatial DBMSs with MapReduce together in spatial domain. Chapter 6 summarises and concludes the work with an outline on future work. 7 Chapter 2 Literature Survey 2.1 Problem Formulation Parallel spatial DBMSs such as Oracle Spatial have been in use for carrying out spatial analysis on moderately large spatial data sets. Today, Spatial DBMSs have been improved to support variety of spatial indexing mechanism which enable it to process spatial queries really fast. But, parallel DBMS, because of their limited scalability, fail to handle the ever increasing size of spatial repositories. To overcome this barrier, researchers have focused on MapReduce as an alternate solution which is capable to express variety of spatial operation such as spatial joins [7],[8],[9], Nearest neighbor queries [5], Voronoi diagram construction [15] etc and unlike parallel DBMSs, can process much larger volumes of spatial repositories on thousands of commodity machines in parallel. But MapReduce is a batch processing programming paradigm and it brute force style of data processing make it unsuitable for real time spatial data analysis. Spatial Indices is a solution to this problem which is not supported by the current implementation of Hadoop’s MapReduce. Hence the two technologies actually goes in opposite directions and neither is good at what the other does well. Thus, we believe, in order to facilitate processing of large spatial data sets while maintaining the reasonable performance level, there is a need to study the behavior of spatial operations in the integrated environment of MapReduce systems and spatial DBMSs. 2.2 Literature Survey To begin With, [10] discusses the implementation of common spatial operators such as geometry intersection on Hadoop-platform by transforming it into Map-Reduce paradigm. It throws the light on what input/output key value pair a Map-Reduce programmer should choose for both Mapper and Reducers to effectively partition the data on several slave machines on a cluster and carrying out the spatial computation in parallel on slave machines. This paper also presents a performance evaluation of spatial operation comparing spatial database with Hadoop platform. The results demonstrate the feasibility and efficiency of the MapReduce model and show that cloud computing technology has the potential to be applicable to more complex spatial problems. 8 [6] discusses the implementation of spatial queries into Map-Reduce Paradigm which particularly involves the spatial join between two or more heterogeneous spatial data sets. The strategies discussed include strip-based plane sweeping algorithm, tile-based spatial partitioning function and duplication avoidance technology. Paper experimentally demonstrates the performance of SJMR (Spatial Join with Map-Reduce) algorithm in various situations with the real world data sets and establish the applicability of computing-intensive spatial applications transformed as MapReduce on clusters. [7] is the optimization work done on [6] . It discusses effective strategies to partition the spatial data in the Map-phase so that most Reducers running on the slave machines get the fair share of data to processed i.e. it should not happen that some reducer gets very less data for processing while other reducers are just overwhelmed with test data to be processed. Paper shows experimental statistics and results that shows the improvement in overall cluster utilization , memory utilization and run-time of the Hadoop job. [12] discusses the strategies to transform the various graph manipulating algorithms such as Dijkastra’s Single Source Shortest Path Algorithm , Bipartite Matching , Approximate vertex and edge covers and minimum cuts etc into Map-Reduce form. Generally the Map-Reduce algorithms to manipulate graphs are iterative in nature i.e. the problem is solved by processing the graph through a Map-Reduce pipeline, each iteration being converging towards a solution. The key challenge in transforming the graph based problems into map-reduce form is the partitioning of the graph . It has to be done carefully as the slave machine processing one part of the graph has no information whatsoever about the rest of the graph. The partitioning of the graph among slave machines must be done so that there is no need for the slave machine to be aware of the rest of the portion of the graph and it can perform computation independently on its own share of graph. [11] discusses the three stage Map-Reduce solution to spatial problem ANN (All nearest Neighbor) [5]. Since Mapper phase partition the spatial objects and group all those together which lies close to each other within a rectangular 2D space called partition. The algorithm requires just one Map-Reduce stage in case if every object’s nearest neighbor is also present with in the same partition as the object is. But what if the nearest neighbor of the object belongs to the adjacent partition. Such objects whose NN is not guaranteed to exist within its own partition are called Pending elements. They make use of the intermediate Data Structure called Pending Files in which they write the pending element and the potential partition which could contain the NN of this pending element. Input to the Map-phase of the next Map-Reduce Stage is the pending files + original data source. Output of second Map-Reduce stage produces the final results in which every element is guaranteed to find its nearest neighbor. Through this approach, the MapReduce programming model overcome the communication barrier between slave nodes. Hadoop Platform does not allow slave nodes to share information while Map-Reduce task is executing. The data partition that a particular slave node executed is made available to another slave node in the next Map-Reduce stage. 9 [13] discusses the Map-Reduce strategy to Perform build the index of large spatial data sets. It discusses the Map-Reduce paradigm to construct the Rtree [14] in parallel on slave machines of cluster. The maximum size of RTree that can be constructed is limited by the total size of main memory of all slave machines of the cluster. [4] does the comparative study of MapReduce based systems against parallel DBMSs. It compares the two systems w.r.t three main aspects : Performance, Fault-tolerance and Scalability. It argues that the two systems are actually complementary to each other, and neither is good at what other does well and targets completely different applications altogether. For one time data processing and batch applications, MapReduce is favorable whereas for interactive and applications demanding high performance, parallel DBMS is a favorable technology. Gap Analysis Until now, we have seen, over the past two-three years, there has been plethora of research in formulating spatial operations as MapReduce problems. But, MapReduce based systems has been originally designed for one time data processing such as log file analysis. In practice MapReduce based systems involve processing of large data sets with almost fixed MapReduce jobs, once data is processed it is offloaded permanently from the system. Contrary to this, most spatial data usually isn’t of One time processing type and require a constant probation from the user end to derive more and more meaningful results. Keeping this perspective, We believe, MapReduce is not a suitable programming model to carry out spatial analysis. Also, lack of indexing support make it further unsuitable for this purpose. Spatial DBMS, on the other hand, is a well known, matured and unlike MapReduce, is significantly optimized technology that has been employed for years for spatial analysis. Therefore, while MapReduce enable us to carry out data processing across large clusters but do not score on performance, DBMS on the other hand, yield high performance but score poorly on scalability. Even for moderate spatial data sets (250 mb) (California roads and counties [18]), we setup 3 node Hadoop cluster (same as the one set up in experiment section) where we observer MapReduce takes as much as 6-7 minutes to process join queries, whereas single postGIS, on the other hand, outputs the results in only 40-50 seconds. Thus, even 3-node Hadoop succumb to single postGIS optimized capabilities in terms of performance. Thus, We consider, to cope up with the ever increasing size of spatial repositories, we need to study the behavior of such systems which can employ potentially thousands of machines yet maintaining reasonable performance level while targeting data analysis over large spatial data sets. 10 Chapter 3 Integrating Geoserver, Hadoop and PostGIS As the solution to the problem we presented in the previous section in the domain of large spatial data analysis, In this section, we integrate the MapReduce with spatial DBMS technology to create real spatial processing environment. We present the step by step discussion regarding assembling of the three systems : GeoServer, Hadoop and PostGIS. We start with the open source Apache’s Hadoop project as the basic software platform on top of which we build the remaining components and it forms the core component of the overall system. In the second step, we replace Hadoop’s default data storage Hadoop Distributed File System, HDFS with the postGIS database servers which stores the spatial test data sets as tables and one instance of which is active on each cluster node. In the third step, we add Geoserver as the front end to the system which provide high level spatial SQL interface to the end user to query the cluster data. Figure 3.1 shows the top level architecture of the overall system we intend to build. Geoserver serves as the front end of the system which enable user to monitor and interact with the cloud in the back end. Hadoop Master-Node is in direct communication with the Geoserver. Hadoop Master-Node in turn monitor and coordinate the activities of rest of the cloud by usual MapReduce paradigm. This system distinguishes itself from many of the current parallel and distributed databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogeneous clusters. Especially in cloud computing environments, where there might be dramatic fluctuations in the performance and availability of individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are critical. In the remainder of this chapter, we first discuss the basic architecture and functioning of Hadoop, followed by the discussion of postGIS and Geoserver integration to the system. 11 SQL Query Namenode SMC Reader MR-code Database Connector Geoserver catalog.xml HDFS tasktracker tasktracker tasktracker postGIS postGIS postGIS Node 1 Node 2 Node 3 Figure 3.1: Overall System Architecture GeoServer + Hadoop + postGIS 3.1 Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on lowcost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is now an Apache Hadoop subproject. 3.1.1 Goals • Hardware Failure: Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. • Streaming Data Access: Applications that run on HDFS need streaming access to their 12 data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. • Large Data Sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. • Simple Coherency Model: HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. • Moving Computation is Cheaper than Moving Data: A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. • Portability Across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. 3.1.2 Architecture HDFS has a master/slave architecture. HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per machine in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated 13 machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. The JobTracker provides command and control for job management. It supplies the primary user interface to a MapReduce cluster. It also handles the distribution and management of tasks. There is one instance of this server running on a cluster. The machine running the JobTracker server is the ’MapReduce master’. TaskTracker provides execution services for the submitted jobs. Each TaskTracker manages the execution of tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of the TaskTracker processes. There is one instance of this server per compute node. Figure 3.2: Hadoop Architecture 14 3.2 Integrating postGIS HDFS is a distributed storage that store all Hadoop data. On each datanode, We replace HDFS with one instance of active postGIS database server which now host all hadoop data as DBMSs tables. In general, the Hadoop’s MapReduce jobs takes the blocks of data distributed across HDFS as input data to be processed. Each data block is allocated a Mapper process which processes the data block. Hadoop HDFS API’s DBInputFormat class allows to change this paradigm, and instantiate a mapper by allocating it a database table hosted on datanode instead of HDFS data block. Each Mapper prior to its execution establish the SQL connection with local postGIS server and execute the input SQL query. The result emitted out is then serves as the input to the Mappers. The result is that we obtain a hybrid that combines databases with scalable and fault-tolerant MapReduce systems. It comprises of Postgres spatial on each node (database layer), Hadoop’s MapReduce as a communication layer that coordinates the multiple nodes each running Postgres. Hadoop with Database as the primary storage layer instead of HDFS is termed as HadoopDB [16]. By taking advantage of Hadoop (particularly HDFS, scheduling, and job-tracking), HadoopDB distinguishes itself from many of the current parallel and distributed databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogeneous clusters. Especially in cloud computing environments, where there might be dramatic fluctuations in the performance and availability of individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are critical. The system is designed to process most of the problem logic within the database layer, thus speeding up the queries by making use of database’s optimized capabilities such as Indexing which is not supported in MapReduce, whereas the aggregation of data from multiple nodes, if required, is done in the MapReduce environment. Figure 3.3 shows the architecture of the system. The Database Connector (DC) component of the system has the responsibility to connect to the databases hosted on cluster machines. DC probes the Catalog file residing in HDFS to locate the host address, port no, database name for a given table name. It also contains the replication details of all tables. The databases hosted by cluster nodes, are spatially enabled open source postgres databases, which we shall now refer to as postGIS. The Hadoop Daemon, called TaskTracker is running on each cluster node to assist and control the execution of local Maps and Reducers. We also need to implement the spatial SQL front end to the system that allow user to query the cluster data via high level SQL queries instead of writing MapReduce jobs (60-70 lines of code for simple SQL queries) everytime for a different query. In the next section, we discuss the implementation details of this front end. Logical View of Architecture Figure 3.4 shows the logical view of the System Architecture. The system can be viewed as the stack of three layers: • Top Layer - Database layer 15 Figure 3.3: Hadoop with Database(postGIS) • Middle layer - MapReduce Layer • Bottom Layer - Hadoop HDFS layer The topmost spatial DBMS layer comprises of postGIS servers running on cluster machines. Most of the spatial data processing takes place at this layer, that is within database engines. Database Tables hosting spatial data are processed via spatial SQL locally and independently on cluster nodes. The result emitted out is then handled to the underlying MapReduce Layer. Arrows shows the direction of intermediate data flow between layers. The Middle layer is the Hadoop’s MapReduce programming paradigm. The intermediate output of the top layer is processed further at this layer. This layer further process each intermediate record (Map-phase) and collates the result from different database sites (Reduce-phase). Also, this layer can write and read the data to and from the bottom HDFS layer. The bottom layer is the Hadoop’s HDFS layer. It provides the distributed transparent storage as well as coordinate the Hadoop Daemon processes across the cluster. This layer is responsible to provide the fault-tolerance capability to the system, thereby improving the scalability. 16 Spatial DBMS Layer Map-Reduce Layer Map-phase Reduce-phase Hadoop HDFS Layer Figure 3.4: Logical View of System Architecture 3.3 Integrating Geoserver as Front-End Geoserver [20] comprises the front end of the system. It allows users to share and edit geospatial data. Designed for interoperability, it publishes data from any major spatial data source using open standards. We implement our simple SQL-to-MapReduce Converter module (SMC) in Geoserver that recognize the basic spatial data types viz Polygons, MultiPolygons, LineStrings and Points and translates the spatial SQL queries into equivalent compiled MapReduceSQL code. SMC provide high level SQL interface to the user to query cluster data which is capable of transforming any spatial query into equivalent MapReduceSQL form provided that there is no collation of data needed from different database sites except through Group By clause, and aggregate functions supported are sum,max and min only. Table 3.1 shows the set of rules to map SQL constructs to MapReduce. As long as SQL query does not have Group By clause, equivalent MapReduceSQL has only Map function. Group By clause require the records having the same value of field which is being grouped to collate from different database sites, thereby introducing a reduce function. For this MapReduce code, the input specification involves the input data to be retrieved from cluster databases instead of default HDFS. Once the data is fetched out of the databases, rest of the computation proceeds as per the usual MapReduce paradigm. Figure 3.5 shows the sequence of operations implemented in SMC starting from taking the spatial SQL query as an input from user upto launching the MapReduce job. TableCatalog.txt is the text file maintained on node running Geoserver which consists of table schema information: Table fields and its data type. This information is required by MapReduceSQL Generator function of SMC. The SQL-enable compiled MapReduce job, produced by SMC, is copied by Hadoop Master node to relevant cluster nodes as single jar file. Here relevant cluster nodes are the nodes which host any of the tables specified in original query. This information comes from catalog file residing on HDFS. 17 Table 3.1: SQL to MapReduce Mapping SQL construct MapReduce construct No GroupBy clause Only Map Group By clause Map and Reduce Group by field output-key of Mappers and input-key of Reducers Aggregate functions Sum , Min , Max supported Data types primitive data types + Geometry data types set of fields Selected Map Input Value Input Spatial SQL Query Parser Generate Tokens: Tablename, fields, agg fns, Group By clause, group by fields TableCatalog.txt MapReduceSQL Generator MapReduce Job Compiler Job Launcher Figure 3.5: MapReduce job Compilation by Geoserver Figure 3.6 shows the Geoserver interface screen which forms the front end of Hadoop + postGIS cloud. At the left hand side of the screen is the hyperlink HadoopPlugin clicking which this screen appears. The text field provided allows the user to mention the spatial query to be processed on cluster data. Specifying the spatial query, user need to press the CREATE MAPREDUCE JOB button to translate the spatial SQL into equivalent compiled MapReduceSQL. Clicking LAUNCH HADOOP JOB button will launch the job. Activity Terminal is the window which shows all information about user interaction with Hadoop cloud. We also provide 18 Figure 3.6: Geoserver-front end to Hadoop cloud other buttons to monitor Hadoop cluster via Geoserver interface viz : START HADOOP CLUSTER, STOP HADOOP CLUSTER, FORMAT HDFS and CLUSTER STATUS. 3.4 Query Execution Steps The query execution passes through three phases: • The first phase simply involves the original query to execute inside the database engine locally on cluster nodes in parallel. • In second phase, the tuples emitted out of DBMSs (in first phase), called the ResultSet, are read by the Mappers. Here Map performs any extra computation, which might not be supported at postGIS layer, if required, on each tuple. For example, we can test inside the postGIS layer to output all pair of roads thats intersect each other, but in case if we are specifically interested in finding all T-Point intersection points between roads, it can be tested in the map phase whether the two roads, which are now confirmed to intersect, actually intersect at around 90 degrees or not. • In third phase, Reducers start when all mappers have finished, each reducer, aggregate the individual map-outputs, consolidate them and write final results back onto HDFS, which is then be imported by Geoserver. This phase is optional, and is not required if no aggregation 19 of Map-outputs is required from different cluster nodes. Usually, third phase comes into picture in case of nested queries, or queries with GROUP BY clause. Suppose we need to find the polygon with the greatest area among all the polygons stored in three database sites. Figure 3.7 shows the Query execution plan of a query in the integrated environment of DBMS and MapReduce. The example shows three postGIS nodes hosting table containing id and area columns of polygon table. The example query output the polygon id and area of polygon that has maximum area in the respective table. Mappers then read this output and accumulate all Map-outputs at a single reducer by binding them with a common reduce key. Reducer, then find the largest polygon among all locally largest polygons. Query : Select id, area(polygon.the geom) as A from polygon where A = (Select max(area(polygon.the geom)) from polygon )) postGIS 1 21 2 42 3 22 postGIS 4 31 5 35 6 17 postGIS 7 12 8 10 9 47 output = 2, 42 output = 5, 35 output = 9, 47 Map (2, 42) { Map-output ( key1 , (2,42)) } Map (5, 35) { Map-output ( key1 , (5,35)) } Map (9, 47) { Map-output ( key1 , (9,47)) } Reduce (key1, List [(2,42), (5,45), (9,47)) { Find largest polygon among 2, 5, 9 Reduce-output ( NULL , (9,47)) } Final output Written onto HDFS Figure 3.7: Query Execution Plan 3.5 Challenges faced While replacing HDFS with postGIS database was accomplished by Hadoop libraries, the two main challenges that we faced were to : 20 • attach Geoserver as a front end to Hadoop cloud and invoking Hadoop’s utilies from Geoserver-interface and, • implementing SMC module in GeoServer. Attaching GeoServer as a front-end: This challenge is due to technical incompatibility between GeoServer which is platform independent and Hadoop’s MasterNode which is biased to run on UNIX platform. GeoServer is a software developed in Java and is therefore platform independent, Hadoop, like GeoServer, is also developed in Java, but Hadoop’s MasterNode relies on certain complex shell scripts which configures hundreds of Hadoop parameters stored in several xml files to configure Hadoop cloud, and therefore MasterNode is compatible to run only on UNIX machines. It is for this reason that Windows based machines can be used as Hadoop slave nodes only. Therefore, what was required, is to invoke those shell scrips with correct parameters from with in HadoopPlugin java code implemented as a separate module in GeoServer. Invoking the shell commands from inside the Java code is accomplished by Java java.lang.ProcessBuilder class. The class ProcessBuilder, in Java 1.5, is used to create operating system processes. With an instance of this class, it can execute an external programs (Hadoop controlling shell scripts in our case) and return an instance of a subclass of java.lang.Process. It provides methods for performing input from the process, performing output to the process, waiting for the process to complete, checking the exit status of the process, and destroying (killing) the process. GeoServer invoke Hadoop utilities via instance of this class. It is because of this reason that our GeoServer as Hadoop front end is only deployable only on UNIX machines, though can be accessed remotely via WEB from any machine. Implementing GeoServer SMC module: Implementing this module was algorithmically very challenging. SMC (SQL to MapReduce converter) module require to take the SQL query as input string, parse this query to generate SQL tokens such as table names specified in the query, fields selected, their data types, fields being grouped by and aggregate functions used. Therefore, this module requires a lot of String handling and builds an appropriate data structures such as mapping of fields specified in SQL Select to their return data types. Having build the data structures after query parsing, SMC maps the SQL constructs to MapReduce constructs. Rules for this mapping are listed in table 3.1. This mapping has to be carefully handled as one can twist the SQL queries in one way or the other. We try our best to make this module support variety of SQL query translation but still it is limited to support aggregate functions sum,max and min and table field data types supported are Geometry, int, float, double and String only. 3.6 Summary HadoopDB does not replace Hadoop. Both systems coexist enabling the analyst to choose the appropriate tools for a given dataset and task. Through the performance benchmarks in the following sections, we show that using an efficient database storage layer cuts down on data processing time especially on tasks that require complex spatial query processing over structured 21 data such as joins. We also show in the experiment section that HadoopDB is able to take advantage of the fault-tolerance and the ability to run on heterogeneous environments that comes naturally with Hadoop-style system. HadoopDB achieves fault tolerance and the ability to operate in heterogeneous environments by inheriting the scheduling and job tracking implementation from Hadoop, yet it achieves the performance of parallel databases by doing much of the query processing inside of the database engine. Thus, by integrating the three systems together, we get the shared-nothing distributed spatial DBMSs cluster with MapReduce as the coordinating layer and achieves the following goals: • Performance, a property inherited from DBMSs • Flexible Query Interface, query the data via SQL instead of MapReduce programs, provided by GeoServer • Fault-Tolerance, a property inherited from Hadoop • Scalability, a property inherited from Hadoop • Ability to run in Heterogeneous environment, a property inherited from Hadoop 22 Chapter 4 Vector Data Distribution We shall now discuss the strategy to distribute the vector test data across the cluster nodes. The distribution of data across the cluster nodes is primarily governed by JOIN operation which is the most commonly used and expensive operation to perform. Spatial joins combine two spatial data sets by their spatial relationship such as Intersection, containment or within. In sharednothing distributed DBMSs, if two tables residing on different sites need to be joined, then one of the table has to be imported on the other’s site prior to perform join. Spatial data are often large in size and therefore expensive to transfer from disk over the network. Vector spatial data, by its nature, is well suited to be processed on clusters following shared-nothing architecture. Hosting all spatial objects enclosed within a finite geographical boundary (called a partition) as tables on single database site eliminates much of the possibility to manipulate tables in between database sites, thus abiding by Hadoop’s shared-nothing architecture. For example, any spatial object enclosed with in region A would not overlap, intersect, meet or touches any spatial object in another geographical region B, and therefore can be hosted on two different database sites as join on these predicates between the two sets would never return any result. Also, it is highly unlikely to perform the join between tables that contains the data that is not spatial proximal. 4.1 Partitioning Strategy For a large random collection of spatial objects, we define universe as the minimum bounding rectangle (MBR) that encloses all objects in the collection. In order to distribute the data sets across shared-nothing database sites following the discussion above, it is required to decompose the universe into smaller regions called partitions. We need to have some metadata first which is essential for decomposition of the universe into partitions. The metadata comprises of the dimensions of the universe which is determined by manual analysis of the data set or through some hand-coded scripts. This is static and permanent information, once computed, need not be computed again through the life time of data set. Number of partitions into which the universe is to be spatially decomposed depends on the maximum table size a database can process efficiently without using temporary disk buffers (another parameter essential for data partitioning). If the 23 total no of spatial objects in the universe is N, and the average no of objects that can be stored in a database table which avoids disk buffer access during query execution is M, then number of partitions to be made is roughly N/M. The dimensions of partitions boundaries are predicted roughly by dividing the universe into smaller rectangular regions of equal sizes. Partitioning of spatial data sets is done by testing the spatial relationship between partitions and MBR of spatial object as per the predicate condition, say overlap in our case. The spatial objects which qualifies the predicate with the partition(s) becomes a member of that partition(s). This step produces candidates which are a superset of the actual result. Figure 4.1 shows the decomposition of spatial space into four partitions. Each partition consists of the spatial objects whose MBRs gives the positive overlap test with the partition. All the spatial objects belonging to a particular partition resides on a single database site in Distributed DBMSs. Also, note that spatial object labeled as O1 in the figure overlap with two partitions P1 and P4, so it is a member of two partitions and therefore resides on two corresponding database sites. Universe P1 O1 P2 P4 P3 Figure 4.1: Decomposition of the Universe into Partitions 4.2 Partition Skew In practice, the distribution of spatial features over 2D spatial space is generally not even. For example, there are more roads in cities than in the rural areas. Therefore, the distribution of spatial objects into partitions may be imbalanced. Figure 4.1 shows that partition P3 consist of the least number of spatial objects where as partition P1 and P4 are densely populated. This situation is called Partition Skew and is not uncommon. Since each partition corresponds to the 24 tables residing on the same database site, this uneven distribution results in tables residing on different database sites to vary from each other in size. Consequently, different amount of querycomputation is carried out on different cluster nodes, thus resulting an increase in the overall job execution time. The overall execution time of the job is decided by the time taken by the cluster node which finishes its share of computation after all cluster nodes have. Therefore, we need Load Balancing for balanced distribution of objects among partitions. 9 P2 10 P3 11 P4 6 7 8 P3 P4 P1 3 4 5 P4 P1 P2 0 1 P1 P2 2 P3 Figure 4.2: Tile Based Partitioning Scheme 4.3 Load Balancing To deal with the problem of partition skew, a tile based partitioning method [9] is used for balanced distribution of objects among partitions. This method involves the decomposition of universe into N smaller partitions called Tiles where N P (number of partitions). There is also many-to-one mapping between tiles and partitions. All spatial objects that gives positives overlap test with the tile(s) is copied to the partition(s) the tile(s) maps to. Larger the number of tiles the universe is decomposed into, more uniform distribution of objects is among partitions. In Figure 4.2 above, the universe is decomposed into 48 tiles. We have shown the decomposition of only one partition P1 into tiles numbered from 0 to 11. Likewise other partitions are also decomposed in the same manner (not shown in the figure). Tiles are mapped to a partitions in Round Robin fashion. Some spatial objects that are spatially enclosed with in this partition are now mapped to other partitions. For example, some spatial objects of partition P1 which overlaps with tile 2 and 5, will now be a member of partition P3 and P2 respectively. In the same manner, some spatial objects from other partitions are also mapped to partition P1. This results in the uniform distribution of spatial objects among partitions. Though, this strategy is simple to implement, but balancing the distribution of spatial objects across partitions via this strategy suffer from following drawbacks : • with the decomposition of each partition further into smaller tiles, the number of spatial 25 objects overlapping across multiple tiles would increase, resulting in mapping of a spatial object to multiple partitions, thereby resulting in profound duplication. • This strategy is somewhat of adhoc nature and unpredictable. • Predication of the appropriate value of N (No. of Tiles) is difficult. We perform tile based partitioning scheme on total of 24,07,442 of spatial objects of state of California which comprises of counties, roads and rivers by decomposing the universe into 512 (randomly chosen) equal sized tiles which maps to 6 partitions (or 6 disks). 7,37,593 2,53,531 P3 P2 P1 94,333 9,20,742 P4 08 P5 4,65,759 P6 Figure 4.3: Initial Distribution 504312 P1 306119 P4 410912 378412 P3 P2 465864 P5 429121 P6 Figure 4.4: Distribution after Tile Partitioning Figure 4.3 shows the original distribution of objects as they lay on the Map of California. From distribution, we infer that there are almost no spatial features on the top right corner of the universe. Figure 4.4 shows the distribution of spatial objects after performing tile based division. It shows the more uniform distribution of objects across partitions. But also, total no of spatial objects in figure 4.4 counts to 24,94,740 which is 87,298 more than the original count. It is because the spatial objects, the MBR of which overlaps with multiple tiles at the same time maps to multiple partitions, thereby, increasing the object duplication which tend to become more pronounce with the granularity of division of space into tiles. This leads to 26 redundant computation of results across multiple database sites. We discuss an improved spatial data partitioning strategy based on Hilbert Space filling curve in the next section which resolve the problem of object duplication. 4.4 Spatial Data Partitioning using Hilbert Curve Hilbert Space Filling Curve (HSFC) is a continuous fractal space-filling curve first described by german mathematician David Hilbert in 1891 [23]. Hilbert curve are useful in the domain of spatial data partitioning because they give a mapping between 1D and 2D space that fairly well preserves spatial locality. If (x,y) is the coordinates of a point within the unit square, and d is the distance along the curve when it reaches that point, then points that have nearby d values will also have nearby (x,y) values. This is true in the opposite direction also, that is, nearby (x,y) points will also have nearby d values. Figure 4.5 shows the division of a 2D space with the increase in the recursion depth of Hilbert curve. Figure 4.5: Hilbert Curve Space Filing Curve We first define the term Hilbert Value. Hilbert Value of an arbitrary point p in 2D space is define as euclidean distance between p and p’ where p’ is the point lying on Hilbert curve and nearest to p. To keep the discussion simple, let us assume we have the following two functions : • double [ ] GenerateHilbertCurve (m, Universe Dimensions) : The function generate set of 2D points called Hilbert Points which decompose the universe as shown in figure above. m is the recursion depth. • double ComputeHilbertValue (point p) : The function return the Hilbert Value of a point p in 2D space. Partitioning of spatial data sets according to this scheme requires to compute the Hilbert vaue of all spatial entities in the data set. Hilbert Value for spatial object obj is computed as follows: 27 Ob jh v = ComputeHilbertValue(Center(MBR(geom(Obj)))) Let us define Vg as the average volume of all spatial entities on each disk in bytes. N is the number of disks in the system and B j is the volume of spatial entities on jth disk, initially initialized to 0 for all j = 0 to N-1; n is the number of all spatial entities to be partitioned; Vi is the size of the ith spatial entity in bytes; i takes value from 0 to n-1. Procedure for data partitioning using HSFC approach is presented as follows: • Construct Hilbert Curve within the universe, for each spatial entity in the data set, compute Hilbert value. As a general rule, if n is the number of spatial entities in the universe, then n <22m , where m is the recursion depth of the Hilbert curve to be generated. • Sort the spatial entities in increasing order according to Hilbert value. • Beginning from i=0, j=0, put the ith spatial object to jth disk. Do B j = B j + Vi • Compare B j and Vg . If B j is smaller than Vg , then i = i+1, otherwise j = j+1 ; repeat this step until i = n-1. In this approach of data partitioning, each spatial entity is assigned to one and only one disk, therefore there is no duplication in this case. Spatial locality is preserved due to assignment of disk to spatial entities in increasing order of Hilbert value. 28 Chapter 5 Performance Evaluation In this section we perform a set of benchmarks to asses the performance of Hadoop+HDFS , GeoServer on top of spatial HadoopDB 3Node cloud (will be termed simply as 3Node GeoServer from now on) and single node Geoserver (a geoserver with single node HadoopDB in the back end) in the domain of spatial data processing. We subject each of the systems to spatial queries with different execution plans. The test data comprises of the counties (polygons) and roads (Linestrings) of the three states of America : California, Arizona and Texas [18]. Following are the details of the environment we conduct experiments. Table 5.1: Test Data Description Node State # Counties # roads size(mb) Node 1 Node 2 Node 3 Texas Arizona California 32693 11963 62096 1377372 718556 2062872 170 135 245 Table 5.2: Hardware Description Node CPU RAM(GB) freq(GHz) Node 1 Node 2 Node 3 intel 4 core intel 4 core intel 2 core 2 2 2 2.66 2.66 1.66 Data Distribution: The test data is distributed across three node cluster. In case of Hadoop, we upload the input files onto HDFS which are then scattered into fixed size data blocks across HDFS. In case of HadoopDB, one postGIS database server is active on each node. We distribute the data State wise, that is, each node stores the county and roads table of exactly one state. All the experiments are performed on this three node cluster set up. The network communication 29 between cluster nodes is established through Ethernet cable of 100 Mbps link. 5.1 Highly Selective Spatial Queries Goal: To show the an improvement in response time by distributing the high selection queries over multiple postGIS servers. Hypothesis: Highly selective spatial queries, such as one shown in figure 5.1 aims at selecting very small number for tuples which qualifies the given predicate condition of the large data sets (order of tens of millions of rows). By replacing the Hadoop’s default read only data source HDFS by database layer, MapReduce is no more bound to scan all the data blocks (or chunks) in a brute force manner to retrieve the required result as per the business logic. MapReduce jobs written to fetch very less number of tuples from extremely large data sets would require to make a brute force linear scan over all the tuples in the entire datasets. MapReduce framework splits large files into smaller chunks which are then distributed across cluster nodes. Each data chunk is bound with exactly one Mapper. When Mappers start, data chunks are independently processed by their respective mappers in parallel across the cluster. However, the potential tuples which actually satisfy the selection criteria may belong to only few, or even to one data chunk. But MapReduce inability to index a data tuple to the data chunk that contains the tuple require MapReduce to process all the data chunks and thus unnecessarily launch as many mappers as the no. of data chunks thereby increasing the jobtracker overhead to control the ongoing computation and results in over-consumption of cluster resources. In these queries, most of the tuples are filtered out within the database layer, and very less number of records are handled to subsequent mappers thus, minimizing the network and disk read-write overhead. We expect HadoopDB to outperform Hadoop by a wide margin. Result and Explanation: The above query outputs the geometry of only those roads whose length is greater than 0.1 units. No. of qualified tuples are 182 out of 4158800 tuples. Note that 0.1 units corresponds to 11.1 km of length because all data is represented in decimal degrees [22]. 3Node GeoServer clearly outperforms single node GeoServer as shown figure 5.1. In 3Node GeoServer, the qualified tuples are fetched out of the database layer as per the SQL WHERE condition logic. Tuples not satisfying the constraint are filtered out in the database layer only. Hence, amount of workload in MapReduce environment is very less as compared to that in pure MapReduce case. Hadoop scans all the data tuples and so shows worst performance. 5.2 Spatial Join Queries Goal: To evaluate the performance of Hadoop, HadoopDB and that of single postGIS while performing spatial joins. 30 70 time in seconds select id, geom from roads where length(geom) >0.1 30 20 3Node Hadoop 3Node Geoserver 1Node Geoserver Figure 5.1: Performance evaluation of Highly-Selective Query We perform the spatial join between counties and roads of all three states. We aim to determine the total length of roads in each county for all three states. We employ SJMR algorithm [6] in which the partitions correspond to bounding boxes of states. So three partitions in all. For HadoopDB and single DB, we use the SQL Query as shown in figure 5.2. Hypothesis: We perform the spatial join query (see figure 5.2) by implementing SJMR on Hadoop which involves the online partitioning of spatial data sets in the Map Phase followed by Reduce phase performing actual spatial join. In case of Intra Join on HadoopDB (that is join operand tables resides on same database sites), data partitioning was done offline and is not a part of join run time processing. Entire spatial join query logic it pushed inside the database layer, thus completely releasing Map phase of any intensive geometric computations and avoids the Reduce phase altogether. We also perform Inter site join on HadoopDB by randomly redistributing the test data among database sites, which is similar to SJMR except that its data source is database tables rather than HDFS. We expect HadoopDB to outperform Hadoop cleanly as long as the join is intra, whereas, performance of the two systems are comparable in case of inter join as databases merely acts as read only layer in this case, like HDFS. Single node HadoopDB, on the other hand performs inferior to HadoopDB for obvious reasons. Inter Site Spatial Join: As said earlier, partitioning of spatial data sets among database sites is governed primarily by Spatial joins. As long as the spatial join operand tables reside on the same database sites, database layer naturally take care of performing speedy join by exploiting spatial indices. However, there can be scenarios where we need to perform join between tables residing on different database sites. We call such spatial joins as Inter site Spatial join. In other Shared-Nothing dis31 tributed spatial DBMSs, spatial join is accomplished by importing one join operand table (which is preferably smaller in size) to another’s site over network prior to performing a join. Transmission cost over the network, cost of uploading the imported data again into database and cost of creating spatial indices again is a part of spatial join algorithm and adds much to the overall cost of performing a spatial query. HadoopDB has an additional advantage of having MapReduce as a task coordination layer beneath database layer in the sense that it has the capability to pro-grammatically represent wide variety of algorithmic logic. we can shift the entire spatial join algorithm down to MapReduce layer. Let us suppose we have spatial data sets R and S residing on database sites Ri and Si respectively. Performing inter site spatial join involves three steps : Step 1. Read Source Data: Read qualified tuples from the sites Ri and Si in parallel as per the WHERE SQL clause, if any. These tuples are read by the Map phase. Step 2. Spatial Data Partitioning: Spatial data partitioning scheme described in the previous section, is now performed online and is implemented in the Map phase. This phase needs the characteristics of data sets such as universe dimensions, Number of partitions as an additional input which is essential to decompose the universe into partitions. Each partition contains the spatial objects from the set R and S which are potential candidate to qualify join predicate. Step 3. Performing actual spatial join: Each partition is then processed by reducers in parallel to compute the spatial join between R and S. We implement famous Sweepline algorithm in this phase which is used to perform spatial join. time in minutes 16 14 12 select a.id, sum(length((b.geom))) from counties as a, roads as b where intersects(a.geom, b.geom) group by a.id; 10 8 6 Reduce phase Map phase Inter Join 4 2 0 Intra Join 3 Node Hadoop 3 Node Geoserver 3 Node GeoServer 1 Node Geoserver Figure 5.2: Performance evaluation Spatial Join Query Result and Explanation: As shown in Figure 5.2, in case of intra join 3Node GeoServer clearly outperforms Hadoop+HDFS and single node Geoserver. But, 3Node GeoServer’s perfor32 mance degrades down to that of Hadoop in case of inter join. This is because, the Join processing has been now shifted from database layer down to the MapReduce layer which, like SJMR, now involves online partitioning followed by Reduce phase. 5.3 Global Sorting Goal: To evaluate the performance of the systems when network bandwidth becomes the bottleneck. Hypothesis: The query shown in figure 5.3 requires that counties to be first read out of HDFS (or DBMS in case of HadoopDB), then aggregated together at a single reducer process for sorting. This results volumes of data flow across the network because the geometry of spatial entities is large in size. The query overall completion time also includes the total time taken for data aggregation at a single machine over the 100 Mbps link and so the performance is largely driven by network bandwidth. Result and Explanation: Figure 5.3 shows that three is no large difference in the performance of 3Node Hadoop+HDFS, and that of 3Node GeoServer for this query, because this MapReduceSQL implementation of this query merely reads all tuples from each local databases in case of GeoServer and from HDFS in case of Hadoop. Single node geoserver performs slightly better for this query as it suffers from no network overhead. 70 select id, geom from counties order by area(geom); Reduce phase Map phase 60 time in seconds 50 40 30 20 10 0 3Node Hadoop 3Node Geoserver 1Node Geoserver Figure 5.3: Performance evaluation of Global Sort Query 33 5.4 Queries against shared-nothing restriction Goal: Performance evaluation of Hadoop, HadoopDB and single node Geoserver for spatial Queries which tend to go against Shared-Nothing restriction. Certain spatial queries tend to go against Hadoop’s Shared-Nothing restriction by invoking the need of communication between independent MapReduce processes running on cluster machines. The query as shown in figure 5.4 returns all the roads of the state of California which are longer than the longest road of Arizona and Texas. Since, roads tables of the three states of America resides on three different database sites, we first need to evaluate the result of the subquery first, which is then taken as the input by the outer query to yield final result. Because, the local result of different database sites (length of the longest road of Arizona and Texas) need to be communicated to the California database site, the execution plan of this query goes against Hadoop’s Shared nothing restriction and, therefore this query cannot be represented by a singlestage MapReduce program. To implement the above query in HadoopDB, MapReduceSQL contains two MapReduce stages. In the first stage, the subquery is processed on the Arizona and Texas sites in parallel and local results are written onto HDFS (length of the longest roads of the state). In the second MapReduce stage, the outer query takes the result of the previous MapReduce stage from HDFS as input during run time and is processed on California site only. The same mechanism is followed by Hadoop by setting the input directories to Texas and Arizona for the first MapReduce stage, and to California directory for the second MapReduce stage. 250 MR Stage 1 MR Stage 2 time in seconds 200 select geom from california roads where length(geom) >ALL ( (select (max(length(geom))) from arizona roads) UNION (select max(length(geom)) from texas roads) ); 150 100 50 0 3Node Hadoop 3Node Geoserver 1Node Geoserver Figure 5.4: Performance evaluation of Against Shared-Nothing Query Results and Explanation: Figure 5.4 shows that Hadoop’s performance is the worst due to obvious reasons. However, the performance of 3Node GeoServer is comparable to that of single node Geoserver. This is because of the overhead of launching two MapReduce task one after the 34 another dominates the overall effective query execution. Hadoop framework easily takes around 8-10 seconds only to initiate the MapReduce job. In cloud computing environment, where there is 100s of gigabytes of data spread across cluster nodes, overhead of launching an extra MapReduce job shall be negligible, unlike in our case. 5.5 Fault Tolerance Test Goal: To verify that HadoopDB inherits the the same fault tolerance capability as that of Hadoop. In this experiment, we use the same query as was used in experiment 2. We configure both the systems to replication factor of 2. During the job execution on three nodes cluster, when the MapReduce job was completed to 65%, we disconnect the node 1. We now note the time taken by the job to run to completion on both Hadoop and HadoopDB cases. Hypothesis: In both HadoopDB and Hadoop, tasks of the failed node are distributed over the remaining available nodes that contain replicas of the data. We expect HadoopDB slightly outperforms Hadoop in such situations. In Hadoop, TaskTrackers assigned blocks not local to them will copy the data first (from a replica) before processing. In HadoopDB, however, processing is pushed into the (replica) database. Since the number of records returned after database query processing is less than the raw size of data, HadoopDB does not experience Hadoop’s network overhead on node failure. F=62.5% 13 min F = Node Failure N = Normal Execution 8 min N F=33% 2 min 1.5 min N 3Node Hadoop+HDFS 3Node Hadoop+DB Figure 5.5: Fault Tolerance comparison of Hadoop+HDFS with Hadoop+DB Results and Explanations: We notice that percentage slow down in case of Hadoop and HadoopDB is 62.5% and 33% respectively. This is due to the fact, that HadoopDB simply restart 35 the query on replica database, whereas in Hadoop, restarted tasktrackers pulls the replica data blocks not local to them, thereby additional network overhead which is significant for spatial data due to large geometry sizes. 36 Chapter 6 Summary and Conclusion 6.1 Summary We started with the discussion on Parallel DBMSs and MapReduce, two widely used methodologies for processing large data sets, including but not limited to spatial data. We highlight the pros and cons of both. While MapReduce has been designed to enable us to afford the advantage of being able to harness commodity hardware operating in a shared nothing mode at the same time lending robustness to the computation since parts of the computation can be restarted on failure, where as spatial DBMS’s optimized capabilities yields us high performance as long as the prominent portion of the query logic is processed inside the database layer. HadoopDB being the hybrid of MapReduce and Database technologies, it inherits the benefits of the two, therefore allowing each to do what it is good at. We do the comparative analysis of the three systems viz : Hadoop+HDFS, Hadoop+DBMS, parallel DBMSs see Figure 6.1 and 6.2. If we treat the DBMS and Hadoop technologies as two extremes , Hadoop-with-Database (HadoopDB) is actually a DBMS equipped with some Hadoop techniques. DBMSs are taken as the storage and execution units, and MapReduce mechanism takes responsibility for parallelization and fault tolerance. By bringing the two technologies together: • What we gain w.r.t Hadoop: 1. Efficiency and performance 2. Data source (DBMS) is modifiable (HDFS was readable) , update queries comes into picture 3. DBMSs Acid properties • What we lose w.r.t Hadoop: 1. Transparency w.r.t. Data loading. Data loading and distribution is no more automated by Hadoop Framework, but becomes manual. 37 2. Facebook process 2 petabytes of data daily, manually uploading that amount of data on DBMS nodes on daily basis is not affordable. • What we gain w.r.t DBMS: 1. a shared-nothing OPEN SOURCE parallel dbms, there has been no parallel open source dbms so far. 2. unlike other distributed DBMSs, HadoopDB posses better fault tolerance features. 6.2 Conclusion We conclude that MapReduce programming paradigm alone is sufficient to express most spatial query logic, but lack of support for spatial indexing mechanism and its brute force nature make it impractical for interactive real time spatial data analysis systems. HadoopDB shows great improvement in query execution speeds as postGIS inherent support of spatial indices adds a significant advantage, but on the other hand performance degrade down to no better than MapReduce for queries the execution plan of which tend to go against the “Shared-Nothing“ restriction such as inter site spatial join. We also realize that vector spatial data, by its nature, is well suited to be processed on Shared-Nothing distributed database clusters. Hosting all spatial objects confined within a finite geographical boundary as a single table chunk on one database node eliminates much of the possibility to manipulate tables between database nodes, thus abiding by Hadoop’s shared-nothing architecture, avoiding the dependency on MapReduce layer and therefore yielding high performance. Also, since HadoopDB does not possess any fault tolerance at data layer, offline management of spatial data costs much to the usability of HadoopDB into the realm of very large scale spatial data analysis. In large scale data analysis, some times, the data have a short life cycle. The data are loaded into the system in batch mode, then some almost fixed queries are put onto the data, and after that, the data is offloaded to the offline system. In such condition, organizing the data into some sophisticated structure is not worthwhile given the extra maintenance cost and the low utility - thus making this system impractical in the field of large scale data analysis. The situation get worse if spatial data suffers from partition skew and load balancing is required which is not uncommon. 6.3 Future Work Current Hadoop’s implementation of MapReduce doesn’t support any sort of indexing mechanism. This is not a drawback of MapReduce, it is some thing that MapReduce has not been designed for. MapReduce has been designed for one time processing of large data sets in batch mode. We see it as a future scope to empower MapReduce with indexing mechanism to make it suitable for real time data analysis. 38 One of the biggest drawback of HadoopDB that makes it unsuitable in the realm of large scale data processing is the lack of fault tolerance as the data layer. Partitioning of the raw data, uploading them onto individual database nodes is no more supervised by Hadoop framework. While HadoopDB integrates the power of efficient DBMS technology with MapReduce, yet it seems impractical to employ this system to carry out large scale data processing. It shall be a great advancement towards large scale data processing if HadoopDB is improved to posses fault tolerance at data layer too, just like Hadoop do. 39 Property Comparison Chart of Hadoop+HDFS, Hadoop+DBMS, Parallel DBMS Property Fault Tolerance Hadoop Excellent fault tolerance capability. Amount of Work lost due to (Amount of work Node/Process failure is less. to be redone in Good Fault tolerance is achieved at the cost of performance (by storing the case of intermediate outputs onto Disks). node/process failures) System can scale upto 40005000 Nodes easily. Hadoop-With-Database Parallel DBMS Inherits Hadoop's fault Poor fault tolerance. tolerance feature. In case of node Amount of work lost due to failure , task is restarted onto another node hosting replicated data. Only the chunk of the data hosted by failed node need to be reprocessed on another node storing replicated database. Node/process failure is large. Query needs to be restarted right from the beginning. Designers emphasize more on performance , & therefore intermediate output data is pipelined to next query operator without having written to disk. posses same scalability near to Parallel Database systems that of Hadoop. possess poor scalability. Asterdata, a parallel database known to posses one of the best scalability in parallel database community is scalable to around 330-350 nodes. Probability of node failure would increase if the cluster size increases , frequent failures would result in degraded performance. Scalability Processing time is much larger. By Replacing the Data Source HDFS by Database , performance of this system is dramatically improved. Also WHERE condition is checked within Map phase (by hand coding). So However , For inter site map function unnecessarily have to read queries , performance can Performance every tuple from input file. degrade down to that of Hadoop. There is no provision to index the input data whatsoever. Lot of Disk Read/Write. Intermediate outputs (of Mappers) are written to Disk before they are fetched by subsequent processes (Reducers) Architecture Shared Nothing Architecture type Reason : DBMS minimize the search space and accelerate the query execution my making use of Database indexes. Because most of the data tuples are filtered out within database (because of WHERE clause), no. of input tuples to mappers is not very large and hence much less computation in Map-Reduce Environment as compared to that of Hadoop. Shared Nothing Architecture Figure 6.1: Comparison Chart contd .... 40 We expect Parallel DBMS to clearly outperform Hadoop. Reason : 1. Lesser Search Space because of database indexes. 2. Unlike Hadoop , Intermediate results are pipelined to next Query operator without having written to Disk. No Disk Read/Writes of intermediate results lead to high performance, but poor fault tolerance. Shared Memory Architecture Designed by keeping in mind that Node failure is going to be a rare event. Therefore, cluster Hadoop has been especially designed to run on cheap commodity hardware , and Inherits this feature from Hadoop. Now we have kind of No provision for Maintenance of record indexes Isolated Indexes has to be created per database , cannot make use of global structures Can make use of global data structures, such as global indexes Data Loading Transparent to the User, data split into blocks and redistribution & Distribution across HDFS is carried out Manual , user need to upload the Transparent to the user, tables are replicated without user manual support. Data loading is Slow. Granularity of Block Level Parallelism , each Parallelism data block is allotted a Map Table level parallelism , full Application Batch processing Requirements Midway between Batch processing and real time processing. Optimized to yield real time benefits. Environment Heterogeneous machines Inherits this feature from Hadoop. Do not score well in Heterogeneous environments Large Scale Designed especially for this Data analysis purpose. (LSDA) Scalable to thousands of machines, but not suitable for LSDA like Hadoop LSDA, but at moderate scale. Hardware Support Global Data structures distributed DBMS that has the ability therefor node failure is considered to be to cope with clusters made up of ordinary Hardware machines. a frequent event. Therefore , a lot of emphasize has been given over Hadoop's Fault tolerance feature. automatically by Hadoop framework. Data loading is fast. Unstructured data in textual format. Cost data manually in databases across the cluster. Data loading is Slow as each tuple might have to undergo too many checks to satisfy integrity constrains. Data structured and organized in tables. This makes this system little unrealistic for large sale data processing. database table is the smallest unit to be allotted to Map Open Source Project, free of cost Completely made up of open source components Figure 6.2: Comparison Chart 41 nodes require quality hardware machines to to used. Granule level parallelism. Table is logically splitted up into chunks called granules, each granule is processed in parallel. Oracle parallel DBMS support this feature. No open source parallel DBMS is known Bibliography [1] J. Dean and S. Ghemawat, ”Mapreduce: simplified data processing on large clusters,”in Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation - Volume 6. San Francisco, CA: USENIX Association, 2004, pp. 10-10. [2] A. Bialecki, M. Cafarella, D. Cutting, and O. OMalley, Hadoop: a framework for running applications on large clusters built of commodity hardware, Wiki at http://lucene.apache.org/hadoop. [3] Pavlo, a., Paulson, e., rasin, a., abadi, d.J., deWitt, d.J., Madden, s.r., and stonebraker, M.A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD International Conference on Management of Data. aCM Press, new york, 2009, pp.165178. [4] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):6471, 2010. [5] Jun Zhang, N. Mamoulis, D. Papadias, and Yufei Tao,All-nearest-neighbors queries in spatial databases, June 2004, pp.297306. [6] Zhang, S., Han, J., Liu, Z., Wang, K., and Xu, Z. SJMR: Parallelizing spatial join with MapReduce on clusters. In Proceedings of CLUSTER. 2009, pp. 1-8. [7] J.P. Dittrich and B. Seeger, Data redundancy and duplicate detection in spatial join processing, in ICDE 00: Proceedings of the 16th International Conference on Data Engineering, 2000, pp. 535546 [8] T. Brinkhof, H.P. Kriegel, and B. Seeger, Parallel processing of spatial joins using R-trees, in ICDE 96: Proceedings of the Twelfth International Conference on Data Engineering, pp. 258265. [9] J.M. Patel and D.J. DeWitt, Partition based spatial-merge join, in Proceedings of the 1996 ACM SIGMOD international conference on Management of data. ACM New York, NY, USA, 1996, pp. 259270 [10] Yonggang Wang, Sheng Wang, Research and Implementation on Spatial Data Storage and Operation”, 2010, pp.275-278 42 [11] K. Wang, J. Han, B. Tu, J. Dai, W. Zhou, and X. Song, ”Accelerating Spatial Data Processing with MapReduce”, in proceedings of ICPADS , 2010, pp. 229-236 [12] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii, ”Filtering: a method for solving graph problems in MapReduce”, in Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2011, pp. 85-94 [13] Haojun Liao, Jizhong Han, Jinyun Fang,”Multi-dimensional Index on Hadoop Distributed File System”, in Vth International Conference on Networking, Architecture, and Storage, 2010, pp. 240-249 [14] A. Guttman, ”R-trees: a dynamic index structure for spatial searching”, in Proceedngs of the ACM SIGMOD, Boston, Massachusetts, ACM, 1984, pp. 47-57 [15] Afsin Akdogan, Ugur Demiryurek, Farnoush Banaei-Kashani, and Cyrus Shahabi Integrated Media Systems Center, University of Southern California, Los Angeles, CA90089 [16] Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi Silberschatz, Er Rasin,HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for anaytical workloads, in Proc. VLDB09, 2009 [17] http://en.wikipedia.org/wiki/GeoServer [18] http://arcdata.esri.com/data/tiger2000/tiger download.cfm [19] G. Leptoukh, NASA remote sensing data in earth sciences: Processing, archiving, distribution, applications at the GES DISC, in Proc. of the 31st Intl Symposium of Remote Sensing of Environment, 2005 [20] http://en.wikipedia.org/wiki/GeoServer [21] http://people.na.infn.it dimartino/webgis/architecture.html [22] http://en.wikipedia.org/wiki/Decima degrees [23] e.wikipedia.org/wiki/Hilbert curve 43 44