* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download “UNDERSTANDING CAPABILITIES OF SPATIAL DATABASE
Survey
Document related concepts
Oracle Database wikipedia , lookup
Microsoft Access wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Team Foundation Server wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Ingres (database) wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Transcript
“UNDERSTANDING CAPABILITIES OF SPATIAL DATABASE (POSTGIS) IN DISTRIBUTED ENVIRONMENT” MID TERM REPORT Submitted in Partial Fulfilment for the Award of M.Tech in Computer Science Of Banasthali University, Rajasthan Supervisor: Submitted By: Prof. N.L.SARDA MONIKA YADAV Computer Science & Engg Dept. 11768 IIT BOMBAY AIM & ACT Banasthali University (Rajasthan) 2011-2012 DECLARATION I hereby declare that the project entitled “Understanding capabilities of spatial database (PostGIS) in distributed environment” submitted for the M.Tech Degree is my original work and the project has not formed the basis for the award of any degree, associate ship, fellowship or any other similar titles. Signature of the Student: Place: Date: ACKNOWLEDGEMENT The partial completion of this Internship project was possible with the help and guidance that I got from my Professor N.L.Sarda and Dr. Smita Sengupta at GISE Lab. I take this opportunity to thank them all for their cooperation extended to me during my internship specially Mr. Bharat Singhvi, Mr. Nikhil Morajkar and Ms. Anuja Shukla. I also take this opportunity to specially thank Prof. N L Sarda for his guidance and kind co‐operation rendered throughout my internship. His insights, expertise and energy contributed greatly to the success of this internship. I also thank my parents who always supported me in all aspects. Formal and informal discussions with Neha Arora, Bharti Kathpalia and Mr. Kashiram Vichare also helped me in achieving various tasks. Finally, I thank Banasthali University for giving me this opportunity to work at a great place and in a great lab and laying the Foundation for my future career path. Monika Yadav, Intern, Geo-Spatial Information Science & Engineering (GISE), Advance Research Lab, Indian Institute of Technology Bombay(IITB) M.TECH, Banasthali University (Rajasthan) TABLE OF CONTENTS CHAPTER NO. 1. TITLE ABSTRACT INTRODUCTION 1.1 Objective 1.2 Technology Used 1.3 Target Users 2. LITERATURE REVIEW 2.1 Study of PostGIS 2.2 Benchmarking On Spatial Database for Single User. 2.3 Pgpool-11-3.04: Tool for parallel processing in Postgresql. 2.4 Join Processing in Multi Database System. 3. OUR APPROACH 3.1 Data Partition 3.1.1 Bounding Box Wise. 3.1.2 Feature wise. 3.2 Block Diagram of our approach 3.3 Query Fragmentation. 3.4 Query Redirection to Distributed Database. 3.5 Query Execution and Storage of result 3.6 Displaying Result on Geoserver. 4. CONCLUSION 5. FUTURE SCOPE 6. REFERENCES ABSTRACT Many Organisations like Municipal Corporation and E-Governess has large scale data stored in Distributed Database (Postgresql-8.4). In order to achieve this I have designed a java application which will query the remotely distributed databases. In this application user will enter the query that will further get fragmented into two parts if query is on more than one table .Then part of query will goes to the one server and another part of query will goes to the another server and get executed on remote server. Now transfer of Result set from both remote servers to Local host take place and then union of result set or any operation that require data from both server take place on local host. Finally we will populate Combination of result from both servers on Geoserver. 1. INTRODUCTION 1.1 Objective: To design a Java application in order to implement the concept of distributed databases by taking spatial database (PostGIS) in order to achieve performance optimization. I have been involved in research in parallel and distributed computing systems. I studied various problems in distributed systems and distributed databases and my work is on PostGIS which is a spatial Database. My project work includes designing of a Java application which will provide User Interface for querying Distributed Database (PostGIS), connection establishment to Remote Postgresql Servers, retrieving combined result from remote servers and populating final result on Geoserver. 1.2 Technology Used: Database: Postgresql -8.4: PostgreSQL is an Open source object-relational database management system (ORDBMS) [4]. PostGIS-1.5.2: PostGIS is an extension to the PostgreSQL object-relational database system which allows GIS (Geographic Information Systems) objects to be stored in the database. PostGIS includes support for GiST-based R-Tree spatial indexes, and functions for analysis and processing of GIS objects [9]. IIT-Bombay Database: IIT-Bombay database consisting of geographical information of IIT-Bombay. We have relational data named Builtup,Boundary,Bus_stop,campus_boundary,culvert,Location_plan,open_space,places,revenue _boundary,zone, gate and all utilities (electrical line, roads, sewerage lines, water lines, telephone lines) etc. For my experiment purpose I have taken IIT-B data only. Programming languages: JAVA EE: Java Enterprise Edition is a programming platform— part of the Java Platform-for developing and running distributed multi-tier architecture Java applications, based largely on modular software components running on an application server. JavaScript: A client side scripting language used to create dynamic web content and user interface. Java Servlet: A servlet is a Java programming language class used to extend the capabilities of servers that host applications accessed via a request-response programming model. Although servlets can respond to any type of request, they are commonly used to extend the applications hosted by Web servers Java Server Pages (JSP): Java Server Pages (JSP) technology provides a simplified, fast way to create dynamic web content. JSP technology enables rapid development of web-based applications that are server and platform independent. Tools & Development Environment: Apache Tomcat 6.0.18 Server: Apache Tomcat is a Servlet container developed by the Apache Software Foundation (ASF). Tomcat implements the Java Servlet and the Java Server Pages (JSP) specifications from Sun Microsystems, and provides a "pure Java" HTTP web server environment for Java code to run. ECLIPSE J2EE: Eclipse is a toolkit which is designed for the creation of complex projects, providing fully dynamic web application utilizing EJB’s. This consist of EJB tools , CMP ,data mapping tools & a universal test client that is designed to aid testing of EJB’s. Pgadmin-III: PgAdmin is the most popular and feature rich Open Source administration and development platform for Postgresql which is the most advanced Open Source database. PgAdmin is designed to answer the needs of all users, from writing simple SQL queries to developing complex databases. The graphical interface supports all PostgreSQL features and makes administration easy. Geoserver (Any OGC complaint server): GeoServer is an open source software server written in Java that allows users to share and edit geospatial data. Designed for interoperability, it publishes data from any major spatial data source using open standards. Any OGC complaint server is used to deploy my web application [12] [13]. 1.3. TARGET USERS: This application is helpful in those areas where large scale data is used and we need performance optimization while accessing data from different Databases distributed remotely. Presently in many organisations we need to keep the data in different databases in multiple servers because of huge amount of data. We need some mechanism in order to access the data from different sites .My project work includes finding such a capability in PostgreSql-8.4 and it is useful in any Organisation like E-Governess ,municipal corporation in order to fire query on distributed Database and in order achieve optimization of performance. 2. LITERATURE REVIEW 2.1 Study of PostGIS[9]: Features Capacities of PostGIS: 1. 2. 3. 4. 5. 6. 7. 8. Capacity known : up to 32 TB data Supports OGC Standards It has proj4 Cartographic projection Library GEOS-Geometry Engine Open Source-Provide Spatial predicate functions, Spatial Operators and topological functions Huge Client/Server Library Support Advanced indexing. Transformation Support Variety of output format options ((E)WKT,(E)WKB,GeoJSON,GML,KML,SVG Limitation of PostGIS: 1. Not multiprocess(Base application PostgreSQL is multiprocess enabled except for Iowaits but the extension PostGIS is not multiprocess) 2. No true Geodetic Support. 3. Fewer hosting provider than Oracle/MYSQL Server/MYSQL 2.2 Benchmarking On Spatial Database for Single User [14]. In order to understand the capabilities of postgres-8.4 and Postgis-1.5.2 I have studied the experiment done on spatial database for Single user by Subham Roy (student at IIT-Bombay). Brief of the Experiment: Comparison of Open-source and Proprietary products: Experiment has two parts: Cold Start: Each query is being run freshly by clearing all the buffer cache pages. Warm Start: A bunch of related queries is run to measure the performances of the two databases. Ways of Benchmarking: Functional Benchmarking: Test Functionality supported by the Spatial Database Performance Benchmarking: Test Speed of the Spatial Database Database Benchmarking: Test performance and throughput of Spatial Database Brief of experiment done By Subham Roy: He has taken Oracle11g on Windows7 and Postgres 9.04/Postgis 1.5.2 on Ubuntu 10.04 for benchmarking with keeping all system requirement same for Oracle and PostGIS using 2010 TIGER/LINE DATA. By running many spatial queries(Some simple query and some Complex query)..He get to know that postgresql has better performance as compare to Oracle. For all queries (simple and complex) time taken by Postgresql is less and Postgres uses the underlying GEOS(Geometry Engine - Open Source) library functions for implementing the geometric operations whereas Oracle 11g implements them on its own. So for Single User majority Postgres performs well. 2.3 Pgpool-11-3.04: Tool for supporting distributed database concept in Postgresql: PgpoolAdmin: The pgpool Administration Tool is management interface for pgpool to monitor, start, stop pgpool and change setting for pgpool. I am using PgpoolAdmin-3.0.3 which is suitable of all pgpool-II-3.0 version (I was using PgpoolII-3.0.4) [7] Fig: 1 Pgpool Admin Pgpool-II-3.0.4: pgpool-II is a middleware that works between PostgreSQL servers and a PostgreSQL database client. It is licensed under BSD license [5]. Feature of pgpool of our interest: Parallel Query: Using the parallel query function, data can be divided among the multiple servers, so that a query can be executed on all the servers concurrently to reduce the overall execution time. Parallel query works the best when searching large-scale data [6] Structure of working of Pgpool: Pgpool-II-3.0.4 Rs=Rs1 op Rs2 Server 3 or Local Host Q Dblink.sql System_db Rs1 Q Rs2 Server 1 Server 2 Fig: 2 Structure of working of pgpool Annotations Used in Diagram: Q- Sql Query Rs1-Result Set stores result after execution of Query Q1 Rs2-Result Set stores result after execution of Query Q2 Rs-Spatial function of Rs1 and Rs2. Op-operation applied on rs1 and rs2 either intersection, union, difference etc. Pgpool-II-3.0.4-Tool for Parallel Query Processing Systemdb:User defined rules for data partitioning and merging of result via Dblink Dblink.sql: Its Sql file in usr/local/postgresql/contrib.dblink.sql.It is used to query local Postgresql and remote Postgresql [8]. Disadvantage of dblink: 1. It lacks SQL server’s linked server approach or open query that allows for synchronized join between linked servers/databases. 2. Not useful in cases where you need to join lots of data with local data. 3. Need to specify output structure. Description of working of pgpool: Pgpool distributes the data on both the server and then redirect same query on both the server and get the combination of result via dblink.sql. Reason for dropping Pgpool: Pgpool is not suitable for our desire task because it only redirects the same query to both server and it does not support query fragmentation. 2.4 Join Processing in Multi Database System[14]. Description of Experiment: User will input the query that will change into modified query. Modified query will separate the Local and remote references according to qualification. Then part of query get executed on local server and part of query get executed on HP-SQL remote server and then transferring of some component from remote server to local server take place because in order to perform join both relations involved in the join need to be present at the same site before join operation take place. I have used the same idea in my Java application considering both databases as postgreSQL-8.4. Structure Used for Join Processing in Multi database system: Remote References Local Database Copy some component to local Separate Local machine in order to references and remote perform join references from Modified Query Query Multi Database Support Layer Remote Database Quer y Fig: 3 Join Processing in Multi database system 3. OUR APPROACH: 3.1 Database Partition: I have used two methods for Data Partition: Bounding Box wise partition and Feature wise partition. 3.1.1. Bounding Box wise Partition: Bounding Box: The bounding box is described by 4 numbers; the x-y coordinates of the lower-left corner of the image, followed by the x-y coordinates of the upper-right corner of the image. Representation of Bounding Box: BBox (Xmin, Ymin, Xmax, Ymax). Method used for partitioning data Bounding Box wise: Bounding Box of IIT-Bombay is:BBox (72.902, 19.122, 72.911, 19.142) By knowing Bounding Box of IIT-B, we have divided the whole data in two Polygons. Polygon1= (72.911 19.122, 72.911 19.122, 72.911 19.142, 72.902 19.142, 72.902 19.122) Polygon2= (72.902 19.122, 72.911 19.122, 72.911 19.142, 72.902 19.142, 72.90219.122) Then by applying SQL Query we have find out Intersection of each Layer with both polygon. Layer intersecting one Polygon is taken in one server and layer intersecting other polygon is taken in another Server. SQL Query for finding out whether one layer (e.g. electric_line) intersecting Polygon1 and Storing result in a table (e.g. electric_line) for partition 1: SELECT ST_Intersects (electric_line.the_geom, ST_GeomFromText ('POLYGON ((72.902 19.122, 72.911 19.122, 72.911 19.142, 72.902 19.142, 72.90219.122))' , 4326)),* INTO gis_schema.electric_line_P1 FROM gis_schema.electric_line ; SQL Query for finding out whether one layer (e.g. electric_line) intersecting Polygon2 and Storing result in a table (e.g. electric_line) for partition 2: SELECT ST_Intersects (electric_line.the_geom, ST_GeomFromText (‘POLYGON ((72.911 19.122, 72.92 19.122, 72.92 19.142, 72.911 19.142, 72.911 19.122))’, 4326)),* INTO gis_schema.electric_line_P1 FROM gis_schema.electric_line ; Likewise we perform the query for all features or all layers and divide the data into 2 and the data which lies on boundary we are replicating that data on both servers. For finding out data that lies on Boundary of two Polygons, we will apply following SQL Query: SELECT count (*) FROM gis_schema.builtup_p1 as p1 JOIN gis_schema.builtup_p2 as p2 ON p1.gid = p2.gid where p1.st_intersects=p2.st_intersects; Table showing no of rows that lies on Boundary: Original Table boundary Builtup Bus_stop Campus_boundary Cmp_boundary Culvert Electric_line Electrical_assest Gate Location_plan Open_space Open_spacev2 Places Proposed_building Proposed_road_edg e Revenue_boundary Road_centerline Road_edge Road_junction Sewerage_assest Sewerage_line Swd_bedlevel Swd_centerline Swd_edge Partition 1 Boundary_bb1 Builtup_bb1 Bus_stop_bb1 Campus_boundary_bb1 Cmp_boundary_bb1 Culvert_bb1 Electric_line_bb1 Electrical_assest_bb1 Gate_bb1 Location_plan_bb1 Open_space_bb1 Open_spacev2_bb1 Places_bb1 Proposed_building_bb1 Proposed_road_edge_b b1 Revenue_boundary_bb 1 Road_centerline_bb1 Road_edge_bb1 Road_junction_bb1 Sewerage_assest_bb1 Sewerage_line_bb1 Swd_bedlevel_bb1 Swd_centerline_bb1 Swd_edge_bb1 Partition 2 Boundary_bb2 Builtup_bb2 Bus_stop_bb2 Campus_boundary_bb2 Cmp_boundary_bb2 Culvert_bb2 Electric_line_bb2 Electrical_assest_bb2 Gate_bb2 Location_plan_bb2 Open_space_bb2 Open_spacev2_bb2 Places_bb2 Proposed_building_bb2 Proposed_road_edge_bb2 No. of rows lies at Boundary 12 6 0 1 5 1 16 7 8 3 7 7 0 0 12 Revenue_boundary_bb2 5 Road_centerline_bb2 Road_edge_bb2 Road_junction_bb2 Sewerage_assest_bb2 Sewerage_line_bb2 Swd_bedlevel_bb2 Swd_centerline_bb2 Swd_edge_bb2 12 8 2 0 11 0 4 7 Telephone_assest Telephone_line Water_assest Water_body Water_line zone Telephone_assest_bb1 Telephone_line_bb1 Water_assest_bb1 Water_body_bb1 Water_line_bb1 Zone_bb1 Telephone_assest_bb2 Telephone_line_bb2 Water_assest_bb2 Water_body_bb2 Water_line_bb2 Zone_bb2 0 8 1 0 13 4 3.1.2. Feature wise partition: For feature wise partition I have stored all utilities on one server and all other feature on another server. For feature wise partition I have stored the different table on different servers because a table in one Postgresql is equivalent to one feature. Features stored on Server 1 Boundary Builtup Bus_stop Campus_boundary Cmp_boundary Culvert Gate Location_plan Open_space Open_spacev2 Places Proposed_building Proposed_road_edge Revenue_boundary Road_centerline Road_edge Road_junction Zone Features stored on Server 2 Electric_line Electrical_assest Sewerage_assest Sewerage_line Swd_bedlevel Swd_centerline Swd_edge Telephone_assest Telephone_line Water_assest Water_body ---------------------------------------------------------------- 3.2 Structure of our approach Display of result Join and Other operation require data from both servers performed at Local host Result set of Q2 Result set of Q1 Query(Q1) Query(Q2) Execution of query Q1 Execution of query Q2 . Remote Server 1 Remote Server 2 Fig: 4 Structure of working of java application Description of Structure: Method mentioned under section 3.3, 3.4 and 3.5 3.3. Query Fragmentation: Query fragmentation includes division of query into two partitions so that part of query will goes to one server and part of query will goes to another server. For Fragmentation of query I have designed a java application that includes the following: SQL Parsing: Parsing of SQL statement is done by “String Tokenizer” method in java and various conditions are applied in order to consider all cases entered by user. By keeping in mind all cases entered by user, a java application is designed in which all field name referring to one server are kept under one part of query and all other field name. Operator Tree: operator tree defines a partial in which operations must be applied in order to produce the result of query. leaves of a tree represent relations on different server. I have used idea of operator tree in order to translate the global query into fragmented one and then applying operation in a sequence by considering local optimization in mind. Bounding Box wise partition: In bounding Box wise partition we need not to fragment the query because all table names are same only size differs. Feature wise partition: in feature wise partition we need not fragment the Query and query fragmentation is done by above mentioned method. 3.4. Query Redirection to Distributed Databases: For Query redirection to distributed databases I have followed below mentioned methology: By distributed database we mean two different servers in which partitions of database are stored and from a third server let’s say Local Host we will Query the both remote servers containing partitioned data. LookUp Table: A Lookup Table containing Meta information i.e. which table is stored on which server together with their Credential (Host, Username, password, port) is maintained at Local Host. From lookup table we can redirect our query to that particular server on which the data query asking for resides. Screenshot of lookup table stored in Pgadmin3at local host or on 3rd server for feature wise partition: Fig: 5 lookup table at local host Extracting ServerName from LookUp table At Localhost for Both part of Query.Now Redirecting part of Query to one server (By Establishing Connection to that server by fetching parameter from LookUp Table) and Part of Query to another Server. 3.5. Query Execution and Storage of result: A Java code will do the following: A part of query gets executed on one server and another part of Query gets executed on another Server. We Obtain result in the form of Result set .Let say Resultset1 and Resultset2. Now transferring of result-set from Remote Server is done with the help of Local host server's Object by creating a table (which is mentioned by User in Query)or executing the create Sql query with the help of statement object of Local host. likewise do for Result set from another server. Now we obtain two table which contains result of User Query from server1 and from server 2.Now we can apply any Operation which contains data from both Table and we can perform Join. Then we will store the final Result in 3rd table (Drop the table if already Exists ) 3.6. Java Application developed till now: Step1: User will enter the Query either for Single table or for more than one table and click on Get Result button. For querying more than one table we need to do Query Fragmentation and I am working on that. Screenshot for user querying on single table: Fig: 6 Form asking User’s query Step 2: Click on Get Result Button: On click Event of Get Result following Task will take place (via Java Code): Connection to all three servers that is Local Host ,Server1 and Server2 is made. Query will go to both server and get executed there . By the use String Tokenization getting table name from Query. If table already exist then drop it otherwise create it. Storing combination of result from server1 and server2 in that table in postGIS on 3rd i.e. Local Host. I am also displaying the combination of result from both Servers in Console. Server Fig: 7 After click on Get Result button Step3: A Table get created in PostGIS on Local Host which is combination of result from both the server Fig: 8 Stored Table at Local host In PgAdm in3 3.5. Displaying Result on Geoserver: Steps to display result on Geoserver or displaying 3rd table which is obtained as combined result from both server:- Step 1: Start the Geoserver-2.1 by typing following command in terminal: $ sh geo.sh (geo.sh is startup file for geoserver ) Step 2: Go to the browser and type the URL:http://localhost:18080/geoserver/web/ A screen will appear then enter username and password for Geoserver. By default username and paaword for geoserver is: Username: admin Password: geoserver Click on login and now geoserver is ready for further operation: Fig: 9 Geoserver Login Step 3: Creation of workspace inside Geoserver: Fig: Click 10.1 on workspace Now click on Add new workspace Fig: 10.2 Add New Workspace Enter of name workspace and Namespace URI and then click on submit. Fig: 10.3 Enter Name and Namespace URI After click on submit a workspace get created and we can search for our namespace by typing its name in search Fig: 10.4 Workspace get created Step 4: Creation of Store inside Geoserver Fig: 11.1 Click on Stores Fig: 11.2 Click on Add new store Fig: 11.3 Click On PostGIS All Layers which are Stored in Database in pgadmin3 get Displayed here. Which shows connection of Geoserver to PostGIS Fig: 11.4 Enter basic Information for connection to PostGIS Ste p 5: Publ ishi ng Post GIS laye r On Geo serv er: Fig: 12.1 Clic k On Publish Enter the information require to connect postGIS to geoserver. Fig: 12.2 Select Coordinate System & Compute min, max coordinates from data Fig: 12.3 Click on layer Preview Fig: 12.4 Search for Layer and click on Open Layers Layer from PostGIS storing combined Result from both servers can be viewed as a map on Geoserver. Fig: 12.5 Map view of layer from PostGIS On geoserver. CONCLUSION The Java application designed can be used by many Organisations having large scale of data distributed over different servers and by implementation of Query Fragmentation and only transferring of Result set rather than whole data results in performance gain. It will solve overhead of data having only at one site and it will also result in distribution of load on one server only. Partial achievement of design of application has implementation of Query on single table. At the end my application will include query on multiple tables or implementation of Query Fragmentation. FUTURE SCOPE Parallel database systems attempt to exploit recent multiprocessor computer architectures in order to build high-performance and high-availability database servers at a much lower price than equivalent mainframe computers. Presently in many organisations we need to keep the data in different databases in multiple servers because of huge amount of data. We need some mechanism in order to access the data from different sites. Scope of this application is in those areas where large scale data is used and we need performance optimization while accessing data from different Databases distributed remotely. My project work is useful in any Organisation like E-Governess, Municipal Corporation in order to fire query on distributed Database (Postgresql-8.4) and in order achieve optimization of performance. REFERENCES [1] Stefano Ceri, Giuseppe Pelagatti , “ Levels of Distributed Transparency, Distributed Database Design, Translation of Global Query to Fragment Query,” in Distributed Databases: Principles and Systems, pg-37-126. [2] S. Bandyopadhyay, “Join processing in a Multi database system” in Data Management (New Dimensions and Perspectives). [3] P.Bernstein N.Goodman E.Wong ,C.Reeve and J.Rothnie, “Query Processing in a System for Distributed Databases ”. [4] “Postgresql”, Available: http://www.postgresql.org/ [5] “Features of pgpool”, Available: http://pgpool.projects.postgresql.org/ [6] “Pgpool tutorial”, Available: http://pgpool.projects.postgresql.org/pgpool-II/doc/tutorial-en.html. [7] “PgpoolAdmin features and installation, Available:http://pgpool.projects.postgresql.org/pgpoolAdmin/doc/en/install.html [8] “Dblink description”, Available: http://www.postgresonline.com/journal/archives/44-Using-DbLink-to-access-otherPostgreSQL-Databases-and-Servers.html [9] PostGIS 1.5.2SVN Manual. [10] “PostGIS Reference”, Available: http://postgis.refractions.net/documentation/manual-1.3/ch06.html [11] “Spatial Functions”, Available: http://postgis.refractions.net/docs/ST_Intersects.html [12] “Geoserver-2.1 User Manual”, Available: http://docs.geoserver.org/stable/en/user/ [13] “Geoserver Information”, Available: http://workshops.opengeo.org/ [14] Subham Roy, “Benchmarking on Spatial Database”, Internal Report IIT Bombay.