Download using a spatial database in a location

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oracle Database wikipedia , lookup

IMDb wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Relational model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Transcript
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI
Publicat de
Universitatea Tehnică „Gheorghe Asachi” din Iaşi
Tomul LV (LIX), Fasc. 3, 2009
SecŃia
AUTOMATICĂ şi CALCULATOARE
USING A SPATIAL DATABASE IN A LOCATION-BASED
SEARCH APPLICATION
BY
ANDREI TABARCEA, *PASI FRÄNTI and VASILE MANTA
Abstract. This paper describes a solution for a georeferencing problem in a
location-based search engine. Georeferencing is the process of assigning a geographic
location to a web-page or part of it. Our solution is to use a spatial indexed database
which acts as a gazetteer and contains geographical coordinates attached to address
strings. We perform a series of tests to choose the best indexing solution for the database.
Key words: spatial database, LBS, search engine, gazetteer, georeferencing.
2000 Mathematics Subject Classification: 68P20.
1. Introduction
During the last few years, the location of a user connected to the
internet has gradually become easier to determine. Starting with rough
estimations based on the IP address and continuing with the development of
positioning technologies such as GPS (Geographical Positioning System),
locating the user stopped being a major obstacle in the development of locationbased services. The increasing availability of geographical positioning in low
cost consumer devices such as PDAs or mobile phones has made possible the
development of various services which also consider the user’s location.
A location-based search engine, which is basically a web search engine
which uses the user’s location as an additional relevance criterion, is one of
such services. The goal of a location-based search engine is to help users find
points of interest described by one or several keywords in the proximity of their
location.
The main problem which rises in the development a location-based
search application is georeferencing (the process of assigning geographic
coordinates to a resource, in our case a web page). Only very few web pages
56
Andrei Tabarcea, Pasi Fränti and Vasile Manta
give a direct positioning (geotags or other forms of coordinates) for which the
information in the page relates to, so the use of geographical coordinates in a
web search has little applicability. However, it is common to find street or
postal addresses on web pages. Therefore, an answer to the georeferencing
problem can be to use a predefined data structure or database that connects any
given postal address to its exact location (coordinates).
We propose a solution [14] that searches addresses from web pages,
converts them to geographical coordinates and uses the information as an
additional relevance criterion. Georeferencing is aided by a gazetteer, which is
defined in [7] as a geospatial dictionary of geographic names and its minimum
components as a geographic name, a geographic location represented by
coordinates and a type designation.
Our implementation of the gazetteer is a spatial database that contains
postal addresses as geographic names, their corresponding coordinates as
geographic locations and a single type: postal addresses. We underline the
importance o a spatial database in the location based-search and we perform a
series of tests to determine the fastest and most efficient structure for the
database, considering the use of specialized spatial data types and functions and
indexing of the most common used fields.
1.1. Related Work
Location based-search has been implemented into various projects,
starting with commercial services such as Google Maps, Yahoo! Local, Bing
Maps and Yellow Pages or with research projects such as [8], [11] and [2].
The first methods of detecting the location of a web resource are found
in [5] and [9]. In [5] “whois” records are analyzed and phone numbers of
network administrators are used with a zip code and area database to assign
coordinates to Class A and B domains and to determine the globality of a website. In [9] the sources for geospatial context are classified as being for the hosts
of a web-page (usually found in “whois” databases and the way the traffic is
routed on the Internet) and for its content (postal addresses and codes, telephone
numbers, geographic feature names). Additional geographical information is
found from hyperlinks and meta tags. The postal address detection uses a postal
code database with latitude/longitude information.
The gazetteer has been defined in [7] as a geospatial dictionary of
geographic names and has been used on most of the projects that involve postal
address detection, such as [3], [4], [6] or [1]. On the other hand, Name Entity
Recognition without gazetteers is discussed in [10], which turns out to work
well with people and organizations, but bad with locations.
The system in [3] employs the gazetteer approach to identify geographic
locations in web-pages. In [4], an ontology-based approach that extracts
geographic knowledge is presented. The address is divided into 3 parts (basic
address, complement and location identifiers such as phone number, postal code
or municipality name) and the address recognition is a process of geoparsing
Bul. Inst. Polit. Iaşi, t. LV (LIX), f. 3, 2009
57
and geocoding, which uses a gazetteer described in [12].
In [6] the location-based data is retrieved by recognizing postal
addresses. The method is ontology-based conceptual information retrieval
combined with graph matching. The concepts (knowledge/address elements) in
a document are identified and linked together in a graph by semantic relations
and the concept set used is actually a gazetteer.
In [1] a geoparser that can identify address level location information
using a database rather than rely on metadata or other structured annotation is
described. Their database contains postal codes, city names, street names, and
also every city-postal code combination for each street of the target area and is
also used for validation. Address detection relies on assuming the address
blocks have a certain structure and that there are certain dependencies between
address elements.
2. Postal Address Spatial Database
2.1. Use of a Spatial Database
A location-based search engine finds websites which contain
information about services or targets in the proximity of a given location,
usually the user’s location. The MOPSI location based search engine [14]
conducts a real-time search with a prominent search engine such as Google and
extracts potential postal address information from the resulting web pages.
Therefore, its georeferencing process consists in finding corresponding
coordinates of a given postal address.
This requires a data structure or a database that connects any given
address to its exact location (coordinates). Our solution is to use such a
database, which is commonly available (although not necessarily free) and can
be purchased for given (or specified) regions. Such a spatial database can be
used in geocoding, which is the process of finding associated geographic
coordinates (latitude and longitude) from other geographic data, such as street
addresses, or zip codes (postal codes).
The main purposes of the postal address database are: converting a
postal address into geographical coordinates, finding the postal address of
geographical coordinates and finding all the location points in a square or
rectangle bounding box. Therefore, the database or data structure used for
georeferencing and geocoding must be optimized so that every location point
and every address from a bounding box defined by minimum and maximum
latitudes and longitudes can be retrieved easily and fast.
A complete server-side database was constructed beforehand for
containing all addresses and coordinates of the target region. The speed and
accuracy of the database can be facilitated by using a database management
system which implements Open Geographical Information System (OpenGIS)
specification or other spatial and location-data standards for faster query results.
The database management systems that can be used include MySQL with
58
Andrei Tabarcea, Pasi Fränti and Vasile Manta
spatial extensions, PostgreSQL with PostGIS or Oracle Spatial.
Alternatively, the database structures can be altered to increase
performance ratios of the entire application. There are several solutions to
enhance performance of the application, for example, using database indexing
or using data types and functions which implement GIS standards.
2.2. Common Operations
In all usage scenarios, the search engine performs the following
operations which translate into database queries:
a) Finding all the municipalities in a square (or rectangle)
bounding box area
This query is performed at the beginning of the search when the user’s
interest area is determined. This is usually defined as a square bounding box
with a fixed length and it can intersect only one municipality, but, in many
cases, can intersect two or more cities. The search engine needs the names of
the municipalities for location identification, because the same street name in
many cities can be found in many cities.
Fig. 1 – The bounding box intersects one municipality (Joensuu).
In the first case (Fig. 1) the bounding box intersects only one
municipality, mainly because the provided location point is near the center. The
search engine will need addresses only from Joensuu, so running a query to find
all the cities would seem pointless.
Bul. Inst. Polit. Iaşi, t. LV (LIX), f. 3, 2009
59
Fig. 2 – The bounding box intersects two municipalities
(orange – Vantaa, blue – Helsinki).
In the second case (Fig. 2), the user is near the border between the two
municipalities and the search engine needs addresses and location points from
both municipalities. The query for finding all the municipalities in a bounding
box is made only one time per search, so its running time is not critical.
b) Finding all the street names in all the
municipalities bounding box
The second step in the location-based search is finding all the street
names in the selected area. This query is made one time for every municipality
found in the first case, so the running time is not highly critical. The street
names are used by the search engine for finding services or other targets in the
area.
c) Converting addresses found into location points
(latitude and longitude)
The third step is converting all the addresses which correspond to the
search results into locations points (geocoding). This operation is usually made
for every address found by the search engine and the resulted locations points
are used for calculating distances to the user’s location. This is one of the most
time critical operations, because it can be done from zero to tens or hundreds
times per search, depending of the number of addresses found in the search
results.
d) Converting the location points (latitude and longitude)
into addresses
This operation is mainly used for determining the user’s address, if the
user provides only its location point. Also, there can be other situations where
60
Andrei Tabarcea, Pasi Fränti and Vasile Manta
the conversion from location points to addresses is needed, especially if the
search engine finds location points and the user needs addresses.
2.3. Implementations and Results
A postal address database which stores all the addresses in Finland was
implemented on a MySQL5 database management system. For the design of the
database the following factors were considered: using MySQL spatial
extensions or common data types and using database indexing.
For testing purposes, a postal address database of the North Karelia
region was created. Table 1 shows the database sizes for the considered
solutions.
Database
implementation
Without spatial
extensions and
without indexing
Without spatial
extensions and
with indexing
With spatial
extensions and
without indexing
With spatial
extensions and
with indexing
Table 1
Database Sizes
Data size Index size
MB
MB
Database size
MB
22.71
0
22.71
26.77
25.67
52.44
29.31
0
29.31
33.37
48.02
81.39
Results show that indexing dramatically increases the database size with
more than 90%, whilst using spatial extensions also increases the storage size
with more than 12%.
For testing the query execution times, the benchmark program
randomly chose a number of 500 location points and tested the database. The
queries from the 4 most common operations for all the chosen location points
were tested for each location point, the execution times were logged and the
total time for query execution for the proposed testing scenario was calculated
with the following formula:
Ttotal = Tquery 1 + n1 ∗ Tquery 2 + n1 ∗ n 2 * Tquery 3 + Tquery 4 ,
where n1 represents the average number of municipalities returned by query1
and its value is 1.7 and n2 represents the average number of search results and
its value is 64.
Bul. Inst. Polit. Iaşi, t. LV (LIX), f. 3, 2009
Table 2
Average Query Execution Times
Without spatial
Without spatial
With spatial
extensions and
extensions and
extensions and
without indexing
with indexing
without indexing
query1
ms
query2
ms
query3
ms
query4
ms
Total time
s
61
With spatial
extensions and
with indexing
886.10
898.67
3956.10
176.83
715.70
953.27
1091.46
173.88
670.00
14.85
674.99
13.33
887.78
909.27
3866.22
215.44
75.88
5.04
83.11
2.13
Results show that the non-indexed solutions are at least 15 times slower
comparing to the indexed ones, therefore using a non-indexed database is not
justified. Using spatial extensions makes queries run significantly faster on the
indexed solutions (at least 2 times faster) and slower on the non-indexed
solutions (1.09 times slower).
3. Conclusions
In this paper we solve the georeferencing problem in a location-based
search engine by using a spatial database as a gazetteer. The most efficient
solutions for the database are also the most storage-costly solutions. However,
the execution time is more important that storage space, which is not a big issue
(the size of a database which stores all the postal addresses in Finland would be
approximately 2GB). The solution we propose uses spatial extensions and
indexing and is at least 2 times faster than the other tested solutions.
Received: August 6, 2009
“Gheorghe Asachi” Technical University of Iaşi,
Department of Computer Engineering
e-mail: [email protected]
*University of Joensuu, Finland
Department of Computer Science and Statistics
e-mail: [email protected]
REFERENCES
1. Ahlers D., Boll S., Retrieving Address-based Locations from the Web. 2nd International
Workshop on Geographic Information Retrieval, Napa Valey-USA, October
26-30, 2008, 27−34.
2. Ahlers D., Boll S., Urban Web Crawling. First International Workshop on Location
and the Web, Beijing-China. April 22, 2008, 25−32.
62
Andrei Tabarcea, Pasi Fränti and Vasile Manta
3. Amitay E., Har’ El N., Sivan R., Soffer A., Web-a-where: Geotagging Web Content.
27th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, Sheffield-United Kingdom, July 25-29, 2004, 273–280.
4. Borges K., Laender A., Medeiros C., Davis Jr. C., Discovering Geographic Locations
in Web Pages Using Urban Addresses. 4th ACM Workshop on Geographic
Information Retrieval, Lisbon-Portugal, November 9, 2007, 31−36.
5. Buyukkokten O., Cho J., Garcia – Molina H., Gravano L., Shivakumar N., Exploiting
Geographical Location Information of Web Pages. 2nd International Workshop
on the Web and Databases WebDB (Informal Proceedings), Philadelphia-SUA,
June 3-4, 1999, 91−96.
6. Cai W., Wang S., Jiang Q., Address Extraction: Extraction of Location-Based
Information from the Web. 7th Asia-Pacific Web Conference, Shanghai-China,
March 29 – April 1, 2005, 925−937.
7. Hill L., Frew J., Zheng Q., Geographic Names: The Implementation of a Gazetteer in a
Georeferenced Digital Library. D-Lib Magazine, January 1999, Vol. 5, Issue 1 (1999).
8. Jones C.B., Abdelmoty A.I., Finch D., Fu G., Vaid S., The SPIRIT Spatial Search
Engine: Architecture, Ontologies and Spatial Indexing. 3rd International
Conference on Geographic Information Science GIScience 2004, MarylandUSA, October 20-23, 2004, 125−139.
9. Mc Curley K.S., Geospatial Mapping and Navigation of the Web. 10th International
Conference on World Wide Web, Hong Kong-China, May 1-5, 2001, 221−229.
10. Mikheev A., Moens M., Grover C., Named Entity Recognition without Gazetteers.
9th Conference on European Chapter of the Association for Computational
Linguistics, Bergen-Norway, June 8-12, 1999, 1−8.
11. Morimoto Y., Aono M., Houle M.E., Mc Curley K.S., Extracting Spatial Knowledge
from the Web. 2003 Symposium on Applications and the Internet, OrlandoUSA, January 27-31, 2003, 326−333.
12. Souza L.A., Davis Jr. C.A., Borges K.A.V., Delboni T.M., Laender A.H.F., The Role
of Gazetteers in Geographic Knowledge Discovery on the Web. 3rd Latin
American Web Congress LA-WEB 2005, Buenos Aires-Argentina, October 31
– November 2, 2005, 9.
13. Wang C., Xie X., Wang L., Lu Y., Ma W.Y., Detecting Geographic Locations from
Web Resources. 2005 Workshop on Geographic Information Retrieval,
Bremen-Germany, October 31 – November 5, 2005, 17–24.
14. *** The MOPSI Location-based Search Engine. http://cs.joensuu.fi/mopsi/, 2009.
FOLOSIREA UNEI BAZE DE DATE SPAłIALE ÎNTR-O APLICAłIE DE
CĂUTARE BAZATĂ PE LOCALIZARE
(Rezumat)
Acest articol descrie o soluŃie pentru problema georeferenŃierii, problemă ce
poate să apară într-un motor de căutare bazat pe localizare. GeoreferenŃierea este
procesul de atribuire de coordonate geografice unei pagini web sau unei părŃi ale sale.
SoluŃia propusă constă în folosirea unei baze de date spaŃiale indexate, care
funcŃionează ca un lexicon geografic şi care conŃine coordonate puse în corespondenŃă
cu adrese poştale. Au fost efectuate o serie de teste pentru a determina structura bazei de
date care eficientizează căutarea.