Download ws910

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Transcript
On building a high performance
gazetteer database
Amittai Axelrod
MetaCarta Inc
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Thanks to
Keith Baker
Kenneth Baker
Michael Bukatin
András Kornai
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Plan of the talk
• Database background
• Relating geographic names and features
• Handling ambiguities and inconsistencies
in geographic names
• Classification and storage system for
geographic features
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Databases
•
•
•
•
•
No DB (faking it with flat files) -- clumsy
Record-oriented -- still runs the world
Relational -- making headway
Object-oriented -- still very academic
For MetaCarta GazDB, relational approach
made most sense:
• Overlapping records (McKinley/Denali)
• Need for frequent updates of subparts of
records
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Gazetteer production process
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Conversion scripts
• Enforce uniform structure on the data
• Normalize across sources (e.g. lat/lon to
decimal degrees, spelling, …)
• Configuration required once per source
• Load data in GazDB
• Combination perl/SQL
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Relating features and names
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Other tables used in GazDB
•
•
•
•
•
•
•
•
•
•
Population
Elevation
Language
Feature type
Source/versioning info
Temporal extent
Hierarchical information
Confidence
Comments
Change logs (full auditing)
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Geographic names
• Internationalization
• Full Unicode (UTF8) support
• Maintain detail language information (SIL)
• Name resolution
• Canonical form (16 bits)
• Display form (8 bit)
• Search form (6 bit)
• Authoritativeness
• Explicitness
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Updating a name in the GazDB
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Geographic features
• Spatial representations
• Point, line, area, …
• Functional classes
• Building, field, campus, city, …
• Administrative types
• Nation, province, county, international org, …
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Export scripts
• Read GazDB
• Select which fields to include in custom
output
• Creates .gbdm (MetaCarta format) binaries
• Combination perl/SQL
• Not yet general across binary output
formats
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search
Conclusions
• Accept multiple sources (only configure
once per source)
• Fast loading of large datasets (1m entries
per hour on linux desktop)
• Simple update procedure
• Outputting large binary custom gazetteers
for different purposes at extreme speeds
(1m entries per minute)
Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic Text Search