Download 20091014b_treehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

SQL wikipedia , lookup

Btrieve wikipedia , lookup

Oracle Database wikipedia , lookup

IMDb wikipedia , lookup

Ingres (database) wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

ContactPoint wikipedia , lookup

Transcript
thai-language.com
Glenn Slayden
October 14, 2009
Agenda
•
•
•
•
•
•
•
Background and history
Site surface demonstration
Database ontology
Database technology
Data Entry demonstration
Future directions
Q&A : throughout please
Overarching Motivation
• Long-term objectives:
–Increase linguistic rigor
–Publish any new work
–Maintain popular accessibility
–Build community
Historical Parchment - 1997
More Parchment - 2001
Site Demonstration
Database? What Database
• How big is a monolingual dictionary?
• 100,000 words x 30 b/entry = 30 MB
• How much memory in a modern server?
32GB.
• That’s about 1/10th of 1% (.00094)
• SQL? MySql? PostGres? Not indicated.
Case Study
October 13, 2009 – 64-bit web server – 32 GB RAM
Server Memory Utilization
n.b. this entire pie chart represents 10% of total memory
In-memory is the way to go
•
•
•
•
•
For performance
For ease and speed of development
Easy refactoring
LINQ – C# “language-integrated query”
Have a flexible and powerful object-model
without worrying about relational mapping
• Completely avoid OR/M (object-relational
mapping) “impedance mismatch” issues
thai-language.com Ontology
• Disclaimer and warning
– Internal names of programming objects are not
(any longer) intended to have any relationship to
corresponding Linguistic terms. On the following
slides please consider these names to be opaque
monikers.
thai-language.com Ontology
Entry
Definition
Phrase
Category
These colors correspond (roughly) to data-entry screen colors in DBEdit
The most basic
Lucky Decision
• ..that turned out to be incredibly valuable:
– Heterogeneous objects are assigned ID numbers
within mutually exclusive ranges
Scary Picture with Clouds In It
Data Entry Demonstration
Future directions
• Track provenance of entries and changes
• Separate-out meta-information in English
senses
• Move towards community curatorship while
maintaining asset value
– Requires reputation-granting authority
• Refine and formalize dictionary statement of
purpose (i.e. to prevent hijacking)
Technology Changes
• In 2009, optimizing a language dictionary
database for size is not necessary
• Detailed fields should be generously deployed
• Exception to the in-memory model:
– Comprehensive change version tracking may
warrant database storage
– This is necessary for community curatorship
An integrated DELPH-IN style
computational-analytical grammar
• Associate a rigorous HPSG feature structure
with each sense
• Display MRS and tree on dictionary page for
compounds and sentences.
• Ability to designate gold standard parse trees
and attestation provenance
• Live interface for LKB/PET-style parser to
provide arbitrary parsing
Thanks for Coming!