Download The Data Ring: Community Content Sharing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
The Data Ring: Community
Content Sharing
Serge Abiteboul (INRIA)
Alkis Polyzotis (UC Santa Cruz)
Data Sharing Communities
Data sharing community: a group of users that share and
query information within some domain
• Examples: UCSC genome browser, SwissProt, Flickr
• Interesting data management problem
–
–
–
–
Shared information is heterogeneous
Data is distributed and dynamic
Lack of central administration
Users are not database savvy
The Data Ring
• P2P middleware system that provides:
– Monitoring
– Querying
– …and other database-like services over the
distributed information
• Main goal: simplicity of use
Data abstraction in the data ring
• Topological layer
• Physical layer
• External layer
Data abstraction in the data ring
Topological Layer
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Declarative query services
• Data and query model based on XML
Data abstraction in the data ring
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Physical Layer
• Basic service is distributed query evaluation
• Comprises the overlay network (DHT), physical access structures
(indices, replicas, views), and the catalog.
Data abstraction in the data ring
External Layer
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Provides semantically richer data models
Data abstraction in the data ring
• Our focus is on the
topological and physical
layer
• External layer is equally
important and an active
research area
Topological Layer
Physical Layer
Thesis #1: formalism for
distributed XML data and
queries
Distributed XML data and queries
• What made the relational model successful:
– A logic for describing tables
– An algebra for query optimization
• We need the equivalent for trees in a
distributed context:
– A logic for describing distributed XML data
– An algebra for optimizing distributed XML queries
Desiderata for description logic
• Seamless transition between data and
services
– Important for loose data integration
• Support for XML streams
– Streams are essential for subscription services
– They are also necessary to support recursion
Starting point: AXML
• AXML: XML tree with embedded web service
calls
– Seamless transition between intentional and
extensional data
– Provides a simple mechanism for loose data
integration
• Core concept: XML streams
– A web service call returns a stream of elements
– Support for both push and pull semantics
Desiderata for algebra
• Be amenable to rewrites
• Capture the topology of distributed computation
• Allow seamless transition between logical and
physical state
– Plans may need to be re-optimized in mid-flight
– It may be necessary to perform partial optimization
– Error recovery
A proposal based on AXML
• A distributed plan is a workflow of web
services … which is exactly a AXML tree
• Components:
– An encoding of distributed plans in AXML
– Rewrite rules
• A nice bonus: plans can be readily
exchanged between nodes
Disclaimer
• AXML is a starting point, not a panacea
• Bottom line: we need formalisms for
distributed XML queries
Thesis #2: autonomic
administration
Autonomic administration
• Users are not database experts
– Typically, scientists with computer experience
• Users are averse to too many “knobs”
• No central authority that is responsible for
administration
• Autonomic administration is a necessity -- not
a gadget
Facets of autonomy
• Self-monitoring
• Self-tuning
• Self-healing
Some issues
•
•
•
•
System integration
Distribution
On-line tuning
Pro-active tuning
Distributed vs. local tuning
• Distributed tuning
– Based on the global workload
– Catalog organization, replication
• Local tuning
– Based on local workload
– Physical design tuning
Data activation for files
• A large portion of the data is expected to be in files
• We need to develop query processors for data
residing in files
• File activation: optimize access to the file based on
the local workload
– E.g., instantiate an index on file contents or materialize a
relational view
• Local tuning is essential in this context