Download Mariposa

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Ömer Korçak
Cmpe 521
[email protected]
Mariposa: A Wide Area Distributed
Database System
1
What is Mariposa?
• Wide Area Distributed Database
Management System.
• Ongoing project in UC Berkeley.
• Addresses the fundamental problems to
the standard approach to distributed
database management.
2
Why Mariposa?
•
To date, distributed database management systems have been designed for
local-area networks (LAN’s)
– Single administrative structure: Few servers operating within one
administrative domain, such as one company or one department within
a company.
– Uniformity: These systems assume uniformity of all processors and
network connections within the system.
– Static data allocation: Data movement in these systems is a very
“heavyweight” operation and is performed manually by a database
administrator.
• The requirements for Wide-Area distributed database systems differ
dramatically from those of Local Area Network Systems.
–
–
–
–
–
Individual sites usually report to different system admins.
Have different access and charging algorithms.
Install site-specific data type extensions.
Different constraints on servicing remote requests.
There may be many sites participating in a WAN distributed DBMS.
3
Main Goals of Mariposa
 Scalability to a large number of cooperating sites. In a WAN
environment, there may be a large number of sites. Their goal is to
scale to 10,000 servers.
 Local autonomy. Each site must have control over its resources.
This includes which objects to store and which queries to run. Query
and data allocation cannot be done by a central, authoritarian query
optimizer.
 Data mobility. It should be easy and efficient to change the “home”
of an object. Preferably, the object should remain available during
movement.
 No global synchronization. Updates and schema changes should
not force a site to synchronize with all other sites. Otherwise, many
common operations will have exceptionally poor response time.
 Easily configurable policies. It should be easy for a local database
administrator to change the behavior of a Mariposa site. A Mariposa
system should respond gracefully to changes in user activity and
data access patterns to maintain low response time and high system
throughput.
4
• Traditional distributed DBMSs do not meet these
requirements.
– Use of an authoritarian, centralized query optimizer does not
scale well.
– The high cost of moving an object between sites restricts data
mobility.
– Schema changes typically require global synchronization
– Centralized management designs inhibit local autonomy and
flexible policy configuration.
• One could claim that these are implementation issues.
• However, the author argue that traditional distributed
DBMSs cannot meet these requirements because of
fundemental architectural reasons.
• A new architecture is introduced. It uses a
microeconomic framework.
5
Overview of the Architecture
• All distributed DBMS issues (query optimization, data movement,
name service, etc) are reformulated in microeconomic terms.
• Implementation of the economic paradigm involves a number of
entities and mechanisms.
• All Mariposa clients and servers have an account on a network bank
• User allocates a budget for each query.
• The goal of the query processing system is to solve the query within
the alloted budget by contracting with various Mariposa processing
sites to perform portion of the query.
• Each query is administered by a broker, which obtains bids for
pieces of query from various sites.
6
Mechanism
•
•
•

•
•


•

Instead of using centralized metadata to determine where to run the query,
the broker finds sites that might want to bid on the query. (By using
distributed advertising service)
A server can join to the Mariposa system by buying objects from other sites,
advertising its services and then bidding on queries.
It can leave the system by selling its objects and ceasing to bid.
Large number of sites supported, highly scalable system.
Each site makes storage decisions to buy and sell fragments, based on
optimizing the revenue it expects to collect.
Mariposa Objects have no notion of home. Their owner could change
rapidly as objects are moved.
Data mobility, free trade of objects.
Global synchronization is avoided by the use of some economic
paradigms like Replication.
Each site is free to bid on any business of interest.
Total local autonomy
7
The Mariposa Architecture
8
An Example
•
•
•
•
A company that sell widgets.
Has offices in San Francisco, Chicago, New York and Miami.
Database of the company includes a table called WIDGETS which contains
pricing and inventory information on all the company’s widgets.
Widgets are warehoused in New York and Miami
– Company keeps half of the WIDGETS table in New York and half in Miami.
•
•
•
In Mariposa, splitting a table is called fragmentation. And the pieces of
tables are called fragments. Here WIDGETS table is fragmented into
WIDGETS1 and WIDGETS2.
If purchasing manager in San Fransisco wants to retrieve all the records
from WIDGETS table, he would enter “SELECT * FROM WIDGETS”. The
site where the query is entered, San Fransisco in this case, is called the
home site.
The query is sent from the frontend application to the Mariposa program
running on the server on San Francisco.
9
Example (cont)
•
The query is passed through:
– Parser,which checks the syntactic correctness of the query.
– Optimizer, which produces a query plan that describes the order in which
different steps in the plan will be executed.
– Fragmenter, which changes the plan produced by the optimizer to reflect the
data fragmentation. The resulting plan is called fragmented query plan.
•
•
In order to do their work, parser, optimizer and fragmenter needs
information about data types, fragment location, etc.
This information is maintained by a Mariposa name server.
– In our example the name server is in the Chicago office.
•
The fragmented query plan describes the operations that will be performed
in order to execute the query, and the order in which they will be carried out.
– In our example, the purchasing manager’s query, “SELECT * FROM WIDGETS”
is represented by a query plan which scans the two WIDGETS fragments,
WIDGETS1 and WIDGETS2, and merges the result.
•
The fragmented query plan is passed to the query broker, whose job it is to
decide where each piece of the fragmented query plan will be executed.
10
Example (cont)
•
•
The query broker contacts the bidder module at each potential processing
site. The broker waits for responses from the bidders before selecting the
best ones.
After the query broker has specified the processing sites, the backend’s
coordinator module takes over. The coordinator notifies the remote sites to
begin processing, collects the results, and returns the answer to the client
program.
11
12
13
Broker
• Responsible for getting the query performed on behalf of the user
• Receive a budget from the user to pay for the query
• Find possible bidders by examining the Ad Table.
– Finding bidders is achieved through an advertising mechanism
– Servers announce their willingness to perform various services
by posting ads.
– Name servers keep a record of these adsin an Ad Table.
• Contact possible bidders and act as auctioneer
• Coordinate the query execution
14
Bidder
•
•
•
•
One bidder per site
Responds to queries issued by brokers
Define bids so to maximize system use and site revenue
Follow site predefined policies
Storage Manager
• Store the fragments and their revenue history
– Revenue history is good predictor for future revenue.
• Decide to buy and sell fragments so to maximize memory use and
site revenue
15
Name Server
• Mariposa uses decentralized naming facility.
• Four Structure used in object naming:
– Internal names:
• Location dependent.
• Used to determine the physical location of a fragment.
– Full names:
• Completely specified names that uniquely identify an object.
• Location independent. (Full name is still valid when an object moves)
– Common names:
• User-specific, partially specified names.
• Using them avoids the tedium of using full names
• Simple rules permit translation from common name to full name.
– Name context:
• Set of affiliated names.
• Names within a context are expected to share some feature.
•  Names do not have to be globally registered.
16
Conclusions
•
•
•
•
They present a distributed microeconomic approach for managing query
execution and storage management.
The economic model reduces the scheduling complexity of distributed
intractions because it does not seek globally optimal solutions.
They test the power and flexibility of Mariposa through experiments running
over a WAN and results are positive.
Mariposa is an ongoing project and they are continuing to implement more
sophisticated features.
Authors:
Michale Stonebreaker, Paul M. Auki, Witold Litwin, Avi Pfeffer, Adam Sah,
Jeff Sidell, Carl Staelin, Andrew Yu.
17