Download Mariposa

Ömer Korçak Cmpe 521 [email protected] Mariposa: A Wide Area Distributed Database System 1 What is Mariposa? • Wide Area Distributed Database Management System. • Ongoing project in UC Berkeley. • Addresses the fundamental problems to the standard approach to distributed database management. 2 Why Mariposa? • To date, distributed database management systems have been designed for local-area networks (LAN’s) – Single administrative structure: Few servers operating within one administrative domain, such as one company or one department within a company. – Uniformity: These systems assume uniformity of all processors and network connections within the system. – Static data allocation: Data movement in these systems is a very “heavyweight” operation and is performed manually by a database administrator. • The requirements for Wide-Area distributed database systems differ dramatically from those of Local Area Network Systems. – – – – – Individual sites usually report to different system admins. Have different access and charging algorithms. Install site-specific data type extensions. Different constraints on servicing remote requests. There may be many sites participating in a WAN distributed DBMS. 3 Main Goals of Mariposa  Scalability to a large number of cooperating sites. In a WAN environment, there may be a large number of sites. Their goal is to scale to 10,000 servers.  Local autonomy. Each site must have control over its resources. This includes which objects to store and which queries to run. Query and data allocation cannot be done by a central, authoritarian query optimizer.  Data mobility. It should be easy and efficient to change the “home” of an object. Preferably, the object should remain available during movement.  No global synchronization. Updates and schema changes should not force a site to synchronize with all other sites. Otherwise, many common operations will have exceptionally poor response time.  Easily configurable policies. It should be easy for a local database administrator to change the behavior of a Mariposa site. A Mariposa system should respond gracefully to changes in user activity and data access patterns to maintain low response time and high system throughput. 4 • Traditional distributed DBMSs do not meet these requirements. – Use of an authoritarian, centralized query optimizer does not scale well. – The high cost of moving an object between sites restricts data mobility. – Schema changes typically require global synchronization – Centralized management designs inhibit local autonomy and flexible policy configuration. • One could claim that these are implementation issues. • However, the author argue that traditional distributed DBMSs cannot meet these requirements because of fundemental architectural reasons. • A new architecture is introduced. It uses a microeconomic framework. 5 Overview of the Architecture • All distributed DBMS issues (query optimization, data movement, name service, etc) are reformulated in microeconomic terms. • Implementation of the economic paradigm involves a number of entities and mechanisms. • All Mariposa clients and servers have an account on a network bank • User allocates a budget for each query. • The goal of the query processing system is to solve the query within the alloted budget by contracting with various Mariposa processing sites to perform portion of the query. • Each query is administered by a broker, which obtains bids for pieces of query from various sites. 6 Mechanism • • •  • •   •  Instead of using centralized metadata to determine where to run the query, the broker finds sites that might want to bid on the query. (By using distributed advertising service) A server can join to the Mariposa system by buying objects from other sites, advertising its services and then bidding on queries. It can leave the system by selling its objects and ceasing to bid. Large number of sites supported, highly scalable system. Each site makes storage decisions to buy and sell fragments, based on optimizing the revenue it expects to collect. Mariposa Objects have no notion of home. Their owner could change rapidly as objects are moved. Data mobility, free trade of objects. Global synchronization is avoided by the use of some economic paradigms like Replication. Each site is free to bid on any business of interest. Total local autonomy 7 The Mariposa Architecture 8 An Example • • • • A company that sell widgets. Has offices in San Francisco, Chicago, New York and Miami. Database of the company includes a table called WIDGETS which contains pricing and inventory information on all the company’s widgets. Widgets are warehoused in New York and Miami – Company keeps half of the WIDGETS table in New York and half in Miami. • • • In Mariposa, splitting a table is called fragmentation. And the pieces of tables are called fragments. Here WIDGETS table is fragmented into WIDGETS1 and WIDGETS2. If purchasing manager in San Fransisco wants to retrieve all the records from WIDGETS table, he would enter “SELECT * FROM WIDGETS”. The site where the query is entered, San Fransisco in this case, is called the home site. The query is sent from the frontend application to the Mariposa program running on the server on San Francisco. 9 Example (cont) • The query is passed through: – Parser,which checks the syntactic correctness of the query. – Optimizer, which produces a query plan that describes the order in which different steps in the plan will be executed. – Fragmenter, which changes the plan produced by the optimizer to reflect the data fragmentation. The resulting plan is called fragmented query plan. • • In order to do their work, parser, optimizer and fragmenter needs information about data types, fragment location, etc. This information is maintained by a Mariposa name server. – In our example the name server is in the Chicago office. • The fragmented query plan describes the operations that will be performed in order to execute the query, and the order in which they will be carried out. – In our example, the purchasing manager’s query, “SELECT * FROM WIDGETS” is represented by a query plan which scans the two WIDGETS fragments, WIDGETS1 and WIDGETS2, and merges the result. • The fragmented query plan is passed to the query broker, whose job it is to decide where each piece of the fragmented query plan will be executed. 10 Example (cont) • • The query broker contacts the bidder module at each potential processing site. The broker waits for responses from the bidders before selecting the best ones. After the query broker has specified the processing sites, the backend’s coordinator module takes over. The coordinator notifies the remote sites to begin processing, collects the results, and returns the answer to the client program. 11 12 13 Broker • Responsible for getting the query performed on behalf of the user • Receive a budget from the user to pay for the query • Find possible bidders by examining the Ad Table. – Finding bidders is achieved through an advertising mechanism – Servers announce their willingness to perform various services by posting ads. – Name servers keep a record of these adsin an Ad Table. • Contact possible bidders and act as auctioneer • Coordinate the query execution 14 Bidder • • • • One bidder per site Responds to queries issued by brokers Define bids so to maximize system use and site revenue Follow site predefined policies Storage Manager • Store the fragments and their revenue history – Revenue history is good predictor for future revenue. • Decide to buy and sell fragments so to maximize memory use and site revenue 15 Name Server • Mariposa uses decentralized naming facility. • Four Structure used in object naming: – Internal names: • Location dependent. • Used to determine the physical location of a fragment. – Full names: • Completely specified names that uniquely identify an object. • Location independent. (Full name is still valid when an object moves) – Common names: • User-specific, partially specified names. • Using them avoids the tedium of using full names • Simple rules permit translation from common name to full name. – Name context: • Set of affiliated names. • Names within a context are expected to share some feature. •  Names do not have to be globally registered. 16 Conclusions • • • • They present a distributed microeconomic approach for managing query execution and storage management. The economic model reduces the scheduling complexity of distributed intractions because it does not seek globally optimal solutions. They test the power and flexibility of Mariposa through experiments running over a WAN and results are positive. Mariposa is an ongoing project and they are continuing to implement more sophisticated features. Authors: Michale Stonebreaker, Paul M. Auki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, Andrew Yu. 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mariposa