* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Distributed Database
Entity–attribute–value model wikipedia , lookup
Serializability wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational model wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Outline • Introduction – What is a distributed DBMS – Distributed DBMS Architecture • • • • • • • • • • • • • Background Distributed Database Design Database Integration Semantic Data Control Distributed Query Processing Multidatabase query processing Distributed Transaction Management Data Replication Parallel Database Systems Distributed Object DBMS Peer-to-Peer Data Management Web Data Management Current Issues Ch.1/1 File Systems program 1 data description 1 File 1 program 2 data description 2 File 2 program 3 data description 3 File 3 Ch.1/2 Database Management Application program 1 (with data semantics) Application program 2 (with data semantics) DBMS description manipulation control database Application program 3 (with data semantics) Ch.1/3 Motivation Database Technology Computer Networks integration distribution Distributed Database Systems integration integration ≠ centralization Ch.1/4 Distributed Computing • A number of autonomous processing elements (not necessarily homogeneous) that are interconnected by a computer network and that cooperate in performing their assigned tasks. • What is being distributed? – – – – Processing logic Function Data Control Ch.1/5 What is a Distributed Database System? A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (D–DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users. Distributed database system (DDBS) = DDB + D–DBMS Ch.1/6 What is not a DDBS? • A timesharing computer system • A loosely (separate primary memory and shared secondary memory) or tightly coupled (shared memory) multiprocessor system • A database system which resides at one of the nodes of a network of computers - this is a centralized database on a network node Ch.1/7 Centralized DBMS on a Network Site 1 Site 2 Site 5 Communication Network Site 4 Site 3 Ch.1/8 Distributed DBMS Environment Site 1 Site 2 Site 5 Communication Network Site 4 Site 3 Ch.1/9 Implicit Assumptions • Data stored at a number of sites each site logically consists of a single processor. • Processors at different sites are interconnected by a computer network not a multiprocessor system – Parallel database systems • Distributed database is a database, not a collection of files data logically related as exhibited in the users’ access patterns – Relational data model • D-DBMS is a full-fledged DBMS – Not remote file system, not a TP system Ch.1/10 Data Delivery Alternatives • We characterize the data delivery alternatives along three orthogonal dimensions: • Delivery modes • Frequency • Communication Methods • Note: not all combinations make sense Ch.1/11 Data delivery • Delivery modes – Pull-only {the transfer of data from servers to clients is initiated by a client pull} • Push-only {the transfer of data from servers to clients is initiated by a server push in the absence of any specific request from clients. periodic, irregular, or conditional} – Hybrid (mix of pull and push) • Frequency • Periodic (A client request for IBM’s stock price every week is an example of a periodic pull.) • Conditional (An application that sends out stock prices only when they change is an example of conditional push.) – Ad-hoc or irregular • Communication Methods {Unicast, One-to-many} Ch.1/12 Ch.1/12 Distributed DBMS Promises Transparent management of distributed, fragmented, and replicated data Improved reliability/availability through distributed transactions Improved performance Easier and more economical system expansion Ch.x/13 Ch.1/13 Transparency • Transparency is the separation of the higher level semantics of a system from the lower level implementation issues. • Fundamental issue is to provide data independence in the distributed environment – Network (distribution) transparency – Replication transparency – Fragmentation transparency • horizontal fragmentation: selection • vertical fragmentation: projection • hybrid Ch.x/14 Ch.1/14 Example SELECT ENAME,SAL FROM EMP,ASG,PAY WHERE DUR > 12 AND EMP.ENO = ASG.ENO AND PAY.TITLE = EMP.TITLE Ch.1/15 Transparent Access Tokyo SELECT FROM WHERE AND AND Paris ENAME,SAL Boston EMP,ASG,PAY DUR > 12 EMP.ENO = ASG.ENO PAY.TITLE = EMP.TITLE Communication Network Paris projects Paris employees Paris assignments Boston employees Boston projects Boston employees Boston assignments Montreal New York Boston projects New York employees New York projects New York assignments Montreal projects Paris projects New York projects with budget > 200000 Montreal employees Montreal assignments Ch.1/16 Distributed Database - User View Distributed Database Ch.1/17 Distributed DBMS - Reality User Query DBMS Software DBMS Software DBMS Software User Application DBMS Software Communication Subsystem User Query User Application DBMS Software User Query Ch.1/18 Types of Transparency • Data independence {It refers to the immunity of user applications to changes in the definition and organization of data, and vice versa. • Logical data independence and physical data independence} • Network transparency (or distribution transparency) – Location transparency – Fragmentation transparency • Replication transparency • Fragmentation transparency Ch.1/19 Who Should Provide Transparency? • Nevertheless, the level of transparency is inevitably a compromise between ease of use and the difficulty and overhead cost of providing high levels of transparency. • Gray argues that full transparency makes the management of distributed data very difficult and claims that “applications coded with transparent access to geographically distributed databases have: poor manageability, poor modularity, and poor message performance”. • He proposes a remote procedure call mechanism between the requestor • users and the server DBMSs whereby the users would direct their queries to a specific DBMS. • Application level {code of application, little transperancy} • Operating system {device drivers within the operating system} • DBMS Ch.1/20 Ch.1/20 Reliability Through Transactions • Replicated components and data should make distributed DBMS more reliable. {eliminate single points of failure} • Distributed transactions provide • Concurrency transparency {sequence of database operations executed as an atomic action. consistent db transformed to another consistent db state} – Failure atomicity {update salary by 10%} Ch.1/21 Potentially Improved Performance • Proximity of data to its points of use {data localization} – Requires some support for fragmentation and replication 1. Since each site handles only a portion of the database, contention for CPU and I/O services is not as severe as for centralized databases. 2. Localization reduces remote access delays Ch.1/22 • Parallelism Requirements • read only queries Have as much of the data required by each application at the site where the application executes – Full replication • How about updates? Ch.1/23 System Expansion • Issue is database scaling • Emergence of microprocessor and workstation technologies – Demise of Grosh's law – Client-server model of computing Ch.1/24 Complications Introduced by Distribution • data may be replicated, the distributed database system is responsible for (1) choosing one of the stored copies of the requested data for access in case of retrievals, and (2) making sure that the effect of an update is reflected on each and every copy of that data item. • if some sites fail or communication fail, DBMS will ensure update for fail site as soon as Ch.1/25 Ch.1/25 Distributed DBMS Issues • Distributed Database Design {chapter 3} – How to distribute the database {portioned and replicated} – Replicated (partial dupliacated or fully duplicated) & non-replicated database distribution – Fragmentation – {research area to minimize cost of storing, processing transactions and communication is NP hard. Proposed solution are based on heuristics} Ch.1/26 Distributed DBMS Issues • Query Processing {chapter 6-8} – Convert user transactions to data manipulation instructions – Optimization problem • min{cost = data transmission + local processing} – General formulation is NP-hard • Concurrency Control {chapter 11} – Synchronization of concurrent accesses – The condition that requires all the values of Ch.1/27 • Distributed deadlock management • The deadlock problem in DDBSs is similar in nature to that encountered in operating systems. The competition among users for access to a set of resources (data, in this case) can result in a deadlock if the synchronization mechanism is based on locking. • The well-known alternatives of prevention, avoidance, and detection/recovery also apply to DDBSs. • Reliability and availability – How to make the system resilient to failures – Atomicity and durability Ch.1/28 Ch.1/28 Relationship Between Issues Directory The same information (i.e., fragment structure Management and placement) is used by the query processor toamong determine the query evaluation There is a strong the concurrency control The replication of relationship Finally, the needproblem, for replication protocols strategy. arise if the deadlock fragments when data distribution involves replicas. management problem, and reliability issues. This is to be expected, they are Query As indicated above, Distribution there is a strong Reliability since together distributed affects Processing Design relationship between replication management protocols andproblem. The they usually called the transaction theare concurrency The design of deal concurrency control techniques, since both concurrency control strategies On the other hand, the accessaffects many areas. It affects distributed databases with the consistency of data, but from control algorithm that is employed will determine whether or not a that might be and usage patterns thatdirectory are determined by the management, because the different perspectives. separate deadlock employed. query processor are used as inputs Concurrency definition of fragments and to their placement determine the management facility and is required. If a locking-based algorithm is Control the data distribution fragmentation contents of the directory used, deadlocks will algorithms. Similarly, directory placement (or directories) as well as the strategies that may be whereas theythe willprocessing not if timestamping is the chosen andoccur, contents influence of employed to manage them. Deadlock queries. alternative. Management Ch.1/29 Architecture • Defines the structure of the system – components identified – functions of each component defined – interrelationships and interactions between components defined Ch.1/30 ANSI/SPARC Architecture Users External Schema External view External view Conceptual Schema Conceptual view Internal Schema Internal view External view Ch.1/31 Differences between Three Levels of ANSI-SPARC Architecture © Pearson Education Limited 1995, 2005 32 Ch.1/32 Data Independence • Logical Data Independence – Refers to immunity of external schemas to changes in conceptual schema. – Conceptual schema changes (e.g. addition/removal of entities). – Should not require changes to external schema or rewrites of application programs. © Pearson Education Limited 1995, 2005 33 Ch.1/33 Data Independence • Physical Data Independence – Refers to immunity of conceptual schema to changes in the internal schema. – Internal schema changes (e.g. using different file organizations, storage structures/devices). – Should not require change to conceptual or external schemas. © Pearson Education Limited 1995, 2005 34 Ch.1/34 Data Independence and the ANSISPARC Three-Level Architecture © Pearson Education Limited 1995, 2005 35 Ch.1/35 Generic DBMS Architecture The interface layer manages the interface to the applications. View management consists of translating the user query from decomposes the query into a tree of external data to algebra operations and controls tries to find the The control layer the conceptual data. “optimal” query by adding semantic integrity ordering of the operations. predicates andThe result is storedThe in an access plan. The query processing (oroutput authorization predicates. thismaps the query compilation) of layer layer is a query expressed into an optimizedin lowerlevel code (algebra operations). sequence of lower-level Finally, theoperations. consistency layer The execution layer directs the manages concurrency control execution of thelayer access plans, The data access manages and loggingtransaction for update theincluding data structures that requests. This layer restart) allows and management (commit, implement the files, indices, transaction, system,ofand media algebra etc.synchronization It also manages the buffers recovery after failure. by caching operations the most frequently accessed data. Ch.1/36 DBMS Implementation Alternatives Ch.1/37 Autonomy Degree to which member databases can operate independently Autonomy is a function of a number of factors such as whether the component systems (i.e., individual DBMSs) exchange information, whether they can independently execute transactions, and whether one is allowed to modify them. 1. The local operations of the individual DBMSs are not affected by their participation in the distributed system. 2. The manner in which the individual DBMSs process queries and optimize them should not be affected by the execution of global queries that access multiple databases. 3. System consistency or operation should not be compromised when individual DBMSs join or leave the distributed system. 38 Ch.1/38 Dimensions of the Problem • Distribution – Whether the components of the system are located on the same machine or not • Heterogeneity – Various levels (hardware, communications, operating system) – DBMS important one • data model, query language, transaction management algorithms • Autonomy – Not well understood and most troublesome – Various versions • Design autonomy: Ability of a component DBMS to decide on issues related to its own design. • Communication autonomy: Ability of a component DBMS to decide whether and how to communicate with other DBMSs. • Execution autonomy: Ability of a component DBMS to execute local operations in any manner it wants to. Ch.1/39 In Figure 1.10, we have identified three alternative architectures that are the focus of this book and that we discus in more detail in the next three subsections: (A0, D1, H0) that corresponds to client/server distributed DBMSs, (A0, D2, H0) that is a peer-to-peer distributed DBMS and (A2, D2, H1) which represents a (peer-topeer) distributed, heterogeneous multidatabase system. Note that we discuss the heterogeneity issues within the context of one system architecture, although the issue arises in other models as well. Ch.1/40 Ch.1/40 Client/Server Architecture Ch.1/41 Advantages of Client-Server Architectures • More efficient division of labor • Horizontal and vertical scaling of resources • Better price/performance on client machines • Ability to use familiar tools on client machines • Client access to remote data (via standards) • Full DBMS functionality provided to client workstations • Overall better system price/performance Ch.1/42 Database Server Ch.1/43 Distributed Database Servers Ch.1/44 Datalogical Distributed DBMS Finally, user applications and user Architecture Data independence is supported access to the database is supported ES1 ES2 ... ESn GCS LCS1 LCS2 ... LCSn LIS1 LIS2 ... LISn since the model is an To handle data fragmentation and by We first note that the physical data extension ofthe ANSI/SPARC, which asof replication, logical organization external schemas (ESs), defined organization on each machine may provides such independence data being above the be,global and conceptual naturally. Location at each site needs to be described. schema. probably is, different. This means and replication transparencies Therefore, there needs to be are a third that there needs to be an individual supported by layer the definition in the of the internal schema local and architecture, theglobal local conceptual definition at each site, which we call conceptual schemas the schema (LCS). In the and architectural the local internal schema (LIS). mapping in between. Network model we have The enterprise transparency, on theconceptual chosen, then, the global view of the data is described by the other hand,isisthe supported schema union of by thethe local global conceptual schema (GCS), definition of the global conceptual conceptual which is global schema. schemas. because it describes the logical structure of the data at all the sites. Ch.1/45 Peer-to-Peer Component Architecture Runtime Support Processor Local Recovery Manager Local Query Processor Global Execution Monitor Global Query Optimizer Semantic Data Controller User Interface Handler The semantic data controller uses the PROCESSOR DATA PROCESSOR ItUSER is important to note, atintegrity this point, that our use the constraints and of authorizations The second majorterms “user processor” System that are defined as part ofLocal the global Global Local External Log component of aprocessor” distributed Conceptual does Conceptual Internal andSchema “data not imply a functional division conceptual schema to check if the user GD/D Schema Schema Schema DBMS isThe the local data processor similar to client/server query optimizer, query can be processed. User The userwhich interface handler is andThese divisions are merely organizational and Database actually acts as the access path requests systems. responsible for interpreting user consists of three elements: there is no suggestion that selector, commands as they should placed on different machines. is recovery responsible forbe choosing the best USER The local manager is they come in, and formatting the access to sure access any data item responsible for path5 making that result data as it is sent to the user. The run-time support processor physically accesses the an the localquery The global optimizer and decomposer determines System database according to the physical commands in the database remains consistent The detailed components of a distributed DBMS execution strategy to minimize a cost function, and translates responsesThe distributed execution monitor coordinates the distributed the schedule generated by the querythe optimizer. Thelocal run-time even when failures occur are shown. One global queries into local ones using global and conceptual execution of the user request. The execution monitor is also support processor is as thewell interface toglobal thewith operating system component handles the interaction users, and schemas as the directory. called the distributed transaction manager. In executing and contains the database buffer (or cache) manager, another deals with the storage. The queries in a distributed fashion, the execution monitors whichfirst is responsible for maintaining the main memory major component, which we call the user at various sites may, and usually do, communicate with one buffers and managing the dataelements: accesses. processor, consists of four another. Ch.1/46 Multidatabase systems (MDBS) • Multidatabase systems (MDBS) represent the case where individual DBMSs (whether distributed or not) are fully autonomous and have no concept of cooperation; they may not even “know” of each other’s existence or how to talk to each other. 47 Ch.1/47 Datalogical Multi-DBMS Architecture LES11 … GES1 GES2 LES1n GCS ... GESn LESn1 … LCS1 LCS2 … LCSn LIS1 LIS2 … LISn LESnm Ch.1/48 MDBS Components & Execution Global User Request Local User Request Local User Request Multi-DBMS Layer Global Subrequest DBMS1 Global Subrequest DBMS2 Global Subrequest DBMS3 Ch.1/49 Mediator/Wrapper Architecture Ch.1/50