* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Database Administration: The Complete Guide to Practices and
Microsoft Jet Database Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Concurrency control wikipedia , lookup
Database Administration: The Complete Guide to Practices and Procedures Chapter 19 Data Movement and Distribution Agenda • Data Movement Methods: – Loading and Unloading Data – EXPORT and IMPORT – Bulk Data Movement • Distributed Databases • Questions Loading and Unloading Data • One of the simplest ways for the DBA to move data from one place to another is to use the LOAD and UNLOAD utilities that come with the DBMS. – The LOAD utility is used to populate tables with new data. – The UNLOAD utility is used to read data from a table and put it into a data file. • Each DBMS may call the actual utilities by different names, but the functionality is the same or similar from product to product. The LOAD Utility • A LOAD utility is used to perform bulk inserts of data into database tables. It typically can support – Adding rows to a table, retaining the current data; – Or replacing all existing rows with the new data. The UNLOAD Utility • The purpose of the UNLOAD utility is to read data from a database and write it to an output data file. – Without an UNLOAD utility, database users are forced to use SQL SELECT statements issued by an interactive SQL facility, report writer, or application program in order to unload data. – However, these methods are error-prone and slow for large quantities of data. – Requiring a developer to code a program to create a file is inflexible and timeconsuming. • UNLOAD is flexible and quick. EXPORT and IMPORT • Similar to an UNLOAD utility, The EXPORT utility reads data from a table and places it into an external file. • The IMPORT utility will read an external file created by the EXPORT utility and insert the data into a table. • IMPORT and EXPORT facilities typically work with more than just the data, though. – The EXPORT data file usually contains the schema for the table along with the data. – Sometimes the EXPORT file contains more than just a single table. – Some EXPORT facilities enable the DBA to specify a single table, and then follow the relationships for that table to extract all of the related files and data. – Some IMPORT/EXPORT facilities provide UNLOAD-like features to sample, subset, and limit the data that is exported (and imported). • Not every DBMS offers IMPORT and EXPORT utilities. Bulk Data Movement • There are other methods for moving large quantities of data: – ETL Software - extract, transform, and load. – Replication and Propagation – Messaging Software – Others Bulk Data Movement • ETL Software – ETL is a type of software that performs data movement. – ETL stands for extract, transform, and load. – ETL software is primarily used to populate data warehouses and data marts from other databases and data sources. Bulk Data Movement • Replication • Another method of moving data is through replication and propagation. • When data is replicated, one data store is copied to one or more data stores, either locally or at other locations. • Replication can be implemented simply by copying entire tables to multiple locations. • Alternatively, replicated data can be a subset of the rows and/or columns. • Replication can be set up to automatically refresh the copied data on a regular basis. Bulk Data Movement • Propagation • Propagation, on the other hand, is the migration of only changed data. Propagation can be implemented by scanning the transaction log and applying the results of data modification statements to another data store. Initial population of a data warehouse can be achieved by replication, and subsequent population of changes by either replication (if the data is very dynamic) or propagation. Bulk Data Movement • Messaging Software • Messaging software, also known as message queueing software or application integration, is another popular form of data movement. When using a message queue, data is placed onto the queue by one application or process; the data is read from the queue by another application or process. Bulk Data Movement • Other Methods • Of course, many other methods exist for moving data— • from the simple, such as using a table editing tool to highlight and copy data, • to the complex, such as writing programs to read the database and write to external files or directly to other databases. Distributed Databases • Sometimes simply moving data from one location to another is not sufficient. • Instead, the data needs to be stored at, and accessible from, various locations throughout an organization. • A distributed database permits data to reside at different physical locations in different databases, perhaps using different DBMS software on different operating systems. Example • Consider, an organization with retail outlets that implements a distributed database system. • Each retail outlet would have a database, and the headquarters would house a central database. With networking technology and the distributed capabilities of the DBMS, data could be accessed and modified at any location from any location. Furthermore, you could specify which locations have update, or even read, access to certain databases. Definitions • Distributed Database: A single logical database that is spread physically across computers in multiple locations that are connected by a data communications link • Decentralized Database: A collection of independent databases at multiple locations on non-networked computers and no database software that make the data appear to be in one logical database They are NOT the same thing! 15 Reasons for Distributed Database • • • • • • • 16 Business unit autonomy and distribution Data sharing Data communication costs Data communication reliability and costs Multiple application vendors Database recovery Transaction and analytic processing Distributed database environments 17 Distributed Database Options • Homogeneous - Same DBMS at each node – Autonomous - Independent DBMSs, passing messages back and forth to share data updates. – Non-autonomous - Central, coordinating DBMS – Easy to manage, difficult to enforce • Heterogeneous - Different DBMSs at different nodes – Systems – With full or partial DBMS functionality – Gateways - Simple paths are created to other databases without the benefits of one logical database – Difficult to manage, preferred by independent organizations 18 Homogeneous, Non-Autonomous Database • Data is distributed across all the nodes • Same DBMS at each node • All data is managed by the distributed DBMS (no exclusively local data) • All access is through one, global schema • The global schema is the union of all the local schema 19 Homogeneous Database Identical DBMSs Source: adapted from Bell and Grimson, 1992. 20 Typical Heterogeneous Environment • Data distributed across all the nodes • Different DBMSs may be used at each node • Local access is done using the local DBMS and schema • Remote access is done using the global schema 21 Figure 13-3: Typical Heterogeneous Environment Non-identical DBMSs Source: adapted from Bell and Grimson, 1992. 22 Major Objectives • Location Transparency – User does not have to know the location of the data – Data requests automatically forwarded to appropriate sites • Local Autonomy – Local site can operate with its database when network connections fail – Each site controls its own data, security, logging, recovery 23 Advantages of Distributed Database over Centralized Databases • • • • • 24 Increased reliability/availability Local control over data Modular growth Lower communication costs Faster response for certain queries Disadvantages of Distributed Database Compared to Centralized Databases • • • • 25 Software cost and complexity Processing overhead Data integrity exposure Slower response for certain queries Options for Distributing a Database • Data replication – Copies of data distributed to different sites • Horizontal partitioning – Different rows of a table distributed to different sites • Vertical partitioning – Different columns of a table distributed to different sites • Combinations of the above 26 Data Replication • Advantages: – Reliability (incase fail at a node info. Can be found at another node) – Fast response (for loc. That have full copy) – May avoid complicated distributed transaction integrity routines (if replicated data is refreshed at scheduled intervals) – Decouples nodes (transactions proceed even if some nodes are down) – Reduced network traffic at prime time (if updates can be delayed) 27 Data Replication (cont.) • Disadvantages: – Additional requirements for storage space – Additional time for update operations – Complexity and cost of updating – Integrity exposure of getting incorrect data if replicated data is not updated simultaneously Therefore, better when used for non-volatile (read-only) data 28 Horizontal & vertical Partitioning Horizontal partitioning Vertical partitioning Vertical partitioning Horizontal Partitioning • Different rows of a table at different sites • Advantages – – – – Data stored close to where it is used efficiency Local access optimization better performance Only relevant data is available security Unions across partitions ease of query • Disadvantages – Accessing data across partitions inconsistent access speed – No data replication backup vulnerability 33 Vertical Partitioning • Different columns of a table at different sites • Advantages and disadvantages are the same as for horizontal partitioning except that combining data across partitions is more difficult because it requires joins (instead of unions) 34 Figure 13-10: Distributed DBMS architecture 35 Local Transaction Steps 1. 2. 3. Application makes request to distributed DBMS Distributed DBMS checks distributed data repository for location of data. (each site has a copy of the distributed DBMS and the associated distributed data dictionary/directory (DD/D) The distributed DD/D contains the location of all data in the network, as well as data definitions.) 4. 5. 6. 7. 36 Finds that it is local Distributed DBMS sends request to local DBMS Local DBMS processes request Local DBMS sends results to application Figure 13-10: Distributed DBMS Architecture (cont.) (showing local transaction steps) 2 1 3 5 4 Local transaction – all data stored locally 37 Global Transaction Steps 1. 2. 3. 4. 5. 6. 7. 8. 38 Application makes request to distributed DBMS Distributed DBMS checks distributed data repository for location of data. Finds that it is remote Distributed DBMS routes request to remote site Distributed DBMS at remote site translates request for its local DBMS if necessary, and sends request to local DBMS Local DBMS at remote site processes request Local DBMS at remote site sends results to distributed DBMS at remote site Remote distributed DBMS sends results back to originating site Distributed DBMS at originating site sends results to application Figure 13-10: Distributed DBMS architecture (cont.) (showing global transaction steps) 2 3 1 7 8 6 4 5 Global transaction – some data is at remote site(s) 39 Distributed DBMS Transparency Objectives 1.Location Transparency -User/application does not need to know where data resides. • Suppose that a marketing manager in San Mateo, California, wanted a list of all company customers whose total purchases exceed $100,000. • From a terminal in San Mateo, with location transparency, the manager could enter the following request: • SELECT * FROM CUSTOMER WHERE TOTAL_SALES > l00,000; 41 Distributed DBMS Transparency Objectives • Notice that this SQL request does not require the user to know where the data are physically stored. The distributed DBMS at the local site (San Mateo) will consult the distributed DD/D and determine that this request must be routed to New York. • When the selected data are transmitted and displayed in San Mateo, it appears to the user at that site that the data were retrieved locally Distributed DBMS Transparency Objectives 2.Replication Transparency • replication transparency (sometimes called fragmentation transparency) – User/application does not need to know about duplication • If a read request originates at a site that does not contain the requested data, that request will have to be routed to another site. In this case, the distributed DBMS should select the remote site that will provide the fastest response. The choice of site will probably depend on current conditions in the network (such as availability of communications lines). • Thus, the distributed DBMS should dynamically select an optimum route. Again, with replication transparency, the requesting user need not be aware that this is a global (rather than local) transaction. 43 Distributed DBMS Transparency Objectives 3.Failure Transparency • For a system to be robust, it must be able to detect a failure, reconfigure the system so that computation may continue, and recover when a processor or link is repaired. • Error detection and system reconfiguration are probably the functions of the communications controller or processor, rather than the DBMS. However, the distributed DBMS is responsible for database recovery when a failure has occurred. • The distributed DBMS at each site has a component called the transaction manager that performs the following functions: • Logs transactions and before and after images • Concurrency control scheme to ensure data integrity – Requires special commit protocol – With failure transparency - Either all or none of the actions of a transaction are committed Two-Phase Commit • Prepare Phase – Coordinator receives a commit request – Coordinator instructs all resource managers to get ready to “go either way” on the transaction. – Each resource manager writes all updates from that transaction to its own physical log – Coordinator receives replies from all resource managers. – If all are ok, it writes commit to its own log; if not then it writes rollback to its log 45 Two-Phase Commit (cont.) • Commit Phase – Coordinator then informs each resource manager of its decision and broadcasts a message to either commit or rollback (abort). If the message is commit, then each resource manager transfers the update from its log to its database – A failure during the commit phase puts a transaction “in limbo.” This has to be tested for and handled with timeouts or polling 46 Concurrency Control Concurrency Transparency – Design goal for distributed database • Timestamping • With this approach, every transaction is given a globally unique time stamp, which generally consists of the clock time when the transaction occurred and the site ID. • Time-stamping ensures that even if two events occur simultaneously at different sites, each will have a unique time stamp. • The purpose of time-stamping is to ensure that transactions are processed in serial order, thereby avoiding the use of locks (and the possibility of deadlocks). • Concurrency control mechanism – Alternative to locks in distributed databases 47 Query Optimization • A major decision for the DBMS is how to process a query, which is affected by both the way a user formulates a query and the intelligence of the distributed DBMS to develop a sensible plan for processing. Query Optimization • A simplified procurement (relational) database has the following three relations: • SUPPLIER (SUPPLIER_NUMBER,CITY) 10,000 records, stored in Detroit • PART (PART_NUMBER, COLOR) 100,000 records, stored in Chicago • SHIPMENT(SUPPLIER_NUMBER, PART_NUMBER ) 1,000,000 records, stored in Detroit • A query is made (in SQL) to list the supplier numbers for Cleveland suppliers of red parts: • SELECT SUPPLIER.SUPPLIER_NUMBER FROM SUPPLIER, SHIPMENT, PART WHERE SUPPLIER.CITY = ‘Cleveland’ AND SHIPMENT.PART_NUMBER = PART.PART_NUMBER AND PART.COLOR = ‘Red’; Query Optimization • Each record in each relation is 100 characters long, there are ten red parts, a history of 100,000 shipments from Cleveland, and a negligible query computation time compared with communication time. • Also, there is a communication system with a data transmission rate of 10,000 characters per second and one second access delay to send a message from one node to another. Query Optimization • Date identifies six plausible query-processing strategies for this situation and develops the associated communication times; these strategies and times are summarized In the Table Query Optimization • A distributed DBMS typically uses the following three steps to develop a query processing plan • Three step process: 1 Query decomposition - rewritten and simplified 2 Data localization - query fragmented so that fragments reference data at only one site 3 Global optimization • Order in which to execute query fragments • Data movement between sites • Where parts of the query will be executed 52 Query Optimization • In a query involving a multi-site join and, possibly, a distributed database with replicated files, the distributed DBMS must decide where to access the data and how to proceed with the join. • One technique used to make processing a distributed query more efficient is to use what is called a semijoin operation • Semijoin: A joining operation used with distributed databases in which only the joining attribute from one site is transmitted to the other site , rather than all the selected attributes from every qualified row. Query Optimization • In a semijoin, only the joining attribute is sent from one site to another, and then only the required rows are returned. If only a small percentage of the rows participate in the join, then the amount of data being transferred is minimal. Query Optimization • For example, consider the distributed database in the Figure : Query Optimization • Suppose that a query at site 1 asks to display the Cust_Name, SIC, and Order_Date for all customers (fields) • in a particular Zip_Code range and an Order_Amount above a specified limit. (conditions) • Assume that 10% of the customers fall in the Zip_Code range and 2% of the orders are above the amount limit. • A semijoin would work as follows • 1. A query is executed at site 1 to create a list of the Cust_No values in the desired Zip_Code range. • So 10,000 customers * .1, or 1,000 rows satisfy the Zip_Code qualification. Thus, 1,000 rows of 10 bytes each for the Cust_No attribute (the joining attribute), or 10,000 bytes, will be sent to site 2. • 2. A query is executed at site 2 to create a list of the Cust_No and Order_Date values to be sent back to site 1 to compose the final result. If we assume roughly the same number of orders for each customer, then 40,000 rows of the Order table will match with the customer numbers sent from site 1. If we assume any customer order is equally likely to be above the amount limit, then 800 rows (40,000 * .02) of the Order table rows are relevant to this query. For each row, the Cust_No and Order_Date need to be sent to site 1, or 14 bytes * 800 rows, thus 11,200 bytes. Query Optimization • The total data transferred is only 21,200 bytes using the semijoin just described (site1>>site2 (10,000))+(site2>>site1 (11,200))=21,200. • Compare this total to simply sending the subset of each table needed at one site to the other site: • To send data from site 1 to site 2 would require sending the Cust_No,Cust_Name, and SIC (65 bytes) for (10,000*0.1=1,000 )rows of the Customer table (65,000 bytes) to site 2. • To send data from site 2 to site 1 would require sending Cust_No and Order_Date (14 bytes) for (400,000*0.2= 8,000 )rows of the Order table (112,000 bytes). Query Optimization • A distributed DBMS uses a cost model to predict the execution time (for data processing and transmission) of alternative execution plans. • The cost model is performed before the query is executed based on general network conditions Distributed Performance Problems • The biggest threat to performance is network traffic. • Performance in a distributed environment is defined by throughput and response time. • The server views performance primarily in terms of throughput. • The requester, however, views performance more in terms of response time. Questions