Download Database Administration: The Complete Guide to Practices and

Document related concepts

Big data wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Concurrency control wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Database Administration:
The Complete Guide to Practices and Procedures
Chapter 19
Data Movement and Distribution
Agenda
• Data Movement Methods:
– Loading and Unloading Data
– EXPORT and IMPORT
– Bulk Data Movement
• Distributed Databases
• Questions
Loading and Unloading Data
• One of the simplest ways for the DBA to move
data from one place to another is to use the
LOAD and UNLOAD utilities that come with the
DBMS.
– The LOAD utility is used to populate tables with new
data.
– The UNLOAD utility is used to read data from a table
and put it into a data file.
• Each DBMS may call the actual utilities by
different names, but the functionality is the same
or similar from product to product.
The LOAD Utility
• A LOAD utility is used to perform bulk inserts
of data into database tables. It typically can
support
– Adding rows to a table, retaining the current data;
– Or replacing all existing rows with the new data.
The UNLOAD Utility
• The purpose of the UNLOAD utility is to read data from
a database and write it to an output data file.
– Without an UNLOAD utility, database users are forced to
use SQL SELECT statements issued by an interactive SQL
facility, report writer, or application program in order to
unload data.
– However, these methods are error-prone and slow for
large quantities of data.
– Requiring a developer to code a program
to create a file is inflexible and timeconsuming.
• UNLOAD is flexible and quick.
EXPORT and IMPORT
• Similar to an UNLOAD utility, The EXPORT utility reads data from a
table and places it into an external file.
• The IMPORT utility will read an external file created by the EXPORT
utility and insert the data into a table.
• IMPORT and EXPORT facilities typically work with more than just
the data, though.
– The EXPORT data file usually contains the schema for the table along
with the data.
– Sometimes the EXPORT file contains more than just a single table.
– Some EXPORT facilities enable the DBA to specify a single table, and
then follow the relationships for that table to extract all of the related
files and data.
– Some IMPORT/EXPORT facilities provide UNLOAD-like features to
sample, subset, and limit the data that is exported (and imported).
• Not every DBMS offers IMPORT and EXPORT utilities.
Bulk Data Movement
• There are other methods for moving large
quantities of data:
– ETL Software - extract, transform, and load.
– Replication and Propagation
– Messaging Software
– Others
Bulk Data Movement
• ETL Software
– ETL is a type of software that performs data
movement.
– ETL stands for extract, transform, and load.
– ETL software is primarily used to populate data
warehouses and data marts from other databases
and data sources.
Bulk Data Movement
• Replication
• Another method of moving data is through replication
and propagation.
• When data is replicated, one data store is copied to
one or more data stores, either locally or at other
locations.
• Replication can be implemented simply by copying
entire tables to multiple locations.
• Alternatively, replicated data can be a subset of the
rows and/or columns.
• Replication can be set up to automatically refresh the
copied data on a regular basis.
Bulk Data Movement
• Propagation
• Propagation, on the other hand, is the migration of only changed data.
Propagation can be implemented by scanning the transaction log and
applying the results of data modification statements to another data store.
Initial population of a data warehouse can be achieved by replication, and
subsequent population of changes by either replication (if the data is very
dynamic) or propagation.
Bulk Data Movement
• Messaging Software
• Messaging software, also known as message
queueing software or application integration,
is another popular form of data movement.
When using a message queue, data is placed
onto the queue by one application or process;
the data is read from the queue by another
application or process.
Bulk Data Movement
• Other Methods
• Of course, many other methods exist for
moving data—
• from the simple, such as using a table editing
tool to highlight and copy data,
• to the complex, such as writing programs to
read the database and write to external files
or directly to other databases.
Distributed Databases
• Sometimes simply moving data from one
location to another is not sufficient.
• Instead, the data needs to be stored at, and
accessible from, various locations throughout
an organization.
• A distributed database permits data to reside
at different physical locations in different
databases, perhaps using different DBMS
software on different operating systems.
Example
• Consider, an organization with retail outlets that
implements a distributed database system.
• Each retail outlet would have a database, and the
headquarters would house a central database.
With networking technology and the distributed
capabilities of the DBMS, data could be accessed
and modified at any location from any location.
Furthermore, you could specify which locations
have update, or even read, access to certain
databases.
Definitions
• Distributed Database: A single logical
database that is spread physically across
computers in multiple locations that are
connected by a data communications link
• Decentralized Database: A collection of
independent databases at multiple locations
on non-networked computers and no
database software that make the data appear
to be in one logical database
They are NOT the same thing!
15
Reasons for
Distributed Database
•
•
•
•
•
•
•
16
Business unit autonomy and distribution
Data sharing
Data communication costs
Data communication reliability and costs
Multiple application vendors
Database recovery
Transaction and analytic processing
Distributed database environments
17
Distributed Database Options
• Homogeneous - Same DBMS at each node
– Autonomous - Independent DBMSs, passing messages
back and forth to share data updates.
– Non-autonomous - Central, coordinating DBMS
– Easy to manage, difficult to enforce
• Heterogeneous - Different DBMSs at different nodes
– Systems – With full or partial DBMS functionality
– Gateways - Simple paths are created to other databases
without the benefits of one logical database
– Difficult to manage, preferred by independent
organizations
18
Homogeneous, Non-Autonomous
Database
• Data is distributed across all the nodes
• Same DBMS at each node
• All data is managed by the distributed DBMS
(no exclusively local data)
• All access is through one, global schema
• The global schema is the union of all the local
schema
19
Homogeneous Database
Identical DBMSs
Source: adapted from Bell and Grimson, 1992.
20
Typical Heterogeneous Environment
• Data distributed across all the nodes
• Different DBMSs may be used at each node
• Local access is done using the local DBMS and
schema
• Remote access is done using the global
schema
21
Figure 13-3: Typical Heterogeneous Environment
Non-identical DBMSs
Source: adapted from Bell and Grimson, 1992.
22
Major Objectives
• Location Transparency
– User does not have to know the location of the data
– Data requests automatically forwarded to appropriate
sites
• Local Autonomy
– Local site can operate with its database when network
connections fail
– Each site controls its own data, security, logging,
recovery
23
Advantages of
Distributed Database over Centralized
Databases
•
•
•
•
•
24
Increased reliability/availability
Local control over data
Modular growth
Lower communication costs
Faster response for certain queries
Disadvantages of
Distributed Database Compared to
Centralized Databases
•
•
•
•
25
Software cost and complexity
Processing overhead
Data integrity exposure
Slower response for certain queries
Options for
Distributing a Database
• Data replication
– Copies of data distributed to different sites
• Horizontal partitioning
– Different rows of a table distributed to different sites
• Vertical partitioning
– Different columns of a table distributed to different sites
• Combinations of the above
26
Data Replication
• Advantages:
– Reliability (incase fail at a node info. Can be found at
another node)
– Fast response (for loc. That have full copy)
– May avoid complicated distributed transaction integrity
routines (if replicated data is refreshed at scheduled
intervals)
– Decouples nodes (transactions proceed even if some
nodes are down)
– Reduced network traffic at prime time (if updates can
be delayed)
27
Data Replication (cont.)
• Disadvantages:
– Additional requirements for storage space
– Additional time for update operations
– Complexity and cost of updating
– Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously
Therefore, better when used for non-volatile
(read-only) data
28
Horizontal & vertical Partitioning
Horizontal partitioning
Vertical partitioning
Vertical partitioning
Horizontal Partitioning
• Different rows of a table at different sites
• Advantages –
–
–
–
Data stored close to where it is used  efficiency
Local access optimization  better performance
Only relevant data is available  security
Unions across partitions  ease of query
• Disadvantages
– Accessing data across partitions  inconsistent access
speed
– No data replication  backup vulnerability
33
Vertical Partitioning
• Different columns of a table at different sites
• Advantages and disadvantages are the same
as for horizontal partitioning except that
combining data across partitions is more
difficult because it requires joins (instead of
unions)
34
Figure 13-10: Distributed DBMS architecture
35
Local Transaction Steps
1.
2.
3.
Application makes request to distributed DBMS
Distributed DBMS checks distributed data repository for
location of data.
(each site has a copy of the distributed DBMS and the
associated distributed data dictionary/directory (DD/D)
The distributed DD/D contains the location of all
data in the network, as well as data definitions.)
4.
5.
6.
7.
36
Finds that it is local
Distributed DBMS sends request to local DBMS
Local DBMS processes request
Local DBMS sends results to application
Figure 13-10: Distributed DBMS Architecture (cont.)
(showing local transaction steps)
2
1
3
5
4
Local transaction – all
data stored locally
37
Global Transaction Steps
1.
2.
3.
4.
5.
6.
7.
8.
38
Application makes request to distributed DBMS
Distributed DBMS checks distributed data repository for location
of data. Finds that it is remote
Distributed DBMS routes request to remote site
Distributed DBMS at remote site translates request for its local
DBMS if necessary, and sends request to local DBMS
Local DBMS at remote site processes request
Local DBMS at remote site sends results to distributed DBMS at
remote site
Remote distributed DBMS sends results back to originating site
Distributed DBMS at originating site sends results to application
Figure 13-10: Distributed DBMS architecture (cont.)
(showing global transaction steps)
2
3
1
7
8
6
4
5
Global transaction – some
data is at remote site(s)
39
Distributed DBMS
Transparency Objectives
1.Location Transparency -User/application does not need
to know where data resides.
• Suppose that a marketing manager in San Mateo,
California, wanted a list of all company customers whose
total purchases exceed $100,000.
• From a terminal in San Mateo, with location transparency,
the manager could enter the following request:
• SELECT *
FROM CUSTOMER
WHERE TOTAL_SALES > l00,000;
41
Distributed DBMS
Transparency Objectives
• Notice that this SQL request does not require
the user to know where the data are
physically stored. The distributed DBMS at the
local site (San Mateo) will consult the
distributed DD/D and determine that this
request must be routed to New York.
• When the selected data are transmitted and
displayed in San Mateo, it appears to the user
at that site that the data were retrieved locally
Distributed DBMS
Transparency Objectives
2.Replication Transparency
• replication transparency (sometimes called fragmentation transparency)
– User/application does not need to know about duplication
• If a read request originates at a site that does not contain the
requested data, that request will have to be routed to another
site. In this case, the distributed DBMS should select the remote
site that will provide the fastest response. The choice of site will
probably depend on current conditions in the network (such as
availability of communications lines).
• Thus, the distributed DBMS should dynamically select an
optimum route. Again, with replication transparency, the
requesting user need not be aware that this is a global (rather
than local) transaction.
43
Distributed DBMS
Transparency Objectives
3.Failure Transparency
• For a system to be robust, it must be able to detect a failure, reconfigure
the system so that computation may continue, and recover when a
processor or link is repaired.
• Error detection and system reconfiguration are probably the functions of
the communications controller or processor, rather than the DBMS.
However, the distributed DBMS is responsible for database recovery when
a failure has occurred.
• The distributed DBMS at each site has a component called the transaction
manager that performs the following functions:
• Logs transactions and before and after images
• Concurrency control scheme to ensure data integrity
– Requires special commit protocol
– With failure transparency - Either all or none of the actions of a
transaction are committed
Two-Phase Commit
• Prepare Phase
– Coordinator receives a commit request
– Coordinator instructs all resource managers to get
ready to “go either way” on the transaction.
– Each resource manager writes all updates from
that transaction to its own physical log
– Coordinator receives replies from all resource
managers.
– If all are ok, it writes commit to its own log; if not
then it writes rollback to its log
45
Two-Phase Commit (cont.)
• Commit Phase
– Coordinator then informs each resource manager of its
decision and broadcasts a message to either commit or
rollback (abort). If the message is commit, then each
resource manager transfers the update from its log to its
database
– A failure during the commit phase puts a transaction “in
limbo.” This has to be tested for and handled with
timeouts or polling
46
Concurrency Control
Concurrency Transparency
– Design goal for distributed database
• Timestamping
• With this approach, every transaction is given a globally unique
time stamp, which generally consists of the clock time when the
transaction occurred and the site ID.
• Time-stamping ensures that even if two events occur
simultaneously at different sites, each will have a unique time
stamp.
• The purpose of time-stamping is to ensure that transactions are
processed in serial order, thereby avoiding the use of locks (and the
possibility of deadlocks).
• Concurrency control mechanism
– Alternative to locks in distributed databases
47
Query Optimization
• A major decision for the DBMS is how to
process a query, which is affected by both the
way a user formulates a query and the
intelligence of the distributed DBMS to
develop a sensible plan for processing.
Query Optimization
• A simplified procurement (relational) database has the following three
relations:
• SUPPLIER (SUPPLIER_NUMBER,CITY) 10,000 records, stored in Detroit
• PART (PART_NUMBER, COLOR) 100,000 records, stored in Chicago
• SHIPMENT(SUPPLIER_NUMBER, PART_NUMBER ) 1,000,000 records,
stored in Detroit
• A query is made (in SQL) to list the supplier numbers for Cleveland
suppliers of red parts:
• SELECT SUPPLIER.SUPPLIER_NUMBER
FROM SUPPLIER, SHIPMENT, PART
WHERE SUPPLIER.CITY = ‘Cleveland’
AND SHIPMENT.PART_NUMBER = PART.PART_NUMBER
AND PART.COLOR = ‘Red’;
Query Optimization
• Each record in each relation is 100 characters
long, there are ten red parts, a history of
100,000 shipments from Cleveland, and a
negligible query computation time compared
with communication time.
• Also, there is a communication system with a
data transmission rate of 10,000 characters
per second and one second access delay to
send a message from one node to another.
Query Optimization
• Date identifies six plausible query-processing strategies for this
situation and develops the associated communication times; these
strategies and times are summarized In the Table
Query Optimization
• A distributed DBMS typically uses the following three steps to
develop a query processing plan
• Three step process:
1 Query decomposition - rewritten and simplified
2 Data localization - query fragmented so that fragments
reference data at only one site
3 Global optimization • Order in which to execute query fragments
• Data movement between sites
• Where parts of the query will be executed
52
Query Optimization
• In a query involving a multi-site join and, possibly, a
distributed database with replicated files, the
distributed DBMS must decide where to access the
data and how to proceed with the join.
• One technique used to make processing a distributed
query more efficient is to use what is called a semijoin
operation
• Semijoin: A joining operation used with distributed
databases in which only the joining attribute from one
site is transmitted to the other site , rather than all the
selected attributes from every qualified row.
Query Optimization
• In a semijoin, only the joining attribute is sent
from one site to another, and then only the
required rows are returned. If only a small
percentage of the rows participate in the join,
then the amount of data being transferred is
minimal.
Query Optimization
• For example, consider the distributed
database in the Figure :
Query Optimization
•
Suppose that a query at site 1 asks to display the Cust_Name, SIC,
and Order_Date for all customers (fields)
• in a particular Zip_Code range and an Order_Amount above a
specified limit. (conditions)
• Assume that 10% of the customers fall in the Zip_Code range and
2% of the orders are above the amount limit.
• A semijoin would work as follows
• 1. A query is executed at site 1 to create a list of the Cust_No
values in the desired Zip_Code range.
• So 10,000 customers * .1, or 1,000 rows satisfy the Zip_Code
qualification. Thus, 1,000 rows of 10 bytes each for the
Cust_No attribute (the joining attribute), or 10,000 bytes, will
be sent to site 2.
• 2. A query is executed at site 2 to create a list of the Cust_No
and Order_Date values to be sent back to site 1 to compose
the final result. If we assume roughly the same number of
orders for each customer, then 40,000 rows of the Order
table will match with the customer numbers sent from site 1.
If we assume any customer order is equally likely to be above
the amount limit, then 800 rows (40,000 * .02) of the Order
table rows are relevant to this query. For each row, the
Cust_No and Order_Date need to be sent to site 1, or 14
bytes * 800 rows, thus 11,200 bytes.
Query Optimization
• The total data transferred is only 21,200 bytes using the semijoin just
described (site1>>site2 (10,000))+(site2>>site1 (11,200))=21,200.
• Compare this total to simply sending the subset of each table needed at
one site to the other site:
•
To send data from site 1 to site 2 would require sending the
Cust_No,Cust_Name, and SIC (65 bytes) for (10,000*0.1=1,000 )rows of
the Customer table (65,000 bytes) to site 2.
•
To send data from site 2 to site 1 would require sending Cust_No and
Order_Date (14 bytes) for (400,000*0.2= 8,000 )rows of the Order table
(112,000 bytes).
Query Optimization
• A distributed DBMS uses a cost model to
predict the execution time (for data
processing and transmission) of alternative
execution plans.
• The cost model is performed before the query
is executed based on general network
conditions
Distributed Performance Problems
• The biggest threat to performance is network
traffic.
• Performance in a distributed environment is
defined by throughput and response time.
• The server views performance primarily in
terms of throughput.
• The requester, however, views performance
more in terms of response time.
Questions