* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Modern Databases - Stellenbosch University
Survey
Document related concepts
Transcript
Modern Databases Willem Visser RW334 The Web is Changing the Game • Databases used to be the domain of corporations with limited amounts of data and limited amounts of users – Very valuable information, but not a lot of it – Important users, but not many of them • In the modern web-driven world – Enormous amounts of data are being generated – Millions of users are interested in that data What is Wrong here? What is Wrong here? How to make the DB scale? Partition and Distribute the Data Distributed Database • A single logical database spread physically across computers in multiple locations that are connected by a data communications link 6 Major Objectives • Location Transparency – User does not have to know the location of the data – Data requests automatically forwarded to appropriate sites • Local Autonomy – Local site can operate with its database when network connections fail – Each site controls its own data, security, logging, recovery 7 Distributed Databases Advantages • • • • • 8 Increased reliability/availability Local control over data Modular growth Lower communication costs Faster response for certain queries Distributed Database Disadvantages • • • • 9 Software cost and complexity Processing overhead Data integrity exposure Slower response for certain queries Options for Distributing a Database • Data replication – Copies of data distributed to different sites • Horizontal partitioning/Sharding – Different rows of a table distributed to different sites • Vertical partitioning – Different columns of a table distributed to different sites • Combinations of the above 10 Data Replication • Advantages: – Reliability – Fast response – May avoid complicated distributed transaction integrity routines (if replicated data is refreshed at scheduled intervals) – Decouples nodes (transactions proceed even if some nodes are down) – Reduced network traffic at prime time (if updates can be delayed) 11 Data Replication (cont.) • Disadvantages: – Additional requirements for storage space – Additional time for update operations – Complexity and cost of updating – Integrity exposure of getting incorrect data if replicated data is not updated simultaneously Therefore, better when used for non-volatile (read-only) data 12 Factors in Choice of Distributed Strategy • • • • • • 13 Funding, autonomy, security Site data referencing patterns Growth and expansion needs Technological capabilities Costs of managing complex technologies Need for reliable service Distributed DBMS • Distributed database requires distributed DBMS • Functions of a distributed DBMS: – Locate data with a distributed data dictionary – Determine location from which to retrieve data and process query components – DBMS translation between nodes with different local DBMSs (using middleware) – Data management functions: security, concurrency, deadlock control, query optimization, failure recovery – Data consistency (via multiphase commit protocols) – Global primary key control – Scalability – Data and stored procedure replication – Allowing for different DBMSs and application code at different nodes 14 Distributed DBMS Transparency Objectives • Location Transparency – User/application does not need to know where data resides • Replication Transparency – User/application does not need to know about duplication • Failure Transparency – Either all or none of the actions of a transaction are committed – Each site has a transaction manager • Logs transactions and before and after images • Concurrency control scheme to ensure data integrity – Requires special commit protocol 15 Query Optimization • In a query involving a multi-site join and, possibly, a distributed database with replicated files, the distributed DBMS must decide where to access the data and how to proceed with the join. Three step process: 1. 2. 3. Query decomposition–rewritten and simplified Data localization–query fragmented so that fragments reference data at only one site Global optimization– • • • • 16 Order in which to execute query fragments Data movement between sites Where parts of the query will be executed Semi join operation: only the joining attribute of the query is sent from one site to the other, rather than all selected attributes Brewer’s CAP Theorem • Eric Brewer, Keynote at ACM Symposium on the Principles of Distributed Computing 2000 • You cannot have all three of: – Consistency – Availability – Partition Tolerance • Nothing short of complete network failure and the system must keep functioning • Theorem proven in 2002 by Gilbert and Lynch • See http://www.julianbrowne.com/article/viewer/brewers-cap-theorem Dealing with CAP? • Drop Partitioning Tolerance – Don’t partition, but then you have serious scalability issues, which is probably why you want to partition in the first place • Drop Availability – Wait for all the partitions to sync before allowing any usage – This is as bad for scalability as having no partitioning • Drop Consistency – Eventual Consistency seems to work in most cases – If you have to drop one, this is the preferred option – Flies against most DB principles Relational DB? • Seems like we are assuming the DB must still be relational • Web also forces a new concept – Not all data look the same anymore! • Email messages, Images, News documents, Facebook updates, Tweets,… • Relations are too rigid • Semi-structured data The Information-Integration Problem • Related data exists in many places and could, in principle, work together. • But different databases differ in: – Model (relational, object-oriented?). – Schema (normalized/ not normalized?). – Terminology: are consultants employees? Retirees? Subcontractors? – Conventions (meters versus feet?). • How do we model information residing in heterogeneous sources (if we cannot combine it all in a single new database)? Example • Suppose we are integrating information about bars in some town. • Every bar has a database. – One may use a relational DBMS; another keeps the menu in an MS-Word document. – One stores the phones of distributors, another does not. – One distinguishes ales from other beers, another doesn’t. – One counts beer inventory by bottles, another by cases. Semi-structured Data • Purpose: represent data from independent sources more flexibly than either relational or object-oriented models. • Think of objects, but with the type of each object its own business, not that of its “class.” • Labels to indicate meaning of substructures. • Data is self-describing: structural information is part of the data. Graphs of Semistructured Data • Nodes = objects. • Labels on arcs (attributes, relationships). • Atomic values at leaf nodes (nodes with no arcs out). • Flexibility: no restriction on: – Labels out of a node. – Number of successors with a given label. Example: Data Graph Notice a new kind of data. root beer bar beer manf name servedAt Bud A.B. manf prize name M’lob name addr Joe’s Maple The bar object for Joe’s Bar year 1995 The beer object for Bud award Gold XML • XML = Extensible Markup Language. • While HTML uses tags for formatting (e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”). • Key idea: create tag sets for a domain (e.g., bars), and translate all data into properly tagged XML documents. • Well formed XML - XML which is syntactically correct – tags and their nesting totally arbitrary. • Valid XML - XML which has DTD (document type definition) – imposes some structure on the tags, but much more flexible than relational database schema. • DTD and XML Schema – Meta-data for XML – Describe what are valid XML structures XML and Semi-structured Data • Well-Formed XML with nested tags is exactly the same idea as trees of semi-structured data. • XML also enables non-tree structures (with references to IDs of nodes), as does the semistructured data model. Example: Well-Formed XML <?xml version = “1.0” standalone = “yes” ?> <BARS> <BAR><NAME>Joe’s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> Root tag <PRICE>3.00</PRICE></BEER> </BAR> <BAR> … Tags surrounding </BARS> a BAR element A NAME subelement A BEER subelement 27 Example • The <BARS> XML document is: BARS BAR BAR BAR NAME BEER BEER Joe’s Bar NAME Bud PRICE 2.50 NAME Miller PRICE 3.00 ... DTD Elements • The description of an element consists of its name (tag), and a parenthesized description of any nested tags. – Includes order of subtags and their multiplicity. • Leaves (text elements) have #PCDATA (Parsed Character DATA ) in place of nested tags. 29 Example: DTD A BARS object has zero or more BAR’s nested within. <!DOCTYPE BARS [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> ]> NAME and PRICE are text. A BAR has one NAME and one or more BEER subobjects. A BEER has a NAME and a PRICE. 30 Querying XML • Why query XML-documents? – special XML databases – major DBMSs “speak” XML; • Does the world need a new query language? • Most of the world's business data is stored in relational databases; • The relational language SQL is mature and wellestablished; • Can SQL be adapted to query XML data? – Leverage existing software – Leverage existing user skills XML vs Relational Data • Relational data is "flat”: rows and columns; • XML data is nested: and its depth may be irregular and unpredictable; • Relations can represent hierarchic data by foreign keys or by structured datatypes; • In XML it is natural to search for objects at • unknown levels of the hierarchy: • "Find all the red things“; XML vs Relational Data (cont.) • Relational data is uniform and repetitive; • All bank accounts are similar in structure; • Metadata can be factored out to a system catalog; • XML data is highly variable; • Every web page is different; • Each XML object needs to be self-describing; • Metadata is distributed throughout the document; • Queries may access metadata as well as data: "Find elements whose name is the same as their content“: • //*[name(.) =string(.)] XML vs Relational Data (cont.) • Relational queries return uniform sets of rows; • The results of an XML query may have mixed types and complex structures; • "Red things": a flag, a cherry, a stopsign, ... • Elements can be mixed with atomic values • XML queries need to be able to perform structural transformations • Example: invert a hierarchy; XML vs Relational Data (cont.) • The rows of a relation are unordered • Any desired output ordering must be derived from values; • The elements in an XML document are ordered • Implications for query: • Preserve input order in query results • Specify an output ordering at multiple levels • "Find the fifth step“ • "Find all the tools used before the hammer“; XML vs Relational Data (cont.) • Relational data is "dense“ • Every row has a value in every column; • A "null" value is needed for missing or inapplicable data • XML data can be "sparse“ • Missing or inapplicable elements can be "empty“ or "not there“ • This gives XML a degree of freedom not present in relational databases XPATH and XQUERY • XPATH is a language for describing paths in XML documents. – Really think of the semi-structured data graph and its paths. – Why do we need path description language: can’t get at the data using just Relation.Attribute expressions. • XQUERY is a full query language for XML documents with power similar to OQL (Object Query Language, query language for object-oriented databases). 'Modern' Databases