* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Advanced Databases
Survey
Document related concepts
Entity–attribute–value model wikipedia , lookup
Encyclopedia of World Problems and Human Potential wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Global serializability wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Commitment ordering wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Serializability wikipedia , lookup
Relational model wikipedia , lookup
Healthcare Cost and Utilization Project wikipedia , lookup
Transcript
Advanced Databases Lectures November 2013. NoSQL 1/3 ZPR-FER - Zagreb Advanced Databases 2013/2014 1 NoSQL - agenda 1. 2. 3. 4. 5. 6. 7. Introduction Distributed Databases The Data Model Distribution Models Consistency, version MapReduce Examples Key Value Document Column family Graph DBs ZPR-FER - Zagreb Advanced Databases 2013/2014 2 Introduction - history Impedance mismatch 1980 RDB 1990 OODB 2000 i dalje RDB Web -> NoSQL 2010 ZPR-FER - Zagreb Advanced Databases 2013/2014 3 Uvod - relacijske databases Relational databases Persistence Concurrency, ACID Integration Standard data model, query language Impedance mismatch Application vs Integration databases Large amounts of data - scalability Availability ZPR-FER - Zagreb Advanced Databases 2013/2014 4 Introduction - scalability The ability of the system to cope with the growing amount of data Retention of acceptable performance Vertical scalability (scale up): Adding one node resources (memory, processor, ...) up Horizontal scalability (scale out) Adding nodes to the system Relational DBMS's have problems with horizontal scalability out ZPR-FER - Zagreb Advanced Databases 2013/2014 5 5 Introduction - availability, clusters RDB clusters But we want: ZPR-FER - Zagreb Advanced Databases 2013/2014 6 Distributed and replicated relational databases This topic is covered in more detail in Database Systems http://www.fer.unizg.hr/predmet/sbp the following slides are taken (and simplified) from the Database Systems lectures ZPR-FER - Zagreb Advanced Databases 2013/2014 7 Distributed DB and distributed RDBMS Distributed database (DDB) is a set of logically related databases deployed in different nodes of a computer network (LAN, MAN, WAN) Distributed database management system (DDBMS) is a software system that manages a distributed database in such a way that the system of distributed systems transparent to users • DDBMS includes n local DBMS's. • Each local DBMS, labelled Si, (i = 1, ..., n) represents a single node (site, node) of a distributed system • Each node Si can directly or indirectly communicate with each node Sj, ie. there is twoway communication between any two nodes • The nodes of a distributed system for managing databases do not share the same physical components (disk, memory, CPU) ZPR-FER - Zagreb Advanced Databases 2013/2014 8 Distributed DB and distributed RDBMS Nodes are able to perform transactions that require only local data access (local transactions), but also transactions that require access to data from different nodes (global transaction) nodes have a degree of local autonomy local applications (transactions) global applications (transactions) database is distributed if it supports at least one global application Local to ISVU T1: - set exam grade Local za FerLib T2: - set book borrowed status ZPR-FER - Zagreb ISVU global T3: - check if all exams are passed - check all books returned - print diploma supplement Ferlib Advanced Databases 2013/2014 9 DDB design An important part of the design of distributed databases is to determine how to distribute the data. Dana is placed in the nodes where it is commonly used minimizes the network traffic Distribution design = fragmentation + allocation Fragmentation schema • division of database into disjoint set of fragments that include all of the data in the database. Database must be reconstructable from these fragments without loss of information • relations can be divided into fragments either horizontally or vertically (or horizontally and vertically) Allocation schema • schema that describes which fragment is assigned to which node of a distributed system ZPR-FER - Zagreb Advanced Databases 2013/2014 10 Fragmentation Horizontal, e.g. two fragments K A B K A B ∪ K A K B B Vertical, e.g. two fragments K Hybrid hor. r1 = r11 >< r12 r11 vert. r12 ZPR-FER - Zagreb A B K r = r1 ∪ r2 A hor. r2 = r21 >< r22 r21 vert. r22 Advanced Databases 2013/2014 11 Allocation Fragment replication degree (factor) the number of nodes in which the fragment is allocated Each fragment must be allocated to at least one node! Partitioned (or non-replicated) DB each of the fragments has been allocated to exactly one node, ie the degree of replication of each fragment = 1 Fully replicated DB • each of the fragments has been allocated to all nodes - each node contains a replica of the database, i.e. the degree of replication of each fragment = n (number of nodes in DDBMS) Partially replicated DB • database is neither partitioned or fully replicated (each of the fragments can be allocated in one, several or all nodes) • ZPR-FER - Zagreb Advanced Databases 2013/2014 12 Allocation Nodes S1, S2, S3 Fragments: • student1, student2, student3 • faculty1 student1 = σidFaculty = 36 (student) student2 = σidFaculty = 102 (student) student3 = σidFaculty = 81 (student) Partially replicated DB S2 S1 •student1 •faculty1 •student2 •faculty1 S3 •student3 •faculty1 ZPR-FER - Zagreb Advanced Databases 2013/2014 13 Global transactions, subtransactions, local transactions Example: DDBMS's nodes: S1, S2, S3 User initiates a global transaction T1 in the node S2 3 transaction T1 is mapped into the set of subtransactions:: T1 , T1 T1 Each subtransaction contains operations that are executed in that node 1 2 1 T1 2 T1 label: Ti j subtransaction of global transaction Ti that is executed in the node Sj S1 T1 3 T1 S2 S3 ZPR-FER - Zagreb Advanced Databases 2013/2014 14 Transaction in DDBMS Fully functional DBMS in each node Transaction can no longer be viewed as (only) a series of logically related operations that are executed in a DBMS Global transaction is a set of coordinated subtransaction executed at multiple nodes that transform distributed database from one consistent state to another ACID? • Consistency is relatively easily achieved through usual mechanisms • Durability is provided by node's DBMS: because each node guarantees its subtransaction durability Much more difficult problem: Atomicity and Isolation ZPR-FER - Zagreb Advanced Databases 2013/2014 15 Atomicity Atomicity of subtransactions is ensured by local nodes How to ensure the atomicity of the global transaction? • during the execution of the global transaction communication breakdown can occur between one or more nodes or one or more nodes can malfunction • Atomicity of the global transaction means that the DBMSs in all the nodes that perform the corresponding subtransactions must adopt and implement the same decision on the outcome of transaction: either all subtransactions of a global transaction are executed, or neither one • DDBMS implement the protocol for ratification of the global transaction: • 2PC - two-phase commit ZPR-FER - Zagreb Advanced Databases 2013/2014 16 2PC - informal description (1) There is a Transaction Manager (TM) in each node: • tasks equivalent to those of a centralized system: restoration, isolation, ... • difference: in addition to the local transactions, executes the subtransactions (for own node) There is a Transaction Coordinator (TC) in each node • launches global transaction initiated at its location (node) • distributes subtransactions to the appropriate nodes - gives orders to individual TMs to execute the subtransaction • orchestrates the completion of global transactions (initiated in its node) in a way that the corresponding subtransactions are commited in all nodes or rollbacked at all nodes ZPR-FER - Zagreb Advanced Databases 2013/2014 17 2PC - informal description (2) TC, which is located an the initiating node, distributes the transaction's subtransactions to the appropriate TMs After the subtransactions are executed, all TMs report of the successful execution to theTC. It is only then the 2PC begins! 1. Phase TC sends the GetReadyToCommit message to all the TMs. Every single TM responds with Ready or NotReady, or does not respond. 2. Phase If all nodes reply with Ready → TC writes the decision in its log and sends the GlobalCommit message to all the TMs If any of the nodes replies with NotReady or does not respond in the given time frame → TC writes the decision in its log and sends the GlobalRollback message to all the TMs TM write the TC's decision in its log, reply TC with confirmation of its decision and commit or rollback the subtransction When TC recieves responses from all the TMs, it writes into its log the EndTransaction tag ZPR-FER - Zagreb Advanced Databases 2013/2014 18 TC TM BEGIN C1 BEGIN P1 Not ready 1. phase write beginCommit write rollback ready P2 C2 write ready send NotReady send GetReadyToCommit P3 send ready WAIT C3 ready Someone is not ready P4 write globalnorollback global rollback C4 all ready send globalnorollback write GlobalCommit write rollback 2. phase Global commit P5 C5 send decisionAccepted send GlobalCommit write COMMIT P6 send decisionAccepted C6 COMMIT C7 rollback rollback write krajPotvrđivanja P7 COMMIT P8 2PC - informal description If, due to a failure in the network or remote node failure, message is not received in the predetermined time (timeout), TC or TM are trying to continue to perform operations in order to avoid transaction blocking C3: TC waits for the decision of one or more TMs. TC may decide to rollback the global transaction C6, C7: TC can not determine whether all TMs executed a decision on subtransaction commit or rollback. TC repeatedly polls TMs that did not respond P1: TM expects a message from TC stating to start the preparation for confirmation. TM may, after the timeout expiration, unilaterally rollback the subtransaction. Should the TC sends a GetReadyToCommit message afterwards, TM responds with NotReady P4: TM has sent a Ready message ready, but it does not known the final TC's decision. TM has to wait for the re-establishment of communication with the TC. ZPR-FER - Zagreb Advanced Databases 2013/2014 20 2PC - informal description - error at TC If a TC that is recovering from failure finds out in its log that it was involved in the 2PC protocol during when it failed, depending on the moment in which the malfunction occurred, performs the following actions: C1: After the recovery, the TC can restart 2PC protocol in the usual way C2, C3: TC has stopped working after it wrote BeginCommit in the log. After the recovery will continue to perform the protocol by sending messages GetReadyToCommit C4, C5, C6, C7: TC has stopped working after it wrote GlobalCommit or GlobalRollback to the log. After the recovery, it will re-send the appropriate message to the TMs. ZPR-FER - Zagreb Advanced Databases 2013/2014 21 2PC - informal description - error at TM If a TM that is recovering from failure finds out in its log that it was involved in the 2PC protocol during when it failed, depending on the moment in which the malfunction occurred, performs the following actions: P1: TM has stopped working before it wrote rollback or ready to the log. During the recovery TM unilaterally rollbacks the transaction P2: TM has stopped working after it wrote rollback to the log. TM rollbacks the subtransaction and leaves it to TC to perform a global transaction rollback after the response timeout P3, P4: TM has stopped working after he Ready wrote in its log. TM sends the Ready message to the TC and waits for an answer e.g. a final decision P5, P6: TM recognizes the outcome of the global transaction and acts accordingly P7, P8: TM does nothing because it is in a state in which the transaction is commited ZPR-FER - Zagreb Advanced Databases 2013/2014 22 Protocol blocking protocol is blocking if there is a possibility that the correct node (TC or TM) will not be able to complete the transaction due to disruption or failure of another node Example: Point P4 in the previous picture • TM has sent Ready message to TC and is in standby mode, waiting for the TC's decision on the outcome of the global transaction. At that moment, a communication with the TC malfunctions • TM can not unilaterally rollback the local transaction because it does not know what is decided by the TC (maybe TC managed to send the GlobalCommit message to all the other nodes) • TM has to wait for the establish the communication with the TC (or recovery of the TC's system) ⇒ 2PC is a blocking protocol ZPR-FER - Zagreb Advanced Databases 2013/2014 23 Protocol independence with regards to recovery Protocol is independent with regards to recovery if each node (TC or TM), after it failed, can independently, without communicating with other nodes, decide the outcome of all (sub) transactions that were being executed it the time of failure (at that node) Example: Point P4 in the previous picture • TM wrote Ready to the log and sent Ready message to TC. At this point, TM fails • When TM starts the recovery it determines that in was involved in the 2PC protocol in the moment of failure. It can not decide whether to committ or rollback the transaction without TC; so it sends a Ready message to TC and waits for a response ⇒ 2PC protocol is not independent with respect to recovery ZPR-FER - Zagreb Advanced Databases 2013/2014 24 Errors at DDBMSs Error Management in DDBMS-in is more complex than in centralized systems: centralized system works as a whole, or does not work Parts of DDBMS can be malfunctioning, and parts continue to operate In addition to malfunctions that are typical for centralized systems (eg, software and hardware errors, disk destruction), DDBMS can experience additional types of failures: Malfunctioning of one or more nodes Loss of connections between nodes Loss of messages Network partition: the network is partitioned (divided) into several subsystems that can not communicate. Even more complex problem: the node Si can not determine whether the network partition occurred or a node Sj simply stopped working ZPR-FER - Zagreb Advanced Databases 2013/2014 25 Disadvantages of DDBMS when compared to centralized DBMS Significantly greater complexity of the system Increased costs, e.g. Expensive software More system administrators Greater security problems Higher costs in ensuring data integrity Lack of standards Lack of experience More complex database design Poor implementation of the distributed database can cause increased communication costs reduction in the availability of data reduction in performance DDBMS's functionalies and techniques, that are results great body of research, are not fully implemented in any of the currently available commercial system. ZPR-FER - Zagreb Advanced Databases 2013/2014 26 Replicated databases ZPR-FER - Zagreb Advanced Databases 2013/2014 27 Replicated databases Fragment is replicated if it is allocated in more than one node For a single logical element (tuple, fragment, relation) there are multiple physical elements (copies, replicas), x1, x2, ..., in nodes S1, S2, ... S1 ZPR-FER - Zagreb S2 Advanced Databases 2013/2014 S3 28 Benefits of replicated DBs • Increased availability • If the node that stores copies of the fragment is unavailable, the system can access a copy of the fragment in another node Decreased data transmission volume • Commonly used data is replicated and accessed locally Parallel query execution • query that involves a fragment can be decomposed, and each part executed over one of the copies of the fragment ZPR-FER - Zagreb Advanced Databases 2013/2014 29 Disadvantages of replicated DBs Consistency problem: system must ensure consistency of all copies. Write operations (insert, delete, update) on one copy of the fragment must be propagated to all nodes in which this fragment is allocated a number of operations to be carried out in a number of nodes can cause a decrease in the availability and increas the number of complete deadlocks (when synchronous replication is used) or decrease in consistency (when asynchronous replication is used) ZPR-FER - Zagreb Advanced Databases 2013/2014 30 Synchronous (eager) protocols all physical operations arising from the logical operations of the initial transactions are conducted within the boundaries of the initial transaction, that is, all copies have to be modified as part of the initial transaction Initial Full consistency transaction Good read performance S1 Worsened write performance Extended transaction execution time, S2 increased deadlocks, low availability T2 (failure of one node S3 prohibits the write operations ) T3 ZPR-FER - Zagreb Advanced Databases 2013/2014 T T1 S4 T4 31 Asynchronous (lazy) protocols Operations of the initial transaction are conducted exclusively in the initial node and are not in any way depend on communication with other nodes initial transaction can be completed before the changes were made over all copies. Changes to other copies are performed asynchronously high availability, good performance high risk of inconsistent data T S1 Initial transaction T1 S2 S4 R T2 T4 S3 T3 ZPR-FER - Zagreb Propagated transactions Advanced Databases 2013/2014 32 One way protocols one-way, master-ownership, primary-copy For each logical element x there is only one master copy: xp All write operations over x must be firstly performed over xp Each node that contains at least one master copy is called a master • single-master system, all master copies at one node • multi-master system, primary copies of various elements are (a)located in different nodes ZPR-FER - Zagreb Advanced Databases 2013/2014 33 Often used one-way protocols 1Master-nSlaves (dissemination) Changes are made in exactly one master node and propagated to the subordinate nodes (slaves). The slave nodes are not allowed to perform transactions that include write operations nMasters-1Slave (consolidation) Updates performed in subordinate (slave) nodes are propagated to exactly one parent (master) node. Master node can not perform transactions that include a write operation Dissemination Consolidation Updates Reads Reads Updates Reads Updates Reads ZPR-FER - Zagreb Updates Advanced Databases 2013/2014 34 Two-way protocols n-way, peer-to-peer, group-ownership, update-anywhere initial transaction can perform updates over any physical copy system availability is considerably increased compared to the one-way system If used in combination with asynchronous protocol the transaction serializability can not be guaranteed ZPR-FER - Zagreb Advanced Databases 2013/2014 35 Important disadvantages of two-way protocols Non-serializable transactions can lead to hard to repaire breaches in data consistency Problem detecting conflicts: Some conflicts can be discovered only after the propagation of changes (when the initial transaction has been commited) problem resolving conflicts: may require canceling commited transactions -> durability property (of ACID) is decreased automatic conflict resolution is often not possible - human intervention is required Product idProduct prodName Example: Fully replicated DB Product 1 ASEA 2 Gyr Device ref.int. S1 Device 10 1 M-10 20 1 M-14 ZPR-FER - Zagreb Advanced Databases 2013/2014 idDev idProduct serNumber S2 Product 1 ASEA 2 Gyr Device 10 1 M-10 20 1 M-14 36 Important disadvantages of two-way protocols S1 Product 9:40 9:41 1 2 ASEA Gyr S2 Device 10 1 20 1 M-10 M-14 1 2 ASEA Gyr ASEA Gyr 10 1 20 1 M-10 M-14 10 1 20 1 M-10 M-14 1 2 ASEA Gyr 10 1 20 1 M-10 M-14 prop. INSERT INTO Device VALUES (30, 2, M-16) 1 synchronization 1 2 DELETE FROM Product WHERE idProduct=2 9:42 9:43 Device Product ASEA 10 1 20 1 M-10 M-14 INSRT INT Device VALUES (30, 2, M-16) 1 ASEA 10 1 20 1 M-10 M-14 1 2 prop. ASEA Gyr 10 1 20 1 30 2 M-10 M-14 M-16 DLTE FRM Product WHERE idProduct=2 1 2 ASEA Gyr 10 1 20 1 30 2 M-10 M-14 M-16 ERROR-referential integrity: missing row ERROR-referential integrity: still referencing row result: system delusion ZPR-FER - Zagreb Advanced Databases 2013/2014 37 Important disadvantages of two-way protocols Modern systems support two-way asynchronous replication, but a comprehensive solution to the described problem does not exist various systems offer different built-optional functionalities that can help in specific cases. E.g., in some systems it is possible to: instead of propagating SQL commands propagate a stored procedure (user-defined), which handles possible conflicts if timestamps are used to find a possible conflict, rollback the initial transaction (how does this affects the durability?) last wins, first wins, greatest value wins, ... ZPR-FER - Zagreb Advanced Databases 2013/2014 38 NoSQL ZPR-FER - Zagreb Advanced Databases 2013/2014 39 NoSQL databases First used as a name of (relational) DBMS developed by Carlo Strozzi in 1998. Used again (twitter hashtag) in 2009. at a "distributed, non-relational database, open source" conference organized by Johan Oskarsson No + SQL: „SQL” means „traditional” relational DBMS • Initially interpreted as "do not use SQL" and does not use a relational DBMS's • Not Only SQL – solutions that are not based solely on relational technologies ZPR-FER - Zagreb Advanced Databases 2013/2014 40 NoSQL informal definition Informal definition(taken from: http://nosql-database.org/): Newer generations databases usually having the following features: non-relational, distributed, open-source and horizontally scalable ... ... Often have additional properties: no data model, easy replication, simple API, BASE (not ACID), working with a large amount of data, etc. Open source Non relational Distributed 21st century web Schemaless ZPR-FER - Zagreb Advanced Databases 2013/2014 41 NoSQL vs RDB RDB are the best one-size-fits-all soulution we have NoSQL are specializized solutions for certain (types of) problems ZPR-FER - Zagreb Advanced Databases 2013/2014 42 Data Model ZPR-FER - Zagreb Advanced Databases 2013/2014 43 The data model - an introduction Model used to represent and handle the data <> physical model (which we mostly do not need to know) e.g. relational model There is no "real" or "correct" model world or domain In NoSQL, four data models: 1. Key Value 2. Document 3. Column family (<> column, columnar) 4. Graph ZPR-FER - Zagreb Advanced Databases 2013/2014 44 Aggregate model The term comes from the Domain Driven Design Aggregate is: Complex record that allows lists Object nesting Set of objects handled as a single record (e.g., order and order items) One root, according to which it is: referenced ensured integrity The basic unit of data - aggregate as whole is saved and/or read One aggregate ~ one "transaction" ZPR-FER - Zagreb Advanced Databases 2013/2014 45 Aggregate - example (1) composition ZPR-FER - Zagreb Advanced Databases 2013/2014 46 Digression: JSON reminder (1) JSON - JavaScript Object Notation Plain-text, human readable Does not depend on the programming language Hierarchical (nesting) JSON vs XML: js.eval () (but "eval is evil", use JSON parser) Easier to work with than XML Shorter than XML-a No end tags Arrays name:value pairs, e.g. "name":"Joe" ZPR-FER - Zagreb Advanced Databases 2013/2014 47 Digression: JSON reminder (2) Value can be: Number (int, real) String ("") Boolean (true/false) Array (enclosed with []) Object (u {}) null { "id": 1001, "name": "Order no. 13/2013", "total": 21.98, "details": [ {"ProductId": 100, "name": "Chocholate", "price": 9.99}, {"ProductId": 101, "name": "Jam", "price": 11.99} ], "payedFor": true, "customer": null } ZPR-FER - Zagreb Advanced Databases 2013/2014 48 Agregat - primjer (1) - sadržaj JSON: // customer { "id": 11, "fName": "Krešo", "ordAddress": {"street":"Unska 3", "city": "Zagreb"} } // order { "id": 1001, "customerId": 11, "items": [ {"ProductId": 100, "name": "Intro to NoSQL databases", "price": 99.99} ], "shipAddress": {"street":"Šumski put", "city": "Zagreb"}, "payment": { "transId": "ABBCCAX124", "orderAddress": {"street":"Unska 3", "city": "Zagreb"} } } ZPR-FER - Zagreb Advanced Databases 2013/2014 49 Aggregate - example (2) ZPR-FER - Zagreb Advanced Databases 2013/2014 50 Aggregate model - comment There is no general instruction where to set the limits of aggregates - depends on the problem Good for the distribution of data, aggregates are cohesive units The relational model does not have this information - it is "aggregate ignorant", as well as the graph model Aggregate ignorant <> bad Aggregate model can help or hinder, depending on the context: Fetch, save, distribute the orders Order data analysis for the last two months Main reason: to be used in a distributed environment, we want to minimize the number of nodes accessed when gathering data for a task ZPR-FER - Zagreb Advanced Databases 2013/2014 51 Key-value databases Data model: (key, value) pairs Operations: Put(k, v) Get(k) Update(k, v) Delete(k) Some DBs support certain structure of values and/or value attributes Some DBs support key range queries Examples: Riak, Dynamo,… ZPR-FER - Zagreb Advanced Databases 2013/2014 52 5 Document databases Like Key-Value, with Value being document Data model: (key, document) Document: JSON, BSON, XML, YAML, some other semi-structured format, binary data Main operations: Put(k, d) Get(k) Update(k, d) Delete(k) Queries based od document content! (not standardized, no query language) Some DBs support indexing Examples: CouchDB, MongoDB, SimpleDB,… ZPR-FER - Zagreb Advanced Databases 2013/2014 53 5 Example: MongoDB documents Relational database: relation fName Ivan lName Car Iva Kralj DateBirth 11.11.1971. BirthPlace Šibenik { MongoDB: collection "_id": ObjectID("4efa8d2b7d284dad101e4bc9"), "fName" : "Ivan", "lName“ : "Car", "BirthDate" : "11.11.1971." }, { "_id" : ObjectID("4efa8d2b7d284dad101e4bc7"), "fName" : "Iva", "lName" : "Kralj", "BirthPlace" : "Šibenik" } ZPR-FER - Zagreb Advanced Databases 2013/2014 54 5 Primjer: MongoDB upiti Mongo queries are JSON (BSON) objects SQL MongoDB CREATE TABLE student( id INT, lName CHAR(50) ) ALTER TABLE student ADD… Implicitly - by putting the first document in collection. Also explicitly: db.createCollection(„student") Implicitly, every document can be changed, there is no schema INSERT INTO student ( 100, ‘Šostakovič’); SELECT * FROM student; db.student.insert( {mbr:100, lName: ‘Šostakovič’} ) db.student.find(); SELECT lName FROM student WHERE mbr = 200 ORDER BY lName; UPDATE student SET lName = ‘Shostakovich’ WHERE mbr = 100; db.student.find( {mbr:100}, {lName:1} ).sort({lName:1}); db.student.update( { mbr: 100 }, {$set : { lName : ‘Shostakovich’ } } ); db.student.remove( { mbr: 100 } ); DELETE FROM student WHERE mbr = 100; ZPR-FER - Zagreb Advanced Databases 2013/2014 55 Aggregate model - KV & Document databases KV and document DBs are based on the aggregate data type KV DBs Retrieval by key Value is BLOB Document DBs Retrieval based on query Part of the document can be retrieved Indexing Constraints on the value (not everything can be inserted) In practice, the distinction between KV & Document DB is blury ZPR-FER - Zagreb Advanced Databases 2013/2014 56 CF databases Chang et al. [2006], Bigtable: A Distributed Storage System for Structured Data Data model: column family Not a table! Two-level hash map, two-level aggregate First level key: row key Second level key: column key Each column is a member of single column family ZPR-FER - Zagreb Advanced Databases 2013/2014 57 CF example (1) get('first', ' color:green') ZPR-FER - Zagreb Advanced Databases 2013/2014 58 CF example (2) ZPR-FER - Zagreb Advanced Databases 2013/2014 59 CF comment Dual view of the data: By rows: each row can be considered an aggregate By columns: each CF defines a record type (e.g. customer), with rows for each record Row = JOIN of records in all CFs Different row setups: Wide row sort Skinny row ZPR-FER - Zagreb Advanced Databases 2013/2014 60 Digression: CF are not columnar DBs (1) Data stored by columns, e.g. C-Store Column oriented DBMS http://en.wikipedia.org/wiki/Columnar_database Some vendors (Oracle, Informix, Microsoft, …) introduce columnar storage model (as indexes) into RDBMSs. Data stored by rows Retrieves only the columns required to resolve queries (in a typical fact table, below 15%) Better compression Increased utilization of buffer (better compression, often used columns) ZPR-FER - Zagreb Advanced Databases 2013/2014 pages Data stored by columns 61 Digression: CF are not columnar DBs (2) Order of magnitude (sometimes several) faster query times Useful when: often read, seldom write E.g. Microsoft SQL Server 2012 Vertipaq*: Test: 1 TB star join (1,44 billion rows), 32processors, 256 GB RAM: Can provide acceleration from hundreds to thousands of times, at least tenfold Compression factor of 4-20 on real data You can not do INSERT 2-3 times slower index creation in comparison to the B-tree * http://download.microsoft.com/download/8/C/1/8C1CE06B-DE2F-40D1-9C5C- 3EE521C25CE9/Columnstore%20Indexes%20for%20Fast%20DW%20QP%20SQL%20Server%2011.pdf ZPR-FER - Zagreb Advanced Databases 2013/2014 62 Graph databases Data model: nodes, edges (arcs), properties: Nodes can have properties (KV pairs) Edges have tags, directions and start and end node Edges also have properties Interfaces and query languages are not standardized (Cypher, SPARQL, Gremlin) Example: 03 friend friend Ela friend 01 Ana acquaintance acquaintance 15 Ivo friend Some DBs: Neo4j, GraphDB, DEX, FlockDB, InfoGrid, OrientDB, Pregel, … ZPR-FER - Zagreb Advanced Databases 2013/2014 63 6 Graph databases - example 11 Krešo ordered last order ordered racun:101 shipAddress racun:107 shipAddress previous :… :… contains contains ProductId: 11 ProductId: 17 ProductId: 33 ProductId: 99 name: name: name: name: … ZPR-FER - Zagreb … … Advanced Databases 2013/2014 … 64 Relational databases and relationships "relational databases deal poorly with relationships" ☺ Friends of friends of my friends? (reminder: FOAF, advanced SQL) Depth 2 3 4 5 Execution Time – MySQL 0.016 30.267 1,543.505 Not Finished in 1 Hour Execution Time –Neo4j 0.010 0.168 1.359 2.132 http://www.neotechnology.com/how-much-faster-is-a-graph-database-really/ ZPR-FER - Zagreb Advanced Databases 2013/2014 65 Graph database "Strange fish in SQL pond" Others… Breaks the data into even smaller units than RDB aggregate relations NoSQL Graph DB nodes Not suitable for distribution Query language ACID In common with others: non-relational model, popularity Suitable for complex, semi-structured, highly connected data ZPR-FER - Zagreb Advanced Databases 2013/2014 66