Download Advanced Databases

Document related concepts

SQL wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Encyclopedia of World Problems and Human Potential wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Global serializability wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Commitment ordering wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Serializability wikipedia , lookup

Relational model wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Concurrency control wikipedia , lookup

Database model wikipedia , lookup

Transcript
Advanced Databases
Lectures
November 2013.
NoSQL
1/3
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
1
NoSQL - agenda
1.
2.
3.
4.
5.
6.
7.
Introduction
Distributed Databases
The Data Model
Distribution Models
Consistency, version
MapReduce
Examples
Key Value
Document
Column family
Graph DBs
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
2
Introduction - history
Impedance mismatch
1980
RDB
1990
OODB
2000
i dalje RDB
Web -> NoSQL
2010
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
3
Uvod - relacijske databases
Relational databases
Persistence
Concurrency, ACID
Integration
Standard data model, query language
Impedance mismatch
Application vs Integration databases
Large amounts of data - scalability
Availability
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
4
Introduction - scalability
The ability of the system to cope with the growing amount of data
Retention of acceptable performance
Vertical scalability (scale up):
Adding one node resources (memory, processor, ...)
up
Horizontal scalability (scale out)
Adding nodes to the system
Relational DBMS's have problems
with horizontal scalability
out
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
5
5
Introduction - availability, clusters
RDB clusters
But we want:
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
6
Distributed and replicated
relational databases
This topic is covered in more detail in
Database Systems
http://www.fer.unizg.hr/predmet/sbp
the following slides are taken (and simplified)
from the Database Systems lectures
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
7
Distributed DB and distributed RDBMS
Distributed database (DDB) is a set of logically related databases deployed
in different nodes of a computer network (LAN, MAN, WAN)
Distributed database management system (DDBMS) is a software system
that manages a distributed database in such a way that the system of
distributed systems transparent to users
• DDBMS includes n local DBMS's.
• Each local DBMS, labelled Si, (i = 1, ..., n)
represents a single node (site, node) of a
distributed system
• Each node Si can directly or indirectly
communicate with each node Sj, ie. there is twoway communication between any two nodes
• The nodes of a distributed system for managing
databases do not share the same physical
components (disk, memory, CPU)
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
8
Distributed DB and distributed RDBMS
Nodes are able to perform transactions that require only local data access
(local transactions), but also transactions that require access to data from
different nodes (global transaction)
nodes have a degree of local autonomy
local applications (transactions)
global applications (transactions)
database is distributed if it supports at least one global application
Local to ISVU
T1: - set exam grade
Local za FerLib
T2: - set book
borrowed status
 ZPR-FER - Zagreb
ISVU
global
T3: - check if all exams are passed
- check all books returned
- print diploma supplement
Ferlib
Advanced Databases 2013/2014
9
DDB design
An important part of the design of distributed databases is to determine how to
distribute the data.
Dana is placed in the nodes where it is commonly used
minimizes the network traffic
Distribution design = fragmentation + allocation
Fragmentation schema
• division of database into disjoint set of fragments that include all of the data in
the database. Database must be reconstructable from these fragments without
loss of information
• relations can be divided into fragments either horizontally or vertically (or
horizontally and vertically)
Allocation schema
• schema that describes which fragment is assigned to which node of a distributed
system
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
10
Fragmentation
Horizontal, e.g. two fragments
K
A
B
K
A
B
∪
K
A
K
B
B
Vertical, e.g. two fragments
K
Hybrid
hor.
r1 = r11 >< r12
r11 vert. r12
 ZPR-FER - Zagreb
A
B
K
r = r1 ∪ r2
A
hor.
r2 = r21 >< r22
r21
vert.
r22
Advanced Databases 2013/2014
11
Allocation
Fragment replication degree (factor)
the number of nodes in which the fragment is allocated
Each fragment must be allocated to at least one node!
Partitioned (or non-replicated) DB
each of the fragments has been allocated to exactly one node, ie the
degree of replication of each fragment = 1
Fully replicated DB
• each of the fragments has been allocated to all nodes - each node
contains a replica of the database, i.e. the degree of replication of each
fragment = n (number of nodes in DDBMS)
Partially replicated DB
• database is neither partitioned or fully replicated (each of the fragments
can be allocated in one, several or all nodes)
•
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
12
Allocation
Nodes S1, S2, S3
Fragments:
• student1, student2, student3
• faculty1
student1 = σidFaculty = 36 (student)
student2 = σidFaculty = 102 (student)
student3 = σidFaculty = 81 (student)
Partially replicated DB
S2
S1
•student1
•faculty1
•student2
•faculty1
S3
•student3
•faculty1
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
13
Global transactions, subtransactions, local
transactions
Example: DDBMS's nodes: S1, S2, S3
User initiates a global transaction T1 in the node S2
3
transaction T1 is mapped into the set of subtransactions::
T1 , T1 T1
Each subtransaction contains operations that are executed in that node
1
2
1
T1
2
T1
label:
Ti
j
subtransaction of
global transaction Ti
that is executed in
the node Sj
S1
T1
3
T1
S2
S3
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
14
Transaction in DDBMS
Fully functional DBMS in each node
Transaction can no longer be viewed as (only) a series of logically
related operations that are executed in a DBMS
Global transaction is a set of coordinated subtransaction executed at
multiple nodes that transform distributed database from one
consistent state to another
ACID?
• Consistency is relatively easily achieved through usual
mechanisms
• Durability is provided by node's DBMS:
because each node guarantees its subtransaction durability
Much more difficult problem: Atomicity and Isolation
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
15
Atomicity
Atomicity of subtransactions is ensured by local nodes
How to ensure the atomicity of the global transaction?
• during the execution of the global transaction
communication breakdown can occur between one or more
nodes or one or more nodes can malfunction
• Atomicity of the global transaction means that the DBMSs in all
the nodes that perform the corresponding subtransactions must
adopt and implement the same decision on the outcome of
transaction: either all subtransactions of a global transaction
are executed, or neither one
• DDBMS implement the protocol for ratification of the global
transaction:
• 2PC - two-phase commit
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
16
2PC - informal description (1)
There is a Transaction Manager (TM) in each node:
• tasks equivalent to those of a centralized system: restoration,
isolation, ...
• difference: in addition to the local transactions, executes the
subtransactions (for own node)
There is a Transaction Coordinator (TC) in each node
• launches global transaction initiated at its location (node)
• distributes subtransactions to the appropriate nodes - gives
orders to individual TMs to execute the subtransaction
• orchestrates the completion of global transactions (initiated in
its node) in a way that the corresponding subtransactions are
commited in all nodes or rollbacked at all nodes
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
17
2PC - informal description (2)
TC, which is located an the initiating node, distributes the transaction's
subtransactions to the appropriate TMs
After the subtransactions are executed, all TMs report of the successful
execution to theTC. It is only then the 2PC begins!
1. Phase
TC sends the GetReadyToCommit message to all the TMs. Every single TM
responds with Ready or NotReady, or does not respond.
2. Phase
If all nodes reply with Ready → TC writes the decision in its log and sends the
GlobalCommit message to all the TMs
If any of the nodes replies with NotReady or does not respond in the given
time frame → TC writes the decision in its log and sends the GlobalRollback
message to all the TMs
TM write the TC's decision in its log, reply TC with confirmation of its decision
and commit or rollback the subtransction
When TC recieves responses from all the TMs, it writes into its log the
EndTransaction tag
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
18
TC
TM
BEGIN
C1
BEGIN
P1
Not ready
1. phase
write
beginCommit
write rollback
ready
P2
C2
write ready
send NotReady
send
GetReadyToCommit
P3
send ready
WAIT
C3
ready
Someone is
not ready
P4
write globalnorollback
global
rollback
C4
all ready
send globalnorollback
write
GlobalCommit
write rollback
2. phase
Global
commit
P5
C5
send
decisionAccepted
send
GlobalCommit
write COMMIT
P6
send
decisionAccepted
C6
COMMIT
C7
rollback
rollback
write
krajPotvrđivanja
P7
COMMIT
P8
2PC - informal description
If, due to a failure in the network or remote node failure, message is not
received in the predetermined time (timeout), TC or TM are trying to
continue to perform operations in order to avoid transaction blocking
C3:
TC waits for the decision of one or more TMs. TC may decide to rollback
the global transaction
C6, C7: TC can not determine whether all TMs executed a decision on
subtransaction commit or rollback. TC repeatedly polls TMs that did not
respond
P1:
TM expects a message from TC stating to start the preparation for
confirmation. TM may, after the timeout expiration, unilaterally rollback
the subtransaction. Should the TC sends a GetReadyToCommit message
afterwards, TM responds with NotReady
P4:
TM has sent a Ready message ready, but it does not known the final
TC's decision. TM has to wait for the re-establishment of
communication with the TC.
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
20
2PC - informal description - error at TC
If a TC that is recovering from failure finds out in its log that it was involved
in the 2PC protocol during when it failed, depending on
the moment in which the malfunction occurred, performs the
following actions:
C1:
After the recovery, the TC can restart 2PC protocol in the usual
way
C2, C3:
TC has stopped working after it wrote BeginCommit in the log.
After the recovery will continue to perform the protocol by
sending messages GetReadyToCommit
C4, C5, C6, C7: TC has stopped working after it wrote GlobalCommit or
GlobalRollback to the log. After the recovery, it will re-send the
appropriate message to the TMs.
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
21
2PC - informal description - error at TM
If a TM that is recovering from failure finds out in its log that it was involved
in the 2PC protocol during when it failed, depending on
the moment in which the malfunction occurred, performs the
following actions:
P1:
TM has stopped working before it wrote rollback or ready to the log.
During the recovery TM unilaterally rollbacks the transaction
P2:
TM has stopped working after it wrote rollback to the log. TM rollbacks
the subtransaction and leaves it to TC to perform a global transaction
rollback after the response timeout
P3, P4: TM has stopped working after he Ready wrote in its log. TM sends the
Ready message to the TC and waits for an answer e.g. a final decision
P5, P6: TM recognizes the outcome of the global transaction and acts
accordingly
P7, P8: TM does nothing because it is in a state in which the transaction is
commited
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
22
Protocol blocking
protocol is blocking if there is a possibility that the correct node (TC or TM)
will not be able to complete the transaction due to disruption or failure of
another node
Example:
Point P4 in the previous picture
• TM has sent Ready message to TC and is in standby mode, waiting for
the TC's decision on the outcome of the global transaction. At that
moment, a communication with the TC malfunctions
• TM can not unilaterally rollback the local transaction because it does
not know what is decided by the TC (maybe TC managed to send the
GlobalCommit message to all the other nodes)
• TM has to wait for the establish the communication with the TC (or
recovery of the TC's system)
⇒ 2PC is a blocking protocol
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
23
Protocol independence with regards to recovery
Protocol is independent with regards to recovery if each node (TC or TM),
after it failed, can independently, without communicating with other nodes,
decide the outcome of all (sub) transactions that were being executed it the
time of failure (at that node)
Example:
Point P4 in the previous picture
• TM wrote Ready to the log and sent Ready message to TC. At this point,
TM fails
• When TM starts the recovery it determines that in was involved in the
2PC protocol in the moment of failure. It can not decide whether to
committ or rollback the transaction without TC; so it sends a Ready
message to TC and waits for a response
⇒ 2PC protocol is not independent with respect to recovery
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
24
Errors at DDBMSs
Error Management in DDBMS-in is more complex than in centralized
systems:
centralized system works as a whole, or does not work
Parts of DDBMS can be malfunctioning, and parts continue to operate
In addition to malfunctions that are typical for centralized systems (eg,
software and hardware errors, disk destruction), DDBMS can experience
additional types of failures:
Malfunctioning of one or more nodes
Loss of connections between nodes
Loss of messages
Network partition: the network is partitioned (divided) into several
subsystems that can not communicate. Even more complex problem:
the node Si can not determine whether the network partition
occurred or a node Sj simply stopped working
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
25
Disadvantages of DDBMS when compared to
centralized DBMS
Significantly greater complexity of the system
Increased costs, e.g.
Expensive software
More system administrators
Greater security problems
Higher costs in ensuring data integrity
Lack of standards
Lack of experience
More complex database design
Poor implementation of the distributed database can cause
increased communication costs
reduction in the availability of data
reduction in performance
DDBMS's functionalies and techniques, that are results great body of research, are not
fully implemented in any of the currently available commercial system.
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
26
Replicated databases
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
27
Replicated databases
Fragment is replicated if it is allocated in more than one node
For a single logical element (tuple, fragment, relation) there are
multiple physical elements (copies, replicas), x1, x2, ..., in nodes S1, S2,
...
S1
 ZPR-FER - Zagreb
S2
Advanced Databases 2013/2014
S3
28
Benefits of replicated DBs
•
Increased availability
• If the node that stores copies of the fragment is unavailable,
the system can access a copy of the fragment in another
node
Decreased data transmission volume
• Commonly used data is replicated and accessed locally
Parallel query execution
• query that involves a fragment can be decomposed, and
each part executed over one of the copies of the fragment
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
29
Disadvantages of replicated DBs
Consistency problem:
system must ensure consistency of all copies. Write
operations (insert, delete, update) on one copy of the
fragment must be propagated to all nodes in which this
fragment is allocated
a number of operations to be carried out in a number of
nodes can cause a decrease in the availability and increas
the number of complete deadlocks (when synchronous
replication is used) or decrease in consistency (when
asynchronous replication is used)
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
30
Synchronous (eager) protocols
all physical operations arising from the logical operations of
the initial transactions are conducted within the boundaries
of the initial transaction, that is, all copies have to be
modified as part of the initial transaction
Initial
Full consistency
transaction
Good read performance
S1
Worsened write performance
Extended transaction execution time,
S2
increased deadlocks,
low availability
T2
(failure of one node
S3
prohibits the write operations )
T3
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
T
T1
S4
T4
31
Asynchronous (lazy) protocols
Operations of the initial transaction are conducted exclusively in the
initial node and are not in any way depend on communication with
other nodes
initial transaction can be completed before the changes were made
over all copies. Changes to other copies are performed asynchronously
high availability, good performance
high risk of inconsistent data
T
S1
Initial
transaction
T1
S2
S4
R
T2
T4
S3
T3
 ZPR-FER - Zagreb
Propagated
transactions
Advanced Databases 2013/2014
32
One way protocols
one-way, master-ownership, primary-copy
For each logical element x there is only one master copy: xp
All write operations over x must be firstly performed over xp
Each node that contains at least one master copy is called a master
• single-master system, all master copies at one node
• multi-master system, primary copies of various elements are
(a)located in different nodes
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
33
Often used one-way protocols
1Master-nSlaves (dissemination)
Changes are made in exactly one master node and propagated to the subordinate nodes
(slaves). The slave nodes are not allowed to perform transactions that include write operations
nMasters-1Slave (consolidation)
Updates performed in subordinate (slave) nodes are propagated to exactly one parent (master)
node. Master node can not perform transactions that include a write operation
Dissemination
Consolidation
Updates
Reads
Reads
Updates
Reads
Updates
Reads
 ZPR-FER - Zagreb
Updates
Advanced Databases 2013/2014
34
Two-way protocols
n-way, peer-to-peer, group-ownership, update-anywhere
initial transaction can perform updates over any physical copy
system availability is considerably increased compared to the
one-way system
If used in combination with asynchronous protocol the
transaction serializability can not be guaranteed
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
35
Important disadvantages of two-way protocols
Non-serializable transactions can lead to hard to repaire breaches in data
consistency
Problem detecting conflicts: Some conflicts can be discovered only after
the propagation of changes (when the initial transaction has been
commited)
problem resolving conflicts: may require canceling commited transactions
-> durability property (of ACID) is decreased
automatic conflict resolution is often not possible - human intervention is
required
Product
idProduct
prodName
Example:
Fully replicated
DB
Product
1 ASEA
2 Gyr
Device
ref.int.
S1
Device
10 1 M-10
20 1 M-14
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
idDev
idProduct
serNumber
S2
Product
1 ASEA
2 Gyr
Device
10 1 M-10
20 1 M-14
36
Important disadvantages of two-way protocols
S1
Product
9:40
9:41
1
2
ASEA
Gyr
S2
Device
10 1
20 1
M-10
M-14
1
2
ASEA
Gyr
ASEA
Gyr
10 1
20 1
M-10
M-14
10 1
20 1
M-10
M-14
1
2
ASEA
Gyr
10 1
20 1
M-10
M-14
prop.
INSERT INTO Device VALUES (30, 2, M-16)
1
synchronization
1
2
DELETE FROM Product WHERE idProduct=2
9:42
9:43
Device
Product
ASEA
10 1
20 1
M-10
M-14
INSRT INT Device VALUES (30, 2, M-16)
1
ASEA
10 1
20 1
M-10
M-14
1
2
prop.
ASEA
Gyr
10 1
20 1
30 2
M-10
M-14
M-16
DLTE FRM Product WHERE idProduct=2
1
2
ASEA
Gyr
10 1
20 1
30 2
M-10
M-14
M-16
ERROR-referential integrity: missing row
ERROR-referential integrity: still referencing row
result: system delusion
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
37
Important disadvantages of two-way protocols
Modern systems support two-way asynchronous replication, but
a comprehensive solution to the described problem does not
exist
various systems offer different built-optional functionalities that
can help in specific cases. E.g., in some systems it is possible to:
instead of propagating SQL commands propagate a stored
procedure (user-defined), which handles possible conflicts
if timestamps are used to find a possible conflict, rollback the initial
transaction (how does this affects the durability?)
last wins, first wins, greatest value wins, ...
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
38
NoSQL
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
39
NoSQL databases
First used as a name of (relational)
DBMS developed by Carlo Strozzi in 1998.
Used again (twitter hashtag) in 2009. at
a "distributed, non-relational database,
open source" conference organized
by Johan Oskarsson
No + SQL:
„SQL” means „traditional” relational DBMS
• Initially interpreted as "do not use SQL" and does not use a
relational DBMS's
• Not Only SQL – solutions that are not based solely on relational
technologies
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
40
NoSQL informal definition
Informal definition(taken from: http://nosql-database.org/):
Newer generations databases usually having the following features:
non-relational, distributed, open-source and horizontally scalable ...
... Often have additional properties: no data model, easy replication, simple API, BASE
(not ACID), working with a large amount of data, etc.
Open source
Non relational
Distributed
21st century web
Schemaless
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
41
NoSQL vs RDB
RDB are the best
one-size-fits-all
soulution we have
NoSQL are specializized
solutions for
certain (types of) problems
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
42
Data Model
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
43
The data model - an introduction
Model used to represent and handle the data
<> physical model (which we mostly do not need to know)
e.g. relational model
There is no "real" or "correct" model world or domain
In NoSQL, four data models:
1. Key Value
2. Document
3. Column family (<> column, columnar)
4. Graph
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
44
Aggregate model
The term comes from the Domain Driven Design
Aggregate is:
Complex record that allows
lists
Object nesting
Set of objects handled as a single record (e.g., order and order
items)
One root, according to which it is:
referenced
ensured integrity
The basic unit of data - aggregate as whole is saved and/or read
One aggregate ~ one "transaction"
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
45
Aggregate - example (1)
composition
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
46
Digression: JSON reminder (1)
JSON - JavaScript Object Notation
Plain-text, human readable
Does not depend on the programming language
Hierarchical (nesting)
JSON vs XML:
js.eval () (but "eval is evil", use JSON parser)
Easier to work with than XML
Shorter than XML-a
No end tags
Arrays
name:value pairs, e.g. "name":"Joe"
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
47
Digression: JSON reminder (2)
Value can be:
Number (int, real)
String ("")
Boolean (true/false)
Array (enclosed with [])
Object (u {})
null
{
"id": 1001,
"name": "Order no. 13/2013",
"total": 21.98,
"details": [
{"ProductId": 100, "name": "Chocholate", "price": 9.99},
{"ProductId": 101, "name": "Jam", "price": 11.99}
],
"payedFor": true,
"customer": null
}
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
48
Agregat - primjer (1) - sadržaj
JSON:
// customer
{
"id": 11,
"fName": "Krešo",
"ordAddress": {"street":"Unska 3", "city": "Zagreb"}
}
// order
{
"id": 1001,
"customerId": 11,
"items": [
{"ProductId": 100, "name": "Intro to NoSQL databases", "price": 99.99}
],
"shipAddress": {"street":"Šumski put", "city": "Zagreb"},
"payment": {
"transId": "ABBCCAX124",
"orderAddress": {"street":"Unska 3", "city": "Zagreb"}
}
}
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
49
Aggregate - example (2)
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
50
Aggregate model - comment
There is no general instruction where to set the limits of
aggregates - depends on the problem
Good for the distribution of data, aggregates are cohesive units
The relational model does not have this information - it is
"aggregate ignorant", as well as the graph model
Aggregate ignorant <> bad
Aggregate model can help or hinder, depending on the context:
Fetch, save, distribute the orders
Order data analysis for the last two months
Main reason: to be used in a distributed environment, we want
to minimize the number of nodes accessed when gathering data
for a task
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
51
Key-value databases
Data model: (key, value) pairs
Operations:
Put(k, v)
Get(k)
Update(k, v)
Delete(k)
Some DBs support certain structure of values and/or value
attributes
Some DBs support key range queries
Examples: Riak, Dynamo,…
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
52
5
Document databases
Like Key-Value, with Value being document
Data model: (key, document)
Document: JSON, BSON, XML, YAML, some other semi-structured
format, binary data
Main operations:
Put(k, d)
Get(k)
Update(k, d)
Delete(k)
Queries based od document content! (not standardized, no query
language)
Some DBs support indexing
Examples: CouchDB, MongoDB, SimpleDB,…
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
53
5
Example: MongoDB documents
Relational
database:
relation
fName
Ivan
lName
Car
Iva
Kralj
DateBirth
11.11.1971.
BirthPlace
Šibenik
{
MongoDB:
collection
"_id": ObjectID("4efa8d2b7d284dad101e4bc9"),
"fName" : "Ivan",
"lName“ : "Car",
"BirthDate" : "11.11.1971."
},
{
"_id" : ObjectID("4efa8d2b7d284dad101e4bc7"),
"fName" : "Iva",
"lName" : "Kralj",
"BirthPlace" : "Šibenik"
}
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
54
5
Primjer: MongoDB upiti
Mongo queries are JSON (BSON) objects
SQL
MongoDB
CREATE TABLE student(
id INT,
lName CHAR(50)
)
ALTER TABLE student ADD…
Implicitly - by putting the first
document in collection.
Also explicitly:
db.createCollection(„student")
Implicitly, every document can be
changed, there is no schema
INSERT INTO student (
100,
‘Šostakovič’);
SELECT * FROM student;
db.student.insert(
{mbr:100, lName: ‘Šostakovič’}
)
db.student.find();
SELECT lName
FROM student
WHERE mbr = 200
ORDER BY lName;
UPDATE student SET lName =
‘Shostakovich’
WHERE mbr = 100;
db.student.find(
{mbr:100},
{lName:1}
).sort({lName:1});
db.student.update(
{ mbr: 100 },
{$set : { lName : ‘Shostakovich’ } }
);
db.student.remove(
{ mbr: 100 }
);
DELETE FROM student WHERE mbr = 100;
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
55
Aggregate model - KV & Document databases
KV and document DBs are based on the aggregate data type
KV DBs
Retrieval by key
Value is BLOB
Document DBs
Retrieval based on query
Part of the document can be retrieved
Indexing
Constraints on the value (not everything can be inserted)
In practice, the distinction between KV & Document DB is blury
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
56
CF databases
Chang et al. [2006], Bigtable: A Distributed Storage System for
Structured Data
Data model: column family
Not a table!
Two-level hash map, two-level aggregate
First level key: row key
Second level key: column key
Each column is a member of single column family
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
57
CF example (1)
get('first', ' color:green')
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
58
CF example (2)
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
59
CF comment
Dual view of the data:
By rows: each row can be considered an aggregate
By columns: each CF defines a record type (e.g. customer), with
rows for each record
Row = JOIN of records in all CFs
Different row setups:
Wide row
sort
Skinny row
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
60
Digression: CF are not columnar DBs (1)
Data stored by columns, e.g. C-Store
Column oriented DBMS http://en.wikipedia.org/wiki/Columnar_database
Some vendors (Oracle, Informix, Microsoft, …) introduce
columnar storage model (as indexes) into RDBMSs.
Data stored
by rows
Retrieves only the columns
required to resolve queries (in a
typical fact table, below 15%)
Better compression
Increased utilization of buffer
(better compression, often used
columns)
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
pages
Data stored
by columns
61
Digression: CF are not columnar DBs (2)
Order of magnitude (sometimes several) faster query times
Useful when: often read, seldom write
E.g. Microsoft SQL Server 2012 Vertipaq*:
Test: 1 TB star join (1,44 billion rows), 32processors, 256 GB
RAM:
Can provide acceleration from hundreds to thousands of times,
at least tenfold
Compression factor of 4-20 on real data
You can not do INSERT
2-3 times slower index creation in comparison to the B-tree
*
http://download.microsoft.com/download/8/C/1/8C1CE06B-DE2F-40D1-9C5C-
3EE521C25CE9/Columnstore%20Indexes%20for%20Fast%20DW%20QP%20SQL%20Server%2011.pdf
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
62
Graph databases
Data model: nodes, edges (arcs), properties:
Nodes can have properties (KV pairs)
Edges have tags, directions and start and end node
Edges also have properties
Interfaces and query languages are not standardized (Cypher,
SPARQL, Gremlin)
Example:
03
friend
friend
Ela
friend
01
Ana
acquaintance
acquaintance
15
Ivo
friend
Some DBs: Neo4j, GraphDB, DEX, FlockDB, InfoGrid, OrientDB, Pregel, …
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
63
6
Graph databases - example
11
Krešo
ordered
last order
ordered
racun:101
shipAddress
racun:107
shipAddress
previous
:…
:…
contains
contains
ProductId:
11
ProductId:
17
ProductId:
33
ProductId:
99
name:
name:
name:
name:
…
 ZPR-FER - Zagreb
…
…
Advanced Databases 2013/2014
…
64
Relational databases and relationships
"relational databases deal poorly with relationships" ☺
Friends of friends of my friends? (reminder: FOAF, advanced SQL)
Depth
2
3
4
5
Execution Time – MySQL
0.016
30.267
1,543.505
Not Finished in 1 Hour
Execution Time –Neo4j
0.010
0.168
1.359
2.132
http://www.neotechnology.com/how-much-faster-is-a-graph-database-really/
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
65
Graph database
"Strange fish in SQL pond"
Others…
Breaks the data into even smaller
units than RDB
aggregate
relations
NoSQL
Graph DB
nodes
Not suitable for distribution
Query language
ACID
In common with others: non-relational model, popularity
Suitable for complex, semi-structured, highly connected data
 ZPR-FER - Zagreb
Advanced Databases 2013/2014
66