Download Introduction to NoSQL - CS 609 : Database Management

Document related concepts

Object storage wikipedia , lookup

Concurrency control wikipedia , lookup

Data model wikipedia , lookup

Operational transformation wikipedia , lookup

Data center wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data analysis wikipedia , lookup

SQL wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

Versant Object Database wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Clusterpoint wikipedia , lookup

B-tree wikipedia , lookup

Business intelligence wikipedia , lookup

SAP IQ wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
NoSQL Databases
C. Strauch
RDBMS
• RDBMS – dominant
• Different approaches – Object DBs, XML stores
– Want same market share as relational
– Absorbed by relational or niche products
• One size fits all?
– Movment – NoSQL
• does not include those existing before, e.g. object,
pure XML
NoSQL Movement
• Term first used in 1998
– For relational that omitted use of SQL
• Used again in 2009
– for non-relational by a Rackspace employee blogger
– purpose is to seek alternatives to solve problems
relational is a bad fit
– Relational viewed as
• slow, expensive
– Startups built own datastores
• Amazon’s Dynamo, Google Bigtable, Facebook
Cassandra
– Cassandra – write 2500 times faster than MySQL
• Open source
Benefits of NoSQL
• NoSQL more efficient?
– Avoids Complexity
• Relax ACID properties – consistency
• Store in memory
– Higher throughput
• Column store
– Scalability, Commodity hardware
• Horizontal, single server performance, schema design
– Avoid Expensive Object-relational Mapping
• Data structures are simpler, e.g. key value store
• SQL awkward for procedural code
– Mapping into SQL is only good if repeated
manipulations
– SQL not good if simple db structure
Benefits of NoSQL
– Complexity/cost of setting up DB clusters
• Clusters can be expanded without cost and
complexity of sharding
– Compromise reliability for performance
• Applications willing to compromise
–May not store in persistent storage
– One Solution (RDS) for all is wrong
• Growth of data to be stored
• Process more data faster
–Facebook doesn’t need ACID
Myth
– Myth: Centralized Data models have
effortless distribution and partitioning
• Doesn’t necessarily scale nor work
correctly without effort
• At first, replicate then shard
–Failures within network different than in
a single machine
–Latency, distributed failures, etc.
Motivation for NoSQL
– Movement in programming languages and
development frameworks
• Trying to hide the use of SQL
– Ruby on Rails, Object-relational mappers in
Java, completely omit SQL, e.g. Simple DB
– Requirements of Cloud Computing
• Ultimate scalability, low administration
overhead
– Data warehousing, key/value stores,
additional features to bridge gap with
RDBMS
Motivation for NoSQL
• More motivation of NoSQL
• Less overhead and memory footprint than
RDB
• Usage of Web technology and RPC calls
for access
• Optional forms of queries
SQL Limitations
– RDBMS + caching vs. built from scratch scalable
systems
• cache objects in memcached for high read loads
• Less capable as systems grow
– RDS place computation on reads instead of
avoiding complex reads
– Serial nature of applications waiting for I/O
– Large scale data
– Operating costs of running and maintaining
systems (Twitter)
• Some sites make use of RDBS, but scale more
important
SQL Limitations
• Yesterday’s vs. today’s needs
– Instead of high end servers, commodity
servers that will fail
– Data no longer rigidly structured
– RDBs designed for centralized not distributed
• Synchronization not implemented
efficiently, requires 2PC or 3PC
False assumptions
• False assumptions about distributed computing:
– Network is reliable
– latency is 0
– bandwidth is infinite
– network is secure
– topology doesn’t change
– one administrator
– transport cost is 0
– network is homogenous
• Do not try to hide distribution applications
Positives of RDBMS
• Historical positives of RDBMS:
– Disk oriented storage and indexing structures
– Multi threading to hide latency
– Locking-based concurrency control
– Log-base recovery
SQL Negatives
• Requires rewrite because not good for:
– Text
– Data warehouses
– Stream processing
– Scientific and intelligence databases
– Interactive transactions
– Direct SQL interfaces are rare
RDBMS Changes needed
• Rewrite of RDBMS
– Main memory – currently disk-oriented solution
for main memory problems
• Disk based recovery log files – impact
performance negatively
– Multi-threading and resource control –
• Transactions affect only a few data sets, cheap
if in memory – if long break up
• No need for multithreaded execution with
elaborate code to minimize CPU and disk
usage if in memory
RDBMS Changes needed
– DBs developed from shared memory to
shared disk
• Today - shared nothing
– Horizontal partitioning of data over several
nodes
– No need to reload and no downtimes
– Can transfer parts of data between nodes
without impacting running transactions
RDBMS Changes needed
– Transactions, processing environment
• Avoid persistent redo-logs
• Avoid client-server odbc communication, instead
run application logic like stored procedures in
process
• Eliminate undo log
• Reduce dynamic locking
• Single-threaded execution to eliminate latching
and overhead for short transactions
• 2PC avoided whenever possible – network
roundtrips take order of ms
RDBMS Changes needed
– High Availability and Disaster Recovery
• Keep multiple replicas consistent
• Assumed shared nothing at the bottom of system
• Disperse load across multiple machines (P2P)
• Inter machine replication for fault tolerance
– Instead of redo log, just refresh data from
operational site
– Only need undo log in memory that can delete
when commit
– No persistent redo log, just transient undo
RDBMS Changes needed
– No knobs –
• Computers were expensive, people cheap
• Now computers cheap, people expensive
• But DBA needed to maximize performance (used to
have complex tuning knobs))
• Need self tuning
Future
– No one size fits all model
– No more one size fits all language
• Ad-hoc queries rarely deployed
• Instead prepared statements, stored procedures
• Data warehouses have complex queries not
fulfilled by SQL
• Embed DB capabilities in programming languages
– E.g. modify Python, Perl, etc.
Types of NoSQL DBs
• Researchers have proposed many
different classification taxonomies
• We will use:
Category
Key-value stores
Document stores
Column Stores
Key-value store
• Key–value (k, v) stores allow the application to store
its data in a schema-less way
• Keys – can be ?
• Values – objects not interpreted by the system
– v can be an arbitrarily complex structure with its own
semantics or a simple word
– Good for unstructured data
• Data could be stored in a datatype of a programming
language or an object
• No meta data
• No need for a fixed data model
Key-Value Stores
• Simple data model
– Map/dictionary
– Put/request values per key
– Length of keys limited, few limitations on
value
– High scalability over consistency
– No complex ad-hoc querying and analytics
– No joins, aggregate operations
Document Store
• Next logical step after key-value to slightly more
complex and meaningful data structures
• Notion of a document
• No strict schema documents must conform to
• Documents encapsulate and encode data in
some standard formats or encodings
• Encodings include:
– XML, YAML, and JSON
– binary forms like BSON, PDF and Microsoft Office
documents
Document Store
• Documents can be organized/grouped as:
– Collections
– Tags
– Non-visible Metadata
– Directory hierarchies
Document Store
• Collections – tables
• documents – records
– But not all documents in a collection have same
fields
• Documents are addressed in the database
via a unique key
• beyond the simple key-document (or key–
value) lookup
• API or query language allows retrieval of
documents based on their contents
Document Store
•
•
•
•
More functionality than key-value
More appropriate for semi-structured data
Recognizes structure of objects stored
Objects are documents that may have
attributes of various types
• Objects grouped into collections
• Simple query mechanisms to search
collections for attribute values
Column Store
• Stores data tables
– Column order
– Relational stores in row order
– Advantages for data warehouses, customer
relationship management (CRM) systems
– More efficient for:
• Aggregates, many columns of same row required
• Update rows in same column
• Easier to compress, all values same per column
Basic Concepts of NoSQL
Distributed issues
The Usual
• CAP
– Table 3.1
• ACID vs. BASE
– Strict and eventual
– Table 3.2
Choice
Traits
Examples
Consistence + Availability
(Forfeit Partitions)
2-phase-commit
cache-validation protocols
Single-site databases
Cluster databases LDAP
xFS file system
Consistency + Partition tolerance
(Forfeit Availability)
Pessimistic locking
Make minority partitions unavailable
Distributed databases
Distributed locking
Majority protocols
Availability + Partition tolerance
(Forfeit Consistency)
expirations/leases
conflict
resolution
optimistic
Coda
Web caching
DNS
Table 3.1.: CAP-Theorem – Alternatives, Traits, Examples (cf.
[Bre00, slides 14–16])
ACID
BASE
Strong consistency
Isolation
Focus on “commit”
Nested transactions
Availability?
Conservative (pessimistic)
Difficult evolution (e. g. schema)
Weak consistency – stale data OK
Availability first
Best effort
Approximate answers OK
Aggressive (optimistic)
Simpler!
Faster
Easier evolution
Table 3.2.: ACID vs. BASE (cf. [Bre00, slide 13])
Replication
• Full vs. partial replication
• Which copy to access
• Improves performance for global queries
but updates a problem
• Ensure consistency of replicated copies of
data
32
Partitioning
• Can distribute a whole relation at a site
or
• Data partitioning (formerly known as fragments)
– logical units of the DB assigned for storage at
various sites
– horizontal partition - subset of tuples in the
relation (select)
– vertical partition - keeps only certain
attributes of relation (project) need a PK
33
Partitioning cont’d
• Horizontal partitioning:
– complete - set of fragments whose conditions
include every tuple
– if disjoint - tuples only member of 1 fragment
salary < 5000 and dno=4
• Vertical partitioning:
– Complete set can be obtained by:
L1 U L2 U ... Ln - attributes of R
where Li intersect Lj = PK(R)
34
RDBS Example
replication/partitioning
• Example of fragments for company DB:
site 1 - company headquarters gets
entire DB
site 2, 3 – horizontal partitioning
based on dept. no.
35
Slide 25- 36
Versioning and multiple copies
• If datasets distributed, strict consistency not
ensured by distributed transaction protocols
– How to handle concurrent modifications?
– To which values will a dataset converge?
Versioning
• How to handle this?
– Timestamps –
• Relies on synchronized clocks (a problem) and
doesn’t capture causality
– Optimistic locking
• Unique counter or clock value for each data item
• When update, supply counter/clock-value
• If causality reasoning on versions
– Requires saving history, need total order of version
numbers
Versioning
– Multiversion storage
• Store timestamp for each table cell (any type of
TS)
• For a given row, multiple versions can exist
concurrently
• Can access most recent or most recent before Ti
• Provides optimist with compare and swap on TSs
• Can take snapshots
– Vector clocks
• Requires no synchronization, no ordering, no
multiple versions
Vector Clocks
• Defined as tuple V[0], V[1], …, V[n] of clock values from
each node
• Represents state of self (node i) and replica nodes state
(as far as it knows) at a given time:
– Vi[0] for clock value of node 0, Vi[i] for itself, etc.
• Clock values may be TS from a local clock, version/
revision values, etc.
V2[0] = 45; V2[1] = 3, V2[2] = 55
• Node 2 perceives update on node 1 led to revision 3,
update on node 0 revision 45 and on itself revision 55
Vector Clocks
• Updated by rules:
– Internal update will increment its clock Vi[i]
• Seen immediately
– If node i sends message to k, advances its own clock
value Vi[i], attaches vector Vi to message to node k
– If node i receives Vmessage from j, advances Vi[i]
and merges own vector clock with Vmessage so that:
• Vi = max(Vi, Vmessage)
• To create a partial order
– Vi>Vj if for all k Vi[k] > Vj[k]
– If neither Vi>Vj nor Vi<Vj, concurrent updates and client application
must resolve
• Discuss paper
Figure 3.1.: Vector Clocks (taken from [Ho09a])
Vector clocks for consistency
models
• Can use vector clocks to resolve
consistency between W on multiple
replicas
– Nodes don’t keep vector clocks for clients, but
– Clients can keep a vector clock of last replica
node talked to
– Use this depending on consistency model
required
• E.g. For monotonic read, attach last clock received
Vector Clock paper
• TS array maintained locally - Initially all values 0
• Local clock value incremented at least once before each
atomic event
• Current value of entire TS array piggybacked onto each
outgoing message
• When receive message m, value of each entry in the TS
array is the max(TS array value, m’s value)
• Value corresponding to sender set to one greater than
value received unless TS array has a greater value (no
overtaking)
• Receives all values even non-neighbor values
Vector clock advantages
• No dependence on synchronized clocks
• No total order of numbers required for
causal reasoning
• No need to store and maintain multiple
versions of data on all nodes
Vector Clocks Gossiping
• Assume database partitioned
• State Transfer Model
– Data exchanged between clients and DB
servers and among DB servers
– DB servers maintain vector clocks
– Clients also maintain vector clocks
– State – actual data, state version trees
– Vstate – vector clock corresponding to last
updated state of its data
Vector Clocks Gossiping
• Query processing
– Client sends vector clock along with query to
server
– Server sends piece of data that precedes
vector clock (monotonic read)
– Server also sends own vector clock
– Client merges vector clock with server’s clock
Vector Clocks Gossiping
• Update processing
– Client sends vector clock with update to
server
– If client’s state precedes server, server omits
update
– Else server updates
• Internode gossiping
– Replicas responsible for same partitions of
data exchange vector clocks, version trees in
background to merge to keep synchronized
Figure 3.2.: Vector Clocks – Exchange via Gossip in State Transfer Mode (taken from [Ho09a])
Operation Transfer Mode
• Send operations instead of data
– Saves?
• Can buffer operation along with
sender/receive vector clock, can order
causally
• Can group operations together for
execution – may need total order
Updates, Versions and Replicas
• http://www.datastax.com/dev/blog/why-cassandradoesnt-need-vector-clocks
• Not blind writes, but updates
• Suppose 3 replicas for each data item
• V1: ('email': '[email protected]', 'phone': ‘555-5555‘)
at all 3 replicas
• V2 is update to V1: ('email': '[email protected]',
'phone': '555-5555‘) written only to 2 replicas
• V3 is update to replica not updated, so V1:
('email': '[email protected]', 'phone': ‘444-4444‘)
• Should be: ('email': '[email protected]', 'phone':
‘444-4444‘)
Solution?
• Some NoSQL DBs, just use last write, so
would lose data
• Other NoSQL DBs keep both V1 and V2
– Difficult to deal with multiple versions in
practice
• With vector clocks, client can merge
simple objects
– But tells you only that a conflict occurred, not
how to resolve it
Cassandra
• What Cassandra does
CREATE TABLE users (
username text PRIMARY KEY,
email text,
phone text
);
INSERT INTO users (username, email, phone)
VALUES ('jbellis', '[email protected]', '555-5555');
UPDATE users SET email = '[email protected]' WHERE username =
'jbellis';
UPDATE users SET phone = '444-4444' WHERE username = 'jbellis';
Cassandra
• Breaks a row into columns
• Resolves changes to email and phone
columns automatically
• Storage engine can resolve changes
• if concurrent changes only 1 kept
Partitioning
• Partitioning needed when
– Large scale system exceeds capacity of
a single machine
– Replication for reliability
– Scaling measures such as loadbalancing
– Different approaches
Partitioning Approach
• Memory caches (memcached)
– Transient partitions of in-memory DBs
– Replicate most requested parts of DB to main
memory
– Rapidly deliver data to clients
– Unburden DB servers significantly
– Memory cache is array of processes with assigned
amount of memory
– launched on several machines in a network
– Made known to application via configuration
– Memcached protocol
• Provides simple key/value store API
• Available in different programming languages
• Stores objects into cache by hashing key
• Non-responding nodes ignored, uses responding
nodes, rehashed to different server
Partitioning Approach
• Clustering
– Data across servers
– Strive for transparency
– Helps to scale
– Criticized because typically added on
top of DBMS not designed for
distribution
Partitioning Approach
• Separating reads from writes (master/slave)
– Specify dedicated servers
– W for all or parts of the data are routed to servers
– If master replicates asynchronously, no write lags but
can lose W if master crashes
– If synchronous, one slave can lag but not lost, however,
R cannot go to any slave if strict consistency
– If master crashes, slave with most recent version new
master
– Master/slave model good if R/W ratio high
– Replication can happen by transfer of state or transfer
of operations
Partitioning Approach -Sharding
• Sharding
– Technically, sharding is a synonym for horizontal
partitioning
– rows of a database table are held separately, rather
than being split into columns
• which is what normalization and vertical partitioning do, to
differing extents
– Each partition forms part of a shard
– A shard may in turn be located on a separate
database server or physical location
Sharding
• Term is often used to refer to any database
partitioning that is meant to make a very large
database more manageable
• The governing concept behind sharding is
based on the idea:
– as the size of a database and the number of
transactions per unit of time made on the
database increase linearly, the response time for
querying the database increases exponentially.
Sharding
• DB sharding can usually be done fairly simply
– Customers located on the East Coast can be
placed on one server
– customers on the West Coast can be placed
on a second server.
• Sharding can be a more complex process in
some scenarios
– less structured data can be very complicated
– resulting shards may be difficult to maintain.
Sharding
• Partition the data so that data typically
requested together resided on the same node
– Load and storage volume evenly distributed
among servers
– Data shared can be replicated for reliability and
load-balancing
– Shard may write to a dedicated replica or all
replicas maintaining a partition
– Must be a mapping between partition (shard) and
storage nodes responsible for the shards
– Mapping can be static or dynamic
Sharding – RDS vs. NoSQL
– RDBMS
• Joins between shards is not possible
– Client application or proxy layer inside or outside
DB has to issue requests and postprocess (filter
aggregate) results instead
• With sharding can lose all features that make RDBSs
useful
• Sharding is operationally obnoxious
• Sharding not designed with current RDBMS but added
on top
Sharding – RDS vs. NoSQL
– NoSQL embraces sharding
• Automatic partitioning and balancing of data
among nodes
• Knowing how to map DB objects to servers is
critical
Partition = hash(0) mod n with o=object to hash,
n= #nodes
Go to slide 78
Sharding MongoDB
• Partitioning of data among multiple machines in an
order-preserving manner
– On a peer-collection bases, not the DB as a whole
– MongoDB detects which collections grow faster than average
for sharding
– Slow growers stay on single nodes
– Shard consists of servers that run MongoDB processes and
store data
– Rebalance shards on nodes for equal loads
– Built on top of replica sets
– Each replica in a set on different server, all servers in same
shard
– One server is primary, can elect new one if failure
– Architecture: Fig. 5.2
Figure 5.2.: MongoDB – Sharding Components (taken from
[MDH+10, Architectural Overview])
Sharding MongoDB
• Config Servers
– Store cluster’s metadata
• Info on each shard server and chunks contained
therein
• Chunk – contiguous ranges of data from
collections ordered by sharding key
• Able to derive on which shards a document
resides
• Data on config servers kept consistent via 2PC
Sharding MongoDB
• Routing Servers
– Execute R/W requests
• Look up shards, connect to them, execute
operation and return results to client app
• Allows distributed MongoDB to look like a single
server
• Processed do not persist any state, do not
coordinate within a shard
• Lightweight
Sharding MongoDB
• Partitioning within a collection
– Number of fields configured by which can
partition documents
– If imbalanced by load or data size of
collection, partitioned by configured key while
preserving order
• Document field name is shard key in example
below
Sharding MongoDB
• Sharding persistence
– Chunks – contiguous ranges of data
– Triple (not tuple) – (collection, minkey, maxkey)
• Min, max of sharding key
– As chunks grow, split
• Side effect of insert – in background
– Choose sharding key with lots of distinct values
– Sharding key cannot
• Failures in shard
– Is single process fails, data still available on different server in
replica set
– If all processes fail in a shard for R/W, other shards not
affected
– Of config server – does not affect R/W, but can’t split up
chunks
RDBMS Proposed changes
• Transaction and schema characteristics
– Should exploit:
• Tree schemas –
– Schema is a tree of 1:n relationships
– every table except root has one join term
– 1:n relationship with ancestor
– Easy to distribute among nodes of a grid
– All equi-joins in tree span only a single
site
– Root table of schema may be partition of
the root table together with the data of all
other tables referencing primary keys in
that root table partition
Storage Layout
• Layout defines which kind of data (rows,
columns, subset of columns) can be read at
once
– Row-based storage
– Column storage
– Column storage with locality groups
– Log structured merge trees
Row-based storage
• A relational table is serialized as rows are
appended and flushed to disk
• Whole datasets can be R/W in a single I/O
operations
• Good locality of access on disk and in cache of
different columns
• Operations on columns expensive, must read
extra data
Column Storage
• Serializes tables by appending columns
and flushing to disk
• Operations on columns – fast, cheap
• Operations on rows costly, seeks in many
or all columns
• Good for?
– aggregations
Column storage with locality groups
• Like column storage but groups columns
expected to be accessed together
• Store groups together and physically
separated from other column groups
– Google’s Bigtable
– Started as column families
Storage Layout
(a) Row-based (b) Columnar (c) Columnar with locality groups
Storage Layout – Row-based, Columnar with/out Locality Groups
Log Structured Merge (LSM)
Trees
• Do not serialize logical structures (table) but how to
efficiently use memory/disk to satisfy R/W
• hold chunks of data in memory (memtables)
• Maintain on disk commit logs for in-memory data and flush
from memtable to disk from time to time to SSTables
(sorted strings table)
– SSTables Immutable, get compacted over time
– Keep old copy until new copy is compacted
– Collectively they form complete dataset
Bloom Filter
• Space efficient probabilistic data structure
used to test whether an element is a
member of a set
• Used when source data would require
impractically large hash area in memory if
conventional hashing used
An empty Bloom filter is a bit array of m bits, all set to 0. There must also
be k different hash functions defined, each of which maps or hashes some set
element to one of the m array positions with a uniform random distribution.
k is a constant, much smaller than m, which is proportional to the number
of elements to be added; the precise choice of k and the constant of proportionality
of m are determined by the intended false positive rate of the filter.
To add an element, feed it to each of the k hash functions to get k array positions. Set
Typically,
the bits at all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of
the k hash functions to get k array positions. If any of the bits at these
positions is 0, the element is definitely not in the set – if it were, then all
the bits would have been set to 1 when it was inserted. If all are 1, then
either the element is in the set, or the bits have by chance been set to 1
during the insertion of other elements, resulting in a false positive. In a
simple Bloom filter, there is no way to distinguish between the two cases,
but more advanced techniques can address this problem.
LSM Trees
• R requests to memtable and SSTables containing data
and return merged view
– To optimize, read only relevant SSTable bloom filters
(a) Data layout (b) Read path (c) Read path with bloom filters
LSM Trees
• W requests first in memtable then flushed
to disk to SSTables
(d) Write path (e) Flushing of Memtables (f) Compaction of SSTables
LSM trees
• Advantages
– LSM Trees in memory can be used to satisfy
R quickly
– Disk I/O faster since SSTables can be read
sequentially
– Can tolerate crashes as W to memory and a
commit LOG
•
Google’s BigTable
Figure 3.12.: Storage Layout – MemTables and SSTables in
Bigtable (taken from [Ho09a])
On-disk or In-memory?
• Cost of GB of data and RW performance
• Some NoSQL DBs leave storage
implementation open, allow plugins
• Most NoSQL implement storage to
optimize their characteristics
• CouchDB has interesting storage
• Go to 96
Storage – CouchDB
• CouchDB
– Copy-on-modified
•
•
•
•
Private copy for user is made from data and B+tree index
Uses copy on modified semantic
Provides R your W consistency
Private copy propagated to replication nodes
– Multiversion concurrency control – clients must merge
conflicting versions (swap root pointer of storage metdata)
• CouchDB synchronously persists all updates to disk append
only
• Garbage collection by copying compacted file to new one
Types of NoSQL DBs
• Researchers have proposed many
different classification taxonomies
• We will use:
Category
Key-value stores
Document stores
Column Stores
Key-value store
• Key–value (k, v) stores allow the application to store
its data in a schema-less way
• Keys – can be ?
• Values – objects not interpreted by the system
– v can be an arbitrarily complex structure with its own
semantics or a simple word
– Good for unstructured data
• Data could be stored in a datatype of a programming
language or an object
• No meta data
• No need for a fixed data model
Key-Value Stores
• Simple data model
– Map/dictionary
– Put/request values per key
– Length of keys limited, few limitations on
value
– High scalability over consistency
– No complex ad-hoc querying and analytics
– No joins, aggregate operations
Document Store
• Next logical step after key-value to slightly more
complex and meaningful data structures
• Notion of a document
• No strict schema documents must conform to
• Documents encapsulate and encode data in
some standard formats or encodings
• Encodings include:
– XML, YAML, and JSON
– binary forms like BSON, PDF and Microsoft Office
documents
Document Store
• Documents can be organized/grouped as:
– Collections
– Tags
– Non-visible Metadata
– Directory hierarchies
Document Store
• Collections – tables
• documents – records
– But not all documents in a collection have same
fields
• Documents are addressed in the database
via a unique key
• beyond the simple key-document (or key–
value) lookup
• API or query language allows retrieval of
documents based on their contents
Document Store
•
•
•
•
More functionality than key-value
More appropriate for semi-structured data
Recognizes structure of objects stored
Objects are documents that may have
attributes of various types
• Objects grouped into collections
• Simple query mechanisms to search
collections for attribute values
Column Store
• Stores data tables
– Column order
– Relational stores in row order
– Advantages for data warehouses, customer
relationship management (CRM) systems
– More efficient for:
• Aggregates, many columns of same row required
• Update rows in same column
• Easier to compress, all values same per column
Query Models
• Substantial differences in querying
capabilities in different NoSQL DBs
• Key/value – lookup by PK or id field, no
querying of other fields
• Document – allow for complex queries
– Need to query non-key fields
• Can have distributed data structures
supporting querying
Query feature implementations
• Companion SQL DB - Searchable attributes copied to SQL or
text DB, used to retrieval PK of matching datasets to access NoSQL
DB (DHT) distributed hash table
Figure 3.14.: Query Models – Companion SQL-Database (taken from [Ho09b])
Query feature implementations
• Scatter/Gather local search – dispatch
queries to DB nodes where query executed
locally
– Results sent back to DB server for post processing
Figure 3.15.: Query Models – Scatter/Gather Local Search (taken from [Ho09b])
Query feature implementations
• Distributed B+trees - Hash searchable attribute to locate
root note of distributed B+tree
– Value of root node contain id for child node to look up
– Repeated until leaf containing PK or id of DB entry
– Node updates in distributed B+ tree tricky
Figure 3.16.: Query Models – Distributed B+Tree (taken from [Ho09b])
Query feature implementations
• Prefix Hash Table (distributed Trie)
• What is a trie?
A data structure in which all the descendants of a node have a
common prefix of the string associated with that node, and the root
is associated with the empty string.
Values stored in the leaf nodes
Query feature implementations
• Prefix Hash Table (distributed Trie)
– Every path from root to leaf contains prefix of key
– Every leaf node contains all data whose key is
prefixed by it
– Try to use for queries
Figure 3.17.: Query
Models – Prefix Hash
Table / Distributed Trie
(taken from [Ho09b])
Query feature implementations
• OR-junctions
– Union results from different DBs
• AND-junctions
– Intersection, so need efficient ways for large data sets
– Don’t want to send all to same site to intersect
– Bloom filters to test if not contained in result set
– Results sets can be cached
– If client doesn’t need full set at once, use incremental
strategy – stream and filter
• Memory caches – DB partitioned through
transient in-memory DBs
• Clustering – cluster of servers instead of one
server, added on top of DB not designed to be
distributed
• Separate R from Ws – specify dedicated servers
• Sharding – data typically requested/updated
together on same nodes.
Partitioning issues – nodes
leave/join
Parts of the data must be redistributed when
nodes leave or join at runtime
• Crashes, unattainability, maintenance, etc.
– In memory caching redistribution may be
implicit
• Cache misses, hashing against available cache
servers, purge stale cache data with LRU
• For persistent data stores not acceptable
– If nodes can join/leave at runtime, need new
strategies, such as consistent hashing
Consistent Hashing
• Family of caching protocols for distributed networks used
to eliminate occurrence of hot spots
• Used by some NoSQL DBs
• Hash both objects (data) and caches (machines) using
same hash function
• Machines get interval of hash functions range and
adjacent machines can take over if leave
• Can give away parts of interval if new node joins and
mapped to adjacent
• Client application can calculate which node to contact to
R/W data item, no need for metadata server
Consistent Hashing
• Example: mapping clockwise where distribution
of nodes is random (hash func)
– Nodes A, B, C and objects 1-4
– 4,1 mapped to A, 2 mapped to B, 3 mapped
to C
– Suppose
C leaves, D enters
Figure 3.4.: Consistent Hashing – Initial Situation (taken from [Whi07])
Only have to remap
items 3 and 4, not all
items
Figure 3.5.: Consistent Hashing – Situation after Node Joining and
Departure (taken from [Whi07])
Consistent Hashing
• Problems:
– Distribution of nodes is random (hash function)
– Intervals between nodes may be unbalanced
– E.g. D has a larger interval space
• Solution:
– Hash replicas (virtual nodes) for each
physical node onto ring
– Number of virtual nodes can be defined
by capacity of machine
– Attach replica counter to node’s id
– Will distribute points for node all
over ring
Figure 3.6.: Consistent Hashing – Virtual Nodes Example (taken from [Lip09, slide
12])
Consistent Hashing
• Problem:
– If node leaves - Data stored on virtual node
unavailable unless replicated to other nodes
– If new joins - Adjacent nodes no longer asked
to maintain adjacent data
• Introduce replication factor r
– Next r nodes in clockwise direction
responsible for a data item
• In example r=3, for every data object 3 physical
nodes listed in square brackets are responsible
Figure 3.7.: Consistent Hashing – Example with Virtual Nodes and Replicated
Data (taken from [K+10b])
Partitioning R/W operations
• Read/Write operations on partitioned data
– R request can go to any node responsible for that
piece of data
– Load-balancing does not apply to quorum scenarios
– Assumes: R+W > N
Partitioning issues –
membership
• Membership changes
– When a new node joins the system
• New node broadcasts arrival and ID to adjacent nodes
• Neighbors adjust object and replica ownerships
• Joining node copies datasets responsible for from
neighbors
– In following example, 3 replicas – preceding clockwise
• X Hashed between A and B, so H, A, B transfer data to
new node X
• B, C, D drop data X is now responsible for as 3rd replica
Figure 3.8.: Membership Changes – Node X joins the System (taken
from [Ho09a])
Partitioning issues –
membership
• When a node leaves the systems
– Other nodes must detect if node has
left (crashed, whatever). In some systems no notification
gets exchanged when a node leaves
– If departure detected, neighbors much exchange data and
adjust object and replica ownerships
– In example node B leaves system
• C, D, E responsible for new intervals of hashed objects
• Must copy data from nodes counterclockwise
• Reorganize – RangeAB and Range BC now Range AC
Figure 3.9.: Membership Changes – Node B leaves the System
(taken from [Ho09a])
No to SQL or Not Only SQL
• Many NoSQL advocates proclaimed death
of SQL systems
• Now means not only SQL
• Means more than just no relational
• Maybe No-ACID is better term?
No to SQL or Not Only SQL
• End of assumptions:
– SQL ACID only tools
– End of master/slave scaling
– End of weaving relational model through code
• Only defines what it is not
– Like some political parties??
• Post Relational instead?
– Doesn’t invalidate usefulness of relational
Criticisms of NoSQL
•
•
•
•
Open source scares business people
Lots of hype, little promise
If RDBMS works, don’t fix it
Questions as to how popular NoSQL is in
production today
NoSQL is nothing new
• Lotus Notes – early document store
supporting distribution and replication
– Favored performance over concurrency
control
– Developer of CouchDB from Lotus
– CouchDB is Lotus Notes done right
– Alternatives to relational have existed for a
long time
RDB vs. NoSQL
• Distribution and shared nothing – same in relational and
non-relational
• Many NoSQL are disk-based, with buffer pool and multithreaded – locking, buffer management still problems
• NoSQL single-record transactions with BASE properties
in favor of performance
• Single-node performance of NoSQL, disk-based nonACID multithreaded system limited to model factor than
well-designed stored procedure SQL OLTP engine
• ACID transactions jettisoned for modest performance
boost which has nothing to do with SQL
Criticisms of NoSQL Stonebreaker
• NoSQL discussion has nothing to do with
SQL
• Recognizes need for non-relational:
– Flexibility – some data doesn’t fit in relational
model
– Performance – put data in MySQL DB, as
data grows, performance down
• Options, shard data over distributed sites –
headaches, or move to expensive commercial DB
Areas for improvement in RDS
• Optimizing single node performance
– Communication – can use stored procedures to avoid
it between application and DB
– Logging – expensive
– Locking – overhead
– Latching – B+trees, etc. have to be accessed by
multiple threads, short-term locks (latches for parallel
access)
– Buffer management – work to manage disk-pages
cached in memory (done by buffer-pool), to disk and
back and identify field boundaries
Real Challenge
• Speed up a RDBMS by eliminating locking,
latching, logging and buffer management with
support for stored procedures
• In future remove overhead to have:
– High speed open-source SQL engines with
automatic sharding
– ACID transactions
– Increased programmer productivity
– Lower maintenance
– Better data independence afforded by SQL
Performance and scalability
•
Scaling has to do with size, not fast
– Hopefully it will be fast
• RDBMS hard to scale
– Sharding multiple tables accessed by any column is
insane
• But
– Scaling of RDBMS for years in banking, trading
systems, retailing, etc.
– Vertical scaling is easy, but can be costly
– Avoid rigid normalization, poor indexing
MySQL
– MySQL
•
•
•
•
No automatic horizontal scaling, sharding
But fast enough for most applications
Familiar and ubiquitous
Can do everything key/value stores can do and
isn’t that much slower
• Just as easy (hard) to shard MySQL as NoSQL if
no automatic sharding
Not all RDBMS are MySQL
• NoSQL seen as successor of MySQL plus
memcached solution
• Not all RDBMS identified by MySQL, some
are easier or better scalable
H-Store – new DB by Stonebreaker
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
H-Store runs on a grid
On each rows of tables are placed contiguously in main memory
B-tree indexing is used
Sites are partitioned into logical sites which are dedicated to one CPU core
Logical sites are completely independent having their own indexes, tuple storage and partition
of main memory of the machine they run on
H-Store works single-threaded and runs transactions uninterrupted
H-Store allows to run only predefined transactions implemented as stored procedures
H-Store omits the redo-log and tries to avoid writing an undo-log whenever possible; if an undolog cannot be avoided it is discarded on transaction commit
If possible, query execution plans exploit the single-sited and one-shot properties discussed
above
H-Store tries to achieve the no-knobs and high availability requirements as well as
transformation of transactions to be single-sited by “an automatical physical database designer
which will specify
horizontal partitioning, replication locations, indexed fields
Since H-Store keeps replicas of each table these have to be updated transactionally. Read
commands can go to any copy of a table while updates are directed to all replicas.
H-Store leverages the above mentioned schema and transaction characteristics for
optimizations, e. g. omitting the undo-log in two-phase transactions.
H-store
• Compared H-store to commercial RDBMS
– 82 times better performance
– Overhead in commercial RDBMS due to logging and
concurrency control
Misconception of critics of
NoSQL
• NoSQL is:
– just about scalability/performance
• More to it than that
– Just document or key/value stores
• Shouldn’t narrow the discussion
– Not needed, I can do just as well in RDBMS
• Can tweak and tune RDB, but not good for types
– Rejection of RDBS
• Not anymore – just less SQL
In Defense of Not NoSQL
• Workloads typically used for NoSQL
– Update and look-up intensive OLTP
– Not query intensive DW or specialized
workflows like document repositories
• Improve performance of OLTP:
– Sharding over shared-nothing
• Most RDBMS provide this
• Just add new nodes when needed
– Improve OLTP performance of single node
• Nothing to do with SQL
The Future
– Need about 5 specialized DBs each for:
• Data warehouses
– Star or snowflake schema
» Can use relational, but not the best
• Stream processing
– Of messages at high speed
» Mix streams and relational data in SQL From classes
• Text processing
– RDBs never used here
• Scientific oriented DBs
– Arrays rather than tables
• Semi-structured data
– Need useful data models
– XML and RDF
Future
– No one size fits all model
– No more one size fits all language
• Ad-hoc queries rarely deployed
• Instead prepared statements, stored procedures
• Data warehouses have complex queries not
fulfilled by SQL
• Embed DB capabilities in programming languages
– E.g. modify Python, Perl, etc.
The end
Your opinion?
RDBMS Changes needed
• Rewrite of RDBMS
– Main memory – currently disk-oriented solution
for main memory problems
• Disk based recovery log files – impact
performance negatively
– Multi-threading and resource control –
• Transactions affect only a few data sets, cheap
if in memory
• No need for multithreaded execution –
elaborate code to minimize CPU and disk
usage
RDBMS Changes needed
– Grid Computing –
• DBs developed from shared memory to shared
disk
• Today - shared nothing
– Horizontal partitioning of data over several
nodes of grid
– No need to reload and no downtimes
– Can transfer parts of data between nodes
without impacting running transactions
RDBMS Changes needed
– High Availability and Disaster Recovery
• Keep multiple replicas consistent
• Assumed shared nothing at the bottom of system
• Disperse load across multiple machines (P2P)
• Inter machine replication for fault tolerance
– Instead of redo log, just refresh data from
operational site
– Only need undo log in memory that can delete
when commit
– No persistent redo log, just transient undo
RDBMS Changes needed
– No knobs –
• Computers were expensive, people cheap
• Now computers cheap, people expensive
• But DBA needed to maximize performance (knobs?)
• Need self tuning
RDBMS Changes needed
– Transactions, processing environment
• Avoid persistent redo-logs
• Avoid client-server odbc communication, instead
running application logic like stored procedures in
process
• Eliminate undo log
• Reduce dynamic locking
• Single-threaded execution to eliminate latching
and overhead for short transactions
• 2PC avoided whenever possible – network
roundtrips take order of ms
RDBMS Proposed changes
• Transaction and schema characteristics
– Should exploit:
• Tree schemas –
– Schema is a tree of 1:n relationships
– every table except root has one join term
– 1:n relationship with ancestor
– Easy to distribute among nodes of a grid
– All equi-joins in tree span only a single
site
– Root table of schema may be partition of
the root table together with the data of all
other tables referencing primary keys in
that root table partition
RDBMS Proposed changes
– Constrained tree applications CTA
– Has a tree schema and only runs transactions with:
• Every command in every transaction class has
equality predicated on the PK of the root node
• Every SQL command in every transaction class is
local to one site
• Transaction classes are:
– Collections of the same SQL statements and program
logic
– Differing in run-time constraints used by individual
transactions
– Possible to decompose current OLTP to be CTAs
RDBMS Proposed changes
• Single site transactions
–Execute only on one node
–No communication with other nodes
• CTAs have these properties
RDBMS Proposed changes
• One shop applications
– Transactions that can be executed in parallel
without requiring intermediate results to be
communicated among sites
– Never use results of earlier queries
– Decompose transactions into a collection of
single site plans sent to appropriate site for
execution
– Partition tables vertically among sites
RDBMS Proposed changes
• 2-phase transactions
– Read operations (can lead to abort)
– Write phase guaranteed to cause no integrity
violations
• Strongly 2-phase
– In addition to 2-phase, in second phase all sites
either rollback or commit
RDBMS Proposed changes
• Transaction commutativity
– 2 concurrent transactions commute when any
interleaving of their single-site sub-plans
produce the same final DB state as any other
interleaving
• Sterile transaction classes
– Those that commute with all transaction
classes including itself
Evaluation of NoSQL DBs
• Auto scaling (e.g. via auto sharding)
• Updated millions of objects in 1:1, 1:n
relationships
– Key/value stores:
• Redis
–
–
–
–
no means for automatic horizontal scaling,
If a machine added, must be made known to application
Open community working on Redis cluster to support it
Perform well, so scaling not needed early
• Voldemort
– Adds servers to cluster automatically, has fault tolerance
– Concentrates on sharding
Evaluation of NoSQL DBs
– Document stores
• MongoDB
– Originally id not scale automatically, no automatic
sharding until later version
– Now lots of sharding
– Column-based
• Cassandra
– Supposed to scale, e.g. add another machine, but didn’t
support losing a machine
SSTables
•Data.db files store the data compressed or uncompressed depending on the
configuration
•Index.db is an index to the Data.db file.
•Statistics.db stores various statistics about that particular SSTable (times
accessed etc)
•Filter.db is a bloomfilter that’s loaded in memory by a node that can tell it
quickly whether a key is in a table or not.
•CompressionInfo.db may or may not be there depending on whether
compression is enabled for a ColumnFamily. It contains information about the
compressed chunks in the Data.db file.