* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to NoSQL - CS 609 : Database Management
Object storage wikipedia , lookup
Concurrency control wikipedia , lookup
Operational transformation wikipedia , lookup
Data center wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
Information privacy law wikipedia , lookup
Versant Object Database wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Clusterpoint wikipedia , lookup
Business intelligence wikipedia , lookup
Data vault modeling wikipedia , lookup
3D optical data storage wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
NoSQL Databases C. Strauch RDBMS • RDBMS – dominant • Different approaches – Object DBs, XML stores – Want same market share as relational – Absorbed by relational or niche products • One size fits all? – Movment – NoSQL • does not include those existing before, e.g. object, pure XML NoSQL Movement • Term first used in 1998 – For relational that omitted use of SQL • Used again in 2009 – for non-relational by a Rackspace employee blogger – purpose is to seek alternatives to solve problems relational is a bad fit – Relational viewed as • slow, expensive – Startups built own datastores • Amazon’s Dynamo, Google Bigtable, Facebook Cassandra – Cassandra – write 2500 times faster than MySQL • Open source Benefits of NoSQL • NoSQL more efficient? – Avoids Complexity • Relax ACID properties – consistency • Store in memory – Higher throughput • Column store – Scalability, Commodity hardware • Horizontal, single server performance, schema design – Avoid Expensive Object-relational Mapping • Data structures are simpler, e.g. key value store • SQL awkward for procedural code – Mapping into SQL is only good if repeated manipulations – SQL not good if simple db structure Benefits of NoSQL – Complexity/cost of setting up DB clusters • Clusters can be expanded without cost and complexity of sharding – Compromise reliability for performance • Applications willing to compromise –May not store in persistent storage – One Solution (RDS) for all is wrong • Growth of data to be stored • Process more data faster –Facebook doesn’t need ACID Myth – Myth: Centralized Data models have effortless distribution and partitioning • Doesn’t necessarily scale nor work correctly without effort • At first, replicate then shard –Failures within network different than in a single machine –Latency, distributed failures, etc. Motivation for NoSQL – Movement in programming languages and development frameworks • Trying to hide the use of SQL – Ruby on Rails, Object-relational mappers in Java, completely omit SQL, e.g. Simple DB – Requirements of Cloud Computing • Ultimate scalability, low administration overhead – Data warehousing, key/value stores, additional features to bridge gap with RDBMS Motivation for NoSQL • More motivation of NoSQL • Less overhead and memory footprint than RDB • Usage of Web technology and RPC calls for access • Optional forms of queries SQL Limitations – RDBMS + caching vs. built from scratch scalable systems • cache objects in memcached for high read loads • Less capable as systems grow – RDS place computation on reads instead of avoiding complex reads – Serial nature of applications waiting for I/O – Large scale data – Operating costs of running and maintaining systems (Twitter) • Some sites make use of RDBS, but scale more important SQL Limitations • Yesterday’s vs. today’s needs – Instead of high end servers, commodity servers that will fail – Data no longer rigidly structured – RDBs designed for centralized not distributed • Synchronization not implemented efficiently, requires 2PC or 3PC False assumptions • False assumptions about distributed computing: – Network is reliable – latency is 0 – bandwidth is infinite – network is secure – topology doesn’t change – one administrator – transport cost is 0 – network is homogenous • Do not try to hide distribution applications Positives of RDBMS • Historical positives of RDBMS: – Disk oriented storage and indexing structures – Multi threading to hide latency – Locking-based concurrency control – Log-base recovery SQL Negatives • Requires rewrite because not good for: – Text – Data warehouses – Stream processing – Scientific and intelligence databases – Interactive transactions – Direct SQL interfaces are rare RDBMS Changes needed • Rewrite of RDBMS – Main memory – currently disk-oriented solution for main memory problems • Disk based recovery log files – impact performance negatively – Multi-threading and resource control – • Transactions affect only a few data sets, cheap if in memory – if long break up • No need for multithreaded execution with elaborate code to minimize CPU and disk usage if in memory RDBMS Changes needed – DBs developed from shared memory to shared disk • Today - shared nothing – Horizontal partitioning of data over several nodes – No need to reload and no downtimes – Can transfer parts of data between nodes without impacting running transactions RDBMS Changes needed – Transactions, processing environment • Avoid persistent redo-logs • Avoid client-server odbc communication, instead run application logic like stored procedures in process • Eliminate undo log • Reduce dynamic locking • Single-threaded execution to eliminate latching and overhead for short transactions • 2PC avoided whenever possible – network roundtrips take order of ms RDBMS Changes needed – High Availability and Disaster Recovery • Keep multiple replicas consistent • Assumed shared nothing at the bottom of system • Disperse load across multiple machines (P2P) • Inter machine replication for fault tolerance – Instead of redo log, just refresh data from operational site – Only need undo log in memory that can delete when commit – No persistent redo log, just transient undo RDBMS Changes needed – No knobs – • Computers were expensive, people cheap • Now computers cheap, people expensive • But DBA needed to maximize performance (used to have complex tuning knobs)) • Need self tuning Future – No one size fits all model – No more one size fits all language • Ad-hoc queries rarely deployed • Instead prepared statements, stored procedures • Data warehouses have complex queries not fulfilled by SQL • Embed DB capabilities in programming languages – E.g. modify Python, Perl, etc. Types of NoSQL DBs • Researchers have proposed many different classification taxonomies • We will use: Category Key-value stores Document stores Column Stores Key-value store • Key–value (k, v) stores allow the application to store its data in a schema-less way • Keys – can be ? • Values – objects not interpreted by the system – v can be an arbitrarily complex structure with its own semantics or a simple word – Good for unstructured data • Data could be stored in a datatype of a programming language or an object • No meta data • No need for a fixed data model Key-Value Stores • Simple data model – Map/dictionary – Put/request values per key – Length of keys limited, few limitations on value – High scalability over consistency – No complex ad-hoc querying and analytics – No joins, aggregate operations Document Store • Next logical step after key-value to slightly more complex and meaningful data structures • Notion of a document • No strict schema documents must conform to • Documents encapsulate and encode data in some standard formats or encodings • Encodings include: – XML, YAML, and JSON – binary forms like BSON, PDF and Microsoft Office documents Document Store • Documents can be organized/grouped as: – Collections – Tags – Non-visible Metadata – Directory hierarchies Document Store • Collections – tables • documents – records – But not all documents in a collection have same fields • Documents are addressed in the database via a unique key • beyond the simple key-document (or key– value) lookup • API or query language allows retrieval of documents based on their contents Document Store • • • • More functionality than key-value More appropriate for semi-structured data Recognizes structure of objects stored Objects are documents that may have attributes of various types • Objects grouped into collections • Simple query mechanisms to search collections for attribute values Column Store • Stores data tables – Column order – Relational stores in row order – Advantages for data warehouses, customer relationship management (CRM) systems – More efficient for: • Aggregates, many columns of same row required • Update rows in same column • Easier to compress, all values same per column Basic Concepts of NoSQL Distributed issues The Usual • CAP – Table 3.1 • ACID vs. BASE – Strict and eventual – Table 3.2 Choice Traits Examples Consistence + Availability (Forfeit Partitions) 2-phase-commit cache-validation protocols Single-site databases Cluster databases LDAP xFS file system Consistency + Partition tolerance (Forfeit Availability) Pessimistic locking Make minority partitions unavailable Distributed databases Distributed locking Majority protocols Availability + Partition tolerance (Forfeit Consistency) expirations/leases conflict resolution optimistic Coda Web caching DNS Table 3.1.: CAP-Theorem – Alternatives, Traits, Examples (cf. [Bre00, slides 14–16]) ACID BASE Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e. g. schema) Weak consistency – stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution Table 3.2.: ACID vs. BASE (cf. [Bre00, slide 13]) Replication • Full vs. partial replication • Which copy to access • Improves performance for global queries but updates a problem • Ensure consistency of replicated copies of data 32 Partitioning • Can distribute a whole relation at a site or • Data partitioning (formerly known as fragments) – logical units of the DB assigned for storage at various sites – horizontal partition - subset of tuples in the relation (select) – vertical partition - keeps only certain attributes of relation (project) need a PK 33 Partitioning cont’d • Horizontal partitioning: – complete - set of fragments whose conditions include every tuple – if disjoint - tuples only member of 1 fragment salary < 5000 and dno=4 • Vertical partitioning: – Complete set can be obtained by: L1 U L2 U ... Ln - attributes of R where Li intersect Lj = PK(R) 34 RDBS Example replication/partitioning • Example of fragments for company DB: site 1 - company headquarters gets entire DB site 2, 3 – horizontal partitioning based on dept. no. 35 Slide 25- 36 Versioning and multiple copies • If datasets distributed, strict consistency not ensured by distributed transaction protocols – How to handle concurrent modifications? – To which values will a dataset converge? Versioning • How to handle this? – Timestamps – • Relies on synchronized clocks (a problem) and doesn’t capture causality – Optimistic locking • Unique counter or clock value for each data item • When update, supply counter/clock-value • If causality reasoning on versions – Requires saving history, need total order of version numbers Versioning – Multiversion storage • Store timestamp for each table cell (any type of TS) • For a given row, multiple versions can exist concurrently • Can access most recent or most recent before Ti • Provides optimist with compare and swap on TSs • Can take snapshots – Vector clocks • Requires no synchronization, no ordering, no multiple versions Vector Clocks • Defined as tuple V[0], V[1], …, V[n] of clock values from each node • Represents state of self (node i) and replica nodes state (as far as it knows) at a given time: – Vi[0] for clock value of node 0, Vi[i] for itself, etc. • Clock values may be TS from a local clock, version/ revision values, etc. V2[0] = 45; V2[1] = 3, V2[2] = 55 • Node 2 perceives update on node 1 led to revision 3, update on node 0 revision 45 and on itself revision 55 Vector Clocks • Updated by rules: – Internal update will increment its clock Vi[i] • Seen immediately – If node i sends message to k, advances its own clock value Vi[i], attaches vector Vi to message to node k – If node i receives Vmessage from j, advances Vi[i] and merges own vector clock with Vmessage so that: • Vi = max(Vi, Vmessage) • To create a partial order – Vi>Vj if for all k Vi[k] > Vj[k] – If neither Vi>Vj nor Vi<Vj, concurrent updates and client application must resolve • Discuss paper Figure 3.1.: Vector Clocks (taken from [Ho09a]) Vector clocks for consistency models • Can use vector clocks to resolve consistency between W on multiple replicas – Nodes don’t keep vector clocks for clients, but – Clients can keep a vector clock of last replica node talked to – Use this depending on consistency model required • E.g. For monotonic read, attach last clock received Vector Clock paper • TS array maintained locally - Initially all values 0 • Local clock value incremented at least once before each atomic event • Current value of entire TS array piggybacked onto each outgoing message • When receive message m, value of each entry in the TS array is the max(TS array value, m’s value) • Value corresponding to sender set to one greater than value received unless TS array has a greater value (no overtaking) • Receives all values even non-neighbor values Vector clock advantages • No dependence on synchronized clocks • No total order of numbers required for causal reasoning • No need to store and maintain multiple versions of data on all nodes Vector Clocks Gossiping • Assume database partitioned • State Transfer Model – Data exchanged between clients and DB servers and among DB servers – DB servers maintain vector clocks – Clients also maintain vector clocks – State – actual data, state version trees – Vstate – vector clock corresponding to last updated state of its data Vector Clocks Gossiping • Query processing – Client sends vector clock along with query to server – Server sends piece of data that precedes vector clock (monotonic read) – Server also sends own vector clock – Client merges vector clock with server’s clock Vector Clocks Gossiping • Update processing – Client sends vector clock with update to server – If client’s state precedes server, server omits update – Else server updates • Internode gossiping – Replicas responsible for same partitions of data exchange vector clocks, version trees in background to merge to keep synchronized Figure 3.2.: Vector Clocks – Exchange via Gossip in State Transfer Mode (taken from [Ho09a]) Operation Transfer Mode • Send operations instead of data – Saves? • Can buffer operation along with sender/receive vector clock, can order causally • Can group operations together for execution – may need total order Updates, Versions and Replicas • http://www.datastax.com/dev/blog/why-cassandradoesnt-need-vector-clocks • Not blind writes, but updates • Suppose 3 replicas for each data item • V1: ('email': '[email protected]', 'phone': ‘555-5555‘) at all 3 replicas • V2 is update to V1: ('email': '[email protected]', 'phone': '555-5555‘) written only to 2 replicas • V3 is update to replica not updated, so V1: ('email': '[email protected]', 'phone': ‘444-4444‘) • Should be: ('email': '[email protected]', 'phone': ‘444-4444‘) Solution? • Some NoSQL DBs, just use last write, so would lose data • Other NoSQL DBs keep both V1 and V2 – Difficult to deal with multiple versions in practice • With vector clocks, client can merge simple objects – But tells you only that a conflict occurred, not how to resolve it Cassandra • What Cassandra does CREATE TABLE users ( username text PRIMARY KEY, email text, phone text ); INSERT INTO users (username, email, phone) VALUES ('jbellis', '[email protected]', '555-5555'); UPDATE users SET email = '[email protected]' WHERE username = 'jbellis'; UPDATE users SET phone = '444-4444' WHERE username = 'jbellis'; Cassandra • Breaks a row into columns • Resolves changes to email and phone columns automatically • Storage engine can resolve changes • if concurrent changes only 1 kept Partitioning • Partitioning needed when – Large scale system exceeds capacity of a single machine – Replication for reliability – Scaling measures such as loadbalancing – Different approaches Partitioning Approach • Memory caches (memcached) – Transient partitions of in-memory DBs – Replicate most requested parts of DB to main memory – Rapidly deliver data to clients – Unburden DB servers significantly – Memory cache is array of processes with assigned amount of memory – launched on several machines in a network – Made known to application via configuration – Memcached protocol • Provides simple key/value store API • Available in different programming languages • Stores objects into cache by hashing key • Non-responding nodes ignored, uses responding nodes, rehashed to different server Partitioning Approach • Clustering – Data across servers – Strive for transparency – Helps to scale – Criticized because typically added on top of DBMS not designed for distribution Partitioning Approach • Separating reads from writes (master/slave) – Specify dedicated servers – W for all or parts of the data are routed to servers – If master replicates asynchronously, no write lags but can lose W if master crashes – If synchronous, one slave can lag but not lost, however, R cannot go to any slave if strict consistency – If master crashes, slave with most recent version new master – Master/slave model good if R/W ratio high – Replication can happen by transfer of state or transfer of operations Partitioning Approach -Sharding • Sharding – Technically, sharding is a synonym for horizontal partitioning – rows of a database table are held separately, rather than being split into columns • which is what normalization and vertical partitioning do, to differing extents – Each partition forms part of a shard – A shard may in turn be located on a separate database server or physical location Sharding • Term is often used to refer to any database partitioning that is meant to make a very large database more manageable • The governing concept behind sharding is based on the idea: – as the size of a database and the number of transactions per unit of time made on the database increase linearly, the response time for querying the database increases exponentially. Sharding • DB sharding can usually be done fairly simply – Customers located on the East Coast can be placed on one server – customers on the West Coast can be placed on a second server. • Sharding can be a more complex process in some scenarios – less structured data can be very complicated – resulting shards may be difficult to maintain. Sharding • Partition the data so that data typically requested together resided on the same node – Load and storage volume evenly distributed among servers – Data shared can be replicated for reliability and load-balancing – Shard may write to a dedicated replica or all replicas maintaining a partition – Must be a mapping between partition (shard) and storage nodes responsible for the shards – Mapping can be static or dynamic Sharding – RDS vs. NoSQL – RDBMS • Joins between shards is not possible – Client application or proxy layer inside or outside DB has to issue requests and postprocess (filter aggregate) results instead • With sharding can lose all features that make RDBSs useful • Sharding is operationally obnoxious • Sharding not designed with current RDBMS but added on top Sharding – RDS vs. NoSQL – NoSQL embraces sharding • Automatic partitioning and balancing of data among nodes • Knowing how to map DB objects to servers is critical Partition = hash(0) mod n with o=object to hash, n= #nodes Go to slide 78 Sharding MongoDB • Partitioning of data among multiple machines in an order-preserving manner – On a peer-collection bases, not the DB as a whole – MongoDB detects which collections grow faster than average for sharding – Slow growers stay on single nodes – Shard consists of servers that run MongoDB processes and store data – Rebalance shards on nodes for equal loads – Built on top of replica sets – Each replica in a set on different server, all servers in same shard – One server is primary, can elect new one if failure – Architecture: Fig. 5.2 Figure 5.2.: MongoDB – Sharding Components (taken from [MDH+10, Architectural Overview]) Sharding MongoDB • Config Servers – Store cluster’s metadata • Info on each shard server and chunks contained therein • Chunk – contiguous ranges of data from collections ordered by sharding key • Able to derive on which shards a document resides • Data on config servers kept consistent via 2PC Sharding MongoDB • Routing Servers – Execute R/W requests • Look up shards, connect to them, execute operation and return results to client app • Allows distributed MongoDB to look like a single server • Processed do not persist any state, do not coordinate within a shard • Lightweight Sharding MongoDB • Partitioning within a collection – Number of fields configured by which can partition documents – If imbalanced by load or data size of collection, partitioned by configured key while preserving order • Document field name is shard key in example below Sharding MongoDB • Sharding persistence – Chunks – contiguous ranges of data – Triple (not tuple) – (collection, minkey, maxkey) • Min, max of sharding key – As chunks grow, split • Side effect of insert – in background – Choose sharding key with lots of distinct values – Sharding key cannot • Failures in shard – Is single process fails, data still available on different server in replica set – If all processes fail in a shard for R/W, other shards not affected – Of config server – does not affect R/W, but can’t split up chunks RDBMS Proposed changes • Transaction and schema characteristics – Should exploit: • Tree schemas – – Schema is a tree of 1:n relationships – every table except root has one join term – 1:n relationship with ancestor – Easy to distribute among nodes of a grid – All equi-joins in tree span only a single site – Root table of schema may be partition of the root table together with the data of all other tables referencing primary keys in that root table partition Storage Layout • Layout defines which kind of data (rows, columns, subset of columns) can be read at once – Row-based storage – Column storage – Column storage with locality groups – Log structured merge trees Row-based storage • A relational table is serialized as rows are appended and flushed to disk • Whole datasets can be R/W in a single I/O operations • Good locality of access on disk and in cache of different columns • Operations on columns expensive, must read extra data Column Storage • Serializes tables by appending columns and flushing to disk • Operations on columns – fast, cheap • Operations on rows costly, seeks in many or all columns • Good for? – aggregations Column storage with locality groups • Like column storage but groups columns expected to be accessed together • Store groups together and physically separated from other column groups – Google’s Bigtable – Started as column families Storage Layout (a) Row-based (b) Columnar (c) Columnar with locality groups Storage Layout – Row-based, Columnar with/out Locality Groups Log Structured Merge (LSM) Trees • Do not serialize logical structures (table) but how to efficiently use memory/disk to satisfy R/W • hold chunks of data in memory (memtables) • Maintain on disk commit logs for in-memory data and flush from memtable to disk from time to time to SSTables (sorted strings table) – SSTables Immutable, get compacted over time – Keep old copy until new copy is compacted – Collectively they form complete dataset Bloom Filter • Space efficient probabilistic data structure used to test whether an element is a member of a set • Used when source data would require impractically large hash area in memory if conventional hashing used An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. k is a constant, much smaller than m, which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter. To add an element, feed it to each of the k hash functions to get k array positions. Set Typically, the bits at all these positions to 1. To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem. LSM Trees • R requests to memtable and SSTables containing data and return merged view – To optimize, read only relevant SSTable bloom filters (a) Data layout (b) Read path (c) Read path with bloom filters LSM Trees • W requests first in memtable then flushed to disk to SSTables (d) Write path (e) Flushing of Memtables (f) Compaction of SSTables LSM trees • Advantages – LSM Trees in memory can be used to satisfy R quickly – Disk I/O faster since SSTables can be read sequentially – Can tolerate crashes as W to memory and a commit LOG • Google’s BigTable Figure 3.12.: Storage Layout – MemTables and SSTables in Bigtable (taken from [Ho09a]) On-disk or In-memory? • Cost of GB of data and RW performance • Some NoSQL DBs leave storage implementation open, allow plugins • Most NoSQL implement storage to optimize their characteristics • CouchDB has interesting storage • Go to 96 Storage – CouchDB • CouchDB – Copy-on-modified • • • • Private copy for user is made from data and B+tree index Uses copy on modified semantic Provides R your W consistency Private copy propagated to replication nodes – Multiversion concurrency control – clients must merge conflicting versions (swap root pointer of storage metdata) • CouchDB synchronously persists all updates to disk append only • Garbage collection by copying compacted file to new one Types of NoSQL DBs • Researchers have proposed many different classification taxonomies • We will use: Category Key-value stores Document stores Column Stores Key-value store • Key–value (k, v) stores allow the application to store its data in a schema-less way • Keys – can be ? • Values – objects not interpreted by the system – v can be an arbitrarily complex structure with its own semantics or a simple word – Good for unstructured data • Data could be stored in a datatype of a programming language or an object • No meta data • No need for a fixed data model Key-Value Stores • Simple data model – Map/dictionary – Put/request values per key – Length of keys limited, few limitations on value – High scalability over consistency – No complex ad-hoc querying and analytics – No joins, aggregate operations Document Store • Next logical step after key-value to slightly more complex and meaningful data structures • Notion of a document • No strict schema documents must conform to • Documents encapsulate and encode data in some standard formats or encodings • Encodings include: – XML, YAML, and JSON – binary forms like BSON, PDF and Microsoft Office documents Document Store • Documents can be organized/grouped as: – Collections – Tags – Non-visible Metadata – Directory hierarchies Document Store • Collections – tables • documents – records – But not all documents in a collection have same fields • Documents are addressed in the database via a unique key • beyond the simple key-document (or key– value) lookup • API or query language allows retrieval of documents based on their contents Document Store • • • • More functionality than key-value More appropriate for semi-structured data Recognizes structure of objects stored Objects are documents that may have attributes of various types • Objects grouped into collections • Simple query mechanisms to search collections for attribute values Column Store • Stores data tables – Column order – Relational stores in row order – Advantages for data warehouses, customer relationship management (CRM) systems – More efficient for: • Aggregates, many columns of same row required • Update rows in same column • Easier to compress, all values same per column Query Models • Substantial differences in querying capabilities in different NoSQL DBs • Key/value – lookup by PK or id field, no querying of other fields • Document – allow for complex queries – Need to query non-key fields • Can have distributed data structures supporting querying Query feature implementations • Companion SQL DB - Searchable attributes copied to SQL or text DB, used to retrieval PK of matching datasets to access NoSQL DB (DHT) distributed hash table Figure 3.14.: Query Models – Companion SQL-Database (taken from [Ho09b]) Query feature implementations • Scatter/Gather local search – dispatch queries to DB nodes where query executed locally – Results sent back to DB server for post processing Figure 3.15.: Query Models – Scatter/Gather Local Search (taken from [Ho09b]) Query feature implementations • Distributed B+trees - Hash searchable attribute to locate root note of distributed B+tree – Value of root node contain id for child node to look up – Repeated until leaf containing PK or id of DB entry – Node updates in distributed B+ tree tricky Figure 3.16.: Query Models – Distributed B+Tree (taken from [Ho09b]) Query feature implementations • Prefix Hash Table (distributed Trie) • What is a trie? A data structure in which all the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values stored in the leaf nodes Query feature implementations • Prefix Hash Table (distributed Trie) – Every path from root to leaf contains prefix of key – Every leaf node contains all data whose key is prefixed by it – Try to use for queries Figure 3.17.: Query Models – Prefix Hash Table / Distributed Trie (taken from [Ho09b]) Query feature implementations • OR-junctions – Union results from different DBs • AND-junctions – Intersection, so need efficient ways for large data sets – Don’t want to send all to same site to intersect – Bloom filters to test if not contained in result set – Results sets can be cached – If client doesn’t need full set at once, use incremental strategy – stream and filter • Memory caches – DB partitioned through transient in-memory DBs • Clustering – cluster of servers instead of one server, added on top of DB not designed to be distributed • Separate R from Ws – specify dedicated servers • Sharding – data typically requested/updated together on same nodes. Partitioning issues – nodes leave/join Parts of the data must be redistributed when nodes leave or join at runtime • Crashes, unattainability, maintenance, etc. – In memory caching redistribution may be implicit • Cache misses, hashing against available cache servers, purge stale cache data with LRU • For persistent data stores not acceptable – If nodes can join/leave at runtime, need new strategies, such as consistent hashing Consistent Hashing • Family of caching protocols for distributed networks used to eliminate occurrence of hot spots • Used by some NoSQL DBs • Hash both objects (data) and caches (machines) using same hash function • Machines get interval of hash functions range and adjacent machines can take over if leave • Can give away parts of interval if new node joins and mapped to adjacent • Client application can calculate which node to contact to R/W data item, no need for metadata server Consistent Hashing • Example: mapping clockwise where distribution of nodes is random (hash func) – Nodes A, B, C and objects 1-4 – 4,1 mapped to A, 2 mapped to B, 3 mapped to C – Suppose C leaves, D enters Figure 3.4.: Consistent Hashing – Initial Situation (taken from [Whi07]) Only have to remap items 3 and 4, not all items Figure 3.5.: Consistent Hashing – Situation after Node Joining and Departure (taken from [Whi07]) Consistent Hashing • Problems: – Distribution of nodes is random (hash function) – Intervals between nodes may be unbalanced – E.g. D has a larger interval space • Solution: – Hash replicas (virtual nodes) for each physical node onto ring – Number of virtual nodes can be defined by capacity of machine – Attach replica counter to node’s id – Will distribute points for node all over ring Figure 3.6.: Consistent Hashing – Virtual Nodes Example (taken from [Lip09, slide 12]) Consistent Hashing • Problem: – If node leaves - Data stored on virtual node unavailable unless replicated to other nodes – If new joins - Adjacent nodes no longer asked to maintain adjacent data • Introduce replication factor r – Next r nodes in clockwise direction responsible for a data item • In example r=3, for every data object 3 physical nodes listed in square brackets are responsible Figure 3.7.: Consistent Hashing – Example with Virtual Nodes and Replicated Data (taken from [K+10b]) Partitioning R/W operations • Read/Write operations on partitioned data – R request can go to any node responsible for that piece of data – Load-balancing does not apply to quorum scenarios – Assumes: R+W > N Partitioning issues – membership • Membership changes – When a new node joins the system • New node broadcasts arrival and ID to adjacent nodes • Neighbors adjust object and replica ownerships • Joining node copies datasets responsible for from neighbors – In following example, 3 replicas – preceding clockwise • X Hashed between A and B, so H, A, B transfer data to new node X • B, C, D drop data X is now responsible for as 3rd replica Figure 3.8.: Membership Changes – Node X joins the System (taken from [Ho09a]) Partitioning issues – membership • When a node leaves the systems – Other nodes must detect if node has left (crashed, whatever). In some systems no notification gets exchanged when a node leaves – If departure detected, neighbors much exchange data and adjust object and replica ownerships – In example node B leaves system • C, D, E responsible for new intervals of hashed objects • Must copy data from nodes counterclockwise • Reorganize – RangeAB and Range BC now Range AC Figure 3.9.: Membership Changes – Node B leaves the System (taken from [Ho09a]) No to SQL or Not Only SQL • Many NoSQL advocates proclaimed death of SQL systems • Now means not only SQL • Means more than just no relational • Maybe No-ACID is better term? No to SQL or Not Only SQL • End of assumptions: – SQL ACID only tools – End of master/slave scaling – End of weaving relational model through code • Only defines what it is not – Like some political parties?? • Post Relational instead? – Doesn’t invalidate usefulness of relational Criticisms of NoSQL • • • • Open source scares business people Lots of hype, little promise If RDBMS works, don’t fix it Questions as to how popular NoSQL is in production today NoSQL is nothing new • Lotus Notes – early document store supporting distribution and replication – Favored performance over concurrency control – Developer of CouchDB from Lotus – CouchDB is Lotus Notes done right – Alternatives to relational have existed for a long time RDB vs. NoSQL • Distribution and shared nothing – same in relational and non-relational • Many NoSQL are disk-based, with buffer pool and multithreaded – locking, buffer management still problems • NoSQL single-record transactions with BASE properties in favor of performance • Single-node performance of NoSQL, disk-based nonACID multithreaded system limited to model factor than well-designed stored procedure SQL OLTP engine • ACID transactions jettisoned for modest performance boost which has nothing to do with SQL Criticisms of NoSQL Stonebreaker • NoSQL discussion has nothing to do with SQL • Recognizes need for non-relational: – Flexibility – some data doesn’t fit in relational model – Performance – put data in MySQL DB, as data grows, performance down • Options, shard data over distributed sites – headaches, or move to expensive commercial DB Areas for improvement in RDS • Optimizing single node performance – Communication – can use stored procedures to avoid it between application and DB – Logging – expensive – Locking – overhead – Latching – B+trees, etc. have to be accessed by multiple threads, short-term locks (latches for parallel access) – Buffer management – work to manage disk-pages cached in memory (done by buffer-pool), to disk and back and identify field boundaries Real Challenge • Speed up a RDBMS by eliminating locking, latching, logging and buffer management with support for stored procedures • In future remove overhead to have: – High speed open-source SQL engines with automatic sharding – ACID transactions – Increased programmer productivity – Lower maintenance – Better data independence afforded by SQL Performance and scalability • Scaling has to do with size, not fast – Hopefully it will be fast • RDBMS hard to scale – Sharding multiple tables accessed by any column is insane • But – Scaling of RDBMS for years in banking, trading systems, retailing, etc. – Vertical scaling is easy, but can be costly – Avoid rigid normalization, poor indexing MySQL – MySQL • • • • No automatic horizontal scaling, sharding But fast enough for most applications Familiar and ubiquitous Can do everything key/value stores can do and isn’t that much slower • Just as easy (hard) to shard MySQL as NoSQL if no automatic sharding Not all RDBMS are MySQL • NoSQL seen as successor of MySQL plus memcached solution • Not all RDBMS identified by MySQL, some are easier or better scalable H-Store – new DB by Stonebreaker 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. H-Store runs on a grid On each rows of tables are placed contiguously in main memory B-tree indexing is used Sites are partitioned into logical sites which are dedicated to one CPU core Logical sites are completely independent having their own indexes, tuple storage and partition of main memory of the machine they run on H-Store works single-threaded and runs transactions uninterrupted H-Store allows to run only predefined transactions implemented as stored procedures H-Store omits the redo-log and tries to avoid writing an undo-log whenever possible; if an undolog cannot be avoided it is discarded on transaction commit If possible, query execution plans exploit the single-sited and one-shot properties discussed above H-Store tries to achieve the no-knobs and high availability requirements as well as transformation of transactions to be single-sited by “an automatical physical database designer which will specify horizontal partitioning, replication locations, indexed fields Since H-Store keeps replicas of each table these have to be updated transactionally. Read commands can go to any copy of a table while updates are directed to all replicas. H-Store leverages the above mentioned schema and transaction characteristics for optimizations, e. g. omitting the undo-log in two-phase transactions. H-store • Compared H-store to commercial RDBMS – 82 times better performance – Overhead in commercial RDBMS due to logging and concurrency control Misconception of critics of NoSQL • NoSQL is: – just about scalability/performance • More to it than that – Just document or key/value stores • Shouldn’t narrow the discussion – Not needed, I can do just as well in RDBMS • Can tweak and tune RDB, but not good for types – Rejection of RDBS • Not anymore – just less SQL In Defense of Not NoSQL • Workloads typically used for NoSQL – Update and look-up intensive OLTP – Not query intensive DW or specialized workflows like document repositories • Improve performance of OLTP: – Sharding over shared-nothing • Most RDBMS provide this • Just add new nodes when needed – Improve OLTP performance of single node • Nothing to do with SQL The Future – Need about 5 specialized DBs each for: • Data warehouses – Star or snowflake schema » Can use relational, but not the best • Stream processing – Of messages at high speed » Mix streams and relational data in SQL From classes • Text processing – RDBs never used here • Scientific oriented DBs – Arrays rather than tables • Semi-structured data – Need useful data models – XML and RDF Future – No one size fits all model – No more one size fits all language • Ad-hoc queries rarely deployed • Instead prepared statements, stored procedures • Data warehouses have complex queries not fulfilled by SQL • Embed DB capabilities in programming languages – E.g. modify Python, Perl, etc. The end Your opinion? RDBMS Changes needed • Rewrite of RDBMS – Main memory – currently disk-oriented solution for main memory problems • Disk based recovery log files – impact performance negatively – Multi-threading and resource control – • Transactions affect only a few data sets, cheap if in memory • No need for multithreaded execution – elaborate code to minimize CPU and disk usage RDBMS Changes needed – Grid Computing – • DBs developed from shared memory to shared disk • Today - shared nothing – Horizontal partitioning of data over several nodes of grid – No need to reload and no downtimes – Can transfer parts of data between nodes without impacting running transactions RDBMS Changes needed – High Availability and Disaster Recovery • Keep multiple replicas consistent • Assumed shared nothing at the bottom of system • Disperse load across multiple machines (P2P) • Inter machine replication for fault tolerance – Instead of redo log, just refresh data from operational site – Only need undo log in memory that can delete when commit – No persistent redo log, just transient undo RDBMS Changes needed – No knobs – • Computers were expensive, people cheap • Now computers cheap, people expensive • But DBA needed to maximize performance (knobs?) • Need self tuning RDBMS Changes needed – Transactions, processing environment • Avoid persistent redo-logs • Avoid client-server odbc communication, instead running application logic like stored procedures in process • Eliminate undo log • Reduce dynamic locking • Single-threaded execution to eliminate latching and overhead for short transactions • 2PC avoided whenever possible – network roundtrips take order of ms RDBMS Proposed changes • Transaction and schema characteristics – Should exploit: • Tree schemas – – Schema is a tree of 1:n relationships – every table except root has one join term – 1:n relationship with ancestor – Easy to distribute among nodes of a grid – All equi-joins in tree span only a single site – Root table of schema may be partition of the root table together with the data of all other tables referencing primary keys in that root table partition RDBMS Proposed changes – Constrained tree applications CTA – Has a tree schema and only runs transactions with: • Every command in every transaction class has equality predicated on the PK of the root node • Every SQL command in every transaction class is local to one site • Transaction classes are: – Collections of the same SQL statements and program logic – Differing in run-time constraints used by individual transactions – Possible to decompose current OLTP to be CTAs RDBMS Proposed changes • Single site transactions –Execute only on one node –No communication with other nodes • CTAs have these properties RDBMS Proposed changes • One shop applications – Transactions that can be executed in parallel without requiring intermediate results to be communicated among sites – Never use results of earlier queries – Decompose transactions into a collection of single site plans sent to appropriate site for execution – Partition tables vertically among sites RDBMS Proposed changes • 2-phase transactions – Read operations (can lead to abort) – Write phase guaranteed to cause no integrity violations • Strongly 2-phase – In addition to 2-phase, in second phase all sites either rollback or commit RDBMS Proposed changes • Transaction commutativity – 2 concurrent transactions commute when any interleaving of their single-site sub-plans produce the same final DB state as any other interleaving • Sterile transaction classes – Those that commute with all transaction classes including itself Evaluation of NoSQL DBs • Auto scaling (e.g. via auto sharding) • Updated millions of objects in 1:1, 1:n relationships – Key/value stores: • Redis – – – – no means for automatic horizontal scaling, If a machine added, must be made known to application Open community working on Redis cluster to support it Perform well, so scaling not needed early • Voldemort – Adds servers to cluster automatically, has fault tolerance – Concentrates on sharding Evaluation of NoSQL DBs – Document stores • MongoDB – Originally id not scale automatically, no automatic sharding until later version – Now lots of sharding – Column-based • Cassandra – Supposed to scale, e.g. add another machine, but didn’t support losing a machine SSTables •Data.db files store the data compressed or uncompressed depending on the configuration •Index.db is an index to the Data.db file. •Statistics.db stores various statistics about that particular SSTable (times accessed etc) •Filter.db is a bloomfilter that’s loaded in memory by a node that can tell it quickly whether a key is in a table or not. •CompressionInfo.db may or may not be there depending on whether compression is enabled for a ColumnFamily. It contains information about the compressed chunks in the Data.db file.