Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NoSQL Databases No SQL or Not Only SQL Copyright © 2011-2013 Curt Hill Historically … • Typical relational databases live in a pleasant niche – Their data is relatively small and usually on one machine – The meaning of the data is wellunderstood – Schemas are tightly defined – Transactional consistency (ACID) is maintained – Results of transactions are accurate Copyright © 2011-2013 Curt Hill Things can be different • Extremely large amounts of data • Data is spread over many machines, possibly geographically distant • Change to this data is continuous • Data quality may be poor, obtained from many sources • Schemas are fuzzy and uncertain – Or completely lacking Copyright © 2011-2013 Curt Hill Relaxing principles • Classic database principles have been left behind • Locking is usually absent • Schema are often inconsistent or lacking • Data come from many sources – How does this get integrated into rigid schema? • Accuracy of the data is missing – By the time we update it has already changed Copyright © 2011-2013 Curt Hill People and Business • A normal relational database gives us accuracy – The limitation is the accuracy of the data • People are used to making decisions without all the facts • Businesses often make decisions without all the facts or complete analysis – Otherwise the window of opportunity has passed Copyright © 2011-2013 Curt Hill CAP Theorem • A distributed database or web service cannot guarantee all of the following: • Consistency – That operations occur all at once • Availability – Every operation must terminate in the intended operation • Partition tolerance – Operations will complete even if individual components fail Copyright © 2011-2013 Curt Hill ACID absent • ACID, in particular, is in danger • The goal of a transaction is to make it look like it occurs by itself without considering other transactions • When multiple computers are communicating and have their own data this is in danger • Locking and unlocking is a problem – Things are changing too fast to let one transaction lock data – Without it serializing is in danger Copyright © 2011-2013 Curt Hill Now and Then • Suppose a transaction is made • One computer messages all the others • By the time that message arrives it reflects a past state • By the time it is processed that state may have changed • Virtually everything on the Internet represents a past state and not currently Copyright © 2011-2013 Curt Hill Now and Then Again • A single computer may think of its data as current • It must accept all messages from other computers as in the past • Absolute consistency cannot be obtained • Eventual consistency is now the norm Copyright © 2011-2013 Curt Hill BASE not ACID • BASE is an alternative to ACID • Basically Available, Soft state, Eventually consistent – Clearly contrived to complement ACID • This is acknowledging that when the data becomes too widely distributed something has to give Copyright © 2011-2013 Curt Hill Not the only relaxation of requirements • NoSQL databases usually abandon the whole relational format • They may also include the relational database as a subset of the entire database • The most common form is the data store – AKA key-value store Copyright © 2011-2013 Curt Hill NoSQL Databases • Must provide APIs to various programming language • Must scale well to very large sizes • Indexing is the key to rapid access • These NoSQL databases are targeted at different niches • Generally not interchangable – Unlike most RDBMS Copyright © 2011-2013 Curt Hill Kinds of NoSQL • • • • Key Value Columnar Document Graph Copyright © 2011-2013 Curt Hill Key Value • Simplest model • There is a key (which must be unique) linked to a group of values • It gets more interesting if the values may include key value pairs as well • Often not much of a schema • Think of a database with one table – Unlimited string as key – Unlimited string as second field • Two examples: Riak and Reddis Copyright © 2011-2013 Curt Hill Key-Value stores • A relational table is a restricted form of key-value – The key is the primary key – The data is all the fields associated with that key – However, it may not be even in First Normal Form • There is only one table – Key is unrestricted size string – Data is whatever needs to be there – The values may be completely different Copyright © 2011-2013 Curt Hill KV Picture Copyright © 2011-2013 Curt Hill Key Value Again • In a relational database we always know what the value extracted from a cell is • It has the same meaning as everything else in the column • This is no longer the case in key value stores Copyright © 2011-2013 Curt Hill Columnar • Also known as a column store • A lot of similarity to relational, but the dominant item is the column not the row • We lack rectangularity that relational has • Columns are stored together • Halfway between Relational and Key Value • HBase, Cassandra, HyperTable, CalPont, MonetDB are examples Copyright © 2011-2013 Curt Hill Columnar Copyright © 2011-2013 Curt Hill Columnar again • Often used in Data warehouses • Since the columns are stored together (rather than the rows) and since the columns have only one data type, there is an opportunity to compress a column that is absent in relational DBs Copyright © 2011-2013 Curt Hill Document • The basic object is now a document instead of a simple field like a number – Document is often XML or JSON • Each document has an ID and other identifying values • A document is an arbitrary and complicated item – As if every field were a BLOB • Examples: MongoDB, CouchDB, Oracle NoSQL, Amazon’s SimpleDB Copyright © 2011-2013 Curt Hill Graph • A mathematical graph consists of nodes (the data) and links between these – This is the network model revisited • Used for highly interconnected data • Processing rides the links • Neo4J and Zope are examples Copyright © 2011-2013 Curt Hill Commentary • These classifications are incomplete • Many examples exist that are combinations of several • We next look at some example databases – Most of these are open source Copyright © 2011-2013 Curt Hill Riak • Key value store designed to be distributed over many nodes • Designed to be fault-tolerant – Peer to peer architecture – no master – All the data is scattered over many servers and disk – Any one or more failures does not compromise the data • Everything is done through web queries • Used by a quarter of Fortune 50 • Includes Best Buy, Github, Comcast Copyright © 2011-2013 Curt Hill Redis • Key value store, optimized for speed • Creator is Salvatore Sanfilippo who calls it a data structure server – Data could be more than a string or number linked to a key • May also consider data a sorted or unsorted set strings – This enables set operations on keys • Keeps data in memory and occasionally updates disk – No ACID guarantees in that • Used by Craigslist, flickr Copyright © 2011-2013 Curt Hill MongoDB • Designed to be very scalable document model database – Used by CERN for Large Hadron data • Data is formatted as JavaScript objects – JavaScript Object Notation (JSON) • Attributes are indexed • Queries now become JavaScript functions • APIs in the major languages • Who is Mongo? Copyright © 2011-2013 Curt Hill JSON • A lightweight data interchange format • Defined by JavaScript but used outside of the JavaScript • Most languages have a subroutine to parse and assimilate JSON • A short JSON presentation Copyright © 2011-2013 Curt Hill MongoDB and ACID • Atomicity - yes • Consistency – no schema, so no consistency or inconsistency • Isolation – good, but not perfect • Durable – yes Copyright © 2011-2013 Curt Hill Terms RDBMS MongoDB Table Collection Row JSON Document Index Index Join Embedding and linking Partition Shard Copyright © 2011-2013 Curt Hill CouchDB • Document based with JSON content • Each document has a set of keys that link to it • Written in Erlang, but with JavaScript API – Other languages interface to that • Very fault tolerant • Used by LinkedIn, Orbitz Copyright © 2011-2013 Curt Hill HBase • A columnar database • Very scalable – designed for big data • Each field is versioned, making it 3D rather than 2D – Columns are stored together – Rows are the related data – Depth are older versions • Used by Facebook, Twitter, Yahoo, eBay Copyright © 2011-2013 Curt Hill Cassandra • Project started by Facebook to track status updates • Became an Apache project • Intended to create a network of equal nodes • Eventual consistency not perfect consistency • Mostly written in Java but provides APIs in Python, Ruby, PhP among others • Used by IBM, HP, Netflix among Copyright © 2011-2013 Curt Hill others Neo4J • Graph database – Network of nodes and links • Data is information on a person or thing • Links are the connections between one datum and another • Numerous graph algorithms have been implemented – Consider Facebook connections • Used by Adobe, Lufthanza, Mozilla Copyright © 2011-2013 Curt Hill CAP • Several of these are distributed • Since they cannot do all three they generally are good at two of the three • See the following picture Copyright © 2011-2013 Curt Hill CAP Consistency MongoDB HBase Partition tolerance Riak CouchDB Copyright © 2011-2013 Curt Hill Availability Niches • For a product to be successful it must find one or more niches where it may do well • A niche is a particular set of circumstances and requirements • Next we want to consider some of these products and what they do well and what they do poorly Copyright © 2011-2013 Curt Hill Relational • Layout and form of the data is well known in advance and relatively stable – We do not need to know in advance what will be done with the data, but we do need to know the form – Most business processes have this kind of requirements • Not as effective for deeply hierarchical and widely varying data Copyright © 2011-2013 Curt Hill Key Value • Easy to make fast or horizontally scalable or both • Useful where data does not conform to a well known schema or the data is not very well related • Searches are easy but more complicated queries are not – No indices – No linkages, ie. foreign keys Copyright © 2011-2013 Curt Hill Columnar • Horizontal scalability is based on storing columns in different nodes – Thus good for big data • Allows for versioning • Like relational, schema needs to be done in advance – Based on what queries are needed – Does poorly with ad hoc data and queries Copyright © 2011-2013 Curt Hill Document • Works well with data that is highly variable and not known in advance • Content is often JSON, so these are object oriented databases • No normalization is possible, so redundancies are mostly unavoidable • Most interesting queries are not possible Copyright © 2011-2013 Curt Hill Graph • Particularly useful for modeling networking • For social networking applications – Nodes are people and edges their relationships – Hard to model this in other models • Not easy to partition, so not easy to scale • No common query language Copyright © 2011-2013 Curt Hill déjà vu? • In the early 1970s database world was in some disarray • There were several models • None had achieved dominance • Commercial offerings were present, but theoretical foundation was lacking • There was no uniformity to these products • Interchanging products was very difficult Copyright © 2011-2013 Curt Hill The End or Start of an Era • Codd changed that by the development of a theoretical foundation for relational databases • SQL became the common language • For several decades now Relational Databases have been the undisputed king • RDMS is a 32 billion dollar industry • The products are to some degree interchangeable Copyright © 2011-2013 Curt Hill Again • The situation around NoSQL databases has a lot of the same feel as in the 1970s • They are not interchangeable and not even directed towards the same ends • Is this the end of RDBMS era? • Unlikely we will soon get rid of RDBMS, but it is not likely to be as exclusive as it has been Copyright © 2011-2013 Curt Hill Finally • Some of the motivations of the NoSQL movement are: – Big Data – Requirements to be distributed – Volatility of data, largely caused by web • Check out the following link – DB-Engines.com rates popularity of data bases Copyright © 2011-2013 Curt Hill