* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download No SQL databases
Survey
Document related concepts
Transcript
NoSQL Databases No SQL or Not Only SQL Copyright © 2011-2013 Curt Hill Historically … • Typical relational databases live in a pleasant niche – Their data is relatively small and usually on one machine – The meaning of the data is wellunderstood – Schemas are tightly defined – Transactional consistency (ACID) is maintained – Results of transactions are accurate Copyright © 2011-2013 Curt Hill Things can be different • Extremely large amounts of data • Data is spread over many machines, possibly geographically distant • Change to this data is continuous • Data quality may be poor, obtained from many sources • Schemas are fuzzy and uncertain – Or completely lacking Copyright © 2011-2013 Curt Hill Relaxing principles • Classic database principles have been left behind • Locking is usually absent • Schema are often inconsistent or lacking • Data come from many sources – How does this get integrated into rigid schema? • Accuracy of the data is missing – By the time we update it has already changed Copyright © 2011-2013 Curt Hill People and Business • A normal relational database gives us accuracy – The limitation is the accuracy of the data • People are used to making decisions without all the facts • Businesses often make decisions without all the facts or complete analysis – Otherwise the window of opportunity has passed Copyright © 2011-2013 Curt Hill CAP Theorem • A distributed database or web service cannot guarantee all of the following: • Consistency – That operations occur all at once • Availability – Every operation must terminate in the intended operation • Partition tolerance – Operations will complete even if individual components fail Copyright © 2011-2013 Curt Hill ACID absent • ACID, in particular, is in danger • The goal of a transaction is to make it look like it occurs by itself without considering other transactions • When multiple computers are communicating and have their own data this is in danger • Locking and unlocking is a problem – Things are changing too fast to let one transaction lock data – Without it serializing is in danger Copyright © 2011-2013 Curt Hill Now and Then • Suppose a transaction is made • One computer messages all the others • By the time that message arrives it reflects a past state • By the time it is processed that state may have changed • Virtually everything on the Internet represents a past state and not currently Copyright © 2011-2013 Curt Hill Now and Then Again • A single computer may think of its data as current • It must accept all messages from other computers as in the past • Absolute consistency cannot be obtained • Eventual consistency is now the norm Copyright © 2011-2013 Curt Hill BASE not ACID • BASE is an alternative to ACID • Basically Available, Soft state, Eventually consistent – Clearly contrived to complement ACID • This is acknowledging that when the data becomes too widely distributed something has to give Copyright © 2011-2013 Curt Hill Not the only relaxation of requirements • NoSQL databases usually abandon the whole relational format • They may also include the relational database as a subset of the entire database • The most common form is the data store – AKA key-value store Copyright © 2011-2013 Curt Hill NoSQL Databases • Must provide APIs to various programming language • Must scale well to very large sizes • Indexing is the key to rapid access • These NoSQL databases are targeted at different niches • Generally not interchangable – Unlike most RDBMS Copyright © 2011-2013 Curt Hill Kinds of NoSQL • • • • Key Value Columnar Document Graph Copyright © 2011-2013 Curt Hill Key Value • Simplest model • There is a key (which must be unique) linked to a group of values • It gets more interesting if the values may include key value pairs as well • Often not much of a schema • Think of a database with one table – Unlimited string as key – Unlimited string as second field • Two examples: Riak and Reddis Copyright © 2011-2013 Curt Hill Key-Value stores • A relational table is a restricted form of key-value – The key is the primary key – The data is all the fields associated with that key – However, it may not be even in First Normal Form • There is only one table – Key is unrestricted size string – Data is whatever needs to be there – The values may be completely different Copyright © 2011-2013 Curt Hill KV Picture Copyright © 2011-2013 Curt Hill Key Value Again • In a relational database we always know what the value extracted from a cell is • It has the same meaning as everything else in the column • This is no longer the case in key value stores Copyright © 2011-2013 Curt Hill Columnar • Also known as a column store • A lot of similarity to relational, but the dominant item is the column not the row • We lack rectangularity that relational has • Columns are stored together • Halfway between Relational and Key Value • HBase, Cassandra, HyperTable, CalPont, MonetDB are examples Copyright © 2011-2013 Curt Hill Columnar Copyright © 2011-2013 Curt Hill Columnar again • Often used in Data warehouses • Since the columns are stored together (rather than the rows) and since the columns have only one data type, there is an opportunity to compress a column that is absent in relational DBs Copyright © 2011-2013 Curt Hill Document • The basic object is now a document instead of a simple field like a number – Document is often XML or JSON • Each document has an ID and other identifying values • A document is an arbitrary and complicated item – As if every field were a BLOB • Examples: MongoDB, CouchDB, Oracle NoSQL, Amazon’s SimpleDB Copyright © 2011-2013 Curt Hill Graph • A mathematical graph consists of nodes (the data) and links between these – This is the network model revisited • Used for highly interconnected data • Processing rides the links • Neo4J and Zope are examples Copyright © 2011-2013 Curt Hill Commentary • These classifications are incomplete • Many examples exist that are combinations of several • We next look at some example databases – Most of these are open source Copyright © 2011-2013 Curt Hill Riak • Key value store designed to be distributed over many nodes • Designed to be fault-tolerant – Peer to peer architecture – no master – All the data is scattered over many servers and disk – Any one or more failures does not compromise the data • Everything is done through web queries • Used by a quarter of Fortune 50 • Includes Best Buy, Github, Comcast Copyright © 2011-2013 Curt Hill Redis • Key value store, optimized for speed • Creator is Salvatore Sanfilippo who calls it a data structure server – Data could be more than a string or number linked to a key • May also consider data a sorted or unsorted set strings – This enables set operations on keys • Keeps data in memory and occasionally updates disk – No ACID guarantees in that • Used by Craigslist, flickr Copyright © 2011-2013 Curt Hill MongoDB • Designed to be very scalable document model database – Used by CERN for Large Hadron data • Data is formatted as JavaScript objects – JavaScript Object Notation (JSON) • Attributes are indexed • Queries now become JavaScript functions • APIs in the major languages • Who is Mongo? Copyright © 2011-2013 Curt Hill JSON • A lightweight data interchange format • Defined by JavaScript but used outside of the JavaScript • Most languages have a subroutine to parse and assimilate JSON • A short JSON presentation Copyright © 2011-2013 Curt Hill MongoDB and ACID • Atomicity - yes • Consistency – no schema, so no consistency or inconsistency • Isolation – good, but not perfect • Durable – yes Copyright © 2011-2013 Curt Hill Terms RDBMS MongoDB Table Collection Row JSON Document Index Index Join Embedding and linking Partition Shard Copyright © 2011-2013 Curt Hill CouchDB • Document based with JSON content • Each document has a set of keys that link to it • Written in Erlang, but with JavaScript API – Other languages interface to that • Very fault tolerant • Used by LinkedIn, Orbitz Copyright © 2011-2013 Curt Hill HBase • A columnar database • Very scalable – designed for big data • Each field is versioned, making it 3D rather than 2D – Columns are stored together – Rows are the related data – Depth are older versions • Used by Facebook, Twitter, Yahoo, eBay Copyright © 2011-2013 Curt Hill Cassandra • Project started by Facebook to track status updates • Became an Apache project • Intended to create a network of equal nodes • Eventual consistency not perfect consistency • Mostly written in Java but provides APIs in Python, Ruby, PhP among others • Used by IBM, HP, Netflix among Copyright © 2011-2013 Curt Hill others Neo4J • Graph database – Network of nodes and links • Data is information on a person or thing • Links are the connections between one datum and another • Numerous graph algorithms have been implemented – Consider Facebook connections • Used by Adobe, Lufthanza, Mozilla Copyright © 2011-2013 Curt Hill CAP • Several of these are distributed • Since they cannot do all three they generally are good at two of the three • See the following picture Copyright © 2011-2013 Curt Hill CAP Consistency MongoDB HBase Partition tolerance Riak CouchDB Copyright © 2011-2013 Curt Hill Availability Niches • For a product to be successful it must find one or more niches where it may do well • A niche is a particular set of circumstances and requirements • Next we want to consider some of these products and what they do well and what they do poorly Copyright © 2011-2013 Curt Hill Relational • Layout and form of the data is well known in advance and relatively stable – We do not need to know in advance what will be done with the data, but we do need to know the form – Most business processes have this kind of requirements • Not as effective for deeply hierarchical and widely varying data Copyright © 2011-2013 Curt Hill Key Value • Easy to make fast or horizontally scalable or both • Useful where data does not conform to a well known schema or the data is not very well related • Searches are easy but more complicated queries are not – No indices – No linkages, ie. foreign keys Copyright © 2011-2013 Curt Hill Columnar • Horizontal scalability is based on storing columns in different nodes – Thus good for big data • Allows for versioning • Like relational, schema needs to be done in advance – Based on what queries are needed – Does poorly with ad hoc data and queries Copyright © 2011-2013 Curt Hill Document • Works well with data that is highly variable and not known in advance • Content is often JSON, so these are object oriented databases • No normalization is possible, so redundancies are mostly unavoidable • Most interesting queries are not possible Copyright © 2011-2013 Curt Hill Graph • Particularly useful for modeling networking • For social networking applications – Nodes are people and edges their relationships – Hard to model this in other models • Not easy to partition, so not easy to scale • No common query language Copyright © 2011-2013 Curt Hill déjà vu? • In the early 1970s database world was in some disarray • There were several models • None had achieved dominance • Commercial offerings were present, but theoretical foundation was lacking • There was no uniformity to these products • Interchanging products was very difficult Copyright © 2011-2013 Curt Hill The End or Start of an Era • Codd changed that by the development of a theoretical foundation for relational databases • SQL became the common language • For several decades now Relational Databases have been the undisputed king • RDMS is a 32 billion dollar industry • The products are to some degree interchangeable Copyright © 2011-2013 Curt Hill Again • The situation around NoSQL databases has a lot of the same feel as in the 1970s • They are not interchangeable and not even directed towards the same ends • Is this the end of RDBMS era? • Unlikely we will soon get rid of RDBMS, but it is not likely to be as exclusive as it has been Copyright © 2011-2013 Curt Hill Finally • Some of the motivations of the NoSQL movement are: – Big Data – Requirements to be distributed – Volatility of data, largely caused by web • Check out the following link – DB-Engines.com rates popularity of data bases Copyright © 2011-2013 Curt Hill