Download No SQL databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Information privacy law wikipedia , lookup

Copyright wikipedia , lookup

Business intelligence wikipedia , lookup

Clusterpoint wikipedia , lookup

Data vault modeling wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
NoSQL Databases
No SQL or Not Only SQL
Copyright © 2011-2013 Curt Hill
Historically …
• Typical relational databases live in a
pleasant niche
– Their data is relatively small and
usually on one machine
– The meaning of the data is wellunderstood
– Schemas are tightly defined
– Transactional consistency (ACID) is
maintained
– Results of transactions are accurate
Copyright © 2011-2013 Curt Hill
Things can be different
• Extremely large amounts of data
• Data is spread over many machines,
possibly geographically distant
• Change to this data is continuous
• Data quality may be poor, obtained
from many sources
• Schemas are fuzzy and uncertain
– Or completely lacking
Copyright © 2011-2013 Curt Hill
Relaxing principles
• Classic database principles have
been left behind
• Locking is usually absent
• Schema are often inconsistent or
lacking
• Data come from many sources
– How does this get integrated into rigid
schema?
• Accuracy of the data is missing
– By the time we update it has already
changed
Copyright © 2011-2013 Curt Hill
People and Business
• A normal relational database gives
us accuracy
– The limitation is the accuracy of the
data
• People are used to making decisions
without all the facts
• Businesses often make decisions
without all the facts or complete
analysis
– Otherwise the window of opportunity
has passed
Copyright © 2011-2013 Curt Hill
CAP Theorem
• A distributed database or web
service cannot guarantee all of the
following:
• Consistency
– That operations occur all at once
• Availability
– Every operation must terminate in the
intended operation
• Partition tolerance
– Operations will complete even if
individual components fail
Copyright © 2011-2013 Curt Hill
ACID absent
• ACID, in particular, is in danger
• The goal of a transaction is to make
it look like it occurs by itself without
considering other transactions
• When multiple computers are
communicating and have their own
data this is in danger
• Locking and unlocking is a problem
– Things are changing too fast to let one
transaction lock data
– Without it serializing is in danger
Copyright © 2011-2013 Curt Hill
Now and Then
• Suppose a transaction is made
• One computer messages all the
others
• By the time that message arrives it
reflects a past state
• By the time it is processed that state
may have changed
• Virtually everything on the Internet
represents a past state and not
currently
Copyright © 2011-2013 Curt Hill
Now and Then Again
• A single computer may think of its
data as current
• It must accept all messages from
other computers as in the past
• Absolute consistency cannot be
obtained
• Eventual consistency is now the
norm
Copyright © 2011-2013 Curt Hill
BASE not ACID
• BASE is an alternative to ACID
• Basically Available, Soft state,
Eventually consistent
– Clearly contrived to complement ACID
• This is acknowledging that when the
data becomes too widely distributed
something has to give
Copyright © 2011-2013 Curt Hill
Not the only relaxation of
requirements
• NoSQL databases usually abandon
the whole relational format
• They may also include the relational
database as a subset of the entire
database
• The most common form is the data
store
– AKA key-value store
Copyright © 2011-2013 Curt Hill
NoSQL Databases
• Must provide APIs to various
programming language
• Must scale well to very large sizes
• Indexing is the key to rapid access
• These NoSQL databases are
targeted at different niches
• Generally not interchangable
– Unlike most RDBMS
Copyright © 2011-2013 Curt Hill
Kinds of NoSQL
•
•
•
•
Key Value
Columnar
Document
Graph
Copyright © 2011-2013 Curt Hill
Key Value
• Simplest model
• There is a key (which must be
unique) linked to a group of values
• It gets more interesting if the values
may include key value pairs as well
• Often not much of a schema
• Think of a database with one table
– Unlimited string as key
– Unlimited string as second field
• Two examples: Riak and Reddis
Copyright © 2011-2013 Curt Hill
Key-Value stores
• A relational table is a restricted form
of key-value
– The key is the primary key
– The data is all the fields associated with
that key
– However, it may not be even in First
Normal Form
• There is only one table
– Key is unrestricted size string
– Data is whatever needs to be there
– The values may be completely different
Copyright © 2011-2013 Curt Hill
KV Picture
Copyright © 2011-2013 Curt Hill
Key Value Again
• In a relational database we always
know what the value extracted from
a cell is
• It has the same meaning as
everything else in the column
• This is no longer the case in key
value stores
Copyright © 2011-2013 Curt Hill
Columnar
• Also known as a column store
• A lot of similarity to relational, but
the dominant item is the column not
the row
• We lack rectangularity that
relational has
• Columns are stored together
• Halfway between Relational and Key
Value
• HBase, Cassandra, HyperTable,
CalPont, MonetDB are examples
Copyright © 2011-2013 Curt Hill
Columnar
Copyright © 2011-2013 Curt Hill
Columnar again
• Often used in Data warehouses
• Since the columns are stored
together (rather than the rows) and
since the columns have only one
data type, there is an opportunity to
compress a column that is absent in
relational DBs
Copyright © 2011-2013 Curt Hill
Document
• The basic object is now a document
instead of a simple field like a
number
– Document is often XML or JSON
• Each document has an ID and other
identifying values
• A document is an arbitrary and
complicated item
– As if every field were a BLOB
• Examples: MongoDB, CouchDB,
Oracle NoSQL, Amazon’s SimpleDB
Copyright © 2011-2013 Curt Hill
Graph
• A mathematical graph consists of
nodes (the data) and links between
these
– This is the network model revisited
• Used for highly interconnected data
• Processing rides the links
• Neo4J and Zope are examples
Copyright © 2011-2013 Curt Hill
Commentary
• These classifications are incomplete
• Many examples exist that are
combinations of several
• We next look at some example
databases
– Most of these are open source
Copyright © 2011-2013 Curt Hill
Riak
• Key value store designed to be
distributed over many nodes
• Designed to be fault-tolerant
– Peer to peer architecture – no master
– All the data is scattered over many
servers and disk
– Any one or more failures does not
compromise the data
• Everything is done through web
queries
• Used by a quarter of Fortune 50
• Includes Best Buy, Github, Comcast
Copyright © 2011-2013 Curt Hill
Redis
• Key value store, optimized for speed
• Creator is Salvatore Sanfilippo who
calls it a data structure server
– Data could be more than a string or
number linked to a key
• May also consider data a sorted or
unsorted set strings
– This enables set operations on keys
• Keeps data in memory and
occasionally updates disk
– No ACID guarantees in that
• Used by Craigslist,
flickr
Copyright © 2011-2013 Curt Hill
MongoDB
• Designed to be very scalable
document model database
– Used by CERN for Large Hadron data
• Data is formatted as JavaScript
objects
– JavaScript Object Notation (JSON)
• Attributes are indexed
• Queries now become JavaScript
functions
• APIs in the major languages
• Who is Mongo?
Copyright © 2011-2013 Curt Hill
JSON
• A lightweight data interchange
format
• Defined by JavaScript but used
outside of the JavaScript
• Most languages have a subroutine to
parse and assimilate JSON
• A short JSON presentation
Copyright © 2011-2013 Curt Hill
MongoDB and ACID
• Atomicity - yes
• Consistency – no schema, so no
consistency or inconsistency
• Isolation – good, but not perfect
• Durable – yes
Copyright © 2011-2013 Curt Hill
Terms
RDBMS MongoDB
Table
Collection
Row
JSON Document
Index
Index
Join
Embedding and linking
Partition
Shard
Copyright © 2011-2013 Curt Hill
CouchDB
• Document based with JSON content
• Each document has a set of keys
that link to it
• Written in Erlang, but with
JavaScript API
– Other languages interface to that
• Very fault tolerant
• Used by LinkedIn, Orbitz
Copyright © 2011-2013 Curt Hill
HBase
• A columnar database
• Very scalable – designed for big
data
• Each field is versioned, making it 3D
rather than 2D
– Columns are stored together
– Rows are the related data
– Depth are older versions
• Used by Facebook, Twitter, Yahoo,
eBay
Copyright © 2011-2013 Curt Hill
Cassandra
• Project started by Facebook to track
status updates
• Became an Apache project
• Intended to create a network of
equal nodes
• Eventual consistency not perfect
consistency
• Mostly written in Java but provides
APIs in Python, Ruby, PhP among
others
• Used by IBM, HP, Netflix among
Copyright © 2011-2013 Curt Hill
others
Neo4J
• Graph database
– Network of nodes and links
• Data is information on a person or
thing
• Links are the connections between
one datum and another
• Numerous graph algorithms have
been implemented
– Consider Facebook connections
• Used by Adobe, Lufthanza, Mozilla
Copyright © 2011-2013 Curt Hill
CAP
• Several of these are distributed
• Since they cannot do all three they
generally are good at two of the
three
• See the following picture
Copyright © 2011-2013 Curt Hill
CAP
Consistency
MongoDB
HBase
Partition tolerance
Riak
CouchDB
Copyright © 2011-2013 Curt Hill
Availability
Niches
• For a product to be successful it
must find one or more niches where
it may do well
• A niche is a particular set of
circumstances and requirements
• Next we want to consider some of
these products and what they do
well and what they do poorly
Copyright © 2011-2013 Curt Hill
Relational
• Layout and form of the data is well
known in advance and relatively
stable
– We do not need to know in advance
what will be done with the data, but we
do need to know the form
– Most business processes have this kind
of requirements
• Not as effective for deeply
hierarchical and widely varying data
Copyright © 2011-2013 Curt Hill
Key Value
• Easy to make fast or horizontally
scalable or both
• Useful where data does not conform
to a well known schema or the data
is not very well related
• Searches are easy but more
complicated queries are not
– No indices
– No linkages, ie. foreign keys
Copyright © 2011-2013 Curt Hill
Columnar
• Horizontal scalability is based on
storing columns in different nodes
– Thus good for big data
• Allows for versioning
• Like relational, schema needs to be
done in advance
– Based on what queries are needed
– Does poorly with ad hoc data and
queries
Copyright © 2011-2013 Curt Hill
Document
• Works well with data that is highly
variable and not known in advance
• Content is often JSON, so these are
object oriented databases
• No normalization is possible, so
redundancies are mostly
unavoidable
• Most interesting queries are not
possible
Copyright © 2011-2013 Curt Hill
Graph
• Particularly useful for modeling
networking
• For social networking applications
– Nodes are people and edges their
relationships
– Hard to model this in other models
• Not easy to partition, so not easy to
scale
• No common query language
Copyright © 2011-2013 Curt Hill
déjà vu?
• In the early 1970s database world
was in some disarray
• There were several models
• None had achieved dominance
• Commercial offerings were present,
but theoretical foundation was
lacking
• There was no uniformity to these
products
• Interchanging products was very
difficult
Copyright © 2011-2013 Curt Hill
The End or Start of an Era
• Codd changed that by the
development of a theoretical
foundation for relational databases
• SQL became the common language
• For several decades now Relational
Databases have been the
undisputed king
• RDMS is a 32 billion dollar industry
• The products are to some degree
interchangeable
Copyright © 2011-2013 Curt Hill
Again
• The situation around NoSQL
databases has a lot of the same feel
as in the 1970s
• They are not interchangeable and
not even directed towards the same
ends
• Is this the end of RDBMS era?
• Unlikely we will soon get rid of
RDBMS, but it is not likely to be as
exclusive as it has been
Copyright © 2011-2013 Curt Hill
Finally
• Some of the motivations of the
NoSQL movement are:
– Big Data
– Requirements to be distributed
– Volatility of data, largely caused by web
• Check out the following link
– DB-Engines.com rates popularity of
data bases
Copyright © 2011-2013 Curt Hill