Download Distributed and Non-Relational Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Distributed and Non-Relational Databases
Mark Feltner
Department of Computer Science
University of Wisconsin – Platteville
Platteville, WI 53818
[email protected]
Abstract
The data storage climate was much different when SQL – the standard data storage model – was
conceived in the 1970s. Today’s data is adaptive, moves quickly, can be astronomical in size,
and may not even be located in a single geographic spot. Consider: the New York Stock
Exchange generates about one terabyte of trade data per day, Facebook hosts approximately ten
billion photos which take up about one petabyte of storage, and the Large Hadron Collider in
Geneva, Switzerland will produce about fifteen petabytes of data per year [1]. New data storage
solutions have been, and are being, conceived which specialize in handling these sorts of
problems. These non-relational models (so called because they stray from the foundations of
relational, SQL databases) are flexible to design and develop with, can easily be scaled
horizontally, and are specialized for dealing with Big Data.
Theory
This section deals broadly with the theory of databases and distributed systems. The foundational
concepts of databases and both the relational and non-relational models are described. A few
important concepts relevant to data storage with distributed systems are also featured.
Relational Databases
The first coinage of a database management system was in 1970 by Edgar Codd at IBM Almden
Research Center. Codd proposed a relational model, based on Set Theory, to structure stored data
[2].
The relational model defines a relation as a set of tuples which have the same attribute. Usually
one tuple represents an object and its associated information. Basically, each tuple is a row in the
table and each attribute is a column [3]. Usually, both rows and attributes are unordered sets. A
1
relation requires a primary key, or at least a set of attributes which are guaranteed to be distinct
among all tuples in the relation [4].
Relations are referenced via foreign keys. This allows a single tuple to be referenced by any
number of other tuples (i.e., one-to-many relationship).
The relational model also allows one to define precompiled indexes which can find tuples
quicker than checking each single tuple for a match.
Data in the relational model is often normalized to retain integrity and remove redundancy [1].
Database transactions are the standard units of work for a relational database management
system (DBMS). Transactions are small and simplified work units. Examples being INSERT,
UPDATE, SELECT, and DELETE. A good DBMS is usually able to take a complex query and divide
it up into efficient, small, discrete, and reliable transactions. This provides the DBMS with the
ability to keep the database in a consistent state even if a transaction fails. Transaction-based
systems benefit from being able to ROLLBACK from changes by simply reversing the transactions
that were executed.
Besides querying data with the SELECT statement, users can modify relations using INSERT,
DELETE, UPDATE, and, of course, other DBMS provided procedures.
ACID
ACID is one of the core characteristics of a "reliable" database. ACID is an acronym for
Atomicity, Consistency, Isolation, and Durability. This set of properties guarantees that reliable
database transactions may take place, and has been a mainstay in modern relational database
systems.
Atomicity simply means that each transaction is all-or-nothing. If any operation in the transaction
group fails, then the entire transaction must be undone. If one were to insert a million rows into a
table, and the 999,999 thousandth one failed before a COMMIT was issued, the database would
revert back to its previous state.
For a database to be consistent, the data must always be in a valid, or consistent, state. Relational
databases are designed with rules in mind. Strict schemas are defined and specify maximum
lengths, value types, relations, and other rules. An ACID compliant database would never break
the rules it is defined with. Of course, this consistency does not guarantee programmer
correctness, just that the database will follow the rules laid out for it.
Isolation defines a DBMS property that enables concurrent execution of transactions. Isolation
ensures that transactions executed serially will result in the same system state as transactions
executed concurrently.
Durability means that once a programmer has executed a COMMIT then the transaction is
permanent across all clients.
2
All of these checks and rules, of course, do add overtime to each transaction that occurs in an
ACID compliant DBMS. For systems in the banking or government sectors that do require a
strong degree of data reliability, then founding their models on that of ACID will benefit them.
The relational model is backed by decades of use, strong foundations in well-studied
mathematics, and a variety of ready-to-use implementations. That being said, it is not the end-allbe-all of data storage. It is inflexible, tedious to work with, and the benefits of ACID do not
always outweigh the costs [4]. Relational databases lend themselves to systems which require a
large degree of reliability and availability, and are generally better suited for single-node
systems. The rise of distributed systems and web services, and the difficulty found when scaling
traditional DBMSs, have led to new paradigms in data storage.
Non-Relational Databases
In 2009, the landscape of databases had changed vastly from that conceived by Codd in the
1970s. Nowadays, we are dealing with systems made to handle thousands of concurrent
connections a second, store petabytes of data, and sync with other similar servers and services
around the globe simulatenously.
NoSQL (so named because of its focused on data storage techniques sans SQL) defines nonrelational, distributed data stores that often did not attempt to provide atomicity, consistency,
isolation, and durability – the components of ACID. As consumers became more and more
connected, data needed to be delivered faster. Two methods of solving this problem were
conceived. First, spread out the data stores geographically. Second, specialize the data stores for
the types of data they hold and process [4].
NoSQL systems store semi-structured or unstructured data. Semi-structured data is much like a
spreadsheet. You know that the data is organized in cells, rows, and columns, but you only have
a loose idea of what kind of data is stored. Unstructured data has no particular internal structure
(i.e., plain text or image data).
Recently, there has been a surge of NoSQL and NoSQL-esque database management systems.
Most, if not all, of these systems lack fixed schemas, avoid joins, and scale really well.
Different systems satisfy different needs. Document-oriented ones, for instance, have an
immense easy-of-use, while key-value or column oriented systems sacrifice this in order to more
efficiently scale data over clusters [4].
Key-Value
Key-value stores are basically “big, distributed, persistent, fault-tolerant hash” tables. Based on
an extremely simple concept, hash tables, key-value stores are prized for their simplicity and
speed [4]. Key-value stores allow the application developer to store schema-less data. This is
beneficial when using a database for persistent storage of object oriented data.
3
The data in a key-value store is usually just a string that represents a key, and a value which is
some type of primitive of whatever programming language is being used (string, integer, array)
[8].
There are a variety of ways to implement this. The most common key-value store is the file
system. The key, for example, would be the path to a file, and the value is the binary data
associated with that file. This simplicity means that key-value stores are not only easy to use, but
the code produced is simple and readable. An example of accessing a key-value store in ruby:
require "pstore" # the key-value store interface
store = PStore.new("data-file.pstore")
store.transaction do # begin transaction
# load some data into the store...
store[:single_object] = "Lorem ipsum dolor sit amet..."
store[:obj_heirarchy] = { "Marc Seeger" => ["ruby", "nosql"],
"Rainer Wahnsinn" => ["php", "mysql"] }
end # commit changes to data store file [8].
Some of the most popular key-value storage solutions include: Apache Cassandra, Redis,
memcached, and Google’s BigTable.
Document-oriented
Document-oriented databases are defined as semi-structured data storages implemented without
tables and relations [4]. Document-stores are highly flexible and have no fixed schema. The idea
is to allow the client application to add and remove attributes to any single tuple without wasting
any space by creating empty attributes for all the other tuples.
Document-oriented databases are suited well for dynamic data because of their lack of fixedschemas and developer flexibility. Because of this, though, document-oriented databases have a
lack of safety (much like using a duck-typed language versus a statically typed one) and there is
a relative performance hit.
Many popular document-oriented databases provide layers over already standardized semistructured file formats such as JSON, YAML, or XML. This allows them to map easily to web
services and other distributed systems which operate by passing these small, structured files
around.
Some of the most popular document-oriented databases include MongoDB, CouchDB, and
SimpleDB.
Graph
Graph-based databases use graph-like structures and use nodes, edges, and properties to
represent data. Graph database are powerful tools, especially for graph-like queries such as
shortest path computation and community analysis.
4
These databases provide another advantage: index-free adjacency. This basically means no index
lookups are necessary because each element contains a direct pointer to its adjacent element.
Graph database scale more naturally for large datasets and are often faster that object-oriented
applications for associative data sets. Graphs typically don’t require expensive JOIN operations
and their flexible schemas are suitable for ad-hoc and changing data.
Some popular graph-based solutions include AllegroGraph, InfiniteGraph, and FlockDB.
Other
Of course, there are many other types of distributed databases including tuple stores, tabular
databases, RDF databases, object databases, multivalue databases, RAM stores, and many other
sub-categories of previously listed categories. All-in-all, there exist quite a number of solutions
for almost any data storage need.
Distributed Systems
Distributed systems, also referred to commonly as “grids”, are in rising popularity. There has
been a growing need for distributed systems in both commercial and scientific domains. Grid
systems can be classified into the computational, data, and service types. For the purposes of this
paper, we will focus on the data grid. The data grid is defined as “systems that provide a
hardware and software infrastructure for synthesizing new information from data repositories
that are distributed in a wide area network” [9].
Almost any application that relies on a database which houses a large volume of data, generally
the bottleneck of that application will be its database. I/O operations are expensive, disk reads
are slow, writes are slower, and since many databases exist in remote data centers one must take
network latency into account as well. A simple solution is to add more hardware to the database
system, but this is innately unsustainable.
Another solution is to parallelize and distribute your data processing. A parallel database
management system can achieve both high throughput and interquery parallelism and low
response time in intraoperation parallelism [10].
Parallelism does add architectural complexity, and creates a lot of system administration
overhead. The biggest disadvantage in parallel computing is the lack of standards and applicable
examples [10].
The structure of distributed systems grants them different performance metrics and different
goals. These systems are measured on their efficiency, or utilization rate of resources;
dependability; adaptability, or ability to support billions of job requests of massive data sets and
virtualized cloud servers under various workloads and service models; and flexibility, or the
ability of the system to run in both high-performance and high-throughput computing
environments [11].
5
CAP Theorem
In 2000, Eric Brewer laid out the CAP Theorem in a keynote at the Proceedings of the Annual
ACM Symposium on Principles of Distributed Computing, and Seth Gilbert and Nancy Lynch
formalized his ideas into a theory in 2009. The CAP Theorem states that it is impossible to
guarantee consistency, availability, and partition tolerance in a distributed system [4]. This idea
has vast implications, especially today where almost every system is distributed in one way or
another.
A service that is consistent operates fully or not at all. That is, a system is consistent if an update
is applied to all relevant nodes at the same logical time [5]. One way to drop consistency with
your service is to simply deal with inconsistencies. Some services simply accept that things are
eventually consistent.
Eventual consistency is a consistency model used in the parallel programming domain. It means
that given a sufficiently long period of time over which no changes are sent, all updates can be
expected to propagate throughout the system and replicas will be consistent [6].
“For a distributed system to be continuously available, every request received by a non-failing
node in the system must result in a response” [7]. Availability is one of the hardest things to test
and plan for because it tends to desert you when you need it most. It is only inevitable that
services go down at busy times just because they are busy. Implementing a system that can deal
with availability gracefully can be non-trivial [4].
“In order to model partition tolerance, the network will be allowed to lose arbitrarily many
messages sent from one node to another. When a network is partitioned, all messages sent from
nodes in one component of the partition to nodes in another component are lost” [7]. Once a
system has become distributed, there is a high chance that it will be in a partitioned state. Say the
data has not traversed the wires to reach your entire data store yet, or perhaps some of your
machines have crashed and are no longer online. To handle issues of consistency and
availability, we can scale horizontally, but as the number of partitions increases, the more atomic
entities you need to keep synchronized. You can drop partition tolerance by placing everything
on a single machine, but this severely limits your ability to scale [4].
To have a consistent system you must sacrifice some level of availability or partition tolerance,
a system with good availability will have to sacrifice some consistency or partition tolerance, and
a system that is partitioned will need to sacrifice some degree of consistency or availability.
Choosing which part of CAP to drop has been argued over-and-over again since CAP’s
inception. Hale believes that you cannot drop partition tolerance because the probability that one
or more node in your system will fail increases exponentially as you add nodes. Instead, Hale
says that you should be asking yourself: “In the event of failures, which will this system
sacrifice? Consistency or availability?”
6
Algorithms
Distributed Systems and Data Parallelization Techniques
MapReduce
MapReduce is a popular web programming model used for scalable data processing on large
clusters over large data sets [4]. MapReduce is suited for applications where data is written once
and read many times [1]. The user provides two functions: a map function which generates a list
of intermediate key/value pairs, and a reduce function which merges all the intermediate values
with the same intermediate key. Being purely functional, MapReduce operates well on semi- or
unstructured data
The map function serves to process a subset of the original problem and returns the result as a list
of (A,B)-tuples.
The reduce function will, given an A value and a list of B values, produces a list of results that is
the final answer for all A-parts. The beauty of these two operations is that they are purely
functional and have no side-effects. What does this mean? It means they tend to scale well, are
highly parallelizable, and extremely flexible [4].
Figure 1: MapReduce Visual Representation
7
This operation is linearly scalable and is massively parallelizable [1]; if you double the amount
of data to operate on, the operation takes twice as much time, but you can also double the size of
the cluster and have the operation take half as much time [1]. It has been proven to work on
terabytes of data on hundreds of thousands of client machines with hundreds of simultaneous
MapReduce programs. In fact, there are estimated thousands of these operations occurring on
Google’s clusters every day [11].
Databases
Table 1: Example collection used to compare row- and column-oriented operations
Title
Breaking the Law
Aces High
Kickstart My Heat
Raining Blood
I Wanna Be Somebody
Artist
Judas Priest
Iron Maiden
Motley Crue
Slayer
W.A.S.P.
Album
British Steel
Powerslave
Dr. Feelgood
Reign in Blood
W.A.S.P.
Year
1980
1984
1989
1986
1984
Row-oriented
Row-oriented databases represent objects as tuples and attributes as one piece of information
about that object. Generally, we think of a tuple as a row and a attribute as column. This is the
primary way of thinking about data in the relational model [4].
A row-oriented database would store data somewhat like this:
Breaking the Law, Judas Priest, British Steel, 1980, Aces High, Iron Maiden, Powerslave.
1984 … [4]
Column-oriented
Column-oriented databases are tuned to operate more efficiently on a set of attributes for all
tuples.
And the table above in a column-oriented database:
Breaking the Law, Aces High, Kickstart My Heart, Raining Blood, I Wanna Be
Somebody, Judas Priest, Iron Maiden, Motley Crue, Slayer, W.A.S.P. , British Steel,
Powerslave, Dr. Feelgood, Reign in Blood … [4]
8
These two distinctly different ways of storing data result in very different results when database
operations are performed.
To create a new tuple (CREATE) in a row-oriented database, the system would only need to
append the new tuple to the end of the collection (table). In a column-oriented database, careful
insertions would need to be made to keep the data lined up correctly.
Reversely, creating a new tuple in a row-oriented database require the more expensive and nontrivial operation, whereas it is as simple as inserting the new data at the end of a column oriented
database.
Select statements, such as a query listing all the albums from the year 1984, are more efficient in
a row-oriented system because the query simply needs to check the 4th element of each tuple and
return the ones matching the year.
Aggregate statements, on the other hand, are more efficient in a column-oriented solution. An
operation such as summing all the years is much easier in a column –oriented system because all
the years are consecutively found and need not be compared to other attributes.
In summary, we can conclude that row- and column-oriented solutions both have trade-offs. A
row-oriented system may be best when the system requires the creation of many rows or the
retrieval of many attributes from the same tuple at the same time. Column-oriented systems
might be the better option when new columns are created often, and aggregations over rows, but
not columns, happen a lot of the time [4].
Technology
Enabling Technology
It has only been in the last decade or so that NoSQL solutions have really started to appeal to
developers and engineers. This is the result of a couple of enabling technologies in the hardware
sector. Specifically, advances in computation and memory and storage systems.
Today’s hard drives can reach sizes in excess of one terabyte with read speeds around 100 MB/s.
So it takes more than two and a half hours to read all the data off the disk. This a long time to
read data, and it takes even longer to write. The solution to this problem is to spread the data out
onto separate drives. By replicating the data onto, say, 100 separate drives, we can take
advantage of the ability to read from multiple disks at once.
There are two problems with this approach though. First is how to deal with hardware failure, a
common occurrence. Commonly, you can use replication to store redundant copies of data so
that if there is a failure, extra copies are available. The second big issue is that you are going to
need to combine that data in some way eventually. Distributed data-storage systems abstract
these problems and provide reliable, shared storage and analysis systems.
9
Distributed Systems
Hadoop
Hadoop was created by Doug Cutting, named after his son’s stuffed elephant, and was bred off
of Apache Nutch, an open-source web search engine that is part of the Lucene project [1]. In
2008, Yahoo! announced that its production search-index was being powered by a 10,000 core
Hadoop cluster [1]. In April of 2008, Hadoop broke the world record to become the fastest
system to sort a terabyte of data. It ran on a 910 –node cluster and only took 209 seconds,
beating the previous winner by 88 seconds [1].
HBase
HBase is another Apache Software Foundation project that stemmed from the Hadoop project.
HBase is basically a Hadoop database running over a Hadoop Distributed File System. It is
modeled after Google’s BigTable. HBase provides nice developer tools including a Java API,
REST interface, and Thrift gateway.
HBase follows a column-oriented architecture and has easy integration with MapReduce jobs
which are run in Hadoop. Tables can be used as both input and output for MapReduce jobs. Its
data model is based on a four dimensional dictionary.
Hbase_db[table][row][column]][time-stamp] -> value
HBase has excellent fault tolerance. It has block-level replication which is achieved from storing
files as set of blocks which are then distributed across the entire cluster. Block replication also
gives HDFS high-throughput across large data sets. By splitting individual files in 64MB blocks
HDFS is able to decrease the amount of metadata storage per file and read data fast because of
sequential blocks.
HBase and the HDFS file system it is built on will sometimes compromise reliability in order to
achieve lower communication costs. HBase supports heartbeat and block report periodic
messaging between nodes.
Cassandra
Cassandra is a “distributed storage solution for managing very large amounts of structured data
spread out across many commodity servers, while providing highly available service with no
single point of failure” [12]. Cassandra is an Apache Software Foundation project now, but was
designed by Facebook to fix their Inbox Search problem. The ability of Facebook users to search
through their Inboxes required massive amounts of read and write throughput, billions of writes
per day, and needed to be scalable for the future. Facebook created Cassandra to resolve this
10
problem and so far Cassandra has held to its duty by scaling well from 100 million users to 250
million users [12].
Cassandra is a column-oriented, distributed DBMS. It features a four to five dimensional keyvalue store with eventual consistency [4].
The Cassandra API is extremely simple, and only exposes three methods:



Insert(table, key, rowMutation)
Get(table, key, columnName)
Delete(table, key, columnName)
[12]
Cassandra has the backing of a large and popular company in Facebook, and has proven itself on
the largest scale. It is extremely simple to use and get started, yet extremely powerful.
Databases
Relational
MySQL is one of the most popular relational database management systems in the world. It is a
core component of web development and is the ‘M’ in the LAMP (Linux Apache MySQL PHP)
acronym. You will find MySQL used in small to medium scale single-server deployments.
MySQL does support scaling by using full replication to increase read capacity, or sharding to
increase write capacity. It also supports multiple storage engines, both native, partner developed,
and open-source.
MongoDB
One popular document-oriented database is MongoDB. MongoDB is a scalable, opens-source,
document-oriented database that has been around since October 2007. MongoDB is based on BTrees and stores data as binary JSON-style documents. MongoDB scales very easily horizontally
and supports sharding and replication. MongoDB is mainly used for problems with strong
transaction requirements. MongoDB is also very flexible and developer-friendy. Its lack of fixed
schema means modifying database definitions, columns, and tuples is easy.
Conclusions
Any software that depends on persistent storage will end up using a database one way or another,
and many times the database can and will be the bottleneck of an application. It is important to
11
know what the implications are in choosing a database system. Today’s data storage climate is
such that traditional, SQL-based DBMSes are not the best solution to every problem. Through a
better understanding of the theory, algorithms, and implementations behind relational and nonrelation databases one will be better suited to making data-driven decisions. For a vast account of
performance considerations that is out of the scope of this paper, the author would like to suggest
Chapter 7 in Lith and Mattsson’s Paper: "Investigating storage solutions for large data”. The
paper investigates relational versus non-relational databases in a specific case study and includes
pages of graphs showing specific performance benchmarks for different databases.
12
References
[1] T. E. White, Hadoop: the definitive guide, Sebastopol, CA: O'Reill Media, Inc., 2012.
[2] M. Stonebraker and U. Çetintemel, "“One Size Fits All”: An Idea Whose Time Has Come and
Gone," Providence, 2007.
[3] C. Ricardo, Databases Illuminated: Second Edition, Ontario: Jones & Bartlett Learning, 2012.
[4] A. Lith and J. Mattsson, "Investigating storage solutions for large data," Chalmers University of
Technology, pp. 1-70, 2010.
[5] C. Hale, "You Can't Sacrifice Partition Tolerance," 07 October 2010. [Online]. Available:
http://codahale.com/you-cant-sacrifice-partition-tolerance/. [Accessed 07 April 2013].
[6] W. Vogels, "Eventually Consistent," ACM Queue, 2008.
[7] S. Gilbert and N. Lynch, "Brewers conjecture and the feasibility of consistent, available, partitiontolerant web services," p. 12, 2002.
[8] M. Seeger, "Key-Value stores: a practical overview," medien informatik, 2009.
[9] F. Magoules, Fundamentals of Grid Computing, Boca Raton: Chapman and Hall, 2010.
[10] A. J. Wells, Grid Database Design, Boca Raton: Auerbach Publications, 2005.
[11] J. D. G. C. F. Kai Hwang, Distributed and Cloud Computing: From Parallel Processing to the
Internet of Things, Morgan Kaufmann, 2011.
[12] A. Lakshman and P. Malik, "Cassandra - A Decentralized Structured Storage System," Ithaca, 2009.
[13] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes
and R. E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," in OSDI'06:
Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006.
Acknowledgements
Figure 1: http://upload.wikimedia.org/wikipedia/en/3/35/Mapreduce_Overview.svg