Download Distributed and Non-Relational Databases

Distributed and Non-Relational Databases Mark Feltner Department of Computer Science University of Wisconsin – Platteville Platteville, WI 53818 [email protected] Abstract The data storage climate was much different when SQL – the standard data storage model – was conceived in the 1970s. Today’s data is adaptive, moves quickly, can be astronomical in size, and may not even be located in a single geographic spot. Consider: the New York Stock Exchange generates about one terabyte of trade data per day, Facebook hosts approximately ten billion photos which take up about one petabyte of storage, and the Large Hadron Collider in Geneva, Switzerland will produce about fifteen petabytes of data per year [1]. New data storage solutions have been, and are being, conceived which specialize in handling these sorts of problems. These non-relational models (so called because they stray from the foundations of relational, SQL databases) are flexible to design and develop with, can easily be scaled horizontally, and are specialized for dealing with Big Data. Theory This section deals broadly with the theory of databases and distributed systems. The foundational concepts of databases and both the relational and non-relational models are described. A few important concepts relevant to data storage with distributed systems are also featured. Relational Databases The first coinage of a database management system was in 1970 by Edgar Codd at IBM Almden Research Center. Codd proposed a relational model, based on Set Theory, to structure stored data [2]. The relational model defines a relation as a set of tuples which have the same attribute. Usually one tuple represents an object and its associated information. Basically, each tuple is a row in the table and each attribute is a column [3]. Usually, both rows and attributes are unordered sets. A 1 relation requires a primary key, or at least a set of attributes which are guaranteed to be distinct among all tuples in the relation [4]. Relations are referenced via foreign keys. This allows a single tuple to be referenced by any number of other tuples (i.e., one-to-many relationship). The relational model also allows one to define precompiled indexes which can find tuples quicker than checking each single tuple for a match. Data in the relational model is often normalized to retain integrity and remove redundancy [1]. Database transactions are the standard units of work for a relational database management system (DBMS). Transactions are small and simplified work units. Examples being INSERT, UPDATE, SELECT, and DELETE. A good DBMS is usually able to take a complex query and divide it up into efficient, small, discrete, and reliable transactions. This provides the DBMS with the ability to keep the database in a consistent state even if a transaction fails. Transaction-based systems benefit from being able to ROLLBACK from changes by simply reversing the transactions that were executed. Besides querying data with the SELECT statement, users can modify relations using INSERT, DELETE, UPDATE, and, of course, other DBMS provided procedures. ACID ACID is one of the core characteristics of a "reliable" database. ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. This set of properties guarantees that reliable database transactions may take place, and has been a mainstay in modern relational database systems. Atomicity simply means that each transaction is all-or-nothing. If any operation in the transaction group fails, then the entire transaction must be undone. If one were to insert a million rows into a table, and the 999,999 thousandth one failed before a COMMIT was issued, the database would revert back to its previous state. For a database to be consistent, the data must always be in a valid, or consistent, state. Relational databases are designed with rules in mind. Strict schemas are defined and specify maximum lengths, value types, relations, and other rules. An ACID compliant database would never break the rules it is defined with. Of course, this consistency does not guarantee programmer correctness, just that the database will follow the rules laid out for it. Isolation defines a DBMS property that enables concurrent execution of transactions. Isolation ensures that transactions executed serially will result in the same system state as transactions executed concurrently. Durability means that once a programmer has executed a COMMIT then the transaction is permanent across all clients. 2 All of these checks and rules, of course, do add overtime to each transaction that occurs in an ACID compliant DBMS. For systems in the banking or government sectors that do require a strong degree of data reliability, then founding their models on that of ACID will benefit them. The relational model is backed by decades of use, strong foundations in well-studied mathematics, and a variety of ready-to-use implementations. That being said, it is not the end-allbe-all of data storage. It is inflexible, tedious to work with, and the benefits of ACID do not always outweigh the costs [4]. Relational databases lend themselves to systems which require a large degree of reliability and availability, and are generally better suited for single-node systems. The rise of distributed systems and web services, and the difficulty found when scaling traditional DBMSs, have led to new paradigms in data storage. Non-Relational Databases In 2009, the landscape of databases had changed vastly from that conceived by Codd in the 1970s. Nowadays, we are dealing with systems made to handle thousands of concurrent connections a second, store petabytes of data, and sync with other similar servers and services around the globe simulatenously. NoSQL (so named because of its focused on data storage techniques sans SQL) defines nonrelational, distributed data stores that often did not attempt to provide atomicity, consistency, isolation, and durability – the components of ACID. As consumers became more and more connected, data needed to be delivered faster. Two methods of solving this problem were conceived. First, spread out the data stores geographically. Second, specialize the data stores for the types of data they hold and process [4]. NoSQL systems store semi-structured or unstructured data. Semi-structured data is much like a spreadsheet. You know that the data is organized in cells, rows, and columns, but you only have a loose idea of what kind of data is stored. Unstructured data has no particular internal structure (i.e., plain text or image data). Recently, there has been a surge of NoSQL and NoSQL-esque database management systems. Most, if not all, of these systems lack fixed schemas, avoid joins, and scale really well. Different systems satisfy different needs. Document-oriented ones, for instance, have an immense easy-of-use, while key-value or column oriented systems sacrifice this in order to more efficiently scale data over clusters [4]. Key-Value Key-value stores are basically “big, distributed, persistent, fault-tolerant hash” tables. Based on an extremely simple concept, hash tables, key-value stores are prized for their simplicity and speed [4]. Key-value stores allow the application developer to store schema-less data. This is beneficial when using a database for persistent storage of object oriented data. 3 The data in a key-value store is usually just a string that represents a key, and a value which is some type of primitive of whatever programming language is being used (string, integer, array) [8]. There are a variety of ways to implement this. The most common key-value store is the file system. The key, for example, would be the path to a file, and the value is the binary data associated with that file. This simplicity means that key-value stores are not only easy to use, but the code produced is simple and readable. An example of accessing a key-value store in ruby: require "pstore" # the key-value store interface store = PStore.new("data-file.pstore") store.transaction do # begin transaction # load some data into the store... store[:single_object] = "Lorem ipsum dolor sit amet..." store[:obj_heirarchy] = { "Marc Seeger" => ["ruby", "nosql"], "Rainer Wahnsinn" => ["php", "mysql"] } end # commit changes to data store file [8]. Some of the most popular key-value storage solutions include: Apache Cassandra, Redis, memcached, and Google’s BigTable. Document-oriented Document-oriented databases are defined as semi-structured data storages implemented without tables and relations [4]. Document-stores are highly flexible and have no fixed schema. The idea is to allow the client application to add and remove attributes to any single tuple without wasting any space by creating empty attributes for all the other tuples. Document-oriented databases are suited well for dynamic data because of their lack of fixedschemas and developer flexibility. Because of this, though, document-oriented databases have a lack of safety (much like using a duck-typed language versus a statically typed one) and there is a relative performance hit. Many popular document-oriented databases provide layers over already standardized semistructured file formats such as JSON, YAML, or XML. This allows them to map easily to web services and other distributed systems which operate by passing these small, structured files around. Some of the most popular document-oriented databases include MongoDB, CouchDB, and SimpleDB. Graph Graph-based databases use graph-like structures and use nodes, edges, and properties to represent data. Graph database are powerful tools, especially for graph-like queries such as shortest path computation and community analysis. 4 These databases provide another advantage: index-free adjacency. This basically means no index lookups are necessary because each element contains a direct pointer to its adjacent element. Graph database scale more naturally for large datasets and are often faster that object-oriented applications for associative data sets. Graphs typically don’t require expensive JOIN operations and their flexible schemas are suitable for ad-hoc and changing data. Some popular graph-based solutions include AllegroGraph, InfiniteGraph, and FlockDB. Other Of course, there are many other types of distributed databases including tuple stores, tabular databases, RDF databases, object databases, multivalue databases, RAM stores, and many other sub-categories of previously listed categories. All-in-all, there exist quite a number of solutions for almost any data storage need. Distributed Systems Distributed systems, also referred to commonly as “grids”, are in rising popularity. There has been a growing need for distributed systems in both commercial and scientific domains. Grid systems can be classified into the computational, data, and service types. For the purposes of this paper, we will focus on the data grid. The data grid is defined as “systems that provide a hardware and software infrastructure for synthesizing new information from data repositories that are distributed in a wide area network” [9]. Almost any application that relies on a database which houses a large volume of data, generally the bottleneck of that application will be its database. I/O operations are expensive, disk reads are slow, writes are slower, and since many databases exist in remote data centers one must take network latency into account as well. A simple solution is to add more hardware to the database system, but this is innately unsustainable. Another solution is to parallelize and distribute your data processing. A parallel database management system can achieve both high throughput and interquery parallelism and low response time in intraoperation parallelism [10]. Parallelism does add architectural complexity, and creates a lot of system administration overhead. The biggest disadvantage in parallel computing is the lack of standards and applicable examples [10]. The structure of distributed systems grants them different performance metrics and different goals. These systems are measured on their efficiency, or utilization rate of resources; dependability; adaptability, or ability to support billions of job requests of massive data sets and virtualized cloud servers under various workloads and service models; and flexibility, or the ability of the system to run in both high-performance and high-throughput computing environments [11]. 5 CAP Theorem In 2000, Eric Brewer laid out the CAP Theorem in a keynote at the Proceedings of the Annual ACM Symposium on Principles of Distributed Computing, and Seth Gilbert and Nancy Lynch formalized his ideas into a theory in 2009. The CAP Theorem states that it is impossible to guarantee consistency, availability, and partition tolerance in a distributed system [4]. This idea has vast implications, especially today where almost every system is distributed in one way or another. A service that is consistent operates fully or not at all. That is, a system is consistent if an update is applied to all relevant nodes at the same logical time [5]. One way to drop consistency with your service is to simply deal with inconsistencies. Some services simply accept that things are eventually consistent. Eventual consistency is a consistency model used in the parallel programming domain. It means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate throughout the system and replicas will be consistent [6]. “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” [7]. Availability is one of the hardest things to test and plan for because it tends to desert you when you need it most. It is only inevitable that services go down at busy times just because they are busy. Implementing a system that can deal with availability gracefully can be non-trivial [4]. “In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost” [7]. Once a system has become distributed, there is a high chance that it will be in a partitioned state. Say the data has not traversed the wires to reach your entire data store yet, or perhaps some of your machines have crashed and are no longer online. To handle issues of consistency and availability, we can scale horizontally, but as the number of partitions increases, the more atomic entities you need to keep synchronized. You can drop partition tolerance by placing everything on a single machine, but this severely limits your ability to scale [4]. To have a consistent system you must sacrifice some level of availability or partition tolerance, a system with good availability will have to sacrifice some consistency or partition tolerance, and a system that is partitioned will need to sacrifice some degree of consistency or availability. Choosing which part of CAP to drop has been argued over-and-over again since CAP’s inception. Hale believes that you cannot drop partition tolerance because the probability that one or more node in your system will fail increases exponentially as you add nodes. Instead, Hale says that you should be asking yourself: “In the event of failures, which will this system sacrifice? Consistency or availability?” 6 Algorithms Distributed Systems and Data Parallelization Techniques MapReduce MapReduce is a popular web programming model used for scalable data processing on large clusters over large data sets [4]. MapReduce is suited for applications where data is written once and read many times [1]. The user provides two functions: a map function which generates a list of intermediate key/value pairs, and a reduce function which merges all the intermediate values with the same intermediate key. Being purely functional, MapReduce operates well on semi- or unstructured data The map function serves to process a subset of the original problem and returns the result as a list of (A,B)-tuples. The reduce function will, given an A value and a list of B values, produces a list of results that is the final answer for all A-parts. The beauty of these two operations is that they are purely functional and have no side-effects. What does this mean? It means they tend to scale well, are highly parallelizable, and extremely flexible [4]. Figure 1: MapReduce Visual Representation 7 This operation is linearly scalable and is massively parallelizable [1]; if you double the amount of data to operate on, the operation takes twice as much time, but you can also double the size of the cluster and have the operation take half as much time [1]. It has been proven to work on terabytes of data on hundreds of thousands of client machines with hundreds of simultaneous MapReduce programs. In fact, there are estimated thousands of these operations occurring on Google’s clusters every day [11]. Databases Table 1: Example collection used to compare row- and column-oriented operations Title Breaking the Law Aces High Kickstart My Heat Raining Blood I Wanna Be Somebody Artist Judas Priest Iron Maiden Motley Crue Slayer W.A.S.P. Album British Steel Powerslave Dr. Feelgood Reign in Blood W.A.S.P. Year 1980 1984 1989 1986 1984 Row-oriented Row-oriented databases represent objects as tuples and attributes as one piece of information about that object. Generally, we think of a tuple as a row and a attribute as column. This is the primary way of thinking about data in the relational model [4]. A row-oriented database would store data somewhat like this: Breaking the Law, Judas Priest, British Steel, 1980, Aces High, Iron Maiden, Powerslave. 1984 … [4] Column-oriented Column-oriented databases are tuned to operate more efficiently on a set of attributes for all tuples. And the table above in a column-oriented database: Breaking the Law, Aces High, Kickstart My Heart, Raining Blood, I Wanna Be Somebody, Judas Priest, Iron Maiden, Motley Crue, Slayer, W.A.S.P. , British Steel, Powerslave, Dr. Feelgood, Reign in Blood … [4] 8 These two distinctly different ways of storing data result in very different results when database operations are performed. To create a new tuple (CREATE) in a row-oriented database, the system would only need to append the new tuple to the end of the collection (table). In a column-oriented database, careful insertions would need to be made to keep the data lined up correctly. Reversely, creating a new tuple in a row-oriented database require the more expensive and nontrivial operation, whereas it is as simple as inserting the new data at the end of a column oriented database. Select statements, such as a query listing all the albums from the year 1984, are more efficient in a row-oriented system because the query simply needs to check the 4th element of each tuple and return the ones matching the year. Aggregate statements, on the other hand, are more efficient in a column-oriented solution. An operation such as summing all the years is much easier in a column –oriented system because all the years are consecutively found and need not be compared to other attributes. In summary, we can conclude that row- and column-oriented solutions both have trade-offs. A row-oriented system may be best when the system requires the creation of many rows or the retrieval of many attributes from the same tuple at the same time. Column-oriented systems might be the better option when new columns are created often, and aggregations over rows, but not columns, happen a lot of the time [4]. Technology Enabling Technology It has only been in the last decade or so that NoSQL solutions have really started to appeal to developers and engineers. This is the result of a couple of enabling technologies in the hardware sector. Specifically, advances in computation and memory and storage systems. Today’s hard drives can reach sizes in excess of one terabyte with read speeds around 100 MB/s. So it takes more than two and a half hours to read all the data off the disk. This a long time to read data, and it takes even longer to write. The solution to this problem is to spread the data out onto separate drives. By replicating the data onto, say, 100 separate drives, we can take advantage of the ability to read from multiple disks at once. There are two problems with this approach though. First is how to deal with hardware failure, a common occurrence. Commonly, you can use replication to store redundant copies of data so that if there is a failure, extra copies are available. The second big issue is that you are going to need to combine that data in some way eventually. Distributed data-storage systems abstract these problems and provide reliable, shared storage and analysis systems. 9 Distributed Systems Hadoop Hadoop was created by Doug Cutting, named after his son’s stuffed elephant, and was bred off of Apache Nutch, an open-source web search engine that is part of the Lucene project [1]. In 2008, Yahoo! announced that its production search-index was being powered by a 10,000 core Hadoop cluster [1]. In April of 2008, Hadoop broke the world record to become the fastest system to sort a terabyte of data. It ran on a 910 –node cluster and only took 209 seconds, beating the previous winner by 88 seconds [1]. HBase HBase is another Apache Software Foundation project that stemmed from the Hadoop project. HBase is basically a Hadoop database running over a Hadoop Distributed File System. It is modeled after Google’s BigTable. HBase provides nice developer tools including a Java API, REST interface, and Thrift gateway. HBase follows a column-oriented architecture and has easy integration with MapReduce jobs which are run in Hadoop. Tables can be used as both input and output for MapReduce jobs. Its data model is based on a four dimensional dictionary. Hbase_db[table][row][column]][time-stamp] -> value HBase has excellent fault tolerance. It has block-level replication which is achieved from storing files as set of blocks which are then distributed across the entire cluster. Block replication also gives HDFS high-throughput across large data sets. By splitting individual files in 64MB blocks HDFS is able to decrease the amount of metadata storage per file and read data fast because of sequential blocks. HBase and the HDFS file system it is built on will sometimes compromise reliability in order to achieve lower communication costs. HBase supports heartbeat and block report periodic messaging between nodes. Cassandra Cassandra is a “distributed storage solution for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure” [12]. Cassandra is an Apache Software Foundation project now, but was designed by Facebook to fix their Inbox Search problem. The ability of Facebook users to search through their Inboxes required massive amounts of read and write throughput, billions of writes per day, and needed to be scalable for the future. Facebook created Cassandra to resolve this 10 problem and so far Cassandra has held to its duty by scaling well from 100 million users to 250 million users [12]. Cassandra is a column-oriented, distributed DBMS. It features a four to five dimensional keyvalue store with eventual consistency [4]. The Cassandra API is extremely simple, and only exposes three methods:    Insert(table, key, rowMutation) Get(table, key, columnName) Delete(table, key, columnName) [12] Cassandra has the backing of a large and popular company in Facebook, and has proven itself on the largest scale. It is extremely simple to use and get started, yet extremely powerful. Databases Relational MySQL is one of the most popular relational database management systems in the world. It is a core component of web development and is the ‘M’ in the LAMP (Linux Apache MySQL PHP) acronym. You will find MySQL used in small to medium scale single-server deployments. MySQL does support scaling by using full replication to increase read capacity, or sharding to increase write capacity. It also supports multiple storage engines, both native, partner developed, and open-source. MongoDB One popular document-oriented database is MongoDB. MongoDB is a scalable, opens-source, document-oriented database that has been around since October 2007. MongoDB is based on BTrees and stores data as binary JSON-style documents. MongoDB scales very easily horizontally and supports sharding and replication. MongoDB is mainly used for problems with strong transaction requirements. MongoDB is also very flexible and developer-friendy. Its lack of fixed schema means modifying database definitions, columns, and tuples is easy. Conclusions Any software that depends on persistent storage will end up using a database one way or another, and many times the database can and will be the bottleneck of an application. It is important to 11 know what the implications are in choosing a database system. Today’s data storage climate is such that traditional, SQL-based DBMSes are not the best solution to every problem. Through a better understanding of the theory, algorithms, and implementations behind relational and nonrelation databases one will be better suited to making data-driven decisions. For a vast account of performance considerations that is out of the scope of this paper, the author would like to suggest Chapter 7 in Lith and Mattsson’s Paper: "Investigating storage solutions for large data”. The paper investigates relational versus non-relational databases in a specific case study and includes pages of graphs showing specific performance benchmarks for different databases. 12 References [1] T. E. White, Hadoop: the definitive guide, Sebastopol, CA: O'Reill Media, Inc., 2012. [2] M. Stonebraker and U. Çetintemel, "“One Size Fits All”: An Idea Whose Time Has Come and Gone," Providence, 2007. [3] C. Ricardo, Databases Illuminated: Second Edition, Ontario: Jones & Bartlett Learning, 2012. [4] A. Lith and J. Mattsson, "Investigating storage solutions for large data," Chalmers University of Technology, pp. 1-70, 2010. [5] C. Hale, "You Can't Sacrifice Partition Tolerance," 07 October 2010. [Online]. Available: http://codahale.com/you-cant-sacrifice-partition-tolerance/. [Accessed 07 April 2013]. [6] W. Vogels, "Eventually Consistent," ACM Queue, 2008. [7] S. Gilbert and N. Lynch, "Brewers conjecture and the feasibility of consistent, available, partitiontolerant web services," p. 12, 2002. [8] M. Seeger, "Key-Value stores: a practical overview," medien informatik, 2009. [9] F. Magoules, Fundamentals of Grid Computing, Boca Raton: Chapman and Hall, 2010. [10] A. J. Wells, Grid Database Design, Boca Raton: Auerbach Publications, 2005. [11] J. D. G. C. F. Kai Hwang, Distributed and Cloud Computing: From Parallel Processing to the Internet of Things, Morgan Kaufmann, 2011. [12] A. Lakshman and P. Malik, "Cassandra - A Decentralized Structured Storage System," Ithaca, 2009. [13] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes and R. E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," in OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006. Acknowledgements Figure 1: http://upload.wikimedia.org/wikipedia/en/3/35/Mapreduce_Overview.svg

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Distributed and Non-Relational Databases