Big Data: A Brief investigation on NoSQL Databases Roshni Bajpayee MATS School of InformationTechnology Raipur (C.G) [email protected] Sonali Priya Sinha Vinod Kumar MATS School of InformationTechnology Raipur (C.G) [email protected] ABSTRACT As the usage of information technology has increased in the world, the Data generation from various resources has unexpectedly increased. The technology for handling the vast amount of data has not developed as compared to the data generation. Traditional database systems are unable to handle the increased volume of data due to its volume, Variety, Complexity, variability. To deal with this problem, Hadoop Distributed File System (HDFS) like technology is developed. The data to be processed exists in different format that is why the traditional relational database management System is suitable for the big data. To deal with the unstructured data various database tools have been developed. This paper mainly focuses on the various NoSQL Database tools that are available to deal with different types of data. It also includes a brief comparison between (NTFS and HDFS) and (NoSQL and Traditional Relational Database). MATS School of InformationTechnology Raipur (C.G) [email protected] continuously increasing nearby it will be from petabytes to zettabytes. Networking site stores large amount of data it is very definitely difficult to be handled using traditional system. Velocity  Velocity deals with the rate at which the data is coming from various sources. This property is not being confined to the rate of incoming data but also rate at which the data flows. Keywords NOSQL Database, Big data, Big Data Tools, HDFS, NTFS, Hadoop 1. INTRODUCTION Big Data is a term large amount Data. Which is important new technologies and architecture? Data capture and access third party very easily such as Facebook, other. Big data is a term for large data sets, more different and complex structure with the difficulties of storing, analyze and visualize for further processes or results. The process of research into large amounts of data to reveal hidden patterns and secret correlations named as big data analytics. This useful information for companies or organizations with the help of gaining richer and deeper insights and getting an advantage over the competition. For this reason, big data implementations need to be analyzed and executed as accurately as possible. Big Data applications have high Volume, high Variety and high Velocity. Some basic properties associated with big data are as follows: Variety  Data being generated is not of one category as it not only contains the traditional data but also the semi structured data from different resources like e-mail, Document, Web Pages, Web Log Files, social media sites, etc. Volume  This characteristic of big data presents the size of the data generated. In the age of information technology, the data is Figure 1: Characteristics of Big Data Variability  Variability considers the inconsistencies of the data flow data loads become challenging to be maintained. Especially with the increase in usage of the social networking. Complexity  It is very important aspect of big data because it is quite an undertaking to link, match, cleanse and transform data across system coming from various sources. Value  User can run certain queries against the data stored and then user got important results from the filtered data obtained and can also rank it according it the dimensions. 2. DAWN OF NOSQL  The name “NoSQL” was primarily used in 1998 by Carlo Strozzi  for the Relational database management system, Strozzi No SQL. Although, Strozzi coined the term basically to differentiate to his solution from other RDMBS solutions which make use of SQL . He used the term NoSQL just for the reason that his database did not expose a SQL interface. Now, the term NoSQL (Not Only SQL)  has come to express a huge set of databases which do not have characteristics of conventional relational databases and which are usually not queried with SQL. The term reenergized in the recent years with giant enterprises and companies like Google, Amazon, Apache by their own data storage centers to amass and process large amounts of data as they emerge in their applications and stirring up other vendors to take part in it. The main characteristics of NoSQL databases are horizontal scaling, replicating and partitioning data over several servers. In recent years, different kinds of NoSQL databases have been produced mainly by practitioners and web enterprise to fulfill their particular requirements regarding performance, maintenance, scalability and feature-set. In the present scenario, our need has changed unlike the some years later we were need. Therefore, currently NoSQL has emerged as a solution for today’s data store requirements and has been a subject of talk and research. 3. COMPARATIVE OVERVIEW Today, People are living in the periphery of big data where each and every moment the data is increasing unexpectedly. It is the massive amount of data (structured, semi-structured and unstructured) being generated with certain velocity from different variety of sources. Traditional File System (TFS) is unable to handle the big data efficiently; therefore Distributed File System (DFS) is taken as a solution over the TFS. Apache Hadoop Distributed File System (HDFS) is playing an important role in the field of big data where commodity hardware is used as data nodes for processing data. Hadoop principally contains of two parts: File System (Hadoop Distributed File System) Programming Paradigm (Map Reduce) Block size lies between 512 B and 64 kB Block size is fixed to 64 MB by default 1 GB file will be split into 16384 blocks 1 GB file will be split into 16 blocks With seek time of 0.1ms, 1GB will be accessed in 16s With seek time of 0.1ms, 1GB will be accessed in 16ms Used for write many read many access Used for write once read many access Data will not be replicated Data is always replicated Scaling up is not possible Scaling wide and scaling deep is possible Table 2: Comparison between NoSQL Database (HDFS) and Traditional Relational Database  SL NoSQL Database (HDFS) 1. NoSQL is unstructured way of storing the data. 2. The amount of data stored does not depend on the Physical memory of the system. It can be scaled horizontally as per the requirement. It can effectively handle million and billion of records It is never advised for transaction management 3. 3.1 Hadoop Distributed File System Hadoop Distributed File System is a File System Developed for keeping huge files with streaming data access patterns, running on clusters on commodity hardware. HDFS block size is much larger than that of traditional file system to reduce the number of disk seeks. 4. 5. Processing time depends upon number of cluster machines 6. Availability is preferred over consistency. 7. It follows CAP theorem. 8. It scales horizontally as well as vertically. There is no need of normalization. 3.2 MapReduce MapReduce is the programming model which runs in the HDFS environment. It consists of mainly two parts- The Mapper and Reducer. Hence, it performs mainly two types of works MAP Task by Mapper and Reduce task by the Reducer. This part is responsible for executing the program in distributed environment and collecting the aggregated result from different distributed nodes i.e. Commodity hardwares. Table 1: Comparison between NTFS and HDFS  NTFS (New Technology File System) HDFS (Hadoop Distributed File System) Files are stored in local file system Files are distributed across the cluster machines 9. Traditional Relational Database RDBMS database completely structured way storing of data. The amount of data stored mainly depends on the Physical memory of the system. It can Effectively handle few thousands of records It is best suited for transaction management The processing time depends on the server machine’s configuration Consistency is preferred over availability It follows ACID property of transaction It scales better vertically Tables must be normalized. 4. SUMMARIZED DATA OF VARIOUS NOSQL DATABASES AVAILABLE 4.1 Key-value stores    Hibari developers FoundationD B Amazon.com FAL Labs STS Soft SC Erlang Flow, C++ Java, .NET C C# Apache Open Source GPL Open Source Proprietary Proprietary 2010/2013 2011/2014 2013/2014 2012/2013 2007/2009 Hibari STSdb W4.0 GPL 1994/2014 AGPLv3 C Sleepycat Software, later Oracle Corporation /2014 Apache License 2.0 2005/2014 Apache Characteri stic Danga Interactive Consistency Partition Tolerance Persistence Consistency Partition Tolerance Persistence Erlang Apache C/C++ RockSolid SQL, Tony Bain Berkeley DB Developer LinkedIn Consistency High Availability Partition Tolerance Hypertable Inc C Java C BSD Apache Open Source Basho Technologie s 4.2 Document-oriented databases Aerospike Salvatore Sanfilippo Language C License BSD Open Source 15 AGPL Proprietary Initial/Stab le Release 2009/2014 2012/2014 2009/2014 Voldemort Robust High Availability 13 2009/2014 Apache Open Source Proprietary Erlang C++ 6 11 Consistency High Availability 12 Persistence GPL Open Source 5 Hypertable 4 Memcache DB 3 Aerospike Riak 2 10 Consistency Partition Tolerance Persistence High Availability Partition Tolerance Persistence 2008/2014 Redis 1 2010/2013 Name SL Table 3: NoSQL- Key Value Store    9 FoundationD B Riak is a distributed NoSQL key-value data store that offers extremely high fault tolerance, availability, operational simplicity and scalability. In addition to the open-source versions, it comes in a supported enterprise version and a cloud storage version that is ideal for cloud computing environments. Consistency High Availability Partition Tolerance Persistence DynamoDB 4.1.2Riak Consistency High Availability Tokyo Cabinet Redis is a data structure server. It is open-source, network, in-memory, and stores keys with optional durability. 8 Strongly Consistent Highly available Scalaris 4.1.1 Redis 7 hamsterdb Key-value store use the associative array as their basic data model. In this model, data is represented as a collection of key-value pair, such that each possible key appears at most once in the collection. The key-value model is one of the easiest non-trivial data models, and richer data models are often implemented on top of it. Multi Version Concurrency ACID Concurrency Replication ACID ACID Document-oriented databases are one of the main categories of NoSQL databases. Document oriented database is developed for storing, managing and retrieving the document-oriented information. The central concepts of a document-oriented database is that Documents. In contrast to relational database in which tuple(Row) is the central concept. Document oriented database system is designed around the abstract notion of “Document”. 4.2.1 MongoDB Couchbase, Inc. Apache Software Foundation Apache Software Foundation Free Community C/C++ Java Erlang Amazon Apache C# Python, Perl .NET RavenDB Proprietary Proprietary Apache AGPL Open Source Proprietary C#, D, ruby, python, Java, Python 2012/2013 Apache Java Orient Technologies LTD BSD Open Source Java BaseX Team Mark Logic Community C++ Proprietary 2003/2011 MarkLogic 2001/2010 Consistency Partition Tolerance Persistence High Availability Consistency Persistence 10 11 Consistency High Availability Persistence Consistency High Availability Partition Tolerance Persistence 1983/2012 MongoDB Inc. C++, 1AGPL Open Source 2009/2014 MongoDB Characteri stic Developer Language License Initial/Stab le Release Name SL 1 2010/2014 RavenDB 9 4.2.5BaseX BaseX is a light-weight and native XML database management system and XQuery processor, designed and developed as a community project on GitHub. It is specialized in querying, storing, and visualizing large XML documents and collections. BaseX is distributed and platform-independent under a permissive free software license. Table 4: NoSQL – Document oriented    2012/2014 ArangoDB 8 4.2.4 OrientDB OrientDB is database management system written in Java and it is open source NoSQL. It is a document-based database, but the relationship is managed as in graph databases with direct connections between records. It support schema-less, schema-full and schema-mixed modes. 2007/2013 4.2.4 RavenDB RavenDB is a transactional, open-source Document Database written in .NET, and offering a flexible data model designed to address requirements coming from real-world systems. RavenDB allows you to build high-performances, low-latency applications quickly and efficiently. Apache Open Source CouchDB FatDB 6 Consistency High Availability Persistence High Availability Partition Tolerance Persistence 5 SimpleDB ArangoDB is an open source, multi model database that combines a document store with a graph databases. This combination allows you to model your data with a lot of flexibility.I will show you how ArangoDB is difference from other NoSQL database – from its support for transactions to the powerful query language AQL. 2005/2014 4 OrientDB 4.2.3 ArangoDB Apache Open Source Apache Jackrabbit 2004/2014 3 4.2.2 FatDB FatDB is the next generations NoSQL databases for Windows that extends database functionality by integrating Map Reduce, a work queue, file management system, highspeed cache, and application services. Apache Couchbase Server 2011/2014 2 BaseX MongoDB is a document database that provides high availability, easy scalability, and high performance. A MongoDB deployment hosts a number of databases. A manual: data store holds a set of collections. Documents have dynamic schema. Dynamic schema means that document in the same collection do not need to have the same set of fields or structures, and common fields in a collection’s documents may hold different types of data. Couchbase Server - Couchbase Server, originally known as Membase, is an open source, distributed (shared-nothing architecture) NoSQL document-oriented database that is optimized for interactive applications. 12 Consistency High Availability Partition Tolerance Persistence Consistency High Availability Partition Tolerance Persistence Consistency High Availability Partition Tolerance Persistence High Availability Apache Software Foundation High Availability Consistency 4.3.5 Sedna Xml Sedna Xml is Open Source and it is XML based database management system, Sedna is an open source database management system that provides native storage for XML data. The distinguishing architecture decisions working in Sedna are (I) for XML data, the schema-based clustering storage strategy is used (ii) use of layered address space for memory management Apache Software Foundation Apache 2 Apache License 2.0 IQLECT C++ Java C, C++ Characteri stic Developer Apache Software Foundation Language Java C,C++, JAVA AGPL Open Source Proprietary BSD Open Source 2003/2014 Java License Apache Open Source Proprietary GPLv2 Initial/Stab le Release 2008/2014 2005/2010 2008/2014 2003/2014 6 4 BangDB 5 Hazelcast It is developed keeping in mind the semi-structured data storage. It is a big map that is indexed by a tuple key, column key, and a timestamp. Each value within the map is an array of bytes that is interpreted by the application. Every interaction of data to a row is atomic, in spite of of how many dissimilar columns are read or written within that row. Sedna Xml 4.3.2 Big Table Apache Open Source 3 It is an open source distributed database management system (DDBMS designed to grip huge amounts of data across many commodity servers, offering high availability with no single chance of failure. HBase 4.3.1Cassandra 2012/2014 2 Cassandra 1 The column of a distributed database is a NoSQL Object of the lowest rank in a key space. It is a row (a keyvalue pair) comprising of three parts. Unique name: column is referenced by it Value: The substance of the column. It can contain diverse types, like AsciiType, LongType, TimeUUIDType, and UTF8Type among others. Timestamp: The system timestamp used to resolve the valid content. Big Table 4.3 Column Store Name Table 5: NoSQL - Column Store Concurrency Transaction support SL Java,c# Persistence 2012/2014 iBoxDB 15 Community GPL LGPL Open Source C++ Proprietary 2012/2014 djondb 14 Community Java Apache 2004/2014 Solr 13 High Availability Partition Tolerance Persistence Consistency High Availability Partition Tolerance Persistence Consistency Partition Tolerance Persistence Consistency High Availability Partition Tolerance Persistence Consistency High Availability Partition Tolerance Apache Software Foundation Java BangDB is developed with the goal to fast, robust, scalable, reliable and very simple to use database for different data management services required by different applications.MongDB comes in the category of multiflavored distributed key value NoSql database. 7 Apache Open Source Apache License 2.0 4.3.4 BangDB 2013/2014 HBase is written in java. It is developed by Apache Foundation. It offers Big Table like capabilities for Hadoop and runs on the Hadoop Distributed File System. It is nonrelational, distributed and Open Source and designed after the Google’s’ Big Table. Accumulo 4.3.3 HBase Durability Consistency Consistency High Availability Partition Tolerance Persistence 2 . Characteri stic Developer Language License Initial/Stab le Release Neo4j is most widely used and liked Database in Graph. It is an open-source graph database, implemented in Java. Neo4j is ACID compliant. It’s basic language is java but has interfaces for many other programming languages like Ruby and Python 12 Multi versioning Concurrency Consistency 13 Community C#, C, X64 Assembly Microsoft Highly concurrency Concsistency Java Kobrix Inc. High scalability Netmesh Inc. Light Weight Java Proprietary Franz, Inc. Proprietary commercial software C#, C, Common Lisp, Java, Python LGPL High Availability Partition Tolerance Persistence AGPLv3, free for small entities 11 Open Source with liberal Apache 2 Java, Blueprints, REST, Table 6: NoSql-Graph Database 2004/2014 10 2010/2014 4.4.1Neo4j 2001/2010 9 2008/2011/ 4.4 Graph Database 2012/2014 7 AllegroGraph Systap Software Company 2012/2014 GPLv2, evaluation license, or commercial license. Java WhiteDB Team C #731642/20 13 GPLv3 and a free commercial licence WhiteDB 6 Bigdata The graph database is one of the abstract types of data store. It is based on the graph theory and uses the nodes along with edges to represent and store the data. In graph database each and every element contains a direct to its adjacent elements and no index lookups are necessary. GitHub Community Developme nt C++ Open Source 1996/2012 Meronymy High Performance ACID Transation Filament Inc. Java BSD MIT License C# 2012/2014 Filament 2001/2014 BrightstarDB 5 Trinity 4 . HyperGraph DB 3 InfoGrid Concurrency Consistency Replication High Availability TITAN MonetDB Developer Team incubator-flink development Java, C, C++, Python, and Ruby Cloudera, Inc. Apache Softw are Foundation MonetDB License (based on the MPL 1.1) Scalable Reliable Fast Hadoop Compatible Hypertable Inc. GNU General Public License 2.0 2004/2014 2004/2014 Java,scale Apache License, Version 2.0 2009/2013 cloudera MonetDB Consistency Concurrency C++ 2013/2014 Hypertable MonetDB License Apache Flink (incubating) 13 Neo Technology Java 2007/2014s AGPL GPL Open Source Name 12 Twitter Scala, Java, Ruby Apache License 2010/2012 1 Neo4j 2007 10 FlockDB SL 8 High Availabity Multi version Concurrency Flexibility Scalability Performance Protability Prtsistence Concurrency High Performance High availabity Atomicity Consistency Isolation Durability Consistency High availability Fault tolerance Objectivity, Inc. Java Duallicensed Java, .NET, C++, Blueprints Interface Sparsity Technologie s 2008/2014 2010/2014 Evaluation (EULA), and commercial DEX 15 Infinite Graph 14  Tom White, Hadoop: The Definitive Guide, 3rd Edition, O'Reilly Media, 2010 High Performance Highly Scalable 5. CONCLUSION In the age of information technology, data is a very important to extract the useful information. It is obvious that data exists in different format. The processing of big data is still a challenging task. There is no universal tool which can handle enormous and data of various formats. Document oriented, Key-Value pair, Column and graph type of NoSQL databases are developed to handle this variety of data. The summarized discussion about different NoSQL databases is helpful in selection of suitable NoSQL database. 6. REFERENCES Strozzi, Carlo: NoSQL – A relational database management system. 2007–2010. http://www.strozzi.it/cgibin/CSA/tw7/I/en_US/nosql/ Home%20Page.  P. Xiang, R. Hou, and Z. Zhou, “Cache and consistency in nosql,” in Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, vol. 6. IEEE, 2010, pp.117–120.   http://nosql.findthebest.com/  http://nosql-database.org/  G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” Proc. 21st ACM SIGOPS Symp. Operating Systems Principles (SOSP 07), 2007, pp. 205–220; doi: 10.1145/1294261.1294281.  A. Masudianpour, “An Introduction to Redis Server, An Advanced Key Value Database,” SlideShare, 9 Aug. 2013; www.slideshare.net/masudianpour/redis25088079.  Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and Applications (ICPCA), 2011 6th International Conference on, vol., no., pp.363, 366, 26-28 Oct. 2011.  http://wiki.apache.org/hadoop/  http://hadoop.apache.org /  http://www.cloudera.com/  http://sortbenchmark.org/YahooHadoop.pdf  Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money, “Big data: issues and challenges moving forward”, IEEE, 46th Hawaii International Conference on System Sciences, 2013.  Frank, L., "Countermeasures against consistency anomalies in distributed integrated databases with relaxed ACID properties," Innovations in Information Technology (IIT), 2011 International Conference on , vol., no., pp.266,270, 25-27 April 2011.  M. Stonebraker, “Sql databases v. nosql databases,” Communications of the ACM, vol. 53, no. 4, pp. 10– 11, 2010.  Sachchidanand Singh, Nirmala Singh, “Big data analytics”, IEEE, International Conference on Communication, Information & Computing Technology (ICCICT),Oct. 19-20, 2012.  Sagiroglu, S.; Sinanc, D., "Big Data: A review," Collaboration Technologies and Systems (CTS), 2013 International Conference on , vol., no., pp.42,47, 2024 May 2013.  Wielki, J., "Implementation of the big data concept in organizations - possibilities, impediments and challenges," Computer Science and Information Systems (FedCSIS), 2013 Federated Conference on, vol., no., pp.985, 989, 8-11 Sept. 2013.  Segev, A; Chihoon Jung; Sukhwan Jung, "Analysis of technology trends based on big data," Big Data (BigData Congress), 2013 IEEE International Congress on, vol., no., pp.419, 420, June 27 2013-July 2 2013  Katal, A; Wazid, M.; Goudar, R.H., "Big data: issues, challenges, tools and good practices," Contemporary Computing (IC3), 2013 Sixth International Conference on , vol., no., pp.404,409, 8-10 Aug. 2013.  K. Kambatla, G. Kollias, V. Kumar, A. Grama, “Trends in big data analytics”, J. Parallel Distrib. Comput. (2014), http://dx.doi.org/10.1016/j.jpdc.2014.01.003  Sheth, Amit, "Transforming big data into smart data: deriving value via harnessing volume, variety, and velocity using semantic techniques and technologies," Data Engineering (ICDE), 2014 IEEE 30th International Conference on , vol., no., pp.2,2, March 31 2014-April 4, 2014.  Saha, Barna; Srivastava, Divesh, "Data quality: the other face of big data," Data Engineering (ICDE), 2014 IEEE 30th International Conference on, vol., no., pp.1294, 1297, March 31 2014-April 4 2014.