Big Data: A Brief investigation on NoSQL Databases Roshni Download

Transcript
Big Data: A Brief investigation on NoSQL Databases
Roshni Bajpayee
MATS School of
InformationTechnology
Raipur (C.G)
[email protected]
Sonali Priya Sinha
Vinod Kumar
MATS School of
InformationTechnology
Raipur (C.G)
[email protected]
ABSTRACT
As the usage of information technology has increased in the
world, the Data generation from various resources has
unexpectedly increased. The technology for handling the
vast amount of data has not developed as compared to the
data generation. Traditional database systems are unable to
handle the increased volume of data due to its volume,
Variety, Complexity, variability. To deal with this problem,
Hadoop Distributed File System (HDFS) like technology is
developed. The data to be processed exists in different
format that is why the traditional relational database
management System is suitable for the big data. To deal
with the unstructured data various database tools have been
developed. This paper mainly focuses on the various
NoSQL Database tools that are available to deal with
different types of data. It also includes a brief comparison
between (NTFS and HDFS) and (NoSQL and Traditional
Relational Database).
MATS School of
InformationTechnology
Raipur (C.G)
[email protected]
continuously increasing nearby it will be from petabytes to
zettabytes. Networking site stores large amount of data it is
very definitely difficult to be handled using traditional
system.
Velocity [13][17][20][22]
Velocity deals with the rate at which the data is coming
from various sources. This property is not being confined to
the rate of incoming data but also rate at which the data
flows.
Keywords
NOSQL Database, Big data, Big Data Tools, HDFS, NTFS,
Hadoop
1. INTRODUCTION
Big Data is a term large amount Data. Which is important
new technologies and architecture? Data capture and access
third party very easily such as Facebook, other. Big data is
a term for large data sets, more different and complex
structure with the difficulties of storing, analyze and
visualize for further processes or results. The process of
research into large amounts of data to reveal hidden
patterns and secret correlations named as big data analytics.
This useful information for companies or organizations
with the help of gaining richer and deeper insights and
getting an advantage over the competition. For this reason,
big data implementations need to be analyzed and executed
as accurately as possible. Big Data applications have high
Volume, high Variety and high Velocity.
Some basic properties associated with big data are as
follows:
Variety [13][17][20]
Data being generated is not of one category as it not only
contains the traditional data but also the semi structured
data from different resources like e-mail, Document, Web
Pages, Web Log Files, social media sites, etc.
Volume [13][17][20][22]
This characteristic of big data presents the size of the data
generated. In the age of information technology, the data is
Figure 1: Characteristics of Big Data
Variability [20]
Variability considers the inconsistencies of the data flow
data loads become challenging to be maintained. Especially
with the increase in usage of the social networking.
Complexity [20]
It is very important aspect of big data because it is quite an
undertaking to link, match, cleanse and transform data
across system coming from various sources.
Value [20]
User can run certain queries against the data stored and
then user got important results from the filtered data
obtained and can also rank it according it the dimensions.
2. DAWN OF NOSQL [1][3][4][7][15]
The name “NoSQL” was primarily used in 1998 by Carlo
Strozzi [1] for the Relational database management system,
Strozzi No SQL. Although, Strozzi coined the term
basically to differentiate to his solution from other RDMBS
solutions which make use of SQL [15]. He used the term
NoSQL just for the reason that his database did not expose
a SQL interface. Now, the term NoSQL (Not Only SQL)
[1] has come to express a huge set of databases which do
not have characteristics of conventional relational databases
and which are usually not queried with SQL. The term reenergized in the recent years with giant enterprises and
companies like Google, Amazon, Apache by their own data
storage centers to amass and process large amounts of data
as they emerge in their applications and stirring up other
vendors to take part in it. The main characteristics of
NoSQL databases are horizontal scaling, replicating and
partitioning data over several servers. In recent years,
different kinds of NoSQL databases have been produced
mainly by practitioners and web enterprise to fulfill their
particular
requirements
regarding
performance,
maintenance, scalability and feature-set. In the present
scenario, our need has changed unlike the some years later
we were need. Therefore, currently NoSQL has emerged as
a solution for today’s data store requirements and has been
a subject of talk and research.
3. COMPARATIVE OVERVIEW
Today, People are living in the periphery of big data where
each and every moment the data is increasing
unexpectedly. It is the massive amount of data (structured,
semi-structured and unstructured) being generated with
certain velocity from different variety of sources.
Traditional File System (TFS) is unable to handle the big
data efficiently; therefore Distributed File System (DFS) is
taken as a solution over the TFS. Apache Hadoop
Distributed File System (HDFS) is playing an important
role in the field of big data where commodity hardware is
used as data nodes for processing data. Hadoop principally
contains of two parts:

File System (Hadoop Distributed File System)

Programming Paradigm (Map Reduce)
Block size lies between 512
B and 64 kB
Block size is fixed to 64 MB
by default
1 GB file will be split into
16384 blocks
1 GB file will be split into 16
blocks
With seek time of 0.1ms,
1GB will be accessed in
16s
With seek time of 0.1ms,
1GB will be accessed in 16ms
Used for write many read
many access
Used for write once read
many access
Data will not be replicated
Data is always replicated
Scaling up is not possible
Scaling wide and scaling deep
is possible
Table 2: Comparison between NoSQL Database
(HDFS) and Traditional Relational Database [12][15]
SL
NoSQL Database (HDFS)
1.
NoSQL is unstructured way
of storing the data.
2.
The amount of data stored
does not depend on the
Physical memory of the
system. It can be scaled
horizontally as per the
requirement.
It can effectively handle
million and billion of
records
It is never advised for
transaction management
3.
3.1 Hadoop Distributed File System
Hadoop Distributed File System is a File System
Developed for keeping huge files with streaming data
access patterns, running on clusters on commodity
hardware. HDFS block size is much larger than that of
traditional file system to reduce the number of disk seeks.
4.
5.
Processing time depends
upon number of cluster
machines
6.
Availability is preferred over
consistency.
7.
It follows CAP theorem.
8.
It scales horizontally as well
as vertically.
There is no need of
normalization.
3.2 MapReduce
MapReduce is the programming model which runs in the
HDFS environment. It consists of mainly two parts- The
Mapper and Reducer. Hence, it performs mainly two types
of works MAP Task by Mapper and Reduce task by the
Reducer. This part is responsible for executing the program
in distributed environment and collecting the aggregated
result from different distributed nodes i.e. Commodity
hardwares.
Table 1: Comparison between NTFS and HDFS [12]
NTFS (New Technology
File System)
HDFS (Hadoop Distributed
File System)
Files are stored in local file
system
Files are distributed across
the cluster machines
9.
Traditional
Relational Database
RDBMS
database
completely structured
way storing of data.
The amount of data
stored
mainly
depends
on
the
Physical memory of
the system.
It can Effectively
handle few thousands
of records
It is best suited for
transaction
management
The processing time
depends on the server
machine’s
configuration
Consistency
is
preferred
over
availability
It follows ACID
property
of
transaction
It
scales
better
vertically
Tables
must
be
normalized.
4. SUMMARIZED DATA OF VARIOUS
NOSQL DATABASES AVAILABLE
4.1 Key-value stores [3] [4] [7]
Hibari
developers
FoundationD
B
Amazon.com
FAL Labs
STS Soft
SC
Erlang
Flow, C++
Java, .NET
C
C#
Apache
Open
Source
GPL
Open
Source
Proprietary
Proprietary
2010/2013
2011/2014
2013/2014
2012/2013
2007/2009
Hibari
STSdb
W4.0
GPL
1994/2014
AGPLv3
C
Sleepycat
Software,
later Oracle
Corporation
/2014
Apache
License 2.0
2005/2014
Apache
Characteri
stic
Danga
Interactive
Consistency
Partition
Tolerance
Persistence
Consistency
Partition
Tolerance
Persistence
Erlang
Apache
C/C++
RockSolid
SQL, Tony
Bain
Berkeley
DB
Developer
LinkedIn
Consistency
High
Availability
Partition
Tolerance
Hypertable
Inc
C
Java
C
BSD
Apache
Open
Source
Basho
Technologie
s
4.2 Document-oriented databases
Aerospike
Salvatore
Sanfilippo
Language
C
License
BSD
Open Source
15
AGPL
Proprietary
Initial/Stab
le Release
2009/2014
2012/2014
2009/2014
Voldemort
Robust
High
Availability
13
2009/2014
Apache
Open
Source
Proprietary
Erlang
C++
6
11
Consistency
High
Availability
12
Persistence
GPL
Open
Source
5
Hypertable
4
Memcache
DB
3
Aerospike
Riak
2
10
Consistency
Partition
Tolerance
Persistence
High
Availability
Partition
Tolerance
Persistence
2008/2014
Redis
1
2010/2013
Name
SL
Table 3: NoSQL- Key Value Store [3] [4] [7]
9
FoundationD
B
Riak is a distributed NoSQL key-value data store that
offers extremely high fault tolerance, availability,
operational simplicity and scalability. In addition to
the open-source versions, it comes in a supported
enterprise version and a cloud storage version that is
ideal for cloud computing environments.
Consistency
High
Availability
Partition
Tolerance
Persistence
DynamoDB
4.1.2Riak
Consistency
High
Availability
Tokyo
Cabinet
Redis is a data structure server. It is open-source,
network, in-memory,
and
stores
keys with
optional durability.
8
Strongly
Consistent
Highly
available
Scalaris
4.1.1 Redis
7
hamsterdb
Key-value store use the associative array as their basic data
model. In this model, data is represented as a collection of
key-value pair, such that each possible key appears at most
once in the collection. The key-value model is one of the
easiest non-trivial data models, and richer data models are
often implemented on top of it.
Multi
Version
Concurrency
ACID
Concurrency
Replication
ACID
ACID
Document-oriented databases are one of the main categories
of NoSQL databases. Document oriented database is
developed for storing, managing and retrieving the
document-oriented information. The central concepts of a
document-oriented database is that Documents. In contrast
to relational database in which tuple(Row) is the central
concept. Document oriented database system is designed
around the abstract notion of “Document”.
4.2.1 MongoDB
Couchbase,
Inc.
Apache
Software
Foundation
Apache
Software
Foundation
Free
Community
C/C++
Java
Erlang
Amazon
Apache
C#
Python, Perl
.NET
RavenDB
Proprietary
Proprietary
Apache
AGPL
Open Source
Proprietary
C#, D, ruby, python,
Java, Python
2012/2013
Apache
Java
Orient
Technologies
LTD
BSD
Open
Source
Java
BaseX
Team
Mark Logic
Community
C++
Proprietary
2003/2011
MarkLogic
2001/2010
Consistency
Partition
Tolerance
Persistence
High
Availability
Consistency
Persistence
10
11
Consistency
High
Availability
Persistence
Consistency
High
Availability
Partition
Tolerance
Persistence
1983/2012
MongoDB
Inc.
C++,
1AGPL
Open
Source
2009/2014
MongoDB
Characteri
stic
Developer
Language
License
Initial/Stab
le Release
Name
SL
1
2010/2014
RavenDB
9
4.2.5BaseX
BaseX is a light-weight and native XML database
management system and XQuery processor, designed and
developed as a community project on GitHub. It is
specialized in querying, storing, and visualizing large XML
documents and collections. BaseX is distributed and
platform-independent under a permissive free software
license.
Table 4: NoSQL – Document oriented [3] [4] [7]
2012/2014
ArangoDB
8
4.2.4 OrientDB
OrientDB is database management system written in Java
and it is open source NoSQL. It is a document-based
database, but the relationship is managed as in graph
databases with direct connections between records. It
support schema-less, schema-full and schema-mixed modes.
2007/2013
4.2.4 RavenDB
RavenDB is a transactional, open-source Document
Database written in .NET, and offering a flexible data model
designed to address requirements coming from real-world
systems. RavenDB allows you to build high-performances,
low-latency applications quickly and efficiently.
Apache
Open
Source
CouchDB
FatDB
6
Consistency
High
Availability
Persistence
High
Availability
Partition
Tolerance
Persistence
5
SimpleDB
ArangoDB is an open source, multi model database that
combines a document store with a graph databases. This
combination allows you to model your data with a lot of
flexibility.I will show you how ArangoDB is difference
from other NoSQL database – from its support for
transactions to the powerful query language AQL.
2005/2014
4
OrientDB
4.2.3 ArangoDB
Apache
Open
Source
Apache
Jackrabbit
2004/2014
3
4.2.2 FatDB
FatDB is the next generations NoSQL databases for
Windows that extends database functionality by integrating
Map Reduce, a work queue, file management system, highspeed cache, and application services.
Apache
Couchbase
Server
2011/2014
2
BaseX
MongoDB is a document database that provides high
availability, easy scalability, and high performance. A
MongoDB deployment hosts a number of databases. A
manual: data store holds a set of collections. Documents
have dynamic schema. Dynamic schema means that
document in the same collection do not need to have the
same set of fields or structures, and common fields in a
collection’s documents may hold different types of data.
Couchbase Server - Couchbase Server, originally known as
Membase, is an open source, distributed (shared-nothing
architecture) NoSQL document-oriented database that is
optimized for interactive applications.
12
Consistency
High
Availability
Partition
Tolerance
Persistence
Consistency
High
Availability
Partition
Tolerance
Persistence
Consistency
High
Availability
Partition
Tolerance
Persistence
High
Availability
Apache
Software
Foundation
High
Availability
Consistency
4.3.5 Sedna Xml
Sedna Xml is Open Source and it is XML based database
management system,
Sedna is an open source database management system that
provides native storage for XML data. The distinguishing
architecture decisions working in Sedna are (I) for XML
data, the schema-based clustering storage strategy is used
(ii) use of layered address space for memory management
Apache
Software
Foundation
Apache 2
Apache
License 2.0
IQLECT
C++
Java
C, C++
Characteri
stic
Developer
Apache
Software
Foundation
Language
Java
C,C++,
JAVA
AGPL
Open Source
Proprietary
BSD
Open
Source
2003/2014
Java
License
Apache
Open
Source
Proprietary
GPLv2
Initial/Stab
le Release
2008/2014
2005/2010
2008/2014
2003/2014
6
4
BangDB
5
Hazelcast
It is developed keeping in mind the semi-structured data
storage. It is a big map that is indexed by a tuple key,
column key, and a timestamp. Each value within the map is
an array of bytes that is interpreted by the application. Every
interaction of data to a row is atomic, in spite of of how
many dissimilar columns are read or written within that row.
Sedna Xml
4.3.2 Big Table
Apache
Open
Source
3
It is an open source distributed database management
system (DDBMS designed to grip huge amounts of data
across many commodity servers, offering high availability
with no single chance of failure.
HBase
4.3.1Cassandra
2012/2014
2
Cassandra
1
The column of a distributed database is a NoSQL
Object of the lowest rank in a key space. It is a row (a keyvalue pair) comprising of three parts.

Unique name: column is referenced by it

Value: The substance of the column. It can
contain diverse types, like AsciiType, LongType,
TimeUUIDType, and UTF8Type among others.

Timestamp: The system timestamp used to
resolve the valid content.
Big Table
4.3 Column Store
Name
Table 5: NoSQL - Column Store
Concurrency
Transaction
support
SL
Java,c#
Persistence
2012/2014
iBoxDB
15
Community
GPL
LGPL
Open
Source
C++
Proprietary
2012/2014
djondb
14
Community
Java
Apache
2004/2014
Solr
13
High
Availability
Partition
Tolerance
Persistence
Consistency
High
Availability
Partition
Tolerance
Persistence
Consistency
Partition
Tolerance
Persistence
Consistency
High
Availability
Partition
Tolerance
Persistence
Consistency
High
Availability
Partition
Tolerance
Apache
Software
Foundation
Java
BangDB is developed with the goal to fast, robust, scalable,
reliable and very simple to use database for different data
management
services
required
by
different
applications.MongDB comes in the category of multiflavored distributed key value NoSql database.
7
Apache
Open
Source
Apache
License 2.0
4.3.4 BangDB
2013/2014
HBase is written in java. It is developed by Apache
Foundation. It offers Big Table like capabilities for Hadoop
and runs on the Hadoop Distributed File System. It is nonrelational, distributed and Open Source and designed after
the Google’s’ Big Table.
Accumulo
4.3.3 HBase
Durability
Consistency
Consistency
High
Availability
Partition
Tolerance
Persistence
2
.
Characteri
stic
Developer
Language
License
Initial/Stab
le Release
Neo4j is most widely used and liked Database in Graph. It is
an open-source graph database, implemented in Java. Neo4j
is ACID compliant. It’s basic language is java but has
interfaces for many other programming languages like Ruby
and Python
12
Multi
versioning
Concurrency
Consistency
13
Community
C#, C, X64
Assembly
Microsoft
Highly
concurrency
Concsistency
Java
Kobrix Inc.
High scalability
Netmesh Inc.
Light Weight
Java
Proprietary
Franz, Inc.
Proprietary
commercial
software
C#, C,
Common Lisp,
Java, Python
LGPL
High
Availability
Partition
Tolerance
Persistence
AGPLv3,
free for small
entities
11
Open Source
with liberal
Apache 2
Java,
Blueprints,
REST,
Table 6: NoSql-Graph Database
2004/2014
10
2010/2014
4.4.1Neo4j
2001/2010
9
2008/2011/
4.4 Graph Database
2012/2014
7
AllegroGraph
Systap
Software
Company
2012/2014
GPLv2,
evaluation
license, or
commercial
license.
Java
WhiteDB
Team
C
#731642/20
13
GPLv3 and
a free
commercial
licence
WhiteDB
6
Bigdata
The graph database is one of the abstract types of data store.
It is based on the graph theory and uses the nodes along with
edges to represent and store the data. In graph database each
and every element contains a direct to its adjacent elements
and no index lookups are necessary.
GitHub
Community
Developme
nt
C++
Open
Source
1996/2012
Meronymy
High
Performance
ACID
Transation
Filament
Inc.
Java
BSD
MIT License
C#
2012/2014
Filament
2001/2014
BrightstarDB
5
Trinity
4
.
HyperGraph
DB
3
InfoGrid
Concurrency
Consistency
Replication
High
Availability
TITAN
MonetDB
Developer
Team
incubator-flink
development
Java, C,
C++,
Python, and
Ruby
Cloudera,
Inc.
Apache
Softw are
Foundation
MonetDB
License
(based on the
MPL 1.1)
Scalable
Reliable
Fast
Hadoop
Compatible
Hypertable
Inc.
GNU
General
Public
License 2.0
2004/2014
2004/2014
Java,scale
Apache
License,
Version 2.0
2009/2013
cloudera
MonetDB
Consistency
Concurrency
C++
2013/2014
Hypertable
MonetDB
License
Apache Flink
(incubating)
13
Neo
Technology
Java
2007/2014s
AGPL
GPL
Open
Source
Name
12
Twitter
Scala, Java,
Ruby
Apache
License
2010/2012
1
Neo4j
2007
10
FlockDB
SL
8
High
Availabity
Multi version
Concurrency
Flexibility
Scalability
Performance
Protability
Prtsistence
Concurrency
High
Performance
High availabity
Atomicity
Consistency
Isolation
Durability
Consistency
High
availability
Fault tolerance
Objectivity,
Inc.
Java
Duallicensed
Java, .NET,
C++,
Blueprints
Interface
Sparsity
Technologie
s
2008/2014
2010/2014
Evaluation
(EULA), and
commercial
DEX
15
Infinite Graph
14
[12] Tom White, Hadoop: The Definitive Guide, 3rd
Edition, O'Reilly Media, 2010
High
Performance
Highly Scalable
5. CONCLUSION
In the age of information technology, data is a very
important to extract the useful information. It is obvious
that data exists in different format. The processing of big
data is still a challenging task. There is no universal tool
which can handle enormous and data of various formats.
Document oriented, Key-Value pair, Column and graph
type of NoSQL databases are developed to handle this
variety of data. The summarized discussion about different
NoSQL databases is helpful in selection of suitable NoSQL
database.
6. REFERENCES
Strozzi, Carlo: NoSQL – A relational database
management
system.
2007–2010.
http://www.strozzi.it/cgibin/CSA/tw7/I/en_US/nosql/
Home%20Page.
[2] P. Xiang, R. Hou, and Z. Zhou, “Cache and
consistency in nosql,” in Computer Science and
Information Technology (ICCSIT), 2010 3rd IEEE
International Conference on, vol. 6. IEEE, 2010,
pp.117–120.
[1]
[3] http://nosql.findthebest.com/
[4] http://nosql-database.org/
[5] G. DeCandia et al., “Dynamo: Amazon’s Highly
Available Key-Value Store,” Proc. 21st ACM
SIGOPS Symp. Operating Systems Principles (SOSP
07),
2007,
pp.
205–220;
doi:
10.1145/1294261.1294281.
[6] A. Masudianpour, “An Introduction to Redis Server,
An Advanced Key Value Database,” SlideShare, 9
Aug. 2013; www.slideshare.net/masudianpour/redis25088079.
[7] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on
NoSQL database," Pervasive Computing
and
Applications (ICPCA), 2011 6th International
Conference on, vol., no., pp.363, 366, 26-28 Oct.
2011.
[8] http://wiki.apache.org/hadoop/
[9] http://hadoop.apache.org /
[10] http://www.cloudera.com/
[11] http://sortbenchmark.org/YahooHadoop.pdf
[13] Stephen Kaisler, Frank Armour, J. Alberto Espinosa,
William Money, “Big data: issues and challenges
moving forward”, IEEE, 46th Hawaii International
Conference on System Sciences, 2013.
[14] Frank, L., "Countermeasures against consistency
anomalies in distributed integrated databases with
relaxed ACID properties," Innovations in Information
Technology (IIT), 2011 International Conference on ,
vol., no., pp.266,270, 25-27 April 2011.
[15] M. Stonebraker, “Sql databases v. nosql databases,”
Communications of the ACM, vol. 53, no. 4, pp. 10–
11, 2010.
[16] Sachchidanand Singh, Nirmala Singh, “Big data
analytics”,
IEEE, International Conference on
Communication,
Information
&
Computing
Technology (ICCICT),Oct. 19-20, 2012.
[17] Sagiroglu, S.; Sinanc, D., "Big Data: A review,"
Collaboration Technologies and Systems (CTS), 2013
International Conference on , vol., no., pp.42,47, 2024 May 2013.
[18] Wielki, J., "Implementation of the big data concept in
organizations - possibilities, impediments and
challenges," Computer Science and Information
Systems (FedCSIS), 2013 Federated Conference on,
vol., no., pp.985, 989, 8-11 Sept. 2013.
[19] Segev, A; Chihoon Jung; Sukhwan Jung, "Analysis of
technology trends based on big data," Big Data
(BigData Congress), 2013 IEEE International
Congress on, vol., no., pp.419, 420, June 27 2013-July
2 2013
[20] Katal, A; Wazid, M.; Goudar, R.H., "Big data: issues,
challenges, tools and good practices," Contemporary
Computing (IC3), 2013 Sixth International Conference
on , vol., no., pp.404,409, 8-10 Aug. 2013.
[21] K. Kambatla, G. Kollias, V. Kumar, A. Grama,
“Trends in big data analytics”, J. Parallel Distrib.
Comput.
(2014),
http://dx.doi.org/10.1016/j.jpdc.2014.01.003
[22] Sheth, Amit, "Transforming big data into smart data:
deriving value via harnessing volume, variety, and
velocity
using
semantic
techniques
and
technologies," Data Engineering (ICDE), 2014 IEEE
30th International Conference on , vol., no., pp.2,2,
March 31 2014-April 4, 2014.
[23] Saha, Barna; Srivastava, Divesh, "Data quality: the
other face of big data," Data Engineering (ICDE),
2014 IEEE 30th International Conference on, vol., no.,
pp.1294, 1297, March 31 2014-April 4 2014.