Download Big Data Frameworks: At a Glance - Academic Science,International

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Transcript
Big Data Frameworks: At a Glance
Rajendra kumar Shukla
Pooja Pandey
Vinod Kumar
School of IT
MATS University
School of IT
MATS University
School of IT
MATS University
[email protected]
[email protected]
[email protected]
Abstract
In the modern era of information technology, the usage of
IT tools and techniques has increased exponetionally in
almost every business organization, Enterprises,
Companies and Government organizations. Therefore the
rate of generation of data has also increased in exponential
order. The huge amount of data which have the various
properties like variety, volume, velocity, complexity,
variability and value has led to the concept of big data. The
traditional systems frameworks, tools and techniques are
not capable to handle it. It is required some new
framework, operating system, warehouse tools and analysis
techniques to handle the big data issues. This paper mainly
focuses on available framework used in big data
environment.
Keywords- Big data Framework, Big Data, Big Data
Tools
coming from a number of varied sources at a high speed
could change? This also refers to the data in motion or
data at rest. The variability in the state of data comes into
play when it is important to take a point-in-time snapshot
of a data for further processing and decision making
Introduction
Big data are the center point of modern science and
business. that contains the billions records of millions
people information that includes the web sales, social
media, audios, search queries , business records ,social
networking ,science data, mobile phones and their
applications and so on, These huge amount of data can be
Managed or Unmanaged.
Big Data characteristics - Data is huge
collected from variety of sources in different form, it is
characterized by the three Parameters
 Variety
 Volume
 Velocity
 Variability
Variety makes the data large in size. Data comes from the
variety of sources that can be of structured, unstructured
and semi- structured type. Different variety of data includes
the sensor data, video, text, audio, web log files and so on.
Volume represents how the data is large. Size or the
Volume of data now is more than petabytes. The grand
scale and rise of data outstrips traditional store and analysis
techniques.
Velocity is required for big data and also for all processes.
Velocity define the pace of the data and the discretion of
streaming of the data
Variability applies to the state or validity of the data in
relation to the time. How quickly a huge volume of data,
Fig-1 Characteristics of Big data
Big data Frameworks
In computer systems, a framework is often a layered
structure indicating what kind of programs can or should
be built and how they would interrelate. Big data
framework is the set of functions or the structure which
defines how to perform the processing, manipulation and
representation of big data. Big data framework handles
the both structured, unstructured, semi-structured data.
The big data framework in fig2 represents the different
layers including different functionalities. Data acquisition
from different sources like enterprise, government
organization and business organizing, data from telecom
industries and social networking sites is carried out in the
first step of the processing for big data where different
activities like data processing, Data cleaning, Integration,
Normalization etc is performed. In second stage the data
repository, the preprocessed data is stored for further
analysis and visualization of big data for finding greater
insight to get the value information.
Fig 2: Big data Frameworks
Terminology used in Big Data
Frameworks
Availability
In Big data frameworks the open source approaches have
the greatest momentum: the most extensive acceptance and
the hottest revolution. Open source platforms are expanding
their footprint in advanced analytics.
Operating System
Most of the organization preferred operating systems like
Windows and Linux but some frame work can also work
with BSD and OS X.
Platform
The widely used big data Language platform are as
followsPig is a platform for data analysis that uses a language
which is a textual language known as Pig Latin and
provides sequences of Map-Reduce programs. It helps
makes it easier to understand, write and maintain programs
which conduct inspections analysis of data in parallel.
Operating System: Operating System Independent.
R is developed by Bell Laboratories, R is a programming
an environment and a language for statistical computing
and graphics which is identical to S. The environment
includes a set of tools that make it easier to perform
calculations, manipulate data and generate charts and
graphs. Operating System: OSX, Linux, Windows.
("Enterprise Control Language") ECL is the language for
working with HPCC. A complete set of tools with
including a debugger and an IDE are included in HPCC,
and one can get the documentation which is available on
the HPCC’s website. Operating System: Linux Os.
Processing Environment
Many enterprises are having hurdle of lots of new data,
which appear in several different variant. Big data has the
ability to confer insights that can help any business
organization. Big data has generated an all new industry of
subsidiary architectures such as Map Reduce.
Map Reduce is a programming framework for distributed
computing which was created by Google using the divide
and conquer method to dissolution sophisticated big data
problems into small units of work and process them in
parallel. MapReduce can be divided into two stages.

Map Step: The master node data is sliced up into
many smaller sub nodes. A sub node processes
some subset of the smaller problems under the
control of the JobTracker node and stores the
result in the local file system where a reducer is
able to access it.

Reduce Step: This step merges and analyzes
input data from the map steps. These can be used
to reduce the number tasks to parallelize the
stockpiling, and these tasks are executed on the
worker nodes under the control of the
JobTracker.
Brief introduction of various big
data frameworks available
1. Apache Spark-
Spark is as a too fast in-memory,
data-processing framework like rapid fast. 100x faster than
Hadoop. As the volume and velocity of data gathered from
web and mobile applications rapidly increases, it’s critical
speed of data processing and analysis stay at least a step
ahead in order to support today’s Big Data applications and
end user expectations. Spark offers the emulative benefit of
high velocity analytics by way of stream processing large
the amount of data, versus what has been traditionally a
more heavily “batch-oriented” approach to data processing
as seen with Hadoop.
2. Graph Lab -
The Graph Lab Power Graph
academic project was started in 2009 at Carnegie Mellon
University to develop a new parallel computation
abstraction tailored to machine learning. Graph Lab Power
Graph 1.0 employed shared-memory design. In Graph Lab
Power Graph 2.1, the framework was redesigned to target
the distributed environment
3. HPCC System-
LexisNexis is positioning HPCC
as a competitor to Apache Hadoop, the open source
software framework for Big Data processing and analytics.
The entry of LexisNexis and HPCC into the Big Data
ecosystem is yet another validation of the Big Data space
and should spur innovation from all parties – HPCC,
Hadoop and others.
Whether HPCC (High Performance Computing
Cluster) is a viable competitor to Hadoop for Big Data
dominance is another question. LexisNexis, which has vast
experience in collecting and processing large volumes of
media and Organizations data, certainly thinks it is. The
answer depends on a number of factors; most of them are
not yet clear.
4. Dryad-
Dryad is an infrastructure which allows a
programmer to use the resources of a computer collectivity
or a data center for running data-parallel programs. A
Dryad programmer can use large number of hardware, each
of them with multiple cores or processors, without knowing
anything about Concomitant programming
5. Apache Flink- Flink exploits in-memory data
streaming and integrates iterative processing deeply into
the system runtime. This makes the system extremely fast
for data-intensive and iterative jobs link is designed to
perform well when memory runs out. Flink contains its
own serialization framework, type inference engine and
memory management component.
6. Strom-
Strom is a free of cost and open source
distributed real time computation system. Storm reliably
process unbounded streams of data, for real time
processing. Storm is easy, can be used with any
programming language, and is great fun to use!
7. R3 -
r³ (Redistribute, reduce, reuse) is a map reduce
engine written in python using a redis backed. Its purpose
is to be simple.r³ has only three concepts to grasp: input
streams, mappers and reducers.
8. Disco-Disco
Disco is a lightweight, open-source
framework for distributed computing based on the Map
Reduce paradigm. Disco is very powerful and easy, thanks
to Python platform. Disco distributes and replicates your
data, and jobs efficiently. Disco even includes the tools you
need to index billions of data points and query them in realtime.
9. Phoenix-
A relational database layer for Apache
Hbase. It’s a Query engine which Transforms SQL queries
into native Hbase API calls and Pushes as much work as
possible onto the cluster for parallel execution. it Is a high
performance, horizontally scalable data store engine for
Big Data, which is suitable as the store of record for
mission critical data.
10. Plasma -
PlasmaFS implement large files in user
space, is a distributed file system. Plasma Map Reduce
famous algorithm for mapping large files turnaround plan
runs. On top of the plasma KV PlasmaFS key value
database.
11. Peregrine -
Peregrine is a framework which
designed for running iterative jobs across partitions of data.
Peregrine is designed to be Boom for executing map
reduces jobs by supporting a number of optimizations and
features not present in other map- reduce frameworks..
12. HTTPMR
-HTTPMR is an implementation of
(Google's) Map Reduce data processing model on clusters
of HTTP servers.
HTTPMR make the following assumptions about the
computing environment:







Machines can only be accessed via HTTP
requests.
Requests are assigned randomly to a set of
machines.
Requests have timeouts on the order of several
seconds.
There is a storage system that is accessible by
code receiving HTTP requests.
The data being processed can be broken up into
many, many small records, each having a unique
identifier.
The storage system can accept >, <= ranges
restrict operations on the data's unique identifiers.
Jobs are controlled by a web spidering system
(such as wget).
13. sector/sphere -
sector is a high performance,
traffic congestion and plan routes. Misco provides an very
scalable, and secures distributed file system. Sphere is a
powerful platform for developing and support these
parallel data processing engine, which is high performance
distributed applications.
and can process Sector data files on the storage nodes with
very simple programming interfaces.
15. MR-MPI- MR-MPI
library was developed at
Sandia National Laboratories, a US Department of Energy
14. Misco -
misco is a distributed computing
facility, for use on informatics problems. It includes C++
framework designed for mobile devices. Being 100%
and C interfaces callable from most hi-level languages, and
implemented in Python, misco is a highly portable and
also a Python wrapper and our own OINK scripting
should be able to run on any Python supported system
wrapper, which can be used to develop and chain
and its networking libraries. As more and more people own
MapReduce operations together. MR-MPI and OINK are
mobile devices and these devices are increasingly powerful,
open-source codes, distributed freely under the terms of the
there has been an explosion in distributed applications.
modified Berkeley Software Distribution (BSD) License
Social networking applications that have been developed to
Open
Source
Parallel
data
processing
Apache S/w
Foundation
Scale,
Java,
Python
Linux,
Mac os,
windows
Open
Source
Parallel
data
processing
Carnegie
Mellon
University
C++
Linux,
Mac os
HPC
C
Syste
m
Dryad
Open
Source
Parallel
data
processing
C++,Ecl
Linux
Microso
ft
Researc
h project
open
source
Parallel
Processing
HPCC
System LexisN
exis Risk
Solutions
Microsoft
Research
C#
Window
s
Apache S/w
Foundation
Java and
Scala
Linux,
Mac os,
windows
open
source
distribute
d realtime
computati
on
system
redis
database
running.
Java
Linux,
Mac os,
windows
-
Python
Google
Core
developed
in Erlang
any
system
which
supports
Python
and it's
networki
ng
libraries
Linux,
Mac OS
X,
FreeBS
D.
Apac
he
Flink
Storm
R3
open
source
Disc
o
open
source
distributed
data
processing
distributed
data
processing
Platform
Operating
System
Processing
Apac
he
Spar
k
Graph
Lab
Developed
by
Availability
friends. Monitoring applications are helping users avoid
Framework
keep the users connected and updated with their family and
BackType
Apache
Phoenix
Peregrine
sector/sp
here
misco
op
en
sou
rce
SQ
L
dat
aba
se
Pr
ne
ws
wir
e
op
en
sou
rce
tes
ted
of
No
kia
Parall
el
Proces
sing
Apache S/w
Foundation
Java
Cross
platform
Table3 Top Open Source Tools for Big Data
Parall
el
Proces
sing
systat
Java
Linux, Mac
os,
windows
parall
el
data
proce
ssing
Parall
el
Proces
sing
Apache 2.0
license
C++
Linux only
server side
Nokia
Research
Center
Python
any system
which
supports
Python and
it's
networking
libraries.
Hadoop, Map
HPCC, Storm
Big Data
Analysis
Platforms and
Tools
GridGain
op
ensou
rce
op
ensou
rce
Parall
el
Proces
sing
Sandia
National
paral
lel
proce
ssing
GridGain
Systems
C++
and C
Python
any system
which.
Javabased
any system
which
supports
Python and
it's
networking
libraries
Grid
Gain
Talend, Jaspersoft, Palo BI Suite/Jedox,
Pentaho, SpagoBI, KNIME, BIRT/Actuate,
Business
Intelligence
RapidMiner/RapidAnalytics,
Mahout,
Orange, Weka, jHepWork, KEEL, SPMF,
Rattle, Hadoop Distributed File System
Pig/Pig Latin, R, ECL
Programming
Languages
Lucene, Solr.
Big Data Search
Sqoop, Flume, Chukwa
Data
Aggregation and
Transfer
Zookeeper, Oozie, Avro, Terracotta
Table2: New frameworks introduced in 2014
Amazon Redshift with default options.
Redshift
Shark– disk
Impala
disk
–
Shark
mem
–
Impala-mem
Hive
Tez
Input and output tables are on-disk compressed
with gzip. OS buffer cache is cleared before
each run.
Input and output tables are on-disk compressed
with snappy. OS buffer cache is cleared before
each run.
Input tables are stored in Spark cache. Output
tables are stored in Spark cache.
Input tables are coerced into the OS buffer
cache. Output tables are on disk (Impala has no
notion of a cached table).
Hive on HDP 2.0.6 with default options. Input
and output tables are on disk compressed with
snappy. OS buffer cache is cleared before each
run.
Tez with the configuration parameters
specified here. Input and output tables are on
disk compressed with snappy. OS buffer cache
is cleared before each run.
,
Cassandra, HBase, MongoDB, Neo4j,
CouchDB,
OrientDB,
Terrastore,
FlockDB,
Hibari,
Riak,
Hypertable,
Hive,
InfoBright Community Edition,
Infinispan, Redis
Databases/Data
Warehouses
Data Mining
MR-MPI
Reduce,
Miscellaneous
Big Data Tools
I. CONCLUSION
In this brief study of Big Data Frameworks and tools. It
has been found that various categories of Frameworks and
tools are available in the market which caters the different
variety of functionalities to deal with big data. It is
recommended to utilize each of them in accordance with
their key feature to get the appropriate solutions. Here, it has
also been observed that each and every Framework is aimed
to fulfill some specific requirement of user. There is no such
exiting Framework in present time, which can cover all kind
of requirements in big data. Therefore, it is still area of
research for development of universal framework which can
tackle every kind of big data issues.
REFERENCES
[1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa,
William Money, “Big data: issues and challenges
moving forward”, IEEE, 46th Hawaii International
Conference on System Sciences, 2013.
[2] Strozzi, Carlo: NoSQL – A relational database
management
system.2007–2010;
http://www.strozzi.it/cgibin/CSA/tw7/I/en_US/nosql/
Home%20Page.
[3] P. Xiang, R. Hou, and Z. Zhou, “Cache and
consistency in nosql,” in computer science and
information technology (ICCSIT), 2010 3rd IEEE
International Conference on, vol. 6. IEEE, 2010,
pp.117–120.
[4] http://nosql.findthebest.com/
[5] http://nosql-database.org/
[6] Frank, L., "Countermeasures against consistency
anomalies in distributed integrated databases with
relaxed ACID properties," Innovations in Information
Technology (IIT), 2011 International Conference on ,
vol., no., pp.266,270, 25-27 April 2011.
[7] G. De Candia et al., “Dynamo: Amazon’s Highly
Available Key-Value Store,” Proc. 21st ACM
SIGOPS Symp. Operating Systems Principles (SOSP
07), 2007, pp. 205–220.
[8] A. Masudianpour, “An introduction to redis server, an
advanced key value database,” SlideShare, 9 Aug.
2013;
www.slideshare.net/masudianpour/redis25088079.
[9] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on
NoSQL
database," pervasive
computing
and
applications (ICPCA), 2011 6th International
Conference on, vol., no., pp.363, 366, 26-28 Oct.
2011.
[10] M. Stonebraker, “Sql databases v. nosql databases,”
Communications of the ACM, vol. 53, no. 4, pp. 10–
11, 2010.
[11] Sachchidanand Singh, Nirmala Singh, “Big data
analytics”,
IEEE, International Conference on
Communication,
Information
&
Computing
Technology (ICCICT),Oct. 19-20, 2012.
[12] Sagiroglu, S.; Sinanc, D., "Big Data: A review,"
Collaboration Technologies and Systems (CTS), 2013
International Conference on, vol., no., pp.42,47, 20-24
May 2013.
[13] Wielki, J., "Implementation of the big data concept in
organizations - possibilities, impediments and
challenges," Computer Science and Information
Systems (FedCSIS), 2013 Federated Conference on,
vol., no., pp.985, 989, 8-11 Sept. 2013.
[14] Segev, A; Chihoon Jung; Sukhwan Jung, "Analysis of
technology trends based on big data," Big Data
(BigData Congress), 2013 IEEE International
Congress on, vol., no., pp.419, 420, June 27 2013-July
2 2013
[15] Katal, A; Wazid, M.; Goudar, R.H., "Big data: issues,
challenges, tools and good practices," Contemporary
Computing (IC3), 2013 Sixth International Conference
on , vol., no., pp.404,409, 8-10 Aug. 2013.
[16] K. Kambatla, G. Kollias, V. Kumar, A. Grama,
“Trends in big data analytics”, J. Parallel Distrib.
Comput.
(2014),
http://dx.doi.org/10.1016/j.jpdc.2014.01.003
[17] Sheth, Amit, "Transforming big data into smart data:
deriving value via harnessing volume, variety, and
velocity
using
semantic
techniques
and
technologies," Data Engineering (ICDE), 2014 IEEE
30th International Conference on , vol., no., pp.2,2,
March 31 2014-April 4, 2014.
[18] Saha, Barna; Srivastava, Divesh, "Data quality: the
other face of big data," Data Engineering (ICDE),
2014 IEEE 30th International Conference on, vol., no.,
pp.1294, 1297, March 31 2014-April 4 2014.
[19]http://www.datamation.com/data-center/50-top-opensource-tools-for-big-data-3.html[12, 01, 2015]
[20]Big Data Framework Firat Tekiner1 andJohn A. Keane
School of Computer Science, The University of
Manchester, Manchester, UK [email protected],
[email protected]
[21] Big Data: A Review
Seref SAGIROGLU and Duygu SINANC Gazi University
Department of Computer Engineering, Faculty of
Engineering
Ankara,
Turkey
[email protected],
[email protected]
[22]BIG
Data
and
Methodology-A
review
Shilpa* Manjit Kaur Student, Computer Science and
Engineering Faculty, Computer Science and Engineering
LPU,
Phagwara,
India
LPU, Phagwara, India
[23]Big Data Framework Firat Tekiner1 andJohn A. Keane
School of Computer Science, The University of
Manchester, Manchester, UK [email protected],
[email protected]