Download Big Data Analysis Using Computational Intelligence and Hadoop: A

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Pattern recognition wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Proceedings of the 9th INDIACom; INDIACom-2015
2015 2 International Conference on “Computing for Sustainable Global Development”, 11th – 13th March, 2015
Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
nd
Big Data Analysis Using Computational
Intelligence and Hadoop: A Study
Apoorva Gupta
Amity School of Engineering and Technology
Amity University
Noida, India
[email protected]
Abstract – Computational Intelligence (CI) techniques are
expected to provide powerful tools for addressing Big Data
challenges. The main techniques in CI, such as evolutionary
computation, neural computation and fuzzy systems are
inherently capable of handling various amount of uncertainty,
which makes CI techniques well suited for dealing with
Variability and Variety of Big Data. On the other hand, the other
two V’s, Volume and Velocity may create serious challenges to
existing CI techniques. The next two V’s that is Value ad Veracity
are equally important and yet challenging in dealing with big
data. Consequently, new CI techniques need to be developed to
efficiently and effectively tackle huge amount of data, and to
rapidly respond to changing situations. It should be pointed out,
however, that such new techniques will not be developed from
scratch; instead, they are based on many on-going research topics
scattered in different areas of CI research, e.g., large-scale
optimization, many-objective optimization, learning in nonstationary environments, and natural language processing. A
recent review of the use of evolutionary computation and other
meta-heuristics in optimization of biological systems indicate a
similarity in imparting computational intelligence in huge amount
of data while using biologically inspired techniques and with the
big data analysis using Hadoop environment.
Keywords – big data, computational intelligence, hadoop,
hadoop ecosystem, hive, mahout, pig, swam intelligence.
I.
INTRODUCTION
Imparting computational intelligence in today’s scenario is a
vital task for the achievement of automation and efficient
analytics of any given data. This may be done by using various
computational approaches. One of them is swarm intelligence.
Swarm analysis is the nature inspired approach that cordially
applies a series of algorithms such as ant colony optimization,
bee colony optimization, bacteria colony optimization and
others. Similar to these approaches a new and widely accepted
advances in the field of big data analysis is the introduction of
Hadoop storage framework for the storage of huge chunks of
data and its ecosystem that comprise of various analysis tools
like mahout, pig, hive for applying machine learning
approaches through recommender engines, large dataset
analysis, data warehouse and querying respectively.
Hadoop provides an easy storage solution for huge chunks of
raw data that may be used for the purpose of analysis and thus
enabling the effective conversion of data into information.
This Hadoop related analysis approach is propounded to be
nature inspired. The big data analysis approach is observed to
depict computational intelligent behavior with reference to
swarm intelligence. Swarm intelligence is an artificial
intelligence (AI) technique that primarily focuses on the
collective behavior of a decentralized system. Just like a
swarm is defined as liable agents to communicate directly or
indirectly with each other and collectively carry out the
distributed problem solving [1].
In a similar way, using Hadoop for storage and then with the
help of various analysis tools the commodity hardware are
considered to behave as swarm. Hence the distributed and yet
interrelated commodity hardware are related to behave as
swarms facilitating Hadoop analysis. The swarm optimization
inspires Hadoop optimized analysis [2]. Both nature inspired
techniques and Hadoop may be used for imparting
computational intelligence to big data. The approaches give
ease of programming, extensibility and optimization
opportunities.
II. BIG DATA
Today is the era of social media. Establishing new
connections, social networking, online shopping, web
postings, online lectures, blogging and much more. ‘Daily
data’ as comments on Facebook, likes, video and pictures
posts, tweets, millions of videos on YouTube are just common
examples of the sources of millions and trillions of data that is
being stored and uploaded/downloaded every day over the
internet. The exponential growth of data is challenging for
Facebook, Yahoo, Google, Amazon and Microsoft. The term
‘Big Data’ is used to refer the collection of data sets that are so
large and complex to handle and process using traditional data
processing applications.
Proceedings of the 9th INDIACom; INDIACom-2015
2015 2 International Conference on “Computing for Sustainable Global Development”, 11th – 13th March, 2015
nd
hour contributing to the exponential growth of data
online as a part of big data. The system is generating
terabytes, petabytes and zeta bytes of data. This data
may be also handled using computationally
intelligent biologically inspired techniques say
bacteria colony optimization. This huge chunk of
data is handled through CI technique of Data Mining
expanding its scope to cover big data analytics.
Figure:1. Big data characteristics.
The term itself is being more formally defined by IBM as the
combination of 3 V’s is velocity, variety and volume. These are
the generic big data properties. However, the acquired
properties depicted after entering the system includes value,
veracity, variability and visualization. Thus, the 7 V’s correctly
describes the big data [11][12].
Fig:2. The 7 V’s of big data [11][12].
III. BIG DATA AND COMPUTATIONAL
INTELLIGENCE
Computational intelligence (CI) provides exceptional tools for
addressing big data challenges. These techniques include
evolutionary computation, neural computation and fuzzy
systems which are inherently capable of handling uncertainty
[13].
A. VOLUME
Millions of data is uploaded everyday on Facebook,
twitter and other online platforms. Akamai analysis 75
million events a day that primarily targets online ads,
Wal-Mart handles 1 million customer transactions per
B. VELOCITY
The system generates streams of data and multiple
sources that require that data. There is an exponential
growth in data every hour. For instance Walmart’s
data warehouse stored 1,000 terabytes of data in 1999
which surpassed over 2.5 petabytes in 2012[12].
Every minute the data is flooded with thousands of
online uploads. The widely accepted machine
learning databases have increased to millions
requiring features selection as a vital requirement.
Various CI techniques are used for time domain
astronomy (TDA)[14].
C. VARIETY
Both structures and unstructured data which include
blogs, images, audio, and videos are a part of big
data. These data may be analyzed for sentiment and
content. Earlier may be the days when companies
dealt with only a single data format but today big
data provides a platform for all data formats.
Various CI techniques even biologically inspired
swarm intelligent techniques can be used for the
dealing with versatile data. Various data mining
techniques are used for performing analysis by using
neural networks, fuzzy logic and graphs and trees
[15].
D. VARIABILITY
Big data allows handling uncertainty in data with
changing data helping in prediction of future
behavior of various customers, entrepreneurs, etc.
Basically the meaning of data is constantly changing
and the data relies mainly on language processing.
E. VERACITY
In order to ensure the accuracy of big data various
security tools are provided for ensuring potential
value of the data. This involves automated decision
making or feeding data into an unsupervised machine
learning algorithm. This ensures the authenticity,
availability and accountability of the data.
F.
VISUALIZATION
The CI techniques involved in making the data
readable and easily accessible contribute to the 5th V
Big Data Analysis Using Computational Intelligence and Hadoop: A Study
of the Big data. The data needs to be easily understood
and the CI techniques such as the various optimization
algorithms provide an advantage of providing an
optimal review of the data analyzed.
G. VALUE
The value of big data is huge. It enables sentiment
analysis, prediction and recommendation. It is massive
and rapidly expanding, but it loses its worth when
dealt without analysis and visualization that
encounters noisy, messy and rapidly changing data.
This value of the big data may be extracted only when
various CI techniques are applied to big data enabling
easy analysis and maximum profit.
•
Cost effective- Hadoop proves to be cost effective in
using commodity hardware and not expensive
servers.[6]
The working environment in Hadoop is given by its Hadoop
ecosystem. Hadoop provides various analysis tools, data
warehousing, data querying and data mining tools inclusive of
machine learning algorithms such that Hadoop may be used
for the analysis of big data.
V. HADOOP ECOSYSTEM
The Big data can be analyzed by using through swarm
intelligent approach like bacteria colony optimization
[5]. The bacteria colony gives a huge problem space
and hence giving a big data problem space domain for
performing analysis and optimization for speedy
decision making activities.
IV. BIG DATA AND HADOOP
Traditionally it may be feasible to analyze the data limited data
stored over the server with was stored over the file systems.
The data intensive companies (Google, Yahoo, Amazon, and
Microsoft) required figuring out the on-demand books,
websites, and popular people and thus deciding what kind ads
actually appealed the audience. The existing tools and SQL
based query analysis tools are not sufficient enough for
meeting the growing data analysis demands failing at tackling
multiplatform, storage of data requiring multiplatform codes.
Hadoop is a distributive open source framework for writing and
running distributed applications that process large amounts of
data. The fey features offered by Hadoop are:• Accessibility- Hadoop runs on large clusters of
commodity machines and provides easy access to all
the systems overcoming the barriers of distance.
•
Robust- Hadoop can easily overcome the frequent
machine malfunctions since it runs on commodity
hardware.
•
Scalable-Hadoop scales linearly to handle larger data
by adding more nodes to the cluster.
•
Simple- The simplicity of Hadoop lies in writing quick
efficient parallel programs supporting giving the
programmer the advantage of using programs in any
language (Java, Python).
Figure:3. The Hadoop ecosystem [16]
Above all the layers of the Hadoop ecosystem lays the Apache
oozie for work flow management. Hadoop is written in java.
All the tools are open source and enables successful
management of data having distributed file system.
Initial release of Hadoop 1.0 architecture has the following
disadvantages• No horizontal scalability of NameNodes that is only
one NameNode for a hadoop cluster and if one
NameNode fails the entire system goes down.
• It does not provide NameNode high availability i.e.
single point of failure.
• May have an overburdened jobtracker.
• Not possible to run non-mapreduce big data
applications on HDFS.
• Do not support multi tenancy i.e. only one type of job
can run or one batch may be executed at a time.
Despite the above disadvantages Hadoop 1.0 is still preferred
and widely used as compared to YARN (Hadoop 2.0
architecture) due to the large 1.0 architecture acceptance in
various industries and organizations such that they may get
accustomed at first and then may shift to the updated versions
of Hadoop.
Proceedings of the 9th INDIACom; INDIACom-2015
2015 2 International Conference on “Computing for Sustainable Global Development”, 11th – 13th March, 2015
nd
VI. SWARM INTELLIGENCE
Swarm intelligence is successfully being applied in hosting
research settings that focus on improving management and
control over large number of interacting entities thus,
describing the collective behavior [3]. It is primarily concerned
with the design of multi agent systems by taking inspiration
from collective behaviors of social insects and other animal
societies [1].Swarm intelligence inspired Hadoop analysis of
Big data [1].
The main requirements the swarm based cluster satisfy are the
following:• Scalability-Commodity hardware in the hadoop
analysis and the robots in case of swarms can be
added or removed as per the requirements.
•
Dealing with different types of attributes- The big data
analysis approach deals with data that incudes
pictures, videos,pdf fies, text files and many others.[4]
•
High dimensionality- Both hadoop related analysis
and the swarm intelligence has the ability to deal with
huge amount of data and thus perform optimization
for fast analysis.
•
Robustness- Hadoop for big data analysis and the
swarm inspired big data analysis allow the data to be
modified at runtime.
•
Highly effective- The two approaches focuses on the
optimized analysis such that the results are effective,
fast and reliable.
Figure:4. The common attributes offered by swarm intelligence and Hadoop
big data analysis.
VII. BIG DATA ANALYSIS USING TRADITIONAL CI AND
HADOOP STORAGE FRAMEWORK
The two approaches may be compared with the following few
examples:• Optimization Inspired by Evolution process of a
Bacterial Colony and hadoop clusterA new swarm intelligent technique called bacterial
colony optimization (BCO) is considered such that
the problem space is huge due to its evolutionary
properties similar to the scalability of commodity
hardware in hadoop in order to provide availability
and scalability properties to the system of
computers.[6]
•
Support vector machines using nonlinear kernels on
hadoop mahout and the kernel methods for trees and
graphs through neural networks
The four major challenges of big data i.e. volume,
velocity, variety and veracity targeted the big data
mining. This can be achieved via hadoop ecosystem
and the swarm intelligent techniques. Here harmonic
cryptosystem with secured multiparty computation of
system matrix operation have been shown to yield
high privacy preserving while data miners perform
information retrieval from big data.[7]
The neural networks are applied on structured data
for mining of useful data that uses a recurrent
network for the analysis of data. [8]
•
Distributed data clustering algorithms
Clustering is one of the majority requirements for
analysis of voluminous amount of data that have
applications in the field of pattern recognition, data
mining, bioinformatics and recommender engines[9]
The basic artificial intelligent algorithms for
computational intelligence like the K-means, Fuzzy
k-means, Dirichlet and latent dirichlet allocation are
considered for cloud computing environments i.e.
Hadoop and granules. These algorithms are proved
Advantages of hadoop and swarm intelligent inspired big data
analysis-[2]
•
Computational efficiency- The availability of multiple
processors in swarm and the commodity hardware for
Hadoop reduces the computational overhead.
•
Reliability- There exist a continuous group operation
for both swarm and Hadoop analysis contributing
decentralised control, shared sensor/analytics data,
and also no single point of failure.
•
Low-cost- Simple design in case of swarm
intelligence requires less hardware and is ready for
mass production. Also in the case of Hadoop related
big data analysis provides an easy cost effective
solution for storage of huge amounts of data and
thereby enabling its analysis.
Big Data Analysis Using Computational Intelligence and Hadoop: A Study
to give successful results through swarm intelligent
techniques.
•
Team collaboration and Transactive Memory on
Swarm intelligence and through Hadoop
Swarm intelligence describes the collective behavior
that emerges from a group of socially interactive
insects/animals [3]. Such collaborative filtering is also
observed to be proved by mahout on Hadoop [10].
Attributes
Huge data sets
Small data sets
Tools
Multitasking,
parallel
processing
Volume
Analysis
using
traditional
computational
intelligence methods
Slow
Fast
Matlab,
weka,
Network Simulator
As per the algorithm
used
Biologically inspired
techniques
egbacteria
colony
optimization
Knowledge
extraction
Using
currently
available datasets
Data
Mostly static
Example
Data mining using
swarm
intelligence(Artificial
intelligence)
Analysis using
computational
intelligent
algorithms
through Hadoop
Fast
Slow
Hive,
pig,
mahout
Yes
using
commodity
hardware
CI
inspired
algorithms
using Hadoop
storage
framework
Machine
learning
and
training
for
empirical
analysis
Dynamic and
robust
Collaborative
filtering
for
online
generated
terabytes
and
petabytes
of
data(Gmail
using
spam
filtering).
Table1: Comparison of analyzing big data using traditional CI techniques and
through Hadoop storage framework.
VIII.
CONCLUSION AND FUTURE SCOPE
Hadoop environment and Computational Intelligence using
various artificial methods like” Artificial intelligence”,
“Bacteria Colony Optimization”, “Ant colony optimization”
are closely related for big data analysis. However, the big data
analysis using Hadoop is nature inspired and is an effective
method for analyzing and mining tons of data for useful
information.
The big data analysis can be optimized taking advantage of
various already discovered algorithms using swarm
intelligence, artificial intelligence incorporating efficient
machine learning for better understanding. This is used for
training the machines and carrying forward the tasks of
predictive analysis, collaborative filtering and also building
empirical stastical predictive models.
REFERENCES
[1] Bharne P.K.,Gulhane V.S.,Yewale S.K., “ Data Clustering Algorithms
based on Swarm Intelligence”, IEEE,2011.
[2] Yan-fei-Zhu,Xiong-min Tang, “Overview of Swarm Intelligence”,
International Conference on Computer Application and System Modelling
(ICCASM 2010), IEEE,2010.
[3] L.L.Ji, Y.H.Jin, “Team Collaboration and Transactive Memory System on
Swarm Intelligene”, IEEE, 2010.
[4] Esteeves R.M , Rong C., “ Using Mahout for clustering Wikipedia’s lastest
Articles”, Third IEEE international Conference on cloud computing
Technology and Science, IEEE,2011.
[5] Xavier, R.S; Natural Computing Lab-LCoN, Mackenzie Presbyterian
University; Sao Paulo,Brazil, Omar N.; de Castro
[6] Li Ming, “ A Novel Swarm intelligence Optimization Inspired by
Evolution Process of a Bacterial Colony”, proceedings of 10th World Congress
on Intelligent Control and Automation Beijing, China, July 6-8, 2012.
[7]Sin G. Teo,Monash Shuguo han, Vincent C.S. Lee, “ Privacy preserving
Support Vector Machine using Non-Linear Kernels on Hadoop Mahout”, 16th
International Conference on Computational Science and Engineering, IEEE,
2013.
[8] Giovanni Da San Martino and Alessandro Sperduti, “ Mining Structured
Data”, Computational Intelligence Magazine, IEEE, 2010.
[9] Kathleen Ericson and Shrideep Pallickara, “ On the performance of
Distributed Data Clustering Algorithms in the File and Streaming Processing
Systems”, Fourth IEEE International Conference on Utility and Cloud
Computing, IEEE,2011.
[10]Sean Owen, Robin Anil, Ted Dunning, Ellen Fiedman, “Mahout in
Action’, Manning Publications, Co, 2012.
[11]Yuri Demchenko, “ Overview NIST Big Data Working Group Activities
and Big data architecture framework (BDAF) by UvA” ,17 September 2013,
2nd RDA Plenary.
[12]Rasmus Wegener and Velu Sinha, “The Value of Big data: How analytics
differentiates winners”, Bain and Company.
[13]Yaochu Jin,, Barbara Hammer, “ Computational Intelligence in Big
Data”, IEEE Computational intelligence magazine, August 2014.
[14]Huijse et al.,” Computational intelligence challenges and Applications on
Largee Scale Astronomical Time Series Databases”, IEEE, 2013.
[15]Giovanni Da San Martino and Alessandro Sperduti,Italy, “Mining
Structured data”
[16] Chuck Lam, Manning Greenwich, “Hadoop In Action”, 2011.