Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the 9th INDIACom; INDIACom-2015 2015 2 International Conference on “Computing for Sustainable Global Development”, 11th – 13th March, 2015 Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) nd Big Data Analysis Using Computational Intelligence and Hadoop: A Study Apoorva Gupta Amity School of Engineering and Technology Amity University Noida, India [email protected] Abstract – Computational Intelligence (CI) techniques are expected to provide powerful tools for addressing Big Data challenges. The main techniques in CI, such as evolutionary computation, neural computation and fuzzy systems are inherently capable of handling various amount of uncertainty, which makes CI techniques well suited for dealing with Variability and Variety of Big Data. On the other hand, the other two V’s, Volume and Velocity may create serious challenges to existing CI techniques. The next two V’s that is Value ad Veracity are equally important and yet challenging in dealing with big data. Consequently, new CI techniques need to be developed to efficiently and effectively tackle huge amount of data, and to rapidly respond to changing situations. It should be pointed out, however, that such new techniques will not be developed from scratch; instead, they are based on many on-going research topics scattered in different areas of CI research, e.g., large-scale optimization, many-objective optimization, learning in nonstationary environments, and natural language processing. A recent review of the use of evolutionary computation and other meta-heuristics in optimization of biological systems indicate a similarity in imparting computational intelligence in huge amount of data while using biologically inspired techniques and with the big data analysis using Hadoop environment. Keywords – big data, computational intelligence, hadoop, hadoop ecosystem, hive, mahout, pig, swam intelligence. I. INTRODUCTION Imparting computational intelligence in today’s scenario is a vital task for the achievement of automation and efficient analytics of any given data. This may be done by using various computational approaches. One of them is swarm intelligence. Swarm analysis is the nature inspired approach that cordially applies a series of algorithms such as ant colony optimization, bee colony optimization, bacteria colony optimization and others. Similar to these approaches a new and widely accepted advances in the field of big data analysis is the introduction of Hadoop storage framework for the storage of huge chunks of data and its ecosystem that comprise of various analysis tools like mahout, pig, hive for applying machine learning approaches through recommender engines, large dataset analysis, data warehouse and querying respectively. Hadoop provides an easy storage solution for huge chunks of raw data that may be used for the purpose of analysis and thus enabling the effective conversion of data into information. This Hadoop related analysis approach is propounded to be nature inspired. The big data analysis approach is observed to depict computational intelligent behavior with reference to swarm intelligence. Swarm intelligence is an artificial intelligence (AI) technique that primarily focuses on the collective behavior of a decentralized system. Just like a swarm is defined as liable agents to communicate directly or indirectly with each other and collectively carry out the distributed problem solving [1]. In a similar way, using Hadoop for storage and then with the help of various analysis tools the commodity hardware are considered to behave as swarm. Hence the distributed and yet interrelated commodity hardware are related to behave as swarms facilitating Hadoop analysis. The swarm optimization inspires Hadoop optimized analysis [2]. Both nature inspired techniques and Hadoop may be used for imparting computational intelligence to big data. The approaches give ease of programming, extensibility and optimization opportunities. II. BIG DATA Today is the era of social media. Establishing new connections, social networking, online shopping, web postings, online lectures, blogging and much more. ‘Daily data’ as comments on Facebook, likes, video and pictures posts, tweets, millions of videos on YouTube are just common examples of the sources of millions and trillions of data that is being stored and uploaded/downloaded every day over the internet. The exponential growth of data is challenging for Facebook, Yahoo, Google, Amazon and Microsoft. The term ‘Big Data’ is used to refer the collection of data sets that are so large and complex to handle and process using traditional data processing applications. Proceedings of the 9th INDIACom; INDIACom-2015 2015 2 International Conference on “Computing for Sustainable Global Development”, 11th – 13th March, 2015 nd hour contributing to the exponential growth of data online as a part of big data. The system is generating terabytes, petabytes and zeta bytes of data. This data may be also handled using computationally intelligent biologically inspired techniques say bacteria colony optimization. This huge chunk of data is handled through CI technique of Data Mining expanding its scope to cover big data analytics. Figure:1. Big data characteristics. The term itself is being more formally defined by IBM as the combination of 3 V’s is velocity, variety and volume. These are the generic big data properties. However, the acquired properties depicted after entering the system includes value, veracity, variability and visualization. Thus, the 7 V’s correctly describes the big data [11][12]. Fig:2. The 7 V’s of big data [11][12]. III. BIG DATA AND COMPUTATIONAL INTELLIGENCE Computational intelligence (CI) provides exceptional tools for addressing big data challenges. These techniques include evolutionary computation, neural computation and fuzzy systems which are inherently capable of handling uncertainty [13]. A. VOLUME Millions of data is uploaded everyday on Facebook, twitter and other online platforms. Akamai analysis 75 million events a day that primarily targets online ads, Wal-Mart handles 1 million customer transactions per B. VELOCITY The system generates streams of data and multiple sources that require that data. There is an exponential growth in data every hour. For instance Walmart’s data warehouse stored 1,000 terabytes of data in 1999 which surpassed over 2.5 petabytes in 2012[12]. Every minute the data is flooded with thousands of online uploads. The widely accepted machine learning databases have increased to millions requiring features selection as a vital requirement. Various CI techniques are used for time domain astronomy (TDA)[14]. C. VARIETY Both structures and unstructured data which include blogs, images, audio, and videos are a part of big data. These data may be analyzed for sentiment and content. Earlier may be the days when companies dealt with only a single data format but today big data provides a platform for all data formats. Various CI techniques even biologically inspired swarm intelligent techniques can be used for the dealing with versatile data. Various data mining techniques are used for performing analysis by using neural networks, fuzzy logic and graphs and trees [15]. D. VARIABILITY Big data allows handling uncertainty in data with changing data helping in prediction of future behavior of various customers, entrepreneurs, etc. Basically the meaning of data is constantly changing and the data relies mainly on language processing. E. VERACITY In order to ensure the accuracy of big data various security tools are provided for ensuring potential value of the data. This involves automated decision making or feeding data into an unsupervised machine learning algorithm. This ensures the authenticity, availability and accountability of the data. F. VISUALIZATION The CI techniques involved in making the data readable and easily accessible contribute to the 5th V Big Data Analysis Using Computational Intelligence and Hadoop: A Study of the Big data. The data needs to be easily understood and the CI techniques such as the various optimization algorithms provide an advantage of providing an optimal review of the data analyzed. G. VALUE The value of big data is huge. It enables sentiment analysis, prediction and recommendation. It is massive and rapidly expanding, but it loses its worth when dealt without analysis and visualization that encounters noisy, messy and rapidly changing data. This value of the big data may be extracted only when various CI techniques are applied to big data enabling easy analysis and maximum profit. • Cost effective- Hadoop proves to be cost effective in using commodity hardware and not expensive servers.[6] The working environment in Hadoop is given by its Hadoop ecosystem. Hadoop provides various analysis tools, data warehousing, data querying and data mining tools inclusive of machine learning algorithms such that Hadoop may be used for the analysis of big data. V. HADOOP ECOSYSTEM The Big data can be analyzed by using through swarm intelligent approach like bacteria colony optimization [5]. The bacteria colony gives a huge problem space and hence giving a big data problem space domain for performing analysis and optimization for speedy decision making activities. IV. BIG DATA AND HADOOP Traditionally it may be feasible to analyze the data limited data stored over the server with was stored over the file systems. The data intensive companies (Google, Yahoo, Amazon, and Microsoft) required figuring out the on-demand books, websites, and popular people and thus deciding what kind ads actually appealed the audience. The existing tools and SQL based query analysis tools are not sufficient enough for meeting the growing data analysis demands failing at tackling multiplatform, storage of data requiring multiplatform codes. Hadoop is a distributive open source framework for writing and running distributed applications that process large amounts of data. The fey features offered by Hadoop are:• Accessibility- Hadoop runs on large clusters of commodity machines and provides easy access to all the systems overcoming the barriers of distance. • Robust- Hadoop can easily overcome the frequent machine malfunctions since it runs on commodity hardware. • Scalable-Hadoop scales linearly to handle larger data by adding more nodes to the cluster. • Simple- The simplicity of Hadoop lies in writing quick efficient parallel programs supporting giving the programmer the advantage of using programs in any language (Java, Python). Figure:3. The Hadoop ecosystem [16] Above all the layers of the Hadoop ecosystem lays the Apache oozie for work flow management. Hadoop is written in java. All the tools are open source and enables successful management of data having distributed file system. Initial release of Hadoop 1.0 architecture has the following disadvantages• No horizontal scalability of NameNodes that is only one NameNode for a hadoop cluster and if one NameNode fails the entire system goes down. • It does not provide NameNode high availability i.e. single point of failure. • May have an overburdened jobtracker. • Not possible to run non-mapreduce big data applications on HDFS. • Do not support multi tenancy i.e. only one type of job can run or one batch may be executed at a time. Despite the above disadvantages Hadoop 1.0 is still preferred and widely used as compared to YARN (Hadoop 2.0 architecture) due to the large 1.0 architecture acceptance in various industries and organizations such that they may get accustomed at first and then may shift to the updated versions of Hadoop. Proceedings of the 9th INDIACom; INDIACom-2015 2015 2 International Conference on “Computing for Sustainable Global Development”, 11th – 13th March, 2015 nd VI. SWARM INTELLIGENCE Swarm intelligence is successfully being applied in hosting research settings that focus on improving management and control over large number of interacting entities thus, describing the collective behavior [3]. It is primarily concerned with the design of multi agent systems by taking inspiration from collective behaviors of social insects and other animal societies [1].Swarm intelligence inspired Hadoop analysis of Big data [1]. The main requirements the swarm based cluster satisfy are the following:• Scalability-Commodity hardware in the hadoop analysis and the robots in case of swarms can be added or removed as per the requirements. • Dealing with different types of attributes- The big data analysis approach deals with data that incudes pictures, videos,pdf fies, text files and many others.[4] • High dimensionality- Both hadoop related analysis and the swarm intelligence has the ability to deal with huge amount of data and thus perform optimization for fast analysis. • Robustness- Hadoop for big data analysis and the swarm inspired big data analysis allow the data to be modified at runtime. • Highly effective- The two approaches focuses on the optimized analysis such that the results are effective, fast and reliable. Figure:4. The common attributes offered by swarm intelligence and Hadoop big data analysis. VII. BIG DATA ANALYSIS USING TRADITIONAL CI AND HADOOP STORAGE FRAMEWORK The two approaches may be compared with the following few examples:• Optimization Inspired by Evolution process of a Bacterial Colony and hadoop clusterA new swarm intelligent technique called bacterial colony optimization (BCO) is considered such that the problem space is huge due to its evolutionary properties similar to the scalability of commodity hardware in hadoop in order to provide availability and scalability properties to the system of computers.[6] • Support vector machines using nonlinear kernels on hadoop mahout and the kernel methods for trees and graphs through neural networks The four major challenges of big data i.e. volume, velocity, variety and veracity targeted the big data mining. This can be achieved via hadoop ecosystem and the swarm intelligent techniques. Here harmonic cryptosystem with secured multiparty computation of system matrix operation have been shown to yield high privacy preserving while data miners perform information retrieval from big data.[7] The neural networks are applied on structured data for mining of useful data that uses a recurrent network for the analysis of data. [8] • Distributed data clustering algorithms Clustering is one of the majority requirements for analysis of voluminous amount of data that have applications in the field of pattern recognition, data mining, bioinformatics and recommender engines[9] The basic artificial intelligent algorithms for computational intelligence like the K-means, Fuzzy k-means, Dirichlet and latent dirichlet allocation are considered for cloud computing environments i.e. Hadoop and granules. These algorithms are proved Advantages of hadoop and swarm intelligent inspired big data analysis-[2] • Computational efficiency- The availability of multiple processors in swarm and the commodity hardware for Hadoop reduces the computational overhead. • Reliability- There exist a continuous group operation for both swarm and Hadoop analysis contributing decentralised control, shared sensor/analytics data, and also no single point of failure. • Low-cost- Simple design in case of swarm intelligence requires less hardware and is ready for mass production. Also in the case of Hadoop related big data analysis provides an easy cost effective solution for storage of huge amounts of data and thereby enabling its analysis. Big Data Analysis Using Computational Intelligence and Hadoop: A Study to give successful results through swarm intelligent techniques. • Team collaboration and Transactive Memory on Swarm intelligence and through Hadoop Swarm intelligence describes the collective behavior that emerges from a group of socially interactive insects/animals [3]. Such collaborative filtering is also observed to be proved by mahout on Hadoop [10]. Attributes Huge data sets Small data sets Tools Multitasking, parallel processing Volume Analysis using traditional computational intelligence methods Slow Fast Matlab, weka, Network Simulator As per the algorithm used Biologically inspired techniques egbacteria colony optimization Knowledge extraction Using currently available datasets Data Mostly static Example Data mining using swarm intelligence(Artificial intelligence) Analysis using computational intelligent algorithms through Hadoop Fast Slow Hive, pig, mahout Yes using commodity hardware CI inspired algorithms using Hadoop storage framework Machine learning and training for empirical analysis Dynamic and robust Collaborative filtering for online generated terabytes and petabytes of data(Gmail using spam filtering). Table1: Comparison of analyzing big data using traditional CI techniques and through Hadoop storage framework. VIII. CONCLUSION AND FUTURE SCOPE Hadoop environment and Computational Intelligence using various artificial methods like” Artificial intelligence”, “Bacteria Colony Optimization”, “Ant colony optimization” are closely related for big data analysis. However, the big data analysis using Hadoop is nature inspired and is an effective method for analyzing and mining tons of data for useful information. The big data analysis can be optimized taking advantage of various already discovered algorithms using swarm intelligence, artificial intelligence incorporating efficient machine learning for better understanding. This is used for training the machines and carrying forward the tasks of predictive analysis, collaborative filtering and also building empirical stastical predictive models. REFERENCES [1] Bharne P.K.,Gulhane V.S.,Yewale S.K., “ Data Clustering Algorithms based on Swarm Intelligence”, IEEE,2011. [2] Yan-fei-Zhu,Xiong-min Tang, “Overview of Swarm Intelligence”, International Conference on Computer Application and System Modelling (ICCASM 2010), IEEE,2010. [3] L.L.Ji, Y.H.Jin, “Team Collaboration and Transactive Memory System on Swarm Intelligene”, IEEE, 2010. [4] Esteeves R.M , Rong C., “ Using Mahout for clustering Wikipedia’s lastest Articles”, Third IEEE international Conference on cloud computing Technology and Science, IEEE,2011. [5] Xavier, R.S; Natural Computing Lab-LCoN, Mackenzie Presbyterian University; Sao Paulo,Brazil, Omar N.; de Castro [6] Li Ming, “ A Novel Swarm intelligence Optimization Inspired by Evolution Process of a Bacterial Colony”, proceedings of 10th World Congress on Intelligent Control and Automation Beijing, China, July 6-8, 2012. [7]Sin G. Teo,Monash Shuguo han, Vincent C.S. Lee, “ Privacy preserving Support Vector Machine using Non-Linear Kernels on Hadoop Mahout”, 16th International Conference on Computational Science and Engineering, IEEE, 2013. [8] Giovanni Da San Martino and Alessandro Sperduti, “ Mining Structured Data”, Computational Intelligence Magazine, IEEE, 2010. [9] Kathleen Ericson and Shrideep Pallickara, “ On the performance of Distributed Data Clustering Algorithms in the File and Streaming Processing Systems”, Fourth IEEE International Conference on Utility and Cloud Computing, IEEE,2011. [10]Sean Owen, Robin Anil, Ted Dunning, Ellen Fiedman, “Mahout in Action’, Manning Publications, Co, 2012. [11]Yuri Demchenko, “ Overview NIST Big Data Working Group Activities and Big data architecture framework (BDAF) by UvA” ,17 September 2013, 2nd RDA Plenary. [12]Rasmus Wegener and Velu Sinha, “The Value of Big data: How analytics differentiates winners”, Bain and Company. [13]Yaochu Jin,, Barbara Hammer, “ Computational Intelligence in Big Data”, IEEE Computational intelligence magazine, August 2014. [14]Huijse et al.,” Computational intelligence challenges and Applications on Largee Scale Astronomical Time Series Databases”, IEEE, 2013. [15]Giovanni Da San Martino and Alessandro Sperduti,Italy, “Mining Structured data” [16] Chuck Lam, Manning Greenwich, “Hadoop In Action”, 2011.