Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Frameworks: At a Glance Rajendra kumar Shukla Pooja Pandey Vinod Kumar School of IT MATS University School of IT MATS University School of IT MATS University [email protected] [email protected] [email protected] Abstract In the modern era of information technology, the usage of IT tools and techniques has increased exponetionally in almost every business organization, Enterprises, Companies and Government organizations. Therefore the rate of generation of data has also increased in exponential order. The huge amount of data which have the various properties like variety, volume, velocity, complexity, variability and value has led to the concept of big data. The traditional systems frameworks, tools and techniques are not capable to handle it. It is required some new framework, operating system, warehouse tools and analysis techniques to handle the big data issues. This paper mainly focuses on available framework used in big data environment. Keywords- Big data Framework, Big Data, Big Data Tools coming from a number of varied sources at a high speed could change? This also refers to the data in motion or data at rest. The variability in the state of data comes into play when it is important to take a point-in-time snapshot of a data for further processing and decision making Introduction Big data are the center point of modern science and business. that contains the billions records of millions people information that includes the web sales, social media, audios, search queries , business records ,social networking ,science data, mobile phones and their applications and so on, These huge amount of data can be Managed or Unmanaged. Big Data characteristics - Data is huge collected from variety of sources in different form, it is characterized by the three Parameters Variety Volume Velocity Variability Variety makes the data large in size. Data comes from the variety of sources that can be of structured, unstructured and semi- structured type. Different variety of data includes the sensor data, video, text, audio, web log files and so on. Volume represents how the data is large. Size or the Volume of data now is more than petabytes. The grand scale and rise of data outstrips traditional store and analysis techniques. Velocity is required for big data and also for all processes. Velocity define the pace of the data and the discretion of streaming of the data Variability applies to the state or validity of the data in relation to the time. How quickly a huge volume of data, Fig-1 Characteristics of Big data Big data Frameworks In computer systems, a framework is often a layered structure indicating what kind of programs can or should be built and how they would interrelate. Big data framework is the set of functions or the structure which defines how to perform the processing, manipulation and representation of big data. Big data framework handles the both structured, unstructured, semi-structured data. The big data framework in fig2 represents the different layers including different functionalities. Data acquisition from different sources like enterprise, government organization and business organizing, data from telecom industries and social networking sites is carried out in the first step of the processing for big data where different activities like data processing, Data cleaning, Integration, Normalization etc is performed. In second stage the data repository, the preprocessed data is stored for further analysis and visualization of big data for finding greater insight to get the value information. Fig 2: Big data Frameworks Terminology used in Big Data Frameworks Availability In Big data frameworks the open source approaches have the greatest momentum: the most extensive acceptance and the hottest revolution. Open source platforms are expanding their footprint in advanced analytics. Operating System Most of the organization preferred operating systems like Windows and Linux but some frame work can also work with BSD and OS X. Platform The widely used big data Language platform are as followsPig is a platform for data analysis that uses a language which is a textual language known as Pig Latin and provides sequences of Map-Reduce programs. It helps makes it easier to understand, write and maintain programs which conduct inspections analysis of data in parallel. Operating System: Operating System Independent. R is developed by Bell Laboratories, R is a programming an environment and a language for statistical computing and graphics which is identical to S. The environment includes a set of tools that make it easier to perform calculations, manipulate data and generate charts and graphs. Operating System: OSX, Linux, Windows. ("Enterprise Control Language") ECL is the language for working with HPCC. A complete set of tools with including a debugger and an IDE are included in HPCC, and one can get the documentation which is available on the HPCC’s website. Operating System: Linux Os. Processing Environment Many enterprises are having hurdle of lots of new data, which appear in several different variant. Big data has the ability to confer insights that can help any business organization. Big data has generated an all new industry of subsidiary architectures such as Map Reduce. Map Reduce is a programming framework for distributed computing which was created by Google using the divide and conquer method to dissolution sophisticated big data problems into small units of work and process them in parallel. MapReduce can be divided into two stages. Map Step: The master node data is sliced up into many smaller sub nodes. A sub node processes some subset of the smaller problems under the control of the JobTracker node and stores the result in the local file system where a reducer is able to access it. Reduce Step: This step merges and analyzes input data from the map steps. These can be used to reduce the number tasks to parallelize the stockpiling, and these tasks are executed on the worker nodes under the control of the JobTracker. Brief introduction of various big data frameworks available 1. Apache Spark- Spark is as a too fast in-memory, data-processing framework like rapid fast. 100x faster than Hadoop. As the volume and velocity of data gathered from web and mobile applications rapidly increases, it’s critical speed of data processing and analysis stay at least a step ahead in order to support today’s Big Data applications and end user expectations. Spark offers the emulative benefit of high velocity analytics by way of stream processing large the amount of data, versus what has been traditionally a more heavily “batch-oriented” approach to data processing as seen with Hadoop. 2. Graph Lab - The Graph Lab Power Graph academic project was started in 2009 at Carnegie Mellon University to develop a new parallel computation abstraction tailored to machine learning. Graph Lab Power Graph 1.0 employed shared-memory design. In Graph Lab Power Graph 2.1, the framework was redesigned to target the distributed environment 3. HPCC System- LexisNexis is positioning HPCC as a competitor to Apache Hadoop, the open source software framework for Big Data processing and analytics. The entry of LexisNexis and HPCC into the Big Data ecosystem is yet another validation of the Big Data space and should spur innovation from all parties – HPCC, Hadoop and others. Whether HPCC (High Performance Computing Cluster) is a viable competitor to Hadoop for Big Data dominance is another question. LexisNexis, which has vast experience in collecting and processing large volumes of media and Organizations data, certainly thinks it is. The answer depends on a number of factors; most of them are not yet clear. 4. Dryad- Dryad is an infrastructure which allows a programmer to use the resources of a computer collectivity or a data center for running data-parallel programs. A Dryad programmer can use large number of hardware, each of them with multiple cores or processors, without knowing anything about Concomitant programming 5. Apache Flink- Flink exploits in-memory data streaming and integrates iterative processing deeply into the system runtime. This makes the system extremely fast for data-intensive and iterative jobs link is designed to perform well when memory runs out. Flink contains its own serialization framework, type inference engine and memory management component. 6. Strom- Strom is a free of cost and open source distributed real time computation system. Storm reliably process unbounded streams of data, for real time processing. Storm is easy, can be used with any programming language, and is great fun to use! 7. R3 - r³ (Redistribute, reduce, reuse) is a map reduce engine written in python using a redis backed. Its purpose is to be simple.r³ has only three concepts to grasp: input streams, mappers and reducers. 8. Disco-Disco Disco is a lightweight, open-source framework for distributed computing based on the Map Reduce paradigm. Disco is very powerful and easy, thanks to Python platform. Disco distributes and replicates your data, and jobs efficiently. Disco even includes the tools you need to index billions of data points and query them in realtime. 9. Phoenix- A relational database layer for Apache Hbase. It’s a Query engine which Transforms SQL queries into native Hbase API calls and Pushes as much work as possible onto the cluster for parallel execution. it Is a high performance, horizontally scalable data store engine for Big Data, which is suitable as the store of record for mission critical data. 10. Plasma - PlasmaFS implement large files in user space, is a distributed file system. Plasma Map Reduce famous algorithm for mapping large files turnaround plan runs. On top of the plasma KV PlasmaFS key value database. 11. Peregrine - Peregrine is a framework which designed for running iterative jobs across partitions of data. Peregrine is designed to be Boom for executing map reduces jobs by supporting a number of optimizations and features not present in other map- reduce frameworks.. 12. HTTPMR -HTTPMR is an implementation of (Google's) Map Reduce data processing model on clusters of HTTP servers. HTTPMR make the following assumptions about the computing environment: Machines can only be accessed via HTTP requests. Requests are assigned randomly to a set of machines. Requests have timeouts on the order of several seconds. There is a storage system that is accessible by code receiving HTTP requests. The data being processed can be broken up into many, many small records, each having a unique identifier. The storage system can accept >, <= ranges restrict operations on the data's unique identifiers. Jobs are controlled by a web spidering system (such as wget). 13. sector/sphere - sector is a high performance, traffic congestion and plan routes. Misco provides an very scalable, and secures distributed file system. Sphere is a powerful platform for developing and support these parallel data processing engine, which is high performance distributed applications. and can process Sector data files on the storage nodes with very simple programming interfaces. 15. MR-MPI- MR-MPI library was developed at Sandia National Laboratories, a US Department of Energy 14. Misco - misco is a distributed computing facility, for use on informatics problems. It includes C++ framework designed for mobile devices. Being 100% and C interfaces callable from most hi-level languages, and implemented in Python, misco is a highly portable and also a Python wrapper and our own OINK scripting should be able to run on any Python supported system wrapper, which can be used to develop and chain and its networking libraries. As more and more people own MapReduce operations together. MR-MPI and OINK are mobile devices and these devices are increasingly powerful, open-source codes, distributed freely under the terms of the there has been an explosion in distributed applications. modified Berkeley Software Distribution (BSD) License Social networking applications that have been developed to Open Source Parallel data processing Apache S/w Foundation Scale, Java, Python Linux, Mac os, windows Open Source Parallel data processing Carnegie Mellon University C++ Linux, Mac os HPC C Syste m Dryad Open Source Parallel data processing C++,Ecl Linux Microso ft Researc h project open source Parallel Processing HPCC System LexisN exis Risk Solutions Microsoft Research C# Window s Apache S/w Foundation Java and Scala Linux, Mac os, windows open source distribute d realtime computati on system redis database running. Java Linux, Mac os, windows - Python Google Core developed in Erlang any system which supports Python and it's networki ng libraries Linux, Mac OS X, FreeBS D. Apac he Flink Storm R3 open source Disc o open source distributed data processing distributed data processing Platform Operating System Processing Apac he Spar k Graph Lab Developed by Availability friends. Monitoring applications are helping users avoid Framework keep the users connected and updated with their family and BackType Apache Phoenix Peregrine sector/sp here misco op en sou rce SQ L dat aba se Pr ne ws wir e op en sou rce tes ted of No kia Parall el Proces sing Apache S/w Foundation Java Cross platform Table3 Top Open Source Tools for Big Data Parall el Proces sing systat Java Linux, Mac os, windows parall el data proce ssing Parall el Proces sing Apache 2.0 license C++ Linux only server side Nokia Research Center Python any system which supports Python and it's networking libraries. Hadoop, Map HPCC, Storm Big Data Analysis Platforms and Tools GridGain op ensou rce op ensou rce Parall el Proces sing Sandia National paral lel proce ssing GridGain Systems C++ and C Python any system which. Javabased any system which supports Python and it's networking libraries Grid Gain Talend, Jaspersoft, Palo BI Suite/Jedox, Pentaho, SpagoBI, KNIME, BIRT/Actuate, Business Intelligence RapidMiner/RapidAnalytics, Mahout, Orange, Weka, jHepWork, KEEL, SPMF, Rattle, Hadoop Distributed File System Pig/Pig Latin, R, ECL Programming Languages Lucene, Solr. Big Data Search Sqoop, Flume, Chukwa Data Aggregation and Transfer Zookeeper, Oozie, Avro, Terracotta Table2: New frameworks introduced in 2014 Amazon Redshift with default options. Redshift Shark– disk Impala disk – Shark mem – Impala-mem Hive Tez Input and output tables are on-disk compressed with gzip. OS buffer cache is cleared before each run. Input and output tables are on-disk compressed with snappy. OS buffer cache is cleared before each run. Input tables are stored in Spark cache. Output tables are stored in Spark cache. Input tables are coerced into the OS buffer cache. Output tables are on disk (Impala has no notion of a cached table). Hive on HDP 2.0.6 with default options. Input and output tables are on disk compressed with snappy. OS buffer cache is cleared before each run. Tez with the configuration parameters specified here. Input and output tables are on disk compressed with snappy. OS buffer cache is cleared before each run. , Cassandra, HBase, MongoDB, Neo4j, CouchDB, OrientDB, Terrastore, FlockDB, Hibari, Riak, Hypertable, Hive, InfoBright Community Edition, Infinispan, Redis Databases/Data Warehouses Data Mining MR-MPI Reduce, Miscellaneous Big Data Tools I. CONCLUSION In this brief study of Big Data Frameworks and tools. It has been found that various categories of Frameworks and tools are available in the market which caters the different variety of functionalities to deal with big data. It is recommended to utilize each of them in accordance with their key feature to get the appropriate solutions. Here, it has also been observed that each and every Framework is aimed to fulfill some specific requirement of user. There is no such exiting Framework in present time, which can cover all kind of requirements in big data. Therefore, it is still area of research for development of universal framework which can tackle every kind of big data issues. REFERENCES [1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money, “Big data: issues and challenges moving forward”, IEEE, 46th Hawaii International Conference on System Sciences, 2013. [2] Strozzi, Carlo: NoSQL – A relational database management system.2007–2010; http://www.strozzi.it/cgibin/CSA/tw7/I/en_US/nosql/ Home%20Page. [3] P. Xiang, R. Hou, and Z. Zhou, “Cache and consistency in nosql,” in computer science and information technology (ICCSIT), 2010 3rd IEEE International Conference on, vol. 6. IEEE, 2010, pp.117–120. [4] http://nosql.findthebest.com/ [5] http://nosql-database.org/ [6] Frank, L., "Countermeasures against consistency anomalies in distributed integrated databases with relaxed ACID properties," Innovations in Information Technology (IIT), 2011 International Conference on , vol., no., pp.266,270, 25-27 April 2011. [7] G. De Candia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” Proc. 21st ACM SIGOPS Symp. Operating Systems Principles (SOSP 07), 2007, pp. 205–220. [8] A. Masudianpour, “An introduction to redis server, an advanced key value database,” SlideShare, 9 Aug. 2013; www.slideshare.net/masudianpour/redis25088079. [9] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," pervasive computing and applications (ICPCA), 2011 6th International Conference on, vol., no., pp.363, 366, 26-28 Oct. 2011. [10] M. Stonebraker, “Sql databases v. nosql databases,” Communications of the ACM, vol. 53, no. 4, pp. 10– 11, 2010. [11] Sachchidanand Singh, Nirmala Singh, “Big data analytics”, IEEE, International Conference on Communication, Information & Computing Technology (ICCICT),Oct. 19-20, 2012. [12] Sagiroglu, S.; Sinanc, D., "Big Data: A review," Collaboration Technologies and Systems (CTS), 2013 International Conference on, vol., no., pp.42,47, 20-24 May 2013. [13] Wielki, J., "Implementation of the big data concept in organizations - possibilities, impediments and challenges," Computer Science and Information Systems (FedCSIS), 2013 Federated Conference on, vol., no., pp.985, 989, 8-11 Sept. 2013. [14] Segev, A; Chihoon Jung; Sukhwan Jung, "Analysis of technology trends based on big data," Big Data (BigData Congress), 2013 IEEE International Congress on, vol., no., pp.419, 420, June 27 2013-July 2 2013 [15] Katal, A; Wazid, M.; Goudar, R.H., "Big data: issues, challenges, tools and good practices," Contemporary Computing (IC3), 2013 Sixth International Conference on , vol., no., pp.404,409, 8-10 Aug. 2013. [16] K. Kambatla, G. Kollias, V. Kumar, A. Grama, “Trends in big data analytics”, J. Parallel Distrib. Comput. (2014), http://dx.doi.org/10.1016/j.jpdc.2014.01.003 [17] Sheth, Amit, "Transforming big data into smart data: deriving value via harnessing volume, variety, and velocity using semantic techniques and technologies," Data Engineering (ICDE), 2014 IEEE 30th International Conference on , vol., no., pp.2,2, March 31 2014-April 4, 2014. [18] Saha, Barna; Srivastava, Divesh, "Data quality: the other face of big data," Data Engineering (ICDE), 2014 IEEE 30th International Conference on, vol., no., pp.1294, 1297, March 31 2014-April 4 2014. [19]http://www.datamation.com/data-center/50-top-opensource-tools-for-big-data-3.html[12, 01, 2015] [20]Big Data Framework Firat Tekiner1 andJohn A. Keane School of Computer Science, The University of Manchester, Manchester, UK [email protected], [email protected] [21] Big Data: A Review Seref SAGIROGLU and Duygu SINANC Gazi University Department of Computer Engineering, Faculty of Engineering Ankara, Turkey [email protected], [email protected] [22]BIG Data and Methodology-A review Shilpa* Manjit Kaur Student, Computer Science and Engineering Faculty, Computer Science and Engineering LPU, Phagwara, India LPU, Phagwara, India [23]Big Data Framework Firat Tekiner1 andJohn A. Keane School of Computer Science, The University of Manchester, Manchester, UK [email protected], [email protected]