Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015) © International Research Publication House • http://www.irphouse.com Evaluating WEKA over the Open Source Web Data Mining Tools Dr. Arvind K. Sharma Shubhra Saxena, Dept. of Computer Science University of Kota, India [email protected] Dept. of Computer Science SKIT, Jaipur, India [email protected] Dr. Anubhav Kumar Mahendra Beniwal HOD CSE, Dept. of CSE Lingaya’s Group [email protected] Dept. of Computer Science SKIT, Jaipur, India [email protected] Abstract--Today, huge amount of data and information are available for everyone, Data can now be stored in many different kinds of databases and information repositories, besides being available on the Internet or in printed form. With such amount of data, there is a need for powerful tools and techniques for better interpretation of these data that exceeds the human's ability for comprehension and making decision in a better way. In order to evaluating the best tools for dealing with the data mining algorithms which help in decision making, this paper discusses an evaluating methodology of the freely available data mining tools and software packages. The best five open source data mining tools such as Orange, Tanagra, Rapid Miner, Weka, and KNIME are shown on the basis of literatures and various data mining methods have been performed. on collecting huge amounts of data from web has appeared which is called web mining. Web mining follows the same knowledge discovery from databases (KDD) process steps as data mining [4]. However, it introduces processes which are unique to this kind of data. Fig 1.1 shows the relationship between data mining and mining web data. Data Mining Web Mining KEYWORDS— Data Mining Tools, Orange, Tanagra, Rapid Miner, KNIME & Weka 1. INTRODUCTION Now a day’s databases and data repositories consists so much data and information that it becomes almost impossible to manually analyze them for valuable decisionmaking. So that, humans need assistance in their analysis capacity, humans need data mining and its applications [2]. Such requirement has generated an urgent need for automated tools that can assist us in transforming those vast amounts of data into useful information and knowledge. Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories[3]. Data mining involves an integration of techniques from multiple disciplines such as database and data warehousing technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial or temporal data analysis [1]. Data mining has many application fields such as marketing, business, science and engineering, economics, games and bioinformatics. A decade ago a new research domain based Preprocessing Task for other kind of Data Preprocessing Task for Web Data Data Mining Algorithms KDD Fig.1: Web Mining is a part of Data Mining The global consumer browses and makes his decisions. At the same time the growth of web sites from government National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015) Lingaya’s G V K S IMT, Faridabad [Page No. 128] International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015) © International Research Publication House • http://www.irphouse.com institutions, libraries and humans increased tremendously over the past few years. The Internet allows two–way communications. This virtual communication between the websites and users generates huge amounts of Internet data which is either on line, offline or in the form of different electronic form as newsletters and etc. The rest of the paper is organized as follows: Section 2 summaries related works on data mining and web data mining tools. Section 3 provides general description on the methodology and general description of the tools and software under test. Section 4 reports the working of Weka software with its experimental evaluation. Finally, the conclusion is shown in the last section of the paper. II. RELATED WORKS The quest for patterns in data has been studied for a long time in many fields, including statistics, patterns recognition and exploratory data analysis [5]. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. This is where data mining has obvious benefits for any enterprise. Data mining is an approach currently receiving great attention and is being recognized as a newly emerging analysis tool [6]. Recently, Data mining has given a great deal of concern and attention in the information industry and in society as a whole. This is due to the wide accessibility of huge amount of data and the important need for turning such data into useful information and knowledge [7]. Data mining problems are generally categorized as clustering, association, classification and prediction [8]. III. PROPOSED METHODOLOGY A. DATA MINING TOOLS Data mining tools are user friendly interface to carrying out automated data analysis tasks. The best five open source data mining tools such as Orange, Tanagra, Rapid Miner, Weka, and KNIME have been investigated on the basis of literatures and many data mining methods have been performed. These data mining tools with their characteristics are explained in this section. 3.1 Characteristics of Data Mining Tools The main characteristics of data mining tools are as follows: • Ability to Handle Complicated Problem: The aim of data mining tool is automatically discovers useful information even from the complex data sets. Data mining algorithms allows performing knowledge discovery and used for prediction and searching data patterns even from the complex data easily. • Automated Discover Unknown Patterns: Data mining automates the process of finding predictive patterns from large databases. Pattern discovery helps to find fraud detection and errors in the transaction that is the main task of evaluation. • Scalability: Data mining tools can handle large amount of data that makes the scalability is one of the important feature of it. • Relatively High Cost: Data mining software tools are non expensive but still somewhat expensive than other softwares. Because in data mining users have to incur overhead costs like data preparation, analyzing and training costs which is relatively high. • Technical Skill Required: Technical skill is required for data mining software users. User must have knowledge of many data mining algorithms to choose appropriate algorithm according to the task requirements. Skills are also required to finding patterns of interest and to evaluate the results of findings[4]. 3.2 ORANGE It is a component-based data mining and machine learning software suite that features friendly yet powerful, fast and versatile visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting. It contains complete set of components for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. It is written in C++ and Python [9], and its graphical user interface is based on cross-platform Qt framework. 3.3 TANAGRA TANAGRA is free Data Mining software for academic and research purposes [10]. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area. It runs under almost Windows Systems, in any case it has been tested under Windows 98, 2000, XP, Vista and Windows 7. National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015) Lingaya’s G V K S IMT, Faridabad [Page No. 129] International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015) © International Research Publication House • http://www.irphouse.com 3.4 RAPID MINER Formerly called as YALE (Yet Another Learning Environment), is an environment for machine learning and data mining experiments that is utilized for both research and real-world data mining tasks[11]. It enables experiments to be made up of a huge number of arbitrarily nestable operators, which are detailed in XML files and are made with the graphical user interface of Rapid Miner. Rapid Miner provides more than 500 operators for all main machine learning procedures, and it also combines learning schemes and attribute evaluators of the Weka learning environment. It is available as a stand-alone tool for data analysis and as a data-mining engine that can be integrated into your own products. 3.5 KNIME KNIME (Konstanz Information Miner) is a user friendly, intelligible, and comprehensive open-source data integration, processing, analysis, and exploration platform [12]. It gives users the ability to visually create data flows or pipelines, selectively execute some or all analysis steps, and later studies the results, models, and interactive views. KNIME is written in Java, and it is based on Eclipse and makes use of its extension method to support plugins thus providing additional functionality. Through plugins, users can add modules for text, image, and time series processing and the integration of various other open source projects, such as R programming language, Weka, and LibSVM etc. IV. EVALUATING WEKA WEKA (Waikato Environment for Knowledge Analysis) is a collection of state-of-the-art machine learning algorithms and data preprocessing tools written in Java, developed at the University of Waikato, New Zealand. It is free software that runs on almost any platform and is available under the GNU General Public License. It has a wide range of applications in various data mining techniques. It provides extensive support for the entire process of experimental data mining, including preparing the input data, evaluating learning schemes statistically, and visualizing the input data and the result of learning. The WEKA workbench includes methods for the main data mining problems: regression, classification, clustering, association rule mining, and attribute selection. It can be used in either of the following two interfaces: Command Line Interface (CLI) Graphical User Interface (GUI) The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Weka’s main GUI applications and supporting tools. If one prefers a MDI (Multiple Document Interface) appearance, then this is provided by an alternative launcher called “Main” (class weka.gui.Main). The GUI Chooser consists of four buttonsone for each of the four major Weka applications- and four menus [13]. The WEKA GUI Chooser appears like this: 3.6 WEKA Written in Java, Weka (Waikato Environment for Knowledge Analysis) is a well-known suite of machine learning software [13] that supports several typical data mining tasks, particularly data preprocessing, clustering, classification, regression, visualization, and feature selection. Its techniques are based on the hypothesis that the data is available as a single flat file or relation, where each data point is labeled by a fixed number of attributes. Weka provides access to SQL databases utilizing Java Database Connectivity (JDBC) and can process the result returned by a database query. Its main user interface is the Explorer, but the same functionality can be accessed from the command line or through the component-based Knowledge Flow interface. Fig.2: User Interface of WEKA The buttons can be used to start the following applications: • Explorer - An environment for exploring data with WEKA. National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015) Lingaya’s G V K S IMT, Faridabad [Page No. 130] International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015) © International Research Publication House • http://www.irphouse.com It provides a simple command line interface that allows direct execution of Weka Commands for operating system that do not provide a command line interface[14]. Fig.3: WEKA Knowledge Explorer Fig.6: Command Line Interface of WEKA • Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. The Menu contains different sections as follows: 1. Program Fig. 7: Weka Program Menu LogWindow– It Opens a log window that captures all that is printed to stdout or stderr. Useful for environments like MS Windows, where WEKA is normally not started from a terminal [14]. Exit – It closes WEKA. 2. Tools Fig.4: WEKA Experimenter Environment • Knowledge Flow This environment supports essentially the same functions as the explorer but with a drag-and-drop interface. One of its merit is it supports incremental learning. Fig.8: Weka Tools Menu Some of the useful applications are as follows: • ArffViewer - An MDI application for viewing ARFF files in spreadsheet format. • SqlViewer - It represents an SQL worksheet, for querying databases via JDBC. • Bayes Net editor - An application for editing, visualizing and learning Bayes nets. Fig.5: WEKA Knowledge Flow Environment 3. Visualization: Different ways of visualization. • Simple CLI National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015) Lingaya’s G V K S IMT, Faridabad [Page No. 131] International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015) © International Research Publication House • http://www.irphouse.com sets designed for such tasks and the known algorithms for clustering and association. REFERENCES [1] Han, J., Kamber, M., Jian P., Data Mining Concepts and Techniques. San Francisco, CA: Morgan Kaufmann Publishers, 2011. [2] Goebel, M., Gruenwald, L., A survey of data mining and knowledge discovery software tools, ACM SIGKDD Explorations Newsletter, v.1 n.1, p.20-33, June 1999. [3] Abdullah H. Wahbeh, Qasem A., et. al, A Comparison Study between Data Mining Tools over some Classification Methods, (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Artificial Intelligence. [4] Arvind Sharma, P.C. Gupta, Predicting the Number of Blood Donors through their Age and Blood Group by using Data Mining Tool, International Journal of Communication and Computer Technologies, Volume 01 – No.6, Issue: 02 September 2012. De Mantaras & Armengol E. (1998),”Machine learning from example: Inductive and Lazy methods”, Data & Knowledge Engineering 25: 99-123 Fig.9: Weka Visualization • Plot - It is used for plotting a 2D plot of a dataset. • ROC - It displays a previously saved ROC curve. • TreeVisualizer - It displays directed graphs, e.g., a decision tree. • GraphVisualizer - It visualizes XML BIF or DOT format graphs, e.g. Bayesian networks. • BoundaryVisualizer - It allows the visualization of classifier decision boundaries in two dimensions. 4. Help Online resources for WEKA can be found here. [5] [6] [7] [8] Fig.10: Help Menu Weka Homepage – Opens a browser window with WEKA’s homepage. HOWTOs, code snippets, etc. – The general Weka, contains lots of examples and HOW TOs around the development and use of WEKA. Weka on Sourceforge – WEKA’s project homepage on Sourceforge.net. SystemInfo– Lists some internals about the Java/WEKA environment, i.e., the CLASSPATH. [9] [10] [11] [12] [13] [14] Tso, G.K.F. and K.K.W. Yau, "Predicting electricity energy consumption: A comparison of regression analysis, decision tree and neural networks". Energy, 2007. 32: p. 1761 - 1768. Han, J. and M. Kamber, "Data Mining: Concepts and Techniques", San Francisco: Morgan Kaufmann Publisher, 2006. Chien, C.F. and L.F. Chen, "Data mining to improve personnel selection and enhance human capital: A case study in high-technology industry", Expert Systems and Applications, 2008. 34(1): p. 380-290. http://orange.biolab.si/ http://eric.univlyon2.fr/~ricco/tanagra/en/tanagra.html https://rapidminer.com/ http://toolkit.snd.org/tools/other/knime/ http://www.cs.waikato.ac.nz/ml/weka/ Ankit Bhardwaj, Arvind Sharma, V.K. Shrivastava, Data Mining Techniques and Their Implementation in Blood Bank Sector–A Review, International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue4, July-August 2012, pp.1303-1309. V. CONCLUSION & FUTURE WORK This paper has shown an evaluating methodology of Weka tool among the four data mining toolkits for the purpose of different data mining algorithms. The five discussed toolkits can be used to test the six classification algorithms namely: Naïve Bayes (NB), Decision Tree (C4.5), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), One Rule (OneR), and Zero Rule (ZeroR). This Paper has concluded that the WEKA toolkit is the best tool in terms of the ability to run the selected classifier followed by Orange, Tanagra, and finally KNIME respectively. In a future research, we are planning to test the selected data mining tools for other machine learning tasks: such as clustering, using test data National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015) Lingaya’s G V K S IMT, Faridabad [Page No. 132]