Download Evaluating WEKA over the Open Source Web Data Mining Tools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015)
© International Research Publication House • http://www.irphouse.com
Evaluating WEKA over the Open Source Web Data
Mining Tools
Dr. Arvind K. Sharma
Shubhra Saxena,
Dept. of Computer Science
University of Kota, India
[email protected]
Dept. of Computer Science
SKIT, Jaipur, India
[email protected]
Dr. Anubhav Kumar
Mahendra Beniwal
HOD CSE, Dept. of CSE
Lingaya’s Group
[email protected]
Dept. of Computer Science
SKIT, Jaipur, India
[email protected]
Abstract--Today, huge amount of data and information are
available for everyone, Data can now be stored in many different
kinds of databases and information repositories, besides being
available on the Internet or in printed form. With such amount of
data, there is a need for powerful tools and techniques for better
interpretation of these data that exceeds the human's ability for
comprehension and making decision in a better way. In order to
evaluating the best tools for dealing with the data mining
algorithms which help in decision making, this paper discusses an
evaluating methodology of the freely available data mining tools
and software packages. The best five open source data mining tools
such as Orange, Tanagra, Rapid Miner, Weka, and KNIME are
shown on the basis of literatures and various data mining methods
have been performed.
on collecting huge amounts of data from web has appeared
which is called web mining. Web mining follows the same
knowledge discovery from databases (KDD) process steps
as data mining [4]. However, it introduces processes which
are unique to this kind of data. Fig 1.1 shows the
relationship between data mining and mining web data.
Data Mining
Web Mining
KEYWORDS— Data Mining Tools, Orange, Tanagra, Rapid Miner,
KNIME & Weka
1. INTRODUCTION
Now a day’s databases and data repositories consists so
much data and information that it becomes almost
impossible to manually analyze them for valuable decisionmaking. So that, humans need assistance in their analysis
capacity, humans need data mining and its applications [2].
Such requirement has generated an urgent need for
automated tools that can assist us in transforming those vast
amounts of data into useful information and knowledge.
Data mining is the process of discovering interesting
knowledge from large amounts of data stored in databases,
data warehouses, or other information repositories[3]. Data
mining involves an integration of techniques from multiple
disciplines such as database and data warehousing
technology, statistics, machine learning, high-performance
computing, pattern recognition, neural networks, data
visualization, information retrieval, image and signal
processing, and spatial or temporal data analysis [1]. Data
mining has many application fields such as marketing,
business, science and engineering, economics, games and
bioinformatics. A decade ago a new research domain based
Preprocessing Task
for other kind of Data
Preprocessing Task
for Web Data
Data Mining
Algorithms
KDD
Fig.1: Web Mining is a part of Data Mining
The global consumer browses and makes his decisions. At
the same time the growth of web sites from government
National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015)
Lingaya’s G V K S IMT, Faridabad [Page No. 128]
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015)
© International Research Publication House • http://www.irphouse.com
institutions, libraries and humans increased tremendously
over the past few years. The Internet allows two–way
communications. This virtual communication between the
websites and users generates huge amounts of Internet data
which is either on line, offline or in the form of different
electronic form as newsletters and etc.
The rest of the paper is organized as follows: Section 2
summaries related works on data mining and web data
mining tools. Section 3 provides general description on the
methodology and general description of the tools and
software under test. Section 4 reports the working of Weka
software with its experimental evaluation. Finally, the
conclusion is shown in the last section of the paper.
II. RELATED WORKS
The quest for patterns in data has been studied for a long
time in many fields, including statistics, patterns recognition
and exploratory data analysis [5]. Analyzing data can
provide further knowledge about a business by going
beyond the data explicitly stored to derive knowledge about
the business. This is where data mining has obvious benefits
for any enterprise. Data mining is an approach currently
receiving great attention and is being recognized as a newly
emerging analysis tool [6]. Recently, Data mining has given
a great deal of concern and attention in the information
industry and in society as a whole. This is due to the wide
accessibility of huge amount of data and the important need
for turning such data into useful information and knowledge
[7]. Data mining problems are generally categorized as
clustering, association, classification and prediction [8].
III. PROPOSED METHODOLOGY
A. DATA MINING TOOLS
Data mining tools are user friendly interface to carrying out
automated data analysis tasks. The best five open source
data mining tools such as Orange, Tanagra, Rapid Miner,
Weka, and KNIME have been investigated on the basis of
literatures and many data mining methods have been
performed. These data mining tools with their characteristics
are explained in this section.
3.1 Characteristics of Data Mining Tools
The main characteristics of data mining tools are as follows:
• Ability to Handle Complicated Problem: The aim of
data mining tool is automatically discovers useful
information even from the complex data sets. Data
mining algorithms allows performing knowledge
discovery and used for prediction and searching data
patterns even from the complex data easily.
•
Automated Discover Unknown Patterns: Data
mining automates the process of finding predictive
patterns from large databases. Pattern discovery helps
to find fraud detection and errors in the transaction that
is the main task of evaluation.
•
Scalability: Data mining tools can handle large amount
of data that makes the scalability is one of the important
feature of it.
•
Relatively High Cost: Data mining software tools are
non expensive but still somewhat expensive than other
softwares. Because in data mining users have to incur
overhead costs like data preparation, analyzing and
training costs which is relatively high.
•
Technical Skill Required: Technical skill is required
for data mining software users. User must have
knowledge of many data mining algorithms to choose
appropriate algorithm according to the task
requirements. Skills are also required to finding patterns
of interest and to evaluate the results of findings[4].
3.2 ORANGE
It is a component-based data mining and machine learning
software suite that features friendly yet powerful, fast and
versatile visual programming front-end for explorative data
analysis and visualization, and Python bindings and libraries
for scripting. It contains complete set of components for data
preprocessing, feature scoring and filtering, modeling, model
evaluation, and exploration techniques. It is written in C++
and Python [9], and its graphical user interface is based on
cross-platform Qt framework.
3.3 TANAGRA
TANAGRA is free Data Mining software for academic and
research purposes [10]. It proposes several data mining
methods from exploratory data analysis, statistical learning,
machine learning and databases area. It runs under almost
Windows Systems, in any case it has been tested under
Windows 98, 2000, XP, Vista and Windows 7.
National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015)
Lingaya’s G V K S IMT, Faridabad [Page No. 129]
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015)
© International Research Publication House • http://www.irphouse.com
3.4 RAPID MINER
Formerly called as YALE (Yet Another Learning
Environment), is an environment for machine learning and
data mining experiments that is utilized for both research and
real-world data mining tasks[11]. It enables experiments to
be made up of a huge number of arbitrarily nestable
operators, which are detailed in XML files and are made with
the graphical user interface of Rapid Miner. Rapid Miner
provides more than 500 operators for all main machine
learning procedures, and it also combines learning schemes
and attribute evaluators of the Weka learning environment. It
is available as a stand-alone tool for data analysis and as a
data-mining engine that can be integrated into your own
products.
3.5 KNIME
KNIME (Konstanz Information Miner) is a user friendly,
intelligible, and comprehensive open-source data integration,
processing, analysis, and exploration platform [12]. It gives
users the ability to visually create data flows or pipelines,
selectively execute some or all analysis steps, and later
studies the results, models, and interactive views. KNIME is
written in Java, and it is based on Eclipse and makes use of
its extension method to support plugins thus providing
additional functionality. Through plugins, users can add
modules for text, image, and time series processing and the
integration of various other open source projects, such as R
programming language, Weka, and LibSVM etc.
IV. EVALUATING WEKA
WEKA (Waikato Environment for Knowledge Analysis) is
a collection of state-of-the-art machine learning algorithms
and data preprocessing tools written in Java, developed at
the University of Waikato, New Zealand. It is free software
that runs on almost any platform and is available under the
GNU General Public License. It has a wide range of
applications in various data mining techniques. It provides
extensive support for the entire process of experimental data
mining, including preparing the input data, evaluating
learning schemes statistically, and visualizing the input data
and the result of learning. The WEKA workbench includes
methods for the main data mining problems: regression,
classification, clustering, association rule mining, and
attribute selection. It can be used in either of the following
two interfaces:
ƒ Command Line Interface (CLI)
ƒ
Graphical User Interface (GUI)
The Weka GUI Chooser (class weka.gui.GUIChooser)
provides a starting point for launching Weka’s main GUI
applications and supporting tools. If one prefers a MDI
(Multiple Document Interface) appearance, then this is
provided by an alternative launcher called “Main” (class
weka.gui.Main). The GUI Chooser consists of four buttonsone for each of the four major Weka applications- and four
menus [13]. The WEKA GUI Chooser appears like this:
3.6 WEKA
Written in Java, Weka (Waikato Environment for Knowledge
Analysis) is a well-known suite of machine learning software
[13] that supports several typical data mining tasks,
particularly data preprocessing, clustering, classification,
regression, visualization, and feature selection. Its techniques
are based on the hypothesis that the data is available as a
single flat file or relation, where each data point is labeled by
a fixed number of attributes. Weka provides access to SQL
databases utilizing Java Database Connectivity (JDBC) and
can process the result returned by a database query. Its main
user interface is the Explorer, but the same functionality can
be accessed from the command line or through the
component-based Knowledge Flow interface.
Fig.2: User Interface of WEKA
The buttons can be used to start the following applications:
• Explorer - An environment for exploring data with
WEKA.
National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015)
Lingaya’s G V K S IMT, Faridabad [Page No. 130]
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015)
© International Research Publication House • http://www.irphouse.com
It provides a simple command line interface that allows
direct execution of Weka Commands for operating system
that do not provide a command line interface[14].
Fig.3: WEKA Knowledge Explorer
Fig.6: Command Line Interface of WEKA
• Experimenter
An environment for performing experiments and conducting
statistical tests between learning schemes.
The Menu contains different sections as follows:
1. Program
Fig. 7: Weka Program Menu
LogWindow– It Opens a log window that captures all that
is printed to stdout or stderr. Useful for environments like
MS Windows, where WEKA is normally not started from a
terminal [14].
Exit – It closes WEKA.
2. Tools
Fig.4: WEKA Experimenter Environment
• Knowledge Flow
This environment supports essentially the same functions as
the explorer but with a drag-and-drop interface. One of its
merit is it supports incremental learning.
Fig.8: Weka Tools Menu
Some of the useful applications are as follows:
• ArffViewer - An MDI application for viewing ARFF files
in spreadsheet format.
• SqlViewer - It represents an SQL worksheet, for querying
databases via JDBC.
• Bayes Net editor - An application for editing, visualizing
and learning Bayes nets.
Fig.5: WEKA Knowledge Flow Environment
3. Visualization: Different ways of visualization.
• Simple CLI
National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015)
Lingaya’s G V K S IMT, Faridabad [Page No. 131]
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 8, Number 1 (2015)
© International Research Publication House • http://www.irphouse.com
sets designed for such tasks and the known algorithms for
clustering and association.
REFERENCES
[1]
Han, J., Kamber, M., Jian P., Data Mining Concepts and Techniques.
San Francisco, CA: Morgan Kaufmann Publishers, 2011.
[2]
Goebel, M., Gruenwald, L., A survey of data mining and knowledge
discovery software tools, ACM SIGKDD Explorations Newsletter,
v.1 n.1, p.20-33, June 1999.
[3]
Abdullah H. Wahbeh, Qasem A., et. al, A Comparison Study between
Data Mining Tools over some Classification Methods, (IJACSA)
International Journal of Advanced Computer Science and
Applications, Special Issue on Artificial Intelligence.
[4]
Arvind Sharma, P.C. Gupta, Predicting the Number of Blood Donors
through their Age and Blood Group by using Data Mining Tool,
International Journal of Communication and Computer Technologies,
Volume 01 – No.6, Issue: 02 September 2012.
De Mantaras & Armengol E. (1998),”Machine learning from
example: Inductive and Lazy methods”, Data & Knowledge
Engineering 25: 99-123
Fig.9: Weka Visualization
• Plot - It is used for plotting a 2D plot of a dataset.
• ROC - It displays a previously saved ROC curve.
• TreeVisualizer - It displays directed graphs, e.g., a decision
tree.
• GraphVisualizer - It visualizes XML BIF or DOT format
graphs, e.g. Bayesian networks.
• BoundaryVisualizer - It allows the visualization of classifier
decision boundaries in two dimensions.
4. Help
Online resources for WEKA can be found here.
[5]
[6]
[7]
[8]
Fig.10: Help Menu
Weka Homepage – Opens a browser window with
WEKA’s homepage.
HOWTOs, code snippets, etc. – The general Weka, contains
lots of examples and HOW TOs around the development
and use of WEKA.
Weka on Sourceforge – WEKA’s project homepage on
Sourceforge.net.
SystemInfo– Lists some internals about the Java/WEKA
environment, i.e., the CLASSPATH.
[9]
[10]
[11]
[12]
[13]
[14]
Tso, G.K.F. and K.K.W. Yau, "Predicting electricity energy
consumption: A comparison of regression analysis, decision tree and
neural networks". Energy, 2007. 32: p. 1761 - 1768.
Han, J. and M. Kamber, "Data Mining: Concepts and Techniques",
San Francisco: Morgan Kaufmann Publisher, 2006.
Chien, C.F. and L.F. Chen, "Data mining to improve personnel
selection and enhance human capital: A case study in high-technology
industry", Expert Systems and Applications, 2008. 34(1): p. 380-290.
http://orange.biolab.si/
http://eric.univlyon2.fr/~ricco/tanagra/en/tanagra.html
https://rapidminer.com/
http://toolkit.snd.org/tools/other/knime/
http://www.cs.waikato.ac.nz/ml/weka/
Ankit Bhardwaj, Arvind Sharma, V.K. Shrivastava, Data Mining
Techniques and Their Implementation in Blood Bank Sector–A
Review, International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2,
Issue4, July-August 2012, pp.1303-1309.
V. CONCLUSION & FUTURE WORK
This paper has shown an evaluating methodology of Weka
tool among the four data mining toolkits for the purpose of
different data mining algorithms. The five discussed toolkits
can be used to test the six classification algorithms namely:
Naïve Bayes (NB), Decision Tree (C4.5), Support Vector
Machine (SVM), K-Nearest Neighbor (KNN), One Rule
(OneR), and Zero Rule (ZeroR). This Paper has concluded
that the WEKA toolkit is the best tool in terms of the ability
to run the selected classifier followed by Orange, Tanagra,
and finally KNIME respectively. In a future research, we are
planning to test the selected data mining tools for other
machine learning tasks: such as clustering, using test data
National Conference on Interdisciplinary Research In Science & Technology (NCIRST- 2015)
Lingaya’s G V K S IMT, Faridabad [Page No. 132]