Download Practical Approaches: A Survey on Data Mining Practical Tools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
Available online at www.ijiere.com
International Journal of Innovative and Emerging
Research in Engineering
e-ISSN: 2394 - 3343
e-ISSN: 2394 - 5494
Practical Approaches: A Survey on Data Mining
Practical Tools
Kanu Patel1, Jayna Donga2
Assistant Professor, BVM Engineering College, V.V.Nagar, [email protected]
Assistant Professor, MBICT, New V.V.Nagar, [email protected]
Abstract:
To Build Powerful world, Data Analysis play very important role. The progress and application of data mining
algorithms requires the use software tools. As the number of available tools continues to grow, the choice of the most
suitable tool becomes increasingly difficult. In this study, five efficient tools for analyzing patent documents were
tested: Weka, Orange, Tanegra, KNIME and R Programming Language All fivetools analyze structured and
unstructured data alike. They all visualize the results achieved from clustering the text fields of patent documents
and either provide basic statistics graphs themselves or contain filters for performing them with other solutions. In
study paper we will study and analysis basic of all this five tools.
Keywords: Weka,Orange,Tanegra, R Programming, Data Mining Tools
I.
INTRODUCTION:
Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with
great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict
future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective
analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision
support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their
expectations.
Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented
rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be
integrated with new products and systems as they are brought on-line. When implemented on high performance client/server
or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as,
"Which clients are most likely to respond to my next promotional mailing, and why?"
This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications
illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can
evolve to deliver the value of data mining to end users.
Data mining
Data mining involves six common classes of tasks:[1]
 Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be
interesting or data errors that require further investigation.
 Association rule learning (Dependency modelling) – Searches for relationships between variables. For example, a
supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can
determine which products are frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
 Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without
using known structures in the data.
 Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might
attempt to classify an e-mail as "legitimate" or as "spam".
 Regression – attempts to find a function which models the data with the least error.
 Summarization – providing a more compact representation of the data set, including visualization and report generation.
1
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
The Foundations of Data Mining
Data mining techniques are the result of a long process of research and product development. This evolution began when
business data was first stored on computers, continued with improvements in data access, and more recently, generated
technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond
retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application
in the business community because it is supported by three technologies that are now sufficiently mature:
 Massive data collection
 Powerful multiprocessor computers
 Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found
that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some
industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can
now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody
techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable
tools that consistently outperform older statistical methods.
In the evolution from business data to business information, each new step has built upon the previous one. For example,
dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical
to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new
business questions to be answered accurately and quickly.
Table 1. Steps in the Evolution of Data Mining.[2]
Evolutionary Step
Business Question
Enabling Technologies
Product
Providers
Characteristics
Data Collection
(1960s)
"What was my total
revenue in the last five
years?"
Computers, tapes, disks
IBM, CDC
Retrospective,
static
data
delivery
Data Access
(1980s)
"What were unit sales in
New
England
last
March?"
Relational
databases
(RDBMS), Structured Query
Language (SQL), ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery
at
record level
Data Warehousing
&
Decision Support
(1990s)
"What were unit sales in
New
England
last
March? Drill down to
Boston."
On-line analytic processing
(OLAP),
multidimensional
databases, data warehouses
Pilot, Comshare,
Arbor, Cognos,
Microstrategy
Retrospective,
dynamic data
delivery
at
multiple levels
Data Mining
(Emerging Today)
"What’s likely to happen
to Boston unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers,
massive databases
Pilot, Lockheed,
IBM,
SGI,
numerous startups
(nascent industry)
Prospective,
proactive
information
delivery
The core components of data mining technology have been under development for decades, in research areas such as statistics,
artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational
database engines and broad data integration efforts, make these technologies practical for current data warehouse environments.
The Scope of Data Mining
Data mining derives its name from the similarities between searching for valuable business information in a large database —
for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore.
Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can generate new business
opportunities by providing these capabilities:
 Automated prediction of trends and behaviors. Data mining automates the process of finding predictive
information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered
2

International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses
data on past promotional mailings to identify the targets most likely to maximize return on investment in future
mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify
previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify
seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting
fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be
implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are
implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster
processing means that users can automatically experiment with more models to understand complex data. High speed makes it
practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
Databases can be larger in both depth and breadth:
 More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due
to time constraints. Yet variables that are discarded because they seem unimportant may carry information about
unknown patterns. High performance data mining allows users to explore the full depth of a database, without
preselecting a subset of variables.
 More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about
small but important segments of a population.
A recent Gartner Group Advanced Technology Research Note listed data mining and artificial intelligence at the top of the five
key technology areas that "will clearly have a major impact across a wide range of industries within the next 3 to 5 years."2
Gartner also listed parallel architectures and data mining as two of the top 10 new technologies in which companies will invest
during the next 5 years. According to a recent Gartner HPC Research Note, "With the rapid advance in data capture,
transmission and storage, large-systems users will increasingly need to implement new and innovative ways to mine the aftermarket value of their vast stores of detail data, employing MPP [massively parallel processing] systems to create new sources
of business advantage (0.9 probability)."3
The most commonly used techniques in data mining are:
 Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural
networks in structure.
 Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the
classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and
Chi Square Automatic Interaction Detection (CHAID) .
 Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural
selection in a design based on the concepts of evolution.
 Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes
of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor
technique.
 Rule induction: The extraction of useful if-then rules from data based on statistical significance.
Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively
small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and
OLAP platforms. The appendix to this white paper provides a glossary of data mining terms.
The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.
Pre-processing
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns
actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough
to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is
3
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the
observations containing noise and those with missing data.
II.
PRACTICAL TOOLS
To support medical data mining and exploratory analysis, a modern data mining suite should provide an easy-to-use
interface for physicians and biomedical researchers that is well supported with data and model visualizations, offers data
analysis tools to accommodate interactive search for any interesting data patterns, and allows interactive exploration of inferred
models
Data Mining Software
 Weka - an open-source software for data mining
 RapidMiner - an open-source system for data and text mining
 KNIME - an open-source data integration, processing, analysis, and exploration platform
 The Mahout machine learning library - mining large data sets. It supports recommendation mining, clustering,
classification and frequent itemset mining.
 Rattle - a GUI for data mining using R
III.
TOOLS IMPLEMENTATION
Weka:
Weka (pronounced to rhyme with Mecca) is a workbench[2] that contains a collection of visualization tools and algorithms
fordata analysis and predictive modeling, together with graphical user interfaces for easy access to these functions. The original
non-Java version of Weka was a TCL/TK front-end to (mostly third-party) modeling algorithms implemented in other
programming languages, plus data preprocessing utilities in C, and a Makefile-based system for running machine learning
experiments. This original version was primarily designed as a tool for analyzing data from agricultural domains, [2]] but the
more recent fully Java-based version (Weka 3), for which development started in 1997, is now used in many different
application areas, in particular for educational purposes and research. Advantages of Weka include:
 Free availability under the GNU General Public License.
 Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing
platform.
 A comprehensive collection of data preprocessing and modeling techniques.
 Ease of use due to its graphical user interfaces.
Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification,
regression, visualization, and feature selection. All of Weka's techniques are predicated on the assumption that the data is
available as a single flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric
or nominal attributes, but some other attribute types are also supported). Weka provides access to SQL databases using Java
Database Connectivity and can process the result returned by a database query. It is not capable of multi-relational data mining,
but there is separate software for converting a collection of linked database tables into a single table that is suitable for
processing using Weka.[3] Another important area that is currently not covered by the algorithms included in the Weka
distribution is sequence modeling.
4
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
Fig 1 Weka Implementation Snapshot
Orange
Orange is an open-source software released under GPL and available for use on github. Versions up to 3.0 include
core components inC++ with wrappers in Python. From version 3.0 onwards, Orange uses common Python open-source
libraries for scientific computing, such as numpy, scipy and scikit-learn, while its graphical user interface operates within
the cross-platform Qt framework.
The default installation includes a number of machine learning, preprocessing and data visualization algorithms in 6
widget sets (data, visualize, classify, regression, evaluate and unsupervised). Additional functionalities are available as addons (bioinformatics, data fusion and text-mining).
Orange is supported on OS X, Windows and Linux and can also be installed from the Python Package Index repository
(pip install Orange). As of 2015 the stable version is 2.7, while 3.0 is available as beta release.
Orange consists of a canvas interface onto which the user places widgets and creates a data analysis workflow.
Widgets offer basic functionalities such as reading the data, showing a data table, selecting features, training predictors,
comparing learning algorithms, visualizing data elements, etc. The user can interactively explore visualizations or feed the
selected subset into other widgets.
5
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
Fig 2. Orange Snapshot
Classification Tree widget in Orange 3.0


Canvas: graphical front-end for data analysis
Widgets:
 Data: widgets for data input, data filtering, sampling, imputation, feature manipulation and feature selection
 Visualize: widgets for common visualization (box plot, histograms, scatter plot) and multivariate visualization
(mosaic display, sieve diagram).
 Classify: a set of supervised machine learning algorithms for classification
 Regression: a set of supervised machine learning algorithms for regression
 Evaluate: cross-validation, sampling-based procedures, reliability estimation and scoring of prediction methods
 Unsupervised: unsupervised learning algorithms for clustering (k-means, hierarchical clustering) and data projection
techniques (multidimensional scaling, principal component analysis, correspondence analysis).
 Add-ons:
 Bioinformatics: widgets for gene set analysis, enrichment, and access to pathway libraries
 Data fusion: widgets for collective matrix factorization and exploration of latent factors
 Text mining: widgets for basic tasks in text-mining
Tanagra
Tanagra works as the current data mining tools. The user can design visually a data mining process in a diagram. Each
node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike
of the majority of the tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are
represented in a tree diagram. The results are displayed in a HTML format. So it is easy to export the outputs in order to
visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet.
Tanagra makes a good compromise between the statistical approaches (e.g. parametric and nonparametric statistical
tests), the multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and the
machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).
6
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
Fig 3. Tanegra Snapshot
R Programming Language
R
and
its
libraries
implement
a
wide
variety
of
statistical
and graphical techniques,
including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R
is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of
packages. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices
made. For computationally intensive tasks, C, C++, and Fortrancode can be linked and called at run time. Advanced users can
write C, C++,Java, .NET or Python code to manipulate R objects directly.
R is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. Due to its S
heritage, R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also
eased by its lexical scoping rules.[4]
Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical
symbols. Dynamic and interactive graphics are available through additional packages. [5]
R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a
number of formats and in hard copy.
7
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
Fig 4. R programming snapshot
KNIME
KNIME is a nicely designed data mining tool that runs inside the IBM’s Eclipse development environment. The
application is easy to try out because it requires no installation besides downloading and unarchiving. Like YALE, KNIME is
written in Java and can extend its library of built-in supervised and Fig. 5. A dialog of the node ‘‘CAIM Binner’’ that transforms
continuous features into discrete features (discretization). Features to be discretized are selected in the bottom part of the
window, with the top part of the window displaying the corresponding split points. 48 ZUPAN & DEMSAR unsupervised data
mining algorithms with those provided by Weka. But unlike that of Yale, KNIME’s visual programming is organized like a
data flow.
The user ‘‘programs’’ by dragging nodes from the node repository to the central part of the benchmark (Fig. 5). Each
node performs a certain function, such as reading the data, filtering, modeling, visualization, or similar functions. Nodes have
input and output ports; most ports send and receive data, whereas some handle data models, such as classification trees. Unlike
nodes in Weka’s KnowledgeFlow, different types of ports are clearly marked, relieving the beginner of the guesswork of what
connects where. Typical nodes in KNIME’s KnowledgeFlow have two dialog boxes, one for configuring the algorithm or a
visualization and the other for showing its results (Fig. 5).
Each node can be in one of the three states, depicted with a traffic-light display: they can be disconnected, not properly
configured, or lack the input data (red); be ready for execution (amber); or have finished the processing (green). A nice feature
called HiLite (Fig. 5) allows the user to select a set of Fig. 5. KNIME HiLiteing (see Fig. 5), where the instances from the
selected classification tree node are HiLited and marked in the scatterplot. OPEN-SOURCE TOOLS FOR DATA MINING 49
instances in one node and have them marked in any other visualization in the current application, in this way further supporting
exploratory data analysis.
8
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
(a)
(b)
Fig 5. Snap of KNIME
9
International Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 9, 2015
IV.
CONCLUSION
Many advanced tools for data mining are available either as open-source or commercial software, to early research
prototypes for newly developed methods. In this paper, , We have to study all five basic tools of datamining. Every tools has
their own feature, They vary in many different characteristics, such as intended user groups, possible data structures,
implemented tasks and methods, interaction styles, import and export capabilities, platforms and license policies are
variable.Now in current scenario single tools handle large amount of data and single feature will generate all analysis.
REFERENCE
[1] Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in
Databases" (PDF). Retrieved 17 December 2008.
[2] Ian H. Witten; Eibe Frank; Mark A. Hall (2011). "Data Mining: Practical machine learning tools and techniques, 3rd
Edition". Morgan Kaufmann, San Francisco. Retrieved 2011-01-19.
[3] P. Reutemann; B. Pfahringer; E. Frank (2004). "Proper: A Toolbox for Learning from Relational Data with Propositional
and Multi-Instance Learners". 17th Australian Joint Conference on Artificial Intelligence (AI2004). Springer-Verlag.
Retrieved 2007-06-25.
[4] Jackman, Simon (Spring 2003). "R For the Political Methodologist" (PDF). The Political Methodologist (Political
Methodology Section, American Political Science Association) 11 (1): 20–22. Archived from the original (PDF) on 200607-21. Retrieved 2006-08-03.
[5] CRAN Task View: Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization". The Comprehensive R
Archive Network. Retrieved 2011-08-01.
[6] Wikipedia : https://en.wikipedia.org/wiki/Weka_(machine_learning)#cite_note-1
10