Download Paper Title (use style: paper title)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
DOI 10.4010/2016.1512
ISSN 2321 3361 © 2016 IJESC
Research Article
Volume 6 Issue No. 5
A Review Paper on Data Mining Techniques
Vinod Bharat1, Balaji Shelale2, K.Khandelwal3, Sushant Navsare4
HOD Computer Department D.Y.Patil School of Engineering Academy Ambi, Pune, Maharashtra, India1
B.E Computer & Pune, Maharashtra, India2, 3, 4
Abstract:
Terabytes of information are generated everyday in several organizations. To extract hidden predictive information from massive
volumes of data, data Mining (DM) techniques are required. Organizations are beginning to realize the significance of data
mining in their strategic planning and successful application of DM techniques often helps in attending enormous payoff for the
organizations. The basic principle of data mining is to explore the data from different aspects, classify it and summarize it. Data
mining has become very prominent and is favoured in each and every application. Though we have huge amount of data but we
don’t have handy information in every field. There are many data mining tools and softwares available which assist us for getting
the beneficial information from large amount of data. This paper gives the fundamentals of data mining steps like preprocessing
the data (remove the noisy data, replace the missing values etc.), feature selection (select the relevant features and discard the
irrelevant and redundant features), classification and evaluation of different classifiers.
Keywords: Classification, Data Mining, Data Preprocessing, Dimensionality Reduction, Feature Selection.
I.
INTRODUCTION
The advancement of information technology in various fields
of human life has lead to the large volumes of data storage in
various formats like documents, data records, images, videos,
sound recordings, scientific data, and many new data formats
[1]. Huge amount of data are available in science, medical,
education, industry and many other areas [2]. The amount of
data being generated and stored is increasing exponentially;
such data may provide knowledge and information for
decision making [3]. Data and Information/Knowledge has a
significant role on human activities. The data collected from
different applications require proper mechanism of extracting
information/ knowledge from large repositories for better
decision making. The research in information technology and
databases has given rise to an approach to store and
manipulate this precious data for further decision making [4].
The technologies for generating and collecting data have been
advancing very quickly. At the present stage, lack of data is no
longer a problem, the inability to generate useful information
from data is [5]!
While the database technologist have been searching
efficient mean of storing, retrieving and manipulating data, the
machine learning community concentrated on techniques
which are used for developing, learning and acquiring
knowledge from the large data[6]. Due to the importance of
extracting knowledge/information from the large data
repositories, data mining has become a crucial component in
various fields of human life. Data Mining is the process of
analyzing data from different viewpoint and summarizing it
into useful information. It has been defined as "the process of
identifying valid, previously unknown, potentially useful, and
finally understandable patterns in data" [7].The field of data
mining has been growing due to its huge success in terms of
wide-ranging applications.
The various application areas of data mining are Customer
Relationship Management(CRM), Web Applications, Life
Sciences (LS), Manufacturing, Competitive Intelligence,
Finance/Banking, Monitoring/Surveillance, Teaching Support,
Computer/Network/Security, Climate modeling etc. Almost
every field of human life has become data dependent, which
made the data mining as an essential component. . Hence, this
paper reviews the various trends & techniques of data mining.
This paper is organized as follows. Section II
presents data preprocessing and data preprocessing techniques,
Section III presents feature selection and feature selection
methods, section IV presents classification and classification
techniques and finally conclusion and discussion of future
research is given.
II.
DATA PREPROCESSING
Data preprocessing is often neglected but important and
prerequisite step in data mining process. Data preprocessing
technique which describes any sort of processing performed
on raw information to prepare it for another analyzing
procedure. Preprocessing reconstructs the data into a format
that will be very easy and effective for further processing.
There are various tools and techniques that are used for
preprocessing which encompass: data cleaning, data
integration, data reduction, data transformation and data
discretization [8].
Data cleaning involves detecting & correcting the
incorrect records. Data integration involves coupling data
from various data sources. Data reduction is the process of
abbreviating the amount of data that needs to be taken for
mining process. In data transformation data is settled from one
form to another which is appropriate for mining. There are
number of techniques used for preprocessing of data some of
which are discussed below.
Various data mining applications have been implemented
successfully in various domains like finance, retail, health
care, telecommunication, risk analysis and fraud detection etc.
International Journal of Engineering Science and Computing, May 2016
6268
http://ijesc.org/
can be hard to find, where graphical representation of data is
not possible, PCA is a powerful and handy tool for analysing
the data. Major advantage of PCA is that once these patterns
are found in data, you compress the data by reducing number
of dimensions, without much loss of information.
III. FEATURE SELECTION
Feature selection is method which selects essential subset of
features according to some reasonable criterion so that original
task can be achieved efficiently. By choosing an essential
subset of features, insignificant and redundant features are
removed according to criterion.
1. Data Preprocessing
2.1 Discretize:
Many data mining tasks and algorithms can benefit from a
discrete representation of the original data set. Discrete
representation is more detailed to human and can simplify, cut
down computational costs and enhance accuracy of many
algorithms. Discretization is the technique of transforming
continuous space valued series X = {
}
into a discrete valued series
Z = {
}.
Discretization can be performed recursively on an attribute.
The crucial part of the discretization process is selecting the
best cut points which divide the continuous value range into
discrete number of bins (states)[9].
2.2 Normalize:
Normalization is scaling technique .It’s the process of casting
the data to the specific range.[10] It can be helpful for the
prediction or forecasting purpose.[11] There are so many ways
to predict or forecast but all can vary with each other a lot. So
to maintain the large difference of prediction and forecasting
the Normalization technique is required to bring them closer.
As per MIN-MAX Normalization technique,
2. Feature Selection
Feature selection processes involve following steps: first is
generation procedure which develop the next candidate subset;
second is an evaluation function which evaluates the subset
and third is a stopping criterion to determine when to stop; and
last step is a validation procedure to check for the validation of
dataset [12].There are number of methods for feature selection
some of which are discussed below.
3.1 Correlation Feature Selection (CFS):
Correlation feature selection (CFS) is a heuristic way to
evaluate the value of a features subset. A good feature subset
is a subset that has features which are highly associated
(predictive of) with the class, yet unassociated (not predictive
of) with each other. CFS measures relations between nominal
features, so numeric features are discretized first. However,
the concept of correlation-based feature selection does not rely
on any particular data transformation [13].
A function that calculates the best individual feature is given
by:
Min-Max Normalization transforms a value X to X’ which fits
in the range [C, D].
Where, HM is the heuristic merit of a features subset S
containing n features,
is the average (avg.) feature-class
Where,
X’
=
Min-Max Normalized Data with [C, D]
Predefined Boundary;
X
= Range Of Original Data;
= Minimum Value of X;
= Maximum Value of X;
2.3 Principal component Attribute (PCA):
PCA is a technique used to reduce the high dimensionality of
big data sets to fewer dimensions that are easier for humans to
understand and visualize. PCA is a method of identifying
patterns in data, and expressing the data to emphasize their
similarities and differences. Patterns in high dimensional data
International Journal of Engineering Science and Computing, May 2016
correlation, and
is the avg. feature-feature correlation.
In above equation, numerator points to how predictive a group
of features are; and the denominator points to how much
redundancy there is among those features.
3.2 Correlation Attribute Evaluator (CAE):
The classification methods were designed to minimize the
errors. Real world applications requires classifiers to reduce
overall cost, which involves false classification cost (every
error has associated cost) and attribute cost. CAE also called
as cost-sensitive classification. The main aim of using CAE is
to reduce cost of the classification.
Cost function:
6269
http://ijesc.org/
Where,
is gain ratio for attribute j,
is cost of
attribute j,
is risk element related with attribute j and
scale factor for cost.
is
3.3 Information Gain (IG):
Information Gain guides us to determine which feature of the
class is most useful for classification, using its entropy value.
Entropy is indicated by the information content of a feature or
how much information that features is giving us. More the
information content, the higher the entropy, IG value is
calculated as:
IG (T, v) = E (T) – E (T | v)
Where E is the information entropy, T is a training example,
and v is a variable value. Above equation, calculates the IG
values of that a training example T obtains from an
observation that a random variable A takes some value v.
IV. CLASSICATION
Data classification is the method of organizing knowledge into
classes for its simplest and economical use. It predicts
catogorial class labels and classifies data to construct a model
based on, training set and the values in a classifying attribute.
There exist many classification techniques in data mining
some of those are discussed below.
4.1 J48 Classifier Algorithm:
Depending on the attribute values, it creates a decision tree.
The decision tree approach is most helpful in classification
problem. With this system, a tree is built to model the
classification method. Once the tree is built, it's applied to
every tuple within the database which results in classification
for that record. While building a decision tree, J48 ignores the
omitted values. J48 allows classification based on decision
trees or rules generated from that decision tree. [14][15].
INPUT:
TD //Training data
OUTPUT:
T //Decision tree
DTBUILD (*TD)
{
T=φ;
T= Create root node and label with splitting attribute;
T= Add arc to root node for each split predicate and
Label;
For each arc do,
TD= Database created by applying splitting predicate
to
TD;
If stopping point reached for this path, then
T’= create leaf node and label with appropriate class;
Else
T’= DTBUILD (TD);
T= add T’ to arc;
}
4.2 Naive Bayes:
A naive bayes classifier is a probabilistic classifier which
implements the bayes theorem with a naive (strong)
assumption. Assumption is features that describe the objects
which are to be classified are analytically independent from
each other. In spite of this assumption naive bayes is very
effective in real world application [16].
The Bayes Theorem:
=
:
:
:
:
Posterior Probability of H
Posterior Probability of Z
Prior Probability of H
Prior Probability of Z
V. CONCLUSION
In these paper, different Data mining techniques namely data
preprocessing techniques, feature selection methods and
classification techniques are studied. Discretization selects the
best cut points by dividing the continuous value range into
discrete number of bins. Normalization casts the data to the
specific range. PCA is a powerful and handy tool for analysing
the data. Correlation feature selection (CFS) is a heuristic way
to evaluate the value of a features subset. CFS measures
relations between nominal features.CAE reduces cost of the
classification. Information Gain guides us to determine which
feature of the class is most useful for classification, using its
entropy value.J48 allows classification based on decision trees
or rules generated from that decision tree. Despite its non
realistic independence assumption, the naive bayes classifier is
amazingly effective in practice since its classification decision
might usually be correct even though its probability estimates
are inaccurate.
3. Classification
International Journal of Engineering Science and Computing, May 2016
6270
http://ijesc.org/
These techniques when used together can improve the
accuracy of the classifier. Future scope would be focused on
using these techniques to build an effective classification
model with improved accuracy.
VI.
REFERENCES
[1] Venkatadri.M, Dr. Lokanatha C. Reddy, “A Review on
Data mining from Past to the Future “International Journal of
Computer Applications
(0975 – 8887) Volume 15–
No.7, February 2011
[14] Margaret H. Danham, S. Sridhar,” Data mining,
Introductory and Advanced Topics”, Person education, 1st ed.,
2006
[15] Aman Kumar Sharma, Suruchi Sahni, “A Comparative
Study of Classification Algorithms for Spam Email Data
Analysis”, IJCSE, Vol. 3, No. 5, 2011, pp. 1890-1895
[16] George Dimitoglou, James A. Adams, and CarolM. Jim,”
Comparison of the C4.5 and a Naive Bayes Classifier for the
Prediction of Lung Cancer Survivability”
[2] Smita, Priti Sharma, “Use of Data Mining in Various Field:
A Survey Paper”, IOSR Journal of Computer Engineering
(IOSR-JCE) e-ISSN: 2278-0661,
p- ISSN: 2278-8727
Volume 16, Issue 3, Ver. V (May-Jun. 2014), PP 18-21
[3] Gary M. Weiss, Brian D. Davison , “Data Mining” ,
Handbook of Technology Management, H. Bidgoli (Ed.) , John
Wiley and Sons, 2010.
[4] Mrs. Bharati M. Ramageri, “DATA MINING
TECHNIQUES AND APPLICATIONS”, Indian Journal of
Computer Science and Engineering Vol. 1 No. 4 301-305
[5] Kalyani M Raval,“Data Mining Techniques”, International
Journal of Advanced Research in Computer Science and
Software Engineering, ISSN: 2277 128X Volume 2, Issue 10,
October 2012
[6] Anand V. Saurkar, Vaibhav Bhujade, Priti Bhagat Amit
Khaparde, “A Review Paper on Various Data Mining
Techniques”, International Journal of Advanced Research in
Computer Science and Software Engineering ISSN: 2277
128X, Volume 4, Issue 4, April 2014.
[7] Albert Bifet, “Adaptive Learning and Mining for Data
Streams and Frequent Patterns” April 2009
[8] Jiawei Han,Micheline Kamber,Jian Pei,” Data Mining :
Concept and Techniques ”, 3rd edition, Morgan
Kaufmann,2011.( 1st edition.,2000-2001)(2nd edition 2006).
[9] P. Chaudhari, D. P. Rana, R. G. Mehta, N. J. Mistry, M.
M. Raghuwanshi,
“Discretization of Temporal Data: A Survey”
[10] Shalabi, L.A., Z. Shaaban and B. Kasasbeh, “Data
Mining: A Preprocessing Engine”, J. Computer. Sci., 2: 735739, 2006
[11] S.Gopal Krishna Patro, Pragyan Parimita Sahoo, Ipsita
Panda,Kishore Kumar Sahu, "Technical Analysis on Financial
Forecasting", International Journal of Computer Sciences and
Engineering, Volume-03, Issue-01, Page No (1-6), E-ISSN:
2347-2693, Jan -2015
[12] Dash, M. & Liu, H. (1997),” Feature Selection for
Classification Intelligent Data Analysis”, 1(3), 131–56
[13] Mark A. Hall, Lloyd A. Smith , “Practical Feature Subset
Selection for Machine Learning”, Computer Science
Department, University of Waikato, Hamilton, New Zealand
International Journal of Engineering Science and Computing, May 2016
6271
http://ijesc.org/