Download Comparison of K-means and Backpropagation Data Mining Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
ISSN 2249-6343
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 2, Issue 2
Comparison of K-means and Backpropagation
Data Mining Algorithms
Nitu Mathuriya,
Dr. Ashish Bansal

Abstract— Data mining has got more and more mature as a field
of basic research in computer science and got more and more
widely applied in several other fields. Data mining has been
recognized by many researchers as a key research topic in data
handling. Data mining is the important approach to realize
knowledge discovery. It is the process of extracting patterns or
predicting previously unknown and useful trends from large
quantities of data by using the knowledge of multidisciplinary
fields such as statistics, mode identify, artificial intelligence,
machine learning, database, management information system
and so on. There are many traditional data mining clustering
and classification techniques K-means, fuzzy C- means,
Decision tree etc. The artificial neural network (ANN) is one of
the techniques of data mining different from traditional
technique. It is the nonlinear auto-fit dynamic system made of
many cells with simulating the construction of biology neural
systems. It can make model from analyzing the mode in the
data and discover the unknown knowledge. A study has been
made by applying K-means and back propagation algorithms to
the recruitment data of an organization. Experiments were
conducted with the data collected from an engineering college to
support their hiring decision. From the comparative study it has
been observed that the K-means clustering algorithm is not
much suitable for the problem and the performance of the back
propagation neural network is high.
Index Terms— Artificial Neural Network, Back propagation,
Data mining, K-means
I.
INTRODUCTION
Data mining is the term used to describe the process of
extracting value from a database. A data-warehouse is a
location where information is stored. The type of data stored
depends largely on the type of industry and the company.
Many companies store every piece of data they have
collected, while others are more ruthless in what they deem to
be ―important‖.[4]
Consider the following example of an organization for
recruitment of the candidate. The present paper gives an
engg. application of data mining based on neural networks.
Manuscript received April 15, 2012.
Nitu Mathuriya, Computer Science, RDPV/ Shri Vaishnav institute of
technology and Science (e-mail: [email protected]).Indore, India .
The back propagation (BP) neural network is used as the
algorithm of data mining. In this paper, the back propagation
(BP) neural network method is used as the technique of data
mining to analyze the effects of structural technologic
parameters on efficiency in resume filtering. In this paper,
results of the experiments conducted by cluster the data with
the K means clustering and by using back propagation
algorithm have been analyzed. Accuracy of the classification
is used as a metric for performance. This section presents
with the details of the data sets. In this method, each feature
vector representing the data has a degree of membership in to
all the clusters and the algorithm works to minimize an
objective function. The K-means algorithm has initially a
randomly chosen centre for each cluster and assigns each data
in the training set to one of the cluster whose centre is nearest.
The algorithm recalculates the centre of the clusters and
continues till there is no significant change in the value of
centre. K-means algorithm works with an assumption that all
attributes are independent and normally distributed. This
study is intended to analyze the issues involved in the
recruitment process of fresh graduates, and find out a way to
reduce the time and cost involved. [1][3]
Artificial intelligence was defined as the study and design of
intelligent agents, where an intelligent agent is a system that
perceives its environment and takes actions which maximize
its chances of success. Artificial intelligence has provided a
number of useful methods for data mining by machine
learning. Machine learning is the subfield of artificial
intelligence that is concerned with the design and
development of algorithms that allow computers (machines)
to improve their performance over time (to learn) based on
data, such as from sensor data or databases. Some of the most
popular learning systems include the neural networks and
support vector machines. A major focus of machine learning
research is to automatically produce (induce) models, such as
rules and patterns, from data. Hence, machine learning is
closely related to fields such as data mining. [2]
II. BRIEF INTRODUCTION OF DATA PREPROCESSING AND
DATA MINING
This paper proposed two most popular traditional and neural
network data mining techniques such as K-means and
Back-propagation neural network.[1][4]
Dr. Ashish Bansal, Information Technology, University/ College/ Shri
Vaishnav institute of technology and Science/ , Indore, India,
(e-mail: [email protected]).
151
ISSN 2249-6343
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 2, Issue 2
More about these techniques and the preprocessing phase in
general will be given in Chapters 2 and 3, where we have
functionally divided preprocessing and its corresponding
techniques into two sub phases: data preparation and
data-dimensionality reduction.
State the problem and formulate the
hypothesis
D. Estimate the model
The selection and implementation of the appropriate
data-mining technique is the main task in this phase. This
process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the
best one is an additional task.
Collect the Data
Data Preprocessing
Estimate the model
Interpret the model and draw conclusion
Figure: 1 Data mining process
A. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a
particular application domain. Hence, domain-specific
knowledge and experience are usually necessary in order to
come up with a meaningful problem statement.
Unfortunately, many application studies tend to focus on the
data-mining technique at the expense of a clear problem
statement. In this step, a modeler usually specifies a set of
variables for the unknown dependency and, if possible, a
general form of this dependency as an initial hypothesis.
B. Collect the data
This step is concerned with how the data are generated and
collected. In general, there are two distinct possibilities. The
first is when the data-generation process is under the control
of an expert (modeler): this approach is known as a designed
experiment. The second possibility is when the expert cannot
influence the data generation process: this is known as the
observational approach. An observational setting, namely,
random data generation, is assumed in most data-mining
applications. Also, it is important to make sure that the data
used for estimating a model and the data used later for testing
and applying a model come from the same, unknown,
sampling distribution. If this is not the case, the estimated
model cannot be successfully used in a final application of
the results.
C. Data Preprocessing
Data-preprocessing steps should not be considered
completely independent from other data-mining phases. In
each iteration of the data-mining process, all activities,
together, could define new and improved data sets for
subsequent iterations. Generally, a good preprocessing
method provides an optimal representation for a data-mining
technique by incorporating a priori knowledge in the form of
application- specific scaling and encoding.
E. Interpret the model and draw conclusions
In most cases, data-mining models should help in decision
making. Hence, such models need to be interpretable in order
to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the
goals of accuracy of the model and accuracy of its
interpretation are somewhat contradictory. Usually, simple
models are more interpretable, but they are also less accurate.
Modern data-mining methods are expected to yield highly
accurate results using high-dimensional models. The problem
of interpreting these models, also very important, is
considered a separate task, with specific techniques to
validate the results. A user does not want hundreds of pages
of numeric results. He does not understand them; he cannot
summarize, interpret, and use them for successful
decision-making.
III. K-MEANS CLUSTERING
K-means is one of the simplest unsupervised learning
algorithms for clustering problems. The algorithm aims at
forming k clusters of n objects such that the resulting
intra-cluster similarity is high but the inter-cluster similarity
is low. The algorithm randomly selects k of the n objects and
one of them is assigned to each cluster to represent the cluster
mean or the center [1]. For each of the remaining objects, an
object is assigned to the cluster to which it is most similar,
based on the distance between the object and the cluster
mean. Then new mean is computed for each cluster and the
process iterates until the criterion function converges. A
square-error criterion is used and defined as (1).
k
E=∑ ∑  p – m i 2
………. (1)
i=1 pC i
[1] Arbitrarily choose K points into the space representing the
objects that are being clustered. These points represent initial
group centroids.
[2] Assign each remaining object to the group that has the
closest centroid.
[3] When all objects have been assigned, recalculate the
positions of the K centroids.
[4] Repeat Steps 2 and 3 until the centroids no longer
move.[3][7]
152
ISSN 2249-6343
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 2, Issue 2
IV. BACKPROPAGATION ALGORITHM
Backpropagation, or propagation of error, is a common
method of teaching artificial neural networks how to perform
a given task. The back propagation algorithm is used in
layered feed forward ANNs. This means that the artificial
neurons are organized in layers, and send their signals
―forward‖, and then the errors are propagated backwards.
The back propagation algorithm uses supervised learning,
which means that we provide the algorithm with examples of
the inputs and outputs we want the network to compute, and
then the error (difference between actual and expected
results) is calculated. The idea of the back propagation
algorithm is to reduce this error, until the ANN learns the
training data.[6][9]
Input Layer
Summary of the technique:
Output Layer
Figure: 2 Back propagation network
1. Present a training sample to the neural network.
2. Compare the network's output to the desired output from
that sample. Calculate the error in each output neuron.
3. for each neuron, calculate what the output should have
been, and a scaling factor, how much lower or higher the
output must be adjusted to match the desired output. This is
the local error.
4. Adjust the weights of each neuron to lower the local error.
5. Assign "blame" for the local error to neurons at the
previous level, giving greater responsibility to neurons
connected by stronger weights.
6. Repeat the steps above on the neurons at the previous level,
using each one's "blame" as its error.[8][5]
Actual Algorithm:
1. Initialize the weights in the network (often randomly)
2. repeat
* for each example e in the training set do
1. O = neural-net-output (network, e);
Forward pass
2. T = teacher output for e
3. Calculate error (T - O) at the output units
4. Compute delta_wi for all weights from hidden layer to
output layer;
Hidden Layer
V. PROPOSED MODEL
The design of the system requires the complete understanding of
the problem domain. The data sets and the input attributes are
determined through knowledge engineering in an IT industry. The
process involves defining the problem, identifying relevant stake
holders, and learns about current solutions to the problem. It also
involves learning domain-specific terminology, description of the
problem and restrictions of it. In this step, interviews were
conducted to the domain experts to obtain required information to
solve the problem, knowledge extraction was made with the
collected information and a knowledge base was built. The
knowledge base construction comprises collection of sample data,
and deciding which data will be needed in respect to data mining
knowledge discovery goals including its format and size.[3]
Knowledge acquisition from the domain experts
Preprocessing the data
Apply Traditional
data mining
technique
Backward pass
5. Compute delta_wi for all weights from input layer to
hidden layer;
backward pass continued
6. Update the weights in the network
* end
3. until all examples classified correctly or stopping criterion
satisfied
4. return (network)
Apply Neural
Network data
mining technique
Evaluate the algorithm and choose the structure
with highest accuracy
Discuss with the domain experts and chose the
rules that best fits the problem
Figure: 3 A data mining framework for the problem
Fig 3 shows the steps involved in the mining process. The
mining process begins with the step to gather knowledge
from the domain experts. Knowledge acquisition is a process
that includes elicitation, collection, analysis, modeling and
validation of knowledge for knowledge engineering.
153
ISSN 2249-6343
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 2, Issue 2
Some of the important issues involved in knowledge
acquisition are the knowledge is hidden within the domain
experts and is not with a single expert. Interviews were
conducted with the domain experts to understand the problem
and the knowledge required to solve the problem. The
knowledge acquired is used along with the recruitment
database maintained in the industry to form the dataset for
experimentation.[3]
The data collected from the industry is complex and have
noisy, missing and inconsistent data. The data is
preprocessed to improve the quality of data and make it fit for
the data mining task. The data used are transformed into
appropriate formats to support meaningful analysis. Some
more attributes are derived using the acquired knowledge to
support the mining process. Apply traditional data mining
algorithm and back propagation neural network algorithm on
the preprocessed data and evaluate the accuracy of each
algorithm, after that compare the accuracy of k means and
back propagation algorithm. Accuracy is the most important
factor to evaluate any model in data mining, so select the
model which gives better accuracy for the problem by
discussing with the domain experts.
The constructed models were reviewed and evaluated before
it is used for decision support. The models were evaluated
using accuracy as the criteria to assess the performance of the
method. Constructive rules were extracted from the
technique which had better accuracy.
VI. EXPERIMENTAL RESULTS
The proposed system has been implemented and evaluated
with extensive experimentations on the collected datasets.
Accuracy of classification is used as the metric for deciding
the best suited model. This section presents the details of the
data sets, test results and comparison of them. The metric
used to evaluate the clustering and classification algorithms
is the accuracy. Accuracy is determined as the ratio of records
correctly classified during testing to the total number of
records tested. The clusters formed were verified for
correctness to know the error.
The details of the applicants to the industry comprising two
datasets were used for experimentation. The first dataset
includes the detail consists of 400 records, the second dataset
include the details consist of 850 records. From the dataset it
is observed, that the dataset consists of more than 60% of
records to be in the rejected category. Hence the machine
learning algorithms were very excellent in recognizing the
rejected data however they were not able to identify selected
records to a large extent. Therefore the dataset was
premeditated and almost equal number of records in both the
categories was used for experimentation. When the records
were chosen for the learning process, the distribution of the
status in the original data was maintained. The algorithms
were trained with records of one dataset and tested with the
records in the other dataset.
K Means and back propagation algorithms were applied with
MATLAB and the accuracy of the algorithms is depicted in
Table 1, Table 2, Table 3 and table 4. It is observed that K
means technique have poor accuracy and not suitable for this
problem domain due to the nature of the data.
Table 1 Results of algorithms Trained with
Dataset1 and Tested with Dataset1
Algorithm used
K- means
Back propagation
Accuracy %
72
80.40
Table 2 Results of Clustering algorithms Trained with
Dataset2 and Tested with Dataset2
Algorithm used
K- means
Back propagation
Accuracy %
86.23
91.42
Table 3 Results of algorithms Trained with
Dataset1 and Tested with Dataset2
Algorithm used
K- means
Back propagation
Accuracy %
71
75.46
Table 2 Results of algorithms Trained with
Dataset2 and Tested with Dataset1
Algorithm used
K- means
Back propagation
Accuracy %
75
70.86
The resultant accuracy of the algorithms shown in the
above table represents backpropagation algorithm gives
better result than K-means algorithm and backpropagation
algorithm suitable for the problem. The variation of the
accuracy depends on the training and testing data sets.
VII. CONCLUSION
The problem domain has been studied to extract the
knowledgebase required to solve the problem. Datasets have
been collected and analyzed to identify the input attributes to
be used for the algorithms. Most popular clustering and
classification techniques were deployed in solving the
problem. It was observed that K-means clustering techniques
are not suitable for this type of data distribution. The popular
algorithm Backprapogation have been applied for the
problem and it has been observed that constructed with
Backpropagation algorithm has better accuracy.
A set of experiments was conducted to test the proposed
approach is using a well defined set of data mining problems.
The results indicate that, using the proposed approach, high
quality or useful data can be discovered from the given data
sets. In future we can apply this technique to select deserving
candidates for an organization. The use of neural networks in
data mining is a promising field of research especially given
the ready availability of large mass of data sets and the
reported ability of neural networks to detect and assimilate
relationships between a large numbers of variables.
154
ISSN 2249-6343
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 2, Issue 2
REFERENCES
[1]
Arun K Pujari ―Data Mining Techniques‖ University press
india(Private Limited), * th impression,2005.
[2]
Haykin, S., Neural Networks, Prentice Hall International Inc., 1999
[3]
N. Sivaram Research Scholar and V, Department of CSE ,Clustering
and Classification Algorithms for Recruitment Data Mining,
International Journal of
Computer Applications (0975 – 8887)
Volume 4 – No.5, July 2010
[4]
Jiawei Han, Micheline Kamber. 2006. Data Mining Concepts and
Techniques‖, Second Edition Morgan Kaufmann Publishers, San
Francisco
[5]
―Artificial Neural Network‖, Wikipedia Encyclopedia, Wikimedia
Foundation, Inc., 2006.
[6]
1DR. YASHPAL SINGH ,ALOK SINGH CHAUHAN, Journal of
Theoretical and Applied Information Technology Reader,
Bundelkhand Institute of Engineering & Technology, Jhansi, India.
[7]
Bhavani,Thura-is-ingham, ―Data-mining‖, Technologies,Techniques
tools & Trends‖, CRC Press
[8]
Bradley, I., Introduction to Neural Networks, Multinet Systems Pty Ltd
Nitu Mathuriya
Bachelor of Engineering (Computer Science), (Master of Engineering
(Computer Science), Shri Vaishnav institute of technology & Science.
Dr. Ashish Bansal
Ph.D (Information Technology), Subject: Techniques for enhancing
watermarking using neural networks. Rajiv Gandhi University, Bhopal.
Published more than 30 papers in international and National journals
and conferences.
155