Download literature review on data mining techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ISSN:2229-6093
K Suguna et al, Int.J.Computer Technology & Applications,Vol 6 (4),583-585
LITERATURE REVIEW ON DATA MINING TECHNIQUES
K.Suguna
Asst.Professor
Department of Computer Applications
Dr.N.G.P Arts and Science College
Coimbatore
India
Dr.K.Nandhini
Professor
Department of computer applications
Professional Group of Institutions
Coimbatore
India
Abstract
The web is an information system of interlinked hypertext
documents that are retrieved over internet. Web mining is
the use of data mining techniques to automatically discover
and extract information from Web documents and services.
A web cache stores copies of documents passing through it;
subsequent requests may be satisfied from the cache if
certain conditions are met.
Keywords:
Association,
Classification, Clustering,
Prediction, Sequential patterns, Decision trees, etc.,
1.INTRODUCTION
Data mining is the process of finding the useful
information from the large amount of data. The
interesting patterns can be mined with the help of the
several data mining techniques. This paper has
reviewed the literature of data mining techniques such
as Association Rules, Classification and Clustering.
This review of literature focuses on how data mining
techniques are used for different application areas for
finding out meaningful pattern from the database.
2.DATA MINING TECHNIQUES
There are several major data mining techniques have
been developing and using in data mining projects including
association,
classification,
clustering,
prediction,
sequential patterns and decision tree.
a)Association
Association is one of the best known data mining
technique. In association, sequential patterns are discovered
based on a relationship between items in the same
transaction. So the association technique is also known
as relation technique. The association rule mining technique
is used in market basket analysis to identify a set of products
that customers frequently purchase together. Nowadays
Retailers are using association technique to research
customer’s buying habits. The retailers might find out that
customers always buy crisps when they buy beers, and
therefore they can put beers and crisps next to each other to
save time for customer and increase sales.
Association rule mining is normally performed in
generation of frequent Item sets. The concepts behind
association rules are provided at the beginning followed by
IJCTA | July-August 2015
Available [email protected]
an overview to some of the previous research works done on
this area.
In 2013, Diti Guptaao et al. [1] suggest that Association
rule mining can be represent in terms of A ⇒B (S, C) where
A and B are item set; S is the support of the rules, defined
as the rate of the transactions containing all items in A and
all items in B. Support (A ⇒B) = P (A ∪B) and C is the
confidence, it is defined as the ratio of S with the rate of
transactions containing A, Probability of (B/A). Support and
confidence are measures of the interestingness of the rule.
They have calculated the support value for justifying the
usefulness of the items present in the data set. A Higher
support value indicates the effectiveness for the enterprise.
Negative association rules of form A=>~B means
supp(AU~B)≥ms.supp(AUB)= supp(A)-supp(AU~B). For
most transactions, the supp(A) < 2*ms. so supp(AUB)<ms,
which means AUB is infrequent itemsets. To find negative
association rules,leads to find infrequent itemsets first. The
support count shows the frequency of the patterns in the
rule; it is the percentage of transactions that contain both A
and B, i.e. Support = Probability (A and B) Support = (# of
transactions involving A and B) / (total number of
transactions). Confidence is the strength of implication of a
rule; it is the percentage of transactions that contain B if
they contain A, ie. Confidence = Probability (B if A) =
P(B/A) Confidence = (# of transactions involving A and B) /
(total number of transactions that have A).
In 2014, T. Karthikeyan and N. Ravikumar et al. [2]
suggest that the two significant basic measures of
association rules are support(s) and confidence(c). Since the
database is huge in size, users concern about only the
frequently bought items. The users can pre-define thresholds
of support and confidence to drop the rules which are not so
useful. The two thresholds are named minimal support and
minimal confidence. Support(s) is defined as the proportion
of records that contain X  Y to the overall records in the
database. The amount for each item is augmented by one,
whenever the item is crossed over in different transaction in
database during the course of the scanning.
Support sum of XY
Support (XY) =
Overall records in the database D
583
ISSN:2229-6093
K Suguna et al, Int.J.Computer Technology & Applications,Vol 6 (4),583-585
Confidence(C) is defined as the proportion of the
number of transactions that contain X  Y to the overall
records that contain X, where, if the ratio outperforms the
threshold of confidence, an association rule X  Y can be
generated.
Support (XY)
Confidence(X/Y) =
Support (X)
In 2012 ,Li Xiaohui et al. [3] suggest that Based on the
analysis of principle and efficiency on Apriori algorithm,
this paper presents out its defects and presents an improved
Apriori algorithm. The new improved Apriori algorithm can
reduce the I/O operation of the process of mining by the way
of decreasing the times of searching in the database. It is
shown in the tentative result that the improved algorithm is
much more efficient than the traditional algorithm in being
applied to mining association rule. b)Classification
Classification is a classic data mining technique
established on machine learning. Mainly classification is
used to classify each item in a set of data into one of
predefined set of classes or groups. Classification method
uses of mathematical techniques such as decision trees,
neural network, and statistics. In classification, the authors
developed the software that can learn how to classify the
data items into groups. For example, Apply classification in
application that “who are all left from the company, predict
who will probably leave from the company in future”. In
such case, the data are divided into two groups of
employees. And then ask our data mining software to
classify the employees into separate groups.
In 2011,E.W.T. Ngai et al. [4] propose a graphical
conceptual classification framework for the available
literature on the applications of data mining techniques to
FFD. The classification framework is based on a literature
review of existing knowledge on the nature of data mining
and fraud detection research.
c)Clustering
Clustering is a data mining technique that makes
valuable cluster of objects. The clustering technique
describes the classes and puts objects in each class, in the
classification techniques, objects are given into predefined
classes. To make the concept clearer, consider an example.
In a library, there is a large number of books in various titles
are available. The challenge is how to keep those books in a
way that readers can take several books in a particular topic
without any difficulty. By using clustering technique, keep
books that have some kinds of similarities in one cluster and
label it with a meaningful name.
In 2013, P. IndiraPriya, Dr. D.K.Ghosh et al. [5]
describes about the Cluster analysis, the group of data
objects based only on the information found in the data that
describes the objects and their relationships. The aim is that
the objects within a group be similar (or related) to one
another and different from (or unrelated to) the objects in
the other groups. The greater similarity of clustering is
within a group and the greater difference between groups,
the more distinct the clustering. The cluster analysis splits
the space into regions, characteristic of the clusters found in
the data. The main benefit of a clustered solution is
automatic recovery from failure. The difficulties of
IJCTA | July-August 2015
Available [email protected]
clustering are complication and inability to recover from
database corruption.
d)Prediction
The prediction, is one of a data mining techniques that
determines relationship between dependent and independent
variables. The prediction analysis technique can be used in
sale to predict profit, sale is an independent variable; profit
could be a dependent variable. Then based on the past sale
and profit data, a regression curve that is used for profit
prediction.
In 2012, Neelamadhab Padhy et al. [6] describes about
the difficulty in predict a data is a complex. Actually no
approaches or tools can guarantee to generate the accurate
prediction in the organization. In this paper, they have
analyzed the different algorithm and prediction technique.
Inspite the fact that the least median squares regression is
known to produce better results than the classifier linear
regression techniques from the given set of attributes. As
comparison they found that Linear Regression technique
which takes the lesser time as compared to Least Median
Square Regression.
In 2011, Brijesh Kumar Bhardwaj et al. [7] In this
paper, Bayesian classification method is used on student
database to predict the students division on the basis of
previous year database. This study will support to the
students and the teachers to improve the division of the
student. This study also works to find those students which
needed special attention to reduce failing ratio and taking
appropriate action at right time. This study displays that
academic performances of the students are not always
depending on their own effort. Study shows that other
factors have got significant influence over students’
performance.
e)Sequential Patterns
Sequential patterns analysis is one of data mining
technique that seeks to discover or identify related patterns,
regular events or trends in transaction data over a business
period. In sales, with past transaction data, it is easy to
identify a set of items that customers buy together in a year.
Then businesses can use this information to recommend
customers buy it with better deals based on their purchasing
frequency in the past.
In 2013, Ms. Pooja Agrawal et al. [8] This review of
sequential pattern-mining algorithms in shows that the
important heuristics employed includes the optimally sized
data structure representations of the sequence database;
early pruning of candidate sequences; mechanisms to reduce
support counting; and maintaining a narrow search space.
In 2014, Vishal S. Motegaonkar, Prof. Madhav V.
Vaidya et al. [9] Initial work on this topic is concentrated on
improvement of the performance of algorithms by using
different data structure or different representation. So, on the
basis of these problems the sequential pattern mining is
categorized into two types, Apriori approach based
algorithms and pattern growth approach based algorithms.
This survey and previous some studies by various
researchers on sequential pattern mining algorithms it is
found that the algorithm which are based on the approach of
pattern growth are better in terms of scalability, timecomplexity and space-complexity.
584
ISSN:2229-6093
K Suguna et al, Int.J.Computer Technology & Applications,Vol 6 (4),583-585
f)Decision trees
Decision tree is one of the most used data mining
techniques because its model is easy to understand for users.
In decision tree technique, the root of the decision tree is a
simple question or condition that has multiple answers. Each
answer then leads to a set of questions or conditions that
help us determine the data so that we can make the final
decision based on it.
In “Literature review on data mining research” [10]
Given a set of examples (training data) described by some
set of attributes (ex. Sex, rank, background) the goal of the
algorithm is to learn the decision function stored in the data
and then use it to classify new inputs. The concept of
information gain or Gini index.
3.Conclusion
This study gives an overall idea about the data mining
techniques which can be used on various server log files to
find the most frequent patterns. The data mining techniques
can be used to find the user behavior over the internet.
REFERENCES
[1] Diti Gupta, Abhishek Singh Chauhan, Mining
Association Rules from Infrequent Itemsets: A Survey,
International Journal of Innovative Research in Science,
Engineering and Technology(IJIRSET),Vol.2,Issue
10,2013.
[2] T. Karthikeyan and N. Ravikumar, A Survey on
Association Rule Mining International Journal of
Advanced Research in Computer and Communication
Engineering (IJARCCE)Vol. 3, Issue 1, January 2014.
[3] Li Xiaohui,Improvement of Apriori algorithm for
association rules, World Automation Congress
(WAC),IEEE, June 2012.
[4] E.W.T. Ngai et al. The application of data mining
techniques in financial fraud detection: A classification
framework and an academic review of literature,
Elsevier,2011.
[5] P. IndiraPriya, Dr. D.K.Ghosh, A Survey on Different
Clustering Algorithms in Data Mining Technique,
(IJMER) Vol.3,2013.
[6] Neelamadhab Padhy and Rasmita Panigrahi, Data
Mining: A prediction Technique for the workers in the
PR Department of Orissa (Block and Panchayat),
IJCSEIT, Vol.2.,2012
[7] Brijesh Kumar Bhardwaj and Saurabh Pal, Data Mining:
A prediction for performance improvement using
classification , IJCSIS, Vol. 9,2011
[8] Ms. Pooja Agrawal Mr. Suresh kashyap, Mr.Vikas
Chandra Pandey, Mr. Suraj Prasad Keshri, An Analytical
Study on Sequential Pattern MiningWith Progressive
Database,IJIRCCE, Vol. 1, Issue 3, May 2013.
[9] Vishal S. Motegaonkar, Madhav V. Vaidya. A Survey on
Sequential Pattern Mining Algorithm, IJCSIT, Vol.
5,2014.
IJCTA | July-August 2015
Available [email protected]
585