Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN:2229-6093 K Suguna et al, Int.J.Computer Technology & Applications,Vol 6 (4),583-585 LITERATURE REVIEW ON DATA MINING TECHNIQUES K.Suguna Asst.Professor Department of Computer Applications Dr.N.G.P Arts and Science College Coimbatore India Dr.K.Nandhini Professor Department of computer applications Professional Group of Institutions Coimbatore India Abstract The web is an information system of interlinked hypertext documents that are retrieved over internet. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. A web cache stores copies of documents passing through it; subsequent requests may be satisfied from the cache if certain conditions are met. Keywords: Association, Classification, Clustering, Prediction, Sequential patterns, Decision trees, etc., 1.INTRODUCTION Data mining is the process of finding the useful information from the large amount of data. The interesting patterns can be mined with the help of the several data mining techniques. This paper has reviewed the literature of data mining techniques such as Association Rules, Classification and Clustering. This review of literature focuses on how data mining techniques are used for different application areas for finding out meaningful pattern from the database. 2.DATA MINING TECHNIQUES There are several major data mining techniques have been developing and using in data mining projects including association, classification, clustering, prediction, sequential patterns and decision tree. a)Association Association is one of the best known data mining technique. In association, sequential patterns are discovered based on a relationship between items in the same transaction. So the association technique is also known as relation technique. The association rule mining technique is used in market basket analysis to identify a set of products that customers frequently purchase together. Nowadays Retailers are using association technique to research customer’s buying habits. The retailers might find out that customers always buy crisps when they buy beers, and therefore they can put beers and crisps next to each other to save time for customer and increase sales. Association rule mining is normally performed in generation of frequent Item sets. The concepts behind association rules are provided at the beginning followed by IJCTA | July-August 2015 Available [email protected] an overview to some of the previous research works done on this area. In 2013, Diti Guptaao et al. [1] suggest that Association rule mining can be represent in terms of A ⇒B (S, C) where A and B are item set; S is the support of the rules, defined as the rate of the transactions containing all items in A and all items in B. Support (A ⇒B) = P (A ∪B) and C is the confidence, it is defined as the ratio of S with the rate of transactions containing A, Probability of (B/A). Support and confidence are measures of the interestingness of the rule. They have calculated the support value for justifying the usefulness of the items present in the data set. A Higher support value indicates the effectiveness for the enterprise. Negative association rules of form A=>~B means supp(AU~B)≥ms.supp(AUB)= supp(A)-supp(AU~B). For most transactions, the supp(A) < 2*ms. so supp(AUB)<ms, which means AUB is infrequent itemsets. To find negative association rules,leads to find infrequent itemsets first. The support count shows the frequency of the patterns in the rule; it is the percentage of transactions that contain both A and B, i.e. Support = Probability (A and B) Support = (# of transactions involving A and B) / (total number of transactions). Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B if they contain A, ie. Confidence = Probability (B if A) = P(B/A) Confidence = (# of transactions involving A and B) / (total number of transactions that have A). In 2014, T. Karthikeyan and N. Ravikumar et al. [2] suggest that the two significant basic measures of association rules are support(s) and confidence(c). Since the database is huge in size, users concern about only the frequently bought items. The users can pre-define thresholds of support and confidence to drop the rules which are not so useful. The two thresholds are named minimal support and minimal confidence. Support(s) is defined as the proportion of records that contain X Y to the overall records in the database. The amount for each item is augmented by one, whenever the item is crossed over in different transaction in database during the course of the scanning. Support sum of XY Support (XY) = Overall records in the database D 583 ISSN:2229-6093 K Suguna et al, Int.J.Computer Technology & Applications,Vol 6 (4),583-585 Confidence(C) is defined as the proportion of the number of transactions that contain X Y to the overall records that contain X, where, if the ratio outperforms the threshold of confidence, an association rule X Y can be generated. Support (XY) Confidence(X/Y) = Support (X) In 2012 ,Li Xiaohui et al. [3] suggest that Based on the analysis of principle and efficiency on Apriori algorithm, this paper presents out its defects and presents an improved Apriori algorithm. The new improved Apriori algorithm can reduce the I/O operation of the process of mining by the way of decreasing the times of searching in the database. It is shown in the tentative result that the improved algorithm is much more efficient than the traditional algorithm in being applied to mining association rule. b)Classification Classification is a classic data mining technique established on machine learning. Mainly classification is used to classify each item in a set of data into one of predefined set of classes or groups. Classification method uses of mathematical techniques such as decision trees, neural network, and statistics. In classification, the authors developed the software that can learn how to classify the data items into groups. For example, Apply classification in application that “who are all left from the company, predict who will probably leave from the company in future”. In such case, the data are divided into two groups of employees. And then ask our data mining software to classify the employees into separate groups. In 2011,E.W.T. Ngai et al. [4] propose a graphical conceptual classification framework for the available literature on the applications of data mining techniques to FFD. The classification framework is based on a literature review of existing knowledge on the nature of data mining and fraud detection research. c)Clustering Clustering is a data mining technique that makes valuable cluster of objects. The clustering technique describes the classes and puts objects in each class, in the classification techniques, objects are given into predefined classes. To make the concept clearer, consider an example. In a library, there is a large number of books in various titles are available. The challenge is how to keep those books in a way that readers can take several books in a particular topic without any difficulty. By using clustering technique, keep books that have some kinds of similarities in one cluster and label it with a meaningful name. In 2013, P. IndiraPriya, Dr. D.K.Ghosh et al. [5] describes about the Cluster analysis, the group of data objects based only on the information found in the data that describes the objects and their relationships. The aim is that the objects within a group be similar (or related) to one another and different from (or unrelated to) the objects in the other groups. The greater similarity of clustering is within a group and the greater difference between groups, the more distinct the clustering. The cluster analysis splits the space into regions, characteristic of the clusters found in the data. The main benefit of a clustered solution is automatic recovery from failure. The difficulties of IJCTA | July-August 2015 Available [email protected] clustering are complication and inability to recover from database corruption. d)Prediction The prediction, is one of a data mining techniques that determines relationship between dependent and independent variables. The prediction analysis technique can be used in sale to predict profit, sale is an independent variable; profit could be a dependent variable. Then based on the past sale and profit data, a regression curve that is used for profit prediction. In 2012, Neelamadhab Padhy et al. [6] describes about the difficulty in predict a data is a complex. Actually no approaches or tools can guarantee to generate the accurate prediction in the organization. In this paper, they have analyzed the different algorithm and prediction technique. Inspite the fact that the least median squares regression is known to produce better results than the classifier linear regression techniques from the given set of attributes. As comparison they found that Linear Regression technique which takes the lesser time as compared to Least Median Square Regression. In 2011, Brijesh Kumar Bhardwaj et al. [7] In this paper, Bayesian classification method is used on student database to predict the students division on the basis of previous year database. This study will support to the students and the teachers to improve the division of the student. This study also works to find those students which needed special attention to reduce failing ratio and taking appropriate action at right time. This study displays that academic performances of the students are not always depending on their own effort. Study shows that other factors have got significant influence over students’ performance. e)Sequential Patterns Sequential patterns analysis is one of data mining technique that seeks to discover or identify related patterns, regular events or trends in transaction data over a business period. In sales, with past transaction data, it is easy to identify a set of items that customers buy together in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past. In 2013, Ms. Pooja Agrawal et al. [8] This review of sequential pattern-mining algorithms in shows that the important heuristics employed includes the optimally sized data structure representations of the sequence database; early pruning of candidate sequences; mechanisms to reduce support counting; and maintaining a narrow search space. In 2014, Vishal S. Motegaonkar, Prof. Madhav V. Vaidya et al. [9] Initial work on this topic is concentrated on improvement of the performance of algorithms by using different data structure or different representation. So, on the basis of these problems the sequential pattern mining is categorized into two types, Apriori approach based algorithms and pattern growth approach based algorithms. This survey and previous some studies by various researchers on sequential pattern mining algorithms it is found that the algorithm which are based on the approach of pattern growth are better in terms of scalability, timecomplexity and space-complexity. 584 ISSN:2229-6093 K Suguna et al, Int.J.Computer Technology & Applications,Vol 6 (4),583-585 f)Decision trees Decision tree is one of the most used data mining techniques because its model is easy to understand for users. In decision tree technique, the root of the decision tree is a simple question or condition that has multiple answers. Each answer then leads to a set of questions or conditions that help us determine the data so that we can make the final decision based on it. In “Literature review on data mining research” [10] Given a set of examples (training data) described by some set of attributes (ex. Sex, rank, background) the goal of the algorithm is to learn the decision function stored in the data and then use it to classify new inputs. The concept of information gain or Gini index. 3.Conclusion This study gives an overall idea about the data mining techniques which can be used on various server log files to find the most frequent patterns. The data mining techniques can be used to find the user behavior over the internet. REFERENCES [1] Diti Gupta, Abhishek Singh Chauhan, Mining Association Rules from Infrequent Itemsets: A Survey, International Journal of Innovative Research in Science, Engineering and Technology(IJIRSET),Vol.2,Issue 10,2013. [2] T. Karthikeyan and N. Ravikumar, A Survey on Association Rule Mining International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE)Vol. 3, Issue 1, January 2014. [3] Li Xiaohui,Improvement of Apriori algorithm for association rules, World Automation Congress (WAC),IEEE, June 2012. [4] E.W.T. Ngai et al. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature, Elsevier,2011. [5] P. IndiraPriya, Dr. D.K.Ghosh, A Survey on Different Clustering Algorithms in Data Mining Technique, (IJMER) Vol.3,2013. [6] Neelamadhab Padhy and Rasmita Panigrahi, Data Mining: A prediction Technique for the workers in the PR Department of Orissa (Block and Panchayat), IJCSEIT, Vol.2.,2012 [7] Brijesh Kumar Bhardwaj and Saurabh Pal, Data Mining: A prediction for performance improvement using classification , IJCSIS, Vol. 9,2011 [8] Ms. Pooja Agrawal Mr. Suresh kashyap, Mr.Vikas Chandra Pandey, Mr. Suraj Prasad Keshri, An Analytical Study on Sequential Pattern MiningWith Progressive Database,IJIRCCE, Vol. 1, Issue 3, May 2013. [9] Vishal S. Motegaonkar, Madhav V. Vaidya. A Survey on Sequential Pattern Mining Algorithm, IJCSIT, Vol. 5,2014. IJCTA | July-August 2015 Available [email protected] 585