Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DOI 10.4010/2016.1512 ISSN 2321 3361 © 2016 IJESC Research Article Volume 6 Issue No. 5 A Review Paper on Data Mining Techniques Vinod Bharat1, Balaji Shelale2, K.Khandelwal3, Sushant Navsare4 HOD Computer Department D.Y.Patil School of Engineering Academy Ambi, Pune, Maharashtra, India1 B.E Computer & Pune, Maharashtra, India2, 3, 4 Abstract: Terabytes of information are generated everyday in several organizations. To extract hidden predictive information from massive volumes of data, data Mining (DM) techniques are required. Organizations are beginning to realize the significance of data mining in their strategic planning and successful application of DM techniques often helps in attending enormous payoff for the organizations. The basic principle of data mining is to explore the data from different aspects, classify it and summarize it. Data mining has become very prominent and is favoured in each and every application. Though we have huge amount of data but we don’t have handy information in every field. There are many data mining tools and softwares available which assist us for getting the beneficial information from large amount of data. This paper gives the fundamentals of data mining steps like preprocessing the data (remove the noisy data, replace the missing values etc.), feature selection (select the relevant features and discard the irrelevant and redundant features), classification and evaluation of different classifiers. Keywords: Classification, Data Mining, Data Preprocessing, Dimensionality Reduction, Feature Selection. I. INTRODUCTION The advancement of information technology in various fields of human life has lead to the large volumes of data storage in various formats like documents, data records, images, videos, sound recordings, scientific data, and many new data formats [1]. Huge amount of data are available in science, medical, education, industry and many other areas [2]. The amount of data being generated and stored is increasing exponentially; such data may provide knowledge and information for decision making [3]. Data and Information/Knowledge has a significant role on human activities. The data collected from different applications require proper mechanism of extracting information/ knowledge from large repositories for better decision making. The research in information technology and databases has given rise to an approach to store and manipulate this precious data for further decision making [4]. The technologies for generating and collecting data have been advancing very quickly. At the present stage, lack of data is no longer a problem, the inability to generate useful information from data is [5]! While the database technologist have been searching efficient mean of storing, retrieving and manipulating data, the machine learning community concentrated on techniques which are used for developing, learning and acquiring knowledge from the large data[6]. Due to the importance of extracting knowledge/information from the large data repositories, data mining has become a crucial component in various fields of human life. Data Mining is the process of analyzing data from different viewpoint and summarizing it into useful information. It has been defined as "the process of identifying valid, previously unknown, potentially useful, and finally understandable patterns in data" [7].The field of data mining has been growing due to its huge success in terms of wide-ranging applications. The various application areas of data mining are Customer Relationship Management(CRM), Web Applications, Life Sciences (LS), Manufacturing, Competitive Intelligence, Finance/Banking, Monitoring/Surveillance, Teaching Support, Computer/Network/Security, Climate modeling etc. Almost every field of human life has become data dependent, which made the data mining as an essential component. . Hence, this paper reviews the various trends & techniques of data mining. This paper is organized as follows. Section II presents data preprocessing and data preprocessing techniques, Section III presents feature selection and feature selection methods, section IV presents classification and classification techniques and finally conclusion and discussion of future research is given. II. DATA PREPROCESSING Data preprocessing is often neglected but important and prerequisite step in data mining process. Data preprocessing technique which describes any sort of processing performed on raw information to prepare it for another analyzing procedure. Preprocessing reconstructs the data into a format that will be very easy and effective for further processing. There are various tools and techniques that are used for preprocessing which encompass: data cleaning, data integration, data reduction, data transformation and data discretization [8]. Data cleaning involves detecting & correcting the incorrect records. Data integration involves coupling data from various data sources. Data reduction is the process of abbreviating the amount of data that needs to be taken for mining process. In data transformation data is settled from one form to another which is appropriate for mining. There are number of techniques used for preprocessing of data some of which are discussed below. Various data mining applications have been implemented successfully in various domains like finance, retail, health care, telecommunication, risk analysis and fraud detection etc. International Journal of Engineering Science and Computing, May 2016 6268 http://ijesc.org/ can be hard to find, where graphical representation of data is not possible, PCA is a powerful and handy tool for analysing the data. Major advantage of PCA is that once these patterns are found in data, you compress the data by reducing number of dimensions, without much loss of information. III. FEATURE SELECTION Feature selection is method which selects essential subset of features according to some reasonable criterion so that original task can be achieved efficiently. By choosing an essential subset of features, insignificant and redundant features are removed according to criterion. 1. Data Preprocessing 2.1 Discretize: Many data mining tasks and algorithms can benefit from a discrete representation of the original data set. Discrete representation is more detailed to human and can simplify, cut down computational costs and enhance accuracy of many algorithms. Discretization is the technique of transforming continuous space valued series X = { } into a discrete valued series Z = { }. Discretization can be performed recursively on an attribute. The crucial part of the discretization process is selecting the best cut points which divide the continuous value range into discrete number of bins (states)[9]. 2.2 Normalize: Normalization is scaling technique .It’s the process of casting the data to the specific range.[10] It can be helpful for the prediction or forecasting purpose.[11] There are so many ways to predict or forecast but all can vary with each other a lot. So to maintain the large difference of prediction and forecasting the Normalization technique is required to bring them closer. As per MIN-MAX Normalization technique, 2. Feature Selection Feature selection processes involve following steps: first is generation procedure which develop the next candidate subset; second is an evaluation function which evaluates the subset and third is a stopping criterion to determine when to stop; and last step is a validation procedure to check for the validation of dataset [12].There are number of methods for feature selection some of which are discussed below. 3.1 Correlation Feature Selection (CFS): Correlation feature selection (CFS) is a heuristic way to evaluate the value of a features subset. A good feature subset is a subset that has features which are highly associated (predictive of) with the class, yet unassociated (not predictive of) with each other. CFS measures relations between nominal features, so numeric features are discretized first. However, the concept of correlation-based feature selection does not rely on any particular data transformation [13]. A function that calculates the best individual feature is given by: Min-Max Normalization transforms a value X to X’ which fits in the range [C, D]. Where, HM is the heuristic merit of a features subset S containing n features, is the average (avg.) feature-class Where, X’ = Min-Max Normalized Data with [C, D] Predefined Boundary; X = Range Of Original Data; = Minimum Value of X; = Maximum Value of X; 2.3 Principal component Attribute (PCA): PCA is a technique used to reduce the high dimensionality of big data sets to fewer dimensions that are easier for humans to understand and visualize. PCA is a method of identifying patterns in data, and expressing the data to emphasize their similarities and differences. Patterns in high dimensional data International Journal of Engineering Science and Computing, May 2016 correlation, and is the avg. feature-feature correlation. In above equation, numerator points to how predictive a group of features are; and the denominator points to how much redundancy there is among those features. 3.2 Correlation Attribute Evaluator (CAE): The classification methods were designed to minimize the errors. Real world applications requires classifiers to reduce overall cost, which involves false classification cost (every error has associated cost) and attribute cost. CAE also called as cost-sensitive classification. The main aim of using CAE is to reduce cost of the classification. Cost function: 6269 http://ijesc.org/ Where, is gain ratio for attribute j, is cost of attribute j, is risk element related with attribute j and scale factor for cost. is 3.3 Information Gain (IG): Information Gain guides us to determine which feature of the class is most useful for classification, using its entropy value. Entropy is indicated by the information content of a feature or how much information that features is giving us. More the information content, the higher the entropy, IG value is calculated as: IG (T, v) = E (T) – E (T | v) Where E is the information entropy, T is a training example, and v is a variable value. Above equation, calculates the IG values of that a training example T obtains from an observation that a random variable A takes some value v. IV. CLASSICATION Data classification is the method of organizing knowledge into classes for its simplest and economical use. It predicts catogorial class labels and classifies data to construct a model based on, training set and the values in a classifying attribute. There exist many classification techniques in data mining some of those are discussed below. 4.1 J48 Classifier Algorithm: Depending on the attribute values, it creates a decision tree. The decision tree approach is most helpful in classification problem. With this system, a tree is built to model the classification method. Once the tree is built, it's applied to every tuple within the database which results in classification for that record. While building a decision tree, J48 ignores the omitted values. J48 allows classification based on decision trees or rules generated from that decision tree. [14][15]. INPUT: TD //Training data OUTPUT: T //Decision tree DTBUILD (*TD) { T=φ; T= Create root node and label with splitting attribute; T= Add arc to root node for each split predicate and Label; For each arc do, TD= Database created by applying splitting predicate to TD; If stopping point reached for this path, then T’= create leaf node and label with appropriate class; Else T’= DTBUILD (TD); T= add T’ to arc; } 4.2 Naive Bayes: A naive bayes classifier is a probabilistic classifier which implements the bayes theorem with a naive (strong) assumption. Assumption is features that describe the objects which are to be classified are analytically independent from each other. In spite of this assumption naive bayes is very effective in real world application [16]. The Bayes Theorem: = : : : : Posterior Probability of H Posterior Probability of Z Prior Probability of H Prior Probability of Z V. CONCLUSION In these paper, different Data mining techniques namely data preprocessing techniques, feature selection methods and classification techniques are studied. Discretization selects the best cut points by dividing the continuous value range into discrete number of bins. Normalization casts the data to the specific range. PCA is a powerful and handy tool for analysing the data. Correlation feature selection (CFS) is a heuristic way to evaluate the value of a features subset. CFS measures relations between nominal features.CAE reduces cost of the classification. Information Gain guides us to determine which feature of the class is most useful for classification, using its entropy value.J48 allows classification based on decision trees or rules generated from that decision tree. Despite its non realistic independence assumption, the naive bayes classifier is amazingly effective in practice since its classification decision might usually be correct even though its probability estimates are inaccurate. 3. Classification International Journal of Engineering Science and Computing, May 2016 6270 http://ijesc.org/ These techniques when used together can improve the accuracy of the classifier. Future scope would be focused on using these techniques to build an effective classification model with improved accuracy. VI. REFERENCES [1] Venkatadri.M, Dr. Lokanatha C. Reddy, “A Review on Data mining from Past to the Future “International Journal of Computer Applications (0975 – 8887) Volume 15– No.7, February 2011 [14] Margaret H. Danham, S. Sridhar,” Data mining, Introductory and Advanced Topics”, Person education, 1st ed., 2006 [15] Aman Kumar Sharma, Suruchi Sahni, “A Comparative Study of Classification Algorithms for Spam Email Data Analysis”, IJCSE, Vol. 3, No. 5, 2011, pp. 1890-1895 [16] George Dimitoglou, James A. Adams, and CarolM. Jim,” Comparison of the C4.5 and a Naive Bayes Classifier for the Prediction of Lung Cancer Survivability” [2] Smita, Priti Sharma, “Use of Data Mining in Various Field: A Survey Paper”, IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727 Volume 16, Issue 3, Ver. V (May-Jun. 2014), PP 18-21 [3] Gary M. Weiss, Brian D. Davison , “Data Mining” , Handbook of Technology Management, H. Bidgoli (Ed.) , John Wiley and Sons, 2010. [4] Mrs. Bharati M. Ramageri, “DATA MINING TECHNIQUES AND APPLICATIONS”, Indian Journal of Computer Science and Engineering Vol. 1 No. 4 301-305 [5] Kalyani M Raval,“Data Mining Techniques”, International Journal of Advanced Research in Computer Science and Software Engineering, ISSN: 2277 128X Volume 2, Issue 10, October 2012 [6] Anand V. Saurkar, Vaibhav Bhujade, Priti Bhagat Amit Khaparde, “A Review Paper on Various Data Mining Techniques”, International Journal of Advanced Research in Computer Science and Software Engineering ISSN: 2277 128X, Volume 4, Issue 4, April 2014. [7] Albert Bifet, “Adaptive Learning and Mining for Data Streams and Frequent Patterns” April 2009 [8] Jiawei Han,Micheline Kamber,Jian Pei,” Data Mining : Concept and Techniques ”, 3rd edition, Morgan Kaufmann,2011.( 1st edition.,2000-2001)(2nd edition 2006). [9] P. Chaudhari, D. P. Rana, R. G. Mehta, N. J. Mistry, M. M. Raghuwanshi, “Discretization of Temporal Data: A Survey” [10] Shalabi, L.A., Z. Shaaban and B. Kasasbeh, “Data Mining: A Preprocessing Engine”, J. Computer. Sci., 2: 735739, 2006 [11] S.Gopal Krishna Patro, Pragyan Parimita Sahoo, Ipsita Panda,Kishore Kumar Sahu, "Technical Analysis on Financial Forecasting", International Journal of Computer Sciences and Engineering, Volume-03, Issue-01, Page No (1-6), E-ISSN: 2347-2693, Jan -2015 [12] Dash, M. & Liu, H. (1997),” Feature Selection for Classification Intelligent Data Analysis”, 1(3), 131–56 [13] Mark A. Hall, Lloyd A. Smith , “Practical Feature Subset Selection for Machine Learning”, Computer Science Department, University of Waikato, Hamilton, New Zealand International Journal of Engineering Science and Computing, May 2016 6271 http://ijesc.org/