Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) An Analytical and Comparative Study of Various Data Preprocessing Method in Data Mining Kamlesh kumar pandey1, Narendra Pradhan2 1 2 MCRPV, Amarkantak Campus, Madhya-Pradesh, India SS College of Education, Pendra Road, Chhattisgarh, India In a data mining, Data Preprocessing technique is required in order to get quality data by removing such irregularities and do data mining task. There are a number of data preprocessing techniques under data preprocessing method are uses to remove specific irregularities in the real world. Some basic data preprocessing method is Data cleaning, Data Integration and Transformation, Data reduction and Discretization and Summarization of data. Abstract—Data pre-processing is an necessary and critical step of the data mining process or Knowledge discovery in databases. Base of data pre-processing is a preparing data as form of accurate, reliable and qualities data. Data preprocessing not only use for data mining but it also use of constructing on data warehouses, WWW etc. if we do not prepared on data then our mining result are uncompleted and non decisional this region it is an importance. Data preprocessing is a base of association rules, classification, prediction, Pattern evaluation, Knowledge presentation for step of data mining. This research paper presents the different method of Data Preprocessing and each method is solving to specific problem. This research cover to analytical study of how the method are worked and which type of data suitable for which type of method and we comparing on each method based of various parameters like input, output and Complexity of preprocessing etc. Keywords—Data mining, Data cleaning, Data integration, Data Transformation, Data reduction, Data Discretization and Summarization, Type of data. I. INTRODUCTION Data mining refers to extracting on interested information or knowledge from large amounts of data sources. Alternative names are data mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis etc. In KDD are consists of an iterative sequence of the seven steps with specific method. First four steps Cleaning, Integration, Transformation ( include reduction and Discretization method) and Selection of data are form are data preprocessing and after three step Data mining, Pattern evaluation and Knowledge presentation are form of mining process. In real world, when different type of data or similar data are collected to different source in a one place. Then some data are consists to lot of irregularities in terms of missing data, noise, inconsistent or even outliers. This type of data doesn’t possess quality of information. If data is quality less then data mining task like that data analysis, pattern reorganization, decision management are not given to optimal solution. Fig 1: - Data Mining and Data Preprocessing We tried to show relationship between data mining and data preprocessing in fig1. Data preprocessing method are basic need of constructing of information repository and data mining time. 174 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) Data mining step are Pattern Evaluation and Data Mining Engine use to Knowledge base for guidance for searching and evaluate the interesting patterns. Data preprocessing method/technique is very useful for constructing on data warehouses, World Wide Web and large databases because data warehouse, World Wide Web and large databases have an accurate data and quality fully data are stored. Data preprocessing method/techniques are helpful in OLTP (online transaction Processing), OLAP (online analytical processing) and any data mining techniques and methods such as classification and clustering. It also Helpful to advance data mining application like a stream mining, bio-data mining ,text mining, web mining because every mining have a needed to accurate and quality data. II. WORKING PROCEDURE OF DATA PREPROCESSING METHOD In this section we analyses and describe how to data preprocessing is worked. In working procedure of data preprocessing included various types of method and technique. We draw a flow chart (fig-2) of data preprocessing method. This flow chart is explained which step are use to prepared on data and how to data preprocessing is worked. A. Data cleaning Data cleaning is a first step of data preprocessing method. Cleaning on the data is one biggest problem in constructing of data warehousing and mining because real world data are very dirty in form of noisy, missing values in tuple and inconsistence. If mining technique is apply to this type of data then our result is unreliable and poor output. Then be needed to clean on the data before data mining. Data cleaning method work to clean the data in form by smoothing noisy data, filling in missing values, correct inconsistencies and identifying or removing outliers. We listed different type of data cleaning technique and their example in table-1. Fig 2: - Flow chart of Data Preprocessing 175 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) Table-1 Technique and example for Data Cleaning B. Data Integration Integration of different type of data, attributes and schema are biggest problem in constructing of data warehousing and data mining because real world data are available in a different location. If mining technique or analysis technique is apply to this type of data warehouses then it taken to more time and decision is unreliable and quality less. Then be needed to combines data from multiple sources into a one place or one database form of data integration, attributes integration and schema integration. The sources may be multiple databases, flat files and data cubes. We listed different type of data integration technique and their example in Table-2. Technique Example Form of data: - A. Missing Values (The filled-in value may not be correct. Technique 5 are most popular strategy) 1.Ignore the tuple:- remove If any tuple have any unfilled tuple and it is a inefficient attributes value is blank then it method tuple are ignore. 2. Fill in the missing value If any tuple have any manually: - manually filling on attributes value is blank then value and it is a time consuming assume most suitable value for and not efficient for large data set. this attribute and fill it. 3. Use a global constant to fill in Set one constant value for the missing value: - Replace all specific attributes, and fill this missing attribute values by the value for every blank tuple for same constant. specifies attributes 4. Use the attribute mean to fill in Find mean for specific the missing value (use for numerical attributes, and fill numerical data):- find out mean of this mean for every blank unfilled attributed and fill this tuple for specifies attributes mean to unfilled tuple. 5. Use the most probable value to Find out most repeated value fill in the missing value: - most for specifies attributes and fill repeated value is filled. it use to this value for blank tuple for regression, inference-based tools specifies attributes using a Bayesian formalism, or decision tree induction method Form of data: - 2. Noisy Data 1. Binning (use for numerical data for price (in dollar): 4, data):- The values are distributed 8, 15, 21 into a number of bins. After that Partition into (equaleach value in a bin is replaced by the frequency) bins: mean value of the bin this type of Bin 1:- 4, 8 Bin 2:-15, 21 called smoothing by bin means. Smoothing by bin means: Bin 1:-6,6 Bin 2-: 18,18 2. Regression(use for numerical In salary data data): - In this method finding the X year experience:- 1,3, 3 best line using fit two attributes Y salary(in $1000s):-20, 30, where one attribute can be used to 36 predict the other. it is a type of Calculated on this type of statistical methodology and use to data then we find out(3,36) straight line function y = b+wx are not cover in straight line. where x and y are attributed value and b and w are a regression coefficients. 3. Clustering: - it techniques In human activity data, Early consider data tuple and depend on in childhood, we learn how the nature of the data. in this method to differentiate between cat similar type of data group into a and dog, or between animals cluster and non similar data are and plants, by continuously organized outside of cluster. The improving sub conscious quality of a cluster represent by its clustering method. diameter, the highest distance between any two objects in the cluster. Table-2 Technique and example for Data Integration Technique 1. Data integration: - in this technique combines data from multiple sources as in data warehousing, WWW. 2. Schema integration: - it technique identify and remove entity identification and Redundancy problem. in this technique combine similar type of attributed as a one attributed. metadata can be used remove entity identification and correlation analysis used to remove Redundancy. 3. Detecting and resolving data value conflicts: - when attribute values are differing but concept and structure are same then time we use this technique. Functional dependencies, referential constraints and scaling, or encoding are used to remove this type of problem and check lower level of abstraction of database. 176 Example cust_id for one database and cust_number is a another database attributed then time data analyst check to value in both database if match then be integrated do not integrated. cust_id for one database and cust_number is a another database attributed then time data analyst check to metadata both attributed if match then be integrated otherwise do not integrated. cust_id for one database and cust_number is a another database attributed then time data analyst check to metadata in both database if do not match then be check low level abstraction and function dependencies or do normalization after that modify the structure of metadata and attributed are integrated. International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) C. Data Transformation Transformation of different type of data, schema from one format to another format is largest problem of constructing in a data mining and data warehousing, WWW etc. because real world data are available in a different format and language. If mining technique or analysis technique is apply to this data in data warehouses or WWW then it taken to more time, holding on more space, decision is suspended and quality less. Then be needed to transform the data one format to another format use of Smoothing, aggregation, generalization, normalization technique. We listed different type of data transformation technique and their example in Table-3. D. Data Reduction A database/data warehouse may store large amount of data (like a TB) and increasing of data size day by day is challenging problem in this time. Mining data and simple data are store in a database, data warehouse, www etc. from. If mining technique or analysis technique is apply to this type of data in database/data warehouses or WWW then it taken to long time and complexity are included in a decision/result. We obtain the reduce data volume help of Data cube aggregation, Dimensionality reduction, Binning, Sampling technique. We listed different type of data reduction technique and their example in Table-4. Table-3 Technique and example for Data Transformation Table-4 Technique and example for Data Reduction Technique 1. Aggregation (use for numerical data): -in this technique calculated to summary on data in a specific attributed. This technique is useful in constructing a data cube for analysis of multiple granularities. 2. Generalization: - in this technique low-level data are replaced by higher-level data through concept hierarchies’ technique. It technique useful of categorical attributes numerical attributes. 3. Normalization: - Normalization is scaling on data to be analyzed to a specific range such as [0.0, 1.0]. It technique useful of ANN classification algorithm. min-max normalization, z-score normalization and decimal scaling are use in normalization technique 4. Attribute construction: - in this technique new attributes are constructed and added from the given set of attributes. It technique useful of feature construction. Example Daily_sales attributed value may be aggregated so as to compute monthly_sales and annual_sales amounts. Technique Example Form of reduction: - 1. Number of attributes 1. Data cube aggregation: - This Constructing a data cube technique is useful in constructing a data for multidimensional cube. Data cubes store multidimensional analysis of sales data aggregated information. Each attributed with respect to annual cell holds an aggregate data value, sales per item type for corresponding to multidimensional each shop_branch space. table/database. 2. Dimensionality reduction: -in this When any data are technique data encoding technique are uploading in internet or applied so as to obtain a reduced communicated one (compressed) form of data. When we place to another place find out original data from the then time we compressed data then some information compressed on large lossless or loosed. some encoding database in a small technique are a Wavelet transforms and volume. principal components analysis 3. Attribute subset selection: -in this To classify customers as technique some problem like a to which product data irrelevant, weakly relevant and are popular in a redundant attributes are detected and shop_branch removed then data set size are reduces table/database. Then by using greedy algorithm or decision notified of a sale, tree induction. Use of this technique attributes such as the find out a minimum set of attributes. customer’s phone number are likely to be irrelevant and unlike attributes as age. Form of reduction: - 2. Attribute values Reducing the attributes value by using See example on data Binning, Clustering, and Aggregation or cleaning and data generalization technique. transformation. Address attributed like street data, can be generalized to higher-level concepts, like city data, after that it generalized to state data,,after that it generalized to country data. Score_pretences attributed value are normalized as a graded value. add the attribute area on database then it based (calculated )on the attributes height and width. 177 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) Table-5 Technique and example for Data Discretization Form of reduction: - 3. Number of tuple 1. Sampling: - it allows a large data set Age attributed are hold to be represented by a much smaller to only one time similar random sample of the data. Some type young, middle_age, sampling techniques are SRSWOR, senior data . SRSWR, Cluster sample and Stratified sample. Suppose that a large data set, D, contains N tuple. In SRSWOR technique created by drawing s of the N tuple from D (s < N), where the probability of drawing any tuple is the same. Technique 1. Supervised Discretization- in this technique discretization process uses class information of tuple and its calculation and determination of splitpoints. Entropy-based Discretization is a supervised Discretization and top up approach technique. E. Data Discretization and Summarization It is a sub form of data reduction method. When data volume are reduced then time handled the loss of information is a challenging problem in data reduction time. It technique are very useful for the automatic generation concept and it use for only for continuous data. Data Discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals and listing on categorized data. When Data Discretization are applied in data then we use in a class information. We listed different type of Data Discretization technique and their example in Table-5. 2. Unsupervised Discretization: - it’s is a top up approach and this technique discretization process not uses class information of tuple and attributed value are divided in a fixed partitions. some Unsupervised Discretization technique are Binning, Histogram analysis. 3. Splitting: - in this method finding the best neighboring intervals and then merging these to form larger intervals, recursively. It use to Supervised Discretization and bottom up approach. X2-based Discretization is a technique of Splitting. 4. Cluster Analysis: -it’s is a top up and bottom up approach. this technique applied to discretize a numerical attribute and Clustering can be used to generate a concept hierarchy. 5. Concept Hierarchy Generation: Concept hierarchies can be used to reduce the data by collecting and replacing low-level with higher-level concepts. Discretization can be performed recursively on attribute and given to hierarchical partitioning of the attribute values, called as a concept hierarchy. It’s technique use for numerical and categorical data. 178 Example If any tuple have a same class label then it group together. Like item_type attributes have a two tuple apple and banana so their class is froud then this tuple are combining by interval. Prices attributes, values are partitioned into equal-sized partitions or ranges and each partition contains the same number of data tuples are combine. all pairs of adjacent intervals exceed some threshold, which is determined by a specified significance level. It Use of trained data in a ANN classification with max-interval and mininterval as a form of threshold value Value are partitioned in a cluster form and grouped to similar type of data. Address attributed like street data , can be generalized to higherlevel concepts, like city data, after that it generalized to state data, , after that it generalized to country data International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) III. COMPARITIVE ANALYSIS OF ALL METHOD Table-6 Comparison of Data Preprocessing Method on Various Parameters Parameter Input Data cleaning Incomplete, Noise Inconsistent data Data Integration Data reduction Data Discretization Different source of same data/data set Entity Identification problem Redundancy data Data Transformation Different type and form of data Multiple granularities data wrong data(noise data) Large records set High dimensional data set Large data volume Large set Continuous data and range of the attribute non categorized data Reduce data and remove unwanted things in data set Minimizing the loss of information Content. Reduce data volume Replacing values of a continuous attribute by an interval labels Categorized data Reduce data volume Simplified data set. Output Quality data Reliable data Completed data. Provide a Metadata Data conflict detection Quality data with care taken Strategy Filling on missing values, smooth on noisy data and identify or remove outliers and also resolve inconsistencies Integration of many databases, several data cubes, and multiple files In data transformation, the data are transformed one form to another form means data are transform for best readable form. Obtains reduced representation in volume but produces the same or similar analytical results Part of data reduction but with numerical data and continue value is dividing on interval. Technique Ignore the record Fill in the missing value manually Use a global constant Attribute Mean Binning Regression clustering High data set and high dimensional data set Data integration Schema integration Correlation analysis of categorical data using X2 Entropy-based Discretization Binning Histogram analysis X2-based Discretization Concept Hierarchy Generation Very Large data set and very high dimensional data set accuracy and understanding of structure in highdimensional data relationships between data attributes Data cube aggregate Attribute subset selection Dimensionality reduction Sampling Binning Clustering Smaller data set and smaller dimensional data set integrity of the original data Complex data analysis Mining on large amounts of data minimize the failure information content Yes Size of output data Complexity of preprocessing (Challenges) Identifying outliers. accuracy and quality of data filling on data Very Large data set and very high dimensional data set semantic heterogeneity and structure of data redundancies and inconsistencies accuracy, speed and quality of data Extra Memory Yes No Normalized data Summarized data Compressed data Correct wrong data Feature construction. Smoothing Aggregation Generalization Normalization Attribute construction 179 Yes Smaller data set and smaller dimensional data set Replacing numerous values of a continuous attribute by a fixed number of interval labels Generalized data meaningful and easier to interpret Classification accuracy and quality Yes International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014) by preprocessed Type of Data Problem of after data preprocessing Any type of data( numeric and char data are more suitable) Different location, volume of database , different form of data, continue value Any type of data Numeric and char type data Numeric and char type data Numeric (may be char data) Volume of database, different form of data, continue value Volume of database, continue value continue value Non (prepared data) IV. CONCLUSIONS [5] This paper discusses to data preprocessing method and their comparison and example. Main aim of data preprocessing is given to qualities data for any type of mining like that data mining, text mining, web mining. Prepared data help to reduce the number of disk accesses or number of input and output costs. Data cleaning method are clean the noisy of data, completed on uncompleted data and remove unnecessary data. Data integration method is integrated to different source of data in one place. Data transformation method change form of data. Data reduction reduces the volume of database by schema integration and Data Discretization reduces the volume of data by interval. Every method is depended to a specific data like Numeric data, alphanumeric data, Strings data. We have compared the various data preprocessing method on the basis of various factors like data, input, output, memory required, working concept, complexity etc. [6] [7] [8] [9] [10] [11] REFERENCES [1] [2] [3] [4] S.S.Baskar, Dr. L. Arockiam, S.Charles ‖ A Systematic Approach on Data Pre-processing In Data Mining‖ COMPUSOFT, An international journal of advanced computer technology, 2 (11), November-2013 (Volume-II, Issue-XI). ISSN:2320-0790 S.Hrushikesava Raju, Dr.T. Swarna Latha ― EXTERNAL DATA PREPROCESSING FOR EFFICIENT SORTING‖ IJRET | NOV 2012, Available @ http://www.ijret.org/ISSN: 2319 – 1163 Petr AUBRECHT, Zden� k KOUBA ―A Universal Data Preprocessing System‖ Lubos Popelínský (ed.), DATAKON 2003, Brno, 18.-21. 10. 2003, pp. 1-3. Neelamadhab Padhy, Dr. Pragnyaban Mishra, and Rasmita Panigrahi ―The Survey of Data Mining Applications And Feature Scope‖ International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.2, No.3, June 2012 [12] [13] [14] [15] 180 Ricardo Gutierrez-Osuna and H. Troy Nagle ―A Method for Evaluating Data-Preprocessing Techniques for Odor Classification with an Array of Gas Sensors‖ IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 29, NO. 5, OCTOBER 1999 Faisal Kamiran ¢ Toon Calders ―Data Pre-Processing Techniques for Classification without Discrimination‖ address- HG 7.46, P.O. Box 513, 5600 MB, Eindhoven, the Netherlands Jasdeep Singh Malik, Prachi Goyal, Mr.Akhilesh K Sharma ― A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules‖ address- Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India Velmurugan.N , Vijayaraj.A ― Efficient Query Optimizing System for Searching Using Data Mining Technique‖ International Journal of Modern Engineering Research (IJMER), Vol.1, Issue.2, pp-347351 ISSN: 2249-6645 Suneetha K.R, Dr. R. Krishnamoorthi ―Data Preprocessing and Easy Access Retrieval of Data through Data Ware House‖ Proceedings of the World Congress on Engineering and Computer Science 2009 Vol I WCECS 2009, October 20-22, 2009, San Francisco, USA Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber - Data Mining: Concepts and Techniques Second Edition, ISBN 13: 978-1-55860-901-3 and ISBN 10: 1-55860-901-6 Agarwal,R and Psaila G, Active Data Mining. In Proceedings on Knowledge Discovery and Data Mining (KDD-95), 1995, 3-8 Menl www.cs.gsu.edu/~cscyqz/courses/dm/slides/ch02.ppt. Agrawal, Rakesh and Ramakrishnan Srikant, ―Fast Algorithms for Mining & Preprocessing Assosiation Rules‖, Proceedings of the 20th VLDB Conference, Santiago, Chile (1994). A. Joshi and R. Krishnapuram. On Mining Web Access Logs. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 2000, 63- 69 Godswill Chukwugozie Nsofor ―A Comparative Analysis Of Predictive Data-Mining Techniques‖ A Thesis Presented for the Master of Science Degree The University of Tennessee, Knoxville, August, 2006