Download Article Pdf - Golden Research Thoughts

Golden Research Thoughts ISSN:- 2231-5063 ORIGINAL ARTICLE Parul Dubey ME Scholar, G.S. Moze College of Engineering, Pune, Maharashtra, INDIA. A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES Abstract:Data pre-processing is a very important step in the data mining process and it plays a vital role on the success of a data mining projects. Data preprocessing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Through this one can understand the nature of the data and the data analysis can be performed more accurately and efficiently. Data pre-processing is challenging job as it involves extensive manual effort and requires time Ratnaraja Kumar Professor ,HOD Computer Science, G.S. Moze College of Engineering, Pune, Maharashtra, INDIA. in developing the data operation scripts. There are various tools and methods used for preprocessing, including: sampling, which points to a representative subset from a large population of data; In data transformation, the data are transformed into forms appropriate for mining; denoising, which makes data noise free; normalization, this organizes data for efficient access; and feature extraction, which pulls out specified data that is significant in given particular context. Keywords: Data mining, cleaning, transformation, integration, reduction, Data warehouse. www.aygrt.isrj.org A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES INTRODUCTION Data mining is the process of revealing nontrivial, previously unknown and potentially useful information from large databases [1]. Data analysis plays a vital role in our day to day life. It is the base for investigations in many fields of knowledge, in all fields including science to engineering and management to process control. Data on a required topic are collected in the form of symbolic and numeric attributes. Analysis of these collected data gives a better understanding of the phenomenon of interest. When development of a knowledge-based system is in initial stage called planning, the data analysis at this stage involves discovery and generation of new knowledge for building a reliable and comprehensive knowledge base. Data preprocessing is considered an important issue for both data warehousing and data mining, as raw data tend to be incomplete, noisy, and inconsistent. Data preprocessing includes steps like data cleaning, data integration, data transformation, and data reduction. Data cleaning is be applied to make data noise free and remove inconsistencies in the data. Data integration uses the concept of merging data from multiple sources into a coherent data store, such as a data warehouse. Data transformation, such as normalization, can also be applied. Data reduction can reduce the data size by methods namely aggregation, elimination redundant feature, or clustering, for instance. With the help of all this data preprocessing techniques we can improve the quality of data and consequently the mining results. Also efficiency of mining process can be improved by this. There are different kinds of problems, related to data collected from the real world that is required to be solved through data pre-processing. For example: (i)Data with missing element, out of range elements or corrupt elements, (ii)Data with noise, (iii)Data from various levels of granularity, (iv)Large data sets, data dependency, and irrelevant data sets (v)Multiple sources of data. WHY DATA PREPROCESSING? Data taken in the raw form, from the real world is dirty and incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data .e.g., population=“” Noisy: containing errors or outliers e.g., Age=“-100” Inconsistent: containing conflict in codes or names e.g., Age=“32” Birthday=“03/07/2003” e.g., Was ranking “1,2,3”, now ranking “A, B, C” A well-accepted multi-dimensional view of a processed data should possess the following characteristics: Accuracy, Completeness Consistency, Timeline Believability, Value added Interpretability, Accessibility. Important Tasks in Data Pre-processing are listed as follows (Fig.1): 1) Data cleaning 2) Data integration 3) Data transformation 4) Data reduction Fig. 1 Adapted from Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second edition.(p:50) Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014 2 A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES DATA CLEANING Real-world data is mostly incomplete, noisy, and inconsistent. Data cleaning is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. [2] Missing Values Following methods are useful in filling the missing values: 1.Ignore the tuple: This is normally used when the class label is missing. This method is not much effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.[3] 2.Fill in the missing value manually: This approach is time-consuming and may not be feasible for a large data set with many missing values. 3.Use a global constant to fill in the missing value: It uses the concept of replacing all missing attribute values by the same constant, example: a label like “Unknown”. 4.Use the attribute mean to fill in the missing value: In this we use the mean of the attribute value used. Example: For finding a missing salary of a worker we use the average salary of all workers. 5.Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. [3] Noisy Data Noise is a random error or variance in a measured variable [3].Following data smoothing techniques are used to remove noise. 1.Binning: Binning methods smooth a sorted data value by consulting its “neighborhood”, Means the values surrounding it. The sorted values are divided into a number of “buckets” or bins (Fig.2): Fig. 2 Adapted from Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second edition.(p:63) 1. Regression: Data can be smoothed by fitting the data to a function, such as with Regression [3]. Linear regression involves finding the “best” line to fit into two attributes (or variables), such that one attribute is used to predict the other. Multiple linear regression is an extension of linear regression, in this more than two attributes are involved. 2.Clustering: Outliers are detected by clustering, where similar values are organized into groups called clusters. Values that fall outside of the set of clusters may be considered as outliers (Fig.3) [3]. Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014 3 A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES Fig. 3 Adapted from Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second edition.(p:64) DATA INTEGRATION Data integration involves combining data residing in different sources and providing users with a unified view of these data [2]. This process becomes significant in a variety of situations including commercial and scientific. There are a number of issues that is considered important during data integration. Schema integration and object matching can be tricky [3].Redundancy is also considered important when we discuss about integration Redundancies can be detected by correlation analysis. DATA TRANSFORMATION 1.Smoothing: This works to remove noise from the data. These techniques include binning, regression, and clustering. 2.Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts [1]. 3.Generalization of the data: Here low-level or raw data are replaced by high-level concepts through the use of concept hierarchies. For example, attributes, like street, can be generalized to city or country. Similarly age, may be mapped to higher-level concepts, like youth, middle-aged, and senior. 4.Normalization: In this the attribute data are scaled so as to fall within a specified range, such as -1:0 to 1:0, or 0:0 to 1:0. 5.Attribute construction or feature construction: Here new attributes are constructed and added from the given set of attributes to support the mining process. DATA REDUCTION Data reduction techniques can be applied to obtain a reduced representation of the data set. The new data set is much smaller in volume and maintains the integrity of the original data Methods for data reduction include the following: 1.Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.[3] 2.Attribute subset selection, where irrelevant data, or redundant attributes or dimensions are be detected and removed. 3.Dimensionality reduction, here encoding mechanisms is used. 4.Numerosity reduction, here the data are replaced or estimated by alternative, smaller data representations. 5.Discretization and concept hierarchy generation, here raw data values for attributes are replaced by range of values or higher conceptual levels. Data discretization is a form of numerosity reduction. USES OF DATA PREPROCESSING Data preprocessing techniques are useful in OLTP (online transaction Processing) and OLAP Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014 4 A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES (online analytical processing). Preprocessing technique is also having its applications for association rules algorithms like- aprior, partitional, princer search algorithm and many more algorithms. Data preprocessing is a very important stage for Data warehousing and Data mining. They are useful for efficient algorithms for mining high utility item sets from transitional databases CONCLUSION In this paper, we have researched on data preprocessing. This paper describes the general steps of data mining, and then gives the common data preprocessing techniques. Uses of data preprocessing is also discussed. Although numerous methods of data preprocessing have been developed, data preprocessing stays as an active area of research due to many causes like the huge amount of inconsistent or dirty data and the complexity of the problem. REFERENCES 1.Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu (2013) ”Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases” IEEE transactions on Knowledge and data engineering, Vol. 25, No. 8. 2.Wei Jianping.” Research on Data Preprocessing in Supermarket Customers Data Mining”. 3.Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second edition. 4. Famili A. , Shen W. M. , Weber R, Simoudis E (1997) “Intelligent Data Analysis”, Elsevier, Vol:1, Issue:1-4, p:3-23. 5.http://europa.eu/eurovoc 6.Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases". 7.O. Brien, J. A., & Marakas, G. M. (2011). Management Information Systems. New York, NY: McGrawHill/Irwin. 8.Zhu, Xingquan; Davidson, Ian (2007). Knowledge Discovery and Data Mining: Challenges and Realities. New York, NY: Hershey. pp. 163–189. 9.Agrawal, Rakesh; Mannila, Heikki; Srikant, Ramakrishnan; Toivonen, Hannu; and Verkamo, A. Inkeri; Fast discovery of association rules, in Advances in knowledge discovery and data mining, MIT Press, 1996, pp. 307–328 Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Article Pdf - Golden Research Thoughts