Download Article Pdf - Golden Research Thoughts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Golden Research Thoughts
ISSN:- 2231-5063
ORIGINAL ARTICLE
Parul Dubey
ME Scholar, G.S. Moze College of Engineering, Pune,
Maharashtra, INDIA.
A STUDY ON IMPORTANCE OF DATA PROCESSING
IN DATA MINING PRACTICES
Abstract:Data pre-processing is a very important step
in the data mining process and it plays a vital role on
the success of a data mining projects. Data preprocessing is a step of the Knowledge discovery in
databases (KDD) process that reduces the
complexity of the data and offers better conditions to
subsequent analysis. Through this one can
understand the nature of the data and the data
analysis can be performed more accurately and
efficiently. Data pre-processing is challenging job as
it involves extensive manual effort and requires time
Ratnaraja Kumar
Professor ,HOD Computer Science, G.S. Moze College
of Engineering, Pune, Maharashtra, INDIA.
in developing the data operation scripts. There are various tools and methods used for preprocessing,
including: sampling, which points to a representative subset from a large population of data; In data
transformation, the data are transformed into forms appropriate for mining; denoising, which makes
data noise free; normalization, this organizes data for efficient access; and feature extraction, which
pulls out specified data that is significant in given particular context.
Keywords:
Data mining, cleaning, transformation, integration, reduction, Data warehouse.
www.aygrt.isrj.org
A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES
INTRODUCTION
Data mining is the process of revealing nontrivial, previously unknown and potentially useful
information from large databases [1]. Data analysis plays a vital role in our day to day life. It is the base for
investigations in many fields of knowledge, in all fields including science to engineering and management
to process control. Data on a required topic are collected in the form of symbolic and numeric attributes.
Analysis of these collected data gives a better understanding of the phenomenon of interest. When
development of a knowledge-based system is in initial stage called planning, the data analysis at this stage
involves discovery and generation of new knowledge for building a reliable and comprehensive knowledge
base. Data preprocessing is considered an important issue for both data warehousing and data mining, as
raw data tend to be incomplete, noisy, and inconsistent. Data preprocessing includes steps like data
cleaning, data integration, data transformation, and data reduction. Data cleaning is be applied to make data
noise free and remove inconsistencies in the data. Data integration uses the concept of merging data from
multiple sources into a coherent data store, such as a data warehouse. Data transformation, such as
normalization, can also be applied. Data reduction can reduce the data size by methods namely aggregation,
elimination redundant feature, or clustering, for instance. With the help of all this data preprocessing
techniques we can improve the quality of data and consequently the mining results. Also efficiency of
mining process can be improved by this.
There are different kinds of problems, related to data collected from the real world that is required
to be solved through data pre-processing. For example:
(i)Data with missing element, out of range elements or corrupt elements,
(ii)Data with noise,
(iii)Data from various levels of granularity,
(iv)Large data sets, data dependency, and irrelevant data sets
(v)Multiple sources of data.
WHY DATA PREPROCESSING?
Data taken in the raw form, from the real world is dirty and incomplete: missing attribute values,
lack of certain attributes of interest, or containing only aggregate data .e.g., population=“”
Noisy: containing errors or outliers e.g., Age=“-100”
Inconsistent: containing conflict in codes or names
e.g., Age=“32” Birthday=“03/07/2003” e.g., Was ranking “1,2,3”, now ranking “A, B, C”
A well-accepted multi-dimensional view of a processed data should possess the following
characteristics: Accuracy, Completeness Consistency, Timeline Believability, Value added Interpretability,
Accessibility. Important Tasks in Data Pre-processing are listed as follows (Fig.1):
1) Data cleaning
2) Data integration
3) Data transformation
4) Data reduction
Fig. 1 Adapted from Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second
edition.(p:50)
Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014
2
A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES
DATA CLEANING
Real-world data is mostly incomplete, noisy, and inconsistent. Data cleaning is the act of detecting
and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. [2]
Missing Values
Following methods are useful in filling the missing values:
1.Ignore the tuple: This is normally used when the class label is missing. This method is not much effective,
unless the tuple contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.[3]
2.Fill in the missing value manually: This approach is time-consuming and may not be feasible for a large
data set with many missing values.
3.Use a global constant to fill in the missing value: It uses the concept of replacing all missing attribute
values by the same constant, example: a label like “Unknown”.
4.Use the attribute mean to fill in the missing value: In this we use the mean of the attribute value used.
Example: For finding a missing salary of a worker we use the average salary of all workers.
5.Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or decision tree induction. [3]
Noisy Data
Noise is a random error or variance in a measured variable [3].Following data smoothing
techniques are used to remove noise.
1.Binning: Binning methods smooth a sorted data value by consulting its “neighborhood”, Means the
values surrounding it. The sorted values are divided into a number of “buckets” or bins (Fig.2):
Fig. 2 Adapted from Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second
edition.(p:63)
1. Regression: Data can be smoothed by fitting the data to a function, such as with Regression [3]. Linear
regression involves finding the “best” line to fit into two attributes (or variables), such that one attribute is
used to predict the other. Multiple linear regression is an extension of linear regression, in this more than
two attributes are involved.
2.Clustering: Outliers are detected by clustering, where similar values are organized into groups called
clusters. Values that fall outside of the set of clusters may be considered as outliers (Fig.3) [3].
Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014
3
A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES
Fig. 3 Adapted from Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second
edition.(p:64)
DATA INTEGRATION
Data integration involves combining data residing in different sources and providing users with a
unified view of these data [2]. This process becomes significant in a variety of situations including
commercial and scientific.
There are a number of issues that is considered important during data integration. Schema
integration and object matching can be tricky [3].Redundancy is also considered important when we
discuss about integration Redundancies can be detected by correlation analysis.
DATA TRANSFORMATION
1.Smoothing: This works to remove noise from the data. These techniques include binning, regression, and
clustering.
2.Aggregation: where summary or aggregation operations are applied to the data. For example, the daily
sales data may be aggregated so as to compute monthly and annual total amounts [1].
3.Generalization of the data: Here low-level or raw data are replaced by high-level concepts through the use
of concept hierarchies. For example, attributes, like street, can be generalized to city or country. Similarly
age, may be mapped to higher-level concepts, like youth, middle-aged, and senior.
4.Normalization: In this the attribute data are scaled so as to fall within a specified range, such as -1:0 to 1:0,
or 0:0 to 1:0.
5.Attribute construction or feature construction: Here new attributes are constructed and added from the
given set of attributes to support the mining process.
DATA REDUCTION
Data reduction techniques can be applied to obtain a reduced representation of the data set. The
new data set is much smaller in volume and maintains the integrity of the original data
Methods for data reduction include the following:
1.Data cube aggregation, where aggregation operations are applied to the data in the construction of a data
cube.[3]
2.Attribute subset selection, where irrelevant data, or redundant attributes or dimensions are be detected
and removed.
3.Dimensionality reduction, here encoding mechanisms is used.
4.Numerosity reduction, here the data are replaced or estimated by alternative, smaller data
representations.
5.Discretization and concept hierarchy generation, here raw data values for attributes are replaced by range
of values or higher conceptual levels. Data discretization is a form of numerosity reduction.
USES OF DATA PREPROCESSING
Data preprocessing techniques are useful in OLTP (online transaction Processing) and OLAP
Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014
4
A STUDY ON IMPORTANCE OF DATA PROCESSING IN DATA MINING PRACTICES
(online analytical processing). Preprocessing technique is also having its applications for association rules
algorithms like- aprior, partitional, princer search algorithm and many more algorithms. Data
preprocessing is a very important stage for Data warehousing and Data mining. They are useful for efficient
algorithms for mining high utility item sets from transitional databases
CONCLUSION
In this paper, we have researched on data preprocessing. This paper describes the general steps of
data mining, and then gives the common data preprocessing techniques. Uses of data preprocessing is also
discussed. Although numerous methods of data preprocessing have been developed, data preprocessing
stays as an active area of research due to many causes like the huge amount of inconsistent or dirty data and
the complexity of the problem.
REFERENCES
1.Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu (2013) ”Efficient Algorithms for Mining
High Utility Itemsets from Transactional Databases” IEEE transactions on Knowledge and data
engineering, Vol. 25, No. 8.
2.Wei Jianping.” Research on Data Preprocessing in Supermarket Customers Data Mining”.
3.Han J. , Micheline K., ”Data Mining: Concepts and Techniques” second edition.
4. Famili A. , Shen W. M. , Weber R, Simoudis E (1997) “Intelligent Data Analysis”, Elsevier, Vol:1,
Issue:1-4, p:3-23.
5.http://europa.eu/eurovoc
6.Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge
Discovery in Databases".
7.O. Brien, J. A., & Marakas, G. M. (2011). Management Information Systems. New York, NY: McGrawHill/Irwin.
8.Zhu, Xingquan; Davidson, Ian (2007). Knowledge Discovery and Data Mining: Challenges and
Realities. New York, NY: Hershey. pp. 163–189.
9.Agrawal, Rakesh; Mannila, Heikki; Srikant, Ramakrishnan; Toivonen, Hannu; and Verkamo, A. Inkeri;
Fast discovery of association rules, in Advances in knowledge discovery and data mining, MIT Press, 1996,
pp. 307–328
Golden Research Thoughts | Volume 4 | Issue 3 | Sept 2014
5