Download An Analytical and Comparative Study of Various Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
An Analytical and Comparative Study of Various Data
Preprocessing Method in Data Mining
Kamlesh kumar pandey1, Narendra Pradhan2
1
2
MCRPV, Amarkantak Campus, Madhya-Pradesh, India
SS College of Education, Pendra Road, Chhattisgarh, India
In a data mining, Data Preprocessing technique is
required in order to get quality data by removing such
irregularities and do data mining task. There are a number
of data preprocessing techniques under data preprocessing
method are uses to remove specific irregularities in the real
world. Some basic data preprocessing method is Data
cleaning, Data Integration and Transformation, Data
reduction and Discretization and Summarization of data.
Abstract—Data pre-processing is an necessary and critical
step of the data mining process or Knowledge discovery in
databases. Base of data pre-processing is a preparing data as
form of accurate, reliable and qualities data. Data
preprocessing not only use for data mining but it also use of
constructing on data warehouses, WWW etc. if we do not
prepared on data then our mining result are uncompleted and
non decisional this region it is an importance. Data
preprocessing is a base of association rules, classification,
prediction, Pattern evaluation, Knowledge presentation for
step of data mining. This research paper presents the different
method of Data Preprocessing and each method is solving to
specific problem. This research cover to analytical study of
how the method are worked and which type of data suitable
for which type of method and we comparing on each method
based
of various parameters like input, output and
Complexity of preprocessing etc.
Keywords—Data mining, Data cleaning, Data integration,
Data Transformation, Data reduction, Data Discretization
and Summarization, Type of data.
I. INTRODUCTION
Data mining refers to extracting on interested
information or knowledge from large amounts of data
sources. Alternative names are data mining is Knowledge
discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis etc. In KDD are consists of
an iterative sequence of the seven steps with specific
method. First four steps Cleaning, Integration,
Transformation ( include reduction and Discretization
method) and Selection of data are form are data
preprocessing and after three step Data mining, Pattern
evaluation and Knowledge presentation are form of mining
process.
In real world, when different type of data or similar data
are collected to different source in a one place. Then some
data are consists to lot of irregularities in terms of missing
data, noise, inconsistent or even outliers. This type of data
doesn’t possess quality of information. If data is quality
less then data mining task like that data analysis, pattern
reorganization, decision management are not given to
optimal solution.
Fig 1: - Data Mining and Data Preprocessing
We tried to show relationship between data mining and
data preprocessing in fig1. Data preprocessing method are
basic need of constructing of information repository and
data mining time.
174
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
Data mining step are Pattern Evaluation and Data
Mining Engine use to Knowledge base for guidance for
searching and evaluate the interesting patterns.
Data preprocessing method/technique is very useful for
constructing on data warehouses, World Wide Web and
large databases because data warehouse, World Wide Web
and large databases have an accurate data and quality fully
data are stored. Data preprocessing method/techniques are
helpful in OLTP (online transaction Processing), OLAP
(online analytical processing) and any data mining
techniques and methods such as classification and
clustering. It also Helpful to advance data mining
application like a stream mining, bio-data mining ,text
mining, web mining because every mining have a needed
to accurate and quality data.
II. WORKING PROCEDURE OF DATA PREPROCESSING
METHOD
In this section we analyses and describe how to data
preprocessing is worked. In working procedure of data
preprocessing included various types of method and
technique. We draw a flow chart (fig-2) of data
preprocessing method. This flow chart is explained which
step are use to prepared on data and how to data
preprocessing is worked.
A. Data cleaning
Data cleaning is a first step of data preprocessing
method. Cleaning on the data is one biggest problem in
constructing of data warehousing and mining because real
world data are very dirty in form of noisy, missing values
in tuple and inconsistence. If mining technique is apply to
this type of data then our result is unreliable and poor
output. Then be needed to clean on the data before data
mining. Data cleaning method work to clean the data in
form by smoothing noisy data, filling in missing values,
correct inconsistencies and identifying or removing
outliers.
We listed different type of data cleaning technique and
their example in table-1.
Fig 2: - Flow chart of Data Preprocessing
175
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
Table-1
Technique and example for Data Cleaning
B. Data Integration
Integration of different type of data, attributes and
schema are biggest problem in constructing of data
warehousing and data mining because real world data are
available in a different location. If mining technique or
analysis technique is apply to this type of data warehouses
then it taken to more time and decision is unreliable and
quality less. Then be needed to combines data from
multiple sources into a one place or one database form of
data integration, attributes integration and schema
integration. The sources may be multiple databases, flat
files and data cubes.
We listed different type of data integration technique and
their example in Table-2.
Technique
Example
Form of data: - A. Missing Values (The filled-in value may not be
correct. Technique 5 are most popular strategy)
1.Ignore the tuple:- remove If any tuple have any
unfilled tuple and it is a inefficient attributes value is blank then it
method
tuple are ignore.
2. Fill in the missing value If any tuple have any
manually: - manually filling on attributes value is blank then
value and it is a time consuming assume most suitable value for
and not efficient for large data set.
this attribute and fill it.
3. Use a global constant to fill in Set one constant value for
the missing value: - Replace all specific attributes, and fill this
missing attribute values by the value for every blank tuple for
same constant.
specifies attributes
4. Use the attribute mean to fill in Find mean for specific
the missing value (use for numerical attributes, and fill
numerical data):- find out mean of this mean for every blank
unfilled attributed and fill this tuple for specifies attributes
mean to unfilled tuple.
5. Use the most probable value to Find out most repeated value
fill in the missing value: - most for specifies attributes and fill
repeated value is filled. it use to this value for blank tuple for
regression, inference-based tools specifies attributes
using a Bayesian formalism, or
decision tree induction method
Form of data: - 2. Noisy Data
1. Binning (use for numerical data for price (in dollar): 4,
data):- The values are distributed 8, 15, 21
into a number of bins. After that Partition
into
(equaleach value in a bin is replaced by the frequency) bins:
mean value of the bin this type of Bin 1:- 4, 8 Bin 2:-15, 21
called smoothing by bin means.
Smoothing by bin means:
Bin 1:-6,6
Bin 2-: 18,18
2. Regression(use for numerical In salary data
data): - In this method finding the X year experience:- 1,3, 3
best line using fit two attributes Y salary(in $1000s):-20, 30,
where one attribute can be used to 36
predict the other. it is a type of Calculated on this type of
statistical methodology and use to data then we find out(3,36)
straight line function y = b+wx are not cover in straight line.
where x and y are attributed value
and b and w are a regression
coefficients.
3. Clustering: - it techniques In human activity data, Early
consider data tuple and depend on in childhood, we learn how
the nature of the data. in this method to differentiate between cat
similar type of data group into a and dog, or between animals
cluster and non similar data are and plants, by continuously
organized outside of cluster. The improving sub conscious
quality of a cluster represent by its clustering method.
diameter, the highest distance
between any two objects in the
cluster.
Table-2
Technique and example for Data Integration
Technique
1. Data integration: - in this
technique combines data from
multiple sources as in data
warehousing, WWW.
2. Schema integration: - it
technique identify and remove
entity
identification
and
Redundancy problem. in this
technique combine similar type
of attributed as a one attributed.
metadata can be used remove
entity
identification
and
correlation analysis used to
remove Redundancy.
3. Detecting and resolving data
value conflicts: - when attribute
values are differing but concept
and structure are same then time
we use this technique. Functional
dependencies,
referential
constraints and scaling, or
encoding are used to remove this
type of problem and check lower
level of abstraction of database.
176
Example
cust_id for one database and
cust_number is a another
database attributed then time
data analyst check to value in
both database if match then
be
integrated
do
not
integrated.
cust_id for one database and
cust_number is a another
database attributed then time
data analyst check to
metadata both attributed if
match then be integrated
otherwise do not integrated.
cust_id for one database and
cust_number is a another
database attributed then time
data analyst check to
metadata in both database if
do not match then be check
low level abstraction and
function dependencies or do
normalization
after that
modify the structure of
metadata and attributed are
integrated.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
C. Data Transformation
Transformation of different type of data, schema from
one format to another format is largest problem of
constructing in a data mining and data warehousing, WWW
etc. because real world data are available in a different
format and language. If mining technique or analysis
technique is apply to this data in data warehouses or WWW
then it taken to more time, holding on more space, decision
is suspended and quality less. Then be needed to transform
the data one format to another format use of Smoothing,
aggregation, generalization, normalization technique.
We listed different type of data transformation technique
and their example in Table-3.
D. Data Reduction
A database/data warehouse may store large amount of
data (like a TB) and increasing of data size day by day is
challenging problem in this time. Mining data and simple
data are store in a database, data warehouse, www etc.
from. If mining technique or analysis technique is apply to
this type of data in database/data warehouses or WWW
then it taken to long time and complexity are included in a
decision/result. We obtain the reduce data volume help of
Data cube aggregation, Dimensionality reduction, Binning,
Sampling technique.
We listed different type of data reduction technique and
their example in Table-4.
Table-3
Technique and example for Data Transformation
Table-4
Technique and example for Data Reduction
Technique
1. Aggregation (use for numerical
data):
-in
this
technique
calculated to summary on data in
a specific attributed. This
technique is useful in constructing
a data cube for analysis of
multiple granularities.
2. Generalization: - in this
technique low-level data are
replaced by higher-level data
through concept hierarchies’
technique. It technique useful of
categorical attributes numerical
attributes.
3. Normalization: - Normalization
is scaling on data to be analyzed
to a specific range such as [0.0,
1.0]. It technique useful of ANN
classification algorithm. min-max
normalization,
z-score
normalization and
decimal
scaling are use in normalization
technique
4. Attribute construction: - in this
technique new attributes are
constructed and added from the
given set of attributes. It technique
useful of feature construction.
Example
Daily_sales attributed value
may be aggregated so as to
compute monthly_sales and
annual_sales amounts.
Technique
Example
Form of reduction: - 1. Number of attributes
1. Data cube aggregation: - This Constructing a data cube
technique is useful in constructing a data for
multidimensional
cube. Data cubes store multidimensional analysis of sales data
aggregated information. Each attributed with respect to annual
cell holds an aggregate data value, sales per item type for
corresponding to multidimensional each
shop_branch
space.
table/database.
2. Dimensionality reduction: -in this When any data are
technique data encoding technique are uploading in internet or
applied so as to obtain a reduced communicated
one
(compressed) form of data. When we place to another place
find out original data from the then
time
we
compressed data then some information compressed on large
lossless or loosed.
some encoding database in a small
technique are a Wavelet transforms and volume.
principal components analysis
3. Attribute subset selection: -in this To classify customers as
technique some problem like a to which product data
irrelevant,
weakly
relevant
and are popular in a
redundant attributes are detected and shop_branch
removed then data set size are reduces table/database.
Then
by using greedy algorithm or decision notified of a sale,
tree induction. Use of this technique attributes such as the
find out a minimum set of attributes.
customer’s
phone
number are likely to be
irrelevant and unlike
attributes as age.
Form of reduction: - 2. Attribute values
Reducing the attributes value by using See example on data
Binning, Clustering, and Aggregation or cleaning
and
data
generalization technique.
transformation.
Address attributed like street
data, can be generalized to
higher-level concepts, like
city data, after that it
generalized
to
state
data,,after
that
it
generalized to country data.
Score_pretences attributed
value are normalized as a
graded value.
add the attribute area on
database then it based
(calculated )on the attributes
height and width.
177
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
Table-5
Technique and example for Data Discretization
Form of reduction: - 3. Number of tuple
1. Sampling: - it allows a large data set Age attributed are hold
to be represented by a much smaller to only one time similar
random sample of the data. Some type young, middle_age,
sampling techniques are SRSWOR, senior data .
SRSWR, Cluster sample and Stratified
sample. Suppose that a large data set, D,
contains N tuple. In SRSWOR technique
created by drawing s of the N tuple from
D (s < N), where the probability of
drawing any tuple is the same.
Technique
1. Supervised Discretization- in this
technique discretization process uses
class information of tuple and its
calculation and determination of splitpoints. Entropy-based Discretization is
a supervised Discretization and top up
approach technique.
E. Data Discretization and Summarization
It is a sub form of data reduction method. When data
volume are reduced then time handled the loss of
information is a challenging problem in data reduction
time. It technique are very useful for the automatic
generation concept and it use for only for continuous data.
Data Discretization techniques can be used to reduce the
number of values for a given continuous attribute by
dividing the range of the attribute into intervals and listing
on categorized data. When Data Discretization are applied
in data then we use in a class information.
We listed different type of Data Discretization technique
and their example in Table-5.
2. Unsupervised Discretization: - it’s is
a top up approach and this technique
discretization process not uses class
information of tuple and attributed
value are divided in a fixed partitions.
some Unsupervised Discretization
technique are Binning, Histogram
analysis.
3. Splitting: - in this method finding the
best neighboring intervals and then
merging these to form larger intervals,
recursively. It use to Supervised
Discretization and bottom up approach.
X2-based Discretization is a technique
of Splitting.
4. Cluster Analysis: -it’s is a top up and
bottom up approach. this technique
applied to discretize a numerical
attribute and Clustering can be used to
generate a concept hierarchy.
5. Concept Hierarchy Generation: Concept hierarchies can be used to
reduce the data by collecting and
replacing low-level with higher-level
concepts.
Discretization can be
performed recursively on attribute and
given to hierarchical partitioning of the
attribute values, called as a concept
hierarchy. It’s technique use for
numerical and categorical data.
178
Example
If any tuple have a
same class label then it
group together. Like
item_type
attributes
have a two tuple apple
and banana so their
class is froud then this
tuple are combining by
interval.
Prices attributes, values
are partitioned into
equal-sized partitions
or ranges and each
partition contains the
same number of data
tuples are combine.
all pairs of adjacent
intervals exceed some
threshold, which is
determined
by
a
specified significance
level. It Use of trained
data in
a
ANN
classification
with
max-interval and mininterval as a form of
threshold value
Value are partitioned in
a cluster form and
grouped to similar type
of data.
Address attributed like
street data , can be
generalized to higherlevel concepts, like city
data, after that
it
generalized to state
data, , after that it
generalized to country
data
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
III. COMPARITIVE ANALYSIS OF ALL METHOD
Table-6
Comparison of Data Preprocessing Method on Various Parameters
Parameter
Input
Data cleaning
 Incomplete,
 Noise
 Inconsistent data
Data Integration
Data reduction
Data Discretization
 Different source of
same data/data set
 Entity Identification
problem
 Redundancy data
Data
Transformation
 Different type and
form of data
 Multiple
granularities data
 wrong data(noise
data)
 Large records set
 High dimensional
data set
 Large data volume
 Large
set
Continuous data
and range of the
attribute
 non
categorized
data
 Reduce data and
remove unwanted
things in data set
 Minimizing the loss
of information
 Content.
Reduce
data volume
 Replacing values of
a
continuous
attribute by an
interval labels
 Categorized data
 Reduce data volume
 Simplified data set.
Output
 Quality data
 Reliable data
 Completed data.
 Provide a Metadata
 Data
conflict
detection
 Quality data with
care taken





Strategy
Filling on missing
values, smooth on
noisy data and identify
or remove outliers and
also
resolve
inconsistencies
Integration of many
databases, several data
cubes, and
multiple
files
In data transformation,
the
data
are
transformed one form
to another form means
data are transform for
best readable form.
Obtains
reduced
representation
in
volume but produces
the same or similar
analytical results
Part of data reduction
but with numerical
data and continue
value is dividing on
interval.
Technique
 Ignore the record
 Fill in the missing
value manually
 Use
a
global
constant
 Attribute Mean
 Binning
 Regression
 clustering
High data set and high
dimensional data set
 Data integration
 Schema integration
 Correlation analysis
of categorical data
using X2





 Entropy-based
Discretization
 Binning
 Histogram analysis
 X2-based
Discretization
 Concept Hierarchy
Generation
Very Large data set
and
very high
dimensional data set
 accuracy
and
understanding of
structure in highdimensional data
 relationships
between
data
attributes
 Data
cube
aggregate
 Attribute
subset
selection
 Dimensionality
reduction
 Sampling
 Binning
 Clustering
Smaller data set and
smaller dimensional
data set
 integrity of the
original data
 Complex
data
analysis
 Mining on large
amounts of data
 minimize the failure
information
content
Yes
Size of output
data
Complexity of
preprocessing
(Challenges)
 Identifying outliers.
 accuracy and quality
of data
 filling on data
Very Large data set and
very high dimensional
data set
 semantic
heterogeneity and
structure of data
 redundancies
and
inconsistencies
 accuracy, speed and
quality of data
Extra Memory
Yes
No
Normalized data
Summarized data
Compressed data
Correct wrong data
Feature
construction.
Smoothing
Aggregation
Generalization
Normalization
Attribute
construction
179
Yes
Smaller data set and
smaller
dimensional
data set
 Replacing
numerous values
of a continuous
attribute by a
fixed number of
interval labels
 Generalized
data
meaningful and
easier to interpret
 Classification
accuracy
and
quality
Yes
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 10, October 2014)
by
preprocessed
Type of Data
Problem
of
after
data
preprocessing
Any type of data(
numeric and char data
are more suitable)
Different
location,
volume of database ,
different form of data,
continue value
Any type of data
Numeric and char type
data
Numeric and char type
data
Numeric (may be char
data)
Volume of database,
different form of data,
continue value
Volume of database,
continue value
continue value
Non
(prepared data)
IV. CONCLUSIONS
[5]
This paper discusses to data preprocessing method and
their comparison and example. Main aim of data
preprocessing is given to qualities data for any type of
mining like that data mining, text mining, web mining.
Prepared data help to reduce the number of disk accesses or
number of input and output costs. Data cleaning method are
clean the noisy of data, completed on uncompleted data and
remove unnecessary data. Data integration method is
integrated to different source of data in one place. Data
transformation method change form of data. Data reduction
reduces the volume of database by schema integration and
Data Discretization reduces the volume of data by interval.
Every method is depended to a specific data like Numeric
data, alphanumeric data, Strings data. We have compared
the various data preprocessing method on the basis of
various factors like data, input, output, memory required,
working concept, complexity etc.
[6]
[7]
[8]
[9]
[10]
[11]
REFERENCES
[1]
[2]
[3]
[4]
S.S.Baskar, Dr. L. Arockiam, S.Charles ‖ A Systematic Approach on
Data Pre-processing In Data Mining‖ COMPUSOFT, An
international journal of advanced computer technology, 2 (11),
November-2013 (Volume-II, Issue-XI). ISSN:2320-0790
S.Hrushikesava Raju, Dr.T. Swarna Latha ― EXTERNAL DATA
PREPROCESSING FOR EFFICIENT SORTING‖ IJRET | NOV
2012, Available @ http://www.ijret.org/ISSN: 2319 – 1163
Petr AUBRECHT, Zden� k KOUBA ―A Universal Data Preprocessing System‖ Lubos Popelínský (ed.), DATAKON 2003,
Brno, 18.-21. 10. 2003, pp. 1-3.
Neelamadhab Padhy, Dr. Pragnyaban Mishra, and Rasmita Panigrahi
―The Survey of Data Mining Applications And Feature Scope‖
International Journal of Computer Science, Engineering and
Information Technology (IJCSEIT), Vol.2, No.3, June 2012
[12]
[13]
[14]
[15]
180
Ricardo Gutierrez-Osuna and H. Troy Nagle ―A Method for
Evaluating Data-Preprocessing Techniques for Odor Classification
with an Array of Gas Sensors‖ IEEE TRANSACTIONS ON
SYSTEMS,
MAN,
AND
CYBERNETICS—PART
B:
CYBERNETICS, VOL. 29, NO. 5, OCTOBER 1999
Faisal Kamiran ¢ Toon Calders ―Data Pre-Processing Techniques for
Classification without Discrimination‖ address- HG 7.46, P.O. Box
513, 5600 MB, Eindhoven, the Netherlands
Jasdeep Singh Malik, Prachi Goyal, Mr.Akhilesh K Sharma ― A
Comprehensive Approach Towards Data Preprocessing Techniques
& Association Rules‖ address- Assistant Professor, IES-IPS
Academy, Rajendra Nagar Indore – 452012 , India
Velmurugan.N , Vijayaraj.A ― Efficient Query Optimizing System
for Searching Using Data Mining Technique‖ International Journal
of Modern Engineering Research (IJMER), Vol.1, Issue.2, pp-347351 ISSN: 2249-6645
Suneetha K.R, Dr. R. Krishnamoorthi ―Data Preprocessing and Easy
Access Retrieval of Data through Data Ware House‖ Proceedings of
the World Congress on Engineering and Computer Science 2009 Vol
I WCECS 2009, October 20-22, 2009, San Francisco, USA
Jiawei Han University of Illinois at Urbana-Champaign Micheline
Kamber - Data Mining: Concepts and Techniques Second Edition,
ISBN 13: 978-1-55860-901-3 and ISBN 10: 1-55860-901-6
Agarwal,R and Psaila G, Active Data Mining. In Proceedings on
Knowledge Discovery and Data Mining (KDD-95), 1995, 3-8 Menl
www.cs.gsu.edu/~cscyqz/courses/dm/slides/ch02.ppt.
Agrawal, Rakesh and Ramakrishnan Srikant, ―Fast Algorithms for
Mining & Preprocessing Assosiation Rules‖, Proceedings of the 20th
VLDB Conference, Santiago, Chile (1994).
A. Joshi and R. Krishnapuram. On Mining Web Access Logs. In
ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery, pages 2000, 63- 69
Godswill Chukwugozie Nsofor ―A Comparative Analysis Of
Predictive Data-Mining Techniques‖ A Thesis Presented for the
Master of Science Degree The University of Tennessee, Knoxville,
August, 2006