Download Data Cleaning Using Clustering Based Data Mining Technique

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected]
Volume 4, Issue 2, March-April 2015
ISSN 2278-6856
Data Cleaning Using Clustering Based Data
Mining Technique
Sujata Joshi1, Usha.T2
1
Assistant Professor, Dept. of CSE, Nitte Meenakshi Institute of Technology,
Bangalore, India.
2
Research scholar M.Tech, Dept. of CSE Nitte Meenakshi Institute of Technology,
Bangalore, India.
Abstract
Data Cleaning is one of the basic tasks performed during the
process of knowledge discovery in the databases, during
modification and integration of database schemas and also in
the creation of data warehouses. Data Cleaning, also called
data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to
improve the quality of data. In this paper, data quality
problems are summarized. An algorithm is implemented using
data mining technique for data standardization and data
correction.
Keywords: Data Cleaning, Data quality problems,
Attribute correction, Levenshtein distance.
1. INTRODUCTION
Data plays a fundamental role in every software system. In
particular, information systems and decision support
systems depend on it more deeply. Data quality is the
crucial factor in data warehouse [1] creation and data
integration.
Data cleaning or scrubbing is the process of removing the
errors from the data. It is an inherent activity related to
database processing, updating and maintenance. Data fed
from various operational systems prevailing in the
different
departments/sub-departments
of
the
organization, has discrepancies in schemas, formats,
semantics etc. due to numerous factors. These
representations may introduce redundancy leading to
exact duplicates of records [2], inconsistency where
records differ in schemas, formats, abbreviations, etc. and
lastly erroneous data. All such unwanted data records are
referred to as ‘dirty data’.
Our approach focuses on the correction of errors in an
attribute. Here, we present a brief overview of various
sources of errors that arise due to machine or human
intervention and a summarization of data quality
problems. Also an algorithm based on clustering is
implemented for data correction and standardization.
2. SOURCES OF ERRONEOUS DATA
The sources of erroneous data are :
1. Lexical errors [3], name discrepancies between the
structure of the data items and the specified format.
2. Syntactical errors represent the violations of the
overall format.
Volume 4, Issue 2, March – April 2015
3. Irregularities are concerned with the non uniform
use of values and abbreviations.
4. Duplicates [4] [5] are two or more tuples
representing the same entity from the real world. The
values of these tuples need not be entirely identical.
Inexact duplicates represent the same entity but with
different values for all or some of its attributes.
5. Missing values are the result of omissions while
collecting the data.
6. Data entry anomalies, these errors occur while the
user is entering data into the data pool.
3. DATA QUALITY
Data quality [6] is a state of completeness, validity,
consistency, timeliness and accuracy that makes data
appropriate for a specific use. High quality data needs to
pass a set of quality criteria. The hierarchy of data quality
is as shown in figure-1.
4. DATA QUALITY PROBLEMS
Data quality problems [7] are present in single data
collections, such as files and databases, e.g. due to
misspellings during data entry, missing information or
other invalid data. When multiple data sources need to be
integrated, e.g., in data warehouses, federated database
systems or global web-based information systems, the
need for data cleaning increases significantly. This is
because the sources often contain redundant data in
different representations.
This section classifies the major data quality problems to
be solved by data cleaning and data transformation. It is
roughly distinguished has single-source and multi-source
Page 137
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected]
Volume 4, Issue 2, March-April 2015
problems and also has schema- and instance-related
problems.
Schema-level problems [8] are reflected in the instances,
they can be addressed at the schema level by an improved
schema design (schema evolution), schema translation
and schema integration.
Instance-level problems [9], on the other hand, refer to
errors and inconsistencies in the actual data contents
which are not visible at the schema level. They are the
primary focus of data cleaning.
Figure-2 shows the categorization of data quality problems
in data sources.
The data quality problems for single source at schema
level and instance level are illustrated with examples in
Table 1.
Table 1: Example of single source problem at schema and
instance level
Problem class
Example
Detection of uniqueness
Detection of invalid
references(Referential
integrity violation)
Detection on misspellings
Detection of Duplicate
values
Detection of invalid
values
Detection of inconsistent
values
Detection of missing
Name=” John Smith” ,
SSN=”158739”
Name=”Kowalski” ,
SSN=”158739”
Name=”John Smith” ,
DepartmentId=14
Name=” Kowalski” ,
DepartmentId=16
Name=” John Smith” ,
City=”Germany”
Name=”Kowalski” ,
City=”Germaany”
Name=” John Smith” ,
Born=”1978”
Name=”J. Smith” ,
Born=”1978”
Name=” John Smith” ,
Bdate=”28-9-1991”
Name=”Kowalski” ,
Bdate=”8-13-2015”
Bdate=”28-9-1991” ,
Age=”23”
Bdate=”8-2-1981” ,
Age=“60”
Name=” John Smith” ,
Volume 4, Issue 2, March – April 2015
values
ISSN 2278-6856
Phone=”9999-999999”
Table 2 and Table 3 show the example of multi- source
problems at schema and instance level. The two sources
are both in relational format but exhibit schema and data
conflicts.
At the schema level, there are name conflicts (synonyms
Customer/Client, Cid/Cno, Sex/Gender) and structural
conflicts (different representations for names and
addresses). At the instance level, we note that there are
different gender representations (“0”/”1” vs. “F”/”M”)
and a duplicate record (John Smith). Solving these
problems requires both schema integration and data
cleaning; Table 4 shows a possible solution.
Table 2: Customer (source 1)
CID Name
Street
City
Sex
214
John Smith
2 Harley Pl
South Fork
1
461
Mary Thomas
Harley St 2
S Fork
0
Cno
Table 3: Client (source 2)
LastName FirstName
Gender
153
Smith
Kowalski
M
186
Smith
John
M
Addr
ess
23
Harle
y
Street
,
Chica
go
2
Harle
y
Place,
South
Fork
5. METHODOLOGY
In this paper we implement an attribute correction method
using clustering technique. Attribute correction [10]
solutions require reference data in order to provide
satisfying results. In this algorithm all the record
attributes are examined and cleaned in isolation without
regard to values of others attributes of a given record.
Page 138
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected]
Volume 4, Issue 2, March-April 2015
The main idea behind this algorithm is based on an
observation that in most data sets there is a certain
number of values having large number of occurrences
within the data sets and a very large number of attributes
with a very low number of occurrences. Therefore, the
most representative values may be the source of reference
data. The values with low number of occurrences are noise
or misspelled instance of the reference data. Table 5
shows the attribute and its occurrence frequency. Here
since, “Asymptomatic” occurs more frequently, it is taken
as reference dataset. All others are discarded since they
have low frequency count.
Table 5: Example of Chest_pain_type attributes
distribution
Chest_pain_type
Asymptomatic
Number of
occurrences
2184
Asmytomatic
Asmythmatics
6
3
Assymtomatics
1
Asympotmatic
1
Asymptomac
1
The algorithm uses two parameters:
1. Distance Threshold: distThresh being the minimum
distance between two values allowing them to be marked
as similar and related.
2. Occurrence Relation: occRel used to determine whether
both compared values belong to the reference data set.
To measure the distance between two values a modified
Levenshtein distance is used. Levenshtein distance [11][13] for two strings is the number of text edit operations
(insertion, deletion, exchange) needed to transform one
string into another. For instance, the Levenshtein distance
between “Asymptomatic” and “Asmyptomatic“ is 2.
The algorithm for attribute correction utilizes a modified
Levenshtein distance defined as
1 Lev ( s1, s 2) Lev ( s1, s 2)
Leˆv (s1, s 2)  .(

)
2
|| s1 ||
|| s 2 ||
The algorithm consists of following steps:
1. Preliminary Cleaning – All attributes are
transformed into uppercase or lowercase.
2. The number of occurrences for all the values of the
cleaned data set is calculated.
3. Each value is assigned to a separate cluster. The
cluster element having higher number of occurrences
is denoted as the cluster representative.
4. The cluster list is sorted descending according the
number of occurrences for each cluster representative.
5. Starting with the first cluster, each cluster is
compared with the other clusters from the list in the
order defined by the number of cluster representative
Volume 4, Issue 2, March – April 2015
ISSN 2278-6856
occurrences. The distance between two clusters is
defined as modified Levenshtein distance between
cluster representatives.
6. If the distance is lower than the disThresh parameter
and the ratio of occurrences of cluster representative
is greater or equal the occRel parameter, the clusters
are merged.
7. After all the clusters are compared, the clusters are
examined whether they contain values having
distance between them and the cluster representative
above the threshold value. If so, they are removed
from the cluster and added to the cluster list as
separate clusters.
8. Steps 4-7 are repeated until there are no changes in
the cluster list i.e. no clusters are merged and no
clusters are created.
9. The cluster representative is the reference data set
and the cluster defined transformation rule for a given
cluster values should be replaced with the value of the
cluster representative.
Table 6 shows the example transformation rules
discovered during the execution of the above algorithm.
Table 6: Example of corrected values
Original value
Correct value
Asmytomatic
Asmythmatics
Assymtomatics
Asymptomatic
Asymptomatic
Asymptomatic
Asympotmatic
Asymptomatic
Asymptomac
Asymptomatic
6. CONCLUSION
Data cleaning is a key precondition for analysis of
decision support systems and data integration. High data
quality is a general requirement in current information
system construction. In order to provide access to accurate
and consistent data, data cleaning becomes necessary.
This paper presents an overview of categorization of data
quality problems in single and multiple data sources. Also
we implemented a clustering technique for data
standardization and correction of an attribute using
Levenshtein distance. The technique was then applied to
the data as shown in Table 6 and the results obtained.
REFERENCES
[1] Lee, M.L.; Lu, H.; Ling, T.W.; Ko, Y.T.: Cleansing
Data for Mining and Warehousing. Proc. 10th Intl.
Conf. Database and Expert Systems Applications
(DEXA), 1999.
[2] Mauricio Hernandez, Salvatore Stolfo, “Real World
Data Is Dirty: Data Cleansing and The Merge/Purge
Problem”, Journal of Data Mining and Knowledge
Discovery, 1(2), 1998.
Page 139
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected]
Volume 4, Issue 2, March-April 2015
ISSN 2278-6856
[3] Huang Yu, Zhang Xiao-yi, Yuan Zhen , Jiang Guoquan , “A Universal Data Cleaning Framework Based
on User Model “, 2009 ISECS.
[4] H.H. Shahri; S.H. Shahri, Eliminating Duplicates in
Information Integration: An Adaptive, Extensible
Framework, Intelligent Systems, IEEE, Volume 21,
Issue 5, Sept.-Oct. 2006 Page(s):63 – 71
[5] Monge, A. E. Matching Algorithm within a Duplicate
Detection System. IEEE Techn. Bulletin Data
Engineering 23 (4), 2000.
[6] Paul Jermyn, Maurice Dixon and Brian J Read,
“Preparing clean views of data for data mining”.
[7] Erhard Rahm, Hong Hai Do. “Data Cleaning:
Problems and Current Approaches”. IEEE Data
Engineering Bulletin, 2000,23 (4):3-13.
[8] KDnuggets Polls. “Data Preparation Part in Data
Mining
Projects”,Sep30-Oct-12,2003.
http://www.kdnuggets.com/polls/2003/data_preparati
on.htm.
[9] Wang Y.R. ; Madnick S.E, “The inter-database
instance identification problem in integrating
autonomous systems” Proceedings of the Fifth
International Conference on Data Engineering, IEEE
Computer Society, Silver Spring 1999, February 6–
10, 1999, Los Angeles, California, USA, p. 46–55.
[10] Lukasz Ciszak, “Application of clustering and
Association Methods in data cleaning”, 978-8360810-14-9/08, 2008 IEEE.
[11] M. Bilenko and R. J. Mooney, “Adaptive duplicate
detection using learnable string similarity measures”.
ACM SIGKDD, 39-48, 2003.
[12] Monge, A. E.; Elkan, P.C.: The Field Matching
Problem: Algorithms and Applications. Proc. 2nd
Intl. Conf. Knowledge Discovery and Data Mining
(KDD), 1996.
[13] W. Cohen, P. Ravi Kumar, S. Fienberg “A
Comparison of String Metrics for Name-matching
Tasks” in Proceedings of the IJCAI-2003.
AUTHOR
Sujata Joshi received the B.E. degree
in Computer Science and Engineering
from B.V.B. College of Engineering
and Technology, Hubli in 1995 and
M.Tech. degree in Computer Science
and Engineering from M.S.Ramaiah
Institute of Technology, Bangalore in 2007. Currently
working as Assistant Professor in the Department of
Computer Science and Engineering at Nitte Meenakshi
Institute of Technology, Bangalore and pursuing Ph.D in
the area of data mining under Visvesvaraya Technological
University, Belagavi.
Volume 4, Issue 2, March – April 2015
Page 140