Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856 Data Cleaning Using Clustering Based Data Mining Technique Sujata Joshi1, Usha.T2 1 Assistant Professor, Dept. of CSE, Nitte Meenakshi Institute of Technology, Bangalore, India. 2 Research scholar M.Tech, Dept. of CSE Nitte Meenakshi Institute of Technology, Bangalore, India. Abstract Data Cleaning is one of the basic tasks performed during the process of knowledge discovery in the databases, during modification and integration of database schemas and also in the creation of data warehouses. Data Cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. In this paper, data quality problems are summarized. An algorithm is implemented using data mining technique for data standardization and data correction. Keywords: Data Cleaning, Data quality problems, Attribute correction, Levenshtein distance. 1. INTRODUCTION Data plays a fundamental role in every software system. In particular, information systems and decision support systems depend on it more deeply. Data quality is the crucial factor in data warehouse [1] creation and data integration. Data cleaning or scrubbing is the process of removing the errors from the data. It is an inherent activity related to database processing, updating and maintenance. Data fed from various operational systems prevailing in the different departments/sub-departments of the organization, has discrepancies in schemas, formats, semantics etc. due to numerous factors. These representations may introduce redundancy leading to exact duplicates of records [2], inconsistency where records differ in schemas, formats, abbreviations, etc. and lastly erroneous data. All such unwanted data records are referred to as ‘dirty data’. Our approach focuses on the correction of errors in an attribute. Here, we present a brief overview of various sources of errors that arise due to machine or human intervention and a summarization of data quality problems. Also an algorithm based on clustering is implemented for data correction and standardization. 2. SOURCES OF ERRONEOUS DATA The sources of erroneous data are : 1. Lexical errors [3], name discrepancies between the structure of the data items and the specified format. 2. Syntactical errors represent the violations of the overall format. Volume 4, Issue 2, March – April 2015 3. Irregularities are concerned with the non uniform use of values and abbreviations. 4. Duplicates [4] [5] are two or more tuples representing the same entity from the real world. The values of these tuples need not be entirely identical. Inexact duplicates represent the same entity but with different values for all or some of its attributes. 5. Missing values are the result of omissions while collecting the data. 6. Data entry anomalies, these errors occur while the user is entering data into the data pool. 3. DATA QUALITY Data quality [6] is a state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. High quality data needs to pass a set of quality criteria. The hierarchy of data quality is as shown in figure-1. 4. DATA QUALITY PROBLEMS Data quality problems [7] are present in single data collections, such as files and databases, e.g. due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleaning increases significantly. This is because the sources often contain redundant data in different representations. This section classifies the major data quality problems to be solved by data cleaning and data transformation. It is roughly distinguished has single-source and multi-source Page 137 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 problems and also has schema- and instance-related problems. Schema-level problems [8] are reflected in the instances, they can be addressed at the schema level by an improved schema design (schema evolution), schema translation and schema integration. Instance-level problems [9], on the other hand, refer to errors and inconsistencies in the actual data contents which are not visible at the schema level. They are the primary focus of data cleaning. Figure-2 shows the categorization of data quality problems in data sources. The data quality problems for single source at schema level and instance level are illustrated with examples in Table 1. Table 1: Example of single source problem at schema and instance level Problem class Example Detection of uniqueness Detection of invalid references(Referential integrity violation) Detection on misspellings Detection of Duplicate values Detection of invalid values Detection of inconsistent values Detection of missing Name=” John Smith” , SSN=”158739” Name=”Kowalski” , SSN=”158739” Name=”John Smith” , DepartmentId=14 Name=” Kowalski” , DepartmentId=16 Name=” John Smith” , City=”Germany” Name=”Kowalski” , City=”Germaany” Name=” John Smith” , Born=”1978” Name=”J. Smith” , Born=”1978” Name=” John Smith” , Bdate=”28-9-1991” Name=”Kowalski” , Bdate=”8-13-2015” Bdate=”28-9-1991” , Age=”23” Bdate=”8-2-1981” , Age=“60” Name=” John Smith” , Volume 4, Issue 2, March – April 2015 values ISSN 2278-6856 Phone=”9999-999999” Table 2 and Table 3 show the example of multi- source problems at schema and instance level. The two sources are both in relational format but exhibit schema and data conflicts. At the schema level, there are name conflicts (synonyms Customer/Client, Cid/Cno, Sex/Gender) and structural conflicts (different representations for names and addresses). At the instance level, we note that there are different gender representations (“0”/”1” vs. “F”/”M”) and a duplicate record (John Smith). Solving these problems requires both schema integration and data cleaning; Table 4 shows a possible solution. Table 2: Customer (source 1) CID Name Street City Sex 214 John Smith 2 Harley Pl South Fork 1 461 Mary Thomas Harley St 2 S Fork 0 Cno Table 3: Client (source 2) LastName FirstName Gender 153 Smith Kowalski M 186 Smith John M Addr ess 23 Harle y Street , Chica go 2 Harle y Place, South Fork 5. METHODOLOGY In this paper we implement an attribute correction method using clustering technique. Attribute correction [10] solutions require reference data in order to provide satisfying results. In this algorithm all the record attributes are examined and cleaned in isolation without regard to values of others attributes of a given record. Page 138 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 The main idea behind this algorithm is based on an observation that in most data sets there is a certain number of values having large number of occurrences within the data sets and a very large number of attributes with a very low number of occurrences. Therefore, the most representative values may be the source of reference data. The values with low number of occurrences are noise or misspelled instance of the reference data. Table 5 shows the attribute and its occurrence frequency. Here since, “Asymptomatic” occurs more frequently, it is taken as reference dataset. All others are discarded since they have low frequency count. Table 5: Example of Chest_pain_type attributes distribution Chest_pain_type Asymptomatic Number of occurrences 2184 Asmytomatic Asmythmatics 6 3 Assymtomatics 1 Asympotmatic 1 Asymptomac 1 The algorithm uses two parameters: 1. Distance Threshold: distThresh being the minimum distance between two values allowing them to be marked as similar and related. 2. Occurrence Relation: occRel used to determine whether both compared values belong to the reference data set. To measure the distance between two values a modified Levenshtein distance is used. Levenshtein distance [11][13] for two strings is the number of text edit operations (insertion, deletion, exchange) needed to transform one string into another. For instance, the Levenshtein distance between “Asymptomatic” and “Asmyptomatic“ is 2. The algorithm for attribute correction utilizes a modified Levenshtein distance defined as 1 Lev ( s1, s 2) Lev ( s1, s 2) Leˆv (s1, s 2) .( ) 2 || s1 || || s 2 || The algorithm consists of following steps: 1. Preliminary Cleaning – All attributes are transformed into uppercase or lowercase. 2. The number of occurrences for all the values of the cleaned data set is calculated. 3. Each value is assigned to a separate cluster. The cluster element having higher number of occurrences is denoted as the cluster representative. 4. The cluster list is sorted descending according the number of occurrences for each cluster representative. 5. Starting with the first cluster, each cluster is compared with the other clusters from the list in the order defined by the number of cluster representative Volume 4, Issue 2, March – April 2015 ISSN 2278-6856 occurrences. The distance between two clusters is defined as modified Levenshtein distance between cluster representatives. 6. If the distance is lower than the disThresh parameter and the ratio of occurrences of cluster representative is greater or equal the occRel parameter, the clusters are merged. 7. After all the clusters are compared, the clusters are examined whether they contain values having distance between them and the cluster representative above the threshold value. If so, they are removed from the cluster and added to the cluster list as separate clusters. 8. Steps 4-7 are repeated until there are no changes in the cluster list i.e. no clusters are merged and no clusters are created. 9. The cluster representative is the reference data set and the cluster defined transformation rule for a given cluster values should be replaced with the value of the cluster representative. Table 6 shows the example transformation rules discovered during the execution of the above algorithm. Table 6: Example of corrected values Original value Correct value Asmytomatic Asmythmatics Assymtomatics Asymptomatic Asymptomatic Asymptomatic Asympotmatic Asymptomatic Asymptomac Asymptomatic 6. CONCLUSION Data cleaning is a key precondition for analysis of decision support systems and data integration. High data quality is a general requirement in current information system construction. In order to provide access to accurate and consistent data, data cleaning becomes necessary. This paper presents an overview of categorization of data quality problems in single and multiple data sources. Also we implemented a clustering technique for data standardization and correction of an attribute using Levenshtein distance. The technique was then applied to the data as shown in Table 6 and the results obtained. REFERENCES [1] Lee, M.L.; Lu, H.; Ling, T.W.; Ko, Y.T.: Cleansing Data for Mining and Warehousing. Proc. 10th Intl. Conf. Database and Expert Systems Applications (DEXA), 1999. [2] Mauricio Hernandez, Salvatore Stolfo, “Real World Data Is Dirty: Data Cleansing and The Merge/Purge Problem”, Journal of Data Mining and Knowledge Discovery, 1(2), 1998. Page 139 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856 [3] Huang Yu, Zhang Xiao-yi, Yuan Zhen , Jiang Guoquan , “A Universal Data Cleaning Framework Based on User Model “, 2009 ISECS. [4] H.H. Shahri; S.H. Shahri, Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework, Intelligent Systems, IEEE, Volume 21, Issue 5, Sept.-Oct. 2006 Page(s):63 – 71 [5] Monge, A. E. Matching Algorithm within a Duplicate Detection System. IEEE Techn. Bulletin Data Engineering 23 (4), 2000. [6] Paul Jermyn, Maurice Dixon and Brian J Read, “Preparing clean views of data for data mining”. [7] Erhard Rahm, Hong Hai Do. “Data Cleaning: Problems and Current Approaches”. IEEE Data Engineering Bulletin, 2000,23 (4):3-13. [8] KDnuggets Polls. “Data Preparation Part in Data Mining Projects”,Sep30-Oct-12,2003. http://www.kdnuggets.com/polls/2003/data_preparati on.htm. [9] Wang Y.R. ; Madnick S.E, “The inter-database instance identification problem in integrating autonomous systems” Proceedings of the Fifth International Conference on Data Engineering, IEEE Computer Society, Silver Spring 1999, February 6– 10, 1999, Los Angeles, California, USA, p. 46–55. [10] Lukasz Ciszak, “Application of clustering and Association Methods in data cleaning”, 978-8360810-14-9/08, 2008 IEEE. [11] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures”. ACM SIGKDD, 39-48, 2003. [12] Monge, A. E.; Elkan, P.C.: The Field Matching Problem: Algorithms and Applications. Proc. 2nd Intl. Conf. Knowledge Discovery and Data Mining (KDD), 1996. [13] W. Cohen, P. Ravi Kumar, S. Fienberg “A Comparison of String Metrics for Name-matching Tasks” in Proceedings of the IJCAI-2003. AUTHOR Sujata Joshi received the B.E. degree in Computer Science and Engineering from B.V.B. College of Engineering and Technology, Hubli in 1995 and M.Tech. degree in Computer Science and Engineering from M.S.Ramaiah Institute of Technology, Bangalore in 2007. Currently working as Assistant Professor in the Department of Computer Science and Engineering at Nitte Meenakshi Institute of Technology, Bangalore and pursuing Ph.D in the area of data mining under Visvesvaraya Technological University, Belagavi. Volume 4, Issue 2, March – April 2015 Page 140