Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Healthcare Cost and Utilization Project wikipedia , lookup
Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 CHAPTER 3 A FRAMEWORK FOR DISTRIBUTED HETEROGENEOUS DATABASES 3.1 INTRODUCTION During the past three decades there has been a rapid growth in the number of databases. This has led to the storage of related data in different formats across multiple databases. For example, in areas such as healthcare, the information on a single patient may be scattered over a number of different medical databases with no simple way of obtaining a complete record of the patient. In this chapter a framework for classifying different aspects of heterogeneity in data sets is proposed, relating the various aspects of heterogeneity discussed by different researchers to this framework. The idea behind such a framework is to identify a comprehensive range of different types of heterogeneity that can arise either alone or in combinations. A simple test-suite using this framework has been devised which can be used to test and compare different approaches to the interoperability of databases. The suite comprises a small number of data sets and queries which exercise almost all aspects of the framework. Hazem Turki El Khatib 46 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 The focus of this framework is on the relational database model since the vast majority of databases currently in use are relational. Most of the heterogeneities identified are common across all database models, including Object Oriented, but this framework has not been extended to consider these other models at this time. Section 3.2 provides an overview of work done by a number of different researchers on the problem of heterogeneous distributed database systems. Section 3.3 describes the framework with examples drawn from a small set of example databases. Section 3.4 describes the test-suite derived from this and section 3.5 provides a summary of the chapter. 3.2 OVERVIEW OF PREVIOUS WORK This section provides a brief overview of the aspects covered in a number of different papers on this subject. The next section presents our proposed framework. A summary of how these fit into the proposed framework is given in Table 3.1. An index for the table abbreviation is given in Table 3.2. It should be noted that the terminology used differs amongst authors and their coverage of a concept varies; this is indicated by an ‘F’ (if the concept is fully covered) or a ‘P’ (if it is partially covered) in the table. The technological differences between computer systems give rise to heterogeneity conflicts. These include differences in hardware, system software (such as operating systems), communication protocols, and so on [7]. At the database system level, the Hazem Turki El Khatib 47 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 heterogeneity can be divided into those resulting from the differences in DBMSs and those resulting from the differences in the semantics of data as shown in Figure 3.1. Solaco et al., [25] classified the heterogeneity as systems heterogeneities and semantic heterogeneities. Systems heterogeneities include differences in hardware, operating systems, database management systems, transaction management, communication protocols, and so on. Semantic heterogeneities include differences in database models, particularly in the schemas of the databases. Database Systems Differences in DBMS Data models (structures, constraints, query languages) System level support (concurrency control, commit, recovery) Semantic Heterogeneity Operating System File systems Naming, file types, operations Transaction support Interprocess communication Hardware/System Instruction set Data formats & representation Configuration Communication Figure 3.1 ~ Type of heterogeneities This thesis is not concerned about system heterogeneities that may or may not exist. It may be that during the design process all databases in the system chose the same hardware, operating system, DBMS, and so on. However, “semantic heterogeneities will nearly always exist because the designers of the respective Hazem Turki El Khatib 48 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 component databases will have conceived the real world in differing ways and will have designed different schemas” [25]. The major problem is that of finding how a data item in one set can be mapped to an appropriate form to make it accessible in another – in other words, finding attributes equivalence. The simplest form of heterogeneity in this regard is that of naming conflicts and naming heterogeneity. In general, the categories of structural and naming heterogeneities are recognised by most authors, e.g. [45] [4]. [9,16,18,26,31,44,46,47] defined naming conflicts as homonyms and synonyms. Elmasri et al. [48] used the same two categories but widened naming conflicts to include attribute equivalence and entity class equivalence. Structural conflicts may be viewed as differences in abstraction level [48,49,50] as well as differences in roles, degree and cardinality constraints [48,22]. Hazem Turki El Khatib 49 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Authors/Heterogeneity N.HG N N S H R.S.H R Z V.H N S N S N s I I S Cardenas, A.F. [58] Sheth and Larson [7] Thomas, Thompson et al [1] Ferrier and Stangret [59] Litwin and Abdellatif [11] Urban and Wu [52] Hurson and Bright [34] Larson,Navathe et al. [23] Chatterjee and Segev [31] Spaccap., Parent et al. [17] Navathe and Gadgil [18] Batini and Lenzerini [26] Bukhres, Elmag. et al [4] Navathe and Savasere [12] Casanova and Vidal [45] Motro and Buneman [30] Al-fedaghi and Scheu. [65] Yao, Waddle, Housel [66] Yu, Jia, Sun, and Dao [28] Teorey and Fry [67] Kahn [46] Elmasri and Navathe [24] Navathe, Sashid. et al. [22] Mannino and Effelsbe. [49] Kual, Drosten, Neuhold[50] Elmasri,Larson, Nav. [48] Dayal and Hwang [9] Batini, Lenzerini et al. [16] Spaccapietra, Parent [60] Solaco, Saltor et al. [25] Ventrone and Heiler [55] Kim and Seo [47] Reddy, Prasad, Gupta [39] Breitbart, Olson et al. [61] Fankhauser, Neuhold [56] Sheth and Kashyap [57] Jeffery,Hutchinson et al [29] Deen, Amin. Taylor [51] P F F F P F F F P P P P P P P F F F P F F F F P F F F P F F F P P P P P P P F F F P F F F F P P F F P F F F F F F F F S.H WD R C C P P P P F P F DAL D.M PH BD DC F F F F F P F F P DIC Chapter 3 DV RK T.H ID AU D.E F F F F F F F P P F P F P F P F F P P P P P P P F P F F F F F F P F F F F P F P F F F F F P F P P F P P F F F F F F F F F F F F P F F F F F F F F F F F F F P F F P F F P F P F F F F F Table 3.1 ~ Relationship between concepts used by other researchers and our classification Hazem Turki El Khatib 50 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata ABBREVIATION N.HG Chapter 3 TERM Naming Heterogeneity N.S Naming Synonyms A.S Attribute Synonyms R.S Relation (Table) Synonyms N.H Naming Homonyms A.H Attribute Homonyms R.H Relation (Table) Homonyms A-R H Attribute_Relation Homonyms (Entity-Class Homonyms) R.S.H Relational Structure Heterogeneity R.Z Relation Size V.H Value Heterogeneity N.N Numeric-Numeric D.U-F.C D.U-T.V.C U-O.C G Different Units- Fixed Conversion Different Units- Time Varying Conversion Units- Other Conversion Granularity C.V.S-V A Composition of Values in a Single-Valued Attribute S.S String-String V.S Value Synonyms V.h Value Homonyms D.S.F Different String Formats N.s Numeric-String S.C Simple Conversion S Structures I.I Incomplete Information S.H Semantic Heterogeneity W.D.R What the data represents C.C Context in which data is captured D.A.L Different in Abstraction Level D.M Data Model P.H Paradigm Heterogeneity B.D Behavioural Differences D.C Dependency Conflicts D.I.C Differences in Constraints D.V Default Value R.K Relation Keys T.H Timing Heterogeneity D.E Domain Evolution I.D.A.U Inconsistencies due to asynchronous updated Table 3.2 ~ Abbreviations of the terms used in Table 3.1 Hazem Turki El Khatib 51 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 The second major form of heterogeneity is concerned with differences in the representations of values. Again several authors recognise differences in units and granularity as well as differences in data types and structure. These include Deen, Amin, and Taylor [51], Bukhres et al. [4], Jeffery et al. [29], and Larson et al. [23], who also cover differences in level of abstraction and in object identifier, Chatterjee and Segev [31], who include codes, incomplete information and recording errors, and Navathe and Savasere [12], who include data type and scale. Another way of viewing this is by distinguishing between schema level conflicts and data level inconsistencies [9]. This notion is elaborated by Kim and Seo [47], who distinguish between data that has been incorrectly entered, obsolete data and different representations for the same data. Reddy, Prasad and Gupta [39] refer to quantitative data incompatibilities which they attribute to different levels of accuracy, asynchronous updates and lack of security. The most complex heterogeneity is semantic heterogeneity, which is addressed by Urban and Wu [52], Colomb and Orlowska [38], Spaccapietra et al. [17], Sheth and Larson [7], and Hurson and Bright [34]. Solaco, Saltor and Castellanos [25] also base the classification of semantic heterogeneities on an object-oriented data model. In [53] the following definition of semantic heterogeneity is given: “variations among component database systems in the structure, organization, and conceptual description of information facts (units), units of behaviour (procedures), and semantic integrity constraints”. Hazem Turki El Khatib 52 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Two forms of semantic heterogeneity in the context of geographic databases have been identified in [54]. They are generic semantic heterogeneity and contextual semantic heterogeneity. Generic semantic heterogeneity arises when nodes use different generic conceptual models of the spatial information. Contextual semantic heterogeneity is caused by the local environmental conditions at nodes. In addition, Ventrone and Heiler [55] describe problems of semantic heterogeneity resulting from domain evolution. Fankhauser and Neuhold [56] refer to the problem of ambiguity and distinguish model ambiguity (arising from primitives such as is-a, instance-of, part-of) and semantic ambiguity. Sheth and Kashyap [57] include conflicts such as default value conflicts, attribute integrity constraint conflicts and union compatibility conflicts. The question of data model heterogeneity is addressed by [58] – [61], and [1,11,12,34], while Bukhres et al. [4] break the heterogeneity dimension into three different possible dimensions: model, access, and processing. Apart from the heterogeneities covered in this chapter, authors have also covered differences in the database management systems [7], in data models [62], in query languages and differences at the system level (e.g. concurrency control, commit protocols and recovery). Ferrier and Stangret [59] include the network and the operating system and Litwin and Abdellatif [11] physical aspects such as login procedures, concurrency control, etc. Hazem Turki El Khatib 53 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata 3.3 Chapter 3 FRAMEWORK FOR CLASSIFYING HETEROGENEITIES This section presents a framework for classifying the different types of heterogeneity which arise and need to be catered for. In so doing, the classifications are described in terms of the relational model as mentioned in the previous section. From the discussion in the previous section, different instances of heterogeneity can be classified into one or a combination of the following: 1. Naming heterogeneity. This occurs when the same values are stored in different databases but the names given to the attributes are different in different systems. These can be handled by a simple (syntactic) attribute transformation of the query. 2. Relational structure heterogeneity. Here the composition of elementary attributes into composite structures varies but once again values stored are identical. This can be handled by a (syntactic) relational transformation of the query. 3. Value heterogeneity. In this case the way in which values are represented is different in different databases. This may involve type and value transformations. 4. Semantic heterogeneity. This is the most difficult form to deal with as in this case the data stored in different databases embody different Hazem Turki El Khatib 54 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 assumptions, e.g. in what they represent or in how they have been captured. To quote [63]: “As we move away from systems issues to semantic issues, we move from well-defined computational paradigms for symbol manipulation to the issues of meaning and use of data as used by different applications and by different human data administrators and end users. We need to deal with multiple (possibly changing) interpretations of data by different user in different context, data inconsistencies, and incomplete information.” 5. Data model heterogeneity. Here the data model itself is the issue and transformations between data models and differences between them are relevant. 6. Timing heterogeneity. This concerns the changes over time in the structure of a database, the representation of attributes and the values themselves. Basically, almost any difference from each of the preceding categories, which can occur between databases, may also arise within a single database if it changes with time. One area not covered in this categorisation is that of recording errors in the data. Although this is a factor that does create problems, the issues of noisy data are generally highly dependent on the application and therefore impossible to cover in any generalised way. Hazem Turki El Khatib 55 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 In order to illustrate the different aspects of heterogeneity which follow, four simple data sets are given in Figures 3.2, 3.3, 3.4 and 3.5. Each contains a collection of patient record data for patients attending different clinics in different institutions. The aim is to link these together to provide a single integrated information system. The first data set, Database1, comprises three relations: PAT-REC which stores basic patient data, VISIT which records details of individual visits to the clinic and LAB-TEST which stores information on laboratory tests conducted. The second data set, Database2, is a minor variation on Database1 with essentially the same three relations. Database3, on the other hand, consists of four relations: PATIENT which stores patient data, PATIENT-NAME which stores patient names, VISIT and TEST. Database4 represents data from a paediatric clinic. It consists of three relations: PATIENT which stores patient data, VISIT which stores information about home visits, C-VISIT which records details of individual visits to the clinic. Hazem Turki El Khatib 56 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 PAT-REC NAME Mark Richard Karen Taylor Susan Marshal ID 529 319 129 BIRTHDATE 05-01-73 13-04-74 05-03-76 ID 319 129 129 529 WEIGHT 75 69 70 NULL SEX Male Female Female PHONE 031-6220723 0131-3122468 NULL VISIT VISIT-DATE 120894 240292 030392 050492 MEDICATION-PRICE £5.5 £6.0 £3.0 £8.5 TEST-ID 3100 1200 2400 4000 LAB-TEST TEST-ID 3100 1200 2400 4000 TEST-CODE PNE BLD LP8 BLOOD RESULT 1.3 4.0 2.2 7.5 RELATION NAME ATTRIBUTE NAME SEMANTIC ATTRIBUT E TYPE NULL VALUE TYPE PAT-REC PHONE UNKNOWN NATIONALPHONE VISIT-DATE VISIT WEIGHT VISIT MEDICATIONPRICE Home telephone number The date when patient visits the clinic The person’s weight when s/he enters the clinic The medication price excluding TAX String VISIT Int Int KILOGRAMS String UK-POUND Note:Patient 129 does not have a home telephone number. Mark Richard's telephone number is stored with the old area code, while Karen Taylor's telephone number is stored with the new area code. The code ‘PNE’ represents ‘Pneumonia’. The tax rate before 1991 was 15% and after this date it became 17.5%. Figure 3.2 ~ Structure of Database 1 The different types of heterogeneity as shown in Figure 3.6 are given below. 3.3.1 Naming heterogeneity The simplest form of heterogeneity is associated with concept naming. This arises when the same concept is described by two or more names in different databases (synonyms), or when the same name is used for different concepts (homonyms). This form of heterogeneity is not concerned with the value which is stored but merely with the name by which it is accessed. Hazem Turki El Khatib 57 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 PATIENT ID 4480 1280 6512 5555 PATIENT-NAME Peter Brown Janet Smith Mark Richard Karen Taylor PHONE 3542311 6248526 5112168 NULL SEX M F M F DATE 23 FEB 80 12 MAY 81 05 JAN 73 13 APR 74 VISIT ID 4480 6512 1280 5555 WEIGHT 180 150 120 160 DATE AUG 12 90 APRIL 24 92 MAY 14 88 JAN 15 93 TEST-ID 4000 3010 3020 3030 PRICE $10.30 $12.0 $5.0 NULL TEST TEST-ID 3010 4000 3020 3030 CODE MKT PNE BLD PNE RESULT Normal Above Normal High Normal ATTRIBUTE NAME PHONE SEMANTIC PATIENT VISIT DATE VISIT WEIGHT VISIT PRICE The date when patient visits the clinic The person’s weight when s/he enters the clinic The medication price excluding TAX RELATION NAME ATTRIBUTE TYPE Int Work telephone number NULL VALUE TYPE UNKNO WN LOCALPHONE String POUNDS Int String Not Applic able US-DOLLAR Note:Mark Richard's telephone number was updated on APRIL 24 1992. The code ‘PNE’ represents ‘Pneumoconiosis’. Figure 3.3 ~ Structure of Database 2 Naming synonyms These include the following: Attribute synonyms The same attribute may be given different names in different databases. For example, the attribute NAME in relation PAT-REC in Database1 corresponds to the attribute PATIENT-NAME in relation PATIENT in Database2. Hazem Turki El Khatib 58 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 PATIENT ID 529 319 420 BIRTHDATE 05 JAN 73 13 APRIL 74 12 JUN 80 SEX Male Female Female FIRST-NAME Karen Mark Diana SURNAME Taylor Richard Steven PHONE 4495111 NULL NULL PATIENT-NAME ID 319 529 420 MAIDEN-NAME Thomas NULL Adam VISIT ID 319 529 529 420 DATE 24 JUN 87 03 APRIL 89 15 MAY 89 13 APRIL 90 WEIGHT 74.6 68.2 67.4 70.6 PRICE $12 $9 $5 $8 T-ID 12b 2FC 13f 7N TEST T-ID 12b 2FC 13f 7N RELATION NAME CODE EF6 LP8 PNE EF7 RESULT A C B C ATTRIBUTE NAME PHONE SEMANTIC VISIT MAIDENNAME DATE VISIT WEIGHT VISIT PRICE The surname before marriage The date when patient visits the clinic The person’s weight when s/he enters the clinic The medication price including TAX PATIENT PATIENT-NAME ATTRIBUTE TYPE Int Work telephone number String NULL VALUE TYPE UNKNOWN Not applicable LOCAL-PHONE String Float String KILOGRAMS (to nearest tenth) US-DOLLAR Note:- Mark Richard's telephone number was updated on MAY 15 1989. Patient phone number was not compulsorily captured until 01/01/1990. So, NULL prior to this date represents 'unknown' and after this date represents 'no telephone number'. The code ‘PNE’ represents ‘Pneumonia’ Figure 3.4 ~ Structure of Database 3 Relation (Table) synonyms The same relation may be represented by different names in different databases. For example, the relation LAB-TEST in Database1 corresponds to the relation TEST in Database2. Hazem Turki El Khatib 59 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 PATIENT P-NAME Alex Brown Sue Peter Mark Smith ID 9211 9345 9289 BIRTHDATE 120393 231195 050496 SEX M F M PHONE 3542311 6248526 5112168 VISIT ID 9211 9345 9289 9345 V-NAME Morton Pattie Fernie Garcia Dines Douglas Ross Scott ID 9345 9211 9289 DATE 06 JAN 96 03 MAR 96 15 SEP 97 DATE OCT 03 97 APRIL 12 96 JAN 24 96 APRIL 23 96 BLOOD PRESSURE 70:140 80:140 80:160 90:140 COMMENT NO CHANGE IMPROVING IMPROVING NO CHANGE C-VISIT RELATION NAME PATIENT PATIENT ATTRIBUTE NAME P-NAME PHONE VISIT ID VISIT V-NAME VISIT DATE VISIT C-VISIT BLOOD PRESSURE ID C-VISIT DATE C-VISIT BLOOD PRESSURE WEIGHT C-VISIT BLOOD PRESSURE 90:140 80:140 90:160 WEIGHT 30.5 35.5 48.5 SEMANTIC ATTRIBUTE TYPE String Int The patient’s name The person’s contact phone number The home visit’s identification The nurse’s name who visits patient at home The date when the nurse visits the patient at home The blood pressure measured at home The clinic visit’s identification The date when patient visits the clinic The blood pressure measured at the clinic The person’s weight when s/he enters the clinic NULL VALUE TYPE UNKNOWN LOCALPHONE Int String String Int String String Float KILOGRAMS Figure 3.5 ~ Structure of Database 4 Naming homonyms These include the following: Attribute homonyms Two attributes with the same name occurring in different databases represent different things. For example, the attribute DATE occurring in relation VISIT of Database2 is different from DATE in relation VISIT of Database4. Although they have the same name, they represent different concepts. Hazem Turki El Khatib 60 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Relation (Table) homonyms Relations with the same name occurring in different databases contain different things. For example, the relation VISIT in Database1 records details of individual visits to the clinic, while the relation VISIT in Database4 stores information about visits to the patient’s home. Attribute-Relation homonyms (or entity-class homonyms) An attribute in one database has the same name as a relation in another database. For example, PATIENT-NAME is an attribute of relation PATIENT in Database2, but it is a relation in Database3. 3.3.2 Relational structure heterogeneity This form of heterogeneity arises when the way in which attributes are composed into relations in one database is different from that of another. Once again this form of heterogeneity is not concerned with the values of attributes, but merely how they are assembled into relations. Relation size In this case relations with the same name have different numbers of attributes in different databases, and thus are not union-compatible. For example, relation PATIENT in Database2 has five attributes whereas relation PATIENT in Database3 has only four. Hazem Turki El Khatib 61 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 3.3.3 Value heterogeneity This form of heterogeneity is concerned with the way in which the values of a concept are represented. It is possible that different instances of the same concept occurring in different databases may be represented in different ways. Numeric – numeric Different units - fixed conversion This arises when different databases use different units for the same data element. For example, an attribute WEIGHT in the VISIT relation of Database1 is expressed in kilograms whereas in Database2 it is expressed in pounds. This represents a straightforward conversion from one set of units to another. Different units- time varying conversion As an example, consider the MEDICATION-PRICE/PRICE attributes in the VISIT relations which in Database1 contains values expressed in pounds sterling and in Database2 contains values expressed in US dollars. This is also a conversion but the conversion factor varies with time and a conversion factor must be chosen for an appropriate instant of time. Units- other conversions Apart from the standard conversions of the previous two subsections, several irregular conversions arise. For example, the telephone number value in the PHONE attribute in relation PAT-REC of Database1 is represented with area Hazem Turki El Khatib 62 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 codes whereas in the PHONE attribute in relation PATIENT of Database2 it is represented without area codes. Naming Heterogeneity (N.HG) Naming Synonyms (N.S) Attribute Synonyms (A.S) Relation (Table) Synonyms (R.S) Naming Homonyms (N.H) Attribute Homonyms (A.H) Relation (Table) Homonyms (R.H) Attribute_Relation Homonyms (Entity-Class Homonyms)(A-R H) Relational Structure Heterogeneity (R.S.H) Relation Size (R.Z) Value Heterogeneity (V.H) Numeric-Numeric (N.N) Different Units- Fixed Conversion (D.U-F.C) Different Units- Time Varying Conversion (D.U-T.V.C) Units- Other Conversion (U-O.C) Granularity (G) Composition of Values in a Single-Valued Attribute (C.V.S-V A) String-String (S.S) Value Synonyms (V.S) Value Homonyms (V.h) Different String Formats (D.S.F) Numeric-String (N.s) Simple Conversion (S.C) Structures (S) Incomplete Information (I.I) Semantic Heterogeneity (S.H) What the data represents (W.D.R) Context in which data is captured (C.C) Different in Abstraction Level (D.A.L) Data Model (D.M) Paradigm Heterogeneity (P.H) Behavioural Differences (B.D) Dependency Conflicts (D.C) Differences in Constraints (D.I.C) Default Value (D.V) Relation Keys (R.K) Timing Heterogeneity (T.H) Domain Evolution (D.E) Inconsistencies due to asynchronous updated (I.D.A.U) Figure 3.6 ~ The Classification Diagram Hazem Turki El Khatib 63 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Granularity This form of heterogeneity arises when data elements representing a particular measurement differ in their level of granularity. For example, the WEIGHT attribute value in the VISIT relation of Database1 is stored to the nearest kilogram while in Database3 it is stored to the nearest tenth of a kilogram. Composition of values in a single-valued attribute Sometimes a value consists of two or more components which are directly related. A classic example is that of the price of an object or service, which may be given inclusive or exclusive of tax. Similarly, prices in a restaurant may be inclusive or exclusive of service charge. As an example of this form of heterogeneity consider the attribute MEDICATION-PRICE of relation VISIT in Database3 which describes the price of the medicine including tax, whereas the attribute PRICE in relation VISIT in Database2 describes the medication price without tax. String – string Value synonyms This occurs when the values of an attribute are represented as strings but a slightly different set of values is used in different databases. As an example, the value of the SEX attribute in the PAT-REC relation of Database1 is stored as Male or Female, while in attribute SEX in relation PATIENT of Database2 it is stored as M or F. Hazem Turki El Khatib 64 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Value homonyms The value ‘PNE’ occurring in attribute TEST-CODE in relation LAB-TEST of Database1 represents ‘Pneumonia’ but ‘PNE’ represents ‘Pneumoconiosis’ in attribute CODE in relation TEST of Database2. Different string formats These arise when different databases use different string formats for the same element. The most common occurrence of this is in date representation; for example the attribute DATE in relation VISIT of Database3 is represented as Day Month Year, whereas the attribute DATE in relation VISIT of Database2 is represented as Month Day Year. Other forms might include “MM-DD-YY”, “DD/MM/YY”, “DDMMYY”, “MMDDYY”, “YYYYMMDD” and so on. Numeric – string Simple conversion This arises when the same attribute is defined in terms of different data types in different databases. For example, the PHONE attribute in relation PAT-REC of Database1 is of type ‘string’, whereas in relation PATIENT of Database2 it is of type integer. The date problem described in the previous section also arises here; for example the VISIT-DATE attribute value in relation VISIT of Database1 is stored as a numeric value while the DATE attribute value in relation VISIT of Database2 is stored as string. Hazem Turki El Khatib 65 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Structures These arise when different databases use different formats for the same element. For example, in Database2 the name of the patient is represented as a single attribute in relation PATIENT whereas in Database3, it is represented as a pair of separate attributes in the relation PATIENT-NAME. Incomplete information The meaning of null varies amongst databases (unknown, not applicable, unavailable). For example, when the value of an attribute MAIDEN-NAME is NULL, this is interpreted as not applicable if attribute SEX is ‘MALE’. However, if SEX is ‘FEMALE’ it would be either NO MAIDEN-NAME or MAIDEN-NAME is unknown. If the AGE attribute value is equal to NULL this is taken as unknown value. On the other hand, if the PHONE attribute is NULL as in Database1 and Database3, this may mean either not applicable or unknown. 3.3.4 Semantic Heterogeneity Ter Bekke [64] defines semantics as the discipline which deals with relationships between words and the things to which these words refer. In database modelling, semantics is concerned with the study of the meaning and relationship between real world features and database objects [3]. This form of heterogeneity occurs when there are differences in what the data actually represents or the context in which the data has been captured in different databases. The semantic heterogeneity can be classed as: Hazem Turki El Khatib 66 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 What the data represents As an example, the PHONE attribute in relation PAT-REC of Database1 is a home phone number; the PHONE attribute in relation PATIENT of Database4 is a contact phone number, which may be the home phone number but may not. They are concepts which are related but not necessarily identical. Context in which data is captured As an example, consider blood pressure. If blood pressure is measured at home by a nurse the measurement may be significantly lower than that obtained in the clinic by a doctor (so-called ‘white coat’ syndrome). In the case of Database4 the blood pressure in relation VISIT is measured at home by a nurse, whereas in relation C-VISIT it is measured in the clinic. Equally, one would like to know whether a measurement may be affected by other conditions (e.g. if a patient being examined for condition X is also suffering from condition Y at the same time). Difference in abstraction level The requirements of different local DBMSs may cause objects to be modelled at different levels of abstraction. For example, the attribute RESULT in relation LABTEST of Database1 describes the result of a test on the scale 0 to 10 whereas attribute RESULT in relation TEST of Database2 describes the result in terms of values {Low, Normal, Above Normal, High}. Hazem Turki El Khatib 67 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Another example, attribute TELEPHONE-NUMBER, may contain home telephone number [23]. 3.3.5 Data model heterogeneity Paradigm heterogeneity Local database systems may employ different paradigms, such as relational, hierarchical, object-oriented, or entity-relationship. The focus of this framework is on the relational database model and has not been extended to consider these other models at this time. Behavioural differences These arise when different insertion/deletion policies are associated with the same class of objects in distinct schemas. A record type may have constraints on the total number of occurrences, or on the insertions and deletions of records. For example, the details of a patient’s visit to hospital must be kept for a minimum of 10 years before they can be deleted, but in another database details may be kept for only 5 years before they can be deleted. Dependency conflicts These arise when a group of concepts is related among themselves with different dependencies in different schemas. For example, it is possible for a relationship between two concepts in one database to be 1:1, whereas in another it could be 1:n. Hazem Turki El Khatib 68 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 Differences in constraints The data model may support different constraints. For example, in Database4 the patients are all children and hence the attribute BIRTHDATE in relation PATIENT is constrained to dates consistent with this (e.g. less than 10 years of age). On the other hand, the corresponding attribute BIRTHDATE in relation PAT-REC in Database1 has no such constraint. Default value This form of heterogeneity occurs when there are different definitions of the attribute domain. Two attributes might have different default values in different databases. For example, when inserting a new VISIT record the default value for VISIT-DATE in Database1 may be the current date whereas the default value for DATE in Database2 may be NULL. Relation keys In this case, equivalent relations in different databases may have different attributes as keys which can affect updates to these relations. 3.3.6 Timing heterogeneity Domain evolution This problem occurs when the semantics of values of a domain change over time. This includes many of the different kinds of heterogeneity already described. For example, the form used to represent a value may change over time. An example Hazem Turki El Khatib 69 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 of this is the change in telephone code occurring in Database1 where the area code changed from ‘031’ to ‘0131’ at a particular point in time. Other forms of domain evolution include changes in composition of values (e.g. when the tax rules changed), changes in granularity, changes in string representations resulting from changes in coding systems, changes in cardinality, etc. Inconsistencies due to asynchronous updates These happen when data items are replicated in different databases, get updated at different points in time and become inconsistent. For example, the PHONE attribute in relation PATIENT of Database2 for Mark Richard has been updated without a corresponding update to attribute PHONE in relation PATIENT of Database3, and so the two attribute values become inconsistent. 3.4 THE TEST SUITE For the test suite, the following five simple queries have been selected. Q1: Find the test code and result for Karen Taylor This query tests for attribute synonyms (e.g. NAME/PATIENT-NAME), attributerelation homonyms (e.g. PATIENT-NAME), relation synonyms (e.g. LABTEST/TEST), value homonyms (e.g. meaning of PNE), structures (e.g. PATIENTNAME), relation size (PATIENT has 4 or 5 attributes), and difference in abstraction level (RESULT). Hazem Turki El Khatib 70 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Q2: Chapter 3 Find the telephone numbers for patients born before 01/01/1981 This query involves semantic heterogeneity – what data represents (home vs. work number), different units – other conversion (with or without area code), incomplete information (meaning of NULL), different string formats (for BIRTHDATE), relation synonyms (PAT-REC/PATIENT), relation size (PATIENT), domain evolution (area code and changed meaning of NULL), inconsistencies due to asynchronous updates (the attributes PHONE in relation PATIENT of Database2 is updated [Mark Richard’s phone number] without updating to the attribute PHONE in relation PATIENT of Database3). Q3: Find weights of all male patients weighed within the last year The query covers fixed conversion between different units (WEIGHT – pounds vs. kilograms), different granularity (Kgs vs. Tenths of Kg), value synonyms (Male/Female vs. M/F), relation size, relation homonyms (VISIT), relation synonyms (PATIENT/PAT_REC) and numeric-string conversion (VISIT-DATE). Q4: Find the price of medication for patient 529 This query involves time-varying conversion between different units (Pounds vs. Dollars), composition of values in a single-valued attribute (PRICE with or without TAX), domain evolution (TAX rate). Hazem Turki El Khatib 71 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Q5: Chapter 3 Find the blood-pressure for Alex Brown This query covers semantic heterogeneity – context in which data is captured (BLOOD PRESSURE), and attribute homonyms (DATE). A single query which covers most of the range of heterogeneities in the test suite is as follows: Q6: Find the name, date of birth and telephone number of every male patient who has had a high result (> 4.0) for test PNE and whose weight exceeds 180 pounds. Hazem Turki El Khatib 72 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata 3.5 Chapter 3 SUMMARY Much research has been carried out on the problem of accessing heterogeneous distributed database systems and a range of different aspects of heterogeneity has been identified by different authors. This chapter presents a framework for classifying the different types of heterogeneity, which brings together the different aspects of heterogeneity addressed by these authors. A summary of this classification is given in Figure 3.6. An overview of some of the work done by different researchers on the problem of heterogeneity is given in section 3.2. A summary of how the different concepts covered by different authors fits into the proposed framework is given in table 3.1. From this framework a test suite has been developed which can be used to evaluate and compare the extent to which different approaches handle different aspects of this heterogeneity. A major advantage of this test suite is that it consists of four small databases and a small set of queries, all of which are easy to implement. Using it, all aspects of heterogeneity identified in the framework are covered, with the exception of data model heterogeneity. This classification is based on a relational model, although it could easily be adapted to other paradigms. Such a framework can provide an aid for database designers and for integrating heterogeneous database research. Hazem Turki El Khatib 73 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 3 REFERENCES [45] M.A. Casanova, V.M.P. Vidal, Towards a sound view integration methodology, Second ACM SIGACT/SIGMOD conference on principles of database systems, Atlanta, Ga., ACM, New York, Mar. 21-23, (1983) 36-47. [46] B. Kahn, A structured logical database design methodology, Ph.D. Dissertation, Department of Computer Science, University of Michigan, Ann Arbor, Mich, (1979). [47] W. Kim, J. Seo, Classifying schematic and data heterogeneity in multidatabase systems, Computer 24 (12) (1991) 12-18. [48] R. Elmasri, J. Larson, S.B. Navathe, Integration algorithms for federated databases and logical database design, Tech. Rep. Honeywell Corporate Research Center, (1987). [49] M.V. Mannino, W. Effelsberg, A methodology for global schema design, Tech. Rep. No. TR- 84-1, Department of Computer and Information Science, University of Florida, (1984). [50] M. Kaul, K. Drosten, E.J. Neuhold, View System: integration heterogeneous information bases by object-oriented views, IEEE 6th International Conference on Data Eng., Los Angeles, (1990) 2-10. [51] S.M. Deen, R.R. Amin, M.C. Taylor, Data integration in distributed databases, IEEE Trans. Softw. Eng. SE-13 (7) (1987) 860-864. [52] S.D. Urban, J. Wu, Resolving semantic heterogeneity through the explicit representation of data model semantics, Sigmod Record 20 (4) (1991) 55-58. [53] P. Drew, R. King, D. McLeod, M. Rusinkiewicz, and A. Silberschatz. Report of the Workshop on Semantic Heterogeneity and Interoperation in Multidatabase Systems. SIGMOD RECORD, 22:3 (September 1993), pp. 47:56. [54] M. F. Worboys and S. M. Deen, Semantic Heterogeneity in Distributed Geographic Databases, SIGMOD RECORD, Vol.20, No.4, December 1991, pp. 30-34 [55] V. Ventrone, S. Heiler, Semantic heterogeneity as a result of domain evolution, Sigmod Record 20 (4) (1991) 16-20. Hazem Turki El Khatib 74 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata [56] Chapter 3 P. Fankhauser, E.J. Neuhold, Knowledge bases integration of heterogeneous databases, Interoperable Database Systems (DS-S) (A-25), D.K. Hsiao, E.J. Neuhold and R. Sacks-Davis (Editor), Elsevier Science Publishers B. V. North-Holland, (1993) 155-175. [57] A. Sheth, V. Kashyap, So Far (Schematically) yet, So Near (Semantically), Interoperable Database Systems (DS-S) (A-25). D.K. Hsiao, E.J. Neuhold and R. Sacks-Davis (Editor), Elsevier Science Publishers B. V. North-Holland, (1993) 283-311. [58] A.F. Cardenas, Heterogeneous distributed database management: The HD-DBMS, Proceedings of the IEEE, 75 (5), (1987) 588-600. [59] A. Ferrier, C. Stangret, Heterogeneity in the distributed database management system Sirius- Delta, Eighth Int. Conf. on Very Large Data Bases, Mexico City, (1982) 45-53. [60] S. Spaccapietra, C. Parent, Conflicts and correspondence assertions in interoperable databases, Sigmod Record 20 (4) (1991) 49-54. [61] Y. Breitbart, P.L. Olson, G.R. Thompson, Database integration in a distributed heterogeneous database system, Second IEEE Data Eng. Int. Conf., CS Press, Los Alamitos, Calif., Order No. 655, (1986) 301-310. [62] D.K. Hsiao, M.N. Kamel, Heterogeneous databases: proliferations, issues, and solutions (Invited Paper), IEEE Trans. on Know. and Data Eng. 1 (1) (1989) 45-62. [63] A. Sheth. Special Issue on Semantic Heterogeneity. ACM SIGMOD Record 20 (4) December, 1991. [64] Ter Bekke, J. H., 1991. “Semantic Data Modeling in Relational Environments”. Ph.D. Thesis, University of Delft. [65] S. Al-Fedaghi, P. Scheuermann, Mapping considerations in the design of schemas for the relational model, IEEE Trans. Softw Eng. SE-7 (1) (1981) 99-111. [66] S.B. Yao, V.E. Waddle, B.C. Housel, View modelling and integration using the functional data model, IEEE Trans. Softw. Eng. SE-8 (6) (1982) 544-553. [67] T. Teorey, J. Fry, Design of database structures, Prentice-Hall, Englewood Cliffs, N.J., 1982. Hazem Turki El Khatib 75 PhD Thesis ~ 2000