Download Automated Resolution of Semantic Heterogeneity in

Agenda • Brief introduction to distributed database, multidatabase (page 1-2) • Automated Resolution of Semantic Heterogeneity in Multidatabases (page 2-4) • SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks (page 5-7) • Context Interchange: New Features and Formalisms for the Intelligent Integration of Information (page 8-9) 5/22/2017 1 Distributed Database Architecture External schema 1 External schema 2 External schema n Global conceptual schema 5/22/2017 Local conceptual schema 1 Local conceptual schema 2 Local conceptual schema n Local internal schema 1 Local internal schema 2 Local internal schema n 2 • Distributed database management system is defined as the software system that permits the management of the DDMS and makes the distributed transparent to the users. • Local internal schema: the data are physically located at different locations, each site has its local internal schema. • Local conceptual schema: to handle a situation that data in distributed database is usually fragmented and replicated, local conceptual schema describes the logical organization of data at each site. • Global conceptual schema is a union of the local conceptual schema, and provides a single view of the database • External schema supports user application to the database, and it is above the global conceptual schema. 5/22/2017 3 Distributed Multidatabase Systems • Besides the issues common to distributed database systems, distributed multidatabase systems need to address database integration, global query language and query translation issues. • Two steps in database integration 1. Translation local database schemas are translated(mapped) to a common intermediate canonical representation. 2. Schema integration Each intermediate schema is integrated into a global conceptual schema. • With database integration, a multidatabase system provides integrated global access to heterogeneous, autonomous local databases in a distributed system. 5/22/2017 4 Database 1 Database 2 Database 3 Translator 1 Translator 2 Translator n Intermediate schema 1 Intermediate schema 2 Intermediate schema n Integrator Global conceptual schema 5/22/2017 5 Database Integration Process Three major design approaches for multidatabase systems 1. Global-Schema Multidatabases 2. Federated database 3. Multidatabase Language Systems (The Summary Schemas Model is based on approach #3.) 5/22/2017 6 1. Global-Schema Multidatabases Another layer, above the local external schemas; Benefits: • Global users essentially see a single, large, integrated database Problems: • The amount of global knowledge. A global schema can be a very large data object • The global DBA must understand all the local optimizations • Changes to local schemas must be reflected by the global schema. 5/22/2017 7 2. Federated databases Federated databases only require partial integration. A federated database integrates a collection of local database systems by supporting interoperability between pairs or collections of the local databases rather than through a complete global schema. 5/22/2017 8 3. Multidatabase Language Systems • Attempt to resolve some of the problems associated with global schemas • Beyond standard database capabilities, most of the language extensions are involved with manipulating differing data representations. The language must have the ability to transform source information into the integrated representations. 5/22/2017 9 Automated Resolution of Semantic Heterogeneity in Multidatabases M.W. Bright, A.R. Hurson and S. Pakzad June, 1994 ACM Transactions on Database Systems 5/22/2017 10 Summary • Problem: identifying semantically similar data in different local databases. • Solution: Summary Schemas Model • Key characteristics of solution: accept imprecise query; use semantic distance for similarity; use Roget thesaurus as the taxonomy to construct hierarchical summary schemas 5/22/2017 11 Summary Schema Model • One study showed that the probability of two subjects picking the same term for a given entity ranged from 7% to 18% • The Summary Schemas Model (SSM) extends multidatabase systems, provides linguistic support to automatically identify semantically similar entities with different terms. • The proposed Summary Schemas Model provides a user-friendly interface by allowing users to specify queries in their own terms, an imprecise query, and\or to use imprecise terms for data references. 5/22/2017 12 Large-System User Interfaces • In a small data access system, it is reasonable to expect all users to learn the exact names for different entities. • In large distributed systems, it is unreasonable to expect users to know the exact system access terms. (time consuming to do it manually) – a reason for Multidatabase Language Systems approach 5/22/2017 13 Semantic Relationships • Synonym: semantically similar (similar meaning). E.g salary is similar to wage • Hypernym: a term with a broader, more general meaning. E.g earning is broader than salary • Hyponym: the opposite of hypernym 5/22/2017 14 Summary Schemas Model • The summary schemas and the online taxonomy map imprecise terms to the semantically closest terms that actually exist in the system. • Taxonomy: 1965 version of Roget’s Thesaurus All Summary Schemas Model processing is described in terms of Roget’s structure. • A summary schema represents the input data in a more abstract manner and consequently needs fewer terms to describe the information. 5/22/2017 15 • e.g. Consider two base relations – one includes the attributes “city” and “zip code”, while the other has “city” and “country.” A global-schema representation of these schemas might have a generalized object with the attribute “city”, “zip code” and “country” attributes. The global schema is a precise representation of the base schemas. The summary schema for the same base relations may represent the input attributes with a single access term (hypernym) “location”. 5/22/2017 16 Schema summation & Summary schema hierarchy • Leaf nodes map to the terms in the local schema • Internal level of the hierarchy has a hypernym relationship with the leaf node(s) • Summary schema model structures multidatabase nodes in a hierarchy. Each internal node contains a summary schema. 5/22/2017 17 5/22/2017 18 Semantic-Distance Metric • A key feature of the SSM is the ability to identify semantically similar entities. The Semantic-Distance Metric (SDM) provides a quantitative measurement of “semantic similarity.” 5/22/2017 19 5/22/2017 20 Imprecise-Query Processing • When an imprecise reference is detected at the query origin node, the summary schemas structure is used to match the reference to a semantically close precise reference, and query processing proceeds normally with precise data references. 5/22/2017 21 Example of Summary Schemas Model query processing NODE 1 (leaf): external schema - (staff, employees, supervisor, wages, address) NODE 2(leaf): external schema – (personnel, engineers, manager, salary, town, Income, city) 1st level internal node summary schema: NODE 3: (personnel <1,2>, worker<1> manager<1, 2>, pay<1 ,2>, Iocality<1>, artisan<2>, district <2>, earnings<2>, polity<2>) 2nd level internal node summary schema: NODE 4 (agent< 3>, director< 3>, payment< 3>, location< 3>, region<3>, acquisition< 3>, authority<3>) 3rd level internal node summary schema: NODE 5: (voluntary action<4>, possessive relations<4>, space in general (<4>, general <4>) 5/22/2017 22 GIVEN: The Semantic-Distance Metric (SDM) is a simple count of links In the hypernym hierarchy. The SDM value used is 1. The user is somewhat familiar with the STAFF and PERSONNEL databases and wants to retrieve employees that make more than $20,000. The user is unsure of the access term for wages in Node 2. The LET command in the ‘query represents the multidatabase language system ability to combine multiple data references Into a single term. QUERY: LET PERS = NODE1. STAFF AND NODE2. PERSONNEL; /* open two databases with precise data references */ LET EMP = NODE1.EMPLOYEE ANO NODE2. ENGINEERS; /* combine two precise relation references */ LET PY = NCCIE1.UAGES AND PAY; /* “pay” is an imprecise data reference */ 5/22/2017 23 SELECT ID, PAY FROM PERS. EMP WHERE PY > 20000; QUERY PROCESSING: Step 1) At Node 1 parse the query. PY(pay) is an imprecise reference so pass the query to Node 3. Step 2) Node 3 has a summary schema term “pay” which is 0 links away from PY(pay). Since the SOM at Node 3 is Less than 1, seed a message dorm the hierarchy to see if the actual access term at Node 2 is within the specified semantic distance. At Node 2, the access term “salary” is 1 link away from “pay” (the hypernym link). Replace PY(pay) with PY(NODE2. SALARY) and return to node 3. Step 3) All data references are now precise. Step 4) Execute the multidatabase query. 5/22/2017 24 SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks Wen-Syan Li, Chris Clifton 2000 Data & Knowledge Engineering 5/22/2017 25 Summary • Problem: Semantic integration - identifying relationships between attributes or classes in different database schemas • Solution: SEMantic INTegrator (SEMINT) • Key characteristics of solution: utilizes both schema information and data contents; neural network • Evaluations: Preliminary experiment; The Boeing company tooling database; US NSF and Canada NSERC funding award databases 5/22/2017 26 Related work • Comparing attribute names This approach assumes that the same attribute may be represented by synonyms in different databases. – abbreviations – homonyms • Comparing field specifications at the schema level match attributes are `pre-programmed' by DBAs • Comparing attribute values and patterns in the data content level Relationships and entity sets can be integrated primarily based on their domain relationships: EQUAL, CONTAINS, OVERLAP, CONTAINED-IN and DISJOINT. Determining such relationships can be time consuming and tedious 5/22/2017 27 SEMantic INTegrator (SEMINT) • The authors focus on the problem of identifying corresponding attributes in different DBMSs that reflect the same real-world class of information. • Observations: Attributes in different databases that represent the same real-world concept will likely have similarities in schema designs, constraints, and data value patterns. • SEMINT supports access to a variety of database systems and utilizes both schema information and data contents to produce rules for matching corresponding attributes automatically. 5/22/2017 28 Technologies in SEMINT Neural networks versus traditional programmed computing • Programmed computing is a known set of rules • Neural networks are trained 5/22/2017 29 attribute correspondence identification procedure 5/22/2017 30 1. Metadata extraction using DBMS-specific parsers * Data type conversion Defining five types: character, number, date, rowed, and raw * Normalization of metadata - Binary values - Category values converting a category input into a vector of binary values, e.g 1, 0, 0 - Range values using SIGMOID-like function to convert to a range of [0, 1] 5/22/2017 31 2. Classifier Before the metadata is used for neural network training, classifier is used to cluster attributes into categories in a single database. Reasons: – Ease of training – If we have information in one database that refers to the same realworld information, we do not want to separate them into two different categories. The user can determine how fine the categories are by setting the radius of clusters rather than the number of categories. 5/22/2017 32 5/22/2017 Illustration of a 3-D self-organizing map 33 5/22/2017 Classifier architecture in SEMINT 34 3. Category learning and recognition neural networks The authors used the output of the classfier (M vectors) in a backpropagation network to train a network to recognize database attributes’ signatures After the back-propagation network is trained, the attribute health_Plan.Insured# run through the neural network. This network determines the similarity between the given input pattern and each of the M categories. The network shows that the input pattern `Insured#' is closest to the category 3 (SSN and Emp_ID) with similarity 0.92. It also shows health_Plan.Insured# is not likely related to other attributes since the similarity is low. 5/22/2017 35 5/22/2017 36 Using neural networks To determine attribute correspondences between two databases, users take the network trained for one database, and use information extracted from the other database as input to this network. 5/22/2017 37 Automated attribute correspondence identification procedure (summary) 5/22/2017 38 Test1: Preliminary experimental results • The result of these experiments showed that there is a substantial gap between high and low similarity. High similarity reflected corresponding attributes, and low similarity reflected non-corresponding attributes. 5/22/2017 39 Test2: The Boeing company tooling database The database contains 12 tables and 412 attributes; The authors split this information into two parts, The first part (denoted AM) had 4 tables with 260 attributes; the second (OZ) had eight tables with 152 attributes. They trained a neural network to recognize the attributes of AM, and then determined the similarity between this and OZ. The precision in this test was approximately 80%. And the recall was close to 90%. Why compare AM with OZ? Need to integrate them? 5/22/2017 40 Test3: US NSF and Canada NSERC funding award databases • The NSF and NSERC award databases are both stored in relational databases. • They contain information on research grants • They contain a wide variety of information, with a few attributes reflecting the same information. 5/22/2017 41 5/22/2017 42 Context Interchange: New Features and Formalisms for the Intelligent Integration of Information Cheng Hian Goh S. Bressan, S. Madnick, and M. Siegel July, 1999 ACM Transaction in Information Systems 5/22/2017 43 Summary • Problem: Intelligent integration of information (Zhongming: please be more specific) • Solution: The context interchange strategy is a mediator-based approach for achieving semantic interoperability among heterogeneous sources and receivers • Key characteristics of solution: context mediator using logic 5/22/2017 44 CONTEXT INTERCHANGE BY EXAMPLE 5/22/2017 45 Q1: SELECT r1.cname, r1.revenue FROM r1, r2 WHERE r1.cname 5 r2.cname AND r1.revenue . r2.expenses; Without a mediation, this query will return the empty answer. In the context interchange system, the semantics of data can be explicitly represented in the form of a context theory and a set of elevation axioms with reference to a domain model. 5/22/2017 46 5/22/2017 47 Context Mediator rewrites the user query to a mediated query. The Optimizer transforms this to an optimized query plan The optimized query plan is executed by an Executioner Executioner dispatches subqueries to individual systems, collates the results, undertakes conversions, and returns the answers to the receiver. 5/22/2017 48 e.g. Q1 is transformed to a mediated query, MQ1 (How to do this?) SELECT r1.cname, r1.revenue FROM r1, r2, r4 WHERE r1.country 5 r4.country AND r4.currency 5 ‘USD’ AND r1.cname 5 r2.cname AND r1.revenue . r2.expenses; UNION SELECT r1.cname, r1.revenue * 1000 * r3.rate FROM r1, r2, r3, r4 WHERE r1.country 5 r4.country AND r4.currency 5 ‘JPY’ AND r1.cname 5 r2.cname AND r3.fromCur 5 ‘JPY’ AND r3.toCur 5 ‘USD’ AND r1.revenue * 1000 * r3.rate . r2.expenses UNION SELECT r1.cname, r1.revenue * r3.rate FROM r1, r2, r3, r4 WHERE r1.country 5 r4.country AND r4.currency ^& ‘USD’ AND r4.currency ^& ‘JPY’ AND r3.fromCur 5 r4.currency AND r3.toCur 5 ‘USD’ AND 5/22/2017 r1.cname 5 r2.cname AND r1.revenue * r3.rate . r2.expenses; 49 The context interchange framework the truth or falsity of a statement can only be understood with reference to a given context. This is formalized using assertions of the form: _ C = ist(c, ) 5/22/2017 50 The Domain Model Two kinds of data objects primitive objects: native to sources and receivers, e.g strings, integers, reals semantic objects: complex types to support the integration strategy A domain model is a collection of primitive types and semantic types A semantic object may have different values in different “contexts.” Suppose we introduce two contexts labeled as c1 and c2 which we associate with sources and receiver. We may write f_r1_revenue(“NTT”)[value(c1), 1000000] f_r1_revenue(“NTT”)[value(c2), 9600000] 5/22/2017 51 Elevation axioms Elevation axioms provide the means for mapping “values” present in sources to “objects” which are meaningful with respect to a domain model. The mapping of attributes to semantic types is formally represented in two sets of assertions. f_r1_revenue(x, y, z) : moneyAmt  r1(x, y, z) f_r1_revenue(x)[value©  y]  r1(x, y, z), (r1,c)  is a source-to-context mapping, e.g (s, c) 5/22/2017 52 Context axioms Context axioms associated with a source or receiver provide for the articulation of the data semantics which are often implicit in the given context. x : moneyAmt, y : semanticNumber |- y[value(c2)  1]  x [scaleFactor(c2)  y] x : moneyAmt, y : currencyType |- y[value(c2)  “USD”]  x [currency(c2)  y] 5/22/2017 53 Query Mediation as Abductive Inferences The simplest case of abduction takes the form From observing A and the axiom B 3 A Infer B as a possible “explanation” of A Reference: Slides 2-5 “Principles of distributed database systems” by M. Tamer Ozsu and Patrick Valduriez, Prentice Hall, 1991 5/22/2017 54 References: Slides 1-4 “Principles of distributed database systems” by M. Tamer Ozsu and Patrick Valduriez, Prentice Hall, 1991 5/22/2017 55

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Automated Resolution of Semantic Heterogeneity in