* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Insert Title Here
Survey
Document related concepts
Transcript
1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21st 2005 CSIG05 Chaitan Baru 2 What is the issue? • Ability to access data stored in multiple, different databases using a single request, e.g. – Get geologic information from multiple geologic databases – Get employee information from all branches • Ability to update data stored in multiple databases, e.g. – Transfer salary amount from University to my bank account – Transfer funds from Visa account to vendor’s account 3 Distributed data access Client Homogeneous: mySQL Heterogeneous: mySQL Database mySQL 1 How about creating a “cached” local copy? mySQL mySQL Oracle DB2 Database Excel 2 ASCII Database flat file 3 4 Data Warehousing Client 2. Query processing interaction only between client and warehouse Data Warehouse (common schema) – Extract 1. Load data – Transform from sources – Load to warehouse ETL ETL Data Source 1 Data Source 2 But, warehouse data could be “stale”, i.e. out of synch with source data… ETL Data Source 3 5 Data integration via middleware 1. Each client request goes to sources, via middleware Database 1 Client Data integration Middleware (aka Mediator) 2. Result collected by middleware and returned to client Database 2 Database 3 6 Warehousing vs Mediation • Warehousing: User ETL to “massage” local data to fit into a common global, warehouse schema • Mediation: Modify user query to match schemas exported by each source – But, which schema does the user query? – The Integrated View Schema – Sources “export” a view (the export schema) • Federated databases – Local sources belong to different “administrative domains”, i.e. different owners. – Local autonomy 7 The Canonical Mediator / Wrapper Architecture Client Application Wrapper processes could execute at sources, at mediator, or elsewhere Q1 Cached data Export view in mediator data model Local view in local data model Mediator (Integrated view in mediator data model, e.g. relational, XML) Q11 Q12 Q13 Q14 Wrapper Wrapper Wrapper Wrapper Local schema Local schema Local schema Local schema Data source 1 Data source 2 Data source 3 q14 Data source 4 8 Example: A Relational Mediator Client Application Mediator (Relational data model) Wrapper Wrapper Relational DBMS e.g. PostGIS Shape file 9 Example: A Shape-file Based Mediator Client Application Mediator (Shape file-based data model) Wrapper Wrapper Relational DBMS e.g. PostGIS Shape file 10 Example: An XML Mediator User / Applications Mediator (XML-based data model, e.g. GML) Wrapper Wrapper Wrapper Relational DBMS e.g. PostGIS Shape file XML file e.g. ArcXML 11 User Authentication and Access Control Client Application 1. User authenticates to system 2. User connects to mediator (passes credentials to mediator) 3. Mediator Mediator connects to sources a) Using original user credentials b) Or, mapped credentials (role-based access) 4. Need to define users or roles in sources Wrapper Wrapper Data source 1 Data source 2 12 Different types of heterogeneity in data integration • Platform heterogeneity: different OS platforms • DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2 • Data type heterogeneity • Schema heterogeneity • Heterogeneity in units, accuracy, resolution • Semantic heterogeneity 13 Schema Integration • A long standing Computer Science problem • Simple case Source 1 Wrapper Sample ID: Table varchar Rock type: Age: varchar int … Source 2 Wrapper: convert between int and varchar for Age Table Sample ID: Rock type: Age: … varchar varchar varchar – Mediator View: (SampleID varchar, Rock_Type varchar, Age int) – In Source2 Table, map Age to int 14 Another integration scenario Source 1 Table Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic Source 2 Table Sample ID: Rock type: Age: varchar varchar varchar “Phanerozoic/mesozoic;jur” – Mediator View: (SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar) – In Source 2 Table, parse Age to obtain sub-components of the field 15 A more advanced integration scenario Source 1 Table Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Phanerozoic Mesozoic Jurassic Source 2 Table Sample ID: Rock type: Age: varchar varchar int 150 • Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar) – Same as Source1 table schema • Query: Get rock types for all rocks from the Jurassic period 16 Doing the integration • • • Query sent to mediator: SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’ Query to Source 1: SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’ For Source2, need to map Period=“Jurassic” to Age values Source 2 Table Sample ID: Rock type: Age: varchar varchar int Geologic_Time Table Eon: Era: Period: Min varchar varchar varchar int Max int 17 Query “fragment” sent to Source 2 • SELECT DISTINCT (S2.Rock_Type) FROM Where is the Source2_Table S2, Geologic_Time Geologic_Time_Table GT table stored ? WHERE GT.Period = ‘Jurrasic’ AND (S2.Age >= GT.Min) AND (S2.Age <= GT.Max) 18 Another complex query • Query: Get rock types for all rocks from the mesozoic era – Easy to do for Source 1: Era = “Mesozoic” – For Source 2: • Need to find numeric age range for Mesozoic – Find age range across all subclasses of Mesozoic (Cretaceous, Jurassic, Triassic) • Select all Source 2 Table records whose age range falls within the Mesozoic age range 19 Data Integration © Carts • Integrating data sets without explicitly creating views • An example request: Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region – Use GEONsearch to find all gravity and geologic data using bounding box for “Rocky Mountain testbed region” • Need gazeteer / spatial ontology to determine Rocky Mountain region • Need to know classification of datasets (as gravity and geology) • Intersect extent of gravity and geologic datasets (from metadata) with extent of Rocky Mountain region – Plot gravity point data that fall within polygons of rocks of given type 20 Ad hoc integration Search Metadata Catalog GEONsearch “Geologic and gravity data in Rocky Mountains” Plot map Data Integration Cart© Query Map 21 Data Registration Spatial Ontology Location Rock Classification Ontology Igneous Point Polygon Granite Quartzmonzonite Latitude Longitude Item Registration (Schema registration) Metadata (X, Y) Gravity dataset Item Detail Registration Lat, Long, RockType Geologic dataset Metadata 22 Data Registration is Important!