Download Insert Title Here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
1
Distributed Database Concepts
8:30-10:00AM
Thursday, July 21st 2005
CSIG05
Chaitan Baru
2
What is the issue?
• Ability to access data stored in multiple, different
databases using a single request, e.g.
– Get geologic information from multiple geologic
databases
– Get employee information from all branches
• Ability to update data stored in multiple databases,
e.g.
– Transfer salary amount from University to my bank
account
– Transfer funds from Visa account to vendor’s account
3
Distributed data access
Client
Homogeneous:
mySQL
Heterogeneous:
mySQL
Database
mySQL 1
How about creating a
“cached” local copy?
mySQL
mySQL
Oracle
DB2
Database
Excel 2
ASCII
Database
flat file
3
4
Data Warehousing
Client
2. Query processing
interaction only between
client and warehouse
Data Warehouse
(common schema)
– Extract
1.
Load data
– Transform
from
sources
– Load
to
warehouse ETL
ETL
Data Source 1
Data Source 2
But, warehouse data could
be “stale”, i.e. out of
synch with source data…
ETL
Data Source 3
5
Data integration via middleware
1. Each client request
goes to sources, via
middleware
Database 1
Client
Data integration
Middleware
(aka Mediator)
2. Result collected by
middleware and
returned to client
Database 2
Database 3
6
Warehousing vs Mediation
• Warehousing: User ETL to “massage” local data to
fit into a common global, warehouse schema
• Mediation: Modify user query to match schemas
exported by each source
– But, which schema does the user query?
– The Integrated View Schema
– Sources “export” a view (the export schema)
• Federated databases
– Local sources belong to different “administrative
domains”, i.e. different owners.
– Local autonomy
7
The Canonical Mediator / Wrapper
Architecture
Client Application
Wrapper processes could
execute at sources, at
mediator, or elsewhere
Q1
Cached
data
Export view
in mediator
data model
Local view
in local data
model
Mediator
(Integrated view in mediator data model, e.g. relational, XML)
Q11
Q12
Q13
Q14
Wrapper
Wrapper
Wrapper
Wrapper
Local schema
Local schema
Local schema
Local schema
Data
source 1
Data
source 2
Data
source 3
q14
Data
source 4
8
Example: A Relational Mediator
Client Application
Mediator
(Relational data model)
Wrapper
Wrapper
Relational DBMS
e.g. PostGIS
Shape file
9
Example: A Shape-file Based Mediator
Client Application
Mediator
(Shape file-based data model)
Wrapper
Wrapper
Relational DBMS
e.g. PostGIS
Shape file
10
Example: An XML Mediator
User / Applications
Mediator
(XML-based data model, e.g. GML)
Wrapper
Wrapper
Wrapper
Relational DBMS
e.g. PostGIS
Shape file
XML file
e.g. ArcXML
11
User Authentication and Access
Control
Client Application
1. User authenticates to
system
2. User connects to mediator (passes credentials to mediator)
3.
Mediator
Mediator connects to sources
a) Using original user credentials
b) Or, mapped credentials (role-based access)
4. Need to define
users or roles in
sources
Wrapper
Wrapper
Data
source 1
Data
source 2
12
Different types of heterogeneity in
data integration
• Platform heterogeneity: different OS
platforms
• DBMS heterogeneity: different database
systems, e.g. SQLServer, mySQL, DB2
• Data type heterogeneity
• Schema heterogeneity
• Heterogeneity in units, accuracy, resolution
• Semantic heterogeneity
13
Schema Integration
• A long standing Computer Science problem
• Simple case
Source 1 Wrapper
Sample ID:
Table
varchar
Rock type: Age:
varchar
int
…
Source 2 Wrapper: convert between int and varchar for Age
Table
Sample ID: Rock type: Age:
…
varchar
varchar
varchar
– Mediator View:
(SampleID varchar, Rock_Type varchar, Age int)
– In Source2 Table, map Age to int
14
Another integration scenario
Source 1
Table
Sample ID: Rock type: Eon:
Era:
Period:
varchar
varchar
varchar varchar varchar
Phanerozoic Mesozoic Jurassic
Source 2
Table
Sample ID: Rock type: Age:
varchar
varchar
varchar
“Phanerozoic/mesozoic;jur”
– Mediator View:
(SampleID varchar, Rock_Type varchar, Age varchar,
Era varchar, Period varchar)
– In Source 2 Table, parse Age to obtain sub-components
of the field
15
A more advanced integration
scenario
Source 1
Table
Sample ID: Rock type: Eon:
Era:
Period:
varchar
varchar
varchar varchar varchar
Phanerozoic Mesozoic Jurassic
Source 2
Table
Sample ID: Rock type: Age:
varchar
varchar
int 150
• Mediator View: (SampleID varchar, Rock_Type varchar, Eon
varchar, Era varchar, Period varchar)
– Same as Source1 table schema
• Query: Get rock types for all rocks from the Jurassic period
16
Doing the integration
•
•
•
Query sent to mediator:
SELECT DISTINCT(Rock_Type) FROM Mediator_View
WHERE Period=‘Jurrasic’
Query to Source 1:
SELECT DISTINCT(Rock_Type) FROM Source1_Table
WHERE Period=‘Jurrasic’
For Source2, need to map Period=“Jurassic” to Age values
Source 2 Table
Sample ID: Rock type: Age:
varchar
varchar
int
Geologic_Time Table
Eon:
Era:
Period: Min
varchar varchar varchar int
Max
int
17
Query “fragment” sent to Source 2
• SELECT DISTINCT (S2.Rock_Type)
FROM
Where is the
Source2_Table S2,
Geologic_Time
Geologic_Time_Table GT
table stored ?
WHERE
GT.Period = ‘Jurrasic’ AND
(S2.Age >= GT.Min) AND
(S2.Age <= GT.Max)
18
Another complex query
• Query: Get rock types for all rocks from the
mesozoic era
– Easy to do for Source 1: Era = “Mesozoic”
– For Source 2:
• Need to find numeric age range for Mesozoic
– Find age range across all subclasses of Mesozoic
(Cretaceous, Jurassic, Triassic)
• Select all Source 2 Table records whose age range
falls within the Mesozoic age range
19
Data Integration
©
Carts
• Integrating data sets without explicitly creating views
• An example request:
Plot all gravity data points that fall within the spatial
extent of rocks of a given type, in the Rocky Mountain
testbed region
– Use GEONsearch to find all gravity and geologic data using
bounding box for “Rocky Mountain testbed region”
• Need gazeteer / spatial ontology to determine Rocky Mountain region
• Need to know classification of datasets (as gravity and geology)
• Intersect extent of gravity and geologic datasets (from metadata) with
extent of Rocky Mountain region
– Plot gravity point data that fall within polygons of rocks of given
type
20
Ad hoc integration
Search Metadata
Catalog
GEONsearch
“Geologic and gravity
data in Rocky Mountains”
Plot map
Data Integration Cart© Query
Map
21
Data Registration
Spatial Ontology
Location
Rock Classification
Ontology
Igneous
Point Polygon
Granite Quartzmonzonite
Latitude Longitude
Item Registration
(Schema registration)
Metadata
(X, Y)
Gravity
dataset
Item Detail
Registration
Lat, Long, RockType
Geologic
dataset
Metadata
22
Data Registration is Important!