Download CS 245: Database System Principles

Document related concepts

Relational algebra wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Chapter 11
Information Integration
Spring 2001
Prof. Sang Ho Lee
School of Computing, Soongsil Univ.
[email protected]
Chapter 11
1
How to integrate information, which
is usually scattered physically
• This is an unavoidable question to all of us.
• Approaches
– (homogenous) Distributed DBMS (80’s)
– Federated databases, Multidatabases, remote data access
(90’s)
– Data warehouse, mediator (late 90’s)
Chapter 11
2
Why Information Integration is
Difficult (1)
• Heterogeneous sources
• Examples (Aardvark Automobile Co.)
–
–
–
–
–
1000 dealers
Each dealer maintains a database of their cars in stock
Aardvark wants to create an integrated database
1000 dealers do not all use the same database schema
Dealer 1:
• Cars(serialNo, model, color, autoTrans, cdPlayer, …)
– Dealer 2
• Autos(serial, model, color)
• Options(serial, option)
Chapter 11
3
Why Information Integration is
Difficult (2)
• Furthermore …
– Data type differences: Serial numbers might be represented
by character strings or integers
– Value differences: The color black might be represented by
an integer code, the string BLACK, or the code BL
– Semantic differences: One dealer distinguish station wagon
from minivans, while another doesn’t
– Missing values: A source does not record information that all
or most of the other sources provide
Chapter 11
4
Modes of Information Integration
• Federated databases
– The sources are independent, but one source can call on
others to supply information
• Warehousing
– Copies of data from several sources are stored in a single
database, called a (data) warehouse
• Mediation
– A mediator is a software component that supports a virtual
database, which the user may query as if it were
materialized
– The mediator stores no data of its own
Chapter 11
5
Federated Database Systems
• A federated database system is a federation of
existing databases systems (called local database
systems, LDBS) and provides applications with a
uniform means of access to data that are managed
by more than one of these database systems
• In theory, local databases should preserve local
autonomy
Chapter 11
6
Local Autonomy (1)
• Design autonomy
– Ability of an LDBS to choose its own design decisions wrt
any matter, including data model, query language,
constraints, system functions, semantic interpretation of
data, …
• Execution autonomy
– Ability of an LDBS to execute local operations without
interference from external operations and to decide the
order in which to schedule external operations
Chapter 11
7
Local Autonomy (2)
• Communication autonomy
– Ability of an LDBS to decide whether and when to
communicate with other database systems
• Association autonomy
– Ability of an LDBS to decide whether and how much to share
its functionality and resources with others. For example, an
LDBS may export only part of its database to external users
or even disassociate itself from an LDBS for some reasons.
Chapter 11
8
Federated Database Example
• A federated collection of four local databases
DB1
DB2
DB3
DB4
Chapter 11
9
Federated Database
• If n databases each need to talk to the n – 1 other
databases, then we should write n(n – 1) pieces of
code to support queries between systems
• This approach is easy to implement in some
circumstances !!!
Chapter 11
10
Query Translation Example
• Dealer 1: NeededCars(model, color, autoTrans)
• Dealer 2: Autos(serial, model, color),
Options(serial, option)
/* Dealer 1 queries Dealer 2 for needed car
For (each tuple (:m, :c, :a) in NeededCars) {
if ( :a = TRUE) { /* automatic transmission wanted */
SELECT serial
FROM Autos, Options
WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’ AND
Autos.model = :m AND Autos.color = :c;
} else { /* automatic transmission not wanted */
SELECT serial
FROM Autos
WHERE Autos.model = :m AND Autos.color = :c AND
NOT EXISTS (
SELECT *
FROM Options
WHERE serial = Autos.serial AND option = ‘autoTrans’ );
}
}
Chapter 11
11
Mediators
• A mediator supports a
query
virtual view or collection
of view
• Don’t store any data of its
own
query
result
Mediator
query
result
result
Wrapper
Wrapper
result query
query
result
Source 1
Chapter 11
Source 2
12
Mediator Example (1)
– A view that is a single relation
AutosMed(serialNo, model, color, autoTrans, dealer)
– A query to the mediator
SELECT serialNo, model
FROM AutosMed
WHERE color = ‘red’
– The mediator can forward the same query to each of the
two wrappers
– The translation work can be done by the wrappers alone
Chapter 11
13
Mediator Example (2)
– A suitable translation for Dealer 1
Cars(serialNo, model, color, autoTrans, cdPlayer, …)
SELECT serialNo, model
FROM Cars
WHERE color = ‘red’;
– A suitable translation for Dealer 2
Autos(serial, model, color), Options(serial, option)
SELECT serial, model
FROM Autos
WHERE color = ‘red’;
– Each wrapper returns to the mediator a serialNo-model pairs
and serial-model pairs, respectively
– The mediator can take the union of these sets and return
the result to the user
Chapter 11
14
Wrappers in Mediator-Based Systems
• Sources could be DBMSs (in various models), file
systems, Web servers, …
• Handles all connection/query-translation problems
peculiar to sources
• Mediator systems require more complex wrappers
than do most warehouse systems
• Techniques
–
–
–
–
Wrapper generator
Template-based
Filter techniques
Etc.
Chapter 11
15
Templates for Query Patterns
• Templates are queries with parameters that represent
constants
– Example
SELECT *
FROM AutosMed
=>
WHERE color = ‘$c’
SELECT serialNo, model, color
autoTrans, ‘dealder1’
FROM Cars
WHERE color = ‘$c’;
• In general there would be 2n templates if we have
the option of specifying n attributes
• The number of templates could grow unreasonably
large
Chapter 11
16
Wrapper Generators
• Wrapper generator
– The software that creates the wrapper
– A table that holds the various query patterns contained in
the templates
Templates
Wrapper
generator
Queries from
mediator
Results
Table
Queries
Source
Results
Driver
Chapter 11
17
Filters
• It is not always realistic to write a template for every
possible from of query
• Another approach to supporting more queries is to
have the wrapper filter the results of queries
Chapter 11
18
Filters Example
– Suppose the only template we have is the one that finds
cars given a color
– The mediator needs to find blue ‘Gobi’ model cars
SELECT *
FROM autosMed
WHERE color = ‘blue’ and model = ‘Gobi’
– A possible way to answer the query
• Use the template (with $c = ‘blue’)
• Store the result in a temporary relation
• Select from TempAutos the Gobi’s
Chapter 11
19
Data Warehousing
• Growing industry since mid 90’s
• Ranges from desktop to huge
• Lots of buzzwords, hype
– Slice & dice, rollup, MOLAP, pivot, …
Chapter 11
20
Information as a Competitive
Weapon
• Organizations have collected large amounts of data.
Now it is time to use it to their advantage.
Chapter 11
21
Can You Easily Answer
These Questions?
What is the correlation
between expenditures
and collection of
delinquent taxes?
What is the impact on
revenues and expenditures
of changing the operating
hours of the Dept. of Motor
Vehicles?
What are Personnel
Services costs across
all departments for
all funding sources?
What are the effects
of outsourcing
specific services?
What is the economic
impact of the small
business initiative in our
district?
What is a Warehouse (1)
• Collection of diverse data
–
–
–
–
–
–
–
Subject oriented
Aimed at executive, decision maker
Often a copy of operational data
With value-added data (e.g., summaries, history)
Integrated
Time-varying
Non-volatile
AND …
Chapter 11
23
What is a Warehouse (2)
• Collection of tools
–
–
–
–
–
Gathering data
Cleansing, integrating
Querying, reporting, analysis
Data mining
Monitoring, administering warehouse
Chapter 11
24
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Chapter 11
Source
25
Motivating Examples
•
•
•
•
Forecasting
Comparing performance of units
Monitoring, detecting fraud
Visualization
Chapter 11
26
Why a Warehouse
• Two approaches:
– Query-driven (lazy)
– Warehouse (eager)
?
Source
Chapter 11
Source
27
Query-driven approach
Client
Client
Mediator
Wrapper
Source
Wrapper
Wrapper
Source
Source
Chapter 11
28
Advantages of Query-driven
• No need to copy data
– Less storage
– No need to purchase data
•
•
•
•
More up-to-date data
Query needs can be unknown
Only query interface needed at sources
May be less draining on sources
Chapter 11
29
Advantages of Warehousing
•
•
•
•
•
•
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
– Modify, summarizes (store aggregates)
– Add historical information
Chapter 11
30
OLTP vs. OLAP
• OLTP (On-Line Transaction Processing)
– Describes processing at operational sites
• OLAP (On-Line Analytical Processing)
– Describes processing at warehouse
Chapter 11
31
OLTP vs. OLAP
• OLTP
• Warehouse
– Mostly updates
– Many small
transactions
– Mb-Tb of data
– Current snapshot
– Raw data
– Clerical users
– Consistency,
recoverability critical
Chapter 11
– Mostly reads
– Queries are long and
complex
– Gb-Tb of data
– History
– Summarized,
consolidated data
– Decision-makers,
analysts as users
32
OLAP Example
• The schema for the warehouse
– Sales(serialNo, date, dealer, price)
– Autos(serialNo, model, color)
– Dealers(name, city,state,phone)
• A typical decision-support query
– SELECT state, AVG(price)
FROM Sales, Dealers
WHERE Sales.dealer = Dealers.name AND date >= ‘199901-04’
GROUP BY state;
• Common OLTP query
– “Find the price at which the auto with serial number 123
was sold”
Chapter 11
33
Warehouse Models and Operations
• Data models
– Relations
– Stars and snowflakes
– Cubes
• Operations
–
–
–
–
Slice and dice
Roll-up, drill-down
Pivoting
other
Chapter 11
34
Star Schemas
• Star schema = fact table + dimension tables
Dimension
table
Dimension
table
Dimension
table
Fact table
Dependent
attributes
Dimension
table
Chapter 11
35
Example-1 (1)
• Sales(serialNo, date, dealer, price)
Autos(serialNo, model, color)
Dealers(name, city, state, phone)
car
dealer
date
• Sales is a fact table
– serialNo, date, dealer are dimensions
– The one dependent attribute is price, which is what OLAP queries
will typically request in an aggregation
• Autos relation and Dealer relation are dimension tables
– Attribute serialNo in the fact table is a foreign key, referencing
serialNo of dimension table Autos
• Join between fact table and dimension tables, is frequently done
Chapter 11
36
Example-1 (2)
• A time dimension table
Days (day, week, month, year)
– Since grouping by various time units is frequently desired by
analysts
– It helps to build into the database a notion of time, as if
there were a time dimension table such as above
Chapter 11
37
Example-2 (1)
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97
customer
custId
53
81
111
store
custId
53
53
111
prodId
p1
p2
p1
name
joe
fred
sally
Chapter 11
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
storeId
c1
c2
c3
city
nyc
sfo
la
amt
12
11
50
city
sfo
sfo
la
38
Example-2 (2)
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
Chapter 11
39
Slicing and Dicing
• Dicing
– For example, in the time
dimension, we might partition
(“group by” clause) according to
days, weeks, months, years, or
not partition at all
– Partitioning is also possible for
cars and dealers
• Slicing
car
dealer
date
– Through the “where” clause, a
query focuses on partitions
along one or more dimensions
Chapter 11
40
Example 1
• A query in which we ask
for a slice in one
dimension (the date),
and dice in two other
dimensions (car and
dealer)
• The date is divided into
four groups, …
Chapter 11
car
dealer
date
41
More Examples
• SELECT color, SUM(price)
FROM Sales NATURAL JOIN Autos
WHERE model = ‘Gobi’
GROUP BY color;
– This query dices by color and then
slices by model
• SELECT dealer, month, SUM(price)
FROM (Sales NATURAL JOIN Autos) JOIN Days on
date = day
WHERE model = ‘Gobi’ and color = ‘red’
GROUP BY color;
Chapter 11
42
How to support cube-structured data
for OLAP
• ROLAP, or Relational OLAP
– Data may be stored in relations with a specialize structure
called a “star schema”
• MOLAP, or Multidimensional OLAP
– A specialized structure, the “data cube”, is used to hold the
data
Chapter 11
43
Data cubes
• An alternative to executing decision-support queries
as an ad-hoc queries is to pre-compute all possible
aggregates in a systematic way
• The amount of extra storage needed is often
tolerable
• We shall continue to call the points of the data cube
the “fact table”
Chapter 11
44
Cube Example
Fact table view:
sale
Multi-dimensional cube:
prodId storeId amt
p1
c1
12
p2
c1
11
p1
c3
50
p2
c2
8
p1
p2
c1
12
11
c2
c3
50
8
dimensions = 2
Chapter 11
45
3-D Cube Example
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
8
dimensions = 3
Chapter 11
46
The Cube Operator
• Given a fact table F, we can define an augmented
table CUBE(F)that adds an additional value, denoted
*, to each dimension
– The * represents aggregation along the dimension in which
it appears
• A tuple of the table CUBE(F)has * in one or more
dimensions
Chapter 11
47
The Cube Operator Example
• Sales(model, color, date, dealer, val, cnt)
– “val” denotes the total price, “cnt” denotes the total # of
automobiles
• Possible tuples
–
–
–
–
–
(‘Gobi’, ‘red’, ‘1999-05-21’, ‘Friendly Fred’, 45000, 2)
(‘Gobi’, *, ‘1999-05-21’, ‘Friendly Fred’, 152000, 7)
(‘Gobi’, *, ‘1999-05-21’, *, 2348000, 100)
(‘Gobi’, *, *, *, 1339800000, 58000)
(*, *, *, *, 3521727000, 198000)
Chapter 11
48
Another Example
• Consider
SELECT color, AVG(price)
FROM Sales WHERE model = ‘Gobi’
GROUP BY color;
• Above query is answered by looking for all tuples of
CUBE(Sales) with the form (‘Gobi’, c, *, *, v, n)
– C is any specific color
– The tuple asked for by the query is (c, v/n)
• Answer is the set of (c,v/n) pairs from all (‘Gobi’, c, *,
*, v, n) tuples
Chapter 11
49
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Chapter 11
ans
date
1
2
sum
81
48
50
Rollup vs. Drill-down
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
rollup
drill-down
Chapter 11
51
Aggregates
• Operators: sum, count, max, min,
median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)
Chapter 11
52
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
Example: computing sums
...
c3
50
8
c2
4
8
rollup
drill-down
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
Chapter 11
sum
110
19
53
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
...
c3
50
sale(c1,*,*)
8
c2
4
8
c3
50
sale(c2,p2,*)
sum
c1
67
c2
12
c3
50
129
p1
p2
Chapter 11
sum
110
19
sale(*,*,*)
54
Extended Cube
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c1
56
11
c267
4
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
*
day 2
day 1
p1
p2
*
Chapter 11
c3
50
* 50
48
48
*
110
19
129
sale(*,p2,*)
55
The lattice of Views
• It helps to think of a lattice of possible groupings for
each dimension of the cube
• A path from some node P2 down to P1 means that
P1 <= P2
All
All
Years
State
Quarters
City
Weeks
Months
Dealer
Days
Chapter 11
56
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
8
country
p1
p2
region A region B
56
54
11
8
Chapter 11
(customer c1 in Region A;
customers c2, c3 in Region B)
57
Data Mining
• Knowledge discovery
• To find surprising facts from existing databases
• Techniques from DBMS, machine learning, and
statistics, …
Chapter 11
58
Decision Tree
• The interior nodes each have an attribute and a value
that serves as a threshold
• The children of a node are either other interior nodes,
or a decision: accept or reject
• A given tuple is passed down the tree, going left or
right at each step according to the value the tuple
has, until a decision node is reached
• The tree is constructed by a training set of tuples
whose outcome is known
Chapter 11
59
Example (weather vs. tennis play)
Outlook
overcast
sunny
yes
humidity
high
no
rainy
windy
normal
false
yes
yes
Chapter 11
true
no
60
Clustering
• To group data items into
some small number of
groups such that the
groups each have
something substantial in
common
• Example
– Clustering of Web pages in
Web search engines
Chapter 11
61
Association-Rule Mining Example
• Market-basket data
– A customer approaches the checkout with a “market basket”
full of the items he or she has selected
– The cash register records all of these items as part of a
single transaction
• Claim: People who buy diapers are unusually likely
also to buy beer
• Schema: Baskets(basket, item)
Chapter 11
62
Data-Ming Applications: AssociationRule Mining
• Naive way to find all high-support pairs of items
SELCT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets J
WHERE I.basket = J.basket AND I.item < J.item
GROUP BY I.item, J.item
HAVING COUNT(I.basket) >= s;
Chapter 11
63
The A-Priori Algorithm
• Basic observation
– If a set of items X has support s, then each subset of X must
also have support at least s.
– If a pair of items, say {i, j} appears in, say, 1000 baskets,
then we know there are at least 1000 baskets with item i
and there are at least 1000 baskets with item j.
• Strategies
– First finding the set of “OK” items -- those that appear in a
sufficient number of baskets by themselves
– Running the query on only the items in the OK set
Chapter 11
64
The A-Priori Algorithm
INSERT INTO OkBasekts
SELECT *
FROM Baskets
WHERE item IN (
SELECT item
FROM Baskets
GROUP BY item
HAVING COUNT(*) > = s
);
SELECT I.item, J.item, COUNT(I.basket)
FROM OkBaskets I, OkBaskets J
WHERE I.basket = J.basket AND
I.item < J.item
GROUP BY I.item, J.item
HAVING COUNT(*) >= s;
Chapter 11
65
How Good the A-Priori Algorithm is
– Assumptions (Example 11.20)
• 10,000 different items, average market basket has 20 items in
it
• 1,000,000 baskets, the Baskets relation has 20,000,000 tuples
– The naive algorithms
• The join has 190,000,000 pairs
• The 190,000,000 tuples must all be grouped and counted
– The A-Priori algorithm
• Suppose that s is 10,000, i.e., 1% of the baskets
• Not possible that more than 2000 (= 20,000,000 / 10,000)
items appear in at least 10,000 baskets
• The sub query produces many fewer than 2000 items
• Assume, OkBaskets has on the average 10 items
• The join is less than ¼ of that of Baskets, which means ¼
reduction of running time
Chapter 11
66