* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS 245: Database System Principles
Survey
Document related concepts
Transcript
Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ. [email protected] Chapter 11 1 How to integrate information, which is usually scattered physically • This is an unavoidable question to all of us. • Approaches – (homogenous) Distributed DBMS (80’s) – Federated databases, Multidatabases, remote data access (90’s) – Data warehouse, mediator (late 90’s) Chapter 11 2 Why Information Integration is Difficult (1) • Heterogeneous sources • Examples (Aardvark Automobile Co.) – – – – – 1000 dealers Each dealer maintains a database of their cars in stock Aardvark wants to create an integrated database 1000 dealers do not all use the same database schema Dealer 1: • Cars(serialNo, model, color, autoTrans, cdPlayer, …) – Dealer 2 • Autos(serial, model, color) • Options(serial, option) Chapter 11 3 Why Information Integration is Difficult (2) • Furthermore … – Data type differences: Serial numbers might be represented by character strings or integers – Value differences: The color black might be represented by an integer code, the string BLACK, or the code BL – Semantic differences: One dealer distinguish station wagon from minivans, while another doesn’t – Missing values: A source does not record information that all or most of the other sources provide Chapter 11 4 Modes of Information Integration • Federated databases – The sources are independent, but one source can call on others to supply information • Warehousing – Copies of data from several sources are stored in a single database, called a (data) warehouse • Mediation – A mediator is a software component that supports a virtual database, which the user may query as if it were materialized – The mediator stores no data of its own Chapter 11 5 Federated Database Systems • A federated database system is a federation of existing databases systems (called local database systems, LDBS) and provides applications with a uniform means of access to data that are managed by more than one of these database systems • In theory, local databases should preserve local autonomy Chapter 11 6 Local Autonomy (1) • Design autonomy – Ability of an LDBS to choose its own design decisions wrt any matter, including data model, query language, constraints, system functions, semantic interpretation of data, … • Execution autonomy – Ability of an LDBS to execute local operations without interference from external operations and to decide the order in which to schedule external operations Chapter 11 7 Local Autonomy (2) • Communication autonomy – Ability of an LDBS to decide whether and when to communicate with other database systems • Association autonomy – Ability of an LDBS to decide whether and how much to share its functionality and resources with others. For example, an LDBS may export only part of its database to external users or even disassociate itself from an LDBS for some reasons. Chapter 11 8 Federated Database Example • A federated collection of four local databases DB1 DB2 DB2 DB2 Chapter 11 9 Federated Database • If n databases each need to talk to the n – 1 other databases, then we should write n(n – 1) pieces of code to support queries between systems • This approach is easy to implement in some circumstances !!! Chapter 11 10 Query Translation Example • Dealer 1: NeededCars(model, color, autoTrans) • Dealer 2: Autos(serial, model, color), Options(serial, option) /* Dealer 1 queries Dealer 2 for needed car For (each tuple (:m, :c, :a) in NeededCars) { if ( :a = TRUE) { /* automatic transmission wanted */ SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’ AND Autos.model = :m AND Autos.color = :c; } else { /* automatic transmission not wanted */ SELECT serial FROM Autos WHERE Autos.model = :m AND Autos.color = :c AND NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ‘autoTrans’ ); } } Chapter 11 11 Mediators • A mediator supports a query virtual view or collection of view • Don’t store any data of its own query result Mediator query result result Wrapper Wrapper result query query result Source 1 Chapter 11 Source 2 12 Mediator Example (1) – A view that is a single relation AutosMed(serialNo, model, color, autoTrans, dealer) – A query to the mediator SELECT serialNo, model FROM AutoMed WHERE color = ‘red’ – The mediator can forward the same query to each of the two wrappers – The translation work can be done by the wrappers alone Chapter 11 13 Mediator Example (2) – A suitable translation for Dealer 1 Cars(serialNo, model, color, autoTrans, cdPlayer, …) SELECT serialNo, model FROM Cars WHERE color = ‘red’; – A suitable translation for Dealer 2 Autos(serial, model, color), Options(serial, option) SELECT serial, model FROM Autos WHERE color = ‘red’; – Each wrapper returns to the mediator a serialNo-model pairs and serial-model pairs, respectively – The mediator can take the union of these sets and return the result to the user Chapter 11 14 Wrappers in Mediator-Based Systems • Sources could be DBMSs (in various models), file systems, Web servers, … • Handles all connection/query-translation problems peculiar to sources • Mediator systems require more complex wrappers than do most warehouse systems • Techniques – – – – Wrapper generator Template-based Filter techniques Etc. Chapter 11 15 Templates for Query Patterns • Templates are queries with parameters that represent constants – Example SELECT * FROM AutosMed => WHERE color = ‘$c’ SELECT serialNo, model, color autoTrans, ‘dealder1’ FROM Cars WHERE color = ‘$c’; • In general there would be 2n templates if we have the option of specifying n attributes • The number of templates could grow unreasonably large Chapter 11 16 Wrapper Generators • Wrapper generator – The software that creates the wrapper – A table that holds the various query patterns contained in the templates Templates Wrapper generator Queries from mediator Results Table Queries Source Results Driver Chapter 11 17 Filters • It is not always realistic to write a template for every possible from of query • Another approach to supporting more queries is to have the wrapper filter the results of queries Chapter 11 18 Filters Example – Suppose the only template we have is the one that finds cars given a color – The mediator needs to find blue ‘Gobi’ model cars SELECT * FROM autosMed WHERE color = ‘blue’ and model = ‘Gobi’ – A possible way to answer the query • Use the template (with $c = ‘blue’) • Store the result in a temporary relation • Select from TempAutos the Gobi’s Chapter 11 19 Data Warehousing • Growing industry since mid 90’s • Ranges from desktop to huge • Lots of buzzwords, hype – Slice & dice, rollup, MOLAP, pivot, … Chapter 11 20 Information as a Competitive Weapon • Organizations have collected large amounts of data. Now it is time to use it to their advantage. Chapter 11 21 Can You Easily Answer These Questions? What is the correlation between expenditures and collection of delinquent taxes? What is the impact on revenues and expenditures of changing the operating hours of the Dept. of Motor Vehicles? What are Personnel Services costs across all departments for all funding sources? What are the effects of outsourcing specific services? What is the economic impact of the small business initiative in our district? What is a Warehouse (1) • Collection of diverse data – – – – – – – Subject oriented Aimed at executive, decision maker Often a copy of operational data With value-added data (e.g., summaries, history) Integrated Time-varying Non-volatile AND … Chapter 11 23 What is a Warehouse (2) • Collection of tools – – – – – Gathering data Cleansing, integrating Querying, reporting, analysis Data mining Monitoring, administering warehouse Chapter 11 24 Warehouse Architecture Client Client Query & Analysis Metadata Warehouse Integration Source Source Chapter 11 Source 25 Motivating Examples • • • • Forecasting Comparing performance of units Monitoring, detecting fraud Visualization Chapter 11 26 Why a Warehouse • Two approaches: – Query-driven (lazy) – Warehouse (eager) ? Source Chapter 11 Source 27 Query-driven approach Client Client Mediator Wrapper Source Wrapper Wrapper Source Source Chapter 11 28 Advantages of Query-driven • No need to copy data – Less storage – No need to purchase data • • • • More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources Chapter 11 29 Advantages of Warehousing • • • • • • High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse – Modify, summarizes (store aggregates) – Add historical information Chapter 11 30 OLTP vs. OLAP • OLTP (On-Line Transaction Processing) – Describes processing at operational sites • OLAP (On-Line Analytical Processing) – Describes processing at warehouse Chapter 11 31 OLTP vs. OLAP • OLTP • Warehouse – Mostly updates – Many small transactions – Mb-Tb of data – Current snapshot – Raw data – Clerical users – Consistency, recoverability critical Chapter 11 – Mostly reads – Queries are long and complex – Gb-Tb of data – History – Summarized, consolidated data – Decision-makers, analysts as users 32 OLAP Example • The schema for the warehouse – Sales(serialNo, date, dealer, price) – Autos(serialNo, model, color) – Dealers(name, city,state,phone) • A typical decision-support query – SELECT state, AVG(price) FROM Sales, Dealers WHERE Sales.dealer = Dealers.name AND date >= ‘199901-04’ GROUP BY state; • Common OLTP query – “Find the price at which the auto with serial number 123 was sold” Chapter 11 33 Warehouse Models and Operations • Data models – Relations – Stars and snowflakes – Cubes • Operations – – – – Slice and dice Roll-up, drill-down Pivoting other Chapter 11 34 Star Schemas • Star schema = fact table + dimension tables Dimension table Dimension table Dimension table Fact table Dependent attributes Dimension table Chapter 11 35 Example-1 (1) • Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state, phone) car dealer date • Sales is a fact table – serialNo, date, dealer are dimensions – The one dependent attribute is price, which is what OLAP queries will typically request in an aggregation • Autos relation and Dealer relation are dimension tables – Attribute serialNo in the fact table is a foreign key, referencing serialNo of dimension table Autos • Join between fact table and dimension tables, is frequently done Chapter 11 36 Example-1 (2) • A time dimension table Days (day, week, month, year) – Since grouping by various time units is frequently desired by analysts – It helps to build into the database a notion of time, as if there were a time dimension table such as above Chapter 11 37 Example-2 (1) product prodId p1 p2 name price bolt 10 nut 5 sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 customer custId 53 81 111 store custId 53 53 111 prodId p1 p2 p1 name joe fred sally Chapter 11 storeId c1 c1 c3 address 10 main 12 main 80 willow qty 1 2 5 storeId c1 c2 c3 city nyc sfo la amt 12 11 50 city sfo sfo la 38 Example-2 (2) product prodId name price sale orderId date custId prodId storeId qty amt customer custId name address city store storeId city Chapter 11 39 Slicing and Dicing • Dicing – For example, in the time dimension, we might partition (“group by” clause) according to days, weeks, months, years, or not partition at all – Partitioning is also possible for cars and dealers • Slicing car dealer date – Through the “where” clause, a query focuses on partitions along one or more dimensions Chapter 11 40 Example 1 • A query in which we ask for a slice in one dimension (the date), and dice in two other dimensions (car and dealer) • The date is divided into four groups, … Chapter 11 car dealer date 41 More Examples • SELECT color, SUM(price) FROM Sales NATURAL JOIN Autos WHERE model = ‘Gobi’ GROUP BY color; – This query dices by color and then slices by model • SELECT dealer, month, SUM(price) FROM (Sales NATURAL JOIN Autos) JOIN Days on date = day WHERE model = ‘Gobi’ and color = ‘red’ GROUP BY color; Chapter 11 42 How to support cube-structured data for OLAP • ROLAP, or Relational OLAP – Data may be stored in relations with a specialize structure called a “star schema” • MOLAP, or Multidimensional OLAP – A specialized structure, the “data cube”, is used to hold the data Chapter 11 43