Download CS 245: Database System Principles

Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ. [email protected] Chapter 11 1 How to integrate information, which is usually scattered physically • This is an unavoidable question to all of us. • Approaches – (homogenous) Distributed DBMS (80’s) – Federated databases, Multidatabases, remote data access (90’s) – Data warehouse, mediator (late 90’s) Chapter 11 2 Why Information Integration is Difficult (1) • Heterogeneous sources • Examples (Aardvark Automobile Co.) – – – – – 1000 dealers Each dealer maintains a database of their cars in stock Aardvark wants to create an integrated database 1000 dealers do not all use the same database schema Dealer 1: • Cars(serialNo, model, color, autoTrans, cdPlayer, …) – Dealer 2 • Autos(serial, model, color) • Options(serial, option) Chapter 11 3 Why Information Integration is Difficult (2) • Furthermore … – Data type differences: Serial numbers might be represented by character strings or integers – Value differences: The color black might be represented by an integer code, the string BLACK, or the code BL – Semantic differences: One dealer distinguish station wagon from minivans, while another doesn’t – Missing values: A source does not record information that all or most of the other sources provide Chapter 11 4 Modes of Information Integration • Federated databases – The sources are independent, but one source can call on others to supply information • Warehousing – Copies of data from several sources are stored in a single database, called a (data) warehouse • Mediation – A mediator is a software component that supports a virtual database, which the user may query as if it were materialized – The mediator stores no data of its own Chapter 11 5 Federated Database Systems • A federated database system is a federation of existing databases systems (called local database systems, LDBS) and provides applications with a uniform means of access to data that are managed by more than one of these database systems • In theory, local databases should preserve local autonomy Chapter 11 6 Local Autonomy (1) • Design autonomy – Ability of an LDBS to choose its own design decisions wrt any matter, including data model, query language, constraints, system functions, semantic interpretation of data, … • Execution autonomy – Ability of an LDBS to execute local operations without interference from external operations and to decide the order in which to schedule external operations Chapter 11 7 Local Autonomy (2) • Communication autonomy – Ability of an LDBS to decide whether and when to communicate with other database systems • Association autonomy – Ability of an LDBS to decide whether and how much to share its functionality and resources with others. For example, an LDBS may export only part of its database to external users or even disassociate itself from an LDBS for some reasons. Chapter 11 8 Federated Database Example • A federated collection of four local databases DB1 DB2 DB3 DB4 Chapter 11 9 Federated Database • If n databases each need to talk to the n – 1 other databases, then we should write n(n – 1) pieces of code to support queries between systems • This approach is easy to implement in some circumstances !!! Chapter 11 10 Query Translation Example • Dealer 1: NeededCars(model, color, autoTrans) • Dealer 2: Autos(serial, model, color), Options(serial, option) /* Dealer 1 queries Dealer 2 for needed car For (each tuple (:m, :c, :a) in NeededCars) { if ( :a = TRUE) { /* automatic transmission wanted */ SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’ AND Autos.model = :m AND Autos.color = :c; } else { /* automatic transmission not wanted */ SELECT serial FROM Autos WHERE Autos.model = :m AND Autos.color = :c AND NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ‘autoTrans’ ); } } Chapter 11 11 Mediators • A mediator supports a query virtual view or collection of view • Don’t store any data of its own query result Mediator query result result Wrapper Wrapper result query query result Source 1 Chapter 11 Source 2 12 Mediator Example (1) – A view that is a single relation AutosMed(serialNo, model, color, autoTrans, dealer) – A query to the mediator SELECT serialNo, model FROM AutosMed WHERE color = ‘red’ – The mediator can forward the same query to each of the two wrappers – The translation work can be done by the wrappers alone Chapter 11 13 Mediator Example (2) – A suitable translation for Dealer 1 Cars(serialNo, model, color, autoTrans, cdPlayer, …) SELECT serialNo, model FROM Cars WHERE color = ‘red’; – A suitable translation for Dealer 2 Autos(serial, model, color), Options(serial, option) SELECT serial, model FROM Autos WHERE color = ‘red’; – Each wrapper returns to the mediator a serialNo-model pairs and serial-model pairs, respectively – The mediator can take the union of these sets and return the result to the user Chapter 11 14 Wrappers in Mediator-Based Systems • Sources could be DBMSs (in various models), file systems, Web servers, … • Handles all connection/query-translation problems peculiar to sources • Mediator systems require more complex wrappers than do most warehouse systems • Techniques – – – – Wrapper generator Template-based Filter techniques Etc. Chapter 11 15 Templates for Query Patterns • Templates are queries with parameters that represent constants – Example SELECT * FROM AutosMed => WHERE color = ‘$c’ SELECT serialNo, model, color autoTrans, ‘dealder1’ FROM Cars WHERE color = ‘$c’; • In general there would be 2n templates if we have the option of specifying n attributes • The number of templates could grow unreasonably large Chapter 11 16 Wrapper Generators • Wrapper generator – The software that creates the wrapper – A table that holds the various query patterns contained in the templates Templates Wrapper generator Queries from mediator Results Table Queries Source Results Driver Chapter 11 17 Filters • It is not always realistic to write a template for every possible from of query • Another approach to supporting more queries is to have the wrapper filter the results of queries Chapter 11 18 Filters Example – Suppose the only template we have is the one that finds cars given a color – The mediator needs to find blue ‘Gobi’ model cars SELECT * FROM autosMed WHERE color = ‘blue’ and model = ‘Gobi’ – A possible way to answer the query • Use the template (with $c = ‘blue’) • Store the result in a temporary relation • Select from TempAutos the Gobi’s Chapter 11 19 Data Warehousing • Growing industry since mid 90’s • Ranges from desktop to huge • Lots of buzzwords, hype – Slice & dice, rollup, MOLAP, pivot, … Chapter 11 20 Information as a Competitive Weapon • Organizations have collected large amounts of data. Now it is time to use it to their advantage. Chapter 11 21 Can You Easily Answer These Questions? What is the correlation between expenditures and collection of delinquent taxes? What is the impact on revenues and expenditures of changing the operating hours of the Dept. of Motor Vehicles? What are Personnel Services costs across all departments for all funding sources? What are the effects of outsourcing specific services? What is the economic impact of the small business initiative in our district? What is a Warehouse (1) • Collection of diverse data – – – – – – – Subject oriented Aimed at executive, decision maker Often a copy of operational data With value-added data (e.g., summaries, history) Integrated Time-varying Non-volatile AND … Chapter 11 23 What is a Warehouse (2) • Collection of tools – – – – – Gathering data Cleansing, integrating Querying, reporting, analysis Data mining Monitoring, administering warehouse Chapter 11 24 Warehouse Architecture Client Client Query & Analysis Metadata Warehouse Integration Source Source Chapter 11 Source 25 Motivating Examples • • • • Forecasting Comparing performance of units Monitoring, detecting fraud Visualization Chapter 11 26 Why a Warehouse • Two approaches: – Query-driven (lazy) – Warehouse (eager) ? Source Chapter 11 Source 27 Query-driven approach Client Client Mediator Wrapper Source Wrapper Wrapper Source Source Chapter 11 28 Advantages of Query-driven • No need to copy data – Less storage – No need to purchase data • • • • More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources Chapter 11 29 Advantages of Warehousing • • • • • • High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse – Modify, summarizes (store aggregates) – Add historical information Chapter 11 30 OLTP vs. OLAP • OLTP (On-Line Transaction Processing) – Describes processing at operational sites • OLAP (On-Line Analytical Processing) – Describes processing at warehouse Chapter 11 31 OLTP vs. OLAP • OLTP • Warehouse – Mostly updates – Many small transactions – Mb-Tb of data – Current snapshot – Raw data – Clerical users – Consistency, recoverability critical Chapter 11 – Mostly reads – Queries are long and complex – Gb-Tb of data – History – Summarized, consolidated data – Decision-makers, analysts as users 32 OLAP Example • The schema for the warehouse – Sales(serialNo, date, dealer, price) – Autos(serialNo, model, color) – Dealers(name, city,state,phone) • A typical decision-support query – SELECT state, AVG(price) FROM Sales, Dealers WHERE Sales.dealer = Dealers.name AND date >= ‘199901-04’ GROUP BY state; • Common OLTP query – “Find the price at which the auto with serial number 123 was sold” Chapter 11 33 Warehouse Models and Operations • Data models – Relations – Stars and snowflakes – Cubes • Operations – – – – Slice and dice Roll-up, drill-down Pivoting other Chapter 11 34 Star Schemas • Star schema = fact table + dimension tables Dimension table Dimension table Dimension table Fact table Dependent attributes Dimension table Chapter 11 35 Example-1 (1) • Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state, phone) car dealer date • Sales is a fact table – serialNo, date, dealer are dimensions – The one dependent attribute is price, which is what OLAP queries will typically request in an aggregation • Autos relation and Dealer relation are dimension tables – Attribute serialNo in the fact table is a foreign key, referencing serialNo of dimension table Autos • Join between fact table and dimension tables, is frequently done Chapter 11 36 Example-1 (2) • A time dimension table Days (day, week, month, year) – Since grouping by various time units is frequently desired by analysts – It helps to build into the database a notion of time, as if there were a time dimension table such as above Chapter 11 37 Example-2 (1) product prodId p1 p2 name price bolt 10 nut 5 sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 customer custId 53 81 111 store custId 53 53 111 prodId p1 p2 p1 name joe fred sally Chapter 11 storeId c1 c1 c3 address 10 main 12 main 80 willow qty 1 2 5 storeId c1 c2 c3 city nyc sfo la amt 12 11 50 city sfo sfo la 38 Example-2 (2) product prodId name price sale orderId date custId prodId storeId qty amt customer custId name address city store storeId city Chapter 11 39 Slicing and Dicing • Dicing – For example, in the time dimension, we might partition (“group by” clause) according to days, weeks, months, years, or not partition at all – Partitioning is also possible for cars and dealers • Slicing car dealer date – Through the “where” clause, a query focuses on partitions along one or more dimensions Chapter 11 40 Example 1 • A query in which we ask for a slice in one dimension (the date), and dice in two other dimensions (car and dealer) • The date is divided into four groups, … Chapter 11 car dealer date 41 More Examples • SELECT color, SUM(price) FROM Sales NATURAL JOIN Autos WHERE model = ‘Gobi’ GROUP BY color; – This query dices by color and then slices by model • SELECT dealer, month, SUM(price) FROM (Sales NATURAL JOIN Autos) JOIN Days on date = day WHERE model = ‘Gobi’ and color = ‘red’ GROUP BY color; Chapter 11 42 How to support cube-structured data for OLAP • ROLAP, or Relational OLAP – Data may be stored in relations with a specialize structure called a “star schema” • MOLAP, or Multidimensional OLAP – A specialized structure, the “data cube”, is used to hold the data Chapter 11 43 Data cubes • An alternative to executing decision-support queries as an ad-hoc queries is to pre-compute all possible aggregates in a systematic way • The amount of extra storage needed is often tolerable • We shall continue to call the points of the data cube the “fact table” Chapter 11 44 Cube Example Fact table view: sale Multi-dimensional cube: prodId storeId amt p1 c1 12 p2 c1 11 p1 c3 50 p2 c2 8 p1 p2 c1 12 11 c2 c3 50 8 dimensions = 2 Chapter 11 45 3-D Cube Example Fact table view: sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 Multi-dimensional cube: date 1 1 1 1 2 2 amt 12 11 50 8 44 4 day 2 day 1 p1 p2 c1 p1 12 p2 11 c1 44 c2 4 c2 c3 c3 50 8 dimensions = 3 Chapter 11 46 The Cube Operator • Given a fact table F, we can define an augmented table CUBE(F)that adds an additional value, denoted *, to each dimension – The * represents aggregation along the dimension in which it appears • A tuple of the table CUBE(F)has * in one or more dimensions Chapter 11 47 The Cube Operator Example • Sales(model, color, date, dealer, val, cnt) – “val” denotes the total price, “cnt” denotes the total # of automobiles • Possible tuples – – – – – (‘Gobi’, ‘red’, ‘1999-05-21’, ‘Friendly Fred’, 45000, 2) (‘Gobi’, *, ‘1999-05-21’, ‘Friendly Fred’, 152000, 7) (‘Gobi’, *, ‘1999-05-21’, *, 2348000, 100) (‘Gobi’, *, *, *, 1339800000, 58000) (*, *, *, *, 3521727000, 198000) Chapter 11 48 Another Example • Consider SELECT color, AVG(price) FROM Sales WHERE model = ‘Gobi’ GROUP BY color; • Above query is answered by looking for all tuples of CUBE(Sales) with the form (‘Gobi’, c, *, *, v, n) – C is any specific color – The tuple asked for by the query is (c, v/n) • Answer is the set of (c,v/n) pairs from all (‘Gobi’, c, *, *, v, n) tuples Chapter 11 49 Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 Chapter 11 ans date 1 2 sum 81 48 50 Rollup vs. Drill-down • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 sale prodId p1 p2 p1 date 1 1 2 amt 62 19 48 rollup drill-down Chapter 11 51 Aggregates • Operators: sum, count, max, min, median, ave • “Having” clause • Using dimension hierarchy – average by region (within store) – maximum by month (within date) Chapter 11 52 Cube Aggregation day 2 day 1 p1 p2 c1 p1 12 p2 11 p1 p2 c1 56 11 c1 44 c2 4 c2 c3 Example: computing sums ... c3 50 8 c2 4 8 rollup drill-down c3 50 sum c1 67 c2 12 c3 50 129 p1 p2 Chapter 11 sum 110 19 53 Cube Operators day 2 day 1 p1 p2 c1 p1 12 p2 11 p1 p2 c1 56 11 c1 44 c2 4 c2 c3 ... c3 50 sale(c1,*,*) 8 c2 4 8 c3 50 sale(c2,p2,*) sum c1 67 c2 12 c3 50 129 p1 p2 Chapter 11 sum 110 19 sale(*,*,*) 54 Extended Cube c2 4 8 c312 p1 p2 c1 * 12 p1 p2 c1* 44 c1 56 11 c267 4 c2 44 c3 4 50 11 23 8 8 50 * 62 19 81 * day 2 day 1 p1 p2 * Chapter 11 c3 50 * 50 48 48 * 110 19 129 sale(*,p2,*) 55 The lattice of Views • It helps to think of a lattice of possible groupings for each dimension of the cube • A path from some node P2 down to P1 means that P1 <= P2 All All Years State Quarters City Weeks Months Dealer Days Chapter 11 56 Aggregation Using Hierarchies day 2 day 1 p1 p2 c1 p1 12 p2 11 c1 44 c2 4 c2 c3 c3 50 customer region 8 country p1 p2 region A region B 56 54 11 8 Chapter 11 (customer c1 in Region A; customers c2, c3 in Region B) 57 Data Mining • Knowledge discovery • To find surprising facts from existing databases • Techniques from DBMS, machine learning, and statistics, … Chapter 11 58 Decision Tree • The interior nodes each have an attribute and a value that serves as a threshold • The children of a node are either other interior nodes, or a decision: accept or reject • A given tuple is passed down the tree, going left or right at each step according to the value the tuple has, until a decision node is reached • The tree is constructed by a training set of tuples whose outcome is known Chapter 11 59 Example (weather vs. tennis play) Outlook overcast sunny yes humidity high no rainy windy normal false yes yes Chapter 11 true no 60 Clustering • To group data items into some small number of groups such that the groups each have something substantial in common • Example – Clustering of Web pages in Web search engines Chapter 11 61 Association-Rule Mining Example • Market-basket data – A customer approaches the checkout with a “market basket” full of the items he or she has selected – The cash register records all of these items as part of a single transaction • Claim: People who buy diapers are unusually likely also to buy beer • Schema: Baskets(basket, item) Chapter 11 62 Data-Ming Applications: AssociationRule Mining • Naive way to find all high-support pairs of items SELCT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; Chapter 11 63 The A-Priori Algorithm • Basic observation – If a set of items X has support s, then each subset of X must also have support at least s. – If a pair of items, say {i, j} appears in, say, 1000 baskets, then we know there are at least 1000 baskets with item i and there are at least 1000 baskets with item j. • Strategies – First finding the set of “OK” items -- those that appear in a sufficient number of baskets by themselves – Running the query on only the items in the OK set Chapter 11 64 The A-Priori Algorithm INSERT INTO OkBasekts SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) > = s ); SELECT I.item, J.item, COUNT(I.basket) FROM OkBaskets I, OkBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(*) >= s; Chapter 11 65 How Good the A-Priori Algorithm is – Assumptions (Example 11.20) • 10,000 different items, average market basket has 20 items in it • 1,000,000 baskets, the Baskets relation has 20,000,000 tuples – The naive algorithms • The join has 190,000,000 pairs • The 190,000,000 tuples must all be grouped and counted – The A-Priori algorithm • Suppose that s is 10,000, i.e., 1% of the baskets • Not possible that more than 2000 (= 20,000,000 / 10,000) items appear in at least 10,000 baskets • The sub query produces many fewer than 2000 items • Assume, OkBaskets has on the average 10 items • The join is less than ¼ of that of Baskets, which means ¼ reduction of running time Chapter 11 66

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS 245: Database System Principles