Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 5 Database Management Systems Objectives of Lecture 5 Data Warehousing and OLAP • Realize the purpose of data warehousing. Winter 2004 • Comprehend the data structures behind data warehouses and understand the OLAP technology. CMPUT 391: Data Warehousing Dr. Osmar R. Zaïane • Get an overview of the schemas used for multidimensional data. University of Alberta Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems • See some implementations of OLAP operators with SQL Chapter 19 of Textbook University of Alberta 1 Data Warehouse and OLAP Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 2 Incentive for a Data Warehouse • Businesses have a lot of data, operational data and facts. • This data is usually in different databases and in different physical places. • Data is available (or archived), but in different formats and locations. (heterogeneous and distributed). • What is a data warehouse and what is it for? • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Decision makers need to access information (data that has been summarized) virtually on one single site. • This access needs to be fast regardless of the size of the data, and how old the data is. • Are there issues related to data cube technology? Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 3 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 4 What Is Data Warehouse? Evolution of Decision Support Systems 1970s 1960s A ata ys nal ls oo is T •Data Analyst Inflexible and non-integrated tools •Data Analyst Flexible integrated spreadsheets. Slow access to operational data Database Management Systems g nd ssin g a roce P sin hou cal are alyti ta W An Da -Line On D op skt De al nu Dr. Osmar R. Zaïane, 2001-2004 s em yst d ase rt S l-b po ina up rm n S Te cisio De a dM an tch ng Ba porti Re •Statistician •Computer scientist Difficult and limited queries highly specific to some distinctive needs • A data warehouse consolidates different data sources. • A data warehouse is a database that is different and maintained separately from an operational database. • A data warehouse combines and merges information in a consistent database (not necessarily up-to-date) to help decision support. 1990s 1980s •Executive Integrated tools Data Mining University of Alberta 5 Decision support systems access data warehouse and do not need to access operational databases Î do not unnecessarily over-load operational databases. Dr. Osmar R. Zaïane, 2001-2004 Definitions Database Management Systems University of Alberta 6 Definitions (con’t) Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. (W.H. Inmon) Data Warehousing is the process of constructing and using data warehouses. Subject oriented: oriented to the major subject areas of the corporation that have been defined in the data model. A corporate data warehouse collects data about subjects spanning the whole organization. Data Marts are specialized, single-line of business warehouses. They collect data for a department or a specific group of people. Integrated: data collected in a data warehouse originates from different heterogeneous data sources. Time-variant: The dimension “time” is all-pervading in a data warehouse. The data stored is not the current value, but an evolution of the value in time. Non-volatile: update of data does not occur frequently in the data warehouse. The data is loaded and accessed. Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 7 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 8 Data Warehouse and OLAP Building a Data Warehouse • What is a data warehouse and what is it for? Option 1: Consolidate Data Marts Data Mart Data Mart Data Mart Corporate Data Warehouse • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? Option 2: Build from scratch Data Mart • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Are there issues related to data cube technology? Corporate data Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 9 We sell products in various markets, and we measure our performance over time Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems e Ti m Markets Products University of Alberta University of Alberta 10 • Think of it as a cube with labels on each edge of the cube. • The cube doesn’t just have 3 dimensions, but may have many dimensions (N). • Any point inside the cube is at the intersection of the coordinates defined by the edge of the cube. • A point in the cube may store values (measurements) relative to the combination of the labeled dimensions. Business Manager Data Warehouse Designer Database Management Systems Construction of Data Warehouse Based on Multi-dimensional Model Describing the Organization We sell Products in various Markets, and we measure our performance over Time Dr. Osmar R. Zaïane, 2001-2004 11 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 12 Concept-Hierarchies Data Warehouse and OLAP Most Dimensions are hierarchical by nature: total orders or partial orders Example: Location(continent Î country Î province Î city) Time(yearÎquarterÎ(month,week)Îday) • What is a data warehouse and what is it for? • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? Industry Dimensions: Product, Region, Time Hierarchical summarization paths Country Category Region Product City Office Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems Year • What is the general architecture of a data warehouse? Quarter • How can we implement a data warehouse? Month Week • Are there issues related to data cube technology? Day University of Alberta 13 On-Line Transaction Processing •OLAP operations: DW tend to be in the order of Tb Databases tend to be hundreds of Mb to Gb. University of Alberta University of Alberta 14 • On-line analytical processing (OLAP) is essential for decision support. • OLAP is supported by data warehouses. • Data warehouse consolidation of operational databases. • The key structure of the data warehouse always contains some element of time. •Owing to the hierarchical nature of the dimensions, OLAP operations view the data flexibly from different perspectives (different levels of abstractions). ¾OLTP operations are structured and repetitive ¾OLTP operations require detailed and up-to-date data ¾OLTP operations are short, atomic and isolated transactions Database Management Systems Database Management Systems On-Line Analytical Processing • Database management systems are typically used for on-line transaction processing (OLTP) • OLTP applications normally automate clerical data processing tasks of an organization, like data entry and enquiry, transaction handling, etc. (access, read, update) • Database is current, and consistency and recoverability are critical. Records are accessed one at a time. Dr. Osmar R. Zaïane, 2001-2004 Dr. Osmar R. Zaïane, 2001-2004 15 Dr. Osmar R. Zaïane, 2001-2004 • roll-up (increase the level of abstraction) • drill-down (decrease the level of abstraction) • slice and dice (selection and projection) • pivot (re-orient the multi-dimensional view) • drill-through (links to the raw data) Database Management Systems University of Alberta 16 OLTP vs OLAP Data Warehouse OLAP Example Time (Quarters) Q1 Q2 Q3 Q4 Edmonton Compact Calgary AB)Lethbridge Mid-size Red Deer Family (city, Minivan. OLTP OLAP users Clerk, IT professional Knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Time (Months, Q3) Location Category Edmonton Location Calgary (city, AB)Lethbridge Red Deer Jul Aug Sep Compact Mid-size Family Category Minivan. Drill down on Q3 Roll-up on Location Time (Quarters) Q1 Q2 Q3 Q4 Maritimes Location Quebec Compact (province, Ontario Mid-size Prairies Canada)Western Pr Family Slice on Category=mid-size Time (Quarters) usage access Category Minivan. Q1 Q2 Q3 Q4 Edmonton Calgary AB)Lethbridge Mid-size Red Deer Location (city, Pivot Time (Quarters) Dr. Osmar R. Zaïane, 2001-2004 er De Red ge id hbr Le t g a r y n l Ca onto m Ed Category Q1 Q2 Q3 Q4 Database Management Systems Location (city, AB) Mid-size Category (Source: JH) University of Alberta 17 Why Do We Separate DW From DB? • Performance reasons: University of Alberta 18 Data Warehouse and OLAP • What is the multi-dimensional data model? • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Decision support requires historical data. • Decision support requires consolidated data. Database Management Systems Database Management Systems • What is a data warehouse and what is it for? – OLAP necessitates special data organization that supports multidimensional views. – OLAP queries would degrade operational DB. – OLAP is read only. – No concurrency control and recovery. Dr. Osmar R. Zaïane, 2001-2004 Dr. Osmar R. Zaïane, 2001-2004 University of Alberta • Are there issues related to data cube technology? 19 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 20 Three-tier Architecture Monitor & Integrator metadata External sources Operational DBs Data Warehouse and OLAP OLAP Server Analysis • What is a data warehouse and what is it for? • What is the multi-dimensional data model? Extract Transform Load Refresh Serve Data Warehouse • What is the difference between OLAP and OLTP? Query Reports • What is the general architecture of a data warehouse? Data mining • How can we implement a data warehouse? • Are there issues related to data cube technology? Client Tools Data sources Dr. Osmar R. Zaïane, 2001-2004 • Can we mine data warehouses? Data Marts Database Management Systems University of Alberta 21 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems Product Date Most data warehouses use a star schema to represent the multidimensional model. Day Month Year Each dimension is represented by a dimension-table that describes it. Date Product ProductNo ProdName ProdDesc Category Store Cust StoreID Customer City CustId State CustName unit_sales Country CustCity dollar_sales Region CustCountry Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) Each dimension is represented by one table ÎUn-normalized (introduces redundancy). Normalize dimension tables Î Snowflake schema The links between the fact-table in the centre and the dimensiontables in the extremities form a shape like a star. (Star Schema) University of Alberta Sales Fact Table Store A fact-table connects to all dimension-tables with a multiple join. Each tuple in the fact-table consists of a pointer to each of the dimension-tables that provide its multi-dimensional coordinates and stores measures for those coordinates. Database Management Systems 22 Example of Star Schema Data Warehouse Design Dr. Osmar R. Zaïane, 2001-2004 University of Alberta 23 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta (Source: JH) 24 Example of Snowflake Schema Year Year Month Month Year Date Sales Fact Table Date Month Date Product City Country City State Country Region Multidimensional view of data in the warehouse: Stress on aggregation of measures by one or more dimensions Cust StoreID City Customer unit_sales dollar_sales State Country ProductNo ProdName ProdDesc Category Store Store State Aggregation in Data Warehouses Product The Data Cube and The Sub-Space Aggregates CustId CustName CustCity CustCountry By City Group By Aggregate Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems Sum University of Alberta Time City Edmonton Calgary Lethbridge Sum 2002 Sum 25 Sum Drama Comedy Horror By Category & City Sum Sum By Time & Category By Category Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 26 Implementation of the OLAP Server ROLAP: Relational OLAP - data is stored in tables in relational database or extended-relational databases. They use an RDBMS to manage the warehouse data and aggregations using often a star schema. •They support extensions to SQL •A cell in the multi-dimensional structure is represented by a tuple. Advantage: Scalable (no empty cells for sparse cube). Disadvantage: no direct access to cells. All Years Drama, Edmonton Drama Comedy … ... Drama Comedy Horror Drama Comedy Horror By Time By Time & City (Source: JH) Construction of Multi-dimensional Data Cube 1999 2000 2001 By Category By Time Snowflake schema: Easier to maintain dimension tables when dimension tables are very large (reduces overall space). Star schema: More effective for data cube browsing (less joins): can affect performance. Cross Tab Q1Q2Q3Q4 Category Q3Q Red Deer 4 Q1 Q2 Lethbridge Calgary Edmonton Category Sum Ex: Microstrategy Metacube (Informix) Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 27 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 28 Implementation of the OLAP Server Example of Fact and Dimension tables for ROLAP MOLAP: Multidimensional OLAP – implements the multidimensional view by storing data in special multidimensional data structures (MDDS) Advantage: Fast indexing to pre-computed aggregations. Only values are stored. Disadvantage: Not very scalable and sparse • The dimensions of the fact table are further described with dimension tables • Fact table: Sales (Market_id, Product_Id, Time_Id, Sales_Amt) • Dimension Tables: Ex: Essbase of Arbor Market (Market_Id, City, Province, Region) Product (Product_Id, Name, Category, Price) Time (Time_Id, Week, Month, Quarter) HOLAP: Hybrid OLAP - combines ROLAP and MOLAP technology. (Scalability of ROLAP and faster computation of MOLAP) Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 29 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 30 Aggregation over Time Aggregation • The output of the previous query • Many OLAP queries involve aggregation of the data in the fact table • For example, to find the total sales (over time) of each product in each market, we might use Market_Id M1 Product_Id SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id • The aggregation is over the entire time dimension and thus produces a two-dimensional view of the data Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 31 SUM(Sales_Amt) P1 P2 P3 P4 P5 Dr. Osmar R. Zaïane, 2001-2004 M2 3003 1503 6003 2402 4503 3 7503 7000 … … Database Management Systems M3 M4 … … … … … University of Alberta 32 Drilling Down and Rolling Up Drilling Down • Some dimension tables form an aggregation hierarchy • Market_Id → City → Province → Region Sales (Market_Id, Product_Id, Time_Id, Sales_Amt) Market (Market_Id, City, Province, Region) • Executing a series of queries that moves down a hierarchy (e.g., from aggregation over regions to that over provinces) is called drilling down 1. • Executing a series of queries that moves up the hierarchy (e.g., from provinces to regions) is called rolling up – Note: In a rollup, coarser aggregations can be computed using prior queries for finer aggregations Database Management Systems University of Alberta 33 Rolling Up • 1. If we have already created a table, Province_Sales, Province_Sales using Database Management Systems Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 34 SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt) FROM Sales S, Time T WHERE T.Time_Id = S.Time_Id Pivot GROUP BY S.Product_Id, T.Quarter T.Product_Id, M.Region, SUM (T.Sales_Amt) Province_Sales T, Market M M.Province = T.Province GROUP BY T.Product_Id, M.Region Dr. Osmar R. Zaïane, 2001-2004 SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.Province, – Pivoting on dimensions D1,…,Dk in a data cube D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY A1,…,Ak and aggregate over Ak+1,…An, where Ai is an attribute of the dimension Di – Example: Pivoting on Product and Time corresponds to grouping on Product_id and Quarter and aggregating Sales_Amt over Market_id: SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.Province SELECT FROM WHERE 2. • When we view the data as a multi-dimensional cube and group on a subset of the axes, we are said to be performing a pivot on those axes then we can roll up from there to: 2. S.Product_Id, M.Region, SUM (S.Sales_Amt) Pivoting Rolling up on market, from Province to Region – SELECT FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.Region – Requires the use of the fact table or information more specific than the requested aggregation (e.g., cities) Dr. Osmar R. Zaïane, 2001-2004 Drilling down on market: from Region to Province University of Alberta 35 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 36 Slicing-and-Dicing Slicing-and-Dicing • When we use WHERE to specify a particular value for an axis (or several axes), we are performing a slice • Typically slicing and dicing involves several queries to find the “right slice.” For instance, change the slice and the axes: – Slicing the data cube in the Time dimension (choosing sales only in week 12) then pivoting to Product_id (aggregating over Market_id) SELECT S.Product_Id, SUM (Sales_Amt) • Slicing on Time and Market dimensions then pivoting to Product_id and Week (in the time dimension) SELECT FROM WHERE S.Product_Id, T.Quarter, SUM (Sales_Amt) Sales S, Time T T.Time_Id = S.Time_Id AND T.Quarter = 4 AND S.Market_id = 12345 GROUP BY S.Product_Id, T.Week Slice FROM Sales S, Time T WHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’ GROUP BY S. Product_Id Pivot Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta Pivot 37 Product_Id SUM(Sales_Amt) P1 P2 P3 P4 Total Dr. Osmar R. Zaïane, 2001-2004 3003 1503 6003 2402 4503 3 7503 7000 … … Database Management Systems M3 Total … … … … … Database Management Systems University of Alberta 38 • For the table entries, without the totals (aggregation on time) SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id • For the row totals (aggregation on time and supermarkets) SELECT S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Product_Id • For the column totals (aggregation on time and products) SELECT S.Market_Id, SUM (S.Sales) FROM Sales S GROUP BY S.Market_Id • To construct the following table, would take 3 queries (next slide) Market_Id M2 Dr. Osmar R. Zaïane, 2001-2004 The Three Queries The CUBE Operator M1 Slice … … … … … University of Alberta 39 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 40 Definition of the CUBE Operator Example of CUBE Operator • Doing these three queries is wasteful • The following query returns all the information needed to make the previous products/markets table: – The first does much of the work of the other two: if we could save that result and aggregate over Market_Id and Product_Id, we could compute the other queries more efficiently SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY CUBE (S.Market_Id, S.Product_Id) • The CUBE clause is part of SQL:1999 – GROUP BY CUBE (v1, v2, …, vn) – Equivalent to a collection of GROUP BYs, one for each of the 2n subsets of v1, v2, …, vn Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 41 ROLLUP Database Management Systems University of Alberta 42 Example of ROLLUP Operator • ROLLUP is similar to CUBE except that instead of aggregating over all subsets of the arguments, it creates subsets moving from right to left • GROUP BY ROLLUP (A1,A2,…,An) is a series of these aggregations: – – – – – – Dr. Osmar R. Zaïane, 2001-2004 SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY ROLLUP (S.Market_Id, S. Product_Id) – first aggregates with the finest granularity: GROUP BY GROUP BY A1 ,…, An-1 ,An GROUP BY A1 ,…, An-1 ……… GROUP BY A1, A2 GROUP BY A1 No GROUP BY S.Market_Id, S.Product_Id – then with the next level of granularity: GROUP BY S.Market_Id – then the grand total is computed with no GROUP BY clause • ROLLUP is also in SQL:1999 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 43 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 44 Materialized Views ROLLUP vs. CUBE • The same query with CUBE: - first aggregates with the finest granularity: GROUP BY The CUBE operator is often used to precompute aggregations on all dimensions of a fact table and then save them as a materialized views to speed up future queries S.Market_Id, S.Product_Id - then with the next level of granularity: GROUP BY S.Market_Id and GROUP BY S.Product_Id - then the grand total with no GROUP BY Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 45 Dr. Osmar R. Zaïane, 2001-2004 Data Warehouse and OLAP Scalability Sparseness Curse of dimensionality Materialization of the multidimensional data cube (total, virtual, partial) • Efficient computation of aggregations • Indexing • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? • How can we implement a data warehouse? • Are there issues related to data cube technology? University of Alberta 46 • • • • • What is the multi-dimensional data model? Database Management Systems University of Alberta Issues • What is a data warehouse and what is it for? Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems 47 Dr. Osmar R. Zaïane, 2001-2004 Database Management Systems University of Alberta 48