* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 16 of Database Design, Application Development
Survey
Document related concepts
Transcript
Chapter 16 Data Warehouse Technology and Management McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Outline Basic concepts and characteristics Business architectures and applications Data cube concepts and operators Relational DBMS features Maintaining a data warehouse 16-2 Comparison of Environments Transaction processing Uses operational databases Short-term decisions: fulfill orders, resolve complaints, provide staffing Decision support processing Uses integrated and summarized data Medium and long-term decisions: capacity planning, store locations, new lines of business 16-3 Definition and Characteristics A central repository for summarized and integrated data from operational databases and external data sources Key Characteristics Subject-oriented Integrated Time-variant Nonvolatile 16-4 Data Comparison Characteristic Data Warehouse Currency Operational Database Current Detail level Individual Orientation Process orientation Few Individual and Summary Subject orientation Number of records processed Normalization level Mostly normalized Update level Volatile Data model Relational Historical Thousands Mostly denormalized Nonvolatile (refreshed) Multidimensional and relational with star schemas 16-5 Architectures and Applications Data warehouse projects Top-down architectures Bottom-up architecture Applications and data mining 16-6 Data Warehouse Projects Large efforts with much coordination across departments Enterprise data model Important artifact of data warehouse project Structure of data model Meta data for data transformation Top-down vs. bottom-up business architectures 16-7 Two Tier Architecture Data warehouse server Operational database User departments Transformation process Operational database Detailed and summarized data External data source EDM Data warehouse 16-8 Three Tier Architecture Data warehouse server Operational database Transformation process Data mart tier User departments Extraction process Data mart Operational database Detailed and summarized data External data source EDM Data warehouse Data mart 16-9 Bottom-up Architecture Data mart tier User departments Operational database Transformation process Data mart Operational database External data source Data mart 16-10 Applications Industry Airline Key Applications Yield management, route assessment Telecommunications Customer retention, network design Insurance Risk assessment, product design, fraud detection Retail Target marketing, supply-chain management 16-11 Maturity Model Guidance for investment decisions Stages provide a framework to view an organization’s progress Insights: difficulty moving between stages Infant to child stages because of investment level Teenager to adult because of strategic importance of data warehouse 16-12 Data Mining Discover significant, implicit patterns Target promotions Change mix and collocation of items Requires large volumes of transaction data Important application for data warehouses 16-13 Data Cube Concepts and Operators Basics Dimension and measure details Operators 16-14 Data Cube Basics Multidimensional arrangement of data Users think about decision support data as data cubes Terminology Dimension: subject label for a row or column Member: value of dimension Measure: quantitative data stored in cells 16-15 Data Cube Example 110 60 25 Utah 40 90 50 30 Arizona 70 55 60 35 Washington 75 85 45 45 65 45 85 60 Location 80 California Colorado Mono Laser Ink Photo Jet Product Portable 12/31/2006 ….. 1/2/2006 1/1/2006 Time 16-16 Dimensions and Measures Dimensions Hierarchies: members can have sub members Sparsity: many cells do not have data Measures Derived measures Multiple measures in cells 16-17 Time Series Data Common data type in trend analysis Reduce dimensionality using time series Time series properties Data type Start date Calendar Periodicity Conversion 16-18 Slice Operator Focus on a subset of dimensions Set dimension to specific value: 1/1/2006 Location California Utah Arizona Washington Colorado Mono Laser 80 40 70 75 65 Product Ink Jet Photo 110 60 90 50 55 60 85 45 45 85 Portable 25 30 35 45 60 16-19 Dice Operator Location Focus on a subset of member values Replace dimension with a subset of values Dice operation often follows a slice operation Utah 40 90 50 30 Mono Ink Photo Portable Laser Jet Product 16-20 Other Operators Operators for hierarchical dimensions Drill-down: add detail to a dimension Roll-up: remove detail from a dimension Recalculate measure values Pivot: rearrange dimensions 16-21 Operator Summary Operator Purpose Description Slice Focus attention on a subset of dimensions Replace a dimension with a single member value or with a summary of its measure values Dice Focus attention on a subset of member values Replace a dimension with a subset of members Drill-down Obtain more detail about Navigate from a more general level a dimension to a more specific level Roll-up Summarize details about a dimension Navigate from a more specific level to a more general level Pivot Present data in a different order Rearrange the dimensions in a data cube 16-22 Relational DBMS Support Data modeling Dimension representation GROUP BY extensions Materialized views and query rewriting Storage structures and optimization 16-23 Relational Data Modeling Dimension table: contains member values Fact table: contains measure values 1-M relationships from dimension to fact tables Grain: most detailed measure values stored 16-24 Star Schema Example Store Item ItemId ItemName ItemUnitPrice ItemBrand ItemCategory StoreSales ItemSales Customer CustId CustName CustPhone CustStreet CustCity CustState CustZip CustNation Sales SalesNo SalesUnits SalesDollar SalesCost TimeDim TimeSales CustSales StoreId StoreManager StoreStreet StoreCity StoreState StoreZip StoreNation DivId DivName DivManager TimeNo TimeDay TimeMonth TimeQuarter TimeYear TimeDayOfWeek TimeFiscalYear 16-25 Constellation Schema Supplier SuppId SuppName SuppCity SuppState SuppZip SuppNation Inventory InvNo InvQOH InvCost InvReturns SuppInv StoreInv ItemInv Store Item ItemId ItemName ItemUnitPrice ItemBrand ItemCategory StoreSales ItemSales Customer CustId CustName CustPhone CustStreet CustCity CustState CustZip CustNation Sales SalesNo SalesUnits SalesDollar SalesCost TimeInv TimeDim TimeSales CustSales StoreId StoreManager StoreStreet StoreCity StoreState StoreZip StoreNation DivId DivName DivManager TimeNo TimeDay TimeMonth TimeQuarter TimeYear TimeDayOfWeek TimeFiscalYear 16-26 Snowflake Schema Example Item Store ItemId ItemName ItemUnitPrice ItemBrand ItemCategory StoreSales ItemSales Customer CustId CustName CustPhone CustStreet CustCity CustState CustZip CustNation Sales SalesNo SalesUnits SalesDollar SalesCost CustSales TimeSales StoreId StoreManager StoreStreet StoreCity StoreState StoreZip StoreNation Division DivStore DivId DivName DivManager TimeDim TimeNo TimeDay TimeMonth TimeQuarter TimeYear TimeDayOfWeek TimeFiscalYear 16-27 Handling M-N Relationships Source data may have M-N relationships, not 1-M relationships Adjust fact or dimension tables for a fixed number of exceptions More complex solutions to support M-N relationships with a variable number of connections 16-28 Time Representation Timestamp Time dimension table for organization specific calendar features Two fact tables for international operations Accumulating fact table for representation of multiple events 16-29 Level of Historical Integrity Primarily an issue for dimension updates Type I: overwrite old values Type II: version numbers for an unlimited history Type III: new columns for a limited history 16-30 Historical Integrity Example Type II Representation Type III Representation Customer Customer CustId VersionNo CustName CustPhone CustStreet CustCity CustCityBegEffDate CustCityEndEffDate CustState CustZip CustNation CustId CustName CustPhone CustStreet CustCityCurr CustCityCurrBegEffDate CustCityCurrEndEffDate CustCityPrev CustCityPrevBegEffDate CustCityPrevEndEffDate CustCityPast CustCityPastBegEffDate CustCityPastEndEffDate CustState CustZip CustNation 16-31 Dimension Representation Star schema and variations lack dimension representation Explicit dimension representation important to data cube operations and optimization Proprietary extensions for dimension representation Represent levels, hierarchies, and constraints 16-32 Oracle Dimension Representation Levels: dimension components Hierarchies: may have multiple hierarchies Constraints: functional dependency relationships 16-33 CREATE DIMENSION Example CREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation ) HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation ) HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId ) ATTRIBUTE DivId DETERMINES Division.DivName ATTRIBUTE DivId DETERMINES Division.DivManager ; 16-34 GROUP BY Extensions ROLLUP operator CUBE operator GROUPING SETS operator Other extensions Ranking Ratios Moving summary values 16-35 CUBE Example SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY CUBE (StoreZip, TimeMonth) 16-36 ROLLUP Example SELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2005 AND 2006 GROUP BY ROLLUP (TimeMonth,TimeYear); 16-37 GROUPING SETS Example SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ()); 16-38 Variations of the Grouping Operators Partial cube Partial rollup Composite columns CUBE and ROLLUP inside a GROUPIING SETS operation 16-39 Materialized Views Stored view Periodically refreshed with source data Usually contain summary data Fast query response for summary data Appropriate in query dominant environments 16-40 Materialized View Example CREATE MATERIALIZED VIEW MV1 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2003 GROUP BY StoreState, TimeYear; 16-41 Query Rewriting Substitution process Materialized view replaces references to fact and dimension tables in a query Query optimizer must evaluate whether the substitution will improve performance over the original query More complex than query modification process for traditional views 16-42 Query Rewriting Process QueryFD Results QueryMV Rewrite SQL Engine QueryFD: query that references fact and dimension tables QueryMV: rewrite of QueryFD such that materialized views are substituted for fact and dimension tables whenever justified by expected performance improvements. 16-43 Query Rewriting Matching Row conditions: query conditions at least as restrictive as MV conditions Grouping detail: query grouping columns at least as general as MV grouping columns Grouping dependencies: query columns must match or be derivable by functional dependencies Aggregate functions: query aggregate functions must match or be derivable from MV aggregate functions 16-44 Query Rewriting Example -- Data warehouse query SELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2005 GROUP BY StoreState, TimeYear; -- Query Rewrite: replace Sales and Time tables with MV1 SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1 FROM MV1, Store WHERE MV1.StoreState = Store.StoreState AND TimeYear = 2005 AND StoreNation IN ('USA','Canada'); 16-45 Storage and Optimization Technologies MOLAP: direct storage and manipulation of data cubes ROLAP: relational extensions to support multidimensional data HOLAP: combine MOLAP and ROLAP storage engines 16-46 ROLAP Techniques Bitmap join indexes Star join optimization Query rewriting Summary storage advisors Parallel query execution 16-47 Maintaining a Data Warehouse Data sources Workflow representation Optimizing the refresh process 16-48 Data Sources Cooperative: Notification using triggers Requires source system changes Logged Readily available Extraneous data in logs Queryable Queries using timestamps Requires timestamps in source data Snapshot Periodic dumps of source data Significant processing for difference operations 16-49 Maintenance Workflow Notification Update Phase Propagation Auditing Integration Phase Merging Auditing Preparation Phase Cleaning Transportation Extraction 16-50 Data Quality Problems Multiple identifiers Multiple field names Different units Missing values Orphaned values Multipurpose fields Conflicting data Different update times 16-51 ETL Tools Extraction, Transformation, and Loading Specification based Eliminate custom coding Third party and DBMS based tools 16-52 Refresh Processing Unknown Processes External Data Sources Primarily Dimension Changes Load Time Lag Valid Time Lag ETL Tools Data Warehouse Internal Data Sources Accounting Fact and Dimension Changes 16-53 Determining the Refresh Frequency Maximize net refresh benefit Value of data timeliness Cost of refresh Satisfy data warehouse and source system constraints 16-54 Refresh Constraints Source access: restrictions on time and frequency Integration: restrictions that require concurrent reconciliation Completeness/consistency: loading in the same refresh period Availability: load scheduling restrictions due to storage capacity, online availability, and server usage 16-55 Summary Data warehouse requirements differ from transaction processing. Architecture choice is important. Multidimensional data model is intuitive Relational representation and storage techniques are significant. Maintaining a data warehouse is an important, operational problem. 16-56