* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download The Snowflake Schema
Document related concepts
OLAP Products, Challenges, and Related Technologies Agenda Basic Introduction and Overview Thoughts on Models and Schemas Product Challenges RDBMS Focus on Warehousing Parting Comments Some Terms Data Warehouse Datamarts Smaller repositories created from the warehouse Also small warehouses or summary tables DSS The primary repository for report Typically a secondary data source created with extracts from primary data sources Reporting -- accomplished by any type of reporting tool or application Business Intelligence Same as above Some More Terms OLAP Query Tools Generates SQL for user based on semantic model, metadata Datamining Tools Requires user to write SQL, may be on-line or batch OLAP Tools Any type of “live” reporting, as opposed to batch Stand alone applications focused on high end analytics Difference between ROLAP and MOLAP Amount of work done in the relational database. And that’s it... The DSS Value Proposition Banks and Financial Services Telecommunications 25% churn rates; turn over entire customer base in 4 years Typical Telco loses $100+ million/yr to churn $300,000+/day in losses Database Marketing 4-7% churn rates Fraud costs billions worldwide 1% default rate = $500 million for $50 billion assets Cross-selling products is worth tens of millions More targeted mail campaigns can save hundreds of millions Retail Inventory Management The Basic Idea Warehousing Basic Reporting Collect the data Who purchased mutual funds in the last 3 years? Analyze data What is the income distribution of mutual fund buyers? Who are my most profitable customers? Advanced Reporting Predict What do customers buy in combination? Who will buy a mutual fund in the next 6 months and why? The DSS System Life Cycle Customers “grow” into high end DSS Most customers struggle to build the warehouse. Once the warehouse is in place they progress fairly rapidly up the “DSS chain” Closed Loop DSS Data Mining OLAP “Actionable DSS” Simple Reporting Warehouse Raw Data (OLTP, external data) Extracts, Load, Transformation Customers opt to build Datamarts because of the “difficulty” of building a proper warehouse So How Are Customers Doing? Customer’s succeed when they know what they want to know. Victoria’s Secret Inventory Management Wal*Mart Pharmacy Many customers “fail / struggle” at first. Poor source information Distracted by technology Database benchmarks? Internal politics Unrealistic scope/timeframes Attempt to implement inappropriate technology Database gateways? Types of Reporting / OLAP Strategies MD OLAP Datamarts Extract data into a smaller relational repository and perform analysis on datamart, using some SQL based tool Structured ROLAP Extract data into a cube or MD database and perform analysis on extracted data Build a schema tailored to a ROLAP tool and perform analysis on the structured schema using SQL based tool Flexible ROLAP Perform analysis directly on the warehouse Require an intelligent SQL engine which is used for more than simple extractions Different Approaches to DSS MOLAP/ HOLAP ROLAP MD API SQL MDDB SQL SQL Structured Schema Warehouse Datamart Raw Data (OLTP, external data) Models & Schemas Dimensional Modeling? Star -- Structured A MDDB stored relationally Snowflake -- Normalized Process of putting a semantic object layer over the physical schema Semantic model typically includes dimensions; attributes or level; facts, metrics or measures; and possibly other objects Wide degree of variance in products on how closely the physical structure must resemble the logical presentation layer. Terrible “new” name for an old concept The Real World? TPC-D as a good example The Original Star Schema Lookup_Product product_key item_name class_name department_name division_name level Lookup_Geography geo_key store_name market_name region_name level Fact_Sales product_key geo_key time_key reg_sls_unit reg_sls_dollar cle_sls_unit cle_sls_dollar pml_sls_unit pml_sls_dollar pln_sls_unit pln_sls_dollar Lookup_Time time_key day month_name year level “A relational cube” The Snowflake Schema Lookup_year year Lookup_division division_id division_name Lookup_month month_id month_name year Lookup dept department_id department_name division_id Lookup_class class_id class_name department_id Lookup_region region_id region_name Lookup_market market_id market_name region_id Lookup_day day month_id Lookup_item item_id item_name class_id Lookup_store store_id store_name market_id Fact_Sales item_id store_id day reg_sls_unit reg_sls_dollar cle_sls_unit cle_sls_dollar pml_sls_unit pml_sls_dollar pln_sls_unit pln_sls_dollar The Star / Snowflake Schema? Lookup_year year Lookup_division division_id division_name Lookup dept department_id department_name division_id Lookup_class class_id class_name department_id Lookup_region region_id region_name Lookup_market market_id market_name region_id Lookup_month month_id month_name year Lookup_item item_id item_name class_id class_name department_id department_name division_id division_name Lookup_store store_id store_name market_id market_name region_id region_name Lookup_day day month_id month_name year Fact_Sales item_id store_id day reg_sls_unit reg_sls_dollar cle_sls_unit cle_sls_dollar pml_sls_unit pml_sls_dollar pln_sls_unit pln_sls_dollar A Real Schema: TPC-D for example Dimensional Model for TPC-D Supp Region Cust Region Region Key Name Comment Cust Nation Customer Cust Key Name Address Nation Key Phone Acct Bal Mkt Segment Comment Part Supp Nation Nation Key Name Region Key Comment Supplier Part Key Name MFGR Brand Type Size Container Retail Price Comment Line Item Supp Key Name Address Nation Key Phone Acct Bal Comment Part Supp Part Key Supp Key Avail Qty Supply Cost Comment Order Key Part Key Supp Key Line Number Quantity Extend Price Discount Tax Return Flag Line Status Ship Date Commit Date Receipt Date Ship Instruct Ship Mode Comment Nation Key Name Region Key Comment Order Order Key Cust Key Order Status Total Price Order Date Order Priority Clerk Ship Priority Comment Region Key Name Comment Order Time Time Key Alpha Year Month Week Day Ship Time Commit Time Receipt Time Time Key Alpha Year Month Week Day Time Key Alpha Year Month Week Day Time Key Alpha Year Month Week Day Logical Business Model for the Order Dimension Cust Region Cust Region Orders Cust Nation Customer Cust Key Name Address Nation Key Phone Acct Bal Mkt Segment Comment Line Item Order Key Part Key Supp Key Line Number Quantity Extend Price Discount Tax Return Flag Line Status Ship Date Commit Date Receipt Date Ship Instruct Ship Mode Comment Nation Key Name Region Key Comment Region Key Name Comment Cust Nation Mkt Segment Customer Clerk Order Time Order Order Key Cust Key Order Status Total Price Order Date Order Priority Clerk Ship Priority Comment Time Key Alpha Year Month Week Day Order Date Order Priority Order Status Ship Priority Order Receipt Date Part Key Commit Date Supp Key Ship Date Ship Mode Return Flag Ship Instruct Line Status Line Item Modeling Conclusions Seem complicated? Modeling data is fairly simple if the data and the capabilities/requirement of the tools are well understood. Not all tools are created equally so often many data transformations must occur to achieve desired results. Real world data is rarely “cube” or “star” like. Caveat... Industry Benchmarks - Comparison There exist interesting differences between the two DSS benchmarks: APB - 1 and TPC -D APB - 1 (built by the OLAP council) Is a basic budgeting application Contains no many to many relationships Contains “clean” dimensions Is very “star” and “cube” like TPC - D (built by the RDBMS community) Is a basic order entry system Contains facts with different dimensional keys Is relatively normalized Contains cross dimensional attributes relationships Contains “table-less” dimensions The Tough Problems Handling Large Volumes Working with Complex / Varied Data Structures Performing Advanced Calculations -- Efficiently Also called “Depth - Breadth - Reporting Range” State of Technology Vendors - Database (good) Vendors - OLAP (bad) Database engines add more scalability and flexibility Continue to focus on making simple problems simpler Basing solutions on too many assumptions Working to confuse market -- ROLAP, HOLAP, MOLAP Working with inherently limited architectures Not utilizing underling RDBMS capabilities Working within fixed database schemas Net Result (still much room for improvement) Vendors failing to solve customers true needs The market is pushing back to datamarts Market is living with simpler reporting -- which may not be bad Large Systems? >> 50-100+ Gigabytes of Raw Data Customers OLAP Vendors Want central data warehouse Find large systems difficult to build and maintain Have data in a variety of structures (table formats) Advocate storing subsets of data in different structure Build proprietary MDDB Push for datamarts ROLAP / RDBMS Push for less data restructuring Design for less data movement RDBMS Vendors Fortunately, with advances in RDBMS technology, ROLAP is increasingly recognized as the best approach Key Enhancements - System Better support for large systems Partitioning (a.k.a. segmentation, AKA fragmentation) Hash and bitmap joins, hash and bitmap index technology Parallel and clustered processing Key Enhancements -- Function Temporary table support Derived table support Outer Joins “OLAP” functions OLAP Demands On An RDMBS Ability to efficiently perform Joins and Aggregations Row restrictions -- filters Ability to generate Counts and Sums Ability to perform iterative calculations and filters Temporary Tables Derived Tables or Table Expressions The above gets you 80 percent of the way there, except for Ranks, Cumulative Sums, Moving Sums Example Analysis DSS Question “Show me customer revenue & customers’ percent contribution (customer rev / total rev), only for those customers who contributed more than 1% to total revenue” Popular OLAP Approach Fetch revenue data for each Customer into OLAP Server Calculate percent to total revenue for each Customer Restrict result set to those Customers whose Contribution is greater than 1% Example Analysis (2) Pure ROLAP Approach select Customer, Sum(Revenue) as REV into Temp1 from Customer_Fact group by Customer select Sum(REV) as TOT_REV into Temp2 from Temp1 select Temp1.Customer, Temp1.REV/Temp2.TOT_REV as CONT from Temp1, Temp2 where Temp1.REV/Temp2.TOT_REV >= .01 SQL Extensions Temporary Table -- Declared Local Tables (ANSI ‘92) Derived Tables -- Selects in FROM clause (ANSI ‘92) Example Analysis (3) ROLAP using Table Expressions select Temp1.Customer, Temp1.REV/Temp2.TOT_REV as CONT from (select Customer, Sum(Revenue) as REV from Customer_Fact group by Customer) as Temp1, (select Sum(Revenue) as TOT_REV from Custom_Fact) as Temp2 where Temp1.REV/Temp2.TOT_REV >= .01 Either Implementation is known as Multi-pass SQL So What About the Other 20%? How do you calculate Ranking, Moving Sums, and Cumulative Sums? Currently OLAP tools must do this on their own. RDBMS vendors begin to add support for this. Teradata and Red Brick have commercial implementations. Proposal put forth by Oracle and IBM for ANSI SQL ‘99. (just approved) Aggregate Navigation Aggregate Navigation involves two parts Materialized Views Materialized View support Query rewrite capabilities A “Summary Table” defined as a view Additional properties telling the database how to update the view An advanced type of index Query Rewrite The ability for the optimizer to redirect a query to a “higher” materialized view based on group by and where clause evaluation Query Rewrite Example Materialized View Select from Region, Sls_unit Aggregate_Sales Aggregate_Sales region_id sls_unit sls_dollar Query Rewrite Select from group Region, Sum(Sls_unit) Base_Sales by Region Base Table Base_Sales store_id sls_unit sls_dollar OLAP & RDBMS How does all this affect OLAP tools? As RDBMS vendors add more functionality -- OLAP tools must become smarter in terms of generating SQL DOLAP does not replace OLAP tools, the tools must work together more intelligently It lessens the appeal of MD OLAP solutions Database Technologies Who are the “leaders”? By market share... International Data Corp -- $9.7 billion market 40.4% Oracle with $3.93 billion 17.8% UDB with $1.73 billion 5.7% Informix 5.1% Microsoft 4.4% Sybase Dataquest Inc 32.3% UDB 29.3% Oracle 10.2% Microsoft 4.4% Informix 3.5% Sybase Database Technologies (cont.) Who are the “leaders”? By benchmarks... See TPC-D results Another Tangent -- Database Gateways? Cohera, ISG Navigator, IBM’s Data Joiner OLAP Technologies Who are the “leaders”? By Market Share “The OLAP Report” - $2+ billion 34% Hyperion Solutions Inc. (merged with Arbor) 17% Oracle Express (from 21% slipping) 9.6% Cognos (slipping slightly) 6.4% MicroStrategy (up from 4.5% and rising) Parting Comments Customers need... Sophisticated problems More flexible OLAP tools RDBMS optimized for DSS Limitations of Multidimensional Model? Large volumes Schema support Management of the environment True ROLAP calculations -- minimize data movement HOLAP is not necessarily ROLAP DSS is becoming mission critical Systems need to ensure success and availability. The End Some further detail on schemas A Sample Star Lookup Table GEO_KEY Star #1 Lookup_Geography geo_key geo_name store_id market_id region_id level GEO_NAME STORE_ID MARKET_ID REGION_ID LEVEL 1001 1002 Bo sto n Greenw ic h 101 102 20 20 1 1 1 1 1003 1004 1005 1006 1007 Pro vid enc e Ba ltim o re Phila d elp hia Cha rlo tte Durha m 103 104 105 106 107 20 10 10 30 30 1 1 1 2 2 1 1 1 1 1 1008 1009 1010 1011 Greenville Atla nta Fa yetteville Mid -Atla ntic 108 109 110 30 40 40 10 2 2 2 1 1 1 1 2 1012 1013 1014 1015 1016 New Eng la nd Ca ro lina s Deep So uth No rthea st So uth 20 30 40 1 2 2 1 2 2 2 2 3 3 Star Schema lookup tables hold all of the elements within a dimension in one physical lookup table. Each dimensional lookup table will have a single column primary key, that is unique within the dimension, regardless of the attribute. Each dimension lookup table will include a ‘level’ field which indicates the attribute level. Original Star Fact Tables Fact_Sales product_key geo_key time_key reg_sls_unit reg_sls_dollar cle_sls_unit cle_sls_dollar pml_sls_unit pml_sls_dollar pln_sls_unit pln_sls_dollar Atomic data only Two types of Star Schemas Atomic data only Consolidated Base tables contain only one level of data (per table). No ‘in-table’ aggregation. Consolidated Base tables contain base table data as well as aggregate data for every possible level of aggregation ‘In-table’ aggregation = storing aggregate data in the same table as atomic level data, for example, storing store, market, and region level information within the same fact table. A Sample Snowflake Lookup STORE_ID Lookup_store store_id store_name market_id region_id Lookup_market market_id market_name region_id Lookup_region region_id region_name MARKET_ID REGION_ID 1 2 REGION_NAM E Northeast South 10 20 30 40 STORE_NAME MARKET_ID REGION_ID 101 Boston 20 1 102 103 Greenwich Providence 20 20 1 1 104 105 Baltimore Philadelphia 10 10 1 1 106 107 Charlotte Durham 30 30 2 2 108 Greenville 30 2 109 110 Atlanta Fayetteville 40 40 2 2 MARKET_NAME REGION_ID Mid-Atlantic New England Carolinas Deep South 1 1 2 2 The snowflake design typically has one physical lookup table per attribute, with each attribute identified by a unique key and having its own description column. Attributes are related to each other by including foreign key columns in attribute lookup tables, as region_id is stored in the Lookup_Market table.