Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An overview of Data Warehousing and OLAP Technology Original slides by: Manish Desai Modified and presented by Alice Leung Introduction • Essential elements of decision support • Enables The Knowledge Worker to make better and faster decisions • Used in many industries like: – Manufacturing (for order shipment) – Retail (for inventory management) – Financial Services (claims and risk analysis) • Every major database vendor offers product in this area 2 What is Data Warehouse ? • A data warehouse is a “subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making” • Typically maintained separately from operational databases 3 Explanation of definition • Subject-Oriented: – Designed around subject such as customer, vendor, product and activity – Does not includes data that are not needed for Decision support system (DSS) • Integrated: – Most important feature – Consistent naming convention, measurement of variables and so forth – The data should be stored in single globally acceptable fashion 4 Explanation (continues…) • Time Varying: – All data in the warehouse should be accurate as of some moment in time – Data stored over a long time horizon (5 –10 years) – Key structure contains element of time (implicitly or explicitly) – Data once correctly recorded cant be updated • Non Volatile: – No Update of data allowed – only loading and access of data operations 5 Why Separate Data Warehouse? • High performance for both systems – DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery – Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation • Different functions and different data: – missing data: Decision support requires historical data which operational DBs do not typically maintain – data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources – data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled April 29, 2017 Data Mining: Concepts and Techniques 6 6 Data Warehouse Vs. Operational Database Data Warehouse Operational Database User Knowledge worker (Executive, manager, analyst) Clerk, IT professional Function Decision support Day to day operations Data Historical,summarized, multidimensional, integrated Current, up-to-date, detailed Unit of work Complex query Short, simple transaction DB Design Subject-oriented Application-oriented Metric Query throughout, response Transaction throughput 7 Tiered Architecture External Sources Operational Databases Tier1: Data Warehouse Server Extract Transform Load Refresh Data Warehouse Tier2: OLAP Server OLAP Server Serve Tier3: Client s Analysis Query/Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools 8 Architecture (continues…) • Distributed Data warehouse – Load balancing, scalability,higher availability – Meta data replicated and centrally administrated – Too expansive • Data marts – Departmental subset focused on selected subjects – example: marketing department includes customer, sales and product tables – Has own repository and administration – May lead to complex integration problems if not designed properly 9 Back end tools and Utilities • Data cleaning, loading, refreshing tools • Cleaning – Multiple source, possibility of errors – Example: replace string sex by gender • Loading – Building indices, sorting and making access paths – Large amount of data • Incremental loading • Only updated tuples are inserted ,Process hard to manage • Refresh – Propagating updates – When to refresh ? – Set by administrator depending on user needs and traffic 10 Conceptual Model and front end tools • Multi dimensional view – – – – Dimensions together uniquely determine the measure Example: Sales can be represented as city,product, data Each dimension is described by set of attribute Example: product consist of • Category of product • Industry of product • Year of introduction • Front end tools – Multi dimensional spreadsheet • Supports Pivoting-reorientation • Roll_up - summarized data • Drill_down - go from high level to low level summary 11 Conceptual Model Date 1 2 3 4 sum U.S.A Canada Mexico Country TV PC PVR sum Total annual sales of TV in U.S.A. sum ALL 12 Database design • Two ways to represent Multi dimensional model – Star schema • Database consist of single fact table and single table for each dimension • Each tuples in fact table consist of pointer to each of dimension – Snowflake schema • Refinement over star schema • Dimensional hierarchy is explicitly represented by normalizing dimension tables 13 Star Schema Time T_key T_day T_day_week T_month T_quarter T_year Sales Fact Table Time_key Item_key Branch Branch_key B_key B_name B_type Measures Location_key Units_sold Dollars_sold Avg_sales item I_key I_name I_brand I_type I_supplier_type location location_key street city province country Star Schema 14 Snowflake Schema Time T_key T_day T_day_week T_month T_quarter T_year Sales Fact Table time_key item_key Branch branch_key B_key B_name B_type Measures location_key Item I_key I_name I_brand I_type I_supplier_type units_sold Location dollars_sold location_key street city avg_sales Supplier S_key S_type City C_key C_city C_province C_country Snowflake Schema 15 Warehouse Servers • Specialized SQL servers – Provides advanced query language and query processing support for SQL queries over star and snowflake schemas – Example: Redbrick • ROLAP – Between relational back end and client front end tools – Extend traditional relational servers to support multidimensional queries – Example: Microstratergy • MOLAP – Multidimensional storage engine – Direct mapping – Example: Essbase from Arbor Inc. 16 Index structures • Bit map indices – Use single bit to indicate specific value of attribute – Example: instead of storing eight characters to record “engineer” as skill of employee use single bit id# Name Skill 1000 John 1 • Join indices – Maintains the relationship between foreign key with its matching primary keys 17 Meta data and warehouse management • Its data about data • Used for building, maintain, managing and using data warehouse • Administrative meta data – Information about setting up and using warehouse • Business meta data – Business terms and definition • Operational meta data – Information collected during operation of warehouse 18