Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Outline What is a data warehouse? A multi-dimensional data model Data warehouse design & architecture Data warehouse implementation From data warehousing to data mining Data Warehouse & OLAP Technology CS 5331 by Rattikorn Hewett Texas Tech University 1 2 Motivation Data Warehouses In business domains, Large amount of business data Data are in different formats and locations (heterogeneous vs. distributed) Decision makers need fast accesses of summarized information (usually) on a single site Loosely speaking, a data warehouse is a data store that integrates data from multiple heterogeneous sources to support decision making maintained separately from operational databases A data warehouse system provides a solid platform of consolidated data for on-line analytical processing (OLAP) and data analysis Data warehousing is a process of constructing and using data warehouses 3 4 1 Other Related terms What is a data warehouse? More precisely, A data warehouse (DW) is a subject-oriented integrated time-variant, and non-volatile (i.e., stable) collection of data in support of management’s decision-making process [Inmon, 96] Data Warehouse: enterprise-based Concerns with decision subjects of the whole enterprise or organization Data Mart: department-based Specialized single line of business warehouses e.g., within departments or groups of people 5 Subject-oriented Integrated A data warehouse: Is organized around subjects of interest e.g., customer, product, sales. Gives a simple and concise view of relevant issues by excluding non-useful data to the decision support 6 Focuses on the modeling/analysis of data for decision makers, A data warehouse integrates multiple heterogeneous data sources: relational databases, on-line transaction records Moving (or converting) data to the warehouse requires data cleaning and data integration techniques to ensure consistency among different data sources in not on daily operations or transaction processing. naming conventions encoding structures, attribute measures, etc. E.g., Hotel price: currency, tax, breakfast covered, etc. 7 8 2 Time-variant Non-volatile A data warehouse is a physically separate store Key structures In data warehouses contain “time” information explicitly or implicitly Not necessary in operational databases give historical perspective (e.g., past 5-10 years) Operational time period is shorter (e.g., current data) of data transformed from the operational environment. does not require transaction processing, recovery, and concurrency control mechanisms requires only two operations in data accessing: initial loading of data access of data Non-volatile 9 Data Warehouse vs. Heterogeneous DB Traditional heterogeneous DB integration: Query-driven Heter.DB Site 1 Local query Processor 1 Heter.DB Site n Local query Processor n Major task of operational (traditional relational) DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. mediator Builds wrappers/mediators on top of heterogeneous DBs requires complex information filtering and integration processes Data Warehouse: Update-driven DW systems vs. Operational DBMS OLTP (on-line transaction processing) Global answer Client query site Metadata Dictionary 10 11 Are structured, repetitive, short, atomic and isolated transactions Require detailed and up-to-date data OLAP (on-line analytical processing) Major task of data warehouse system Serve knowledge workers: Data analysis and decision making OLAP Operations: information from multiple heterogeneous sources is integrated in advance and stored for query more efficient than heterogeneous DB but data (stored over longer period) are not as current OLTP Operations: View data flexibly from different perspectives and abstractions Require historical data with time elements 12 3 OLTP vs. OLAP DW systems vs. Operational DBMS Distinct features OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimension al integrated, consolidated ad-hoc read/write index/hash on prim. key short, simple transaction lots of scans unit of work # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response User/system orientation: customer Data contents: current, detailed vs. historical, consolidated data Design: ER & app-oriented vs. star & subj-oriented usage View: current, local vs. evolutionary, integrated Access patterns: update vs. market access vs. read-only but complex queries complex query 13 Outline Why Separate Data Warehouse? High performance is desired in both DBMS and DW systems DBMS – tuned for OLTP: indexing, searching certain records, querying “canned” queries DW – tuned for OLAP: complex OLAP queries, use of special organization, multidimensional view, consolidation. OLAP necessitates special data organization that supports multidimensional views If mixed - OLAP queries would degrade operational DBMS 14 Different functions, contents and uses in both systems DBMS – Transactional operations, requires concurrency controls (to ensure data consistency) OLAP is read only and does not need concurrency control. Concurrency control on OLAP’s operations would jeopardize execution of concurrent transactions in DBMS What is a data warehouse? A multi-dimensional data model Data warehouse design & architecture Data warehouse implementation From data warehousing to data mining DW – Decision support, requires consolidation of data from heterogeneous sources (for aggregation etc.) and historical data DBMS does not maintain historical data nor facilitates data consolidation 15 16 4 Data Cubes Lattice of Cuboids A data warehouse is based on a multi-dimensional data model which views data in the form of a data cube Example: A data cube with three dimensions: item (type) time (quarter) location (city) 50 with measure: Dollars_sold 300 Time (quarter) (location) Dollars_sold (in thousands) of (phone, Q1, Vancouver) = 400 400 0-D (apex) cuboid () 200 100 Q1 The lattice of cuboids forms a data cube. Each cuboid represents a different degree of summarization The bottom most n-D base cube is called a base cuboid The top most 0-D cuboid is called the apex cuboid - holds the highest-level of summarization – denoted by all (item) (time) 1-D cuboids Q2 Q3 Dollars_sold of (TV, Q1) = 650 Q4 (location, item) (location, time) (item, time) 2-D cuboids Dollars_sold of (Q3) TV VCR oven phone (location, item, time) 17 Item (type) Cuboids of 4-D data cube all 0-D(apex) cuboid item time,location location supplier item,location time,supplier 1-D cuboids location,supplier time,item 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,location time,item,supplier 18 Three data model schemas time 3-D (base) cuboid item,location,supplier Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into additonal smaller dimension tables, forming a shape similar to snowflake Fact constellation schema: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 4-D(base) cuboid time, item, location, supplier 19 20 5 Star Snowflake Example of Star Schema Item: dimension table Time: dimension table time_key day day_of_the_week month quarter year Sales: Fact Table time_key item_key item_name brand type supplier_type Location: dimension table Star introduces redundancy E.g., (203 Pine St, Abilene, TX, USA) location_key street (205 14th St, Abilene, TX, USA) city state_or_province country item_key branch_key Branch: dimension table branch_key branch_name branch_type units_sold Measures location_key dollars_sold avg_sales Location: dimension table Snowflake normalizes this dimension table to location_key street city_key location_key street city state_or_province country city_key state_or_province country (203 Pine St, Abilene_key) (205 14th St, Abilene_key) (Abilene_key, TX, USA) 21 Example of Snowflake Schema time_key day day_of_the_week month quarter year Sales: Fact Table time_key branch_key branch_name branch_type Measures time_key day day_of_the_week month quarter year item_key Supplier_key Supplier_type location_key units_sold dollars_sold avg_sales Branch: dimension table branch_key branch_name branch_type Location: dimension table location_key street city-key state_or_province country Item: dimension table Time: dimension table item_key item_name brand type key supplier_type branch_key Branch: dimension table Example of Constellation Schema Item: dimension table Time: dimension table 22 Measures city_key State_or_province country 23 Sales: Fact Table time_key item_key item_key item_name brand type supplier_type units_sold dollars_sold avg_sales item_key time_key shipper_key from_location to_location branch_key location_key Shipping: Fact Table dollars_cost units_shipped Location: dimension table location_key street city state_or_province country shipper_key shipper_name location_key shipper_type Shipper: dimension table 24 6 Which design schema to use? DMQL: A Data Mining Query Language Benchmarking performance can be used to select the best schema Snowflake vs. Star Snowflake: easier to maintain dimension tables when they are very large (reduce space) Star: more effective for browsing (less joins for query) Data Warehouse: uses “fact constellation” since it can model multiple, interrelated subjects Data Mart: uses “star” (more popular) or “snowflake” Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) time as “cube definition” dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> First define 25 Example of Star Schema time_key day day_of_the_week month quarter year A Star Schema in DMQL Item: dimension table Time: dimension table Sales: Fact Table time_key define cube sales [time, item, branch, location]: item_key item_name brand type supplier_type dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) item_key branch_key Branch: dimension table branch_key branch_name branch_type location_key units_sold Measures dollars_sold avg_sales 26 Location: dimension table location_key street city state_or_province country 27 28 7 Example of Snowflake Schema Item: dimension table Time: dimension table time_key day day_of_the_week month quarter year define cube sales [time, item, branch, location]: item_key item_name brand type supplier_key Sales: Fact Table time_key dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) item_key Supplier_key Supplier_type branch_key Branch: dimension table branch_key branch_name branch_type A Snowflake Schema in DMQL location_key units_sold Measures dollars_sold avg_sales Location: dimension table location_key street City_key city_key State_or_province country 29 Time: dimension table time_key day day_of_the_week month quarter year Branch: dimension table branch_key branch_name branch_type Measures Sales: Fact Table time_key item_key item_key item_name brand type supplier_type location_key units_sold dollars_sold avg_sales Shipping: Fact Table item_key time_key from_location dollars_cost units_shipped Location: dimension table location_key street city state_or_province country shipper_key shipper_name location_key shipper_type Shipping: dimension table define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales shipper_key to_location branch_key 30 A fact constellation Schema in DMQL Example of Constellation Schema Item: dimension table define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier (supplier-key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city (citykey, province_or_state, country)) 31 32 8 Measures Concept Hierarchy Quantities computed by aggregating values in data cube cells in various ways: distributive: if the result can be derived by aggregating results from partitioned data. Europe region country E.g., avg() = sum()/count(), standard_deviation(). holistic: if there is no constant bound on the storage size needed to describe a subaggregate. all all E.g., count(), sum(), min(), max(). algebraic: if it can be computed by an algebraic function with a finite number of arguments, each of which is obtained by applying a distributive function. Example: Hierarchy for the dimension “location” city Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... ... Mexico Toronto E.g., median(), mode(), rank(). office L. Chan ... M. Wind 33 Specifications of Hierarchies Dimensions: Location Time Office < City < Country < Region Region Year Day < {month<quarter; week} < year Country Quarter Schema hierarchy: Set_grouping hierarchy (1..10] < inexpensive City Office 34 Hierarchy and Warehouse View DBMiner: • Importing data • Table browsing • Dimension creation/ browsing • Cube building/browsing Month Week Day 35 36 9 Building cube: Sales volume as a function of product, month, and region Total annual sales of TVs in USA Time (quarter) 1Qtr 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Location (city) TV PC VCR sum Browsing a Data Cube Visualization OLAP capabilities Interactive manipulation sum ALL = Total sales 37 Typical OLAP operations 38 Examples: Roll-up & drill-down USA 350 Canada 300 Roll up (increase the level of abstraction): Roll-up on location Summarize data by climbing up hierarchy or by dimension reduction 50 300 200 100 Drill down (decrease the level of abstraction): Q1 Reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Q1 Q2 Q3 Q4 400 Q2 TV VCR oven Q4 Drill-down On time for Q1 TV VCR oven phone Jan 100 Feb 200 100 Mar 39 phone Q3 TV VCR oven phone 40 10 Typical OLAP operations (cont) Examples: dice, slice, pivot Q1 50 300 200 100 150 Q1 Dice half top left corner Q2 TV VCR 250 400 Q2 Chicago TV 100 Q3 New York VCR 150 Q4 Toronto oven 250 Vancouver TV VCR oven phone 400 Pivot Vancouver VCR oven phone Toronto TV New York Slice for Q1 phone 100 150 250 400 Chicago Slice and dice(project and select): Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across (links to the raw data): involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) Toronto 200 Vancouver 100 41 OLAP functions 42 A Star-Net Query Model Customer Orders Shipping Method Customer OLAP offers functions to: CONTRACTS Calculate e.g., ratios, variance, etc. Generate e.g., summarizations, aggregations and hierarchies at each level and every dimension intersection Support forecasting functions e.g., trend and stat analysis AIR-EXPRESS PRODUCT LINE Time OLAPs and SDBs (statistical Databases) are similar Product ANNUALY QTRLY SDBs focus on socioeconomic applications, and are sensitive to privacy issues when drilling down OLAPs are designed to handle HUGE amounts of data ORDER TRUCK DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT OLAP’s querying is based on a starnet model – representing abstraction levels in each dimension REGION 43 Location Each circle is called a footprint Promotion DIVISION 44 Organization 11 Outline A Data Warehouse Design What is a data warehouse? A multi-dimensional data model Data warehouse design & architecture Data warehouse implementation From data warehousing to data mining Four views to be considered: Top-down view: allows selection of the relevant information necessary for the data warehouse Data source view: exposes the information being captured, stored, and managed by operational systems Data warehouse view: consists of fact tables and dimension tables Business query view: perspectives of data in the warehouse from the view of end-users 45 Design approaches Design process Top-down: Starts with overall design and planning Typical steps in data warehouse design process: Select good for mature technology and well-defined problems Systematic and min. integration problem expensive and lack flexibility business process to model (e.g., orders, invoices, etc.) Bottom-up: Starts with experiments and prototypes 46 grain (atomic level of data) of the business process (e.g., individual transactions, daily-snapshots etc.) Rapid returns, low cost Flexible but hard to integrate Combined dimensions measure 47 that will apply to each fact table record that will populate each fact table record 48 12 Three data warehouse models Enterprise warehouse As in Software Engineering, two development methods: collects all of the information about subjects spanning the entire organization Data Mart Data warehouse development a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems short turn around time Recommended for the development of data warehouses and data marts Virtual warehouse Waterfall: do A set of views over operational databases Only some of the possible summary views may be materialized 49 A recommended approach Data Warehouse Architecture For data warehouse development Data Mart Model refinement External sources Multi-Tier Data Warehouse Distributed Data Marts 50 Operational DBs Enterprise Data Warehouse Data Mart Monitor & Integrator Meta data Extract Transform Load Refresh Data Warehouse OLAP Server • Analysis • Query Serve Reports • Data mining Data Marts Model refinement Data Sources Define a high-level corporate data model 51 Data Warehouse Server Bottom tier OLAP Engine Middle tier Front-End (Client) Tools Top tier 52 13 Data sources Back-End Tools & Utilities Data sources are Often the operational DBMS Designed for operational use, not for decision support Multiple data sources Are often from different systems using different hardware and customized software Create problems e.g., semantic conflicts Data Cleaning: eliminates noise and errors. Three classes of tools: Migration: simple data transformation Scrubbing: by using domain-specific knowledge Auditing: detects outliers Data Extraction: by a program interface called gateway Data Transformation: convert data from legacy or host format to warehouse format Load: sort, summarize, consolidate, compute views, check integrity constraints, and build indices and partitions Refresh: propagates updates from data sources to the data store When to refresh? Depends on usage and types of data How to refresh? Data shipping (by triggers) and Transaction shipping (by updating transaction logs) 53 The Bottom tier: warehouse server 54 The Middle tier: OLAP server Implemented by: Monitor: ROLAP (Relational OLAP) e.g., microstrategy’s DSS Server, Informix’s Metacube Detect changes that are of interest Propagate the change to the integrator Uses relational or extended-relational DBMS (often in a star schema) Includes optimization of DBMS backend tools Supports extended SQL (Sequence Query Language) A cell in a multi-dimensional structure is represented by a tuple Advantage: scalable (no empty cell for sparse cube) Disadvantage: no direct access to cells Integrator: Integrate changes into the warehouse Make the data conform to conceptual schema in the warehouse Resolve possible update anomalies Metadata Repository: data about data (warehouse objects) Administrative metadata: source DBs and their contents, security Business metadata: business terms, ownership of data Operational metadata:history of migrated data and transformations applied 55 56 14 The Middle tier: OLAP server (cont) Outline MOLAP (Multidimensional OLAP) e.g., Arbor’s Essbase Stores data in a multi-dimensional data structure (MDDS) Uses Array-based multidimensional storage engine (sparse matrix techniques) Advantage: fast indexing to pre-computed aggregations. Disadvantage: not very scalable and sparse HOLAP (Hybrid OLAP) Scalability of ROLAP + efficiency of MOLAP User flexibility, e.g., low level: relational, high-level: array-based What is a data warehouse? A multi-dimensional data model Data warehouse design & architecture Data warehouse implementation From data warehousing to data mining Specialized SQL servers specialized support for SQL queries over star/snowflake schemas 57 Efficient Cube Computation Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains an empty cell Suppose that at most one level of each dimension appears in a cuboid. How many cuboids in an n-dimensional cube with L levels? n T = Õ (Li +1) i =1 10-D cube with 4 levels gives T = 510 ~ 9.8 x 106 !!!! Ex: A data cube of 3 dimensions: time, location, item The total number of cuboids = 2x2x2 The cuboid contains time or none 58 Partial Materialization Pre-computing and materializing all possible cuboids of a data cube is not feasible when there are large number of them Given a base cuboid, there are three choices Full materialization – pre-compute every cuboid No materialization – do not pre-compute any “nonbase” cuboid Partial materialization – select subset of cuboids to compute Selection of which cuboids to materialize depends on the queries in the workload, their frequencies, and access costs etc. E.g., OLAP product uses heuristic – “Prefer cuboids that are most referred” Suppose now time has 3 levels of hierarchy: month < quarter < year The total number of cuboids = 4x2x2 The cuboid contains month or quarter or year or none 59 60 15 Full materialization To ensure fast on-line processing, full materialization may be necessary Efficient cube computation ROLAP cube computation Optimization ideas: Multi-way Array Aggregation Reorder or cluster related tuples Partial grouping of subaggregates and use them to speedup computation of other subaggregates and aggregates Partition arrays into chunks (a small subcube which fits in memory). Compress chunks to remove wasted space -- handle sparse cubes Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. C MOLAP cube computation Multi-way array aggregation b3 B13 14 15 16 b0 1 2 a0 a1 3 a2 4 a3 b2 9 (can’t use ROLAP’s approach since it is based on different data structure and data access/index from ROLAP) 61 Multi-way Array Aggregation B b1 5 Multi-way Array Aggregation C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B 60 14 15 16 b3 13 44 28 56 b2 9 40 24 52 b1 5 36 20 2 3 4 b0 1 a0 a1 a2 62 A C B What is the best traversing order to do multi-way aggregation? B a3 c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B 60 14 15 16 b3 13 44 28 56 b2 9 40 24 52 b1 5 36 20 2 3 4 b0 1 a0 A a1 a2 a3 A 63 64 16 Multi-way Array Aggregation Multi-way Array Aggregation C B C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B 60 14 15 16 b3 13 44 28 56 b2 9 40 24 52 b1 5 36 20 2 3 4 b0 1 a0 a1 a2 B a3 c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B 60 14 15 16 b3 13 44 28 56 b2 9 40 24 52 b1 5 36 20 2 3 4 b0 1 a0 A a1 a2 a3 A 65 66 Indexing OLAP data Multi-way Array Aggregation Method: the planes should be sorted and computed from size small to large (see Han and Kamber, Example 2.12 pp. 75-78) Bitmap Indexing Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation: efficient for a small number of dimensions Index on a particular column (attribute) Each value in the column corresponds to a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table If there is a large number of dimensions, bottom-up computation [Beyer & Ramakrishnan, SIGMOD’99] iceberg cube computation 67 Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe Type Retail Dealer Dealer Retail Dealer Index on Region Index on Type RecID Asia Europe America 1 1 0 0 2 0 1 0 3 1 0 0 4 0 0 1 5 0 1 0 RecID Retail Dealer 1 1 0 2 0 1 3 0 1 4 1 0 5 0 1 68 17 Indexing OLAP data Example: Selecting cuboids Join index Allows records to identify joinable tuples without performing join operation (which is expensive) - used to identify subcubes of interest E.g., in star schema data warehouse Fact table, “sale” and dimensions: location, item sale_k Loc_k Br_k Item_kTime_k Loc_k Str City State country Item_k Name Brand type Which cuboid to process query on {brand, state} with “year = 2000” ? Country & cuboid1: {item_name, city, year} cuboid2: {brand, country, year} State Type cuboid3: {brand, state, year} City Brand cuboid4: {item_name, state} where year = 2000 Street Item_name Eliminate cuboid2 since “country” can’t be generalized to “state” (lower level) sale_k Loc_k sale_k Item_k sale_k Item_k Loc_k By generalization rule, the rest of cuboids will work But which is better? cuboid1: requires two generalizations cost most if not many year values associated with brand but each brand has several item_names couboid3 is less costly than cuboid4 Join indices 69 71 Outline Data Warehouse Usage What is a data warehouse? A multi-dimensional data model Data warehouse design & architecture Data warehouse implementation From data warehousing to data mining Three kinds of data warehouse applications Information processing 73 supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks 74 18 From OLAP to OLAM OLAM Architecture Layer4 Mining query Mining result Why OLAM (online analytical mining)? High quality of data in data warehouses (DW) User GUI API OLAM Engine DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses OLAP Engine User Interface Layer3 OLAP/OLAM Data Cube API ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools Layer2 MDDB OLAP gives a good framework for data exploration “interactive” mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions e.g., integration and swapping of multiple mining functions, algorithms, and tasks. Meta Data Filtering&Integration Database API Databases 75 Discovery-Driven Exploration Filtering Data Data cleaning Data integration Warehouse MDDB (Multi-dimensional DB) Layer1 Data Repository 76 Example Data cube exploration: Hypothesis-driven exploration by user, huge search space, can overlook Discovery-driven (Sarawagi, et al.’98) Effective navigation of large OLAP data cubes measures indicating exceptions, guide user in the data analysis, at all levels of aggregation Exception: significantly different from the value anticipated, based on a statistical model Visual cues such as background color are used to reflect the degree of exception of each cell pre-compute 77 78 19 “Exception” and its computation Parameters SelfExp: surprise of cell relative to other cells at same level of aggregation InExp: surprise beneath the cell PathExp: surprise beneath cell for each drill-down path Computation of exception indicator (model fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction Exception themselves can be stored, indexed and retrieved like pre-computed aggregates Data Mining & Warehouses Construction of data warehouses requires data cleaning and integration important preprocessing in data mining Data mining tools can interface with the OLAP engine facilitates interactive data exploration, including data integration, aggregation and navigation OLAP-based mining OLAM 79 For more, see … Summary Data warehouse A multi-dimensional model of a data warehouse Representation: A data cube consists of dimensions & measure Schema: Star, snowflake, and fact constellations OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes Partial vs. full vs. no materialization Multi-way array aggregation Bitmap index and join index implementations Data mining and data warehouse 80 Discovery-driven in data warehouse From OLAP to OLAM (on-line analytical mining) 81 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97. R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997. OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998. G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’2001 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and subtotals. Data Mining and Knowledge Discovery, 1:29-54, 1997. 82 20 For more, see … J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998. K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97. K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98. E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons, 1997. W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97. 83 21