* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data - Telkom University
Survey
Document related concepts
Transcript
Data Warehouse & OLAP Introduction Sesi: 03-04 Dosen Pembina : Danang Junaedi The Knowledge Discovery Process Interpretation/ Evaluation Data Mining Transformation Knowledge Patterns / Models Preprocessing Transformed Data Selection Preprocessed Data Target Data Data Sources Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Data Sources Relational Databases Data Warehouses WWW Audio Video Printed Materials : : Relational Databases Multidimensional Data Cube Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 Evolution of DSS Transaction Processing Systems (TPS) Operational data stores and OLTP Batch reports, hard to find and analyze information, inflexible and expensive, reprogram every new request (circa 60’s) MIS Management reporting from transactions in TPS Still inflexible, not integrated with desktop tools (circa 70’s) DSS Combine data with analytic models or expert rules Integration with desktop tools (80’s) Data Warehousing Data integrated after (cleaning and scrubbing) from multiple sources (both internal and external to the organization) OLAP is the technology used to study the data in terms of operations on a multi-dimensional data set Data warehousing also supports processing of data by analytic methods and permits data mining (90’s) Applications Retail - inventory management, promotions Manufacturing - order shipment Insurance – policy and claims tracking Telecommunications - call analysis Financial – account tracking CRM/eCRM – customer profiling, clickstream analysis Healthcare – disease management, patient and physician profiling Databases for Decision Support Transaction Processing systems are optimized for performance Data they capture are too detailed to be of use for decision support purposes Online Analytical Processing (OLAP) imposes very different demands on databases than does Online Transaction Processing (OLTP) Heterogeneous Database Integration Integration System World Wide Web Personal Digital Libraries Scientific Databases Databases • Collects and combines information from disparate sources • Provides integrated view, and a uniform user interface • Supports sharing of data between entities Why look at data in this way? What would be the demand for services (forecasting)? Who are our key customers/patients, and What are the margins/outcomes? (profitable customers/satisfied patients) How do we market to them/treat them? What pricing/treatment strategy is desirable? What are their preferences? What type of customer/patient services are required? What services when packaged result in higher/better sales/revenues/margins/outcomes, efficient workflow? Which promotion/patient education/counseling works or does not work and why? What is the inventory/patient turnover? Which channel/technology is more effective/profitable? Why do margins/outcomes differ from one place to another or one patient to another? Data Warehousing and Industry One of the hottest topic in IS. Over 90% of larger companies either have a DW or are starting one. Warehousing is big business $2 billion in 1995 $3.5 billion in early 1997 $8 billion in 1998 [Metagroup] over $200 billion over next 5 years. Data Warehousing and Industry (2) A 1996 study of 62 data warehousing projects showed: An average return on investment of 321%, with an average payback period of 2.73 years. WalMart has largest warehouse 900-CPU, 2,700 disk, 23 TB Teradata system ~7TB in warehouse 40-50GB per day Why Data Warehousing? Advance of information technology. Data collected in huge amounts. Need to make good use of data? Architecture and tools to Bring together scattered information from multiple sources to provide consistent data source for decision support. Support information processing by providing a solid platform of consolidated, historical data for analysis. A producer wants to know…. Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom-otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? Data, Data everywhere yet ... I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need o need an expert to get the data I can’t understand the data I found o available data poorly documented I can’t use the data I found o results are unexpected o data needs to be transformed from one form to other What are the users saying... Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required What is a Data Warehouse? Defined in many different ways non-rigorously. A DB for decision support. Maintained separately from an organization’s operational database. A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decision-making process.— W. H. Inmon 90% of major organizations have or are building some kind of data warehouse. A decision support database that is maintained separately from the organization’s operational databases. Data Warehouse A data warehouse is a subject-oriented integrated time-varying non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996 What is Data Warehousing? Information A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Data Warehousing -- It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database Explorers, Farmers and Tourists Tourists: Browse information harvested by farmers Farmers: Harvest information from known access paths Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data Data Warehouse Architecture OLAP / Decision support/ User queries Data cubes/ data mining Relational database (warehouse) Data extraction Data cleaning/ programs scrubbing Data Data Data source source source Data Warehouse Architecture Data Warehouse Architecture Data Warehouse Architecture Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Metadata Repository Analyze Query Data Warehouse Component Data Warehouse Architecture (e.g., AS400) Clinical Payroll System System Data Meta-Data Mining Tools Oracle OLAP servers Financials on HP 9000 Billing System Transformation Data Integration Warehouse Access, Files Other Internal Data (Industry Reports) External Data Excel Web Other Data Sources ETL Software S T A G I N G Transaction Data Prod IBM Mkt IMS HR Fin VSAM Ascential Oracle Extract Acctg Syba se Other Internal Data ERP SAP Infor mix SAS HarteHanks Users ANALYSTS Cognos Teradata IBM Load Informatica D A T A External Data Demographic A R E A O P E R A T I O N A L Data Analysis Tools and Applications SQL Sagent Web Data Clickstream Data Stores S T O R E Clean/Scrub Trans form Firstlogic Data Warehouse Data Marts SAS MANAGERS Finance Essbase Marketing Queries,Reporting, DSS/EIS, Data Mining EXECUTIVES Micro Strategy Meta Data Sales Microsoft Siebel Business Objects OPERATIONAL PERSONNEL Web Browser CUSTOMERS/ SUPPLIERS Data Warehouse Architecture Extraction, Transformation, & Load (ETL) ETL is a set of tools and techniques used to populate a data warehouse Extraction Extract data from sources (e.g., operational DBMSs, file systems, Web pages) Transformation Clean data Convert from legacy/host format to warehouse format (e.g., convert “surname” to “last name”) Extraction, Transformation, & Load (ETL) Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partition Huge volumes of data to be loaded, yet small time window (usually at night) when the warehouse can be taken off-line Techniques: batch, sequential load often too slow; incremental, parallel loading techniques may be used Refresh Propagate updates from sources to the warehouse When to refresh - on every update, periodically (e.g., every 24 hours), or after “significant” events How to refresh – full extract from base tables vs. incremental techniques Data Mart A data mart stores data for a limited number of subject areas, such as marketing and sales data. It is used to support specific applications. An independent data mart is created directly from source systems. A dependent data mart is populated from a data warehouse. Data Warehouse vs. Data Marts Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization. Requires extensive business modeling May take years to design and build Data Marts: departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales. Faster roll out, but complex integration in the long run. Baca artikel di url: http://www.martinsights.com/?p=1088 Data Warehouse for Decision Support & OLAP Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years? Decision Support Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements The Complete Decision Support System (Source: Franconi) Information Sources Data Warehouse OLAP Servers Clients Server (Tier 2) (Tier 3) (Tier 1) e.g., MOLAP Analysis Semistructured Sources Data serve Warehouse extract Query/Reporting transform serve load e.g., ROLAP refresh Operational DB’s etc. Data Mining serve Data Marts Data Mining works with Warehouse Data Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence We want to know ... Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal? Data Mining helps extract such information Why Separate Data Warehouse? Performance Op dbs designed & tuned for known txs & workloads. Complex OLAP queries would degrade perf. for op txs. Special data organization, access & implementation methods needed for multidimensional views & queries. Function Missing data: Decision support requires historical data, which op dbs do not typically maintain. Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled. Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach – When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set – Complex information filtering, compete for resources Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis Data Warehouse vs. Operational DBMS OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries RDBMS used for OLTP Database Systems have been used traditionally for OLTP clerical data processing tasks detailed, up to date data structured repetitive tasks read/update a few records isolation, recovery and integrity are critical Operational Systems Run the business in real time Based on up-to-the-second data Optimized to handle large numbers of simple read/write transactions Optimized for fast response to predefined transactions Used by people who deal with customers, products -- clerks, salespeople etc. They are increasingly used by customers Examples of Operational Data Data Industry Usage Technology Volumes Customer File All Legacy application, flat files, main frames Small-medium Account Balance Finance Legacy applications, hierarchical databases, mainframe Large Point-ofSale data Retail ERP, Client/Server, relational databases Very Large Call Record Telecommunications Legacy application, hierarchical database, mainframe ERP, relational databases, AS/400 Very Large Production ManufactRecord uring Track Customer Details Control account activities Generate bills, manage stock Billing Control Production Medium Application-Orientation vs. Subject-Orientation Application-Orientation Subject-Orientation Operational Database Loans Credit Card Data Warehouse Customer Vendor Trust Savings Product Activity OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query OLTP vs. Data Warehouse OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December OLTP vs Data Warehouse OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User Warehouse (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager) OLTP vs Data Warehouse OLTP Performance Sensitive Few Records accessed at a time (tens) Read/Update Access No data redundancy Database Size 100MB 100 GB Data Warehouse Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes OLTP vs Data Warehouse OLTP Transaction throughput is the performance metric Thousands of users Managed in entirety Data Warehouse Query throughput is the performance metric Hundreds of users Managed by subsets To summarize ... OLTP Systems are used to “run” a business The Data Warehouse helps to “optimize” the business From Tables and Spreadsheets to Data Cubes A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. A Sample Data Cube Total annual sales 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country TV PC VCR sum 1Qtr of TVs in U.S.A. Date Cube: A Lattice of Cuboids all time 0-D(apex) cuboid item time,location location supplier item,location location,supplier time,item time,supplier 1-D cuboids 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,location time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier Data (Hyper) cubes 2-d to 3-d cube Rotating the cube Conceptual Modeling of Data Warehouses ER design techniques not appropriate Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations schema : Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. Problem with ER ER models are NOT suitable for DW? End user cannot understand or remember an ER Model Many DWs have failed because of overly complex ER designs Not optimized for complex, ad-hoc queries Data retrieval becomes difficult due to normalization Browsing becomes difficult Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other Multidimensional Data Model Database is a set of facts (points) in a multidimensional space A fact has a measure dimension quantity that is analyzed, e.g., sale, budget A set of dimensions on which data is analyzed e.g. , store, product, date associated with a sale amount Dimensions form a sparsely populated coordinate system Each dimension has a set of attributes e.g., owner, city and county of store Attributes of a dimension may be related by partial order Hierarchy: e.g., street > county >city Lattice: e.g., date> month>year, date>week>year Example: Patient profiling A healthcare organization needed a longitudinal view of patients, including trends of services to patients Model Facts include Healthcare (e.g., diagnosis, procedure), Financial (e.g., amount billed, number of claims), Resources (e.g., number of bed-days, inpatient and outpatient visits) Dimensions include Time, Provider, Claim Type, Demographics, Encounter Type, Diagnosis and Procedure, Person, Organization Questions answered by system: Which individuals are eligible for services but not obtaining them? Which individuals are registered for services, but not receiving preventive healthcare? Star Schema A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes Does not capture hierarchies directly Straightforward means of capturing a multiple dimension data model using relations Slowly Changing Dimensions Example of Star Schema time item time_key item_key day Sales Fact Table day_of_the_week time_key month item_key quarter year branch_key branch location_key branch_key branch_name units_sold branch_type dollars_sold avg_sales Measures item_name brand type supplier_type location location_key street city province_or_street country Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. Example of a Star Schema Order Product Order No ProductNO Order Date Customer Customer No Customer Name Customer Address City Salesperson SalespersonID SalespersonName City Quota ProdName Fact Table OrderNO SalespersonID ProdDescr Category CategoryDescription UnitPrice CustomerNO ProdNo Date DateKey DateKey CityName Date Quantity Total Price City CityName State Country A star schema for credit card purchases Purchase Key 1 2 3 4 5 6 . . . Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous . . . Time Dimension Time Key Month Day Quarter Year 10 Jan 5 1 2002 . . . . . . . . . . . . . . . Cardholder Key Purchase Key Location Key 1 2 1 15 4 5 1 2 3 . . . . . . . . . Cardholder Key Name 1 John Doe 2 Sara Smith . . . . . . Cardholder Dimension Gender Income Range Male 50 - 70,000 Female 70 - 90,000 . . . . . . Fact Table Time Key Amount 10 14.50 11 8.25 10 22.40 . . . . . . Location Key Street 10 425 Church St . . . . . . Location Dimension City State Region Charleston SC 3 . . . . . . . . . The Star Schema A multi-dimensional data model Customer dimension Non-normalized Time dimension Fact table Non-normalized Normalized Non-normalized Non-normalized Application dimension Problem dimension One E-R vs. Many Stars Transaction Analysis focus focus One E-R model for all the business process. One star per modeled business process. Snowflake Schema Represent dimensional hierarchy directly by normalizing the dimension tables Easy to maintain Saves storage, but may reduce effectiveness of browsing (Kimball) Example of Snowflake Schema time item time_key day Sales Fact Table day_of_the_week time_key month item_key quarter branch_key year branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key supplier item_name supplier_key brand supplier_type type supplier_key location location_key street city_key city city_key city province_or_street Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. country Example of a Snowflake Schema Order Product Order No Category ProductNO Order Date Fact Table CategoryName ProdName CategoryDescr ProdDescr Customer Category Customer No OrderNO Category Customer Name SalespersonID UnitPrice Customer Address CustomerNO City ProdNo DateKey Salesperson CityName SalespersonID Quantity SalespersonName Total Price City Quota Date DateKey Date Month City CityName State Country Month Month Year Year State StateName Country Year Fact Constellation Schema Multiple fact tables share dimension tables. This schema is viewed as collection of stars hence called galaxy schema or fact constellation. Sophisticated applications require such schema. Example of Fact Constellation time time_key item day day_of_the_week month Sales Fact Table item_key year branch_key branch location_key branch_key units_sold branch_name dollars_sold branch_type brand type supplier_type location avg_sales item_key shipper_key from_location to_location location_key dollars_cost street units_shipped city Measures time_key item_name time_key quarter item_key Shipping Fact Table province_or_street country Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. shipper shipper_key shipper_name location_key shipper_type Example Fact Constellation Schema Sales Fact Table Store Key Shipping Product Dimension Fact Table Shipper Key Product Key Product Key Store Key Period Key Product Desc Product Key Units Period Key Price Units Store Dimension Store Key Store Name City State Region Price A constellation schema for credit card purchases and promotions Promotion Key 1 . . . Promotion Dimension Description Cost watch promo 15.25 . . . . . . Time Dimension Time Key Month Day Quarter Year 5 Dec 31 4 2001 8 Jan 3 1 2002 10 Jan 5 1 2002 . . . . . . . . . . . . . . . Purchase Key 1 2 3 4 5 6 Promotion Fact Table Cardholder Key Promotion Key Time Key 1 1 5 2 1 5 . . . . . . . . . Response Yes No . . . Cardholder Key Name 1 John Doe 2 Sara Smith . . . . . . Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous Purchase Fact Table Cardholder Key Purchase Key Location Key 1 2 1 15 4 5 1 2 3 . . . . . . . . . Cardholder Dimension Gender Income Range Male 50 - 70,000 Female 70 - 90,000 . . . . . . Time Key Amount 10 14.50 11 8.25 10 22.40 . . . . . . Location Key Street 5 425 Church St . . . . . . Location Dimension City State Region Charleston SC 3 . . . . . . . . . What is OLAP? Software tool providing multi-dimensional view of data for business analysis Example of “Decision Support” or “Business Intelligence” tool Fast data access and fast computations Interactive, flexible user interface “Slice, dice, drill-down” Excel Pivot Table and Pivot Chart are examples of simple OLAP tools Defining OLAP - ANALYSIS Business logic and statistical analysis relevant to end user Should not require programming for everything Analysis can be via vendors’ tools or link to generic analytical platform such as spreadsheet Examples include time series analysis, cost allocation, currency translation, goal seeking, ad-hoc multidimensional structural changes (cube building), nonprocedural modeling, exception alerting, and data mining. Capabilities vary widely by vendor and market OLAP Operations Common cube operations Pivot or Rotate – change which dimensions and/or levels within dimensions are shown on row and column axes Roll-up – aggregate or combine cells within a dimension according to some mathematical operation Uses a hierarchy definition for the dimension Commonly this is summation or count Drill down – examine data a greater level of detail Add another row or column header which is further down the concept hierarchy Slice – select a subset of a cube by constraining the value of some dimension Ex: Select cells for month = January in time dimension Dice – select a subset of a cube by constraining two or more dimensions Drill through – access atomic level detail data OLAP Operations & SQL Sample Cube Operation (SQL) A Few Products Microsoft Analysis Services Part of SQL Server 2005 Create OLAP cubes, 10 data mining algorithms Tableau A new, pretty amazing pivoting tool Cognos Recently bought by IBM Hyperion Essbase Full suite of business intelligence developer and end user tools Purchased by Oracle Business Objects (Crystal) Full suite of business intelligence developer and end user tools Microstrategy Oracle Information Builders Home of WebFocus, a web based OLAP tool Pentaho A new open source business intelligence project http://www.pentaho.org/ A Data Mining Query Language, DMQL: Language Primitives Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) First time as “cube definition” define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales First Example Enrollment Data Sumber: Dr. Mourad YKHLEF,Decision Support System, King Saud University, 2009 Enroll Table Class Day Time Prof Enrolled 1336 T 8 Aars 14 1430 M 2 Aars 28 1430 M 11 Booth 30 1430 T 11 Booth 26 1430 T 2 Booth 23 1430 T 2 Fry 27 1430 T 12 Aars 29 1440 M 1 Aars 11 2334 M 9 Fry 27 2350 M 10 Maurer 19 3101 T 12 Grabow 16 3303 M 11 Aars 11 3324 M 8 Gaitros 20 3330 T 11 Fry 5 3331 T 12 Aars 11 3334 M 11 Hamerly 20 3335 M 2 Donahoo 17 3336 T 8 Sturgill 9 3342 T 2 Aars 10 3439 T 9 Poucher 10 Group by rollup Select Prof, Sum(Students) From enroll Group by prof With rollup Prof Aars Booth Donahoo Fry Gaitros Grabow Hamerly Maurer Poucher Sturgill Enrolled 114 79 17 59 20 16 20 19 10 9 363 Group by cube Select Day, Time, Sum(Students) From enroll Group By Day,Time With cube {rollup(Day,Time) -- all rows except ([NULL])} Day Time [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] M M M M M M M T T T T T T 1 2 8 9 10 11 12 1 2 8 9 10 11 2 8 9 11 12 Enrolled 363 11 105 43 37 19 92 56 183 11 45 20 27 19 61 180 60 23 10 31 56 Group by rollup Select Day, Time, Prof, Sum(Students) From enroll Group by day,time, prof With rollup Second Example Sales Data Sales Cube Product Market Time Sales Example (Cont.) Simple Cross-Tabular Report 1997 Region Department CassetteSales Profit Video Sales Profit Total Profit Central 82,000 85,000 167,000 East 101,000 137,000 238,000 West 96,000 97,000 193,000 Total 279,000 319,000 598,000 Sales Example (Cont.) Roll up – the query SELECT Time, Region, Department, sum(Profit) FROM sales GROUP BY Time, Region, Dept WITH ROLLUP this query returns the following sets of rows: • Regular aggregation rows that would be produced by GROUP BY without using ROLLUP . • First-level subtotals aggregating across Department for each combination of Time and Region . • Second-level subtotals aggregating across Region and Department for each Time value . • A grand total row . Sales Example (Cont.) Roll up – the result of query Time Region Dept Profit 1996 Central CassetteSales 75,000 1996 Central VideoSales 74,000 1996 Central [NULL] 149,000 1996 East CassetteSales 89,000 1996 East VideoSales 115,000 1996 East [NULL] 204,000 1996 West CassetteSales 87,000 1996 West VideoSales 86,000 1996 West [NULL] 173,000 1996 [NULL] [NULL] 526,000 1997 Central CassetteSales 82,000 1997 Central VideoSales 85,000 1997 Central [NULL] 167,000 1997 East CassetteSales 101,000 1997 East VideoSales 137,000 1997 East [NULL] 238,000 1997 West CassetteSales 96,000 1997 West VideoSales 97,000 1997 West [NULL] 193,000 1997 [NULL] [NULL] 598,000 [NULL] [NULL] [NULL] 1,124,000 Sales Example (Cont.) Roll up Calculating Subtotals without ROLLUP The result set could be generated by the UNION of four SELECT statements, as shown below. This is a subtotal across three dimensions. Notice that a complete set of ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked with UNION ALL. SELECT Time, Region, Department, SUM(Profit) FROM Sales GROUP BY Time, Region, Department UNION ALL SELECT Time, Region, '' , SUM(Profit) FROM Sales GROUP BY Time, Region UNION ALL SELECT Time, '', '', SUM(Profits) FROM Sales GROUP BY Time UNION ALL SELECT '', '', '', SUM(Profits) FROM Sales; Sales Example (Cont.) Cube - the query SELECT Time, Region, Department, sum(Profit) FROM sales GROUP BY Time, Region, Dept WITH CUBE Sales Example (Cont.) Cube – the result of query Time Region Dept Profit 1996 Central CassetteSales 75,000 1996 Central VideoSales 74,000 1996 Central [NULL] 149,000 1996 East CassetteSales 89,000 1996 East VideoSales 115,000 1996 East [NULL] 204,000 1996 West CassetteSales 87,000 1996 West VideoSales 86,000 1996 West [NULL] 173,000 1996 [NULL] CassetteSales 251,000 1996 [NULL] VideoSales 275,000 1996 [NULL] [NULL] 526,000 1997 Central CassetteSales 82,000 1997 Central VideoSales 85,000 1997 Central [NULL] 167,000 1997 East CassetteSales 101,000 1997 East VideoSales 137,000 1997 East [NULL] 238,000 1997 West CassetteSales 96,000 1997 West VideoSales 97,000 1997 West [NULL] 193,000 1997 [NULL] CassetteSales 279,000 1997 [NULL] VideoSales 319,000 1997 [NULL] [NULL] 598,000 [NULL] Central CassetteSales 157,000 [NULL] Central VideoSales 159,000 [NULL] Central [NULL] 316,000 [NULL] East CassetteSales 190,000 [NULL] East VideoSales 252,000 [NULL] East [NULL] 442,000 Sales Example (Cont.) Cube Calculating Subtotals without CUBE Just as for ROLLUP, multiple SELECT statements combined with UNION statements could provide the same information gathered through CUBE. However, this may require many SELECT statements: for an ndimensional cube, 2n SELECT statements are needed. In our 3-dimension example, this would mean issuing 8 SELECTS linked with UNION ALL. Sales Example (Cont.) Grouping Two challenges arise with the use of ROLLUP and CUBE. First, how can we programmatically determine which result set rows are subtotals, and how do we find the exact level of aggregation of a given subtotal? We will often need to use subtotals in calculations such as percent-oftotals, so we need an easy way to determine which rows are the subtotals we seek. Second, what happens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE? How does an application or developer differentiate between the two? Sales Example (Cont.) Grouping To handle these issues, we have a function called GROUPING. Using a single column as its argument, Grouping returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, will return a 0. Sales Example (Cont.) Grouping – the query SELECT Time, Region, Department, SUM(Profit) GROUPING (Time) as T, GROUPING (Region) as R, GROUPING (Department) as D FROM Sales GROUP BY Time, Region, Department WITH ROLLUP Sales Example (Cont.) Grouping – the result of query Time Region Dept Profit T R D 1996 Central CassetteSales 75,000 0 0 0 1996 Central Video Sales 74,000 0 0 0 1996 Central [NULL] 149,000 0 0 1 1996 East CassetteSales 89,000 0 0 0 1996 East Video Sales 115,000 0 0 0 1996 East [NULL] 204,000 0 0 1 1996 West CassetteSales 87,000 0 0 0 1996 West Video Sales 86,000 0 0 0 1996 West [NULL] 173,000 0 0 1 1996 [NULL] [NULL] 526,000 0 1 1 1997 Central CassetteSales 82,000 0 0 0 1997 Central Video Sales 85,000 0 0 0 1997 Central [NULL] 167,000 0 0 1 1997 East CassetteSales 101,000 0 0 0 1997 East Video Sales 137,000 0 0 0 1997 East [NULL] 238,000 0 0 1 Grouping This table shows an ambiguous result set created using the CUBE extension. Time Region Profit 1996 East 200,000 1996 [NULL] 200,000 [NULL] East 200,000 [NULL] [NULL] 190,000 [NULL] [NULL] 190,000 [NULL] [NULL] 190,000 [NULL] [NULL] 390,000 Grouping (Cont.) We can resolve the ambiguity by using the GROUPING and other functions in the code below SELECT decode(grouping(Time), 1, 'All Times', Time) as Time, decode(grouping(region), 1, 'All Regions', Region) as Region, sum(Profit) FROM Sales GROUB BY Time, Region WITH CUBE Grouping (Cont.) The code result Time Region Profit 1996 1996 East All Regions 200,000 200,000 All Times East 200,000 [NULL] [NULL] 190,000 [NULL] All Regions 190,000 All Times [NULL] 190,000 All Times All Regions 390,000 Grouping (Cont.) Also we can use GROUPING function for this purpose we retrieve a subset of the subtotals created by a CUBE and none of the base-level aggregations. The HAVING clause constrains columns which use GROUPING functions SELECT Time, Region, Department, SUM(Profit) AS Profit, GROUPING (Time) AS T, GROUPING (Region) AS R, GROUPING (Department) AS D FROM Sales GROUP BY Time, Region, Department WITH CUBE HAVING (D=1 AND R=1 AND T=1) OR (R=1 AND D=1) OR (T=1 AND D=1) Grouping (Cont.) The query result Time Region Department Profit 1996 1997 [NULL] [NULL] [NULL] [NULL] 526,000 598,000 [NULL] Central [NULL] 316,000 [NULL] East [NULL] 442,000 [NULL] West [NULL] 366,000 [NULL] [NULL] [NULL] 1,124,000 Roll up Example SELECT Year, Quarter, Month, SUM(Profit) AS Profit FROM sales GROUP BY Year, Quarter, Month WITH ROLLUP Roll up Example The query result Year Quarter Month Profit 1997 Winter Jan 55,000 1997 Winter Feb 64,000 1997 Winter March 71,000 1997 Winter [NULL] 190,000 1997 Spring April 75,000 1997 Spring May 86,000 1997 Spring June 88,000 1997 Spring [NULL] 249,000 1997 Summer July 91,000 1997 Summer August 87,000 1997 Summer September 101,000 1997 Summer [NULL] 279,000 1997 Fall October 109,000 1997 Fall November 114,000 Referensi 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Keith C.C. Chan,2003, Data Warehousing & Data Mining, The Hong Kong Polytechnic University Dr. Mourad YKHLEF,2009,Decision Support System, King Saud University, -,-, Decision Support Technology,The Heinz School Ahmed M. Zeki, 2004, Data Mining & Data Warehousing, INFO 6630 Dan St. Clair, 2002, Lect 1 – Intro. To Data Mining & Data Warehouses, University of Missouri-Rolla S. Sudarshan; Krithi Ramamritham,-, Data Warehouse and Data Mining, IIT Bombay Chris Clifton, 2004, Data Warehousing, Purdue University Richard J. Roiger,-,The Data Warehouse,Hugh J. Watson,-, Recent Developments in Data Warehousing, http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt, Tanggal Akses:17-092010 Mark Isken,-, Data Warehousing and Online Analytical Processing (OLAP),Ari Cahyono,-,Introduction to Data Warehouse, Magister Teknologi Informasi UGM Reference Library BI Resources The Data Warehousing Institute http://www.tdwi.org/ Kimball and Associates http://www.ralphkimball.com./html/articles.html A Dimensional Modeling Manifesto – Kimball, R. http://www.dbmsmag.com/9708d15.html DSS Resources http://dssresources.com/ Data Warehousing Information Center http://www.dwinfocenter.org/ Intelligent Enterprise http://www.intelligententerprise.com/ DM Review http://dmreview.com/ KDNuggets http://www.kdnuggets.com/ IT Toolbox http://www.ittoolbox.com/ http://businessintelligence.ittoolbox.com/ http://datawarehouse.ittoolbox.com/ OLAP Report http://www.olapreport.com/ Some free stuff (nice history of OLAP and commentary on industry trends Other stuff costs $ http://www.mosha.com/msolap/ Awesome set of resources from the lead developer on MS SQL Server Analysis Server team Some Good Books and Articles The Data Warehouse Toolkit – Kimball, R. Definitive, Microsoft SQL Server 2005 based 3rd edition now out OLAP Solutions – Thomsen, E. Definitive, abstract and dense, good MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and Hyperion Essbase by George Spofford MDX is the “SQL” for cubes Data Mining with SQL Server 2005 (Paperback) by ZhaoHui Tang (Author), Jamie MacLennan Data Warehouse Design Solutions – Adamson and Venerable Multi-D DW designs from lots of different industries Very practical, uses realistic situations to reinforce the concepts Summers Rubber Company designs its data warehouse Gorla, Narasimhaiah; Krehbiel, Steve Interfaces; Mar/Apr 1999; 29, 2; ABI/INFORM Global More AS Tutorials and Resources http://www.mosha.com/msolap/ This is the granddaddy of MS SQL Server Analysis Services resoures. Mosha Pasumansky is the MS development lead on AS engine. Site is gold mine of information and links regarding AS and related software He participates in microsoft.public.sqlserver.olap Great Blog at http://www.sqljunkies.com/WebLog/mosha/ Introduction to Analysis Services - by William Pearson (series of articles) http://www.databasejournal.com/article.php/1459531/ Very nice series of MS AS tutorials Best practices for Business Intelligence using the Microsoft Data Warehousing Framework A white paper from Microsoft