Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Query Processing Chapter6 Contents Contents ................................................................................................................................................................199 Present Business Scenario ....................................................................................................................................200 Data Modeling and Normalization .......................................................................................................................201 A Data Warehouse ! .............................................................................................................................................203 What is Data Warehouse?.................................................................................................................................204 Data Warehousing -- It is a process ..................................................................................................................204 Evolution ..........................................................................................................................................................206 Data Warehouse—Subject-Oriented ....................................................................................................................207 Data Warehouse—Integrated ...............................................................................................................................208 Data Warehouse—Time Variant ..........................................................................................................................208 Data Warehouse—Non-Volatile...........................................................................................................................208 Data Warehouse vs. Heterogeneous DBMS .........................................................................................................209 OLTP vs. OLAP ...................................................................................................................................................209 Entering Data into the Warehouse ........................................................................................................................212 Data Warehouse Components ..............................................................................................................................213 Loading the Warehouse ........................................................................................................................................214 Data Transformation Terms..................................................................................................................................216 Loading .................................................................................................................................................................199 Data -- Heart of the Data Warehouse ...................................................................................................................201 Vertical Partitioning .............................................................................................................................................203 Schema Design .....................................................................................................................................................204 Dimension Tables .................................................................................................................................................204 Star Schema ..........................................................................................................................................................205 Snowflake schema ................................................................................................................................................206 Fact Constellation .................................................................................................................................................206 Cube: A Lattice of Cuboids ..................................................................................................................................208 Defining a Star Schema in DMQL .......................................................................................................................211 Concept Hierarchy: Dimension (location) ............................................................................................................213 On-Line Analytical Processing (OLAP)...............................................................................................................215 Limitations of SQL ...............................................................................................................................................216 197 Query Processing Chapter6 Typical OLAP Operations ....................................................................................................................................219 Aggregates ............................................................................................................................................................223 Cube Operation.....................................................................................................................................................228 Cross Tab ..............................................................................................................................................................231 Cross Tabulation of sales by item-name and color ..........................................................................................231 Cross Tabulation With Hierarchy .....................................................................................................................232 Relational Representation of Crosstabs ............................................................................................................232 Three-Dimensional Data Cube .........................................................................................................................233 Data warehouse implementation … .....................................................................................................................234 Relational OLAP: 3 Tier DSS .............................................................................................................................235 MOLAP: 2 Tier DSS ............................................................................................................................................236 Data Warehouse Back-End Tools and Utilities ....................................................................................................236 Index Structures ....................................................................................................................................................237 Bit Maps ...............................................................................................................................................................238 Join .......................................................................................................................................................................239 What to Materialize? ............................................................................................................................................240 Meta data ..............................................................................................................................................................242 Recipe for a Successful Warehouse......................................................................................................................243 DW and OLAP Research Issues ...........................................................................................................................244 Data Warehouse Usage.........................................................................................................................................246 An OLAM Architecture .......................................................................................................................................248 Summary...............................................................................................................................................................248 Exercises ...............................................................................................................................................................249 198 Query Processing Chapter6 Data Warehouse Extracted from : Han - Data Mining: Concepts and Techniques S. Sudarshan and Krithi Ramamritham – Slides DATA WAREHOUSING AND DATA MINING Objective To gain an understanding about datawarehous, muli-dimentional data model and OLAP. Contents Motivation What is a data warehouse? A multi-dimensional data model Data warehouse implementation From data warehousing to data mining Motivation Over the last 20 years, $1 trillion has been invested in new computer systems to gain competitive advantage. The vast majority of these systems have automated business processes, to make them faster, cheaper, and more responsive to the customer. Electronic point of sales (EPOS) at supermarkets, itemized billing at telecommunication companies (telcos), and mass market mailing at catalog companies are some examples of such ―Operational Systems‖. 199 Query Processing Chapter6 Present Business Scenario Presently almost all businesses have operational systems and these systems are not giving them any competitive advantage. These systems have gathered a vast amount of ―data‖ over the years. The companies are now realizingthe importance of this ―hidden treasure‖ of information. Efforts are now on to tap into this information that will improve the quality of their decision-making. A ―data warehouse‖ is nothing but a repository of data collected from the various operational systems of an organization. This data is then comprehensively analyzed to gain competitive advantage. The analysis is basically used in decision making at the top level. A data warehousing system can perform advanced analyses of operational data without impacting operational systems. Decision support systems & Transactional Databases Operational or transactional databases are designed to process individual transactions quickly and efficiently. This type of transaction-based interaction is known as On-Line Transactional Processing (OLTP). OLTP is very fast and efficient at recording the business transactions not so good at providing answers to high-level strategic questions. In contrast, decision support systems are subject oriented. 200 Query Processing Chapter6 They incorporate facilities for reporting, analyzing, and mining data about a particular topic such as heart disease. Data Modeling and Normalization Relationships: o One-to-One Relationships o One-to-Many Relationships o Many-to-Many Relationships Normal forms o First Normal Form o Second Normal Form o Third Normal Form Type ID Make Customer ID Year Income Range Vehicle - Type Customer 201 Query Processing Chapter6 Table 6.1a • Relational Table for Vehicle-Type Type ID Make Year 4371 6940 4595 2390 Chevrolet Cadillac Chevrolet Cadillac 1995 2000 2001 1997 Table 6.1b • Relational Table for Customer Customer ID Income Range ($) Type ID 0001 0002 0003 0004 0005 70–90K 30–50K 70–90K 30–50K 70–90K 2390 4371 6940 4595 2390 Table 6.2 • Join of Tables 6.1a and 6.1b Customer ID Income Range ($) Type ID Make Year 0001 0002 0003 0004 0005 70–90K 30–50K 70–90K 30–50K 70–90K 2390 4371 6940 4595 2390 Cadillac Chevrolet Cadillac Chevrolet Cadillac 1997 1995 2000 2001 1997 Is this table Normalized? ―Is there a relationship between salary and type of a car?‖ This kind of relationship is not of interest in a transactional environment, but it is of primary importance to decision support. Relationships showing this type of redundancy can only be observed by denormalizing the data. 202 Query Processing Chapter6 This leads to new questions about o which entities to combine, as well as o how and where to store and maintain the combined entities A better choice is to have:o Mechanism for storing, maintaining, and processing transactional data, and another o To house data for decision support Contents Motivation What is a data warehouse? A multi-dimensional data model Data warehouse implementation From data warehousing to data mining A Data Warehouse ! A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site. o Important for large businesses which generate data from multiple divisions, possibly at multiple sites. When transactional data is no longer of value to the operational environment, it is removed from the database. If a business is without a decision support (DS) facility, the data is archived and eventually destroyed. 203 Query Processing Chapter6 However, if there is a DS environment, the data is transported to some type of interactive medium commonly referred to as a data warehouse. What is Data Warehouse? A decision support database that is maintained separately from the organization’s operational database. Support information processing by providing a solid platform of consolidated, historical data for analysis. ―A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process.‖—W. H. Inmon Data warehousing: o The process of constructing and using data warehouses Data Warehousing -- It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible. A decision support database maintained separately from the organization’s operational database 204 Query Processing Chapter6 Problems faced by users I can’t find the data I need o data is scattered over the network o many versions, subtle differences I can’t get the data I need o need an expert to get the data I can’t understand the data I found o available data poorly documented I can’t use the data I found o results are unexpected o data needs to be transformed from one form to other What are the users saying... Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required 205 Query Processing Chapter6 Evolution 60’s: Batch reports o hard to find and analyze information o inflexible and expensive, reprogram every new request 70’s: Terminal-based DSS and EIS (executive information systems) o still inflexible, not integrated with desktop tools 80’s: Desktop data access and analysis tools o query tools, spreadsheets, GUIs o easier to use, but only access operational databases 90’s: Data warehousing with integrated OLAP engines and tools Very Large Data Bases Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes Petabytes -- 10^15 bytes: Geographic Information Systems Exabytes -- 10^18 bytes: National Medical Records Zettabytes -- 10^21 bytes: Weather images Zottabytes -- 10^24 bytes: Intelligence Agency Videos Data Warehouse A data warehouse is a o subject-oriented o integrated o time-varying 206 Query Processing Chapter6 o non-volatile collection of data that is used primarily in organizational decision making. Bill Inmon, Building the Data Warehouse 1996 Data Warehouse—Subject-Oriented Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Example: 207 Query Processing Chapter6 Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sources o relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. o Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. o When data is moved to the warehouse, it is converted. Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. o Operational database: current value data. o Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse o Contains an element of time, explicitly or implicitly o But the key of operational data may or may not contain ―time element‖. Data Warehouse—Non-Volatile A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. 208 Query Processing Chapter6 o Does not require transaction concurrency control mechanisms processing, recovery, and o Requires only two operations in data accessing: initial loading of data and access of data. Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: o Build wrappers/mediators on top of heterogeneous databases o Query driven approach When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Data warehouse: Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis OLTP vs. OLAP OLTP (on-line transaction processing) o Major task of traditional relational DBMS o Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) o Major task of data warehouse system 209 Query Processing Chapter6 o Data analysis and decision making OLTP OLAP users clerk, IT professional, salespersons, administrator Business analysts, managers, Primary purpose day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date historical, detailed, flat relational summarized, multidimensional isolated integrated, consolidated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, narrow, planned, and simple updates and queries, Broad, complex queries and analysis # records accessed tens, many constant updates and queries on one or few table rows Millions, periodic batch updates and queries requiring many or all rows #users thousands hundreds DB size 100MB-GB 100GB-TB 210 Query Processing Design goal Chapter6 Performance throughput, availability Ease of flexible access and use Why Separate Data Warehouse? High performance for both systems o DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery o Warehouse—tuned for OLAP: complex multidimensional view, consolidation. OLAP queries, Different functions and different data: o missing data: Decision support requires historical data which operational DBs do not typically maintain o data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources o data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 211 Query Processing Chapter6 Entering Data into the Warehouse Data mart is a data store that is similar to a warehouse but limits its focus to a single subject. Independent data mart is structured by using operational data as well as external data sources. Dependent data mart can be structured by using data from a warehouse. External data represents items such as economic indicators, weather information, (i.e., information not specific to the internal organization) ETL (Extract, Transform, Load) Responsibilities of ETL includes: o Extracting data from one or more of the input sources o Cleaning and transforming the external data as necessary o Loading the data into the warehouse 212 Query Processing Chapter6 Data Warehouse Components Components of the Warehouse Data Extraction, Cleaning and Loading The Warehouse Analyze and Query -- OLAP Tools Metadata Data Mining tools Need for Data warehouse Can data mining continue to take place in environments not supporting a data warehouse? Yes, but as volume of data continue to be collected for purpose of decision support, the need for organized, efficient data storage and retrieval architectures has become quite apparent, thus the need for the data warehouse. 213 Query Processing Chapter6 Loading the Warehouse Extract and Transform the data before it is loaded Source Data Typically host based, legacy applications o Customized applications, COBOL, 3GL, 4GL Point of Contact Devices o POS, ATM, Call switches External Sources o Nielsen’s, Acxiom, CMIE, Vendors, Partners 214 Query Processing Chapter6 Data Quality - The Reality Tempting to think creating a data warehouse is simply extracting operational data and entering into a data warehouse Nothing could be farther from the truth Warehouse data comes from disparate questionable sources Data Integration Across Sources Data Transformation Example 215 Query Processing Chapter6 Data Integrity Problems Same person, different spellings o Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name o Persistent Systems, PSPL, Persistent Pvt. LTD. Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of sale o manual entry leads to mistakes o ―in case of a problem use 9999999‖ Data Transformation Terms Extracting Conditioning Scrubbing Merging Householding 216 Query Processing Chapter6 Enrichment Scoring Loading Validating Updating Extracting o Capture of data from operational source in ―as is‖ status o Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Conditioning o The conversion of data types from the source to the target data store (warehouse) -- always a relational database Householding o Identifying all members of a household (living at the same address) o Ensures only one mail is sent to a household o Can result in substantial savings: eg catalog postage. Enrichment o Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc... Scoring 198 Query Processing Chapter6 o computation of a probability of an event. e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product Scrubbing/Cleaning Data Sophisticated transformation tools. Used for cleaning the quality of data Clean data is vital for the success of the warehouse Example o Naming problems, Coding, … Cleaning Tools o Apertus -- Enterprise/Integrator o Vality -- IPE o Postal Soft Loading After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse Issues o huge volumes of data to be loaded o small time window available when warehouse can be taken off line (usually nights) o when to build index and summary tables 199 Query Processing Chapter6 o allow system administrators to monitor, cancel, resume, change load rates o Recover gracefully -- restart after failure from where you were and without loss of data integrity When to Refresh? periodically (e.g., every night, every week) or after significant events on every update: not warranted unless warehouse data require current data (up to the minute stock quotes) refresh policy set by administrator based on user needs and traffic possibly different policies for different sources Refresh Techniques Full Extract from base tables o read entire source table: too expensive o maybe the only choice for legacy systems Detecting Changes Create a snapshot log table to record ids of updated rows of source data and timestamp Detect changes by: o Defining after row triggers to update snapshot log when source table changes o Using regular transaction log to detect changes to source data 200 Query Processing Chapter6 Contents Motivation What is a data warehouse? A multi-dimensional data model Data warehouse implementation From data warehousing to data mining Data -- Heart of the Data Warehouse Heart of the data warehouse is the data itself! Corporate memory Data is organized in a way that represents business -- subject orientation Data Warehouse Structure Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. o E.g., customer may be five tables o base customer (1985-87) custid, from date, to date, name, phone, dob o base customer (1988-90) custid, from date, to date, name, credit rating, employer o customer activity (1986-89) -- monthly summary o customer activity detail (1987-89) custid, activity date, amount, clerk id, order no 201 Query Processing Chapter6 o customer activity detail (1990-91) custid, activity date, amount, line item no, order no Data Analysis and OLAP Aggregate functions summarize large volumes of data Online Analytical Processing (OLAP) o Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion (with negligible delay) Data that can be modeled as dimension attributes and measure attributes are called multidimensional data model. Measure and Dimension attributes Measure attributes define some value, and can be aggregated upon. For instance, the attribute number of the sales relation is a measure attribute, since it measures the number of units sold. Dimension attributes define the dimensions on which measure attributes, and summaries of measure attributes, are viewed. Data Granularity in Warehouse Summarized data stored o reduce storage costs o reduce cpu usage o increases performance since smaller number of records to be processed o design around traditional high level reporting needs 202 Query Processing Chapter6 o tradeoff with volume of data to be stored and detailed usage of data Granularity in Warehouse Can not answer some questions with summarized data o Did John call Sara last month? Not possible to answer if total duration of calls by John over a month is only maintained and individual call details are not. Detailed data too voluminous Tradeoff is to have dual level of granularity o Store summary data on disks 95% of DSS processing done against this data o Store detail on tapes 5% of DSS processing against this data Vertical Partitioning 203 Query Processing Chapter6 Schema Design Dataware house organization o must look like business o must be recognizable by business user o approachable by business user o Must be simple Schema Types o Star Schema o Fact Constellation Schema o Snowflake schema Two types of tables o Dimension o Measure (Fact) Dimension Tables Dimension tables o Define business in terms already familiar to users o Wide rows with lots of descriptive text o Tables with few rows o Joined to fact table by a foreign key o heavily indexed o typical dimensions 204 Query Processing Chapter6 time periods, geographic region (markets, cities), products, customers, salesperson, etc. Fact (measure) Table Contains the measure attributes Central table o mostly raw numeric items o narrow rows, a few columns at most o large number of rows (millions to a billion) o Access via dimensions Star Schema A single fact table and for each dimension one dimension table Does not capture hierarchies directly 205 Query Processing Chapter6 Snowflake schema Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage Fact Constellation Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry 206 Query Processing Chapter6 De-normalization Normalization in a data warehouse may lead to lots of small tables Can lead to excessive I/O’s since many tables have to be accessed De-normalization is the answer especially since updates are rare Multidimensional data model A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions o Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) o Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cuboid is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. 207 Query Processing Chapter6 Cube: A Lattice of Cuboids Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures o Star schema: A fact table in the middle connected to a set of dimension tables o Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables o Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 208 Query Processing Chapter6 Example of Star Schema Example of Snowflake Schema 209 Query Processing Chapter6 Example of Fact Constellation A Data Mining Query Language, DMQL: Language Primitives Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) o First time as ―cube definition‖ o define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> 210 Query Processing Chapter6 Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) 211 Query Processing Chapter6 Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales 212 Query Processing Chapter6 Concept Hierarchy: Dimension (location) Concept Hierarchy … Example of dbminer software Sales volume as a function of product, month, and region 213 Query Processing Chapter6 A Sample Data Cube Cuboids Corresponding to the Cube 214 Query Processing Chapter6 On-Line Analytical Processing (OLAP) Making Decision Support Possible Typical OLAP Queries Write a multi-table join to compare sales for each product this year vs. last year. Repeat the above process to find the top 5 product contributors to margin. Repeat the above process to find the sales of a product line to new vs. existing customers. 215 Query Processing Chapter6 Repeat the above process to find the customers that have had negative sales growth. Limitations of SQL What Is OLAP? Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System, Multidimensional Database 216 Query Processing Chapter6 The OLAP Market Rapid growth in the enterprise market o 1995: $700 Million o 1997: $2.1 Billion Significant consolidation activity among major DBMS vendors o 10/94: Sybase acquires ExpressWay o 7/95: Oracle acquires Express o 11/95: Informix acquires Metacube o 1/97: Arbor partners up with IBM o 10/96: Microsoft acquires Panorama Result: OLAP shifted from small vertical niche to mainstream DBMS category Strengths of OLAP It is a powerful visualization paradigm It provides fast, interactive response times It is good for analyzing time series It can be useful to find some clusters and outliers Many vendors offer OLAP tools 217 Query Processing Chapter6 Multi-dimensional Data Data Cube Lattice Cube lattice Can materialize some groupbys, compute others on demand Question: which groupbys to materialze? Question: what indices to create Question: how to organize data (chunks, etc) 218 Query Processing Chapter6 Typical OLAP Operations Pivot (rotate): o reorient the cube, visualization, 3D to series of 2D planes. Roll up : summarize data o by climbing up hierarchy or by dimension reduction Drill down : reverse of roll-up o from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: o project and select Browsing a Data Cube Visualization OLAP capabilities Interactive manipulation 219 Query Processing Chapter6 Visualizing Neighbors is simpler A Visual Operation: Pivot (Rotate) 220 Query Processing Chapter6 Relational presentation vs multi-dimensional cube Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 221 Query Processing Chapter6 Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId 222 Query Processing Chapter6 Roll-up and Drill Down Aggregates Operators: sum, count, max, min, median, ave ―Having‖ clause Using dimension hierarchy o average by region (within store) o maximum by month (within date) 223 Query Processing Chapter6 Cube Aggregation Example: computing sums Cube Operators 224 Query Processing Chapter6 Extended Cube Aggregation Using Hierarchies 225 Query Processing Chapter6 “Slicing and Dicing” A Star-Net Query Model 226 Query Processing Chapter6 Nature of OLAP Analysis Aggregation -- (total sales, percent-to-total) Comparison -- Budget vs. Expenses Ranking -- Top 10, quartile analysis Access to detailed and aggregate data Complex criteria specification Visualization Multidimensional Spreadsheets Analysts need spreadsheets that support o pivot tables (cross-tabs) o drill-down and roll-up o slice and dice o sort o selections o derived attributes Eg. Excel pivot table SQL Extensions Front-end tools require o Extended Family of Aggregate Functions rank, median, mode 227 Query Processing Chapter6 o Reporting Features running totals, cumulative totals o Results of multiple group by total sales by month and total sales by product o Data Cube Cube Operation Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys 228 Query Processing Chapter6 Typical OLAP Problems Data Explosion On-line Analytical Processing (OLAP) OLAP is a query-based methodology that supports data analysis in a multidimensional environment. OLAP is a valuable tool for o verifying or refuting human-generated hypotheses and o performing manual data mining. OLAP engine logically structures multidimensional data in the form of a cube. 229 Query Processing Chapter6 Month = Dec. Category = Vehicle Region = Two Amount = 6,720 Count = 110 Dec. Nov. Oct. Sep. Month Aug. Jul. Jun. May Apr. Mar. Feb. Jan. Miscellaneous Restaurant Retail Vehicle Travel Supermarket On e Tw o Fo ur Th ree n gi o Re Category The Cube The Cube displays three dimensions o Purchase category o Time in months o Regions Difficult to picture a data cube having more than three dimensions but crosstabs can be used as views on a data cube 230 Query Processing Chapter6 Cross Tab A cross-tab is a table where o values for one of the dimension attributes form the row headers, values for another dimension attribute form the column headers Other dimension attributes are listed on top o Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell. Cross Tabulation of sales by item-name and color The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. 231 Query Processing Chapter6 Cross Tabulation With Hierarchy o Crosstabs can be easily extended to deal with hierarchies Can drill down or roll up on a hierarchy Relational Representation of Crosstabs Crosstabs can be represented as relations 232 Query Processing Chapter6 Three-Dimensional Data Cube A data cube is a multidimensional generalization of a crosstab Cannot view a three-dimensional object in its entirety but crosstabs can be used as views on a data cube Contents Motivation What is a data warehouse? A multi-dimensional data model Data warehouse implementation From data warehousing to data mining 233 Query Processing Chapter6 Data warehouse implementation … ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Metadata Relational OLAP (ROLAP) Architecture Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces 234 Query Processing Chapter6 Relational OLAP: 3 Tier DSS Multidimensional OLAP (MOLAP) Architecture Array-based multidimensional storage engine (sparse matrix techniques) providing fast indexing to pre-computed summarized data 235 Query Processing Chapter6 MOLAP: 2 Tier DSS Data Warehouse Back-End Tools and Utilities Data extraction: o get data from multiple, heterogeneous, and external sources Data cleaning: o detect errors in the data and rectify them when possible Data transformation: o convert data from legacy or host format to warehouse format Load: o sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh o propagate the updates from the data sources to the warehouse 236 Query Processing Chapter6 Index Structures Traditional Access Methods o B-trees, hash tables, … Popular in Warehouses o inverted lists o bit map indexes o join indexes Inverted Lists 237 Query Processing Chapter6 Using Inverted Lists Query: o Get people with age = 20 and name = ―fred‖ List for age = 20: r4, r18, r34, r35 List for name = ―fred‖: r18, r52 Answer is intersection: r18 Bit Maps 238 Query Processing Chapter6 Using Bit Maps Query: o Get people with age = 20 and name = ―fred‖ List for age = 20: 1101100000 List for name = ―fred‖: 0100000001 Answer is intersection: 010000000000 Good if domain cardinality small Bit vectors can be compressed Join ―Combine‖ SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT 239 Query Processing Chapter6 Join Indexes What to Materialize? Store in warehouse results useful for common queries Example: 240 Query Processing Chapter6 Materialization Factors Type/frequency of queries Query response time Storage cost Update cost Cube Aggregates Lattice 241 Query Processing Chapter6 Meta data Meta data is the data defining warehouse objects. It has the following kinds o Description of the structure of the warehouse schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents o Operational meta-data data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) 242 Query Processing Chapter6 o The algorithms used for summarization o The mapping from operational environment to the data warehouse o Data related to system performance warehouse schema, view and derived data definitions o Business data business terms and definitions, ownership of data, charging policies Recipe for a Successful Warehouse For a Successful Warehouse From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.html From day one establish that warehousing is a joint user/builder project Establish that maintaining data quality will be an ONGOING joint user/builder responsibility Train the users one step at a time Look closely at the data extracting, cleaning, and loading tools Determine a plan to test the integrity of the data in the warehouse Data Warehouse Pitfalls You are going to spend much time extracting, cleaning, and loading data Despite best efforts at project management, data warehousing project scope will increase 243 Query Processing Chapter6 You are going to find problems with systems feeding the data warehouse You will find the need to store data not being captured by any existing system You will need to validate data not being validated by transaction processing systems Some transaction processing systems feeding the warehousing system will not contain detail After end users receive query and report tools, requests for IS written reports may increase 'Overhead' can eat up great amounts of disk space You will fail if you concentrate on resource optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customer DW and OLAP Research Issues Data cleaning o focus on data inconsistencies, not schema differences o data mining techniques Physical Design o design of summary tables, partitions, indexes o tradeoffs in use of different indexes Query processing o selecting appropriate summary tables 244 Query Processing Chapter6 o dynamic optimization with feedback o query optimization: cost estimation, use of transformations, search strategies o partitioning query processing between OLAP server and backend server. Warehouse Management o resource management o incremental refresh techniques o computing summary tables during load o failure recovery during load and refresh o process management: scheduling queries, load and refresh o Query processing, caching o use of workflow technology for process management Contents Decision support system Transactional Databases What is a data warehouse? A multi-dimensional data model Data warehouse implementation From data warehousing to data mining 245 Query Processing Chapter6 Data Warehouse Usage Three kinds of data warehouse applications o Information processing – informative reporting supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs o Analytical processing Multidimensional data analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting o Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. 246 Query Processing Chapter6 From On-Line Analytical Processing to On Line Analytical Mining (OLAM) Why online analytical mining? o High quality of data in data warehouses DW contains integrated, consistent, cleaned data o Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools o OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. o On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. 247 Query Processing Chapter6 An OLAM Architecture Summary Data warehouse o A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process A multi-dimensional model of a data warehouse o Star schema, snowflake schema, fact constellations o A data cube consists of dimensions & measures 248 Query Processing Chapter6 OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP Efficient computation of data cubes o Partial vs. full vs. no materialization o Multiway array aggregation o Bitmap index and join index implementations Further development of data cube technology o From OLAP to OLAM (on-line analytical mining) Exercises Use SQL-Server to create a cube. Your cube should have at least three dimensions. Experiment with the OLAP operation. Explore the Cube and OLAP capabilities in relational DBMSs such as SQL Server, Oracle, Postgress, Mysql, … 249