Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From Transaction Processing to Support for Decision Making CIS 671 Decision Support Systems 1 Computerized Information Systems • Used to “run the business”. • OSU Examples – Personnel & Payroll (ARMS) – Course Offerings – Students, including course enrollments and grades • (estimated $30M to replace) – Inventory • Transaction Processing Decision Support Systems 2 st 1 Generation DBMS • Designed for Transaction Processing – Hierarchical – IBM – IMS – Network • Management Information Systems – Added later – Mostly standard summary reports • Produced on a regular basis Decision Support Systems 3 Relational DBMS • Codd – particularly designed for “ad hoc” queries • First uses for Transaction Processing • Transaction Data now available on-line – Use it to help Decision Making – Ad Hoc Decision Support Systems 4 Decision Support Systems (DSS) • Use comprehensive view of all aspects of business. – Different business units – Historical data – Summary information • Classes of analysis tools: – Complex “traditional” SQL queries – Many “group-by” and “aggregation” queries (On Line Analytical Processing) – Exploratory data analysis - Data Mining Decision Support Systems 5 Data Warehousing • Properties – Consolidated data from many sources – Spanning long time periods – Augmented with summary information • Size: several gigabytes to terabytes Decision Support Systems 6 Data Warehouse Creation • Integrate schemas from different groups – Semantic mismatches • Different currencies • Different names for same attributes • Different structures for similar tables Decision Support Systems 7 Data Warehouse Creation, cont. • Extract data from different operational databases and other external sources – – – – – – Clean data - correct errors, fill in missing data Transform data to match integrated schema Load data into warehouse Refresh data in a timely fashion Purge very old data Create metadata repository • May be so large that it is in a separate database Decision Support Systems 8 Data Warehouse - Provide Variety of Analytical Tools – – – – – – Complex “traditional” SQL queries OLAP query engine Data mining algorithm Information visualization tools Statistical packages Report generators Decision Support Systems 9 Data Mart • Departmental subset of a data warehouse • Top-down approach – Derive from the organization’s data warehouse – May be too hard to do all at once • Bottom-up approach – Initially create departmental data marts – Integrate data marts into organizational data warehouse – If not done carefully, may be hard to integrate Decision Support Systems 10 OLTP vs. Data Warehouse DBs (from Toby J. Teorey, Database Modeling & Design, Morgan Kaufmann, 1999, p. 212) OLTP Data Warehouse • Transaction oriented • Subject oriented • Thousands of users • Few users ( 100) • Small (MB to several GB) • Large (hundreds of GB to several TB) • Current data • Historical data • Normalized data (many tables, few columns per table) • Denormalized data (few tables, many columns per table) • Continuous update • Batch updates • Simple to complex queries • Usually very complex queries Decision Support Systems 11 Complex “traditional” SQL queries • Relational DBMS optimized for decision support – in contrast to a DBMS optimized for transaction processing • Example: – Teradata machine from NCR Decision Support Systems 12 On Line Analytical Processing (OLAP) Multidimensional Databases (MDD) Decision Support Systems 13 Example from Finkelstein [Fink95]: SALES_INFO Branch ProdID Date Sales Returns BOS 1 1/2/98 $1,000.00 4 NY 1 1/2/98 $1,222.00 2 CMH 2 1/3/98 $555.00 1 SF 2 1/3/98 $1,777.00 9 BRANCH_INFO Branch Region BOS A NY A CMH B SF C PROD_INFO ProdID Description Category 1 Widget I 2 Super Widget II • Note that REGION_INFO Region Territory Branch, ProdID, Date Sales, Returns A East B East • Note the multidimensionality C West of the SALES_INFO table. Decision Support Systems 14 Dimension Hierarchies LOCATION TIME Year Territory Quarter Region Branch Decision Support Systems Week Month Date PRODUCT Category ProdID 15 Possible queries: 1. How did product Widget sell in the last month, and how does this figure compare with sales over the last five years? How about by branch, region and territory? 2. Did this product sell better in different regions, and are there any regional trends? 3. Were there more returns of Widgets over the last year? Were these returns caused by defects? Were they manufactured in any particular plants? Decision Support Systems 16 Additional Possible query: 4. Do commissions and pricing affect how sales persons sell the product? Do particular salespersons do a better job of selling the product? Note that a "multidimensional" spreadsheet would be useful. Codd called this type of problem On Line Analytical Processing (OLAP) in contrast to On Line Transaction Processing (TP). Decision Support Systems 17 Codd's rules for OLAP: [Codd93] 1. Multi-Dimensional Concept View The user should be able to see the data as being multidimensional insofar as it should be easy to 'pivot' or 'slice and dice’. (See later.) 2. Transparency The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host'. 3. Accessibility OLAP should allow the user to access diverse data stores but see the data within a common 'schema' provided by the OLAP tool. Decision Support Systems 18 OLAP Rules, cont. 4. Consistent Reporting Performance There should not be significant degradation in performance with large numbers of dimensions or large quantities of data. 5. Client-Server Architecture Since much of the data is on mainframes, and the users work on PCs, the OLAP tool must be able to bring the two together! 6. Generic Dimensionality Data dimensions must all be treated equally. Functions available for one dimension must be available for others. Decision Support Systems 19 OLAP Rules, cont. 7. Dynamic Sparse Matrix Handling The OLAP tool should be able to work out for itself the most efficient way to store sparse matrix data. 8. Multi User Support This is self-evident. 9. Unrestricted Cross-Dimensional Operations e.g., individual office overheads are allocated according to total corporate overheads divided in proportion to individual office sales. Decision Support Systems 20 OLAP Rules, cont. 10. Intuitive Data Manipulation Navigation should be done by operations on individual cells rather than menus. 11. Flexible Reporting Row and column headings must be capable of more than one dimension each, and of displaying subsets of any dimension. 12. Unlimited Dimensions and Aggregation Levels At least 15 dimensions may be required, and within each there may be many hierarchical levels. Decision Support Systems 21 Example from Finkelstein [Fink95]: SALES_INFO Branch ProdID Date Sales Returns BOS 1 1/2/98 $1,000.00 4 NY 1 1/2/98 $1,222.00 2 CMH 2 1/3/98 $555.00 1 SF 2 1/3/98 $1,777.00 9 BRANCH_INFO Branch Region BOS A NY A CMH B SF C PROD_INFO ProdID Description Category 1 Widget I 2 Super Widget II • Note that REGION_INFO Region Territory Branch, ProdID, Date Sales, Returns A East B East • Note the multidimensionality C West of the SALES_INFO table. Decision Support Systems 22 “Pivoting” Cross Tabulation Sales by Date and Region Region A B C Total Date 1/2/98 $2,222 $0 $0 $2,222 1/3/98 $0 $555 $1,777 $2,332 Total $2,222 $555 $1,777 $4,554 Decision Support Systems 23 “Drill Down” (narrower category) Replace Region by Branch. Region A B C Total Date 1/2/98 $2,222 $0 $0 $2,222 1/3/98 $0 $555 $1,777 $2,332 Total $2,222 $555 $1,777 $4,554 “Rollup” (more general category) Replace Region by Territory. Decision Support Systems 24 OLAP Questions 1. Query language - how to say what's wanted. 2. Processing language - how to specify calculations: ratios, variances, . . . . 3. Data visualization - how to see the data. 4. Performance - time to process the query (5 second rule). Decision Support Systems 25 OLAP References [Codd93] E. F. Codd, S. B. Codd, and C.T. Salley, "Providing OLAP to User Analysts: An IT Mandate," Codd & Date Inc., 1993. [Fink95] Richard Finkelstein, "MDD: Database Reaches the Next Dimension," DATABASE Programming and Design, 8(4), April 1995. Decision Support Systems 26 Exploratory Data Analysis Data Mining • Find interesting trends or patterns in large data sets. • Statistics - Exploratory Data Analysis • Artificial Intelligence - Knowledge Discovery and Machine Learning • Much larger data sets Decision Support Systems 27 Mining for Association Rules • Classic example • Market basket analysis – Record each customer transaction at a grocery store. – Try and identify sets of items purchased together. Decision Support Systems 28 TransID 111 111 111 112 112 112 113 113 113 114 114 115 115 Item coke chips dip coke chips veggies coke beef chicken chips beef chips chicken Decision Support Systems Association Rule: {coke} {chips} People who buy coke usually buy chips. Measures for Association Rule {LHS} {RHS} • Support: % of transactions containing this set of items. (2/5=40%) • Confidence: given all transactions containing LHS items, the % that also contain the RHS (2/3=67%) Want both to be “reasonably” large. 29 On-Line Analytical Processing (OLAP) Part II: CIS 671 Elmasri & Navathe §26.1 Decision Support Systems 30 Multi-dimensional View of Data • Fact Table (also called cubes) – Dimension attributes – Dependent attributes (functions of the dimension attributes) • Dimension Tables, potentially one for each dimension Decision Support Systems 31 OLAP Operations • Roll-up – increase the level of aggregation • Drill-down - decrease the level of aggregation • Slice-and-dice - selection and projection, i.e., reduce dimensionality of the data • Pivot – re-orient the dimensional view Decision Support Systems 32 Implementation Approaches • Relational OLAP (ROLAP) Servers – Data stored in a relational – system – SQL extended • To allow easy OLAP query expression • To provide efficient OLAP query execution. • Multidimensional OLAP (MOLAP) – Systems directly store multidimensional data in special data structures – OLAP operations implemented directly on these data structures. • Hybrid OLAP (HOLAP) – Combines ROLAP and MOLAP. – Detail records (largest volume) in relational database. – Aggregations in separate, but connected”, MOLAP store. Decision Support Systems 33 Example a Star Schema Order OrderNo OrderDate Customer CustomerNo CustomerName CustomerAddress City Salesperson SalespersonID SalespersonName City Quota Decision Support Systems Sales (Fact) table OrderNo SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalPrice Product ProdNo ProdName ProdDescr Category CategoryDescr UnitPrice QOH Date DateKey Date Month Year City CityName State Region 34 Snowflake Schema Order OrderNo OrderDate Customer CustomerNo CustomerName CustomerAddress City Salesperson SalespersonID SalespersonName City Quota Decision Support Systems Sales (Fact) table OrderNo SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalPrice Product ProdNo ProdName ProdDescr Category UnitPrice QOH Category CategoryName CategoryDescr Date DateKey Date Month Year Month Month Year City CityName State Region State State Region Year Region 35 Data Cubes • Precompute all possible aggregations. • Required extra storage is tolerable. • Little penalty to keep aggregate up-to-date if data does not change. • Normally some aggregation of raw data is done before it is entered into the data cube. Decision Support Systems 36 Data Cube with Orders Accumulated Customer CustomerNo CustomerName CustomerAddress City Salesperson SalespersonID SalespersonName City Quota Sales table SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalValue Note that average for any aggregate can be calculated from TotalValue and Quantity. Decision Support Systems Product ProdNo ProdName ProdDescr Category UnitPrice QOH Category CategoryName CategoryDescr Date DateKey Date Month Month Month Year City CityName State State State Region Year Region 37 Sample of Aggregates in the CUBE Sales (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22 11 100 2 ‘Columbus’ 3 300 CUBE(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22 11 100 2 ‘Columbus’ 3 300 22 * 100 2 ‘Columbus’ 6 2222 22 * * 2 ‘Columbus’ 25 33000 * * * 2 ‘Columbus’ 75 90000 * * * * ‘Columbus’ 200 503444 Decision Support Systems 38 How to answer query given the relation CUBE(Sales) Choose tuples in CUBE(Sales) with the following properties: 1. Query specifies value v for attribute a tuple t has v in its component for a. 2. Query groups by attribute a tuple t has any non-* value in its component for a. 3. Query has neither groups by attribute a nor specifies value for a tuple t has * value in its component for a. Decision Support Systems 39 How to answer query given the relation CUBE(Sales) Cube(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22 11 100 2 ‘Columbus’ 3 300 22 * 100 2 ‘Columbus’ 6 2222 22 * * 2 ‘Columbus’ 25 33000 * * * 2 ‘Columbus’ 75 90000 * * * * ‘Columbus’ 200 503444 select CustomerNo, avg(Price) from Sales where SalespersonID = 22 Group by CustomerNo Result(c, v/n) Cube(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22 c * * * n v Decision Support Systems 40 Cube Implementation by Materialized Views • Dimensions may have hierarchies. – Product, Category – City, State, Region Decision Support Systems 41 Example: Materialized Views Cube(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) City (CityName, State, Region) insert into SalesV1 select SalespersonID, CustomerNo, Month, State sum(Quantity) as Quantity, sum(TotalValue) as TotalValue from Sales join City on Sales.CityName = City.CityName group by SalespersonID, CustomerNo, Month, State; insert into SalesV2 select SalespersonID, CustomerNo, Month, Region sum(Quantity) as Quantity, sum(TotalValue) as TotalValue from Sales join City on Sales.CityName = City.CityName group by SalespersonID, CustomerNo, Month, Region; Decision Support Systems 42 Example: Query 1 select SalespersonID, sum(TotalValue) from Sales group by SalespersonID; Answer by select SalespersonID, sum(TotalValue) from SalesV1 group by SalespersonID; or by select SalespersonID, sum(TotalValue) from SalesV2 group by SalespersonID; Decision Support Systems 43 Example: Query 2 select SalespersonID, State, sum(TotalValue) from Sales group by SalespersonID, State; Answer only by select SalespersonID, State, sum(TotalValue) from SalesV1 group by SalespersonID, State; Decision Support Systems 44 Example: Query 3 select SalespersonID, State, date, sum(TotalValue) from Sales group by SalespersonID, State, Date; Cannot be answered by either SalesV1 or SalesV2. Thus must use Sales itself. Decision Support Systems 45 Lattice of Views All All Years Quarters Weeks Months Days Decision Support Systems Region State City 46 Lattice of Materialized Views and Queries Q1 SalesV1 Q2 Q3 SalesV2 Sales Decision Support Systems 47 OLAP Example Garcia-Molina, Ullman & Widom, Database System Implementation, Prentice Hall, 2000 Automobile Sales Company: analyze sales of cars Fact Table Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state) Days(day, week, month, year) ( 5, 27, 7, 2000) Decision Support Systems Dimension Tables Time Dimension Table, probably not stored 48 Assume a particular car model, say ‘Gobi’, is not selling as well as anticipated. How to analyze? Maybe it’s the color. Slice for ‘Gobi. Dice for color. select color, sum(price) from Sales natural join Autos where model = ‘Gobi’ group by color; Doesn’t show anything interesting. Decision Support Systems 49 Gobi analysis, continuing What about time? Drill down for month. select color, month, sum(price) from Sales natural join Autos join Days on date = day where model = ‘Gobi’ group by color, month; Suppose we discover red Gobis have not sold well recently. Decision Support Systems 50 Gobi analysis, continuing Are red Gobis selling poorly for all dealers or just some? Drill down for dealer. select dealer, month, sum(price) from Sales natural join Autos join Days on date = day where model = ‘Gobi’ and color = ‘red’ group by dealer, month; Discover there are too few sales to show anything interesting. Decision Support Systems 51 Gobi analysis, continuing Rollup time from month to year and slice for last two years. select dealer, year, sum(price) from Sales natural join Autos join Days on date = day where model = ‘Gobi’ and color = ‘red’ and (year = ‘1999’ or year = ‘2000’) group by dealer, year; Does show variation. Now understand the problem better. Decision Support Systems 52 Administration • • • • Lab assignments and HWs posted on the web. Clarifications/Questions? Please use appropriate online submit command Teams of 2 allowed but make contribution of each team member explicit especially in the lab assignment. • Extra Credit assignment in lab. • Bring questions to class on Thursday Decision Support Systems 53 • • • • • • • • • • (color codes , meaning tuple representation (time in quarters, product,country,Tsales) time, product, country are dimension attributes, Tsales is total sales White squares (basic fact table) (q, p, c, sales) Green squares total annual sales grouped by product and country. (*, p, c, Tsales) Dark Green squares total annual sales grouped by product (*, p, *, Tsales) Orange squares total annual sales grouped by quarter and country. (q, *, c, Tsales) Dark orange squares total annual sales grouped by quarter. (q, *, *, Tsales) Grey total annual sales grouped by country. (*, *, c, Tsales) Other pair (quarter and product) not shown (need to pivot). (q, *, p, Tsales) Decision Support Systems Dark blue (all sales) (*, *, *, sales) Size of white cube = QXPXC, size of colored cube = (Q+1)X (P+1)X(C+1) Why? (* think of it as another category along each dimension Size of colored cube with hierarchy Even larger! 54 Decision Support Systems 55 Aggregation causes Database Explosion in Large Multi-dimensional Applications as the Number of Dimensions Increases Based on Nigel Pendse, “Database Explosion”, www.olapreport.com/DatabaseExplosion.htm Decision Support Systems 56 Factors not causing data explosion • Poor handling of data sparsity. – No more than factor of 4 vs. factors of 10s or 100s • Type of database technology. – Although optimized storage technology will be significantly better. • Lack of data compression. – Compression is helpful, but explosion still occurs. • Software errors – Again, a different problem. Decision Support Systems 57 Multi-dimensional Database (MDB) can save significant space • Keys, indexes & dimensional structures . – Not required or take far less space. • Sparsity better suppressed. • Data compressed. • Example: – – – – – – 6-dimensional (including measures) banking cube 13 million row fact table Relational fact table incl. indexes, but not aggregates: 5188 Mb MOLAP cube including aggregations: 336 Mb Well under 10% the space. Much faster query processing. Decision Support Systems 58 Why is there a data explosion even without sparsity? • Take two dimensional example • n: data from original source. • m: data aggregations precalculated. • p: on-the-fly results, not stored. (n+m+p) 2 (n+m) 2 n2 n m p Simplifying to n=m=p 1n2, 4n2, 9n2 In 3 dimensions this becomes 1n3, 8n3, 27n3 Decision Support Systems 59 When Data is Sparse it’s much worse. • One-dimensional data. • Simple hierarchy. Black - actual data, red - nulls. • Detailed level: 8 of 25 or 32%. • Aggregated levels: 5 of 6 or 83%. • Growth factor: 1.625 (13 cells based on 8 input cells) Decision Support Systems 60 Two dimensions: The problem gets worse • Potential input cells: 25*25=625 Detail data • Potential aggregated cells: 6*6 + 6*25 +6*25 =336 Aggregated data Decision Support Systems • More than 1 derived cell for every 2 possible input cells. • In 6 dimensions, could have 2 or 3 derived cells per 1 input cell. 61 What about higher dimensions? • • • • • One percent density, 6 of 625 input cells. Yields 29 computed cells. I.e., 35 total cells, only 6 input. Growth factor: 5.83. Growth factor per dimension: sqrt(5.83)=2.4. – Called compound growth factor (CGF). • CGF is typically in the range 1.5 to 2.5. • CGF increases as sparsity increases. • With large dimensions, will often be more consolidation. – (Many thousands of products more levels of groupings.) • With CGF of 2.0, extra dimension with no increase in input data, will double size of fully computed database. Decision Support Systems 62 So what is the problem? • Disk space increases. • Can software handle this much data? • Time to load and update database increases. – Could take days to load the database. Decision Support Systems 63 What to do? • Avoid fully pre-calculating any multidimensional object with more than 5 sparse dimensions. • Reduce sparsity of individual data objects: – Use good application design. What to pre-calculate? • Data that is slow to calculate at run-time because it depends on many other cells or complex formulae. • Data that is frequently viewed. • Data that is the basis of many other calculations. • Note: If too much is precalculated, performance may decrease because cache will not include as much useful data. Decision Support Systems 64