* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Paradigm Shift in Database Optimization: From Indices to
Survey
Document related concepts
Expense and cost recovery system (ECRS) wikipedia , lookup
Data center wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Information privacy law wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Data vault modeling wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Business intelligence wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Transcript
A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress Ryan LaBrie • Robert St. Louis • Lin Ye Arizona State University [email protected] • [email protected] • [email protected] Agenda A need for a shift in optimization strategy What our research is focusing on How we performed this research Update on our results Next steps Why a Shift, Why Now? HISTORICALLY Relational database technology is really good at what it does… Transaction-oriented, operational systems Optimized for data INPUT FOCUS: Storage of DATA TODAY’S ENVIRONMENT Large Data Warehouses Used for decision support Need to be optimized for information OUTPUT FOCUS: Retrieval of INFORMATION The Decision Support Problem Relational DBMS limitations Too much data Too complex queries Tera- and petabytes, quickly approaching exabytes Structured Query Language Too long for results Indexing limitations Usage of (i.e. Table Scans) B+ Trees A Possible Decision Support Solution Multidimensional Databases New effective storage techniques Simpler modeling techniques Potential for easier query interfaces and Intelligent Aggregation Appropriate use of redundancy More effective indexing algorithms Bitmapped indices The Focus of Our Research CURRENT RESEARCH Cost comparisons of Relational vs. Multidimensional Decision Support Systems Working towards a multidimensional benchmarking system 1. 2. TPC-H is positioned as a Decision Support benchmark, however it is based on relational technologies GOAL: Vendor neutral benchmark for comparing multidimensional database products FUTURE RESEARCH In the long term, show that decisions can be made more easily with multidimensional technology Simpler design, simple interfaces, faster responses Why Develop a Multidimensional Benchmark? Benchmarking is an established method for creating vendor neutral tests Benchmarking has been examine in other IS fields including Server Platforms: Johnson & Gray, 1993 eCommerce: Menasce, 2002 It has been called for specifically in the data warehousing academic community Transaction Processing Performance Council (TPC) Nemati et al., 2000 and Has yet to be done How Are We Building Our Benchmark Based on the TPC-H relational decision support benchmark Create a relational dimensional model that forms the basis for the data mart Build a multidimensional cube off the dimensional model Convert the SQL statement to the equivalent MDX Run both the SQL query and the MDX query, report results What We Have Done To Date Initially have mapped all 22 TPC-H relational queries to potential data marts 3-4 data marts necessary Built 2 TPC-H data sets (1GB and 10GB) Converted TPC-H Query #4 to MDX Ran comparisons on both data sets In the process of converting a second query (TPC-H Query #7) for additional analysis/confirmation of gains TPC-H: Query #4 – Relational SQL Typical Decision Support Request: Answers the questions, “How many orders were delivered late in Quarter 3 of 1993, sorted by priority?” SELECT o_orderpriority, COUNT(*) AS order_count FROM orders WHERE o_orderdate >= '1993-07-01' AND o_orderdate < '1993-10-01' AND EXISTS (SELECT * FROM lineitem WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate) GROUP BY o_orderpriority ORDER BY o_orderpriority lineitem L_ORDERKEY L_PARTKEY L_SUPPKEY L_LINENUMBER L_QUANTITY L_EXTENDEDPRICE L_DISCOUNT L_TAX L_RETURNFLAG L_LINESTATUS L_SHIPDATE L_COMMITDATE L_RECEIPTDATE L_SHIPINSTRUCT L_SHIPMODE L_COMMENT part P_PARTKEY P_NAME P_MFGR P_BRAND P_TYPE P_SIZE P_CONTAINER P_RETAILPRICE P_COMMENT partsupp PS_PARTKEY PS_SUPPKEY PS_AVAILQTY PS_SUPPLYCOST PS_COMMENT customer C_CUSTKEY C_NAME C_ADDRESS C_NATIONKEY C_PHONE C_ACCTBAL C_MKTSEGMENT C_COMMENT supplier S_SUPPKEY S_NAME S_ADDRESS S_NATIONKEY S_PHONE S_ACCTBAL S_COMMENT NATION N_NATIONKEY N_NAME N_REGIONKEY N_COMMENT REGION R_REGIONKEY R_NAME R_COMMENT orders O_ORDERKEY O_CUSTKEY O_ORDERSTATUS O_TOTALPRICE O_ORDERDATE O_ORDERPRIORITY O_CLERK O_SHIPPRIORITY O_COMMENT TPC-H: Query #4 – Multidimensional Expression (MDX) Equivalent SELECT {[Measures].[O Latecount]} ON COLUMNS, {[PriorityDim].children} ON ROWS FROM Q4Cube WHERE ([TimeDim].[All TimeDim].[1993].[Quarter 3]) The Database Costs Dilemma Disk Space? Build Time? Query Speed? Results To Date (Query Speed) TPC-H Query 4 1 GB Dataset 10 GB Dataset Multidimensional 0.33 seconds 0.33 seconds Relational 46.6 seconds (140x slower) 925 seconds (~15.5 min) (~2800x slower) Relational (optimized w/Indices) 38 seconds (114x slower) Relational (optimized w/Indices & Striping) 26 seconds (78x slower) Test not run 247 seconds (~4.0 min) (~750x slower) Results To Date (Other Measures) TPC-H Query 4 1 GB Dataset 10 GB Dataset Relational DB 1.2 GB 12.5 GB Relational DB 1.8 GB (w/Indices) Multidimensional .16 MB Cube Size Multidimensional 46 seconds Cube Build Time 28.9 GB .16 MB 356 seconds (~6 minutes) Preliminary Conclusions For a very modest investment organizations will be able to process very large data warehouses The multidimensional data mart is the only practical (speed, processing time) way to support the end-user decision maker. Aggregation truly is a substitute for expensive hardware Next Steps Acquire a larger server Build 100GB and 300GB TPC-H data sets Benchmark both relational and dimensional queries Publish results Consider ROLAP, HOLAP, MOLAP issues Possible extensions to some data mining research Possible extensions to decision making through technology research Thank You for Your Time Questions? [email protected] www.public.asu.edu/~rlabrie (for this presentation and paper) Appendix A: Current System SOFTWARE Microsoft Windows 2000 Advanced Server Microsoft SQL Server 2000 Enterprise Edition Microsoft SQL Server 2000 Analysis Services Enterprise Edition HARDWARE (1) 1.8GHz Intel Pentium 4 processor 768MB RAM 240GB HD space (3 IDE 80GB 7200RPM Drives) Total cost: $1100 (Hardware only)