* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download multi-dimensional views of data
Survey
Document related concepts
Transcript
Basic concepts of On-Line Analytical processing DT211 /4 1 What is OLAP • OLAP stands for "On-Line Analytical Processing.“ • OLTP ("On-Line Transaction Processing") • OLAP describes a class of technologies that are designed for live ad hoc data access and analysis. • OLTP generally relies solely on relational databases, • OLAP has become synonymous with multidimensional views of business data supported by multidimensional databases • Relational databases were never intended to provide data synthesis, analysis and consolidation functionality. 2 What is OLAP • OLTP databases are optimised for transaction updating however, OLAP applications are used by managers and analysts for a higher level aggregate view of the data, thus they are designed for analysis. • Many problems that people try to solve using relational databases e.g. summaries are handled much more efficiently by an OLAP server than by RDBMS 3 OLAP Applications Although OLAP applications are found in widely divergent functional areas, as illustrate in the table opposite. Moreover they all have the following key features: 1. multi-dimensional views of data (MD databases via Star Schema) 2. support for complex calculations 3. time intelligence 4 Purchase Key 1 2 3 4 5 6 . . . Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous . . . Star Schema: basis of MD view Time Dimension Time Key Month Day Quarter Year 10 Jan 5 1 2002 . . . . . . . . . . . . . . . Cardholder Key Purchase Key Location Key 1 2 1 15 4 5 1 2 3 . . . . . . . . . Cardholder Key Name 1 John Doe 2 Sara Smith . . . . . . Cardholder Dimension Gender Income Range Male 50 - 70,000 Female 70 - 90,000 . . . . . . Fact Table Time Key Amount 10 14.50 11 8.25 10 22.40 . . . . . . Location Key Street 10 425 Church St . . . . . . A star schema for credit card purchases Location Dimension City State Region Charleston SC 3 . . . . . . . . . Multi-dimensional view as a cube Month = Dec. Category = Vehicle Region = Two Amount = 6,720 Count = 110 Dec. Nov. Oct. Sep. Jul. Jun. May Apr. Mar. Feb. Fo Jan. Miscellaneous Restaurant Retail Vehicle Travel On Supermarket Month Aug. Category Figure 6.6 A multidimensional cube for credit card purchases Tw o Th r e ee n gio Re ur Representation of Multidimensional Data • Example of two-dimensional query. • What is the total revenue generated by property sales in each city, in each quarter of 2011?’ • Choice of representation is based on types of queries end-user may ask. 7 Multi-dimensional Data as Three-field table versus Two-dimensional Matrix 8 Compare representation - three-field relational table versus two-dimensional matrix. Representation of Multidimensional Data • Example of three-dimensional query. – ‘What is the total revenue generated by property sales for each type of property (Flat or House) in each city, in each quarter of 2004?’ 9 © Pearson Education Limited 1995, 2005 Multi-dimensional Data as Four-field Table versus Three-dimensional Cube 10 Compare representation - four-field relational table versus three-dimensional matrix. Multidimensional Data • Queries requiring only a single number to be retrieved need not use multidimensional databases. • If queries involved retrieving multiple numbers and aggregating them for large databases can become intolerable as relational databases can scan only a few hundred records per second. • However multidimensional databases can add up 10,000 or more numbers in rows and columns per second. • Thus for such queries multidimensional databases have an enormous performance advantage 11 Multi-dimensional Operations • Slice – A single dimension operation • Dice – A multidimensional operation • Roll-up – A higher level of generalization • Drill-down – A greater level of detail • Rotation – View data from a new perspective Slice and Dice • A desired analysis of a three dimensional matrix may require any combination of dimensions to be reported so you need to be able to rotate this data cube. • 6 views are required for a three dimensional data object while a 4 dimensional object would require 12 views. • The technique of being able to look at all the data of the cube from different viewpoints is called: “slice and dice” 13 Simple Hierarchies: Roll up • With hierarchical dimensions the database knows not to combine members of the dimension that are at different levels of the hierarchy: referred to as roll-up • It allows the user to view queries at all or any different levels e.g.. At state level or at city level. • Such hierarchies facilitate drill down to successive levels of detail: yearly, quarterly, monthly, weekly, daily. 14 Multiple hierarchies: roll up • Utilising multiple hierarchies product sales can roll up by region, type , brand name and so forth. Without this capability an extra dimension would have to be created for each. • Another use of multiple hierarchies is for geographical dimensions e.g.: 15 Drill down to core database • Most organisations now utilise relational databases as standard for their data warehouses. • Often there is no need to replicate all the data in the relational database into a MD database for OLAP. • Summary level data can be kept in the MD database and detailed data in the relational database. 16 Drilling to relational data • To get a single number from a MD database takes the same time as it does from a relational database. • Thus it would be futile to individual customers into a MD database. But for summarised data a MD database is superior. • Thus ideally you should be able to drill down through the MD database into the relational database. • Such an approach is useful as most of data volume will reside at the detailed level and will thus not hinder queries of the higher levels 17 Support for complex calculations • Important computational features of OLAP servers inlcude: – Independently dimensioned variables (IDV) – Derived variables – Consolidation speed – Vector Arithmetic 18 OLAP calculations : Variables • Variables are numeric measures (facts) such as Sales, Cost, price…; dimensions include region, customer type, product… • OLAP servers treat variables as a special dimension. So one can select only the relevant dimensions for each variable (IDV) . See next slide • They can also incorporate various rules for consolidation. • Must provide a range of powerful computational and statistical methods such as that required by sales forecasting . • Another variable is called a derived variable which appears to be a variable in the database but which is actually computer at run time. E.g. margin = profit – expenses. 19 Star schema for property sales of DreamHome 20 Vector Arithmetic • Data held in arrays can be more easily manipulated than data stored in a relational table. • Thus a plane for actual can be easily subtracted from a plane from budget to give a plane for variance. • Such arithmetic allows entire planes of the database to be combined quickly. 21 Time Series Data Types • Users want to look at trends in all aspects of their business e.g. sales trends, market trends etc. • A series of numbers representing a particular variable over time is called a time series e.g.. 52 weekly sales numbers is a time series. • Utilising a time-series data type allows you to store an entire string of numbers representing daily, weekly or monthly data. • Thus an OLAP server that supports time-series data type allows one to store historical data without having to specify a separate dimension for time. • Unlike other dimensions time has special attributes and rules. 22 Time-series data type • Time series always have a particular periodicity. • Time series data must include rules to convert one periodicity to another • In the absence of a time-series data type a new dimension must be declared and labelled explicitly. • A time-series data cell contains a great deal of information compared with a single cell or even a full record. 23 Time-Series Data types • Consider the following example for a time-series data type of sales. • • • • • • • • Start date = 1\1\2000 Periodicity = Daily, business days only Conversion = Summation Long description = Variable=Sales, Product=Nuts, Region=East Data type = Numeric, single precision Sacristy = Non-sparse Calendar = 445 Fiscal year Data points = 708,800,821,743,779,856,878,902,799, ... 24 Time-series data types • Start date is the first data point • Periodicity can be daily, weekly etc with calendar years, fiscal periods and business weeks etc being understood. • Data type can be single precision, double precision, text strings or dates • Sparse data is used where the same number is used over and over again e.g. price. Defining it as sparse would cause the database to store dates on which the price changed and the corresponding new values. • Data points can store very long time series e.g. 10 years of daily data. 25 Sparse Data • When less than 10% of the cells contain data the database is said to be sparsely populated or sparse. • Scarcity can also occur if there are many cells that contain the same number e.g.. Price of a product every day. • This situation can also be represented by storing the number once along with the number of days that the number is repeated • While a relational database would fill up the database with duplicate data an OLAP server that understands sparse data can skip over zeros, missing data and duplicate data. 26 Conclusion • In essence OLAP technology is a fast, flexible data summarisation and analysis tool. • The data analysis requires the ability to summarise data in many ways and view trends. • It should have 3 main characteristics: MD views, ability to perform complex calculations, time intelligence 27 Question • What are the three basic features of any OLAP system and explain the role each play in the analytical process, illustrating your answer with appropriate examples. 28