Download OLAP2 outline Multi Dimensional Data Model A Sample Data Cube

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
OLAP2 outline
·
·
·
·
Multi Dimensional Data Model
Need for Multi Dimensional Analysis
OLAP Operators
Data Cube Demonstration Using SQL
Multi Dimensional Data Model
Multi dimensional analysis is a popular approach to extend the additional features for reporting. Instead of submitting
multiple queries data is structured to trigger fast and easy access to interactively answer the questions posed by
different kinds of the users. This type of analysis is generally performed on large corporate warehouses or data marts.
The view of data as Multi Dimensional Array can be generalized to more than Three Dimensions. In OLAP applications
the bulk of the data can be represented in such a Multi Dimensional Array. The systems which used to store multi
dimensional data is termed as MOLAP Systems. Multi Dimensional array can also represented as an array as shown
below.
The relation with related dimensions to the measure of interest is called Fact Table. Multi dimensional data model
focus mainly on a collection of measures called numerical measures which are termed as facts and depends on a
number of associated dimensions. In the data warehousing literature Data Cube is a one of the popular structure
which is widely used in representing the multi dimensional model. We represented the multi dimensional data model in
two ways, one is in the form of table and another is in the form of Array. The Dimension names are not shown in the
above diagram but the Ids associated with each dimension are represented as PID for Product ID, TimeId for Time
Dimension and LocId for Location ID and Sales is here as a Numeric Measure.
A Sample Data Cube
The representation given below shows the annual sales of various categories of products in different quarters over
different cities. For example annual sales of TVs for 3 cities are shown in the form of array and total sales in the form
a simple number. So if we observe below diagram we have different kinds of aggregation stored in this physical
structure. One is the total sales over all cities and over all products; another kind of aggregation is total sales over all
quarter for each product.
If we observe this diagram it specifies different kinds of aggregations in form of physical structure called Data Cube.
Here we have 3 Dimensions, but if we have more than 3 Dimensions you can't find a physical structure. But we can
always generalize by using a concept called Hyper Cube.
Cuboids Corresponding to the Cube
The slide shows 3 Dimensions called Product, Date and Country. That means to say we have combined computations
across combinations of these dimensions.
If we see this diagram, this is a integration of different kinds of computations over combination of these three
dimensions. So this each combination is nothing but a particular type of cuboid. Here there are 4 types of cuboids, 0Diemnsion cuboid, which is Apex cuboid, which gives the grand sales. And Base or 3-D cuboid which gives sales for
each product, date and city. Between these two aggregations we have different kinds of computations. But the duty of
this computation is whatever the computation we do at intermediate level, the values always more than computation
from lower level and smaller than the values from the higher level. So this type of structure in mathematics is termed
as Lattice. So the concept of data cube imitate the behavior of lattice. Because we are taking all the combinations with
some ordering and this ordering is nothing but Partial Ordering.
Data Cube involving four Dimensions
This slide talks about the ascension of a Three dimension cuboid into a collection of 4-D cuboid. If we have 4
dimensions we have 24 cuboids. We have one cuboid of 0-Dimension, one cuboid of 4-Dimension, 4 cuboids of 3
dimension nature, 6 cuboids of 2-dimension nature. In the similar way we can extend the combinations of
computations over several dimensions in the form of a structure. But however no physical structure exists to show
when the no. of dimensions are more than three. But if it is less than three we can always show physically. If it is
more than three dimensions, we need to think in a abstract sense in the form of a cube called Hyper cube.
Questions on Multi Dimension Model
Q 1) What is Hyper cube?
As I mentioned earlier Hyper cube is a generic metaphor for representing the Multi Dimensional Data. A group of Data
Cells arranged by the Dimensions of the Data. For Example if we take the Spread Sheet, Spread sheet exemplifies a
two dimensional array with data cell arranged in rows and columns. Each being a dimension, it means row is one
dimension and column is another dimension. In a similar way if we think in a database table is also represented in the
form rows and columns. So both representations Spread sheets and Database tables are metaphors for representing
data in excel sheets and data in Database table. In a similar way Hyper cube is a generic metaphor for representing
Multi Dimensional Data. I demonstrate now, since it is a complex topic, I go in a more detail manner about the
concept of hyper cube. As I mentioned Spread sheets use Worksheets and Database use Tables, Hyper cubes are used
to understand Multi Dimensional views. For example consider two dimensional cross tabulation report which models a
location and product to measure sales. So this two dimension grid provides all possible combinations of locations and
products. That means it have K locations and P products for each combination of location across the products sales are
computed here.
How is it viewed?
In fact we can view the hyper view in an abstract manner by understanding how the cube is build in one dimension,
how the cube is viewed in two dimension and how the cube is viewed in the 3-dimension. Now I will explain how it can
be viewed in each dimension step by step. For example If we have a data in one dimension we can view it in single
row or column. If we have two dimensions like in our case product and location it is a matrix or a table. If the no. of
dimensions are more than 3 the imagination of hyper cube is quite difficult. In the sense that we can’t represent
Physically in the form of a structure. But you should understand in an abstract sense. So in the sense that if we
imagine a control panel of a stereo sound system or day to day presentations. So if we use the sound system each
slider control one aspect of the sound such as balance, volume, bass and treble. So all these things are Knobs which
can be controlled. We can adjust the parameter control. In a similar way if have more than three controls, we can add
these higher dimensional data into a two dimensional grid in this manner. What is going to happen if we add more and
more dimensions such as payment methods, coupons etc. and the grid becomes cube. That means if I add for
location, product cube the payment method dimension then it becomes a three dimension cube, if coupons is added
then it becomes hyper cube. So any no. of dimensions can be added. So therefore there is no physical metaphor exist
for more than 4 dimensions. So this is how we should interpret the concept of generalizing to two dimensions, three
dimensions and beyond three dimensions.
Q 2) Can you explain the cube by taking measures and dimensions?
If we see the normal commercial tools like Cognos, Business Objects there is a concept of cube. That means the OLAP
software packages support this transformation with the concept of Power Cube. So this power cube concept is
available as a part of cognos tool. Cognos is provided with a concept called transformer. This transformer job is to
transform the data into a form of cube. Which is nothing but a hyper cube but from the terminology point of view
cognos tools names this as power cube, in this sense power cube is same as hyper cube. This is basically used to
organize the data into selected business perspectives, say for example the power cube shown in the slide gives a
power cube involving measures and dimensions. Here Time is a dimension, Status is a dimension, performance,
indicators and salaries are measures. That means when you takes a particular category of values and each dimension
you get combination, on that combination we are calculating the aggregate values for all these measures. The cell
contains these aggregated values. As I mentioned earlier if we see the hyper cube concept, there is no physical
metaphor exist for hyper cube, but I can always map into a two dimensional plane. For example here there exist 4
dimensions, and I can add as many dimensions as I want. If I define the measures across the different combinations
of dimensions, measures are automatically calculated based on the functions defined. Once the functions are triggered
a cube is generated and shown here. So any way a hyper cube is simulated by the facility provided in Cognos with the
name called transformer.
Extending the answer to the solution:
Representing Multi Dimensional Data
Since this is an abstract concept I would like to also extend the discussion by representing physically the data in both
the forms. For example if you take a two dimensional representation cube represents the data in an array, relational
table only represents multi dimensional data in two dimension. Suppose If you take an array of two dimension, what is
the total revenue generated by sales in each city and each product of year 2009. That means if you define there are 4
quarters in a year, for each quarter and in each city what are the total sales. So this is a two dimensional grid, so the
measure is here total sales. But choice of representation is always based on types of queries that end user asks. So
there are two different kinds of representations, array based and tabular based representation. Now if you observe the
table representation in order to represent all the sales in each city for each quarter in year 2009 we need three
columns. First column is for city wise, second column is for quarter names and third column is for storing the measure.
So that means to store the data in a three field relational table requires three columns. For suppose if we represent
the same data in a matrix requires only two columns.
For example here in slide we have city values, time values and total revenue measure values. In this representation
there are two dimensions and one measure. The same data is transformed in the form of a matrix representation by
considering the database values stored across the rows. Here dimension values are nothing but the database values.
These will become the headings for row and columns. For example Q1, Q2, Q3 and Q4 are the headings for the
quarter and Glasgow , London etc;are the headings for City. Then if we take the combination of these over a time the
values 45677 is the total revenue in the quarter 4 in the city Aberdeen.
As I mentioned, both the themes use the concepts of cells, then the way the data is represented in two dimensional
matrix where the database values becomes column headings. For example ‘what is the total revenue generated by
property sales for each type of property (flat or house) in each city, in each quarter of 2009. Four columns are
required to represent the above in two dimensional matrix.
Depending on the combination of attributes aggregate operation is applied and the cell value for that combination hold
the measure. So that is why measures are associated with dimensions.
Q) What are the different kinds of functional support provided for data
cubes in commercial databases?
As you all know that very popular commercial database exist today in the market they are db2, sequel server, oracle ,
ingres, postgres etc. In fact all these packages are now providing functional support for data cube. That means the
functionality of the traditional sql feature extended by incorporating additional features to do manipulations for multi
dimensional operations. As I mentioned there are different kinds of multi dimensional operations like slicing, dicing,
pivoting, rollup,drill down. To do all these operations we need lot of aggregations and computation required at the
backend of the database. So most of the commercial databases pushed the functionality within the database level by
incorporating various power full operations. If we take oracle, oracle provides two powerful operators in OLAP to do
the aggregation on combination of dimensions, they are Cube and Grouping sets. Cube is used to find the
aggregations across k dimensions and grouping set is used to compute selected combinations of aggregations.
Q) Explain the lattice concept with example.
There are different kinds of subsets for a given set. Suppose if you have k elements in a set you can have 2k subsets
including empty subset. What is the relationship exist among these subsets. If you take any k element subset, this
subset always contains k-1 element subsets also. In the sense that if you have a two element subset, then this 2
element subset then this always contained in a 3 element subset. Then what is the meaning of the containment, it is
nothing but some king of subset ordering. I can always order these 2k subsets in a partial order the relationship is
called the containment. When you trigger this order, you take any element the element is always more than the one
element subset or more than one subset. In a similar way if you take any subset that is always contained in the
original set.If you have observed this we have some relationship partial ordering and also greatest lower bound and
least upper bound. This is nothing but a lattice. In the sense a lattice is a partial order set with bounds. The same
concept is widely used in here in representing Multi dimensional data analysis.
I will explain how the lattice is represented.
Suppose if we have 3 dimensions product, City and Date. We consider these as our 3 elements which are nothing but
the perspectives or dimensions. If we take these, then we have 3 sets of 2 element combinations {Product, City},
{Product, Date} and also {Date, City} and also contains the 3 element set {Product, City, Date} which forms a lattice
of cuboids. Once we have 2 dimensional computations we can as well compute total sales for each product over all
cities and over all dates. Using one dimensional cuboid we can compute grand sales over all products, over all cities
and over all dates. If you observe the ‘all’ the grand sales and the base cuboid, the base cuboid is the greatest lower
bound and apex cuboid is the least upper bound.
That is why a cube is nothing but a collection of cuboids, and each cuboid is nothing but a aggregation. If we integrate
one dimensional, two dimensional and three dimensional aggregations then that is a data cube.
Q) Why should we take the cuboid in multi dimensional model and not in
any of the polygon, when there are more 4 dimensions?
There is no physical metaphor exist for more than 3 dimensions. Physically we can’t view, we need to think in a
abstract sense. Cuboid is a mathematical terminology which was brought from the discrete mathematics for
representing certain types of aggregations. So the representation of aggregations always follows the containment
principal. So the relational model has a strong mathematical base, which is nothing but a set concept in a similar way
here. Why not of What is Polygon means? polygon is a general term, poly means many. In a way cuboid is a part of
polygon. Hence the technical name given for 4 dimensions is Fesaract. There is no visual representation for Facer act.
Q) What is the
Dimensional
difference
between
Relational
DBMS
and
Multi
DBMS.
If you observe in the above slide, ‘all’ is a 0-dimensional cuboid, product, city, date are one-dimensional cuboid and so
forth. What happens when we add one more dimension. Just now discussed that the visual representation for 4-D is
messy. Even though the diagram is messy, we can understand the concept by representing in the form of a lattice.
From the functionality point of view both are used for certain kind of activities. I will narrate with an example, which
has two dimensions and one measure, which is also represented in a multi dimensional representation. In the multi
dimensional database the data is transformed into square because here we have only two dimensions.
Here the values of the columns are used as column and row heading in the Multi Dimensional model. If we have three
values in each of the column then total we need to have 27 different cells but whereas Multi Dimensional database
requires only 9 cells for the same. If you extend the complexity by adding one more column say Dealership, we have 4
columns in relation. If I select a particular dealer cell then I get a matrix for particular dealer. To observe the
complexity here when one dimension is added to the Multi dimensional data model requires always less no. of cells
when compared to the relational table representation. Operations on relational table are much slower than the
operations on the Multi dimensional cube. Suppose if I add one more dimension say ‘Time’ then each dimension is
become a 3-dimensional cube. Hence multi dimensional structures are much faster than relational table because of
less storage space and also it accommodates more no. of values in less storage.
Establish the need for Multi Dimensional Analysis
Generally One dimensional queries say for example how many units of item ‘a’ in store did we sell located in Delhi. The
other query shown on the slide is how much revenue did the new item X generated during the last six months, broken
down by individual months in AP state by individual stores. Broken down by promotions (p1, p2…) compared to
estimates, and compared to the previous version of the product. For efficient analysis the decision maker should equip
with easy way to calculating complex analysis along different business dimensions. Such an environment we can
establish using representation of data model called multi dimensional model. The basic advantage of using this data
model is to provide easy and flexible access to information decision makers have an ability to analyze the data along
any no. of dimensions at any level of aggregation with capability of viewing the results in varying no. of ways. Also
they must have ability to navigate the results from one level of summarization to the next level of the summarization.
Such a type of power does not exist in the 1-D queries. Therefore without having a solid system with this kind of
facility then the purpose of using data warehouse is incomplete. That is the reason why multi dimensional analysis is
very widely used. Of course the time is also important dimension in any system. Every analytical query is executed
with time as one of the dimension. An Analytical system must recognize the sequential nature of time. Because of
these factors traditional systems are very much inappropriate for answering complex queries.
Concept hierarchy is very much important and widely used in data warehouses. Concept hierarchy defines a sequence
of mappings in a set of low level concepts to the higher level concepts which are more general in nature. The different
kinds of hierarchies for Industry, Region and Time. Here Time is having two different kinds of the hierarchies,
collection of days is Month, collection of months is Quarter, collection of quarter is a Year. Another hierarchy over
location is office>city>country>region. So the purpose of using the hierarchy in OLAP basically provides better
navigation facility for the decision makers.
Q) Explain why multi dimensional analysis is important?
From the analysis point of view this we call as very important structure because it is very easy and flexible access in
the sense that you can retrieve any kind of aggregation just by querying. Any kind of ad-hoc queries can be answered.
Analysis of the data is easy and we can also show the results by varying the levels. That means we do the
computations bottom to top or top to bottom. Which we call in OLAP as drill down and drill up.
Q)
What
are
the
Constraints
applied
on
the
OLAP?
Constraints can be from theoretical point of view and implementation point of view. From the implementation point of
view, as I mentioned earlier there are various extensions available in sql with operator names as Rollup and Cube. But
the disadvantage of these operators are that, there is no way to compute desired set of computation using them. It
means if you use rollup using k dimension then we can compute k+1 computation. If you use cube with k dimensions
you can compute 2k computation. So from the constraints point of view here, suppose if they are analysts want to
view only desired set of combinations of aggregations there are no automatic way in rollup and cube operator. So
support that every package enriches the functionality of sql by supporting analytical concept called Grouping function.
So this grouping function is used to compute the desired levels of combinations of aggregates. That means it is
providing the flexibility for the analysts to represents combinations using this concept. In addition to that so many
packages also provide partial rollup and partial cube computations by extending the grouping function operator.
The same question can also be answered theoretically. There are so many ways to compute cubes theoretically by
Iceberg queries and BUC algorithm.
Q) What is Navigation and what are the OLAP operations that provide
navigation?
The navigation is basically is used to move from one level of concept to another level of concept. That means from low
level concepts to higher level concepts. If you see the lattice structure the low level concept is a n-dimensional cuboid,
0-dimensional cuboid is a higher level structure. So once if we know the real world data,that means once you have low
level aggregations then I can compute all these higher level computations. In OLAP this is possible by using two
powerful operators namely rollup and drill down. The drill down is reverse to rollup and vice versa. Drill down is
moving from high level cuboid to low level cuboid and roll up is moving from low level cuboid to high level cuboid.
Roll
UP:
{product,city,date}
Drill Down: {all} à {product, city} à {product, city, date}
à
{product,city}
à
{all}
OLAP Operators
The summary of all the OLAP operators are shown in the slide below. These five operators provide easy access to and
flexibility to decision makers to compute different kinds of aggregations. These operators are very convenient set of
operators for end users.
Data Cube Demonstration using SQL
We can easily simulate the Data Cube using SQL Operators. You all know that a popular clause used in SQL to do the
aggregation is Group by clause. So by writing different kinds of queries using group by clause we can combine the
different kinds of aggregations by applying union operator on various sub queries. For example if we have two
dimensions, as I mentioned we have 22 combinations of aggregations. One query is for grand sum and another query
for base cuboid and the two queries are aggregations on remaining dimensions. When you have 4 queries one for each
cuboid then by union operator of SQL I can combine all the results of these 4 queries in a single result set. So that
means whatever the results the data cube gives over any number of dimensions those results can be simulated in a
single query using SQL. That means when we need to represent sub queries and then all these are united using union
operator. So if you do this way we can easily get possible set of aggregated values that are required to represent the
data cube.