Download How to design your MDDB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Data analysis wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
How to design your MDDB?
Geert Peeters, Origin International (ICA)
e-mail: [email protected]
Abstract.
One of the most valuable add-ons to SAS6.12 is the introduction of the multi-dimensional database
(MDDB). Although the advantages of the MDDB are clear to all, i.e. better performance while
accessing enormous detailed data, still a lot of people fail to bring them in practice.
This can mainly be put down to fact the users of MDDB’s make the wrong choices during the design of
their system, resulting in a bad performance.
This article tries to give some answers in how to build a data model using MDDB technology. They are
based on practical experience, and gathered during an implementation of a large EIS-system, making
use of SAS6.12 technology.
The main topics addressed in this article are: the size of a SAS MDDB and the way to circumvent the
2GB limit; the choice between a mono-cube structure and a multi-cube structure and the
implementation of it in SAS; MDDB in a client/server environment ; and difference between ROLAP,
MOLAP and HOLAP and their practical use in SAS.
For example: suppose we want to build an
application in which the sales margins can be
analysed. In order to implement this at least the
sales and cost price will be needed.
This type of elements are called analysis
variables. Analysis variables usually contain
measurable values. In order to give sense to
these analysis variables, they need to be
connected with some event. An event can be
identified using class variables. Class variables
have limited or discrete class values. The base
table holds the events and their analysis
variables at the most detailed level.
For example: a base table may contain sales
transactions. The elements by which the sales
are identified are the class variables, e.g. time
stamp and sales place. The amount and
quantity of the sale are the analysis variables.
Introduction.
OLAP-applications are characterised by the
flexibility with which users can view and report
the data any way they want; to perform new ad
hoc analyses; to do large scale complex
calculations; and to perform dynamic
exception reporting from large databases.
Together with the introduction of these
applications, a very interesting discussion has
been started concerning the best supporting
database technology for it. The most quoted
technology is the multi-dimensional database.
A MDDB roughly organises its data in such a
way that at moment of data access the
database is able to respond immediately to the
query. Without any doubt this leads to
applications with very good response times.
In traditional databases the design of the
database is done with universal accepted
techniques such as normalisation. These
techniques can be applied in almost every
relational database. The theory around
multidimensional modelling is however not so
far evolved.
This article is a step by step approach
explaining how to design your MDDB in SAS.
This theory is based on practical experience.
Dimensions.
Some of the class variables are interrelated,
e.g. there might be class variables containing
the sales place and the sales region. In a
relational
database,
following
the
normalisation steps, this kind of information
would be put in separate tables. However in
MDDB technology we keep these elements
together in a denormalised base table,
containing redundant information.
The set of class variables that have a
relationship, is called a dimension. Different
dimensions are by default never interrelated.
A logical way to show a MDDB is a star
diagram. Each axis on the diagram represents a
dimension. The class variables that belong to
the dimension are drawn within that axis. The
way the different class variables are related, are
shown by the way they are connected.
Gathering elements.
The first step in the design of the MDDB is the
gathering of the elements the model should
contain. This means the designer of the OLAPapplication should know which elements are
going to be used in the application and should
understand how they are calculated. This
information can normally be derived directly
from the business rules the OLAP-application
will support.
-1-
Product axis
12NC
Entity axis
CAG
Entity
AG
MAG
Front Office
Product type axis
BG
FO Grp
Month
PD
Year
P type
Time axis
Quarter
Brand
12NC
Sizing.
The size of a SAS/MDDB
can be calculated using:
• the number of analysis
variables;
• the number of class
variables;
• the maximum formatted
length of each class variable;
• the length of each class variable;
• the number of distinct values in each class
variables; and
• the number of valid crossings between class
variables for each hierarchy.
The formula for the exact size of the
SAS/MDDB can be found in the article
referenced [1].
The size of the SAS/MDDB is however
limited to 2GB.
Brand axis
Class
Eurochannel
Ship to
Customer class axis
the SAS/MDDB selects that
hierarchy that is the closest
to the one that corresponds
with
the
query
and
calculates the information
at run-time. By selecting
the closest hierarchy a very
good response time is
maintained.
Corp. Regrouping
Statement to
Ship to
Invoice to
Ship to
Eurochannel axis
Customer axis
An example of a star diagram.
Summary crossings.
An ideal MDDB contains for each combination
of class variables pre-summarised analysis
variables. In this way the MDDB is able to
respond immediately to each kind of query. A
combination of class variables is called a
summary crossing.
There are some draw-backs to the approach
where all crossings are available. The first
problem is the number of possible crossings
and therefore the size of the MDDB. In theory
this can be very high (see example). A second
disadvantage is the usability of the crossings. A
lot of the available summary crossings will
never be approached by any query.
The total number of summary crossings are
expressed by following formula: the
multiplication of the number of class variables
within a dimension and this summed per
combination of dimension.
For example: imagine a base table containing 3
dimensions, e.g. time, location and product.
The number of class variables are e.g. 5
elements for the time dimension, 2 elements for
the location dimension and 7 elements for the
product dimension. The possible combinations
of the dimensions are: time-location-product,
time-location, time-product, location-product,
time, location and product. This leads to 5*2*7
+ 5*2 + 5*7 + 2*7 + 5 + 2 + 7 = 143 summary
crossing.
Accessing the SAS/MDDB.
SAS supplies three different ways to access the
data in a MDDB.
The most obvious use of the MDDB is in
SAS/EIS. The metabase of SAS/EIS is a user
friendly tool, in which the attributes of the
MDDB can be defined. This information is
used to assist the building of the OLAPapplications. The access of the data is handled
by the SAS/EIS model co-ordinator
(EMDDB_M class). This model co-ordinator
can be sub-classed. In this way an in-depth
customisation can be obtained.
Another way of accessing the data of the
MDDB is through SCL. In SCL an instance of
the MDDB_M class can be created. By
sending the correct methods to this object the
data of the MDDB can be fetched.
The last way of using the data in the MDDB is
in Base/SAS. A special engine (SASSFIO) is
foreseen to open the MDDB as a library. In
this way the MDDB tables are directly
available for Base/SAS programs.
To overcome this problem, SAS/MDDB
allows to define hierarchies at MDDB buildtime. The summary crossings chosen are those
that correspond to the queries the most likely
asked. In this way SAS/MDDB reduces the
number of unusable hierarchies. In case a
query is asked where no hierarchy exists for,
Design issues.
There are some factors that need to be
considered during the design of the MDDB.
The choices made are of influence to the size
-2-
combined to a common base table, using the
total set of class variables. This results in a
sparse data set, holding a lot of analysis
element on missing. The MDDB build on top
of this base table will, on his turn, have a high
degree of sparsity. This means a very big
MDDB will be built of which only a limited
amount of storage space is going to be used.
This approach is called a mono-cube structure.
An alternative way to store the data is to keep
different kinds of analysis elements in separate
MDDB’s. This leads to reduced sizes of the
individual MDDB’s. The latter way of storing
analysis element is called the multi-cube
approach. The retrieval of the data in a multicube, can be accomplished through customised
data access.
of the SAS/MDDB and affect the way the
OLAP-applications needs to be built.
The first possible issue is the choice between
combined class variables or single class
variables. A single class variable is a variable
that uniquely can be identified by its values. A
combined class variable on the other hand
needs multiple variables to identify the class.
For example: the class year/month can be
identified by the combination of two variables
year and month. E.g. June 1998 can be
identified by the combination year = 1998 and
month = 6. The same information can be
stored in a single variable yearmm. E.g. the
month June 1998 can be identified by the
single variable yearmm = 199806.
There is an important advantage when a
MDDB is built with single class variables. The
relationships between single class variables of
the same dimension can very easily be stored
in SAS/formats. These formats can become
very handy when building the OLAPapplication, or can also be used when
customising the MDDB access.
On the other hand, single class variables have a
higher number of distinct class variables. They
also have a bigger class variable size,
compared to combined class variables. Both
features negatively influence the size of the
MDDB.
A last design issue exists when the MDDB is
build on an enormous base table. In this case
the MDDB might grow above 2GB. This is a
problem since the SAS/MDDB is limited to
this size.
In order to solve this problem, this one MDDB
can be split in multiple MDDB’s. The way to
do this is by spreading the data of the base
table over multiple base tables.
Suppose a base table is build with n
dimensions. This base table can be split into n
base tables containing n-1 dimensions. This is
a technique already used in SAS/Motore (SAS
6.11).
For example: imagine a base table containing 3
dimensions, e.g. time, location and product.
The data in this base table can be spread over 3
other base tables containing the dimensions
time-location, time-product and locationproduct.
The access of the data in these multiple
MDDB’s can be done by customised data
access.
A next issue is the choice of technical keys
instead of extended class variables or
formatted class variables.
As indicated in the sizing of the MDDB is the
storage space of an MDDB negatively
influenced by the length of the formatted class
variables. It is therefore advisable to replace
long class variables with smaller technical
keys. A technical key is an arbitrary unique
number. The relation between the actual class
variable value and the technical key can be
stored in SAS/formats and SAS/informats. The
major advantage of this approach is the smaller
size of the MDDB. The translation to and from
the technical key can be done through
customised data access.
Client/server.
Since OLAP-applications use large amounts of
data and perform complex analyses, they need
to be built on a platform with the necessary
resources. This is the reason why OLAPapplications are often associated with a
client/server architecture.
There exist several ways by which a
SAS/MDDB can be accessed in a client/server
environment.
The first way to access a SAS/MDDB on a
server is by using the Remote Library Service
(RLS). This technique allows data of a MDDB
to be transported from the server to the client.
The analysis of the data is still performed on
the client.
In OLAP-applications it happens a lot that
elements of a total different origin are accessed
together.
For example: in an OLAP-application that will
be used to analyse the profitability of a
company, accounting and sales elements are
often used together.
A problem exists when different types of
elements are identified by different kinds of
class variables. These elements can still be
-3-
In case of heavy analyses it might be preferable
to perform the calculations on the server as
well. This means that only the result of the
calculations will be transported in the direction
of the client.
There are two ways in which this can be
implemented.
In the first solution, every MDDB access
results in the creation of an instance of the
MDDB_M class on the remote SAS session.
This instance is used to fetch the data. After the
calculation of the result, the instance will be
terminated.
In the second solution, a permanent instance of
the MDDB_M class (or a customised subclass) is maintained on the server. This
permanent instance can only exist in an AFapplication. This AF-application can be
executed as a background process on the
server. The remote session is used to
communicate with the background process.
The second solution is advisable when a
customised sub-class for the MDDB access is
constructed. The creation of an instance of this
sub-class every time the MDDB is accessed
would be too time consuming.
HOLAP.
MOLAP stands for an OLAP-application build
on the MDDB-technology. ROLAP on the
other hand is an OLAP application on top of a
relational database. HOLAP is an hybrid form
of MOLAP and HOLAP.
SAS6.12 also offers an HOLAP extension, in
which the data is partly stored in a MDDB and
partly in data sets.
The result of this extension is a reduced
storage space and improved response times.
Conclusion.
SAS/MDDB is a very valuable add-on to the
SAS-system that allows the design of
performant OLAP-applications. SAS/MDDB
also offers the necessary tools to customise the
MDDB according to the needs of the OLAPapplication.
Reference.
[1] M. Moorman. Getting a Grip on a Growing
Concern: Managing Large Data Sets with a
Multidimensional Database. Observations First Quarter 1997, pp. 52-56.
-4-