Download Data Warehouse: design and implementation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Warehouse Design and Implementation
Paul Murray, Data Log Consultants Ltd. Cheltenham UK.
Abstract
A U.K. based Financial Institution undertook a Business Process Re-engineering project in the
early 90's, it's objective to become more customer orientated not product depenedent, in order for it
to be more reactive to changes in the marketplace. From the IT systems derived from the BPR, a
Data Warehouse was identified as the solution to accommodate the needs of business, management
and process information reporting.
After exploring the capabilities of tools already in use on site, and others, the SAS system was
selected to satisfy the reporting needs and as the primary tool used to create the Data Warehouse.
The data came from various sources the largest of which was the OLTP system supported by an
IMS/DL1 database architecture and associated products. This paper will focus on the tasks
undertaken to deliver a Data Warehouse, and state the particular SAS System products, and
features thereof, that accomplished these tasks.
Introduction
It has been said before that a Data Warehouse must be built, it cannot be bought and just filled
with data. The steps undertaken in creating this particular Data Warehouse were:
•
•
•
•
•
•
•
•
•
•
Analysis of existing data sources
Inclusion of operational reference data
Creation of Data Dictionary.
Denormalisation
Specification and coding of an incremental load
Design of Data Warehouse Structure
Migration of business data
Reporting from the Data Warehouse
User interfaces.
Audit of code and documentation
Because of the flexibility of the SAS System all the tasks mentioned above were carried out using
one or more SAS Institute products. This fitted in well with the company strategy as all future MI
needs were to be fulfilled with the SAS System. Some examples of which SAS products carried out
Data Warehouse tasks:
•
•
•
•
•
•
•
IMS analysis using SAS/ACCESS to IMS/DL1
Data Dictionary - Base SAS, SAS/ACCESS and SAS/AF
Specification and Coding of an incremental load - Base SAS (especially DATA STEP, PROC
SQL and SAS/MACRO)
OLTP Reference Data - (Base SAS especially PROC FORMAT and indexed datasets)
Migration of business data - Base SAS and SAS ACCESS to IMS/DL1
Reporting from the Data Warehouse - Base SAS, SAS/GRAPH, SAS/STAT, SAS/ETS
User Interfaces. SAS/EIS, SAS/AF, SAS/ASSIST and SAS/INSIGHT
Although the Data Warehouse discussed is insurance based, many of the techniques adopted can be
used to create a Data Warehouse in any industry.
Analysis Of Existing Data Sources
The data was primarily sourced from the OLTP system using an IMS/DL1 database architecture
and products additional sources are



The Walker Accounts Package
In-house Marketing Databases
Performance Database.
Much effort was put into analysing the structure of the OLTP Database concerning:






Subject
Key attributes
Date and time based attributes
Reporting requirements
Analysis and forecasting potential
Data coding - used extensively throughout the transaction system
The information collected in these areas helped to design the structure of the Data Warehouse, The
structure being how the data is stored, but also incorporates archiving and management, the data
marts and their summarisation requirements.
Inclusion of Operational Reference Data
The OLTP system was designed to be as generic as possible, so for example new types of insurance
could be added to the system without changes being required to the infrastructure or applications.
Part of this was achieved by having a single repository of all potential values of database attributes
(i.e. all possible postcodes, vehicle types, incident circumstances). The 'Reference Data' being
continuously updated.
Reference data sources such as Address and Area Rating down to car model and gender had to be
assimilated into, or used by the Data Warehouse. Some reference data were found not to be needed
by the Data Warehouse at all (i.e. underwriter rule sets). Maintenance programs had to be
developed to keep the reference data up to date.
The reference data that was included into the Data Warehouse was achieved in one of two ways:
either SAS formats(gender, marital status, policy type), or look up tables in the case of the largest
sources (address, rating area, postcode, vehicles).
Some reference data was retrieved during the loading of the database, and stored in its de
referenced form, rather than have to derive values at reporting. This is useful in the case of say
rating areas that can change, so the rating area at the time of quotation is stored.
Creation of a Data Dictionary
The OLTP system, the main data source for the Data Warehouse, was in the latter stages of
development when the Data Warehouse Project was initiated in parallel. This meant that the Data
Warehouse structure had to reflect any changes to the OLTP database during final development,
through five phases of testing and up to release.
A data dictionary was built to contain, amongst other details, information relating the IMS/DL1
source tables and attributes with their SAS counterparts. The IMS/DL1 information was created
from combining the sites central file of attribute and table definitions (a VSAM file) with the IMS
database definition files (DBD's) and Program Specification Blocks (PSB's). The VSAM file was
needed as the transaction system developers, to make a more flexible system, did not fully define
the database within IMS, but used in house developed 'middle ware' to access the VSAM file of
definitions.
The dictionary was updated with the Data Warehouse denormalised entity information, once they
were created, using the SASHELP views and PROC CONTENTS.
The dictionary was then supplemented with more metadata until it contained enough information
to be used to build the a prototype Data Warehouse, including development environments, and the
incremental load. In addition the dictionary also drove the build process (see later). A SAS/AF
based front end provided an interface to facilitate deriving new variables and define the use of
variables in reports.
While changes involving metadata through to new releases of the OLTP system were passed into
test, the Data Warehouse data dictionary had to be updated regularly, and test environments used
for reporting and application development had to be built and re-built. Code was developed for the
build that processed new releases of the metadata describing the OLTP database. This kept code
re-writes to a minimum with subsequent new releases of the operational system.
During testing of the transaction system the Data Warehouse team had to build and maintain
several releases of the Data Warehouse simultaneously, receiving data from different test releases
of the OLTP databases. The number of different releases was to allow different development teams
working on the OLTP system to test their own programs and integration with the whole system.
The code that formed the incremental load to the Data Warehouse had to keep pace with the OLTP
database. With new tables and attributes being added, some tables being amalgamated and some
being removed it meant the Data Warehouse load and denormalisation programs had to be
dependant on the data dictionary.
A SAS/AF based application was used to access and enhance the metadata contained in the data
dictionary. Adding in derived variables and referencing code used to create them. As multiple users
were updating the PC/Network dictionary datasets simultaneously the access was controlled by
SAS/SHARE, SAS/CONNECT providing upload to the mainframe for build execution.
The application provides:






attributes to be accessed by Data Warehouse build and load programs
drill down on the entities to attributes to code
reports on entities for programmers
query on entities, attributes and their sources
usage of attributes in the Data Warehouse, keys and indices
Usage of attributes in reports (classification and analysis)
Denormalisation
The OLTP system architecture is typical of a transaction database, with the transaction system
updating normalised tables. The normalised tables are in subject oriented IMS databases, called
Domains.
The structure of the denormalised tables were worked out by considering each domain, its key
variables and the relationships between the tables (to estimate the volume of data once
denormalised).
The denormalised tables were created by making physical joins on the tables (segments) within the
domains. Duplication and redundancy are impossible to avoid during the denormalisation process,
and it required a paradigm shift for people used to working with normalised data to work with the
denormalisation process.
The simplest way to explain denormalisation was to remind the team that they denormalise data
every time they report on the joining of two or more tables, so the process we wanted to follow was
to use reporting methods for a normalised structure, but instead of putting the results on paper we
would be appending the resulting denormalised data to the Data Warehouse.
Some duplication was avoided by summarising data (i.e. named drivers per vehicle, or country
visited per policy), the sources of any summarised data were stored as stand alone tables to avoid
loss of information. Some tables were transposed, others were not used in the denormalisation
process at all but were left as look up tables.
The decision on how far to go with denormalisation was to draw a line at the point where storage
space was wasted when the resulting data structure served no purpose to make reporting or
summarising into Data Marts easier.
The final denormalised structure was arrived at by an iterative process, the initial structure was
defined and the Data Warehouse Reporting Team (which also specified the Data Marts) were
briefed to see how useful each denormalised Data Warehouse entity was to reporting and feeding
the data marts.
If relationships in the data change, or discovered that make reporting more difficult using the denormalised structure, tables can be moved around the structure. This process may be harder to
accomplish once the Data Warehouse has been in production for some time, but we are confident
we have the tools and methodology to accomplish future structural changes.
Specification and Coding of incremental load
To update the Data Warehouse, the IMS log was used to generate an audit dataset to be read by
SAS programs. The results being SAS datasets resembling the IMS segments they came from.
The data was validated to ensure key and essential data were present or within certain ranges.
Any data containing erroneous values were passed to the 'spin' database and reported on. The 'spin'
database being a SAS library based copy of the OLTP database structure. A SAS/AF based front
end was developed to maintain the spin database.
Only records that belonged to a complete business case (a business case being a logical business
process such as a quotation, a claim, a policy or claim adjustment) would normally be entered into
the Data Warehouse. A process may take a number of days and customer contacts to complete so
records belonging to an incomplete business case are also added to the spin database.
The spin maintenance facility allowed the forcing of incomplete business cases through the Data
Warehouse load process, to help with things like end of period accounting.
These tables containing validated data belonging to complete business cases were then
denormalised. This created records of the structures adopted by the Data Warehouse to which they
were then appended.
During the incremental load process data flows into the Data Warehouse in a number of different
ways:





All record changes are appended to the movement database.
All quotation records go to the quote database
All claim details are passed to the claims database
The most up to date versions of all records are stored in the current picture database.
Summaries of the combination of movement and current picture update the data marts.
The denormalisation process was specified as to which tables would be joined, by which keys, which
tables would be transposed before joining and which would be left as stand alone look up tables.
During this specification process it was noted that the eventual code could be developed as a macro,
making changes to the denormalised structure (as noted in the previous section) easy to implement.
During this process control totals were calculated, based on transactions and units of work
processed. These are stored and referenced during each subsequent incremental load, providing the
basis of checkpoint data enabling the development of mechanisms to avoid the same data being
loaded to the Data Warehouse more than once, or out of sequence. To ensure flexibility in case of
job failure, override facilities are also in place.
A SAS/AF application exists for viewing the data at all stages of the incremental load, and viewing
on-line reports on data throughput, spin data and data loaded to the Data Warehouse. Reports on
throughput are also downloaded to the PC and exported to spreadsheet packages to fit in with in
house reporting standards.
Design of Data Warehouse Structure
The Data Warehouse has four distinct elements:
•
•
•
•
Daily movement database
Current Picture database
Specific Data Bases (Quotation and Claim)
Data Marts
All changes to data are written to the movement database. Where information is needed to fill in
missing values in movement data, values are read from the current picture.
Daily movement data from, the previous day's business comes from the audit dataset by way of the
incremental load. As a table is added to the movement database all preceding datasets are aged, (as
per computer performance databases), leaving us with a series of datasets each with a days worth
of movement. The number of iterations of these datasets is parameter driven. The movement data
is used by the data mart updating programs, and the relevant data is also passed to the Quotation
and Claim databases.
As the audit trail only contains details of object images that have changed, to create a denormalised
record in the movement database requires that movement data is referenced against the existing
data before the change, i.e. the current picture.
The current picture, mirrors the denormalised structure of the movement database, but contains
the most up to date values of all the records in each of the Data Warehouse entities. The records on
the current picture have lifetimes according to their business definitions, so an incident record the
basis of a claim, can have a different lifetime in the current picture database than a policy record.
Without the current picture a trawl through up to a years worth of movement data would be
required to update the Data Warehouse.
Two further databases were created from the denormalised structure to enable the holding of data
for longer periods than was allowed for in the Data Warehouse movement data - which holds all
types of data for short term analysis only.
The quotation database exists because of the high volume of enquiries passing through the Data
Warehouse. A quotation can be a very short lived piece of information, with a lot of associated
attributes and possible outcomes, reports are needed on quotation type (e.g. initial, additional,
adjustment) and specific details are needed to investigate potential new business areas and
forecast potential volume of quotations issued and taken up. A forecast of potential enquiries can
help with planning staffing levels of telephone agents, forecasting quotations taken up is a direct
estimate of future income, when tied in with a forecast of claims.
Claims can have a long life cycle - several years potentially, making subsequent collecting data for
claim analysis via movement and archive data untimely and expensive. Especially when analysing
claims can involve a lot of customer contacts, written or via telephone, a number of organisations
can be involved in the settling of a claim and amounts have to be set aside for settlement. The
complexity of the analysis, and extended life cycle of a claim, made claims a candidate for a
separate database.
Data marts were created to fulfil separate reporting requirements. Marts are sets of summaries
that provide information to different types of business report, such as:
•
•
•
•
Policy (quotations, adjustments, new business)
Claims
Vehicles
Contents
The idea behind the data marts is that they are sourced from the Data Warehouse by programs
that run after the incremental load. The mart data may consist of a merge and then a summary of
one or more of the tables from Data Warehouse movement and current picture databases (see
Figure 1.).
Migration of Business Data
Once the all the BPR derived applications including the OLTP system and the Data Warehouse are
operational , data stored in the current business systems (these will one day be the legacy systems)
will need to be integrated into the Data Warehouse and OLTP database.
A major project is underway to migrate data to the OLTP system, the team managing migration is
using the SAS system to reconcile data from the business systems with the new OLTP system data
to ensure data is not 'double counted' and that key values match up once the data is migrated.
The Data Warehouse plan is to take advantage of this project and extract the segments of the
migrated data needed to update the Data Warehouse from the audit trail generated by the
migration jobs. So the load from the migration will be acheived by the incremental load jobs.
Reporting From the Data Warehouse
Because the Data Warehouse has been designed to make analysis of the data easier for the user
reports can be generated from all the different sections of the Data Warehouse. From the
movement data and current picture, the quotation and claims databases and the data marts.
Initially the Data Warehouse has had to be defined to provide reporting on the OLTP system, i.e.
fill the gap between the operations and management in respect to day to day movement.
Subsequent activity has been related to satisfy business enquiries from business users of the Data
Warehouse. Reports are generated by category under business area, in some cases data marts have
been generated to provide ad-hoc and what if analysis against data reported.
User Interfaces
User interfaces have been created for:
•
•
•
•
Data dictionary user interface
Viewing and maintaining spin data.
Viewing Data Warehouse data and reports on incremental load.
Analysing accounts data for potential fraud.
Audit of Programs/Documentation
In order to leave behind a manageable and maintainable system all code and documentation
created in building the Data Warehouse had to pass an testing and audit. During the project the
team changed from being largely external consultant based to being a mixture of consultants and
staff. The site standards in naming conventions and programming style were taken on board and
modified to reflect SAS based programming.
Guidelines to follow were issued to the programmers on the project, these included:







naming conventions variables, datasets and program/macro names
coding style
efficiency techniques
macro library references (which was continuously updated)
documentation style
testing process
implementation to production
As well as this every week a 'random dope test' approach was adopted where a couple of hours were
set aside for the audit police to look over some of the latest releases from each programmer. Any
problems were sorted out, and any good practice emerging or interesting techniques used were
added to the guidelines and distributed.
The Future
Now that the Data Warehouse is on line, applications have been specified for:
•
•
•
Replacing existing EIS with a SAS/AF and SAS/EIS application
On line analysis and forecasting of quotations and similarly for claims
Marketing analysis application through customer and product segmentation.
The Data Warehouse can become a victim of it's own success, as the more information that is given
to managers and business analysts the more information they want and the more questions they
ask.
Summary
This paper has outlined the many considerations and tasks involved in building one Data
Warehouse, from analysis of data sources to design of Data Warehouse structure to loading the
Data Warehouse and reporting from data marts. In all these areas the SAS System has provided
the tools required to complete the task.
As can be seen every Data Warehouse will have a different design, depending on the industry
concerned, the sources feeding the Data Warehouse, the analysis and reporting requirements
placed upon it and not least of all the tool selected to provide it.
Overall Design
Audit Data
Subject Orientated databases
Staging
Validation
ve
hi
rc
A
Reference Data
Data Marts
Figure 1. Schematic of Data Warehouse Architecture
Acknowledgements
Base SAS, SAS, SAS/ACCESS, SAS/AF, SAS/CONNECT, SAS/SHARE, SAS/GRAPH, SAS/STAT, SAS/ETS,
SAS/EIS, SAS/AF, SAS/ASSIST and SAS/INSIGHT are registered trademarks of SAS Institute Inc., Cary,
NC, USA.
For Further Information Contact:
Paul Murray
Data Log Consultants Ltd.
26 Hales Road
Cheltenham
GLOCS
GL52 6SE
UK.


+44 1242 573709
[email protected]
Gary Beck
4.G.B. Ltd.
4,Keswick Close,
Cringleford,
Norwich,
Norfolk.
NR4 6UW.
UK.
+44 1603 506988