Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehouse Design and Implementation Paul Murray, Data Log Consultants Ltd. Cheltenham UK. Abstract A U.K. based Financial Institution undertook a Business Process Re-engineering project in the early 90's, it's objective to become more customer orientated not product depenedent, in order for it to be more reactive to changes in the marketplace. From the IT systems derived from the BPR, a Data Warehouse was identified as the solution to accommodate the needs of business, management and process information reporting. After exploring the capabilities of tools already in use on site, and others, the SAS system was selected to satisfy the reporting needs and as the primary tool used to create the Data Warehouse. The data came from various sources the largest of which was the OLTP system supported by an IMS/DL1 database architecture and associated products. This paper will focus on the tasks undertaken to deliver a Data Warehouse, and state the particular SAS System products, and features thereof, that accomplished these tasks. Introduction It has been said before that a Data Warehouse must be built, it cannot be bought and just filled with data. The steps undertaken in creating this particular Data Warehouse were: • • • • • • • • • • Analysis of existing data sources Inclusion of operational reference data Creation of Data Dictionary. Denormalisation Specification and coding of an incremental load Design of Data Warehouse Structure Migration of business data Reporting from the Data Warehouse User interfaces. Audit of code and documentation Because of the flexibility of the SAS System all the tasks mentioned above were carried out using one or more SAS Institute products. This fitted in well with the company strategy as all future MI needs were to be fulfilled with the SAS System. Some examples of which SAS products carried out Data Warehouse tasks: • • • • • • • IMS analysis using SAS/ACCESS to IMS/DL1 Data Dictionary - Base SAS, SAS/ACCESS and SAS/AF Specification and Coding of an incremental load - Base SAS (especially DATA STEP, PROC SQL and SAS/MACRO) OLTP Reference Data - (Base SAS especially PROC FORMAT and indexed datasets) Migration of business data - Base SAS and SAS ACCESS to IMS/DL1 Reporting from the Data Warehouse - Base SAS, SAS/GRAPH, SAS/STAT, SAS/ETS User Interfaces. SAS/EIS, SAS/AF, SAS/ASSIST and SAS/INSIGHT Although the Data Warehouse discussed is insurance based, many of the techniques adopted can be used to create a Data Warehouse in any industry. Analysis Of Existing Data Sources The data was primarily sourced from the OLTP system using an IMS/DL1 database architecture and products additional sources are The Walker Accounts Package In-house Marketing Databases Performance Database. Much effort was put into analysing the structure of the OLTP Database concerning: Subject Key attributes Date and time based attributes Reporting requirements Analysis and forecasting potential Data coding - used extensively throughout the transaction system The information collected in these areas helped to design the structure of the Data Warehouse, The structure being how the data is stored, but also incorporates archiving and management, the data marts and their summarisation requirements. Inclusion of Operational Reference Data The OLTP system was designed to be as generic as possible, so for example new types of insurance could be added to the system without changes being required to the infrastructure or applications. Part of this was achieved by having a single repository of all potential values of database attributes (i.e. all possible postcodes, vehicle types, incident circumstances). The 'Reference Data' being continuously updated. Reference data sources such as Address and Area Rating down to car model and gender had to be assimilated into, or used by the Data Warehouse. Some reference data were found not to be needed by the Data Warehouse at all (i.e. underwriter rule sets). Maintenance programs had to be developed to keep the reference data up to date. The reference data that was included into the Data Warehouse was achieved in one of two ways: either SAS formats(gender, marital status, policy type), or look up tables in the case of the largest sources (address, rating area, postcode, vehicles). Some reference data was retrieved during the loading of the database, and stored in its de referenced form, rather than have to derive values at reporting. This is useful in the case of say rating areas that can change, so the rating area at the time of quotation is stored. Creation of a Data Dictionary The OLTP system, the main data source for the Data Warehouse, was in the latter stages of development when the Data Warehouse Project was initiated in parallel. This meant that the Data Warehouse structure had to reflect any changes to the OLTP database during final development, through five phases of testing and up to release. A data dictionary was built to contain, amongst other details, information relating the IMS/DL1 source tables and attributes with their SAS counterparts. The IMS/DL1 information was created from combining the sites central file of attribute and table definitions (a VSAM file) with the IMS database definition files (DBD's) and Program Specification Blocks (PSB's). The VSAM file was needed as the transaction system developers, to make a more flexible system, did not fully define the database within IMS, but used in house developed 'middle ware' to access the VSAM file of definitions. The dictionary was updated with the Data Warehouse denormalised entity information, once they were created, using the SASHELP views and PROC CONTENTS. The dictionary was then supplemented with more metadata until it contained enough information to be used to build the a prototype Data Warehouse, including development environments, and the incremental load. In addition the dictionary also drove the build process (see later). A SAS/AF based front end provided an interface to facilitate deriving new variables and define the use of variables in reports. While changes involving metadata through to new releases of the OLTP system were passed into test, the Data Warehouse data dictionary had to be updated regularly, and test environments used for reporting and application development had to be built and re-built. Code was developed for the build that processed new releases of the metadata describing the OLTP database. This kept code re-writes to a minimum with subsequent new releases of the operational system. During testing of the transaction system the Data Warehouse team had to build and maintain several releases of the Data Warehouse simultaneously, receiving data from different test releases of the OLTP databases. The number of different releases was to allow different development teams working on the OLTP system to test their own programs and integration with the whole system. The code that formed the incremental load to the Data Warehouse had to keep pace with the OLTP database. With new tables and attributes being added, some tables being amalgamated and some being removed it meant the Data Warehouse load and denormalisation programs had to be dependant on the data dictionary. A SAS/AF based application was used to access and enhance the metadata contained in the data dictionary. Adding in derived variables and referencing code used to create them. As multiple users were updating the PC/Network dictionary datasets simultaneously the access was controlled by SAS/SHARE, SAS/CONNECT providing upload to the mainframe for build execution. The application provides: attributes to be accessed by Data Warehouse build and load programs drill down on the entities to attributes to code reports on entities for programmers query on entities, attributes and their sources usage of attributes in the Data Warehouse, keys and indices Usage of attributes in reports (classification and analysis) Denormalisation The OLTP system architecture is typical of a transaction database, with the transaction system updating normalised tables. The normalised tables are in subject oriented IMS databases, called Domains. The structure of the denormalised tables were worked out by considering each domain, its key variables and the relationships between the tables (to estimate the volume of data once denormalised). The denormalised tables were created by making physical joins on the tables (segments) within the domains. Duplication and redundancy are impossible to avoid during the denormalisation process, and it required a paradigm shift for people used to working with normalised data to work with the denormalisation process. The simplest way to explain denormalisation was to remind the team that they denormalise data every time they report on the joining of two or more tables, so the process we wanted to follow was to use reporting methods for a normalised structure, but instead of putting the results on paper we would be appending the resulting denormalised data to the Data Warehouse. Some duplication was avoided by summarising data (i.e. named drivers per vehicle, or country visited per policy), the sources of any summarised data were stored as stand alone tables to avoid loss of information. Some tables were transposed, others were not used in the denormalisation process at all but were left as look up tables. The decision on how far to go with denormalisation was to draw a line at the point where storage space was wasted when the resulting data structure served no purpose to make reporting or summarising into Data Marts easier. The final denormalised structure was arrived at by an iterative process, the initial structure was defined and the Data Warehouse Reporting Team (which also specified the Data Marts) were briefed to see how useful each denormalised Data Warehouse entity was to reporting and feeding the data marts. If relationships in the data change, or discovered that make reporting more difficult using the denormalised structure, tables can be moved around the structure. This process may be harder to accomplish once the Data Warehouse has been in production for some time, but we are confident we have the tools and methodology to accomplish future structural changes. Specification and Coding of incremental load To update the Data Warehouse, the IMS log was used to generate an audit dataset to be read by SAS programs. The results being SAS datasets resembling the IMS segments they came from. The data was validated to ensure key and essential data were present or within certain ranges. Any data containing erroneous values were passed to the 'spin' database and reported on. The 'spin' database being a SAS library based copy of the OLTP database structure. A SAS/AF based front end was developed to maintain the spin database. Only records that belonged to a complete business case (a business case being a logical business process such as a quotation, a claim, a policy or claim adjustment) would normally be entered into the Data Warehouse. A process may take a number of days and customer contacts to complete so records belonging to an incomplete business case are also added to the spin database. The spin maintenance facility allowed the forcing of incomplete business cases through the Data Warehouse load process, to help with things like end of period accounting. These tables containing validated data belonging to complete business cases were then denormalised. This created records of the structures adopted by the Data Warehouse to which they were then appended. During the incremental load process data flows into the Data Warehouse in a number of different ways: All record changes are appended to the movement database. All quotation records go to the quote database All claim details are passed to the claims database The most up to date versions of all records are stored in the current picture database. Summaries of the combination of movement and current picture update the data marts. The denormalisation process was specified as to which tables would be joined, by which keys, which tables would be transposed before joining and which would be left as stand alone look up tables. During this specification process it was noted that the eventual code could be developed as a macro, making changes to the denormalised structure (as noted in the previous section) easy to implement. During this process control totals were calculated, based on transactions and units of work processed. These are stored and referenced during each subsequent incremental load, providing the basis of checkpoint data enabling the development of mechanisms to avoid the same data being loaded to the Data Warehouse more than once, or out of sequence. To ensure flexibility in case of job failure, override facilities are also in place. A SAS/AF application exists for viewing the data at all stages of the incremental load, and viewing on-line reports on data throughput, spin data and data loaded to the Data Warehouse. Reports on throughput are also downloaded to the PC and exported to spreadsheet packages to fit in with in house reporting standards. Design of Data Warehouse Structure The Data Warehouse has four distinct elements: • • • • Daily movement database Current Picture database Specific Data Bases (Quotation and Claim) Data Marts All changes to data are written to the movement database. Where information is needed to fill in missing values in movement data, values are read from the current picture. Daily movement data from, the previous day's business comes from the audit dataset by way of the incremental load. As a table is added to the movement database all preceding datasets are aged, (as per computer performance databases), leaving us with a series of datasets each with a days worth of movement. The number of iterations of these datasets is parameter driven. The movement data is used by the data mart updating programs, and the relevant data is also passed to the Quotation and Claim databases. As the audit trail only contains details of object images that have changed, to create a denormalised record in the movement database requires that movement data is referenced against the existing data before the change, i.e. the current picture. The current picture, mirrors the denormalised structure of the movement database, but contains the most up to date values of all the records in each of the Data Warehouse entities. The records on the current picture have lifetimes according to their business definitions, so an incident record the basis of a claim, can have a different lifetime in the current picture database than a policy record. Without the current picture a trawl through up to a years worth of movement data would be required to update the Data Warehouse. Two further databases were created from the denormalised structure to enable the holding of data for longer periods than was allowed for in the Data Warehouse movement data - which holds all types of data for short term analysis only. The quotation database exists because of the high volume of enquiries passing through the Data Warehouse. A quotation can be a very short lived piece of information, with a lot of associated attributes and possible outcomes, reports are needed on quotation type (e.g. initial, additional, adjustment) and specific details are needed to investigate potential new business areas and forecast potential volume of quotations issued and taken up. A forecast of potential enquiries can help with planning staffing levels of telephone agents, forecasting quotations taken up is a direct estimate of future income, when tied in with a forecast of claims. Claims can have a long life cycle - several years potentially, making subsequent collecting data for claim analysis via movement and archive data untimely and expensive. Especially when analysing claims can involve a lot of customer contacts, written or via telephone, a number of organisations can be involved in the settling of a claim and amounts have to be set aside for settlement. The complexity of the analysis, and extended life cycle of a claim, made claims a candidate for a separate database. Data marts were created to fulfil separate reporting requirements. Marts are sets of summaries that provide information to different types of business report, such as: • • • • Policy (quotations, adjustments, new business) Claims Vehicles Contents The idea behind the data marts is that they are sourced from the Data Warehouse by programs that run after the incremental load. The mart data may consist of a merge and then a summary of one or more of the tables from Data Warehouse movement and current picture databases (see Figure 1.). Migration of Business Data Once the all the BPR derived applications including the OLTP system and the Data Warehouse are operational , data stored in the current business systems (these will one day be the legacy systems) will need to be integrated into the Data Warehouse and OLTP database. A major project is underway to migrate data to the OLTP system, the team managing migration is using the SAS system to reconcile data from the business systems with the new OLTP system data to ensure data is not 'double counted' and that key values match up once the data is migrated. The Data Warehouse plan is to take advantage of this project and extract the segments of the migrated data needed to update the Data Warehouse from the audit trail generated by the migration jobs. So the load from the migration will be acheived by the incremental load jobs. Reporting From the Data Warehouse Because the Data Warehouse has been designed to make analysis of the data easier for the user reports can be generated from all the different sections of the Data Warehouse. From the movement data and current picture, the quotation and claims databases and the data marts. Initially the Data Warehouse has had to be defined to provide reporting on the OLTP system, i.e. fill the gap between the operations and management in respect to day to day movement. Subsequent activity has been related to satisfy business enquiries from business users of the Data Warehouse. Reports are generated by category under business area, in some cases data marts have been generated to provide ad-hoc and what if analysis against data reported. User Interfaces User interfaces have been created for: • • • • Data dictionary user interface Viewing and maintaining spin data. Viewing Data Warehouse data and reports on incremental load. Analysing accounts data for potential fraud. Audit of Programs/Documentation In order to leave behind a manageable and maintainable system all code and documentation created in building the Data Warehouse had to pass an testing and audit. During the project the team changed from being largely external consultant based to being a mixture of consultants and staff. The site standards in naming conventions and programming style were taken on board and modified to reflect SAS based programming. Guidelines to follow were issued to the programmers on the project, these included: naming conventions variables, datasets and program/macro names coding style efficiency techniques macro library references (which was continuously updated) documentation style testing process implementation to production As well as this every week a 'random dope test' approach was adopted where a couple of hours were set aside for the audit police to look over some of the latest releases from each programmer. Any problems were sorted out, and any good practice emerging or interesting techniques used were added to the guidelines and distributed. The Future Now that the Data Warehouse is on line, applications have been specified for: • • • Replacing existing EIS with a SAS/AF and SAS/EIS application On line analysis and forecasting of quotations and similarly for claims Marketing analysis application through customer and product segmentation. The Data Warehouse can become a victim of it's own success, as the more information that is given to managers and business analysts the more information they want and the more questions they ask. Summary This paper has outlined the many considerations and tasks involved in building one Data Warehouse, from analysis of data sources to design of Data Warehouse structure to loading the Data Warehouse and reporting from data marts. In all these areas the SAS System has provided the tools required to complete the task. As can be seen every Data Warehouse will have a different design, depending on the industry concerned, the sources feeding the Data Warehouse, the analysis and reporting requirements placed upon it and not least of all the tool selected to provide it. Overall Design Audit Data Subject Orientated databases Staging Validation ve hi rc A Reference Data Data Marts Figure 1. Schematic of Data Warehouse Architecture Acknowledgements Base SAS, SAS, SAS/ACCESS, SAS/AF, SAS/CONNECT, SAS/SHARE, SAS/GRAPH, SAS/STAT, SAS/ETS, SAS/EIS, SAS/AF, SAS/ASSIST and SAS/INSIGHT are registered trademarks of SAS Institute Inc., Cary, NC, USA. For Further Information Contact: Paul Murray Data Log Consultants Ltd. 26 Hales Road Cheltenham GLOCS GL52 6SE UK. +44 1242 573709 [email protected] Gary Beck 4.G.B. Ltd. 4,Keswick Close, Cringleford, Norwich, Norfolk. NR4 6UW. UK. +44 1603 506988