* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Building a Data Warehouse With SAS Software in the UNIX Environment
Survey
Document related concepts
Transcript
Host Systems Building a Data Warehouse with SAS® Software in the Unix® Environment Karen Grippo, Dun & Bradstreet, Basking Ridge, NJ John Chen, Dun & Bradstreet, Basking Ridge, NJ ABSTRACT required to load the database had to be minimized. Since commitments to customers would be at stake, a solution requiring more than one weekend to load the data would not be acceptable and less than twelve hours was preferable, in the event that the data had to be reloaded overnight. (2) The majority of projects involved appending Dun & Bradstreet data to files with less than 50,000 records using the DUNS Number® as the primary key for retrieving data (the DUNS Number is a unique number, like the Social Security number, that is assigned to every company in the database). (3) The time required to implement the database and transition the group from the mainframe to the Unix platform should be as short as possible. A common trend in corporations today is "downsizing" moving applications from expensive mainframes to more affordable micJ.range platforms. The Information Center of Dun & Bradstreet Information Services, faced with the challenge of moving fulfillment off the mainframe, has chosen SAS software not only as the vehicle for its data analysis and delivery but also as its database management system. Twenty gigabytes, extracted from mainframe legacy databases, have been loaded onto an HP-UX® platform as indexed SAS data files. Two data structures (a table mapping the data files to file systems and a data dictionary mapping data elements to data files) and a set of SAS macros have been developed to give the Information Center developers and business users transparent access to the data, freeing them from the need to know data file or data element locations. The evaluation process covered not only traditional RDBMS's (Sybase® and Informix®), but also a new product from Red Brick Systems® (a product optimized for data warehousing, querying and decision support), and lastly, the SAS System. The SAS System was chosen for several reasons: (1) The time quoted by Sybase and Informix for the database load was between 3 and 7 days. Benchmarks with SAS software indicated that both the download of the raw data and the building and indexing of the SAS data files could be completed within a ten hour window. (2) No other product offered as comprehensive a set of integrated tools as the SAS System - tools not only for complex data manipulation, but for analysis and presentation as well. Ultimately, SAS software provides tools that we can give to our bUSiness users to empower them to meet more of their own information delivery needs. (3) The level of SAS software expertise within the group meant an easier transition to the Unix platform. Less time would be required for implementing the data warehouse and less time would be required to begin fulfillment on the new platform. (4) The portability of SAS code meant that many of the existing mainframe procedures could be ported to the Unix environment with little or no change. Since SAS language is virtually platform-independent, our developers could be productive immediately, with very little retraining necessary. (5) SAS is optimized to read SAS data files, as opposed to tables from other DBMS's. Since we were going to use SAS to present our data, regardless of the database we chose, storing our data in SAS data files enhanced our performance. THE CHALLENGE Like many corporations today, Dun & Bradstreet Information Services is trying to reduce costs by moving applications from the mainframe to more affordable midrange platforms. The Information Center of the Technology & Business Services Division was given the challenge of reducing its use of mainframe MIPS by designing and implementing an altemate platform solution. Historically, the role of the Information Center within the organization has been to provide customized development, analysis and fulfillment to both internal and extemal customers. The requests for information received are as varied as the data collected by Dun & Bradstreet and in the course of fulfillment, probably every database and data source has been accessed by the group at one time or another. Since its inception, the InfoCenter has used the SAS System for its extensive data manipulation, analytical, and reporting capabilities. THE EVALUATION PROCESS Due to the diversity of projects and data requirements, some time and analysis was required to identify applications suitable for migration. Once the applications and the supporting data elements were chosen, a database management system had to be selected. Several key requirements were Identified: (1) The time 367 Host Systems THE IMPLEMENTATION records matching certain selection criteria. have had insufficient commonality in terms of data elements used for selection. Creating too many indexes would add significant overhead in terms of space and require additional time to build the database. This might not be offset by gains in turnaround time. Thus. it has been decided to defer the Issue of creating additional indexes until further analysis can be done. Since space is not an issue. it has been decided not to compress the datasets. allowing for faster retrieval. Once the decision had been made to use SAS software as the data management tool, one operating system constraint became a very important issue. In the Unix operating system, the size of any fi Ie is restricted to 2 GB. Some DBMS packages, like Sybase, manage their own database space and thus, have no limitation on the size of a table. The SAS System, however, does not manage its own space, dictating that the database files had to comply with the 2 GB limit. As described above. the 'main' database is segmented by business indicators, compriSing 13 categories broken into 21 data segments. Retrieving a record using the DUNS Number index could potentially require searching all 21. querying each until a particular DUNS number is located - not a desirable solution. To resolve this issue. a more efficient two-Ievel indexing scheme has been developed. The first level is a SAS data file. indexed by DUNS number. containing all 19+ million cases and painters (in the form of a code) to the main segments in which they are located. At the second-level, the main segments themselves are indexed by the DUNS number. This allows the retrieval of data for any DUNS number in only two steps: (a) Use the first-level index to determine which main segment contains each DUNS Number and then. (b) follow the pointers to the appropriate segments. using the DUNS Number to directly access the correct record in each segment This indexing scheme is illustrated in Figure 2. One obvious choice and, probably the simplest,. for segmenting the data would have been to split the data sequentially by the DUNS Number. However, a different approach was adopted with the following strategy. First, in an effort to minimize the size of the database, some data elements have been separated into smaller datasets, generically called 'support" files. These support files contain data elements whose frequency of occurrence in the data is low relative to the amount of space they occupy. For example, the mailing address of a company occupies 65 bytes but is present in only 17% of the cases. By placing the mailing address fields in a separate data file. more than 1 GB of space is saved. In this manner. more than 10 separate support files have been created. saving 4+ GB in space. Second, the remaining data elements. comprising a 600+ byte extract for over 19 million cases (12 GBl. are segmented using five key business indicators which place each record into one of 13 different categories. called 'main' segments. An example of a key business indicator is one which determines if a company is out of business or not The advantage of our scheme is that certain categories of records can automatically be included or excluded, without even reading them, simply by knowing into which category they fall. For projects Where we need to query the entire database. we frequently want to eliminate certain types of companies altogether (e.g. those which are out of business). In these circumstance&. our database design allows us to avoid processing millions of records, using less valuable CPU resources and gMng a faster turnaround. DATA ACCESS METHODOLOGY The complexity of the database design just described. translates to complexity in the SAS code required to retrieve data. As an application developer or business user wishing to retrieve data. you would have to know. for each element you wanted to extract. whether it is in a main segment or a support file, and if it is in a support file. which one. 'Main' data elements must be retrieved from multiple files since the data is segmented. To further complicate matters, since a given file cannot exceed 2 GB. data files need to be split into two as they reach the size limit. Developers would have to constantly monitor the state of the database and adjust their code as changes occurred - an undesirable situation and a nightmare for program maintenance. Some segments still exceed the 2 GB limit and have to be further subdivided by the DUNS Number. The entire segmentation scheme is illustrated in Figure 1. To address this problem, metadata (that is. data about the data). in the form of two data structures. and a set of SAS macros have been created. This new data access methodology makes the number. type and location of the SAS data files as well as the location of individual data elements transparent to the application developer and the business user. Indexing the Database It has been decided to index the datasets only on the DUNS number for several reasons. First, the nature of applications being implemented is such that most of the projects involve taking a DUNS-numbered file of customer data and appending Dun & Bradstreet data using the DUNS number as the primary key. For this type of operation. only an index on the DUNS number is reqUired. Second. it has been difficult to identify a reasonable number of fields on which to create secondary indexes. The decision-support type projects which require 'sweeping' the entire database, looking for The first data structure. called the File System Map, Is an ASCII file containing the name of each SAS data file in the database. the file system on which it resides. and a code indicating whether it is a main or support file. Any time a data file is added, deleted or moved, a change is made to this table. Certain naming conventions have 368 Host Systems also been adopted. File system names have to be 8 characters or less so that they can also be used as SAS librefs. The main files are named after the 13 categories, with a number added at the end to indicate how many sub-segments that category contains. For example, If the categories are given as A, B, ... ,M then the SAS data files might be named A1, B1, B2, 83, ... , M1 indicating that category A maps to one dataset, B to three, and M to one., and so on. the past several years. Access to certain databases is severely restricted, sometimes eliminated, during the day. Virtually any job requires an overnight run. If an error is made in JCL or code, another day could be lost. The altemate platform is our dedicated resource. A job on the platform typically runs in under 30 minutes and can be run at any time, with no queues. (3) Greater customer satisfaction. The two factors listed above both translate to increased customer satisfaction. Projects. which would have taken two to three days on the mainframe, are usually completed within one day, allowing tight deadlines to be met, sometimes saving contracts and revenue. The second data structure is a data dictionary. Created using PROC CONTENTS and some manually entered information, it is a SAS data file containing one record for each variable in the database with its name, type, length, and the segment from which it comes. Additional information, such as the mainframe source of the data element and a long text description of the variable, has been added to help users understand the contents of the database and for the creation of a hardcopy data dictionary for documentation. (4) Greater data consistency. By pulling data from a variety of sources, and then cleansing and combining it, we have been able to provide a comprehensive source of data that has a high degree of conSistency and quality. This type of resource has great potential to be leveraged throughout the organization. A set of SAS macros utilizes these two data structures. The main macro, APND, defines the user's interface to the database. The user simply provides the names of the variables requested (up to a maximum of 100), the name of the input SAS data file containing the DUNS Numbers to which data will be appended and the name of the SAS data file that will contain the output. The macros do all the wOrk. The variables requested are matched to the data dictionary to determine in which datasets they reside. A PROC SQL is constructed for each dataset, extracting the appropriate data elements. All the indMdual files are then combined into one data set and returned to the user. This data retrieval process is illustrated in Figure 3. (5) End-User Empowerment. The database interface allows easy access to the database, freeing all users from the need to know the details of the database implementation. This ease of access makes it possible to allow non-technical users to utilize the data warehouse. Efforts are currently underway to build an application that will give users point-and-click data retrieval capabilities. An analytical workstation has also been built which allows business analysts to view data and drill-down using both summarized and detail data extracted from the warehouse. SUMMARY The design of this data access methodology embodies object-oriented concepts, even though SAS software is not traditionally considered an object-oriented language. These two· data structures, plus the SAS macros which utilize them, encapsulate everything any program needs to know to extract data from the database. Programs become virtually maintenance-free. The task of building an alternate platform solution is difficult and complex. The SAS System may not be an obvious choice for implementing a data warehouse, but our experience illustrates that it is certainly a viable alternative. The SAS System provides traditional database management features such as indexing, data compression, support for SQL, and the creation of data views. With SAS Macro Language, you have a powerful facility for building the infrastructure of a data warehouse and for building a sophisticated mechanism for transparent data access. RESULTS The benefits we have experienced from implementing our data warehouse have been substantial: (1) Development time and effort significantly reduced. In the past, a developer had to code one COBOL program per database accessed, typically four or five. Additionally, a SAS program had to be written to combine the extracts and produce a customized file or report. With the data warehouse, extracting data is as simple as a function call. Development time is significantly reduced. Most projects can be completed in just one step, combining data extraction, data manipulation, and presentation of the final result. (2) Turnaround time significantly reduced. Mainframe processing windows have been diminishing steadily over 369 HostSystems pata Stamentatlon .. rlftIn ttl """*"~. DUNS .......r ~ bell .... dIIIMM IIiJ'oUt tJI 0I'IIf ~ .......... (1' PrtmIIry Ke¥ • DUNS ....... (21 EKh ... _1Iyout PI ....,....... "" 1111111;111 ... ....,."".. "... l.!A~Ibfn~~'~lgm~IeI'Il!_'tI_lIJ Too ......llltbe • • Slglz*11Iid "..2 ••• ~1J 1\" CJ CJ Cl - ~ ,. ,. ., ,.". Jc Figure 1 indexing Scheme ieta..v.l I2SIfU • I a " IndLewl ,. 0 I:::"" I 0 1=="" I .fIIt.!J[ ,.". 1We1J ",., ,.,.,., "..,. 1=.-:"" 0 " , . 13 figure 2 370 1=-="" I Host Systems 'IIpNI «_.111 ••• ...1:...., 1 .... EIemenIa ... ~ IIlNto .... CI ¥ . , ~"$r."" 8Itp 3: __ .,... TftIot to 1)8 TIMe &LUl1lon DATA IJIC7JONARY Figure 3 371 I ~1IIe