* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download An Interface Between the SAS System and the INFORMIX Database
Survey
Document related concepts
Transcript
AN INTERFACE BETWEEN THE SAS SYSTEM AND THE INFORMIX DATABASE Kevin Kane, John B Brown, Dr. Krystyna Kelly - AMGEN Ltd. ABSTRACT : AMGEN Ltd., a pharmaceutical company, uses an Informix based relational database to store its clinical trials data, and performs statistical ,analysis using the SAS system. This creates the problem of data transfer between the two systems, and ensuring that the transfer is exact. A set of generic Informix-4GL programs were written to transfer automatically the data from any Informix database to a series of SAS datasets by interrogating the database system catalogue. The first 4GL program dumps the data from each database table into formatted ASCII files. The next program creates SAS code which reads the ASCII files The generated SAS code is executed, creating a SAS dataset into SAS datasets. corresponding to each database table. -, . , The procedure to ensure that the SAS datasets exactly match the data held in the Informix tables is to re-transfer the data from SAS into Informix and to compare this set of data with the original copy. This new methodology has· proved to be reliable and has saved many hours of programming time. AMGEN Ltd. is a biopharmaceutical company based in Thousand Oaks, California. Clinical trials, which are part of Amgen's drug development programme, are carried out globally, and the data centre for Europe is based in Cambridge, England. The company is relatively small, with around 1500 employees world wide, 30 of these being involved in the data centre at Cambridge. The activities within the data centre include database development, data entry and the analysis and reporting of clinical trials. Currently, Amgen's two main products are growth factors. One of these stimulates the number of red blood cells, and the other increases the number of one type of white cell. Because clinical trials of these drugs tend to be carried out over long periods of time and measurements have to be recorded frequently, large amounts of data are generated. Typically, one patient generates 10,000 data points. The data centre processes this data using around 35 databases every year, and each of these databases have approximately 1000 variables. : . i The clinical trials data are processed using a Sun 4/490 Sparc Server which runs under Unix. The data is stored in a relational database developed using the Informix database management system. Statistical analysis is carried out using the SAS® system. The use of these two different packages creates the problem of transferring data from the Informix database to the SAS system. The issue of whether to do any data manipulation before or after the data are transferred is raised. : In the past, analysis variables were derived using Informix tools, and then written to ASCII \- (or text) files. The statistician had to specify the exact layout of these ASCII files. SAS code \vas then used to read in the data and do any required analysis. Mar . " 378 I . encountered with this system because of the large number of steps required - opportunity for error was high and reproducibility was difficult. Additionally, because the database tools being used were not primarily designed for the mathematical data manipulations required, the programming was more difficult and very labour intensive. This system was not satisfactory, and a better method of data transfer and manipulation was sought. The requirements for our data transfer process were that the transfer had to be exact so that we could have confidence that any end results produced were correct, and it had to be fast and easy so that we could have an efficient process. The new method that was developed consisted of three automatic steps. Initially, ASCII files were produced of the contents of the database. SAS code was then generated to read in these ASCII files and then the SAS statements were run. The first step involved writing a program that would, for any Informix database, produce standard column formatted ASCII files. One file is produced for every database table and each file has one column per variable. The next step is the automatic generation of SAS statements that will read the ASCII files into SAS datasets. Both of these steps require information about the contents of the database. In a relational database, this information is stored in the database system catalogue (see figure 1). There are a number of different tables of data, each table holding data that differs in some way. Each table has a number of columns or variables of different data types. For each table, the system catalogue holds information such as the table name, a unique identification number and the number of columns in the table. Similar information is held for each column - the column name, the identification number of the table the column is located in, a column number, the type of data that is held in this column (eg. numeric or text), and the length of the data allowed to be stored in the column. From the system catalogue we can extract information about both the tables and columns in the database for use in our programs. Database System Catalogue S~§tem TSlbie Information Table Name, Table 10 Number Number of Columns etc S~stem Column Information . olumn Name, Table 10 Number Column Number, Column Type, Column Length Figure 1 : The Database System Catalogue :$ There are some problems caused by differences in the method of handling data between SAS apd the Informix database. The two main differences are the length of variable names, and t~ different data types that are used in each system. In the SAS system, v~ri~h1~ n~m~.~ r.:m 379 only be up to a maximum of 8 characters long. However, in the Infonllix database, they can be up to 18. To truncate the variable would generate problems if any two variables existed in one table that had names which only differed after the eighth character. Additionally, our databases are developed using a two character prefix and an underscore in front of each column name to indicate which table it is from, and to make it unique in the database. To preserve this prefix for our SAS names would leave only five available characters to represent meaningful names. To overcome these problems, we developed a method of translation for the variable names. The translator would strip off the two letter prefix and underscore from column names, and truncate the remaining text at eight characters. If this name had already been used, then a number was substituted for the last character to make the name unique. For example, if there were two variable names in one table that were called tl_temperature_before and tCtemperature_after, then the resulting SAS names would be temperat and temperal. The other main problem encountered in transferring the data from Informix to SAS was that the types of data storage available in each system differed. In Informix, a number of data types are available which are predefined when the database is built. SAS only uses two data types - numeric or character, but has the ability to format these in a variety of ways. The program which generates the SAS statements reads the system catalogue to determine the data type and length of the column. An equivalent SAS format is then constructed and used in the SAS input statement. For example, a date type in the database would be mapped onto to a ddmmyy8. format, and an Informix data type decimal(S,3) would have a numeric 10.3 format. When the SAS input statements have been generated, using the translated SAS names and formats, the code can then be run. The SAS input statements are generated for column input, using the same columns as the program which produces the ASCII files. The SAS code reads the ASCII files and produces one dataset for every database table. The advantages of the new data transfer system are numerous. Because all data in the database are transferred to SAS datasets, the statistician has access to all data. This is an advantage because the statistical analysis of clinical trials is usually a dynamic process, that can only be planned to a certain extent. The process also removes the tedious work of specifying column formatted text files, writing code to produce these, and creating SAS input statements to read them in. Another major implication of the new system is that the data manipulation is carried out in SAS as opposed to using database tools. Complicated data manipulation is clumsy using the database's manipulation facilities, as they are designed primarily for data extraction. Additionally, an advantage is that the new data transfer process allows you to take a 'snapshot' of a live database at any time, allowing analyses to be carried out at any time, or using a small amount of actual data as a basis for testing statistical programs. The only disadvantage of the new system is that large amounts of data are stored twice. For an increasing volume of data that has to be analysed, this has implications for the amount of disk space required. In the pharmaceutical industry, there is a need to be able to prove that any results or conclusions made are genuine. Regulatory agencies must be confident that data collected during clinical trials has not been corrupted from the original information recorded in the patients' notes at the hospitals. It is therefore incumbent on the pharmaceutical company to ',verify all steps of data collection and transfer. In particular, the use of electronic systems 'must be validated. 380 To satisfy the validation requirement, a plan for the quality control of the proposed data transfer system was devised (see figure 2). The basis of the plan was to transfer all the data in the SAS datasets back into an Informix database and compare this with the original. To transfer the data from SAS to Informix, a SAS program was written which would output the data in the style of Informix 'unload' files. An unload file is an Informix utility which allows you to dump the contents of a table to an ASCII file which is free format, and has separators between the value of each variable. Commands exist to load and unload the data easily between unload files and the database. The unload files that are generated from the SAS data sets are then loaded into an empty copy of the database and unloaded once again into unload files. This step is necessary to ensure that these unload files are formatted in exactly the same way as the files that have been unloaded from the original database. The two sets of unload files are compared (using the unix 'diff utility) and an error report is generated of any differences. Because all of the computer programmes used are generic, the process can be repeated for any informix database. Informix style unload file Informix original database copy of Informix database Informix unload file Informix unload file Error Report Figure 2 : Plan for the Quality Control of the New Data Transfer System In the future, improvements to the system include adding a user interface, and updating the programs so they can handle any new data types available in further releases of the Inforrnix software. In summary, the development of the new data transfer system was successful, with the time to analyse clinical trials being reduced from over 100 person/working hours to around 20 person/working hours. The system has been adopted for use in the analysis of all European clinical trials. The system is also well validated, so we can be confident in our results and satisfy any requirements from the regulatory agencies. \ SAS is a registered tragemark of SAS Institute Inc., Cary, NC, USA. 381