Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Deployment of a Data Transfer Application Using PROC SQL and Nested Macros Keh-Dong Shiang, City of Hope National Medical Center, Duarte, CA ABSTRACT With more and more data being electronically delivered via the Internet, complications arise when we attempt to bring together the information currently maintained by heterogeneous systems. However, data transformation and loading are the vital components of data integration projects, which are also defined here as the processes of moving stored data from one system (source system, denoted by S) to fit into another system (target system, denoted by T) uni- or bidirectionally, where S and T can be any of: SAS®, Microsoft SQL Server, Access and Oracle. Hence, building a bridging system to link the two independent data systems is essential in the effective management of multiple data systems. This paper describes a step-by-step methodology to transfer data between the back-end database systems. The concept of a mapping table as well as the application of 4-level nested macros and SQL procedures are also introduced. INTRODUCTION The City of Hope National Medical Center leads a multi-center research collaboration called the Southern California Islet Consortium (SC-IC) to evaluate islet transplantation as a treatment for type 1 diabetes. Data for the clinical islet transplantation trials performed by the SC-IC are obtained from both the clinical and laboratory data from our central facility located in Duarte, California, as well as from many other external laboratories and medical centers. As the consortium’s data collection center, we regularly report our study data to several other national registries, including the Collaborative Islet Transplant Registry (CITR). Data transferring is an important component in information sharing among institutions, and creating an automated process for this has become a primary focus at our data coordination center. The approach we are developing should effectively bridge together information from multiple sources, while reducing the errors associated with manual data entry. The concept, methodology and procedures we have implemented to accurately and efficiently process our data upload projects are presented here. This paper will introduce a method for building a specialized SAS macro to automate the process reading data from a Microsoft SQL database and transforming data sets into the formats suitable for the other center’s data structure. DATABASE DESIGN: EXTRACT, TRANSFORM AND LOAD I have developed a Microsoft Access database that has been placed in the production line to electronically transfer study data from the SC-IC (i.e. source system), an ODBC-linked SQL database, to CITR (i.e. target system), a CSV (Comma Separated Value) database. Recently, this Access database system and its Visual Basic modules have been converted and translated to SAS. Currently, it still functions as a primitive test database system, but it may be worthwhile applying and sharing with others. My programming concept and flows are illustrated in the six steps below: STEP 1. Create a SAS library directory as the system container LIBNAME statement opens a file location to include the entire source and target data sets. STEP 2. Make ODBC-Link to import all the source study SQL tables in the specified SAS library. Two macro variables, SQLTABLE and SASDATASET, are used to hold the SQL table and SAS data set values. %MACRO GetSQLTable(SQLTABLE, SASDATASET); CREATE TABLE Mylib.&SASDATASET AS SELECT * FROM CONNECTION TO ODBC (SELECT * FROM &SQLTABLE) %MEND GetSQLTable; PROC SQL; CONNECT TO ODBC ("DSN=ISLET"); %GetSQLTable(PhysicicianReport, PhysReport); %GetSQLTable(PhysicalExam, PhysExam); %GetSQLTable(Symptom, SYMP); %GetSQLTable(FollwUp, FUL); %GetSQLTable(AdverseEvent, AdvEvent); %GetSQLTable(TransplantDay0, TranspDay0); ………………………………………. DISCONNECT FROM ODBC; QUIT; STEP 3. Gather all source tables and save their names to a table called “source index table”. FormInex MainTableName F03 ChemMetCBC F05 ChemistriesOther F09 CTCMaster F09 CTCMaster F11 SAEFU M04 OSVitalSigns T01 TransplantDay0 T01 TransplantDay0 … ….. SubTableName ChemMetCBC ChemistriesOther CTCDetail CTCMaster SAEFU OSVitalSigns TransDay0PortPressure TransplantDay0 ….. STEP 4. Copy all the empty target data sets (i.e., only the structure, no data) into the working library. Those blank data sets include AEF, FUP, … etc., which are added to a table called “target index tables”. FormID FormName AEV Adverse Event Form FUP Follow-up Form … …… STEP 5. Profile both the source and target data First, all Target Form/Table ID and Field column titles are added at the beginning of this mapping data set, which values are then entered as AEF (Adverse Event Form) and FUP (Follow UP) … and so on. TargetFormID TargetFormName TargetFieldName AEV Adverse Event Form PATID AEV Adverse Event Form AEONSET AEV Adverse Event Form AEDESCR AEV Adverse Event Form AEPERSDT AEV Adverse Event Form RESDTE AEV Adverse Event Form EVTOUTCM AEV Adverse Event Form AEIMMUN AEV Adverse Event Form AERELAT AEV Adverse Event Form AESEVTY AEV Adverse Event Form TXREQRD AEV Adverse Event Form SAEDEATH FUP Follow-up Form ASSMTDTE FUP Follow-up Form FOICPEP FUP Follow-up Form FOIGLUC FUP Follow-up Form FOIGLUND FUP Follow-up Form FOIHBA1C FUP Follow-up Form FOIINS FUP Follow-up Form TXDATE Second, a number of source data fields are added, such as SourceFormID, SourceTableName, SourceSubTableName, SourceFieldName, SourceFunctionCalled, SourceFixedValue … etc. SourceTableID SourceTableName SourceSubTableName SourceFieldName SourceFunctionCalled F11 SAEFU PtID F11 SAEFU SAEDate F11 SAEFU ToxicityCode CTCAEterm("ToxicityCode") F11 SAEFU LastAssessmentDate LastAssessDate("LastAssessmentDate") F11 SAEFU ResolutionDate ResolutionDate("ResolutionDate") F11 SAEFU OutcomeOfSAE F11 SAEFU TreatmentRelationship F11 SAEFU TransRelationship F09 CTCDEtail F11 SAEFU F09 CTCDetail F03 F03 CTCDetail Grade ToxGrade('Grade') TreatmentRequired TxRequired("TreatmentRequired") Grade SAEdef(1,"Grade") ChemMetCBC ProinsulinDate AssesmentDate("ProinsulinDate") ChemMetCBC Cpeptide CalcCpeptide("ProinsulinDate","Cpeptide") CTCDetail SourceFixedValue SourceTableID SourceTableName SourceSubTableName SourceFieldName SourceFunctionCalled F03 ChemMetCBC Glucose CalcGlucose("MetPanelDate","Glucose") F03 ChemMetCBC Glucose GlucoseNotDone("MetPanelDate","Glucose") F05 ChemistriesOther HgbA1c CalcHgbA1c("HgbA1cDate","HgbA1c") M04 OSVitalSigns PtCurrTakingInsulinYN T01 TransplantDay0 TransplantDate SourceFixedValue InfusionDateFOI("Transplant1Date") STEP 6. Create a mapping table that plays a central role in the entire data transfer process. The final combined data set from STEP 5 is so-called ‘mapping table’, which is merged by the above two tables. All the values for those source fields are manually entered. Since a single patient can have multiple patient identifiers, such as Tracking Number, Research Participant Number, Medical Record Number, UNOS ID and SC-IC Patient ID, that depend on the study phases (recruitment, pre-study, treatment and follow-up) as well as various data source. Therefore, we have manually created an index Patients table, which includes all the patient identifier variables. Please note that all the different same-row patient identifiers are actually pointing to the same patient in each row. After all those necessary steps and tables are ready, a main SAS macro has to be developed to perform the automatic data transfer task. In this code, there are total of four nested iterative loops, which is presented as follows. DO Loop through TargetIndex data set Select a single row (i.e. target table) each time to initiate the data insertion DO Loop through entire Patients data set Choose a single patient row by row from this data set Call AddToDestination macro DO Loop through whole ‘Mapping’ data set Select a single mapping record at each iterative loop Call RunEachRowInMapping macro DO Loop through each source table Insert records into target data sets from source tables Customized data manipulation functions may be needed END DO END DO END DO END DO In this module, record search, SQL procedure, DATA step, data insertion, iterative loop, flow control, macro variable and nested macros are extensively used and performed. The complete code is included as an appendix. In the working library, our study SQL database consists of almost 100 tables, and most of them are normalized Master and Detail views. However, the common fields (for linking views) are various form by form (i.e. table by table), which have also increased the level of complexity in our mapping table approach. In our target database system, it contains 20 data sets, which is relatively simple. You can easily imagine that the data transfer (between two heterogeneous data systems) from 100 source data sets to 20 target data sets is really a high hurdle for database programmers. However, there is no doubt that many institutions like us facing this challenge would attempt to find an efficient gateway interface first to link two different databases. As you may know, data profiling involves studying both the source and target data thoroughly to understand their structure and content. Once both data systems have been profiled, an appropriate set of mapping specifications are recorded in the mapping table based on this profile. The combination of data profiling and mapping processes comprises the essential step in our data transfer project and should definitely be completed prior to attempting to extract, transform and load the source data into the target database. CONCLUSION This paper presents a simple example and a structured methodology of how we can transfer data between two heterogeneous data systems. A mapping table that consists of the metadata records is introduced to retrieve and map variables from one or more source data sets into one or more target data sets. With the aid of the mapping table, the challenge of data transfer should become systematic, straightforward and visually understood. It may be of interest and creativity to anyone who can expand this idea and methodology to a more sophisticated program. ACKNOWLEDGMENT The author wishes to thank Dr. Jeffrey Longmate, Dr. Joyce Niland, Dr. Fouad Kandeel and Dr. Craig Smith from City of Hope for their support. Also, thanks to my colleagues, Lorraine Lesiecki and Jeannette Hacker, for their kind assistance and reviews of this manuscript. In addition, I would like to thank CITR for the opportunity to work on this data transfer project. TRADEMARKS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Contact the author at: Keh-Dong Shiang, Ph.D. Department of Biostatistics & Department of Diabetes City of Hope National Medical Center 1500 East Duarte Road Duarte, CA 91010-3000 Work Phone: (626)256-4673 Ext. 65768 Fax: (626)471-7106 E-mail: [email protected] APPENDIX LIBNAME Mylib 'C:\WUSS2005'; %MACRO VarNumber(LibName, DsName); PROC SQL NOPRINT; CREATE TABLE Mylib.VarList AS SELECT VARNUM, NAME, TYPE, LENGTH, LABEL, FORMAT FROM DICTIONARY.COLUMNS WHERE LIBNAME=%UPCASE("&LibName") AND MEMNAME=%UPCASE("&DsName") ORDER BY VARNUM; SELECT MAX(VARNUM) INTO :MaxNumVar FROM Mylib.Varlist; QUIT; %MEND VarNumber; %MACRO VariableType(VarName); PROC SQL NOPRINT; SELECT TYPE INTO :VarType FROM Mylib.VarList WHERE NAME=%UPCASE("&VarName"); QUIT; %MEND VariableType; %MACRO RunEachRowInMapping(PatientID, FormID, MaxNumVar, TFieldName, STableName, SSubTableName, SFieldName, SFuncCalled, SFixValue); %GLOBAL VarType; %VariableType(&TFieldName) %LET valSField = ''; DATA _NULL_; SET Mylib.&STableName (WHERE=(PtId="&PatientID")); varSField = &SFieldName; IF varSField NE '' THEN DO; CALL SYMPUT('valSField', varSField); END; RUN; %IF &valSField NE '' AND &TFieldName NE '' %THEN %DO; PROC SQL NOPRINT; %IF &row=1 %THEN %DO; %IF &VarType=char %THEN %DO; INSERT INTO Mylib.&FormID SET &TFieldName="&valSField"; %END; %ELSE %DO; INSERT INTO Mylib.&FormID SET &TFieldName=&valSField; %END; %END; %ELSE %DO; %IF &VarType=char %THEN %DO; UPDATE Mylib.&FormID SET &TFieldName="&valSField"; %END; %ELSE %DO; UPDATE Mylib.&FormID SET &TFieldName=&valSField; %END; %END; %LET row = %EVAL(&row+1); QUIT; %END; %MEND RunEachRowInMapping; %MACRO AddToDestination(PatientID, FormID); %LOCAL k; %GLOBAL MaxNumVar; %VarNumber(Mylib, &FormID) %GLOBAL row; %LET row = 1; DATA _NULL_; SET Mylib.Mapping; CALL SYMPUT('NoObsM', _N_); CALL SYMPUT('TFormID' || LEFT(_N_), TargetFormID); CALL SYMPUT('TFieldName' || LEFT(_N_), TargetFieldName); CALL SYMPUT('STableName' || LEFT(_N_), SourceTableName); CALL SYMPUT('SSubTableName' || LEFT(_N_), SourceSubTableName); CALL SYMPUT('SFieldName' || LEFT(_N_), SourceFieldName); CALL SYMPUT('SFuncCalled' || LEFT(_N_), SourceFunctionCalled); CALL SYMPUT('SFixValue' || LEFT(_N_), SourceFixedValue); RUN; %DO k = 1 %TO &NoObsM; %IF "&FormID" EQ "&&TFormID&k" AND "&&STableName&k" NE "" AND ("&&SFieldName&k" NE "" OR "&&SFuncCalled&k" NE "" OR "&&SFixValue&k" NE "") %THEN %DO; %RunEachRowInMapping(&PatientID, &FormID, &MaxNumVar, &&TFieldName&k, &&STableName&k, &&SSubTableName&k, &&SFieldName&k, &&SFuncCalled&k, &&SFixValue&k) %END; %END; %MEND AddToDestination; %MACRO Main; %LOCAL h i j; DATA _NULL_; SET Mylib.TargetIndex; CALL SYMPUT('NoObsT', _N_); CALL SYMPUT('FormID' || LEFT(_N_), TRIM(FormID)); RUN; %DO h = 1 %TO &NoObsT; PROC SQL NOPRINT; DELETE FROM Mylib.&&FormID&h; QUIT; %END; %IF &NoObsT > 0 %THEN %DO i = 1 %TO &NoObsT; DATA _NULL_; SET Mylib.Patients; CALL SYMPUT('NoObsP', _N_); CALL SYMPUT('PatientID' || LEFT(_N_), TRIM(id)); RUN; %IF &NoObsP > 0 %THEN %DO j = 1 %TO &NoObsP; %PUT &&FormID&i &&PatientID&J; %AddToDestination(&&PatientID&J,&&FormID&i) %END; %END; %MEND Main; %Main