Download Deployment of a Data Transfer Application Using PROC SQL and Nested Macros

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Big data wikipedia , lookup

SQL wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Deployment of a Data Transfer Application Using PROC SQL and
Nested Macros
Keh-Dong Shiang, City of Hope National Medical Center, Duarte, CA
ABSTRACT
With more and more data being electronically delivered via the Internet, complications arise when
we attempt to bring together the information currently maintained by heterogeneous systems.
However, data transformation and loading are the vital components of data integration projects,
which are also defined here as the processes of moving stored data from one system (source
system, denoted by S) to fit into another system (target system, denoted by T) uni- or bidirectionally, where S and T can be any of: SAS®, Microsoft SQL Server, Access and Oracle.
Hence, building a bridging system to link the two independent data systems is essential in the
effective management of multiple data systems. This paper describes a step-by-step
methodology to transfer data between the back-end database systems. The concept of a
mapping table as well as the application of 4-level nested macros and SQL procedures are also
introduced.
INTRODUCTION
The City of Hope National Medical Center leads a multi-center research collaboration called the
Southern California Islet Consortium (SC-IC) to evaluate islet transplantation as a treatment for
type 1 diabetes. Data for the clinical islet transplantation trials performed by the SC-IC are
obtained from both the clinical and laboratory data from our central facility located in Duarte,
California, as well as from many other external laboratories and medical centers. As the
consortium’s data collection center, we regularly report our study data to several other national
registries, including the Collaborative Islet Transplant Registry (CITR). Data transferring is an
important component in information sharing among institutions, and creating an automated
process for this has become a primary focus at our data coordination center.
The approach we are developing should effectively bridge together information from multiple
sources, while reducing the errors associated with manual data entry. The concept, methodology
and procedures we have implemented to accurately and efficiently process our data upload
projects are presented here. This paper will introduce a method for building a specialized SAS
macro to automate the process reading data from a Microsoft SQL database and transforming
data sets into the formats suitable for the other center’s data structure.
DATABASE DESIGN: EXTRACT, TRANSFORM AND LOAD
I have developed a Microsoft Access database that has been placed in the production line to
electronically transfer study data from the SC-IC (i.e. source system), an ODBC-linked SQL
database, to CITR (i.e. target system), a CSV (Comma Separated Value) database. Recently,
this Access database system and its Visual Basic modules have been converted and translated
to SAS. Currently, it still functions as a primitive test database system, but it may be worthwhile
applying and sharing with others.
My programming concept and flows are illustrated in the six steps below:
STEP 1. Create a SAS library directory as the system container
LIBNAME statement opens a file location to include the entire source and target data sets.
STEP 2. Make ODBC-Link to import all the source study SQL tables in the specified SAS library.
Two macro variables, SQLTABLE and SASDATASET, are used to hold the SQL table and SAS
data set values.
%MACRO GetSQLTable(SQLTABLE, SASDATASET);
CREATE TABLE Mylib.&SASDATASET AS
SELECT * FROM CONNECTION TO ODBC
(SELECT * FROM &SQLTABLE)
%MEND GetSQLTable;
PROC SQL;
CONNECT TO ODBC ("DSN=ISLET");
%GetSQLTable(PhysicicianReport, PhysReport);
%GetSQLTable(PhysicalExam, PhysExam);
%GetSQLTable(Symptom, SYMP);
%GetSQLTable(FollwUp, FUL);
%GetSQLTable(AdverseEvent, AdvEvent);
%GetSQLTable(TransplantDay0, TranspDay0);
……………………………………….
DISCONNECT FROM ODBC;
QUIT;
STEP 3. Gather all source tables and save their names to a table called “source index table”.
FormInex MainTableName
F03
ChemMetCBC
F05
ChemistriesOther
F09
CTCMaster
F09
CTCMaster
F11
SAEFU
M04
OSVitalSigns
T01
TransplantDay0
T01
TransplantDay0
…
…..
SubTableName
ChemMetCBC
ChemistriesOther
CTCDetail
CTCMaster
SAEFU
OSVitalSigns
TransDay0PortPressure
TransplantDay0
…..
STEP 4. Copy all the empty target data sets (i.e., only the structure, no data) into the working
library. Those blank data sets include AEF, FUP, … etc., which are added to a table called
“target index tables”.
FormID
FormName
AEV
Adverse Event Form
FUP
Follow-up Form
…
……
STEP 5. Profile both the source and target data
First, all Target Form/Table ID and Field column titles are added at the beginning of this mapping
data set, which values are then entered as AEF (Adverse Event Form) and FUP (Follow UP) …
and so on.
TargetFormID TargetFormName TargetFieldName
AEV
Adverse Event Form PATID
AEV
Adverse Event Form AEONSET
AEV
Adverse Event Form AEDESCR
AEV
Adverse Event Form AEPERSDT
AEV
Adverse Event Form RESDTE
AEV
Adverse Event Form EVTOUTCM
AEV
Adverse Event Form AEIMMUN
AEV
Adverse Event Form AERELAT
AEV
Adverse Event Form AESEVTY
AEV
Adverse Event Form TXREQRD
AEV
Adverse Event Form SAEDEATH
FUP
Follow-up Form
ASSMTDTE
FUP
Follow-up Form
FOICPEP
FUP
Follow-up Form
FOIGLUC
FUP
Follow-up Form
FOIGLUND
FUP
Follow-up Form
FOIHBA1C
FUP
Follow-up Form
FOIINS
FUP
Follow-up Form
TXDATE
Second, a number of source data fields are added, such as SourceFormID, SourceTableName,
SourceSubTableName, SourceFieldName, SourceFunctionCalled, SourceFixedValue … etc.
SourceTableID SourceTableName SourceSubTableName
SourceFieldName
SourceFunctionCalled
F11
SAEFU
PtID
F11
SAEFU
SAEDate
F11
SAEFU
ToxicityCode
CTCAEterm("ToxicityCode")
F11
SAEFU
LastAssessmentDate
LastAssessDate("LastAssessmentDate")
F11
SAEFU
ResolutionDate
ResolutionDate("ResolutionDate")
F11
SAEFU
OutcomeOfSAE
F11
SAEFU
TreatmentRelationship
F11
SAEFU
TransRelationship
F09
CTCDEtail
F11
SAEFU
F09
CTCDetail
F03
F03
CTCDetail
Grade
ToxGrade('Grade')
TreatmentRequired
TxRequired("TreatmentRequired")
Grade
SAEdef(1,"Grade")
ChemMetCBC
ProinsulinDate
AssesmentDate("ProinsulinDate")
ChemMetCBC
Cpeptide
CalcCpeptide("ProinsulinDate","Cpeptide")
CTCDetail
SourceFixedValue
SourceTableID SourceTableName SourceSubTableName
SourceFieldName
SourceFunctionCalled
F03
ChemMetCBC
Glucose
CalcGlucose("MetPanelDate","Glucose")
F03
ChemMetCBC
Glucose
GlucoseNotDone("MetPanelDate","Glucose")
F05
ChemistriesOther
HgbA1c
CalcHgbA1c("HgbA1cDate","HgbA1c")
M04
OSVitalSigns
PtCurrTakingInsulinYN
T01
TransplantDay0
TransplantDate
SourceFixedValue
InfusionDateFOI("Transplant1Date")
STEP 6. Create a mapping table that plays a central role in the entire data transfer process.
The final combined data set from STEP 5 is so-called ‘mapping table’, which is merged by the
above two tables. All the values for those source fields are manually entered.
Since a single patient can have multiple patient identifiers, such as Tracking Number, Research
Participant Number, Medical Record Number, UNOS ID and SC-IC Patient ID, that depend on the
study phases (recruitment, pre-study, treatment and follow-up) as well as various data source.
Therefore, we have manually created an index Patients table, which includes all the patient
identifier variables. Please note that all the different same-row patient identifiers are actually
pointing to the same patient in each row.
After all those necessary steps and tables are ready, a main SAS macro has to be developed to
perform the automatic data transfer task. In this code, there are total of four nested iterative
loops, which is presented as follows.
DO
Loop through TargetIndex data set
Select a single row (i.e. target table) each time to initiate the data insertion
DO
Loop through entire Patients data set
Choose a single patient row by row from this data set
Call AddToDestination macro
DO
Loop through whole ‘Mapping’ data set
Select a single mapping record at each iterative loop
Call RunEachRowInMapping macro
DO
Loop through each source table
Insert records into target data sets from source tables
Customized data manipulation functions may be needed
END DO
END DO
END DO
END DO
In this module, record search, SQL procedure, DATA step, data insertion, iterative loop, flow
control, macro variable and nested macros are extensively used and performed. The complete
code is included as an appendix.
In the working library, our study SQL database consists of almost 100 tables, and most of them
are normalized Master and Detail views. However, the common fields (for linking views) are
various form by form (i.e. table by table), which have also increased the level of complexity in our
mapping table approach.
In our target database system, it contains 20 data sets, which is relatively simple. You can easily
imagine that the data transfer (between two heterogeneous data systems) from 100 source data
sets to 20 target data sets is really a high hurdle for database programmers. However, there is
no doubt that many institutions like us facing this challenge would attempt to find an efficient
gateway interface first to link two different databases.
As you may know, data profiling involves studying both the source and target data thoroughly to
understand their structure and content. Once both data systems have been profiled, an
appropriate set of mapping specifications are recorded in the mapping table based on this profile.
The combination of data profiling and mapping processes comprises the essential step in our
data transfer project and should definitely be completed prior to attempting to extract, transform
and load the source data into the target database.
CONCLUSION
This paper presents a simple example and a structured methodology of how we can transfer data
between two heterogeneous data systems. A mapping table that consists of the metadata
records is introduced to retrieve and map variables from one or more source data sets into one or
more target data sets. With the aid of the mapping table, the challenge of data transfer should
become systematic, straightforward and visually understood.
It may be of interest and creativity to anyone who can expand this idea and methodology to a
more sophisticated program.
ACKNOWLEDGMENT
The author wishes to thank Dr. Jeffrey Longmate, Dr. Joyce Niland, Dr. Fouad Kandeel and Dr.
Craig Smith from City of Hope for their support. Also, thanks to my colleagues, Lorraine Lesiecki
and Jeannette Hacker, for their kind assistance and reviews of this manuscript. In addition, I
would like to thank CITR for the opportunity to work on this data transfer project.
TRADEMARKS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective
companies.
CONTACT INFORMATION
Contact the author at:
Keh-Dong Shiang, Ph.D.
Department of Biostatistics & Department of Diabetes
City of Hope National Medical Center
1500 East Duarte Road
Duarte, CA 91010-3000
Work Phone: (626)256-4673 Ext. 65768
Fax: (626)471-7106
E-mail: [email protected]
APPENDIX
LIBNAME Mylib 'C:\WUSS2005';
%MACRO VarNumber(LibName, DsName);
PROC SQL NOPRINT;
CREATE TABLE Mylib.VarList AS
SELECT VARNUM, NAME, TYPE, LENGTH, LABEL, FORMAT
FROM DICTIONARY.COLUMNS
WHERE LIBNAME=%UPCASE("&LibName") AND
MEMNAME=%UPCASE("&DsName") ORDER BY VARNUM;
SELECT MAX(VARNUM) INTO :MaxNumVar FROM Mylib.Varlist;
QUIT;
%MEND VarNumber;
%MACRO VariableType(VarName);
PROC SQL NOPRINT;
SELECT TYPE INTO :VarType FROM Mylib.VarList
WHERE NAME=%UPCASE("&VarName");
QUIT;
%MEND VariableType;
%MACRO RunEachRowInMapping(PatientID, FormID, MaxNumVar, TFieldName,
STableName, SSubTableName, SFieldName, SFuncCalled, SFixValue);
%GLOBAL VarType;
%VariableType(&TFieldName)
%LET valSField = '';
DATA _NULL_;
SET Mylib.&STableName (WHERE=(PtId="&PatientID"));
varSField = &SFieldName;
IF varSField NE '' THEN DO;
CALL SYMPUT('valSField', varSField);
END;
RUN;
%IF &valSField NE '' AND &TFieldName NE '' %THEN %DO;
PROC SQL NOPRINT;
%IF &row=1 %THEN %DO;
%IF &VarType=char %THEN %DO;
INSERT INTO Mylib.&FormID SET
&TFieldName="&valSField";
%END;
%ELSE %DO;
INSERT INTO Mylib.&FormID SET &TFieldName=&valSField;
%END;
%END;
%ELSE %DO;
%IF &VarType=char %THEN %DO;
UPDATE Mylib.&FormID SET &TFieldName="&valSField";
%END;
%ELSE %DO;
UPDATE Mylib.&FormID SET &TFieldName=&valSField;
%END;
%END;
%LET row = %EVAL(&row+1);
QUIT;
%END;
%MEND RunEachRowInMapping;
%MACRO AddToDestination(PatientID, FormID);
%LOCAL k;
%GLOBAL MaxNumVar;
%VarNumber(Mylib, &FormID)
%GLOBAL row;
%LET row = 1;
DATA _NULL_;
SET Mylib.Mapping;
CALL SYMPUT('NoObsM', _N_);
CALL SYMPUT('TFormID' || LEFT(_N_), TargetFormID);
CALL SYMPUT('TFieldName' || LEFT(_N_), TargetFieldName);
CALL SYMPUT('STableName' || LEFT(_N_), SourceTableName);
CALL SYMPUT('SSubTableName' || LEFT(_N_), SourceSubTableName);
CALL SYMPUT('SFieldName' || LEFT(_N_), SourceFieldName);
CALL SYMPUT('SFuncCalled' || LEFT(_N_), SourceFunctionCalled);
CALL SYMPUT('SFixValue' || LEFT(_N_), SourceFixedValue);
RUN;
%DO k = 1 %TO &NoObsM;
%IF "&FormID" EQ "&&TFormID&k" AND "&&STableName&k" NE "" AND
("&&SFieldName&k" NE "" OR "&&SFuncCalled&k" NE "" OR "&&SFixValue&k"
NE "") %THEN
%DO;
%RunEachRowInMapping(&PatientID, &FormID, &MaxNumVar,
&&TFieldName&k, &&STableName&k, &&SSubTableName&k, &&SFieldName&k,
&&SFuncCalled&k, &&SFixValue&k)
%END;
%END;
%MEND AddToDestination;
%MACRO Main;
%LOCAL h i j;
DATA _NULL_;
SET Mylib.TargetIndex;
CALL SYMPUT('NoObsT', _N_);
CALL SYMPUT('FormID' || LEFT(_N_), TRIM(FormID));
RUN;
%DO h = 1 %TO &NoObsT;
PROC SQL NOPRINT;
DELETE FROM Mylib.&&FormID&h;
QUIT;
%END;
%IF &NoObsT > 0 %THEN
%DO i = 1 %TO &NoObsT;
DATA _NULL_;
SET Mylib.Patients;
CALL SYMPUT('NoObsP', _N_);
CALL SYMPUT('PatientID' || LEFT(_N_), TRIM(id));
RUN;
%IF &NoObsP > 0 %THEN
%DO j = 1 %TO &NoObsP;
%PUT &&FormID&i &&PatientID&J;
%AddToDestination(&&PatientID&J,&&FormID&i)
%END;
%END;
%MEND Main;
%Main