Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paper DM02 Prepare Data Listings with SAS® Software for Database Audit Jay Zhou and Larry Shen, Amylin Pharmaceuticals, Inc. San Diego, CA ABSTRACT ICH-GCP emphasizes the importance of a clinical trial database audit. Auditing an entire database, especially in a large clinical trial, could be daunting, as it involves great effort and resources. It is also unnecessary from the perspective of statistical quality control. Efficiency can be gained by identifying variables as either critical or non-critical: 100% of the critical variable data being audited whereas only a subset of the non-critical variable data being audited. This paper presents techniques for separating critical from non-critical variable data, selecting a representative sample from the non-critical variable data, and preparing data listings for an audit. INTRODUCTION Database audit is a key step to assure the database quality in clinical trials. The audit is performed as a comparison of SAS data listings generated with the raw data against the original copy of the Case Report Forms (CRFs). Based on the importance of the clinical data, variables can be divided into critical and non-critical ones. For the critical variables, such as birth date, all adverse events data, dosing date and time of study medication, visit dates, termination date, and key efficacy variables, 100% of the data points need to be audited. But for the non-critical variables, it is necessary to randomly select a smaller subset of the data that represents the original database in order to save time and resources. For some of the data sets there may be a few variables that are critical and other variables that are non-critical. In that case, for easy handling, the data sets will be split and treated as two data sets: one for the critical variables and the other for the non-critical, although all these variables will be audited together. Once the data audit is completed the error rates are calculated for both critical and non-critical variables separately. To facilitate the audit process and to conform to the principle of statistical quality control, the data listings generated for audit should meet the following requirements: 1) the final data listings should be reported in a way that mirrors the order and set-up of the actual CRFs; 2) the random sampling unit should be as small as possible to ensure that the sample is a good representation of the database and that the error rate and its confidence interval has statistical validity. Once the data points are selected, to satisfy the first requirement, the listings will be generated in such a way that each page contains data per CRF module arranged by visit for each subject. This will allow QA audit personnel to conduct the data audit by subject and CRF flow and in a time sequential manner. This means that the critical and non-critical variables will be presented in the same listing for the same subject. But they are presented on different pages, even though the data are collected from the same CRF page, in order for the auditors to calculate the error rates of the non-critical variables. As for the second requirement, the sampling unit is chosen to be the lines of observations in a data set (to be fully described later). To meet these challenging needs, the programming steps will include (1) separating non-critical variables, (2) calculating the probability of the data points which will be randomly selected, (3) randomly sampling observations for the non-critical variables, and (4) generating the data listings with both critical and noncritical variables together. It could be challenging and time-consuming to prepare such a sophisticated listing per subject by visit. The goal of this paper is to provide a solution using the techniques executing these steps which can be used to speed the production of data audit listings. SEPARATING NON-CRITICAL VARIABLES Since the random sampling only pertains to the non-critical data points, the non-critical variables must be determined by the database audit plan. In this paper we assume that overall number of non-critical data points to be audited has been determined by a valid statistical method in the audit plan. In order to determine the number of data points to be randomly selected in each data set for the non-critical variables so that the overall number of the data points selected meets the requirement, it is necessary to create a data library restricted to data sets with the non-critical variables and plus other identifier variables (e.g., subject identifier and visit number). Depending on the numbers of the non-critical variables in each data set, the new data set can be easily created using a KEEP/DROP statement or KEEP=/DROP= data set option. Because there are many data sets in each clinical trial, it would be more efficient to create a SAS macro that will process the entire data library with only one single call. Here is an example of a macro call to prepare the data sets for the non-critical variables: %drop(inlib=raw, outlib=audit, data=comment conmed disp ecg incexc lab pe vs, drop=studyid domain cmclas cmdecod cmclascd weight wtu dsstdtc dsstdy egdy pedy) The inlib= parameter specifies the libref of the original data sets, while the outlib= defines the libref of the data sets with the non-critical variables. Because not all the data sets in the library need to be processed (e.g., there may not be non-critical variables in the data sets for adverse events and subject dispositions), it is necessary to specify the data sets that contain the non-critical variables with the data= parameter. Otherwise, the macro will process the entire library if this parameter is not defined. By the meaning of the word, the drop= parameter will drop those variables that should not be included in the output data sets. Variables such as study identifier (e.g., studyid, domain), non-CRF data (e.g., dictionary terms – cmclas, cmdecod, cmclascd), derived numerical timing variables (e.g., dsstdy, ecgdy, pedy) representing the same data points of the character timing variables, and critical variables (e.g., dsstdtc in the DISP data set, weight and wtu in the VS data set) should be dropped. The macro %drop program is included in Appendix 1. CALCULATING THE PROBABILITY OF RANDOM SAMPLE The purpose of the random sampling is to ensure that the data points selected to be audited will be a ‘true’ representation of the underlying database. Typically, the database audit is conducted by randomly sampling a certain number of subjects from the database. If a subject is chosen, all the observations of the subject are chosen. This means that all the observations of this subject are not really randomly sampled. With the method presented in this paper, the sampling will be randomly drawn from observations that have equal opportunity of being selected. Within each data set the columns represent different data variables and rows represent the observations (records) of data for the subjects across different visits. The proportions of the rows to be sampled are determined to ensure that the total number of selected data points will meet the required sample size specified in the database audit plan. Once the data sets with the non-critical variables are prepared, the next step is to determine the total number of the non-critical data points for the entire database. To do that, one must obtain the total number of non-critical variables and total number of observations from each data set. This can be easily solved using PROC SQL with the data set metadata from the dictionary tables, COLUMNS and TABLES (or the view tables, VCOLUMN and VTABLE, in the SASHELP library): %let size=2000; *** sample size of non-critical data points ***; proc sql noprint; create table tempdata as select distinct(a.memname), b.nobs, count(a.name) as nvar label nvar='Number of Variables', b.nobs*calculated nvar as points label points='Total Number of Data Points per Data set'n from dictionary.columns as a, dictionary.tables as b where a.libname=b.libname="AUDIT" and a.memname=b.memname and upcase(a.name) not in ('USUBJID','SITEID','VISIT') o group by a.memname; select &size/sum(points), sum(points) into :prob, :total p from tempdata; create table tempdata as select *, round(nobs*&prob, 1) as obs label obs='Number of Observations per Data set Randomly Selected', calculated obs*nvar as numbers label numbers='Number of Data Points per Data set Selected' q from tempdata; select memname into :memname r separated by '|' from tempdata; quit; ods listing close; ods rtf file="c:\temp\DataPointsAudited.rtf"; title "Total data points for the non-critical variables in Study &study are &total"; title2 "The probability of sampling &size data points for audit is &prob; proc print data=tempdata noobs label; s var memname nobs nvar points obs numbers; label memname='Data set Name' nobs='Number of Observations per Data set'; run; ods rtf close; ods listing; n Function DISTINCT is used to keep only one record for each data set in the memname column. Using the COUNT function is to create a new column nvar for the total number of non-critical variables from the name column for each data set in the columns table. Because the nvar column is newly calculated, the function CALCULATED is necessary to enable you to use the results in the same SELECT clause to obtain the total data points for each data set. o Because the variables usubjid, siteid, and visit are not the non-critical variables, they are excluded from the calculation to make the total non-critical data points accurate. p This SELECT clause is used to create two macro variables, &prob and &total. The &prob is the probability of the random sample calculated with sample size (2000 data points) divided by the total non-critical data points. This macro variable will be used later in the next step to select the data points from each data set. The &total macro variable will be used in the title of the sampling report to give the auditor an idea of total non-critical data points in the database. q The SELECT clause is to add two variables, obs and number, to the TEMPDATA data set for later reporting in s. r A macro variable &memname created by the SELECT clause contains all the data set names in the AUDIT library separated by ‘|’ character. This variable will be used in the %_sample macro later for the automation of sampling the data from each data set. s This step will generate a table (see Table 1.) captured by ODS in RTF format for auditor’s reference that will help the auditor to calculate the error rate for each data set. Table 1. Total data points for the non-critical variables in Study 123 are 123197. The probability of sampling 2000 data points for audit is 0.016234. Number of Total Number Observations per of Data Points Data set Randomly per Data set Selected Number of Observations per Data set Number of Variables COMMENT 2464 6 14784 40 240 CONMED 2007 13 26091 33 429 DISP 826 5 4130 13 65 ECG 192 21 4032 3 63 IECEXC 190 9 1710 3 27 LAB 7022 7 49154 114 798 PE 2243 7 15701 36 252 VS 1085 7 7595 18 126 Data set Name Number of Data Points per Data set Selected RANDOMLY SAMPLING OBSERVATIONS Random samples are any collection of N observations selected from a population in such a way that each possible sample has the same chance of being chosen. There are several techniques available for pulling random samples. The method used in this paper is one of the sequential methods sampling without replacement under which each observation has only one opportunity to be selected. Because of the sequential nature, the decision to consider an observation as a sample must be made at the time the observation is processed. You cannot go back and reconsider an observation that has already been passed. Since the sampling is done within each data set that contains non-critical variables, the sampling procedure is considered to be stratified sampling where each data set serves as stratum and the resulting total sample is called a stratified sample (Mittag & Rinne 1993). In addition, the minimum sampling unit is an observation line with multiple data values corresponding to different non-critical variables and the sampling procedure can also be considered to be cluster sampling where each observation line is a cluster. Because the data reside in different data sets, the sampling is done within each data set that contains non-critical variables. The strategy in the selection involves modifying the probability of selecting an observation based upon how many sampled observations (k) are desired and how many total observations (n) are available in the data set. Initially the k for a specific data set is the product of the total observations in the data set multiplying the probability described in the previous section. The observation will be selected if the random number generated by the RANUNI function is smaller than or equal to the k/n ratio (probability). The SAS code of exact sample size using changing probability method is found in SAS Language and Procedures, Usage 2, p.235: data SAMPLE (drop=k n); retain k 100 n; if _n_=1 then n=total; set POP nobs=total; if ranuni(0)<=k/n then do; output; k=k-1; end; n=n-1; if k=0 then stop; run; The seed value used for initiating a random sequence with the RANUNI function is not critical. In the above example, the seed value is zero (RANUNI(0)) with which the SAS uses the system clock’s time value as a seed and enables you to rerun the code and obtain different samples from each execution. If it is desired to control the initialization, you can use a fixed positive integer as the seed, which guarantees the samples selected with each run from the same data set are the same. This is important because sometimes you may want to rerun the program to obtain the same samples. To meet the purpose, a macro %_select is created: %macro _select; %let i=1; n %let _data=%scan(&memname,&i, |); o %do %while (%length(&_data) gt 0); p %sample(dsin=audit.&_data, dsout=subset.&_data, prob=&prob, seed=&i); q %let i=%eval(&i + 1); r %let _data=%scan(&memname,&i, |); s %end; %mend _select; n To loop through each of the data sets listed in &memname, the iteration variable is initialized to 1 that references the first data set. o The &memname macro variable created with PROC SQL SELECT clause in the previous step contains all the data set names. Since the delimiter to the %SCAN function is specified to a ‘|’ character, the macro variable &_data is initialized by scanning &memname for the characters preceding the first ‘|’ character. p A %DO %WHILE loop is initialized to process each data set until all data sets have been iterated through. q The macro %sample (see Appendix 2 for detail) is modified from the SAS code of exact sample size using changing probability method shown previously. The input population data set is from the AUDIT data library and the output sample data set resides in the SUBSET data library. The &prob macro variable as a constant is the sampling probability calculated with PROC SQL SELECT clause in the previous section. Because the importance of each data set is treated the same, no weight is added to the probability of particular data sets. The &i macro variable is used as the seed value so that the initialization is different from data set to data set as the &i value changes. But the seed value is fixed for the same data set that allows you to repeat the process with the same results. r After sampling first data set, the iteration variable iterates to 2. s The &_data variable takes the second data set name and control is returned to the %DO %WHILE statement. This loop continues in this manner until &_data has incremented through all data set names. GENERATING THE DATA LISTINGS The main challenge of reporting is to present the data, according to subjects, visits, and data modules with critical and non-critical variables, which best mirrors the flow of the CRFs. To simplify the process, it is much easier if each data set is reported independently, but assembled together on the listing for the same subject. With this approach, the reporting programs for all the data sets can be easily generated with the similar process described by Morrill, Wiser, and Zhou (2002). To make the report correspond to the CRF pages for each subject, the WHERE statement with usubjid=&usubjid and visit=&visit (if applicable) must be embedded into the program to subset the data, and the programs must be arranged in order. Since only one subject’s data will be reported at each visit for each program (data set), each page needs to contain a large number of variables but few observations. This makes the PROC PRINT procedure to be a better choice over PROC REPORT. If PROC PRINT cannot fit all the variables on a single line, it splits the observations into two or more sections and prints the observation number or the ID variables at the beginning of each line. This requires fewer pages than PROC REPORT. To help auditors understand the data on the listings when comparing with the CRFs, it is desirable to use the LABEL option to display variables' labels as column headings. The sample code below illustrates the inclusion of all the programs arranged in such a way to correspond to the order of CRF pages in a timely sequence. The PROC SQL procedure creates a macro variable, called &_usubjid, which contains all the unique subject numbers from the DM data set separated by the ‘|’ character and will be used in the macro %report later. When that macro is invoked, the &usubjid macro variable with the individual subject number is created by the %let statement with the %SCAN function from &_usubjid to subset the data by subject for each data set used in the program. For instance, vsn.sas for the vital sign data set with the non-critical variables is invoked by the %include statement. The vsn.sas program is included in Appendix 3. proc sql noprint; select usubjid into :_usubjid separated by '|' from dm; quit; %macro report; %let prgpath=c:\temp\pgm; %let i=1; %let usubjid=%scan(&_usubjid,&i, |); %do %while (%length(&usubjid) gt 0); ods listing close; ods rtf file="c:\temp\out\P&usubjid..rtf"; *** Visit One ***; %let visit=1; %include "&prgpath\incexc.sas"; * %include "&prgpath\dm.sas"; * For %include "&prgpath\vsc.sas"; * For %include "&prgpath\vsn.sas"; * For %include "&prgpath\pe.sas"; * For *** Visit Two ***; %let visit=2; For Inclusion/Exclusion Criteria *; Demographics *; Critical variables in Vital Signs *; Non-critical variables in Vital Signs *; Physical Examination *; %include "&prgpath\vsc.sas"; * For Critical variables in Vital Signs *; %include "&prgpath\vsn.sas"; * For Non-critical variables in Vital Signs *; ... * Include other programs for other data modules at Visit Two *; *** Other Visits ***; %let visit=3; ... * Include programs for other visits *; *** Visit Termination ***; %include "&prgpath\vsc.sas"; * For Critical variables in Vital Signs *; %include "&prgpath\vsn.sas"; * For Non-critical variables in Vital Signs *; %include "&prgpath\ae.sas"; * For Adverse Events *; %include "&prgpath\conmed.sas"; * For Concomitant Medication *; %include "&prgpath\comment.sas"; * For Comments *; %include "&prgpath\disp.sas"; * For Dispostion *; %let i=%eval(&i + 1); %let usubjid=%scan(&_usubjid,&i, |); %end; %mend report; %report; CONCLUSION Database auditing is an important step to ensure the database quality, but it is time-consuming, especially when auditing more data points than needed in addition to auditing against unfriendly data listings. This paper presents a solution that randomly selects the observations for non-critical variables and reports both critical and non-critical variables in the same listing per subject in the order that mirrors the flow of CRF pages. The solution will not only facilitate the audit process to save time and resources but also make the audit statistically accurate. REFERENCES Morrill, J., Wiser, K., and Zhou, J. (2002), A Data-Driven Macro Automating the Data Presentation Process by Generating Tailored, Customizable SAS Code - Relax, let %TABGEN do your work! Proceedings of the Annual Conference of the Pharmaceutical Industry SAS Users Group in Year 2002, pp. 43-47. Mittag H, J. and Rinne H. (1993), Statistical Methods of Quality Assurance, Chapman & Hall. SAS Institute (2000), SAS Language and Procedures: Usage 2, Version 6, First Edition, PDF Format. p.235. ACKNOWLEDGMENTS The authors would like to thank Nuwan Nanayakkara and David Brown for reviewing this manuscript and providing valuable comments. SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Jay Zhou Amylin Pharmaceuticals, Inc. 9360 Towne Centre Drive San Diego, CA 92121 Email: [email protected] APPENDIX 1. MACRO %DROP PROGRAM /************************************************************************* DESCRIPTION: To drop unwanted variables for the specified data sets PARAMETERS: inlib Specify the library of the source data sets data Specify data sets to be processed outlib Specify the library of the processed data sets drop Specify the variables to be dropped **************************************************************************/ %macro drop(inlib=, data=, outlib=, drop=); %local data inlib sasfile files filename drop var j; %if %length(&inlib)=0 %then %do; %let inlib=work; %put CAUSION: The INLIB parameter was not defined ***; %put NOTE: The WORK library has been used as default ****; %end; %*** make the program not to be case-sensitive ***; %if %length(&data)>0 %then %let data=%upcase(%sysfunc(tranwrd(%sysfunc(compbl(&data)),%str( ),%str(" ")))); %let drop=%upcase(%sysfunc(compbl(&drop))); %*** create a macro variable containing the data set names ***; proc sql NOPRINT; select trim(libname)||'.'||memname into :files separated by ' ' from dictionary.tables where libname eq "&inlib" %if %length(&data) gt 0 %then %str(and memname in ("&data"));; quit; %let j=1; %let sasfile=%scan(&files,&j,%str( )); %do %while (%length(&sasfile) gt 0); %let filename=%lowcase(%scan(&sasfile,2)); %*** get the variables to be dropped from the input data set ***;; data _data; dsid=open("&sasfile", 'i'); do n=1 to attrn(dsid, 'nvars'); name=upcase(varname(dsid, n)); if indexw("&drop", name) >0 then output; end; dsid=close(dsid); run; proc sql NOPRINT; select name into :vars separated by ' ' from _data; quit; data &outlib..&filename; set &sasfile; %if %length(&vars) gt 0 %then %str(drop &vars;); run; %let j=%eval(&j+1); %let sasfile=%scan(&files,&j,%str( )); %end; %mend drop; APPENDIX 2. MACRO %SAMPLE PROGRAM /************************************************************************* DESCRIPTION: To randomly sample observations from a given data set PARAMETERS: dsin - Specify the input data set name. dsout - Specify the output data set name. prob - Specify the probability of selecting each observation. seed - Optional. By default, seed=0, **************************************************************************/ %macro sample(dsin=,dsout=,prob=,var=,seed=); %if %length(&seed)=0 %then %let seed=0; *** determine how many observations needed ***; %local _tobs _obsneed; %let _tobs=%nobs(&dsin); %let _obsneed=%sysfunc(round(&_tobs * &prob, 1)); data &dsout (drop=_tobs _obsneed); retain _obsneed &_obsneed _tobs &_tobs; set &dsin; if ranuni(&seed) <= _obsneed/_tobs then do; output; _obsneed = _obsneed - 1; end; _tobs = _tobs - 1; if _obsneed=0 then stop; run; %mend sample; APPENDIX 3. VSN.SAS PROGRAM /***************************************************************************** Program: vsn.sas Author: %datadump macro Created: 23MAY2005 Description: Create a listing for VS data set with non-critical variables *****************************************************************************/ options number pageno=1; proc print data=subset.vs noobs label; by siteid usubjid visit; where usubjid=&usubjid and visit=&visit; var NOTDONE TEMP TEMPU RHR RR SSBP SDBP; label NOTDONE ="Vital Not Done" TEMP ="Temperature" TEMPU ="Temperature Units" RHR ="Resting Heart Rate (bpm)" RESP ="Respiratory Rate (bpm)" SSBP ="Sitting Systolic Blood Pressure (mmHg)" SSDP ="Sitting Diastolic Blood Pressure (mmHg)" ; title1 "Study 123 DATABASE QUALITY REVIEW (Produced at &systime:&sysdate9)"; title2 "Tabulation of VS (Vital Signs) With Non-Critical Variables"; run;