Download Prepare Data Listings with SAS Software for Database Audit

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Paper DM02
Prepare Data Listings with SAS® Software for Database Audit
Jay Zhou and Larry Shen, Amylin Pharmaceuticals, Inc. San Diego, CA
ABSTRACT
ICH-GCP emphasizes the importance of a clinical trial database audit. Auditing an entire database,
especially in a large clinical trial, could be daunting, as it involves great effort and resources. It is also
unnecessary from the perspective of statistical quality control. Efficiency can be gained by identifying
variables as either critical or non-critical: 100% of the critical variable data being audited whereas only a
subset of the non-critical variable data being audited. This paper presents techniques for separating
critical from non-critical variable data, selecting a representative sample from the non-critical variable
data, and preparing data listings for an audit.
INTRODUCTION
Database audit is a key step to assure the database quality in clinical trials. The audit is performed as a
comparison of SAS data listings generated with the raw data against the original copy of the Case Report
Forms (CRFs). Based on the importance of the clinical data, variables can be divided into critical and
non-critical ones. For the critical variables, such as birth date, all adverse events data, dosing date and
time of study medication, visit dates, termination date, and key efficacy variables, 100% of the data points
need to be audited. But for the non-critical variables, it is necessary to randomly select a smaller subset
of the data that represents the original database in order to save time and resources. For some of the
data sets there may be a few variables that are critical and other variables that are non-critical. In that
case, for easy handling, the data sets will be split and treated as two data sets: one for the critical
variables and the other for the non-critical, although all these variables will be audited together. Once the
data audit is completed the error rates are calculated for both critical and non-critical variables separately.
To facilitate the audit process and to conform to the principle of statistical quality control, the data listings
generated for audit should meet the following requirements: 1) the final data listings should be reported in
a way that mirrors the order and set-up of the actual CRFs; 2) the random sampling unit should be as
small as possible to ensure that the sample is a good representation of the database and that the error
rate and its confidence interval has statistical validity.
Once the data points are selected, to satisfy the first requirement, the listings will be generated in such a
way that each page contains data per CRF module arranged by visit for each subject. This will allow QA
audit personnel to conduct the data audit by subject and CRF flow and in a time sequential manner. This
means that the critical and non-critical variables will be presented in the same listing for the same subject.
But they are presented on different pages, even though the data are collected from the same CRF page,
in order for the auditors to calculate the error rates of the non-critical variables. As for the second
requirement, the sampling unit is chosen to be the lines of observations in a data set (to be fully described
later).
To meet these challenging needs, the programming steps will include (1) separating non-critical variables,
(2) calculating the probability of the data points which will be randomly selected, (3) randomly sampling
observations for the non-critical variables, and (4) generating the data listings with both critical and noncritical variables together. It could be challenging and time-consuming to prepare such a sophisticated
listing per subject by visit. The goal of this paper is to provide a solution using the techniques executing
these steps which can be used to speed the production of data audit listings.
SEPARATING NON-CRITICAL VARIABLES
Since the random sampling only pertains to the non-critical data points, the non-critical variables must be
determined by the database audit plan. In this paper we assume that overall number of non-critical data
points to be audited has been determined by a valid statistical method in the audit plan. In order to
determine the number of data points to be randomly selected in each data set for the non-critical
variables so that the overall number of the data points selected meets the requirement, it is necessary to
create a data library restricted to data sets with the non-critical variables and plus other identifier variables
(e.g., subject identifier and visit number). Depending on the numbers of the non-critical variables in each
data set, the new data set can be easily created using a KEEP/DROP statement or KEEP=/DROP= data
set option.
Because there are many data sets in each clinical trial, it would be more efficient to create a SAS macro
that will process the entire data library with only one single call. Here is an example of a macro call to
prepare the data sets for the non-critical variables:
%drop(inlib=raw, outlib=audit,
data=comment conmed disp ecg incexc lab pe vs,
drop=studyid domain cmclas cmdecod cmclascd weight wtu dsstdtc dsstdy egdy pedy)
The inlib= parameter specifies the libref of the original data sets, while the outlib= defines the libref
of the data sets with the non-critical variables. Because not all the data sets in the library need to be
processed (e.g., there may not be non-critical variables in the data sets for adverse events and subject
dispositions), it is necessary to specify the data sets that contain the non-critical variables with the data=
parameter. Otherwise, the macro will process the entire library if this parameter is not defined. By the
meaning of the word, the drop= parameter will drop those variables that should not be included in the
output data sets. Variables such as study identifier (e.g., studyid, domain), non-CRF data (e.g.,
dictionary terms – cmclas, cmdecod, cmclascd), derived numerical timing variables (e.g., dsstdy,
ecgdy, pedy) representing the same data points of the character timing variables, and critical variables
(e.g., dsstdtc in the DISP data set, weight and wtu in the VS data set) should be dropped. The
macro %drop program is included in Appendix 1.
CALCULATING THE PROBABILITY OF RANDOM SAMPLE
The purpose of the random sampling is to ensure that the data points selected to be audited will be a
‘true’ representation of the underlying database. Typically, the database audit is conducted by randomly
sampling a certain number of subjects from the database. If a subject is chosen, all the observations of
the subject are chosen. This means that all the observations of this subject are not really randomly
sampled. With the method presented in this paper, the sampling will be randomly drawn from
observations that have equal opportunity of being selected. Within each data set the columns represent
different data variables and rows represent the observations (records) of data for the subjects across
different visits. The proportions of the rows to be sampled are determined to ensure that the total number
of selected data points will meet the required sample size specified in the database audit plan.
Once the data sets with the non-critical variables are prepared, the next step is to determine the total
number of the non-critical data points for the entire database. To do that, one must obtain the total
number of non-critical variables and total number of observations from each data set. This can be easily
solved using PROC SQL with the data set metadata from the dictionary tables, COLUMNS and TABLES (or
the view tables, VCOLUMN and VTABLE, in the SASHELP library):
%let size=2000; *** sample size of non-critical data points ***;
proc sql noprint;
create table tempdata as
select distinct(a.memname), b.nobs,
count(a.name) as nvar label nvar='Number of Variables',
b.nobs*calculated nvar as points label
points='Total Number of Data Points per Data set'n
from dictionary.columns as a, dictionary.tables as b
where a.libname=b.libname="AUDIT" and a.memname=b.memname and
upcase(a.name) not in ('USUBJID','SITEID','VISIT') o
group by a.memname;
select &size/sum(points), sum(points) into :prob, :total p
from tempdata;
create table tempdata as
select *, round(nobs*&prob, 1) as obs label
obs='Number of Observations per Data set Randomly Selected',
calculated obs*nvar as numbers label
numbers='Number of Data Points per Data set Selected' q
from tempdata;
select memname into :memname r
separated by '|'
from tempdata;
quit;
ods listing close;
ods rtf file="c:\temp\DataPointsAudited.rtf";
title "Total data points for the non-critical variables in Study &study are &total";
title2 "The probability of sampling &size data points for audit is &prob;
proc print data=tempdata noobs label; s
var memname nobs nvar points obs numbers;
label memname='Data set Name'
nobs='Number of Observations per Data set';
run;
ods rtf close;
ods listing;
n Function DISTINCT is used to keep only one record for each data set in the memname column. Using
the COUNT function is to create a new column nvar for the total number of non-critical variables from the
name column for each data set in the columns table. Because the nvar column is newly calculated, the
function CALCULATED is necessary to enable you to use the results in the same SELECT clause to obtain
the total data points for each data set. o Because the variables usubjid, siteid, and visit are
not the non-critical variables, they are excluded from the calculation to make the total non-critical data
points accurate. p This SELECT clause is used to create two macro variables, &prob and &total. The
&prob is the probability of the random sample calculated with sample size (2000 data points) divided by
the total non-critical data points. This macro variable will be used later in the next step to select the data
points from each data set. The &total macro variable will be used in the title of the sampling report to
give the auditor an idea of total non-critical data points in the database. q The SELECT clause is to add
two variables, obs and number, to the TEMPDATA data set for later reporting in s. r A macro variable
&memname created by the SELECT clause contains all the data set names in the AUDIT library separated
by ‘|’ character. This variable will be used in the %_sample macro later for the automation of sampling
the data from each data set. s This step will generate a table (see Table 1.) captured by ODS in RTF
format for auditor’s reference that will help the auditor to calculate the error rate for each data set.
Table 1. Total data points for the non-critical variables in Study 123 are 123197.
The probability of sampling 2000 data points for audit is 0.016234.
Number of
Total Number Observations per
of Data Points Data set Randomly
per Data set
Selected
Number of
Observations per
Data set
Number of
Variables
COMMENT
2464
6
14784
40
240
CONMED
2007
13
26091
33
429
DISP
826
5
4130
13
65
ECG
192
21
4032
3
63
IECEXC
190
9
1710
3
27
LAB
7022
7
49154
114
798
PE
2243
7
15701
36
252
VS
1085
7
7595
18
126
Data set
Name
Number of Data
Points per Data
set Selected
RANDOMLY SAMPLING OBSERVATIONS
Random samples are any collection of N observations selected from a population in such a way that each
possible sample has the same chance of being chosen. There are several techniques available for
pulling random samples. The method used in this paper is one of the sequential methods sampling
without replacement under which each observation has only one opportunity to be selected. Because of
the sequential nature, the decision to consider an observation as a sample must be made at the time the
observation is processed. You cannot go back and reconsider an observation that has already been
passed. Since the sampling is done within each data set that contains non-critical variables, the sampling
procedure is considered to be stratified sampling where each data set serves as stratum and the resulting
total sample is called a stratified sample (Mittag & Rinne 1993). In addition, the minimum sampling unit is
an observation line with multiple data values corresponding to different non-critical variables and the
sampling procedure can also be considered to be cluster sampling where each observation line is a
cluster.
Because the data reside in different data sets, the sampling is done within each data set that contains
non-critical variables. The strategy in the selection involves modifying the probability of selecting an
observation based upon how many sampled observations (k) are desired and how many total
observations (n) are available in the data set. Initially the k for a specific data set is the product of the
total observations in the data set multiplying the probability described in the previous section. The
observation will be selected if the random number generated by the RANUNI function is smaller than or
equal to the k/n ratio (probability). The SAS code of exact sample size using changing probability method
is found in SAS Language and Procedures, Usage 2, p.235:
data SAMPLE (drop=k n);
retain k 100 n;
if _n_=1 then n=total;
set POP nobs=total;
if ranuni(0)<=k/n then do;
output;
k=k-1;
end;
n=n-1;
if k=0 then stop;
run;
The seed value used for initiating a random sequence with the RANUNI function is not critical. In the
above example, the seed value is zero (RANUNI(0)) with which the SAS uses the system clock’s time
value as a seed and enables you to rerun the code and obtain different samples from each execution. If it
is desired to control the initialization, you can use a fixed positive integer as the seed, which guarantees
the samples selected with each run from the same data set are the same. This is important because
sometimes you may want to rerun the program to obtain the same samples. To meet the purpose, a
macro %_select is created:
%macro _select;
%let i=1; n
%let _data=%scan(&memname,&i, |); o
%do %while (%length(&_data) gt 0); p
%sample(dsin=audit.&_data, dsout=subset.&_data, prob=&prob, seed=&i); q
%let i=%eval(&i + 1); r
%let _data=%scan(&memname,&i, |); s
%end;
%mend _select;
n To loop through each of the data sets listed in &memname, the iteration variable is initialized to 1 that
references the first data set. o The &memname macro variable created with PROC SQL SELECT clause
in the previous step contains all the data set names. Since the delimiter to the %SCAN function is
specified to a ‘|’ character, the macro variable &_data is initialized by scanning &memname for the
characters preceding the first ‘|’ character. p A %DO %WHILE loop is initialized to process each data set
until all data sets have been iterated through. q The macro %sample (see Appendix 2 for detail) is
modified from the SAS code of exact sample size using changing probability method shown previously.
The input population data set is from the AUDIT data library and the output sample data set resides in
the SUBSET data library. The &prob macro variable as a constant is the sampling probability calculated
with PROC SQL SELECT clause in the previous section. Because the importance of each data set is
treated the same, no weight is added to the probability of particular data sets. The &i macro variable is
used as the seed value so that the initialization is different from data set to data set as the &i value
changes. But the seed value is fixed for the same data set that allows you to repeat the process with the
same results. r After sampling first data set, the iteration variable iterates to 2. s The &_data variable
takes the second data set name and control is returned to the %DO %WHILE statement. This loop
continues in this manner until &_data has incremented through all data set names.
GENERATING THE DATA LISTINGS
The main challenge of reporting is to present the data, according to subjects, visits, and data modules
with critical and non-critical variables, which best mirrors the flow of the CRFs. To simplify the process, it
is much easier if each data set is reported independently, but assembled together on the listing for the
same subject. With this approach, the reporting programs for all the data sets can be easily generated
with the similar process described by Morrill, Wiser, and Zhou (2002). To make the report correspond to
the CRF pages for each subject, the WHERE statement with usubjid=&usubjid and visit=&visit (if
applicable) must be embedded into the program to subset the data, and the programs must be arranged
in order. Since only one subject’s data will be reported at each visit for each program (data set), each
page needs to contain a large number of variables but few observations. This makes the PROC PRINT
procedure to be a better choice over PROC REPORT. If PROC PRINT cannot fit all the variables on a
single line, it splits the observations into two or more sections and prints the observation number or the
ID variables at the beginning of each line. This requires fewer pages than PROC REPORT. To help
auditors understand the data on the listings when comparing with the CRFs, it is desirable to use the
LABEL option to display variables' labels as column headings.
The sample code below illustrates the inclusion of all the programs arranged in such a way to correspond
to the order of CRF pages in a timely sequence. The PROC SQL procedure creates a macro variable,
called &_usubjid, which contains all the unique subject numbers from the DM data set separated by the
‘|’ character and will be used in the macro %report later. When that macro is invoked, the &usubjid
macro variable with the individual subject number is created by the %let statement with the %SCAN
function from &_usubjid to subset the data by subject for each data set used in the program. For
instance, vsn.sas for the vital sign data set with the non-critical variables is invoked by the %include
statement. The vsn.sas program is included in Appendix 3.
proc sql noprint;
select usubjid into :_usubjid
separated by '|'
from dm;
quit;
%macro report;
%let prgpath=c:\temp\pgm;
%let i=1;
%let usubjid=%scan(&_usubjid,&i, |);
%do %while (%length(&usubjid) gt 0);
ods listing close;
ods rtf file="c:\temp\out\P&usubjid..rtf";
*** Visit One ***;
%let visit=1;
%include "&prgpath\incexc.sas"; *
%include "&prgpath\dm.sas"; * For
%include "&prgpath\vsc.sas"; * For
%include "&prgpath\vsn.sas"; * For
%include "&prgpath\pe.sas"; * For
*** Visit Two ***;
%let visit=2;
For Inclusion/Exclusion Criteria *;
Demographics *;
Critical variables in Vital Signs *;
Non-critical variables in Vital Signs *;
Physical Examination *;
%include "&prgpath\vsc.sas"; * For Critical variables in Vital Signs *;
%include "&prgpath\vsn.sas"; * For Non-critical variables in Vital Signs *;
... * Include other programs for other data modules at Visit Two *;
*** Other Visits ***;
%let visit=3;
... * Include programs for other visits *;
*** Visit Termination ***;
%include "&prgpath\vsc.sas"; * For Critical variables in Vital Signs *;
%include "&prgpath\vsn.sas"; * For Non-critical variables in Vital Signs *;
%include "&prgpath\ae.sas"; * For Adverse Events *;
%include "&prgpath\conmed.sas"; * For Concomitant Medication *;
%include "&prgpath\comment.sas"; * For Comments *;
%include "&prgpath\disp.sas"; * For Dispostion *;
%let i=%eval(&i + 1);
%let usubjid=%scan(&_usubjid,&i, |);
%end;
%mend report;
%report;
CONCLUSION
Database auditing is an important step to ensure the database quality, but it is time-consuming, especially
when auditing more data points than needed in addition to auditing against unfriendly data listings. This
paper presents a solution that randomly selects the observations for non-critical variables and reports
both critical and non-critical variables in the same listing per subject in the order that mirrors the flow of
CRF pages. The solution will not only facilitate the audit process to save time and resources but also
make the audit statistically accurate.
REFERENCES
Morrill, J., Wiser, K., and Zhou, J. (2002), A Data-Driven Macro Automating the Data Presentation
Process by Generating Tailored, Customizable SAS Code - Relax, let %TABGEN do your work!
Proceedings of the Annual Conference of the Pharmaceutical Industry SAS Users Group in Year
2002, pp. 43-47.
Mittag H, J. and Rinne H. (1993), Statistical Methods of Quality Assurance, Chapman & Hall.
SAS Institute (2000), SAS Language and Procedures: Usage 2, Version 6, First Edition, PDF Format.
p.235.
ACKNOWLEDGMENTS
The authors would like to thank Nuwan Nanayakkara and David Brown for reviewing this manuscript and
providing valuable comments.
SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries.
® indicates USA registration. Other brand and product names are registered trademarks or trademarks
of their respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Jay Zhou
Amylin Pharmaceuticals, Inc.
9360 Towne Centre Drive
San Diego, CA 92121
Email: [email protected]
APPENDIX 1. MACRO %DROP PROGRAM
/*************************************************************************
DESCRIPTION: To drop unwanted variables for the specified data sets
PARAMETERS:
inlib Specify the library of the source data sets
data Specify data sets to be processed
outlib Specify the library of the processed data sets
drop Specify the variables to be dropped
**************************************************************************/
%macro drop(inlib=, data=, outlib=, drop=);
%local data inlib sasfile files filename drop var j;
%if %length(&inlib)=0 %then %do;
%let inlib=work;
%put CAUSION: The INLIB parameter was not defined ***;
%put NOTE:
The WORK library has been used as default ****;
%end;
%*** make the program not to be case-sensitive ***;
%if %length(&data)>0 %then
%let data=%upcase(%sysfunc(tranwrd(%sysfunc(compbl(&data)),%str( ),%str(" "))));
%let drop=%upcase(%sysfunc(compbl(&drop)));
%*** create a macro variable containing the data set names ***;
proc sql NOPRINT;
select trim(libname)||'.'||memname into :files
separated by ' '
from dictionary.tables
where libname eq "&inlib" %if %length(&data) gt 0 %then
%str(and memname in ("&data"));;
quit;
%let j=1;
%let sasfile=%scan(&files,&j,%str( ));
%do %while (%length(&sasfile) gt 0);
%let filename=%lowcase(%scan(&sasfile,2));
%*** get the variables to be dropped from the input data set ***;;
data _data;
dsid=open("&sasfile", 'i');
do n=1 to attrn(dsid, 'nvars');
name=upcase(varname(dsid, n));
if indexw("&drop", name) >0 then output;
end;
dsid=close(dsid);
run;
proc sql NOPRINT;
select name into :vars
separated by ' '
from _data;
quit;
data &outlib..&filename;
set &sasfile;
%if %length(&vars) gt 0 %then %str(drop &vars;);
run;
%let j=%eval(&j+1);
%let sasfile=%scan(&files,&j,%str( ));
%end;
%mend drop;
APPENDIX 2. MACRO %SAMPLE PROGRAM
/*************************************************************************
DESCRIPTION: To randomly sample observations from a given data set
PARAMETERS:
dsin - Specify the input data set name.
dsout - Specify the output data set name.
prob - Specify the probability of selecting each observation.
seed - Optional. By default, seed=0,
**************************************************************************/
%macro sample(dsin=,dsout=,prob=,var=,seed=);
%if %length(&seed)=0 %then %let seed=0;
*** determine how many observations needed ***;
%local _tobs _obsneed;
%let _tobs=%nobs(&dsin);
%let _obsneed=%sysfunc(round(&_tobs * &prob, 1));
data &dsout (drop=_tobs _obsneed);
retain _obsneed &_obsneed _tobs &_tobs;
set &dsin;
if ranuni(&seed) <= _obsneed/_tobs then do;
output;
_obsneed = _obsneed - 1;
end;
_tobs = _tobs - 1;
if _obsneed=0 then stop;
run;
%mend sample;
APPENDIX 3. VSN.SAS PROGRAM
/*****************************************************************************
Program:
vsn.sas
Author:
%datadump macro
Created:
23MAY2005
Description:
Create a listing for VS data set with non-critical variables
*****************************************************************************/
options number pageno=1;
proc print data=subset.vs noobs label;
by siteid usubjid visit;
where usubjid=&usubjid and visit=&visit;
var NOTDONE TEMP TEMPU RHR RR SSBP SDBP;
label NOTDONE ="Vital Not Done"
TEMP
="Temperature"
TEMPU
="Temperature Units"
RHR
="Resting Heart Rate (bpm)"
RESP
="Respiratory Rate (bpm)"
SSBP
="Sitting Systolic Blood Pressure (mmHg)"
SSDP
="Sitting Diastolic Blood Pressure (mmHg)"
;
title1 "Study 123 DATABASE QUALITY REVIEW (Produced at &systime:&sysdate9)";
title2 "Tabulation of VS (Vital Signs) With Non-Critical Variables";
run;