Download Simplifying NDA Programming with PROC SQL

SIMPLIFYING NDA PROGRAMMING WITH PROt SQL Aileen L. Yam, Besselaar Associates, Princeton, NJ ABSlRACf The programming of New Drug Application (NDA) Integrated Summary of Safety (ISS) usually involves obtaining patient counts, percentages, and other summary statistics such as mean, standard deviation and range. This paper shows how to obtain all of these results with the SQL procedure. While PROe SQL is often perceived as a data retrieval tool, its unique features allow programmers to write compact codes to obtain data summaries for any application similar to the NDA ISS or the safety summary tables in individual new drug studies. This paper also shows that several DATA or other PRoe steps can be reduced to one or two steps with PROe SQL. OVERVIEW unique features are boldfaced in the following programs. Repeated uses of the same features in a subsequent program are not boldfaced or discussed again. At the end of this paper are three tables that represent the types of most rommonly presented summary statistics in safety tables in pharmaceutical research. The data in those tables are fictitious for illustration purposes only. Variable names, data set names, macro variable names and macro variable references from the programs are italicized in the discussion. The three types of summary tables are: rounts, percentages, mean, standard deviation, range and missing value frequencies of demographic data; 2. rounts and percentages of adverse events by body system; 3. rounts and percentages of adverse events by body system and COSTART term. The intention of this paper is not to advocate PROC SQL over DATA steps or other procedures, and there is no benchmark statistics to compare their performance differences. The objective, however, is to present the SQL procedure as a valuable alternative for summarizing data with fewer steps. 1. This paper shows that the summary statistics of each of these three types of tables can be obtained entirely within one or two PROC SQL steps. The unique features of PROC SQL make it possible to reduce many DATA or other PROC steps in summing, grouping, sorting, selecting first occurrences of each subgroup, merging, concatenating, conditional processing, and calculating percentages, mean, standard deviation, range and missing values. Such TOTAL PATIENT COUNTS The following SQL procedure obtains total patient rounts for drug 1, drug 2, and all drug groups (drug 1 and drug 2 combined, assigned as drug 3 for report writing purposes). Since total patient rounts appear in all three summary tables, the rounts are calculated once and saved in a permanent data set called totpat. 133 r····························_························....-........................ .. -............ ~ get the first observation of each patient for a count of nonduplicating patients. There is no need to count the patients in two steps, one with by drug group and the other without by drug group. The resulting counts will not need to be passed into another DATA step to be concatenated together, or to be reorganized by _TYPE_ if a PROC step for summary statistics is used. l%let numdrug=2; ,proe sql; ~ , ~ ~ I:,"~ create table perrn.totpat as select drug, count(distinct patient) as totn from raw.data union by drug selectgroup 'J'.eval(&numdrug+ 1) as drug, count(distinct patient) as totn from raw.data; DEMOGRAPlUC TABLE The DISTINCT keyword eliminates duplicate rows before counting. The GROUP BY clause is used to classify patient counts into drug groups. The UNION operator combines two queries, putting the result from the first query on top of the result from the second query. The AS keyword assigns values to a variable. The demographic table consists of two parts, so two SQL procedures are written. The first SQL procedure generates counts (ent) and percentages (pet) of gender and race groups in Table 1. f-···· ..···· ..··································..······............................................j%macro xx(outds=,var=); Assuming that there are two drug groups, 1 and 2. The first SELECT statement counts the number of nonduplicating patients in each drug group. The second SELECT statement counts the number of nonduplicating patients without grouping patients by drug. Notice that the variable drug is given a value of 3 in the second SELECT statement. Since PROC SQL allows the selection of a literal numeric value or a character string for a variable, any arbitrary value can be assigned. The number of drug groups is set to 2 in the macro variable reference, &numdrug; therefore, the two drug groups combined is assigned as 3, that is, the number of drug groups, &numdrug, plus one. The results from the two SELECT statements are concatenated into a permanent data set, totpat. Totpat consists of patient counts in drug 1, drug 2, as well as in drug 1 and drug 2 combined. The macro variable reference, &numdrug, can be adjusted according to the number of drug groups in a study. jproe sql; ; create table &outds as select It, ~ , round(ent/ case sum(cnt) when 0 then. else sum(cnt) end "100) as pet from (select orug, &var, . eount(distinet patient) as co! from raw.data group by drug, &var) group by drug ~ , E , , , , j , union I". select':;und(ent/ case sum(cnt) when 0 then. else sum(cnt) end '100) as pet from (select %eva](&numdrug+1) as drug, &var, ~ eount(distinet patient) as ent from raw.data group by &var) order by 1, Z; ~%mendxx; ~ ~%xx(outdS=gencnt,var=gender); ~%xx(outds=racecnt,var=race); Several steps are saved. The data do not need to be sorted by drug group and patient. There is no need to set the data by the sorted variables into a DATA step to In the macro calls to xx, there are two major queries joined by the UNION 134 statement for calculating DATA percentages. The other is not having to sort the result table in ascending order. operator. In each of these queries, a subquay is used by nesting the second SELECT statement within the first SELECT statement. CASE expression is used to perform conditional processing. The SUM function is used to calculate the grand total for the denominator. The ORDER BY clause sorts the results by the order-by items in a default sequence, from the lowest value to the highest value. The second SQL procedure generates mean, standard deviation, range and number of missing values of age, weight and height in Table 1. f..········_······..··········..················-········ ............................................. I%m.acro yy(outds=,var=); An asterisk (*) after the SELECT statement in the outer query indicates that all the values, drug, &var and ent, returned by the second SELECT statement are used. In the second SELECT statement, the nlll1"1l:a- (ent) of nonduplicating patients is counted by drug group and by the macro variable reference, &var, when it is resolved. Percentages (pet) are calculated in the outer query using ent as numerator and the SUM of ent as denominator. The CASE expression is used to prevent error message when the denominator is zero. WHEN the SUM of ent is zero, mEN it is set to missing, ELSE the SUM of ent is the denominator. Similar calculations are done after the UNION operator without grouping patients by drug. Thus, the counts (ent) and percentages (pet) of gender and race for the two drug groups separately and combined are obtained. The results are ordered by the values in the first and second columns, as indica ted by 1 and 2 in the ORDER BY clause. The first column is the first variable specified in the SELECT statement, and the first variable is drug. Similarly, the second column refers to the second variable in the SELECT statement, and the second variable is a macro variable that varies depending on the values supplied in the macro calls. In other words, the results are ordered by drug and gender in the first macro call, and by drug and race in the second macro call. iproc sql; ~ 1 ~ ~ i ~ ; ~ create table &outds as select drug, . n &var" as varl mean(&var) as mean, std(&varl as std, min(&var) as min, max(&var) as max, nmiss(&var) as nmiss ~ !:':"" from raw.data union by drug select group %eval(&numdrug+ 1) as drug, H&var" as var, mean(&var) as mean, std(&var) as std, , i min(&var) as min, i max(&var) as max, . nmiss(&var) as nmiss from raw.data; ~ i !%mendyy; ; ~%yy(outds=agestat/var=age); ~%yy(outds=wtstat/var=weight); l%yy(outds=htstat,var=heiglit); In the macro calls to yy, the functions MEAN, STD, MIN, MAX and NMISS are used to calculate summary statistics. The character string when resolved from the macro variable reference, &var, is used to associate each variable in the macro call with its corresponding summary statistics. All the summary statistics for the two drug groups separately and combined are calculated and concatenated within one SQL procedure. Besides the steps mentioned under the Total Patient Counts section, two additional steps are saved. One is not having to pass the patient counts into a ADVERSE EVENTS TABLES The summary statistics 'for the two adverse events tables, Table 2 and Table 3, 135 can be obtained by calling the same single PROC SQL statement below. specifies the rolumns for matching rows in two data sets to be joined. The WHERE clause specifies a condition for selecting the data. The OUTER UNION CORRESPONDING operator concatenates results from SELECT statements similar to using a DATA step with a SET statement. The differences between UNION and OUTER UNION CORRESPONDING are: UNION matches rolumns in a table expression by ordinal position, keeping the rolumn names in the result table from the first table. OUTER UNION CORRESPONDING, on the other hand, matches rolumns by rolumn names. In addition, when the OUTER UNION CORRESPONDING operator is used, the non-matching rolumns are retained in the result table. The DESC keyword sorts the result table in descending order. .._........................._....................................._................................. ~%macro zz(indsl=,inds2=,outds=,var=,selectif=, sortord=); 1 1proc sql; i create table &outds (drop=totn) as ~ l select distinct·, round(cnt/ : case totn when 0 then . i i i i else toln end "100) as pet, lasseq. from (select &indsl..drug, l ~ count(distinct patient) as cnt, &inds2 ..toIn l i from raw.&indsl left join perm.&inds2 on &indsLdrug=&inds2..drug where &selectif group by &indsl ..drug) l l l i ; . outer union corresponding select distinct ., round(ent/ Two sets of queries, identified by the variable seq as 1 and 2 for report writing purposes, 'are concatenated by the OUTER UNION CORRESPONDING operator. The first set of queries rounts the total number (ent) of nonduplicating patients with adverse events, merges the results with the totpat permanent data set by drug group, keeping only the rows from the adverse events counts with the LEFT JOIN operator, and calculates the percentages (pet) of ent. Only patients from the double-blind period (period=2) are selected in the WHERE clause. The serond set of queries performs similar calculations, except that the patient rounts (ent) and percentages (pet) . are by body system or by body system and COSTART term. case totn when 0 then. else totn end ~l 00) as pet, 2 as seq from (select &indsl ..drug, &var, count(distinct patient) as ent, &inds2..totn from raw.&indsl left join perm.&inds2 on &indsl..drug=&inds2..drug where &selecttf group by &indsl ..drug, &var) order by seq, drug, &sortord; %mendzz; %zz(indsl=ae,inds2=totpat,outds=aebcnt, var=body,selectif=%str(period=2),sortord= cnt desc); %zz(indsl=ae,inds2=totpat,outds=aebccnt,var= %str(body,costart),selectif=%str(period=2), sortord=%str(body, cnt dese»; For the Adverse Events tables, the DISTINCT option is used in two different ways: to rount the number of nonduplicating patient for each adverse event category, and to eliminate duplicate rows as a result of the LEFT JOIN. The first macro call to zz groups adverse events by body system. The second macro call to zz groups ad verse events by body system and COSTART term. The DISTINCT keyword eliminates duplicate rows of data. The LEFT JOIN operator retrieves matching rows and nonmatching rows based on the data specified on the left (raw.&inds1). The ON clause The DISTINCT option is particularly useful for rounting patients with adverse events, because patients with multiple Occurrences of the same adver~ event are to 136 For addItIonal informatIon, contact: be rounted once only. Among the steps mentioned previously, the most important steps saved here are not having to sort the adverse events data and to set the sorted data to get the first occurrences of adverse events by patient. Aileen L. Yam Besselaar Associates 210 Carnegie Center Princeton, NJ 08540-6681 Tel.: (609) 452-4200 The selection of first occurrences of each adverse event, the conditional processing, the sorting, the summing, the calculation of percentages, the concatenation of data sets, the sorting of the result table by seq, drug, body system, and by descending adverse event counts (ent) can all take place within the same SQL procedure. SUMMARy This paper uncovers the potential of PROC SQL as a very useful data sununary tool, in addition to being a data retrieval tool. The beauty of PROC SQL lies in the simplicity and resourcefulness of the codes. Several steps can be condensed to Il)ake onestep programming possible. The tradeoff is it generally takes more time to write and debug SQL programs, because with PROC SQL, the intermediate results from each step take place internally, and all the query expressions produce a single output table. The programs in this paper were originally developed for an NDA, but the programming logic and techniques can be used for similar data summaries. (Three sample NDA Integrated Summary of Safety tables are included on the next two pages.) SAS is a registered trademJJrk or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and P!'oduct names are. registered trademJJrks or trademarks of their respective companies. 137 TABLE 1 SUMMARY OF DEMOGRAPHIC DATA All Drug Groups Drug 1 Drug2 Total Patients 849 851 Gender Male Female 429 (51%) 420 (49%) 432 (51%) 419 (49%) 861 (51%) 839 (49%) White 467 (55%) Black 362 (43%) 471 (55%) 371 (44%) 9 (I'll) 938 (55%) 733 (43%) 29 (2%) 1700 Race (2%) Other 20 Mean Standard Deviation 362 16.3 17-69 1 37.1 16.1 17-69 2 36.8 162 17-69 3 1552 10.6 96-209 0 156.1 10.9 9&-212 1 155.8 10.8 96-212 1 64.7 8.8 59-72 0 65.6 9.0 60-76 0 652 8.9 59-76 0 Age Range # Missing Weight (pounds) Mean Standard Deviation Range # Missing Height (inches) Mean Standard Deviation Range # Missing TABLE 2 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM Drug I Drug 2 Total Patients 849 851 Total Patients with Adverse Events 420 (49%) 320 (38%) BODY AS A WHOLE 360 (42%) 277 (33%) DIGESTIVE SYSTEM 280 (33%) 230 (27%) SKIN AND APPENDAGES 200 (24%) 207 (24%) RESPIRATORYSYSTEM 39 (5%) 32 (4%) CARDIOVASCULAR SYSTEM 30 (4%) 28 (3%) ENDOCRINE SYSTEM 9 (1%) 3 (.4%) NERVOUS SYSTEM 2 (.2%) 1 (.1%) etc. 138 TABLE 3 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM AND COSTART TERM Drug 1 Drug 2 Total Patients 849 851 Total Patients with Adverse Events 420 (49%) 320 (38%) 120 70 64 52 47 18 3 (14%) (8%) (8%) (6%) (6%) (.4%) 110 62 52 42 30 12 1 (13%) (7%) (6%) (5%) (4%) (1%) (.1%) 360 (42%) 277 (33%) 120 80 100 70 40 12 6 2 (14%) (9%) (8%) (5%) (1%) (1%) (.2%) 34 30 8 3 1 (12%) (8%) (4%) (4%) (1%) (.4%) (.1%) 280 (33%) 230 (27%) BODY AS A WHOLE HEADACHE CHILLS FLU SYNDROME ALLERGIC REACTION INFECTION FEVER PAIN Subtotal (2%) DIGESTIVE SYSTEM DIARRHEA NAUSEA FLATULENCE STOMATITIS GASTRms ESOPHAGITIS CONSTIPATION Subtotal 72 etc. 139

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Simplifying NDA Programming with PROC SQL