Download Simplifying NDA Programming with PROC SQL

Pharmaceutical Industry SIMPLIFYING NDA PROGRAMMING WITH PROC SQL Aileen L. Yam Coming Besselaar Inc., Princeton, NJ ABSTRACT The programming of New Drug Application (NDA) Integrated Summary of Safety (ISS) usually in'Dol'Des obtaining patient counts, percentages, and other summary statistics such as mean, standard deviation and range. This paper shows 1zow to obtain all of these results with the SQL procedure. While PROC SQL is often percei'Ded as a data retriwal tool, its unique features allow programmers to write compact codes to obtain data summaries for any application similar to the NDA ISS or the safety summary tables in indi'Didual new drug studies. This paper also shows that sweral DATA or other PROC steps can be reduced to one or two steps with PROC SQL. unique features are boldfaced in the following programs. Repeated uses of the same features in a subsequent program are not boldfaced or discussed again. OVERVIEW At the end of this paper are three tables that represent the types of most commonly presented sununaxy statistics in safety tables in pharmaceutical research. The data in those tables are fictitious for illustration purposes only. Variable names, data set names, macro variable names and macro variable references from the programs are italicized in the discussion. The three types of summary tables are: 1. counts, percentages, mean, standard The intention of this paper is not to advocate PROC SQL over DATA steps or other procedures, and there is no benchmark statistics to compare their performance differences. The objective, however, is to present the SQL procedure as a valuable alternative for summarizing data with fewer steps. deviation, range and missing value frequencies of demographic data; 2. counts and percentages of adverse events by body system; 3. counts and percentages of adverse events by body system and COSTART term. This paper shows that the summary statistics of each of these three types of tables can be obtained entirely within one or two PROC SQL steps. The unique features of PROC SQL make it possible to reduce many DATA or other PROC steps in swnrning, grouping, sorting, selecting first occurrences of each subgroup, merging, concatenating, conditional processing, and calculating percentages, mean, standard deviation, range and missing values. Such TOTAL PATIENT COUNTS The following SQL procedure obtains total patient CDunts for drug 1, drug 2, and all drug groups (drug 1 and drug 2 combined, assigned as drug 3 for report writing purposes). Since total patient counts appear in all three SIlI1lI11a1Y tables, the counts are calculated 0Ila! and saved in a permanent data set called totpat. 581 Pharmaceutical Industry .-........................................................_......_......_...............__. get the first observation of each patient for ~ count of nonduplicating patients. There is ro need to count the patients in two steps, one with by drug group and the other without by drug group. The resulting counts will not need to be passed into another DATA step to be concatenated together, or to be reorganized by _TYPE_ if a PROC step for summary statistics is used. i j%let numdmg=2; ! iproc sql; create table perm.totpat as select driig, count (distinct patient) as totn from raw.data group by drug union select %eval(&numdIUjI:+l) as drug, count(distinet patient) as totn from raw.data; DEMOGRAPHIC TABLE The DISTINCT keyword eliminates duplicate rows before counting. The GROUP BY clause is used to classify patient counts into drug groups. The UNION operator combines two queries, putting the result from the first query on top of the result from the second query. The AS keyword assigns values to a variable. The demographic table consists of two parts, so two SQL procedures are written. The first SQL procedure generates (ent) and percentages (pet) of gender and race groups in Table 1. counts .•......................................._......_.............................._.................. l%macro xx(outds=,var=); Assuming that there are two drug groups, 1 and 2. The first SELECT statement counts the number of nonduplicating patients in each drug group. The second SELECT statement counts the number of nonduplicating patients without grouping patients by drug. Notice that ·the variable drug is given a value of 3 in the second SELECT statement. Since PROC SQL al10ws the selection of a Ii teral numeric value or a character string for a variable, any arbitrary value can be assigned. The number of drug groups is set to 2 in the macro variable reference, &numdrug; therefore, the two drug groups combined is assigned as 3, that is, the number of drug groups, &numdrug, plus one. The results from the two SELECT statements are concatenated into a permanent data set, totpat. Totpat consists of patient counts in drug 1, drug 2, as well as in drug 1 and drug 2 combined. The macro variable reference, &numdrug, can be adjusted according to the rnunber of drug groups in a study. lproe sql; create table &outds as • select", round(ent/ case sum(ent) i when 0 then • else sum(ent) end "100) as pet from (select drug, &var, count(distinct patient) as ent from raw.data ~up by drug, &var) group by drug ! !.: union select ~, round (cnt! case sum(ent) i whenOthen. else sum(ent) end "100) as pet i from (select %evaI(&numdrug+1) as drug, &var, count(distinct patient) as ent from raw.data "~ • ! .i I......""': · by 1, !"",P by ""'" j i %xx(outds:genent,var=gender); i%xx(outds=raeecnt,var=race); Several steps are saved. The data do not need to be sorted by dnlg group and patient. There is no need to set the data by the sorted variables into a DATA step to i In the macro calls to xx, there are two major queries joined by the UNION 582 Pharmaceutical Industry operator. In each of these queries, a subquay is used by nesting the second SELECf statement within the first SELEcr statement. CASE expression is used to perform conditional processing. The SUM function is used to calculate the grand to tal for the denominator. The ORDER BY clause sorts the results by the order-by items in a default sequence, from the lowest value to the highest value. DATA statement for calculating percentages. The other is not having to sort the result table in ascending order. The second SQL procedure generates mean, standard deviation, range and number of missing values of age, weight and height in Table 1. _ _ ,...._."..... ..._._._._.-...._.... .._._._......._..-. I%macro yy(outds=,var=); An asterisk (..) after the SELEcr statement in the outer query indicates that all the values, drug, &var and cnf, returned by the second SELEcr statement are used. In the second SELEcr statement, the number (ent) of nonduplicating patients is counted by drug group and by the macro variable reference, &var, when it is resolved. Percentages (pet) are calculated in the outer query using cnt as numerator and the SUM of cnt as denominator. The CASE expression is used to prevent error message when the denominator is zero. WHEN the SUM of cnf is zero, mEN it is set to missing, EISE the SUM of ent is the denominator. Similar calculations are done after the UNION operator without grouping patients by drug. Thus, the counts (enO and percentages (pet) of gender and race for the two drug groups separately and combined are obtained. The results are ordered by the values in the first and second columns, as indicated by 1 and 2 in the ORDER BY clause. The first column is the first variable specified in the SELEcr statement, and the first variable is drug. Similarly, the second column refers to the second variable in the SELEcr statement, and the second variable is a macro variable that varies depending en the values supplied in the macro calls. In other words, the results are ordered by drug and gender in the first macro call, and by drug and race in the second macro call. Iprocsql; i ; create table &:outds as select drug, "&var'" as var, mean(&:var) as mean, std(&:var) as std, min(&:var) as min, max(&:var) as max, nmiss(&:var) as nmiss from raw.clata group by drug union select %eva1(&:numdrug+1) as drug, "&var'" as var, mean(&:var) as mean, std(&:var) as std, min(&:var) as min, max(&:var) as max, nmiss(&:var) as nmiss from raw.data; i%mendyy; !%yy(outds=agestat,var=age); i% yy(outds=wtstat,var=weight); i%yy(outds=htstat,var-heignt); ! In the macro calls to yy, the functions MEAN, STD, MIN, MAX and NMISS are used to calculate summary statistics. The character string when resolved from the macro variable reference, &var, is used to associate each variable in the macro call with its corresponding summary statistics. All the summary statistics for the two drug groups separately and combined are calculated and concatenated within one SQL procedure. Besides the steps mentioned under the Total Patient Counts section, two additional steps are saved. One is not having to pass the patient counts into a ADVERSE EVENTS TABLES The summary statistics for the two adverse events tables, Table 2 and Table 3, 583 Pharmaceutical Industry can be obtained by calling the same single PROC SQL statement below. specifies the columns for matching rows in two data sets to be joined. The WHERE clause specifies a condition for selecting the data. The OUTER UNION CORRESPONDING operator concatenates results from SELECT statements similar to using a DATA step with a SET statement. The differences between UNION and OUTER UNION CORRESPONDING are: UNION matches columns in a table expression by ordinal position, keeping the oolumn names in the result table from the first table. OUTER UNION CORRESPONDING, on the other hand, matches columns by column names. In addition, when the OUTER UNION CORRESPONDING operator is used, the non-matching columns are retained in the result table. The DESC keyword sorts the result table in descending order. r~:::::;~~:l=::~:;:::~::;:::::;:- i sortord=); i !proc sql; create table &coutds (drop=totn) as select distinct ., round(cnt/ casetotn whenOthen. elsetotn end ·100) as pet, 1 as seq from (select &indsl..drug, count(distinct patient) as cnt, &inds2 ..totn from raw.&indslleft join .&inds2 on &indsLd~~&indS2..drug where &Selectif group by &:indsl ..drug) outer union corresponding select distinct ., round(cnt/ case totn whenOthen. elsetotn end ~oo) as pet, Two sets of queries, identified by the variable seq as 1 and 2 for report writing purposes, are concatenated by the OUTER UNION CORRESPONDING operator. The first set of queries counts the total number (cnt) of nonduplicating patients with adverse events, merges the results with the totpat permanent data set by drug group, keeping only the rows from the adverse events counts with the LEFT JOIN operator, and calculates the percentages (pet) of ent. Only patients from the double-blind period (period=2) are selected in the WHERE clause. The second set of queries performs similar calculations, except that the patient counts (ent) and percentages (pet) are by body system or by body system and 2as~ : i ! I from (select &:indsl ..drug, &:var, count(distinct patient) as cot, &:inds2..totn from raw.&:indslleft join .&:inds2 on &:indsLf~g=&indS2..drug where &:selectif ~up by &:inds1..drug, &:var) order by seq, drug, &sortord; i%mendzz; i i%zz(mdsl =ae,inds2,;,totpat,out<!s=aebcot, i var=body,selectiI=%str(penod=2),sortord= i cnt desc); i%zz(indsl=ae,inds2=totpat,outds=aebccnt,var= i %str(body,costart),selectif=%str(period=2), ! sortord=%str(body, cnt dese»; ! COSTART term. The first macro call to zz groups adverse events by body system. The second macro call to zz groups adverse events by body system and COSTART term. For the Adverse Events tables, the DISTINCT option is used in two different ways: to count the number of nonduplicating patient for each adverse event category, and to eliminate duplicate rows as a result of the LEFT JOIN. The DISTINCT keyword eliminates duplicate rows of data. The LEFT JOIN operator retrieves matching rows and nonmatching rows based on the data specified on the left (raw.&inds1). The ON clause The DISTINCT option is particularly useful for counting patients with adverse events, because patients with multiple occurrences of the same adverse event are to 584 Pharmaceutical Industry For liiL/'tlonal tnformatwn, contact: Among the steps mentioned previously, the most important steps saved here are not having to sort the adverse events data and to set the sorted data to get the first oocurrences of adverse events by patient. be counted once only. Aileen L. Yam Corning Besselaar, Inc. 210 Carnegie Center Princeton, NJ 08540-6233 Tel.: (609) 452-4200 The selection of first occurrences of each adverse event, the conditional processing, the sorting, the summing, the calculation of percentages, the concatenation of data sets, the sorting of the result table by seq, drug, body system, and by descending adverse event counts (ent) can all take place within the same SQL procedure. SUMMARy This paper uncovers the potential of PROC SQL as a very useful data summary tool, in addition to being a data retrieval tool. The beauty of PROC SQL lies in the simplicity and resourcefulness of the codes. Several steps can be condensed to make onestep progranuning possible. The tradeoff is it generally takes more time to write and debug SQL programs, because with PROC SQL, the intermediate results from each step take place internally, and all the query expressions produce a single output table. The programs in this paper were originally developed for an NDA, but the programming logic and techniques can be used for other similar data summaries. (Three sample NDA Integrated Summary of Safety tables are included on the next two pages.) SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered tradema~ks or trademarks of their respective companies. 585 Pharmaceutical Industry TABLEt SUMMARY OF DEMOGRAPHIC DATA Drug 1 Drug 2 849 851 Male Female 429 (51%) 420 (49%) 432 (51%) 419 (49%) 861 (51%) 839 (49%) White Black Other 467 (55%) 362 (43%) 20 (2%) 471 (55%) 371 (44%) 9 (1%) 938 (55%) 733 (43%) 29 (2%) Total Patients AD Drug Groul/!! 1700 Gender Race Age Mean Standard Deviation Range # Missing Weight (pounds) Mean Standard Deviation Range # Missing Height (inches) Mean Standard Deviation Range # Missing 36.8 36.2 16.3 17-69 1 37.1 16.1 17-69 2 155.2 10.6 96-209 0 156.1 10.9 9S-212 1 155.8 10.8 96-212 1 64.7 65.6 9.0 60-76 65.2 8.9 59-76 8.8 59-72 0 16.2 17-69 3 0 0 TABLE 2 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM lli!!&.l 1lD!z.l Total Patients 849 851 Total Patients with Adverse Events 420 (49%) 320 (38%) BODY ASA WHOLE 360 (42%) 277 (33%) DICESllVE SYSrEM 280 (33%) 230 (27%) SKIN AND APPENDAGES 200 (24%) 207 (24%) RESPIRATORY SYSTEM 39 (5%) 32 (4%) CARDIOVASCULAR SYSTEM 30 (4%) 28 (3%) ENDOCRINE SYSrEM 9 (1%) 3 (.4%) NERVOUS SYSTEM 2 (.2%) 1 (.1%) etc. 586 Pharmaceutical Industry TABLE 3 NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS BY BODY SYSTEM AND COSTART TERM Dmi-l ~ Total Patients 849 851 Total Patients with Adverse Events 420 (49%) 320 (38%) 120 70 110 64 52 47 18 3 (14%) (8%) (8%) (6%) (6%) (2%) (.4%) 30 12 1 (13%) (7%) (6%) (5%) (4%) (1%) (.1%) 360 (42%) 277 (33%) 120 80 40 12 6 2 (14%) (9%) (8%) (5%) (1%) (1%) (.2%) 100 70 34 30 8 3 1 (12%) (8%) (4%) (4%) (1%) (.4%) (.1%) 280 (33%) 23D (27%) BODY AS A WHOLE HEADACHE CHILLS FLU SYNDROME ALLERGIC REACTION INFECTION FEVER PAIN Subtotal 62 52 42 DIGESTIVE SYSTEM DIARRHEA NAUSEA FLATULENCE STOMATITIS GASTRI1lS ESOPHAGmS CONSTIPATION Subtotal 72 etc. 587

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Simplifying NDA Programming with PROC SQL