Download Simplifying NDA Programming with PROC SQL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Electronic prescribing wikipedia , lookup

Polysubstance dependence wikipedia , lookup

Pharmacognosy wikipedia , lookup

Biosimilar wikipedia , lookup

Compounding wikipedia , lookup

Neuropharmacology wikipedia , lookup

List of comic book drugs wikipedia , lookup

Medication wikipedia , lookup

Pharmaceutical industry wikipedia , lookup

Drug interaction wikipedia , lookup

Theralizumab wikipedia , lookup

Prescription drug prices in the United States wikipedia , lookup

Drug design wikipedia , lookup

Drug discovery wikipedia , lookup

Prescription costs wikipedia , lookup

Pharmacovigilance wikipedia , lookup

Pharmacokinetics wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Bad Pharma wikipedia , lookup

Transcript
SIMPLIFYING NDA PROGRAMMING WITH PROt SQL
Aileen L. Yam, Besselaar Associates, Princeton, NJ
ABSlRACf
The programming of New Drug Application (NDA) Integrated Summary of Safety (ISS)
usually involves obtaining patient counts, percentages, and other summary statistics such as
mean, standard deviation and range. This paper shows how to obtain all of these results with
the SQL procedure. While PROe SQL is often perceived as a data retrieval tool, its unique
features allow programmers to write compact codes to obtain data summaries for any
application similar to the NDA ISS or the safety summary tables in individual new drug
studies. This paper also shows that several DATA or other PRoe steps can be reduced to one or
two steps with PROe SQL.
OVERVIEW
unique features are boldfaced in the
following programs. Repeated uses of the
same features in a subsequent program are
not boldfaced or discussed again.
At the end of this paper are three
tables that represent the types of most
rommonly presented summary statistics in
safety tables in pharmaceutical research.
The data in those tables are fictitious for
illustration purposes only.
Variable names, data set names, macro
variable names and macro variable
references from the programs are italicized
in the discussion.
The three types of summary tables are:
rounts, percentages, mean, standard
deviation, range and missing value
frequencies of demographic data;
2. rounts and percentages of adverse
events by body system;
3. rounts and percentages of adverse
events by body system and COSTART
term.
The intention of this paper is not to
advocate PROC SQL over DATA steps or
other procedures, and there is no benchmark
statistics to compare their performance
differences. The objective, however, is to
present the SQL procedure as a valuable
alternative for summarizing data with
fewer steps.
1.
This paper shows that the summary
statistics of each of these three types of
tables can be obtained entirely within one
or two PROC SQL steps. The unique
features of PROC SQL make it possible to
reduce many DATA or other PROC steps in
summing, grouping, sorting, selecting first
occurrences of each subgroup, merging,
concatenating, conditional processing, and
calculating percentages, mean, standard
deviation, range and missing values. Such
TOTAL PATIENT COUNTS
The following SQL procedure obtains
total patient rounts for drug 1, drug 2, and
all drug groups (drug 1 and drug 2 combined,
assigned as drug 3 for report writing
purposes).
Since total patient rounts
appear in all three summary tables, the
rounts are calculated once and saved in a
permanent data set called totpat.
133
r····························_························....-........................ .. -............
~
get the first observation of each patient for
a count of nonduplicating patients. There is
no need to count the patients in two steps,
one with by drug group and the other
without by drug group. The resulting counts
will not need to be passed into another
DATA step to be concatenated together, or
to be reorganized by _TYPE_ if a PROC step
for summary statistics is used.
l%let numdrug=2;
,proe sql;
~
,
~
~
I:,"~
create table perrn.totpat as
select drug,
count(distinct patient) as totn
from raw.data
union
by drug
selectgroup
'J'.eval(&numdrug+
1) as drug,
count(distinct patient) as totn
from raw.data;
DEMOGRAPlUC TABLE
The DISTINCT keyword eliminates
duplicate rows before counting.
The
GROUP BY clause is used to classify
patient counts into drug groups.
The
UNION operator combines two queries,
putting the result from the first query on top
of the result from the second query. The AS
keyword assigns values to a variable.
The demographic table consists of two
parts, so two SQL procedures are written.
The first SQL procedure generates
counts (ent) and percentages (pet) of gender
and race groups in Table 1.
f-···· ..···· ..··································..······............................................j%macro xx(outds=,var=);
Assuming that there are two drug
groups, 1 and 2. The first SELECT statement
counts the number of nonduplicating
patients in each drug group. The second
SELECT statement counts the number of
nonduplicating patients without grouping
patients by drug. Notice that the variable
drug is given a value of 3 in the second
SELECT statement.
Since PROC SQL
allows the selection of a literal numeric
value or a character string for a variable,
any arbitrary value can be assigned. The
number of drug groups is set to 2 in the macro
variable reference, &numdrug; therefore,
the two drug groups combined is assigned as
3, that is, the number of drug groups,
&numdrug, plus one. The results from the
two SELECT statements are concatenated
into a permanent data set, totpat. Totpat
consists of patient counts in drug 1, drug 2, as
well as in drug 1 and drug 2 combined. The
macro variable reference, &numdrug, can be
adjusted according to the number of drug
groups in a study.
jproe sql;
;
create table &outds as
select It,
~
,
round(ent/
case sum(cnt)
when 0 then.
else sum(cnt)
end "100) as pet
from (select orug, &var, .
eount(distinet patient) as co!
from raw.data
group by drug, &var)
group by drug
~
,
E
,
,
,
,
j
,
union
I".
select':;und(ent/
case sum(cnt)
when 0 then.
else sum(cnt)
end '100) as pet
from (select
%eva](&numdrug+1) as drug,
&var,
~
eount(distinet patient) as ent
from raw.data
group by &var)
order by 1, Z;
~%mendxx;
~
~%xx(outdS=gencnt,var=gender);
~%xx(outds=racecnt,var=race);
Several steps are saved. The data do
not need to be sorted by drug group and
patient. There is no need to set the data by
the sorted variables into a DATA step to
In the macro calls to xx, there are two
major queries joined by the UNION
134
statement
for
calculating
DATA
percentages. The other is not having to sort
the result table in ascending order.
operator. In each of these queries, a
subquay is used by nesting the second
SELECT statement within the first SELECT
statement. CASE expression is used to
perform conditional processing. The SUM
function is used to calculate the grand total
for the denominator. The ORDER BY
clause sorts the results by the order-by
items in a default sequence, from the lowest
value to the highest value.
The second SQL procedure generates
mean, standard deviation, range and
number of missing values of age, weight and
height in Table 1.
f..········_······..··········..················-········ .............................................
I%m.acro yy(outds=,var=);
An asterisk (*) after the SELECT
statement in the outer query indicates that
all the values, drug, &var and ent, returned
by the second SELECT statement are used.
In the second SELECT statement, the
nlll1"1l:a- (ent) of nonduplicating patients is
counted by drug group and by the macro
variable reference, &var, when it is
resolved. Percentages (pet) are calculated
in the outer query using ent as numerator and
the SUM of ent as denominator. The CASE
expression is used to prevent error message
when the denominator is zero. WHEN the
SUM of ent is zero, mEN it is set to
missing, ELSE the SUM of ent is the
denominator. Similar calculations are done
after the UNION operator without
grouping patients by drug. Thus, the counts
(ent) and percentages (pet) of gender and
race for the two drug groups separately and
combined are obtained. The results are
ordered by the values in the first and
second columns, as indica ted by 1 and 2 in
the ORDER BY clause. The first column is
the first variable specified in the SELECT
statement, and the first variable is drug.
Similarly, the second column refers to the
second variable in the SELECT statement,
and the second variable is a macro variable
that varies depending on the values
supplied in the macro calls. In other words,
the results are ordered by drug and gender
in the first macro call, and by drug and race
in the second macro call.
iproc sql;
~
1
~
~
i
~
;
~
create table &outds as
select drug, .
n &var" as varl
mean(&var) as mean,
std(&varl as std,
min(&var) as min,
max(&var) as max,
nmiss(&var) as nmiss
~
!:':""
from raw.data
union
by drug
select group
%eval(&numdrug+
1) as drug,
H&var" as var,
mean(&var) as mean,
std(&var) as std,
,
i
min(&var) as min,
i
max(&var) as max, .
nmiss(&var) as nmiss
from raw.data;
~
i
!%mendyy;
;
~%yy(outds=agestat/var=age);
~%yy(outds=wtstat/var=weight);
l%yy(outds=htstat,var=heiglit);
In the macro calls to yy, the functions
MEAN, STD, MIN, MAX and NMISS are
used to calculate summary statistics. The
character string when resolved from the
macro variable reference, &var, is used to
associate each variable in the macro call
with its corresponding summary statistics.
All the summary statistics for the two
drug groups separately and combined are
calculated and concatenated within one
SQL procedure.
Besides the steps mentioned under the
Total Patient
Counts section,
two
additional steps are saved. One is not
having to pass the patient counts into a
ADVERSE EVENTS TABLES
The summary statistics 'for the two
adverse events tables, Table 2 and Table 3,
135
can be obtained by calling the same single
PROC SQL statement below.
specifies the rolumns for matching rows in
two data sets to be joined. The WHERE
clause specifies a condition for selecting the
data.
The
OUTER
UNION
CORRESPONDING operator concatenates
results from SELECT statements similar to
using a DATA step with a SET statement.
The differences between UNION and
OUTER UNION CORRESPONDING are:
UNION matches rolumns in a table
expression by ordinal position, keeping the
rolumn names in the result table from the
first
table.
OUTER
UNION
CORRESPONDING, on the other hand,
matches rolumns by rolumn names.
In
addition, when the OUTER UNION
CORRESPONDING operator is used, the
non-matching rolumns are retained in the
result table. The DESC keyword sorts the
result table in descending order.
.._........................._....................................._.................................
~%macro zz(indsl=,inds2=,outds=,var=,selectif=,
sortord=);
1
1proc sql;
i
create table &outds (drop=totn) as
~
l
select distinct·,
round(cnt/
:
case totn
when 0 then .
i
i
i
i
else toln
end "100) as pet,
lasseq.
from (select &indsl..drug,
l
~
count(distinct patient) as cnt,
&inds2 ..toIn
l
i
from raw.&indsl left join
perm.&inds2
on &indsLdrug=&inds2..drug
where &selectif
group by &indsl ..drug)
l
l
l
i
;
.
outer union corresponding
select distinct .,
round(ent/
Two sets of queries, identified by the
variable seq as 1 and 2 for report writing
purposes, 'are concatenated by the OUTER
UNION CORRESPONDING operator. The
first set of queries rounts the total number
(ent) of nonduplicating patients with
adverse events, merges the results with the
totpat permanent data set by drug group,
keeping only the rows from the adverse
events counts with the LEFT JOIN operator,
and calculates the percentages (pet) of ent.
Only patients from the double-blind period
(period=2) are selected in the WHERE
clause. The serond set of queries performs
similar calculations, except that the
patient rounts (ent) and percentages (pet)
. are by body system or by body system and
COSTART term.
case totn
when 0 then.
else totn
end ~l 00) as pet,
2 as seq
from (select &indsl ..drug, &var,
count(distinct patient) as ent,
&inds2..totn
from raw.&indsl left join
perm.&inds2
on &indsl..drug=&inds2..drug
where &selecttf
group by &indsl ..drug, &var)
order by seq, drug, &sortord;
%mendzz;
%zz(indsl=ae,inds2=totpat,outds=aebcnt,
var=body,selectif=%str(period=2),sortord=
cnt desc);
%zz(indsl=ae,inds2=totpat,outds=aebccnt,var=
%str(body,costart),selectif=%str(period=2),
sortord=%str(body, cnt dese»;
For the Adverse Events tables, the
DISTINCT option is used in two different
ways: to rount the number of nonduplicating
patient for each adverse event category,
and to eliminate duplicate rows as a result
of the LEFT JOIN.
The first macro call to zz groups
adverse events by body system. The second
macro call to zz groups ad verse events by
body system and COSTART term.
The DISTINCT keyword eliminates
duplicate rows of data. The LEFT JOIN
operator retrieves matching rows and nonmatching rows based on the data specified
on the left (raw.&inds1). The ON clause
The DISTINCT option is particularly
useful for rounting patients with adverse
events, because patients with multiple
Occurrences of the same adver~ event are to
136
For addItIonal informatIon, contact:
be rounted once only. Among the steps
mentioned previously, the most important
steps saved here are not having to sort the
adverse events data and to set the sorted
data to get the first occurrences of adverse
events by patient.
Aileen L. Yam
Besselaar Associates
210 Carnegie Center
Princeton, NJ 08540-6681
Tel.: (609) 452-4200
The selection of first occurrences of each
adverse event, the conditional processing,
the sorting, the summing, the calculation of
percentages, the concatenation of data sets,
the sorting of the result table by seq, drug,
body system, and by descending adverse
event counts (ent) can all take place within
the same SQL procedure.
SUMMARy
This paper uncovers the potential of
PROC SQL as a very useful data sununary
tool, in addition to being a data retrieval
tool.
The beauty of PROC SQL lies in the
simplicity and resourcefulness of the codes.
Several steps can be condensed to Il)ake onestep programming possible. The tradeoff is
it generally takes more time to write and
debug SQL programs, because with PROC
SQL, the intermediate results from each
step take place internally, and all the
query expressions produce a single output
table.
The programs in this paper were
originally developed for an NDA, but the
programming logic and techniques can be
used for similar data summaries.
(Three
sample
NDA
Integrated
Summary of Safety tables are included on
the next two pages.)
SAS is a registered trademJJrk or trademark of SAS
Institute Inc. in the USA and other countries. ®
indicates USA registration.
Other brand and P!'oduct names are. registered
trademJJrks or trademarks of their respective
companies.
137
TABLE 1
SUMMARY OF DEMOGRAPHIC DATA
All Drug Groups
Drug 1
Drug2
Total Patients
849
851
Gender
Male
Female
429 (51%)
420 (49%)
432 (51%)
419 (49%)
861 (51%)
839 (49%)
White
467 (55%)
Black
362 (43%)
471 (55%)
371 (44%)
9 (I'll)
938 (55%)
733 (43%)
29 (2%)
1700
Race
(2%)
Other
20
Mean
Standard Deviation
362
16.3
17-69
1
37.1
16.1
17-69
2
36.8
162
17-69
3
1552
10.6
96-209
0
156.1
10.9
9&-212
1
155.8
10.8
96-212
1
64.7
8.8
59-72
0
65.6
9.0
60-76
0
652
8.9
59-76
0
Age
Range
# Missing
Weight (pounds)
Mean
Standard Deviation
Range
# Missing
Height (inches)
Mean
Standard Deviation
Range
# Missing
TABLE 2
NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS
BY BODY SYSTEM
Drug I
Drug 2
Total Patients
849
851
Total Patients with Adverse Events
420
(49%)
320
(38%)
BODY AS A WHOLE
360
(42%)
277
(33%)
DIGESTIVE SYSTEM
280
(33%)
230
(27%)
SKIN AND APPENDAGES
200
(24%)
207
(24%)
RESPIRATORYSYSTEM
39
(5%)
32
(4%)
CARDIOVASCULAR SYSTEM
30
(4%)
28
(3%)
ENDOCRINE SYSTEM
9
(1%)
3
(.4%)
NERVOUS SYSTEM
2
(.2%)
1
(.1%)
etc.
138
TABLE 3
NUMBER AND PERCENT OF PATIENTS WITH ADVERSE EVENTS
BY BODY SYSTEM AND COSTART TERM
Drug 1
Drug 2
Total Patients
849
851
Total Patients with Adverse Events
420
(49%)
320
(38%)
120
70
64
52
47
18
3
(14%)
(8%)
(8%)
(6%)
(6%)
(.4%)
110
62
52
42
30
12
1
(13%)
(7%)
(6%)
(5%)
(4%)
(1%)
(.1%)
360
(42%)
277
(33%)
120
80
100
70
40
12
6
2
(14%)
(9%)
(8%)
(5%)
(1%)
(1%)
(.2%)
34
30
8
3
1
(12%)
(8%)
(4%)
(4%)
(1%)
(.4%)
(.1%)
280
(33%)
230
(27%)
BODY AS A WHOLE
HEADACHE
CHILLS
FLU SYNDROME
ALLERGIC REACTION
INFECTION
FEVER
PAIN
Subtotal
(2%)
DIGESTIVE SYSTEM
DIARRHEA
NAUSEA
FLATULENCE
STOMATITIS
GASTRms
ESOPHAGITIS
CONSTIPATION
Subtotal
72
etc.
139