Download MULTILABEL - A Useful Addition to the FORMAT Procedure

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pharmacogenomics wikipedia , lookup

Theralizumab wikipedia , lookup

Bad Pharma wikipedia , lookup

Bilastine wikipedia , lookup

Transcript
Paper TT22
MULTILABEL - A useful addition to the FORMAT procedure
Venky Chakravarthy, Ann Arbor, MI
ABSTRACT
This paper examines the MULTILABEL option that was added to the FORMAT procedure in SAS® Version 8.2 and
its application to clinical trials reporting. The MULTILABEL option allows for the specification of (1) overlapping
ranges across labels and (2) secondary labels to the same values. Two examples are covered – one illustrates an
application with overlapping ranges and the other with secondary label values. The latter has relevance to a New
Drug Application (NDA) filing. Both are relevant to table generation in the Pharmaceutical Industry.
INTRODUCTION
PROC FORMAT is almost an indispensable tool for the SAS programmer. This procedure had two limitations, until
recently, that resulted in extra work. One was that it could accept only mutually exclusive categories. The second
was that multiple labels could not be specified for the same value. The MULTILABEL option introduced in version 8.2
overcomes these two limitations. This option has some useful applications in generating tables. This paper illustrates
MULTILABEL with two examples. The first example summarizes by age category where some age values are
mapped to more than one category. The second example is targeted towards an NDA filing. Multiple dosage
strengths of the drug of interest (e.g. 10mg, 20mg) are often summarized as “Any Dose”. This example illustrates
how this can be easily calculated with the MULTILABEL assigning more than one label to a value. Finally it finishes
with one of the procedures that is “MULTILABEL enabled” that can take advantage of this useful option. The paper
also addresses some of its limitations. A bonus feature is the coverage given to some useful enhancements such as
the PRELOADFMT and COMPLETETYPES that are used in conjunction with MULTILABEL. The targeted audience
is the Pharmaceutical programmer who can be classified as an intermediate to an advanced user. However,
beginners are encouraged to attend the presentation and read the paper as this has significant relevance to
generating tables.
MULTILABEL – THE SORCEROR
The introduction of MULTILABEL (SAS OnlineDoc, Version 8, 2001) is a valuable tool. Prior to its introduction in
Version 8.2, the following two things could not be done in a PROC FORMAT.
Proc format ;
Value THIS ------------------------(1)
0 - 1 = "Under 2"
0 - 2 = "Under 3"
;
Value THAT ------------------------(2)
1 = "One"
1 = "Again One"
;
run ;
(1) THIS is specifying overlapping ranges
(2) THAT is assigning a secondary label to a single value.
These were not allowed in PROC FORMAT and would result in ERRORS if an attempt to create THIS and THAT
were made. Let us not go into the details of doing THIS and THAT with the MULTILABEL option, yet. However, let us
briefly cover some of the confines under which this operates.
MULTILABEL – NOT A STAND ALONE TOOL
The main restriction is that it can be used only in certain procedures. Although it is a welcome addition to PROC
FORMAT, the MULTILABEL option becomes useful only when used in conjunction with some of the procedures
summarizing data. It is important to note that as of SAS® Version 8.2 only the MEANS, SUMMARY and TABULATE
procedures are MULTILABEL enabled to take advantage of this option (SAS OnlineDoc, Version 8, 2001). This
paper limits the examples to PROC MEANS.
1
With that background we can now cover the specific details. One of the best ways of going about explaining
MULTILABEL is to examine how it can help us in our jobs as pharmaceutical programmers in generating tables for
clinical trials.
Let us assume that we are working on a drug for weight loss. One of our table specifications is to categorize patients
under each treatment group into age groups of “Under 25”, “Under 50”, “Under 75”, “Under 100” and “100 and
Above”. So the final table would look something like Table 1 below:
Table 1
Number and Percentage of Patients with Significant Weight Loss
Age
Group
Placebo
Block
Buster
Under 25
Under 50
Under 75
Under 100
100 and Above
x
x
x
x
x
x
x
x
x
x
(x.x)
(x.x)
(x.x)
(x.x)
(x.x)
(x.x)
(x.x)
(x.x)
(x.x)
(x.x)
Note that all the categories except the last row are overlapping. As an example, the “Under 50” category has all the
“Under 25” elements subsumed but not vice versa. Based on what we know thus far, PROC FORMAT with its new
MULTILABEL option could make this task easier. We would have otherwise read the data and flag the overlapping
ranges with some unique value and output it again. Perhaps something like below:
data <output data> ;
set <input data> ;
length group $13 ;
/* Write out the observations to the output data set */
if age lt 25 then do;
group = 'Under 25' ;
output ;
end ;
if age lt 50 then do;
group = 'Under 50' ;
output ;
end ;
if age lt 75 then do;
group = 'Under 75' ;
output ;
end ;
if age lt 100 then do;
group = 'Under 100' ;
output ;
end ;
if age ge 100 then do;
group = '100 and Above' ;
output ;
end ;
run ;
At best, it is a kludge. Let us now examine how this can be done more elegantly. Let us first go about the task of
creating the sample data to be used in the demonstration.
2
CREATING SAMPLE DATA
First, let us cook up some demographic data (see Table 2). To keep the clutter to a minimum only 10 patients and
their treatment groups are considered. The study is given an arbitrary number of 999. Again for the sake of
simplicity this is limited to 2 treatment groups - a Placebo group and the Company Drug fictionally referred to as
Block Buster.
Let us now assume that we have all the weight information available and we are considering the weight loss at
termination (see Table 3). Further, let us assume that the patients in the demo data are the only ones with clinically
significant weight loss (whatever that criteria may be). Here are the cooked up data sets of demographic data and
patients with weight loss.
data demo ( label = "Cooked Demo" ) ;
retain study 999 ;
do patient = 1 to 10 ;
rxgroup = mod(patient,2)+1 ;
output ;
end ;
run ;
Table 2
Demographic Data
Obs
1
2
3
4
5
6
7
8
9
10
study
patient
999
999
999
999
999
999
999
999
999
999
rxgroup
1
2
3
4
5
6
7
8
9
10
2
1
2
1
2
1
2
1
2
1
data weightloss ( label = "Cooked Weight Loss" drop = seed: ) ;
retain Seed_1 129857 weightloss 1 ;
do patient = 1 to 10 ;
call ranuni ( Seed_1 , age ) ;
age = ceil(100*age) ;
output ;
end ;
run ;
Table 3
Clinically Significant Weight Loss Data
Obs
1
2
3
4
5
6
7
8
9
10
weightloss
1
1
1
1
1
1
1
1
1
1
patient
1
2
3
4
5
6
7
8
9
10
age
69
82
99
73
55
45
65
90
11
14
3
Next we need some formats to assign treatment groups. We will spend some time on examining how the
MULTILABEL format is specified to categorize the ages into the relevant overlapping groups needed for reporting
purposes. Once we have used the FORMAT to summarize, we can use an INFORMAT to order the age groups to
meet the reporting requirements.
proc format ;
value rxfmt 1 = "Placebo"
2 = "Block Buster"
;
Readers are quite familiar with creating a simple format for the above to merit any explanation except to mention that
the RXFMT is created. Next comes the main focus of this paper – creating a MULTILABEL FORMAT. The syntax is
not complicated. Go about creating a format the normal way. The only difference is to specify the word MULTILABEL
within parenthesis next to the format name.
value agefmt (multilabel)
low - 24 = "Under 25"
low - 49 = "Under 50"
low - 74 = "Under 75"
low - 99 = "Under 100"
100 - high = "100 and Above"
;
It is as simple as that. We notice that everything is the same in assigning a FORMAT except for the explicit mention
of MULTILABEL. We have now passed the information to PROC FORMAT to create this format in a special way.
Once this is executed, the AGEFMT FORMAT is created and SAS internally assigns “M” to the value of HLO to flag
this as a MULTILABEL FORMAT. The only way one can see this is by outputting a control data set with the AGEFMT
format and viewing the data.
/* Since the Multilabel format is used
by the class statement in PROC MEANS
we need an informat to order the age
categories */
invalue ageinf
"Under 25"
= 1
"Under 50"
= 2
"Under 75"
= 3
"Under 100"
= 4
"100 and Above" = 5
;
run ;
Now we simply merge the demographic data with the Weight Loss data by Patient number.
data demowgt ;
merge demo weightloss ;
by patient ;
run ;
After merging with the demographic data we now have all the relevant information to summarize the data. Let us now
look at how the MULTILABEL option of the FORMAT can be used.
SUMMARIZE DATA BY THE MULTILABEL GROUPS
We will use the MEANS procedure to summarize by the MULTILABEL groups. You may have noticed that
there are no patients in the “100 and Above” age category in our data set. However this is a reporting
requirement. One of the recent enhancements to PROC MEANS is the COMPLETETYPES option. This
tells the procedure to create all possible combinations of class variable values. Another enhancement is the
PRELOADFMT, which is used in conjunction with the COMPLETETYPES option. The PRELOADFMT option is an
instruction to load the FORMAT that is assigned to the class variable prior to the analysis. Since the FORMAT
already has the “100 and Above” category this will be PRELOADED and then output albeit as having a zero value.
Let us now examine how we would specify the syntax for the PROC MEANS to meet our reporting requirements.
4
%*----------------------------------------------------;
%* NOTE:
PROC MEANS takes advantage of the
;
%*
Multilabel format with the MLF option
;
%*
specified in the CLASS statement.
;
%*----------------------------------------------------;
proc means noprint data = demowgt completetypes ; ---- (1)
class age / mlf preloadfmt ;
--------------------(2)
class rxgroup ;
var weightloss ;
output out = sumweight ( drop=_type_ _freq_ )
n = N_by_RX ;
types age*rxgroup ;
format age agefmt. ;
run ;
The most important lines for us are (1) and (2). The first line carries the COMPLETETYPES option. This means all
possible combinations of the CLASS statement have to be considered. All the possible combinations are fed by the
MULTILABEL AGEFMT FORMAT, which is PRELOADED before analyzing the data. The VAR statement must
contain a numeric type variable. We are interested only in the counts in each category so only the N Statistic is
requested in the output data set SUMWEIGHT. The TYPES statement requests the crossings between the age
categories and RXGROUP. Table 4 presents a PROC PRINT of the SUMWEIGHT data:
Table 4
Obs
1
2
3
4
5
6
7
8
9
10
age
100 and Above
100 and Above
Under 100
Under 100
Under 25
Under 25
Under 50
Under 50
Under 75
Under 75
rxgroup
N_by_RX
1
2
1
2
1
2
1
2
1
2
0
0
5
5
1
1
2
1
3
4
Notice that the 100 and Above age category is added to the output even though there was no one in the data set that
qualified for that age category. This is a result of the COMPLETETYPES option, which used PRELOADFMT to pick
up this age category from the AGEFMT FORMAT. Also note that the weight loss information is summarized correctly
using the overlapping age ranges specified in the MULTILABEL AGEFMT format. After sorting by the AGEINF
INFORMAT value of AGE, we can transpose the data and output the results presented in Table 5.
proc sql ;
create table sorteddata as
select age , put(rxgroup,rxfmt.) as arxgroup, n_by_rx
from sumweight
order by input(age,ageinf.) , rxgroup ;
quit ;
data transposed ( drop = N_by_RX arxgroup ) ;
do _n_ = 1 by 1 until ( last.age ) ;
set sorteddata ;
by age notsorted ;
array rx (2) ;
rx(_n_) = N_by_RX ;
end ;
label age = "Age*Group"
rx1 = "Placebo"
rx2 = "Block*Buster" ;
run ;
5
The syntax used above may be unfamiliar to the audience. There is an entire SUGI 28 paper devoted to this
(Chakravarthy, 2003). However, there is always another way to transpose the data and you can proceed further
without any loss of information.
title1 "Block Buster Busted" ;
title2 "The treatment group columns reflect number having Significant Weight
Loss" ;
proc print data = transposed label split = "*" ;
run ;
title ;
Table 5
Block Buster Busted
The treatment group columns reflect number having Significant Weight Loss
Age
Group
Obs
1
2
3
4
5
Placebo
Under 25
Under 50
Under 75
Under 100
100 and Above
Block
Buster
1
2
3
5
0
1
1
4
5
0
Once we divide by the N for the treatment groups we will get the percentage with weight loss by treatment group,
which is not presented here.
So far we have covered one aspect of the MULTILABEL format i.e. the ability to specify overlapping ranges. Let
us now proceed to the other function – the ability to specify secondary labels to the same value.
MULTILABEL – SECONDARY LABELS
Once again, it is best to relate this to the work done in clinical trials. This time let us look at a possible application to
an NDA filing. There are usually a number of clinical trials performed at varying doses of the company drug before it
is ready for filing. When the Summary of Clinical Safety (SCS) or Integrated Summary of Safety (ISS) is written there
are a number of tables that go into it with an additional treatment column. This is the “Any Dose of the Company
Drug” column. This is not captured anywhere in the database as such. There are also typically a few comparator
drugs that comprise the last few columns of a table (see Table 6 for an illustration). Notice that the “Any Dose”
column must precede them. The same can also be expected from an Integrated Summary of Efficacy (ISE).
Table 6
The Overall N of treatment group columns including the Any Dose of the Company
Drug
Placebo
Company
10mg
Company
20mg
N=25
N=24
N=26
Any
Dose
Comparator
N=50
N=25
It can be seen that the ANY DOSE column is made up of the two company drug columns. This kind of table
specification naturally lends itself to the other type of MULTILABEL application i.e. the ability to specify secondary
labels to the same value. This is presented next.
We have already seen how the MULTILABEL format is created and how PROC MEANS uses it with elaborate
explanations above. If anything in the syntax below is unclear the reader is encouraged to go back to the relevant
sections and read that over again. The second example will be short and strictly focused on extracting the
information for Table 6. Let us first create a data set that has the patients pooled from different studies and then
reassigned a unique patient ID. To simplify matters, we only have the patient numbers and the treatment group
information in this data set. This is the only source data that will be used. The reader can run the following code and
th
view the patient data for further clarity. There are 4 treatment groups present in the data and a 5 group “Any Dose”
6
will be added using the MULTILABEL format. Treatment groups 2 and 3 are the company drugs and these two
together, form the “Any Dose” column. With this background let us proceed with the task of creating the data,
FORMAT and INFORMAT before summarizing it.
data rxgroups ;
do drug = 1 to 4 ;
do _n_ = 1 to ceil(100*ranuni(196)) ;
patient + 1 ;
output ;
end ;
end ;
run ;
The specification of multiple labels is presented below.
proc format ;
value drugfmt (multilabel)
1 = "Placebo"
2 = "Company 10mg"
2 = "Any Dose"
3 = "Company 20mg"
3 = "Any Dose"
4 = "Comparator"
;
invalue order
"Placebo" = 1
"Company 10mg" = 2
"Company 20mg" = 3
"Any Dose" = 4
"Comparator" = 5
;
run ;
Since the specification of multiple labels is the other feature covered this deserves some explanation. The FORMAT
DRUGFMT assigns the value 2 to two separate labels. Likewise the value 3 is assigned two labels. One is with the
dosage and the other assigned as the “Any Dose” treatment group that is not present in the data. We will now use
this in PROC MEANS as we did with the AGEFMT before.
proc means noprint data = rxgroups completetypes ;
class drug / mlf preloadfmt ;
var drug ;
output out = Overall_N ( drop=_type_ _freq_ )
n = N_by_RX ;
types drug ;
format drug drugfmt. ;
run ;
Once again we order the data with the INFORMAT built specifically for this purpose.
proc sql ;
create table sortedRX as
select *
from overall_n
order by input(drug,order.) ;
quit ;
By using DRUG as the ID variable in the PROC TRANSPOSE we ensure that the column headers are named by the
distinct values present in the DRUG column (see Table 7).
proc transpose data = sortedrx out = result (drop=_name_) ;
id drug ;
run ;
7
title1 "Table 7" ;
title2 "The Overall N of treatment groups" ;
proc print data = result noobs split="_";
run ;
title ;
Table 7
The Overall N of treatment groups
Placebo
Company
10mg
26
14
Company
20mg
37
Any
Dose
Comparator
51
85
Thus we have the essential information summarized to meet the spec in Table 6. However, it requires a little
polishing before it can be fully validated. That isn’t the focus here but how the “Any Dose” column was built on the fly
using the enhanced capabilities of the MULTILABEL format.
Let us now revisit THIS and THAT, which we addressed at the beginning of the paper and mention some of the
limitations of the MULTILABEL format. We have created the equivalent of THIS with the MULTILABEL AGEFMT
format that covered overlapping ranges. We also created something like THAT with the DRUGFMT that assigned
multiple labels to the same values. However in doing THIS and THAT there are some limitations, which are
addressed next.
LIMITATIONS OF THE MULTILABEL FORMAT
The first limitation is that practically only two procedures can use it (SUMMARY and MEANS are almost identical).
This does not allow for more robust user testing and adaptation. Yes, we can do THIS and THAT with the FORMAT
but we can only use it somewhat.
With some experimentation, it was found that more than one MULTILABEL format could not be used in PROC
MEANS. This is supposedly fixed in version 9.1.
CONCLUSION
In conclusion it can be mentioned that the MULTILABEL format is a useful addition to the FORMAT procedure.
When used with the MULTILABEL enabled procedures it has useful applications in clinical trials reporting including
tables with special requirements generated for the filing of an NDA. There is scope for improvement and it should be
extended to other procedures.
REFERENCES
SAS OnlineDoc (2001), SAS Procedures Guides: The FORMAT Procedure. SAS Institute Inc., Cary, NC, USA.
SAS OnlineDoc (2001), SAS Procedures Guides: The MEANS Procedure. SAS Institute Inc., Cary, NC, USA.
Chakravarthy, V. (2003), “The DOW (not that DOW!!!) and the LOCF in Clinical Trials,” Proceedings of the Twenty
Eighth Annual SAS Users Group International Conference, 99-28, 1-4.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Venky Chakravarthy
1591 Abigail Way
Ann Arbor, MI 48103
Email: [email protected]
8
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
9