Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paper TT22 MULTILABEL - A useful addition to the FORMAT procedure Venky Chakravarthy, Ann Arbor, MI ABSTRACT This paper examines the MULTILABEL option that was added to the FORMAT procedure in SAS® Version 8.2 and its application to clinical trials reporting. The MULTILABEL option allows for the specification of (1) overlapping ranges across labels and (2) secondary labels to the same values. Two examples are covered – one illustrates an application with overlapping ranges and the other with secondary label values. The latter has relevance to a New Drug Application (NDA) filing. Both are relevant to table generation in the Pharmaceutical Industry. INTRODUCTION PROC FORMAT is almost an indispensable tool for the SAS programmer. This procedure had two limitations, until recently, that resulted in extra work. One was that it could accept only mutually exclusive categories. The second was that multiple labels could not be specified for the same value. The MULTILABEL option introduced in version 8.2 overcomes these two limitations. This option has some useful applications in generating tables. This paper illustrates MULTILABEL with two examples. The first example summarizes by age category where some age values are mapped to more than one category. The second example is targeted towards an NDA filing. Multiple dosage strengths of the drug of interest (e.g. 10mg, 20mg) are often summarized as “Any Dose”. This example illustrates how this can be easily calculated with the MULTILABEL assigning more than one label to a value. Finally it finishes with one of the procedures that is “MULTILABEL enabled” that can take advantage of this useful option. The paper also addresses some of its limitations. A bonus feature is the coverage given to some useful enhancements such as the PRELOADFMT and COMPLETETYPES that are used in conjunction with MULTILABEL. The targeted audience is the Pharmaceutical programmer who can be classified as an intermediate to an advanced user. However, beginners are encouraged to attend the presentation and read the paper as this has significant relevance to generating tables. MULTILABEL – THE SORCEROR The introduction of MULTILABEL (SAS OnlineDoc, Version 8, 2001) is a valuable tool. Prior to its introduction in Version 8.2, the following two things could not be done in a PROC FORMAT. Proc format ; Value THIS ------------------------(1) 0 - 1 = "Under 2" 0 - 2 = "Under 3" ; Value THAT ------------------------(2) 1 = "One" 1 = "Again One" ; run ; (1) THIS is specifying overlapping ranges (2) THAT is assigning a secondary label to a single value. These were not allowed in PROC FORMAT and would result in ERRORS if an attempt to create THIS and THAT were made. Let us not go into the details of doing THIS and THAT with the MULTILABEL option, yet. However, let us briefly cover some of the confines under which this operates. MULTILABEL – NOT A STAND ALONE TOOL The main restriction is that it can be used only in certain procedures. Although it is a welcome addition to PROC FORMAT, the MULTILABEL option becomes useful only when used in conjunction with some of the procedures summarizing data. It is important to note that as of SAS® Version 8.2 only the MEANS, SUMMARY and TABULATE procedures are MULTILABEL enabled to take advantage of this option (SAS OnlineDoc, Version 8, 2001). This paper limits the examples to PROC MEANS. 1 With that background we can now cover the specific details. One of the best ways of going about explaining MULTILABEL is to examine how it can help us in our jobs as pharmaceutical programmers in generating tables for clinical trials. Let us assume that we are working on a drug for weight loss. One of our table specifications is to categorize patients under each treatment group into age groups of “Under 25”, “Under 50”, “Under 75”, “Under 100” and “100 and Above”. So the final table would look something like Table 1 below: Table 1 Number and Percentage of Patients with Significant Weight Loss Age Group Placebo Block Buster Under 25 Under 50 Under 75 Under 100 100 and Above x x x x x x x x x x (x.x) (x.x) (x.x) (x.x) (x.x) (x.x) (x.x) (x.x) (x.x) (x.x) Note that all the categories except the last row are overlapping. As an example, the “Under 50” category has all the “Under 25” elements subsumed but not vice versa. Based on what we know thus far, PROC FORMAT with its new MULTILABEL option could make this task easier. We would have otherwise read the data and flag the overlapping ranges with some unique value and output it again. Perhaps something like below: data <output data> ; set <input data> ; length group $13 ; /* Write out the observations to the output data set */ if age lt 25 then do; group = 'Under 25' ; output ; end ; if age lt 50 then do; group = 'Under 50' ; output ; end ; if age lt 75 then do; group = 'Under 75' ; output ; end ; if age lt 100 then do; group = 'Under 100' ; output ; end ; if age ge 100 then do; group = '100 and Above' ; output ; end ; run ; At best, it is a kludge. Let us now examine how this can be done more elegantly. Let us first go about the task of creating the sample data to be used in the demonstration. 2 CREATING SAMPLE DATA First, let us cook up some demographic data (see Table 2). To keep the clutter to a minimum only 10 patients and their treatment groups are considered. The study is given an arbitrary number of 999. Again for the sake of simplicity this is limited to 2 treatment groups - a Placebo group and the Company Drug fictionally referred to as Block Buster. Let us now assume that we have all the weight information available and we are considering the weight loss at termination (see Table 3). Further, let us assume that the patients in the demo data are the only ones with clinically significant weight loss (whatever that criteria may be). Here are the cooked up data sets of demographic data and patients with weight loss. data demo ( label = "Cooked Demo" ) ; retain study 999 ; do patient = 1 to 10 ; rxgroup = mod(patient,2)+1 ; output ; end ; run ; Table 2 Demographic Data Obs 1 2 3 4 5 6 7 8 9 10 study patient 999 999 999 999 999 999 999 999 999 999 rxgroup 1 2 3 4 5 6 7 8 9 10 2 1 2 1 2 1 2 1 2 1 data weightloss ( label = "Cooked Weight Loss" drop = seed: ) ; retain Seed_1 129857 weightloss 1 ; do patient = 1 to 10 ; call ranuni ( Seed_1 , age ) ; age = ceil(100*age) ; output ; end ; run ; Table 3 Clinically Significant Weight Loss Data Obs 1 2 3 4 5 6 7 8 9 10 weightloss 1 1 1 1 1 1 1 1 1 1 patient 1 2 3 4 5 6 7 8 9 10 age 69 82 99 73 55 45 65 90 11 14 3 Next we need some formats to assign treatment groups. We will spend some time on examining how the MULTILABEL format is specified to categorize the ages into the relevant overlapping groups needed for reporting purposes. Once we have used the FORMAT to summarize, we can use an INFORMAT to order the age groups to meet the reporting requirements. proc format ; value rxfmt 1 = "Placebo" 2 = "Block Buster" ; Readers are quite familiar with creating a simple format for the above to merit any explanation except to mention that the RXFMT is created. Next comes the main focus of this paper – creating a MULTILABEL FORMAT. The syntax is not complicated. Go about creating a format the normal way. The only difference is to specify the word MULTILABEL within parenthesis next to the format name. value agefmt (multilabel) low - 24 = "Under 25" low - 49 = "Under 50" low - 74 = "Under 75" low - 99 = "Under 100" 100 - high = "100 and Above" ; It is as simple as that. We notice that everything is the same in assigning a FORMAT except for the explicit mention of MULTILABEL. We have now passed the information to PROC FORMAT to create this format in a special way. Once this is executed, the AGEFMT FORMAT is created and SAS internally assigns “M” to the value of HLO to flag this as a MULTILABEL FORMAT. The only way one can see this is by outputting a control data set with the AGEFMT format and viewing the data. /* Since the Multilabel format is used by the class statement in PROC MEANS we need an informat to order the age categories */ invalue ageinf "Under 25" = 1 "Under 50" = 2 "Under 75" = 3 "Under 100" = 4 "100 and Above" = 5 ; run ; Now we simply merge the demographic data with the Weight Loss data by Patient number. data demowgt ; merge demo weightloss ; by patient ; run ; After merging with the demographic data we now have all the relevant information to summarize the data. Let us now look at how the MULTILABEL option of the FORMAT can be used. SUMMARIZE DATA BY THE MULTILABEL GROUPS We will use the MEANS procedure to summarize by the MULTILABEL groups. You may have noticed that there are no patients in the “100 and Above” age category in our data set. However this is a reporting requirement. One of the recent enhancements to PROC MEANS is the COMPLETETYPES option. This tells the procedure to create all possible combinations of class variable values. Another enhancement is the PRELOADFMT, which is used in conjunction with the COMPLETETYPES option. The PRELOADFMT option is an instruction to load the FORMAT that is assigned to the class variable prior to the analysis. Since the FORMAT already has the “100 and Above” category this will be PRELOADED and then output albeit as having a zero value. Let us now examine how we would specify the syntax for the PROC MEANS to meet our reporting requirements. 4 %*----------------------------------------------------; %* NOTE: PROC MEANS takes advantage of the ; %* Multilabel format with the MLF option ; %* specified in the CLASS statement. ; %*----------------------------------------------------; proc means noprint data = demowgt completetypes ; ---- (1) class age / mlf preloadfmt ; --------------------(2) class rxgroup ; var weightloss ; output out = sumweight ( drop=_type_ _freq_ ) n = N_by_RX ; types age*rxgroup ; format age agefmt. ; run ; The most important lines for us are (1) and (2). The first line carries the COMPLETETYPES option. This means all possible combinations of the CLASS statement have to be considered. All the possible combinations are fed by the MULTILABEL AGEFMT FORMAT, which is PRELOADED before analyzing the data. The VAR statement must contain a numeric type variable. We are interested only in the counts in each category so only the N Statistic is requested in the output data set SUMWEIGHT. The TYPES statement requests the crossings between the age categories and RXGROUP. Table 4 presents a PROC PRINT of the SUMWEIGHT data: Table 4 Obs 1 2 3 4 5 6 7 8 9 10 age 100 and Above 100 and Above Under 100 Under 100 Under 25 Under 25 Under 50 Under 50 Under 75 Under 75 rxgroup N_by_RX 1 2 1 2 1 2 1 2 1 2 0 0 5 5 1 1 2 1 3 4 Notice that the 100 and Above age category is added to the output even though there was no one in the data set that qualified for that age category. This is a result of the COMPLETETYPES option, which used PRELOADFMT to pick up this age category from the AGEFMT FORMAT. Also note that the weight loss information is summarized correctly using the overlapping age ranges specified in the MULTILABEL AGEFMT format. After sorting by the AGEINF INFORMAT value of AGE, we can transpose the data and output the results presented in Table 5. proc sql ; create table sorteddata as select age , put(rxgroup,rxfmt.) as arxgroup, n_by_rx from sumweight order by input(age,ageinf.) , rxgroup ; quit ; data transposed ( drop = N_by_RX arxgroup ) ; do _n_ = 1 by 1 until ( last.age ) ; set sorteddata ; by age notsorted ; array rx (2) ; rx(_n_) = N_by_RX ; end ; label age = "Age*Group" rx1 = "Placebo" rx2 = "Block*Buster" ; run ; 5 The syntax used above may be unfamiliar to the audience. There is an entire SUGI 28 paper devoted to this (Chakravarthy, 2003). However, there is always another way to transpose the data and you can proceed further without any loss of information. title1 "Block Buster Busted" ; title2 "The treatment group columns reflect number having Significant Weight Loss" ; proc print data = transposed label split = "*" ; run ; title ; Table 5 Block Buster Busted The treatment group columns reflect number having Significant Weight Loss Age Group Obs 1 2 3 4 5 Placebo Under 25 Under 50 Under 75 Under 100 100 and Above Block Buster 1 2 3 5 0 1 1 4 5 0 Once we divide by the N for the treatment groups we will get the percentage with weight loss by treatment group, which is not presented here. So far we have covered one aspect of the MULTILABEL format i.e. the ability to specify overlapping ranges. Let us now proceed to the other function – the ability to specify secondary labels to the same value. MULTILABEL – SECONDARY LABELS Once again, it is best to relate this to the work done in clinical trials. This time let us look at a possible application to an NDA filing. There are usually a number of clinical trials performed at varying doses of the company drug before it is ready for filing. When the Summary of Clinical Safety (SCS) or Integrated Summary of Safety (ISS) is written there are a number of tables that go into it with an additional treatment column. This is the “Any Dose of the Company Drug” column. This is not captured anywhere in the database as such. There are also typically a few comparator drugs that comprise the last few columns of a table (see Table 6 for an illustration). Notice that the “Any Dose” column must precede them. The same can also be expected from an Integrated Summary of Efficacy (ISE). Table 6 The Overall N of treatment group columns including the Any Dose of the Company Drug Placebo Company 10mg Company 20mg N=25 N=24 N=26 Any Dose Comparator N=50 N=25 It can be seen that the ANY DOSE column is made up of the two company drug columns. This kind of table specification naturally lends itself to the other type of MULTILABEL application i.e. the ability to specify secondary labels to the same value. This is presented next. We have already seen how the MULTILABEL format is created and how PROC MEANS uses it with elaborate explanations above. If anything in the syntax below is unclear the reader is encouraged to go back to the relevant sections and read that over again. The second example will be short and strictly focused on extracting the information for Table 6. Let us first create a data set that has the patients pooled from different studies and then reassigned a unique patient ID. To simplify matters, we only have the patient numbers and the treatment group information in this data set. This is the only source data that will be used. The reader can run the following code and th view the patient data for further clarity. There are 4 treatment groups present in the data and a 5 group “Any Dose” 6 will be added using the MULTILABEL format. Treatment groups 2 and 3 are the company drugs and these two together, form the “Any Dose” column. With this background let us proceed with the task of creating the data, FORMAT and INFORMAT before summarizing it. data rxgroups ; do drug = 1 to 4 ; do _n_ = 1 to ceil(100*ranuni(196)) ; patient + 1 ; output ; end ; end ; run ; The specification of multiple labels is presented below. proc format ; value drugfmt (multilabel) 1 = "Placebo" 2 = "Company 10mg" 2 = "Any Dose" 3 = "Company 20mg" 3 = "Any Dose" 4 = "Comparator" ; invalue order "Placebo" = 1 "Company 10mg" = 2 "Company 20mg" = 3 "Any Dose" = 4 "Comparator" = 5 ; run ; Since the specification of multiple labels is the other feature covered this deserves some explanation. The FORMAT DRUGFMT assigns the value 2 to two separate labels. Likewise the value 3 is assigned two labels. One is with the dosage and the other assigned as the “Any Dose” treatment group that is not present in the data. We will now use this in PROC MEANS as we did with the AGEFMT before. proc means noprint data = rxgroups completetypes ; class drug / mlf preloadfmt ; var drug ; output out = Overall_N ( drop=_type_ _freq_ ) n = N_by_RX ; types drug ; format drug drugfmt. ; run ; Once again we order the data with the INFORMAT built specifically for this purpose. proc sql ; create table sortedRX as select * from overall_n order by input(drug,order.) ; quit ; By using DRUG as the ID variable in the PROC TRANSPOSE we ensure that the column headers are named by the distinct values present in the DRUG column (see Table 7). proc transpose data = sortedrx out = result (drop=_name_) ; id drug ; run ; 7 title1 "Table 7" ; title2 "The Overall N of treatment groups" ; proc print data = result noobs split="_"; run ; title ; Table 7 The Overall N of treatment groups Placebo Company 10mg 26 14 Company 20mg 37 Any Dose Comparator 51 85 Thus we have the essential information summarized to meet the spec in Table 6. However, it requires a little polishing before it can be fully validated. That isn’t the focus here but how the “Any Dose” column was built on the fly using the enhanced capabilities of the MULTILABEL format. Let us now revisit THIS and THAT, which we addressed at the beginning of the paper and mention some of the limitations of the MULTILABEL format. We have created the equivalent of THIS with the MULTILABEL AGEFMT format that covered overlapping ranges. We also created something like THAT with the DRUGFMT that assigned multiple labels to the same values. However in doing THIS and THAT there are some limitations, which are addressed next. LIMITATIONS OF THE MULTILABEL FORMAT The first limitation is that practically only two procedures can use it (SUMMARY and MEANS are almost identical). This does not allow for more robust user testing and adaptation. Yes, we can do THIS and THAT with the FORMAT but we can only use it somewhat. With some experimentation, it was found that more than one MULTILABEL format could not be used in PROC MEANS. This is supposedly fixed in version 9.1. CONCLUSION In conclusion it can be mentioned that the MULTILABEL format is a useful addition to the FORMAT procedure. When used with the MULTILABEL enabled procedures it has useful applications in clinical trials reporting including tables with special requirements generated for the filing of an NDA. There is scope for improvement and it should be extended to other procedures. REFERENCES SAS OnlineDoc (2001), SAS Procedures Guides: The FORMAT Procedure. SAS Institute Inc., Cary, NC, USA. SAS OnlineDoc (2001), SAS Procedures Guides: The MEANS Procedure. SAS Institute Inc., Cary, NC, USA. Chakravarthy, V. (2003), “The DOW (not that DOW!!!) and the LOCF in Clinical Trials,” Proceedings of the Twenty Eighth Annual SAS Users Group International Conference, 99-28, 1-4. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Venky Chakravarthy 1591 Abigail Way Ann Arbor, MI 48103 Email: [email protected] 8 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 9