Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Non-standard graphical representation of clinical data David Granger - Parexel International Limited Abstract. In the course of this paper I propose to demonstrate how Parexel International Limited use SAS/GRAPH® software to obtain some unusual graphical displays of clinical data. Introduction. Parexel is a worldwide leader in providing independent clinical research to the pharmaceutical industry. In one of our many roles we provide Biostatistical analyses which often require presentation quality graphics to accompany our statistical and clinical reports. Parexel run SAS® software version 6.04 on PCIDOS and version 6.07 under VAXNMS. All source code written for this paper has been written under V AXNMS. Background information. When analysing data from clinical trials, it is useful to calculate various summary statistics. Commonly data possess the properties of the normal distribution (ie amongst other things, that the distribution of the data is symmetrical, and the nature of the data is continuous). If there is evidence to suggest that these properties hold, then a formal analysis, based on the summary statistics mean and standard deviation (a standard measure of the variation in the data), may be appropriate. If it is suspected that the data are not from the normal distribution (ie the data are not normally distributed), then an analysis based on the ranks of the values may be more appropriate. The corresponding summary statistics for this type of non-parametric analysis include the median (the 50th percentile point). and the upper and lower quartiles (75th and 25th percentile points). These methods are robust in their operation as they are not affected by extreme values (the upper and lower quartiles cover the middle 50% of the data). and they can handle data that may be very skewed (asymmetrical). I Graph displaying median, upper and lower guartiles. If an analysis based on the ranks of the data is performed (ie an analysis using the percentile points of the distribution) a graph showing the mean and standard deviation of the data is inappropriate. A graph showing medians and quartiles is more suitable. SAS does not provide an interpolation option for the GPLOT procedure that will join the lower and upper quartiles of Y at each X value. and join the median value of Y between X values. (I=STD I TJ joins the mean value of Y between X values. and joins the mean value of Y ±1 standard deviation at each X value). We can however achieve this form of interpolation by using the I=HILOCJT option. This option is primarily designed to graphically display stock market data as it joins the maximum and minimum values of Y at each X value and joins the closing value of Y (the last value) 101 between X values. Example. We have data available for approximately 200 patients recording the number of panic attacks in the last 5 days available in the form TRT A A PATIENT I I VISIT I 2 PANIC 8 6 where trt takes the values A, B, C, D, and visit ranges from I through to 7. Summarizing the data using the univariate procedure, and transposing we can obtain data TRT A A A A A A VISIT I I I 2 2 2 - NAME_ D 7 LQ COLI 5 10 4 5 MED UQ LQ MED UQ LQ 9 3 0 In many situations the summary statistics, and hence graphics, for each treatment group (TRT) will overlap. To clarify the graph the values of visit can be offset using a simple datastep. data a; set summary; iftrt='A' then visit=visit-0.I5; else if trt='B' then visit=visit-0.5; run; The I=HILOCJT interpolation option can now be used as for each trt, visit combination we have a suitable HIGH value CNAME_= UQ), LOW value CNAME_ = LQ), and CLOSE value CNAME_ = MED). A GPLOT procedure of the form proc gplot; symboll v=none i=hilocjt 1= I ;symbol2 v=none i=hilocjt 1=2; symbo13 v=none i=hilocjt 1=3;symboI4 v=none i=hilocjt 1=4; plot coIl *visit=trt; run; will produce the required output. An example graph is shown in Figure I. 708 II Graph displaying mean ± 1 standard deviation(s). If an analysis based on mean values has been performed, the results of the analysis can be complemented by a graph showing means and standard deviations. SAS 'provides us with the STD ... interpolation methods to produce plots of mean values of Y ± standard deviations at each X value (see above). Unlike the graphs described previously that display medians and quartiles, graphs showing mean values and standard deviations are symmetrical about the mean by their very nature. It is possible to exploit this fact to produce some unusual, yet informative graphs. Example. Systolic blood pressure data is available for the 200 patients previously analysed, and for the purposes of this example we will only consider treatment groups A and B. As blood pressure data is normally distributed an analysis based on this fact is appropriate, and hence graphs showing means and standard deviations are required. Since the summary statistics for the two treatment groups are likely to overlap, the data at each visit will usually have to be offset by a small amount as described in section I. We can use the symmetry of the situation to remove some redundant information from the graph, and hence remove the need for the 'offsetting procedure'. By using a summarising procedure, MEANS or SUMMARY perhaps, we can obtain the mean and standard deviation for each treatment group (TRT), visit (VISIT) c-ombination producing a dataset of the form TRT A A VISIT 1 2 B 7 MEAN 163 166 STD 149 10 5 5 We shall use the notation meanA for the mean systolic blood pressure in group A, and meanB for the corresponding group B value. Similarly we define stdA and stdB. By comparing the range [mean-std,mean+std] for treatment groups A and B at each visit we can decide which one of the following situations apply. i ii iii IV v meanA > mean B and meanA < mean B and meanA > mean B and meanA < mean B and meanA=meanB meanA-stdA > meanB+stdB meanA+stdA < meanB-stdB meanA-stdA ~ meanB+stdB meanA+stdA ~ meanB-std B If situations i or ii hold then we can draw the lines mean±std as we know that the lines for each treatment group will not overlap. 709 If, however, situations iii, iv, or v hold then we know that the lines covering the range mean±std will overlap. Due to the symmetry of the situation we will not lose any information by drawing a single standard deviation away from each mean value. ego if iii holds we draw lines from meanA to meanA+stdA, and mean B to meanB-stdB. We can use datastep(s) to perform the above operation and output a dataset of the form VALUE TRT A A A VISIT 1 1 1 163 158 168 B B B 7 7 7 149 139 149 where if i or ii hold (VISIT=I, for example) we output 3 observations for each treatment group ie for TRT=A we output observations with VALUE = meanA, meanA-stdA, meanA+stdA. and for TRT=B we output observations with VALUE = mean B, meanB-std B, and meanB+stdB. If iii holds (VISIT=7, for example) we again output 3 observations for each treatment group, but now for TRT=A we output observations with VALUE = meanA(twice), meanA+stdA> and for TRT=B we output observations with VALUE = mean B (twice), meanB-std B. This data can then be displayed using the GPLOT procedure and the I=HILOCJT interpolation method. Figure IT shows the completed example. In practice this method is often used to display confidence limits, where we replace mean±std in the comparisons with the appropriate confidence limits. ITI Decision tree representation of estimated probabilities When fitting a statistical model to our data it is often easier to interpret the results if they are displayed graphically. Example. The data listed overleaf are the result of a logistic regression with outcome 'disease developed' and the effect of several demographic covariates investigated. 710 Age (yrs) Smoking status Alcohol consumption <55 <55 <55 <55 <55 <55 55 + 55 + 55 + 55 + 55 + 55 + Smoker Smoker Smoker Non-Smoker Non-Smoker Non-Smoker Smoker Smoker Smoker Non-Smoker Non-Smoker Non-Smoker None < 10 units 10 units + None < 10 units 10 units + None < 10 units 10 units + None < 10 units 10 units + Probability 0.170 0.220 0.280 0.050 0.130 0.160 0.200 0.300 0.370 0.070 0.200 0.150 The outline below demonstrates how SAS/GRAPH software, and in particular the ANNOTA TE dataset can be used to produce some powerful graphical tools for displaying this data. At the time of writing several analyses of the above type were perfonned which needed to be displayed in a neat and concise graphical manner. This led to the development of a macro which, with a minimum number of parameters passed to the macro, would take a dataset of the above form and produce a graph similar to the one presented below. Age Smoking status Alcohol consumption Probability Smoker 0.170 0.220 Non-Smoker < lOunits 0.280 0.050 0.130 Smoker lO units + None < 10 units 0.160 0.200 0.300 Non-Smoker lO units + None < lO units 0.370 0.070 0.200 lO units + 0.150 ~' 711 The BOXDIAG macro. This macro has been developed to accept, as input, a dataset of the above form where the classification variables (age, smoking status, and alcohol consumption in this case), are named newlevl-newlevn, where we have n classification variables (in this example we have the variables newlevl-newlev3). In this example we have one probability column, and this is stored in the variable prob 1. In general there may be more than one probability column and so the macro has been designed to handle the variables probl-probn, where we have n probability variables. The labels of the variables newlevl-newlevn and probl-probn will be used to create the column headings, and the values of these variables will be the text that is to be displayed on the graph «55, Non-Smoker etc.) Note that these are character strings and not formatted values. Calls to the boxdiag macro are of the form %boxdiag(dset,n,p,maxh,maxw, wi, w2, w3, w4, w5, w6); At present the macro is designed to cater for a maximum of 6 classification columns, but it is a simple matter to extend this if required. The parameters are: The name of the input dataset The number of classification variables newlev I-newlevn The number of probability variables prob I-probn The maximum height of the procedural output area to be used by the display (%) maxw The width of the procedural output area to be used by the classification variables (%). wIThe width of the newlev 1 column (% of procedural output area) dset n p maxh w6 The width of the newlev6 column (% of procedural output area) With the following parameters n=3, maxw=80, wl=30, w2=20, w3=15, the classification columns occupy 30+20+ 15=65% of the procedural output area. As we have specified to use 80% of the width of this area, the remaining 15% will be divided evenly between the columns ie the space between columns 1 and 2, and columns 2 and 3, will be 15/2=7.5%. A check is built into the macro so that wl+ ... +wn :5: maxw. 712 The main steps performed in the macro achieve the following objectives. 1 Create a dataset containing all the changes in the values ,of the classification variables, storing them in an (n,tOO) array. Here the changes for column n are stored in the elements (n, 1) to (n, tOO). Given the above data this may look something like this <55 Smoker None 55 + Non-Smoker Smoker < to units to units + Non-Smoker None < to units tOunits + ... The probabilities and column headings are stored in a similar manner. The variable labels are obtained with the LABEL call routine. 2 From the dataset created in step 1 it is possible to determine the maximum number of cells (boxes) needed in a column by looking at the data for the furthest right classification column (newlev3 in this case). In the example 12 cells are required in column 3. Given this information the size and vertical position of each cell in column 3 can be determined. The horizontal coordinates are determined from the macro variables w1-wn. This also allows us to determine the positions of the probability cells (prob 1). 3 Again, from the datastep created in step 1 we can see that each cell in column 1 joins to 2 cells in column 2, and each cell in column 2 joins to 3 cells in column 3. From this information we can work out the vertical positions of the cells in all but the far right classification column. As before the horizontal positions come from the w 1-wn macro variables. 4 Once all the coordinates have been calculated a datastep creates an annotate dataset that is later used to draw the cells in each of the columns, and the lines connecting the relevant cells. The annotate functions move and draw are used here. 5 Another annotate dataset is created to place all the text on the graph. 6 The annotate datascts are combined and the variables xsys, ysys, and hsys are set to '5' so that the coordinates are given as a percentage of the procedural output area. 7 The GSLIDE procedure is used to display the graph produced from the annotate dataset as any titles and footnotes will automatically be included. Figure III shows a graph of the above data produced using the boxdiag macro. 713 Conclusion Through the above three examples I have hoped to show how SAS/GRAPH software can be used in an innovative manner. Each example, has increased in complexity, with the corresponding source code increasing from approx 10, to 60, and 350 lines. SAS, and SAS/GRAPH are registered trademarks of SAS Institute Inc., Cary, NC, USA. Parexel International Limited Craven House 40 Uxbridge Road Ealing LONDON W5 2BS (0) 81 579-8292 714 -. -- -. - - " -- ... -- .---. .". . .-.. ~- ~- ~ " ---- ';~;~Ar.!~!'t~~~~;I.'l'lS~~·,:;;;~1-;','~-r.-~i'~+..,!I~~!b~I_"t~~:<·~'.t~~~:-7-h·~~~,~~'"t:f~~~:'":'-"'>~"Ft-";(~~n-,";t:t·~;:;~-:::-,·'~'.'::-~' "~".'1.""l~,!,;:~.;""'-n--~~;:-t~>,~.-~~~~,.:"!7"':O;-:,<"",:';-;;-'~'- ;'c" - :--- Figure I Plot of number of panic attacks per week showing medians, upper and lower quartiles 17 16 15 14 13 ~ 12 Q) Q) :: 11 Q) 10 a. 9 ~ '-;- - -.J U'I 0 ~ 0 "2 Cd a.. 8 7 6 5 4l ;:: /'- "" 1111\\ /1rn><J~~,~\ III 3l "" I I I I -I 2 I I'H- __ 1111 III~~ '- 1 '- '.::::.....,;; 0 ;:. '- 1 2 4 3 6 5 7 Visit ,~ \ \ Treatment Group A ---- B --- c --0 <.':< ;. ' PAREXEL International Limited ~.'·,c'V'.",(,... ~·;~~~"~.~.\'l:."":e;;;QtQkJL::O;;:;:.,G .• ~... '!"'!V"~~;;-;~'l':-~~~.N~-:-":""~.·"'1.7_~_~.' ...:'·:~.",.~::" ..... :.~ .. ~-:'''-='~~...~~~:K.'~..'\,:";,.~"'-'",,y_~'...~.f~~~,i,«".'" .. ,~ ...,;.,O, .. J.:,..""·"···' ... \;r~···~·:~··· ,'. , •.. ~~.;-,!~.,:'_',~~{.~~ ..\' ":";~C"" I,•• ~~-~ ',' ,&:.•.. _~ _> .•••: . , ; ........ ~_.,' '_:,' .,-,,_. . .... , .. J.;. .• \'!.':':'.~ Figure II Plot of Systolic blood pressure showing means + /- 1 standard deviation 180 ';'. Cl 170 J: E E ~ Q) ~ :::l ~ 160 ~ a. "C --..,J 0\ ~.e 15°1 ~ en r-------r -~ ----f------1--~ ---1-------~ ------ 1 140 130 i~------~--------~------~--------~------~--------~-------r------~ I 1 2 3 4 5 6 7 Visit Treatment Group --A ---- B PAREXEL International Limited , (~tt0·~,t:t~,),.~;, _7 "i,>_cxo: '~~i,\'jl."-;,~_!,;: ~~h;':t:ti~~~:;~(""(;:-~~--i');"--<1f~?~..;o,:",~,_~~",:::,-,,:,!,_:_ :,,~<.~. ':H"'!11t'1l~:,,' y"",,,,::-;~ < " ,~<;. "'"f'"'~':~ p",",,,' '''-'~ f : ~-:, ~:' ~ i'T' .: ·i;:·':~~tf.'~·; tr. '!::~: -~:'-" Figure III Probability of developing disease by demographic factors Age (yrs) Smoking status ----< 55 Alcohol consumption Probability None o . 170 < 10 units 0.220 + 0.280 Smoker i 10 units Non-Smoker None 0.050 < 10 units o . 130 -----{-10~~u--.;-J ~ ~ Smoker r----:-: IE ~ None 0.200 < 10 units 0.300 + 0.370 10 units 1-- ---N~~e I o . 160 -- I 0.070 I Non-Smoker < 10 units 0.200 10 units + 0.150 Note: Probabilities are adjusted PAREXEL International Limited