Download Non standard Graphical Representation of Clinical Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Non-standard graphical representation of clinical data
David Granger - Parexel International Limited
Abstract.
In the course of this paper I propose to demonstrate how Parexel International Limited use
SAS/GRAPH® software to obtain some unusual graphical displays of clinical data.
Introduction.
Parexel is a worldwide leader in providing independent clinical research to the pharmaceutical
industry. In one of our many roles we provide Biostatistical analyses which often require
presentation quality graphics to accompany our statistical and clinical reports.
Parexel run SAS® software version 6.04 on PCIDOS and version 6.07 under VAXNMS. All
source code written for this paper has been written under V AXNMS.
Background information.
When analysing data from clinical trials, it is useful to calculate various summary statistics.
Commonly data possess the properties of the normal distribution (ie amongst other things,
that the distribution of the data is symmetrical, and the nature of the data is continuous). If
there is evidence to suggest that these properties hold, then a formal analysis, based on the
summary statistics mean and standard deviation (a standard measure of the variation in the
data), may be appropriate.
If it is suspected that the data are not from the normal distribution (ie the data are not
normally distributed), then an analysis based on the ranks of the values may be more
appropriate. The corresponding summary statistics for this type of non-parametric analysis
include the median (the 50th percentile point). and the upper and lower quartiles (75th and 25th
percentile points). These methods are robust in their operation as they are not affected by
extreme values (the upper and lower quartiles cover the middle 50% of the data). and they
can handle data that may be very skewed (asymmetrical).
I Graph displaying median, upper and lower guartiles.
If an analysis based on the ranks of the data is performed (ie an analysis using the percentile
points of the distribution) a graph showing the mean and standard deviation of the data is
inappropriate. A graph showing medians and quartiles is more suitable.
SAS does not provide an interpolation option for the GPLOT procedure that will join the
lower and upper quartiles of Y at each X value. and join the median value of Y between X
values. (I=STD I TJ joins the mean value of Y between X values. and joins the mean value
of Y ±1 standard deviation at each X value).
We can however achieve this form of interpolation by using the I=HILOCJT option. This
option is primarily designed to graphically display stock market data as it joins the maximum
and minimum values of Y at each X value and joins the closing value of Y (the last value)
101
between X values.
Example.
We have data available for approximately 200 patients recording the number of panic attacks
in the last 5 days available in the form
TRT
A
A
PATIENT
I
I
VISIT
I
2
PANIC
8
6
where trt takes the values A, B, C, D, and visit ranges from I through to 7.
Summarizing the data using the univariate procedure, and transposing we can obtain data
TRT
A
A
A
A
A
A
VISIT
I
I
I
2
2
2
- NAME_
D
7
LQ
COLI
5
10
4
5
MED
UQ
LQ
MED
UQ
LQ
9
3
0
In many situations the summary statistics, and hence graphics, for each treatment group
(TRT) will overlap. To clarify the graph the values of visit can be offset using a simple
datastep.
data a;
set summary;
iftrt='A' then visit=visit-0.I5;
else if trt='B' then visit=visit-0.5;
run;
The I=HILOCJT interpolation option can now be used as for each trt, visit combination we
have a suitable HIGH value CNAME_= UQ), LOW value CNAME_ = LQ), and CLOSE
value CNAME_ = MED).
A GPLOT procedure of the form
proc gplot;
symboll v=none i=hilocjt 1= I ;symbol2 v=none i=hilocjt 1=2;
symbo13 v=none i=hilocjt 1=3;symboI4 v=none i=hilocjt 1=4;
plot coIl *visit=trt;
run;
will produce the required output. An example graph is shown in Figure I.
708
II Graph displaying mean ± 1 standard deviation(s).
If an analysis based on mean values has been performed, the results of the analysis can be
complemented by a graph showing means and standard deviations. SAS 'provides us with the
STD ... interpolation methods to produce plots of mean values of Y ± standard deviations at
each X value (see above).
Unlike the graphs described previously that display medians and quartiles, graphs showing
mean values and standard deviations are symmetrical about the mean by their very nature.
It is possible to exploit this fact to produce some unusual, yet informative graphs.
Example.
Systolic blood pressure data is available for the 200 patients previously analysed, and for the
purposes of this example we will only consider treatment groups A and B.
As blood pressure data is normally distributed an analysis based on this fact is appropriate,
and hence graphs showing means and standard deviations are required.
Since the summary statistics for the two treatment groups are likely to overlap, the data at
each visit will usually have to be offset by a small amount as described in section I. We can
use the symmetry of the situation to remove some redundant information from the graph, and
hence remove the need for the 'offsetting procedure'.
By using a summarising procedure, MEANS or SUMMARY perhaps, we can obtain the mean
and standard deviation for each treatment group (TRT), visit (VISIT) c-ombination producing
a dataset of the form
TRT
A
A
VISIT
1
2
B
7
MEAN
163
166
STD
149
10
5
5
We shall use the notation meanA for the mean systolic blood pressure in group A, and meanB
for the corresponding group B value. Similarly we define stdA and stdB.
By comparing the range [mean-std,mean+std] for treatment groups A and B at each visit we
can decide which one of the following situations apply.
i
ii
iii
IV
v
meanA > mean B and
meanA < mean B and
meanA > mean B and
meanA < mean B and
meanA=meanB
meanA-stdA > meanB+stdB
meanA+stdA < meanB-stdB
meanA-stdA ~ meanB+stdB
meanA+stdA ~ meanB-std B
If situations i or ii hold then we can draw the lines mean±std as we know that the lines for
each treatment group will not overlap.
709
If, however, situations iii, iv, or v hold then we know that the lines covering the range
mean±std will overlap. Due to the symmetry of the situation we will not lose any information
by drawing a single standard deviation away from each mean value.
ego if iii holds we draw lines from meanA to meanA+stdA, and mean B to meanB-stdB.
We can use datastep(s) to perform the above operation and output a dataset of the form
VALUE
TRT
A
A
A
VISIT
1
1
1
163
158
168
B
B
B
7
7
7
149
139
149
where if i or ii hold (VISIT=I, for example) we output 3 observations for each treatment
group ie for TRT=A we output observations with VALUE = meanA, meanA-stdA, meanA+stdA.
and for TRT=B we output observations with VALUE = mean B, meanB-std B, and meanB+stdB.
If iii holds (VISIT=7, for example) we again output 3 observations for each treatment group,
but now for TRT=A we output observations with VALUE = meanA(twice), meanA+stdA> and
for TRT=B we output observations with VALUE = mean B (twice), meanB-std B.
This data can then be displayed using the GPLOT procedure and the I=HILOCJT interpolation
method.
Figure IT shows the completed example.
In practice this method is often used to display confidence limits, where we replace mean±std
in the comparisons with the appropriate confidence limits.
ITI Decision tree representation of estimated probabilities
When fitting a statistical model to our data it is often easier to interpret the results if they are
displayed graphically.
Example.
The data listed overleaf are the result of a logistic regression with outcome 'disease
developed' and the effect of several demographic covariates investigated.
710
Age (yrs)
Smoking
status
Alcohol
consumption
<55
<55
<55
<55
<55
<55
55 +
55 +
55 +
55 +
55 +
55 +
Smoker
Smoker
Smoker
Non-Smoker
Non-Smoker
Non-Smoker
Smoker
Smoker
Smoker
Non-Smoker
Non-Smoker
Non-Smoker
None
< 10 units
10 units +
None
< 10 units
10 units +
None
< 10 units
10 units +
None
< 10 units
10 units +
Probability
0.170
0.220
0.280
0.050
0.130
0.160
0.200
0.300
0.370
0.070
0.200
0.150
The outline below demonstrates how SAS/GRAPH software, and in particular the
ANNOTA TE dataset can be used to produce some powerful graphical tools for displaying this
data.
At the time of writing several analyses of the above type were perfonned which needed to
be displayed in a neat and concise graphical manner. This led to the development of a macro
which, with a minimum number of parameters passed to the macro, would take a dataset of
the above form and produce a graph similar to the one presented below.
Age
Smoking status
Alcohol consumption
Probability
Smoker
0.170
0.220
Non-Smoker
< lOunits
0.280
0.050
0.130
Smoker
lO units +
None
< 10 units
0.160
0.200
0.300
Non-Smoker
lO units +
None
< lO units
0.370
0.070
0.200
lO units +
0.150
~'
711
The BOXDIAG macro.
This macro has been developed to accept, as input, a dataset of the above form where the
classification variables (age, smoking status, and alcohol consumption in this case), are named
newlevl-newlevn, where we have n classification variables (in this example we have the
variables newlevl-newlev3). In this example we have one probability column, and this is
stored in the variable prob 1. In general there may be more than one probability column and
so the macro has been designed to handle the variables probl-probn, where we have n
probability variables.
The labels of the variables newlevl-newlevn and probl-probn will be used to create the
column headings, and the values of these variables will be the text that is to be displayed on
the graph «55, Non-Smoker etc.) Note that these are character strings and not formatted
values.
Calls to the boxdiag macro are of the form
%boxdiag(dset,n,p,maxh,maxw, wi, w2, w3, w4, w5, w6);
At present the macro is designed to cater for a maximum of 6 classification columns, but it
is a simple matter to extend this if required.
The parameters are:
The name of the input dataset
The number of classification variables newlev I-newlevn
The number of probability variables prob I-probn
The maximum height of the procedural output area to be used by the
display (%)
maxw The width of the procedural output area to be used by the classification
variables (%).
wIThe width of the newlev 1 column (% of procedural output area)
dset
n
p
maxh
w6
The width of the newlev6 column (% of procedural output area)
With the following parameters n=3, maxw=80, wl=30, w2=20, w3=15, the classification
columns occupy 30+20+ 15=65% of the procedural output area. As we have specified to use
80% of the width of this area, the remaining 15% will be divided evenly between the columns
ie the space between columns 1 and 2, and columns 2 and 3, will be 15/2=7.5%.
A check is built into the macro so that wl+ ... +wn :5: maxw.
712
The main steps performed in the macro achieve the following objectives.
1
Create a dataset containing all the changes in the values ,of the classification
variables, storing them in an (n,tOO) array. Here the changes for column n are
stored in the elements (n, 1) to (n, tOO).
Given the above data this may look something like this
<55
Smoker
None
55 +
Non-Smoker Smoker
< to units
to units +
Non-Smoker
None
< to units
tOunits + ...
The probabilities and column headings are stored in a similar manner. The
variable labels are obtained with the LABEL call routine.
2
From the dataset created in step 1 it is possible to determine the maximum
number of cells (boxes) needed in a column by looking at the data for the
furthest right classification column (newlev3 in this case). In the example 12
cells are required in column 3. Given this information the size and vertical
position of each cell in column 3 can be determined. The horizontal
coordinates are determined from the macro variables w1-wn. This also allows
us to determine the positions of the probability cells (prob 1).
3
Again, from the datastep created in step 1 we can see that each cell in column
1 joins to 2 cells in column 2, and each cell in column 2 joins to 3 cells in
column 3. From this information we can work out the vertical positions of the
cells in all but the far right classification column. As before the horizontal
positions come from the w 1-wn macro variables.
4
Once all the coordinates have been calculated a datastep creates an annotate
dataset that is later used to draw the cells in each of the columns, and the lines
connecting the relevant cells. The annotate functions move and draw are used
here.
5
Another annotate dataset is created to place all the text on the graph.
6
The annotate datascts are combined and the variables xsys, ysys, and hsys are
set to '5' so that the coordinates are given as a percentage of the procedural
output area.
7
The GSLIDE procedure is used to display the graph produced from the
annotate dataset as any titles and footnotes will automatically be included.
Figure III shows a graph of the above data produced using the boxdiag macro.
713
Conclusion
Through the above three examples I have hoped to show how SAS/GRAPH software can be
used in an innovative manner.
Each example, has increased in complexity, with the corresponding source code increasing
from approx 10, to 60, and 350 lines.
SAS, and SAS/GRAPH are registered trademarks of SAS Institute Inc., Cary, NC, USA.
Parexel International Limited
Craven House
40 Uxbridge Road
Ealing
LONDON W5 2BS
(0) 81 579-8292
714
-.
-- -. - -
"
--
...
--
.---. .". . .-..
~- ~- ~
"
----
';~;~Ar.!~!'t~~~~;I.'l'lS~~·,:;;;~1-;','~-r.-~i'~+..,!I~~!b~I_"t~~:<·~'.t~~~:-7-h·~~~,~~'"t:f~~~:'":'-"'>~"Ft-";(~~n-,";t:t·~;:;~-:::-,·'~'.'::-~' "~".'1.""l~,!,;:~.;""'-n--~~;:-t~>,~.-~~~~,.:"!7"':O;-:,<"",:';-;;-'~'- ;'c" - :---
Figure I
Plot of number of panic attacks per week showing medians, upper and lower quartiles
17
16
15
14
13
~
12
Q)
Q)
:: 11
Q) 10
a.
9
~
'-;-
-
-.J
U'I
0
~
0
"2
Cd
a..
8
7
6
5
4l
;::
/'-
"" 1111\\
/1rn><J~~,~\
III
3l
"" I I I
I -I
2
I I'H- __
1111
III~~
'-
1
'-
'.::::.....,;;
0
;:.
'-
1
2
4
3
6
5
7
Visit
,~
\
\
Treatment Group
A
---- B
--- c
--0
<.':<
;.
'
PAREXEL
International Limited
~.'·,c'V'.",(,... ~·;~~~"~.~.\'l:."":e;;;QtQkJL::O;;:;:.,G
.• ~...
'!"'!V"~~;;-;~'l':-~~~.N~-:-":""~.·"'1.7_~_~.' ...:'·:~.",.~::" ..... :.~
.. ~-:'''-='~~...~~~:K.'~..'\,:";,.~"'-'",,y_~'...~.f~~~,i,«".'" .. ,~ ...,;.,O, .. J.:,..""·"···' ... \;r~···~·:~···
,'.
, •.. ~~.;-,!~.,:'_',~~{.~~ ..\'
":";~C""
I,•• ~~-~
','
,&:.•.. _~
_> .•••: . , ; ........
~_.,'
'_:,' .,-,,_.
. .... , .. J.;. .• \'!.':':'.~
Figure II
Plot of Systolic blood pressure showing means
+ /-
1 standard deviation
180
';'.
Cl
170
J:
E
E
~
Q)
~
:::l
~
160
~
a.
"C
--..,J
0\
~.e 15°1
~
en
r-------r -~ ----f------1--~ ---1-------~ ------
1
140
130 i~------~--------~------~--------~------~--------~-------r------~
I
1
2
3
4
5
6
7
Visit
Treatment Group
--A
---- B
PAREXEL
International Limited
, (~tt0·~,t:t~,),.~;, _7 "i,>_cxo: '~~i,\'jl."-;,~_!,;: ~~h;':t:ti~~~:;~(""(;:-~~--i');"--<1f~?~..;o,:",~,_~~",:::,-,,:,!,_:_ :,,~<.~. ':H"'!11t'1l~:,,' y"",,,,::-;~
< "
,~<;. "'"f'"'~':~ p",",,,' '''-'~ f : ~-:, ~:' ~ i'T' .: ·i;:·':~~tf.'~·; tr. '!::~: -~:'-"
Figure III
Probability of developing disease by demographic factors
Age (yrs)
Smoking status
----< 55
Alcohol consumption
Probability
None
o . 170
< 10 units
0.220
+
0.280
Smoker
i
10 units
Non-Smoker
None
0.050
< 10 units
o . 130
-----{-10~~u--.;-J
~
~
Smoker
r----:-:
IE
~
None
0.200
< 10 units
0.300
+
0.370
10 units
1-- ---N~~e
I
o . 160
-- I
0.070
I
Non-Smoker
< 10 units
0.200
10 units +
0.150
Note: Probabilities are adjusted
PAREXEL
International Limited