Download Producing SAS Files from Larger Master Files for a Clinical Research Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Data analysis wikipedia , lookup

Diff wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
PROWCING SAS FILES FROM LARGE MASTER FILES FOR A CLINICAL RESEARCH PROJECT
Frank E. Harrell, Jr.,Lipids Research Clinics Program,.University qf North
The
Prima~ P~evention Trial of the Lipids
Clinics Prog~am is & cl~nical intervention trial Q£ almost 4000 gubjeeta~ gathering
ReBe$r~h
data for thousands of variables On each.fiubject.
,
c
The aubjects each sttent 60 visits to their
clinic. Visits 1-5 are baseline vl~lte and data
forms for these visits are dissimilar. After
visit 5, clinic visits consist of only three
types: 2 month, 6 month, &nd ye~rly visits. So
for visits 6-60~ the vls1,t number and one of
three system item numbers completely define each
variable for purposes of retrieval of data for
statistical analyses. For visits 1-5, five· system
itern numbers describe the address of each variable~ Although data forms vary Be'ross clinic
visits, there is much overlapping of variables
between visits tn termS of definitions, units of
IDeasurement_ and input formats. Therefore, we can
reduce the problem of-data retrieval to its simp1ist terms by settin,g up a dictionary containing
item numbers and deacriptions _of variables.
I~ our Master ,data fil~s, every item has
stored with it a 1-byte status field deBcribing
the quality of the item. These are out'f~d in
the following table:
Figure 1 - Meanings
Status
of Item Status
Meaning
naffit'mative ll • Raeod.ing th~ blank with. E ailOWB
distinguishing hetween rlnegative" and 11fQrm not
arrived yet ll •
For character variables, of Which there &re
very few on our forms, value$ are merely changed
to blank if the corresponding statuses are less
than 5. This could be modified to recode the
values to special characters instead of blanks.
Spacial programs are neQded t~ retrieve
data for analysts from the m4ster file because
the data are eucTypted and because only a subset
of the thouSQndB of variables for each subject
are deSi~ed for analySiS. AnalyuiB file Bub settiog is initiated by listing the'mnemon1ce and
visit number a of desired-1t-ems~ These are linked
with an item dictiQnary~ There are two separate
item dictionaries - one: for visits ·1-5 and -One
for visits 6-60. The item dictionary defines the
system item numbers, decimal places, label, and
answer type· for categorical items, given the
mnemonic base. The final SAS va.riable name 16.
formed hy concAtenAting. the visit number (without
leading blanks or ze-r_os) to thf! mnemonic baae. If
the item number field on the dictiana~y is blank
for 4 particular visit, the variable is undefined
f~ that viSit (i.e. ia not on Bny fo1;"m fer that
visit) and hence is not retrieved. The dictionary
for visits 1-5 iB described in Figure 2:
BAS Special
Figure 2 - Layout of Item Dictionary
Missina: Value
Blank.
Form containing item not
Columns
1
4rrived to sys t-em
Fo~ containing ltem arrived
1-5
~A
but item not obtainable
2
Item has an edit error
.B
(e.g. out of valid range)
4
Item inconsistent with
.n
.!mother 'item.
6 or 9 Item has good status but is ~E
blank or not valid numeric
6
Valid numeric. field,no errorlil>.Z Le. a
valid value
9
Item verified by clinic 'after).2'_1.e. a
initially failing an edit
valid value
or consistency check and is
a valid numeric field
8ecause calculations are to be performed
only on non-missing. non-eTroneous data, we have
found that it is not wort~hile to carry statuses
to the file used in analyses, but only to ~ecode
variablea which have b~d status (status < 5). For
numeric variableB~ the special mdsBing valueB in
SAS allow for recoding of bad it~ $uch that the
reason an item was not clean will be known. The
~pecial missing values used for such recoding are
displayed in Figure 1. For example, a 3-byte
numeric value having a IiItatuB of 2 1a recoded
with 'B 'by the analysis file retrieval program
so that SAS will input the variable as .E if a
MISSINC statement i9 given.
The mi.saing value .E is used principally
f.or multiple-choice !,.tems in whioh ~ blank apswer
means "negat1vell and "a non-hiaok answer' means . .
98
Ca~olina
6-30
31
32-71
80
Contents
Mnemonic for item ~ to become SAS name
afte~ adding visit numbar
Five system item numbers for visits 1-5
which define the location of the item
on the master file
Number of places to right of decimal pt
Blank if integer or character
DescrIption of item - ~o become SAS
label
An~er type fo~ item~ if catego~tcal
Definea r'value labels"
Example of portion of dictionary:
CHOL i841823566 1418 1193 1803 CHOLESTEROL
LDL 1860623725 1606 1371 2077 ESTI!OCI:ED LDL •••
VLDL
23871
VLDL CHOLESTEROL
ill
~9
mI=,~
lIT
570214B525473258332627411iEIGill, KG
MBRTH 281
MONTH OF BIRTH
DBRTH 992
DAY OF BIRTH
YBRTH 998
YEAR OF BIRTH
EVMAR 320
EVER MARRIED1
MCE 21288
EDUC
21235
EDUCATION
OCCUP
21303
OCCUPATIO!>!
CORNA
22360
CORNEAL ARCUS1
M
R
E
0
A
The retrieval program outputs 8 file suitable -for reading in a SAS program.,. automatically
comput,ing: locations in the ClutPUt record for ~ach
. item.- Use"" is· made of a separate system t;able not
described here which contains the type (numeric
or ,character) and length of each item~ given the
item number.
The following statements give a simple example of creating a SAS file from one of our master file.s=
Figure 3
Example of Craating SAS File
II EXEC FETCH - subset master file. Read dictionary to get item numbers for
mnemonics and visit numbers
listed below.
IIINPUT DD
- mao tor file
I/OUTPUT DD DSN=SUBSET, •••
IIITEMS DD DSN=ITEMS,DISP=(NEW,PASS), ••. - will
have one record for every item
at each visit containing label,
decimal places, etc., and position of item in extracted file
AGE 1 3 wr 1 2 3 HT 1 RACE 1 SEX 1
SGOT * 1 4 (visits 1-4 inclusive)
(I EXEC SASDESF produces SAS macroa for reading
and describing file
IIITEMS DD DSN-ITEMS •..•
I/OUTPUT DP DSN=MACROS, •.••
II EXEC SAS
IlsAVE DD - disk dataset for saving SAS file
IIIN DD DSN=SUBSET, •..
IISYSIN DD DSN=MACROS, .••
II DD
DATA SAVE.ONE; INFILE IN; MISSING ABeDE;
LABEL IO=PALIENT ID LABELF; LENGTH LENGTHF;
FORMAT PORMATF; INPUT ID $ 1-9 INPUTF;
.•. statements to compute derived values ..•
••. LABEL,LENGTH,FORMAT for der1~~d values .••
*
,
i
~:
We also have occasion to run a program
before FETCH which splits lists of mnemonics by
the mnemonic base type so that multiple SAS datasets are c~eated. This is useful when there are
more than 1024 variables to analyze (the maximum
SAS allows) - the variables can be split u~ into
2 or mure logical groups and 2 or more SAS datasets can be created by executing SASDESF and
SAS more than once.
Answer types for categor~cal cariables are
defined by Ilnking the ITEMS file (figure 3) with
an external "ANSWERt' file. Since SAS does not yect
have value labels, program SASDESF merely prints
descriptions -of answers as c"on:anents in the user's
SAS program for documentation. The program builds
macros for specifying LABEL~ LENGTH, INPUT. and
FORMAT statements. These are exemplified in
Figure 4 below:;
Figure 4 - Comments and Macros Produeed by program
"SASDESF" in example. in Figure 3.
*-----------------------------------------------*
VARIABLE ANSWERS
RAGE
SEX
I=WHITE 2-BLACK 3=ORIENTAL 9=OTHER
l=MALE 2=FEMALE
*-~~--------------------------------------------;
PAGE;
MACRO INPUTF @10 AGEl 2. @12 AGE3 2. @14 WT1 4.1
@18 WT2 4.1 @22 WT3 4.1 @26 HT1 4.1 @30 RACEl 1.
@31 SEX! 1. @32 SGOT1 6.2 @38 SGOT2 6.2 ••.
SGOT4 6.2%
MACRO FORMArF AGEl 2. AGE3 2. WT1 5.1 WT2 5.1
WT3 5.1 HT1 5.1 RACE1 1. SEX! 1. SGOTl 7.2 ...
SGOT4 7.2%
MACRO LENGTHF AGEL 2 AGE3 2 wrl 3 WT2 3 WT3 3
HTI 3 RACEI 2 SEXl 2 SGOTl 4 ••• SGOT44%
MACRO LABBLF AGE1='AGE IN YEARS' AGE3='AGE IN YEA
RS t WTl=IWEIGHT~ KG I WT2~rWEIGHT, KG I WT3~JWEIGHT
, KG' Ktl~'HEIGHT, CM'% (Others had blank label
fields in dictionary since their mnemonics were
self-explanatory).
PAGE;
The labels were retrieved from the item dictionary aD~ put in macro LABELF. The lengths generated for the LENGTH statement are the minimum number of bytes allowing for n significant digits
where n is the number of columns for the numeric
field on the input reeord. Out of all the variables in our data. the only one requiring more
than" 4 bytes of" floating binary storage is social
security number. so considerable savings of disk
space can be -achieved by computing the minimum
length for each variable and not using the default
length of S bytes , which allows for l6 significant
digits.
The components of the INPUT statement are
gotten from the ITEMS file. These include CQl~
locations on the sub8etted analysis file, types
(characte~ or numeric) and number of places after
the decimal point for numeric it~. The output
formats for numeric variables for the FORMAT
statement are taken as the input specification
for integer data (e.g. FORMAT X 2.) or k.d when
the input specification was m.d~ where k=m+l. The
main reason for using output formats is that for
non-integer data such as 33.01, SAS may print the
value as 33".0099770, for example"~ when the length
of the stored variable is less than 8 bytes and
the implied format of the variable has been over~
ridden by placing the variable on the left hand
side on an equals sign. This problem is due to the
truncation problem described in the SAS manual.
Giving the variable an output format such as
FORMAT Y 5.2 causea the variable to always be
printed out by SAS programs or procedures in a
nice format.
After statements for reading and describing"
a SAS file. statem~nt8 computing derived values
ean appear. Due to the propagatio~ of missing
values in any algebraic stateme.nt, ne~ variables
thus created are automatically recoded if any of
the component variables have
bad sta~us (special
misaing value).
After this file is built, we have to take
care of some problems which are present with the
type of data we have. For instance, subjects can
transfer between clinics, so thst the master files
may have partial records for the subject from each
clinic. Each record will have the same subject ID
code. We wish to pool all available data for each
unique subject ID into a single T6cord. This is
e:asily handled uaing the UPDATE statement and a
small auxiliary file containing a list of all
unique IDls. After this problem 1s handled~ other
operations can take place, such ~s merging baseline data with da~a for later visits. The visit
number at the end of the variable n~es makes them
unique across visits.
We run a procedure written by UH, DAIACHK~
on the completed file. This p~ocedure p~ints, for
every numeric variable in the file~ the number of
a
99
special missing values broken down by .,.A-.B,
and the lo~eat and hlgheat 5 distinct v41u~s of
the variable. This is useful in checking the
qualiey of the dataset. If errors are found,
variables are corrected by running a small SAS
program. We use- tha value .C to denot~ bad data
found at analysis time-.
When all this is done, running PROC
gives the user complete documentation of
the file, a step which used to be very time consuming.
~en reading large SAS datasets, we have
found the following practices to save money in
Le~ Qf 110 time.
1. When subsetting files, using OPTIONS
GEN=Q can decrease the number of 1/0
operat~ons since the large amount of
source code and variable descriptions
and history information are not copied
CONTENTS
each titILe~
2, Using KEEP~ in the SET statement, e.g.
DATA NEW;SET SAVE.OLD(KEEP=X Y Z••• )~
can increase efficiency ~hen 8ubsett~ng
a small proportion of variables from a
file with many va~iablea. This is more
efficient than using a separate KEEP
statement after the DATA statement.
This work was supported by U.S. National Heart,
Lung, and Blood Institute contract number
NHLI-NIH-71-2243 from National Institute~ of
Health.
f
r
I:
I
I
100