Download Getting from Complex Databases to a SAS Analytic File: Tales from the Trenches

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Getting from Complex Databases to a SAS® Analytic File: Tales from the
Trenches
MARCIA STODDARD, LEWIN-TAG, INC.
ABSTRACT
Increasingly, data destined for statistical analysis
in SAS come from one or more complex
databases. In this paper the author demonstrates
various methods of gathering information from
multiple, complex data streams and knitting them
together into one or more analytic data sets. The
types of data considered include date-specific
event histories, episodic clinical data, and patientprovided panel data. The focus is on how to use
the linkages among these various types of
information to create analytic files in SAS. This
paper also explores how to apply summarization
techniques to both prospective and retrospective
data sources.
INTRODUCTION
Health care research often depends on data
collection in the absence of fully formed research
questions. Often detailed patient and physician
reported data about a specific condition are
collected over a long period of time (e.g., registry
data, observational data). Sometimes these data
may be augmented with chart abstraction and
survey questionnaires. These rich sources of
data can be used in more than one analytic study
and can provide a level of detail ihat will yield
results in different disciplines (e.g., Quality of Life
(Qol) research, assessment of practice patterns,
economic outcomes). However, the very complex
and detailed nature of these data presents
challenges in building and using analytic files.
Data collection is sometimes initiated before
exploring every avenue of interesti[lg and useful
research and may be driven by the multiple
interests represented by a team of specialists.
Clinicians select which portions of their patients'
medical information - past, present, and future - to
gather and put into an observational database.
Meanwhile, academic researchers develop
questionnaires to administer to the specialists'
patients
on
a
regular
basis
covering
demographics, QoL, satisfaction with care, and a
variety of health care resource use. Data analysts
weigh in with database design issues. The basic
280
notion is to have ongoing data collection for a
select group of people. This kind of longitudinal
data collection is common in health care research
and is often continuously 'mined' for information
as it continues to grow and change.
In long-term studies, data elements may be
added or changed. Patients are added or lost to
follow-up. As data are gathered and entered they
also evolve over time. Thus, as the database
matures, requests for apparently 'simple'
analyses may present complexities for the analyst
that are primarily linked to the structure of the
data and their internal relationships. The analyst
will be asked to do what may sound like a 'quick
and dirty' task that turns out to be a major ·
undertaking - even when the research question is
deceptively simpte: 'How many patients had a
surgical treatment?'. There are factors to consider
such as time, definition, patient sample. In no
time, the analyst will be rattling off lists of
questions hoping to accurately select the
information required to create useful analytic files
from large complex databases.
The following definitions outline types of data that
will be discussed:
DEFINITIONS
Panel data: Data that are collected in waves,
usually in predefined intervals but not necessarily
equally spaced.
Event history data: Data for which you know the
exact date of occurrence (e.g., birth, death,
marriage, divorce, job status). This type of data
may also be reconstructed based on recall.
Episodic clinical data: Data that are gathered
every time there is a clinical encounter (e.g.,
doctor visit, lab values, hospitalization).
Analytic file: Data that are used as input to SAS
procedures.
A VIEW OF COMPLEX DATA
Prior to the creation of a complex database, there
are numerous decisions made that involve
different aspects of the study: defining the patient
population (e.g., sampling, inclusion and
exclusion criteria), the timing of collection, the
data elements of interest, and the forms to be
used. These decisions can incorporate the input
· researchers,
sponsors,
clinicians,
of
academicians, and data analysts. The result of
this collaboration is the study protocol. The
information contained within this protocol might
include a copy of the data collection forms; an
outline detailing what forms are to be distributed,
when, and to whom; rules on how and what
portions of the data are to be coded; and criteria
for selecting patients.
Data collection may follow a relatively typical
scenario. The clinicians that agree to take part in
the study screen and enroll patients, and obtain
informed consent. During a patient's visit, they
explain the study and ask if the patient would be
interested in getting involved. The patients that
agree and that meet study criteria_ are generally
handed a baseline questionnaire that contains
questions about demographics, comorbidities,
and Qol. The clinician fills out a baseline clinical
form detailing diagnostic and past treatment
information. Additionally, any current information
based on the visit that day (e.g., medications
prescribed, labs ordered) is recorded on a
separate form.
Figure 1 shows how different types of data enter
into the picture. The first line indicates
questionnaires that are sent out on a regular
basis, quarterly for example. These represent
patient-reported panel data where current health
status is recorded. A patient might not return a
questionnaire at any given timepoint and
therefore there is a hole there representing the
possibility of missing data.
The second line shows when a patient reports a
hospitalization (or any other resource use) in the
past quarter. Based on recall, the patient is
reconstructing event history data.
When a clinician records a visit, medication or
other medical procedure (surgery, lab test) these
data are included as well. The third line
represents this frequency and variety of episodic
clinical data.
FIG. 1. STREAMS OF DATA
0>--+-+--+-1~~uestionnaires
;
ospitalizations
1--HHIHIIHHf---IHIIIII--tl--1-1---HHI-1---, •meal
WHAT MAKES THESE DATA COMPLEX?
These data are complex for a variety of reasons.
Inherent in the data gathering process for
observational studies is ongoing change. There is
a need to allow for flexibility in the data collection
forms so that the latest treatments and practice
patterns can be captured in the data collection
forms. New therapies might be introduced and
variables could be added to the forms and thus,
the database. In order to understand occurrences
that have not been pre-defined, open-ended text
fields are collected, periodically analyzed, and
sometimes recast as coded fields. As a result,
whole new sections are added to the data
collection forms so that frequent responses can
be collected in coded responses rather than
open-ended text fields. The prospect of openended text fields is one that can cause a fair
amount of apprehension for the analysts involved
in a project, whereas a researcher may view them
as an opportunity for exploration. Change must
be possible, and expected, but must also be
systematic and controlled. It is often the analyst
who is the overseer of this process.
DOCUMENTATION
Before running so much as a line of SAS code or
extracting a single morsel of information from the
database, it is important to get all parties involved
to agree upon a variety of issues related to the
data. The minimal facts that need to be
established are which patients will be included
and what types of information will be explored
during data analysis. Before data collection even
begins, there will be a string of memoranda,
meetings, and emails where questions are asked
and answered. If you document all decision made
along the way, the methodology section of papers
will be much easier to write.
281
In addition, when the piles of output start to
accumulate on your desk, it helps to keep things
organized. For example, copies of data collection
forms should be indexed as they wiil change over
time (e.g., form version A, with 8, C, etc. to
follow). Guides are available which give
suggestions on organizing project information and
program logs and output. (Cistemas and
Stoddard 1996).
DETERMINE THE VARIABLES
The analyst and researchers need to identify what
portions of the data collection forms will be
analyzed, i.e., which data elements should be
extracted and consolidated into an analytic file.
Some information may be collected on multiple
forms. For example, patients report quarterly on
their recent treatment history. While at the same
time, ·the physicians are reporting pn the clinical
information, treatment and otherwise, as it occurs.
If the analysis of choice involves treatment, the
researchers need to decide if patient self-reported
or clinical information will be used. Alternatively,
one could be used to validate the other by
comparing the information from both sources.
It is highly recommended to have copies of
annotated forms on hand durin~ discussions
about what information is to be extracted and
analyzed. As the forms change over time, a copy
of each iteration should be kept on hand in order
to be able to trace the history of what variables
contain the data of interest. For example, it is
important to be aware of data that was once
captured in open-ended fields and has since been
moved to its own separate variable. The
implication is that it may be necessary to search
both places for the data.
FREEZE THE DATA
Observational databases by nature are in a
constant state of change. However, most of them
have imposed (somewhat arbitrary in many
cases) 'phases' or stages- points in time that the
data are considered complete. This is often
referred to as a closed or frozen database.
In order to be able to reconstruct any files or
replicate analytic results during or after analysis, it
is a good idea to extract files for analysis only
from a frozen copy of the database, and to
282
document which 'phase' of the data was used.
This is particularly important if you're using a
database without some built in rollback feature. It
is also advisable to create permanent files for
every extraction or 'phase' of the database. If
done on a regular basis, say quarterly, then later
studies can use the most recent copy of the data,
or easily re-analyze historical data from a new
perspective. Quality control can be done on any
new or updated data since the most recent freeze
of the database.
EXTRACT THE DESIRED DATA
Once criteria for a patient population are
established, the data extraction begins. It is useful
to show the researchers involved exactly which
portions of the form will be used in the data
extraction to make sure everyone is literally on
the same page. It is important to agree which
variables will be extracted for purposes of defining
which patients are included.
In addition, it is crucial to agree on data
transformations and definitions of any new data
elements created for the analysis. Qol scores are
a good example - they are typically constructed
from single items of varying scales (values of 1-4
or 1-5) and converted to 100-point scales. In
some cases they are also weighted.
Another example of a common data manipulation
is the definition of treatment duration, which
depends on strict identification of start and stop
points, but may also depend on combinations of
clinical variables. Initial treatment generally is
defined as treatment that happens in some
window around diagnosis of a disease. Any
number of treatments might occur in the specified
window. Subsequent treatment may also be of
interest. Refining the definitions of treatment
require taking a closer look at the data.
CATEGORIZE TYPE OF TREATMENT
Depending on what type of disease is being
studied, evaluating a course of treatment can be
quite challenging from a data analyst's
perspective. There may be some types of medical
or surgical procedures that occur over several
days. There may also be data on medications for
which there are varying types, each with start and
stop dates which last for varying periods of time.
In order to operationalize a definition of treatment,
the analyst develops a categorization process
that typically involves timing, context and content.
A suggested approach on how to deal with
treatment information is as follows. First, isolate
the variables of interest which contain the dates
of the treatment. Create a simple data set which
includes the patient ID, a date variable, and a
character variable whose value is the name of the
variable whenever there is a date for a treatment
recorded. For example, treatment X recorded in
TX1VAR with a value of 01/01/99 would translate
to PTID='Pt Bob', TXDATE=OH01/99, and
TXTEXT='TX1VAR'. Then, create a descriptive
variable (e.g., TXDESC) which groups certain
values of TXTEXT into a descriptive and easier to
understand set of values. If TX1VAR, TX2VAR,
and TX3VAR all record similar treatments, then
TXDESC is set to the same value when TXTEXT
takes on those values.
These data can then be arrayed· by date and
reviewed if they occur within a certain window of
time. After looking over the output, mutually
exclusive and exhaustive categories can be set
up and SAS code written to try and place each
patient in one of the categories based on their
arrayed treatment information. The first line of
defense to ensure the categories are set up
correctly is to use a series of IF statements,
rather than ELSE IF's, and set flags for each
category. Then, see if the sum of all of the flags is
ever other than one.
THINGS TO THINK ABOUT
Once the population is established, it is time for
the next step in data extraction. What is the main
research question? This is the cenlral issue that
theoretically should inform every part of the data
analyst's decision making. What is the next level
of information that needs to be extracted and in
what timeframe? These questions are particularly
important when dealing with observational data
that can stretch endlessly in time in either
direction. What are the types of things that will be
controlled for in the analysis? Becallse these data
are not collected in a randomized fashion it is
critical to consider the characteristics that make
patients different from one another. Is the order of
occurrence important for certain variables? This
question is important to any analysis where
changes in patient status are linked to treatment
effects. Is it enough to know whether something
'ever' occurred? This devious question is one of
the hardest to answer, because data collected
even over long periods of time are never
complete.
ESTABLISH DATE WINDOWS
The study population can be defined in a variety
of ways, depending on the information available
and the research question. Patients may be
included based, for example, on their course of
treatment, or on their disease status at the time of
study entry or diagnosis, or perhaps on some of
their demographic characteristics. This is the
point at which the concept of date windows is
introduced. If the patient population is defined
based on an occurrence of something, treatment
for example, then a date window during which
treatments of interest are flagged needs to be
established. Initial treatment might include any
therapy noted in the six months after diagnosis of
a disease. Subsequent therapies might be
flagged as those that occur an additional six
months beyond initial treatment. Essentially these
windows depend very heavily on definitions of
'treatment', 'diagnosis', 'first treatment', and so
forth.
If information on disease status (a clinical score,
for instance) is recorded in the database on
exactly the same date as diagnosis, then it will be
easy to associate that score with the diagnosis of
the disease. However, if such information is
recorded around the time of diagnosis, date
windows (expansion of time beyond the single
point) need to be defined around the diagnosis in
which the variables containing disease status
information should be extracted. If disease status
variables have multiple values recorded in the
specified windows, then a decision will need to be
made as to which value should be counted (e.g.,
earliest, latest, highest value). For example,
cancer scores are taken at the time of diagnosis,
but different diagnostic tests may yield different
results. The diagnostic tests may or may not
occur all on the same day. The clinician ideally
decided upon a definitive date of diagnosis, which
is available in the database. But the analyst
283
should anticipate the possibility of multiple tests
occurring. There will need to be a discussion as to
how to deal with this issue programmatically.
SUMMARIZATION TECHNIQUES
variable. A frequency of this information is
created and reviewed. Using this frequency, an
algorithm can be developed which places patients
into categories based on their treatment
information.
Figure 2 includes SAS code used to gather
treatment information that occurs up to and
including the date of a questionnaire. A treatmentlevel data set is merged with a questionnaire-level
data set. When the treatments 'tx1' and 'tx2'
occur, a flag is set and only questionnaire
observations are output. The flag indicates that a
patient underwent treatments 'tx1' or 'tx2' on or
before that particular questionnaire. The earliest
date of 'tx1 ' and 'tx2' up to the time of the
questionnaire is retained for later use.
In this example, a six-month (180 day) window
was established during which any and all
treatment information was extracted. It was sorted
by date of treatment within each patient so that
once the type of treatment was arrayed out at the
patient-level, the order of occurrence could be
assessed as part of the categorization process.
For example, it might be of interest that
medication was given pre-surg~ry to improve the
outcome and these patients would be categorized
differently than those without the medication.
FIG. 2. PRIOR TREATMENT DATE & FLAG
*----------------------------------*;
* Flag any tx up to QoL instrument. *;
*-----------------------------------~;
proc sort data•qol;
by ptid qoldate;
run;
proc sort
data~tx;
by ptid txdate:
run;
data flagit;
merge qol(in=qol rename=(qolda te=txdatell
tx ( in=txs l ;
by ptid txdate;
retain pre_tx1 pre_tx2;
array pretxf{*) pre_tx1f pre_tx2f;
array pretxd{*) pre_tx1d pre_tx2d;
*----------------------------------*•
• Initialize the arrays.
*----------------------------------*
if first.ptid then do;
do i = 1 to dim(pretxfl;
pretxf{i) = 0;
pretxd{i) = . ;
end;
*-----------------------------------*
*----------------------------------*
if txdesc = 'tx1' then do;
• Set flag and keep earliest txdate.*
pre txlf = 1;
pre-txld = min(pre tx1d,txdate);
-
else if txdesc = 'tx2' then do;
pre tx2f = 1;
pre-tx2d = min(pre tx2d,txdate);
end; -
-
if (qol) then output;
run;
Figure 3 includes SAS code that arrays out
treatment information for each patient and
concatenates the types of treatment into a single
284
Finally, all of the treatment descriptive information
is concatenated into one variable and a frequency
is printed. From this information, patients are
grouped into meaningful categories.
FIG. 3. CREATE INITIAL TREATMENT
*-----------------------------------*;
*;
*---------------------------------*;
%macro doit;
*Macro to concatenate tx info.
end;
end; -
A count variable is created for two reasons. Its
value is one for each patient's first treatment
observation, two for the second, and so on. It
helps in the creation of the array. Additionally, the
maximum value of COUNT across all patients is
kept in a macro variable for use in the array
processing. As inclusion criteria change (e.g., the
treatment window is expanded) the macro
variable keeps the analyst from having to search
through code to change the values of the length
of the array.
%do i
1 %to &numtx;
init tx = trim(init tx) I I'-' I ltxdesc&i;
%end; %mend doit;
2
*-----------------------------------*
• sort and keep only txs whose date *
* falls in the initial tx window.
*
*-----------------------------------*
proc sort data•tx(where= {txdate le
diaqdate+180)) ;
by ptid txdate;
run;
*----------------------------------**
* count J of txs per pt.
*-----------------------------------*
data count;
set. tx;
by ptid;
retain count;
if first.ptid then count = 0;
count = count + 1;
•
The nature of analyzing observational data
where data collection is ongoing.
run;
proc summary data=count nway;
output out=counts max(count)=;
run;
Some suggestions to consider include:
•
Keeping good documentation.
*-----------------------------------*
* Macro-ize the # for use in array. *
•
data null ;
set-counts;
call symput('numtx',left(put(cou nt,3.)));
run;
Understanding and discussing the level of
analysis and which source of data to use.
•
•
Freezing ongoing databases.
Using only frozen copies for analysis.
data null ;
put-"The-max # of txs per pt is &numtx";
•
Remembering to ask lots of questions.
run;
•
Taking the time to establish some useful
summarization techniques.
*-----------------------------------*
*--------------------------*;
* Array out tx info per pt. --------*;
--------*;
*--------------------------data pt txl(drop=txdate txdesc);
set cOunt;
by ptid count;
array txtxt (*l $ txdescl-txdesc&numtx;
txdatel-txdate&numtx;
array txdte (*}
retain txdescl-txdesc&numtx
txdatel-txdate&numtx;
--------*;
*--------------------------*;
• Initialize the arrays.
--------*;
*--------------------------if first.ptid then do;
do i
=
l to &numtx;
txtxt {i}
' ';
Keep in mind that the data are bound to change
over time (and even the analyst). This paper is a
product of several years of trying things one way,
finding an even better way, soliciting guidance
from other analysts experienced in ongoing data
collection and dissemination, and continually
learning along the way.
REFERENCES
txdte(i} = .;
end;
end;
--------*;*;
*--------------------------• Upd·ate the array with tx info.
--------*;
*--------------------------txtxt(count} = txdesc;
txdte{countl = txdate;
if last.ptid then do;
%doit;
output;
end;
run;
proc freq data=pt tx2;
tables init tx;title 'Pt-level initial tx information';
run;
DISCUSSION
Dealing with multiple data sources and combining
data into an analytic file (or series of analytic files)
is complicated. In part, the complexity is due to:
•
Recording of duplicate information across
multiple data collection forms.
•
Issues of timing when defining the patient
deciding what
when
and
population
information to extract and how.
Cistemas, M. and Stoddard, M. 1996 "Designing
and Implementing SAS Programming Standards:
Examples and Suggestions", Proceedings of the
Fourth Annual Western Users of SAS Software
Regional Users Group Conference, 348-352.
ACKNOWLEDGMENTS
SAS software are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ®
indicates USA registration.
AUTHOR CONTACT
Marcia Stoddard, Lead Statistical Analyst
[email protected]
Lewin-TAG, Inc.
490 2"d St., Suite 201
San Francisco, CA 94107
(415) 495-8966 (phone)
(415) 495-8669 (fax)
285