Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Getting from Complex Databases to a SAS® Analytic File: Tales from the Trenches MARCIA STODDARD, LEWIN-TAG, INC. ABSTRACT Increasingly, data destined for statistical analysis in SAS come from one or more complex databases. In this paper the author demonstrates various methods of gathering information from multiple, complex data streams and knitting them together into one or more analytic data sets. The types of data considered include date-specific event histories, episodic clinical data, and patientprovided panel data. The focus is on how to use the linkages among these various types of information to create analytic files in SAS. This paper also explores how to apply summarization techniques to both prospective and retrospective data sources. INTRODUCTION Health care research often depends on data collection in the absence of fully formed research questions. Often detailed patient and physician reported data about a specific condition are collected over a long period of time (e.g., registry data, observational data). Sometimes these data may be augmented with chart abstraction and survey questionnaires. These rich sources of data can be used in more than one analytic study and can provide a level of detail ihat will yield results in different disciplines (e.g., Quality of Life (Qol) research, assessment of practice patterns, economic outcomes). However, the very complex and detailed nature of these data presents challenges in building and using analytic files. Data collection is sometimes initiated before exploring every avenue of interesti[lg and useful research and may be driven by the multiple interests represented by a team of specialists. Clinicians select which portions of their patients' medical information - past, present, and future - to gather and put into an observational database. Meanwhile, academic researchers develop questionnaires to administer to the specialists' patients on a regular basis covering demographics, QoL, satisfaction with care, and a variety of health care resource use. Data analysts weigh in with database design issues. The basic 280 notion is to have ongoing data collection for a select group of people. This kind of longitudinal data collection is common in health care research and is often continuously 'mined' for information as it continues to grow and change. In long-term studies, data elements may be added or changed. Patients are added or lost to follow-up. As data are gathered and entered they also evolve over time. Thus, as the database matures, requests for apparently 'simple' analyses may present complexities for the analyst that are primarily linked to the structure of the data and their internal relationships. The analyst will be asked to do what may sound like a 'quick and dirty' task that turns out to be a major · undertaking - even when the research question is deceptively simpte: 'How many patients had a surgical treatment?'. There are factors to consider such as time, definition, patient sample. In no time, the analyst will be rattling off lists of questions hoping to accurately select the information required to create useful analytic files from large complex databases. The following definitions outline types of data that will be discussed: DEFINITIONS Panel data: Data that are collected in waves, usually in predefined intervals but not necessarily equally spaced. Event history data: Data for which you know the exact date of occurrence (e.g., birth, death, marriage, divorce, job status). This type of data may also be reconstructed based on recall. Episodic clinical data: Data that are gathered every time there is a clinical encounter (e.g., doctor visit, lab values, hospitalization). Analytic file: Data that are used as input to SAS procedures. A VIEW OF COMPLEX DATA Prior to the creation of a complex database, there are numerous decisions made that involve different aspects of the study: defining the patient population (e.g., sampling, inclusion and exclusion criteria), the timing of collection, the data elements of interest, and the forms to be used. These decisions can incorporate the input · researchers, sponsors, clinicians, of academicians, and data analysts. The result of this collaboration is the study protocol. The information contained within this protocol might include a copy of the data collection forms; an outline detailing what forms are to be distributed, when, and to whom; rules on how and what portions of the data are to be coded; and criteria for selecting patients. Data collection may follow a relatively typical scenario. The clinicians that agree to take part in the study screen and enroll patients, and obtain informed consent. During a patient's visit, they explain the study and ask if the patient would be interested in getting involved. The patients that agree and that meet study criteria_ are generally handed a baseline questionnaire that contains questions about demographics, comorbidities, and Qol. The clinician fills out a baseline clinical form detailing diagnostic and past treatment information. Additionally, any current information based on the visit that day (e.g., medications prescribed, labs ordered) is recorded on a separate form. Figure 1 shows how different types of data enter into the picture. The first line indicates questionnaires that are sent out on a regular basis, quarterly for example. These represent patient-reported panel data where current health status is recorded. A patient might not return a questionnaire at any given timepoint and therefore there is a hole there representing the possibility of missing data. The second line shows when a patient reports a hospitalization (or any other resource use) in the past quarter. Based on recall, the patient is reconstructing event history data. When a clinician records a visit, medication or other medical procedure (surgery, lab test) these data are included as well. The third line represents this frequency and variety of episodic clinical data. FIG. 1. STREAMS OF DATA 0>--+-+--+-1~~uestionnaires ; ospitalizations 1--HHIHIIHHf---IHIIIII--tl--1-1---HHI-1---, •meal WHAT MAKES THESE DATA COMPLEX? These data are complex for a variety of reasons. Inherent in the data gathering process for observational studies is ongoing change. There is a need to allow for flexibility in the data collection forms so that the latest treatments and practice patterns can be captured in the data collection forms. New therapies might be introduced and variables could be added to the forms and thus, the database. In order to understand occurrences that have not been pre-defined, open-ended text fields are collected, periodically analyzed, and sometimes recast as coded fields. As a result, whole new sections are added to the data collection forms so that frequent responses can be collected in coded responses rather than open-ended text fields. The prospect of openended text fields is one that can cause a fair amount of apprehension for the analysts involved in a project, whereas a researcher may view them as an opportunity for exploration. Change must be possible, and expected, but must also be systematic and controlled. It is often the analyst who is the overseer of this process. DOCUMENTATION Before running so much as a line of SAS code or extracting a single morsel of information from the database, it is important to get all parties involved to agree upon a variety of issues related to the data. The minimal facts that need to be established are which patients will be included and what types of information will be explored during data analysis. Before data collection even begins, there will be a string of memoranda, meetings, and emails where questions are asked and answered. If you document all decision made along the way, the methodology section of papers will be much easier to write. 281 In addition, when the piles of output start to accumulate on your desk, it helps to keep things organized. For example, copies of data collection forms should be indexed as they wiil change over time (e.g., form version A, with 8, C, etc. to follow). Guides are available which give suggestions on organizing project information and program logs and output. (Cistemas and Stoddard 1996). DETERMINE THE VARIABLES The analyst and researchers need to identify what portions of the data collection forms will be analyzed, i.e., which data elements should be extracted and consolidated into an analytic file. Some information may be collected on multiple forms. For example, patients report quarterly on their recent treatment history. While at the same time, ·the physicians are reporting pn the clinical information, treatment and otherwise, as it occurs. If the analysis of choice involves treatment, the researchers need to decide if patient self-reported or clinical information will be used. Alternatively, one could be used to validate the other by comparing the information from both sources. It is highly recommended to have copies of annotated forms on hand durin~ discussions about what information is to be extracted and analyzed. As the forms change over time, a copy of each iteration should be kept on hand in order to be able to trace the history of what variables contain the data of interest. For example, it is important to be aware of data that was once captured in open-ended fields and has since been moved to its own separate variable. The implication is that it may be necessary to search both places for the data. FREEZE THE DATA Observational databases by nature are in a constant state of change. However, most of them have imposed (somewhat arbitrary in many cases) 'phases' or stages- points in time that the data are considered complete. This is often referred to as a closed or frozen database. In order to be able to reconstruct any files or replicate analytic results during or after analysis, it is a good idea to extract files for analysis only from a frozen copy of the database, and to 282 document which 'phase' of the data was used. This is particularly important if you're using a database without some built in rollback feature. It is also advisable to create permanent files for every extraction or 'phase' of the database. If done on a regular basis, say quarterly, then later studies can use the most recent copy of the data, or easily re-analyze historical data from a new perspective. Quality control can be done on any new or updated data since the most recent freeze of the database. EXTRACT THE DESIRED DATA Once criteria for a patient population are established, the data extraction begins. It is useful to show the researchers involved exactly which portions of the form will be used in the data extraction to make sure everyone is literally on the same page. It is important to agree which variables will be extracted for purposes of defining which patients are included. In addition, it is crucial to agree on data transformations and definitions of any new data elements created for the analysis. Qol scores are a good example - they are typically constructed from single items of varying scales (values of 1-4 or 1-5) and converted to 100-point scales. In some cases they are also weighted. Another example of a common data manipulation is the definition of treatment duration, which depends on strict identification of start and stop points, but may also depend on combinations of clinical variables. Initial treatment generally is defined as treatment that happens in some window around diagnosis of a disease. Any number of treatments might occur in the specified window. Subsequent treatment may also be of interest. Refining the definitions of treatment require taking a closer look at the data. CATEGORIZE TYPE OF TREATMENT Depending on what type of disease is being studied, evaluating a course of treatment can be quite challenging from a data analyst's perspective. There may be some types of medical or surgical procedures that occur over several days. There may also be data on medications for which there are varying types, each with start and stop dates which last for varying periods of time. In order to operationalize a definition of treatment, the analyst develops a categorization process that typically involves timing, context and content. A suggested approach on how to deal with treatment information is as follows. First, isolate the variables of interest which contain the dates of the treatment. Create a simple data set which includes the patient ID, a date variable, and a character variable whose value is the name of the variable whenever there is a date for a treatment recorded. For example, treatment X recorded in TX1VAR with a value of 01/01/99 would translate to PTID='Pt Bob', TXDATE=OH01/99, and TXTEXT='TX1VAR'. Then, create a descriptive variable (e.g., TXDESC) which groups certain values of TXTEXT into a descriptive and easier to understand set of values. If TX1VAR, TX2VAR, and TX3VAR all record similar treatments, then TXDESC is set to the same value when TXTEXT takes on those values. These data can then be arrayed· by date and reviewed if they occur within a certain window of time. After looking over the output, mutually exclusive and exhaustive categories can be set up and SAS code written to try and place each patient in one of the categories based on their arrayed treatment information. The first line of defense to ensure the categories are set up correctly is to use a series of IF statements, rather than ELSE IF's, and set flags for each category. Then, see if the sum of all of the flags is ever other than one. THINGS TO THINK ABOUT Once the population is established, it is time for the next step in data extraction. What is the main research question? This is the cenlral issue that theoretically should inform every part of the data analyst's decision making. What is the next level of information that needs to be extracted and in what timeframe? These questions are particularly important when dealing with observational data that can stretch endlessly in time in either direction. What are the types of things that will be controlled for in the analysis? Becallse these data are not collected in a randomized fashion it is critical to consider the characteristics that make patients different from one another. Is the order of occurrence important for certain variables? This question is important to any analysis where changes in patient status are linked to treatment effects. Is it enough to know whether something 'ever' occurred? This devious question is one of the hardest to answer, because data collected even over long periods of time are never complete. ESTABLISH DATE WINDOWS The study population can be defined in a variety of ways, depending on the information available and the research question. Patients may be included based, for example, on their course of treatment, or on their disease status at the time of study entry or diagnosis, or perhaps on some of their demographic characteristics. This is the point at which the concept of date windows is introduced. If the patient population is defined based on an occurrence of something, treatment for example, then a date window during which treatments of interest are flagged needs to be established. Initial treatment might include any therapy noted in the six months after diagnosis of a disease. Subsequent therapies might be flagged as those that occur an additional six months beyond initial treatment. Essentially these windows depend very heavily on definitions of 'treatment', 'diagnosis', 'first treatment', and so forth. If information on disease status (a clinical score, for instance) is recorded in the database on exactly the same date as diagnosis, then it will be easy to associate that score with the diagnosis of the disease. However, if such information is recorded around the time of diagnosis, date windows (expansion of time beyond the single point) need to be defined around the diagnosis in which the variables containing disease status information should be extracted. If disease status variables have multiple values recorded in the specified windows, then a decision will need to be made as to which value should be counted (e.g., earliest, latest, highest value). For example, cancer scores are taken at the time of diagnosis, but different diagnostic tests may yield different results. The diagnostic tests may or may not occur all on the same day. The clinician ideally decided upon a definitive date of diagnosis, which is available in the database. But the analyst 283 should anticipate the possibility of multiple tests occurring. There will need to be a discussion as to how to deal with this issue programmatically. SUMMARIZATION TECHNIQUES variable. A frequency of this information is created and reviewed. Using this frequency, an algorithm can be developed which places patients into categories based on their treatment information. Figure 2 includes SAS code used to gather treatment information that occurs up to and including the date of a questionnaire. A treatmentlevel data set is merged with a questionnaire-level data set. When the treatments 'tx1' and 'tx2' occur, a flag is set and only questionnaire observations are output. The flag indicates that a patient underwent treatments 'tx1' or 'tx2' on or before that particular questionnaire. The earliest date of 'tx1 ' and 'tx2' up to the time of the questionnaire is retained for later use. In this example, a six-month (180 day) window was established during which any and all treatment information was extracted. It was sorted by date of treatment within each patient so that once the type of treatment was arrayed out at the patient-level, the order of occurrence could be assessed as part of the categorization process. For example, it might be of interest that medication was given pre-surg~ry to improve the outcome and these patients would be categorized differently than those without the medication. FIG. 2. PRIOR TREATMENT DATE & FLAG *----------------------------------*; * Flag any tx up to QoL instrument. *; *-----------------------------------~; proc sort data•qol; by ptid qoldate; run; proc sort data~tx; by ptid txdate: run; data flagit; merge qol(in=qol rename=(qolda te=txdatell tx ( in=txs l ; by ptid txdate; retain pre_tx1 pre_tx2; array pretxf{*) pre_tx1f pre_tx2f; array pretxd{*) pre_tx1d pre_tx2d; *----------------------------------*• • Initialize the arrays. *----------------------------------* if first.ptid then do; do i = 1 to dim(pretxfl; pretxf{i) = 0; pretxd{i) = . ; end; *-----------------------------------* *----------------------------------* if txdesc = 'tx1' then do; • Set flag and keep earliest txdate.* pre txlf = 1; pre-txld = min(pre tx1d,txdate); - else if txdesc = 'tx2' then do; pre tx2f = 1; pre-tx2d = min(pre tx2d,txdate); end; - - if (qol) then output; run; Figure 3 includes SAS code that arrays out treatment information for each patient and concatenates the types of treatment into a single 284 Finally, all of the treatment descriptive information is concatenated into one variable and a frequency is printed. From this information, patients are grouped into meaningful categories. FIG. 3. CREATE INITIAL TREATMENT *-----------------------------------*; *; *---------------------------------*; %macro doit; *Macro to concatenate tx info. end; end; - A count variable is created for two reasons. Its value is one for each patient's first treatment observation, two for the second, and so on. It helps in the creation of the array. Additionally, the maximum value of COUNT across all patients is kept in a macro variable for use in the array processing. As inclusion criteria change (e.g., the treatment window is expanded) the macro variable keeps the analyst from having to search through code to change the values of the length of the array. %do i 1 %to &numtx; init tx = trim(init tx) I I'-' I ltxdesc&i; %end; %mend doit; 2 *-----------------------------------* • sort and keep only txs whose date * * falls in the initial tx window. * *-----------------------------------* proc sort data•tx(where= {txdate le diaqdate+180)) ; by ptid txdate; run; *----------------------------------** * count J of txs per pt. *-----------------------------------* data count; set. tx; by ptid; retain count; if first.ptid then count = 0; count = count + 1; • The nature of analyzing observational data where data collection is ongoing. run; proc summary data=count nway; output out=counts max(count)=; run; Some suggestions to consider include: • Keeping good documentation. *-----------------------------------* * Macro-ize the # for use in array. * • data null ; set-counts; call symput('numtx',left(put(cou nt,3.))); run; Understanding and discussing the level of analysis and which source of data to use. • • Freezing ongoing databases. Using only frozen copies for analysis. data null ; put-"The-max # of txs per pt is &numtx"; • Remembering to ask lots of questions. run; • Taking the time to establish some useful summarization techniques. *-----------------------------------* *--------------------------*; * Array out tx info per pt. --------*; --------*; *--------------------------data pt txl(drop=txdate txdesc); set cOunt; by ptid count; array txtxt (*l $ txdescl-txdesc&numtx; txdatel-txdate&numtx; array txdte (*} retain txdescl-txdesc&numtx txdatel-txdate&numtx; --------*; *--------------------------*; • Initialize the arrays. --------*; *--------------------------if first.ptid then do; do i = l to &numtx; txtxt {i} ' '; Keep in mind that the data are bound to change over time (and even the analyst). This paper is a product of several years of trying things one way, finding an even better way, soliciting guidance from other analysts experienced in ongoing data collection and dissemination, and continually learning along the way. REFERENCES txdte(i} = .; end; end; --------*;*; *--------------------------• Upd·ate the array with tx info. --------*; *--------------------------txtxt(count} = txdesc; txdte{countl = txdate; if last.ptid then do; %doit; output; end; run; proc freq data=pt tx2; tables init tx;title 'Pt-level initial tx information'; run; DISCUSSION Dealing with multiple data sources and combining data into an analytic file (or series of analytic files) is complicated. In part, the complexity is due to: • Recording of duplicate information across multiple data collection forms. • Issues of timing when defining the patient deciding what when and population information to extract and how. Cistemas, M. and Stoddard, M. 1996 "Designing and Implementing SAS Programming Standards: Examples and Suggestions", Proceedings of the Fourth Annual Western Users of SAS Software Regional Users Group Conference, 348-352. ACKNOWLEDGMENTS SAS software are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. AUTHOR CONTACT Marcia Stoddard, Lead Statistical Analyst [email protected] Lewin-TAG, Inc. 490 2"d St., Suite 201 San Francisco, CA 94107 (415) 495-8966 (phone) (415) 495-8669 (fax) 285