Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DEATH MATCHING: AN ALGORITHM AND COMPARISON OF ITS IMPLEMENTATION IN PROC SQL VS. PROC SQL PLUS A DATA STEP Charlotte Corelle, Kaiser Permanente Center for Health Research Dean MacLaughlin, Boston Collaborative Drug Surveillance Program Lois Drew, Kaiser Permanente Center for Health Research Mary Longacre, Kaiser Permanente Center for Health Resea,rch ABSTRACT The CHR uses KPNW as a research laboratory. KPNW is a health maintenance organization (HMO) with 378,000 members located in the Portland, Oregon and Vancouver, Washington metropolitan areas. The HMO maintains a variety of automated databases used by CHR scientists and analysts. These databases include an outpatient pharmacy system, an automated hospital admission and discharge database, an outpatient utilization and morbidity database, a tumor registry, and a KPNW membership eligibility database. Across all databases a single unique identifier is assigned to each patient. This identifier allows us to link data from the various databases. For example, we can select all dispensing records for drugs of particular interest to the FDA. Then, using the unique identifier from the dispensing record, we can map to the other databases to get demographic, morbidity, and health care data for the drug users. This paper describes how we used base SAS® software to match person-based records to other person-based records without having the benefit of a unique identifier on which to merge. It illustrates how analysts at the Center for Health Research (CHR) in Portland, Oregon, linked records of a study population whose vital status was unknown with state death certificate records to determine mortality. CHR analysts modified a death-matching algorithm used at the Boston Collaborative Drug Surveillance Program and implemented it in two different schemes, both of which used PROC SQL and a DATA step. One of the coding schemes was considerably more efficient than the other. This paper presents the matching algorithm and compares the perfonnance of the two coding implementations. INTRODUCTION From time-ta-time, an analyst must develop an application to locate matching records among person-based data sets containing records without a unique common identifier. In epidemiological research, this task is sometimes broken into two steps: first, electronically identifying potential matches, and, second, reviewing (actually looking at) each of the potential matches. This paper focuses only on the process of electronically identifying the potential matches. It describes a powerful, efficient, and flexible approach to person-based record matching. We developed an algorithm and method to search for death certificates of study subjects; however, the approach could be used to perfonn other person-based or non-person-based matching without a unique common identifier. RESEARCH OBJECTIVE Our work with the FDA required us to search through Oregon and Washington vital statistics data to find out which of a group of infants born 'in KPNW hospitals had died before reaching age one. Name was available in . both data sources, but name is an unstable matching variable. The value for a person's name may vary as the result of a true name change (e.g., Theresa becomes Reesa) or due to mistaken spelling (e.g., Browne is entered as Brown) or due to other data entry and spelling errors (e.g.,'Thomas Lee Nyguen is entered as Lee Thoma Nyguen). We needed to develop a matching algorithm that did not rely on a single common identifier to join records that were potential matches. METHODS BACKGROUND Our goal was to compare each record in the study population dataset to every record in the death certificate data set in order to identify all potential matches. The study popUlation was composed of all babies delivered at the KPNW hospitals from 1986 through 1989 (N-18,000). Because we wanted to ascertain only infant mortality among our study population, we restricted the study dataset to KPNW babies not known to be alive after We developed the algorithm and code at the Center for Health Research of Kaiser Pennanente, Northwest Region (KPNW). The work waS done in conjunction with CHR's participation in a Cooperative Agreement with the Food and Drug Administration to conduct studies on adverse effects of marketed drugs. 237 age one (N-5,500) and we restricted the death dataset to deaths occurring before age one(N-5,500). We wanted to review manually only those joined records that had a matching potential score of at least 6; therefore, those were the only records we ultimately needed to save to an ouput data set. Joining each record of the KPNW group to every record of the Oregon and Washington state deaths would have resulted in 30,250,000 joined records (55oox5500). When we. calculated this, we realized we faced a serious shortage of disk space. We chose PROC SQL to perfonn the many-to-many merge and concentrated on reducing the size of the input data sets to minimize the the requisite disk space. We believed we could build an algorithm to create a matching potential variable, but at fIrst we understood how to implement the variable only in a DATA step. Figure 2 displays the data elements used in the matching process along with the variable names in each data source. Figure 2 KPNW variable klname kfname kinitial klsndx , t i kbmon kbday kbyear kbdayl I l t ~ first name middle initial sinitial slsndx sfsndx sbmon sbday sbyear ssex Figure 3 proc sql; create table femalel as select' from d~ths. kaiser where sdod between klastalv and kbdayl; quit; data female2; set female 1; In the DATA step we next created a matching potential variable, This variable scored how well the components of name and DOB matched between the two data sources. A maximum matching potential score for a joined record was II. Figure 1 shows how points were assigned to create the matching potential score: last name sfname Because at fIrst we could not fIgure out how to code the entire matching algorithm in PROC SQL, we used PROC SQL to join gender-matched records where DOD fell within the study infant's search window and used a DATA step after that to create the matching potential score. Figure 3 shows the code for the inefficient twostep method. matchpot= «kbmon=sbmon) «kbday=sbday) « kbyr-sbyr) «kfsndx=sfsndx) «(kfname=sfname) «(kinitial=sinitial) «(klsndx:slsndx) * 2)+ • 2)+ • 2)+ (klname=slname) • I): * 1)+ • 1)+ * 1)+ * 1)+ if matchpot>=6; run; This fIrst method worked but it was extremely diskintensive and used 30% more CPU time than the preferred method we subsequently tested. The preferred method implements the algorithm entirely within the one PROC SQL step. It resolves a compound expression within the WHERE CLAUSE to evaluate whether the minimum matching potential score is met. In addition to saving CPU time, when we tested the two methods with Figure I !, birth day state death variable slname birth year gender last known alive date age one birthday date of death sdod lesex kIastalv If records matched on gender and if the DOD on the certifIcate record fell between the date the infant last visited the hospital or clinic and the date of the infant's fIrst birthday (Le., if the death fIt the KPNW infant's search window), then we wanted to evaluate the matching potential of the joined record electronically; otherwise, we concluded there was not a potential match. We decided to put the records of each gender through a PROC SQL step separately and to use a WHERE CLAUSE to eliminate the need tojoin records in which DOD fell outside the search window. Using this approach we barely had enough disk space do the SQL processing that would join the records we considered ntight have some matching potential and output them for further processing in a DATA step. data element birth month birth day birth year last nameSoundex first name Soundex birth month kfsndx Besides the baby's name, date of birth (DOB), and gender, the KPNW data set included two other variables important to the matching process. One of these variables was the date of the baby's last KPNW hospital or clinic visit (last alive date) and the other was the date the infant would have turned one-year-old (date ofOOt birthday). We used these two variables to create a search window within which to fIt the date of death (DOD) from the certifIcate dataset. data element last name frstname middle initial point value 2 if exact match. else 0 2 if exact match, else 0 2 if exact match, else 0 2 if exact match. else 1 if Soundex match, else 0 2 if exact match, else 1 if Soundex match, else 0 1 if exact match, else 0 238 input data sets of female records only, the preferred . implementation used negligible disk space for processlOg (less than 1 block as compared to 180,000 blocks). After the test we added a gender-matching requirement to the WHERE CLAUSE and rewrote the program so that the male and female records could remain together for processing. SIGNIFICANCE The benefit of the method we described is in its power, efficiency, and flexibility. Its power comes from PROC SQL which allows each record of one data set to be joined with every record of another data set. It is efficient because only when the compound expression in the WHERE CLAUSE resolves to 1 (true) is a record read into the data set of potential matches for manual review. This saves an enormous amount of proceSsing disk space. Finally, it is flexibile because the matching algorithm can be changed easily, depending upon the data and the matching task at hand. Figure 4 shows the code for implementing the matching algorithm in the efficient one-step method described above. The preferred method creates in one step a set of all potential matches with a score of 6 or more among study infants and infants with death certificates of the same gender when the State DOD falls after the KPNW baby's last known alive date before the. date of her first birthday. ACKNOWLEDGMENTS SAS is a registered trademark of SAS Institute Inc. in the USA and other countries. ®Indicates USA registration. Figure 4 Other brand and product mimes are registered trademarks or trademarks of their respective companies. pro< sql; Create table toreview as select * from deaths. kaiser AUTHOR CONTACT where (ksex=ssex) & (,dod between klastalv and kbdayl) & (,um( «kbmon=sbmon) * 2), «kbday=sbday) * 2). « kbyr=sbyr) * 2), «kf'ndx=sf'ndx) * I), «kfname=sfname) • I), «(kinitial=sinitial) * 1), «klsnw."slsndx) «ldname=slname) * I), * I) )>=6); Charlotte Corelle Kaiser Perrnanente Center for Health Research 3800 N. Kaiser Center Drive Portland, Oregon 97227 (503)335-6740 quit; 239