Download Death Matching – An Algorithm and Comparison of Its Implementation in PROC SQL versus PROC SQL plus a DATA Step

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DEATH MATCHING: AN ALGORITHM AND COMPARISON OF ITS
IMPLEMENTATION IN PROC SQL VS. PROC SQL PLUS A DATA STEP
Charlotte Corelle, Kaiser Permanente Center for Health Research
Dean MacLaughlin, Boston Collaborative Drug Surveillance Program
Lois Drew, Kaiser Permanente Center for Health Research
Mary Longacre, Kaiser Permanente Center for Health Resea,rch
ABSTRACT
The CHR uses KPNW as a research laboratory. KPNW
is a health maintenance organization (HMO) with
378,000 members located in the Portland, Oregon and
Vancouver, Washington metropolitan areas. The HMO
maintains a variety of automated databases used by CHR
scientists and analysts. These databases include an
outpatient pharmacy system, an automated hospital
admission and discharge database, an outpatient utilization and morbidity database, a tumor registry, and a
KPNW membership eligibility database. Across all
databases a single unique identifier is assigned to each
patient. This identifier allows us to link data from the
various databases. For example, we can select all
dispensing records for drugs of particular interest to the
FDA. Then, using the unique identifier from the dispensing record, we can map to the other databases to get
demographic, morbidity, and health care data for the drug
users.
This paper describes how we used base SAS® software to
match person-based records to other person-based records
without having the benefit of a unique identifier on which
to merge. It illustrates how analysts at the Center for
Health Research (CHR) in Portland, Oregon, linked
records of a study population whose vital status was
unknown with state death certificate records to determine
mortality. CHR analysts modified a death-matching
algorithm used at the Boston Collaborative Drug Surveillance Program and implemented it in two different
schemes, both of which used PROC SQL and a DATA
step. One of the coding schemes was considerably more
efficient than the other. This paper presents the matching
algorithm and compares the perfonnance of the two
coding implementations.
INTRODUCTION
From time-ta-time, an analyst must develop an application to locate matching records among person-based data
sets containing records without a unique common
identifier. In epidemiological research, this task is
sometimes broken into two steps: first, electronically
identifying potential matches, and, second, reviewing
(actually looking at) each of the potential matches. This
paper focuses only on the process of electronically
identifying the potential matches. It describes a powerful,
efficient, and flexible approach to person-based record
matching. We developed an algorithm and method to
search for death certificates of study subjects; however,
the approach could be used to perfonn other person-based
or non-person-based matching without a unique common
identifier.
RESEARCH OBJECTIVE
Our work with the FDA required us to search through
Oregon and Washington vital statistics data to find out
which of a group of infants born 'in KPNW hospitals had
died before reaching age one. Name was available in .
both data sources, but name is an unstable matching
variable. The value for a person's name may vary as the
result of a true name change (e.g., Theresa becomes
Reesa) or due to mistaken spelling (e.g., Browne is
entered as Brown) or due to other data entry and spelling
errors (e.g.,'Thomas Lee Nyguen is entered as Lee Thoma
Nyguen). We needed to develop a matching algorithm
that did not rely on a single common identifier to join
records that were potential matches.
METHODS
BACKGROUND
Our goal was to compare each record in the study
population dataset to every record in the death certificate
data set in order to identify all potential matches. The
study popUlation was composed of all babies delivered at
the KPNW hospitals from 1986 through 1989
(N-18,000). Because we wanted to ascertain only infant
mortality among our study population, we restricted the
study dataset to KPNW babies not known to be alive after
We developed the algorithm and code at the Center for
Health Research of Kaiser Pennanente, Northwest
Region (KPNW). The work waS done in conjunction
with CHR's participation in a Cooperative Agreement
with the Food and Drug Administration to conduct
studies on adverse effects of marketed drugs.
237
age one (N-5,500) and we restricted the death dataset to
deaths occurring before age one(N-5,500).
We wanted to review manually only those joined records
that had a matching potential score of at least 6; therefore,
those were the only records we ultimately needed to save
to an ouput data set.
Joining each record of the KPNW group to every record
of the Oregon and Washington state deaths would have
resulted in 30,250,000 joined records (55oox5500).
When we. calculated this, we realized we faced a serious
shortage of disk space. We chose PROC SQL to perfonn
the many-to-many merge and concentrated on reducing
the size of the input data sets to minimize the the requisite
disk space. We believed we could build an algorithm to
create a matching potential variable, but at fIrst we
understood how to implement the variable only in a
DATA step.
Figure 2 displays the data elements used in the matching
process along with the variable names in each data
source.
Figure 2
KPNW variable
klname
kfname
kinitial
klsndx
,
t
i
kbmon
kbday
kbyear
kbdayl
I
l
t
~
first name
middle initial
sinitial
slsndx
sfsndx
sbmon
sbday
sbyear
ssex
Figure 3
proc sql;
create table femalel as
select'
from d~ths. kaiser
where sdod between klastalv and kbdayl;
quit;
data female2;
set female 1;
In the DATA step we next created a matching potential
variable, This variable scored how well the components
of name and DOB matched between the two data sources.
A maximum matching potential score for a joined record
was II. Figure 1 shows how points were assigned to
create the matching potential score:
last name
sfname
Because at fIrst we could not fIgure out how to code the
entire matching algorithm in PROC SQL, we used PROC
SQL to join gender-matched records where DOD fell
within the study infant's search window and used a
DATA step after that to create the matching potential
score. Figure 3 shows the code for the inefficient twostep method.
matchpot= «kbmon=sbmon)
«kbday=sbday)
« kbyr-sbyr)
«kfsndx=sfsndx)
«(kfname=sfname)
«(kinitial=sinitial)
«(klsndx:slsndx)
* 2)+
• 2)+
• 2)+
(klname=slname)
• I):
* 1)+
• 1)+
* 1)+
* 1)+
if matchpot>=6;
run;
This fIrst method worked but it was extremely diskintensive and used 30% more CPU time than the preferred method we subsequently tested. The preferred
method implements the algorithm entirely within the one
PROC SQL step. It resolves a compound expression
within the WHERE CLAUSE to evaluate whether the
minimum matching potential score is met. In addition to
saving CPU time, when we tested the two methods with
Figure I
!,
birth day
state death variable
slname
birth year
gender
last known alive date
age one birthday
date of death
sdod
lesex
kIastalv
If records matched on gender and if the DOD on the
certifIcate record fell between the date the infant last
visited the hospital or clinic and the date of the infant's
fIrst birthday (Le., if the death fIt the KPNW infant's
search window), then we wanted to evaluate the matching
potential of the joined record electronically; otherwise,
we concluded there was not a potential match. We
decided to put the records of each gender through a
PROC SQL step separately and to use a WHERE
CLAUSE to eliminate the need tojoin records in which
DOD fell outside the search window. Using this approach we barely had enough disk space do the SQL
processing that would join the records we considered
ntight have some matching potential and output them for
further processing in a DATA step.
data element
birth month
birth day
birth year
last nameSoundex
first name Soundex
birth month
kfsndx
Besides the baby's name, date of birth (DOB), and
gender, the KPNW data set included two other variables
important to the matching process. One of these variables
was the date of the baby's last KPNW hospital or clinic
visit (last alive date) and the other was the date the infant
would have turned one-year-old (date ofOOt birthday).
We used these two variables to create a search window
within which to fIt the date of death (DOD) from the
certifIcate dataset.
data element
last name
frstname
middle initial
point value
2 if exact match. else 0
2 if exact match, else 0
2 if exact match, else 0
2 if exact match. else 1 if Soundex match, else 0
2 if exact match, else 1 if Soundex match, else 0
1 if exact match, else 0
238
input data sets of female records only, the preferred .
implementation used negligible disk space for processlOg
(less than 1 block as compared to 180,000 blocks). After
the test we added a gender-matching requirement to the
WHERE CLAUSE and rewrote the program so that the
male and female records could remain together for
processing.
SIGNIFICANCE
The benefit of the method we described is in its power,
efficiency, and flexibility. Its power comes from PROC
SQL which allows each record of one data set to be
joined with every record of another data set. It is efficient
because only when the compound expression in the
WHERE CLAUSE resolves to 1 (true) is a record read
into the data set of potential matches for manual review.
This saves an enormous amount of proceSsing disk space.
Finally, it is flexibile because the matching algorithm can
be changed easily, depending upon the data and the
matching task at hand.
Figure 4 shows the code for implementing the matching
algorithm in the efficient one-step method described
above. The preferred method creates in one step a set of
all potential matches with a score of 6 or more among
study infants and infants with death certificates of the
same gender when the State DOD falls after the KPNW
baby's last known alive date before the. date of her first
birthday.
ACKNOWLEDGMENTS
SAS is a registered trademark of SAS Institute Inc. in the
USA and other countries. ®Indicates USA registration.
Figure 4
Other brand and product mimes are registered trademarks
or trademarks of their respective companies.
pro< sql;
Create table toreview as
select *
from deaths. kaiser
AUTHOR CONTACT
where (ksex=ssex) & (,dod between klastalv and kbdayl) &
(,um(
«kbmon=sbmon)
* 2),
«kbday=sbday)
* 2).
« kbyr=sbyr)
* 2),
«kf'ndx=sf'ndx)
* I),
«kfname=sfname)
• I),
«(kinitial=sinitial)
* 1),
«klsnw."slsndx)
«ldname=slname)
* I),
* I) )>=6);
Charlotte Corelle
Kaiser Perrnanente Center for Health Research
3800 N. Kaiser Center Drive
Portland, Oregon 97227
(503)335-6740
quit;
239