Download Sequential Medical Treatment Mining for Survival Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sequential Medical Treatment Mining for Survival Analysis
Arlei Silva1 , Wagner Meira Jr1 ., Odilon Queiroz2 , Mariângela Cherchiglia3
1
Computer Science Depart. – Federal University of Minas Gerais – Brazil
2
Faculty of Medicine – Federal University of Minas Gerais – Brazil
{arlei, meira}@dcc.ufmg.br, {odilon, cherchml}@medicina.ufmg.br
Abstract. In this paper, we study the problem of evaluating the survival associated with sequential medical treatments. We propose a new data mining algorithm (SMTM) that combines the survival analysis framework with the sequence
mining task. This research is motivated by the necessity of assessing the quality of the renal replacement therapies (RRTs) , what has become a policy issue
in several countries. We apply SMTM to evaluate sequences of RRTs and show
that SMTM is computationally efficient and able to provide important knowledge
about the survival of patients in RRT, better describing the patients’ survival
pattern than the traditional survival analysis. The results obtained may support
future programs and health policies for the assistance of patients in RRT.
1. INTRODUCTION
Survival analysis is a collection of statistical procedures for data analysis for which the
outcome variable of interest is the time until an event occurs [5, 6]. Through survival
analysis we can, for example, study how long patients survive after receiving a heart
transplant or the time it takes for a patient to respond to a therapy. These studies are
important to compare competing treatments, to evaluate the effects of a disease, and to
support the medical decision process in general [13].
Despite the widespread application of survival analysis techniques, most of them
assume that the patient receives few treatments or do not consider the ordered execution
of different single treatments through time. In this case, a single treatment can be a medication, a surgery, or any other medical procedure. A positive HIV patient, for example,
may begin his/her treatment with a particular combination of antiviral medications, and
then, as the patient’s viral load and CD4 count change over time, this combination may
be changed or other treatments may be indicated [8]. Patients suffering from end-stage
renal disease receive a long sequences of therapies, which can be composed by sessions
of intermittent peritoneal dialysis, hemodialysis, among others [4]. In this paper, we study
the problem of evaluating the survival time associated with sequential treatments.
We define a sequential medical treatment as an ordered sequence of single medical
treatments. Given two single treatments A and B, an example of a sequential treatment
is the sequence (A → A → B), that is, two consecutive executions of A followed by an
execution of B. In this case, the goal of the survival analysis is to evaluate the survival
time of patients who receive (A → A → B). Given a database of patients, containing sequences of treatments, this analysis may be an evidence about the effectiveness of
(A → A → B) in terms of survival time. The problem of evaluating sequential medical treatments presents some similarities with the traditional sequence mining task, since
each sequential treatment can be seen as a sequence of events [1, 15]. However, traditional
sequence mining algorithms are not able to perform a survival analysis of a sequence of
medical treatments, since they only identify frequent sequences in a database. On the
other hand, existing survival analysis techniques do not take into account how different
treatments are executed sequentially.
In order to evaluate sequential treatments in terms of survival, we formulate a new
technique based on the existing framework for survival analysis and the sequential pattern
mining task, which we call SMTM (Sequential Medical Treatment Mining). The SMTM
algorithm searches the space of sequential treatments in a level-wise manner to identify
effective sequential treatments using two pruning strategies, based on support and median
survival of the patients who receive the sequences of treatments.
Due to the increasing number of patients who require an RRT, the high cost of
these treatments, and the low estimates of survival for patients suffering from end-stage
renal disease, evaluating the RRTs has become a policy issue in several countries. We
apply our sequential treatment mining technique using a dataset composed by more than
100,000 Brazilian patients in RRT [2]. Therefore, besides the innovative application of
data mining (e.g., pattern recognition, statistics) and medical disciplines (e.g., survival
analysis), this paper also describes an experience of the use of these disciplines in practice.
More specifically, our main contributions are summarized as follows:
• A Novel Method for Survival Analysis of Sequential Treatments: The method
proposed in this paper characterizes the survival associated with a sequence of
treatments. We have not found any other technique for survival analysis of sequential treatments.
• A New Data Mining Algorithm for Sequential Medical Treatment Mining:
We describe a new data mining algorithm to evaluate sequences of medical treatments, called SMTM. The algorithm employs two pruning strategies during a
level-wise search for sequential treatments.
• The Application of the Proposed Algorithm in a Real Case Study: We evaluate
the SMTM algorithm using a real dataset of patients suffering from end-stage renal
disease in Brazil.
The remainder of this paper is organized as follows. Section 2 summarizes the
motivations of this research. Section 3 describes our technique for survival analysis of
sequential treatments. Section 4 presents the SMTM algorithm. Section 5 presents the
empirical results. Section 6 discusses related work. Finally, Section 7 presents our conclusions and future work.
2. MOTIVATION
The motivating problem for this research is the evaluation of Renal Replacement Therapies (RRTs) [4], required by patients suffering from end-stage renal disease (ESRD). The
costs of RRTs lead to a large burden for the health care systems, particularly in developing
countries [12, 11, 2]. In order to encourage the development of new procedures for the
analysis of the RRTs, the Brazilian Government supported the construction of a national
database of ESRD patients assisted by the Brazilian Public Health System.
End-stage renal disease, also known as chronic kidney failure, is the permanent
loss of the kidney function. The main functions of the kidneys are removing waste prod-
ucts and excess of water from the blood. Due to the asymptomatic nature of this disease,
the kidney disease is not frequently detected until it is not reversible anymore, meaning
that the prevention opportunities are over[12]. We distinguish five major RRTs:
• Hemodialysis (HD): The blood is filtered through a dialysis machine, cleansed,
and returned to the body.
• Transplantation (TX): Surgery to replace the kidney that failed by a healthy kidney. Transplanted patients require the use of immunosupressants for the rest of
their lives to prevent the rejection of the new kidney.
• Continuous Ambulatory Peritoneal Dialysis (CAPD): A solution is drained into
the abdomen through a catheter, during few hours, it absorbs waste products from
the blood, and then it is drained out.
• Continuous Cycling Peritoneal Dialysis (CCPD): Employs a cycler machine to
perform a procedure similar to the CAPD.
• Intermittent Peritoneal Dialysis (IPD): Similar to CCPD, but normally executed
in a hospital, taking around 24 hours and performed several times a week.
Patients suffering from ESRD require long term treatments, which can be composed by different RRTs. A patient may, for example, initiate his/her treatment with
sessions of IPD during 6 months, be in HD for 10 months, and then receive a kidney
transplant. Evaluating how these treatments interact and affect the survival of patients is
an important medical research problem. The survival analysis of sequences of treatments
may address key questions such as: Which are the most appropriate sequences of RRTs?
What is (are) the most suitable dialysis for a patient who has experienced a rejection?
What is the survival prognostic for transplanted patients who performed hemodialysis for
a long period of time? Nevertheless, it raises important challenges:
1. The number of possible sequences may be very large.
2. The frequency of the sequences tends to be skewed due to existence of common
medical practices.
3. Considering that the whole sequence of treatments may lead to long infrequent
patterns, short frequent sequences may be more representative.
In order to analyze the survival of patients who execute a sequence of RRTs, we
propose a new data mining algorithm for survival analysis of sequential treatments. Our
main goal is to discover survival patterns associated with the sequential execution of medical treatments. These patterns may be useful to assess the quality of the RRTs in terms
of the survival of the patients, leading to a more effective usage of the healthcare budget.
Moreover, the proposed technique may be applied to other diseases.
3. SURVIVAL ANALYSIS OF SEQUENTIAL TREATMENTS
In this section, we define the concept of sequential medical treatment and describe how we
combine the framework for survival analysis and the sequence mining task for the survival
analysis of sequences of treatments. Let I = {α1 , α2 , . . . αm } be a set of m distinct
treatments, a sequential treatment is a sequence of treatments s = (α1 → α2 → . . . αq ),
composed by q ordered treatments. We state that a sequential treatment sa = (α1 →
α2 → . . . αn ) is contained in sb = (β1 → β2 → . . . βm ), or sa is a subsequence of sb , if
there exists 1 ≤ i1 < i2 < . . . < in ≤ m such that α1 = βi1 , α2 = βi2 , . . . αn = βin .
For example, (A → C) is a subsequence of (A → B → C). Gaps in the sequences are
allowed in order to not restrict the patterns discovered to strictly executed successions of
treatments. These gaps may result in more general and simple patterns able to describe
long and complex sequential treatments, as will be discussed along this paper.
A database D of sequential treatments, containing data of patients suffering from
a given disease, can be obtained through a survival study. For each patient the database
has the patient id (pid), the observed survival time t, whether the patient is censored or
not (c), and the patient sequence p = (α1 , α2 , . . . αg ) of treatments received by him/her
during the observed time. Censoring occurs when we do not know a person’s survival
time exactly because we loose track of him/her or because the patient survives during
the whole observation time [6]. A patient sequence of treatments pb is said to contain a
sequence pa if pa is a subsequence of pb . We use a different representation for the patient
sequences (i.e., transactions in the database) to differentiate them from the sequential
treatments mined (i.e., result of the survival analysis). The execution of a sequence of
treatments is divided into constant and discrete time intervals t. A treatment executed
in ti affects the patient in ti+1 and a size k sequence of treatments takes k intervals of
duration t to be executed. Table 1 shows an example database of sequential treatments,
where I = {A, B, C, D, E}, and the number of patients is 7.
pid
1
2
3
4
5
6
7
t
7
2
1
4
6
3
2
c
0
0
0
1
0
0
1
s
A, C, D, B, A, B, B
A, C
E
E, E, E, E
B, A, C, B, C, C
A, A, C
D, A
Table 1: Example database
The goal of the survival analysis [5, 6] is to infer the relation between the survival
time and one or more explanatory variables. In our case, the explanatory variable is
a sequential treatment. Two important functions for survival analysis are the survivor
function S(t) and the hazard function h(t). The survivor function gives the probability
that a person survives longer than some specified time t (S = P (T > t)). The hazard
function h(t) gives the instantaneous potential per unit time for the patient death to occur,
given that the individual has survived up to time t:
P (t ≤ T ≤ t + ∆t|T ≥ t)
∆t→0
∆t
h(t) = lim
where T is the random variable for survival time. Moreover, we can evaluate medical
treatments by descriptive measures, such as the median survival and the average hazard
rate. These measures give an overall value for the survival associated with the explanatory
variables. The median survival tM is the time at which the survival probability is 0.5
(S(tM ) = 0.5). The average hazard rate is defined by dividing the total number of failures
(or deaths) by the sum of the observed survival times:
h=
#f ailures
Pn
i=1 ti
The most applied method for survival analysis is the Kaplan Meier (KM) estimate [6]. It
calculates the proportion Ŝ(t) of patients whose survival time at death would exceed t if
no censoring had occurred. The result is a cumulative step function that decreases along
the time. According to the KM estimate, the survival probability at time t is given by:
Ŝ(t) =
i
Y
nj − dj
), ti ≤ t ≤ ti+1
(
nj
j=1
Where dj is the number of failures occurring at time t out of nj surviving to tj (risk set).
Both the h(t) and Ŝ(t) functions and also the descriptive measures have been proposed for
the analysis of single or combined treatments, but not for sequential medical treatments.
In this paper, we extend these functions and measures to the analysis of sequences of medical treatments. In order to achieve this goal, we must understand how a given sequence
of treatments affects the patients’ survival. As a patient may receive the same treatment
or sequence of treatments more than once, we always consider the first execution of such
treatment or sequence. In the case of size one sequential treatments, the evaluation process is straightforward. Since a size one sequence is composed by one single treatment,
the problem is reduced to the analysis of one explanatory variable. Considering the example database presented in Table 1, Table 2 presents the number of patients in the risk set
(nj ), the number of deaths (dj ), the survival probability Ŝ(t) and the hazard function h(t)
in the beginning of the sequential treatment and for different failure times tj . The median
survival tM and the average hazard rate (h) obtained by (A) are 5 and 0.5, respectively.
tj
0
2
3
5
7
nj
5
4
3
2
1
dj
0
1
1
1
1
Ŝ(t)
1
1 × 0.75 = 0.75
0.75 × 0.67 = 0.5
0.5 × 0.5 = 0.25
0.25 × 0 = 0
h(t)
0
0.25
0.33
0.5
1
Table 2: Survival data of (A), tM =5, h =0.5
The evaluation of longer treatments (size > 1) is more challenging. In this case,
we are interested not only in the patient’s survival after the execution of a sequence of
treatments, but also in how the patients survive along the sequential treatment. Notice that,
in the evaluation of a long sequential treatment, considering only the whole sequence may
lead to a strong bias w.r.t. the survival time associated with such sequence of treatments,
since the complete execution of a sequential treatment of size k requires a survival time of,
at least, k periods of time. Moreover, a long sequence of treatments may be composed by
a subsequence with a high mortality rate. The case of the sequential treatment (E → E)
exposes this problem. One patient (pid = 4) executed this sequential treatment, surviving
for three periods of time, until being censored, which may be considered a good result.
However, (E → E) starts with an execution of (E), its leftmost size-one subsequence,
executed by patients 3 and 4, and half of them deceased after its execution.
Considering the results obtained by long sequences of treatments along their execution is very important to achieve a meaningful evaluation of sequential treatments in
terms of survival. Nevertheless, it brings the problem of valuating the results of the subsequences of treatments over the results of long sequences. The basis of the survival
analysis procedures described in this section is to obtain the risk set and the number of
deaths associated with the explanatory variables at time tj . Therefore, the problem of ex-
tending these procedures to the evaluation of sequences of treatments reduces to identify
the risk set and the number of failures associated with a sequence of single treatments.
A long sequential treatment s = (α1 → α2 → . . . αq ) can be seen as a set of
subsequences composed by its partial executions ((α1 ), (α1 → α2 ), . . . (α1 → α2 →
. . . αq )). Therefore, we can evaluate s through the results of its subsequences. However,
several patients who execute these subsequences do not receive the whole sequence s, as
they can be censored, die or receive other sequences of treatments that do not contain s.
We define a window time ω to consider the partial results of a sequential treatment in its evaluation. The intuition behind the window time is to estimate how long
is the time interval between the execution of two treatments αk and αk+1 in practice.
If these treatments are frequently distant in terms of patient transactions, we extend the
effect of the subsequence (α1 → α2 → . . . αk ) over (α1 → α2 → . . . αk+1 ). The determination of ω depends on a user-defined parameter, what we call window size threshold (λ), that defines the percentage of patients for whom αk and αk+1 must be separated by a distance shorter than or equal to ω in the whole set of patients who executed
(α1 → α2 → . . . αk+1 ). For example, in the database shown in Table 1, if λ = 0.5, the
window size ω that separates the treatments A and B in the sequence (A → B) is 1, and
if λ = 1, ω is 2. The risk set and the number of deaths associated to (α1 → α2 → . . . αk ),
executed in the interval t0 < t < tk are the same of (α1 → α2 → . . . αk ) in the interval
t0 < t < tk + ω. After tk + ω, only patients who executed (α1 → α2 → . . . αk+1 ) are
considered in the risk set of this sequential treatment.
Table 3 shows the survival data associated with (A → B) according to our example database. The window size threshold λ applied is 0.5. In this case, the results of
(A → B) in the interval 0 ≤ t ≤ 2 are the same obtained by the sequence (A), since
the window size ω is 1. For t > 2, we consider only data about patients who received
a sequence that contains (A → B) and have not been considered yet. Figure 1 shows
the survival probability of (A) and (A → B) through a Kaplan Meier plot. The median
survival tM obtained by (A → B) is 5 and the average hazard rate (h) is 0.43.
tj
0
2
5
6
nj
5
4
2
1
dj
0
1
1
1
Ŝ(t)
1
1 × 0.75 = 0.75
0.75 × 0.5 = 0.37
0.37 × 0 = 0
h(t)
0
0.25
0.5
1
Table 3: Survival data of (A → B), tM = 5, h =0.43
Once defined an evaluation technique to assess the survival associated with a sequence of treatments, the second problem studied in this section is the computational cost
of evaluating all possible sequences of treatments. We can see this problem as an instance
of the traditional sequence mining task [1, 15]. As we do not consider the existence of
parallel treatments, each event of the sequence is a single treatment. For m possible single treatments, the number of possible sequences of treatments of size k is equal to mk .
Considering different sizes of sequences,Puntil a limit n (1 ≤ k ≤ n), the total number of possible sequential treatments is nk=1 mk . Therefore, it becomes unfeasible to
evaluate all the sequences of treatments in the case of long sequences and a large set of
single treatments. Moreover, evaluating the whole set of possible sequential treatments
1
A
0.8
0.6
0.6
S(t)
S(t)
1
0.8
0.4
0.2
A
A−>B
0.4
0.2
0
0
0
1
2
3
4
5
6
7
0
1
2
3
t
4
5
6
7
t
(b) (A → B)
(a) (A)
Figure 1: Survival probabilities for (A) and (A → B), they share the same results
into the window interval 0 ≤ t ≤ 2
may not be always necessary. Based on the traditional sequence mining task, we propose
the application of the support as a selection criteria for the sequences to be evaluated. A
user-defined minimum support threshold σ is used to select only the frequent sequential
treatments. A second approach to select the sequential treatments is to apply an evaluation measure of survival as a pruning criteria. We present this strategy in the next section,
while describing our algorithm for sequential treatment mining.
4. THE SMTM ALGORITHM
The SMTM (Sequential Medical Treatment Mining) algorithm implements the methodology described in the last section. Besides the evaluation of sequential treatments, the
algorithm also employs two pruning strategies for sake of efficiency, based on the support
and the median survival of the sequential treatments. To prune the sequences of treatments according to their support in the database, we apply the anti-monotone property,
similar to most of the sequence mining algorithms presented in the literature [1, 15]. If
a size k sequential treatment has a support lower than a minimum support threshold, it is
pruned from the list of candidate to be size k + 1 frequent sequences.
1
A
A−>C
A−>C−>C
0.8
0.6
S(t)
S(t)
1
A
A−>C
0.8
0.4
0.2
0.6
0.4
0.2
0
0
0
1
2
3
4
5
t
(a) (A → C)
6
7
0
1
2
3
4
5
6
7
t
(b) (A → C → C)
Figure 2: Survival probabilities for (A → C) and (A → C → C)
The median survival based pruning is based on a user-defined minimum median
survival threshold µ and in the time window ω, defined in the last section. If a size k + 1
sequence of treatments s=(α1 → α2 → . . . αk+1 ) is composed by a size k sequence
(α1 → α2 → . . . αk ), executed in the interval 0 ≤ t ≤ tk , that has an median survival
lower than µ in the interval 0 ≤ t ≤ tk + ω, s can be pruned from the size k + 2
sequential treatment candidates. To illustrate this property, in Figures 2(a) and 2(b), we
show the Kaplan Meier curves of the sequential treatments (A → C) and (A → C → C),
respectively, according to the database presented in Table 1. The time window ω that
divides A and C is 1, the survival probability when t is equal to 2 is 0.5. If µ >2,
(A → C) can be pruned and it is not necessary to evaluate (A → C → C) or any other
sequence that contains (A → C) as prefix, since they have the same median survival.
Algorithm 1: SMTM
Input : D, σ, λ, µ,max size
Output: Sequential T reatments (S)
S1 ⇐ {Size one sequential treatments};
K ⇐ 1;
while K ≤ max size do
Evaluate sequential treatments in SK ;
for s ∈ SK do
if (median-survival(s) ≤ wk + 1) then
if (median-survival(s) < µ) then
Prune s;
Generate SK+1 from SK ;
K ⇐ K + 1;
The Algorithm 1 is a high-level description of the SMTM algorithm. It receives
as parameters the sequential database D, the minimum support threshold σ, the window
size threshold λ, the minimum median survival threshold µ, and the maximum size of
the sequential treatments to be evaluated (max size). The outputs of SMTM are the
sequential treatments with maximum size max size that have a support higher than σ
and a median survival higher than µ, considering the window size threshold λ.
The SMTM algorithm searches the space of sequential treatments in a level-wise
manner. After generating and evaluating the size one sequential treatments, the algorithm
employs the size k sequences to generate the size k + 1 sequences of treatments. The
evaluation process implements the methodology presented in the last section. For each
frequent sequential treatment s generated, its median survival is evaluated and, if possible,
the median survival based pruning is applied.
1
1
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0
0
0 1 2 3 4
#changes
(a) General
ccdf
1
0.8
ccdf
1
0.8
ccdf
1
0.8
ccdf
ccdf
5. EXPERIMENTAL EVALUATION
0.4
0.2
0
0 1 2 3 4
#changes
0 1 2 3 4
#changes
0 1 2 3 4
#changes
0 1 2 3 4
#changes
(b) 1+
(c) 2+
(d) 3+
(e) 4+
Figure 3: Distribution of treatment changes for different survival times (years)
5.1. Dataset
We evaluate the SMTM algorithm using the RRT database[2]. Originally, the database
contains 6 single treatments: Hemodialysis (HD), Transplantation (TX), Continuous Ambulatory Peritoneal Dialysis (CAPD), Continuous Cycling Peritoneal Dialysis (CCPD),
SMTM−0
SMTM−20
SMTM−40
500
300
200
100
400
300
200
100
0
0.02
0.03
0.04
0.05
minimum support(%)
300
200
0
10
20
30
40
50
60
max_size
(a)
400
100
0
0.01
SMTM−0
SMTM−20
SMTM−40
500
execution time(s)
SMTM−0
SMTM−20
SMTM−40
400
execution time(s)
execution time(s)
500
(b)
20
40
60
80
100
number of patients(’000)
(c)
Figure 4: Performance evaluation
Intermittent Peritoneal Dialysis (IPD), and Hemodialysis for positive HIV patients (HD/HIV).
Since more than one treatment can be executed during a given period of time, we also consider these combinations as single treatments. The total number of treatments (single and
composed therapies) in the database is 30. The number of patients considered is 106,449.
The maximum observation time is 60 months, the average observed time is 18.6 months,
and 63% of the patients are censored. We considered both the transplantation surgery and
the use of immunosupressants as TX, since they are close related treatments.
Figure 3 shows the CCDF (complementary cumulative distribution function) of
the number of treatment changes in all the dataset (general) and for patients who survived
more than one (1+), two (2+), three (3+), and four years (4+). We can see that the more
the patients survive the higher is the probability of receiving more than one therapy. In the
case of patients who live more than four years, this probability is higher than 40%. It motivates us to evaluate sequences of treatments in order to understand how these sequential
treatments interact and affect the survival time of the patients.
5.2. Performance evaluation
In this section, we present a performance evaluation of the SMTM algorithm. The experiments were executed on a Suse Linux PC with a 64-bit AMD Athlon 3500+ and 1GB
main memory. The execution time of SMTM is evaluated in terms of the median survival threshold µ, the minimum support threshold σ, the maximum size of the sequential
treatments to be evaluated (max size), and the number of patients in the database n. In
all experiments, we compare three median survival threshold (µ) levels, which we call
SMTM-0 (µ = 0), SMTM-20 (µ = 20), and SMTM-40 (µ = 40). The value of the
window size threshold λ is kept constant, set to 0.6.
Figure 4(a) shows the execution time of the SMTM algorithm as the minimum
support σ is changed from 0.01% to 0.05%. The number of patients in the database is
106,449 and the maximum size of the sequential treatments is 60. It can be observed that
reducing the value of σ increases the execution time exponentially, as expected. However, the median survival threshold (µ) is able to reduce the execution time substantially,
enabling the analysis of sequential treatments using low minimum support thresholds.
The execution time for maximum sizes of sequential treatments (max size) varying from 10 to 60 is shown in Figure 4(b). The number of patients in the database is
106,449 and the minimum support is 0.01%. Since the average observed time of the pa-
tients is low (18.6 months), the impact of increasing max size over the execution time is
reduced for long sequences.
To evaluate the scalability of the SMTM algorithm in terms of the number of
patients (n) in the database, we selected samples of the RRT database with n ranging
from 20,000 to 100,000 patients. The minimum support is set to 0.01% and the maximum
size of the sequential treatments is set to 60. The SMTM algorithm scales almost linearly
while we increase the number of patients in the database. Moreover, the median survival
threshold may allow the survival analysis of large databases.
The performance evaluation of the SMTM algorithm shows that it is a computationally efficient technique for the analysis of sequential treatments. Furthermore, the
proposed median survival threshold is able to reduce the execution time of the algorithm significantly, enabling the survival analysis of long sequential treatments, using
large databases, in a feasible time.
5.3. Sequential treatments as survival pattern descriptors
In order to evaluate the SMTM algorithm regarding the quality of the survival analysis
provided, we apply it to generate survival pattern descriptors. Given a chain of treatments
p = (α1 , α2 , . . . αg ), completely or partially executed by a set of patients P , we are interested in the sequential treatment s = (α1 → α2 → . . . αq ) that best describes the survival
function of the patients in P . We like to point out that, while s is a sequential treatment,
p is series of linked treatments that tends to be much larger than s.
We analyze three important aspects associated with the sequential treatment analysis: (1) the capacity of the proposed modeling to improve the survival analysis of patients,
(2) the impact of the size of p over the sequential treatment analysis, and (3) the correlation between the size of the sequential treatments and their precision as survival pattern
descriptors. We perform these analysis using the RRT database.
To compare the survivor function Ŝ(p), associated with patients who execute a
given chain of treatments p, and the survivor function Ŝ(s), of a sequential treatment s,
we calculate the deviation of Ŝ(s) in relation to Ŝ(p) through the following equation:
deviation(s, p) =
Pk
j=1
|Ŝ(s, tj ) − Ŝ(p, tj )|
Pk
j=1 Ŝ(p, tj )
where k is the observed time, Ŝ(s, tj ) is the survival probability associated with the sequential treatment s in tj , and Ŝ(p, tj ) is the survival probability of p in tj . Since several
sequential treatments can be selected, we always apply the sequential treatment s contained in p that has the lowest deviation in w.r.t. to p. The window size threshold λ
applied is 0.6 , since it achieved the best results for most of the experiments executed, the
minimum support threshold σ was set to 0.1%, the median survival threshold µ was set
to 0. In all the experiments, we select a random set of patients R, containing 5% of the
patients (5,323 patients) from the RRT database, for whom the chain of treatments p will
be described by a sequential treatment s.
In Figure 5(a), we show the average deviation obtained by the SMTM algorithm
and the traditional Kaplan Meier (KM) method in the generation of descriptors for sequences of treatments. We randomly selected five different sets of patients R, containing
6.5
SMTM
KM
average deviation(%)
average deviation (%)
15
12
9
6
SMTM
6
5.5
3
5
0+
11+
23+
35+
47+
0
2
4
6
survival
8
10
12
max_size
(a) Kaplan Meier method (KM) X SMTM approach (b) SMTM deviation for different values of
max size
1
1
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
10
20
30
t
40
S(t)
1
0.8
S(t)
S(t)
Figure 5: Sequential treatments as survival pattern descriptors
50
0.2
0
0
(a) HD(13)
0.4
10
20
30
t
40
50
0
(b) TX(13)
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
0
20
30
t
(d) HD(13)
40
50
30
t
40
50
h(t)
0.1
0.08
h(t)
0.1
0.08
h(t)
0.1
10
20
(c) HD(7)-TX(6)
0.08
0
10
0
0
10
20
30
t
(e) TX(13)
40
50
0
10
20
30
t
40
50
(f) HD(7)-TX(6)
Figure 6: Survivor and hazard functions of the sequential treatments HD(13),
TX(13), and HD(7)-TX(6)
patients from the whole database (0+) and also from the set of patients who survived
more than 11 (11+), 23 (23+), 35 (35+), and 47 months (47+). Each set contains 5,323
patients. The KM method was applied to analyze the survival function of single therapies and we selected, for each chain of treatments p, the therapy that presented the lowest
deviation compared to p, among the therapies contained in p. The sequential treatments
were generated by the SMTM algorithm using the maximum size of the sequential treatments (max size) set to 10. The sequential treatment analysis is shown to be much more
effective than the traditional KM method in the generation of survival pattern descriptors.
Moreover, the average deviations obtained by both the SMTM and KM techniques tend to
reduce as the size of the treatments analyzed increase, since these strategies may exploit
more information about the treatments received by the patients.
Figure 5(b) shows how the maximum size (max size) of the sequential treatments
used as survival pattern descriptors affects the average deviation w.r.t. the survival of patients in the database. Longer sequential treatments are able to obtain a better description
of the sequences of treatments, but, in general, short sequences are able to obtain a low
average deviation.
The results presented in this section evaluate the capacity of sequential treatments
to be used as survival descriptors for sequences of treatments. The proposed technique
is able to obtain more precise descriptors than the analysis of single therapies using the
traditional Kaplan Meier estimate. We also may improve the precision of these descriptors
using long sequential treatments.
5.4. Assessing the quality of the RRTs
In the last part of this experimental evaluation, we analyze the survival time of patients
who execute different sequences of RRTs, what is the main motivation for this research.
However, since this paper is focused on the description and evaluation of the proposed
algorithm as a data mining application, we will only emphasize specific results that highlight its relevance in the medical field. Due to space constraints, an extensive survival
analysis of the sequential execution of RRTs is left as future work.
Sequential treatment
TX(13)
HD(1)-TX(12)
HD(2)-TX(11)
HD(3)-TX(10)
HD(4)-TX(9)
HD(5)-TX(8)
HD(6)-TX(7)
HD(7)-TX(6)
HD(8)-TX(5)
HD(9)-TX(4)
HD(10)-TX(3)
HD(11)-TX(2)
HD(12)-TX(1)
HD(13)
CAPD(13)
sup(%)
9.3
3.1
3.2
3.2
3.3
3.3
3.3
3.4
3.4
3.4
3.4
3.3
3.2
37.2
3.1
tM
>60
>60
>60
>60
>60
>60
>60
>60
>60
>60
>60
>60
>60
35.0
28.0
h
17.6
474.5
470.4
467.6
471.3
469.7
468.7
474.2
473.2
478.9
478.0
483.1
494.8
563.3
52.1
Table 4: Survival analysis of size 13 sequential RRTs
Table 4 presents the support (sup), the median survival (tM ), and the average hazard rate (h) of the 15 most frequent size 13 sequential RRTs in the RRT database. The sequential treatments are represented by the ordered therapies that compose each sequence
and the number of repetitions of each therapy. We decided to evaluate the size 13 sequential treatments because it is the median observed time of the patients, what leads to the
analysis of a large sample of sequences completely or partially received by several patients. The frequent window threshold λ was set as 0.6. We set the value of tM as ’>60’
for those sequential treatments with median survival higher than the observed time, what
is true for most of the sequences shown in Table 4.
It can be observed that TX(13) presents the lowest average hazard rate, on the
other hand, HD(13) has the highest one. Moreover, these sequential treatments are very
frequent, specially HD(13), which has a support of 37.2%. Patients who receive CAPD(13)
have the lowest median survival among the sequential treatments evaluated. The execution of HD for one or more months followed by TX is a very frequent sequential treatment
and presents different results in terms of average hazard rate depending on how long the
patient is maintained in HD. The longer the patient waits for a transplantation, the higher
is the probability of death, and this pattern becomes stronger after the ninth month in HD.
Figures 6(a), 6(b), and 6(c) show the survivor functions, and Figures 6(d), 6(e),
and 6(f) show the hazard functions of the sequential treatments HD(13), TX(13), and
HD(7)-TX(6), respectively. The survival pattern of patients who receives HD(13) is very
different from that associated with TX(13). While the hemodialysis presents a very high
probability of death in the first months of treatment and a high probability of death along
the whole sequential treatment, the transplantation achieves a constant and low probability
of death. On the other hand, the survival of patients executing HD(7)-TX(6) is similar to
the survival of patients in hemodialysis in the beginning of the sequential treatment until
the 22th month, when 60% of the patients in HD(7)-TX(6) had already been transplanted
(λ = 0.6). After the transplantation, patients executing HD(7)-TX(6) tend to reduce the
probability of death significantly. About ten months after the transplantation, when the
risk of rejection is reduced, the probability of death associated with HD(7)-TX(6) is very
low, even lower than that associated with TX(13).
6. RELATED WORK
In this section, we discuss previous work related to medical data mining, survival analysis,
evidence-based medicine, and sequential pattern mining. The application of data mining
algorithms to medical databases has been presented as a solution to face the extensive
amounts of data collected and stored by medical information systems. Through intelligent data analysis techniques, these databases can provide rich information resources for
decision makers. In [9] the authors study the specific problem of finding risk patterns in
medical data. [10] presents a pattern discovery methodology for the analysis of relationships between events in medical records. In this paper, we propose a novel data mining
application for the survival analysis of sequential treatments. The main challenges for the
application of data mining on medical data include specific requirements of understandability, efficiency, and contextualized measures of interestingness.
Survival analysis consists of a set of statistical procedures extensively applied by
medical information systems. These procedures are also useful to the analysis of other
events, such as customer behavior and biological processes [6]. However, similar to other
statistical methods, most of the survival analysis techniques are not able to deal with large
databases and complex patterns. As we describe in this paper, the application of data
mining to the survival analysis may achieve promising results. A data mining strategy to
predict whether patients in dialysis are more likely to survive for a time below or above
the median survival is proposed in [7].
The study of the patient’s history of treatments and health conditions has attracted
the interest of the medical community in the recent years, specially in the case of chronic
disease treatments. Two important research topics related to the analysis of long term
information about patients are the patient pathways [3] and the adaptive treatment strategies [8]. The application of data mining has already presented successful results in both
topics. A patient pathway represents the patient’s journey through the care system. These
paths are modeled through states and transitions through a Markov chain. Frequent paths
can be identified using sequence mining algorithms. Adaptive treatment strategies are
techniques for adapting a treatment plan according to the patient’s history of treatments
and the response to those treatments. In this case, data mining algorithms may extract
patterns associated with optimal decisions.
A sequential treatment is an ordered sequence of treatments executed across time,
what makes this problem close to the traditional sequence mining task [1, 15]. The discovered patterns of sequence mining algorithms are frequent sequences of items. Despite
the importance of analyzing the most frequent treatments, which affect the health of many
patients, the main objective of the survival analysis of sequential treatments is to evaluate
sequences of medical treatments in terms of the survival time. The problem of mining sequential patterns was introduced in [1]. An efficient algorithm that applies lattice
search techniques to identify frequent sequences is described in [15]. The application
of sequence mining algorithms to knowledge discovery in contexts where the support is
not the only interesting pattern have already been studied in previous work. In [16], the
authors present an algorithm to predict failures in databases with plan executions using
sequence mining. In [14], it is proposed an algorithm to identify high utility plans to be
used to convert groups of customers from less desirable classes to more desirable ones.
7. CONCLUSIONS AND FUTURE WORK
In this paper, we have described a novel method for survival analysis of sequences of
medical treatments executed across time, what we called sequential treatments. Based on
the proposed method, we designed the SMTM algorithm, a new data mining algorithm
for the survival analysis of sequential medical treatments.
The SMTM algorithm exploits frequent gaps between the execution of single treatments in order to evaluate the survival time of patients not only after, but also along the
execution of a sequential treatment, combining the existing survival analysis framework
and the traditional sequence mining task. Moreover, the SMTM algorithm applies two
pruning strategies, based on support and median survival of the sequential treatments.
We have evaluated the SMTM algorithm using a database of Brazilian patients
suffering from ESRD. The results show that the proposed algorithm is computationally
efficient, and the median survival pruning is able to reduce its execution time significantly,
allowing the survival analysis of long sequential treatments, using large databases, in a
feasible time. We also have shown that sequential treatments are much more effective than
the traditional survival analysis method in the generation of survival pattern descriptors for
sequences of treatments. Finally, we present a real case study, evaluating the sequential
execution of RRT’s. The results obtained provide important knowledge to assess the
quality of the RRT’s in Brazil, and may affect new government programs and health
policies for the assistance of patients in RRT.
Future work include an extensive survival analysis of sequential treatments for
ESRD and the application of the proposed algorithm using other datasets. Moreover,
we will develop a methodology for calibrating the parameters of the algorithm, since the
empirical evaluation of them, like is done in this paper, may be a difficult task.
8. ACKNOWLEDGMENTS
This work was partially supported by CNPq, CAPES, Finep, Fapemig, and the Brazilian
Ministry of Health.
References
[1] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th IEEE Int’l Conf. on
Data Engineering, pages 3–14, 1995.
[2] M. Cherchiglia, A. Guerra, E. Andrade, C. Machado, F. Acurcio, W. Meira Jr, B. Paula,
and O. Queiroz. The Construction of a National Database of Renal Replacement
Therapies Focused on the Individual: A record linkage procedure approach (in Portuguese). Revista Brasileira de Estudos da População, 24(1):163–167, 2007.
[3] L. Garg, S. McClean, B. Meenan, and P. Millard. Non-homogeneous Markov models for
sequential pattern mining of healthcare data. IMA Journal of Management Mathematics, (1), 2008.
[4] L. Kirby and L. Vale. Dialysis for end-stage renal disease. International Journal of
Technology Assessment in Health Care, 17(2):181–189, 2001.
[5] J. Klein and M. Moeschberger. Survival analysis: Techniques for censored and truncated
data. Springer, New York, NY, 2003.
[6] D. Kleinbaum and M. Klein. Survival analysis: A self-learning text. Springer, New York,
NY, 2005.
[7] A. Kusiak, B. Dixon, and S. Shah. Predicting survival time for kidney dialysis patients: a
data mining approach. Computers in Biology and Medicine, 35(4):311–327, 2005.
[8] P. Lavori and R. Dawson. Adaptive Treatment Strategies in Chronic Disease. Annual
Review of Medicine, 59(1):443, 2008.
[9] J. Li, A. Fu, H. He, J. Chen, H. Jin, D. McAullay, G. Williams, R. Sparks, and C. Kelman.
Mining risk patterns in medical data. In Proc. 11th ACM SIGKDD Int’l Conf. on
Knowledge Discovery and Data Mining, pages 770–775, 2005.
[10] G. N. Norén, A. Bate, J. Hopstadius, K. Star, and I. R. Edwards. Temporal pattern discovery for trends and transient effects: its application to patient records. In Proc.
14th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages
963–971, 2008.
[11] M. Perazella and S. Khan. Increased mortality in chronic kidney disease: A call to action.
The American Journal of the Medical Sciences, 331(3):150, 2006.
[12] Z. Qiu-Li and R. Dietrich. Prevalence of chronic kidney disease in population-based
studies: Systematic review. BMC Public Health, 8(1):117, 2008.
[13] D. Sacket, W. Rosenberg, J. Gray, R. Haynes, and S. Richardison. Evidence based
medicine: How to practice and teach EBM. Churchill Livingstone, New York, NY,
1997.
[14] Q. Yang and H. Cheng. Mining plans for customer-class transformation. In Proc. 3rd
IEEE ICDM Int’l Conf. on Data Mining, pages 403–410, 2003.
[15] M. J. Zaki. Spade: An efficient algorithm for mining frequent sequencies. Machine
Learning, 42(1):31–60, 2001.
[16] M. J. Zaki, N. Lesh, and M. Ogihara. Planmine: Sequence mining for plan failures. In
Proc. 4th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining,
pages 369–374, 1998.