Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequential Medical Treatment Mining for Survival Analysis Arlei Silva1 , Wagner Meira Jr1 ., Odilon Queiroz2 , Mariângela Cherchiglia3 1 Computer Science Depart. – Federal University of Minas Gerais – Brazil 2 Faculty of Medicine – Federal University of Minas Gerais – Brazil {arlei, meira}@dcc.ufmg.br, {odilon, cherchml}@medicina.ufmg.br Abstract. In this paper, we study the problem of evaluating the survival associated with sequential medical treatments. We propose a new data mining algorithm (SMTM) that combines the survival analysis framework with the sequence mining task. This research is motivated by the necessity of assessing the quality of the renal replacement therapies (RRTs) , what has become a policy issue in several countries. We apply SMTM to evaluate sequences of RRTs and show that SMTM is computationally efficient and able to provide important knowledge about the survival of patients in RRT, better describing the patients’ survival pattern than the traditional survival analysis. The results obtained may support future programs and health policies for the assistance of patients in RRT. 1. INTRODUCTION Survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is the time until an event occurs [5, 6]. Through survival analysis we can, for example, study how long patients survive after receiving a heart transplant or the time it takes for a patient to respond to a therapy. These studies are important to compare competing treatments, to evaluate the effects of a disease, and to support the medical decision process in general [13]. Despite the widespread application of survival analysis techniques, most of them assume that the patient receives few treatments or do not consider the ordered execution of different single treatments through time. In this case, a single treatment can be a medication, a surgery, or any other medical procedure. A positive HIV patient, for example, may begin his/her treatment with a particular combination of antiviral medications, and then, as the patient’s viral load and CD4 count change over time, this combination may be changed or other treatments may be indicated [8]. Patients suffering from end-stage renal disease receive a long sequences of therapies, which can be composed by sessions of intermittent peritoneal dialysis, hemodialysis, among others [4]. In this paper, we study the problem of evaluating the survival time associated with sequential treatments. We define a sequential medical treatment as an ordered sequence of single medical treatments. Given two single treatments A and B, an example of a sequential treatment is the sequence (A → A → B), that is, two consecutive executions of A followed by an execution of B. In this case, the goal of the survival analysis is to evaluate the survival time of patients who receive (A → A → B). Given a database of patients, containing sequences of treatments, this analysis may be an evidence about the effectiveness of (A → A → B) in terms of survival time. The problem of evaluating sequential medical treatments presents some similarities with the traditional sequence mining task, since each sequential treatment can be seen as a sequence of events [1, 15]. However, traditional sequence mining algorithms are not able to perform a survival analysis of a sequence of medical treatments, since they only identify frequent sequences in a database. On the other hand, existing survival analysis techniques do not take into account how different treatments are executed sequentially. In order to evaluate sequential treatments in terms of survival, we formulate a new technique based on the existing framework for survival analysis and the sequential pattern mining task, which we call SMTM (Sequential Medical Treatment Mining). The SMTM algorithm searches the space of sequential treatments in a level-wise manner to identify effective sequential treatments using two pruning strategies, based on support and median survival of the patients who receive the sequences of treatments. Due to the increasing number of patients who require an RRT, the high cost of these treatments, and the low estimates of survival for patients suffering from end-stage renal disease, evaluating the RRTs has become a policy issue in several countries. We apply our sequential treatment mining technique using a dataset composed by more than 100,000 Brazilian patients in RRT [2]. Therefore, besides the innovative application of data mining (e.g., pattern recognition, statistics) and medical disciplines (e.g., survival analysis), this paper also describes an experience of the use of these disciplines in practice. More specifically, our main contributions are summarized as follows: • A Novel Method for Survival Analysis of Sequential Treatments: The method proposed in this paper characterizes the survival associated with a sequence of treatments. We have not found any other technique for survival analysis of sequential treatments. • A New Data Mining Algorithm for Sequential Medical Treatment Mining: We describe a new data mining algorithm to evaluate sequences of medical treatments, called SMTM. The algorithm employs two pruning strategies during a level-wise search for sequential treatments. • The Application of the Proposed Algorithm in a Real Case Study: We evaluate the SMTM algorithm using a real dataset of patients suffering from end-stage renal disease in Brazil. The remainder of this paper is organized as follows. Section 2 summarizes the motivations of this research. Section 3 describes our technique for survival analysis of sequential treatments. Section 4 presents the SMTM algorithm. Section 5 presents the empirical results. Section 6 discusses related work. Finally, Section 7 presents our conclusions and future work. 2. MOTIVATION The motivating problem for this research is the evaluation of Renal Replacement Therapies (RRTs) [4], required by patients suffering from end-stage renal disease (ESRD). The costs of RRTs lead to a large burden for the health care systems, particularly in developing countries [12, 11, 2]. In order to encourage the development of new procedures for the analysis of the RRTs, the Brazilian Government supported the construction of a national database of ESRD patients assisted by the Brazilian Public Health System. End-stage renal disease, also known as chronic kidney failure, is the permanent loss of the kidney function. The main functions of the kidneys are removing waste prod- ucts and excess of water from the blood. Due to the asymptomatic nature of this disease, the kidney disease is not frequently detected until it is not reversible anymore, meaning that the prevention opportunities are over[12]. We distinguish five major RRTs: • Hemodialysis (HD): The blood is filtered through a dialysis machine, cleansed, and returned to the body. • Transplantation (TX): Surgery to replace the kidney that failed by a healthy kidney. Transplanted patients require the use of immunosupressants for the rest of their lives to prevent the rejection of the new kidney. • Continuous Ambulatory Peritoneal Dialysis (CAPD): A solution is drained into the abdomen through a catheter, during few hours, it absorbs waste products from the blood, and then it is drained out. • Continuous Cycling Peritoneal Dialysis (CCPD): Employs a cycler machine to perform a procedure similar to the CAPD. • Intermittent Peritoneal Dialysis (IPD): Similar to CCPD, but normally executed in a hospital, taking around 24 hours and performed several times a week. Patients suffering from ESRD require long term treatments, which can be composed by different RRTs. A patient may, for example, initiate his/her treatment with sessions of IPD during 6 months, be in HD for 10 months, and then receive a kidney transplant. Evaluating how these treatments interact and affect the survival of patients is an important medical research problem. The survival analysis of sequences of treatments may address key questions such as: Which are the most appropriate sequences of RRTs? What is (are) the most suitable dialysis for a patient who has experienced a rejection? What is the survival prognostic for transplanted patients who performed hemodialysis for a long period of time? Nevertheless, it raises important challenges: 1. The number of possible sequences may be very large. 2. The frequency of the sequences tends to be skewed due to existence of common medical practices. 3. Considering that the whole sequence of treatments may lead to long infrequent patterns, short frequent sequences may be more representative. In order to analyze the survival of patients who execute a sequence of RRTs, we propose a new data mining algorithm for survival analysis of sequential treatments. Our main goal is to discover survival patterns associated with the sequential execution of medical treatments. These patterns may be useful to assess the quality of the RRTs in terms of the survival of the patients, leading to a more effective usage of the healthcare budget. Moreover, the proposed technique may be applied to other diseases. 3. SURVIVAL ANALYSIS OF SEQUENTIAL TREATMENTS In this section, we define the concept of sequential medical treatment and describe how we combine the framework for survival analysis and the sequence mining task for the survival analysis of sequences of treatments. Let I = {α1 , α2 , . . . αm } be a set of m distinct treatments, a sequential treatment is a sequence of treatments s = (α1 → α2 → . . . αq ), composed by q ordered treatments. We state that a sequential treatment sa = (α1 → α2 → . . . αn ) is contained in sb = (β1 → β2 → . . . βm ), or sa is a subsequence of sb , if there exists 1 ≤ i1 < i2 < . . . < in ≤ m such that α1 = βi1 , α2 = βi2 , . . . αn = βin . For example, (A → C) is a subsequence of (A → B → C). Gaps in the sequences are allowed in order to not restrict the patterns discovered to strictly executed successions of treatments. These gaps may result in more general and simple patterns able to describe long and complex sequential treatments, as will be discussed along this paper. A database D of sequential treatments, containing data of patients suffering from a given disease, can be obtained through a survival study. For each patient the database has the patient id (pid), the observed survival time t, whether the patient is censored or not (c), and the patient sequence p = (α1 , α2 , . . . αg ) of treatments received by him/her during the observed time. Censoring occurs when we do not know a person’s survival time exactly because we loose track of him/her or because the patient survives during the whole observation time [6]. A patient sequence of treatments pb is said to contain a sequence pa if pa is a subsequence of pb . We use a different representation for the patient sequences (i.e., transactions in the database) to differentiate them from the sequential treatments mined (i.e., result of the survival analysis). The execution of a sequence of treatments is divided into constant and discrete time intervals t. A treatment executed in ti affects the patient in ti+1 and a size k sequence of treatments takes k intervals of duration t to be executed. Table 1 shows an example database of sequential treatments, where I = {A, B, C, D, E}, and the number of patients is 7. pid 1 2 3 4 5 6 7 t 7 2 1 4 6 3 2 c 0 0 0 1 0 0 1 s A, C, D, B, A, B, B A, C E E, E, E, E B, A, C, B, C, C A, A, C D, A Table 1: Example database The goal of the survival analysis [5, 6] is to infer the relation between the survival time and one or more explanatory variables. In our case, the explanatory variable is a sequential treatment. Two important functions for survival analysis are the survivor function S(t) and the hazard function h(t). The survivor function gives the probability that a person survives longer than some specified time t (S = P (T > t)). The hazard function h(t) gives the instantaneous potential per unit time for the patient death to occur, given that the individual has survived up to time t: P (t ≤ T ≤ t + ∆t|T ≥ t) ∆t→0 ∆t h(t) = lim where T is the random variable for survival time. Moreover, we can evaluate medical treatments by descriptive measures, such as the median survival and the average hazard rate. These measures give an overall value for the survival associated with the explanatory variables. The median survival tM is the time at which the survival probability is 0.5 (S(tM ) = 0.5). The average hazard rate is defined by dividing the total number of failures (or deaths) by the sum of the observed survival times: h= #f ailures Pn i=1 ti The most applied method for survival analysis is the Kaplan Meier (KM) estimate [6]. It calculates the proportion Ŝ(t) of patients whose survival time at death would exceed t if no censoring had occurred. The result is a cumulative step function that decreases along the time. According to the KM estimate, the survival probability at time t is given by: Ŝ(t) = i Y nj − dj ), ti ≤ t ≤ ti+1 ( nj j=1 Where dj is the number of failures occurring at time t out of nj surviving to tj (risk set). Both the h(t) and Ŝ(t) functions and also the descriptive measures have been proposed for the analysis of single or combined treatments, but not for sequential medical treatments. In this paper, we extend these functions and measures to the analysis of sequences of medical treatments. In order to achieve this goal, we must understand how a given sequence of treatments affects the patients’ survival. As a patient may receive the same treatment or sequence of treatments more than once, we always consider the first execution of such treatment or sequence. In the case of size one sequential treatments, the evaluation process is straightforward. Since a size one sequence is composed by one single treatment, the problem is reduced to the analysis of one explanatory variable. Considering the example database presented in Table 1, Table 2 presents the number of patients in the risk set (nj ), the number of deaths (dj ), the survival probability Ŝ(t) and the hazard function h(t) in the beginning of the sequential treatment and for different failure times tj . The median survival tM and the average hazard rate (h) obtained by (A) are 5 and 0.5, respectively. tj 0 2 3 5 7 nj 5 4 3 2 1 dj 0 1 1 1 1 Ŝ(t) 1 1 × 0.75 = 0.75 0.75 × 0.67 = 0.5 0.5 × 0.5 = 0.25 0.25 × 0 = 0 h(t) 0 0.25 0.33 0.5 1 Table 2: Survival data of (A), tM =5, h =0.5 The evaluation of longer treatments (size > 1) is more challenging. In this case, we are interested not only in the patient’s survival after the execution of a sequence of treatments, but also in how the patients survive along the sequential treatment. Notice that, in the evaluation of a long sequential treatment, considering only the whole sequence may lead to a strong bias w.r.t. the survival time associated with such sequence of treatments, since the complete execution of a sequential treatment of size k requires a survival time of, at least, k periods of time. Moreover, a long sequence of treatments may be composed by a subsequence with a high mortality rate. The case of the sequential treatment (E → E) exposes this problem. One patient (pid = 4) executed this sequential treatment, surviving for three periods of time, until being censored, which may be considered a good result. However, (E → E) starts with an execution of (E), its leftmost size-one subsequence, executed by patients 3 and 4, and half of them deceased after its execution. Considering the results obtained by long sequences of treatments along their execution is very important to achieve a meaningful evaluation of sequential treatments in terms of survival. Nevertheless, it brings the problem of valuating the results of the subsequences of treatments over the results of long sequences. The basis of the survival analysis procedures described in this section is to obtain the risk set and the number of deaths associated with the explanatory variables at time tj . Therefore, the problem of ex- tending these procedures to the evaluation of sequences of treatments reduces to identify the risk set and the number of failures associated with a sequence of single treatments. A long sequential treatment s = (α1 → α2 → . . . αq ) can be seen as a set of subsequences composed by its partial executions ((α1 ), (α1 → α2 ), . . . (α1 → α2 → . . . αq )). Therefore, we can evaluate s through the results of its subsequences. However, several patients who execute these subsequences do not receive the whole sequence s, as they can be censored, die or receive other sequences of treatments that do not contain s. We define a window time ω to consider the partial results of a sequential treatment in its evaluation. The intuition behind the window time is to estimate how long is the time interval between the execution of two treatments αk and αk+1 in practice. If these treatments are frequently distant in terms of patient transactions, we extend the effect of the subsequence (α1 → α2 → . . . αk ) over (α1 → α2 → . . . αk+1 ). The determination of ω depends on a user-defined parameter, what we call window size threshold (λ), that defines the percentage of patients for whom αk and αk+1 must be separated by a distance shorter than or equal to ω in the whole set of patients who executed (α1 → α2 → . . . αk+1 ). For example, in the database shown in Table 1, if λ = 0.5, the window size ω that separates the treatments A and B in the sequence (A → B) is 1, and if λ = 1, ω is 2. The risk set and the number of deaths associated to (α1 → α2 → . . . αk ), executed in the interval t0 < t < tk are the same of (α1 → α2 → . . . αk ) in the interval t0 < t < tk + ω. After tk + ω, only patients who executed (α1 → α2 → . . . αk+1 ) are considered in the risk set of this sequential treatment. Table 3 shows the survival data associated with (A → B) according to our example database. The window size threshold λ applied is 0.5. In this case, the results of (A → B) in the interval 0 ≤ t ≤ 2 are the same obtained by the sequence (A), since the window size ω is 1. For t > 2, we consider only data about patients who received a sequence that contains (A → B) and have not been considered yet. Figure 1 shows the survival probability of (A) and (A → B) through a Kaplan Meier plot. The median survival tM obtained by (A → B) is 5 and the average hazard rate (h) is 0.43. tj 0 2 5 6 nj 5 4 2 1 dj 0 1 1 1 Ŝ(t) 1 1 × 0.75 = 0.75 0.75 × 0.5 = 0.37 0.37 × 0 = 0 h(t) 0 0.25 0.5 1 Table 3: Survival data of (A → B), tM = 5, h =0.43 Once defined an evaluation technique to assess the survival associated with a sequence of treatments, the second problem studied in this section is the computational cost of evaluating all possible sequences of treatments. We can see this problem as an instance of the traditional sequence mining task [1, 15]. As we do not consider the existence of parallel treatments, each event of the sequence is a single treatment. For m possible single treatments, the number of possible sequences of treatments of size k is equal to mk . Considering different sizes of sequences,Puntil a limit n (1 ≤ k ≤ n), the total number of possible sequential treatments is nk=1 mk . Therefore, it becomes unfeasible to evaluate all the sequences of treatments in the case of long sequences and a large set of single treatments. Moreover, evaluating the whole set of possible sequential treatments 1 A 0.8 0.6 0.6 S(t) S(t) 1 0.8 0.4 0.2 A A−>B 0.4 0.2 0 0 0 1 2 3 4 5 6 7 0 1 2 3 t 4 5 6 7 t (b) (A → B) (a) (A) Figure 1: Survival probabilities for (A) and (A → B), they share the same results into the window interval 0 ≤ t ≤ 2 may not be always necessary. Based on the traditional sequence mining task, we propose the application of the support as a selection criteria for the sequences to be evaluated. A user-defined minimum support threshold σ is used to select only the frequent sequential treatments. A second approach to select the sequential treatments is to apply an evaluation measure of survival as a pruning criteria. We present this strategy in the next section, while describing our algorithm for sequential treatment mining. 4. THE SMTM ALGORITHM The SMTM (Sequential Medical Treatment Mining) algorithm implements the methodology described in the last section. Besides the evaluation of sequential treatments, the algorithm also employs two pruning strategies for sake of efficiency, based on the support and the median survival of the sequential treatments. To prune the sequences of treatments according to their support in the database, we apply the anti-monotone property, similar to most of the sequence mining algorithms presented in the literature [1, 15]. If a size k sequential treatment has a support lower than a minimum support threshold, it is pruned from the list of candidate to be size k + 1 frequent sequences. 1 A A−>C A−>C−>C 0.8 0.6 S(t) S(t) 1 A A−>C 0.8 0.4 0.2 0.6 0.4 0.2 0 0 0 1 2 3 4 5 t (a) (A → C) 6 7 0 1 2 3 4 5 6 7 t (b) (A → C → C) Figure 2: Survival probabilities for (A → C) and (A → C → C) The median survival based pruning is based on a user-defined minimum median survival threshold µ and in the time window ω, defined in the last section. If a size k + 1 sequence of treatments s=(α1 → α2 → . . . αk+1 ) is composed by a size k sequence (α1 → α2 → . . . αk ), executed in the interval 0 ≤ t ≤ tk , that has an median survival lower than µ in the interval 0 ≤ t ≤ tk + ω, s can be pruned from the size k + 2 sequential treatment candidates. To illustrate this property, in Figures 2(a) and 2(b), we show the Kaplan Meier curves of the sequential treatments (A → C) and (A → C → C), respectively, according to the database presented in Table 1. The time window ω that divides A and C is 1, the survival probability when t is equal to 2 is 0.5. If µ >2, (A → C) can be pruned and it is not necessary to evaluate (A → C → C) or any other sequence that contains (A → C) as prefix, since they have the same median survival. Algorithm 1: SMTM Input : D, σ, λ, µ,max size Output: Sequential T reatments (S) S1 ⇐ {Size one sequential treatments}; K ⇐ 1; while K ≤ max size do Evaluate sequential treatments in SK ; for s ∈ SK do if (median-survival(s) ≤ wk + 1) then if (median-survival(s) < µ) then Prune s; Generate SK+1 from SK ; K ⇐ K + 1; The Algorithm 1 is a high-level description of the SMTM algorithm. It receives as parameters the sequential database D, the minimum support threshold σ, the window size threshold λ, the minimum median survival threshold µ, and the maximum size of the sequential treatments to be evaluated (max size). The outputs of SMTM are the sequential treatments with maximum size max size that have a support higher than σ and a median survival higher than µ, considering the window size threshold λ. The SMTM algorithm searches the space of sequential treatments in a level-wise manner. After generating and evaluating the size one sequential treatments, the algorithm employs the size k sequences to generate the size k + 1 sequences of treatments. The evaluation process implements the methodology presented in the last section. For each frequent sequential treatment s generated, its median survival is evaluated and, if possible, the median survival based pruning is applied. 1 1 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 0 1 2 3 4 #changes (a) General ccdf 1 0.8 ccdf 1 0.8 ccdf 1 0.8 ccdf ccdf 5. EXPERIMENTAL EVALUATION 0.4 0.2 0 0 1 2 3 4 #changes 0 1 2 3 4 #changes 0 1 2 3 4 #changes 0 1 2 3 4 #changes (b) 1+ (c) 2+ (d) 3+ (e) 4+ Figure 3: Distribution of treatment changes for different survival times (years) 5.1. Dataset We evaluate the SMTM algorithm using the RRT database[2]. Originally, the database contains 6 single treatments: Hemodialysis (HD), Transplantation (TX), Continuous Ambulatory Peritoneal Dialysis (CAPD), Continuous Cycling Peritoneal Dialysis (CCPD), SMTM−0 SMTM−20 SMTM−40 500 300 200 100 400 300 200 100 0 0.02 0.03 0.04 0.05 minimum support(%) 300 200 0 10 20 30 40 50 60 max_size (a) 400 100 0 0.01 SMTM−0 SMTM−20 SMTM−40 500 execution time(s) SMTM−0 SMTM−20 SMTM−40 400 execution time(s) execution time(s) 500 (b) 20 40 60 80 100 number of patients(’000) (c) Figure 4: Performance evaluation Intermittent Peritoneal Dialysis (IPD), and Hemodialysis for positive HIV patients (HD/HIV). Since more than one treatment can be executed during a given period of time, we also consider these combinations as single treatments. The total number of treatments (single and composed therapies) in the database is 30. The number of patients considered is 106,449. The maximum observation time is 60 months, the average observed time is 18.6 months, and 63% of the patients are censored. We considered both the transplantation surgery and the use of immunosupressants as TX, since they are close related treatments. Figure 3 shows the CCDF (complementary cumulative distribution function) of the number of treatment changes in all the dataset (general) and for patients who survived more than one (1+), two (2+), three (3+), and four years (4+). We can see that the more the patients survive the higher is the probability of receiving more than one therapy. In the case of patients who live more than four years, this probability is higher than 40%. It motivates us to evaluate sequences of treatments in order to understand how these sequential treatments interact and affect the survival time of the patients. 5.2. Performance evaluation In this section, we present a performance evaluation of the SMTM algorithm. The experiments were executed on a Suse Linux PC with a 64-bit AMD Athlon 3500+ and 1GB main memory. The execution time of SMTM is evaluated in terms of the median survival threshold µ, the minimum support threshold σ, the maximum size of the sequential treatments to be evaluated (max size), and the number of patients in the database n. In all experiments, we compare three median survival threshold (µ) levels, which we call SMTM-0 (µ = 0), SMTM-20 (µ = 20), and SMTM-40 (µ = 40). The value of the window size threshold λ is kept constant, set to 0.6. Figure 4(a) shows the execution time of the SMTM algorithm as the minimum support σ is changed from 0.01% to 0.05%. The number of patients in the database is 106,449 and the maximum size of the sequential treatments is 60. It can be observed that reducing the value of σ increases the execution time exponentially, as expected. However, the median survival threshold (µ) is able to reduce the execution time substantially, enabling the analysis of sequential treatments using low minimum support thresholds. The execution time for maximum sizes of sequential treatments (max size) varying from 10 to 60 is shown in Figure 4(b). The number of patients in the database is 106,449 and the minimum support is 0.01%. Since the average observed time of the pa- tients is low (18.6 months), the impact of increasing max size over the execution time is reduced for long sequences. To evaluate the scalability of the SMTM algorithm in terms of the number of patients (n) in the database, we selected samples of the RRT database with n ranging from 20,000 to 100,000 patients. The minimum support is set to 0.01% and the maximum size of the sequential treatments is set to 60. The SMTM algorithm scales almost linearly while we increase the number of patients in the database. Moreover, the median survival threshold may allow the survival analysis of large databases. The performance evaluation of the SMTM algorithm shows that it is a computationally efficient technique for the analysis of sequential treatments. Furthermore, the proposed median survival threshold is able to reduce the execution time of the algorithm significantly, enabling the survival analysis of long sequential treatments, using large databases, in a feasible time. 5.3. Sequential treatments as survival pattern descriptors In order to evaluate the SMTM algorithm regarding the quality of the survival analysis provided, we apply it to generate survival pattern descriptors. Given a chain of treatments p = (α1 , α2 , . . . αg ), completely or partially executed by a set of patients P , we are interested in the sequential treatment s = (α1 → α2 → . . . αq ) that best describes the survival function of the patients in P . We like to point out that, while s is a sequential treatment, p is series of linked treatments that tends to be much larger than s. We analyze three important aspects associated with the sequential treatment analysis: (1) the capacity of the proposed modeling to improve the survival analysis of patients, (2) the impact of the size of p over the sequential treatment analysis, and (3) the correlation between the size of the sequential treatments and their precision as survival pattern descriptors. We perform these analysis using the RRT database. To compare the survivor function Ŝ(p), associated with patients who execute a given chain of treatments p, and the survivor function Ŝ(s), of a sequential treatment s, we calculate the deviation of Ŝ(s) in relation to Ŝ(p) through the following equation: deviation(s, p) = Pk j=1 |Ŝ(s, tj ) − Ŝ(p, tj )| Pk j=1 Ŝ(p, tj ) where k is the observed time, Ŝ(s, tj ) is the survival probability associated with the sequential treatment s in tj , and Ŝ(p, tj ) is the survival probability of p in tj . Since several sequential treatments can be selected, we always apply the sequential treatment s contained in p that has the lowest deviation in w.r.t. to p. The window size threshold λ applied is 0.6 , since it achieved the best results for most of the experiments executed, the minimum support threshold σ was set to 0.1%, the median survival threshold µ was set to 0. In all the experiments, we select a random set of patients R, containing 5% of the patients (5,323 patients) from the RRT database, for whom the chain of treatments p will be described by a sequential treatment s. In Figure 5(a), we show the average deviation obtained by the SMTM algorithm and the traditional Kaplan Meier (KM) method in the generation of descriptors for sequences of treatments. We randomly selected five different sets of patients R, containing 6.5 SMTM KM average deviation(%) average deviation (%) 15 12 9 6 SMTM 6 5.5 3 5 0+ 11+ 23+ 35+ 47+ 0 2 4 6 survival 8 10 12 max_size (a) Kaplan Meier method (KM) X SMTM approach (b) SMTM deviation for different values of max size 1 1 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 10 20 30 t 40 S(t) 1 0.8 S(t) S(t) Figure 5: Sequential treatments as survival pattern descriptors 50 0.2 0 0 (a) HD(13) 0.4 10 20 30 t 40 50 0 (b) TX(13) 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 20 30 t (d) HD(13) 40 50 30 t 40 50 h(t) 0.1 0.08 h(t) 0.1 0.08 h(t) 0.1 10 20 (c) HD(7)-TX(6) 0.08 0 10 0 0 10 20 30 t (e) TX(13) 40 50 0 10 20 30 t 40 50 (f) HD(7)-TX(6) Figure 6: Survivor and hazard functions of the sequential treatments HD(13), TX(13), and HD(7)-TX(6) patients from the whole database (0+) and also from the set of patients who survived more than 11 (11+), 23 (23+), 35 (35+), and 47 months (47+). Each set contains 5,323 patients. The KM method was applied to analyze the survival function of single therapies and we selected, for each chain of treatments p, the therapy that presented the lowest deviation compared to p, among the therapies contained in p. The sequential treatments were generated by the SMTM algorithm using the maximum size of the sequential treatments (max size) set to 10. The sequential treatment analysis is shown to be much more effective than the traditional KM method in the generation of survival pattern descriptors. Moreover, the average deviations obtained by both the SMTM and KM techniques tend to reduce as the size of the treatments analyzed increase, since these strategies may exploit more information about the treatments received by the patients. Figure 5(b) shows how the maximum size (max size) of the sequential treatments used as survival pattern descriptors affects the average deviation w.r.t. the survival of patients in the database. Longer sequential treatments are able to obtain a better description of the sequences of treatments, but, in general, short sequences are able to obtain a low average deviation. The results presented in this section evaluate the capacity of sequential treatments to be used as survival descriptors for sequences of treatments. The proposed technique is able to obtain more precise descriptors than the analysis of single therapies using the traditional Kaplan Meier estimate. We also may improve the precision of these descriptors using long sequential treatments. 5.4. Assessing the quality of the RRTs In the last part of this experimental evaluation, we analyze the survival time of patients who execute different sequences of RRTs, what is the main motivation for this research. However, since this paper is focused on the description and evaluation of the proposed algorithm as a data mining application, we will only emphasize specific results that highlight its relevance in the medical field. Due to space constraints, an extensive survival analysis of the sequential execution of RRTs is left as future work. Sequential treatment TX(13) HD(1)-TX(12) HD(2)-TX(11) HD(3)-TX(10) HD(4)-TX(9) HD(5)-TX(8) HD(6)-TX(7) HD(7)-TX(6) HD(8)-TX(5) HD(9)-TX(4) HD(10)-TX(3) HD(11)-TX(2) HD(12)-TX(1) HD(13) CAPD(13) sup(%) 9.3 3.1 3.2 3.2 3.3 3.3 3.3 3.4 3.4 3.4 3.4 3.3 3.2 37.2 3.1 tM >60 >60 >60 >60 >60 >60 >60 >60 >60 >60 >60 >60 >60 35.0 28.0 h 17.6 474.5 470.4 467.6 471.3 469.7 468.7 474.2 473.2 478.9 478.0 483.1 494.8 563.3 52.1 Table 4: Survival analysis of size 13 sequential RRTs Table 4 presents the support (sup), the median survival (tM ), and the average hazard rate (h) of the 15 most frequent size 13 sequential RRTs in the RRT database. The sequential treatments are represented by the ordered therapies that compose each sequence and the number of repetitions of each therapy. We decided to evaluate the size 13 sequential treatments because it is the median observed time of the patients, what leads to the analysis of a large sample of sequences completely or partially received by several patients. The frequent window threshold λ was set as 0.6. We set the value of tM as ’>60’ for those sequential treatments with median survival higher than the observed time, what is true for most of the sequences shown in Table 4. It can be observed that TX(13) presents the lowest average hazard rate, on the other hand, HD(13) has the highest one. Moreover, these sequential treatments are very frequent, specially HD(13), which has a support of 37.2%. Patients who receive CAPD(13) have the lowest median survival among the sequential treatments evaluated. The execution of HD for one or more months followed by TX is a very frequent sequential treatment and presents different results in terms of average hazard rate depending on how long the patient is maintained in HD. The longer the patient waits for a transplantation, the higher is the probability of death, and this pattern becomes stronger after the ninth month in HD. Figures 6(a), 6(b), and 6(c) show the survivor functions, and Figures 6(d), 6(e), and 6(f) show the hazard functions of the sequential treatments HD(13), TX(13), and HD(7)-TX(6), respectively. The survival pattern of patients who receives HD(13) is very different from that associated with TX(13). While the hemodialysis presents a very high probability of death in the first months of treatment and a high probability of death along the whole sequential treatment, the transplantation achieves a constant and low probability of death. On the other hand, the survival of patients executing HD(7)-TX(6) is similar to the survival of patients in hemodialysis in the beginning of the sequential treatment until the 22th month, when 60% of the patients in HD(7)-TX(6) had already been transplanted (λ = 0.6). After the transplantation, patients executing HD(7)-TX(6) tend to reduce the probability of death significantly. About ten months after the transplantation, when the risk of rejection is reduced, the probability of death associated with HD(7)-TX(6) is very low, even lower than that associated with TX(13). 6. RELATED WORK In this section, we discuss previous work related to medical data mining, survival analysis, evidence-based medicine, and sequential pattern mining. The application of data mining algorithms to medical databases has been presented as a solution to face the extensive amounts of data collected and stored by medical information systems. Through intelligent data analysis techniques, these databases can provide rich information resources for decision makers. In [9] the authors study the specific problem of finding risk patterns in medical data. [10] presents a pattern discovery methodology for the analysis of relationships between events in medical records. In this paper, we propose a novel data mining application for the survival analysis of sequential treatments. The main challenges for the application of data mining on medical data include specific requirements of understandability, efficiency, and contextualized measures of interestingness. Survival analysis consists of a set of statistical procedures extensively applied by medical information systems. These procedures are also useful to the analysis of other events, such as customer behavior and biological processes [6]. However, similar to other statistical methods, most of the survival analysis techniques are not able to deal with large databases and complex patterns. As we describe in this paper, the application of data mining to the survival analysis may achieve promising results. A data mining strategy to predict whether patients in dialysis are more likely to survive for a time below or above the median survival is proposed in [7]. The study of the patient’s history of treatments and health conditions has attracted the interest of the medical community in the recent years, specially in the case of chronic disease treatments. Two important research topics related to the analysis of long term information about patients are the patient pathways [3] and the adaptive treatment strategies [8]. The application of data mining has already presented successful results in both topics. A patient pathway represents the patient’s journey through the care system. These paths are modeled through states and transitions through a Markov chain. Frequent paths can be identified using sequence mining algorithms. Adaptive treatment strategies are techniques for adapting a treatment plan according to the patient’s history of treatments and the response to those treatments. In this case, data mining algorithms may extract patterns associated with optimal decisions. A sequential treatment is an ordered sequence of treatments executed across time, what makes this problem close to the traditional sequence mining task [1, 15]. The discovered patterns of sequence mining algorithms are frequent sequences of items. Despite the importance of analyzing the most frequent treatments, which affect the health of many patients, the main objective of the survival analysis of sequential treatments is to evaluate sequences of medical treatments in terms of the survival time. The problem of mining sequential patterns was introduced in [1]. An efficient algorithm that applies lattice search techniques to identify frequent sequences is described in [15]. The application of sequence mining algorithms to knowledge discovery in contexts where the support is not the only interesting pattern have already been studied in previous work. In [16], the authors present an algorithm to predict failures in databases with plan executions using sequence mining. In [14], it is proposed an algorithm to identify high utility plans to be used to convert groups of customers from less desirable classes to more desirable ones. 7. CONCLUSIONS AND FUTURE WORK In this paper, we have described a novel method for survival analysis of sequences of medical treatments executed across time, what we called sequential treatments. Based on the proposed method, we designed the SMTM algorithm, a new data mining algorithm for the survival analysis of sequential medical treatments. The SMTM algorithm exploits frequent gaps between the execution of single treatments in order to evaluate the survival time of patients not only after, but also along the execution of a sequential treatment, combining the existing survival analysis framework and the traditional sequence mining task. Moreover, the SMTM algorithm applies two pruning strategies, based on support and median survival of the sequential treatments. We have evaluated the SMTM algorithm using a database of Brazilian patients suffering from ESRD. The results show that the proposed algorithm is computationally efficient, and the median survival pruning is able to reduce its execution time significantly, allowing the survival analysis of long sequential treatments, using large databases, in a feasible time. We also have shown that sequential treatments are much more effective than the traditional survival analysis method in the generation of survival pattern descriptors for sequences of treatments. Finally, we present a real case study, evaluating the sequential execution of RRT’s. The results obtained provide important knowledge to assess the quality of the RRT’s in Brazil, and may affect new government programs and health policies for the assistance of patients in RRT. Future work include an extensive survival analysis of sequential treatments for ESRD and the application of the proposed algorithm using other datasets. Moreover, we will develop a methodology for calibrating the parameters of the algorithm, since the empirical evaluation of them, like is done in this paper, may be a difficult task. 8. ACKNOWLEDGMENTS This work was partially supported by CNPq, CAPES, Finep, Fapemig, and the Brazilian Ministry of Health. References [1] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th IEEE Int’l Conf. on Data Engineering, pages 3–14, 1995. [2] M. Cherchiglia, A. Guerra, E. Andrade, C. Machado, F. Acurcio, W. Meira Jr, B. Paula, and O. Queiroz. The Construction of a National Database of Renal Replacement Therapies Focused on the Individual: A record linkage procedure approach (in Portuguese). Revista Brasileira de Estudos da População, 24(1):163–167, 2007. [3] L. Garg, S. McClean, B. Meenan, and P. Millard. Non-homogeneous Markov models for sequential pattern mining of healthcare data. IMA Journal of Management Mathematics, (1), 2008. [4] L. Kirby and L. Vale. Dialysis for end-stage renal disease. International Journal of Technology Assessment in Health Care, 17(2):181–189, 2001. [5] J. Klein and M. Moeschberger. Survival analysis: Techniques for censored and truncated data. Springer, New York, NY, 2003. [6] D. Kleinbaum and M. Klein. Survival analysis: A self-learning text. Springer, New York, NY, 2005. [7] A. Kusiak, B. Dixon, and S. Shah. Predicting survival time for kidney dialysis patients: a data mining approach. Computers in Biology and Medicine, 35(4):311–327, 2005. [8] P. Lavori and R. Dawson. Adaptive Treatment Strategies in Chronic Disease. Annual Review of Medicine, 59(1):443, 2008. [9] J. Li, A. Fu, H. He, J. Chen, H. Jin, D. McAullay, G. Williams, R. Sparks, and C. Kelman. Mining risk patterns in medical data. In Proc. 11th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 770–775, 2005. [10] G. N. Norén, A. Bate, J. Hopstadius, K. Star, and I. R. Edwards. Temporal pattern discovery for trends and transient effects: its application to patient records. In Proc. 14th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 963–971, 2008. [11] M. Perazella and S. Khan. Increased mortality in chronic kidney disease: A call to action. The American Journal of the Medical Sciences, 331(3):150, 2006. [12] Z. Qiu-Li and R. Dietrich. Prevalence of chronic kidney disease in population-based studies: Systematic review. BMC Public Health, 8(1):117, 2008. [13] D. Sacket, W. Rosenberg, J. Gray, R. Haynes, and S. Richardison. Evidence based medicine: How to practice and teach EBM. Churchill Livingstone, New York, NY, 1997. [14] Q. Yang and H. Cheng. Mining plans for customer-class transformation. In Proc. 3rd IEEE ICDM Int’l Conf. on Data Mining, pages 403–410, 2003. [15] M. J. Zaki. Spade: An efficient algorithm for mining frequent sequencies. Machine Learning, 42(1):31–60, 2001. [16] M. J. Zaki, N. Lesh, and M. Ogihara. Planmine: Sequence mining for plan failures. In Proc. 4th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 369–374, 1998.