Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
行政院國家科學委員會專題研究計畫 成果報告 模糊統計分析在資料採礦以及軟計算方法之理論探討及其 應用 計畫類別: 個別型計畫 計畫編號: NSC91-2213-E-004-007執行期間: 91 年 08 月 01 日至 92 年 07 月 31 日 執行單位: 國立政治大學應用數學學系 計畫主持人: 吳柏林 報告類型: 精簡報告 處理方式: 本計畫涉及專利或其他智慧財產權,1 年後可公開查詢 中 華 民 國 92 年 10 月 13 日 模糊統計分析在資料採礦以及軟計算方法之理 論探討及其應用 計畫編號: NSC91-2213-E-004-007 吳柏林 國立政治大學應用數學系 2003/8/2 Introduction Data mining for knowledge discovery is a new and rapidly developing area of research towards applications in this information age. Although this area falls into the field of multivariate data analysis, there is a tendency to consider alternative techniques to traditional multivariate statistical techniques, which are borrowed from computer science and artificial intelligence, such as neural networks, and rough sets theory(see, e.g., [5], [7]). This emerging technique are somewhat ad-hoc and lack theoretical foundations. As Elder [2] pointed out, there is a compelling reason to look at statistical perspectives of knowledge discovery in databases(KDD). This is mainly due to the fact that recent statistical contributions to multivariate data analysis did not fully spread out to the KDD community. Needless to say, the advantage of using statistical methodology is that it will provide a firm theoretical foundation for deriving inference procedure in decision-making. In this research project, we will show that a relatively new area of probability theory and statistics, namely random sets theory(see, e.g., [4], [7], and [15]), is suitable as a framework for data mining problems. Theorem 2.1. Let X be a universal set and A be a fuzzy set on X . Then the complement of fuzzy set, Ac , is denoted by Ac ∨ c (1 ) ( Ac )(1 ) . (1- ) ( A ) (2.3) where ( Ac ) is the level set of Ac , ( Ac ) (1 ) is a (1 ) cut of Ac , and ∨ denotes the finite fuzzy union. Theorem 2.2. Let X be a universal set, A and B be two fuzzy sets on X . For any significant levels 1 , 2 [0,1] , the standard fuzzy union and intersection of A and B are denoted by (i) A F B [ ∨ ( A) ] F [ ∨ ( B) ] ( A) ( B ) [ Ker ( A) F Ker ( B)] F ∨ [( ( A) ( A) ) F ( 1 (ii) A F B [ ∨ ( A) ] ( A) [ Ker ( A) F [( F ∨ ( A) ( B ) F [ Ker ( A) ( A) ) F Ker ( B)] ( A) ) F ( F ( ( B) )] ∨ ( B ) 2 1 F [( ∨ ( A) 1 (2.12) 2 [ ∨ ( B) ] Ker ( B)] F ( B) )] . (2.11) ∨ ( B ) ( B) )] . ∨ ( B ) 2 where ( A) is the level set of A , ( B ) is the level set of B , ( A) is a cut of A , (B ) is a cut of B , F and F denote the standard fuzzy union and intersection respectively Applications with the RRI data The data analyzed here comes from the ICU of Taipei Veterans General Hospital, 2001. The data records RRI of the dead patients and survival patients for the first four days of ICU. The RRI data for each patient is measured with 30 minutes. By discarding the first 100 observation, we analysis the 101 to 600 observations from each patient which contains about 1800 ~ 3000 observations of RRI. The purpose of this study is to extract features and identify nonlinear time series for the ICU patients. Figure 4.1 (a) and (b) plot respectively the dead and survival patients’ RRI. The 2nd Day for 1st Dead Patient The 2nd Day for 1st Survival Patient The 2nd Day for 2nd Dead Patient The 2nd Day for 2nd Survival Patient The 2nd Day for 3rd Dead Patient The 2nd Day for 3rd Survival Patient The 2nd Day for 4th Dead Patient The 2nd Day for 4th Survival Patient The 2nd Day for 5th Dead Patient The 2nd Day for 5th Survival Patient (a) (b) Figure 4.1 Plots of RRI for dead and survival patients For the 500 observations, we can find the cluster centers for each data set. Now, under the significant level = 0.9, we can construct our kernel sets by the proposed procedures in the Section 3. In following, the dead patients’ and the survival patients’ cluster centers, radiuses of kernel set and ratios are showed in Table 4.1. Table 4.1 The dead and the survival patients’ cluster centers, radiuses and ratios Patient Cluster center Radius of kernel set Ker( Di 2 ) / Di 2 D12 623.734 0.734 0.348 S12 750.546 Radius of kernel set 0.546 D22 D32 976.018 1.018 0.118 850.132 0.868 0.006 883.592 0.592 0.088 S 22 S 32 561.882 0.882 0.066 D42 651.442 0.558 0.046 S 42 667.570 0.570 0.060 Average of Ker(Di 2 ) / Di 2 Its ratio 0.15 Patient Cluster center Average of Ker(Si 2 ) / Si 2 Its ratio Ker(Si 2 ) / Si 2 0.078 0.05 4 The kernel set learned from the dead patients is Ker ( D) F Ker ( Di 2 ) . Then, we i 1 can give the following testing-hypothesis procedure: H 0 : the data belongs to the dead patterns. H 1 : the data doesn’t belong to the dead patterns. Decision rule: for the new sample Ker( Dnew ) , under the significant level D , if there exist i, such that Ker ( Dnew ) Ker ( Di 2 ) 1 a D then we accept H 0 . Otherwise, we reject H 0 . 4 Similarly, the kernel set learned from the survival patients is Ker ( S ) F Ker ( S i 2 ) , i 1 and the testing-hypothesis procedures: H 0 : the data belongs to the survival patterns. H 1 : the data doesn’t belong to the survival patterns. Decision rule: for the new sample Ker ( S new ) , under the significant level D , if there exist i, such that Ker ( S new ) Ker ( S i 2 ) 1 D then we accept H 0 . Otherwise, we reject H 0 . Now, we examine two new RRI samples of patients from ICU. First, we can find the cluster centers for each data set and construct their kernel sets under the significant level = 0.9. Then we can find radiuses of kernel sets for these samples. Finally, we can get their patterns according to their features by above the testing-hypothesis procedures. For these two samples, we get the cluster centers 592.624 and 761.658 respectively. We can calculate the membership for each observation via the distance between observation and its cluster center. Under the significant level = 0.9, if the membership of observation is lager than 0.9, then this observation is a member of the kernel set. Therefore, the results of two new samples are showed in Table 4.2. Table 4.2 The sample memberships and kernel set for RRI of new samples The sample memberships and kernel set for RRI of a new sample (I)(Cluster The sample memberships and kernel set for RRI of a new sample (II) center: 592.624, = 0.9) (Cluster center: 761.658, = 0.9) Data Memberships Is a member of the kernel set ? Data Memberships Is a member of the kernel set? 1 2 3 4 5 6 7 8 9 10 569 573 571 529 622 598 609 614 608 605 0.042 0.051 0.046 0.016 0.034 0.186 0.061 0.047 0.065 0.081 no no no no no no no no no no 751 740 760 734 730 729 718 713 708 741 0.094 0.046 0.603 0.036 0.032 0.031 0.023 0.021 0.019 0.048 no no no no no no no no no no 491 492 493 494 495 496 591 591 591 595 598 598 0.616 0.616 0.616 0.421 0.186 0.186 no no no no no no 737 725 722 717 721 721 0.041 0.027 0.025 0.022 0.025 0.025 no no no no no no 497 498 499 500 Total 595 592 598 596 0.421 1.000 0.186 0.296 no yes no no 47 (0.094 > 0.05) 717 725 736 728 0.022 0.027 0.039 0.030 no no no no 17 (0.034 0.05) From Table 4.2 we can find that: (1) For these two new samples, the radiuses of sample kernel sets are 0.624 and 0.658 respectively. This information seems insufficient if we want to classify them by using the traditional identification method. (2) For the new sample (I), the observations which belongs to the sample kernel set with respect to total observations is 0.094 (=47/500), which is larger than 0.05. By the radius of sample kernel set and the ratio 0.094, we can say that the patient has the same features of dead patients. From Table 4.1, we obtain the significant level D = 0.05. Under this condition, we find that the number of observations which belongs to the sample kernel set with respect to the number of Ker ( Di 2 ) is lager than 0.95, i 3,4 . By the decision rule, the RRI of this patient will be identified into the learned dead pattern of RRI. (3) Similarly, for new sample (II), the observations which belongs to the sample kernel set with respect to total observations is 0.034(=17/500), which is smaller than 0.05. By the radius of sample kernel set and the ratio 0.034, we can say that the patient has the same features of survival patients. Under the significant level D = 0.05, we find that the number of observations which belongs to the sample kernel set with respect to the number of Ker(S 22 ) is lager than 0.95. By the decision rule, the RRI of this patient will be identified into the learned survival pattern of RRI. Conclusion In the medical science analysis discussed above the time series data have the uncertain property. If we use the conventional clustering methods to analyze these data, it will not solve the orientation problem. The contribution of this paper is that it provides a new method to cluster and identify time series. In this paper, we presented a procedure that can effectively cluster nonlinear time series into several patterns based on kernel set. The proposed algorithm also combines with the concept of a fuzzy set. We have demonstrated how to find a kernel set to help to cluster nonlinear time series into several patterns. Our algorithm is highly recommended practically for clustering nonlinear time series and is supported by the empirical results. A major advantage of this framework is that our procedure does not require any initial knowledge about the structure in the data and can take full advantage of much more detailed information for some ambiguity. However, certain challenging problems still remain open, such as: (1) Since hardly ever any disturbance or noise in the data set can be completely eliminated, therefore, for the case of interacting noise, the complexity of multivariate filtering problems still remains to be solved. (2) The convergence of the algorithm for clustering and the proposed statistics have not been well proved, although the algorithms and the proposed statistics are known as fuzzy criteria. This needs further investigation. Although there remain many problems to be overcome, we think fuzzy statistical methods will be a worthwhile approach and will stimulate more future empirical work in time series analysis.