Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Sub-health Risk Appraisal Model Based On Decision Tree and Rough Sets Xin Lu Licheng Liu School of Software University of Electronic Science and Technology of China (UESTC) Cheng du, china E-mail: [email protected] School of Software University of Electronic Science and Technology of China (UESTC) Cheng du, china E-mail:[email protected] moreover, appraises the way also inclined to the current diagnosis, and can not forecast the potential of sub-health risk, If fails to take into account of the history of individual health data, it’s hard to do with the potential sub-health risk in the future. Regardless of uses which appraisal method to achieve the sub-health risk accurate appraisal that must solve two problems ˖How to take full advantage of relevant data and established of appropriate appraisal model. A novel appraisal model which based on decision tree and rough sets theory was presented from this paper. The research model was different from the above same type is: first, the different from this model and above-mentioned similar research was: This model used the rough sets theory to carry on attribute and attribute value reduction preprocessing to individual subhealth monitoring data, avoids disorderly and the noise data, Then through the establishment decision tree C4.5 excavation algorithm, after undergoing the rough sets theory preprocessing the historical sub-health monitoring data carries on the classified training ,mining the potential of subhealth risk information. Abstract—There are some problems in people’s sub-health risk appraisal using current technology, for example, incomplete data, bias in the diagnosis and can not effectively predict participant’s the future health state. This paper presents a subhealth risk appraisal method based on data mining technique to resolve these issues. By introduction the rough sets preprocessing risk appraisal noise data, extraction of information entropy in the training set, combined with C4.5 decision tree algorithm, it established the sub-health risk appraisal prediction model. Experimental results confirm that this model than the normal method of decision tree model has higher prediction accuracy of sub-health state. Keywords-sub-health; data preprocessing; rough sets; C4.5 algorithm I. INTRODUCTION Many customers, enterprises and the related departments has been paying attention to sub-health appraisal. Especially in recent years, the sub-health risk which caused by stress and improper living habits, like smoking, drinking, lack of exercise and eating disorders became increasingly prominent [1]. Therefore, the research of sub-health risk appraisal became a hot spot. Many research institutes and university have deeply studied it, moreover, set up special health management study association like Health Management Research Center of Michigan University [2]. Sub-health was the intermediate state of physical and mental between health and disease [3].The reason of subhealth was complex and it was difficult to find the obvious symptoms. The difficulty of current sub-health risk appraisal was how to make full use of the individual's history and current data, predict the potential sub-health risks. The research of sub-health risk appraisal by scholars in and abroad showed that the appraisal methods can be classified into three categories. The first one was symptom standard appraisal which appraised individual's symptom by simulating expert diagnosis, such as [4]. The second one was quantitative appraisal according to physical examination and living environment, such as [5]. The third one combined with mathematic model which appraised individual health data synthetically, such as [6]. But many researches have suggested that sub-health state was influence of many sided element. If want to forecast their risk appraisal must be combined with a large amount of historical data [7]. Abovementioned three methods only use the current data, II. ROUGH SETS PREPROCESSING HEALTH DATA Rough sets theory is called the treated object composed of a finite set as universe of discourse, and to respond to that "knowledge" has classification capability and granularity. Knowledge of universe of discourse objects go through attributes and attributes values to describe [8]. Based on this, the sub-health appraisal data we set to as follows. Set the sub-health monitoring data knowledge expression system S as S U, C, D, F ! ,and U is set of objects, called universe of discourse; CĤD R is attribute for the set of sub-health monitoring data , called subset C and D are sub-health monitoring data condition attributes and decision attribute; U Ĥ9r ( rę R ) is property value for the set of sub-health monitoring data; V r to indicate ręR attribute range of attribute values; f : U * R o V is an information function ,and set the x attribute value of each U object. The application of rough sets theory in the sub-health monitoring data preprocessing, first, we discredited the original sub-health monitoring data, Furthermore reduction of sub-health of data attribute and calculating nuclear set, 464 § ¨ ¨ # ¨ 0 © after calculating the nuclear set, then carry on attribute value reduction. Hereon, we based on follow rule established subhealth decision-making table. Set U {U1 ,U2 ,...,Un } is a universe of discourse, Ui (i 1, 2,..., n) is health knowledge library's research object, P is health knowledge library attribute set, P CD C is condition attribute value, D is sub-health knowledge decision attribute values, T (U, P, C, D) is health information decision table. Decision table is a decision rule A. Attribute Reduction of a Core Based on the above established decision table, suppose all of reduction's occurring set is the nucleus in sub-health monitoring data attribute set P, to indicate as core ( P ) ,In other words ,It can serve as a basis for all the reduction and the feature set which can’t elimination. Hereon, based on the discriminated metrics calculating nuclear set Pc , regard as health knowledge system P attribute set S is cannot distinguish, and core set of knowledge systems definite as follow. Established a sub-health attribute decision m * n matrix D, and element D ij is subset of data attribute set P, definition element D ij as follow (1) and (2). d ijk I ,U { P ,Uik k ik U i, j 1, 2 , ..., n " From above discriminated matrix obtain. D {Y } is sub-health decision-making set, does not need reduction ,but set of conditions C required reduction, and according to two attribute rules extracts the sub-health knowledge library's reduction set and the nuclear set. (1) TABLE I. U e1 e2 e3 e4 jk zU k jk 1, 2 , 1, ..., n (4) Attribute Value Reduction Attribute value reduction is remove of redundant value. Hereon, also defines the attribute value reduction a related attribute value nucleus [10]. Set ) o < is a decision table decision rules, health knowledge attribute value v is one can cancel if only works as () o < ) o () {v} o < ) , ) and < are the decision table logic formula. In this article, we used part of sub-health data established decision tablHĉ, the decision table attribute and attribute value set as: Gender G (1: man, 2: female), Family disease inheritance H (1: y, 2: n), carrying disease and other properties (O, other property, including life diet, body sensation, etc.) (1: If exception is more than 60%, 2: undecided 40% greater than 2 less than 60%, 3: less than 40%), and reduction it. Sub-health decision e (1:sub-health; 2: normal). x about health data attribute set P value. {d ij1 , d ij 2 , d ij 3 ,..., d ijn } % a1 n · ¸ # ¸ ¸¹ B. for each row: dx~C ! dx~D, dx~P to indicate individual Dij a1 2 ! (2) U ik And U jk is the decision tables of i row and j row SUB-HEALTH DATA ATTRIBUTE VALUE G 2 1 1 2 H 1 2 1 2 O 2 3 2 3 e 3 2 1 3 Tableĉ by the examples given in the decision table, we use rough sets attribute value reduction rule reduction; will be come to the tableĊ. values of two properties, k is decision-making table of the number of research object, discriminated metrics D is a diagonal line 0 symmetrical matrices. In symmetrical matrices, following attribute reduction rule to process it [9]. The relationship of attribute set reduction and nucleus as follows (3). core( P ) red ( P ) (3) TABLE II. AFTER ATTRIBUTE VALUE REDUCTION DATA U 1 2 3 4 red ( P ) is all of reduction P . core( P ) Include all of reduction common equivalent relation among P , it’s the set of attributes P indispensable and important sub-health data G 1 2 1 2 H 1 1 1 2 O 1 2 2 3 e 1 2 1 2 According to the examples of above reduction the subhealth knowledge attribute value, it can obtain more streamlined and coincidence of system rules data, reduced the amount of appraisal calculate complex. attribute set. According to sub-health knowledge library interrelated attribute, set sub-health knowledge expression system is universe of discourse, S , U={U1,U2,...,Un} is III. SUB-HEALTH RISK APPRAISAL MODEL In decision tree algorithm application, node generates its attribute according to the method of information gain judgment [11]. In this paper, used in decision tree as a subhealth appraisal model, application after rough sets techniques preprocessing sub-health monitoring data set as the training set, the output for the sub-health risk appraisal, C {a11, a12,..., a1n}is sub-health knowledge library's condition attribute, D {Y } is decision attribute set of sub-health knowledge, P C D , and construct the corresponding discriminated matrix as(4). 465 decision analysis. “Fig.1” is the principle of sub-health appraisal. p ij C i probability. Besides, A branch point will also carry on the corresponding information gain. Its formula description is as follows (8). (8) G a in ( A ) I ( s 1 , s 2 , ..., s m ) E ( A ) Induction algorithms according to each attribute information gain to carry on the calculated. After calculation took of the biggest gain among attributes information selected as the test attribute for a given set S . According to this way produce corresponding branch point. And the corresponding attribute mark will be producing the point. Then use this node's attribute foundation branch, the corresponding branch is also the sample subset which divides. Figure 1. Sub-health risk appraisal principle A. Decision Tree Appraisal Model After determining the training set data, we will accord the following process establishment decision tree sub-health risk appraisal model. First, Set S to one contains s set of sub-health monitoring data sample; assume that class label attribute has m different values. Definition m different class Ci (i 1, 2,3..., m) , assume Si is class Ci number of data samples in the sub-health. According to (5) to calculate the sub-health monitoring data sample classification expectation information. m I ( s1 , s 2 ...s m ) ¦ p i lo g 2 ( p i ) B. Prediction Algorithm Based on decision tree model above, we adopted decision tree C4.5 algorithm for establishing sub-health risk appraisal tree, C4.5 structure algorithm is top-down recursive, using information gain ratio calculation of others type’s sub-health data samples the proportional gain ratio[12]. Use the formula (9) and (10) calculate the information gain ratio. G a in ( A ) (9) G a in R a tio ( A ) (5) i 1 pi is the random selection sample belongs Ci probability, and use si / s estimated. Logarithm functions S p litI ( A ) with 2 as the bottom, because the information encoded with a binary. v different Set attribute A have total of values {a1 , a2 ,...av } , use attribute A divide the sub-health And u S p litI ( A ) IV. through A division of a subset of information (entropy) is calculated as follow (6). u (6) j 1 j subset weight, and is equal to the subset number of samples divided total number of samples in S (i.e.: ai is value of A ). The entropy value is smaller, the purity is higher. Given the subset S j , we using formulas (7) calculate its expectation information. m ¦ p ij lo g( p ij ) EXPERIMENT AND ANALYSIS In the experiment, in order to analyze conveniently, after select the sample taken as input data, used to this paper proposed sub-health appraisal model, Simultaneously, compared to without through rough sets preprocessing mode. Finally, use Mat Lab simulation software testing and analysis. According to the above-mention sub-health appraisal principle, we using sub-health monitoring data sample provided by Sichuan Center for Disease Control and Prevention (SCDCP), choose 2000 experts confirmed the sub-health sample as test data. Sample and attribute settings as follows. Gender G (1:man,2:female), Age A (1:16-30 years old,2:30-55 years old,3:above 55), Weight W (1:fat,2:normal3:thin), Family disease inheritance H (1:y,2:n), Education E (1:junior college or under,2: undergraduate,3:master or above), Marital status M (1:y,2:n), carrying disease and other properties (O, other property, including life diet , body sensation, etc.), (1:If exception is more than 60%,2: undecided 40% greater than 2 less than monitoring samples which belonging to category Ci , I ( s1 j , s 2 j ...s m j ) (10) From 2.1, 2.2, we have use rough sets theory preprocessing sub-health monitoring data, simply use the C4.5 algorithm to generate sub-health decision tree, and then under the generate sub-health decision tree analysis and forecast sub-health risk to obtain results. The next section in this article we will experiment with real data test validity. ai value, if A take as test attribute, then these correspond in containing the set S growing out of the branch nodes. Set Sij is subset S j and the number of sub-health Hereon, ( s1 j s 2 j ... s mj ) / s the number of ¦ p i lo g i ( p i ) i 1 monitoring data sample set S into total of V subset {S1, S2,...Sv}, and S contain A sample, in A taking E(A) ¦ (s1 j s2 j ... smj )/ s *I (s1 j s2 j ... smj ) s ij / | s j | is S j a sample and belongs to the class (7) i 1 466 the accuracy of sub-health risk appraisal rate about 92.6%, without through rough sets preprocessing of the model about 86%, can be the conclusion of this paper, rough sets theory after preprocessing appraisal combined with decision tree C4.5 algorithm model more accuracy. 60% ,3 :less than 40%) 7 attributes as input, The attribute after rough set processing as followstable ċ: TABLE III. G 1(200) 1(100) 2(100) 2(100) 2(200) 1(200) 2(200) 1(50) 1(350) 2(150) 2(150) 1(100) 1(100) TEST SAMPLIE ATTRIBUTE A 1 1 2 1 2 3 2 3 2 2 2 3 2 W 1 1 2 1 2 1 1 1 1 2 2 2 2 H 1 2 1 1 1 2 1 2 1 1 2 1 1 E 2 1 2 1 2 1 2 3 1 3 2 1 2 M 1 2 1 1 1 1 1 2 1 1 1 1 1 O 1 1 1 2 1 1 2 1 2 2 1 1 2 V. How to take full advantage of individual sub-health monitoring data, and establish an appropriate appraisal model is the difficulties of sub-health risk appraisal study. The appraisal model based on rough sets theory and decision tree was proposed in this article. Through the rough sets theory preprocessing of the sub-health monitoring data, being fully used individual health data, solved the clutter, noise and other problem in the data. According to the needs of this appraisal model application in practical engineering projects, mainly considered the efficiency of the algorithm and appraisal accuracy. The establishment of sub-health appraisal model was proposed by application of decision tree algorithm, achieved a scientific, objective, reasonable subhealth risk appraisal. Carries on the test after above-mention sub-health sample through rough set theory preprocessing, distinction input this test into two kinds of model contrast, can obtains two kind of models the sub-health appraisal result rate of accuracy following tablesČ. ACKNOWLEDGMENT TABLE Č TEST SAMPLE RESULT Appraisal model This model C4.5 model Sub-health normal Sample 1850 1740 150 260 2000 2000 The project is supported in part by Science & Technology Department of Sichuan Province (S&TDSP). accurac y 92.5% 87% REFERENCES Based on this, applies Mat Lab tools to the sample which selects to carry on the simulation test. First, two kind of model's operating speed compared shown in “Fig.2”. [1] [2] [3] [4] 1 0.9 0.8 times(×10s) 0.7 [5] 0.6 0.5 0.4 [6] 0.3 0.2 0.1 0 Not Rough set process After Rough set process 0 200 400 600 800 1000 1200 Input sample quantity 1400 1600 1800 [7] 2000 Figure 2. Computation speed and input sample change tendency [8] Next, the input sample and appraisal rate of accuracy change tendency shown in “Fig.3”. [9] 0.95 0.9 [10] Testing accurately 0.85 0.8 0.75 [11] 0.7 0.65 [12] 0.6 0.55 Not Rough set process After Rough set process 0 200 400 600 800 1000 1200 Input sample quantity 1400 1600 1800 CONCLUSION 2000 Figure 3. Two kinds of appraisal model forecasting result curvilinear trend [13] Can be seen from“Fig.2” “Fig.3”and TablH Č , after through rough sets preprocessing sub-health monitoring data 467 The role of health management and development. http://www.3gaojk.com/article. _view.php?id-133 Health Manage Research Center.http://www.hmrc.kines.unich.Edu /hra/toc.cgi. Zhao hong. A non-invasive type of the risk appraisal model method. China Health Management Journal, vol. 3, pp. 166–169, March 2003. Wei yuke,Wang renhuang. A kind of new method of sub-health diagnostic reasoning. Computer Applications. vol. 3, pp. 70–73, February 2006. Wang liming,Zhao yin. Sub-health state comprehensive evaluation index system of ideas of. Journal of Chinese Medicine. vol. 25, pp. 180–183, February 2010. Liu zunxian. Sub-health of the differential equation model and data analysis. Mathematics in Practice and Theory. vol. 15, pp. 221–224, December 2009 Lin jie. Data Mining in Health Management. China Archives Science, vol 17, pp㸸35-36, Oct 2007. Yuan changan,Deng song,Li wenjing. Principles of Data Mining and SPSS Clementine.beijing: Electronic Industry Press, pp. 228–232, April 2009. Qin guangzhong,Mao zongyuan. Rough neural network and its application of Traditional Chinese Medicine intelligent diagnosis system. Computer Engineering and Applications. vol 22,pp. 34–35, June 2001. Shao fengjing, Principle and Algorithm of Data Mining (Second Edition).beijing: Science Press, pp:128-129, August 2009. Márcio P. Basgalupp1, Rodrigo C. Barros2, André C.P.L.F. de Carvalho1, Alex A. Freitas3 and Duncan D. Ruiz. LEGAL-Tree: A Lexicographic Multi-objective Genetic Algorithm for Decision Tree Induction. Hawaii, U.S.A, SAC’09,vol 15 , pp:1085-1091,March 2009 Han jiawei. Data Mining Concepts and Techniques.beijing: Machinery Industry Press, pp:188-190 ,September 2007.