Download A Sub-health Risk Appraisal Model Based On Decision Tree and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Sub-health Risk Appraisal Model Based On Decision Tree and Rough Sets
Xin Lu
Licheng Liu
School of Software
University of Electronic Science and Technology of
China (UESTC)
Cheng du, china
E-mail: [email protected]
School of Software
University of Electronic Science and Technology of
China (UESTC)
Cheng du, china
E-mail:[email protected]
moreover, appraises the way also inclined to the current
diagnosis, and can not forecast the potential of sub-health
risk, If fails to take into account of the history of individual
health data, it’s hard to do with the potential sub-health risk
in the future.
Regardless of uses which appraisal method to achieve the
sub-health risk accurate appraisal that must solve two
problems ˖How to take full advantage of relevant data and
established of appropriate appraisal model. A novel appraisal
model which based on decision tree and rough sets theory
was presented from this paper. The research model was
different from the above same type is: first, the different
from this model and above-mentioned similar research was:
This model used the rough sets theory to carry on attribute
and attribute value reduction preprocessing to individual subhealth monitoring data, avoids disorderly and the noise data,
Then through the establishment decision tree C4.5
excavation algorithm, after undergoing the rough sets theory
preprocessing the historical sub-health monitoring data
carries on the classified training ,mining the potential of subhealth risk information.
Abstract—There are some problems in people’s sub-health risk
appraisal using current technology, for example, incomplete
data, bias in the diagnosis and can not effectively predict
participant’s the future health state. This paper presents a subhealth risk appraisal method based on data mining technique
to resolve these issues. By introduction the rough sets
preprocessing risk appraisal noise data, extraction of
information entropy in the training set, combined with C4.5
decision tree algorithm, it established the sub-health risk
appraisal prediction model. Experimental results confirm that
this model than the normal method of decision tree model has
higher prediction accuracy of sub-health state.
Keywords-sub-health; data preprocessing; rough sets; C4.5
algorithm
I.
INTRODUCTION
Many customers, enterprises and the related departments
has been paying attention to sub-health appraisal. Especially
in recent years, the sub-health risk which caused by stress
and improper living habits, like smoking, drinking, lack of
exercise and eating disorders became increasingly prominent
[1]. Therefore, the research of sub-health risk appraisal
became a hot spot. Many research institutes and university
have deeply studied it, moreover, set up special health
management study association like Health Management
Research Center of Michigan University [2].
Sub-health was the intermediate state of physical and
mental between health and disease [3].The reason of subhealth was complex and it was difficult to find the obvious
symptoms. The difficulty of current sub-health risk appraisal
was how to make full use of the individual's history and
current data, predict the potential sub-health risks. The
research of sub-health risk appraisal by scholars in and
abroad showed that the appraisal methods can be classified
into three categories. The first one was symptom standard
appraisal which appraised individual's symptom by
simulating expert diagnosis, such as [4]. The second one was
quantitative appraisal according to physical examination and
living environment, such as [5]. The third one combined with
mathematic model which appraised individual health data
synthetically, such as [6]. But many researches have
suggested that sub-health state was influence of many sided
element. If want to forecast their risk appraisal must be
combined with a large amount of historical data [7]. Abovementioned three methods only use the current data,
II.
ROUGH SETS PREPROCESSING HEALTH DATA
Rough sets theory is called the treated object composed
of a finite set as universe of discourse, and to respond to that
"knowledge" has classification capability and granularity.
Knowledge of universe of discourse objects go through
attributes and attributes values to describe [8]. Based on this,
the sub-health appraisal data we set to as follows.
Set the sub-health monitoring data knowledge expression
system S as S U, C, D, F ! ,and U is set of objects,
called universe of discourse; CĤD R is attribute for the
set of sub-health monitoring data , called subset C and D
are sub-health monitoring data condition attributes and
decision attribute; U Ĥ9r ( rę R ) is property value for
the set of sub-health monitoring data; V r to indicate
ręR
attribute
range
of
attribute
values;
f : U * R o V is an information function ,and set the
x attribute value of each U object.
The application of rough sets theory in the sub-health
monitoring data preprocessing, first, we discredited the
original sub-health monitoring data, Furthermore reduction
of sub-health of data attribute and calculating nuclear set,
464
§‡
¨
¨ #
¨ 0
©
after calculating the nuclear set, then carry on attribute value
reduction. Hereon, we based on follow rule established subhealth decision-making table.
Set U {U1 ,U2 ,...,Un } is a universe of discourse,
Ui (i 1, 2,..., n) is health knowledge library's research
object, P is health knowledge library attribute set, P CD
C is condition attribute value, D is sub-health knowledge
decision attribute values, T (U, P, C, D) is health
information decision table. Decision table is a decision rule
A. Attribute Reduction of a Core
Based on the above established decision table, suppose
all of reduction's occurring set is the nucleus in sub-health
monitoring data attribute set P, to indicate as core ( P ) ,In
other words ,It can serve as a basis for all the reduction and
the feature set which can’t elimination.
Hereon, based on the discriminated metrics calculating
nuclear set Pc , regard as health knowledge system P
attribute set S is cannot distinguish, and core set of
knowledge systems definite as follow.
Established a sub-health attribute decision m * n matrix
D, and element D ij is subset of data attribute set P, definition
element D ij as follow (1) and (2).
d ijk
I ,U
{ P ,Uik
k
ik
U
i, j
1, 2 , ..., n
"
From above discriminated matrix obtain. D {Y } is
sub-health decision-making set, does not need reduction ,but
set of conditions C required reduction, and according to two
attribute rules extracts the sub-health knowledge library's
reduction set and the nuclear set.
(1)
TABLE I.
U
e1
e2
e3
e4
jk
zU
k
jk
1, 2 , 1, ..., n
(4)
Attribute Value Reduction
Attribute value reduction is remove of redundant value.
Hereon, also defines the attribute value reduction a related
attribute value nucleus [10].
Set ) o < is a decision table decision rules, health
knowledge attribute value v is one can cancel if only works
as () o < ) o () {v} o < ) , ) and < are the
decision table logic formula.
In this article, we used part of sub-health data established
decision tablHĉ, the decision table attribute and attribute
value set as: Gender G (1: man, 2: female), Family disease
inheritance H (1: y, 2: n), carrying disease and other
properties (O, other property, including life diet, body
sensation, etc.) (1: If exception is more than 60%, 2:
undecided 40% greater than 2 less than 60%, 3: less than
40%), and reduction it. Sub-health decision e (1:sub-health;
2: normal).
x about health data attribute set P value.
{d ij1 , d ij 2 , d ij 3 ,..., d ijn }
%
a1 n ·
¸
# ¸
‡ ¸¹
B.
for each row: dx~C ! dx~D, dx~P to indicate individual
Dij
a1 2 !
(2)
U ik And U jk is the decision tables of i row and j row
SUB-HEALTH DATA ATTRIBUTE VALUE
G
2
1
1
2
H
1
2
1
2
O
2
3
2
3
e
3
2
1
3
Tableĉ by the examples given in the decision table, we
use rough sets attribute value reduction rule reduction; will
be come to the tableĊ.
values of two properties, k is decision-making table of the
number of research object, discriminated metrics D is a
diagonal line 0 symmetrical matrices. In symmetrical
matrices, following attribute reduction rule to process it [9].
The relationship of attribute set reduction and nucleus as
follows (3).
core( P ) ˆ red ( P )
(3)
TABLE II.
AFTER ATTRIBUTE VALUE REDUCTION DATA
U
1
2
3
4
red ( P ) is all of reduction P . core( P ) Include all of
reduction common equivalent relation among P , it’s the set
of attributes P indispensable and important sub-health data
G
1
2
1
2
H
1
1
1
2
O
1
2
2
3
e
1
2
1
2
According to the examples of above reduction the subhealth knowledge attribute value, it can obtain more
streamlined and coincidence of system rules data, reduced
the amount of appraisal calculate complex.
attribute set.
According to sub-health knowledge library interrelated
attribute, set sub-health knowledge expression system is
universe of discourse,
S , U={U1,U2,...,Un} is
III.
SUB-HEALTH RISK APPRAISAL MODEL
In decision tree algorithm application, node generates its
attribute according to the method of information gain
judgment [11]. In this paper, used in decision tree as a subhealth appraisal model, application after rough sets
techniques preprocessing sub-health monitoring data set as
the training set, the output for the sub-health risk appraisal,
C {a11, a12,..., a1n}is sub-health knowledge library's condition
attribute, D {Y } is decision attribute set of sub-health
knowledge, P C D , and construct the corresponding
discriminated matrix as(4).
465
decision analysis. “Fig.1” is the principle of sub-health
appraisal.
p ij
C i probability.
Besides, A branch point will also carry on the
corresponding information gain. Its formula description is as
follows (8).
(8)
G a in ( A )
I ( s 1 , s 2 , ..., s m ) E ( A )
Induction algorithms according to each attribute
information gain to carry on the calculated. After calculation
took of the biggest gain among attributes information
selected as the test attribute for a given set S . According to
this way produce corresponding branch point. And the
corresponding attribute mark will be producing the point.
Then use this node's attribute foundation branch, the
corresponding branch is also the sample subset which
divides.
Figure 1. Sub-health risk appraisal principle
A. Decision Tree Appraisal Model
After determining the training set data, we will accord
the following process establishment decision tree sub-health
risk appraisal model.
First, Set S to one contains s set of sub-health
monitoring data sample; assume that class label attribute has
m
different
values.
Definition
m different
class Ci (i 1, 2,3..., m) , assume Si is class Ci number
of data samples in the sub-health. According to (5) to
calculate the sub-health monitoring data sample
classification expectation information.
m
I ( s1 , s 2 ...s m )
¦ p i lo g 2 ( p i )
B. Prediction Algorithm
Based on decision tree model above, we adopted decision
tree C4.5 algorithm for establishing sub-health risk appraisal
tree, C4.5 structure algorithm is top-down recursive, using
information gain ratio calculation of others type’s sub-health
data samples the proportional gain ratio[12]. Use the formula
(9) and (10) calculate the information gain ratio.
G a in ( A )
(9)
G a in R a tio ( A )
(5)
i 1
pi is the random selection sample belongs Ci
probability, and use si / s estimated. Logarithm functions
S p litI ( A )
with 2 as the bottom, because the information encoded with a
binary.
v different
Set attribute A have total of
values {a1 , a2 ,...av } , use attribute A divide the sub-health
And
u
S p litI ( A )
IV.
through A division of a subset of information (entropy) is
calculated as follow (6).
u
(6)
j 1
j
subset weight, and is equal to the subset number of samples
divided total number of samples in S (i.e.: ai is value of A ).
The entropy value is smaller, the purity is higher.
Given the subset S j , we using formulas (7) calculate its
expectation information.
m
¦ p ij lo g( p ij )
EXPERIMENT AND ANALYSIS
In the experiment, in order to analyze conveniently, after
select the sample taken as input data, used to this paper
proposed sub-health appraisal model, Simultaneously,
compared to without through rough sets preprocessing mode.
Finally, use Mat Lab simulation software testing and
analysis.
According to the above-mention sub-health appraisal
principle, we using sub-health monitoring data sample
provided by Sichuan Center for Disease Control and
Prevention (SCDCP), choose 2000 experts confirmed the
sub-health sample as test data. Sample and attribute settings
as follows. Gender G (1:man,2:female), Age A (1:16-30
years old,2:30-55 years old,3:above 55), Weight W
(1:fat,2:normal3:thin), Family disease inheritance
H
(1:y,2:n), Education E (1:junior college or under,2:
undergraduate,3:master or above), Marital status M (1:y,2:n),
carrying disease and other properties (O, other property,
including life diet , body sensation, etc.), (1:If exception is
more than 60%,2: undecided 40% greater than 2 less than
monitoring samples which belonging to category Ci ,
I ( s1 j , s 2 j ...s m j )
(10)
From 2.1, 2.2, we have use rough sets theory
preprocessing sub-health monitoring data, simply use the
C4.5 algorithm to generate sub-health decision tree, and then
under the generate sub-health decision tree analysis and
forecast sub-health risk to obtain results. The next section in
this article we will experiment with real data test validity.
ai value, if A take as test attribute, then these correspond
in containing the set S growing out of the branch nodes.
Set Sij is subset S j and the number of sub-health
Hereon, ( s1 j s 2 j ... s mj ) / s the number of
¦ p i lo g i ( p i )
i 1
monitoring data sample set S into total of V
subset {S1, S2,...Sv}, and S contain A sample, in A taking
E(A) ¦ (s1 j s2 j ... smj )/ s *I (s1 j s2 j ... smj )
s ij / | s j | is S j a sample and belongs to the class
(7)
i 1
466
the accuracy of sub-health risk appraisal rate about 92.6%,
without through rough sets preprocessing of the model about
86%, can be the conclusion of this paper, rough sets theory
after preprocessing appraisal combined with decision tree
C4.5 algorithm model more accuracy.
60% ,3 :less than 40%) 7 attributes as input, The attribute
after rough set processing as followstable ċ:
TABLE III.
G
1(200)
1(100)
2(100)
2(100)
2(200)
1(200)
2(200)
1(50)
1(350)
2(150)
2(150)
1(100)
1(100)
TEST SAMPLIE ATTRIBUTE
A
1
1
2
1
2
3
2
3
2
2
2
3
2
W
1
1
2
1
2
1
1
1
1
2
2
2
2
H
1
2
1
1
1
2
1
2
1
1
2
1
1
E
2
1
2
1
2
1
2
3
1
3
2
1
2
M
1
2
1
1
1
1
1
2
1
1
1
1
1
O
1
1
1
2
1
1
2
1
2
2
1
1
2
V.
How to take full advantage of individual sub-health
monitoring data, and establish an appropriate appraisal
model is the difficulties of sub-health risk appraisal study.
The appraisal model based on rough sets theory and decision
tree was proposed in this article. Through the rough sets
theory preprocessing of the sub-health monitoring data,
being fully used individual health data, solved the clutter,
noise and other problem in the data. According to the needs
of this appraisal model application in practical engineering
projects, mainly considered the efficiency of the algorithm
and appraisal accuracy. The establishment of sub-health
appraisal model was proposed by application of decision tree
algorithm, achieved a scientific, objective, reasonable subhealth risk appraisal.
Carries on the test after above-mention sub-health sample
through rough set theory preprocessing, distinction input this
test into two kinds of model contrast, can obtains two kind
of models the sub-health appraisal result rate of accuracy
following tablesČ.
ACKNOWLEDGMENT
TABLE Č TEST SAMPLE RESULT
Appraisal
model
This model
C4.5 model
Sub-health
normal
Sample
1850
1740
150
260
2000
2000
The project is supported in part by Science &
Technology Department of Sichuan Province (S&TDSP).
accurac
y
92.5%
87%
REFERENCES
Based on this, applies Mat Lab tools to the sample which
selects to carry on the simulation test. First, two kind of
model's operating speed compared shown in “Fig.2”.
[1]
[2]
[3]
[4]
1
0.9
0.8
times(×10s)
0.7
[5]
0.6
0.5
0.4
[6]
0.3
0.2
0.1
0
Not Rough set process
After Rough set process
0
200
400
600
800
1000 1200
Input sample quantity
1400
1600
1800
[7]
2000
Figure 2. Computation speed and input sample change tendency
[8]
Next, the input sample and appraisal rate of accuracy
change tendency shown in “Fig.3”.
[9]
0.95
0.9
[10]
Testing accurately
0.85
0.8
0.75
[11]
0.7
0.65
[12]
0.6
0.55
Not Rough set process
After Rough set process
0
200
400
600
800
1000 1200
Input sample quantity
1400
1600
1800
CONCLUSION
2000
Figure 3. Two kinds of appraisal model forecasting result curvilinear
trend
[13]
Can be seen from“Fig.2” “Fig.3”and TablH Č , after
through rough sets preprocessing sub-health monitoring data
467
The
role
of
health
management
and
development.
http://www.3gaojk.com/article. _view.php?id-133
Health Manage Research Center.http://www.hmrc.kines.unich.Edu
/hra/toc.cgi.
Zhao hong. A non-invasive type of the risk appraisal model method.
China Health Management Journal, vol. 3, pp. 166–169, March 2003.
Wei yuke,Wang renhuang. A kind of new method of sub-health
diagnostic reasoning. Computer Applications. vol. 3, pp. 70–73,
February 2006.
Wang liming,Zhao yin. Sub-health state comprehensive evaluation
index system of ideas of. Journal of Chinese Medicine. vol. 25, pp.
180–183, February 2010.
Liu zunxian. Sub-health of the differential equation model and data
analysis. Mathematics in Practice and Theory. vol. 15, pp. 221–224,
December 2009
Lin jie. Data Mining in Health Management. China Archives Science,
vol 17, pp㸸35-36, Oct 2007.
Yuan changan,Deng song,Li wenjing. Principles of Data Mining and
SPSS Clementine.beijing: Electronic Industry Press, pp. 228–232,
April 2009.
Qin guangzhong,Mao zongyuan. Rough neural network and its
application of Traditional Chinese Medicine intelligent diagnosis
system. Computer Engineering and Applications. vol 22,pp. 34–35,
June 2001.
Shao fengjing, Principle and Algorithm of Data Mining (Second
Edition).beijing: Science Press, pp:128-129, August 2009.
Márcio P. Basgalupp1, Rodrigo C. Barros2, André C.P.L.F. de
Carvalho1, Alex A. Freitas3 and Duncan D. Ruiz. LEGAL-Tree: A
Lexicographic Multi-objective Genetic Algorithm for Decision Tree
Induction. Hawaii, U.S.A, SAC’09,vol 15 , pp:1085-1091,March
2009
Han jiawei. Data Mining Concepts and Techniques.beijing:
Machinery Industry Press, pp:188-190 ,September 2007.