Download An Approach of Differential Geometry to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Towards Digital Earth
— Proceedings of the International Symposium on Digital Earth
Science Press,1999
1
An Approach of Differential Geometry to Data Mining
Tianxiang Yue
Chenghu Zhou
State Key Laboratory of Resources and Environmental Information System,
Chinese Academy of Sciences
917 Building, Datun, Anwai, 100101 Beijing, P. R. China
Tel.: 86-10-64889633, Fax: 86-10-64889630
email: [email protected], [email protected]
ABSTRACT In this paper, to give a solution to the problems facing data mining, an approach of differential geometry is
proposed. By means of plane curve theorems of differential geometry, a mathematical model formulating distance
between plane curves are constructed, in which the distance is determined by at most 3 variables. This kind of distance
is a distance on metric space of curve according to theory of functional analysis. Finally, a model for huge-data in a
single-attribute-phase and a model for huge-indices in a multi-attribute-phase are constructed, which are based on the
mathematical model formulating distance between curves. The approach of differential geometry to data mining includes
four important steps that are identifying the overall purpose of data mining, preparing data, operating models, and
evaluating model results.
In recent years, both the number and size of databases are growing at a staggering rate. It has been realized that
there is valuable knowledge buried in the data. In the meantime, some of the enabling technologies have recently
become mature enough to make data mining possible on large data sets(Carbone, 1998). Therefore, data mining has
been paid an enormous attention and is becoming popular due to the decreasing costs of data collection(Pfeiffer et al.,
1998).
Data mining is defined as the process of extracting patterns and relationships, often previously unknown, from data
sources that include data bases, collection data, or even data warehouse(Thuraisingham, 1997). Data mining is a step in
a larger process of knowledge discovering in databases (KDD) that refers to the overall process of discovering useful
knowledge from data. To begin the KDD process, the analysis must first have an overall purpose or set of goals to select
data to be analyzed from the set of all available data. Then, the target data are moved to another database for further
preprocessing. To discover knowledge such as trends, patterns, characteristics and anomalies, data mining algorithms
should be used, which should be pertinent to the purpose of the analysis and to the type of data to be analyzed. When a
pattern is identified, it should be examined to determine whether it is new, relevant and correct by some standard of
measure. After the interpretation and evaluation step is completed and the pattern is deemed relevant and useful, the
pattern can be deemed knowledge(Carbone, 1998).
Data mining is an important method for extracting valuable information from all sizes of databases. Data miners are
sometimes required to construct a highly accurate model for data mining as quickly as possible. But three factors make
constructing a model for data mining a potentially lengthy process, i.e. (1) an enormous amount of data that must be
processed, (2) a large number of models that must be constructed, and (3) the intricacies of testing and validating
models(Small and Edelstein, 1998). The approach of differential geometry, developed in this paper, is a solution to these
problems. This approach to data mining includes a model for huge-data in a single-attribute-phase and a model for
huge-indices in a multi-attribute-phase.
KEY WORDS Data mining, Differential geometry, Mathematical models, Attribute phase
1. The Foundation of the Approach
Curve Theorem in the plane(Spivak, 1979). Let
k:
curve,
S0 , S  
L:
be continuous. Then there is a
S0 , S  2
,
parameterized
arc-length, whose curvature at s is
by
k  s for all
s  S0 , S . Moreover, if L1 and L2 are two such
curves, then
L1    L2 where 
is some
proper Euclidean motion (a translation followed by a
rotation).
Therefore, the overall difference between the
two plane curves can be simulated as following(Yue
and Ai, 1990)
CD  IV  SL  CU



2
1 S
L1 S 0   L2 S 0 2  1 s    2 s 2  k1 s   k 2 s  ds
S  S 0 S0
Where
(1)
2
Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining
1
IV 
S  S0
 L S   L S 
1
S  S0
  s    s  ds
SL 
S
1
S0
0
2
2
0
S
ds
S0
ds  1    x   dx
(2)
(8)
Suppose that the curve
2
1
1
2
2
(3)
2
L2 : f  x  is considered as
L1: g x  is an
an intended-goal-function and
CU 
2
S k1 s   k 2 s  ds
1
S  S0
S
arbitrary function. According to the discussion
(4)
0
g x consists of negative factors, the
above, if
ki  s
is
the
curvature
of
the
plane
curve
intended-goal-function is a plane straight line
Li ;  i  s is the slope of the plane curve Li ;
f x   0 , x  X 0 , X  .
Li S0  is the initial value (i=1,2).
 2  s,
It can be proven(Yue et al., 1999) that
CD L1 , L2 

has
following
three


properties:

 2  s,
 2  s
In
and
this
case,
f  X0 
equal
zero, then(Yue, 1994)
CDnegative  
1
X  X0
  x   x  g  X 1   x
X
2
2
2
2
0
X0
1
2
dx
(14)
(a) CD L1 , L2  0; CD L1 , L2  0 if and only if
Obviously, CDnegative  0 and CDnegative  0 is the
L1  L2 ; (b) CD L1 , L2   CD L2 , L1  ; (c)
optimum situation. In other words, the closer the
f x   0 , the better the
distance is from the
CD L1 , L3   CD L1 , L2   CD L2 , L3  . In terms

of Theory of Functional Analysis, CD L1 , L2

is a
kind of distance on metric space of curves(Taylor,
1958). We could call this kind of distance as a
Curves’ Distance.
If the curves
situation is.
If g x consists of positive factors, it is not so
easy to determine a quantitatively intended goals.
In
this
situation,
we
express
the
intended-goal-function as the longer distance from
the straight line f x  0 . In other words, for the
issues of positive factors, the better the situation is,
the longer the distance is from the straight line
f x  0 . The model can be generally formulated
as



Li could be stimulated as
CD positive 
y  fi  x 
1
X  X0
i
as
X
2
X0
2
2
2
0
1
2
dx
(15)
(5)
Where
then,
  x   x  g  X 1   x
and
CD positive  0,
CD positive  0 is the worst
ki can be respectively formulated
situation and the biggest CDpositive is the optimum
 i  x 
dfi  x 
dx
ki  x  
d i  x 
 1   i 2  x  
dx
situation.
(6)
3

2
(7)
2. The Model for Huge-Data in A Single-Attribute
-Phase
If the relative data at every point of the earth or of a
region are put in order in terms of longitude, latitude
and time, they are sequenced in three dimensions.
The train of thought on this model at the initial stage
Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining
of its development can be expressed as follows: (1)
at first, let two of the three variables (longitude,
latitude and time) be fixed temporally and transform
the sequenced data into plane curves; for any plane
curve, it is enough to analyze three parameters that
are intercept, slope and curvature to find the pattern
of the sequenced data; (2) in order to analyze the
reasons that have caused the pattern, matrixes of
leading factors are included in the model; (3) finally,
the temporarily fixed variables are respectively
allowed to change freely so that we can analyze the
spatial and temporal dynamics.
Suppose that the sequenced data in
single-attitude-phase can be expressed as fellows
in terms of longitude, latitude and time,
 x1,1, t  x1,2, t 
 x2,1, t  x2,2, t 
X t   
 ...
...

 xI ,1, t  xI ,2, t 
where
X t 
is
... x1, J , t  
(16)
... x2, J , t 
 xi, j , t I  J
...
... 

... xI , J , t 
the
tth
layer
of
the
three-dimensional matrix ( t=1,2,…,T ); J is the
maximum longitude; I is the maximum latitude and
T is the maximum value of the time variable.
The three-dimensional matrix X t can be
transformed into a standardized matrix

Y t    yi, j, t I  J
(17)
and x max
xi, j , t 
x max
i , j ,t
Then,the dynamic model in terms of latitude and
time can be formulated as
1
J


CDi, t   sign  y      2 i, j, t   k 2 i, j, t   y 2 i,0, t 1   2 i, j, t 2 
j 1 

(18)
where
yi,0, t  
y 0
y 0
1 J
 yi, j, t 
J j 1

 i,0, t  
k i,0, t  
1
J
1
J

3
2
(24)
J
 i, j, t 
(25)
j 1
J
 k i, j, t 
(26)
j 1
To formulate the dynamic state of the leading
factors, we introduce two special matrixes,
S max t   M i, j, t I J
(27)
S min t   mi, j, t I  J
(28)
where

mi, j, t    
M i, j, t  
0
x i , j ,t 
0
x i , j ,t 
x i , j ,t  D1
x i , j ,t D1
x i , j ,t D2
x i , j ,t D2
(29)
(30)
D1  D2 ; D1 is the critical upper-value and D2 is
the critical lower-value.
According to requirements of some studied
issues, very useful knowledge can sometimes be


of the sector y,  i, j , t , k i, j , t  , where y is
the measurements of average situation of the
huge-data in a single-attribute-phase.
 max xi, j, t  .
 1
sign  y   
1

k i, j, t    i, j, t    i, j  1, t  1   2 i, j, t 
obtained by analyzing the dynamic characteristics
y i, j , t  
where
3
(19)
(22)
 i, j, t   yi, j, t   yi, j  1, t 
(23)
3. The Model For Huge-Indices In A
Multi-Attribute-Phase
Index systems have been studied by many
scientists, in which each index is a summarization
of a data cluster. The Organization for Economic
Co-operation and Development (1997) developed a
set of environmental indicators for agriculture in
terms of the Driving Force-State-Response (DSR)
framework in order to identify and quantify the
extent of the impacts of agriculture and agricultural
policies on the environment and to better
understand the effects of different policy measures
on the environment. To establish ecological
balance-sheets and measures of environmental
protection, Haber and Engelfried(1997) set up a
criterion system
for environmental impact
assessment. To measure changes in the quality or
condition of land and so promote land management
4
Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining
practices that ensure productive and sustainable
use of natural resources, Pieri et al.(1995)
proposed the land quality indicators. To answer the
question whether the development of a region or a
nation is sustainable or not, Opschoor and
Reijnders(1992)
introduced
sustainable
development indicators. In order to find a way of
measuring the economy that can give better
guidance than the gross national product to those
interested in promoting economic welfare, Daly and
Cobb(1990) developed an index system of
sustainable economic welfare.
For all these index systems, we design a model
for huge-indices decision making by means of
differential geometry. This model applies to the
situations that have more than 10 indices
(indicators or criteria). The studied issues might
sometimes required us to analyze, (1) effects of
negative factors, (2) effects of positive factors, or (3)
simultaneously the both. In these three situations
we must separately set up index system of negative
factors or one of positive factors. For this index
system, all indexes should be relatively
independent.
Suppose that for an analyzed issue an index
system has been set up as follows
1 I
 k i, j, t 
I i 1
k 0, j , t  
(39)
For the index system (31), the determination of
the index weights is very important for constructing
its model. Each set of weights would correspond to
one kind of structure in the index system. Change
of index weights would mean the model's structural
dynamics. Different sets of weights would produce
different results (or scenarios). The determination of
the weights of the indexes have various ways such
as choosing equal weights for all indexes,
determining
the
weights
by analysis
of
administrative levels or by a subordinate function of
fuzzysets. The weight system can be generally
formulated as
w1, w2, ..., wi , ..., wI  (40)
Where i=1, 2, ..., I; j=1, 2, ..., J; t=1, 2, ..., T ;
I
 w i   1 ;
I is the total number of
the indexes;
i 1
J is the total number of analyzed regions; and T is
the total number of analyzed sub-periods. The
z1, j, t , z2, j, t , ..., zi, j, t , ..., zI , j, t  (31)
common model both for negative factors and for
positive factors in the jth region can be expressed
For constructing a temporal dynamic model, a
sub-period t would be temporarily fixed. Then, we
can get the following algebraic matrixes
Z t   z i, j, t I  J
as
1 (41)
I


CD j, t   sign  y    w(i)    2 i, j, t   k 2 i, j, t   y 2 0, j, t 1   2 i, j, t 2 
i 1


 1
1
(32)
where sign  y   
y 0
.
y 0
Let
z max i, t   max  z i, j , t  
(33)
The general model in the whole area investigated
can be formulated as
1 j  J
y i, j , t  
z i, j , t 
z max i, t 
J
GSCDt    P j, t   CD j, t 
(34)
1 I
y 0, j , t    y i, j, t 
I i 1
Where
(35)
 i, j, t   yi, j, t   yi 1, j, t 

1
I
P j, t  is a parameter determined by the jth
country or region;
CD j, t  is the pattern in the jth
country or region;
GSCDt  is the general pattern
(36)

k i, j , t    i, j , t    i  1, j , t  1   2 i, j , t 
 0, j , t  
(42)
j 1

3
2
(37)
in the whole analyzed area.
I
  i, j , t 
i 1
(38)
In order to know the leading indices in which
Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining
countries or regions exist, we introduce two leading
matrixes
M max t   M i, j, t I J
(43)
mmin t   mi, j, t I J
(44)
where

mi, j, t    
M i, j, t  
0
z i , j ,t 
z i , j ,t  C1
z i , j ,t C1
( 45)
0
z i , j ,t 
z i , j ,t  C2
z i , j ,t C2
(46)
C1 C 2 ; C1 is the critical upper-value of the index
system and
C2 is the critical lower-value of the
index system.
5
select the specific data. When the specific data are
selected, some additional data transformations may
be necessary. For instances, to operate the model
for huge-data in single-attribute-phase, the data
may be sorted out and correspondingly given them
plus sign or minus sign according to that they have
a positive contribution or negative contribution to
the overall purpose; to operate the model for
huge-indices in multi-attribute-phase, the data may
be clustered and transformed into index system
according to certain algorithms.
After the model constructed by means of the
approach of the differential geometry is operated,
its results must be evaluated and their significance
must be interpreted. When the model has been
used, it must be measured how well it has worked.
When the model works well, the performance of the
model must be continually monitored because all
systems may evolve and the data may change over
time(Edelstein, 1998).
According to concrete contents of some studied
issues,
dynamic
characteristics
 y0, j, t , i, j, t , k i, j, t 
,
are
of
sector,
sometimes
useful for bring to light the law of the issues.
4. Discussions
The models in the approach of differential geometry
to data mining have a common shell and need at
most to deal with three variables, which are the
curvature, the slope and the initial value, no matter
how many data must be mined or how many indices
must be handled. It is not necessary for the
approach of different geometry to construct a large
number of models in order to processing an
enormous amount of data. The effective application
of the approach of differential geometry to data
mining requires performing 4 important steps. They
include identifying the overall purpose of data
mining, preparing data, operating model and
evaluating results of the model. Because different
purposes require very different data or index
system, the overall purpose must be clearly stated
in order to make the best use of data mining.
The step of preparing data is the most time
consuming. It is quite possible that some of the data
required has never been collected so that it may be
necessary to supplement additional data. Because
good models must be supported by good data, it is
essential to assess data characteristics and to
repair the data defects. When data comes from
multiple sources, they must be consolidated into a
single database and ensured to measure the same
thing in the same way. Once the data are gathered
for the model to be constructed, it is needed to
References
Carbone, P. L. 1998, Data mining: knowledge discovery
in data bases. In B. Thuraisingham ( ed.), Data
Management: 611-624, Washington D C: CRC Press
LLC
Daly, H. E. & J. J. B. Cobb, 1990, For the Common Good
- Redirecting the economy towards community, the
environment, and a sustainable future, London: Green
Print
Edelstein, H. 1998, Data mining—let’s get practical, DB2
Magazine, http://www.db2mag.com /98smEdel.htm.
Haber, W. & J. Engelfried, 1997, Von Ökobilanzen zur
Umweltverträglichkeit
menschlicher
Aktivitäten,
Zeitschrift für Angewandte Umweltforschung 10:
222-229
OECD, 1997, Environmental Indicators for Agriculture,
75775 Paris Cedex 16, France: OECD Publications
Opschoor, H. & L. Reijnders, 1992, Towards sustainable
development indicators, In O. Kuik & H. Verbruggen
(eds), In Search of Indicators of Sustainable Development:7-28, Dordrecht: Kluwer Academic Publishers
Pfeiffer, K., E. Papcek & D. Smith, 1998, What is data
mining?
http://www-personal.umd.umich.edu/
~kpfeiff/index.html
Pieri, C., J. Dumanski, A. Hamblin & A.Young, 1995, Land
Quality Indicators, World Bank Discussion Papers, No.
315
Small, R. D. & H. A. Edelstein, 1998, Scalable data
mining. In B. Thuraisingham (ed.), Data Management:
637-647, Washington, D. C.: CRC
Spivak, M., 1979, A Comprehensive Introduction to
Differential Geometry. Houston, Texas: Publish or
Perish, INC
Taylor, A. E., 1958, Introduction to Functional Analysis,
New York: John Wiley & Sons, IncThuraisingham, B.
6
Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining
1997, Data Management Systems:173-185, Florida:
CRC Press LLC
Yue, T. X., W. Haber, W. D. Grossmann & H. D.
Kasperidus, 1999, A method for strategic management
of land, In Y. A. Pykh, D. E. Hyatt & R. J.M.B.
Lenz(eds), Environmental Indices: Systems Analysis
Approaches: 181-201, London: EOLSS Publishers Co
Ltd
Yue, T. X. 1994, Systems Models for Land Management
and Real Estate Evaluation:149-152, Beijing: China
Society Press(in Chinese)
Yue, T. X. & N. S. Ai, 1990, A morphological mathematical
model for cirques. Glaciology and Cryopedology 12(3):
227-234 (in Chinese)