Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards Digital Earth — Proceedings of the International Symposium on Digital Earth Science Press,1999 1 An Approach of Differential Geometry to Data Mining Tianxiang Yue Chenghu Zhou State Key Laboratory of Resources and Environmental Information System, Chinese Academy of Sciences 917 Building, Datun, Anwai, 100101 Beijing, P. R. China Tel.: 86-10-64889633, Fax: 86-10-64889630 email: [email protected], [email protected] ABSTRACT In this paper, to give a solution to the problems facing data mining, an approach of differential geometry is proposed. By means of plane curve theorems of differential geometry, a mathematical model formulating distance between plane curves are constructed, in which the distance is determined by at most 3 variables. This kind of distance is a distance on metric space of curve according to theory of functional analysis. Finally, a model for huge-data in a single-attribute-phase and a model for huge-indices in a multi-attribute-phase are constructed, which are based on the mathematical model formulating distance between curves. The approach of differential geometry to data mining includes four important steps that are identifying the overall purpose of data mining, preparing data, operating models, and evaluating model results. In recent years, both the number and size of databases are growing at a staggering rate. It has been realized that there is valuable knowledge buried in the data. In the meantime, some of the enabling technologies have recently become mature enough to make data mining possible on large data sets(Carbone, 1998). Therefore, data mining has been paid an enormous attention and is becoming popular due to the decreasing costs of data collection(Pfeiffer et al., 1998). Data mining is defined as the process of extracting patterns and relationships, often previously unknown, from data sources that include data bases, collection data, or even data warehouse(Thuraisingham, 1997). Data mining is a step in a larger process of knowledge discovering in databases (KDD) that refers to the overall process of discovering useful knowledge from data. To begin the KDD process, the analysis must first have an overall purpose or set of goals to select data to be analyzed from the set of all available data. Then, the target data are moved to another database for further preprocessing. To discover knowledge such as trends, patterns, characteristics and anomalies, data mining algorithms should be used, which should be pertinent to the purpose of the analysis and to the type of data to be analyzed. When a pattern is identified, it should be examined to determine whether it is new, relevant and correct by some standard of measure. After the interpretation and evaluation step is completed and the pattern is deemed relevant and useful, the pattern can be deemed knowledge(Carbone, 1998). Data mining is an important method for extracting valuable information from all sizes of databases. Data miners are sometimes required to construct a highly accurate model for data mining as quickly as possible. But three factors make constructing a model for data mining a potentially lengthy process, i.e. (1) an enormous amount of data that must be processed, (2) a large number of models that must be constructed, and (3) the intricacies of testing and validating models(Small and Edelstein, 1998). The approach of differential geometry, developed in this paper, is a solution to these problems. This approach to data mining includes a model for huge-data in a single-attribute-phase and a model for huge-indices in a multi-attribute-phase. KEY WORDS Data mining, Differential geometry, Mathematical models, Attribute phase 1. The Foundation of the Approach Curve Theorem in the plane(Spivak, 1979). Let k: curve, S0 , S L: be continuous. Then there is a S0 , S 2 , parameterized arc-length, whose curvature at s is by k s for all s S0 , S . Moreover, if L1 and L2 are two such curves, then L1 L2 where is some proper Euclidean motion (a translation followed by a rotation). Therefore, the overall difference between the two plane curves can be simulated as following(Yue and Ai, 1990) CD IV SL CU 2 1 S L1 S 0 L2 S 0 2 1 s 2 s 2 k1 s k 2 s ds S S 0 S0 Where (1) 2 Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining 1 IV S S0 L S L S 1 S S0 s s ds SL S 1 S0 0 2 2 0 S ds S0 ds 1 x dx (2) (8) Suppose that the curve 2 1 1 2 2 (3) 2 L2 : f x is considered as L1: g x is an an intended-goal-function and CU 2 S k1 s k 2 s ds 1 S S0 S arbitrary function. According to the discussion (4) 0 g x consists of negative factors, the above, if ki s is the curvature of the plane curve intended-goal-function is a plane straight line Li ; i s is the slope of the plane curve Li ; f x 0 , x X 0 , X . Li S0 is the initial value (i=1,2). 2 s, It can be proven(Yue et al., 1999) that CD L1 , L2 has following three properties: 2 s, 2 s In and this case, f X0 equal zero, then(Yue, 1994) CDnegative 1 X X0 x x g X 1 x X 2 2 2 2 0 X0 1 2 dx (14) (a) CD L1 , L2 0; CD L1 , L2 0 if and only if Obviously, CDnegative 0 and CDnegative 0 is the L1 L2 ; (b) CD L1 , L2 CD L2 , L1 ; (c) optimum situation. In other words, the closer the f x 0 , the better the distance is from the CD L1 , L3 CD L1 , L2 CD L2 , L3 . In terms of Theory of Functional Analysis, CD L1 , L2 is a kind of distance on metric space of curves(Taylor, 1958). We could call this kind of distance as a Curves’ Distance. If the curves situation is. If g x consists of positive factors, it is not so easy to determine a quantitatively intended goals. In this situation, we express the intended-goal-function as the longer distance from the straight line f x 0 . In other words, for the issues of positive factors, the better the situation is, the longer the distance is from the straight line f x 0 . The model can be generally formulated as Li could be stimulated as CD positive y fi x 1 X X0 i as X 2 X0 2 2 2 0 1 2 dx (15) (5) Where then, x x g X 1 x and CD positive 0, CD positive 0 is the worst ki can be respectively formulated situation and the biggest CDpositive is the optimum i x dfi x dx ki x d i x 1 i 2 x dx situation. (6) 3 2 (7) 2. The Model for Huge-Data in A Single-Attribute -Phase If the relative data at every point of the earth or of a region are put in order in terms of longitude, latitude and time, they are sequenced in three dimensions. The train of thought on this model at the initial stage Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining of its development can be expressed as follows: (1) at first, let two of the three variables (longitude, latitude and time) be fixed temporally and transform the sequenced data into plane curves; for any plane curve, it is enough to analyze three parameters that are intercept, slope and curvature to find the pattern of the sequenced data; (2) in order to analyze the reasons that have caused the pattern, matrixes of leading factors are included in the model; (3) finally, the temporarily fixed variables are respectively allowed to change freely so that we can analyze the spatial and temporal dynamics. Suppose that the sequenced data in single-attitude-phase can be expressed as fellows in terms of longitude, latitude and time, x1,1, t x1,2, t x2,1, t x2,2, t X t ... ... xI ,1, t xI ,2, t where X t is ... x1, J , t (16) ... x2, J , t xi, j , t I J ... ... ... xI , J , t the tth layer of the three-dimensional matrix ( t=1,2,…,T ); J is the maximum longitude; I is the maximum latitude and T is the maximum value of the time variable. The three-dimensional matrix X t can be transformed into a standardized matrix Y t yi, j, t I J (17) and x max xi, j , t x max i , j ,t Then,the dynamic model in terms of latitude and time can be formulated as 1 J CDi, t sign y 2 i, j, t k 2 i, j, t y 2 i,0, t 1 2 i, j, t 2 j 1 (18) where yi,0, t y 0 y 0 1 J yi, j, t J j 1 i,0, t k i,0, t 1 J 1 J 3 2 (24) J i, j, t (25) j 1 J k i, j, t (26) j 1 To formulate the dynamic state of the leading factors, we introduce two special matrixes, S max t M i, j, t I J (27) S min t mi, j, t I J (28) where mi, j, t M i, j, t 0 x i , j ,t 0 x i , j ,t x i , j ,t D1 x i , j ,t D1 x i , j ,t D2 x i , j ,t D2 (29) (30) D1 D2 ; D1 is the critical upper-value and D2 is the critical lower-value. According to requirements of some studied issues, very useful knowledge can sometimes be of the sector y, i, j , t , k i, j , t , where y is the measurements of average situation of the huge-data in a single-attribute-phase. max xi, j, t . 1 sign y 1 k i, j, t i, j, t i, j 1, t 1 2 i, j, t obtained by analyzing the dynamic characteristics y i, j , t where 3 (19) (22) i, j, t yi, j, t yi, j 1, t (23) 3. The Model For Huge-Indices In A Multi-Attribute-Phase Index systems have been studied by many scientists, in which each index is a summarization of a data cluster. The Organization for Economic Co-operation and Development (1997) developed a set of environmental indicators for agriculture in terms of the Driving Force-State-Response (DSR) framework in order to identify and quantify the extent of the impacts of agriculture and agricultural policies on the environment and to better understand the effects of different policy measures on the environment. To establish ecological balance-sheets and measures of environmental protection, Haber and Engelfried(1997) set up a criterion system for environmental impact assessment. To measure changes in the quality or condition of land and so promote land management 4 Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining practices that ensure productive and sustainable use of natural resources, Pieri et al.(1995) proposed the land quality indicators. To answer the question whether the development of a region or a nation is sustainable or not, Opschoor and Reijnders(1992) introduced sustainable development indicators. In order to find a way of measuring the economy that can give better guidance than the gross national product to those interested in promoting economic welfare, Daly and Cobb(1990) developed an index system of sustainable economic welfare. For all these index systems, we design a model for huge-indices decision making by means of differential geometry. This model applies to the situations that have more than 10 indices (indicators or criteria). The studied issues might sometimes required us to analyze, (1) effects of negative factors, (2) effects of positive factors, or (3) simultaneously the both. In these three situations we must separately set up index system of negative factors or one of positive factors. For this index system, all indexes should be relatively independent. Suppose that for an analyzed issue an index system has been set up as follows 1 I k i, j, t I i 1 k 0, j , t (39) For the index system (31), the determination of the index weights is very important for constructing its model. Each set of weights would correspond to one kind of structure in the index system. Change of index weights would mean the model's structural dynamics. Different sets of weights would produce different results (or scenarios). The determination of the weights of the indexes have various ways such as choosing equal weights for all indexes, determining the weights by analysis of administrative levels or by a subordinate function of fuzzysets. The weight system can be generally formulated as w1, w2, ..., wi , ..., wI (40) Where i=1, 2, ..., I; j=1, 2, ..., J; t=1, 2, ..., T ; I w i 1 ; I is the total number of the indexes; i 1 J is the total number of analyzed regions; and T is the total number of analyzed sub-periods. The z1, j, t , z2, j, t , ..., zi, j, t , ..., zI , j, t (31) common model both for negative factors and for positive factors in the jth region can be expressed For constructing a temporal dynamic model, a sub-period t would be temporarily fixed. Then, we can get the following algebraic matrixes Z t z i, j, t I J as 1 (41) I CD j, t sign y w(i) 2 i, j, t k 2 i, j, t y 2 0, j, t 1 2 i, j, t 2 i 1 1 1 (32) where sign y y 0 . y 0 Let z max i, t max z i, j , t (33) The general model in the whole area investigated can be formulated as 1 j J y i, j , t z i, j , t z max i, t J GSCDt P j, t CD j, t (34) 1 I y 0, j , t y i, j, t I i 1 Where (35) i, j, t yi, j, t yi 1, j, t 1 I P j, t is a parameter determined by the jth country or region; CD j, t is the pattern in the jth country or region; GSCDt is the general pattern (36) k i, j , t i, j , t i 1, j , t 1 2 i, j , t 0, j , t (42) j 1 3 2 (37) in the whole analyzed area. I i, j , t i 1 (38) In order to know the leading indices in which Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining countries or regions exist, we introduce two leading matrixes M max t M i, j, t I J (43) mmin t mi, j, t I J (44) where mi, j, t M i, j, t 0 z i , j ,t z i , j ,t C1 z i , j ,t C1 ( 45) 0 z i , j ,t z i , j ,t C2 z i , j ,t C2 (46) C1 C 2 ; C1 is the critical upper-value of the index system and C2 is the critical lower-value of the index system. 5 select the specific data. When the specific data are selected, some additional data transformations may be necessary. For instances, to operate the model for huge-data in single-attribute-phase, the data may be sorted out and correspondingly given them plus sign or minus sign according to that they have a positive contribution or negative contribution to the overall purpose; to operate the model for huge-indices in multi-attribute-phase, the data may be clustered and transformed into index system according to certain algorithms. After the model constructed by means of the approach of the differential geometry is operated, its results must be evaluated and their significance must be interpreted. When the model has been used, it must be measured how well it has worked. When the model works well, the performance of the model must be continually monitored because all systems may evolve and the data may change over time(Edelstein, 1998). According to concrete contents of some studied issues, dynamic characteristics y0, j, t , i, j, t , k i, j, t , are of sector, sometimes useful for bring to light the law of the issues. 4. Discussions The models in the approach of differential geometry to data mining have a common shell and need at most to deal with three variables, which are the curvature, the slope and the initial value, no matter how many data must be mined or how many indices must be handled. It is not necessary for the approach of different geometry to construct a large number of models in order to processing an enormous amount of data. The effective application of the approach of differential geometry to data mining requires performing 4 important steps. They include identifying the overall purpose of data mining, preparing data, operating model and evaluating results of the model. Because different purposes require very different data or index system, the overall purpose must be clearly stated in order to make the best use of data mining. The step of preparing data is the most time consuming. It is quite possible that some of the data required has never been collected so that it may be necessary to supplement additional data. Because good models must be supported by good data, it is essential to assess data characteristics and to repair the data defects. When data comes from multiple sources, they must be consolidated into a single database and ensured to measure the same thing in the same way. Once the data are gathered for the model to be constructed, it is needed to References Carbone, P. L. 1998, Data mining: knowledge discovery in data bases. In B. Thuraisingham ( ed.), Data Management: 611-624, Washington D C: CRC Press LLC Daly, H. E. & J. J. B. Cobb, 1990, For the Common Good - Redirecting the economy towards community, the environment, and a sustainable future, London: Green Print Edelstein, H. 1998, Data mining—let’s get practical, DB2 Magazine, http://www.db2mag.com /98smEdel.htm. Haber, W. & J. Engelfried, 1997, Von Ökobilanzen zur Umweltverträglichkeit menschlicher Aktivitäten, Zeitschrift für Angewandte Umweltforschung 10: 222-229 OECD, 1997, Environmental Indicators for Agriculture, 75775 Paris Cedex 16, France: OECD Publications Opschoor, H. & L. Reijnders, 1992, Towards sustainable development indicators, In O. Kuik & H. Verbruggen (eds), In Search of Indicators of Sustainable Development:7-28, Dordrecht: Kluwer Academic Publishers Pfeiffer, K., E. Papcek & D. Smith, 1998, What is data mining? http://www-personal.umd.umich.edu/ ~kpfeiff/index.html Pieri, C., J. Dumanski, A. Hamblin & A.Young, 1995, Land Quality Indicators, World Bank Discussion Papers, No. 315 Small, R. D. & H. A. Edelstein, 1998, Scalable data mining. In B. Thuraisingham (ed.), Data Management: 637-647, Washington, D. C.: CRC Spivak, M., 1979, A Comprehensive Introduction to Differential Geometry. Houston, Texas: Publish or Perish, INC Taylor, A. E., 1958, Introduction to Functional Analysis, New York: John Wiley & Sons, IncThuraisingham, B. 6 Tianxiang Yue, Chenghu Zhou/An Approach of Differential Geometry to Data Mining 1997, Data Management Systems:173-185, Florida: CRC Press LLC Yue, T. X., W. Haber, W. D. Grossmann & H. D. Kasperidus, 1999, A method for strategic management of land, In Y. A. Pykh, D. E. Hyatt & R. J.M.B. Lenz(eds), Environmental Indices: Systems Analysis Approaches: 181-201, London: EOLSS Publishers Co Ltd Yue, T. X. 1994, Systems Models for Land Management and Real Estate Evaluation:149-152, Beijing: China Society Press(in Chinese) Yue, T. X. & N. S. Ai, 1990, A morphological mathematical model for cirques. Glaciology and Cryopedology 12(3): 227-234 (in Chinese)