Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining of Population Distribution Rules: an Attribute-Oriented Approach Liu Deqin, Ma Weijun Chinese Academy of Surveying and Mapping, Beijing 100039, China [email protected] Abstract: Attribute-Oriented Induction (AOI) reduces the search space of large data to produce a minimal rule set. The theory and technology of data mining based on AOI are introduced. In allusion to the shortage of recent population spatial analysis which is mainly based on the qualitative methods, Attribute-Oriented Induction is used to discover knowledge of population distribution rule according to geomorphologic region, from the integration of population data and geomorphologic data. The theory, technology, process and result are introduced in this paper. 1 INTRODUCTION Data mining is the extraction of interesting patterns concealed in large databases. Attribute-Oriented Induction (AOI) is a set of oriented generalization technique used to find various types of rules including such as association, sequential patterns, classification and summarization. The method integrates learning-from-examples techniques with database procedures. In the census database, the vast number of population data is stored. The qualitative method is mainly used for the analysis of the population data in the past, which lead to lack of quantitative answers. The paper presents an attribute-oriented induction approach for the discovery of population distribution rules. 2 POPULATION SPATIAL DISTRIBUTIONS Population spatial distribution is population distributing status in a geographic space on the certain time frame. The status is a result of interrelationship of many factors, which include natural, social, economic and political element. Professor Hu Huanyong made the first isoline map of population density depicted population distribution of China in 1935, presented a line from Aihui in Heilongjiang Province to Tengchong in Yunnan Province in the map to divide China into two parts. In the south-east part, there is about half of the total area of the country and 96 percent of total population, but in the north-west part, there is about half of the total area of the country and only 4 percent of total population. In China, the population is concentrated in the east part, where has the better natural environment and developed economy. The unbalance of population distribution is the result of long-term development of nature, environment, society and economy. The change of population is depended on the natural increase and migration of population. There is notably difference of natural increase between different regions. Generally speaking, natural increase rate of population is higher in economically less developed region, while natural increase rate of population is lower in economically developed region. The migration of population is from the less developed area to developed area. The recent research and summary of population distribution rule is mainly based on the quantitative analysis on macro population data. The shortage of this method are weak ability of processing large amount of population index, low speed of processing, strong subjectivity, lack of concept of spatial position, and unable to integrate large amount of natural, social and economic data. As the development of the knowledge discover and data mining technology, many studies related to mine the population distribution rules and quantitatively describe the characteristic of population geography have been done. The government has a huge amount of population data from the population census, which include number, age, nationality, distribution, composition, quality, etc. The important knowledge of population spatial distribution can be obtained through the data mining technology. By these methods, the application technology of population data will be improved, so as to provide reliable information for the scientific formulation of mid- and long-term program for national economic and social development, for the integrated arrangement for the material and culture life of the population, and for the coordinated development of population, economy, resource and environment. 3 ATTRIBUTE-ORIENTED INDUCTION Attribute-Oriented Induction (AOI) is a set of oriented generalization technique that produces high-level rules from huge data sets. AOI reduces the input relation to a minimal relation called a prime table and then a final rule table by using an attribute or rule threshold. An attribute or rule threshold determines how any distinct attributes or rules remain in the final rule table. For each attribute, AOI uses a concept hierarchy tree to generalize it by climbing through the hierarchy levels of that attribute. An attribute is generalized if its low-level concepts (e.g. leaf concepts) are replaced by high-level concepts. Database values are stored as leaf concepts in the tree. The AOI method basically involves three primitives that specify the learning task. These are collection of initial task-relevant data (Data Collection), use of background knowledge (Domain knowledge) during the mining process and representation of the learning result (Rule formation). The fundamental principle in AOI is to generalize the initial relation to a prime relation and then to a final relation using background knowledge and user-defined threshold. Besides of the population database, the geomorphologic data is also used in the study, which includes 17 classes and 55 sub-classes. The analysis steps are as follows: 1. Preparation of the data, which includes the digitization of the geomorphologic map, retrieving the county boundary data from the existing database and establishing the population database. 2. Overlay analysis of the boundary data with the population data. 3. Calculation of the population data in each geomorphologic sub-classes and obtaining the spatially distributing population data according the geomorphologic classes. 4. Data mining of rules of population spatial distribution by attribute-oriented approach 4 EXAMPLE 4.1 Data preparation and processing Data preparation for data mining of rule of population spatial distribution is shown in Figure 1. Geomorphologic map Digitizing Map Database (1:250K) Projection transformation Geomorphologic region Retrieval County boundary Census Database Retrieval Population Data Overlay, calculation Population spatially distributed data Figure 1 Data Preparation for Data Mining The scale of original geomorphologic map is 1:4,000,000. The map is digitized, edited, projection transformed and quality controlled. The county boundary is extracted from the spatial database of 1:250,000 National Fundamental Geographic Information System. These data are linked through the common coding system. The overlay analysis of population data with geomorphologic data is performed. Area of geomorphologic region weighted population calculation is taken and the number of population and number of population with college and above education attainment is obtained. Summary of population spatial distribution according to geomorphologic region is shown in Table 1. Table 1 Data of population spatial distribution according to geomorphologic region Code of Secondary Class Name of Geomorphologic Region Population Density of Population with Density college and above education (person/km2) attainment (person/km2) 1.1 SanJiang Plain 78 4 1.2 North-east Plain 134 7 1.3 North China Plain 588 19 1.4 Jiansu Plain 803 38 Mountainous Region of East part in 2.1 North-east China 154 9 2.2 Xinganling Mountainous Region 27 1 3.1 Low Mountain Hill of Jiaodong 509 21 3.2 Jiaolai Plain 614 28 Low Mountain Hill of Middle-south in 3.3 Shandong 644 20 4.1 Hulunbeier Plateau 10 0 4.2 Xilingele Plateau 8 0 4.3 Hill of South-east Part in Inner Mongolia 26 1 5.1 Plain of Inner Mongolia and Shanxi 77 2 5.2 Mountainous Region of Yinshan 72 3 5.3 Middle Mountain and Plateau of Shanxi 309 14 Middle and Low Mountain in North Part in Hebei, West part in Liaoning and 5.4 Plateau of Zhaomong 165 10 5.5 Middle Mountain in Gansu and Loess Hill 150 4 6.0 Mountainous Region of Aertai 5 0 7.1 Basin of Zhungeer 17 1 7.2 Plateau of East Zhungeer 5 0 7.3 Mountainous Region of Weat Zhungeer 9 0 7.4 Plain of Tachen 9 0 8.1 Mountainous Region in North Tianshan 26 3 Mountainous Region and Plain in Middle 8.2 Tianshan 10 0 8.3 Mountainous Region in South Tianshan 12 0 9.1 Basin of Talimu 9 0 … … … … … … … … 4.2 Data mining of rules of population spatial distribution Data of population spatial distribution according to geomorphologic region, shown in Table 1 is used to the analysis of model and discrepancy of population spatial distribution in China, in order to discover and demonstrate the general characteristics and rules of population spatial distribution. According to the data mining method of Attribute-Oriented Induction, generalized process of data from Table 1 is performed. Form the spatial location of geomorphologic region of second class, the region is induced to North East, North, North West, East, Middle and South, South West region. The population density is divided to 3 classes, in which High represents 500-999 person/km2 , Middle represents 100-499 person/km2 and Low represents 0-99 person/km2. The Density of population with college and above education attainment is also divided to 3 classes, in which High represents 20-39 person/km2 ,Middle represents 10-19 person/km2 and Low represents 0-9 person/km2. The generalized result is shown in Table 2. Table 2 Generalized result of population spatial distribution according to geomorphologic region Code Region Geomorphologic Type Population Density of Population with college Density and above education attainment 1.1 North East Plain Low Low 1.2 North East Plain Middle Low 1.3 North Plain High Middle 1.4 East Plain High High 2.1 North East Mountain Middle Low 2.2 North East Mountain Low Low 3.1 East Low Mountain and Hill High High 3.2 East Plain High High 3.3 East Low Mountain and Hill High High 4.1 North Plateau Low Low 4.2 North Plateau Low Low 4.3 North Hill Low Low 5.1 North Plain and Hill Low Low 5.2 North Mountain Low Low Middle Middle Middle Middle Middle Mountain and 5.3 North Plateau Middle and Low 5.4 North Mountain, and Plateau Middle Mountain and 5.5 North West Loess Hill Middle Low 6.0 North West Mountain Low Low 7.1 North West Basin Low Low 7.2 North West Plateau Low Low 7.3 North West Mountain Low Low 7.4 North West Plain Low Low 8.1 North West Mountain Low Low Mountain and Plain 8.2 North West 8.3 North West 9.1 … Between Mountain Low Low Mountain Low Low North West Basin Low Low … … … … … … … After the combination and further induction of data from Table 2, the final generalized result is obtained, which is shown in Table 3. Table 3 Final generalized result of population spatial distribution according to geomorphologic region Density of Population No. Region Geomorphologic Type Population with college and above Count Density education attainment 1 North East — Middle、Low Low 4 2 North — Middle、Low Middle、Low 7 3 North West Mountain, Plateau, Basin Low Low 15 4 Middle Mountain, Loess Hill, Low Mountain 3 North West and Hill Middle Low 5 South West High Mountain, Plateau Low Low 6 6 South West Low Mountain, Hill, Mountain Middle Low 4 7 East - Middle Middle 4 8 Central South - Middle Low 4 In the Table 3, a set of rules of population spatial distribution is shown. Each record in the table represents a rule of population spatial distribution and number of count is considered as the sustaining rate. For example, the record 2 represents as that in the north part of china, the population density is in the middle-low level, and its density of population with college and above education attainment is also in the middle-low class; the record 5 and 6 represent as that in the high mountain and plateau area of south-west part of china, the population density is in the low class, and its density of population with college and above education attainment is also in the low class, but in the low mountain, hill and mountainous region of south-west part of china, the population density is in the middle class, and its density of population with college and above education attainment is in the low class, etc.. 5 CONCLUSIONS Comparing to the traditional method of population data analysis, AOI method is a objective method and has the advantages to perform with computer. AOI is capable to discover the knowledge from the large GIS data base. The government has a huge amount of population data from the population census, the important knowledge of population spatial distribution can be obtained through the data mining technology. By these methods, the application of population data will be improved, for the better decision-making assistance in the economic and social development planning. REFERENCES [1] CHEUNG D, HWWANG HY. Efficient Rule-Based Attribute-Oriented Induction for Data Mining [J]. Journal of Intelligent Information System, 2000, 15 [2] Maybin K. Muyeba and John A. Keane. A Concurrent Approach to the Key-Preserving Attribute-Oriented Induction Method[J]. IEEE Trans. Knowledge and Data Engineering,2002 [3] Han J, Cai Y and Cercone N. Knowledge Discovery in Database: an Attribute-Oriented Approach[J]. Proceedings of 18th VLDB,1992 [4] Han J, Cai Y and Cercone N. Data-driven Discovery of Quantitative Rules in Relational Database[J]. IEEE Trans. Knowledge and Data Engineering, 1993