Download Data Mining of Population Distribution Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining of Population Distribution Rules:
an Attribute-Oriented Approach
Liu Deqin, Ma Weijun
Chinese Academy of Surveying and Mapping,
Beijing 100039, China
[email protected]
Abstract: Attribute-Oriented Induction (AOI) reduces the search space of large data
to produce a minimal rule set. The theory and technology of data mining based on
AOI are introduced. In allusion to the shortage of recent population spatial analysis
which is mainly based on the qualitative methods, Attribute-Oriented Induction is
used to discover knowledge of population distribution rule according to
geomorphologic region, from the integration of population data and geomorphologic
data. The theory, technology, process and result are introduced in this paper.
1 INTRODUCTION
Data mining is the extraction of interesting patterns concealed in large databases.
Attribute-Oriented Induction (AOI) is a set of oriented generalization technique used
to find various types of rules including such as association, sequential patterns,
classification and summarization. The method integrates learning-from-examples
techniques with database procedures. In the census database, the vast number of
population data is stored. The qualitative method is mainly used for the analysis of the
population data in the past, which lead to lack of quantitative answers. The paper
presents an attribute-oriented induction approach for the discovery of population
distribution rules.
2 POPULATION SPATIAL DISTRIBUTIONS
Population spatial distribution is population distributing status in a geographic space
on the certain time frame. The status is a result of interrelationship of many factors,
which include natural, social, economic and political element. Professor Hu
Huanyong made the first isoline map of population density depicted population
distribution of China in 1935, presented a line from Aihui in Heilongjiang Province to
Tengchong in Yunnan Province in the map to divide China into two parts. In the
south-east part, there is about half of the total area of the country and 96 percent of
total population, but in the north-west part, there is about half of the total area of the
country and only 4 percent of total population. In China, the population is
concentrated in the east part, where has the better natural environment and developed
economy. The unbalance of population distribution is the result of long-term
development of nature, environment, society and economy. The change of population
is depended on the natural increase and migration of population. There is notably
difference of natural increase between different regions. Generally speaking, natural
increase rate of population is higher in economically less developed region, while
natural increase rate of population is lower in economically developed region. The
migration of population is from the less developed area to developed area.
The recent research and summary of population distribution rule is mainly based on
the quantitative analysis on macro population data. The shortage of this method are
weak ability of processing large amount of population index, low speed of processing,
strong subjectivity, lack of concept of spatial position, and unable to integrate large
amount of natural, social and economic data. As the development of the knowledge
discover and data mining technology, many studies related to mine the population
distribution rules and quantitatively describe the characteristic of population
geography have been done. The government has a huge amount of population data
from the population census, which include number, age, nationality, distribution,
composition, quality, etc. The important knowledge of population spatial distribution
can be obtained through the data mining technology. By these methods, the
application technology of population data will be improved, so as to provide reliable
information for the scientific formulation of mid- and long-term program for national
economic and social development, for the integrated arrangement for the material and
culture life of the population, and for the coordinated development of population,
economy, resource and environment.
3 ATTRIBUTE-ORIENTED INDUCTION
Attribute-Oriented Induction (AOI) is a set of oriented generalization technique that
produces high-level rules from huge data sets. AOI reduces the input relation to a
minimal relation called a prime table and then a final rule table by using an attribute
or rule threshold. An attribute or rule threshold determines how any distinct attributes
or rules remain in the final rule table. For each attribute, AOI uses a concept hierarchy
tree to generalize it by climbing through the hierarchy levels of that attribute. An
attribute is generalized if its low-level concepts (e.g. leaf concepts) are replaced by
high-level concepts. Database values are stored as leaf concepts in the tree.
The AOI method basically involves three primitives that specify the learning task.
These are collection of initial task-relevant data (Data Collection), use of background
knowledge (Domain knowledge) during the mining process and representation of the
learning result (Rule formation). The fundamental principle in AOI is to generalize
the initial relation to a prime relation and then to a final relation using background
knowledge and user-defined threshold. Besides of the population database, the
geomorphologic data is also used in the study, which includes 17 classes and 55
sub-classes. The analysis steps are as follows:
1. Preparation of the data, which includes the digitization of the geomorphologic
map, retrieving the county boundary data from the existing database and
establishing the population database.
2. Overlay analysis of the boundary data with the population data.
3. Calculation of the population data in each geomorphologic sub-classes and
obtaining the spatially distributing population data according the
geomorphologic classes.
4. Data mining of rules of population spatial distribution by attribute-oriented
approach
4 EXAMPLE
4.1 Data preparation and processing
Data preparation for data mining of rule of population spatial distribution is shown in
Figure 1.
Geomorphologic map
Digitizing
Map Database (1:250K)
Projection transformation
Geomorphologic region
Retrieval
County boundary
Census Database
Retrieval
Population Data
Overlay, calculation
Population spatially distributed data
Figure 1 Data Preparation for Data Mining
The scale of original geomorphologic map is 1:4,000,000. The map is digitized, edited,
projection transformed and quality controlled. The county boundary is extracted from
the spatial database of 1:250,000 National Fundamental Geographic Information
System. These data are linked through the common coding system. The overlay
analysis of population data with geomorphologic data is performed. Area of
geomorphologic region weighted population calculation is taken and the number of
population and number of population with college and above education attainment is
obtained. Summary of population spatial distribution according to geomorphologic
region is shown in Table 1.
Table 1 Data of population spatial distribution according to geomorphologic region
Code of Secondary
Class
Name of Geomorphologic Region
Population
Density of Population with
Density
college and above education
(person/km2)
attainment (person/km2)
1.1
SanJiang Plain
78
4
1.2
North-east Plain
134
7
1.3
North China Plain
588
19
1.4
Jiansu Plain
803
38
Mountainous Region of East part in
2.1
North-east China
154
9
2.2
Xinganling Mountainous Region
27
1
3.1
Low Mountain Hill of Jiaodong
509
21
3.2
Jiaolai Plain
614
28
Low Mountain Hill of Middle-south in
3.3
Shandong
644
20
4.1
Hulunbeier Plateau
10
0
4.2
Xilingele Plateau
8
0
4.3
Hill of South-east Part in Inner Mongolia
26
1
5.1
Plain of Inner Mongolia and Shanxi
77
2
5.2
Mountainous Region of Yinshan
72
3
5.3
Middle Mountain and Plateau of Shanxi
309
14
Middle and Low Mountain in North Part
in Hebei, West part in Liaoning and
5.4
Plateau of Zhaomong
165
10
5.5
Middle Mountain in Gansu and Loess Hill
150
4
6.0
Mountainous Region of Aertai
5
0
7.1
Basin of Zhungeer
17
1
7.2
Plateau of East Zhungeer
5
0
7.3
Mountainous Region of Weat Zhungeer
9
0
7.4
Plain of Tachen
9
0
8.1
Mountainous Region in North Tianshan
26
3
Mountainous Region and Plain in Middle
8.2
Tianshan
10
0
8.3
Mountainous Region in South Tianshan
12
0
9.1
Basin of Talimu
9
0
… …
… …
… …
… …
4.2 Data mining of rules of population spatial distribution
Data of population spatial distribution according to geomorphologic region, shown in
Table 1 is used to the analysis of model and discrepancy of population spatial
distribution in China, in order to discover and demonstrate the general characteristics
and rules of population spatial distribution. According to the data mining method of
Attribute-Oriented Induction, generalized process of data from Table 1 is performed.
Form the spatial location of geomorphologic region of second class, the region is
induced to North East, North, North West, East, Middle and South, South West region.
The population density is divided to 3 classes, in which High represents 500-999
person/km2 , Middle represents 100-499 person/km2 and Low represents 0-99
person/km2. The Density of population with college and above education attainment is
also divided to 3 classes, in which High represents 20-39 person/km2 ,Middle
represents 10-19 person/km2 and Low represents 0-9 person/km2. The generalized
result is shown in Table 2.
Table 2 Generalized result of population spatial distribution according to
geomorphologic region
Code
Region Geomorphologic Type
Population
Density of Population with college
Density
and above education attainment
1.1
North East
Plain
Low
Low
1.2
North East
Plain
Middle
Low
1.3
North
Plain
High
Middle
1.4
East
Plain
High
High
2.1
North East
Mountain
Middle
Low
2.2
North East
Mountain
Low
Low
3.1
East
Low Mountain and Hill
High
High
3.2
East
Plain
High
High
3.3
East
Low Mountain and Hill
High
High
4.1
North
Plateau
Low
Low
4.2
North
Plateau
Low
Low
4.3
North
Hill
Low
Low
5.1
North
Plain and Hill
Low
Low
5.2
North
Mountain
Low
Low
Middle
Middle
Middle
Middle
Middle Mountain and
5.3
North
Plateau
Middle and Low
5.4
North
Mountain, and Plateau
Middle Mountain and
5.5
North West
Loess Hill
Middle
Low
6.0
North West
Mountain
Low
Low
7.1
North West
Basin
Low
Low
7.2
North West
Plateau
Low
Low
7.3
North West
Mountain
Low
Low
7.4
North West
Plain
Low
Low
8.1
North West
Mountain
Low
Low
Mountain and Plain
8.2
North West
8.3
North West
9.1
…
Between
Mountain
Low
Low
Mountain
Low
Low
North West
Basin
Low
Low
…
… …
… …
… …
After the combination and further induction of data from Table 2, the final generalized
result is obtained, which is shown in Table 3.
Table 3 Final generalized result of population spatial distribution according to
geomorphologic region
Density of Population
No.
Region
Geomorphologic Type
Population
with college and above Count
Density
education attainment
1
North East
—
Middle、Low
Low
4
2
North
—
Middle、Low
Middle、Low
7
3
North West
Mountain, Plateau, Basin
Low
Low
15
4
Middle Mountain, Loess Hill, Low Mountain
3
North West
and Hill
Middle
Low
5
South West
High Mountain, Plateau
Low
Low
6
6
South West
Low Mountain, Hill, Mountain
Middle
Low
4
7
East
-
Middle
Middle
4
8
Central South
-
Middle
Low
4
In the Table 3, a set of rules of population spatial distribution is shown. Each record in
the table represents a rule of population spatial distribution and number of count is
considered as the sustaining rate. For example, the record 2 represents as that in the
north part of china, the population density is in the middle-low level, and its density
of population with college and above education attainment is also in the middle-low
class; the record 5 and 6 represent as that in the high mountain and plateau area of
south-west part of china, the population density is in the low class, and its density of
population with college and above education attainment is also in the low class, but in
the low mountain, hill and mountainous region of south-west part of china, the
population density is in the middle class, and its density of population with college
and above education attainment is in the low class, etc..
5 CONCLUSIONS
Comparing to the traditional method of population data analysis, AOI method is a
objective method and has the advantages to perform with computer. AOI is capable to
discover the knowledge from the large GIS data base. The government has a huge
amount of population data from the population census, the important knowledge of
population spatial distribution can be obtained through the data mining technology.
By these methods, the application of population data will be improved, for the better
decision-making assistance in the economic and social development planning.
REFERENCES
[1] CHEUNG D, HWWANG HY. Efficient Rule-Based Attribute-Oriented Induction
for Data Mining [J]. Journal of Intelligent Information System, 2000, 15
[2] Maybin K. Muyeba and John A. Keane. A Concurrent Approach to the
Key-Preserving Attribute-Oriented Induction Method[J]. IEEE Trans. Knowledge and
Data Engineering,2002
[3] Han J, Cai Y and Cercone N. Knowledge Discovery in Database: an
Attribute-Oriented Approach[J]. Proceedings of 18th VLDB,1992
[4] Han J, Cai Y and Cercone N. Data-driven Discovery of Quantitative Rules in
Relational Database[J]. IEEE Trans. Knowledge and Data Engineering, 1993