Download Integration of GIS and Data Mining Technology to Enhance the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Integration of GIS and Data Mining Technology to Enhance
the Pavement Management Decision Making
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
Guoqing Zhou1; Linbing Wang2; Dong Wang3; and Scott Reichle4
Abstract: This paper presents a research effort undertaken to explore the applicability of data mining and knowledge discovery 共DMKD兲
in combination with Geographic Information System 共GIS兲 technology to pavement management to better decide maintenance strategies,
set rehabilitation priorities, and make investment decisions. The main objective of the research is to utilize data mining techniques to find
pertinent information hidden within the pavement database. Mining algorithm C5.0, including decision trees and association rules, has
been used in this analysis. The selected rules have been used to predict the maintenance and rehabilitation strategy of road segments. A
pavement database covering four counties within the state of North Carolina, which was provided by North Carolina DOT 共NCDOT兲, has
been used to test this method. A comparison was conducted in this paper for the decisions related to a rehabilitation strategy proposed by
the NCDOT to the proposed methodology presented in this paper. From the experimental results, it was found that the rehabilitation
strategy derived by this paper is different from that proposed by the NCDOT. After combining with the AIRA Data Mining method, seven
final rules are defined. Using these final rules, the maps of several pavement rehabilitation strategies are created. When their numbers and
locations are compared with ones made by engineers at the Institute for Transportation Research and Education 共ITRE兲 at North Carolina
State University, it has been found that error for the number and the location are various for the different rehabilitation strategies. With the
pilot experiment in the project, it can be concluded: 共1兲 use of the DMKD method for the decision of road maintenance and rehabilitation
can greatly increase the speed of decision making, thus largely saving time and money, and shortening the project period; 共2兲 the DMKD
technology can make consistent decisions about road maintenance and rehabilitation if the road conditions are similar, i.e., interference
from human factors is less significant; 共3兲 integration of the DMKD and GIS technologies provides a pavement management system with
the capabilities to graphically display treatment decisions against distresses; and 共4兲 the decisions related to pavement rehabilitation made
by the DMKD technology is not completely consistent with that made by ITRE, thereby, the postprocessing for verification and refinement
is necessary.
DOI: 10.1061/共ASCE兲TE.1943-5436.0000092
CE Database subject headings: Pavement management; Data collection; Geographic information systems; Decision making.
Author keywords: Pavement management; Data mining; GIS; Pavement distress.
Introduction
A pavement management system 共PMS兲 is a valuable tool and
one of the critical elements of the highway transportation infrastructure 共Tsai et al. 2004; Kulkarni and Miller 2003兲. The earliest
PMS concept can be traced back to the 1960s. With rapid increase
1
Dept. of Civil Engineering and Technology, Old Dominion Univ.,
Kaufman Hall, Rm. 214, Norfolk, VA 23529; and, Virginia Tech Transportation Institute 共VTTI兲, Dept. of Civil and Environmental Engineering,
Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061
共corresponding author兲. E-mail: [email protected]
2
Virginia Tech Transportation Institute 共VTTI兲, Dept. of Civil and
Environmental Engineering, Virginia Polytechnic Institute and State
Univ., Blacksburg, VA 24061. E-mail: [email protected]
3
Virginia Tech Transportation Institute 共VTTI兲, Dept. of Civil and
Environmental Engineering, Virginia Polytechnic Institute and State
Univ., Blacksburg, VA 24061. E-mail: [email protected]
4
Dept. of Civil Engineering and Technology, Old Dominion Univ.,
Kaufman Hall, Rm. 214, Norfolk, VA 23529. E-mail: [email protected]
Note. This manuscript was submitted on July 25, 2008; approved on
July 31, 2009; published online on August 3, 2009. Discussion period
open until September 1, 2010; separate discussions must be submitted for
individual papers. This paper is part of the Journal of Transportation
Engineering, Vol. 136, No. 4, April 1, 2010. ©ASCE, ISSN 0733-947X/
2010/4-332–341/$25.00.
of advanced information technology, many investigators have
successfully integrated the Geographic Information System 共GIS兲
into PMS for storing, retrieving, analyzing, and reporting information needed to support pavement-related decision making.
Such an integration system is thus called G-PMS 共Lee et al.
1996兲. The main characteristic of a GIS system is that it links
data/information to its geographical location 共e.g., latitude/
longitude or state plane coordinates兲 instead of the milepost or
reference-point system traditionally used in transportation. Moreover, the GIS can describe and analyze the topological relationship of the real world using the topological data structure and
model 共Goulias 2002; Lee et al. 1996兲. GIS technology is also
capable of rapidly retrieving data from a database and can automatically generate customized maps to meet specific needs such
as identifying maintenance locations. Therefore, a G-PMS can be
enhanced with features and functionality by using a GIS to perform pavement management operations, create maps of pavement
condition, provide cost analysis for the recommended maintenance strategies, and long-term pavement budget programming.
With the increasing amount of pavement data collected, updated and exchanged due to deteriorating road conditions, increasing traffic loading, and shrinking funds, many knowledgebased expert systems 共KBESs兲 have been developed to solve
various transportation problems 共e.g., Abkowitz et al. 1990; Nassar 2007; Spring and Hummer 1995兲. A comprehensive survey of
332 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010
J. Transp. Eng. 2010.136:332-341.
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
KBESs in transportation is summarized and discussed by Cohn
and Harris 共1992兲. However, only a few scholars have investigated applying data mining and knowledge discovery 共DMKD兲 to
PMSs. For example, Attoh-Okine 共2002, 1997兲 presented application of rough set theory to enhance the decision support of the
pavement rehabilitation and maintenance. Prechaverakul and Hadipriono 共1995兲 applied KBES and fuzzy logic for minor rehabilitation projects in Ohio. Wang et al. 共2003兲 discussed the decisionmaking problem of pavement maintenance and rehabilitation. Leu
et al. 共2001兲 investigated the applicability of data mining in the
prediction of tunnel support stability using an artificial neural
network algorithm. Sarasua and Jia 共1995兲 explored an integration
of GIS technology with knowledge discovery and expert system
for pavement management. Ferreira et al. 共2002兲 explored the
application of probabilistic segment-linked pavement management optimization model. Chan et al. 共1994兲 applied the genetic
algorithm for road maintenance planning. Soibelman and Kim
共2000兲 discussed the data preparation process for construction
knowledge generation through knowledge discovery in databases,
as well as construction knowledge generation and dissemination.
This paper presents a research effort undertaken to explore the
applicability of data mining in combination with GIS technology
to pavement management to better decide maintenance strategies,
set rehabilitation priorities, and make investment decisions. A
brief overview of data mining techniques as applied to pavement
management is presented in “Data Mining and Inductive Learning.” Data collected from the Institute for Transportation Research and Education 共ITRE兲 of North Carolina State University
and pavement condition rating are presented in “Pavement Performance Measures.” Examples using data mining to reveal unknown patterns and rules in the database are presented in “Data
Mining-Based Maintenance and Repair 共M&R兲.” Finally, the
findings and conclusions are discussed.
Data Mining and Inductive Learning
The goal of applying DMKD technology to a PMS is to discover
any pertinent information in the pavement management database
by creating a decision tree and/or inducing rules. Some information is stored in an obvious location of the database, and thus can
be obtained by traditional database query operation, such as maximum and minimum width of the roads. Other knowledge is hidden in a more remote area of the database 共e.g., pavement
condition patterns兲, and thus only can be obtained by intelligent
technology, such as data mining. Several data mining techniques
have been developed over the last few decades by artificial intelligence community. Generally, the data mining techniques can be
categorized in four categories, depending on their functionality:
共1兲 classification; 共2兲 clustering; 共3兲 numeric prediction; and 共4兲
association rules 共Michalski 1983兲.
• Classification: this technique is designed essentially to generate predictive models by analyzing an existing database to
determine categorical divisions or patterns in the database.
Thus, this method focuses on identifying the characteristics of
the group or class to which each record belongs in the database;
• Clustering: the purpose of this data mining technique is to
group or classify items that seem to fall naturally together in
the database when there is no preidentified classification or
group;
• Numeric prediction: this is essentially a classification learning
technique with the intended outcome of a numeric value 共nu-
meric quantity兲 rather than a category 共discrete class兲. Thus,
the numeric values are used for prediction; and
• Association rule: the purpose of this data mining technique is
to find useful associations and/or correlation relationships
among large sets of data items. Association rules, expressed by
in the form of “if-then” statements, shows attribute value conditions that occur frequently together in a given data set. These
rules are computed from the data, unlike the if-then logic
rules, and thus are probabilistic in nature.
The main differences among the earlier discussed techniques
lie in which algorithms and methods will be used to extract
knowledge and how the mined knowledge and discovery rules are
expressed. The data mining technique used in this paper is an
association/inductive learning rule, so that relevant and implicit
knowledge, such as road condition distress pattern, can be extracted. Many inductive learning algorithms, which mainly come
from the machine learning, have been presented, such as AQ11
共Michalski and Larson 1975兲, AQ15 共Michalski et al. 1986; Hong
et al. 1995兲 and AQ19 共Kaufman and Michalski 1999兲, AE1 and
AE9 共Hong et al. 1995兲, concept learning system 共CLS兲 共Hunt
1966兲, ID3, C4.5, and C5.0 共Quinlan 1979, 1986, 1993兲, and CN2
共Clark and Niblett 1989兲, and so on. CLS 共Hunt et al. 1966兲 used
a heuristic look ahead method to construct decision trees. Quinlan
extended CLS method by using information content in the heuristic function, called ID3. ID3 method adopts a strategy, called
“divide and conquer,” and selects classification attributes recursively on the basis of information entropy. Quinlan 共1993兲 further
developed his method, called C4.5, which only dealt with strings
of constant length. C4.5 not only can create a decision tree, but
induce equivalent production rules as well, and deals with multiclass problems with continuous attributes. C5.0 is an upgraded
version of C4.5. This method requires the training data, which are
usually constructed by several tuples, each of which has several
attributes. Note that if the records in the database are taken as
tuples and fields as attributes, the C5.0 algorithm is easily realized
in a pavement database. Thus, the knowledge discovered by C5.0
algorithm is a group of classification rules and a default class, and
furthermore, with each rule there is a confidence value 共between 0
and 1兲. Moreover, the C5.0 method can generate a decision tree
and associated decision rule.
Decision Tree
The inductive learning C5.0 method is a top-down induction algorithm that is capable of generating a decision tree model to
classify data in a database. The decision tree is capable of partitioning the data at each level based on a particular feature of the
data. The processes of C5.0 algorithm are: 共1兲 a decision tree is
built by choosing an attribute that best separates the classes in the
database and 共2兲 a pruning strategy is used to prune a branch
when the introduced error is one standard error of the existing
errors. The implementation of the C5.0 algorithm requires recursive iteration to the partitions until partitions only have a single
class, or the partition becomes too small. With such recursive
iteration, data with higher partitioning ability will be toward the
root, and iterative classification occurs at the leaves in a decision
tree. At each iteration, the data are fed into a filtering method,
which reduces the number of features. In our project, the decision
tree categorizes the entire subject according to whether or not
they are likely to have hypertension.
JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 333
J. Transp. Eng. 2010.136:332-341.
Association Rule
In addition to the decision tree generation, the C5.0 algorithm
also induces association rules. An association rule differs from a
classification rule because the associate rule can be used to predict any attribute 共not just the group or class兲, and more than one
attribute’s value at a time. An associate rule also gives an occurrence relationship among factors. A typical association rule is
if-then sentence, i.e.,
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
IF Cause_1, AND/OR Cause_2
THEN Result 共or consequence兲
This rule means “if Cause_1 and/or Cause_2 are true, and then
Result 共the association rule兲 would be generated at cases of x%
with a confidence of x%.” With such as rule, a piece of knowledge
can be discovered from data set. The rules induced by C5.0
algorithm also provide with support, confidence level, and capture
using statistical tests to correct the decision tree model 共Chae et al.
2001兲.
• Support: this measures the percentages of the training data that
support the rule; i.e., how many data generates the left hand
side of the rule. The support is the number of cases in which
the rule is found and a pattern is defined by the rule. Once this
pattern is identified, it can be used to predict the similar patterns in database;
• Confidence: the confidence indicates a probability of a certain
rule. It can be used to measure the accuracy of rule; and
• Capture: this is used to measure the percentage of records that
are correctly captured by this rule. For example, if a rule with
capture is close to 100%, it means that all observations with
this class are closely clustered.
In this paper, the initial association rules will be induced by
using the C5.0 algorithm. The final rules are obtained by postprocessing of the initial rules. The attributes for deductive learning
are the same as that in inductive learning except the class label
attribute. The final rules are used to identify the occurrence relationship between pavement rehabilitation and various road distress conditions.
Pavement Performance Measures
Data Sources
In 1983, ITRE of North Carolina State University began working
with the Division of Highways of the North Carolina DOT
共NCDOT兲 to develop and implement a PMS for its 60,000 mi of
paved state highways. At the request of several municipalities, the
NCDOT has made the PMS available for North Carolina municipalities. The ITRE modified the system for municipal streets and
has used it in more than 100 municipalities in North and South
Carolina. The data sources for this experiment are provided by the
ITRE. They conducted a pavement distress survey for several
counties in North Carolina in January of 2007 to determine
whether pavement rehabilitation was necessary. Field data collection was performed by a two-person crew comprised of a “rater”
and a “driver.” The rater performed the visual evaluation/
inspection for each section of roadway and entered all field collected attributes to the corresponding line feature 共by using a
customized ArcPad software application installed on a laptop/
pentop computer兲. The driver handled field measurement of pavement width, as necessary. All pavement distress data was
identified and quantified by means of visual inspection 共windshield survey performed by trained staff兲. A measuring wheel was
used as necessary.
The survey assessment was performed following the guidelines provided in the Pavement Condition Rating Manual
共AASHTO 2001, 1999兲. The collected 1,285 records, to be used
in this project, were from a network-level survey, covering several county roads including Route U.S.1. The pavement database
provided by the ITRE is a spatial-based rational database 共i.e., an
ArcGIS software compatible database.兲 In this database, 89 fields
including geospatial data 共e.g., X , Y coordinates, central line,
width of lane, etc.兲 and pavement condition data 共e.g., cracking,
rutting, etc.兲, and traffic data 共e.g., shoulder, lane number, etc.兲,
and economic data 共e.g., initial cost, total cost兲 are recorded by
engineers. They conducted these surveys by walking or driving
along the roads and recording the pertinent information and their
corresponding maintenance and repair 共M&R兲 strategy. Then, the
data are integrated into the database, as shown in Fig. 1. The first
through 19th columns recorded road name, type, class, owner;
and the 31st through 43rd columns recorded the pavement condition 共distress兲; the 44th through 50th columns recorded the different types of cost information; others columns included proposed
activities and other related information. From the expert’s suggestion in the ITRE, the following nine common types of distresses
are considered to assess the necessity for road rehabilitation
共Table 1兲. The rating was determined by visual evaluation/
inspection at each section of roadway.
Distress Rating
All the analyses and pavement condition evaluations presented in
this paper are based on the pavement performance measures. A
common acceptable pavement performance measure is pavement
condition index 共PCI兲, which was first defined by the U.S. Army.
Under the PCI, the pavement condition is related to factors such
as structural integrity, structural capacity, roughness, skid resistance, and rate of distress. These factors are quantified in the
evaluation worksheet that field inspectors used to assess and express the local pavement condition and severity of damage. Note
that inspectors primarily relied on their own judgment to assess
the distress conditions. Usually, the PCI is quantified into seven
levels, corresponding to from Excellent 共over 85兲 to Failed 0.
Table 1 presents eight types of distresses that are evaluated
for asphaltic concrete pavements. The severity of distresses in
Table 1 is rated in four categories ranging from very slight to very
severe. Extent 共or density兲 is also classified in five categories
ranging from few 共less than 10%兲 to throughout 共more than 80%兲.
The identification and description of types of distress, severity,
and density are as follow, respectively.
• The road conditions of the alligator cracking are rated as a
percentage of the section that falls under the categories of
none, light, moderate, and severe. Percentages are shown as
1 = 10%, 2 = 30%, 3 = 60%, up to 10= 100%. The appropriate
percentages were placed under none, light, moderate, and severe. These percentages should always add up to 100%;
• The severity levels of distresses for block cracking, transverse
cracking, bleeding, rutting, utility cut patching, patching deterioration, and raveling are rated four levels: none 共N兲, light
共L兲, moderate 共M兲, and severe 共S兲, respectively; and
• The severity levels of ride quality are classified as: average
共L兲, slightly rough 共M兲, and rough 共S兲.
334 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010
J. Transp. Eng. 2010.136:332-341.
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
Fig. 1. Spatial-based pavement database provided by NCDOT
Potential Rehabilitation Strategies
Based on the knowledge obtained from ITRE in the field, the
ITRE has classified flexible pavement rehabilitation needs into
three main categories according to the type of the problem to be
corrected: 共1兲 cracking; 共2兲 surface defect problems; and 共3兲 structural problems. These problems can be treated using crack treatment, surface treatment, and nonstructural overlay 共one- and twocourse overlay兲, respectively. To select an appropriate treatment
for rehabilitation and maintenance, seven potential rehabilitation
and maintenance strategies have been proposed by the NCDOT
共see Table 2兲. Which treatment strategies will be carried out for a
pavement segment is traditionally determined by the ITRE who
comprehensively evaluate all types of distresses. This paper seeks
to replace this process of decision making for rehabilitation strategies with data mining technology. Thus, one of the eight potential rehabilitation strategies will be determined based upon the
association rule in data mining, which will be generated by the
data mining technology.
Table 1. Nine Common Types of Distresses to Be Used in This Study
Distress
Alligator cracking 共four types of rates are given兲
Block/transverse cracking 共BK兲
Reflective cracking 共RF兲
Rutting 共RT兲
Raveling 共RV兲
Bleeding 共BL兲
Patching 共PA兲
Utility cut patching
Ride quality 共RQ兲
Rating
Alligator
Alligator
Alligator
Alligator
none 共AN兲
Percentages of 1 = 10%, 2 = 20%, 3 = 30%, up to 10= 100%
indicate none, light, moderate, and severe, respectively
light 共AL兲
moderate 共AM兲
severe 共AS兲
This indicates the overall condition of the section as follows:
• N—none;
• L—light;
• M—moderate; and
• S—severe
The same rating as BK’s
The same rating as BK’s
The same rating as BK’s
The same rating as BK’s
The same rating as BK’s
The same rating as BK’s
The condition is designated as follows:
• L—average;
• M—slightly rough; and
• S—rough
JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 335
J. Transp. Eng. 2010.136:332-341.
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
Table 2. Potential Rehabilitation Strategies Proposed by Engineers of
NCDOT
ID
Rehabilitation strategies
0
1
2
3
5
6
7
Nothing
Crack pouring 共CP兲
Full-depth patch 共FDP兲
1-in. plant mix resurfacing 共PM1兲
2-in. plant mix resurfacing 共PM2兲
Skin patch 共SKP兲
Short overlay 共SO兲
Table 3. Decision Tree Model
Tree information
Total number of nodes
Number of leaf nodes
Number of levels
Data Mining-Based Maintenance and Repair „M&R…
Decision Trees of M&R
Implementation of data mining-based maintenance and rehabilitation usually involves the following five basic steps: 共1兲 problem
identification; 共2兲 knowledge acquisition; 共3兲 knowledge representation; 共4兲 implementation; and 共5兲 validation and extension.
The first step of problem identification has already been discussed, which is to reveal and extract the relevant information
hidden in the pavement database to predict potential pavement
conditions and plan the rehabilitation. As to the second step, it has
been obtained a significant amount of knowledge from the ITRE
in the field, such as distress ratings, PCI ratings, and potential
rehabilitation strategies. As for the third step, the decision tree
and induced rule to be derived by the C5.0 algorithm constitute
the knowledge representation. The theory behind the decision tree
and inductive rules has already been described earlier 共see Data
Mining and Inductive Learning, supra.兲 The experimental results
will now be presented in this section, which constitutes the fourth
step of implementation. Finally, Step 5, validation, will be discussed next as well.
Experiments
The CTree for Excel tool 共Saha 2003兲 was used to create the
decision tree. This tool is based on the C5.0 algorithm, which lets
user build a tree-based classification model. The classification tree
can generate the association rules. The basic steps are as follows.
Step 1: Load Pavement Database. It is first loaded the pavement database into the data worksheet, in which the observations
were placed in rows and the variables were placed in columns. In
addition, for each column in the data worksheet, it is chosen an
appropriate type, i.e., Omit, Class, Cont, and Cat. The implications of these types are:
• Omit= to drop a column from model;
• Cat= to treat a column as categorical predictor;
• Cont= to treat a column as continuous predictor; and
• Output= to treat a column as class variable.
Because the type, Cont, in CTree only recognizes the digital
number, while the distresses in Table 1 were measured by letter,
N, L, M, and S, these letters have to be quantified into 100, 75,
50, and 25. In the CTree tool, the maximum predictor variables
are 50, and only a class variable is allowed. Application will treat
the class variable as categorical, and each of categorical predictors 共including class variable兲 is limited to 20. After the data are
uploaded into software, the class variables and data do not have
blank rows and blank columns, which are treated as missing values in the CTree tool.
Percentage for misclassified data
72
37
20
On training data
On test data
61.2%
60.0%
Step 2: Fill-Up Model Inputs. In addition, the tool requires
inputting some parameters to optimize the process of decision tree
generation. These parameters include:
1. Adjust for number categories of a categorical predictor:
while growing the tree, child nodes are created by splitting
parent nodes. Which predictor is used for this split is decided
by certain criterion. Because this criterion has an inherent
bias toward choosing predictors with more categories, input
of an adjustment factor will be used to adjust this bias;
2. Minimum node size criterion: while growing the tree, the
decision as to whether to stop splitting a node and declare the
node as a leaf node will be determined by some criteria
which need to be chosen. These criteria are:
• Minimum node size: a valid minimum node size is between 0
and 100;
• Maximum purity: an effective value is between 0 and 100. The
higher the value of this, the larger will be the tree. Stop splitting a node if its purity is 95% or more 共e.g., purity is 90%
means兲 Also, stop splitting a node if number of records in that
node is 1% or less of total number of records; and
• Maximum depth: a valid maximum depth is greater than 1 and
less than 20. The higher the value, the larger will be the tree.
Stop splitting a node if its depth is 6 or more 共depth of root
node is 1; any node’s depth is that its parent’s depth +1兲.
3. Pruning option: this option allows us to determine whether
or not to prune the tree when tree is growing, which can help
us to study the effect of pruning; and
4. Training and test data: in this research, a subset of data are
used to build the model and the rest to study the performance
of the model. Also, the tool is required to randomly select the
test set at a ratio of 10%.
Experimental Results
With the aforementioned data input, a decision tree is generated.
The parameters of the decision tree are listed in Table 3. As seen
from Table 3, the total number of nodes is 72, in which the number of leaf nodes is 37. The details on each node can be obtained
from the NodeView sheet, which shows the class distribution on
each node. Moreover, the decision tree consists of 20 levels. In
this test, the number of training observations achieved is 555, the
number of test observations is 510, and the number of predictors
is 9. The percentage of misclassified data reaches 61.2% for training data and 60.0% for test data. The number for different types
of rehabilitation strategies, including nothing, crack pouring 共CP兲,
full-depth patch 共FDP兲, 1-in. plant mix resurfacing 共PM1兲, 2-in.
plant mix resurfacing 共PM2兲, skin patch 共SKP兲, and short overlay
共SO兲 in the study area is calculated for training data and test data
共Table 4兲.
In addition, the C5.0 algorithm also generates rules. These
associated rules discovered rehabilitation strategies that will be
carried out if the conditions are met 共see Fig. 2兲. Since only seven
rehabilitation strategies in the study area were suggested by the
ITRE at the NCDOT, the generated rules by the C5.0 algorithms
are clustered manually because of the fact too many rules were
generated by the CTree software tool. The method of clustering
336 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010
J. Transp. Eng. 2010.136:332-341.
Table 4. Number for Different Types of Rehabilitation Strategies for the
Training Data and Test Data
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
Rehabilitation
strategies
Nothing
CP
FDP
PM1
PM2
SKP
SO
Total
Training
data
Test
data
417
12
13
3
2
8
4
459
589
23
3
8
5
3
4
640
Table 5. Support, Confidence, and Capture for Each Generated Rule
Rule
ID
0
1
2
3
4
5
6
the rules are as follows: merging the similar rules in each leaf and
node together. As a result, their support and confidence and capture are merged as well. Finally, seven rules, associated their
SUPPORT and confidence and capture, are listed in Table 5. As
observed from Table 5, The confidence for rehabilitation strategies, CP and PM2 are 100%, and support for rehabilitation strategies are 100% as well. in addition, the support, confidence, and
capture for rehabilitation, FDP, are lowest. This means that the
decision made for FDP rehabilitation has a low reliability.
M&R Rules Verification
The aforementioned generated decision tree theoretically organizes the “hidden” knowledge in a logical order. A determination
as to whether the tree is used to provide a useful methodology for
selecting a feasible and effective treatment strategy for rehabilitation 共from the seven predetermined strategies兲 will require this
prototype of rules to be tested and validated, and then modified or
extended, if necessary. In the study area, seven rules have been
created to handle operations involved in the spatial knowledge of
the PMS. Upon carefully checking the rules, it was determined
Classes
Support
Confidence
Capture
NO
CP
SKP
FDP
PM2
PM1
SO
100.0%
60.7%
60.5%
71.6%
80.2%
81.3%
81.6%
86.7%
100.0%
66.7%
55.6%
100.0%
71.4%
66.7%
93.0%
75.6%
82.5%
66.2%
73.1%
85.3%
73.5%
that these rules are not completely correct. Some of rules are
redundant. For this reason, the AIRA tool for Excel v1.3.3 is used
to verify these rules. This tool is an add-in for MS-Excel and
allows user to extract the “hidden information” 共i.e., discover
rules兲 right from the spreadsheets for small to midrange database
files. With the same data set and the AIRA tool, 41 rules are
generated, part of which are listed in Fig. 2. Obviously, many
rules will result in misclassification, and thus have to be reduced.
To this end, the following schemes are adopted in this research:
1. If the attribute values simultaneously match the condition of
the rules induced by both decision tree and AIRA reasoning,
the rules will be retained;
2. If attribute values simultaneously match the conditions of
several rules, those rules with the maximum confidence will
be kept;
3. If attribute values simultaneously match several rules with
the same confidence values, those rules with the maximum
coverage of learning samples will be kept; and
4. If attribute values do not match any rule, this class of attribute is defined as the rule, nothing treatment.
With the schemes proposed earlier for rule reduction, 14 rules
were still retained. However, only seven rehabilitation strategies
in the study area were suggested by the ITRE. Thus, the following
method is suggested to further reduce the number of rules:
1. Reduce the attribute data sets from alligator cracking family
through
AC = 兵关AN兴,关AL兴,关AM兴,关AS兴其
2.
Reduce the attribute data through checking the PCI values.
The principles are,
a.
If AC= M or S reduce other attributes; and
b. If RT= S, reduce other attributes.
With the aforementioned reduction methods, the seven rules is
finally refined 共see Fig. 3兲.
Mapping of Data Mining-Based Decision of M&R
Fig. 2. Rules induced by AIAR algorithm
With the rules induced earlier, the rehabilitation strategies can be
predicted and decided for each road segment in the database. In
other words, the operation using DMKD only occurs in the database, and thus the results cannot be visualized and displayed on
either map or screen. However, the GIS system is capable of
rapidly retrieving data from the database and automatically generating customized maps to meet specific needs such as identifying maintenance locations, visualizing spatial and nonspatial data,
and linking data/information to its geographical location. Thus,
this paper employed ArcGIS software in combination with the
earlier induced results to create the decision-making map for
maintenance and rehabilitation. The basic operation consists of
taking each of the above rules as a logic query in the ArcGIS
JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 337
J. Transp. Eng. 2010.136:332-341.
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
Fig. 3. Final rules after verification and postprocessing
software, and then queried results are displayed in the ArcGIS
layout map. The results are listed in Figs. 4–9, corresponding to
each of the preset rehabilitation strategies, respectively. The rehabilitations suggested by the ITRE of North Carolina State University are superimposed with the decisions made under the analysis
presented in this paper. As seen from Figs. 4–9, each rehabilitation strategy derived in this paper can be located with its geographical coordinates, and visualized with its spatial, nonspatial
data, and different colors.
Comparison Analysis and Discussion
Comparison Analysis
To verify the results of the proposed method for the decision
making of road maintenance and rehabilitation, the recommended
pavement segment treatments produced in this paper were compared with those suggested by the ITRE at the NCDOT. All pavement treatments derived by this paper and by NCDOT are
displayed in Fig. 10. A comparison analysis for both the numbers
Fig. 4. Comparison for the CP road rehabilitation made by the proposed method and by NCDOT 共ITRC兲
Fig. 5. Comparison for the FDP road rehabilitation made by the
proposed method and by NCDOT 共ITRC兲
and the location of each treatment strategy derived by this paper
and by NCDOT in the study area is listed in Table 6. As seen from
Table 6, the number of the crack pouring treatment derived by
this paper and by NCDOT is the same, i.e., 3, but the locations of
the three roads are not the same. Specifically, the location of one
road derived by the method presented in this paper is different
from the one derived by NCDOT. The number of the full-depth
patch 共FDP兲 treatments suggested by NCDOT is 34, but 29 under
the method employed in this paper. The difference between two
methods is five. Moreover, the location of three road segments for
FDP treatment is different for two methods. In the 1-in. plant mix
resurfacing 共PM1兲 strategy, NCDOT suggested six roads for PM1
treatment, but this paper indicates seven roads using data mining.
Moreover, a road location is different for two methods. For the
skin patch 共SKP兲 rehabilitation strategy, 65 roads are suggested
by NCDOT for treatment, but 56 roads are identified using data
mining. Moreover 13 road locations are different between two
methods. For the short overlay rehabilitation strategy, three roads
Fig. 6. Comparison for the PM1 road rehabilitation made by the
proposed method and by NCDOT 共ITRC兲
338 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010
J. Transp. Eng. 2010.136:332-341.
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
Fig. 7. Comparison for the PM2 road rehabilitation made by the
proposed method and by NCDOT 共ITRC兲
are suggested for treatment, but five roads are recommended by
the methods employed in this paper, of which the locations of two
roads are different between the two methods.
Fig. 9. Comparison for the SO road rehabilitation made by the proposed method and by NCDOT 共ITRC兲
The decision trees are based on the knowledge acquired from
pavement management engineers to be used for rehabilitation
strategy selection. A decision tree is used to organize the obtained
knowledge in a logical order. Thus, the decision trees can determine a technically feasible rehabilitation strategy for each road
segment. Different decision trees can be built to acknowledge
changes. For example, the decision trees were based on severity
levels of individual distresses in this paper. If the pavement layer
thickness and material type are taken as knowledge, or work
history, pavement type, and ride data are taken as knowledge
for generating decision trees, these decision trees generated by
different types of knowledge are different. This means the association rules generated by different knowledge are different as
well.
On the other hand, from our experiment, the rules induced
from the decision tree are probably inexact, thereby verification of the rules is needed. This project, in fact, tested different
tools 共e.g., SPSS, ACRT, RSES, CHIAD兲 to endeavor to generate
an “optimum” or “ideal” decision tree and rules. Unfortunately,
the decision trees and rules generated by these tools are not
completely the same. Therefore, postprocessing for verification of
the rules is still needed. Through the project, it has been realized
that:
1. The decision tree and rules induced by the data mining
method are not completely exact, i.e., the DMKD method is
not as exact as individual inspection and observation. Therefore, postprocessing for refining the rules is still essential;
and
2. The advantages of applying the DMKD method for road
maintenance and rehabilitation are: 共a兲 it can largely increase
the speed of choosing a rehabilitation strategy, thus largely
save time and money, and shorten the project period and 共b兲
the same treatments proposed by the DMKD method in a
Fig. 8. Comparison for the SKP road rehabilitation made by the
proposed method and by NCDOT 共ITRC兲
Fig. 10. All decisions for the road rehabilitation made by ours and
NCDOT
Discussion
JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 339
J. Transp. Eng. 2010.136:332-341.
Table 6. Comparison Analysis for Each Treatment Strategy Made by the Writers and NCDOT
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
ID
Proposed treatment strategies
1
Crack pouring 共CP兲
2
Full-depth patch 共FDP兲
3
1-in. plant mix resurfacing 共PM1兲
4
2-in. plant mix resurfacing 共PM2兲
5
Skin patch 共SKP兲
6
Short overlay 共SO兲
From
Number
NCDOT
This paper
NCDOT
This paper
NCDOT
This paper
NCDOT
This paper
NCDOT
This paper
NCDOT
This paper
3
3
34
29
6
7
3
4
65
56
3
5
Difference
in number
Difference
in location
0
1
5
3
1
1
1
1
9
13
2
2
mined by DMKD are not exact, and thus, postprocessing for
verification and refinement is necessary.
road network level should have similar pavement conditions,
thus, this method avoids any human factors such as visual
distress data by engineers.
Acknowledgments
Conclusions
This paper conducts research and analysis related to applying data
mining technology in combination with GIS for pavement management. The main purpose of the research is to utilize data mining techniques to find some useful knowledge hidden in the
pavement database. The C5.0 algorithm has been employed to
generate decision trees and association rules. The induced rules
have been used to predict which maintenance and rehabilitation
strategy to be selected for each road segment. A pavement database covering four counties in the state of North Carolina, which
are provided by the ITRE at NCDOT, has been used to test the
proposed method. The comparison of two decisions for rehabilitation treatment suggested by NCDOT and by the methodology
presented in this paper has been conducted. From the experimental results, it was found that the rehabilitation strategies derived
by the rules, i.e., C5.0 method, are different from those suggested
by NCDOT. After combining other technologies, e.g., AIRA
method, and postprocessing, seven rules are finally refined. Using
the final rules, mapping for different types of pavement rehabilitation strategies is created using ArcGIS v. 9.1. When compared
with the results from NCDOT, the number and location of the
suggested road rehabilitations are different. The maximum error
for the number of suggested road rehabilitations is nine, and for
the location is 13 out of 65 共see Table 6兲.
Through this project, it has been concluded that
1. The DMKD technology can make consistent decisions regarding road maintenance and rehabilitation for the same
road conditions with less interference from human factors;
2. The DMKD technology can largely increase the speed of
decision making because the technology automatically generates a decision tree and rules if the expert knowledge is
given. This can save significant time and money for PMS;
3. Integration of the DMKD and GIS can provide the PMS with
the capabilities of graphically displaying treatment decisions,
visualize the attribute and nonattribute data, and link data
and information to the geographical coordinates; and
4. The DMKD method is not quite as intelligent as individual
observation. In other words, the decisions and rules deter-
The pavement database was provided by Greg Ferrara, GIS Program Manager at the Institute for Transportation Research and
Education 共ITRE兲 of North Carolina State University. John Roklevi at the ITRE provided essential help in data delivery, data
interpretation, and explanation related the metadata. The writers
sincerely thank both of them. The writers also thank the project
administrators for granting permission to use their data.
References
AASHTO. 共1999兲. AASHTO guidelines for pavement management system, AASHTO, Washington, D.C.
AASHTO. 共2001兲. Pavement management guide, AASHTO, Washington,
D.C.
Abkowitz, M., Walsh, S., Hauser, E., and Minor, L. 共1990兲. “Adaptation
of geographic information systems to highway management.”
J. Transp. Eng., 116共3兲, 310–327.
Attoh-Okine, N. O. 共1997兲. “Rough set application to data mining principles in pavement management database.” J. Comput. Civ. Eng.,
11共4兲, 231–237.
Attoh-Okine, N. O. 共2002兲. “Combining use of rough set and artificial
neural networks in doweled-pavement-performance modeling—A hybrid approach.” J. Transp. Eng., 128共3兲, 270–275.
Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., and Ji, S. H. 共2001兲.
“Data mining approach to policy analysis in a health insurance domain.” Int. J. Med. Inf., 62, 103–111.
Chan, W. T., Fwa, T. F., and Tan, C. Y. 共1994兲. “Road maintenance
planning using genetic algorithms. I: Formulation.” J. Transp. Eng.,
120共5兲, 693–709.
Clark, P., and Niblett, T. 共1989兲. “The CN2 induction algorithm.” Mach.
Learn., 3, 261–283.
Cohn, L. F., and Harris, R. A. 共1992兲. Knowledge-based expert system in
transportation. NCHRP synthesis 183, TRB, National Research Council, Washington, D.C.
Ferreira, A., Antunes, A., and Picado-Santos, L. 共2002兲. “Probabilistic
segment-linked pavement management optimization model.” J.
Transp. Eng., 128共6兲, 568–577.
Goulias, D. G. 共2002兲. “Management systems and spatial data analysis in
transportation and highway engineering.” Proc., Management Infor-
340 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010
J. Transp. Eng. 2010.136:332-341.
Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved.
mation Systems, 2002: Incorporating GIS and Remote Sensing, Vol.
2002, Univ. of Illinois at Urbana-Champaign, Urbana, Ill., 321–327.
Hong, J., Mozetic, I., and Michalski, R. S. 共1995兲. “AQ15: Incremental
learning of attribute-based descriptions from examples, the method
and user’s guide.” Rep. Prepared for Intelligent Systems Group, Univ.
of Illinois at Urbana-Champaign, Urbana, Ill.
Hunt, E. B., Marin, J., and Stone, P. T. 共1966兲. Experiments in induction,
Academic, San Diego.
Kaufman, K. A., and Michalski, R. S. 共1999兲. “Learning in an inconsistent world: Rule selection in AQ19.” Rep. No. MLI 99-2, George
Mason Univ., Fairfax, Va.
Kulkarni, R. B., and Miller, R. W. 共2003兲. “Pavement management systems: Past, present, and future.” Transp. Res. Rec., 1853, 65–71.
Lee, H. N., Jitprasithsiri, S., Lee, H., and Sorcic, R. G. 共1996兲. “Development of geographic information system-based pavement management system for Salt Lake City.” Transp. Res. Rec., 1524, 16–24.
Leu, S.-S., Chen, C.-N., and Chang, S.-L. 共2001兲. “Data mining for tunnel
support stability: Neural network approach.” Autom. Constr., 10共4兲,
429–441.
Michalski, R. S. 共1983兲. “A theory and methodology of machine learning.” Machine learning: An artificial intelligence approach, R. S.
Michalski, J. G. Carbonell, and T. M. Mitchell, eds., Univ. of Illinois
at Urbana-Champaign, Urbana, Ill., 83–134.
Michalski, R. S., and Larson, J. 共1975兲. “AQVAL/1 共AQ7兲 user’s guide
and program description.” Rep. No. 731, Dept. of Computer Science,
Univ. of Illinois at Urbana-Champaign, Urbana, Ill.
Michalski, R. S., Mozetic, I., Hong, J., and Lavrac, N. 共1986兲. “The
multi-purpose incremental learning system AQ15 and its testing application to three medical domains.” Proc., 5th National Conf. on
Artificial Intelligence (AAAI-86), AAAI Press, 1041–1045.
Nassar, K. 共2007兲. “Application of data-mining to state transportation
agencies, projects databases.” ITcon, 12, 139–149.
Prechaverakul, S., and Hadipriono, F. C. 共1995兲. “Using a knowledge
based expert system and fuzzy logic for minor rehabilitation projects
in Ohio.” Transp. Res. Rec., 1497, 19–26.
Quinlan, J. R. 共1979兲. “Discovering rules from large collections of examples: A case study.” Expert systems in the microelectronic age,
D. Michie, ed., Edinburgh University Press, Edinburgh, U.K.
Quinlan, J. R. 共1986兲. “Induction of decision trees.” Mach. Learn., 1,
81–106.
Quinlan, J. R. 共1993兲. C4.5: Programs for machine learning, Morgan
Kaufmann, San Mateo, Calif.
Saha, A. 共2003兲. “CTree software for Excel.” Classification tree in Excel,
具http://www.geocities.com/adotsaha/典 共Feb. 24, 2003兲.
Sarasua, W. A., and Jia, X. 共1995兲. “Framework for integrating GIS-T
with KBES: A pavement management system example.” Transp. Res.
Rec., 1497, 153–163.
Soibelman, L., and Kim, H. 共2000兲. “Generating construction knowledge
with knowledge discovery in databases.” Computing in Civil and
Building Engineering, 2, 906–913.
Spring, G. S., and Hummer, J. 共1995兲. Identification of hazardous highway locations using knowledge-based GIS: A case study.” Transp.
Res. Rec., 1497, 83–90.
Tsai, Y., Gao, B., and Lai, J. S. 共2004兲. “Multiyear pavementrehabilitation planning enabled by geographic information system:
Network analyses linked to projects.” Transp. Res. Rec., 1889, 21–30.
Wang, F., Zhang, Z., and Machemehl, R. B. 共2003兲. “Decision-making
problem for managing pavement maintenance and rehabilitation
projects.” Transp. Res. Rec., 1853, 21–28.
JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 341
J. Transp. Eng. 2010.136:332-341.