Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integration of GIS and Data Mining Technology to Enhance the Pavement Management Decision Making Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. Guoqing Zhou1; Linbing Wang2; Dong Wang3; and Scott Reichle4 Abstract: This paper presents a research effort undertaken to explore the applicability of data mining and knowledge discovery 共DMKD兲 in combination with Geographic Information System 共GIS兲 technology to pavement management to better decide maintenance strategies, set rehabilitation priorities, and make investment decisions. The main objective of the research is to utilize data mining techniques to find pertinent information hidden within the pavement database. Mining algorithm C5.0, including decision trees and association rules, has been used in this analysis. The selected rules have been used to predict the maintenance and rehabilitation strategy of road segments. A pavement database covering four counties within the state of North Carolina, which was provided by North Carolina DOT 共NCDOT兲, has been used to test this method. A comparison was conducted in this paper for the decisions related to a rehabilitation strategy proposed by the NCDOT to the proposed methodology presented in this paper. From the experimental results, it was found that the rehabilitation strategy derived by this paper is different from that proposed by the NCDOT. After combining with the AIRA Data Mining method, seven final rules are defined. Using these final rules, the maps of several pavement rehabilitation strategies are created. When their numbers and locations are compared with ones made by engineers at the Institute for Transportation Research and Education 共ITRE兲 at North Carolina State University, it has been found that error for the number and the location are various for the different rehabilitation strategies. With the pilot experiment in the project, it can be concluded: 共1兲 use of the DMKD method for the decision of road maintenance and rehabilitation can greatly increase the speed of decision making, thus largely saving time and money, and shortening the project period; 共2兲 the DMKD technology can make consistent decisions about road maintenance and rehabilitation if the road conditions are similar, i.e., interference from human factors is less significant; 共3兲 integration of the DMKD and GIS technologies provides a pavement management system with the capabilities to graphically display treatment decisions against distresses; and 共4兲 the decisions related to pavement rehabilitation made by the DMKD technology is not completely consistent with that made by ITRE, thereby, the postprocessing for verification and refinement is necessary. DOI: 10.1061/共ASCE兲TE.1943-5436.0000092 CE Database subject headings: Pavement management; Data collection; Geographic information systems; Decision making. Author keywords: Pavement management; Data mining; GIS; Pavement distress. Introduction A pavement management system 共PMS兲 is a valuable tool and one of the critical elements of the highway transportation infrastructure 共Tsai et al. 2004; Kulkarni and Miller 2003兲. The earliest PMS concept can be traced back to the 1960s. With rapid increase 1 Dept. of Civil Engineering and Technology, Old Dominion Univ., Kaufman Hall, Rm. 214, Norfolk, VA 23529; and, Virginia Tech Transportation Institute 共VTTI兲, Dept. of Civil and Environmental Engineering, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061 共corresponding author兲. E-mail: [email protected] 2 Virginia Tech Transportation Institute 共VTTI兲, Dept. of Civil and Environmental Engineering, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061. E-mail: [email protected] 3 Virginia Tech Transportation Institute 共VTTI兲, Dept. of Civil and Environmental Engineering, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061. E-mail: [email protected] 4 Dept. of Civil Engineering and Technology, Old Dominion Univ., Kaufman Hall, Rm. 214, Norfolk, VA 23529. E-mail: [email protected] Note. This manuscript was submitted on July 25, 2008; approved on July 31, 2009; published online on August 3, 2009. Discussion period open until September 1, 2010; separate discussions must be submitted for individual papers. This paper is part of the Journal of Transportation Engineering, Vol. 136, No. 4, April 1, 2010. ©ASCE, ISSN 0733-947X/ 2010/4-332–341/$25.00. of advanced information technology, many investigators have successfully integrated the Geographic Information System 共GIS兲 into PMS for storing, retrieving, analyzing, and reporting information needed to support pavement-related decision making. Such an integration system is thus called G-PMS 共Lee et al. 1996兲. The main characteristic of a GIS system is that it links data/information to its geographical location 共e.g., latitude/ longitude or state plane coordinates兲 instead of the milepost or reference-point system traditionally used in transportation. Moreover, the GIS can describe and analyze the topological relationship of the real world using the topological data structure and model 共Goulias 2002; Lee et al. 1996兲. GIS technology is also capable of rapidly retrieving data from a database and can automatically generate customized maps to meet specific needs such as identifying maintenance locations. Therefore, a G-PMS can be enhanced with features and functionality by using a GIS to perform pavement management operations, create maps of pavement condition, provide cost analysis for the recommended maintenance strategies, and long-term pavement budget programming. With the increasing amount of pavement data collected, updated and exchanged due to deteriorating road conditions, increasing traffic loading, and shrinking funds, many knowledgebased expert systems 共KBESs兲 have been developed to solve various transportation problems 共e.g., Abkowitz et al. 1990; Nassar 2007; Spring and Hummer 1995兲. A comprehensive survey of 332 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. KBESs in transportation is summarized and discussed by Cohn and Harris 共1992兲. However, only a few scholars have investigated applying data mining and knowledge discovery 共DMKD兲 to PMSs. For example, Attoh-Okine 共2002, 1997兲 presented application of rough set theory to enhance the decision support of the pavement rehabilitation and maintenance. Prechaverakul and Hadipriono 共1995兲 applied KBES and fuzzy logic for minor rehabilitation projects in Ohio. Wang et al. 共2003兲 discussed the decisionmaking problem of pavement maintenance and rehabilitation. Leu et al. 共2001兲 investigated the applicability of data mining in the prediction of tunnel support stability using an artificial neural network algorithm. Sarasua and Jia 共1995兲 explored an integration of GIS technology with knowledge discovery and expert system for pavement management. Ferreira et al. 共2002兲 explored the application of probabilistic segment-linked pavement management optimization model. Chan et al. 共1994兲 applied the genetic algorithm for road maintenance planning. Soibelman and Kim 共2000兲 discussed the data preparation process for construction knowledge generation through knowledge discovery in databases, as well as construction knowledge generation and dissemination. This paper presents a research effort undertaken to explore the applicability of data mining in combination with GIS technology to pavement management to better decide maintenance strategies, set rehabilitation priorities, and make investment decisions. A brief overview of data mining techniques as applied to pavement management is presented in “Data Mining and Inductive Learning.” Data collected from the Institute for Transportation Research and Education 共ITRE兲 of North Carolina State University and pavement condition rating are presented in “Pavement Performance Measures.” Examples using data mining to reveal unknown patterns and rules in the database are presented in “Data Mining-Based Maintenance and Repair 共M&R兲.” Finally, the findings and conclusions are discussed. Data Mining and Inductive Learning The goal of applying DMKD technology to a PMS is to discover any pertinent information in the pavement management database by creating a decision tree and/or inducing rules. Some information is stored in an obvious location of the database, and thus can be obtained by traditional database query operation, such as maximum and minimum width of the roads. Other knowledge is hidden in a more remote area of the database 共e.g., pavement condition patterns兲, and thus only can be obtained by intelligent technology, such as data mining. Several data mining techniques have been developed over the last few decades by artificial intelligence community. Generally, the data mining techniques can be categorized in four categories, depending on their functionality: 共1兲 classification; 共2兲 clustering; 共3兲 numeric prediction; and 共4兲 association rules 共Michalski 1983兲. • Classification: this technique is designed essentially to generate predictive models by analyzing an existing database to determine categorical divisions or patterns in the database. Thus, this method focuses on identifying the characteristics of the group or class to which each record belongs in the database; • Clustering: the purpose of this data mining technique is to group or classify items that seem to fall naturally together in the database when there is no preidentified classification or group; • Numeric prediction: this is essentially a classification learning technique with the intended outcome of a numeric value 共nu- meric quantity兲 rather than a category 共discrete class兲. Thus, the numeric values are used for prediction; and • Association rule: the purpose of this data mining technique is to find useful associations and/or correlation relationships among large sets of data items. Association rules, expressed by in the form of “if-then” statements, shows attribute value conditions that occur frequently together in a given data set. These rules are computed from the data, unlike the if-then logic rules, and thus are probabilistic in nature. The main differences among the earlier discussed techniques lie in which algorithms and methods will be used to extract knowledge and how the mined knowledge and discovery rules are expressed. The data mining technique used in this paper is an association/inductive learning rule, so that relevant and implicit knowledge, such as road condition distress pattern, can be extracted. Many inductive learning algorithms, which mainly come from the machine learning, have been presented, such as AQ11 共Michalski and Larson 1975兲, AQ15 共Michalski et al. 1986; Hong et al. 1995兲 and AQ19 共Kaufman and Michalski 1999兲, AE1 and AE9 共Hong et al. 1995兲, concept learning system 共CLS兲 共Hunt 1966兲, ID3, C4.5, and C5.0 共Quinlan 1979, 1986, 1993兲, and CN2 共Clark and Niblett 1989兲, and so on. CLS 共Hunt et al. 1966兲 used a heuristic look ahead method to construct decision trees. Quinlan extended CLS method by using information content in the heuristic function, called ID3. ID3 method adopts a strategy, called “divide and conquer,” and selects classification attributes recursively on the basis of information entropy. Quinlan 共1993兲 further developed his method, called C4.5, which only dealt with strings of constant length. C4.5 not only can create a decision tree, but induce equivalent production rules as well, and deals with multiclass problems with continuous attributes. C5.0 is an upgraded version of C4.5. This method requires the training data, which are usually constructed by several tuples, each of which has several attributes. Note that if the records in the database are taken as tuples and fields as attributes, the C5.0 algorithm is easily realized in a pavement database. Thus, the knowledge discovered by C5.0 algorithm is a group of classification rules and a default class, and furthermore, with each rule there is a confidence value 共between 0 and 1兲. Moreover, the C5.0 method can generate a decision tree and associated decision rule. Decision Tree The inductive learning C5.0 method is a top-down induction algorithm that is capable of generating a decision tree model to classify data in a database. The decision tree is capable of partitioning the data at each level based on a particular feature of the data. The processes of C5.0 algorithm are: 共1兲 a decision tree is built by choosing an attribute that best separates the classes in the database and 共2兲 a pruning strategy is used to prune a branch when the introduced error is one standard error of the existing errors. The implementation of the C5.0 algorithm requires recursive iteration to the partitions until partitions only have a single class, or the partition becomes too small. With such recursive iteration, data with higher partitioning ability will be toward the root, and iterative classification occurs at the leaves in a decision tree. At each iteration, the data are fed into a filtering method, which reduces the number of features. In our project, the decision tree categorizes the entire subject according to whether or not they are likely to have hypertension. JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 333 J. Transp. Eng. 2010.136:332-341. Association Rule In addition to the decision tree generation, the C5.0 algorithm also induces association rules. An association rule differs from a classification rule because the associate rule can be used to predict any attribute 共not just the group or class兲, and more than one attribute’s value at a time. An associate rule also gives an occurrence relationship among factors. A typical association rule is if-then sentence, i.e., Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. IF Cause_1, AND/OR Cause_2 THEN Result 共or consequence兲 This rule means “if Cause_1 and/or Cause_2 are true, and then Result 共the association rule兲 would be generated at cases of x% with a confidence of x%.” With such as rule, a piece of knowledge can be discovered from data set. The rules induced by C5.0 algorithm also provide with support, confidence level, and capture using statistical tests to correct the decision tree model 共Chae et al. 2001兲. • Support: this measures the percentages of the training data that support the rule; i.e., how many data generates the left hand side of the rule. The support is the number of cases in which the rule is found and a pattern is defined by the rule. Once this pattern is identified, it can be used to predict the similar patterns in database; • Confidence: the confidence indicates a probability of a certain rule. It can be used to measure the accuracy of rule; and • Capture: this is used to measure the percentage of records that are correctly captured by this rule. For example, if a rule with capture is close to 100%, it means that all observations with this class are closely clustered. In this paper, the initial association rules will be induced by using the C5.0 algorithm. The final rules are obtained by postprocessing of the initial rules. The attributes for deductive learning are the same as that in inductive learning except the class label attribute. The final rules are used to identify the occurrence relationship between pavement rehabilitation and various road distress conditions. Pavement Performance Measures Data Sources In 1983, ITRE of North Carolina State University began working with the Division of Highways of the North Carolina DOT 共NCDOT兲 to develop and implement a PMS for its 60,000 mi of paved state highways. At the request of several municipalities, the NCDOT has made the PMS available for North Carolina municipalities. The ITRE modified the system for municipal streets and has used it in more than 100 municipalities in North and South Carolina. The data sources for this experiment are provided by the ITRE. They conducted a pavement distress survey for several counties in North Carolina in January of 2007 to determine whether pavement rehabilitation was necessary. Field data collection was performed by a two-person crew comprised of a “rater” and a “driver.” The rater performed the visual evaluation/ inspection for each section of roadway and entered all field collected attributes to the corresponding line feature 共by using a customized ArcPad software application installed on a laptop/ pentop computer兲. The driver handled field measurement of pavement width, as necessary. All pavement distress data was identified and quantified by means of visual inspection 共windshield survey performed by trained staff兲. A measuring wheel was used as necessary. The survey assessment was performed following the guidelines provided in the Pavement Condition Rating Manual 共AASHTO 2001, 1999兲. The collected 1,285 records, to be used in this project, were from a network-level survey, covering several county roads including Route U.S.1. The pavement database provided by the ITRE is a spatial-based rational database 共i.e., an ArcGIS software compatible database.兲 In this database, 89 fields including geospatial data 共e.g., X , Y coordinates, central line, width of lane, etc.兲 and pavement condition data 共e.g., cracking, rutting, etc.兲, and traffic data 共e.g., shoulder, lane number, etc.兲, and economic data 共e.g., initial cost, total cost兲 are recorded by engineers. They conducted these surveys by walking or driving along the roads and recording the pertinent information and their corresponding maintenance and repair 共M&R兲 strategy. Then, the data are integrated into the database, as shown in Fig. 1. The first through 19th columns recorded road name, type, class, owner; and the 31st through 43rd columns recorded the pavement condition 共distress兲; the 44th through 50th columns recorded the different types of cost information; others columns included proposed activities and other related information. From the expert’s suggestion in the ITRE, the following nine common types of distresses are considered to assess the necessity for road rehabilitation 共Table 1兲. The rating was determined by visual evaluation/ inspection at each section of roadway. Distress Rating All the analyses and pavement condition evaluations presented in this paper are based on the pavement performance measures. A common acceptable pavement performance measure is pavement condition index 共PCI兲, which was first defined by the U.S. Army. Under the PCI, the pavement condition is related to factors such as structural integrity, structural capacity, roughness, skid resistance, and rate of distress. These factors are quantified in the evaluation worksheet that field inspectors used to assess and express the local pavement condition and severity of damage. Note that inspectors primarily relied on their own judgment to assess the distress conditions. Usually, the PCI is quantified into seven levels, corresponding to from Excellent 共over 85兲 to Failed 0. Table 1 presents eight types of distresses that are evaluated for asphaltic concrete pavements. The severity of distresses in Table 1 is rated in four categories ranging from very slight to very severe. Extent 共or density兲 is also classified in five categories ranging from few 共less than 10%兲 to throughout 共more than 80%兲. The identification and description of types of distress, severity, and density are as follow, respectively. • The road conditions of the alligator cracking are rated as a percentage of the section that falls under the categories of none, light, moderate, and severe. Percentages are shown as 1 = 10%, 2 = 30%, 3 = 60%, up to 10= 100%. The appropriate percentages were placed under none, light, moderate, and severe. These percentages should always add up to 100%; • The severity levels of distresses for block cracking, transverse cracking, bleeding, rutting, utility cut patching, patching deterioration, and raveling are rated four levels: none 共N兲, light 共L兲, moderate 共M兲, and severe 共S兲, respectively; and • The severity levels of ride quality are classified as: average 共L兲, slightly rough 共M兲, and rough 共S兲. 334 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. Fig. 1. Spatial-based pavement database provided by NCDOT Potential Rehabilitation Strategies Based on the knowledge obtained from ITRE in the field, the ITRE has classified flexible pavement rehabilitation needs into three main categories according to the type of the problem to be corrected: 共1兲 cracking; 共2兲 surface defect problems; and 共3兲 structural problems. These problems can be treated using crack treatment, surface treatment, and nonstructural overlay 共one- and twocourse overlay兲, respectively. To select an appropriate treatment for rehabilitation and maintenance, seven potential rehabilitation and maintenance strategies have been proposed by the NCDOT 共see Table 2兲. Which treatment strategies will be carried out for a pavement segment is traditionally determined by the ITRE who comprehensively evaluate all types of distresses. This paper seeks to replace this process of decision making for rehabilitation strategies with data mining technology. Thus, one of the eight potential rehabilitation strategies will be determined based upon the association rule in data mining, which will be generated by the data mining technology. Table 1. Nine Common Types of Distresses to Be Used in This Study Distress Alligator cracking 共four types of rates are given兲 Block/transverse cracking 共BK兲 Reflective cracking 共RF兲 Rutting 共RT兲 Raveling 共RV兲 Bleeding 共BL兲 Patching 共PA兲 Utility cut patching Ride quality 共RQ兲 Rating Alligator Alligator Alligator Alligator none 共AN兲 Percentages of 1 = 10%, 2 = 20%, 3 = 30%, up to 10= 100% indicate none, light, moderate, and severe, respectively light 共AL兲 moderate 共AM兲 severe 共AS兲 This indicates the overall condition of the section as follows: • N—none; • L—light; • M—moderate; and • S—severe The same rating as BK’s The same rating as BK’s The same rating as BK’s The same rating as BK’s The same rating as BK’s The same rating as BK’s The condition is designated as follows: • L—average; • M—slightly rough; and • S—rough JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 335 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. Table 2. Potential Rehabilitation Strategies Proposed by Engineers of NCDOT ID Rehabilitation strategies 0 1 2 3 5 6 7 Nothing Crack pouring 共CP兲 Full-depth patch 共FDP兲 1-in. plant mix resurfacing 共PM1兲 2-in. plant mix resurfacing 共PM2兲 Skin patch 共SKP兲 Short overlay 共SO兲 Table 3. Decision Tree Model Tree information Total number of nodes Number of leaf nodes Number of levels Data Mining-Based Maintenance and Repair „M&R… Decision Trees of M&R Implementation of data mining-based maintenance and rehabilitation usually involves the following five basic steps: 共1兲 problem identification; 共2兲 knowledge acquisition; 共3兲 knowledge representation; 共4兲 implementation; and 共5兲 validation and extension. The first step of problem identification has already been discussed, which is to reveal and extract the relevant information hidden in the pavement database to predict potential pavement conditions and plan the rehabilitation. As to the second step, it has been obtained a significant amount of knowledge from the ITRE in the field, such as distress ratings, PCI ratings, and potential rehabilitation strategies. As for the third step, the decision tree and induced rule to be derived by the C5.0 algorithm constitute the knowledge representation. The theory behind the decision tree and inductive rules has already been described earlier 共see Data Mining and Inductive Learning, supra.兲 The experimental results will now be presented in this section, which constitutes the fourth step of implementation. Finally, Step 5, validation, will be discussed next as well. Experiments The CTree for Excel tool 共Saha 2003兲 was used to create the decision tree. This tool is based on the C5.0 algorithm, which lets user build a tree-based classification model. The classification tree can generate the association rules. The basic steps are as follows. Step 1: Load Pavement Database. It is first loaded the pavement database into the data worksheet, in which the observations were placed in rows and the variables were placed in columns. In addition, for each column in the data worksheet, it is chosen an appropriate type, i.e., Omit, Class, Cont, and Cat. The implications of these types are: • Omit= to drop a column from model; • Cat= to treat a column as categorical predictor; • Cont= to treat a column as continuous predictor; and • Output= to treat a column as class variable. Because the type, Cont, in CTree only recognizes the digital number, while the distresses in Table 1 were measured by letter, N, L, M, and S, these letters have to be quantified into 100, 75, 50, and 25. In the CTree tool, the maximum predictor variables are 50, and only a class variable is allowed. Application will treat the class variable as categorical, and each of categorical predictors 共including class variable兲 is limited to 20. After the data are uploaded into software, the class variables and data do not have blank rows and blank columns, which are treated as missing values in the CTree tool. Percentage for misclassified data 72 37 20 On training data On test data 61.2% 60.0% Step 2: Fill-Up Model Inputs. In addition, the tool requires inputting some parameters to optimize the process of decision tree generation. These parameters include: 1. Adjust for number categories of a categorical predictor: while growing the tree, child nodes are created by splitting parent nodes. Which predictor is used for this split is decided by certain criterion. Because this criterion has an inherent bias toward choosing predictors with more categories, input of an adjustment factor will be used to adjust this bias; 2. Minimum node size criterion: while growing the tree, the decision as to whether to stop splitting a node and declare the node as a leaf node will be determined by some criteria which need to be chosen. These criteria are: • Minimum node size: a valid minimum node size is between 0 and 100; • Maximum purity: an effective value is between 0 and 100. The higher the value of this, the larger will be the tree. Stop splitting a node if its purity is 95% or more 共e.g., purity is 90% means兲 Also, stop splitting a node if number of records in that node is 1% or less of total number of records; and • Maximum depth: a valid maximum depth is greater than 1 and less than 20. The higher the value, the larger will be the tree. Stop splitting a node if its depth is 6 or more 共depth of root node is 1; any node’s depth is that its parent’s depth +1兲. 3. Pruning option: this option allows us to determine whether or not to prune the tree when tree is growing, which can help us to study the effect of pruning; and 4. Training and test data: in this research, a subset of data are used to build the model and the rest to study the performance of the model. Also, the tool is required to randomly select the test set at a ratio of 10%. Experimental Results With the aforementioned data input, a decision tree is generated. The parameters of the decision tree are listed in Table 3. As seen from Table 3, the total number of nodes is 72, in which the number of leaf nodes is 37. The details on each node can be obtained from the NodeView sheet, which shows the class distribution on each node. Moreover, the decision tree consists of 20 levels. In this test, the number of training observations achieved is 555, the number of test observations is 510, and the number of predictors is 9. The percentage of misclassified data reaches 61.2% for training data and 60.0% for test data. The number for different types of rehabilitation strategies, including nothing, crack pouring 共CP兲, full-depth patch 共FDP兲, 1-in. plant mix resurfacing 共PM1兲, 2-in. plant mix resurfacing 共PM2兲, skin patch 共SKP兲, and short overlay 共SO兲 in the study area is calculated for training data and test data 共Table 4兲. In addition, the C5.0 algorithm also generates rules. These associated rules discovered rehabilitation strategies that will be carried out if the conditions are met 共see Fig. 2兲. Since only seven rehabilitation strategies in the study area were suggested by the ITRE at the NCDOT, the generated rules by the C5.0 algorithms are clustered manually because of the fact too many rules were generated by the CTree software tool. The method of clustering 336 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 J. Transp. Eng. 2010.136:332-341. Table 4. Number for Different Types of Rehabilitation Strategies for the Training Data and Test Data Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. Rehabilitation strategies Nothing CP FDP PM1 PM2 SKP SO Total Training data Test data 417 12 13 3 2 8 4 459 589 23 3 8 5 3 4 640 Table 5. Support, Confidence, and Capture for Each Generated Rule Rule ID 0 1 2 3 4 5 6 the rules are as follows: merging the similar rules in each leaf and node together. As a result, their support and confidence and capture are merged as well. Finally, seven rules, associated their SUPPORT and confidence and capture, are listed in Table 5. As observed from Table 5, The confidence for rehabilitation strategies, CP and PM2 are 100%, and support for rehabilitation strategies are 100% as well. in addition, the support, confidence, and capture for rehabilitation, FDP, are lowest. This means that the decision made for FDP rehabilitation has a low reliability. M&R Rules Verification The aforementioned generated decision tree theoretically organizes the “hidden” knowledge in a logical order. A determination as to whether the tree is used to provide a useful methodology for selecting a feasible and effective treatment strategy for rehabilitation 共from the seven predetermined strategies兲 will require this prototype of rules to be tested and validated, and then modified or extended, if necessary. In the study area, seven rules have been created to handle operations involved in the spatial knowledge of the PMS. Upon carefully checking the rules, it was determined Classes Support Confidence Capture NO CP SKP FDP PM2 PM1 SO 100.0% 60.7% 60.5% 71.6% 80.2% 81.3% 81.6% 86.7% 100.0% 66.7% 55.6% 100.0% 71.4% 66.7% 93.0% 75.6% 82.5% 66.2% 73.1% 85.3% 73.5% that these rules are not completely correct. Some of rules are redundant. For this reason, the AIRA tool for Excel v1.3.3 is used to verify these rules. This tool is an add-in for MS-Excel and allows user to extract the “hidden information” 共i.e., discover rules兲 right from the spreadsheets for small to midrange database files. With the same data set and the AIRA tool, 41 rules are generated, part of which are listed in Fig. 2. Obviously, many rules will result in misclassification, and thus have to be reduced. To this end, the following schemes are adopted in this research: 1. If the attribute values simultaneously match the condition of the rules induced by both decision tree and AIRA reasoning, the rules will be retained; 2. If attribute values simultaneously match the conditions of several rules, those rules with the maximum confidence will be kept; 3. If attribute values simultaneously match several rules with the same confidence values, those rules with the maximum coverage of learning samples will be kept; and 4. If attribute values do not match any rule, this class of attribute is defined as the rule, nothing treatment. With the schemes proposed earlier for rule reduction, 14 rules were still retained. However, only seven rehabilitation strategies in the study area were suggested by the ITRE. Thus, the following method is suggested to further reduce the number of rules: 1. Reduce the attribute data sets from alligator cracking family through AC = 兵关AN兴,关AL兴,关AM兴,关AS兴其 2. Reduce the attribute data through checking the PCI values. The principles are, a. If AC= M or S reduce other attributes; and b. If RT= S, reduce other attributes. With the aforementioned reduction methods, the seven rules is finally refined 共see Fig. 3兲. Mapping of Data Mining-Based Decision of M&R Fig. 2. Rules induced by AIAR algorithm With the rules induced earlier, the rehabilitation strategies can be predicted and decided for each road segment in the database. In other words, the operation using DMKD only occurs in the database, and thus the results cannot be visualized and displayed on either map or screen. However, the GIS system is capable of rapidly retrieving data from the database and automatically generating customized maps to meet specific needs such as identifying maintenance locations, visualizing spatial and nonspatial data, and linking data/information to its geographical location. Thus, this paper employed ArcGIS software in combination with the earlier induced results to create the decision-making map for maintenance and rehabilitation. The basic operation consists of taking each of the above rules as a logic query in the ArcGIS JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 337 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. Fig. 3. Final rules after verification and postprocessing software, and then queried results are displayed in the ArcGIS layout map. The results are listed in Figs. 4–9, corresponding to each of the preset rehabilitation strategies, respectively. The rehabilitations suggested by the ITRE of North Carolina State University are superimposed with the decisions made under the analysis presented in this paper. As seen from Figs. 4–9, each rehabilitation strategy derived in this paper can be located with its geographical coordinates, and visualized with its spatial, nonspatial data, and different colors. Comparison Analysis and Discussion Comparison Analysis To verify the results of the proposed method for the decision making of road maintenance and rehabilitation, the recommended pavement segment treatments produced in this paper were compared with those suggested by the ITRE at the NCDOT. All pavement treatments derived by this paper and by NCDOT are displayed in Fig. 10. A comparison analysis for both the numbers Fig. 4. Comparison for the CP road rehabilitation made by the proposed method and by NCDOT 共ITRC兲 Fig. 5. Comparison for the FDP road rehabilitation made by the proposed method and by NCDOT 共ITRC兲 and the location of each treatment strategy derived by this paper and by NCDOT in the study area is listed in Table 6. As seen from Table 6, the number of the crack pouring treatment derived by this paper and by NCDOT is the same, i.e., 3, but the locations of the three roads are not the same. Specifically, the location of one road derived by the method presented in this paper is different from the one derived by NCDOT. The number of the full-depth patch 共FDP兲 treatments suggested by NCDOT is 34, but 29 under the method employed in this paper. The difference between two methods is five. Moreover, the location of three road segments for FDP treatment is different for two methods. In the 1-in. plant mix resurfacing 共PM1兲 strategy, NCDOT suggested six roads for PM1 treatment, but this paper indicates seven roads using data mining. Moreover, a road location is different for two methods. For the skin patch 共SKP兲 rehabilitation strategy, 65 roads are suggested by NCDOT for treatment, but 56 roads are identified using data mining. Moreover 13 road locations are different between two methods. For the short overlay rehabilitation strategy, three roads Fig. 6. Comparison for the PM1 road rehabilitation made by the proposed method and by NCDOT 共ITRC兲 338 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. Fig. 7. Comparison for the PM2 road rehabilitation made by the proposed method and by NCDOT 共ITRC兲 are suggested for treatment, but five roads are recommended by the methods employed in this paper, of which the locations of two roads are different between the two methods. Fig. 9. Comparison for the SO road rehabilitation made by the proposed method and by NCDOT 共ITRC兲 The decision trees are based on the knowledge acquired from pavement management engineers to be used for rehabilitation strategy selection. A decision tree is used to organize the obtained knowledge in a logical order. Thus, the decision trees can determine a technically feasible rehabilitation strategy for each road segment. Different decision trees can be built to acknowledge changes. For example, the decision trees were based on severity levels of individual distresses in this paper. If the pavement layer thickness and material type are taken as knowledge, or work history, pavement type, and ride data are taken as knowledge for generating decision trees, these decision trees generated by different types of knowledge are different. This means the association rules generated by different knowledge are different as well. On the other hand, from our experiment, the rules induced from the decision tree are probably inexact, thereby verification of the rules is needed. This project, in fact, tested different tools 共e.g., SPSS, ACRT, RSES, CHIAD兲 to endeavor to generate an “optimum” or “ideal” decision tree and rules. Unfortunately, the decision trees and rules generated by these tools are not completely the same. Therefore, postprocessing for verification of the rules is still needed. Through the project, it has been realized that: 1. The decision tree and rules induced by the data mining method are not completely exact, i.e., the DMKD method is not as exact as individual inspection and observation. Therefore, postprocessing for refining the rules is still essential; and 2. The advantages of applying the DMKD method for road maintenance and rehabilitation are: 共a兲 it can largely increase the speed of choosing a rehabilitation strategy, thus largely save time and money, and shorten the project period and 共b兲 the same treatments proposed by the DMKD method in a Fig. 8. Comparison for the SKP road rehabilitation made by the proposed method and by NCDOT 共ITRC兲 Fig. 10. All decisions for the road rehabilitation made by ours and NCDOT Discussion JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 339 J. Transp. Eng. 2010.136:332-341. Table 6. Comparison Analysis for Each Treatment Strategy Made by the Writers and NCDOT Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. ID Proposed treatment strategies 1 Crack pouring 共CP兲 2 Full-depth patch 共FDP兲 3 1-in. plant mix resurfacing 共PM1兲 4 2-in. plant mix resurfacing 共PM2兲 5 Skin patch 共SKP兲 6 Short overlay 共SO兲 From Number NCDOT This paper NCDOT This paper NCDOT This paper NCDOT This paper NCDOT This paper NCDOT This paper 3 3 34 29 6 7 3 4 65 56 3 5 Difference in number Difference in location 0 1 5 3 1 1 1 1 9 13 2 2 mined by DMKD are not exact, and thus, postprocessing for verification and refinement is necessary. road network level should have similar pavement conditions, thus, this method avoids any human factors such as visual distress data by engineers. Acknowledgments Conclusions This paper conducts research and analysis related to applying data mining technology in combination with GIS for pavement management. The main purpose of the research is to utilize data mining techniques to find some useful knowledge hidden in the pavement database. The C5.0 algorithm has been employed to generate decision trees and association rules. The induced rules have been used to predict which maintenance and rehabilitation strategy to be selected for each road segment. A pavement database covering four counties in the state of North Carolina, which are provided by the ITRE at NCDOT, has been used to test the proposed method. The comparison of two decisions for rehabilitation treatment suggested by NCDOT and by the methodology presented in this paper has been conducted. From the experimental results, it was found that the rehabilitation strategies derived by the rules, i.e., C5.0 method, are different from those suggested by NCDOT. After combining other technologies, e.g., AIRA method, and postprocessing, seven rules are finally refined. Using the final rules, mapping for different types of pavement rehabilitation strategies is created using ArcGIS v. 9.1. When compared with the results from NCDOT, the number and location of the suggested road rehabilitations are different. The maximum error for the number of suggested road rehabilitations is nine, and for the location is 13 out of 65 共see Table 6兲. Through this project, it has been concluded that 1. The DMKD technology can make consistent decisions regarding road maintenance and rehabilitation for the same road conditions with less interference from human factors; 2. The DMKD technology can largely increase the speed of decision making because the technology automatically generates a decision tree and rules if the expert knowledge is given. This can save significant time and money for PMS; 3. Integration of the DMKD and GIS can provide the PMS with the capabilities of graphically displaying treatment decisions, visualize the attribute and nonattribute data, and link data and information to the geographical coordinates; and 4. The DMKD method is not quite as intelligent as individual observation. In other words, the decisions and rules deter- The pavement database was provided by Greg Ferrara, GIS Program Manager at the Institute for Transportation Research and Education 共ITRE兲 of North Carolina State University. John Roklevi at the ITRE provided essential help in data delivery, data interpretation, and explanation related the metadata. The writers sincerely thank both of them. The writers also thank the project administrators for granting permission to use their data. References AASHTO. 共1999兲. AASHTO guidelines for pavement management system, AASHTO, Washington, D.C. AASHTO. 共2001兲. Pavement management guide, AASHTO, Washington, D.C. Abkowitz, M., Walsh, S., Hauser, E., and Minor, L. 共1990兲. “Adaptation of geographic information systems to highway management.” J. Transp. Eng., 116共3兲, 310–327. Attoh-Okine, N. O. 共1997兲. “Rough set application to data mining principles in pavement management database.” J. Comput. Civ. Eng., 11共4兲, 231–237. Attoh-Okine, N. O. 共2002兲. “Combining use of rough set and artificial neural networks in doweled-pavement-performance modeling—A hybrid approach.” J. Transp. Eng., 128共3兲, 270–275. Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., and Ji, S. H. 共2001兲. “Data mining approach to policy analysis in a health insurance domain.” Int. J. Med. Inf., 62, 103–111. Chan, W. T., Fwa, T. F., and Tan, C. Y. 共1994兲. “Road maintenance planning using genetic algorithms. I: Formulation.” J. Transp. Eng., 120共5兲, 693–709. Clark, P., and Niblett, T. 共1989兲. “The CN2 induction algorithm.” Mach. Learn., 3, 261–283. Cohn, L. F., and Harris, R. A. 共1992兲. Knowledge-based expert system in transportation. NCHRP synthesis 183, TRB, National Research Council, Washington, D.C. Ferreira, A., Antunes, A., and Picado-Santos, L. 共2002兲. “Probabilistic segment-linked pavement management optimization model.” J. Transp. Eng., 128共6兲, 568–577. Goulias, D. G. 共2002兲. “Management systems and spatial data analysis in transportation and highway engineering.” Proc., Management Infor- 340 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 J. Transp. Eng. 2010.136:332-341. Downloaded from ascelibrary.org by MARRIOTT LIB-UNIV OF UT on 11/26/14. Copyright ASCE. For personal use only; all rights reserved. mation Systems, 2002: Incorporating GIS and Remote Sensing, Vol. 2002, Univ. of Illinois at Urbana-Champaign, Urbana, Ill., 321–327. Hong, J., Mozetic, I., and Michalski, R. S. 共1995兲. “AQ15: Incremental learning of attribute-based descriptions from examples, the method and user’s guide.” Rep. Prepared for Intelligent Systems Group, Univ. of Illinois at Urbana-Champaign, Urbana, Ill. Hunt, E. B., Marin, J., and Stone, P. T. 共1966兲. Experiments in induction, Academic, San Diego. Kaufman, K. A., and Michalski, R. S. 共1999兲. “Learning in an inconsistent world: Rule selection in AQ19.” Rep. No. MLI 99-2, George Mason Univ., Fairfax, Va. Kulkarni, R. B., and Miller, R. W. 共2003兲. “Pavement management systems: Past, present, and future.” Transp. Res. Rec., 1853, 65–71. Lee, H. N., Jitprasithsiri, S., Lee, H., and Sorcic, R. G. 共1996兲. “Development of geographic information system-based pavement management system for Salt Lake City.” Transp. Res. Rec., 1524, 16–24. Leu, S.-S., Chen, C.-N., and Chang, S.-L. 共2001兲. “Data mining for tunnel support stability: Neural network approach.” Autom. Constr., 10共4兲, 429–441. Michalski, R. S. 共1983兲. “A theory and methodology of machine learning.” Machine learning: An artificial intelligence approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, eds., Univ. of Illinois at Urbana-Champaign, Urbana, Ill., 83–134. Michalski, R. S., and Larson, J. 共1975兲. “AQVAL/1 共AQ7兲 user’s guide and program description.” Rep. No. 731, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Urbana, Ill. Michalski, R. S., Mozetic, I., Hong, J., and Lavrac, N. 共1986兲. “The multi-purpose incremental learning system AQ15 and its testing application to three medical domains.” Proc., 5th National Conf. on Artificial Intelligence (AAAI-86), AAAI Press, 1041–1045. Nassar, K. 共2007兲. “Application of data-mining to state transportation agencies, projects databases.” ITcon, 12, 139–149. Prechaverakul, S., and Hadipriono, F. C. 共1995兲. “Using a knowledge based expert system and fuzzy logic for minor rehabilitation projects in Ohio.” Transp. Res. Rec., 1497, 19–26. Quinlan, J. R. 共1979兲. “Discovering rules from large collections of examples: A case study.” Expert systems in the microelectronic age, D. Michie, ed., Edinburgh University Press, Edinburgh, U.K. Quinlan, J. R. 共1986兲. “Induction of decision trees.” Mach. Learn., 1, 81–106. Quinlan, J. R. 共1993兲. C4.5: Programs for machine learning, Morgan Kaufmann, San Mateo, Calif. Saha, A. 共2003兲. “CTree software for Excel.” Classification tree in Excel, 具http://www.geocities.com/adotsaha/典 共Feb. 24, 2003兲. Sarasua, W. A., and Jia, X. 共1995兲. “Framework for integrating GIS-T with KBES: A pavement management system example.” Transp. Res. Rec., 1497, 153–163. Soibelman, L., and Kim, H. 共2000兲. “Generating construction knowledge with knowledge discovery in databases.” Computing in Civil and Building Engineering, 2, 906–913. Spring, G. S., and Hummer, J. 共1995兲. Identification of hazardous highway locations using knowledge-based GIS: A case study.” Transp. Res. Rec., 1497, 83–90. Tsai, Y., Gao, B., and Lai, J. S. 共2004兲. “Multiyear pavementrehabilitation planning enabled by geographic information system: Network analyses linked to projects.” Transp. Res. Rec., 1889, 21–30. Wang, F., Zhang, Z., and Machemehl, R. B. 共2003兲. “Decision-making problem for managing pavement maintenance and rehabilitation projects.” Transp. Res. Rec., 1853, 21–28. JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / APRIL 2010 / 341 J. Transp. Eng. 2010.136:332-341.