Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining for Building & Not Digging Dr. Kathryn Burn-Thornton Data Mining Group, Durham University + 44 191 374 7017 + 44 191 374 2560 [email protected] http://www.dur.ac.uk/CompSci/personnel/dmg seugi presentation - K.E.Burn-Thornton » Introduction to the Data Mining Group » Problem Domain » Methodology Employed » Results » Conclusions & Future Work seugi presentation - K.E.Burn-Thornton Algorithm Development Application of Algorithms to Domain specific Problems seugi presentation - K.E.Burn-Thornton Data Mining Networks Data Mining Construction Medical Pharmaceutical seugi presentation - K.E.Burn-Thornton Analysis of Construction Projects CG A.B.I. seugi presentation - K.E.Burn-Thornton DMG Project Aim Determine if Data Mining may be used to determine whether or not construction programs have been successful. 6 seugi presentation - K.E.Burn-Thornton Project Aims class } Phase 1 • specific algorithm development • algorithm • tool } Phase 2 6 seugi presentation - K.E.Burn-Thornton Methodology: Phase1 • Data Analysis • Supervised Data Mining seugi presentation - K.E.Burn-Thornton Algorithms • k-NN - statistical • C4.5 - machine learning decision tree • OC1 - machine learning oblique • CN2 - machine learning rule induction • RBF - neural networks 15 seugi presentation - K.E.Burn-Thornton Algorithms Algorithm Algorithm Class Classification Method Data Type Appropriateness k-NN Statistical k Nearest neighbours + neighbour weighting where required Numerical C4.5 Machine Learning Entropy based + pruning all CN2 Machine Learning Rule Induction all OC1 Machine Learning Oblique all RBF Neural Network Hidden Node Numerical 16 seugi presentation - K.E.Burn-Thornton Supervised: Training/Testing • tr/te 9:10 • training based upon ‘expert’ knowledge • maintaining stratification bias • 10 fold cross validation 18 seugi presentation - K.E.Burn-Thornton Results: Supervised • Best Classification Accuracy: OC1 • Fastest: C4.5 & kNN • No Saturation 21 seugi presentation - K.E.Burn-Thornton Conclusions & Future Work • Initially Promising • More work needed - in progress seugi presentation - K.E.Burn-Thornton Data Mining for Building & Not Digging Dr. Kathryn Burn-Thornton Data Mining Group, Durham University + 44 191 374 7017 + 44 191 374 2560 [email protected] http://www.dur.ac.uk/CompSci/personnel/dmg seugi presentation - K.E.Burn-Thornton Data Mining for building but not for digging K.E. Burn-Thornton Data Mining Group, Dept. Computer Science, Durham University, South Rd, Durham City, DURHAM DH13LE. Email: [email protected] Tel: + 44 (0) 191 374 7017 Fax: + 44(0) 191 374 2560 http://www.durham.ac.uk/CompSci/personnel/dmg ABSTRACT This paper describes the use of Data Mining in an area, which has yet to be fully exploited, that of construction. The main body of this text discusses the methodology employed during, and results obtained from, a feasibility study investigating the possible use of various classes of Data Mining algorithms to aid in optimization of process flow during building construction. The Data Mining algorithms chosen for this task were example algorithms chosen from 5 algorithm classes which were tested on the data using a test-bed (tool) – a detailed description of which can be found in [4]. Key Words: Data Mining, Construction, Accurate, Fast. 1 1 Introduction The aim of this work was to determine if it was possible to use Data Mining algorithms to analyze, and classify, data which is routinely gathered in the construction industry in order to attempt to optimize the process flow during construction projects. An optimized process flow being defined as one in which a ‘late penalty’ does not have to be paid by the construction company. This paper provides a brief introduction to Data Mining with the main body of this text discussing the data set fusion, investigation methodology, and the results of this initial investigation. Conclusions are then drawn with regard to accuracy of construction project outcome classification by each algorithm and their potential use in the construction industry. Future work is also discussed. 2 Data Mining Data Mining enables vast amounts of data to be mined in order to find valid, novel, and potentially useful and ultimately understandable patterns. It is still a rapidly expanding field whose algorithms, techniques and methodologies have yet to be fully exploited in many domains. Initially Data Mining tools, and techniques, were targeted towards the most lucrative of application domains such as financial [1-2] and communications [3]. However, Data Mining tools and techniques are now being applied to the less lucrative, but more exploitable, fields of medical diagnosis via data mining for key indicators, or a key indicator [4-6]. Another domain with great potential for the application of Data Mining algorithms is the construction domain particularly the optimization of process flow. 2 It is generally believed that each Data Mining algorithm performs most accurately over certain characteristic data sets such as numerical or categorical with many (k-NN [7]) or few variables (SMART [8]) and many (ALLOC80 [9]), or few (Cal5 [10]) classes. The algorithms are grouped into five classes which are based upon the approach used in data classification. These classes are statistical, machine learning (both decision tree and rule induction), neural networks, and hybrids. However some algorithms appear to behave equally well over data of all characteristics. In this investigation we used a tool in which the algorithms, which are to used for investigations, may be incorporated so that they can be applied to the data sets(s) being investigated. The Data Mining algorithms chosen were k-NN, C4.5, CN2, RBF, and OC1 being examples of statistical, machine learning (both decision tree and rule induction), neural networks, and hybrid classes. It is believed that these algorithms perform equally well over all types of data set characteristics. 3 2.1 k-NN This algorithm compares an unknown data point (from within a new data set) with k nearest neighbours from previously classified data points. The idea of this method is that the k nearest neighbours to the unknown point are most likely to be from the point's proper population. However, it may be necessary to reduce the weight attached to some variables by suitable scaling, such that at one extreme variables may be removed completely if they don't contribute usefully to the discrimination. 2.2 C4.5 The C4.5 algorithm uses an entropy based measure in order to determine the quality of the tests available. However, this alone would favour tests which reduce the level of uncertainty of the class so C4.5 also uses a modified measure to ensure that it is not biased towards tests with many outcomes. The advantage that this algorithm has over other members of the same class is that it supports estimated error-based pruning so ensuring that the performance is not reduced due to overfitting. 2.3 CN2 One key advantage that this algorithm has over similar methods in its class is that it has an ability to cope with 'other complications' in the data. During its search for complexes CN2 does not automatically remove from consideration candidates which are included in more than one negative example. It reassigns a set of complexes in its search which is evaluated statistically as covering a large number of examples of a given class and few of other classes. The manner in which CN2 conduct a search is generic-to-specific. Each trial specialization step takes the form of either adding a new conjunctive term or removing a disjunctive term. Once CN2 has found a good complex, the algorithm removes those examples it covers from the training set and adds if 4 <complex> then predict <class> to the end of the rule list. The process terminates for each given class when no more acceptable complexes can be added to the list. 2.4 RBF The RBF network consists of a layer of units which perform linear or non-linear functions of the attributes, followed by a layer of weighted connections to the nodes whose outputs have the same form as the target vectors. It has a structure similar to the Multilayer percepton (MLP) except that each node of the hidden layer computes n arbitrary functions of the inputs, and the transfer function of each output node is the trivial identity function. The hidden layer has parameters appropriate for whatever functions are being used such as Gaussian widths and positions. The main advantages that this algorithm has over other neural net algorithms are that it includes a linear training rule once the locations in attribute space of the non-linear functions have been determined, and an underlying model including localised functions in the attribute space, rather than long-range functions occurring in other models. The linear learning rule avoids problems associated with local minima, especially since this enables the enhanced ability to make statements about the probabilistic interpretation of the outputs. 5 2.5 OC1 OC1 is a machine learning, decision tree algorithm which, unlike C4.5, makes its decisions on various boundaries based upon single attributes (termed oblique decisions). The OC1 uses linear combinations of attributes in decision making and requires all attributes to be numeric. 3 Methodology Data has been collected, during the past few years, as a by-product of various construction projects. The data was ‘fused’ from the many data sets, assembled during the construction projects, and analyzed during the feasibility investigation. Once ‘fused’ the data was then further modified to contain the common variables recorded during each project. All variables which were not common to all projects were not used in this feasibility investigation. The ‘fused’, common variable, data was then classified by an ‘expert’. Note: No public details will be made available regarding the identity of the construction company providing the data. The ‘expert’ classified the projects, and sub project constituents, into two outcome classes. Class 1 was project success i.e. no ‘late penalty’ was paid by the construction company. Class 2 was project failure i.e. where a ‘late penalty’ was paid by the construction company. The ‘fused’ data set was split in the ration 9:1. This enabled the Data Mining algorithms to be effectively trained on the data of the larger sub-set and then tested on the data from the smaller sub-set. be The training data was then used to train all five of the algorithms. After training the algorithms were tested in their accuracy of project outcome. 6 4 Results The initial results showed that OC1, from the machine learning algorithm class, appeared to be the most accurate at project outcome determination. This result might be expected because machine learning algorithms are best at analysis of small class, multi-variable data - the construction data is such data. The k-NN and C4.5 algorithms required less training time to reach project outcome determination compared to the other three algorithms. However, the accuracy of these two methods was worse than the more time consuming OC1. In this specific application accuracy of classification is far more important than speed. Therefore the k-NN and C4.5 algorithms would appear to be less suited for this task than OC1. 5 Discussion The motivation behind the exclusion of the ‘non-common’ variables, when the classification of the project outcome by the algorithms was being investigated, was driven by the fact that some of the algorithms being used in the trial ‘do not handle missing values too well’. This variable exclusion may have hand an effect on the classification outcome. Future work which will investigate this is described in the next section. 7 6 Conclusions & Future Work This work has shown the potential for the use of Data Mining algorithms to determine project outcome in the construction industry - an issue which is of paramount importance to the companies involved in tendering for construction contracts. Future work, already in progress, will investigate the use of the ‘excluded’ variables on the project outcomes. Other work will also investigate the filling in of the missing values as well as using other experts to classify the projects into more than 2 classes. References [1] Petersen J., 'Business Applications of Statistics for Data Mining', UNICOM Seminar on Data Mining, July 1995, London, pp. 157-165. [2] Leon M. & Valamudi P, 'Data Warehouse vendors do Data Mining', Info World, 1996, vol.18, pp. 38-42. [3] Burn-Thornton K.E., Cattrall D. & Simpson A.D., 'Polymorphic Functions for Data Mining in ATM Networks'. 4th IFIP Conference on ATM Networks, 4th IFIP ’96, July 1996, Ilkely, UK. [4] Burn-Thornton K.E. & Edenbrandt L., 'Myocardial Infarction - Pinpointing the Key Indicators using Data Mining', Computers and Medicine, August 1998. [5] Bounds D., Lloyd P. & Mathew B., ‘A comparison of neural networks and other pattern recognition approaches to the diagnosis of low back disorders’, Neural Networks, 1990, vol.3., pp. 583 -591. [6] Waddell G., ‘A new clinical model for the treatment of lower back-pain’, Spine, 1998, vol. 12, pp. 632- 644. [7] Enas G.G. & Choi S.C., 'Choice of the smoothing parameter and efficiency of the k-nn neighbour classification', Comput. Math. Applic., 1986, vol. 12A, pp. 235-244. [8] Friedman J., 'Regularised discrimant analysis', J. Amer. Statist. Assoc., 1989, vol. 84, pp.165175. [9] Kendall M.G., Stuart A. & Ord J.K., 'The advanced theory of statistics', 1984, vol. 3, 4th ed., Griffin, London. [10] Quinlan J. R., ' Simplifying decision trees', Int. J. Man-Machine Studies, 1987, vol. 27, pp. 221-234. 8