Download Data Mining for Building & Not Digging

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining for Building & Not Digging
Dr. Kathryn Burn-Thornton
Data Mining Group,
Durham University
+ 44 191 374 7017
+ 44 191 374 2560
[email protected]
http://www.dur.ac.uk/CompSci/personnel/dmg
seugi presentation - K.E.Burn-Thornton
» Introduction to the Data
Mining Group
» Problem Domain
» Methodology Employed
» Results
» Conclusions & Future Work
seugi presentation - K.E.Burn-Thornton
Algorithm Development
Application of Algorithms to Domain
specific Problems
seugi presentation - K.E.Burn-Thornton
Data Mining
Networks
Data Mining
Construction
Medical
Pharmaceutical
seugi presentation - K.E.Burn-Thornton
Analysis of Construction Projects
CG
A.B.I.
seugi presentation - K.E.Burn-Thornton
DMG
Project Aim
Determine if Data Mining may be
used to determine whether or not
construction programs have been
successful.
6
seugi presentation - K.E.Burn-Thornton
Project Aims
class } Phase 1
• specific algorithm
development
• algorithm
• tool
}
Phase 2
6
seugi presentation - K.E.Burn-Thornton
Methodology: Phase1
• Data Analysis
• Supervised Data Mining
seugi presentation - K.E.Burn-Thornton
Algorithms
• k-NN - statistical
• C4.5 - machine learning decision tree
• OC1 - machine learning oblique
• CN2 - machine learning rule induction
• RBF - neural networks
15
seugi presentation - K.E.Burn-Thornton
Algorithms
Algorithm
Algorithm Class
Classification
Method
Data Type
Appropriateness
k-NN
Statistical
k Nearest neighbours +
neighbour weighting
where required
Numerical
C4.5
Machine Learning
Entropy based +
pruning
all
CN2
Machine Learning
Rule Induction
all
OC1
Machine Learning
Oblique
all
RBF
Neural Network
Hidden Node
Numerical
16
seugi presentation - K.E.Burn-Thornton
Supervised: Training/Testing
• tr/te 9:10
• training based upon ‘expert’ knowledge
• maintaining stratification bias
• 10 fold cross validation
18
seugi presentation - K.E.Burn-Thornton
Results: Supervised
• Best
Classification Accuracy: OC1
• Fastest: C4.5 & kNN
• No Saturation
21
seugi presentation - K.E.Burn-Thornton
Conclusions & Future Work
• Initially Promising
• More work needed
- in progress
seugi presentation - K.E.Burn-Thornton
Data Mining for Building & Not Digging
Dr. Kathryn Burn-Thornton
Data Mining Group,
Durham University
+ 44 191 374 7017
+ 44 191 374 2560
[email protected]
http://www.dur.ac.uk/CompSci/personnel/dmg
seugi presentation - K.E.Burn-Thornton
Data Mining for building but not for digging
K.E. Burn-Thornton
Data Mining Group,
Dept. Computer Science,
Durham University,
South Rd,
Durham City,
DURHAM DH13LE.
Email: [email protected]
Tel: + 44 (0) 191 374 7017 Fax: + 44(0) 191 374 2560
http://www.durham.ac.uk/CompSci/personnel/dmg
ABSTRACT
This paper describes the use of Data Mining in an area, which has yet to be fully exploited, that
of construction. The main body of this text discusses the methodology employed during, and
results obtained from, a feasibility study investigating the possible use of various classes of Data
Mining algorithms to aid in optimization of process flow during building construction.
The Data Mining algorithms chosen for this task were example algorithms chosen from 5
algorithm classes which were tested on the data using a test-bed (tool) – a detailed description of
which can be found in [4].
Key Words: Data Mining, Construction, Accurate, Fast.
1
1 Introduction
The aim of this work was to determine if it was possible to use Data Mining algorithms to
analyze, and classify, data which is routinely gathered in the construction industry in order to
attempt to optimize the process flow during construction projects. An optimized process flow
being defined as one in which a ‘late penalty’ does not have to be paid by the construction
company.
This paper provides a brief introduction to Data Mining with the main body of this text
discussing the data set fusion, investigation methodology, and the results of this initial
investigation. Conclusions are then drawn with regard to accuracy of construction project
outcome classification by each algorithm and their potential use in the construction industry.
Future work is also discussed.
2 Data Mining
Data Mining enables vast amounts of data to be mined in order to find valid, novel, and
potentially useful and ultimately understandable patterns. It is still a rapidly expanding field
whose algorithms, techniques and methodologies have yet to be fully exploited in many domains.
Initially Data Mining tools, and techniques, were targeted towards the most lucrative of
application domains such as financial [1-2] and communications [3]. However, Data Mining
tools and techniques are now being applied to the less lucrative, but more exploitable, fields of
medical diagnosis via data mining for key indicators, or a key indicator [4-6]. Another domain
with great potential for the application of Data Mining algorithms is the construction domain particularly the optimization of process flow.
2
It is generally believed that each Data Mining algorithm performs most accurately over certain
characteristic data sets such as numerical or categorical with many (k-NN [7]) or few variables
(SMART [8]) and many (ALLOC80 [9]), or few (Cal5 [10]) classes. The algorithms are grouped
into five classes which are based upon the approach used in data classification. These classes are
statistical, machine learning (both decision tree and rule induction), neural networks, and
hybrids. However some algorithms appear to behave equally well over data of all characteristics.
In this investigation we used a tool in which the algorithms, which are to used for investigations,
may be incorporated so that they can be applied to the data sets(s) being investigated. The Data
Mining algorithms chosen were k-NN, C4.5, CN2, RBF, and OC1 being examples of statistical,
machine learning (both decision tree and rule induction), neural networks, and hybrid classes. It
is believed that these algorithms perform equally well over all types of data set characteristics.
3
2.1 k-NN
This algorithm compares an unknown data point (from within a new data set) with k nearest
neighbours from previously classified data points. The idea of this method is that the k nearest
neighbours to the unknown point are most likely to be from the point's proper population.
However, it may be necessary to reduce the weight attached to some variables by suitable scaling,
such that at one extreme variables may be removed completely if they don't contribute usefully to
the discrimination.
2.2 C4.5
The C4.5 algorithm uses an entropy based measure in order to determine the quality of the tests
available. However, this alone would favour tests which reduce the level of uncertainty of the
class so C4.5 also uses a modified measure to ensure that it is not biased towards tests with many
outcomes. The advantage that this algorithm has over other members of the same class is that it
supports estimated error-based pruning so ensuring that the performance is not reduced due to
overfitting.
2.3 CN2
One key advantage that this algorithm has over similar methods in its class is that it has an ability
to cope with 'other complications' in the data. During its search for complexes CN2 does not
automatically remove from consideration candidates which are included in more than one
negative example. It reassigns a set of complexes in its search which is evaluated statistically as
covering a large number of examples of a given class and few of other classes. The manner in
which CN2 conduct a search is generic-to-specific. Each trial specialization step takes the form
of either adding a new conjunctive term or removing a disjunctive term. Once CN2 has found a
good complex, the algorithm removes those examples it covers from the training set and adds if
4
<complex> then predict <class> to the end of the rule list. The process terminates for each given
class when no more acceptable complexes can be added to the list.
2.4 RBF
The RBF network consists of a layer of units which perform linear or non-linear functions of the
attributes, followed by a layer of weighted connections to the nodes whose outputs have the same
form as the target vectors. It has a structure similar to the Multilayer percepton (MLP) except that
each node of the hidden layer computes n arbitrary functions of the inputs, and the transfer
function of each output node is the trivial identity function. The hidden layer has parameters
appropriate for whatever functions are being used such as Gaussian widths and positions.
The main advantages that this algorithm has over other neural net algorithms are that it includes a
linear training rule once the locations in attribute space of the non-linear functions have been
determined, and an underlying model including localised functions in the attribute space, rather
than long-range functions occurring in other models. The linear learning rule avoids problems
associated with local minima, especially since this enables the enhanced ability to make
statements about the probabilistic interpretation of the outputs.
5
2.5 OC1
OC1 is a machine learning, decision tree algorithm which, unlike C4.5, makes its decisions on
various boundaries based upon single attributes (termed oblique decisions). The OC1 uses linear
combinations of attributes in decision making and requires all attributes to be numeric.
3 Methodology
Data has been collected, during the past few years, as a by-product of various construction
projects. The data was ‘fused’ from the many data sets, assembled during the construction
projects, and analyzed during the feasibility investigation. Once ‘fused’ the data was then further
modified to contain the common variables recorded during each project. All variables which
were not common to all projects were not used in this feasibility investigation. The ‘fused’,
common variable, data was then classified by an ‘expert’. Note: No public details will be made
available regarding the identity of the construction company providing the data.
The ‘expert’ classified the projects, and sub project constituents, into two outcome classes. Class
1 was project success i.e. no ‘late penalty’ was paid by the construction company. Class 2 was
project failure i.e. where a ‘late penalty’ was paid by the construction company.
The ‘fused’ data set was split in the ration 9:1. This enabled the Data Mining algorithms to be
effectively trained on the data of the larger sub-set and then tested on the data from the smaller
sub-set. be The training data was then used to train all five of the algorithms. After training the
algorithms were tested in their accuracy of project outcome.
6
4 Results
The initial results showed that OC1, from the machine learning algorithm class, appeared to be
the most accurate at project outcome determination. This result might be expected because
machine learning algorithms are best at analysis of small class, multi-variable data - the
construction data is such data.
The k-NN and C4.5 algorithms required less training time to reach project outcome
determination compared to the other three algorithms. However, the accuracy of these two
methods was worse than the more time consuming OC1. In this specific application accuracy of
classification is far more important than speed. Therefore the k-NN and C4.5 algorithms would
appear to be less suited for this task than OC1.
5 Discussion
The motivation behind the exclusion of the ‘non-common’ variables, when the classification of
the project outcome by the algorithms was being investigated, was driven by the fact that some of
the algorithms being used in the trial ‘do not handle missing values too well’. This variable
exclusion may have hand an effect on the classification outcome. Future work which will
investigate this is described in the next section.
7
6 Conclusions & Future Work
This work has shown the potential for the use of Data Mining algorithms to determine project
outcome in the construction industry - an issue which is of paramount importance to the
companies involved in tendering for construction contracts.
Future work, already in progress, will investigate the use of the ‘excluded’ variables on the
project outcomes. Other work will also investigate the filling in of the missing values as well as
using other experts to classify the projects into more than 2 classes.
References
[1] Petersen J., 'Business Applications of Statistics for Data Mining', UNICOM Seminar on Data
Mining, July 1995, London, pp. 157-165.
[2] Leon M. & Valamudi P, 'Data Warehouse vendors do Data Mining', Info World, 1996,
vol.18, pp. 38-42.
[3] Burn-Thornton K.E., Cattrall D. & Simpson A.D., 'Polymorphic Functions for Data Mining
in ATM Networks'. 4th IFIP Conference on ATM Networks, 4th IFIP ’96, July 1996, Ilkely,
UK.
[4] Burn-Thornton K.E. & Edenbrandt L., 'Myocardial Infarction - Pinpointing the Key Indicators
using Data Mining', Computers and Medicine, August 1998.
[5] Bounds D., Lloyd P. & Mathew B., ‘A comparison of neural networks and other pattern
recognition approaches to the diagnosis of low back disorders’, Neural Networks, 1990, vol.3.,
pp. 583 -591.
[6] Waddell G., ‘A new clinical model for the treatment of lower back-pain’, Spine, 1998, vol.
12, pp. 632- 644.
[7] Enas G.G. & Choi S.C., 'Choice of the smoothing parameter and efficiency of the k-nn
neighbour classification', Comput. Math. Applic., 1986, vol. 12A, pp. 235-244.
[8] Friedman J., 'Regularised discrimant analysis', J. Amer. Statist. Assoc., 1989, vol. 84, pp.165175.
[9] Kendall M.G., Stuart A. & Ord J.K., 'The advanced theory of statistics', 1984, vol. 3, 4th ed.,
Griffin, London.
[10] Quinlan J. R., ' Simplifying decision trees', Int. J. Man-Machine Studies, 1987, vol. 27, pp.
221-234.
8