Download 2 DATA mining

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

Cluster analysis wikipedia, lookup

Data mining VIMS data for information on truck condition
Tad S. Golosinski and Hui Hu
University of Missouri-Rolla, Rolla, MO 65409-0450, USA
Ralph Elias
University of Botswana, Gaborone, Botswana
ABSTRACT: The paper presents initial research related to use of data mining for analysis of condition of an
off-highway mining truck. The raw data was collected using VIMS system of Caterpillar in a Botswana mine.
The data mining tool was the IBM Intelligent Miner for data. The results indicate that data mining allows
identification and quantification of relations between the various types of data. As such the data mining offers
the potential for development of a predictive tool for prognosticating equipment condition and performance.
Development of this capacity requires further research.
Modern mining equipment if fitted with numerous
sensors that monitor its condition and performance.
The data collected by these sensors is used to alert
the operator to existence of abnormal operating conditions and to perform emergency shut-own if the
pre-set values of the monitoring parameters are exceeded. This data is also used for post-failure diagnostics and for reporting and analysis of equipment
It is believed that availability of this voluminous
data, together with availability of sophisticated data
processing methods and tools, may allow for extraction of additional information contained in the data.
One method that may be of value is data mining
(Golosinski, 2001).
The research presented in this paper investigates
use of the data collected from various sensors installed on a mining truck to develop predictive tool
that would allow for reliable projection of both the
equipment performance and condition into the future.
Subject to the research was data collected by a
variety of sensors installed on an off-highway mining truck (CAT 785) by the VIMS (Vital Information Monitoring System) system of Caterpillar.
The data mining tool was the IBM Intelligent Miner
for Data.
Data mining is the next step beyond online analytical processing (OLAP) for querying data warehouses. It is frequently used in analysis of customer
relationships in retail industry, for financial analysis
& research, for fraud and risk management, supply
chain management and e-business (Westphal and
Blaxton, 1998). The representative methods used in
data mining are:
 Association discovery
 Sequential pattern discovery
 Clustering
 Classification
 Value prediction
 Similar time sequences
Numerous data mining techniques are used. AS
an example predictive model creation is supported
by supervised induction techniques, link analysis is
supported by association discovery and sequence
discovery techniques, clustering techniques supports
database segmentation, and deviation detection is
supported by statistical techniques. Analysis of results is usually accomplished through various forms
of visualization that facilitates identification of patterns hidden in data, as well as in better comprehension of the information extracted by the data mining
Data mining involves problem definition, data selection and preparation, data analysis and presentation of results.
Caterpillar's Vital Information Management System
(VIMS) is installed on selected CAT mining equipment. It is a powerful tool for machine management
that provides operators, service personnel and managers with information on a wide range of vital machine functions and information on equipment production and performance. VIMS monitors and
records indications of numerous sensors that are integrated into the vehicle design. It has the capacity
to alert the operator if these indications exceed the
pre-set critical values and conduce emergency
equipment shut-down if so programmed (Caterpillar,
2000). The flow of VIMS data is illustrated in fig. 1.
Data Mining
Data down-
Figure 1.
VIMS schematic.
Precedure of
data VIMS unit records the occurrence of cerOn-board
mintain VIMS
events and real time machine conditions.
The recorded data can be downloaded from the onboard VIMA unit either to a notebook computer or it
can be sent to the central control unit via radio
(VIMS Wireless). All collected data is grouped into
seven following categories:
3.1.1 Event list summary list
The event list is a record of stored events (“what
happened and when”) that have occurred on the machine. The record contains the last 500 events that
have occurred before equipment shutdown, listed in
chronological order.
3.1.2 Snapshot
The snapshot stores a segment of equipment history
recorded in real time for all monitored parameters at
a one-second interval. The snapshot relates to a set
of pre-defined events and is triggered automatically
if one of these events occurs (e.g. abnormal condition or emergency situation).
3.1.3 Data logger
The data logger records all the machine parameters
that are monitored by VIMS. The data is sampled in
real time at one-second intervals. The logger is started and stopped at the operator command and can
record data for up to 30 minutes.
3.1.4 Trends
Trend information reported by VIMS consists of the
minimums, maximums and averages of the selected
indications calculated over a pre-selected period of
3.1.5 Cumulative
Cumulative information refers to the number of occurrences of specific events over a pre-set period of
time. An example of cumulative information can be
the total engine revolutions or total fuel consumption over the life of the machine or component.
3.1.6 Histogram
Histogram information records the performance history of a selected parameter since last reset. For example a histogram of the engine speed would indicate the percentages of time that the engine operated
within a pre-specified speed ranges.
3.1.7 Payload
Indications of the payload measurement system,
if installed, may be recorded if so specified.
Variety of data mining software is commercially
available from numerous vendors. It includes Intelligent Miner of IBM (International Business Machines Corporation), MineSet of SGI (Silicon
Graphic Inc.), Clementine of ISL (Integral Solutions
Limited of U.K.) and other. The IBM Intelligent
Miner (IM) version 6.1 was used for the data mining
work reported in this paper (IBM, 2000). It offers a
large choice of algorithms, is easy to use, and has
proven itself useful in many commercial applications.
The IM included the following mining and statistics functions:
1. Mining functions: associations, demographic and
neural clustering, sequential patterns and similar
sequences, tree and neural classification, and
neural and RBF (Radial Basis Function) prediction.
2. Statistics functions: bivariate statistics, linear regression, principal component analysis, univariate curve fitting and factor analysis.
The IM allows modeling of pre-defined phenomena
for events that can be either usual or unusual. Usual
events describe the situation that is considered normal and for which the relations between different attributes are sought. For example, relations between
truck operating and mechanical attributes can be defined or a relation between the engine load and truck
payload. Definition and quantification of these relations may be of help in improving efficiency of
truck operation or be of help with operator training.
The unusual events are a failure of the monitored
machine. Data mining of these events may allow for
definition of algorithms that would facilitate prediction of the machine failure events.
and failure situation. For example, from the
huge VIMS data, if an important rule is found with
high confidence. This rule can be used to improve
the VIMS system itself for the higher reliable prediction of machine failure or fraud. Some emergent
situation is set up to improve VIMS, which can
alarm the operator for some emergency with this
important rule. Moreover, the proper data analysis is
helpful for the design of the machine. Based on the
discovered relations and rules from the off board data, some attentions of the machine designer can focus on some important components.
To facilitate IM data mining of VIMS recorded data,
it has to be adapted to the format acceptable for the
IM. The original VIMS data, downloaded from an
on-board VIMS unit, can be merged into Microsoft
ACCESS 97 database using the VIMS PC99 software. Unfortunately the IM does not accept Access
data format. Consequently the data has first to be
converted into ASCII format that is one of the acceptable data input formats for version 6.1 of Intelligent Miner software.
The first step of data mining is data preparation.
It involves data clean-up, and identification and extraction of data that is of interest to the problem at
hand. This data must be presented in a form that is
able to represent all information consistently.
Subject to data mining were 4437 VIMS records
each of which contained values of 90 parameters
recorded on a CAT 789 truck operating in a diamond mine in Botswana. The records were taken on
March 8, 1999 and represent 4437 seconds of consecutive truck operation (sampling rate of one record
per second).
The purpose of the investigations was to confirm
the feasibility of determining which condition and
performance parameters of the truck are related to its
fuel consumption rate. After that the strength of the
relation between each of these parameters and the
fuel consumption rate was to be tested.
Three different data mining methods were used to
define and quantify the relation between the recorded data streams. These were classification, demographic clustering, and principal component analysis
combined with factor analysis (Bernson and Smith,
1997). All three are briefly summarized below.
6.1 Relationship discovery with Principal
Component Analysis and Factor Analysis
PCA (Principal Component Analysis) is used in
statistics to extract the main relationships in data of
high dimensionality. A common way to find the
Principal Components of a data set is by calculating
the eigenvectors of the data correlation matrix.
These vectors give the directions in which the data
cloud is stretched most. The projections of the data
on the eigenvectors are the Principal Components.
The corresponding eigenvalues give an indication of
the amount of information the respective Principal
Components represent. Principal Components corresponding to large eigenvalues represent much information in the data set and thus tell much about
the relations between the data points. In Intelligent
Miner, it looks for the standardized linear combination of the original variables. This tool can be used
to summarize data and identify linear relationships
among variables. It is also a dimension-reduction
technique. In difference to factor analysis, principal
component analysis tries to transform the vector describing the original variables linearly into a lowerdimensional subspace. Other benefit of this analysis
tool is the handling of missing values. If a valid record contains missing values, these values are replaced with the mean value of the field or variable in
question. It is possible that these data contains many
missing values, so this tool is also useful as a preparatory step for running a mining function using the
generated components as input fields. Moreover it is
not necessary to run some mining method on all variables for saving plenty of time.
Factor Analysis is an exploratory approach,
which aim to make sense of multivariate data in a
systematic manner. It searches for hidden variables
in order to reduce data involving many variables
down to small number of dimensions. Factor Analysis discovers the relationships among variables in
terms of a few underlying, but unobservable, random quantities called factors. It has the same function for handling the missing and invalid value with
Principal Component Analysis. Factor loadings window, shown as fig. 8, offers a graphical representation of the factors. In case of no clear interpretation
of the factors, the factor rotation can simplify the
factor structure to help user to better identify the
meanings of the calculated factors. Its application in
this case is similar to component analysis.
6.2 Database segmentation with Demographic
In difference to the Principal Component Analysis
and the Factor Analysis, the Clustering searches for
hidden groups and classifies data into related clusters on the basis of values of several variables. The
Demographic Clustering provides fast and natural
clustering of very large databases. It automatically
determines the number of clusters to be generated.
Similarities between records are determined by
comparing their field values. The clusters are then
defined so that Condorcet’s criterion is maximized
(IBM, 2000).
This tool presents the percentage of the parameter
of interest that appears in the whole population (all
records), and the clustering percentage ie. the percentage current clustering that the parameter accounts for. Therefore, different combinations of
these two allow to uncover interesting relations in
the data set. as follows: e not particular.
6.3 Classification
Another utility is to allow quantification of the
correlation between various parameters under consideration. It is expressed as correlation coefficients
of input variables. Fig. 3 and fig. 4 present value of
these coefficients for several representative parameters.
To illustrate the relations involved, fig. 3 presents
several negative correlation coefficients for parameters determined to be related to engine fuel consumption rate. In this case, the turbocharger air in
pressure has the largest negative correlation coefficient with the engine fuel consumption rate. Likewise the turbocharger out pressure has the largest
positive correlation (fig. 4). It appears to be logical
as the properly operating turbocharger can boost engine power by up to 40%.
Classification is used to segregate database records into pre-defined classes based on selected criteria. Thus this technique can be used to define what
truck operating or condition parameters define fuel
consumption rate, what parameters define its cycle
time and the like.
7.1 Statistical analysis
The chart in fig. 2 presents the IM display (Principal Components Result Viewer) of the principal
attributes, generated by applying the IM Principal
Component Analysis tool to selected VIMS data. In
total it lists 65 principal attributes as related to truck
fuel consumption rate out of the total of 90 listed in
the database.
Figure 3. Negative correlation coefficient (engine fuel
rate consumption).
Figure 4 Positive correlation coefficient (engine
fuel rate consumption).
Figure 2. Principal Component Analysis
One of the principal utilities of this method is to
reduce the number of parameters of interest that will
form the input to the other data mining methods,
thus to simplify the further investigations. In the
case under consideration statistical analysis allowed
reduction of parameters of interest to 65 principal, or
by some 30%.
Besides this correlation, other correlations were
also defined. Some other truck parameters that have
high positive correlation to the fuel consumption
rate include:
 Booster Pressure calculated by subtracting atmospheric pressure from the turbocharger outlet
 Engine Load calculated from the engine speed,
throttle switch position, throttle position, boost
pressure, and atmospheric pressure and expressed as a percentage of full load
Right Exhaust Temperature and Left Exhaust
Temperature, the temperature within the exhaust
manifold of the engine on both sides of the truck.
It is relevant for the purpose of subject research that
the relations between various operating parameters
of the truck can be found and quantified using data
mining techniques, in this case Principal Component
Analysis technique. Full discussion of the defined
correlations is beyond the scope of this paper.
truck spends more time running at the full load. On
short hauls more time is spent on loading / dumping
/ maneuvering and waiting, the truck activities during which fuel consumption is low.
7.2 Tree-Clustering
Following the principal component analysis the remaining data set was data mined using the IM demographic clustering technique. As a result the data
set was segmented into 9 clusters as shown in fig. 5.
The three largest clusters each account for the 14%
of the whole data set.
120 140
160 180
Figure 7. Demographic clustering: payload cluster (horizontal scale: payload in tons).
Fig. 7 shows the payload cluster. It indicates that
all trucks in the analyzed cluster were running empty
(100 of the cluster), while in the whole population
only around 50% of the trucks were empty. All the
trucks in this cluster were traveling at 4th with the
speed of 25 to 35 MPH and the fuel consumption
rate was average. Fact that the truck was running
empty for all the hauls in this cluster does not allow
for drawing valid conclusions on the fuel consumption rate.
The other clusters identified in this work are presented in fig. 5. These contain variety of other information related to truck performance.
Figure 5 Demographic Clustering – IM output
Fig. 6 and 7 show a zoom of the cluster related to
haul distance and by truck payload. The haul distance cluster, shown in fig. 6 indicates that the haul
distance is one of the main determinants of fuel consumption rate. In this cluster the percentage of 6 to
10 mile long hauls is approximately 40%, while the
same percentage for the whole population is only
5%. One possible explanation is that on the long
hauls truck fuel consumption rate is larger since
0 2 4 6 8 10 12 14 16 18 20 22 24
Figure 6. Demographic Clustering: haul distance
cluster (horizontal scale: haul distance in miles) .
7.3 Tree-Classification
A sample of results obtained using classification
technique of data mining is shown in fig. 8. It presents the statistical information and the confusion
matrix of the data mining run.
The tree-classification mining function builds a
classification model as a binary decision tree. Each
interior node of the binary decision tree tests an attribute of a record. If the attribute value satisfies the
test, the record is sent down the left branch of the
node. If the attribute value does not meet the requirements, the record is sent down the right branch
of the node. At upper left corner, the 4 classes are
marked with different colors. They are reflected in
the tree map as Solid Square. The solid circles are
the decision nodes. The binary decision tree consists
of the root node on top, followed by non-leaf nodes
and leaf nodes. Branches connect a node to 2 other
nodes. Root and non-leaf nodes are represented as
pie charts. Leaf nodes are represented as rectangles.
Clicking on each node displays its characteristics
in the window at the bottom of the window (see
fig.9). This information includes:
Engine Fuel Rate
-------------Number of classes = 4
= 1205 (27.78%)
Confusion matrix for pruned tree
Predicted Class ->| low | 100-200(l/h)| 200-300(l/h)| high|
| 1150 |
278 |
100-200(l/h) | 100 |
801 |
20 |
35 | total = 956
200-300(l/h) |
89 |
121 |
386 |
61 | total = 657
| 100 |
24 |
122 | 194 | total = 1744
61 | 795 | total = 980
Selected observations that can be made in this
case are:
 When ground speed is in the range from
12.25MPH to 15.5 MPH and the payload is over
126.85t, of the 283 records 96.8% indicate high
engine fuel consumption rate;
 When ground speed is more than 31.5MPH and
actual gear is higher than 5, all 146 records show
low engine fuel consumption rate;
 The ground speed has more impact on the engine
fuel consumption rate than do other parameters.
--------------------------------------------------------------------1439 |
1224 |
589 | 1085| total = 4337
Figure 8 Classification result statistics
Label: The pre-dominant class label of the selected node.
Test: The split criterion for this node. This applies
only to non-leaf nodes and specifies a simple selection.
Records: The number of records contained in
each of the sub-nodes the selected node.
Distributions: The number of records corresponding to each of the possible class labels. The classification is most meaningful if all records belong to
one leaf node only. However, by pruning the binary
decision tree, records of other nodes can be assigned
to the selected node.
Purity-The percentage of correctly classified records assigned to a node.
For the fuel consumption rate run the IM defined
four classes that group the various input parameters.
These allow definition of the parameters that contribute most to the high fuel consumption rate. This
is done by tracking of the thicker black line with the
arrow that link the nodes and continues on to the
rectangles at the foot of the figure. Since the original
plot is in color, the IM tracking is fairly straightforward.
Figure 9. Classification-Tree.
The investigations presented above prove that data
mining techniques can be used to analyze performance of mining equipment. In particular the relations between its various operating, condition and
performance parameters can be defined and quantified.
These relations, in turn, can be used to develop
predictive capability related to equipment condition
and performance. Further research is needed to develop this capacity.
Investigations presented in this paper were funded
by the grant from the Research Board of the University of Missouri System. Support and cooperation of
Caterpillar, Inc. and of Debswana Diamond Mining
Company is gratefully acknowledged.
Bernson, A. and Smith, S. J. 1997. Data warehousing, data
mining and OLAP. McGraw-Hill.
Caterpillar, Inc. 1999. Vital Information Management System
(VIMS): System Operation Testing and Adjusting. Company publication.
Golosinski. 2001. Data mining uses in mining. Proceedings,
APCOM 2001, Beijing, China.
IBM (International Business Machines Corporation). 2000.
Manual: Using the Intelligent Miner for Data. Company
Westphal, C. and Blaxton, T. 1998. Data mining solutions.
John Wiley and Sons, Inc.