Download Data Mining, a useful tool in veterinary epidemiology?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining, a useful tool in veterinary epidemiology?
Valle, P.S.1, Flaten, O.2, Lien, G.2, Koesling, M.3, Ebbesvik, M.3 and Carroll, C.1,4
1
The Norwegian School of Veterinary Science, P. O. Box 8146 Dep. N-0033 Oslo, Norway
Norwegian Agricultural Economics Research Institute, P.O.Box 8024 Dep., 0030 Oslo, Norway
3
Research Institute and National Center for Ecological Agriculture, 6630 Tingvoll, Norway
4
Stat Tech, Inc., 3543 West Braddock Road, Alexandria, Virginia 22302, USA
2
Abstract
As more data have been amassed and interest in working with the ensuing data sets
have grown, methods for organizing and examining the data have evolved. The need
to work with these larger amounts of data has led to the development of ‘data mining’
methods and software. Data mining has a somewhat skewed reputation, and has often
been characterised as ‘data dredging’ or ‘fishing expeditions’ 1. However, most of us
must admit that such ‘expeditions’ or what one also could call hypothesis-generating
approaches where we look for both likely and less likely associations, has occurred
within our own research. In principal, generating promising associations is what data
mining is all about.
In this paper we have applied one of many commercial software available (Enterprise
Miner, SAS) on a small dataset merged from a questionnaire data set and the national
dairy cattle health and production records. We investigated for patterns separating
organic dairy farmers from the conventional ones. The main framework of the data
mining approach, some of the core modelling methods and the data mining results are
briefly described and assessed.
Background
Today data can come from many sources: administrative records, governmental
records, laboratory records, industrial records, scientific studies, etc. Massive amounts
of information are in these records, and the possible number of associations and data
patterns often go far beyond the human mind’s capacity. Data mining typically deals
with observational data that have already been collected.
‘Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful to the data owner.1 In many cases, the statistical methods
are not new and data mining software products may even share the same underlying
algorithms used in statistical software sold by the same vendor. For example,
regression calculations may be performed with identical code. On the other hand
some data mining products also contain algorithms less traditionally a part of
statistical analysis software and methods e.g. neural networks.
Material and methods
This paper reports on the application of the data-mining tool, Enterprise Miner (EM),
Release 4.1 (SAS Institute, Cary, NC). The data used were from three sources: dairy
cattle health records and diary cattle production records maintained on Norwegian
dairy herds, and a data set based on responses to a mailed questionnaire survey on
Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, 2003
Available at www.sciquest.org.nz
risk, risk attitudes and risk handling practices among a sample of Norwegian organic
and conventional dairy farmers. The aggregate data set contained 481 records and 385
variables. The data set is small in data mining terms. Also, the relative number of
variables is massive asking for special handling procedures.
Data Mining and Statistics
SAS Institute defines data mining as “the process of Selecting, Exploring, Modifying,
Modeling, and Assessing (‘SEMMA’) large amounts of data to uncover previously
unknown patterns that can be utilized as a business advantage.” We followed a
SEMMA process using some of the tools available in EM, and as outlined by the
diagram provided in EM (Figure 1).
Figure 1, diagram over the data mining process in Enterprise Miner, SAS.
The SEMMA process proposed by SAS and supported by EM is briefly:
1) Specifying the input data set; in this case being the merged health, production
and questionnaire data (oeko.hbspsv3), and defining a target variable which
here is the binary variable separating the observations into organic and
conventional dairy farms.
2) Exploring the data by graphical procedures using e.g. Multiplot or different
graphical procedures in Insight
3) Modifying the data by:
a. creating partitioned data sets which in this case splits the data into
three subsets; training, validation and test data sets by simple random
sampling. Also, other sampling options are available.
b. replacing variables with missing information according to appropriate
methods e.g. replacing with the mean for interval variables, and
replacing with the most common variable for binary, nominal or
ordinal variables. (Transformation can also be performed.)
c. variable selection for preselecting important input variables
4) Modeling and in this case we used the predictive modelling tools:
a. logistic regression model which by default given the binary target
variable fits a logistic regression model with independent group
variables coded by either GLM (dummy) or Deviation (effect) coding.
Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, 2003
Available at www.sciquest.org.nz
b. Classification tree model
c. neural network model which enables one to fit nonlinear models like a
multilayer perceptron (MLP). Neural networks are classes of software
algorithms originally developed by computer scientists in a
subdiscipline generally called artificial intelligence.
5) Assessing provides information on the ability of the developed models to
predict the target variable through the following charts: lift, profit, return on
investment, receiver operating curves, diagnostic charts and threshold based
charts.
Results
The classification tree model comes up with a model separating the organic and the
conventional dairy farmers by a questionnaire variable were the farmers were asked to
score the importance of fertilizer for the plants in the first step and the percentage of
concentrate in the feed ration in the second. In the regression procedure the most
important variables for separating the two groups comes up to be: percentage of hay
in the feed ration and the
need
for
fertilizer.
Unfortunately, it is difficult
to assess the importance of
individual inputs on the
classification from the neural
network.
The
model
assessment ROC procedure
(Figure 2) show us that the
regression
procedure
provides
the
best
classification followed by the
tree model. From the 40’th
percentile the neural network
outperforms the tree model.
Figure 2, ROC for the applied predictive modelling procedures.
Discussion
Modern data mining tools such as EM offer a way to organize and streamline the
process of analyzing primary data that has been collected for some specific purpose
and secondary data collected for some other purposes. The data mining tools allow
researchers to quickly gain an overview of the data, identify difficulties, potentially
interesting patterns and relationships, and develop and assess models. But, like all
tools, they have their uses and missus. An understanding of both the mathematical
modelling and computational algorithm is essential to grasp the complexity of data
mining1.
References
1
Hand, D., Mannila, H., Smyth, P. Principles of data mining. The MIT Press, Cambridge,
Massachusetts, USA, 2001, p.22,
Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, 2003
Available at www.sciquest.org.nz