Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining, a useful tool in veterinary epidemiology? Valle, P.S.1, Flaten, O.2, Lien, G.2, Koesling, M.3, Ebbesvik, M.3 and Carroll, C.1,4 1 The Norwegian School of Veterinary Science, P. O. Box 8146 Dep. N-0033 Oslo, Norway Norwegian Agricultural Economics Research Institute, P.O.Box 8024 Dep., 0030 Oslo, Norway 3 Research Institute and National Center for Ecological Agriculture, 6630 Tingvoll, Norway 4 Stat Tech, Inc., 3543 West Braddock Road, Alexandria, Virginia 22302, USA 2 Abstract As more data have been amassed and interest in working with the ensuing data sets have grown, methods for organizing and examining the data have evolved. The need to work with these larger amounts of data has led to the development of ‘data mining’ methods and software. Data mining has a somewhat skewed reputation, and has often been characterised as ‘data dredging’ or ‘fishing expeditions’ 1. However, most of us must admit that such ‘expeditions’ or what one also could call hypothesis-generating approaches where we look for both likely and less likely associations, has occurred within our own research. In principal, generating promising associations is what data mining is all about. In this paper we have applied one of many commercial software available (Enterprise Miner, SAS) on a small dataset merged from a questionnaire data set and the national dairy cattle health and production records. We investigated for patterns separating organic dairy farmers from the conventional ones. The main framework of the data mining approach, some of the core modelling methods and the data mining results are briefly described and assessed. Background Today data can come from many sources: administrative records, governmental records, laboratory records, industrial records, scientific studies, etc. Massive amounts of information are in these records, and the possible number of associations and data patterns often go far beyond the human mind’s capacity. Data mining typically deals with observational data that have already been collected. ‘Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.1 In many cases, the statistical methods are not new and data mining software products may even share the same underlying algorithms used in statistical software sold by the same vendor. For example, regression calculations may be performed with identical code. On the other hand some data mining products also contain algorithms less traditionally a part of statistical analysis software and methods e.g. neural networks. Material and methods This paper reports on the application of the data-mining tool, Enterprise Miner (EM), Release 4.1 (SAS Institute, Cary, NC). The data used were from three sources: dairy cattle health records and diary cattle production records maintained on Norwegian dairy herds, and a data set based on responses to a mailed questionnaire survey on Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, 2003 Available at www.sciquest.org.nz risk, risk attitudes and risk handling practices among a sample of Norwegian organic and conventional dairy farmers. The aggregate data set contained 481 records and 385 variables. The data set is small in data mining terms. Also, the relative number of variables is massive asking for special handling procedures. Data Mining and Statistics SAS Institute defines data mining as “the process of Selecting, Exploring, Modifying, Modeling, and Assessing (‘SEMMA’) large amounts of data to uncover previously unknown patterns that can be utilized as a business advantage.” We followed a SEMMA process using some of the tools available in EM, and as outlined by the diagram provided in EM (Figure 1). Figure 1, diagram over the data mining process in Enterprise Miner, SAS. The SEMMA process proposed by SAS and supported by EM is briefly: 1) Specifying the input data set; in this case being the merged health, production and questionnaire data (oeko.hbspsv3), and defining a target variable which here is the binary variable separating the observations into organic and conventional dairy farms. 2) Exploring the data by graphical procedures using e.g. Multiplot or different graphical procedures in Insight 3) Modifying the data by: a. creating partitioned data sets which in this case splits the data into three subsets; training, validation and test data sets by simple random sampling. Also, other sampling options are available. b. replacing variables with missing information according to appropriate methods e.g. replacing with the mean for interval variables, and replacing with the most common variable for binary, nominal or ordinal variables. (Transformation can also be performed.) c. variable selection for preselecting important input variables 4) Modeling and in this case we used the predictive modelling tools: a. logistic regression model which by default given the binary target variable fits a logistic regression model with independent group variables coded by either GLM (dummy) or Deviation (effect) coding. Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, 2003 Available at www.sciquest.org.nz b. Classification tree model c. neural network model which enables one to fit nonlinear models like a multilayer perceptron (MLP). Neural networks are classes of software algorithms originally developed by computer scientists in a subdiscipline generally called artificial intelligence. 5) Assessing provides information on the ability of the developed models to predict the target variable through the following charts: lift, profit, return on investment, receiver operating curves, diagnostic charts and threshold based charts. Results The classification tree model comes up with a model separating the organic and the conventional dairy farmers by a questionnaire variable were the farmers were asked to score the importance of fertilizer for the plants in the first step and the percentage of concentrate in the feed ration in the second. In the regression procedure the most important variables for separating the two groups comes up to be: percentage of hay in the feed ration and the need for fertilizer. Unfortunately, it is difficult to assess the importance of individual inputs on the classification from the neural network. The model assessment ROC procedure (Figure 2) show us that the regression procedure provides the best classification followed by the tree model. From the 40’th percentile the neural network outperforms the tree model. Figure 2, ROC for the applied predictive modelling procedures. Discussion Modern data mining tools such as EM offer a way to organize and streamline the process of analyzing primary data that has been collected for some specific purpose and secondary data collected for some other purposes. The data mining tools allow researchers to quickly gain an overview of the data, identify difficulties, potentially interesting patterns and relationships, and develop and assess models. But, like all tools, they have their uses and missus. An understanding of both the mathematical modelling and computational algorithm is essential to grasp the complexity of data mining1. References 1 Hand, D., Mannila, H., Smyth, P. Principles of data mining. The MIT Press, Cambridge, Massachusetts, USA, 2001, p.22, Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, 2003 Available at www.sciquest.org.nz