Download TMT 2005-project- Datamining on wine fields

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
9th International Research/Expert Conference
“Trends in the Development of Machinery and Associated Technology”
TMT 2005, Antalya, Turkey, 26-30 September, 2005
WINE RECOGNITION ANALYSIS BY USING DATA MINING
Kivanc Kilicer
T.C. Bahcesehir University
Bahcesehir, Istanbul
Turkey
Adem Karahoca
T.C. Bahcesehir University
Bahcesehir, Istanbul
Turkey
ABSTRACT
The aim of this study is to evaluate and to understand the indicators of wine quality by using data
mining methods. We used a java based program “Weka” to compare the effects of 13 constituents
found in each of three types of wines. Our Wine recognition dataset contains the results of a chemical
analysis of wines grown in the same region in Italy but derived from three different cultivars. A
chemical analysis of 178 Italian wines from three different cultivars yielded 13 measurements. This
dataset is often used to test and compare the performance of various classification algorithms.
Keywords: Data Mining, Wine, Classification, Clustering, Discretisizing, Bayes, K-means
1. INTRODUCTION
Wine recognition dataset contains the results of a chemical analysis of wines grown in the same
region in Italy but derived from three different cultivars. A chemical analysis of 178 Italian wines
from three different cultivars yielded 13 measurements. This dataset is often used to test and compare
the performance of various classification algorithms. The analysis determined the effects of 13
constituents found in each of the three types of wines [1, 5].
These are:
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
By using datamining methods, my main target is to evaluate these figures and try to understand the
indicators of wine classification.
PDF created with pdfFactory Pro trial version www.pdffactory.com
2. MATERIAL METHOD
In order to evaluate our dataset, we will use pre-processing, classification and clustering methods of
data mining. Association method is not applicable because of numeric values.
By preprocessing, my main aim is to summarize the data in the best way. With the help of
discretisizing and range removal, we will try to visualize the constituent ratios used in 3 different
types of wine.
By Classification methods, my main aim is to classify the data into decision trees and making a
prediction most near to reality. We will use j48 filter for treeing. To understand entropy and gain
values, we will use Bayes theorem. To reach the best results we will also use SMO and Multilayer
Perceptron filters [2].
During classification, we will also make a prediction with the help of Weka software. By Clustering,
our main aim is to observe the effects of variables in one chart. Cogweb and K-means will be helpful
methods in my investigation to reach the cluster means and make judgement about the contituents.
3. RESULTS
We have used 178 different wines from which 59 of them belong to “Class1”; 71 of them belong to
“Class2” and 48 of them belong to “Class3”. “Class1” can be seen as blue colour; “Class2” as red
colour and “Class3” as cyan colour. Our dataset revealed the graphs of 13 constituents present in these
3 classes:
Figure 1. Thirteen constitunets in three different classes of wine quality
3.1. Preprocessing by Discretisizing
In order to have a more clear picture on our dataset, we discretisized the data into 12 bins for each
constituent and use equal frequencies:
Figure 2. Discretisizing the constituents into twelve bins
PDF created with pdfFactory Pro trial version www.pdffactory.com
3.2 Classification by Bayes Theorem
By using NaiveBayes classification method, we reached the P(h1) values as %33 for Class 1, %40 for
Class 2, %27 for Class 3. Bayes classification method gave us good percentage on prediction.
Posterior probabilities are also shown below. Entropy value is close to zero which means almost no
surprises would happen on our predictions. [3]
ResultResult
Figure 3.Prediction of Bayes Therorem can be accepted with 98.3 per cent
We used other classification methods such as Multilayer Perceptron [4] and SMO where we also
obtained good results. Let’s make a prediction here with SMO by adding “?” into the dataset and
choosing “output prediction” in the options menu.
The last line of dataset is like this
“13.42,4.65,2.55,20,93,3,.9,.47,1.32,4.3,.94,2.35,580,?”
We can see in the below table that Weka’s prediction is “class 3” with %95 certainty. This means that
using these values for each constraint in the soil, it’s possible to grow a grape with class 3 quality.
PDF created with pdfFactory Pro trial version www.pdffactory.com
Figure 4.Output Prediction of SMO shows that the grape will be in class 3 quality when we use the
constraint levels as 13.42, 4.65, 2.55, 20, 93, 3, 0.9, 0.47, 1.32, 4.3, 0.94, 2.35, 580 respectively
3.3 Decision Trees and Clustering
Our analysis under Weka J48 filter decision tree showed us that there is an important relationship
between flavonoids and color intensity. Colorintensity is a good indicator for defining wine type. Plus,
the amount of proline can affect the class type if flavanoid amount is above 1,57. By using Cogweb
and K-means tests[6], we saw the same relationship between flavanoid and color intensity. We saw
that flavonoid, proline and color intensity variables affects most the class of wine in the final
situation.
PDF created with pdfFactory Pro trial version www.pdffactory.com
Figure 5.Clustering
4. CONCLUSIONS AND DISCUSSION
Our classification, clustering and decision tree methods show that the most important factors lying
beneath defining the wine quality are color intensity, flavonoid, alchool and proline attributes in the
soil. On the other hand, 3 different classes are clustered in different areas, which means that the
classification of the wine types are will-made. The content of wine can be more than %95 seperable
according to 3 mentioned classes.
With this reliable analysis, it’s now more easy to understand the type of an unknown third party wine
because our output prediction worked very well in the above mentioned experiment. A future analysis
may help us to divide the regions of Italy and other countries into different clusters according to types
of cultivation in grape fields.
5. REFERENCES
[1] Kment, Petr and Mihalijevic, Martin, Differentiation of Czech Wines Using Multielement Composition – A
Comparison With Vineyard Soil, Faculty of Science, Institute of Geochemistry, Mineralogy and Mineral
Resources, Charles Uiversity, Czech Republic
[2] S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech.
Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook
University of North Queensland
[3] S. Aeberhard, D. Coomans and O. de Vel, "THE CLASSIFICATION PERFORMANCE OF RDA" Tech.
Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook
University of North Queensland.
[4] Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno,16147 Genoa, Italy
[5] Ying, Guang-Guo and Williams, Bryan, Dissipation of Herbicides in Soil and Grapes in a South Australian
Vineyard(1999), Department of Environmental Science and Management, University of Adelaide, Australia
[6] J. Leonard and P. Andrieux, Infiltration Charactheristics of Soils in Mediterranean Vineyards in Southern
France (1998), INRA, UFR Science du Sol, 2 Place Viala, 34060, Montpellier cedex 2, France
PDF created with pdfFactory Pro trial version www.pdffactory.com