Download Data Mining Report

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Aden Gleitz
1
Supervised Data Classification of an Image
Segmentation Dataset (April 2016)
Aden G. Gleitz

Abstract—This paper discusses the process followed
during the data mining process of the Image Segmentation
Dataset from the UCI Machine Learning Repository. In this
journal, the focus is on the different types of classifiers that
can be used when determining the class of an instance. The
paper also records the process used to increase the
performance of the models that are generated as well as
determine the most important features in the dataset. It is
determined that the classes of the Image Segmentation
Dataset can be accurately determined by a small number of
rules. These rules use the features of different color values
of sections of an image to then determine what that section
of the image is actually of.
Index
Terms—Data,
dataset,
mining,
image,
segmentation, k-means, IBK, J48, tree, learning,
supervised, classification, attributes, instances, scatterplot,
matrix.
I. INTRODUCTION
T
report is a journal of the information found
while data mining the Image Segmentation
dataset from the UCI Machine Learning
Repository and the approach to how it was determined.
This dataset consists of nineteen attributes and over two
thousand three hundred instances. This dataset was
constructed by segmenting seven different outdoor
images into a three-by-three block of pixels. From each of
these blocks of pixels the part of the element it was a part
of was recorded as the class feature. Several other
attributes were used to record different color, hue,
saturation, and intensity values of the center pixels in that
block.
HIS
Multiple other journals and papers have cited this
dataset in their works. The Journal of Machine Learning
Research published the paper “Cluster Ensembles for
This work was submitted on April 27, 2016. This work was
supported in part by the Indiana University Southeast Informatics
Department and Prof. C. J. Kimmer”.
High Dimensional Clustering: An Empirical Study” in
which this dataset was used to test clustering using three
different construction methods. Another work, in the
Pattern Recognition symposium, cites this dataset as a test
set for a Global K-Means Clustering algorithm. This
algorithm chooses favorable initial positions to start
clustering from.
II. METHODS
The methods used to mine this dataset follow the
standard data mining process. The very first part is
selecting the data. Starting with this dataset it was
determined the type of data that was recorded. Every
feature is numerical except for the class feature, which
was one of seven nominal values. This cuts down the
number of different filters and classifiers that can be used
when dealing with this data, like ID3, which only work
with nominal data. The main task to perform on this
dataset is classification. The class feature is given in this
dataset so we can use supervised learning to create models
for the data to determine which instances should be in
what class.
In this preprocessing step of the data mining process, I
had to choose if there were any attributes that needed to
be left out. This process started manually and also used
WEKA, a data analysis tool, the software that was used
for the rest of the machine learning process. The first
feature that was removed in the Image Segmentation
dataset is the ‘Region Pixel Count’ feature. This was done
because every instance had the same value of nine for this
feature. Although this may not have affected any results
if this feature was left in, it was done to reduce the number
of attributes.
A. G. Gleitz is with Indiana University Southeast, 4201 Grant Line
Rd. New Albany, In 47150 USA (e-mail: [email protected]).
Aden Gleitz
Before performing any further attribute selection or
preprocessing of the data, I started to apply different
classifiers with default parameters to see if I could gather
any good rules from the dataset as is. The first classifier
that I ran was ‘Instance Based Learning’ (IBK) in WEKA.
The IBK classifier is also known as K-Nearest Neighbor
(KNN). This is considered a lazy classifier because it
waits until an instance is added or queried to then
determine the class of that instance. With the default
parameters selected, IBK was able to create a model that
was over ninety-seven percent accurate at classifying the
instances. The default parameter is one nearest neighbor,
so the closest instance to the one needing classified is the
class that is used.
The next classifier used is J48, WEKA’s version of the
C4.5 algorithm. This is a decision tree classifier in which
a sequence of comparisons are made to determine what
class the instance should be classified in. These sequences
are a part of a hierarchical tree structure to be followed
during classifying. Using this classifier with the default
parameters it was able to correctly classify just under
ninety-seven percent of the instances. The tree model that
was generated consists of seventy-seven different
comparisons to be made, and it has thirty-nine leaves.
Even though this is a decently accurate model, it is
difficult to explain the process in which to determine the
class. There are only seven classes, but this tree is too
large to gather a good amount of knowledge about any of
those.
After this I chose to use one more classifier called
Naïve Bayes. This classifier assumes that the features in
the dataset are all statistically independent. Probabilities
are then calculated for each class based on each feature
value of the instance. The largest probability is then what
is used to classify that instance. Starting with using the
default parameters again, I was able to create a model
based off of this method of classification. The Naïve
Bayes model predicted eighty percent of the instance in
the dataset correctly. This lower percentage correct leads
me to determine that there may be a slight statistical
dependence between features. Moving on from here I
went back to the beginning of the process to attribute
selection and cleaning of the data.
In reprocessing the selection of the data I considered
2
both attribute selection and normalization of the data. All
of the features in the dataset are numeric so it would be a
good candidate for normalization. This would bring all of
the values to between zero and one. Doing this will pull
the information closer together and help with values that
are very distant from others. Performing further attribute
selection would also help to remove any features that are
not helpful or that are statistically dependent on another
feature. Therefore, this was the next method of processing
the data that I pursued. I used WEKA’s attribute selection
filter and the best first search method to determine the best
features of this dataset to keep to continue processing the
data.
After running attribute selection the dataset was
reduced down to seven features and the class feature. The
features that were removed were line density measures
along with a couple that all shared mostly the same values
between the instances. With this reduced set of attributes
for this dataset, I moved forward to rerunning classifiers
to see the new models.
Using the IBK classifier again on the reduced set of
attributes, I checked to see if any improvements were
made at classifying the instances. There was only a slight
improvement on the number of correctly identified
instances, about three-tenths of a percent. Knowing that
these instances are each sections of pixels of an image, it
is possible that changing the parameters this classifier
uses can lead to better results. The parameter the instance
based classifier uses is ‘K’ which determines the number
of instances closest to the one that is being classified to
use. Using the software’s built in parameter selection
feature I tested ten different values of K from one to ten.
This classifier then determines the best number nearest
neighbors to use when classifying instances for this
dataset. This however determined that the best number of
neighbors to use in classification is only one, so it did not
improve on the current model.
It was then decided to revisit the decision tree classifier,
J48, with the reduced attribute set to make improvements
on the previous model. The tree that was created in this
model was larger than the previous one created with the
full dataset. In addition the size of the tree being increased
the percentage of correctly identified instances dropped
slightly. The next approach with this classifier I took was
Aden Gleitz
to adjust the parameters used for calculation of the tree
model. I attempted this by adjusting the confidence factor
of the algorithm, and the minimum number of objects that
can be in each branch. Using automated parameter
selection I tested with five different values for the
confidence factor at .1 intervals and five different values
of for the minimum number of objects parameter from one
to five. The results became slightly better than running
with the default parameters by a few tenths of a percent,
however, again the tree became larger to achieve this.
This was expected, as to become more accurate you need
to create more rules to fit the data.
The next step that was taken was to rerun the Naïve
Bayes classifier with the reduced feature set. I predicted
that the results of this newer classification would be better
than the original because since Naïve Bayes relies on
statistical independence there may have been some
dependence with the features that were not previously
removed. The percent of correctly identified instances did
increase as I predicted, almost eighty-seven percent
correct. This is over a six and a half percent increase from
the full dataset.
These steps only focused on a few different types of
classification methods, so I then decided to see if there
was any other classifiers that would be able to generate a
good model for this data. The first one I tried is one of the
simplest classification rules called Zero Rule. This
method of classification takes the largest class in the
dataset and will classify every instance as that class. This
method was only fourteen percent accurate. This method
is not the most ideal, especially for datasets that have
multiple class features, which in this dataset there are
seven and instances are evenly distributed between the
classes.
Another similar classification method is the One Rule
method. This is the next step I took in mining this dataset.
In the same way that Zero Rule takes a simple approach
to classification, One Rule takes a single feature that best
identifies the class and then only uses that feature to
determine the class of that instance. The parameters that
are used in this is the minBucketSize, which determines
how many instances must be in each bucket. The default
for this is six, and this produced a rule that was sixty three
percent accurate. This was better than I expected for being
3
able to classify based off of one feature. In turn I then used
the automated parameter selection to test thirty values of
minimum bucket sizes to get the best value. This value
determined to be the most ideal was thirteen, but this only
contributed to an additional half a percent of instances that
were correctly classified.
Another notable function I used is called JRip. JRip
creates a model which is a list of rules to classify an
instance. Additionally, this method produced a model that
was able to correctly identify instances over ninety five
percent of the time. More importantly, the model for this
only uses seventeen rules to determine the class feature
that accurately.
Fig. 1. This graph shows the dataset based off of the rawred and rawblue
features, two of which JRip mainly uses in its rules. This shows that one class
does stand out, but the other six are clustered together.
Another essential aspect of machine learning is the
ability to find clusters in the data. This was the next step
that was taken when determining information from this
dataset. I used the clustering function Simple K Means on
the full feature set to have it find clusters with the default
parameters. With the default number of clusters to find of
this algorithm being two, it did not provide any good
information, as I expected. I then reran this clusterer and
forced it to find seven different clusters in the data. This
turned out better as it created the labeled clusters similar
to the different classes. It did however combine two class
values, cement and path, into one cluster. Subsequently,
this method created two different clusters in the grass
class value.
After these attempts to create models of the data, I went
back again to preprocessing to attempt to clean up the data
any further. I decided to try to remove any possible
outliers in the data. This was accomplished by first
Aden Gleitz
4
standardizing the data. I then manually looked at the data
based on each attribute to see if any of them followed a
standard bell curve distribution. Unfortunately, all of
these features did not have a standard distribution, so
removing any instances from here would be likely to
throw out potentially relevant data instead of removing
likely outliers.
III. RESULTS
The results of this project show that there are several
good classification methods when it comes to gathering
information about this dataset. Given that the structure of
the dataset is all numerical it makes it a challenge to
classify the data.
After examining the data with different algorithms and
models it was determined that one can accurately identify
the class based on the other features. Though the data
mining process, I visualized the data and then processed
all the instances through tree structures, rule models, and
lazy classifiers. I then returned to the top of this process
and became more detailed at the data that I selected in the
preprocessing stage, and then optimized parameters of the
models to yield better results.
It was determined that overall the best classifier is the
Instance Based classifier using the nearest neighbors to
determine the class feature. Without preprocessing any
data this method produced ninety seven percent accurate
results. With this data set being a collection of parts of an
image, although not impossible this would have a small
percentage of outliers in the data as opposed to a dataset
that might be of lab results.
Other results that were found is that one can create a list
of rules that do a respectable job of defining the class
feature. This was found by using the J48 classifier and
JRip. J48 created a list of rules in a tree structure that was
able to determine a fairly accurate method of
classification. However, the rules that were produced
from the JRip model are much more useful in this dataset.
While this classifier produces slightly less accurate results
than the J48 model it is much easier to read and
understand. This is because the model that was produced
was only seventeen rules that needed to be evaluated for
the entire model. This is in contrast to the tree that was
produced at a size of seventy seven rules or evaluations at
each branch. In addition to being easier to read this much
lower number of rules is easier to explain to someone
when discussing the data.
Furthermore, results found during this process of
mining this image segmentation data set, is that there are
features that do not add to the accuracy of the data model.
These features are ones that all share the same value
across every instance, as well as attributes that show they
have a statistical dependence on another feature. This was
proven during the use of the Naïve Bayes classifier during
the mining process. This classifier, when it was computed
with the full dataset, only returned a total number of
correctly identified instances of eighty percent. In the
preprocessing of data, it was determined that there were
twelve instances that could be removed because those
instances did not add to the predictive ability of the
attribute or these features added redundancy to the data.
Upon removing these attributes and recomputing the
model based off of Naïve Bayes, the accuracy of this
model increased over six percent. This result shows that
in those features there is a statistical dependence among
them.
IV. CONCLUSION
It was determined that by having segments of images
and using the color values of the pixels you can determine
with about ninety seven percent accuracy what element of
the outdoors that segment of the picture is of. This was
determined by using classifying methods to create rules
for instances.
In conclusion it was determined that the pixel count and
line density features should be removed from the dataset
because they do not provide additional information and
have a statistical dependence on other features. It has been
shown that this dataset can be reduced to just seven
features and one class feature to create accurate predictive
models of classifying. The simplest method of creating
these results and of conveying these findings is with the
JRip model because it uses the smallest number of rules
to predict.
REFERENCES
[1]A. Likas, N. Vlassis and J. J. Verbeek, "The global k-means clustering
algorithm", Pattern Recognition, vol. 36, no. 2, pp. 451-461, 2003.
Aden Gleitz
[2]X. Fern and C. Brodley, "Cluster ensembles for high dimensional
clustering : an empirical study", Technical Reports (Electrical Engineering and
Computer Science), 2006.
[3]"UCI Machine Learning Repository: Image Segmentation Data Set",
Archive.ics.uci.edu,
2016.
[Online].
Available:
http://archive.ics.uci.edu/ml/datasets/Image+Segmentation. [Accessed: 14Jan- 2016].
[4]A. Bagirov, A. Rubinov, N. Soukhoroukova and J. Yearwood,
"Unsupervised and supervised data classification via nonsmooth and global
optimization", Top, vol. 11, no. 1, pp. 1-75, 2003.
[5]K. Doherty, R. Adams and N. Davey, "Unsupervised learning with
normalised data and non-Euclidean norms", Applied Soft Computing, vol. 7,
no. 1, pp. 203-210, 2007.
[6]I. Witten and E. Frank, Data mining. Amsterdam: Morgan Kaufman,
2005.
5
Aden Gleitz
6
Appendix A
IBK Summary and Confusion Matrix
=== Summary ===
Correctly Classified Instances
2244
Incorrectly Classified Instances
66
Kappa statistic
0.9667
Mean absolute error
0.0089
Root mean squared error
0.0902
Relative absolute error
3.645 %
Root relative squared error
25.7796 %
Total Number of Instances
2310
97.1429 %
2.8571 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.994 0.003 0.985 0.994 0.989 0.996 brickface
1
0
1
1
1
1
sky
0.961 0.01
0.943 0.961 0.952 0.976 foliage
0.955 0.008 0.952 0.955 0.953 0.974 cement
0.9
0.012 0.928 0.9
0.914 0.947 window
1
0.002 0.991 1
0.995 0.999 path
0.991 0
1
0.991 0.995 0.996 grass
Weighted Avg. 0.971 0.005 0.971 0.971 0.971 0.984
=== Confusion Matrix ===
a b c d e f g <-- classified as
328 0 0 0 2 0 0 | a = brickface
0 330 0 0 0 0 0 | b = sky
0 0 317 2 11 0 0 | c = foliage
2 0 2 315 9 2 0 | d = cement
3 0 17 13 297 0 0 | e = window
0 0 0 0 0 330 0 | f = path
0 0 0 1 1 1 327 | g = grass
Aden Gleitz
7
Appendix B
J48 Tree
J48 pruned tree
-----------------region-centroid-row <= 155
| rawred-mean <= 27.2222
| | hue-mean <= -1.89048
| | | hue-mean <= -2.24632: foliage (160.0/1.0)
| | | hue-mean > -2.24632
| | | | saturation-mean <= 0.772831
| | | | | region-centroid-col <= 110
| | | | | | rawred-mean <= 0.666667
| | | | | | | region-centroid-row <= 150: foliage (14.0/1.0)
| | | | | | | region-centroid-row > 150: window (2.0)
| | | | | | rawred-mean > 0.666667
| | | | | | | exred-mean <= -15.7778: foliage (10.0/2.0)
| | | | | | | exred-mean > -15.7778
| | | | | | | | hue-mean <= -2.03348
| | | | | | | | | rawblue-mean <= 31.6667
| | | | | | | | | | region-centroid-row <= 120: window (27.0)
| | | | | | | | | | region-centroid-row > 120
| | | | | | | | | | | exgreen-mean <= -7.11111: cement (14.0/1.0)
| | | | | | | | | | | exgreen-mean > -7.11111: window (13.0/1.0)
| | | | | | | | | rawblue-mean > 31.6667: cement (3.0)
| | | | | | | | hue-mean > -2.03348
| | | | | | | | | vedge-mean <= 2.44444
| | | | | | | | | | region-centroid-row <= 150: brickface (6.0/1.0)
| | | | | | | | | | region-centroid-row > 150: window (2.0)
| | | | | | | | | vedge-mean > 2.44444: cement (3.0)
| | | | | region-centroid-col > 110
| | | | | | exgreen-mean <= -14.3333: cement (11.0/1.0)
| | | | | | exgreen-mean > -14.3333
| | | | | | | rawred-mean <= 24.7778: window (169.0/8.0)
| | | | | | | rawred-mean > 24.7778
| | | | | | | | vedge-mean <= 1.72223: window (4.0)
| | | | | | | | vedge-mean > 1.72223: cement (7.0)
| | | | saturation-mean > 0.772831
| | | | | hue-mean <= -2.09121
| | | | | | region-centroid-row <= 132: foliage (94.0)
| | | | | | region-centroid-row > 132
| | | | | | | rawred-mean <= 0.444444
| | | | | | | | hedge-mean <= 0.277778
| | | | | | | | | hedge-mean <= 0.166667: window (9.0/1.0)
| | | | | | | | | hedge-mean > 0.166667
Aden Gleitz
| | | | | | | | | | region-centroid-col <= 86: window (3.0)
| | | | | | | | | | region-centroid-col > 86: foliage (4.0)
| | | | | | | | hedge-mean > 0.277778: foliage (18.0/1.0)
| | | | | | | rawred-mean > 0.444444: window (9.0/1.0)
| | | | | hue-mean > -2.09121
| | | | | | region-centroid-col <= 8: foliage (2.0)
| | | | | | region-centroid-col > 8: window (34.0)
| | hue-mean > -1.89048
| | | exgreen-mean <= -5
| | | | vedge-mean <= 2.77778
| | | | | exgreen-mean <= -7: brickface (295.0/2.0)
| | | | | exgreen-mean > -7
| | | | | | vedge-mean <= 0.888891: brickface (26.0)
| | | | | | vedge-mean > 0.888891: window (4.0/1.0)
| | | | vedge-mean > 2.77778
| | | | | region-centroid-row <= 107: brickface (6.0)
| | | | | region-centroid-row > 107: foliage (5.0/1.0)
| | | exgreen-mean > -5
| | | | rawgreen-mean <= 11.7778
| | | | | region-centroid-col <= 115: foliage (7.0/1.0)
| | | | | region-centroid-col > 115: window (58.0)
| | | | rawgreen-mean > 11.7778: grass (6.0)
| rawred-mean > 27.2222
| | rawblue-mean <= 91.4444
| | | hue-mean <= -2.21924: foliage (18.0)
| | | hue-mean > -2.21924: cement (265.0)
| | rawblue-mean > 91.4444: sky (330.0)
region-centroid-row > 155
| exblue-mean <= 9.77778: grass (325.0/1.0)
| exblue-mean > 9.77778
| | saturation-mean <= 0.386456
| | | region-centroid-row <= 159
| | | | hedge-mean <= 8.5: cement (3.0)
| | | | hedge-mean > 8.5: path (3.0)
| | | region-centroid-row > 159: path (327.0)
| | saturation-mean > 0.386456: cement (14.0)
Number of Leaves : 39
Size of the tree : 77
Time taken to build model: 0.44 seconds
=== Stratified cross-validation ===
=== Summary ===
8
Aden Gleitz
Correctly Classified Instances
2239
Incorrectly Classified Instances
71
Kappa statistic
0.9641
Mean absolute error
0.0104
Root mean squared error
0.0914
Relative absolute error
4.2494 %
Root relative squared error
26.1312 %
Total Number of Instances
2310
9
96.9264 %
3.0736 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.982 0.004 0.979 0.982 0.98
0.993 brickface
1
0.001 0.997 1
0.998 1
sky
0.936 0.01
0.942 0.936 0.939 0.977 foliage
0.948 0.006 0.966 0.948 0.957 0.976 cement
0.921 0.016 0.907 0.921 0.914 0.97 window
1
0.001 0.997 1
0.998 1
path
0.997 0.001 0.997 0.997 0.997 0.998 grass
Weighted Avg. 0.969 0.005 0.969 0.969 0.969 0.988
=== Confusion Matrix ===
a b c d e f g <-- classified as
324 0 1 2 3 0 0 | a = brickface
0 330 0 0 0 0 0 | b = sky
2 1 309 3 15 0 0 | c = foliage
3 0 0 313 13 0 1 | d = cement
2 0 18 6 304 0 0 | e = window
0 0 0 0 0 330 0 | f = path
0 0 0 0 0 1 329 | g = grass
Aden Gleitz
10
Appendix C
JRIP Rules and Summary
JRIP rules:
===========
(intensity-mean >= 26.1111) and (hue-mean >= -2.17447) and (region-centroid-row <= 159) and (intensity-mean <=
72.8889) and (rawgreen-mean >= 22.3333) => class=cement (281.0/0.0)
(vedge-mean >= 1.72222) and (region-centroid-row <= 160) and (region-centroid-row >= 146) and (hedge-sd <=
1.86667) and (saturation-mean <= 0.541667) => class=cement (20.0/1.0)
(region-centroid-row >= 123) and (hue-mean <= -2.10408) and (hue-mean >= -2.17535) and (rawred-mean >= 8)
and (region-centroid-row <= 156) => class=cement (19.0/1.0)
(intensity-mean >= 86.2963) => class=sky (330.0/0.0)
(hue-mean >= 1.28706) => class=grass (327.0/0.0)
(hedge-mean <= 0.777777) and (region-centroid-col >= 128) and (saturation-mean <= 0.533928) and (exred-mean
<= 0.111111) => class=window (91.0/0.0)
(rawred-mean <= 18.2222) and (region-centroid-col >= 152) and (rawblue-mean >= 9.55556) and (hue-mean >= 2.20829) => class=window (82.0/0.0)
(intensity-mean <= 3.7037) and (hue-mean >= -2.08783) and (region-centroid-col >= 34) => class=window
(62.0/1.0)
(hue-mean <= -2.0793) and (hue-mean >= -2.21646) and (rawred-mean >= 0.666667) and (rawred-mean <=
25.6667) and (exgreen-mean <= -6.22222) and (exblue-mean <= 33.6667) => class=window (51.0/2.0)
(vedge-mean <= 0.277778) and (region-centroid-row >= 131) and (region-centroid-col >= 125) => class=window
(8.0/1.0)
(exgreen-mean >= -6.11111) and (region-centroid-row >= 133) and (hue-mean >= -2.1753) and (exgreen-mean <=
-3.11111) and (region-centroid-col >= 38) => class=window (18.0/3.0)
(intensity-mean <= 2.96296) and (region-centroid-row >= 133) and (rawred-mean >= 0.888889) => class=window
(5.0/0.0)
(exgreen-mean >= -6.33333) and (region-centroid-row <= 133) => class=foliage (233.0/5.0)
(hue-mean <= -2.0944) and (region-centroid-row <= 145) => class=foliage (98.0/11.0)
(rawred-mean <= 18.4444) and (exred-mean <= -6) => class=foliage (13.0/4.0)
(region-centroid-row <= 149) => class=brickface (334.0/7.0)
=> class=path (338.0/10.0)
Number of Rules : 17
Time taken to build model: 0.86 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
2204
Incorrectly Classified Instances
106
Kappa statistic
0.9465
Mean absolute error
0.0172
95.4113 %
4.5887 %
Aden Gleitz
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
11
0.1115
7.0261 %
31.8519 %
2310
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.982 0.005 0.97 0.982 0.976 0.991 brickface
0.997 0
1
0.997 0.998 0.998 sky
0.927 0.018 0.897 0.927 0.912 0.972 foliage
0.93 0.012 0.93 0.93 0.93
0.977 cement
0.864 0.014 0.913 0.864 0.888 0.965 window
0.988 0.005 0.97 0.988 0.979 0.996 path
0.991 0.001 0.997 0.991 0.994 0.999 grass
Weighted Avg. 0.954 0.008 0.954 0.954 0.954 0.986
=== Confusion Matrix ===
a b c d e f g <-- classified as
324 0 1 3 2 0 0 | a = brickface
0 329 0 1 0 0 0 | b = sky
2 0 306 6 14 2 0 | c = foliage
3 0 6 307 10 3 1 | d = cement
5 0 24 11 285 5 0 | e = window
0 0 3 1 0 326 0 | f = path
0 0 1 1 1 0 327 | g = grass
Aden Gleitz
12
Appendix D
Naïve Bayes Pre-Attribute Selection
=== Summary ===
Correctly Classified Instances
1853
Incorrectly Classified Instances
457
Kappa statistic
0.7692
Mean absolute error
0.0575
Root mean squared error
0.229
Relative absolute error
23.4869 %
Root relative squared error
65.4546 %
Total Number of Instances
2310
80.2165 %
19.7835 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.967 0.046 0.778 0.967 0.862 0.992 brickface
0.994 0
1
0.994 0.997 0.999 sky
0.145 0.011 0.686 0.145 0.24
0.956 foliage
0.864 0.028 0.836 0.864 0.849 0.966 cement
0.709 0.142 0.453 0.709 0.553 0.889 window
0.955 0.003 0.981 0.955 0.968 0.999 path
0.982 0
1
0.982 0.991 0.993 grass
Weighted Avg. 0.802 0.033 0.819 0.802 0.78
0.971
=== Confusion Matrix ===
a b c d e f g <-- classified as
319 0 0 7 4 0 0 | a = brickface
0 328 0 2 0 0 0 | b = sky
8 0 48 10 264 0 0 | c = foliage
25 0 5 285 9 6 0 | d = cement
58 0 12 26 234 0 0 | e = window
0 0 4 11 0 315 0 | f = path
0 0 1 0 5 0 324 | g = grass
Aden Gleitz
13
Appendix E
Naïve Bayes Post-Attribute Selection
=== Summary ===
Correctly Classified Instances
2004
Incorrectly Classified Instances
306
Kappa statistic
0.8455
Mean absolute error
0.0456
Root mean squared error
0.1672
Relative absolute error
18.627 %
Root relative squared error
47.7778 %
Total Number of Instances
2310
86.7532 %
13.2468 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.855 0.041 0.777 0.855 0.814 0.987 brickface
0.994 0
1
0.994 0.997 0.999 sky
0.736 0.027 0.818 0.736 0.775 0.968 foliage
0.873 0.023 0.865 0.873 0.869 0.98 cement
0.648 0.064 0.629 0.648 0.639 0.921 window
0.976 0
1
0.976 0.988 1
path
0.991 0
1
0.991 0.995 0.997 grass
Weighted Avg. 0.868 0.022 0.87 0.868 0.868 0.979
=== Confusion Matrix ===
a b c d e f g <-- classified as
282 0 1 4 43 0 0 | a = brickface
0 328 0 2 0 0 0 | b = sky
10 0 243 5 72 0 0 | c = foliage
28 0 4 288 10 0 0 | d = cement
43 0 47 26 214 0 0 | e = window
0 0 1 7 0 322 0 | f = path
0 0 1 1 1 0 327 | g = grass