Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Recent Advances in Intelligent Information Systems
ISBN 978-83-60434-59-8, pages 429–442
A Closest Fit Approach to Missing Attribute
Values
Jerzy W. Grzymala-Busse1,2
1
2
Department of Electrical Engineering and Computer Science, University of Kansas,
Lawrence, KS 66045, USA
Institute of Computer Science, Polish Academy of Sciences,
01–237 Warsaw, Poland
Abstract
Recently, results on a comparison of seven successful methods of handling missing
attribute values were reported. This paper describes experimental results on the
three most successful methods out of these seven. Two of these methods, based
on a Closet Fit idea (searching in a remaining data set for the closest fit case
and replacing a missing attribute value by the corresponding known value from
that case) were enhanced by 12 strategies (using two different options for missing
attribute replacement, three different interpretations of missing attribute values
and two types of rules: certain and possible). Our results show that for a given
data set the best method handling missing attribute values should be selected
individually, testing the main two methods: Local Closest Fit method with four
options: both types of missing attribute replacement and two interpretations of
missing attribute values and the Most Common Value for symbolic attributes and
Average Value for numerical attributes, both restricted to a concept. All of these
methods are local, i.e., restricted to a concept, so it indicates all over again that
local approaches are better than global ones.
1 Introduction
It is well known that mining incomplete data, i.e., data in which some attribute
values are missing, is much more challenging than mining complete data. Research
on selecting the best methods of handling missing attribute values is important
for data mining since many real-life data sets are incomplete.
In general, methods to handle missing attribute values may be categorized as
sequential and parallel. In sequential methods some technique is used to handle
missing attribute values, then the main process of acquiring knowledge is conducted. In parallel methods missing attribute values are taken into account during
knowledge acquisition, i.e., both processes are performed in parallel.
Sequential methods include techniques based on deleting cases with missing
attribute values, replacing a missing attribute value by the most common value of
that attribute, assigning all possible values to the missing attribute value, replacing
a missing attribute value by the mean for numerical attributes, assigning to a
430
Jerzy W. Grzymala-Busse
missing attribute value the corresponding value taken from the closest fit case, or
replacing a missing attribute value by a new value, computed from a new data set,
in which the original attribute is a decision.
Parallel methods to handle missing attribute values include the MLEM2 (Modified Learning from Examples Module, version 2) rule induction algorithm in which
rules are induced form the original data set, with missing attribute values considered to be lost values, attribute-concept values, or ”do not care” conditions, see,
e.g., Grzymala-Busse (2002). MLEM2 is an option of the LERS (Learning from
Examples based on Rough Sets) data mining system. The C4.5, see Quinlan
(1993), and CART, see Breiman et al. (1984), approaches to missing attribute
values are other examples of methods from the same group.
There are three important interpretations of missing attribute values. The
first is a lost value interpretation, Grzymala-Busse and Wang (1997); Stefanowski
and Tsoukias (1999), where we consider a missing attribute value as potentially
important but unavailable because it was mistakenly erased or mistakenly not
recorded. We cannot replace a lost value by any known (specified) attribute value.
In the second interpretation, called a ”do not care” condition, Grzymala-Busse
(1991); Kryszkiewicz (1999), it is assumed that a missing attribute value was
irrelevant during data collection. Typically, a ”do not care” condition can be
replaced by any possible value from the attribute domain.
The third possibility is an attribute-concept value where a missing attribute
value may be replaced by any possible value from the attribute domain restricted
to the concept to which the case belongs. The concept (or class) is a set of
all cases classified the same way. For example, if we want to use the attributeconcept interpretation of a missing attribute value for the attribute Temperature
for a patient who is sick with flu, and other patients sick with flu have values
of Temperature either high or very high, then the typical values are high and
very high. The attribute-concept approach to missing attribute values was introduced in Grzymala-Busse (2004).
This paper is a continuation of Grzymala-Busse (2009). In Grzymala-Busse
(2009) experimental results on seven successful methods of handling missing attribute values, each method with additional two options of using certain and possible rules, were presented. Among these seven methods, only three produced
consistently good results in terms of an error rate evaluated by ten-fold cross validation. Two of these good methods were Global and Local Closest Fit methods,
the third method was Most Common Value for symbolic attributes and Average
Value for numerical attributes, both restricted to a concept.
The main objective of this paper was to conduct experiments using both Closest Fit methods, each with 12 options, to identify the best approach and then to
compare the best results with the third best method identified in Grzymala-Busse
(2009), i.e., the method of Most Common Value for symbolic attributes and Average Value for numerical attributes, both restricted to a concept. Though in six
cases out of eight the Local Closest Fit methods outperformed the Most Common Value for symbolic attributes and Average Value for numerical attributes,
both restricted to a concept, their performance does not differ significantly (at 5%
significance level using the Wilcoxon two-tailed test).
A Closest Fit Approach to Missing Attribute Values
431
2 Conventional Methods to Handle Missing Attribute Values
We will describe some traditional methods to handle missing attribute values, all
sequential and considered by many authors as very successful.
2.1 Most Common Value for Symbolic Attributes, Average Value for
Numerical Attributes
There are two versions of this method. In the first version, any missing attribute
value is replaced by the most common value of the attribute. Thus, a missing
attribute value is replaced by the most probable known (specified) attribute value.
Additionally, every missing attribute value for a numerical attribute is replaced
by the average of known attribute values.
In the second version of this method the most common value of the attribute
restricted to the concept is used instead of the most common value for all cases.
Such a concept is the same concept that contains the case with missing attribute
value. A missing attribute value of a numerical attribute is replaced by the average
of all known values of the attribute restricted to the concept.
2.2 Closest Fit
Again, there are two versions of this method. The first version, called a Global
Closest Fit method, Grzymala-Busse et al. (2002), is based on replacing a missing
attribute value by the known value in another case that resembles as much as
possible the case with the missing attribute value. In searching for the closest fit
case we compare two vectors of attribute values, one vector corresponds to the
case with a missing attribute value, the other vector is a candidate for the closest
fit. The search is conducted for all cases, hence the name Global Closest Fit. For
each case a distance is computed, the case for which the distance is the smallest is
the closest fitting case that is used to determine the missing attribute value. Let
x and y be two cases. The value of x for the attribute ai will be denoted by ai (x),
where i = 1, 2, ..., n. The distance between cases x and y is computed as follows
distance (x, y) =
n
X
distance (ai (x), ai (y)),
i=1
where
distance (ai (x), ai (y)) =
0
|ai (x)−ai (y)|
r
1
if both ai (x) and ai (y) are
known and ai (x) = ai (y) ,
if ai (x) and ai (y) are numbers
and ai (x) 6= ai (y) ,
otherwise,
r is the difference between the maximum and minimum of the known values of the
numerical attribute with a missing value. If there is a tie for two cases with the
same distance, a kind of heuristics is necessary, for example, select the first case.
In this method, there are two approaches to replacement of missing attribute
values by the known value of the closest case. In the first approach, called static,
432
Jerzy W. Grzymala-Busse
Table 1: An incomplete decision table
Attributes
Decision
Case
Temperature
Headache
Cough
Flu
1
2
3
4
5
6
7
high
?
high
−
*
normal
normal
?
yes
−
no
yes
no
−
yes
*
yes
*
no
?
no
yes
yes
yes
no
no
no
no
there is an attempt to replace all missing attribute values after the scanning of
the entire data set. In the second approach, called dynamic, any missing attribute
value is replaced by the existing value immediately after finding the closest case.
This value may be used in the following steps of looking for the closest case as
known.
In general, using the Global Closest Fit method may result in data sets in
which some missing attribute values are not replaced by known values, in spite of
the fact that the entire data set is searched for the closest cases in many iterations
in which the entire data set is completely scanned. Additional iterations of using
this method may reduce the number of missing attribute values, but may not end
up with all missing attribute values being replaced by known attribute values.
The second closest fit method, called a Local Closest Fit (or Concept Closest
Fit) is similar to the Global Closest Fit method. The original data set, containing
missing attribute values, is split into smaller data sets, each smaller data set
corresponds to a concept from the original data set, and then the Global Closest
Fit is used for the smaller data sets. This approach is applied using either static
or dynamic way of replacing missing attribute values.
3 Blocks of attribute-value pairs
In this section, we will quote some basic ideas of the rough set theory. We will
assume that lost values will be denoted by ”?”, ”do not care” conditions will be
denoted by ”*”, and attribute-concept values will be denoted by ”–”. We assume
that the input data sets are presented in the form of a decision table. An example
of a decision table is shown in Table 1.
Rows of the decision table represent cases, while columns are labeled by variables. The set of all cases will be denoted by U . In Table 1, U = {1, 2, ..., 7}.
Independent variables are called attributes and a dependent variable is called a
decision and is denoted by d. The set of all attributes will be denoted by A. In
Table 1, A = {Temperature, Headache, Cough}.
A Closest Fit Approach to Missing Attribute Values
433
For the rest of the paper we will assume that all decision values are known,
i.e., they are not missing. Additionally, we will assume that for each case at least
one attribute value is known.
An important tool to analyze decision tables is a block of an attribute-value
pair. Let (a, v) be an attribute-value pair. For complete decision tables, i.e.,
decision tables in which every attribute value is known, a block of (a, v), denoted
by [(a, v)], is the set of all cases x for which a(x) = v. For incomplete decision
tables the definition of a block of an attribute-value pair is modified.
• If for an attribute a there exists a case x such that a(x) = ?, i.e., the corresponding value is lost, then the case x should not be included in any blocks
[(a, v)] for all values v of attribute a,
• If for an attribute a there exists a case x such that the corresponding value is
a ”do not care” condition, i.e., a(x) = ∗, then the case x should be included in
blocks [(a, v)] for all known values v of attribute a,
• If for an attribute a there exists a case x such that the corresponding value is an
attribute-concept value, i.e., a(x) = −, then the corresponding case x should
be included in blocks [(a, v)] for all known values v ∈ V (x, a) of attribute a,
where
V (x , a) = {a(y) | a(y) is known, y ∈ U, d(y) = d(x)}.
For Table 1, V (3, Headache) = {yes}, V (4, T emperature) = {normal} and
V (7, Headache) = {no, yes}, so the blocks of attribute-value pairs are:
[(Temperature, normal)] = {4, 5, 6, 7},
[(Temperature, high)] = {1, 3, 5},
[(Headache, no)] = {4, 6, 7},
[(Headache, yes)] = {2, 3, 5, 7},
[(Cough, no)] = {2, 4, 5, 7},
[(Cough, yes)] = {1, 2, 3, 4},
For a case x ∈ U the characteristic set KB (x) is defined as the intersection of
the sets K(x, a), for all a ∈ B, where the set K(x, a) is defined in the following
way:
• If a(x) is specified, then K(x, a) is the block [(a, a(x)] of attribute a and its
value a(x),
• If a(x) =? or a(x) = ∗ then the set K(x, a) = U ,
• If a(x) = −, then the corresponding set K(x, a) is equal to the union of all
blocks of attribute-value pairs (a, v), where v ∈ V (x, a) if V (x, a) is nonempty.
If V (x, a) is empty, K(x, a) = U .
For Table 1 and B = A,
KA (1) = {1, 3, 5} ∩ U ∩ {1, 2, 3, 4} = {1, 3},
KA (2) = U ∩ {2, 3, 5, 7} ∩ U = {2, 3, 5, 7},
KA (3) = {1, 3, 5} ∩ {2, 3, 5, 7} ∩ {1, 2, 3, 4} = {3},
KA (4) = {4, 5, 6, 7} ∩ {4, 6, 7} ∩ U = {4, 6, 7},
KA (5) = U ∩ {2, 3, 5, 7} ∩ {2, 4, 5, 7} = {2, 5, 7},
KA (6) = {4, 5, 6, 7} ∩ {4, 6, 7} ∩ U = {4, 6, 7},
434
Jerzy W. Grzymala-Busse
Table 2: Data sets used for experiments
Data set
cases
Bankruptcy
Breast cancer
Hepatitis
House
Image segmentation
Iris
Lymphography
Wine
66
277
155
434
210
150
148
178
Number of
attributes
5
9
19
16
19
4
18
12
concepts
Type of
attributes
2
2
2
2
7
3
4
3
numerical
symbolic
discretized
symbolic
discretized
numerical
symbolic
discretized
KA (7) = {4, 5, 6, 7} ∩ ({2, 3, 5, 7} ∪ {4, 6, 7}) ∩ {2, 4, 5, 7} = {4, 5, 7}.
Characteristic set KB (x) may be interpreted as the set of cases that are indistinguishable from x using all attributes from B and using a given interpretation
of missing attribute values. Thus, KA (x) is the set of all cases that cannot be
distinguished from x using all attributes. The characteristic relation R(B) is a
relation on U defined for x, y ∈ U as follows
(x , y) ∈ R(B ) if and only if y ∈ KB (x ).
The characteristic relation R(B) is reflexive but—in general—does not need to
be symmetric or transitive. Also, the characteristic relation R(B) is known if we
know characteristic sets KB (x) for all x ∈ U . In our example, R(A) = {(1, 1), (1,
3), (2, 2), (2, 3), (2, 5), (2, 7), (3, 3), (4, 4), (4, 6), (4, 7), (5, 2), (5, 5), (5, 7), (6,
4), (6, 6), (6, 7), (7, 4), (7, 5), (7, 7)}.
For a complete decision table, the characteristic relation R(B) is reduced to
the indiscernibility relation Pawlak (1991). Recently characteristic relations were
investigated in a number of papers, see, e.g., Li et al. (2007); Qi et al. (2008); Song
et al. (2008); Yang et al. (2007).
4 Lower and upper approximations
For incomplete decision tables lower and upper approximations may be defined in
a few different ways. In this paper we will use concept lower and upper approximations for incomplete decision tables, following Grzymala-Busse (2003). Again,
let X be a concept, let B be a subset of the set A of all attributes, and let R(B)
be the characteristic relation of the incomplete decision table with characteristic
sets KB (x), where x ∈ U . A concept B-lower approximation of the concept X,
introduced in Grzymala-Busse (2003), is defined as follows:
BX = ∪{KB (x) | x ∈ X, KB (x) ⊆ X}.
435
A Closest Fit Approach to Missing Attribute Values
Table 3: Number of missing attribute values
Name of
the data set
Bankruptcy
Breast cancer
Hepatitis
House
Image segmentation
Iris
Lymphography
Wine
Number of missing attribute values in
input data set
data set outputted by
Global
Local
Closest Fit
Closest Fit
static dynamic static dynamic
113
869
1,035
376
1,394
210
931
806
5
117
74
2
151
29
91
119
11
175
45
4
126
54
84
93
5
149
110
2
137
44
79
107
8
184
63
4
107
56
71
104
A concept B-upper approximation of the concept X is defined as follows:
BX = ∪{KB (x) | x ∈ X, KB (x) ∩ X 6= ∅} = ∪{KB (x) | x ∈ X}.
The concept upper approximation was defined in Lin (1992) and Slowinski and
Vanderpooten (2000) as well. For the decision table presented in Table 1, the
concept A-upper approximations are
A{1, 2, 3} = {1, 3},
A{4, 5, 6, 7} = {4, 5, 6, 7},
A{1, 2, 3} = {1, 2, 3, 5, 7},
A{4, 5, 6, 7} = {2, 4, 5, 6, 7}.
5 MLEM2
Rules induced from the lower approximation of the concept certainly describe the
concept, hence such rules are called certain, Grzymala-Busse (1988). On the other
hand, rules induced from the upper approximation of the concept describe the
concept possibly, so these rules are called possible, Grzymala-Busse (1988).
In our experiments, we used the MLEM2 rule induction algorithm. LEM2
explores the search space of attribute-value pairs, for details see, e.g., Chan and
Grzymala-Busse (1991); Grzymala-Busse (1992).
MLEM2, a modified version of LEM2, induces rule sets directly from data
with numerical attributes and missing attribute values, see, e.g., Grzymala-Busse
436
Jerzy W. Grzymala-Busse
Table 4: Error rates—Bankruptcy Data
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
13.64%
13.64%
13.64%
13.64%
13.64%
13.64%
4.55%
4.55%
4.55%
4.55%
4.55%
4.55%
7.58%
7.58%
7.58%
7.58%
7.58%
7.58%
3.03%
3.03%
4.55%
4.55%
4.55%
4.55%
(2002). For Table 1, certain rules induced by MLEM2, using concept approximations, are:
2, 2, 2
(Temperature, high) & (Cough, yes) -> (Flu, yes)
1, 4, 4
(Temperature, normal) -> (Flu, no)
Possible rules, induced in the same way, are:
1, 2, 4
(Headache, yes) -> (Flu, yes)
1, 2, 3
(Temperature, high) -> (Flu, yes)
1, 4, 4
(Temperature, normal) -> (Flu, no)
1, 3, 4
(Cough, no) -> (Flu, no)
The above rules are in the LERS format (every rule is equipped with three
numbers, the total number of attribute-value pairs on the left-hand side of the
rule, the total number of cases correctly classified by the rule during training, and
the total number of training cases matching the left-hand side of the rule), see,
e.g., Grzymala-Busse (2002).
6 Experiments
We used eight data sets for our experiments, see Tables 2–3. All of these data sets,
except bankruptcy, are well-known data accessible at the University of California at
Irvine Data Depository, http://archive.ics.uci.edu/ml/. The data set bankruptcy
was collected by E. Altman and M. Heine at the New York University, School of
Business, in 1968. Note that only one of data sets, house, was used for experiments
in its original form. In remaining seven data sets, some attribute values, about
437
A Closest Fit Approach to Missing Attribute Values
Table 5: Error rates—Breast Cancer Data, Slovenia
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
29.24%
31.41%
29.24%
31.05%
28.52%
31.41%
23.10%
25.27%
22.38%
27.08%
23.47%
25.99%
31.41%
29.24%
29.96%
29.96%
30.32%
29.60%
24.91%
25.63%
25.99%
28.16%
27.44%
28.16%
Table 6: Error rates—Hepatitis Data
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
20.00%
21.94%
23.23%
23.23%
23.23%
22.58%
8.39%
8.39%
14.19%
11.61%
15.48%
12.90%
28.39%
25.81%
25.16%
27.10%
25.16%
27.10%
10.32%
12.26%
10.32%
9.03%
9.68%
10.97%
35%, were randomly removed, or more exactly, the original attribute values were
replaced by symbols of missing attribute values. Three data sets: hepatitis, image
segmentation and wine were discretized using a discretization method based on
agglomerative cluster analysis, Chmielewski and Grzymala-Busse (1996).
In our previous experiments, Grzymala-Busse (2009), we tested seven methods
of handling missing attribute values:
1. Most common value for symbolic attributes and average value for numerical
attributes,
2. Most common value for symbolic attributes and average value for numerical
attributes, both restricted to a concept,
3. Global closest fit with static replacement of missing attribute values, if this
method outputs a data set with some missing attribute values, use Method 5,
4. Concept closest fit with static replacement of missing attribute values, if this
method outputs a data set with some missing attribute values, use Method 5,
5. Missing attribute values interpreted as lost values,
6. Missing attribute values interpreted as ”do not care” conditions,
438
Jerzy W. Grzymala-Busse
Table 7: Error rates—House of Representatives, USA, 1984
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
3.92%
4.15%
3.92%
4.38%
3.92%
4.38%
4.38%
4.61%
4.38%
4.61%
4.38%
4.61%
4.61%
5.07%
4.61%
5.07%
4.61%
5.07%
3.69%
3.69%
3.69%
3.69%
3.69%
3.69%
Table 8: Error rates—Image Segmentation Data
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
46.19%
41.90%
60.48%
40.95%
59.52%
41.43%
13.81%
12.38%
16.67%
11.90%
15.71%
12.86%
55.24%
45.24%
52.86%
41.90%
55.71%
41.43%
13.81%
12.38%
12.86%
13.33%
15.71%
14.76%
7. Missing attribute values interpreted as attribute-concept values.
For all seven methods we induced both certain and possible rules. For every
data set the best method, one of our seven methods, was identified using the smallest error rate (estimated by ten-fold cross validation) as an indicator of quality.
Only three methods, out of seven, were winners in this competition. Eight times
the winner was Method 4, six times the winner was Method 2, and four times
the winner was Method 3. A choice between certain and possible rules was not
important.
For methods 3 and 4, i.e., Global Closest Fit and Local Closest Fit, there exist
additional strategies, based on a choice between static and dynamic replacement of
missing attribute values and three additional choices based on interpreting remaining missing attribute values, after applying Closest Fit methods, as lost values,
”do not care” conditions, and attribute-concept values. Additionally, there are
two possible ways of rule induction: as certain and possible. These additional
strategies were not tested before in Grzymala-Busse (2009).
In experiments described in this paper, we investigated these additional strate-
439
A Closest Fit Approach to Missing Attribute Values
Table 9: Error rates—Iris Plant Data
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
10.00%
10.67%
11.33%
16.00%
13.33%
16.67%
2.67%
2.67%
1.33%
4.67%
6.67%
6.67%
16.00%
16.67%
14.00%
14.00%
14.00%
17.33%
2.00%
2.00%
2.67%
4.67%
7.33%
7.33%
Table 10: Error rates—Lymphography Data
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
24.32%
25.00%
25.00%
25.68%
25.00%
25.00%
9.46%
9.46%
7.43%
7.43%
10.81%
10.81%
27.70%
22.97%
25.68%
24.32%
26.35%
25.68%
9.46%
9.46%
7.43%
7.43%
10.81%
10.81%
gies. Thus for every data set 24 strategies of handling missing attribute values
were tested. Results are presented in Tables 4–11. As it is clear from Tables 4–11,
Local Closest Fit always outperformed Global Closest Fit. Local Closest Fit is
designed to work on a single concept at a time and it is the main reason for its
success. Additionally, one of interpretations of missing attribute values, namely
attribute-concept value, was outperformed by remaining two interpretations, lost
values and ”do not care” conditions in all experiments. Static replacement of
missing attribute values was better for five out of seven data sets (with one tie),
however, the Wilcoxon matched-pairs signed-rank test indicates no significant difference (at 5% significance level using a two-tailed test). On the other hand, the
difference may be dramatic, as wine data set shows: the error rate was only 2.25%
for static replacement and 15.17% for dynamic replacement. Similarly, ”do not
care” condition interpretation of missing attribute values was better in applying to
five out of seven data sets (with one tie), again, the same test shows no significant
difference. Finally, like in our previous experiments, Grzymala-Busse (2009), we
may conclude that the choice between certain and possible rules is irrelevant, for
two data sets better were certain rules, for one case possible rules, with five ties.
440
Jerzy W. Grzymala-Busse
Table 11: Error rates—Wine Recognition Data
Method
Global Closest Fit
Static
Dynamic
Local Closest Fit
Static
Dynamic
?, certain rules
?, possible rules
∗, certain rules
∗, possible rules
−, certain rules
−, possible rules
13.48%
13.48%
14.61%
12.92%
16.85%
13.48%
5.62%
5.62%
2.25%
2.25%
3.37%
3.37%
7.87%
5.06%
7.30%
5.06%
6.18%
5.62%
29.78%
12.92%
29.21%
15.17%
29.78%
13.48%
The next task was to compare the best results accomplished using the Local
Closest Fit and Method 2, i.e., Most Common Value for symbolic attributes and
Average Value for numerical attributes, both restricted to a concept. Results are
presented in Table 12. For these two methods of handling missing attribute values,
even though Local Closest Fit outperformed Method 2 for six out of eight data
sets, the Wilcoxon matched-pairs signed-rank test shows that the performance on
our eight data sets does not differ significantly at 5% significance level using a
two-tailed test.
7 Conclusions
A very successful method of handling missing attribute values (Method 1) and two
conventional methods based on Closest Fit enhanced by the rough-set methodology
with 12 strategies in each were used in our experiments. It is clear that Local
Closest Fit outperformed Global Local Fit and that attribute-concept value was
the worst interpretation of missing attribute values.
Obviously, more experiments are needed, but for time being it is clear that the
best methodology should be selected as either Local Closest Fit equipped with
four options: both static and dynamic replacement of missing attribute values and
lost values and ”do not care” conditions for interpretation of missing attribute
values or as Most Common Value for symbolic attributes and Average Value for
numerical attributes, both restricted to a concept. A choice between using certain
and possible rules seems to be not important.
References
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone (1984), Classification
and Regression Trees, Wadsworth & Brooks, Monterey, CA.
C. C. Chan and J. W. Grzymala-Busse (1991), On the attribute redundancy and the
learning programs ID3, PRISM, and LEM2, Technical report, Department of Computer Science, University of Kansas.
441
A Closest Fit Approach to Missing Attribute Values
Table 12: Error rates—the best results
Name of
the data set
Bankruptcy
Breast cancer
Hepatitis
House
Image segmentation
Iris
Lymphography
Wine
The best results
Concept most common value Local Closest fit
and concept average value
7.58%
23.47%
12.90%
4.15%
9.52%
4.00%
6.08%
3.37%
3.03%
22.38%
8.39%
3.69%
11.90%
1.33%
7.43%
2.25%
M. R. Chmielewski and J. W. Grzymala-Busse (1996), Global discretization of continuous attributes as preprocessing for machine learning, International Journal of Approximate Reasoning, 15(4):319–331.
J. W. Grzymala-Busse (1988), Knowledge acquisition under uncertainty—A rough set
approach, Journal of Intelligent & Robotic Systems, 1:3–16.
J. W. Grzymala-Busse (1991), On the unknown attribute values in learning from examples, in Proceedings of the ISMIS-91, 6th International Symposium on Methodologies
for Intelligent Systems, pp. 368–377.
J. W. Grzymala-Busse (1992), LERS—A system for learning from examples based on
rough sets, in R. Slowinski, editor, Intelligent Decision Support. Handbook of Applications and Advances of the Rough Set Theory, pp. 3–18, Kluwer Academic Publishers,
Dordrecht, Boston, London.
J. W. Grzymala-Busse (2002), MLEM2: A new algorithm for rule induction from
imperfect data, in Proceedings of the 9th International Conference on Information
Processing and Management of Uncertainty in Knowledge-Based Systems, pp. 243–
250.
J. W. Grzymala-Busse (2003), Rough set strategies to data with missing attribute
values, in Workshop Notes, Foundations and New Directions of Data Mining, in conjunction with the 3-rd International Conference on Data Mining, pp. 56–63.
J. W. Grzymala-Busse (2004), Three approaches to missing attribute values—A rough
set perspective, in Proceedings of the Workshop on Foundation of Data Mining, in
conunction with the Fourth IEEE International Conference on Data Mining, pp. 55–
62.
J. W. Grzymala-Busse (2009), A comparison of traditional and rough set approaches
to missing attribute values in data mining, accepted for Data Mining 2009, the 10-th
International Conference on Data Mining, Detection, Protection and Security, Royal
Mare Village, Crete.
J. W. Grzymala-Busse, W. J. Grzymala-Busse, and L. K. Goodwin (2002), A com-
442
Jerzy W. Grzymala-Busse
parison of three closest fit approaches to missing attribute values in preterm birth
data, International Journal of Intelligent Systems, 17(2):125–134.
J. W. Grzymala-Busse and A. Y. Wang (1997), Modified algorithms LEM1 and LEM2
for rule induction from data with missing attribute values, in Proceedings of the Fifth
International Workshop on Rough Sets and Soft Computing (RSSC’97) at the Third
Joint Conference on Information Sciences (JCIS’97), pp. 69–72.
M. Kryszkiewicz (1999), Rules in incomplete information systems, Information Sciences, 113(3-4):271–292.
T. Li, D. Ruan, W. Geert, J. Song, and Y. Xu (2007), A rough sets based characteristic
relation approach for dynamic attribute generalization in data mining, KnowledgeBased Systems, 20(5):485–494.
T. Y. Lin (1992), Topological and fuzzy rough sets, in R. Slowinski, editor, Intelligent
Decision Support. Handbook of Applications and Advances of the Rough Sets Theory,
pp. 287–304, Kluwer Academic Publishers, Dordrecht, Boston, London.
Z. Pawlak (1991), Rough Sets. Theoretical Aspects of Reasoning about Data, Kluwer
Academic Publishers, Dordrecht, Boston, London.
Y. S. Qi, L. Wei, H. J. Sun, Y. Q. Song, and Q. S. Sun (2008), Characteristic relations in
generalized incomplete information systems, in International Workshop on Knowledge
Discovery and Data Mining, pp. 519–523.
J. R. Quinlan (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA.
R. Slowinski and D. Vanderpooten (2000), A generalized definition of rough approximations based on similarity, IEEE Transactions on Knowledge and Data Engineering,
12:331–336.
J. Song, T. Li, and D. Ruan (2008), A new decision tree construction using the cloud
transform and rough sets, in Proceedings of the Rough Sets and Knowledge Technology
Conference, pp. 524–531.
J. Stefanowski and A. Tsoukias (1999), On the extension of rough sets under incomplete information, in Proceedings of the RSFDGrC’1999, 7th International Workshop
on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing, pp.
73–81.
X. B. Yang, J. Y. Yang, C. Wu, and D. J. Yu (2007), Further investigation of characteristic relation in incomplete information systems, Systems Engineering - Theory &
Practice, 27(6):155–160.