Download Graph-based induction and its applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Advanced Engineering Informatics 16 (2002) 135±143
www.elsevier.com/locate/aei
Graph-based induction and its applications
Takashi Matsuda*, Hiroshi Motoda, Takashi Washio
I.S.I.R., Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka 567-0047, Japan
Received 18 May 2001; accepted 15 January 2002
Abstract
A machine learning technique called Graph-based induction (GBI) ef®ciently extracts typical patterns from graph data by stepwise pair
expansion (pairwise chunking). In this paper, we introduce GBI for general graph structured data, which can handle directed/undirected,
colored/uncolored graphs with/without (self) loop and with colored/uncolored links. We show that its time complexity is almost linear with
the size of graph. We, further, show that GBI can effectively be applied to the extraction of typical patterns from DNA sequence data and
organochlorine compound data from which are to be generated classi®cation rules, and that GBI also works as a feature construction
component for other machine learning tools. q 2002 Published by Elsevier Science Ltd.
Keywords: Graph-based induction; General graph structured data; Data mining; Machine learning
1. Introduction
There have been quite a number of research work on data
mining in seeking for better performance over the last few
years. Better performance includes mining from structured
data, which is a new challenge, and there has been little
work on this subject. Since structure is represented by
proper relations and a graph can easily represent relations,
knowledge discovery from graph structured data poses a
general problem for mining from structured data. Some
examples amenable to graph mining are ®nding typical
web browsing pattern, identifying characteristic substructure of chemical compounds and discovering diagnostic
rules from patient history records.
Majority of the methods widely used are for data that do
not have structure and are represented by attribute-value
pairs. Decision tree [1,2], and induction rules [3,4] relate
attribute-values to target classes. Association rules often
used in data mining also uses this attribute-value pair
representation. However, the attribute-value pair representation is not suitable to represent a more general data
structure, and there are problems that need a more powerful
representation.
A most powerful representation that can handle relation
and thus, structure, would be inductive logic programming
* Corresponding author. Tel.: 181-6-6879-8542; fax: 181-6-6879-8544.
E-mail addresses: [email protected] (T. Matsuda),
[email protected] (H. Motoda),
[email protected] (T. Washio).
1474-0346/02/$ - see front matter q 2002 Published by Elsevier Science Ltd.
PII: S 1474-034 6(02)00005-8
(ILP) [5] which uses the ®rst-order predicate logic. It can
represent general relationship embedded in data, and has a
merit that domain knowledge and acquired knowledge can
be utilized as background knowledge. However, it is not
ef®cient enough to solve large-scale problems. However,
its state of the art is not so matured that anyone can use
the technique easily.
AGM (a priori-based graph mining) [6] was developed
for the purpose of mining the association rules among the
frequently appearing substructures in a given graph data set.
A graph transaction is represented by an adjacency matrix,
and the frequent patterns appearing in the matrices are
mined through the extended algorithm of the basket analysis. This algorithm can extract all connected/disconnected
substructures by complete search. However, its computation
time increases exponentially with input graph size and
support. AGM can use only frequency for the evaluation
function.
SUBDUE [7] is an algorithm for extracting a subgraph
which can best compress an input graph based on minimum
description length principle (MDL). The found substructure
can be considered a concept. This algorithm is based on a
computationally constrained beam search. It begins with a
substructure comprising only a single vertex in the input
graph, and grows it incrementally expanding a node in it.
At each expansion it evaluates the total description length
(DL) of the input graph which is de®ned as the sum of the
two: DL of the substructure and DL of the input graph in
which all the instances of the substructure are replaced by
single nodes. It stops when the substructure that minimizes
136
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
the total DL is found. After the optimal substructure is found
and the input graph is rewritten, next iteration starts using
the rewritten graph as a new input. This way, SUBDUE
®nds a more abstract concept at each round of iteration.
As is clear, the algorithm can ®nd only one substructure at
each iteration. Further, it does not maintain strictly the
original input graph structure after compression because
its aim is to facilitate the global understanding of the
complex database by forming hierarchical concepts and
using them to approximately describe the input data.
Graph-based induction (GBI) [8] is a technique which
was devised for the purpose of discovering typical patterns
in directed graph data by recursively chunking two adjoining nodes, and its expressiveness lies in between the
attribute-value pair representation and the ®rst-order logic.
The computation time for GBI is very short because of its
greedy search, and GBI does not lose any information of
graph structure after chunking. GBI can use various evaluation functions based on frequency. GBI is not suitable for
pattern extraction from a graph-structured data where many
nodes share the same label because of its greedy recursive
chunking without backtracking. However, it is still thought
effective in extracting patterns from such graph structured
data where each node has a distinct label (e.g. World Wide
Web browsing data) or where some typical structures exist
even if some nodes share the same labels (e.g. chemical
structure data containing benzene rings, etc.).
Implementation of the previous GBI [8], p. 78 could
handle only tree structured data with node labels and link
labels as inputs. We have enhanced the expressiveness of
GBI so that it can handle a general graph data having loops
(including self-loops) with colored/uncolored nodes and
links. The paper is organized as follows. In Section 2, we
brie¯y describe the framework of GBI. In Section 3, we
discuss the time complexity of GBI from both theoretical
and experimental point of view, and in Sections 4 and 5, we
show that GBI can successfully extract typical patterns or
classi®cation rules by applying it to DNA sequence data and
chemical compound data, and that GBI can also be used as a
mean to construct compound attributes for use in other
classi®er. In Section 6 we conclude the paper by summarizing the results and the future work.
Fig. 1. The basic idea of the GBI method.
[2] and Gini Index [9], all of which are based on frequency.
Said differently, we do not deal with a concept, which
cannot be measured with these evaluation functions that
can be de®ned by the frequency.
Fig. 2 explains the idea of `stepwise pair expansion' of
typical pattern by showing the process of extracting from an
input graph `a pair of data which are highly correlated'.
Since a `pair' constitutes a complicated pattern by recursive
stepwise pair expansion, each extracted pair is called a
`typical pattern' or an `extracted pattern'.
In Fig. 2, it is assumed that two typical patterns (A, B) are
already extracted from the input graph. The stepwise pair
expansion (pairwise chunking) repeats the following three
steps until no more typical patterns are found.
Step 1. Rewrite all the patterns in the input graph, which
are identical to the newly chunked pattern to one node
and assign a new label.
Step 2. Extract all the pairs consisting of connected two
nodes in the contracted graph.
Step 3. Select the most typical pair from among the
2. Graph-based induction
The original GBI was so formulated to minimize the
graph size by replacing each found pattern with one node,
so that it repeatedly contracted the graph [8]. We assume
that typical patterns represent some concepts and `typicality' is characterized by the pattern's frequency or the value
of some evaluation function of its frequency. GBI is realized
under this assumption by the idea of extracting typical
patterns by stepwise pair extraction as shown in Fig. 1.
We can use statistical indices as an evaluation function,
such as frequency itself, Information Gain [1], Gain Ratio
Fig. 2. The idea of pairwise chunking.
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
Fig. 3. The concept of the pair in the GBI method.
extracted pairs and register it as the pattern to chunk. If
either or both nodes of the selected pair have already been
rewritten at Step 1, they are restored to the original
patterns before registration.
When a pair is chunked, either one or both of the parent
and the child nodes have already been chunked, then, it is
important to keep track of from where in the parent the link
goes to where in the child (see Fig. 3). All the extracted
patterns are characterization of the input data. The search at
Step 2 is greedy without backtracking, which means that in
enumerating pairs no pattern, which has been chunked into
one node, is restored to the original pattern. Because of this,
all the typical patterns that exist in the input graph are not
necessarily extracted. The problem of extracting all the
isomorphic subgraphs is known to be NP-hard. Thus, GBI
aims at extracting only meaningful typical patterns of a
certain size. Its objective is not ®nding all the typical
patterns nor ®nding all the frequent patterns. The merit of
GBI is its ef®ciency. Its time complexity is almost linear to
the size of the graph as shown in Section 3. It is very useful
in particular when the size of the input graph is huge. GBI
has disadvantages, too. When each node has a distinct label
in the input graph, no ambiguity arises in selecting a pair to
be chunked and GBI performs well. However, since the
search in GBI is greedy, when the same label is shared by
more than one node in the input graph, there arises ambiguity when there are ties in the evaluation function or there
is a chain of nodes of the same label. For example, in the
case of the structure like a ! a ! a, we don't know which
a ! a is best to chunk.
3. Performance evaluation
3.1. Theoretical evaluation of time complexity
The time complexity of the implemented program was
theoretically evaluated. Let Ni, li, Pi and C, respectively,
denote, at chunking step i, the total number of nodes in
the graph, the average number of outgoing links from one
137
node, the number of different kinds of pairs in the graph and
the number of different kinds of chunked patterns derived
from the graph data.
The time complexity to read the input data is O…N0 l0 †;
because the total number of links in the input graph is N0l0
and all the link information in the graph must be read. The
time complexity to count the number of pairs for each kind
is O(N0l0 log N0l0), because all the links in the graph must be
searched. Throughout this paper, we use frequency as the
measure of chunking. The time complexity to select the pair
to be chunked at chunking step i is O(Pi), because the most
frequently appearing pair must be found by scanning all the
pair information. The time complexity to perform the pairwise chunking is O(Nili), because all the links in the graph
must be searched. The time complexity to update the pair
information is O(Pi11), because all kinds of pairs in the
graph must be searched. GBI repeats the earlier process
until the total number of chunked patterns becomes C.
Therefore, the total time complexity is
O…N0 l0 1 N0 l0 log N0 l0 1
CX
21
iˆ0
…Pi 1 Ni li 1 Pi11 †
where, Ni , N0 and li , l0 because the number of nodes
and the number of links at chunking step i is less than the
number of nodes and the number of links in the input graph.
Furthermore, Pi # Ni li , N0 l0 because the number of
different kinds of pairs is less than the total number of
links in the graph. Therefore, the time complexity of GBI is
, O…N0 l0 log N0 l0 1
CX
21
iˆ0
N0 l0 † ˆ O…N0 l0 log N0 l0 1 CN0 l0 †
Since the maximum number of chunks C is less than the
total number of links in the input graph, C # N0 l0 and, thus,
the maximum time complexity of GBI is
O…N02 l20 †:
3.2. Experimental evaluation of time complexity
The time complexity of the implemented program is also
experimentally evaluated by applying it to arti®cially generated datasets. The machine we used for this experiment is a
PC with Pentium II 400 MHz CPU and 256 MB memory.
Random graphs were arti®cially generated increasing the
number of nodes by 100 from 100 to 10,000 with a ®xed
average number of out going links from each node (three
and ten) and a ®xed number of node labels (one and ®ve).
The threshold of frequency was set at 4% of the total nodes
in the initial input graph. That is, those pairs which appear
most and whose occurrence frequency exceeds the threshold
were chosen to be chunked. Under the above setting, the
computation time was measured from the start to the end of
execution. The result is shown in Fig. 4.
It can be said from Fig. 4 that the computation time
increases almost linearly with the number of nodes in the
138
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
Fig. 4. Computation time vs. number of nodes.
graph. It is noted that the computation time takes longer for
graphs having fewer node labels when the graph size is
equal. This is because the probability of same patterns
being in the graph becomes higher for graphs with fewer
node labels and this invokes larger number of pairwise
chunking. It is also noted that the computation time
becomes longer for graphs having more outgoing links
from each node when the graph size is equal.
Next, how the number of links in the graph affects the
computation time was evaluated. Random graphs were
again arti®cially generated increasing the probability of
the outgoing link per node by 10% from 10 to 100%
(100% corresponds to the complete graph). Here, we ®xed
the graph size at 200 nodes and evaluated three cases where
the number of different node labels was one, three and ®ve.
The same threshold value was used as in 3.2. The result is
shown in Fig. 5.
It is found from Fig. 5, that the computation time is upperbounded by a quadratic function for increase in the number
of links in the graph as predicted by the theoretical evaluation. However, in real-world data, when the number of
nodes in the graph increases, the number of outgoing links
from one node would remain almost the same (e.g. number
of bonds in chemical compounds), a value speci®c to each
domain. Therefore, the computation time for real-world data
would behave as shown in Fig. 4.
4. Extracting classi®cation rules by GBI
As an initial test to evaluate the performance of GBI, we
Fig. 5. Computation time vs. number of links.
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
139
Table 1
Comparison of prediction accuracy for promoter sequence classi®cation
Fig. 6. Extraction of classi®cation rules from DNA sequence data.
have applied GBI to extract classi®cation rules from DNA
sequence data [10].
4.1. Application to promoter DNA sequence data
Promoter dataset is one of the benchmark datasets
provided by UCI Machine Learning Repository [11]. A
promoter is a genetic region, which initiates the ®rst step
in the expression of an adjacent gene (transcription). Promoter
dataset consists of strings that represent nucleotides (one of
A, G, T or C). The input features are 57 sequential DNA
nucleotides and the total number of instances is 106 including 53 positive instances (sample promoter sequences) and
53 negative instances (non-promoter sequence).
Fig. 6 shows the process of mapping the problem into a
colored directed graph, using GBI to extract patterns and
interpreting them as classi®cation rules. In Fig. 6, for
example, the ®rst sequence in upper left ®gure is mapped
into like the left graph of a lower left ®gure. This graph
shows that this sequence belongs to class positive and has
C as the ®rst nucleotide, G as the second, T as the third and
so on. We give a set of these graphs to GBI as input and
extract patterns (lower right ®gure). Finally, extracted
patterns can be interpreted as rules like in the upper right
®gure. The same frequency threshold was used, i.e. 4% of
the number of instances. The time required for extracting
patterns was 10 s using a PC with Pentium II 400 MHz CPU
Fig. 7. Examples of extracted patterns from Promoter dataset.
Learning method
ID3
C4.5
GBI
No. of errors /106
19
18
16
and 256 MB Memory. Some examples of the resultant
patterns from this dataset are shown in Fig. 7.
These patterns can be interpreted as rules by considering
the root node as a conclusion and the leaf nodes and the link
to them as a condition. We measured prediction accuracy of
the extracted classi®cation rules by leaving-one-out since
the dataset was small. Many rules of different importance
(support and con®dence in terms of data mining terminology) are extracted by this method and, thus rule ordering is
important. The rules are ordered from the lowest frequency
to the highest, and for those within a tie, they are ordered
according to the size of the pattern, the largest coming ®rst.
That is, more speci®c rules are given higher priority. The
patterns corresponding to rule conditions are matched
against the test data in this order. Table 1 compares the
experimental results (the number of errors out of total 106
cases) with two other learning methods ID3 and C4.5. Both
ID3 [1] and C4.5 [2] are well known decision tree learners.
They are based on divide and conquer algorithm. As a
splitting function, ID3 uses information gain and C4.5
uses information gain ratio. From this table, it is noted
that the error rate of GBI is slightly lower than these
standard tree-induction programs.
4.2. Application to splice DNA sequence data
Splice dataset is also a set of nucleotides sequences
provided by UCI Machine Learning Repository [11]. This
dataset consists of three classes: E/I, I/E and Neither and the
length of strings of one instance is 60. Class `E/I' means
boundary between `exon' which is the region of a gene that
contains the code for producing protein and `intron' which
is the part of a gene that is initially transcribed into the
primary RNA transcript and the same is said of Class `I/
E'. This dataset contains 3,190 cases, of which 25% are I/E,
25% are E/I and the remaining 50% are Neither. In mapping
the cases in the dataset into the graph structure, we
constructed one subgraph for each sequence in the dataset,
just same as in the promoter DNA data. This time we set the
frequency threshold at 1% of the total number of instances.
Using the same machine, the computation time need to
extract patterns was about 10 min. Some examples of
extracted patterns from this dataset are shown in Fig. 8.
Prediction accuracy of extracted classi®cation rules was
evaluated by 10-fold cross-validation using the same rule
ordering as in Section 4.1. Table 2 compares the experimental results (error rate) with ID3 and C4.5. The error rate of
GBI is slightly smaller than ID3 and is almost the same as
140
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
Table 2
Comparison of prediction accuracy
Learning method
ID3
C4.5
GBI
Error rate (%)
10.6
8.0
8.8
C4.5. Thus, it can be said that GBI can be used as a classi®er
although it was not meant to be a pure classi®er.
5. Application to chemical compound data
In this section, we report the results obtained by applying
GBI to chemical compound data [12].
5.1. Application to carcinogenicity data
The carcinogenesis prediction is one of the crucial
problems in the chemical control of our environments and
medical conditions and in the industrial development of new
chemical compounds. However, the experiments on living
bodies and environments to evaluate the carcinogenesis are
quite expensive and very time consuming, and thus it is
sometimes prohibitive to rely solely on experiments from
both economical and ef®ciency point of view. It will be
extremely useful if some of these properties can be shown
predictive by the structure of the chemical substances before
being actually synthesized.
We have applied GBI to this carcinogenesis prediction
from the organochlorine compound data. The task is to ®nd
structures typical to carcinogen of organic chlorides
comprising C, H and Cl. The data were taken from the
National Toxicology Program Database. We used the
same small dataset that was used in Ref. [13] in which
typical attributes representing substructure of the substances
were symbolically extracted and used as inputs to a neural
network by which a classi®er is induced. The data consists
of 41 organic chlorides out of which 31 are carcinogenic
(positive examples) and 10 non-carcinogenic (negative
examples). There are three kinds of links: single bonding,
double bonding and aromatic bonding. Several examples of
the organic chlorine compounds that have carcinogenicity
are shown in Fig. 9.
Fig. 8. Examples of extracted patterns from Splice dataset.
Fig. 9. Examples of organochlorine compounds.
In order to apply the algorithm to undirected graphs,
undirected graphs are converted to directed graphs by
imposing a certain ®xed order to node labels. The direction
of link between the nodes of the same label is arbitrarily set.
For example, by ordering node labels as a ! b ! c, the
graph on the left in Fig. 10 is converted to the directed
graph on the right.
We treated the carbon, chlorine and benzene ring as
distinctive nodes in graphs and ignored the hydrogen in
this analysis. Further, we treated the single bond, double
bond, triple bond and bond between benzene rings as links
with different labels. Fig. 11 shows an example of the
conversion from an organochlorine compound to its
corresponding graph structured data.
The frequency threshold used was two, meaning that
those pairs which appear most and whose occurrence
frequency is more than twice were chosen to be chunked.
Note that the threshold here is not given by % but by the
absolute number of occurrence. The computation time
required for pattern extraction was only 1 s by a machine
with Pentium III 600 MHz CPU and 384 MB Memory.
Figs. 12 and 13 show, respectively, patterns extracted
from positive and negative examples. By comparing these
two sets of patterns, we can derive useful rules from the
patterns that appear only in either positive or negative
patterns (see Fig. 14).
As explained earlier, not all the typical patterns can be
extracted by GBI because of its greedy search. In order to
evaluate how many typical patterns are actually extracted by
GBI, comparison was made with the patterns extracted by
Fig. 10. Conversion from undirected graph to directed graph.
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
Fig. 11. Conversion to a graph structured data.
141
there are so many chemical compounds, it is impossible to
obtain mutagenicity data for every compound from biological experiments. Accurate evaluation of mutagenic activity
from the chemical structure (structure±activity relationship)
is desirable. Furthermore, the mechanism of mutation is
extremely complex and known only in part. Some evidence
supports the existence of multiple mechanistic pathways for
different classes of chemical compounds. If this leads to a
hypothesis for the key step in the mechanisms of mutation, it
will be very important in mutagenesis research.
The dataset used in this experiment was taken from
Ref. [14]. This data contains 230 aromatic or heteroaromatic
nitro compounds. Mutagenesis activity was discretized into
four categories and used as the class attribute.
AGM [6]. AGM is an algorithm for extracting subgraphs
based on an extended a priori algorithm. AGM can extract
all the typical patterns (both connected and unconnected
subgraphs), but its time complexity is exponential to the
Inactive: activity ˆ 299
size of graph and also to the value of minimum support.
Low: 299 , activity , 0.0
In GBI chunked patterns are never restored to the original
Medium: 0.0 # activity , 3.0
pattern in searching for the next pair, the frequency of each
High: 3.0 # activity
is underestimated than actually is as shown in Fig. 15.
Therefore, if the same minimum support used for AGM is
By this categorization, we can classify the above
set for GBI as the threshold, it is most likely that some
compounds into 22 Inactive cases, 68 Low case, 105
patterns are missed. This can be resolved by setting the
Medium cases and 35 High cases. The percentages of the
threshold to a lower value. Since the computation time for
classes of high, medium, low and inactive are 15.2, 45.7, 29.5
AGM increases exponentially with the value of minimum
and 9.6%, respectively. Each compound is associated with two
support, its value was set at 15% and patterns were
other features: LogP value and LUMO energy level. LogP
extracted. As the result, a total of 454 different typical
value is the standard measure of hydrophobicity, where P is
patterns were extracted by AGM, among which 436 were
the water/octanol partition coef®cient, and LUMO energy
unconnected subgraphs. Thus the patterns that can be
level shows the energy level of the lowest unoccupied molecompared with GBI extractable ones are only 18. GBI was
cular orbital. These were discretized each into two intervals.
able to extract 11 patterns among them. It was further
There are three types of bonds (single, double, triple and
con®rmed that those which were not extracted by GBI
aromatic), four atoms (carbon, chlorine, nitrogen and hydrowere all subgraphs of the extracted patterns, e.g. pattern
gen). A pair of atom and its charge (1 or 2) corresponds to
such as b ! c when a ! b is chunked ®rst and …a; b† ! c
a node label in the graph and a bond between two atoms and
is chunked next, in pattern a ! b ! c. Since any subgraph
its bond type are treated as a link between the nodes and its
of an extracted pattern is also a frequent pattern, GBI
label, respectively.
successfully extracted all the typical (connected)
In this experiment we use GBI as a tool to construct
patterns that AGM extracted as frequent. AGM took
features and use C4.5 as a classi®er. To provide C4.5 with
17 min. to extract patterns whereas GBI took only a
good set of features we added an extra function to GBI.
second. GBI is by far more ef®cient than AGM.
When we select a pair to be chunked, all the existing pairs
Another concern is the effect of node numbering.
are evaluated by a measure (1), which is more appropriate to
Different numbering results in different representation
select patterns for use in classi®cation. Note that chunking is
of the same graph. GBI was run 10 times for different
based on frequency measure.
8
9
i
l
m
h
>
>
>
>
<
=
I
L
M
H
;
;
;
Max
…1†
>
i
l
m
h i
l
m
h i
l
m
h i
l
m
h >
>
>
: 1 1
;
1
1 1
1
1 1
1
1 1
1
I
L
M
H I
L
M
H I
L
M
H I
L
M
H
node ordering. We con®rmed that each run gave the same
typical patterns in this problem.
5.2. Application to mutagenicity data
Some chemical compounds are known to cause frequent
mutations which are structural alternations in DNA. Since
This measure indicates the maximum relative class
frequency, i.e. degree of contribution to class membership.
Here, i, l, m and h stand for the number of compounds for
each class which has the pair as a subgraph, and I, L, M and
H stand for the original number of compounds for each
class.
As stated earlier, we used frequency as the evaluation
142
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
Fig. 15. Problem in counting pairs.
Fig. 12. Example patterns extracted from positive examples.
Fig. 13. Example patterns extracted from negative examples.
Fig. 16. Rule from mutagenicity data.
function for selecting a pair to be chunked. We experimented two different threshold values for the pairs contained:
10% or more of compounds, and 20% or more. And also
using two different threshold values of measure for attribute
selection: 30% or more and 40% or more. For each case, we
obtained 67 patterns for the chunking threshold of 10% and
the attribute selection threshold of 30%, 26 patterns for 20
and 30%, 31 patterns for 10 and 40%, and 5 patterns for 20
and 40%, respectively.
We used these selected patterns, LogP value and LUMO
energy level as attributes for C4.5. LogP value and LUMO
value were discretized into two intervals beforehand.
The Prediction error of C4.5 evaluated by 10-fold crossvalidation is shown in Table 3.
Prediction error by 10-fold cross-validation is not good.
The reason may be due to discretization of the originally
continuous class value. The following analysis justi®es this.
The distribution of data classi®ed by C4.5 is shown in the
Tables 4±7. For example, the second column of Table 4
shows that the compounds classi®ed as class Low by C4.5
consist of 14 Inactive compound, 39 Low compounds, 22
Medium compounds and 1 High compound.
If we give 2 points to cases correctly classi®ed by C4.5, 1
point to cases misclassi®ed to the adjoining classes (note
that the class values are ordered), and 0 point to other cases
and average them over all the data, we obtain 1.50, 1.56,
1.53, 1.55 points, respectively. This shows that almost all
cases are either classi®ed correctly or misclassi®ed to the
adjoining classes. In other words, misclassi®ed cases are
centered on the correct class and C4.5 has succeeded in
identifying the global characteristics of the dataset.
One typical rule that is characteristic to activity `High' is
shown in Fig. 16. Negation of the same pattern appears in
rules for activity `Low'. What is typical to this pattern is the
coplanarity of the benzene ring and NO2. According to
chemical specialist, the steric hindrance to the coplanarity
may decrease the mutagenicity of a molecule. This is a
small discovery in the domain.
6. Conclusions
carcinogen
Frequency = 9
carcinogen
Frequency = 5
Fig. 14. Example rules derived from carcinogenicity data.
In this paper, we showed how we can expand the expressiveness of the GBI algorithm to handle more general
graphs, i.e. directed/undirected graphs with colored/uncolored nodes and links, and with/without loop structure
(including self-loop). The time complexity of the implemented program was evaluated from both theoretical and
experimental points of view. The algorithm runs almost
linearly to the graph size (number of nodes in the graph).
We further applied the enhanced GBI to two kinds of realworld data (classi®cation problem of DNA sequential data
T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143
143
Table 3
Prediction error (%) by C4.5
Threshold for chunking
Threshold for attribute selection
10%
30%
w/o cv
15.7
40%
w/o cv
19.6
10 fcv
43.9
Table 4
Distribution of data classi®ed by C4.5. Threshold for chunking 10%,
threshold for attribute selection 30%
Inactive
Low
Medium
High
Inactive
Low
Medium
High
1
5
4
1
14
39
22
1
7
24
67
11
0
0
12
22
Table 5
Distribution of data classi®ed by C4.5. Threshold for chunking 10%,
threshold for attribute selection 40%
Inactive
Low
Medium
High
Inactive
Low
Medium
High
1
3
3
1
15
44
18
0
6
21
73
15
0
0
10
20
Table 6
Distribution of data classi®ed by C4.5. Threshold for chunking 20%,
threshold for attribute selection 30%
Inactive
Low
Medium
High
Inactive
Low
Medium
High
1
4
3
0
15
44
18
2
6
20
75
20
0
0
9
13
Table 7
Distribution of data classi®ed by C4.5. Threshold for chunking 20%,
threshold for attribute selection 40%
Inactive
Low
Medium
High
Inactive
Low
Medium
High
0
2
1
0
19
45
25
0
3
21
71
21
0
0
8
14
and extraction of typical patterns from chemical structure
data) and showed its usefulness. We also showed GBI's
potential capability as a feature construction tool.
Future work includes the followings. The extracted
patterns may be affected by node ordering because GBI
employ a greedy search. No signi®cant difference was
seen in the experiments, but we feel that it is necessary to
investigate the sensitivity and devise a way not to be
strongly affected by node ordering. We limited in this
10 fcv
40.0
20%
30%
w/o cv
20.0
10 fcv
42.2
40%
w/o cv
32.2
10 fcv
43.5
paper the characterization of typicality to frequency primarily and a heuristic class separability measure secondarily.
Other frequency-based characterization must be investigated. It is also felt necessary to make use of domain knowledge to control the chunking process in the current
framework so that unnecessary patterns are not extracted
and more focused patterns are extracted. Further, undirected
graphs were mechanically converted into directed graphs to
use the GBI algorithm without any change. It is necessary to
improve the algorithm so that undirected graphs can be
strictly treated.
References
[1] Quinlan JR. Induction of decision trees. Mach Learn 1986;1:81±106.
[2] Quinlan JR. C4.5: programs for machine learning. Los Altos, CA:
Morgan (Kaufmann), 1993.
[3] Michalski RS. Learning ¯exible concepts: fundamental ideas and a
method based on two-tiered representation. Mach Learn, Artif Intell
Approach 1990;3:63±102.
[4] Breiman L, Friedman JH, Olshen RA, Stone CJ. The cn2 induction
algorithm. Mach Learn 1989;3:261±83.
[5] Muggleton S, de Raedt L. Inductive logic programming: theory and
methods. J Logic Progr 1994;19(20):629±79.
[6] Inokuchi A, Washio T, Motoda H. An a priori-based algorithm for
mining frequent substructures from graph data. Proceedings of the
Fourth European Conference on Principles of Data Mining and
Knowledge Discovery, 2000. p. 13±23.
[7] Cook DJ, Holder LB. Graph-based data mining. IEE Intell Syst
2000;15(2):32±41.
[8] Yoshida K, Motoda H. Clip: concept learning from inference pattern.
J Artif Intell 1995;75(1):63±92.
[9] Breiman L, Friedman JH, Olshen RA, Stone CJ. Classi®cation and
regression trees. Belmont, CA: Wadsworth, 1984.
[10] Matsuda T, Horiuchi T, Motoda H, Washio T. Extension of graphbased induction for general graph structured data, Knowledge discovery and data mining: current issues and new applications. New York:
Springer, 2000 LNAI 1805, pp. 420±31.
[11] Blake CL, Keogh E, Merz C. Uci repository of machine learning
database. http://www.ics.uci.edu/(mlearn/MLRepository.html, 1998.
[12] Matsuda T, Horiuchi T, Motoda H, Washio T. Graph-based induction
for general graph structured data and its application to chemical
compound data. Proceedings of the Third International Conference
on Discovery Science, 2000.
[13] Matsumoto T, Tanabe K. Prediction of carcinogenicity of organic
chlorine-containing compounds by neural network. JCPE J
1999;11(1):29±34 in Japanese.
[14] Debnath AK, Lopez de Compadre RL, Debnath G, Shusterman AJ,
Hansch C. Structure-activity relationship of mutagenic aromatic and
heteroaromatic nitro compounds. Correlation with molecular orbital
energies and hydrophobicity. J Med Chem 1991;34:786±97.