Download Graph-based induction and its applications

Advanced Engineering Informatics 16 (2002) 135±143 www.elsevier.com/locate/aei Graph-based induction and its applications Takashi Matsuda*, Hiroshi Motoda, Takashi Washio I.S.I.R., Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka 567-0047, Japan Received 18 May 2001; accepted 15 January 2002 Abstract A machine learning technique called Graph-based induction (GBI) ef®ciently extracts typical patterns from graph data by stepwise pair expansion (pairwise chunking). In this paper, we introduce GBI for general graph structured data, which can handle directed/undirected, colored/uncolored graphs with/without (self) loop and with colored/uncolored links. We show that its time complexity is almost linear with the size of graph. We, further, show that GBI can effectively be applied to the extraction of typical patterns from DNA sequence data and organochlorine compound data from which are to be generated classi®cation rules, and that GBI also works as a feature construction component for other machine learning tools. q 2002 Published by Elsevier Science Ltd. Keywords: Graph-based induction; General graph structured data; Data mining; Machine learning 1. Introduction There have been quite a number of research work on data mining in seeking for better performance over the last few years. Better performance includes mining from structured data, which is a new challenge, and there has been little work on this subject. Since structure is represented by proper relations and a graph can easily represent relations, knowledge discovery from graph structured data poses a general problem for mining from structured data. Some examples amenable to graph mining are ®nding typical web browsing pattern, identifying characteristic substructure of chemical compounds and discovering diagnostic rules from patient history records. Majority of the methods widely used are for data that do not have structure and are represented by attribute-value pairs. Decision tree [1,2], and induction rules [3,4] relate attribute-values to target classes. Association rules often used in data mining also uses this attribute-value pair representation. However, the attribute-value pair representation is not suitable to represent a more general data structure, and there are problems that need a more powerful representation. A most powerful representation that can handle relation and thus, structure, would be inductive logic programming * Corresponding author. Tel.: 181-6-6879-8542; fax: 181-6-6879-8544. E-mail addresses: [email protected] (T. Matsuda), [email protected] (H. Motoda), [email protected] (T. Washio). 1474-0346/02/$ - see front matter q 2002 Published by Elsevier Science Ltd. PII: S 1474-034 6(02)00005-8 (ILP) [5] which uses the ®rst-order predicate logic. It can represent general relationship embedded in data, and has a merit that domain knowledge and acquired knowledge can be utilized as background knowledge. However, it is not ef®cient enough to solve large-scale problems. However, its state of the art is not so matured that anyone can use the technique easily. AGM (a priori-based graph mining) [6] was developed for the purpose of mining the association rules among the frequently appearing substructures in a given graph data set. A graph transaction is represented by an adjacency matrix, and the frequent patterns appearing in the matrices are mined through the extended algorithm of the basket analysis. This algorithm can extract all connected/disconnected substructures by complete search. However, its computation time increases exponentially with input graph size and support. AGM can use only frequency for the evaluation function. SUBDUE [7] is an algorithm for extracting a subgraph which can best compress an input graph based on minimum description length principle (MDL). The found substructure can be considered a concept. This algorithm is based on a computationally constrained beam search. It begins with a substructure comprising only a single vertex in the input graph, and grows it incrementally expanding a node in it. At each expansion it evaluates the total description length (DL) of the input graph which is de®ned as the sum of the two: DL of the substructure and DL of the input graph in which all the instances of the substructure are replaced by single nodes. It stops when the substructure that minimizes 136 T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 the total DL is found. After the optimal substructure is found and the input graph is rewritten, next iteration starts using the rewritten graph as a new input. This way, SUBDUE ®nds a more abstract concept at each round of iteration. As is clear, the algorithm can ®nd only one substructure at each iteration. Further, it does not maintain strictly the original input graph structure after compression because its aim is to facilitate the global understanding of the complex database by forming hierarchical concepts and using them to approximately describe the input data. Graph-based induction (GBI) [8] is a technique which was devised for the purpose of discovering typical patterns in directed graph data by recursively chunking two adjoining nodes, and its expressiveness lies in between the attribute-value pair representation and the ®rst-order logic. The computation time for GBI is very short because of its greedy search, and GBI does not lose any information of graph structure after chunking. GBI can use various evaluation functions based on frequency. GBI is not suitable for pattern extraction from a graph-structured data where many nodes share the same label because of its greedy recursive chunking without backtracking. However, it is still thought effective in extracting patterns from such graph structured data where each node has a distinct label (e.g. World Wide Web browsing data) or where some typical structures exist even if some nodes share the same labels (e.g. chemical structure data containing benzene rings, etc.). Implementation of the previous GBI [8], p. 78 could handle only tree structured data with node labels and link labels as inputs. We have enhanced the expressiveness of GBI so that it can handle a general graph data having loops (including self-loops) with colored/uncolored nodes and links. The paper is organized as follows. In Section 2, we brie¯y describe the framework of GBI. In Section 3, we discuss the time complexity of GBI from both theoretical and experimental point of view, and in Sections 4 and 5, we show that GBI can successfully extract typical patterns or classi®cation rules by applying it to DNA sequence data and chemical compound data, and that GBI can also be used as a mean to construct compound attributes for use in other classi®er. In Section 6 we conclude the paper by summarizing the results and the future work. Fig. 1. The basic idea of the GBI method. [2] and Gini Index [9], all of which are based on frequency. Said differently, we do not deal with a concept, which cannot be measured with these evaluation functions that can be de®ned by the frequency. Fig. 2 explains the idea of `stepwise pair expansion' of typical pattern by showing the process of extracting from an input graph à pair of data which are highly correlated'. Since a `pair' constitutes a complicated pattern by recursive stepwise pair expansion, each extracted pair is called a `typical pattern' or an èxtracted pattern'. In Fig. 2, it is assumed that two typical patterns (A, B) are already extracted from the input graph. The stepwise pair expansion (pairwise chunking) repeats the following three steps until no more typical patterns are found. Step 1. Rewrite all the patterns in the input graph, which are identical to the newly chunked pattern to one node and assign a new label. Step 2. Extract all the pairs consisting of connected two nodes in the contracted graph. Step 3. Select the most typical pair from among the 2. Graph-based induction The original GBI was so formulated to minimize the graph size by replacing each found pattern with one node, so that it repeatedly contracted the graph [8]. We assume that typical patterns represent some concepts and `typicality' is characterized by the pattern's frequency or the value of some evaluation function of its frequency. GBI is realized under this assumption by the idea of extracting typical patterns by stepwise pair extraction as shown in Fig. 1. We can use statistical indices as an evaluation function, such as frequency itself, Information Gain [1], Gain Ratio Fig. 2. The idea of pairwise chunking. T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 Fig. 3. The concept of the pair in the GBI method. extracted pairs and register it as the pattern to chunk. If either or both nodes of the selected pair have already been rewritten at Step 1, they are restored to the original patterns before registration. When a pair is chunked, either one or both of the parent and the child nodes have already been chunked, then, it is important to keep track of from where in the parent the link goes to where in the child (see Fig. 3). All the extracted patterns are characterization of the input data. The search at Step 2 is greedy without backtracking, which means that in enumerating pairs no pattern, which has been chunked into one node, is restored to the original pattern. Because of this, all the typical patterns that exist in the input graph are not necessarily extracted. The problem of extracting all the isomorphic subgraphs is known to be NP-hard. Thus, GBI aims at extracting only meaningful typical patterns of a certain size. Its objective is not ®nding all the typical patterns nor ®nding all the frequent patterns. The merit of GBI is its ef®ciency. Its time complexity is almost linear to the size of the graph as shown in Section 3. It is very useful in particular when the size of the input graph is huge. GBI has disadvantages, too. When each node has a distinct label in the input graph, no ambiguity arises in selecting a pair to be chunked and GBI performs well. However, since the search in GBI is greedy, when the same label is shared by more than one node in the input graph, there arises ambiguity when there are ties in the evaluation function or there is a chain of nodes of the same label. For example, in the case of the structure like a ! a ! a, we don't know which a ! a is best to chunk. 3. Performance evaluation 3.1. Theoretical evaluation of time complexity The time complexity of the implemented program was theoretically evaluated. Let Ni, li, Pi and C, respectively, denote, at chunking step i, the total number of nodes in the graph, the average number of outgoing links from one 137 node, the number of different kinds of pairs in the graph and the number of different kinds of chunked patterns derived from the graph data. The time complexity to read the input data is ON0 l0 ; because the total number of links in the input graph is N0l0 and all the link information in the graph must be read. The time complexity to count the number of pairs for each kind is O(N0l0 log N0l0), because all the links in the graph must be searched. Throughout this paper, we use frequency as the measure of chunking. The time complexity to select the pair to be chunked at chunking step i is O(Pi), because the most frequently appearing pair must be found by scanning all the pair information. The time complexity to perform the pairwise chunking is O(Nili), because all the links in the graph must be searched. The time complexity to update the pair information is O(Pi11), because all kinds of pairs in the graph must be searched. GBI repeats the earlier process until the total number of chunked patterns becomes C. Therefore, the total time complexity is ON0 l0 1 N0 l0 log N0 l0 1 CX 21 i0 Pi 1 Ni li 1 Pi11 where, Ni , N0 and li , l0 because the number of nodes and the number of links at chunking step i is less than the number of nodes and the number of links in the input graph. Furthermore, Pi # Ni li , N0 l0 because the number of different kinds of pairs is less than the total number of links in the graph. Therefore, the time complexity of GBI is , ON0 l0 log N0 l0 1 CX 21 i0 N0 l0 ON0 l0 log N0 l0 1 CN0 l0 Since the maximum number of chunks C is less than the total number of links in the input graph, C # N0 l0 and, thus, the maximum time complexity of GBI is ON02 l20 : 3.2. Experimental evaluation of time complexity The time complexity of the implemented program is also experimentally evaluated by applying it to arti®cially generated datasets. The machine we used for this experiment is a PC with Pentium II 400 MHz CPU and 256 MB memory. Random graphs were arti®cially generated increasing the number of nodes by 100 from 100 to 10,000 with a ®xed average number of out going links from each node (three and ten) and a ®xed number of node labels (one and ®ve). The threshold of frequency was set at 4% of the total nodes in the initial input graph. That is, those pairs which appear most and whose occurrence frequency exceeds the threshold were chosen to be chunked. Under the above setting, the computation time was measured from the start to the end of execution. The result is shown in Fig. 4. It can be said from Fig. 4 that the computation time increases almost linearly with the number of nodes in the 138 T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 Fig. 4. Computation time vs. number of nodes. graph. It is noted that the computation time takes longer for graphs having fewer node labels when the graph size is equal. This is because the probability of same patterns being in the graph becomes higher for graphs with fewer node labels and this invokes larger number of pairwise chunking. It is also noted that the computation time becomes longer for graphs having more outgoing links from each node when the graph size is equal. Next, how the number of links in the graph affects the computation time was evaluated. Random graphs were again arti®cially generated increasing the probability of the outgoing link per node by 10% from 10 to 100% (100% corresponds to the complete graph). Here, we ®xed the graph size at 200 nodes and evaluated three cases where the number of different node labels was one, three and ®ve. The same threshold value was used as in 3.2. The result is shown in Fig. 5. It is found from Fig. 5, that the computation time is upperbounded by a quadratic function for increase in the number of links in the graph as predicted by the theoretical evaluation. However, in real-world data, when the number of nodes in the graph increases, the number of outgoing links from one node would remain almost the same (e.g. number of bonds in chemical compounds), a value speci®c to each domain. Therefore, the computation time for real-world data would behave as shown in Fig. 4. 4. Extracting classi®cation rules by GBI As an initial test to evaluate the performance of GBI, we Fig. 5. Computation time vs. number of links. T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 139 Table 1 Comparison of prediction accuracy for promoter sequence classi®cation Fig. 6. Extraction of classi®cation rules from DNA sequence data. have applied GBI to extract classi®cation rules from DNA sequence data [10]. 4.1. Application to promoter DNA sequence data Promoter dataset is one of the benchmark datasets provided by UCI Machine Learning Repository [11]. A promoter is a genetic region, which initiates the ®rst step in the expression of an adjacent gene (transcription). Promoter dataset consists of strings that represent nucleotides (one of A, G, T or C). The input features are 57 sequential DNA nucleotides and the total number of instances is 106 including 53 positive instances (sample promoter sequences) and 53 negative instances (non-promoter sequence). Fig. 6 shows the process of mapping the problem into a colored directed graph, using GBI to extract patterns and interpreting them as classi®cation rules. In Fig. 6, for example, the ®rst sequence in upper left ®gure is mapped into like the left graph of a lower left ®gure. This graph shows that this sequence belongs to class positive and has C as the ®rst nucleotide, G as the second, T as the third and so on. We give a set of these graphs to GBI as input and extract patterns (lower right ®gure). Finally, extracted patterns can be interpreted as rules like in the upper right ®gure. The same frequency threshold was used, i.e. 4% of the number of instances. The time required for extracting patterns was 10 s using a PC with Pentium II 400 MHz CPU Fig. 7. Examples of extracted patterns from Promoter dataset. Learning method ID3 C4.5 GBI No. of errors /106 19 18 16 and 256 MB Memory. Some examples of the resultant patterns from this dataset are shown in Fig. 7. These patterns can be interpreted as rules by considering the root node as a conclusion and the leaf nodes and the link to them as a condition. We measured prediction accuracy of the extracted classi®cation rules by leaving-one-out since the dataset was small. Many rules of different importance (support and con®dence in terms of data mining terminology) are extracted by this method and, thus rule ordering is important. The rules are ordered from the lowest frequency to the highest, and for those within a tie, they are ordered according to the size of the pattern, the largest coming ®rst. That is, more speci®c rules are given higher priority. The patterns corresponding to rule conditions are matched against the test data in this order. Table 1 compares the experimental results (the number of errors out of total 106 cases) with two other learning methods ID3 and C4.5. Both ID3 [1] and C4.5 [2] are well known decision tree learners. They are based on divide and conquer algorithm. As a splitting function, ID3 uses information gain and C4.5 uses information gain ratio. From this table, it is noted that the error rate of GBI is slightly lower than these standard tree-induction programs. 4.2. Application to splice DNA sequence data Splice dataset is also a set of nucleotides sequences provided by UCI Machine Learning Repository [11]. This dataset consists of three classes: E/I, I/E and Neither and the length of strings of one instance is 60. Class È/I' means boundary between èxon' which is the region of a gene that contains the code for producing protein and ìntron' which is the part of a gene that is initially transcribed into the primary RNA transcript and the same is said of Class Ì/ E'. This dataset contains 3,190 cases, of which 25% are I/E, 25% are E/I and the remaining 50% are Neither. In mapping the cases in the dataset into the graph structure, we constructed one subgraph for each sequence in the dataset, just same as in the promoter DNA data. This time we set the frequency threshold at 1% of the total number of instances. Using the same machine, the computation time need to extract patterns was about 10 min. Some examples of extracted patterns from this dataset are shown in Fig. 8. Prediction accuracy of extracted classi®cation rules was evaluated by 10-fold cross-validation using the same rule ordering as in Section 4.1. Table 2 compares the experimental results (error rate) with ID3 and C4.5. The error rate of GBI is slightly smaller than ID3 and is almost the same as 140 T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 Table 2 Comparison of prediction accuracy Learning method ID3 C4.5 GBI Error rate (%) 10.6 8.0 8.8 C4.5. Thus, it can be said that GBI can be used as a classi®er although it was not meant to be a pure classi®er. 5. Application to chemical compound data In this section, we report the results obtained by applying GBI to chemical compound data [12]. 5.1. Application to carcinogenicity data The carcinogenesis prediction is one of the crucial problems in the chemical control of our environments and medical conditions and in the industrial development of new chemical compounds. However, the experiments on living bodies and environments to evaluate the carcinogenesis are quite expensive and very time consuming, and thus it is sometimes prohibitive to rely solely on experiments from both economical and ef®ciency point of view. It will be extremely useful if some of these properties can be shown predictive by the structure of the chemical substances before being actually synthesized. We have applied GBI to this carcinogenesis prediction from the organochlorine compound data. The task is to ®nd structures typical to carcinogen of organic chlorides comprising C, H and Cl. The data were taken from the National Toxicology Program Database. We used the same small dataset that was used in Ref. [13] in which typical attributes representing substructure of the substances were symbolically extracted and used as inputs to a neural network by which a classi®er is induced. The data consists of 41 organic chlorides out of which 31 are carcinogenic (positive examples) and 10 non-carcinogenic (negative examples). There are three kinds of links: single bonding, double bonding and aromatic bonding. Several examples of the organic chlorine compounds that have carcinogenicity are shown in Fig. 9. Fig. 8. Examples of extracted patterns from Splice dataset. Fig. 9. Examples of organochlorine compounds. In order to apply the algorithm to undirected graphs, undirected graphs are converted to directed graphs by imposing a certain ®xed order to node labels. The direction of link between the nodes of the same label is arbitrarily set. For example, by ordering node labels as a ! b ! c, the graph on the left in Fig. 10 is converted to the directed graph on the right. We treated the carbon, chlorine and benzene ring as distinctive nodes in graphs and ignored the hydrogen in this analysis. Further, we treated the single bond, double bond, triple bond and bond between benzene rings as links with different labels. Fig. 11 shows an example of the conversion from an organochlorine compound to its corresponding graph structured data. The frequency threshold used was two, meaning that those pairs which appear most and whose occurrence frequency is more than twice were chosen to be chunked. Note that the threshold here is not given by % but by the absolute number of occurrence. The computation time required for pattern extraction was only 1 s by a machine with Pentium III 600 MHz CPU and 384 MB Memory. Figs. 12 and 13 show, respectively, patterns extracted from positive and negative examples. By comparing these two sets of patterns, we can derive useful rules from the patterns that appear only in either positive or negative patterns (see Fig. 14). As explained earlier, not all the typical patterns can be extracted by GBI because of its greedy search. In order to evaluate how many typical patterns are actually extracted by GBI, comparison was made with the patterns extracted by Fig. 10. Conversion from undirected graph to directed graph. T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 Fig. 11. Conversion to a graph structured data. 141 there are so many chemical compounds, it is impossible to obtain mutagenicity data for every compound from biological experiments. Accurate evaluation of mutagenic activity from the chemical structure (structure±activity relationship) is desirable. Furthermore, the mechanism of mutation is extremely complex and known only in part. Some evidence supports the existence of multiple mechanistic pathways for different classes of chemical compounds. If this leads to a hypothesis for the key step in the mechanisms of mutation, it will be very important in mutagenesis research. The dataset used in this experiment was taken from Ref. [14]. This data contains 230 aromatic or heteroaromatic nitro compounds. Mutagenesis activity was discretized into four categories and used as the class attribute. AGM [6]. AGM is an algorithm for extracting subgraphs based on an extended a priori algorithm. AGM can extract all the typical patterns (both connected and unconnected subgraphs), but its time complexity is exponential to the Inactive: activity 299 size of graph and also to the value of minimum support. Low: 299 , activity , 0.0 In GBI chunked patterns are never restored to the original Medium: 0.0 # activity , 3.0 pattern in searching for the next pair, the frequency of each High: 3.0 # activity is underestimated than actually is as shown in Fig. 15. Therefore, if the same minimum support used for AGM is By this categorization, we can classify the above set for GBI as the threshold, it is most likely that some compounds into 22 Inactive cases, 68 Low case, 105 patterns are missed. This can be resolved by setting the Medium cases and 35 High cases. The percentages of the threshold to a lower value. Since the computation time for classes of high, medium, low and inactive are 15.2, 45.7, 29.5 AGM increases exponentially with the value of minimum and 9.6%, respectively. Each compound is associated with two support, its value was set at 15% and patterns were other features: LogP value and LUMO energy level. LogP extracted. As the result, a total of 454 different typical value is the standard measure of hydrophobicity, where P is patterns were extracted by AGM, among which 436 were the water/octanol partition coef®cient, and LUMO energy unconnected subgraphs. Thus the patterns that can be level shows the energy level of the lowest unoccupied molecompared with GBI extractable ones are only 18. GBI was cular orbital. These were discretized each into two intervals. able to extract 11 patterns among them. It was further There are three types of bonds (single, double, triple and con®rmed that those which were not extracted by GBI aromatic), four atoms (carbon, chlorine, nitrogen and hydrowere all subgraphs of the extracted patterns, e.g. pattern gen). A pair of atom and its charge (1 or 2) corresponds to such as b ! c when a ! b is chunked ®rst and a; b ! c a node label in the graph and a bond between two atoms and is chunked next, in pattern a ! b ! c. Since any subgraph its bond type are treated as a link between the nodes and its of an extracted pattern is also a frequent pattern, GBI label, respectively. successfully extracted all the typical (connected) In this experiment we use GBI as a tool to construct patterns that AGM extracted as frequent. AGM took features and use C4.5 as a classi®er. To provide C4.5 with 17 min. to extract patterns whereas GBI took only a good set of features we added an extra function to GBI. second. GBI is by far more ef®cient than AGM. When we select a pair to be chunked, all the existing pairs Another concern is the effect of node numbering. are evaluated by a measure (1), which is more appropriate to Different numbering results in different representation select patterns for use in classi®cation. Note that chunking is of the same graph. GBI was run 10 times for different based on frequency measure. 8 9 i l m h > > > > < = I L M H ; ; ; Max 1 > i l m h i l m h i l m h i l m h > > > : 1 1 ; 1 1 1 1 1 1 1 1 1 1 I L M H I L M H I L M H I L M H node ordering. We con®rmed that each run gave the same typical patterns in this problem. 5.2. Application to mutagenicity data Some chemical compounds are known to cause frequent mutations which are structural alternations in DNA. Since This measure indicates the maximum relative class frequency, i.e. degree of contribution to class membership. Here, i, l, m and h stand for the number of compounds for each class which has the pair as a subgraph, and I, L, M and H stand for the original number of compounds for each class. As stated earlier, we used frequency as the evaluation 142 T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 Fig. 15. Problem in counting pairs. Fig. 12. Example patterns extracted from positive examples. Fig. 13. Example patterns extracted from negative examples. Fig. 16. Rule from mutagenicity data. function for selecting a pair to be chunked. We experimented two different threshold values for the pairs contained: 10% or more of compounds, and 20% or more. And also using two different threshold values of measure for attribute selection: 30% or more and 40% or more. For each case, we obtained 67 patterns for the chunking threshold of 10% and the attribute selection threshold of 30%, 26 patterns for 20 and 30%, 31 patterns for 10 and 40%, and 5 patterns for 20 and 40%, respectively. We used these selected patterns, LogP value and LUMO energy level as attributes for C4.5. LogP value and LUMO value were discretized into two intervals beforehand. The Prediction error of C4.5 evaluated by 10-fold crossvalidation is shown in Table 3. Prediction error by 10-fold cross-validation is not good. The reason may be due to discretization of the originally continuous class value. The following analysis justi®es this. The distribution of data classi®ed by C4.5 is shown in the Tables 4±7. For example, the second column of Table 4 shows that the compounds classi®ed as class Low by C4.5 consist of 14 Inactive compound, 39 Low compounds, 22 Medium compounds and 1 High compound. If we give 2 points to cases correctly classi®ed by C4.5, 1 point to cases misclassi®ed to the adjoining classes (note that the class values are ordered), and 0 point to other cases and average them over all the data, we obtain 1.50, 1.56, 1.53, 1.55 points, respectively. This shows that almost all cases are either classi®ed correctly or misclassi®ed to the adjoining classes. In other words, misclassi®ed cases are centered on the correct class and C4.5 has succeeded in identifying the global characteristics of the dataset. One typical rule that is characteristic to activity `High' is shown in Fig. 16. Negation of the same pattern appears in rules for activity `Low'. What is typical to this pattern is the coplanarity of the benzene ring and NO2. According to chemical specialist, the steric hindrance to the coplanarity may decrease the mutagenicity of a molecule. This is a small discovery in the domain. 6. Conclusions carcinogen Frequency = 9 carcinogen Frequency = 5 Fig. 14. Example rules derived from carcinogenicity data. In this paper, we showed how we can expand the expressiveness of the GBI algorithm to handle more general graphs, i.e. directed/undirected graphs with colored/uncolored nodes and links, and with/without loop structure (including self-loop). The time complexity of the implemented program was evaluated from both theoretical and experimental points of view. The algorithm runs almost linearly to the graph size (number of nodes in the graph). We further applied the enhanced GBI to two kinds of realworld data (classi®cation problem of DNA sequential data T. Matsuda et al. / Advanced Engineering Informatics 16 (2002) 135±143 143 Table 3 Prediction error (%) by C4.5 Threshold for chunking Threshold for attribute selection 10% 30% w/o cv 15.7 40% w/o cv 19.6 10 fcv 43.9 Table 4 Distribution of data classi®ed by C4.5. Threshold for chunking 10%, threshold for attribute selection 30% Inactive Low Medium High Inactive Low Medium High 1 5 4 1 14 39 22 1 7 24 67 11 0 0 12 22 Table 5 Distribution of data classi®ed by C4.5. Threshold for chunking 10%, threshold for attribute selection 40% Inactive Low Medium High Inactive Low Medium High 1 3 3 1 15 44 18 0 6 21 73 15 0 0 10 20 Table 6 Distribution of data classi®ed by C4.5. Threshold for chunking 20%, threshold for attribute selection 30% Inactive Low Medium High Inactive Low Medium High 1 4 3 0 15 44 18 2 6 20 75 20 0 0 9 13 Table 7 Distribution of data classi®ed by C4.5. Threshold for chunking 20%, threshold for attribute selection 40% Inactive Low Medium High Inactive Low Medium High 0 2 1 0 19 45 25 0 3 21 71 21 0 0 8 14 and extraction of typical patterns from chemical structure data) and showed its usefulness. We also showed GBI's potential capability as a feature construction tool. Future work includes the followings. The extracted patterns may be affected by node ordering because GBI employ a greedy search. No signi®cant difference was seen in the experiments, but we feel that it is necessary to investigate the sensitivity and devise a way not to be strongly affected by node ordering. We limited in this 10 fcv 40.0 20% 30% w/o cv 20.0 10 fcv 42.2 40% w/o cv 32.2 10 fcv 43.5 paper the characterization of typicality to frequency primarily and a heuristic class separability measure secondarily. Other frequency-based characterization must be investigated. It is also felt necessary to make use of domain knowledge to control the chunking process in the current framework so that unnecessary patterns are not extracted and more focused patterns are extracted. Further, undirected graphs were mechanically converted into directed graphs to use the GBI algorithm without any change. It is necessary to improve the algorithm so that undirected graphs can be strictly treated. References [1] Quinlan JR. Induction of decision trees. Mach Learn 1986;1:81±106. [2] Quinlan JR. C4.5: programs for machine learning. Los Altos, CA: Morgan (Kaufmann), 1993. [3] Michalski RS. Learning ¯exible concepts: fundamental ideas and a method based on two-tiered representation. Mach Learn, Artif Intell Approach 1990;3:63±102. [4] Breiman L, Friedman JH, Olshen RA, Stone CJ. The cn2 induction algorithm. Mach Learn 1989;3:261±83. [5] Muggleton S, de Raedt L. Inductive logic programming: theory and methods. J Logic Progr 1994;19(20):629±79. [6] Inokuchi A, Washio T, Motoda H. An a priori-based algorithm for mining frequent substructures from graph data. Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery, 2000. p. 13±23. [7] Cook DJ, Holder LB. Graph-based data mining. IEE Intell Syst 2000;15(2):32±41. [8] Yoshida K, Motoda H. Clip: concept learning from inference pattern. J Artif Intell 1995;75(1):63±92. [9] Breiman L, Friedman JH, Olshen RA, Stone CJ. Classi®cation and regression trees. Belmont, CA: Wadsworth, 1984. [10] Matsuda T, Horiuchi T, Motoda H, Washio T. Extension of graphbased induction for general graph structured data, Knowledge discovery and data mining: current issues and new applications. New York: Springer, 2000 LNAI 1805, pp. 420±31. [11] Blake CL, Keogh E, Merz C. Uci repository of machine learning database. http://www.ics.uci.edu/(mlearn/MLRepository.html, 1998. [12] Matsuda T, Horiuchi T, Motoda H, Washio T. Graph-based induction for general graph structured data and its application to chemical compound data. Proceedings of the Third International Conference on Discovery Science, 2000. [13] Matsumoto T, Tanabe K. Prediction of carcinogenicity of organic chlorine-containing compounds by neural network. JCPE J 1999;11(1):29±34 in Japanese. [14] Debnath AK, Lopez de Compadre RL, Debnath G, Shusterman AJ, Hansch C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem 1991;34:786±97.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Graph-based induction and its applications