Download A survey on mining multiple data sources

Overview A survey on mining multiple data sources T. Ramkumar,1∗ S. Hariharan2 and S. Selvamuthukumaran1 Advancements in computer and communication technologies demand new perceptions of distributed computing environments and development of distributed data sources for storing voluminous amount of data. In such circumstances, mining multiple data sources for extracting useful patterns of significance is being considered as a challenging task within the data mining community. The domain, multi-database mining (MDM) is regarded as a promising research area as evidenced by numerous research attempts in the recent past. The methods exist for discovering knowledge from multiple data sources, they fall into two wide categories, namely (1) mono-database mining and (2) local pattern analysis. The main intent of the survey is to explain the idea behind those approaches and consolidate the research contributions along with their significance and limitations. C 2012 Wiley Periodicals, Inc. How to cite this article: WIREs Data Mining Knowl Discov 2013, 3: 1–11 doi: 10.1002/widm.1077 INTRODUCTION R apid strides made in the communication technology over wired and wireless networks result in the development of various distributed applications. A distributed application might have data sources, which are scattered over various geographical locations for handling huge volume of data. This scenario allows organizations for promoting multidatabase applications toward fulfilling their operational needs. Thus many organizations need to mine their multi-databases distributed at branches for the purpose of decision-making. Consider a retail store Reliance India Ltd, which has launched a retail revolution in India—from no stores to 1500 outlets in just six months. Each of these outlets produces huge number of transactions on a daily basis. Developing an effective data mining technique to discover patterns from multiple branches thus become crucial one for these types of applications. The domain, multi-database mining (MDM) gains significant attention because of (1) Increasing use of automatic data collection tools and flood of ∗ Correspondence to: [email protected] 1 Department of Computer Applications, A.V.C. College of Engineering, Tamil Nadu, India 2 Department of Computer science and Engineering, TRP Engineering College, Tamil Nadu, India DOI: 10.1002/widm.1077 Volume 3, January/February 2013 data generated in the operational process of an organization; (2) changing nature of distributed repositories with different data sources and formats; (3) organization’s imperative needs for analyzing the contents and trends of branch databases; and (4) need to enhance the effectiveness of decision-making process by the way of incorporating quality knowledge extracted from multi-databases. The success of MDM application largely depends on the data available in multiple data bases. In real-world application, data stored in multiple places are often inconsistent and conflict with each other. Bright et al.1 discussed the following data representation issues in multi-database environment. (1) Name differences: Databases may have different conventions for the naming of objects, leading to problems with synonyms and homonyms. A synonym means that the same data item has a different name in different databases. The global system must recognize the semantic equivalence of the items and map the differing local names to a single global name. A homonym means that different data items have the same name in different databases. The global system must recognize the semantic difference between items and map the common names to different global names. (2) Format differences: Format differences include differences in the data type, domain, scale, precision, and item combinations. As an example, we can cite the case, where a part number is defined as an integer c 2012 John Wiley & Sons, Inc. 1 Overview wires.wiley.com/widm F I G U R E 1 | Mono-database mining. in one database and as an alpha-numeric string in another. Sometimes data items are broken into separate components in one database, while the combination is recorded as a single quantity in another one. Multidatabase systems typically resolve format differences by defining transformation functions between local and global representations. Some functions may consist of simple numeric calculations such as converting square feet to acres. Others may require tables of conversion values or algorithmic transformations. A problem in this area is that the local-to-global transformation required, may be very complex, especially if updates are to be supported. (3) Structural differences: An object may be structured differently in different local databases. A data item may have a single value in one database and multiple values in another. An object may be represented as a single relation in one location or as multiple relations in another. The same item may be a data value in one location, an attribute in another and a relation in a third. So the data often have discrepancies in structure and content that must be cleaned. (4) Conflicting data: The problem of conflicting data occurs when two databases record the same data item but assign to it different values. It may be due to incomplete update and system errors while manipulating such data. The above issues show the importance of adopting suitable methods for MDM problem because global organization’s headquarter decisions are highly influenced by the quality knowledge synthesized from multiple data bases. This survey is organized as follows: The two main ideas for MDM are presented in Major Methods for Multi-Database Mining which includes definition, pros and cons of mono-mining and 2 local pattern analysis strategies with schematic representations. Next, in Research Efforts Based on MonoDatabase Mining and Research Efforts on MultiDatabase Mining, research efforts based on monomining and local pattern analysis are reviewed and discussed respectively. Finally, conclusion and future research directions are presented in Conclusion and Scope for Future Work. MAJOR METHODS FOR MULTI-DATABASE MINING During the past decades, attempts have been made to enrich the knowledge discovery process by applying techniques coming from artificial intelligence to databases, forming an interesting research forum called Knowledge Discovery from Multiple Databases also known as MDM. It can be defined as the process of mining data from multiple databases, which may be heterogeneous, and finding novel and useful patterns of significance.2 Though many methods exist for discovering knowledge from multiple data sources, they fall into two wide categories, namely (1) monodatabase mining and (2) local pattern analysis.3 Mono-Database Mining In mono-database mining, data from different data sources has been aggregated to a centralized repository for the task of mining (see Figure 1). The main theme of the mono-database mining is to discover patterns, which are globally significant among participatory data sources. Mono-database mining can c 2012 John Wiley & Sons, Inc. Volume 3, January/February 2013 WIREs Data Mining and Knowledge Discovery Mining multiple data sources be defined as the process where data from various databases are integrated, put in a data warehouse and mining is done for identifying global pattern of interest. The primary technical challenge here is the communication cost between distributed data sources and it is often very costly and sometimes impossible to join multiple data sources into a single database.4 Selecting the relevant data sources based on the specific application and then put them together to mine the knowledge is the refinement of the earlier one. Though the approach is effective in reducing search cost for a given application, it is application dependence and requires multiple scans for each application.5 5. Putting all the data from the relevant databases into a single data set can destroy some important information that reflects the individuality of the branches. Branch databases may have different weights and some branches provide greater contribution to the whole company in terms turnover, transactions, and so on. The above limitations show that the traditional process of mono-database mining is inadequate and local pattern analysis has been put forward as an alternate way for mining multiple data sources. Local Pattern Analysis Limitations of Mono-Database Mining Mono-database mining could not be considered, a good solution for mining multiple databases because of the following limitations: 1. It is based on the traditional data warehouse architecture and fundamentally inappropriate for most of the distributed and ubiquitous data mining applications. Because branch databases can be in different formats, much attention is required in the data preprocessing stage. 2. A single computer might take a very long time to process the entire data set. Though it can be achieved by employing parallel machines and associated software, the company may have to invest heavily in associated software and hardware. From the cost– benefit analysis perspective, it would not be a feasible solution.6 3. It may be an unrealistic proposition to collect data from different branches for centralized processing, as branch databases are operated with huge volumes of transactions daily. 4. Even if the data can be quickly centralized using relatively fast network, the privacy issue plays an increasingly important role in data mining applications based on mono-mining. For an example, if the consortium of different banks wants to collaborate each other in detecting fraud, then mono-mining approach is not a feasible solution as it requires collection of all the financial data pertaining to an individual customer from every bank into a single location, which jeopardizes the privacy of bank customer. Volume 3, January/February 2013 The objective of the local pattern analysis is to perform the data mining operation based on the type and availability of the distributed resources without moving the data to the central repository. It mines important local patterns from individual data sources, forward the pattern base and reduces the data movement (Figure 2). Hence data mining application based on local pattern analysis strategy can be able to learn models from distributed data without exchanging the raw data. A local pattern7 could be a frequent itemset, an association rule, a causal rule, dependency, or some other expression pertaining to show the individuality of a branch site. MDM using local pattern analysis is defined as the process of synthesizing global patterns from the forwarded patterns by the individual sites. This approach is recommended when the application involves a large number of data sources and is likely to be more scalable. The primary focus is to synthesize the local mining results at multiple level of abstraction in view of promoting regional and global features. Advantages of Local Pattern Analysis This approach provides the following advantages: 1. It is an in-place strategy8 (eliminates data movement) and provides a feasible way to generate pattern models when huge volumes of data are distributed at various sites. 2. It captures the individuality of the data sources and can able to find special patterns, which are more important than the patterns present in the integrated and unified single database (mono-database). 3. It is of low complexity because it only mines relevant individual data sources. c 2012 John Wiley & Sons, Inc. 3 Overview wires.wiley.com/widm F I G U R E 2 | Local pattern analysis. 4. It offers a strategy for synthesizing forwarded patterns at multiple level of abstraction for inventing various kinds of patterns distributed in data sources. For example, global patterns (voted by many of the sites), subglobal patterns (voted by some of the sites) and local patterns (voted by few/single sites).9 5. It provides a means for two-level decisions: (1) global decisions—the central company’s decisions for global applications on the basis of the synthesized patterns; (2) branch decisions—decisions of the local branches on the basis of features of local patterns mined from the local databases.3 6. The main objective in knowledge discovery from databases is to capture interesting patterns with respect to some user point of view. The user may not be a data mining expert, but is an expert in the field being mined.10 Thus the importance of any pattern depends upon the interest of the user. By using local pattern analysis strategy, heads of local branches can use different interestingness measures for evaluating local patterns of their respective databases, which may not be the same with the interestingness measure used by the central head in global pattern synthesizing. For Example, the measure, lift may be used by branch ‘B1’ and support11,12 measure may be used by branch ‘B2’ for evaluating the local rule A→B in their cor- 4 responding sites. Once synthesized, the rule A→B can be evaluated globally by using a measure, say correlation. It shows that the strategy of local pattern analysis enable the local and central sites to adopt different interestingness metrics for evaluating patterns. Limitation of Local Pattern Analysis Though local pattern analysis strategy offers reasonably good solutions for MDM problem, researchers have seen the other side as well. Adhikari et al.13 criticized the issue, frequency of data mining is a drawback of local pattern analysis strategy for MDM problem. In mono-mining, the mining database occurs only once. But in the case of local pattern analysis strategy, the frequency of mining associates with the number of databases. Despite the various advantages, this may be accounted as a limitation. RESEARCH EFFORTS BASED ON MONO-DATABASE MINING For mining databases, a prototype knowledge discovery system, INLEN14 (Inference and Learning) has been developed at George Mason University whose principal knowledge discovery algorithm AQ learns decision rules by performing inductive inference over a set of training examples. The system provides valuable insights into characteristics and relationships that exist in the database, but are unknown to the user. The discovered knowledge is displayed in the form c 2012 John Wiley & Sons, Inc. Volume 3, January/February 2013 WIREs Data Mining and Knowledge Discovery Mining multiple data sources of IF-THEN rules. The limitation was that INLEN has been used only for discovering in small single databases. To overcome this limitation, Ribeiro et al.15 extended this approach to discover knowledge in multiple databases by applying INLEN’s methodology to individual databases and then further processing the discovered knowledge Accordingly, the AQ algorithm has been modified to handle primary and foreign key information from two data sources. In their approach, the databases should be residing on the same machine. Wrobel16 extended the concept of foreign keys to include foreign links because MDM involves accessing nonkey attributes. In practice, useful databases may exists in some remote locations and provide knowledge in decision making process. To respond to such a scenario, Aronis et al.17 introduced WoRLD (Worldwide Relational Learning Daemon) system which uses inductive rule-learning program that can learn from multiple databases distributed around the network. They have proposed an approach—‘activation spreading’ which computes the cardinal distribution of the feature values in the individual data sets and the distribution is propagated across different sites. Turinsky and Grossman8 in their work, discuss two types of strategies while mining multiple databases. Because the task of moving large data sets over the Internet may be a time-consuming and costly proposition, the first strategy is to leave the data in place, building local models, and combining the models at a central site. They call this scheme an in-place strategy. At the other extreme, when the amount of geographically distributed data is very small, it is possible to move all the data to a central site and build a single model there. They call this a centralized strategy. Then, they describe an intermediate strategy of optimal data and model partitions to achieve a given level of accuracy at a minimum cost. Grossman et al.18 introduced Papyrus, a system for distributed data mining which supports various strategies such as move data, move model, move results as well as the mixture of all, on the basis of data distribution, availability of resources and accuracy required. Prodromids et al.19 adopted a metalearning strategy for mining multiple databases by integrating multiple classifiers computed over different databases to form higher-level classifiers or classification model. The process of metalearning starts with distributed databases, or a set of data subsets of the original database and concurrently running a learning algorithm on each of the subsets. Using an integration rule, it combines the predictions from classifiers learned from the subsets by recursively learning ‘combiner’ Volume 3, January/February 2013 and ‘arbiter’ models in a bottom-up tree manner. The focus of metalearning is to combine the predictions of the learned models from the partitioned data subsets in a parallel and distributed environment. Kargupta and his colleagues20,21 considered a collective framework to address data analysis for heterogeneous environments and proposed the collective data mining (CDM) framework for predictive data modeling. The main features of CDM can be summarized in the following steps: (1) Generate approximate orthonormal basis coefficients at each local site (2) Move an approximately chosen sample of the data sets from each site to single site and generate the approximate basis coefficients corresponding to nonlinear cross terms (3) Combine the local models, transform the model to the user described canonical representation and output the model. Here, nonlinear terms represent a set of coefficients (or patterns) that cannot be determined at a local site. In essence, the performance of a CDM model depends on the quality of estimated cross-terms. Typically, CDM requires an exchange of a small sample that is often negligible compared to the entire data. On the basis of the framework, various distributed data analysis algorithms such as collective decision rule learning using Fourier analysis, collective hierarchical clustering, collective multivariate regression using wavelets, and collective principal component analysis are developed. For mining large databases, Savasere et al.22 proposed a partition algorithm. The algorithm mines frequent itemsets from each non-overlapping partitions of the database and global candidate patterns are generated from the union of all frequent itemsets. A second run is made on each partition to obtain the frequency count of each of the candidate patterns, which are then summed up to obtain the global support count. If the support count is found to be greater than the minimum support count, the pattern is deemed as a global pattern and global rules are generated from them. This approach provides an elegant solution for mining huge centralized databases. However, the method is not directly applicable to MDM. To adopt this approach to MDM, each branch database may be considered as a part of a partition of the multi-database. The mined local frequent itemsets are forwarded to the center to form candidate global patterns. The candidate global patterns are transmitted to the local sites to mine for a second time. The mined patterns are again forwarded by the local sites to center to assemble and evaluate global rules. The above scheme requires two sets of scans of local databases and three transmissions of mined patterns across the network. c 2012 John Wiley & Sons, Inc. 5 Overview wires.wiley.com/widm T A B L E 1 Analysis of Research attempts in Mono-Database Mining with Their Significance Serial No. Researchers Issue-Focused Contribution 1 Characteristics and relationships exist in the database Data movement 3 Michalski et al.,14 Ribeiro et al.,15 and Aronis et al.17 Grossman et al.18 and Turinsky and Grossman8 Prodromidis et al.19 4 5 Kargupta et al.20,21 Savasere et al.22 Data analysis Patten discovery 6 Zhong et al.23 Pattern discovery 7 Liu et al.24 Database Selection 8 Wu et al.25 Database classification Rule-based knowledge discovery algorithms Data movement strategies for the task of mining Classification model for distributed database environment Collective data mining framework Partitioning algorithm for mining large databases Procedure for mining peculiarity patterns in multiple databases Application dependent selection Procedure Application independent database selection procedure 2 Database classification To find new, surprising, interesting patterns hidden in data, peculiarity oriented mining in multiple databases was introduced by Zhong et al.23 The Peculiarity represents a new interpretation of interesting, unexpected relationships that are hidden in the relatively small number of data. The main task of mining peculiarity rules is the identification of peculiar data. Peculiarity of data is characterized by two features: (1) very different from other objects in a data set, and (2) consisting of a relatively low number of objects. They argued that the peculiarity rules are a typical regularity hidden in a lot of scientific, statistical and transaction databases. They have proposed a peculiarity factor to find whether the attribute value occurs in relatively low number and is very different from other values by evaluating the sum of the square root of the conceptual distance between them. Finally, one can select the peculiar data by means of peculiarity factor threshold. To deal with multiple databases, Liu et al.24 have proposed to discover multi-database by identifying relevant databases. They argued that the first step for MDM is to identify databases that are most likely to be relevant to an application for efficiency and accuracy. In their approach, the cluster of multidatabases is constructed for an application, which is typically application dependent, referred to as database selection. However, database selection has to be carried out multiple times to identify relevant databases for two or more real-world applications. In particular, when users require to mine their multi-databases without reference to any specific application, the application dependent techniques 6 do not work well. To cater to this requirement, Wu et al.25 proposed an application-independent database classification strategy for MDM. They have presented a technique of clustering databases toward mining multiple databases. Multiple databases are classified by constructing a relevance measure called as similarity. In particular, they have defined measures |class|, Goodness and distancefG for searching good cluster in multi-databases. Both the works focused on efficient data preparation technique for MDM. Table 1 analyzes the research contributions in mono-mining strategy along their significance. The above efforts have provided a good insight into mining multiple databases and tackled several important issues in MDM. However, there are also many potentially useful patterns in local databases. Apart from the cost of moving huge data over a communication network, the mono-database mining strategy obliterates interesting local patterns at various sites. The following section reviews the research efforts on local pattern analysis, which overcomes the limitations of mono-mining. RESEARCH EFFORTS ON MULTI-DATABASE MINING Zhang et al.26 brought out the differences between mono-database mining and MDM by presenting novel significant patterns that are found in MDM, which are not captured in mono-database mining. They argue that any business organization, with multiple branches, has two levels of decisions: headquarter level (global) and branch level (local) c 2012 John Wiley & Sons, Inc. Volume 3, January/February 2013 WIREs Data Mining and Knowledge Discovery Mining multiple data sources decisions. Following this logic, they classify the patterns in a multi-database system as local patterns, high-vote patterns, exceptional patterns, and suggested patterns. High-vote patterns are supported by most of the branches or all branches of an interstate organization. Such patterns reflect the common features among the branch databases. According to these patterns, the head company can make decisions for the common profit of all the branches. Exceptional patterns have a higher support in some branches but zero support in other branches. According to these patterns, the head company can adjust measures to local conditions and make special policies for such branches. Suggested patterns are supported by some of the branches, but these are lesser than the branches supporting the highvote patterns. Because users are more likely to provide the patterns mined from their databases rather than their raw data and more number of local patterns are forwarded from the branch databases, a synthesizing model is necessary to gather global patterns from the forwarded local patterns. Wu and Zhang27 advocated a model for synthesizing high-frequency rules from multiple databases through weighting. In many fields such as probability and fuzzy set theory, weighting has been considered as a common method for aggregating information. To aggregate association rules from multiple databases, one also needs to determine the weights of the data sources. The weighting model advocated by Wu and Zhang27 has been considered as a first attempt in synthesizing global patterns from the forwarded local patterns. Their weighting model aims to synthesize high-frequency association rules from different data sources. According to them, a rule is called a high-frequency rule if it is supported or voted for by a large number of data sources. Their rule weight is proportional to the number of data sources supporting the rule. The weight of any data source in turn is calculated based on the number of high-frequency rules supported by it. They have assigned high weights to the data source that supports a larger number of highfrequency rules and lower weights to the data sources that supports a fewer number of high-frequency rules. The synthesizing model proposed by Wu and Zhang27 works on similar sized data sources. When numerous data sources are considered, it is practically impossible to have similar sized data sources. To process data sources of different sizes, merging and splitting of data sources has to be done to make them of the same size. But these types of operations are complex and a huge effort is required. When merging of data sources is not possible because of data sharing Volume 3, January/February 2013 issues, Wu and Zhang27 have suggested to ignore data sources if their sizes are below a user-specified threshold. Thus, some of the data sources may not participate in the rule synthesis process. Though Wu and Zhang’s model,27 attempts to synthesize the global association rules for the overall organization that would have been discovered from the union of all the data sources, the comparison of the synthesized results with the mono-mining results obtainable by the union of those data sources is not targeted in their work. Nedunchezhian and Anbumani28 focused two issues namely data source selection and selection of valid ruled for synthesizing. They have calculated weight of the data source on the basis of two factors: (1) number of high-frequency rules voted by the data source; (2) size of the data source. Accordingly, weights of all participating data sources are calculated and data source selection threshold value is applied for identifying candidate data sources for synthesizing high-frequency rules. To prune low frequency rules at the local sites itself, a procedure called support equalization is also presented by equating supports of the data sources which reduces total number of rules forwarded to the central head. Zhang et al.29 have advocated an approach for synthesizing global exceptional patterns for MDM applications. They have developed an algorithm for identifying global exceptional patterns in multiple databases. In their approach, every local database is mined separately in a random order for synthesizing global exceptional patterns. Kum et al.30 have developed a local mining approach for finding sequential patterns in multiple databases. They present a novel algorithm to mine approximate sequential patterns called consensus patterns from large sequence databases in two steps. First, sequences are clustered by similarity. Then, consensus patterns are mined directly from each cluster through multiple alignments. Adhikari and Rao6 have extended the local pattern analysis model and have introduced the notion of heavy association rules in their work. The heavy association rules, whose synthesized global supports are higher than a user’s given threshold. Their criterion is the same as the measure defined by Wu Zhang27 and stated that the heavy association rules are sometimes more useful than the high-frequent association rules. Also they have observed the cases that heavy association rules may not be shared by all the databases. Therefore, they defined a high-frequency rule as the rule shared by at least n × r1 databases, and an exceptional rule as the rule shared by no more than n × r2 databases. Here ‘n’ indicated the number of databases and r1 and r2 are the user-defined thresholds. They have presented an algorithm for c 2012 John Wiley & Sons, Inc. 7 Overview wires.wiley.com/widm synthesizing heavy association rules from multiple data sources and reported whether a heavy association rule is high-frequent or exceptional in multiple databases. The imitation of this model is that it provides approximate global patterns. proposed Ramkumar and Srinivasan2 transactions-population-based weighting model for synthesizing high-frequency rules from different data sources. According to them, rule weight is proportional to the sum of the weights of the data sources supporting the rule. The weight of any data source is calculated based on the population of data sources—that is, by the number of transactions in the database. Their goal in synthesizing global patterns from the forwarded local pattern is that, the support and confidence of synthesized pattern should be very nearly same if all data sites were integrated and Mono-mining has been done. They did not agree with the formulation that in a big company each branch, big or small, has equal power to vote for patterns. They have also added that in a pure business sense all branches are not equal; the branches that have a high volume of business will have and should have a greater say in determining global policies based on global patterns. The synthesizing models2,6,27 focused on synthesizing high-frequency rules only because they emerge as global rules when all the data sources are integrated. High-frequency rules are truly valid in making global decisions by the head branch of any interstate company. However, in such a synthesis, regional patterns or rules get eliminated. For making decisions, say at regional levels, patterns that show the individuality of regions or cluster or groups of branches become important. They can be explored by synthesizing them in a multilevel perspective alone. In responding to this demand, Ramkumar and Srinivasan9 extended their earlier work and proposed a framework for multilevel rule synthesis model using two interesting rule evaluation measures, namely, effective and nominal vote rates. γ effective is defined as the effective vote rate, which is the cumulative percentage of votes received from different data sources for a given rule on the basis of the transactions-populations of respective data sources. γ nominal is defined as the nominal vote rate, which is the cumulative percentage of vote received from different data sources for a given rule on the basis of equal votes for sites. Using these rule selection measures, local patterns are synthesized into global rules, subglobal rules and local rules. During the synthesizing process, when a rule or pattern is present in a site only weakly and fails to satisfy the minimum support threshold value, that rule is not allowed to take part in the synthesizing proce- 8 dure. Such circumstances do not imply that the rule is not present at all, because the rule may have some significance in the site with a support value, which lies between 0 and minimum support. Ramkumar and Srinivasan31 focused on this issue and introduced the notion of a correction factor in the rule synthesizing process. With the inclusion of the correction factor, synthesized results are improved. They have also concluded that the domain expert would choose a suitable correction factor, based on his knowledge and estimate about the distribution of data. In the absence of detailed knowledge about data distribution, they have recommended the correction factor as 0.50. Adhikari et al.32 proposed a model for mining global patterns in multiple transactional timestamped databases. They have argued that finding variation of sales of an item over time is an important issue. Accordingly, they introduced the notion of stability of an item because stable items are useful in making many strategic decisions. On the basis of the degree of stability of an item, an algorithm for clustering different databases has been proposed. Zhang et al.33 proposed a method for obtaining local patterns from the individual databases based on the customer lifetime values (CLVs). For computing CLVs, three attributes—namely, customerid, customer expenditure, and the period of lifecycle for the customer—are accounted. By using a method, namely kernel estimation for mining global patterns (KEMGP), which adopts kernel estimation, global patterns are synthesized from the forwarded local patterns. Adhikari et al.34 noted that association analysis of select items in multiple market databases is an important as well as promising issue. As many important decisions are based on the set of specific items called as select items, they have proposed a model for mining global patterns of select items in multiple databases. A measure of overall association between two items in databases is also proposed. They have designed an algorithm based on the proposed measure for grouping frequent items in multiple databases. The existing rule synthesizing methods commonly assumes that an appropriate relevance analysis has been done among databases and the databases under consideration are highly relevant. This is equivalent to the assumption that all stores have the same type of business with identical metadata structures, which is hardly impossible. The above problem has been attacked by He et al.35 They have proposed a synthesizing model for databases containing different items and the databases may not be relevant with each other. They have argued that simple rule synthesizing model without a detailed understanding of the databases is not adequate to reveal meaningful c 2012 John Wiley & Sons, Inc. Volume 3, January/February 2013 WIREs Data Mining and Knowledge Discovery Mining multiple data sources T A B L E 2 Analysis of Research Attempts in Local Pattern Analysis Strategy with Their Significance Serial No. Researchers Issue-Focused Contribution 1 Zhang et al.26 Local pattern analysis 2 Wu and Zhang27 Synthesizing model 3 Nedunchezhian and Anbumani28 4 Zhang et al.29 Database identification and setting up threshold values for synthesizing Discovering new kinds of patterns Identification of new kinds of patterns in multi-database environments Weighting model for synthesizing global patterns based on frequent rules voted by the data source Data source selection and support equalization for synthesizing global patterns. 5 6 Kum et al.30 Adhikari and Rao6 Discovering new kinds of patterns Discovering new kinds of patterns 7 Ramkumar and Srinivasan2 Synthesizing model 8 Ramkumar and Srinivasan9 Discovering new kinds of patterns 9 Ramkumar and Srinivasan31 10 Adhikari et al.32 Optimization in synthesizing model Database clustering 11 12 Adhikari et al.34 He et al.35 Grouping items Database clustering patterns inside the databases. They have proposed a two-step clustering-based rule synthesizing framework. Accordingly, for databases with different items, clustering can be done at the item level, where as for databases sharing similar items but different rules, the clusters generated from the item-level clustering are further clustered. By this two-step process, final clusters contain both similar items and similar rules. Then the weighted rule synthesizing method proposed by Wu and Zhang has been applied on such clusters to generate synthesized rules. Table 2 summarizes the salient features of research work on the basis of local pattern analysis strategy. CONCLUSION AND SCOPE FOR FUTURE WORK Research in MDM would be more important, imperative, and challenging with the increasing development of multi-databases. This paper surveys various research works in the growing field with emphasis on Volume 3, January/February 2013 Synthesizing procedure for globally exceptional patterns Algorithms for sequential pattern discovery Notion of heavy association rules in synthesizing process on the basis of Wu and Zhang’s model Transactions-population-based weighting model for synthesizing global patterns with a target of obtaining closer mono-mining result Notion of Effective and nominal vote rate in rule synthesizing for pattern classification Notion of correction factor in rule synthesizing process for improved synthesized results Mining global patterns in timestamped databases Model for mining selective items Synthesizing model for databases of dissimilar in nature mono-mining and local pattern analysis. There are still several challenges in local pattern analysis approach, which need further research. Inclusion of Quantitative Information in the Allocation of Data Source Weights Considering two sites S1 and S2 with the respective populations of 100 and 1000 transactions, transactions-population-based weighting model assigns a weight of 10 times that of S1 for S2. It would not be a fair, if the turnover of S1 is higher than that of S2. Thus allocation of site weights on the basis of transactions-population alone may not be a good decision. To improve decisions, quantitative mining on the basis of turnover quantity or cost of items sold may be carried out. The frequent rule Wine→Salmon (support = 10%, confidence = 80%) may be more important than the frequent rule Bread→Milk (support = 30%, confidence = 80%), even though the former holds a lower support. This is because those items in the first rule usually come with more profit c 2012 John Wiley & Sons, Inc. 9 Overview wires.wiley.com/widm per unit sale when comparing the items in the next rule. Hence the inclusion of quantitative information in the allocation of site weights is needed and a synthesizing model based on multiple minimum supports for the corresponding quantitative information is required. Assigning Weights for Transactions Assigning weights to the transactions is also one of the future research directions from our view. Different transactions have different weights in real world data sets. For an instance, in the market basket analysis, each transaction is recorded with some profit. The transactions with a large amount of items should be considered more important than the transactions containing only few items. It shows the requirement of assigning different weights for different transactions to reflect their importance. To assign weight for transactions, factors such as recency, frequency, monetary value, and duration (RFMD) values can be used by each of the local branches. The RFMD technique is one of the popular methods in the market segmentation. Customers who have recently purchased (recency), customers who purchase many times (frequency), customers who spend more money (monetary value), and customers who spend more time on sellers website (duration) are the main parameters of RFMD technique. Weights are assigned to each parameter and the weighted score for each transaction can be calculated. Also, by assigning transactionsweights, the problem of considering all transactions in the rule mining process can be eliminated and the extracted rules will have greater significance. Global Classification Model for Rule Synthesizing Supervised learning is a well-known data mining functionality, which is used to classify data records into set of predefined class labels. Using classification techniques like decision trees, mining the features of the local data sources for a given concept or a class and synthesizing them to form a global classification model for mining multiple databases is also one of the interesting research directions. Synthesizing Negative Association Rules Mining negative association rules in multiple databases could also be one of the future research areas. A negative association rule describes relationship between item set and implies the occurrences of some item sets characterized by the absence of others.36 A positive association rule ‘A→B’ has three corresponding negative association ‘A→⌐B’, ‘⌐A→B’, ‘⌐A→⌐B’. The⌐ negative association rules also play important rule in decision-making. For example, handling different medical databases coming from different areas, the center for disease control would be interested to find out which factors are relatively irrelevant and absolutely irrelevant although they may arise frequently. Thus there is a sound scope for research in the development of an effective synthesizing model for negative association rules. REFERENCES 1. Bright MW, Hurson AR, Pakzad SH. A taxonomy and current issues in multidatabase systems. IEEE Comput 1992, 25: 50–60. 2. Ramkumar T, Srinivasan R. Modified algorithms for synthesizing high-frequency rules from different data sources. Knowl Inf Syst 2008, 17:313–334. 3. Zhang S, Wu X, Zhang C. Multi-database mining, IEEE Comput Intell Bull 2003, 2:5–13. 4. Zhang S, Chen Q, Yang Q. Acquiring knowledge from inconsistent data sources through weighting. Data Knowl Eng 2010, 69:779–799. 5. Liu H, Lu H, Yao J. Identifying relevant database for multidatabase mining. In: Proceeding of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining. Melbourne, Australia; 1998, 210– 221. 10 6. Adhikari A, Rao PR. Synthesizing heavy association rules from different real data sources. Pattern Recognit Lett 2008, 29:59–71. 7. Zhang S, Zaki JM. Mining multiple data sources: local pattern analysis. Data Min Knowl Discov 2006, 12:121–125. 8. Turinsky K, Grossman R. A framework for finding distributed data mining strategies that are intermediate between centralized strategies and in-place strategies. In: Workshop on Distributed and Parallel Knowledge Discovery at Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000). Boston, MA; 2000, 1–7. 9. Ramkumar T, Srinivasan R. Multi-level synthesis of frequent rules from different data sources. Int J Comput Theory Eng 2010, 2:195–204. c 2012 John Wiley & Sons, Inc. Volume 3, January/February 2013 WIREs Data Mining and Knowledge Discovery Mining multiple data sources 10. Lenca P, Meyer P, Vaillant B, Lallich S. On selecting interestingness measures for association rules: user oriented description and multiple criteria decision aid. Eur J Oper Res 2008, 184:610–626. 11. Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington, D.C.; 1993, 207–216. 12. Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the Twentieth International Conference on Very Large Databases(VLDB). Santiago de Chile, Chile; 1994, 478–499. 13. Adhikari A, Jain CL, Ramana S. Analysing effect of database grouping on multi-database mining. IEEE Intell Inf Bull 2011, 12:25–32. 14. Michalski RS, Kerschberg L, Kaufman KA, Ribeiro JS. Mining for knowledge in databases: the INLEN architecture, initial implementation and first results. J Intell Inf Syst: Integr AI and Database Technol 1992, 1:85– 113. 15. Ribeiro J, Kaufman K, Kerschberg L. Knowledge discovery from multiple databases. In: Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95). Montreal, Canada; 1995, 240–245. 16. Wrobel S. An algorithm for multi-relational discovery of subgroups. In: Proceedings of the First European Symposium on Principles of Data mining and Knowledge Discovery. Trondheim, Norway; 1997, 78–87. 17. Aronis J, Kolluri V, Provost F, Buchanan B. The WoRLD: knowledge discovery from multiple distributed databases. In: Proceedings of the Tenth International Florida AI Research Symposium. Daytona Beach, FL; 1997, 337–341. 18. Grossman RL, Bailey S, Ramu A, Malhi B, Turinsky A. The preliminary design of papyrus: a system for high performance, distributed data mining over clusters. In: Advances in Distributed and Parallel Knowledge Discovery. Menlo Park, CA: AAAI/MIT Press; 2000, 259– 275. 19. Prodromidis A, Chan P, Stolfo S. Meta-learning in distributed data mining systems: issues and approaches. In: Advances in Distributed and Parallel Knowledge Discovery. Menlo Park, CA: AAAI/MIT Press; 2000. 20. Kargupta H, Huang W, Sivakumar K, Johnson E. Distributed clustering using collective principal component analysis. Knowl Inf Syst 2001, 3:422–448. 21. Kargupta H, Huang W, Sivakumar K, Park B,Wang S. Collective principal component analysis from distributed, heterogeneous data. In: Proceedings of the Volume 3, January/February 2013 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. Fourth European Conference on Principles of Data Mining and Knowledge Discovery. Lyon, France; 2000, 452–457. Savasere A, Omiecinski E, Navathe S. An efficient algorithm for mining association rules in large databases. In: Proceedings of the Twenty First International Conferences on Very Large Data Bases. Zurich, Switzerland; 1995, 432–444. Zhong N, Yao YY, Ohshima M. Peculiarity oriented multi-database mining. IEEE Trans Knowledge Data Eng 2003, 15:952–960. Liu H, Lu H, Yao J. Toward multi-database mining: identifying relevant databases. IEEE Trans Knowl Data Eng 2001, 13:541–553. Wu X, Zhang C, Zhang S. Database classification for multi-database mining. Inf Syst 2005, 30:71–88. Zhang S, Zhang C, Wu X. Knowledge Discovery in Multiple Databases. London: Springer-Verlag; 2004. Wu X, Zhang S. Synthesizing high-frequency rules from different data sources. IEEE Trans Knowl Data Eng 2003, 15:353–367. Nedunchezhian R, Anbumani K. Post mining– discovering valid rules from different sized data sources. Int J Inf Technol 2006, 3:47–53. Zhang C, Liu M, Nie W. Identifying global exceptional patterns in multidatabase mining. IEEE Comput Intell Bull 2004, 3:19–24. Kum HC, Chang JH, Wang W. Sequential pattern mining in multidatabases via multiple alignment. Data Min Knowl Discov 2006, 12:151–180. Ramkumar T, Srinivasan R. The effect of correction factor in synthesizing global rules in a multi-database mining scenario. J Appl Comput Sci 2009, 3:33– 38. Adhikari J, Rao PR, Adhikari A. Clustering items in different data sources induced by stability. Int Arab J Inf Technol 2009, 6:394–402. Zhang S, You X, Jin Z, Wu X. Mining globally interesting patterns from multiple databases using kernel estimation. Expert Syst Appl 2009, 36:10863–10869. Adhikari A, Rao PR, Pedrycz W. Study of select items in different data sources by grouping. Knowledge Inf Syst 2010, 27:23–43. He D, Wu X, Zhu X. Rule synthesizing from multiple related databases. In: Proceedings of the Fourteenth Pacific-Asia Conference on Knowledge Discovery and Data Mining. Hyderabad, India; 2010, 201–213. Zhang S, Wu X. Fundamentals of association rules in data mining and knowledge discovery. WIREs Data Min Knowl Discov 2011, 1:97–116. c 2012 John Wiley & Sons, Inc. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A survey on mining multiple data sources