Download A survey on mining multiple data sources

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Overview
A survey on mining multiple
data sources
T. Ramkumar,1∗ S. Hariharan2 and S. Selvamuthukumaran1
Advancements in computer and communication technologies demand new perceptions of distributed computing environments and development of distributed
data sources for storing voluminous amount of data. In such circumstances, mining multiple data sources for extracting useful patterns of significance is being
considered as a challenging task within the data mining community. The domain,
multi-database mining (MDM) is regarded as a promising research area as evidenced by numerous research attempts in the recent past. The methods exist
for discovering knowledge from multiple data sources, they fall into two wide
categories, namely (1) mono-database mining and (2) local pattern analysis. The
main intent of the survey is to explain the idea behind those approaches and consolidate the research contributions along with their significance and limitations.
C
2012 Wiley Periodicals, Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2013, 3: 1–11 doi: 10.1002/widm.1077
INTRODUCTION
R
apid strides made in the communication technology over wired and wireless networks result in the development of various distributed applications. A distributed application might have data
sources, which are scattered over various geographical locations for handling huge volume of data. This
scenario allows organizations for promoting multidatabase applications toward fulfilling their operational needs. Thus many organizations need to mine
their multi-databases distributed at branches for the
purpose of decision-making. Consider a retail store
Reliance India Ltd, which has launched a retail revolution in India—from no stores to 1500 outlets in
just six months. Each of these outlets produces huge
number of transactions on a daily basis. Developing
an effective data mining technique to discover patterns from multiple branches thus become crucial one
for these types of applications.
The domain, multi-database mining (MDM)
gains significant attention because of (1) Increasing
use of automatic data collection tools and flood of
∗
Correspondence to: [email protected]
1
Department of Computer Applications, A.V.C. College of Engineering, Tamil Nadu, India
2
Department of Computer science and Engineering, TRP Engineering College, Tamil Nadu, India
DOI: 10.1002/widm.1077
Volume 3, January/February 2013
data generated in the operational process of an organization; (2) changing nature of distributed repositories with different data sources and formats; (3) organization’s imperative needs for analyzing the contents
and trends of branch databases; and (4) need to enhance the effectiveness of decision-making process by
the way of incorporating quality knowledge extracted
from multi-databases.
The success of MDM application largely depends on the data available in multiple data bases. In
real-world application, data stored in multiple places
are often inconsistent and conflict with each other.
Bright et al.1 discussed the following data representation issues in multi-database environment. (1) Name
differences: Databases may have different conventions for the naming of objects, leading to problems
with synonyms and homonyms. A synonym means
that the same data item has a different name in different databases. The global system must recognize the
semantic equivalence of the items and map the differing local names to a single global name. A homonym
means that different data items have the same name
in different databases. The global system must recognize the semantic difference between items and
map the common names to different global names.
(2) Format differences: Format differences include differences in the data type, domain, scale, precision,
and item combinations. As an example, we can cite
the case, where a part number is defined as an integer
c 2012 John Wiley & Sons, Inc.
1
Overview
wires.wiley.com/widm
F I G U R E 1 | Mono-database mining.
in one database and as an alpha-numeric string in another. Sometimes data items are broken into separate
components in one database, while the combination
is recorded as a single quantity in another one. Multidatabase systems typically resolve format differences
by defining transformation functions between local
and global representations. Some functions may consist of simple numeric calculations such as converting square feet to acres. Others may require tables of
conversion values or algorithmic transformations. A
problem in this area is that the local-to-global transformation required, may be very complex, especially
if updates are to be supported. (3) Structural differences: An object may be structured differently in different local databases. A data item may have a single
value in one database and multiple values in another.
An object may be represented as a single relation in
one location or as multiple relations in another. The
same item may be a data value in one location, an attribute in another and a relation in a third. So the data
often have discrepancies in structure and content that
must be cleaned. (4) Conflicting data: The problem
of conflicting data occurs when two databases record
the same data item but assign to it different values. It
may be due to incomplete update and system errors
while manipulating such data.
The above issues show the importance of adopting suitable methods for MDM problem because
global organization’s headquarter decisions are highly
influenced by the quality knowledge synthesized from
multiple data bases. This survey is organized as follows: The two main ideas for MDM are presented in
Major Methods for Multi-Database Mining which includes definition, pros and cons of mono-mining and
2
local pattern analysis strategies with schematic representations. Next, in Research Efforts Based on MonoDatabase Mining and Research Efforts on MultiDatabase Mining, research efforts based on monomining and local pattern analysis are reviewed and
discussed respectively. Finally, conclusion and future
research directions are presented in Conclusion and
Scope for Future Work.
MAJOR METHODS FOR
MULTI-DATABASE MINING
During the past decades, attempts have been made
to enrich the knowledge discovery process by applying techniques coming from artificial intelligence
to databases, forming an interesting research forum
called Knowledge Discovery from Multiple Databases
also known as MDM. It can be defined as the process
of mining data from multiple databases, which may be
heterogeneous, and finding novel and useful patterns
of significance.2 Though many methods exist for
discovering knowledge from multiple data sources,
they fall into two wide categories, namely (1) monodatabase mining and (2) local pattern analysis.3
Mono-Database Mining
In mono-database mining, data from different data
sources has been aggregated to a centralized repository for the task of mining (see Figure 1). The main
theme of the mono-database mining is to discover
patterns, which are globally significant among participatory data sources. Mono-database mining can
c 2012 John Wiley & Sons, Inc.
Volume 3, January/February 2013
WIREs Data Mining and Knowledge Discovery
Mining multiple data sources
be defined as the process where data from various
databases are integrated, put in a data warehouse
and mining is done for identifying global pattern of
interest.
The primary technical challenge here is the communication cost between distributed data sources and
it is often very costly and sometimes impossible to join
multiple data sources into a single database.4 Selecting
the relevant data sources based on the specific application and then put them together to mine the knowledge is the refinement of the earlier one. Though the
approach is effective in reducing search cost for a
given application, it is application dependence and
requires multiple scans for each application.5
5. Putting all the data from the relevant
databases into a single data set can destroy
some important information that reflects
the individuality of the branches. Branch
databases may have different weights and
some branches provide greater contribution
to the whole company in terms turnover,
transactions, and so on.
The above limitations show that the traditional
process of mono-database mining is inadequate and
local pattern analysis has been put forward as an alternate way for mining multiple data sources.
Local Pattern Analysis
Limitations of Mono-Database Mining
Mono-database mining could not be considered, a
good solution for mining multiple databases because
of the following limitations:
1. It is based on the traditional data warehouse
architecture and fundamentally inappropriate for most of the distributed and ubiquitous
data mining applications. Because branch
databases can be in different formats, much
attention is required in the data preprocessing stage.
2. A single computer might take a very long
time to process the entire data set. Though
it can be achieved by employing parallel
machines and associated software, the company may have to invest heavily in associated software and hardware. From the cost–
benefit analysis perspective, it would not be
a feasible solution.6
3. It may be an unrealistic proposition to collect
data from different branches for centralized
processing, as branch databases are operated
with huge volumes of transactions daily.
4. Even if the data can be quickly centralized using relatively fast network, the privacy issue
plays an increasingly important role in data
mining applications based on mono-mining.
For an example, if the consortium of different banks wants to collaborate each other in
detecting fraud, then mono-mining approach
is not a feasible solution as it requires collection of all the financial data pertaining to an
individual customer from every bank into a
single location, which jeopardizes the privacy
of bank customer.
Volume 3, January/February 2013
The objective of the local pattern analysis is to perform the data mining operation based on the type
and availability of the distributed resources without
moving the data to the central repository. It mines important local patterns from individual data sources,
forward the pattern base and reduces the data movement (Figure 2). Hence data mining application based
on local pattern analysis strategy can be able to learn
models from distributed data without exchanging the
raw data. A local pattern7 could be a frequent itemset, an association rule, a causal rule, dependency, or
some other expression pertaining to show the individuality of a branch site. MDM using local pattern
analysis is defined as the process of synthesizing global
patterns from the forwarded patterns by the individual sites. This approach is recommended when the application involves a large number of data sources and
is likely to be more scalable. The primary focus is to
synthesize the local mining results at multiple level of
abstraction in view of promoting regional and global
features.
Advantages of Local Pattern Analysis
This approach provides the following advantages:
1. It is an in-place strategy8 (eliminates data
movement) and provides a feasible way to
generate pattern models when huge volumes
of data are distributed at various sites.
2. It captures the individuality of the data
sources and can able to find special patterns,
which are more important than the patterns
present in the integrated and unified single
database (mono-database).
3. It is of low complexity because it only mines
relevant individual data sources.
c 2012 John Wiley & Sons, Inc.
3
Overview
wires.wiley.com/widm
F I G U R E 2 | Local pattern analysis.
4. It offers a strategy for synthesizing forwarded patterns at multiple level of abstraction for inventing various kinds of patterns
distributed in data sources. For example,
global patterns (voted by many of the sites),
subglobal patterns (voted by some of the
sites) and local patterns (voted by few/single
sites).9
5. It provides a means for two-level decisions:
(1) global decisions—the central company’s
decisions for global applications on the basis of the synthesized patterns; (2) branch
decisions—decisions of the local branches on
the basis of features of local patterns mined
from the local databases.3
6. The main objective in knowledge discovery
from databases is to capture interesting patterns with respect to some user point of view.
The user may not be a data mining expert,
but is an expert in the field being mined.10
Thus the importance of any pattern depends
upon the interest of the user. By using local pattern analysis strategy, heads of local branches can use different interestingness measures for evaluating local patterns
of their respective databases, which may not
be the same with the interestingness measure
used by the central head in global pattern
synthesizing. For Example, the measure, lift
may be used by branch ‘B1’ and support11,12
measure may be used by branch ‘B2’ for
evaluating the local rule A→B in their cor-
4
responding sites. Once synthesized, the rule
A→B can be evaluated globally by using a
measure, say correlation. It shows that the
strategy of local pattern analysis enable the
local and central sites to adopt different interestingness metrics for evaluating patterns.
Limitation of Local Pattern Analysis
Though local pattern analysis strategy offers reasonably good solutions for MDM problem, researchers
have seen the other side as well. Adhikari et al.13
criticized the issue, frequency of data mining is a
drawback of local pattern analysis strategy for MDM
problem. In mono-mining, the mining database occurs only once. But in the case of local pattern analysis
strategy, the frequency of mining associates with the
number of databases. Despite the various advantages,
this may be accounted as a limitation.
RESEARCH EFFORTS BASED ON
MONO-DATABASE MINING
For mining databases, a prototype knowledge discovery system, INLEN14 (Inference and Learning) has
been developed at George Mason University whose
principal knowledge discovery algorithm AQ learns
decision rules by performing inductive inference over
a set of training examples. The system provides valuable insights into characteristics and relationships that
exist in the database, but are unknown to the user.
The discovered knowledge is displayed in the form
c 2012 John Wiley & Sons, Inc.
Volume 3, January/February 2013
WIREs Data Mining and Knowledge Discovery
Mining multiple data sources
of IF-THEN rules. The limitation was that INLEN
has been used only for discovering in small single
databases.
To overcome this limitation, Ribeiro et al.15 extended this approach to discover knowledge in multiple databases by applying INLEN’s methodology to
individual databases and then further processing the
discovered knowledge Accordingly, the AQ algorithm
has been modified to handle primary and foreign key
information from two data sources. In their approach,
the databases should be residing on the same machine.
Wrobel16 extended the concept of foreign keys to include foreign links because MDM involves accessing
nonkey attributes. In practice, useful databases may
exists in some remote locations and provide knowledge in decision making process. To respond to such a
scenario, Aronis et al.17 introduced WoRLD (Worldwide Relational Learning Daemon) system which
uses inductive rule-learning program that can learn
from multiple databases distributed around the network. They have proposed an approach—‘activation
spreading’ which computes the cardinal distribution
of the feature values in the individual data sets
and the distribution is propagated across different
sites.
Turinsky and Grossman8 in their work, discuss two types of strategies while mining multiple
databases. Because the task of moving large data sets
over the Internet may be a time-consuming and costly
proposition, the first strategy is to leave the data in
place, building local models, and combining the models at a central site. They call this scheme an in-place
strategy. At the other extreme, when the amount of
geographically distributed data is very small, it is possible to move all the data to a central site and build a
single model there. They call this a centralized strategy. Then, they describe an intermediate strategy of
optimal data and model partitions to achieve a given
level of accuracy at a minimum cost.
Grossman et al.18 introduced Papyrus, a system
for distributed data mining which supports various
strategies such as move data, move model, move results as well as the mixture of all, on the basis of data
distribution, availability of resources and accuracy required. Prodromids et al.19 adopted a metalearning
strategy for mining multiple databases by integrating
multiple classifiers computed over different databases
to form higher-level classifiers or classification model.
The process of metalearning starts with distributed
databases, or a set of data subsets of the original
database and concurrently running a learning algorithm on each of the subsets. Using an integration rule,
it combines the predictions from classifiers learned
from the subsets by recursively learning ‘combiner’
Volume 3, January/February 2013
and ‘arbiter’ models in a bottom-up tree manner. The
focus of metalearning is to combine the predictions of
the learned models from the partitioned data subsets
in a parallel and distributed environment.
Kargupta and his colleagues20,21 considered a
collective framework to address data analysis for heterogeneous environments and proposed the collective
data mining (CDM) framework for predictive data
modeling. The main features of CDM can be summarized in the following steps: (1) Generate approximate orthonormal basis coefficients at each local site
(2) Move an approximately chosen sample of the
data sets from each site to single site and generate
the approximate basis coefficients corresponding to
nonlinear cross terms (3) Combine the local models,
transform the model to the user described canonical
representation and output the model. Here, nonlinear terms represent a set of coefficients (or patterns)
that cannot be determined at a local site. In essence,
the performance of a CDM model depends on the
quality of estimated cross-terms. Typically, CDM requires an exchange of a small sample that is often
negligible compared to the entire data. On the basis
of the framework, various distributed data analysis
algorithms such as collective decision rule learning
using Fourier analysis, collective hierarchical clustering, collective multivariate regression using wavelets,
and collective principal component analysis are
developed.
For mining large databases, Savasere et al.22
proposed a partition algorithm. The algorithm mines
frequent itemsets from each non-overlapping partitions of the database and global candidate patterns
are generated from the union of all frequent itemsets.
A second run is made on each partition to obtain
the frequency count of each of the candidate patterns, which are then summed up to obtain the global
support count. If the support count is found to be
greater than the minimum support count, the pattern
is deemed as a global pattern and global rules are
generated from them. This approach provides an elegant solution for mining huge centralized databases.
However, the method is not directly applicable to
MDM. To adopt this approach to MDM, each branch
database may be considered as a part of a partition
of the multi-database. The mined local frequent itemsets are forwarded to the center to form candidate
global patterns. The candidate global patterns are
transmitted to the local sites to mine for a second
time. The mined patterns are again forwarded by the
local sites to center to assemble and evaluate global
rules. The above scheme requires two sets of scans
of local databases and three transmissions of mined
patterns across the network.
c 2012 John Wiley & Sons, Inc.
5
Overview
wires.wiley.com/widm
T A B L E 1 Analysis of Research attempts in Mono-Database Mining with Their Significance
Serial No.
Researchers
Issue-Focused
Contribution
1
Characteristics and relationships
exist in the database
Data movement
3
Michalski et al.,14 Ribeiro et al.,15
and Aronis et al.17
Grossman et al.18 and Turinsky
and Grossman8
Prodromidis et al.19
4
5
Kargupta et al.20,21
Savasere et al.22
Data analysis
Patten discovery
6
Zhong et al.23
Pattern discovery
7
Liu et al.24
Database Selection
8
Wu et al.25
Database classification
Rule-based knowledge discovery
algorithms
Data movement strategies for the
task of mining
Classification model for distributed
database environment
Collective data mining framework
Partitioning algorithm for mining
large databases
Procedure for mining peculiarity
patterns in multiple databases
Application dependent selection
Procedure
Application independent database
selection procedure
2
Database classification
To find new, surprising, interesting patterns hidden in data, peculiarity oriented mining in multiple
databases was introduced by Zhong et al.23 The Peculiarity represents a new interpretation of interesting,
unexpected relationships that are hidden in the relatively small number of data. The main task of mining
peculiarity rules is the identification of peculiar data.
Peculiarity of data is characterized by two features:
(1) very different from other objects in a data set, and
(2) consisting of a relatively low number of objects.
They argued that the peculiarity rules are a typical
regularity hidden in a lot of scientific, statistical and
transaction databases. They have proposed a peculiarity factor to find whether the attribute value occurs in
relatively low number and is very different from other
values by evaluating the sum of the square root of the
conceptual distance between them. Finally, one can
select the peculiar data by means of peculiarity factor
threshold.
To deal with multiple databases, Liu et al.24
have proposed to discover multi-database by identifying relevant databases. They argued that the first
step for MDM is to identify databases that are most
likely to be relevant to an application for efficiency
and accuracy. In their approach, the cluster of multidatabases is constructed for an application, which
is typically application dependent, referred to as
database selection. However, database selection has
to be carried out multiple times to identify relevant
databases for two or more real-world applications.
In particular, when users require to mine their
multi-databases without reference to any specific
application, the application dependent techniques
6
do not work well. To cater to this requirement, Wu
et al.25 proposed an application-independent
database classification strategy for MDM. They have
presented a technique of clustering databases toward
mining multiple databases. Multiple databases are
classified by constructing a relevance measure called
as similarity. In particular, they have defined measures |class|, Goodness and distancefG for searching
good cluster in multi-databases. Both the works
focused on efficient data preparation technique for
MDM. Table 1 analyzes the research contributions
in mono-mining strategy along their significance.
The above efforts have provided a good insight
into mining multiple databases and tackled several
important issues in MDM. However, there are also
many potentially useful patterns in local databases.
Apart from the cost of moving huge data over a
communication network, the mono-database mining
strategy obliterates interesting local patterns at various sites. The following section reviews the research
efforts on local pattern analysis, which overcomes the
limitations of mono-mining.
RESEARCH EFFORTS ON
MULTI-DATABASE MINING
Zhang et al.26 brought out the differences between mono-database mining and MDM by presenting novel significant patterns that are found in
MDM, which are not captured in mono-database
mining. They argue that any business organization,
with multiple branches, has two levels of decisions:
headquarter level (global) and branch level (local)
c 2012 John Wiley & Sons, Inc.
Volume 3, January/February 2013
WIREs Data Mining and Knowledge Discovery
Mining multiple data sources
decisions. Following this logic, they classify the patterns in a multi-database system as local patterns,
high-vote patterns, exceptional patterns, and suggested patterns.
High-vote patterns are supported by most of the
branches or all branches of an interstate organization.
Such patterns reflect the common features among the
branch databases. According to these patterns, the
head company can make decisions for the common
profit of all the branches. Exceptional patterns have a
higher support in some branches but zero support in
other branches. According to these patterns, the head
company can adjust measures to local conditions and
make special policies for such branches. Suggested
patterns are supported by some of the branches, but
these are lesser than the branches supporting the highvote patterns.
Because users are more likely to provide the patterns mined from their databases rather than their
raw data and more number of local patterns are forwarded from the branch databases, a synthesizing
model is necessary to gather global patterns from the
forwarded local patterns. Wu and Zhang27 advocated
a model for synthesizing high-frequency rules from
multiple databases through weighting. In many fields
such as probability and fuzzy set theory, weighting
has been considered as a common method for aggregating information. To aggregate association rules
from multiple databases, one also needs to determine
the weights of the data sources.
The weighting model advocated by Wu and
Zhang27 has been considered as a first attempt in
synthesizing global patterns from the forwarded local patterns. Their weighting model aims to synthesize high-frequency association rules from different
data sources. According to them, a rule is called a
high-frequency rule if it is supported or voted for by a
large number of data sources. Their rule weight is proportional to the number of data sources supporting
the rule. The weight of any data source in turn is calculated based on the number of high-frequency rules
supported by it. They have assigned high weights to
the data source that supports a larger number of highfrequency rules and lower weights to the data sources
that supports a fewer number of high-frequency rules.
The synthesizing model proposed by Wu and
Zhang27 works on similar sized data sources. When
numerous data sources are considered, it is practically
impossible to have similar sized data sources. To process data sources of different sizes, merging and splitting of data sources has to be done to make them of
the same size. But these types of operations are complex and a huge effort is required. When merging of
data sources is not possible because of data sharing
Volume 3, January/February 2013
issues, Wu and Zhang27 have suggested to ignore data
sources if their sizes are below a user-specified threshold. Thus, some of the data sources may not participate in the rule synthesis process. Though Wu and
Zhang’s model,27 attempts to synthesize the global association rules for the overall organization that would
have been discovered from the union of all the data
sources, the comparison of the synthesized results
with the mono-mining results obtainable by the union
of those data sources is not targeted in their work.
Nedunchezhian and Anbumani28 focused two
issues namely data source selection and selection of
valid ruled for synthesizing. They have calculated
weight of the data source on the basis of two factors: (1) number of high-frequency rules voted by the
data source; (2) size of the data source. Accordingly,
weights of all participating data sources are calculated
and data source selection threshold value is applied
for identifying candidate data sources for synthesizing
high-frequency rules. To prune low frequency rules at
the local sites itself, a procedure called support equalization is also presented by equating supports of the
data sources which reduces total number of rules forwarded to the central head.
Zhang et al.29 have advocated an approach for
synthesizing global exceptional patterns for MDM
applications. They have developed an algorithm for
identifying global exceptional patterns in multiple
databases. In their approach, every local database is
mined separately in a random order for synthesizing
global exceptional patterns. Kum et al.30 have developed a local mining approach for finding sequential patterns in multiple databases. They present a
novel algorithm to mine approximate sequential patterns called consensus patterns from large sequence
databases in two steps. First, sequences are clustered
by similarity. Then, consensus patterns are mined directly from each cluster through multiple alignments.
Adhikari and Rao6 have extended the local pattern analysis model and have introduced the notion
of heavy association rules in their work. The heavy
association rules, whose synthesized global supports
are higher than a user’s given threshold. Their criterion is the same as the measure defined by Wu
Zhang27 and stated that the heavy association rules
are sometimes more useful than the high-frequent association rules. Also they have observed the cases that
heavy association rules may not be shared by all the
databases. Therefore, they defined a high-frequency
rule as the rule shared by at least n × r1 databases,
and an exceptional rule as the rule shared by no more
than n × r2 databases. Here ‘n’ indicated the number of databases and r1 and r2 are the user-defined
thresholds. They have presented an algorithm for
c 2012 John Wiley & Sons, Inc.
7
Overview
wires.wiley.com/widm
synthesizing heavy association rules from multiple
data sources and reported whether a heavy association rule is high-frequent or exceptional in multiple
databases. The imitation of this model is that it provides approximate global patterns.
proposed
Ramkumar
and
Srinivasan2
transactions-population-based weighting model
for synthesizing high-frequency rules from different
data sources. According to them, rule weight is
proportional to the sum of the weights of the data
sources supporting the rule. The weight of any data
source is calculated based on the population of data
sources—that is, by the number of transactions in the
database. Their goal in synthesizing global patterns
from the forwarded local pattern is that, the support
and confidence of synthesized pattern should be
very nearly same if all data sites were integrated and
Mono-mining has been done. They did not agree with
the formulation that in a big company each branch,
big or small, has equal power to vote for patterns.
They have also added that in a pure business sense
all branches are not equal; the branches that have a
high volume of business will have and should have a
greater say in determining global policies based on
global patterns.
The synthesizing models2,6,27 focused on synthesizing high-frequency rules only because they
emerge as global rules when all the data sources are integrated. High-frequency rules are truly valid in making global decisions by the head branch of any interstate company. However, in such a synthesis, regional
patterns or rules get eliminated. For making decisions,
say at regional levels, patterns that show the individuality of regions or cluster or groups of branches become important. They can be explored by synthesizing them in a multilevel perspective alone. In responding to this demand, Ramkumar and Srinivasan9 extended their earlier work and proposed a framework
for multilevel rule synthesis model using two interesting rule evaluation measures, namely, effective and
nominal vote rates. γ effective is defined as the effective
vote rate, which is the cumulative percentage of votes
received from different data sources for a given rule
on the basis of the transactions-populations of respective data sources. γ nominal is defined as the nominal
vote rate, which is the cumulative percentage of vote
received from different data sources for a given rule
on the basis of equal votes for sites. Using these rule
selection measures, local patterns are synthesized into
global rules, subglobal rules and local rules.
During the synthesizing process, when a rule or
pattern is present in a site only weakly and fails to satisfy the minimum support threshold value, that rule
is not allowed to take part in the synthesizing proce-
8
dure. Such circumstances do not imply that the rule is
not present at all, because the rule may have some significance in the site with a support value, which lies
between 0 and minimum support. Ramkumar and
Srinivasan31 focused on this issue and introduced the
notion of a correction factor in the rule synthesizing
process. With the inclusion of the correction factor,
synthesized results are improved. They have also concluded that the domain expert would choose a suitable correction factor, based on his knowledge and
estimate about the distribution of data. In the absence of detailed knowledge about data distribution,
they have recommended the correction factor as 0.50.
Adhikari et al.32 proposed a model for mining global patterns in multiple transactional timestamped databases. They have argued that finding
variation of sales of an item over time is an important issue. Accordingly, they introduced the notion of
stability of an item because stable items are useful in
making many strategic decisions. On the basis of the
degree of stability of an item, an algorithm for clustering different databases has been proposed. Zhang et
al.33 proposed a method for obtaining local patterns
from the individual databases based on the customer
lifetime values (CLVs). For computing CLVs, three
attributes—namely, customerid, customer expenditure, and the period of lifecycle for the customer—are
accounted. By using a method, namely kernel estimation for mining global patterns (KEMGP), which
adopts kernel estimation, global patterns are synthesized from the forwarded local patterns.
Adhikari et al.34 noted that association analysis
of select items in multiple market databases is an important as well as promising issue. As many important
decisions are based on the set of specific items called
as select items, they have proposed a model for mining
global patterns of select items in multiple databases.
A measure of overall association between two items in
databases is also proposed. They have designed an algorithm based on the proposed measure for grouping
frequent items in multiple databases.
The existing rule synthesizing methods commonly assumes that an appropriate relevance analysis has been done among databases and the databases
under consideration are highly relevant. This is equivalent to the assumption that all stores have the same
type of business with identical metadata structures,
which is hardly impossible. The above problem has
been attacked by He et al.35 They have proposed a
synthesizing model for databases containing different items and the databases may not be relevant with
each other. They have argued that simple rule synthesizing model without a detailed understanding of
the databases is not adequate to reveal meaningful
c 2012 John Wiley & Sons, Inc.
Volume 3, January/February 2013
WIREs Data Mining and Knowledge Discovery
Mining multiple data sources
T A B L E 2 Analysis of Research Attempts in Local Pattern Analysis Strategy with Their Significance
Serial No.
Researchers
Issue-Focused
Contribution
1
Zhang et al.26
Local pattern analysis
2
Wu and Zhang27
Synthesizing model
3
Nedunchezhian and Anbumani28
4
Zhang et al.29
Database identification and
setting up threshold values for
synthesizing
Discovering new kinds of patterns
Identification of new kinds of patterns in
multi-database environments
Weighting model for synthesizing global
patterns based on frequent rules voted by
the data source
Data source selection and support equalization
for synthesizing global patterns.
5
6
Kum et al.30
Adhikari and Rao6
Discovering new kinds of patterns
Discovering new kinds of patterns
7
Ramkumar and Srinivasan2
Synthesizing model
8
Ramkumar and Srinivasan9
Discovering new kinds of patterns
9
Ramkumar and Srinivasan31
10
Adhikari et al.32
Optimization in synthesizing
model
Database clustering
11
12
Adhikari et al.34
He et al.35
Grouping items
Database clustering
patterns inside the databases. They have proposed
a two-step clustering-based rule synthesizing framework. Accordingly, for databases with different items,
clustering can be done at the item level, where as for
databases sharing similar items but different rules,
the clusters generated from the item-level clustering
are further clustered. By this two-step process, final
clusters contain both similar items and similar rules.
Then the weighted rule synthesizing method proposed
by Wu and Zhang has been applied on such clusters
to generate synthesized rules. Table 2 summarizes the
salient features of research work on the basis of local
pattern analysis strategy.
CONCLUSION AND SCOPE FOR
FUTURE WORK
Research in MDM would be more important, imperative, and challenging with the increasing development of multi-databases. This paper surveys various
research works in the growing field with emphasis on
Volume 3, January/February 2013
Synthesizing procedure for globally
exceptional patterns
Algorithms for sequential pattern discovery
Notion of heavy association rules in
synthesizing process on the basis of Wu and
Zhang’s model
Transactions-population-based weighting
model for synthesizing global patterns with
a target of obtaining closer mono-mining
result
Notion of Effective and nominal vote rate in
rule synthesizing for pattern classification
Notion of correction factor in rule synthesizing
process for improved synthesized results
Mining global patterns in time- stamped
databases
Model for mining selective items
Synthesizing model for databases of dissimilar
in nature
mono-mining and local pattern analysis. There are
still several challenges in local pattern analysis approach, which need further research.
Inclusion of Quantitative Information
in the Allocation of Data Source Weights
Considering two sites S1 and S2 with the respective populations of 100 and 1000 transactions,
transactions-population-based weighting model assigns a weight of 10 times that of S1 for S2. It would
not be a fair, if the turnover of S1 is higher than that
of S2. Thus allocation of site weights on the basis of
transactions-population alone may not be a good decision. To improve decisions, quantitative mining on
the basis of turnover quantity or cost of items sold
may be carried out. The frequent rule Wine→Salmon
(support = 10%, confidence = 80%) may be more
important than the frequent rule Bread→Milk (support = 30%, confidence = 80%), even though the
former holds a lower support. This is because those
items in the first rule usually come with more profit
c 2012 John Wiley & Sons, Inc.
9
Overview
wires.wiley.com/widm
per unit sale when comparing the items in the next
rule. Hence the inclusion of quantitative information
in the allocation of site weights is needed and a synthesizing model based on multiple minimum supports
for the corresponding quantitative information is
required.
Assigning Weights for Transactions
Assigning weights to the transactions is also one of
the future research directions from our view. Different transactions have different weights in real world
data sets. For an instance, in the market basket analysis, each transaction is recorded with some profit.
The transactions with a large amount of items should
be considered more important than the transactions
containing only few items. It shows the requirement
of assigning different weights for different transactions to reflect their importance. To assign weight
for transactions, factors such as recency, frequency,
monetary value, and duration (RFMD) values can be
used by each of the local branches. The RFMD technique is one of the popular methods in the market segmentation. Customers who have recently purchased
(recency), customers who purchase many times (frequency), customers who spend more money (monetary value), and customers who spend more time on
sellers website (duration) are the main parameters of
RFMD technique. Weights are assigned to each parameter and the weighted score for each transaction
can be calculated. Also, by assigning transactionsweights, the problem of considering all transactions
in the rule mining process can be eliminated and the
extracted rules will have greater significance.
Global Classification Model for Rule
Synthesizing
Supervised learning is a well-known data mining functionality, which is used to classify data records into
set of predefined class labels. Using classification techniques like decision trees, mining the features of the
local data sources for a given concept or a class
and synthesizing them to form a global classification
model for mining multiple databases is also one of the
interesting research directions.
Synthesizing Negative Association Rules
Mining negative association rules in multiple
databases could also be one of the future research areas. A negative association rule describes relationship
between item set and implies the occurrences of some
item sets characterized by the absence of others.36
A positive association rule ‘A→B’ has three corresponding negative association ‘A→⌐B’, ‘⌐A→B’,
‘⌐A→⌐B’. The⌐ negative association rules also play
important rule in decision-making. For example, handling different medical databases coming from different areas, the center for disease control would be
interested to find out which factors are relatively irrelevant and absolutely irrelevant although they may
arise frequently. Thus there is a sound scope for research in the development of an effective synthesizing
model for negative association rules.
REFERENCES
1. Bright MW, Hurson AR, Pakzad SH. A taxonomy and
current issues in multidatabase systems. IEEE Comput
1992, 25: 50–60.
2. Ramkumar T, Srinivasan R. Modified algorithms for
synthesizing high-frequency rules from different data
sources. Knowl Inf Syst 2008, 17:313–334.
3. Zhang S, Wu X, Zhang C. Multi-database mining,
IEEE Comput Intell Bull 2003, 2:5–13.
4. Zhang S, Chen Q, Yang Q. Acquiring knowledge from
inconsistent data sources through weighting. Data
Knowl Eng 2010, 69:779–799.
5. Liu H, Lu H, Yao J. Identifying relevant database
for multidatabase mining. In: Proceeding of the Second Pacific-Asia Conference on Knowledge Discovery
and Data Mining. Melbourne, Australia; 1998, 210–
221.
10
6. Adhikari A, Rao PR. Synthesizing heavy association
rules from different real data sources. Pattern Recognit
Lett 2008, 29:59–71.
7. Zhang S, Zaki JM. Mining multiple data sources: local pattern analysis. Data Min Knowl Discov 2006,
12:121–125.
8. Turinsky K, Grossman R. A framework for finding
distributed data mining strategies that are intermediate between centralized strategies and in-place strategies. In: Workshop on Distributed and Parallel Knowledge Discovery at Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
(KDD-2000). Boston, MA; 2000, 1–7.
9. Ramkumar T, Srinivasan R. Multi-level synthesis of
frequent rules from different data sources. Int J Comput Theory Eng 2010, 2:195–204.
c 2012 John Wiley & Sons, Inc.
Volume 3, January/February 2013
WIREs Data Mining and Knowledge Discovery
Mining multiple data sources
10. Lenca P, Meyer P, Vaillant B, Lallich S. On selecting
interestingness measures for association rules: user oriented description and multiple criteria decision aid. Eur
J Oper Res 2008, 184:610–626.
11. Agrawal R, Imielinski T, Swami A. Mining association
rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International
Conference on Management of Data. Washington,
D.C.; 1993, 207–216.
12. Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the Twentieth International Conference on Very Large Databases(VLDB).
Santiago de Chile, Chile; 1994, 478–499.
13. Adhikari A, Jain CL, Ramana S. Analysing effect of
database grouping on multi-database mining. IEEE Intell Inf Bull 2011, 12:25–32.
14. Michalski RS, Kerschberg L, Kaufman KA, Ribeiro JS.
Mining for knowledge in databases: the INLEN architecture, initial implementation and first results. J Intell
Inf Syst: Integr AI and Database Technol 1992, 1:85–
113.
15. Ribeiro J, Kaufman K, Kerschberg L. Knowledge discovery from multiple databases. In: Proceedings of the
First International Conference on Knowledge Discovery and Data Mining (KDD-95). Montreal, Canada;
1995, 240–245.
16. Wrobel S. An algorithm for multi-relational discovery
of subgroups. In: Proceedings of the First European
Symposium on Principles of Data mining and Knowledge Discovery. Trondheim, Norway; 1997, 78–87.
17. Aronis J, Kolluri V, Provost F, Buchanan B. The
WoRLD: knowledge discovery from multiple distributed databases. In: Proceedings of the Tenth International Florida AI Research Symposium. Daytona
Beach, FL; 1997, 337–341.
18. Grossman RL, Bailey S, Ramu A, Malhi B, Turinsky A.
The preliminary design of papyrus: a system for high
performance, distributed data mining over clusters. In:
Advances in Distributed and Parallel Knowledge Discovery. Menlo Park, CA: AAAI/MIT Press; 2000, 259–
275.
19. Prodromidis A, Chan P, Stolfo S. Meta-learning in distributed data mining systems: issues and approaches.
In: Advances in Distributed and Parallel Knowledge
Discovery. Menlo Park, CA: AAAI/MIT Press; 2000.
20. Kargupta H, Huang W, Sivakumar K, Johnson E. Distributed clustering using collective principal component analysis. Knowl Inf Syst 2001, 3:422–448.
21. Kargupta H, Huang W, Sivakumar K, Park B,Wang
S. Collective principal component analysis from distributed, heterogeneous data. In: Proceedings of the
Volume 3, January/February 2013
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
Fourth European Conference on Principles of Data
Mining and Knowledge Discovery. Lyon, France;
2000, 452–457.
Savasere A, Omiecinski E, Navathe S. An efficient algorithm for mining association rules in large databases.
In: Proceedings of the Twenty First International Conferences on Very Large Data Bases. Zurich, Switzerland; 1995, 432–444.
Zhong N, Yao YY, Ohshima M. Peculiarity oriented
multi-database mining. IEEE Trans Knowledge Data
Eng 2003, 15:952–960.
Liu H, Lu H, Yao J. Toward multi-database mining: identifying relevant databases. IEEE Trans Knowl
Data Eng 2001, 13:541–553.
Wu X, Zhang C, Zhang S. Database classification for
multi-database mining. Inf Syst 2005, 30:71–88.
Zhang S, Zhang C, Wu X. Knowledge Discovery in
Multiple Databases. London: Springer-Verlag; 2004.
Wu X, Zhang S. Synthesizing high-frequency rules
from different data sources. IEEE Trans Knowl Data
Eng 2003, 15:353–367.
Nedunchezhian R, Anbumani K. Post mining–
discovering valid rules from different sized data
sources. Int J Inf Technol 2006, 3:47–53.
Zhang C, Liu M, Nie W. Identifying global exceptional
patterns in multidatabase mining. IEEE Comput Intell
Bull 2004, 3:19–24.
Kum HC, Chang JH, Wang W. Sequential pattern mining in multidatabases via multiple alignment. Data Min
Knowl Discov 2006, 12:151–180.
Ramkumar T, Srinivasan R. The effect of correction
factor in synthesizing global rules in a multi-database
mining scenario. J Appl Comput Sci 2009, 3:33–
38.
Adhikari J, Rao PR, Adhikari A. Clustering items in
different data sources induced by stability. Int Arab J
Inf Technol 2009, 6:394–402.
Zhang S, You X, Jin Z, Wu X. Mining globally interesting patterns from multiple databases using kernel
estimation. Expert Syst Appl 2009, 36:10863–10869.
Adhikari A, Rao PR, Pedrycz W. Study of select items
in different data sources by grouping. Knowledge Inf
Syst 2010, 27:23–43.
He D, Wu X, Zhu X. Rule synthesizing from multiple
related databases. In: Proceedings of the Fourteenth
Pacific-Asia Conference on Knowledge Discovery and
Data Mining. Hyderabad, India; 2010, 201–213.
Zhang S, Wu X. Fundamentals of association rules in
data mining and knowledge discovery. WIREs Data
Min Knowl Discov 2011, 1:97–116.
c 2012 John Wiley & Sons, Inc.
11