Download a unified theory of data mining based on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A UNIFIED THEORY OF DATA MINING BASED ON
UNIPARTITE AND BIPARTITE GRAPHS
William Perrizo
Computer Science Department
North Dakota State University
IACC 258, Suite A15
Fargo, ND 58105-5164
[email protected]
Abstract
All Data Mining can be unified under the
concept of relationship analysis (graph analysis
is a dual concept). Horizontal versus Vertical
implementation is an orthogonal issue(e.g.,
involved with performance, scalability, etc.)
which we have discussed in other forums and is
beyond the scope of this paper. We give a
unified presentation of the three main areas of
data mining, namely Association Rule Mining,
Supervised Machine Learning (Classification)
and Unsupervised Machine Learning
(Clustering), in graph theory.
The purpose and benefit of such a view is that
scientists can understand and select data mining
tools to extract information from many-to-many
relationships (graphs) found in their research
domain more effectively.
Since Data Mining is a relatively new field,
many researchers are hard pressed to select the
proper method or tool to mine their data. For
example, in bioinformatics, clustering is the tool
of choice, even when the end result of the
application is a classification ([4,5]. In Software
Engineering, many interactions between entity
pairs (e.g., Cross-checking of execution traces,
fault-localization analyses, etc. [1,2,3]) are
analyzed without the benefit of an understanding
of the relationship between, for example,
association rule mining and classification
(researchers mine for association rules when the
problem may be more appropriately cast as a
classification problem). In precision agriculture,
when data mining is used at all to predict yield, it
is, again, cast as an association rule mining
problem when it is more appropriately a
classification problem ([6]). It is hoped that this
paper will assist applied scientists in
understanding data mining tools better and allow
them to better select from the data mining tool
suite.
Keywords: data mining, clustering,
classification, association rule mining, graph,
bipartite, unipartite
Introduction
Data mining is a vast field which began as
Artificial Intelligence and Machine Learning and
has more recently developed an independent
existence. The concept of data mining , as
opposed to machine learning and artificial
intelligence concentrates more on databases
instead of single data sets. As presently
constituted, data mining includes the three
subareas of Association Rule Mining (ARM),
Clustering (also called unsupervised machine
learning) and Classification (also called
supervised learning). The purpose of this paper
is to provide a new view of all three areas of data
mining under the heading of graph analysis and
to put them in a proper context with each other
(currently they are treated as separate and often
unrelated method categories).
Association Rule Mining (ARM), Classification
and Clustering can be unified under the concept
of relationship analysis or graph analysis, which
is just a dual concept in the sense that any manyto-many relationship (the most general kind)
generates an undirected graph and vice versa.
The method of implementation of the data sets
(and therefore the methods of processing those
data sets) can vary from horizontal (the standard
approach) to the more recently developing
vertical [7,8,9] approach. Implementation is an
orthogonal issue (e.g., involved with
performance, scalability, etc.) which are beyond
the scope of this paper.
In this paper, we give a unified presentation of
the three main areas of data mining, namely
Association Rule Mining, Supervised Machine
Learning (Classification) and Unsupervised
Machine Learning (Clustering), in graph theory
to benefit scientists who need to understand and
select data mining tools to extract information
from any interaction data found in their research
domains. Since data mining is a relatively new
field many of these researchers are hard pressed
to select methods to mine their data. In
bioinformatics clustering is the tool of choice,
even when the end result of the application is a
classification ([4,5]. In Software Engineering,
many interactions between entity pairs (e.g.,
Cross-checking of execution traces, faultlocalization analyses, etc. [1,2,3]) are analyzed
without the benefit of an understanding of the
relationship between, for example, association
rule mining and classification (researchers mine
for association rules when the problem may be
more appropriately cast as a classification
problem). In precision agriculture, when data
mining is used at all to predict yield, it is, again,
cast as an association rule mining problem when
it is more appropriately a classification problem
([6]).
Next, we give a simple example to make these
three alternatives clear.
G=(N,E) with N={n1,n2,n3,n4} and
EdgeSet of G=(N,E):
ek,1,ek,2N, k=1,…,|E|}
E(N,N) = {{ek,1, ek,2}|
This is the standard way of representing graph
data, namely by simply listing the set of edges
that make up the graph. The same graph can be
modeled (more efficiently?) as an
Index on E(N,N):
E(N,Nset) =
{(n,Nsetn)|nN, Nsetn≡Set of nodes related to n}
Then, if there are many edges, it may be more
efficient to bit-map the terminal edges. We note
that any time one has a list of entities from an
entity type, one can either list them or bit map
them (the bitmapping requires a position
assignment function to assign a bit position to
each potential entity instance):
Bit MapIndex on E(N,N):
E(N,Nmap)={(n,Nmapn)|nN}, where
Nmapn≡Map of nodes related to n, that is,
Nmapn(k)=1 iff { k,f(k)}E } where
f:{1…|N|}→N is the position assignment table
for N.
(N, N)
n2
n3
n4
n1
n2
n4
n1
n4
n1
n2
n3
E(N,ES)
n1
n2
n3
n4
(N, Nset)
{n2, n3, n4}
{n1, n2, n4}
{n1, n4}
{n1, n2, n3}
then
Unified Theory
We begin with a review of several necessary
concepts. A DEGREE=2 UNIPARTITE
relationships (between an entity, N, and itself)
can be modeled as:
E(N,N)
n1
n1
n1
n2
n2
n2
n3
n3
n4
n4
n4
and
E(N,EM) (N, Nmap)
n1 0111
n2 1101
n3 1001
n4 1110
assuming the
position assignment function is f(i) = ni
One see immediately that each alternative in tern
can be much more space efficient.
Using the ubiquitous horizontal implementation
approach (files of horizontally structure records),
one implements either E(N, Nset) or E(N, Nmap
as shown pictorially above (usually E(N,EM) so
that it is in first normal form). However, using a
vertical approach, can implement E(N, Nset) as a
set of vertical bit vectors by using some bit
encoding. Standard encoding is just "bit slice
encoding" and then the resulting bit vectors can
be compressed into P-trees ([7,8,9] of any
particular dimension for efficiency. We show
the uncompressed bit slice vertical
implementation of the graph next. Given
E(N,EM) (N, Nmap)
n1 0111
n2 1101
n3 1001
n4 1110
The vertical bit slices are
n1-slice,
0
1
1
1
1
1
0
0
n2-slice,
1
0
0
1
1
1
1
0
Given a DEGREE=2 BIPARTITE
RELATIONSHIP, e.g. between entities, T and I,
where N = T ! I (disjoint union), there are five
representations (with more efficient position
assignment functions),
fI:{1,…,|I|}
→I,
then the set of edges is the same,
E(T,I) = { {t,i} | tT and iI and {t,i}E}
but the T-index is,
Isett≡{i|{t,i}E
and the T-bitmap index is,
E(T,Imap)= { (t, Imapt) | tT} Imapt(k)=1 iff
{t,fI(k)}E (map of is related to t)
and the I-index is,
E(I,Tset)= { (i, Tseti) | iI}
(set of ts related to i)
A LABEL FUNCTION, L:{Ci}→Labels
(assuming the cluster components, the Cis, are
labeled by their IDs) is also dual to a clustering
or partition on N in the sense that the Pre-image
partition, {L-1(Lk)}, is a clustering and vice versa.
CLASSIFYING sS(A1, … , An) using the
training set, R(A1, … , An, L), is just a matter of
identifying the best R→R[L] pre-image cluster
for s based on R.
fT:{1,…,|T|}→T
E(T,Iset)= { (t, Isett) | tT},
(the set of is related to t)
reflexive (x,x)E xN
symmetric (x,y)E (y,x)E
transitive (x,y), (y,z)E implies (x,z)E is
an EQUIVALENCE RELATION.
n3-slice,
n4-slice
Next we briefly recall the dualities that exist
among the concepts of an equivalence relation, a
partition and a label function on a set, N. A
CLUSTERING or PARTITION of N is a dual
formulation of an equivalence relation on N,
since the partition, {Ci}, into equivalence classes
is a clustering and vice versa.
Tseti≡{t|{t,i}E
and the I-bitmap index is,
E(I,Tmap)= { (i, Tmapi) | iI} Tmapi(k)=1 iff
{fT(k),i}E (map of ts related to i).
The graph definitions above can be extended to
relationships of degree > 2, but doing so is
beyond the needs of this paper since we can
already unify data mining with degree=2 graphs.
A Degree=2 UNIPARTITE RELATIONSHIP on
N, that is also
For completeness, we note here that a
Degree=2 UNIPARTITE RELATIONSHIP on N,
which also satisfies the properties,
reflexivity
( (x,x)E xN )
anti-symmetry
( (x,y)E and (y,x)E
implies y=x )
transitivity
( (x,y), (y,z)E implies
(x,z)E )
is a PARTIAL ORDERING (A
dual formulation of a partial ordering is a
directed unipartite graph on N).
Next we move to the notion of a Degree=2
BIPARTITE RELATIONSHIP on N = T ! I.
A degree two bipartite relationship or graph on N
= T ! I generates
I-Association Rule (I-AR),
AC,
A,C I, with A∩C=∅ (disjoint I-sets)
and
T-Association Rule (T-AR), AC,
A,C T, with A∩C=∅ (disjoint T-sets).
There are measures of quality of association
rules. The main two are,
I-AR, AC, is T-frequent iff the TSUPPORT(AC) ≥ MINSUPP
and
I-AR, AC, is T-confident iff TSUPPORT(AC) / T-SUPPORT(A) ≥
MINCONF, where
T-SUPPORT(A) ≡ |{ t | (i,t)E
iA}|,
T-SUPPORT(AC) ≡ |{ t | (i,t)E
iAC}| and
MINSUPP and MINCONF are user
chosen parameters.
Likewise, a T-AR, AC is I-frequent iff ISUPPORT(AC) ≥ MINSUPP,
and a
T-AR, AC is I-condfident iff ISUPPORT(AC) / I-SUPPORT(A) ≥
MINCONF.
In many application areas, interactions have
other feature attributes. We can accommodate
feature attributes of entities and relationships as
node labels and edge labels respectively. Any
graph G=(N,E) can have both Node labels and
Edge Labels (possibly only their names or
identifiers). In general we assume node and edge
labels are structures (as complex as is needed to
capture the semantics of the application under
analysis).
Distance and similarity functions are important
in most clustering applications, A distance
function, d, on N can be modeled as a nonnegative real valued edge label function on the
graph (N,E) with E=NN, subject to the
conditions:
positive definite
x, y  N and
d(x,y) ≥ 0 
d(x,y)=0 iff
x=y
symmetric
 x, y  N
triangle inequality
 x, y, z  N
d(x,y) = d(y,x)
d(x,y) + d(y,z) ≥ d(x,z)
A similarity function, s, on N measures closeness
rather than distance. The notion of a distance
function and a similarity function are certainly
dual, however, there really is no one canonical
duality transformation between these two notions
(in fact, there are many). However, in one
particular case (very important case), the case of
bit or Boolean data, the only distance is
Hamming distance (in the sense that all Lp
distances collapse to Hamming) and the
Hamming similarity really has only one
definition also. The definitions are as follows,
Hamming Distance on a Boolean Table,
R(A1..An) is dH(x,y)=|{i|xi≠yi}|
(the count of bit
positions where x and y differ), and
Hamming Similarity on a Boolean Table,
R(A1..An) is dH(x,y)=|{i|xi=yi}|
(the count of bit positions
where x and y are the same).
In general, for a DEGREE=2 BIPARTITE
EDGE LABELLED GRAPH with edge label
function, l:E→EL (letting lt,i≡ l(t,i) )
E(T,I,EL)
= { (t,i,lt,i) | {t,i}E
and
E(T,I-ELset) = { (t, I-ELsett) | tT }
where I-ELsett is set of (i,lt,i)
pairs : {i,t}E
and
E(T,I-ELmap) = { (t,I-ELmapt) | tT }
where I-ELmapt(k,b)=1 iff
{t,fI(k)}E and the 2b-bit of l{t, fI(k)} is a 1 bit,
and
E(T,I-ELset) = { (i,I-ELseti) | iI }
where I-ELseti≡ set of (t,lt,i) :
{i,t}E
and
E(I,T-ELmap) = { (i,T-ELmapi) | tI }
where T-ELmapi(k,b)=1 iff
{fT(k),i}E and the 2b-bit of l{fT(k),i} is a 1 bit.
And one can similarly define, DEGREE=2
BIPARTITE NODE LABELLED GRAPHs with
node label functions, lT:E→TL and lI:E→IL
(letting lTt≡ lT(t) and lIi≡ lI(i) ).
Conclusion
A unifying theory of the three areas of data
mining, association rule mining, clustering and
classification is given within the theory of uniand bi-partite graphs. With this unified theory,
an application scientist is able to better see
exactly which area of data mining is best for his
or her application needs. Given any interaction
between entities, there is a graph defined by that
interaction. The graph may be undirected or
directed and may have elaborate structured node
and edge labels. In any case, one can mine for
information in various ways, If the interaction
is uni-partite of degree two then it can be
clustered clustering and classification. If the
interaction is bi-partite and of degree two, it can
be mined for association rules in two distinct
ways, depending upon which of the bi-partite
entities is selected to form the antecedent and
consequent rule components. Even in the case of
a bi-partite interaction, the data can be clustered
or classified by defining a similarity function on
one of the bi-partite entities according to the
signal it generates in the other bipartite entity.
This can also be done it two ways, depending
upon which of the bipartite entities is clustered.
4
5
6
7
In bioinformatics, for example, the interaction
between genes and experiments is studied. This
is a bipartite interaction but it is almost always
studied in terms of the two uni-partite graphs
generated by focusing on just one of the entity
types. Seldom is it realized that the full bipartite
interaction relationship can be mined for
association rules in two ways and that in so
doing, truly spectacular rule relationships may
well emerge. Instead, bioinformaticists cluster
settle for only dual clustering (for the purpose of
classification or annotation, usually). They
cluster genes in terms of similarities in their
experiment signals and they cluster experiments
in terms of their gene expression signals. They
do not mine for the strong association rules that
might lie hidden in the bipartite interaction data.
This association rule information may well hold
the key to unlocking the mysteries of biological
pathway understanding.
References
1
2
3
James A. Jones and Mary Jean Harrold,
“Visualization of Test Information to
Assist Fault-Localization”, ACM ASE
Conference, November, 2005, Long
Beach, CA, pp. 273-282.
James A. Jones, Mary Jean Harrold,
John Stasko, “Empirical Evaluation of
the Tarantula Automatic FaultLocalization Technique”, ACM ICSE
Conference, May, 2002, Orlando, FL,
pp. 467-477.
Tristan Denmat, Mireille Ducasse,
Olivier Ridoux, “Data Mining and
Cross-checking of Execution Traces”,
8
9
ACM ASE Conference, November,
2005, Long Beach, CA, pp. 396-399.
Dan E. Krane and Michael L. Raymer,
Fundamental Concepts of
Bioinformatics, Benjamin Cummings
Publishing Company, 2003.
Andreas D. Baxevanis and B.F.Francis
Ouellette, Bioinformatics, A Practical
Guide to the Analysis of Genes and
Protiens, Third Edition, Wiley
Publishing Company, 2005.
Stu Pocknee, “Analyzing Data for
Precision Agriculture” InfoAg
Conference, August, 2001, Indianapolis,
IN.
Imad Rahal and William Perrizo, “A
Scalable Vertical Mining of Association
Rules” Journal of Information and
Knowledge Management, World
Scientific, V3:4, Dec., 2004,
http://www.worldscinet.com/03/0304/S
02196492040304.html
Imad Rahal and William Perrizo, “A
Predicate-tree based Framework for
Accelerating Multilevel Secure
Database Queries”, ISCA International
Journal of Computer Applications,
March, 2005.
Qiang Ding and William Perrizo,
“Cluster Analysis of Spatial Data Using
Peano Count Trees”, Information: An
International Journal, V7:1, pp15-26,
2004.