Download a unified theory of data mining based on

A UNIFIED THEORY OF DATA MINING BASED ON UNIPARTITE AND BIPARTITE GRAPHS William Perrizo Computer Science Department North Dakota State University IACC 258, Suite A15 Fargo, ND 58105-5164 [email protected] Abstract All Data Mining can be unified under the concept of relationship analysis (graph analysis is a dual concept). Horizontal versus Vertical implementation is an orthogonal issue(e.g., involved with performance, scalability, etc.) which we have discussed in other forums and is beyond the scope of this paper. We give a unified presentation of the three main areas of data mining, namely Association Rule Mining, Supervised Machine Learning (Classification) and Unsupervised Machine Learning (Clustering), in graph theory. The purpose and benefit of such a view is that scientists can understand and select data mining tools to extract information from many-to-many relationships (graphs) found in their research domain more effectively. Since Data Mining is a relatively new field, many researchers are hard pressed to select the proper method or tool to mine their data. For example, in bioinformatics, clustering is the tool of choice, even when the end result of the application is a classification ([4,5]. In Software Engineering, many interactions between entity pairs (e.g., Cross-checking of execution traces, fault-localization analyses, etc. [1,2,3]) are analyzed without the benefit of an understanding of the relationship between, for example, association rule mining and classification (researchers mine for association rules when the problem may be more appropriately cast as a classification problem). In precision agriculture, when data mining is used at all to predict yield, it is, again, cast as an association rule mining problem when it is more appropriately a classification problem ([6]). It is hoped that this paper will assist applied scientists in understanding data mining tools better and allow them to better select from the data mining tool suite. Keywords: data mining, clustering, classification, association rule mining, graph, bipartite, unipartite Introduction Data mining is a vast field which began as Artificial Intelligence and Machine Learning and has more recently developed an independent existence. The concept of data mining , as opposed to machine learning and artificial intelligence concentrates more on databases instead of single data sets. As presently constituted, data mining includes the three subareas of Association Rule Mining (ARM), Clustering (also called unsupervised machine learning) and Classification (also called supervised learning). The purpose of this paper is to provide a new view of all three areas of data mining under the heading of graph analysis and to put them in a proper context with each other (currently they are treated as separate and often unrelated method categories). Association Rule Mining (ARM), Classification and Clustering can be unified under the concept of relationship analysis or graph analysis, which is just a dual concept in the sense that any manyto-many relationship (the most general kind) generates an undirected graph and vice versa. The method of implementation of the data sets (and therefore the methods of processing those data sets) can vary from horizontal (the standard approach) to the more recently developing vertical [7,8,9] approach. Implementation is an orthogonal issue (e.g., involved with performance, scalability, etc.) which are beyond the scope of this paper. In this paper, we give a unified presentation of the three main areas of data mining, namely Association Rule Mining, Supervised Machine Learning (Classification) and Unsupervised Machine Learning (Clustering), in graph theory to benefit scientists who need to understand and select data mining tools to extract information from any interaction data found in their research domains. Since data mining is a relatively new field many of these researchers are hard pressed to select methods to mine their data. In bioinformatics clustering is the tool of choice, even when the end result of the application is a classification ([4,5]. In Software Engineering, many interactions between entity pairs (e.g., Cross-checking of execution traces, faultlocalization analyses, etc. [1,2,3]) are analyzed without the benefit of an understanding of the relationship between, for example, association rule mining and classification (researchers mine for association rules when the problem may be more appropriately cast as a classification problem). In precision agriculture, when data mining is used at all to predict yield, it is, again, cast as an association rule mining problem when it is more appropriately a classification problem ([6]). Next, we give a simple example to make these three alternatives clear. G=(N,E) with N={n1,n2,n3,n4} and EdgeSet of G=(N,E): ek,1,ek,2N, k=1,…,|E|} E(N,N) = {{ek,1, ek,2}| This is the standard way of representing graph data, namely by simply listing the set of edges that make up the graph. The same graph can be modeled (more efficiently?) as an Index on E(N,N): E(N,Nset) = {(n,Nsetn)|nN, Nsetn≡Set of nodes related to n} Then, if there are many edges, it may be more efficient to bit-map the terminal edges. We note that any time one has a list of entities from an entity type, one can either list them or bit map them (the bitmapping requires a position assignment function to assign a bit position to each potential entity instance): Bit MapIndex on E(N,N): E(N,Nmap)={(n,Nmapn)|nN}, where Nmapn≡Map of nodes related to n, that is, Nmapn(k)=1 iff { k,f(k)}E } where f:{1…|N|}→N is the position assignment table for N. (N, N) n2 n3 n4 n1 n2 n4 n1 n4 n1 n2 n3 E(N,ES) n1 n2 n3 n4 (N, Nset) {n2, n3, n4} {n1, n2, n4} {n1, n4} {n1, n2, n3} then Unified Theory We begin with a review of several necessary concepts. A DEGREE=2 UNIPARTITE relationships (between an entity, N, and itself) can be modeled as: E(N,N) n1 n1 n1 n2 n2 n2 n3 n3 n4 n4 n4 and E(N,EM) (N, Nmap) n1 0111 n2 1101 n3 1001 n4 1110 assuming the position assignment function is f(i) = ni One see immediately that each alternative in tern can be much more space efficient. Using the ubiquitous horizontal implementation approach (files of horizontally structure records), one implements either E(N, Nset) or E(N, Nmap as shown pictorially above (usually E(N,EM) so that it is in first normal form). However, using a vertical approach, can implement E(N, Nset) as a set of vertical bit vectors by using some bit encoding. Standard encoding is just "bit slice encoding" and then the resulting bit vectors can be compressed into P-trees ([7,8,9] of any particular dimension for efficiency. We show the uncompressed bit slice vertical implementation of the graph next. Given E(N,EM) (N, Nmap) n1 0111 n2 1101 n3 1001 n4 1110 The vertical bit slices are n1-slice, 0 1 1 1 1 1 0 0 n2-slice, 1 0 0 1 1 1 1 0 Given a DEGREE=2 BIPARTITE RELATIONSHIP, e.g. between entities, T and I, where N = T ! I (disjoint union), there are five representations (with more efficient position assignment functions), fI:{1,…,|I|} →I, then the set of edges is the same, E(T,I) = { {t,i} | tT and iI and {t,i}E} but the T-index is, Isett≡{i|{t,i}E and the T-bitmap index is, E(T,Imap)= { (t, Imapt) | tT} Imapt(k)=1 iff {t,fI(k)}E (map of is related to t) and the I-index is, E(I,Tset)= { (i, Tseti) | iI} (set of ts related to i) A LABEL FUNCTION, L:{Ci}→Labels (assuming the cluster components, the Cis, are labeled by their IDs) is also dual to a clustering or partition on N in the sense that the Pre-image partition, {L-1(Lk)}, is a clustering and vice versa. CLASSIFYING sS(A1, … , An) using the training set, R(A1, … , An, L), is just a matter of identifying the best R→R[L] pre-image cluster for s based on R. fT:{1,…,|T|}→T E(T,Iset)= { (t, Isett) | tT}, (the set of is related to t) reflexive (x,x)E xN symmetric (x,y)E (y,x)E transitive (x,y), (y,z)E implies (x,z)E is an EQUIVALENCE RELATION. n3-slice, n4-slice Next we briefly recall the dualities that exist among the concepts of an equivalence relation, a partition and a label function on a set, N. A CLUSTERING or PARTITION of N is a dual formulation of an equivalence relation on N, since the partition, {Ci}, into equivalence classes is a clustering and vice versa. Tseti≡{t|{t,i}E and the I-bitmap index is, E(I,Tmap)= { (i, Tmapi) | iI} Tmapi(k)=1 iff {fT(k),i}E (map of ts related to i). The graph definitions above can be extended to relationships of degree > 2, but doing so is beyond the needs of this paper since we can already unify data mining with degree=2 graphs. A Degree=2 UNIPARTITE RELATIONSHIP on N, that is also For completeness, we note here that a Degree=2 UNIPARTITE RELATIONSHIP on N, which also satisfies the properties, reflexivity ( (x,x)E xN ) anti-symmetry ( (x,y)E and (y,x)E implies y=x ) transitivity ( (x,y), (y,z)E implies (x,z)E ) is a PARTIAL ORDERING (A dual formulation of a partial ordering is a directed unipartite graph on N). Next we move to the notion of a Degree=2 BIPARTITE RELATIONSHIP on N = T ! I. A degree two bipartite relationship or graph on N = T ! I generates I-Association Rule (I-AR), AC, A,C I, with A∩C=∅ (disjoint I-sets) and T-Association Rule (T-AR), AC, A,C T, with A∩C=∅ (disjoint T-sets). There are measures of quality of association rules. The main two are, I-AR, AC, is T-frequent iff the TSUPPORT(AC) ≥ MINSUPP and I-AR, AC, is T-confident iff TSUPPORT(AC) / T-SUPPORT(A) ≥ MINCONF, where T-SUPPORT(A) ≡ |{ t | (i,t)E iA}|, T-SUPPORT(AC) ≡ |{ t | (i,t)E iAC}| and MINSUPP and MINCONF are user chosen parameters. Likewise, a T-AR, AC is I-frequent iff ISUPPORT(AC) ≥ MINSUPP, and a T-AR, AC is I-condfident iff ISUPPORT(AC) / I-SUPPORT(A) ≥ MINCONF. In many application areas, interactions have other feature attributes. We can accommodate feature attributes of entities and relationships as node labels and edge labels respectively. Any graph G=(N,E) can have both Node labels and Edge Labels (possibly only their names or identifiers). In general we assume node and edge labels are structures (as complex as is needed to capture the semantics of the application under analysis). Distance and similarity functions are important in most clustering applications, A distance function, d, on N can be modeled as a nonnegative real valued edge label function on the graph (N,E) with E=NN, subject to the conditions: positive definite x, y  N and d(x,y) ≥ 0  d(x,y)=0 iff x=y symmetric  x, y  N triangle inequality  x, y, z  N d(x,y) = d(y,x) d(x,y) + d(y,z) ≥ d(x,z) A similarity function, s, on N measures closeness rather than distance. The notion of a distance function and a similarity function are certainly dual, however, there really is no one canonical duality transformation between these two notions (in fact, there are many). However, in one particular case (very important case), the case of bit or Boolean data, the only distance is Hamming distance (in the sense that all Lp distances collapse to Hamming) and the Hamming similarity really has only one definition also. The definitions are as follows, Hamming Distance on a Boolean Table, R(A1..An) is dH(x,y)=|{i|xi≠yi}| (the count of bit positions where x and y differ), and Hamming Similarity on a Boolean Table, R(A1..An) is dH(x,y)=|{i|xi=yi}| (the count of bit positions where x and y are the same). In general, for a DEGREE=2 BIPARTITE EDGE LABELLED GRAPH with edge label function, l:E→EL (letting lt,i≡ l(t,i) ) E(T,I,EL) = { (t,i,lt,i) | {t,i}E and E(T,I-ELset) = { (t, I-ELsett) | tT } where I-ELsett is set of (i,lt,i) pairs : {i,t}E and E(T,I-ELmap) = { (t,I-ELmapt) | tT } where I-ELmapt(k,b)=1 iff {t,fI(k)}E and the 2b-bit of l{t, fI(k)} is a 1 bit, and E(T,I-ELset) = { (i,I-ELseti) | iI } where I-ELseti≡ set of (t,lt,i) : {i,t}E and E(I,T-ELmap) = { (i,T-ELmapi) | tI } where T-ELmapi(k,b)=1 iff {fT(k),i}E and the 2b-bit of l{fT(k),i} is a 1 bit. And one can similarly define, DEGREE=2 BIPARTITE NODE LABELLED GRAPHs with node label functions, lT:E→TL and lI:E→IL (letting lTt≡ lT(t) and lIi≡ lI(i) ). Conclusion A unifying theory of the three areas of data mining, association rule mining, clustering and classification is given within the theory of uniand bi-partite graphs. With this unified theory, an application scientist is able to better see exactly which area of data mining is best for his or her application needs. Given any interaction between entities, there is a graph defined by that interaction. The graph may be undirected or directed and may have elaborate structured node and edge labels. In any case, one can mine for information in various ways, If the interaction is uni-partite of degree two then it can be clustered clustering and classification. If the interaction is bi-partite and of degree two, it can be mined for association rules in two distinct ways, depending upon which of the bi-partite entities is selected to form the antecedent and consequent rule components. Even in the case of a bi-partite interaction, the data can be clustered or classified by defining a similarity function on one of the bi-partite entities according to the signal it generates in the other bipartite entity. This can also be done it two ways, depending upon which of the bipartite entities is clustered. 4 5 6 7 In bioinformatics, for example, the interaction between genes and experiments is studied. This is a bipartite interaction but it is almost always studied in terms of the two uni-partite graphs generated by focusing on just one of the entity types. Seldom is it realized that the full bipartite interaction relationship can be mined for association rules in two ways and that in so doing, truly spectacular rule relationships may well emerge. Instead, bioinformaticists cluster settle for only dual clustering (for the purpose of classification or annotation, usually). They cluster genes in terms of similarities in their experiment signals and they cluster experiments in terms of their gene expression signals. They do not mine for the strong association rules that might lie hidden in the bipartite interaction data. This association rule information may well hold the key to unlocking the mysteries of biological pathway understanding. References 1 2 3 James A. Jones and Mary Jean Harrold, “Visualization of Test Information to Assist Fault-Localization”, ACM ASE Conference, November, 2005, Long Beach, CA, pp. 273-282. James A. Jones, Mary Jean Harrold, John Stasko, “Empirical Evaluation of the Tarantula Automatic FaultLocalization Technique”, ACM ICSE Conference, May, 2002, Orlando, FL, pp. 467-477. Tristan Denmat, Mireille Ducasse, Olivier Ridoux, “Data Mining and Cross-checking of Execution Traces”, 8 9 ACM ASE Conference, November, 2005, Long Beach, CA, pp. 396-399. Dan E. Krane and Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings Publishing Company, 2003. Andreas D. Baxevanis and B.F.Francis Ouellette, Bioinformatics, A Practical Guide to the Analysis of Genes and Protiens, Third Edition, Wiley Publishing Company, 2005. Stu Pocknee, “Analyzing Data for Precision Agriculture” InfoAg Conference, August, 2001, Indianapolis, IN. Imad Rahal and William Perrizo, “A Scalable Vertical Mining of Association Rules” Journal of Information and Knowledge Management, World Scientific, V3:4, Dec., 2004, http://www.worldscinet.com/03/0304/S 02196492040304.html Imad Rahal and William Perrizo, “A Predicate-tree based Framework for Accelerating Multilevel Secure Database Queries”, ISCA International Journal of Computer Applications, March, 2005. Qiang Ding and William Perrizo, “Cluster Analysis of Spatial Data Using Peano Count Trees”, Information: An International Journal, V7:1, pp15-26, 2004.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download a unified theory of data mining based on