Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Framework Unifying Association Rule Mining, Clustering and Classification Anne Denton and William Perrizo Dept of Computer Science North Dakota State University Introduction The fundamental concept of a partition links almost all knowledge discovery and data mining. – Such fundamental and unifying concepts are very important since there is such a wide variety of problem domains covered under the general headings of knowledge discovery and data mining. The concept of a relation is also at the center of our model. – The concept of a equivalence relation is central to the understanding of data patterns through similarity partitioning. • glues object together • reflexive, symmetric and transitive relation – The concept of a comparison relation (partial order relation or hierarchy) is central to distinguishing similarity patterns. • distinguishes objects • irreflexive, and transitive. Mathematical Foundations of Data Mining: Relation .-->Equivalence Relation<-. | | v v Function Closed Undirected Graph ^ ^ | | `---->Partition<----------' .->Partial Order Relation<-. | | | | | | | | `->Directed Acyclic Graph<-' The RELATION, and the restricted notions of equivalence relations (partitions)and order relations (Partially Ordered Sets or POSets) are key to Database and Data Mining. The SELF-RELATION on a set S is a relation on {S,S'} where S'is an alias for S. Truely the Relational Model is ubiquitous in DB and DM. There is no other! Ted Codd was right on the money! Relations A relation, R(A1,…,An) with Dom(Ai) = Xi, is the f -1(1)-component of the pre-image partition generated by a function f:X1…Xn {0,1} which assigns 1 if the tuple “exists in the relation” and 0 if it “does not exist in the relation” – function pre-images; partitions and equivalence_relations are pair-wise dual concepts. • we partition the full Cartesian product of the attribute domains into two components (the pre-images under f, above) whenever we define a relation. • Data mining and database querying are a matter of describing the nonrandomness of that partition boundary (if it is non-random). • Clearly, if f is identically 1, the relation is the entire Cartesian product and there is no boundary (one extreme.) • At the other extreme, f is the characteristic function of a singleton set and there is a clear boundary (clear non-randomness.) (Data mining in this case is query processing.) • Data mining can be viewed as finding and describing the non-randomness of that boundary. Partitions The fundamental concept of a partition links almost all knowledge discovery and data mining. Such fundamental and unifying concepts are very important since there is such a wide variety of problem domains covered under the general headings of knowledge discovery and data mining. For instance, a data store that tries to analyze shopping behavior would not benefit much from a machine learning algorithm that allows prediction of one quantity as a function of some number of other variables. Yet, such an algorithm may be precisely the right tool for an agricultural producer who wants to predict yield from the nitrogen and moisture values in his field. We will show that both problems and their solutions, can be described in the framework of partitions and generalized database operations. A LABELED_PARTITION is a Partition in which every component is assigned a label (from some label space, L) Graphs UNDIRECTED GRAPH G=(N,E); N=node set; E=edge set (edge=pair of nodes) (Directed Graph is just an undirected graph in which each edge is an ordered set of nodes) SUBGRAPH, G'=(N',E') of G is a graph : N' subset of N and E' subset of E. GRAPH-PARTITION of G is a set of subgraphs, {G1..Gn}, : their edge sets, {E1..En}, forms a partition of E. PATH in G is a subgraph, P=(N',E') such that there exists an ordering of N', (n1..nk), such that {(n1,n2)..(nk-1,nk)} is a subset of E' PATH-CONNECTED GRAPH (or just CONNECTED) is a graph : for every pair of nodes (ni,nj) there is a path in G connecting ni to nj. CONNECTIVITY PARTITION is the partition into path-components CLOSED GRAPH is a graph : {n1,n2}, {n2,n3} edges ==> {n1,n3} is an edge. Equivalently, If there's a path n1 n2, there's edge n1 n2. (note: This means provides the transitive property to the induced equivalence relation) Directed Graph DIRECTED GRAPH: (digraph) G=(N,E) is : N is a set (of nodes). – E is a set of ordered pairs of nodes. – A Digraph induces canonical graph by unordering each ordered pair in E. DIRECTED SUBGRAPH, G'=(N',E') of a Directed Graph, G, is such that N' is a subset of N E' is a subset of E PATH in a Directed Graph, G is a subgraph, P=(N',E') : there exists an ordering of N', (n1..nk), such that {(n1,n2)..(nk-1,nk)} is E' – – – – TRIVIAL if ni=nj i, j=1,...,k SIMPLE if all ni distinct except possibly n1 and/or nk. CYCLE is simple nontrivial path with n1=nk (simple closed path). MINIMAL: any cycle nodes ni,nj if (ni,nj) is an edge, it is in the cycle DIRECTED ACYCLIC GRAPH (DAG) is a digraph containing no cycles. – SOURCE = node with outgoing but no incoming edges. – SINK = node with incoming but no outgoing edges. – ROOTED DAG = dag with unique source. TRANSITIVE CLOSURE: if G=(N,E) a digraph and G+=(N,E') is : (a,b) in E‘ iff there is a nontrivial path from a to b in G then G+ is the transitive closure of G. – For every G there is 1 and only 1 transitive closure • Transitive closure: put (ni,nj) in E' for every non-trivial path in G from ni to nj. PARTIAL ORDER RELATIONS PARTIAL_ORDER is a binary Self-Relation, R, on S such that (s,s) not in R for any s in S (ir-reflexive) if (s1,s2) in R then (s2,s1) not in R (anti-symmetric) if (s1,s2), (s2,s3) are in R then (s1,s3) is in R (transitive) LINEAR ORDER, R, on S is a Partial Order such that for every s1 not equal to s2 in S, either (s1,s2) or (s2,s1) in R (of course never both) LATTICE ORDER, R, on S is a partial Order on S (and (S,R) is a LATTICE) if for every s1 not equal to s2 in S there is a s3 in S such that (s1,s3) is in S and (s2,s3) is in S (every pair has upper bound) PARTITION LATTICE of S is lattice ordering of all Partitions of S under the ordering of sub-partitions, where a sub-partition, Q, of a partition, P, is such that every component of Q is a subset of a component of P (and of course only one). CONCEPT HIERARCHY on an attribute is Partition Lattice of that attribute – THE concept hierarchies for that attribute, since any user defined concept hierarchy based on some domain knowledge, is a sub-Lattice of this Mother Lattice). CLOSED_UNDIRECTED_GRAPH DUALITIES on a set S PARTITION FUNCTION EQUIVALENCE_RELATION The LABELED_EQUIVALENCE_RELATION induces the canonical LABELED_PARTITION of equivalence components (a, b belong to same component iff they're equivalent) The LABELED_PARTITION, P={C1..Cn}, induces the canonical FUNCTION, g: S{C1, …, Cn} by g(s)=Ci iff s is in Ci (letting Ci be both the partition component and the label (name) assigned to that component. The FUNCTION, g: SL induces the canonical CLOSED UNDIRECTED GRAPH on S with edge set = {(s1, s2) | g(s1)=g(s2)}. The CLOSED UNIDRECTED GRAPH induces the canonical LABELED EQUIVALENCE RELATION by s1, s2 are equivalent iff there is a path connecting them. Another DUALITY Partially Ordered Set Directed Graph The POset, L= ( S, < ) can be viewed as Directed Acyclic Graph (DAG), G= (S,E), where N=S and (a,b) in E iff a < b A DAG, G, can be viewed as POset, L, where S=N and a < b iff ab and (a,b) is in Closure(E) – often only non-transitive edges are included, ie, if a < b, b < c we don't include edge (a,c) since the additional edges cluter the picture. – Given POSet L, one can diagram it as a DAG using the duality just described. – However, since a diagram is intended to help visualization, and therefore should not be cluttered, we generally do not include all edges (don't display the closure - in fact usually display the minimal dag that corresponds to the POSet. Cartesian Products, Star schemas, Histograms In a RELATION, R(S1,..,Sk,A1,..,An), the S's are dimension (or structure) attributes and the A's are either features is a subset of the CARTESIAN PRODUCT of all domains. The STAR SCHEMA is the normalization of R into S1(A1,1..A1,n1) S2(A2,1..A2,n2) . . . Sk(Ak,1..Ak,nk) \ \ / C(S1..Sk,A(k+1),1..A(k+1),n(k+1)) – where {A1,1..A1,n1, A2,1..A2,n2, Ak,1..Ak,nk, A(k+1),1..A(k+1),n(k+1)} is a partition of {A1..An} into dimension features and cube measurements or features. A HISTOGRAM is the rollup of the Cartesian product along the structure attributes (using some aggregate, usually count) and, possibly, projected onto selected Ai's. For R(S1, S2, A1) the Cartesian product is: ________________________________________ / / / / / / / / /| A1,7 / / / / / / / / / | /____/____/____/____/____/____/____/____/ | / / / / / / / / /| | A1,6 / / / / / / / / / | /| /____/____/____/____/____/____/____/____/ |/ | / / / / / / / / /| | | A1,5 / / / / / / / / / | /| /| /____/____/____/____/____/____/____/____/ |/ |/ | / / / / / / / / /| | | | A1,4 / / / / / / / / / | /| /| /| /____/____/____/____/____/____/____/____/ |/ |/ |/ | / / / / / / / / /| | | | | A1,3 / / / / / / / / / | /| /| /| /| /____/____/____/____/____/____/____/____/ |/ |/ |/ |/ | / / / / / / / / /| | | | | | A1,2 / / / / / / / / / | /| /| /| /| /| /____/____/____/____/____/____/____/____/ |/ |/ |/ |/ |/ | / / / / / / / / /| | | | | | | A1,1 / / / / / / / / / | /| /| /| /| /| /| /____/____/____/____/____/____/____/____/ |/ |/ |/ |/ |/ |/ | / / / / / / / / /| | | | | | | | A1,0 / / / / / / / / / | /| /| /| /| /| /| /| /S1,0/S1,1/S1,2/S1,3/S1,4/S1,5/S1,6/S1,7/ |/ |/ |/ |/ |/ |/ |/ | | | | | | | | | | | | | | | | | | S2,0| | | | | | | | | /| /| /| /| /| /| /| / |____|____|____|____|____|____|____|____|/ |/ |/ |/ |/ |/ |/ |/ | | | | | | | | | | | | | | | / S2,1| | | | | | | | | /| /| /| /| /| /| / |____|____|____|____|____|____|____|____|/ |/ |/ |/ |/ |/ |/ | | | | | | | | | | | | | | / S2,2| | | | | | | | | /| /| /| /| /| / |____|____|____|____|____|____|____|____|/ |/ |/ |/ |/ |/ | | | | | | | | | | | | | / S2,3| | | | | | | | | /| /| /| /| / |____|____|____|____|____|____|____|____|/ |/ |/ |/ |/ | | | | | | | | | | | | / S2,4| | | | | | | | | /| /| /| / |____|____|____|____|____|____|____|____|/ |/ |/ |/ | | | | | | | | | | | / S2,5| | | | | | | | | /| /| / |____|____|____|____|____|____|____|____|/ |/ |/ | | | | | | | | | | / S2,6| | | | | | | | | /| / |____|____|____|____|____|____|____|____|/ |/ | | | | | | | | | / S2,7| | | | | | | | | / |____|____|____|____|____|____|____|____|/ The Star Schema (A1 as a fact measurement or feature attribute): ________________________________________ / / / / / / / /| / / / / / / / / / | /S1,0/S1,1/S1,2/S1,3/S1,4/S1,5/S1,6/S1,7/ | | | | | | | | | | | S2,0| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | / | | | | | | | | | S2,1| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,2| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,3| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,4| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,5| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,6| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | / S2,7| | | | | | | | | / |____|____|____|____|____|____|____|____|/ A1-values Star Schema (A1 is a Feature Attribute of S2): (from 1-NF to 3-NF) ________________________________________ / / / / / / / / /| / / / / / / / / / | /S1,0/S1,1/S1,2/S1,3/S1,4/S1,5/S1,6/S1,7/ | | | | | | | | | | | S2,0| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | S2,1| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,2| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,3| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,4| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,5| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | | S2,6| | | | | | | | | /| |____|____|____|____|____|____|____|____|/ | | | | | | | | | | / S2,7| | | | | | | | | / |____|____|____|____|____|____|____|____|/ S2-dimension file: S2 A1 ---+--|0 | 4 | |1 | 2 | |2 | 7 | |3 | 1 | |4 | 1 | |5 | 2 | |6 | 1 | |7 | 6 | 0/1-values ("relationship existence”) Histogram Cube counts the number of facts for each value of a cube of feature attributes (of the fact cube and/or dimension files). In this case we have just one such, namely A1 so it is a 1-D cube: _____ / /| A1,7 / / | /____/ | / /| | A1,6 / / | / /____/ |/ / /| / A1,5 / / | / /____/ |/ / /| / A1,4 / / | / /____/ |/ / /| / A1,3 / / | / /____/ |/ / /| / A1,2 / / | / /____/ |/ / /| / A1,1 / / | / /____/ |/ / /| / A1,0 / / | / /____/ |/ | | | | / |____|/ count of facts showing A1=0 value, etc. Classification (supervised learning) The set of all partitions under set containment forms a lattice CLASSIFICATION is choosing a good partition level in the Partition Lattice of R. EAGER CLASSIFIER (e.g., a decision tree) is selecting from the Partition Lattice of T(A1..An,C) (training set), a closed form partition in which ClassCount-Histograms (CCHs) are sufficiently discriminatory in terms of picking a winner (ie, maximal class is sufficiently more populous than the next highest). LAZY CLASSIFIERS (e.g., K-Nearest-Neighbor) focus locally around unclassified sample, for "neighborhood" (1 partition component) in which the CCH is sufficiently discriminatory (then discarding that information. – If the locality is taken as the entire training set (e.g., methods where every training pt is a neighbor of some degree - depending on distance from the unclassified sample), then the difference between lazy and eager goes away. Clustering (unsupervised learning) Clustering is choosing a partition from the lattice of all partitions. – Using a similarity matrix – Using a distance function • Actually a pseudo-metric is sufficient Association Rule Mining How does ARM relate to Machine Learning (Classification and Clustering)? Given a lattice universe, ( U , < ) and a TransactionSet, T, which is a subset of U and a "size function, s:(U,<) Z which is monotone increasing and a "support function, supp:(U,<) Z which is monotone decreasing and a support threshold, minsupp in Z, A u in U is FREQUENT iff supp(u) > minsupp. APRIORI-type algorithms solve two problems: "Find all FREQUENT subsets of U" (Exhaustive search is too complex, since the solution space is 2 U) – – – – The second question answered by APRIORI is: Find all strong pairs in U = UxU-{(a,c)|ther're no nonempty common progeny} (disjoint in the case of set containment) – prunes the search of the solution space using the fact: "If u1 is infrequent and u1 < u2, then u2 is infrequent.“ (infrequency pushes up the lattice) Another useful fact (when searching for infrequent sets?) is "If u2 is frequent and u1 < u2, then u1 is frequent." (frequency pushes down the lattice) where "strong" means conf(a,c) > minconf where conf: U Z is monotone on antecedents, a or a' < a =>conf( a => u-a ) < conf( a'=> u-a') or "Find all strong pairs in U2= {(a,u) in UxU| a subset of u subset of U} where "strong" means conf(a,u-a) > minconf where conf:U2Z is monotone on antecedents, a or a' < a => conf( a=>u-a ) < conf( a'=>ua‘)