* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Mining
Survey
Document related concepts
Transcript
By Paweł Olszewski Krzysztof Bryś by Paweł Olszewski Data Mining - projects in pairs - raport (user, tech, experiments, next time more info about it.) mark lab - project (next week more information) lecture – test (a,b,c,d... multiple choose ~20 questions) pass If I pass test I will have mark from labs. STATGRAPHICS 1. Introduction Data Mining model Data Mining techniques Data warehousing Potential applications 2. Clustering 3. Association rule discovery 4. Classification algorithms 5. Estimation and regression techniques 6. Deviation detection techniques 7. Visualization of Data Mining results Books: 1. Lecture notes :) 2. J. Hom, H. Komber – “Data Mining concepts and Techniques” Morgan Kaufman 1996 3. N. Indurknya, S.M.Weiss – “Predictive Data Mining : a practical guide” M.K. 1997 4. I. Witten, E. Frank – “Data Mining” M.K. 2000 5. M. Berry, G. Linoff – “Mastering Data Mining” 6. U.Fayyad, G.Piatetsky-Shapiro, P.Smyth, R.Uthumrsamy – “Advances In Knowledge Discovery and Data Mining” AAAI Press 1996 2 7. P. Cichosz – “Systemy uczące się” WNT 2000 Data mining = odkrywanie wiedzy Data mining is the process of exchanging useful (useful for the user) patterns and regularities from largebodies (cannot be measured by hand) of data. Remark: Since data sets, which we can use are very large, then simple methods should be used. Potential applications of data mining methods. data compression marketing internet (e-business) banking (financial market) medicine 1) Regularities found in a data set are used for compact encoding and approximation. 2) Sales dates and the aim is discovering patterns which might enable purchasing behaviour. For example: Set of items: I = {Milk, Coke, Orange juice} Set of transactions: M,O | C,O | M | C,O | 1000 1000 1000 1000 M M C O X 0 1000 C X X 2000 O X X X It’s symmetric so only lower parts of graph are used. C (M) O └──►──┘ 3 3) Instead of Milk, Orange juice and Coke we may have web pages and find if a person that like cars also like pets for example. 66% customers buy Coke and Orange juice together putting these products nearly we may increase salary of them; and putting something like Milk between them, possibly we may increase sallary of the added product because its in the middle of items needed by a customer. 4) Identify stock trading rules from historical market data. 120 100 80 60 SALARY 40 20 0 If such situation will happen many times, then we may expect that here salary will increase. (sec. picture) 5) Characterise patient behaviour identify succesfull thephy. Classes of Data Mining methods The classification of data mining methods: 1) Classification 2) Clustering (Segmentation) 3) Associations 4) Sequential (temporal) patterns 5) Dependency modeling (regression) 4 ad.1) The user defines some classes in the training data set. Matching must find set of questions that have to be answered. Q1 A The data mining system A2 1 constructs descriptions for Q2 Q3 the classes; methods, that after C1 C2 C3 C4 C5 C6 answering several questions allows us to choose the proper class. The system should find the best rule for each class. a rule L H S => R H S (eft) (and) (ide) (ight) (and) (ide) The cathegories of rules: - exact rules (no exception) = must be satisfied by each member of a class - strong rules (allows some exceptions) = must be satisfied by “almost” all members of a class - limited number - probabilistic rules (limited exceptions gived by probability) = relates P(RHS|LHS) to P(RHS), when they are almost equal it means P(RHS|LHS) ~ P(RHS) [it operate on probability] ad. 2) Clustering is a process of creating a partition of the whole data set such that all the members of each class of the partition are similar according to some distance measure. If we change measure we also change partition. Classes are characterised by measure. 5 centers of the classes data set new record is putted to the nearest class. We measure distance from centers of nearest classes. ad. 3) We find rules IF LHS THEN RHS by using some associations function, which returns patterns that exist among collection of items. We use for the data set which consists of records each of them contains some number of items. (given a set of items and each of the records in the data set is a subset of this set.) example: {A,B,C,D,E} – the set of items Data Set {A, C} {A, C, D} {A, B, C} {B, C, D} {A, B, D} {A, C, E} 80% of records containing A contain also C so we find a rule: P(if A then C) = 0.8 P(if C then A) = 0.8 but it’s a probabilistic rule. ad. 4) sequential patterns (temporal) 6 given a set of sequences, of records over a period of time. example: DAY 1. 2. 3. 4. 5. SMITH Milk Beer Milk Beer Milk GATES Coke, Milk Beer, Juice Coke Beer Milk I.JONES Milk Milk Milk Milk Milk, Juice a sequence of records for Mr. Smith Milk ↓ Beer └─────────┘ Mr. Smith buys Milk and Beer so we put other drinks between them, maybe he also want to buy them :) ad. 5) The goal is to find a model which describes important dependicies between the variables. Using the association rulee method we can find for example that 80% of all cases contain A and C. A <=> C ? A => C independent variable dependent variable The most important DM methods - statistical methods (regression) (1) (4) (5) (2) - probabilistic methods (bayesian methods, apriori algorithm) (2) - neural networks (2) - genetic algorithms (2) - decision trees (2) - nearest neighbour method (3) - data visualisation (2) (4) - rule induction (1) 7 Knowledge Discovery Process DATA ─────► ANSWER SET ▲ │ QUESTION DATA SET ─────► Answer ▲ | DM QUESTION 1) Creating a target data set (for example selection from the larger data set) 2) Clean and preprocess data (eliminate errors noise, fill missing data) e.g: 0 – Yes D1 and we have to correct 1 – No answer so Yes=0 & No=1 0 – No in both cases 1 – Yes D2 3) Data reduction (delete attributes which are not usefull) 4) Pattern extraction and discovery = data mining a) choose data mining goal b) choose data mining algorithm(s) c) search for patterns of interest 5) Visualization of the data 6) Interpretation of discovered patterns 7) Evaluation of discovered knowledge (how we may use it?) Decision trees X – a training set of cases (examples) 8 Each case is described by n attributes and the class to which belongs. values of attributes │ ▼ decision (number of the class) X Outlook 1 2 3 4 5 6 7 8 9 10 11 12 13 14 sunny sunny sunny sunny sunny overcast overcast overcast overcast rain rain rain rain rain Temp.(F) Humidity Windy? Class 75 80 85 72 69 72 83 64 81 71 65 75 68 70 70 90 85 95 70 90 78 65 75 80 70 80 80 96 T T F F F T F T F T T F F F P N N N P P P P P N N P P P An attribute is a function a: X ─► A A – the set of attribute values a – outlook A = {sunny, overcast, rain} X is partitioned into some classes C1, ..., Cl l - categories By a test we mean a function t: X ──►R A decision tree is a classifier which consists of – leafs which correspond to classes - decision nodes which correspond to the tests ( each branch and subtree starting in a decision node 9 corresponds to one possible outcome of the test ) Remark: we will consider one attribute tests Examples of tests: if x belongs to some set equality test inequality test if outlook=overcast then play next week – algorithms, how to construct decision trees and how to choose tests (training set) X – the set of examples a1 : X ──► A1 . . attributes . an : X ──► An 10 c : X ──► C – the set of Classes Classifying function the set of test values (outcomes) X = C1...Ck classes a test: t:X--> Rt R (we deal with one attribute test) classification of test: 1º Identity test t(x) = a(x) , xX 2º Equality test 1 t(x) = 2 if a(x) = V if a(x) V 3º Membership test 1 if a(x) V t(x) = 2 if a(x) V 4º Partition test 1 if a(x) V1 2 if a(x) V2 t(x) = : : : : n if a(x) Vn 5º Inequality test V = (-∞, v) 11 1 if a(x) ≤ V 2 if a(x) > V t(x) = Algorithm of constructing a decision tree 1º 2º 3º If X contains one or more examples belonging to the same class Cj then the decision tree for the set X is a leaf identifying the class Cj . If X contains m examples then the decision tree in this node is a leaf, but the class to be associated with this leaf must be determined from information other than X (for example the most frequent class in the parent node) If X contains a mixture of classes then choose a test based on a single attribute with possible outcomes o1, ..., on. X is partitioned into subsets X1, ..., Xn, where Xi contains all cases in t that have outcome ai of the chosen test. The decision tree for X consists of a decision node identifying the test and one branch for each possible outcome we choose t in each node X x1 ... x k X d {x X : c ( x ) d } X tr {x X : t ( x) r} X trd {x X : c( x) d t ( x) r} (X – the set of examples d – the label of the class S – the set of possible tests) for each attribute we may use many tests binary attributes has only one test function build_tree(X,d,S) 12 IF STOP(X,S) THEN form a leaf L d l class ( X , d ) RETURN(L) ELSE form a node n t n choose _ test ( X , S ) d class ( X , d ) FOR each r R t n n(r) build_tree (X t n r , d , S \ {t n }) RETURN (n) 1 if, for example possible stop criteria: - the set X is empty or STOP ( X , S ) - the set S is empty or 0 - all examples in X belongs to the same class (“almost” all) ...if one of these conditions is satisfied algorithm stops. CLASS(X, d) = the most frequent class of examples in the set X or d (d in case if X=0). (it returns d if the set is empty) CHOOSE_TEST(X, S) For the set X, the expected information Xd X I(X ) log d X d C X where 13 X d set of those examples that belongs to class d X X d average number of classes X cardinalit y of X Xd X the probabilit y that an example belongs to the class d X X1 X n Entrophy is large if the cardinalit y differs is large if X 1 X n X 1 X n example: X 1000 X X1 X 2 X 1 X 2 500 small entropy X1 1 large entropy X 2 999 For such test t, the expected entropy of the test t Et X rRt where X tr X Etr X Etr X d C X trd X tr - Entropy log X tr X trd 14 - the expected information in the set Xtr Entropy measures average amount of information (number of steps) which is needed to identify class of an example when the test t is used. The gain of the information , when the test is used g t X I X Et X we choose from the set S the test t with the largest gt(X) I X d C Xd log X X Xd how many steps we need (in avg) to find a leaf node, to find the class of the object.) the expected info log X Xd For continuous attribute (value) we need many tests. Example: t = outlook (playing tennis) Outcomes (test values)→ sunny Decisions (classes) overcast rain 15 PLAY DON’T PLAY I X 2 3 5 4 0 4 3 2 5 9 5 9 9 5 5 log log 0.940 14 14 14 14 Et X 5 2 log 14 5 2 3 3 log 5 5 5 4 4 log 14 4 4 0 log 4 4 5 3 3 2 log log 14 5 5 5 g t X I X t 0.940 0.694 0.246 0 4 2 0.694 5 Remark The gain of the information proffers tests with many outcomes. The information value. X X IVt X tr log tr X rRt X X the average number of subtrees X tr gt X IVt X we chose the test with this ratio t(X) = a(X) The gain ratio : P(t) – the cost of the test t. The test value denoted by g t2 X Vt X Pt X Remark: For the discrete attribute one test with outcomes as many as the number of distinct values of the attribute is considered. For the continuous attribute the data should be 16 sorted (with respect to the attribute) and the entropy gains should be calculated for each possible binary test (if a(x) < Z) a(X) < Z a(X) ≥ Z each Z = one test if a(x) < 81 P processors N data items (examples) Parallel algorithm for constructing a decision tree. P(n) – the set of processors which handle the node n. If the node n is partitioned into child nodes n1, n2, ..., nk then the processor group P(n) is also partitioned into k-groups P1, ..., Pk such that Pi is handled by ni. ALGORITHM 1) expand a node n IF the number of child nodes < |P(n)| THEN 2) Assign a subset of processors to each childnode in such a way that the number of processors assigned to a childnode is proportional to the number of data items contained in the node. 2) Follow the above steps 1-3 recursively ELSE 17 2) Partition the child nodes into |P(n)| groups such that each group has about the equal number of data items. Assign each processor to one node group. 3) Follow the computation for each processor independently. NEXT WEEK: Association rules discovery rule LHS → RHS in a decision tree LHS = the outcomes of the tests RHS = the class In each node we have information about the answer Example: 5 rules. Each path from a leaf to top is a rule. 5: if outlook = overcast then play 1: if outlook = sunny and humidity ≤ 75 then play 18 Association Rules Discovery Example: 1000 x {beer, milk, juice} 1000 x {beer, milk, water} 1000 x {beer, milk} 1000 x {beer, water} P( if b then m ) < P( if m than b) || || 0.75 1 3 P(m b) 4 3 P ( m b) P(b) 1 4 association rule = a rule which implies some association relationship between attributes values in the data set. We may remove windy in this case. I = {s,r,o,t,f,p,d,} T = { (1,s), (1,r), (1,o), (2,t), (2,f), (3,p), (3,d) } 19 IF ( windy = false, outlook = sunny ) THEN Decision = play T – a set of transactions (sets of items) I – a set of items (attributes, values, decisions) I = {s, o, r, t, f, p, d} T= { {s,t,p}, {s,t,d}, ... } an association rule “if X then Y” denoted by X,Y where X,Y I For each X,Y I ST X , Y t T : X t , Y t T the support of the rule X, Y in the set T ConfT X , Y t T : X t , Y t t T : X t the confidence of the rule X,Y in the set T 20 X = {car, cigarettes} Y = {plane} |T|=1000 large means >> 1 2 ST X , Y 1 1000 Conf X , Y 1 TXY 1 TX 1 t car , cigarretes , plane TX t T : X t XY if X then Y TX ,Y t T : X t , Y t ST X , Y TX ,Y ConfT X , Y T TX ,Y TX if St is large the support of the item set X ST X TX TX ,Y TX T and TY TX ,Y X car , plane ST X , Y ST X ST X , Y ST Y X ,Y I large conf and large support => rule is good TX , T 2 I - the family of all subsets of I ConfT X , Y r ST X , Y s 21 for each tT tI The number of all possible pairs X,Y, where X and Y are 2 I subsets of I is 2 SUPERSET of the set X is the set containing X. Apriori Algorithm S = minimum support r = minimum confidence 1) Find all combinations (sets) of items, that have support above minimum support s. Call those combinations frequent items sets. 2) Use the frequent itemsets to generate the desired rules. “if A,B then C,D” is a frequent rule if ABCD is frequent itemset ST X , Y s ConfT ABCD and ST ABCD r ST AB DEF: A rule X,Y is frequent if ST X , Y s confT X , Y r and A set X is frequent if ST X s AprioriAlg() L1 = { frequent 1-element itemsets} // for each item we check // the support 22 FOR(k=2, Lk-1=0, k++) DO Ck = apriori_gen(Lk-1) // new candidates // Remark Each subset of the frequent set is also frequent. For all t in the dataset do { for all candidates c in Ck contained in t do { c.count++ } } LK c C K : c.count s T return(k , LK ) } next tine more about apriori_gen(Lk-1) Weka system – collection of math http://www.cs.waikato.ac.nz/ml/weka Databases / Databases generator http://www.datgen.com Links to the datasets in the net http://mainseek.pl/ca/557351/datasets 23 apriori I I – set of items T – a set of transactions Transaction – a set of items (subset of I) For each set AI ST A t T : A t T a set A is frequent <=> ST(A) ≥ ... ≥ s (minimal support) Ck = apriori_gen(Lk-1) Lk- the set of all k-element frequent itemsets Ck- the set of k-element candidates Remark: a set A is frequent each subset of A is frequent Ck = apriori_gen(Lk-1) 24 1) Join Lk-1 and Lk-1, the joining condition is that (k-2) items are the same and obtain the set Ck For example: Lk 1 A V1 ,V2 , ,Vk 2 , Vk'1 B V1 ,V2 , ,Vk 2 , Vk''1 A B {V1 , V2 , , Vk 2 ,Vk'1 ,Vk''1} A B k 2) Delete from Ck those itemsets that have some (k-1) – element subset not in Lk-1. Remark: If there is some subset of A which is not frequent then there is some (k-1) – element superset (subset of A) of this subset which is not frequent. Apriori_gen(Lk-1) 1) for each ALk-1 do { for each BLk-1 do { if |AB| = k-2 then Ck = Ck {AB} } } 2) for each DCk do { repeat d = new_element(D) 25 until ( d = null or D \ d Lk-1) if d ≠ null then Ck = Ck \ {D} } } if c t then c.count++ Frequent itemsets can be found in the following way: 1) Take a sample of the data (main memory sized) 2) Run apriori algorithm for this data (find frequent itemsets in the sample) 3) Verify that the frequent itemsets of the sample are frequent in the whole dataset Remarks: 1. It will miss sets that are frequent in the whole dataset but not in the sample. 2. the minimum support in the sample should be lower than the minimum support in the whole dataset. (risk : there will be too many “candidates”) Paweł Olszewski 26 Sequential analysis Example: Customers: Sequence for Mr X sequence for Mr Y X Y Z sequence for Mr Z We are looking for sequential patterns 2) Pb, m, j 1 2 b, j, m, j, w - pattern j, m, b - pattern contained in no sequences Input data: a set of sequences called data sequences (each sequence is an ordered list of transactions/itemsets) Typically : there is time associated with each transaction 27 A sequential pattern = a sequence of sets of items(NOT NECESSARILY TRANSACTIONS) b b1 , , bt s.t. b1 , , bt I T I – the set of items T – the set of transactions X – the set of sequences X a1 ,, ał : a1 ,, ał T Problem: Find all sequential patterns which are “frequent” (with a user-specified minimum support) The support of a sequential pattern b in the set X: b x X : b x x the length of sequence = the number of itemsets in the sequence k-sequence = the sequence of length k 1-sequence = itemset frequent 1-sequence = frequent itemset NEXT TIME: 2 methods for sequential patterns... 28 13-11-2002 1-sequence = itemset frequent set = large set = large itemset = litemset = the set of items with minimum support. The support of sequence bin the dataset X S x b x X : b x x Example: Let A,B,C – sets X = { (A,C,B,A), (B,C,B,B) } customer sequences S x B 2 1 2 X – the set of customer sequences transaction = the set of items customer sequence = order of transactions T – the set of transactions T = { A,C,B,A,B,C,B,B } ST B 4 1 8 2 DEF: A sequence b is large (frequent) if S x b r , where r is the user defined minimum support. 29 Remark: If a sequence is large then each itemset contained in this sequence is also large. b b1 ,, bk & S x b r S x bi r i 1,, k Moreover each subsequence of b is also large. Apriori All 1, 2, 3, 4, 5, 6 Ck Lk Apriori Some (avoids counting non maximal large sequences) Ck-1 1, 11, 21, 31 ... 28, 29, 30 ◄──────── The solution of the problem of mining sequential patterns consists of the following phases: 1. Sort Phase (the set of transactions -> the set of customer sequences) 2. Large Itemsets Phase (we find all large 1-sequences (large itemsets) using Apriori) 3. Transformation Phase transaction ↓ the set of large itemsets which are contained in t example: A, B, C, D t={a,d,f,g} A={a,d,g} T={A} customer sequence ↓ the ordered list of transactions (of sets of itemsets) 30 if there is no large itemset contained in the transaction or customer sequence, then we delete it. (but the number of customers is not changed) 4. Sequence Phase (we find large sequences of each length) 5. Maximal Phase (we find the maximal sequences among the set of all large sequences) Families of algorithms for finding patterns. Count All = count all large sequences Count Some = Count all maximal large sequences so first we count longer sequences which are contained in some longer large sequence) Remark: The time saved by not counting sequences contained in a longer sequences may be less than the time wasted counting sequences without minimum suport that would never have been counted, because their subsequences were not large. 31 Apriori All // Forward Phase L1 = the set of frequent itemsets, k is the length fork 2, Lk 1 0, k { begin Ck = the set of new candidates generated from Lk-1 For each customer sequence c in the dataset { for each candidate d in Ck contained in c { d.count++ } } s=the minimum support L d C : d .count s k k last=k end // Return(Maximal frequents sequence in Lk) // Backward Phase for(k=last, k≥2, k--) { delete all sequences in Lk contained in some Li, i>k return (Lk) } 32 Apriori Some // Forward Phase L1 = the set of frequent itemsets C1 = L1 fork 2, Ck 1 0 and LLast 0, k { begin if (Lk-1 is known) Ck – the set of new candidates generated from Lk-1 else Ck – the set of new candidates generated from Ck-1 if (k=next(last)) then {begin for each customer sequence c { for each candidate d in Ck contained in c { d.count++} } Lk d Ck : d .count s last = k end} end} // Backward Phase for(k=last, k≥2, k--) { Lk+1,...,Llast if (Lk not found in foreward phase) then {begin for each customer sequence c { for each candidate d in Ck contained in c {d.count++} } Lk d Ck : d .count s 33 delete all sequences in Ck for some i > k else { delete all sequences in Lk contained in some Li, i>k } return(Lk) } //of for Last = 1 Remark: For “lower” minimum-supports there are longer longe sequences and hence more non-maximal large sequences are generated. In this case Apriori. Paweł Olszewski How generate candidates in Ck from Lk-1 1) Join Lk-1 and Lk-1 (Ck-1 and Ck-1) for each A,B Lk 1 (or Ck 1 ) we find A B 2) Select those unions of sequences which have k-2 common itemsets A B k 1 3) Delete all sequences c such that some subsequence of length k-1 of c is not in L k-1 ( or Ck-1) Example: L1 = { 1, 2, 3, 4, 5 } T = { (1,5,2,3,4) (1,3,4,3,5) (1,2,3,4) (1,3,5) (4,5) } 11 S = 0,4 12 13 K=2 14 L2 15 21 22 23 24 25 31 32 33 34 35 // 1,2,3,4,5 – sets, the frequent itemsets 0 0,4 0,8 0,6 0,6 0 0 0,4 0,4 0 0 0 0 0,6 0,4 (40%) 41 42 43 44 45 51 52 53 54 55 0 0 0,2 0 0,4 0 0,2 0,2 0,2 0 |T| = 5 S*|X| = 2 34 (1,2) C if (...,1,...,2,...) Now we join these not removed together by common digit (eg. 12 + 24 => 124) we don’t use these which have support less than our S=0,4. L3 L4 123 124 134 135 145 234 235 245 345 0,4 0,4 0,6 0,4 0,2 0,4 0 0 0 1234 0,4 STOP . And now we go back Answer: (4,5), (1,3,5), (1,2,3,4) Clustering Given: points in some space X Goal: Group these points into some number of clusters, each cluster consists of points which are “near” (“similar”) we have set of points we want to divide it into some clusters such that if we take point from cluster A, then distance from this point to any other point in the cluster A is smaller than distance between this point and any point that do not belong to cluster A 35 X mk X i C p ik Cp A distance measure d is any function d: XxX R which satisfies the following conditions: 1° d(x,x) = 0 for each xX 2° d(x,y) = d(y,x) (symmetric) 3° d(x,z) ≤ d(x,y) + d(y,z) x , yX x , y , zX Examples of distance measures 1) Euclidean space k – dimensional R k x1 ,, x k : x1 ,, x k R a. Euclidean distance d x, y k x y i 1 where 2 i i x x1 ,, xk y y1 ,, yk b. Manhattan distance 36 k d x, y xi yi i 1 c. M a x i m u m o f dimensions d x, y max xi yi i 1k d. Harming distance d x, y i : xi yi 2) X- space of all strings Distance between two strings x,y d(x,y) = |x| + |y| - 2LCS(x, y) where |x|, |y| - lengths of x, y LCS(x, y) - the length of the largest common subsequence for example: x = abcdef y = bababcdfe d(x, y) = 6 + 9 – 2*4 = 7 common subsequence LCS( abcde, abe ) LCS( x, y ) = the largest such sequence z that z x, and z and the elements of z are in x and y not necessary consecutive. 37 d(x, y) = |x| + |y| -2LCS(x ,y) Clustering of data sets : Given : S – the set of cases, each case consists of l values (corresponding to l variables) so S = { (xi1, xi2, ..., xil) : i = 1, ..., n } |S| = n cases // variables n cases each of k variables xik – the value of the k’th variable in the i’th case 1 2 ... k ... l 1 2 : i xik : n W – the set of weights of variables W = {wk : k = 1, ..., l} wk – the weight of the comparison of the k-th variable l wk = 1 for each k = 1, ..., l wk l k 1 1 wk 0 Mr. Smith Mr. X if comparison of the k-th variable is valid if not. weight (kg) 200 65 height (mm) 1500 2100 The Euclidean Distance between two cases. 38 x ik x jk wk l d ij d xi , x j k 1 2 l w k k 1 For each cluster Cp we define : - the mean of Cp m p m p1 ,, m pl where for k 1,, l x m pk i:xi C p ik Cp - the standard deviation of Cp x l Cp 1 Cp k 1 ik m pk l w i:xC p k k 1 d1 d 2 d 3 Cp 3 The distance between the i’th case xi and p-th cluster Cp d xi , C p d xi , mp The distance between clusters Cp and Cq def D pq d C p , Cq d m p , mq def Approaches to clustering 39 1. Centroid approaches (the number of clusters is fixed) 2. Hierarchical approaches (the number of cluster changes) - agglomerative - divisive S xi1 ,, xil : i 1n xik – the value of the k’th variable in the i’th case xi, xj S d(xi, xj) a cluster = the group of cases for cluster Cp d(xi, Cp) = d(xi, mp) for each k, wk=1. mp = (mp1, ..., mpl) – the mean of the cluster Cp if each xik, for i=1, ..., n, k=1, ..., l is real number then x l d xi , x j k 1 y jk wk 2 ik l w k 1 k wk – weight of companion of k-th variable the mean does not have to be one of the points mp is called the centroid of the duster Cp (mp is not necessary an element of S) mp is any point in RL for k=1, ..., l x m pk i:xi C p ik Cp otherwise (if values of some variables are not numbers) if some distance measure d(xi, xj) is given, then the mean of the cluster Cp is called clustroid [not necessarily center, one of elements belonging to cluster] and is that element of the 40 dataset belonging to Cp that minimizes the size of the distances to the other points of this cluster. Standardization of the data All cases consists of real numbers Remark The values of each variable should be standardize, since we wish the variables to be treated equally. Otherwise clustering would be dominating by the diversity of only some variables. With the largest standard deviation. The standardized value of xik: xik xik mk k n where mk x i 1 n ik is the mean of the k - th variabl e 1 n 2 k x m is the standard deviation ik k n i 1 of the k - th variabl e Remark: The standardized variables have a mean of 0 and standard deviation of 1. the standardized value = standard score, z-score example: 41 x1 x2 x3 x4 x5 v1 v2 v3 v4 v5 65,9 10,4 19,7 2,6 1,4 90,5 1,7 1,4 6,2 0,4 71,3 12,3 13,1 1,9 2,3 46,4 9,7 42 0 0,85 86,2 3,0 4,8 5,2 0,7 v1 v2 v3 v4 v5 x1 x2 x3 x4 x5 x6 sum -0,95 -1,15 0,89 -0,84 0,95 -1,12 -0,85 1,06 1,13 0,92 -0,53 -1,67 0,26 -1,22 2,85 -2,47 3,01 -2,25 0,03 0,95 0,63 -0,88 -0,52 0 0 0 0,64 -0,32 0 0 Distance Matrix x1 x2 x3 x4 x5 x1 0 4,02 0,83 2,01 2,45 x2 x 0 6,30 8,73 0,19 x3 x x 0 4,23 4,39 x4 x x x 0 6,79 x5 x x x x 0 Possible clustering Clustering methods: 1) k-means algorithms (the number of clusters is fixed) 2) Hierarchical clustering 1) k-means algorithms (the number of clusters is fixed) a. Take k-cases Each case is the centroid of its own cluster. 42 b. Each other case is assigned to the cluster that has the nearest centroid c. Calculate the new position of the centroid of each cluster (if xi is assigned to Cp then m pk m pk x ik m pk np for k=1, ..., l , where np = |Cp| d. repeat steps b and c until all centroids don’t change the positions (when calculated at step 3) 2) Hierarchical clustering - will be described next time, and also will be given methods for use in practic k-means algorithm Possible modifications of the original k-means algorithm: 1. At the begining we can choose the k clusters by picking k points (cases) which are sufficiently far away from any other. 2. During the computation the number of clusters can be reduced (by joining two clusters, if the distance between these clusters is smaller than some user-defined value r) or increased (by splitting one cluster into two new clusters, if it is sufficiently large) 3. We can split “the largest” cluster into two new clusters and merge two other (the closest two) to keep the number of clusters at k. The standard deviation of the cluster Cp. 1 Cp i:xi C p xik m pk 2 wk k l wk k 1 43 where m p (m p1 , m p 2 , , m pi ) is the centroid of the cluster Cp p q p q The increase of the sum of standard deviations (obtained by splitting one cluster Cp Cq into Cp and Cq ): I p ,q p q pq |Ik| is small enough where Ik is the increase of standard deviation where the number of clusters is increased from k to left. k=0 k=1 . : k=ko-1 k=ko Hierarchical Clustering 1) Agglomerate Clustering We start with n clusters (each cluster contains only one case) a. Compare all pairs of clusters and find the nearest pair b. The distance between this closest pair (denote it by D) is compared to some user-defined value r if D<r then we join the nearest two clusters into one and RETURN to a. else STOP Possible measures of “closeness” of two clusters: - distance between their centroids - maximum distance between nodes in the compared (minimum) clusters (one point from one cluster and the (average) second from the other one) - the increase of the standard deviation of the joining two clusters into one 44 2) Divisive Clustering We start with one cluster containing all points a. The distances between all pairs of cases within the same cluster Cp are calculated and the pair with the largest distance is selected. d xi , x j between two b. The maximal distance D p xmax , x C i j p cases in Cp is compared to some user-defined value s. If Dp>s then Cp is divided into two new clusters, the points xi and xj are seed points of these new clusters and each case in Cp is placed into the new cluster with the nearest seed point and RETURN to a. ELSE STOP Mixed Clustering For assigning the new object to the clusters we use 4 new operators: 1) assign the new object to one of the existing clusters 2) we form the new own cluster for the new object (if all existing clusters are far away from the new object) 3) Split one cluster into 2 new clusters and assign the new object to one of them (the nearest) (if the standard deviation at cluster would be too large) 4) Join two existing clusters into one and assign the new object to this cluster (if after adding the new object to one of these clusters we get two clusters which are two large) (obtained cluster must not be too large) Thm Bayes P Ai | B PB | Ai P Ai P B xCi x1=v1 x=(x1, ..., xL) 45 { outlook = sunny temp = 75C humidity = 101% } P Ai | B max PB | Ai P Ai max j : PA j | B max P Ai | B r 1 Px1 v1 , , xi vi | x Ci Px Ci d x, Ci ...P( xL vL | x Ci ) Px Ci Cx – case Ci – i’th cluster Px Ci Ci x Pxk vk | x Ci yCi : yk vk Ci The value of the clustering into k-clusters Ci, ..., Ck V C k 1 k p 1 Cp x l l P xk v | x Ci P 2 xk v k 1 vVk k 1 vVk prob. after clustering Vk – the set of all values of the k-th argument & before clustering Remark: C1 is better than C2 if V(C1) > V(C2) Application of clustering for finding missed values: xi xi1 ,, xik ,, xil xi C p m p m p1 , , m pk , , m pl mp – the centroid of the cluster Cp 46 Use only if at most 50% of all values of the clusters is missed. Fuzzy Clustering Pxi C p 1 case cluster S= {x1, ..., xn} the set of cases given clustering into cluster C1, ..., Ck d(Xi, Cp) – the distance between xi and the cluster Cp The membership function mip Pxi C p d x , C 2 1 i p p d x , C 2 i k 1 1 k =2 k=3 mi1 mi 2 mi 3 =1 for each i mip 1p (hard clustering) d12 d12 d 22 d 32 d 22 d12 d 22 d 32 d 32 d12 d 22 d 32 y mip d 2 xi , C p n n i 1 p 1 47 for 1 if we are sure x belongs to this cluster because it gives a very small number y 1p d 2 xi , C p n k i 1 p 1 the degree of fuzziness 1; Paweł Olszewski if d1 d 2 d 3 we don’t know where to put it (xi). But in Fuzzy clustering we get big difference in d1, d2, d3 since we take the power of these values. k m p 1 ip if m1 m2 m3 13 1 mi1 mi 2 0.01 the degree of fuziness m i1 mi2 0.01 smaller Remarks: 1) If =1 then we get the “hard” clustering 2) As approaches to infinity the differencies between membership mi1 0.1 functions are mi 2 0.9 very small mi1 3 0.001 mi 2 3 0.729 mi 2 mi1 if x is in large distance from each cluster then we may choose a cluster we want :), because for large 48 difference in distances will be very small (see Remark 2.) We choose such clustering for which the objective function is minimum. Naive Bayes Classification Bayes Theorem PB | A P B P A| B P A P(B) – a priori probability P(B|A) – a posteriori probability Naive Bayes assumptions evidence can be splitted into independent parts (sttributes of the instance) A1 An A and Ai A j 0 for i j P A | B P A1 | B P A2 | B P An | B P B | A P B P A1 | B P A2 | B P An | B P A (sunny, (hot, mild, cool) (high,normal) (true,false) overcast,rainy) Outlook S S O R R R O S S Temperature Humidity h h h m c c c m c H H H H N N N H N (Yes, No) Windy Decision F T F F F T T F F N N Y Y Y N Y N Y 49 R O O R S m m h m m N H N H N F T F T T Y Y Y N Y A={outlook=s, temp=c, humid=h, windy=true} a) Pdecision Y | A P B P A|B P A 1 b) Pdecision N | A P A 1 15 1 P B2 P A| B2 P A 9 2333 14 9 9 9 9 1 15 5 31 4 3 14 5 5 5 5 1 15 0,005 1 15 0,102 15 15 (14 1, this new example) PB1 149 P A | B1 Poutlook s | B1 Ptemp c | B1 Phumi h | B1 Pwind true | B1 92 93 93 93 P A | B2 Poutlook s | B2 Ptemp c | B2 Phumi h | B2 Pwind true | B2 53 15 54 53 decision = N data warehouse – a decision support database that is maintained separately from the operational database. If support processing by providing a solid platform for data analysis. OLTP – On Line Transaction Processing OLAP – On Line Analysis Processing (for finding rules) OLAP users function DB design Data Queries Nb of users DB size knowledge worker decision support subject oriented Historical Complex Hundreds 100GB - ..TB OLTP clerk, IT programmer day to day operation application oriented current simple thousands 100MB - ..GB 50