Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
J.Korczak Contents Data Mining – definitions Inductive Decision Trees - Algorithms ID3 and C4.5 - Tools : WEKA, Orange, Cader, Statistica Discovery of Association Rules - Apriori and Hash Trees - Tools: WEKA, Orange, Associator, Merja Clustering algorithms Classical algorithms: K-means Scalable algorithms: CURE, DBSCAN, Future Research IA IN FINANCE AND ECONOMICS Data Mining Jerzy KORCZAK email: [email protected] http://www.korczak-leliwa.pl http://citi-lab.pl 1 J. Korczak, UE What is Data Mining? 2 J. Korczak, UE Data Mining Many definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Data mining (knowledge discovery in databases, KDD) Extraction of interesting, non-trivial, implicit, previously unknown and potentially useful information (knowledge) or patterns from data in large databases or other information repositories Scientific point of view: data abstraction and KDD Commercial point of view: competitive pressure Necessity is the mother of invention Data is everywhere — data mining should be everywhere, too! Understand and use data — an imminent task! 3 MBA-BA-DataMining Classification and rule extraction Data Mining Tasks... Classification has been an essential theme in data mining, and statistics research Data Decision trees, Bayesian classification, neural networks, knearest neighbors, etc. Tree-pruning, Boosting, Bagging techniques Tid Refund Marital Taxable Status Income Cheat Milk MBA-BA-DataMining MBA: Data Mining 10 4 MBA-BA-DataMining 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes Efficient and scalable classification methods Exploration of attribute-class pairs CURE, DBSCAN, SLIQ, SPRINT, RainForest, BOAT, etc. Classification of semi-structured and non-structured data Classification by clustering association rules Association-based classification Web document classification 5 J. Korczak, UE 6 1 J.Korczak Statistics vs Data Mining Classification Given: Database of tuples, each assigned a class label Develop a model/profile for each class Example profile (good credit): (25 <= age <= 40 and income > 40k) or (married = YES) Statistics: a discipline dedicated to data analysis What are the differences? Huge amount of data—in Giga to Tera bytes Fast computer—quick response, interactive analysis Multi-dimensional, powerful, thorough analysis High-level, “declarative”—user’s ease and control Sample applications: Credit card approval (good, bad) Bank locations (good, fair, poor) Treatment effectiveness (good, fair, poor) Automated or semi-automated—mining functions hidden or built-in in many systems 7 MBA-BA-DataMining Bayesian Classification 8 J. Korczak, UE Naive Bayesian Classification 1) Let D be a training set of tuples associated class labels, C1, C2,…, Cm, where each tuple is X= (x1, x2,…, xn) 2) Given a tuple X, the classifier will predict that X belongs to the class having the highest posterior probability conditioned on X. 3) P(Ci |X) = P(X|Ci) P(Ci)/ P(X) 4) P(X) is is constant for all classes, only P(X|Ci)P(Ci ) need to be maximized. If the class prior probabilities are not known, it is assumed that the classes are equally likely, so we maximize P(X|Ci) 5) In order to reduce computation P(X|Ci) (class cond.indep.) B.C is based on Bayes’ theorem. Naive B.C. assume the class conditional independence (an attribute value on a given class is independant of the values of the other attributes) Bayesian belief networks (allow dependencies) Bayes Theorem P(H|X) is a posteriori probability of H conditioned on X. P(H) is a priori probabilty of H. P(X|H) is the p-ty of X conditioned on H. P(X) is the p-ty of X. P(X|Ci) = Π P(x |C ) = P(x |C ) *P(x |C ) *…* P(x |C ) k i 1 i 2 i n i where P(x1|Ci),P(x2|Ci)…P(xn|Ci) can be estimated from the training set. 6) In order to predict the class label of X, P(X|Ci) P(Ci) is evaluated for each class Ci . P(H|X) = P(X|H)P(H) / P(X) Performance: comparable with decision tree, neural network classifiers. 9 J. Korczak, UE Decision Tree Construction Data Decision Tree Decision Rules 10 J. Korczak, UE Discovery of decision rules: GOLF example Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Prediction on unseen data Which attribute to select? Attribute branching? ID3, C4.5 [Quinlan] - Gain function based on entropy - Normalized gain Temp. 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80 Windy F T F F F T T F F F T T F T Play ? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Outlook rainy sunny overcast Humidity <=75 Yes >75 No Windy Yes T F No Yes Entropy characterizes the (im)purity, or homogeneity, of an arbitrary collection of instances. IF Outlook=sunny & H<=75 THEN we play IF Outlook=sunny & H>75 THEN we don’t play J. Korczak, UE MBA: Data Mining 11 Data Mining, J. Korczak 12 2 J.Korczak Attribute selection measures: Shannon entropy Constructing decision trees «divide and conquer» Heuristic for the „best” splitting criterion …minimization of the expected number of tests needed to classify a given tuple while a leaf node does not contain objects of a one class split the set of objects E into E1,E2,…,En partitions Entropy of S : Splitting criterion: H(S)= -Σ ( ni /card(S) * log2( ni /card(S) ) X=a (for qualitative attributes) X<=a (for quantitive attribute) where ni represents the number of tuples of S that belongs to class i. Discrete- valued and continuous-valued attributes 13 14 Attribute selection measure: Information gain Entropy Interpretation : • Shannon entropy is a measure of the average information content • min number of bits to encode a message S • optimal code uses –log2p bits for message with probability p • Let H be the entropy of the population «non partitioned» • Let H(i,j) be the entropy of the population having the value j of the attribute i. • Let G(i) be the information gain of the attribute i G(i)= H - Σj ISjI/S H(i,j) The attribute with the highest Gain(i) is chosen as the splitting at node N. 15 16 Example: GOLF Outlook Temp.Humidity Windy sunny 85 85 F sunny 80 90 T overcast 83 78 F rainy 70 96 F rainy 68 80 F rainy 65 70 T overcast 64 65 T sunny 72 95 F sunny 69 70 F rainy 75 80 F sunny 75 70 T overcast 72 90 T overcast 81 75 F rainy 71 80 T GOLF: Discretization Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Outlook Temp.Humidity Windy sunny 85 85 F sunny 80 90 T overcast 83 78 F rainy 70 96 F rainy 68 80 F rainy 65 70 T overcast 64 65 T sunny 72 95 F sunny 69 70 F rainy 75 80 F sunny 75 70 T overcast 72 90 T overcast 81 75 F rainy 71 80 T Two classes: {Yes,No} 5 attributes 14 tuples H(S) = -9/14*log2(9/14)5/14*log2(5/14) = 0,94 bits Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Outlook Temp.Humidity Windy sunny hot high F sunny hot high T overcast hot normal F rainy cool high F rainy cool normal F rainy cool normal T overcast cool normal T sunny cool high F sunny cool normal F rainy hot normal F sunny hot normal T overcast cool high T overcast hot normal F rainy cool normal T Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Simple discretization: Breakpoint (Temp)= (64+85)/2 =74,5 IF Temp<74,5 THEN cool ELSE hot IF Humidity<80.5 THEN normal ELSE high J. Korczak, UE MBA: Data Mining 17 J. Korczak, UE 18 3 J.Korczak Example: GOLF Example : Golf G(Outlook)=H – 5/14*H(Outlook,sunny) – 4/14*H(Outlook,overcast) – 5/14*H(Outlook,rainy) = Outlook Temp.Humidity Windy sunny 85 85 F sunny 80 90 T overcast 83 78 F rainy 70 96 F rainy 68 80 F rainy 65 70 T overcast 64 65 T sunny 72 95 F sunny 69 70 F rainy 75 80 F sunny 75 70 T overcast 72 90 T overcast 81 75 F rainy 71 80 T Similarly, we compute Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No H(Outlook,sunny)= -2/5*log2(2/5)-3/5*log2(3/5) = = 0,971 G(Temp)=0,029 G(Humidity)=0,151 et G(Windy)=0,048. MAX (G(Outlook),G(Temp),G(Humidity),G(Windy)) = G(Outlook) Outlook has the highest information gain H(Outlook,overcast) = -4/4*log2(4/4)-0 = 0 Outlook H(Outlook,rainy)= -3/5*log2(3/5)-2/5*log2(2/5) = 0,971 sunny overcast rainy G(Outlook) = 0.246 Branches will grow for each outcome of Outlook. 19 J. Korczak, UE 20 Selection the next attribute GOLF: Computing Outlook Temp.Humidity Windy sunny 85 high F sunny 80 high T overcast 83 78 F rainy 70 96 F rainy 68 80 F rainy 65 70 T overcast 64 65 T sunny 72 high F sunny 69 normal F rainy 75 80 F sunny 75 normal T overcast 72 90 T overcast 81 75 F rainy 71 80 T Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Outlook Temp.Humidity Windy sunny 85 85 F sunny 80 90 T overcast 83 78 F rainy 70 96 F rainy 68 80 F rainy 65 70 T overcast 64 65 T sunny 72 95 F sunny 69 70 F rainy 75 80 F sunny 75 70 T overcast 72 90 T overcast 81 75 F rainy 71 80 T Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Outlook rainy overcast ? [2+,3-] Yes [4+,0-] ? [3+,2-] G(Ssunny,Humidity) = 0,970 - (3/5)*0 - (2/5)*0 = 0,970 G(Ssunny,Temp) = 0,970 - (2/5)*0 - (2/5)*1 - (1/5)*0 = 0,570 G(Ssunny,Windy) = 0,970 - (2/5)*1 -(3/5)*0,918 = 0,019 21 J. Korczak, UE [9+,5-] sunny 22 Next attribute for «Outlookrainy?» Decision tree: Choice «OutlooksunnyHumidity» Outlook Outlook [9+,5-] sunny rainy sunny overcast Yes Humidity ? Humidity high Yes [2+,0-] No [0+,3-] 23 MBA: Data Mining yes ? [3+,2-] [3+,2-] normal rainy overcast G(Srainy,Humidity) = 0,970 - (2/5)*1 - (3/5)*0,918 = 0,019 G(Srainy,Temp) = 0,970 - (0/5) - (3/5)*0,918 - (2/5)*1 = 0,019 G(Srainy,Windy) = 0,970 - (2/5)*0 -(3/5)*0= 0,970 24 4 J.Korczak Example GOLF – The final decision tree Rule extraction Outlook Parcours of decision tree using the depth-first strategy. sunny (expands one of the node at the deepest level of the tree) Yes Humidity Decision rule: a path from the root to the leaf node. normal Windy high Yes rainy overcast false No true Yes No Decision rules: IF Outlook=sunny AND Humidity=normal THEN Yes IF Outlook=sunny AND Humidity=high THEN No IF Outlook=overcast THEN Yes IF Outlook=rainy AND Windy = false THEN Yes IF Outlook=rainy AND Windy = true THEN No 25 26 Overfitting Why we should try to find a concise tree? Ockham razor The most likely hypothesis is the simplest one that is consistent with all observations. more generic hypothesis, more chances to refute size of a hypothesis depends on a KR language Effect of noisy examples : (sunny, >75, normal, true, No) Experimental error vs. real error 27 J. Korczak, UE Problems of quality of decision rules 28 Extensions Avoid overfitting Stop growing the tree while the data are not statistically significant. Generate the tree, then do pruning. Handling continuous attributes Gain ratio Missing attribute values Cost of attributes Software: Statistica, SAS, WEKA, Sipina, … Selection of the best decision tree Evaluate performances on distinct data Carry out statistical tests MDL : minimise size(tree) + size(classification error(tree)) Rule post-pruning (C4.5) Rule precision Precision-based rule ordering Cost complexity J. Korczak, UE MBA: Data Mining 29 30 5 J.Korczak C4.5 Gain Ratio C4.5 is an extension of the basic ID3 [Quinlan, 93] The information gain is biased toward tests with many outcomes! Sometimes such a partitioning is useless for classification. C4.5 applies a kind of normalization: Avoiding overfitting the data Reduced error pruning Rule post-pruning Handling continuous attributes Choosing an appriopriate attribute selection measure SplitInfo (S,A) = - Σ ISiI/ISI log2 ISiI/ISI GainRatio(G, A) ≡ GainRatio (S,A) = Gain(S,A)/SplitInfo(S,A) Handling training data with missing attribute values Improving computational efficiency Handling attributes with differing costs Gain2(S,A) /Cost (A) Gain( S , A) SplitInformation( S , A) c SplitInfor mation ( S , A) ≡ − ∑ i =1 J. Korczak, UE 31 CADER Si S log 2 Si S 32 Decision Trees Pros Fast execution time Generated rules are easy to interpret by humans Scale well for large data sets Can handle high dimensional data Cons Cannot capture correlations among attributes Consider only axis-parallel cuts J. Korczak, UE 33 Mining Association Rules Association Rules - Sample Applications Given: A database of customer transactions Each transaction is a set of items Market basket analysis Attached mailing in direct marketing Fraud detection for medical insurance Department store floor/shelf planning Find all rules X => Y that correlate the presence of one set of items X with another set of items Y Example: 98% of people who purchase diapers and baby food also buy beer. Any number of items in the consequent/antecedent of a rule Possible to specify constraints on rules (e.g., find only rules involving expensive imported products) J. Korczak, UE MBA: Data Mining 34 J. Korczak, UE 35 J. Korczak, UE 36 6 J.Korczak MinSupport and MinConfiance Confidence and Support Customers that buy both A rule must have some minimum user-specified confidence 1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. A rule must have some minimum user-specified support 1 & 2 => 3 should hold in some minimum percentage of transactions to have business value Find rules X & Y ⇒ Z with support > s and confiance >c Customer buying chips support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability of the consequent Z of the rule given its antecedent {X, Y} Customer buying beer Confidence=support(X,Y,Z)/support(X,Y) ID Transaction Items 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F 37 J. Korczak, UE 38 J. Korczak, UE Apriori — Example Algorithm Apriori min_support=2 Join step: Ck is joined with Lk-1 itself Prune step: Each (k-1)-itemset that is not frequent cannot be a subset of frequent k-itemset Base D TID 100 200 300 400 Fk : Set of frequent itemsets of size k Ck : Set of candidate itemsets of size k F1 = {large items} for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk for each transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support } Answer = ∪k Fk J. Korczak, UE {1 3} {2 3} {2 5} {3 5} C2 itemset sup C2 itemset {1 2} Scan D Scan D 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 {1} {2} {3} {5} 2 3 3 3 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} L3 itemset sup {2 3 5} 2 Koniec 22/10 40 J. Korczak, UE Computing Association Require Exponential Computation Frequent itemsets ≠ association rules {a} For every non empty subset A of X 1. Let B=X-A 2. A →B is an association rule if confidence (A → B) ≥ minConf where confidence (A → B) support (AB)/support(A) and support (A → B)=support (AB) {b} {a,c} {a,b} {a,b,c} Example : X=2,3,5 is a frequent item set with minSupp=50% The association rules are: (2,3) → 5 confidence 100% (2 → (3,5) confidence 67% … MBA: Data Mining L1 itemset sup. {1 {1 {1 {2 {2 {3 2 2 3 2 C3 itemset {2 3 5} 39 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C1 Items 134 Scan D 235 1235 25 L2 itemset sup APRIORI’s Rule Derivation J. Korczak, UE For support minimum 50%, and confidence minimum 50%: A ⇒ C (50%, 66.6%) C ⇒ A (50%, 100%) {a,d} {a,b,d} {c} {b,c} {a,c,d} {d} {b,d} {c,d} {b,c,d} {a,b,c,d} Given m items, there are 2m-1 possible item combinations 41 J. Korczak, UE 42 7 J.Korczak Drawbacks of Apriori Handling Exponential Complexity Given n transactions and m different items: number of possible association rules: computation complexity: Principle: Use (k – 1) frequent itemsets to generate k-itemsets candidats Database scan can take a prohibitive amount of time Ways of improvements of the process of candidate generation Counting candidat sets 104 1-item frequent sets generates 107 2-itemsets candidats To find100-itemsets one has to generate 2100 ≈ 1030 candidats. Many database scans : One has to do (n +1 ) scans to find n-item frequent set not efficient data structure used to store the candidate and frequent sets O ( m2 m −1 ) O( nm 2 m ) Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: If {A,B} has support at least α, then both A and B have support at least α. If either A or B has support less than α, then {A,B} has support less than α. Use patterns of n-1 items to find patterns of n items. 44 43 J. Korczak, UE Improving the efficiency of Apriori Algorithm: Hashing-based itemset counting Hash-based itemset counting Base D Transaction reduction (scanned in future iterations) TID Partitioning (to find cadidate itemsets) List of I1, I2, I2, I4 items I5 Sampling (mining on a subset of the given data) Dynamic itemset counting (adding candidate items at different point during a scan) T100 T200 T300 T400 T500 T600 T700 T800 T900 I2,I3 I1,I2,I 4 I1,I3 I2,I3 I1,I3 I1,I2,I 3,I5 I1,I2,I 3 Table of hashing, h(I,J)=(num(I)*10 + num(J)) mod 7 Adress 0 1 2 3 4 5 Count 2 2 4 2 2 4 6 4 Content {I1,I4} {I3,I5} {I1,I5} {I1,I5} {I2,I3} {I2,I3} {I2,I3} {I2,I3} {I2,I4} {I2,I4} {I2,I5} {I2,I5} {I1,I2} {I1,I2} {I1,I2} {I1,I2} {I1,I3} {I1,I3} {I1,I3} {I1,I3} 45 46 J. Korczak, UE Multilevel exploration Multilevel association rules Product Top-down strategy: The concept hierarchy for the items Software Computer Data can be generalized by replacing low-level concepts by their ancestors Laptop HP Top-down strategy is employed TID The items on the inferior level have T1 T2 tje inferior support rate T3 T4 T5 Desktop Office AntiVir Dell Variations: Items {111, 121, 211, 221} {111, 211, 222, 323} {112, 122, 221, 411} {111, 121} {111, 122, 211, 221, 413} 47 MBA: Data Mining Find „strong” rules on superior levels : computer software [20%, 60%]. – then pass to lower levels (lower support thresholds): laptop Office [6%, 50%] Level crossing laptop MS Office 48 8 J.Korczak Sequential Patterns Sequential Patterns and Time-Series Analysis Trend analysis [Agrawal, Srikant 95], [Srikant, Agrawal 96] Given: A sequence of customer transactions Each transaction is a set of items Find all maximal sequential patterns (sequential dependencies) supported by more than a user-specified percentage of customers (A B) C(D E) Trend movement vs. cyclic variations, seasonal variations and random fluctuations Similarity search in time-series database Handling gaps, scaling, etc. Indexing methods and query languages for time-series Sequential pattern mining Various kinds of sequences, various methods From GSP to PrefixSpan Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction * 10% is the support of the pattern Periodicity analysis Full periodicity, partial periodicity, cyclic association rules Apriori style algorithm can be used to compute frequent sequences 49 J. Korczak, UE Efficient Methods for Mining Association Rules Partitioning et Clustering Apriori algorithm [Agrawal, Srikant 94] DHP (Aprori+Hashing) [Park, Chen, Yu 95] A k-itemset is in Ck only if it is hashed into a bucket satisfying minimum support [Savasere, Omiecinski, Navathe 95] Any potential frequent itemset appears as a frequent itemset in at least one of the partitions Random sampling [Toivonen 96] Dynamic Itemset Counting [Brin, Motwani, Ullman, Tsur 97] During a pass, if itemset becomes frequent, then start counting support for all supersets of itemset (with frequent subsets) FUP [Cheung, Han, Ng, Wang 96] Incremental algorithmPDM [Park, Chen, Yu 95] Use hashing technique to identify k-itemsets from local database Parallel and Distributed Methods : PDM [Agrawal, Shafer 96] Count distribution FDM [Cheung, Han, Ng, Fy, Fu 96] J. Korczak, UE 50 J. Korczak, UE s = 50 p=2 s/p = 25 s/pq = 5 y y y x y y x x 51 x x 52 J. Korczak, UE Traditional Algorithms Clustering Hierarchical clustering (agglomerative and divisive) Given: Data points and number of desired clusters K Group the data points into K clusters Data points within clusters are more similar than across clusters Nested Partitions Tree structure Sample applications: Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth J. Korczak, UE MBA: Data Mining 53 J. Korczak, UE 54 9 J.Korczak K-Means :Example (cdn) K-Means :Example A={1,2,3,6,7,8,13,15,17}. Create 3 clusters in A. dist(3,M2)<dist(3,M3)3 moves to C2. The other objects stay in C3. C1={1}, M1=1, C2={2,3}, M2=2.5,C3={6,7,8,13,15,17} et M3= 66/6=11 Take randomly 3 objects, e.g. 1, 2 and 3. C1={1}, M1=1, C2={2}, M2=2, C3={3} et M3=3 dist(6,M2)<dist(6,M3)6 moves to C2. The other objects do not move. C1={1}, M1=1, C2={2,3,6}, M2=11/3=3.67, C3={7,8,13,15,17}, M3= 12 dist(2,M1)<dist(2,M2)2 moves to C1. dist(7,M2)<dist(7,M3) 7 moves to C2. The other objects do not move. C1={1,2}, M1=1.5, C2={3,6,7}, M2=5.34, C3= {8,13,15,17}, M3=13.25 Each object is assigned to the closest cluster. So 6 is assigned to C3 because dist(M3,6)<dist(M2,6) and dist(M3,6)<dist(M1,6) The result is: C1={1}, M1=1 C2={2}, M2=2 C3={3, 6, 7,8,13,15,17}, M3=69/7=9.86 dist(3,M1)<dist(3,M2)3 moves to C1. dist(8,M2)<dist(8,M3)8 moves to C2. C1={1,2,3}, M1=2, C2={6,7,8}, M2=7, C3={13,15,17}, M3=15 Nothing change. End. 55 J. Korczak, UE Algorithm K-Means Clustering: Summary of Drawbacks of Traditional Methods Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 Partition-based algorithms split large clusters Centroid-based method splits large and non-hyperspherical clusters Centers of subclusters can be far apart Minimum spanning tree algorithm is sensitive to outliers and slight change in position 2 1 1 0 Exhibits chaining effect on string of outliers 0 0 1 2 3 4 5 6 56 J. Korczak, UE 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Cannot scale up for large databases 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 J. Korczak, UE 3 4 5 6 7 8 9 10 57 Clustering 58 J. Korczak, UE CURE (Clustering Using REpresentatives ) (from Database and Machine Learning Community ) Scalable Clustering Algorithms CLARANS – sampling database DBSCAN – density based method BIRCH – partitions objects hierarchically using tree structure CLIQUE – integrates density-based and grid-based method CURE ROCK – merges clusters based on their interconnectivity COBWEB and CLASSIT Neural networks: SOM, GNG … The classical methods generate the clusters (b) CURE: (1998) Stops on k cluster Based on representative points J. Korczak, UE MBA: Data Mining 59 J. Korczak, UE 60 10 J.Korczak Outlier Discovery Cure: merging representative points y Given: Data points and number of outliers (= n) to find Find top n outlier points y Outliers are considerably dissimilar from the remainder of the data Sample applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis x x shrinking representative points to the center by a factor α (outliers!). points allow to define a shape of the cluster 61 J. Korczak, UE Data Mining with WEKA: the software 62 J. Korczak, UE Social networks analysis Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features: Social network analysis Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms Internet search engines Which people are powerful? Which people infuence other people? How does information spread within the network? Who is relattively isolated, and who is well connected? … Google search engine: PageRank algorithm Marketing Viral marketing: „word-of-mouth” advertising Hotmail – free email service Fraud detection AML systems … 63 J. Korczak, UE Social Network Analysis – measures 64 Link analysis Link analysis techniques are applied to data that can be represented as nodes and links SNA is the science of using network theory to construct, view and analyze social networks Degree Centrality – the number of direct relationships that an entity has. Closeness Centrality – how quickly an entity can access more entities in the network. Betweenness Centrality identifies an entity’s position within a network in terms of its ability to make connections to other pairs or groups. Eigenvalue measure how close an entity is to other highly close entities within a network A node (vertice): person, bank account, document,… A link: a relationship between two bank accounts 65 J. Korczak, UE MBA: Data Mining 66 11 J.Korczak Social networks Degree Centrality Large Graph Mining [C.Faloutsos et.al., KDD2009] 67 68 J.Korczak, UE Wroclaw 69 70 The ruling coalition of Unity, Zatler Reform Party, and the National Alliance —all majority ethnic Latvian National Alliance Future Research Issues Incorporating constraints into existing data mining techniques Traditional algorithms Disproportionate computational cost for selective users Overwhelming volume of potentially useless results Need user-controlled focus in mining process Association rules containing certain items Sequential patterns containing certain patterns Unity Tight-coupling with DBMS Most data mining algorithms are based on flat file data (i.e. loosecoupling with DBMS) A set of standard data mining operators (e.g. sampling operator) Harmony Center Greens/farmers Zatler‘s Reform Party 71 MBA: Data Mining J. Korczak, UE 72 12