Download Unsupervised Clustering

DM.Lab in University of Seoul An Excel-Based Data Mining Tool iData Analyzer Data Mining Laboratory April 24th , 2008 Summarized by Sungjick Lee Data Mining Laboratory Contents DM.Lab in University of Seoul The iData Analyzer ESX:A Multipurpose Tool for Data Mining iDAV Foramt for Data Mining A Approach for Unsupervised Clustering A Approach for Supervised Learning Data Mining Laboratory 2 The iData Analyzer DM.Lab in University of Seoul Scanning for errors • illegal numeric values • balnk lines • missing items exemplar-based data mining tool builds a concept hierarchy to generalize data allows users to extract a representative subset of the data •A backpropagation neural network for supervised learning •A self-organizing feature map for unsupervised clustering Data Mining Laboratory 3 DM.Lab in University of Seoul ESX:A Multipurpose Tool for Data Mining(1/2)  Both supervised learning and unsupervised clustering  No statistical assumptions about the nature for data  An automated method for dealing with missing attrib ute values  In domains containg both categorical and numberical data  For supervised classification, Determination of those instances and attributes best able to classify new instances of unknown origin  For unsupervised clustering, a globally optimizing evaluation function that encourages a best instance clustering Data Mining Laboratory 4 DM.Lab in University of Seoul ESX:A Multipurpose Tool for Data Mining(2/2) summary information about the domain summary statistics about the attribute values found within instance-level Class resemblance scores Root Level Root Class resemblance scores Concept Level Class resemblance scores C1 Report Generator Class resemblance scores C2 ... Cn summary report in spreadsheet format define the concept classes Instance Level Data Mining Laboratory I11 I12 . . . I1j I21 I22 . . . I2k In1 In2 . . . Inl 5 iDAV Format for Data Mining DM.Lab in University of Seoul Table 4.1 • Credit Card Promotion Database: iDAV Format Income Range Magazine Promotion Watch Promotion C C C I I I 40–50K Yes No 30–40K Yes Yes I : input attribute 40–50K No No U : not used 30–40K D : not used for classificationYes or clustering, Yes 50–60Kavlue summaryYes but attribute information is No 20–30K No No displayed 30–40K Yes No O : used as an ouput attribute 20–30K No Yes 30–40K Yes No 30–40K Yes Yes 40–50K No Yes 20–30K No Yes 50–60K Yes Yes 40–50K No Yes 20–30K No No Data Mining Laboratory Life Insurance Promotion C I No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance Sex C C I I NoC : categoricalMale (nomical) NoR : real-valued Female (numerical) No Male Yes Male No Female No Female Yes Male No Male No Male No Female No Female No Male No Female No Male Yes Female Age R I 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 6 DM.Lab in University of Seoul A Approach for Unsupervised Clustering 1. 2. 3. 4. Enter data into a new Excep Spreadsheet Perform a data mining session Read and interpret summary results Read and interpret results for individual clusters 5. Visualize and interpret rules defining the individual clusters Data Mining Laboratory 7 A approach for unsupervised clustering DM.Lab in University of Seoul Enter data into a new Excel Spreadsheet CreditCardPromotion.xls Data Mining Laboratory 8 A approach for unsupervised clustering DM.Lab in University of Seoul Perform a data mining session(1/2) A value closer to 100 : encourages the formation of new clusters A value closer to 0 : discourages the formation of new clusters 8 classes are too many!! Change Instance similarity value and try again. Data Mining Laboratory The similarity criteria for real-valued attribute 1.0 is usually appropriate 9 A approach for unsupervised clustering DM.Lab in University of Seoul Perform a data mining session(2/2) Attribute Significance {The largest class mean(class 1 age = 43.33) The smallest class mean(Class 2 age = 37.00) } / the domain standar deviation Data Mining Laboratory 10 A approach for unsupervised clustering Result–RES RUL(The generated production rules) Rules for Class 1 **Total Percent Coverage = 0.00% Rules for Class 2 Rules for Class 3 Income Range = "20-30,000" :rule accuracy 100.00% :rule coverage 80.00% Income Range = "30-40,000" :rule accuracy 80.00% :rule coverage 57.14% 19.00 <= Age <= 29.00 :rule accuracy 100.00% :rule coverage 60.00% Magazine Promo = Yes :rule accuracy 75.00% :rule coverage 85.71% 19.00 <= Age <= 29.00 and Income Range = "20-30,000" :rule accuracy 100.00% :rule coverage 60.00% Life Ins Promo = Yes :rule accuracy 77.78% :rule coverage 100.00% 19.00 <= Age <= 29.00 and Magazine Promo = No :rule accuracy 100.00% :rule coverage 60.00% Data Mining Laboratory DM.Lab in University of Seoul 35.00 <= Age <= 43.00 :rule accuracy 77.78% :rule coverage 100.00% ( 중간 생략 ) (중간 생략) **Total Percent Coverage = 80.00% **Total Percent Coverage = 100.00% 11 A approach for unsupervised clustering DM.Lab in University of Seoul Result–RES SUM(summary statistics) (1/2) Resemblance Score Within-class resemblance scores are higher than the domain resemblance value? If not, why? •Bad choice of attributes •Bad choice of instances •The domain does not contain definable classes Attribute Significance Data Mining Laboratory {The largest class mean(class 1 age = 43.33) The smallest class mean(Class 2 age = 37.00) } / the domain standar deviation (9.51) 12 A approach for unsupervised clustering DM.Lab in University of Seoul Result–RES CLS(statistics about the individual class) (1/2) Typicality the average similarity of an instance to all other members of its cluster Predictability Predictiveness • • • • • • • • degree that a correct forecast the percent of instances within a class within-class measur es If ‘1’, the value is necessary Data Mining Laboratory the state of being predicted the probability an instance reside in the Class between-class measures If ‘1’, the value is sufficient 13 A approach for unsupervised clustering DM.Lab in University of Seoul Result–RES CLS(statistics about the individual class) (1/2) Highly greater than or equal to 0.80 Data Mining Laboratory 14 DM.Lab in University of Seoul A Approach for Supervised Clustering 1. Enter data into a new Excep Spreadsheet and Choose output attribute 2. Perform a data mining session 3. Read and interpret summary results 4. Read and interpret test set results 5. Read and interpret results for individual clusters 6. Visualize and interpret class rules Data Mining Laboratory 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Unsupervised Clustering