Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tutorial On Fuzzy Clustering Jan Jantzen Technical University of Denmark [email protected] Abstract Problem: To extract rules from data Method: Fuzzy c-means Results: e.g., finding cancer cells Cluster (www.m-w.com) A number of similar individuals that occur together as a: two or more consecutive consonants or vowels in a segment of speech b: a group of houses (...) c: an aggregation of stars or galaxies that appear close together in the sky and are gravitationally associated. Cluster analysis (www.m-w.com) A statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics. Vehicle Example Vehicle V1 V2 V3 V4 V5 V6 V7 V8 V9 Top speed km/h 220 230 260 140 155 130 100 105 110 Colour red black red gray blue white black red gray Air resistance 0.30 0.32 0.29 0.35 0.33 0.40 0.50 0.60 0.55 Weight Kg 1300 1400 1500 800 950 600 3000 2500 3500 Vehicle Clusters 3500 3000 Lorries Weight [kg] 2500 Sports cars 2000 1500 Medium market cars 1000 500 100 150 200 Top speed [km/h] 250 300 Terminology Object or data point feature space 3500 label 3000 Lorries 2500 Weight [kg] cluster Sports cars 2000 1500 Medium market cars feature 1000 500 100 150 200 Top speed [km/h] feature 250 300 Example: Classify cracked tiles 475Hz 557Hz Ok? -----+-----+--0.958 0.003 Yes 1.043 0.001 Yes 1.907 0.003 Yes 0.780 0.002 Yes 0.579 0.001 Yes 0.003 0.105 No 0.001 1.748 No 0.014 1.839 No 0.007 1.021 No 0.004 0.214 No Table 1: frequency intensities for ten tiles. Tiles are made from clay moulded into the right shape, brushed, glazed, and baked. Unfortunately, the baking may produce invisible cracks. Operators can detect the cracks by hitting the tiles with a hammer, and in an automated system the response is recorded with a microphone, filtered, Fourier transformed, and normalised. A small set of data is given in TABLE 1 (adapted from MIT, 1997). Algorithm: hard c-means (HCM) (also known as k means) Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 0 2 log(intensity) 475 Hz Plot of tiles by frequencies (logarithms). The whole tiles (o) seem well separated from the cracked tiles (*). The objective is to find the two clusters. Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 0 2 log(intensity) 475 Hz 1. 2. Place two cluster centres (x) at random. Assign each data point (* and o) to the nearest cluster centre (x) Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz 1. 2. Compute the new centre of each class Move the crosses (x) 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz Iteration 2 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz Iteration 3 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 0 2 log(intensity) 475 Hz Iteration 4 (then stop, because no visible change) Each data point belongs to the cluster defined by the nearest centre M = 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 The membership matrix M: 1. The last five data points (rows) belong to the first cluster (column) 2. The first five data points (rows) belong to the second cluster (column) Membership matrix M data point k cluster centre i 2 1 if uk ci uk c j mik 0 otherwise distance cluster centre j 2 c-partition All clusters C together fills the whole universe U Clusters do not overlap c C i U i 1 A cluster C is never empty and it is smaller than the whole universe U Ci C j Ø for all i j Ø Ci U for all i 2cK There must be at least 2 clusters in a c-partition and at most as many as the number of data points K Objective function Minimise the total sum of all distances J J i u k ci i 1 i 1 k ,u k Ci c c 2 Algorithm: fuzzy c-means (FCM) Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 0 log(intensity) 475 Hz Each data point belongs to two clusters to different degrees 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 0 2 log(intensity) 475 Hz 1. Place two cluster centres 2. Assign a fuzzy membership to each data point depending on distance Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz 1. 2. Compute the new centre of each class Move the crosses (x) 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz Iteration 2 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz Iteration 5 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 log(intensity) 475 Hz Iteration 10 0 2 Tiles data: o = whole tiles, * = cracked tiles, x = centres 2 1 log(intensity) 557 Hz 0 -1 -2 -3 -4 -5 -6 -7 -8 -8 -6 -4 -2 0 2 log(intensity) 475 Hz Iteration 13 (then stop, because no visible change) Each data point belongs to the two clusters to a degree M = 0.0025 0.9975 0.0091 0.9909 0.0129 0.9871 0.0001 0.9999 0.0107 0.9893 0.9393 0.0607 0.9638 0.0362 0.9574 0.0426 0.9906 0.0094 0.9807 0.0193 The membership matrix M: 1. The last five data points (rows) belong mostly to the first cluster (column) 2. The first five data points (rows) belong mostly to the second cluster (column) Fuzzy membership matrix M Point k’s membership of cluster i Fuzziness exponent mik 1 d ik j 1 d jk c dik uk ci 2 / q 1 Distance from point k to current cluster centre i Distance from point k to other cluster centres j Fuzzy membership matrix M mik 1 d ik j 1 d jk c d ik d 1k 1 d1k 2 / q 1 2 / q 1 2 / q 1 1 2 / q 1 d d ik ik d 2k d ck 1 2 / q 1 d ik 1 1 2 / q 1 2 / q 1 d 2k d ck 2 / q 1 Gravitation to cluster i relative to total gravitation Electrical Analogy U RI R I U R1 i1 R2 i2 R R 1 1 1 1 R1 R2 Rc 1 Ri 1 1 1 1 Ri R1 R2 Rc 1 U 1 ii U Ri I I ii Same form as mik Fuzzy Membership o is with q = 1.1, * is with q = 2 Membership of test point 1 0.5 01 Data point Cluster centres 2 3 4 5 Fuzzy c-partition All clusters C together fill the whole universe U. Remark: The sum of memberships for a data point is 1, and the total for all points is K Not valid: Clusters do overlap c C i U i 1 A cluster C is never empty and it is smaller than the whole universe U Ci C j Ø for all i j Ø Ci U for all i 2cK There must be at least 2 clusters in a c-partition and at most as many as the number of data points K Example: Classify cancer cells Normal smear Using a small brush, cotton stick, or wooden stick, a specimen is taken from the uterin cervix and smeared onto a thin, rectangular glass plate, a slide. The purpose of the smear screening is to diagnose pre-malignant cell changes before they progress to cancer. The smear is stained using the Papanicolau method, hence the name Pap smear. Different characteristics have different colours, easy to distinguish in a microscope. A cyto-technician performs the screening in a microscope. It is time consuming and prone to error, as each slide may contain up to 300.000 cells. Severely dysplastic smear Dysplastic cells have undergone precancerous changes. They generally have longer and darker nuclei, and they have a tendency to cling together in large clusters. Mildly dysplastic cels have enlarged and bright nuclei. Moderately dysplastic cells have larger and darker nuclei. Severely dysplastic cells have large, dark, and often oddly shaped nuclei. The cytoplasm is dark, and it is relatively small. Possible Features Nucleus and cytoplasm area Nucleus and cyto brightness Nucleus shortest and longest diameter Cyto shortest and longest diameter Nucleus and cyto perimeter Nucleus and cyto no of maxima (...) Classes are nonseparable Hard Classifier (HCM) moderate A cell is either one or the other class defined by a colour. Ok light Ok severe Fuzzy Classifier (FCM) moderate Ok light Ok severe A cell can belong to several classes to a Degree, i.e., one column may have several colours. Function approximation 1.5 1 Output1 0.5 0 -0.5 -1 -1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Input 0.9 1 Curve fitting in a multi-dimensional space is also called function approximation. Learning is equivalent to finding a function that best fits the training data. Approximation by fuzzy sets 2 1 0 -1 -2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.8 0.6 0.4 0.2 0 0 Procedure to find a model 1. Acquire data 2. Select structure 3. Find clusters, generate model 4. Validate model Conclusions Compared to neural networks, fuzzy models can be interpreted by human beings Applications: system identification, adaptive systems Links J. Jantzen: Neurofuzzy Modelling. Technical University of Denmark: Oersted-DTU, Tech report no 98-H-874 (nfmod), 1998. URL http://fuzzy.iau.dtu.dk/download/nfmod.pdf PapSmear tutorial. URL http://fuzzy.iau.dtu.dk/smear/ U. Kaymak: Data Driven Fuzzy Modelling. PowerPoint, URL http://fuzzy.iau.dtu.dk/tutor/ddfm.htm Exercise: fuzzy clustering (Matlab) Download and follow the instructions in this text file: http://fuzzy.iau.dtu.dk/tutor/fcm/exerF5.txt The exercise requires Matlab (no special toolboxes are required)