Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Analysis Santiago González <[email protected]> Contents Introduction CRISP-DM Tools Data understanding Data preparation Modeling (2) (1) (1) Association rules? Supervised classification Clustering Assesment & Evaluation (1) Examples: (2) Neuron Classification Alzheimer disease Meduloblastoma CliDaPa … Special Guest Prof. Ernestina Menasalvas “Stream Mining” Data Analysis Data Mining: Modeling Data Analysis Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Data Analysis Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Supervised cl. Regression [Predictive] Clustering [Descriptive] Unsupervised cl. Data Analysis Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Data Analysis Association Rule Discovery Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Data Analysis Association Rule Discovery Example: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Data Analysis Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Data Analysis Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class (categorical). Class may be binary o not… Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A testing set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used Data Analysis to validate it. Classification Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set Learn Classifier Test Set Model Classifying Galaxies Early Class: • Stages of Formation Courtesy: http://aps.umn.edu Attributes: • Image features, • Characteristics of light waves received, etc. Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Data Analysis Classification Data Analysis Cross validation Well classified: (a+d)/Sum Wrong classified: (c+b)/Sum True positive (sensibility): a/a+c True negative (specificity): d/b+d False positive: b/a+c False negative: c/b+d Data Analysis Classification: example Well classified: Wrong classified: True positive (sensibility): True negative (specificity): False positive: False negative: Data Analysis Classification: example Well classified: 4/6 Wrong classified: 2/6 True positive (sensibility): 2/3 True negative (specificity): 2/3 False positive: 1/3 False negative: 1/3 Data Analysis Classification Data Analysis KNN Idea: use information of the k nearest neighbours. We need to calculate the distance between samples in order to know who is nearest (euclidea, manhattan, etc.) Prior info: Number of neighbours: K Distance function: d(x,y) Learning data Testing data Data Analysis KNN Euclidean distance Manhattan distance Quite similar Difference: absolute squared value value instead of Data Analysis KNN Example with K = 3, two attributes and euclidean distance Data Analysis ID3 Objective: Create a decision tree as a method to approximate a target function based on discrete values Resistant to noise in the data Is able to find or learn of a disjunction of expressions. Result can be expressed as rules: if-then Try to find the simplest tree that separe better the samples. It is a recursive algorithm Use information gain Data Analysis ID3 Data Analysis ID3 The most discriminative feature is the one with more Information Gain: G (C,Attr1) = E (C) - ∑ P(C|Attr1=Vi) * E (Attr1) where E (Attr1) = - ∑ P(Attr1=Vi ) * log2(P(Attr1=Vi )) = = - ∑ P(Attr1=Vi ) * ln(P(Attr1=Vi )) / ln(2) Data Analysis ID3: example This feature is important?? Clasificación Supervisada Data Analysis ID3: example G(AdministrarTratamiento,Gota) = G(AT,G) G(AT,G) = E(AT) – P(G=Si) x E(G=Si) – P(G=No) x E(G=No) E(G=Si) = - P(AT=Si|G=Si) * log2(P(AT=Si|G=Si)) - P(AT=No|G=Si) * log2(P(AT=No|G=Si)) = = - 3/7 * log2 (3/7) – 4/7 * log2 (4/7) = 0.985 E(G=No) = - P(AT=Si|G=No) * log2(P(AT=Si|G=No)) - P(AT=No|G=No) * log2(P(AT=No|G=No)) = - 6/7 * log2 (6/7) – 1/7 * log2 (1/7) = 0.592 E(AT)=- P(AT=Si)* log2(P(AT=Si)) - P(AT=No)* log2(P(AT=No)) = = - 9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940 G(AT,G) = 0.94 – P(G=Si) x 0.985 – P(G=No) x 0.592 = = 0.94 – (7/14) x 0.985 – (7/14) x 0.592 = 0.151 Data Analysis ID3: example Data Analysis ID3: example Data Analysis ID3: example Data Analysis Bayes Classifier A probabilistic framework for solving classification problems P ( A, C ) P (C | A) Conditional Probability: P( A) P( A | C ) P ( A, C ) P(C ) Bayes theorem: P( A | C ) P(C ) P(C | A) P( A) Data Analysis Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S ) 0.0002 Data Analysis P( S ) 1 / 20 Bayesian Classifiers Consider variables Given each attribute and class label as random a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly from data? Data Analysis Bayesian Classifiers Approach: compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P(C | A A A ) 1 2 n P( A A A | C ) P(C ) P( A A A ) 1 2 n 1 2 n Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) How to estimate P(A1, A2, …, An | C )? Data Analysis Naïve Bayes Classifier Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal. Data Analysis How tol Estimate Probabilities l a a us c c i i o r r u o o from Data? n s Class: P(C) = N /N i g g t s e e c t t n a Tid 10 ca ca co cl Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes: P(Ai | Ck) = |Aik|/ Nck where |Aik| is number of instances having attribute Ai and belongs to class Ck Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins Two-way split: (A < v) or (A > v) one ordinal attribute per bin violates independence assumption choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) Data Analysis a l Estimate al us How oto Probabilities c c i i o r r u g go tin ss e e t t n a from Data? cl ca ca co Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Normal distribution: P( Ai | c j ) One 1 2 2 ij e ( Ai ij ) 2 2 ij2 for each (Ai,ci) pair For (Income, Class=No): If Class=No sample mean = 110 sample variance = 2975 10 P( Income 120 | No) 1 e 2 (54.54) (120 110 ) 2 2 ( 2975 ) 0.0072 Data Analysis Example of Naïve Bayes Classifier Given a Test Record: X (Refund No, Married, Income 120K) naive Bayes Classifier: P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25 P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024 P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married|Class=Yes) P(Income=120K|Class=Yes) = 1 0 1.2 10-9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Data Analysis Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: N ic Original : P( Ai | C ) Nc c: number of classes N ic 1 Laplace : P( Ai | C ) Nc c N ic mp m - estimate : P( Ai | C ) Nc m p: prior probability m: parameter Data Analysis Example of Naïve Bayes A: attributes Classifier Name human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Give Birth yes Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Can Fly no Live in Water Have Legs no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes Live in Water Have Legs yes no Class mammals non-mammals non-mammals mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals Class ? M: mammals N: non-mammals 6 6 2 2 P( A | M ) 0.06 7 7 7 7 1 10 3 4 P( A | N ) 0.0042 13 13 13 13 7 P( A | M ) P( M ) 0.06 0.021 20 13 P( A | N ) P( N ) 0.004 0.0027 20 P(A|M)P(M) > P(A|N)P(N) => Mammals Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Data Analysis Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advetising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. Data Analysis Regression Data Analysis Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Data Analysis Clustering Definition A clustering is a set of clusters Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures. Data Analysis Illustrating Clustering Intracluster distances are minimized Intercluster distances are maximized Euclidean Distance Based Clustering in 3-D space. Data Analysis Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Data Analysis Clustering Data Analysis Types of Clusterings Important distinction between hierarchical, partitional and density sets of clusters Partitional Clustering (K-Means) Hierarchical clustering (Agglomerative) A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset A set of nested clusters organized as a hierarchical tree Density clustering (DBSCAN) Clusters are regarded as regions in the data space in which the objects are dense, and which are separated by regions of low object density (noise). Data Analysis Partitional Clustering Original Points A Partitional Clustering Data Analysis K-Means Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple Data Analysis K-Means Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is usually measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Data Analysis Importance of Choosing Initial Centroids Iteration 6 1 2 3 4 5 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Data Analysis Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y 3 y 3 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 x 0 0.5 1 1.5 2 -2 Iteration 4 Iteration 5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 0 x 0.5 1 1.5 2 0 0.5 1 1.5 2 1 1.5 2 y 2.5 y 2.5 y 3 -1 -0.5 Iteration 6 3 -1.5 -1 x 3 -2 -1.5 x -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 5 6 0.2 4 3 4 2 0.15 5 2 0.1 1 0.05 3 0 1 3 2 5 4 1 6 Data Analysis Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering p3 p4 Non-traditional Dendrogram DBSCAN Original Points Clusters • Resistant to Noise • Can handle clusters of different shapes and sizes Data Analysis Data Mining: Assesment Data Analysis Assesment Algorithms Supervised Metrics Validation Algorithms Unsupervised Metrics Data Analysis Supervised validation alg. Resubstitution Data Analysis Supervised validation alg. Hold-out Data Analysis Supervised validation alg. N-fold cross validation Data Analysis Supervised validation alg. Leave-one-out (N max folds) N-cross fold validation cuando N = dim(Datos) Data Analysis Supervised validation alg. 0.632 Bootstrap Clasificación Supervisada Supervised metrics Calibration Distance between real class and predited class. Continuous [0,∞) Discrimination Probability of classification Continuous [0,1] In classification, we want to get the lowest calibration possible and the highest discrimination possible. Data Analysis Página 65 Supervised metrics Example: Real class: 1 Predicted class: 0.6 (using regression) Discrimination: 1 supossing that if Classpredicted > 0.5 then Classpredicted = 1 Calibration: 0.4 (1 - 0.6) Data Analysis Supervised metrics Accuracy (well classified) [Discrimination] Log Likelihood [Calibration] AUC [Discrimination] Brier Score [Calibration + Discrimination] … Hosmer DW, Lemeshow S (2000) Applied logistic regression 2nd edn. Wiley, New York Data Analysis AUC Area Under the ROC Curve Continuous [0,1] Data Analysis Unsupervised validation Data Analysis Unsupervised alg. Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized. Separation, the clusters themselves should be widely spaced. There are three common approaches measuring the distance between two different clusters: Single linkage: It measures the distance between the closest members of the clusters. Complete linkage: It measures the distance between the most distant members. Comparison of centroids: It measures the distance between the centers of the clusters. MARIA HALKIDI, YANNIS BATISTAKIS and MICHALIS VAZIRGIANNIS On Clustering Validation Techniques, Journal of IIS, 2001 Data Analysis Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Entropy Sum of Squared Error (SSE) Relative Index: Used to compare two different clusters. Often an external or internal index is used for this function, e.g., SSE or entropy MARIA HALKIDI, YANNIS BATISTAKIS and MICHALIS VAZIRGIANNIS On Clustering Validation Techniques, Journal of IIS, 2001 Data Analysis Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually. 1 1 0.9 0.8 0.7 Points 0.6 y 0.5 0.4 0.3 0.2 0.1 0 10 0.9 20 0.8 30 0.7 40 0.6 50 0.5 60 0.4 70 0.3 80 0.2 90 0.1 100 0 0.2 0.4 0.6 0.8 1 20 x 40 60 80 0 100 Similarity Points Complete Link Data Analysis Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity y Points 1 0 0 Points 0.2 0.4 0.6 0.8 1 x Complete Link Data Analysis Using Similarity Matrix for Cluster Validation 1 0.9 1 2 500 6 3 0.8 0.7 1000 4 0.6 1500 0.5 0.4 2000 0.3 5 0.2 2500 0.1 7 3000 500 1000 1500 2000 2500 3000 DBSCAN Data Analysis 0 Data Analysis Santiago González <[email protected]>