Download Presentation material from Sami Äyrämö

UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13.02.2007 University of Joensuu Sami Äyrämö 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • The need for scalable and robust KD/DM/DM methods is caused by the constant growth of digital data storages: • Capability to measure and store data from different objects has increased (industrial processes, human beings, markets, … ) • Collecting data is cheap! • The consequence is that the essential information may be lost into the oceans of data – The growth of knowledge lags behind the growth of digital data sets • Lots of data, high dimensions, missing values, outliers,… – Scalable and robust data analysis methods are needed • NB! Data owners should retain that they have to collect and digitize the most important data – All collected digital data is not important – All important data may not be collected 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Modified from: U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, Volume 39 , Issue 11, November 1996. 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • Enhanced understanding of the target data leads to success 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • • • • • • • • • • • • E-commerce Web mining Marketing research Process industry (e.g., paper machines) Gene databases Social sciences Software quality analysis Document databases Image databases Radar applications Peace science Telecommunication Archeology and many others… DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Control Quality Sensors  d1m   d11       D   dt1  dtj  dtm         d n1  d nm  Temporal events 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Operator Laborant Process data Process data Quality Customer Control data Manager  d1m   d11      D   d t1  d tj  d tm        d n1  d nm  Feedback 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Clustering is the process of grouping the data into clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters The core method of data mining 155 48 178 60  160 90  ? 78  D 195 ?  250 71  : :  170  60 S E ...120 V K ... ?  H E ...136  ? ? ...180 S E ... ?   R ? ...117  : : :  ? E ...171 Refine,compess,simplify,… 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • • • • • • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Application domain → relevant variables → type of variables → distance measure • Non-informative variables, e.g., completely missing, constant, and dublicate variables Outliers and noise (robustness) Missing data (consistency) Scaling of data (equal weighting of variables) Number of clusters (may be ambiguous) Choice of clustering method (underlying assumptions) Non-uniquess of results (solution depends on the initial values) Large data set (scalability) Cluster presentation (dimension reductions) Interpretation (domain knowledge) 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Features: height, weight, eye color, eyeglasses,…, blood pressure Objects/ B N ...120 G Y ... ?  G N ...136  ? ? ...180 B N ... ?   R ? ...117  : : :  B N ...171 persons Outlier 155 48 178 60  160 90  ? 78  D 195 ?  250 71  : :  170  60 Missing value 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ Initialization DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Robust clustering Bad? But this is the global minimum? Better? Useful? Role of K? Clustering methods do not necessarily inherit the robustness of its robust components! 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Roughly speaking there are two types of missing values: 1. The true underlying value exists, but for one reason or another it is not entered into the data set • E.g., human mistake or measurement system failure 2. There may not exist true value in the real-world • • • • An unemployed person can not have an employer ”Do not exist” or ”do not know” may be the most important information (cf. Finnish parliamentary election 2007) Hence, missing data may contain the most important piece of information (”nonignorable variables”) Anyhow, missing data causes many challenges for cluster analysis • For instance, dissimilarity matrix may be impossible to compute (what is suitable distance measure for a pair of vectors (NaN , 3.2 , NaN) and (1.7 , NaN , 4.0)) 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Probability that jth component of vector xi∈ℝp is missing is independent of any other known or missing value of xi Any missing data strategy can be applied without producing significant bias on the data The easiest type of missing data from the statistical pointof-view 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • Probability that jth component of vector xi∈ℝp is missing may depend on the observed components of xi The observed values of (xi)j is a random sample within subclasses defined by the values of (xi)j, i≠k Some knowledge about the unavailable data values DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY x > 5 → y is missing 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • • Probability that jth component of vector xi∈ℝp is missing depend on its value E.g., value of a measurement is out of some permissible range and might therefore be censored or replaced by an empty value (in this case the mechanism of missing data is understood) Some knowledge about the unavailable data values No general method exists for this kind of missing data mechanism • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY x exists only if x > 0 Missing data value is actually an extreme outlier (any value between -∞ and ∞) 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • Fill in the missing values manually: • • Simple, but not usually recommended Treat unknown values as a special category Use only complete cases in the analysis • • • Time consuming (esp. DM applications) Replace by constant • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY A lot of information might be lost It may be more reasonable to discard a variable with a large fraction of missing values than to discard the large number of objects Available case strategy • • • • Uses all available data in computation Statistical efficiency of basic estimates is proportional to the fraction of missing data Implemented easily by projector technique where all computation is projected to the available vector components (more details later..) Extremely simple to the end-user 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Imputation • • • • • • Replace missing values with reasonable estimates (mean, median, mode, kNN) based on the available data values Simple approach Mean imputation works when the sample is drawn from unimodal normal distribution and the data is missing at random Reduces the inherent variability in the data Outliers may seriously distort the mean imputation (median might be better) Simple imputation methods are not generally suitable for data clustering • The potential class structure is not considered by simple imputation methods • Shrinking the within-cluster variations may disturb cluster validation 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Hot deck imputation • Replaces the missing values with values based on estimated missing data distributions – the missing data distribution is estimated from the available values • Sample EM-like hot deck algorithm: 1. Cluster the data using complete cases 2. Assign incomplete cases to the nearest clusters (using an appropriate distance measure) 3. Impute missing values using statistical within-cluster summaries 4. Cluster the whole data (both complete and imputed cases) 5. If changes in clustering repeat from Step 3 • Cold deck imputation uses other distinct data set for substitution of missing values 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • Nearest-neighbor imputation • • • Identifies candidates that are the most similar to the object with missing data (the candidate has to be complete in the required variables) and substitutes the missing values with values of candidates Appropriate distance measure is needed • E.g., l 2- or l ∞-norm on available variables (presumes standardization of the variables) • Upper limit d0 may be given for the distance in order to avoid occurence of strange objects (if the distance from the incomplete object to its closest complete-case neighbor is greater than d0 imputation will be omitted) Prediction-based imputation • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Uses classification (nominal variable) and regression (continuous) models based on other variables to predict missing values Multiple imputation • • A collection of imputed data sets is produced and analyzed using standard methods The final result is obtained by combining the characterizations of the imputed data sets (computationally expensive) 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • • • • • • • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data presentation Choice of objects Choice of variables What to cluster: objects or variables Normalization/weighting of variables Choice of (dis)similarity measures! Choice of clustering criterion (objective function) Choice of missing data strategy Algorithms and computer implementation (and their reliability, e.g., convergence) Number of clusters Interpretation of results 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Data sets produced by industrial processes are often sparse and contain outliers that are caused e.g., by human errors and measurement failures (e.g., laboratory measurements) Traditional clustering methods (e.g., k-means, hierarchical methods) are not appropriate for comprehensive cluster analysis by themselves, because of their sensitivity to such defects Robust methods complement the traditional methods because of their • tolerance to incomplete, sparse and dirty data • usability (less manual trimming of data by the end user) Quality attributes T  0.55  0.37  m  NaN e  0.44 D  0.29   NaN  0.31   0.34 i 0.40 NaN 0.90 NaN 1.01 NaN NaN NaN NaN 120.0 55.3 0.90  67.3 129.0  0.04 71.1 NaN  401.0 70.0 NaN   0.01 64.0 NaN  0.02 67.0 NaN   0.03 69.0 NaN  0.67 0.20 0.05 Missing value Possible outlier 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY min J q (I,{c k }kK1 ), K α = 2, q=2 → K-means {c k }k 1 ,I i α = 1, q=2 → K-spatialmedians n  i 1 q for J q (I,{c k }kK1 )   Pi (xi  c ( I )i ) , α = 1, q=1 → K-coord.medians where, (I)i {1,..., K }, c k  R p , Pi  diag( p i ) and xi  R p . • K-prototype algorithm 1. Initialize the cluster prototypes (HOW?) 2. Assign each data point to the closest cluster prototype (DISTANCE MEASURE?) 3. Compute the new estimates for the cluster prototypes (ESTIMATOR?) 4. Termination: stop if termination criteria are satisfied (usually no changes in I) T. Kärkkäinen and S. Äyrämö, Robust clustering methods for erroneous and incomplete data, In Proc. of 5th Conference on Data Mining, September, 2004. 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY “Robustness signifies insensitivity to small deviations from the assumptions”, P. Huber, Robust statistics, 1981 • • • • • • Small deviation means either gross errors (outliers) in the minor part of data or small errors (noise) in a large part of data Statistical methods and tools are mainly based on classical statistics (mean, variance) Classical statistics are based on the assumption that data sets originates from normal distribution This often makes results useless when the normal assumption is not satisfied The classical statistics are computationally straightforward Robust statistics are more intractable from computational point-of-view 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Breakdown point: ”the smallest fraction of contamination that can cause the estimator T to take values arbitrarily far from T(X)” • • • • BP of the sample mean is 1/n and the coordinatewise and spatial medians 1/2 Influence function Gross-error sensitivity Local-shift sensitivity 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • The trimmed mean • • • Attractive robust alternative in one- or two-dimensions Translation and scale equivariant The spatial median • • • • • Requires parameters (distance or fraction of data) Difficult in high-dimensions The (coordinatewise) median • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Attractive robust alternative in high dimensions Computationally challenging – explicit solution does exist for the optimization problem Orthogonal equivariant Presumes equal weighting for variables – insensitive to scaling of variables Many others: Oja median,… 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • • • • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY High efficiency: Nearly equal efficiency with maximum likelihood estimators under ideal parametric models. Qualitative robustness: Estimates are influenced just slightly by small deviations from the assumed model. Quantitative robustness: Estimates are protected against large amounts of contamination or single gross errors (high breakdown point). Local-shift sensitivity: Smooth reaction to rounding and grouping. Rejection point: Separation between outliers and the bulk of data. Fisher consistency: Estimation of right quantity (for parametric models). Affine equivariance: The solution should be independent of the scales of the variables (multivariate cases) Computational practicality: The solution should be obtainable in a practical amount of computing time, even in a high dimension and/or with large amounts of data. 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY •Infinite local-shift sensitivity at the median of the distribution: 3D-sample 4D-sample 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY •Coordinatewise median does not always lie in the convex hull of the data in ℝp with p≥3: data : {ei }iN1 , ei  R p (ei  unit vecto r)  median  0 not contained in the convex hull for N  3. • Not orthogonal equivariant – sensitive to the rotations of data 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • In statistics: a robust estimate of location (M-estimate). A.k.a. FermatWeber problem or single-facility location problem (used, e.g., in transportation cost minimization problems, networking problems) Good statistical properties on multivariate samples – – – – – – • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY More robust than the sample mean (breakdown point 50%) Inherently multivariate estimator on continuous variables The coordinatewise sample median is sensitive to so called ”inliers” and works better with discrete variables Rotation invariant In univariate case spatial median coincides with the coordinatewise median Statistical efficiency approach to the sample mean as p approach to infinity Treated as nonsmooth convex optimization problem – – Unique solution exists when the data are not collinear SOR-Weiszfeld type of algorithm with treatment for missing data Kärkkäinen, T. and Äyrämö, S., On Computation of Spatial Median for Robust Data Mining, In Proceedings of EUROGEN 2005, September, 2005, Germany. Valkonen, T., Convergence of a SOR-Weiszfeld type algorithm for incomplete data sets, Numerical Functional Analysis and Optimization, 27 (7-8), Dec. 2006. 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY   minn J q (u), for J q (u)  uR 1 N ux   i 1  i q The spatial median: (u  xi ) j      , for u  xi 2  0,  i 2  q  2 : 0  J 21 (u)   i , with  j u  xi 2 i 1    1, for u  xi 2  0.  i 2 N • • • • Nonsmooth Convex Unique solution if data is not collinear In univariate case spatial median coincides with the coordinatewise median 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Modified gradient: (Pi (u  x i )) j N J (u)    i , with  i  j  1 2 i 1 max( Pi (u  x i ) ,  ) , for all x i . ε-approximating problem: N minn J  (u), for J  (u)   uR i 1 Pi (u  x i )   2 N (Pi (u  x i )) j i 1 Pi (u  x i )   J  (u)    i , where  i  j  2 , for all x i . 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • Classical direct search optimization methods (e.g., Nelder-Mead) – Accurate since not based on gradients – Scale poorly w.r.t. the number of dimensions – Requires a large amount of computing resources • Gradient-based methods – Presume differentiability – Inaccurate – Requires a lot of resources to compute • Iterative methods (e.g., Weiszfeld and its modifications) – Fast – Good scalability to high dimensions 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • ε-approximation problem reformulation N minn J  (u), for J  (u)   uR Pi (u  x i )   2 i 1 • Basic iteration is based on the first-order necessary conditions for a stationary point of the cost function: n  i 1 Pi (u  x i ) Pi (u  x i )   2  0. 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • Acceleration using SOR type of stepsize factor u k 1  u k   ( v  u k ), where   [0,2]. • On incomplete data sets operations can be performed correspondingly by projecting all computation to the available values! 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • In this variant nonsmooth data points are removed from the computations. • Basic iteration is based on the first-order necessary conditions for a stationary point of the cost function:  i:x i B ( u , r ) Pi (u  x i ) Pi (u  x i ) 2  0. 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ • • • DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY Time complexity of one SOR/ASSOR iteration corresponds to two/three cost function evaluations (O(n)) The number of iterations of the SOR-based methods is remarkably smaller when compared to the number function evaluations of the classical optimization methods CG with ε-approximation problem formulation gives inaccurate solutions 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • Needed to avoid the locally optimal solutions • Random – Practical only for small data sets • Distance-optimization – Maximizes between-cluster distances – Sensitive towards outliers • Density-estimation – Tries to detect dense regions from data by clustering several subsamples – Easily generalized to the robust clustering – Trimming 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • Silhouettes • L1-datadepth based robust methods • Trimmed silhouettes – Concentrates on the tight cluster cores and between cluster distances – Trim away the most distant 50% of the cluster points (with respect to the spatial median) – Compute the average distance a(Ck) from the remaining most central points to the closest cluster prototype (measures the tightness of the core of the cluster) – Similarly, compute the average distance b(Ck) from the most central points to the second closest cluster prototype (measures the between-cluster distance) 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY • Trimmed silhouette width for cluster k is computed as: bCk   aCk  sr Ck   max aCk , bCk  • Robust silhouette coefficient is defined as: 1 sr K   K K  sC  i 1 i 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007 UNIVERSITY OF JYVÄSKYLÄ DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY 13/02/2007

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation material from Sami Äyrämö