Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Data Mining Lecture 1 – summary 0. Introduction From data to information Data mining relates to the area of application Passive: Observing relations Active: Finding the underlying model Passive: Clustering, Classification Active: Passive + Model construction + parameter estimation 1. Example: PCB production Normalization and standardization Hidden parameters and missing values Input, output, and state parameters 2. Data as Sets of Observations Parameters and Observations/Measurements Ordered Data Sets and alphanumeric sets and lexicographic sets Data is a set of ordered numeric data of n observations of d parameters Dynamic Sets and Time series sampling Data Quality 3. Statistics Uncertainty and Probability distributions: normal, uniform, Poisson, Bernouli, something Normal distribution Kolmogorov-Smirnov test Multivariate sets Mean Mean-centered data Correlation and correlation coefficient Covariance matrix Estimation Outliers and the linear correlation coefficient Transforming 4. Similarity and Distance Measures for Similarity and Dissimilarity How similar or dissimilar two data points are sim(p,q) in [0,1] sim(p,q) = sim(q,p) sim(p,p) = 1 dissim also in [0,1] : dissim = 1 – sim Distance measures and Metrics Distance measured according to certain rule in the data space dist(p,q) >= 0 if dist(p,q) = 0 then p = q triangle-inequality: dist(p,q) <= dist(p,a) + dist(a,q) notice that distance relates to the dissimilarity a vector space with a distance definition is a metric space Examples of Distances Euclidean : distE(p,q)2 = (p1 – q1)2 + (p2 – q2)2 + … + (pn – qn)2 Manhattan: distM(p,q) = |p1 – q1| + |p2 – q2| + … + |pn – qn| Max-norm: distmax(p,q) = max {|p1 – q1|, |p2 – q2|, …, |pn – qn| } Notice the graphs of dist@(p,0) in IR2 for @ = Euclidean, Manhattan, max Generalized p-norm 1/ d n d p d pi i 1 notice that for: d =2: ||p - q||d = Euclidean distance for: d =1: ||p - q||d = Manhattan distance for: d = ∞: ||p - q||d = Max-norm distance normally d >= 1 Riemannian Metric Let g be a definite non-negative matrix on IRd (i.e. all eigen values >= 0) then g induces a Riemannian metric on the space: p 2 g p T gp Involving peculiarities of the distribution Let ρ be a probability distribution on a data space. Let m be the mean and C be the covariance matrix associated to ρ : C (x m) (x)( x m) T dx then the Mahalanobis distance is defined as: dist Mahal (x) (x m)C 1 (x m) T It give a measure for the distance of a data point x to the center of the distribution. Notice that in the mean-centered space the Mahalanobis-distance is a Riemannian norm with metric g = C-1. Similarity and Distance Notice that we can define the similarity between two data points p and q as some function f: sim(p,q) = f(dist(p,q)). Examples are: f(d) = 1/(1 + d/L) f(d) = exp(-d/L) f(d) = - d/L if d < L and f(d) = 0 if d >= L where d = dist(p,q) and L is some characteristic length for the problem. Correlation 5. Visualizing and Exploring Data single variable-display: Histogram Cumulative distribution * Kernel estimate Two-variable representation Scatter plot Contour plot Multiple-variable representation Scatter plot matrix Trelis plot Chernoff face 6. Dimension Reduction Principal Components Analysis (PCA) Find the principal directions in the data, and use them to reduce the number of dimensions of the set by representing the data in linear combinations of the principal components. Works best for multivariate data. Finds the m < d eigen-vectors of the covariance matrix with the largest eigen-values. These eigen-vectors are the principal components. Decompose the data in these principal components and thus obtain as more concise data set. Caution1: Depends on the normalization of the data! Ideal for data with equal units. Caution2: works only in linear relations between the parameters Caution3: valuable information can be lost in pca Factor Analysis Represent data with fewer variables Not invariant for transformations: multiple equivalent solutions Widely used esp. in alpha-world and medicine Multidimensional scaling Equivalent to PCA, and also in case there are non-linear relations between the parameters. Input: similarity- or distance-map between the data-points. Output: a 2D- (or even higher D) map of the data points. Kohonen SO feature map (Will be addressed later) Input: distance or similarity-map, Output: a 2D- (or even higher D) map of the data points.