Download Summary

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Data Mining Lecture 1 – summary
0. Introduction
From data to information
Data mining relates to the area of application
Passive: Observing relations
Active: Finding the underlying model
Passive: Clustering, Classification
Active: Passive + Model construction + parameter estimation
1. Example: PCB production
 Normalization and standardization
 Hidden parameters and missing values
 Input, output, and state parameters
2. Data as Sets of Observations
Parameters and Observations/Measurements
Ordered Data Sets and alphanumeric sets and lexicographic sets
Data is a set of ordered numeric data of n observations of d parameters
Dynamic Sets and Time series
Data Quality
3. Statistics
Uncertainty and Probability distributions: normal, uniform, Poisson, Bernouli, something
Normal distribution
Kolmogorov-Smirnov test
Multivariate sets
Mean-centered data
Correlation and correlation coefficient
Covariance matrix
Outliers and the linear correlation coefficient
4. Similarity and Distance
Measures for Similarity and Dissimilarity
How similar or dissimilar two data points are
sim(p,q) in [0,1]
sim(p,q) = sim(q,p)
sim(p,p) = 1
dissim also in [0,1] : dissim = 1 – sim
Distance measures and Metrics
Distance measured according to certain rule in the data space
dist(p,q) >= 0
if dist(p,q) = 0 then p = q
triangle-inequality: dist(p,q) <= dist(p,a) + dist(a,q)
notice that distance relates to the dissimilarity
a vector space with a distance definition is a metric space
Examples of Distances
Euclidean : distE(p,q)2 = (p1 – q1)2 + (p2 – q2)2 + … + (pn – qn)2
Manhattan: distM(p,q) = |p1 – q1| + |p2 – q2| + … + |pn – qn|
Max-norm: distmax(p,q) = max {|p1 – q1|, |p2 – q2|, …, |pn – qn| }
Notice the graphs of dist@(p,0) in IR2 for @ = Euclidean, Manhattan, max
Generalized p-norm
1/ d
 n d
p d    pi 
 i 1
notice that for: d =2: ||p - q||d = Euclidean distance
for: d =1: ||p - q||d = Manhattan distance
for: d = ∞: ||p - q||d = Max-norm distance
normally d >= 1
Riemannian Metric
Let g be a definite non-negative matrix on IRd (i.e. all eigen values >= 0)
then g induces a Riemannian metric on the space:
 p T gp
Involving peculiarities of the distribution
Let ρ be a probability distribution on a data space. Let m be the mean and C be the
covariance matrix associated to ρ :
C   (x  m)  (x)( x  m) T dx
then the Mahalanobis distance is defined as:
dist Mahal (x)  (x  m)C 1 (x  m) T
It give a measure for the distance of a data point x to the center of the distribution.
Notice that in the mean-centered space the Mahalanobis-distance is a Riemannian norm
with metric g = C-1.
Similarity and Distance
Notice that we can define the similarity between two data points p and q as some
function f: sim(p,q) = f(dist(p,q)). Examples are:
f(d) = 1/(1 + d/L)
f(d) = exp(-d/L)
f(d) = - d/L if d < L and f(d) = 0 if d >= L
where d = dist(p,q) and L is some characteristic length for the problem.
5. Visualizing and Exploring Data
single variable-display:
 Histogram
 Cumulative distribution
* Kernel estimate
Two-variable representation
 Scatter plot
 Contour plot
Multiple-variable representation
 Scatter plot matrix
 Trelis plot
 Chernoff face
6. Dimension Reduction
Principal Components Analysis (PCA)
Find the principal directions in the data, and use them to reduce the number of dimensions of
the set by representing the data in linear combinations of the principal components.
Works best for multivariate data. Finds the m < d eigen-vectors of the covariance matrix with
the largest eigen-values. These eigen-vectors are the principal components. Decompose the
data in these principal components and thus obtain as more concise data set.
Caution1: Depends on the normalization of the data! Ideal for data with equal units.
Caution2: works only in linear relations between the parameters
Caution3: valuable information can be lost in pca
Factor Analysis
Represent data with fewer variables
Not invariant for transformations: multiple equivalent solutions
Widely used esp. in alpha-world and medicine
Multidimensional scaling
Equivalent to PCA, and also in case there are non-linear relations between the parameters.
Input: similarity- or distance-map between the data-points. Output: a 2D- (or even higher D)
map of the data points.
Kohonen SO feature map
(Will be addressed later) Input: distance or similarity-map, Output: a 2D- (or even higher D)
map of the data points.