Download Document

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Last lecture summary
Test-data and Cross Validation
testing error
training error
model complexity
Test set method
• Split the data set into training and test data sets.
Train
Test
• Common ration – 70:30
• Train the algorithm on training set, assess its
performance on the test set.
• Disadvantages
– This is simple, however it wastes data.
– Test set estimator of performance has high variance
adopted from Cross Validation tutorial, Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
• stratified division
– same proportion of data in the training and test
sets
• Training error can not be used as an indicator
of model’s performance due to overfitting.
• Training data set - train a range of models, or
a given model with a range of values for its
parameters.
• Compare them on independent data –
Validation set.
– If the model design is iterated many times, then
some overfitting to the validation data can occur
and so it may be necessary to keep aside a third
• Test set on which the performance of the
selected model is finally evaluated.
LOOCV
1. choose one data point
2. remove it from the set
3. fit the remaining data points
4. note your error using the removed data point as
test
Repeat these steps for all points. When you are
done report the mean square error (in case of
regression).
k-fold crossvalidation
1. randomly break data into k partitions
2. remove one partition from the set
3. fit the remaining data points
4. note your error using the removed partition as
test data set
Repeat these steps for all partitions. When you are
done report the mean square error (in case of
regression).
Selection and testing
• Complete procedure to algorithm selection and
estimation of its quality
1. Divide data to train/test
Train
Test
2. By Cross Validation on the Train choose the
algorithm
Train
Val
3. Use this algorithm to construct a classifier using Train
Train
4. Estimate its quality on the Test
Test
polynomial regression
degree
1
2
3
4
5
6
MSEtrain
MSE10-fold
Choice
adopted from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html
Model selection via CV
Nearest Neighbors Classification
instances
• Similarity sij is quantity that reflects the
strength of relationship between two objects
or two features.
• Distance dij measures dissimilarity
– Dissimilarity measure the discrepancy between
the two objects based on several features.
– Distance satisfies the following conditions:
• distance is always positive or zero (dij ≥ 0)
• distance is zero if and only if it measured to itself
• distance is symmetric (dij = dji)
– In addition, if distance satisfies triangular
inequality |x+y| ≤ |x|+|y|, then it is called metric.
Distances for quantitative variables
• Minkowski distance (Lp norm)
Lp 
n
p
x y
i 1
i
p
i
• distance matrix – matrix with all pairwise
distances
p1
p1
p2
p3
p4
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Manhattan distance
n
L1  d  x, y    xi  yi
i 1
y2
x2
x1
y1
Euclidean distance
L2  d  x, y  
y2
x2
x1
y1
n
 x  y 
i 1
i
i
2
k-NN
• supervised learning
• target function f may be
– dicrete-valued (classification)
– real-valued (regression)
• We assign to the class which instance is most
similar to the given point.
• k-NN is a lazy learner
• lazy learning
– generalization beyond the training data is delayed
until a query is made to the system
– opposed to eager learning – system tries to
generalize the training data before receiving
queries
Which k is best?
k=1
fitting noise, outliers
overfitting
k = 15
value not too small smooth
out distinctive behavior
Hastie et al., Elements of Statistical Learning
Real-valued target function
• Algorithm calculates the mean value of the k
nearest training examples.
k=3
value = 12
value = (12+14+10)/3 = 12
value = 14
value = 10
Distance-weighted NN
• Give greater weight to closer neighbors
unweighted
• 2 votes
• 2 votes
k=4
4
2
5
1
weighted
• 1/12 + 1/22 = 1.25 votes
• 1/42 + 1/52 = 0.102 votes
k-NN issues
• Curse of dimensionality is a problem.
• Significant computation may be required to
process each new query.
• To find nearest neighbors one has to evaluate
full distance matrix.
• Efficient indexing of stored training examples
helps
– kd-tree
Cluster Analysis
• We have data, we don’t
know classes.
• Assign data objects into
groups (called clusters) so
that data objects from the
same cluster are more
similar to each other than
objects from different
clusters.
• We have data, we don’t
know classes.
• Assign data objects into
groups (called clusters) so
that data objects from the
same cluster are more
similar to each other than
objects from different
clusters.
Stages of clustering process
On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis
How would you solve the problem?
• How to find clusters?
• Group together most similar patterns.
Single linkage
(metoda nejbližšího souseda)
based on A Tutorial on Clustering Algorithms
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
BA
FL
MI/TO
MI/TO
NA
RM
0
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
MI/TO
FL
877
MI/TO
NA
RM
0
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
MI/TO
BA
FL
MI/TO
877
295
0
BA
BA
0
FL
662
FL
NA
MI
RM
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
MI/TO
BA
FL
MI/TO
NA
877
295
0
754
BA
BA
0
FL
662
FL
MI
RM
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
MI/TO
BA
FL
MI/TO
NA
RM
877
295
0
754
564
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
FL
MI/TO
NA RM
BA
0
662
877
255 412
FL
662
0
295
468 268
0
754 564
MI/TO 877 295
NA
255 468
754
0
219
RM
412 268
564
219
0
BA
FL
MI/TO
NA/RM
BA
0
662
877
255
FL
662
0
295
268
0
564
564
0
MI/TO
877 295
NA/RM 255 268
BA/NA/RM
FL
MI/TO
BA/NA/RM
0
268
564
FL
268
0
295
MI/TO
564
295
0
BA/FL/NA/RM
MI/TO
BA/FL/NA/RM
0
295
MI/TO
295
0
Torino → Milano
Rome → Naples
Dendrogram
→ Bari
→ Florence
Join Torino–Milano and Rome–Naples–Bari–Florence
Dendrogram
Torino → Milano (138)
Rome → Naples (219)
→ Bari (255)
→ Florence (268)
Join Torino–Milano and Rome–Naples–Bari–Florence (295)
dissimilarity
295
268
NA
RM
138
219
255
BA
FL
MI
TO
dissimilarity
TO
MI
FL
RM
NA
BA
Complete linkage
(metoda nejvzdálenějšího souseda)
BA
FL
MI/TO
MI/TO
NA
RM
0
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
MI/TO
FL
996
MI/TO
NA
RM
0
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
MI/TO
BA
FL
MI/TO
996
400
0
BA
BA
0
FL
662
FL
NA
MI
RM
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
MI/TO
BA
FL
MI/TO
NA
996
400
0
869
BA
BA
0
FL
662
FL
MI
RM
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
MI/TO
BA
FL
MI/TO
NA
RM
996
400
0
869
669
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
BA
FL
MI/TO
NA RM
BA
0
662
996
255 412
FL
662
0
400
468 268
0
869 669
MI/TO 996 400
NA
255 468
869
0
219
RM
412 268
669
219
0
BA
FL
MI/TO
NA/RM
BA
0
662
996
412
FL
662
0
400
468
0
869
869
0
MI/TO
996 400
NA/RM 412 468
BA
BA
MI/TO/FL
NA/RM
0
996
412
0
869
869
0
MI/TO/FL 996
NA/RM
412
complete linkage
BA
NA
RM
FL
MI
TO
BA
NA
RM
FL
MI
TO
single linkage
Average linkage
(metoda průměrné vazby)
BA
MI/TO
FL
MI/TO
936.5
NA
RM
0
(996+877)/2=936.5
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
Centroid linkage
BA
MI/TO
FL
895
MI/TO
NA
RM
0
cluster is represented
by its centroid
BA
BA
0
FL
662
FL
MI
NA RM
TO
662 877 255 412 996
0
MI 877 295
295 468 268 400
0
NA 255 468 754
754 564 138
0
RM 412 268 564 219
219 869
0
TO 996 400 138 869 669
669
0
Summary
Similarity?
•
•
•
•
single linkage (MIN)
complete linkage (MAX)
average linkage
centroids
Summary
•
•
•
•
single linkage (MIN)
complete linkage (MAX)
average linkage
centroids
Summary
•
•
•
•
single linkage (MIN)
complete linkage (MAX)
average linkage
centroids
Summary
•
•
•
•
single linkage (MIN)
complete linkage (MAX)
average linkage
centroids
Summary


•
•
•
•
single linkage (MIN)
complete linkage (MAX)
average linkage
centroids
Ward’s linkage (method)
In Ward’s method metrics are not used, they do
not have to be chosen. Instead, sums of squares
(i.e. squared Euclidean distances) between
centroids of clusters are computed.
• Ward's method says that the distance
between two clusters, A and B, is how much
the sum of squares will increase when we
merge them.
• At the beginning of clustering, the sum of
squares starts out at zero (because every point
is in its own cluster) and then grows as we
merge clusters.
• Ward‘s method keeps this growth as small as
possible.
Types of clustering
• hierarchical
– groups data with a sequence of nested partitions
• agglomerative
– bottom-up
– Start with each data point as one cluster, join the clusters up to the
situation when all points form one cluster.
• divisive
– top-down
– Initially all objects are in one cluster, then the cluster is
subdivided into smaller and smaller pieces.
• partitional
– divides data points into some prespecified number of
clusters without the hierarchical structure
– i.e. divides the space
Hierarchical clustering
• Agglomerative methods are used more widely.
• Divisive methods need to consider (2N − 1 −1)
possible subset divisions, which is very
computationally intensive.
– computational difficulties of finding the optimum
partitions
• Divisive clustering methods are better at
finding large clusters than hierarchical
methods.
Hierarchical clustering
• Disadvantages
– High computational complexity – at least O(N2).
• Needs to calculate all mutual distances.
– Inability to adjust once the splitting or merging is
performed
• no undo
k-means
• How to avoid the computing of all mutual
distances?
• Calculate distances from representatives
(centroids) of clusters.
• Advantage: number of centroids is much
lower than the number of data points.
• Disadvantage: number of centroids k must be
given in advance
k-means – kids algorithm
•
•
•
•
Once there was a land with N houses.
One day K kings arrived to this land.
Each house was taken by the nearest king.
But the community wanted their king to be at the
center of the village, so the throne was moved
there.
• Then the kings realized that some houses were
closer to them now, so they took those houses,
but they lost some.. This went on and on…
• Until one day they couldn't move anymore, so
they settled down and lived happily ever after in
their village.
k-means – adults algorithm
• decide on the number of clusters k
• randomly initialize k centroids
• repeat until convergence (centroids do not
move)
– assign each point to the cluster represented by
the centroid it is nearest to
– move the centroids to the position given as a
mean of all points in the cluster
k-means applet
http://www.kovan.ceng.metu.edu.tr/~maya/kmeans/index.html
• Disadvantages:
– k must be determined in advance.
– Sensitive to initial conditions. The algorithm
minimizes the following “energy” function, but
may be trapped in the local minima.
 
K
l 1
xi X l
|| xi  l ||
2
– Applicable only when mean is defined, then what
about categorical data? E.g. replace mean with
mode (k-modes).
– Arithmetic mean is not robust to outliers (use
median – k-medoids).
– Clusters are spherical because the algorithm is
based on distance.