Download Clustering Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Clustering methods: Part 10
Very large data sets
Pasi Fränti
5.5.2014
Speech and Image Processing Unit
School of Computing
University of Eastern Finland
Methods for large data sets
• Birch
• Clarans
• On-line EM
• Scalable EM
• GMG
Let’s study this
(no material for the others) 
Gradual model generator (GMG)
[Kärkkäinen & Fränti, 2007: Pattern Recognition]
Model generation
Selection
Data
Buffer
Output models
Model
Generated
model
Postprocessing
Model size
reduction
Goal of the GMG algorithm
EM
GMG
Contours of probability density
distributions
EM
GMG
Model update
• New data points are mapped immediately when input.
• Points too far (from any model) will remain in buffer.
• Buffered points are re-tested when new models created.
Before update
After update
Generating new components
• When buffer full, selected points are used to generate
new components.
• Most compact k-neighborhood is selected as seed for
a new component.
Data in buffer
Selected points and
a new component
Example
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Example
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Example
Example
Example
Example
Post-processing
Model before processing
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Post-processing
Model before processing
Updated model
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Post-processing
Model before processing
Updated model + data
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Literature
1.
I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass
clustering", Pattern Recognition, 40 (3), 784-795, March 2007.
2.
P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using
EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition,
vol. 2, 2000, pp. 76-80.
3.
R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial
Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002)
1003-1016.
4.
M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian
Network, Neural Computation 12(2) (2000) 407-432.
5.
T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering
Algorithm and Its Applications, Data Mining and Knowledge Discovery
1(2) (1997) 141-182.
Related documents