Download 05_signmod_kmeanspreproc

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Vertical Set Inner Product (VSIP) Technology with
Predicate-trees
William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND, 58105, USA
[email protected]
ABSTRACT
In this paper, a Set Inner Product construct is used to identify high
quality centroids for partitioning clustering methods such as kmeans or k-medoids clustering (or as a very fast clustering method
in and of itself). A strong advantage is that number of centroids
(k) need not be pre-specified, but is determined effectively within
the algorithm. The method can also be used to identify outliers.
The method is fast and scales well to very large data sets. The
method applies to vertical data sets and uses vertical structures
called Predicate-trees or P-trees. We show that the method can
identify high-quality centroids (which may need no further
refinement) and outliers. It is also shown that the method is fast
and scalable.
General Terms
Algorithms, Performance.
Keywords
Clustering, Vertical Set Inner Products, P-tree.
(n is large); the third is that the initial points are chosen randomly.
In case of random selection, if the points selected are far away
from the mean or the real centroid points, both the quality of the
clusters and efficiency of the process deteriorate significantly (t is
large).
In this paper, we introduce the concept of the Set Inner Product to
cope with these problems. Our new approach is founded on the
vertical construction of the Set Inner Product. In our approach,
through construction of the Set Inner Product, high quality
centroids are identified for the partitioning clustering methods.
Therefore, users do need to pre-define k ahead at all. The second
advantage of our method is that the construct of the Set Inner
Product itself is a process of clustering and the process can detect
outliers as well. The calculation of the Set Inner Product is based
on a vertical structure, P-Trees, which makes our method work
efficiently and scale well with very large datasets. We successfully
applied our algorithms to several real-world datasets, showing the
validity of producing high-quality centroids and its superior
performance over existing approaches in term of efficiency and
scalability.
1. INTRODUCTION
One of the primary data mining tasks is clustering, which intends
to discover and understand the natural structure or group in a data
set [8]. The goal of clustering is to collect similar objects mutually
exclusively and collectively exhaustively, achieving minimal
dissimilarity within one cluster and maximal dissimilarity among
clusters [4]. Many useful clustering methods, such as partitioning,
hierarchical, density-based, grid-based, and model-based methods,
were proposed in the last decade [9][4].
This paper focuses on partitioning clustering methods. In a
partitioning clustering problem, the aim is to partition a given set
of n points in m dimensional space into k groups, called clusters,
so that points within each cluster are near each other. The most
well known portioning methods are k-means and k-medoids, and
their variants. These methods are successful when the clusters are
compact clouds that are rather well separated from one another.
However, this approach has three main shortcomings. The first
shortcoming is the necessity for users to specify k, the number of
clusters, in advance, which is not applicable for many real
application because users may have no prior knowledge of the
distribution of the datasets; the other disadvantage is that the
computation complexity of the methods are O (nkt), where n is the
number of objects, k is the number of clusters, and t is the number
of iterations, which are not efficient and scalable for large datasets
The remainder of the paper is organized as follows. Section 2
presents a brief introduction to Predicate Trees (P-Trees). Section
3 presents inner product formulas and examples. Section 4
presents an algorithm for calculating the Set Inner Product. Set
Inner Products are very useful in determining the local density of
points and therefore in choosing centroids and identifying
outliers. The Set Inner Product calculation is very fast and
scalable, using vertical data structures. Section 5 discusses
performance evaluation. Section 6 presents conclusions and future
work.
2. PREDICATE TREES (P-TREES)
The P-Tree is a vertically partitioned, lossless compressed
representation of binary data [2] [7] that is well suited for
representing data sets that are normally stored in a horizontal
fashion. As seen in the references [5][6] P-Trees have been used
successfully in data mining applications for K-Nearest Neighbor
classification. Any data set can be represented as a relation R with
attributes A1 thru An and denoted R(A1, A2, … An) with the key
normally being A1.
In this explanation data will be byte based and the P-tree will
convert this into a bit based representation. It should be noted that
there is no limitation on the data type size, 8 bits is used for
simplicity and can be expanded to multiple bytes without loss of
generality. Each 8 bit byte will be converted to 8 P-trees that will
recursively give the count of one bits in that bit position in the
attribute. For a table with 3 attributes that each have 8 bits of data
we will generate 24 P-trees. A P-tree will recursively sub-divide
the bit band into halves until we reach the bit level (purity) and
counts the number of ones bits in that attribute and considers the
predicate all bits are 1. This is also called bit sequential format
(bSQ) which is vertically partitioning all the bits in a byte into
separate bit vectors which are represented by level 0 of a P-tree.
The following example will illustrate this idea for one bit slice of
an attribute Ai with bits numbered from 7 down to 0 with bit 7
being most significant.
3. VERTICAL SET INNER PRODUCT
Vertical set inner products primarily measures a total variation of
set of points in class X about a point. The formula is defined as
follows:
X  a X  a

 x  a   x  a 
xX
n
 x

xX i 1

 ai 
i
n
n

 2  xi ai   ai2 
i 1
i 1

n
   x

xX
i 1
2
i
n
n
n
   xi2  2  xi ai   ai2
xX i 1
Given A17, which we would read as, attribute 1 bit slice 7 Figure
1 shows the conversion to P-Tree.
A17
2
xX i 1
 T1  T2  T3
where
n
1
P17
0
L3
T1   xi2
xX i 1
1
0
1
0
L2
n
=
0
0
2j
 rc ( PX  Pij ) 
i 1 j b 1
1
0
0
0
2
L1
k
 rc( PX  Pij  Pil )
k ( j*2)( j 1)&& j 0
l ( j 1)0&& j 0
0
0
0
 2
1
0
0
1
L0
1
n
T2  2   xi a i
x X i 1
n
Figure 1 P-Tree of attribute A17
 2 
xX i 1
n
We can see that at Level 3 (L3) the tree represents the predicate all
1 and since the root is 0 we conclude the bit slice is not all 1’s.
We denote a half as pure when it contains 2level 1-bits with the root
having the highest level and the deepest leaf having level 0. We
can see the compression at L1 where there are two branches that
are not expanded. When a sub-tree is pure we do not need to
expand that branch. Reading the leaves of the tree we see that we
have 11100001 at the leaf level and this is recovered by the
formula (leaf value) * 2L and read from left to right.
P-trees allow for efficient processing of data at the bit level and
can be used to perform fast and, or, and exclusive or operations on
data. Several optimization techniques have been developed and
can be found in the references.

0
   2
 2  
 j b 1
0
 2
j
i 1 j b 1 xX
n
 2  
0
j
 xij 
0
n
i 1
0
2
j b 1
j
j b 1
j
j

 a ij 


 aij
0
2
j b 1
 rc( PX  Pij )
n
2
i
xX i i
 0

    2 j  aij 
xX i 1  j b 1

n
0
2
 2 j  rc( PX  Pij ) 
 2   ai 
 a
2
j b 1
i 1 j b 1
T3 
 x ij
2
j
 aij
xX i 1
 0

 rc( PX )     2 j  aij 
i 1  j b 1

n
a)o(X-a) are further examined using the other columns of the table
to determine their nature.
2
the total number of points in X, essentially measures total
variation of a set of points in class X about a.
The table can be sorted on the Set Inner Product column. Then
the table can be used to quickly reveal a set select a high-quality
value for k (the number of centroids). Those centroids can be
used to cluster the data set (putting each non-centroid point with
it’s closest centroid point) or, if we wish to perform k-means
clustering (with, likely very few iterations), the method can be
used as a pre-processing k-means or k-medoid clustering step.
The method can also be used to identify outliers (contents of the
disks centered at the bottom rows in the table – until ~ 98.5% of
the set has been identified as outliers), or detect cluster boundary
points by examining the outer disk where the variation changes.
4. ALGORITHM DESIGN
4.1 Algorithm
n
 rc ( PX )   a i2
i 1
The
X  a X  a
measures the sum of vector length
connecting X and a. Thus,  X  a    X  a  where N refers to
N
The Set Inner Product [1] algorithm produces a table of highquality information, which can be used for clustering and outlier
analysis. Creating this table is fast and efficient using vertical
data structuring. Our choice of vertical data structure, the
Predicate-Tree or P-tree [2][7], allows us to build out rectangular
neighborhoods of increasing radius until an upturn or downturn in
density is discovered and to do so in a scalable manner. A vector,
a, is first, randomly selected from the space, X. The Set Inner
Product of X about a is then calculated and inserted into a table.
The table contains columns for the selected points, a, the Set Inner
Product about each a (which we denote as (X-a)o(X-a) for reasons
which will become clear later), the radii to which the point, a, can
be built-out before the local density begins to change
significantly, those local build-out densities, and the direction of
change in that final local build-out density (up or down). A very
important part of this algorithm is the pruning of X before
iteratively picking the next value, a (prune off the maximum builtout disk for which the density remains roughly constant). This
pruning step facilitates the building of a table containing the
essential local density information we need, without requiring a
full scan of X. The table structure is as follows:
1.
2.
3.
4.
5.
6.
7.
Select an a  X.
build out disk D(a,r) of radius r about a, increasing r until the
density changes significantly (this significance level is an
input parameter).
Calculate the Set Inner Product about a, the build-out radii r,
the build out densities, and the direction of change of the
final build-out density. Enter all of these into the table.
Prune all points in the largest disk of common density (these
points have approximately the same local density as a).
Select another point from what remains of X,
Repeat from step 1 until X is empty.
Sort the table on Set Inner Product column.
O
ut
lie
r
Deep
Cluster
Data Set
Figure 2. An Example of the Algorithm.
Table 1. Local density information table.
a
a7
a3
a101
(X-a)o(X-a)
38
38
38
44
44
46
46
46
Build-out
Radius
5
10
15
5
10
5
10
15
Build-out
Density
10
11
3
9
1
18
17
38
+/variation
-
+
Lower values of (X-a)o(X-a) are associated with points, a, that are
“deep cluster points” (deep within clusters), while higher values
of (X-a)o(X-a) are associated with outliers. Middle values of (X-
Figure 2 shows an example of the algorithm pictorially. The
clusters will be built based on the total variation. The disks
will be built based on the density threshold with increasing
radius r. As we build out the disks, when a major change in
density of the disks is observed, the information is added to
the table and the disk is pruned. For example, when there is
a data point in the first disk and as we build out the disks by
increasing the radius r, if there are no data points in the next
disk, then we consider that point as an outlier, store the
information in the table and prune the disk. We do not
further build out disks around that point. In case of
boundary points, we store the information in the table and
prune the points based on the change in variation of the
disk.
4.2 ???
In this section we report the experiments we conducted to
evaluate the algorithm. It can be clearly observed from the
algorithm that most of the compute time for the algorithm is spent
on computing the set inner product. In this section we show that
the use of the vertical P-tree data structure for computing the set
inner product will scale with respect to the data size. We compare
the execution time for the calculation of the set inner product
employing a vertical approach (vertical data structure and
horizontal bitwise AND operation) with horizontal approach
(horizontal data structure and vertical scan operation). We show
the results of experiments of execution time with respect to the
size of the data set. Performance of both algorithms was observed
under different machine specifications, including an SGI Altix
CC-NUMA machine. Table 2 summarizes the different types of
machines used for the experiments.
Table 2. The specification of machines used.
Machine
AMD1GB
P42GB
SGI Altix
Specification
AMD Athlon K7 1.4GHz, 1GB RAM
Intel P4 2.4GHz processor 2GB RAM
SGI Altix CC-NUMA 12 processor shared
memory (12 x 4 GB RAM).
24
: Out of memory


28.80
0.00010
Time to Compute Set Inner Product
35.00
Time (Seconds)
5. EXPERIMENTAL RESULTS
30.00
25.00
20.00
15.00
10.00
5.00
0.00
0 2 4 6 8 10 12 14 16 18 20 22 24 26
Data Set Size (x1024^2)
Horz-AMD-1G
Horz-P4-2G
Virt-AMD-1G
out of Mem
Horz-SGI-48G
Figure 3. Average time running under different machines.
The experimental data was generated based on a set of aerial
photographs from the Best Management Plot (BMP) of Oakes
Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude
and longitude are 970 42'18"W, taken in 1998. The image
contains three bands: red, green, and blue reflectance values. We
use the original image of size 1024x1024 pixels (having
cardinality of 1,048,576). Corresponding synchronized data for
soil moisture, soil nitrate and crop yield were also used for
experimental evaluations. Combining of all bands and
synchronized data, we obtained a dataset with 6 dimensions.
Additional datasets with different sizes were synthetically
generated based on the original data sets to study the timing and
scalability.
We observe the timing with respect to scalability when executing
on machines with different hardware configurations. We were
forced to increase the hardware configuration to accommodate the
horizontal version of the set inner product calculation on the
larger data sets. We were are able to compute the set inner product
for the largest data set using the vertical ptree data structure on the
smallest machine (AMD Athlon with 1 GB memory). Table 3
presents the average time to compute the set inner product using
the two different techniques under different machines and figure
further illustrates performance with respect to scalability.
Table 3. Average time different hardware configurations.
Dataset
Size in 10242
1
2
4
8
16
Average Time to Compute Inner Product
(Seconds)
Horizontal
Vertical
AMDP4SGI Altix
AMD1GB
2GB
12x4GB
1GB
0.55
0.46
1.37
0.00008
1.10
0.91
2.08
0.00008
2.15
1.85
3.97
0.00010
3.79
8.48
0.00010

16.64
0.00010


As the figure shows the horizontal approach is very sensitive to
the available memory in the machine. For the vertical approach, it
only requires 0.0001 seconds on average to complete the
calculation on all data sets datasets, very much less than
Horizontal approach. This significant improvement in
computation time is due to the use of similar root count values,
pre-computed P-trees creation. Although various vectors for
different data points are fed during calculation the pre-computed
root counts can be used repeatedly. This allows us to pre-compute
these once and use their values repeatedly regardless how many
inner product calculations are computed as long as the dataset
does not change. Notice also that Vertical approach tend to have a
constant execution time even though datasets size is expanded.
One may argue that pre-calculation of root count makes this
comparison fallacious. However, notice the time required for
loading vertical data structure to memory and one time root count
operations for vertical approach, and loading horizontal records to
memory given on table 4. The performance with respect to time of
the vertical approach is comparable to the horizontal approach.
There is a slight increase for time required to load horizontal
records than to load P-trees and to compute root counts as
presented in table 4. This illustrates the ability of the P-tree data
structure to efficiently load and compute the simple counts. These
timing were tested on a P4 with 2GB of memory.
Table 4. Time for computing root count and loading dataset.
Dataset Size
Size in 10242
1
2
4
8
Time (Seconds)
Vertical
Horizontal
Root Count PreHorizontal Dataset
Computation and
Loading
P-trees Loading
3.900
4.974
8.620
10.470
18.690
19.914
38.450
39.646
6. CONCLUSION
???.
[6] Perera, A., Denton, A., Kotala,P., Jockheck,W.,
Granda, W. V., and Perrizo, W., (2002). P-tree
Classification of Yeast Gene Deletion Data. SIGKDD
Explorations, 4(2): 108-109.
[7] Perrizo, W. (2001). Peano Count Tree Technology,
Technical Report NDSU-CSOR-TR-01-1.
7. REFERENCES
[1] Abidin, T. and Perrizo, W., Vertical Set Inner Products
Formula. http://midas.cs.ndsu.nodak.edu/
~abidin/research/PSIPs.pdf
[2] Ding, Q., Khan, M., Roy, A., and Perrizo, W., (2002). The Ptree Algebra, Proceedings of the ACM Symposium on
Applied Computing. 426-431.
[3] Eric W. Weisstein et al. Total Variation. From MathWorld
– A Wolfram Web Resource.
http://mathworld.wolfram.com/TotalVariation.html
[4] Han, J., and Kamber, M. (2001). Data Mining: Concepts
and Techniques. San Francisco, CA., Morgan Kaufmann.
[5] Khan, M., Ding, Q., and Perrizo, W., (2002). K-Nearest
Neighbor Classification of Spatial Data Streams using Ptrees, Proceedings of the PAKDD. 517-528.
[8] Baumgartner, C., Plant, C., Kailing, K., Kriegel, H., Kroger,
P., (2004) Subspace Selection for Clustering HighDimensional Data, proceedings of 4th IEEE International
Conference on Data Mining.
[9] Bohm,C., Kailing, K., Kriegel,H., Kroger,P., Density
Connected Clustering with Local Subspace Preferences,
proceedings of 4th IEEE International Conference on Data
Mining.
[10] Tomas,F., Greene,D., “Optimal Algorithm for Approximate
Clustering”, the proceeding of the 20th ACM Symp. Theory
of computing, pages 434--444, Chicago, Illinois, 1988.
[11]