Download 05_iasse_VSSDClust - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Vertical Set Square Distance Based Clustering without Prior Knowledge of K
Amal Perera, Taufik Abidin, Masum Serazi, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
{amal.perera, taufik.abidin, md.serazi, william.perrizo}@ndsu.edu
Abstract
Clustering is automated identification of groups of
objects based on similarity. In clustering two major
research issues are scalability and the requirement of
domain knowledge to determine input parameters. Most
approaches suggest the use of sampling to address the issue
of scalability. However, sampling does not guarantee the
best solution and can cause significant loss in accuracy.
Most approaches also require the use of domain knowledge,
trial and error techniques, or exhaustive searching to figure
out the required input parameters. In this paper we
introduce a new clustering technique based on the set
square distance. Cluster membership is determined based
on the set squared distance to the respective cluster. As in
the case of mean for k-means and median for k-medoids,
the cluster is represented by the entire cluster of points for
each evaluation of membership. The set square distance for
all n items can be computed efficiently in O(n) using a
vertical data structure and a few pre-computed values.
Special ordering of the set square distance is used to break
the data into the “natural” clusters compared to the need of
a known k for k-means or k-medoids type of partition
clustering. Superior results are observed when the new
clustering technique is compared with the classical k-means
clustering. To prove the cluster quality and the resolution of
the unknown k, data sets with known classes such as the iris
data, the uci_kdd network intrusion data, and synthetic data
are used. The scalability of the proposed technique is
proved using a large RSI data set.
Keywords
Vertical Set Square Distance, P-trees, Clustering.
1. INTRODUCTION
Clustering is a very important human activity. Built-in
trainable clustering models are continuously trained from
early childhood allowing us to separate cats from dogs.
Clustering allows us to distinguish between different
objects. Given a set of points in multidimensional space, the
goal of clustering is to compute a partition of these points
into sets called clusters, such that the points in the same
cluster are more similar than points across different
clusters. Clustering allows us to identify dense and sparse
regions and, therefore, discover overall distribution of
interesting patterns and correlations in the data. Automated
clustering is very valuable in analyzing large data, and thus
has found applications in many areas such as data mining,
search engine indexing, pattern recognition, image
processing, trend analysis and many other areas [1][2].
Large number of clustering algorithms exists. In the
clustering literature these clustering algorithms are grouped
into four: partitioning methods, hierarchical methods,
density-based (connectivity) methods and grid-based
methods [1][3]. In partitioning methods n objects in the
original data set is broken into k partitions iteratively, to
achieve a certain optimal criterion. The most classical and
popular partitioning methods are k-means [4] and k-medoid
[5]. The k clusters are represented by the gravity of the
cluster in k-means or by a representative of the cluster in kmedoid. Each object in the space is assigned to the closest
cluster in each iteration. All the partition based methods
suffer from the requirement of providing the k (number of
partitions) prior to clustering, only able to identify spherical
clusters, and having large genuine clusters split in order to
optimize cluster quality [3].
A hierarchical clustering algorithm produces a
representation of the nested grouping relationship among
objects. If the clustering hierarchy is formed from bottom
up, at the start each data object is a cluster by itself, then
small clusters are merged into bigger clusters at each level
of the hierarchy based on similarity until at the top of the
hierarchy all the data objects are in one cluster. The major
difference between hierarchical algorithms is how to
measure the similarity between each pair of clusters.
Hierarchical clustering algorithms require the setting of a
termination condition with some prior domain knowledge
and typically they have high computational complexity
[3][8]. Density based clustering methods attempts to
separate the dense and sparse regions of objects in the data
space [1]. For each point of a cluster the density of data
points in the neighborhood has to exceed some threshold
[10]. Density based clustering techniques allow discovering
arbitrary shaped clusters through a linking phase. But they
do suffer from the requirement of setting prior parameters
based on domain knowledge to arrive at the best possible
clustering. A grid-based approach divides the data space
into a finite set of multidimensional grid cells and performs
clustering in each cell and then groups those neighboring
dense cells into clusters [1]. Determination of the cell size
and other parameters affect the final quality of the
clustering.
In general, two of the most demanding challenges in
clustering are scalability and minimal requirement of
domain knowledge to determine the input parameters [1]. In
this work we describe a new clustering mechanism that is
scalable and operates without the need of an initial
parameter that determines the expected number of clusters
in the data set. We describe an efficient vertical technique
to compute the density based on influence using the set
square distance of each data point with respect to all other
data points in the space. Natural partitions in the density
values are used to initially partition the data set into
clusters. Subsequently cluster membership of each data
point is confirmed or reassigned with the use of efficiently
recalculating the set square distance with respect to each
cluster.
2. RELATED WORK
Many clustering algorithms work well on small datasets
containing fewer than 200 data objects [1]. The NASA
Earth Observing System will deliver close to a terabyte of
remote sensing data per day and it is estimated that this
coordinated series of satellites will generate peta-bytes of
archived data in the next few years [12][13][14]. For real
world applications, the requirement is to cluster millions of
records using scalable techniques [11]. A general strategy
to scale-up clustering algorithms is to draw a sample or to
apply a kind of data compression before applying the
clustering algorithm to the resulting representative objects.
This may lead to biased results [1][14]. CLARA [1]
addresses the scalability issue by choosing a representative
sample of the data set and then continuing with the classical
k-mediod method. The effectiveness depends on the size of
the sample. CLARANS [6] is an example for a partition
based clustering technique which uses a randomized and
bounded search strategy to achieve the optimal criterion.
This is achieved by not fixing the sample to a specific set
from the data set for the entire clustering process. An
exhaustive traversal of the search space is not achieved in
the final clustering. BIRCH [7] uses a tree structure that
records the sufficient statistics (summary) for subsets of
data that can be compressed and represented by the
summery. Initial threshold parameters are required to obtain
the best clustering and computational optimality in BIRCH.
Most of the clustering algorithms require the users to input
certain parameters [1]. Subsequently the clustering results
are sensitive to the input parameters. For example
DENCLUE [10] requires the user to input the cell size to
compute the influence function. DBSCAN [15] needs the
neighborhood radius and minimum number of points that
are required to mark a neighborhood as a core object with
respect to density. To address the requirement for
parameters OPTICS [16] computes an augmented cluster
ordering for automatic and interactive cluster analysis.
OPTICS stores sufficient additional information enabling
the user to extract any density based clustering without
having to re-scan the data set. Parameter-less-ness comes at
a cost. OPTICS has a time complexity of O (n log n) when
used with a spatial index that allows it to easily walk
through the search space. Less expensive partition based
techniques suffer the requirement of specifying the number
of expected partitions (k) prior to clustering [1][3][26]. Xmeans [26] attempts to find k by repeatedly searching
through for different k values and testing it against a model
based on Bayesian Information Criterion (BIC). G-means
[17] is another attempt to learn k using a repeated top down
division of the data set until each individual cluster
demonstrates a Gaussian data distribution within a user
specified significance level. ACE [14] maps the search
space to a grid using a suitable weighting function similar to
the particle-mesh method used in physics and then uses a
few agents to heuristically search through the mesh to
identify the natural clusters in the data. Initial weighting
costs only O(n), but the success of the techniques depends
on the agent based heuristic search and the size of the gird
cell. The authors suggest a linear weighting scheme based
on neighborhood grid cells and a variable grid cell size to
avoid the over dependence on cell size for quality results.
The linear weighting scheme adds more compute time to
the process.
3. OUR APPROACH
Our approach attempts to address the problem of scalability
in clustering using a partition-based algorithm using a
vertical data structure (P-tree1) that aids fast computation of
counts. Three major inherent issues with partition-based
algorithms are: Need to input K, Need to initialize the
clusters that would lead to a optimal solution and, Cluster
representation (prototype) and computation of membership
for each cluster. We solve the first two problems based on
the concept of being able to formally model the influence of
each data point using a function first proposed for
DENCLUE [10] and the use of an efficient technique to
compute the total influence rapidly over the entire search
space. Significantly large differences in the total influence
are used to identify the natural clusters in the data set. Data
points with similar total influence are initially put together
as initial clusters to get a better initialization in search of an
optimal clustering in the subsequent iterative process. Each
cluster is represented by the entire cluster. The above third
issue is solved with the use of a vertical data structure. We
show an efficient technique that can compute the
membership for each data item by comparing the total
influence of each item against each cluster.
1
Patents are pending on the P-tree technology. This work was partially
supported by GSA Grant ACT#: K96130308.
3.1 Influence and Density
The influence function can be interpreted as a function,
which describes the impact of a data point within its
neighborhood [10]. Examples for influence functions are
parabolic function, square wave function, or the Gaussian
function. The influence function can be applied to each data
point. Indication of the overall density of the data space can
be calculated as the sum of the influence function of all data
points [10]. The density function which results from a
Gaussian function for a point ‘a’ in a neighborhood ‘xi‘ is
n
f
D
Gausian
x, a    e

d ( a , xi ) 2
2 2
i 1
The Gaussian influence function is used in DENCLUE and
since it is O(n 2), for all n data points they use a grid to
locally compute the density[9][10]. The influence function
should be radially symmetric about any point (either
variable), continuous and differentiable. Some other
influence functions are:
n
m
i 1
j 1
D
j
2j
f Power
2 m  x, a    (1) W j d ( xi , a)
D
 x, a  
f Parabolic
n

i 1
2
 d ( xi , a ) 2
We note that the power 2m function is a truncation of the
Gaussian Maclaurin series. Figure 1 shows the distribution
of the density for the Gaussian (b) and the Parabolic (c)
influence function.
Next we show how the density based on the Parabolic
influence function here after denoted by Set Square
Distance could be efficiently computed using a vertical data
structure.
Vertical data representation consists of set
structures representing the data column-by-column rather
than row-by-row (relational data). Predicate-trees (P-tree)
are one choice of vertical data representation, which can be
used for data mining instead of the more common sets of
relational records. P-trees [23] are a lossless, compressed,
and data-mining-ready data structure. This data structure
has been successfully applied in data mining applications
ranging from Classification and Clustering with K-Nearest
Neighbor, to Classification with Decision Tree Induction, to
Association Rule Mining [18][20][22][24][25]. A basic Ptree represents one attribute bit that is reorganized into a
tree structure by recursively sub-dividing, while recording
the predicate truth value regarding purity for each division.
Each level of the tree contains truth-bits that represent pure
sub-trees and can then be used for fast computation of
counts. This construction is continued recursively down
each tree path until a pure sub-division is reached that is
entirely pure (which may or may not be at the leaf level).
The basic and complement P-trees are combined using
boolean algebra operations to produce P-trees for values,
entire tuples, value intervals, or any other attribute pattern.
The root count of any pattern tree will indicate the
occurrence count of that pattern. The P-tree data structure
provides the structure for counting patterns in an efficient
manner.
Binary representation is intrinsically a fundamental
concept in vertical data structures. Let x be a numeric value
of attribute A1. Then the representation of x in b bits is
written as:
0
x1b1  x10 
2
j
 x1 j
j b 1
xb1 and x0 are
the
highest
and
respectively.
density
density
50
45
40
35
30
25
20
15
5
10
15
20
25
30
35
(a) Data Set
40
45
(b) Gaussian
(c) Parabolic
Figure 1. Distribution of density based on influence function
lowest
order
bits
Vertical Set Square Distance (VSSD) for a point ‘a’ in a
data set ‘X’ is defined as follows [24]:
  x  a x  a 
f VertSetSqr Dist (a , X )
2
d

 xi  ai 

x X i 1
d

 x
xX



x X 
 2
2
i
d

d
xX i 1
d

xi2  2 
i 1
i 1
d
xi ai 

a
i 1
2
i 

d
 x a   a
i
i
xX i 1
2
i
 T1  T2  T3
xX i 1
d
d

=
2 2 j  rc ( PX  Pij ) 
 2k  rc( PX  Pij  Pil )
d
d

 2
xi ai
d
0
  2
j
 xij  ai  2 
i 1 j b 1 xX
d

xX i i
ai 2 
d
1

 2
 rc( PX ) 
x X i 1
11
21
31
41
51
61
71
81
91
Figure 2. Distribution of sorted Set Square Distance.
0
j
 rc ( PX  Pij )  ai
d
d
ai 2
200
0
 0


2 j  x ij  a i 


 j b 1

i 1 j  b 1
d
300
100
 
xX i 1
xX i 1
T3 
Sorted Set Square Distance
400
i 1 j b 1
k ( j*2 )( j 1)& j 0
l ( j 1)0& j 0
 2
700
500
0

xi2
xX i 1
T2  2 
Identify difference > { mean(difference) + 3 x
StandardDeviation (difference)}
4. Break dataset into clusters using large differences
as partition boundaries.
Following figure shows sorted set square distances (VSSD)
for 100 data points. Largest differences in VSSDs are
observed at cluster boundaries. This characteristic is used to
find the natural partitions in the data set.
600
where
T1 
3.

i 1
ai 2  rc( PX ) 
a
2
i
i 1
Pi,j indicates the P-tree for the jth bit of ith attribute. rc(P) denotes
the root count of a P-tree (number of truth bits). PX denotes the
P-tree (mask) for the subset X.
In the above computation count operations are
independent from ‘a’, thus allowing us to pre-compute them
in advance for the computation of VSSD for multiple
number of data points where the corresponding data set (Set
X) does not change. This observation provides us with a
technique to compute the VSSD influence based density for
all data points in O(n). Further for a given cluster, VSSD
influence based density for each data point could be
computed efficiently to determine its’ cluster membership.
3.2 Algorithm (VSSDClust)
The algorithm has two phases. In the initial phase the
VSSD is computed for the entire data set. While computing
the VSSD, they are placed on a heap (sorted). Next
differences between each consecutive two VSSD values are
computed. It is assumed that outlier differences will
indicate a separation between two clusters. Statistical
outliers are identified by using the standard mean + 3
standard deviation formula on the VSSD difference values.
The ordered (based on VSSD) data set is partitioned at the
outlier differences to arrive at the initial clustering. Initial
phase can be re-stated in the following steps:
1. Compute VSSD(a,DataSet) for all points in data
set and place in sorted heap
2. Find the difference between VSSD(a,DataSet)i
and VSSD(a,DataSet)i+1 (i and i+1 sorted order)
In phase two of the algorithm each item in the data set
is confirmed or re-assigned based on the VSSD with respect
to each cluster. This step is similar to the classical K-means
algorithm. In this case instead of using the mean each
cluster is represented by all the data points in the cluster.
And instead of the mean square distance to determine the
cluster membership set squared distance is used.
Phase 2: Iterate until (max iteration) or (no change in
cluster sizes) or (oscillation)
1. Compute VSSD for all points, against each Cluster
Ci i.e VSSD(a,Ci)
2. Re-assign clusters membership for ‘a’ based on
min{VSSD(a,Ci)}.
In order to maintain the scalability, it is important to use
non compute intensive algorithm termination criterion.
Cluster sizes recorded for each iteration is used to identify
the progress of the algorithm. If there is no change in the
number of data points in each cluster for subsequent
iterations, that is an indication of the clustering arriving at a
stable solution. If there is any oscillation in the cluster
results it is important to terminate. In order to avoid having
the algorithm run infinitely we also check for max iteration.
4. EXPERIMENTAL RESULTS
To show the practical relevance of the new clustering
approach we show comparative experimental results in this
section. This approach is aimed at reducing the need for the
parameter K and the scalability with respect to the
cardinality of the data. To show the successful elimination
of K we use few different synthetic data sets and also actual
real world data sets with known clusters. We also compare
the results with a classical K-means algorithm to show
relative difference in speed to obtain an optimal solution.
To show the linear scalability we use a large RSI image
re(i, j )  C j  Ci* / Ci*
pr (i, j )  C j  Ci* / C j
Note: F = 1 for a perfect clustering. F measure will also
indicate if the selection of the number of clusters is
appropriate.
Fi , j 
l
2. pr (i, j )  re(i, j )
C*
k
F   i  max j 1 Fi , j 
N
pr (i, j )  re(i, j )
i 1
Synthetic data: The following table shows the results for a
few synthetically generated cluster data sets. The
motivation is to show the capability of the algorithm to
independently find the natural clusters in the data set. The
classical K-means clustering algorithm with given K is used
as a comparison. The number of database scans required for
(i.e iterations) to achieve an F-measure of 1.0 (i.e. perfect
clustering) is shown in the following table.
Data
30
35
25
30
35
30
25
25
20
20
20
15
Set
15
15
10
10
10
5
5
5
0
0
0
0
10
20
30
40
0
5
10
15
20
25
30
0
5
10
15
20
25
30
35
VSSD
2 iterations
2 iterations
6 iterations
Kmeans
8 iterations
8 iterations
14 iterations
Table 1. Synthetic data clustering results
Iris Plant Data: This data set is from the Iris Plant Data set
from the UCI machine learning repository [27]. This data
set was originally introduced by Fisher [28] which is
frequently used as an example of the problem of pattern
recognition. This contains four dimensional patterns (sepal
length, sepal width, petal length, petal width) mapped into
one of three classes (iris setosa, iris versicolor, and iris
virginica), and 50 sample patterns per class, totaling 150
sample patterns. The following table shows the results of
the comparison. The k means algorithm is executed for
different Ks in an attempt to obtain better results. It is
clearly observable that the VSSD based clustering can
obtain comparable results at a lower computational cost.
VSSD
K-Means
K =3
K=4
K=5
5
16
38
24
Iterations
0.84
0.80
0.74
0.69
F-measure
Table 2. Iris Plant data clustering results
KDD-99 Network Intrusion Data: This is the data [27]
set used for The Third International Knowledge Discovery
and Data Mining Tools Competition, which was held in
conjunction with KDD-99 conference. This dataset includes
a wide variety of intrusions simulated in a military network
environment. The original dataset contains 31 attributes of
information from a TCP dump. We used the 22 numeric
attributes. Each data item identifies the categorical type of
attack or intrusion such as Satan, Smurf, IpSweep,
PortSweep, Neptune, etc or if it is Normal. We randomly
sampled 3 datasets from the original dataset to include 2,4,
and 6 clusters based on the intrusion type. Comparison
results are shown in the following table. Once again the
results clearly show a clear advantage in using the VSSD to
obtain comparable results at a much lower cost compared to
the classical k-means algorithm.
With 6 Clust.
VD
K- means
4 Clust.
VD
K means
2 Clu.
VD Km
5 6 7 - 3 4 5 - 2
K=
7 10 12 12 9 16 12 16 3 6
Iter.
.81 .81 .81 .81 .80 .80 .80 .79 .90 .90
F=
Table 3. Network Intrusion data clustering results
RSI data: This RSI data set was generated based on a set
of aerial photographs from the Best Management Plot
(BMP) of Oakes Irrigation Test Area (OITA) near Oakes,
North Dakota. Latitude and longitude are 970 42'18"W,
taken in 1998. The image contains three bands: red, green,
and blue reflectance values. We use the original image of
size 1024x1024 pixels (cardinality=1,048,576 pixels).
Corresponding synchronized data for soil moisture, soil
nitrate and crop yield were also used for experimental
evaluations to obtain a dataset with 6 dimensions.
Additional datasets with different sizes were synthetically
generated based on the original datasets to study the timing
and scalability of VSSD technique presented in this paper.
The following graph shows a plot of the clustering time
with respect to the dataset size for 3 different types of
machines. Main observation is the linear scalability of the
approach up to 25 million rows with 6 columns of data.
VSSD Clustering Time
Time (Seconds)
data set with our approach and show the actual required
computation time with respect to data size. We use the
following quality measure, extensively used in text mining
to compare the quality of the clustering. Note that this
measure could only be used with known clusters and it is a
computationally expensive. Let C* = {C1*,…..Ci*……. Cl*}
be the original cluster and C = {C1,…..Ci……. Ck} be some
clustering of the data set.
8,000
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0
0
2
4
6
8 10 12 14 16 18 20 22 24 26
Data Set Size
( x1,000,000 rows x6 columns)
AMD-Ath-1G
Intl-P4-2G
SGI-4G
Figure 3. Scalability results for VSSD Clustering
It is important to note that this new clustering
technique will fail to identify the natural clusters when the
data points are distributed symmetrically through out the
entire space. In most of the real world datasets we do not
see this property. Thus it can be argued that the clustering
technique is applicable to real world data sets. In case of
complete symmetry the entire data set will be clustered as
one partition. One possible solution is to ask the user for an
upper bound for the number of clusters and to randomly
assign the original data points to the clusters and continue
with the second phase of the proposed clustering process.
Also a grid-based approach could be very easily wrapped
on top of the new clustering technique described in this
paper. Vertical P-tree data structure used for this work
inherently has a built-in index to the attribute space. This
can be used easily to compute the density values for each
cell and could be followed by a step that could link the
clusters to form global clusters.
[6] R. Ng and J. Han Efficient and effective clustering method for
5. CONCLUSION
[12] Goddard Space Flight Center. http://eospso.gsfc.nasa.gov, 2004.
[13] A. Zomaya, T. El-Ghazawi, and O. Frieder. Parallel and distributed
Two major problems are scalability and the
requirement of domain knowledge to determine input
parameters in unsupervised learning to identify groups of
objects based on similarity in large datasets. Most of the
existing approaches suggest the use of sampling to address
the issue of scalability and require the use of domain
knowledge, trial and error techniques, or exhaustive
searching to figure out the required input parameters.
spatial data mining. In Proc. Conf. on VLDB, pp 144-155, 1994.
[7] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: an efficient
data clustering method for very large databases, Proc. ACMSIGMOD Intl. Conf. Management of Data, pp. 103–114, 1996.
[8] S. Guha, R. Rastogi, and K. Shim, CURE: an efficient clustering
algorithm for large databases, Proc. ACM-SIGMOD Intl. Conf.
Management of Data, pp. 73–84, 1998.
[9] A. Hinneburg and D. A. Keim, Optimal gridclustering: towards
breaking the curse of dimensionality in high-dimensional clustering,
Proc. of 25th Intl. Conf. Very Large DataBases, pp. 506–517, 1999.
[10] A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering
in Multimedia Databases with Noise. In Proc. 4th Int. Conf. on
Knowledge Discovery and Data Mining. AAAI Press, 1998.
[11] M. M. Breunig, H.-P. Kriegel, P. Kröger, J. Sander, Data Bubbles:
Quality Preserving Performance Boosting for Hierarchical
Clustering ACM SIGMOD, Santa Barbara, California, 2001.
computing for data mining, IEEE Concurrency, Vol 7(4), 1999.
[14] W. Peter, J. Chiochetti C. Giardina, New Unsupervised Clustering
Algorithm for Large Datasets SIGKDD, WashingtonDC USA, 2003.
[15] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, A density-based
algorithm for discovering clusters in large spatial databases with
noise, In Proc. ACM-SIGKDD, pp 226-231, 1996.
[16] M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. S. OPTICS: Ordering
Points To Identify the Clustering Structure Proc. ACM SIGMOD’99
Int. Conf. on Management of Data, Philadelphia PA, 1999.
[17] G. Hamerly, C. Elkan, Learning the k in k-means. Seventeenth
In this paper we introduce a new clustering technique
based on the set square distance. This technique is scalable
and does not need prior knowledge of the existing
(expected) number of partitions. Efficient computation of
the set square distance using a vertical data structure
enables the above breakthrough. We show how a special
ordering of the set square distance which is an indication of
the density at each data point, can be used to break the data
into the “natural” clusters. We also show the effectiveness
of determining cluster membership based on the set square
distance to the respective cluster. We prove the cluster
quality and the resolution of the unknown k, of our new
technique using data sets with known classes. We show the
scalability of the proposed technique with respect to data
set size by using a large RSI data set.
6. REFERENCES
[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001.
[2] K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice
Hall, Englewood Cliffs, NJ, 1988.
[3] O. R. Zaïane, A. Foss, C. Lee, W. Wang, On Data Clustering
Analysis: Scalability, Constraints and Validation, in Proc. of the
Sixth PAKDD'02, Taipei, Taiwan, pp 28-39, May 2002.
[4] J. MacQueen Some methods for classification and analysis of
multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist.
Prob. 1967.
[5] L. Kaufman and P. J.Rousseeuw, Finding Groups in Data: an Intro.
to Cluster Analysis. J. Wiley & Sons. New York, NY, 1990.
Annaul Conference on Neural Information Processing Systems
(NIPS), British Columbia, Canada, 2003.
[18] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree Algebra,
Proceedings of the ACM Sym. on App. Comp., pp. 426-431, 2002.
[19] J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, New
York, NY, 1975.
[20] M. Khan, Q. Ding, and W. Perrizo, K-Nearest Neighbor
Classification of Spatial Data Streams using P-trees, Proceedings of
the PAKDD, pp. 517-528, 2002.
[21] E.M. Knorr and R. T. Ng. Algorithms for Mining Distance-Based
Outliers in Large Datasets. Proceedings of 24th International
Conference on Very Large Data Bases (VLDB), pp. 392-403, 1998.
[22] A. Perera, A. Denton, P. Kotala, W. Jockhec, W.V. Granda, and W.
Perrizo, P-tree Classification of Yeast Gene Deletion Data. SIGKDD
Explorations, 4(2), pp.108-109, 2002.
[23] W. Perrizo, Peano Count Tree Technology, Technical Report
NDSU-CSOR-TR-01-1, 2001.
[24] T. Abidin, A. Perera, M. Serazi, W. Perrizo, Vertical Set Square
Distance: A Fast and Scalable Technique to Compute Total
Variation in Large Datasets, CATA-2005 New Orleans, 2005.
[25] I. Rahal and W. Perrizo, An Optimized Approach for KNN Text
Categorization using P-Trees. Proceedings of ACM Symposium on
Applied Computing, pp. 613-617, 2004.
[26] D. Pelleg, A. W. Moore, X-means: Extending K-means with
Efficient Estimation of the Number of Clusters, Proc. of the 17th
International Conference on Machine Learning. Morgan Kaufmann
Publishers Inc. San Francisco, CA, pp 727 – 734, 2000.
[27] UCI Machine Learning Data Repository.
http://www.ics.uci.edu/~mlearn/MLSummary.html, 2004.
[28] R. A. Fisher, The Use of Multiple Measurements in Axonomic
Problems. Annals of Eugenics 7, pp 179-188, 1936.