Download pillar pkmeans2 - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Pillar PK-means Clustering for Big Data
Dr. William Perrizo, ?, ?
North Dakota State University
([email protected])
ABSTRACT:
This paper describes an approach for the data mining
technique called clustering using vertically structured data
and k-means partitioning methodology. The partitioning
methodology is based on the scalar product with
judiciously chosen unit vectors. In any k-means clustering
method, choosing a good value for k and a good set of
initial cluster centroid points is very important. Typically
the user is left to guess at k and then stick with it, resulting
in subset being considered one cluster when they are really
several, or a cluster being divided up into many clusters to
meet that there be k clusters. In this work we do not
predefine k (the best choice of k is revealed as the
algorithm progresses). Also, in the process of determining
an appropriate k, we naturally get a good set of initial
centroid points (one for each of the k clusters). By good
centroid points, we simply mean ones that cause the kmeans algorithm to converge rapidly.
For so-called “big data”, the speed at which a clustering
method can be trained is a critical issue. Many very good
algorithms are unusable in the big data environment due
to the fact that they take an unacceptable amount of time.
Therefore, algorithm speed is very important. To address
the speed issue, for choosing a good set of initial centroids
and for the rest of the algorithm, we use horizontal
processing of vertically structured data rather than the
ubiquitous vertical (scan) processing of horizontal
(record) data. We use pTree, bit level, vertical data
structuring. pTree technology represents and processes
data differently from the ubiquitous horizontal data
technologies. In pTree technology, the data is structured
column-wise (into bit slices which may or may not be
compressed into tree structures) and the columns are
processed horizontally (typically across a few to a few
hundred pTrees), while in horizontal technologies, data is
structured row-wise and those rows are processed
vertically (often down billions of rows).
ranging from classification [1,2,3], clustering [4,7],
association rule mining [9], as well as other data mining
algorithms.
Introduction
Unsupervised Machine Learning or Clustering is an
important data mining technology for mining information
out of new data sets, especially very large datasets. The
assumption is usually that there is essentially nothing yet
known about the data set (therefore the process is
“unsupervised”). The goal is often to partition the data set
into subset of “similar” or “correlated” records [4, 7].
There may be various levels of supervision available and,
of course, that additional information should be used to
advantage during either the clustering process. For
instance, it may be known that there are exactly k clusters
or similarity subsets, in which case, a method such as kmeans clustering may be a very productive method. To
mine a RGB image for, say red cars, white cars, grass,
pavement and bare ground, k would be five. It would
make sense to use that supervising knowledge by
employing k-means clustering starting with a mean set
consisting of RGB vectors as closely approximating the
clusters as one can guess, e.g., red_car=(150,0,0),
white_car=(85,85,85), grass=(0,150,0), etc. That is to say,
it is productive to view the level of supervision available to
us as a continuum and not just the two extremes.
pTrees are lossless, compressed and data-mining ready
data structures [9][10]. pTrees are lossless because the
vertical bit-wise partitioning that is used in the pTree
technology guarantees that all information is retained
completely. There is no loss of information in converting
horizontal data to this vertical format. pTrees are
compressed because in this technology, segments of bit
sequences which are either purely 1-bits or purely 0-bits,
are represented by a single bit. This compression saves a
considerable amount of space, but more importantly
facilitates faster processing. pTrees are data-mining
ready because the fast, horizontal data mining processes
involved can be done without the need to decompress the
structures first. pTree vertical data structures have been
exploited in various domains and data mining algorithms,
In this paper, we assume there is no supervising
knowledge. We assume the data set is a table of numbers
with n columns and N rows and that two rows are close if
there are close in the Euclidean sense (e.g., Rn). More
general assumptions could be made (e.g., that there are
categorical data columns as well, or that the similarity is
based on L1 distance or some correlation-based similarity)
but we feel that would only obscure the main points we
want to make.
that is, ||s◦d –t◦d|| ≤ ||s–t||. This follows since ||s◦d –t◦d|| =
We structure the data vertically into columns of bits
||(s–t)◦d|| ≤ ||s–t|| ||d || = ||s–t||, since ||d|| = 1.
(possibly compressed into tree structures), called predicate
Trees or pTrees. The simplest example of pTree
The background algorithm we consider in this research is
structuring of a non-negative integer table is to slice it
the very popular k-means clustering algorithm. Assume
vertically into its bit-position slices. The main reason we
k initial centroid points for the k clusters have been chosen
do that is so that we can process across the [usually
(possibly at random).
relatively few] vertical pTree structures rather than
1.
processing down the [usually a very, very high number of]
rows. Very often these days, data is called Big Data
it (this assumes some method of breaking ties).
2.
because there are many, many rows (billions or even
trillions) while the number of columns, by comparison, is
Place each point with the centroid point closest to
Calculate the centroid (e.g., mean point) of each
of these k new clusters.
3.
If some stopping condition is realized (such as the
relatively small (tens or hundreds, thousands, multiple
density of all k clusters is sufficiently high), stop,
thousands). Therefore processing across (bit) columns
otherwise, goto 1.
rather than down the rows has a clear speed advantage,
Formulas
provided that the column processing can be done very
In each round after the initial round, the calculation of the
efficiently. That is where the advantage of our approach
centroid points (we’ll assume centroid=mean) and the
lies, in devising very efficient (in terms of time taken)
assignment of all points to their proper cluster is done
algorithms for horizontal processing of vertical (bit)
using pTree horizontal processing of vertically structured
structures. Our approach also benefits greatly from the
data (pTrees).
fact that, in general, modern computing platforms can do
logical processing of (even massive) bit arrays very
Assuming the dataset has been vertically sliced into
quickly [9,10].
column sets and that those column sets have been further
vertically sliced into bit slices (which is the basic pTree
Horizontal Clustering of Vertical Data Algorithms
structuring minus the compression into trees, which we
will ignore for simplicity in this paper), the mean or
We employ a distance-dominating functional (a functional
average function, which computes the average of one
assigns a non-negative integer to each row). By distance
complete table column by horizontally (ANDs and Ors)
dominating, we simply mean that the distance between any
the vertical bit slices horizontally, goes as shown here.
two output functional values is always dominated by the
The COUNT function is probably the simplest and most
distance between the two input vectors. A class of
useful of all these Aggregate functions. It is not necessary
distance dominating functionals we have found productive
to write special function for Count because a pTree
are based on the dot or scalar product with a unit vector.
RootCount function which efficiently counts the number of
1-bits in the full slice, provided the mechanism to
We let s◦t denote the dot or inner product
 siti of two
implement it. Given a pTree Pi, RootCount(Pi) returns the
vectors s, t  Rn, and ||s|| = √  si2, the Euclidean norm.
number of 1-bits in Pi. Count is the
Note that ||s◦t|| ≤ ||s|| ||t||, and the distance between s and t is
RootCount(ORi=0…n(Pi)).
||s-t||. We assert that the dot product function applied to a
vector, s, with a unit vector, d, is “distance dominating”,
The Sum Aggregate function can total a column of
numerical values.
Evaluating sum with pTrees
total  0
for i = 0 to n do
total  total + 2i * RootCount(Pi)
endfor
Return total
The Average Aggregate, Mean or Expectation will show
the average value of a column and is calculated from
Count and Sum.
A point, m1, found in this manner is called a
non-outlier pillar of X wrt a, or nop(X,a) )
Let m2  nop(X,m1)
In general, if non-outlier pillars m1..mi-1 have
been chosen, choose mi from
nop(X,{m1,...,mi-1})
2
(i.e., mi maximizes k=1..i-1distance (X,mk) and is
a non-outlier).
Pillar pkmeans clusterer:
Average  Sum/Count
The calculation of the Vector of Medians, or simply the
Assign each (object, class) a ClassWeightReals
(all CW are initially set at 0.).
median of any column, goes as follows. Median returns
Classes numbered as they are revealed.
the median value in a column.
As we identify pillar, mj's, compute Lm = Xo(mj-mj-1)
j
Rank (K) returns the value that is the kth largest value in a
field. Therefore, for a very fast and quite accurate
determination of the median value use K =
Roof(TableCardinality/2).
Evaluating Median with pTrees
median  0, pos  N/2, c  0
for i = n to 0 do
c  RootCount (Pc AND Pi)
if (c >= pos) then
median  median + 2i
Pc  Pc AND Pi
else
pos  pos - c
Pc  Pc AND NOT (Pi)
endif
endfor
return median
Finding the Pillars of X
(Choose the k in kmeans intelligently.)
Let m1 be a point in X maximizing
2
distance (X,a)=(X-a)o(X-a) where
aAverageVectorX
If m1 is an outlier,
repeat until m1 is a non-outlier.
1. For the next larger PCI in Ld(C), left-to-right.
-1
1.1a If followed by PCD, CkAvg(Ld [PCI,PCD]).
If Ck is center of a sphere-gap (or barrel gap),
declare Classk and mask off.
1.1b If followed by another PCI, declare next
Classk= sphere gapped set around
-1
Ck=Avg( Ld [ (3PCI1+PCI2)/4,PCI2) ).
Mask it off.
2. For the next smaller PCD in Ld from the left side.
2.1a If preceded by a PCI, declare next Class k=
-1
subset of Ld [PCI, PCD] sphere-gapped around
Ck=Average. Mask it off.
2.1b If preceded by another PCD declare next Class k=
subset of same, sphere-gapped around
-1
Ck=Avg(Ld ( [PCD2,(PCD1+PCD2)/4] ).
Mask it off.
REFERENCES
[1]
Engineers (IEEE) International Conference on
Data Mining (ICDM-04), pp. 503-506, Nov 1-4,
T. Abidin and W. Perrizo, “SMART-TV: A Fast
2004.
and Scalable Nearest Neighbor Based Classifier
for Data Mining,” Proceedings of the 21st
Association of Computing Machinery
[7]
E. Wang, I. Rahal, W. Perrizo, “DAVYD: an
iterative Density-based Approach for clusters
Symposium on Applied Computing (SAC-06),
with Varying Densities". International Society of
Dijon, France, April 23-27, 2006.
Computers and their Applications (ISCA)
[2]
International Journal of Computers and Their
T. Abidin, A. Dong, H. Li, and W. Perrizo,
Applications, V17:1, pp. 1-14, March 2010.
“Efficient Image Classification on Vertically
Decomposed Data,” Institute of Electrical and
Electronic Engineers (IEEE) International
[8]
Qin Ding, Qiang Ding, W. Perrizo, “PARM - An
Efficient Algorithm to Mine Association Rules
Conference on Multimedia Databases and Data
from Spatial Data" Institute of Electrical and
Management (MDDM-06), Atlanta, Georgia,
Electronic Engineering (IEEE) Transactions of
April 8, 2006.
Systems, Man, and Cybernetics, Volume 38,
[3]
Number 6, ISSN 1083-4419), pp. 1513-1525,
M. Khan, Q. Ding, and W. Perrizo, “K-nearest
December, 2008.
Neighbor Classification on Spatial Data Stream
Using P-trees,” Proceedings of the Pacific-Asia
Conference on Knowledge Discovery and Data
[9]
Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D.
Ren, W. Wu, W. Perrizo, “DataMIME™”,
Mining (PAKDD 02), pp. 517-528, Taipei,
Association of Computing Machinery,
Taiwan, May 2002.
Management of Data (ACM SIGMOD 04), Paris,
[4]
France, June 2004.
Perera, T. Abidin, M. Serazi, G. Hamer, and W.
Perrizo, “Vertical Set Squared Distance Based
[10]
Clustering without Prior Knowledge of K,”
Treeminer Inc., The Vertical Data Mining
Company, 175 Admiral Cochrane Drive, Suite
International Conference on Intelligent and
300, Annapolis, Maryland 21401,
Adaptive Systems and Software Engineering
http://www.treeminer.com
(IASSE-05), pp. 72-77, Toronto, Canada, July 2022, 2005.
[11]
H. Wilkinson, The Algebraic Eigenvalue Problem,
Oxford University Press, Oxford, 1965.
[5]
Rahal, D. Ren, W. Perrizo, “A Scalable Vertical
Model for Mining Association Rules,” Journal of
Information and Knowledge Management (JIKM),
V3:4, pp. 317-329, 2004.
[12]
M. S. Bazaraa, H. D. Sherali, C. M. Shetty,
Nonlinear Programming, Theory and Algorithms,
Third Edition, John Wiley and Sons, Inc,
Hoboken, NJ, 2006.
[6]
D. Ren, B. Wang, and W. Perrizo, “RDF: A
Density-Based Outlier Detection Method using
Vertical Data Representation,” Proceedings of the
4th Institute of Electrical and Electronic
[13]
H. Stapleton, Linear Statistical Models, Second
Edition John Wiley and Sons, Inc, Hoboken, NJ,
2009.
[14]
W. Perrizo, Functional Analytic Unsupervised
and Supervised Data Mining Technology, International
Conference on Computer Applications in Industry and
Engineering, Los Angeles, September, 2013.
Appendix
Predicate tree or P-tree is the vertical data representation
that represents the data column-by-column rather than
row- by-row (which is relational data representation). It
was initially developed for mining spatial data [2][4].
Since then it has been used for mining many other types of
data [3][5]. The creation of P-tree is typically started by
converting a relational table of horizontal records to a set
of vertical, compressed P- trees by decomposing each
attribute in the table into separate bit vectors (e.g., one for
each bit position of a numeric attribute or one bitmap for
each category in a categorical attribute). Such vertical
partitioning guarantees that the information is not lost.
Figure: Construction of P-trees from attribute A1.
The P-trees are built from the top down, stopping
any branch as soon as purity (either purely 1-bits or
purely 0- bits) is reached (giving the compression).
For example, let R be a relational table consists of three
numeric attributes R(A1, A2, A3). To convert it into P-tree
we have to convert the attribute values into binary then
take vertical bit-slices of every attribute and store them in
separate files. Each bit slice is considered as a P-tree,
which indicates the predicate if a particular bit position is
zero or one. This bit slice may be compressed dividing it
into binary trees recursively. Figure 2 depicts the
conversion of a numerical attribute, A1, into P-trees. □