Download Vertical Functional Analytic Unsupervised Machine Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Linear, Spherical and Radial Functional Classification
on
Vertically Structured Data
Dr. William Perrizo
North Dakota State University
([email protected])
ABSTRACT:
In this paper we describe an approach for
the data mining technique called
classification or prediction using vertically
structured data and functional partitioning
methodology. The partitioning methodology
is based on three different functional
approaches, the linear or scalar product
functional, the spherical functional and the
radial distance from a line functional. In all
three functional approaches, we are
mapping an n-dimensional vector space
onto a 1-dimensional line in a variety of
ways, each of which is “distance
dominating”, in some sense. By “distance
dominating”, we mean that the separation
between two functional images, f(x) and f(y)
is less or equal to the distance between x
and y. By using such functionals, we are
guaranteed that a gap of sufficient size in
the functional values (on the line) reveals a
gap of at least that size on the vector space.
This fact allows us to separate classes by
applying the easily computed functionals.
The functionals are easily computed because
they are applied to vertically structured
data.
A prominent gap in the functional values of
classified training set points provides good
cut points for separation hyperplane,
spherical and tubular segments which will
form a hull model to be used in the
classification of future unclassified samples..
For so-called “big data”, the speed at which
a classification model can be trained is a
critical issue. Many very good classification
algorithms are unusable in the big data
environment due to the fact that the training
step takes an unacceptable amount of time.
Therefore, speed of training is very
important. To address the speed issue, in
this paper, we use horizontal processing of
vertically structured data rather than the
ubiquitous vertical (scan) processing of
horizontal (record) data. We use pTree, bit
level, vertical data structuring.
PTree
technology represents and processes data
differently from the ubiquitous horizontal
data technologies. In pTree technology, the
data is structured column-wise (into bit
slices) and the columns are processed
horizontally (typically across a few to a few
hundred bit level columns), while in
horizontal technologies, data is structured
row-wise and those rows are processed
vertically (often down millions, even billions
of rows).
P-trees are lossless, compressed and datamining ready data structures [9][10].
pTrees are lossless because the vertical bitwise partitioning that is used in the pTree
technology guarantees that all information
is retained completely. There is no loss of
information in converting horizontal data to
this vertical format. pTrees are compressed
because in this technology, segments of bit
sequences which are either purely 1-bits or
purely 0-bits, are represented by a single bit
within a tree structure. This compression
saves a considerable amount of space, but
more
importantly
facilitates
faster
processing. PTrees are data-mining ready
because the fast, horizontal data mining
processes involved can be done without the
need to decompress the structures first.
PTree vertical data structures have been
exploited in various domains and data
mining
algorithms,
ranging
from
classification [1,2,3], clustering [4,7],
association rule mining [9], as well as
other data mining algorithms.
Speed
improvements are very important in data
mining because many quite accurate
algorithms require an unacceptable amount
of processing time to complete, even with
today’s powerful computing systems and
efficient software platforms. In this paper,
we evaluate the speed of functional-based
data mining algorithms when using pTree
technology.
Introduction
Supervised Machine Learning,
Classification or Prediction is one of the
important data mining technologies for
mining information out of large data sets.
The assumption is usually that there is a
very large table of data in which the “class”
of each instance is given (called the training
data set) and there is another data set in
which the class are not known (called the
test data set). The task is predict the class of
each test set object based on class
information found in the training data set
(therefore “supervised” prediction). [1, 2, 3].
Unsupervised Machine Learning or
Clustering is also an important data mining
technology for mining information out of
new data sets. The assumption in clustering
is usually that there is essentially nothing yet
known about the data set (therefore it is
“unsupervised”). The goal of clustering is to
partition the data set into subset of “similar”
or “correlated” objects [4, 7], often so that
the data set can be used as a classification
training set for classifying future
unclassified objects.
We note here that there may be various
additional levels of supervision available in
either classification or clustering and, of
course, that additional information should be
used to advantage during the machine
learning process. That is to say, often the
problem is not a purely supervised nor
purely unsupervised. For instance, it may be
known that there are exactly k similarity
subsets, in which case, a method such as kmeans clustering may be a productive
method. To mine a RGB image for, say red
cars, white cars, grass, pavement, bare
ground, and other, k would be six. It would
make sense to use that supervising
knowledge by employing k-means clustering
starting with a mean set consisting of RGB
vectors as closely approximating the clusters
as one can guess, e.g., red_car=(150,0,0),
white_car=(85,85,85), grass=(0,150,0), etc.
That is to say, we should view the level of
supervision available to us as a continuum
and not just the two extremes. The ultimate
in supervising knowledge is a very large
training set, which has enough class
information in it to very accurately assign
predicted classes to all test instances. We
can think of a training set as a set of records
that have been “classified” by an expert
(human or machine) into similarity classes
(and assigned a class or label).
In this paper we assume there is an existing
training set of classified object which is
large enough to fully characterize classes.
We will assume the training set is a subset
of a vector space, consisting of non-negative
integers with n columns and N rows and that
two rows are similar if there are close in the
Euclidean sense. More general assumptions
could be made (e.g., that there are
categorical data columns as well, or that the
similarity is based on L1 distance or some
correlation-based similarity) but we feel that
would only obscure the main points we want
to make by generalizing.
We structure the data vertically into columns
of bits (possibly compressed into tree
structures), called predicate Trees or pTrees.
The simplest example of pTree structuring
of a non-negative integer table is to slice it
vertically into its bit-position slices. The
main reason we do that is so that we can
process across the (usually relatively few)
vertical pTree structures rather than
processing down the (usually very
numerous) rows. Very often these days,
data is called Big Data because there are
many, many rows (billions or even trillions)
while the number of columns, by
comparison, is relatively small (tens or
hundreds, sometimes thousands, but seldom
more than that). Therefore processing
across (bit) columns rather than down the
rows has a clear speed advantage, provided
that the column processing can be done very
efficiently. That is where the advantage of
our approach lies, in devising very efficient
(in terms of time taken) algorithms for
horizontal processing of vertical (bit)
structures. Our approach also benefits
greatly from the fact that, in general, modern
computing platforms can do logical
processing of (even massive) bit arrays very
quickly [9,10].
LSR: The Linear, Spherical, Radial
Functional Algorithm for Horizontal
Classification of Vertical Data Algorithm
In this algorithm, we build a separate
decision tree for each of a series of unit
vectors, d, used in the dot product linear and
radial functionals. The more unit vectors
used, the better (the more hull segments we
use to approximate classes and therefore the
fewer false positives we get).
So, given a unit vector, d, and a point, p, in
the vector space, X = (X1,…,Xn), and letting
xoy denote the dot product of vectors, x and
y, we define the linear functional
Ld,p = Xop where Xop stands for the
column of dot product results, one for each
row, x, in X.
The three functionals then are:
Ld,p (X-p)od= Xod-pod= Ld-pod
Sp  (X-p)o(X-p)= XoX+Xo(-2p)+pop
2
Rd,p Sp-L d,p
2
2
=XoX+Xo(-2p)+pop-L d-2pod*Xod+pod
2
2
= L-2p-(2pod)d+pop+pod +XoX-L d
The first, “Linear” functional, can be viewed
as a mapping of Rn to R1 via the dot product
with d, which maps a vector point to its
“shadow” on the d-line made through
perpendicular projection. In the pTree sense,
we view this as a mapping of the PTreeSet
for X (bit slices for n columns) onto the
ScalarPTreeSet of dot products (bit slices of
the one derived column of dot products).
The second, “Spherical” functional,
produces the (bit slices of the) column of
square distances from the point, p.
The third, “Radial” functional, produces the
(bit slices of the) column of radial distances
from the d-line through the point p.
Assuming X is “Big Data” (say, with a
trillion rows), the formulas derived above
show that vertical structuring into pTrees
can have tremendous benefit since the Scalar
PTreeSet, XoX can be precomputed. Then
for each d and p chosen, the only
calculations required are the scalar
calculations, pod and pop, and the
ScalarPTreeSet calculations, Xod, Xo(-2p),
L2 and S – L2.
The LSR algorithm builds a decision tree for
each unit vector d, as follows:
LSR Decision Tree algorithm.
Build a decision tree for each ek (also for
some/all ek arithmetic combinations?).
Build branches until we have 100% True
Positives (no class duplication exits).
Then y isa class=C
if y isa class=C in every d-decision_tree,
else y isa Other.
We note that, for every node we build a
branch for each pair of classes in each
interval.
At the root of any d-decision_tree we
calculate the minimum and maximum of
linear projection values of Ld,p(X) within
each class, Ck. Note that we have a bit-map
mask for each class, Ck and the L
computation is done entirely through
horizontal processing across the pTrees (or
bit slices) of X.
This ends the LSR Decision Tree algorithm.
We note that we can improve the accuracy
with respect to false positives (FPs) by
forming intervals, not just with minimums
and maximum values at each stage, but with
all Precipitous Count Changes (PCCs). A
PCC is a value whose count changes at least
some fixed percentage from its predecessor
value’s count. We used 25% as the
threshold. PCCs can be either increases or
decreases. We use PCI for a PCC that is in
fact an increase by 25% or more and PCD
for a decrease.
Finally, we note, for convex classes
(roundish classes) the mathematical convex
hull of the class points is the optimal hull
model to use (fewest false positives),
however often one or more of the classes is
not convex, e.g., the “horseshoe shaped class
of points in 2-space indicated by “@”’s
below.
@
@
At the next level of any d-decision_tree, we
build a branch on every interval formed by
those minimums and maximums, using both
S and R.
At the next level of any d-decision_tree, we
build a branch on every interval formed by
minimums and maximums of the previous
node, using L with the unit vector that runs
between each class mean pair and with the
class mean of the first class pair as p.
At all succeeding pairs of levels of any ddecision_tree, we build branches using the
same pattern of L followed by S and R until
there is purity (only one class).
@
@
Here, the convex hull is definitely not the
optimal model for the class, since any
unclassified sample in the interior of the
“horseshoe” would be a false positives
(classified as being in the “@” class falsely).
@ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @
@
@
@
@
@
@
@
@
@
@
@ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @
@
@
@
@
@
@
@
@
@
@
@
@
@
@
radial functionals with respect to that d.
Therefore, certainly, we recommend using
all of the dimensional unit vectors,
ek=(00..010..00) in the standard basis for the
vector space, as well as all sums and
differences of those unit vectors. Ultimately,
it would be best to include an entire
covering grid of unit vectors for the vector
space (covering all angles up to some level
of approximation). There would be, of
course, considerable overlap in doing this
but always the potential for an increase in
the accuracy of the classifier.
Implementation.
Convex hull
On the other hand, with the serial
application of the LSR decision tree
algorithm, a much better fit is possible.
minL = pci L
@ @ @ @ @ @1@ @ @ @ @ @ @
@ @ @ @ @ pcd
@ @L @ @ @ @ @ @
1
@
@
d
@
@
@
@
@
@
@
@
@
@
@
@
pci2L
@@ @ @ @ @ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @ @ @ @ @
maxL = pcd2L
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
To
@ facilitate this happening, we recommend
using as many unit vectors, d, as possible,
since each one contributes another set of
@
edge segments to the hull around each class
and therefore (potentially) eliminates more
@ positive classifications. The only
false
additional cost of an additional unit vector, d,
@ the cost of calculating the dot product and
is
@
@
@
In this section we develop the decision trees
on a datasets taken from the University of
California Irvine Machine Learning
Repository called IRIS, which consists of 4
measurements of iris flower samples, pedal
length, pedal width, sepal length and sepal
width, along with the class of iris as one of 3
classes, setosa irises (we will use S),
versicolor irises (we will use E) and
Virginica irises (we will use I) . This
datasets was selected because it is very
commonly used for such in the literature and
because it provide “supervision”, that is,
samples are already classified.
For d=e1=(1,0,0,0) the root pseudo code is
if
elseif
elseif
elseif
else
43
49
59
70


<
<
L1000(y)=y1 <
L1000(y)=y1 
L1000(y)=y1 
L1000(y)=y1 
{y
49 {y isa
58 {y isa
70 {y isa
79 {y isa
isa Other
S }
SEI}1
EI}2
I}
}
The {y isa SEI)1 recursive step pseudo code:
if
0  R1000,AvgS(y) 99
{y isa S}
elseif 99 < R1000,AvgS(y)< 393 {y isa O}
elseif 393< R1000,AvgS(y) 1096 {y isa E}
elseif 1096<R1000,AvgS(y)< 1217 {y isa O}
elseif 1217R1000,AvgS(y  1826 {y isa I}
else {y isa Other}
The {y isa EI)2 recursive step pseudo code:
if
270R1000,AvgS(y)<792 {y isa I}
elseif 792R1000,AvgS(y)1558 {y isa EI}3
elseif 1558R1000,AvgS(y)2568 {y isa I}
else {y isa Other}
The {y isa EI}3 recursive step:
if
5.7 LAvE-AvI(y)<13.6{y isa E }
elseif 13.6LAvE-AvI(y)15.9{y isa EI}4
elseif 15.9<LAvE-AvI(y)16.6{y isa I}
else {y isa Other}
[7]
The {y
if
elseif
elseif
else
[8]
isa EI)4 recursive step pseudo code:
22RAvE-AvI,AvgE(y)<31{y isa E }
31RAvE-AvI,AvgE(y)35{y isa EI}5
35RAvE-AvI,AvgE(y)54{y isa I}
{y isa O}
The {y isa EI}5 recursive step:
if
1=LAvgE-AvgI,origin(y) {y isa E}
elseif 6=LAvgE-AvgI,origin(y) {y isa I}
else
{y isa O}
[9]
[10]
REFERENCES
[1] T. Abidin and W. Perrizo, “SMARTTV: A Fast and Scalable Nearest
Neighbor Based Classifier” Proc.
ACM Sym. on Applied Comp (SAC),
Dijon, France, April 23-27, 2006.
[2] T. Abidin, A. Dong, H. Li, and W.
Perrizo, “Efficient Image
Classification on Vertical Data,”
IEEE Int’l Conf. on Multimedia
Databases and Data Mgmt (MDDM),
Atlanta, Georgia, April 8, 2006.
[3] M. Khan, Q. Ding, and W. Perrizo,
“KNN Classification on Spatial Data
Stream,” Proc. Pacific-Asia Conf. on
Knowledge Dis. and DM (PAKDD),
pp. 517-528, Taipei, Taiwan, 2002.
[4] Perera, T. Abidin, M. Serazi, G.
Hamer, and W. Perrizo, “Vertical Set
Squared Distance Based Clustering
without Prior Knowledge of K,” Int’l
Conf. on Intel. Sys and SE (IASSE),
pp.72-77, Toronto, July 20-22, 2005.
[5] Rahal, D. Ren, W. Perrizo, “Scalable
Vertical Model for ARM,” Journal of
Info Knowledge Mgmt, V3:4, pp., 04.
[6] D. Ren, B. Wang, and W. Perrizo,
“RDF: Density Outlier Detection
Method using Vertical Data Rep”
IEEE Int’l Conf on Data Mining
(ICDM), pp. 503-506, Nov, 2004.
[11]
[12]
[13]
E. Wang, I. Rahal, W. Perrizo,
“DAVYD: Density-based Approach
for clusters w Varying Densities".
ISCA Int’l Journal of Computers and
Applics, V17:1, March 2010.
Qin Ding, Qiang Ding, W. Perrizo,
“PARM - An Efficient Algorithm for
ARM on Spatial Data" IEEE Trans
of Systems, Man, and Cybernetics,
V38:6, pp. 1513-1525, Dec, 2008.
Rahal, M. Serazi, A. Perera, Q. Ding,
F. Pan, D. Ren, W. Wu, W. Perrizo,
“DataMIME™”, ACM SIGMOD 04,
Paris, France, June 2004.
Treeminer Inc., 175 Admiral
Cochrane Drive, Suite 300,
Annapolis, Maryland 21401,
http://www.treeminer.com
H. Wilkinson, Algebraic Eigenvalue
Problem, Oxford U Press, 1965.
M. S. Bazaraa, H. D. Sherali, C. M.
Shetty, Nonlinear Prog. Theory and
Algs, John Wiley, Hoboken, NJ, 06
H. Stapleton, Linear Stat Models,
John Wiley, Hoboken, NJ, 2009.