Download Contact Person: - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
http://www.cs.ndsu.nodak.edu/~datasurg
Contact Person:
Dr. William Perrizo
Department of Computer Science
North Dakota State University
258 A10 IACC Building
1301 12 th Ave North
Fargo, ND 58102, USA
Voice: +1 701 231 7248
Fax: +1 701 231 8255
Email: [email protected]
http://www.cs.ndsu.nodak.edu/~perrizo
SMART-TV: AN EFFICIENT AND SCALABLE NEAREST
NEIGHBORS BASED CLASSIFIER
Contents
Introduction
2
Problem Statement
2
P-tree Vertical Data Structure
3
Vertical Set Squared Distance
3
Graph of Total Variations
5
Proposed Algorithm
5
Performance Results
10
Summary
12
Reference
12
1200
1000
800
600
400
200
60
50
40
40
30
20
20
0
10
Introduction
Recent advances in computer power,
network,
information
storage,
and
multimedia have lead to a proliferation of
stored data in various domains like
bioinformatics, image analysis, World Wide
Web, networking, banking, and retailing.
This explosive growth of data has opened
the need for developing efficient and
scalable data mining techniques that can
extract valuable and interesting information
from large volume of data.
Classification is one of the data mining
functionality that involves predicting the
class label of newly encountered objects
using feature attributes of a set of preclassified objects. The predictive pattern
from the classification result can be used to
understand the existing objects in the
database and to understand how new objects
are grouped. The main goal of classification
is to determine the data membership of the
unclassified objects.
0
Problem Statement
K-nearest neighbor (KNN) classifier is one of the methods used for classification. KNN
classifier is the most commonly used neighborhood classification due to its simplicity and
good performance. The classifier does not build a model in advance like decision tree
induction, Neural Network, and Support Vector Machine. Instead, it invests all the effort
for classification until a new object arrives, and thus, the classification decision is made
locally based on the features of the new objects. For each new object, KNN classifier
searches for the most similar objects in the training set and assigns a class label to the
new object based on the plurality of category of the k-nearest neighbors. The notion of
similarity or closeness between the training objects and the unclassified object is
determined using a distance measure, e.g. Euclidian distance. KNN classifier has shown
good performance on various datasets. However, when the training set is very large, i.e.
millions of objects, a brute force search to find the k-nearest neighbors will increase the
classification time significantly. Such approach will not scale to very large datasets.
Most of the state-of-the-art classification algorithms use the traditional horizontal row
based structuring to represent the data. The use of traditional horizontal record based and
sequential scan-based approach are known to scale poorly to very large data repositories.
Jim Gray from Microsoft in his talk at SIGMOD 2004 conference [3] emphasized that
2
vertical column based structuring as opposed to horizontal row based structuring is a
good alternative to speed up query process and ensure scalability.
P-tree Vertical Data Structure
Vertical data representation represents and processes the data differently from horizontal
data representation. In vertical data representation, the data is structured column-bycolumn and processed horizontally through logical operation like AND or OR, while in
horizontal data representation, the data is structured row-by-row and processed vertically
through scanning or using some indexes.
P-tree vertical data structure is one choice of vertical data representation [6]. P-tree is a
lossless, compressed, and data-mining ready data structure. P-tree is lossless because the
vertical bit-wise partitioning guarantees that the original data values can be retained
completely. P-tree is compressed because when the segments of bit sequences are either
pure-1 or pure-0, they can be represented in a single bit. P-tree is data-mining ready
because it addresses the curse of cardinality or the curse of scalability, one of the major
issues in data mining. P-tree vertical data structure has been intensively exploited in
various domains and data mining algorithms, ranging from classification [1][2][4],
clustering [5], association rule mining [7], and outlier analysis [8]. In September 2005, Ptree technology was patented in the United States by North Dakota State University,
patent number 6,941,303.
Vertical Set Squared Distance
Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A numerical value v of attribute
Ai can be written in b bits binary representation to be:
vib1 vi 0 
0
2
j b 1
j
 vij , where vij can either be 0 or 1
Now, let x be a vector in d-dimensional space. The binary representation of x in b bits can
be written as:
x  ( x1(b1)  x10 , x2(b1)  x20 ,  , xd (b1)  xd 0 )
Let X be a set of vectors in R or X be the relation R itself, xX, and a be the input vector.
The total variation of X about a, denoted as TV(X, a), measures the cumulative squared
separation of the vectors in X about a, which can be computed quickly and scalably using
the Vertical Set Squared Distance (VSSD) formula, derived as follows:
d
2
TV ( X , a)  ( X  a)  ( X  a)   x  a   x  a    xi  ai 
xX
xX i 1
d
d
 d

    xi2  2   xi ai   ai2 
xX  i 1
i 1
i 1

3
d
d
d
  x  2   xi ai   ai2
xX i 1
2
i
xX i 1
xX i 1
 T1  2T2  T3
d
T1 

 0


2 j  xij 


i 1  j b 1

d
xi2 
xX i 1
2
 
xX
d  0
0

     2 j k   xij  xik 
i 1  j b1 k b1
xX

Let PX be a P-tree mask of set X that can quickly identify data points in X. PX is a bit
pattern containing 1s and 0s, where bit 1 indicates that a point at that bit position belongs
to X, while 0 indicates otherwise. Using this mask, the above equation can be simplified
x
by substituting
xX
ij
 xik with COUNT ( PX  Pij  Pik ) , where COUNT is the
aggregate operation that counts bit 1 in a given pattern and Pij is the P-tree at ith
dimension and jth bit position. Thus, T1 can be simplified to be:
 0 0 j k

T1      2  COUNT ( PX  Pij  Pik ) 
i 1  j b1 k b1

d
Similarly for terms T2 and T3, we get:
T2 
d
0
 0
 d
xi ai   ai   2 j   xij    ai  2 j  COUNT ( PX  Pij )


x X i 1
i 1
 j b1 xX  i 1 j b1
d

d
T3   ai  COUNT ( PX )   ai2
xX i 1
2
d
i 1
Therefore:
d  0
0

TV ( X , a)  ( X  a)  ( X  a)      2 j  k  COUNT ( PX  Pij  Pik )  
i 1  j b 1 k b 1

d
2   ai 
i 1
0
2
j b 1
j
 COUNT ( PX  Pij ) 
d
COUNT ( PX )   ai2
i 1
Let us consider X as the relation R itself. In this case, the mask PX can be removed since
all objects now belong to R. The formula then can be rewritten as:
4
d
0
 0 0 j k



TV ( X , a)      2  COUNT ( Pij  Pik )   2   ai   2 j  COUNT ( Pij ) 
i 1  j b 1 k b 1
i 1
j b 1

d
d
| X | a
i 1
(7)
2
i
where |X| is the cardinality of R.
Furthermore, note that the COUNT operations are independent from a (the input value).
For classification problem where X is the training set (or the relation R itself), this
independency allows us to pre-compute the count values in advance, retain them, and use
them repeatedly when computing total variations. The reusability of count values
expedites the computation of total variation significantly.
Graph of Total Variations
Let us consider equally distributed data objects in a 2-dimensional space. The plot of the
total variations of each object is illustrated in the following figure. The graph of the total
variations is always minimized at the mean.
1200
1000
800
600
400
200
60
50
40
40
30
20
20
0
10
0
Figure 1. The hyper-parabolic graph of total variations.
Proposed Algorithm
The proposed algorithm finds the candidates of neighbors by forming a total variation
contour around the total variation of the unclassified object. The objects within the
contour are considered as the superset of nearest neighbors. These objects are identified
efficiently using P-tree range query algorithm without the need to scan the total variation
values. The proposed algorithm further prunes the neighbors set by means of dimensional
5
projections. After pruning, the k-nearest neighbors are searched from the set, and then, let
them vote to determine the class label of the unclassified object.
The algorithm consists of two phases: preprocessing and classifying. In the preprocessing
phase, all steps are conducted only once, while in the classifying phase, the steps are
repeated for every new unclassified object.
Let R(A1,…,An, C) be a training space and X(A1,…,An) = R[A1,…,An] be the features
subspace. TV(X, a) is the total variation of X about a, and
let f (a)  TV ( X , a)  TV ( X ,  ) . Recall that:
d  0
0
d
d

TV ( X , a)      2 j k  COUNT ( Pij  Pik )   2   ai   xi  | X |  ai2
i 1  j b 1 k b 1
i 1
xX
i 1

d
d
 0 0 j k



     2  COUNT ( Pij  Pik )   2 | X |  ai i  | X |  ai2
i 1  j b 1 k b 1
i 1
i 1

d
d
d
 0 0 j k





     2  COUNT ( Pij  Pik )   | X |   2   ai i   ai2 
i 1  j b 1 k b 1
i 1
i 1



d
Since f (a)  TV ( X , a)  TV ( X ,  ) , we get:
d
d
d
d



2
| X |   2   ai i   ai  | X |   2   i i   i2 
i 1
i 1
i 1
i 1




d
d


| X |   2   ai  i  i  i    ai2  i2 
i 1
i 1


d
d
d


| X |   ai2  2 i ai   i2 
i 1
i 1
 i 1

| X | a  
2
Letting g (a)  ln( f (a))  ln X  ln a  
ith dimension, the gradient of g at a =
2
and taking the first derivate of g(a) fixing at
2(a   ) i
g
. Note that the gradient = 0 if
(a) 
2
a d
a
and only if a= and the gradient length depends only on the length of a - . Thus, the
isobars are hyper-circles. The shape of the graph is a funnel as shown in Figure 2.
6

a
Figure 2. The shape of graph g(a).
Preprocessing Phase:
1. Obtain the count values needed for computing the total variation and store them. The
complexity of this operation is O(db), where d is the number of dimensions and b is
the bit width.
2. Compute g (x ) , x  X and create derived P-trees of these values. The complexity of
this computation is O(n) since the computation is done for all objects in X.
Classifying Phase:
The following steps are performed to classify each unclassified object.
1. Determine vector (a - ), where a is the unclassified object and  is the vector mean
of the feature space X.
2. Given an epsilon ( > 0), determine two vectors located in the lower and upper side of
a by moving  unit inward toward  along vector (a - ) and moving  unit outward
away from a. Let b and c be the two vectors in lower and upper side of a respectively,
then vector b and c can be determined as follows:


b    1 
a


  (a   )




c    1 
a


  (a   )


7
3. Calculate g(b) and g(c) such that g(b) g(a) g(c) and determine the interval [g(b),
g(c)] that creates a total variation contour. The neighborhood mask within the interval
can be created efficiently using P-tree range query algorithm without the need for
scanning the total variation values. The mask is a bit pattern containing 1s and 0s,
where bit 1 indicates that the object belongs to the neighborhood, while 0 indicates
otherwise. The objects within the pre-image of this contour in the original feature
space are the candidates of neighbors ( neighborhood of a, Nbrhd(a, )).
g
g(c)
g(b)
Y

X
-contour
b ac
Figure 3. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd(a, ).
4. Prune the neighborhood using the dimensional projections. For each dimension, a
dimensional projection contour is created around the unclassified object. The width of
the contour is specified using the same epsilon previously used to determined vector p
and q. It is important to note that because the training set is already converted into Ptrees vertical structure, no additional derived P-trees are needed to conduct the
projection. The P-trees of each dimension can be used directly with no extra cost.
A parameter, called MS (manageable size of candidate set), is needed for the pruning.
This parameter specifies the upper bound of neighbors in the candidate set such that if
the manageable size of neighbors is reached, the pruning process will be terminated,
and the number of neighbors in the candidate set is considered small enough to be
scanned to search for the k-nearest neighbors.
The pruning technique consists of two major steps. First, it obtains the count of
neighbors in the candidate set in each dimension. The rationale is to maintain the
neighbors that are predominant (close to the unclassified object) in most dimensions
so that when the Euclidian distance is measured (step 5 of the classifying phase), they
8
are the true closest neighbors. The process of obtaining the count starts from the first
dimension. The dimensional projection contour around the unclassified object a is
formed and the mask of this contour is created. The contour mask is then AND-ed
with the mask of the pre-image of total variation contour (the candidate set), and the
count is obtained. Note that at this point, no neighbors are pruned, only the count is
obtained. The process continues for all dimensions, and at the end of the process, the
counts are sorted in descending order.
The second step of the pruning is to intersect the dimensional projection contour with
the pre-image of total variation contour. The intersection starts based on the order of
the count. The dimension with the highest count is intersected first, followed by the
dimension with the second most count, and so forth. In each intersection, the number
of neighbors in the pre-image of total variation contour (candidate set) is updated. In
the implementation perspective, this intersection is simply ANDing the mask of total
variation contour with the mask of the dimensional projection contour. The process
continues until a manageable size of neighbors is reached or all dimensional
projection contours have been intersected with the pre-image of total variation
contour.
g
g(c)
g(b
)
Y
-contour

b ac
X
Projection on X
dimension
Projection on Y
dimension
Figure 4. Pruning the contour using dimensional projections.
5. Find the k-nearest neighbors from the remaining neighbors by measuring the
d
Euclidian distance x  Nbrhd(a, ), d 2 ( x, a) =
 (x
i 1
i
 ai ) 2 .
6. Let the k nearest neighbors vote using a weighted vote.
9
Experimental Results
The experiments were conducted on an Intel Pentium 4 CPU 2.6 GHz machine with
3.8GB RAM, running Red Hat Linux version 2.4.20-8smp.
Datasets
Some of the datasets were taken from the Machine Learning repository at
(http://www.ics.uci.edu/~mlearn/MLRepository.html).
RSI dataset: This dataset is a set of aerial photographs from the Best Management Plot
(BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota, taken in 1998.
The images contain three bands: red, green, and blue. Each band has values in the range
of 0 and 255, which in binary can be represented using 8 bits. The corresponding
synchronized data for soil moisture, soil nitrate, and crop yield were also used, and the
crop yield was selected as the class attribute. Combining all the bands and synchronized
data, a dataset with 6 dimensions (five feature attributes and one class attribute) was
obtained. To simulate different classes, the crop yield was divided into four different
categories: low yield having intensity between 0 and 63, medium low yield having
intensity between 64 and 127, medium high yield having intensity between 128 and 191,
and high yield having intensity between 192 and 255. Three synthetic datasets were
generated to study the scalability and running time of the proposed algorithm. The
cardinality of these datasets is varied from 32 to 96 million.
KDDCUP 1999 dataset: This dataset is the network intrusion dataset used in KDDCUP
1999. The dataset contains more than 4.8 million samples from the TCP dump. Each
sample identifies a type of network intrusion such as Normal, IP Sweep, Neptune, Port
Sweep, Satan, and Smurf.
Wisconsin Diagnostic Breast Cancer (WDBC): This dataset contains 569 diagnosed
breast cancers with 30 real-valued features. The task is to predict two types of diagnoses
as either Benign (B) or Malignant (M).
OPTICS dataset: OPTICS dataset has eight different classes. The dataset contains 8,000
points in 2-dimensional space.
Iris dataset: This is the Iris plants dataset, created by R.A. Fisher. The task is to classify
iris plants into one of three iris plants varieties: Iris Setosa, Iris Versicolor, and Iris
Virginica. Iris dataset contains 150 samples in a 4-dimensional space.
10
900
SMART-TV w/ P-Tree
SMART-TV w/ Scan
P-KNN with HOBit
KNN
Classification Time (Seconds/Sample)
800
700
600
500
400
300
200
100
0
0
32000000
64000000
Dataset Cardinality
96000000
Note: SMART-TV with scan uses a scanning approach to determine the superset of neighbors in the total variation contour. It is
showed just for comparison.
Figure 5. Trend of classification time on RSI dataset.
Classification Time (Seconds/Sample)
300
250
200
Smart-TV with P-tree
Smart-TV with Scan
150
P-KNN HOBBIT
KNN
100
50
0
KDDCUP Dataset
Figure 6. Classification time on KDDCUP dataset.
11
Table 1. Classification accuracy comparison on various datasets.
Dataset
KDDCUP
OPTICS
IRIS
WDBC
SMART-TV
with P-tree
0.93
0.99
0.97
0.94
Classification Accuracy
P-KNN
P-KNN
HOBBIT
EIN
0.91
0.99
0.99
0.89
0.96
0.39
0.96
KNN
0.93
0.99
0.96
0.97
Summary
SMART-TV is an efficient and scalable nearest neighbor based classification algorithm
that employs P-tree vertical data structure. From the experimental results, we conclude
that in terms of speed and scalability, SMART-TV is comparable to the other vertical
nearest neighbor algorithms. In terms of classification accuracy, SMART-TV is
comparable to that of KNN classifier.
Reference
[1] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor
Based Classifier for Data Mining,” Proceedings of the 21st ACM Symposium on
Applied Computing (SAC-06), Dixon, France, April 23-27, 2006.
[2]
T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on
Vertically Decomposed Data,” Proceedings of the 1st IEEE International Workshop
on Multimedia Databases and Data Management (MDDM-06), Atlanta, Georgia,
April 8, 2006.
[3]
J. Gray, “The Next Database Revolution,” Proceedings of the 10th ACM SIGMOD,
pp. 1-4, Paris, 2004.
[4]
M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial
Data Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on
Knowledge Discovery and Data Mining, pp. 517-528, Taipei, Taiwan, May 2002.
[5]
A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared
Distance Based Clustering without Prior Knowledge of K,” Proceedings of the 14th
International Conference on Intelligent and Adaptive Systems and Software
Engineering (IASSE-05), pp. 72-77, Toronto, Canada, July 20-22, 2005.
[6]
W. Perrizo, “Peano Count Tree Technology Lab Notes,” Technical Report
NDSU-CS-TR-01-1, North Dakota State University, Computer Science
Department. http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/pct.html, 2001.
12
[7]
I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining
Association Rules,” Journal of Information & Knowledge Management (JIKM),
Vol.3, No. 4, pp. 317-329, 2004.
[8]
D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection
Method using Vertical Data Representation,” Proceedings of the 4th IEEE
International Conference on Data Mining (ICDM-04), pp. 503-506, Nov 1-4, 2004.
13