Download Vertical Set Square Distance:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Vertical Approach to Computing Set Squared Distance
Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
{taufik.abidin, amal.perera, md.serazi, william.perrizo}@ndsu.edu
Abstract
Recent advances in computer power, network, information storage, and multimedia have lead to
a proliferation of stored data in various applications such as bioinformatics, pattern recognition
and image analysis, World Wide Web, networking, banking, and market basket research. This
explosive growth of data poses great challenges and generates urgent needs for efficient and
scalable algorithms that can deal with massive datasets. In this paper, we propose a vertical
approach (algorithm) to computing set squared distance that measures the total variation of a set
about a given point. Total variation analysis can be used in classification, clustering, and outlier
detection algorithms. The proposed algorithm employs P-tree1 vertical data structure, one choice
of vertical representation that has been experimentally proven to address the curse of scalability.
The experimental results show that the proposed approach is fast and scalable to compute total
variation in very large datasets when compared to the conventional horizontal record-based
approach.
Keywords
Vertical Data Structure, Vertical Set Squared Distance.
1
P-tree technology is patented by North Dakota State University, United States Patent number 6,941,303.
1
1. INTRODUCTION
Distance (closeness) between data objects is frequently measured in data mining
algorithms. For example, in distance-based classification such as k-nearest neighbors, the class
label of the unclassified objects is determined based on the plurality of class labels of the knearest neighbors. These k-nearest neighbors are searched based on the proximity of the
unclassified object and the objects in the training set [2][6][9]. Similarly in partitioning-based
clustering algorithms, such as k-means and k-medoids, the assignment of cluster membership of
data points is based on the distance between the data points and the cluster representatives [5].
Distance is also measured in outlier analysis [7].
Several distance functions have been introduced in the literature such as Minkowski,
Manhattan,
Euclidian,
Max,
and
the
like.
Minkowski
distance
is
defined
as
1/ q
 d
q
d q ( x, y)    xi  yi 
 i 1

where q is a positive integer. This distance is the generalization of
d
Manhattan, Euclidian, and Max distances [4]. Manhattan distance, d1 ( x, y )   xi  y i , is
i 1
Minkowski distance with q=1, Euclidian distance, d 2 ( x, y) 
d
(x
i 1
i
 yi ) 2 , is Minkowski
d
distance with q=2, and Max distance, d  ( x, y)  max xi  yi , is Minkowski distance with q=.
i 1
In this paper, we proposed a vertical approach to computing set squared distance. Set
d
2
squared distance, defined as   xi  ai  , measures the cumulative squared separation of
xX i 1
points in set X about a fixed point a or the total variation of X about a. We denote the set squared
distance as TV(X, a) in this paper. Figure 1 shows the hyper-parabolic graph of the total
2
variations of equally distributed data points. The hyper-parabolic graph of the total variations will
always minimize at the mean.
1200
1000
800
600
400
200
60
50
40
40
30
20
20
0
10
0
Figure 1. The hyper-parabolic graph of the total variations on equally distributed data.
In this paper, we focus our work on scalability of the algorithm. Because data is growing
at disastrous rate, scalability becomes a major issue, and new developed algorithms should scale
to very large datasets. To ensure the scalability, the propose algorithm employs P-tree vertical
data structure that organizes data vertically and processes it horizontally through fast and
efficient logical operations AND, OR, or NOT. Using P-tree vertical data structure, the need for
repeated dataset scans, as commonly done in horizontal record-based approach, can be
eliminated. We demonstrate the scalability of the proposed algorithm empirically through several
experiments.
Throughout the paper, we will use the term VSSD to refer to Vertical Set Squared
Distance, a vertical approach to computing set squared distance, and use the term HSSD to refer
to Horizontal Set Squared Distance, a horizontal approach to computing set squared distance. It
is worth noting here that the horizontal approach, implemented for comparison, uses a scanning
3
approach, which scans the entire dataset every time a total variation is computed. Scanning is a
common approach used in conventional horizontal record-based structure. In addition, the use of
indexes for the horizontal approach will not help much in this context.
The rest of the paper is organized as follows: in Section 2, we discuss about P-tree
vertical data representation; in Section 3, we discuss the derivation of vertical set squared
distance and a strategy to retain count values; in Section 4, we report the experimental results,
and in Section 5, we conclude the work and give some directions for future works.
2. VERTICAL DATA REPRESENTATION
Vertical data representation represents and processes data differently from horizontal data
representation. In vertical data representation, the data is structured column-by-column and
processed horizontally through ANDing or ORing, while in horizontal data representation, the
data is structured row-by-row and processed vertically through scanning or using some indexes.
P-tree vertical data structure [11] is one choice of vertical data representation. It is a lossless,
compressed, and data mining ready data structure. P-tree is lossless because the vertical bit-wise
partitioning guarantees that the original data values can be retained completely. The compressed
property is obtained when the vertical bit sequences are constructed into tree structures and there
exist segments of bit sequences that are either pure-1 or pure-0. P-tree is data-mining ready
because it addresses the curse of cardinality or the curse of scalability problem, one of the major
issues in data mining. P-tree vertical data structure has been used in various data mining
algorithms [1][6][8][9][10] and has been experimentally proven to have great potential to address
the curse of scalability.
4
The construction of P-tree vertical data structure is started by converting the dataset,
normally arranged in a relation R(A1, A2,…, Ad) of horizontal records, into binary. Each attribute
in the relation is vertically partitioned (projected), and for each bit position in the attribute,
vertical bit sequences (containing 0s and 1s) are subsequently created. During partitioning, the
relative order of the data is retained to ensure convertibility. In 0-dimensional P-trees, the vertical
bit sequences are left uncompressed and are not constructed into predicate trees. The size of 0dimensional P-trees is equal to the cardinality of the dataset. In 1-dimensional compressed Ptrees, the vertical bit sequences are constructed into predicate trees. In this compressed form,
ANDing operations can be accelerated. The 1-dimensional P-trees are constructed by recursively
halving the vertical bit sequences and recording the truth of “purely 1-bits” predicate in each half.
A predicate 1 indicates that the bits in that half are all 1s, and a predicate 0 indicates otherwise.
Consider Figure 2 to get some insights on how 1-dimensional P-trees of a single attribute A1 are
constructed.
A1
A1
4
100
1
2
010
Converted to binary
0
0
0
1
0
0
1
0
1
1
1
A2
010
2
The projection of each
bit position into bit
sequences
7
111
5
101
1
0
1
1
001
0
0
1
6
110
1
1
0
3
011
0
1
1
P11
P12
0
0
0
P13 P12 P11
0
0
0 1
1
P13
0
0
0
0
0 1
0 1
0
1
0
0
0
1
0
0
0
0
0
1 0
0 1
1 0
1 0
Figure 2. The illustration of P-tree vertical data structure.
5
In P-tree data structure, range queries, values, or any other patterns can be obtained using
a combination of Boolean algebra operators AND, OR, and NOT. Beside those operators,
COUNT operation is also very important in this structure, which counts the number of 1-bits in
the basic or complement P-tree. For example, when using P-trees P11, P12, and P13 from Figure 2,
the count values resulted from COUNT(P11), COUNT(P12), and COUNT(P13) are 4, 5, and 4
respectively. Refer to [3] for complete information about P-tree algebra and its operations.
3. THE PROPOSED APPROACH
3.1. Vertical Set Squared Distance
Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A numerical value v of attribute
Ai can be written in b bits binary representation as follows:
vib1 vi 0 
0
2
j b 1
j
 vij , where vij can either be 0 or 1
(1)
The first subscript corresponds to the attribute to which v belongs, and the second subscript
corresponds to the bit order. The summation in the right-hand side of the equation is equal to the
numerical value of v in base 10.
Now let x be a vector in d-dimensional space. The binary representation of x in b bits can
be written as follows:
x  ( x1(b1)  x10 , x2(b1)  x20 ,  , xd (b1)  xd 0 )
(2)
Let X be a set of vectors in a relation R, xX, and a be a vector being examined. The total
variation of X about a, denoted as TV(X, a), measures the cumulative squared separation of the
6
vectors in X about a. The total variation can be measured quickly and scalably using vertical set
squared distance, derived as follows:
2
d
TV ( X , a)  ( X  a)  ( X  a)   x  a   x  a    xi  ai 
xX
(3)
xX i 1
d
d
 d

    xi2  2   xi ai   ai2 
xX  i 1
i 1
i 1

d
d
d
  x  2   xi ai   ai2
xX i 1
2
i
xX i 1
xX i 1
 T1  2T2  T3
d
T1 

xX i 1
 0


2 j  xij 


i 1  j b 1

d
xi2

2
 
xX
(4)
0
 0 j

    2  xij  2 k  xik 
xX i 1  j b1
k b1

d
 0 0 j k

     2 xij  xik 
xX i 1  j b1 k b1

d
Commuting the summation of xX further inside to first process all vectors that belong to X
vertically and then process each attribute horizontally, we get:
 0 0 j k

     2   xij  xik 
i 1  j b1 k b1
xX

d
(5)
Let PX be a P-tree mask of set X that can quickly identify data points in X. PX is a bit
pattern containing 1s and 0s in which bit 1 indicates that a point at that bit position belongs to X,
while 0 indicates otherwise. Using this mask, equation (5) can be simplified by substituting
7
x
xX
ij
 xik with COUNT ( PX  Pij  Pik ) . Recall that COUNT operation will count the number
of 1-bits in the pattern, the simplified form of equation (5) can be written as:
 0 0 j k

T1      2  COUNT ( PX  Pij  Pik ) 
i 1  j b1 k b1

d
(6)
Similarly for terms T2 and T3, using the mask, the derivation will be straight forward as
shown in equation (7) and (8) respectively.
0
 0 j
 d
T2   xi ai   ai   2   xij    ai  2 j  COUNT ( PX  Pij )
i 1
x X i 1
 j b1 xX  i 1 j b1
d
d
d
(7)
T3   ai  COUNT ( PX )   ai2
2
xX i 1
d
(8)
i 1
Therefore, VSSD is defined to be:
d  0
0

TV ( X , a)  ( X  a)  ( X  a)      2 j  k  COUNT ( PX  Pij  Pik )  
i 1  j b 1 k b 1

d
2   ai 
i 1
0
2
j
j b 1
 COUNT ( PX  Pij ) 
(9)
d
COUNT ( PX )   ai2
i 1
Observe that the COUNT operations in equation (9) are independent from a. These
d
include the COUNT of single operand COUNT(PX),
0
  COUNT ( PX  P ) , and three
i 1 j b1
ij
 0 0

operands,     COUNT ( PX  Pij  Pik )  . For a dataset that is rarely changed
i 1  j b1 k b1

d
8
(static), such as a training set used in classification, the independency of COUNT operations is an
advantage. It allows us to precompute the count values in advance, retain them, and use them
repeatedly during the computation of total variations. The reusability of count values expedites
the computation of total variation significantly.
3.2. A Strategy for Retaining Count Values
For illustration, consider a dataset containing 10 data points as shown in Table 1. The
dataset has two feature attributes: X and Y, and a class attribute containing two values: C1 and C2.
The binary values of each data point are included in the table for clarity. The last two columns on
the right-side of the table are the masks of each class, denoted as PX1 and PX2. In this example,
we want to measure the total variations of each class about a fixed point, and thus, the dataset is
subdivided into two sets. The first set is a collection of data points in class C1, and the second set
is a collection of data points in class C2. If the entire dataset is considered as a single set without
subdividing them into classes, we do not need P-tree mask because there is only one set. In such
a case, the resulting count of COUNT(PX) can be replaced with the cardinality of the dataset,
d

0
 COUNT ( PX  Pij ) can be simplified to be
i 1 j b1
d
0
  COUNT ( P )
ij
i 1 j b1
 0 0

   COUNT ( PX  Pij  Pik ) 



i 1  j b1 k b1

, and
d
becomes
 0 0

   COUNT ( Pij  Pik )  .



i 1  j b1 k b1

d
Table 1. Example dataset.
X
Y
CLASS
7
6
C1
X in
Binary
111
Y in
Binary
110
PX1
PX2
1
0
9
2
6
3
3
7
7
4
1
6
6
3
3
4
5
2
5
4
5
C2
C1
C2
C2
C1
C1
C2
C2
C2
010
110
011
011
111
111
100
001
110
110
011
011
100
101
010
101
100
101
0
1
0
0
1
1
0
0
0
1
0
1
1
0
0
1
1
1
The count values that need to be retained are the values returned by those COUNT
operations. They are stored separately in three files. The first file holds the count values returned
by
d
COUNT(PX),
the
second
file
holds
the
count
values
returned
by
0
  COUNT ( PX  P )
, and the last file holds the count values returned by
ij
i 1 j b1
 0 0



COUNT
(
PX

P

P
)

ij
ik  . The count values in each file are organized
 
i 1  j b1 k b1

d
accordingly in appropriate order. For example, for COUNT(PX) operation, the count value of the
first set is placed first in the file, followed by the count value of the second set, and so forth. The
d
count values of
0
  COUNT ( PX  P ) are also organized accordingly, i.e. the count
ij
i 1 j b1
values of the first set are stored first, followed by the count values of the next set. The number of
d
count values resulted from
0
  COUNT ( PX  P )
i 1 j b1
ij
for each set is equal to the
summation of bit length of each dimension. This number is needed to correctly retrieve the count
values of the set when a total variation is computed. The same strategy is also used when storing
10
 0 0



COUNT
(
PX

P

P
)
count values of    
ij
ik  . The number of count values
i 1  j b1 k b1

d
returned by this operation for each set is equal to the summation of squared bit length of each
dimension. Table 2 summarizes the count values of class C1 and C2.
When the sets are not static, retaining the count values in advance is not really beneficial
since the members of the sets are changing dynamically. However, retaining the count values can
also be done in such a case, although it does not give much advantage as when the sets are static.
Table 2. The count values of each class.
CLASS
C1
i
j
k
1 2
2
1
0
2
1
0
2
1
0
2
1
0
2
1
0
2
1
0
1
0
2 2
1
0
COUNT
PX^Pij^Pik PX^Pij
4
4
4
3
4
4
4
3
3
3
3
3
2
2
1
1
1
3
3
1
1
2
1
2
PX
4
CLASS
i
j
k
C2
1
2
2
1
0
2
1
0
2
1
0
2
1
0
2
1
0
2
1
0
1
0
2
2
1
0
COUNT
PX^Pij^Pik PX^Pij
2
2
1
0
1
4
4
2
0
3
2
3
5
5
1
2
1
2
2
1
2
3
1
3
PX
6
Subscript i represents index of attributes, j and k represent bit-position.
3.3. Complexity Analysis
The cost of VSSD lies in the computation of count values. When datasets are subdivided
into several subsets, the complexity is O(kdb2) where k is the number of subsets, d is the
11
number of feature attributes, and b is the average bit length. However, when the entire dataset is
considered as a single set, the complexity is reduced to O(db2). The choice whether to consider
the entire dataset as a single set or subdivide it into several subsets depends on the situation. It is
important to note that the cost to compute count values is more expensive when the dataset is
subdivided into several subsets.
The complexity to compute total variation using VSSD is O(1). This constant complexity
is obtained because the same count values can be reused for any input point. The computation of
total variation is a just matter of taking the right count values and solving equation (9) but
without the COUNT operations. For example, let a = (2, 3) be the point being examined and the
count values of class C1 and C2 are as listed in Table 2. The total variation of class C1 about a,
denoted as TV(C1, a), can be computed quickly as follows:
2  0
0

TV (C1 , a)     2 j  k  COUNT ( PX  Pij  Pik )  
i 1  j  2 k  2

2
0
2
i 1
j 2
i 1
2   ai   2 j  COUNT ( PX  Pij )  COUNT ( PX )   ai2
= 24  4 + 23  4 + 22  3 + 23  4 + 22  4 + 21  3 + 22  3 + 21  3 + 20  3 +
24  2 + 23  1 + 22  1 + 23  1 + 22  3 + 21  1 + 22  1 + 21  1 + 20  2 –
2  (2  (22  4 + 21  4 + 20  3) + 3  (22  2 + 21  3 + 20  2)) +
4  (22 + 32)
= 105
Similarly, the total variation of class C2 about a, denoted as TV(C2, a), can be computed
as follows:
12
 0 0 j k

TV (C2 , a)     2  COUNT ( PX  Pij  Pik )  
i 1  j  2 k  2

2
2
0
2
i 1
j 2
i 1
2   ai   2 j  COUNT ( PX  Pij )  COUNT ( PX )   ai2
= 24  2 + 23  1 + 22  0 + 23  1 + 22  4 + 21  2 + 22  0 + 21  2 + 20  3 +
24  5 + 23  1 + 22  2 + 23  1 + 22  2 + 21  1 + 22  2 + 21  1 + 20  3 –
2  (2  (22  2 + 21  4 + 20  3) + 3  (22  5 + 21  2 + 20  3)) +
6  (22 + 32)
= 42
4. EXPERIMENTAL RESULTS
In this section, we report the experimental results. The objective is to compare the
efficiency and scalability between VSSD employing a vertical approach (vertical data structure
with horizontal bitwise operations) and HSSD utilizing a horizontal approach (horizontal data
structure with vertical scans). HSSD is defined as shown in equation (3). Both VSSD and HSSD
were implemented using C++ programming language. The P-treeAPI [12] was also used in the
implementation of VSSD. The performance of the algorithms was observed under several
different machine specifications, including an SGI Altix CC-NUMA machine. The specification
of the machines is listed in Table 3.
Table 3. The specification of the machines.
Machine Name
Specification
Memory Size
AMD
AMD Athlon K7 1.4GHz processor
1GB
P4
Intel P4 2.4GHz processor
3.8GB
SGI Altix
SGI Altix CC-NUMA 12 processors
Shared Memory (12 x 4 GB)
13
4.1. Datasets
The datasets were taken from a set of aerial photographs from the Best Management Plot
(BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and longitude
are 970 42'18"W, taken in 1998. The images contain three bands: red, green, and blue reflectance
values. The values are between 0 and 255, which in binary number can be represented using 8
bits. The original image is the image of size 1024x1024 pixels (having cardinality of 1,048,576),
depicted in Figure 3. Corresponding synchronized data for soil moisture, soil nitrate, and crop
yield were also used. The crop yield was selected to be a class attribute. Combining all bands and
synchronized data, we obtained a dataset with 6 dimensions (five feature attributes and one class
attribute).
Additional datasets with different cardinality were synthetically generated from the
original dataset to study the speed and scalability of the methods. Both speed and scalability were
evaluated with respect to dataset size. Due to the small number of cardinality obtained from the
original dataset (1,048,576 records), we super sampled the dataset using a simple image
processing tool on the original dataset to produce five other larger datasets, each having
cardinality of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608, 16,777,216 (4096x4096
pixels), and 25,160,256 (5016x5016 pixels). We categorized the crop yield attribute into four
different categories to simulate various subsets in the datasets. The categories are: low yield
having intensity between 0 and 63, medium low yield having intensity between 64 and 127,
medium high yield having intensity between 128 and 191, and high yield having intensity
between 192 and 255.
14
Figure 3. The original image of the dataset.
4.2. Computation Time and Scalability
Our first observation was to evaluate the performance of VSSD and HSSD when running
on different machines. We discovered that VSSD is significantly faster than HSSD on all
machines. VSSD takes only 0.0004 seconds on an average to compute the total variations of each
set (low yield, medium low yield, medium high yield, and high yield) about 5 tested points. As
discussed before in Section 3.3, the cost for VSSD lies in the computation of count values.
However, this computation is extremely fast because the COUNT operations are simply counting
the number of 1-bits in the bit patterns. We discovered that for each dataset, the COUNT
operations were executed 1,280 times, derived from 4 x 5 x 82, or equal to the complexity of
computing count values O(kdb2). Table 4 summarizes the amount of time needed for VSSD to
run all COUNT operations on different datasets and machines. Notice that when running on
AMD machine, VSSD only needs 0.4 seconds on an average to finish a single COUNT operation
for the dataset of size 25,160,256, while on P4 machine, VSSD only needs 0.15 seconds on an
average to finish a single COUNT operation for the same dataset. The COUNT operation was
15
even faster when VSSD was running on SGI Altix. It takes 183.81 seconds to complete all 1,280
COUNT operations, or on an average, it takes only 0.14 seconds to complete a single COUNT
operation.
Table 4. Time for VSSD to compute all count values.
Dataset
Sizes
1,048,576
2,097,152
4,194,304
8,388,608
16,777,216
25,160,256
The computation of total
Time
(Seconds)
AMD
P4
SGI Altix
14.57
5.05
4.39
36.32
11.05
9.19
75.89
24.03
21.73
147.79
49.69
50.25
305.22
97.59
121.73
513.98
192.07
183.81
variation is really fast for VSSD once the count values are
obtained. It is a matter of taking the appropriate count values and completing the summation as
shown in equation (9) but without the COUNT operations. We only report the time to compute
the total variations when VSSD was running on AMD machine because the same time was also
found on the other machines.
In contrast, HSSD takes more time to compute the total variations. The time is linear to
the dataset sizes and difference on every machine. For example, on AMD machine, HSSD takes
79.86 and 132.17 seconds on average to compute the total variations for the datasets of size
16,777,216 and 25,160,256 respectively. On P4 machine, HSSD takes 98.62 and 155.06 seconds
on average to compute the total variations for the same datasets. The same phenomenon was also
found when HSSD was running on SGI Altix machine. The average time to compute the total
variations for the dataset of size 16,777,216 is twice the time to compute the total variations for
the dataset of size 8,388,608. Table 5 shows the average time to compute the total variations on
16
different machines, and Figure 4 illustrates the time trend. The time in the table shows a clear
advantage in using the proposed approach.
Table 5. The average time to compute the total variations under different machines.
Average Time to Compute the Total Variations
(Seconds)
Dataset Sizes
HSSD
1,048,576
2,097,152
4,194,304
8,388,608
16,777,216
25,160,256
VSSD
AMD
P4
SGI Altix
AMD
5.30
10.58
18.40
36.85
79.86
132.17
6.14
12.27
24.73
50.15
98.62
155.06
6.79
13.84
27.64
55.10
109.76
164.95
0.0004
0.0004
0.0004
0.0004
0.0004
0.0004
180
160
Time
(Seconds)
140
120
VSSD on AMD
100
HSSD on AMD
80
HSSD on P4
60
HSSD on SGI Altix
40
20
0
0
2
4
6 8 10 12 14 16 18 20 22 24
Number of Tuples (x 1024^2)
Figure 4. Time trend to compute total variations.
It is important to note that the significant disparity of time to compute the total variations
between VSSD and HSSD is due to the capability of VSSD to reuse the same count values once
the count values are computed. As a result, VSSD tends to have a constant time when computing
17
total variations even when the dataset sizes are varied. On the other hand, HSSD must scan the
datasets each time the total variations are computed. Thus, the time to compute a total variation
is linear to the cardinality of the datasets.
Our second observation was to compare the time to load the datasets to memory. We
discovered that when the datasets are organized in P-tree vertical structure, the time to load the
datasets is more efficient than when the datasets are organized in horizontal structure, see Table
6. The reason is that when the dataset are organized in P-tree vertical structure, they are stored in
binary. Hence, they can be loaded efficiently. On the other hand, when the datasets are organized
in horizontal structure, which is not in binary format, it takes more time to load them to memory.
Table 6. Loading time comparison.
Dataset Sizes
Average Loading Time
(Seconds)
Loading P-trees and Count Values
Loading Horizontal Datasets
1,048,576
2,097,152
4,194,304
8,388,608
16,777,216
25,160,256
AMD
P4
0.11
0.25
0.47
0.95
1.87
3.33
0.04
0.09
0.16
0.35
0.67
0.95
SGI Altix
AMD
P4
SGI Altix
0.04
0.06
0.12
0.26
0.55
0.82
31.65
63.22
118.61
243.84
489.59
784.57
24.12
48.21
97.87
202.69
389.96
588.27
25.92
51.96
103.98
208.61
415.43
625.33
5. CONCLUSION
We have introduced a vertical approach to computing set squared distance (total
variation) and evaluated its performance. The empirical results clearly show that VSSD is fast
and scalable to very large datasets. The independency of COUNT operations to the data points
being examined is the major factor that makes the computation of total variation using VSSD is
extremely fast. The proposed approach is scalable due to the use of P-tree vertical data structure.
18
For future works, we will apply the total variation analysis in classification and clustering
algorithms. The efficiency to compute total variation using vertical approach provides a window
of opportunities to develop fast and scalable data mining techniques.
6. REFERENCES
[1]
T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based
Classifier for Data Mining,” Proceedings of the 21st ACM Symposium on Applied
Computing (SAC-06), Dixon, France, April 23-27, 2006.
[2]
T.M. Cover and P.E. Hart. “Nearest Neighbor Pattern Classification”, Journal IEEE
Transactions on Information Theory, IT 13, 1967, 21-27.
[3]
Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The P-tree Algebra”, Proceedings of the ACM
Symposium on Applied Computing, pp. 426-431, 2002.
[4]
J. Han, and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann,
San Francisco, CA, 2001.
[5]
J. A. Hartigan, “Clustering Algorithms”, John Wiley & Sons, New York, NY, 1975.
[6]
M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data
Stream Using P-trees”. Proceedings of the Pacific-Asia Conference on Knowledge
Discovery and Data Mining, pp. 517-528, Taipei, Taiwan, May 2002.
[7]
E.M. Knorr and R. T. Ng. “Algorithms for Mining Distance-Based Outliers in Large
Datasets”. Proceedings of 24th International Conference on Very Large Data Bases
(VLDB), pp. 392-403, 1998.
[8]
A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Square Distance
Based Clustering without Prior Knowledge of K”. Proceedings of the 14th International
19
Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-05),
July 20-22, 2005, Toronto Canada.
[9]
A. Perera, et al., “P-tree Classification of Yeast Gene Deletion Data”. SIGKDD
Explorations, 4(2), pp. 108-109, 2002.
[10] I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining Association
Rules.” Journal of Information and Knowledge Management (JIKM), Vol.3, No. 4, pp.
317-329, 2004.
[11] W. Perrizo, “Peano Count Tree Technology”, Technical Report NDSU-CSOR-TR-01-1,
2001.
[12] P-tree
Application
Programming
Interface
(P-tree
API)
Documentation,
http://midas.cs.ndsu.nodak.edu/~datasurg/ptree/
20