Download 04_caine_dis_outlier_pruning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Vertical Distance-based Outlier Detection Method with
Local Pruning
Dongmei Ren, Imad Rahal, William Perrizo
Kirk Scott
North Dakota State University
Computer Science
Computer Science Department
The University of Alaska Anchorage
Fargo, ND, 58105, USA
Anchorage, AK 99508
1-701-231-6403
[email protected]
ABSTRACT
“One person’s noise is another person’s signal”. Outlier detection
is used to clean up datasets and also to discover useful anomalies,
such as criminal activities in electronic commerce, computer
intrusion attacks, terrorist threats, agricultural pest infestations,
etc. Thus, outlier detection is critically important in the
information-based society. This paper focuses on finding outliers
in large datasets using distance-based methods. First, to speedup
outlier detections, we revise Knorr and Ng’s distance-based
outlier definition; second, a vertical data structure, instead of
traditional horizontal structures, is adopted to facilitate efficient
outlier detection further. We tested our methods against national
hockey league dataset and show an order of magnitude of speed
improvement compared to the contemporary distance-based
outlier detection approaches.
Categories and Subject Descriptors
parameters, such as the mean and the variance, are known, (c) the
number of expected outliers, (d) and the types of expected outliers
such as upper or lower outliers in an ordered sample [1]. Yet,
despite all of these options and decisions, there is no guarantee of
finding outliers, either because there may not be any test
developed for a specific combination, or because no standard
distribution can adequately model the observed distribution [1].
The second category of outlier studies in statistics is depth-based.
Each data object is represented as a point in a k-dimensional
space, and is assigned a depth. Outliers are more likely to be data
objects with smaller depths. There are many definitions of depth
that have been proposed. In theory, depth-based approaches can
work for large values of k , where k is the number of dimensions;
however, in practice, while there exists efficient algorithms for k
equal to 2 or 3, depth-based approaches become inefficient for
large datasets when k is large than
based approaches rely on the computation of k-dimensional
convex hulls which has a lower bound complexity of Ω (N k/2),
where N is the number of samples [5].
H.2.8 [Database Applications]: Data Mining
General Terms
Algorithms, Performance, Experimentation, Theory.
Keywords
Outlier
Detection,
Neighborhood Search
Distance-based,
P-Tree,
Pruning,
1. INTRODUCTION
Many studies that consider outlier identification as their primary
objective are in statistics.. Barnett and Lewis provide a
comprehensive treatment, listing about 100 discordance tests for
normal, exponential, Poisson, and binomial distributions [21].
The choice of appropriate discordance tests depends on a number
of factors: (a) the distribution, (b) whether or not the distribution
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CIKM’04 November 8-13, 2004, Washington, DC, USA.
Copyright 2004 ACM 1-58113-874-1/04/0011…$5.00.
In data mining, Knorr and Ng proposed a unified definition for
outliers, which is defined as: an object O in a dataset T is a UO (p,
D) outlier if at least fraction p of the objects in T lie at a distance
greater than or equal to D from O [1][2]. They show that for many
discordance tests in statistics, if an object O is an outlier
according to a specific discordance test, then O is also a UO (p,
D) outlier for some suitably defined p and D. They proposed a cell
structure-based outlier detection algorithm, where the whole set of
cells resembles a data cube. They showed that their method works
well with low dimensionality. However, the method has two
shortcomings. One is that it can not achieve good performance
with very large datasets and high dimensional datasets. Another
disadvantage is that it takes a global view of the dataset making it
difficult to detecting some kinds of outliers in the datasets with
more complex structures where data are distributed with different
densities and arbitrary shapes. There are other definitions and
methods for outliers, such as density-based and clustering-based
methods. We focus on distance-based outlier detection in this
paper.
In this paper, we propose a vertical “by-neighbor” outlier
detection method with local pruning, which can detect outliers
efficiently and scale well with large datasets. Our contributions
are: 1) we introduce a neighborhood-based outlier definition
which improves the speed of the outlier detection process. Based
on this definition, we propose two properties which can be used
as a pruning method and a “by-neighbor” labeling method
respectively. 2) By using the pruning and the “by-neighbor”
technologies, we propose our new outlier detection algorithm
which works much more efficiently over large datasets than
current approaches in terms of speed and scalability. 3) Our
method is implemented using a vertical representation of data,
named P-Trees, which can speed up our method further.
Experiments show that our vertical outlier detection method
outperforms the current distance-based outlier detection methods
significantly in terms of speed and scalability to data size.
The paper is organized as follows; section 2 first introduces our
restatement for Knorr and Ng’s distance-based outlier definition
in terms of neighborhood. Next, based on this restatement, a
pruning method and a “by-neighbor” labeling method are
proposed; section 3 gives an overview of the P-Tree technology1.
Our outlier-detection algorithm with local pruning is described in
section 4. We experimentally demonstrate our improvement in
terms of efficiency and scalability in section 5. In section 6, we
provide a theoretical complexity study for our work. Finally we
conclude the paper with possible future research extensions.
2. DEFINITION OF DISTANCE-BASED
(DB) OUTLIERS
In this section, a formal definition for outliers is given based on
neighborhood. We also propose some properties which can be
used to prune non-outlier data points
2.1 Definitions
Definition of a DB outlier
In [1], Knorr and Ng’s defined distance based outlier as:
An object O in a dataset T is a DB (p, D) outlier if at least
fraction p of the objects in T lies at a distance greater than or
equal to threshold distance, D, from O. An equivalent description
can be stated as:
An object O in a dataset T is a DB (p, D) outlier if at most
fraction (1-p) of the objects in T lies less than threshold distances
D from O.
We give a neighborhood based outlier definition formally in the
following paragraph.
Definition 1 (Neighborhood)
The Neighborhood of a data point O with the radius r is defined as
a set Nbr (O, r) = {x X | |O-x| r}, where, X is the dataset, x is a
point in X, and |O-x| is the distance between O and x. It is also
called the r-neighborhood of O and the points in this
neighborhood are called the neighbors of O. The number of
neighbors of O is denoted as N (Nbr (O, r)).
Definition 2 (Outliers)
Based on the neighborhood definition, a point O is considered as
a DB (p, D) outlier if N (Nbr (O,D)) ≤ (1-p)*|X|, where |X| is the
total size of the dataset X. We define outliers as a subset of the
dataset X with N (Nbr(x,D)) ≤ (1-p)*|X|, where x denotes a point
1
Patents are pending on the P-tree technology. This work was
partially supported by GSA Grant ACT#: K96130308.
in the outlier set. The outlier set is denoted as Ols(X, p, D) =
{xX | N (Nbr(x,D)) ≤ (1-p)*|X|}.
2.2 Properties
Property 1
The max distance of two points in the neighborhood Nbr (O, D) is
2*D, where 2*D means 2 times D.
Property 1 can be proved easily by the following:
Assume that O1, O2 are any two points in Nbr (O, D), the
distance between two points O1 and O2 is: Dist (O1, O2) ≤ 2*D.
It is observed from this equation that 2*D is the max distance
between any two points in the Nbr C (O, D). Therefore, property
1 holds.
Property 2
If N (Nbr (O, D/2)) ≥ (1-p)*|X|, then all the points in Nbr ((O,
D/2)) are not outliers.
Proof: Assume that point Q is an arbitrary point in Nbr (O, D/2).
Q’s D-neighborhood Nbr (Q, D) completely contains Nbr (O,
D/2). Since N (Nbr (Q, D)) ≥ N (Nbr (O, D/2)) ≥ |X|*(1-p), we
can conclude that Q is not an outlier according to our outlier
definition.
Property 2 can be used as a pruning rule. Based on this property,
all points in the Nbr (O, D/2) can be pruned off. The pruning
makes the detection process more efficient as we shall
demonstrate in our experiments. Figure 1 show the pruning
pictorially in Figure1, where the Nbr (O, D/2) is pruned.
Nbr (O, D/2)
D
Q
O
D/2
D
Figure 1 Pruning Nbr (O, D/2)
Property 3
If N ( Nbr (O, 2D)) < |X| * (1-p), then all the points in the Nbr
(O, D) are outliers.
Proof: For any point Q in the Nbr (O, D), Nbr (Q, D)  Nbr (O,
2D) holds, which means that the D-neighborhood of Q is a subset
of the 2*D-neighborhood of P. From set theory, we have N (Nbr
(Q, D)) ≤ N (Nbr (O, 2D)) < |X| * (1-p). According to the
aforementioned outlier definition, all the points in Nbr (O, D) are
outliers.
According to property 3, all points in Nbr (O, D) can be labeled as
outliers. Therefore, property 3 provides a powerful means to label
outliers by neighborhood, instead of point by point, which will
speed up the outlier detection process significantly. Figure 2
shows the “by-neighbor” process pictorially. All the points in the
Nbr (O, D) can be considered without undergoing further
processing.
Nbr(O,D)
As shown in c) of Figure 3, the root value, also called the root
count, of P1 tree is 3, which is the count of 1s in the entire bit file.
The second level of P1 contains the counts 1s in the two halves
separately, which are 0 and 3.
D
3.2 P-Tree Operations
Q
O
D
2D
Figure 2. Label Outliers by-neighbors
AND, OR and NOT logic operations are the most frequently used
P-tree operations. For efficient implementation, we use a variation
of P-trees, called Pure-1 trees (P1-trees). A tree is pure-1, denoted
as p1, if all the values in the sub-tree are 1’s. Figure 4 shows the
P1-trees corresponding to the P-trees in c), d), and e) of Figure 3.
Figure 5 shows the result of AND (a), OR (b) and NOT (c)
operations of P-Tree.
3. REVIEW OF P-TREES
Most data mining algorithms assume that the data being mined
have some sort of structure such as relational tables in databases
or data cubes in data warehouses [13]. Traditionally, data are
represented horizontally and processed tuple by tuple (i.e. row by
row) in the database and data mining areas. The traditional
horizontally oriented record structures are known to scale poorly
to very large data sets.
In our previous work, we proposed a novel vertical data structure,
the P-Tree. In the P-Tree approach, we decompose attributes of
relational tables into separate files by bit position and compress
the vertical bit files using a data-mining-ready structure called the
P-tree. Instead of processing horizontal data vertically, we process
these vertical P-trees horizontally through fast logical operations.
Since P-trees markedly compress the data and the P-tree logical
operations scale extremely well, this common vertical data
structure approach has the potential of addressing the curse of
non-scalability with respect to size. A number of papers proposing
P-Tree-based data-mining algorithms have been published, in
which it was explored and experimentally shown that P-trees
facilitate efficient data mining over large datasets.
Figure 3 Construction of P-Tree
In this section, we briefly review some useful features (which will
be used in this paper) of the P-Tree, including its optimized
logical operations.
3.1 Construction of P-Tree
Given a data set with d attributes, X = (A1, A2 … Ad), and the
binary representation of the jth attribute Aj as bj.m, bj.m-1,..., bj.i, …,
bj.1, bj.0, we decompose each attribute into bit files, one file for
each bit position [15]. To build a P-tree, a bit file is recursively
partitioned into halves and each half into sub-halves until the subhalf is pure entirely 1-bits or entirely 0-bits.
The detailed construction of P-trees is illustrated by an example in
Figure 3. For simplicity, assume each transaction has one
attribute. We represent the attribute in binary, e.g., (7)10 = (111)2.
We then vertically decompose the attribute into three separate bit
files as shown in b). The corresponding basic P-trees, P1, P2 and
P3, are constructed, which are shown in c), d) and e).
Figure 4 P1-trees for the transaction set ( in figure 3)
Figure 5 AND, OR and NOT Operations
3.3 Predicate P-Tree
There are many variants of predicated P-Tree, such as value PTrees, tuple P-Trees, mask P-Trees, etc. We will describe
inequality P-Trees in this section, which will be used to search for
neighbors in section 4.2.
An inequality P-tree represents data points within a data set X
satisfying an inequality predicate, such as x>v and x<v. Without
loss of generality, we will discuss two inequality P-trees: Pxv
and
Pxv .
The calculation of
Pxv
and
Px v
is as follows:
Pxv
:
Calculation of
Let x be an m-bit data point within a data
set X, and Pm, Pm-1, …, P0 be P-trees for the vertical bit files of X.
Let v = bm…bi…b0, where bi is ith binary bit value of v, and Pxv
be the predicate tree for the predicate
x v
, then
Pxv
= Pm opm …
Pi opi Pi-1 … op1 P0, i = 0, 1 … m, where:
1) opi is AND
i=1,
Pxv
opi is
Pxv
is similar to the
X, and P’m, P’m-1, … P’0 be the complement P-trees for the
vertical bit files of X. Let v=bm…bi…b0, where bi is ith binary bit
Px v
·2D
O
: Calculation of
calculation of Pxv . Let x be an m-bit data point within a data set
value of v, and
In case N (Nbr(O,D) < (1-p)*|X|, the algorithm expands the
neighborhood radius from D to 2D and computes N (Nbr
(O,2*D). If N (Nbr(O,2*D) < (1-p)*|X|, all the D-neighbors can be
claimed as outliers. Those points are inserted into the outlier set
and need not undergo further detection either, which contributes
to the speed improvement as well; otherwise, only point O is an
outlier. In figure 6, all points in the large gray neighborhood are
outliers.
Step 3: The process is conducted iteratively until all points in
the dataset X are examined.
2) the operators are right binding; right binding means operators
are associated from right to left, e.g., P 2 op2 P1 op1 P0 is
equivalent to (P2 op2 (P1 op1 P0)). For example, the inequality
tree Px ≥101 = (P2 AND (P OR
0)).
Calculation of
Step 2: Second, search for the D-neighborhood of O and
calculate the number of neighbors, N (Nbr(O,D);
In case N (Nbr(O,D) ≥ (1-p)*|X|, where |X| is the total size of
the dataset, reduce the radius from D to D/2; then, calculate
N(Nbr(O,D/2). If N(Nbr(O,D/2) > (1-p) * |X|, all the D/2neighbors are not outliers. Therefore, those points need not
undergo further testing, which is one of the ways where the
efficiency of our algorithm comes into the picture. In figure 6, all
points in the small gray neighborhood are pruned off.
·D
O ·D/2
·D
be the predicate tree for the predicate x v ,
= P’mopm … P’i opi P’i-1 … opk+1P’k,
k i  m ,
then
Px v
1)
opi is AND
2)
k is the rightmost bit position with value of “0”, i.e., b k=0,
bj=1,  j<k,
3)
the operators are right binding. For example, the inequality
tree Px101 = (P’2 OR
1).
i is OR
where
Figure 6 “By- neighbors” Outlier detection with local pruning
;
4. A VERTICAL OUTLIER DETECTON
METHOD WITH LOCAL PRUNING
In section 4.1, we propose a “by-neighbor” outlier detection
method with local pruning based on property 2 and property 3
mentioned in section 2. In section 4.2, the method is implemented
using the P-Tree data representation. Inequality P-Trees and the
optimized logical operations of the P-Trees facilitate efficient
neighborhood search significantly, which further speeds up the
outlier detection process.
4.1 The “By-neighbor” Outlier Detection
Algorithm with Local Pruning
The “by-neighbor” outlier detection algorithm is based on
our neighborhood-based outlier definition, property 1 and
property 2. The algorithm is described as follows.
Step 1: First, select one point O arbitrarily from the dataset;
4.2 Vertical Approach Using P-Trees
The algorithm is implemented using the vertical data model, PTrees. By using P-Trees, the above outlier detection process can
be made even faster. The speed comes from: 1) In the P-Tree
based approach, the neighborhood search is fast because it is
conducted by inequality P-Tree ANDing; 2) the computation of
the number of neighbors is efficient because it is done by
extracting the value of the root node of the neighborhood P-Tree,
and 3) the algorithm prunes points by neighborhood using the
optimized P-Trees operation, AND, which accelerates the process
significantly.
The vertical method works as follows. First, the dataset to be
mined is represented as the set of P-Trees. Secondly, one point O
in the dataset is selected arbitrarily; then, the D-neighbors are
searched using the fast computation of inequality P-Tree, and the
D-neighbors are represented with an inequality P-Tree, which is
called a neighborhood P-Tree. In the neighborhood P-Tree, ‘1’
means the point is a neighbor of the point O, while 0 means the
point is not a neighbor. Thirdly, the number of points in Dneighbors is calculated efficiently by extracting values from the
root node of the neighbor P-Tree [15]. Finally, do the pruning or
“by-neighbor” outlier determination. The pruning process can be
executed efficiently by ANDing the unprocessed P-Tree, which
represents the unprocessed dataset, with the D-Neighborhood PTree. In this case, the points can be pruned by masking the
corresponding points in the result P-Trees. The “by-neighbor”
outlier labeling can be executed by ORing the D/2-neighborhood
P-Tree and the outlier P-Tree, a P-tree representing the outlier set.
By searching neighbors using inequality P-Trees and pruning
using the P-Tree AND operation, the vertical approach improves
the outlier-detection process dramatically in terms of speed. The
formal algorithm is shown in figure 7.
improvements. It has an order of magnitude of speed
improvement over the nested loop approach.
As for scalability, the PODMP is the best among the three. It
scales well with increasing dataset sizes. Figure 9 shows this
pictorially.
Comparision of NL, PODM, PODMP
1800
1600
run time
1400
Algorithm: “Vertical Outlier Detection with Local Pruning”
Input: D: Distance threshold, f: outlier fraction, T: dataset
Output: Ols: outlier set
// N: total number of points in dataset T
//PT: P-tree represented dataset
//PU: P-tree represented unprocessed dataset
//PN: P-tree represented neighborhood
//PO: P-tree represented outlier set
//PNO: P-tree represented non candidate outlier set
Figure 7 Pseudo code of the algorithm
5. EXPERIMENTAL DEMONSTRATION
In this section, we design an experiment to compare our method
with Knorr and Ng’s nested loop approach (denoted as NL in
figures 8 and 9)[2]. We implemented a P-Tree-based outlier
detection method without pruning, named PODM, and P-Tree
based outlier detection method using pruning, PODMP, as
discussed in detail in section 4. Our purpose is to show our
approach’s efficiency and scalability.
We ran the experiment on a 1400-MHZ AMD machine with 1GB
main memory, running Debian Linux version 4.0. We tested the
algorithms over the National Hockey League (NHL) 1996 dataset.
To show how our algorithm scales as data size increases, the
dataset is divided into five groups with increasing size, where
larger data groups contain repeating records. Figure 8
demonstrates that the two P-Tree-based methods has a significant
improvement in speed compared to Knorr and Ng’s Nested Loop
method. Also, it is shown that by using pruning and “byneighbor” labeling, the PODMP benefits from additional speed
800
600
200
0
256
1024
4096
data size
16384
65536
Figure 8 Comparison of NL, PODM, and PODMP
IF m > (1- f) * N
// not outlier
PN  findNeigborhood (P, D/2);
m  PN.rootCount ();
IF m > (1-f) * N
PNO  PNO  PN; //  is OR operation of P-trees
PU  PU ∩ PNO; // pruining
ENDIF
ENDIF
}
ENDWHILE
NL
PODM
PODMP
1000
400
// Build up P-Trees set for T;
PT T;
PU PT;
WHILE ( ! PU. size() )
{
PN  findNeigborhood (P, D);
m  PN.rootCount (); // retrieve value of the root node
scalability of NL, PODM and PODMP
1800
1600
1400
1200
run time
IF m <= (1- f) * N
PO  PO U P;
PN  findNeighborhood (P,2D);
m  PN.rootCount ();
IF m < (1-f) * N,
PO  PO  PN;
ENDIF
ENDIF
1200
1000
NL
800
PODM
600
PODMP
400
200
0
-200256
1024
4096
16384
65536
data size
Figure 9 Comparison of Scalability of NL, PODM, and
PODMP
6. COMPUTATIONAL COMPLEXITY
ANALYSIS
In this section, we analyze and compare our vertical outlier
detection method with current methods with respect to
computational complexity. For our method, the worst case is O (k
N), where k is complexity for the P-Tree operations, which
changes with the number of P-Trees, and N is the total number of
points in the dataset. The best case is O (k) and it occurs when we
prune all the points in the dataset after processing the first point.
With pruning, complexity can be described as O (k M), where M
≤ N. For the worst case scenario, M=N. Table 1 shows the
comparison.
Table 1 Complexity Comparison
Algorithm
Complexity
Nested-loop
O(N*N)
Tree Indexed
O(N*logN)
Cell Based
linear in N, exponential in d (dimension )
Our approach
O (k N), where k is much small than logN
7. CONCLUSION AND FUTURE WORK
Outlier detection is becoming critically important to many areas
such as monitoring of criminal activities in electronic commerce,
and credit card fraud. In this paper, we propose a neighborhoodbased outlier definition, a pruning rule, and a “by-neighbor”
labeling technique, based on which, a novel outlier detection
algorithm is proposed. Pruning and “by-neighbor” labeling can
speed up the outlier process significantly. Our method is
implemented using a vertical data representation model, P-Tree,
instead of traditional horizontal structures, which can speed up
our approach further. Experiments show that our method has an
order of magnitude of speed improvement compared to the current
distance-based outlier detection approaches and the scalability of
our algorithm outperforms the-state-of-the-art approaches as well.
An order of complexity analysis clearly shows the reason for these
observed performance improvements.
In future work, we will explore the use of P-Trees in densitybased outlier detection. The encouraging results shown in this
paper demonstrate a great potential for improving the efficiency of
density-based outlier detection as well.
8. REFERENCES
[1]. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of
Outliers: Properties and Computation. 3rd International
Conference on Knowledge Discovery and Data Mining
Proceedings, 1997, pp. 219-222.
[2]. Knorr, Edwin M. and Raymond T. Ng. Algorithms for
Mining Distance-Based Outliers in Large Datasets. Very
Large Data Bases Conference Proceedings, 1998, pp. 24-27.
[3]. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional
Knowledge of Distance-Based Outliers. Very Large Data
Bases Conference Proceedings, 1999, pp. 211-222.
[4]. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim,
“Efficient algorithms for mining outliers from large
datasets”, International Conference on Management of Data
and Symposium on Principles of Database Systems,
Proceedings of the 2000 ACM SIGMOD international
conference on Management of data Year of
Publication: 2000, ISSN:0163-5808
[5]. Markus M.Breunig, Hans-Peter kriegel, Raymond T.Ng,
Jorg Sander, “OPTICS-OF: Identifying Local Outliers”,
Proc. of PKDD '99, Prague. Czech Republic, Lecture Notes
in Computer Science (LNAI 1704), Springer Verlag, 1999,
p. 262-270.
[6] Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan.
A Linear Method for Deviation Detection in Large
Databases. 2nd International Conference on Knowledge
Discovery and Data Mining Proceedings, 1996, pp. 164-169.
[7] Bowman, B., Debray, S. K., and Peterson, L. L. Reasoning
about naming systems. ACM Trans. Program. Lang. Syst.,
15, 5 (Nov. 1993), 795-825.
[8] Ding, W., and Marchionini, G. A Study on Video Browsing
Strategies. Technical Report UMIACS-TR-97-40, University
of Maryland, College Park, MD, 1997.
[9] Fröhlich, B. and Plate, J. The cubic mouse: a new device for
three-dimensional iput. In Proceedings of the SIGCHI
conference on Human factors in computing systems
(CHI ’00) (The Hague, The Netherlands, April 1-6, 2000).
ACM Press, New York, NY, 2000, 526-531.
[10] Lamport, L. LaTeX User’s Guide and Document Reference
Manual. Addison-Wesley, Reading, MA, 1986.
[11] Sannella, M. J. Constraint Satisfaction and Debugging for
Interactive User Interfaces. Ph.D. Thesis, University of
Washington, Seattle, WA, 1994.
[12] S. Sarawagi, R. Agrawal, and N. Megiddo. DiscoveryDriven Exploration of OLAP Data Cubes. EDBT'98.
[13] Jiawei han, Micheline kambr, “Data mining concepts and
techniques”, Morgan kaufman Publishers
[14] Shashi Shekhar, Chang-Tien Lu & Pusheng Zhang,
“Detecting graph-based spatial outliers”, Journal of
Inteelegent dat analysis 6 (20022) 451-468, IOS press.
[15] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree
algebra. Proceedings of the ACM SAC, Symposium on
Applied Computing, 2002.
[16] M. Khan, Q. Ding and W. Perrizo, K-nearest Neighbor
Classification on Spatial Data Stream Using P-trees,
Proceedings of PAKDD 2002, Springer-Verlag, Lecture
Notes in Artificial Intelligence 2336, May 2002, pp. 517528.
[17] Pan, F., Wang, B., Ren, D., Hu, X. and Perrizo, W.,
Proximal Support Vector Machine for Spatial Data Using
Peano Trees, CAINE 2003
[18] Imad Rahal, William Perrizo, An optimized Approach for
KNN Text Categorization using P-tees. Proceedings of the
ACM SAC, Symposium on Applied Computing (Nicosia,
Cyprus),March 2004.
[19] Fei Pan, Imad Rahal, Yue Cui, William Perrzio, Efficient
Ranking of Keyword Queries Using P-Trees”, 19th
International Conference on Computers and Their
Applications, pp. 278-281, 2004.
[20] Maum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov,
Imad Rahal, Fen Pan, Dongmei Ren, Weihua Wu, and
William Perrizo. "DataMIME™". ACM SIGMOD, Paris,
France, June 2004.
[21] V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John
Wiley’s Publisher.