Download 01WAIM_camera1 - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Derive High Confidence Rules for Spatial Data using
Tuple Count Cube
William Perrizo1, Qin Ding1, Qiang Ding1, and Amalendu Roy1
1
Department of Computer Science, North Dakota State University,
Fargo, ND 58105-5164, USA
{William_Perrizo, Qin_Ding, Qiang_Ding, Amalendu_Roy}@ndsu.nodak.edu
Abstract. The traditional task of association rule mining is to find all rules
with high support and high confidence. In some applications, such as
mining spatial datasets for natural resources, the task is to find high
confidence rules even though their supports may be low. In still other
applications, such as the identification of agricultural pest infestations, the
task is to find high confidence rules preferably while the support is still
very low. The basic Apriori algorithm cannot be used to solve these
problems efficiently. In this paper, we propose a new model to derive high
confidence rules for spatial data. A new data structure, the Peano Count
Tree (PC-tree), is used in our model to represent all the information we
need. PC-trees represent spatial data bit-by-bit in a recursive quadrant-byquadrant arrangement. Based on the PC-tree, we build a special data cube,
the Tuple Count Cube (TC-cube), to derive high confidence rules. Our
algorithm for deriving confident rules is fast and efficient. In addition, we
discuss some strategies for avoiding over-fitting (removing redundant and
misleading rules).
1
Introduction
Association rule mining [1,2,3,4,5], proposed by Agrawal, Imielinski and Swami in
1993, is one of the important tasks of data mining. The original application of
association rule mining is on market basket data. A typical example is “customers
who purchase one item are very likely to purchase another item at the same time”.
There are two accuracy measures, support and confidence, for each rule. The problem
of association rule mining is to find all the rules with support and confidence
exceeding some user specified thresholds. The basic algorithms, such as Apriori [1]
and DHP [4], use the downward closure property of support to find frequent itemsets,
whose supports are above the threshold. After obtaining all frequent itemsets, which is
very time consuming, high confidence rules are derived in a very straightforward way.
However, in some applications, such as spatial data mining, we are also interested
in rules with high confidence that do not necessarily have high support. In still other
applications, such as the identification of agricultural pest infestations, the task is to
find high confidence rules preferably while the support is still very low. In these
cases, the traditional algorithms are not suitable. One may think that we can simply
set the minimal support to a very low value, so that high confidence rules with almost
no support limit can be derived. However, this will lead to a huge number of frequent
itemsets, and is, thus, impractical.
In this paper, we propose a new model, including new data structures and
algorithms, to derive “confident” rules (high confidence only rules), especially for
spatial data. We use a data structure, called the Peano Count Tree (PC-tree), to store
all the information we need. A PC-tree is a quadrant based count tree. From the PCtrees, we build a data cube, the Tuple Count Cube or TC-cube which exposes the
confident rules. We also use the attribute precision concept hierarchies and a natural
rule ranking to prune the complexity of our data mining algorithm.
The rest of the paper is organized as follows. In section 2, we provide some
background on spatial data. In section 3, we describe the data structures we use for
association rule mining, including PC-trees and TC-cubes. In section 4, we detail our
algorithms for deriving confident rules. Performance analysis and implementation
issues are given in section 5, followed by related work in section 6. Finally, the
conclusion is given.
2
Formats of Spatial Data
There are huge amounts of spatial data on which we can perform data mining to
obtain useful information[16]. Spatial data are collected in different ways and are
organized in different formats. BSQ, BIL and BIP are three typical formats.
An image contains several bands. For example, TM6 (Thermatic Mapper) scene
contains 6 bands, while TM7 scene contains 7 bands, including Blue, Green, Red,
NIR, MIR, TIR, MIR2, each of which contains reflectance values in the range, 0~255.
An image can be organized into a relational table in which each pixel is a tuple and
each spectral band is an attribute. The primary key can be latitude and longitude pairs
which uniquely identify the pixels.
BSQ (Band Sequential) is a similar format, in which each band is stored as a
separate file. Raster order is used for each individual band. TM scenes are in BSQ
format. BIL (Band Interleaved by Line) is another format in which all the bands are
organized in one file and bands are interleaved by row (the first row of all bands are
followed by the second row of all bands, and so on). For example, SPOT data from
French satellites are in BIL format. In the BIP (Band Interleaved by Pixel) format,
there is also just one file in which the first pixel-value of the first band is followed by
the first pixel-value of the second band, ..., the first pixel-value of the last band,
followed by the second pixel-value of the first band, and so on. For example, TIFF
images are in BIP format. Fig. 1 gives an example of using BSQ, BIL and BIP
formats.
In this paper, we propose a new format, called bSQ (bit Sequential), to organize
images. The reflectance values of each band range from 0 to 255, represented as 8
bits. We split each band into a separate file for each bit position. Fig. 1 also gives an
example of bSQ format.
There are several reasons to use the bSQ format. First, different bits have different
degrees of contribution to the value. In some applications, we do not need all the bits
because the high order bits give us enough information. Second, the bSQ format
facilitates the representation of a precision hierarchy. Third, and most importantly,
bSQ format facilitates the creation of an efficient, rich data structure, the PC-tree, and
accomodates algorithm pruning based on a one-bit-at-a-time approach.
We give a very simple illustrative example with only 2 data bands for a scene
having only 2 rows and 2 columns (both decimal and binary representation are
shown).
BAND-1
254
127
(1111 1110)
(0111 1111)
14
193
(0000 1110)
(1100 0001)
B11
1
0
0
1
B12
1
1
0
1
BAND-2
37
240
(0010 0101) (1111 0000)
200
19
(1100 1000)
(0001 0011)
bSQ format (16 files)
B13 B14 B15 B16 B17 B18 B21 B22 B23
1
1
1 1
1 0
0
0
1
1
1
1 1
1 1
1
1
1
0
0
1
1
1 0
1
1
0
0
0
0 0
0 1
0
0
0
B24 B25 B26
0
0
1
1
0
0
0
1
0
1
0
0
B27 B28
0
1
0
0
0
0
1
1
BSQ format (2 files)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240
14 193 200 19
BIP format (1 file)
254 37 127 240
14 200 193 19
Fig. 1. Two bands of a 2-row-2-column image and its BSQ, BIP, BIL and bSQ formats
3
3.1
Data Structures
Basic PC-trees
We organize each bit file in the bSQ format into a tree structure, called a Peano Count
Tree (PC-tree). A PC-tree is a quadrant based tree. The idea is to recursively divide
the entire image into quadrants and record the count of 1-bits for each quadrant, thus
forming a quadrant count tree. PC-trees are somewhat similar in construction to other
data structures in the literature (e.g., Quadtrees[10] and HHcodes [14]).
For example, given an 8-row-8-column image, the PC-tree is as shown in Fig. 2.
PM-tree
PC-tree
55
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
____________/ / \ \___________
___ / \___
\
/
\
\
16
____8__
_15__
16
/ / |
\
/ | \ \
3 0 4
1
4 4 3 4
//|\
//|\
//|\
/
/
1110
0010
1101
m
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
1
____m__
_m__
1
/ / |
\
/ | \ \
m 0 1
m
1 1 m 1
//|\
//|\
//|\
1110
0010
1101
Fig. 2. 8*8 image and its PC-tree (PC-tree and PM-tree)
In this example, 55 is the number of 1’s in the entire image. This root level is
labeled as level 0. The numbers at the next level (level 1), 16, 8, 15 and 16, are the 1bit counts for the four major quadrants. Since the first and last quadrant are composed
entirely of 1-bits (called a “pure 1 quadrant”), we do not need subtrees for these two
quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0bits are called “pure 0 quadrants” which also terminate these tree branches. This
pattern is continued recursively using the Peano or Z-ordering of the four
subquadrants at each new level. Every branch terminates eventually (at the “leaf”
level, each quadrant is a pure quadrant). If we were to expand all subtrees, including
those for pure quadrants, then the leaf sequence is just the Peano-ordering (or, Zordering) of the original raster image. Thus, we use the name Peano Count Tree.
We note that, the fan-out of the PC-tree need not be limited to 4. It can be any
power of 4 (effectively skipping that number of levels in the tree). Also, the fanout at
any one level need not coincide with the fanout at another level. The fanout pattern
can be chosen to produce maximum compression for each bSQ file.
For each band (assuming 8-bit data values), we get 8 basic PC-trees, one for each
bit position. For band B1, we will label the basic PC-trees, P11, P12, …, P18. Pij is a
lossless representation of the j th bits of the values from the ith band. In addition, Pij
provides the 1-bit count for every quadrant of every dimension. Finally, we note that
these PC-trees can be generated quite quickly and can be viewed as a “data mining
ready”, lossless format for storing spatial data.
The 8 basic PC-Trees defined above can be combined using simple logical
operations (AND, NOT, OR, COMPLEMENT) to produce PC-Trees for the original
values in a band (at any level of precision, 1-bit precision, 2-bit precision, etc.). We
let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be
expressed in 1-bit, 2-bit,.., or 8-bit precision. Pb,v is called a value PC-tree. Using the
full 8-bit precision (all 8 –bits) for values, value PC-tree Pb,11010011 can be constructed
from the basic PC-trees by ANDing basic PC-trees (for each 1-bit) and their
complements (for each 0 bit):
PCb,11010011 = PCb1 AND PCb2 AND PCb3’ AND PCb4 AND PCb5’ AND PCb6’ AND PCb7 AND PCb8
where ‘ indicates the bit-complement (which is simply the PC-tree with each count
replaced by its count complement in each quadrant).
From value PC-trees, we can construct tuple PC-trees. Tuple PC-tree for tuple
(v1,v2,…,vn), denoted PC (v1, v2, …, vn), is:
PC(v1,v2,…,vn) = PC1,v1 AND PC2,v2 AND … AND PCn,vn
where n is the total number of bands.
Basic (bit) PC-trees
(i.e., P11, P12, …, P21, …, P88)
AND
Value PC-trees
(i.e., P1, 001 )
AND
Tuple PC-trees
(i.e., P001, 010, 111, 011, 001, 110, 011, 101 )
Fig. 3. Basic PC-trees, Value PC-trees (for 3-bit values) and Tuple PC-trees
The AND operation is simply the pixel-wise AND of the bits. Before going
further, we note that the process of converting the BSQ data for a TM satellite image
(approximately 60 million pixels) to its basic PC-trees can be done in just a few
seconds using a high performance PC computer. This is a one-time process. We also
note that we are storing the basic PC-trees in a “breadth-first” data structure which
specifies the pure-1 quadrants only. Using this data structure, each AND can be
completed in a few milliseconds and the result counts can be accumulated easily once
the AND and COMPLEMENT program has completed.
3.2 Variations of PC-tree
In order to optimize the AND operation, we use a variation of the PC-tree, called PMtree (Pure Mask tree). In the PM-tree, we use a 3-value logic to represent pure-1,
pure-0 and mixed quadrant. To simplify the exposition, we use 1 for pure 1, 0 for pure
0, and m for mixed quadrants. Thus, the PM-tree for the previous example is also
given in Fig. 2.
The PM-tree specifies the location of the pure-1 quadrants of the operands, so that
the pure-1 quadrants of the AND result can be easily identified by the coincidence of
pure-1 quadrants in both operands and pure-0 quadrants of the AND result occur
wherever a pure-0 quadrant occurs on at least one of the operands.
3.3 Value Concept Hierarchy
Using bSQ format, we can easily represent the value concept hierarchy of spatial data.
For example, for band n, we can use from 1 bit up to 8 bits to represent the
reflectances (Fig. 4).
[0,0]
[1,1]
-------- 1 bit
( 0~127 )
[00,01)
(0~63)
[000,
001)
(128~255)
[01,10)
(64~127)
[001, [010, [011,
010) 011) 100)
[10,11)
[11,11]
(128~191)
(192~255)
[100,
101)
[101,
110)
[110,
111)
[111,
111]
-------- 2 bits
-------- 3 bits
(0~31) (32~63) (64~95) (96~127) (128~159) (160~191) (192~223) (224~255)
Fig. 4. Value Concept Hierarchy
3.4 Tuple Count Cube
For most spatial data mining, the root counts of the tuple PC-trees (e.g., PC(v1,v2,…,vn) =
PC1,v1 AND PC2,v2 AND … AND PCn,vn), are the numbers required, since root counts
tell us exactly the number of occurrences of that particular pattern over the space in
question. These root counts can be inserted into a data cube, called the Tuple Count
cube (TC-cube) of the spatial dataset. Each band corresponds to a dimension of the
cube, the band values labeling that dimension. The TC-cube cell at location,
(v1,v2,…,vn), contains the root count of PC(v1,v2,…,vn). For example, assuming just 3
bands, the (v1,v2,v3)th cell of the TC-cube contains the root count of PC(v1,v2,v3) =
PC1,v1 AND PC2,v2 AND PC3,v3. The cube can be contracted or expanded by going up
[down] in the value concept hierarchy.
4
Confident Rule Mining Algorithm
4.1 PC-tree ANDing algorithm
We begin this section with a description of the AND algorithm. This algorithm is used
to compose the value PC-trees and to populate the TC-cube. The approach is to store
only the basic PC-trees and then generate value PC-tree root counts “on-the-fly” when
needed (in Section 5 we show this can be done in about 50ms). In this algorithm we
will assume the PC-tree is coded in its most compact form, a depth-first ordering of
the paths to each pure-1 quadrant.
Let’s look at the operand 1 first (Fig. 5). Each path is represented by the sequence
of quadrants in Peano order, beginning just below the root. Therefore, the depth-first
pure1 path code for this example is: 0 100 101 102 12 132 20 21 220 221 223 23 3 (0
indicates the entire level 1 upper left quadrant is pure 1 s, 100 indicates the level 3
quadrant arrived at along the branch through node 1 (2 nd node) of level 1, node 0 (1st
node) of level 2 and node 0 of level 3, etc.). We will take the second operand (Fig. 5),
with depth-first pure1 path code: 0 20 21 22 231. Since a quadrant will be pure 1’s in
the result only if it is pure1’s in both operands (or all operands, in the case there are
more than 2), the AND is done by: scan the operands; output matching pure1 paths.
Therefore we get the result (Fig. 5).
Operand 1
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
PC-tree:
55
________ / / \ \___
/
____ / \
\
/
/
\
\
16 _8 _
_15_ 16
/ / | \
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110 0010
1101
PM-tree: m
______/ / \ \______
/
/ \
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010 1101
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
PC-tree:
29
________ / / \ \___
/
____ / \
\
/
/
\
\
16 0
_ 13_ 0
/ | \ \
4 4 41
//|\
0100
PM-tree: m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
11 1 m
//|\
0100
00
00
00
00
11
11
11
11
00
00
00
00
11
11
11
11
PC-tree:
28
________ / / \ \___
/
____ / \
\
/
/
\
\
16
0
12
0
/ | \ \
4 4 3 1
//|\ //|\
1101 1000
PM-tree: m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
1 1 m m
//|\ //|\
1101 1000
Operand 2
11
11
11
11
11
11
11
11
11
11
11
11
11
11
01
00
AND Result
11
11
11
11
11
11
11
01
11
11
11
11
11
11
10
00
AND Process
0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231
0
0
20
20
21
21
220 221 223
22
23
231






RESULT
0
20
21
220 221 223
231
Fig. 5. Operand 1, Operand 2, AND Result and AND Process
The pseudo code for the ANDing algorithm is given below.
Ptree_ANDing(P1, P2, Presult)
// pos1, pos2, pos3 records the pure-1 quadrant path position of P1, P2, Presult
1. pos1:=0; pos2:=0; pos3:=0;
2. DO WHILE (pos1 <> ENDofP1 and pos2 <> ENDofP2)
(a) IF P1.pos1=P2.pos2 THEN BEGIN
Presult.pos3:=P1.pos1; pos1:=pos1+1; pos2:=pos2+1; pos3:=pos3+1;
END
(b) ELSE IF P1.pos1 is the substring of P2.pos2 THEN BEGIN
Presult.pos3:=P2.pos2; pos2:=pos2+1; pos3:=pos3+1; END
(c) ELSE IF P2.pos2 is the substring of P1.pos1 THEN BEGIN
Presult.pos3:=P1.pos1; pos1:=pos1+1; pos3:=pos3+1; END
(d) ELSE IF P1.pos1 < P2.pos2 THEN pos1:=pos1+1;
(e) ELSE pos2:=pos2+1;
END IF
END DO
Fig. 6. PC-tree ANDing algorithm
4.2 Mining Confident Rules from Spatial Data Using TC-cubes
In this section a TC-cube based method for mining non-redundant, low-support, highconfidence rules is introduced. Such rules will be called confident rules. The main
interest is in rules with low support, which are important for many application areas
such as, natural resource searches, agriculture pest infestations identification, etc.
However, a small positive support threshold is set, in order to eliminate rules that
result from noise and outliers (similar to [7], [8] and [15]). A high threshold for
confidence is set in order to find only the most confident rules.
To eliminate redundant rules resulting from over-fitting, an algorithm similar to the
one introduced in [8] is used. In [8] rules are ranked based on confidence, support,
rule-size and data-value ordering, respectively. Rules are compared with their
generalizations for redundancy before they are included in the set of confident rules.
In this paper, we use a similar rank definition, except that we do not use support level
and data-value ordering. Since support level is expected to be very low in many
spatial applications, and since we set a minimum support only to eliminate rules
resulting from noise, it is not used in rule ranking. Rules are declared redundant only
if they are outranked by a generalization. We choose not to eliminate a rule which is
outranked only by virtue the specific data values involved.
A rule, r, ranks higher than rule, r', if confidence[r] > confidence[r'], or if
confidence[r] = confidence[r'] and the number of attributes in the antecedent of r is
less than the number in the antecedent of r'.
A rule, r, generalizes a rule, r’, if they have the same consequent and the antecedent
of r is properly contained in the antecedent of r’. The algorithm for mining confident
rules from spatial data is given in Fig. 7.
Build the set of confident rules, C (initially empty) as follows.
Start with 1-bit values, 2 bands;
then 1-bit values and 3 bands; …
then 2-bit values and 2 bands;
then 2-bit values and 3 bands; …
...
At each stage defined above, do the following:
Find all confident rules (support at least minimum_support and confidence at least minimum_confidence),
by rolling-up the TC-cube along each potential consequent set using summation. Comparing these sums
with the support theshold to isolate rule support sets with the minimum support. Compare the normalized
TC-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident
rules. Place any new confident rule in C, but only if the rank is higher than any of its generalizations
already in C.
Fig. 7. Algorithm for mining confident rules
The following example contains 3 bands of 3-bit spatial data in bSQ format.
Band-1:
11 11
11 01
11 00
11 00
11 11
11 11
11 11
11 11
B11
00 01
00 01
00 11
01 11
00 00
00 00
00 00
00 00
B12
00 00
00 00
00 00
00 00
11 11
11 11
10 01
10 11
11
11
11
11
00
00
00
00
00
00
01
10
00
00
11
11
B13
11 11
11 11
11 11
11 11
00 11
10 11
00 00
00 01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
Band-2:
00 00
00 00
00 00
00 00
11 11
11 11
11 11
11 11
B21
00 11
00 11
11 00
11 00
00 00
00 00
00 00
00 00
B22
00 11
00 11
11 00
11 00
11 11
11 11
10 01
10 11
00
00
00
00
11
11
11
11
00
00
00
00
11
11
11
11
B23
11 11
11 11
11 11
11 11
11 11
11 11
11 11
11 11
11
11
11
11
00
00
00
00
11
11
11
11
11
10
11
11
Band-3:
11 11
11 11
11 11
11 11
00 00
00 00
00 00
00 00
B31
00 00
00 00
00 00
00 00
11 01
11 11
00 11
00 11
B32
00 00
00 00
00 00
00 00
00 00
00 00
11 00
11 00
00
00
11
11
00
00
00
00
00
00
00
00
00
00
00
00
B33
00 00
00 00
10 11
10 11
11 11
11 11
11 11
11 11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
Band-1:
PM11
PM12
mm10
0mmm
1m10 0mm1
101m 11mm 0001
1101 0101 0001 0110 1010 0111
PM13
10m0
m10m
0010 0001
Band-2:
PM21
PM22
PM23
0m10
0110
m011
0110
111m
0m01
Band-3:
PM31
PM32
PM33
00m0
0010
m111
00m1
1010
100m
1m01
0111
Fig. 8. 88 image data and its PM-trees
Assume minimum confidence threshold of 80% and minimum support threshold of
10%. Start with 1-bit values and 2 bands, B1 and B2. The TC-cube values (root
counts from the PC-trees) are given in Fig. 9, while the rolled-up sums and confidence
thresholds are given in Fig. 10.
__________
/
/
/|
/
/
/ |
/____/____/ |
|
|
| |
2,0 | 25 | 15 | /|
|____|____|/ |
|
|
| /
2,1 | 5 | 19 | /
|____|____|/
1,0 1,1
Fig. 9. TC-cube for band 1 and band 2
__________
/ 30 / 34 / sums
/ 24 /27.2/  thresholds
/____/____/
__________
/
/
/|
/|
/
/
/ |
/ |
/____/____/ |
/40|
|
|
| |
|32|
2,0 | 25 | 15 | /|
| /|
|____|____|/ |
|/ |
|
|
| /
|24/
2,1 | 5 | 19 | /
|19.2
|____|____|/
|/
1,0 1,1
Fig. 10. Rolled-up sums and confidence thresholds
All sums are at least 10% support (6.4). There is one confident rule:
C:
B1={0} => B2={0} with confidence = 83.3%
Continue with 1-bit values and the 2 bands, B1 and B3, we can get the following
TC-cube with rolled-up sums and confidence thresholds (Fig. 11). There are no new
confident rules. Similarly, the 1-bit TC-cube for band B2 and B3 can be constructed
(Fig. 12).
_____
3,1 / 27 /|
/21.6/ |
/____/ |
3,0 / 37 /|27|
/29.6/ | /|
/____/ |/ |
| 40 |13| 0|
2,0 | 32 | /| /
|____|/ |/
| 24 |24/
2,1 |19.2| /
|____|/
__________
3,1 /
/
/|
/|
/ 16 / 11 / |
/ |
/____/____/ |
/27|
3,0 /
/
/| / /|21.6
/ 14 / 23 / | / / | /
/____/____/ |/ /37|/
|
|
| /
|29.6
|
|
| /
| /
|____|____|/
|/
1,0 1,1
_________
| 30 | 34 |
| 24 | 27.2
|____|____|
Fig. 11. TC-cube for band 1 and band 3
Fig. 12. TC-cube for band 2 and band 3
All sums are at least 10% of 64 (6.4), thus, all rules will have enough support.
There are two confident rule, B2={1} => B3={0} with confidence = 100% and
B3={1} => B2={0} with confidence = 100%. Thus,
C:
B1={0} => B2={0}
B2={1} => B3={0}
B3={1} => B2={0}
c = 83.3%
c = 100%
c = 100%
Next consider 1-bit values and bands, B1, B2 and B3. The counts, sums and
confidence thresholds are given in Fig. 13:
__________
27/
/ 16 / 11 /.
21.6/
/12.8/8.8 / .
/
/____/____/ .
37/
/ 14 / 23 /
.
29.6
/11.2/18.4/
.
.
/. . . ./____/____/
.
.
__________.
.
/
/
/|
/|
3,1/ 16 / 11 / |
/27
/____/____/ |
/21.6
/|
|
/| |
/| |
3,0/ |
| / | /|
/ | /|
/ |___ _/ |/ |
/13|/ |
|\ | 0 / | | 0/
10.4/ 0/
2,0| \____/ | /| /
| /| /
|_9__|__4_|/ |/
|/ |/
|
|
| /
|24/
2,1| 5 | 19 | /
|19.2
|____|____|/ . . . . |/
1,0 1,1.
_________
.
| 25 | 15 |
.
| 20 | 12 |
.
|____|____|
.
| 5 | 19 | .
| 4 | 15.2 .
|____|____|.
____ ____.
30
34
24 27.2
|40
|24
|
|
|24
|19.2
Fig. 13. The counts, sums and confidence thresholds for 1-bit values
Support sets, B1={0}^B2={1} and B2={1}^B3={1} lack support. The new
confident rules are:
B1={1}^B2={1} => B3={0},
B1={1}^B3={0} => B2={1},
B1={1}^B3={1} =>B2={0},
B1={0}^B3={1} => B2={0},
c = 100%
c = 82.6%
c =100%
c =100%
B1={1}^B2={1} => B3={0} in not included because it is generalized by B2={1}
=> B3={0}, which is already in C and has higher rank. Also, B1={1}^B3={1} =>
B2={0} is not included because it is generalized by B3={1} => B2={0}, which is
already in C and has higher rank. B1={0}^B3={1} => B2={0} is not included
because it is generalized by B3={1} => B2={0}, which has higher rank also. Thus,
C:
B1={0} => B2={0}
B2={1} => B3={0}
B3={1} => B2={0}
B1={1}^B3={0} => B2={1}
c = 83.3%
c = 100%
c = 100%
c = 82.6%
Next, we consider 2-bit data values and proceed in the same way. Depending upon
the goal of the data mining task (e.g., mine for classes of rules, individual rules, …)
the rules already in C can be used to obviate the need to consider 2-bit refinements of
the rules in C. This simplifies the 2-bit stage markedly.
5
Implementation Issues and Performance Analysis
In our model, we build TC-cube values from basic PC-trees on-the-fly as needed.
Once the TC-cube is built, we can perform the mining task with different parameters
(i.e., different support and confidence thresholds) without rebuilding the cube. Using
the roll-up cube operation, we can get the TC-cube for n bit from the TC-cube for n+1
bit. This is a good feature of bit value concept hierarchy.
We have enhanced the functionalities of our model in two ways. Firstly, we don’t
specify the antecedent attribute. Compared to other approaches for deriving high
confidence rules, our model is more general. Secondly, we remove redundant rules
based on the rule rank.
One important feature of our model is its scalability. It has two meanings. First, our
model is scalable with respect to the data set size. The reason is that the size of TCcube is independent of the data set size, but only based on the number of bands and
number of bits. In addition, the mining cost only depends on the TC-cube size. For
example, for an image with size 81928192 with three bands, the TC-cube using 2
bits is as simple as that of the example in Section 4. By comparison, in Apriori
algorithm, the larger the data set, the higher the cost of the mining process. Therefore,
the larger the data set, the more benefit in using our model.
The other aspect of scalability is that our model is scalable with respect to the
support threshold. Our task focuses on mining high confidence rules with very small
support. As the support threshold is decreased to very low value, the cost of using
Aprioir algorithm will be increased dramatically, resulting in a huge number of
frequent itemsets (combination exploration). However, in our model, the process is
not based on the frequent itemsets generation, so it works well for low support
threshold.
As we mentioned, there is an additional cost to build the TC-cube. The key issue of
this cost is the PC-tree ANDing. We have implemented a parallel ANDing of PC-trees
which is efficient on a cluster of computers.
We use an array of 16 dual 266 MHz processor systems with a 400 MHz dual
processor as the control node. We partition the 2048*2048 image among all the nodes.
Each node contains data for 512512 pixels. These data are store at different nodes as
another variation of PC-tree, called Peano Vector Tree (PV-Tree). Here is how PVtree is constructed. First we build a Peano Count Tree using fan-out 64 for each level.
Then the tree is saved as bit vectors. For each internal node (except the root), we use
two 64 bit bit-vectors, one is for pure 1 and other is for pure 0. At the leaf level we
only use one vector (for pure 1).
The following algorithm (Fig. 14) describes this implementation in detail.
PROCEDURE SavePeanoTree (Tree PeanoTree)
//Peano Tree having fan out 64 and 3 level and tree is implemented using array
BEGIN
Vector PureOneVector=0, PureZeroVector=0; //Vector is a 64 bit data structure
for i = 1 to 64 do Begin
//level 1
if PeanoTree[i] == 4096 then turn on the bit at ith position of PureOneVector;
else if PeanoTree[i] == 0 then turn on the bit at ith position of PureZeroVector;
Endfor
Write PureOneVector and PureZeroVector to the file;
for each mixed node at level 1 do Begin
//for level 2
ChildIndex = IndexOfCurrentNode * 64 +1;
PureOneVector=0; PureZeroVector=0;
for i= ChildIndex to ChildIndex+64 do Begin
if PeanoTree[i] == 64 then turn on the bit at ith position of PureOneVector;
else if PeanoTree[i] == 0 then turn on the bit at ith position of PureZeroVector;
Endfor
Write PureOneVector and PureZeroVector to the file;
Enfor
for each mixed node at level 2 do Begin
//for level 3
Save the 64 children to the file
Endfor
END SavePeanoTree
Fig. 14. SavePeanoTree Algorithm
From a single TM scene, we will have 56 (78) Peano Vector Tree - all save in a
single node. Using 16 nodes we are covering a scene of size, 20482048.
When we need to perform ANDing operation on the entire scene, we calculate the
local ANDing result of two Peano Vector Trees and send the result to the control
node, giving us the final result. The following algorithm (Fig. 15) describes the local
ANDing operation,
FUNCTION LocalAND (Tree PeanoVectorTree1, Tree PeanoVectorTree2)
BEGIN
unsigned long Result;
Vector Mixed1, Mixed2;
Extract the Pure One Vector at the first level of each Tree and perform bit-wise AND.
Find the total number of 1 bit in the resultant vector (let, n).
Result = 4096 * n;
Extract the Pure Zero Vector at the first level from each tree and compliment them and
perform bit-wise OR operation with the compliment of corresponding pure one vectors (let,
the resultant vectors are Mixed1 and Mixed2)
For i = 1 to 64 do Begin
If ith bit of Mixed1 is 0 then move pointer for Tree2
else if ith bit of Mixed2 is 0 then Move pointer for Tree2
else if ith bit of both Mixed1 and Mixed2 is 1
Extract the Pure One Vector at the second level of each Tree and
perform bit-wise AND operation on them.
Find the total number of 1 bit in the resultant vector (let, m).
Result = Result + 64 * m;
Extract the Pure Zero Vector at the second level from each tree, and
take compliment of them and perform bit-wise OR operation with the
compliment of corresponding pure one vectors (let, the resultant
vectors are SecondMixed1 and SecondMixed2)
For j =1 to 64 do
If jth bit of Mixed1 is 0 then move pointer for Tree2
else if jth bit of Mixed2 is 0 then move pointer for Tree2
else if ith bit of both Mixed1 and Mixed2 is 1
Extract the pure one vector at the leaf level of
each tree and perform bit-wise AND operation
on them (let, the resultant vector is Pure)
Find the number of 1 bit in Pure (let, l)
Result = Result + l;
Endfor
Endfor
END LocalAND
Fig. 15. LoadAND Algorithm
We use Message Passing Interface (MPI) on the cluster to implement the logical
operations on Peano Vector Trees. This program uses the Single Program Multiple
Data (SPMD) paradigm. The following graph (Fig. 16) shows the result of ANDing
time experiments we have seen (to perform AND operation on two peano vector tree)
for a TM scene. The AND time varies from 6.72 ms to 52.12 ms.
ANDing Time Vs Bit Number
Time (ms)
60
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
Lower Bit Numberof two PC-trees
Fig. 16. PC-tree ANDing time vs. Bit Number
With this high-speed ANDing, the TC-cube can be built very quickly. For
example, for a 2-bit 3-band TC-cube, the total AND time is about 1s.
6
Related work
There is some other work discussing the problem of deriving high confidence rules
[6,7,8]. Although they deal with non-spatial data and are therefore not directly
comparable, we make some rough comparisons. In [7] rules are found that have
extremely high confidence, but for which there is no (or extremely weak) support. A
set of algorithms are proposed to solve this problem. There are two disadvantages of
this work. One is that only pairs of columns (attributes) are considered. All pairs of
columns, with similarity exceeding a pre-specified threshold are identified. The
second disadvantage is that the similarity measure is bi-directional which means the
co-occurrence of the antecedent and consequent.
In [6], a brute-force technique is used for mining classification rules. They used the
association rule mining to solve the classification problem, i.e., a special rule set
(classifiers) is derived. However, both the support and confidence are used in the
algorithm even though only the high confidence rules are targeted. Several pruning
techniques are proposed but there are trade-offs among those pruning techniques.
[8] and [15] are similar in that they both apply the association rule mining method
to the classification task. They turn an arbitrary set of association rules into a
classifier. A confidence based pruning method is proposed using the property called
"existential upward closure". The method is used for building a decision tree from
association rules. The antecedent attribute is specified.
Our model is more general than the models cited above and is particularly efficient
and useful for spatial data mining.
The PC-tree structure is related to Quadtrees [10,11,13] and its variants (such as
point quadtree [13] and region quadtree [10]), and HHcode [14]. The similarities
among PC-tree, quadtree and HHCode are that they are quadrant based, but the
difference is that PC-tree is focused on counts. PC-trees are not only beneficial for
storing data, but also for association rule mining, since they provide useful needed
information for association rule mining.
7
Conclusion
In this paper, we propose a new model to derive high confidence rules on spatial data.
Data cube techniques are used in our model. The basic data structure of our model,
PC-tree, has much more information than the original image file but is small in size.
We build a Tuple Count cube from which the high confidence rules can be derived.
Currently we use the 16-node system to perform the ANDing operations for images
with size 20482048. In the future we will extend our system to 256-nodes so that we
can handle the image as large as 81928192. In that case, the PC-tree ANDing time
will be approximately the same as in the 16-node system for a 20482048 image since
only the communication cost is increased and that increase is insignificant.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules Between
Sets of Items in Large Database. ACM-SIGMOD 93, Washington, DC, May
1993.
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Proc.
of Int’l Conf. on VLDB, Santiago, Chile, September 1994.
R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large
Relational Tables. ACM SIGMOD 96, Montreal Canada.
Jong Soo Park, Ming-Syan Chen and Philip S. Yu. An effective Hash-Based
Algorithm for Mining Association Rules. ACM SIGMOD 95, CA, 1995.
J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate
Generation. ACM_SIGMOD 2000, Dallas, Texas, May 2000.
R. J. Bayardo Jr.. Brute-Force Mining of High-Confidence Classification Rules.
KDD 97.
E. Cohen, M. Datar, S. Fujiwara etc.. Finding Interesting Associations without
Support Pruning. VLDB2000.
Ke Wang, Senqiang Zhou and Yu He. Growing Decision Trees on Support-less
Association Rules. KDD 2000, Boston, MA.
Volker Gaede and Oliver Gunther. Multidimensional Access Methods.
Computing Surveys, 30(2), 1998.
H. Samet. The quadtree and related hierarchical data structure. ACM Computing
Survey, 16, 2, 1984.
H. Samet. Applications of Spatial Data Structures. Addison-Wesley, Reading,
Mass., 1990.
H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley,
Reading, Mass., 1990.
R. A. Finkel and J. L. Bentley. Quad trees: A data structure for retrieval of
composite keys. Acta Informatica, 4, 1, 1974.
HH-code. Available at http://www.statkart.no/nlhdb/iveher/hhtext.htm
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining.
KDD 98.
16. Jianning Dong, William Perrizo, Qin Ding and Jingkai Zhou. The Application of
Association Rule Mining on Remotely Sensed Data. Proc. of ACM Symposium
on Applied Computers, Italy, March 2000.