Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rule Mining on Remotely Sensed
Imagery Using Peano-trees (P-trees)
Qin Ding, Qiang Ding, and William Perrizo
Computer Science Department
North Dakota State University, USA
May 2002
(P-tree technology is patent pending by NDSU)
Outline
Concepts
– Association Rule Mining
– Market Basket Data
– Remotely Sensed Imagery (RSI) data
– Peano Count Trees (P-trees)
Association rule mining on RSI data using P-trees
 Performance analysis
 Conclusion
Association Rule Mining
Originally proposed for market basket data.
Given
– A set of items I = {i1,i2,…im} (e.g., items purchasable in a market)
– A set of transactions D
(e.g., customers checking out = id + itemset)
An association rule is X=>Y, where X, Y are disjoint itemsets
– X, Y are consider as events.
 E.g., X is the event that a transaction contains X.
 X=>Y is the event: “if t contains X, then it contains Y”
 X is called the antecedent, Y is called the consequent.
Two measures: support (% trans containing XY) and confidence
(% of those transactions containing X which also contain Y)
Given minimum thresholds, minsup and minconf,
– Find the frequent itemsets which have support above minsup.
– Derive all rules supported by frequent sets, with confidence above minconf.
Association rule mining on RSI data
RSI data can be viewed as a relational table
– Each band (column) is an attribute (for simplicity we assume all
values are bytes)
– Each pixel (row) is a transaction.
– Each interval in each band is an item.
– Row/column or longitude/latitude is the primary key
ARM task on RSI data
– To mine implicit relations among different bands, for example,
relations among spectral bands and yield.
Example Rule (NDVI): NIR[192,255] ^ RED[0,63] => Yield[128,255]
Important ARM Algorithms
Apriori – stepwise algorithm
DHP (Direct Hashing and Pruning) – hash itemset counts and
prune transactions
Partition – divide the database into small partitions such that
each can be processed independently and efficiently in memory.
DIC (Dynamic Itemset Counting) – overlap the counting of
candidate itemsets at different points during a scan.
FP-growth – uses Frequent Pattern tree (FP-tree) to optimize
candidate generation.
Others…
Remotely Sensed Imagery (RSI) Data
Satellite image
– TM (Thematic Mapper) imagery (6, 7 or 8 bands)
TM is Landsat satellite imagery covering the earth every 18 days since 1972.
ETM+ (Landsat-7) contains 8 bands
– 7 VIR bands (Blue, Green, Red, NIR, MIR, TIR, MIR2)
– 1 Panchromatic band (PC).
Aerial photography
– TIFF (3 bands: Blue, Green, Red)
Ground data
– Yield, Moisture, Nitrate, Temperature, Elevation, etc
Precision Agriculture Dataset:
TIFF Image and related Bands
(1320×1320)
RGB
Yield
Moisture
Nitrate
As a relation
x y RG BYM N
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
812
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
43
43
44
43
43
47
50
51
46
33
30
41
40
43
42
40
40
38
34
39
36
42
40
39
40
30
33
35
30
30
60
58
60
63
69
73
68
65
63
53
49
55
55
56
52
58
66
59
51
53
54
57
59
68
56
45
57
58
54
57
59
50
52
54
52
54
58
54
54
50
47
54
57
52
52
45
47
47
55
63
57
48
43
50
57
43
45
62
63
52
146
146
146
146
146
146
146
146
146
146
146
146
146
146
146
146
146
145
145
145
145
145
145
145
145
145
145
145
145
145
83
83
83
83
83
83
83
83
83
83
83
83
83
83
83
83
83
83
82
82
82
82
82
82
82
82
82
82
82
82
188
188
187
186
186
185
184
183
182
182
181
180
179
178
177
176
176
175
175
174
173
173
172
172
172
172
172
173
173
173
x: Row
y: Column
R: Red
G: Green
B: Blue
Y: Yield
M: Moisture
N: Nitrate
Spatial Data Formats
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
Spatial Data Formats
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
Spatial Data Formats
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
BIP format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
254 37 127 240
14 200 193 19
Spatial Data Formats
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
BIP format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
254 37 127 240
14 200 193 19
bSQ format (16 files)
B11 B12 B13 B14 B15
1
1
1
1
1
0
1
1
1
1
0
0
0
0
1
1
1
0
0
0
B16 B17 B18 B21 B22 B23
1
1 0
0
0 1
1
1 1
1
1 1
1
1 0
1
1 0
0
0 1
0
0 0
B24 B25 B26
0
0 1
1
0 0
0
1 0
1
0 0
B27
0
0
0
1
B28
1
0
0
1
Peano Count Tree (P-tree)
P-tree represents RSI data bit-by-bit in a
recursive quadrant-by-quadrant arrangement.
 P-trees are a lossless compressed
representation of the original data.
bSQ file
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
An example 2-D a P-tree
bSQ file arranged as a spatial
dataset (2-D raster order)
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
00
00
00
00
00
00
00
10
00
00
00
00
39
16
8
15
3 0 4 1
4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
Quadrant-based, Pure (Pure-1/Pure-0) quadrant
 Peano or Z-ordering
 Root Count
0
Peano Mask Tree (PM-tree)
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
00
00
00
00
00
00
00
10
00
00
00
00
m
m
m
m 0 1 m
1 1 m 1
1
1 1 1 0 0 0 1 0 1 1 0 1
Truth-Trees (1 if condition is true of quadrant, else 0
– E.g., Pure-1 and Pure-0 Trees
– All are lossless compressed representations of the dataset
0
001
111
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
0
16
2
3
15
2
3 0 4 1
4 4 3 4
3
1 1 1 0 0 0 1 0 1 1 0 1
Peano or Z-ordering
 Pure-1/Pure-0 quadrant
 Root Count
( 7, 1 )
1
( 111, 001 )
8
16
2.2.3
 Level
 Fan-out
 QID (Quadrant ID)
10.10.11
P-tree Operations
P-tree
55
______/ / \ \_______
/
__ / \___
\
/
/
\
\
16 __8____
_15__ 16
/ / | \
/ | \ \
3 0 4
1 4 4 3 4
//|\
//|\
//|\
1110
0010
1101
P-tree-1:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010 1101
PM-tree
m
______/ / \ \______
/
__ / \ __
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010 1101
P-tree-2:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
11 1 m
//|\
0100
Complement 9
______/ / \ \_______
/
__ / \___
\
/
/
\
\
0 __8____
_1__ 0
/ / | \
/ | \ \
1 4 0 3 0 0 1 0
//|\
//|\
//|\
0001
1101
0010
AND-Result: m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
0
m
0
/ | \ \
1 1 m m
//|\ //|\
1101 0100
m
______/ / \ \______
/
__ / \ __
\
/
/
\
\
0
m
m
0
/ / \ \
/ / \ \
m1 0 m 00 m 0
//|\
//|\
//|\
0001 1101
0010
OR-Result:
m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
m
1
1
/ / \ \
m 0 1 m
//|\
//|\
1110
0010
Ptree ANDing Operation
PM-tree1:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010 1101
PM-tree2:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
11 1 m
//|\
0100
Result:
m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
0
m
0
/ | \ \
1 1 m m
//|\ //|\
1101 0100
Depth-first Pure-1 path code
0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231
0
0
20
20
21
21
220 221 223
22
23
231
RESULT
0
20
21
220 221 223
231
Various P-trees
AND, OR, COMPLEMENT
AND, OR, COMPLEMENT
Basic P-trees
Pi, j
AND, OR
COMPLEMENT
Predicate P-trees
P(p)
AND
COMPLEMENT
Value P-trees
Pi(v)
OR
AND
Tuple P-trees
P(v1, v2, …, vn)
Interval P-trees
Pi(v1, v2)
AND
OR
Cube P-trees
P([v11, v12], …, [vN1, vN2])
Association Rule Mining on RSI Data
using P-trees
Admissible Itemsets (Asets )
– Asets are itemsets of the form, Int1  Int2  ...  Intn =
Π i=1...n Inti , where Inti is an interval of values in Bandi
(some of which may be the full value range).
– Example: Aset {[01,01]1, [11,11]2}
P-ARM algorithm
 Pruning techniques
P-ARM algorithm
Procedure P-ARM
{
Data_Discretization;
F1 = {frequent 1-Asets};
For (k=2; F k-1 ) do begin
Ck = p-gen(F k-1);
Forall candidate Asets c  Ck do
c.count = AND_rootcount(c);
Fk = {cCk | c.count >= minsup}
end
Answer = k Fk
}
•F1 is determined directly from P-tree
root counnts and pruning techniques
rather than transaction database scan.
•The p-gen function differs from the
apriori-gen function in Apriori by
using some pruning techniques.
•
• The AND_rootcount function is
used to calculate Aset counts directly
by ANDing the appropriate basic Ptrees instead of scanning the
transaction databases.
The support count for Aset {B1[0,64), B2[64,127)} (or {[00,
00]1, [01, 01]2}) is the root count of P1(00) AND P2(01).
Pruning Techniques
Band-based pruning
– An itemset with two items from the same band will have
support zero.
Constraint-base pruning
– E.g., specify yield as the only consequent band of interest.
– Note: in the performance comparisons we did not use this
pruning technique (to maintain fairness, since it is hard to
implement in other alogrithms)
Bit-based pruning for multi-level rules
– if Aset [128,255] (or [1,1]2) is not frequent, then the Aset [128,191] (or
[10,10]2) and [192,255] (or [11,11]2) cannot be frequent either.
Others
P-ARM versus Apriori
1,742,400 pixels (transactions)
Run time (Sec.)
800
700
600
P-ARM
500
400
Apriori
300
200
100
0
10% 20% 30% 40% 50% 60% 70% 80% 90%
Support threshold
Scalability with support threshold
P-ARM versus Apriori (cont.)
Support threshold =10%
1200
Time (Sec.)
1000
800
Apriori
600
P-ARM
400
200
0
100
500
900
1300 1700
Num ber of transactions (K)
Scalability with number of transactions
P-ARM versus FP-growth
Run time (Sec.)
800
17,424,000 pixels (transactions)
1,742,400 pixels (transactions)
700
600
P-ARM
500
400
FP-growth
300
200
100
0
10%
30%
50%
70%
90%
Support threshold
Scalability with support threshold
P-ARM versus FP-growth (cont.)
Support threshold =10%
Time (Sec.)
1200
Support threshold =10%
1000
800
FP-growt h
600
P-ARM
400
200
0
100
500
900
1300
1700
Num ber of transactions(K)
Scalability with the number of transactions
Conclusion
A model for association rule mining on RSI data
– P-trees facilitate fast calculation of support
– P-trees facilitates significant pruning techniques
Applications other than precision agriculture
– Flood prediction and monitoring
– Community and regional planning
– Virtual archeology
– Mineral exploration
– Bioinformatics/Genomics
– VLSI design