Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North Dakota State University, USA May 2002 (P-tree technology is patent pending by NDSU) Outline Concepts – Association Rule Mining – Market Basket Data – Remotely Sensed Imagery (RSI) data – Peano Count Trees (P-trees) Association rule mining on RSI data using P-trees Performance analysis Conclusion Association Rule Mining Originally proposed for market basket data. Given – A set of items I = {i1,i2,…im} (e.g., items purchasable in a market) – A set of transactions D (e.g., customers checking out = id + itemset) An association rule is X=>Y, where X, Y are disjoint itemsets – X, Y are consider as events. E.g., X is the event that a transaction contains X. X=>Y is the event: “if t contains X, then it contains Y” X is called the antecedent, Y is called the consequent. Two measures: support (% trans containing XY) and confidence (% of those transactions containing X which also contain Y) Given minimum thresholds, minsup and minconf, – Find the frequent itemsets which have support above minsup. – Derive all rules supported by frequent sets, with confidence above minconf. Association rule mining on RSI data RSI data can be viewed as a relational table – Each band (column) is an attribute (for simplicity we assume all values are bytes) – Each pixel (row) is a transaction. – Each interval in each band is an item. – Row/column or longitude/latitude is the primary key ARM task on RSI data – To mine implicit relations among different bands, for example, relations among spectral bands and yield. Example Rule (NDVI): NIR[192,255] ^ RED[0,63] => Yield[128,255] Important ARM Algorithms Apriori – stepwise algorithm DHP (Direct Hashing and Pruning) – hash itemset counts and prune transactions Partition – divide the database into small partitions such that each can be processed independently and efficiently in memory. DIC (Dynamic Itemset Counting) – overlap the counting of candidate itemsets at different points during a scan. FP-growth – uses Frequent Pattern tree (FP-tree) to optimize candidate generation. Others… Remotely Sensed Imagery (RSI) Data Satellite image – TM (Thematic Mapper) imagery (6, 7 or 8 bands) TM is Landsat satellite imagery covering the earth every 18 days since 1972. ETM+ (Landsat-7) contains 8 bands – 7 VIR bands (Blue, Green, Red, NIR, MIR, TIR, MIR2) – 1 Panchromatic band (PC). Aerial photography – TIFF (3 bands: Blue, Green, Red) Ground data – Yield, Moisture, Nitrate, Temperature, Elevation, etc Precision Agriculture Dataset: TIFF Image and related Bands (1320×1320) RGB Yield Moisture Nitrate As a relation x y RG BYM N 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 812 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 43 43 44 43 43 47 50 51 46 33 30 41 40 43 42 40 40 38 34 39 36 42 40 39 40 30 33 35 30 30 60 58 60 63 69 73 68 65 63 53 49 55 55 56 52 58 66 59 51 53 54 57 59 68 56 45 57 58 54 57 59 50 52 54 52 54 58 54 54 50 47 54 57 52 52 45 47 47 55 63 57 48 43 50 57 43 45 62 63 52 146 146 146 146 146 146 146 146 146 146 146 146 146 146 146 146 146 145 145 145 145 145 145 145 145 145 145 145 145 145 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83 82 82 82 82 82 82 82 82 82 82 82 82 188 188 187 186 186 185 184 183 182 182 181 180 179 178 177 176 176 175 175 174 173 173 172 172 172 172 172 173 173 173 x: Row y: Column R: Red G: Green B: Blue Y: Yield M: Moisture N: Nitrate Spatial Data Formats 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 Spatial Data Formats 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 Spatial Data Formats 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19 Spatial Data Formats 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19 bSQ format (16 files) B11 B12 B13 B14 B15 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 B16 B17 B18 B21 B22 B23 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 0 0 B24 B25 B26 0 0 1 1 0 0 0 1 0 1 0 0 B27 0 0 0 1 B28 1 0 0 1 Peano Count Tree (P-tree) P-tree represents RSI data bit-by-bit in a recursive quadrant-by-quadrant arrangement. P-trees are a lossless compressed representation of the original data. bSQ file 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 An example 2-D a P-tree bSQ file arranged as a spatial dataset (2-D raster order) 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 00 00 00 00 00 00 00 10 00 00 00 00 39 16 8 15 3 0 4 1 4 4 3 4 1 1 1 0 0 0 1 0 1 1 0 1 Quadrant-based, Pure (Pure-1/Pure-0) quadrant Peano or Z-ordering Root Count 0 Peano Mask Tree (PM-tree) 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 00 00 00 00 00 00 00 10 00 00 00 00 m m m m 0 1 m 1 1 m 1 1 1 1 1 0 0 0 1 0 1 1 0 1 Truth-Trees (1 if condition is true of quadrant, else 0 – E.g., Pure-1 and Pure-0 Trees – All are lossless compressed representations of the dataset 0 001 111 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 55 0 16 2 3 15 2 3 0 4 1 4 4 3 4 3 1 1 1 0 0 0 1 0 1 1 0 1 Peano or Z-ordering Pure-1/Pure-0 quadrant Root Count ( 7, 1 ) 1 ( 111, 001 ) 8 16 2.2.3 Level Fan-out QID (Quadrant ID) 10.10.11 P-tree Operations P-tree 55 ______/ / \ \_______ / __ / \___ \ / / \ \ 16 __8____ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 P-tree-1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 PM-tree m ______/ / \ \______ / __ / \ __ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree-2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100 Complement 9 ______/ / \ \_______ / __ / \___ \ / / \ \ 0 __8____ _1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 m ______/ / \ \______ / __ / \ __ \ / / \ \ 0 m m 0 / / \ \ / / \ \ m1 0 m 00 m 0 //|\ //|\ //|\ 0001 1101 0010 OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010 Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 PM-tree2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 Depth-first Pure-1 path code 0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 0 0 20 20 21 21 220 221 223 22 23 231 RESULT 0 20 21 220 221 223 231 Various P-trees AND, OR, COMPLEMENT AND, OR, COMPLEMENT Basic P-trees Pi, j AND, OR COMPLEMENT Predicate P-trees P(p) AND COMPLEMENT Value P-trees Pi(v) OR AND Tuple P-trees P(v1, v2, …, vn) Interval P-trees Pi(v1, v2) AND OR Cube P-trees P([v11, v12], …, [vN1, vN2]) Association Rule Mining on RSI Data using P-trees Admissible Itemsets (Asets ) – Asets are itemsets of the form, Int1 Int2 ... Intn = Π i=1...n Inti , where Inti is an interval of values in Bandi (some of which may be the full value range). – Example: Aset {[01,01]1, [11,11]2} P-ARM algorithm Pruning techniques P-ARM algorithm Procedure P-ARM { Data_Discretization; F1 = {frequent 1-Asets}; For (k=2; F k-1 ) do begin Ck = p-gen(F k-1); Forall candidate Asets c Ck do c.count = AND_rootcount(c); Fk = {cCk | c.count >= minsup} end Answer = k Fk } •F1 is determined directly from P-tree root counnts and pruning techniques rather than transaction database scan. •The p-gen function differs from the apriori-gen function in Apriori by using some pruning techniques. • • The AND_rootcount function is used to calculate Aset counts directly by ANDing the appropriate basic Ptrees instead of scanning the transaction databases. The support count for Aset {B1[0,64), B2[64,127)} (or {[00, 00]1, [01, 01]2}) is the root count of P1(00) AND P2(01). Pruning Techniques Band-based pruning – An itemset with two items from the same band will have support zero. Constraint-base pruning – E.g., specify yield as the only consequent band of interest. – Note: in the performance comparisons we did not use this pruning technique (to maintain fairness, since it is hard to implement in other alogrithms) Bit-based pruning for multi-level rules – if Aset [128,255] (or [1,1]2) is not frequent, then the Aset [128,191] (or [10,10]2) and [192,255] (or [11,11]2) cannot be frequent either. Others P-ARM versus Apriori 1,742,400 pixels (transactions) Run time (Sec.) 800 700 600 P-ARM 500 400 Apriori 300 200 100 0 10% 20% 30% 40% 50% 60% 70% 80% 90% Support threshold Scalability with support threshold P-ARM versus Apriori (cont.) Support threshold =10% 1200 Time (Sec.) 1000 800 Apriori 600 P-ARM 400 200 0 100 500 900 1300 1700 Num ber of transactions (K) Scalability with number of transactions P-ARM versus FP-growth Run time (Sec.) 800 17,424,000 pixels (transactions) 1,742,400 pixels (transactions) 700 600 P-ARM 500 400 FP-growth 300 200 100 0 10% 30% 50% 70% 90% Support threshold Scalability with support threshold P-ARM versus FP-growth (cont.) Support threshold =10% Time (Sec.) 1200 Support threshold =10% 1000 800 FP-growt h 600 P-ARM 400 200 0 100 500 900 1300 1700 Num ber of transactions(K) Scalability with the number of transactions Conclusion A model for association rule mining on RSI data – P-trees facilitate fast calculation of support – P-trees facilitates significant pruning techniques Applications other than precision agriculture – Flood prediction and monitoring – Community and regional planning – Virtual archeology – Mineral exploration – Bioinformatics/Genomics – VLSI design