Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Derive High Confidence Rules for Spatial Data using Tuple Count Cube William Perrizo1, Qin Ding1, Qiang Ding1, and Amalendu Roy1 1 Department of Computer Science, North Dakota State University, Fargo, ND 58105-5164, USA {William_Perrizo, Qin_Ding, Qiang_Ding, Amalendu_Roy}@ndsu.nodak.edu Abstract. The traditional task of association rule mining is to find all rules with high support and high confidence. In some applications, such as mining spatial datasets for natural resources, the task is to find high confidence rules even though their supports may be low. In still other applications, such as the identification of agricultural pest infestations, the task is to find high confidence rules preferably while the support is still very low. The basic Apriori algorithm cannot be used to solve these problems efficiently. In this paper, we propose a new model to derive high confidence rules for spatial data. A new data structure, the Peano Count Tree (PC-tree), is used in our model to represent all the information we need. PC-trees represent spatial data bit-by-bit in a recursive quadrant-byquadrant arrangement. Based on the PC-tree, we build a special data cube, the Tuple Count Cube (TC-cube), to derive high confidence rules. Our algorithm for deriving confident rules is fast and efficient. In addition, we discuss some strategies for avoiding over-fitting (removing redundant and misleading rules). 1 Introduction Association rule mining [1,2,3,4,5], proposed by Agrawal, Imielinski and Swami in 1993, is one of the important tasks of data mining. The original application of association rule mining is on market basket data. A typical example is “customers who purchase one item are very likely to purchase another item at the same time”. There are two accuracy measures, support and confidence, for each rule. The problem of association rule mining is to find all the rules with support and confidence exceeding some user specified thresholds. The basic algorithms, such as Apriori [1] and DHP [4], use the downward closure property of support to find frequent itemsets, whose supports are above the threshold. After obtaining all frequent itemsets, which is very time consuming, high confidence rules are derived in a very straightforward way. However, in some applications, such as spatial data mining, we are also interested in rules with high confidence that do not necessarily have high support. In still other applications, such as the identification of agricultural pest infestations, the task is to find high confidence rules preferably while the support is still very low. In these cases, the traditional algorithms are not suitable. One may think that we can simply set the minimal support to a very low value, so that high confidence rules with almost no support limit can be derived. However, this will lead to a huge number of frequent itemsets, and is, thus, impractical. In this paper, we propose a new model, including new data structures and algorithms, to derive “confident” rules (high confidence only rules), especially for spatial data. We use a data structure, called the Peano Count Tree (PC-tree), to store all the information we need. A PC-tree is a quadrant based count tree. From the PCtrees, we build a data cube, the Tuple Count Cube or TC-cube which exposes the confident rules. We also use the attribute precision concept hierarchies and a natural rule ranking to prune the complexity of our data mining algorithm. The rest of the paper is organized as follows. In section 2, we provide some background on spatial data. In section 3, we describe the data structures we use for association rule mining, including PC-trees and TC-cubes. In section 4, we detail our algorithms for deriving confident rules. Performance analysis and implementation issues are given in section 5, followed by related work in section 6. Finally, the conclusion is given. 2 Formats of Spatial Data There are huge amounts of spatial data on which we can perform data mining to obtain useful information[16]. Spatial data are collected in different ways and are organized in different formats. BSQ, BIL and BIP are three typical formats. An image contains several bands. For example, TM6 (Thermatic Mapper) scene contains 6 bands, while TM7 scene contains 7 bands, including Blue, Green, Red, NIR, MIR, TIR, MIR2, each of which contains reflectance values in the range, 0~255. An image can be organized into a relational table in which each pixel is a tuple and each spectral band is an attribute. The primary key can be latitude and longitude pairs which uniquely identify the pixels. BSQ (Band Sequential) is a similar format, in which each band is stored as a separate file. Raster order is used for each individual band. TM scenes are in BSQ format. BIL (Band Interleaved by Line) is another format in which all the bands are organized in one file and bands are interleaved by row (the first row of all bands are followed by the second row of all bands, and so on). For example, SPOT data from French satellites are in BIL format. In the BIP (Band Interleaved by Pixel) format, there is also just one file in which the first pixel-value of the first band is followed by the first pixel-value of the second band, ..., the first pixel-value of the last band, followed by the second pixel-value of the first band, and so on. For example, TIFF images are in BIP format. Fig. 1 gives an example of using BSQ, BIL and BIP formats. In this paper, we propose a new format, called bSQ (bit Sequential), to organize images. The reflectance values of each band range from 0 to 255, represented as 8 bits. We split each band into a separate file for each bit position. Fig. 1 also gives an example of bSQ format. There are several reasons to use the bSQ format. First, different bits have different degrees of contribution to the value. In some applications, we do not need all the bits because the high order bits give us enough information. Second, the bSQ format facilitates the representation of a precision hierarchy. Third, and most importantly, bSQ format facilitates the creation of an efficient, rich data structure, the PC-tree, and accomodates algorithm pruning based on a one-bit-at-a-time approach. We give a very simple illustrative example with only 2 data bands for a scene having only 2 rows and 2 columns (both decimal and binary representation are shown). BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) B11 1 0 0 1 B12 1 1 0 1 BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) bSQ format (16 files) B13 B14 B15 B16 B17 B18 B21 B22 B23 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 B24 B25 B26 0 0 1 1 0 0 0 1 0 1 0 0 B27 B28 0 1 0 0 0 0 1 1 BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19 Fig. 1. Two bands of a 2-row-2-column image and its BSQ, BIP, BIL and bSQ formats 3 3.1 Data Structures Basic PC-trees We organize each bit file in the bSQ format into a tree structure, called a Peano Count Tree (PC-tree). A PC-tree is a quadrant based tree. The idea is to recursively divide the entire image into quadrants and record the count of 1-bits for each quadrant, thus forming a quadrant count tree. PC-trees are somewhat similar in construction to other data structures in the literature (e.g., Quadtrees[10] and HHcodes [14]). For example, given an 8-row-8-column image, the PC-tree is as shown in Fig. 2. PM-tree PC-tree 55 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 ____________/ / \ \___________ ___ / \___ \ / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ / / 1110 0010 1101 m ____________/ / \ \___________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 Fig. 2. 8*8 image and its PC-tree (PC-tree and PM-tree) In this example, 55 is the number of 1’s in the entire image. This root level is labeled as level 0. The numbers at the next level (level 1), 16, 8, 15 and 16, are the 1bit counts for the four major quadrants. Since the first and last quadrant are composed entirely of 1-bits (called a “pure 1 quadrant”), we do not need subtrees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0bits are called “pure 0 quadrants” which also terminate these tree branches. This pattern is continued recursively using the Peano or Z-ordering of the four subquadrants at each new level. Every branch terminates eventually (at the “leaf” level, each quadrant is a pure quadrant). If we were to expand all subtrees, including those for pure quadrants, then the leaf sequence is just the Peano-ordering (or, Zordering) of the original raster image. Thus, we use the name Peano Count Tree. We note that, the fan-out of the PC-tree need not be limited to 4. It can be any power of 4 (effectively skipping that number of levels in the tree). Also, the fanout at any one level need not coincide with the fanout at another level. The fanout pattern can be chosen to produce maximum compression for each bSQ file. For each band (assuming 8-bit data values), we get 8 basic PC-trees, one for each bit position. For band B1, we will label the basic PC-trees, P11, P12, …, P18. Pij is a lossless representation of the j th bits of the values from the ith band. In addition, Pij provides the 1-bit count for every quadrant of every dimension. Finally, we note that these PC-trees can be generated quite quickly and can be viewed as a “data mining ready”, lossless format for storing spatial data. The 8 basic PC-Trees defined above can be combined using simple logical operations (AND, NOT, OR, COMPLEMENT) to produce PC-Trees for the original values in a band (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision. Pb,v is called a value PC-tree. Using the full 8-bit precision (all 8 –bits) for values, value PC-tree Pb,11010011 can be constructed from the basic PC-trees by ANDing basic PC-trees (for each 1-bit) and their complements (for each 0 bit): PCb,11010011 = PCb1 AND PCb2 AND PCb3’ AND PCb4 AND PCb5’ AND PCb6’ AND PCb7 AND PCb8 where ‘ indicates the bit-complement (which is simply the PC-tree with each count replaced by its count complement in each quadrant). From value PC-trees, we can construct tuple PC-trees. Tuple PC-tree for tuple (v1,v2,…,vn), denoted PC (v1, v2, …, vn), is: PC(v1,v2,…,vn) = PC1,v1 AND PC2,v2 AND … AND PCn,vn where n is the total number of bands. Basic (bit) PC-trees (i.e., P11, P12, …, P21, …, P88) AND Value PC-trees (i.e., P1, 001 ) AND Tuple PC-trees (i.e., P001, 010, 111, 011, 001, 110, 011, 101 ) Fig. 3. Basic PC-trees, Value PC-trees (for 3-bit values) and Tuple PC-trees The AND operation is simply the pixel-wise AND of the bits. Before going further, we note that the process of converting the BSQ data for a TM satellite image (approximately 60 million pixels) to its basic PC-trees can be done in just a few seconds using a high performance PC computer. This is a one-time process. We also note that we are storing the basic PC-trees in a “breadth-first” data structure which specifies the pure-1 quadrants only. Using this data structure, each AND can be completed in a few milliseconds and the result counts can be accumulated easily once the AND and COMPLEMENT program has completed. 3.2 Variations of PC-tree In order to optimize the AND operation, we use a variation of the PC-tree, called PMtree (Pure Mask tree). In the PM-tree, we use a 3-value logic to represent pure-1, pure-0 and mixed quadrant. To simplify the exposition, we use 1 for pure 1, 0 for pure 0, and m for mixed quadrants. Thus, the PM-tree for the previous example is also given in Fig. 2. The PM-tree specifies the location of the pure-1 quadrants of the operands, so that the pure-1 quadrants of the AND result can be easily identified by the coincidence of pure-1 quadrants in both operands and pure-0 quadrants of the AND result occur wherever a pure-0 quadrant occurs on at least one of the operands. 3.3 Value Concept Hierarchy Using bSQ format, we can easily represent the value concept hierarchy of spatial data. For example, for band n, we can use from 1 bit up to 8 bits to represent the reflectances (Fig. 4). [0,0] [1,1] -------- 1 bit ( 0~127 ) [00,01) (0~63) [000, 001) (128~255) [01,10) (64~127) [001, [010, [011, 010) 011) 100) [10,11) [11,11] (128~191) (192~255) [100, 101) [101, 110) [110, 111) [111, 111] -------- 2 bits -------- 3 bits (0~31) (32~63) (64~95) (96~127) (128~159) (160~191) (192~223) (224~255) Fig. 4. Value Concept Hierarchy 3.4 Tuple Count Cube For most spatial data mining, the root counts of the tuple PC-trees (e.g., PC(v1,v2,…,vn) = PC1,v1 AND PC2,v2 AND … AND PCn,vn), are the numbers required, since root counts tell us exactly the number of occurrences of that particular pattern over the space in question. These root counts can be inserted into a data cube, called the Tuple Count cube (TC-cube) of the spatial dataset. Each band corresponds to a dimension of the cube, the band values labeling that dimension. The TC-cube cell at location, (v1,v2,…,vn), contains the root count of PC(v1,v2,…,vn). For example, assuming just 3 bands, the (v1,v2,v3)th cell of the TC-cube contains the root count of PC(v1,v2,v3) = PC1,v1 AND PC2,v2 AND PC3,v3. The cube can be contracted or expanded by going up [down] in the value concept hierarchy. 4 Confident Rule Mining Algorithm 4.1 PC-tree ANDing algorithm We begin this section with a description of the AND algorithm. This algorithm is used to compose the value PC-trees and to populate the TC-cube. The approach is to store only the basic PC-trees and then generate value PC-tree root counts “on-the-fly” when needed (in Section 5 we show this can be done in about 50ms). In this algorithm we will assume the PC-tree is coded in its most compact form, a depth-first ordering of the paths to each pure-1 quadrant. Let’s look at the operand 1 first (Fig. 5). Each path is represented by the sequence of quadrants in Peano order, beginning just below the root. Therefore, the depth-first pure1 path code for this example is: 0 100 101 102 12 132 20 21 220 221 223 23 3 (0 indicates the entire level 1 upper left quadrant is pure 1 s, 100 indicates the level 3 quadrant arrived at along the branch through node 1 (2 nd node) of level 1, node 0 (1st node) of level 2 and node 0 of level 3, etc.). We will take the second operand (Fig. 5), with depth-first pure1 path code: 0 20 21 22 231. Since a quadrant will be pure 1’s in the result only if it is pure1’s in both operands (or all operands, in the case there are more than 2), the AND is done by: scan the operands; output matching pure1 paths. Therefore we get the result (Fig. 5). Operand 1 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 PC-tree: 55 ________ / / \ \___ / ____ / \ \ / / \ \ 16 _8 _ _15_ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 PC-tree: 29 ________ / / \ \___ / ____ / \ \ / / \ \ 16 0 _ 13_ 0 / | \ \ 4 4 41 //|\ 0100 PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100 00 00 00 00 11 11 11 11 00 00 00 00 11 11 11 11 PC-tree: 28 ________ / / \ \___ / ____ / \ \ / / \ \ 16 0 12 0 / | \ \ 4 4 3 1 //|\ //|\ 1101 1000 PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 m m //|\ //|\ 1101 1000 Operand 2 11 11 11 11 11 11 11 11 11 11 11 11 11 11 01 00 AND Result 11 11 11 11 11 11 11 01 11 11 11 11 11 11 10 00 AND Process 0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 0 0 20 20 21 21 220 221 223 22 23 231 RESULT 0 20 21 220 221 223 231 Fig. 5. Operand 1, Operand 2, AND Result and AND Process The pseudo code for the ANDing algorithm is given below. Ptree_ANDing(P1, P2, Presult) // pos1, pos2, pos3 records the pure-1 quadrant path position of P1, P2, Presult 1. pos1:=0; pos2:=0; pos3:=0; 2. DO WHILE (pos1 <> ENDofP1 and pos2 <> ENDofP2) (a) IF P1.pos1=P2.pos2 THEN BEGIN Presult.pos3:=P1.pos1; pos1:=pos1+1; pos2:=pos2+1; pos3:=pos3+1; END (b) ELSE IF P1.pos1 is the substring of P2.pos2 THEN BEGIN Presult.pos3:=P2.pos2; pos2:=pos2+1; pos3:=pos3+1; END (c) ELSE IF P2.pos2 is the substring of P1.pos1 THEN BEGIN Presult.pos3:=P1.pos1; pos1:=pos1+1; pos3:=pos3+1; END (d) ELSE IF P1.pos1 < P2.pos2 THEN pos1:=pos1+1; (e) ELSE pos2:=pos2+1; END IF END DO Fig. 6. PC-tree ANDing algorithm 4.2 Mining Confident Rules from Spatial Data Using TC-cubes In this section a TC-cube based method for mining non-redundant, low-support, highconfidence rules is introduced. Such rules will be called confident rules. The main interest is in rules with low support, which are important for many application areas such as, natural resource searches, agriculture pest infestations identification, etc. However, a small positive support threshold is set, in order to eliminate rules that result from noise and outliers (similar to [7], [8] and [15]). A high threshold for confidence is set in order to find only the most confident rules. To eliminate redundant rules resulting from over-fitting, an algorithm similar to the one introduced in [8] is used. In [8] rules are ranked based on confidence, support, rule-size and data-value ordering, respectively. Rules are compared with their generalizations for redundancy before they are included in the set of confident rules. In this paper, we use a similar rank definition, except that we do not use support level and data-value ordering. Since support level is expected to be very low in many spatial applications, and since we set a minimum support only to eliminate rules resulting from noise, it is not used in rule ranking. Rules are declared redundant only if they are outranked by a generalization. We choose not to eliminate a rule which is outranked only by virtue the specific data values involved. A rule, r, ranks higher than rule, r', if confidence[r] > confidence[r'], or if confidence[r] = confidence[r'] and the number of attributes in the antecedent of r is less than the number in the antecedent of r'. A rule, r, generalizes a rule, r’, if they have the same consequent and the antecedent of r is properly contained in the antecedent of r’. The algorithm for mining confident rules from spatial data is given in Fig. 7. Build the set of confident rules, C (initially empty) as follows. Start with 1-bit values, 2 bands; then 1-bit values and 3 bands; … then 2-bit values and 2 bands; then 2-bit values and 3 bands; … ... At each stage defined above, do the following: Find all confident rules (support at least minimum_support and confidence at least minimum_confidence), by rolling-up the TC-cube along each potential consequent set using summation. Comparing these sums with the support theshold to isolate rule support sets with the minimum support. Compare the normalized TC-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules. Place any new confident rule in C, but only if the rank is higher than any of its generalizations already in C. Fig. 7. Algorithm for mining confident rules The following example contains 3 bands of 3-bit spatial data in bSQ format. Band-1: 11 11 11 01 11 00 11 00 11 11 11 11 11 11 11 11 B11 00 01 00 01 00 11 01 11 00 00 00 00 00 00 00 00 B12 00 00 00 00 00 00 00 00 11 11 11 11 10 01 10 11 11 11 11 11 00 00 00 00 00 00 01 10 00 00 11 11 B13 11 11 11 11 11 11 11 11 00 11 10 11 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Band-2: 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 B21 00 11 00 11 11 00 11 00 00 00 00 00 00 00 00 00 B22 00 11 00 11 11 00 11 00 11 11 11 11 10 01 10 11 00 00 00 00 11 11 11 11 00 00 00 00 11 11 11 11 B23 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 11 11 11 11 11 10 11 11 Band-3: 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 B31 00 00 00 00 00 00 00 00 11 01 11 11 00 11 00 11 B32 00 00 00 00 00 00 00 00 00 00 00 00 11 00 11 00 00 00 11 11 00 00 00 00 00 00 00 00 00 00 00 00 B33 00 00 00 00 10 11 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 Band-1: PM11 PM12 mm10 0mmm 1m10 0mm1 101m 11mm 0001 1101 0101 0001 0110 1010 0111 PM13 10m0 m10m 0010 0001 Band-2: PM21 PM22 PM23 0m10 0110 m011 0110 111m 0m01 Band-3: PM31 PM32 PM33 00m0 0010 m111 00m1 1010 100m 1m01 0111 Fig. 8. 88 image data and its PM-trees Assume minimum confidence threshold of 80% and minimum support threshold of 10%. Start with 1-bit values and 2 bands, B1 and B2. The TC-cube values (root counts from the PC-trees) are given in Fig. 9, while the rolled-up sums and confidence thresholds are given in Fig. 10. __________ / / /| / / / | /____/____/ | | | | | 2,0 | 25 | 15 | /| |____|____|/ | | | | / 2,1 | 5 | 19 | / |____|____|/ 1,0 1,1 Fig. 9. TC-cube for band 1 and band 2 __________ / 30 / 34 / sums / 24 /27.2/ thresholds /____/____/ __________ / / /| /| / / / | / | /____/____/ | /40| | | | | |32| 2,0 | 25 | 15 | /| | /| |____|____|/ | |/ | | | | / |24/ 2,1 | 5 | 19 | / |19.2 |____|____|/ |/ 1,0 1,1 Fig. 10. Rolled-up sums and confidence thresholds All sums are at least 10% support (6.4). There is one confident rule: C: B1={0} => B2={0} with confidence = 83.3% Continue with 1-bit values and the 2 bands, B1 and B3, we can get the following TC-cube with rolled-up sums and confidence thresholds (Fig. 11). There are no new confident rules. Similarly, the 1-bit TC-cube for band B2 and B3 can be constructed (Fig. 12). _____ 3,1 / 27 /| /21.6/ | /____/ | 3,0 / 37 /|27| /29.6/ | /| /____/ |/ | | 40 |13| 0| 2,0 | 32 | /| / |____|/ |/ | 24 |24/ 2,1 |19.2| / |____|/ __________ 3,1 / / /| /| / 16 / 11 / | / | /____/____/ | /27| 3,0 / / /| / /|21.6 / 14 / 23 / | / / | / /____/____/ |/ /37|/ | | | / |29.6 | | | / | / |____|____|/ |/ 1,0 1,1 _________ | 30 | 34 | | 24 | 27.2 |____|____| Fig. 11. TC-cube for band 1 and band 3 Fig. 12. TC-cube for band 2 and band 3 All sums are at least 10% of 64 (6.4), thus, all rules will have enough support. There are two confident rule, B2={1} => B3={0} with confidence = 100% and B3={1} => B2={0} with confidence = 100%. Thus, C: B1={0} => B2={0} B2={1} => B3={0} B3={1} => B2={0} c = 83.3% c = 100% c = 100% Next consider 1-bit values and bands, B1, B2 and B3. The counts, sums and confidence thresholds are given in Fig. 13: __________ 27/ / 16 / 11 /. 21.6/ /12.8/8.8 / . / /____/____/ . 37/ / 14 / 23 / . 29.6 /11.2/18.4/ . . /. . . ./____/____/ . . __________. . / / /| /| 3,1/ 16 / 11 / | /27 /____/____/ | /21.6 /| | /| | /| | 3,0/ | | / | /| / | /| / |___ _/ |/ | /13|/ | |\ | 0 / | | 0/ 10.4/ 0/ 2,0| \____/ | /| / | /| / |_9__|__4_|/ |/ |/ |/ | | | / |24/ 2,1| 5 | 19 | / |19.2 |____|____|/ . . . . |/ 1,0 1,1. _________ . | 25 | 15 | . | 20 | 12 | . |____|____| . | 5 | 19 | . | 4 | 15.2 . |____|____|. ____ ____. 30 34 24 27.2 |40 |24 | | |24 |19.2 Fig. 13. The counts, sums and confidence thresholds for 1-bit values Support sets, B1={0}^B2={1} and B2={1}^B3={1} lack support. The new confident rules are: B1={1}^B2={1} => B3={0}, B1={1}^B3={0} => B2={1}, B1={1}^B3={1} =>B2={0}, B1={0}^B3={1} => B2={0}, c = 100% c = 82.6% c =100% c =100% B1={1}^B2={1} => B3={0} in not included because it is generalized by B2={1} => B3={0}, which is already in C and has higher rank. Also, B1={1}^B3={1} => B2={0} is not included because it is generalized by B3={1} => B2={0}, which is already in C and has higher rank. B1={0}^B3={1} => B2={0} is not included because it is generalized by B3={1} => B2={0}, which has higher rank also. Thus, C: B1={0} => B2={0} B2={1} => B3={0} B3={1} => B2={0} B1={1}^B3={0} => B2={1} c = 83.3% c = 100% c = 100% c = 82.6% Next, we consider 2-bit data values and proceed in the same way. Depending upon the goal of the data mining task (e.g., mine for classes of rules, individual rules, …) the rules already in C can be used to obviate the need to consider 2-bit refinements of the rules in C. This simplifies the 2-bit stage markedly. 5 Implementation Issues and Performance Analysis In our model, we build TC-cube values from basic PC-trees on-the-fly as needed. Once the TC-cube is built, we can perform the mining task with different parameters (i.e., different support and confidence thresholds) without rebuilding the cube. Using the roll-up cube operation, we can get the TC-cube for n bit from the TC-cube for n+1 bit. This is a good feature of bit value concept hierarchy. We have enhanced the functionalities of our model in two ways. Firstly, we don’t specify the antecedent attribute. Compared to other approaches for deriving high confidence rules, our model is more general. Secondly, we remove redundant rules based on the rule rank. One important feature of our model is its scalability. It has two meanings. First, our model is scalable with respect to the data set size. The reason is that the size of TCcube is independent of the data set size, but only based on the number of bands and number of bits. In addition, the mining cost only depends on the TC-cube size. For example, for an image with size 81928192 with three bands, the TC-cube using 2 bits is as simple as that of the example in Section 4. By comparison, in Apriori algorithm, the larger the data set, the higher the cost of the mining process. Therefore, the larger the data set, the more benefit in using our model. The other aspect of scalability is that our model is scalable with respect to the support threshold. Our task focuses on mining high confidence rules with very small support. As the support threshold is decreased to very low value, the cost of using Aprioir algorithm will be increased dramatically, resulting in a huge number of frequent itemsets (combination exploration). However, in our model, the process is not based on the frequent itemsets generation, so it works well for low support threshold. As we mentioned, there is an additional cost to build the TC-cube. The key issue of this cost is the PC-tree ANDing. We have implemented a parallel ANDing of PC-trees which is efficient on a cluster of computers. We use an array of 16 dual 266 MHz processor systems with a 400 MHz dual processor as the control node. We partition the 2048*2048 image among all the nodes. Each node contains data for 512512 pixels. These data are store at different nodes as another variation of PC-tree, called Peano Vector Tree (PV-Tree). Here is how PVtree is constructed. First we build a Peano Count Tree using fan-out 64 for each level. Then the tree is saved as bit vectors. For each internal node (except the root), we use two 64 bit bit-vectors, one is for pure 1 and other is for pure 0. At the leaf level we only use one vector (for pure 1). The following algorithm (Fig. 14) describes this implementation in detail. PROCEDURE SavePeanoTree (Tree PeanoTree) //Peano Tree having fan out 64 and 3 level and tree is implemented using array BEGIN Vector PureOneVector=0, PureZeroVector=0; //Vector is a 64 bit data structure for i = 1 to 64 do Begin //level 1 if PeanoTree[i] == 4096 then turn on the bit at ith position of PureOneVector; else if PeanoTree[i] == 0 then turn on the bit at ith position of PureZeroVector; Endfor Write PureOneVector and PureZeroVector to the file; for each mixed node at level 1 do Begin //for level 2 ChildIndex = IndexOfCurrentNode * 64 +1; PureOneVector=0; PureZeroVector=0; for i= ChildIndex to ChildIndex+64 do Begin if PeanoTree[i] == 64 then turn on the bit at ith position of PureOneVector; else if PeanoTree[i] == 0 then turn on the bit at ith position of PureZeroVector; Endfor Write PureOneVector and PureZeroVector to the file; Enfor for each mixed node at level 2 do Begin //for level 3 Save the 64 children to the file Endfor END SavePeanoTree Fig. 14. SavePeanoTree Algorithm From a single TM scene, we will have 56 (78) Peano Vector Tree - all save in a single node. Using 16 nodes we are covering a scene of size, 20482048. When we need to perform ANDing operation on the entire scene, we calculate the local ANDing result of two Peano Vector Trees and send the result to the control node, giving us the final result. The following algorithm (Fig. 15) describes the local ANDing operation, FUNCTION LocalAND (Tree PeanoVectorTree1, Tree PeanoVectorTree2) BEGIN unsigned long Result; Vector Mixed1, Mixed2; Extract the Pure One Vector at the first level of each Tree and perform bit-wise AND. Find the total number of 1 bit in the resultant vector (let, n). Result = 4096 * n; Extract the Pure Zero Vector at the first level from each tree and compliment them and perform bit-wise OR operation with the compliment of corresponding pure one vectors (let, the resultant vectors are Mixed1 and Mixed2) For i = 1 to 64 do Begin If ith bit of Mixed1 is 0 then move pointer for Tree2 else if ith bit of Mixed2 is 0 then Move pointer for Tree2 else if ith bit of both Mixed1 and Mixed2 is 1 Extract the Pure One Vector at the second level of each Tree and perform bit-wise AND operation on them. Find the total number of 1 bit in the resultant vector (let, m). Result = Result + 64 * m; Extract the Pure Zero Vector at the second level from each tree, and take compliment of them and perform bit-wise OR operation with the compliment of corresponding pure one vectors (let, the resultant vectors are SecondMixed1 and SecondMixed2) For j =1 to 64 do If jth bit of Mixed1 is 0 then move pointer for Tree2 else if jth bit of Mixed2 is 0 then move pointer for Tree2 else if ith bit of both Mixed1 and Mixed2 is 1 Extract the pure one vector at the leaf level of each tree and perform bit-wise AND operation on them (let, the resultant vector is Pure) Find the number of 1 bit in Pure (let, l) Result = Result + l; Endfor Endfor END LocalAND Fig. 15. LoadAND Algorithm We use Message Passing Interface (MPI) on the cluster to implement the logical operations on Peano Vector Trees. This program uses the Single Program Multiple Data (SPMD) paradigm. The following graph (Fig. 16) shows the result of ANDing time experiments we have seen (to perform AND operation on two peano vector tree) for a TM scene. The AND time varies from 6.72 ms to 52.12 ms. ANDing Time Vs Bit Number Time (ms) 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 Lower Bit Numberof two PC-trees Fig. 16. PC-tree ANDing time vs. Bit Number With this high-speed ANDing, the TC-cube can be built very quickly. For example, for a 2-bit 3-band TC-cube, the total AND time is about 1s. 6 Related work There is some other work discussing the problem of deriving high confidence rules [6,7,8]. Although they deal with non-spatial data and are therefore not directly comparable, we make some rough comparisons. In [7] rules are found that have extremely high confidence, but for which there is no (or extremely weak) support. A set of algorithms are proposed to solve this problem. There are two disadvantages of this work. One is that only pairs of columns (attributes) are considered. All pairs of columns, with similarity exceeding a pre-specified threshold are identified. The second disadvantage is that the similarity measure is bi-directional which means the co-occurrence of the antecedent and consequent. In [6], a brute-force technique is used for mining classification rules. They used the association rule mining to solve the classification problem, i.e., a special rule set (classifiers) is derived. However, both the support and confidence are used in the algorithm even though only the high confidence rules are targeted. Several pruning techniques are proposed but there are trade-offs among those pruning techniques. [8] and [15] are similar in that they both apply the association rule mining method to the classification task. They turn an arbitrary set of association rules into a classifier. A confidence based pruning method is proposed using the property called "existential upward closure". The method is used for building a decision tree from association rules. The antecedent attribute is specified. Our model is more general than the models cited above and is particularly efficient and useful for spatial data mining. The PC-tree structure is related to Quadtrees [10,11,13] and its variants (such as point quadtree [13] and region quadtree [10]), and HHcode [14]. The similarities among PC-tree, quadtree and HHCode are that they are quadrant based, but the difference is that PC-tree is focused on counts. PC-trees are not only beneficial for storing data, but also for association rule mining, since they provide useful needed information for association rule mining. 7 Conclusion In this paper, we propose a new model to derive high confidence rules on spatial data. Data cube techniques are used in our model. The basic data structure of our model, PC-tree, has much more information than the original image file but is small in size. We build a Tuple Count cube from which the high confidence rules can be derived. Currently we use the 16-node system to perform the ANDing operations for images with size 20482048. In the future we will extend our system to 256-nodes so that we can handle the image as large as 81928192. In that case, the PC-tree ANDing time will be approximately the same as in the 16-node system for a 20482048 image since only the communication cost is increased and that increase is insignificant. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules Between Sets of Items in Large Database. ACM-SIGMOD 93, Washington, DC, May 1993. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Proc. of Int’l Conf. on VLDB, Santiago, Chile, September 1994. R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. ACM SIGMOD 96, Montreal Canada. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. An effective Hash-Based Algorithm for Mining Association Rules. ACM SIGMOD 95, CA, 1995. J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation. ACM_SIGMOD 2000, Dallas, Texas, May 2000. R. J. Bayardo Jr.. Brute-Force Mining of High-Confidence Classification Rules. KDD 97. E. Cohen, M. Datar, S. Fujiwara etc.. Finding Interesting Associations without Support Pruning. VLDB2000. Ke Wang, Senqiang Zhou and Yu He. Growing Decision Trees on Support-less Association Rules. KDD 2000, Boston, MA. Volker Gaede and Oliver Gunther. Multidimensional Access Methods. Computing Surveys, 30(2), 1998. H. Samet. The quadtree and related hierarchical data structure. ACM Computing Survey, 16, 2, 1984. H. Samet. Applications of Spatial Data Structures. Addison-Wesley, Reading, Mass., 1990. H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, Mass., 1990. R. A. Finkel and J. L. Bentley. Quad trees: A data structure for retrieval of composite keys. Acta Informatica, 4, 1, 1974. HH-code. Available at http://www.statkart.no/nlhdb/iveher/hhtext.htm B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. KDD 98. 16. Jianning Dong, William Perrizo, Qin Ding and Jingkai Zhou. The Application of Association Rule Mining on Remotely Sensed Data. Proc. of ACM Symposium on Applied Computers, Italy, March 2000.