Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University, Fargo, ND [email protected] Vertical Data and pTrees Traditionally, a file or dataset comprises one or more horizontal records, each record comprises one or more fields, and each field comprises some number of bits that represent a piece of data. Traditionally, a dataset is structured horizontally as a set of such horizontal records, and is processed vertically, record-by-record. Alternatively, a file or dataset may be structured vertically as a set of columns and processed horizontally, column-by-column. A column-structured dataset may be further vertically decomposed as a set of bit-vectors or bitslices and processed horizontally, bitslice-by-bitslice using logical operations. A pTree is a tree-like organization of these bitslices having a zero bit value for each internal node and a zero or one bit value for each of its leaves (external nodes). The three basic pTree operations are AND, OR and complement. An example follows. predicate Trees (pTrees): slice by attribute or column vertically slice off each bit position (12 vertical structures) then compress each bit slice into a tree using a predicate e.g., the compression of R11 into pTree, P11 : A pTree example 1st, Vertically Processing of Horizontal Data (VPHD) e.g., find the number of occurences of 7 0 1 4 =2 2nd, using pTrees find the number of occurences of 7 0 1 4 Base 10 for Horizontally structured, record-oriented data, one must scan vertically R(A1 A2 A3 A4) 2 6 3 2 3 2 7 7 7 7 7 7 2 2 0 0 6 6 5 5 1 1 1 1 1 0 1 7 4 5 4 4 R[A1] R[A2] R[A3] R[A4] Base 2 = 010 011 010 010 011 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 0 0 0 0 0 0 1 1 R11 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. 1. Whole thing pure1? false 0 2. Left half pure1? false 0 3. Right half pure1? false 0 4. Left half of rt half ? false0 5. Rt half of right half? true1 pure1? false=0 010 011 010 010 011 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 P11 P12 P13 P11 But it's pure0 so this branch ends 0 0 0 01 1 0 0 0 0 0 0 0 01 1 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 01 0 0 0 0 1 0 1 0 0 0 0 10 10 10 01 00 00 0001 0100 ^ 10 ^ ^ ^ 01 ^ ^ ^ 10 01 01 01 01 7 0 1 4 = To count (7,0,1,4)s use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 0 *23 0 0 *22 0 1 *21 *20 =2 Securing Vertical pTree Data How can a pTree database be made secure without causing the processing to be slowed down instance, requiring a slow decryption step prior to processing)? (by, for A first level of security may be provided by re-ordering (e.g., permuting) the pTrees. The re-ordering method or permutation becomes the security key (those who know it can use the data and those who do not, cannot). For big data, key compression techniques might be necessary. A second level of security can be provided through random bit padding (at the bottom and/or top) Padding will randomly place the true starting and ending position of the pTree somewhere in the middle of the structure. This second level is intended to prevent an adversary from discovering the re-ordering by focusing on a smaller part of the dataset (e.g., the first bit or the last bit only). A third level of security is also possible when needed. That third level would involve re-ordering the bits within the actual pTrees. For datasets in which the ordering of the record instances is random or otherwise unknown, this would be unnecessary. However, if there is a known ordering, it might be possible for the clever adversary to discover the starting and ending positions of the pTrees. At this time, this third level seems problematic since it would probably require the encryption and therefore a time consuming decryption step prior to processing. There may be a way to re-order the pTree bits without introducing significant additional processing time? This is "future work". The "bottom" padding might be unnecessary from a security perspective (add little or no additional security, however, we believe it will make it much easier to secure volatile datasets (by providing a large bottom pad that can be overwritten as the dataset grows in cardinality) Implementations This security scheme can be implemented per dataset or over the entire database as a whole. That is, for each dataset, there could be a massive string of bits in which all the pTrees of just that dataset are embedded (separate massive string of bits for each dataset), or there could be a much more massive string of bits in which all the pTrees of the database are embedded (and intermingled for additional security?). The latter might be the best choice since it does not introduce any additional processing time overhead and it does provide additional security. In general, a security monitor subsystem would be used to implement this security scheme. The security monitor would be responsible for identification, authentication and authorization. The security monitor could manage free space within the massive bit strings (be they per dataset or per database). Conclusion This vertical database scheme involves at least two levels of security. The first level is a matter of disguising the location of each vertical bitslice structure (e.g. permuting the ordering of the bitslice). The second level is a matter of disguising the bit position location of the pTree within a bigger bit string (to make it harder to focus on the first bits only, etc.). An important fact is that neither of these two schemes imposes almost no additional processing time. The reason we can say that is pTree processing always involves 1. identifying those pTrees that are to be involved in the processing 2. locating those pTrees (which is always a matter of following a pointer and we simple shuffle the order of those pointers so that those who know the shuffle can find the data and those who do not, cannot. 3. loading the pertinent pTrees (which is just as fast as long as you know at which bit position to start the load. Those who know that start position, can load and those who do not, cannot.). Conclusion continued A third possible level of security was mentioned, that of encrypting or otherwise anonymizing each pTree itself. Again, this may be unnecessary if there is no know or expected record ordering involved. One simple way to [somewhat] anonymize individual pTrees it to shuffle the order of the levels within each pTree. This would be metadata (manage by the security monitor) but once known, could be used with no or very very little additional processing delay, since each level of a multi-level pTree is processed separately and therefore the locating portion of the process involves a pointer and no additional processing would be required on that pointer (it is just that the correct level of the correct pTree to be located would not be in the expected place within the pTree as stored.). Other partial anonymizations, short of full encryption, might suffice in most cases.