Vertical Database Design for Scalable Data Mining Qiang Ding, Masum Serazi, Taufik Abidin, Baoying Wang, William Perrizo North Dakota State University B A C K G R O U N D For several decades and especially with the preeminence of relational database systems, data is almost always formed into horizontal record structures and then processed vertically (vertical scans of files of horizontal records). This makes good sense when the requested result is a set of horizontal records. In knowledge discovery and data mining, however, researchers are typically interested in collective properties or predictions that can be expressed very briefly. Therefore, the approaches for scan-based processing of horizontal records are known to be inadequate for data mining in very large data repositories (Han & Kamber, 2001; Han, Pei & Yin, 2000; Shafer, Agrawal & Mehta, 1996). On the contrary, more and more advantages of using vertical data organization have been realized. For example, it makes hardware caching work well; it makes compression easy to do; it may greatly increase the effectiveness of the I/O device since only participating fields are retrieved instead of the whole record. The vertical decomposition of a relation also permits a number of transactions to execute concurrently. As a result, much effort has been focused on sub -sampling and indexing to address problems of scalability. However, sub -sampling requires that the sub-sampler knows enough about the large dataset in the first place, to sub sample “representatively”. That is, sub-sampling representatively presupposes considerable knowledge about the data. For many large datasets, such knowledge may be inadequate or non -existent. Index files are vertical structures and they are vertical access paths to sets of horizontal records. Some indices, such as Bit -Sliced Index (BSI) (Chan & Ioannidis, 1998; O’Neil & Quass, 1997; Rinfret et al, 2001), and Encoded Bitmap Index (EBI) (Wu & Buchmann, 1998; Wu, 1998), do address the scalability problem in many cases, but they do so at the cost of creating and maintaining additional index files separate from the data files. Sybase IQ (Sybase Inc., 1997), a commercial DBMS specially designed for data warehousing applications, requires data to be fully inverted to achieve scalable data analyses. However, original horizontal records still need to be retrieved during the analyses. Another approach, which is different from the above conceptually, is to build the whole database vertically. Such database can be used not only for routine data management, but also for data mining. Unlike the horizontal databases which are stored horizontally and processed vertically, vertical databases are stored vertically and processed horizontally. With other characteristics, vertical databases are shown to address the scalability issues. V E R T I C A L D A T A B A S E D E S I G N The concept of vertical data files, in fact, is not new at all. Copeland et al (1985) presented an attribute-level Decomposition Storage Model called DSM, similar to the Attribute Transposed File model (AT F) (Batory, 1979) that stores each column of a relational table into a separate table. However, DSM was shown to perform well. It utilizes surrogate keys to map individual attributes together, hence requiring a surrogate key to be associated with each att ribute of each record in the database. Attribute -level vertical decomposition is also used in Remotely Sensed Imagery, e.g. Landsat Thematic Mapper Imagery, where it is called Band Sequential (BSQ) format. Beyond attribute -level decomposition, Wong et al (1985) presented the Bit Transposed File model (BTF), which further partitioned each column into bit level and utilized encoding methods to reduce the storage space. Due to the difficulty of accessing files directly in an operating system, a higher layer of accessing known as database is invented. In most cases, databases are stored horizontally which is suitable for data retrieval but not data mining purposes. On the other hand, vertical database can achieve both data retrieval and data mining purposes. In vertical databases, data is stored vertically and processed horizontally through fast, multi-operand logical operations, such as AND, OR, XOR, and complement. Predicate tree (P-tree) is one of lossless vertical structures that can meet the requirement. P-tree is suitable to represent numerical and categorical data and has been successfully used in OLAP operations (Wang et al, 2003) and various data mining applications, including classification (Khan et al, 2002), clustering (Denton et al, 2002), and association rule mining (Ding et al, 2002). A vertical database consists of a set of P -trees rather than a set of relational tables. To convert a relational table of horizontal records to a set of vertical P trees, the table has to be projected into colu mns, one for each attribute, retaining the original record order in each. Then each attribute column is further decomposed into separate bit vectors, one for each bit position of the values in that attribute. Figure 1 shows a relational table with three a ttributes, in which all of the attributes are numeric. Figure 2 shows the decomposition process from the relational table R to a set of bit vectors. R (A1, A2, A3) A2 5 2 7 7 2 4 3 1 2 3 2 2 5 7 2 3 7 2 2 5 5 1 1 4 Figure 1. A relational table R. R (A1, A2, A3) A1 A2 101 010 111 111 010 100 011 001 A11 A12 A13 A12 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 A2 A3 010 011 010 010 101 111 010 011 111 010 010 101 101 001 001 100 A21 A22 A23 A31 A32 A33 1 A0 2 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 A1 2 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 Figure 2. Vertical decomposition o f the table R After decomposition process, each bit vectors is then converted into a P -tree. P-trees can be 1-dimensional, 2-dimensional, and multi -dimensional. If the data has a natural dimension, for instance spatial data, P -tree dimension is matched to the data dimension. Otherwise, the dimension can be chosen to optimize the compression ratio. Figure 3 shows the construction of three 1 -dimensional P-trees from the bit vectors of the second attribute A 2 . They are built by recording the truth of the predicate “purely 1-bits” recursively on halves of the bit vectors until purity is reached. 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 (a) P 2 1 (b) P 2 2 (c) P 2 3 Figure 3. P-trees of attributes A 2 1 , A 2 2 and A 2 3 With built-in various engines, such as query engine, OLAP engine, and data mining engine, vertical database can be used to accomplish SPJ queries (Ding et al 2002), OLAP operations (Wang et al, 2003) and various data mining applications. The detailed description of system structure is discussed in the next section. A S Y S T E M P R O T O T Y P E As a proof of concept, a prototype system has been developed and tested successfully for scalable data mining on the top of the vertical database concept. The multi-layered software framewor k approach has been taken to design the prototype. The system is formally named as DataMIME TM (Serazi et al, 2004). The layers of the system include Data Mining Interface (DMI), Data Capture and Data Integration Interface (DCI/DII), Data Mining Algorithm (DMA), and Distributed Ptree Management Interface (DPMI). DMI does counting, the most important operation for data mining provided by P -trees, including basic P-trees, value P-trees, tuple P-trees, interval P-trees, and cube P-trees. DMI also provide the P-tree algebra, which has four operations, AND, OR, NOT (complement) and XOR, to implement the point wise logical operations on P -trees for (Data Mining Algorithms) DMA. DCI/DII allows user to capture and to integrate data to system required format (P-tree format). The DPMI layer provides access, location, and concurrency transparency by hiding the fact that data representation may differ, and resource access protocol may vary, resources may be located in different places, and shared by several competitive users. DMA layer contains a collection of data mining tools, e.g. P-KNN (Khan et al, 2002), PINE (Perrizo et al, 2003), P BAYESIAN (Perera et al, 2002), P -SVM (Pan et al, 2004), and P -ARM (Ding et al, 2002). Besides all those core layers the system provide s a graphical user interface that adds flexible user interaction with the system. In order to comprehend how vertical database concept affects the system, there are some key concepts that must be grasped. Unlike traditional database, data is not stored as horizontal row-based format rather they are stored as compressed vertical P-tree format. The DPMI layer is responsible to store and manage this P -tree based vertical data in the system. The efficient bit -wise operations on vertical data offer the scalability for data mining algorithms and these are achieved through DMI layer. Finally, this uniform efficient vertical data structure at the lowest layer can take advantage of the latest hardware. C O N C L U S I O N Horizontal data structure has been proven to be i nefficient for data mining on very large sets due to the large cost of scanning. It is of importance to develop vertical data structures and algorithms to solve the scalability issue. Various structures have been proposed, among which P -tree is a very promising vertical structure. This database model is not a set of indexes, but is a collection of representations of dataset itself. P-trees have show great performance to process data containing large number of tuples due to the fast logical AND operation without scanning (Ding et al, 2002). In general, horizontal data organization is preferable for transactional data with intended output as a relation, and vertical data structure is more appropriate for data mining on very large data sets. R E F E R E N C E S Batory, D. S. (1979). On Searching Transposed Files. ACM Transactions on Database Systems, 4(4):531 -544. Chan, C. Y. and Ioannidis, Y. (1998). Bitmap index design and evaluation. Proceedings of the ACM SIGMOD, 355 -366. Copeland, G. and Khoshafian, S. (1985). Decomposition Storage Model. Proceedings of the ACM SIGMOD, 268 -279. Denton, A., Ding, Q., Perrizo, W., and Ding, Q. (2002). Efficient Hierarchical Clustering of Large Data Sets Using P -Trees. Proceeding of International Conference on Computer Application s in Industry and Engineering, 138 -141. Ding, Q., Ding, Q., and Perrizo, W. (2002). Association Rule Mining on Remotely Sensed Images Using Ptrees, Proceeding of the Pacific -Asia Conference on Knowledge Discovery and Data Mining, 66 -79. Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA, Morgan Kaufmann. Han, J., Pei, J., and Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX. Khan, M., Ding, Q., and Perrizo, W. (2002). K -nearest Neighbor Classification on Spatial Data Stream Using Ptrees, Proceeding of the Pacific -Asia Conference on Knowledge Discovery and Data Mining, 517 -528. Khan, M., Ding, Q., and Perri zo, W. (2002). K-nearest neighbor classification on spatial data stream using P -trees. In Proceedings of the Pacific -Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Springer -Verlag, Lecture Notes in Artificial Intelligence 2336, 517 -528. O’Neil, P. and Quass, D. (1997). Improved Query Performance with Variant Indexes. Proceedings of the ACM SIGMOD, 38 -49. Perera, A., Serazi, M., and Perrizo, W. (2002). Performance Improvement for Bayesian Classification on Spatial Data with P -Trees. CAINE. Perrizo, W., Ding, Q., Denton, A., Scott, K., Ding, Q., and Khan, M. (2003). Podium Incremental Neighbor Evaluator for Spatial Data using P -trees. SAC. Rinfret, D., O’Neil, P., and O’Neil, E. (2001). Bit -Sliced Index Arithmetic. Proceedings of the ACM SIGM OD, 47-57. Serazi, M., Perera, A., Ding, Q., Malakhov, V., Rahal, I., Pan, F., Ren, D., Wu, W., and Perrizo, W. (2004). DataMIME™. ACM SIGMOD. Shafer, J., Agrawal, R., and Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 544 -555, Bombay, India. Sybase Inc. (1997). Sybase IQ Indexes. In Sybase IQ Administration Guide, Sybase IQ Release 11.2 Collection, chapter 5. AIPD Technical Publications. Wong, H. K. T., Liu, H. -F., Olken, F., Rotem, D., and Wong. L. (1985). Bit Transposed Files. Proceedings of VLDB, 448 -457. Wu, M-C and Buchmann, A. (1998). Encoded bitmap indexing for data warehouses. Proceedings of IEEE International Conference on Data Engineering, 220-230. Wu, M-C. (1998). Query Optimization for Selections using Bitmaps. Technical Report, DVS98-2, DVS1, Computer Science Department, Technische Universitat Darmstadt. Terms and Definitions Predicate Tree (P-tree): A lossless tree that is vertically structured and horizontally processed through fast multi-operand logical operations. P-Tree Algebra: The set of logical operations, functions and properties of P-trees. Basic logical operations include AND, OR, and complement. Vertical Decomposition: A Process of partitioning a relational table of horizontal records to separate vertical data files, either to attribute level or bit level, usually retaining the original record order in each. Vertical Database Design: A process of developing a vertical data model, usually with intended data mining functionality that utilizes logical operations for fast data processing. Vertical Data Mining: A process of finding pattern and knowledge from data that is organized in vertical structures, which aims to address the scalability issues. Multi-Layered Software Framework: A layer-based software environment where each layer is a group of entities dedicated to perform a particular task. DataMIMETM: A prototype system that has been designed and implemented on the top of vertical database technology and multi-layered software framework by DataSURG group at North Dakota State University, ND, USA.