* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 04_VDB_submit-02_chapter
Survey
Document related concepts
Operational transformation wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Clusterpoint wikipedia , lookup
Forecasting wikipedia , lookup
Relational model wikipedia , lookup
Information privacy law wikipedia , lookup
3D optical data storage wikipedia , lookup
Data vault modeling wikipedia , lookup
Business intelligence wikipedia , lookup
Transcript
Vertical Database Design for Scalable Data Mining William Perrizo, North Dakota State University, USA Qiang Ding, Concordia College – Moorhead, USA Masum Serazi, North Dakota State University, USA Taufik Abidin, North Dakota State University, USA Baoying Wang, North Dakota State University, USA INTRODUCTION For several decades and especially with the preeminence of relational database systems, data is almost always formed into horizontal record structures and then processed vertically (vertical scans of fi les of horizontal records). This makes good sense when the requested result is a set of horizontal records. In knowledge discovery and data mining, however, researchers are typically interested in collective properties or predictions that can be expressed very briefly. Therefore, the approaches for scan-based processing of horizontal records are known to be inadequate for data mining in very large data repositories (Han & Kamber, 2001; Han, Pei & Yin, 2000; Shafer, Agrawal & Mehta, 1996). On the contrary, more and more advantages of using vertical data organization have been realized. For example, it makes hardware caching work well, it makes compression easy to do, and it may greatly increase the effectiveness of the I/O device since only participating fiel ds are retrieved instead of the whole record. The vertical decomposition of a relation also permits a number of transactions to execute concurrently. As a result, much effort has been focused on sub -sampling and indexing to address problems of scalability. However, sub-sampling requires that the sub-sampler knows enough about the large dataset in the first place, to sub -sample “representatively”. That is, sub -sampling representatively presupposes considerable knowledge about the data. For many large dataset s, such knowledge may be inadequate or non-existent. Index files are vertical structures and they are vertical access paths to sets of horizontal records. Some indices, such as Bit -Sliced Index (BSI) (Chan & Ioannidis, 1998; O’Neil & Quass, 1997; Rinfret e t al, 2001), and Encoded Bitmap Index (EBI) (Wu & Buchmann, 1998; Wu, 1998), do address the scalability problem in many cases, but they do so at the cost of creating and maintaining additional index files separate from the data files. Sybase IQ (Sybase Inc., 1997), a commercial DBMS specially designed for data warehousing applications, requires data to be fully inverted to achieve scalable data analyses. However, original horizontal records still need to be retrieved during the analyses. Another approach, which is different from the above conceptually, is to build the whole database vertically. Such database can be used not only for routine data management, but also for data mining. Unlike the horizontal databases, which are stored horizontally and processed vertically, vertical databases are stored vertically and processed horizontally. With other characteristics, vertical databases are shown to address the scalability issues. BACKGROUND The concept of vertical data files, in fact, is not new at all. Copela nd et al (1985) presented an attribute-level Decomposition Storage Model called DSM, similar to the Attribute Transposed File model (ATF) (Batory, 1979) that stores each column of a relational table into a separate table. However, DSM was shown to perform well. It utilizes surrogate keys to map individual attributes together, hence requiring a surrogate key to be associated with each attribute of each record in the database. Attribute-level vertical decomposition is also used in Remotely Sensed Imagery, e.g. Landsat Thematic Mapper Imagery, where it is called Band Sequential (BSQ) format. Beyond attribute-level decomposition, Wong et al (1985) presented the Bit Transposed File model (BTF), which further partitioned each column into bit level and utilized encoding methods to reduce the storage space. Due to the difficulty of accessing files directly in an operating system, a higher layer of accessing known as database is invented. In most cases, databases are stored horizontally, which is suitable for data ret rieval but not data mining purposes. On the other hand, vertical database can achieve both data retrieval and data mining purposes. MAIN THRUST OF THE CHAPTER Vertical Databases In vertical databases, data is stored vertically and processed horizontally through fast, multi-operand logical operations, such as AND, OR, XOR, and complement. Predicate tree (P -tree) is one of lossless vertical structures that can meet the requirement. P-tree is suitable to represent numerical and categorical data and has been successfully used in OLAP operations (Wang et al, 2003) and various data mining applications, including classification (Khan et al, 2002), clustering (Denton et al, 2002), and association rule mining (Ding et al, 2002). A vertical database consists of a s et of P-trees rather than a set of relational tables. To convert a relational table of horizontal records to a set of vertical P -trees, the table has to be projected into columns, one for each attribute, retaining the original record order in each. Then ea ch attribute column is further decomposed into separate bit vectors, one for each bit position of the values in that attribute. Figure 1 shows a relational table with three attributes, in which all of the attributes are numeric. Figure 2 shows the decompos ition process from the relational table R to a set of bit vectors. R (A1, A2, A3) A2 5 2 7 7 2 4 3 1 2 3 2 2 5 7 2 3 7 2 2 5 5 1 1 4 Figure 1. Relational table R. R (A1, A2, A3) A1 A2 101 010 111 111 010 100 011 001 A11 A12 A13 A12 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 A2 A3 010 011 010 010 101 111 010 011 111 010 010 101 101 001 001 100 A21 A22 A23 A31 A32 A33 1 A0 2 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 A1 2 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 Figure 2. Vertical decomposition of the table R After decomposition process, each bit vectors is then converted into a P -tree. P-trees can be 1-dimensional, 2-dimensional, and multi -dimensional. If the data has a natural dimension, for instance spatial data, P -tree dimension is matched to the data dimension. Otherwise, the dimension can be chosen to optimize the compression ratio. Figure 3 shows the construction of three 1 -dimensional P-trees from the bit vectors of the second attribute A 2 . They are built by recording the truth of the predicate “purely 1-bits” recursively on halves of the bit vectors until purity is reached. 0 0 0 0 1 1 0 0 0 0 0 1 (a) P 2 1 (b) P 2 2 0 1 0 0 0 1 0 1 0 0 1 (c) P 2 3 Figure 3. P-trees of attributes A 2 1 , A 2 2 and A 2 3 With built-in various engines, such as query engine, OLAP engine, and data mining engine, vertical database can be used to accomp lish SPJ queries (Ding et al 2002), OLAP operations (Wang et al, 2003) and various data mining applications. The detailed description of system structure is discussed in the next section. System Prototype As a proof of concept, a prototype system has been developed and tested successfully for scalable data mining on the top of the vertical database concept. The multi-layered software framework approach has been taken to design the prototype. The system is formally named as DataMIME TM (Serazi et al, 2004). The layers of the system include Data Mining Interface (DMI), Data Capture and Data Integration Interface (DCI/DII), Data Mining Algorithm (DMA), and Distributed Ptree Management Interface (DPMI). DMI does counting, the most important operation for data mining provided by P-trees, including basic P-trees, value P-trees, tuple P-trees, interval P-trees, and cube P-trees. DMI also provide the P-tree algebra, which has four operations, AND, OR, NOT (complement) and XOR, to implement the point wise logical ope rations on P-trees for (Data Mining Algorithms) DMA. DCI/DII allows user to capture and to integrate data to system required format (P-tree format). The DPMI layer provides access, location, and concurrency transparency by hiding the fact that data represe ntation may differ, and resource access protocol may vary, resources may be located in different places, and shared by several competitive users. DMA layer contains a collection of data mining tools, e.g. P-KNN (Khan et al, 2002), PINE (Perrizo et al, 2003 ), P-BAYESIAN (Perera et al, 2002), P-SVM (Pan et al, 2004), and P -ARM (Ding et al, 2002). Besides all those core layers the system provides a graphical user interface that adds flexible user interaction with the system. In order to comprehend how vertica l database concept affects the system, there are some key concepts that must be grasped. Unlike traditional database, data is not stored as horizontal row -based format rather they are stored as compressed vertical P-tree format. The DPMI layer is responsib le to store and manage this P -tree based vertical data in the system. The efficient bit -wise operations on vertical data offer the scalability for data mining algorithms and these are achieved through DMI layer. Finally, this uniform efficient vertical dat a structure at the lowest layer can take advantage of the latest hardware. FUTURE TRENDS Vertical database will become more and more important as many data sets have become extremely large. Research has shown that scanning the entire data set horizontally to be inefficient and non scalable. Therefore a new scalable approach is critical. Vertical database has exposed to be a scalable methodology that can be used to perform fast, efficient and effective data mining on large data sets by organizing data in vertical layouts and conducting logical operations on vertical partitioned data without scanning. Also, there is great potential to combine vertical database with parallel data mining as well as hardware. CONCLUSION Horizontal data structure has been prove n to be inefficient for data mining on very large sets due to the large cost of scanning. It is of importance to develop vertical data structures and algorithms to solve the scalability issue. Various structures have been proposed, among which P -tree is a very promising vertical structure. This database model is not a set of indexes, but is a collection of representations of dataset itself. P-trees have show great performance to process data containing large number of tuples due to the fast logical AND oper ation without scanning (Ding et al, 2002). In general, horizontal data organization is preferable for transactional data with intended output as a relation, and vertical data structure is more appropriate for data mining on very large data sets. REFERENCES Batory, D. S. (1979). On Searching Transposed Files. ACM Transactions on Database Systems, 4(4):531-544. Chan, C. Y. and Ioannidis, Y. (1998). Bitmap Index Design and Evaluation. Proceedings of the ACM SIGMOD, 355-366. Copeland, G. and Khoshafian, S. ( 1985). Decomposition Storage Model. Proceedings of the ACM SIGMOD, 268-279. Denton, A., Ding, Q., Perrizo, W., and Ding, Q. (2002). Efficient Hierarchical Clustering of Large Data Sets Using Ptrees. Proceeding of International Conference on Computer Applications in Industry and Engineering, 138-141. Ding, Q., Ding, Q., and Perrizo, W. (2002). Association Rule Mining on Remotely Sensed Images Using Ptrees. Proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining , 66-79. Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques . San Francisco, CA, Morgan Kaufmann. Han, J., Pei, J., and Yin, Y. (2000). Mining Frequent Patterns without Candidate Generation. Proceedings of the ACM SIGMOD, 1-12. Khan, M., Ding, Q., and Perrizo, W. (2002). K-nearest Neighbor Classification on Spatial Data Stream Using Ptrees. Proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining , 517-528. O’Neil, P. and Quass, D. (1997). Improved Query Performance with Variant Indexes. Proceedings of the ACM SIGMOD, 38-49. Perera, A., Serazi, M., and Perrizo, W. (2002). Performance Improvement for Bayesian Classification on Spatial Data with P -Trees. Proceeding of International Conference on Computer Applications in Industry and Engineering. Perrizo, W., Ding, Q., Denton, A., Scott, K., Ding, Q., and Khan, M. (2003). Podium Incremental Neighbor Evaluator for Spatial Data using P -trees. ACM Symposium on Applied Computing. Rinfret, D., O’Neil, P., and O’Neil, E. (2001). Bit -Sliced Index Arithmetic. Proceedings of the ACM SIGMOD, 47-57. Serazi, M., Perera, A., Ding, Q., Malakhov, V., Rahal, I., Pan, F., Ren, D., Wu, W., and Perrizo, W. (2004). DataMIME™. Proceedings of the ACM International Conference on Management of Data . Shafer, J., Agrawal, R., and Mehta, M. (1996). SPRINT: A Scalable Parallel Classifier for Data Mining. Proceedings of the International Confere nce on Very Large Data Bases, 544-555. Sybase Inc. (1997). Sybase IQ Indexes. In Sybase IQ Administration Guide, Sybase IQ Release 11.2 Collection, Chapter 5. AIPD Technical Publications. Wang, B., Pan, F., Ren, D., Cui, Y., Ding, Q. and Perrizo, W., (2003 ). Efficient OLAP Operations for Spatial Data Using P -trees, 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery . Wong, H. K. T., Liu, H. -F., Olken, F., Rotem, D., and Wong. L. (1985). Bit Transposed Files. Proceedings of the International Conference on Very Large Data Bases, 448-457. Wu, M-C and Buchmann, A. (1998). Encoded Bitmap Indexing for Data Warehouses. Proceedings of IEEE International Conference on Data Engineering , 220-230. Wu, M-C. (1998). Query Optimization for Sel ections using Bitmaps. Technical Report, DVS98-2, Universitat Darmstadt. DVS1,Computer Science Department, Technische TERMS AND THEIR DEFINITION Predicate Tree (P-tree): A lossless tree that is vertically structured and horizontally processed through fast multi-operand logical operations. P-Tree Algebra: The set of logical operations, functions and properties of P-trees. Basic logical operations include AND, OR, and complement. Vertical Decomposition: A Process of partitioning a relational table of horizontal records to separate vertical data files, either to attribute level or bit level, usually retaining the original record order in each. Vertical Database Design: A process of developing a vertical data model, usually with intended data mining functionality that utilizes logical operations for fast data processing. Vertical Data Mining: A process of finding pattern and knowledge from data that is organized in vertical structures, which aims to address the scalability issues. Multi-Layered Software Framework: A layer-based software environment where each layer is a group of entities dedicated to perform a particular task. DataMIMETM: A prototype system that has been designed and implemented on the top of vertical database technology and multi-layered software framework by DataSURG group at North Dakota State University, ND, USA.