04_VDB_encyc_cpt - NDSU Computer Science Download

Transcript
Vertical Database Design for Scalable
Data Mining
Qiang Ding, Masum Serazi, Taufik Abidin, Baoying Wang, William Perrizo
North Dakota State University
B A C K G R O U N D
For several decades and especially with the preeminence of relational database
systems, data is almost always formed into horizontal record structures and then
processed vertically (vertical scans of files of horizontal records). This makes good
sense when the requested result is a set of horizontal records. In knowledge
discovery and data mining, however, researchers are typically interested in
collective properties or predictions that can be expressed very briefly. Therefore,
the approaches for scan-based processing of horizontal records are known to be
inadequate for data mining in very large data repositories (Han & Kamber, 2001;
Han, Pei & Yin, 2000; Shafer, Agrawal & Mehta, 1996).
On the contrary, more and more advantages of using vertical data organization
have been realized. For example, it makes hardware caching work well; it makes
compression easy to do; it may greatly increase the effectiveness of the I/O device
since only participating fields are retrieved instead of the whole record. The
vertical decomposition of a relation also permits a number of transactions to
execute concurrently. As a result, much effort has been focused on sub -sampling
and indexing to address problems of scalability. However, sub -sampling requires
that the sub-sampler knows enough about the large dataset in the first place, to sub sample “representatively”. That is, sub-sampling representatively presupposes
considerable knowledge about the data. For many large datasets, such knowledge
may be inadequate or non -existent.
Index files are vertical structures and they are vertical access paths to sets of
horizontal records. Some indices, such as Bit -Sliced Index (BSI) (Chan &
Ioannidis, 1998; O’Neil & Quass, 1997; Rinfret et al, 2001), and Encoded Bitmap
Index (EBI) (Wu & Buchmann, 1998; Wu, 1998), do address the scalability
problem in many cases, but they do so at the cost of creating and maintaining
additional index files separate from the data files.
Sybase IQ (Sybase Inc., 1997), a commercial DBMS specially designed for data
warehousing applications, requires data to be fully inverted to achieve scalable data
analyses. However, original horizontal records still need to be retrieved during the
analyses.
Another approach, which is different from the above conceptually, is to build
the whole database vertically. Such database can be used not only for routine data
management, but also for data mining. Unlike the horizontal databases which are
stored horizontally and processed vertically, vertical databases are stored vertically
and processed horizontally. With other characteristics, vertical databases are shown
to address the scalability issues.
V E R T I C A L
D A T A B A S E
D E S I G N
The concept of vertical data files, in fact, is not new at all. Copeland et al
(1985) presented an attribute-level Decomposition Storage Model called DSM,
similar to the Attribute Transposed File model (AT F) (Batory, 1979) that stores
each column of a relational table into a separate table. However, DSM was shown
to perform well. It utilizes surrogate keys to map individual attributes together,
hence requiring a surrogate key to be associated with each att ribute of each record
in the database. Attribute -level vertical decomposition is also used in Remotely
Sensed Imagery, e.g. Landsat Thematic Mapper Imagery, where it is called Band
Sequential (BSQ) format. Beyond attribute -level decomposition, Wong et al (1985)
presented the Bit Transposed File model (BTF), which further partitioned each
column into bit level and utilized encoding methods to reduce the storage space.
Due to the difficulty of accessing files directly in an operating system, a higher
layer of accessing known as database is invented. In most cases, databases are
stored horizontally which is suitable for data retrieval but not data mining
purposes. On the other hand, vertical database can achieve both data retrieval and
data mining purposes.
In vertical databases, data is stored vertically and processed horizontally
through fast, multi-operand logical operations, such as AND, OR, XOR, and
complement. Predicate tree (P-tree) is one of lossless vertical structures that can
meet the requirement. P-tree is suitable to represent numerical and categorical data
and has been successfully used in OLAP operations (Wang et al, 2003) and various
data mining applications, including classification (Khan et al, 2002), clustering
(Denton et al, 2002), and association rule mining (Ding et al, 2002).
A vertical database consists of a set of P -trees rather than a set of relational
tables. To convert a relational table of horizontal records to a set of vertical P trees, the table has to be projected into colu mns, one for each attribute, retaining
the original record order in each.
Then each attribute column is further
decomposed into separate bit vectors, one for each bit position of the values in that
attribute. Figure 1 shows a relational table with three a ttributes, in which all of the
attributes are numeric. Figure 2 shows the decomposition process from the
relational table R to a set of bit vectors.
R (A1, A2, A3)
A2
5
2
7
7
2
4
3
1
2
3
2
2
5
7
2
3
7
2
2
5
5
1
1
4
Figure 1. A relational table R.
R (A1, A2, A3)
A1
A2
101
010
111
111
010
100
011
001
A11 A12 A13
A12
0
1
1
0
1
0
0
0
1
1
1
1
0
1
0
A2
A3
010
011
010
010
101
111
010
011
111
010
010
101
101
001
001
100
A21 A22 A23 A31 A32 A33
1 A0
2
0 0
1 0
1 0
0 1
0 1
1 0
1 0
1
1
1
1
0
1
1
1
0 A1
2
1 0
0 0
0 1
1 1
1 0
0 0
1 1
1
1
1
0
0
0
0
0
1
0
0
1
1
1
1
0
Figure 2. Vertical decomposition o f the table R
After decomposition process, each bit vectors is then converted into a P -tree.
P-trees can be 1-dimensional, 2-dimensional, and multi -dimensional. If the data
has a natural dimension, for instance spatial data, P -tree dimension is matched to
the data dimension. Otherwise, the dimension can be chosen to optimize the
compression ratio. Figure 3 shows the construction of three 1 -dimensional P-trees
from the bit vectors of the second attribute A 2 . They are built by recording the truth
of the predicate “purely 1-bits” recursively on halves of the bit vectors until purity
is reached.
0
0
0
0
1
1
0
0
0
0
0 1
0
1
0
0 1
0
0 1
0
0 1
(a) P 2 1
(b) P 2 2
(c) P 2 3
Figure 3. P-trees of attributes A 2 1 , A 2 2 and A 2 3
With built-in various engines, such as query engine, OLAP engine, and data
mining engine, vertical database can be used to accomplish SPJ queries (Ding et al
2002), OLAP operations (Wang et al, 2003) and various data mining applications.
The detailed description of system structure is discussed in the next section.
A
S Y S T E M
P R O T O T Y P E
As a proof of concept, a prototype system has been developed and tested
successfully for scalable data mining on the top of the vertical database concept.
The multi-layered software framewor k approach has been taken to design the
prototype. The system is formally named as DataMIME TM (Serazi et al, 2004).
The layers of the system include Data Mining Interface (DMI), Data Capture
and Data Integration Interface (DCI/DII), Data Mining Algorithm (DMA), and
Distributed Ptree Management Interface (DPMI). DMI does counting, the most
important operation for data mining provided by P -trees, including basic P-trees,
value P-trees, tuple P-trees, interval P-trees, and cube P-trees. DMI also provide
the P-tree algebra, which has four operations, AND, OR, NOT (complement) and
XOR, to implement the point wise logical operations on P -trees for (Data Mining
Algorithms) DMA. DCI/DII allows user to capture and to integrate data to system
required format (P-tree format). The DPMI layer provides access, location, and
concurrency transparency by hiding the fact that data representation may differ, and
resource access protocol may vary, resources may be located in different places,
and shared by several competitive users. DMA layer contains a collection of data
mining tools, e.g. P-KNN (Khan et al, 2002), PINE (Perrizo et al, 2003), P BAYESIAN (Perera et al, 2002), P -SVM (Pan et al, 2004), and P -ARM (Ding et al,
2002). Besides all those core layers the system provide s a graphical user interface
that adds flexible user interaction with the system.
In order to comprehend how vertical database concept affects the system, there
are some key concepts that must be grasped. Unlike traditional database, data is not
stored as horizontal row-based format rather they are stored as compressed vertical
P-tree format. The DPMI layer is responsible to store and manage this P -tree based
vertical data in the system. The efficient bit -wise operations on vertical data offer
the scalability for data mining algorithms and these are achieved through DMI
layer. Finally, this uniform efficient vertical data structure at the lowest layer can
take advantage of the latest hardware.
C O N C L U S I O N
Horizontal data structure has been proven to be i nefficient for data mining on
very large sets due to the large cost of scanning. It is of importance to develop
vertical data structures and algorithms to solve the scalability issue. Various
structures have been proposed, among which P -tree is a very promising vertical
structure. This database model is not a set of indexes, but is a collection of
representations of dataset itself. P-trees have show great performance to process
data containing large number of tuples due to the fast logical AND operation
without scanning (Ding et al, 2002). In general, horizontal data organization is
preferable for transactional data with intended output as a relation, and vertical
data structure is more appropriate for data mining on very large data sets.
R E F E R E N C E S
Batory, D. S. (1979). On Searching Transposed Files. ACM Transactions on
Database Systems, 4(4):531 -544.
Chan, C. Y. and Ioannidis, Y. (1998). Bitmap index design and evaluation.
Proceedings of the ACM SIGMOD, 355 -366.
Copeland, G. and Khoshafian, S. (1985). Decomposition Storage Model.
Proceedings of the ACM SIGMOD, 268 -279.
Denton, A., Ding, Q., Perrizo, W., and Ding, Q. (2002). Efficient Hierarchical
Clustering of Large Data Sets Using P -Trees.
Proceeding of International
Conference on Computer Application s in Industry and Engineering, 138 -141.
Ding, Q., Ding, Q., and Perrizo, W. (2002). Association Rule Mining on
Remotely Sensed Images Using Ptrees, Proceeding of the Pacific -Asia Conference
on Knowledge Discovery and Data Mining, 66 -79.
Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques. San
Francisco, CA, Morgan Kaufmann.
Han, J., Pei, J., and Yin, Y. (2000). Mining frequent patterns without
candidate generation. In Proceedings of the ACM International Conference on
Management of Data (SIGMOD), Dallas, TX.
Khan, M., Ding, Q., and Perrizo, W. (2002). K -nearest Neighbor Classification
on Spatial Data Stream Using Ptrees, Proceeding of the Pacific -Asia Conference on
Knowledge Discovery and Data Mining, 517 -528.
Khan, M., Ding, Q., and Perri zo, W. (2002). K-nearest neighbor classification
on spatial data stream using P -trees. In Proceedings of the Pacific -Asia Conference
on Knowledge Discovery and Data Mining (PAKDD), Springer -Verlag, Lecture
Notes in Artificial Intelligence 2336, 517 -528.
O’Neil, P. and Quass, D. (1997). Improved Query Performance with Variant
Indexes. Proceedings of the ACM SIGMOD, 38 -49.
Perera, A., Serazi, M., and Perrizo, W. (2002). Performance Improvement for
Bayesian Classification on Spatial Data with P -Trees. CAINE.
Perrizo, W., Ding, Q., Denton, A., Scott, K., Ding, Q., and Khan, M. (2003).
Podium Incremental Neighbor Evaluator for Spatial Data using P -trees. SAC.
Rinfret, D., O’Neil, P., and O’Neil, E. (2001). Bit -Sliced Index Arithmetic.
Proceedings of the ACM SIGM OD, 47-57.
Serazi, M., Perera, A., Ding, Q., Malakhov, V., Rahal, I., Pan, F., Ren, D., Wu,
W., and Perrizo, W. (2004). DataMIME™. ACM SIGMOD.
Shafer, J., Agrawal, R., and Mehta, M. (1996). SPRINT: A scalable parallel
classifier for data mining. In Proceedings of the International Conference on Very
Large Data Bases (VLDB), 544 -555, Bombay, India.
Sybase Inc. (1997). Sybase IQ Indexes. In Sybase IQ Administration Guide,
Sybase IQ Release 11.2 Collection, chapter 5. AIPD Technical Publications.
Wong, H. K. T., Liu, H. -F., Olken, F., Rotem, D., and Wong. L. (1985). Bit
Transposed Files. Proceedings of VLDB, 448 -457.
Wu, M-C and Buchmann, A. (1998). Encoded bitmap indexing for data
warehouses. Proceedings of IEEE International Conference on Data Engineering,
220-230.
Wu, M-C. (1998). Query Optimization for Selections using Bitmaps. Technical
Report, DVS98-2, DVS1, Computer Science Department, Technische Universitat
Darmstadt.
Terms and Definitions
Predicate Tree (P-tree): A lossless tree that is vertically structured and horizontally
processed through fast multi-operand logical operations.
P-Tree Algebra: The set of logical operations, functions and properties of P-trees. Basic
logical operations include AND, OR, and complement.
Vertical Decomposition: A Process of partitioning a relational table of horizontal
records to separate vertical data files, either to attribute level or bit level, usually retaining
the original record order in each.
Vertical Database Design: A process of developing a vertical data model, usually with
intended data mining functionality that utilizes logical operations for fast data processing.
Vertical Data Mining: A process of finding pattern and knowledge from data that is
organized in vertical structures, which aims to address the scalability issues.
Multi-Layered Software Framework: A layer-based software environment where each
layer is a group of entities dedicated to perform a particular task.
DataMIMETM: A prototype system that has been designed and implemented on the top
of vertical database technology and multi-layered software framework by DataSURG
group at North Dakota State University, ND, USA.