Download 04_VDB_submit-02_chapter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Operational transformation wikipedia , lookup

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Database wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Clusterpoint wikipedia , lookup

Forecasting wikipedia , lookup

Relational model wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Data mining wikipedia , lookup

Database model wikipedia , lookup

Transcript
Vertical Database Design for Scalable Data Mining
William Perrizo, North Dakota State University, USA
Qiang Ding, Concordia College – Moorhead, USA
Masum Serazi, North Dakota State University, USA
Taufik Abidin, North Dakota State University, USA
Baoying Wang, North Dakota State University, USA
INTRODUCTION
For several decades and especially with the preeminence of relational
database systems, data is almost always formed into horizontal record structures and
then processed vertically (vertical scans of fi les of horizontal records). This makes
good sense when the requested result is a set of horizontal records. In knowledge
discovery and data mining, however, researchers are typically interested in collective
properties or predictions that can be expressed very briefly. Therefore, the
approaches for scan-based processing of horizontal records are known to be
inadequate for data mining in very large data repositories (Han & Kamber, 2001;
Han, Pei & Yin, 2000; Shafer, Agrawal & Mehta, 1996).
On the contrary, more and more advantages of using vertical data
organization have been realized. For example, it makes hardware caching work well,
it makes compression easy to do, and it may greatly increase the effectiveness of the
I/O device since only participating fiel ds are retrieved instead of the whole record.
The vertical decomposition of a relation also permits a number of transactions to
execute concurrently. As a result, much effort has been focused on sub -sampling and
indexing to address problems of scalability. However, sub-sampling requires that the
sub-sampler knows enough about the large dataset in the first place, to sub -sample
“representatively”. That is, sub -sampling representatively presupposes considerable
knowledge about the data. For many large dataset s, such knowledge may be
inadequate or non-existent.
Index files are vertical structures and they are vertical access paths to sets of
horizontal records. Some indices, such as Bit -Sliced Index (BSI) (Chan & Ioannidis,
1998; O’Neil & Quass, 1997; Rinfret e t al, 2001), and Encoded Bitmap Index (EBI)
(Wu & Buchmann, 1998; Wu, 1998), do address the scalability problem in many
cases, but they do so at the cost of creating and maintaining additional index files
separate from the data files.
Sybase IQ (Sybase Inc., 1997), a commercial DBMS specially designed for
data warehousing applications, requires data to be fully inverted to achieve scalable
data analyses. However, original horizontal records still need to be retrieved during
the analyses.
Another approach, which is different from the above conceptually, is to build
the whole database vertically. Such database can be used not only for routine data
management, but also for data mining. Unlike the horizontal databases, which are
stored horizontally and processed vertically, vertical databases are stored vertically
and processed horizontally. With other characteristics, vertical databases are shown
to address the scalability issues.
BACKGROUND
The concept of vertical data files, in fact, is not new at all. Copela nd et al
(1985) presented an attribute-level Decomposition Storage Model called DSM,
similar to the Attribute Transposed File model (ATF) (Batory, 1979) that stores each
column of a relational table into a separate table. However, DSM was shown to
perform well. It utilizes surrogate keys to map individual attributes together, hence
requiring a surrogate key to be associated with each attribute of each record in the
database. Attribute-level vertical decomposition is also used in Remotely Sensed
Imagery, e.g. Landsat Thematic Mapper Imagery, where it is called Band Sequential
(BSQ) format. Beyond attribute-level decomposition, Wong et al (1985) presented
the Bit Transposed File model (BTF), which further partitioned each column into bit
level and utilized encoding methods to reduce the storage space. Due to the
difficulty of accessing files directly in an operating system, a higher layer of
accessing known as database is invented. In most cases, databases are stored
horizontally, which is suitable for data ret rieval but not data mining purposes. On
the other hand, vertical database can achieve both data retrieval and data mining
purposes.
MAIN THRUST OF THE CHAPTER
Vertical Databases
In vertical databases, data is stored vertically and processed horizontally
through fast, multi-operand logical operations, such as AND, OR, XOR, and
complement. Predicate tree (P -tree) is one of lossless vertical structures that can
meet the requirement. P-tree is suitable to represent numerical and categorical data
and has been successfully used in OLAP operations (Wang et al, 2003) and various
data mining applications, including classification (Khan et al, 2002), clustering
(Denton et al, 2002), and association rule mining (Ding et al, 2002).
A vertical database consists of a s et of P-trees rather than a set of relational
tables. To convert a relational table of horizontal records to a set of vertical P -trees,
the table has to be projected into columns, one for each attribute, retaining the
original record order in each. Then ea ch attribute column is further decomposed into
separate bit vectors, one for each bit position of the values in that attribute. Figure 1
shows a relational table with three attributes, in which all of the attributes are
numeric. Figure 2 shows the decompos ition process from the relational table R to a
set of bit vectors.
R (A1, A2, A3)
A2 5
2
7
7
2
4
3
1
2
3
2
2
5
7
2
3
7
2
2
5
5
1
1
4
Figure 1. Relational table R.
R (A1, A2, A3)
A1
A2
101
010
111
111
010
100
011
001
A11 A12 A13
A12
0
1
1
0
1
0
0
0
1
1
1
1
0
1
0
A2
A3
010
011
010
010
101
111
010
011
111
010
010
101
101
001
001
100
A21 A22 A23 A31 A32 A33
1 A0
2
0 0
1 0
1 0
0 1
0 1
1 0
1 0
1
1
1
1
0
1
1
1
0 A1
2
1 0
0 0
0 1
1 1
1 0
0 0
1 1
1
1
1
0
0
0
0
0
1
0
0
1
1
1
1
0
Figure 2. Vertical decomposition of the table R
After decomposition process, each bit vectors is then converted into a P -tree.
P-trees can be 1-dimensional, 2-dimensional, and multi -dimensional. If the data has
a natural dimension, for instance spatial data, P -tree dimension is matched to the
data dimension. Otherwise, the dimension can be chosen to optimize the compression
ratio. Figure 3 shows the construction of three 1 -dimensional P-trees from the bit
vectors of the second attribute A 2 . They are built by recording the truth of the
predicate “purely 1-bits” recursively on halves of the bit vectors until purity is
reached.
0
0
0
0
1
1
0
0
0
0
0 1
(a) P 2 1
(b) P 2 2
0
1
0
0
0 1
0 1
0
0 1
(c) P 2 3
Figure 3. P-trees of attributes A 2 1 , A 2 2 and A 2 3
With built-in various engines, such as query engine, OLAP engine, and data
mining engine, vertical database can be used to accomp lish SPJ queries (Ding et al
2002), OLAP operations (Wang et al, 2003) and various data mining applications.
The detailed description of system structure is discussed in the next section.
System Prototype
As a proof of concept, a prototype system has been developed and tested
successfully for scalable data mining on the top of the vertical database concept. The
multi-layered software framework approach has been taken to design the prototype.
The system is formally named as DataMIME TM (Serazi et al, 2004).
The layers of the system include Data Mining Interface (DMI), Data Capture
and Data Integration Interface (DCI/DII), Data Mining Algorithm (DMA), and
Distributed Ptree Management Interface (DPMI). DMI does counting, the most
important operation for data mining provided by P-trees, including basic P-trees,
value P-trees, tuple P-trees, interval P-trees, and cube P-trees. DMI also provide the
P-tree algebra, which has four operations, AND, OR, NOT (complement) and XOR,
to implement the point wise logical ope rations on P-trees for (Data Mining
Algorithms) DMA. DCI/DII allows user to capture and to integrate data to system
required format (P-tree format). The DPMI layer provides access, location, and
concurrency transparency by hiding the fact that data represe ntation may differ, and
resource access protocol may vary, resources may be located in different places, and
shared by several competitive users. DMA layer contains a collection of data mining
tools, e.g. P-KNN (Khan et al, 2002), PINE (Perrizo et al, 2003 ), P-BAYESIAN
(Perera et al, 2002), P-SVM (Pan et al, 2004), and P -ARM (Ding et al, 2002).
Besides all those core layers the system provides a graphical user interface that adds
flexible user interaction with the system.
In order to comprehend how vertica l database concept affects the system,
there are some key concepts that must be grasped. Unlike traditional database, data
is not stored as horizontal row -based format rather they are stored as compressed
vertical P-tree format. The DPMI layer is responsib le to store and manage this P -tree
based vertical data in the system. The efficient bit -wise operations on vertical data
offer the scalability for data mining algorithms and these are achieved through DMI
layer. Finally, this uniform efficient vertical dat a structure at the lowest layer can
take advantage of the latest hardware.
FUTURE TRENDS
Vertical database will become more and more important as many data sets have
become extremely large. Research has shown that scanning the entire data set horizontally
to be inefficient and non scalable. Therefore a new scalable approach is critical. Vertical
database has exposed to be a scalable methodology that can be used to perform fast,
efficient and effective data mining on large data sets by organizing data in vertical layouts
and conducting logical operations on vertical partitioned data without scanning. Also, there
is great potential to combine vertical database with parallel data mining as well as
hardware.
CONCLUSION
Horizontal data structure has been prove n to be inefficient for data mining on
very large sets due to the large cost of scanning. It is of importance to develop
vertical data structures and algorithms to solve the scalability issue. Various
structures have been proposed, among which P -tree is a very promising vertical
structure. This database model is not a set of indexes, but is a collection of
representations of dataset itself. P-trees have show great performance to process data
containing large number of tuples due to the fast logical AND oper ation without
scanning (Ding et al, 2002). In general, horizontal data organization is preferable for
transactional data with intended output as a relation, and vertical data structure is
more appropriate for data mining on very large data sets.
REFERENCES
Batory, D. S. (1979).
On Searching Transposed Files. ACM Transactions on
Database Systems, 4(4):531-544.
Chan, C. Y. and Ioannidis, Y. (1998). Bitmap Index Design and Evaluation.
Proceedings of the ACM SIGMOD, 355-366.
Copeland, G. and Khoshafian, S. ( 1985). Decomposition Storage Model. Proceedings
of the ACM SIGMOD, 268-279.
Denton, A., Ding, Q., Perrizo, W., and Ding, Q. (2002). Efficient Hierarchical
Clustering of Large Data Sets Using Ptrees. Proceeding of International
Conference on Computer Applications in Industry and Engineering, 138-141.
Ding, Q., Ding, Q., and Perrizo, W. (2002). Association Rule Mining on Remotely
Sensed Images Using Ptrees. Proceeding of the Pacific-Asia Conference on
Knowledge Discovery and Data Mining , 66-79.
Han, J. and Kamber, M. (2001).
Data Mining: Concepts and Techniques .
San
Francisco, CA, Morgan Kaufmann.
Han, J., Pei, J., and Yin, Y. (2000).
Mining Frequent Patterns without Candidate
Generation. Proceedings of the ACM SIGMOD, 1-12.
Khan, M., Ding, Q., and Perrizo, W. (2002). K-nearest Neighbor Classification on
Spatial Data Stream Using Ptrees. Proceeding of the Pacific-Asia Conference
on Knowledge Discovery and Data Mining , 517-528.
O’Neil, P. and Quass, D. (1997). Improved Query Performance with Variant Indexes.
Proceedings of the ACM SIGMOD, 38-49.
Perera, A., Serazi, M., and Perrizo, W. (2002). Performance Improvement for
Bayesian Classification on Spatial Data with P -Trees. Proceeding of
International
Conference
on
Computer
Applications
in
Industry
and
Engineering.
Perrizo, W., Ding, Q., Denton, A., Scott, K., Ding, Q., and Khan, M. (2003). Podium
Incremental Neighbor Evaluator for Spatial Data using P -trees. ACM
Symposium on Applied Computing.
Rinfret, D., O’Neil, P., and O’Neil, E. (2001). Bit -Sliced Index Arithmetic.
Proceedings of the ACM SIGMOD, 47-57.
Serazi, M., Perera, A., Ding, Q., Malakhov, V., Rahal, I., Pan, F., Ren, D., Wu, W.,
and Perrizo, W. (2004). DataMIME™. Proceedings of the ACM International
Conference on Management of Data .
Shafer, J., Agrawal, R., and Mehta, M. (1996).
SPRINT: A Scalable Parallel
Classifier for Data Mining. Proceedings of the International Confere nce on
Very Large Data Bases, 544-555.
Sybase Inc. (1997). Sybase IQ Indexes. In Sybase IQ Administration Guide, Sybase
IQ Release 11.2 Collection, Chapter 5. AIPD Technical Publications.
Wang, B., Pan, F., Ren, D., Cui, Y., Ding, Q. and Perrizo, W., (2003 ). Efficient
OLAP Operations for Spatial Data Using P -trees, 8th ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge Discovery .
Wong, H. K. T., Liu, H. -F., Olken, F., Rotem, D., and Wong. L. (1985). Bit
Transposed Files. Proceedings of the International Conference on Very Large
Data Bases, 448-457.
Wu, M-C and Buchmann, A. (1998). Encoded Bitmap Indexing for Data Warehouses.
Proceedings of IEEE International Conference on Data Engineering , 220-230.
Wu, M-C. (1998). Query Optimization for Sel ections using Bitmaps. Technical
Report,
DVS98-2,
Universitat Darmstadt.
DVS1,Computer
Science
Department,
Technische
TERMS AND THEIR DEFINITION
Predicate Tree (P-tree): A lossless tree that is vertically structured and horizontally
processed through fast multi-operand logical operations.
P-Tree Algebra: The set of logical operations, functions and properties of P-trees. Basic
logical operations include AND, OR, and complement.
Vertical Decomposition: A Process of partitioning a relational table of horizontal records
to separate vertical data files, either to attribute level or bit level, usually retaining the
original record order in each.
Vertical Database Design: A process of developing a vertical data model, usually with
intended data mining functionality that utilizes logical operations for fast data processing.
Vertical Data Mining: A process of finding pattern and knowledge from data that is
organized in vertical structures, which aims to address the scalability issues.
Multi-Layered Software Framework: A layer-based software environment where each
layer is a group of entities dedicated to perform a particular task.
DataMIMETM: A prototype system that has been designed and implemented on the top of
vertical database technology and multi-layered software framework by DataSURG group
at North Dakota State University, ND, USA.