Download 04_sigmod_DataMime - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DataMIME™
I. Rahal, A. Perera, M. Serazi, Q. Ding, F. Pan, D. Ren, W. Wu and W. Perrizo
Computer Science Department
North Dakota State University
Fargo, ND
USA
[email protected]
Abstract
DataMIME™ is a universal data mining system developed by the DataSURG research group at North Dakota State University. The
system exploits a novel technology, the Ptree technology*, for compressed vertical data representation which facilitates fast and efficient
data mining over large datasets. DataMIME™ provides a suite of data mining applications for the tasks of association rule mining,
classification, clustering and the like. The key for using the Ptree technology in DataMIME™ lies in its vertical† (i.e. column-based)
representation and processing of data that has proven the efficiency and scalability of the applications embedded in DataMIME™.
1. Vertical Partitioning
The concept of vertical partitioning has been studied within the context of both centralized and distributed database systems. It is a good
strategy especially when small numbers of columns are retrieved by most queries. The decomposition of a relation also permits a number
of transactions to execute concurrently. Copeland et al presented an attribute level called Decomposition Storage Model (DSM) [5],
similar to the Attribute Transposed File model (ATF) [3], storing each column of a relational table into a separate table. Wong et al
shows the advantage of encoding attribute values using a small number of bits to reduce storage space [15]. [14] shows a more recent
vertical model for frequent pattern mining. In this system, we decompose attributes of relational tables into separate groups by bit
position (or bit slice). In addition, we compress the vertical bit groups by using the Ptree technology.
2. The Ptree Technology
The basic data structure exploited by the Ptree technology [6][12] is the Ptree (Predicate or Peano tree). Formally, Ptrees are tree-like
data structures that store relational data in column-wise, bit-compressed format by splitting each attribute into bits (i.e. representing each
attribute value by its binary equivalent), grouping together all bits in each bit position for all tuples, and representing each bit group by a
Ptree. Ptrees provide a lot of information and are structured to facilitate data mining processes. Once we have represented our data using
P-trees, no scans of the database are needed to perform the data mining task at hand. Data querying can be achieved through logical
operations such as AND, OR and NOT referred to as the Ptree algebra in the literature. Various aspects of Ptrees including
representation, querying, algebra, speed and compression have been discussed in greater details in [6][12][13][9].
3. Description of DataMIME™
DataMIME™ is an efficient and scalable data mining system providing the flexibility of plugging in new data mining applications when
needed. Clients can interact with the DataMIME™ system to capture their data and convert it into the Ptree format after which they can
*
†
Patents are pending for the P-tree technology. This work was partially supported by the GSA grant ACT#: K96130308
This is in contrast to the conventional horizontal representation (i.e. row-based) employed by the relational model
apply different data mining applications. The actual data converter along with all the data mining applications execute in the server side.
Figure 1 depicts a logical view of the system.
Capture data
Mine using
DataMIME™
with my data
Integrate data
to
DataMIME™
Internet
Analyze the
performance of
the system
One of the Slave Servers
Master Server
...........
.
Figure 1. The Logical Architecture of DataMIME™
3.1 Server-side Architecture
The multi-threaded, concurrent and distributed DataMIME™ server has a four-layer architecture: DCI/DII, DMI, DMA, and DPMS.
This architecture is depicted in Figure 2.
DMA Layer
DCI/DII Layer
D
M
DATA
Room for new feeder
I
Figure 2. Server Side Architecture of the system
Plugs for new
Applicationss
Already Plugged
Applications
DCI/DII (Data Integration and Capture Interface) layer: This layer allows users to capture and into integrate data to the system. The
main component of this layer is the feeder. Tailored feeders can process particular formats of incoming data. New feeders can be
integrated easily
DMI (Data Mining Interface) layer: DMI is the engine of the DataMIME™. It provides the P-tree algebra, which currently has four
operations: AND, OR, NOT (complement) and XOR.
DPMI (Distributed P-tree Management Interface) layer: The distributed P-tree Management Interface is responsible for managing
the Ptrees in the system.
DMA (Data Mining Algorithm) Layer: This layer provides the suite of available data mining applications.
3.2 Client Structure
On its client side, DataMIME™ has a graphical user interface (GUI) to enable visual interaction with users. The two main functionalities
of the system are:

Capturing which sends datasets along with their meta information to the DII/DCI layer of the server

Mining which sends requests to the DMA layer for applying data mining applications on a captured datasets
Meta
Data
Prediction
Model
Data
D
Unclassi
fied data
D
Client side DMA
M
Client Side
DCI
Data
C
Meta-data
generator
Visualization Tool
A
Client Side
DCI
I
(a) Capturing
(b) Mining
Figure: Client side architecture of the system
3.3 System Features

Ability to handle formatted record-based, relational-like data with numerical, categorical, and symbolic attributes. The data could be
in the form of ASCII text format, relational format, or TIFF image format. In addition, easy conversion to any other machine
readable format is provided,

The system provides various Ptree based data mining applications spanning association rule mining, classification, prediction, and
similarity search.

Clients of DataMIME™ can run UNIX and Microsoft Windows (including 95, 98, NT, 2000, and XP) with the server designed to
be a UNIX-only system.

Scalability
4. Embedded Data Mining Applications
As aforementioned, the Applications Interface of DataMIME™ embeds a number of applications that span various data mining areas.
We briefly present a sample of those applications next.
P-ARM
DataMime contains two applications for association rule mining (ARM). Ptree Association Rule Mining, P-ARM, is a Ptree-based
version of the Apriori algorithm [1] that operates on binary-transactions types of datasets to find all association rules satisfying minimum
support and confidence thresholds [1][2]. In terms of speed and scalability, P-ARM has been proven to outperform [7] a number of ARM
implementations based on [1] in addition to FP-Growth [8].
PFC-ARM
The second ARM application in DataMime is Ptree Fixed Consequent Association Rule Mining, PFC-ARM. Confident rules (are always
a target for users; however, support thresholds fluctuate with the applications and the datasets under study. Rules with very high support
are desired in market basket research applications where, for example, one would like to study shopping patterns to know which items
are bought together; while, very low support rules are expected in text processing applications where one is interested in finding words
that occur together, or in the medical research where one is looking for patterns that correspond to the death of the patient. PFC-ARM
has the objective of extracting confident rules having a fixed predefined consequent set [4] chosen by the user beforehand. In addition,
our approach relieves the user from the burden of specifying a minimum support threshold by extracting the highest support rules
making it applicable to the different types of support environments. The current version of PFC-ARM operates on binary transactions
data.
P-kNN
Ptree k-nearest neighbor classifier, P-kNN, is aimed at alleviating the problem resulting from the high cost associated with building a
new classifier each time new data arrives. In such situations, kNN classification is a very good choice since no classifier needs to be built
ahead of time; kNN refers back to the all the raw data available to select the closest distance-based neighbors in order to produce a class
label for the sample being classified. kNN is extremely simple to implement and lends itself to a wide variety of applications. In this
context, the construction of the neighborhood is the most costly operation. By using the Ptree technology in P-kNN, we can efficiently
find a closed-kNN set [9] which has been proven experimentally results to produce higher classification accuracy as well as significantly
higher speed [9] [13].
P-SVM
The traditional support vector machines (SVM) approach involves solving a quadratic optimization problem that requires considerably
long computation time for large datasets. In DataMime®, we offer a new and efficient proximal Support Vector Machine [10], P-SVM.
Our main idea is to fit the classification boundary using segment hyperplanes in original data space. After finding the support vectors
around boundary, the classification boundary is approximated by a hyperplane, which is determined by the d-nearest support vectors,
where d is the dimension of the input data.
P-Bayesian
A Bayesian classifier uses Bayes’ theorem to predict class membership as a conditional probability that a given data sample falls into a
particular class. The complexity of computing the conditional probability values can become prohibitive for most of the applications
with large data sets. Bayesian Belief Networks relax many constraints and use the information about the domain to build a conditional
probability table. Naïve Bayesian Classification is a lazy classifier. Computational cost is reduced with the use of the Naïve assumption
of class conditional independence, to calculate the conditional probabilities when required. Bayesian Belief Networks require building
time and domain knowledge where the Naïve approach looses accuracy if the assumption is not valid. The Ptree data structure allows us
to compute the Bayesian probability values efficiently, without the Naïve assumption, by building Ptrees for the training data. Bayesian
classification with P-trees, P-Bayesian, has been successfully applied to remotely sensed image data to predict yield in precision
agriculture [11].
References
[1] Agrawal, Imielinski, and Swami, Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD, International
Conference on Management of Data (Washington, D.C.), 1993.
[2] Agrawal and Srikant, Fast algorithms for mining association rules. Proceeding of the VLDB, International Conference on Very Large Databases
(Santiago, Chile), 1994.
[3] Batory, “On Searching Transposed Files”. ACM Transactions on Database Systems (TODS), 4(4):531-544, 1979.
[4] Bayardo, Agrawal, and Gunopulos, Constraint-Based Rule Mining in Large, Dense Databases. Proceedings of the IEEE ICDE, International
Conference on Data Engineering (Sydney, Australia), 188-197, 1999.
[5] Copeland and Khoshafian. A Decomposition Storage Model. Proceeding of the ACM SIGMOD, International Conference on Management of Data
(Austin, Texas), 268-279, May 1985.
[6] Ding, Khan, Roy, and Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing (Madrid, Spain), 2002.
[7] Qin Ding, Qiang Ding, and Perrizo, Association Rule Mining on Remotely Sensed Images Using P-trees. Proceedings of the PAKDD, Pacific-Asia
Conference on Knowledge Discovery and Data Mining, Springer-Verlag, Lecture Notes in Artificial Intelligence 2336, 66-79, May 2002.
[8] Han, Pei and Yin, Mining Frequent Patterns without Candidate Generation. Proceedings of the ACM SIGMOD, International Conference on
Management of Data (Dallas, Texas), 2000
[9] Khan, Ding, and Perrizo. K-nearest Neighbor Classification on Spatial Data Stream Using P-trees. Proceedings of the PAKDD, Pacific-Asia
Conference on Knowledge Discovery and Data Mining, Springer-Verlag,. Lecture Notes in Artificial Intelligence 2336, 517-528, May 2002.
[10] Pan, Wang, Ren, Hu. and Perrizo, Proximal Support Vector Machine for Spatial Data Using Peano Trees, Proceedings of the ISCA CAINE,
International Conference on Computer Applications in Industry and Engineering (Las Vegas, Neveda), November 2003.
[11] Perera, Serazi, and Perrizo, Performance Improvement for Bayesian Classification with P-Trees. Proceedings of the ISCA CAINE, International
Conference on Computer Applications in Industry and Engineering (San Diego, California), November 2002.
[12] Perrizo, Peano count tree technology lab notes. Technical Report NDSU-CS-TR-01-1, 2001.
http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/pct.html. January 2003.
[13] Rahal and Perrizo, An optimized Approach for KNN Text Categorization using P-tees. Accepted for publication in the proceedings of the ACM SAC,
Symposium on Applied Computing (Nicosia, Cyprus), March 2004.
[14] Shenoy, Haristsa, Sudatsham, Bhalotia, Baqa and Shah, Turbo-charging vertical mining of large databases. Proceedings of the ACM SIGMOD,
International Conference on Management of Data (Austin, Texas), 22-29, May 2000.
[15] H. K. T. Wong, Liu, Olken, Rotem, and L. Wong. Bit Transposed Files. Proceeding of the VLDB, International Conference on Very Large Databases
(Stockholm, Sweden), 1985.