Download 04_sigmod_DataMime - NDSU Computer Science

DataMIME™ I. Rahal, A. Perera, M. Serazi, Q. Ding, F. Pan, D. Ren, W. Wu and W. Perrizo Computer Science Department North Dakota State University Fargo, ND USA [email protected] Abstract DataMIME™ is a universal data mining system developed by the DataSURG research group at North Dakota State University. The system exploits a novel technology, the Ptree technology*, for compressed vertical data representation which facilitates fast and efficient data mining over large datasets. DataMIME™ provides a suite of data mining applications for the tasks of association rule mining, classification, clustering and the like. The key for using the Ptree technology in DataMIME™ lies in its vertical† (i.e. column-based) representation and processing of data that has proven the efficiency and scalability of the applications embedded in DataMIME™. 1. Vertical Partitioning The concept of vertical partitioning has been studied within the context of both centralized and distributed database systems. It is a good strategy especially when small numbers of columns are retrieved by most queries. The decomposition of a relation also permits a number of transactions to execute concurrently. Copeland et al presented an attribute level called Decomposition Storage Model (DSM) [5], similar to the Attribute Transposed File model (ATF) [3], storing each column of a relational table into a separate table. Wong et al shows the advantage of encoding attribute values using a small number of bits to reduce storage space [15]. [14] shows a more recent vertical model for frequent pattern mining. In this system, we decompose attributes of relational tables into separate groups by bit position (or bit slice). In addition, we compress the vertical bit groups by using the Ptree technology. 2. The Ptree Technology The basic data structure exploited by the Ptree technology [6][12] is the Ptree (Predicate or Peano tree). Formally, Ptrees are tree-like data structures that store relational data in column-wise, bit-compressed format by splitting each attribute into bits (i.e. representing each attribute value by its binary equivalent), grouping together all bits in each bit position for all tuples, and representing each bit group by a Ptree. Ptrees provide a lot of information and are structured to facilitate data mining processes. Once we have represented our data using P-trees, no scans of the database are needed to perform the data mining task at hand. Data querying can be achieved through logical operations such as AND, OR and NOT referred to as the Ptree algebra in the literature. Various aspects of Ptrees including representation, querying, algebra, speed and compression have been discussed in greater details in [6][12][13][9]. 3. Description of DataMIME™ DataMIME™ is an efficient and scalable data mining system providing the flexibility of plugging in new data mining applications when needed. Clients can interact with the DataMIME™ system to capture their data and convert it into the Ptree format after which they can * † Patents are pending for the P-tree technology. This work was partially supported by the GSA grant ACT#: K96130308 This is in contrast to the conventional horizontal representation (i.e. row-based) employed by the relational model apply different data mining applications. The actual data converter along with all the data mining applications execute in the server side. Figure 1 depicts a logical view of the system. Capture data Mine using DataMIME™ with my data Integrate data to DataMIME™ Internet Analyze the performance of the system One of the Slave Servers Master Server ........... . Figure 1. The Logical Architecture of DataMIME™ 3.1 Server-side Architecture The multi-threaded, concurrent and distributed DataMIME™ server has a four-layer architecture: DCI/DII, DMI, DMA, and DPMS. This architecture is depicted in Figure 2. DMA Layer DCI/DII Layer D M DATA Room for new feeder I Figure 2. Server Side Architecture of the system Plugs for new Applicationss Already Plugged Applications DCI/DII (Data Integration and Capture Interface) layer: This layer allows users to capture and into integrate data to the system. The main component of this layer is the feeder. Tailored feeders can process particular formats of incoming data. New feeders can be integrated easily DMI (Data Mining Interface) layer: DMI is the engine of the DataMIME™. It provides the P-tree algebra, which currently has four operations: AND, OR, NOT (complement) and XOR. DPMI (Distributed P-tree Management Interface) layer: The distributed P-tree Management Interface is responsible for managing the Ptrees in the system. DMA (Data Mining Algorithm) Layer: This layer provides the suite of available data mining applications. 3.2 Client Structure On its client side, DataMIME™ has a graphical user interface (GUI) to enable visual interaction with users. The two main functionalities of the system are:  Capturing which sends datasets along with their meta information to the DII/DCI layer of the server  Mining which sends requests to the DMA layer for applying data mining applications on a captured datasets Meta Data Prediction Model Data D Unclassi fied data D Client side DMA M Client Side DCI Data C Meta-data generator Visualization Tool A Client Side DCI I (a) Capturing (b) Mining Figure: Client side architecture of the system 3.3 System Features  Ability to handle formatted record-based, relational-like data with numerical, categorical, and symbolic attributes. The data could be in the form of ASCII text format, relational format, or TIFF image format. In addition, easy conversion to any other machine readable format is provided,  The system provides various Ptree based data mining applications spanning association rule mining, classification, prediction, and similarity search.  Clients of DataMIME™ can run UNIX and Microsoft Windows (including 95, 98, NT, 2000, and XP) with the server designed to be a UNIX-only system.  Scalability 4. Embedded Data Mining Applications As aforementioned, the Applications Interface of DataMIME™ embeds a number of applications that span various data mining areas. We briefly present a sample of those applications next. P-ARM DataMime contains two applications for association rule mining (ARM). Ptree Association Rule Mining, P-ARM, is a Ptree-based version of the Apriori algorithm [1] that operates on binary-transactions types of datasets to find all association rules satisfying minimum support and confidence thresholds [1][2]. In terms of speed and scalability, P-ARM has been proven to outperform [7] a number of ARM implementations based on [1] in addition to FP-Growth [8]. PFC-ARM The second ARM application in DataMime is Ptree Fixed Consequent Association Rule Mining, PFC-ARM. Confident rules (are always a target for users; however, support thresholds fluctuate with the applications and the datasets under study. Rules with very high support are desired in market basket research applications where, for example, one would like to study shopping patterns to know which items are bought together; while, very low support rules are expected in text processing applications where one is interested in finding words that occur together, or in the medical research where one is looking for patterns that correspond to the death of the patient. PFC-ARM has the objective of extracting confident rules having a fixed predefined consequent set [4] chosen by the user beforehand. In addition, our approach relieves the user from the burden of specifying a minimum support threshold by extracting the highest support rules making it applicable to the different types of support environments. The current version of PFC-ARM operates on binary transactions data. P-kNN Ptree k-nearest neighbor classifier, P-kNN, is aimed at alleviating the problem resulting from the high cost associated with building a new classifier each time new data arrives. In such situations, kNN classification is a very good choice since no classifier needs to be built ahead of time; kNN refers back to the all the raw data available to select the closest distance-based neighbors in order to produce a class label for the sample being classified. kNN is extremely simple to implement and lends itself to a wide variety of applications. In this context, the construction of the neighborhood is the most costly operation. By using the Ptree technology in P-kNN, we can efficiently find a closed-kNN set [9] which has been proven experimentally results to produce higher classification accuracy as well as significantly higher speed [9] [13]. P-SVM The traditional support vector machines (SVM) approach involves solving a quadratic optimization problem that requires considerably long computation time for large datasets. In DataMime®, we offer a new and efficient proximal Support Vector Machine [10], P-SVM. Our main idea is to fit the classification boundary using segment hyperplanes in original data space. After finding the support vectors around boundary, the classification boundary is approximated by a hyperplane, which is determined by the d-nearest support vectors, where d is the dimension of the input data. P-Bayesian A Bayesian classifier uses Bayes’ theorem to predict class membership as a conditional probability that a given data sample falls into a particular class. The complexity of computing the conditional probability values can become prohibitive for most of the applications with large data sets. Bayesian Belief Networks relax many constraints and use the information about the domain to build a conditional probability table. Naïve Bayesian Classification is a lazy classifier. Computational cost is reduced with the use of the Naïve assumption of class conditional independence, to calculate the conditional probabilities when required. Bayesian Belief Networks require building time and domain knowledge where the Naïve approach looses accuracy if the assumption is not valid. The Ptree data structure allows us to compute the Bayesian probability values efficiently, without the Naïve assumption, by building Ptrees for the training data. Bayesian classification with P-trees, P-Bayesian, has been successfully applied to remotely sensed image data to predict yield in precision agriculture [11]. References [1] Agrawal, Imielinski, and Swami, Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD, International Conference on Management of Data (Washington, D.C.), 1993. [2] Agrawal and Srikant, Fast algorithms for mining association rules. Proceeding of the VLDB, International Conference on Very Large Databases (Santiago, Chile), 1994. [3] Batory, “On Searching Transposed Files”. ACM Transactions on Database Systems (TODS), 4(4):531-544, 1979. [4] Bayardo, Agrawal, and Gunopulos, Constraint-Based Rule Mining in Large, Dense Databases. Proceedings of the IEEE ICDE, International Conference on Data Engineering (Sydney, Australia), 188-197, 1999. [5] Copeland and Khoshafian. A Decomposition Storage Model. Proceeding of the ACM SIGMOD, International Conference on Management of Data (Austin, Texas), 268-279, May 1985. [6] Ding, Khan, Roy, and Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing (Madrid, Spain), 2002. [7] Qin Ding, Qiang Ding, and Perrizo, Association Rule Mining on Remotely Sensed Images Using P-trees. Proceedings of the PAKDD, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag, Lecture Notes in Artificial Intelligence 2336, 66-79, May 2002. [8] Han, Pei and Yin, Mining Frequent Patterns without Candidate Generation. Proceedings of the ACM SIGMOD, International Conference on Management of Data (Dallas, Texas), 2000 [9] Khan, Ding, and Perrizo. K-nearest Neighbor Classification on Spatial Data Stream Using P-trees. Proceedings of the PAKDD, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag,. Lecture Notes in Artificial Intelligence 2336, 517-528, May 2002. [10] Pan, Wang, Ren, Hu. and Perrizo, Proximal Support Vector Machine for Spatial Data Using Peano Trees, Proceedings of the ISCA CAINE, International Conference on Computer Applications in Industry and Engineering (Las Vegas, Neveda), November 2003. [11] Perera, Serazi, and Perrizo, Performance Improvement for Bayesian Classification with P-Trees. Proceedings of the ISCA CAINE, International Conference on Computer Applications in Industry and Engineering (San Diego, California), November 2002. [12] Perrizo, Peano count tree technology lab notes. Technical Report NDSU-CS-TR-01-1, 2001. http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/pct.html. January 2003. [13] Rahal and Perrizo, An optimized Approach for KNN Text Categorization using P-tees. Accepted for publication in the proceedings of the ACM SAC, Symposium on Applied Computing (Nicosia, Cyprus), March 2004. [14] Shenoy, Haristsa, Sudatsham, Bhalotia, Baqa and Shah, Turbo-charging vertical mining of large databases. Proceedings of the ACM SIGMOD, International Conference on Management of Data (Austin, Texas), 22-29, May 2000. [15] H. K. T. Wong, Liu, Olken, Rotem, and L. Wong. Bit Transposed Files. Proceeding of the VLDB, International Conference on Very Large Databases (Stockholm, Sweden), 1985.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 04_sigmod_DataMime - NDSU Computer Science