Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft Jet Database Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Ingres (database) wikipedia , lookup
Concurrency control wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Oracle Database wikipedia , lookup
Soft Computing in Intelligent Systems and Information Processing, Proceedings of 1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996, 332-337 SUPPORTING ROUGH SET THEORY IN VERY LARGE DATABASES USING ORACLE RDBMS Rayne Chen* Amdahl Corporation 1250 East Arques Avenue Sunnyvale, California 94088-3470 [email protected] T.Y. Lin** Department of Electrical Engineering and Computer Science University of California Berkeley, California 94720 Earlier a “new” rough set theory for very large databases was proposed by Lin. In this paper the authors attempt to evaluate the performance of such a rough set theory for a very large database. ORACLE, a relational database management system (RDBMS), is the market leader in open system databases. Windows NT has been growing into the data server environment and is a strong contender for decision support system applications with its open and cost-effective architecture.. So ORACLE running under Windows NT was used in this report. The main goal of this research is to formulate a suitable rough set theory for very large databases 1. Introduction Data Mining is one of the new major areas of research in applications of databases. Rough set theory is probably the most well developed theory and technology in data mining. In the past rough set researchers focus on median to small size applications, such as stock market, medical diagnosis systems. In CESA’96 at Lille France, there were three papers attempt to deal with large databases [2], [3], [4]. The goal of this paper is to examine and evaluate the fundamental rough set operations in very large databases. The stress test table has 18 million rows and about three giga-bytes of data. Many decision support system has been running on low cost open system platforms with open system relational database management systems, such as running ORACLE RDBMS on top of UNIX operating systems. As Microsoft Windows NT, typically running on x86-processors, is a new contender in the open operating system market and is capable of running many similar server functions as UNIX operating systems[1]. So in this paper, we use ORACLE RDBMS running on a Windows NT system. 2. The Systems * This research is partially sponsored by Amdahl Corp. The views expressed are those of the author and do not necessarily represent the views of Amdahl Corp. ** This research is partially supported by Electric Power Research Institute, Palo Alto, California and Amdahl Corp. *** T.Y. Lin is currently on leave from SJSU, San Jose, Ca 95192 ([email protected] ) -1- Soft Computing in Intelligent Systems and Information Processing, Proceedings of 1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996, 332-337 2.1 The Platform The system used is the same as [1]. The NT platform used was a generic X86 system with an Intel Pentium-133 CPU running Microsoft Windows NT 3.51. The ORACLE version used were the 7.2 version. The detailed information about the configuration is listed in the following table. System Vendor 1 CPU SPECint92 1 SCSI Memory OS/ DBMS Table-spaces 5 Disks generic X86 System Intel Pentium-133MHz 256Kbytes cache 174.2 Adaptec AHA-2940 64Mbytes, 70ns Windows NT 3.5.1/ ORACLE 7.2.2.4.0 Read-Only on NTFS Seagate ST12400N SUN2.1G 2.2 The Database The database used here is a subset of the performance study conducted in [1]. The main focus is on the table main. Most decision support system (hence data mining) databases are in the very large database category (VLDB) because of their size. So the database has to have a reasonable size to represent the real environment. In order to conduct stress test on rough set operations in RDBMS, the workload used a very large table. Columns of different characteristics were designed in the main data table to facilitate the different operations. In the row size calculation, all integer and floating point fields are counted as 4 bytes long. The table and index definitions are defined by SQL. The row size, number of rows of the main table, and actual storage size used by ORACLE on the NT platform is listed below. Table Main row size (bytes) 154 number of rows 18,815,889 table size (MBs) 2,897.5 physical NT table size (MBs) 3,229.6 The total physical database size is about 6.7 Giga-bytes including the temporary table spaces allocated for ORACLE. The physical database size is larger than the calculated size because of the data row storage allocation in each data block and the free spaces reserved by ORACLE. 2.3 The Database Schema The main table was created using the following SQL: CREATE TABLE main ( -2- Soft Computing in Intelligent Systems and Information Processing, Proceedings of 1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996, 332-337 key i1k i1h z1k z2 s1 s2 s3 s4 number(10), number(3), number(2), number(3), number(1), char(28) default char(23) default char(28) default char(23) default i1g c1k i5 z1h number(9), char(4), number(1), number(2), i1m f1k i2 z5 number(6), number(3,0), number(1), number(1), (‘1234567890123456789012345678’) not null, (‘12345678901234567890123’) not null, (‘1234567890123456789012345678’) not null, (‘12345678901234567890123’) not null); 3. Performance Consideration 3.1. Classical Rough Set Theory According to [5], one needs at least n2 time to get discernibility matrix and finding minimal reduct is Np-hard. For table main in this database, we need at least 0.18*18,000,000 seconds, which is equivalent to 37.5 days (24 hours a day). In this computation (1) we assumed the execution time for one instruction is 10 nanoseconds and (2) there are adequate main memory to create discernibility matrix; such a computer is a “super’ computer. So classical approach without modification can not be used directly to very large databases [3]. 3.2. Rough Set Theory for Large Database -High Frequency Decision Tables Lin proposed to use a set of tables [3 that consist of high frequency data to replace the classical rough set theory, if databases are very large. The set of tables of “lower degrees” contains reducts and simplified rules (value reducts) of the table of the highest degree. However, we should pointed out that there are no inexpensive way to identify which “shorter” rules are simplifications of “longer” rules; it is Np-hard. So some sacrifices are necessary. We believe that all rules, regardless of short or long, are patterns in data. It is relatively unimportant to know that certain rules are referred to the same cases in the “past or training” data. We are using pattern for “future” data; precise history may not be very important. Example 1. i1k_ z2_table select count(ccn), sum(ccn), avg(ccn) from ( select mi1k, max(z2), max(cn) ccn, count(*) from ( select mod(i1k,100) mi1k,z2, count(*) cn from main a group by mod(i1k,100),z2 ) group by mi1k having count(*) = 1 ); Example 2. i1k_i1m_ z2_table -3- Soft Computing in Intelligent Systems and Information Processing, Proceedings of 1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996, 332-337 select count(ccn), sum(ccn), avg(ccn) from ( select mi1k, mi1m, max(z2), max(cn) ccn, count(*) from ( select mod(i1k,100) mi1k,mod(i1m,100) mi1m,z2, count(*) cn from main a group by mod(i1m,100),mod(i1k,100),z2 ) group by mi1k, mi1m having count(*) = 1 ); Example 3. i1k_i1m_i1g z2_table select count(ccn), sum(ccn), avg(ccn) from ( select mi1k, mi1m, mi1g, max(z2), max(cn) ccn, count(*) from ( select mod(i1k,100) mi1k,mod(i1m,100) mi1m, mod(i1g,100) mi1g, z2, count(*) cn from main group by mod(i1k,100), mod(i1m,100), mod(i1g,100),z2 ) group by mi1k, mi1m, mi1g having count(*) = 1 ); Example 4. i1h_i1k_i1m_i1g z2_table select count(ccn), sum(ccn), avg(ccn) from ( select i1h, mi1k, mi1m, mi1g, max(z2), max(cn) ccn, count(*) from ( select i1h, mod(i1k,100) mi1k, mod(i1m,100) mi1m, mod(i1g,100) mi1g, z2, count(*) cn from main group by i1h, mod(i1k,100), mod(i1m,100), mod(i1g,100), z2 ) group by i1h, mi1k, mi1m, mi1g having count(*) = 1); The results of the experiments are showed below: SQL Query Name # of Rules # of Rows Rows/ Rule Found Found i1g_ z2 0 0 i1m_ z2 20 3,761,767 188088 i1k_ z2 30 5,642,748 188092 i1h_ z2 0 0 ilg_i1m_ z2 2,000 3,761,767 1881 ilg_i1k_ z2 3,000 5,642,748 1881 ilg_i1h_ z2 0 0 ilm_i1k_ z2 4,400 8,277,322 1881 ilm_i1h_ z2 2,000 3,761,767 1881 ilk_i1h_ z2 3,000 5,642,748 1881 ilg_i1m_i1k_ z2 442,622 8,312,982 18.78 ilg_i1m_i1h_ z2 219,483 4,057,313 18.49 ilg_i1k_i1h_ z2 309,405 5,780,064 18.68 ilm_i1k_i1h_ z2 442,589 8,312,215 18.78 -4- Elapsed Time (seconds) 1275.87 1218.96 1196.58 788.48 1905.43 1856.02 1589.06 1827.01 1530.94 1498.52 3280.98 2974.85 2918.92 2869.93 Soft Computing in Intelligent Systems and Information Processing, Proceedings of 1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996, 332-337 ilg_i1m_i1k_i1h_ z2 16,852,465 18,070,168 1.07 7102.95 Note that if the # of rows found in a shorter table is greater than or equal to that of a longer table, then the shorter table is highly likely to be one of the reducts of the longer table. So the number of rows in the table indicates that the only reducts are the whole longer tables. We would like to point out here to data miners that the data show are repetitions of rules, not data, inconsistent rows are not included. In this paper, we focus on the performance and feasibility of the proposal “new” theory. The artificial data created in [1] was used to avoid the hidden desire; Table main was created before this project started, of course, the table does not include the hidden desire as well as the semantic aspect of the data. We will report our semantic study in future reports. Conclusions This project showed that in spite of time complexity problem, one can still apply rough set methodology to a very large databases. We believe this paper establishes that the tuned rough set methodology is a viable tool in data mining. References 1. 2. 3. 4. 5. Chen, R., Using Windows NT for Decision Support with ORACLE RDBMS M. Fernandez-Baizan, E. Menasalvasruiz, J. Pena, M. Castano, E. Santos, R Portaencasa, and C. Perez, Rough Sets as a Foundation to add Data Mining Capabilities to a RDMS, CESA’96 IMACS Multiconference, Symposium on Modeling, Analysis and Simulation, pp. 764-768. Lin, T. Y. , Rough Set Theory in Very Large Databases, CESA’96 IMACS Multiconference, Symp. on Modeling, Analysis and Simulation, pp.936-941. Nguyen, S. H. , Nguyen, T. T. , Polkowski, L. , Skowron, A. , Synak, P., and Wroblewski, J., Decisions Rules for Large Data Tables, pp942-947. Skowron, A. , and Rauszer, C., The discernibility matrices and functions in information systems, Decision Support by Experience - Application of the Rough Sets Theory, R. Slowinski (ed.), Kluwer Academic Publishers, 1992, pp. 331-362 Appendix. ORACLE parameters in the NT system -5- Soft Computing in Intelligent Systems and Information Processing, Proceedings of 1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996, 332-337 The following is a partial list of the init.ora file on the NT system. This file contains the ORACLE configuration and tuning parameters for the NT system. db_block_size = 8192 db_file_multiblock_read_count = 32 db_block_buffers = 2048 sort_area_size = 5000000 shared_pool_size = 1024000 parallel_min_servers =5 parallel_max_servers = 20 compatible = 7.2.0 sequence_cache_entries = 100 sequence_cache_hash_buckets = 89 -6-