Download implementing tcp/ip in a client-server environment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Ingres (database) wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Oracle Database wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Soft Computing in Intelligent Systems and Information Processing, Proceedings of
1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996,
332-337
SUPPORTING ROUGH SET THEORY IN VERY LARGE DATABASES
USING ORACLE RDBMS
Rayne Chen*
Amdahl Corporation
1250 East Arques Avenue
Sunnyvale, California 94088-3470
[email protected]
T.Y. Lin**
Department of Electrical Engineering
and Computer Science
University of California
Berkeley, California 94720
Earlier a “new” rough set theory for very large databases was proposed by Lin. In
this paper the authors attempt to evaluate the performance of such a rough set
theory for a very large database. ORACLE, a relational database management
system (RDBMS), is the market leader in open system databases. Windows NT has
been growing into the data server environment and is a strong contender for
decision support system applications with its open and cost-effective architecture..
So ORACLE running under Windows NT was used in this report. The main goal of
this research is to formulate a suitable rough set theory for very large databases
1. Introduction
Data Mining is one of the new major areas of research in applications of databases.
Rough set theory is probably the most well developed theory and technology in data
mining. In the past rough set researchers focus on median to small size applications,
such as stock market, medical diagnosis systems. In CESA’96 at Lille France, there
were three papers attempt to deal with large databases [2], [3], [4]. The goal of this
paper is to examine and evaluate the fundamental rough set operations in very large
databases. The stress test table has 18 million rows and about three giga-bytes of
data. Many decision support system has been running on low cost open system
platforms with open system relational database management systems, such as
running ORACLE RDBMS on top of UNIX operating systems. As Microsoft
Windows NT, typically running on x86-processors, is a new contender in the open
operating system market and is capable of running many similar server functions as
UNIX operating systems[1]. So in this paper, we use ORACLE RDBMS running on
a Windows NT system.
2. The Systems
* This research is partially sponsored by Amdahl Corp. The views expressed are those of the
author and do not necessarily represent the views of Amdahl Corp.
** This research is partially supported by Electric Power Research Institute, Palo Alto,
California and Amdahl Corp.
*** T.Y. Lin is currently on leave from SJSU, San Jose, Ca 95192 ([email protected] )
-1-
Soft Computing in Intelligent Systems and Information Processing, Proceedings of
1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996,
332-337
2.1 The Platform
The system used is the same as [1]. The NT platform used was a generic X86 system
with an Intel Pentium-133 CPU running Microsoft Windows NT 3.51. The
ORACLE version used were the 7.2 version. The detailed information about the
configuration is listed in the following table.
System Vendor
1 CPU
SPECint92
1 SCSI
Memory
OS/ DBMS
Table-spaces
5 Disks
generic X86 System
Intel Pentium-133MHz 256Kbytes cache
174.2
Adaptec AHA-2940
64Mbytes, 70ns
Windows NT 3.5.1/ ORACLE 7.2.2.4.0
Read-Only on NTFS
Seagate ST12400N SUN2.1G
2.2 The Database
The database used here is a subset of the performance study conducted in [1]. The
main focus is on the table main. Most decision support system (hence data mining)
databases are in the very large database category (VLDB) because of their size. So
the database has to have a reasonable size to represent the real environment. In
order to conduct stress test on rough set operations in RDBMS, the workload used a
very large table. Columns of different characteristics were designed in the main data
table to facilitate the different operations. In the row size calculation, all integer and
floating point fields are counted as 4 bytes long. The table and index definitions are
defined by SQL. The row size, number of rows of the main table, and actual storage
size used by ORACLE on the NT platform is listed below.
Table
Main
row size
(bytes)
154
number of
rows
18,815,889
table size (MBs)
2,897.5
physical NT table
size (MBs)
3,229.6
The total physical database size is about 6.7 Giga-bytes including the temporary
table spaces allocated for ORACLE. The physical database size is larger than the
calculated size because of the data row storage allocation in each data block and the
free spaces reserved by ORACLE.
2.3 The Database Schema
The main table was created using the following SQL:
CREATE TABLE main (
-2-
Soft Computing in Intelligent Systems and Information Processing, Proceedings of
1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996,
332-337
key
i1k
i1h
z1k
z2
s1
s2
s3
s4
number(10),
number(3),
number(2),
number(3),
number(1),
char(28) default
char(23) default
char(28) default
char(23) default
i1g
c1k
i5
z1h
number(9),
char(4),
number(1),
number(2),
i1m
f1k
i2
z5
number(6),
number(3,0),
number(1),
number(1),
(‘1234567890123456789012345678’) not null,
(‘12345678901234567890123’)
not null,
(‘1234567890123456789012345678’) not null,
(‘12345678901234567890123’)
not null);
3. Performance Consideration
3.1. Classical Rough Set Theory
According to [5], one needs at least n2 time to get discernibility matrix and finding
minimal reduct is Np-hard. For table main in this database, we need at least
0.18*18,000,000 seconds, which is equivalent to 37.5 days (24 hours a day). In this
computation (1) we assumed the execution time for one instruction is 10 nanoseconds and (2) there are adequate main memory to create discernibility matrix;
such a computer is a “super’ computer. So classical approach without modification
can not be used directly to very large databases [3].
3.2. Rough Set Theory for Large Database -High Frequency Decision Tables
Lin proposed to use a set of tables [3 that consist of high frequency data to replace
the classical rough set theory, if databases are very large. The set of tables of “lower
degrees” contains reducts and simplified rules (value reducts) of the table of the
highest degree. However, we should pointed out that there are no inexpensive way
to identify which “shorter” rules are simplifications of “longer” rules; it is Np-hard.
So some sacrifices are necessary. We believe that all rules, regardless of short or
long, are patterns in data. It is relatively unimportant to know that certain rules are
referred to the same cases in the “past or training” data. We are using pattern for
“future” data; precise history may not be very important.
Example 1. i1k_ z2_table
select count(ccn), sum(ccn), avg(ccn) from (
select mi1k, max(z2), max(cn) ccn, count(*) from (
select mod(i1k,100) mi1k,z2, count(*) cn from main a
group by mod(i1k,100),z2 )
group by mi1k having count(*) = 1 );
Example 2. i1k_i1m_ z2_table
-3-
Soft Computing in Intelligent Systems and Information Processing, Proceedings of
1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996,
332-337
select count(ccn), sum(ccn), avg(ccn) from (
select mi1k, mi1m, max(z2), max(cn) ccn, count(*) from (
select mod(i1k,100) mi1k,mod(i1m,100) mi1m,z2, count(*) cn
from main a
group by mod(i1m,100),mod(i1k,100),z2 )
group by mi1k, mi1m having count(*) = 1 );
Example 3. i1k_i1m_i1g z2_table
select count(ccn), sum(ccn), avg(ccn) from (
select mi1k, mi1m, mi1g, max(z2), max(cn) ccn, count(*) from (
select mod(i1k,100) mi1k,mod(i1m,100) mi1m, mod(i1g,100) mi1g,
z2, count(*) cn
from main
group by mod(i1k,100), mod(i1m,100), mod(i1g,100),z2 )
group by mi1k, mi1m, mi1g having count(*) = 1 );
Example 4. i1h_i1k_i1m_i1g z2_table
select count(ccn), sum(ccn), avg(ccn) from (
select i1h, mi1k, mi1m, mi1g, max(z2), max(cn) ccn, count(*)
from ( select i1h, mod(i1k,100) mi1k, mod(i1m,100) mi1m,
mod(i1g,100) mi1g, z2, count(*) cn
from main
group by i1h, mod(i1k,100), mod(i1m,100), mod(i1g,100), z2 )
group by i1h, mi1k, mi1m, mi1g having count(*) = 1);
The results of the experiments are showed below:
SQL Query Name
# of Rules
# of Rows Rows/ Rule
Found
Found
i1g_ z2
0
0
i1m_ z2
20
3,761,767
188088
i1k_ z2
30
5,642,748
188092
i1h_ z2
0
0
ilg_i1m_ z2
2,000
3,761,767
1881
ilg_i1k_ z2
3,000
5,642,748
1881
ilg_i1h_ z2
0
0
ilm_i1k_ z2
4,400
8,277,322
1881
ilm_i1h_ z2
2,000
3,761,767
1881
ilk_i1h_ z2
3,000
5,642,748
1881
ilg_i1m_i1k_ z2
442,622
8,312,982
18.78
ilg_i1m_i1h_ z2
219,483
4,057,313
18.49
ilg_i1k_i1h_ z2
309,405
5,780,064
18.68
ilm_i1k_i1h_ z2
442,589
8,312,215
18.78
-4-
Elapsed Time
(seconds)
1275.87
1218.96
1196.58
788.48
1905.43
1856.02
1589.06
1827.01
1530.94
1498.52
3280.98
2974.85
2918.92
2869.93
Soft Computing in Intelligent Systems and Information Processing, Proceedings of
1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996,
332-337
ilg_i1m_i1k_i1h_ z2
16,852,465
18,070,168
1.07
7102.95
Note that if the # of rows found in a shorter table is greater than or equal to that of a
longer table, then the shorter table is highly likely to be one of the reducts of the
longer table. So the number of rows in the table indicates that the only reducts are
the whole longer tables. We would like to point out here to data miners that the data
show are repetitions of rules, not data, inconsistent rows are not included. In this
paper, we focus on the performance and feasibility of the proposal “new” theory.
The artificial data created in [1] was used to avoid the hidden desire; Table main
was created before this project started, of course, the table does not include the
hidden desire as well as the semantic aspect of the data. We will report our semantic
study in future reports.
Conclusions
This project showed that in spite of time complexity problem, one can still apply
rough set methodology to a very large databases. We believe this paper establishes
that the tuned rough set methodology is a viable tool in data mining.
References
1.
2.
3.
4.
5.
Chen, R., Using Windows NT for Decision Support with ORACLE RDBMS
M. Fernandez-Baizan, E. Menasalvasruiz, J. Pena, M. Castano, E. Santos, R
Portaencasa, and C. Perez, Rough Sets as a Foundation to add Data Mining
Capabilities to a RDMS, CESA’96 IMACS Multiconference, Symposium on
Modeling, Analysis and Simulation, pp. 764-768.
Lin, T. Y. , Rough Set Theory in Very Large Databases, CESA’96 IMACS
Multiconference, Symp. on Modeling, Analysis and Simulation, pp.936-941.
Nguyen, S. H. , Nguyen, T. T. , Polkowski, L. , Skowron, A. , Synak, P., and
Wroblewski, J., Decisions Rules for Large Data Tables, pp942-947.
Skowron, A. , and Rauszer, C., The discernibility matrices and functions in
information systems, Decision Support by Experience - Application of the
Rough Sets Theory, R. Slowinski (ed.), Kluwer Academic Publishers, 1992, pp.
331-362
Appendix. ORACLE parameters in the NT system
-5-
Soft Computing in Intelligent Systems and Information Processing, Proceedings of
1996 Asian Fuzzy Systems Symposium, Kenting, Taiwan, December 11-14, 1996,
332-337
The following is a partial list of the init.ora file on the NT system. This file contains
the ORACLE configuration and tuning parameters for the NT system.
db_block_size
= 8192
db_file_multiblock_read_count = 32
db_block_buffers
= 2048
sort_area_size
= 5000000
shared_pool_size
= 1024000
parallel_min_servers
=5
parallel_max_servers
= 20
compatible
= 7.2.0
sequence_cache_entries
= 100
sequence_cache_hash_buckets = 89
-6-