* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Mining Very Large Databases
Survey
Document related concepts
Transcript
Cover Feature Cover Feature Mining Very Large Databases The explosive growth of databases makes the scalability of data-mining techniques increasingly important. The authors describe algorithms that address three classical data-mining problems. Venkatesh Ganti Johannes Gehrke Raghu Ramakrishnan University of WisconsinMadison stablished companies have had decades to accumulate masses of data about their customers, suppliers, and products and services. The rapid pace of e-commerce means that Web startups can become huge enterprises in months, not years, amassing proportionately large databases as they grow. Data mining, also known as knowledge discovery in databases,1 gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases, however, are much too large to be held in main memory. Retrieving data from disk is markedly slower than accessing data in RAM. Thus, to be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if—given a fixed amount of main memory—its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data-mining algorithms to very large data sets. In this survey, we describe a broad range of algorithms that address three classical data-mining problems: market basket analysis, clustering, and classification. E MARKET BASKET ANALYSIS A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity—for example, a customer’s visit to a grocery store or an online purchase from a virtual store such as Amazon.com. Retailers accumulate huge collections of transactions by recording business activity over time. One common analysis run against a transactions database is to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of transactions that contain it. Businesses can use knowledge of these patterns to improve the placement of items in 38 Computer a store or the layout of mail-order catalog pages and Web pages. An itemset containing i items is called an i-itemset. The percentage of transactions that contain an itemset is called the itemset’s support. For an itemset to be interesting, its support must be higher than a user-specified minimum; such itemsets are said to be frequent. Figure 1 shows three transactions stored in a relational database system. The database has five fields: a transaction identifier, a customer identifier, the item purchased, its price, and the transaction date. The first transaction shows a customer who bought a computer, MS Office, and Doom. As an example, the 2-itemset {hard disk, Doom} has a support of 67 percent. Why is finding frequent itemsets a nontrivial problem? First, the number of customer transactions can be very large and usually will not fit in memory. Second, the potential number of frequent itemsets is exponential in the number of different items, although the actual number of frequent itemsets can be much smaller. The example in Figure 1 shows four different items, so there are 24 − 1 = 15 potential frequent itemsets. If the minimum support is 60 percent, only five itemsets are actually frequent. Thus, we want algorithms that are scalable with respect to the number of transactions and examine as few infrequent itemsets as possible. Efficient algorithms have been designed to address these criteria. The Apriori algorithm2 provided one early solution, which subsequent algorithms built upon. APRIORI ALGORITHM This algorithm computes the frequent itemsets in several rounds. Round i computes all frequent i-itemsets. A round has two steps: candidate generation and candidate counting. Consider the ith round. In the candidate generation step, the algorithm generates a set of candidate i-itemsets whose support has not yet been computed. In the candidate counting step, the algorithm scans the transaction database, counting the supports of the candidate itemsets. After the scan, the 0018-9162/99/$10.00 © 1999 IEEE algorithm discards candidates with support lower than the user-specified minimum and retains only the frequent i-itemsets. In the first round, the generated set of candidate itemsets contains all 1-itemsets. The algorithm counts their support during the candidate counting step. Thus, after the first round, all frequent 1-itemsets are known. What are the candidate itemsets generated during the candidate generation step of round two? Naively, all pairs of items are candidates. Apriori reduces the set of candidate itemsets by pruning—a priori—those candidate itemsets that cannot be frequent, based on knowledge about infrequent itemsets obtained from previous rounds. The pruning is based on the observation that if an itemset is frequent, all its subsets must be frequent as well. Therefore, before entering the candidate counting step, the algorithm can discard every candidate itemset with a subset that is infrequent. Consider the database in Figure 1. Assume that the minimum support is 60 percent—so an itemset is frequent if it is contained in at least two transactions. In round one, all single items are candidate itemsets and are counted during the candidate counting step. In round two, only pairs of items in which each item is frequent can become candidates. For example, the itemset {MSOffice, Doom} is not a candidate, since round one determined that its subset {MSOffice} is not frequent. In round two, therefore, the algorithm counts the candidate itemsets {computer, Doom}, {hard disk, Doom}, and {computer, hard disk}. In round three, no candidate itemset survives the pruning step. The itemset {computer, hard disk, Doom} is pruned a priori because its subset {computer, hard disk} is not frequent. Thus, with respect to a minimum support of 60 percent, the frequent itemsets in our sample database and their support values are • • • • • {computer} 67 percent, {hard disk} 67 percent, {Doom} 100 percent, {computer, Doom} 67 percent, and {hard disk, Doom} 67 percent. Apriori counts not only the support of all frequent itemsets, but also the support of those infrequent candidate itemsets that could not be eliminated during the pruning step. The set of all candidate itemsets that are infrequent but whose support is counted by Apriori is called the negative border. Thus, an itemset is in the negative border if it is infrequent, but all its subsets are frequent. In our example, the negative border consists of itemsets {MSOffice} and {computer, hard disk}. All subsets of an itemset in the negative border are frequent; otherwise the itemset would have been eliminated by the subset-pruning step. TID CID Item Price Date 101 101 101 201 201 201 Computer MSOffice Doom 1,500 300 100 1/4/99 1/4/99 1/4/99 102 102 201 201 Hard disk Doom 500 100 1/7/99 1/799 103 103 103 202 202 202 Computer Hard disk Doom 1,500 500 100 1/24/99 1/24/99 1/24/99 Figure 1. Database containing three transactions. Optimizing Apriori Apriori scans the database several times, depending on the size of the longest frequent itemset. Several refinements have been proposed that focus on reducing the number of database scans, the number of candidate itemsets counted in each scan, or both. Partitioning. Ashok Savasere and colleagues3 developed Partition, an algorithm that requires only two scans of the transaction database. The database is divided into disjoint partitions, each small enough to fit in memory. In a first scan, the algorithm reads each partition and computes locally frequent itemsets on each partition using Apriori. In the second scan, the algorithm counts the support of all locally frequent itemsets toward the complete database. If an itemset is frequent with respect to the complete database, it must be frequent in at least one partition; therefore the second scan counts a superset of all potentially frequent items. Hashing. Jong Soo Park and colleagues4 proposed using probabilistic counting to reduce the number of candidate itemsets counted during each round of Apriori execution. This reduction is accomplished by subjecting each candidate k-itemset to a hash-based filtering step in addition to the pruning step. During candidate counting in round k − 1, the algorithm constructs a hash table. Each entry in the hash table is a counter that maintains the sum of the supports of the k-itemsets that correspond to that particular entry of the hash table. The algorithm uses this information in round k to prune the set of candidate k-itemsets. After subset pruning as in Apriori, the algorithm can remove a candidate itemset if the count in its hash table entry is smaller than the minimum support threshold. Sampling. Hannu Toivonen5 proposed a samplingbased algorithm that typically requires two scans of the database. The algorithm first takes a sample from the database and generates a set of candidate itemsets that are highly likely to be frequent in the comAugust 1999 39 Hardware Computer Hard disk Software MSOffice Doom Figure 2. Sample taxonomy for an is-a hierarchy of database items. plete database. In a subsequent scan over the database, the algorithm counts these itemsets’ exact supports and the support of their negative border. If no itemset in the negative border is frequent, then the algorithm has discovered all frequent itemsets. Otherwise, some superset of an itemset in the negative border could be frequent, but its support has not yet been counted. The sampling algorithm generates and counts all such potentially frequent itemsets in a subsequent database scan. Dynamic itemset counting. Sergey Brin and colleagues6 proposed the Dynamic Itemset Counting algorithm. DIC partitions the database into several blocks marked by start points and repeatedly scans the database. In contrast to Apriori, DIC can add new candidate itemsets at any start point, instead of just at the beginning of a new database scan. At each start point, DIC estimates the support of all itemsets that are currently counted and adds new itemsets to the set of candidate itemsets if all its subsets are estimated to be frequent. If DIC adds all frequent itemsets and their negative border to the set of candidate itemsets during the first scan, it will have counted each itemset’s exact support at some point during the second scan; thus DIC will complete in two scans. Extensions and generalizations Several researchers have proposed extensions to the basic problem of finding frequent itemsets. Is-a hierarchy. One extension considers an is-a hierarchy on database items. An is-a hierarchy defines which items are specializations or generalizations of other items. For instance, as shown in Figure 2, the items {computer, hard disk} in Figure 1 can be generalized to the item hardware. The extended problem is to compute itemsets that include items from different hierarchy levels. The presence of a hierarchy modifies the notion of when an item is contained in a transaction: In addition to the items listed explicitly, the transaction contains their ancestors in the taxonomy. This allows the detection of relationships involving higher hierarchy 40 Computer levels, since an itemset’s support can increase if an item is replaced by one of its ancestors. Consider the taxonomy in Figure 2. The transaction {computer, MSOffice} contains not only the items computer and MSOffice, but also hardware and software. In Figure 1’s sample database, the support of the itemset {computer, MSOffice} is 33 percent, whereas the support of the itemset {computer, software} is 67 percent. One approach to computing frequent itemsets in the presence of a taxonomy is to conceptually augment each transaction with the ancestors of all items in the transaction. Any algorithm for computing frequent itemsets can now be used on the augmented database. Optimizations on this basic strategy have been described by Rakesh Agrawal and Ramakrishnan Srikant.7 Sequential patterns. With each customer, we can associate a sequence of transactions ordered over time. The business goal is to find sequences of itemsets that many customers have purchased in approximately the same order.7,8 For each customer, the input database consists of an ordered sequence of transactions. Given an itemset sequence, the percentage of transaction sequences that contain it is called the itemset sequence’s support. A transaction sequence contains an itemset sequence if each itemset is contained in one transaction and the following holds: If the ith itemset in the itemset sequence is contained in transaction j in the transaction sequence, the (i + 1)st itemset in the itemset sequence is contained in a transaction with a number greater than j. The goal of finding sequential patterns is to find all itemset sequences that have a support higher than a user-specified minimum. An itemset sequence is frequent if its support is larger than this minimum. In Figure 1, customer 101 is associated with the transaction sequence [{computer, MSOffice}, {hard disk, Doom}]. This transaction sequence contains the itemset sequence [{MSOffice}, {hard disk}]. Calendric market basket analysis. Sridhar Ramaswamy and colleagues9 use the time stamp associated with each transaction to define the problem of calendric market basket analysis. Even though an itemset’s support may not be large with respect to the entire database, it might be large on a subset of the database that satisfies certain time constraints. Conversely, in certain cases, itemsets that are frequent on the entire database may gain their support from only certain subsets. The goal of calendric market basket analysis is to find all itemsets that are frequent in a set of user-defined time intervals. CLUSTERING Clustering distributes data into several groups so that similar objects fall into the same group. In Figure 1’s sample database, assume that to cluster customers based on their purchase behavior, we compute for each customer the total number and average price of all items purchased. Figure 3 shows clustering information for nine customers, distributed across three clusters. Customers in cluster one purchase few high-priced items, customers in cluster two purchase many high-priced items, and customers in cluster three purchase few low-priced items. Figure 3’s data does not match Figure 1’s because the earlier figure accommodated only a few transactions. The clustering problem has been studied in many fields, including statistics, machine learning, and biology. However, scalability was not a design goal in these applications; researchers always assumed the complete data set would fit in main memory, and the focus was on improving the clustering quality. Consequently, these algorithms do not scale to large data sets. Recently, several new algorithms with greater emphasis on scalability have been developed, including summarized cluster representation, sampling, and using data structures supported by database systems. Summarized cluster representations Tian Zhang and colleagues10 proposed Birch, which uses summarized cluster representations to achieve speed and scalability while clustering a data set. The Birch approach can be thought of as a two-phase clustering technique: Birch is used to yield a collection of coarse clusters, and then other (main-memory) clustering algorithms can be used on this collection to identify “true clusters.” As an analogy, if each data point is a marble on a table top, we replace clusters of marbles by tennis balls and then look for clusters of tennis balls. While the number of marbles may be large, we can control the number of tennis balls to make the second phase feasible with traditional clustering algorithms whose goal is to recover complex cluster shapes. Other work on scalable clustering addressed Birch’s limitations or applied the summarized cluster representation idea in different ways. Birch and the Birch* framework. A cluster corresponds to a dense region of objects. Birch treats this region collectively through a summarized representation called its cluster feature. A cluster’s CF is a triple consisting of the cluster’s number of points, centroid, and radius, with the cluster’s radius defined as the square root of the average mean-squared distance from the centroid of a point in the cluster. When a new point is added to a cluster, the new CF can be computed from the old CF; we do not need the set of points in the cluster. The incremental Birch algorithm exploits this property of a CF and maintains only the CFs of clusters, rather than the sets of points, while scanning the data. Cluster features are efficient for two reasons: • They occupy much less space than the naive rep- Cluster 1 <2, 1,700> <3, 2,000> <4, 2,300> Cluster 2 <10, 1,800> <12, 2,100> <11, 2,040> Cluster 3 <2, 100> <3, 200> <3, 150> Figure 3. Sample set of clusters—data groups consisting of similar objects. resentation, which maintains all objects in a cluster. • They are sufficient for calculating all intercluster and intracluster measurements involved in making clustering decisions. Moreover, these calculations can be performed much faster than using all the objects in clusters. For instance, distances between clusters, radii of clusters, CFs—and hence other properties of merged clusters—can all be computed very quickly from the CFs of individual clusters. In Birch, the CF’s definition relies on vector operations like addition, subtraction, centroid computation, and so on. Therefore, Birch’s definition of CF will not extend to datasets consisting of character strings, say, for which these operations are not defined. In recent work, the CF and CF-tree concepts used in Birch have been generalized in the Birch* framework11 to derive two new scalable clustering algorithms for data in an arbitrary metric space. These new algorithms will, for example, separate the set {University of Wisconsin-Madison, University of Wisconsin-White Water, University of Texas-Austin, University of Texas-Arlington} into two clusters of Wisconsin and Texas universities. Other CF work. Recently, Paul Bradley and colleagues12 used CFs to develop a framework for scaling up the class of iterative clustering algorithms, such as the K-Means algorithm. Starting with an initial data-set partitioning, iterative clustering algorithms repeatedly move points between clusters until the distribution optimizes a criterion function. The framework functions by identifying sets of discardable, compressible, and main-memory points. A point is discardable if its membership in a cluster can be ascertained; the algorithm discards the actual points and retains only the CF of all discardable points. A point is compressible if it is not discardable but belongs to a tight subcluster—a set of points that August 1999 41 Record ID Salary Age Employment Group 1 30K 30 Self C 2 40K 35 Industry C 3 70K 50 Academia C 4 60K 45 Self B 5 70K 30 Academia B The article “Chameleon: Hierarchical Clustering Using Dynamic Modeling” by George Karypis and colleagues (p. 68) covers these last two algorithms in detail. 6 60K 35 Industry A CLASSIFICATION 7 60K 35 Self A 8 70K 30 Self A 9 40K 45 Industry C Assume that we have identified, through clustering of the aggregated purchase information of current customers, three different groups of customers, as shown in Figure 3. Assume that we purchase a mailing list with demographic information for potential customers. We would like to assign each person in the mailing list to one of three groups so that we can send a catalog tailored to that person’s buying patterns. This data-mining task uses historical information about current customers to predict the cluster membership of new customers. Our database with historical information, also called the training database, contains records that have several attributes. One designated attribute is called the dependent attribute, and the others are called predictor attributes. The goal is to build a model that takes the predictor attributes as inputs and outputs a value for the dependent attribute. If the dependent attribute is numerical, the problem is called a regression problem; otherwise it is called a classification problem. We concentrate on classification problems, although similar techniques apply to regression problems as well. For a classification problem, we refer to the attribute values of the dependent attribute as class labels. Figure 4 shows a sample training database with three predictor attributes: salary, age, and employment. Group is the dependent attribute. Researchers have proposed many classification models:17 neural networks, genetic algorithms, Bayesian methods, log-linear and other statistical methods, decision tables, and tree-structured models—so-called classification trees. Classification trees, also called decision trees, are attractive in a data-mining environment for several reasons: Figure 4. Sample training database. always share cluster membership. Such points can move from one cluster to another, but they always move together. Such a subcluster is summarized using its CF. A point is a main-memory point if it is neither discardable nor compressible. Main-memory points are retained in main memory. The iterative clustering algorithm then moves only the main-memory points and the CFs of compressible points between clusters until the distribution optimizes the criterion function. Gholamhosein Sheikholeslami and colleagues13 proposed WaveCluster, a clustering algorithm based on wavelet transforms. They first summarize the data by imposing a multidimensional grid on the data space. The number of points that map into a single cell summarizes all the points that mapped into the cell. This summary information typically fits in main memory. WaveCluster then applies the wavelet transform on the summarized data to determine clusters of arbitrary shapes. Other approaches Of the other proposed clustering algorithms for large data sets, we mention two sampling-based approaches and one based on database system support. Raymond T. Ng and Jiawei Han14 proposed Clarans, which formulates the clustering problem as a randomized graph search. In Clarans, each node represents a partition of the data set into a user-specified number of clusters. A criterion function determines the clusters’ quality. Clarans samples the solution space—all possible partitions of the data set—for a good solution. The random search for a solution stops at a node that meets the minimum among a prespecified number of local minima. Sudipto Guha and colleagues15 proposed CURE, a sampling-based hierarchical clustering algorithm that discovers clusters of arbitrary shapes. In the DBScan algorithm, Martin Ester and colleagues16 proposed a density-based notion of a cluster that also lets the cluster take an arbitrary shape. 42 Computer • Their intuitive representation makes the resulting classification model easy to understand. • Constructing decision trees does not require any input parameters from the analyst. • The predictive accuracy of decision trees is equal to or higher than other classification models. • Fast, scalable algorithms can be used to construct decision trees from very large training databases. Each internal node of a decision tree is labeled with a predictor attribute, called the splitting attribute, and each leaf node is labeled with a class label. Each edge originating from an internal node is labeled with a splitting predicate that involves only the node’s splitting attribute. The splitting predicates have the prop- Salary <= 50K erty that any record will take a unique path from the root to exactly one leaf node. The combined information about splitting attributes and splitting predicates at a node is called the splitting criterion. Figure 5 shows a possible decision tree for the training database from Figure 4. Decision tree construction algorithms consist of two phases: tree building and pruning. In tree building, the tree grows top-down in the following greedy way. Starting with the root node, the algorithm examines the database using a split selection method to compute the locally “best” splitting criterion. Then it partitions the database according to this splitting criterion and applies the procedure recursively. The algorithm then prunes the tree to control its size. Some decision tree construction algorithms separate tree building and pruning, while others interleave them to avoid the unnecessary expansion of some nodes. Figure 6 shows a code sample of the tree-building phase. The choice of splitting criterion determines the quality of the decision tree, and it has been the subject of considerable research. In addition, if the training database does not fit in memory, we need a scalable data access method. One such method, the Sprint algorithm introduced by John Shafer and colleagues,18 uses only a minimum amount of main memory and scales a popular split selection method called CART. Another approach, the RainForest framework,19 scales a broad class of split-selection methods, but has main-memory requirements that depend on the number of different attribute values in the input database. Sprint. This classification-tree construction algorithm removes all relationships between main memory and the data set size. Sprint builds classification trees with binary splits and it requires sorted access to each attribute at each node. For each attribute, the algorithm creates an attribute list, which is a vertical partition of the training database D. For each tuple t ∈ D, the entry of t in the attribute list consists of the projection of t onto the attribute, the class label attribute, and the record identifier of t. The attribute list of each attribute is created at the beginning of the algorithm and sorted once in increasing order of attribute values. At the root node, the algorithm scans all attribute lists once to determine the splitting criterion. Then it distributes each attribute list among the root’s children through a hash-join with the attribute list of the splitting attribute. The record identifier, which is duplicated in each attribute, establishes the connection between the different parts of the tuple. During the hash-join, the algorithm reads and distributes each attribute list sequentially, which preserves the initial sort order of the attribute list. The algorithm then recurses on each child partition. RainForest. The RainForest framework19 operates > 50K Group Age <= 40 > 40 Employment Group C Academia, Industry Group B Self Group A Figure 5. Sample decision tree for a catalog mailing. Input: node n, datapartition D, split selection method CL Output: decision tree for D rooted at n Top-Down Decision Tree Induction Schema (Binary Splits): BuildTree(Node n, datapartition D, split selection method CL) (1) Apply CL to D to find the splitting criterion for n (2) if (n splits) (3) Create children n1 and n2 of n (4) Use best split to partition D into D1 and D2 (5) BuildTree(n1, D1, CL) (6) BuildTree(n2, D2, CL) (7) endif Figure 6. Code sample for the tree-building phase. from the premise that nearly all split-selection methods need only aggregate information to decide on the splitting criterion at a node. This aggregated information can be captured in a relatively compact data structure called the attribute-value class label group, or AVC group. Consider the root node of the tree, and let D be the training database. The AVC set of predictor attribute A is the projection of D onto A, where counts of the individual class labels are aggregated. The AVC group at a node consists of the AVC sets of all predictor attributes. Consider the training database shown in Figure 4. The AVC group of the root node is shown in Figure 7. The size of a node’s AVC group is not proportional to the number of records in the training database, but rather to the number of different attribute values. Thus, in most cases, the AVC group is much smaller than the training database and usually fits into main August 1999 43 Acknowledgments Group Salary Group A B C 30K 0 0 1 40K 0 0 60k 1 1 70K 1 1 Age A B C 30 1 1 1 2 35 1 1 1 1 45 0 0 2 2 50 0 0 1 Figure 7. AVC group of the root node for the sample input database in Figure 4. memory. Knowing that the AVC group contains all the information any split-selection method needs, the problem of scaling up an existing split-selection method is now reduced to the problem of efficiently constructing the AVC group at each node of the tree. One simple data access method works by performing a sequential scan over the training database to construct the root node’s AVC group in main memory. The split-selection method then computes the split of the root node. In the next sequential scan, each record is read and appended to one child partition. The algorithm then recurses on each child partition. Rajeev Rastogi and Kyuseok Shim20 developed an algorithm called Public that interleaves tree building and pruning. Public eagerly prunes nodes that need not be expanded further during tree building, thus saving on the expansion cost of some nodes in the tree. ost current data-mining research assumes that data is static. In practice, data is maintained in data warehouses, which are updated continuously by the addition of records in batches. Given this scenario, we believe that future research must address algorithms for efficient model maintenance and methods to measure changes in data characteristics. The current data-mining paradigm resembles that of traditional database systems. A user initiates data mining and awaits the complete result. But analysts are interested in quick, partial, or approximate results that can then be fine-tuned through a series of interactive queries. Thus, further research must focus on making data mining more interactive. Finally, the Web is the largest repository of structured, semistructured, and unstructured data. The Web’s dynamic nature, as well as the extreme variety of data types it holds, will challenge the research community for years to come. ❖ M 44 Computer Venkatesh Ganti is supported by a Microsoft Graduate Fellowship. Johannes Gehrke is supported by an IBM Graduate Fellowship. The research for this article was supported by Grant 2053 from the IBM Corp. References 1. U.M. Fayyad et al., eds. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Menlo Park, Calif., 1996. 2. R. Agrawal et al., “Fast Discovery of Association Rules,” Advances in Knowledge Discovery and Data Mining, U.M. Fayyad et al., eds., AAAI/MIT Press, Menlo Park, Calif., 1996, pp. 307-328. 3. A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. 21st Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1995, pp. 432-444. 4. J.S. Park, M.-S. Chen, and S.Y. Philip, “An Effective HashBased Algorithm for Mining Association Rules,” Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press, New York, 1995, pp.175-186. 5. H. Toivonen, “Sampling Large Databases for Association Rules,” Proc. 22nd Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1996, pp. 134-145. 6. S. Brin et al., “Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press, New York, 1997, pp. 255-264. 7. R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int’l Conf. Data Eng., IEEE CS Press, Los Alamitos, Calif., 1995, pp. 3-14. 8. H. Mannila, H. Toivonen, and A.I. Verkamo, “Discovering Frequent Episodes in Sequences,” Proc. 1st Int’l Conf. Knowledge Discovery Databases and Data Mining, AAAI Press, Menlo Park, Calif., 1995, pp. 210-215. 9. S. Ramaswamy, S. Mahajan, and A. Silbershatz, “On the Discovery of Interesting Patterns in Association Rules,” Proc. 24th Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1998, pp. 368-379. 10. T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An Efficient Data Clustering Method for Large Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press, New York, 1996, pp. 103-114. 11. V. Ganti et al., “Clustering Large Datasets in Arbitrary Metric Spaces,” Proc. 15th Int’l Conf. Data Eng., IEEE CS Press, Los Alamitos, Calif., 1999, pp. 502-511. 12. P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. 4th Int’l Conf. Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, Calif., 1998, pp. 9-15. 13. G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A Multi-Resolution Clustering Approach 14. 15. 16. 17. 18. 19. for Very Large Spatial Databases,” Proc. 24th Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1998, pp. 428-439. R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. 20th Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1994, pp. 144-155. S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press, New York, 1998, pp. 73-84. M. Ester et al., “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. 2nd Int’l Conf. Knowledge Discovery Databases and Data Mining, AAAI Press, Menlo Park, Calif., 1996, pp. 226-231. D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine Learning, Neural and Statistical Classification,” Ellis Horwood, Chichester, UK, 1994. J. Shafer, R. Agrawal, and M. Mehta. “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22nd Int’l Conf. Very Large Data Bases,” Morgan Kaufmann, San Francisco, 1996, pp. 544-555. J. Gehrke, R. Ramakrishnan, and V. Ganti, “RainForest—a Framework for Fast Decision Tree Construction of Large Datasets,” Proc. 24th Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1998, pp. 416-427. 20. R. Rastogi and K. Shim, “Public: A Decision Tree Classifier that Integrates Building and Pruning,” Proc. 24th Int’l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, 1998, pp. 404-415. Venkatesh Ganti is a PhD candidate at the University of Wisconsin-Madison. His primary research interests are the exploratory analysis of large data sets and monitoring changes in data characteristics. Ganti received an MS in computer science from the University of Wisconsin-Madison. Johannes Gehrke is a PhD candidate at the University of Wisconsin-Madison. His research interests include scalable techniques for data mining, performance of data-mining algorithms, and mining and monitoring evolving data sets. Raghu Ramakrishnan is a professor in the Computer Sciences Department at the University of WisconsinMadison. His research interests include database languages, net databases, data mining, and interactive information visualization. Contact Ganti, Gehrke, and Ramakrishnan at the University of Wisconsin-Madison, Dept. of Computer Science, Madison, WI 53706; {vganti, johannes, raghu}@cs.wisc.edu. Jini Lead Architect Jim Waldo “The big wads of software that we have grown used to might be replaced by small, simple components that do only what they need to do and can be combined together.” Tcl Creator John Ousterhout “Scripting languages will be used for a larger fraction of application development in the years ahead.” Software revolutionaries on software As seen in the May issue of Computer Innovative Technology for Computer Professionals Thompson On Middleware Steps Forward Unix and Beyond Perl Creator Larry Wall “The most revolutionary thing about language design these days is that we’re putting more effort into larger languages.” http://computer.org http://computer.org May 1999 CORBA 3 Preview Two New Awards Honor Innovators, p. 11