Download Mining Very Large Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Serializability wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Concurrency control wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Cover Feature
Cover Feature
Mining
Very Large
Databases
The explosive growth of databases makes the scalability of data-mining
techniques increasingly important. The authors describe algorithms that
address three classical data-mining problems.
Venkatesh
Ganti
Johannes
Gehrke
Raghu
Ramakrishnan
University of
WisconsinMadison
stablished companies have had decades to
accumulate masses of data about their customers, suppliers, and products and services.
The rapid pace of e-commerce means that
Web startups can become huge enterprises in
months, not years, amassing proportionately large
databases as they grow. Data mining, also known as
knowledge discovery in databases,1 gives organizations the tools to sift through these vast data stores
to find the trends, patterns, and correlations that can
guide strategic decision making.
Traditionally, algorithms for data analysis assume
that the input data contains relatively few records.
Current databases, however, are much too large to be
held in main memory. Retrieving data from disk is
markedly slower than accessing data in RAM. Thus,
to be efficient, the data-mining techniques applied to
very large databases must be highly scalable. An algorithm is said to be scalable if—given a fixed amount of
main memory—its runtime increases linearly with the
number of records in the input database.
Recent work has focused on scaling data-mining
algorithms to very large data sets. In this survey, we
describe a broad range of algorithms that address three
classical data-mining problems: market basket analysis, clustering, and classification.
E
MARKET BASKET ANALYSIS
A market basket is a collection of items purchased
by a customer in an individual customer transaction,
which is a well-defined business activity—for example, a customer’s visit to a grocery store or an online
purchase from a virtual store such as Amazon.com.
Retailers accumulate huge collections of transactions
by recording business activity over time. One common
analysis run against a transactions database is to find
sets of items, or itemsets, that appear together in many
transactions. Each pattern extracted through the
analysis consists of an itemset and the number of transactions that contain it. Businesses can use knowledge
of these patterns to improve the placement of items in
38
Computer
a store or the layout of mail-order catalog pages and
Web pages.
An itemset containing i items is called an i-itemset.
The percentage of transactions that contain an itemset
is called the itemset’s support. For an itemset to be
interesting, its support must be higher than a user-specified minimum; such itemsets are said to be frequent.
Figure 1 shows three transactions stored in a relational database system. The database has five fields: a
transaction identifier, a customer identifier, the item
purchased, its price, and the transaction date. The first
transaction shows a customer who bought a computer,
MS Office, and Doom. As an example, the 2-itemset
{hard disk, Doom} has a support of 67 percent.
Why is finding frequent itemsets a nontrivial problem? First, the number of customer transactions can
be very large and usually will not fit in memory.
Second, the potential number of frequent itemsets is
exponential in the number of different items, although
the actual number of frequent itemsets can be much
smaller. The example in Figure 1 shows four different
items, so there are 24 − 1 = 15 potential frequent itemsets. If the minimum support is 60 percent, only five
itemsets are actually frequent. Thus, we want algorithms that are scalable with respect to the number of
transactions and examine as few infrequent itemsets
as possible. Efficient algorithms have been designed to
address these criteria. The Apriori algorithm2 provided
one early solution, which subsequent algorithms built
upon.
APRIORI ALGORITHM
This algorithm computes the frequent itemsets in
several rounds. Round i computes all frequent i-itemsets. A round has two steps: candidate generation and
candidate counting. Consider the ith round. In the candidate generation step, the algorithm generates a set
of candidate i-itemsets whose support has not yet been
computed. In the candidate counting step, the algorithm scans the transaction database, counting the supports of the candidate itemsets. After the scan, the
0018-9162/99/$10.00 © 1999 IEEE
algorithm discards candidates with support lower
than the user-specified minimum and retains only the
frequent i-itemsets.
In the first round, the generated set of candidate
itemsets contains all 1-itemsets. The algorithm counts
their support during the candidate counting step.
Thus, after the first round, all frequent 1-itemsets are
known. What are the candidate itemsets generated
during the candidate generation step of round two?
Naively, all pairs of items are candidates. Apriori
reduces the set of candidate itemsets by pruning—a
priori—those candidate itemsets that cannot be frequent, based on knowledge about infrequent itemsets
obtained from previous rounds. The pruning is based
on the observation that if an itemset is frequent, all its
subsets must be frequent as well. Therefore, before
entering the candidate counting step, the algorithm
can discard every candidate itemset with a subset that
is infrequent.
Consider the database in Figure 1. Assume that the
minimum support is 60 percent—so an itemset is frequent if it is contained in at least two transactions. In
round one, all single items are candidate itemsets and
are counted during the candidate counting step. In
round two, only pairs of items in which each item is
frequent can become candidates. For example, the
itemset {MSOffice, Doom} is not a candidate, since
round one determined that its subset {MSOffice} is not
frequent. In round two, therefore, the algorithm
counts the candidate itemsets {computer, Doom},
{hard disk, Doom}, and {computer, hard disk}. In
round three, no candidate itemset survives the pruning step. The itemset {computer, hard disk, Doom} is
pruned a priori because its subset {computer, hard
disk} is not frequent. Thus, with respect to a minimum
support of 60 percent, the frequent itemsets in our
sample database and their support values are
•
•
•
•
•
{computer} 67 percent,
{hard disk} 67 percent,
{Doom} 100 percent,
{computer, Doom} 67 percent, and
{hard disk, Doom} 67 percent.
Apriori counts not only the support of all frequent
itemsets, but also the support of those infrequent candidate itemsets that could not be eliminated during
the pruning step. The set of all candidate itemsets that
are infrequent but whose support is counted by
Apriori is called the negative border. Thus, an itemset
is in the negative border if it is infrequent, but all its
subsets are frequent. In our example, the negative border consists of itemsets {MSOffice} and {computer,
hard disk}. All subsets of an itemset in the negative
border are frequent; otherwise the itemset would have
been eliminated by the subset-pruning step.
TID
CID
Item
Price
Date
101
101
101
201
201
201
Computer
MSOffice
Doom
1,500
300
100
1/4/99
1/4/99
1/4/99
102
102
201
201
Hard disk
Doom
500
100
1/7/99
1/799
103
103
103
202
202
202
Computer
Hard disk
Doom
1,500
500
100
1/24/99
1/24/99
1/24/99
Figure 1. Database containing three transactions.
Optimizing Apriori
Apriori scans the database several times, depending on the size of the longest frequent itemset. Several
refinements have been proposed that focus on reducing the number of database scans, the number of candidate itemsets counted in each scan, or both.
Partitioning. Ashok Savasere and colleagues3 developed Partition, an algorithm that requires only two
scans of the transaction database. The database is
divided into disjoint partitions, each small enough to
fit in memory. In a first scan, the algorithm reads each
partition and computes locally frequent itemsets on
each partition using Apriori.
In the second scan, the algorithm counts the support of all locally frequent itemsets toward the complete database. If an itemset is frequent with respect to
the complete database, it must be frequent in at least
one partition; therefore the second scan counts a
superset of all potentially frequent items.
Hashing. Jong Soo Park and colleagues4 proposed
using probabilistic counting to reduce the number of
candidate itemsets counted during each round of
Apriori execution. This reduction is accomplished by
subjecting each candidate k-itemset to a hash-based
filtering step in addition to the pruning step.
During candidate counting in round k − 1, the algorithm constructs a hash table. Each entry in the hash
table is a counter that maintains the sum of the supports of the k-itemsets that correspond to that particular entry of the hash table. The algorithm uses this
information in round k to prune the set of candidate
k-itemsets. After subset pruning as in Apriori, the
algorithm can remove a candidate itemset if the count
in its hash table entry is smaller than the minimum
support threshold.
Sampling. Hannu Toivonen5 proposed a samplingbased algorithm that typically requires two scans of
the database. The algorithm first takes a sample from
the database and generates a set of candidate itemsets that are highly likely to be frequent in the comAugust 1999
39
Hardware
Computer
Hard disk
Software
MSOffice Doom
Figure 2. Sample taxonomy for an is-a hierarchy of database
items.
plete database. In a subsequent scan over the database, the algorithm counts these itemsets’ exact supports and the support of their negative border. If no
itemset in the negative border is frequent, then the
algorithm has discovered all frequent itemsets.
Otherwise, some superset of an itemset in the negative border could be frequent, but its support has not
yet been counted. The sampling algorithm generates
and counts all such potentially frequent itemsets in a
subsequent database scan.
Dynamic itemset counting. Sergey Brin and colleagues6 proposed the Dynamic Itemset Counting
algorithm. DIC partitions the database into several
blocks marked by start points and repeatedly scans
the database. In contrast to Apriori, DIC can add new
candidate itemsets at any start point, instead of just
at the beginning of a new database scan. At each start
point, DIC estimates the support of all itemsets that
are currently counted and adds new itemsets to the
set of candidate itemsets if all its subsets are estimated
to be frequent.
If DIC adds all frequent itemsets and their negative
border to the set of candidate itemsets during the first
scan, it will have counted each itemset’s exact support
at some point during the second scan; thus DIC will
complete in two scans.
Extensions and generalizations
Several researchers have proposed extensions to the
basic problem of finding frequent itemsets.
Is-a hierarchy. One extension considers an is-a hierarchy on database items. An is-a hierarchy defines
which items are specializations or generalizations of
other items. For instance, as shown in Figure 2, the
items {computer, hard disk} in Figure 1 can be generalized to the item hardware. The extended problem is
to compute itemsets that include items from different
hierarchy levels.
The presence of a hierarchy modifies the notion of
when an item is contained in a transaction: In addition to the items listed explicitly, the transaction contains their ancestors in the taxonomy. This allows the
detection of relationships involving higher hierarchy
40
Computer
levels, since an itemset’s support can increase if an
item is replaced by one of its ancestors.
Consider the taxonomy in Figure 2. The transaction {computer, MSOffice} contains not only the items
computer and MSOffice, but also hardware and software. In Figure 1’s sample database, the support of
the itemset {computer, MSOffice} is 33 percent,
whereas the support of the itemset {computer, software} is 67 percent.
One approach to computing frequent itemsets in the
presence of a taxonomy is to conceptually augment each
transaction with the ancestors of all items in the transaction. Any algorithm for computing frequent itemsets
can now be used on the augmented database. Optimizations on this basic strategy have been described by
Rakesh Agrawal and Ramakrishnan Srikant.7
Sequential patterns. With each customer, we can associate a sequence of transactions ordered over time. The
business goal is to find sequences of itemsets that many
customers have purchased in approximately the same
order.7,8 For each customer, the input database consists
of an ordered sequence of transactions. Given an itemset sequence, the percentage of transaction sequences
that contain it is called the itemset sequence’s support.
A transaction sequence contains an itemset sequence
if each itemset is contained in one transaction and the
following holds: If the ith itemset in the itemset
sequence is contained in transaction j in the transaction
sequence, the (i + 1)st itemset in the itemset sequence is
contained in a transaction with a number greater than
j. The goal of finding sequential patterns is to find all
itemset sequences that have a support higher than a
user-specified minimum. An itemset sequence is frequent if its support is larger than this minimum.
In Figure 1, customer 101 is associated with the
transaction sequence [{computer, MSOffice}, {hard
disk, Doom}]. This transaction sequence contains the
itemset sequence [{MSOffice}, {hard disk}].
Calendric market basket analysis. Sridhar Ramaswamy and colleagues9 use the time stamp associated
with each transaction to define the problem of calendric market basket analysis. Even though an itemset’s
support may not be large with respect to the entire
database, it might be large on a subset of the database
that satisfies certain time constraints.
Conversely, in certain cases, itemsets that are frequent on the entire database may gain their support
from only certain subsets. The goal of calendric market basket analysis is to find all itemsets that are frequent in a set of user-defined time intervals.
CLUSTERING
Clustering distributes data into several groups so
that similar objects fall into the same group. In Figure
1’s sample database, assume that to cluster customers
based on their purchase behavior, we compute for each
customer the total number and average price of all
items purchased. Figure 3 shows clustering information for nine customers, distributed across three clusters. Customers in cluster one purchase few
high-priced items, customers in cluster two purchase
many high-priced items, and customers in cluster three
purchase few low-priced items. Figure 3’s data does
not match Figure 1’s because the earlier figure accommodated only a few transactions.
The clustering problem has been studied in many
fields, including statistics, machine learning, and biology. However, scalability was not a design goal in
these applications; researchers always assumed the
complete data set would fit in main memory, and the
focus was on improving the clustering quality.
Consequently, these algorithms do not scale to large
data sets. Recently, several new algorithms with
greater emphasis on scalability have been developed,
including summarized cluster representation, sampling, and using data structures supported by database systems.
Summarized cluster representations
Tian Zhang and colleagues10 proposed Birch, which
uses summarized cluster representations to achieve
speed and scalability while clustering a data set. The
Birch approach can be thought of as a two-phase clustering technique: Birch is used to yield a collection of
coarse clusters, and then other (main-memory) clustering algorithms can be used on this collection to
identify “true clusters.” As an analogy, if each data
point is a marble on a table top, we replace clusters
of marbles by tennis balls and then look for clusters of
tennis balls. While the number of marbles may be
large, we can control the number of tennis balls to
make the second phase feasible with traditional clustering algorithms whose goal is to recover complex
cluster shapes. Other work on scalable clustering
addressed Birch’s limitations or applied the summarized cluster representation idea in different ways.
Birch and the Birch* framework. A cluster corresponds to a dense region of objects. Birch treats this
region collectively through a summarized representation called its cluster feature. A cluster’s CF is a triple
consisting of the cluster’s number of points, centroid,
and radius, with the cluster’s radius defined as the
square root of the average mean-squared distance from
the centroid of a point in the cluster. When a new point
is added to a cluster, the new CF can be computed from
the old CF; we do not need the set of points in the cluster. The incremental Birch algorithm exploits this property of a CF and maintains only the CFs of clusters,
rather than the sets of points, while scanning the data.
Cluster features are efficient for two reasons:
• They occupy much less space than the naive rep-
Cluster 1
<2, 1,700>
<3, 2,000>
<4, 2,300>
Cluster 2
<10, 1,800>
<12, 2,100>
<11, 2,040>
Cluster 3
<2, 100>
<3, 200>
<3, 150>
Figure 3. Sample set of clusters—data groups consisting of
similar objects.
resentation, which maintains all objects in a cluster.
• They are sufficient for calculating all intercluster
and intracluster measurements involved in making clustering decisions. Moreover, these calculations can be performed much faster than using
all the objects in clusters. For instance, distances
between clusters, radii of clusters, CFs—and
hence other properties of merged clusters—can
all be computed very quickly from the CFs of
individual clusters.
In Birch, the CF’s definition relies on vector operations like addition, subtraction, centroid computation, and so on. Therefore, Birch’s definition of CF
will not extend to datasets consisting of character
strings, say, for which these operations are not defined.
In recent work, the CF and CF-tree concepts used
in Birch have been generalized in the Birch* framework11 to derive two new scalable clustering algorithms for data in an arbitrary metric space. These
new algorithms will, for example, separate the set
{University of Wisconsin-Madison, University of
Wisconsin-White Water, University of Texas-Austin,
University of Texas-Arlington} into two clusters of
Wisconsin and Texas universities.
Other CF work. Recently, Paul Bradley and colleagues12 used CFs to develop a framework for scaling up the class of iterative clustering algorithms, such
as the K-Means algorithm. Starting with an initial
data-set partitioning, iterative clustering algorithms
repeatedly move points between clusters until the distribution optimizes a criterion function.
The framework functions by identifying sets of discardable, compressible, and main-memory points. A
point is discardable if its membership in a cluster can
be ascertained; the algorithm discards the actual points
and retains only the CF of all discardable points.
A point is compressible if it is not discardable but
belongs to a tight subcluster—a set of points that
August 1999
41
Record ID
Salary
Age
Employment
Group
1
30K
30
Self
C
2
40K
35
Industry
C
3
70K
50
Academia
C
4
60K
45
Self
B
5
70K
30
Academia
B
The article “Chameleon: Hierarchical Clustering
Using Dynamic Modeling” by George Karypis and colleagues (p. 68) covers these last two algorithms in detail.
6
60K
35
Industry
A
CLASSIFICATION
7
60K
35
Self
A
8
70K
30
Self
A
9
40K
45
Industry
C
Assume that we have identified, through clustering
of the aggregated purchase information of current customers, three different groups of customers, as shown
in Figure 3. Assume that we purchase a mailing list
with demographic information for potential customers. We would like to assign each person in the
mailing list to one of three groups so that we can send
a catalog tailored to that person’s buying patterns.
This data-mining task uses historical information
about current customers to predict the cluster membership of new customers.
Our database with historical information, also
called the training database, contains records that have
several attributes. One designated attribute is called
the dependent attribute, and the others are called predictor attributes. The goal is to build a model that
takes the predictor attributes as inputs and outputs a
value for the dependent attribute.
If the dependent attribute is numerical, the problem is called a regression problem; otherwise it is
called a classification problem. We concentrate on
classification problems, although similar techniques
apply to regression problems as well. For a classification problem, we refer to the attribute values of the
dependent attribute as class labels. Figure 4 shows a
sample training database with three predictor attributes: salary, age, and employment. Group is the dependent attribute.
Researchers have proposed many classification
models:17 neural networks, genetic algorithms,
Bayesian methods, log-linear and other statistical
methods, decision tables, and tree-structured models—so-called classification trees. Classification trees,
also called decision trees, are attractive in a data-mining environment for several reasons:
Figure 4. Sample training database.
always share cluster membership. Such points can
move from one cluster to another, but they always
move together. Such a subcluster is summarized using
its CF.
A point is a main-memory point if it is neither discardable nor compressible. Main-memory points are
retained in main memory. The iterative clustering
algorithm then moves only the main-memory points
and the CFs of compressible points between clusters
until the distribution optimizes the criterion function.
Gholamhosein Sheikholeslami and colleagues13 proposed WaveCluster, a clustering algorithm based on
wavelet transforms. They first summarize the data by
imposing a multidimensional grid on the data space.
The number of points that map into a single cell summarizes all the points that mapped into the cell. This
summary information typically fits in main memory.
WaveCluster then applies the wavelet transform on
the summarized data to determine clusters of arbitrary
shapes.
Other approaches
Of the other proposed clustering algorithms for
large data sets, we mention two sampling-based
approaches and one based on database system support.
Raymond T. Ng and Jiawei Han14 proposed
Clarans, which formulates the clustering problem as
a randomized graph search. In Clarans, each node represents a partition of the data set into a user-specified
number of clusters. A criterion function determines
the clusters’ quality. Clarans samples the solution
space—all possible partitions of the data set—for a
good solution. The random search for a solution stops
at a node that meets the minimum among a prespecified number of local minima.
Sudipto Guha and colleagues15 proposed CURE, a
sampling-based hierarchical clustering algorithm that
discovers clusters of arbitrary shapes. In the DBScan
algorithm, Martin Ester and colleagues16 proposed a
density-based notion of a cluster that also lets the cluster take an arbitrary shape.
42
Computer
• Their intuitive representation makes the resulting classification model easy to understand.
• Constructing decision trees does not require any
input parameters from the analyst.
• The predictive accuracy of decision trees is equal
to or higher than other classification models.
• Fast, scalable algorithms can be used to construct
decision trees from very large training databases.
Each internal node of a decision tree is labeled with
a predictor attribute, called the splitting attribute, and
each leaf node is labeled with a class label. Each edge
originating from an internal node is labeled with a
splitting predicate that involves only the node’s splitting attribute. The splitting predicates have the prop-
Salary
<= 50K
erty that any record will take a unique path from the
root to exactly one leaf node. The combined information about splitting attributes and splitting predicates
at a node is called the splitting criterion. Figure 5
shows a possible decision tree for the training database from Figure 4.
Decision tree construction algorithms consist of two
phases: tree building and pruning. In tree building, the
tree grows top-down in the following greedy way.
Starting with the root node, the algorithm examines
the database using a split selection method to compute the locally “best” splitting criterion. Then it partitions the database according to this splitting criterion
and applies the procedure recursively. The algorithm
then prunes the tree to control its size. Some decision
tree construction algorithms separate tree building
and pruning, while others interleave them to avoid the
unnecessary expansion of some nodes. Figure 6 shows
a code sample of the tree-building phase.
The choice of splitting criterion determines the quality of the decision tree, and it has been the subject of
considerable research. In addition, if the training database does not fit in memory, we need a scalable data
access method. One such method, the Sprint algorithm
introduced by John Shafer and colleagues,18 uses only
a minimum amount of main memory and scales a popular split selection method called CART. Another
approach, the RainForest framework,19 scales a broad
class of split-selection methods, but has main-memory requirements that depend on the number of different attribute values in the input database.
Sprint. This classification-tree construction algorithm removes all relationships between main memory and the data set size. Sprint builds classification
trees with binary splits and it requires sorted access to
each attribute at each node. For each attribute, the
algorithm creates an attribute list, which is a vertical
partition of the training database D. For each tuple t
∈ D, the entry of t in the attribute list consists of the
projection of t onto the attribute, the class label
attribute, and the record identifier of t. The attribute
list of each attribute is created at the beginning of the
algorithm and sorted once in increasing order of
attribute values.
At the root node, the algorithm scans all attribute
lists once to determine the splitting criterion. Then it
distributes each attribute list among the root’s children through a hash-join with the attribute list of the
splitting attribute. The record identifier, which is duplicated in each attribute, establishes the connection
between the different parts of the tuple. During the
hash-join, the algorithm reads and distributes each
attribute list sequentially, which preserves the initial
sort order of the attribute list. The algorithm then
recurses on each child partition.
RainForest. The RainForest framework19 operates
> 50K
Group
Age
<= 40
> 40
Employment
Group C
Academia, Industry
Group B
Self
Group A
Figure 5. Sample decision tree for a catalog mailing.
Input: node n, datapartition D, split selection
method CL
Output: decision tree for D rooted at n
Top-Down Decision Tree Induction Schema (Binary
Splits):
BuildTree(Node n, datapartition D, split selection
method CL)
(1) Apply CL to D to find the splitting criterion
for n
(2) if (n splits)
(3)
Create children n1 and n2 of n
(4)
Use best split to partition D into D1 and D2
(5)
BuildTree(n1, D1, CL)
(6)
BuildTree(n2, D2, CL)
(7) endif
Figure 6. Code sample for the tree-building phase.
from the premise that nearly all split-selection methods need only aggregate information to decide on the
splitting criterion at a node. This aggregated information can be captured in a relatively compact data
structure called the attribute-value class label group,
or AVC group.
Consider the root node of the tree, and let D be the
training database. The AVC set of predictor attribute
A is the projection of D onto A, where counts of the
individual class labels are aggregated. The AVC group
at a node consists of the AVC sets of all predictor
attributes. Consider the training database shown in
Figure 4. The AVC group of the root node is shown in
Figure 7.
The size of a node’s AVC group is not proportional
to the number of records in the training database, but
rather to the number of different attribute values.
Thus, in most cases, the AVC group is much smaller
than the training database and usually fits into main
August 1999
43
Acknowledgments
Group
Salary
Group
A
B
C
30K
0
0
1
40K
0
0
60k
1
1
70K
1
1
Age
A
B
C
30
1
1
1
2
35
1
1
1
1
45
0
0
2
2
50
0
0
1
Figure 7. AVC group of the root node for the sample input database in Figure 4.
memory. Knowing that the AVC group contains all
the information any split-selection method needs,
the problem of scaling up an existing split-selection
method is now reduced to the problem of efficiently
constructing the AVC group at each node of the
tree.
One simple data access method works by performing a sequential scan over the training database to construct the root node’s AVC group in
main memory. The split-selection method then computes the split of the root node. In the next sequential scan, each record is read and appended to one
child partition. The algorithm then recurses on each
child partition.
Rajeev Rastogi and Kyuseok Shim20 developed
an algorithm called Public that interleaves tree
building and pruning. Public eagerly prunes nodes
that need not be expanded further during tree building, thus saving on the expansion cost of some
nodes in the tree.
ost current data-mining research assumes that
data is static. In practice, data is maintained in
data warehouses, which are updated continuously by the addition of records in batches. Given
this scenario, we believe that future research must
address algorithms for efficient model maintenance
and methods to measure changes in data characteristics.
The current data-mining paradigm resembles that
of traditional database systems. A user initiates data
mining and awaits the complete result. But analysts
are interested in quick, partial, or approximate
results that can then be fine-tuned through a series
of interactive queries. Thus, further research must
focus on making data mining more interactive.
Finally, the Web is the largest repository of structured, semistructured, and unstructured data. The
Web’s dynamic nature, as well as the extreme variety of data types it holds, will challenge the research
community for years to come. ❖
M
44
Computer
Venkatesh Ganti is supported by a Microsoft Graduate
Fellowship. Johannes Gehrke is supported by an IBM
Graduate Fellowship. The research for this article was supported by Grant 2053 from the IBM Corp.
References
1. U.M. Fayyad et al., eds. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Menlo Park,
Calif., 1996.
2. R. Agrawal et al., “Fast Discovery of Association
Rules,” Advances in Knowledge Discovery and Data
Mining, U.M. Fayyad et al., eds., AAAI/MIT Press,
Menlo Park, Calif., 1996, pp. 307-328.
3. A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large
Databases,” Proc. 21st Int’l Conf. Very Large Data
Bases, Morgan Kaufmann, San Francisco, 1995, pp.
432-444.
4. J.S. Park, M.-S. Chen, and S.Y. Philip, “An Effective
HashBased Algorithm for Mining Association Rules,”
Proc. ACM SIGMOD Int’l Conf. Management of Data,
ACM Press, New York, 1995, pp.175-186.
5. H. Toivonen, “Sampling Large Databases for Association Rules,” Proc. 22nd Int’l Conf. Very Large Data
Bases, Morgan Kaufmann, San Francisco, 1996, pp.
134-145.
6. S. Brin et al., “Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press,
New York, 1997, pp. 255-264.
7. R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int’l Conf. Data Eng., IEEE CS Press,
Los Alamitos, Calif., 1995, pp. 3-14.
8. H. Mannila, H. Toivonen, and A.I. Verkamo, “Discovering Frequent Episodes in Sequences,” Proc. 1st Int’l
Conf. Knowledge Discovery Databases and Data Mining, AAAI Press, Menlo Park, Calif., 1995, pp. 210-215.
9. S. Ramaswamy, S. Mahajan, and A. Silbershatz, “On the
Discovery of Interesting Patterns in Association Rules,”
Proc. 24th Int’l Conf. Very Large Data Bases, Morgan
Kaufmann, San Francisco, 1998, pp. 368-379.
10. T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An
Efficient Data Clustering Method for Large Databases,”
Proc. ACM SIGMOD Int’l Conf. Management of Data,
ACM Press, New York, 1996, pp. 103-114.
11. V. Ganti et al., “Clustering Large Datasets in Arbitrary
Metric Spaces,” Proc. 15th Int’l Conf. Data Eng., IEEE
CS Press, Los Alamitos, Calif., 1999, pp. 502-511.
12. P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. 4th Int’l
Conf. Knowledge Discovery and Data Mining, AAAI
Press, Menlo Park, Calif., 1998, pp. 9-15.
13. G. Sheikholeslami, S. Chatterjee, and A. Zhang,
“WaveCluster: A Multi-Resolution Clustering Approach
14.
15.
16.
17.
18.
19.
for Very Large Spatial Databases,” Proc. 24th Int’l Conf.
Very Large Data Bases, Morgan Kaufmann, San Francisco, 1998, pp. 428-439.
R.T. Ng and J. Han, “Efficient and Effective Clustering
Methods for Spatial Data Mining,” Proc. 20th Int’l Conf.
Very Large Data Bases, Morgan Kaufmann, San Francisco, 1994, pp. 144-155.
S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient
Clustering Algorithm for Large Databases,” Proc. ACM
SIGMOD Int’l Conf. Management of Data, ACM Press,
New York, 1998, pp. 73-84.
M. Ester et al., “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,”
Proc. 2nd Int’l Conf. Knowledge Discovery Databases and
Data Mining, AAAI Press, Menlo Park, Calif., 1996, pp.
226-231.
D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine
Learning, Neural and Statistical Classification,” Ellis
Horwood, Chichester, UK, 1994.
J. Shafer, R. Agrawal, and M. Mehta. “SPRINT: A Scalable
Parallel Classifier for Data Mining,” Proc. 22nd Int’l Conf.
Very Large Data Bases,” Morgan Kaufmann, San Francisco, 1996, pp. 544-555.
J. Gehrke, R. Ramakrishnan, and V. Ganti, “RainForest—a Framework for Fast Decision Tree Construction
of Large Datasets,” Proc. 24th Int’l Conf. Very Large
Data Bases, Morgan Kaufmann, San Francisco, 1998, pp.
416-427.
20. R. Rastogi and K. Shim, “Public: A Decision Tree Classifier that Integrates Building and Pruning,” Proc. 24th Int’l
Conf. Very Large Data Bases, Morgan Kaufmann, San
Francisco, 1998, pp. 404-415.
Venkatesh Ganti is a PhD candidate at the University of Wisconsin-Madison. His primary research
interests are the exploratory analysis of large data sets
and monitoring changes in data characteristics. Ganti
received an MS in computer science from the University of Wisconsin-Madison.
Johannes Gehrke is a PhD candidate at the University
of Wisconsin-Madison. His research interests include
scalable techniques for data mining, performance of
data-mining algorithms, and mining and monitoring
evolving data sets.
Raghu Ramakrishnan is a professor in the Computer
Sciences Department at the University of WisconsinMadison. His research interests include database languages, net databases, data mining, and interactive
information visualization.
Contact Ganti, Gehrke, and Ramakrishnan at the
University of Wisconsin-Madison, Dept. of Computer
Science, Madison, WI 53706; {vganti, johannes,
raghu}@cs.wisc.edu.
Jini Lead Architect
Jim Waldo
“The big wads of software that we
have grown used to might be replaced
by small, simple components that do
only what they need to do and can
be combined together.”
Tcl Creator
John Ousterhout
“Scripting languages will be
used for a larger fraction of
application development in
the years ahead.”
Software
revolutionaries
on software
As seen
in the
May issue
of Computer
Innovative Technology for Computer Professionals
Thompson On
Middleware
Steps Forward Unix and Beyond
Perl Creator
Larry Wall
“The most revolutionary thing about language
design these days is that we’re putting more
effort into larger languages.”
http://computer.org
http://computer.org
May 1999
CORBA 3
Preview
Two New
Awards Honor
Innovators,
p. 11