Download Contents - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Contents
9 Mining Complex Types of Data
9.1 Generalization and Multidimensional Analysis of Complex Data Objects . . . . . . . .
9.1.1 Generalization on structured data . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.2 Aggregation and approximation in spatial and multimedia data generalization .
9.1.3 Generalization of object identiers and class/subclass hierarchies . . . . . . . .
9.1.4 Generalization on inherited and derived properties . . . . . . . . . . . . . . . .
9.1.5 Generalization on class composition hierarchies . . . . . . . . . . . . . . . . . .
9.1.6 Class-based generalization and mining object data cubes . . . . . . . . . . . . .
9.2 Mining Spatial and Multimedia Databases . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 Spatial data cube construction and spatial OLAP . . . . . . . . . . . . . . . . .
9.2.2 Spatial association analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.3 Spatial clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.4 Spatial classication and spatial trend analysis . . . . . . . . . . . . . . . . . .
9.2.5 Mining raster databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.6 From spatial data mining and multimedia data mining . . . . . . . . . . . . . .
9.3 Mining Time-Series Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.1 Trend analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.2 Similarity search in time-series analysis . . . . . . . . . . . . . . . . . . . . . .
9.3.3 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.4 Periodicity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Mining Text Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Text data analysis and information retrieval . . . . . . . . . . . . . . . . . . . .
9.4.2 Text mining: keyword-based association and document classication . . . . . .
9.5 Mining the World-Wide-Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.1 Mining Web's link structures to identify authoritative Web pages . . . . . . . .
9.5.2 Automatic classication of Web documents . . . . . . . . . . . . . . . . . . . .
9.5.3 Construction of multi-layered Web information-base . . . . . . . . . . . . . . .
9.5.4 Web usage mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
5
5
5
6
6
7
9
10
10
10
11
11
11
13
14
14
15
16
19
19
20
22
22
22
22
2
CONTENTS
c J. Han and M. Kamber, 2000, DRAFT!! DO NOT COPY!! DO NOT DISTRIBUTE!!
November 15, 1999
Chapter 9
Mining Complex Types of Data
Our previous studies on data mining techniques have been focused on mining relational databases, transactional
databases, and data warehouses formed by transformation and integration of structured data. With the rapid
progress of database systems, data collection tools, and WWW technologies, vast amounts of data in various complex
forms, structured and unstructured, hypertext and multimedia, have been pouring in and been growing explosively.
Therefore, an increasingly important task in data mining is to mine complex types of data, including complex objects,
spatial data, time-series data, hyper-text and multimedia data, and WWW data.
In this chapter, we examine how to further develop the essential data mining techniques, such as characterization,
classication, association and clustering, and how to develop new ones to cope with complex types of data and perform
fruitful data mining in complex information repositories. In particular, section 1 is devoted to the generalization of
complex data objects, section 2 is on spatial and multimedia data mining, section 3 is on time-series data mining,
section 4 is on mining text databases, and section 5 is on mining World-Wide-Web. Since mining such complex types
of data is a fast expanding research frontier, our discussion covers only some preliminary issues. We expect that
many dedicated books on mining particular kinds of data will be available in the future.
9.1 Generalization and Multidimensional Analysis of Complex Data Objects
A major limitation of many commercial data warehouse and OLAP tools for multidimensional database analysis
is their restriction on the allowable data types for dimensions and measures. Most data cube implementations
conne dimensions to nonnumeric data, and measures to simple, aggregated values. To introduce data mining and
multidimensional data analysis for complex objects, one needs to examine how to perform generalization on complex
structured objects and construct object cubes for OLAP and mining in object databases.
In this section, we examine how such generalization can be performed.
The storage and access of complex structured data have been studied in object-relational and object-oriented
database systems. These systems organize a large set of complex data objects into classes which are in turn organized
into class/subclass hierarchies. Each object in a class is associated with (1) an object-identier, (2) a set of attributes
which may contain sophisticated data structures, set- or list- valued data, class composition hierarchies, multimedia
data, etc., and (3) a set of methods which specify the computational routines or rules associated with the object
class.
To facilitate generalization and induction in such databases, it is important to examine how each kind of components in object-relational and object-oriented databases can be generalized, and how the generalized data can be
used for multi-dimensional data analysis and data mining.
9.1.1 Generalization on structured data
An important feature of an object-relational or object-oriented database is its capability of storage, accessing, and
modeling of complex structure-valued data, such as set-valued and list-valued data and data with nested structures.
Such kinds of data can be generalized in several ways in order to summarize and extract interesting patterns.
3
4
CHAPTER 9. MINING COMPLEX TYPES OF DATA
A set-valued attribute may be of homogeneous or heterogeneous types. Typically, a set-valued data can be
generalized by (1) generalization of each value in a set into its corresponding higher level concepts, or (2) derivation
of the general behavior of a set, such as the number of elements in the set, the types or value ranges in the set,
the weighted average for numerical data, etc. Moreover, generalization can be performed by applying dierent
generalization operators to explore alternative generalization paths. In this case, the result of generalization is a
heterogeneous set.
For example, the hobby of a person is a set-valued attribute which contains a set of values, such as ftennis, hockey,
chess, violin, nintendo gamesg, which can be generalized into a set of high level concepts, such as fsports, music,
video gamesg, or into 5 (the number of hobbies in the set), or both, etc. Moreover, a count can be associated with
a generalized value to indicate how many elements are generalized to the corresponding generalized value, such as
fsports(3), music(1), video games(1)g, where sports(3) indicates three kinds of sports, etc.
A set-valued attribute may be generalized into a set-valued or a single-valued attribute; whereas a single-valued
attribute may also be generalized into a set-valued one if the \hierarchy" is a lattice or the generalization follows
dierent paths. Further generalizations on such a generalized set-valued attribute should follow the generalization
path of each value in the set.
A list-valued or a sequence-valued attribute can be generalized in a way similar to the set-valued attribute except
that the order of the elements in the sequence should be observed in the generalization. Each value in the list can
be generalized into its corresponding higher level concept. Alternatively, a list can be generalized according to its
general behavior, such as the length of the list, the type of list elements, the value range, weighted average value
for numerical data, or dropping unimportant elements in the list, etc. A list may be generalized into a list, a set
or a single value. For example, a sequence (list) of data for a person's education record: \((B.Sc. in Electrical
Engineering, U.B.C., Dec., 1980), (M.Sc. in Computer Engineering, U. Maryland, May, 1983), (Ph.D. in Computer
Science, UCLA, Aug., 1987))" can be generalized by dropping less important descriptions (subattributes) of each
tuple in the list, such as \((B.Sc., U.B.C., 1980), )", or by retaining only the most important tuple(s) in the list,
or both, such as \(Ph.D. in Computer Science, UCLA, 1987)".
Set- and list-valued attributes are simple structure-valued attributes. In general, a structure-valued attribute
may contain sets, tuples, lists, trees, records, etc. and their combinations. Moreover, one structure can be nested in
another at any level. Similar to the generalization of set- and list-valued attributes, a structure-valued attribute can
be generalized in several ways, such as (1) generalizing each attribute in the structure whereas maintaining the shape
of the structure, (2) attening the structure and generalizing the attened structure, (3) summarizing the low-level
structures by high-level concepts or aggregation, and (4) returning the type or an overview of the structure.
9.1.2 Aggregation and approximation in spatial and multimedia data generalization
Besides generalized concept substitution and structured data summarization, aggregation and approximation should
be considered as an important means of generalization, which is especially useful for generalizing the attributes with
large sets of values, complex structures, spatial or multimedia data, etc.
Take spatial data as an example. It is desirable to generalize detailed geographic points into clustered regions, such
as business, residential, industry, or agricultural areas, according to land usage. Such generalization often requires
the merge of a set of geographic areas by spatial operations, such as spatial union or spatial clustering methods.
Aggregation and approximation are important techniques in such generalization. In spatial merge, it is necessary to
not only merge the regions of similar types within the same general class but also compute the total areas, average
density or other aggregate functions and ignore some scattered regions with dierent types if they are unimportant
to the study. For example, dierent pieces of land for dierent purposes of agricultural usage, such as vegetables,
grains, fruits, etc. can be merged into one large piece of agricultural land by spatial merge. However, such a piece
of agricultural land may contain highways, houses, small stores, etc. If the majority land is used for agriculture, the
scattered spots for other purposes can be ignored, and the whole region can be claimed as an agricultural area by
approximation.
The spatial operators, such as spatial-union, spatial-overlapping, spatial-intersection, etc., which may require to
merge scattered small regions into large, clustered regions, can use spatial aggregation and approximation as data
generalization operators.
A multimedia database may contain complex texts, graphics, images, video fragments, maps, voice, music, and
other forms of audio/video information. Such multimedia data are typically stored as sequences of bytes with variable
9.1. GENERALIZATION AND MULTIDIMENSIONAL ANALYSIS OF COMPLEX DATA OBJECTS
5
lengths, and segments of data are linked together or indexed in a multiple dimensional way for easy reference.
Generalization on multimedia data can be performed by recognition and extraction of the essential features and/or
general patterns of such data.
There are many ways to extract the essential features or general patterns from segments of multimedia data. For
an image, the size, color, shape, texture and orientation of the contained objects or the major regions in the image
can be extracted by aggregation and/or approximation. For a segment of music, its melody can be summarized based
on the approximate patterns that repeatedly occur in the segment and its style can be summarized based on its tone,
tempo, major musical instruments played, etc. For an article, its abstract or general organization structure such as
the table of contents, the subject and index terms frequently occurring in the article, etc. may serve as generalization
results.
In general, it is a challenging task to generalize multimedia data and spatial data to extract interesting knowledge
implicitly stored in the data. Technologies developed in multimedia databases and spatial databases, such as contentbased image retrieval, multidimensional indexing methods, spatial data accessing and analysis techniques, etc. should
be integrated with data generalization and data mining techniques to achieve satisfactory results. More on the
techniques for mining such data will be discussed in the following sections.
9.1.3 Generalization of object identiers and class/subclass hierarchies
An essential component of an object-oriented database is the object identier, whose role is to uniquely identify
objects. It remains unchanged after structural reorganization of the data. At rst glance, it may seem impossible
to generalize an object identier. However, since objects in an object-oriented database are organized into classes
which in turn are organized into class/subclass hierarchies, the generalization of an object can be performed by
referring to its associated hierarchy. Thus an object identier can be generalized in the following way. First, the
object identier is generalized to the identier of the lowest subclass to which the object belongs. Then the identier
of this subclass can be in turn generalized to a higher level class/subclass identier by climbing up the class/subclass
hierarchy. Similarly, a class or a subclass can be generalized to its corresponding superclass(es) by climbing up its
associated class/subclass hierarchy.
9.1.4 Generalization on inherited and derived properties
Since object-oriented databases are organized into class/subclass hierarchies, some attributes or methods of an object
class are not explicitly specied in the class itself but are inherited from its higher level classes. Some object-oriented
database systems may allow the properties to be inherited from more than one superclass (called multiple inheritance)
when the class/subclass \hierarchy" is organized in the shape of a lattice. The inherited properties of an object can
be derived by query processing in the object-oriented database. From the data generalization point of view, it is
unnecessary to distinguish which data are stored within the class and which are inherited from its superclass. As
long as the set of relevant data are collected by query processing, the data mining process will treat the inherited
data in the same way as the data stored in the object class and perform generalization accordingly.
Method is another important component of object-oriented databases. Many behavioral data of objects can be
derived by application of methods. Since a method is usually dened by a computational procedure/function or by
a set of deduction rules, it is impossible to perform generalization on the method itself. However, generalization can
be performed on the data derived by method application. That is, one should derive the task-relevant set of data by
application of the method and, possibly, also data retrieval, and then perform generalization by treating the derived
data as the existing ones.
9.1.5 Generalization on class composition hierarchies
An attribute of an object may be composed of or described by another object, some of whose attributes may be in
turn composed of or described by other objects, thus forming a class composition hierarchy. Generalization on a
class composition hierarchy can be viewed as generalization on a set of (possibly innite, if the nesting is recursive)
nested structured data.
In principle, the reference to a composite object may traverse via a long sequence of references along the corresponding class composition hierarchy. However, in most cases, the longer the sequence of references traversed, the
weaker is the semantic linkage between the original object and the referenced composite object. For example, one
6
CHAPTER 9. MINING COMPLEX TYPES OF DATA
attribute \vehicles owned" of an object class \student" could refer to another object class \car" which may contain
an attribute \auto dealer", which may refer to its \manager" with an attribute \children". Obviously, it is unlikely
to nd any interesting general regularities between a student and his/her car's dealer's manager's children. Therefore, generalization on a class of objects should be performed on its own descriptive attribute values, methods, with
limited reference to its closely related components via the close linkage in the class composition hierarchy. That is, in
order to discover interesting knowledge, generalization should be performed on the objects in the class composition
hierarchy closely related in semantics to the currently focused class(es) but not on those which have only remote and
rather weak semantic linkages.
9.1.6 Class-based generalization and mining object data cubes
The above discussed methods are object-based generalization techniques. In a large object database, data mining
and multidimensional analysis are not working on individual objects but on classes of objects. Thus an important
question is how to perform class-based generalization for a large set of objects.
Since a set of objects in a class may share many attributes and methods, and the generalization of each attribute
and method may apply a sequence of generalization operators, the major issue becomes how to cooperate the
generalization processes among dierent attributes and methods in the class(es) to produce interesting results.
For class-based generalization, the attribute-oriented induction method, developed in Chapter 4 for mining characteristics of relational databases, can be extended to mine data characteristics in object databases.
A generalization-based data mining process can be viewed as the application of a sequence of class-based generalization operators on dierent attributes until the resulting class contains a small number of generalized objects which
can be summarized as a concise, generalized rule in high-level terms. For ecient implementation, the generalization
of multidimensional attributes of complex object class can be performed by examining each attribute (or dimension),
generalizing each attribute to simple valued data, and constructing a multidimensional data cube, called object cube.
Once an object cube is constructed, multidimensional analysis and data mining can be performed on such an object
cube in a similar way as that on relational data cubes.
Notice that from the application point of view it is not always desirable to generalize a set of values to single
valued data. For example, for a book, the key-word attribute may contain a set of keywords describing the book,
it will not make much sense to generalize this set of keywords to one single value. In this context, it is dicult to
construct an object cube containing the keyword dimension. We will address some progress in this direction in the
next section when discussing spatial data cube construction. However, it is still a challenging research issue on how
to handle set-valued data eectively in object cube construction and object-based data mining.
9.2 Mining Spatial and Multimedia Databases
A spatial database stores a large amount of space-related data, such as maps, remote sensing data, astronomical
data, medical imaging, VLSI chip layout, etc. Spatial data mining refers to the extraction of knowledge, spatial
relationships, or other interesting patterns not explicitly stored in spatial databases.
Spatial data has many features distinguishing it from relational databases. It carries topological and/or distance
information, usually organized by sophisticated, multidimensionalspatial indexing structures, accessed by spatial data
access methods, and often requiring spatial reasoning, geometric computation, and spatial knowledge representation
techniques. Spatial data mining demands an integration of data mining with spatial database technologies. A crucial
challenge to spatial data mining is the exploration of ecient spatial data mining techniques due to the huge amount
of spatial data and the complexity of spatial data types and spatial access methods.
Spatial data mining can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and nonspatial data, constructing spatial knowledge-bases, reorganizing spatial databases and
optimizing spatial queries. It is expected to have wide applications in geographic information systems, geomarketing,
remote sensing, image database exploration, medical imaging, navigation, trac control, environmental studies, and
many other areas where spatial data is used.
Statistical spatial data analysis has been a popular approach for analyzing spatial data. Many statistical analysis
algorithms and various optimization techniques have been developed for such analysis. It handles numerical data
well and usually comes up with realistic models of spatial phenomena. However, this approach is usually based
on the assumption of statistical independence among the spatially distributed data but in reality spatial objects
9.2. MINING SPATIAL AND MULTIMEDIA DATABASES
7
are often interrelated. Moreover, most statistical modeling can only be done by experts with a fair amount of
domain knowledge and statistical expertise. Furthermore, statistical methods do not work well with symbolic values,
incomplete or inconclusive data and it is expensive to compute the results. Spatial data mining allows the extension
of traditional spatial analysis methods by putting emphasis on eciency, cooperation with database systems, better
interaction with user, and discovery of new types of knowledge.
9.2.1 Spatial data cube construction and spatial OLAP
Similar to relational data, we can integrate spatial data and construct data warehouse to facilitate spatial data
mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of both
spatial and nonspatial data in support of spatial data mining and spatial data related decision-making process.
We examine one application example.
Example 9.1 There are about 3,000 weather probes distributed in British Columbia (B.C.), each recording daily
temperature and precipitation for a designated small area and transmitting signals to a provincial weather station. A
user may like to view weather patterns on a map by month, by region, and by dierent combinations of temperature
and precipitation, or even may like to dynamically drill-down or roll-up along any dimension to explore desired
patterns, such as wet and hot regions in Fraser Valley in Summer 1997.
2
To facilitate multi-dimensional analysis of weather patterns, it is desirable to construct spatial data warehouse
and support spatial OLAP. However, there are several challenging issues in construction and utilization of spatial
data warehouses.
The rst challenge is the integration of spatial data from heterogeneous sources and systems. Spatial data is
usually stored in dierent industry rms and government agencies using dierent data formats. Data formats are
not only structure-specic (e.g., raster- vs. vector-based spatial data, object-oriented vs. relational models, dierent
spatial storage and indexing structures, etc.), but also vendor-specic (e.g., ESRI, MapInfo, Intergraph, etc.). There
has been a lot of work on integration and exchange of heterogeneous spatial data which has paved the way for spatial
data integration and spatial data warehouse construction.
The second challenge is the realization of fast and exible on-line analytical processing in a spatial data warehouse.
This is the issue we will discuss in detail here.
We consider that the star/snowake schema model introduced in Chapter 2 is still a good choice for modeling
spatial data warehouses because it provides a concise and organized warehouse structure and facilitates OLAP
operations. However, in a spatial warehouse, both dimensions and measures may contain spatial components.
There are three types of dimensions in a spatial data cube:
1. Nonspatial dimension is a dimension containing only nonspatial data. For example, two dimensions, temperature and precipitation, can be constructed for the warehouse in Example 9.1, each is a dimension containing
nonspatial data whose generalizations are nonspatial, such as hot, and wet.
2. Spatial-to-nonspatial dimension is a dimension whose primitive level data is spatial but whose generalization,
starting at a certain high level, becomes nonspatial. For example, state in the U.S. map is spatial data.
However, each state can be generalized to some nonspatial value, such as pacic northwest, or big state, and
its further generalization is nonspatial, and thus playing a similar role as a nonspatial dimension.
3. Spatial-to-spatial dimension is a dimension whose primitive level and all of its high-level generalized data are
spatial. For example, equi-temperature-region is spatial data, and all of its generalized data, such as regions
covering 0-5 degree, 5-10 degree, and so on, are also spatial.
We distinguish two types of measures in a spatial data cube.
1. Numerical measure is a measure containing only numerical data. For example, one measure in a spatial data
warehouse could be monthly revenue of a region, and a roll-up may get the total revenue by year, by county, etc.
Numerical measures can be further classied into distributive, algebraic, and holistic, the same as in Chapter 2.
2. Spatial measure is a measure which contains a collection of pointers to spatial objects. For example, during the
generalization (or roll-up) in a spatial data cube of Example 9.1, the regions with the same range of temperature
CHAPTER 9. MINING COMPLEX TYPES OF DATA
8
Figure 9.1: Star model of a spatial data warehouse: BC weather and corresponding BC weather probes map
Region name:
probe location district city region province
Time:
hour day month season
Temperature:
any (cold, mild, hot)
cold (below ,20, ,20 to ,10, ,10 to 0)
mild (0 to 10, 10 to 15, 15 to 20)
hot (20 to 25, 25 to 30, 30 to 35, above 35)
Precipitation:
any (dry, fair, wet)
dry (0 to 0.05, 0.05 to 0.2)
fair (0.2 to 0.5, 0.5 to 1.0, 1.0 to 1.5)
wet (1.5 to 2.0, 2.0 to 3.0, 3.0 to 5.0, above 5.0)
Figure 9.2: Hierarchy for each dimension in BC weather
and precipitation will be grouped into the same cell, and the measure so formed contains a collection of pointers
to those regions.
A nonspatial data cube contains only nonspatial dimensions and numerical measures. If a spatial data cube
contained spatial dimensions but no spatial measures, its OLAP operations, such as drilling or pivoting, could be
implemented in a way similar to nonspatial data cubes. However, the introduction of spatial measures to spatial
data cubes raises some challenging issues on ecient implementation, as shown in the following example.
Example 9.2 A star model for the BC weather warehouse of Example 9.1 is shown in Figure 9.1. It consists of four
dimensions: temperature, precipitation, time, and region name, and three measures: region map, area, and count.
A concept hierarchy for each dimension can be created by users or experts, or generated automatically by data
clustering analysis. Figure 9.2 presents the hierarchies for dimensions in BC weather warehouse.
Of the three measures, area and count are numerical measures which can be computed similarly as in nonspatial
data cube; whereas region map is a spatial measure which represents a collection of spatial pointers to the corresponding regions. Since dierent spatial OLAP operations result in dierent collections of spatial objects in region map, it
poses a major challenge to compute the merges of a large number of regions exibly and dynamically. For example,
two dierent roll-ups on the B.C. weather map data (Figure 9.1) may produce two dierent generalized region maps
as shown in Figure 9.3, each being the result of merging a large number of small (probe) regions shown in Figure
9.1.
2
Can we precompute all the possible spatial merges and store them in the corresponding cuboid cells in a spatial
data cube? The answer is probably not. Unlike a numerical measure where each aggregated value takes only a few
bytes of space, a merged region map of BC may takes megabytes of storage space. Thus, we face a dilemma for
balancing the cost of on-line computation and the space overhead of storing computed measures: the substantial
computation cost for on-the-y computation of spatial aggregations calls for precomputation but substantial overhead
for storing aggregated spatial values discourages it.
There are at least three possible choices in regard to the computation of spatial measures in spatial data cube
construction.
1. Collect and store the corresponding spatial object pointers but do not perform precomputation of spatial measures
in a spatial data cube.
This can be implemented by storing, in the corresponding cube cell, a pointer to a collection of spatial object
pointers, and invoking and performing the spatial merge (or other computation) of the corresponding spatial
Figure 9.3: Roll-up operations along dierent dimensions
9.2. MINING SPATIAL AND MULTIMEDIA DATABASES
9
objects, when necessary, on-the-y. It is still a good choice if only spatial display is required (i.e., no real
spatial merge has to be performed), or if there are not so many regions to be merged in any pointer collection
(thus on-line merge is not very costly), or if on-line spatial merge computation is fast enough. Recently, some
ecient spatial merge methods have been developed for fast spatial OLAP. Since OLAP results are often used
for further spatial analysis and mining, it is still recommended to precompute some of spatially connected
regions to speed up such analysis.
2. Precompute and store some rough approximation/estimation of the spatial measures in a spatial data cube.
This choice is good for a rough view or coarse estimation of spatial merge results under the assumption that it
takes little storage space to store the coarse estimation result. For example, the minimum bounding rectangle
(MBR), representable by two points, can be taken as a rough estimate of a merged region. Such a precomputed
result is small and can be presented quickly to users. If higher precision is needed for specic cells, the
application can either fetch precomputed high quality results, if available, or compute them on-the-y.
3. Selectively precompute some spatial measures in a spatial data cube.
This seems to be a smart choice. The question becomes which portion of the cube should be selected for
materialization.
The selection can be performed at the cuboid level, i.e., either precompute and store each set of mergeable
spatial regions for each cell of a selected cuboid, or precompute none if the cuboid is not selected. Since a
cuboid usually consists of a large number of spatial objects, it may involve precomputation and storage of a
large number of mergeable spatial objects but some of them could be rarely used. Therefore, it is recommended
to perform selection at a ner granularity level: examining each group of mergeable spatial objects in a cuboid
to determine whether such a merge should be precomputed. Decision should be based on the utility (such as
access frequency or access priority), sharability of merged regions, and the balanced overall cost of space and
on-line computation.
With ecient implementation of spatial data cube and spatial OLAP, generalization-based descriptive spatial
mining, such as spatial characterization and discrimination, can be performed eciently.
9.2.2 Spatial association analysis
Similar to mining association rules in transactional and relational databases, spatial association rules can be mined
in spatial databases. A spatial association rule is of the form X ! Y [s%, c%], where X and Y are sets of spatial or
non-spatial predicates, s% is the support of the rule, and c% is the condence of the rule. For example, the following
rule is a spatial association rule:
is a(x; school) ^ close to(x; sport center) ! close to(x; park)
[0:5%; 80%]:
This rule states that 80% of schools which are close to sport centers are also close to parks and 0.5% of data belong
to such a case.
There are various kinds of spatial predicates that could constitute a spatial association rule. Examples include
topological relations like intersects, overlap, and disjoint; spatial orientations like left of and west of; and distance
information, such as close to and far away.
Since spatial association mining needs to evaluate multiple spatial relationships among a large number of spatial
objects, the process could be quite costly. An important mining optimization method, called progressive renement,
can be adopted in spatial association analysis. The method rst mines large data sets roughly using a scalable fast
algorithm and then improves the quality of mining in the pruned data set using a more expensive algorithm.
To ensure that the pruned data set covers the complete set of answers when applying the high quality data
mining algorithms at a later stage, an important requirement to the early stage rough mining algorithm is the
superset coverage property: that is, it preserves all the potential answers. In other words, it should allow a false
positive test, which might include a proper superset of the answer sets, but it should not allow the false negative test
which might exclude some potential answers.
For mining spatial association related to spatial predicate \close to", one can rst collect the candidates which
pass the minimumsupport threshold by (1) applying certain rough spatial evaluation algorithms, e.g., using minimum
10
CHAPTER 9. MINING COMPLEX TYPES OF DATA
bounding rectangle (MBR) structure (which registers only two spatial points rather than a set of complex polygons),
and (2) evaluating a relaxed spatial predicate, \g close to" which is a generalized \close to" covering a broader
context including close to, touch, intersect, etc. If two spatial objects are closely located, their enclosing minimum
bounding rectangles must be closely located, matching \g close to". But the reverse is not always true: If the
enclosing minimum bounding rectangles are closely located, the two spatial objects may or may not be closely
located. Thus, the minimum bounding rectangle pruning is a false positive testing tool for closeness: Only those
which pass the rough test need to be further examined using more expensive spatial computation algorithms. With
this step preprocessing, only the patterns which are frequent at the approximation level will need to be examined by
more detailed and ner, but more expensive spatial computation.
9.2.3 Spatial clustering methods
Spatial data clustering identies clusters, or densely populated regions, according to some distance measurement,
in a large, multidimensional data set. Spatial clustering methods have been studied thoroughly in Chapter 8 since
cluster analysis usually takes spatial data clustering as examples and applications. Therefore, people interested in
spatial clustering should refer to Chapter 8.
9.2.4 Spatial classication and spatial trend analysis
Spatial classication is to analyze spatial objects and derive classication schemes, such as decision trees. Such
classication schemes are in relevance to certain spatial properties, such as neighborhood of a district, highway, river,
etc.
Suppose one would like to classify regions in a state into rich vs. poor according to the average family income
and nd what are the important factors which determine a region to be rich or poor. This is a spatial classication
problem. There are many features associated with each spatial object, such as near a major airport, hosting a
university, containing interstate highways, near a lake, etc. These features can be used in relevance analysis and
nding interesting classication schemes.
Spatial trend analysis deals with another issue: detect changes and trends along a spatial dimension. Usually,
trend analysis studies changes with time, such as the changes of temporal patterns in time series data. Spatial trend
detection replaces time with space and studies the trend of some nonspatial data changing with space. For example,
one may observe the trend of changes in economic situation when moving away from the center of a city, or trend of
changes of the climate or vegetation with the increasing distance from an ocean. For such analyses, regression and
correlation analysis methods are often applied by utilization of spatial data structures and spatial access methods.
There are also many applications where patterns are changing with both space and time. For example, trac
ows on highways and in cities are both time and space related. Weather patterns are also closely related to both
time and space. To nd spatio-temporal patterns and making good predictions, sophisticated methods should be
developed for mining spatio-temporal data.
There have been a few interesting studies on spatial classication and spatial trend analysis. However, there has
been few studies on spatio-temporal data mining. More methods and applications of spatial classication and trend
analysis, especially those associated with time, need to be explored in the future.
9.2.5 Mining raster databases
Spatial database systems usually handle vector data which consist of points, lines, polygons (regions), and their
compositions, such as networks or partitions. Typical such kind of data are maps, design graphs, 3-D representation
of the arrangement of the chains of protein molecules, etc. However, a huge amount of space-related data are in
digital raster (image) forms, such as satellite images, remote sensing data, computer tomography, etc. It is important
to explore data mining in raster databases.
There have been many studies on mining raster data in scientic research, such as astronomy, seismology, and
geoscientic research. In general, the following data mining methods have been explored in raster data mining.
1. Decision tree classication has been an essential data mining method in reported raster data mining applications. For example, one may take the sky images which have been carefully classied by astronomers as
the training set and construct models for recognition of galaxies, stars and other stellar objects, based on the
9.3. MINING TIME-SERIES DATABASES
11
properties like magnitudes, areas, intensity, image moments, orientation, etc. Then a large number of sky images taken by telescopes or space probes can be tested against the constructed models to identify new celestral
bodies. Similar studies have also be performed successfully to identify volcanos on Venus.
2. Data preprocessing, such as noise reduction, data focusing and feature extraction, is often imporant in raster
data mining since the images may contain noise, pictures may be taken from dierent angles, etc. Besides
standard methods used in pattern recognition such as edge detection, Hough transformation, etc., one may
explore techniques like decomposition of images to eigenvectors, adopting probabilistic models to deal with
uncertainty, etc.
3. Parallel and distributed processing are useful since the raser data is often in huge volumes and may require
substantial processing power.
4. Raster data mining is closely linked to image analysis and scientic data mining, and thus many image analysis
techniques and scientic data analysis methods can be applied to raster data mining.
9.2.6 From spatial data mining and multimedia data mining
With the popular use of audio-video equipments, CD-ROMs, and Internet, many database systems store amdn
manage a large number of multimedia objects, including audio data, images, video data, hypertext data which
contains text, text markups, and linkages, sequence data, etc. A database system which stores and manages a large
collection of multimedia objects is called a multimedia database system.
Typical multimedia database systems include NASA's EOS (Earth Observation System), Human Genome project,
digital libraries,
Mining multimedia database is a challenging task due to the huge size and unstructued nature of a multimedia
object. However, some progress has been made at mining multimedia data. In this section, we will introduce a few
methods for mining multimedia databases, include content-based retrieval and similar search of multimedia data,
generalization and multi-dimensional analysis of multimedia data, and mining associations in multimedia data.
Similarity search in multimedia data
Given a set of images, nd all images similar to a given image or all pairs of similar images.
Applications: medical diagnosis, weather prediction, Web search engine for images, e-commerce.
related work:
Multi-dimensional analysis of multimedia data
Mining associations in multimedia data
9.3 Mining Time-Series Databases
A time-series database consists of sequences of values or events changing with time. Time-series databases are popular
in many applications, such as studying daily uctuations of a stock-market, business transaction sequences, traces
of a dynamic production process, scientic experiments, medical treatments, Web page access sequences, and so
on. There are many distinct issues at mining time-series database, such as trend and periodicity analysis, similarity
search in time-series analysis, and time-related frequent pattern mining.
9.3.1 Trend analysis
Trend analysis is one of the major applications in time-series analysis. In many cases, a time series involving a
variable Y , such as the daily closing price of a share in a stock market, can be viewed as a function of time t, i.e.,
Y = F(t). Such a function can be drawn as a graph of time series, as shown in Figure 9.4, which describes a point
moving with the passage of time.
Time-series movements can be characterized into the following components.
12
CHAPTER 9. MINING COMPLEX TYPES OF DATA
Figure 9.4: Timeseries.
1. long-term or trend movements: these refer to the general direction that a time series is moving over a long
interval of time. This trend movement is indicated by a trend curve, or in some time series, it corresponds to a
trend line. Typical methods to determine such a trend curve or a trend line include the least-squares method,
the weighted moving average method, etc.
2. cyclic movements or cyclic variations: these refer to the long-term oscillations, or swings, about a trend line or
curve. The \cycles" may or may not be periodic, that is, they may or may not follow exactly similar patterns
after equal intervals of time.
3. seasonal movements or seasonal variations: these refer to the identical or almost identical patterns that a time
series appears to follows during corresponding months of successive years. Such movements are due to recurring
events that take place annually, such as the sudden increase of departmental store sales before Christmas.
4. irregular or random movements: these refer to the sponradic motion of time series due to chance events, such
as earth quake, strike, etc.
Time-series analysis which investigates the factors trend, cyclic, seasonal, and irregular, is often referred to as a
decomposition of a time-series into its basic component movements.
Given a set of numbers, y1, y2 , y3 , ... , a moving average of order n is the sequence of arithmetic means:
y1 + y2 + + yn ; y2 + y3 + + yn+1 ; y3 + y4 + + yn+2 ; (9.1)
n
n
n
Moving average tends to reduce the amount of variations present in a set of data. Thus the process of replacing the
time series by its moving average eliminates unwanted uctuations and is therefore called smoothing time series. If
weighted arithmetic means are used in sequence (9.1), the resulting sequence is called a weighted moving average of
order n.
Example 9.3 Given a sequence of 9 numbers, its moving average of order 3 and weighted moving average of order
3, with the weight as (1; 4; 1) used, can be printed in the tabular form where each number in the moving average
being the mean of the three numbers immediately above it, and each number in the weighted moving average being
the weighted average of the three numbers immediately above it.
Original data:
3 7 2 0 4 5 9 7 2
Moving average of order 3:
4 3 2 3 6 7 6
Weighted (1; 4; 1) moving average of order 3:
5.5 2.5 1 3.5 5.5 8 6.5
7+12 = 5:5. The weighted average usually gives central
The rst weighted average value is calculated as 13+4
1+4+1
element more weights to oset the smoothing eect which could be strongly eected by some extreme values. 2
In general, we have the following curve tting methods for estimation of a trend.
1. The freehand method: drawing an approximate line or curve to t a set of data based on individual's judgement.
The validity and quality of this method relies on individual's judgement which is costly and barely reliable for
any large-scaled data mining.
2. The least squares method: Consider the best tting curve C as the least squares curve, that is, the curve having
the minimum of in=1 di, where the deviation or error di is the dierence between the value yi of a point (xi ; yi)
and the corresponding value as determined from the curve C.
3. The moving average method: using moving average of appropriate orders, one can eliminate cyclic, seasonal,
and irregular patterns, thus leaving only the trend movement. However, moving average will lose the data at
the beginning and end of a series, may sometimes generate cycles or other movements which are not present in
the original data, and may be strongly aected by some extreme values. Notice that the inuence of extreme
values can be reduced by using a weighted moving average with appropriate weights as shown in Example 9.3.
9.3. MINING TIME-SERIES DATABASES
13
In many business transactions, such as sales in a year, there are some expected regular seasonal uctuations, such
as higher sale volumes during the Christmas season. Therefore, it is important to identify such seasonal variations
and deseasonalize the data for trend and cyclic data analysis. For this purpose, the concept of seasonal index is
introduced, which is a set of numbers showing the relative values of a variable during the months of a year. For
example, if the sales during October, November and December are 80%, 110%, and 140% of the average monthly
sales for the whole year, respectively, 80%, 110%, and 140% are provided as the seasonal index numbers for the year.
If the original monthly data are divided by the corresponding seasonal index numbers, the resulting data are said to
be deseasonalized, or adjusted for seasonal variations. Such data still include trend, cyclic and irregular movements.
The deseasonalized data can be adjusted for trend by dividing the data by their corresponding trend values.
Furthermore, an appropriate moving average will smooth out the irregular variations and leave only cyclic variations
for further analysis. If periodicity or approximate periodicity of cycles occurs, cyclic indexes can be constructed in
a similar way as seasonal indexes.
Finally, irregular or random variations can be estimated by adjusting data for the trend, seasonal and cyclic
variations. In practice, irregular movements tend to have a small magnitude and follow the pattern of normal
distribution, that is, small deviations occur with large frequency, whereas large deviations occur with small frequency.
In practice, it is often benecal to rst graph the time series, and estimate qualititatively the presence of long
term trend, seasonal variations and cyclic variation. This may help us choose the right method for analysis and
comprehend the results of analysis.
With the systematic analysis of the movements of trend, cyclic, seasonal, and irregular components, one will be
able to do long-term or short-term prediction or forecasting time-series with reasonable quality.
9.3.2 Similarity search in time-series analysis
Given a set of time-series sequences, the problem of similarity search is to nd all the data sequences which are
similar to another query sequence or similar to each other. Notice that dierent from normal database queries which
are to nd data which matches the query exactly, similarity search is to nd data sequences which dier only slightly
from the query sequence.
In general, there are two categories of similarity sequence matching problems: whole sequence matching vs.
subsequence matching. The former is to nd all pairs of similar sequences; whereas the latter is to nd a sequence
that is similar to a query sequence. Similarity search in time-series analysis is useful at nancial market analysis (e.g.,
stock data analysis), medical diagnosis (cardiogram analysis), and scientic or engineering databases (e.g., power
consumption analysis).
Whole sequence matching
Two time sequences S and T are said to be -similar if they contain nonoverlapping subsequences s1 ; s2; : : :; sm and
t1 ; t2; : : :; tm respectively such that
1. si < sj , and ti < tj , for 1 i < j m
2. there exist some scale and some translation so that 8mi=1 ((si )) ti, where is a similarity operator
dened by certain similarity measure, such as the fraction of the matching length to the total length of the two
sequences is above the specied threshold .
For eciently nding whole sequence matching, one needs to rst extract k features from every sequence, and
every sequence is then represented as a point in the k-dimensional space. Then one can use a multi-dimensional
indexing method to store and search these points. Notice that spatial indices usually do not work well for high
dimensional data.
Usually, people to distance-preserving orthonormal transformations. Discrete Fourier Transform (DFT) transform
and Haar wavelet transform are two often used transformations. Since the distance between two signals in the time
domain is the same as their Euclidean distance in the frequency domain, Discrete Fourier Transform does a good job
of concentrating energy in the rst few coecients. If we keep only rst a few coecients in DFT, we can compute
the lower bounds of the actual distance.
One implementation method goes as follows. One may take Euclidean distance as the similarity measure, obtain
Discrete Fourier Transform (DFT) coecients of each sequence in the database, build a multi-dimensional index
CHAPTER 9. MINING COMPLEX TYPES OF DATA
14
using a few Fourier coecients, use the index to retrieve sequences that are at most a certain small distance away
from query sequence. After such processing, one need compute the actual distance between sequences in the time
domain and discard false matches.
Subsequence matching
An intuitive notion of sequence similarity allowing: nonmatching gaps, amplitutde scaling, oset translation.
The matching subsequences need not be aligned along time axis. We need parameters: sliding window size, width
of an envelope for similarity, maximum gap, matching fraction.
similarity model:
Sequences are normalized with amplitude scaling and oset translation.
Two subsequences are considered similar if one lies within an envelope of width around the other, ignoring
outliers.
Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences.
Outline of the approach
atomic matching: nd all pairs of gap-free windows of length that are similar.
window stitching: stitch similar windows to form pairs of large similar subsequences allowing gaps between
atomic matches.
subsequence ordering: linearly order the subsequence matches to determine whether enough similar pieces exist.
9.3.3 Frequent pattern mining
Dierent kinds of time-related frequent patterns
Most concentrate on symbolic patterns, although some consider numerical curve patterns in time series. Agrawal and
Srikant [AS95] developed an Apriori-like technique [AS94] for mining sequential patterns. Mannila et al. [MTV95]
consider frequent episodes in sequences, where episodes are essentially acyclic graphs of events whose edges specify the
temporal before-and-after relationalship but without timing-interval restrictions. Inter-transaction association rules
proposed by Lu et al. [LHF98] are implication rules whose two sides are totally-ordered episodes with timing-interval
restrictions (on the events in the episodes and on the two sides). Bettini et al. [BWJ98] consider a generalization of
inter-transaction association rules: these are essentially rules whose left-hand and right-hand sides are episodes with
time-interval restrictions.
Mining sequential patterns and episodes
Mining inter-transaction association rules
9.3.4 Periodicity analysis
The mining of periodic patterns, that is, the search for recurring patterns in time-series database, is an important data
mining problem with many applications. For example, seasons, tides, planet trajectories, daily power consumptions,
daily trac patterns, and weekly TV programs all present certain periodic patterns.
Mining periodic pattern problems can be partitioned into three categories: mining full periodic patterns, mining
partial periodic patterns, and mining cylic association rules. Full periodicity means that every point in time
contributes (precisely or approximately) to the cyclic behavior of the time series. For example, all the days in the
year approximately contribute to the season cycle of the year. Partial periodicity species the periodic behavior
of the time series at some but not all points in time. For example, Jim reads New York Times from 7:00 to 7:30
every weekday morning but his activities at other times do not have much regularity. Partial periodicity is a looser
kind of periodicity than full one, and it also happens more popular in the real world. Cylic association rules.
9.4. MINING TEXT DATABASES
15
Full periodicity mining
Partial periodicity mining
Most methods for nding full periodic patterns are either inapplicable to or prohibitively expensive for the mining
of partial periodic patterns , because of the mixture of periodic events and non-periodic events in the same period.
For example, FFT (Fast Fourier Transformation) cannot be applied to mining partial periodicity because it treats
the time-series as an inseparable ow of values. Some periodicity detection methods can detect some partial periodic
patterns, but only if the period, and the length and timing of the segment in the partial patterns with specic
behavior are explicitly specied. For the newspaper reading example, we need to explicitly specify details such as
\nd the regular activities of Jim during the half-hour after 7:00 for the period of 24 hours." A naive adaptation
of such methods to our partial periodic pattern mining problem would be prohibitively expensive, requiring their
application to a huge number of possible combinations of the three parameters of length, timing, and period.
An Apriori-like algorithm has been proposed for mining imperfect partial periodic patterns with a given (single )
period in a recent study by two of the current authors [HGY98]. It is an interesting algorithm for mining imperfect
partial periodicity. However, with a detailed examination of the data characteristics of partial periodicity, we found
that Apriori pruning in mining partial periodicity may not be as eective as in mining association rules.
Our study has revealed the following new characteristics of partial periodic patterns in time series: The Apriorilike property among partial periodic patterns still holds for any xed period, but it does not hold for patterns between
dierent periods. Furthermore, there is a strong correlation among frequencies of partial patterns.
Cyclic association rule mining
Similar to our problem, the mining of cyclic association rules by O zden, et al. [ORS98]1 also considers the mining of
some patterns of a range of possible periods. Observe that cyclic association rules are partial periodic patterns with
perfect periodicity in the sense that each pattern reoccurs in every cycle, with 100% condence. The perfectness in
periodicity leads to a key idea used in designing ecient cyclic association rule mining algorithms: As soon as it is
known that an association rule R does not hold at a particular instant of time, we can infer that R cannot have
periods which include this time instant. For example, if the maximum period of interest is `max and it is discovered
that R does not hold in the rst `max time instants, then R cannot have any periods. This idea leads to the useful
\cycle-elimination" strategy explored in that paper. Since real life patterns are usually imperfect, our goal is not to
mine perfect periodicity and thus \cycle-elimination" based optimization will not be considered here. 2
9.4 Mining Text Databases
Most previous studies of data mining have been focused on structured data, such as relational, transactional and data
warehouse data. However, in reality, a substantial portion of the available information is stored in text databases
or document databases, which consist of large collections of documents from various sources, such as news articles,
research papers, books, digital libraries, e-mails, and Web pages. With the popular use of electronic publications,
e-mails, CD-ROMS and WWW, information is increasingly available in electronic forms, and the amount of on-line
text data has been growing very rapidly.
Data stored in most text databases are semi-structured data in the sense that they are neither completely unstructured nor completely structured. For example, a document many contain a few structured elds, such as title,
authors, publication date, length, category, etc., but also contain some largely unstructured text components, such
as abstract and contents. There have been a lot of studies on modeling and implementation of semi-structured data
in recent database research. Moreover, information retrieval techniques, such as text indexing methods, have been
developed to handle unstructured documents.
With the fast growing and a vast amount of text data, the traditional information retrieval techniques become
inadequate because there are often too many documents containing useful information but only a small fraction of
1 It is important to point out that [ORS98] concentrates on the elimination of candidate itemsets for the association rule mining
algorithm, although the cycle-elimination strategy does lead to a small reduction on the number of patterns when we process the time
series from left to right.
2 Note that a modied strategy, where we stop considering certain patterns as soon as the length of the time series to be processed is
not enough to make the condence higher than the threshold, can be used.
CHAPTER 9. MINING COMPLEX TYPES OF DATA
16
Relevant
documents
R&R
Retrieved
documents
All documents
Figure 9.5: Relationship between the set of relevant documents and the set of retrieved documents
them is relevant to a particular individual, and without knowing what could be in the documents, it is dicult to
even work out the correct or smart queries. Also, with a large number of documents, people may like to compare
dierent documents, rank the importance and relevance of the documents, or nd patterns and trends across multiple
documents. Furthermore, Internet can be viewed as a huge, interconnected, dynamic text database. With the advent
and the fast growing popularity of Internet, text mining has become an increasingly popular and essential theme in
data mining.
9.4.1 Text data analysis and information retrieval
Information retrieval (IR) is a eld that has been developed in parallel with database systems for many years.
Dierent from database system that has been focused on query and transaction processing of structured data,
information retrieval has been focused on the organization and retrieval of information in a large number of textbased documents. A typical information retrieval problem is to locate relevant documents based on user input,
such as keywords or example documents, and the typical information retrieval systems include online library catalog
systems and online document management systems.
Since information retrieval and database systems are handling dierent kinds of data, there are some database
system problems which are usually not present in information retrieval systems, such as concurrency control, recovery,
transaction management and update. There are also some common information retrieval problems which are usually
not encountered in traditional database systems, such as unstructured documents, approximate search based on
keywords, and the notion of relevance.
To analyze a text database, the following simple model can be adopted: A document is represented by a string,
which can be identied by a set of keywords. Such simple keyword-based information retrieval model will encounter
two major diculties. The rst is the synonymy problem: a keyword, such as software product, may not appear
anywhere in the document, even though the document is closely related to software product. The second is the
polysemy problem: the same keyword, such as mining, may mean dierent things in dierent contexts.
Basic measures for text retrieval
There are two basic measures for content-based text retrieval. One is precision, which is the percentage of retrieved
documents are in fact correct (i.e., relevant to the query). The other is recall, which is the percentage of documents
which should be retrieved (i.e., which are in the database and are relevant to the query) were in fact retrieved. Let
the set of documents which are relevant to a query be fRelevantg, and the set of documents which are retrieved
be fRetrievedg. The set of documents which are both relevant and retrieved will be fRelevantg \ fRelevantg, as
shown in the Venn diagram of Figure 9.5.
The two measures are dened formally as follows.
g \ fRetreivedgj
(9.2)
recall = jfRelevant
jfRelevantgj
g \ fRetreivedgj
(9.3)
precision = jfRelevant
jfRetreivedgj
Keyword-based and similarity-based retrieval
Most information retrieval systems support keyword-based and/or similarity-based retrieval. For keyword-based
retrieval, user poses one keyword or an expression formed out of a set of keywords, such as, car and repair shops, tea
9.4. MINING TEXT DATABASES
17
or coee, database systems but not Oracle, etc. A good information retrieval system should consider synonyms when
answering such queries. For example, when the keyword contains car, one should consider to include its synonyms,
automobile and vehicle, in the search as well. Similarity-based retrieval is to nd similar documents based on a set
of common keywords. The answer should be based on the degree of relevance, where the relevance is measured based
on the nearness of the keywords, relative frequency of the keywords, etc.
How do such keyword-based and similarity-based information retrieval systems work?
A text retrieval system often associates with a set of documents a stop list, which is a set of words that are
deemed \irrelevant". For example, a, the, of, for, with, and so on are stop words even they may appear frequently.
However, stop lists may vary when the document sets vary. For example, database systems could be an important
keyword in a newspaper. But it may be considered as a stop word in a set of research papers presented in a database
system conference.
A group of syntactically minorly dierent words may share the same word stem. A text retrieval system needs
to identify the group of words which are small syntactic variants of each other and collect only their common word
stem. For example, a group of words drug, drugged, and drugs, share a common word stem, drug, and one may view
them as the dierent appearances of the same word.
Starting with a set of d documents and a set of t terms, we can model each document as a vector v in the t
dimensional space Rt . The j th coordinate of v is a number that measures the association of the j th term with respect
to the given document: it is generally dened as 0 if the document does not contain the term, and nonzero otherwise.
There are many ways to dene the term-weighting for the nonzero entry in such a vector. For example, one can
dene simply vj = 1 as long as the j th term occurs in the document, or vj to be term frequency, i.e., the number of
occurrences of term ti in the document, or relative term frequency, i.e., the term frequency vs. the total number of
occurrences of all the terms in the document.
Example 9.4 Table 9.1 shows a term frequency matrix, in which each column represents a document vector, and
each entry frequency matrix(i; j) registers the number of occurrences of a term ti in document dj .
term/document
t1
t2
t3
t4
t5
d1
321
354
15
22
74
d2
84
91
32
143
87
d3
31
71
167
72
85
d4
68
56
46
203
92
d5
72
82
289
51
25
d6
15
6
225
15
54
2
d7
430
392
17
54
121
Table 9.1: Term-document frequency matrix
Since similar documents should have similar term frequencies, one may measure the similarity among a set of
documents or between a document and a query (which is often a set of keywords), based on the similar relative term
occurrences in the frequency table.
There have been many metrics proposed for measuring the similarity of two documents. A representative metric
is the cosine measure dened as follows. Let v1 and v2 be two document vectors. Their cosine similarity is dened
by the equation (9.4),
sim(v1 ; v2) = jvv1 jj vv2j
(9.4)
1
2
where the inner product v1 v2 pis the standard vector dot product, dened as ti=1 v1iv2i , and the norm in the
denominator is dened as jv1 j = v1 v1.
With the above dened numerical similarity metrics on documents, one can construct similarity-based indices on
such documents. Then text-based queries can be represented as vectors, which can be used to search for their nearest
neighbors in a collection of documents. However, for any nontrivial document databases, the number of documents
D and the number of terms T are usually quite large. Such high dimensionality not only leads to the problem of
ecient computation, since the the resulting frequency table will have the size of jDj jT j, but also leads to very
sparse vectors and increases the diculty to detect and exploit the relationships among terms (e.g., synonymy). To
18
CHAPTER 9. MINING COMPLEX TYPES OF DATA
overcome this problem, a latent semantic indexing method has been developed which eectively reduces the size of
the frequency table to analyze.
Latent semantic indexing
The latent semantic indexing method uses a singular value decomposition (SVD ) technique, a well-known tech-
nique in matrix theory, to reduce the size of the term frequency table and retain the K most signicant rows of the
frequency table, where K is usually taken to be around a few hundred (e.g., 200) for large document collections.
Notice that such a reduction, taken the input of D T matrix and represent it as a much smaller K K matrix
leads to some information loss. We must ensure that they must miss only the least signicant parts of the frequency
table.
Such a method for matrix transformation and SVD construction has been worked out successfully. The detailed
method is rather sophisticated and is beyond the scope of this chapter, however, the well-known SVD algorithms are
available freely through packages such as MATLAB and LAPACK.
In general, the latent semantic indexing method consists of the following basic steps.
1. Create a term frequency matrix table, frequency matrix.
2. Compute singular valued decompositions of the frequency matrix by splitting the matrix into three smaller
matrices, U, S, V , where U and V are orthogonal matrices, i.e., U T U = I, and S is a singular (i.e., diagonal)
matrix.
3. For each document d, identify the vector which is the set of all items in the frequency matrix whose corresponding rows have not been eliminated in the singular matrix S.
4. Store the set of all vectors, and create indices for them using some advanced multidimensional indexing techniques.
By singular valued decomposition and multidimensional indexing, the transformed document vectors can be used to
compare the similarity between two documents or to nd the top n matches for a query.
Other text retrieval indexing techniques
There are also several other popularly adopted text retrieval indexing techniques, including inverted indices and
signature les.
An inverted index is an index structure widely used in industry for indexing text documents. It maintains two
hash indexed or B+-tree indexed tables: document table and term table. The former (document table ) consists of a
set of document records, each containing two elds: doc id and posting list, where the posting list is a list of terms (or
pointers to terms) that occur in the document, sorted according to some relevance measure. The latter (term table )
consists of a set of term records, each containing two elds: term id and posting list, where the posting list specifes a
list of document identiers in which the term appears. With such organization, it is easy to answer queries like \nd
all the documents associated with a set of terms", or \nd all the terms associated with a set of documents". For
example, to nd all the documents associated with a set of terms, one can rst nd a list of document identiers in
the term table for each term, and then intersect them to obtain the set of relevant documents. The inverted indices
are easy to implement, but they are not satisfactory at handling polysemy and synonym. Also, the posting lists
could be rather long and the storage requirment could be quite large.
A signature le is a le which stores a signature record for each document in the database. Each signature
has a xed size of b bits. A simple encoding scheme goes as follows. Every bit of a document is initialized to 0.
A bit is set if the corresponding term appears in the document. A sigature S1 matches another signature S2 if
each bit set in sigature S2 is also set in S1 . Since there are usually more terms than available bits, there will be
multiple terms mapped into the same bit. Such multiple-to-one mapping makes the search rather expensive since
a document matches the signature of a query does not mean that it denitely contains the set of keywords of the
query. The document has to be retrieved, parsed, stemmed and checked. Improvements can be done with a good
signature encoding scheme, by rst performing a frequency analysis, stemming, and ltering stop words, and then
using some hashing technique and superimposed coding technique to encode the list of terms into bit representation.
Nevertheless, the problem of multiple-to-one mapping still exists, which is the major disadvantge of the approach.
9.5. MINING THE WORLD-WIDE-WEB
19
9.4.2 Text mining: keyword-based association and document classication
Keyword-based association analysis
Text data consists of structured, semi-structures or unstructured text, including
Term Extraction Text Mining at the Word Level ? The association generation process detected either compounds,
i.E. Domain-dependent terms such as [wall, street] or [treasury, secretary, james, baker] ? Or uninterpretable associations such as [dollars, shares, exchange, total, commission, stake, securities]
Conclusions
? Term level text mining attempts to benet from the advantages of two extremes.
? On the one hand there is no need for human eort in tagging document, and we do not loose most of the
information present in the document as in the tagged documents approach.
? On the other hand the number of meaningless results is greatly reduced and the execution time of the mining
algorithms is also reduced.
Document classication analysis
9.5 Mining the World-Wide-Web
With the fast advances of computer, network, satellite, and informationtechnologies, the World-Wide-Web (or simply,
the Web) has become increasingly popular and important in today's society. With a vast amount of information
available on the Internet, and many on-line information services ourishing around the Web, the Web serves as a
huge, widely distributed, global information service center for news, advertisements, consumer information, nancial
management, education, government, e-commerce, and many other services. Especially, the Web contains not only
a huge collection of documents but also a rich and dynamic collection of hyper-link information, access and usage
information, etc. Such a great wealth of information provides rich sources for data mining.
However, the Web also poses great challenges for eective resource and knowledge discovery based on the following
observations.
1. The Web seems to be too huge for eective data warehousing and data mining. The size of the Web is in the
order of hundreds of tera-bytes and is still growing rapidly. Many organizations and societies put most of their
public accessible information on the Web. It is impossible to set up a data warehouse to replicate, store, or
integrate all the data on the Web.
2. The complexity of the Web pages is far greater than that of any traditional text document collections. The
Web pages lack a unifying structure. There are far more authoring style and content variations than that of
books and other traditional text-based documents. The Web is considered as a huge digit library, however,
the tremendous number of documents in this library are not arranged according to any particular sorted order.
There is no category index, nor title, author list, cover page, table of contents, etc. It could be a real challenge
to search for information you want in such a library!
3. The Web is a highly dynamic information source. Not only the Web grows at a rapid pace, but also the
information is updated constantly. News, stock market, company advertisements, and Web service centers
update their Web pages regularly. The linkage information and access records are also updated frequently.
4. The Web serves broad diversity of user communities. The Internet current connect about 50 million workstations and the user community is still expanding rapidly. Users may have very dierent backgrounds, interests
and purposes of usage. Also, most users may not have good knowledge about the structure of the information
network, may not be aware of the heavy cost of a particular search, may easily get lost by groping in the
\darkness" of the network, and may easily get bored by taking many hops and waiting impatiently for a piece
of information.
5. Only a small portion of the information on the Web is truly relevant or useful. It is said that 99% of the Web
are useless to 99% of the users. Although this may not be obvious to everyone, it is true that a particular
person is interested in only a tiny portion of the Web, and the Web also contains a lot of junks or undesirable
20
CHAPTER 9. MINING COMPLEX TYPES OF DATA
stus which may swamp desired search results. How to nd the portion of the Web which is truly relevant to
your interest? How to search for high quality Web pages on a topic?
These challenges have promoted ourishing researches into ecient and eective discovery and use of resources
on the Internet.
There are many index-based Web search engines which search the Web, index the Web pages, build and store
huge size, keyword-based indices, and help users locate the set of Web pages containing a given set of keywords.
With such search engines, an experienced user may be able to quickly locate documents by providing a set of tightly
constrained keywords and phrases. However, current keyword-based search engines suer from several diciencies.
First, a topic of any breadth may easily contain hundreds of thousands of documents, which may lead to a huge
number of document entries returned by a search engine, whereas many of them may only marginally relevant to the
topic or may contain very poor quality stu. Second, many documents which are highly relevant to a topic may not
contain the exact keywords. For example, using keyword \data mining", one may nd many Web pages related to
other \mining industry" but not many papers related to knowledge discovery, statistical analysis, or machine learning
although those topics are highly related to data mining. As another example, search based on the keyword \search
engine" may not even nd the most popular Web search engines like Yahoo!, AltaVista, or American-On-Line since
they barely claim themselves as search engines on their Web pages.
This indicates that the current Web search engines are not sucient for Web resource discovery, not to say a more
challenging task, Web knowledge discovery, which is to nd Web access patterns, Web structures, and the regularity
and dynamics of Web contents. Web mining is to accomplish these tasks and help people discover the structures
and the dynamics of WWW and nd interesting and high quality information from among the oceans of Web pages.
In general, one can classify Web mining tasks into three categories: Web content mining, Web structure mining, and
Web usage mining. Alternatively, one may treat Web structures as a part of Web contents. Then Web mining can
also be simply classied into two cateogries: Web content mining and Web usage mining.
In the following subsections, we discuss several important issues related to Web mining: mining Web's link
structures, building multi-layered Web information-base, and Weblog mining.
9.5.1 Mining Web's link structures to identify authoritative Web pages
As discussed above, on any large Web search topic, the current Web search engine often returns a large number of
Web pages. May of such pages, though relevant, could be of pretty low quality. Thus, besides the notion of relevance,
it is highly desirable to introduce the notion of authority in Web topic-oriented search. That is, the search task is
not only to locate a set of relevant pages but also to identify those relevant pages of high quality.
How to automatically identify authoritative Web pages on a topic?
Interestingly, the secret of authority is hiding in Web page linkages. The Web consists of not only pages but
also hyperlinks pointing from one page to another. This hyperlink structure contains an enormous amount of latent
human annotation that can help automatically infer the notion of authority. In general, the creation of a hyperlink by
the author of one Web page pointing to another (page) represents author's endorsement of the page. The collective
endorsements of certain pages by dierent authors on the Web may indicate the importance of the page, and may
naturally lead to the discovery of authoritative Web pages. In general, the tremendous amount of Web linkage
information provide rich information about the relevance, the quality, and the structure of the Web's contents, and
thus a rich source for Web mining.
This idea has motivated some interesting studies on mining authoritative pages on the Web. However, unlike
journal citations, the Web linkage structure has some unique features. First, not every hyperlink represents the
endorsement we seek. Some links are created for other purposes, such as for navigation or for paid advertisements.
But overall, if the majority of hyperlinks are for endorsement, the collective judgement will still dominate. Second,
for commercial or competitive interests, one authority seldom has its Web page pointing to its rival authorities in the
same eld. For example, \CocaCola" may not like to endorse its competitor \Pepsi" by linking to their Web pages,
similarly for \Honda" and \Toyota". Third, authoritative pages are seldom particularly descriptive. For example,
the main Web page of Yahoo! may hardly contain the explicit self-description like \Web search engine".
These properties of Web link structures lead people to consider another important category of Web pages, hub.
A hub is one or a set of Web pages which provide collections of links to authorities. Such hub pages may not
be prominent themselves, or there may even exist few links pointing to them, however, they provide links to a
collection of prominent sites on a common topic. Such pages could be lists of recommended links on individual home
9.5. MINING THE WORLD-WIDE-WEB
21
pages, such as recommended reference sites from a course home page, or professionally assembled resource lists on
commercial sites. They play the role of implicitly conferring authorities on a focused topic. In general, a good hub
is a page that points to many good authorities; whereas a good authority is a page pointed to by many good hubs.
Such a mutual reinforcement relationship between hubs and authorities helps mining authoritative Web pages and
automated discovery of high quality Web structures and resources.
Based on theses ideas, an interesting algorithm called HITS (for Hyperlink-Induced Topic Search) is developed as
follows.
First, HITS uses the query terms to collect a starting set of pages, say, 200, from an index-based search engine.
Since many of these pages are presumably relevant to the search topic, some of them should contain links to most of
the prominent authorities. Therefore, the root set can be expanded into a base set by including all the pages that
the root-set pages link to, and all the pages that link to a page in the root set, up to a designated size cuto, such
as 1000 to 5000 pages (to be included in the base set).
Second, a weight-propagation phase is initiated which is an iterative process, determining numerical estimates of
hub and authority weights. Notice since the links between two pages with the same Web domain (i.e., sharing the
same rst level in their URLs) often serve as a navigation function and thus do not confer authority, such links are
excluded from the weight-propagation analysis.
We rst associate a nonnegative authority weight ap and a nonnegative hub weight hp with each page p in the
base set, and initialize all a and h values to a uniform constant. Also, the weights are normalized and an invariant
is maintained that the squares of all weights sum to 1. The authority and hub weights are updated based on (9.5)
and (9.6).
ap = (q such than q!p) hq
(9.5)
hp = (q such than q p) aq
(9.6)
Equation (9.5) implies that if a page is pointed to by many good hubs, its authority weight should increase (it is the
sum of the current hub weights of all the pages pointing to it). Equation (9.6) implies that if a page is pointing to
many good authorities, its hub weight should increase (it is the sum of the current authority weights of all the pages
it points to).
These equations can be written in the matrix form as follows. Let us number the pages f1; 2; : : :; ng and dene
their adjacency matrix A to be an n n matrix where A(i; j) is 1 if page i links to page j or 0 otherwise. Similarly,
we dene the authority weight vector a = (a1; a2; : : :; an), and the hub weight vector h = (h1; h2; : : :; hn). Thus, we
have
h = Aa
(9.7)
T
a =A h
(9.8)
Unfolding these two equations k times, we have
h = A a = AAT h = (AAT )h = (AAT )2 h = = (AAT )k h
(9.9)
a = AT h = AT Aa = (AT A)a = (AT A)2 a = = (AT A)k a
(9.10)
According to linear algebra, these two sequences of iterations, when normalized, converges to the principal
eigenvectors of AT A and AAT , respectively. This also proves that the authority and hub weights are intrinsic
features of the linked pages collected, not inuenced by the initial weight setting.
Finally, the HITS algorithm outputs a short list of the pages with large hub weights and the pages with large
authority weights for the given search topic. Many experiments have shown that HITS provides surprisingly good
search results for a wide range of queries.
Although relying extensively on links leads to encouraging results, ignoring textual contexts may encounter some
diculties. For example, HITS sometimes drifts when hubs contain multiple topics. It may also cause topic hijacking
due to many pages from a single Web site pointing to the same single popular site, giving the site too large a share
of the authority weight. Such problems can be overcome by replacing the sums of Equations (9.5) and (9.6) with
weighted sums, scaling down the weights of multiple links from within the same site, using anchor text (the text
surrounding hyperlink denitions in Web pages) to weight the links which authority is propagated, breaking large
hub pages into smaller units, etc.
By analyzing Web links and textual context information, it has been reported that the systems based on HITS
algorithm, such as Clever, and another system, Google, based on a similar principle, can achieve better quality search
CHAPTER 9. MINING COMPLEX TYPES OF DATA
22
results than those generated by term-index engine such as AltaVista and those created by human ontologists such
as Yahoo!.
9.5.2 Automatic classication of Web documents
9.5.3 Construction of multi-layered Web information-base
9.5.4 Web usage mining
Web usage records, in the form of Web logs, are registered in Web server's log les. Such log les The web pages'
reader's behaviour is imprinted in the web server log les. Analyzing and exploring regularities in this behaviour can
improve system performance, enhance the quality and delivery of Internet information services to the end user, and
identify population of potential customers for electronic commerce.
Web servers register a (web) log entry for every single access they get. The server usually save the URL requested,
the IP address from which the request originated, and a timestamp. For Web-based e-commerce servers, a huge
number of Web access log records are being collected. Popular web sites can see their web log growing by hundreds
of megabytes every day. Condensing these colossal les of raw web log data in order to retrieve signicant and useful
information is a nontrivial task. It is not easy to perform systematic analysis on such a huge amount of data and
therefore, most institutions have not been able to make eective use of web access history for server performance
enhancement, system design improvement, or customer targeting in electronic commerce. However, many people
have realized the potential usage of such data.
Using web log les, studies have been conducted on analyzing system performance, improving system design,
understanding the nature of web trac, and understanding user reaction and motivation[FdG97, GC97, Sul97, TG97].
One innovative study has proposed adaptive sites: web sites that improve themselves by learning from user access
patterns[PE97]. While it is encouraging and exciting to see the various potential applications of web log le analysis,
it is important to know that the success of such applications depends on what and how much valid and reliable
knowledge one can discover from the large raw log data.
9.6 Summary
Work it out!
Exercises
1. Work it out!
Bibliographic Notes
Mining complex types of data has been a popular research topic, with many research papers and tutorials appearing
in conferences and journals on data mining and database systems.
Multidimensional generalization and mining of complex type of data in object-oriented and object-relational
databases by construction of object cubes was proposed by Han, Nishio, et al [HNKW98]. A method for construction
of multiple layered database by generalization-based data mining techniques for handling semantic heteogeneity was
proposed by Han, Ng, et al. [HNFD98].
Lu, Han and Ooi [LHO93] proposed a generalization-based spatial data mining method based on attribute-oriented
induction. Koperski and Han [KH95] proposed a progressive renement method for mining spatial association
rules. Knorr and Ng [KN96] presented a method for mining aggregate proximity relationships and commonalities in
spatial databases. Spatial classifcation and trend analysis methods have been developed by Ester et al. [EKSX97].
Spatial clustering methods have been a focused topic in recent data mining research, with quite a few interesting
methods introduced, including distance-based methods such as [NH94, EKX95, Hua98], hierarchical methods, such
as [ZRL96, GRS98, GRS99, KHK99], density-based methods, such as [EKSX96, ABKS99], and grid-based methods,
such as [WYM97, AGGR98, SCZ98]. Knorr and Ng [KN98] introduced the notion of distance-based outlier and
9.6. SUMMARY
23
developed a few algortihms for its ecient mining. For surveys of spatial data mining methods, one may refer
Koperski, Adhikary and Han [KAH96] and Ester, Kriegel and Sander [EKS97]. A spatial data mining system
prototype, GeoMiner, was developed by Han, Koperski and Stefanovic [HKS97].
For analysis of raster or image data, Fayyad and Smyth [FS93] developed a classication method to analyze high
resolution radar images for identication of volcanos on Venus. Fayyad et al. [FDW96] applied decision tree methods
for the classication of galaxies, stars and other stellar objects in the Palomar Observatory Sky Survey (POSS-II)
project. Storlz quakernder (KDD'96)
Agrawal and Srikant [AS95] developed an Apriori-like technique for mining sequential patterns. Mannila, et
al. [MTV95] consider frequent episodes in sequences, where episodes are essentially acyclic graphs of events whose
edges specify the temporal before-and-after relationalship but without timing-interval restrictions. Lu, Han and Feng
[LHF98] proposed inter-transaction association rules which are implication rules whose two sides are totally-ordered
episodes with timing-interval restrictions (on the events in the episodes and on the two sides). Bettini et al. [BWJ98]
consider a generalization of inter-transaction association rules. Mining partial periodicty by Han, Dong and Yin
[HDY99]. O zden, et al. [ORS98] studied methods for mining cyclic association rules. Sequence pattern mining for
plan failures was proposed by Zaki, Lesh and Ogihara [ZLO98]. Plan mining by [HYK99].
Information retrieval and text analysis methods have been introduced in many textbooks and surveys, including
Salton and McGill [SM83], Salton [Sal89], Yu and Meng [YM97], Raghavan [Rag97], Subramanian [Sub98], Kleinberg
and Tomkins [KT99], etc. The latent semantic indexing method for document similarity analysis was developed by
Deerwester . Feldman and Hirsh [FH98] studied methods for mining association rules in text databases.
The theory and practice of multimedia database systems have been introduced in many textbooks and surveys,
including Subramanian [Sub98], Yu and Meng [YM97], etc. A multimedia data mining system prototype, MultiMediaMiner, was developed by Zaane et al. [ZHL+ 98].
Mining Web's link structures to recognize authoritative Web pages was studied by Chakrabarti et al. [CDK+ 99]
and Kleinberg and Tomkins [KT99]. A Web mining language, WebML, was proposed by Zaane and Han [ZH98].
A multi-layer database approach for constructing Web warehouse was studied by Zaane and Han [ZH95]. Weblog
mining was studied by Zaane, Xin, and Han [ZXH98].
24
CHAPTER 9. MINING COMPLEX TYPES OF DATA
Bibliography
[ABKS99] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering
structure. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data, pages 49{60, Philadelphia,
PA, June 1999.
[AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data,
pages 94{105, Seattle, Washington, June 1998.
[AS94]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very
Large Data Bases, pages 487{499, Santiago, Chile, September 1994.
[AS95]
R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering,
pages 3{14, Taipei, Taiwan, March 1995.
[BWJ98] C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in
time sequences. Data Engineering Bulletin, 21:32{38, 1998.
[CDK+ 99] S. Chakrabarti, B. E Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and
J. Kleinberg. Mining the web's link structure. COMPUTER, 32:60{67, 1999.
[DDF+ 90] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic
analysis. 41:391{407, 1990.
[EKS97] M. Ester, H.-P. Kriegel, and J. Sander. Spatial data mining: A database approach. In Proc. Int. Symp.
Large Spatial Databases (SSD'97), pages 47{66, Berlin, Germany, July 1997.
[EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in
large spatial databases. In Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD'96), pages
226{231, Portland, Oregon, August 1996.
[EKSX97] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. Density-connected sets and their application for trend
detection in spatial databases. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97),
pages 10{15, Newport Beach, California, August 1997.
[EKX95] M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques
for ecient class identication. In Proc. 4th Int. Symp. Large Spatial Databases (SSD'95), pages 67{82,
Portland, Maine, August 1995.
[FdG97] R. Fuller and J. de Graa.
Measuring user motivation from server log les.
In
http://www.microsoft.com/usability/webconf/fuller/fuller.htm, 1997.
[FDW96] U. M. Fayyad, S. G. Djorgovski, and N. Weir. Automating the analysis and cataloging of sky surveys.
In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge
Discovery and Data Mining, pages 471{493. AAAI/MIT Press, 1996.
[FH98]
R. Feldman and H. Hirsh. Finding associations in collectionds of text. In R. S. Michalski, I. Bratko, ,
and M. Kubat, editors, \Machine Learning and Data Mining: Methods and Applications", John Wiley
Sons, pages 223{240. 1998.
25
26
[FS93]
BIBLIOGRAPHY
U. Fayyad and P. Smyth. Image database exploration: Progress and challenges. In Proc. Knowledge
Discovery in Databases Workshop, pages 14{27, Washington, D.C, 1993.
[GC97]
J. Graham-Cumming. Hits and miss-es: A year watching the web. In Proc. 6th Int. World Wide Web
Conf., Santa Clara, California, April 1997.
[GRS98] S. Guha, R. Rastogi, and K. Shim. Cure: An ecient clustering algorithm for large databases. In Proc.
1998 ACM-SIGMOD Int. Conf. Management of Data, pages 73{84, Seattle, Washington, June 1998.
[GRS99] S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In
Proc. 1999 Int. Conf. Data Engineering, pages 512{521, Sydney, Australia, March 1999.
[HCC93] J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in relational databases.
IEEE Trans. Knowledge and Data Engineering, 5:29{40, 1993.
[HDY99] J. Han, G. Dong, and Y. Yin. Ecient mining of partial periodic patterns in time series database. In
Proc. 1999 Int. Conf. Data Engineering (ICDE'99), pages 106{115, Sydney, Australia, April 1999.
[HF96]
J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data mining. In U.M.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery
and Data Mining, pages 399{421. AAAI/MIT Press, 1996.
[HGY98] J. Han, W. Gong, and Y. Yin. Mining segment-wise periodic patterns in time-related databases. In
Proc. 1998 Int. Conf. on Knowledge Discovery and Data Mining (KDD'98), pages 214{218, New York
City, NY, August 1998.
[HKS97] J. Han, K. Koperski, and N. Stefanovic. GeoMiner: A system prototype for spatial data mining. In Proc.
1997 ACM-SIGMOD Int. Conf. Management of Data, pages 553{556, Tucson, Arizona, May 1997.
[HNFD98] J. Han, R. T. Ng, Y. Fu, and S. Dao. Dealing with semantic heterogeneity by generalization-based data
mining techniques. In M. P. Papazoglou and G. Schlageter (eds.), Cooperative Information Systems:
Current Trends & Directions, pages 207{231, Academic Press, 1998.
[HNKW98] J. Han, S. Nishio, H. Kawano, and W. Wang. Generalization-based data mining in object-oriented
databases using an object-cube model. Data and Knowledge Engineering, 25:55{97, 1998.
[HSK98] J. Han, N. Stefanovic, and K. Koperski. Selective materialization: An ecient method for spatial data
cube construction. In Proc. 1998 Pacic-Asia Conf. Knowledge Discovery and Data Mining (PAKDD'98)
[Lecture Notes in Articial Intelligence, 1394, Springer Verlag, 1998], Melbourne, Australia, April 1998.
[Hua98] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values.
Data Mining and Knowledge Discovery, 2:283{304, 1998.
[HYK99] J. Han, Q. Yang, and E. Kim. Plan mining by divide-and-conquer. In Proc. 1999 SIGMOD Workshop
on Research Issues on Data Mining and Knowledge Discovery (DMKD'99), pages 8:1{8:6, Philadelphia,
PA, May 1999.
[KAH96] K. Koperski, J. Adhikary, and J. Han. Knowledge discovery in spatial databases: Progress and challenges. In Proc. 1996 SIGMOD'96 Workshop Research Issues on Data Mining and Knowledge Discovery
(DMKD'96), pages 55{70, Montreal, Canada, June 1996.
[KH95]
K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. In
Proc. 4th Int. Symp. Large Spatial Databases (SSD'95), pages 47{66, Portland, Maine, Aug. 1995.
[KHK99] G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using
dynamic modeling. COMPUTER, 32:68{75, 1999.
[KN96]
E. Knorr and R. Ng. Finding aggregate proximity relationships and commonalities in spatial data mining.
IEEE Trans. Knowledge and Data Engineering, 8:884{897, Dec. 1996.
[KN98]
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. 1998 Int.
Conf. Very Large Data Bases, pages 392{403, New York, NY, August 1998.
BIBLIOGRAPHY
[KT99]
27
J. Kleinberg and A. Tomkins. Application of linear algebra in information retrieval and hypertext analysis. In Proc. 18th ACM Symp. Principles of Database Systems (PODS), pages 185{193, Philadelphia,
PA, May 1999.
[LHF98] H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules.
In Proc. 1998 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery
(DMKD'98), pages 12:1{12:7, Seattle, Washington, June 1998.
[LHO93] W. Lu, J. Han, and B. C. Ooi. Knowledge discovery in large spatial databases. In Proc. Far East
Workshop Geographic Information Systems, pages 275{289, Singapore, June 1993.
[MTV95] H. Mannila, H Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proc. 1st
Int. Conf. Knowledge Discovery and Data Mining, pages 210{215, Montreal, Canada, Aug. 1995.
[NH94]
R. Ng and J. Han. Ecient and eective clustering method for spatial data mining. In Proc. 1994 Int.
Conf. Very Large Data Bases, pages 144{155, Santiago, Chile, September 1994.

[ORS98] B. Ozden,
S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. 1998 Int. Conf. Data
Engineering (ICDE'98), pages 412{421, Orlando, FL, Feb. 1998.
[PE97]
M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. In Proc.
6th Int. World Wide Web Conf., Santa Clara, California, April 1997.
[Rag97] P. Raghavan. Information retrieval algorithms: A survey. In Proc. 1997 ACM-SIAM Symp. Discrete
Algorithms, 1997.
[Sal89]
G. Salton. Automatic Text Processing. Addison-Wesley, 1989.
[SCZ98] G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach
for very large spatial databases. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 428{439, New
York, NY, August 1998.
[SM83]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[Sub98] V. S. Subrahmanian. Principles of multimedia database systems. Morgan Kaufmann, 1998.
[Sul97]
T. Sullivan. Reading reader reaction: A proposal for inferential analysis of web server log les. In Proc.
3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997.
[TG97]
L. Tauscher and S. Greenberg. How people revisit web pages: Empirical ndings and implications for the
design of history systems. International Journal of Human Computer Studies, Special issue on World
Wide Web Usability, 47:97{138, 1997.
[WYM97] W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data
mining. In Proc. 1997 Int. Conf. Very Large Data Bases, pages 186{195, Athens, Greece, Aug. 1997.
[YM97] C. T. Yu and W. Meng. Principles of Database Query Processing for Advanced Applications. Morgan
Kaufmann, 1997.
[ZH95]
O. R. Zaane and J. Han. Resource and knowledge discovery in global information systems: A preliminary
design and experiment. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
331{336, Montreal, Canada, Aug. 1995.
[ZH98]
O. R. Zaane and J. Han. Webml: Querying the world-wide web for resources and knowledge. In
Proc. Int. Workshop on Web Information and Data Management (WIDM'98), pages 9{12, Bethesda,
Maryland, Nov. 1998.
[ZHL+ 98] O. R. Zaane, J. Han, Z. N. Li, J. Y. Chiang, and S. Chee. Multimedia-miner: A system prototype for
multimedia data mining. In Proc. 1998 ACM-SIGMOD Conf. on Management of Data, pages 581{583,
Seattle, Washington, June 1998.
28
[ZLO98]
[ZRL96]
[ZTH99]
[ZXH98]
BIBLIOGRAPHY
M. J. Zaki, N. Lesh, and M. Ogihara. PLANMINE: Sequence mining for plan failures. In Proc. 4th Int.
Conf. Knowledge Discovery and Data Mining (KDD'98), pages 369{373, New York, NY, August 1998.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an ecient data clustering method for very large
databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 103{114, Montreal,
Canada, June 1996.
X. Zhou, D. Truet, and J. Han. Ecient polygon amalgamation methods for spatial olap and spatial
data mining. In Proc. 6th Int. Symp. on Large Spatial Databases (SSD'99), pages 167{187, Hong Kong,
July 1999.
O. R. Zaane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and
data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages
19{29, Santa Barbara, CA, April 1998.