Download Contents - Computer Science

Contents 9 Mining Complex Types of Data 9.1 Generalization and Multidimensional Analysis of Complex Data Objects . . . . . . . . 9.1.1 Generalization on structured data . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Aggregation and approximation in spatial and multimedia data generalization . 9.1.3 Generalization of object identiers and class/subclass hierarchies . . . . . . . . 9.1.4 Generalization on inherited and derived properties . . . . . . . . . . . . . . . . 9.1.5 Generalization on class composition hierarchies . . . . . . . . . . . . . . . . . . 9.1.6 Class-based generalization and mining object data cubes . . . . . . . . . . . . . 9.2 Mining Spatial and Multimedia Databases . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Spatial data cube construction and spatial OLAP . . . . . . . . . . . . . . . . . 9.2.2 Spatial association analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Spatial clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Spatial classication and spatial trend analysis . . . . . . . . . . . . . . . . . . 9.2.5 Mining raster databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.6 From spatial data mining and multimedia data mining . . . . . . . . . . . . . . 9.3 Mining Time-Series Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Trend analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Similarity search in time-series analysis . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Periodicity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Mining Text Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Text data analysis and information retrieval . . . . . . . . . . . . . . . . . . . . 9.4.2 Text mining: keyword-based association and document classication . . . . . . 9.5 Mining the World-Wide-Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Mining Web's link structures to identify authoritative Web pages . . . . . . . . 9.5.2 Automatic classication of Web documents . . . . . . . . . . . . . . . . . . . . 9.5.3 Construction of multi-layered Web information-base . . . . . . . . . . . . . . . 9.5.4 Web usage mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 4 5 5 5 6 6 7 9 10 10 10 11 11 11 13 14 14 15 16 19 19 20 22 22 22 22 2 CONTENTS c J. Han and M. Kamber, 2000, DRAFT!! DO NOT COPY!! DO NOT DISTRIBUTE!! November 15, 1999 Chapter 9 Mining Complex Types of Data Our previous studies on data mining techniques have been focused on mining relational databases, transactional databases, and data warehouses formed by transformation and integration of structured data. With the rapid progress of database systems, data collection tools, and WWW technologies, vast amounts of data in various complex forms, structured and unstructured, hypertext and multimedia, have been pouring in and been growing explosively. Therefore, an increasingly important task in data mining is to mine complex types of data, including complex objects, spatial data, time-series data, hyper-text and multimedia data, and WWW data. In this chapter, we examine how to further develop the essential data mining techniques, such as characterization, classication, association and clustering, and how to develop new ones to cope with complex types of data and perform fruitful data mining in complex information repositories. In particular, section 1 is devoted to the generalization of complex data objects, section 2 is on spatial and multimedia data mining, section 3 is on time-series data mining, section 4 is on mining text databases, and section 5 is on mining World-Wide-Web. Since mining such complex types of data is a fast expanding research frontier, our discussion covers only some preliminary issues. We expect that many dedicated books on mining particular kinds of data will be available in the future. 9.1 Generalization and Multidimensional Analysis of Complex Data Objects A major limitation of many commercial data warehouse and OLAP tools for multidimensional database analysis is their restriction on the allowable data types for dimensions and measures. Most data cube implementations conne dimensions to nonnumeric data, and measures to simple, aggregated values. To introduce data mining and multidimensional data analysis for complex objects, one needs to examine how to perform generalization on complex structured objects and construct object cubes for OLAP and mining in object databases. In this section, we examine how such generalization can be performed. The storage and access of complex structured data have been studied in object-relational and object-oriented database systems. These systems organize a large set of complex data objects into classes which are in turn organized into class/subclass hierarchies. Each object in a class is associated with (1) an object-identier, (2) a set of attributes which may contain sophisticated data structures, set- or list- valued data, class composition hierarchies, multimedia data, etc., and (3) a set of methods which specify the computational routines or rules associated with the object class. To facilitate generalization and induction in such databases, it is important to examine how each kind of components in object-relational and object-oriented databases can be generalized, and how the generalized data can be used for multi-dimensional data analysis and data mining. 9.1.1 Generalization on structured data An important feature of an object-relational or object-oriented database is its capability of storage, accessing, and modeling of complex structure-valued data, such as set-valued and list-valued data and data with nested structures. Such kinds of data can be generalized in several ways in order to summarize and extract interesting patterns. 3 4 CHAPTER 9. MINING COMPLEX TYPES OF DATA A set-valued attribute may be of homogeneous or heterogeneous types. Typically, a set-valued data can be generalized by (1) generalization of each value in a set into its corresponding higher level concepts, or (2) derivation of the general behavior of a set, such as the number of elements in the set, the types or value ranges in the set, the weighted average for numerical data, etc. Moreover, generalization can be performed by applying dierent generalization operators to explore alternative generalization paths. In this case, the result of generalization is a heterogeneous set. For example, the hobby of a person is a set-valued attribute which contains a set of values, such as ftennis, hockey, chess, violin, nintendo gamesg, which can be generalized into a set of high level concepts, such as fsports, music, video gamesg, or into 5 (the number of hobbies in the set), or both, etc. Moreover, a count can be associated with a generalized value to indicate how many elements are generalized to the corresponding generalized value, such as fsports(3), music(1), video games(1)g, where sports(3) indicates three kinds of sports, etc. A set-valued attribute may be generalized into a set-valued or a single-valued attribute; whereas a single-valued attribute may also be generalized into a set-valued one if the \hierarchy" is a lattice or the generalization follows dierent paths. Further generalizations on such a generalized set-valued attribute should follow the generalization path of each value in the set. A list-valued or a sequence-valued attribute can be generalized in a way similar to the set-valued attribute except that the order of the elements in the sequence should be observed in the generalization. Each value in the list can be generalized into its corresponding higher level concept. Alternatively, a list can be generalized according to its general behavior, such as the length of the list, the type of list elements, the value range, weighted average value for numerical data, or dropping unimportant elements in the list, etc. A list may be generalized into a list, a set or a single value. For example, a sequence (list) of data for a person's education record: \((B.Sc. in Electrical Engineering, U.B.C., Dec., 1980), (M.Sc. in Computer Engineering, U. Maryland, May, 1983), (Ph.D. in Computer Science, UCLA, Aug., 1987))" can be generalized by dropping less important descriptions (subattributes) of each tuple in the list, such as \((B.Sc., U.B.C., 1980), )", or by retaining only the most important tuple(s) in the list, or both, such as \(Ph.D. in Computer Science, UCLA, 1987)". Set- and list-valued attributes are simple structure-valued attributes. In general, a structure-valued attribute may contain sets, tuples, lists, trees, records, etc. and their combinations. Moreover, one structure can be nested in another at any level. Similar to the generalization of set- and list-valued attributes, a structure-valued attribute can be generalized in several ways, such as (1) generalizing each attribute in the structure whereas maintaining the shape of the structure, (2) attening the structure and generalizing the attened structure, (3) summarizing the low-level structures by high-level concepts or aggregation, and (4) returning the type or an overview of the structure. 9.1.2 Aggregation and approximation in spatial and multimedia data generalization Besides generalized concept substitution and structured data summarization, aggregation and approximation should be considered as an important means of generalization, which is especially useful for generalizing the attributes with large sets of values, complex structures, spatial or multimedia data, etc. Take spatial data as an example. It is desirable to generalize detailed geographic points into clustered regions, such as business, residential, industry, or agricultural areas, according to land usage. Such generalization often requires the merge of a set of geographic areas by spatial operations, such as spatial union or spatial clustering methods. Aggregation and approximation are important techniques in such generalization. In spatial merge, it is necessary to not only merge the regions of similar types within the same general class but also compute the total areas, average density or other aggregate functions and ignore some scattered regions with dierent types if they are unimportant to the study. For example, dierent pieces of land for dierent purposes of agricultural usage, such as vegetables, grains, fruits, etc. can be merged into one large piece of agricultural land by spatial merge. However, such a piece of agricultural land may contain highways, houses, small stores, etc. If the majority land is used for agriculture, the scattered spots for other purposes can be ignored, and the whole region can be claimed as an agricultural area by approximation. The spatial operators, such as spatial-union, spatial-overlapping, spatial-intersection, etc., which may require to merge scattered small regions into large, clustered regions, can use spatial aggregation and approximation as data generalization operators. A multimedia database may contain complex texts, graphics, images, video fragments, maps, voice, music, and other forms of audio/video information. Such multimedia data are typically stored as sequences of bytes with variable 9.1. GENERALIZATION AND MULTIDIMENSIONAL ANALYSIS OF COMPLEX DATA OBJECTS 5 lengths, and segments of data are linked together or indexed in a multiple dimensional way for easy reference. Generalization on multimedia data can be performed by recognition and extraction of the essential features and/or general patterns of such data. There are many ways to extract the essential features or general patterns from segments of multimedia data. For an image, the size, color, shape, texture and orientation of the contained objects or the major regions in the image can be extracted by aggregation and/or approximation. For a segment of music, its melody can be summarized based on the approximate patterns that repeatedly occur in the segment and its style can be summarized based on its tone, tempo, major musical instruments played, etc. For an article, its abstract or general organization structure such as the table of contents, the subject and index terms frequently occurring in the article, etc. may serve as generalization results. In general, it is a challenging task to generalize multimedia data and spatial data to extract interesting knowledge implicitly stored in the data. Technologies developed in multimedia databases and spatial databases, such as contentbased image retrieval, multidimensional indexing methods, spatial data accessing and analysis techniques, etc. should be integrated with data generalization and data mining techniques to achieve satisfactory results. More on the techniques for mining such data will be discussed in the following sections. 9.1.3 Generalization of object identiers and class/subclass hierarchies An essential component of an object-oriented database is the object identier, whose role is to uniquely identify objects. It remains unchanged after structural reorganization of the data. At rst glance, it may seem impossible to generalize an object identier. However, since objects in an object-oriented database are organized into classes which in turn are organized into class/subclass hierarchies, the generalization of an object can be performed by referring to its associated hierarchy. Thus an object identier can be generalized in the following way. First, the object identier is generalized to the identier of the lowest subclass to which the object belongs. Then the identier of this subclass can be in turn generalized to a higher level class/subclass identier by climbing up the class/subclass hierarchy. Similarly, a class or a subclass can be generalized to its corresponding superclass(es) by climbing up its associated class/subclass hierarchy. 9.1.4 Generalization on inherited and derived properties Since object-oriented databases are organized into class/subclass hierarchies, some attributes or methods of an object class are not explicitly specied in the class itself but are inherited from its higher level classes. Some object-oriented database systems may allow the properties to be inherited from more than one superclass (called multiple inheritance) when the class/subclass \hierarchy" is organized in the shape of a lattice. The inherited properties of an object can be derived by query processing in the object-oriented database. From the data generalization point of view, it is unnecessary to distinguish which data are stored within the class and which are inherited from its superclass. As long as the set of relevant data are collected by query processing, the data mining process will treat the inherited data in the same way as the data stored in the object class and perform generalization accordingly. Method is another important component of object-oriented databases. Many behavioral data of objects can be derived by application of methods. Since a method is usually dened by a computational procedure/function or by a set of deduction rules, it is impossible to perform generalization on the method itself. However, generalization can be performed on the data derived by method application. That is, one should derive the task-relevant set of data by application of the method and, possibly, also data retrieval, and then perform generalization by treating the derived data as the existing ones. 9.1.5 Generalization on class composition hierarchies An attribute of an object may be composed of or described by another object, some of whose attributes may be in turn composed of or described by other objects, thus forming a class composition hierarchy. Generalization on a class composition hierarchy can be viewed as generalization on a set of (possibly innite, if the nesting is recursive) nested structured data. In principle, the reference to a composite object may traverse via a long sequence of references along the corresponding class composition hierarchy. However, in most cases, the longer the sequence of references traversed, the weaker is the semantic linkage between the original object and the referenced composite object. For example, one 6 CHAPTER 9. MINING COMPLEX TYPES OF DATA attribute \vehicles owned" of an object class \student" could refer to another object class \car" which may contain an attribute \auto dealer", which may refer to its \manager" with an attribute \children". Obviously, it is unlikely to nd any interesting general regularities between a student and his/her car's dealer's manager's children. Therefore, generalization on a class of objects should be performed on its own descriptive attribute values, methods, with limited reference to its closely related components via the close linkage in the class composition hierarchy. That is, in order to discover interesting knowledge, generalization should be performed on the objects in the class composition hierarchy closely related in semantics to the currently focused class(es) but not on those which have only remote and rather weak semantic linkages. 9.1.6 Class-based generalization and mining object data cubes The above discussed methods are object-based generalization techniques. In a large object database, data mining and multidimensional analysis are not working on individual objects but on classes of objects. Thus an important question is how to perform class-based generalization for a large set of objects. Since a set of objects in a class may share many attributes and methods, and the generalization of each attribute and method may apply a sequence of generalization operators, the major issue becomes how to cooperate the generalization processes among dierent attributes and methods in the class(es) to produce interesting results. For class-based generalization, the attribute-oriented induction method, developed in Chapter 4 for mining characteristics of relational databases, can be extended to mine data characteristics in object databases. A generalization-based data mining process can be viewed as the application of a sequence of class-based generalization operators on dierent attributes until the resulting class contains a small number of generalized objects which can be summarized as a concise, generalized rule in high-level terms. For ecient implementation, the generalization of multidimensional attributes of complex object class can be performed by examining each attribute (or dimension), generalizing each attribute to simple valued data, and constructing a multidimensional data cube, called object cube. Once an object cube is constructed, multidimensional analysis and data mining can be performed on such an object cube in a similar way as that on relational data cubes. Notice that from the application point of view it is not always desirable to generalize a set of values to single valued data. For example, for a book, the key-word attribute may contain a set of keywords describing the book, it will not make much sense to generalize this set of keywords to one single value. In this context, it is dicult to construct an object cube containing the keyword dimension. We will address some progress in this direction in the next section when discussing spatial data cube construction. However, it is still a challenging research issue on how to handle set-valued data eectively in object cube construction and object-based data mining. 9.2 Mining Spatial and Multimedia Databases A spatial database stores a large amount of space-related data, such as maps, remote sensing data, astronomical data, medical imaging, VLSI chip layout, etc. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. Spatial data has many features distinguishing it from relational databases. It carries topological and/or distance information, usually organized by sophisticated, multidimensionalspatial indexing structures, accessed by spatial data access methods, and often requiring spatial reasoning, geometric computation, and spatial knowledge representation techniques. Spatial data mining demands an integration of data mining with spatial database technologies. A crucial challenge to spatial data mining is the exploration of ecient spatial data mining techniques due to the huge amount of spatial data and the complexity of spatial data types and spatial access methods. Spatial data mining can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and nonspatial data, constructing spatial knowledge-bases, reorganizing spatial databases and optimizing spatial queries. It is expected to have wide applications in geographic information systems, geomarketing, remote sensing, image database exploration, medical imaging, navigation, trac control, environmental studies, and many other areas where spatial data is used. Statistical spatial data analysis has been a popular approach for analyzing spatial data. Many statistical analysis algorithms and various optimization techniques have been developed for such analysis. It handles numerical data well and usually comes up with realistic models of spatial phenomena. However, this approach is usually based on the assumption of statistical independence among the spatially distributed data but in reality spatial objects 9.2. MINING SPATIAL AND MULTIMEDIA DATABASES 7 are often interrelated. Moreover, most statistical modeling can only be done by experts with a fair amount of domain knowledge and statistical expertise. Furthermore, statistical methods do not work well with symbolic values, incomplete or inconclusive data and it is expensive to compute the results. Spatial data mining allows the extension of traditional spatial analysis methods by putting emphasis on eciency, cooperation with database systems, better interaction with user, and discovery of new types of knowledge. 9.2.1 Spatial data cube construction and spatial OLAP Similar to relational data, we can integrate spatial data and construct data warehouse to facilitate spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of both spatial and nonspatial data in support of spatial data mining and spatial data related decision-making process. We examine one application example. Example 9.1 There are about 3,000 weather probes distributed in British Columbia (B.C.), each recording daily temperature and precipitation for a designated small area and transmitting signals to a provincial weather station. A user may like to view weather patterns on a map by month, by region, and by dierent combinations of temperature and precipitation, or even may like to dynamically drill-down or roll-up along any dimension to explore desired patterns, such as wet and hot regions in Fraser Valley in Summer 1997. 2 To facilitate multi-dimensional analysis of weather patterns, it is desirable to construct spatial data warehouse and support spatial OLAP. However, there are several challenging issues in construction and utilization of spatial data warehouses. The rst challenge is the integration of spatial data from heterogeneous sources and systems. Spatial data is usually stored in dierent industry rms and government agencies using dierent data formats. Data formats are not only structure-specic (e.g., raster- vs. vector-based spatial data, object-oriented vs. relational models, dierent spatial storage and indexing structures, etc.), but also vendor-specic (e.g., ESRI, MapInfo, Intergraph, etc.). There has been a lot of work on integration and exchange of heterogeneous spatial data which has paved the way for spatial data integration and spatial data warehouse construction. The second challenge is the realization of fast and exible on-line analytical processing in a spatial data warehouse. This is the issue we will discuss in detail here. We consider that the star/snowake schema model introduced in Chapter 2 is still a good choice for modeling spatial data warehouses because it provides a concise and organized warehouse structure and facilitates OLAP operations. However, in a spatial warehouse, both dimensions and measures may contain spatial components. There are three types of dimensions in a spatial data cube: 1. Nonspatial dimension is a dimension containing only nonspatial data. For example, two dimensions, temperature and precipitation, can be constructed for the warehouse in Example 9.1, each is a dimension containing nonspatial data whose generalizations are nonspatial, such as hot, and wet. 2. Spatial-to-nonspatial dimension is a dimension whose primitive level data is spatial but whose generalization, starting at a certain high level, becomes nonspatial. For example, state in the U.S. map is spatial data. However, each state can be generalized to some nonspatial value, such as pacic northwest, or big state, and its further generalization is nonspatial, and thus playing a similar role as a nonspatial dimension. 3. Spatial-to-spatial dimension is a dimension whose primitive level and all of its high-level generalized data are spatial. For example, equi-temperature-region is spatial data, and all of its generalized data, such as regions covering 0-5 degree, 5-10 degree, and so on, are also spatial. We distinguish two types of measures in a spatial data cube. 1. Numerical measure is a measure containing only numerical data. For example, one measure in a spatial data warehouse could be monthly revenue of a region, and a roll-up may get the total revenue by year, by county, etc. Numerical measures can be further classied into distributive, algebraic, and holistic, the same as in Chapter 2. 2. Spatial measure is a measure which contains a collection of pointers to spatial objects. For example, during the generalization (or roll-up) in a spatial data cube of Example 9.1, the regions with the same range of temperature CHAPTER 9. MINING COMPLEX TYPES OF DATA 8 Figure 9.1: Star model of a spatial data warehouse: BC weather and corresponding BC weather probes map Region name: probe location district city region province Time: hour day month season Temperature: any (cold, mild, hot) cold (below ,20, ,20 to ,10, ,10 to 0) mild (0 to 10, 10 to 15, 15 to 20) hot (20 to 25, 25 to 30, 30 to 35, above 35) Precipitation: any (dry, fair, wet) dry (0 to 0.05, 0.05 to 0.2) fair (0.2 to 0.5, 0.5 to 1.0, 1.0 to 1.5) wet (1.5 to 2.0, 2.0 to 3.0, 3.0 to 5.0, above 5.0) Figure 9.2: Hierarchy for each dimension in BC weather and precipitation will be grouped into the same cell, and the measure so formed contains a collection of pointers to those regions. A nonspatial data cube contains only nonspatial dimensions and numerical measures. If a spatial data cube contained spatial dimensions but no spatial measures, its OLAP operations, such as drilling or pivoting, could be implemented in a way similar to nonspatial data cubes. However, the introduction of spatial measures to spatial data cubes raises some challenging issues on ecient implementation, as shown in the following example. Example 9.2 A star model for the BC weather warehouse of Example 9.1 is shown in Figure 9.1. It consists of four dimensions: temperature, precipitation, time, and region name, and three measures: region map, area, and count. A concept hierarchy for each dimension can be created by users or experts, or generated automatically by data clustering analysis. Figure 9.2 presents the hierarchies for dimensions in BC weather warehouse. Of the three measures, area and count are numerical measures which can be computed similarly as in nonspatial data cube; whereas region map is a spatial measure which represents a collection of spatial pointers to the corresponding regions. Since dierent spatial OLAP operations result in dierent collections of spatial objects in region map, it poses a major challenge to compute the merges of a large number of regions exibly and dynamically. For example, two dierent roll-ups on the B.C. weather map data (Figure 9.1) may produce two dierent generalized region maps as shown in Figure 9.3, each being the result of merging a large number of small (probe) regions shown in Figure 9.1. 2 Can we precompute all the possible spatial merges and store them in the corresponding cuboid cells in a spatial data cube? The answer is probably not. Unlike a numerical measure where each aggregated value takes only a few bytes of space, a merged region map of BC may takes megabytes of storage space. Thus, we face a dilemma for balancing the cost of on-line computation and the space overhead of storing computed measures: the substantial computation cost for on-the-y computation of spatial aggregations calls for precomputation but substantial overhead for storing aggregated spatial values discourages it. There are at least three possible choices in regard to the computation of spatial measures in spatial data cube construction. 1. Collect and store the corresponding spatial object pointers but do not perform precomputation of spatial measures in a spatial data cube. This can be implemented by storing, in the corresponding cube cell, a pointer to a collection of spatial object pointers, and invoking and performing the spatial merge (or other computation) of the corresponding spatial Figure 9.3: Roll-up operations along dierent dimensions 9.2. MINING SPATIAL AND MULTIMEDIA DATABASES 9 objects, when necessary, on-the-y. It is still a good choice if only spatial display is required (i.e., no real spatial merge has to be performed), or if there are not so many regions to be merged in any pointer collection (thus on-line merge is not very costly), or if on-line spatial merge computation is fast enough. Recently, some ecient spatial merge methods have been developed for fast spatial OLAP. Since OLAP results are often used for further spatial analysis and mining, it is still recommended to precompute some of spatially connected regions to speed up such analysis. 2. Precompute and store some rough approximation/estimation of the spatial measures in a spatial data cube. This choice is good for a rough view or coarse estimation of spatial merge results under the assumption that it takes little storage space to store the coarse estimation result. For example, the minimum bounding rectangle (MBR), representable by two points, can be taken as a rough estimate of a merged region. Such a precomputed result is small and can be presented quickly to users. If higher precision is needed for specic cells, the application can either fetch precomputed high quality results, if available, or compute them on-the-y. 3. Selectively precompute some spatial measures in a spatial data cube. This seems to be a smart choice. The question becomes which portion of the cube should be selected for materialization. The selection can be performed at the cuboid level, i.e., either precompute and store each set of mergeable spatial regions for each cell of a selected cuboid, or precompute none if the cuboid is not selected. Since a cuboid usually consists of a large number of spatial objects, it may involve precomputation and storage of a large number of mergeable spatial objects but some of them could be rarely used. Therefore, it is recommended to perform selection at a ner granularity level: examining each group of mergeable spatial objects in a cuboid to determine whether such a merge should be precomputed. Decision should be based on the utility (such as access frequency or access priority), sharability of merged regions, and the balanced overall cost of space and on-line computation. With ecient implementation of spatial data cube and spatial OLAP, generalization-based descriptive spatial mining, such as spatial characterization and discrimination, can be performed eciently. 9.2.2 Spatial association analysis Similar to mining association rules in transactional and relational databases, spatial association rules can be mined in spatial databases. A spatial association rule is of the form X ! Y [s%, c%], where X and Y are sets of spatial or non-spatial predicates, s% is the support of the rule, and c% is the condence of the rule. For example, the following rule is a spatial association rule: is a(x; school) ^ close to(x; sport center) ! close to(x; park) [0:5%; 80%]: This rule states that 80% of schools which are close to sport centers are also close to parks and 0.5% of data belong to such a case. There are various kinds of spatial predicates that could constitute a spatial association rule. Examples include topological relations like intersects, overlap, and disjoint; spatial orientations like left of and west of; and distance information, such as close to and far away. Since spatial association mining needs to evaluate multiple spatial relationships among a large number of spatial objects, the process could be quite costly. An important mining optimization method, called progressive renement, can be adopted in spatial association analysis. The method rst mines large data sets roughly using a scalable fast algorithm and then improves the quality of mining in the pruned data set using a more expensive algorithm. To ensure that the pruned data set covers the complete set of answers when applying the high quality data mining algorithms at a later stage, an important requirement to the early stage rough mining algorithm is the superset coverage property: that is, it preserves all the potential answers. In other words, it should allow a false positive test, which might include a proper superset of the answer sets, but it should not allow the false negative test which might exclude some potential answers. For mining spatial association related to spatial predicate \close to", one can rst collect the candidates which pass the minimumsupport threshold by (1) applying certain rough spatial evaluation algorithms, e.g., using minimum 10 CHAPTER 9. MINING COMPLEX TYPES OF DATA bounding rectangle (MBR) structure (which registers only two spatial points rather than a set of complex polygons), and (2) evaluating a relaxed spatial predicate, \g close to" which is a generalized \close to" covering a broader context including close to, touch, intersect, etc. If two spatial objects are closely located, their enclosing minimum bounding rectangles must be closely located, matching \g close to". But the reverse is not always true: If the enclosing minimum bounding rectangles are closely located, the two spatial objects may or may not be closely located. Thus, the minimum bounding rectangle pruning is a false positive testing tool for closeness: Only those which pass the rough test need to be further examined using more expensive spatial computation algorithms. With this step preprocessing, only the patterns which are frequent at the approximation level will need to be examined by more detailed and ner, but more expensive spatial computation. 9.2.3 Spatial clustering methods Spatial data clustering identies clusters, or densely populated regions, according to some distance measurement, in a large, multidimensional data set. Spatial clustering methods have been studied thoroughly in Chapter 8 since cluster analysis usually takes spatial data clustering as examples and applications. Therefore, people interested in spatial clustering should refer to Chapter 8. 9.2.4 Spatial classication and spatial trend analysis Spatial classication is to analyze spatial objects and derive classication schemes, such as decision trees. Such classication schemes are in relevance to certain spatial properties, such as neighborhood of a district, highway, river, etc. Suppose one would like to classify regions in a state into rich vs. poor according to the average family income and nd what are the important factors which determine a region to be rich or poor. This is a spatial classication problem. There are many features associated with each spatial object, such as near a major airport, hosting a university, containing interstate highways, near a lake, etc. These features can be used in relevance analysis and nding interesting classication schemes. Spatial trend analysis deals with another issue: detect changes and trends along a spatial dimension. Usually, trend analysis studies changes with time, such as the changes of temporal patterns in time series data. Spatial trend detection replaces time with space and studies the trend of some nonspatial data changing with space. For example, one may observe the trend of changes in economic situation when moving away from the center of a city, or trend of changes of the climate or vegetation with the increasing distance from an ocean. For such analyses, regression and correlation analysis methods are often applied by utilization of spatial data structures and spatial access methods. There are also many applications where patterns are changing with both space and time. For example, trac ows on highways and in cities are both time and space related. Weather patterns are also closely related to both time and space. To nd spatio-temporal patterns and making good predictions, sophisticated methods should be developed for mining spatio-temporal data. There have been a few interesting studies on spatial classication and spatial trend analysis. However, there has been few studies on spatio-temporal data mining. More methods and applications of spatial classication and trend analysis, especially those associated with time, need to be explored in the future. 9.2.5 Mining raster databases Spatial database systems usually handle vector data which consist of points, lines, polygons (regions), and their compositions, such as networks or partitions. Typical such kind of data are maps, design graphs, 3-D representation of the arrangement of the chains of protein molecules, etc. However, a huge amount of space-related data are in digital raster (image) forms, such as satellite images, remote sensing data, computer tomography, etc. It is important to explore data mining in raster databases. There have been many studies on mining raster data in scientic research, such as astronomy, seismology, and geoscientic research. In general, the following data mining methods have been explored in raster data mining. 1. Decision tree classication has been an essential data mining method in reported raster data mining applications. For example, one may take the sky images which have been carefully classied by astronomers as the training set and construct models for recognition of galaxies, stars and other stellar objects, based on the 9.3. MINING TIME-SERIES DATABASES 11 properties like magnitudes, areas, intensity, image moments, orientation, etc. Then a large number of sky images taken by telescopes or space probes can be tested against the constructed models to identify new celestral bodies. Similar studies have also be performed successfully to identify volcanos on Venus. 2. Data preprocessing, such as noise reduction, data focusing and feature extraction, is often imporant in raster data mining since the images may contain noise, pictures may be taken from dierent angles, etc. Besides standard methods used in pattern recognition such as edge detection, Hough transformation, etc., one may explore techniques like decomposition of images to eigenvectors, adopting probabilistic models to deal with uncertainty, etc. 3. Parallel and distributed processing are useful since the raser data is often in huge volumes and may require substantial processing power. 4. Raster data mining is closely linked to image analysis and scientic data mining, and thus many image analysis techniques and scientic data analysis methods can be applied to raster data mining. 9.2.6 From spatial data mining and multimedia data mining With the popular use of audio-video equipments, CD-ROMs, and Internet, many database systems store amdn manage a large number of multimedia objects, including audio data, images, video data, hypertext data which contains text, text markups, and linkages, sequence data, etc. A database system which stores and manages a large collection of multimedia objects is called a multimedia database system. Typical multimedia database systems include NASA's EOS (Earth Observation System), Human Genome project, digital libraries, Mining multimedia database is a challenging task due to the huge size and unstructued nature of a multimedia object. However, some progress has been made at mining multimedia data. In this section, we will introduce a few methods for mining multimedia databases, include content-based retrieval and similar search of multimedia data, generalization and multi-dimensional analysis of multimedia data, and mining associations in multimedia data. Similarity search in multimedia data Given a set of images, nd all images similar to a given image or all pairs of similar images. Applications: medical diagnosis, weather prediction, Web search engine for images, e-commerce. related work: Multi-dimensional analysis of multimedia data Mining associations in multimedia data 9.3 Mining Time-Series Databases A time-series database consists of sequences of values or events changing with time. Time-series databases are popular in many applications, such as studying daily uctuations of a stock-market, business transaction sequences, traces of a dynamic production process, scientic experiments, medical treatments, Web page access sequences, and so on. There are many distinct issues at mining time-series database, such as trend and periodicity analysis, similarity search in time-series analysis, and time-related frequent pattern mining. 9.3.1 Trend analysis Trend analysis is one of the major applications in time-series analysis. In many cases, a time series involving a variable Y , such as the daily closing price of a share in a stock market, can be viewed as a function of time t, i.e., Y = F(t). Such a function can be drawn as a graph of time series, as shown in Figure 9.4, which describes a point moving with the passage of time. Time-series movements can be characterized into the following components. 12 CHAPTER 9. MINING COMPLEX TYPES OF DATA Figure 9.4: Timeseries. 1. long-term or trend movements: these refer to the general direction that a time series is moving over a long interval of time. This trend movement is indicated by a trend curve, or in some time series, it corresponds to a trend line. Typical methods to determine such a trend curve or a trend line include the least-squares method, the weighted moving average method, etc. 2. cyclic movements or cyclic variations: these refer to the long-term oscillations, or swings, about a trend line or curve. The \cycles" may or may not be periodic, that is, they may or may not follow exactly similar patterns after equal intervals of time. 3. seasonal movements or seasonal variations: these refer to the identical or almost identical patterns that a time series appears to follows during corresponding months of successive years. Such movements are due to recurring events that take place annually, such as the sudden increase of departmental store sales before Christmas. 4. irregular or random movements: these refer to the sponradic motion of time series due to chance events, such as earth quake, strike, etc. Time-series analysis which investigates the factors trend, cyclic, seasonal, and irregular, is often referred to as a decomposition of a time-series into its basic component movements. Given a set of numbers, y1, y2 , y3 , ... , a moving average of order n is the sequence of arithmetic means: y1 + y2 + + yn ; y2 + y3 + + yn+1 ; y3 + y4 + + yn+2 ; (9.1) n n n Moving average tends to reduce the amount of variations present in a set of data. Thus the process of replacing the time series by its moving average eliminates unwanted uctuations and is therefore called smoothing time series. If weighted arithmetic means are used in sequence (9.1), the resulting sequence is called a weighted moving average of order n. Example 9.3 Given a sequence of 9 numbers, its moving average of order 3 and weighted moving average of order 3, with the weight as (1; 4; 1) used, can be printed in the tabular form where each number in the moving average being the mean of the three numbers immediately above it, and each number in the weighted moving average being the weighted average of the three numbers immediately above it. Original data: 3 7 2 0 4 5 9 7 2 Moving average of order 3: 4 3 2 3 6 7 6 Weighted (1; 4; 1) moving average of order 3: 5.5 2.5 1 3.5 5.5 8 6.5 7+12 = 5:5. The weighted average usually gives central The rst weighted average value is calculated as 13+4 1+4+1 element more weights to oset the smoothing eect which could be strongly eected by some extreme values. 2 In general, we have the following curve tting methods for estimation of a trend. 1. The freehand method: drawing an approximate line or curve to t a set of data based on individual's judgement. The validity and quality of this method relies on individual's judgement which is costly and barely reliable for any large-scaled data mining. 2. The least squares method: Consider the best tting curve C as the least squares curve, that is, the curve having the minimum of in=1 di, where the deviation or error di is the dierence between the value yi of a point (xi ; yi) and the corresponding value as determined from the curve C. 3. The moving average method: using moving average of appropriate orders, one can eliminate cyclic, seasonal, and irregular patterns, thus leaving only the trend movement. However, moving average will lose the data at the beginning and end of a series, may sometimes generate cycles or other movements which are not present in the original data, and may be strongly aected by some extreme values. Notice that the inuence of extreme values can be reduced by using a weighted moving average with appropriate weights as shown in Example 9.3. 9.3. MINING TIME-SERIES DATABASES 13 In many business transactions, such as sales in a year, there are some expected regular seasonal uctuations, such as higher sale volumes during the Christmas season. Therefore, it is important to identify such seasonal variations and deseasonalize the data for trend and cyclic data analysis. For this purpose, the concept of seasonal index is introduced, which is a set of numbers showing the relative values of a variable during the months of a year. For example, if the sales during October, November and December are 80%, 110%, and 140% of the average monthly sales for the whole year, respectively, 80%, 110%, and 140% are provided as the seasonal index numbers for the year. If the original monthly data are divided by the corresponding seasonal index numbers, the resulting data are said to be deseasonalized, or adjusted for seasonal variations. Such data still include trend, cyclic and irregular movements. The deseasonalized data can be adjusted for trend by dividing the data by their corresponding trend values. Furthermore, an appropriate moving average will smooth out the irregular variations and leave only cyclic variations for further analysis. If periodicity or approximate periodicity of cycles occurs, cyclic indexes can be constructed in a similar way as seasonal indexes. Finally, irregular or random variations can be estimated by adjusting data for the trend, seasonal and cyclic variations. In practice, irregular movements tend to have a small magnitude and follow the pattern of normal distribution, that is, small deviations occur with large frequency, whereas large deviations occur with small frequency. In practice, it is often benecal to rst graph the time series, and estimate qualititatively the presence of long term trend, seasonal variations and cyclic variation. This may help us choose the right method for analysis and comprehend the results of analysis. With the systematic analysis of the movements of trend, cyclic, seasonal, and irregular components, one will be able to do long-term or short-term prediction or forecasting time-series with reasonable quality. 9.3.2 Similarity search in time-series analysis Given a set of time-series sequences, the problem of similarity search is to nd all the data sequences which are similar to another query sequence or similar to each other. Notice that dierent from normal database queries which are to nd data which matches the query exactly, similarity search is to nd data sequences which dier only slightly from the query sequence. In general, there are two categories of similarity sequence matching problems: whole sequence matching vs. subsequence matching. The former is to nd all pairs of similar sequences; whereas the latter is to nd a sequence that is similar to a query sequence. Similarity search in time-series analysis is useful at nancial market analysis (e.g., stock data analysis), medical diagnosis (cardiogram analysis), and scientic or engineering databases (e.g., power consumption analysis). Whole sequence matching Two time sequences S and T are said to be -similar if they contain nonoverlapping subsequences s1 ; s2; : : :; sm and t1 ; t2; : : :; tm respectively such that 1. si < sj , and ti < tj , for 1 i < j m 2. there exist some scale and some translation so that 8mi=1 ((si )) ti, where is a similarity operator dened by certain similarity measure, such as the fraction of the matching length to the total length of the two sequences is above the specied threshold . For eciently nding whole sequence matching, one needs to rst extract k features from every sequence, and every sequence is then represented as a point in the k-dimensional space. Then one can use a multi-dimensional indexing method to store and search these points. Notice that spatial indices usually do not work well for high dimensional data. Usually, people to distance-preserving orthonormal transformations. Discrete Fourier Transform (DFT) transform and Haar wavelet transform are two often used transformations. Since the distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain, Discrete Fourier Transform does a good job of concentrating energy in the rst few coecients. If we keep only rst a few coecients in DFT, we can compute the lower bounds of the actual distance. One implementation method goes as follows. One may take Euclidean distance as the similarity measure, obtain Discrete Fourier Transform (DFT) coecients of each sequence in the database, build a multi-dimensional index CHAPTER 9. MINING COMPLEX TYPES OF DATA 14 using a few Fourier coecients, use the index to retrieve sequences that are at most a certain small distance away from query sequence. After such processing, one need compute the actual distance between sequences in the time domain and discard false matches. Subsequence matching An intuitive notion of sequence similarity allowing: nonmatching gaps, amplitutde scaling, oset translation. The matching subsequences need not be aligned along time axis. We need parameters: sliding window size, width of an envelope for similarity, maximum gap, matching fraction. similarity model: Sequences are normalized with amplitude scaling and oset translation. Two subsequences are considered similar if one lies within an envelope of width around the other, ignoring outliers. Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences. Outline of the approach atomic matching: nd all pairs of gap-free windows of length that are similar. window stitching: stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches. subsequence ordering: linearly order the subsequence matches to determine whether enough similar pieces exist. 9.3.3 Frequent pattern mining Dierent kinds of time-related frequent patterns Most concentrate on symbolic patterns, although some consider numerical curve patterns in time series. Agrawal and Srikant [AS95] developed an Apriori-like technique [AS94] for mining sequential patterns. Mannila et al. [MTV95] consider frequent episodes in sequences, where episodes are essentially acyclic graphs of events whose edges specify the temporal before-and-after relationalship but without timing-interval restrictions. Inter-transaction association rules proposed by Lu et al. [LHF98] are implication rules whose two sides are totally-ordered episodes with timing-interval restrictions (on the events in the episodes and on the two sides). Bettini et al. [BWJ98] consider a generalization of inter-transaction association rules: these are essentially rules whose left-hand and right-hand sides are episodes with time-interval restrictions. Mining sequential patterns and episodes Mining inter-transaction association rules 9.3.4 Periodicity analysis The mining of periodic patterns, that is, the search for recurring patterns in time-series database, is an important data mining problem with many applications. For example, seasons, tides, planet trajectories, daily power consumptions, daily trac patterns, and weekly TV programs all present certain periodic patterns. Mining periodic pattern problems can be partitioned into three categories: mining full periodic patterns, mining partial periodic patterns, and mining cylic association rules. Full periodicity means that every point in time contributes (precisely or approximately) to the cyclic behavior of the time series. For example, all the days in the year approximately contribute to the season cycle of the year. Partial periodicity species the periodic behavior of the time series at some but not all points in time. For example, Jim reads New York Times from 7:00 to 7:30 every weekday morning but his activities at other times do not have much regularity. Partial periodicity is a looser kind of periodicity than full one, and it also happens more popular in the real world. Cylic association rules. 9.4. MINING TEXT DATABASES 15 Full periodicity mining Partial periodicity mining Most methods for nding full periodic patterns are either inapplicable to or prohibitively expensive for the mining of partial periodic patterns , because of the mixture of periodic events and non-periodic events in the same period. For example, FFT (Fast Fourier Transformation) cannot be applied to mining partial periodicity because it treats the time-series as an inseparable ow of values. Some periodicity detection methods can detect some partial periodic patterns, but only if the period, and the length and timing of the segment in the partial patterns with specic behavior are explicitly specied. For the newspaper reading example, we need to explicitly specify details such as \nd the regular activities of Jim during the half-hour after 7:00 for the period of 24 hours." A naive adaptation of such methods to our partial periodic pattern mining problem would be prohibitively expensive, requiring their application to a huge number of possible combinations of the three parameters of length, timing, and period. An Apriori-like algorithm has been proposed for mining imperfect partial periodic patterns with a given (single ) period in a recent study by two of the current authors [HGY98]. It is an interesting algorithm for mining imperfect partial periodicity. However, with a detailed examination of the data characteristics of partial periodicity, we found that Apriori pruning in mining partial periodicity may not be as eective as in mining association rules. Our study has revealed the following new characteristics of partial periodic patterns in time series: The Apriorilike property among partial periodic patterns still holds for any xed period, but it does not hold for patterns between dierent periods. Furthermore, there is a strong correlation among frequencies of partial patterns. Cyclic association rule mining Similar to our problem, the mining of cyclic association rules by O zden, et al. [ORS98]1 also considers the mining of some patterns of a range of possible periods. Observe that cyclic association rules are partial periodic patterns with perfect periodicity in the sense that each pattern reoccurs in every cycle, with 100% condence. The perfectness in periodicity leads to a key idea used in designing ecient cyclic association rule mining algorithms: As soon as it is known that an association rule R does not hold at a particular instant of time, we can infer that R cannot have periods which include this time instant. For example, if the maximum period of interest is `max and it is discovered that R does not hold in the rst `max time instants, then R cannot have any periods. This idea leads to the useful \cycle-elimination" strategy explored in that paper. Since real life patterns are usually imperfect, our goal is not to mine perfect periodicity and thus \cycle-elimination" based optimization will not be considered here. 2 9.4 Mining Text Databases Most previous studies of data mining have been focused on structured data, such as relational, transactional and data warehouse data. However, in reality, a substantial portion of the available information is stored in text databases or document databases, which consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, e-mails, and Web pages. With the popular use of electronic publications, e-mails, CD-ROMS and WWW, information is increasingly available in electronic forms, and the amount of on-line text data has been growing very rapidly. Data stored in most text databases are semi-structured data in the sense that they are neither completely unstructured nor completely structured. For example, a document many contain a few structured elds, such as title, authors, publication date, length, category, etc., but also contain some largely unstructured text components, such as abstract and contents. There have been a lot of studies on modeling and implementation of semi-structured data in recent database research. Moreover, information retrieval techniques, such as text indexing methods, have been developed to handle unstructured documents. With the fast growing and a vast amount of text data, the traditional information retrieval techniques become inadequate because there are often too many documents containing useful information but only a small fraction of 1 It is important to point out that [ORS98] concentrates on the elimination of candidate itemsets for the association rule mining algorithm, although the cycle-elimination strategy does lead to a small reduction on the number of patterns when we process the time series from left to right. 2 Note that a modied strategy, where we stop considering certain patterns as soon as the length of the time series to be processed is not enough to make the condence higher than the threshold, can be used. CHAPTER 9. MINING COMPLEX TYPES OF DATA 16 Relevant documents R&R Retrieved documents All documents Figure 9.5: Relationship between the set of relevant documents and the set of retrieved documents them is relevant to a particular individual, and without knowing what could be in the documents, it is dicult to even work out the correct or smart queries. Also, with a large number of documents, people may like to compare dierent documents, rank the importance and relevance of the documents, or nd patterns and trends across multiple documents. Furthermore, Internet can be viewed as a huge, interconnected, dynamic text database. With the advent and the fast growing popularity of Internet, text mining has become an increasingly popular and essential theme in data mining. 9.4.1 Text data analysis and information retrieval Information retrieval (IR) is a eld that has been developed in parallel with database systems for many years. Dierent from database system that has been focused on query and transaction processing of structured data, information retrieval has been focused on the organization and retrieval of information in a large number of textbased documents. A typical information retrieval problem is to locate relevant documents based on user input, such as keywords or example documents, and the typical information retrieval systems include online library catalog systems and online document management systems. Since information retrieval and database systems are handling dierent kinds of data, there are some database system problems which are usually not present in information retrieval systems, such as concurrency control, recovery, transaction management and update. There are also some common information retrieval problems which are usually not encountered in traditional database systems, such as unstructured documents, approximate search based on keywords, and the notion of relevance. To analyze a text database, the following simple model can be adopted: A document is represented by a string, which can be identied by a set of keywords. Such simple keyword-based information retrieval model will encounter two major diculties. The rst is the synonymy problem: a keyword, such as software product, may not appear anywhere in the document, even though the document is closely related to software product. The second is the polysemy problem: the same keyword, such as mining, may mean dierent things in dierent contexts. Basic measures for text retrieval There are two basic measures for content-based text retrieval. One is precision, which is the percentage of retrieved documents are in fact correct (i.e., relevant to the query). The other is recall, which is the percentage of documents which should be retrieved (i.e., which are in the database and are relevant to the query) were in fact retrieved. Let the set of documents which are relevant to a query be fRelevantg, and the set of documents which are retrieved be fRetrievedg. The set of documents which are both relevant and retrieved will be fRelevantg \ fRelevantg, as shown in the Venn diagram of Figure 9.5. The two measures are dened formally as follows. g \ fRetreivedgj (9.2) recall = jfRelevant jfRelevantgj g \ fRetreivedgj (9.3) precision = jfRelevant jfRetreivedgj Keyword-based and similarity-based retrieval Most information retrieval systems support keyword-based and/or similarity-based retrieval. For keyword-based retrieval, user poses one keyword or an expression formed out of a set of keywords, such as, car and repair shops, tea 9.4. MINING TEXT DATABASES 17 or coee, database systems but not Oracle, etc. A good information retrieval system should consider synonyms when answering such queries. For example, when the keyword contains car, one should consider to include its synonyms, automobile and vehicle, in the search as well. Similarity-based retrieval is to nd similar documents based on a set of common keywords. The answer should be based on the degree of relevance, where the relevance is measured based on the nearness of the keywords, relative frequency of the keywords, etc. How do such keyword-based and similarity-based information retrieval systems work? A text retrieval system often associates with a set of documents a stop list, which is a set of words that are deemed \irrelevant". For example, a, the, of, for, with, and so on are stop words even they may appear frequently. However, stop lists may vary when the document sets vary. For example, database systems could be an important keyword in a newspaper. But it may be considered as a stop word in a set of research papers presented in a database system conference. A group of syntactically minorly dierent words may share the same word stem. A text retrieval system needs to identify the group of words which are small syntactic variants of each other and collect only their common word stem. For example, a group of words drug, drugged, and drugs, share a common word stem, drug, and one may view them as the dierent appearances of the same word. Starting with a set of d documents and a set of t terms, we can model each document as a vector v in the t dimensional space Rt . The j th coordinate of v is a number that measures the association of the j th term with respect to the given document: it is generally dened as 0 if the document does not contain the term, and nonzero otherwise. There are many ways to dene the term-weighting for the nonzero entry in such a vector. For example, one can dene simply vj = 1 as long as the j th term occurs in the document, or vj to be term frequency, i.e., the number of occurrences of term ti in the document, or relative term frequency, i.e., the term frequency vs. the total number of occurrences of all the terms in the document. Example 9.4 Table 9.1 shows a term frequency matrix, in which each column represents a document vector, and each entry frequency matrix(i; j) registers the number of occurrences of a term ti in document dj . term/document t1 t2 t3 t4 t5 d1 321 354 15 22 74 d2 84 91 32 143 87 d3 31 71 167 72 85 d4 68 56 46 203 92 d5 72 82 289 51 25 d6 15 6 225 15 54 2 d7 430 392 17 54 121 Table 9.1: Term-document frequency matrix Since similar documents should have similar term frequencies, one may measure the similarity among a set of documents or between a document and a query (which is often a set of keywords), based on the similar relative term occurrences in the frequency table. There have been many metrics proposed for measuring the similarity of two documents. A representative metric is the cosine measure dened as follows. Let v1 and v2 be two document vectors. Their cosine similarity is dened by the equation (9.4), sim(v1 ; v2) = jvv1 jj vv2j (9.4) 1 2 where the inner product v1 v2 pis the standard vector dot product, dened as ti=1 v1iv2i , and the norm in the denominator is dened as jv1 j = v1 v1. With the above dened numerical similarity metrics on documents, one can construct similarity-based indices on such documents. Then text-based queries can be represented as vectors, which can be used to search for their nearest neighbors in a collection of documents. However, for any nontrivial document databases, the number of documents D and the number of terms T are usually quite large. Such high dimensionality not only leads to the problem of ecient computation, since the the resulting frequency table will have the size of jDj jT j, but also leads to very sparse vectors and increases the diculty to detect and exploit the relationships among terms (e.g., synonymy). To 18 CHAPTER 9. MINING COMPLEX TYPES OF DATA overcome this problem, a latent semantic indexing method has been developed which eectively reduces the size of the frequency table to analyze. Latent semantic indexing The latent semantic indexing method uses a singular value decomposition (SVD ) technique, a well-known technique in matrix theory, to reduce the size of the term frequency table and retain the K most signicant rows of the frequency table, where K is usually taken to be around a few hundred (e.g., 200) for large document collections. Notice that such a reduction, taken the input of D T matrix and represent it as a much smaller K K matrix leads to some information loss. We must ensure that they must miss only the least signicant parts of the frequency table. Such a method for matrix transformation and SVD construction has been worked out successfully. The detailed method is rather sophisticated and is beyond the scope of this chapter, however, the well-known SVD algorithms are available freely through packages such as MATLAB and LAPACK. In general, the latent semantic indexing method consists of the following basic steps. 1. Create a term frequency matrix table, frequency matrix. 2. Compute singular valued decompositions of the frequency matrix by splitting the matrix into three smaller matrices, U, S, V , where U and V are orthogonal matrices, i.e., U T U = I, and S is a singular (i.e., diagonal) matrix. 3. For each document d, identify the vector which is the set of all items in the frequency matrix whose corresponding rows have not been eliminated in the singular matrix S. 4. Store the set of all vectors, and create indices for them using some advanced multidimensional indexing techniques. By singular valued decomposition and multidimensional indexing, the transformed document vectors can be used to compare the similarity between two documents or to nd the top n matches for a query. Other text retrieval indexing techniques There are also several other popularly adopted text retrieval indexing techniques, including inverted indices and signature les. An inverted index is an index structure widely used in industry for indexing text documents. It maintains two hash indexed or B+-tree indexed tables: document table and term table. The former (document table ) consists of a set of document records, each containing two elds: doc id and posting list, where the posting list is a list of terms (or pointers to terms) that occur in the document, sorted according to some relevance measure. The latter (term table ) consists of a set of term records, each containing two elds: term id and posting list, where the posting list specifes a list of document identiers in which the term appears. With such organization, it is easy to answer queries like \nd all the documents associated with a set of terms", or \nd all the terms associated with a set of documents". For example, to nd all the documents associated with a set of terms, one can rst nd a list of document identiers in the term table for each term, and then intersect them to obtain the set of relevant documents. The inverted indices are easy to implement, but they are not satisfactory at handling polysemy and synonym. Also, the posting lists could be rather long and the storage requirment could be quite large. A signature le is a le which stores a signature record for each document in the database. Each signature has a xed size of b bits. A simple encoding scheme goes as follows. Every bit of a document is initialized to 0. A bit is set if the corresponding term appears in the document. A sigature S1 matches another signature S2 if each bit set in sigature S2 is also set in S1 . Since there are usually more terms than available bits, there will be multiple terms mapped into the same bit. Such multiple-to-one mapping makes the search rather expensive since a document matches the signature of a query does not mean that it denitely contains the set of keywords of the query. The document has to be retrieved, parsed, stemmed and checked. Improvements can be done with a good signature encoding scheme, by rst performing a frequency analysis, stemming, and ltering stop words, and then using some hashing technique and superimposed coding technique to encode the list of terms into bit representation. Nevertheless, the problem of multiple-to-one mapping still exists, which is the major disadvantge of the approach. 9.5. MINING THE WORLD-WIDE-WEB 19 9.4.2 Text mining: keyword-based association and document classication Keyword-based association analysis Text data consists of structured, semi-structures or unstructured text, including Term Extraction Text Mining at the Word Level ? The association generation process detected either compounds, i.E. Domain-dependent terms such as [wall, street] or [treasury, secretary, james, baker] ? Or uninterpretable associations such as [dollars, shares, exchange, total, commission, stake, securities] Conclusions ? Term level text mining attempts to benet from the advantages of two extremes. ? On the one hand there is no need for human eort in tagging document, and we do not loose most of the information present in the document as in the tagged documents approach. ? On the other hand the number of meaningless results is greatly reduced and the execution time of the mining algorithms is also reduced. Document classication analysis 9.5 Mining the World-Wide-Web With the fast advances of computer, network, satellite, and informationtechnologies, the World-Wide-Web (or simply, the Web) has become increasingly popular and important in today's society. With a vast amount of information available on the Internet, and many on-line information services ourishing around the Web, the Web serves as a huge, widely distributed, global information service center for news, advertisements, consumer information, nancial management, education, government, e-commerce, and many other services. Especially, the Web contains not only a huge collection of documents but also a rich and dynamic collection of hyper-link information, access and usage information, etc. Such a great wealth of information provides rich sources for data mining. However, the Web also poses great challenges for eective resource and knowledge discovery based on the following observations. 1. The Web seems to be too huge for eective data warehousing and data mining. The size of the Web is in the order of hundreds of tera-bytes and is still growing rapidly. Many organizations and societies put most of their public accessible information on the Web. It is impossible to set up a data warehouse to replicate, store, or integrate all the data on the Web. 2. The complexity of the Web pages is far greater than that of any traditional text document collections. The Web pages lack a unifying structure. There are far more authoring style and content variations than that of books and other traditional text-based documents. The Web is considered as a huge digit library, however, the tremendous number of documents in this library are not arranged according to any particular sorted order. There is no category index, nor title, author list, cover page, table of contents, etc. It could be a real challenge to search for information you want in such a library! 3. The Web is a highly dynamic information source. Not only the Web grows at a rapid pace, but also the information is updated constantly. News, stock market, company advertisements, and Web service centers update their Web pages regularly. The linkage information and access records are also updated frequently. 4. The Web serves broad diversity of user communities. The Internet current connect about 50 million workstations and the user community is still expanding rapidly. Users may have very dierent backgrounds, interests and purposes of usage. Also, most users may not have good knowledge about the structure of the information network, may not be aware of the heavy cost of a particular search, may easily get lost by groping in the \darkness" of the network, and may easily get bored by taking many hops and waiting impatiently for a piece of information. 5. Only a small portion of the information on the Web is truly relevant or useful. It is said that 99% of the Web are useless to 99% of the users. Although this may not be obvious to everyone, it is true that a particular person is interested in only a tiny portion of the Web, and the Web also contains a lot of junks or undesirable 20 CHAPTER 9. MINING COMPLEX TYPES OF DATA stus which may swamp desired search results. How to nd the portion of the Web which is truly relevant to your interest? How to search for high quality Web pages on a topic? These challenges have promoted ourishing researches into ecient and eective discovery and use of resources on the Internet. There are many index-based Web search engines which search the Web, index the Web pages, build and store huge size, keyword-based indices, and help users locate the set of Web pages containing a given set of keywords. With such search engines, an experienced user may be able to quickly locate documents by providing a set of tightly constrained keywords and phrases. However, current keyword-based search engines suer from several diciencies. First, a topic of any breadth may easily contain hundreds of thousands of documents, which may lead to a huge number of document entries returned by a search engine, whereas many of them may only marginally relevant to the topic or may contain very poor quality stu. Second, many documents which are highly relevant to a topic may not contain the exact keywords. For example, using keyword \data mining", one may nd many Web pages related to other \mining industry" but not many papers related to knowledge discovery, statistical analysis, or machine learning although those topics are highly related to data mining. As another example, search based on the keyword \search engine" may not even nd the most popular Web search engines like Yahoo!, AltaVista, or American-On-Line since they barely claim themselves as search engines on their Web pages. This indicates that the current Web search engines are not sucient for Web resource discovery, not to say a more challenging task, Web knowledge discovery, which is to nd Web access patterns, Web structures, and the regularity and dynamics of Web contents. Web mining is to accomplish these tasks and help people discover the structures and the dynamics of WWW and nd interesting and high quality information from among the oceans of Web pages. In general, one can classify Web mining tasks into three categories: Web content mining, Web structure mining, and Web usage mining. Alternatively, one may treat Web structures as a part of Web contents. Then Web mining can also be simply classied into two cateogries: Web content mining and Web usage mining. In the following subsections, we discuss several important issues related to Web mining: mining Web's link structures, building multi-layered Web information-base, and Weblog mining. 9.5.1 Mining Web's link structures to identify authoritative Web pages As discussed above, on any large Web search topic, the current Web search engine often returns a large number of Web pages. May of such pages, though relevant, could be of pretty low quality. Thus, besides the notion of relevance, it is highly desirable to introduce the notion of authority in Web topic-oriented search. That is, the search task is not only to locate a set of relevant pages but also to identify those relevant pages of high quality. How to automatically identify authoritative Web pages on a topic? Interestingly, the secret of authority is hiding in Web page linkages. The Web consists of not only pages but also hyperlinks pointing from one page to another. This hyperlink structure contains an enormous amount of latent human annotation that can help automatically infer the notion of authority. In general, the creation of a hyperlink by the author of one Web page pointing to another (page) represents author's endorsement of the page. The collective endorsements of certain pages by dierent authors on the Web may indicate the importance of the page, and may naturally lead to the discovery of authoritative Web pages. In general, the tremendous amount of Web linkage information provide rich information about the relevance, the quality, and the structure of the Web's contents, and thus a rich source for Web mining. This idea has motivated some interesting studies on mining authoritative pages on the Web. However, unlike journal citations, the Web linkage structure has some unique features. First, not every hyperlink represents the endorsement we seek. Some links are created for other purposes, such as for navigation or for paid advertisements. But overall, if the majority of hyperlinks are for endorsement, the collective judgement will still dominate. Second, for commercial or competitive interests, one authority seldom has its Web page pointing to its rival authorities in the same eld. For example, \CocaCola" may not like to endorse its competitor \Pepsi" by linking to their Web pages, similarly for \Honda" and \Toyota". Third, authoritative pages are seldom particularly descriptive. For example, the main Web page of Yahoo! may hardly contain the explicit self-description like \Web search engine". These properties of Web link structures lead people to consider another important category of Web pages, hub. A hub is one or a set of Web pages which provide collections of links to authorities. Such hub pages may not be prominent themselves, or there may even exist few links pointing to them, however, they provide links to a collection of prominent sites on a common topic. Such pages could be lists of recommended links on individual home 9.5. MINING THE WORLD-WIDE-WEB 21 pages, such as recommended reference sites from a course home page, or professionally assembled resource lists on commercial sites. They play the role of implicitly conferring authorities on a focused topic. In general, a good hub is a page that points to many good authorities; whereas a good authority is a page pointed to by many good hubs. Such a mutual reinforcement relationship between hubs and authorities helps mining authoritative Web pages and automated discovery of high quality Web structures and resources. Based on theses ideas, an interesting algorithm called HITS (for Hyperlink-Induced Topic Search) is developed as follows. First, HITS uses the query terms to collect a starting set of pages, say, 200, from an index-based search engine. Since many of these pages are presumably relevant to the search topic, some of them should contain links to most of the prominent authorities. Therefore, the root set can be expanded into a base set by including all the pages that the root-set pages link to, and all the pages that link to a page in the root set, up to a designated size cuto, such as 1000 to 5000 pages (to be included in the base set). Second, a weight-propagation phase is initiated which is an iterative process, determining numerical estimates of hub and authority weights. Notice since the links between two pages with the same Web domain (i.e., sharing the same rst level in their URLs) often serve as a navigation function and thus do not confer authority, such links are excluded from the weight-propagation analysis. We rst associate a nonnegative authority weight ap and a nonnegative hub weight hp with each page p in the base set, and initialize all a and h values to a uniform constant. Also, the weights are normalized and an invariant is maintained that the squares of all weights sum to 1. The authority and hub weights are updated based on (9.5) and (9.6). ap = (q such than q!p) hq (9.5) hp = (q such than q p) aq (9.6) Equation (9.5) implies that if a page is pointed to by many good hubs, its authority weight should increase (it is the sum of the current hub weights of all the pages pointing to it). Equation (9.6) implies that if a page is pointing to many good authorities, its hub weight should increase (it is the sum of the current authority weights of all the pages it points to). These equations can be written in the matrix form as follows. Let us number the pages f1; 2; : : :; ng and dene their adjacency matrix A to be an n n matrix where A(i; j) is 1 if page i links to page j or 0 otherwise. Similarly, we dene the authority weight vector a = (a1; a2; : : :; an), and the hub weight vector h = (h1; h2; : : :; hn). Thus, we have h = Aa (9.7) T a =A h (9.8) Unfolding these two equations k times, we have h = A a = AAT h = (AAT )h = (AAT )2 h = = (AAT )k h (9.9) a = AT h = AT Aa = (AT A)a = (AT A)2 a = = (AT A)k a (9.10) According to linear algebra, these two sequences of iterations, when normalized, converges to the principal eigenvectors of AT A and AAT , respectively. This also proves that the authority and hub weights are intrinsic features of the linked pages collected, not inuenced by the initial weight setting. Finally, the HITS algorithm outputs a short list of the pages with large hub weights and the pages with large authority weights for the given search topic. Many experiments have shown that HITS provides surprisingly good search results for a wide range of queries. Although relying extensively on links leads to encouraging results, ignoring textual contexts may encounter some diculties. For example, HITS sometimes drifts when hubs contain multiple topics. It may also cause topic hijacking due to many pages from a single Web site pointing to the same single popular site, giving the site too large a share of the authority weight. Such problems can be overcome by replacing the sums of Equations (9.5) and (9.6) with weighted sums, scaling down the weights of multiple links from within the same site, using anchor text (the text surrounding hyperlink denitions in Web pages) to weight the links which authority is propagated, breaking large hub pages into smaller units, etc. By analyzing Web links and textual context information, it has been reported that the systems based on HITS algorithm, such as Clever, and another system, Google, based on a similar principle, can achieve better quality search CHAPTER 9. MINING COMPLEX TYPES OF DATA 22 results than those generated by term-index engine such as AltaVista and those created by human ontologists such as Yahoo!. 9.5.2 Automatic classication of Web documents 9.5.3 Construction of multi-layered Web information-base 9.5.4 Web usage mining Web usage records, in the form of Web logs, are registered in Web server's log les. Such log les The web pages' reader's behaviour is imprinted in the web server log les. Analyzing and exploring regularities in this behaviour can improve system performance, enhance the quality and delivery of Internet information services to the end user, and identify population of potential customers for electronic commerce. Web servers register a (web) log entry for every single access they get. The server usually save the URL requested, the IP address from which the request originated, and a timestamp. For Web-based e-commerce servers, a huge number of Web access log records are being collected. Popular web sites can see their web log growing by hundreds of megabytes every day. Condensing these colossal les of raw web log data in order to retrieve signicant and useful information is a nontrivial task. It is not easy to perform systematic analysis on such a huge amount of data and therefore, most institutions have not been able to make eective use of web access history for server performance enhancement, system design improvement, or customer targeting in electronic commerce. However, many people have realized the potential usage of such data. Using web log les, studies have been conducted on analyzing system performance, improving system design, understanding the nature of web trac, and understanding user reaction and motivation[FdG97, GC97, Sul97, TG97]. One innovative study has proposed adaptive sites: web sites that improve themselves by learning from user access patterns[PE97]. While it is encouraging and exciting to see the various potential applications of web log le analysis, it is important to know that the success of such applications depends on what and how much valid and reliable knowledge one can discover from the large raw log data. 9.6 Summary Work it out! Exercises 1. Work it out! Bibliographic Notes Mining complex types of data has been a popular research topic, with many research papers and tutorials appearing in conferences and journals on data mining and database systems. Multidimensional generalization and mining of complex type of data in object-oriented and object-relational databases by construction of object cubes was proposed by Han, Nishio, et al [HNKW98]. A method for construction of multiple layered database by generalization-based data mining techniques for handling semantic heteogeneity was proposed by Han, Ng, et al. [HNFD98]. Lu, Han and Ooi [LHO93] proposed a generalization-based spatial data mining method based on attribute-oriented induction. Koperski and Han [KH95] proposed a progressive renement method for mining spatial association rules. Knorr and Ng [KN96] presented a method for mining aggregate proximity relationships and commonalities in spatial databases. Spatial classifcation and trend analysis methods have been developed by Ester et al. [EKSX97]. Spatial clustering methods have been a focused topic in recent data mining research, with quite a few interesting methods introduced, including distance-based methods such as [NH94, EKX95, Hua98], hierarchical methods, such as [ZRL96, GRS98, GRS99, KHK99], density-based methods, such as [EKSX96, ABKS99], and grid-based methods, such as [WYM97, AGGR98, SCZ98]. Knorr and Ng [KN98] introduced the notion of distance-based outlier and 9.6. SUMMARY 23 developed a few algortihms for its ecient mining. For surveys of spatial data mining methods, one may refer Koperski, Adhikary and Han [KAH96] and Ester, Kriegel and Sander [EKS97]. A spatial data mining system prototype, GeoMiner, was developed by Han, Koperski and Stefanovic [HKS97]. For analysis of raster or image data, Fayyad and Smyth [FS93] developed a classication method to analyze high resolution radar images for identication of volcanos on Venus. Fayyad et al. [FDW96] applied decision tree methods for the classication of galaxies, stars and other stellar objects in the Palomar Observatory Sky Survey (POSS-II) project. Storlz quakernder (KDD'96) Agrawal and Srikant [AS95] developed an Apriori-like technique for mining sequential patterns. Mannila, et al. [MTV95] consider frequent episodes in sequences, where episodes are essentially acyclic graphs of events whose edges specify the temporal before-and-after relationalship but without timing-interval restrictions. Lu, Han and Feng [LHF98] proposed inter-transaction association rules which are implication rules whose two sides are totally-ordered episodes with timing-interval restrictions (on the events in the episodes and on the two sides). Bettini et al. [BWJ98] consider a generalization of inter-transaction association rules. Mining partial periodicty by Han, Dong and Yin [HDY99]. O zden, et al. [ORS98] studied methods for mining cyclic association rules. Sequence pattern mining for plan failures was proposed by Zaki, Lesh and Ogihara [ZLO98]. Plan mining by [HYK99]. Information retrieval and text analysis methods have been introduced in many textbooks and surveys, including Salton and McGill [SM83], Salton [Sal89], Yu and Meng [YM97], Raghavan [Rag97], Subramanian [Sub98], Kleinberg and Tomkins [KT99], etc. The latent semantic indexing method for document similarity analysis was developed by Deerwester . Feldman and Hirsh [FH98] studied methods for mining association rules in text databases. The theory and practice of multimedia database systems have been introduced in many textbooks and surveys, including Subramanian [Sub98], Yu and Meng [YM97], etc. A multimedia data mining system prototype, MultiMediaMiner, was developed by Zaane et al. [ZHL+ 98]. Mining Web's link structures to recognize authoritative Web pages was studied by Chakrabarti et al. [CDK+ 99] and Kleinberg and Tomkins [KT99]. A Web mining language, WebML, was proposed by Zaane and Han [ZH98]. A multi-layer database approach for constructing Web warehouse was studied by Zaane and Han [ZH95]. Weblog mining was studied by Zaane, Xin, and Han [ZXH98]. 24 CHAPTER 9. MINING COMPLEX TYPES OF DATA Bibliography [ABKS99] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data, pages 49{60, Philadelphia, PA, June 1999. [AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 94{105, Seattle, Washington, June 1998. [AS94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487{499, Santiago, Chile, September 1994. [AS95] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering, pages 3{14, Taipei, Taiwan, March 1995. [BWJ98] C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32{38, 1998. [CDK+ 99] S. Chakrabarti, B. E Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the web's link structure. COMPUTER, 32:60{67, 1999. [DDF+ 90] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. 41:391{407, 1990. [EKS97] M. Ester, H.-P. Kriegel, and J. Sander. Spatial data mining: A database approach. In Proc. Int. Symp. Large Spatial Databases (SSD'97), pages 47{66, Berlin, Germany, July 1997. [EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD'96), pages 226{231, Portland, Oregon, August 1996. [EKSX97] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. Density-connected sets and their application for trend detection in spatial databases. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages 10{15, Newport Beach, California, August 1997. [EKX95] M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for ecient class identication. In Proc. 4th Int. Symp. Large Spatial Databases (SSD'95), pages 67{82, Portland, Maine, August 1995. [FdG97] R. Fuller and J. de Graa. Measuring user motivation from server log les. In http://www.microsoft.com/usability/webconf/fuller/fuller.htm, 1997. [FDW96] U. M. Fayyad, S. G. Djorgovski, and N. Weir. Automating the analysis and cataloging of sky surveys. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 471{493. AAAI/MIT Press, 1996. [FH98] R. Feldman and H. Hirsh. Finding associations in collectionds of text. In R. S. Michalski, I. Bratko, , and M. Kubat, editors, \Machine Learning and Data Mining: Methods and Applications", John Wiley Sons, pages 223{240. 1998. 25 26 [FS93] BIBLIOGRAPHY U. Fayyad and P. Smyth. Image database exploration: Progress and challenges. In Proc. Knowledge Discovery in Databases Workshop, pages 14{27, Washington, D.C, 1993. [GC97] J. Graham-Cumming. Hits and miss-es: A year watching the web. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997. [GRS98] S. Guha, R. Rastogi, and K. Shim. Cure: An ecient clustering algorithm for large databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 73{84, Seattle, Washington, June 1998. [GRS99] S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In Proc. 1999 Int. Conf. Data Engineering, pages 512{521, Sydney, Australia, March 1999. [HCC93] J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in relational databases. IEEE Trans. Knowledge and Data Engineering, 5:29{40, 1993. [HDY99] J. Han, G. Dong, and Y. Yin. Ecient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE'99), pages 106{115, Sydney, Australia, April 1999. [HF96] J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 399{421. AAAI/MIT Press, 1996. [HGY98] J. Han, W. Gong, and Y. Yin. Mining segment-wise periodic patterns in time-related databases. In Proc. 1998 Int. Conf. on Knowledge Discovery and Data Mining (KDD'98), pages 214{218, New York City, NY, August 1998. [HKS97] J. Han, K. Koperski, and N. Stefanovic. GeoMiner: A system prototype for spatial data mining. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 553{556, Tucson, Arizona, May 1997. [HNFD98] J. Han, R. T. Ng, Y. Fu, and S. Dao. Dealing with semantic heterogeneity by generalization-based data mining techniques. In M. P. Papazoglou and G. Schlageter (eds.), Cooperative Information Systems: Current Trends & Directions, pages 207{231, Academic Press, 1998. [HNKW98] J. Han, S. Nishio, H. Kawano, and W. Wang. Generalization-based data mining in object-oriented databases using an object-cube model. Data and Knowledge Engineering, 25:55{97, 1998. [HSK98] J. Han, N. Stefanovic, and K. Koperski. Selective materialization: An ecient method for spatial data cube construction. In Proc. 1998 Pacic-Asia Conf. Knowledge Discovery and Data Mining (PAKDD'98) [Lecture Notes in Articial Intelligence, 1394, Springer Verlag, 1998], Melbourne, Australia, April 1998. [Hua98] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283{304, 1998. [HYK99] J. Han, Q. Yang, and E. Kim. Plan mining by divide-and-conquer. In Proc. 1999 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'99), pages 8:1{8:6, Philadelphia, PA, May 1999. [KAH96] K. Koperski, J. Adhikary, and J. Han. Knowledge discovery in spatial databases: Progress and challenges. In Proc. 1996 SIGMOD'96 Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pages 55{70, Montreal, Canada, June 1996. [KH95] K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. In Proc. 4th Int. Symp. Large Spatial Databases (SSD'95), pages 47{66, Portland, Maine, Aug. 1995. [KHK99] G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. COMPUTER, 32:68{75, 1999. [KN96] E. Knorr and R. Ng. Finding aggregate proximity relationships and commonalities in spatial data mining. IEEE Trans. Knowledge and Data Engineering, 8:884{897, Dec. 1996. [KN98] E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 392{403, New York, NY, August 1998. BIBLIOGRAPHY [KT99] 27 J. Kleinberg and A. Tomkins. Application of linear algebra in information retrieval and hypertext analysis. In Proc. 18th ACM Symp. Principles of Database Systems (PODS), pages 185{193, Philadelphia, PA, May 1999. [LHF98] H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. In Proc. 1998 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), pages 12:1{12:7, Seattle, Washington, June 1998. [LHO93] W. Lu, J. Han, and B. C. Ooi. Knowledge discovery in large spatial databases. In Proc. Far East Workshop Geographic Information Systems, pages 275{289, Singapore, June 1993. [MTV95] H. Mannila, H Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, pages 210{215, Montreal, Canada, Aug. 1995. [NH94] R. Ng and J. Han. Ecient and eective clustering method for spatial data mining. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 144{155, Santiago, Chile, September 1994. [ORS98] B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. 1998 Int. Conf. Data Engineering (ICDE'98), pages 412{421, Orlando, FL, Feb. 1998. [PE97] M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997. [Rag97] P. Raghavan. Information retrieval algorithms: A survey. In Proc. 1997 ACM-SIAM Symp. Discrete Algorithms, 1997. [Sal89] G. Salton. Automatic Text Processing. Addison-Wesley, 1989. [SCZ98] G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 428{439, New York, NY, August 1998. [SM83] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [Sub98] V. S. Subrahmanian. Principles of multimedia database systems. Morgan Kaufmann, 1998. [Sul97] T. Sullivan. Reading reader reaction: A proposal for inferential analysis of web server log les. In Proc. 3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997. [TG97] L. Tauscher and S. Greenberg. How people revisit web pages: Empirical ndings and implications for the design of history systems. International Journal of Human Computer Studies, Special issue on World Wide Web Usability, 47:97{138, 1997. [WYM97] W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proc. 1997 Int. Conf. Very Large Data Bases, pages 186{195, Athens, Greece, Aug. 1997. [YM97] C. T. Yu and W. Meng. Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann, 1997. [ZH95] O. R. Zaane and J. Han. Resource and knowledge discovery in global information systems: A preliminary design and experiment. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages 331{336, Montreal, Canada, Aug. 1995. [ZH98] O. R. Zaane and J. Han. Webml: Querying the world-wide web for resources and knowledge. In Proc. Int. Workshop on Web Information and Data Management (WIDM'98), pages 9{12, Bethesda, Maryland, Nov. 1998. [ZHL+ 98] O. R. Zaane, J. Han, Z. N. Li, J. Y. Chiang, and S. Chee. Multimedia-miner: A system prototype for multimedia data mining. In Proc. 1998 ACM-SIGMOD Conf. on Management of Data, pages 581{583, Seattle, Washington, June 1998. 28 [ZLO98] [ZRL96] [ZTH99] [ZXH98] BIBLIOGRAPHY M. J. Zaki, N. Lesh, and M. Ogihara. PLANMINE: Sequence mining for plan failures. In Proc. 4th Int. Conf. Knowledge Discovery and Data Mining (KDD'98), pages 369{373, New York, NY, August 1998. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an ecient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 103{114, Montreal, Canada, June 1996. X. Zhou, D. Truet, and J. Han. Ecient polygon amalgamation methods for spatial olap and spatial data mining. In Proc. 6th Int. Symp. on Large Spatial Databases (SSD'99), pages 167{187, Hong Kong, July 1999. O. R. Zaane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19{29, Santa Barbara, CA, April 1998.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Contents - Computer Science