Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DOLAP 2004 A New OLAP Aggregation Based on the AHC Technique R. Ben Messaoud, O. Boussaid, S. Rabaséda Laboratoire ERIC – Université de Lyon 2 5, avenue Pierre-Mendès–France 69676, Bron Cedex – France http://eric.univ-lyon2.fr Complex data Definition: 0 1 2 3 4 5 Data are considered complex if they are … Multi-formats: information can be supported by different kind of data (numeric, symbolic, texts, images, sounds, videos …) Multi-structures: structured, unstructured or semi-structured (relational databases, XML documents …) Multi-sources: data come from different sources (distributed databases, web …) Multi-modals: the same information can be described differently (data in different languages …) Multi-versions: data are updated through time (temporal databases, periodical inventory …) November 13, 2004 Ben Messaoud et al. 2 General context Complex data 0 1 Complex data MDBMS Analyze complex data 2 3 4 Huge volumes of complex data Warehousing complex data … OLAP facts as complex objects Data mining OLAP Current OLAP tools aren’t suited to process complex data Data mining is able to process complex data like images, texts, videos … Coupling OLAP and data mining Analyze complex data on-line New operator OpAC: Operator of Aggregation by Clustering (AHC) 5 OpAC November 13, 2004 Ben Messaoud et al. 3 Outline 0 Complex data and general context 1 Related work: Coupling OLAP and data mining 2 Objectives of the proposed operator 3 Formalization of the operator 4 Implementation and demonstration 5 Conclusion and future works November 13, 2004 Ben Messaoud et al. 4 Related work Three approaches for coupling OLAP and data mining 0 1 2 First approach: approach Extending the query languages of decision support systems Second approach: approach Adapting multidimensional environment to classical data mining techniques Third approach: approach Adapting data mining methods for multidimensional data 3 4 Data mining 5 OLAP DBMS November 13, 2004 Ben Messaoud et al. 5 Related work These works proved that: 0 1 2 3 4 Associating data mining to OLAP is a promising way to involve rich analysis tasks Data mining is able to extend the analysis power of OLAP Use data mining to enhance OLAP tools in order to process complex data OpAC: A new OLAP operator based on a data mining technique 5 Data mining November 13, 2004 OpAC Ben Messaoud et al. OLAP 6 Objectives Classic OLAP aggregation Vs OpAC aggregation 0 1 2 Classic OLAP: Summarizes numerical data in a fewer number of values Computes additive measures (Sum, Average, Max, Min …) Example: Sales cube 3 4 Count Sales Count + Washington +$2520 Bellingham120 $700 32 + California +$2410 Bremerton129 $400 20 + - Washington Washington +$2520 Olympia $850 44 + Redmond $250 9 + Seattle $320 15 + Berkeley $820 41 +$2410 Beverly Hills 129 $910 50 + Los Angeles $680 38 5 + - California California November 13, 2004 Sales 120 Ben Messaoud et al. 7 Objectives Classic OLAP aggregation Vs OpAC aggregation 0 1 2 OpAC aggregation: What about aggregating complex objects? How to aggregate images, texts or videos with classic OLAP tools? Complex objects are not additive OLAP measures … Example: Images cube 3 Orange coral 4 Nebraska, USA 5 Toco toucan Maldives November 13, 2004 Images ? Size ASM 3560px 0,016 2340px 0,021 4434px 0,014 3260px 0,012 Ben Messaoud et al. 8 Objectives 0 How to aggregate complex objects? 1 2 Using a data mining technique: AHC (Agglomerative Hierarchical Clustering) 3 The AHC aggregates data 4 The hierarchical aspect of the AHC 5 November 13, 2004 Ben Messaoud et al. 9 0 Images 1 2 3 4 Very high High Medium Low Very low 5 L1Normalized for high homogeneity Objectives Homogeneity November 13, 2004 L1Normalized for low entropy Ben Messaoud et al. 10 Formalization 1 The set of individuals: 2 3 4 5 {gijt / gijt hij } The set of variables: Dimension retained for individuals can’t generate variables Only one hierarchical level of a dimension is allowed to generate variables X /X(gijt)= Measure of gsrv crossed with gijt S where gsrv hsr , s i and r is unique for each s November 13, 2004 Ben Messaoud et al. 0 Di : the ith dimension of a data cube C hij : the jth hirarchical level of the dimension Di gijt : the tth modality of hij 11 Formalization Evaluation tools 0 1 2 3 Minimize the intra-cluster distances Maximize the inter-cluster distances Inter and intra-cluster inertia A1 , A2 , …, Ak is a partition of P(Ai) is the weight of Ai G(Ai) is the gravity center of Ai 4 Iintra(k) 5 Iinter(k) November 13, 2004 k = I(Ai) i=1 k = P(Ai)d(G(Ai),G()) i=1 Ben Messaoud et al. 12 Formalization - Inter-clusters - Intra-cluster 0 500 1 400 2 200 300 100 3 0 7 4 Individuals: 5 Variables: 6 5 4 3 2 1 Modalities from the dimension of images Very high High Medium Low Very low L1Normalized values of images for all possible modalities of the entropy dimension L1Normalized values of images for all possible modalities of the Homogeneity homogeneity dimension November 13, 2004 Ben Messaoud et al. 13 Formalization 0 1 2 3 Results: Exploits the cube’s facts describing images to construct groups of similar complex objects Highlights significant groups of objects by a clustering technique 4 Clusters –aggregates- are defined both from dimensions and measures of a data cube 5 Implementation of a prototype November 13, 2004 Ben Messaoud et al. 14 Implementation 0 1 2 3 4 5 Prototype: Data loading module: Connects to a data cube on Analysis Services of MS SQL Server Uses MDX queries to import information about the cube’s structure Extract data selected by the user Parameter setting interface: Assists the user to extract individuals and variables from the cube Selects modalities and measures Defines the clustering problem Clustering module: Allows the definition of the clustering parameters like dissimilarity metric and aggregation criterion Constructs the AHC Plots the results of the AHC on a dendrogram November 13, 2004 Ben Messaoud et al. 15 Implementation 0 1 Images dataset: 3000 images collected from the web: 2 3 4 5 Semantic annotation: Description, subject and theme Descriptors of texture like: ENT: Entropy CON: Contrast L1Normalized: Medium Color Characteristic … Three color channels: RGB November 13, 2004 Ben Messaoud et al. 16 Implementation 0 Demonstration: 1 2 3 4 5 November 13, 2004 Ben Messaoud et al. 17 Conclusion 0 1 2 3 4 5 OpAC is a possible way to realize on-line analysis over complex data OpAC aggregates complex objects Aggregates –clusters- are defined from both dimensions and measures of a data cube Prototype available at : http://bdd.univ-lyon2.fr/?page=logiciel&id=5 November 13, 2004 Ben Messaoud et al. 18 Future works 0 1 2 3 4 5 The current evaluation tool may present some limits Use other evaluation indicators to evaluate the quality of partitions Assist user to find the best number of clusters Exploit the aggregates generated by OpAC in order to reorganize the cube’s dimensions Get a new cube with remarkable regions Use other data mining technique to enhance the OLAP power with explanation and prediction capabilities November 13, 2004 Ben Messaoud et al. 19 The End November 13, 2004 Ben Messaoud et al. 20