Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Asian Journal of Information Technology 4 (5): 466-470, 2005 © Grace Publications, 2005 A Survey of Data Mining Query Language (DMQL) for Multimedia Databases Bilal A. H. Abul-Huda, Basel Bani Ismail, Osama Al-Horani and Ahmad El-Mustafah Department of Computer Information System, College of Information Technology and Computer Science, Yarmouk University, P.O. Box 4762, 21163 Irbid Jordan Abstract: Data Mining is the process of discovering interesting knowledge (patterns) from large amounts of data stored either in databases, data warehouses, or other information repositories. Multimedia data mining is the mining of high-level multimedia information and knowledge from large multimedia databases. Mining multimedia data is, however, at an experimental stage. Substantial progress in the field of data mining and data warehousing research has been witnessed in the last few years. Numerous research and commercial systems for data mining and data warehousing have been developed for mining knowledge in relational database and data warehouses (Fayyad et al., 1996) Despite the fact that Multimedia has been the major focus for many researchers around the world, data mining from multimedia databases is still in its infancy, multimedia mining still seem shy on results. Many techniques for representing, storing, indexing and retrieving multimedia data have been proposed. However, rare are the researches who ventured in the multimedia data mining field. The emerging data mining tools and systems lead to the demand of a powerful data mining query language. The concepts of such a language for relational databases are discussed in (Han et al., 1996) With the increasing popularity of multimedia databases, it is important to design a data mining query language for such databases. A multimedia data mining system prototype, MultiMediaMiner, has been designed and developed. It includes the construction of a multimedia data cube which facilitates multiple dimensional analysis of multimedia data, primarily based on visual content and the mining of multiple kinds of knowledge, including characterization (summarization), discrimination (comparison), classification, association and clustering, in image and video databases. Key words: Data mining, Data mining query language, Knowledge discovery, Data warehousing, Data cube, Multimedia database and Information retrieval Introduction Data Mining means the discovery of knowledge and useful information from the large amounts of data stored in databases. Data Mining, involves an integration of techniques from multiple disciplines such as: database technology, statistics, machine learning, neural networks, information retrieval. The desired feature of data mining systems is the ability to support ad hoc and interactive data mining in order to facilitate flexible and effective knowledge discovery. Data mining query language can be design to support such a feature. The importance of the design of a good data mining query language can also be seen from observing the history of relational database system. There is a lot of research that has been conducted on data mining in relational databases to mine a specific kind of knowledge. Also, there are some data mining experimental systems that have been developed for relational databases, such as DBMiner, Explora, MineSet, Quest, etc (Elfeky et al., 2000) It is reasonably easy to design a data mining language for data mining in relational databases. It is a great challenge to design language for knowledge mining in other kinds of databases, such as transactional databases, object-oriented databases, spatial databases, multimedia databases (Han et al., 1996). Multimedia data mining is a subfield of data mining that deals with the extraction of implicit knowledge, multimedia data relationships, or other patterns not explicitly stored in multimedia databases. Multimedia data mining is not limited to images, video or sound, but encompasses text mining as well. There has been interesting research in text mining from text documents and Web or semi-structured data querying and mining. The availability of affordable imaging technology is leading to an explosion of data in the forms of image and video. Many relational databases are now including multimedia information, such as photos of customers, videos about real estate, etc. The proliferation of huge amounts of multimedia data is becoming prominent. Global information networks like the Internet, as well as specialized databases, are filled with a variety of multimedia, medical images, satellite pictures, etc., necessitating means to retrieve, classify and understand this data. With huge amounts of multimedia data collected by video cameras and audio recorders, satellite telemetry systems, remote sensing systems, surveillance cameras and other data collection tools, it is crucial to develop tools for discovery of interesting knowledge from large multimedia databases. Moreover, with the popularity of multimedia 466 Abul-Huda: A survey of data mining query language objects in extended and object-relational databases, it is becoming important to mine knowledge related to both multimedia and relational data in large databases and maybe, to deal with them in the same manner (Osmar et al., 1998). For similarity searching in multimedia data, There are two main families of multimedia indexing and retrieval systems: description-based retrieval systems, which build indices and perform object retrieval based on image descriptions, such as keywords, size, caption and time of creation and content-based retrieval systems, which support retrieval based on the image content, such as like colour histogram, texture, shape, objects and wavelet transforms. In a content-based retrieval system, there are often two kinds of queries: image sample-based queries and image feature specification queries. Since multimedia retrieval is based on similarity calculations of semantics and media-based search, exact matches are not expected. We view querying multimedia database as a combination of IR, image matching and traditional database query processing and it should be conducted in a way of perpetual query reformulation for honing target results. Query processing in multimedia databases is different from that in traditional database systems. Because contents stored in traditional database systems are rather precise, query results are certain. In information retrieval, documents are represented as keyword lists. To retrieve a document, systems compare keywords specified by users with the documents’ keyword lists. Images, in a similar manner, are represented as features. Image matching is carried out through comparing these features. In both information retrieval and image matching, results are based on similarity calculations; comparing similarity between semantic meanings of keywords and image features respectively `( Wen-Syan et al., 1997). The current MultiMediaMiner system includes four data mining modules for mining knowledge in image and video databases: characterization (summarization), discrimination (comparison), classification and association. Additional modules are in the design and development stage (Osmar et al., 1998). Design of a Data Mining Language: The philosophy of data mining may strongly influence the design of a data mining language. Hence, the main principle in designing a data mining query language is to support the specifications of the following primitives (principles). The set of task-relevant data to be mined. The kind of knowledge to be mined. The background knowledge (such as conceptual hierarchy information, etc.) to be used in the discovery process . The interestingness measures and thresholds for pattern evaluation. The expected representation for visualizing the discovered patterns. Specification of Task-Relevant Data: This involves specifying the following information. Database or data warehouse name Database tables or data warehouse cubes Conditions for data selection Relevant attributes or dimensions Data grouping criteria Mining Different Kinds of Rules: Based on the above considerations, a data mining query language, DMQL, has been designed for mining several kinds of knowledge in multimedia databases. It consists of the specifications of four major primitives in data mining: (1) the set of data in relevance to a data mining process, (2) the kind of knowledge to be discovered, (3) the background knowledge and (4) the justification of the interestingness of the knowledge (i.e., thresholds). The first primitive, the set of relevant data, can be specified in a way similar to that of a relational query, which is to be used to fetch the set of relevant data from the database. The second primitive, the kind of knowledge to be discovered, may include generalized relations, characteristic rules, discriminant rules, classification rules, association rules, etc., which are detailed as follows (Han et al., 1996). A generalized relation is a relation obtained by generalizing from a large set of low level data. A generalized relation can then be used for extraction of different kinds of rules or be viewed at high concept levels from different angles. A characteristic rule is an assertion which characterizes a concept satisfied by all or most of the examples in the class undergoing examination (called the target class). For example, the symptoms of a specific disease can be summarized by a characteristic rule. A discriminant rule is an assertion which discriminates a concept of the class being examined (the target class) from other classes (called contrasting classes). For example, to distinguish one disease from others, a discriminant rule should summarize the symptoms that discriminate this disease from others. 467 Abul-Huda: A survey of data mining query language A classification rule is a set of rules which classifies the relevant set of data, which is usually obtained by first classifying the data (i.e., obtaining a preferred classification scheme) and then returning a set of rules associated with each class or subclass. For example, one may classify diseases and provide the symptoms which describe each class or subclass. An association rule describes association relationships among a set of data (patterns). For example, one may discover a set of symptoms frequently occurring together with certain kinds of diseases and further study the reasons behind it. The third primitive, the background knowledge, is a set of concept hierarchies or generalization operators which provide corresponding higher level concepts and assist generalization processes. The fourth primitive, the interestingness or significance of the knowledge to be discovered can be specified as a set of different mining thresholds depending on the kinds of rules to be mined. Specification of Background Knowledge (Concept Hierarchies): A concept hierarchy (or lattice) defines a sequence of mappings from a set of concepts to their higher level correspondences. Concept hierarchies represent necessary background knowledge to control the generalization process that is a preliminary step in most data mining algorithms. Generalization of an attribute means to replace its value by a higher one based on a concept hierarchy tree. For example, a person’s address can be generalized from a detailed address, such as the street, into a higher leveled one, such as a district, a city, a country, etc (Elfeky et al., 2000). Specification of Interestingness and Thresholds: A data mining task may need to specify a set of thresholds to control its data mining process, including guiding an induction process, constraining search for interesting knowledge, testing the interestingness or significance of the discovered knowledge, etc. This requires the introduction of the fourth set of primitives, a set of data mining thresholds, in DMQL (Han et al., 1996). There are many kinds of thresholds that should be specified to control the mining process. The attribute threshold is the maximum number allowed of distinct values for an attribute in the generalized objects. It is specified independent of the kind of rules since it is considered in the generalization step before considering the kind of rules to be mined. The other kinds of thresholds depend on the specified type of rules being mined. For example, mining association rules should specify a support threshold that is the minimum support value of a rule and a confidence threshold that is the minimum confidence value of a rule. Also, mining classification rules should specify a classification threshold such that further classification on a set of classified objects may become unnecessary if a substantial portion (no less than the specified threshold) of the classified objects belong to the same class(Elfeky et al., 2000). Presentation and Visualization of Discovered Patterns: For data mining to be effective, data mining systems should be able to display the discovered patterns in multiple forms, such as rules, tables, crosstabs, reports, charts, graphs, decision trees and cubes. Allowing the visualization of discovered patterns in various forms can help users with different backgrounds to identify patterns of interest and to interact or guide the system in further discovery. A Database Mining System Prototype: The MultiMediaMiner system is based on the development of an on-line analytical data mining system, DBMiner and C-BIRD, a system for Content-Based Image Retrieval from Digital libraries. The DBMiner system, currently contains the following five data mining functional modules: characterizer, comparator, associator, predictor and classifier. A general description of these functional modules is in (Han et al., 1997) DBMiner applies multi-dimensional database structures (Han et al., 1997), attributeoriented induction, multi-level association analysis (Han et al., 1995) statistical data analysis and machine learning approaches for mining these different kinds of rules in relational databases and data warehouses. C-BIRD system contains four major components: (i) Image Excavator (a web agent) for the extraction of images and videos from multimedia repositories, (ii) a pre-processor for the extraction of image features and storing precomputed data in a database, (iii) a user interface and (iv) a search kernel for matching queries with image and video features in the database. The database used by C-BIRD is an addition to the image repository and contains mainly meta-data extracted by the pre-processor and the Image Excavator, like colour, texture and shape characteristics and automatically generated keywords. MultiMediaMiner, the general architecture of which is shown in Fig. 1, inherits the CBIRD database (Osmar et al., 1998). For each image collected, the database contains some descriptive information, a feature descriptor and a layout descriptor. The original image is not directly stored in the database; only its feature descriptors are stored. The 468 Abul-Huda: A survey of data mining query language descriptive information encompasses fields like: image file name, image URL, image and video type (i.e. gif, jpeg, bmp, avi, mpeg, . . . ), a list of all known web pages referring to the image (i.e. parent URLs), a list of keywords and a thumbnail used by the user interface for image and video browsing. The feature descriptor is a set of vectors for each visual characteristic. The main vectors are: a colour vector containing the colour histogram quantized to 512 colours (8 × 8 × 8 for R × G × B), MFC (Most Frequent Colour) vector and MFO (Most Frequent Orientation) vector. The MFC and MFO contain 5 colour centroids and 5 edge orientation centroids for the 5 most frequent colours and 5 most frequent orientations, respectively. The edge orientations used are: 0 , 22.5 , 45 , 67.5 , 90 and so on. The layout descriptor contains a colour layout vector and an edge layout vector. Regardless of their original size, all images are assigned an 8 × 8 grid. The most frequent colours for each of the 64 cells are stored in the colour layout vector and the number of edges for each orientation in each of the cells is stored in the edge layout vector. Other sizes of grids, like 4 × 4, 2 × 2 and 1 × 1, can be derived easily (Osmar et al., 1998). The Image Excavator uses image contextual information, like HTML tags in web pages, to derive keywords. By traversing on-line directory structures, like the Yahoo directory, it is possible to create hierarchies of keywords mapped on the directories in which the image was found. These graphs are used as concept hierarchies for the dimension keyword in the multidimensional data cube (Osmar et al., 1998). The multimedia data cube can have many dimensions. The following are some examples: the size of the image or video in bytes; the width and height of the frames (or pictures) constitute 2 dimensions; the date on which the image or video was created (or last modified); the format type of the image or video; the frame sequence duration in seconds; the image or video Internet domain; the Internet domain of pages referencing the image or video (parent URL); the Keywords; a colour dimension; an edge-orientation dimension; etc (Osmar et al., 1998). Fig. 1: General Architecture of MultiMediaMiner The mining modules of the MultiMediaMiner system include four functional modules, characterizer, comparator, classifier and associator. Many data mining techniques are used in the development of these modules, including data cube construction and search (Chaudhuri et al., 1997), attribute-oriented induction (Han et al., 1997), mining multi-level association rules (Han et al., 1995), etc. The functionalities of these modules are described as follows (Osmar et al., 1998). MM-Characterizer: This module discovers a set of characteristic features at multiple abstraction levels from a relevant set of data in a multimedia database. It provides users with a multiple-level view of the data in the database with roll-up and drill-down capabilities. For example, the module may describe the general characteristics of image sequences based on the topic of the video, the topic being a high level keyword defined in the concept hierarchy. The user can drill-down along the topic dimension to find characteristics of the image sequences based on more concrete topics (Osmar et al., 1998). MM-Comparator: This module discovers a set of comparison characteristics contrasting the features of different classes of the relevant sets of data in a multimedia database. It compares and distinguishes the general features of one set of data, known as the target class from the other set(s) of data, known as the contrasting class(es). For example, the module may show the differences in video duration and colour richness between videos served in the commercial Internet domain (com) and videos served on the education domain (edu) and created in July 1997 (Osmar et al., 1998). MM-Associator: This module finds a set of association rules from the relevant set(s) of data in an image and video database. An association rule shows the frequently occurring patterns (or relationships) of a set of data items in a database. A typical association rule is in the form of “X Y [s%; c%]" where X and Y are sets of predicates, s% is the support of the rule (the probability that X and Y hold together among all the possible cases) and c% is the confidence of the rule (the conditional probability that Y is true under the condition of X). For example, the module 469 Abul-Huda: A survey of data mining query language mines association rules like: “what are relationships among still images, the frequent colours used in them, their size and the keyword `sky'?" One possible association rule among many to be found is “if image is big and is related to sky, it is blue with a possibility of 68%" or “if image is small and is related to sky, it is dark blue with a possibility of 55%" (Osmar et al., 1998). MM-Classifier: This module classifies multimedia data based on some provided class labels, such as topics (based on keywords). The result is an elegant classification of a large set of multimedia data and a characteristic description of each class. This classification represented as a decision tree can also be used for prediction (Osmar et al., 1998). Conclusions Multimedia data mining is the mining of high-level multimedia information and knowledge from large multimedia databases. Mining multimedia data is, however, at an experimental stage. Multimedia has been the major focus for many researchers around the world, data mining from multimedia databases is still in its infancy, multimedia mining still seem shy on results. A multimedia data mining system prototype, MultiMediaMiner, has been designed and developed. It includes the construction of a multimedia data cube which facilitates multiple dimensional analysis of multimedia data, primarily based on visual content and the mining of multiple kinds of knowledge, including characterization (summarization), discrimination (comparison), classification, association and clustering, in image and video databases. There are three major tasks calling for further research into the design and development of the MultiMediaMiner system. The first task is the improvement of the design and construction of multimedia data cube. Current implementation supports only limited number of intervals on the colour and texture dimensions in the data cube. The second task is to enhance data mining algorithms to take advantage of the MFC and MFO centroids in order to discover interesting spatial relationships. The third task is the incremental addition of new data mining functionalities into the system. References Fayyad, U. M., G. Piatesky-Shapiro, P. Smyth and R. Uthurusamy, 1996. Advance in Knowledge Discovery and Data Mining. AAAI/MIT Press. Han, J., Y. Fu, K. Koperski, W. Wang and O. Zaiane, 1996. "DMQL: A Data Mining Query Language for Relational Database," Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Quebec, Canada, pp: 27-34. Elfeky, M. G., A. A. Saad and S. A. Fouad, 2000, "ODMQL: Object Data Mining Query Language." Lecture Notes in Computer Sci.,1944 (LNCS), pp: 128-140. Osmar, R., Zaïane, H. Jiawei Li, Ze-Nian and Jean Hou, 1998. `` Mining Multimedia Data'', Proc. CASCON'98: Meeting of Minds, Toronto, Canada, pp: 83-96. Wen-Syan, Li, K., C. Selçuk and H. Kyoji, 1997. Yoshinori Hara: Facilitating Multimedia Database Exploration through Visual Interfaces and Perpetual Query Reformulations. VLDB pp: 538-547 Osmar, R. Z., Jiawei Han, Li. Ze-Nian, Y. C. Jenny and C. Sonny, 1998. Multimedia-miner: A system prototype for multimedia data mining. In Proc. 1998 ACM-SIGMOD Conf. on Management of Data, Seattle, Washington, pp: 581-583. Han, J., J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, G. Liu, K. Koperski, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Za¨_ane, S. Zhang and H. Zhu 1997. DBMiner: A system for data mining in relational databases and data warehouses. In Proc. CASCON'97: Meeting of Minds, Toronto, Canada, pp: 249-260. Han, J. and Y. Fu 1995. Discovery of multiple-level association rules from large databases. In Proc. Intl. Conf. Very Large Data Bases, Zurich, Switzerland, pp: 420-431. Chaudhuri, S. and U. Dayal, 1997. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record 26: 65-74. 470