Download A Survey of Data Mining Query Language D

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Asian Journal of Information Technology 4 (5): 466-470, 2005
© Grace Publications, 2005
A Survey of Data Mining Query Language (DMQL) for Multimedia Databases
Bilal A. H. Abul-Huda, Basel Bani Ismail, Osama Al-Horani and Ahmad El-Mustafah
Department of Computer Information System, College of Information Technology and Computer Science,
Yarmouk University, P.O. Box 4762, 21163 Irbid Jordan
Abstract: Data Mining is the process of discovering interesting knowledge (patterns) from large amounts of data
stored either in databases, data warehouses, or other information repositories. Multimedia data mining is the
mining of high-level multimedia information and knowledge from large multimedia databases. Mining multimedia
data is, however, at an experimental stage. Substantial progress in the field of data mining and data warehousing
research has been witnessed in the last few years. Numerous research and commercial systems for data mining
and data warehousing have been developed for mining knowledge in relational database and data warehouses
(Fayyad et al., 1996) Despite the fact that Multimedia has been the major focus for many researchers around the
world, data mining from multimedia databases is still in its infancy, multimedia mining still seem shy on results.
Many techniques for representing, storing, indexing and retrieving multimedia data have been proposed.
However, rare are the researches who ventured in the multimedia data mining field. The emerging data mining
tools and systems lead to the demand of a powerful data mining query language. The concepts of such a
language for relational databases are discussed in (Han et al., 1996) With the increasing popularity of
multimedia databases, it is important to design a data mining query language for such databases. A multimedia
data mining system prototype, MultiMediaMiner, has been designed and developed. It includes the construction of
a multimedia data cube which facilitates multiple dimensional analysis of multimedia data, primarily based on
visual content and the mining of multiple kinds of knowledge, including characterization (summarization),
discrimination (comparison), classification, association and clustering, in image and video databases.
Key words: Data mining, Data mining query language, Knowledge discovery, Data warehousing, Data cube,
Multimedia database and Information retrieval
Introduction
Data Mining means the discovery of knowledge and useful information from the large amounts of data stored in
databases. Data Mining, involves an integration of techniques from multiple disciplines such as: database
technology, statistics, machine learning, neural networks, information retrieval.
The desired feature of data mining systems is the ability to support ad hoc and interactive data mining in order to
facilitate flexible and effective knowledge discovery. Data mining query language can be design to support such a
feature. The importance of the design of a good data mining query language can also be seen from observing the
history of relational database system. There is a lot of research that has been conducted on data mining in
relational databases to mine a specific kind of knowledge. Also, there are some data mining experimental
systems that have been developed for relational databases, such as DBMiner, Explora, MineSet, Quest, etc
(Elfeky et al., 2000)
It is reasonably easy to design a data mining language for data mining in relational databases. It is a great
challenge to design language for knowledge mining in other kinds of databases, such as transactional
databases, object-oriented databases, spatial databases, multimedia databases (Han et al., 1996).
Multimedia data mining is a subfield of data mining that deals with the extraction of implicit knowledge,
multimedia data relationships, or other patterns not explicitly stored in multimedia databases. Multimedia data
mining is not limited to images, video or sound, but encompasses text mining as well. There has been
interesting research in text mining from text documents and Web or semi-structured data querying and mining.
The availability of affordable imaging technology is leading to an explosion of data in the forms of image and
video. Many relational databases are now including multimedia information, such as photos of customers, videos
about real estate, etc. The proliferation of huge amounts of multimedia data is becoming prominent. Global
information networks like the Internet, as well as specialized databases, are filled with a variety of multimedia,
medical images, satellite pictures, etc., necessitating means to retrieve, classify and understand this data. With
huge amounts of multimedia data collected by video cameras and audio recorders, satellite telemetry systems,
remote sensing systems, surveillance cameras and other data collection tools, it is crucial to develop tools for
discovery of interesting knowledge from large multimedia databases. Moreover, with the popularity of multimedia
466
Abul-Huda: A survey of data mining query language
objects in extended and object-relational databases, it is becoming important to mine knowledge related to
both multimedia and relational data in large databases and maybe, to deal with them in the same manner
(Osmar et al., 1998).
For similarity searching in multimedia data, There are two main families of multimedia indexing and retrieval
systems: description-based retrieval systems, which build indices and perform object retrieval based on image
descriptions, such as keywords, size, caption and time of creation and content-based retrieval systems, which
support retrieval based on the image content, such as like colour histogram, texture, shape, objects and wavelet
transforms. In a content-based retrieval system, there are often two kinds of queries: image sample-based
queries and image feature specification queries.
Since multimedia retrieval is based on similarity calculations of semantics and media-based search, exact
matches are not expected. We view querying multimedia database as a combination of IR, image matching and
traditional database query processing and it should be conducted in a way of perpetual query reformulation for
honing target results. Query processing in multimedia databases is different from that in traditional database
systems. Because contents stored in traditional database systems are rather precise, query results are certain. In
information retrieval, documents are represented as keyword lists. To retrieve a document, systems compare
keywords specified by users with the documents’ keyword lists. Images, in a similar manner, are represented as
features. Image matching is carried out through comparing these features. In both information retrieval and image
matching, results are based on similarity calculations; comparing similarity between semantic meanings of
keywords and image features respectively `( Wen-Syan et al., 1997).
The current MultiMediaMiner system includes four data mining modules for mining knowledge in image and video
databases: characterization (summarization), discrimination (comparison), classification and association.
Additional modules are in the design and development stage (Osmar et al., 1998).
Design of a Data Mining Language: The philosophy of data mining may strongly influence the design of a data
mining language. Hence, the main principle in designing a data mining query language is to support the
specifications of the following primitives (principles).
The set of task-relevant data to be mined.
The kind of knowledge to be mined.
The background knowledge (such as conceptual hierarchy information, etc.) to be used in the discovery process .
The interestingness measures and thresholds for pattern evaluation.
The expected representation for visualizing the discovered patterns.
Specification of Task-Relevant Data: This involves specifying the following information.
Database or data warehouse name
Database tables or data warehouse cubes
Conditions for data selection
Relevant attributes or dimensions
Data grouping criteria
Mining Different Kinds of Rules: Based on the above considerations, a data mining query language, DMQL, has
been designed for mining several kinds of knowledge in multimedia databases. It consists of the specifications
of four major primitives in data mining: (1) the set of data in relevance to a data mining process, (2) the kind of
knowledge to be discovered, (3) the background knowledge and (4) the justification of the interestingness of the
knowledge (i.e., thresholds). The first primitive, the set of relevant data, can be specified in a way similar to that of
a relational query, which is to be used to fetch the set of relevant data from the database. The second primitive,
the kind of knowledge to be discovered, may include generalized relations, characteristic rules, discriminant rules,
classification rules, association rules, etc., which are detailed as follows (Han et al., 1996).
A generalized relation is a relation obtained by generalizing from a large set of low level data. A generalized
relation can then be used for extraction of different kinds of rules or be viewed at high concept levels from different
angles.
A characteristic rule is an assertion which characterizes a concept satisfied by all or most of the examples in the
class undergoing examination (called the target class). For example, the symptoms of a specific disease can be
summarized by a characteristic rule.
A discriminant rule is an assertion which discriminates a concept of the class being examined (the target class)
from other classes (called contrasting classes). For example, to distinguish one disease from others, a
discriminant rule should summarize the symptoms that discriminate this disease from others.
467
Abul-Huda: A survey of data mining query language
A classification rule is a set of rules which classifies the relevant set of data, which is usually obtained by first
classifying the data (i.e., obtaining a preferred classification scheme) and then returning a set of rules associated
with each class or subclass. For example, one may classify diseases and provide the symptoms which describe
each class or subclass.
An association rule describes association relationships among a set of data (patterns). For example, one may
discover a set of symptoms frequently occurring together with certain kinds of diseases and further study the
reasons behind it.
The third primitive, the background knowledge, is a set of concept hierarchies or generalization operators which
provide corresponding higher level concepts and assist generalization processes. The fourth primitive, the
interestingness or significance of the knowledge to be discovered can be specified as a set of different mining
thresholds depending on the kinds of rules to be mined.
Specification of Background Knowledge (Concept Hierarchies): A concept hierarchy (or lattice) defines a
sequence of mappings from a set of concepts to their higher level correspondences. Concept hierarchies
represent necessary background knowledge to control the generalization process that is a preliminary step in
most data mining algorithms. Generalization of an attribute means to replace its value by a higher one based on a
concept hierarchy tree. For example, a person’s address can be generalized from a detailed address, such as the
street, into a higher leveled one, such as a district, a city, a country, etc (Elfeky et al., 2000).
Specification of Interestingness and Thresholds: A data mining task may need to specify a set of thresholds to
control its data mining process, including guiding an induction process, constraining search for interesting
knowledge, testing the interestingness or significance of the discovered knowledge, etc. This requires the
introduction of the fourth set of primitives, a set of data mining thresholds, in DMQL (Han et al., 1996).
There are many kinds of thresholds that should be specified to control the mining process. The attribute threshold
is the maximum number allowed of distinct values for an attribute in the generalized objects. It is specified
independent of the kind of rules since it is considered in the generalization step before considering the kind of
rules to be mined. The other kinds of thresholds depend on the specified type of rules being mined. For example,
mining association rules should specify a support threshold that is the minimum support value of a rule and a
confidence threshold that is the minimum confidence value of a rule. Also, mining classification rules should
specify a classification threshold such that further classification on a set of classified objects may become
unnecessary if a substantial portion (no less than the specified threshold) of the classified objects belong to the
same class(Elfeky et al., 2000).
Presentation and Visualization of Discovered Patterns: For data mining to be effective, data mining systems
should be able to display the discovered patterns in multiple forms, such as rules, tables, crosstabs, reports,
charts, graphs, decision trees and cubes. Allowing the visualization of discovered patterns in various forms can
help users with different backgrounds to identify patterns of interest and to interact or guide the system in further
discovery.
A Database Mining System Prototype: The MultiMediaMiner system is based on the development of an on-line
analytical data mining system, DBMiner and C-BIRD, a system for Content-Based Image Retrieval from Digital
libraries.
The
DBMiner
system, currently contains the following five data mining functional modules:
characterizer, comparator, associator, predictor and classifier. A general description of these functional modules
is in (Han et al., 1997) DBMiner applies multi-dimensional database structures (Han et al., 1997), attributeoriented induction, multi-level association analysis (Han et al., 1995) statistical data analysis and machine
learning
approaches for mining these different kinds of rules in relational databases and data warehouses.
C-BIRD system contains four major components: (i) Image Excavator (a web agent) for the extraction of images
and videos from multimedia repositories, (ii) a pre-processor for the extraction of image features and storing
precomputed data in a database, (iii) a user interface and (iv) a search kernel for matching queries with image
and video features in the database. The database used by C-BIRD is an addition to the image repository and
contains mainly meta-data extracted by the pre-processor and the Image Excavator, like colour, texture and shape
characteristics and automatically generated keywords. MultiMediaMiner, the general architecture of which is
shown in Fig. 1, inherits the CBIRD database (Osmar et al., 1998).
For each image collected, the database contains some descriptive information, a feature descriptor and a layout
descriptor. The original image is not directly stored in the database; only its feature descriptors are stored. The
468
Abul-Huda: A survey of data mining query language
descriptive information encompasses fields like: image file name, image URL, image and video type (i.e. gif,
jpeg, bmp, avi, mpeg, . . . ), a list of all known web pages referring to the image (i.e. parent URLs), a list of
keywords and a thumbnail used by the user interface for image and video browsing. The feature descriptor is a
set of vectors for each visual characteristic. The main vectors are: a colour vector containing the colour histogram
quantized to 512 colours (8 × 8 × 8 for R × G × B), MFC (Most Frequent Colour) vector and MFO (Most Frequent
Orientation) vector. The MFC and MFO contain 5 colour centroids and 5 edge orientation centroids for the 5 most
frequent colours and 5 most frequent orientations, respectively. The edge orientations used are: 0 , 22.5 , 45 ,
67.5 , 90 and so on. The layout descriptor contains a colour layout vector and an edge layout vector. Regardless
of their original size, all images are assigned an 8 × 8 grid. The most frequent colours for each of the 64 cells
are stored in the colour layout vector and the number of edges for each orientation in each of the cells is stored
in the edge layout vector. Other sizes of grids, like 4 × 4, 2 × 2 and 1 × 1, can be derived easily (Osmar et al.,
1998).
The Image Excavator uses image contextual information, like HTML tags in web pages, to derive keywords. By
traversing on-line directory structures, like the Yahoo directory, it is possible to create hierarchies of keywords
mapped on the directories in which the image was found. These graphs are used as concept hierarchies for the
dimension keyword in the multidimensional data cube (Osmar et al., 1998).
The multimedia data cube can have many dimensions. The following are some examples: the size of the image
or video in bytes; the width and height of the frames (or pictures) constitute 2 dimensions; the date on which the
image or video was created (or last modified); the format type of the image or video; the frame sequence duration
in seconds; the image or video Internet domain; the Internet domain of pages referencing the image or video
(parent URL); the Keywords; a colour dimension; an edge-orientation dimension; etc (Osmar et al., 1998).
Fig. 1: General Architecture of MultiMediaMiner
The mining modules of the MultiMediaMiner system include four functional modules, characterizer, comparator,
classifier and associator. Many data mining techniques are used in the development of these modules, including
data cube construction and search (Chaudhuri et al., 1997), attribute-oriented induction (Han et al., 1997), mining
multi-level association rules (Han et al., 1995), etc.
The functionalities of these modules are described as follows (Osmar et al., 1998).
MM-Characterizer: This module discovers a set of characteristic features at multiple abstraction levels from a
relevant set of data in a multimedia database. It provides users with a multiple-level view of the data in the
database with roll-up and drill-down capabilities. For example, the module may describe the general
characteristics of image sequences based on the topic of the video, the topic being a high level keyword defined
in the concept hierarchy. The user can drill-down along the topic dimension to find characteristics of the image
sequences based on more concrete topics (Osmar et al., 1998).
MM-Comparator: This module discovers a set of comparison characteristics contrasting the features of different
classes of the relevant sets of data in a multimedia database. It compares and distinguishes the general features
of one set of data, known as the target class from the other set(s) of data, known as the contrasting class(es). For
example, the module may show the differences in video duration and colour richness between videos served in
the commercial Internet domain (com) and videos served on the education domain (edu) and created in July 1997
(Osmar et al., 1998).
MM-Associator: This module finds a set of association rules from the relevant set(s) of data in an image and video
database. An association rule shows the frequently occurring patterns (or relationships) of a set of data items in a
database. A typical association rule is in the form of “X Y [s%; c%]" where X and Y are sets of predicates, s% is
the support of the rule (the probability that X and Y hold together among all the possible cases) and c% is the
confidence of the rule (the conditional probability that Y is true under the condition of X). For example, the module
469
Abul-Huda: A survey of data mining query language
mines association rules like: “what are relationships among still images, the frequent colours used in them, their
size and the keyword `sky'?" One possible association rule among many to be found is “if image is big and is
related to sky, it is blue with a possibility of 68%" or “if image is small and is related to sky, it is dark blue with a
possibility of 55%" (Osmar et al., 1998).
MM-Classifier: This module classifies multimedia data based on some provided class labels, such as topics
(based on keywords). The result is an elegant classification of a large set of multimedia data and a characteristic
description of each class. This classification represented as a decision tree can also be used for prediction
(Osmar et al., 1998).
Conclusions
Multimedia data mining is the mining of high-level multimedia information and knowledge from large multimedia
databases. Mining multimedia data is, however, at an experimental stage. Multimedia has been the major focus
for many researchers around the world, data mining from multimedia databases is still in its infancy, multimedia
mining still seem shy on results. A multimedia data mining system prototype, MultiMediaMiner, has been
designed and developed. It includes the construction of a multimedia data cube which facilitates multiple
dimensional analysis of multimedia data, primarily based on visual content and the mining of multiple kinds of
knowledge, including characterization (summarization), discrimination (comparison), classification, association
and clustering, in image and video databases. There are three major tasks calling for further research into the
design and development of the MultiMediaMiner system.
The first task is the improvement of the design and construction of multimedia data cube. Current implementation
supports only limited number of intervals on the colour and texture dimensions in the data cube. The second task
is to enhance data mining algorithms to take advantage of the MFC and MFO centroids in order to discover
interesting spatial relationships. The third task is the incremental addition of new data mining functionalities into
the system.
References
Fayyad, U. M., G. Piatesky-Shapiro, P. Smyth and R. Uthurusamy, 1996. Advance in Knowledge Discovery and
Data Mining. AAAI/MIT Press.
Han, J., Y. Fu, K. Koperski, W. Wang and O. Zaiane, 1996. "DMQL: A Data Mining Query Language for Relational
Database," Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge
Discovery, Montreal, Quebec, Canada, pp: 27-34.
Elfeky, M. G., A. A. Saad and S. A. Fouad, 2000, "ODMQL: Object Data Mining Query Language." Lecture Notes in
Computer Sci.,1944 (LNCS), pp: 128-140.
Osmar, R., Zaïane, H. Jiawei Li, Ze-Nian and Jean Hou, 1998. `` Mining Multimedia Data'', Proc. CASCON'98:
Meeting of Minds, Toronto, Canada, pp: 83-96.
Wen-Syan, Li, K., C. Selçuk and H. Kyoji, 1997. Yoshinori Hara: Facilitating Multimedia Database Exploration
through Visual Interfaces and Perpetual Query Reformulations. VLDB pp: 538-547
Osmar, R. Z., Jiawei Han, Li. Ze-Nian, Y. C. Jenny and C. Sonny, 1998. Multimedia-miner: A system prototype for
multimedia data mining. In Proc. 1998 ACM-SIGMOD Conf. on Management of Data, Seattle, Washington,
pp: 581-583.
Han, J., J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, G. Liu, K. Koperski, Y. Lu, N.
Stefanovic, L. Winstone, B. Xia, O. R. Za¨_ane, S. Zhang and H. Zhu 1997. DBMiner: A system for data mining
in relational databases and data warehouses. In Proc. CASCON'97: Meeting of Minds, Toronto, Canada,
pp: 249-260.
Han, J. and Y. Fu 1995. Discovery of multiple-level association rules from large databases. In Proc. Intl. Conf.
Very Large Data Bases, Zurich, Switzerland, pp: 420-431.
Chaudhuri, S. and U. Dayal, 1997. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD
Record 26: 65-74.
470