Download Using AK-Mode Algorithm to Cluster OLAP Requirements

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
Using AK-Mode Algorithm to Cluster OLAP Requirements
1
Nouha Arfaoui, 2 Jalel AKaichi
1,2
BIRT- Institut Supérieur de Gestion,
41, Avenue de la liberté, Cité Bouchoucha, Le Bardo 2000, Tunisia
1
[email protected], 2 [email protected]
ABSTRACT
The data warehousing is becoming increasingly important in terms of strategic decision making through their capacity to
integrate heterogeneous data from multiple information sources in a common storage space, for querying and analysis. Since its
design is not an easy task, we propose exploiting the OLAP requirements to construct the schemas of data marts which will be
used next to build the schema of the Data Warehouse. The construction of the Data Marts is done through the clustering of the
schemas that correspond to the OLAP requirements. In this work we focus on the clustering step and we propose then the use of
AK-Mode which is an extension of k-mode algorithm. The AK-Mode integrates the ontology to take into consideration the
semantic aspect of our data.
Keywords: AK-mode, Clustering, Simple Matching, Ontology, Data Warehouse Schema, Data Mart Schema.
1.
INTRODUCTION
The data warehousing is becoming increasingly
important in terms of strategic decision making through their
capacity to integrate heterogeneous data from multiple
information sources in a common storage space, for querying
and analysis.
In order to fully exploit the DW, it is essential to
have a good design to be able to satisfy specified needs and
thus give a complete and centralized view of all existing data.
The design is not an easy task because of the necessity to
acquire the important knowledge related to the scope and the
design techniques used to ensure adequate understanding of
the different concepts. Their mastery imposes more effort and
time especially with the continuous change and evolution of
the domain. This requires to resort to methods such as design
methods top-down, bottom-up and middle-out.
The idea is to construct the Data warehouse schema
from OLAP requirements. In order to facilitate this task, we
propose the use of clustering as a data mining technique to
group the different schemas resulting from the process of
transforming the requirements. As result we get schemas
grouped according to their departments or business functions.
Then, for each cluster we construct the corresponding Data
Mart schema.
In this work we focus on clustering the schemas, and
we start by defining this notion. In fact, it is the unsupervised
classification of patterns into groups called Clusters[3], it
involves dividing a set of data points into non-overlapping
groups, or cluster of points [35]. The objects of one cluster are
more similar to those of another one, so the clustering aims at
maximizing the homogeneity within the same group. To
determine the notion of similarity we use some measure of
proximity. In the literature, the clustering has been used, first,
to cluster numerical data as consequence many algorithms has
been proposed. The base of each algorithm is the use of
coefficients to calculate the similarity/dissimilarity measure
between objects. With the emergence of categorical data and
their use in real databases, it becomes important to look for
new algorithms to cluster this kind of data, so, many new ones
has been proposed. But the challenge, in this level, is to solve
the problem of similarity measure [6]. In fact, traditional
measures that are used with numerical data cannot be applied;
they must be modified to take into consideration the specific
characteristic of categorical data. So, as consequence new
measures have been proposed to cluster categorical data.
The problem in our case is that the schemas provide
additional information that we have to take into consideration
when we make the comparison. This information can
influence the result of clustering, for example, we can use
several words to denote the same thing. So if we use the
traditional measures, the result of our comparison will not
reflect the reality. Using the traditional version of k-mode, this
level is ignored. So, to overcome this problem, we propose
“AK-Mode” that extends Simple Matching (SM) dissimilarity
measure by adding the ontology, by this way, we improve the
efficiency of this measure.
The outline of this paper is as follows: in section 2,
we detail some works that have mixed the OLAP and the Data
Mining technologies. In the next section, we propose the use
of multidimensional table to express the OLAP requirements.
The information is visualized using an intermediate schema.
Section 4 describes the AK-mode algorithm, it argues this
choice and explains the different modifications, and we finish
with the conclusion.
2.
STATE OF THE ART
In this section we list the various works
that have mixed the OLAP and the Data Mining technologies
to tAKe advantages of the both.
In [25], Ben Messaoud et al. propose OpAC
(Operator for Aggregation by Clustering) which is considered
as a new operator for multidimensional on-line analysis. It
consists in using the agglomerative hierarchical clustering to
achieve a semantic aggregation on the attributes of a data cube
1285
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
dimension. The authors propose tAKing advantage from both
the OLAP and the Data Mining to get at the end an analysis
process that provides the exploration, explication and
prediction capabilities. Another data mining system DBMiner
was presented in [10]. This latter integrates different data
mining functions such as characterization, comparison,
association, classification, prediction and clustering, as well as
it integrates database and OLAP technologies. It has the
advantage to mine various kind of knowledge at multiple
levels of abstraction from different databases also from data
warehouse. It offers SQL-like data mining query language,
DMQL and a graphical user interface to facilitate the
interactive mining. In the same context, Goil, in [30], proposes
PARSIMONY which is a parallel and scalable OLAP and
Data Mining framework for large datasets. The system sparse
the datasets using chunks so the data can be stored either as a
dense block or as sparse representation. There is also, iDiff
[33] which is an operator used to automate the manual
discovery processes using mining technology that can play a
fitting role in providing the state of these products. iDiff
returns summarized reasons for drops or increases observed at
an aggregated level, and this can be done in a single step.
In [24], Chen et al. propose a scalable DW and
OLAP-based engine for analyzing web log records. The
proposed framework supports the typical OLAP operation and
DM operations such as extended multilevel and
multidimensional association rules. The OLAP server is used
as a computation engine to support DM operations.
The data mining can be applied to detect the outliers.
In this field, the [32] implements OLAP-outlier-based data
association method as the result of the combination of OLAP
and Data Mining. The proposed method integrates both outlier
detection concept in DM and ideas from OLAP field and it is
used to solve the data association problem. It can be used also
to discover causal relations among heterogeneous databases as
presented in [16] that propose a computer software agent. This
agent combines Data Warehouse, OLAP and KDD
functionalities in order to support the kno wledge discovery
tasks. The solution consists on developing IIMiner (Integrated
Interactive data Miner) that is used to provide convenient
ways for the user to interact with the KDD processes using
OLAP and DW techniques. In [29], Goil and Choudhary
present a parallel multidimensional framework for large data
sets in OLAP. This framework has been integrated with DM
of association rules to facilitate handling a large number of
dimensions and large datasets.
3. OLAP REQUIREMENTS
This section is devoted to present the OLAP
requirements, also their modelisation through the
multidimensional table first then using an intermediate
schema.
The requirements play a crucial role in the DW
process design; since it is the first step, it can cause the failure
of the whole project if it is faulty. Despite its importance not
much attention has been paid to this phase causing 85% of the
DW projects fail to meet business objects, 40% of the DW
projects never develop, the authors consider also that the
problem for the fail is the poor communication between the
different stAKeholders [11].
According to [9], this phase is used to specify “what
data should be available and how it should be organized as
well as what queries are of interest” and it is serves to extract
the important elements related to the multidimensional schema
(facts, measures, dimensions, hierarchies) this extraction helps
to manipulate and calculate data.
3.1
The Multidimensional Table
Different works in the literature propos the use of ndimensional table (called also multidimensional table) as a
way to express the needs of the decision mAKers. In fact and
since the decision mAKers cannot be computer scientists, i.e.
they can find difficulties to express their needs using the SQL
queries (especially when using the GROUP BY and/or
HAVING clauses), we propose, as solution, the use of ndimensional table [18], [8].
It is a tabular representation that can show the fact of
the decision mAKer. This fact and its measures can be
analyzed according to dimensions and their granularity levels.
This choice is done because:
-
It is easy to be used by a non computer scientist
user.
It allows seeing values of certain attributes as a
function of others.
The representation is close to decision mAKers’
vision of data [8]
We propose the use of a multidimensional table (MT)
to visualize the structure of the data i.e. the fact, the
dimensions and the mesaures. The Fig. 1 shows the model that
we adapt to present our MT that has the following structure:
-
“Dom”: The domain of analysis.
“F”: The fact corresponding to the analyzed
subject.
“M”: The measures which are defined through
an aggregation function “f”: {f1(m1),…, fn(mn)}
“CalculFun”: The function that used to calculate
the measure.
“D”: The dimensions related to the subject of
analysis.
“L”: The levels
“HStar”: HD
HL1x…xHLn: it is a function
that associates the different levels to their linked
dimension instance.
Fig 1: The model of multidimensional table
1286
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
The Fig. 2 presents an example of MT where we have
three dimensions “Customer”, Supplier” and “List Book”, one
fact “Sale” and the measure “Benefit” that will be calculated
using the “Sum” function. Concerning the dimension
“Customer” we take into consideration two levels “Customer
ID” and “Quantity”.
Clustering can be applied to various type of data:
continuous numerical variables [26],[21], binary variables
[12], categorical variables [21]. In our case we propose its use
to cluster OLAP-Requirement Schemas (ORS). Indeed, each
ORS is composed by a set of dimensions, measures, fact and
levels.
For categorical data, as algorithms we find: KMODE [19], ROCK [31], QROCK [17], CACTUS [28],
COOLCAT [5], CLICK [20], LIMBO [23], MULIC[4], etc.
Our choice is in function of the “Time complexity”.
Table 1 presents the time complexity of different algorithms.
Fig 2: Example of Multidimensional Table (MT)
Table 1: The algorithms that are used to cluster categorical
data
3.2
The Schema of The Olap Requirement
Using the database and the multidimensional table,
we propose the visualization of the intermediate schema. This
latter allows the user to validate him/her-self his/her
requirements. The schema will be transformed to an xml file
to facilitate its manipulation during the following stage.
Applying this to our example, we get the Fig.3 which
is the intermediate schema corresponding to the MT of the
Fig.2.
Algorithm
K-MODE
ROCK
QROCK
CACTUS
COOLCAT
CLICK
LIMBO
MULIC
Complexity
O (n)
O(kn2)
O(n2)
Scalable
O (n2)
Scalable
O (n Log n)
O (n2)
Coefficient
Simple Matching
Links
Threshold
Support
Entropy
Co-occurrence
Information Bottleneck
Hamming measure
We can notice that the k-mode has O(n) which is the
lowest complexity, but it cannot deal with our data because it
does not take into consideration the semantic aspect of the
elements.
So, we extend it to deal with our case and we propose
the AK-Mode.
4.1
Fig 3: The intermediate schema corresponding to the MT
(Fig.2)
4. THE AK-MODE ALGORITHM
In this section, we propose AK-Mode which is an
extension of the k-mode algorithm where we use the ontology
to calculate the dissimilarity distance.
The Data Mining (DM) is “the analysis of (often
large) observational data sets to find unsuspected relationships
and to summarize the data in novel ways that are both
understandable and useful to the data owner” [7].
Many techniques and algorithms are used, in the
following we give some of them: clustering, classification,
prediction, etc.
In our case, we propose the use of clustering because
it is the process of partitioning a given population of events or
items into sets of similar elements, so that items within a
cluster have high similarity in comparison to one another, but
are very dissimilar to items in other clusters [1].
The Algorithm
The new algorithm uses an extension of “Simple
Matching” as well as an extension of mode algorithm update.
For the rest of the algorithm we keep it and it is described as
follow [36]:
a) Select ‘k’ initial modes.
b) Allocate an object to the cluster whose mode is
the nearest to the cluster, using the following
formula(1):
d (A, B) = ∑ δ(ai, bi) where δ(ai, bi) = 0 if ai = bi
(1)
and δ(ai, bi) =1 if ai ≠ bi
Update the mode of the cluster after each allocation.
c)
After all objects have been allocated to the
respective cluster, retest the objects with new
modes and update the clusters.
d) Repeat steps (b) and (c) until there is no change
in clusters.
1287
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
4.2
-
The Ontology
-
The ontology is used to resolve the heterogeneity
problem [34], but in our case, it can be used to improve the
document quality using the hierarchical knowledge [2], [14].
We propose in this section a way to improve the
simple matching dissimilarity measure. In fact, the traditional
measures, those are used for numerical data, categorical data
and even for heterogeneous data, ignore the semantic
knowledge. This has negatively influences on the quality of
the interpretations [15], especially with the possibility to add
semantic information about the domain in some fields [27].
The ontology must take into consideration the
following details:
-
In our work, we propose the use of two kinds of
ontology: Word Net ontology, and domain ontology.
Word Net ontology: it is a large lexical database of
English. Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each expressing a
distinct concept. Synsets are interlinked by means of
conceptual-semantic and lexical relations [13].
Domain ontology: it contains information about
different classes as well as the relationships between them. So
we define the most general concepts which are the following:
-
-
-
-
-
Domain: it indicates the domain to which the
schema belongs.
Schema: it is used to group the different facts,
dimensions, measures, hierarchies and levels that
belong to the same schema. This concept
includes the different terms used to design the
same meaning.
Fact: it corresponds to the subject of analysis. It
includes all the different ways used to describe
one fact.
Dimension: it corresponds to the axe of analysis.
It serves to group the different ways to describe
one specific dimension.
Measure: every fact has one or more measures
that are numerical. We keep information about
the different words used to describe a specific
measure.
Hierarchy: it is a logical structure used to order
levels as a means of organization data.
Level: it represents a position in a hierarchy. We
keep information about the different terms used
to describe the same level.
Concerning the relationships, we have:
-
is-Schema (Si, Dj): it indicates that “Si” is a
schema that belongs to the domain “Dj”.
is-Fact (Fi, Sj): it indicates that “Fi” is a fact that
belongs to the schema “Sj”.
is-Dimension (Di, Fj): it indicates that “Di” is a
dimension that belongs to the fact “Fj”.
is-Measure (Mi, Fj): it indicates that “Mi” is a
measure that belongs to the fact “Fj”.
is-Hierarchy (Hi, Dj): it indicates that “Hi” is a
hierarchy that characterizes the dimension “Dj”.
is-Level(Li, Hj): it indicates that the “Li” is a
level that exists into the hierarchy “Hj”.
-
4.3
Partial-Name: we are in the case where different
words are applied to design the same meaning.
Those words have been pre- or post- fixed.
Example: “Tab Product”, “Product” and
“Product Table”.
Levenshtein Name: this is the case when there
are misspellings. Example “Customer” and
“Customer”. We need here calculate the degree
of similarity of the words.
Synonymous: we can use different words to talk
about the same thing. Example: “Customer” and
“Client”.
The Extension of Simple Matching Based
on Ontology
We start this part by an example to clarify the
importance of adding the ontology; then we move to present
the new simple matching dissimilarity measure.
Running Example. The simple matching coefficient
as applied to categorical data is calculated using the formula
(1).
This coefficient cannot be applied to calculate the
dissimilarity measure between two schemas, this is way we
propose the following algorithm (Fig 4), it takes as input the
‘Mode’ and the ‘ORS’ to give as result ‘Simple Matching
Coef’ CoefSM, with:
-
CoefD: calculate the number of similar
dimensions.
CoefM: calculate the number of similar
measures.
CoefL: calculate the number of similar levels
names.
Then, when we calculi the ‘coefSM’ we need to define
‘MaxD’ (it corresponds to the maximum number of the
existing dimensions), ‘MaxM’ (it corresponds to the
maximum number of the existing measure) and ‘MaxL’ (it
corresponds to the maximum number of the existing levels
names).
Input: ORS, Mode
Output: CoefSM
Begin
Coef D = Similarity Function D (ORS, Mode)
Coef M = Similarity Function M (ORS, Mode)
Coef L = Similarity Function L (ORS, Mode)
Coef SM=[(Max D – Coef D )/Max D]+(Max M –
Coef M)/Max M] + ( Max L – Coef L ) /Max L]
End
Fig 4: The “Simple Matching” algorithm
1288
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
We consider the two following schemas. Each one
corresponds to OLAP requirement schema. The first schema
(Fig. 5) is composed by one fact table “Sales” and four
dimensions: “Customer”, “Time”, “Product” and “Seller”.
“Customer” contains one key “Id-Customer” and two
attributes “FN-Customer” (FN=First Name) and “LNCustomer” (LN=Last Name). “Time” is composed by one key
“Id-Time” and four attributes: “Month”, “Week”, “Day” and
“Hour”. “Seller” has one key “Id-Seller” and two attributes
“FN-Seller” and “LN-Seller”. “Product” is defined by a key
“Id-Product” and two attributes “Name-Product” and
“Category-Product”.
The second schema (Fig. 6) is composed by one fact
table “Sales” and four dimensions: “Customer”, “Salesman”,
“Date” and “Product”. “Customer” contains one key
“Customer-ID” and two attributes “First Name” and “Last
Name”. “Salesman” is composed by “Salesman-ID” which is
the key and two attributes “First Name” and “Last Name”.
“Date” has one key “Date-ID” and four attributes “Month”,
“Week”, “Day” and “Hour”. “Product” is defined by one key
“Product-ID” and two attributes “Product-Name” and
“Product-Category”.
But this coefficient does not reflect the reality since
“Salesman” and “Seller” means the same thing, it is the case
for many others such as: “FN-Customer” and “First Name”,
“LN-Customer” and “Last Name”, etc. So, if we take the
semantic of the words in the consideration we get the
following values:
CoefD = 4; CoefM = 2; CoefL = 14;
CoefSM (2) = [(4 - 4)/4] + [(2 - 2)/2] + [(14 – 14)/14] =0
+0+0=0
According CoefSM (2) the two schemas are similar.
The extension of Simple Matching. Based on the
example presented before, we can conclude that the simple
matching as used in different works does not correspond to
our need; so, we propose, to improve this coefficient,
modifying the three functions “Similarity Function D” (Fig.7),
“Similarity Function M”, and “Similarity Function L”.
The “Onto Term” function serves to take into
consideration the different details (explained in 4.2).
Input: Schema, Mode
Output: coefSD
Begin
For (each Schema. Dimension)
If(Mode. Dimension. Equals (On to Term(Schema.
Dimension))
Coef SD++
End If
End For
End
Fig 7: “Similarity Function D” algorithm
5. IMPLEMENTATION
Fig 5: First example of OLAP requirement schema
In this section, we present the implementation of our
system. In order to realize this purpose, we used the eclipse as
Java editor, SQL Server 2008 and different libraries as Jena to
deal with xml files, etc. The Fig. 8 presents the structure of the
database that we use to storage the information extracted from
the schemas, we need then the following tables: “Schema”,
“Dimension” Level Name”, “Measure” and “Mode”.
Fig 6: Second example of OLAP requirement schema
Let us apply the simple matching, as presented in the
Fig.4, to calculate the dissimilarity measure between the two
schemas (Fig. 5, and Fig. 6) we get:
CoefD = 2; CoefM = 2; CoefL = 4;
CoefSM (1) = [(4 - 2)/4] + [(2 - 2)/2] + [(14 – 4)/14] =0.
5 + 0 + 0.714 = 1.214
Fig 8: The structure of the tables of our database
1289
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
Fig. 9 presents the interface of our system. Indeed,
the user starts with the specification of the paths of different
xml files presenting the OLAP requirement schemas. Then
once he/she validates the selection, our system extracts
different elements including: schemas, dimensions, levels and
measures, and stores them into tables. The existing schemas
are displayed in a list where the user can initialize the modes.
In fact, the number of the selected modes corresponds to the
number of clusters. For example, in Fig 9, the user choose “3”
modes, so the “k = 3”. Once he/she finishes the specifications,
he/she cliques on the button “Cluster” to start the process of
clustering.
Fig 9: The interface of our application
Fig. 10 presents an example of the result of the
clustering of the schemas. We propose the presentation of the
elements as graph so we can see the clusters and their content.
From one cluster we can distinguish the different schemas. For
each schema, we can determine the dimensions and the
measures, and for each dimensions we can see the levels.
Fig 10: Example of the result of the clustering
improve the update of the “Mode”, we propose also the
techniques of matching and mapping to ensure the fusion of
different schemas existing in one cluster to get the
corresponding data mart schema.
REFERENCES
[1]
A. Omari, M. B. Lamine, and S. Conrad, “On Using
Clustering And Classification During The Design Phase
To Build Well-Structured Retail Websites”, IADIS
European Conference on Data Mining 2008,
Amsterdam, The Netherlands, 2008, pp. 51-59
[2]
A. Hotho, S. Staab, and G. Stumme, “Wordnet
improves Text Document Clustering”, In Proceeding
of the Semantic Web Workshop at SIGIR-2003, 26th
Annual International ACM SIGIR Conference,
Toronto, Canada, (2003).
[3]
A. K. Jain, M. N. Murty, and P. J. Flynn, “Data
Clustering: A Review”, ACM Comput. Surv. Vol. 31,
1999, pp. 264-323.
[4]
B. Andreopoulos, A. An, and X. Wang, “MULIC:
Multi-Layer Increasing Coherence Clustering of
Categorical data sets”, Technical Report CS-2004-07,
York University, 2004.
[5]
D. Barbara, J. Couto, and Y. Li: COOLCAT: An
entropy-based algorithm for categorical clustering, In
Proceedings of the eleventh international conference on
Information and knowledge management, (2002) 582589.
[6]
D. Chen, D.W. Cui, C.X. Wang, and Z. R. Wang, “A
Rough Set-Based Hierarchical Clustering Algorithm for
Categorical Data”, International Journal of Information
Technology, Vol.12, 2006.
[7]
D. Hand, H. Mannila and P. Smyth, “Principles of Data
Mining”, MIT Press, Cambridge, MA, 2001.
[8]
E. Annoni, F. Ravat, O. Teste, and G. Zurfluh,
“Towards multidimensional requirement design”, In:
DaWAK 2006. LNCS, vol. 4081, 2006, pp. 75-84.
[9]
E. Zimányi, E. Malinowski, “Advanced data warehouse
design”, Springer, 2008
[10]
J. Han, J.Y. Chiang, S. Chee, J. Chen, Q. Chen, S.
Cheng, W. Gong, M. Kamber, K. Koperski, G. Liu, Y.
Lu, N. Stefanovic, L. Winstone, B.B, Xia, O.R.
Zaiane, S. Zhang, and H. Zhu, “DBMiner: A System
for Data Mining in Relational Databases and Data
Warehouses”, In: Proceeding of CASCON'97: Meeting
of Minds, Toronto, Canada, 1997.
[11]
J. Schiefer, B. List, and R. M. Bruckner, “A Holistic
Approach for Managing Requirements of Data
Warehouse Systems”, In proceeding of 8th Americas
Conference on Information Systems, 2002
6. CONCLUSION
In our work we proposed AK-Mode which is an
extension of k-mode used to cluster the schemas extracted
from the OLAP requirements. The proposed algorithm
integrates the ontology to take into consideration the semantic
aspect when comparing the different schemas.
The goal behind this work is to get a set of clusters.
Each one contains set of schemas belonging to the same
domain which facilitates the construction of data mart
schemas which will be used next to build the schema of the
data warehouse.
As perspective, we propose the use of “union-based
algorithm” instead of “frequency- based algorithm” to
1290
VOL. 3, NO. 9 SEP, 2012
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2012 CIS Journal. All rights reserved.
http://www.cisjournal.org
[12]
H. Rezankova, “Cluster Analysis and Categorical
Data”, Profesional Publishing, Vysoka Skola
Ekonomicka v Praze, Praha, 2009
[13]
http://wordnet.princeton.edu/
[14]
L. Jing, L. Zhou, M.K. Ng, and J. Z. Huang,
“Ontology-based Distance Measure for Text
Clustering”, In: Proceeding of the Fourth Workshop on
Text Mining Sixth SIAM International Conference on
Data Mining Hyatt Regency Bethesda Bethesda,
Maryland, 2006.
[15]
M. Batet, A. Valls, and K. Gibert, “Improving classical
clustering with ontologies”, In: Proceeding of IASC08,
Japan, 2008.
[16]
M.
Chen, Q. Zhu, and Z. Chen, “An integrated
interactive environment for knowledge discovery from
heterogeneous data sources”, Information and Software
Technology, Volume 43, 2001, pp. 487-496.
[17]
[18]
[19]
M. Dutta, A. K. Mahanta, and A. K. Pujari, “QROCK:
A Quick Version of the ROCK Algorithm for
Clustering of Categorical Data, Pattern Recognition
Letters”, 2005, pp. 2364-2373.
M. Gyssen, and L.V.S. LAKshmanan, “A Foundation
for Multi-Dimensional Databases”, In: 23rd Int. Conf.
on Very Large Data Bases (VLDB), 1997, pp. 106–115.
M. K. Ng, M. J. Li, J. Z. Huang, and Z. He, “On the
Impact of Dissimilarity Measure in k-modes Clustering
Algorithm”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2007, pp. 503-507.
[20] M. Peters and M. J. ZAKi, “CLICK: Clustering
Categorical Data using K-partite Maximal Cliques”,
International Engineering (ICDE), 2005.
[21]
M. Yan, “Methods of Determining the Number of
Clusters in a Data Set and a New Clustering Criterion”,
PhD, November, Blacksburg, Virginia, 2005
[22]
O. M. San, V.N. Huynh, and Y. NAKamori, “An
Alternative Extension Of The K-Means Algorithm For
Clustering Categorical Data”, An Alternative Extension
of the k-Means Algorithm for Clustering Categorical
Data, Journal of Applied Mathematics and Computer
Science, 2004, pp. 241-247.
[23]
[24]
P. Andritsos, P. Tsaparas, R. J. Miller, and K. C.
Sevcik, “LIMBO: Scalable Clustering of Categorical
Data”, In Proceedings of the 9th International
Conference on Extending Database Technology
(EDBT), HerAKlion, Greece 2004, pp.123-146.
Q. Chen, U. Dayal, and M. Hsu, “An OLAP-based
Scalable Web Access Analysis Engine”, In Proceeding
of CASCON'97: Meeting of Minds, Toronto, Canada,
1997.
[25]
R. Ben Messaoud, S. Rabaséda, O. Boussaid, and F.
Bentayeb, “OpAC: A New OLAP Operator Based on a
Data Mining Method”, ixth International Baltic
Conference on Databases and Information Systems
(DB&IS 04), Riga, Latvia, 2004.
[26]
R. Shahid, S.Bertazzon, M.L. Knudtson, and W. A.
Ghali, “Comparison of distance measures in spatial
analytical modeling for health service planning”, BMC
Health Services Research 2009
[27]
R. Studer, V.R. Benjamins, and D.
Fensel,
“Knowledge Engineering: Principles and Methods”,
IEEE Trans on Data and Knowledge Engineering,
1998, pp. 161-197.
[28]
S. Aranganayagi, and K. Thangavel, “Clustering
Categorical
Data
using
Bayesian
Concept”,
International Journal of Computer Theory and
Engineering, 2009, pp. 119-125.
[29]
S. Goil, A. Choudhary, “High Performance
Multidimensional Analysis and Data Mining”, In:
International Database Engineering and Application
Symposium, 1999
[30]
S. Goil, “PARSIMONY: An Infrastructure for Parallel
Multidimensional Analysis and Data Mining”, Journal
of parallel and distributed computing, (2001, pp. 285 –
321.
[31]
S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust
Clustering Algorithm for Categorical Attributes”, In
Inf. Syst., UK: Elsevier Science Ltd, 2000, pp. 345-366.
[32]
S. Lin, and D. E. Brown, “Outlier-based Data
Association: Combining OLAP and Data Mining”,
Technical Report, Department of Systems Engineering
University of Virginia, 2002.
[33]
S. Sarawagi, “iDiff: Informative Summarization of
Differences in Multidimensional Aggregates”, Data
Min. Knowl. Discov. 2001, pp. 255-276.
[34]
V. Alexiev, M. Breu, J. Bruijn, D. Fensel, R. Lara, and
H. Lausen, “Information Integration with Ontologies:
Experiences from an Industrial Showcase”, John Wiley
& Son, Ltd. 180, 2005.
[35]
V. Faber, “Clustering and the Continuous k-means
Algorithm”, Los Alamos Science, 1994, pp. 138-144.
[36]
Z. Huang, “A Fast Clustering Algorithm to cluster Very
Large Categorical Datasets in Data Mining”, In:
Proceeding of SIGMOD Workshop on Research Issues
on Data Mining and Knowledge Discovery, 1997.
}.
1291