Download Relational Data Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data Protection Act, 2012 wikipedia , lookup

Versant Object Database wikipedia , lookup

Data center wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data model wikipedia , lookup

Operational transformation wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Relational algebra wikipedia , lookup

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Open Problems in Relational Data Clustering
University of Maryland Baltimore County
Adam Anthony [email protected]
Marie desJardins [email protected]
Overview
Heterogeneous Data
• Data clustering is the task of detecting patterns in
a set of data.
• Most algorithms take non-relational data as input
and are sometimes unable to find significant
patterns.
• Many data sets can include relational information,
as well as independent object attributes.
• Relational data clustering techniques can help
find strong patterns in such sets.
• Two areas of interest in relational data clustering
are: clustering heterogeneous data, and relation
selection.
It can be very difficult to compare different typed
objects. For example, how can actors be compared
to directors? One possibility is an inter-cluster
relation signature.
Example Data Sets
A relation space is a set of relation graphs,
RS = {RG1, RG2, ..., RGK},
where
RGi = {Oi, Ri},
Oi  FS,
and Ri is a set of edges for a specific relation
A feature space is a set of objects
with attributes,
FS = {o1, o2, …, on},
where
oi = < a1 , a2, …, am>
Internet Movie Database
CIA World Factbook
Attributes include personal data such as awards
received, financial earnings, age, gender, or
Hollywood stock exchange rating. Examples of
relations are acted-in, directed, and sequel.
Attribute values come from categories like
government, economics, and population. Relations
can be derived from sources such as common
membership in international organizations.
Norman
Jewison
Carl
Weathers
G-8
Talia
Shire
Italy
G-8
US
directed
directed directed
acted-in
directed
acted-in
UNSC
UNSC
acted-in
UK
UNSC
G-77
acted-in
Botswana
G-77
China
G-77
AU G-77
AsDB
Carl
Weathers
Norman
Jewison
1
Boxing
1
Comedy
Talia
Shire
1
Drama
acted-in
directed
acted-in
directed
acted-in
directed
acted-in
directed
relation signature.
3. Cluster all objects based
on the inter-cluster
relation signatures.
Boxing
Comedy
Drama
It is intuitive that, just as some features are not helpful for
clustering a data set, some relations might provide little
information for a relational clustering algorithm, or even harm
the performance of an algorithm. As relational clustering
algorithms continue to develop, detecting such graphs will
become more important.
The graph on the right includes an additional
relation graph (blue links) that represents the
World Trade Organization, which fully connects all
countries shown (redundant links omitted).
Including the WTO as one of the relation graphs
obscures the patterns that can be seen in the
graph on the left, making a clustering harder to
find.
G-8
Ron
Howard
1. Cluster one set of
homogeneous data. This is
the reference clustering.
2. For each object, Create a
vector that records the
number of links from that
object to each cluster
discovered in step 1. This
is the inter-cluster
Ron
Howard
Relation Selection
Relation Space
Feature Space
1
Boxing
We find this situation to be similar to cases in the
feature space where an attribute has the same
value for all objects. Removing the WTO graph
reduces the size of the total graph, and makes
finding patterns easier.
G-8
Italy
G-8
G-8
US
G-77
Botswan
a
AU G77
Kenya
UNSC
UK
UNSC
China
G-77
G-77
AsDB
G-77
Thailand
UNSC
AsDB
AsDB
Japan
AsDB
G-77
Kenya
Thailand
AsDB
Japan
Prior Research
• Join related objects to form independent compound objects, cluster normally
(Yin et al., 2005).
• Use attribute-based distance measures as weights in a relation graph; adapt
a graph cutting algorithm to use edge weights (Neville et al., 2003).
• Probabilistic relational model with an adapted EM algorithm (Taskar et al.,
2001).
• Calculate a hybrid metric that linearly combines relation similarity and
attribute similarity, run single-link algorithm (Bhattacharya and Getoor, 2005)
This research
funded by NSF
grant #0545726
Conclusion
• Early research in relational clustering has been successful.
• Analyzing relational patterns can help us develop methods
for comparing heterogeneous data objects.
• Development of relation selection techniques will help
improve existing relational clustering algorithms.
References
Bhattacharya, I., & Getoor, L. (2005). Entity resolution in graph data (Technical Report CS-TR-4758).
University of Maryland.
Neville, J., Adler, M., & Jensen, D. (2003). Clustering relational data using attribute and link information.
Proceedings of the Text Mining and Link Analysis Workshop.
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classification and clustering in relational data.
Proceeding of IJCAI-01, 17th International Joint Conference on Artificial Intelligence (pp. 870–878).
Seattle, US.
Yin, X., Han, J., & Yu, P. S. (2005). Cross-relational clustering with user’s guidance. KDD ’05:
Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining (pp. 344–353). New York, NY, USA: ACM Press.