Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia, lookup

Database model wikipedia, lookup

Versant Object Database wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

3D optical data storage wikipedia, lookup

Relational algebra wikipedia, lookup

Business intelligence wikipedia, lookup

Open data in the United Kingdom wikipedia, lookup

Data vault modeling wikipedia, lookup

Information privacy law wikipedia, lookup

Operational transformation wikipedia, lookup

Data analysis wikipedia, lookup

Transcript

Open Problems in Relational Data Clustering University of Maryland Baltimore County Adam Anthony [email protected] Marie desJardins [email protected] Overview Heterogeneous Data • Data clustering is the task of detecting patterns in a set of data. • Most algorithms take non-relational data as input and are sometimes unable to find significant patterns. • Many data sets can include relational information, as well as independent object attributes. • Relational data clustering techniques can help find strong patterns in such sets. • Two areas of interest in relational data clustering are: clustering heterogeneous data, and relation selection. It can be very difficult to compare different typed objects. For example, how can actors be compared to directors? One possibility is an inter-cluster relation signature. Example Data Sets A relation space is a set of relation graphs, RS = {RG1, RG2, ..., RGK}, where RGi = {Oi, Ri}, Oi FS, and Ri is a set of edges for a specific relation A feature space is a set of objects with attributes, FS = {o1, o2, …, on}, where oi = < a1 , a2, …, am> Internet Movie Database CIA World Factbook Attributes include personal data such as awards received, financial earnings, age, gender, or Hollywood stock exchange rating. Examples of relations are acted-in, directed, and sequel. Attribute values come from categories like government, economics, and population. Relations can be derived from sources such as common membership in international organizations. Norman Jewison Carl Weathers G-8 Talia Shire Italy G-8 US directed directed directed acted-in directed acted-in UNSC UNSC acted-in UK UNSC G-77 acted-in Botswana G-77 China G-77 AU G-77 AsDB Carl Weathers Norman Jewison 1 Boxing 1 Comedy Talia Shire 1 Drama acted-in directed acted-in directed acted-in directed acted-in directed relation signature. 3. Cluster all objects based on the inter-cluster relation signatures. Boxing Comedy Drama It is intuitive that, just as some features are not helpful for clustering a data set, some relations might provide little information for a relational clustering algorithm, or even harm the performance of an algorithm. As relational clustering algorithms continue to develop, detecting such graphs will become more important. The graph on the right includes an additional relation graph (blue links) that represents the World Trade Organization, which fully connects all countries shown (redundant links omitted). Including the WTO as one of the relation graphs obscures the patterns that can be seen in the graph on the left, making a clustering harder to find. G-8 Ron Howard 1. Cluster one set of homogeneous data. This is the reference clustering. 2. For each object, Create a vector that records the number of links from that object to each cluster discovered in step 1. This is the inter-cluster Ron Howard Relation Selection Relation Space Feature Space 1 Boxing We find this situation to be similar to cases in the feature space where an attribute has the same value for all objects. Removing the WTO graph reduces the size of the total graph, and makes finding patterns easier. G-8 Italy G-8 G-8 US G-77 Botswan a AU G77 Kenya UNSC UK UNSC China G-77 G-77 AsDB G-77 Thailand UNSC AsDB AsDB Japan AsDB G-77 Kenya Thailand AsDB Japan Prior Research • Join related objects to form independent compound objects, cluster normally (Yin et al., 2005). • Use attribute-based distance measures as weights in a relation graph; adapt a graph cutting algorithm to use edge weights (Neville et al., 2003). • Probabilistic relational model with an adapted EM algorithm (Taskar et al., 2001). • Calculate a hybrid metric that linearly combines relation similarity and attribute similarity, run single-link algorithm (Bhattacharya and Getoor, 2005) This research funded by NSF grant #0545726 Conclusion • Early research in relational clustering has been successful. • Analyzing relational patterns can help us develop methods for comparing heterogeneous data objects. • Development of relation selection techniques will help improve existing relational clustering algorithms. References Bhattacharya, I., & Getoor, L. (2005). Entity resolution in graph data (Technical Report CS-TR-4758). University of Maryland. Neville, J., Adler, M., & Jensen, D. (2003). Clustering relational data using attribute and link information. Proceedings of the Text Mining and Link Analysis Workshop. Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classification and clustering in relational data. Proceeding of IJCAI-01, 17th International Joint Conference on Artificial Intelligence (pp. 870–878). Seattle, US. Yin, X., Han, J., & Yu, P. S. (2005). Cross-relational clustering with user’s guidance. KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 344–353). New York, NY, USA: ACM Press.