* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Relational Data Clustering
Survey
Document related concepts
Data Protection Act, 2012 wikipedia , lookup
Versant Object Database wikipedia , lookup
Data center wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Operational transformation wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Relational algebra wikipedia , lookup
Business intelligence wikipedia , lookup
Data vault modeling wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Database model wikipedia , lookup
Transcript
Open Problems in Relational Data Clustering University of Maryland Baltimore County Adam Anthony [email protected] Marie desJardins [email protected] Overview Heterogeneous Data • Data clustering is the task of detecting patterns in a set of data. • Most algorithms take non-relational data as input and are sometimes unable to find significant patterns. • Many data sets can include relational information, as well as independent object attributes. • Relational data clustering techniques can help find strong patterns in such sets. • Two areas of interest in relational data clustering are: clustering heterogeneous data, and relation selection. It can be very difficult to compare different typed objects. For example, how can actors be compared to directors? One possibility is an inter-cluster relation signature. Example Data Sets A relation space is a set of relation graphs, RS = {RG1, RG2, ..., RGK}, where RGi = {Oi, Ri}, Oi FS, and Ri is a set of edges for a specific relation A feature space is a set of objects with attributes, FS = {o1, o2, …, on}, where oi = < a1 , a2, …, am> Internet Movie Database CIA World Factbook Attributes include personal data such as awards received, financial earnings, age, gender, or Hollywood stock exchange rating. Examples of relations are acted-in, directed, and sequel. Attribute values come from categories like government, economics, and population. Relations can be derived from sources such as common membership in international organizations. Norman Jewison Carl Weathers G-8 Talia Shire Italy G-8 US directed directed directed acted-in directed acted-in UNSC UNSC acted-in UK UNSC G-77 acted-in Botswana G-77 China G-77 AU G-77 AsDB Carl Weathers Norman Jewison 1 Boxing 1 Comedy Talia Shire 1 Drama acted-in directed acted-in directed acted-in directed acted-in directed relation signature. 3. Cluster all objects based on the inter-cluster relation signatures. Boxing Comedy Drama It is intuitive that, just as some features are not helpful for clustering a data set, some relations might provide little information for a relational clustering algorithm, or even harm the performance of an algorithm. As relational clustering algorithms continue to develop, detecting such graphs will become more important. The graph on the right includes an additional relation graph (blue links) that represents the World Trade Organization, which fully connects all countries shown (redundant links omitted). Including the WTO as one of the relation graphs obscures the patterns that can be seen in the graph on the left, making a clustering harder to find. G-8 Ron Howard 1. Cluster one set of homogeneous data. This is the reference clustering. 2. For each object, Create a vector that records the number of links from that object to each cluster discovered in step 1. This is the inter-cluster Ron Howard Relation Selection Relation Space Feature Space 1 Boxing We find this situation to be similar to cases in the feature space where an attribute has the same value for all objects. Removing the WTO graph reduces the size of the total graph, and makes finding patterns easier. G-8 Italy G-8 G-8 US G-77 Botswan a AU G77 Kenya UNSC UK UNSC China G-77 G-77 AsDB G-77 Thailand UNSC AsDB AsDB Japan AsDB G-77 Kenya Thailand AsDB Japan Prior Research • Join related objects to form independent compound objects, cluster normally (Yin et al., 2005). • Use attribute-based distance measures as weights in a relation graph; adapt a graph cutting algorithm to use edge weights (Neville et al., 2003). • Probabilistic relational model with an adapted EM algorithm (Taskar et al., 2001). • Calculate a hybrid metric that linearly combines relation similarity and attribute similarity, run single-link algorithm (Bhattacharya and Getoor, 2005) This research funded by NSF grant #0545726 Conclusion • Early research in relational clustering has been successful. • Analyzing relational patterns can help us develop methods for comparing heterogeneous data objects. • Development of relation selection techniques will help improve existing relational clustering algorithms. References Bhattacharya, I., & Getoor, L. (2005). Entity resolution in graph data (Technical Report CS-TR-4758). University of Maryland. Neville, J., Adler, M., & Jensen, D. (2003). Clustering relational data using attribute and link information. Proceedings of the Text Mining and Link Analysis Workshop. Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classification and clustering in relational data. Proceeding of IJCAI-01, 17th International Joint Conference on Artificial Intelligence (pp. 870–878). Seattle, US. Yin, X., Han, J., & Yu, P. S. (2005). Cross-relational clustering with user’s guidance. KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 344–353). New York, NY, USA: ACM Press.