Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Network Data as Complex Data Objects: An Approach using Symbolic Data Analysis Giuseppe Giordano and Paula Brito Abstract In this work we focus on network data as defined in the framework of Social Network Analysis and Graph Theory. Each network is represented as a complex data object by using the Symbolic Data Analysis (SDA) approach. SDA aims at extending statistics and data mining methods from first-order (i.e. micro-data) to second-order objects, often obtained by aggregation of micro-data into more or less large groups, taking into account variability that is inherent to the data. The definition of a graph structure as a complex data object considers the different structural information that can be of interest to retrieve. Descriptive statistics provide a first insight into the network structure. The basic idea is to aggregate information attached to each node in terms of its centrality and role in the network and express it as symbolic data by means of interval or histogram-valued variables. We then obtain a symbolic data table where each row pertains to a different network and columns hold network indices, i.e. each row defines a Network Symbolic Object (NSO). Symbolic data analysis of NSO could be applied i) for sake of comparisons among several networks emerged at different occasions in time, ii) for computing similarities among networks and iii) to represent networks as points on a reduced embedding (metric space). A simulation study as well as an empirical case study are provided. Key words: Histogram-Valued Data, Social Network Analysis, Symbolic Data. Giuseppe Giordano Department of Economics and Statistics, Via Ponte Don Melillo, 84084 Fisciano (Salerno), ITALY e-mail: [email protected] Paula Brito FEP & LIAAD INESC TEC, Univ. Porto, Rua Dr. Roberto Frias, 4200-464 Porto, PORTUGAL e-mail: [email protected] 1 2 Giuseppe Giordano and Paula Brito 1 Introduction and Definitions In recent years the network paradigm has affirmed as one of the most attractive and valuable models to describe and represent the complexity of relationships among a wide variety of actors. In a wider sense the concept of net could be applied to any kind of actors able to establish a relationship. However, the concept of network assumes special importance when the actors are individuals and when the relationships are related to specific states or properties attached to each pair of subjects (personal relations such as trustee, acquaintance, collaborations, etc.). These kinds of networks take into account human beings and the study of their birth, growth, shape and topology is the scope of Social Network Analysis. Nevertheless, the concept of network is so immediate and easy to be generalized that the underlying paradigm has been successfully applied to very different fields of knowledge, ranging from Communication and Transportation to Economics and Finance, through Medicine, Ecology, Linguistics, Computer Science and much more. Because of its greater generality to represent connectivity among interacting parts, the network skeleton has been investigated from different and complementary points of view, with descriptive and statistical modelling aims, addressing numerical and computational drawbacks, modelling evolutionary and dynamical features. Network data are characterized by two set of items: nodes and edges; the mathematical counterpart is a Graph entity. Let G (N , E ) be the graph formed by the set N of N nodes (vertices) and by the set E of K edges, different kinds of information are attached both to nodes, edges and attribute data. Thus N is the order and K is the size of the graph. The degree of a node is the number of edges that connect to it. In the following we refer to undirected simple finite graphs, that is, edges between two nodes have no orientation, no loops are considered and no more than one edge exists between any two different nodes. A network is analysed looking at the different pieces of information attached to the different sets: i) node data; ii) node-attribute data; iii) edge data and iv) edgeattribute data. The first two are monadic data, i.e. individual units as in traditional statistical variables; type iii) and iv) are dyadic data, since they pertain to pairs of nodes. For instance, the node list is the set of all nodes (eventually ordered and labelled) while the edge list (or incidence list) is the set of all pair of nodes sharing a link, on these lists we may define attribute data in terms of node attributes (e.g.: usual socio-demographical variables when nodes represent individuals as in social networks) and edge attributes (i.e. cost, length) settled as dichotomous or continuous values (binary or weighted network). On such network data, several analysis can be carried out with descriptive or modelling purposes. From a descriptive point of view, network statistical indices can be defined at local level, in terms of node (or edge) measurement, or at a global level, in terms of measurements for the entire network. In the first case we may retrieve the statistical distribution of each network index and analyse it. Several network indices have been defined to take into account the centrality position and the role of the nodes in the net (see Freeman (1979) [2] and Wasserman and Faust (1994) [6] for definitions and interpretations) such as degree centrality, density, cohesion, etc. The most important Network Data as Complex Data Objects: An Approach using Symbolic Data Analysis 3 are Degree, Closeness, Betweenness and Eigenvector centrality. Moreover, global statistics could be computed to capture some topological characteristics of the network as a whole, and the presence of subgroups (social structures), such as density, diameter, number of cliques and size of the largest clique. 2 Graphs as complex data objects The definition of the graph structure G as a complex data object should consider the different structural information that can be of interest to retrieve. We may start from descriptive statistical measures that provide a first insight into the network structure. The basic idea is to aggregate information attached to each node in terms of its centrality and role in the network and express it as symbolic data by means of interval or histogram-valued variables (see, for instance, Bock & Diday (2000) [1] or Noirhomme-Fraiture and Brito (2011) [4]) so that the whole network could be expressed through the logical union of such different measurements. The final output should allow building a symbolic data table where each row pertains to a different network and columns to the network indices. That is, each row defines a Network Symbolic Object (NSO). Symbolic data analysis of NSO could be applied for sake of comparisons among several networks emerged in different occasions and time, computing similarities among networks (see, e.g., Verde, and Irpino (2008) [5]) and representing networks as points on a reduced embedding (metric space). In the following a simulation study is carried out to generate several network data structures. Traditional network analysis of such data produce a symbolic data table, representing the statistical distributions of the network indices. 2.1 Simulation design The simulation scheme control for three attributes: generating process, process parameter, graph order, each at three levels for a total of 27 network data structures. The following factors are considered: - Order of the graph: N ∈ {100; 300; 500}. - Generating process: GP ∈ {Random Graph; Preferential Attachment; SmallWorld}. - Process parameters: for each generating process specific parameters have been considered that control, respectively: • the density of the Random Graph: p ∈ {0.01; 0.03; 0.05}; • the power of the Preferential Attachment: λ ∈ {0.75; 1.00; 1.25}; • the rewiring probability of the Small-World model: π ∈ {0.005; 0.01; 0.05}. Figure 1 presents the graph representations of the 27 networks, arranged by attribute and levels of the simulation design. 4 Giuseppe Giordano and Paula Brito Fig. 1 The 27 networks generated by the simulation scheme 2.2 Data analysis Suitable multivariate symbolic data analysis may then be performed on the obtained symbolic data array. In a first step we follow a clustering approach, using different attribute representations, different combinations of attributes and different dissimilarity measures. Classical hierarchical clustering, based on a quantile representation (see Ichino (2008) [3]) of the symbolic network data are performed, using different aggregation indices, and provide dendrograms on the set of networks. Other distances, more adapted to the type of data at hand (see, e.g., Verde, and Irpino (2008) [5]) are also to be used. On the other hand, conceptual clustering approaches, which take the network symbolic descriptions directly into account (and not not solely based on distance matrices) may provide a different insight. Future work should address discriminant analysis, to put in evidence the role of the different retrieved attributes and their discriminant power as relates to the various network classes or else to identify particular network patterns. References 1. Bock, H.-H., Diday, E.: Analysis of Symbolic Data, XVIII, Springer, Berlin (2000). 2. Freeman, L.C.: Centrality in Social Networks I: Conceptual Clarification. Social Networks. 1, 215–239 (1979). 3. Ichino, M.: Symbolic PCA for Histogram-Valued Data. In: Proc. IASC 2008, Yokohama, Japan, (2008). 4. Noirhomme-Fraiture, M., Brito, P.: Far Beyond the Classical Data Models: Symbolic Data Analysis. Statistical Analysis and Data Mining, 4, 2, 157-170, (2011). 5. Verde, R., Irpino, A.: Comparing Histogram Data Using a Mahalanobis-Wasserstein Distance. In: Paula Brito (ed.) COMPSTAT 2008, pp. 77-89. Physica-Verlag HD (2008). 6. Wasserman, S., Faust, K.: Social Networks Analysis: Methods and Applications. Cambridge University Press, New York (1994).