Download Network Data as Complex Data Objects: An

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Network Data as Complex Data Objects: An
Approach using Symbolic Data Analysis
Giuseppe Giordano and Paula Brito
Abstract In this work we focus on network data as defined in the framework of
Social Network Analysis and Graph Theory. Each network is represented as a complex data object by using the Symbolic Data Analysis (SDA) approach. SDA aims
at extending statistics and data mining methods from first-order (i.e. micro-data) to
second-order objects, often obtained by aggregation of micro-data into more or less
large groups, taking into account variability that is inherent to the data. The definition of a graph structure as a complex data object considers the different structural
information that can be of interest to retrieve. Descriptive statistics provide a first
insight into the network structure. The basic idea is to aggregate information attached to each node in terms of its centrality and role in the network and express
it as symbolic data by means of interval or histogram-valued variables. We then
obtain a symbolic data table where each row pertains to a different network and
columns hold network indices, i.e. each row defines a Network Symbolic Object
(NSO). Symbolic data analysis of NSO could be applied i) for sake of comparisons
among several networks emerged at different occasions in time, ii) for computing
similarities among networks and iii) to represent networks as points on a reduced
embedding (metric space). A simulation study as well as an empirical case study are
provided.
Key words: Histogram-Valued Data, Social Network Analysis, Symbolic Data.
Giuseppe Giordano
Department of Economics and Statistics, Via Ponte Don Melillo, 84084 Fisciano (Salerno), ITALY
e-mail: [email protected]
Paula Brito
FEP & LIAAD INESC TEC, Univ. Porto, Rua Dr. Roberto Frias, 4200-464 Porto, PORTUGAL
e-mail: [email protected]
1
2
Giuseppe Giordano and Paula Brito
1 Introduction and Definitions
In recent years the network paradigm has affirmed as one of the most attractive and
valuable models to describe and represent the complexity of relationships among a
wide variety of actors. In a wider sense the concept of net could be applied to any
kind of actors able to establish a relationship. However, the concept of network assumes special importance when the actors are individuals and when the relationships
are related to specific states or properties attached to each pair of subjects (personal
relations such as trustee, acquaintance, collaborations, etc.). These kinds of networks take into account human beings and the study of their birth, growth, shape
and topology is the scope of Social Network Analysis. Nevertheless, the concept of
network is so immediate and easy to be generalized that the underlying paradigm
has been successfully applied to very different fields of knowledge, ranging from
Communication and Transportation to Economics and Finance, through Medicine,
Ecology, Linguistics, Computer Science and much more. Because of its greater generality to represent connectivity among interacting parts, the network skeleton has
been investigated from different and complementary points of view, with descriptive
and statistical modelling aims, addressing numerical and computational drawbacks,
modelling evolutionary and dynamical features.
Network data are characterized by two set of items: nodes and edges; the mathematical counterpart is a Graph entity. Let G (N , E ) be the graph formed by the set
N of N nodes (vertices) and by the set E of K edges, different kinds of information
are attached both to nodes, edges and attribute data. Thus N is the order and K is
the size of the graph. The degree of a node is the number of edges that connect to it.
In the following we refer to undirected simple finite graphs, that is, edges between
two nodes have no orientation, no loops are considered and no more than one edge
exists between any two different nodes.
A network is analysed looking at the different pieces of information attached to
the different sets: i) node data; ii) node-attribute data; iii) edge data and iv) edgeattribute data. The first two are monadic data, i.e. individual units as in traditional
statistical variables; type iii) and iv) are dyadic data, since they pertain to pairs of
nodes. For instance, the node list is the set of all nodes (eventually ordered and
labelled) while the edge list (or incidence list) is the set of all pair of nodes sharing
a link, on these lists we may define attribute data in terms of node attributes (e.g.:
usual socio-demographical variables when nodes represent individuals as in social
networks) and edge attributes (i.e. cost, length) settled as dichotomous or continuous
values (binary or weighted network). On such network data, several analysis can be
carried out with descriptive or modelling purposes.
From a descriptive point of view, network statistical indices can be defined at
local level, in terms of node (or edge) measurement, or at a global level, in terms of
measurements for the entire network. In the first case we may retrieve the statistical
distribution of each network index and analyse it. Several network indices have been
defined to take into account the centrality position and the role of the nodes in the
net (see Freeman (1979) [2] and Wasserman and Faust (1994) [6] for definitions and
interpretations) such as degree centrality, density, cohesion, etc. The most important
Network Data as Complex Data Objects: An Approach using Symbolic Data Analysis
3
are Degree, Closeness, Betweenness and Eigenvector centrality. Moreover, global
statistics could be computed to capture some topological characteristics of the network as a whole, and the presence of subgroups (social structures), such as density,
diameter, number of cliques and size of the largest clique.
2 Graphs as complex data objects
The definition of the graph structure G as a complex data object should consider
the different structural information that can be of interest to retrieve. We may start
from descriptive statistical measures that provide a first insight into the network
structure. The basic idea is to aggregate information attached to each node in terms
of its centrality and role in the network and express it as symbolic data by means
of interval or histogram-valued variables (see, for instance, Bock & Diday (2000)
[1] or Noirhomme-Fraiture and Brito (2011) [4]) so that the whole network could
be expressed through the logical union of such different measurements. The final
output should allow building a symbolic data table where each row pertains to a
different network and columns to the network indices. That is, each row defines a
Network Symbolic Object (NSO). Symbolic data analysis of NSO could be applied
for sake of comparisons among several networks emerged in different occasions and
time, computing similarities among networks (see, e.g., Verde, and Irpino (2008)
[5]) and representing networks as points on a reduced embedding (metric space).
In the following a simulation study is carried out to generate several network data
structures. Traditional network analysis of such data produce a symbolic data table,
representing the statistical distributions of the network indices.
2.1 Simulation design
The simulation scheme control for three attributes: generating process, process parameter, graph order, each at three levels for a total of 27 network data structures.
The following factors are considered:
- Order of the graph: N ∈ {100; 300; 500}.
- Generating process: GP ∈ {Random Graph; Preferential Attachment; SmallWorld}.
- Process parameters: for each generating process specific parameters have been
considered that control, respectively:
• the density of the Random Graph: p ∈ {0.01; 0.03; 0.05};
• the power of the Preferential Attachment: λ ∈ {0.75; 1.00; 1.25};
• the rewiring probability of the Small-World model: π ∈ {0.005; 0.01; 0.05}.
Figure 1 presents the graph representations of the 27 networks, arranged by attribute and levels of the simulation design.
4
Giuseppe Giordano and Paula Brito
Fig. 1 The 27 networks generated by the simulation scheme
2.2 Data analysis
Suitable multivariate symbolic data analysis may then be performed on the obtained
symbolic data array. In a first step we follow a clustering approach, using different
attribute representations, different combinations of attributes and different dissimilarity measures. Classical hierarchical clustering, based on a quantile representation
(see Ichino (2008) [3]) of the symbolic network data are performed, using different
aggregation indices, and provide dendrograms on the set of networks. Other distances, more adapted to the type of data at hand (see, e.g., Verde, and Irpino (2008)
[5]) are also to be used. On the other hand, conceptual clustering approaches, which
take the network symbolic descriptions directly into account (and not not solely
based on distance matrices) may provide a different insight.
Future work should address discriminant analysis, to put in evidence the role
of the different retrieved attributes and their discriminant power as relates to the
various network classes or else to identify particular network patterns.
References
1. Bock, H.-H., Diday, E.: Analysis of Symbolic Data, XVIII, Springer, Berlin (2000).
2. Freeman, L.C.: Centrality in Social Networks I: Conceptual Clarification. Social Networks.
1, 215–239 (1979).
3. Ichino, M.: Symbolic PCA for Histogram-Valued Data. In: Proc. IASC 2008, Yokohama,
Japan, (2008).
4. Noirhomme-Fraiture, M., Brito, P.: Far Beyond the Classical Data Models: Symbolic Data
Analysis. Statistical Analysis and Data Mining, 4, 2, 157-170, (2011).
5. Verde, R., Irpino, A.: Comparing Histogram Data Using a Mahalanobis-Wasserstein Distance.
In: Paula Brito (ed.) COMPSTAT 2008, pp. 77-89. Physica-Verlag HD (2008).
6. Wasserman, S., Faust, K.: Social Networks Analysis: Methods and Applications. Cambridge
University Press, New York (1994).