Download Distributed Dynamic Graph Analytic Framework

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
206
Int'l Conf. Artificial Intelligence | ICAI'15 |
Distributed Dynamic Graph Analytic Framework:
Scalable Layered Multi-Modal Network Analysis
M. Margitus1 and W. Tagliaferri, Jr.2
1
Information Fusion Group, CUBRC Inc., Buffalo, NY, USA
2
Information Fusion Group, CUBRC Inc., Rome, NY, USA
Abstract – Dynamic Graph Analytic Framework (DYGRAF)
is a domain agnostic framework from which data alignment,
data association, and layered multi-modal network analysis
can be performed. Past installments of DYGRAF have been
able to provide analytic insight for small and medium-sized
data sets, however scalability becomes problematic as the
amount of collected data increases. In this work, we discuss
extensions to DYGRAF that allow analysis to be performed
over data of considerable size. By augmenting its current
modules with distributed computing components, DYGRAF’s
scalability can be extended to accommodate the effective
analysis of the large amount of data being collected, as well
as the massive layered multi-modal network derived.
Keywords: multi-modal network analysis; distributed
computing; information fusion; social network analysis; graph
1
Introduction
The introduction and incorporation of state-of-the-art
data collection capabilities and platforms has afforded
decision makers and analysts access to unprecedented levels
of data. The sheer volume and disparity in content and
representation of data within and across sources has created a
challenge out of the tasks of analysis and identifying salient
and/or actionable intelligence in a timely manner. To address
these challenges, the analyst must be able to fully understand
the environment of interest, requiring the ability to investigate
interconnected relationships of many diverse data sources
simultaneously, as they evolve both spatially and temporally.
An effective way to meet this requirement is through the use
of layered multi-modal network analysis.
Layered multi-modal network analysis (LMMNA) [1] is
a technique in which multiple types of objects and
relationships from a set of disparate layers of source data are
represented, and analyzed as a graph, i.e. a finite nonempty
set of vertices, together with a set of (un)ordered pairs of
vertices called edges. The resultant network is valuable,
providing a single graph which encapsulates the structural and
semantic relationships of multiple data layers, and assembles
previously disconnected information into a common picture.
Dynamic Graph Analytic Framework (DYGRAF) [2] is
a framework through which LMMNA can be adeptly
performed on small and medium-size data sets.
The
extensions presented in this work serve to extend the
scalability of DYGRAF’s analytic capabilities, to address the
phenomenon of information overload, and provide timely
situation awareness for analysts, improving their ability to
better understand and anticipate activities within their area of
responsibility.
The contribution of this work is a system that can be
applied in order to achieve effective LMMNA at scale,
through modifications and novel interactions between existing
technologies. The remainder of this paper is structured as
follows. Section 2 describes the current state, capabilities,
and limitations of DYGRAF. Section 3 discusses the
extensions implemented to achieve the scalability goals of
DYGRAF. Section 4 discusses initial graph analytic results
achieved through the distributed extensions, and Section 5
wraps up the discussion of DYGRAF and the extensions
presented.
2
Dynamic Graph Analytic Framework
DYGRAF is a domain agnostic framework that provides
the infrastructure to perform LMMNA via a modular design.
At a high level, DYGRAF facilitates the alignment,
association, and analysis of disparate layers of data to yield a
cohesive and comprehensive picture of the evolving crosslayer situation.
The framework incorporates technologies that
amalgamate, via standardization and normalization
techniques, the disparate network layers; apply an entity
resolution architecture to associate like vertices within and
across layers; implement graph clustering and nesting
techniques to (1) form a partition of associated information of
interest from each layer and (2) introduce batch and real-time
graph analytic algorithms and heuristics to derive and
maintain graph measures, graph matching, and graph
querying, to identify semantic and structural elements and
discover new patterns of interest.
Four modules are responsible for DYGRAF’s end-toend functionality, illustrated in Figure 1: data alignment, data
Int'l Conf. Artificial Intelligence | ICAI'15 |
207
Figure 1. DYGRAF processing pipeline, including the data alignment, data association (entity resolution),
Evidentiary Graph builder (graphical representation) and graph analytics modules.
association, Evidentiary Graph building, and graph analytics.
Each module plays an important role in the preparation or
analysis of the layered multi-modal network representation of
the original disparate data sources.
The data alignment module is responsible for the
standardization and normalization of heterogeneous data
layers (e.g. communications, financial) through an ontological
alignment process.
To accurately compare disparate
representations of entity attributes across data sources, a
mapping is created to assign data instances to classes in
ontologies.
An ontology is a specification of a
representational vocabulary for a shared domain, defining
classes, relations, functions, and other objects [3]. DYGRAF
utilizes ontologies comprised of classes that represent features
and characteristics of people, locations, events, and other
objects found within the data. Data alignment aids in
determining the similarity of two entities during the entity
resolution process by standardizing the representation of each
entity’s characteristics. This process eliminates the need to
record and utilize each distinct representation of the same
attribute (e.g. Last Name, Last, Surname); a set of descriptors
that grows with the set of data sources.
The data association module utilizes homogenized data
produced from the data alignment module to perform entity
resolution. Within this process, entities are evaluated for
similarity via attribute, rule-based, and semantic comparisons.
Pairs of entities that meet or exceed a similarity threshold,
subject to additional constraints, are merged. The process is
performed hierarchically: within source, within layer, and
across layers. Entity resolution is initially a batch process,
then is incrementally performed as data updates, or additions,
become available.
The results produced from the data association module
are used in the construction of the Evidentiary Graph,
DYGRAF’s layered multi-modal network. The Evidentiary
Graph is a single graph that encapsulates the structural and
semantic relationships of the multiple data layers, as well as
the unique entities and objects obtained through the entity
resolution process. It provides a composite structured view
from which emerging activities can be identified, and from
which viable actions can be formulated. Within DYGRAF,
layered multi-modal networks are represented by an attributed
graph model (sometimes referred to as a property graph
model), where objects and relationships are represented by
vertices and edges, respectively, and there exists a variable set
of attributes on each vertex and edge, describing the features
of the object or relationship being modeled.
The graph analytics module utilizes the Evidentiary
Graph as a starting point for LMMNA. The module consists
of a set of algorithms and heuristics that execute a priori, or in
real-time, to provide topological, semantic, and discovery
analysis of the layered multi-modal network. Incorporated
into this module are classical social network analysis
centrality measurements; community detection algorithms;
graph matching heuristics used to find user defined patterns
within the Evidentiary Graph; graph querying capabilities;
subgraph extraction tools; and path calculation algorithms.
The plug-in design of the module permits extensibility and
customization for an analyst’s objectives, allowing algorithms
to be added or exchanged, if needed.
3
Distributed extensions
DYGRAF has the capacity to achieve data alignment,
entity resolution, and LMMNA on small and medium-sized
data sets. When the Evidentiary Graph exceeds these
208
Int'l Conf. Artificial Intelligence | ICAI'15 |
Figure 2. Extensions of DYGRAF, incorporating distributed technologies to improve analytic scalability.
thresholds, the computation time and ability of traditional
algorithms included within DYGRAF begin to degrade,
preventing the analyst from effectively meeting his or her
objectives. Investigating massive data sources, which are
quickly becoming typical, require the modules of DYGRAF to
be augmented with components suited for distributed
computing.
The development of distributed computing frameworks,
for example, Apache Hadoop [4], and graph processing
systems, have produced an improved toolset for large scale
analysis than what has been previously available. Leveraging
these advancements, extensions to DYGRAF have been
developed and implemented to improve the scalability aspects
of the data association module and the graph analytics
module.
Existing distributed technologies have been
incorporated into these modules, and have been leveraged in
order to perform large-scale entity resolution, distributed
storage, and iterative distributed graph processing within a
Hadoop environment, Figure 2.
3.1
MapReduce data association
To efficiently analyze and glean the best intelligence
from a layered multi-modal network, the information
described within the network must be robust and, at the same
time, have a terse representation. In disparate layers of data,
multiple mentions of a single entity often appear across a set
of data sources. Generating a layered multi-modal network,
and including these entities without a resolution, association,
or deduplication process would result in multiple occurrences
of the same entity within the graph, increasing the size of the
network and segmenting information.
DYGRAF utilizes an entity resolution process to prevent
duplication and segmentation of entities, their attributes, and
their relationships. Entity resolution is the problem of
identifying when entity mentions from within and across
multiple data sources refer to the same entity. The process
can be decomposed into two steps: scoring and clustering.
The scoring step measures the similarity between entities for
each pair represented in the data. The clustering step attempts
to use the scores generated from the scoring step to group
entities by their true unique identities.
DYGRAF implements entity resolution within the data
association module using a non-parallelized graph association
approach [5]. As the number of entities under investigation
increases, a limit is reached where the amount of pairwise
similarity comparisons and assignments exceeds the
computation capabilities of a single computer. Transitioning
from a linear data association module, where scoring and
entity resolution are performed sequentially across all data
sources, to a MapReduce-based association module, shown in
Figure 3, eliminates the scalability limitations for the entity
resolution process.
The MapReduce data association extension utilizes
normalized data from the data alignment module, recording
unique entity data to the Hadoop Distributed File System
(HDFS). A MapReduce job is executed to read this data and
compute similarity scores between every pair of entities. This
Int'l Conf. Artificial Intelligence | ICAI'15 |
209
that has the potential for a multitude of query perspectives.
The inherent graph structure retained through the usage of a
graph database for persistence allows straightforward
integration with other graph-based technologies (e.g. graph
servers, graph visualizers, graph query languages), requiring
little transformation from data structure to graph model
implementation.
Figure 3. The MapReduce data association module. Data
are retrieved from the relational database and transferred
to HDFS where MapReduce jobs are executed to calculate
similarity scores and determine which entities shall be
clustered into a single unique entity.
sequence satisfies the scoring step. For the clustering step, a
MapReduce job is executed to read the similarity scores and,
based on the constraints of the model, decide when to assign
similar mentions to a cluster. Each cluster represents what the
algorithm has determined to be a unique entity—information
used during the construction of the Evidentiary Graph.
The data association extension holds an advantage over
DYGRAF’s ordinary entity resolution process, providing
parallelization during the scoring and assignment (solving)
steps. The extension decreases the execution time needed for
this task and, in situations which require the analysis of
extremely large data sets are being analyzed, enables the
module to complete its task when it otherwise could not.
3.2
Graph storage and data access
The large-scale topology of the Evidentiary Graph
requires a storage solution that permits efficient data querying
and analysis capabilities on a graph structure. Relational
databases are proficient at persisting and querying regular
structures, however, their schemas and table definitions are
more rigid, and less suitable for a graph representation,
especially when representing dynamic graphs, where new data
fields may be introduced at each iteration of data ingest.
Standard graph-oriented flat files, such as GraphML, also do
not provide an acceptable solution, as these types of files
aren’t easily modifiable, and are more difficult to interact with
as the file size increases.
To avoid the shortcomings of the aforementioned storage
solutions, a distributed graph database has been chosen as the
storage method for this extension. A graph database provides
more flexibility over other storage solutions, such as relational
data models, and does not require a predefined schema or
consistent data keys across each entry of the same type. These
criteria are especially important when modeling data that
evolves over time, data containing complex relations, or data
Titan [6] is a graph database capable of being distributed
across multiple computer cluster nodes, and has flexible
backend storage. Backed by Apache HBase [7], Titan
provides DYGRAF with the ability to distribute an
Evidentiary Graph, too large to be stored on one computer,
across multiple machines, and it provides support for
thousands of concurrent query transactions.
3.3
Large-scale graph exploration and analysis
Effectively analyzing massive disparate sources of data
through DYGRAF’s LMMNA capabilities require the graph
analytics module to accommodate large-scale Evidentiary
Graphs. Existing analytics within the module, which include
traditional graph algorithms and social network analysis
metrics, are capable of analyzing small and medium-sized
Evidentiary Graphs. To mitigate the scalability shortcomings
of graph analysis, DYGRAF’s distributed analytical
capabilities have been augmented with scalable graph
technologies, namely, the TinkerPop graph technology suite,
and Apache Giraph, a distributed graph processing system.
The TinkerPop graph suite [8] is an open-source
collection of graph technologies which provide property graph
modeling and interaction libraries, graph query and traversal
capabilities, and graph exposure mechanisms. Within the
distributed extensions to DYGRAF’s graph analytics module,
TinkerPop Blueprints (property graph API), Gremlin (graph
traversal/querying), and Rexster (graph server) are valuable
components that allow DYGRAF to be interoperable with
other third party applications that utilize these technologies,
such as Titan, which implements the blueprints API graph
model for its graph representation.
Additionally, the
TinkerPop Blueprints API has been chosen as the common
graph API throughout each of DYGRAF’s previously
established graph capabilities, as well as DYGRAF’s
extensions.
The graph querying capabilities of DYGRAF have been
extended with the incorporation of TinkerPop Gremlin, a
graph traversal language capable of composing highlydetailed, flexible, and sophisticated graph queries across a
property graph, executed in a depth-first manner. By
incorporating the ability to construct and execute Gremlin
queries over the Evidentiary Graph, a user can filter the graph
by entity connections, entity or relationship attributes,
conditional statements, or custom functions for the data under
investigation. Gremlin also provides a mechanism for users to
calculate shortest paths between two entities, and derive
210
Int'l Conf. Artificial Intelligence | ICAI'15 |
Table I. Graph Analysis Comparison
Graph Statistics
Data Sets
Vertices
Edges
Local Graph Analytic
Runtime (min)
Distributed Graph
Analytic Runtime (min)
VAST 2010
547
1064
0.339
7.73
Global
Terrorism
Database
230132
632238
subgraphs of the Evidentiary Graph, such as a k-hop
neighborhood.
Apache Giraph [9] is leveraged to execute DYGRAF’s
large-scale graph and social network analysis algorithms on
the Evidentiary Graph, and derived subgraphs. Giraph is an
open source iterative graph processing system, modeled after
Google’s Pregel [10], that implements the Bulk Synchronous
Parallel computing model for parallel computing [11]. Each
algorithm developed using the Giraph API is designed from a
vertex-centric parallel perspective, as opposed to the
traditional “top-down” graph algorithm approach, which
considers the entire graph throughout each step of the
algorithm.
Extracting the graph topology from Titan, DYGRAF
supplies custom input and output formats that instruct Giraph
on how to interpret the data, and how to handle or process
algorithmic results. For each distributed algorithm included
in DYGRAF, a Giraph configuration and job is constructed
and executed over the iterative processing framework,
respectively. The results calculated by each algorithm are
then made available directly as text files in HDFS, as well as
written as additional attributes for each vertex/edge in Titan,
so that they can be included in subsequent queries and
analysis.
4
Initial analytical results
To illustrate the value provided by DYGRAF’s
distributed extensions, two data sets—one small, one large—
were chosen to be analyzed as preliminary proof of concept
exercises. The small data set was constructed from the VAST
2010 data set [12], augmented with additional synthetic data,
to
represent
hospital
admittance,
financial,
and
communication records. The Evidentiary Graph derived from
this set consisted of 547 vertices and 1064 edges, each with
varying sets of attributes. The large data set consisted of
records from the Global Terrorism Database (GTD) [13],
which included individual/faction information, event
occurrences and event descriptions. The resulting Evidentiary
Graph contained 230132 vertices and 632238 edges.
To compare runtime and processing ability, both data
sets were analyzed using a subset of social network analysis
algorithms included in DYGRAF, as well as their counterparts
introduced through DYGRAF’s distributed extensions. As the
-
13.33
objective of the exercise was in part to show a proof of
concept, the distributed extensions were executed on a single
VM pseudo-cluster, providing 8GB of RAM. The results are
presented in Table I.
Analysis results of the VAST 2010 data set show that the
non-distributed version of DYGRAF outperforms the
distributed extensions with respect to runtime on small data
sets that fit into memory. This disparity is mainly caused by
the overhead incurred in order to configure and execute
Giraph jobs. The real value of the distributed extensions is
illustrated through the results of the GTD analysis. Although
the runtime to execute the set of SNA metrics was 13.33
minutes, the same set of metrics did not complete using nondistributed methods. This accomplishment is valuable in that
the distributed extensions allow DYGRAF to perform
analyses on data sets that it had not been able to achieve
previously. These initial results provide motivation for
subsequent experiments with a multi-node cluster, to test
scalability limits and runtime.
5
Conclusions
DYGRAF is a domain agnostic framework used to
provide layered multi-modal network analysis resulting in a
cohesive and comprehensive picture of the current cross
network situation faced by analysts and decision makers,
enabling them to correlate and investigate activity and time
occurrences in related networks. DYGRAF provides the
capabilities to reduce the amount of mental correlation
required by the analyst, while maximizing their ability to
understand and potentially predict when and where a major
event or activity of interest is about to occur. By leveraging
data collected from multiple disparate sources, DYGRAF is
able to shed light upon the relationships between diverse and
seemingly unrelated information.
As data collection techniques and storage capabilities
become more advanced, the amount of data that exists
describing events and activities of interest has grown
considerably.
Furthermore, to fully understand the
information found in this data, there is a desire to analyze the
sets in their entirety, with the goal of capturing any trends,
patterns, or hidden relationships. In order to meet these needs
and eliminate scalability restrictions, DYGRAF has
introduced distributed computing extensions to its framework.
Augmenting its capabilities with high-performance computing
Int'l Conf. Artificial Intelligence | ICAI'15 |
211
technologies allows DYGRAF to perform the large-scale data
alignment, data association, and LMMNA needed to fully
exploit the massive data sets that are becoming common.
[12] VAST Challenge 2010 [Online].
Available
http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge
%202010/challenges/Grand%20Challenge%202010/
6
[13] Global Terrorism Database [Online].
http://www.start.umd.edu/gtd/
Acknowledgment
The authors would like to thank Mr. Craig Anken and
Mr. John Spina of the DYGRAF program management team
for their help and support, as well as Dr. Gregory Tauer for
his MapReduce data association research.
7
References
[1] P. LaMonica, T. Waskiewicz, “Layered multi-modal
network analysis of textual data for improved situation
awareness,” In Proc. International Conference of Knowledge
Engineering (IKE), July 2011.
[2] M. Margitus, W. Tagliaferri Jr., M. Sudit, P. LaMonica,
“Dynamic graph analytic framework (DYGRAF): greater
situation awareness through layered multi-modal network
analysis,” Proc. SPIE 8402, Evolutionary and Bio-Inspired
Computation: Theory and Applications IV, 84020E (May 1,
2012); doi:10.1117/12.920598
[3] T.R. Gruber. “A translation approach to portable
ontologies,” Knowledge Acquisition, 5(2); 199-220, 1993.
[4] Apache
Hadoop
http://hadoop.apache.org
[Online].
Available
[5] G. Tauer, R. Nagi, M. Sudit, “The graph association
problem: mathematical models and a Lagrangian heuristic,”
Naval Research Logistics (NRL), vol. 60, issue 3, pp. 251268, March 2013.
[6] Titan Distributed Graph Database [Online]. Available
http://thinkaurelius.github.io/titan/
[7] Apache
HBase
http://hbase.apache.org
[Online].
Available
[8] TinkerPop
[Online].
http://www.tinkerpop.com
Available
[9] Apache
Giraph
http://giraph.apache.org
Available
[Online].
[10] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn,
N. Leiser, G. Czajkowski, “Pregel: a system for large-scale
graph processing,” In Proceedings of the 2010 ACM
SIGMOD International Conference on Management of Data
(SIGMOD ’10). AMC, New York, NY, USA, 135-146.
[11] L. Valiant, “A bridging model for parallel computation,”
Commun. ACM 33, 8 (August 1990), 103-111.
Available