Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
206 Int'l Conf. Artificial Intelligence | ICAI'15 | Distributed Dynamic Graph Analytic Framework: Scalable Layered Multi-Modal Network Analysis M. Margitus1 and W. Tagliaferri, Jr.2 1 Information Fusion Group, CUBRC Inc., Buffalo, NY, USA 2 Information Fusion Group, CUBRC Inc., Rome, NY, USA Abstract – Dynamic Graph Analytic Framework (DYGRAF) is a domain agnostic framework from which data alignment, data association, and layered multi-modal network analysis can be performed. Past installments of DYGRAF have been able to provide analytic insight for small and medium-sized data sets, however scalability becomes problematic as the amount of collected data increases. In this work, we discuss extensions to DYGRAF that allow analysis to be performed over data of considerable size. By augmenting its current modules with distributed computing components, DYGRAF’s scalability can be extended to accommodate the effective analysis of the large amount of data being collected, as well as the massive layered multi-modal network derived. Keywords: multi-modal network analysis; distributed computing; information fusion; social network analysis; graph 1 Introduction The introduction and incorporation of state-of-the-art data collection capabilities and platforms has afforded decision makers and analysts access to unprecedented levels of data. The sheer volume and disparity in content and representation of data within and across sources has created a challenge out of the tasks of analysis and identifying salient and/or actionable intelligence in a timely manner. To address these challenges, the analyst must be able to fully understand the environment of interest, requiring the ability to investigate interconnected relationships of many diverse data sources simultaneously, as they evolve both spatially and temporally. An effective way to meet this requirement is through the use of layered multi-modal network analysis. Layered multi-modal network analysis (LMMNA) [1] is a technique in which multiple types of objects and relationships from a set of disparate layers of source data are represented, and analyzed as a graph, i.e. a finite nonempty set of vertices, together with a set of (un)ordered pairs of vertices called edges. The resultant network is valuable, providing a single graph which encapsulates the structural and semantic relationships of multiple data layers, and assembles previously disconnected information into a common picture. Dynamic Graph Analytic Framework (DYGRAF) [2] is a framework through which LMMNA can be adeptly performed on small and medium-size data sets. The extensions presented in this work serve to extend the scalability of DYGRAF’s analytic capabilities, to address the phenomenon of information overload, and provide timely situation awareness for analysts, improving their ability to better understand and anticipate activities within their area of responsibility. The contribution of this work is a system that can be applied in order to achieve effective LMMNA at scale, through modifications and novel interactions between existing technologies. The remainder of this paper is structured as follows. Section 2 describes the current state, capabilities, and limitations of DYGRAF. Section 3 discusses the extensions implemented to achieve the scalability goals of DYGRAF. Section 4 discusses initial graph analytic results achieved through the distributed extensions, and Section 5 wraps up the discussion of DYGRAF and the extensions presented. 2 Dynamic Graph Analytic Framework DYGRAF is a domain agnostic framework that provides the infrastructure to perform LMMNA via a modular design. At a high level, DYGRAF facilitates the alignment, association, and analysis of disparate layers of data to yield a cohesive and comprehensive picture of the evolving crosslayer situation. The framework incorporates technologies that amalgamate, via standardization and normalization techniques, the disparate network layers; apply an entity resolution architecture to associate like vertices within and across layers; implement graph clustering and nesting techniques to (1) form a partition of associated information of interest from each layer and (2) introduce batch and real-time graph analytic algorithms and heuristics to derive and maintain graph measures, graph matching, and graph querying, to identify semantic and structural elements and discover new patterns of interest. Four modules are responsible for DYGRAF’s end-toend functionality, illustrated in Figure 1: data alignment, data Int'l Conf. Artificial Intelligence | ICAI'15 | 207 Figure 1. DYGRAF processing pipeline, including the data alignment, data association (entity resolution), Evidentiary Graph builder (graphical representation) and graph analytics modules. association, Evidentiary Graph building, and graph analytics. Each module plays an important role in the preparation or analysis of the layered multi-modal network representation of the original disparate data sources. The data alignment module is responsible for the standardization and normalization of heterogeneous data layers (e.g. communications, financial) through an ontological alignment process. To accurately compare disparate representations of entity attributes across data sources, a mapping is created to assign data instances to classes in ontologies. An ontology is a specification of a representational vocabulary for a shared domain, defining classes, relations, functions, and other objects [3]. DYGRAF utilizes ontologies comprised of classes that represent features and characteristics of people, locations, events, and other objects found within the data. Data alignment aids in determining the similarity of two entities during the entity resolution process by standardizing the representation of each entity’s characteristics. This process eliminates the need to record and utilize each distinct representation of the same attribute (e.g. Last Name, Last, Surname); a set of descriptors that grows with the set of data sources. The data association module utilizes homogenized data produced from the data alignment module to perform entity resolution. Within this process, entities are evaluated for similarity via attribute, rule-based, and semantic comparisons. Pairs of entities that meet or exceed a similarity threshold, subject to additional constraints, are merged. The process is performed hierarchically: within source, within layer, and across layers. Entity resolution is initially a batch process, then is incrementally performed as data updates, or additions, become available. The results produced from the data association module are used in the construction of the Evidentiary Graph, DYGRAF’s layered multi-modal network. The Evidentiary Graph is a single graph that encapsulates the structural and semantic relationships of the multiple data layers, as well as the unique entities and objects obtained through the entity resolution process. It provides a composite structured view from which emerging activities can be identified, and from which viable actions can be formulated. Within DYGRAF, layered multi-modal networks are represented by an attributed graph model (sometimes referred to as a property graph model), where objects and relationships are represented by vertices and edges, respectively, and there exists a variable set of attributes on each vertex and edge, describing the features of the object or relationship being modeled. The graph analytics module utilizes the Evidentiary Graph as a starting point for LMMNA. The module consists of a set of algorithms and heuristics that execute a priori, or in real-time, to provide topological, semantic, and discovery analysis of the layered multi-modal network. Incorporated into this module are classical social network analysis centrality measurements; community detection algorithms; graph matching heuristics used to find user defined patterns within the Evidentiary Graph; graph querying capabilities; subgraph extraction tools; and path calculation algorithms. The plug-in design of the module permits extensibility and customization for an analyst’s objectives, allowing algorithms to be added or exchanged, if needed. 3 Distributed extensions DYGRAF has the capacity to achieve data alignment, entity resolution, and LMMNA on small and medium-sized data sets. When the Evidentiary Graph exceeds these 208 Int'l Conf. Artificial Intelligence | ICAI'15 | Figure 2. Extensions of DYGRAF, incorporating distributed technologies to improve analytic scalability. thresholds, the computation time and ability of traditional algorithms included within DYGRAF begin to degrade, preventing the analyst from effectively meeting his or her objectives. Investigating massive data sources, which are quickly becoming typical, require the modules of DYGRAF to be augmented with components suited for distributed computing. The development of distributed computing frameworks, for example, Apache Hadoop [4], and graph processing systems, have produced an improved toolset for large scale analysis than what has been previously available. Leveraging these advancements, extensions to DYGRAF have been developed and implemented to improve the scalability aspects of the data association module and the graph analytics module. Existing distributed technologies have been incorporated into these modules, and have been leveraged in order to perform large-scale entity resolution, distributed storage, and iterative distributed graph processing within a Hadoop environment, Figure 2. 3.1 MapReduce data association To efficiently analyze and glean the best intelligence from a layered multi-modal network, the information described within the network must be robust and, at the same time, have a terse representation. In disparate layers of data, multiple mentions of a single entity often appear across a set of data sources. Generating a layered multi-modal network, and including these entities without a resolution, association, or deduplication process would result in multiple occurrences of the same entity within the graph, increasing the size of the network and segmenting information. DYGRAF utilizes an entity resolution process to prevent duplication and segmentation of entities, their attributes, and their relationships. Entity resolution is the problem of identifying when entity mentions from within and across multiple data sources refer to the same entity. The process can be decomposed into two steps: scoring and clustering. The scoring step measures the similarity between entities for each pair represented in the data. The clustering step attempts to use the scores generated from the scoring step to group entities by their true unique identities. DYGRAF implements entity resolution within the data association module using a non-parallelized graph association approach [5]. As the number of entities under investigation increases, a limit is reached where the amount of pairwise similarity comparisons and assignments exceeds the computation capabilities of a single computer. Transitioning from a linear data association module, where scoring and entity resolution are performed sequentially across all data sources, to a MapReduce-based association module, shown in Figure 3, eliminates the scalability limitations for the entity resolution process. The MapReduce data association extension utilizes normalized data from the data alignment module, recording unique entity data to the Hadoop Distributed File System (HDFS). A MapReduce job is executed to read this data and compute similarity scores between every pair of entities. This Int'l Conf. Artificial Intelligence | ICAI'15 | 209 that has the potential for a multitude of query perspectives. The inherent graph structure retained through the usage of a graph database for persistence allows straightforward integration with other graph-based technologies (e.g. graph servers, graph visualizers, graph query languages), requiring little transformation from data structure to graph model implementation. Figure 3. The MapReduce data association module. Data are retrieved from the relational database and transferred to HDFS where MapReduce jobs are executed to calculate similarity scores and determine which entities shall be clustered into a single unique entity. sequence satisfies the scoring step. For the clustering step, a MapReduce job is executed to read the similarity scores and, based on the constraints of the model, decide when to assign similar mentions to a cluster. Each cluster represents what the algorithm has determined to be a unique entity—information used during the construction of the Evidentiary Graph. The data association extension holds an advantage over DYGRAF’s ordinary entity resolution process, providing parallelization during the scoring and assignment (solving) steps. The extension decreases the execution time needed for this task and, in situations which require the analysis of extremely large data sets are being analyzed, enables the module to complete its task when it otherwise could not. 3.2 Graph storage and data access The large-scale topology of the Evidentiary Graph requires a storage solution that permits efficient data querying and analysis capabilities on a graph structure. Relational databases are proficient at persisting and querying regular structures, however, their schemas and table definitions are more rigid, and less suitable for a graph representation, especially when representing dynamic graphs, where new data fields may be introduced at each iteration of data ingest. Standard graph-oriented flat files, such as GraphML, also do not provide an acceptable solution, as these types of files aren’t easily modifiable, and are more difficult to interact with as the file size increases. To avoid the shortcomings of the aforementioned storage solutions, a distributed graph database has been chosen as the storage method for this extension. A graph database provides more flexibility over other storage solutions, such as relational data models, and does not require a predefined schema or consistent data keys across each entry of the same type. These criteria are especially important when modeling data that evolves over time, data containing complex relations, or data Titan [6] is a graph database capable of being distributed across multiple computer cluster nodes, and has flexible backend storage. Backed by Apache HBase [7], Titan provides DYGRAF with the ability to distribute an Evidentiary Graph, too large to be stored on one computer, across multiple machines, and it provides support for thousands of concurrent query transactions. 3.3 Large-scale graph exploration and analysis Effectively analyzing massive disparate sources of data through DYGRAF’s LMMNA capabilities require the graph analytics module to accommodate large-scale Evidentiary Graphs. Existing analytics within the module, which include traditional graph algorithms and social network analysis metrics, are capable of analyzing small and medium-sized Evidentiary Graphs. To mitigate the scalability shortcomings of graph analysis, DYGRAF’s distributed analytical capabilities have been augmented with scalable graph technologies, namely, the TinkerPop graph technology suite, and Apache Giraph, a distributed graph processing system. The TinkerPop graph suite [8] is an open-source collection of graph technologies which provide property graph modeling and interaction libraries, graph query and traversal capabilities, and graph exposure mechanisms. Within the distributed extensions to DYGRAF’s graph analytics module, TinkerPop Blueprints (property graph API), Gremlin (graph traversal/querying), and Rexster (graph server) are valuable components that allow DYGRAF to be interoperable with other third party applications that utilize these technologies, such as Titan, which implements the blueprints API graph model for its graph representation. Additionally, the TinkerPop Blueprints API has been chosen as the common graph API throughout each of DYGRAF’s previously established graph capabilities, as well as DYGRAF’s extensions. The graph querying capabilities of DYGRAF have been extended with the incorporation of TinkerPop Gremlin, a graph traversal language capable of composing highlydetailed, flexible, and sophisticated graph queries across a property graph, executed in a depth-first manner. By incorporating the ability to construct and execute Gremlin queries over the Evidentiary Graph, a user can filter the graph by entity connections, entity or relationship attributes, conditional statements, or custom functions for the data under investigation. Gremlin also provides a mechanism for users to calculate shortest paths between two entities, and derive 210 Int'l Conf. Artificial Intelligence | ICAI'15 | Table I. Graph Analysis Comparison Graph Statistics Data Sets Vertices Edges Local Graph Analytic Runtime (min) Distributed Graph Analytic Runtime (min) VAST 2010 547 1064 0.339 7.73 Global Terrorism Database 230132 632238 subgraphs of the Evidentiary Graph, such as a k-hop neighborhood. Apache Giraph [9] is leveraged to execute DYGRAF’s large-scale graph and social network analysis algorithms on the Evidentiary Graph, and derived subgraphs. Giraph is an open source iterative graph processing system, modeled after Google’s Pregel [10], that implements the Bulk Synchronous Parallel computing model for parallel computing [11]. Each algorithm developed using the Giraph API is designed from a vertex-centric parallel perspective, as opposed to the traditional “top-down” graph algorithm approach, which considers the entire graph throughout each step of the algorithm. Extracting the graph topology from Titan, DYGRAF supplies custom input and output formats that instruct Giraph on how to interpret the data, and how to handle or process algorithmic results. For each distributed algorithm included in DYGRAF, a Giraph configuration and job is constructed and executed over the iterative processing framework, respectively. The results calculated by each algorithm are then made available directly as text files in HDFS, as well as written as additional attributes for each vertex/edge in Titan, so that they can be included in subsequent queries and analysis. 4 Initial analytical results To illustrate the value provided by DYGRAF’s distributed extensions, two data sets—one small, one large— were chosen to be analyzed as preliminary proof of concept exercises. The small data set was constructed from the VAST 2010 data set [12], augmented with additional synthetic data, to represent hospital admittance, financial, and communication records. The Evidentiary Graph derived from this set consisted of 547 vertices and 1064 edges, each with varying sets of attributes. The large data set consisted of records from the Global Terrorism Database (GTD) [13], which included individual/faction information, event occurrences and event descriptions. The resulting Evidentiary Graph contained 230132 vertices and 632238 edges. To compare runtime and processing ability, both data sets were analyzed using a subset of social network analysis algorithms included in DYGRAF, as well as their counterparts introduced through DYGRAF’s distributed extensions. As the - 13.33 objective of the exercise was in part to show a proof of concept, the distributed extensions were executed on a single VM pseudo-cluster, providing 8GB of RAM. The results are presented in Table I. Analysis results of the VAST 2010 data set show that the non-distributed version of DYGRAF outperforms the distributed extensions with respect to runtime on small data sets that fit into memory. This disparity is mainly caused by the overhead incurred in order to configure and execute Giraph jobs. The real value of the distributed extensions is illustrated through the results of the GTD analysis. Although the runtime to execute the set of SNA metrics was 13.33 minutes, the same set of metrics did not complete using nondistributed methods. This accomplishment is valuable in that the distributed extensions allow DYGRAF to perform analyses on data sets that it had not been able to achieve previously. These initial results provide motivation for subsequent experiments with a multi-node cluster, to test scalability limits and runtime. 5 Conclusions DYGRAF is a domain agnostic framework used to provide layered multi-modal network analysis resulting in a cohesive and comprehensive picture of the current cross network situation faced by analysts and decision makers, enabling them to correlate and investigate activity and time occurrences in related networks. DYGRAF provides the capabilities to reduce the amount of mental correlation required by the analyst, while maximizing their ability to understand and potentially predict when and where a major event or activity of interest is about to occur. By leveraging data collected from multiple disparate sources, DYGRAF is able to shed light upon the relationships between diverse and seemingly unrelated information. As data collection techniques and storage capabilities become more advanced, the amount of data that exists describing events and activities of interest has grown considerably. Furthermore, to fully understand the information found in this data, there is a desire to analyze the sets in their entirety, with the goal of capturing any trends, patterns, or hidden relationships. In order to meet these needs and eliminate scalability restrictions, DYGRAF has introduced distributed computing extensions to its framework. Augmenting its capabilities with high-performance computing Int'l Conf. Artificial Intelligence | ICAI'15 | 211 technologies allows DYGRAF to perform the large-scale data alignment, data association, and LMMNA needed to fully exploit the massive data sets that are becoming common. [12] VAST Challenge 2010 [Online]. Available http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge %202010/challenges/Grand%20Challenge%202010/ 6 [13] Global Terrorism Database [Online]. http://www.start.umd.edu/gtd/ Acknowledgment The authors would like to thank Mr. Craig Anken and Mr. John Spina of the DYGRAF program management team for their help and support, as well as Dr. Gregory Tauer for his MapReduce data association research. 7 References [1] P. LaMonica, T. Waskiewicz, “Layered multi-modal network analysis of textual data for improved situation awareness,” In Proc. International Conference of Knowledge Engineering (IKE), July 2011. [2] M. Margitus, W. Tagliaferri Jr., M. Sudit, P. LaMonica, “Dynamic graph analytic framework (DYGRAF): greater situation awareness through layered multi-modal network analysis,” Proc. SPIE 8402, Evolutionary and Bio-Inspired Computation: Theory and Applications IV, 84020E (May 1, 2012); doi:10.1117/12.920598 [3] T.R. Gruber. “A translation approach to portable ontologies,” Knowledge Acquisition, 5(2); 199-220, 1993. [4] Apache Hadoop http://hadoop.apache.org [Online]. Available [5] G. Tauer, R. Nagi, M. Sudit, “The graph association problem: mathematical models and a Lagrangian heuristic,” Naval Research Logistics (NRL), vol. 60, issue 3, pp. 251268, March 2013. [6] Titan Distributed Graph Database [Online]. Available http://thinkaurelius.github.io/titan/ [7] Apache HBase http://hbase.apache.org [Online]. Available [8] TinkerPop [Online]. http://www.tinkerpop.com Available [9] Apache Giraph http://giraph.apache.org Available [Online]. [10] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski, “Pregel: a system for large-scale graph processing,” In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10). AMC, New York, NY, USA, 135-146. [11] L. Valiant, “A bridging model for parallel computation,” Commun. ACM 33, 8 (August 1990), 103-111. Available