Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DS-RT 2010 DISTRIBUTED AND INTERACTIVE SIMULATIONS AT LARGE SCALE FOR TRANSCONTINENTAL EXPERIMENTATION 19 October 2010 Dan M. Davis For Gottschalk, Yao Lucas, & Wagenbreth [email protected] (310)448-8434 Approved for public release; distribution is unlimited. Co-Authors Dr. Thomas D. Gottschalk Center for Advanced Computing Research California Institute of Technology Marina del Rey, California 91125 [email protected] Prof. Robert F. Lucas, Dr. Ke-Thia Yao and Gene Wagenbreth Information Sciences Institute University of Southern California Marina del Rey, California 90292 { rflucas, kyao, genew } @isi.edu DS-RT 2010 Overview DS-RT 2010 Needs of user community at JFCOM Three Technologies General Purpose GPUs Implementation High Bandwidth Studies with Interest Management Distributed Data Management Lessons Learned Theses DS-RT 2010 Transcontinentally Distributed Simulations require advanced technologies and innovative paradigmatic approaches Academic Researchers often have such capabilities that are typically not yet in the journeyman programmers’ toolbox GPGPUs provide useable acceleration, but making conservative assessments of potential speed-up may be warranted Transcontinental High Performance Communications (“10 Gig”) are possible with Interest Management With distributed systems and a data glut from High Performance Computing simulations, new approaches to Data Management are required JFCOM as a Distributed Simulation User DS-RT 2010 U.S. Joint Forces Command, Norfolk, Virginia One of DoD’s combatant commands Key role in transforming defense capabilities Two JFCOM Directorates using agent-based simulations J7 - Training trains forces develops doctrine leads training requirements analysis provides an interoperable training environment J9 - Concept Development and Experimentation develops innovative joint concepts and capabilities provides proven solutions to problems facing the joint force Simulations are typically GenSer Secret and characterized by: Interactive use by hundreds of personnel Distributed trans-continentally, but must be real time Vast majority of users at the terminals are uniformed warfighters Plan View Display DS-RT 2010 3D Stealth View DS-RT 2010 Simulation Federates DS-RT 2010 Agent-based models use rules for entity behavior Autonomous-agent entities Can be Human-In-The-Loop (HITL) and run in real time Large compute clusters required to run large-scale simulations Standard interface is HLA RTI communication (IEEE 1516) Supplanted to old DIS Publish/Subscribe model USC/Caltech Software Routers scale better Common Codes in use at JFCOM: Joint Semi-Automated Forces (JSAF) “Culture”, stripped-down civilian instantiation of JSAF Simulating Location & Attack of Mobile Enemy Missiles (SLAMEM) OneSAF Computing Infrastructure DS-RT 2010 Koa and Glenn Deployed, spring ’04 MHPCC & ASC-MSRC 2 Linux Clusters, 32 bit 256 nodes, 512 cores each Joshua deployed 2008 GPGPU-enhanced 256 nodes, 1024 cores, 64 bit DREN Connectivity Users in VA, CA & Army bases Application tolerates network latency Real-time interactive supercomputing Joshua GPGPU-Enhanced Cluster Configuration DS-RT 2010 256 Nodes Nodes - (2) AMD Santa Rosa 2220 2.8 GHz dual-core processors, 1024 cores total GPUs - (1) NVIDIA 8800 Video Card Node Chassis - 4U chassis Memory - 16 GB DIMM DDR2 667 per node GigE Inter-node Communications Delivery to: Joint Advanced Tactics and Training Laboratory (JATTL) in Suffolk, VA Perspective: Entity Growth vs. Time DS-RT 2010 10,000,000 Number and Complexity of JSAF Entities SCALE and FIDELITY Experiments continue to require orders of magnitude larger & more complex battlespaces 1,400,00 1,000,000 DHPI GPUEnhanced Cluster DC Clusters at MHPCC & ASCMSRC SPP Proof of Principle DARPA / Caltech 250,000 107,000 12,000 3,600 UE 98-1 (1997) SAF Express (1997) J9901 (1999) 50,000 AO-00 (2000) JSAF/SPP Urban Resolve (2004) JSAF/SPP Tests (2004) JSAF/SPP JSAF/SPP Capability Joshua (2006) (2008) Why GPUs? DS-RT 2010 GPU performance can be 100X hosts Don’t forget Gene Amdahl’s Law; 2-3X typical This speed-up is expected to grow Early SAF work (UNC. SAIC, USC) Line of Sight Route Finding Collision Detection Sparse Matrix Factorization ISI verified similar bottlenecks in JSAF New ideas for use in scenario generation for new multi-spectral sensors Route Planning Performance Impact DS-RT 2010 Time Spent in Route Planning is Critical Bottleneck Benefits of GPGPU Computing DS-RT 2010 Joshua has provided many benefits; some are not easily quantified Training, analysis or evaluation in cities otherwise off-limits due to: security issues public resistance to combat troops in their city diplomatic about U.S. interest in cities of potential conflict Joshua does save personnel costs, e.g. Army Division costs ~ $20M per day. DHPI cluster can runs such a program using only ~100 technicians Cost saving may be ~$19.5M each day. Good visibility with the leadership elite: Congressional visits Lieutenant General noted that it was probably the only time in his career he would have an opportunity to command so large a unit 1,500 soldiers across the country participated, all connected Transcontinental Network High Bandwidth Research DS-RT 2010 The issue here was the potential exploitation of High Bandwidth Nets, e.g. as 10 Gig (10 GigaBit per Second) Nets The nodes of this WAN were located at ISI-East in Virginia University of Illinois at Chicago ISI-West in California Previous work indicated the utility of interest managed communications on cluster meshes high-bandwidth Local Area Networks (LANs) lower bandwidth WANs Interest-limited message exchange was done using Caltech’s MeshRouter formalism Interest Managed Software Router Diagram DS-RT 2010 Transcontinental Testbed for “10 Gig” Proof of Concept DS-RT 2010 Tests on Local ISI Cluster DS-RT 2010 Prepared for the wide area tests Ran a number of generalizations of a various configurations using the ISI-W cluster Tests involved configurations with a single router process on its own node single publish process on its own node multiple subscriber processes on either Performance results for these configurations are summarized in next slide Bandwidth numbers reflect only data rates into the subscriber processes. Bandwidth results for Various Test Configurations DS-RT 2010 Number of Subscribers Subscriber Nodes Per Node Bandwidth Total Bandwidth Router Load 1 1 680 Mbit/sec 690 Mbit/sec 41% Busy 2 1 672 Mbit/sec 1.3 Gbit/sec 54% Busy 4 1 504 Mbit/sec 2.0 Gbit/sec 59% Busy 6 2 344 Mbit/sec 2.1 Gbit/sec 55% Busy 8 2 288 Mbit/sec 2.3 Gbit/sec 51% Busy 16 2 160 Mbit/sec 2.6 Gbit/sec 44% Busy Benchmark Tests: Two forms DS-RT 2010 Application processes Publish Processors messages of specified length and interest state nominal total publication rate (Mbyte/sec) controlled by a data file Subscribe Processors which receives messages for a specified interest state, collects messages from multiple publishers measure actual incoming message rates Routers direct individual messages from publishers to subscribers according to the interest declarations Router processes were instrumented to determine the fraction of time spent on management. Aggregate bandwidth versus Number of Subscribers DS-RT 2010 10 Gig WAN Test Results DS-RT 2010 WAN Tests Were then Performed This entailed a great deal of trouble-shooting various routers along the WAN A number of variants of the basic configuration were explored: the number of distinct interest states the number of processors associated with a single router processes) the number of replicas of the basic “Router plus Associated Pub/Sub” nodes at each site Typical performance numbers for a test with eight participating nodes at each of the UIC and ISI sites Results are summarized on the next slide Performance Measures for Typical WAN Test DS-RT 2010 Message Length Client BW (bytes/sec) Single MR BW (bytes/sec) Aggregate BW (bits/sec) 0.4 KByte 3.2 M 16.0 M 1.0 G 0.8 KByte 6.4 M 32.0 M 2.1 G 1.6 KByte 12.8 M 64.0 M 4.1 G 2 KByte 14.3 M 71.5 M 4.6 G 100 KByte 0.8 M 4.0 M 0.3 G Final Results ~ 5 GigaBits Over a “10 Gig” Line DS-RT 2010 Aggregate throughput is rather poorer for: Individual message sizes that are too small Individual message sizes that are too large This is consistent with experience on single SPP Constraints largely due to the nature of the RTI-s communications primitives As long as near the optimal message size, the aggregate bandwidth for the WAN test is ~ 4.6 Gbits/sec. WAN tests varying configurations gives similar results: Max Total WAN BW = 4.6 – 4.9 Gbit/sec Data Requirements DS-RT 2010 Larger Scale Global scale vs. theatre Higher Fidelity Greater complexity in models of entities (sensors, people, vehicles, etc.) Urban Operations Millions of civilians All of the above produce dramatic increases in data relative to the previously experienced events. Terrain Large, but NOT Significant Data Issue DS-RT 2010 Growth and the Impending Data Armageddon DS-RT 2010 JFCOM had immediate need for more entities (>10X) Limited Memory on Nodes and in Tesla Tertiary Storage very limited TeraByte a week with existing practice Keeping only 20% of current data Need 10X more entities Need 10X behavior improvement Net growth needed: almost three orders of magnitude Now doing face validity Need more quantitative, statistical approach (future work at Caltech – Dr. Thomas Gottschalk) Data mining efforts now commencing Two Key Challenges DS-RT 2010 Collect the “fire hoses” of data generated by largescale distributed sensor rich environments Without interfering with communication Without interfering with simulator performance Maximally exploit the collected data efficiently Without overwhelming users Without losing critical content Goal: Unified distributed logging/analysis infrastructure, which will help the users, not burden the computing/networking managers and not unduly tax the network traffic loads Limitation of the First System: Did not scale! Two separate data analysis systems One for near-real time during the event Another one for post event processing For near-real time Too much data to access over wide-area network without crashing For post event processing 1-2 weeks to stage data to centralized data store Discards Green entities (80%) DS-RT 2010 Handling Dynamic Data DS-RT 2010 Data is NOT static during runs, but users need to access Logger continuously inserts new data from the simulation Need distributed query to combine remote data sources Distributed logger inserts data into SDG data store at each site Problems Local cache invalid with respect to inserts Cannot preposition data to optimize queries ISI Strategy: explore trade-offs Compute on demand for better efficiency Compute on insert for faster queries Variable fidelity: periodic updates Dynamic pre-computation: detect frequent queries Handling Distributed Data DS-RT 2010 Analyze data in place Data is generated on distributed nodes Leave data where it is generated Distribute data access so data appears to be at a single site Take advantage of HPC hardware capabilities Large capacity data storage High bandwidth network Data archival Exploit JSAF query characteristics Limited number of joins Counting/aggregation type queries Data product is several orders of magnitude less than raw data Notional Diagram of Scalable Data Grid DS-RT 2010 Multidimensional Analysis: Sensor/Target Scoreboard Summarizes sensor contact reports Positive and negative sensor detections Displays two dimensional views of the data Provides three levels of drill-down Sensor platform type vs. target class Sensor platforms vs. sensor modes List of contact reports DS-RT 2010 Multidimensional Analysis DS-RT 2010 Raw data has other dimensions of potential interest Detection status Time, location Terrain type, entity density Weather condition Each dimension can be aggregated at multiple levels Location: country, grid square Collapse and expand multiple dimensions for viewing sensor Time: minutes, hours, days target Multidimensional Analysis DS-RT 2010 Dimensions * : any * : any * : any Dimension Lattice t : sensor platform type p : sensor platform c : target class m : sensor mode * t o : target object Sensor/Target Scoreboard drill-downs in the context multidimensional analysis Data classified along 3 dimensions Drill-down to 3 nodes in the dimensional lattice p c tc pc o to po m tm pm cm tcm pcm om tom pom Cube Dimension Editor DS-RT 2010 SDG: Broader Applicability? DS-RT 2010 Scalable Data Grid: a distributed data management application/middleware that effectively: Collects and stores high volumes of data at very high data rates from geographically distributed sources Accesses, queries and analyzes the distributed data Utilizes the distributed computing resources on HPC Provides a multidimensional framework for viewing the data Potential application areas Large scale distributed simulations Instrumented live training exercises High volume instrumented physics research Virtually any distributed data environment using HPC K-Means Testing DS-RT 2010 K-Means is a popular data mining algorithm The K-Means algorithm requires three inputs: - an integer k to indicate the number of desired clusters - a distance function over the data instances - the set of n data instances to be clustered. Typically, a data instance represented as a vector The output of the algorithm is a set of k points representing the mean of the k clusters Each of the n data instances is assigned to nearest cluster mean based on the distance function K-Means Clustering of Three Data Points DS-RT 2010 Clustering of 3 Clusters 5000 4000 3000 2000 1000 0 -6000 -5000 -4000 -3000 -2000 -1000 0 -1000 -2000 1000 2000 Conclusions DS-RT 2010 Large-scale Distributed Simulations demand more power Technology can to deliver that power This paper set out three emerging technologies GPGPU Acceleration of Linux Clusters High Bandwidth Interest Managed Communications Distributed Data Management This work shows Simulation Researchers’ new abilities to generate simulations to more effectively move the data around to more efficiently store the data to better analyze the data Researchers would benefit from understanding the technologies’ power limitations costs Research Funded by JFCOM and AFRL DS-RT 2010 This material is based on research sponsored by the U.S. Joint Forces Command via a contract with the Lockheed Martin Corporation and SimIS, Inc., and on research sponsored by the Air Force Research Laboratory under agreement numbers F30602-02-C-0213 and FA8750-05-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government. Approved for public release; distribution is unlimited.