Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multi-Threaded Event Reconstruction in Java Norman Graf CHEP 2010, Taipei October 21, 2010 Why Multi-threading? Moore’s law still holds, but clock-speed of CPUs fell off the curve several years ago. Can no longer get improved performance “for free” from faster CPUs. Current trend is towards multi- or many-core architectures sharing memory. Don’t believe there are “silver bullets” Even getting slower… no compiler switch to optimize for many-core no libraries to link against to give concurrency Requires a paradigm shift in coding. 2 Multi-threading advantages “Many hands make light work.” John Heywood “The real performance payoff of dividing a program's workload into tasks comes when there are a large number of independent, homogeneous tasks that can be processed concurrently.” Java Concurrency in Practice 3 Multi-threading gotchas “Too many cooks spoil the broth.” Anonymous “Amdahl’s Law” Gene Amdahl, 1967 4 HEP Reconstruction Parallelism Currently, HEP employs run and event based parallelism. Conditions usually ~static per run. Set up geometry once and access throughout job. Program then processes each event serially and independently. Memory footprint of program, conditions, and increasingly so for events, becoming significant. Investigate whether parallelism within an event reconstruction holds promise for the future. One event in memory, multiple CPUs processing 5 HEP Event Reconstruction Modern Collider detectors are complex, but modular, and a number of tasks can be easily identified as independent. For example: Digitization, clustering, centroid calculation of hits in silicon trackers easily factorizable. Clustering of calorimeter cells By subsystem (barrel vs endcap), by layer, by wafer. By subsystem (barrel/endcap, EM/Had), modules Jet Flavor-tagging vertexing done only on list of associated tracks lepton association only done on jet constituents. 6 Thread by Wafer Silicon Detector Inner Tracker Thread by Subdetector Barrel Endcap Thread by Layer Layer 1 Layer 2 7 org.lcsim A fast and flexible reconstruction and analysis framework developed for ILC physics and detector response simulations. Written in Java. Plug & Play reconstruction Drivers. Runtime configurable with xml file. Supports a number of different subdetector types: TPC, Si pixel and -strip; sampling and total absorption crystal calorimetry; … Perfect development environment to test ideas of multi-threaded event reconstruction. 8 Threads in Java Java has included support for concurrency since its beginning, and has improved over time. Makes it easy to develop, implement and study. Use existing Java reconstruction package lcsim.org to study feasibility of multi-threaded approach to event reconstruction. ISO C++ standard doesn’t mention threads. Thread, Runnable, Callable Usual solutions involve non-portable, platform-specific concurrency features and libraries. boost a possible solution; C++0x draft offers threads. Idea is to study the concept in an environment currently supportive of this approach (Java), and apply if, and when, needed and supported in C++. 9 Thread Class Most basic method is to extend base class Thread. run() method class ThreadTask extends Thread { public void run() { … } } … Thread t = new ThreadTask(); t.start(); t.join(); … accepts no arguments returns no values cannot throw checked exceptions blocks until task completes 10 Runnable Interface Runnable Interface allows user class to be active while not subclassing Thread. run() method public interface Runnable<V> { public void run(); } class RunnableTask implements Runnable extends UsefulBaseClass { public void run() { … } } accepts no arguments returns no values can’t throw checked exceptions … Runnable runnable = new RunnableTask(); Thread t = new Thread(runnable); t.start(); t.join(); … Passed to Thread as arg. To get a value back from the now-completed task, you must use a method outside the interface and wait for some kind of notification message that the task completed. 11 Callable Interface Callable Interface allows user class to inherit from other classes. Best suited for result-bearing tasks call() method returns typed value can throw checked exceptions public interface Callable<V> { V call() throws Exception; } class CallableTask implements Runnable extends UsefulBaseClass { public Object call() { … } } Cannot pass a Callable into a Thread to execute. Requires use of ExecutorService to execute the Callable object. 12 ExecutorService Part of the java.util.concurrent package. Asynchronous task handler. Creates, manages, runs thread pools. Executor has three factory methods: newSingleThreadExecutor() newFixedThreadPool(int nThreads) single thread, unbounded queue for tasks specified maximum thread pool size, unbounded task queue if thread dies, a new one will be created to replace it. newCachedThreadPool() open-ended # of threads, grows and shrinks on demand. caches threads for short period of time for re-use 13 ExecutorService and Callable The service accepts Callable objects to run by way of the submit() method: <T> Future<T> submit(Callable<T> task) or by way of the invokeAll() method <T> List<Future<T>> invokeAll(Collection<? extends Callable<T>> tasks) throws InterruptedException Returns a Future object representing that task. Future’s get() method will return the given result upon successful completion. public interface Future { … V get() throws Exception; } 14 Thread-safe Collections With multiple threads running, need to worry about concurrent access both for reading from and writing to the Event. Original Java collections (e.g. Vector) were thread-safe, but slow. The commonly used collection classes in java.util are not currently synchronized. Synchronization wrappers add automatic synchronization (thread-safety) to an arbitrary Java collection. List<Type> list = Collections.synchronizedList(new new ArrayList<Type>(); ArrayList<Type>()); 15 Calorimeter Clustering Example I // the map containing the calorimeter hits keyed on subdetector name Map<String, List<CalorimeterHit>> chitmap = new HashMap<String, List<CalorimeterHit>>(); … // a container to hold resulting clusters… List<Cluster> clusterList = new ArrayList<Cluster>(); // A Clusterer to cluster hits Clusterer c = new Clusterer(); for (String s : keys) { List<Cluster> clusters = c.cluster(s, hitmap.get(s)); clusterList.addAll(clusters); } … 16 Calorimeter Clustering Example II // how many processors are available? int nProcessors = Runtime.getRuntime().availableProcessors(); // create a fixed number of threads for processing ExecutorService threadExecutor = Executors.newFixedThreadPool(nThreads); // the map containing the calorimeter hits keyed on subdetector name Map<String, List<CalorimeterHit>> chitmap = new HashMap<String, List<CalorimeterHit>>(); // a container to hold resulting clusters… List<Cluster> clusterList = new ArrayList<Cluster>(); // a collection to hold the clustering tasks Collection<Callable<List<Cluster>>> tasks = new LinkedList<Callable<List<Cluster>>>(); // create one task per subdetector and add to task list for (String s : keys) { tasks.add(new CallableClusterer(s, hitmap.get(s))); } // process all tasks List<Future<List<Cluster>>> futures = threadExecutor.invokeAll(tasks); // analyze output for (Future<List<Cluster>> f : futures) { List<Cluster> clusters = f.get(); clusterList.addAll(clusters); } … 17 Calorimeter Clustering Example III // how many processors are available? int nProcessors = Runtime.getRuntime().availableProcessors(); // create a fixed number of threads for processing ExecutorService threadExecutor = Executors.newFixedThreadPool(nThreads); // the map containing the calorimeter hits keyed on subdetector name Map<String, List<CalorimeterHit>> chitmap = new HashMap<String, List<CalorimeterHit>>(); // a thread-safe container to hold resulting clusters… List<Cluster> clusterList = Collections.synchronizedList(new ArrayList<Cluster>()); // a collection to hold the clustering tasks Collection<Callable<List<Cluster>>> tasks = new LinkedList<Callable<List<Cluster>>>(); // create one task per subdetector for (String s : keys) { tasks.add(new CallableClusterer(s, hitmap.get(s), clusterList)); } // process all tasks List<Future<List<Cluster>>> futures = threadExecutor.invokeAll(tasks); … Makes more efficient use of threads by adding clusters directly to the clusterList instead of adding them all after all the threads have finished. Can still check status of Future objects to make sure all tasks have finished successfully. 18 Testing on multi-core systems. My home PC 'ki-eval01' : Dual Intel 'Westmere' 6-core CPUs. Intel hyperthreading feature is enabled which doubles number of cores from 12 to a total of 24. 48GB of RAM available 'ki-eval05' : Intel Core i7 with hyperthreading, giving 8 cores 12GB of RAM Dual AMD 12-core CPUs. No hyperthreading, so a total of 24 cores. 64GB of RAM. Thanks to Stuart Marshall and Yemi Adesanya at Kipac for granting access. 19 CPU Intensive Example (e.g. Digitization) Dual AMD 12-core CPUs 30 Dual Intel 'Westmere' 6-core CPUs + HT Speedup Factor 25 20 15 10 5 5 10 15 20 Number of Threads 25 30 20 Analysis Process Very large phase space for optimization. Balance granularity of threaded tasks with overhead and data structures i.e. is wafer-level threading realistic for Si tracker? Amdahl’s Law limits maximum gain Not all tasks lend themselves to concurrent processing But not an “all-or-nothing” game e.g. track-finding spans detector elements Enabling Intel HT could lead to immediate gains Need tools to monitor threads, CPU and memory. 21 JConsole “JConsole uses the extensive instrumentation of the Java Virtual Machine (Java VM) to provide information about the performance and resource consumption of applications running on the Java platform.” Local or remote connection images from java.sun.com 22 JConsole Overview 23 JConsole Memory 24 JConsole Threads 25 JConsole MBean Operations 26 Threading MBean findMonitorDeadlockedThreads. getThreadInfo. Detects if any threads are deadlocked on the object monitor locks. This operation returns an array of deadlocked thread IDs. Returns the thread information. This includes the name, stack trace, and the monitor lock that the thread is currently blocked on, if any, and which thread is holding that lock, and thread contention statistics. getThreadCpuTime. Returns the CPU time consumed by a given thread 27 Obtaining Detailed Thread Information 28 JConsole Extensibility Extremely functional tool as-is. Extend functionality by: implementing custom MBeans. using JConsole plug-in API. 29 Summary and Outlook HEP event reconstruction is inherently modular and lends itself well to a multi-threaded approach. lcsim.org’s modular approach to “generic” reconstruction was easily modified to accommodate multi-threaded reconstruction. Java’s built-in support for concurrent processing and tools to monitor results make coding and analysis straightforward. Current work is “proof-of-concept” study. Process just begun, still learning, interested in collaborating with others. Motivated by curiosity, not by need. Events are small enough, and Java code runs fast enough that current serial reconstruction was more than adequate for ILC LOI exercise involving analysis of tens of millions of events. Job submission environments (e.g. lsf or Grid) target individual processors, so do not (yet) benefit from multi-cores. Hope that experience and “lessons learned” from studies of threaded event reconstruction in Java will be applicable to C++ reconstruction if and when it is needed and supported. Thanks to Tony Johnson for stimulating discussions and help. 30