Download Global ILC Simulation Status Report

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Multi-Threaded Event
Reconstruction in Java
Norman Graf
CHEP 2010, Taipei
October 21, 2010
Why Multi-threading?


Moore’s law still holds, but clock-speed of CPUs
fell off the curve several years ago.
Can no longer get improved performance “for
free” from faster CPUs.



Current trend is towards multi- or many-core
architectures sharing memory.
Don’t believe there are “silver bullets”



Even getting slower…
no compiler switch to optimize for many-core
no libraries to link against to give concurrency
Requires a paradigm shift in coding.
2
Multi-threading advantages

“Many hands make light work.”
John Heywood

“The real performance payoff of dividing a
program's workload into tasks comes when
there are a large number of independent,
homogeneous tasks that can be processed
concurrently.”
Java Concurrency in Practice
3
Multi-threading gotchas

“Too many cooks spoil the broth.”
Anonymous

“Amdahl’s Law”
Gene Amdahl, 1967
4
HEP Reconstruction Parallelism

Currently, HEP employs run and event based
parallelism.

Conditions usually ~static per run.




Set up geometry once and access throughout job.
Program then processes each event serially and
independently.
Memory footprint of program, conditions, and
increasingly so for events, becoming significant.
Investigate whether parallelism within an event
reconstruction holds promise for the future.

One event in memory, multiple CPUs processing
5
HEP Event Reconstruction


Modern Collider detectors are complex, but
modular, and a number of tasks can be easily
identified as independent. For example:
Digitization, clustering, centroid calculation of
hits in silicon trackers easily factorizable.


Clustering of calorimeter cells


By subsystem (barrel vs endcap), by layer, by wafer.
By subsystem (barrel/endcap, EM/Had), modules
Jet Flavor-tagging


vertexing done only on list of associated tracks
lepton association only done on jet constituents.
6
Thread by Wafer
Silicon Detector Inner Tracker
Thread by Subdetector
Barrel
Endcap
Thread by Layer
Layer 1
Layer 2
7
org.lcsim






A fast and flexible reconstruction and analysis
framework developed for ILC physics and
detector response simulations.
Written in Java.
Plug & Play reconstruction Drivers.
Runtime configurable with xml file.
Supports a number of different subdetector
types: TPC, Si pixel and -strip; sampling and
total absorption crystal calorimetry; …
Perfect development environment to test ideas
of multi-threaded event reconstruction.
8
Threads in Java

Java has included support for concurrency since its
beginning, and has improved over time.




Makes it easy to develop, implement and study.
Use existing Java reconstruction package lcsim.org
to study feasibility of multi-threaded approach to
event reconstruction.
ISO C++ standard doesn’t mention threads.



Thread, Runnable, Callable
Usual solutions involve non-portable, platform-specific
concurrency features and libraries.
boost a possible solution; C++0x draft offers threads.
Idea is to study the concept in an environment
currently supportive of this approach (Java), and
apply if, and when, needed and supported in C++.
9
Thread Class
Most basic method is to
extend base class Thread.


run() method



class ThreadTask extends Thread
{
public void run() { … }
}
…
Thread t = new ThreadTask();
t.start();
t.join();
…
accepts no arguments
returns no values
cannot throw checked exceptions
blocks until
task completes
10
Runnable Interface
Runnable Interface allows
user class to be active while
not subclassing Thread.
 run() method




public interface Runnable<V>
{
public void run();
}
class RunnableTask
implements Runnable
extends UsefulBaseClass
{
public void run() { … }
}
accepts no arguments
returns no values
can’t throw checked exceptions …
Runnable runnable = new RunnableTask();
Thread t = new Thread(runnable);
t.start();
t.join();
…
Passed to Thread as arg.
 To get a value back from
the now-completed task, you
must use a method outside the interface and wait for some
kind of notification message that the task completed.

11
Callable Interface
Callable Interface allows
user class to inherit from
other classes.
 Best suited for result-bearing
tasks
 call() method





returns typed value
can throw checked exceptions
public interface Callable<V>
{
V call() throws Exception;
}
class CallableTask
implements Runnable
extends UsefulBaseClass
{
public Object call() { … }
}
Cannot pass a Callable into a Thread to execute.
Requires use of ExecutorService to execute the Callable
object.
12
ExecutorService


Part of the java.util.concurrent package.
Asynchronous task handler.


Creates, manages, runs thread pools.
Executor has three factory methods:

newSingleThreadExecutor()


newFixedThreadPool(int nThreads)



single thread, unbounded queue for tasks
specified maximum thread pool size, unbounded task queue
if thread dies, a new one will be created to replace it.
newCachedThreadPool()


open-ended # of threads, grows and shrinks on demand.
caches threads for short period of time for re-use
13
ExecutorService and Callable

The service accepts Callable objects to run

by way of the submit() method:
<T> Future<T> submit(Callable<T> task)

or by way of the invokeAll() method
<T> List<Future<T>> invokeAll(Collection<? extends Callable<T>> tasks)
throws InterruptedException

Returns a Future object representing that task.

Future’s get() method will return the given result upon
successful completion.
public interface Future
{
…
V get() throws Exception;
}
14
Thread-safe Collections




With multiple threads running, need to worry
about concurrent access both for reading from
and writing to the Event.
Original Java collections (e.g. Vector) were
thread-safe, but slow.
The commonly used collection classes in
java.util are not currently synchronized.
Synchronization wrappers add automatic
synchronization (thread-safety) to an arbitrary
Java collection.
List<Type> list = Collections.synchronizedList(new
new ArrayList<Type>();
ArrayList<Type>());
15
Calorimeter Clustering Example I
// the map containing the calorimeter hits keyed on subdetector name
Map<String, List<CalorimeterHit>> chitmap = new HashMap<String, List<CalorimeterHit>>();
…
// a container to hold resulting clusters…
List<Cluster> clusterList = new ArrayList<Cluster>();
// A Clusterer to cluster hits
Clusterer c = new Clusterer();
for (String s : keys) {
List<Cluster> clusters = c.cluster(s, hitmap.get(s));
clusterList.addAll(clusters);
}
…
16
Calorimeter Clustering Example II
// how many processors are available?
int nProcessors = Runtime.getRuntime().availableProcessors();
// create a fixed number of threads for processing
ExecutorService threadExecutor = Executors.newFixedThreadPool(nThreads);
// the map containing the calorimeter hits keyed on subdetector name
Map<String, List<CalorimeterHit>> chitmap = new HashMap<String, List<CalorimeterHit>>();
// a container to hold resulting clusters…
List<Cluster> clusterList = new ArrayList<Cluster>();
// a collection to hold the clustering tasks
Collection<Callable<List<Cluster>>> tasks = new LinkedList<Callable<List<Cluster>>>();
// create one task per subdetector and add to task list
for (String s : keys) { tasks.add(new CallableClusterer(s, hitmap.get(s)));
}
// process all tasks
List<Future<List<Cluster>>> futures = threadExecutor.invokeAll(tasks);
// analyze output
for (Future<List<Cluster>> f : futures) {
List<Cluster> clusters = f.get();
clusterList.addAll(clusters);
}
…
17
Calorimeter Clustering Example III
// how many processors are available?
int nProcessors = Runtime.getRuntime().availableProcessors();
// create a fixed number of threads for processing
ExecutorService threadExecutor = Executors.newFixedThreadPool(nThreads);
// the map containing the calorimeter hits keyed on subdetector name
Map<String, List<CalorimeterHit>> chitmap = new HashMap<String, List<CalorimeterHit>>();
// a thread-safe container to hold resulting clusters…
List<Cluster> clusterList = Collections.synchronizedList(new ArrayList<Cluster>());
// a collection to hold the clustering tasks
Collection<Callable<List<Cluster>>> tasks = new LinkedList<Callable<List<Cluster>>>();
// create one task per subdetector
for (String s : keys) { tasks.add(new CallableClusterer(s, hitmap.get(s), clusterList));
}
// process all tasks
List<Future<List<Cluster>>> futures = threadExecutor.invokeAll(tasks);
…
Makes more efficient use of threads by adding clusters directly to the clusterList
instead of adding them all after all the threads have finished. Can still check status
of Future objects to make sure all tasks have finished successfully.
18
Testing on multi-core systems.

My home PC



'ki-eval01' :




Dual Intel 'Westmere' 6-core CPUs.
Intel hyperthreading feature is enabled which doubles
number of cores from 12 to a total of 24.
48GB of RAM available
'ki-eval05' :




Intel Core i7 with hyperthreading, giving 8 cores
12GB of RAM
Dual AMD 12-core CPUs.
No hyperthreading, so a total of 24 cores.
64GB of RAM.
Thanks to Stuart Marshall and Yemi Adesanya at
Kipac for granting access.
19
CPU Intensive Example (e.g. Digitization)
Dual AMD 12-core CPUs
30
Dual Intel 'Westmere' 6-core CPUs + HT
Speedup Factor
25
20
15
10
5
5
10
15
20
Number of Threads
25
30
20
Analysis Process


Very large phase space for optimization.
Balance granularity of threaded tasks with
overhead and data structures


i.e. is wafer-level threading realistic for Si tracker?
Amdahl’s Law limits maximum gain

Not all tasks lend themselves to concurrent processing


But not an “all-or-nothing” game


e.g. track-finding spans detector elements
Enabling Intel HT could lead to immediate gains
Need tools to monitor threads, CPU and memory.
21
JConsole


“JConsole uses the extensive instrumentation of
the Java Virtual Machine (Java VM) to provide
information about the performance and resource
consumption of applications running on the Java
platform.”
Local or remote connection
images from java.sun.com
22
JConsole Overview
23
JConsole Memory
24
JConsole Threads
25
JConsole MBean Operations
26
Threading MBean

findMonitorDeadlockedThreads.


getThreadInfo.


Detects if any threads are deadlocked on the object
monitor locks. This operation returns an array of
deadlocked thread IDs.
Returns the thread information. This includes the
name, stack trace, and the monitor lock that the
thread is currently blocked on, if any, and which
thread is holding that lock, and thread contention
statistics.
getThreadCpuTime.

Returns the CPU time consumed by a given thread
27
Obtaining Detailed Thread Information
28
JConsole Extensibility


Extremely functional tool as-is.
Extend functionality by:


implementing custom MBeans.
using JConsole plug-in API.
29
Summary and Outlook




HEP event reconstruction is inherently modular and lends itself
well to a multi-threaded approach.
lcsim.org’s modular approach to “generic” reconstruction was
easily modified to accommodate multi-threaded reconstruction.
Java’s built-in support for concurrent processing and tools to
monitor results make coding and analysis straightforward.
Current work is “proof-of-concept” study.






Process just begun, still learning, interested in collaborating with others.
Motivated by curiosity, not by need.
Events are small enough, and Java code runs fast enough that current
serial reconstruction was more than adequate for ILC LOI exercise
involving analysis of tens of millions of events.
Job submission environments (e.g. lsf or Grid) target individual
processors, so do not (yet) benefit from multi-cores.
Hope that experience and “lessons learned” from studies of
threaded event reconstruction in Java will be applicable to C++
reconstruction if and when it is needed and supported.
Thanks to Tony Johnson for stimulating discussions and help.
30