Download The IBM Dash Compiler Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
IBM Dash
An Implicitly Parallel Mathematical Language
CASCON - Nov 2014
IBM: Bob Blainey – Ettore Tiotto - Taylor Lloyd – John Keenleyside, Fiona Fan
Univ. of Alberta: Jose Nelson Amaral
1
© 2014 IBM Corporation
Agenda
Overview
Motivation and
Design Objectives
Example – Monte Carlo
Dash Language Highlights
Building Blocks: defining and
using procedure and functions
Operations on vector and matrices
Higher level primitives: generators, filters, maps
Case Study (Extreme Blue)
Monte
Carlo simulation of the Heston volatility model
Heston model calibration using Differential Evolution
2
© 2014 IBM Corporation
Financial Industry Development Cycle
Typical development process
Performance
50% of financial
institutions looking to
improve execution
time
Different tools, often different people
Performance
82% of buy-side firms
wish to reduce recoding
time
Source: Mathworks survey results, 2012
3
© 2014 IBM Corporation
Goal: expressivity with
good performance
Languages
High-Level Languages
Difficult to translate into
efficient GPU code
Uses CUDA
Low Level APIs
Harder to use
Multiple programming
paradigms
cuRAND
Low-level APIs
Specialized Libraries
Hide some complexity at
the cost of expressivity
Slower
5
Leverage Specialized
Libraries
Libraries
Harder to use
EXPRESSIVITY
Easier to use
Parallel Programming Landscape and IBM Dash
PERFORMANCE
Faster
© 2014 IBM Corporation
IBM Dash
What does it look like?
8
© 2014 IBM Corporation
Example: Monte Carlo written in Dash
n = 24,000 iterations
pure function
function norm(double x, double y) // return type inferred
= x^2 + y^2;
random variable
procedure pi() returns double {
const n = 1_000_000; // inferred variable type
double random stream s = uniform(0.0,1.0); // uniform distr. (RT)
// compiler can parallelize this computation
const count = [i in 1..n | norm(sample(s),sample(s))< 1.0?1.0: 0.0];
return mean(count)*4.0;
}
9
vector generator
reduction
© 2014 IBM Corporation
Language Tour
10
© 2014 IBM Corporation
Functions & Procedures
Procedure: semantically similar to a Fortran subroutine
 No return
value, but mutable arguments supported
 Can have side effects (i.e. modify global state)
 Allowed to call other procedures
// a procedure definition
procedure proc_name(real in, var real out) { statements }
 Function: maps a tuple of arguments to a return value
depends exclusively on the values passed to it  mathematical
function
 No side effects
 separate invocations with the same arguments yield identical results
 results
// a function definition
function func_name(parameters) returns type
{ statements; return statement; }
11
© 2014 IBM Corporation
Functions & Procedures
 Functions can be defined using an “compact assignment form”:
function mpy (real x, integer(32) y) returns real = x * y;
 A function can return multiple values (a tuple of values):
function max_and_min (real vector v) returns (real, real) {
// find min and max values in vector v
return (max, min); // pack return values into a tuple
}
real min = 0.0, max = 0.0;
real vect[10] = …;
min, max = max_and_min(vect); // “unpacks” the tuple returned
_, max = max_and_min(vect);
// drop return value using “_”
12
© 2014 IBM Corporation
Control Structures – loops
Use of loops is discouraged …
• … we have higher language primitives
Loops can be occasionally useful, Dash supports:
• Pre-predicated while and until loops
 loop condition evaluated before the loop executes
• Post-predicated while and until loops
 loop condition evaluated after the loop executes
• Counted loops
 execution controlled by counter variable over a loop domain
 example:
loop (n in 10..20, i in 2..n) // domain: n  [10,19], i  [2,n]
{
if i == n
call found_prime(n);
}
14
© 2014 IBM Corporation
Vectors and Matrices (Arrays) – some examples
 Elements indexed using square brackets (one based)
int vector v[N];
// a vector of N elements
int matrix m[N,N]; // a NxN square matrix
m[2,1] = v[1];
 Rich array slicing operations (encouraged … they reduce the need for loops)
m[1,*] = v; // copy vector v into the first row of matrix m
m[2,3..7 by 2] = [1,2,3]; // modify a portion of the second row
 Element wise operations
vector v = c;
// splat scalar to initialize a vector
even = v[2..* by 2]; // gather the even elements of v
w = even * odd;
// apply binary operation to all elements
.. = w[v];
// sparse gather operation (e.g. index vector)
15
© 2014 IBM Corporation
High-level primitives
 Map: Useful when we want to apply a function to all elements in an
array
• In Dash this is done by passing the array to a function taking a scalar argument
function mpy (int x, int y) = x * y; // scalar function
// map: [mpy(v[1],c), mpy(v[2],c), mpy(v[3],c)]
function mpy_vector(int vector v, int c)
= mpy(v, c); // elements of v multiplied concurrently
 Generator: allows creation of an array from a specification
// yields a NxM matrix
const m = [i in 1..N, j in 1..M | mpy(i,j)];
 Filter: useful to select elements that satisfy a predicate
// select all element of v greater that a certain threshold
w,_ = filter (i in 1..length(v) | v[i] > thresh);
16
© 2014 IBM Corporation
Compiler Infrastructure
19
© 2014 IBM Corporation
The IBM Dash Compiler Architecture
Dashsource
sourcefile
file
Dash
Dash
source file
.ds
Dash
Front End
Lexer, Parser
Semantic Analyzer
Dash IR generator
Dash IR
Dash
High-Level Optimizer
Dash IR transformer
Expression Simplifier
Peephole Optimizations,
Inliner, Generator Fusion
20
© 2014 IBM Corporation
The IBM Dash Compiler Architecture
Dashsource
sourcefile
file
Dash
Dash
source file
.ds
Dash
Front End
Lexer, Parser
Semantic Analyzer
IR generator
Dash IR
Dash
High-Level Optimizer
Dash IR transformer
Expression Simplifier
Peephole Optimizations,
Inliner, Generator Fuser
Dash IR
Dash
C Generator
Dash IR  C
.c
C Compiler
(GCC, XLC, ICC, …)
21
© 2014 IBM Corporation
The IBM Dash Compiler Architecture
LLVM supports many processors
PowerPC, x86, ARM, GPU (via
PTX) …
Dashsource
sourcefile
file
Dash
Dash
source file
.ds
Dash
Front End
Lexer, Parser
Semantic Analyzer
Dash IR generator
LLVM Optimizer and
Code Generator
.bc
Dash IR
Dash
High-Level Optimizer
Dash IR transformer
Expression Simplifier
Peephole Optimizations,
Inliner, Generator Fuser, …)
Dash
LLVM Generator
Dash IR  LLVM IR
Dash IR
Dash
C Generator
Dash IR  C
.c
C Compiler
(GCC, XLC, ICC, …)
22
© 2014 IBM Corporation
The IBM Dash Compiler Architecture
LLVM supports many processors
PowerPC, x86, ARM, GPU (via
PTX) …
Dashsource
sourcefile
file
Dash
Dash
source file
.ds
Dash
Front End
Lexer, Parser
Semantic Analyzer
Dash IR generator
LLVM Optimizer and .o
Code Generator
Dash IR transformer
Expression Simplifier
Peephole Optimizations,
Inliner, Generator Fuser, …)
Executable
.bc
Dash IR
Dash
High-Level Optimizer
System
Linker
.so
Dash
LLVM Generator
Dash IR  LLVM IR
Dash
Runtime Library
Dash IR
Dash
C Generator
Dash IR  C
.so
.c
C Compiler
(GCC, XLC, ICC, …)
23
23
IBM Confidential
.o
System
Linker
Executable
© 2014 IBM Corporation
How IBM Dash works
const x = random();
const y = random();
int a = b + c;
int n = img->rows;
trial (
Experiment (vec, n),
num_sims
);
1. Mathematician creates
the financial model
fft (
vec, p1, p2, p3
);
Vectors Matrices Pure Functions
Sequences Random Variables
Generators Filters ….
const x = random();
const y = random();
int a = b + c;
int n = img->rows;
2. Dash maps the model to
the most efficient parallel
hardware
3. The model
runs with
maximum
performance
Mathematical syntax
embedded in a
statically-typed C-like
language (minus pointers).
trial (
Experiment (vec,
n),
num_sims
);
Static compilation,
optimization and automatic
parallelization.
fft (
vec, p1, p2, p3
);
const x = random();
const y = random();
int a = b + c;
int n = img->rows;
trial (
Experiment (vec, n),
num_sims
);
fft (
vec, p1, p2, p3
);
24
© 2014 IBM Corporation
25
© 2014 IBM Corporation
Asian option Pricing using the Heston model
 Equations
dSt  Stdt  Vt StdWt1
dVt     Vt ) dt   Vt dWt 2
dWt1dWt 2  dt
 Description
The Heston Model characterize stock price
using two correlated Brownian processes
(W1 and W2)
European Options
• Can be priced using an analytical
closed-form solution
Asian Options
• No analytical solution
• Priced using a Monte Carlo simulation
26
© 2014 IBM Corporation
Heston Model: Dash code highlights
Implemented a Monte Carlo simulation of Asian option pricing using Dash
Parallel computation expressed using high-level language construct (trial)
const payoffs = trial (monte_carlo(heston_model, opt, time_steps),
num_sims);
function monte_carlo(HestonModel model, Option opt, int time_steps)
returns double
{
double random stream dist = normal(0.0, 1.0);
double vector spot_draws = sample(dist, time_steps);
double vector vol_draws = sample(dist, time_steps);
// Correlate the 2 vector of normal random draws using factor rho
vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i]
+ spot_draws[i] * (1-model.rho*model.rho) ^ 0.5];
const vol_path = compute_vol_path(model, opt, vol_draws);
const spot_path = compute_spot_path(model, opt, spot_draws, vol_path);
return call_option_payoff(opt, spot_path[length(spot_path)]);
}
27
© 2014 IBM Corporation
Heston Model: GPU Exploitation
Dash compiler offloads Monte Carlo computation to a NVIDIA GPU by
generating C+CUDA code for the trial operation
Dash Program
const payoffs = trial (monte_carlo(heston_model, opt, time_steps),
num_sims);
function monte_carlo(HestonModel model, Option opt, int time_steps)
returns double
{
double random stream dist = normal(0.0, 1.0);
double vector spot_draws = sample(dist, time_steps);
double vector vol_draws = sample(dist, time_steps);
// Correlate the 2 vector of normal random draws using factor rho
vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i]
+ spot_draws[i] * (1-model.rho*model.rho) ^ 0.5];
Dash Compiler
Dash
Front End
Dash IR
Dash
High-Level
Optimizer
Dash
Runtime
Code
Generator
.cu
.so
NVCC
.o
System
Linker
const vol_path = compute_vol_path(model, opt, vol_draws);
const spot_path = compute_spot_path(model, opt, spot_draws, vol_path);
return call_option_payoff(opt, spot_path[length(spot_path)]);
Generated Code (C + CUDA)
}
Main
program
Kernel
wrapper
GPU Kernel
Device
Function
(volatility path)
28
Device
Function
(spot path)
© 2014 IBM Corporation
Heston Model: Code Complexity
450+
256
4.8X
Reduction in
Complexity
81
82
• Implemented Monte Carlo simulation (Heston model) in Python
(Numpy), C++, C++/CUDA, and Dash
• Complexity: Dash is as expressive as Python and ~5X more
expressive than C++/CUDA
• It took ~ a week to write, test and debug the C++/CUDA version
29
© 2014 IBM Corporation
Heston Model: Performance
GPU
CPU (single core)
16X
13X
- Dash (GPU backend prototype) is 16X
faster than optimized C++ (sequential)
and 600X faster than Python
- Performance of the GPU Backend
prototype expected to increase as we
develop the Dash optimizer further
Higher performance results in improved model accuracy
30
© 2014 IBM Corporation
Heston Model Calibration
 The process of “fitting” the model to observed market data
 To calibrate the model to market data, minimize the objective
function over 5 parameters:
 We do this using an algorithm called Differential Evolution (genetic
algorithm) by Storm & Price
 http://link.springer.com/article/10.1023%2FA%3A1008202821328
 Implemented the algorithm in C++, and in Dash
found several opportunities to use generators and simplify code
generators expose application parallelism compiler generates
parallel code
 Compared C++ performance vs Dash on Intel multicore system
31
© 2014 IBM Corporation
Heston Model Calibration
procedure desolver(int maxpop, int MAX_ITR, double CP, double BETA,
double vector data, double lambda, double vector lb,
double vector ub) {
const dim = length(lb); double vector fval[maxpop];
Work in Progress:
double matrix px[maxpop,dim], px_new[maxpop,dim];
generators fusion
int random stream us = uniform(1, maxpop);
double random stream s = uniform(0.0,1.0);
to expose course
grain parallelism(*)
px
= [i in 1 .. maxpop | initialpx(s, lb, ub, i)];
fval = [i in 1 .. maxpop | evaluate(data, lambda, px[i,*])];
loop while (itr < MAX_ITR) {
var cand_px_new = [i in 1 .. maxpop | gen_vec(us,s,CP,BETA,px,lb,ub,i)];
double vector fit_values = [i in 1 .. maxpop |
evaluate(data, lambda, cand_px_new[i,*])];
//selection
double matrix px_new = [i in 1 .. maxpop | (fit_values[i] < fval[i])
? cand_px_new[i,*]
: px[i, *]];
// update objective values after selection
fval = [i in 1 .. maxpop | min(fval[i], fit_values[i])];
px = px_new; itr += 1;
}
33
© 2014 IBM Corporation
Heston Model: Calibration Performance
- Dash (CPU backend) is 13X faster than
optimized C++ (sequential) 16X
13X
- Parallelization achieved by exploiting
OpenMP
- Generator fusion may improve
performance further by enabling
opportunity to eliminate array copies
between adjacent generators
Higher performance results in improved model accuracy
35
© 2014 IBM Corporation
Vision for IBM Dash
GOAL: solve two key problems for the mathematical
programming domain
 agility
of development (fast iterations)
 performance and scalability for big data
Advance language design
 encompass
more mathematical abstractions, develop implicitly parallel
primitives such as scan, filter, etc…
Advance compiler design
 support many (hybrid) parallel
targets .. CPU+GPU initially, then perhaps
FPGA, then perhaps cluster, etc…
SUMMARY: create a tool which simultaneously provides a
productive language for mathematical modeling and insulates
programmers from the complexity and evolution of hybrid systems.
36
© 2014 IBM Corporation