Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IBM Dash An Implicitly Parallel Mathematical Language CASCON - Nov 2014 IBM: Bob Blainey – Ettore Tiotto - Taylor Lloyd – John Keenleyside, Fiona Fan Univ. of Alberta: Jose Nelson Amaral 1 © 2014 IBM Corporation Agenda Overview Motivation and Design Objectives Example – Monte Carlo Dash Language Highlights Building Blocks: defining and using procedure and functions Operations on vector and matrices Higher level primitives: generators, filters, maps Case Study (Extreme Blue) Monte Carlo simulation of the Heston volatility model Heston model calibration using Differential Evolution 2 © 2014 IBM Corporation Financial Industry Development Cycle Typical development process Performance 50% of financial institutions looking to improve execution time Different tools, often different people Performance 82% of buy-side firms wish to reduce recoding time Source: Mathworks survey results, 2012 3 © 2014 IBM Corporation Goal: expressivity with good performance Languages High-Level Languages Difficult to translate into efficient GPU code Uses CUDA Low Level APIs Harder to use Multiple programming paradigms cuRAND Low-level APIs Specialized Libraries Hide some complexity at the cost of expressivity Slower 5 Leverage Specialized Libraries Libraries Harder to use EXPRESSIVITY Easier to use Parallel Programming Landscape and IBM Dash PERFORMANCE Faster © 2014 IBM Corporation IBM Dash What does it look like? 8 © 2014 IBM Corporation Example: Monte Carlo written in Dash n = 24,000 iterations pure function function norm(double x, double y) // return type inferred = x^2 + y^2; random variable procedure pi() returns double { const n = 1_000_000; // inferred variable type double random stream s = uniform(0.0,1.0); // uniform distr. (RT) // compiler can parallelize this computation const count = [i in 1..n | norm(sample(s),sample(s))< 1.0?1.0: 0.0]; return mean(count)*4.0; } 9 vector generator reduction © 2014 IBM Corporation Language Tour 10 © 2014 IBM Corporation Functions & Procedures Procedure: semantically similar to a Fortran subroutine No return value, but mutable arguments supported Can have side effects (i.e. modify global state) Allowed to call other procedures // a procedure definition procedure proc_name(real in, var real out) { statements } Function: maps a tuple of arguments to a return value depends exclusively on the values passed to it mathematical function No side effects separate invocations with the same arguments yield identical results results // a function definition function func_name(parameters) returns type { statements; return statement; } 11 © 2014 IBM Corporation Functions & Procedures Functions can be defined using an “compact assignment form”: function mpy (real x, integer(32) y) returns real = x * y; A function can return multiple values (a tuple of values): function max_and_min (real vector v) returns (real, real) { // find min and max values in vector v return (max, min); // pack return values into a tuple } real min = 0.0, max = 0.0; real vect[10] = …; min, max = max_and_min(vect); // “unpacks” the tuple returned _, max = max_and_min(vect); // drop return value using “_” 12 © 2014 IBM Corporation Control Structures – loops Use of loops is discouraged … • … we have higher language primitives Loops can be occasionally useful, Dash supports: • Pre-predicated while and until loops loop condition evaluated before the loop executes • Post-predicated while and until loops loop condition evaluated after the loop executes • Counted loops execution controlled by counter variable over a loop domain example: loop (n in 10..20, i in 2..n) // domain: n [10,19], i [2,n] { if i == n call found_prime(n); } 14 © 2014 IBM Corporation Vectors and Matrices (Arrays) – some examples Elements indexed using square brackets (one based) int vector v[N]; // a vector of N elements int matrix m[N,N]; // a NxN square matrix m[2,1] = v[1]; Rich array slicing operations (encouraged … they reduce the need for loops) m[1,*] = v; // copy vector v into the first row of matrix m m[2,3..7 by 2] = [1,2,3]; // modify a portion of the second row Element wise operations vector v = c; // splat scalar to initialize a vector even = v[2..* by 2]; // gather the even elements of v w = even * odd; // apply binary operation to all elements .. = w[v]; // sparse gather operation (e.g. index vector) 15 © 2014 IBM Corporation High-level primitives Map: Useful when we want to apply a function to all elements in an array • In Dash this is done by passing the array to a function taking a scalar argument function mpy (int x, int y) = x * y; // scalar function // map: [mpy(v[1],c), mpy(v[2],c), mpy(v[3],c)] function mpy_vector(int vector v, int c) = mpy(v, c); // elements of v multiplied concurrently Generator: allows creation of an array from a specification // yields a NxM matrix const m = [i in 1..N, j in 1..M | mpy(i,j)]; Filter: useful to select elements that satisfy a predicate // select all element of v greater that a certain threshold w,_ = filter (i in 1..length(v) | v[i] > thresh); 16 © 2014 IBM Corporation Compiler Infrastructure 19 © 2014 IBM Corporation The IBM Dash Compiler Architecture Dashsource sourcefile file Dash Dash source file .ds Dash Front End Lexer, Parser Semantic Analyzer Dash IR generator Dash IR Dash High-Level Optimizer Dash IR transformer Expression Simplifier Peephole Optimizations, Inliner, Generator Fusion 20 © 2014 IBM Corporation The IBM Dash Compiler Architecture Dashsource sourcefile file Dash Dash source file .ds Dash Front End Lexer, Parser Semantic Analyzer IR generator Dash IR Dash High-Level Optimizer Dash IR transformer Expression Simplifier Peephole Optimizations, Inliner, Generator Fuser Dash IR Dash C Generator Dash IR C .c C Compiler (GCC, XLC, ICC, …) 21 © 2014 IBM Corporation The IBM Dash Compiler Architecture LLVM supports many processors PowerPC, x86, ARM, GPU (via PTX) … Dashsource sourcefile file Dash Dash source file .ds Dash Front End Lexer, Parser Semantic Analyzer Dash IR generator LLVM Optimizer and Code Generator .bc Dash IR Dash High-Level Optimizer Dash IR transformer Expression Simplifier Peephole Optimizations, Inliner, Generator Fuser, …) Dash LLVM Generator Dash IR LLVM IR Dash IR Dash C Generator Dash IR C .c C Compiler (GCC, XLC, ICC, …) 22 © 2014 IBM Corporation The IBM Dash Compiler Architecture LLVM supports many processors PowerPC, x86, ARM, GPU (via PTX) … Dashsource sourcefile file Dash Dash source file .ds Dash Front End Lexer, Parser Semantic Analyzer Dash IR generator LLVM Optimizer and .o Code Generator Dash IR transformer Expression Simplifier Peephole Optimizations, Inliner, Generator Fuser, …) Executable .bc Dash IR Dash High-Level Optimizer System Linker .so Dash LLVM Generator Dash IR LLVM IR Dash Runtime Library Dash IR Dash C Generator Dash IR C .so .c C Compiler (GCC, XLC, ICC, …) 23 23 IBM Confidential .o System Linker Executable © 2014 IBM Corporation How IBM Dash works const x = random(); const y = random(); int a = b + c; int n = img->rows; trial ( Experiment (vec, n), num_sims ); 1. Mathematician creates the financial model fft ( vec, p1, p2, p3 ); Vectors Matrices Pure Functions Sequences Random Variables Generators Filters …. const x = random(); const y = random(); int a = b + c; int n = img->rows; 2. Dash maps the model to the most efficient parallel hardware 3. The model runs with maximum performance Mathematical syntax embedded in a statically-typed C-like language (minus pointers). trial ( Experiment (vec, n), num_sims ); Static compilation, optimization and automatic parallelization. fft ( vec, p1, p2, p3 ); const x = random(); const y = random(); int a = b + c; int n = img->rows; trial ( Experiment (vec, n), num_sims ); fft ( vec, p1, p2, p3 ); 24 © 2014 IBM Corporation 25 © 2014 IBM Corporation Asian option Pricing using the Heston model Equations dSt Stdt Vt StdWt1 dVt Vt ) dt Vt dWt 2 dWt1dWt 2 dt Description The Heston Model characterize stock price using two correlated Brownian processes (W1 and W2) European Options • Can be priced using an analytical closed-form solution Asian Options • No analytical solution • Priced using a Monte Carlo simulation 26 © 2014 IBM Corporation Heston Model: Dash code highlights Implemented a Monte Carlo simulation of Asian option pricing using Dash Parallel computation expressed using high-level language construct (trial) const payoffs = trial (monte_carlo(heston_model, opt, time_steps), num_sims); function monte_carlo(HestonModel model, Option opt, int time_steps) returns double { double random stream dist = normal(0.0, 1.0); double vector spot_draws = sample(dist, time_steps); double vector vol_draws = sample(dist, time_steps); // Correlate the 2 vector of normal random draws using factor rho vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i] + spot_draws[i] * (1-model.rho*model.rho) ^ 0.5]; const vol_path = compute_vol_path(model, opt, vol_draws); const spot_path = compute_spot_path(model, opt, spot_draws, vol_path); return call_option_payoff(opt, spot_path[length(spot_path)]); } 27 © 2014 IBM Corporation Heston Model: GPU Exploitation Dash compiler offloads Monte Carlo computation to a NVIDIA GPU by generating C+CUDA code for the trial operation Dash Program const payoffs = trial (monte_carlo(heston_model, opt, time_steps), num_sims); function monte_carlo(HestonModel model, Option opt, int time_steps) returns double { double random stream dist = normal(0.0, 1.0); double vector spot_draws = sample(dist, time_steps); double vector vol_draws = sample(dist, time_steps); // Correlate the 2 vector of normal random draws using factor rho vol_draws = [i in 1..length(vol_draws) | model.rho * vol_draws[i] + spot_draws[i] * (1-model.rho*model.rho) ^ 0.5]; Dash Compiler Dash Front End Dash IR Dash High-Level Optimizer Dash Runtime Code Generator .cu .so NVCC .o System Linker const vol_path = compute_vol_path(model, opt, vol_draws); const spot_path = compute_spot_path(model, opt, spot_draws, vol_path); return call_option_payoff(opt, spot_path[length(spot_path)]); Generated Code (C + CUDA) } Main program Kernel wrapper GPU Kernel Device Function (volatility path) 28 Device Function (spot path) © 2014 IBM Corporation Heston Model: Code Complexity 450+ 256 4.8X Reduction in Complexity 81 82 • Implemented Monte Carlo simulation (Heston model) in Python (Numpy), C++, C++/CUDA, and Dash • Complexity: Dash is as expressive as Python and ~5X more expressive than C++/CUDA • It took ~ a week to write, test and debug the C++/CUDA version 29 © 2014 IBM Corporation Heston Model: Performance GPU CPU (single core) 16X 13X - Dash (GPU backend prototype) is 16X faster than optimized C++ (sequential) and 600X faster than Python - Performance of the GPU Backend prototype expected to increase as we develop the Dash optimizer further Higher performance results in improved model accuracy 30 © 2014 IBM Corporation Heston Model Calibration The process of “fitting” the model to observed market data To calibrate the model to market data, minimize the objective function over 5 parameters: We do this using an algorithm called Differential Evolution (genetic algorithm) by Storm & Price http://link.springer.com/article/10.1023%2FA%3A1008202821328 Implemented the algorithm in C++, and in Dash found several opportunities to use generators and simplify code generators expose application parallelism compiler generates parallel code Compared C++ performance vs Dash on Intel multicore system 31 © 2014 IBM Corporation Heston Model Calibration procedure desolver(int maxpop, int MAX_ITR, double CP, double BETA, double vector data, double lambda, double vector lb, double vector ub) { const dim = length(lb); double vector fval[maxpop]; Work in Progress: double matrix px[maxpop,dim], px_new[maxpop,dim]; generators fusion int random stream us = uniform(1, maxpop); double random stream s = uniform(0.0,1.0); to expose course grain parallelism(*) px = [i in 1 .. maxpop | initialpx(s, lb, ub, i)]; fval = [i in 1 .. maxpop | evaluate(data, lambda, px[i,*])]; loop while (itr < MAX_ITR) { var cand_px_new = [i in 1 .. maxpop | gen_vec(us,s,CP,BETA,px,lb,ub,i)]; double vector fit_values = [i in 1 .. maxpop | evaluate(data, lambda, cand_px_new[i,*])]; //selection double matrix px_new = [i in 1 .. maxpop | (fit_values[i] < fval[i]) ? cand_px_new[i,*] : px[i, *]]; // update objective values after selection fval = [i in 1 .. maxpop | min(fval[i], fit_values[i])]; px = px_new; itr += 1; } 33 © 2014 IBM Corporation Heston Model: Calibration Performance - Dash (CPU backend) is 13X faster than optimized C++ (sequential) 16X 13X - Parallelization achieved by exploiting OpenMP - Generator fusion may improve performance further by enabling opportunity to eliminate array copies between adjacent generators Higher performance results in improved model accuracy 35 © 2014 IBM Corporation Vision for IBM Dash GOAL: solve two key problems for the mathematical programming domain agility of development (fast iterations) performance and scalability for big data Advance language design encompass more mathematical abstractions, develop implicitly parallel primitives such as scan, filter, etc… Advance compiler design support many (hybrid) parallel targets .. CPU+GPU initially, then perhaps FPGA, then perhaps cluster, etc… SUMMARY: create a tool which simultaneously provides a productive language for mathematical modeling and insulates programmers from the complexity and evolution of hybrid systems. 36 © 2014 IBM Corporation