Using Today's Fastest Chips to Design the Chips of Tomorrow Mauro Calderara, Sascha Brück, Mathieu Luisier | | Overview What we want to do How we do it Overview What we want to do → Quantum Transport: electrons and structures How we do it → How GPUs saved the day Probably you're familiar with this Zooming in The future? (link to video: http://iis.ee.ethz.ch/~mauro/movie_SC15.avi) From a somewhat more abstract POV Device This is what we're ultimately interested in! How do electrons behave w.r.t the device? Device Change in parameters → change in behavior? Gate voltage Material properties Dimensions Change in parameters → change in behavior? Device How do electrons behave w.r.t the device? Applies not just to transistors Batteries Storage devices ... How would we do that? The ''easy'' case: → device behaves like bulk material How would we do that? The ''difficult'' case: → device behaves like atomic structure The cost of going small Why is this ''easy'' ... ... and this ''difficult''? Can assume is ''infinite'' and use semi empirical model. Very finite! Need to do it from first principles. The cost of going small Can assume is ''infinite'' and use semi empirical model. Very finite! Need to do it from first principles. Semi-empirical → O(Hours) First principles → O(Months) The cost of going small Overview What we want to do → Quantum Transport: electrons and structures How we do it → How GPUs saved the day Where does all that time go? ~ 40x Solve an eigenvalue problem (not discussed here). Invert the matrix from before (selectively!) using a recursive algorithm. Avoiding the inversion, use a sparse solver instead Instead of trying to invert selectively, solve system using generic sparse solver package ~ 40x Gain: speed, parallelism, capacity for somewhat larger systems Cost: code now mem-bw bound And: not such a good fit for GPUs ... Tackling the eigenvalue problem ~ 200x We've been able to solve that one Good speedup so far (now: O(Days), still not quite there...) Now what? ~ 70x overall But Mem-BW bound by sparse solver Good speedup so far (now: O(Days), still not quite there...) Now what? ~ 70x overall But Advisor PhD student Mem-BW bound by sparse solver A Sparse Solver for Transport Problems running on GPUs Inverting sparse system not feasible In our case: also not neccessary -1 = Need first and last block rows only If we can compute this fast, we can interleave the solving step with the BC computation obtain the full solution very efficiently Obtaining the first and last block columns of the inverse for i = N:1 𝑋𝑖 ← (𝐴𝑖,𝑖 − 𝐴𝑖,𝑖+1 𝑋𝑖+1 ) \ 𝐴𝑖,𝑖−1 Recursive algorithm based on the Schwinger-Dyson equation for i = 2:N 𝑄𝑖 ← −𝑋𝑖 𝑄𝑖−1 xGEMM + xGESV + xGEMM Very fast on accelerators Parallelizable A Sparse Solver for Transport Problems running on GPUs Runs on GPUs, compute bound Performance [log(FLOPS)] Arithmetic Intensity [log(FLOPS/Byte)] Interleaves with EV computation Memory efficient Much faster than sparse solvers ~ 10x / 80x Whole simulation: O(Hours) Summary Transforming a sparse problem to a dense one can be a good thing Large speedup over state of the art (15x - 150x) Significant increase in capacity (100'000 atoms → 10x - 100x) Uses hybrid ressources very efficiently (15 PF sustained) Made ballistic ab-initio QT simulations for realistic structures a reality (link to video: http://iis.ee.ethz.ch/~mauro/movie_Ag_Switch.avi) Questions? Mauro Calderara | Apr 08 2016 | 24