Download 6.896 Project Presentations

6.895/SMA5509 Project Presentations October 15-17 2003 Improving Cilk Adaptively Parallel Processor Allocation for Cilk Jobs Kunal Agrawal Siddhartha Sen Goal    Design and implement a dynamic processorallocation system for adaptively parallel jobs (jobs for which the number of processors that can be used without waste varies during execution) The problem of allocating processors to adaptively parallel jobs is called the adaptively parallel processor-allocation problem [2]. At any given time, each job j has a desire dj and an allotment mj Goal (cont.)  We want to design a processor-allocation system that achieves a “fair” and “efficient” allocation among all jobs. – – “fair” means that whenever a job receives fewer processors than it desires, no other job receives more than one more processor than this job received [2] “efficient” means that no job receives more processors than it desires [2] Illustration of Problem  …  dk = 10 mk = 6 … P processors dj = 6 mj = 5 Main Algorithmic Questions  Dynamically determine the desires of each job in the system – – –  Heuristics on steal rate Heuristics on number of visible stack frames Heuristics on number of threads in ready deques Dynamically determine the allotment for each job such that the resulting allocation is fair and efficient – – – SRLBA algorithm [2] Cilk macroscheduler algorithm [3] Lottery scheduling techniques [5] Assumptions     All jobs on the system are Cilk jobs Jobs can enter and leave the system and change their parallelism during execution All jobs are mutually trusting (they will stay within the bound of their allotments and communicate their desires honestly) Each job has at least one processor to start with References 1. 2. 3. 4. 5. 6. Supercomputing Technologies Group. Cilk 5.3.2 Reference Manual. MIT Lab for Computer Science, November 2001. B. Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998. R. D. Blumofe, C. E. Leiserson, and B. Song. Automatic processor allocation for work-stealing jobs. C. A. Waldspurger. Lottery and Stride Scheduling: Flexible Proportional-Share Resource Management. PhD thesis, Massachusetts Institute of Technology, September 1995. C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportionalshare resource management. In Proceedings of the First Symposium on Operating System Design and Implementation, pages 1-11. USENIX, November 1994. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An e±cient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 207-216, Santa Barbara, California, July 1995. Fast Serial-Append for Cilk 6.895 – Theory of Parallel Systems Project Presentation (Proposal) Alexandru Caracas Serial-Append (cont'd) ● ● ● A Cilk dag Numbers represent the serial execution of threads For example: – Consider all threads perform file I/O operations Serial-Append (cont'd) ● ● ● Same Cilk dag Numbers represent the parallel execution of threads Goal: – We would like to execute the threads in parallel while still allowing for the serial order to be reconstructed. Serial-Append (cont'd) ● ● Partitioning of the Cilk dag Colors represent the I/O operations done by a processor between two steal operations. Serial-Append ● The Serial-Append of the computation is obtained by reading the output in the order: – orange (I) – green (II) – dark-red (III) – purple (IV) – red (V) Project Goals ● Efficient serial-append algorithm – ● ● ● possibly using B+-trees Analyze possible implementation: – kernel level – formatted file – Cilk file system Implementation – ● (one of the above) ● Port Cheerio to Cilk 5.4. Comparison with own implementation Port serial applications to use serial-append (analyze their performance) References ● ● ● Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 356-368, Santa Fe, New Mexico, November 1994. Matthew S. DeBergalis. A parallel file I/O API for Cilk. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, June 2000. Supercomputing Technology Group MIT Laboratory for Computer Science. Cilk 5.3.2 Reference Manual, November 2001. Available at http://supertech.lcs.mit.edu/cilk/manual5.3.2.pdf. A Space-Efficient Global Scheduler for Cilk Jason Hickey Tyeler Quentmeyer Idea  Implement a new scheduler for Cilk based on Narlikar and Blelloch’s Space-Efficient Scheduling of Nested Parallelism paper AsyncDF Scheduler  Idea: minimize peak memory usage by preempting threads that try to allocate large blocks of memory  Algorithm: Roughly, AsyncDF runs the P left-most threads in the computation DAG Example In parallel for i = 1 to n Temporary B[n] In parallel for j = 1 to n F(B,i,j) Free B Comparison to Cilk  Cilk    Distributed Run threads like a serial program until a processor runs out of work and has to steal Space: p*S1  AsyncDF    Global Preempt a thread when it tries to allocate a large block of memory and replace it with a computationally heavy thread Space: S1 + O(K*D*p) AsyncDF Optmizations  Two versions of scheduler   Serial Parallel  Running P left-most threads does not exploit the locality of a serial program  Solution: Look at the cP left-most threads and “group” threads together for locality Project Plan  Implement serial version of AsyncDF     Experiment with algorithms to group threads for locality Performance comparison to Cilk Different values of AsyncDF’s tunable parameter Case studies    Recursive matrix multiplication Strassen multiplication N-body problem  Implement parallel version of AsyncDF  Performance comparison to Cilk  Additional experiments/modifications as research suggests Questions? Automatic conversion of NSP DAGs to SP DAGs Sajindra Jayasena Sharad Ganesh Objectives : Design of an efficient Algorithm to convert an arbitrary NSP DAG to SP DAG - Correctness preserving - Identifying different graph topologies Analysis of Space and Time complexities - Bound on increase in critical path length Perform empirical analysis using cilk 5/24/2017 Theory of Parallel Systems 6.895 Transactional Cilk Language-Level Complex Transactions C. Scott Ananian An Evaluation of Nested Transactions Sean Lie [email protected] 6.895 Theory of Parallel Systems Wednesday October 15th, 2003 Transactional Memory  The programmer is given the ability to define atomic regions of arbitrary size. Start_Transaction; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; End_Transaction;  With hardware support, different transactions can be executed concurrently when there are no memory conflicts. Nested Transactions  Nested transactions can be handled by simply merging all inner transactions into the outermost transaction. Start_Transaction; Start_Transaction; .. .. Start_Transaction ; Start_Transactio n; .. .. End_Transaction; End_Transaction; .. .. .. .. .. .. End_Transaction; End_Transaction; Concurrency  Merging transactions is simple but may decrease the program concurrency. Start_Transaction; .. Start_Transaction ; Conflict! Access S; Start_Transaction; End_Transaction; Access S; .. .. .. End_Transaction; End_Transaction; Nested Concurrent Transactions  Nested Concurrent Transactions (NCT):   Questions:     When is NCT actually useful? How much overhead is incurred by NCT? How much do we gain from using NCT? The tool needed to get the answers:    Allow the inner transaction to run and complete independently from the outer transaction. Transactional memory hardware simulator – UVSIM UVSIM currently supports merging nested transactions. How to get the answers:    Identify applications where NCT is useful. Look at malloc? Implement necessary infrastructure to run NCTs in UVSIM. Evaluate performance of NCT vs. merging transactions. Compiler Support for Atomic Transactions in Cilk Jim Sukha Tushara Karunaratna Determinacy Race Example If we could guarantee that this line of code executed atomically, then the program would have no determinacy race. New Keywords in Cilk It would be convenient if programmers could make a section of code execute atomically by using xbegin and xend keywords. Simulation of Transactional Memory For a transaction, keep track of the memory locations accessed. Transformation load/store operations in original code. As long as there are no conflicts, the transaction executes as usual. If there is a conflict, the transaction is aborted. Restore the modified locations to their original values, wait some time, and then try again If we successfully completed, commit the changes and cleanup. Work Involved • Modify the Cilk parser to accept xbegin and xend as keywords. • Identify all load and store operations in user code, and replace them with calls to functions from the runtime system. • Implement the runtime system. An initial implementation is to divide memory into blocks and to use a hash table to store a lock and backup values for each block. • Experiment with different runtime implementations. Data Race in Transactional Cilk Xie Yong Background • Current Cilk: achieve atomicity using lock – problems such as priority inversion, deadlock, etc. – Nontrivial to code • Transactional memory – Software TM: overhead -> slow – Hardware TM Cilk’s transaction everywhere • Every instruction becomes part of the a transaction • Based on HTM • Cilk transaction: – cut Cilk program into atomic pieces – Base on some language construct or compiler automatic generate Data Race • Very different from traditional definition in Nondeterminator • Assumption (correct parallelization) • Definition (1st half of Kai’s master thesis) – Efficient detection (v.s. NP-complete proved by Kai) – Make use of current algorithms/tools Non-Determinacy Detection Linear-time determinacy race detection Jeremy Fineman Motivation • Nondeterminator has two parts: – Check whether threads are logically parallel – Use shadow spaces to detect determinacy race • SP-Bags algorithm uses LCA lookup to determine whether two threads are parallel • LCA lookup with disjoint-sets data structure takes O(α(v,v)) time. • We do not care about the LCA. We just want to know if two threads are logically parallel. New algorithm for determinacyrace detection • Similar to SP-Bags algorithm with two parts: – Check whether threads are logically parallel – Use shadow spaces to detect determinacy race • Use order maintenance data-structure to determine whether threads are parallel • Gain: Order maintenance operations are O(1) amortized time. The algorithm • Maintain two orders: – Regular serial, depth-first, left-to right execution. • At each spawn, follow spawn thread before continuation thread – Right-to-left execution. • At each spawn, follow continuation thread before spawn thread • Claim: Two threads e1 and e2 are parallel if and only if e1 precedes e2 in one order, and e2 precedes e1 in the other. Depth-first, left-to-right execution e0 e5 F e6 e7 e1 e2 F1 F2 e3 F2 e4 F3 Right-to-left execution e0 e5 F e6 e7 e1 e2 F1 F2 e3 F2 e4 F3 How to maintain both orders in serial execution • Keep two pieces of state for each procedure F: – CF is current thread in F – SF is next sync statement in F. • In both orders, insert new threads after current thread • On spawn, insert continuation thread before spawn thread in one order. Do the opposite in the other. • On sync, advance CF to SF • In any other thread, update shadow spaces based on current thread Example e0 e2 F s1 e4 e1 e5 e3 CF: e5420 Order: e0 se1 e2 e3 e4 SF: s12 Order: e0 se12 e14 e3 e5 e5 Project Proposal: Parallel Nondeterminator He Yuxiong Objective I propose Parallel Nondeterminator to: • Check the determinacy race in the parallel execution of the program written in the language like Cilk • Develop efficient algorithm to decide the concurrency between threads • Develop efficient algorithm to reduce the number of entries in access history Primitive Idea in Concurrency Test (1) Labeling Scheme (2) Set operation Thread representation (fid, tid) Two sets: Parallel set PS(f) ={(pf, ptid) | all threads with tid >= ptid in function pf is parallel with the running thread of f} Children set CS(f)={fid | fid is the descendant of current function } Operations: Spawn: Thread Tx of function Fi spawn function Fj Operations on child Fj Operations on parent Fi Sync: Function Fi executes sync PS(Fi)=PS(Fi)-CS(Fi) Return: Function Fj returns to Function Fi CS(Fi) = CS(Fi) + CS(Fj) Release PS(Fj) and CS(Fj) Concurrency Test: Check if (fx, tx) is parallel with the current running thread (fc, tc): Primitive Idea for Access History Serial read(l), write(l) Simplest parallel program without nested parallelism Two parallel read records, one write record Language structure like Cilk read: max level of parallelism, one write record Q: Is it possible to keep only two read records for each shared location in Cilk Parallel Nondeterminator? Keep two parallel read records with highest level of LCA in parent child spawn tree. Thank you very much! Parallel Nondeterminator Wang Junqing Using Cilk Accelerating Multiprocessor Simulation 6.895 Project Proposal Kenneth C. Barr Introduction • Inspiration – UVSIM is slow – FlexSIM is cool * Interleave short periods of detailed simulation with long periods of functional warmup (eg., cache and predictor updates, but not out-of-order logic) * Achieve high accuracy in fraction of the time – Multi-configuration simulation is cool * Recognize structural properties. E.G., “contents of FA cache are subset of all larger FA caches with same line size” so search small->large. Once we hit is small cache, we can stop searching * Simulate many configurations with a single run Kenneth C. Barr – MIT 6.895 October 15, 2003 The Meta Directory • Combine previous ideas to speed multiprocessor simulation • Sequential Consistency is what matters, perhaps detailed consistency mechanisms can be interleaved with a fast, functional equivalent method Kenneth C. Barr – MIT 6.895 October 15, 2003 Meta-directory example • Detailed simulation: – – – – – – P1 P2 P3 P2 P1 P2 ld x ld x ld x st x st x ld x ;null -> Sh, data sent to P1 ;Sh -> Sh, data sent to P2 ;Sh- > Sh data sent to P3 ;Sh -> Ex, inv sent to P1 and P3, data sent to P2 ;Ex -> Ex, inv sent to P2, data sent to P1 ;Ex -> Sh, data sent to P2 • Shortcut – Record access in a small “meta-directory:” x: {P1, 0, r}, {P2, 1, r}, {P3, 2, r}, {P2, 3, w}, {P1, 4, w}, {P2, 5, r} – All reads and writes occur in a memory; no messages sent or received, no directory modeled, no cache model in processor (?) – When it comes time for detailed simulation, we can reconstruct directory by scanning backwards: x is shared by P1 and P2. Kenneth C. Barr – MIT 6.895 October 15, 2003 Challenges • Accesses stored in circular queue. How many records needed for each address? • What happens when processor writes back data, current scheme introduces false hits. • Does this scheme always work? Some proofs would be nice. Kenneth C. Barr – MIT 6.895 October 15, 2003 Methodology • Platform – – – – I got Simics to boot SMP Linux But Bochs is open-source X86 is key if this is to be long-term framework Cilk benchmarks? • Tasks – – – – Create detailed directory model Create meta directory Create a way to switch back and forth maintaining program correctness Measure the dramatic improvement! Kenneth C. Barr – MIT 6.895 October 15, 2003 Conclusion • Reality – Sounds daunting for December deadline, but if I can prove feasibility or fatal flaws, I’d be excited • Suggestions? Kenneth C. Barr – MIT 6.895 October 15, 2003 Project Proposal: Parallelizing METIS Zardosht Kasheff What is METIS? • Graph Partitioning algorithm • Developed at University of Minnesota by George Karypis and Vipin Kumar • Information on METIS: http://wwwusers.cs.umn.edu/~karypis/metis/ Stages of Algorithm: Coarsening • Task: Create sequentially smaller graphs that make good representation of original graph by collapsing connected nodes. • Issues: – Minor concurrency issues. – Maintaining data locality. – Writing large amount of data to memory in a scalable fashion. Stages of Algorithm: Initial partitioning. • Task: partition small graph • Issues: none. Runtime of this portion of algorithm very small. Stages of Algorithm: Uncoarsening and Refinement • Uncoarsening and Refinement: project partition of coarsest graph to original graph and refine partition. • Issues: – Major concurrency issues. – Remaining issues under research. Parallel Sorting Paul Youn I propose to parallelize a sorting algorithm. Initially, looking at Quick Sort, but may investigate other sorting algorithms. Sorting is a problem that can potentially be sped up significantly (mergesort). As an alternative, considering other algorithms. Goals • • • • • • Similar to Lab 1 approach. Speedy serial algorithm. Basic parallel algorithm. Runtime Analysis. Empirical verification of runtime analysis. Parallel speedups. HELP! • Not sure about the scope of the problem. • Anyone out there interested in this stuff? • Other appropriate algorithms? Max-flow? PROJECT PROPOSAL SMA5509 Implement FIR filter in parallel computing using Cilk Name: Pham Duc Minh Matric: HT030502H Email: [email protected] [email protected] Discrete Signal  the discrete input signal is x(n)  the output signal is y(n)  the impulse response h(n) y ( n)   x ( m) h ( n  m) i, j i  j n FIR filter Discrete time LTI (Linearity and Time Invariance) systems can be classified into FIR and IIR. An FIR filter has impulse response h(n) that extends only over a finite time interval, say 0≤ n ≤ M, and is identically zero beyond that: {h0, h1, h2 . . . , hM, 0, 0, 0 . . .} M is referred to as the filter order. The output y(n) in FIR is simplified to the finite-sum form: M y ( n )   h( m) x ( n  m) m 0 or, explicitly y(n)  h0 x(n)  h1 x(n  1)  h2 x(n  2)  ...  hM x(n  M ) Methods to process 1. Block processing a. Convolution y ( n)   x ( m) h ( n  m) i, j i  j n (convolution table form) b. Direct Form The length of the input signal x(n) is L The length of output y(n) will be Ly = Lx+Lh-1 y ( n)  min( n , M )  h(m) x(n  m) max( 0, n  L 1) c. Matrix form  y0  y  h0 0 0 0 0   x0   1  h h 0 0 0   x    1   Hx  1 0 y   y2   ..........................  ....  ...  0 0 0 0 h   x  M   L 1   y L  M 1     with the dimension of H is L y  Lx  ( L  M )  L a. Overlap-Add Block Convolution Method  above methods are not feasible in infinitive or extremely long applications  divide the long input into contiguous non-overlapping blocks of manageable length, say L samples  y0 = h*x0  y1 = h*x1  y2 = h*x2 Thus the resulting block y0 starts at absolute time n = 0 block y1 starts at n = L, and so on. The last M samples of each output block will overlap with the first M outputs of the next block . 2. Sampling Processing The direct form I/O convolutional equation y(n)  h0 x(n)  h1 x(n  1)  ...hM x(n  M ) We define the internal states w1(n), w2(2), w3(n)…. w0(n) = x(n) w1(n) = x(n-1) = w0(n-1) w2(n) = x(n-2) = w1(n-1) ….. With this definition, we can y(n) in the form y(n)  h0 w0 (n)  h1w1 (n  1)  ...hM wM (n) Goal of project Use Cilk to implement FIR filter in parallel Compare with the processes of DSP kit (pipeline) and the process of FPGA (Field Programmable Gate Array) (also parallel). How to implement Multithreaded programming Cilk compiler Other things  If I finish soon, I hope additional work suggestion from staff. Cache-Oblivious Algorithms SMA5509: Theory of Parallel Systems Project Proposal Cache-oblivious sorting for Burrow-Wheeler Transform Sriram Saroop Advait D. Karande bzip2 compression  Lossless text compression scheme Input data Burrow Wheeler Transform Move-to-Front coding Huffman encoding Compressed data Burrows-Wheeler Transform (1994)      A pre-processing step for data compression Involves sorting of all rotations of the block of data to be compressed Rationale: Transformed strings compress better Worst case complexity of N2log(N), where N = size of block to be sorted Can be improved to N(logN)2 (Sadakane ’98) Why make it cache-oblivious?    Performance of BWT heavily dependent on cache behavior (Seward ’00) Avoid slowdown for large files with high degree of repetitiveness Especially useful in applications like bzip2 that are required to perform well on different memory structures Project Plan    Design and implementation of a cache oblivious algorithm for BWT sorting Performance comparisons with standard bzip2 for available benchmarks Back-up: Optimizations for BWT including parallelization, and incorporating Cache-Obliviousness in other parts of bzip2 compression New Cache Oblivious Algorithms Neel Kamal Zhang JiaHui Our Goals Designing Cache Oblivious Algorithms for problems in several areas of Compute Science. Finding theoretical bounds of number of cache misses for these algorithms. Trying to parallelize them … Large Integer Arithmetic Operations • Basic Operators – Addition, Subtraction – Multiplication – Factorial – Matrix Multiplication • Application: RSA encryption algorithm Large Integer Multiplication Basic Idea: A B C D B A B A x C x x D C x D Dynamic Programming Algorithms • Longest Common Sequence Basic Idea A B C B D A B B 0 1 1 1 1 1 1 D 0 1 1 1 2 2 2 C 0 1 2 2 2 2 2 A 1 1 2 2 2 3 3 B 1 2 2 3 3 3 4 A 1 2 2 3 3 4 4 Graph Algorithm • All-pair shortest path algorithm – Floyd • Connectivity Testing of a Graph • And more … Computational Geometry • Finding the closest pair of points • And More … Concurrent Cache-Oblivious Algorithms Seth Gilbert Cache-Oblivious Algorithms  Optimal memory access  Matrix multiply, Fast Fourier Transform  B-trees, Priority Queues  Concurrent?  wait-free, lock-free, obstruction-free  Synchronization?  DCAS, CAS, LL/SC Packed Memory Structure  Operations:  Traverse(k) -- O(k/B)  Insert -- O(log2n/B)  Delete -- O(log2n/B)  Goal:  Design a lock-free version of the cacheoblivious algorithm using CAS More  Basis for many data structure  B-tree  distributed data structures  Maybe parallel nondeterminator The End Packed Memory Structure V1   V2  V3 V4   size(array) < 4 x num. elements  elements well-spaced in array V5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 6.896 Project Presentations