Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture Sathish Vadhiyar Concurrent Programming Until now: execution involved one flow of control through program Concurrent programming is about programs with multiple flows of control For example: a program that runs as multiple processes cooperating to achieve a common goal To cooperate, processes must somehow communicate 2 Inter Process Communication (IPC) Using files 1. Parent process creates 2 files before forking child process Child inherits file descriptors from parent, and they share the file pointers Can use one for parent to write and child to read, other for child to write and parent to read OS supports something called a pipe 2. Producer writes at one end (write-end) and consumer reads from the other end (read-end) corresponds to 2 file descriptors (int fd[2]) Read from fd[0] accesses data written to fd[1] in FIFO order and vice versa Used with fork - parent process creates a pipe and uses it to communicate with a child process 3 Other IPC Mechanisms Processes could communicate through variables that are shared between them 3. Shared variables, shared memory; other variables are private to a process Special OS support for program to specify objects that are to be in shared regions of address space Posix shared memory – shmget, shmat Processes could communicate by sending and receiving messages to each other 4. Special OS support for these messages 4 More Ideas on IPC Mechanisms Sometimes processes don’t need to communicate explicit values to cooperate 5. They might just have to synchronize their activities Example: Process 1 reads 2 matrices, Process 2 multiplies them, Process 3 writes the result matrix Process 2 should not start work until Process 1 finishes reading, etc. Called process synchronization Synchronization primitives Examples: mutex lock, semaphore, barrier 5 Programming With Shared Variables Consider a 2 process program in which both processes increment a shared variable shared int X = 0; P1: X++; P2: X++; Q: What is the value of X after this? Complication: Remember that X++ compiles into something like LOAD ADD STORE R1, 0(R2) R1, R1, 1 0(R2), R1 6 Problem with using shared variables Final value of X could be 1! P1 loads X into R1, increments R1 P2 loads X into register before P1 stores new value into X Net result: P1 stores 1, P2 stores 1 Moral of example: Necessary to synchronize processes that are interacting using shared variables Problem arises when 2 or more processes try to update shared variable Critical Section: part of program where shared variable is accessed like this 7 Critical Section Problem: Mutual Exclusion Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock 8 Implementing a Lock int L=0; /* 0: lock available */ AcquireLock(L): while (L==1); L = 1; /* `BUSY WAITING’ */ ReleaseLock(L): L = 0; 9 Why this implementation fails while ( L == 1) ; L = 1; Process 1 Process 2 LW R1 with 0 Context Switch LW R1 with 0 BNEZ ADDI SW Enter CS Context Switch BNEZ ADDI wait: LW R1, Addr(L) BNEZ R1, wait ADDI R1, R0, 1 SW R1, Addr(L) Assume that lock L is currently available (L = 0) and that 2 processes, P1 and P2 try to acquire the lock L IMPLEMENTATION ALLOWS PROCESSES P1 and P2 TO BE IN CRITICAL SECTION TOGETHER! SW Enter CS time 10 Busy Wait Lock Implementation Hardware support will be useful to implement a lock Example: Test&Set instruction Test&Set Lock: tmp = Lock Lock = 1 Return tmp Where these 3 steps happen atomically or indivisibly. i.e., all 3 happen as one operation (with nothing happening in between) Atomic Read-Modify-Write (RMW) instruction 11 Busy Wait Lock with Test&Set AcquireLock(L) while (Test&Set(L)) ; ReleaseLock(L) L = 0; Consider the case where P1 is currently in a critical section, P2-P10 are executing AcquireLock: all are executing the while loop When P1 releases the lock, by the definition of Test&Set exactly one of P2-P10 will read the new lock value of 0 and set L back to 1 12 More on Locks Other names for this kind of lock Mutex Spin wait lock Busy wait lock Can have locks where instead of busy waiting, an unsuccessful process gets blocked by the operating system 13 Semaphore A more general synchronization mechanism Operations: P (wait) and V (signal) P(S) if S is nonzero, decrements S and returns Else, suspends the process until S becomes nonzero, when the process is restarted After restarting, decrements S and returns V(S) Increments S by 1 If there are processes blocked for S, restarts exactly one of them 14 Critical Section Problem & Semaphore Semaphore S = 1; Before critical section: P(S) After critical section: V(S) Semaphores can do more than mutex locks Initialize S to 10 and 10 processes will be allowed to proceed P1:read matrices; P2: multiply; P3: write product Semaphores S1=S2=0; End of P1: V(S1), beginning of P2: P(S1) etc 15 Deadlock Consider the following process: P1: lock (L); wait(L); P1 is waiting for something (release of lock that it is holding) that will never happen Simple case of a general problem called deadlock Cycle of processes waiting for resources held by others while holding resources needed by others 16 Classical Problems Producers-Consumers Problem Bounded buffer problem Producer process makes things and puts them into a fixed size shared buffer Consumer process takes things out of shared buffer and uses them Must ensure that producer doesn’t put into full buffer or consumer take out of empty buffer While treating buffer accesses as critical section 17 Producers-Consumers Problem shared Buffer[0 .. N-1] Producer: repeatedly Produce x; if (buffer is full) wait for consumption Buffer[i++] = x ; signal consumer Consumer: repeatedly If (buffer is empty) wait for production y = Buffer[- - i] Consume y ; signal producer 18 Dining Philosophers Problem N philosophers sitting around a circular table with a plate of food in front of each and a fork between each 2 philosophers Philosopher does: repeatedly Eat (using 2 forks) Think Problem: Avoid deadock; be fair 19 THREADS Thread Weight related to a basic unit of CPU utilization Thread of control in a process `Light weight process’ Time for creation Time for context switch Size of context Recall context of process 20 Threads and Processes Thread context So, thread context switching can be fast Many threads in same process that share parts of process context Thread id Stack Stack pointer, PC, GPR values Virtual address space (other than stack) So, threads in the same process share variables that are not stack allocated 21 Threads and Sharing Shares with other threads of a process – code section, data section, open files and signals 22 Threads Benefits – responsiveness, communication, parallelism and scalability Types – user threads and kernel threads Multithreading models Many-one: efficient; but entire process will block if a thread makes a blocking system call One-to-one: e.g. linux. Parallelism; but heavy weight Many-to-many: balance between the above two schemes 23 Thread Implementation Could either be supported in the operating system or by a library Pthreads: POSIX thread library – a standard for defining thread creation and synchronization int pthread_create pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine), void *arg pthread_attr pthread_join pthread_exit pthread_detach Do “man –k pthreads” 24 Synchronization Primitives Mutex locks int pthread_mutex_lock(pthread_mutex_t *mutex) If the mutex is already locked, the calling thread blocks until the mutex becomes available. Returns with the mutex object referenced by mutex in the locked state with the calling thread as its owner. pthread_mutex_unlock Semaphores sem_init sem_wait sem_post 25 Pthread scheduling Process contention scope – scheduling user-level threads among a set of kernel threads. System contention scope – scheduling kernel threads for CPU. Functions for setting the scope pthread_attr_setscope, pthread_attr_getscope Can use PTHREAD_SCOPE_PROCESS for PCS and PTHREAD_SCOPE_SYSTEM for SCS 26 Thread Safety A function is thread safe if it always produces correct results when called repeatedly from concurrent multiple threads Thread Unsafe functions That That That That don’t protect shared variables keep state across multiple invocations return a pointer to a static variable call thread unsafe functions Races When correctness of a program depends on one thread reaching a point x before another thread reaching a point y 27 PARALLEL ARCHITECTURE 28 PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor Motivations • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10% 29 Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 30 Classification of Architectures – Flynn’s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 31 Classification of Architectures – Based on Memory Shared memory 2 types – UMA and NUMA UMA NUMA Examples: HPExemplar, SGI Origin, Sequent NUMA-Q Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 32 Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory P P P P P P P M M M M M M M P P P Interconnect P P P P Interconnect Main Memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine 33 Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access 34 Classification of Architectures – Based on Memory Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids 35 Parallel Architecture: Interconnection Networks An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network 36 Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Direct interconnects: nodes are connected directly to each other Shared bus, multiple bus, crossbar, MIN Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches 37 Interconnection Networks Static Dynamic – Communication links are formed dynamically by switches Bus Completely connected Star Linear array, Ring (1-D torus) Mesh k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh Trees – our campus network Crossbar Multistage For more details, and evaluation of topologies, refer to book by Grama et al. 38 Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Crossbar switch Multistage Interconnection Network 39 Direct Interconnect Topologies Star Ring Linear 2D Mesh Hypercube (binary n-cube) n=2 n=3 Torus 40 Evaluating Interconnection topologies Diameter – maximum distance between any two processing nodes Full-connected – 1 2 Star – Ring – p/2 Hypercube - logP Connectivity – multiplicity of paths between 2 nodes. Miniimum number of arcs to be removed from network to break it into two disconnected networks Linear-array – 1 Ring – 2 2-d mesh – 2 2-d mesh with wraparound – 4 D-dimension hypercubes – d 41 Evaluating Interconnection topologies bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – 2 P-node 2-D mesh - Root(P) Tree – 1 Star – 1 Completely connected – P2/4 Hypercubes - P/2 42 Evaluating Interconnection topologies channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate – performance of a single physical wire channel bandwidth – channel rate times channel width bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth 43 Shared Memory Architecture: Caches P2 Read X P1 ReadX=1 X Write Cache hit: Wrong data!! X: X:10 X: 0 X: 1 0 44 Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms 45 Cache Coherence Protocols Write update – propagate cache line to other processors on every write to a processor Write invalidate – each processor gets the updated cache line whenever it reads stale data Which is better? 46 Invalidation Based Cache Coherence P2 Read X P1 ReadX=1 X Write X: X:10 X: 1 X: 0 Invalidate X: 0 X: 1 47 Cache Coherence using invalidate protocols 3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0 Implementations Snoopy for bus based architectures shared bus interconnect where all cache controllers monitor all bus activity There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Memory operations are propagated over the bus and snooped Directory-based Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors A central directory maintains states of cache blocks, associated processors Implemented with presence bits 48