Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING 1 Indiana University Computer Science Dept. Seung-Hee Bae 1 OUTLINE Multicore Parallel Computing & MPI Data Mining 2 Multicore Toward Concurrency What is Multicore? Shared cache architecture Recognition, Mining, and Synthesis (RMS) Parallel Computing & MPI Data Mining 3 TOWARD CONCURRENCY IN SOFTWARE Exponential growth (Moore’s Law) can’t continue Previous CPU performance gains Clock speed: getting more cycles Become harder to exploit higher clock speeds due to several physical issues, such as, heat, power consumption, and current leakage problems. (2GHz:2001, 3.4GHz:2004, now?) Execution optimization: more work per cycle Pipelining, branch prediction, executing multiple instructions in the same clock cycle reordering the instruction stream: changing meaning of programs. Cache Increasing the size of on-chip cache: main memory is much slower than the CPU 4 TOWARD CONCURRENCY IN SOFTWARE 2 Current CPU performance gains Moore’s law is over? Not yet (# of transistors ↑) Hyperthreading Running two or more threads in parallel inside a single CPU Runs some instructions in parallel One each of most basic CPU features, (except extra registers) 5% ~ 15 %, 40% under ideal conditions It doesn’t help single-threaded applications Multicore Running two or more actual CPUs on one chip. Less than double the speed even in the ideal case. It will boost reasonably well-written multi-thread applications, but not single-threaded applications. 2 * 3GHz < 6 GHz Coordination overhead between the cores to ensure cache coherency. Cache Only this will broadly benefit most existing applications. A cache miss costs 10 to 50 times. 5 WHAT IS MULTICORE? Single Chip Multiple distinct processing Engine E.g.) Shared-cache Dual Core Architecture Core 0 Core 1 CPU CPU L1 Cache L1 Cache L2 Cache 6 SHARED-CACHE ARCHITECTURE Options for the last-level cache private to each core sharing the last-level cache among diff. cores Benefits of the Shared-Cache Architecture Efficient use of the last-level cache. reduce resource underutilization. Reduce cache-coherence complexity reduced false sharing because of shared cache. reduce data-storage redundancy same data only needs to be stored once. reduce front-side bus traffic data requests can be resolved at the shared-cache level instead of system memory. 7 SOFTWARE TECHNIQUES FOR SHARED-CACHE MULTICORE SYSTEMS Cache blocking (Data Tiling) Allow data to stay in the cache while being processing by data loops. Reducing unnecessary cache traffic. (Better cache hit ratio.) Hold approach (Late update) Each thread maintain its own private copy of data. Updating the shared copy only when it is necessary. Reducing the frequency of access to the shared data. Avoid false sharing What is false sharing? (unnecessary cache line update.) How to avoid false sharing? To allocate non-shared data to different cache lines. (padding) To copy the global variable to a local function variable, then copy the data back 8 before the function exits. RECOGNITION, MINING, AND SYNTHESIS (RMS) Era of Tera is coming quickly Teraflops (computing power), Terabits (comm.), Terabytes (storage) World data is doubling every three years and is now measured exabytes (a billion billion bytes) Need computing model to deal this enormous sea of information Working with Models Recognition (What is ?) Mining (Is it ?) Identifying that a set of data constitutes a model and then constructing that model. Search for instances of the model. Synthesis (What if ?) Create a potential instance of that model in an imaginary world. 9 RMS 2 (from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.) Examples Medicine (a tumor) Business (hiring) Investment 10 Multicore Parallel Computing & MPI Parallel architectures (Shared-Memory vs. Distributed-Memory) Decomposing Program (Data Parallelism vs. Task Parallelism) MPI and OpenMP Data Mining 11 PARALLEL COMPUTING: INTRODUCTION Parallel computing More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single processor to virtually limitless computing power Parallel computing software systems Goal: to make parallel programming easier and the resulting applications more portable and scalable while achieving good performance. Difficulty Explicitly parallel program is difficult e.g.) computation, partitioning, synchronization, and data movement (correct answer & high performance) Must be machine-independent – portability Complexity of the problems being attacked. Parallel Computing Challenges Concurrency & Communication Need for high performance Diversity of Architectures 12 PARALLEL ARCHITECTURE 1 Shared-memory machines Have a single shared address space that can be accessed by any processor. Examples Multicore Symmetric multiprocessor (SMP) Uniform Memory Access (UMA) Access time is independent of the loc. Use bus or completely connected net. Not scalable Shared-Memory Programming model Need for synchronization to preserve the integrity E.g.) Open Specifications for MultiProcessing (OpenMP) Distributed-memory machines The system memory is packaged with individual nodes of one or more processors (c.f. Use separate computers connected by a network) E.g. Cluster communication is required to provide data from a processor to a different processor. support message-passing programming model Send-receive communication steps. E.g.) Message Passing Interface (MPI) 13 PARALLEL ARCHITECTURE 2 Shared-Memory Distributed Memory Pros • Lower latency and higher BW • Data are available to all of the CPUs through load and store instructions • Single address space • Scalable, if a scalable interconnection network is used. • Quite fast local data access. Cons • cache coherency issue • synchronization is explicitly needed to access shared data. • scalability issue • Communication required to access data in a diff. processor. • Communication management problem 1. Long latency Consolidation of messages btwn the same pair of processors 2. Long transmission time Overlapping communication and computing 14 PARALLEL ARCHITECTURE 3 Hybrid systems Distributed shared-memory (DSM) Distributed-memory machine which allows a processor to directly access a datum in a remote memory. Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA) characteristics. SMP clusters distributed-memory system with SMP as a unit. 15 PARALLEL PROGRAM: Decomposition 1 Decomposing Programs Decomposition: Identifying Decomposition strategy Task (Functional) parallelism the portions for the parallelism. Different processors carry out different functions. Data parallelism Subdivides the data domain of a problem into multiple regions and assigns different processors to compute the results for each region. More commonly used in scientific problems. Natural form of scalability Programming models Shared-memory programming model Need for synchronization to preserve the integrity Message-passing model Communication is required to access a remote data location. 16 PARALLEL PROGRAM: DECOMPOSITION 2 Data Parallelism Exploit the parallelism inherent in many large data structures. Same Task on diff. data. (SPMD) Can be expressed by ALL parallel programming models (i.e. MPI, HPF like, OpenMP like) Features Scalable Hard to express when geometry irregular or dynamic Functional Parallelism Coarse grain parallelism Parallelism btwn the parts of many systems. Diff. task on the same or diff. data. Features Parallelism limited in size Tens not millions Synchronization probably good as parallelism Decomposition natural E.g.) workflow 17 PARALLEL PROGRAM: DECOMPOSITION 3 Load balance and scalability Scalable: running time is inversely proportional to the number of processors used. Speedup(n) = T(1)/T(n) Second definition of scalability: scaled speedup Scalable if speedup(n) ≈ n Scalable if the running time remains the same when the number of processors and the problem size are increased by a factor of n. Why scalability is not achieved? a region that must be run sequentially. Total speedup ≤ T(1)/Ts (Amdahl’s Law) Require for a high degree of communication or coordination. Poor load balance (major goal of parallel programming) If one of the processors takes half of the parallel work, speedup will be limited to a factor of two. 18 PARALLEL PROGRAM Memory-Hierarchy Management Blocking Ensuring that data remains in cache between subsequent accesses to the same memory location. Elimination of False Sharing False sharing: When two diff. processors are accessing distinct data items that reside on the same cache block. Ensure that data used by diff. processors reside on diff. cache blocks. (by padding: inserting empty bytes in a data structure.) Communication Minimization and Placement Move send and receive commands far enough apart so that time spent on communication can be overlapped. Stride-one access Programs in which the loops access contiguous data items are much more efficient than those that do not. 19 MESSAGE PASSING INTERFACE (MPI) 1 Message Passing Interface (MPI) A specification for a set of functions for managing movement of data among sets of communicating processes. The dominant scalable parallel computing paradigm with scientific problem. Explicit message send and receive using rendezvous model. Point-to-point communication Collective communication Commonly implemented in terms of an SPMD model All processes execute essentially the same logic. Pros: scalable and portable Race condition avoided (implicit synch. w/ the copy) Cons: implements details at communication. 20 MPI 6 Key Functions MPI_INIT MPI_COMM_RANK MPI_COMM_SIZE MPI_SEND MPI_RECV MPI_FINALIZE Collective Communications Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange General reduction operation (sum, minimum, scan) Blocking, nonblocking, buffered, synchronous messaging 21 OPEN SPECIFICATIONS FOR MULTIPROCESSING (OPENMP) 1 Appropriate to Shared-Memory. A sophisticated set of annotations (compiler directives) for traditional C, C++, or Fortran codes to aid compilers producing parallel codes. It provides parallel loops and collective operations such as summation over loop indices. Provide lock variables to allow fine-grain synchronization btwn threads. 22 OPENMP 2 Directives: instruct the compiler to Create threads Perform synchronization operations. Manage shared memory. Examples PARALLEL DO ~ END PARALLEL DO: explicit parallel loop. SCHEDULE (STATIC): assign continuous blocks at compile time. SCHEDULE (DYNAMIC): assign continuous blocks at run-time. REDUCTION(+: x): final values of var. x is determined global sum. PARALLEL SECTIONS: task parallelism. OpenMP synchronization primitives Critical sections Atomic updates Barriers Master selection 23 OPENMP 3 Summary Work decomposition Ideal target system: uniform-access, shared-memory. Specify where multiple threads should be applied, and how to assign work to those threads. Pros: Excellent programming interface for uniform-access, sharedmemory machines. Cons: No way to specify locality in machines w/ non-uniform sharedmemory or distributed memory. 24 Multicore Parallel Computing & MPI Data Mining Expectation Maximization (EM) Deterministic Annealing (DA) Hidden Markov Model (HMM) Support Vector Machine (SVM) 25 EXPECTATION MAXIMIZATION (EM) Expectation Maximization (EM) A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables. An efficient iterative procedure Goal: estimate unknown parameters, given measurement. Hill climbing approach guarantee to reach local maxima. Two Steps E-step (Expectation): the missing data are estimated given the observed data and current estimate of the model parameters. M-step (Maximization): the likelihood function is maximized under the assumption that the missing data are known. (The estimated missing data from the E-step are used in lieu of the actual missing data.) 26 Those two steps are repeated until the likelihood converges. DETERMINISTIC ANNEALING (DA) Purpose: avoid local minima (optimization) Clustering example of unsupervised learning Simulated Annealing (SA) A sequence of random moves is generated and the random decision to accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method) Deterministic Annealing (DA) Deterministic: don’t wandering randomly (minimize the free energy directly) Annealing: still want to avoid local minima with certain level of uncertainty. maintain the free energy at its minimum. eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost) At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously 27 DA FOR CLUSTERING Start with a single cluster giving as solution Y1 as centroid For some annealing schedule for T, iterate above algorithm testing covariance matrix in Xi about each cluster center to see if “elongated” Split cluster if elongation “long enough” You do not need to assume number of clusters but rather a final resolution T or equivalent At T=0, uninteresting solution is N clusters; one at each point xi 28 HIDDEN MARKOV MODEL (HMM) 1 Markov model A system which may be described at any time as being in one of a set of N distinct states, S1, S2, …, SN. State transition probability The special case of a discrete, first order Markov chain: P[qt = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1) Furthermore, consider those processes in which the right-hand side of (1) is independent of time, thereby leading to the set of state transition probability aij of the form aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1 Initial state probability 29 HIDDEN MARKOV MODEL (HMM) 2 Hidden Markov Model Observation is a probabilistic function of the state. State is hidden. Elements of an HMM N, the number of states in the model. (Although the states are hidden) M, the number of distinct observation symbols per state, i.e. the discrete alphabet size. The state transition probability distribution A = {aij}, where aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1 The observation symbol probability distribution (emission probability) in state j, B = {bj(k)}, where bj(k) = P[vk at t| qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M The initial state distribution π = {πi} where πi = P[q1 = Si], 1 ≤ j ≤ N 30 Compact notation: λ = (A, B, π) HIDDEN MARKOV MODEL (HMM) 3 Three Basic Problems for HMMs Prob(observation seq | model): Given the observation sequence O = O1O2 … OT, and a model λ = (A, B, π), how do we efficiently compute P(O| λ), the probability of the observation sequence, given the model? Finding Optimal State Sequence: Given the observation sequence O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a corresponding state sequence Q = q1q2 … qT which is optimal in some meaningful sense (i.e. best “explains” the observations)? Finding Optimal Model Parameters: How do we adjust the model parameters λ = (A, B, π) to maximize P(O| λ) ? 31 HIDDEN MARKOV MODEL (HMM) 4 Solution to the three basic problems for HMMs Solution to the problem 1 (Forward-Backward procedure) Enumeration (straightforward way): computationally unfeasible. Forward Procedure Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e., the probability of the partial observation sequence, O1O2 … Ot, (until time t) and state Si at time t, given the model λ. Solution to the problem 2 (Viterbi algorithm) Optimality criterion: to find the single best state sequence (path), i.e., to maximize P(Q|O, λ) which is equivalent to maximizing P(Q, O| λ). A formal technique for finding this single best state sequence exists, based on dynamic programming methods, 32 and is called Viterbi algorithm. HIDDEN MARKOV MODEL (HMM) 5 Solution to the Problem 3. (Baum-Welch Algorithm) The third problem of HMMs is to determine a method to adjust the model parameters (A, B, π) to maximize the probability of the observation sequence given the model. Choose λ = (A, B, π) such that P(O| λ) is locally maximized using an iterative procedure such as the Baum-Welch method (or equivalently the EM (expectation-modification) method) or using gradient techniques. Reestimation (iterative update and improvement), define ξt(i, j), the probability of being in state Si at time t and state Sj at time t+1, given the model and the observation sequence, i.e. ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ) 33