Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Signalling in the Heterogeneous Architecture Multiprocessor Paradigm Antonio Núñez, Victor Reyes, Tomás Bautista Keynote IUMA, Institute for Applied Microelectronics, ULPGC SPIE Gran Canaria 2003 A. Nunez 1 Index MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks Previous and Related work Our SystemC Based Modelling Approach Experiments Conclusions SPIE Gran Canaria 2003 A. Nunez 2 SPIE Gran Canaria 2003 A. Nunez 3 Technological Forecasts Moore's Law: number of transistors per chip double every two years ITRS: SoC Year of 1st shipment Local Clock (GHz) Across Chip (GHz) Chip Size (mm²) Dense Lines (nm) Number of chip I/O Transistors per chip SPIE Gran Canaria 2003 MPSoC GALS NoC 1997 1999 2002 2005 2008 2011 2014 0,75 1,25 2,1 3,5 6 10 16,9 0,75 1,2 1,6 2 2,5 3 3,674 300 340 430 520 620 750 901 250 180 130 100 70 50 35 1515 1867 2553 3492 4776 6532 8935 11M 21M 76M 200M 520M 1,4B 3,62B A. Nunez 4 SPIE Gran Canaria 2003 A. Nunez 5 Processor to DRAM Performance Gap CPU “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) 100 10 DRAM 1 µProc 60%/yr. DRAM 7%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 Time SPIE Gran Canaria 2003 A. Nunez 6 Logic to Memory Area Gap SPIE Gran Canaria 2003 A. Nunez 7 Logic to Productivity Gap SPIE Gran Canaria 2003 A. Nunez 8 SPIE Gran Canaria 2003 -> Platform based design -> Communication architectures A. Nunez 9 Index MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks Previous and Related work Our SystemC Based Modelling Approach Experiments Conclusions SPIE Gran Canaria 2003 A. Nunez 10 Processor Architecture Paradigms Cfr. Ungerer et al, Patterson et al, Tenhunnen et al, Computer special issue Processor/Memory/Switch Processor- Memory- Communications- dominated systems Communications architecture Processor-Mono: Speed-up of a single-threaded application Advanced superscalar Trace Cache Patt, Sohi… Superspeculative Multiscalar processors Processor-Multi: Speed-up of multi-threaded applications Simultaneous multithreading (SMT) Homo Many.. Hetero Chip multiprocessors (CMPs) Patterson Memory, Processor-in-Memory, IRAM, others Network on Chip Mihal, Tenhunnen, Goosens SPIE Gran Canaria 2003 A. Nunez 11 Monoprocessor: Superflow Processor Fine granularity, data word The Superflow processor speculates on instruction flow: two-phase branch predictor combined with trace cache register data flow: dependence prediction: predict the register value dependence between instructions source operand value prediction constant value prediction value stride prediction: speculate on constant, incremental increases in operand values dependence prediction predicts inter-instruction dependences memory data flow: prediction of load values, of load addresses and alias prediction SPIE Gran Canaria 2003 A. Nunez 12 Com-arch in Superflow Processor SPIE Gran Canaria 2003 A. Nunez 13 Multiscalar Processors A program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control. A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task. The tasks are distributed to a number of parallel PEs within a processor. Each PE fetches and executes instructions belonging to its assigned task. The primary constraint: it must preserve the sequential program semantics. SPIE Gran Canaria 2003 A. Nunez 14 Multiscalar mode of execution PE 0 A B C Data values Task A PE 1 Task B D PE 2 Task D E PE 3 Task E SPIE Gran Canaria 2003 A. Nunez 15 Com-arch in Multiscalar processor SPIE Gran Canaria 2003 A. Nunez 16 Multiscalar, Trace and Speculative Multithreaded Processors Multiscalar: A program is statically partitioned into tasks which are marked by annotations of the CFG. Trace Processor: Tasks are generated from traces of the trace cache. Speculative multithreading: Tasks are otherwise dynamically constructed. Common target: Increase of single-thread program performance by dynamically utilizing thread-level speculation additionally to instruction-level parallelism. A „thread“ means a „HW thread“ SPIE Gran Canaria 2003 A. Nunez 17 Multis: Additional utilization of more coarse-grained parallelism CMPs Chip multiprocessors or multiprocessor chips integrate two or more complete processors on a single chip, every functional unit of a processor is duplicated. SMPs Simultaneous multithreaded processors store multiple contexts in different register sets on the chip, the functional units are multiplexed between the threads, instructions of different contexts are simultaneously executed. SPIE Gran Canaria 2003 A. Nunez 18 CMPs-Homo: Com-arch by shared global memory Processor Processor Processor Processor Primary Cache Secndary Cache Global Memory Global Memory Shared global memory, no caches SPIE Gran Canaria 2003 A. Nunez 19 CMPs-Homo: Com-arch by shared primary cache Processor Processor Processor Processor Primary Cache Secondary Cache Global Memory Shared primary cache SPIE Gran Canaria 2003 A. Nunez 20 CMPs-Homo: Com-arch by global memory, caches Processor Processor Processor Processor Processor Processor Processor Processor Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory Global Memory Shared caches and memory SPIE Gran Canaria 2003 Secondary Cache Shared secondary cache A. Nunez 21 Com-arch in Hydra: A Single-Chip Multiprocessor A Single Chip Centralized Bus Arbitration Mechanisms CPU 0 Primary I-cache Primary D-cache CPU 0 Memory Controller On-chip Secondary Cache SPIE Gran Canaria 2003 CPU 1 Primary I-cache CPU 2 Primary I-cache Primary D-cache CPU 1 Memory Controller Primary D-cache Primary I-cache CPU2 Memory Controller Off-chip L3 Interface Rambus Memory Interface Cache SRAM Array DRAM Main Memory A. Nunez CPU 3 Primary D-cache CPU 3 Memory Controller DMA I/O Bus Interface I/O Device 22 CMPs-Hetero: Communications Architecture Architectures found in today’s heterogeneous processors for platform based design E.gr. CPU cores, AMBA buses, internal/external shared memories RISC Core Internal/ External Memory SPIE Gran Canaria 2003 AMBA Bus Engines Engines Shared Bus A. Nunez External I/O 23 CMPs-Hetero: Communications Architecture, Arbiters SPIE Gran Canaria 2003 A. Nunez 24 Multithreaded Processors Aim: Latency tolerance What is the problem? Load access latencies measured on an Alpha Server 4100 SMP with four Alpha 21164 processors are: 7 cycles for a primary cache miss which hits in the on-chip L2 cache of the 21164 processor, 21 cycles for a L2 cache miss which hits in the L3 (board-level) cache, 80 cycles for a miss that is served by the memory, and 125 cycles for a dirty miss, i.e., a miss that has to be served from another processor's cache memory. SPIE Gran Canaria 2003 A. Nunez 25 Multithreading Multithreading The ability to pursue two or more threads of control in parallel within a processor pipeline. Advantage: The latencies that arise in the computation of a single instruction stream are filled by computations of another thread. Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors. SPIE Gran Canaria 2003 A. Nunez 26 Approaches of Multithreaded Processors Cycle-by-cycle interleaving An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle. Block-interleaving The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch. Simultaneous multithreading SMTs Instructions are simultaneously issued from multiple threads to the FUs of a superscalar processor. combines a wide issue superscalar instruction issue with multithreading. SPIE Gran Canaria 2003 A. Nunez 27 (a) (b) SPIE Gran Canaria 2003 Context switch Context switch Time (process cycles) Multithreading versus NonMultithreading Approaches (a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar (c) A. Nunez 28 Simultaneous Multithreading (SMT) and Chip Multiprocessors (CMP) Time (processor cycles) (a) SMT (b) CMP Issue slots (a) SPIE Gran Canaria 2003 (b) A. Nunez 29 Combining SMT and Multimedia Start with a wide-issue superscalar general-purpose processor Enhance by simultaneous multithreading Enhance by multimedia unit(s) Enhance by on-chip RAM memory for constants and local variables SPIE Gran Canaria 2003 A. Nunez 30 The SMT Multimedia Processor To Memory Memoryinterface DCache Global L/S Local Memory Local L/S ICache I/O Thread Control Branch IF ID Rename RI IF ID Simple Integer RT WB Compl Integer BTAC Register SPIE Gran Canaria 2003 A. Nunez 31 IPC of Maximum Processor Models 6,32 6,33 5,56 5,64 7 5,67 5,34 6 3,84 3,89 3,91 1,98 1,99 1,99 1 1 8 3,53 3,27 1,96 1,86 1 1,86 1,86 1,57 1 4 0,96 Threads 1 1 SPIE Gran Canaria 2003 3,52 2 4 A. Nunez 6 5 4 IPC 3 2 1 0 8 Issue 32 Combining CMP-hetero and Multimedia Start with a general-purpose processor Enhance by hierarchical-bus com-arch Enhance by hardware accelerators and copros including multimedia unit(s) Enhance by on-chip RAM memories for constants, local variables, frames… SPIE Gran Canaria 2003 A. Nunez 33 Real implementation example: Philips Eclipse architecture instance for video coding SPIE Gran Canaria 2003 A. Nunez 34 CMP or SMT? The performance race between SMT and CMP is not yet decided. CMP is easier to implement, but only SMT has the ability to hide latencies. A functional partitioning is not easily reached within a SMT processor due to the centralized instruction issue. A separation of the thread queues is a possible solution, although it does not remove the central instruction issue. A combination of simultaneous multithreading with the CMP may be superior. Research: combine SMT or CMP organization with the ability to create threads with compiler support or fully dynamically out of a single thread thread-level speculation close to multiscalar SPIE Gran Canaria 2003 A. Nunez 35 Processor-in-Memory Technological trends have produced a large and growing gap between processor speed and DRAM access latency. Today, it takes dozens of cycles for data to travel between the CPU and main memory. CPU-centric design philosophy has led to very complex superscalar processors with deep pipelines. Much of this complexity is devoted to hiding memory access latency. Memory wall: the phenomenon that access times are increasingly limiting system performance. Memory-centric design is envisioned for the future SPIE Gran Canaria 2003 A. Nunez 36 PIM or Intelligent RAM (IRAM) PIM (processor-in-memory) or IRAM (intelligent RAM) approaches couple processor execution with large, high-bandwidth, on-chip DRAM banks. PIM or IRAM merge processor and memory into a single chip. Advantages: The processor-DRAM gap in access speed increases in future. PIM provides higher bandwidth and lower latency for (on-chip-)memory accesses. DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches. On-chip memory may be treated as main memory - in contrast to a cache which is just a redundant memory copy. PIM decreases energy consumption in the memory system due to the reduction of off-chip accesses. VIRAM, CODE SPIE Gran Canaria 2003 A. Nunez 37 V-IRAM-2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB 8 x 64 or 16 x 32 or 32 x 16 + x 2-way Superscalar Vector Instruction Queue Processor I/O I/O ÷ Load/Store 8K I cache Vector Registers 8K D cache 8 x 64 8 x 64 Serial I/O Memory Crossbar Switch M I/O M 8…x 64 I/O M M M M M M M … M 8… x 64 M … M 8…x 64 … M M M M M M SPIE Gran Canaria 2003 A. Nunez M M 8… x 64 M M M M M … M 8… x 64 … M M M M … 38 NoC Processor Architecture Network-on-chip, specialized PEs, advanced interconnect technologies Will use packet network architectures in 2010 On-Chip Memory Controller PE DSP PE Array SPIE Gran Canaria 2003 External Memory PE Switch Node External I/O Packet Network Switch Node PE PE A. Nunez PE 39 NoC Mescal Communication Architecture General Paradigm Mescal Communication Architecture is a general, coarse-grained on-chip interconnection scheme for various system components such as Processing Elements, memory and other communicating elements. PE Processing Element $ PE switch bridge MEM $ MEM switch SPIE Gran Canaria 2003 PE A. Nunez Processing Element 40 NoC Mescal Abstract System Architecture Processing Element Processing Element Communication Instructions (send/recv) Communication Instructions (send/recv) Communication Assist Communication Assist On-Chip-Network Operations On-Chip-Network Operations Application Layer Presentation Layer Session Layer Transport Layer Network Layer On Chip Network Data Link Layer Physical Layer SPIE Gran Canaria 2003 A. Nunez 41 NoC Communication Architecture Translation of network operations to packet switch operations Corresponding Protocol Stack On-Chip-Network Operations On-Chip-Network Operations Packet Deassembler Packet Assembler Packet Switch Network Operation Network Layer N0 N1 N7 N4 N6 Data Link Layer Packet Switching Network N2 N5 Physical Layer N3 SPIE Gran Canaria 2003 A. Nunez 42 NoC: Example for a bus Translation of network operations to bus operations On-Chip-Network Operations On-Chip-Network Operations Bus Interface Adapter Corresponding Protocol Stack Bus Interface Adapter Data Link Layer Bus Operation On Chip Bus Physical Layer SPIE Gran Canaria 2003 A. Nunez 43 Index MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks Previous and Related work Our SystemC Based Modelling Approach Experiments Conclusions SPIE Gran Canaria 2003 A. Nunez 44 Todays Communication Architecture Paradigms: Topology Single and Shared Transport and Signalling Channel p2p Bus Hierarchical bus Switch Crossbar Multistage… Ring Trees Network Circuit sw Packet sw w/o connection Packet sw w connection.. SPIE Gran Canaria 2003 A. Nunez 45 Todays Communication Architecture Paradigms: Topology Split Transport and Signalling Transport Topology (bus, h-bus, switch, ring, network…) Signalling (Addresses and routing, services, synchronisms) Associated channel Topology Common channel Topology… Protocol layer stack: software and process view of the generation of hardware signalling requires mapping onto actual interfaces SPIE Gran Canaria 2003 A. Nunez 46 Todays Communications Architecture Paradigms: Bandwidth Application Granularity Transport Granularity Fine grain Medium grain Coarse grain Bus sizes, transfer sizes Traffic Characterization Traffic Characterization E.gr. Streaming, burstiness, interval requests, space-time distribution SPIE Gran Canaria 2003 A. Nunez 47 Todays Communications Architecture Paradigms: Protocols Protocols High level signalling primitives mapping Communications to architecture mapping Access policies mapping, priorities, static, dynamic Traffic and flow control Burstiness Request Intervals Concurrency SPIE Gran Canaria 2003 A. Nunez 48 Todays Communications Architecture Paradigms: Signalling Addressing, routing info Service info Hand-shake and command sync strobes High level signalling primitives mapping Communications to architecture mapping Access policies mapping, priorities, static, dynamic Traffic and flow control Burstiness Request Intervals Concurrency Streaming ... SPIE Gran Canaria 2003 A. Nunez 49 Com-arch Modelling: Ptolemy-Mescal UCBerkeley PtolemyI&II, Mescal, UCSD-Dey, PR-Vissers, Goosens, Lippen.., TIMA-Jerraya.. Components for channels: Synchronous digital bus (shared or point-to-point) ARM AMBA bus IBM CoreConnect bus Analog channel Actors encapsulate the physical layer Each actor has a common interface to make experimentation possible Ptolemy actor interface is a higher level than the channel’s actual electrical interface SPIE Gran Canaria 2003 A. Nunez 50 Com-arch Modelling: Ptolemy-Mescal Components for CommAssists Queues Arbitrators PE interfaces Bus interfaces External memory or I/O cycle generators Switches Small memories Parameterizable components Programmable components Designing a CA, very similar to designing a PE SPIE Gran Canaria 2003 A. Nunez 51 Com-arch Modelling: Ptolemy-Mescal Encapsulate a PE model as a composite actor Combine with CA components to make a Communicator Encapsulate Communicator model as a composite actor Combine multiple Communicators with Channel components to make a complete system SPIE Gran Canaria 2003 A. Nunez 52 Index MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks Previous and Related work Our SystemC Based Modelling Approach Experiments Conclusions SPIE Gran Canaria 2003 A. Nunez 53 Case study: Communication architecture in HA-MPSoC Mapping communicating processes and threads on HA-MPSoC requires efficient ways of implementing the on-chip communication Previous work: comparative performance of different classes of data communication architectures (San Diego) But: The communication architecture can be split in: the data communication architecture, and the signalling and synchronization architecture The impact of different signalling and synchronization architectural options on the overall performance has not been sufficiently studied SPIE Gran Canaria 2003 A. Nunez 54 Our focus: Signalling in the HA-MPSoC paradigm, split sync, SystemC modelling New solutions for signalling and synchronization in the HA-MPSoC paradigm Based in a technique for modelling the communication and synchronization architectures using SystemC High abstraction modelling based on the Kahn Process Network Model of Computation Here: Variations on Dey’s simple communication architecture (bus) SPIE Gran Canaria 2003 A. Nunez 55 Previous related work: UCSD-Dey Analysis of the performance of various SoC communication architectures under different classes of on-chip communication traffic Identifying parts of the application’s “communiation traffic space” for which different communication architectures are well-suited Methodology based on POLIS/PTOLEMY SPIE Gran Canaria 2003 A. Nunez 56 Previous related work: Dey’s communication architectures Static Priority Based Shared Bus Architecture Two-level TDMA Based Architecture Hierarchical Bus Architecture Ring Based Architecture SPIE Gran Canaria 2003 A. Nunez 57 Index MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks Previous and Related work Our SystemC Based Modelling Approach Experiments Conclusions SPIE Gran Canaria 2003 A. Nunez 58 Abstracting high level communication KPN: concurrent tasks interconnected by channels (FIFOs) Processes have to share service administrative information related to the FIFOs Administrative information divided in two parts: static and dynamic information The update of the dynamic information of the FIFO is the synchronization aspect of the complete signalling function SPIE Gran Canaria 2003 A. Nunez 59 A simple KPN example Producer FIFO Consumer Administrative information - Base address memory - FIFO size - Number of data in FIFO SPIE Gran Canaria 2003 A. Nunez 60 Signalling Primitives in MPSoC Flexiblity and scalability, a protocol for communicating tasks is needed Set of primitives for data communication and synchronization. The Eclipse (Philips Research) example: - Primitives for data communication: void Read(int port_id, int offset, int n_bytes, Bytes *bytevector) void Write(int port_id, int offset, int n_bytes, Bytes *bytevector) - Primitives for data synchronization: bool GetSpace(int port_id, int n_bytes) void PutSpace(int port_id, int n_bytes) SPIE Gran Canaria 2003 A. Nunez 61 Our SystemC-based Modelling Executable specification of a system described in different abstraction levels (functional untimed, timed, transaction level and cycle-true) TLM is a natural method to perform system level performance simulation SystemC Master/Slave library hides the more complex details of C++ programming and fits well for TLM development The design time of complex MPSoC models can be greatly shortened using the SystemC Master/Slave library SPIE Gran Canaria 2003 A. Nunez 62 Application modelling Chain of P processors interconnected through FIFOs Simulation parameters: number of processes (P), token size (data-granularity), request intervals, waiting cycles, transfer cycles, execution time, total simulation time P1 Pin PP-2 FIFO1 SPIE Gran Canaria 2003 Pout FIFOP-1 A. Nunez 63 Index MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks Previous and Related work Our SystemC Based Modelling Approach Experiments Conclusions SPIE Gran Canaria 2003 A. Nunez 64 Average Communication rate Static Priority Based Shared Bus Architecture 250 200 Inter-Request = 10 150 Inter-Request = 100 Inter-Request = 500 100 Inter-Request = 1000 50 0 1 10 50 100 Token size SPIE Gran Canaria 2003 A. Nunez 65 Average Communication rate Two-level TDMA Based Architecture 250 200 Inter-Request = 10 150 Inter-Request = 100 Inter-Request = 500 100 Inter-Request = 1000 50 0 1 10 50 100 Token size SPIE Gran Canaria 2003 A. Nunez 66 Average Communication rate Hierarchical Bus Architecture 350 300 250 Inter-Request = 10 200 Inter-Request = 100 150 Inter-Request = 500 100 Inter-Request = 1000 50 0 1 10 50 100 Token size SPIE Gran Canaria 2003 A. Nunez 67 Average Communication rate Ring Based Architecture 350 300 250 Inter-Request = 10 200 Inter-Request = 100 150 Inter-Request = 500 100 I nter-Request = 1000 50 0 1 10 50 100 Token size SPIE Gran Canaria 2003 A. Nunez 68 Reminder of Dey’s communication architectures Static Priority Based Shared Bus Architecture Two-level TDMA Based Architecture Hierarchical Bus Architecture Ring Based Architecture SPIE Gran Canaria 2003 A. Nunez 69 Experiments: Additional models of communication architectures MEM MEM ARB ARB Wd P1 Wd Wd Wd P1 P2 Wd Wd Wd P3 P2 Wd P4 P3 P4 SYNC MEM ARB Wd - Ws Wd - Ws P2 P1 Wd - Ws Wd - Ws P3 P4 MEM MEM ARB Wd Wd Wd Wd P1 P2 P3 P4 Ws Ws Ws Ws ARB ARB SPIE Gran Canaria 2003 A. Nunez Wd Wd Wd Wd P1 P2 P3 P4 Ws Ws Ws Ws 70 Centralized architecture using shared memory (Mem) MEM sync ARB Wd Wd Wd Wd P1 P2 P3 P4 SPIE Gran Canaria 2003 A. Nunez 71 Centralized architecture using a central synchronization module (Central) MEM ARB Wd Wd Wd Wd P1 P2 P3 P4 SYNC SPIE Gran Canaria 2003 A. Nunez 72 Distributed architecture, same bus for data transport and synchronization (Single-Bus) MEM ARB Wd-Ws Wd-Ws Wd-Ws Wd-Ws P1 P2 P3 P4 SPIE Gran Canaria 2003 A. Nunez 73 Distributed architecture, splitting data transport bus and sync bus (2-Busses) MEM ARB Wd Wd Wd Wd P1 P2 P3 P4 Ws Ws Ws Ws ARB SPIE Gran Canaria 2003 A. Nunez 74 Distributed architecture with ring topology for synchronization (Ring) MEM ARB Wd Wd Wd Wd P1 P2 P3 P4 Ws Ws Ws Ws SPIE Gran Canaria 2003 A. Nunez 75 Implementation example: Philips Eclipse architecture instance for video coding SPIE Gran Canaria 2003 A. Nunez 76 Additional measurements Quantify what synchronization topology allows the shortest execution time for an application, i.e. the more efficient from the performance point of view The Coprocessor Usage percentage figure (Ucop): %Ucop = (Texec/Tsim) · 100 SPIE Gran Canaria 2003 A. Nunez 77 % Coprocessor Usage, P = 4 10 9 8 7 6 5 4 3 2 1 0 . 2-busses Ring Single-bus Mem Central 1 4 8 16 Token size SPIE Gran Canaria 2003 A. Nunez 78 % Coprocessor Usage, P = 8 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 2-busses Ring Single-bus Mem Central 1 4 8 16 Token size SPIE Gran Canaria 2003 A. Nunez 79 Conclusions Increasing importance of communication architecture, MPSoCs <-> NoCs Design space exploration extended with communication-architectures SystemC master/slave library powerful modelling tool Large performance spread found due to communication topologies, signalling protocols, and traffic characteristics Need of more qualitative and quantitative modelling, analysis, studies, tools Consider splitting transport and signalling Hierarchical buses, rings, plus splitting ++ SPIE Gran Canaria 2003 A. Nunez 80 Signalling in the Heterogeneous Architecture Multiprocessor Paradigm Antonio Núñez, Victor Reyes, Tomás Bautista Keynote IUMA, Institute for Applied Microelectronics, ULPGC SPIE Gran Canaria 2003 A. Nunez 81