Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A tiled processor architecture prototype: the Raw microprocessor October 02 Tiled Processor Architecture (TPA) IMEM DMEM PC REG FPU ALU SMEM PC SWITCH Tile Lots of ALUs and regs Short, prog. wires Lower power Programmable. Supports ILP and Streams A Prototype TPA: The Raw Microprocessor [Billion transistor IEEE Computer Issue ’97] A Raw Tile Raw Switch SMEM PC IMEM DMEM PC REG FPU ALU SMEM PC SWITCH The Raw Chip Tile Packet stream Disk stream Video1 RDRAM Software-scheduled interconnects (can use static or dynamic routing – but compiler determines instruction placement and routes) Tight integration of interconnect r24 r25 r26 A Raw Tile IMEM DMEM FPU ALU SMEM PC SWITCH The Raw Chip Tile 0-cycle “local bypass network” r27 Input FIFOs from Static Router PC REG Packet stream Disk stream Video1 RDRAM IF D r24 r25 r26 r27 Output FIFOs to Static Router E M1 RF M2 A TL F P TV U F4 WB Point-to-point bypass-integrated compiler-orchestrated on-chip networks How to “program the wires” Tile 10 Tile 11 fmul r24, r3, r4 fadd r5, r3, r25 route P->E route W->P software controlled crossbar software controlled crossbar The result of orchestrating the wires mem mem mem C program ILP computation MPI program httpd Zzzz Custom Datapath Pipeline Perspective We have replaced Bypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc. With a general, point-to-point, routed interconnect called: Scalar operand network (SON) Fundamentally new kind of network optimized for both scalar and stream transport Programming models and software for tiled processor architectures o Conventional scalar programs (C, C++, Java) Or, how to do ILP o Stream programs Scalar (ILP) program mapping E.g., Start with a C program, and several transformations later: v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval1 + 2.0 tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval2 + 2.0 pval3 = seed.0 * v2.4 tmp2.5 = pval3 + 2.0 pval5 = seed.0 * 6.0 pval4 = pval5 + 2.0 tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 Existing languages will work Lee, Amarasinghe et al, “Space-time scheduling”, ASPLOS ‘98 Scalar program mapping v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval1 + 2.0 tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval2 + 2.0 pval3 = seed.0 * v2.4 tmp2.5 = pval3 + 2.0 pval5 = seed.0 * 6.0 pval4 = pval5 + 2.0 tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 seed.0=seed pval1=seed.0*3.0 v1.2=v1 v2.4=v2 pval5=seed.0*6.0 pval0=pval1+2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 Program code tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v3.10=tmp3.6-v2.7 v2=v2.7 v3=v3.10 Graph Program graph clustering seed.0=seed pval1=seed.0*3.0 v1.2=v1 v2.4=v2 pval5=seed.0*6.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp1.3=pval2+2.0 tmp2.5=pval3+2.0 tmp1=tmp1.3 tmp0=tmp0.1 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v3.10=tmp3.6-v2.7 v2=v2.7 v3=v3.10 Placement Tile2 Tile1 Tile4 Tile3 Routing Processor Switch code code Instruction Scheduling v1.2=v1 v2.4=v2 route(t,E) route(N,t) route(t,E) route(W,t) seed.0=recv(0) pval3=seed.o*v2.4 route(W,S) route(N,t) seed.0=seed send(seed.0) pval1=seed.0*3.0 seed.0=recv() pval5=seed.0*6.0 seed.0=recv() pval2=seed.0*v1.2 pval0=pval1+2.0 tmp0.1=pval0/2.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp2.5=pval3+2.0 tmp1.3=pval2+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 route(t,E) route(t,W) route(E,t) route(W,t) send(tmp1.3) tmp1=tmp1.3 tmp3=tmp3.6 route(t,E) route(W,S) tmp2.5=recv() pval7=tmp1.3+tmp2.5 send(tmp0.1) tmp0=tmp0.1 route(W,S) route(N,t) v2.7=pval6*5.0 v1.8=pval7*3.0 Send(v2.7) v2=v2.7 route(t,E) route(W,N) v1=v1.8 tmp0.1=recv() v0.9=tmp0.1-v1.8 v0=v0.9 route(W,N) route(S,t) v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 Tile3 Tile4 Tile2 Tile1 time Raw die photo .18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta) Of course, custom IC designed by industrial design team could do much better Raw motherboard MIPS PR4450 Generation 1 MS Memory Controller TM32 RW RW DCS-SEC S TM32 MS DCS-SEC S MS RW S DCS-CTR PMA S PMA-MON M-GIC S S PMA-SEC M-IPC S S PMA-ARB S TM1-IPC CLOCKS S M PCI/XIO R S W GLOBAL S S S DE R W TM1-DBG S S IIC1 R W RESET TM2-DBG S UART1 UART2 UART3 S S S S S S S S EJTAG BOOT • focus on computation • programmable cores • domain specific cores • L1 caches • reuse level raised from SC to IP • communication: straightforward • buses + bridges (heterogeneous) • Data is communicated via external memory under synchronization control of a programmable core M S IIC2 R W IIC3 R W USB R W SMC1 SMC2 R W R W M-Gate M S R W VMPG S R W DVDD S R W EDMA S R W VLD S R QVCP2 S R W MBS1 S R W MBS2 S R W QTNR S R QVCP1 S W VIP1 S W VIP2 S R W VPK S W TSDMA S S TM1-GIC S TM2-IPC S TM2-GIC S DENC S SPDIO R W S AIO1 R W S AIO2 R W S AIO3 R W S GPIO R W M TUNNEL R S W S MSP1 R W S MSP2 R W C-Bridge M-DCS T-DCS CAB • ex. Nexperia • 0.18 m / 8M • 1.8V / 4.5 W • 75 clock domains • 35 M transistors DCS-CTR S MPEG 1394 MBS + VIP T-PI MMI+AICP Conditional access MSP M-PI MIPS TriMedia VLIW Conventional architectures (Nexperia) in1 job1 task graph job2 in2 •rationale: driven by flexibility • applications are unknown at fh=16MHz task design time) • dynamic load balancing via flexible binding of Kahn processes to processors •Extension of well-known computer architectures (shared memory, caches, …) adopting a general purpose view and using existing skills. • key issue: cache coherency and (RT)OS (RT)OS (RT)OS Proc Proc Proc Cache Cache ... Cache Acc Acc Bus based interconnect memory consistency • performance analysis via simulation out Memory Symmetric Multiprocessor [Culler] Problems (1): timing events • Classic approach: processors communicate via SDRAM under synchronisation control of the CPU 2 C_1 CPU C_2 C_20 ... 1 3 mem 4 application level • P1: extra claims on scarce resource (bandwidth) point to point comm. • P2: lots of events exchanged with higher SW levels start when data available TM sync level SD level drivers fA fB Problems (2): Timing & processor stalls . . Processor stalls, e.g. 60% large variation . miss penalty (BC, AC, WC) . Miss rate . . Caches . Memory . Busses . . “Easy to program, hard to tune” Task B L1$ 3% miss rate unpredictability at every arbiter programming effort ? . L2$ way2 Cost ? SDRAM BC=8cc AC=20cc WC=3000cc A typical video flow graph [Janssen, Dec. 02] Problems (3): End-to-end timing ? DDR SDRAM Interaction between multiple local arbiters D$ I$ TM-CPU D$ I$ TM-CPU D$ I$ STC ASIP Problems (4): Compositionality DDR SDRAM Multiple independent applications active simultaneously D$ I$ TM-CPU D$ I$ TM-CPU D$ I$ STC ASIP . Generation 1: Problem summary Timing . Events (coarse level sync . Latency critical . . end-to-end timing behavior of the application . . Composition of several applications sharing resources (virtualization) 3% miss rate, 20 cc penalty 60% stalls interaction local arbiters . . . Power • 2 x power dissipation Area • DDR SDRAM expensive caches NRE cost . (>20 M$ of ITRS due to SW): verification by simulation D$ I$ Proc D$ I$ Proc D$ I$ Proc Towards a solution . distributed systems. . tiles will become very much autonomous. . . Computation timing (GALS) techniques for performance & predictability reasons we want to decouple communication from computation. . Tiles run independent from other tiles . and from communication actions. Add a communication assist (CA) . acts as an initiator of a communication action. . arbitrates the access to the memory . can stall the processor. . This way communication and computation concerns are separated. Local mem CA master slave Gen. 2 Architecture cluster [Culler][Hijdra] . cluster Processor Processor stall MEM CA stall MEM CA Network on chip MEM Memory cluster CA MEM . . Cluster/tile = computation + local memory . heterogeneous . CPU, DSP, ASIPs, ASICs . Memory only . IO Clusters are autonomous. The communication is done via an on-chip network CA SDRAM CTRL A generic scalable multiprocessor architecture = a collection of essentially complete computers, including one or more processors and memory, communicating through a general-purpose high-performance scalable interconnect and a communication assist. [C&S, pp.51]