Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The CA1024: A Massively Parallel Processor for Cost-Effective HDTV Connex Technology Proprietary and Confidential 1 Company Background • Fabless semiconductor company in Silicon Valley • VC funded (series A & B) • In the product-development stage with 26+ employees – Deep experience with video algorithms, processor design, and digital-video system software • Core asset: ConnexArrayTM vector-processor architecture – Architecture verified in CA4096 test chip • Six patent applications on Connex vector-processor technology – 1 US patent granted, 3 US patents pending, 2 US provisional – Granted and pending patents also filed in China, Taiwan, Korea, EEC, Japan, Singapore • Initial market focus on DTV Connex Technology Proprietary and Confidential 2 Presentation Agenda • Why a massively parallel processor (MPP)? • How is MPP integrated in an SoC? • Processor performance • Project status Connex Technology Proprietary and Confidential 3 Challenges • HDTV codec & post-processing are computationally intensive • Computation is dominated by dataparallel processes • HDTV is a fast-evolving domain • ASICs are a very costly solution Connex Technology Proprietary and Confidential 4 Our Solution: Integral Parallel Machine • Data-parallel computation • Time-parallel computation (supported by speculative parallelism) • I/O process is transparent to the computational process Connex Technology Proprietary and Confidential 5 Key Technology • Fully programmable solution for HDTV video encoding, decoding, and transcoding at the system and algorithm levels – Simple programming model • Silicon-efficient architecture; die size competitive with similar function ASICs – Re-use of transistors – Minimal dedicated hard-wired blocks • Sufficient performance to enable multistandard, multichannel, high-definition DTV – Linearly scalable Connex Technology Proprietary and Confidential 6 The Connex Architecture 255 254 16-bit RAM Sequencer CA1024-PVP: m = n = 32 for a 1,024-PE Connex Machine Test Chip: m = n = 64 for a 4,096-PE Connex Array; sequencer and I/O control in an FPGA 0 1 I/O Controller R7 R6 R5 R4 R3 R2 R1 R0 Connex Array n 0 1 2 3.2 GByte/sec I/O channel in parallel with code running on the Connex Array Connex Technology Proprietary and Confidential 1 Address 0 m AUX I/O Connex Index Select 16 bit ALU 7 Connex Cell Architecture • PE (Processing Element) has eight accumulator registers, including Connex, Aux, and I/O specialfunction registers • Select flag enables or disables instruction processing 255 254 RAM 1 Address 0 R7 R6 R5 R4 R3 R2 R1 R0 • Index is a unique cell number used to direct certain instructions • Bidirectional 16-bit bus to 256 RAM locations • Connex register includes connections for shifts to/from adjacent PE • Aux and I/O registers dedicated to specific instruction functions AUX I/O Connex Select Index 16 bit ALU Connex Technology Proprietary and Confidential 8 ConnexArray Structure • Replicated Connex cells each include PE and local RAM • Linear interconnect of neighbor registers • Conditional execution based on state of select bit or index value • All selected cells execute the same instruction stream 255 254 255 254 255 254 1 0 1 0 1 0 R7 R6 R5 R4 R3 R2 R1 R0 R7 R6 R5 R4 R3 R2 R1 R0 R7 R6 R5 R4 R3 R2 R1 R0 On 0 Off 1 On 1023 16 bit 16 bit ALU Connex TechnologyALU Proprietary and Confidential 16 bit ALU 9 Connex Data-Array Structure 0 Element n 1023 0 16-bit data operands Line m 255 256 lines with 1024 16-bit elements per line 1GByte data I/O in parallel with computation operations Connex Technology Proprietary and Confidential 10 Full Line Operations: Operate On All Elements in Parallel 0 1023 0 Line i +, -, *, XOR, etc. Line j = Line k 255 Line k = Line i OP Line j Line k = Line i OP scalar value (repeated Connex Technology Proprietaryfor all elements) and Confidential 11 Columns Active Based On Repeating Patterns 0 1023 0 Line i +, -, *, XOR, etc. Line j = Line k 255 Example: Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc. 12 Connex Technology Proprietary and Confidential Columns Active Based On Results of Previous Operations 0 1023 0 Line i +, -, *, XOR, etc. Line j = Line k 255 Example: Apparently random columns are active, marked, based on Data-dependent results of previous operations. 13 Connex Technology Proprietary This enables selective processing based on data content. and Confidential Outer-Loop Parallelism: Program in context of 128+ data-structure instances Example: 8x8 DCT 0 7 1023 0 8x8 8x8 …….. 8x8 8x8 7 Line i Line j 255 Example: 128 sets of 8x8 run in parallel in a 1024-cell array Connex Technology Proprietary and Confidential 14 I/O System Connex Array IS I/O Plane IOC Interrupts Switch Fabric DDR-DRAM Controller DRAM DRAM DRAM DRAM Connex Technology Proprietary and Confidential 15 Computational-Intensive Architecture • All forms of parallelism are strongly segregated – Connex Array for data-parallel computation – Speculative Array for time-parallel computation • The granularity perfectly fits the application domain – 16-bit processing elements – no MACs, no FPUs, no multipliers… Connex Technology Proprietary and Confidential 16 High I/O Bandwidth • External I/O: 3.2 GBytes/sec – Serial access and random access with similar performance • Internal I/O: 400 GBytes/sec Connex Technology Proprietary and Confidential 17 Area & Power Efficiency • 2 GOPS/mm2 (peak performance) • GOPS/Watt is 25–50 times greater than a mature sequential technology Connex Technology Proprietary and Confidential 18 Programming Connex • CPL (Connex Programming Language) is an extension of C with C/C++ syntax • Code that operates on scalar data is written in regular C notation • Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections • CPL uses sequential operators and control structures on vector and select datatypes { ... const short OFFSET = 15; ... short vector x, y; short vector min, max; ... sel = all; x += OFFSET; ... min = (x < y)? x : y; max = (x > y)? x : y; ... } • Using CPL, the Connex Machine is programmed the same way as conventional sequential machines Vectors are arrays of scalar components. • Hides the complexities of the parallel execution hardware Selections are arrays of Boolean values that dictate which vector components are active. • Complete SDK Connex Technology Proprietary and Confidential 19 Performance • DCT: 0.35 clock cycle per pixel • SAD: 0.0025 clock cycle per pixel Connex Technology Proprietary and Confidential 20 H.264 Dual HD Stream Decoding Clock Cycles Per Macroblock Dezigzagging 37.3 Intra Prediction 54.1 IT/IQ 97.3 Motion Compensation 114.3 27.1 Deblocking Filter Total [ Clock Cycles/Macroblock ] 337.8 Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles Connex Technology Proprietary and Confidential 21 H.264 CABAC (SA) Decoding • Targeted profile and level: 4.1 Main Profile • Bit-rate/stream considered: 35Mbps (45Mbps maximum) • Number of bins to decode using CABAC : 47M/sec • Number of clock cycles per bin: 1 cycle • Cycles to decode bins/stream: 50MHz • Typical bit-rate expected for DVB: 10Mbps • Cycles to decode bins for typical stream (DVB): 15MHz Connex Technology Proprietary and Confidential 22 64-bit Wide DRAM DDR-DRAM Ctrl GPIO (400 MHz Data Rate) I2C Test ICE JTAG Video In Switch Fabric Instruction Sequencer Ext. Bus HOST I/F Switch Fabric CA1024 Host CPU TS/Sec CPU Audio CPU Video CPU Connex Technology Proprietary and Confidential Audio Out Audio In Flash I/O Controller Audio In 2x-I2S or S/PDIF Multi-Codec Processing Pre-Analysis 3D Filter Scaling Graphics Processing Video Merge/Blend Motion Adaptive De-interlacing Audio Out Video In 2x-I2S or S/PDIF Programmable Media Processor Video Out BT.656/1120 ConnexArray™ Switch Fabric BT.656/1120 Video Out Switch Fabric BT.656/1120 BT.656/1120 5x-I2S S/PDIF 1xI2S PCI v2.2 or Generic SA 23 CA1024 Project Status CWOA DDR MIPS MIPS ACF CA256 MIPS SA CA256 CA256 PCI MIPS • TSMC 0.13 micron • 676-pin PBGA • Samples Q3 2006 • [email protected] CA256 Connex Technology Proprietary and Confidential 24 In Summary….. • Fully programmable processor • Computational-intensive architecture • High-bandwidth I/O • Connex Programming Language & SDK • Die-area and power-efficient architecture Connex Technology Proprietary and Confidential 25 Thank You ! Connex Technology Proprietary and Confidential 26