Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The CA1024: A Massively Parallel Processor
for Cost-Effective HDTV
Connex Technology Proprietary
and Confidential
1
Company Background
• Fabless semiconductor company in Silicon Valley
• VC funded (series A & B)
• In the product-development stage with 26+ employees
– Deep experience with video algorithms, processor design, and
digital-video system software
• Core asset: ConnexArrayTM vector-processor architecture
– Architecture verified in CA4096 test chip
• Six patent applications on Connex vector-processor technology
– 1 US patent granted, 3 US patents pending, 2 US provisional
– Granted and pending patents also filed in China, Taiwan, Korea,
EEC, Japan, Singapore
• Initial market focus on DTV
Connex Technology Proprietary
and Confidential
2
Presentation Agenda
• Why a massively parallel processor
(MPP)?
• How is MPP integrated in an SoC?
• Processor performance
• Project status
Connex Technology Proprietary
and Confidential
3
Challenges
• HDTV codec & post-processing are
computationally intensive
• Computation is dominated by dataparallel processes
• HDTV is a fast-evolving domain
• ASICs are a very costly solution
Connex Technology Proprietary
and Confidential
4
Our Solution:
Integral Parallel Machine
• Data-parallel computation
• Time-parallel computation (supported by
speculative parallelism)
• I/O process is transparent to the
computational process
Connex Technology Proprietary
and Confidential
5
Key Technology
• Fully programmable solution for HDTV video
encoding, decoding, and transcoding at the system
and algorithm levels
– Simple programming model
• Silicon-efficient architecture; die size competitive
with similar function ASICs
– Re-use of transistors
– Minimal dedicated hard-wired blocks
• Sufficient performance to enable multistandard,
multichannel, high-definition DTV
– Linearly scalable
Connex Technology Proprietary
and Confidential
6
The Connex Architecture
255
254
16-bit
RAM
Sequencer
CA1024-PVP:
m = n = 32 for a
1,024-PE Connex
Machine
Test Chip:
m = n = 64 for a
4,096-PE Connex
Array; sequencer
and I/O control in an
FPGA
0
1
I/O
Controller
R7
R6
R5
R4
R3
R2
R1
R0
Connex Array
n
0 1 2
3.2 GByte/sec I/O channel in parallel with
code running on the Connex Array
Connex Technology Proprietary
and Confidential
1
Address 0
m
AUX
I/O
Connex
Index
Select
16 bit
ALU
7
Connex Cell Architecture
• PE (Processing Element) has eight
accumulator registers, including
Connex, Aux, and I/O specialfunction registers
• Select flag enables or disables
instruction processing
255
254
RAM
1
Address 0
R7
R6
R5
R4
R3
R2
R1
R0
• Index is a unique cell number used
to direct certain instructions
• Bidirectional 16-bit bus to 256 RAM
locations
• Connex register includes
connections for shifts to/from
adjacent PE
• Aux and I/O registers dedicated to
specific instruction functions
AUX
I/O
Connex
Select
Index
16 bit
ALU
Connex Technology Proprietary
and Confidential
8
ConnexArray Structure
• Replicated Connex
cells each include
PE and local RAM
• Linear
interconnect of
neighbor registers
• Conditional
execution based
on state of select
bit or index value
• All selected cells
execute the same
instruction stream
255
254
255
254
255
254
1
0
1
0
1
0
R7
R6
R5
R4
R3
R2
R1
R0
R7
R6
R5
R4
R3
R2
R1
R0
R7
R6
R5
R4
R3
R2
R1
R0
On
0
Off
1
On
1023
16 bit
16 bit
ALU
Connex
TechnologyALU
Proprietary
and Confidential
16 bit
ALU
9
Connex Data-Array Structure
0
Element n
1023
0
16-bit data operands
Line m
255
256 lines with 1024 16-bit elements per line
1GByte data I/O in parallel with computation operations
Connex Technology Proprietary
and Confidential
10
Full Line Operations:
Operate On All Elements in Parallel
0
1023
0
Line i
+, -, *, XOR, etc.
Line j
=
Line k
255
Line k = Line i OP Line j
Line k = Line i OP
scalar
value (repeated
Connex
Technology
Proprietaryfor all elements)
and Confidential
11
Columns Active Based On
Repeating Patterns
0
1023
0
Line i
+, -, *, XOR, etc.
Line j
=
Line k
255
Example: Mark all odd columns active. Or mark every third column active.
Or mark every third and fourth
column active, etc.
12
Connex Technology Proprietary
and Confidential
Columns Active Based On Results
of Previous Operations
0
1023
0
Line i
+, -, *, XOR, etc.
Line j
=
Line k
255
Example: Apparently random columns are active, marked, based on
Data-dependent results of previous operations.
13
Connex Technology Proprietary
This enables selective processing
based
on
data
content.
and Confidential
Outer-Loop Parallelism:
Program in context of 128+ data-structure instances
Example: 8x8 DCT
0
7
1023
0
8x8
8x8
……..
8x8
8x8
7
Line i
Line j
255
Example: 128 sets of 8x8 run in parallel in a 1024-cell array
Connex Technology Proprietary
and Confidential
14
I/O System
Connex
Array
IS
I/O Plane
IOC
Interrupts
Switch Fabric
DDR-DRAM
Controller
DRAM
DRAM
DRAM
DRAM
Connex Technology Proprietary
and Confidential
15
Computational-Intensive
Architecture
• All forms of parallelism are strongly segregated
– Connex Array for data-parallel computation
– Speculative Array for time-parallel computation
• The granularity perfectly fits the application domain
– 16-bit processing elements
– no MACs, no FPUs, no multipliers…
Connex Technology Proprietary
and Confidential
16
High I/O Bandwidth
• External I/O: 3.2 GBytes/sec
– Serial access and random access with similar
performance
• Internal I/O: 400 GBytes/sec
Connex Technology Proprietary
and Confidential
17
Area & Power Efficiency
• 2 GOPS/mm2 (peak performance)
• GOPS/Watt is 25–50 times greater than a
mature sequential technology
Connex Technology Proprietary
and Confidential
18
Programming Connex
•
CPL (Connex Programming Language) is
an extension of C with C/C++ syntax
•
Code that operates on scalar data is
written in regular C notation
•
Connex-specific operators defined for
features not available in C, e.g.
operations on vectors, selections
•
CPL uses sequential operators and
control structures on vector and select
datatypes
{
...
const short OFFSET = 15;
...
short vector x, y;
short vector min, max;
...
sel = all;
x += OFFSET;
...
min = (x < y)? x : y;
max = (x > y)? x : y;
...
}
•
Using CPL, the Connex Machine is
programmed the same way as
conventional sequential machines
Vectors are arrays of scalar components.
•
Hides the complexities of the parallel
execution hardware
Selections are arrays of Boolean values that
dictate which vector components are
active.
•
Complete SDK
Connex Technology Proprietary
and Confidential
19
Performance
• DCT: 0.35 clock cycle per pixel
• SAD: 0.0025 clock cycle per pixel
Connex Technology Proprietary
and Confidential
20
H.264 Dual HD Stream Decoding
Clock Cycles
Per Macroblock
Dezigzagging
37.3
Intra Prediction
54.1
IT/IQ
97.3
Motion Compensation
114.3
27.1
Deblocking Filter
Total [ Clock Cycles/Macroblock ]
337.8
Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles
Connex Technology Proprietary
and Confidential
21
H.264 CABAC (SA) Decoding
• Targeted profile and level: 4.1 Main Profile
• Bit-rate/stream considered: 35Mbps (45Mbps
maximum)
• Number of bins to decode using CABAC : 47M/sec
• Number of clock cycles per bin: 1 cycle
• Cycles to decode bins/stream: 50MHz
• Typical bit-rate expected for DVB: 10Mbps
• Cycles to decode bins for typical stream (DVB):
15MHz
Connex Technology Proprietary
and Confidential
22
64-bit Wide
DRAM
DDR-DRAM Ctrl
GPIO
(400 MHz Data Rate)
I2C
Test ICE
JTAG
Video
In
Switch Fabric
Instruction
Sequencer
Ext.
Bus
HOST
I/F
Switch Fabric
CA1024
Host
CPU
TS/Sec
CPU
Audio
CPU
Video
CPU
Connex Technology Proprietary
and Confidential
Audio
Out
Audio
In
Flash
I/O
Controller
Audio
In
2x-I2S or
S/PDIF
Multi-Codec Processing
Pre-Analysis
3D Filter
Scaling
Graphics Processing
Video Merge/Blend
Motion Adaptive De-interlacing
Audio
Out
Video
In
2x-I2S or
S/PDIF
Programmable Media Processor
Video
Out
BT.656/1120
ConnexArray™
Switch Fabric
BT.656/1120
Video
Out
Switch Fabric
BT.656/1120
BT.656/1120
5x-I2S
S/PDIF
1xI2S
PCI v2.2
or
Generic
SA
23
CA1024 Project Status
CWOA
DDR
MIPS
MIPS
ACF
CA256
MIPS
SA
CA256
CA256
PCI
MIPS
• TSMC 0.13 micron
• 676-pin PBGA
• Samples Q3 2006
• [email protected]
CA256
Connex Technology Proprietary
and Confidential
24
In Summary…..
• Fully programmable processor
• Computational-intensive architecture
• High-bandwidth I/O
• Connex Programming Language & SDK
• Die-area and power-efficient architecture
Connex Technology Proprietary
and Confidential
25
Thank You !
Connex Technology Proprietary
and Confidential
26