Download Multiprocessor System-on-Chip(MPSoC)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Airborne Networking wikipedia , lookup

Distributed operating system wikipedia , lookup

Bus (computing) wikipedia , lookup

Transcript
Multiprocessor System-onChip(MPSoC) Technology
Wayne Wolf, Ahmed Amine Jerraya and Grant Martin
Presented by Santosh Ponnala
1
Brief Overview
•
•
•
•
Introduction
Multiprocessors and the Evolution of MPSoCs
How Applications Influence Architecture
Architectures for Real-Time Low-Power
Systems
• CAD Challenges in MPSoCs
• Conclusion
2
Introduction
•
•
•
•
•
•
What is a MPSoC?
Where are they used?
System Requirements?
Why MPSoC?
What is a Multiprocessor?
How is a MPSoC different from a
Multiprocessor?
3
What is a Parallel Architecture
• A large collection of processing
elements that communicate and
cooperate to solve large problems
fast. [-- Almasi and Gottlieb]
• “ collection of processing
elements”
Serial Computing
– How many? How powerful each?
Scalability?
• “ that can communicate”
– How do PEs communicate?
(shared memory vs message
passing)
– Interconnection Networks (bus,
crossbar, ..)
Parallel Computing
4
Why Use Parallel Computing?
Main Reasons:
• Save time and/or money
• Solve larger Problems
• Provide Concurrency
• Limits to serial computing
5
Taxonomy Of Parallel Computers
6
Vector vs Array Processing
Let n be the size of each vector. Then, the
time to compute f (V1, V2) = k + (n-1), where k
is the length of the pipe f.
Array Processing
In array processing, each of the operations
f (V1j, V2j) with the components of the two
vectors is carried out simultaneously, in one
step.
7
Early Multiprocessors
CU = Control Unit , PE = Processing
Element , PEM = PE Memory module.
• The machine was not fully operational
until 1975. Between that time and 1981
it was the world's fastest computer.
• It performed Vector
operations in parallel.
and
Array
• Speed of integration tracks Moore’s
law: doubling every 18-24 months.
• Generic Model of Multiprocessors:
A collection of Computers ( cpu +
memory) communicating over an
interconnect network. [ Culler et al.]
Architecture of the ILLIAC IV
8
Why did uniproccesor performance grow so fast?
• ~ half from circuit improvement (smaller transistors, faster
clock, etc.)
• ~ half from architecture/organization:
• Instruction Level Parallelism (ILP)
– Pipelining: RISC, CISC with RISC backend
– Superscalar
– Out of order execution
• Memory hierarchy (Caches)
– Exploiting spatial and temporal locality
– Multiple cache levels
9
History of Multiprocessors
•
80s – early 90s: prime time for parallel architecture research
– A microprocessor cannot fit on a chip, so naturally need multiple chips (and processors)
•
90s: at the low end, uniprocessor system’s speed grows much faster than parallel
system’s speed
– A microprocessor fits on a chip. So do branch predictor, multiple functional units, large
caches, etc!
– Microprocessor also exploits parallelism (pipelining, multiple issue, VLIW) – parallelisms
originally invented for multiprocessors
•
90s: emergence of distributed (vs. parallel) machines
(Progress in network technologies:)
– Network bandwidth grows faster than Moore’s law
– Fast interconnection network getting cheap
– Connects cheap uniprocessor systems into a large distributed machine
– Network of Workstations, Clusters, GRID.
•
00s: parallel architectures are back
– Transistors per chip >> microproc transistors
– Harder to get more performance from a uniprocessor
– E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5, Sun Niagara, etc.
10
History of MPSoCs
1. Lucent Daytona MPSoC
•
•
•
•
Designed for wireless base
stations, in which identical signal
processing was performed on a
number of data channels.
Split transaction Bus.
Processing element is based on
SPARC V8.
Reconfigurable L1 cache.
SIMD Architecture
11
2. C-5 Network Processor
• Application: Packet Processing in
Networks.
• Packets are handled by channel
processors.
• Each cluster has 4 processors.
•Packet processors intercept individual
IP data packets and process them
using application software.
•Executive Processor: RISC CPU
• Operating Freq: 166MHz - 233MHz
12
3. Phillips Viper Nexperia
• Application: Multimedia Processing.
• Has two CPUs.
•
master: MIPS PR3940
•
slave : Trimedia TM32
• Has three buses.
• Memory controller for external DRAM
interface and DMA units for each CPU.
• Can execute many OS including, Windows
CE, Linux, VxWorks.
• CPUs share same resources and use
semaphores to negotiate ownership of
shared resources.
13
4. TI OMAP 5912
• Application: Cell phone Processor.
• Designed to support 2.5G and 3G wireless
applications.
• In addition to basic voice services, it is intended
for speech processing, location-based services,
security, gaming, and multimedia.
• Has two CPUs: an ARM9 and a TMS320C55x
digital signal processor (DSP)
• C55x DSP performs signal processing as slave.
• ARM runs operating system, dispatches tasks to
DSP.
• SRAM capacity: 192 KB
14
5. STMicro Nomadik
•Designed for mobile multimedia.
• Accelerators built around MMDSP+ core:
• One instruction per cycle.
• 16- and 24-bit fixed-point, 32-bit
floating-point.
• Host Processor : ARM926EJ
• Two programmable accelerators on the
bus.
• Video Accelerator is a heterogeneous MP
Video accelerator
Audio accelerator
15
Moore’s Law
• A law of physics
• A law of process technology
• A law of micro-architecture
• A law of psychology
• Most of us are familiar with
Moore’s Law growth of
transistors
• Other characteristics appear to
have reached a ceiling
16
Multiprocessors: Implementation Technology concerns
(billion-transistor CMOS implementation technology)
• Design Issues.
– Transistor gate delay
– Interconnect delay*
– Exponential increase in processor clock rates
• Result of these trends.
• Design Complexity.
17
UltraSparc Niagra
• 8 CPU Cores
• Only a single floating
point Unit
• 4 DDR2 Busses
• 4-way L2 Cache
• Built in self-test
• Operated at 1.4 GHz
• Capable of processing up to
32 concurrent threads.
18
Comparing Alternative Multiprocessor Architectures
Superscalar
SMP
•
Logic, Wire and Design Complexity will
increasingly favor CMP over Superscalar and
SMT implementations.
CMP
19
Parallel vs Distributed Computers
Characteristics of Superscalar, SMT, and CMP
architectures
20
How to use increasing Transistors
Year
Processor
Transistors
Feature
size
Data
Width
Frequency
Features
1971
4004
2300
104nM
4
1978
8086
29000
3000nM
16
10MHz IBM PC/AT
1985
80386
275000
1000nM
32
33MHZ Pipelining
1989
80486
1200000
800nM
32
100MHz Integral FPU
1993
Pentium
3100000
800nM
32
150 MHz On-Chip L1 Cache;
Superscalar
1995
Pentium
Pro
5500000
600nM
32
200MHz Out-of-order
execution
1997
Pentium
MMX P55C
4500000
350nM
32
450MHz Dynamic branch
prediction; MMX
(SIMD) instructions
1999
Pentium III
28000000
180nM
32
1.1GHz On-chip L2 Cache
2004
Pentium 4E
125000000
90nM
32
3.8 GHz Hyper-threading
2006
Xeon Tulsa
167000000
65nM
64
3.4 GHz Dual-Core
2010
Xeon 7500 2300000000
Nehalem
45nM
64
2.26GHz Eight-Cores
740 KHz First Microprocessor
21
Multi-nonsense
•
•
•
•
•
•
•
•
•
•
Multi-core was a solution to a performance problem
Hardware works sequentially
Make the hardware simple – thousands of cores
Do in parallel at a slower clock rate to save power
ILP is dead
Examine what is (rather than what can be)
Communication: off-chip hard, on-chip easy
Abstraction is a pure good
Programmers are all dumb and need to be protected
Thinking in parallel is hard
22
Power and Memory considerations
23
Performance Improvements
• Computer Engineers improve performance through the
reduction of C/I
•I/P is the domain of CS – writing software
•S/C is the domain of EE/VLSI – IC fabrication
• CPI or C/I is improved through getting more instructions done
in each cycle
• This means doing work in parallel distributed across the
functional units of the IC
24
How Applications Influence Architecture
• Complex Applications
•Nature of the computations
• Eg. MPEG-2 encoder.
• Memory bandwidth requirements of an
encoder vary across the block diagram.
MPEG-2 encoder
• Standard based design
• Many high-volume markets are standards-driven:
• wireless
• multimedia
• networking.
• Standard defines the basic I/O requirements.
• Real time operation.
• Low power/energy operation.
• Standards committees often provide reference implementations ( very single
threaded).
25
Platform based design
• What is a Platform?
A partial design:
– for a particular type of system
– includes embedded processor(s), may include embedded software
– customizable to a customer’s requirements:
• software
• component changes
• Why Platforms?
Any given space has a limited number of good solutions to its basic problems.
– A platform captures the good solutions to the important design challenges in that
space.
– A platform reuses architectures.
•
Standards encourage platform-based design.
26
Alternative to platforms
• General-purpose architectures.
– May require much more area to accomplish
the same task.
– Often much less energy-efficient.
• Reconfigurable systems.
Intel
– Good for pieces of the system, but tough to
compete with software for miscellaneous
tasks.
Xilinx
27
Platform vs. full-custom
• Platform has many fewer degrees of freedom:
– harder to differentiate
– can analyze design characteristics.
• Full-custom:
– extremely long design cycles
– may use less aggressive design styles if you can’t
reuse some pieces.
• Costs of platform-based design
– Masks.
– design of the platform + customization.
– Design verification.
28
Platform based Design (reduces cost)
• Divide system design into 2 phases
• design a platform for a class of
applications
• adapt the platform for a particular
product in that application space
• Homogeneous MP vs Heterogeneous MP
• Examples of platforms:
• Data rate
• Power and energy consumption
• Buffering and Memory Management
• Product Design -- S/W Driven
(Customization)
• Usefulness of platform depends largely on
the quality and capabilities of the SDE.
29
Architectures for Real-time Low-Power Systems
• Performance and Power efficiency
• Benchmarks: high-performance data networking, voice recognition, video compression/
decompression, and other applications
Power consumption trends for desktop processors from Austin et al. [Aus04] 2004 IEEE Computer Society
30
Architectures for Real-time Low-Power Systems (contd.)
• Real-Time Performance
–
–
–
–
Homogeneous Architecture
Heterogeneous Architecture
Eg. Shared Memory MP
Software methods to eliminate conflicts.
• Application Structure
– Homogeneous vs Heterogeneous Architecture
31
CAD Challenges in MPSoCs
1. Configurable processors and instruction set
synthesis.
•
•
•
•
•
CPU configuration ( tools that generate a HDL)
Coarse grained and fine grained instruction ext.
Eg. MIMOLA, LISA, Tensilica Xtensa.
Instruction set synthesis
1% rule [ Holmer and Despain]
32
CAD Challenges in MPSoCs (contd.)
2. Encoding
• Signal Encoding improves area & power consumption.
• Eg. Code Compression [ Wolfe and Chanin (Huffman)]
and bus encoding.
• Data Compression (more complex)
• Eg. Lempel- Ziv Compression (L3 - MM)
• Bus-Invert Coding (Stan and Burleson)
33
CAD Challenges in MPSoCs (contd.)
3. Interconnect-driven design
•
•
•
•
•
•
•
•
•
Early SoCs were driven by design approach.
Interconnect choices are based on conventional bus concepts.
Bus? (Single set of wires shared among multiple devices)
Best known SoC buses : ARM AMBA, IBM CoreConnect.
Growth in complexity of SoCs (Communication Bottleneck)
Network on chip (NoC)
Use a hierarchical N/W with routers for data communication
Single shared Bus vs Multiple Communication Channels
Eg. Sonics SiliconBackplane (TDMA style interconnection n/w)
34
CAD Challenges in MPSoCs (contd.)
4. Memory system optimizations
• Cache : everything ( placement, replacement, allocation
and WB) is managed by hardware.
• vs Scratchpad: everything is managed by software.
• Servers, general purpose systems use caches.
• Scratchpad provides predictability of hits/ misses.
• Important for ensuring real time property.
• Complexity increases with applications.
• Worst case time is more tightly bound.
35
CAD Challenges in MPSoCs (contd.)
5. Hardware/ Software Codesign
• Used to explore design space of heterogeneous MP
• Cost estimation ( area, power & performance)
6. SDEs
• SDEs for single processors ( commercial and open-source)
• No comparable retargeting technology for multiprocessors
• MPSoC development environments tend to be a collection
of tools. ( no substantial connection)
• Difficult to determine the true state of the system.
36
Conclusion
• MPSoCs are an important chapter in the
history of multiprocessing
• System Designers like uniprocessors with
sufficient computation power.
• DSPs (Audio Processing)
• Von Neumann architecture supports
traditional software development tools
• Computational power (Moore’s Law) vs low –
power, low- cost, real time requirements.
37