Download Layout-Aware, IR-Drop Tolerant Transition Fault Pattern

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Paper Review:
Area-Performance Trade-offs in
Tiled Dataflow Architectures
Ke Peng
Instructor: Chun-Hsi Huang
CSE 340 Computer Architecture
University of Connecticut
Reference
Steven Swanson, Andrew Putanm, Martha Mercaldi, Ken
Michelson, Andrew Petersen, Andrew Schwerin, Mark
Oskin, Susan J. Eggers
Computer Science & Engineering, University of
Washington
“Area-Performance Trade-offs in Tiled Dataflow
Architectures”
Proceedings of the 33rd International Symposium on
Computer Architecture, 2006 IEEE.
2/33
Outline
Background Introduction
Experimental Infrastructure
WaveScalar Architecture
Evaluation
Conclusion
3/33
Background Introduction
A lot of issues should be addressed in processor
design
Wire delay
Fabrication reliability
Design complexity
Processing elements (PEs) are designed and
replicated across a chip
Examples of titled architectures
RAW
SmartMemories
TRIPS
WaveScalar
4/33
Background Introduction
Tiled WaveScalar Architecture
Captured online, University of Washington
5/33
Background Introduction
Benefits PE design
Decreases design and verification time
Provide robustness for fabrication errors
Reduce wire delay for data and control signal
transmission
Good performance is achievable only if all aspects
of the microarchitecture are properly designed.
Challenges
Tile number VS Tile size
High utilized tiles VS Possibly more powerful tiles
Partition and distribution of data memory across the chip
Tiles interconnection
Etc.
6/33
Background Introduction
This paper focuses on WaveScalar processor,
explores the area-performance trade-offs
encountered when designing a tiled architecture.
WaveScalar, tiled dataflow architecture
Based on PE replication
Hierarchical data networks
Distributed hardware data structures, including the
caches, store buffers, and specialized dataflow memories
(token store).
7/33
Experimental Infrastructure
Synthesizable RTL WaveScalar model
TSMC (Taiwan Semiconductor Manufacturing
Company) 90 nm technology
Use Synopsys DesignWare IP
Synopsys Design Compiler for front-end synthesis
Cadence First Encounter for back-end synthesis
Synopsys VCS for RTL simulation and function
verification
Hierarchical architecture and single voltage design,
can be extended for multiple voltage design.
8/33
Experimental Infrastructure
Three workloads to evaluate the WaveScalar
processor
Spec2000 benchmark suite (ammp, art, equake, gzip,
twolf and mcf), for single-threaded performance
evaluation
Mediabench (rawdaudio, mgeg2encoder, djpeg), for
media processing performance evaluation
Splash2 benchmarks(fft, lu-continuous, oceannoncontinuous, raytrace, water-spatial, radix), for
multiple-threaded performance evaluation
9/33
WaveScalar Architecture
Processing Elements
(PEs)
PE is the heart of a
WaveScalar machine.
Execution resources
(Captured from Steven ISCA’06)
10/33
WaveScalar Architecture
Five pipeline stages of PE
Input stage:
Operand messages arrive at the PE either from itself or
another PE.
Match stage:
Operands enter the matching table, determing which
instructions are ready to fire, issuing table index of eligible
instructions into instruction scheduling queue.
Dispatch stage:
Selects instruction from scheduling queue, reads operands
from matching table for executing.
Execute state:
Executes an instruction
Output stage:
Send output to its consumer instructions via connection
network.
11/33
WaveScalar Architecture
Several PEs are combined into a single Pod
share bypass networks
Several pods are combined into a single Domain
Several Domains are combined into a single cluster
These are parameters that affect the performance
and area trade-off of WaveScaler.
2-PE pod is 15% faster on average than isolated PEs
Increasing the number of PEs in each pod with further
increase performance but adversely affects cycle time.
12/33
WaveScalar Architecture
The configuration of the baseline WaveScalar processor
(Captured from Steven ISCA’06)
13/33
WaveScalar Architecture
Hierarchical organization of the WaveScalar microarchitecture
(Captured from Steven ISCA’06)
14/33
WaveScalar Architecture
4 Levels hierarchical interconnect network
Intra-pod
Intra-domain
Broadcast-based
Pseudo-PEs (Mem, Net), serve as gateways to the
memory system and PEs in other domains or clusters.
7% area overhead
Intra-cluster
Small network, area overhead negligible
Inter-cluster
Responsible for all long-distance communication
1% of the total chip area overhead
15/33
WaveScalar Architecture
Hierarchical cluster interconntects
(Captured from Steven ISCA’06)
16/33
WaveScalar Architecture
Memory Subsystem
Wave-ordered store buffers
Memory interface that enables WaveScalar to execute
programs written in imperative languages (C, C++, Java)
Store decoupling technique to process store address
and store data messages separately
Partial store queues for storing address
Occupies approximately 6.2% area of the cluster
Conventional memory hierarchy with distributed L1
and L2 cache
L1 data cache is 4-way set associative and have 128-byte
lines, with hit costs 3-cycles
L2’s hit rate is 20-30 cycles
Main memory latency is modeled at 200 cycles
17/33
Evaluation
Die area spent for the baseline design
(Captured from Steven ISCA’06)
18/33
Evaluation
The configuration of the baseline WaveScalar processor
(Captured from Steven ISCA’06)
19/33
Evaluation
Many parameters affect the area required for
WaveScalar designs.
This paper considers 7 parameters with the
strongest effect on the area requirements.
Ignores some minor effects.
For example, assuming that wiring costs do not decrease
with fewer than 4 domains.
20/33
Evaluation
WaveScalar processor area model
(Captured from Steven ISCA’06)
21/33
Evaluation
The ranges allow for over 21,000 Wavescalar
processor configurations.
To select the configurations, the authors:
Eliminate clearly poor, unbalanced designs
Bound die size at 400mm2 in 90 nm technology
Reduce the number of designs to 201
Report AIPC (Alpha-equivalent instructions executed per
cycle) instead of IPC.
22/33
Evaluation
Pareto-optimal WaveScalar Designs
(Captured from Steven ISCA’06)
23/33
Evaluation
Pareto-optimal configurations for Splash2
(Captured from Steven ISCA’06)
24/33
Evaluation
Goal of WaveScalar’s hierarchical interconnect:
Isolate as much traffic as possible in the lower levels of the
hierarchy, within a PE, a pod or a domain.
On average, 40% network traffic remains within a
pod
52% network traffic remains within a domain
On average, just 1.5% of traffic traverses the intercluster interconnect
25/33
Evaluation
Pareto-optimal WaveScalar Designs
(Captured from Steven ISCA’06)
26/33
Conclusion
This paper presents WaveScalar processor
architecture in details
Presents parameters affecting area and
performance significantly
Propose WaveScalar architecture and explored the
area/performance trade-offs simulation and
analysis
Reveals that Wavescalar processors tuned for
either area efficiency or maximum performance
across a wide range of processor sizes.
The hierarchical interconnect network is very
effective.
Over 50% of messages stay within a domain
Over 80% of messages stay within a cluster
27/33
Thank you!
28/33