Download Compendium of articles by Magma Design Automation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Decibel wikipedia , lookup

Power factor wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Power inverter wikipedia , lookup

Wireless power transfer wikipedia , lookup

Standby power wikipedia , lookup

Buck converter wikipedia , lookup

History of electric power transmission wikipedia , lookup

Electrification wikipedia , lookup

Power MOSFET wikipedia , lookup

Rectiverter wikipedia , lookup

Electric power system wikipedia , lookup

Audio power wikipedia , lookup

Power over Ethernet wikipedia , lookup

Power electronics wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Voltage optimisation wikipedia , lookup

Distribution management system wikipedia , lookup

Power engineering wikipedia , lookup

Alternating current wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Mains electricity wikipedia , lookup

AC adapter wikipedia , lookup

Transcript
ARM and Magma Implementation Reference Methodologies Addressing
High Performance, Low Power, Small Area and Full Automation
Reining in
Time-To-Market
for Next Generation
Embedded Design
CONTENTS
Talus® and Automation
Accelerating Hierarchical Implementation of the
ARM® Cortex™-A9 MPCore™ processors with
Magma’s Hydra™
2
Jumpstarting ARM Cortex-A9 MPCore
Processor-based SoC Designs with
Talus and Hydra
6
High-Performance
A Fully Automated High Performance
Implementation of ARM Cortex-A8
7
Low-Power
Magma iRM for ARM Powered-optimized
Cortex-R4 Processors
12
Rapid Implementation of Low Power ARM
Microprocessors
16
A Compendium of Articles by Magma Design Automation from the ARM Information Quarterly Magazine,
a publication of ARM and Convergence Promotions.
ARM
RTL
Deliverables
ARM and
Magma
iRM
Application
Constraints
+
Technology Libraries
Abstracted Model
ARM® and Magma® in Partnership
Industry Standard
Views/Models for
SoC Integration
+
High-Quality
ARM Processor
Implementation
Hardened Core
Implementation Reference Methodologies (iRMs) for customer-specific
hardening of synthesizable ARM processors
•
•
•
•
•
Co-developed by ARM and Magma
The proven route to successful silicon
The basis of a custom deployment methodology
Eases learning curve to adoption of ARM IP
Enables rapid, application-specific implementation
A process of continuous improvement
• Reap the benefit of researching new tools and techniques
• Leverage best practices for embedded design implementation
Most mobile phone chips embed ARM® processors.
Most mobile phone chips are designed with Magma® software.
Shouldn’t you be using ARM and Magma?
1
Accelerating Hierarchical
Implementation of the ARM
Cortex-A9 with Magma’s Hydra
Author:
Stuart Riches and Philip Watson,
ARM Ltd., Cambridge; Jim Schultz,
Pete Churchill, Somasekhar Eerappa
and Vasu Madabushi, Magma Design
Automation
Synopsis:
Multicore processor-based systems
are the future of embedded computing
in high-performance hand-held and
power-hungry devices. The ARM
Cortex-A9 MPCore multicore processor
is ideally poised to meet this demand
for System-on-Chip (SoC) designs that
embed such processors. The sheer size
of such SoCs powered by the ARM
Cortex-A9 MPCore multicore processor
will make it difficult to manage
traditional tasks such as design
partitioning, time budgeting, hierarchy
management, block shaping, power
planning, etc. With time-to-market
pressures shrinking the design time,
SoC implementers are turning to
hierarchical chip planning and finishing
systems to automate these traditional
tasks and provide them with faster
ways to achieve quality floorplans to
close their designs.
N
eed for Multicore Processorbased Solutions
Next-generation smart phones and Mobile
Internet Devices (MID) will incorporate
exciting features that allow watching high
definition movies, making video phone
calls, playing 3-D video games, watching
live TV, GPS satellite-based navigation and
not least of all, provide a rich Internet
browsing experience. These capabilities
are possible in large part due to the rapid
convergence of technologies in the embedded System-on-Chip (SoC) platform that
earlier required desktop computing power
and bandwidth to accomplish. Designing
such handheld or mobile devices mean
combining rich functionality and highperformance computing while minimizing
power consumption.
Delivering the necessary performance
requires higher processor frequencies,
which increases power consumption and
heat dissipation. Moreover, pushing synthesis tools to achieve that last increase in
MHz results in a significant area penalty.
In summary, both power and area
increase exponentially as the envelope of
achievable frequencies is pushed. The solution lies in scaling high-performance computing and power consumption to meet a
particular device’s requirement.
Advanced technologies available within
the ARM Cortex-A9 MPCore processors
help accelerate system performance,
reduce power consumption, and include
key new features and approaches that
enhance embedded multicore designs.
These multicore processors have better
power and area efficiency at the same performance point compared with uni-processors. For example, the ARM Cortex-A9
MPCore processors have a smaller
This article will describe the quad-core
hierarchical implementation of ARM’s
latest Cortex-A9 MPCore multicore
processor using Hydra, Magma’s
automated hierarchical design solution.
New technologies, features and the
ability to provide hand-off quality
floorplans in different stages of the
design cycle and early-prototyping-to-finalimplementation stages will be discussed.
Figure 1: ARM Cortex-A9 MPCore architecture of a quad-core configuration
Information Quarterly
2 ]
[20
Volume 7, Number 3, 2008
memory footprint that contributes to area
and power reduction. The ability to turn
off processors with power gating depending on the workload requirement, and
voltage and frequency scaling also enable
significant power savings. Multicore
processors thus offer excellent scalability of
performance and power.
Design Size and Complexity of SoCs
with Embedded ARM Cortex-A9
MPCore Multicore Processors
To meet the increasing demand for compute power the size and complexity of
embedded processors has increased. The
relative size of an ARM processor has
always been just a fraction of the whole
SoC. However, there is currently a significant, in fact, exponential change. For
example, the ARM 926EJ-S™ processor can
be efficiently implemented with under 75K
placeable instances, while the ARM11™
MPCore™ (quad) and the Cortex™-A8
processors require approximately 500K
placeable instances. There were tremendous advantages in maintaining a flat
approach with these implementations,
which were more straightforward when
artificial timing boundaries and floorplan
regions were not required. The Power,
Performance and Area (PPA) targets were
always easier to achieve in a flat implementation since the macro and cell placers
had more degrees of freedom to exercise.
Unit (FPU) blocks is about 800K instances
in size and a full quad configuration
includes well over one millions cells. In
addition, the ARM Cortex-A9 MPCore
Multicore processor allows excellent configurability with choice of single, dual and
quad-cores, processor trace macrocell
(PTM), L1 cache size; including NEON
(SIMD media processing/DSP engine), FPU,
L2 cache controllers, debug interfaces and
AXI bus masters. Such a full configuration
can contribute in minor ways to increasing
the design size, complexity, and challenges
posed by deep sub-micron process technologies and libraries.
Complex Cortex-A9 MPCore-based SoCs
can have hundreds of macros, multiple
voltage domains and complex clock topology, particularly in the mobile market segment where power management is paramount. Such exponential growth in the
complexity of chips means several weeks
of runtime to implement physical blocks
flat, exceeding the bounds of reasonable
TAT. We have moved into an era where the
processor core has become a very large IP
subsystem in its own right and the target
for a hardened re-useable block is much
larger. As a result, a hierarchical approach
becomes a natural choice, but that will
require high quality floorplanning and
hierarchical planning solutions.
Today a methodology that enables hierarchical
design
planning and prototyping of designs
using
advanced
technologies is needed. A methodology
that performs automated partitioning
and shaping, optimized macro placement, global routebased pin assignment,
accurate
budgeting
and
black-box handling.
Figure 2: Quad-Cortex-A9 MPCore-based SoCs: Increasing design size con- The advantages of
tributing to complexity and challenge
such an approach
would be ease of use,
However, we have reached a point where,
faster TAT, early feedback and predictabiliin balancing the need for turnaround time
ty, and tight correlation with the final
(TAT) and straightforward implementation
implementation. Magma’s Hydra provides
methodologies, some compromise is
such an infrastructure for a comprehensive
required. Take for example the Cortex-A9
floorplan synthesis and hierarchical planMPCore, where the dual configuration
ning solution. Hydra also uses the same
including the NEON™ and Floating Point
underlying engines for standard cell and
Information Quarterly
3 ]
[21
macro placement, physical optimization,
timing and routing from the Talus® IC
Implementation platform, ensuring the
quality and accuracy of the floorplan. This
also ensures that both the logic and physical design engineer can close on a solution
faster. An added advantage is the hierarchical design planning capabilities of
Hydra, which allow designers to quickly
explore multiple floorplan implementations. Utilizing Relative Floorplanning
Constraints™ (RFCs), designers can create
convergent floorplans by retaining floorplan changes between iterations. This
becomes very useful when accommodating late-arriving changes to the design
that require small changes to the core size
and shape without performing time-consuming macro placement updates.
With lessons learned from an earlier
proven Talus-based hierarchical implementation methodology for ARM11
MPCore, we set out to implement the
quad-core Cortex-A9 MPCore processor
with Hydra.
Design Planning: First Impressions
with Hydra
Design exploration is a difficult and time
consuming, iterative task. Designers typically make a large number of iterations
before homing in on an optimal floorplan.
Don’t we often hear SoC Project Managers
complain when floorplan changes are
made? A common refrain is “I thought
you finished the floorplan last week?” For
an embedded processor core in an SoC,
these changes will likely include size,
shape, pin placement, macro placement
and power grid.
Figure 3: Hydra analyzes data and suggests an
implementation: virtual flat placement before
automatic partitioning
Volume 7, Number 3, 2008
The challenge posed by the full configuration quad Cortex-A9 MPCore multicore
processor was that it was almost three
times the size of an ARM11 MPCore
processor with 2200 top-level boundary
pins, 150 macros and a frequency target of
600 MHz with a TSMC 65GP ARM
Advantage-HS physical IP library. In addition, it also included memory-bist and
scan runtime.
One of the conundrums of floorplanning a
large SoC with an embedded multicore
processor is figuring out where to start.
This is where an automated solution is
extremely valuable. Pictured in Figure 4 is
the result after the initial floorplan of the
quad-Cortex-A9 MPCore processor. The
logic was placed and the design was partitioned and automatically shaped into
hierarchical blocks. In the first pass, a
square aspect ratio of 4mm x 4mm was
specified. While the chances of having a
square area to place such a large multicore processor in a design are almost negligible, the important point is that Hydra
provided us a solid starting point for further exploration, in less than three hours
of runtime.
tool to seek a better solution for the heavily congested SCU block.
At first, some of the critical logic (timing
driven) from the SCU block was re-partitioned by placing them at the top-level
and allowing them to move in to the channel. Additionally, the SCU partition was
given a smaller area utilization target in
order to be able to grow the block. Finally,
given that there was more top-level logic,
the global channels were widened.
Honoring these constraints, new partitions
were shaped by Hydra in under an hour
for the entire design.
Figure 5: Quad-Cortex-A9 MPCore: Hydra uses
channel sizing to provide mixed hierarchy/flat
support
ed. It was decided to delete the SCU partition completely and put the logic at the
top-level. The real challenge lay in reusing what was already achieved until
that point and to build from there. The
CPU’s highlighted by the bright green and
yellow blocks were frozen while the
turquoise and purple CPU blocks were
manually re-shaped. A few of the macros
were manually fixed, while placement of
other macros was reused using Relative
Floorplanning Constraints derived from
the previous iteration by the tool. This
gave us the results seen in figure 6.
Wider Aspect Ratio for a
Better/Efficient Floorplan
In a real design scenario, if the chip integrator changes the floorplan to say, a
wider aspect ratio, how would one make
use of the knowledge gained from the previous run? The solution is to do an intelligent restart! Pre-defined partitions and
top-level logic from the previous run were
used. Relative Floorplanning Constraints
can be used to assign relative locations to
floorplan objects. The channel size determined in the previous run was also reused. Given that the previous pin assignment was good, the relative pin locations
were re-used too. In 20 minutes, we were
able to replace and shape the design.
Partial Floorplan Re-use
Even after widening the channels and
pulling the SCU logic to the top, the block
continued to have issues. Each of the four
CPUs has 1500 pins and the SCU contained 5500 pins just by itself. To solve
this, some radical approaches were adopt-
Figure 4: Result after the initial floorplan of the
quad-Cortex-A9 MPCore processor
Channel Sizing
Let us now look at the results of multiple
shaping experiments. Hydra offers flexibility in design style with support for nearabutted and channel-based hierarchical
designs. From a congestion analysis standpoint, it was clear that locating the highlyconnected Snoop Control Unit (SCU) block
in the middle (orange color in Figure 5)
would generate quite a bit of congestion,
leading to pin-assignment issues. This
allowed for further exploration with the
Information Quarterly
Figure 7: Quad-Cortex-A9 MPCore: Hydra re-use
allows rapid prototyping
Figure 6: Floorplan results after partial reuse of
results
4 ]
[22
Also important to note in Figure 7 above is
the rectilinear shape of the four CPU hierarchical blocks. If trapped in a world of
pure rectangles, floorplanning would
become a nearly impossible task.
Rectilinear shapes are a must to fully utilize the silicon area available. Various hierarchical blocks in a design will have different shape requirements as dictated by
internal hard macros, external location
and connectivity. Hydra’s shaper provided
Volume 7, Number 3, 2008
an easy process for creating initial rectilinear shapes and refining current shapes as
the design matures, as seen by the progression of our work.
Using Relative Floorplanning
Constraints (RFC’s) on the Quad
Cortex-A9 MPCore Processor
Implementation
The auto-interactive macro placer in
Hydra is straightforward and easy to run
and the results
are excellent for
getting
quick
feedback on a
floorplan shape
and pin location.
This is a solid
starting point to
use Relative Floorplanning
Constraints effectively
for
further
changes. Hydra
offers the ability
to extract the relative constraints
Figure 8: Arrows showing from an existing
anchor points of using macro placement,
Re-lative Floorplanning
which can then be
Con-straints
in
the
and
Cortex-A9 MPCore floor- adjusted
plan: Define arrays easily maintained going
and locate arrays using forward. The RFC
relation to an object
extraction capability is very noteworthy—this feature provides the bridge from the prototype world
to the production world.
floorplan file was not necessary because it
was all placed with relative constraints,
thereby saving us several days in TAT.
Addressing Channel Congestion
Congestion analysis will help determine
on the custom channel sizing. Through
our previous work we were successfully
able to remove the congestion in the central area. However, there were still some
channels that needed closer investigation.
The solution was to custom size those
channels. It wouldn’t be prudent to globally make all channels larger. This was
accomplished by reviewing the congestion
in the Hydra GUI with the congestion map
and the channel report to identify hot
spots. The channel report also suggests
channel sizes for a given utilization.
Additionally, blockages can be added to
manually alleviate any boundary pin congestion. It took under 20 minutes to turnaround the design, including shaping,
macro placement and global route.
Figure 9: The wide floorplan and dispersed SCU
logic was successful in reducing the central congestion and further analysis of local channel congestion suggested custom sizing.
So, why would one want to use Relative
Floorplanning Constraints? They can be
used to guide the macro placer to get the
final floorplan. One can fully specify a relative set of constraints without the need to
run the macro placer. It offers a method for
capturing and implementing true “designer intent”. It allows a floorplan to be specified (scripted) as a human would think
(such as “a group of 8 rams in the upper
right corner, 4 more rams stacked just
below them,” etc.), and not in a Cartesian
based dump-file (as a tool would think).
Most importantly, Relative Floorplanning
Constraints will allow small changes in the
design to be absorbed without a need to
change the macro placement. Case in
point was the real-life example of having
to change the RAM size two days before
release of the ARM and Magma implementation Reference Methodology (iRM)
for the Cortex-A9 MPCore. An updated
Information Quarterly
Figure 10: After modifying the channels and rerunning the placer & global router the design is
nearly congestion-free.
Congestion Clean Floorplan
Implementation
In summary, we have a nearly congestionfree design before full blown optimization.
The key here is to fix any gross timing vio-
5 ]
[23
Figure 11: Early analysis and partitioning solves
gross timing/congestion issues before optimization, reducing overall TAT
lations and congestion issues before going
into optimization. This reduces the overall
runtime since the tool will not try to fix
impossible paths. A very acceptable floorplan was settled in a matter of days not
weeks. A natural progression would be to
experiment implementing the four CPUs
as repeated blocks.
Conclusions
Based on the results we obtained, it is clear
that the above approaches offer many
benefits. Rectilinear shaping and autointeractive macro placement are valuable
features of Hydra that allowed optimal use
of the quad Cortex-A9 MPCore floorplan
area. When late-arriving RTL and aspect
ratios of macros change, Relative
Floorplanning Constraints may be effectively utilized to feed back to the shaper
and cluster placement engine to reduce
TAT significantly. Gross timing violations
and congestion can be fully addressed
even before going into full optimization.
Overall, the auto-interactive features of
Hydra offers a fast prototyping, hierarchical chip planning and finishing system for
SoCs that embed different configurations
of the Cortex-A9 MPCore multicore processor. Utilizing these and other features of
Hydra early in the design process will yield
confidence in closing the overall design
post placement optimization and routing,
as well as saving months of time compared to a manual approach, thereby
maximizing productivity.
Hear Magma Design Automation give two
presentations at the ARM Developer’s
Conference:
1. Quad-core Cortex-A9 MPCore Multicore
Processor Implementation
2. Advanced Techniques for Implementing
Cortex-M3 based Ultra-Low Power Designs
Visit www.rtcgroup.com/arm/2008/
for more information.
Volume 7, Number 3, 2008
6
7
8
9
10
11
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
Magma iRM for ARM Powered™
optimized Cortex -R4 Processors
S
Author:
Vasu Madabushi, Gary Powell and Joe
Walston, Magma Design Automation;
Stuart Riches, ARM Limited, UK
Synopsis:
As the feature size of process technology
becomes smaller, the leakage power
dominates overall power dissipation and
the problem gets worse if the design is
supposed to perform at very high frequencies. This article addresses specific problems
for achieving high performance and low
power while implementing nanometer
system-on-a chip (SoC) designs with the
Magma iRM for the ARM Cortex-R4, which
is based on Magma Design Automation’s
IC implementation software. This article also
presents solutions and identifies necessary
enhancements to the overall methodology
so that the entire design flow can be
automated to accelerate turnaround time.
The ARM® Cortex™-R4 processor targets
deeply embedded applications in the
extremely competitive imaging, automotive,
wireless base-band and storage vertical
markets, where achieving optimal performance at the least possible overall cost is
paramount.
Magma’s IC implementation software
lends itself to a power-aware design flow
that lets designers make timing-vs.-power
and area-vs.-power trade-offs at different
stages of the flow. It provides access to
appropriate low-power analysis and
optimization engines that are integrated
with,and applied throughout,the entire
RTL-to-GDSII flow.
tringent Requirements of
Cortex-R4 Embedded Applications
Cortex-R4 covers a wide area of applications, including mass storage/hard-disk
drive controllers, digital video and still
cameras, car chassis/braking systems,
mobile wireless modems, intelligent PCindependent printers, networking and
home gateways. In such deeply embedded
high-volume systems, balancing performance targets with overall cost is a delicate
act. For example, given the relative low
cost (price per chip) of disk-drive controllers with embedded Cortex-R4 processors, any area savings at a given performance point may directly translate into
increased profitability. On the other hand,
safety-critical automotive applications
require design margins to be built into the
chips to address reliability concerns due to
extreme temperature variations and an
extended product lifetime
in the vicinity of 10+ years.
Under such conditions,
poor power supply (mesh)
design can lead to Voltage
(IR)-drop issues. Analyzing
and preventing electromigration
during
the
design phase is necessary
in order to avoid thermal
breakdown on the power
network and to address signal integrity issues. All of
these can place severe
restrictions on process
choices, maximum operating frequency and design
complexity.
Here’s an overview of
Cortex-R4 features that
mitigate some of these
requirements:
• ARMv7-R architecture
– Thumb®-2 technology
ensures a more efficient
processor design
Information Quarterly
12 ]
[66
– Optional MPU—applications
that don’t require it can save
on area
– 32-bit signed/unsigned hardware divider for control
applications
– Improvements to interrupt handling
for hard real time applications
• Micro-architecture
– 8 stage selective dual issue pipeline
ensures minimized area overhead
– Optional I and D caches, Tightly
Coupled Memories (TCM)
– AMBA 3 AXI master port for efficient
on-chip interconnect
• DMA access to TCMs through slave
port
– Global history branch predictor and
function call return stack
Figure 1: The ARM Cortex-R4 architecture
Volume 5, Number 4, 2006
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
• Flexible synthesis-time configurability
ensures efficiency and significant area savings depending on application market
requirement
– Each cache can be 4KB - 64KB or can
be removed
– 1 to 3 TCMs (Up to 8MB), or can be
removed
– 8 or 12 regions in MPU, or no MPU
– 2 – 8 Breakpoints, 1 – 8 Watchpoints
Relatively lower target clock frequencies
mean that Cortex-R4 implementations are
smaller than the ARM11 family of processors. Cortex-R4 includes a number of features that reduce supporting memory cost.
The processor pipeline has fewer stages
and yet accesses local RAMs over two clock
cycles. These enable the use of a lower
speed memory library, which drastically
reduces silicon area and power consumption. Moreover, use of the RAMs from ARM
Physical IP Metro® library in Cortex-R4
implementations give about 35% area and
greater than 50% power savings. All of the
above and the Magma iRM for ARM contribute towards making timing closure easier, shortening design cycle and reducing
risk.
Cortex R4 also offers an increase in performance over the ARM946E-S processor;
in terms of maximum operating frequency
and improved computing efficiency, without compromising on low power and size.
In order to closely match Cortex-R4 features with its wide application coverage,
end-users can take advantage of the excellent configurability at synthesis time.
The Need for an Integrated PowerAware Design Flow in Implementing
Cortex-R4-Based SoC Designs
In traditional flows, power considerations
are addressed by stand-alone tools and
without paying enough attention to how
they simultaneously impact timing, area
and turnaround time. The lack of integration between point tools and the rest of the
design environment can result in a
tremendous amount of “false errors” that
can render a design impossible to close.
Worse yet, this lack of integration coupled
with limited repair capabilities can result
in an unreliable power network, causing
numerous time-consuming design iterations.
Another problem with the majority of
today’s design environments is that engi-
Information Quarterly
neers concentrate on analyzing and
addressing power considerations during
physical design. Hence, poor decisions
made during the early stages of the design
makes it almost impossible to fix any
problems during implementation.
What today’s SoC designers need is access
to proven solutions to address all of the
above issues, shrinking process geometries
(65-nm and below) and lower power consumption, while simultaneously improving performance and reducing area. The
Magma iRM for ARM adds tremendous
value by helping ARM licensees become
familiar with best practices that are robust
and deliver a completely hardened macro
for re-use in a hierarchical SoC design. The
Magma iRM for ARM takes advantage of
powerful features in Magma’s tools, such
as a single executable with a single timing
engine and a single, unified data model.
The common data model for algorithms
allows for analysis, optimization and
implementation for timing and power and
signal integrity to be performed concurrently. Standard format outputs allow endusers to perform quick and easy verification of the implementation.
Salient Features of the Cortex-R4
Magma iRM for ARM
The current Magma iRM for Cortex-R4 is
based on Magma’s Blast 5.0 toolset which
contains:
• Full RTL-to-GDSII Flow
- Verilog RTL synthesis
- DFT scan chain insertion
- Floorplanning, power grid & physical
synthesis
- Clock tree synthesis
- Final routing
- DRC checking
- Static timing analysis
• Flexible Packaged Flow
- Scalable across different technologies
- Suitable for different core
configurations
- Direct Tcl script access for quick
customizations
• Advanced Feature Support
- Cross-talk delay analysis and
avoidance optimization techniques
- Insertion and optimization of clock
gating
- Options to control optimization effort
13]
[67
All of the above will also equally apply to
the recently announced ARM Cortex-R4F
processor, because they share a common
flow. The additional advanced features of
the Cortex-R4F processor include support
for Error-Correcting Code (ECC) in the
cache memories and TCM, the extension
of error detection into the interconnect, a
synthesis-optional Floating-Point Unit
(FPU) and additional synthesis configuration of DMA.
Key Low-power Considerations for
Implementing the ARM Cortex-R4
Processor
The key low power design considerations
include dynamic and static (leakage)
power dissipation, temperature and performance, and voltage drop effects. These
are addressed continually through out
Magma’s RTL-to-GDSII flow.
Dynamic Power Dissipation
During synthesis, gate sizes and cell counts
are reduced, which directly translates to
lower dynamic power. Automatic clockgate insertion and optimization has dual
advantages in reducing the overall area
and dynamic power consumption.
Moreover, the clock tree in any design will
typically consume a significant portion of
the budgeted power. During clock tree synthesis (CTS), advanced clock gate cloning,
buffering, clustering and multi-Vt techniques are used to lower total power. Power
aware routing within the Magma environment minimizes capacitance on high
switching nets by spreading the wires.
Static Power Dissipation
Static power dissipation is associated with
logic gates when they are inactive. Static
power dissipation has an exponential
dependence on temperature. This means
that as the chip heats up, its static power
dissipation increases exponentially. Static
power dissipation also has an exponential
dependence on the switching threshold of
the transistors (Vt). The delay (switching
time) associated with a transistor is affected by the switching threshold of that transistor (Vt) and the supply voltage to that
transistor (Vdd).
All of this means that engineers have to
perform a complicated balancing act,
because lowering the supply voltage
reduces the amount of heat being generated, which in turn lowers the static power
Volume 5, Number 4, 2006
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
dissipation. However, lowering the supply
voltage also increases gate delays. By comparison, lowering the transistors' switching
thresholds speeds them up, but this exponentially increases their leakage and
therefore their static power dissipation.
The Magma iRM for ARM helps automate
the leakage power trade-off and optimization by effectively using low Vt transistors
only on timing-critical paths and high Vt
transistors on non-critical paths. This
automation is a default feature of the
Magma iRM for ARM and is very easy to
use and implement. The only pre-requisite
is the multi-Vt library preparation step
prior to running the Magma iRM for ARM,
which sorts the standard cell models
according to logic and Vt class. Magma
application notes are available to help
with the multi-Vt library preparation.
Challenges Faced with Variations in
Temperature, Power and Voltage Drop
Effects
Power consumption - both static and
dynamic - increases a device's operating
temperature. This may force engineers to
employ expensive device packaging and
external cooling technology.
To accommodate variations in operating
temperature and supply voltage, designers
have traditionally been obliged to pad
device characteristics and design margins.
However, creating a device's power network using excessively conservative design
practices consumes valuable silicon real
estate (leading to increased costs), increases routing congestion, and results in performance that is significantly below the silicon's full potential. This simply is unacceptable in today's highly competitive
marketplace.
Every power and ground track segment
has a small amount of resistance associated with it. This means that the logic gate
closest to the IC's primary power or ground
pins is presented with the optimal supply.
The next gate in the chain will be presented with a slightly degraded supply, and so
on down the chain.
The problem is exacerbated with transient
or alternating current (AC) voltage drop
effects. These occur when gates switch from
one value to another or in worst-cases,
when entire blocks are switched on/off.
This causes transitory power surges, which
momentarily reduce the voltage supply to
gates farther down the power supply
chain.
The reason voltage drop effects are so
important is that the input-to-output
delays across a logic gate increase as the
voltage supplied to that gate is reduced,
which can cause the gate to miss its timing
specifications. There is also an increase in
the interconnect delays associated with
wires driven by under-powered gates.
Furthermore, a gate's input switching
thresholds are modified when its supply is
reduced, which causes that gate to become
more susceptible to noise.
Voltage drop effects are becoming increasingly significant because the resistivity of
the power and ground tracks rises as a
function of decreasing feature sizes (track
widths). These effects can be minimized by
increasing the width of power and ground
tracks, but this consumes valuable real
estate on the silicon, which typically causes routing congestion. In order to solve
these problems, the logic functions have to
be spaced farther apart, which increases
delays (and power consumption) due to
longer signal tracks. Thus, implementing
an optimal power network requires the
balancing of many diverse factors.
Advanced Features Address Variations
in Temperature, Power and Voltage
Drop Effects
Advanced power analysis and repair features of Magma tools enables the analysis
of power, voltage drop, temperature and
the impact of voltage drop on timing.
Automatic power grid synthesis consists of
IR drop analysis and is used to ensure optimal power distribution without overdesigning the power grid. These play an
important part in avoiding electro-migration and thermal break-down, while keeping the cost to the absolute minimum.
These are very crucial in safety-critical
automotive applications of Cortex-R4,
such as anti-lock breaking systems (ABS)
and electronic stability control (ESC).
Subsequently, intelligent de-coupling
capacitance insertion can be employed on
the power grid to minimize the transients,
keeping leakage power in check as well as
improving yield.
What
Detail
Advantages
Target Library
ARM Physical IP Sage X
for TSMC 90G process
High-density, High-speed
Libraries
Cortex-R4 Cache
Configuration
16K Data and 16K
Instruction Cache
Yet another consideration is that the onchip temperature gradient (the difference
in temperatures at different portions of the
device caused by unbalanced power consumption) can produce mechanical stress,
which may degrade the device's reliability.
Target Frequency
385 MHz
Max Power
Leakage: 9.8 mW; Internal: Low power of approximately
97.8 mW; swcap: 50.0 mW 0.4 mW/MHz
Total Max Power: 157.7mW
65-nm devices are prone to voltage drop
effects, which are caused by the resistance
associated with the network of wires used
to distribute power and ground from the
external pins to the internal circuitry [with
direct current (DC) related voltage drops,
these are also often referred to as IR drop
effects].
Standard Cell Count
115K cells (at 63%
Utilization)
Total Cell Area
1.010mm2
Total Area with Memories
3.487 mm2
Information Quarterly
Top performance target achieved
including Cross-talk delay
avoidance and optimization
Low Area
Low overall cost of
implementation
Table 1: Vital statistics of the results achieved with the Magma iRM for ARM, in internal implementations
14]
[68
Volume 5, Number 4, 2006
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
Typical Results Achieved with the
Magma iRM for ARM
Enabling a rapid timing closure and fast
turn-around-time (TAT), the RTL-to-GDSII
Magma iRM for ARM takes designers from
the simplicity of an ‘out of the box’ experience to first working Cortex-R4 processor
in less than one engineering day (once
setup, it typically takes an overnight run to
achieve predictable results, given the configurability of the Cortex-R4 at synthesis time).
Customer Success with Magma iRM
for ARM
Broadcom Corporation has benefited from
the tremendous value of the Magma iRM
for ARM that captures best practices and
provided them with a working, proven
flow right out of the box. Given the experiments SoC designers typically perform,
the user-configurability and repeatable
nature of the Magma iRM for ARM afforded Broadcom quicker TAT. This allowed
their system designers to explore “what-if”
scenarios on chip function (given the synthesis-time configurability of the CortexR4 processor) versus cost of different physical implementations. With ease of setup,
the pre-packaged RTL-to-GDSII flow
offered a complete turnkey methodology
for hardening the Cortex-R4 processor and
embedding it in their SoCs. Here’re some
sample results from two of Broadcom’s
designs that validates the Magma iRM for
ARM and exceeded their expectations:
Overnight Run to Harden Cortex-R4 with ARM-Magma RM
Summary
The
ARM
and
Magma Partnership
Developed in collaboration between ARM
and Magma engineers, Magma iRM
for ARM has allowed
embedded SoC based
on Cortex-R4 proccessor implementations
to take advantage of
rapid time-to-market,
process portability
and a predictable
and proven route to
first-time working silicon. The Magma
iRM for ARM is a preverified flow that aids
rapid,
application Figure 2: Layout of the Cortex-R4 used in the 10GBit PHY transceiver chip
specific implementation of ARM powered
40% Reduction in Leakage Power with Concurrent Multi-Vt
SoCs and provides
Optimization
ease of integration
with the rest of the
chip
using
the
Magma flow. It provides an excellent
starting point for
implementing several ARM processors
and eases the learning curve to adoption
of ARM IP.
Performance target achieved
Floorplan Area
Library
Standard Cell Count
400MHz
.73mm X .73mm (0.53 mm2)
TSMC 65G 7 LM
102,060
Standard Cell Area
Utilization
RTL-to GDSII Runtime:
0.43 mm2
81.2%
13.5 CPU Hours
Table 2: High speed 65nm Implementation of Cortex-R4: 10GBit PHY
Transceiver at Broadcom
Performance target achieved
Floorplan Area
Library
150 MHz
.98mm X .80 mm (0.784 mm2)
TSMC 65LP
Standard Cell Count
96,239
(Regular Vt: 15,338
High Vt: 80,901)
0.58 mm2
74 %
9.6 CPU Hours
Standard Cell Area
Utilization
RTL-to GDSII Runtime:
Figure 3: Layout of the Cortex-R4 used in the disk drive controller
chip
Information Quarterly
Figure 3: Layout of the Cortex-R4 used in the disk drive controller chip
As a result of the ARM-Magma
partnership, end-users benefit
from continuous improvement
of Magma iRM for ARM new
EDA technologies and flows as
well as reference methodology
development and support for
newer ARM processors.
tures and ease of use for chip building. The
Magma iRM for ARM with Cortex-R4 support delivers a low-risk; high performance
power optimized embedded processor hardening with a top-down approach. This
greatly simplifies large SoC implementations in deeply embedded applications
of automotive, imaging and storage
markets.
Magma is pushing forward
with excellent low power fea-
15]
[69
Volume 5, Number 4, 2006
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
Rapid Implementation of Low Power
ARM Microprocessors
E
Author:
Alan Gibbons,
Magma Design Automation, Inc.
Synopsis:
As the levels of integration on mobile
devices increases, the competing
requirements of high performance
with low power consumption being
placed on the implementation of leading
edge microprocessors is set to continue.
Concurrent optimization and analysis of
power, timing and area must form an
integral part of any implementation flow
for these processor cores. The Blast
Power based low power implementation
solution by Magma Design Automation,
is one answer.
xtended battery life, color displays
and the ability to support multiple concurrent applications on mobile devices is
placing considerable focus on power management in the development of ARM
based system chips.
Careful consideration must be given to
minimizing both static and dynamic
power dissipation without sacrificing performance and market forces dictate that
designers must meet this challenge with
ever decreasing time to market windows.
Consequently there is a pressing industry
requirement for the ability to rapidly
implement application specific high performance, low power processor cores.
ARM has been addressing the needs of low
power applications for years through
novel microprocessor architecture and
design techniques that have culminated
recently in the launch of the flagship
ARM1176JZF-S processor. A processor targeted specifically at the low power needs
of the consumer and wireless application
market.
In the era of synthesizable cores however,
the work required to implement very low
power processors is split across both ARM
and their licensee - the development of the
processor architecture and design falling
into ARM’s domain and the ability to
implement technology specific versions of
the processor in the hands of the licensee.
In order to satisfy an increasingly
demanding customer base, these technology specific implementations must be created rapidly to be both high performance
and energy efficient and created rapidly
through the use of comprehensive, integrated design flows.
The power consumption of a CMOS device
includes both the dynamic power associated with activity and static power which
reflects the energy consumed when the
device is idle. As we move from one
process to the next, this power consump-
Information Quarterly
[5516]
tion grows exponentially and the need to
address it as a primary design goal
becomes more apparent.
The commercial impacts of increased
power consumption can be severe. In the
mobile world, where we have a finite energy budget, trade-offs must be made in the
feature set of the mobile device – for
example providing a color display at the
expense of prolonged battery life, or the
ability to provide support for both video
and messaging simultaneously. In the
wired world, increased power consumption directly affects packaging costs and
form factors as well as device performance
and failure.
Cleary the need to reduce the power dissipation is a critical factor in the continued
development of highly integrated portable
devices. The methods used to reduce the
power budget have application at the system level, during sub-system architecture,
IC design and library development. This
article focuses on the techniques we can
employ to reduce the power dissipation at
the IC level.
Reducing Dynamic Power
Dissipation
Dynamic power dissipation can be represented as
DynamicPower ≡ af x C x V2
Where af is the activity expressed as a
function of frequency, V is the supply voltage and C is the capacitance being
switched.
Clearly, by reducing the switching activity,
the voltage or the operating frequency of
the design, we will reduce the dynamic
power dissipation and by simultaneously
reducing all three we will realize significant power reduction.
The most obvious way to reduce the
switching activity of a design is to drop the
frequency. However, there are many other
Volume 4, Number 1, 2005
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
well documented techniques that can be
employed within a design to help reduce
the switching activity at a given frequency, and include: Multi-Level Clock gating,
Operand Isolation, Pin Swapping,
Technology mapping (hiding high toggle
rate nets within more complex cells), and
Factoring.
Various combinations of these techniques
are commonly employed in CMOS designs
today, however, with the exception of
clock gating, these techniques have a marginal impact on the overall power dissipation. By far the greatest gain in dynamic
power reduction can be achieved by reducing the voltage at which the design operates. Reducing the voltage would necessarily reduce the performance of the design,
so if we can couple this reduction in voltage with a corresponding reduction in frequency then we get an almost cubic reduction in the dynamic power dissipation of
the design.
The architecture of the ARM1176JZF-S
processor can be easily partitioned into
separate voltage domains or islands and
can operate synchronously where each
domain is at the same voltage or asynchronously where the domains are at different voltage levels and these voltage levels can change over time through interaction with the operating system.
When operating asynchronously, the
domains communicate through level
shifters that sit on the domain interfaces.
When operating synchronously, these
level shifters can be bypassed restoring the
cycles per instruction (CPI) performance of
the core. The ARM1176JZF-S offers the
designer both high performance when in
synchronous mode coupled with a significant power savings when operating asynchronously during times where the work
load is light.
Reducing static power dissipation
Addressing the dynamic power dissipation
of a high performance microprocessor is
certainly going to provide significant
power savings to the designer. However, as
process migration passes 65nm and continues towards 45nm where the operating
voltages are lower, and the switching
thresholds roll-of more rapidly, static
power dissipation is expected to exceed
dynamic power dissipation and become
the dominant contributor to the total
power of a device. Static power minimization must be considered as an integral
part of the power reduction strategy.
Figure 2: Static vs. Dynamic Power with process
Migration
Static power dissipation is dominated by
sub-threshold leakage current through the
individual transistors and although small
in magnitude, the cumulative effect of this
leakage current in a system chip with hundred’s of millions of transistors is significant and cannot be ignored.
The leakage current through a transistor is
a combination of a number of components (including sub threshold and gateoxide leakage), this can be approximated:
th
Leakage ≅ exp (-qV
)
kT
Figure 1: ARM1176JZF-S Voltage Domains
Information Quarterly
One important point about this equation
is that it shows that static power dissipation has an exponential dependence on
temperature. This means that as the chip
heats up, its static power dissipation
increases exponentially. Furthermore we
see that static power dissipation has an
inverse exponential dependence on the
switching threshold of the transistors.
However, as mentioned previously, the
challenge to the designer is to design to a
significantly reduced power budget while
also maintaining high levels of perform-
17]
[56
ance for the design. Increasing the switching threshold of a transistor has the effect
of increasing delay through the device and
consequently reducing the performance.
The ability to use multi-threshold transistors in a design is an excellent technique
for reduction of static power dissipation.
Low (or regular) threshold transistors are
used on timing critical parts of the design
and high threshold transistors on non-critical paths to minimize leakage.
Power Gating
In addition to the use of multi-threshold
libraries, power gating can be used to further reduce the effects of leakage power.
Leakage is state dependant, but unlike
dynamic power it is not activity dependant. Therefore even when a device has no
switching activity it is still dissipating leakage power, Multi-Threshold CMOS (MTCMOS) switch cells can be used to isolate
specific regions of the design. These
regions can then be powered down when
inactive to significantly reduce leakage
power. The ARM1176JZF-S architecture
lends itself well to this approach where the
design is partitioned into multiple voltage
domains with active regions and MTCMOS regions defined. MTCMOS switch
cells are inserted into the power mesh in
the MTCMOS region. These switch cells
are enabled by a sleep signal sourced from
the active region. When enabled these
switch cells disconnect the inactive part of
the design from the power network. This
switch cell isolates the logic from the
power mesh reducing the leakage current.
Comprehensive, integrated tool flow
Implementation of an ARM1176JZF-S
processor with optimal performance that
supports the previously mentioned techniques to reduce power, places a significant burden on the design flow and technology/library combination used. The
design flow must be capable of achieving
excellent performance with a traditional
SI aware approach for nanometer technology while automating the handling of
multiple voltage domains and minimizing leakage. Specifically, support is
required for automatic insertion of level
shifters and isolation cells between voltage
domains, domain specific optimization,
power grid synthesis, multi-Vt optimization and switch cell insertion.
Managing these low power techniques
must be an integral part of the optimization flow and cannot be an afterthought.
Volume 4, Number 1, 2005
D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S
Simultaneous optimization for timing,
area, power and SI is a minimum requirement for an ARM1176JZF-S design flow.
Further, in order to meet aggressive time to
market requirements, simultaneous optimization for these design characteristics
requires a unified data model with support
for concurrent processing. This level of
integration extends to the analysis environment. Lack of integration between
implementation and analysis tools can
lead to a significant time penalty in
resolving false errors and inconsistent
data that can force designers to overcompensate in certain areas of the flow resulting in a sub-optimal implementation.
Magma Design Automation’s Blast Fusion
RTL-to-GDSII design flow enables the
designer to meet this challenge today
through the ability to continually optimize for timing, power and area through
all phases of the design flow. Blast Fusion
integrates a number of focused solutions
around a unified data model that enables
the optimization, implementation and
analysis engines to get immediate access
to continually updated logical, physical,
timing and power data. This single pass
approach allows the engines to make
instant decisions that ensure optimal
results. Specifically, Blast Create is used for
synthesis, Blast Plan Pro for prototyping,
Blast Noise for signal integrity, Blast Rail
for power integrity and Blast Power, for
power management.
Blast Power forms the heart of the implementation flow for the ARM1176JZF-S and
is used to provide a complete low power
implementation solution supporting
power aware synthesis, leakage power
minimization with Multi-Vt libraries,
automated support for dynamic voltage
and frequency scaling, including automated level shifter insertion, domain
based optimization and multi-corner optimization. In addition, Blast Power provides automated power grid synthesis and
can automatically insert and optimize
decap cells based on transient voltage
drop analysis.
Multiple Voltage Domains
Blast Power provides a domain based
methodology to handle multiple voltage
domains within a design. Using domains
with associated floorplans a design can be
partitioned into a number of regions each
operating at different voltages and frequencies.
Through the specification of domains, the
designer is able to identify the power and
ground nets, the nature of the power supply (constant, switched, variable etc.) and
other salient process and temperature
characteristics. By associating domains
with floorplans, the designer is able to
identify the logical to physical relationship and identify which cells are attached
to which domain and the physical location of the domain within the design. The
domains and floorplans are maintained
through synthesis and physical optimization as well as during timing and power
analysis. New cells added during the
implementation process are automatically
attached to the correct domain and connected to the corresponding power supply.
Having partitioned the design into specific
voltage islands (domains) and physical
regions (floorplans) Blast Power can automatically determine which interfaces need
to be level shifted and insert the appropriate type of level shifter and/or isolation
cell. The sensitivities associated with level
shifter insertion, such as buffering and secondary power supply routing are handled
automatically within the tool.
This automated domain based approach
significantly reduces the complexity of the
multi-voltage implementation process.
Figure 3: Comprehensive RTL to GDSII Solution
Information Quarterly
Leakage Mitigation
Typically a multi-Vt library contains two
or more versions of the same standard cell
set: one set contains high-Vt cells and the
other contains low-Vt cells. Blast Power
automatically reduces leakage current in
the design by using high-Vt (slow, lower
leakage) cells for non-critical paths and
18]
[57
Figure 4: ARM1176JZF-S Physical Domains
low-Vt (faster, higher leakage) cell for the
timing critical paths in the design.
Through the unified data model and integrated analysis environment, concurrent
optimization for both timing and leakage
is possible to yield the most optimal implementation. Unlike conventional flows, the
Blast Power multi-Vt flow performs leakage optimization at different stages of the
design flow resulting in superior QoR.
Multiple Corner Analysis
Technology libraries are characterized for
leakage and dynamic power for all possible arcs that can be exercised. For multicorner analysis traditional flows handle
one PVT/library at a time leading to multiple sequential runs. Depending on the
number of corners for the design this could
potentially result in long run times, multiple design iterations and cause convergence problems. Magma uses the super
corner approach to concurrently analyze
and optimize multi corner designs much
faster, helping cover timing criticality for
all corners. When performing concurrent
optimization and analysis at different
operating corners the correct selection and
derating of the characterization data is
required.
When changing the voltage of a supply
net, the timing and power behavior of all
cells supplied by that net changes. If the
voltage differs only slightly from the characterization value then derating may suffice. Blast Power supports numerous derating methods, including k-factor, polynomial models, and support for ECSM characterized libraries.
However, to improve accuracy and avoid
derating altogether, the cell library can be
characterized at many different operating
conditions. This more extensive characterization data is then used by Blast
Power to target the operating condition
of choice.
Volume 4, Number 1, 2005
1650 Technology Drive, San Jose, CA 95110 USA | Tel: 408-565-7500 | Fax: 408-565-7501 | www.magma-da.com
© 2008 Magma Design Automation,Inc.All rights reserved. Magma is a registered trademark of Magma Design Automation.
All other product and company names are trademarks or registered trademarks of their respective companies. 10/08