Download Compendium of articles by Magma Design Automation

ARM and Magma Implementation Reference Methodologies Addressing High Performance, Low Power, Small Area and Full Automation Reining in Time-To-Market for Next Generation Embedded Design CONTENTS Talus® and Automation Accelerating Hierarchical Implementation of the ARM® Cortex™-A9 MPCore™ processors with Magma’s Hydra™ 2 Jumpstarting ARM Cortex-A9 MPCore Processor-based SoC Designs with Talus and Hydra 6 High-Performance A Fully Automated High Performance Implementation of ARM Cortex-A8 7 Low-Power Magma iRM for ARM Powered-optimized Cortex-R4 Processors 12 Rapid Implementation of Low Power ARM Microprocessors 16 A Compendium of Articles by Magma Design Automation from the ARM Information Quarterly Magazine, a publication of ARM and Convergence Promotions. ARM RTL Deliverables ARM and Magma iRM Application Constraints + Technology Libraries Abstracted Model ARM® and Magma® in Partnership Industry Standard Views/Models for SoC Integration + High-Quality ARM Processor Implementation Hardened Core Implementation Reference Methodologies (iRMs) for customer-specific hardening of synthesizable ARM processors • • • • • Co-developed by ARM and Magma The proven route to successful silicon The basis of a custom deployment methodology Eases learning curve to adoption of ARM IP Enables rapid, application-specific implementation A process of continuous improvement • Reap the benefit of researching new tools and techniques • Leverage best practices for embedded design implementation Most mobile phone chips embed ARM® processors. Most mobile phone chips are designed with Magma® software. Shouldn’t you be using ARM and Magma? 1 Accelerating Hierarchical Implementation of the ARM Cortex-A9 with Magma’s Hydra Author: Stuart Riches and Philip Watson, ARM Ltd., Cambridge; Jim Schultz, Pete Churchill, Somasekhar Eerappa and Vasu Madabushi, Magma Design Automation Synopsis: Multicore processor-based systems are the future of embedded computing in high-performance hand-held and power-hungry devices. The ARM Cortex-A9 MPCore multicore processor is ideally poised to meet this demand for System-on-Chip (SoC) designs that embed such processors. The sheer size of such SoCs powered by the ARM Cortex-A9 MPCore multicore processor will make it difficult to manage traditional tasks such as design partitioning, time budgeting, hierarchy management, block shaping, power planning, etc. With time-to-market pressures shrinking the design time, SoC implementers are turning to hierarchical chip planning and finishing systems to automate these traditional tasks and provide them with faster ways to achieve quality floorplans to close their designs. N eed for Multicore Processorbased Solutions Next-generation smart phones and Mobile Internet Devices (MID) will incorporate exciting features that allow watching high definition movies, making video phone calls, playing 3-D video games, watching live TV, GPS satellite-based navigation and not least of all, provide a rich Internet browsing experience. These capabilities are possible in large part due to the rapid convergence of technologies in the embedded System-on-Chip (SoC) platform that earlier required desktop computing power and bandwidth to accomplish. Designing such handheld or mobile devices mean combining rich functionality and highperformance computing while minimizing power consumption. Delivering the necessary performance requires higher processor frequencies, which increases power consumption and heat dissipation. Moreover, pushing synthesis tools to achieve that last increase in MHz results in a significant area penalty. In summary, both power and area increase exponentially as the envelope of achievable frequencies is pushed. The solution lies in scaling high-performance computing and power consumption to meet a particular device’s requirement. Advanced technologies available within the ARM Cortex-A9 MPCore processors help accelerate system performance, reduce power consumption, and include key new features and approaches that enhance embedded multicore designs. These multicore processors have better power and area efficiency at the same performance point compared with uni-processors. For example, the ARM Cortex-A9 MPCore processors have a smaller This article will describe the quad-core hierarchical implementation of ARM’s latest Cortex-A9 MPCore multicore processor using Hydra, Magma’s automated hierarchical design solution. New technologies, features and the ability to provide hand-off quality floorplans in different stages of the design cycle and early-prototyping-to-finalimplementation stages will be discussed. Figure 1: ARM Cortex-A9 MPCore architecture of a quad-core configuration Information Quarterly 2 ] [20 Volume 7, Number 3, 2008 memory footprint that contributes to area and power reduction. The ability to turn off processors with power gating depending on the workload requirement, and voltage and frequency scaling also enable significant power savings. Multicore processors thus offer excellent scalability of performance and power. Design Size and Complexity of SoCs with Embedded ARM Cortex-A9 MPCore Multicore Processors To meet the increasing demand for compute power the size and complexity of embedded processors has increased. The relative size of an ARM processor has always been just a fraction of the whole SoC. However, there is currently a significant, in fact, exponential change. For example, the ARM 926EJ-S™ processor can be efficiently implemented with under 75K placeable instances, while the ARM11™ MPCore™ (quad) and the Cortex™-A8 processors require approximately 500K placeable instances. There were tremendous advantages in maintaining a flat approach with these implementations, which were more straightforward when artificial timing boundaries and floorplan regions were not required. The Power, Performance and Area (PPA) targets were always easier to achieve in a flat implementation since the macro and cell placers had more degrees of freedom to exercise. Unit (FPU) blocks is about 800K instances in size and a full quad configuration includes well over one millions cells. In addition, the ARM Cortex-A9 MPCore Multicore processor allows excellent configurability with choice of single, dual and quad-cores, processor trace macrocell (PTM), L1 cache size; including NEON (SIMD media processing/DSP engine), FPU, L2 cache controllers, debug interfaces and AXI bus masters. Such a full configuration can contribute in minor ways to increasing the design size, complexity, and challenges posed by deep sub-micron process technologies and libraries. Complex Cortex-A9 MPCore-based SoCs can have hundreds of macros, multiple voltage domains and complex clock topology, particularly in the mobile market segment where power management is paramount. Such exponential growth in the complexity of chips means several weeks of runtime to implement physical blocks flat, exceeding the bounds of reasonable TAT. We have moved into an era where the processor core has become a very large IP subsystem in its own right and the target for a hardened re-useable block is much larger. As a result, a hierarchical approach becomes a natural choice, but that will require high quality floorplanning and hierarchical planning solutions. Today a methodology that enables hierarchical design planning and prototyping of designs using advanced technologies is needed. A methodology that performs automated partitioning and shaping, optimized macro placement, global routebased pin assignment, accurate budgeting and black-box handling. Figure 2: Quad-Cortex-A9 MPCore-based SoCs: Increasing design size con- The advantages of tributing to complexity and challenge such an approach would be ease of use, However, we have reached a point where, faster TAT, early feedback and predictabiliin balancing the need for turnaround time ty, and tight correlation with the final (TAT) and straightforward implementation implementation. Magma’s Hydra provides methodologies, some compromise is such an infrastructure for a comprehensive required. Take for example the Cortex-A9 floorplan synthesis and hierarchical planMPCore, where the dual configuration ning solution. Hydra also uses the same including the NEON™ and Floating Point underlying engines for standard cell and Information Quarterly 3 ] [21 macro placement, physical optimization, timing and routing from the Talus® IC Implementation platform, ensuring the quality and accuracy of the floorplan. This also ensures that both the logic and physical design engineer can close on a solution faster. An added advantage is the hierarchical design planning capabilities of Hydra, which allow designers to quickly explore multiple floorplan implementations. Utilizing Relative Floorplanning Constraints™ (RFCs), designers can create convergent floorplans by retaining floorplan changes between iterations. This becomes very useful when accommodating late-arriving changes to the design that require small changes to the core size and shape without performing time-consuming macro placement updates. With lessons learned from an earlier proven Talus-based hierarchical implementation methodology for ARM11 MPCore, we set out to implement the quad-core Cortex-A9 MPCore processor with Hydra. Design Planning: First Impressions with Hydra Design exploration is a difficult and time consuming, iterative task. Designers typically make a large number of iterations before homing in on an optimal floorplan. Don’t we often hear SoC Project Managers complain when floorplan changes are made? A common refrain is “I thought you finished the floorplan last week?” For an embedded processor core in an SoC, these changes will likely include size, shape, pin placement, macro placement and power grid. Figure 3: Hydra analyzes data and suggests an implementation: virtual flat placement before automatic partitioning Volume 7, Number 3, 2008 The challenge posed by the full configuration quad Cortex-A9 MPCore multicore processor was that it was almost three times the size of an ARM11 MPCore processor with 2200 top-level boundary pins, 150 macros and a frequency target of 600 MHz with a TSMC 65GP ARM Advantage-HS physical IP library. In addition, it also included memory-bist and scan runtime. One of the conundrums of floorplanning a large SoC with an embedded multicore processor is figuring out where to start. This is where an automated solution is extremely valuable. Pictured in Figure 4 is the result after the initial floorplan of the quad-Cortex-A9 MPCore processor. The logic was placed and the design was partitioned and automatically shaped into hierarchical blocks. In the first pass, a square aspect ratio of 4mm x 4mm was specified. While the chances of having a square area to place such a large multicore processor in a design are almost negligible, the important point is that Hydra provided us a solid starting point for further exploration, in less than three hours of runtime. tool to seek a better solution for the heavily congested SCU block. At first, some of the critical logic (timing driven) from the SCU block was re-partitioned by placing them at the top-level and allowing them to move in to the channel. Additionally, the SCU partition was given a smaller area utilization target in order to be able to grow the block. Finally, given that there was more top-level logic, the global channels were widened. Honoring these constraints, new partitions were shaped by Hydra in under an hour for the entire design. Figure 5: Quad-Cortex-A9 MPCore: Hydra uses channel sizing to provide mixed hierarchy/flat support ed. It was decided to delete the SCU partition completely and put the logic at the top-level. The real challenge lay in reusing what was already achieved until that point and to build from there. The CPU’s highlighted by the bright green and yellow blocks were frozen while the turquoise and purple CPU blocks were manually re-shaped. A few of the macros were manually fixed, while placement of other macros was reused using Relative Floorplanning Constraints derived from the previous iteration by the tool. This gave us the results seen in figure 6. Wider Aspect Ratio for a Better/Efficient Floorplan In a real design scenario, if the chip integrator changes the floorplan to say, a wider aspect ratio, how would one make use of the knowledge gained from the previous run? The solution is to do an intelligent restart! Pre-defined partitions and top-level logic from the previous run were used. Relative Floorplanning Constraints can be used to assign relative locations to floorplan objects. The channel size determined in the previous run was also reused. Given that the previous pin assignment was good, the relative pin locations were re-used too. In 20 minutes, we were able to replace and shape the design. Partial Floorplan Re-use Even after widening the channels and pulling the SCU logic to the top, the block continued to have issues. Each of the four CPUs has 1500 pins and the SCU contained 5500 pins just by itself. To solve this, some radical approaches were adopt- Figure 4: Result after the initial floorplan of the quad-Cortex-A9 MPCore processor Channel Sizing Let us now look at the results of multiple shaping experiments. Hydra offers flexibility in design style with support for nearabutted and channel-based hierarchical designs. From a congestion analysis standpoint, it was clear that locating the highlyconnected Snoop Control Unit (SCU) block in the middle (orange color in Figure 5) would generate quite a bit of congestion, leading to pin-assignment issues. This allowed for further exploration with the Information Quarterly Figure 7: Quad-Cortex-A9 MPCore: Hydra re-use allows rapid prototyping Figure 6: Floorplan results after partial reuse of results 4 ] [22 Also important to note in Figure 7 above is the rectilinear shape of the four CPU hierarchical blocks. If trapped in a world of pure rectangles, floorplanning would become a nearly impossible task. Rectilinear shapes are a must to fully utilize the silicon area available. Various hierarchical blocks in a design will have different shape requirements as dictated by internal hard macros, external location and connectivity. Hydra’s shaper provided Volume 7, Number 3, 2008 an easy process for creating initial rectilinear shapes and refining current shapes as the design matures, as seen by the progression of our work. Using Relative Floorplanning Constraints (RFC’s) on the Quad Cortex-A9 MPCore Processor Implementation The auto-interactive macro placer in Hydra is straightforward and easy to run and the results are excellent for getting quick feedback on a floorplan shape and pin location. This is a solid starting point to use Relative Floorplanning Constraints effectively for further changes. Hydra offers the ability to extract the relative constraints Figure 8: Arrows showing from an existing anchor points of using macro placement, Re-lative Floorplanning which can then be Con-straints in the and Cortex-A9 MPCore floor- adjusted plan: Define arrays easily maintained going and locate arrays using forward. The RFC relation to an object extraction capability is very noteworthy—this feature provides the bridge from the prototype world to the production world. floorplan file was not necessary because it was all placed with relative constraints, thereby saving us several days in TAT. Addressing Channel Congestion Congestion analysis will help determine on the custom channel sizing. Through our previous work we were successfully able to remove the congestion in the central area. However, there were still some channels that needed closer investigation. The solution was to custom size those channels. It wouldn’t be prudent to globally make all channels larger. This was accomplished by reviewing the congestion in the Hydra GUI with the congestion map and the channel report to identify hot spots. The channel report also suggests channel sizes for a given utilization. Additionally, blockages can be added to manually alleviate any boundary pin congestion. It took under 20 minutes to turnaround the design, including shaping, macro placement and global route. Figure 9: The wide floorplan and dispersed SCU logic was successful in reducing the central congestion and further analysis of local channel congestion suggested custom sizing. So, why would one want to use Relative Floorplanning Constraints? They can be used to guide the macro placer to get the final floorplan. One can fully specify a relative set of constraints without the need to run the macro placer. It offers a method for capturing and implementing true “designer intent”. It allows a floorplan to be specified (scripted) as a human would think (such as “a group of 8 rams in the upper right corner, 4 more rams stacked just below them,” etc.), and not in a Cartesian based dump-file (as a tool would think). Most importantly, Relative Floorplanning Constraints will allow small changes in the design to be absorbed without a need to change the macro placement. Case in point was the real-life example of having to change the RAM size two days before release of the ARM and Magma implementation Reference Methodology (iRM) for the Cortex-A9 MPCore. An updated Information Quarterly Figure 10: After modifying the channels and rerunning the placer & global router the design is nearly congestion-free. Congestion Clean Floorplan Implementation In summary, we have a nearly congestionfree design before full blown optimization. The key here is to fix any gross timing vio- 5 ] [23 Figure 11: Early analysis and partitioning solves gross timing/congestion issues before optimization, reducing overall TAT lations and congestion issues before going into optimization. This reduces the overall runtime since the tool will not try to fix impossible paths. A very acceptable floorplan was settled in a matter of days not weeks. A natural progression would be to experiment implementing the four CPUs as repeated blocks. Conclusions Based on the results we obtained, it is clear that the above approaches offer many benefits. Rectilinear shaping and autointeractive macro placement are valuable features of Hydra that allowed optimal use of the quad Cortex-A9 MPCore floorplan area. When late-arriving RTL and aspect ratios of macros change, Relative Floorplanning Constraints may be effectively utilized to feed back to the shaper and cluster placement engine to reduce TAT significantly. Gross timing violations and congestion can be fully addressed even before going into full optimization. Overall, the auto-interactive features of Hydra offers a fast prototyping, hierarchical chip planning and finishing system for SoCs that embed different configurations of the Cortex-A9 MPCore multicore processor. Utilizing these and other features of Hydra early in the design process will yield confidence in closing the overall design post placement optimization and routing, as well as saving months of time compared to a manual approach, thereby maximizing productivity. Hear Magma Design Automation give two presentations at the ARM Developer’s Conference: 1. Quad-core Cortex-A9 MPCore Multicore Processor Implementation 2. Advanced Techniques for Implementing Cortex-M3 based Ultra-Low Power Designs Visit www.rtcgroup.com/arm/2008/ for more information. Volume 7, Number 3, 2008 6 7 8 9 10 11 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S Magma iRM for ARM Powered™ optimized Cortex -R4 Processors S Author: Vasu Madabushi, Gary Powell and Joe Walston, Magma Design Automation; Stuart Riches, ARM Limited, UK Synopsis: As the feature size of process technology becomes smaller, the leakage power dominates overall power dissipation and the problem gets worse if the design is supposed to perform at very high frequencies. This article addresses specific problems for achieving high performance and low power while implementing nanometer system-on-a chip (SoC) designs with the Magma iRM for the ARM Cortex-R4, which is based on Magma Design Automation’s IC implementation software. This article also presents solutions and identifies necessary enhancements to the overall methodology so that the entire design flow can be automated to accelerate turnaround time. The ARM® Cortex™-R4 processor targets deeply embedded applications in the extremely competitive imaging, automotive, wireless base-band and storage vertical markets, where achieving optimal performance at the least possible overall cost is paramount. Magma’s IC implementation software lends itself to a power-aware design flow that lets designers make timing-vs.-power and area-vs.-power trade-offs at different stages of the flow. It provides access to appropriate low-power analysis and optimization engines that are integrated with,and applied throughout,the entire RTL-to-GDSII flow. tringent Requirements of Cortex-R4 Embedded Applications Cortex-R4 covers a wide area of applications, including mass storage/hard-disk drive controllers, digital video and still cameras, car chassis/braking systems, mobile wireless modems, intelligent PCindependent printers, networking and home gateways. In such deeply embedded high-volume systems, balancing performance targets with overall cost is a delicate act. For example, given the relative low cost (price per chip) of disk-drive controllers with embedded Cortex-R4 processors, any area savings at a given performance point may directly translate into increased profitability. On the other hand, safety-critical automotive applications require design margins to be built into the chips to address reliability concerns due to extreme temperature variations and an extended product lifetime in the vicinity of 10+ years. Under such conditions, poor power supply (mesh) design can lead to Voltage (IR)-drop issues. Analyzing and preventing electromigration during the design phase is necessary in order to avoid thermal breakdown on the power network and to address signal integrity issues. All of these can place severe restrictions on process choices, maximum operating frequency and design complexity. Here’s an overview of Cortex-R4 features that mitigate some of these requirements: • ARMv7-R architecture – Thumb®-2 technology ensures a more efficient processor design Information Quarterly 12 ] [66 – Optional MPU—applications that don’t require it can save on area – 32-bit signed/unsigned hardware divider for control applications – Improvements to interrupt handling for hard real time applications • Micro-architecture – 8 stage selective dual issue pipeline ensures minimized area overhead – Optional I and D caches, Tightly Coupled Memories (TCM) – AMBA 3 AXI master port for efficient on-chip interconnect • DMA access to TCMs through slave port – Global history branch predictor and function call return stack Figure 1: The ARM Cortex-R4 architecture Volume 5, Number 4, 2006 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S • Flexible synthesis-time configurability ensures efficiency and significant area savings depending on application market requirement – Each cache can be 4KB - 64KB or can be removed – 1 to 3 TCMs (Up to 8MB), or can be removed – 8 or 12 regions in MPU, or no MPU – 2 – 8 Breakpoints, 1 – 8 Watchpoints Relatively lower target clock frequencies mean that Cortex-R4 implementations are smaller than the ARM11 family of processors. Cortex-R4 includes a number of features that reduce supporting memory cost. The processor pipeline has fewer stages and yet accesses local RAMs over two clock cycles. These enable the use of a lower speed memory library, which drastically reduces silicon area and power consumption. Moreover, use of the RAMs from ARM Physical IP Metro® library in Cortex-R4 implementations give about 35% area and greater than 50% power savings. All of the above and the Magma iRM for ARM contribute towards making timing closure easier, shortening design cycle and reducing risk. Cortex R4 also offers an increase in performance over the ARM946E-S processor; in terms of maximum operating frequency and improved computing efficiency, without compromising on low power and size. In order to closely match Cortex-R4 features with its wide application coverage, end-users can take advantage of the excellent configurability at synthesis time. The Need for an Integrated PowerAware Design Flow in Implementing Cortex-R4-Based SoC Designs In traditional flows, power considerations are addressed by stand-alone tools and without paying enough attention to how they simultaneously impact timing, area and turnaround time. The lack of integration between point tools and the rest of the design environment can result in a tremendous amount of “false errors” that can render a design impossible to close. Worse yet, this lack of integration coupled with limited repair capabilities can result in an unreliable power network, causing numerous time-consuming design iterations. Another problem with the majority of today’s design environments is that engi- Information Quarterly neers concentrate on analyzing and addressing power considerations during physical design. Hence, poor decisions made during the early stages of the design makes it almost impossible to fix any problems during implementation. What today’s SoC designers need is access to proven solutions to address all of the above issues, shrinking process geometries (65-nm and below) and lower power consumption, while simultaneously improving performance and reducing area. The Magma iRM for ARM adds tremendous value by helping ARM licensees become familiar with best practices that are robust and deliver a completely hardened macro for re-use in a hierarchical SoC design. The Magma iRM for ARM takes advantage of powerful features in Magma’s tools, such as a single executable with a single timing engine and a single, unified data model. The common data model for algorithms allows for analysis, optimization and implementation for timing and power and signal integrity to be performed concurrently. Standard format outputs allow endusers to perform quick and easy verification of the implementation. Salient Features of the Cortex-R4 Magma iRM for ARM The current Magma iRM for Cortex-R4 is based on Magma’s Blast 5.0 toolset which contains: • Full RTL-to-GDSII Flow - Verilog RTL synthesis - DFT scan chain insertion - Floorplanning, power grid & physical synthesis - Clock tree synthesis - Final routing - DRC checking - Static timing analysis • Flexible Packaged Flow - Scalable across different technologies - Suitable for different core configurations - Direct Tcl script access for quick customizations • Advanced Feature Support - Cross-talk delay analysis and avoidance optimization techniques - Insertion and optimization of clock gating - Options to control optimization effort 13] [67 All of the above will also equally apply to the recently announced ARM Cortex-R4F processor, because they share a common flow. The additional advanced features of the Cortex-R4F processor include support for Error-Correcting Code (ECC) in the cache memories and TCM, the extension of error detection into the interconnect, a synthesis-optional Floating-Point Unit (FPU) and additional synthesis configuration of DMA. Key Low-power Considerations for Implementing the ARM Cortex-R4 Processor The key low power design considerations include dynamic and static (leakage) power dissipation, temperature and performance, and voltage drop effects. These are addressed continually through out Magma’s RTL-to-GDSII flow. Dynamic Power Dissipation During synthesis, gate sizes and cell counts are reduced, which directly translates to lower dynamic power. Automatic clockgate insertion and optimization has dual advantages in reducing the overall area and dynamic power consumption. Moreover, the clock tree in any design will typically consume a significant portion of the budgeted power. During clock tree synthesis (CTS), advanced clock gate cloning, buffering, clustering and multi-Vt techniques are used to lower total power. Power aware routing within the Magma environment minimizes capacitance on high switching nets by spreading the wires. Static Power Dissipation Static power dissipation is associated with logic gates when they are inactive. Static power dissipation has an exponential dependence on temperature. This means that as the chip heats up, its static power dissipation increases exponentially. Static power dissipation also has an exponential dependence on the switching threshold of the transistors (Vt). The delay (switching time) associated with a transistor is affected by the switching threshold of that transistor (Vt) and the supply voltage to that transistor (Vdd). All of this means that engineers have to perform a complicated balancing act, because lowering the supply voltage reduces the amount of heat being generated, which in turn lowers the static power Volume 5, Number 4, 2006 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S dissipation. However, lowering the supply voltage also increases gate delays. By comparison, lowering the transistors' switching thresholds speeds them up, but this exponentially increases their leakage and therefore their static power dissipation. The Magma iRM for ARM helps automate the leakage power trade-off and optimization by effectively using low Vt transistors only on timing-critical paths and high Vt transistors on non-critical paths. This automation is a default feature of the Magma iRM for ARM and is very easy to use and implement. The only pre-requisite is the multi-Vt library preparation step prior to running the Magma iRM for ARM, which sorts the standard cell models according to logic and Vt class. Magma application notes are available to help with the multi-Vt library preparation. Challenges Faced with Variations in Temperature, Power and Voltage Drop Effects Power consumption - both static and dynamic - increases a device's operating temperature. This may force engineers to employ expensive device packaging and external cooling technology. To accommodate variations in operating temperature and supply voltage, designers have traditionally been obliged to pad device characteristics and design margins. However, creating a device's power network using excessively conservative design practices consumes valuable silicon real estate (leading to increased costs), increases routing congestion, and results in performance that is significantly below the silicon's full potential. This simply is unacceptable in today's highly competitive marketplace. Every power and ground track segment has a small amount of resistance associated with it. This means that the logic gate closest to the IC's primary power or ground pins is presented with the optimal supply. The next gate in the chain will be presented with a slightly degraded supply, and so on down the chain. The problem is exacerbated with transient or alternating current (AC) voltage drop effects. These occur when gates switch from one value to another or in worst-cases, when entire blocks are switched on/off. This causes transitory power surges, which momentarily reduce the voltage supply to gates farther down the power supply chain. The reason voltage drop effects are so important is that the input-to-output delays across a logic gate increase as the voltage supplied to that gate is reduced, which can cause the gate to miss its timing specifications. There is also an increase in the interconnect delays associated with wires driven by under-powered gates. Furthermore, a gate's input switching thresholds are modified when its supply is reduced, which causes that gate to become more susceptible to noise. Voltage drop effects are becoming increasingly significant because the resistivity of the power and ground tracks rises as a function of decreasing feature sizes (track widths). These effects can be minimized by increasing the width of power and ground tracks, but this consumes valuable real estate on the silicon, which typically causes routing congestion. In order to solve these problems, the logic functions have to be spaced farther apart, which increases delays (and power consumption) due to longer signal tracks. Thus, implementing an optimal power network requires the balancing of many diverse factors. Advanced Features Address Variations in Temperature, Power and Voltage Drop Effects Advanced power analysis and repair features of Magma tools enables the analysis of power, voltage drop, temperature and the impact of voltage drop on timing. Automatic power grid synthesis consists of IR drop analysis and is used to ensure optimal power distribution without overdesigning the power grid. These play an important part in avoiding electro-migration and thermal break-down, while keeping the cost to the absolute minimum. These are very crucial in safety-critical automotive applications of Cortex-R4, such as anti-lock breaking systems (ABS) and electronic stability control (ESC). Subsequently, intelligent de-coupling capacitance insertion can be employed on the power grid to minimize the transients, keeping leakage power in check as well as improving yield. What Detail Advantages Target Library ARM Physical IP Sage X for TSMC 90G process High-density, High-speed Libraries Cortex-R4 Cache Configuration 16K Data and 16K Instruction Cache Yet another consideration is that the onchip temperature gradient (the difference in temperatures at different portions of the device caused by unbalanced power consumption) can produce mechanical stress, which may degrade the device's reliability. Target Frequency 385 MHz Max Power Leakage: 9.8 mW; Internal: Low power of approximately 97.8 mW; swcap: 50.0 mW 0.4 mW/MHz Total Max Power: 157.7mW 65-nm devices are prone to voltage drop effects, which are caused by the resistance associated with the network of wires used to distribute power and ground from the external pins to the internal circuitry [with direct current (DC) related voltage drops, these are also often referred to as IR drop effects]. Standard Cell Count 115K cells (at 63% Utilization) Total Cell Area 1.010mm2 Total Area with Memories 3.487 mm2 Information Quarterly Top performance target achieved including Cross-talk delay avoidance and optimization Low Area Low overall cost of implementation Table 1: Vital statistics of the results achieved with the Magma iRM for ARM, in internal implementations 14] [68 Volume 5, Number 4, 2006 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S Typical Results Achieved with the Magma iRM for ARM Enabling a rapid timing closure and fast turn-around-time (TAT), the RTL-to-GDSII Magma iRM for ARM takes designers from the simplicity of an ‘out of the box’ experience to first working Cortex-R4 processor in less than one engineering day (once setup, it typically takes an overnight run to achieve predictable results, given the configurability of the Cortex-R4 at synthesis time). Customer Success with Magma iRM for ARM Broadcom Corporation has benefited from the tremendous value of the Magma iRM for ARM that captures best practices and provided them with a working, proven flow right out of the box. Given the experiments SoC designers typically perform, the user-configurability and repeatable nature of the Magma iRM for ARM afforded Broadcom quicker TAT. This allowed their system designers to explore “what-if” scenarios on chip function (given the synthesis-time configurability of the CortexR4 processor) versus cost of different physical implementations. With ease of setup, the pre-packaged RTL-to-GDSII flow offered a complete turnkey methodology for hardening the Cortex-R4 processor and embedding it in their SoCs. Here’re some sample results from two of Broadcom’s designs that validates the Magma iRM for ARM and exceeded their expectations: Overnight Run to Harden Cortex-R4 with ARM-Magma RM Summary The ARM and Magma Partnership Developed in collaboration between ARM and Magma engineers, Magma iRM for ARM has allowed embedded SoC based on Cortex-R4 proccessor implementations to take advantage of rapid time-to-market, process portability and a predictable and proven route to first-time working silicon. The Magma iRM for ARM is a preverified flow that aids rapid, application Figure 2: Layout of the Cortex-R4 used in the 10GBit PHY transceiver chip specific implementation of ARM powered 40% Reduction in Leakage Power with Concurrent Multi-Vt SoCs and provides Optimization ease of integration with the rest of the chip using the Magma flow. It provides an excellent starting point for implementing several ARM processors and eases the learning curve to adoption of ARM IP. Performance target achieved Floorplan Area Library Standard Cell Count 400MHz .73mm X .73mm (0.53 mm2) TSMC 65G 7 LM 102,060 Standard Cell Area Utilization RTL-to GDSII Runtime: 0.43 mm2 81.2% 13.5 CPU Hours Table 2: High speed 65nm Implementation of Cortex-R4: 10GBit PHY Transceiver at Broadcom Performance target achieved Floorplan Area Library 150 MHz .98mm X .80 mm (0.784 mm2) TSMC 65LP Standard Cell Count 96,239 (Regular Vt: 15,338 High Vt: 80,901) 0.58 mm2 74 % 9.6 CPU Hours Standard Cell Area Utilization RTL-to GDSII Runtime: Figure 3: Layout of the Cortex-R4 used in the disk drive controller chip Information Quarterly Figure 3: Layout of the Cortex-R4 used in the disk drive controller chip As a result of the ARM-Magma partnership, end-users benefit from continuous improvement of Magma iRM for ARM new EDA technologies and flows as well as reference methodology development and support for newer ARM processors. tures and ease of use for chip building. The Magma iRM for ARM with Cortex-R4 support delivers a low-risk; high performance power optimized embedded processor hardening with a top-down approach. This greatly simplifies large SoC implementations in deeply embedded applications of automotive, imaging and storage markets. Magma is pushing forward with excellent low power fea- 15] [69 Volume 5, Number 4, 2006 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S Rapid Implementation of Low Power ARM Microprocessors E Author: Alan Gibbons, Magma Design Automation, Inc. Synopsis: As the levels of integration on mobile devices increases, the competing requirements of high performance with low power consumption being placed on the implementation of leading edge microprocessors is set to continue. Concurrent optimization and analysis of power, timing and area must form an integral part of any implementation flow for these processor cores. The Blast Power based low power implementation solution by Magma Design Automation, is one answer. xtended battery life, color displays and the ability to support multiple concurrent applications on mobile devices is placing considerable focus on power management in the development of ARM based system chips. Careful consideration must be given to minimizing both static and dynamic power dissipation without sacrificing performance and market forces dictate that designers must meet this challenge with ever decreasing time to market windows. Consequently there is a pressing industry requirement for the ability to rapidly implement application specific high performance, low power processor cores. ARM has been addressing the needs of low power applications for years through novel microprocessor architecture and design techniques that have culminated recently in the launch of the flagship ARM1176JZF-S processor. A processor targeted specifically at the low power needs of the consumer and wireless application market. In the era of synthesizable cores however, the work required to implement very low power processors is split across both ARM and their licensee - the development of the processor architecture and design falling into ARM’s domain and the ability to implement technology specific versions of the processor in the hands of the licensee. In order to satisfy an increasingly demanding customer base, these technology specific implementations must be created rapidly to be both high performance and energy efficient and created rapidly through the use of comprehensive, integrated design flows. The power consumption of a CMOS device includes both the dynamic power associated with activity and static power which reflects the energy consumed when the device is idle. As we move from one process to the next, this power consump- Information Quarterly [5516] tion grows exponentially and the need to address it as a primary design goal becomes more apparent. The commercial impacts of increased power consumption can be severe. In the mobile world, where we have a finite energy budget, trade-offs must be made in the feature set of the mobile device – for example providing a color display at the expense of prolonged battery life, or the ability to provide support for both video and messaging simultaneously. In the wired world, increased power consumption directly affects packaging costs and form factors as well as device performance and failure. Cleary the need to reduce the power dissipation is a critical factor in the continued development of highly integrated portable devices. The methods used to reduce the power budget have application at the system level, during sub-system architecture, IC design and library development. This article focuses on the techniques we can employ to reduce the power dissipation at the IC level. Reducing Dynamic Power Dissipation Dynamic power dissipation can be represented as DynamicPower ≡ af x C x V2 Where af is the activity expressed as a function of frequency, V is the supply voltage and C is the capacitance being switched. Clearly, by reducing the switching activity, the voltage or the operating frequency of the design, we will reduce the dynamic power dissipation and by simultaneously reducing all three we will realize significant power reduction. The most obvious way to reduce the switching activity of a design is to drop the frequency. However, there are many other Volume 4, Number 1, 2005 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S well documented techniques that can be employed within a design to help reduce the switching activity at a given frequency, and include: Multi-Level Clock gating, Operand Isolation, Pin Swapping, Technology mapping (hiding high toggle rate nets within more complex cells), and Factoring. Various combinations of these techniques are commonly employed in CMOS designs today, however, with the exception of clock gating, these techniques have a marginal impact on the overall power dissipation. By far the greatest gain in dynamic power reduction can be achieved by reducing the voltage at which the design operates. Reducing the voltage would necessarily reduce the performance of the design, so if we can couple this reduction in voltage with a corresponding reduction in frequency then we get an almost cubic reduction in the dynamic power dissipation of the design. The architecture of the ARM1176JZF-S processor can be easily partitioned into separate voltage domains or islands and can operate synchronously where each domain is at the same voltage or asynchronously where the domains are at different voltage levels and these voltage levels can change over time through interaction with the operating system. When operating asynchronously, the domains communicate through level shifters that sit on the domain interfaces. When operating synchronously, these level shifters can be bypassed restoring the cycles per instruction (CPI) performance of the core. The ARM1176JZF-S offers the designer both high performance when in synchronous mode coupled with a significant power savings when operating asynchronously during times where the work load is light. Reducing static power dissipation Addressing the dynamic power dissipation of a high performance microprocessor is certainly going to provide significant power savings to the designer. However, as process migration passes 65nm and continues towards 45nm where the operating voltages are lower, and the switching thresholds roll-of more rapidly, static power dissipation is expected to exceed dynamic power dissipation and become the dominant contributor to the total power of a device. Static power minimization must be considered as an integral part of the power reduction strategy. Figure 2: Static vs. Dynamic Power with process Migration Static power dissipation is dominated by sub-threshold leakage current through the individual transistors and although small in magnitude, the cumulative effect of this leakage current in a system chip with hundred’s of millions of transistors is significant and cannot be ignored. The leakage current through a transistor is a combination of a number of components (including sub threshold and gateoxide leakage), this can be approximated: th Leakage ≅ exp (-qV ) kT Figure 1: ARM1176JZF-S Voltage Domains Information Quarterly One important point about this equation is that it shows that static power dissipation has an exponential dependence on temperature. This means that as the chip heats up, its static power dissipation increases exponentially. Furthermore we see that static power dissipation has an inverse exponential dependence on the switching threshold of the transistors. However, as mentioned previously, the challenge to the designer is to design to a significantly reduced power budget while also maintaining high levels of perform- 17] [56 ance for the design. Increasing the switching threshold of a transistor has the effect of increasing delay through the device and consequently reducing the performance. The ability to use multi-threshold transistors in a design is an excellent technique for reduction of static power dissipation. Low (or regular) threshold transistors are used on timing critical parts of the design and high threshold transistors on non-critical paths to minimize leakage. Power Gating In addition to the use of multi-threshold libraries, power gating can be used to further reduce the effects of leakage power. Leakage is state dependant, but unlike dynamic power it is not activity dependant. Therefore even when a device has no switching activity it is still dissipating leakage power, Multi-Threshold CMOS (MTCMOS) switch cells can be used to isolate specific regions of the design. These regions can then be powered down when inactive to significantly reduce leakage power. The ARM1176JZF-S architecture lends itself well to this approach where the design is partitioned into multiple voltage domains with active regions and MTCMOS regions defined. MTCMOS switch cells are inserted into the power mesh in the MTCMOS region. These switch cells are enabled by a sleep signal sourced from the active region. When enabled these switch cells disconnect the inactive part of the design from the power network. This switch cell isolates the logic from the power mesh reducing the leakage current. Comprehensive, integrated tool flow Implementation of an ARM1176JZF-S processor with optimal performance that supports the previously mentioned techniques to reduce power, places a significant burden on the design flow and technology/library combination used. The design flow must be capable of achieving excellent performance with a traditional SI aware approach for nanometer technology while automating the handling of multiple voltage domains and minimizing leakage. Specifically, support is required for automatic insertion of level shifters and isolation cells between voltage domains, domain specific optimization, power grid synthesis, multi-Vt optimization and switch cell insertion. Managing these low power techniques must be an integral part of the optimization flow and cannot be an afterthought. Volume 4, Number 1, 2005 D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S Simultaneous optimization for timing, area, power and SI is a minimum requirement for an ARM1176JZF-S design flow. Further, in order to meet aggressive time to market requirements, simultaneous optimization for these design characteristics requires a unified data model with support for concurrent processing. This level of integration extends to the analysis environment. Lack of integration between implementation and analysis tools can lead to a significant time penalty in resolving false errors and inconsistent data that can force designers to overcompensate in certain areas of the flow resulting in a sub-optimal implementation. Magma Design Automation’s Blast Fusion RTL-to-GDSII design flow enables the designer to meet this challenge today through the ability to continually optimize for timing, power and area through all phases of the design flow. Blast Fusion integrates a number of focused solutions around a unified data model that enables the optimization, implementation and analysis engines to get immediate access to continually updated logical, physical, timing and power data. This single pass approach allows the engines to make instant decisions that ensure optimal results. Specifically, Blast Create is used for synthesis, Blast Plan Pro for prototyping, Blast Noise for signal integrity, Blast Rail for power integrity and Blast Power, for power management. Blast Power forms the heart of the implementation flow for the ARM1176JZF-S and is used to provide a complete low power implementation solution supporting power aware synthesis, leakage power minimization with Multi-Vt libraries, automated support for dynamic voltage and frequency scaling, including automated level shifter insertion, domain based optimization and multi-corner optimization. In addition, Blast Power provides automated power grid synthesis and can automatically insert and optimize decap cells based on transient voltage drop analysis. Multiple Voltage Domains Blast Power provides a domain based methodology to handle multiple voltage domains within a design. Using domains with associated floorplans a design can be partitioned into a number of regions each operating at different voltages and frequencies. Through the specification of domains, the designer is able to identify the power and ground nets, the nature of the power supply (constant, switched, variable etc.) and other salient process and temperature characteristics. By associating domains with floorplans, the designer is able to identify the logical to physical relationship and identify which cells are attached to which domain and the physical location of the domain within the design. The domains and floorplans are maintained through synthesis and physical optimization as well as during timing and power analysis. New cells added during the implementation process are automatically attached to the correct domain and connected to the corresponding power supply. Having partitioned the design into specific voltage islands (domains) and physical regions (floorplans) Blast Power can automatically determine which interfaces need to be level shifted and insert the appropriate type of level shifter and/or isolation cell. The sensitivities associated with level shifter insertion, such as buffering and secondary power supply routing are handled automatically within the tool. This automated domain based approach significantly reduces the complexity of the multi-voltage implementation process. Figure 3: Comprehensive RTL to GDSII Solution Information Quarterly Leakage Mitigation Typically a multi-Vt library contains two or more versions of the same standard cell set: one set contains high-Vt cells and the other contains low-Vt cells. Blast Power automatically reduces leakage current in the design by using high-Vt (slow, lower leakage) cells for non-critical paths and 18] [57 Figure 4: ARM1176JZF-S Physical Domains low-Vt (faster, higher leakage) cell for the timing critical paths in the design. Through the unified data model and integrated analysis environment, concurrent optimization for both timing and leakage is possible to yield the most optimal implementation. Unlike conventional flows, the Blast Power multi-Vt flow performs leakage optimization at different stages of the design flow resulting in superior QoR. Multiple Corner Analysis Technology libraries are characterized for leakage and dynamic power for all possible arcs that can be exercised. For multicorner analysis traditional flows handle one PVT/library at a time leading to multiple sequential runs. Depending on the number of corners for the design this could potentially result in long run times, multiple design iterations and cause convergence problems. Magma uses the super corner approach to concurrently analyze and optimize multi corner designs much faster, helping cover timing criticality for all corners. When performing concurrent optimization and analysis at different operating corners the correct selection and derating of the characterization data is required. When changing the voltage of a supply net, the timing and power behavior of all cells supplied by that net changes. If the voltage differs only slightly from the characterization value then derating may suffice. Blast Power supports numerous derating methods, including k-factor, polynomial models, and support for ECSM characterized libraries. However, to improve accuracy and avoid derating altogether, the cell library can be characterized at many different operating conditions. This more extensive characterization data is then used by Blast Power to target the operating condition of choice. Volume 4, Number 1, 2005 1650 Technology Drive, San Jose, CA 95110 USA | Tel: 408-565-7500 | Fax: 408-565-7501 | www.magma-da.com © 2008 Magma Design Automation,Inc.All rights reserved. Magma is a registered trademark of Magma Design Automation. All other product and company names are trademarks or registered trademarks of their respective companies. 10/08

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Compendium of articles by Magma Design Automation