Download Energy Management in Multicore Designs with

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Switched-mode power supply wikipedia , lookup

Microprocessor wikipedia , lookup

Opto-isolator wikipedia , lookup

History of electric power transmission wikipedia , lookup

Power engineering wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Mains electricity wikipedia , lookup

Voltage optimisation wikipedia , lookup

Magnetic core wikipedia , lookup

Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup

Alternating current wikipedia , lookup

Multi-core processor wikipedia , lookup

Transcript
Energy Management in Multicore Designs with Embedded Virtualization
Multicore processors are becoming widespread in embedded devices, making earlier
methods of energy management inadequate. The use of virtualization via a hypervisor
can greatly improve energy savings in Multicore systems.
by Gernot Heiser, CTO, Open Kernel Labs
Energy conservation is increasingly an important requirement in the design of computer
systems. In the desktop and server space the main driver is the cost of electricity and air
conditioning as well as environmental concerns. For intelligent devices, there is the
additional desire to extend the usable battery life between charges or replacement, or at
least to keep it constant in the face of increasingly sophisticated functionality.
Energy conservation comes from building greener, leaner circuits and from using
deployed systems more efficiently. On the hardware side, conservation primarily entails
using lower static supply voltages and by reducing leakage current. Under software
control, EM focuses on mechanisms that support dynamic voltage and frequency scaling
(together, DVFS).
Voltage scaling involves lowering (and raising) processor core supply voltage (Vcc) from
its nominal value (e.g., 1.3 VDC) downwards towards a minimum value. Since Power =
Voltage x Current, voltage scaling saves energy over time by reducing power
consumption.
However, when operating at reduced voltage, processor circuitry runs slower, and
therefore scaling down the core voltage also requires scaling down the frequency of the
CPU clock resulting in decreased processor performance.
Downward scaling of the frequency in itself reduces power consumption, as switching
the circuits between logic levels requires energy to charge and discharge circuit
capacitance. This dynamic power varies directly with frequency, and with the square of
voltage:
Pdyn ά f V2
Due to the linear relationship between dynamic power and frequency, scaling frequency
alone while reducing power does not actually reduce the dynamic energy use: a particular
number of CPU cycles still requires the same amount of dynamic energy. In fact, running
at a lower frequency may result in increased energy usage, because of static power
consumption (from leakage current). By definition, static power is independent of
frequency, and therefore static energy is proportional to time. At lower frequency,
execution time increases and so does static energy.
Furthermore, the power used by RAM is independent of CPU core voltage. It also
comprises both a static component, and a dynamic component roughly proportional to the
number of memory accesses (loads and stores) by the CPU, which depend on the core
frequency. However, memory accesses are slower than CPU operations, and the CPU
frequently stalls waiting for data from memory. When the CPU runs slower, the number
of stall cycles (which result in waste of dynamic CPU energy) is reduced.
Thus, the relationship between energy consumption and core frequency is a complex
function of hardware characteristics and program behavior. While energy use by CPUbound programs (which rarely access memory) tends to be minimized at high clock rates,
for memory-bound programs, minimal energy consumption occurs at low frequencies.
DVFS is today a standard feature of most microprocessor families, but has yielded mixed
results from software systems that attempt to drive it, due to the above complexities.
While OS kernels and individual device drivers can readily address DVFS control
registers, divergent approaches and heuristics exist for scaling and many factors can
influence return on DVFS investment:




Relative CPU and memory power consumption
Importance of static vs. dynamic power use by CPU, memory and other components
Degree of memory-boundedness of applications
Complex trade-offs between DVFS operating points and sleep modes
Enter Multicore and Multi-processing
Multi-processing on multicore silicon, once a high-end capability, is today mostly
mainstream. Silicon suppliers routinely integrate specialized companion processors
alongside applications CPUs on a single substrate (asymmetric multiprocessing or AMP),
and are also deploying 2x, 4x and other parallel configurations of the same ARM, MIPS,
Power or x86 architecture cores (symmetric multiprocessing or SMP). Driving this
evolution are needs for dedicated silicon to process multimedia, graphics, baseband, etc.,
the need to sustain growth of compute capability without power-hungry high frequency
clocks, and requirements to run multiple OSs on a single device.
While multicore, multi-processed systems may limit power consumption in some areas,
they present steep challenges to energy management paradigms optimized for single chip
systems. In particular, multicore limits the scope and capability of DVFS because most
SoC subsystems share clocks and power supplies. One consequence is that scaling the
operating voltage of one of several SoC subsystems (when even possible) can limit its
ability to use local buses to communicate with other subsystems, and to access shared
memory (including its own DRAM). Clock frequency scaling of a single SoC subsystem
also presents interoperability challenges, especially for synchronous buses. And, since
multicore systems in SMP configuration usually share Vcc, clock, cache and other
resources, that requires that DVFS apply to all constituent cores and not to a useful
subset.
Silicon supplier roadmaps point to further multiplying numbers of cores—today 2x on
embedded CPUs, and soon 4x, 8x and beyond. This surfeit of available silicon will
encourage designers to dedicate one or more cores to particular subsystems or
functionalities (CPU-function affinity). Some dedicated operations, like media
processing, will use cores in a binary fashion – at full throttle or not at all. However, most
other functions will impose varying loads, ranging from a share of a single core to
saturating multiple cores. All-or-nothing use is fairly easy to manage, but dynamic loads
on multiple cores present much greater EM challenges.
Multiple OSs and Energy Management
Most OSs are mediocre resource managers. If OSs were competent in resource
management, virtualization in the data center would not enjoy the mission-critical role it
does today. Embedded OSs aren't any better at managing resources than their server
counterparts. They assume static provisioning and full resource availability with
simplistic state models for resources under their purview.
Many intelligent devices also deploy multiple OSs: high-level OSs like Android, Linux,
Symbian, WindowsCE or WindowsMobile to provide user services and to run end-user
applications, and one or more RTOSs to handle low-level chores like wireless baseband
and signal processing. These OSs and the programs they host may run on dedicated
silicon, may occupy dedicated cores on a multicore system, or can also run in dedicated
partitions of memory and cycles of a single shared CPU.
High-level application OSs typically include their own EM schemes that leverage DFVS
(e.g., Linux apm and dpm, and Windows/BIOS ACPI). Most RTOSs eschew any
operations that curtail real-time responsiveness, leaving OEMs and integrators to roll
their own or do without. Some do offer explicit power management APIs (e.g, vxLib
vxPowerDown()), but usually lack coherent mechanisms for policy.
Whatever the inherent EM capability of resident OSs, there remains the challenge of
coordinating efforts in a multi-OS environment. Even if one among several OSs is
capable of managing energy in its own domain, it will have no awareness of the EM
capabilities and state of its peer OSs in the same system, adding to development and
integration headaches. Even if all co-resident application OSs and RTOSs have some
EM capacity, however rudimentary, how can system developers and integrators
coordinate and optimize operation and energy management policy across OS domains?
A Real-world Example – Mobile Handset Energy Management
Legacy smartphones typically employ dedicated, separate CPUs for application
processing, multimedia and graphics and for real-time wireless baseband modem
operations. As multicore CPUs and virtualization become more ubiquitous, designs are
migrating these separate operations onto partitioned single- and multicore processors. A
good example lies in the Motorola Evoke handset.
Devices like the Evoke and other current-generation handsets consolidate the diverse
functions of a smartphone onto a single processor with one or two cores. Nextgeneration devices will build on silicon with even greater numbers of available cores, and
will surely find ways for each subsystem to consume available compute capacity.
In these devices, one subsystem would be the baseband modem, whose real-time software
stack would comprise a load that fully occupies one or more full cores during peak
processing (e.g., for streaming or voice conferencing), but typically consumes perhaps a
fifth of a single core’s capacity when quiescent. Another subsystem would be an HD
multimedia stack, requiring an additional core (or more) at full load, and zero when no
media is displayed. A GUI stack might use a half core during heavy user interaction, and
zero when quiescent. User applications would consume any remaining compute capacity
when executing, and would occupy either zero cores when quiescent, or represent some
other finite load with background processing (Figure 1).
DVFS Challenges
Clearly, each of the functional stacks and the OSes that host them present unique
challenges to managing energy with DVFS. In combination on a multicore CPU, it
becomes nearly impossible to determine useful DVFS operation points and policy for
transition among them, both on a per function basis and a coordinated one.
The above scenario clearly illustrates that localized DVFS schemes are inadequate to
address the needs of next-generation multi-stack multicore designs. Further analysis of
the scenario also highlights the limitations of the coarse-grained assignment of functions
to available CPU cores:
 Peak loads for different subsystems can consume most or all of one or more cores’




compute capacity
Gross assignment/dedication of functions to cores can waste available compute
capacity and potentially starve functions at peak load
Real-world total load is unpredictable due to third-party applications (e.g., with
Android Market), and additional demands placed on communications and multimedia
stacks from those applications and the traffic they generate
Scalable loads dictate sharing of available CPU cores across functions
Most silicon cannot run stably at all frequencies and voltages. Real-world energy
management paradigms build on discrete, stable pairings of voltage and frequency
(operating points)
Virtualization for Energy Management
Rather than try to salvage legacy EM paradigms from each functional domain, let’s
employ the approach favored by data center IT managers – using virtualization for energy
management.
DVFS wrings incremental gains in energy efficiency by reducing voltage and clock
frequency. A given CPU offers developers and integrators a set of safe “operating
points” with fixed voltages and frequencies. As load/demand increases or decreases, EM
middleware or EM-aware OSes transition from operating point to operating point (Figure
2). A logical extension of applying DVFS is reduction of voltage to 0 VDC and
completely stopping the CPU clock. That is, utilizing only two operating points – Full
Stop and Full Throttle – but employing them across the range of available cores: OP1
uses one core, OP2 uses two, etc.
In multicore systems without virtualization, or with simple partitioning, wholesale
shutdown of CPU cores presents additional challenges, because loads (OSes and
applications threads) are tightly bound to one or more cores (complete CPU affinity, as in
Figure 3). Shutting down a CPU core requires a policy decision between a shallow sleep
mode—with fast entry and exit, but significant remaining leakage power—and a deep
sleep mode—with low leakage power but high overhead at both entry and exit. Also,
migrating loads across CPUs is nearly impossible; only loads already running as SMP
can shed CPUs.
Introducing virtualization neatly addresses the challenges of CPU shutdown and CPU
core affinity. First, instead of binding loads to actual CPUs, the presence of a fullfeatured Type I Hypervisor associates functional subsystems with dedicated virtual
CPUs. Based on real compute needs and on policy established at integration time, the
hypervisor can bind virtual CPUs to one or more physical CPUs (Figure 4.) and/or can
share available physical CPUs among virtual CPUs as needed as suggested in Figure 1.
To facilitate energy conservation, a hypervisor enables Full Stop of underutilized CPU
cores by (re)mapping virtual CPUs and their loads onto fewer physical CPUs (Figure 5.).
This neat trick is only possible through via the construct of virtual CPUs, which facilitate
arbitrary mapping of loads to physical silicon and migrating running loads transparently
across CPU cores. The resulting consolidation means that on average, more CPUs are in
an off state and they remain there for longer, allowing more efficient use of deep sleep
states.
Shutting down whole cores leads to linear, and therefore highly-predictable,
performance/energy tradeoffs, unlike DVFS, and is therefore easier to manage. DVFS
can still be employed on the active cores for fine-tuning the energy-performance tradeoff.
Since energy management is now handled by the hypervisor, with full knowledge of
performance requirements, hardware-imposed constraints such as common core voltage
are readily incorporated.
As mentioned earlier, embedded OSes are notoriously poor at resource management.
Those with native energy management schemes have “been taught” to monitor their own
loads and make energy management policy transitions. They are not, however, equipped
to manage energy and CPU utilization outside their own purview, on other CPU cores
running different OSes and accompanying loads. For multiple, diverse hosted OSes in a
multicore system, effective energy management must “step outside” of the local context
of functional subsystems (baseband, GUI, etc.) to a global scope that encompasses these
subsystem together.
The foregoing has been a brief review of energy management mechanism, and challenges
presented to EM schemes by modern multicore systems. Of available software-based
energy management mechanisms, only virtualization is positioned (in the global
architecture/stack) to manage energy for all cores and all functional subsystems in
concert. Since it is the hypervisor that actually dispatches threads to run on physical
silicon, it is uniquely and ideally situated comprehend actual CPU loading (not calculated
guest OS loads), and to scale power utilization and energy by bringing available cores in
and out of service.
Open Kernel Labs, Chicago, IL. (312) 924-1145. www.ok-labs.com