Download MediaTek CorePilot 2.0™

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MediaTek CorePilot 2.0
MediaTek CorePilot 2.0™
MediaTek continues to lead in power and thermal innovation with our advanced
heterogeneous computing architecture that maximizes device performance and
power efficiency
In July 2013, MediaTek delivered the industry’s first mobile System on a Chip (SOC) with
Heterogeneous Multi-Processing. Called CorePilot 1.0, the technology maximized device
performance and power saving through interactive power management, adaptive thermal
management and advanced scheduler algorithms. In 2015, MediaTek introduces CorePilot 2.0,
an evolution of CorePilot 1.0, which adds new advancements in heterogeneous computing
including both CPUs and GPUs. This cooperation, named Device Fusion, is an “intelligent”
technology which can efficiently execute OpenCL programs by fusing GPU and CPU computing
capabilities. With mobile phone overheating dominating the headlines lately, OEMs are eager
to find SOCs that address power and thermal challenges. CorePilot 2.0 and Device Fusion
expertly address the power/performance relationship of heterogeneous computing
architectures and enable programmers to focus on application software development without
needing to consider the hardware platform configuration.
1
MediaTek CorePilot 2.0
Introduction
For modern mobile app programmers, effectively balancing powerful computing device
capabilities with power and thermal constraints can often be a significant challenge.
Programmers seek to meet user performance needs without overheating the device or causing
rapid battery drain. MediaTek, a pioneering fabless semiconductor company and a market
leader in cutting-edge Systems on a Chip for wireless communications and connectivity, HDTV,
DVD and Blu-ray, addressed the optimization of mobile device CPU core performance through
the development of CorePilot 1.0. For an in-depth history of the CorePilot 1.0, see the white
paper that can be found here.
This white paper discusses the challenges and solutions of CorePilot 2.0, an evolution of
CorePilot 1.0, in addressing workload management and optimization through the use of
multiple processing devices. This white paper will show:

CorePilot 2.0’s Device Fusion technology achieves up to 146% performance
improvement when compared to using CPU or GPU only architectures.

CorePilot 2.0’s Device Fusion technology leads to lower energy consumption of up to
18% when compared to using CPU or GPU only architectures.

CorePilot 2.0’s Device Fusion technology frees programmers from predicting what
computing device is best suited to which task.

CorePilot 2.0 positions MediaTek as the leader in power resource management
innovation and the only company currently offering a parallel computing device
scheduling algorithm.
The Challenge: Programming to Take Full Advantage of
Multiple Processing Devices
Recent studies suggest that using CPUs and GPUs together is a more efficient way of computing
compared with using CPUs or GPUs alone. Data show that different types of computing
processing units may be better suited to different types of workloads. For example, CPUs are
generally good at control-intensive workloads while GPUs perform well at computing-intensive
tasks.
Currently, programmers often struggle to write apps that can effectively use heterogeneous
computing devices (GPUs and CPUs in combination) to achieve high performance, efficient
computing. Anticipating both the computing capacity of a particular device and which program
2
MediaTek CorePilot 2.0
is best suited for a particular device (program affinity) is challenging. If incorrect predictions are
made by the programmer, performance could be severely degraded.
The Problem of Executing OpenCL Program in a Mobile Device SOC
Open Computing Language (OpenCL), a popular open standard computing platform for
heterogeneous computing, is designed to serve as the common high-level language for
optimizing multiple computing devices. An OpenCL program can be executed on any device,
including mobile devices, that supports the OpenCL standard Application Program Interface
(API).
Although OpenCL programs can be executed on all devices supporting OpenCL, the
performance may not meet the programmer’s expectation. The reasons are as follows:
1. The program affinity for a specific program is not easy to predict. The affinity of a
program could be affected by algorithm design, implementation, and/or the
architecture of the target device. A program running on the un-preferred computing
device results in lower performance when compared to the preferred computing device.
For example, two different implementations of the same object detection algorithm,
violaJones_NAIVE and violaJones_LWS32, shows different program affinities, leading to
large performance difference when executed on CPU and GPU, as shown in the Figure 1,
below. From the figure, executing violaJones_LWS32 (which prefers GPU) on the CPU
results in severe performance degradation compared to the GPU. The violaJones_NAIVE
shows the opposite result.
Figure 1. Performance Differences on CPU and GPU
2. Programmers often have trouble predicting the computing capabilities of the target
devices. Even if the programmers can focus on specific CPUs and GPUs of the target
devices, the computing capability of CPUs and GPUs still may not be able to be correctly
3
MediaTek CorePilot 2.0
predicted because the computing abilities of CPUs and GPUs are often reduced by
power budgets or thermal constraints. Once a program is sent to a computing device
which has lower than anticipated computing capability, the desired performance cannot
be achieved. In addition, for throughput-oriented programs that are executed in parallel
on both CPUs and GPUs, performance is degraded further.
Mpixel / second
For example, the performance of parallel execution of the program “Juliaset” is shown
in Figure 2, below.
160
140
120
100
80
60
40
20
0
0
0.2
0.4
0.6
0.8
1
Job Dispatching Ratio to GPU
Higher is Better
Figure 2. Performance of Parallel Executing Juliaset with Different Dispatch Ratios
The y-axis indicates the performance of a parallel execution. The x-axis shows the ratio
n, where n is the ratio of jobs dispatching to the GPU and 1-n is the ratio of jobs
dispatching to the CPU. As shown, the maximum performance occurs when the ratio is
0.56, indicating that the GPU and the CPU handle 56% and 44% (i.e. 100% - 56%) of the
jobs, respectively. With the incorrect anticipation of the CPU’s and the GPU’s computing
capabilities (given the ratio less than 0.5), the performance of parallel execution could
be lower than that of only sending tasks to the GPU.
3. The final reason programmers may not accurately predict computing capacity is the
additional overhead (any combination of excess or indirect computation time, memory,
bandwidth or other resources) required for parallel execution. For parallel programs,
enabling data sharing and synchronization requires additional overhead. Such
overheads are typically hardware-dependent. Once the overheads outweigh the
benefits of parallel execution, the performance could fall below predicted requirements.
4
MediaTek CorePilot 2.0
CorePilot 2.0: Heterogeneous Computing with a Fused
CPU+GPU Device
To address these problems, CorePilot 2.0 introduces Device Fusion, an “intelligent” technology
which can efficiently execute OpenCL programs by fusing GPU and CPU computing capabilities.
As shown in Figure 3 below, Device Fusion is able to flexibly dispatch each kernel (functional
part) of an OpenCL program to the most suitable computing device.
Figure 3. Dispatch Options in Device Fusion
For throughput-oriented programs, Device Fusion provides an infrastructure that can
automatically maintain parallel execution on GPUs and CPUs. By using the Device Fusion,
programmers can focus on program development and obtain performance improvements
without being affected by platform issues.
CorePilot 2.0 Advantage: Freeing Programmers to Focus on Algorithm
Development
Device Fusion is presented as a virtual OpenCL device. The virtual device, on top of the CPU and
GPU devices1, is compliant with the standard OpenCL API so that the existing OpenCL
applications can easily take advantage of Device Fusion without any modification. An overview
of Device Fusion is shown in Figure 4, below.
1
Note that, the standalone OpenCL CPU and GPU devices are still available.
5
MediaTek CorePilot 2.0
Figure 4. Overview of Device Fusion
The virtual device includes three modules: The kernel analysis module; the parallel execution
infrastructure module; and, the dispatch strategy module. When an OpenCL program is
assigned to the virtual device, each kernel of the program will be initially analyzed to collect the
necessary information, such as whether it’s optimal for the kernel to be executed in parallel. If
the parallel execution is allowed, the parallel execution infrastructure module prepares the
infrastructure, enabling data sharing and synchronization, for example. Finally, the dispatch
strategy module determines the number of jobs (i.e. work-items of the OpenCL program) of the
kernel to be directed to the CPUs and GPUs respectively, according to the internal loadbalancing algorithm. If parallel execution is not available, the virtual device tries to dispatch the
entire kernel to the most suitable computing device at the direction of the dispatch strategy
module.
The MediaTek Advantage: A Leader in Power Management
With mobile phone overheating dominating the headlines lately, OEMs are eager to find SoCs
which address power and thermal challenges. And, for consumers, power consumption is one
of the leading criteria in making a mobile device purchase decision. MediaTek continues to
meet consumer demands by leading the industry in power and thermal management
innovations with its patented CorePilot technology. CorePilot 1.0 maximized device
performance and power saving through interactive power management, adaptive thermal
management, and advanced scheduler algorithms. CorePilot 2.0 adds asymmetric big.LITTLE
CPU cores to the advanced scheduling algorithms and enables efficient computing by sending
workloads programmed by OpenCL languages to the suitable computing device, CPUs or GPUs,
or to both computing devices. Currently, only MediaTek offers this Device Fusion technology.
Our competitors offer CPU-only or GPU-only scheduling solutions, limiting efficiency and
thermal management capabilities.
6
MediaTek CorePilot 2.0
Case Studies Using CorePilot 2.0
Superresolution
Superresolution is an image-processing algorithm which can enhance image resolution and
extract significant detail from the original image. Superresolution requires significant
computation. Superresolution divides the process into several stages: find_neighbor,
col_upsample, row_upsample, sub and blur2. Each stage is implemented as an OpenCL kernel.
In this case study, three different image resolutions: 1MPixel, 2MPixel and 4MPixel were
enlarged by 2.25x using GPU only, CPU only, and Device Fusion (in CorePilot 2.0).
As shown in Figure 5, Device Fusion outperforms using the GPU-only or using the CPU-only in
three different resolutions by up to 46%. The increase in performance is the result of the
parallel execution of the major kernel, find_neighbor, and sending the load to the most suitable
device.
1800
1600
find_neighbor
col_upsample
row_upsample
sub
blur2
Execution Time (ms)
1400
1200
1000
800
600
400
200
1Mpixel-image
2Mpixel-image
Adreno420
Device
Fusion
MT6795CPU
MT6795GPU
Adreno420
Device
Fusion
MT6795CPU
MT6795GPU
Adreno420
Device
Fusion
MT6795CPU
MT6795GPU
0
4Mpixel-image
Lower is Better
Figure 5. Execution Times Breakdown of Each Stage in Superresolution under GPU, CPU and
Device Fusion
7
MediaTek CorePilot 2.0
Face Detection
Face detection is broadly used in mobile devices when taking pictures or unlocking devices. The
task flow of face detection is shown in Figure 6, below. Using CorePilot 2.0, each function was
sent to the most suitable computing device. For example, the function “IntegralStep1” was sent
to the GPU and the function “ViolaJones” was sent to the CPU.
Figure 6. Flow of Face Detection and Performance/Energy Consumption Results Comparison
Figure 6 shows both the performance and energy consumption results. According to the figures,
Device Fusion leads to a performance improvement of up to 146% when compared with CPUonly or GPU-only processing. In addition to performance improvement, Device Fusion can
reduce energy consumption by as much as 18% in this instance when compared with CPU-only
or GPU-only processing.
Conclusion
CorePilot 1.0 was designed for efficient management of symmetric and asymmetric multi-core
CPUs to achieve load balance and better performance. The enhanced CorePilot 2.0 not only
manages CPU cores, but also efficiently manages both CPU and GPU cores in modern mobile
device SoCs. CorePilot 2.0 includes Device Fusion technology, which enables efficient
computing by sending workloads programmed by OpenCL languages to the suitable computing
device or to both computing devices. Since CPUs and GPUs are best suited to different
workloads, CorePilot 2.0 can determine which task will perform better on which computing
device. This frees up programmers to concentrate more on program development rather than
predicting computing device efficiency outcomes.
8