Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MediaTek CorePilot 2.0 MediaTek CorePilot 2.0™ MediaTek continues to lead in power and thermal innovation with our advanced heterogeneous computing architecture that maximizes device performance and power efficiency In July 2013, MediaTek delivered the industry’s first mobile System on a Chip (SOC) with Heterogeneous Multi-Processing. Called CorePilot 1.0, the technology maximized device performance and power saving through interactive power management, adaptive thermal management and advanced scheduler algorithms. In 2015, MediaTek introduces CorePilot 2.0, an evolution of CorePilot 1.0, which adds new advancements in heterogeneous computing including both CPUs and GPUs. This cooperation, named Device Fusion, is an “intelligent” technology which can efficiently execute OpenCL programs by fusing GPU and CPU computing capabilities. With mobile phone overheating dominating the headlines lately, OEMs are eager to find SOCs that address power and thermal challenges. CorePilot 2.0 and Device Fusion expertly address the power/performance relationship of heterogeneous computing architectures and enable programmers to focus on application software development without needing to consider the hardware platform configuration. 1 MediaTek CorePilot 2.0 Introduction For modern mobile app programmers, effectively balancing powerful computing device capabilities with power and thermal constraints can often be a significant challenge. Programmers seek to meet user performance needs without overheating the device or causing rapid battery drain. MediaTek, a pioneering fabless semiconductor company and a market leader in cutting-edge Systems on a Chip for wireless communications and connectivity, HDTV, DVD and Blu-ray, addressed the optimization of mobile device CPU core performance through the development of CorePilot 1.0. For an in-depth history of the CorePilot 1.0, see the white paper that can be found here. This white paper discusses the challenges and solutions of CorePilot 2.0, an evolution of CorePilot 1.0, in addressing workload management and optimization through the use of multiple processing devices. This white paper will show: CorePilot 2.0’s Device Fusion technology achieves up to 146% performance improvement when compared to using CPU or GPU only architectures. CorePilot 2.0’s Device Fusion technology leads to lower energy consumption of up to 18% when compared to using CPU or GPU only architectures. CorePilot 2.0’s Device Fusion technology frees programmers from predicting what computing device is best suited to which task. CorePilot 2.0 positions MediaTek as the leader in power resource management innovation and the only company currently offering a parallel computing device scheduling algorithm. The Challenge: Programming to Take Full Advantage of Multiple Processing Devices Recent studies suggest that using CPUs and GPUs together is a more efficient way of computing compared with using CPUs or GPUs alone. Data show that different types of computing processing units may be better suited to different types of workloads. For example, CPUs are generally good at control-intensive workloads while GPUs perform well at computing-intensive tasks. Currently, programmers often struggle to write apps that can effectively use heterogeneous computing devices (GPUs and CPUs in combination) to achieve high performance, efficient computing. Anticipating both the computing capacity of a particular device and which program 2 MediaTek CorePilot 2.0 is best suited for a particular device (program affinity) is challenging. If incorrect predictions are made by the programmer, performance could be severely degraded. The Problem of Executing OpenCL Program in a Mobile Device SOC Open Computing Language (OpenCL), a popular open standard computing platform for heterogeneous computing, is designed to serve as the common high-level language for optimizing multiple computing devices. An OpenCL program can be executed on any device, including mobile devices, that supports the OpenCL standard Application Program Interface (API). Although OpenCL programs can be executed on all devices supporting OpenCL, the performance may not meet the programmer’s expectation. The reasons are as follows: 1. The program affinity for a specific program is not easy to predict. The affinity of a program could be affected by algorithm design, implementation, and/or the architecture of the target device. A program running on the un-preferred computing device results in lower performance when compared to the preferred computing device. For example, two different implementations of the same object detection algorithm, violaJones_NAIVE and violaJones_LWS32, shows different program affinities, leading to large performance difference when executed on CPU and GPU, as shown in the Figure 1, below. From the figure, executing violaJones_LWS32 (which prefers GPU) on the CPU results in severe performance degradation compared to the GPU. The violaJones_NAIVE shows the opposite result. Figure 1. Performance Differences on CPU and GPU 2. Programmers often have trouble predicting the computing capabilities of the target devices. Even if the programmers can focus on specific CPUs and GPUs of the target devices, the computing capability of CPUs and GPUs still may not be able to be correctly 3 MediaTek CorePilot 2.0 predicted because the computing abilities of CPUs and GPUs are often reduced by power budgets or thermal constraints. Once a program is sent to a computing device which has lower than anticipated computing capability, the desired performance cannot be achieved. In addition, for throughput-oriented programs that are executed in parallel on both CPUs and GPUs, performance is degraded further. Mpixel / second For example, the performance of parallel execution of the program “Juliaset” is shown in Figure 2, below. 160 140 120 100 80 60 40 20 0 0 0.2 0.4 0.6 0.8 1 Job Dispatching Ratio to GPU Higher is Better Figure 2. Performance of Parallel Executing Juliaset with Different Dispatch Ratios The y-axis indicates the performance of a parallel execution. The x-axis shows the ratio n, where n is the ratio of jobs dispatching to the GPU and 1-n is the ratio of jobs dispatching to the CPU. As shown, the maximum performance occurs when the ratio is 0.56, indicating that the GPU and the CPU handle 56% and 44% (i.e. 100% - 56%) of the jobs, respectively. With the incorrect anticipation of the CPU’s and the GPU’s computing capabilities (given the ratio less than 0.5), the performance of parallel execution could be lower than that of only sending tasks to the GPU. 3. The final reason programmers may not accurately predict computing capacity is the additional overhead (any combination of excess or indirect computation time, memory, bandwidth or other resources) required for parallel execution. For parallel programs, enabling data sharing and synchronization requires additional overhead. Such overheads are typically hardware-dependent. Once the overheads outweigh the benefits of parallel execution, the performance could fall below predicted requirements. 4 MediaTek CorePilot 2.0 CorePilot 2.0: Heterogeneous Computing with a Fused CPU+GPU Device To address these problems, CorePilot 2.0 introduces Device Fusion, an “intelligent” technology which can efficiently execute OpenCL programs by fusing GPU and CPU computing capabilities. As shown in Figure 3 below, Device Fusion is able to flexibly dispatch each kernel (functional part) of an OpenCL program to the most suitable computing device. Figure 3. Dispatch Options in Device Fusion For throughput-oriented programs, Device Fusion provides an infrastructure that can automatically maintain parallel execution on GPUs and CPUs. By using the Device Fusion, programmers can focus on program development and obtain performance improvements without being affected by platform issues. CorePilot 2.0 Advantage: Freeing Programmers to Focus on Algorithm Development Device Fusion is presented as a virtual OpenCL device. The virtual device, on top of the CPU and GPU devices1, is compliant with the standard OpenCL API so that the existing OpenCL applications can easily take advantage of Device Fusion without any modification. An overview of Device Fusion is shown in Figure 4, below. 1 Note that, the standalone OpenCL CPU and GPU devices are still available. 5 MediaTek CorePilot 2.0 Figure 4. Overview of Device Fusion The virtual device includes three modules: The kernel analysis module; the parallel execution infrastructure module; and, the dispatch strategy module. When an OpenCL program is assigned to the virtual device, each kernel of the program will be initially analyzed to collect the necessary information, such as whether it’s optimal for the kernel to be executed in parallel. If the parallel execution is allowed, the parallel execution infrastructure module prepares the infrastructure, enabling data sharing and synchronization, for example. Finally, the dispatch strategy module determines the number of jobs (i.e. work-items of the OpenCL program) of the kernel to be directed to the CPUs and GPUs respectively, according to the internal loadbalancing algorithm. If parallel execution is not available, the virtual device tries to dispatch the entire kernel to the most suitable computing device at the direction of the dispatch strategy module. The MediaTek Advantage: A Leader in Power Management With mobile phone overheating dominating the headlines lately, OEMs are eager to find SoCs which address power and thermal challenges. And, for consumers, power consumption is one of the leading criteria in making a mobile device purchase decision. MediaTek continues to meet consumer demands by leading the industry in power and thermal management innovations with its patented CorePilot technology. CorePilot 1.0 maximized device performance and power saving through interactive power management, adaptive thermal management, and advanced scheduler algorithms. CorePilot 2.0 adds asymmetric big.LITTLE CPU cores to the advanced scheduling algorithms and enables efficient computing by sending workloads programmed by OpenCL languages to the suitable computing device, CPUs or GPUs, or to both computing devices. Currently, only MediaTek offers this Device Fusion technology. Our competitors offer CPU-only or GPU-only scheduling solutions, limiting efficiency and thermal management capabilities. 6 MediaTek CorePilot 2.0 Case Studies Using CorePilot 2.0 Superresolution Superresolution is an image-processing algorithm which can enhance image resolution and extract significant detail from the original image. Superresolution requires significant computation. Superresolution divides the process into several stages: find_neighbor, col_upsample, row_upsample, sub and blur2. Each stage is implemented as an OpenCL kernel. In this case study, three different image resolutions: 1MPixel, 2MPixel and 4MPixel were enlarged by 2.25x using GPU only, CPU only, and Device Fusion (in CorePilot 2.0). As shown in Figure 5, Device Fusion outperforms using the GPU-only or using the CPU-only in three different resolutions by up to 46%. The increase in performance is the result of the parallel execution of the major kernel, find_neighbor, and sending the load to the most suitable device. 1800 1600 find_neighbor col_upsample row_upsample sub blur2 Execution Time (ms) 1400 1200 1000 800 600 400 200 1Mpixel-image 2Mpixel-image Adreno420 Device Fusion MT6795CPU MT6795GPU Adreno420 Device Fusion MT6795CPU MT6795GPU Adreno420 Device Fusion MT6795CPU MT6795GPU 0 4Mpixel-image Lower is Better Figure 5. Execution Times Breakdown of Each Stage in Superresolution under GPU, CPU and Device Fusion 7 MediaTek CorePilot 2.0 Face Detection Face detection is broadly used in mobile devices when taking pictures or unlocking devices. The task flow of face detection is shown in Figure 6, below. Using CorePilot 2.0, each function was sent to the most suitable computing device. For example, the function “IntegralStep1” was sent to the GPU and the function “ViolaJones” was sent to the CPU. Figure 6. Flow of Face Detection and Performance/Energy Consumption Results Comparison Figure 6 shows both the performance and energy consumption results. According to the figures, Device Fusion leads to a performance improvement of up to 146% when compared with CPUonly or GPU-only processing. In addition to performance improvement, Device Fusion can reduce energy consumption by as much as 18% in this instance when compared with CPU-only or GPU-only processing. Conclusion CorePilot 1.0 was designed for efficient management of symmetric and asymmetric multi-core CPUs to achieve load balance and better performance. The enhanced CorePilot 2.0 not only manages CPU cores, but also efficiently manages both CPU and GPU cores in modern mobile device SoCs. CorePilot 2.0 includes Device Fusion technology, which enables efficient computing by sending workloads programmed by OpenCL languages to the suitable computing device or to both computing devices. Since CPUs and GPUs are best suited to different workloads, CorePilot 2.0 can determine which task will perform better on which computing device. This frees up programmers to concentrate more on program development rather than predicting computing device efficiency outcomes. 8