Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Software for embedded multiprocessors Introduction on embedded OS Code parallelization Interprocessor communication Task allocation Power management Introduction on embedded operating systems OS Overview Kernel, device drivers, boot loader, user interface, file system and utilies Kernel components: Interrupt handler scheduler Memory manager System services (networking and IPC) It runs in protected memory space – kernel space and full access to HW, while apps run in user-space Apps communicate with kernel via system calls app app printf() strcpy() write() app open() library functions system calls OS Overview (II) Operating system takes control of the execution Timer interrupt I/O interrupts System calls Exceptions (undef instruction, data abort, page faults etc…) Processes A process is a unique execution of a program. A process has its own state: Several copies of a program may run simultaneously or at different times. registers; memory. The operating system manages processes. Process state A process can be in one of three states: executing on the CPU; ready to run; waiting for data. gets CPU executing preempted gets data and CPU needs data gets data ready needs data waiting Processes and CPUs Activation record: copy of process state. Context switch: current CPU context goes out; new CPU context goes in. process 1 process 2 ... memory PC registers CPU Terms Thread = lightweight process: a process that shares memory space with other processes. Reentrancy: ability of a program to be executed several times with the same results. Processes in POSIX Create a process with fork: parent process keeps executing old program; child process executes new program. process a process a process b fork() The fork process creates child: childid = fork(); if (childid == 0) { /* child operations */ } else { /* parent operations */ } execv() Overlays child code: childid = fork(); if (childid == 0) { execv(“mychild”,childargs); perror(“execv”); exit(1); } file with child code Context switching Who controls when the context is switched? How is the context switched? Co-operative multitasking Improvement on co-routines: hides context switching mechanism; still relies on processes to give up CPU. Each process allows a context switch at cswitch() call. Separate scheduler chooses which process runs next. Problems with co-operative multitasking Programming errors can keep other processes out: process never gives up CPU; process waits too long to switch, missing input. Context switching Must copy all registers to activation record, keeping proper return value for PC. Must copy new activation record into CPU state. How does the program that copies the context keep its own context? Preemptive multitasking Most powerful form of multitasking: Use timer to call OS, switch contexts: CPU interrupt timer OS controls when contexts switches; OS determines what process runs next. Flow of control with preemption P1 interrupt OS P1 interrupt OS P2 time Preemptive context switching Timer interrupt gives control to OS, which saves interrupted process’s state in an activation record. OS chooses next process to run. OS installs desired activation record as current CPU state. Operating systems The operating system controls resources: who gets the CPU; when I/O takes place; how much memory is allocated. The most important resource is the CPU itself. CPU access controlled by the scheduler. Design Issues Kernel space/user space/real-time space Monolithic versus micro-kernel Monolithic: OS services (including DDs, network and filesystem) run in privileged mode (easier to make efficient) (Linux, WinNT) Microkernel: privileged mode only for task management, scheduling, IPC, interrupt handling, memory management (more robust) (QNX, VxWorks) Pre-emptable kernel or not Memory management versus shared memory Dedicated versus general Embedded vs General Purpose OS Small footprint Stability (must run for years without manual intervention) Hardware watchdogs Little power Autonomous reboot (safely and instantly) Taxonomy High-end embedded systems Deeply embedded OS Down sized derivatives of existing GP OSes (routers, switches, PDA, set-top boxes) Small OSes with a handful of basic functions. They are designed from the ground for a particular application They typically lack high-performance GUI and networking (automotive control, digital camera, mobile phones) They are statically linked to the application. After the compilation the whole package containing OS kernel and applications are concatenated to a single package that can be loaded to the embedded machine Run-time environment Boot routine + embedded libraries Java, C++ offers functionalities such as memory management, threading, task synchronization, exception handling Embedded operating system User Programs User Program Operating Including Operating Hardware Hardware System Typical OS Configuration System Components Typical Embedded Configuration Real-time operating system Dynamic VS Static Loading Dynamic loading Static loading OS is loaded as a separate entity and applications are dynamically loaded in memory (more flexibility, code relocation is needed) (e.g. uClinux) OS is linked and loaded together with applications (no flexibility, higher predictability) (e.g. eCos, RTEMS) OS is a set of libraries that provide OS services How about Memory protection? (shared address space) System calls? (no cpu mode change required) Process creation? (fork, exec)? (shared address space, no overloading) File system? (only for input/output data) Static Loading No address space separation User applications run with the same access privilege as the kernel Functions are accessed as function calls, no “system calls” No need for copying parameters and data No need for state saving Speed and control Dynamic Loading system memory File system run time process boot time relocated (compile addresses != run time addresses address space separation OS constant address Focus on software for embedded multiprocessors Embedded vs. General Purpose Server Applications Embedded Applications Asymmetric Multi-Processing Coherent memory Shared local memories HW FIFOS, other direct connections Dataflow programming models Classical example – Smart mobile – RISC + DSP + Media processors Symmetric Multi-Processing Mapped to dedicated processors Configurable and extensible processors: performance, power efficiency Communication Differentiated Processors Specific tasks known early General tasks known late Tasks run on any core High-performance, high-speed microprocessors Communication Homogeneous cores large coherent memory space on multi-core die or bus SMT programming models (Simultaneous Multi-Threading) Examples: large server chips (eg Sun Niagara 8x4 threads), scientific multi-processors Parallel programming of embedded multiprocessors Parallelism & Programming Models MP is difficult: Concurrency, and “Fear of Concurrency” No robust and general models to automatically extract concurrency in 20-30+ years of research Many programming models/libraries - SMT, SMP OpenMP, MPI (message passing interface) Users manually modify code Concurrent tasks or threads Communications Synchronisation Today: Coarse-grained (whole application/data-wise) concurrency – unmodified source + MP scheduler API for communications and synchronisation Sequential execution model The most common The most well understood Supported by traditional (imperative) languages (C, C++, Fortran, etc.) Huge bulk of legacy code We are trained to solve problems algorithmically (sequence of steps) Microprocessors have been originally designed to run sequential code The easiest to debug Tracing the state of the CPU Step-by-step execution But… it HIDES parallelism!! Types of Parallelism Instruction Level Parallelism (ILP) Compilers & HW are mature Scatter Data Parallel Gather Pipeline Scatter Gather Task Parallelism Parallelism explicit in algorithm Between filters without producer/consumer relationship Data Parallelism Between iterations of a stateless filter Place within scatter/gather pair (fission) Can’t parallelize filters with state Pipeline Parallelism Between producers and consumers Stateful filters can be parallelized Data Task Parallelizing Loops: a Key Problem 90% of execution time is in loops Partial success in automatic extraction Mostly “well-behaved loops” Challenges: dependency analysis & interaction with data placement Cooperative approaches are common The programmers drives automatic parallelization (openMP) FORALL No “loop carried dependences” Fully parallel FORACROSS Some “loop carried dependences” Parallelized loops rely on Barrier Synchronization Barrier with Pthreads BARRIER Master core only initializes synchronization structures pthread_mutex_init() pthread_create() SERIAL REGION PARALLEL REGION Pthreads on Heterogeneous CPUs? Heterogeneous MPSoC Master CPU Slave CPU Slave CPU Slave CPU INTERCONNECT Private Private Private Mem Mem Mem Private Mem Shared Mem ISSUES There is an OS running on each core. No means for the master core to fork new threads on slave nodes. Use of pthreads is not a suitable solution. SOLUTION Standalone implementation. Master/Slave cores instead of threads. Synchronization through shared memory. SPMD Barrier All cores initialize Slaves notify synchronization theirand presence structures ondata the barrier to common in the master shared memory Master releases slaves as soon as he’s ready to start parallel region SERIAL REGION Additional serial code is only executed by master core while slaves wait on barrier PARALLEL REGION Code implementation flow MPARM original C code Master Slave CPU CPU Slave CPU Slave CPU INTERCONNECT parallel Parallel compiler code Private Private Private Private Mem Mem Mem Mem binary gcc code Shared Mem Runtime Library The Runtime Library is responsible for Initializing needed synchronization features, creating new worker threads (in the original implementation) and coordinating their parallel execution over multiple cores Providing implementation of synchronization facilities (locks, barriers) Code Execution Each CPU execute the same program. Basing upon the CPU id we separate portions of code to be executed by master and slaves. Master CPU executes serial code, initializes synchronization structures in shared memory, etc.. Slave CPUs only execute the parallel regions of code, behaving like the typical slave threads int main() { Master CPU Slave CPU Slave CPU Slave CPU … if (MASTERID) { serial code synchronization } NoC INTERCONNECT … if (SLAVEID) { parallel code Private Private Private Private Mem Mem Mem Mem Shared Mem } … } Synchronization structures (barriers, locks) Synchronization Parallel programming through shared memory requires global and point-to-point synchronization On symmetric architectures, implementations use pthreads library synchronization facilities, on heterogeneous architectures hw semaphores must be used void lock(pthread_mutex_t *lock) { pthread_mutex_lock(lock); } void lock(int *lock) { while(*lock); } void unlock(pthread_mutex_t *lock) { pthread_mutex_unlock(lock); } void unlock(pthread_mutex_t *lock) { *lock = 0; } Typical Barrier implementation LOCK(bar->lock); bar->entry_count++; if (bar->entry_count < nproc) { UNLOCK(bar->lock); while(bar->entry_count != nproc); LOCK(bar->lock); bar->exit_count++; if (bar->exit_count == nproc) bar->entry_count = 0x0; UNLOCK(bar->lock); } else { bar->exit_count = 0x1; if (bar->exit_count == nproc) bar->entry_count = 0x0; UNLOCK(bar->lock); } while(bar->exit_count != nproc); struct barrier { lock_type lock; int entry_count; int exit_count; } *bar; Shared counters protected by locks Barrier Implementation Issues ISSUES This approach is not very scalable Every processor notifies its arrival to the barrier increasing the value of a common shared variable As the number of cores increases contention for the shared resource may increase significantly POSSIBLE SOLUTION A vector of flags, one per each core, instead of a single shared counter New Barrier Implementation typedef struct Barrier { int entered[NSLAVES]; int usecount; } Barrier; void Master_Wait (Barrier b, int num_procs ) { Only the slaves int i; spin ;wait for (i = 1; i < num_procs i++)on a shared counter No busy-waiting while (!b−>entered[i]); due to contention //Reset flags to 0 that is updated by the master of a shared } counter. Each the master void Only Slave_Enter (Barrier b, int slave id) { updates its spin waits on own flag int ent = b−>usecount; each slave’s=flag b−>entered[id] 1; detect their whileto(ent == b−>usecount); void Master_Release(Barrier b) { presence on the } b−>usecount++; barrier } Compiler aware of synchronization cost? A lightweight implementation of the synchronization structures allows a parallelized code with a big number of barrier instruction to still perform better than the serial version It would be useful to let the compiler know about the cost of synchronization. This would allow it not only to select the parallelizable loops, but also to estabilish if the parallelization is worthwhile For well distributed workloads across the cores the proposed barrier performs well, but for a high degree of load imbalance an interrupt-based implementation may be best suited. The compiler may choose which barrier instruction to insert depending on the amount of busy waiting Performance analysis Timeparallel neededexecution for initializing Ideal synchronization time Ideal timewas was synchronization estimated for thestructures various calculated simulating on oneincore shared memory was configurations making the master the computational loadmeasured of the on a wait single simulation. core oncore the barrier after all various configurations. slaves entered.to be invariant with It isexpected, expected As it almost halves increasing numbers of cores. with In the case synchronization thereal doubling of the number requires of workingadditional cores. waiting time. Simultaneous accesses to the Upper triangular 32x32 matrix filling 2500 . 2000 Texec (overhead) Texec (ideal) Cost (*) 1500 Tsync (overhead) Tsync (ideal) 1000 Tinit (overhead) Tinit (ideal) 500 0 Serial 1 2 Cores 4 8 sharedadditional memory generate a traffic Those cycles Overall execution time isalso on the bus that a to include the by contribution due lenghtened theproduces waiting cycles significant overhead. polling on concurrent the synchronization due to the accesses structures in shared memory. to shared memory. (*) Overall number of cycles normalized by the number of cycles spent for an ideal bus transaction (1 read + 1 write) Performance analysis 2 For bigger computational load matrix filling Upper triangular initialization and32x32 synchronization contribution go completely 2500 unnoticed. Big speedup margin. Texec (overhead) Speedup is heavily limited by Texec (ideal) 1500 frequent accesses to shared Tsync (overhead) Tsync (ideal) memory. Would pure 1000 Tinit (overhead) computation follow the profile of Tinit (ideal) the500blue bars? . 2000 ForUpper small computational load filling(i.e. 1024x1024 matrix triangular few matrix elements) initialization 1400000 and synchronization have a big 1200000 impact on overall performance. No Texec (overhead) 1000000speedup. . Texec (ideal) Cost Cost 800000 Possible 0 Would cacheable shared 8 4 2 1 Serial memory regions Cores help? optimizations on Tsync (overhead) barriers in order to reduce Tsync (ideal) 600000 Tinit (overhead) accesses to shared memory. Tinit (ideal) 400000 200000 Possible optimizations on initialization, serializing / 0 4 to 8bus. 2 1 accesses Serial interleaving Cores Example: MP-Queue library MP-Queue is a library intended for message-passing among different cores in a MPSoc environment. Highly optimized C implementation: Low level exploitation of data structures and semaphores: low overhead; data transfer optimized for performance: analyses of disassembled code; synch operations optimized for minimal interconnect utilization Producer-consumer paradigm, different topologies: 1-N N-1 N-N Communication library API 1. Autonit_system() 1. 2. 2. Autoinit_producer() 1. 2. 3. 3. 2. 3. To be called by a consumer core only. Requires a queue id. Waits for n producers to be bounded to the consumer structures. Read() 1. 5. To be called by a producer core only. Requires a queue id. Creates the queue buffers and signals its position to n consumers. Autoinit_consumer() 1. 4. Every core has to call it at the very beginning. Allocates data structures and prepares the semaphore arrays. Gets a message from the circular buffer (consumer only). Write() 1. Puts a message into the circular buffer (producer only). Communication semantics C1 P1 C2 P2 Notification mechanisms available: Round robin. Notify all. Target core specifying. The i-th producer: – Gets the write position index. – Puts data onto the buffer. – Signals either one consumer (round-robin / fixed) or all consumers (notify all). The i-th consumer: – Gets the read position index. – Gets data from the buffer. – Signals either one producer (round-robin / fixed) or all producers (notify all). Architectural Flexibility 1. Multi core architectures with distributed memory. 2. Purely shared memory based architectures. 3. Hybrid platforms Transaction Chart Shares bus accesses are minimized as much as possible: Local polling on scratchpad memories. Insertion and extraction indexes are stored into shared memory and protected by mutex. Data transfer section involves shared bus Critical for performance. Sequence diagrams 1 producer and 1 consumer (parallel activity). Synch time vs pure data transfer. Local polling onto scratch semaphore Signaling to remote core scratch “Pure” data transfer to and from FIFO buffer in shared memory Message size 864WORDS WORDS Communication efficiency Comparison against ideal point-to-point communication. 1-N queues leverages bus pipelining: bigger asymptotic efficiency. Interrupt based notification allows more than one task per core. significant overhead (up to 15%). Low-level optimizations are critical! 16 words per token 32 words per token Not produced any more by compiler!!! Growth of assembly length for copy sections Gcc compiler avoids to insert the multiple load /multiple store loop from 32 words on. Code size would be exponentially rising. Where high throughput is required, a less compact but more optimized representation is desired. Compiler-aware optimization benefits Compiler may be forced to unroll data transfer loops. About 15% improvement with 32 word sized messages. A Typical JPEG 8x8 block is encoded in a 32 word struct. 8x8x16 bit data. Task allocation in MPSoC architectures Application Mapping T T 1 2 T T 8 4 Proc. 1 T 3 T T T 5 7 6 Proc. 2 … Proc. N T 1 T INTERCONNECT T 2 3 Private Private T T 5 T Mem Resources T 7 8 Mem 6 T 4 … Private Mem Deadline T3 T2 T1 T5 T4 Time T7 T8 The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard. New tool flows for efficient mapping of multi-task applications onto hardware platforms Approach Focus: Statically scheduled Applications; Objectives: Complete approach to allocation, scheduling and frequency selection: High computational efficiency w.r.t. commercial solvers; High accuracy of generated solutions; New methodology for multi-task application development: To quickly develop multi-task applications; To easily apply the optimal solution found by our optimizer. Target architecture An architectural template for a message-oriented distributed memory MPSoC: Several MPSoC platforms available on the market Act. A match this template: Support for message exchange between the computation tiles; Single-token communication; Availability of local memory devices at the computation tiles and of remote memories for program data. The Silicon Hive Avispa-CH1 processor; The Cradle CT3600 family of multiprocessor; The Cell Processor The ARM MPCore platform. period T . . . . The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor; Act. B Act. N Application model A task graph represents: – – – – – A group of tasks T Task dependencies Execution times express in clock cycles: WCN(Ti) Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj) These values can be back-annotated from functional simulation WCN(WT1T2) WCN(T1) WCN(RT1T2) WCN(T2) WCN(WT2T4) WCN(T4) WCN(RT2T4) Task2 Task4 WCN(WT4T6) WCN(RT4T6) WCN(T6) Task1 Task6 WCN(WT1T3) WCN(RT1T3) Task3 Task5 WCN(WT3T5) WCN(T3) WCN(RT3T5) WCN(T5) WCN(WT5T6) WCN(RT5T6) Example P1 N T 11 Number of nodes : 12 Graph of activities Node type a2 fork branch Normal, Branch, Conditional, Terminator TB a3 Or, And, Fork, Branch T C 44 T B 33 branch 22 Node behaviour P2 a1 Number of CPU : 2 a7 Task Allocation Task Scheduling T N88 Arc priorities Freq. & Voltage //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICa11 or uint queue_consumer [..] [..] = { a4 a5 a6 T C TC 5 T C 66 7 a8 a9 T N N T 10 10 99 a12 #define TASK_NUMBER 12 N T11 11 uint node_type[TASK_NUMBER] = {0,1,1,0,..}, {1,2,2,1,..}; #define N_CPU 2 //Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH {0,0,0,1,1,.}, a14 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; uint node_behaviour[TASK_NUMBER] = {2,3,3,..}; {0,0,0,0,0,1,1..}, int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..}; {0,0,0,0,..}..}; Resources Deadline BB B2 3 3 N 1 C C C4 Time 7 7 N8 NN N 1 1 0 0 1 1 TT 1 1 2 2 or a13 and TT 12 12 a10 Queue ordering optimization CPU2 T1 Wait! C2 CPU1 T4 RUN! T2 T3 … … T5 … T6 … Communication ordering affects system performances Queue ordering optimization CPU2 T1 Wait! C2 CPU1 T4 RUN! T2 T3 … … T5 … T6 … Communication ordering affects system performances Synchronization among tasks Proc. 1 T1 T2 C2 T4 T1 T3 T4 Proc. 2 T2 T4 suspended T4 is re-activated T3 Non blocked semaphores Logic Based Benders Decomposition Memory constraints Allocation & Freq. Assign.: INTEGER PROGRAMMING Valid allocation Real Time constraint Obj. Function: Communication cost & energy consumption No good: linear constraint CONSTRAINT PROGRAMMING Objective Function: minimizing energy consumption during execution and communication of tasks Scheduling → CP Scheduling: Decomposes a problem into 2 sub-problems: Allocation & Assignment of freq. settings → IP Objective Function: minimizing energy consumption during frequency switching The process continues until the master problem and sub-problem converge providing the same value. Methodology has been proven to converge to the optimal solution [J.N.Hooker and G.Ottosson]. Application Development Methodology Simulator CTG Characterization Application Profiles Phase Optimizer Optimization Phase Application Development Support Optimal SW Application Implementation Platform Execution GSM Encoder Task Graph: 10 computational tasks; 15 communication tasks. Throughput required: 1 frame/10ms. With 2 processors and 4 possible freq.&voltage settings: Without optimizations: 50.9μJ With optimizations: 17.1 μJ -66,4% Energy Management o o Basic techniques: Shutdown and DVFS Advanced techniques: Feedback control Energy Optimization in MPSoCs Two main steps: Workload allocation to processing elements: task mapping and scheduling After workload allocation, resource of processing elements should be adapted to the required performance to minimize energy consumption shut-down Urbino, 19-10-2006 voltage scaling 79 Shut-Down When the system is idle the processor can be placed into a low-power state reactivity power level -core clock gating (waked-up by timer interrupt) -core power gating (waked-up by on-chip peripherals) -chip power gating (waked-up by external, on board interrupts) no need for context restore need for context restore Urbino, 19-10-2006 80 Frequency/Voltage Scaling DFVS: Adapting frequency/voltage to the workload Frequency must be scaled with voltage to keep circuit functionality Dynamic power goes with the square of Vdd and linearly with clock speed Scaling V/F by a factor of s -> power scales as s3 2 P Ceff Vdd f Urbino, 19-10-2006 81 Power Manager Implementation Power management policy consists of algorithms that use input information and parameter settings to generate commands to steer mechanisms [Pouwelse03] parameter settings workload commands policy operational conditions operating points A dynamic power management system is a set of rules and procedures that move the system from one operating point to another as events occur [IBM/Montavista 02] Urbino, 19-10-2006 82 Power Manager Components Monitoring Prediction Averaging (e.g. EMA), filtering (e.g. LMS) Per-task based (e.g. Vertigo), global utilization (e.g. Grunwald) Control Utilization, idle times, busy times Shutdown, DVFS Open-loop, closed loop (e.g. adaptive control) Update Rule Establish decision points Urbino, 19-10-2006 83 Traditional Approach TASK 1 TASK 2 idle time monitor update rule Urbino, 19-10-2006 global utilization monitor idle time predictor workload predictor shutdown controller DVFS controller per-task utilization monitor 84 Limitations Assuming no specific info from applications are available, traditional approaches are based on observation of utilization history Slow adaptation impact system reactivity Specific techniques for tracking interactive task have been proposed [Flautner2002] For soft real-time (multimedia) application deadline may be missed Frequent voltage oscillations impact energy efficiency Square BAD relationship between power and voltage GOOD Cost of switching (power/time/functionality) Urbino, 19-10-2006 85 Multimedia applications Multimedia applications can be represented as communicating objects Ex.: Gstreamer multimedia framework data data data pads 23/05/2017 OSHMA Workshop – Brasov ,ROMANIA 86 Streaming Applications Multimedia streaming applications are going multicore Objects are mapped to tasks that are distributed on the cores to increase performance Specific tasks can be executed by hardware accelerators such as IPU, GPU units data data P0 P1 CORE #0 23/05/2017 P2 CORE #1 data P3 CORE #2 87 Overcoming Traditional Approaches Key observation Multimedia applications are mapped into MPSoCs as communicating tasks P2 Software FM Radio P0 P1 P3 EXT. PERIPHERAL P5 split P4 join A pipeline with single and parallel blocks (split-join) communicating with Feedback path are also common queues 23/05/2017 88 Frequency, Voltage Setting Middleware • Migration+dynamic f,Vdd setting critical for energy management – – DVFS for multi-stage producer-consumer streaming exploits info on occupancy of the synchronization queues @equilibrium, average output rate should match input rate in each queue occupancy level monitored to adjust PE speed and Vdd STAGE N USER F R E Q. OS/MW HW [Carta TECS07] 23/05/2017 FREQ. CONTR. N Q U E U E OCC. STAGE N+1 D A T A D A T A O U T I N STAGE N FREQ. CONTR. N+1 COMM. AND SYNCHRONIZ. LAYER OP. SYST. N OP. SYST. N+1 OP. SYST. N+2 PROCESSOR N PROCESSOR N+1 PROCESSOR N+2 89 Middleware Support main() { produce_data(); write_queue(); } main() { read_queue(); consume_data(); } communication library check queue level(); run_control_algorithm(); set_frequency_and_voltage(); Almost OS-independent If interrupts are used an OS-specific ISR must be written Easy integration into communication library support for Gstreamer, openmax 23/05/2017 90 Feedback Controlled DVS Technique to perform run-time energy optimization of pipelined computation in MPSoCs Queue occupancy provide estimation of the level of performance required by the preceding block Unlike traditional approaches, the idea is to look at level of occupancy of inter-processor queues to compute the speed of processing elements queue queue 23/05/2017 speed control queue 91 Thermal optimization Energy/power optimization is NOT thermal optimization! Need for temperature awareness OS-MPSoC Thermal Studies Spunti di ricerca Focus on embedded multimedia streaming and interactive applications Efficient automatic code parallelization for embedded multiprocessors Efficient communication and synchronization infrastructure Static + dynamic task allocation and for performance/energy/thermal balancing EU projects: PREDATOR, REALITY