Download A novel methodology for dynamically reconfigurable embedded

P OLITECNICO DI M ILANO Facoltà di Ingegneria dell’Informazione Corso di laurea specialistica in Ingegneria Informatica A novel methodology for dynamically reconfigurable embedded systems design Relatore: Prof. Donatella S CIUTO Correlatore: Ing. Marco Domenico S ANTAMBROGIO Tesi di laurea specialistica di Vincenzo R ANA Matr. 674672 Anno Accademico 2005/2006 to you Riassunto della tesi Nel corso degli ultimi anni, lo scenario dei sistemi digitali dedicati è stato considerevolmente influenzato dallo sviluppo delle architetture dinamicamente riconfigurabili. Queste architetture permettono l’introduzione, all’interno del flusso di progettazione, di un ulteriore grado di libertà, grazie al quale diventa possibile aumentare notevolmente la flessibilità dei sistemi sviluppati. Oggi, diverse tipologie di applicazioni potrebbero trarre benefici, sia in termini di costi che di servizi, dalla capacità di modificare le proprie funzionalità hardware, molto più veloci delle rispettive implementazioni in software, anche successivamente alla fase di produzione, in modo da garantire coerenza col cambiamento delle necessità degli utenti, con la variazione della codifica dei dati o con l’evoluzione dei protocolli di comunicazione. Ciò risulta possibile grazie all’impiego dei dispositivi riprogrammabili, quali ad esempio le FPGA (Field Programmable Gate Arrays). Caratteristica distintiva di questi dispositivi è la possibilità di essere riconfigurati dinamicamente anche durante la fase di utilizzo. Alcuni di essi consentono inoltre di sfruttare le potenzialità derivanti dalla riconfigurazione parziale, che interessa solo una porzione dell’intero dispositivo, mentre la restante parte, che non è direttamente coinvolta nel processo di riconfigurazione, può continuare a svolgere le proprie funzionalità senza alcuna interferenza. Uno degli approcci alla riconfigurazione descritti in [1], è quello modulebased, caratterizzato dall’idea di suddividere ogni dispositivo riprogrammabile in un determinato numero di parti, ciascuna delle quali viene indicata con il nome di slot riconfigurabile, o semplicemente slot. In questo scenario risulta possibile utilizzare uno o più slot per configurare sul dispositivo un componente capace di eseguire una specifica funzionalità, chiamato modulo. Un aspet- v to di fondamentale importanza consiste nella possibilità di garantire il corretto funzionamento dei moduli configurati su slot non coinvolti nel processo di riconfigurazione anche durante l’esecuzione di questo processo. Un secondo metodo, attraverso il quale è possibile portare a termine una riconfigurazione, è quello difference-based, che non richiede la definizione di slot e moduli e comporta perciò un minore sforzo di progettazione. Tuttavia questo approccio risulta essere adeguato soltanto nei casi in cui le differenze tra una configurazione e la successiva sono ridotte. Ciò è dovuto principalmente al processo sul quale esso è basato, che risulta essere adatto unicamente all’introduzione di cambiamenti di piccola entità all’interno del sistema. Nella progettazione e nello sviluppo dei sistemi digitali dedicati dinamicamente riconfigurabili, l’approccio che sembra portare ad ottenere i risultati più soddisfacenti è, come descritto in [2], il modular-based design. L’idea principale sulla quale questo approccio è basato consiste nel considerare la specifica del sistema composta da un insieme di vari componenti indipendenti tra loro (chiamati IP-Core, Intellectual Property-Cores), i quali vengono sintetizzati individualmente per poi essere infine assemblati insieme per produrre il sistema desiderato. La definizione di questi IP-Core richiama il concetto di modulo nell’approccio alla riconfigurazione module-based, con il quale il modular-based design risulta dunque essere strettamente collegato. Ciascun IP-Core, che può quindi essere considerato come un modulo che verrà configurato in un determinato insieme di slot, è composto da due parti distinte: la logica applicativa e la logica di comunicazione. La prima componente, spesso indicata soltanto come logica, implementa la funzionalità primaria dell’intero modulo, mentre la seconda permette al modulo di essere inserito in un sistema complesso e di interagire con gli altri elementi di tale sistema, per esempio altri IP-Core. La riconfigurazione dinamica di questi moduli è dunque la chiave primaria che permette di conferire al sistema sviluppato un elevato livello di flessibilità. Per semplificare la gestione di questi processi di riconfigurazione risulta spesso utile l’impiego di un controllore software, il quale può essere sviluppato sia come un’applicazione stand-alone che attraverso il supporto di un Sistema vi Operativo. • La prima soluzione, che prevede l’implementazione di un’applicazione dedicata, è maggiormente orientata alla creazione di una specifica soluzione ottimizzata per un singolo particolare problema. Tuttavia questa scelta richiede un enorme investimento in termini di impegno sia nella progettazione che nello sviluppo del sistema, oltre a causare un notevole aumento del tempo necessario per la realizzazione di queste fasi, ed è quindi indicata solo in particolari circostanze. • Al contrario, la seconda soluzione può essere adottata durante le fasi di prototipazione o per accentuare la flessibilità dell’intero sistema, poiché in questo modo risulta possibile sfruttare i classici servizi offerti da un Sistema Operativo, come ad esempio le tecniche di scheduling dei processi o i sistemi di comunicazione tra tali processi, applicandoli allo scopo di semplificare ed ottimizzare la gestione della riconfigurazione. Obiettivo di questa tesi è la definizione di una metodologia che sia in grado di descrivere completamente l’intero processo di progettazione modulare di un sistema digitale dedicato dinamicamente riconfigurabile, e di guidarne inoltre lo sviluppo, partendo dalla specifica di alto livello dell’applicazione originaria. Tra le caratteristiche di fondamentale importanza dell’approccio proposto vi è la possibiltà di fornire al progettista uno strumento attraverso il quale risulti possibile sia ridurre considerevolmente il tempo necessario per lo sviluppo del sistema, che migliorare, oltre a semplificare, il processo stesso di sviluppo. Per raggiungere questo obiettivo è stato definito il flusso BE-DRESD, composto dalla seguente serie di elementi, ciascuno dei quali dedicato ad una specifica funzionalità. • L’ingresso di questo flusso è composto dalla specifica ad alto livello di un’applicazione in grado di risolvere un determinato problema e dalla sua descrizione hardware, per esempio una descrizione VHDL (Very high speed integrated circuit Hardware Description Language). Queste descrizioni iniziali vengono analizzate da DRESD-HLR (DRESD High-Level vii Reconfiguration) al fine di creare un grafo ed estrarne informazioni relative alle strutture ricorrenti, che verranno utilizzate nelle fasi successive del flusso. • Un altro componente, chiamato DRESD-BE (DRESD Back-End), sfrutta le informazioni precedentemente ottenute per la generazione dell’architettura hardware sulla quale il sistema sarà basato e dei moduli hardware, che potranno essere staticamente inseriti nella parte fissa dell’architettura o dinamicamente configurati nel sistema finale. • La creazione della parte software del sistema svilppato viene invece gestita da DRESD-SW (DRESD SoftWare), attraverso il quale è possibile ottenere sia la versione stand-alone che quella basata sull’impiego di un Sistema Operativo capace di gestire i processi di riconfigurazione. • In aggiunta a questi componenti, il flusso BE-DRESD contiene anche DRESD-VAL (DRESD Validation), che è composto da due applicativi, SyCERS e BAnMaT (Bitstream Analyzer and Manipulator Tool), ed è utilizzato per validare i risultati delle precedenti fasi e per ottenere informazioni in grado di guidare il ciclo di raffinamento della soluzione sviluppata. • Un ulteriore componente del flusso principale è DRESD-DB (DRESD DataBase), una sorta di base di dati che fornisce le informazioni relative al dispositivo per il quale il sistema deve essere progettato a tutti gli altri elementi del flusso. • L’ultima fase è realizzata da DRESD-TM (DRESD Technology Management) e consiste nella generazione della soluzione finale, composta dalla parte hardware, il risultato di DRESD-BE, da quella software, il risultato di DRESD-SW, e dalle informazioni necessarie per la loro disposizione fisica sulla piattaforma hardware utilizzata. Il contributo innovativo di questo lavoro di tesi consiste, in aggiunta alla definizione del flusso BE-DRESD per la progettazione e lo sviluppo di sistemi dinamicamente riconfigurabili, nell’integrazione degli elementi preesistenti, viii contenuti in questo stesso flusso, con una nuova serie di infrastrutture in grado di colmare le lacune riscontrate nello stato attuale dell’arte. In particolare la fase realizzata da DRESD-BE è stata completata con l’introduzione di IPGen (IP-Core Generator), un’applicazione in grado di utilizzare le informazioni estratte da DRESD-HLR, che consistono nella logica necessaria per implementare le funzionalità estratte, per la generazione automatica degli IP-Core, ovvero dei moduli contenenti sia la logica applicativa che quella di comunicazione attraverso il Bus Wishbone. I moduli ottenuti grazie ad IPGen possono dunque essere utilizzati direttamente sia come componenti statici della parte fissa dell’architettura che come moduli dinamicamente configurabili all’interno del sistema sviluppato. Il secondo contributo innovativo di questo lavoro di tesi è rappresentato dalla creazione DRESD-SW, che consiste sia nell’estensione del Sistema Operativo Linux con un supporto per la riconfigurabilità dinamica che nello sviluppo di un gestore centralizzato della riconfigurazione per tale Sistema Operativo. • Il supporto per la riconfigurabilità è costituito dal modulo del kernel per la gestione del Reconfiguration Controller, il controllore hardware che permette di eseguire la riconfigurazione fisica dei dispositivi riprogrammabili, da quello per l’interazione con il MAC (Media Access Control), il responsabile dello spazio di indirizzamento sul Bus Wishbone, dal modulo del kernel chiamato LOL (Load On Linux), che gestisce l’aggiunta e la rimozione dinamica di componenti all’interno del sistema, memorizzandone le informazioni principali, e dalla Reconfiguration Library, il cui scopo è quello di semplificare l’utilizzo dei moduli del kernel precedentemente presentati. • Il gestore centralizzato della riconfigurazione è rappresentato dal ROTFL Daemon (Reconfiguration Of The FPGA under Linux), il quale è in grado di gestire le richieste, relative all’aggiunta o alla rimozione di un modulo, provenienti tramite una comunicazione basata sui socket dalla ROTFL Library, la libreria che ogni applicazione deve includere per poter eseguire una riconfigurazione dinamica. Queste richieste sono gestite dai tre elementi che compongono il ROTFL Daemon: il ROTFL Module Manager, che ix implementa una sorta di cache dei moduli configurati, il ROTFL Allocation Manager, il cui scopo è la ricerca dell’insieme di slot da utilizzare per configurare il modulo richiesto, ed il ROTFL Positioning Manager, il cui obiettivo è rappresentato dalla selezione del bitstream corretto in grado di configurare il modulo richiesto nella posizione specificata dal ROTFL Allocation Manager. Grazie allo sviluppo di queste nuove componenti del flusso BE-DRESD ed alla metodologia proposta risulta possibile eseguire automaticamente la generazione degli IP-Core, partendo dalla loro logica applicativa, includere nella soluzione finale un Sistema Operativo in grado di gestire la riconfigurabilità dinamica ed infine sfruttare un insieme di driver attraverso i quali stabilire un semplice ma potente canale di comunicazione con i moduli dinamicamente configurati. La tesi è organizzata in sei capitoli, il primo dei quali, il Capitolo 1, introduce lo scenario dei sistemi digitali dedicati, con particolare riferimento alla possibilità di una loro estensione attraverso l’uso della riconfigurabilità dinamica. Il Capitolo 2 presenta uno studio dello stato dell’arte per quanto concerne il campo dei sistemi digitali dedicati riconfigurabili. La prima parte del capitolo è focalizzata sulla presentazione delle principali piattaforme configurabili e riconfigurabili esistenti in letteratura. L’analisi di queste piattaforme, ciascuna delle quali manualmente sviluppata e specificatamente ottimizzata per la risoluzione di un particolare problema, rende evidente la mancanza di un flusso capace di astrarre ed automatizzare il processo di sviluppo in modo da sfruttare pienamente le potenzialità di tali sistemi. La seconda parte del capitolo descrive le più rappresentative metodologie di sviluppo, che tentano di porre rimedio alla precedente mancanza, ma senza riuscirvi completamente in quanto limitate ad una parziale visione del flusso o confinate ad un livello di astrazione troppo elevato per poter automatizzare un reale processo di sviluppo. L’ultima parte del capitolo è dedicata alla presentazione dei più importanti supporti alla riconfigurazione per i Sistemi Operativi, nei quali è stata riscontrata l’assenza del servizio di DMA (Direct Memory Access) e la mancanza di un gestore della riconfigurazione centralizzato. x Gli aspetti emersi dalle analisi effettuate nel precedente capito conducono, nel Capitolo 3, alla presentazione della metodologia adottata per definire il flusso di progettazione proposto. In particolare la prima parte del capitolo pone l’accento sulla generazione automatica degli IP-Core, data la loro funzionalità base, mentre la seconda parte si focalizza sugli aspetti relativi al supporto per la riconfigurazione nei Sistemi Operativi, quali ad esempio la gestione della riconfigurazione dei moduli, del caricamento automatico dei driver necessari per i moduli configurati e della comunicazione di questi con l’intero sistema. Lo scopo del Capitolo 4 è quello di descrivere dettagliatamente l’implementazione fisica della metodologia proposta nel precedente capitolo, ponendo una notevole enfasi sia sull’integrazione del flusso di progettazione con la generazione automatica dei moduli, che possono essere utilizzati come componenti fissi o riconfigurabili del sistema finale, che sullo sviluppo di un’architettura software basata sul Sistema Operativo Linux e composta da una serie di moduli del kernel, di librerie e di un gestore centralizzato della riconfigurazione, in grado di sfruttare i meccanismi della riconfigurazione dinamica nei sistemi digitali dedicati. Il Capitolo 5 introduce una vasta raccolta di risultati sperimentali in modo da rendere possibile la validazione della metodologia proposta. La prima parte del capitolo è rivolta all’implementazione del tool per la generazione automatica degli IP-Core, mentre la seconda parte presenta la piattaforma hardware utilizzata per lo sviluppo dell’architettura software ed un’ampia gamma di risultati sperimentali inerenti l’architettura software stessa. Infine il Capitolo 6 traccia le conclusioni finali relative alla metodologia proposta ed alla sua implementazione, sottolineando alcuni possibili estensioni e lavori futuri che possono essere applicati per ampliare e migliorare l’approccio descritto in questo lavoro di tesi. xi Contents Riassunto della tesi v 1 Introduction 1 2 State of the art 5 2.1 Configurable systems . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Reconfigurable Pipelined Datapath . . . . . . . . . . . . . 8 2.1.2 Configurable Pipelined State Machine . . . . . . . . . . . . 8 2.1.3 Configurable Architecture for High-Speed Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Configurable FPGA-Based Hardware Architecture for Adaptive Processing of Noisy Signals for Target Detection based on Constant False Alarm Rate (CFAR) Algorithms . 10 Configurable, High-Throughput LDPC Decoder Architecture for Irregular Codes . . . . . . . . . . . . . . . . . . . . 10 Common features and limits of configurable systems . . . 11 Reconfigurable systems . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 PipeRench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 MorphoSys . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Splash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Garp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.5 Raw Architecture Workstation . . . . . . . . . . . . . . . . 20 2.2.6 Common features and limits of reconfigurable system . . 21 Development methodologies . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 23 2.1.4 2.1.5 2.1.6 2.2 2.3 RECONF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Contents 2.4 2.3.2 ADRIATIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Common features and limits of development methodologies 26 Software reconfiguration supports . . . . . . . . . . . . . . . . . . 2.4.1 2.5 3 Embedded Linux as a platform for dynamically selfreconfiguring systems-on-chip . . . . . . . . . . . . . . . . 28 2.4.2 Caronte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.3 BORPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.4 Common features and limits of software reconfiguration supports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 34 37 3.1 BE-DRESD flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 DRESD-BE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Cores handling . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Automatic IP-Core generation . . . . . . . . . . . . . . . . 46 DRESD-SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Reconfiguration layer . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Dynamic reconfiguration management . . . . . . . . . . . 52 3.3.3 IP-Cores devices access . . . . . . . . . . . . . . . . . . . . 54 3.3.3.1 Dynamic device drivers loading and unloading . 56 3.3.3.2 IP-Core user-side drivers . . . . . . . . . . . . . . 57 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 xiv 27 Proposed methodology 3.3 4 24 Design flow software development 59 4.1 IPGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.1 Underlying platform . . . . . . . . . . . . . . . . . . . . . . 65 4.2.2 Linux kernel modules infrastructure . . . . . . . . . . . . . 67 4.2.2.1 The Reconfigurator Controller kernel module . . 68 4.2.2.2 The MAC kernel module . . . . . . . . . . . . . . 70 4.2.2.3 The LOL kernel module . . . . . . . . . . . . . . . 71 4.2.2.4 The Reconfiguration Library . . . . . . . . . . . . 72 BIBLIOGRAPHY 4.2.3 4.3 5 6 The ROTFL architecture . . . . . . . . . . . 4.2.3.1 The ROTFL Library . . . . . . . . 4.2.3.2 The ROTFL Daemon . . . . . . . 4.2.3.3 The ROTFL Module Manager . . 4.2.3.4 The ROTFL Allocation Manager . 4.2.3.5 The ROTFL Positioning Manager 4.2.3.6 The ROTFL Repository . . . . . . Concluding remarks . . . . . . . . . . . . . . . . . Experimental results 5.1 IPGen . . . . . . . . . . . . . . . . . 5.2 Software architecture . . . . . . . . 5.2.1 RAPTOR2000 board . . . . 5.2.2 ROTFL Allocation Manager 5.2.3 ROTFL architecture . . . . . 5.3 Concluding remarks . . . . . . . . Conclusions and future work Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 77 79 80 82 89 91 92 . . . . . . 95 95 97 98 99 103 105 107 111 xv List of Tables 2.1 2.2 Configurable systems features . . . . . . . . . . . . . . . . . . . . . Reconfigurable systems features . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6 IPGen tests . . . . . . . . . . . . . . . . . . . Temporal performance . . . . . . . . . . . . Final results . . . . . . . . . . . . . . . . . . Comparison with the exhaustive algorithm Hardware reconfiguration latency . . . . . ROTFL performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 21 96 101 102 102 104 104 xvii List of Figures 2.1 PipeRench reconfigurable pipeline . . . . . . . . . . . . . . . . . . . 15 2.2 MorphoSys reconfigurable processor architecture . . . . . . . . . . 16 2.3 Garp reconfigurable processor architecture . . . . . . . . . . . . . . 19 2.4 RECONF2 design flow . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 ADRIATIC design flow . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 IP-Core Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 BE-DRESD flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 YaRA Modular Architecture Creation . . . . . . . . . . . . . . . . . 43 3.3 IP-Core schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 DRESD-SW design flow . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Drivers hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Reading process diagram . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Writing process diagram . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Multi-FPGA scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Linux kernel modules infrastructure . . . . . . . . . . . . . . . . . 68 4.5 Reconfiguration Controller registers . . . . . . . . . . . . . . . . . 69 4.6 Command Register . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.7 Software Architecture schematic . . . . . . . . . . . . . . . . . . . 75 4.8 Architectural layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.9 Socket communication . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.10 Genetic algorithm chromosome . . . . . . . . . . . . . . . . . . . . 86 4.11 Fitness evaluation examples . . . . . . . . . . . . . . . . . . . . . . 89 xix LIST OF FIGURES 5.1 5.2 5.3 xx Multi-FPGAs system on RAPTOR2000 . . . . . . . . . . . . . . . . 98 Module cached scenario . . . . . . . . . . . . . . . . . . . . . . . . 106 Reconfiguration latencies . . . . . . . . . . . . . . . . . . . . . . . . 106 Chapter 1 Introduction The embedded systems design scenario has been significantly affected, in the last years, by the influence of dynamic reconfigurable architectures. By exploiting the potentiality of these architectures, it is possible to introduce into the design workflow a new degree of freedom, that increases the flexibility of the developed systems. Several different classes of applications, in fact, would benefit from the possibility to change their functionalities after the system has been produced. This is possible thanks to the employment of reprogrammable devices, such as FPGAs (Field Programmable Gate Arrays), that are characterized by the ability to be partially reconfigured at run-time, while the rest of the device that is not involved in the reconfiguration process is still working. From a general point of view, as described in [1], partial reconfiguration can be performed by using the following approaches. • The first one is the module-based approach, that is characterized by the division of reprogrammable devices in a certain number of portions, each one of which is called reconfigurable slot. In this scenario it is possible to reconfigure one or more reconfigurable slots with a hardware component that is able to perform a specific functionality, called module. Obviously, the modules contained in slots that are not involved in the reconfiguration task do not have to stop during the reconfiguration process. 1 Chapter 1. Introduction • The second approach is the difference-based one, that does not require slots and modules definition, but that is only suitable when the differences between two configurations are very small, since the process on which it is based is suitable only when little changes in the design are required. The most general design approach for dynamically reconfigurable embedded systems, as described in [2], is the modular-based design. This approach is strongly connected to the module-based reconfiguration approach and is based on the idea of a design implemented considering the system specification as composed of a set of several independent modules (called IP-Cores, Intellectual Property-Cores) that can be individually synthesized and finally assembled to produce the desired system. Each one of these IP-Cores consists of two parts: • the core logic, often called just core for short, that implements the module functionality, and • the communication logic, that allows the component to be plugged into a system and to interact with the rest of the system, for examples with other IP-Cores. To handle the dynamic reconfiguration of these modules it is often useful to use a software controller. This controller can be developed as a stand-alone software application or with the support of an Operating System. The first choice is oriented towards the creation of a specific solution that is optimized for a particular problem. However this solution requires a large investment in terms of design and implementation efforts, and considerably increases the time to market. On the opposite, the second choice can be followed to increase the flexibility of the whole system, since in this way it is possible to exploit the classical services that an Operating System can provide, such as processes scheduling techniques or inter-process communication systems, applying them to improve the reconfiguration management. Aim of this thesis is to define a complete methodology, based on the modular design approach, that describes the whole design process and that drives 2 Chapter 1. Introduction the creation of partial dynamic reconfigurable embedded systems, starting from the high-level specification of the original application. The proposed approach provides the designer with a methodology able to strongly reduce the time to market of the final implementation of the system and to simplify the development process, since all the design phases have been automated. To achieve this objective, a tool that automatically generates IP-Cores, starting from their core logic, has been implemented, and the Linux Operating System has been extended with both a reconfiguration support and a centralized reconfiguration manager able to handle dynamic reconfigurability. In addition to this, also a collection of drivers has been developed to realize a simple and powerful communication channel with configured modules. The thesis is composed of six chapters. Chapter 2 presents a review of the state of the art in the reconfigurable embedded systems area. The analysis of existing approaches starts by presenting configurable and reconfigurable hardware platforms and ends with the description of the more representative development methodologies and Operating Systems reconfiguration supports, with particular attention to their common features and limits. Chapter 3 introduces the methodology adopted to define the proposed design workflow for dynamically reconfigurable embedded systems. In particular this chapter is focussed on the automatic IP-Core generation, once provided the core functionality, and on the Operating System reconfiguration support aspects, such as the modules reconfiguration and communication handling. Aim of Chapter 4 is to detail the actual implementation of the proposed methodology, emphasizing the integration of the design flow with the automatic generation of reconfigurable and fixed hardware modules, and the development of a software architecture based on the standard Linux Operating System that allows exploitation of reconfiguration on embedded systems. Finally, Chapter 5 introduces a large collection of experimental results of the presented implementation in order to validate the proposed methodology, while Chapter 6 draws the final conclusions, outlining some possible extensions and future works based on the approach described in this thesis. 3 Chapter 2 State of the art This chapter describes the state of the art of both configurable and reconfigurable systems, of their development methodologies and of software reconfiguration supports. Section 2.1 presents the more general configurable system. These kinds of systems are characterized by their flexibility, that can be obtained in different ways. It is possible to employ either hybrid systems that consist of static hardware and programmable logic or programmable devices. Each solution brings to different benefits and disadvantages, that will be analyzed at the end of the section. A more restricted set of the previously introduced configurable systems is described in Section 2.2. The reconfigurable systems employ a dynamic approach to the configuration, to add another degree of freedom to the flexibility of configurable systems. In this way, in fact, it is possible to modify the system components at run-time, by changing the cores of an architecture while some others cores are still running. In Section 2.3 two development methodologies are presented. Aim of these methodologies is to guide the design of a configurable or reconfigurable systems. These methodologies, in fact, can be applied to simplify the development or to improve the performance of the previously described platforms. To achieve these objectives, each methodology introduces both general architec- 5 Chapter 2. State of the art ture structures and flows that have to be followed during the design of this kind of systems. Section 2.4 describes a set of software solutions to support the reconfiguration tasks. The more suitable way to achieve this objective is to extend an OS (Operating System) with a reconfiguration support. An important aspect of this support is the development of either a manager or a set of tools that can simplify and improve the management of the reconfiguration processes, implementing useful services such as allocation politics or device handling. Finally Section 2.5 summarizes the more important aspects that characterize the described systems, methodologies and software supports. 2.1 Configurable systems In the last years a lot of attempts have been made to try to fill the gap between GPPs (General-Purpose Processors) and ASICs (Application-Specific Integrated Circuits). General-purpose microprocessors [3] are digital electronic components with transistors on a single semiconductor integrated circuit that allow to interpret instructions and to process data contained in a program. This kind of components gives to a system a very good flexibility, because it is possible to write several different applications that solve different problems and that run on the same microprocessor. In this way it is possible to use the same component to achieve various objectives, but introducing in the system a remarkable delay and increasing the power consumption, since the general-purpose microprocessor solution is slower and more power expensive than its counterpart on the field of full-custom design. ASICs are integrated circuits customized for a particular use, in order to achieve very small chips and good performance, matching exactly the computation (high throughput and low latency); unfortunately this kind of components requires conspicuous non-recurring engineering costs (the cost to setup the factory to produce a particular ASIC), a long time to market (long design cycle) and their flexibility, if any, is very low. 6 2.1. Configurable systems Between these opposite alternatives it is possible to find a compromise, by using DSPs (Digital Signal Processors), FPGAs (Field-Programmable Gate Arrays) or hybrid systems containing a mix of the previous solutions. DSPs are special-purpose microprocessors designed specifically for digital signal processing, generally in real-time. These devices are either not programmable, or have limited programming facilities, but they are cheaper and more specialized of general-purpose microprocessors in order to achieve better performance for a certain class of problems. FPGAs are semiconductor devices containing logic blocks, which can be configured to compute arbitrary functions, and configurable wiring, which can be used to connect the logic blocks as well as registers together into arbitrary circuits. Traditional FPGAs are very generic, but some of the higher-end FPGAs, such as the Xilinx Virtex 4 and Virtex 5 families, offer multiple subfamilies, each optimized for a different market area. The optimizations are achieved by crafting different mixes of memory, logic, multiplier-accumulator (MAC) blocks and high-speed I/O. Using this kind of components it is possible to obtain a good flexibility in the system (by introducing the ability to re-program the device), to decrease the time to market and to reduce the non-recurring engineering costs, even if FPGAs are generally slower than their ASIC counterparts and draw more power. In the next subsections a few examples of configurable systems will be described, to show some ways in which it is possible to develop a configurable architecture to find a trade-off between the flexibility of a general-purpose processor and the performance of an ASIC. In the RaPiD (Reconfigurable Pipelined Datapath) and Configurable Pipelined State Machine approaches, respectively described in Section 2.1.1 and in Section 2.1.2, an attempt was made to build coarse-grained adaptable ASICs or hybrid ASIC/FPGA architectures by introducing some programmable elements to interconnect hardware logic. The remaining approaches, described in Sections 2.1.3, 2.1.4 and 2.1.5, employ FPGAs to develop FPGA-based architectures or as coprocessors, to give a good degree of flexibility to the whole system. 7 Chapter 2. State of the art In the last subsection, 2.1.6, all the presented approaches will be be analyzed to find common features and limits; this analysis can be useful to find the right way to simplify the development task, to maximize the flexibility of the FPGAbased systems and to improve the time to market, reducing the time required by the development, the interfacing and the integration phases. 2.1.1 Reconfigurable Pipelined Datapath RaPiD [4] research is developed at Department of Computer Science and Engineering of the University of Washington and is focussed on defining coarse-grained adaptable architectures that solve the performance/power/price constraints posed by mobile/embedded systems platforms for a wide range of highly repetitive and computationally-intensive applications in the signal and image processing domain. This is accomplished by mapping the computation into a deep pipeline using a configurable array of coarse-grained computational units. RaPiD provides a large number of ALUs, multipliers, registers and memory modules that can be configured into the appropriate pipelined datapath; this datapath is a linear array of functional units communicating in mostly nearest-neighbor fashion. Mapping applications to RaPiD involves designing the underlying datapath and providing the dynamic control required for the different parts of the computation. The control design can be hard because control signals are generated at different times and travel at different rates. To simplify this task it is possible to use Rapid-C [5], that is an ad-hoc program language to develop RaPiD systems, but, even if this language provides a nice abstraction of the architecture, the programmer is still responsible for all the scheduling of data and operations in the datapath. 2.1.2 Configurable Pipelined State Machine The Configurable Pipelined State Machine [6] developed at the Institute of Microelectronic Systems of Darmstadt University of Technology in Germany, is a FSM (Finite State Machine) where all units relevant for control and transition logic 8 2.1. Configurable systems are configurable, while the basic structural components like state registers are built of fixed logic; this architecture is the result of a combined approach and it is faster and smaller than a FPGA implementation while providing full programmability. Since in this specific case the underlying pipeline structure will be the same for all possible applications, it is possible to limit configurability to the logic producing the control signals and the state transition logic, while the basic architectural structure can remain to be fixed hardware. The result of this approach is an hybrid ASIC that implement in hardware the basic architectural structure of a pipelined state machine while allowing to configure control and state transition logic; however this solution is built adhoc to solve this specific class of problems and then it is impossible to apply the same structure to a generalized set of scenarios. 2.1.3 Configurable Architecture for High-Speed Communication Systems The Configurable Architecture for High-Speed Communication Systems [7], developed at the Center for Wireless Telecommunication of Virginia Polytechnic Institute and State University in Virginia, is a prototype of a rapidly deployable last mile wireless high-speed communications system to support emergency management. Given the high bandwidth required and the amount of data that needs to be transported, an hybrid architecture was used, with processing elements implemented partially as software running on a microprocessor and partially as FPGA hardware logic blocks. The hybrid architecture developed is a combination of a specialized processor (Motorola PowerQuicc II 8255) for packet-level operations and a programmable logic device (Xilinx Virtex XCV600) for bit-level operations, with a dual-port memory, that allows the processor and the FPGA to read and write data simultaneously and it is highly fit for the specific application. 9 Chapter 2. 2.1.4 State of the art Configurable FPGA-Based Hardware Architecture for Adaptive Processing of Noisy Signals for Target Detection based on Constant False Alarm Rate (CFAR) Algorithms The Configurable FPGA-Based Hardware Architecture for Adaptive Processing of Noisy Signals for Target Detection based on Constant False Alarm Rate (CFAR) Algorithms [8] has been designed at the National Institute for Astrophysics, Optics and Electronics, in Mexico, specifically to be configured for the Cell-Average version (CA-CFAR) of CFAR algorithm and for two variations of it: the Max and the Min CFAR. However there are other versions of the CFAR algorithm, such as the Order Statistics CFAR, that have not been taken into account. This architecture has been implemented on a FPGA device providing good performance; in fact it is 18 times faster than the required theoretical processing time, about 10 times faster of the software implementation over a personal computer with a Pentium IV processor running at 2.4 GHz and with 512 Mbytes of main memory, and 3 times faster of the solution using a TMS320C6203 DSP device from Texas Instruments. Even if this architecture efficiently implements a class of related CFAR algorithms for adaptive signal processing and target detection, the CA-CFAR, the MAX-CFAR and the MIN-CFAR algorithms, and it can be extended to more complex CFAR algorithms such as the Statistic Ordered algorithms, since it exploits the parallel nature in CFAR signal processing, no attempts have been made to find out a generalized structure of the architecture or a common development flow that is able to solve a wide set of problems, covering a whole class of related applications. 2.1.5 Configurable, High-Throughput LDPC Decoder Architecture for Irregular Codes The Configurable, High throughput LDPC decoder Architecture for Irregular codes [9] is suitable for a scenario in which it is compulsory to ensure a very high data rate 10 2.1. Configurable systems communication through a noisy channel. In these contexts, to provide a reliable communication infrastructure and to guarantee a low power consumption, it is possible to use error correcting codes to eliminate or to reduce the need of retransmission, for example the Low Density Parity Check (LDPC) codes, that can assure very good performance in noisy channels and that is a good candidate for the next generation of wireless devices. To create a flexible architecture that is able to support different block lengths and code rates, a Virtex4-xc4vfx60 FPGA has been used to implement the whole architecture; the clock frequency of the generated design is 160 MHz, against the 412 MHz of an ASIC solution, and its latency is between 5 and 11 microseconds, while the ASIC solution latency is between 2.2 and 4.5 microseconds. The ASIC system is then a little bit faster than the FPGA-Based system, but it only supports one code, so it is necessary to develop a different ASIC for each different block length or code rate, introducing the necessity of a considerable investment and increasing the time to market. Differently from the approach presented in the Section 2.1.4, this approach exploits an interesting advantage of the FPGA-Based solution, introducing the re-use of the hardware to develop different versions of the system, even if this is yet far away to achieve a structure that can be used in a generalized flow and that can be applied to a large group of scenarios. 2.1.6 Common features and limits of configurable systems All the presented scenarios have in common the lack of a generalized flow and a complete methodology that allows to abstract and automate the development of a configurable system. Without this flow it is impossible to exploit totally the true potentialities of this kind of systems, since there is no possibility to automatically reach a low-level implementation, starting from a high-level specification of the application; in fact all the proposed approaches are characterized by ad-hoc solutions manually developed in a different way for each case. With a generalized and automatic flow, instead, in addition to the simplification of the development task, it will be also possible to maximize the flexibility of the system by developing various implementations to improve the analysis 11 Chapter 2. State of the art Table 2.1: Configurable systems features Platform ASIC 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 FPGA DSP X X X X X X GPP Flexibility X X X X X Flow Partial generalized flow Complete generalized flow X X of the solution space or to solve the same problem in different ways. This task doesn’t require too much effort, since it can be enough to modify some parts of the high-level specification to generate a different low-level implementation that can be more suitable for a different scenario. Furthermore, it is possible to improve the time to market, since this kind of flow reduces the time required by the development, the interfacing and the integration phases. Table 2.1 shows the platform or platforms on which each approach has been developed and the main features that characterize that solution. The first two approaches, 2.1.1 and 2.1.2, are basically developed with the ASIC technologies and provide a low-level of flexibility, while no generalization has been introduced. In fact, in the first one the programmer is still responsible for all the scheduling of data and operations in the datapath, while in the second one is voluntarily built ad-hoc to solve the specific problem. Even if the third solution, 2.1.3, is a hybrid system that uses both a DSP and a FPGA to provide more flexibility, it is still developed for a singular application without taking into account the possibility of extending the same solution to other similar problems. On the contrary in the last two approaches, 2.1.4 and 2.1.5, that are both developed by using a FPGA-Based architecture, there is an attempt to generalize the solution. The 2.1.4 system, in fact, efficiently implements a class of related 12 2.2. Reconfigurable systems CFAR algorithms for adaptive signal processing and target detection, the CACFAR, the MAX-CFAR and the MIN-CFAR algorithms. The 2.1.5 system, instead, is able to support different block lengths and codes rates of the same LDPC code. Anyway in all the approaches there is no trace of the effort to create a really generalized flow that can automate or improve all the development phases or that can bring to apply a successful system to solve a different problem. 2.2 Reconfigurable systems Reconfigurable systems add another degree of freedom to the flexibility of configurable systems, since they make it possible to modify the system components at run-time. In this way it is possible to change the cores of an architecture while some other cores are still running. In the next subsections some representative reconfigurable architectures will be presented. The PipeRench architecture [10], described in Section 2.2.1, introduces the concept of hardware virtualization to make it possible to execute a design of any size to a compatible device with any capacity. The MorphoSys architecture [11], presented in Section 2.2.2, is a reconfigurable computer architecture targeted to computational intensive applications that consists of a TinyRISC processor, that is a programmable processing unit, and a RC-Array, that is the reconfigurable hardware unit. The Splash processor [12], described in Section 2.2.3, is a special-purpose parallel processor that is able to exploit temporal parallelism (pipelining) or data parallelism (single instruction multiple data stream) present in the applications. In this processor the computing elements are programmable FPGA devices. The Garp architecture [13], outlined in Section 2.2.4, is the integration of a reconfigurable computing unit with an ordinary processor on a single chip. Programming for the Garp system is an automatic task, since the Garp compiler is able to automatically extract loops from ANSI C programs. 13 Chapter 2. State of the art The RAW architecture [15], introduced in Section 2.2.5, is a simple, wireefficient multicore architecture, in which it is possible to increase the performance by exploiting fine-grained parallelism. Finally, in Section 2.2.6, common features and limits of reconfigurable systems will be described, to show the main characteristics and trends of the current scenario of this kind of systems. 2.2.1 PipeRench The PipeRench project [10] allows a hardware design of any size to execute on a compatible device with any capacity, by virtualizing hardware. The PipeRench system provides both the extremely high-speed reconfiguration necessary for hardware virtualization and compilation tools for this architecture. In this way it is possible to find a solution for both the problems that inhibit the deployment of applications based on run-time reconfiguration: the first is that design methodologies for partial reconfigurable applications are completely ad-hoc, while the second one is the lack, in existing FPGAs, of reconfiguration mechanisms that adequately support local run-time reconfiguration. This solution is suitable for a scenario in which available resources are not enough for the computation, so it is possible to exploit the reconfigurable pipeline to virtualize pipeline stages; this technique implies that at every clock cycle a new stage is configured, in a way that makes it possible to execute the computation even if the whole pipeline is never configure at the same time. Figure 2.1 shows a Virtual Pipestage and a Physical Pipestage of an example in which there is an application that consists of five stages and a physical pipeline that consists of only 3 stages. In this example each stage is configured in one cycle and then it is executed for the next two cycles, so the effective throughput is of two computed results every five clock cycles. More in general, the throughput of a virtualized application with v virtual stages on a system with p physical stages is (p -1)/v. However the reconfigurable pipeline structure introduces some relevant constraints that limit the freedom of the design. For example the state of a stage 14 2.2. Reconfigurable systems Figure 2.1: PipeRench reconfigurable pipeline can only depend from the previous stages, so in this kind of system only connections between consecutive stages are allowed. 2.2.2 MorphoSys MorphoSys [11] is a reconfigurable computer architecture targeted to computational intensive applications. Figure 2.2 shows the MorphoSys system, that consists of the following components: • TinyRISC, a general-purpose 32 bit RISC processor • RC-Array (Reconfigurable Cells Array), the reconfigurable hardware unit 15 Chapter 2. State of the art Figure 2.2: MorphoSys reconfigurable processor architecture • framebuffer, the embedded data memory of the reconfigurable processor • DMA (Direct Memory Access), used to transfer data from external memory • context memory, 32-bit instruction words for RC-Array. The execution model of the MorphoSys processor is based on the partitioning of applications in sequential and data-parallel tasks; the first are executed by the programmable processing unit called TinyRISC processor, while the latter are mapped on the reconfigurable hardware unit called RC-Array. This is composed of a bidimensional array of reconfigurable cells (RC), whose configurations are stored in the context memory. During the execution, configuration data are fetched from the context memory, while computational data for the RCArray is loaded in the framebuffer from external memory. Data transfer between the MorphoSys elements and the external memory are managed by the DMA and 16 2.2. Reconfigurable systems requested from TinyRISC processor. After data loading, RC-Array is enabled by the TinyRISC with a specific command; however, during the computation, it is possible to change context to specific RC by reconfiguring only the selected part of the array. To use the MorphoSys architecture it is necessary to write both the RC-Array configuration program and the instructions program for the TinyRISC processor. The first can be realized by using a specific assembler language, while the second one can be obtained from a C compiler. However, the current version of the compiler is not able to manage the RC-Array, so the control instructions have to be manually inserted by the programmer. Thus, even if low-level details of the hardware component, such as the composition of the RC-Array and interconnection network, are deeply described, the MorphoSys solution doesn’t present a complete methodology to implement the whole reconfigurable system. Furthermore, it is not explained how it is possible to derive the sequential and parallel tasks from a given application and how they are managed by the scheduler. 2.2.3 Splash The Splash processor [12], devloped at the IDA Supercomputing Research Center, is an attached special-purpose parallel processor designed to accelerate the solution of problems which exhibit at least modest amounts of temporal parallelism (pipelining) or data parallelism (single instruction multiple data stream). In this processor the computing elements are programmable FPGA devices. The system is composed of a normal workstation (a Sun SparcStation host), an interface board and an array of Splash boards (from 1 up to 16). The reconfigurable elements in the Splash system all consist of Xilinx XC4010 FPGAs. The interface board between the workstation and the array consists of an input and an output DMA channel, each controlled by a FPGA (called XL and XR), connected to the SparcStation host via Sun SBus channel. The XL element is connected to the first board, while the XR is attached to the last one. Splash boards consist of 16 FPGAs (X1. . . X16), a crossbar switch and a seventeenth FPGA (X0), which acts as a control element for the board. Within a 17 Chapter 2. State of the art board, a FPGA is connected to its left and right neighbour and to the crossbar switch. The boards are connected to each other in a chain, and the X0 element of each board is also connected to the interface board. The workstation performs a wide range of operations, since it acts as a general controller for the reconfiguration of FPGA elements and crossbar switches, sends computational data and control signals to the array and collects the results. The Splash architecture has been designed to supply a Single-Instruction Multiple-Data (SIMD) computational model, where each board has all processing elements configured to perform the same operations on different data in parallel, but the flexibility of the architecture allows a lot of different computational models. As an example, pipelining can be used to perform a flow of computation on the same data connecting different programmable elements so that the output of a FPGA is the input of the next one. Dynamic reconfiguration in the Splash model consists in modifying two kind of elements within a board: the crossbar switch and/or the processing elements. In the first scenario, reconfiguration of the crossbar switch interconnections allow an easy way to modify the data flow in the system without the need to modify the single computing elements; in the second one, single FPGAs are reconfigured to change the kind of computation performed on the data. Programming for the Splash system is done by writing the behavioral description of the algorithm using the VHSIC Hardware Description Language (VHDL), which goes through a process of refinement and debugging using the Splash simulator. The algorithm is then manually partitioned on the different processing elements. Thus, the Splash solution does not present a methodology to partition an implementation of an algorithm on the array modules, but the process must be performed manually. This makes programming the Splash system quite difficult, as it requires direct and low-level knowledge of the physical implementation of the system. Also, there is no direct way to derive a configuration for the crossbar switch even when the mapping of functional units on the FPGAs is known. 18 2.2. Reconfigurable systems 2.2.4 Garp The focus of the Garp [13] research at Berkeley University is the integration of a reconfigurable computing unit with an ordinary RISC processor to form a single combined processor chip. Figure 2.3 shows the chip containing the Garp reconfigurable processor architecture, in which the reconfigurable array works as a slave coprocessor of the master microprocessor. The RISC core is a single-issue MIPS-II, provided with a set of special instructions that manage the reconfigurable array, by modifying its configuration. The reconfigurable array is organized as 32 rows by 23 columns of 2-bit logic blocks, and it can be used to speed up part of the computation, for example loops. Furthermore, the reconfigurable array can perform data cache or memory accesses to the shared memory independent of the MIPS core. Memory Instruction cache Data cache Standard processor Configurable array Garp chip Figure 2.3: Garp reconfigurable processor architecture 19 Chapter 2. State of the art Programming for the Garp system [14] is an automatic task. However Garp compiler is able to automatically extract only compute-intensive loops of ANSI C programs for acceleration on the tightly-coupled dynamically reconfigurable coprocessor. 2.2.5 Raw Architecture Workstation The Raw Architecture Workstation (RAW) [15] [16] is a simple, wire-efficient multicore architecture. Its goal is to increase the performance of applications in which the compiler can discover and statically schedule fine-grained parallelism. The RAW project’s approach to achieving this goal is to implement a simple, highly parallel VLSI architecture, and to fully expose the low-level details of the hardware architecture to the compiler, so that the software can orchestrate the execution of the application by applying techniques such as pipelining, synchronization and conflict elimination for shared resources by static scheduling and routing. RAW is composed of a set of interconnected tiles that can be crossed by a signal in just one clock cycle. Each tile is composed of: • instructions • switch-instructions • a data memory • an ALU • a FPU • registers • a programmable switch This approach acquires the same set of features that makes ASICs popular for specific applications. First, RAW implements fine-grained communication between large numbers of replicated processing elements and, thereby, is able 20 2.2. Reconfigurable systems Table 2.2: Reconfigurable systems features System PipeRench MorphoSys Slpash Garp RAW Reconfiguration Partial Partial Complete Complete Partial Granularity Fine-grained Coarse-grained Coarse-grained Fine-grained Coarse-grained Application domain Hardware accelerator Data parallelism Data parallelism General-purpose Hardware accelerator to exploit huge amounts of fine-grained parallelism in applications, when this parallelism exists. Second, it exposes the complete details of the underlying hardware architecture to the software system, so the compiler or the software in general can determine, and implement, the best allocation of resources for each application. Since the RAW solution allows to write an application with a high-level programming language and to compile it for the RAW architecture, it is really suitable to develop very good hardware accelerator. However this architecture is not flexible enough to be applicable in a generalized embedded systems scenario; in this kind of systems, in fact, there is the need to dynamically reconfigure the cores to allow the modification at run-time of the configuration of the System on Chip (SoC). This can be done, for example, by implementing the whole system on a FPGA and exploiting its reconfiguration features. 2.2.6 Common features and limits of reconfigurable system Table 2.2 summarizes the features of the presented approaches. Even if each of them presents a good solution for a specific scenario, they are far from being general solution for a wide class of problems, since each one presents some aspects that limit its applicability to different contexts. In the PipeRench approach, to execute a design of any size on a compatible device with any capacity, the concept of hardware virtualization has been introduced, but to adopt this solution it is mandatory to introduce some relevant 21 Chapter 2. State of the art constraints. These constraints limit coinsiderably the freedom of the design, decreasing its degree of flexibility. In the MorphoSys solution, even if to obtain the instructions program for the TinyRISC processor it is possible to use a C compiler, the programmer is still responsible for the insertion of the control instructions to manage the RC-Array. Thus there is not an automatic way to obtain the final system, since the control instructions have to be inserted manually. Also in the Slpash solution there is a similar problem, since it does not present a methodology to partition an implementation of an algorithm on the array modules, but the process must be done manually. This makes programming the Splash system quite difficult, as it requires direct and low-level knowledge of the physical implementation of the system. On the opposite, programming for the Garp system is an automatic task, but the true limit is that the Garp compiler is able to automatically extract only compute-intensive loops of ANSI C programs. The RAW solution, instead, is suitable to develop very good hardware accelerators, but it is not flexible enough to be applicable in a generalized embedded systems scenario in which run-time reconfiguration may be necessary. 2.3 Development methodologies To develop a configurable or a reconfigurable system it is possible to build an ad-hoc solution of to follow a generalized design flow. The first choice implies a considerable investment in terms of both time and efforts requested to build a specific and optimized solution for the given problem, while the second one allows to exploit the re-use of knowledge, cores and software to reach more rapidly a good solution to the same problem. In the next subsections, 2.3.1 and 2.3.2, the RECONF2 and the ADRIATIC development methodologies will be presented. The first methodology takes as input a static version of an application, executes a partitioning of the given application and then implements both the partitioned parts and the reconfiguration controller in hardware. The second one, instead, introduces a system 22 2.3. Development methodologies level implementation of the flow, even if some implementation problems are not described in details and no solutions are given for them. Finally, Section 2.3.3 will describe the common features of the presented approaches and the main limits that characterize each one of them, to show the absence of a complete and generalized flow that is able to describe and to guide the whole design flow of a reconfigurable system. 2.3.1 RECONF2 The RECONF2 [17] aim is to allow implementation of adaptive system architectures by developing a complete design environment to take benefits of dynamic reconfigurable FPGAs; in particular it is targeted to real-time image processing or signal processing applications. The RECONF2 builds a set of partial bitstreams representing different features and then use this collection to partially reconfigure the FPGA when needed; the reconfiguration task can be under the control of the FPGA itself or through the use of an external controller. A set of tools and associated methodologies have been developed to accomplish the following tasks: • automatic or manual partitioning of a conventional design, • specification of the dynamic constraints, • verification of the dynamic implementation through dynamic simulations in all steps of the design flow, • automatic generation of the configuration controller core for VHDL or C implementation, • dynamic floorplanning management and guidelines for modular backend implementation. Figure 2.4 shows the proposed design flow. It is possible to use as input for this flow a conventional VHDL static description of the application or multiple 23 Chapter 2. State of the art Figure 2.4: RECONF2 design flow descriptions of a given VHDL entity, to enable dynamic switching between two architectures sharing the same interfaces and area on the FPGA. The steps that characterize this approach are the partitioning of the design code, the verification of the dynamic behavior and the generation of the configuration controller. The main limit of the RECONF2 solution is that there isn’t the possibility to integrate the system with both a hardware and a software part, since both the partitioned application and the reconfiguration controller are implemented in hardware in the final system. 2.3.2 ADRIATIC Aim of the ADRIATIC [18] project is to define a methodology able to guide the codesign of reconfigurable SoC, with particular attention to cores situated in the wireless communication application domain. Figure 2.5 shows the whole design flow. The first phase is the system specification, in which the functionality of the system can be described by using a 24 2.3. Development methodologies Figure 2.5: ADRIATIC design flow high-level language program, like in a standard design flow. This executable specification can be used to accomplish the following tasks: • generation of the test-bench, that can be used in the other phases of the design, 25 Chapter 2. State of the art • partitioning of the application to specify which part of the system will be implemented in hardware (either static or dynamically reconfigurable hardware), • accurate definition of the application domain and of the designer knowledge. To derive the final architecture from the input specification, the dynamically reconfigurable hardware has to be identified; each dynamically reconfigurable hardware block can be considered as a hardware block that can be scheduled for a certain time interval. During the partitioning phase it has to be decided for each part of the system, if it has to be implemented in software, in hardware or in a reconfigurable hardware block. To help in this decision, some general guidelines have been developed. In the mapping phase the functionalities defined by the executable specification are modified to obtain thorough simulation results. In conclusion, the ADRIATIC flow is a solution that can be easily applied to the system-level of a design. In this phase, in fact, it is possible to draw benefits from the general rules that guide the partitioning and from the mapping phase. However there is not a detailed description of the following phases, that take place at RTL level, thus there are some implementation problems that cannot find a solution within the ADRIATIC flow. 2.3.3 Common features and limits of development methodologies Both the presented flows try to find a solution to the lack of a generalized flow that is able to describe the design flow of a reconfigurable system development. RECONF2 is a solution that automate the whole flow, from the high-level description of the application to the synthesis phase, but it is limited to the hardware. No software part, in fact, can be included in the final architecture, 26 2.4. Software reconfiguration supports since both the partitioned parts derived from the original application and the reconfiguration controller are always implemented in hardware. On the opposite, ADRIATIC takes into account, in addition to the static and the reconfigurable hardware, also a software part, but the flow is described solely from the system specification phase to the system level simulation. Thus it can be applied only to the system level, and not to the lower level implementation phases, that takes place in the RTL level. 2.4 Software reconfiguration supports A dynamic reconfigurable architecture often needs software integration to control the scheduling of the reconfiguration. This kind of tasks can be implemented as stand-alone software application or with the support of an Operating System. The first choice is oriented to create a specific solution that is optimized for a specific problem. This solution requires a big investment in terms of design and implementation efforts, and considerably increases the time to market. The second choice, instead, can be followed to increase the flexibility of the whole system. In this way, in fact, it is possible to exploit the classical services that an Operating System can provide, such as processes scheduling techniques or inter-process communication systems, applying them to improve the reconfiguration management. In the next subsections some solutions to integrate an Operating System with a reconfiguration support will be presented. Section 2.4.1 describes the approach developed at the University of Queensland, Australia, which aims at creating a set of tools to simplify the design and the implementation of reconfigurable systems. Embedded Linux is the host used to achieve this goal. Section 2.4.2 introduces the Caronte solution, that is a natural extension of the approach exposed in Section 2.4.1. This solution adds to the embedded Linux a module that is responsible for the management of the devices dynamically mapped on the FPGA. 27 Chapter 2. State of the art Section 2.4.3 presents the BORPH approach, that consists of an extended Linux kernel that is able to manage FPGA resources as if they were additional CPUs of the reconfigurable computer on which it is running. Finally, Section 2.4.4 compares all the presented solutions to find features and limits. Even if the described approaches use an OS to manage the reconfiguration, there are various ways to support reconfigurations requests, and these different ways lead to different solutions. 2.4.1 Embedded Linux as a platform for dynamically selfreconfiguring systems-on-chip The approach developed at the University of Queensland, Australia, [20], to design and implement meaningful systems employing dynamic self reconfiguration, or DRSs (Dynamic Reconfigurable Systems), is focussed on the creation of a platform of tools that can simplify these tasks. To achieve this goal, embedded Linux is proposed as a natural host for such a platform. As part of the reconfigurable system-on-chip (RSoC) research project called Egret [22], an embedded Linux kernel called uClinux has been successfully ported to the Xilinx Microblaze soft-core processor [21]. The capability to support research and experimentation into dynamic and self reconfiguring systems is one of Egret’s design requirements. uClinux is a porting of the Linux kernel to support embedded processors lacking a memory management unit (MMU), like the Xilinx Microblaze. Thus, uClinux offers an interface almost identical to standard Linux, including command shells, C library support and Unix system calls. In addition to this, a support for Xilinx FPGA self-reconfiguration has been integrated into the Microblaze uClinux kernel, using the standard Linux device driver model. This solution allows the exploitation of the power and the flexibility given by the Linux platform to rapidly develop a set of tools whose purpose is to perform complex dynamic self-reconfiguration tasks. This support is provided by an abstraction layer for the Xilinx Internal Configuration Access Port (ICAP). Xilinx developed an OPB interface to the ICAP 28 2.4. Software reconfiguration supports module, that allow frame-by-frame readback and partial configuration in ICAPsupported devices. Using this OPB interface it is possible to connect this peripheral to the Microblaze soft-core processor. To integrate this device within the Linux kernel, the standard device driver architecture used by all Linux devices has been adopted. To follow the Linux philosophy, a device driver has been developed that just implements mechanism (the provided capabilities), without any reference to the policy (how those capabilities can be used). The result of this approach is a character-based device driver, which implements the read(), write() and ioctl() system calls: • read: initiates a read from the ICAP into a user memory buffer, of the specified number of bytes, • write: the specified number of bytes are written to the ICAP from a user memory buffer, • ioctl: interface to device specific control operations, such as querying the status, or changing operating modes. This device, that is registered in the Linux device subsystem (/dev/icap), may be accessed using standard Linux system calls, such as open, read and write. In this way the kernel mediates between user programs, that implement policy, and the device driver, that implement mechanism. In addition to this, it is possible to develop a collection of small tools, each focussed on performing a single job, and to use the shell as a mechanism to chaining these tools together. This is one of the underlying principles of Un*xlike operating system, that can make the combination of uClinux and the ICAP device driver very powerful and easy to use. However, when an application accesses the ICAP device driver to perform a reconfiguration, the processor is kept occupied for the whole time interval needed to reconfigure the FPGA, since there is no possibilities to exploit DMA. Furthermore, this approach doesn’t present a centralized manager that is able to manage the reconfiguration at a high-level, but each reconfiguration is 29 Chapter 2. State of the art performed as a single task. In this way it is not possible to exploit benefits derived from services such as caching. Finally, for each reconfiguration request there is the need to specify the bitstream with which it is possible to perform the reconfiguration itself, since there isn’t an abstraction layer that allows to ask for a module without knowing the name of its corresponding bitstream file. 2.4.2 Caronte The Caronte solution [23], developed at the Politecnico di Milano, is a natural extension of the approach exposed in Section 2.4.1, in which the IPCM (IntellectualProperty-Core Manager) has been introduced. As shown in Figure 2.6, the IPCM is responsible for the management of the IP-Cores dynamically mapped on the FPGA. Kernel Register/Unregister devices Request/Free memory areas kernel modules IP-Core drivers Kernel Module Kernel Module IP-Core Manager Driver Driver Driver Data I/O reconfigurable HW IP-Core IP-Core IP-Core IP-Core IP-Core Figure 2.6: IP-Core Manager The main task of the IPCM is to handle the dynamic addition and removal if IP-Cores which is done during partial reconfiguration. The cores can communicate with this module providing information upon device type and I/O 30 2.4. Software reconfiguration supports memory location, which are necessary to the operating system to access the device. The IPCM hides the kernel from differences in devices type, since all of them are interfaced using a single major number, and the IPCM itself distinguishes among them and selects the correct driver implementing the necessary calls. From the kernel point of view, in fact, it is a standard module which registers a major number (by default 121) among character devices that will be used to access all the IP-Core devices. Anyway this solution limits the number of IP-Cores that can be configured, since with the current implementation of the IPCM it is possible to configure just 16 kinds of IP-Core, that have to be statically assigned to a number from 0 to 15, and 16 IP-Cores for each kind. In addition to this, even if each IP-Core is registered automatically, there is not the possibility to automatically load or unload its corresponding driver, so this operation has to be performed manually. An important advantage provided by the IPCM is an easier programming interface for the development of IP-Cores drivers, since it hides the kernel internal structures and integrates all common operations for the devices. The main disadvantage of this approach is the absence of a unique reconfiguration manager that is able to implement the caching, the allocation and the positioning mechanisms. Each of these phases can be implemented manually by exploiting the information contained in the IPCM module, but there is no a framework that implements them in an integrated and automatic way, to improve the system performance. Finally, an intermediate layer abstracting the reconfiguration requests is missing. This layer can be useful to hide to the software applications any details about low-level implementation of the reconfiguration routines and to allow to change them without the need to modify the software application. In fact, it can be used as an interface to decouple the high-level user applications that request reconfigurations from the low-level kernel tasks that perform the real reconfiguration process. 31 Chapter 2. 2.4.3 State of the art BORPH BORPH [19] (Berkeley Os for ReProgrammable Hardware), is an Operating System designed at the Univeristy of California, for FPGA-based reconfigurable computers. It is an extended Linux kernel that is able to handle FPGA resources as native computational resources on BEE2 (Berkeley Emulation Engine 2), that is a reconfigurable computer. This OS, in addition to allow a simple way to perform FPGAs reconfiguration, also provides useful standard system services, such as the ability for FPGAs to read or write to the standard Linux file system, allowing them to communicate with the rest of the system easily and systematically. To achieve this goal, BORPH introduces the concept of hardware process, that is a hardware design running on a FPGA; this hardware process is a standard user process, so it behaves just like a normal software program running on a processor. However there is not the possibility to execute a partitioning on a given application to derive a software and a hardware part, and there is not an automatic flow that is able to bring in an easy way to the generation of a hardware process from a high-level specification. Thus each change on the high-level specification of the problem has to be directly translated in a manual change of the low-level hardware description. To deploy a hardware process on the reconfigurable devices, BORPH exploit the concept of hardware regions, that is the smallest reconfigurable region that is possible to manage. Even if it is possible to imagine a hardware region as a partially reconfigurable region on a single FPGA, on a BEE2 module it is implemented only as an entire user FPGA. Thus each hardware process, also a very small one, needs to be deployed on an entire dedicated FPGA. Furthermore, the hardware configuration of a hardware process is encapsulated in the executable file so it is hard to completely exploit hardware re-use and it is impossible to implement caching politics. In addition to this, it is not possible to choose at run-time the most suitable hardware description of the hardware process that has to be deployed, for example depending either on the FPGAs availability or on the performance required from the user. 32 2.4. Software reconfiguration supports Finally, the BORPH solution doesn’t allow to completely separate the software layer from the hardware layer, but they still remain to the same level. This implies that both the hardware and the software side have to be specifically developed to work together. Otherwise, to write a software application that uses a hardware process it is necessary to know exactly how it behaves, since there is not the possibility either to use a common library for the communication or to write software controllers for the hardware processes. 2.4.4 Common features and limits of software reconfiguration supports The proposed approaches have in common the idea to extend an Operating System with a reconfiguration support. The different solutions derive from the different ways in which this support has been developed. Thus, in addition to some common problems, also specifics limits will be analyzed in the following paragraphs. In all the presented approaches the processor is involved in the configuration of the FPGA for the whole time interval needed by the reconfiguration process. This is due to the lack of the DMA service, that makes it compulsory to give to the reconfiguration module the whole bitstream instead of only its memory address. Even if Caronte presents a sort of centralized manager, in all the solutions it is not possible to manage the reconfiguration at a high-level. Each reconfiguration is performed as a single task, so it is not possible to exploit a framework that presents services such as caching, allocation and positioning. In fact each reconfiguration request has to specify the right bitstream with which to perform the reconfiguration itself. It is not possible to request just a desired functionality and let the system to decide the more suitable configuration file to use to perform the reconfiguration. Another limit is the lack of an intermediate layer to decouple the user-side applications from the low-level kernel tasks that perform the reconfiguration. This can be implemented, for example, with a library and it can be useful to 33 Chapter 2. State of the art abstract the reconfiguration requests, to hide to the user-side software applications all the steps that have to be followed to perform a reconfiguration or to manage a configured IP-Core. A constraint of the Caronte solution is the limited number of IP-Cores that can be supported by the system. The current implementation of the IPCM, in fact, makes it compulsory to assign statically a number from 0 to 15 to 16 kinds of IP-Cores. Furthermore, it is possible to instantiate only 16 IP-Cores for each previously declared type. Finally, a specific disadvantage of the BORPH solution is that it is based on BEE2 platform, in which each hardware process, also a very small one, needs to be deployed on an entire dedicated FPGA. This is due to the definition of the smallest reconfigurable region as a complete FPGA. 2.5 Concluding remarks The analysis of the presented configurable systems bring to the conclusion that the main problem is the lack of a generalized flow that allows abstraction and automation of the development of a configurable system. All the described approaches are ad-hoc solutions, that are manually optimized for the specific problem but that require considerable time and efforts to be developed. In fact, it is very hard to exploit the potentiality of such kind of systems without a general flow that can guide the whole design. A complete design flow could offer several advantages: from the simplification of the development phase to the improvement of the flexibility of the system. Furthermore, it is also possible to shorten the time to market, by reducing the time required by the development, the interfacing and the integration phases. Also the discussion on reconfigurable systems brings to similar conclusions. Even if each of the described reconfigurable systems presents a good solution for a specific scenario, they are far from being a general solution for a wide class of problems. Each solution, in fact, presents some aspects that limit its 34 2.5. Concluding remarks applicability to different contexts, so it is not possible to apply the same solution structure to solve a different, even if similar, problem. The presented development methodologies try to find a solution to this lack of a generalized flow that is able to abstract the design flow of a configurable or reconfigurable system development. RECONF2 is a solution that automates the whole flow, from the high-level description of the application to the synthesis phase, but the real problem is that it is limited to the hardware. Both the partitioned parts derived from the original application and the reconfiguration controller are always implemented in hardware. Thus, there is not the possibility to include a software part in the final architecture, On the opposite, ADRIATIC takes into account both a hardware and software part. The main limit of this solution is that the flow is described solely from the system specification phase to the system level simulation. Thus it can be applied only to the system level, and not to the implementation phases. Finally Operating System reconfiguration supports are described. A considerable disadvantage of the presented approaches is the absence of the DMA service. This forces the employ of the processor for the whole time interval needed by the reconfiguration process. Furthermore, in all the solutions it is not possible to manage the reconfiguration at a high-level. There is no framework able to provide services such as caching, allocation and positioning. Another considerable limit is the lack of an intermediate layer to decouple the user-side applications from the low-level kernel tasks that perform the reconfiguration. This layer can be useful to abstract the reconfiguration requests. In this way it can be possible to hide to the user-side software applications all the kernel tasks that are necessary to perform a reconfiguration or to manage a configured IP-Core. A specific constraint of the Caronte solution is the limited number of IP-Cores that can be supported by the system. In conclusion, a specific disadvantage of the BORPH solution is that on the BEE2 each hardware process, even if it is very small, needs to be deployed on an 35 Chapter 2. State of the art entire dedicated FPGA, since the FPGA is the smallest reconfiguration area that is possible to manage. 36 Chapter 3 Proposed methodology Aim of this chapter is to introduce a flow that is able to guide the design of a configurable or reconfigurable system, starting from the high-level specification of an application. This flow simplifies, improves and automates the development process. Moreover, it is possible to include in the final solution an Operating System that is able to manage the reconfiguration, to add more flexibility to the whole system. The analysis of previous works, presented in Chapter 2, shows that there is still the lack of a complete methodology that is able to describe the design of a reconfigurable system. The presented approaches are limited either to an abstract description of the flow or to a reduced portion of the whole process. In addition to this, while hardware reconfigurable platforms have already gained a wide range of use in different scenarios, an homogeneous and centralized support for dynamic reconfiguration within a standard Operating System is still missing. As shown in Sections 2.1 and 2.2, embedded systems, most of times make use of a standalone application explicitly designed for the particular target system and with a complete knowledge of the hardware on which it has to run. However, the developement of a standalone application makes it hard to exploit application design reuse. In fact, it introduces a major effort to develop a new system, since few can be derived from a previous one. Furthermore, it 37 Chapter 3. Proposed methodology reduces the flexibility of the developed system and to considerably increase the time to market. Section 3.1 introduces the BE-DRESD flow, whose goal is to find a solution to the presented limits. This flow represents the proposed methodology for dynamically reconfigurable systems design and consists of several components, each one performing a specific task. In particular, the main contribution of this thesis is the integration of DRESD-BE with a tool, called IPGen, for the automatic generation of IP-Cores, starting from their cores description, and the creation of DRESD-SW, that consists of the extension of the Linux Operating System with a reconfiguration support and the development of a centralized reconfiguration manager. Section 3.2 describes in detail DRESD-BE, that is responsible for the creation of the hardware architecture of the final system. This architecture includes the reconfigurable and the fixed modules, that are automatically created by IPGen. Section 3.3, presents DRESD-SW, whose main task is to analyze and to modify the software part of the original application to make it suitable for the interaction with the reconfiguration manager of the Linux Operating System extended with the reconfiguration support. Finally, Section 3.4 summarizes all the main aspects of the presented flow, giving an overall view of the proposed methodology. 3.1 BE-DRESD flow The schematic of the flow proposed in this thesis, called BE-DRESD, is shown in Figure 3.1. The input of this flow consists of both a high-level specification of the application that solves a particular problem and its translation to a hardware description, such as a VHDL (Very high speed integrated circuit Hardware Description Language) description. It is possible to write this hardware description either manually or by using development frameworks such as CoDeveloper. CoDeveloper is a C language development system for coarse grained programmable hardware targets including mixed processor and FPGA platforms. 38 3.1. BE-DRESD flow High-level specification BE-DRESD DRESD-HLR DID DID DRESD-DB DID SyCERS DID DRESD-VAL DID DRESD-BE BAnMaT DRESD-SW .bit .elf DRESD-TM Figure 3.1: BE-DRESD flow CoDeveloper’s core technology is the Impulse C library and related tools that allow standard ANSI C to be used for the expression of highly parallel applications and algorithms targeting mixed hardware/software targets. In this way it is possible to obtain synthesizable HDL from C applications, so this framework can be used as an integration of the BE-DRESD flow to restrict the required inputs to ANSI C programs. The BE-DRESD flow is composed of several components, each one implementing a different stage of the flow: • DRESD-HLR: DRESD High-Level Reconfiguration takes the input description of the application and tries to extract from it the recurrent structures that will be used by the following stages, 39 Chapter 3. Proposed methodology • DRESD-BE: DRESD Back-End is responsible for the creation of both reconfigurable modules and configurable or reconfigurable architectures, • DRESD-SW: DRESD SoftWare is the generator of the software part of the final solution, that consists of either a standalone software application or an Operating System with a reconfiguration support, device drivers, userside drivers and the software part of the application, • DRESD-VAL: DRESD Validation is composed of two tools, SyCERS and BAnMaT, and it is used to validate the output of DRESD-HLR and DRESDBE, • DRESD-DB: DRESD DataBase provides useful information on the target device to the other components of BE-DRESD, • DRESD-TM: DRESD Technology Management is the final stage of the flow and it takes in input the output generated by DRESD-BE and DRESD-SW to create the final solution. The focus of this thesis is based on the components highlighted in Figure 3.1, that are the back-end, DRESD-BE, and the software manager, DRESD-SW. These components, that represent with DRESD-HLR the core functionalities of BE-DRESD, have been integrated with the other pre-existing elements in order to achieve a complete flow. A more detailed description of these two components is presented respectively in Sections 3.2 and 3.3. The input of DRESD-BE is the output generated by DRESD-HLR, that is validated using SyCERS. DRESD-HLR analyzes the input description to create a graph on which it is able to work. The obtained graph is explored to find recurrent structures with which it is possible to cover the graph itself. This task can be driven by the information produced by the validation phases. To achieve this goal it is possible to use different algorithms, for example to maximize the number of instances of the same structures present in the graph or to maximize the size of the structure itself. Anyway, when an adequate set of structures has been found, it is extracted from the original graph. 40 3.1. BE-DRESD flow The obtained partitioning information is validated with SyCERS (DRESDVAL). This validation phase can be useful to obtain performance measurement that can drive the refinement cycle. The DRESD-HLR process is then repeated several times, until the validation constraints are satisfied. When the validation stage is successfully passed, the generated output is given to to the DRESD-BE and DRESD-SW components. When the DRESD-BE and DRESD-SW processes are completed their outputs are taken as input by DRESD-TM. The output of DRESD-SW is a set of executables that can be either compressed together to form a ramdisk image or directly given to the DRESD-TM. On the opposite, the output of DRESD-BE has to be validated with BAnMaT (Bitstream Analyzer and Manipulator Tool), that is the second tool of DRESD-VAL. This bitstream validation phase can impact on both DRESD-BE and DRESD-HLR, since its output can guide both processes. In fact, if the validation constraints are not satisfied, it is possible to repeat these phases to try to fulfill them. When these constraints are satisfied, instead, the obtained bitstreams are finally given to DRESD-TM. Aim of DRESD-DB is to provide to the other components a description of the target device. This device is part of the platform on which the final solution has to be deployed. Each step of the BE-DRESD flow needs this information to create the constraints, to improve the exploration of the feasible solutions or to optimize the process itself. Physically DRESD-DB is a database that contains all the necessary information about a wide range of devices. This is the set of supported devices, but it is also possible to extend it with new descriptions, to increase the flexibility of the BE-DRESD approach. The last step of the proposed flow is represented by DRESD-TM. In this phase the executables and the bitstreams are set together with the deployment information. These are necessary to establish where each part of the solution has to be placed, since there is not a fixed position in which each part of the solution can be located. In this way it is possible to create the final solution that implements the given application and that specifies how it has to be deployed to solve the original problem. 41 Chapter 3. 3.2 Proposed methodology DRESD-BE DRESD-BE is the stage in which the reconfigurable architecture is developed. The general system on which each output of this phase is based is YaRA (Yet Another Reconfigurable Architecture) [24]. This architecture has been chosen since it can be adopted to solve several very wide classes of problems, thus it provides a considerable flexibility to the flow. YaRA is constituted by two parts: a fixed part, YaRA_FIX, and a reconfigurable part, YaRA_REC, that is a collection of reconfigurable IP-Cores (or modules). Each possible configuration of YaRA_FIX with a different set of reconfigurable IP-Cores give origin to a static photo of the system, that is called YaRA_TOP. It is possible to imagine, in a general view, that these static photos are used to create the bitstreams (a complete bitstream and a group of partial bitstreams) that will be used to set-up the system (the complete bitstream) and to pass from a static photo to another one (the partial bitstreams). In particular, the adopted solution consists of the generation of both a complete bitstream that configures the system with the Top and a set of empty modules. Then for each module two partial bitstreams have to be created: one is used to configure it over an empty module and another one to come back to the empty module. In this way to change from an IP-Core to a different one it is necessary to pass from the first module to the empty module and then it is possible to configure the desired IP-Core. Figure 3.2 shows the YaRA Modular Architecture Creation phase, that is part of DRESD-BE. Its inputs are provided by DRESD-DB and DRESD-HLR. These inputs consist of both information about the processor and the reconfigurable bus that have to be used, and information about the set of cores that have been extracted from the original specification. Also DRESD-VAL is involved in this flow, since it gives useful guidelines to improve the previous solution in a refinement cycle. The first step of the flow is the creation of YaRA_TOP. This goal is achieved starting from the generation of the System.vhd and the .ncd files with the 42 3.2. DRESD-BE YaRA Modular Architecture Creation YaRA Top Creator Processor Info + IP-Cores* + Reconfigurable Bus Info EDK System Creator IPGen System.vhd + *.ncd IP-Generator Bus Info Gender Fix Generator Fix Rec YaRA(--) + YaRA_FIX COMIC YaRA(-) System Configuration Tool YaRA+ =1: Codesign ≥2: Reconfiguration Figure 3.2: YaRA Modular Architecture Creation EDK System Creator tool. The first one represents the VHDL description of YaRA_FIX, while the others are the descriptions of the fixed component included in YaRA_FIX. Input of this tool, in addition to the standard input of the 43 Chapter 3. Proposed methodology whole flow, are the IP-Cores that have been selected to be inserted in the fixed part of the architecture. These modules are provided by IPGen (IP-Core Generator), aims at creating an IP-Core for each given core. This process requires mainly the mapping of the signals of the core to internal registers and their interface with the signals present in the chosen communication infrastructure. IPGen concepts will be described in a more detailed way in Section 3.2.2. After that, the Fixed Generator tool produces YaRA_FIX and YaRA(–), that is the first version of the complete architecture, in which there is no information about the communication infrastructure and reconfigurable modules. Another tool, COMIC (COMmunication Infrastructure Creator), produces the next version of the solution, that is YaRA(-); this version contains the communication infrastructure, but reconfigurable modules are still missing. The last tool, System Configuration Tool, completes the YaRA(-) description with the reconfigurable modules collection. These reconfigurable modules are provided by IPGen, in the same way in which it provides fixed modules to EDK System Creator. The output of this final step is a group of possible configurations of the system. If this group consists of just one configuration, the flow output is a codesign of the original specification, since it consists of a configurable system that doesn’t need dynamical reconfiguration. Otherwise, if the group consists of more than one configuration, the final result is a dynamical reconfigurable system, and the different configurations represent the possible status or instances of the system in a particular instant. 3.2.1 Cores handling The main task of DRESD-HLR is the generation of the recurrent structures list. Using this list it is possible to perform the gender assignment phase, in which each structure is assigned to the hardware, to the reconfigurable hardware or to the software side. The following step is the creation of an architecture that includes the functionalities that have to be implemented in hardware. Since these functionalities are extracted by DRESD-HLR from the specification, they consist 44 3.2. DRESD-BE of the minimum logic that is necessary to express their purpose, thus they are not already suitable to be used with a bus-based communication infrastructure. On the opposite, to build a reconfigurable architecture using a general structure, it is useful to implement a bus communication. This communication allows the fixed components of the architecture to interact with a standard interface of the modules, that uses the same kind of signals and that behaves in a similar way (for example the reset signal is always interpreted in the same way) for each different reconfigurable IP-Core. This layer of abstraction provides more flexibility to the system and makes it possible to adopt the reconfiguration model described in Section 3.2, in which each module can be substituted with another one. As previously hinted, the cores extracted by DRESD-HLR are not suitable to be used with a bus-based communication infrastructure. Anyway it is possible to adapt them to the bus communication without changing their internal logic. This is the task that IPGen is able to perform in an automatic way. There are two main concepts on which automatic IP-Cores generation task is based on: • the need to preserve both the internal structure and the functionality of each IP-Core, • the possibility to flatten each VHDL description. On one side it is compulsory to preserve the functionality of each IP-Core, since it is part of the original specification and then it cannot be arbitrarily modified. Furthermore it is very desirable to preserve also the internal structure, since in this way it is possible to abstract the implementation of each module to concentrate the efforts on the analysis of its interface. On the other hand, the introduction of a hierarchy of wrappers doesn’t directly imply a considerable waste either of performance or of resources required to implement it in hardware. This is possible because VHDL allows the synthesis of each description in a flattened way, so the overhead introduced by the wrappers hierarchy is very small. 45 Chapter 3. Proposed methodology These considerations bring to the conclusion that it can be a good solution to develop a sort of wrappers hierarchy to create an IP-Core from each given core, as described more in details in Section 3.2.2. In addition to the adaptation of the core to the bus communication, in a way that makes it possible to use it with the YaRA architecture, this choice also offers an efficient, elegant and human-readable solution to the proposed issue. 3.2.2 Automatic IP-Core generation Aim of this phase is to build a complete IP-Core starting from its core logic. This task can be automatically performed through three steps: • registers mapping • address spaces assignment • signals interfacing Figure 3.3 shows the result of these steps. The core logic is included in a more complex component called IP-Core that is able to communicate with the target bus. Registers mapping is necessary since each core has a different signals set. These collections can differ for the total number of signals, for their size or their type. In this scenario the most suitable solution is to use a standard set of signals for the communication with the rest of the system, and to use these standard signals to manage the specific signals of each core. To make this idea applicable it is necessary to find a way in which it is possible to store temporarily a specific signal during its set-up (to avoid undesired interferences with the core logic) and to make it available also when the standard signals are managing other specific signals. The easiest way in which this decoupling can be done is by introducing a group of registers that correspond to the specific signals set. Each register, then, can be assigned to a specific signal in a direct way, while the standard signals can interact only with the registers and not with the specific signals set. 46 3.2. DRESD-BE BUS IP-Core Interface Core Logic Register Register Register Register Core Logic Figure 3.3: IP-Core schematic The second step that has to be performed is the address space assignment. Once the standard signals are mapped on the registers set, it is necessary to assign to each register a specific address. In this way it is possible to use the data contained in the address signal to refer to a specific register. This solution allows the use of a small collection of signals both to write and to read from each specific signal of the cores. This group of signals consists of an address signal, a data signal and a few control signals. The address signal contains the address of the register that has to be accessed, while the data signal either contains the data that has to be written on the selected register or is the place where the data read from the selected register can be stored. In the last step the signals interface phase is performed. In this phase the signals of the target bus interface have to be used to interact with the registers. The address and the data signals are involved in the creation of routines to read or to write on a particular register, which address is specified in the address signal. Also the control signals, such as the reset signal, are used to manage the core in the correct way. 47 Chapter 3. Proposed methodology After the execution of this step, the IP-Core is ready to be bound to the target bus and to work properly with it, since the signals of its interface are the only set of signals that the developed IP-Core needs to perform the correct functionality. These signals, in fact, are used to set-up and manage the registers, that are directly mapped on the specific signals of the contained core. This sort of wrappers hierarchy achieves the main objective of the IP-Core automatic generation, that is to automatically create, starting from a given core logic, a module that is compatible with the target bus. 3.3 DRESD-SW DRESD-SW is the component of the BE-DRESD flow, that is located between DRESD-HLR and DRESD-TM. Aim of this component is to generate the software part of the final solution. This software part can be developed as a standalone application that includes the reconfiguration controller, using a specific reconfiguration library. Otherwise, it can be designed to run on an Operating system that provides reconfiguration mechanisms. Figure 3.4 presents the DRESD-SW design flow. As shown in this figure, its input set consists of the following items: • the base Operating System • the cores descriptions • the software application The base OS is the platform on which the final software solution will run. In the proposed approach the Linux OS has been considered a good choice to obtain both flexibility and performance, so it has been adopted to develop the first version of this flow. Anyway it is also possible to follow the same flow with a different OS, developing another specific reconfigurable support and compiling all the codes using the right cross-compilation target. 48 3.3. DRESD-SW Cores* VHDL OS OS Support Software analysis Library Cores analysis Reconfiguration support integration OS(+) Software Device Drivers (Kernel modules) Reconfiguration library User-side Drivers Modified software Device Drivers integration Software compiling Software compiling Compiled software OS(++) SW (elf) Software integration SW+ (elf) Selection DRESD-SW Software solution Figure 3.4: DRESD-SW design flow Another input of this component is the cores descriptions set. These descriptions are extracted from the original specification by DRESD-HLR and have to be analyzed by DRESD-SW to obtain the corresponding collections of drivers. There are two different kinds of drivers: 49 Chapter 3. Proposed methodology • device drivers • user-side drivers Device drivers, often called just drivers for short, provide the Operating System with the information on how to control and communicate with a particular hardware device. This kind of drivers implements the basic functions that the OS needs to manage different devices, such as the writing or the reading from a particular register of a hardware module. These functions allow both data transfer and control registers management. On the other hand, user-side drivers allow to abstract the device drivers layer to provide the user-side software applications with a simple and efficient way to manage devices. This makes it possible to avoid in the user-side applications direct calls to the devices configured in the OS, since these calls are already implemented and grouped in the functions provided by the user-side drivers. The last input of this component is the software application from which cores have been extracted. This application consists of both the software controller and the software parts specified during the partitioning phase. In other words, it represents the whole original specification, excluding the parts that have been chosen to be implemented in hardware during the partitioning phase. This application is analyzed and modified by replacing the portions of code that have been selected to be implemented as hardware cores with the corresponding function calls to the previously described user-side drivers, and to the library provided by the reconfiguration support of the selected OS. The main purpose of this library is to provide the software applications (for example the controller) with a fast and powerful mechanism to request or to discard a module. The whole software part is then cross-compiled to obtain binaries that can run on the processor of the target system. These binaries perform all the tasks that have been selected to remain in software and control the reconfigurable or fixed modules. The functions provided by the OS support library are used to 50 3.3. DRESD-SW obtain or to release an IP-Core, while user-side drivers are used to access the IP-Core, writing or reading data from it. The OS reconfiguration support, which main functions are used in the software controller, has to be included in the OS to extend it with all the necessary reconfiguration mechanisms. Once that the support has been merged with the OS, an extended OS that is able to manage reconfiguration is obtained. This extended OS, however, doesn’t support yet the communication with the particular IP-Cores needed by each different application. To provide it with these specific communication procedures it is necessary to include in the OS all the device drivers obtained from the cores analysis. The result of these steps is an OS that is able both to manage reconfiguration and to provide all the necessary communication functions to the software applications that need to use IP-Cores. On this OS it is possible to run the binaries obtained from the cross-compilation previously described. The integration between the OS and the user-side binaries represents the software part of the solution and it is also the final output of DRESD-SW. 3.3.1 Reconfiguration layer One of the main aspects of the OS reconfiguration support is the complete abstraction of the reconfiguration task. In other words it is fundamental that this support decouples the user-side applications from the system processes that have to be executed to perform a reconfiguration. In this way it is possible to obtain several benefits, by exploiting the following advantages: • simplification of the reconfiguration calls, • code reuse and portability, • different low-level implementations support. The introduction of a reconfiguration layer that completely hides the lowlevel reconfiguration processes to user-side applications, simplifies the task of 51 Chapter 3. Proposed methodology writing software that uses hardware modules. The functions that this layer has to provide to the user-side applications are the following ones: • module request, to ask to the reconfiguration manager a particular module that has to be configured on the system, • module release: to let the reconfiguration manager know that a specific module instance is no longer in use and it can be deleted or cached, • module removal: to ask to the reconfiguration manager to explicitly delete a particular module instance from the system, • modules list: to know the list of configured modules and their relative status. This abstraction approach allows also code reuse and portability, since the high-level reconfiguration calls don’t contain any information about their lowlevel implementation. In this way it is possible both to reuse the same code or the same portion of code in different situations and to port them on different hardware platforms. For the same reason it is possible to implement in various ways the same reconfiguration tasks, for example by following different cache policies or different allocation mechanisms, and to choose at runtime the most suitable solution for each particular scenario. The only constraint is that each implementation has to satisfy the standard interfaces of the reconfiguration functions. 3.3.2 Dynamic reconfiguration management In the proposed approach the Linux OS has been extended with a centralized reconfiguration manager to support and manage external and internal reconfigurations. The choice of a centralized manager instead of a distributed solution has been followed because in this way it is possible to exploit several advantages, brought by the possibility to implement the following policies: • cache policy 52 3.3. DRESD-SW • allocation policy • positioning policy The first policy represents the way in which cached modules are managed. When a module is no longer in use, in fact, it is possible to perform either an hard-removal or a soft-removal to delete it. The hard-removal configures the slots occupied from the unused IP-Core with blank modules, removing physically all the logic of the deleted module. The soft-removal, instead, leaves unaltered the FPGA configuration, but perform a logic removal by deleting all the information associated with the deleted modules. Another way to manage a module removal is to keep both the module configured on the reprogrammable device and its information, while setting its status as cached. In this way the cached module can be assigned to other applications that require an IP-Core of the same family. This approach brings to a remarkable improvement of the temporal performance, since it introduces the possibility to satisfied a module request without performing any physical reconfiguration. Allocation policies aim at defining how to position a given module. Implementing this kind of policies allows the exploitation of well known algorithms to maximize the number of IP-Core that it is possible to configure on the same device. This can be seen as a reduction of the number of refused modules, that are the modules that cannot be placed on the device because there is no more space available. The main concept to follow while implementing allocation policies is to minimize the fragmentation of the devices. So, for each required module, it is necessary to find the minimum set of consecutive free slots where it is possible to configure the module itself. In this way larger groups of free slots are left available for larger modules, without breaking them into several smaller groups. Positioning policies concern the selection of the bitstream that is able to perform the desired reconfiguration. There are two possible ways in which this selection can be executed. 53 Chapter 3. Proposed methodology The first is suitable for a scenario in which for each feasible position of the module on the reprogrammable device there is a different bitstream. In this case the positioning layer searches the right collection of bitstreams for the desired family of IP-Cores and then selects the bitstream that corresponds to the place chosen in the allocation phase. The second way is suitable when there is a component that is able to modify a bistream to shift its position within the FPGA. In this case the positioning layer has to select the right base bitstream, that is the only bitstream that represents the whole family of bitstreams corresponding to the same module. This information is then used to setup the relocation component that performs the shifting of the base bitstream to the desider position. In this way it is possible to obtain a new bitstream with which it is possible to configure the desired module in the position selected in the allocation phase. All of these policies have to access one or more databases in which all the modules and the bitstreams information are stored. There is the need, then, of a database where it is possible to store the current configuration of the FPGA, with the status of each module, to know if it is either running or cached. Furthermore it is compulsory to develop a database in which module family data and relatives bitstreams can be retrieved. 3.3.3 IP-Cores devices access Once that an IP-Core has been configured on the reprogrammable device, there is the need to establish a communication channel between the Operating System and the module itself. This channel can be used by the OS to accomplish the applications requests of writing or reading from an IP-Core, since it cannot be acceptable to let the software applications access directly to the configurable hardware. The best way to achieve this goal is to follow the standard Linux philosophy, that proposes the implementation of device drivers. Figure 3.5 shows the drivers hierarchy that allows to decouple applications requests and OS communication tasks. 54 3.3. DRESD-SW Software applications Software application Software application Software application User-side Driver 1 User-side Driver 2 User-side Driver 2 /dev/device_1 /dev/device_2a /dev/device_2b Devices device_1.o device_2.o Device Drivers IP-Core_1 IP-Core_2a IP-Core_2b IP-Cores Figure 3.5: Drivers hierarchy Each IP-Cores family is managed by the same device driver, so the number of device drivers loaded by the OS at any time corresponds to the number of types of IP-Cores that is possible to handle. The device driver is able to distinguish a module from another of the same family by its memory address space, since it is unique for each module. The development of a centralized and automatic reconfiguration manager implies the implementation of a mechanism to dynamically manage this kind of drivers. Devices drivers needed to handle the configured IP-Cores have to be dynamically loaded, while, if no more modules of a certain family are present on the FPGA, the corresponding device driver has to be unloaded. This aspect of the reconfiguration manager is described more in details in Section 3.3.3.1. 55 Chapter 3. Proposed methodology To allow user-side applications to access IP-Cores, the OS provides them with a collection of devices, located in the /dev directory. Each different device corresponds to a different IP-Core, so each devices set that corresponds to modules of the same family has to refer to the same device driver. A device is characterized by its major number and its minor number. Each IP-Core family is represented by the same major number, that corresponds to a specific device driver, while the minor number allows to distinguish between different IP-Cores of the same type. To avoid to include in the user-side applications direct calls to the devices, it can be useful to develop a collection of user-side drivers, each one of them is able to manage a complete family of IP-Cores. The way in which this kind of drivers allows to considerably simplify the access to the configured modules is described in Section 3.3.3.2. 3.3.3.1 Dynamic device drivers loading and unloading During the reconfiguration of a module, it is necessary to check if an appropriate device driver is already loaded in the OS. If this driver is not found between the loaded device drivers, it is compulsory to load it, otherwise it would not be possible to manage a communication with the requested IP-Core. After this phase, the configured module has to be registered to the right device driver, to set-up its memory address space. During this process an unique minor number is assigned to the module. The association of this minor number and the major number of the device driver allows to identify the device that corresponds to the configured module. The name of this device is then used by the user-side application to manage the IP-Core. When a module is no longer in use, and the caching policy has decided that it cannot be kept in cache, it has to be unregistered from its device driver. This operation is useful to free the memory address space allocated for the unused IP-Core. If in the whole system there is no module of the same family of the removed IP-Core, then it is also possible to unload its device driver. Following the presented steps it is possible to implement a dynamic management of the device driver to automatically set-up the channel communi- 56 3.4. Concluding remarks cation that each configured module needs to be used by both the OS and the user-side applications. 3.3.3.2 IP-Core user-side drivers Even if it is possible for the user-side applications to directly access the devices, this is not a simple and clear way to manage configured modules, since it requires to know the way in which each device driver operates. A more powerful way is to develop a collection of user-side drivers, each one of them able to interact with a whole IP-Cores family. These user-side drivers provide applications with a set of functions that perform the following classes of tasks: • reading from module registers, • writing to module registers, and • changing the module status. Each of these classes is directly translated in the corresponding set of instructions that interact with the device to perform the required process. The introduction of this layer of drivers, not only allow both to simplify and to decrease the time required to implement the communication with reconfigurable modules, but also makes it possible to change the implementation of a device driver without the need of substantial modifications to the user-side application. It is sufficient to develop a new user-side driver that is compatible with the new implementation of the device driver and that exports the same interface of the communication functions. This approach allows to add more flexibility to the whole drivers hierarchy. 3.4 Concluding remarks The proposed methodology is described by the BE-DRESD flow, in which a high-level specification of an application that solves a particular problem is analyzed and modified by DRESD-HLR. This component allows the identification 57 Chapter 3. Proposed methodology of recurrent structures in the input code that can be potentially implemented as hardware modules, and in this way it performs a hardware/software partitioning of the high-level specification. The following steps, that have to be executed after DRESD-VAL validation, are represented by DRESD-BE and DRESD-SW, the original contribution of this thesis. The first one is responsible for the creation of the base architecture and for the generation of both reconfigurable and fixed hardware modules. On the other hand, the second component modifies the software part of the original application to make it able to manage reconfigurable modules. The result of this process can be integrated with an OS that is extended with a reconfiguration support. DRESD-DB provides all the other components with information about the target device, to automatically develop a solution that can be physically deployed on the real target system. The deployment information is associated with the hardware and the software solutions in DRESD-TM, that produces the final configurable or reconfigurable system implementing the original highlevel specification. 58 Chapter 4 Design flow software development Aim of this chapter is to describe the development details of the methodological aspects previously presented in Chapter 3. These aspects concern both the integration of the DRESD flow with the automatic generation of reconfigurable and fixed hardware modules, and the design of a software architecture based on a standard Operating System that allows exploitation of reconfiguration. In particular, Section 4.1 introduces the IPGen tool, that is a tool able to integrate the DRESD-BE flow with the automatic generation of IP-Cores, starting from their core logic. These IP-Cores can be used either as fixed or as reconfigurable modules that have to be plugged in the final architecture. The following section, Section 4.2, presents the development details of the software architecture that allows to perform reconfiguration tasks over the Linux Operating System. The class of underlying platforms on which it is possible to run the same software architecture is introduced in Section 4.2.1. Since this collection of platforms can be described in the same way from an abstract point of view, it is possible to manage reconfiguration processes without the need of modifications on the proposed software architecture. The developed solution is based both on the Linux low-level reconfiguration support and on the centralized reconfiguration manager. The first one is presented in Section 4.2.2 and consists of several kernel modules that implement the low level operations needed by reconfiguration processes, such as the 59 Chapter 4. Design flow software development set up of the right address space on the Wishbone bus or the physical reconfiguration of the reprogrammable device. The second one is introduced in section 4.2.3 and is composed of three different managers that are able to handle reconfiguration requests at different abstraction layers. 4.1 IPGen As hinted in the previous chapter, the IPGen tool can be used to automate and to speed up the generation of bus-compatible components, that is part of the proposed embedded systems design flow. In particular, the IP-Cores generation phase is involved in the design of the hardware architecture, that is part of the DRESD-BE flow. The core functionalities extracted from the original specification, in fact, cannot be directly used in the YaRA architecture. To be plugged in YaRA they need to be adapted for the Wishbone bus communication. After this phase, the obtained IP-Cores can be used either as fixed or as reconfigurable modules in the developed architecture. The creation of a complete IP-Core is a process that can be divided into two distinct phases: • the generation of the IP-Core logic, that is the core functionality of the whole component, and • the implementation of the communication infrastructure that makes it possible to interconnect the IP-Core with the rest of the system. IPGen is a software tool that allows to perform the second step in an automatic way, starting from a given VHDL description of the core logic and the information about the bus that has to be used to communicate with the system. To achieve this task, IPGen performs the following steps. • The first step that has to be executed is the input phase, in which the tool is provided with the VHDL description of the core and the indication of the chosen communication infrastructure. 60 4.1. IPGen Begin Read from input file False Error Pattern recognized False False End of core recognized True Signal recognized True True Add signal to the list End Figure 4.1: Reading process diagram • The second step is the reading process, shown in Figure 4.1, in which IPGen reads and interprets the input VHDL description to store all the information needed by the following step. This phase can be further divided in the following operations: – the recognition of the VHDL entity declaration pattern in the VHDL description; – the building of the signals list: the basic idea followed by this step is that when a signal is recognized, it is analyzed and its information is stored in the signals list; this action is repeated until the end of the signal declaration is reached; and – the storing of the core’s entity name and the file path in two variables used by the following process. 61 Chapter 4. Design flow software development Begin Get signal list Read from "input stub" False Pattern recognized True Add input to "output stub" True End of "input stub" False Elaborate signal list Read from "input sln" Add data to "output stub" False Pattern recognized True Add input to "output sln" True End of "input sln" False Add data to "output sln" End Figure 4.2: Writing process diagram • The third step is the writing process, shown in Figure 4.2, that takes in input the signals list, the core’s entity name, its path and the kind of bus infrastructure that has to be used. Aim of this last step is to write the IP- 62 4.1. IPGen Core VHDL description, and this objective is achieved by performing the following actions: – the creation of a stub VHDL file between the core and the IP-Core VHDL descriptions, that allows the input signals of the core to be written by the bus master, and the outputs to be read. An important feature of the tool is that the address decoding logic is automatically generated and included in the stub; and – the generation of the top architecture VHDL file, that is the final IPCore that contains both the processing logic and the chosen bus interface. If an error occurs during the execution of either the reading or writing phase, the tool is halted and an error message is returned. This message contains information that is useful to understand where and why the process failed. Even if the tool cannot detect all VHDL syntax errors, since it is not a VHDL parser and it does not validate the analyzed code, it is however able to check the entity declaration syntax. On the opposite, if the execution ends correctly, the created IP-Core, which structure is shown in Figure 3.3, is ready to be plugged in the architecture for which it has been developed. Within the DRESD-BE flow, to be more precise in the YaRA Modular Architecture Creation phase, the IP-cores obtained thanks to IPGen are used by YaRA Top Creator. The fixed modules are included in the fixed part of the architecture, by using EDK System Creator and Fix Generator, while the reconfigurable modules are plugged in the architecture by System Configuration Tool during the last step, as shown in Figure 3.2. In this way it is possible to automatically obtain a working architecture based on YaRA that supports all the functionalities of the original specification that have been implemented as hardware components. 63 Chapter 4. 4.2 Design flow software development Software architecture The software architecture is the part of the system that is responsible both for managing dynamic reconfiguration and for handling the reconfigurable hardware. This architecture can be implemented both as standalone application or as Operating System support. The first solution is designed to solve just one class of problems, so it can be deeply optimized but it is necessary to rewrite the whole application if the context changes. The second solution is a more general one and consists of a layer that allows to access to the reconfigurable hardware at a very high level of abstraction; moreover, this partial dynamic reconfiguration support allows to easily exploit inter-process communication and process scheduling provided by the OS. Section 4.2.1 presents a class of reconfigurable embedded systems that can be described in the same way from an abstract point of view, using the concepts of master and slave FPGAs. This class represents the collection of systems that the proposed software can handle and on which it is able to manage reconfiguration tasks. Section 4.2.2 introduces the low-level implementation of the Linux reconfiguration support. It consists of a collection of kernel modules that have to be loaded in the OS to enable the hardware components responsible of the reconfiguration process. Two of them are essential to grant the access to two IP-Cores on the master FPGA that are responsible for the reconfiguration of the slave FPGAs and for the setup of the communication on the Wishbone bus. The last one is the manager of the dynamically registered devices. Furthermore, to simplify the access to these kernel modules, also a common library, called Reconfiguration Library, is introduced. This OS support can be extended, as described in Section 4.2.3, with a centralized reconfiguration manager. This manager, called ROTFL Daemon and described in Section 4.2.3.2, has both to implement all the policies previously introduced in Section 3.3.2 and to allow an easy communication between user-side 64 4.2. Software architecture applications and the Linux kernel modules that perform physical reconfigurations. 4.2.1 Underlying platform The more general platform on which a configurable or reconfigurable system can be developed is a multi-FPGA scenario where the reconfigurable resources are distributed on several interconnected FPGAs. The master FPGA has to be able to reconfigure, partially or totally, other slave FPGAs. These slave FPGAs can be divided into several slots that can be filled with IP-Cores by the master FPGA. The main challenge in such a scenario is to hide the system characteristics and the additional efforts regarding the communication with dynamic modules from the user applications. Figure 5.1 shows a collection of different scenarios on which the previously described abstraction can be applied. In all these scenarios, each master FPGA is characterized by the presence of an embedded PowerPC processor, on which the Operating System runs, in addition to the static hardware components such as a memory controller, general purpose inputs/outputs, and a reconfiguration manager. Slave FPGAs, instead, hold the reconfigurable resources used to dynamically load hardware modules into the system. These resources are used according to a 1D-placement with a granularity of four CLB (Configurable Logic Block) columns. This means that dynamic modules always use the full height of the FPGA, while their width is a multiple of four CLB columns. In the first scenario, called Scenario A in Figure 4.3, there is just one FPGA that is used both as a master FGPA and as a slave FPGA. This FPGA is logically divided in two different parts: • a fixed part, that is the part of the FPGA that contains the PowerPC processor and that acts as a single master FPGA, and • a reconfigurable part, that is handled as a single slave FPGA, even if the number of slots that it is possible to configure is smaller. 65 Chapter 4. Design flow software development Slot 8 Slot 7 Slot 6 Slave FPGA Slot 5 Slot 4 Slot 3 Slot 1 PPC Slave FPGA Slot 2 Master FPGA Slot 2 PPC (Scenario A) (Scenario B) Slot 1 Master and Slave FPGA Master FPGA Slave FPGA Slave FPGA Slot 12 Slot 11 Slot 10 Slot 9 Slot 8 Slot 7 Slot 6 Slot 4 Slot 3 Slot 2 Slot 1 Slot 5 PPC (Scenario C) Slave FPGA Master FPGA Slave FPGA Slot 12 Slot 11 Slot 10 Slot 9 Slot 8 Slot 7 Slot 6 Slot 4 Slot 3 Slot 2 Slot 1 Slave FPGA Slot 5 PPC (Scenario D) Slave FPGA Figure 4.3: Multi-FPGA scenarios On the opposite, in all the remaining scenarios each FPGA of the system acts either as a master or as a slave FPGA, without logical internal divisions. 66 4.2. Software architecture The differences between these scenarios reside in the different ways in which the communication infrastructure is implemented. The second scenario, called Scenario B in Figure 4.3, presents a chain communication in which the master FPGA can communicate with just one slave FPGA, and each slave FPGA can communicate just with the following one. Scenario C and Scenario D, instead, represent respectively a point to point connection and a bus-based connection. In both these scenarios the master FPGA is able to communicate directly with each slave FPGA. Even if the presented scenarios differ for the logical partitioning of master and slave FPGAs sets and for their communication infrastructures, they can be reduced to same class of platforms from the software point of view. For this reason they can be handled by the same software architecture, as described in the following section. 4.2.2 Linux kernel modules infrastructure The low-level implementation of the partial dynamic reconfiguration support consists of three kernel modules and a library. Figure 4.4 shows the hierarchy between the software applications, the Reconfiguration Library, the kernel and the kernel modules. The first kernel module is the Reconfiguration Controller kernel module, described in Section 4.2.2.1, that provides the interaction with the hardware Reconfiguration Controller; this kernel module manages partial or complete reconfigurations of an FPGA by simply providing the controller with information on the bitstream base address, on its size and on the slave FPGA that has to be configured with the given bitstream. Section 4.2.2.2 presents the second module, the MAC (Media Access Control) kernel module, that allows to communicate with the hardware MAC component, that is the IP-Core responsible for the dynamic changes of the address space of the configured modules on the Wishbone bus. With this kernel module it is possible to setup the right address space on the Wishbone bus for each new IP-Core added into a slave FPGA. 67 Chapter 4. Design flow software development Software applications SW SW SW Library Reconfiguration Library Kernel Kernel Kernel modules Reconfiguration controller MAC LOL SW Kernel modules Figure 4.4: Linux kernel modules infrastructure The third kernel module is the LOL (Load On Linux) kernel module, introduced in Section 4.2.2.3, that doesn’t refer directly to a hardware component of the system. It is a centralized manager that is able to handle the dynamic registering and the unregistering of other devices. Its function is to store all the information about the registered devices and to allow both the addition of a new device and the removal of an existing one from the system. Finally Section 4.2.2.4 describes the Reconfiguration Library, which purpose is to simplify writing applications that have to manage the presented kernel modules. For this reason the library offers also a set of functions that allow to make read, write and IOCTL calls in a simplified way both on the Reconfiguration Controller device and on the MAC device. 4.2.2.1 The Reconfigurator Controller kernel module The Reconfigurator controller kernel module is an interface for the Reconfiguration Controller, that is a hardware component that has to be present in the final re- 68 4.2. Software architecture configurable system, since it is allows the reconfiguration of a slave FPGAs with a given bitstream. A special feature of the Reconfiguration Controller component is its Direct Memory Access (DMA) to the SDRAM (Synchronous Dynamic Random Access Memory) memory. This enables very fast configurations when downloading bitstreams from a given position within the memory to a selected FPGA. It is possible to communicate with this hardware component through its registers, whose schematic is shown in Figure 4.5. The Bitstream base address register contains the base address of the bitstream that has to be used reconfigure the selected FPGA, while the Bitstream dimension represents the dimension, expressed in bytes, of the bitstream itself. The size of these two registers is 32 bits. 0x000 31 30 29 2 1 0 2 1 0 2 1 0 Bitstream base address 0x008 31 30 29 Bitstream dimension (bytes) 0x020 31 30 29 8 7 Command Figure 4.5: Reconfiguration Controller registers The last register is the Command register, as shown in Figure 4.6. It is smaller that the previous registers, since its size is just 8 bits. In particular, bit number 4 is used to select a complete (indicated with a 0) or a partial (indicated with a 1) reconfiguration, while last three bits (bits number 2, 1 and 0) are used to select the FPGA that has to be reconfigured. The Reconfiguration Controller work is divided in two phases: the setup phase and the reconfiguration phase. • In the setup phase it is possible to set the right data to specify which bitstream has to be used to perform a complete or partial reconfiguration; 69 Chapter 4. Design flow software development Command 7 6 5 4 3 2 1 0 Slave FPGA number Complete (0) or Partial (1) reconfiguration Figure 4.6: Command Register the information needed by the controller are the memory base address at which the bitstream is stored and its dimension expressed in bytes. Both the base address and the dimension of the bitstream are written on their respective registers on the controller when the proper IOCTL calls are performed. • The second step, the reconfiguration phase, starts when the Command register of the Reconfiguration Controller is modified; this step performs the specified kind of physical reconfiguration (complete or partial) of the selected slave FPGA. Since the Reconfiguration Controller works with DMA, the processor on which the Operating System is running is involved only in the setup phase, while during the reconfiguration phase it is free to work on others processes. 4.2.2.2 The MAC kernel module Each slave FPGA comprises a Wishbone bus to which the hardware modules are dynamically connected. The bus-bridges that are used to connect the modules to the processor system require the Medium Access Controll (MAC) for the communication with the modules. These MACs differ considerably from those used in standard on-chip bus systems, since they have to deal with a changing number of communication 70 4.2. Software architecture participants. Thus, they provide the ability to allocate address space for each loaded module at run-time. This allows for a very flexible use of the available bandwidth as well as for multiple instantiations of modules (e.g., two identical Adder-modules loaded for different tasks). The MAC kernel module is the part of the system that provides the setup of the communication between the processor and the configured IP-Cores, setting the correct address space for each one of them; in fact when an IP-Core is configured on a slave FPGA its range is known and it is possible to search a free space for it on the Wishbone bus. The information about the address space reserved to the new IP-Core on this bus is passed to the MAC through an IOCTL call to the MAC kernel module and then the MAC itself takes care of the communication setup. 4.2.2.3 The LOL kernel module The LOL (Load On Linux) kernel module is used to dynamically manage the registering and unregistering of devices. Each time a device driver is loaded, it makes a call to a function exported by the LOL kernel module. This function is used to communicate to the LOL kernel module the value of three function pointers and of an integer number: • the first function pointer refers to the add_device function of the loaded device driver, that allows to register a new device; when a new device is added, the device driver has to store its minor number and to register the corresponding memory space for communication; • the second pointer is the rem_device function pointer, that is responsible to the deletion of an existing device; when a deletion of a device is performed the device driver has both to update its table of devices, deleting all the information corresponding to the removed device, and to free the memory space occupied by the removed device; • the last pointer refers to the clean-up function that is useful to unload the device driver; this request can be performed when no more devices are 71 Chapter 4. Design flow software development registered on it and either there is no more space in memory to keep the device driver or the device driver is not more useful for the system; • the integer number represents the major number of the device driver itself; this major number is dynamically assigned to the device driver by the Operating System. Since it is necessary to know this number to establish a communication with the device driver and there is no way to know it directly, the device driver has to give it to another kernel module, whose major number is well known. This can be the LOL kernel module, that is able to store this major number in a place that is accessible by the upper level, otherwise the communication between the software applications and the device driver cannot take place. The LOL kernel module stores the information on each device driver in a table. When a new device has to be added or an existing one has to be deleted from a device driver already loaded in the system, it is possible to find in this table the pointer to the right function that is able to perform the requested action. 4.2.2.4 The Reconfiguration Library Aim of the Reconfiguration Library is to provide a simple and optimized mechanism to interact with both the Reconfigurator Controller kernel module and the MAC kernel module. To improve the usability of the Reconfigurator Controller kernel module, the Reconfiguration Library offers a collection of functions that implement the IOCTL calls that are necessary to perform the following actions: • write the bitstream base address in the corresponding Bitstream base address register of the Reconfigurator Controller, • write the bitstream size in the corresponding Bitstream dimension register of the Reconfigurator Controller, • write the command in the corresponding Command register of the Reconfigurator Controller, and 72 4.2. Software architecture • reset all the Reconfigurator Controller registers. In addition to these simple processes, the Reconfiguration Library also implements two complex processes that combine the basic IOCTL calls to achieve the following flows: • configuration, that takes in input the bitstream base address, the bitstream size and the number of the slave FPGA on which to perform a total configuration with the given bitstream, and • reconfiguration, that is similar to the previous flow, but it performs a partial reconfiguration on the selected slave FPGA On the other hand, to improve the usability of the MAC kernel module, the Reconfiguration Library offers a set of functions that allow to perform the following IOCTL calls: • reset all the MAC kernel module registers, • write the base address of the address space on the corresponding register of the MAC kernel module, • write the high address of the address space on the corresponding register of the MAC kernel module, and • write the number number of the module that corresponds to the selected address space range. In conclusion, using the Reconfiguration Library it is possible to communicate with the Reconfigurator Controller and the MAC both to perform a partial or a complete reconfiguration and to setup the correct address space on the Wishbone bus of the reconfigured module with just a few function calls. 73 Chapter 4. 4.2.3 Design flow software development The ROTFL architecture The reconfiguration support described in Section 4.2.2 makes it possible to configure a slave FPGA with a given bitstream and to setup the correct address space for it on the Wishbone bus. Due to this support, partial dynamic reconfiguration can be performed by the Operating System in a very simple way, so there is the need for an architecture capable to receive module requests from software applications, to successfully complete the whole process of reconfiguration and to answer to requests by giving back to the applications the name of the device that has to be used to perform the requested functionality. This kind of architectures can be used as a support to write software applications, such as the software controller, able to execute some processes with reconfigurable hardware instead of using only the processor on which the Operating System is running. These applications don’t have to manage anything about reconfiguration, they only have to know the interfaces of the functions that are necessary to perform the following steps: • to request a hardware module that is able to perform the desired functionality, • to interact with the requested module, to both write and read to and from its registers, and • to delete the module when it is no longer in use. The software architecture proposed in this thesis, called ROTFL (Reconfiguration Of The FPGA under Linux), implements the previous functions. As shown in Figure 4.7, it is characterized by three components: the ROTFL Library, the ROTFL Daemon and the ROTFL Repository. • The ROTFL Library, detailed in Section 4.2.3.1, is an interface that provides the possibility to communicate through sockets with the ROTFL Daemon; 74 4.2. Software architecture User space Software application ROTFL Daemon ROTFL Module Manager ROTFL Library ROTFL Allocation Manager ROTFL Repository ROTFL Positioning Manager Kernel space LOL Manager MAC Reconfigurator Controller Hardware Figure 4.7: Software Architecture schematic in other words it allows the interaction, using a simple function call, with the ROTFL Daemon by sending a command to it and by receiving from it the result of the process. • The ROTFL Daemon, that is an application that runs on the Operating System waiting for a socket command, is presented in Section 4.2.3.2; it is capable to handle requests like the configuration of a new module or the deletion of an existing one. These tasks are accomplished by the three managers of which the ROTFL Daemon is composed: the ROTFL Module Manager, the ROTFL Allocation Manager and the ROTFL Positioning Manager. Each one of these managers tries to manage by itself the requests, and only if this is not possible the requests are forwarded to the next manager, that is located at a lower level in the hierarchy. 75 Chapter 4. Design flow software development • The ROTFL Repository is introduced in Section 4.2.3.6; it is a sort of database to store and to retrieve information about bitstream locations, bitstream dimensions, device drivers names and paths and module specifications. For the implementation of dynamically reconfigurable systems it can be useful to adopt a layer model that systematically abstracts from the hardware resources. Each layer represents a set of components within the HW/SW architecture that is part of the reconfiguration process. By defining these layers and especially the interfaces between neighboring layers, the reusability of existing components is increased while the errorproneness of the system design is significantly reduced. This layer model has been applied to the development of the proposed software architecture. The layers in which the ROTFL architecture can be divided are shown in Figure 4.8. User space Application Layer Software application ROTFL Daemon Module Management Layer ROTFL Module Manager ROTFL Library ROTFL Repository ROTFL Allocation Manager Allocation Layer ROTFL Positioning Manager Positioning Layer Kernel space LOL Manager MAC Reconfigurator Controller Configuration Layer Hardware Hardware Layer Figure 4.8: Architectural layers The uppermost layer is the Application Layer. It represents all applications that are using dynamically reconfigurable hardware. Any application that 76 4.2. Software architecture wants to load a new hardware module makes a module request to the Module Management Layer. This layer holds a list of all currently loaded hardware modules. In case of a module request, it checks whether any inactive module of the requested type is available, and returns a reference to this module to the application, changing its status from cached to running. If no such module exists, the Module Management Layer requests a module placement to the Allocation Layer. The Allocation Layer is responsible for choosing appropriate reconfigurable resources for requested modules as well as for allocating address spaces on the communication infrastructures (e.g., the Wishbone bus). This layer is the uppermost layer that knows about the physical arrangement of the reconfigurable resources such as the existence of multiple FPGAs. When the most suitable reconfigurable resources for the requested module have been found, both this information and the module type are given to the following layer. This is the Positioning Layer, that loads a bitstream from a local bitstreams repository and adapts the position information to the given position. This manipulated bitstream is then given to the Configuration Layer. This layer contains interfaces to all existing reconfigurable resources, such as the ICAP for self-configuration, or the SelectMap and JTAG interfaces for external configuration. The reconfigurable resources themselves, which can be distributed over several FPGAs, are represented by the Hardware Layer. The whole software architecture has been developed according to the presented layer model, using well defined interfaces between contiguous layers. In this way it is possible to obtain a high level of flexibility, since there is no need to change the whole architecture structure if a single layer has to be modified. 4.2.3.1 The ROTFL Library The ROTFL Library is mainly used by the software applications that want to work with reconfigurable hardware. This library simplifies the reconfiguration 77 Chapter 4. Design flow software development tasks by providing the user-side applications or the software reconfiguration controller with the following functions. • The ROTFL_add function is used to add a new hardware module in the system. When an user-side application needs an IP-Core, it has to call this function with the name of the desired module. The second parameter that this function takes as input is a char pointer to the string that will contain the result of the requested process. This function returns an integer value that indicates if the requested process has been successfully completed. If the returned value is a zero value, the second parameter given to the function contains the name of the device that is able to perform the desired functionality. On the opposite, if the returned value is different from zero, it means that an error is occurred and the second parameter contains a message that describes the error type. • Another function provided by the ROTFL Library is the ROTFL_del function, that can be used when a module is no longer in use and it is not more useful for the user-side application. This function is similar to the previous one from the input point of view, since the first parameter represent the name of the module that has to be deleted and the second one still denotes the string on which the result description has to be written. Also the output is homologous to the previous one; if the returned value is different from zero an error is occurred during the execution of the requested operation and the string pointed by the second parameter contains the error description, while a returned value equal to zero stands for a successfully completed operation. • The ROTFL_rem function is almost identical to the ROTFL_del, but it specifies to the ROTFL Daemon that the removed module cannot be handle as a cached module, but it has to be physically removed from the system. This can be useful either when it is known that this module cannot be useful to any other application or when there is the need to avoid that anyone else can use the same IP-Core. 78 4.2. Software architecture • The last function provided by the ROTFL Library is the ROTFL_list function, that takes in input a pointer to the string that will contain the list of the modules physically configured on the reprogrammable devices. This list is a sort of description of the status of each slave FPGA, with explicit information about running and cached modules. It is possible to employ this information to monitor the availability of the slots of each reprogrammable device. The ROTFL Library converts each of the presented function calls into a socket communication flow with the ROTFL Daemon. This mechanism allows both to hide to user-side applications the handling of the socket communication and to export a collection of simple functions that abstract the implementation of the reconfiguration tasks. 4.2.3.2 The ROTFL Daemon The ROTFL Daemon is the centralized manager that is located between the ROTFL Library and the Operating System reconfiguration support, that consists of the collection of kernel modules presented in Section 4.2.2. Aim of this component is to manage each socket request that comes from the Application Layer, in general from the ROTFL Library, as shown in Figure 4.9. These commands can be represented by the request to add, to delete or to remove a module or to get the list of the configured IP-Cores. This daemon consists of the three following managers, each of which is located on the corresponding layer: • the ROTFL Module Manager, presented in Section 4.2.3.3, that is located in the Module Management Layer, • the ROTFL Allocation Manager, described in Section 4.2.3.4, that implements the Allocation Layer, and • the ROTFL Positioning Manager, introduced in Section 4.2.3.5, that is located in the Positioning Layer. 79 Chapter 4. Design flow software development 1746 Socket Client (ROTFL Library) Server (ROTFL Daemon) Socket Command sent to the server Elaboration of the request Result returned to the client Figure 4.9: Socket communication These managers have been developed to communicate between them through well defined interfaces, so it is possible to develop several versions of the same manager that use the same interface. Since these different implementations of the managers are completely interchangeable, it is possible to choose the most suitable solution for each different scenario without the need to change the whole structure. 4.2.3.3 The ROTFL Module Manager When the ROTFL Daemon receives a request, the first manager that analyzes the received command is the the ROTFL Module Manager. Aim of this manager is to implement a sort of modules cache. This manager is able to handle all the requests that come from the ROTFL Library. In particular commands are treated in the following ways. • When the configuration of a new module is requested, with a module_manager_add function call, the first step that the ROTFL Module Man- 80 4.2. Software architecture ager performs is to search if the table of the configured IP-Cores contains a cached module of the same kind of the requested one. The manager is able to accomplish by itself the module request only if this search ends successfully, otherwise the request is forwarded to the ROTFL Allocation Manager. To be more precise, if the cached module is found, the manager returns its name to the user-side application, without performing any physical reconfiguration. The status of this module is changed from cached to running to avoid to accidentally delete it or to assign it to another user-side application. On the opposite, if the cached module is not found, then the module request has to be forwarded to the following manager, the ROTFL Allocation Manager, that will perform the real reconfiguration of the new IP-Core. At the same time the ROTFL Module Manager has to retrieve from the ROTFL Repository the name of the device driver that is able to handle the requested module. If this device driver is not yet loaded in the system, the the ROTFL Module Manager has to load it, otherwise the Operating System will not be able to manage the new configured module. • When the deletion of a module is requested, with a module_manager_del function call, the only step that the ROTFL Module Manager has to perform is to change the status of the selected module from running to cached. In this way the module will be available for other user-side applications that need an IP-Core of the same kind of the cached one. In this specific case no physical reconfiguration has to be performed, since the configuration status of the reprogrammable devices has to remain the same. • In a similar way, when the removing of a module is requested with a module_manager_rem function call, there is no the need to reconfigure the reprogrammable device. To perform the removing of the selected modules, in fact, it is sufficient to delete from the table of the configured IP-Cores all the informations concerning the selected module, to free its address space on the Wishbone bus and to unregister the corresponding device to free 81 Chapter 4. Design flow software development the memory reserved for it. In this way all the resources occupied by the removed module are freed and are make available for other IP-Cores. • The last request that it is possible to get by the ROTFL Library is the list of the configured modules. If the user-side application requests this list, the ROTFL Module Manager has just to return all the informations contained in the table of the configured IP-Cores, since it is always coherent with the real status of all the slave FPGAs. 4.2.3.4 The ROTFL Allocation Manager When it is impossible to find a cached module that is able to perform the requested functionality, the new module has to be configured somewhere in one of the reprogrammable devices. Aim of the ROTFL Allocation Manager is to find a suitable location to place the module that has to be configured. Moreover the ROTFL Allocation Manager also has to find a free place on the Wishbone bus where it is possible to position the address space that the module needs to establish the communication with the rest of the system. These tasks can be performed in several different ways and using various algorithms. The simpler algorithm that allows to accomplish these processes is the algorithm that returns the first suitable position that is found. However it is possible to implement algorithms that are able to choose the most suitable FPGA on which it is possible to configure the module and the best place for its address space on the Wishbone bus by following metrics like the time that has to be spent for the search or the fragmentation of the FPGAs. To achieve its objective, each algorithm needs to retrieve from the ROTFL Repository the information on the size and the range of the address space of the new module. It is possible that a single module can be configured with different bitstreams that need different combinations of slots and that represent IP-Cores of different sizes. In this case the algorithm can also evaluate which is the most suitable for each particular situation. When the ROTFL Allocation Manager has found both the place where it is possible to configure the IP-Core and the base address of its address space on 82 4.2. Software architecture the Wishbone bus, this information is given to the ROTFL Positioning Manager. This last manager is responsible for the selection of the bitstream that is able to configure the selected FPGA at the position specified by the ROTFL Allocation Manager. The ROTFL Allocation Manager implemented using a genetic algorithm Genetic algorithms, in computer science, are a class of search techniques modeled on evolutionary biology. The ROTFL Allocation Manager has been implemented using an algorithm of this class, since it allows to look for a good sub-optimum solution in a reasonable time, while an exhaustive search can be excessively slow. Moreover, using a genetic algorithm it is possible to adapt the search process to the current status of the system, by tuning parameters such as the probability of crossover or mutation process. It is possible, in fact, that in a particular moment the system needs a fast configuration process (that implies a fast allocation task) or a very accurate solution to avoid waste of configurable resources (that requires a very precise allocation task). Since a genetic algorithm is appropriate for both the presented situations, it can represent an excellent and parametric compromise between the optimality of the final solution and the time constraints imposed by the dynamically reconfigurable scenario. Genetic algorithms Evolution is a long time scale process that changes a population of organisms by generating better offsprings through reproduction. Borrowing this idea from biology, a learning process can be modelled as evolution. So genetic algorithms are inspired by Darwin’s theory of evolution: Problems are solved by an evolutionary process that mimics natural evolution in looking for a best (fittest) solution (survivor). They are a part of evolutionary computing. Genetic algorithms are based on following concepts: • chromosome is the coding of a possible solution for a given problem • gene is the coding of a part of the solution • allele is one of the elements used to code the genes 83 Chapter 4. Design flow software development • fitness is the evaluation of the actual solution • crossover is the generation of a new solution by mixing two existing solutions • mutation is a random change in the solution According to Darwin’s theory of evolution the best chromosome survive to create new offspring. Crossover and mutation depend on the encoding of chromosomes. Mutation is intended to prevent falling of all solutions in the population into a local optimum. The basic genetic algorithm is based on following steps: 1. generation of a random population of chromosomes; 2. evaluation of the fitness of each chromosome in the population; 3. creation of a new population by repeating the following steps until the new population is complete: (a) selection of two parent chromosomes from a population according to their fitness; (b) crossover of the parents to form a new offspring; (c) mutation of the new offspring at each locus; (d) placement of the new offspring in the new population; 4. the new population is used for a further run of the algorithm; 5. if the end condition is satisfied, the best solution in current population is returned; 6. otherwise the cycle is repeated starting again from point 2. In general genetic algorithms are best suited for following cases: • big, not unimodal and not smooth search space 84 4.2. Software architecture • noisy and usually not analytic fitness function • looking for a good sub-optimum in a reasonable time They can be used for many applications, for example for optimization, prediction, classification, economy, ecology, automatic programming. In this example a genetic algorithm has been used for dinamically reconfigurable modules allocation. In particular the algorithm has been applied to the allocation of dynamically reconfigurable modules. When a new module has to be reconfigured in the system, in fact, there is the need to find a suitable free place where it can be configured. This search task has been modeled with a genetic algorithm in which each chromosome represents a configuration status of the reprogrammable devices and both crossover and mutation processes try to change the previously found location for the new module in order to achieve a better fitness, that stands for the goodness of the final solution. Encoding There are many parameters and settings that can be implemented in a different way for each class of problems: how to create cromosomes and what kind of encoding is suitable for each particular situation; how to select parents for the crossover process, following the idea that the better parents will produce the better offspring, and how to define crossover and mutation tasks, that are the two basic operators of genetic algorithms. Then, the first step in developing a genetic algorithm is defining a suitable solution encoding. A chromosome should in some way contain information about the solution that it represents. Since the encoding depends mainly on the solved problem, for the ROTFL Allocation Manager it has been chosen a couple of arrays, the Slots and the Modules arrays. Figure 4.10 shows an example chromosome of a system that contains only one slave FPGA with four slots. The first array consists of a collection of genes, which contain the information on which module is configured on each slot of the reprogrammable device. In particular each gene directly corresponds to a single slot of a slave FPGA. Since on a device of n slots it is possible to configure not more than n modules 85 Chapter 4. Design flow software development Slots Modules 0 1 2 3 1 0 3 3 0 1 2 3 0 -2 0 1 Figure 4.10: Genetic algorithm chromosome (this is possible only when each configured module requires just one slot), the alleles of this kind of genes are represented by the number between 0 and n. The numbers contained in the Slots array correspond to the position of a gene in the second array. The Modules array, in fact, is composed of a set of genes that represent hardware IP-Cores. The following numbers represent the codification of the alleles for this second kind of genes: • 0: this number means that the module is not configured on the reprogrammable device, since it has not been yet placed or it has already been deleted from the system • 1: this number indicates that the module has been already configured on the FPGA and it is still running, so in this moment it cannot be directly unloaded from the system • -2: a module characterized by this number is a cached IP-Core. In other words it is a module that has already been placed on the reprogrammable device but it is not currently used by any user-side application, thus it is possible to unload it to overwrite its slots with the configuration of a more useful IP-Core The example shown in Figure 4.10 represents a status of the system in which the second module (module 1) is configured on the first slot of the FPGA (slot 0) and the fourth module (module 3) is placed on the third and on the fourth slot (slot 2 and slot 3), while the second slot (slot 1) is free (since the first module, module 0, is not configured). 86 4.2. Software architecture The Slots array gives further information, indicating that the second module (module 1) is cached, while the fourth module (module 3) is still running. This means that the bigger module that is possible to configure starting from this status is a module that requires two slots, since it can be configure on the first two slots of the FPGA (slot 0 and slot 1), by unloading the second module (module 1) that is currently cached. After the choice of the proper coding for chromosomes, genes and alleles, a suitable fitness function has to be defined . Main objective of the ROTFL Allocation Manager is to handle the configurable space of the reprogrammable device to avoid both a waste of slots and the refusing of the configuration of an IPCore, that happens when there is no place where it is possible to configure it. This means that it is desirable to keep the free slots all together, without breaking them in a lot of smaller separate set of free slots, since a large collection of contiguous slots allows to configure also bigger modules. For this reason the fitness function has been defined as a number that increases of a small quantity for each free slot. This quantity starts from a default value, but it gets bigger when a free slot is followed by another free slot. On the opposite, when a free slot is followed by a slot containing a cached or a running module, the gain comes back to the default value. Moreover, to prefer solutions with a large number of cached modules, that are useful to speed up the reconfiguration process, also a fixed reward has been introduced for each cached IP-Core of the solution. Figure 4.11 shows an example of the evaluation of the fitness function of three given chromosomes, with a default gain of 2 points, increased of 1 point for each contiguous free slot, and a fixed reward of 1 point for each cached module. The three chromosomes are very similar, but the seventh module (module 6) is placed in a different position in each solution. In the first example (A), the seventh module is located at the end of the FPGA, in the second example (B) it is configured to break the set of the last four free slots, while in the third example (C) it has been placed in the most suitable location, that is the second slot (slot 1). Even if the number of configured IP-Cores, the number of cached modules and the total number of free slots are the same for all the solutions, the first one 87 Chapter 4. Design flow software development presents two sets of free slots (which sizes are respectively of 1 and 3 slots) with a fitness of 13, the second one 4 sets (which sizes are respectively of 1, 2 and 1 slots) with a fitness of 11, while the third one a singe set (which size is of 4 slots) with a fitness of 16. Obviously the last solution is the most suitable, since is the only one that allows the configuration of a new module that requires 4 contiguous slots, in fact it presents the bigger fitness within the class of the presented solutions. Development details The proposed genetic algorithm for the ROTFL Allocation Manager is performed each time a set of new modules have to be configured on the reprogrammable devices of the system. It is possible to choose, for each module, the best location where it has to be placed. If each module can be placed in n positions, an exhaustive search with a set of m IP-Cores requires nm evaluations of feasible solutions. With a genetic algorithm it is possible to considerably decrease the time required by the allocation process, since it works on a smaller set of solutions, trying to modify them to reach a good sub-optimum solution in a reasonable time. The size of the initial population is a parameter of the algorithm and it can be changed to tune the performance of the ROTFL Allocation Manager. Between this population a set of chromosomes is chosen to create a new population. These chromosomes are called parents of the offspring, that is formed through the crossover process. The crossover task is performed by randomly choosing two parents. The new chromosome is generated by keeping the locations of the first half of the m modules from the first parent, while the other locations are taken directly from the second parent. During this phase it is possible to introduce, with a random probability, a mutation. This is defined as a change in the partial solutions found by the parents. In other words it means that the location inherited by the parents can be randomly modified, to prevent that all solutions in the population fall into a local optimum. 88 4.2. Software architecture Slots A Modules A Fitness A 0 1 2 3 4 5 6 7 1 0 3 3 0 0 0 6 0 1 2 3 4 5 6 7 0 -2 0 1 0 0 -2 0 0 1 2 3 4 5 6 7 1 3 +2 Slots B Modules B Fitness B Modules C Fitness C 5 3 +0 +2 8 +3 12 +4 13 +1 0 1 2 3 4 5 6 7 1 0 3 3 0 0 6 0 0 1 2 3 4 5 6 7 0 -2 0 1 0 0 -2 0 0 1 2 3 4 5 6 7 1 3 +2 Slots C 3 +0 3 +0 5 3 +0 +2 8 +3 9 +1 11 +2 0 1 2 3 4 5 6 7 1 6 3 3 0 0 0 0 0 1 2 3 4 5 6 7 0 -2 0 1 0 0 -2 0 0 1 2 3 4 5 6 7 1 2 +1 2 +0 4 2 +0 +2 7 +3 11 +4 16 +5 Figure 4.11: Fitness evaluation examples 4.2.3.5 The ROTFL Positioning Manager The available position for the required module and its corresponding address space on the Wishbone bus, found by the ROTFL Allocation Manager, are the inputs of the following manager, called ROTFL Positioning Manager. Aim of this manager is both to setup the MAC kernel module with the information concern- 89 Chapter 4. Design flow software development ing the Wishbone bus address space and to retrieve from the ROTFL Repository the base address and the size of the bitstream that is able to configure the required module to the selected location of the slave FPGA. This information is then used to setup the Reconfiguration Controller kernel module that will perform the physical reconfiguration of the slave FPGA. Since each partial bitstream is able to configure a module only in one specific position, and this position is incapsulated in the bitstream itself, there is the need to store in memory, for each kind of IP-Cores, one different bitstream for each location in which the module can be configured. To avoid this waste of space it is possible to keep in memory just one bitstream for each IP-Cores type, using this base bitstream with a relocation filter, that is a hardware component that is able to shift a bitstream to make it suitable for another set of slots of the slave FPGA. The ROTFL Positioning Manager has to know if the system in which it is working contains a hardware relocation filter or not, since it behaves in the following ways depending on the presence of this filter. • On one hand, if the hardware relocation filter is not present in the system, the ROTFL Positioning Manager has to search in the ROTFL Repository the right class of bitstreams that is able to configure the requested type of IPCores. Then it has to find, between this collection of bitstreams, the only one that can configure the selected module in the right position of the slave FPGA, that has been allocated by the ROTFL Allocation Manager in the previous phase. • On the other hand, if the system contains the hardware relocation filter, the ROTFL Positioning Manager has to search in the ROTFL Repository the base bitstream for the selected type of IP-Cores. Both the retrieved bitstream and the position on which it has to be relocated, that is the position allocated by the ROTFL Allocation Manager, are then used to initialize the relocation filter. The output of this component is a new bitstream that is able to configure the same type of IP-Cores of the selected one, but in a different position of the slave FPGA. 90 4.2. Software architecture The following step is to setup the Reconfiguration Controller kernel module with the information concerning the proper bitstream, that is the bitstream retrieved from the ROTFL Repository, if the relocation filter is not present in the system, or the adapted one, in the case that the system contains the relocation filter. This information, that consists of both the base memory address at which the bitstream file is stored and the size of the file itself, is then used by the Reconfiguration Controller to perform the actual reconfiguration of the slave FPGA. After this task the IP-Core is physically configured in the system, but the communication infrastructure has not yet been realized. To establish a communication between the new module and the rest of the system, in fact, a last step has to be performed. To be more precise, there is the need to setup the MAC kernel module with the address space information provided by the ROTFL Allocation Manager, that consists of the base address and the range of the address space assigned to the new module on the Wishbone bus. 4.2.3.6 The ROTFL Repository The ROTFL Repository is a database to simplify the management of bitstream files and of their corresponding information, such as their functionality and the name of the device driver that is able to manage them. When a new module is requested, the ROTFL Daemon has to check if the repository contains all the necessary information about this module, i. e. the base address and the size of its bitstream or its bitstreams, the size of the module itself, the required range of address space on the bus and the name of the device driver that is able to manage the required module. In particular this information is needed by the following stage of the whole reconfiguration process. • The bitstream base address and its size are required by the Reconfigurator Controller to perform reconfiguration on the slave FPGA. • Module size and address space range are essential to search free space both on the slave FPGAs and on the Wishbone bus; the ROTFL Alloca- 91 Chapter 4. Design flow software development tion Manager uses this information to perform the search algorithm to find where it is possible to configure the new module and which address space is available on the Wishbone bus to establish the communication between the module and the rest of the system. • The name of the device driver is used by the ROTFL Module Manager, that gives this information to the LOL kernel module. In fact this is the responsible for the loading and the unloading of the device driver each time a new module is added or an existing one is removed from the system. 4.3 Concluding remarks This chapter focuses on the software architecture of the components, developed in this thesis, that constitute part of the entire design flow described in Chapter 3. The tool described in the first part of this chapter is IPGen. This tool makes it possible to take a given core logic of a component, that represents the core functionality of the whole IP-Core, and to automatically build both the register mapping and the interface with the desired bus. In this way it is possible to obtain a complete IP-Core that is ready to be plugged in the final system either as a fixed or as a reconfigurable module. The rest of the chapter focuses on the description of the proposed software architecture. First a class of reconfigurable hardware platforms has been introduced. These platforms have in common the possibility to be viewed in a uniform way from an abstract point of view. This allows the application of the developed software architecture to all the platforms of the family. After that the Linux kernel modules infrastructure has been described. This infrastructure, that consists of several kernel modules and a library, constitutes the Linux OS reconfiguration support. The Reconfiguration Controller kernel module is the responsible for the physical reconfiguration of the programmable device with a given bitstream. The MAC kernel module provides the setup to the bus communication by configuring the address space on the Wishbone bus of 92 4.3. Concluding remarks the new module with the correct information. The LOL kernel module, in addition to allow the dynamic registering and unregistering of devices, stores all the information about the registered devices. Finally, the Reconfiguration Library aims at simplifying writing applications that have to manage the previously described kernel modules. In the last part of the chapter the ROTFL architecture has been described. This architecture consists of a daemon, the ROTFL Daemon, a library, the ROTFL Library, that simplifies the interaction with the daemon and a repository, the ROTFL Repository, that represents a sort of dynamic database. The ROTFL Daemon implements the Module Management Layer, the Allocation Layer and the Positioning Layer. Each layer is handled by a different manager that can be developed by using several algorithms, whose implementation is described by a well defined interface. In this way it is possible to choose at run-time the most suitable solution for each situation. Finally an example has been shown to demonstrate how it is possible to develop a different version of a manager of the ROTFL Daemon using a well known algorithms. In particular, a new version of the ROTFL Allocation Manager has been build using a genetic algorithm. 93 Chapter 5 Experimental results Goal of this chapter is to present an overall view of the experimental results of the proposed implementation of the methodologies introduced in Chapter 3. In particular, Section 5.1 focuses on the description of the tests for the IPGen tool. This tool has been tested with several components, and all the generated IP-Cores have been physically configured on a FPGA, to verify their effectiveness. The second part of this chapter, Section 5.2, presents a prototyping platform on which the proposed software architecture is able to run and a collection of experimental results on the ROTFL architecture. These results describe both the number of slices and BlockRAMs occupied by the components that are necessary to enable partial reconfiguration and the timing performance of the software architecture on the developed system. Finally, Section 5.3 summarizes all the results presented in this chapter, in order to achieve a complete overview of the performance of the described implementations and to evaluate the goodness of the proposed approaches. 5.1 IPGen The methodology for the automatic generation of IP-Core, presented in Section 3.2.2, has been developed as explained in Section 4.1 and has been tested under different Operating Systems and architectures. These tests concern several 95 Chapter 5. Experimental results types of components, starting from some small IP-Cores such as an adder, a xor and two different multipliers. Moreover, also more complex examples have been examined, e.g a Direct Fourier Transformation core, various implementations of the AES algorithm, a Siemens Mobile Communications description of a complex ALU and a video editing core that changes the image coloring plane from RGB to YCbCr. Table 5.1 presents some relevant results, considering both the input core, that represents the core logic, and the obtained component, that is the final IPCore produced by the IPGen tool. For each one of them, the size in terms of 4INPUT LUT S and the number of occupied slices are illustrated, both as absolute values and as the percentage with respect to the total dimension of the FPGA. In addition to this information, also the time needed by IPGen to create the IP-Core is specified. Table 5.1: IPGen tests IP-Core Core: Mult1 IP-Core: Mult1 Core: Mult2 IP-Core: Mult2 Core: IrDA IP-Core: IrDA Core: FIR IP-Core: FIR Core: AES128 IP-Core: AES128 Core: RGB2YCbCr IP-Core: RGB2YCbCr Core: Complex ALU IP-Core: Complex ALU 4-input LUTs 30 172 64 339 15 146 273 308 4124 4314 1028 848 1750 2089 Perc. 0% 2% 1% 4% 1% 1% 2% 3% 42% 44% 10% 9% 18% 21% Slices 26 122 37 205 11 103 153 173 2132 2250 913 940 950 1079 Perc. 1% 2% 1% 4% 1% 2% 3% 3% 43% 46% 18% 19% 19% 22% Time (s) 0.049 0.053 0.045 0.058 0.075 0.063 0.071 On one hand the relative overhead due to the interface of the core logic with the Wishbone bus is acceptable, both for the 4- INPUT LUT S and for the occupied slices, especially when the core size is relevant. This allows the use of 96 5.2. Software architecture the generated IP-Cores in the final reconfigurable system without wasting too much space on the reconfigurable devices. On the other hand, the computation is extremely low with respect to the whole embedded system design process. In particular, it is almost constant and on average is of 0.065 seconds. The range in which computation time is located starts from 0.045 seconds and ends to 0.075 seconds. In conclusion, the proposed flow for automatic IP-Cores generation has successfully passed all the proposed tests, generating working components, that can be imported into standard architectures with a Wishbone bus communication to obtain a bitstream that can be directly downloaded on a FPGA. Moreover, the IPGen tool, that implements this flow, is characterized by very good performance, introducing only a little overhead in the size of the final IP-Core. 5.2 Software architecture To prove the correctness of the software architecture proposed in Section 4.2.3, the ROTFL solution has been tested on the RAPTOR2000 board [25]. Using this board, whose detailed description can be found in Section 5.2.1, it is possible to implement several kinds of reconfigurable systems that can be associated with the platform classes introduced in Section 4.2.1. The first set of tests concerns the implementation of the ROTFL Allocation Manager obtained by using a genetic algorithm. This solution has been tested and compared with other different implementations of the same manager and the final results are described in Section 5.2.2. Tests and results concerning the whole ROTFL architecture, instead, are presented in Section 5.2.3. Aim of these tests is to obtain information on the latency introduced by the partial reconfiguration of the slave FPGAs and the Operating System timing overhead. In this way it is possible to evaluate the timing performance of the ROTFL architecture in a real reconfigurable embedded system. 97 Chapter 5. 5.2.1 Experimental results RAPTOR2000 board For a prototype implementation of one of the multi-FPGAs reconfigurable systems proposed in Section 4.2.1, the RAPTOR2000 hardware architecture [25] has been used. RAPTOR2000 is a prototyping platform that consists of a motherboard and up to six daughter-boards. The motherboard provides several communication infrastructures and a configuration environment for partial and dynamic configuration of FPGAs located on the daughter-boards. Figure 5.1 shows the schematic of the system that has been developed by using the RAPTOR2000 board. It consists of a Xilinx Virtex-2Pro FPGA and two Xilinx Virtex-II FPGAs. The Virtex-2Pro FPGA, which is used to run the SW solution, is constituted by a PowerPC and several static hardware components, such as a memory controller, general purpose inputs/outputs and the Reconfiguration Controller (that is represented by the VCM, Virtex Configuration Manager). Figure 5.1: Multi-FPGAs system on RAPTOR2000 The Virtex-II FPGAs represent the reconfigurable resources used to dynamically load hardware modules into the system. Moreover, each Virtex-II FPGA includes a Wishbone bus to which the hardware modules are connected dynamically. The bus-bridges that are used to connect the modules to the processor system include the Medium Access Control (MAC) for the communication with the modules. The Reconfiguration Controller is a hardware component that represents the Allocation Layer and a part of the Positioning Layer. A special feature of this component is its direct memory access (DMA) to the local SDRAM memory. This 98 5.2. Software architecture enables very fast configurations when downloading bitstreams from a given position within the memory to a selected FPGA within the RAPTOR2000 system. 5.2.2 ROTFL Allocation Manager To prove the flexibility and the adaptability of the ROTFL architecture, one of its components, the ROTFL Allocation Manager, has been developed using a genetic algorithm, as described in Section 4.2.3.4. To evaluate the performance of the proposed solution, it has been tested and the obtained results have been compared with those achieved by other implementations of the same manager that use different algorithms. In the performed tests, the only parameter imposed by the system is the number of reconfigurable slots. This number represents the size of the reprogrammable devices, divided by the size of each slot. Obviously the duration of the exhaustive algorithm considerable increases with the number of slots of the system, while both the random and the genetic algorithm are almost independent from this parameter, since there is a little reduction of the performance due to the increased dimension of the data structure but the overall complexity of these algorithms remains the same. Also the number of modules that have to be placed at the same time on the reconfigurable system affects the presented algorithms in a similar way, but the difference is that this parameter can be chosen and modified either at compiletime or at run-time. Moreover, the following set of parameters can be used to specifically tune the performance of the genetic algorithm. • Minimum fitness: this is the minimum fitness that allows a solution to be chosen as the final solution of the algorithm before the maximum number of rounds has been processed. If this limit is too high, then the maximum number of rounds is always reached. • Maximum number of rounds: this number represents the number of evolution cycles that have to be performed to obtain the final solution. On 99 Chapter 5. Experimental results one hand, if this number is too small, it is possible to obtain a final solution with a very small fitness. On the other hand, if this number is too large, the performance of the algorithm can be drastically reduced. • Initial population size: this number is the number of chromosomes of the initial randomly created population. Obviously a bigger initial population has more probability to contain a good solution with respect to a smaller initial one. • Selection size: this number represents the number of parent chromosomes that are kept in the following generation and that are used to form the new offspring. It keeps solutions with a high fitness in the following generations. • Crossover probability: this is the probability to perform a crossover between the two parents while the reproduction phase, allowing to mix two good solutions in the hope to form a better one. If no crossover has to be executed, then one of the two parents is copied directly in the new population. • Mutation probability: this is the probability to perform a mutation during the generation of the offspring, allowing to randomly modify new chromosomes. All these parameters can be modified to tune the genetic algorithm performance or to adapt this algorithm to a specific situation. The following results have been obtained by using a system with 400 slots and configuring the genetic algorithm with a minimum fitness of 10000 points, with the maximum number of rounds set to 15, with a population of 30 individuals, with a selection size of 15 individuals and with both a crossover and a mutation probability of 50%. To evaluate the performance of the proposed algorithm, both the timing performance and the quality of the final results of the following collection of different implementations has been examined. • The null implementation of the ROTFL Allocation Manager is the base implementation that is useful just to estimate the overhead due to the test- 100 5.2. Software architecture Table 5.2: Temporal performance Algorithm Null Random Genetic Exhaustive Total time (s) 69 74 104 157 Normalized time (s) 0 5 35 88 Average time for each module (ms) 0 0.17 1.17 2.93 bench. In fact it follows a very simple behavior, since it always answer to the test application refusing the requested module. The time taken by this implementation is the time wasted in the creation of the requests and in the management of the FPGA structure from the test application. • The random solution is an implementation that tries to place the requested module in a random feasible location. If that position is free, then the module can be configured, otherwise the request is directly refused. • The genetic algorithm is the proposed solution that implements the ROTFL Allocation Manager using the previously described approach. • The exhaustive solution, finally, tries all the possible placement combinations to find the solution that minimizes the fragmentation of the slave FPGAs. This implementation is obviously the slowest solution, but it also the implementation that provides the best final results. Table 5.2 shows the performance of the previously presented algorithms. The total time, expressed in seconds, is based on a test that performs 100 rounds, that consist of 100 insertions of 3 modules each, in a system with 400 reconfigurable slots. The normalized time, also expressed in seconds, is the time effectively spent by the algorithm, since it is computed subtracting to the total time the overhead of the test application. Finally the average time for each module is the estimated time that each single module insertion requires. The second table, Table 5.3, presents the number of refused modules and the total caching reward for each algorithm. In addition to this, it also shows the 101 Chapter 5. Experimental results Table 5.3: Final results Algorithm Null Random Genetic Exhaustive Refused modules 10000 7392 4670 3473 Normalized refused modules 6527 3919 1197 0 Caching reward 0 22465 67614 92247 Normalized caching reward -92247 -69782 -24633 0 Table 5.4: Comparison with the exhaustive algorithm Algorithm Random Genetic Temporal improvement (%) 1760 251 Rejection worsening (%) 39.2 12 Caching worsening (%) 75.6 26.7 normalized values of these parameters, that have been calculated by subtracting to the total values the results of the exhaustive algorithm, that are the maximum obtainable. Finally, Table 5.4 describes the temporal improvement that the genetic algorithm and the random algorithm are able to obtain, with respect to the exhaustive algorithm. Even if the random solution brings a considerable temporal improvement (1760%), it cannot be chosen as a suitable solution since its results, in terms of both refused modules and caching reward, are really inadmissible (40% - 76%). On the other side, the genetic solution makes it possible to obtain a more modest temporal improvement (251%), but it allows to keep a lower worsening of the final results (12 % - 27%). In conclusion, the genetic algorithm seems to implement the best compromise between temporal performance and effectiveness of the final results, that consists of both the number of refused modules and the number of the IP-Cores that are kept in cache. 102 5.2. Software architecture 5.2.3 ROTFL architecture The performance of the whole ROTFL architecture is affected mainly by the latency introduced by the partial reconfiguration of the slave FPGAs and by the overhead caused by the Operating System. Furthermore, additional FPGA resources are required to enable partial reconfiguration, i.e., the Virtex Configuration Manager (VCM) introduced in section 5.2.1, that represents the Reconfiguration Controller. The VCM modules uses 1726 slices (18.6%) and 6 BlockRAMs (6.8%) of the Xilinx Virtex-2Pro FPGA (XC2VP20). The high area requirement is caused mainly by the integrated readback functionality, which will be used in future implementations. On the opposite, the additional resources required for partial reconfiguration and for the other components of the static architecture can be neglected, since the resource overhead in these components is smaller than 1%. The latency for partial reconfiguration introduced by the hardware components is composed of the following parts. • First, a static time that is required to initiate the DMA transfer of the partial configuration bitstreams from the SDRAM to the configuration interface (VCM), plus the time required to initialize the configuration interface of the FPGA and to flush the configuration buffer at the end of the configuration. • Second, the time needed to download the bitstream to the FPGA. This time depends on the size of the reconfigurable hardware modules. The static time is 158 clock cycles before reconfiguration and 824 clock cycles for buffer flushing after reconfiguration. Moreover, the number of clock cycles needed to reconfigure one CLB column of the used Xilinx Virtex-II FPGA (XC2V4000) is 18,128. Therefore, the time to reconfigure a hardware module in the proposed system is (158 + n · 18128 + 824) · 20 ns (5.1) 103 Chapter 5. Experimental results Table 5.5: Hardware reconfiguration latency Columns 4 8 12 Latency (µs) 1469.88 2920.12 4370.36 where n is the number of reconfigured CLB columns. The reconfiguration clock period used in the prototypic implementation is 20 ns. Table 5.5 shows the reconfiguration time introduced by the hardware for typical module sizes. These modules only use CLB columns. The download time changes insignificantly if embedded multipliers or BlockRAMs are used. If also the BlockRAM contents have to be written during reconfiguration, an additional 1054.72 µs apply per BlockRAM column. Equation 5.1 assumes that no data compression is used for the partial bitstreams and thus gives worst case times. On the other hand, there is the time overhead caused by the Operating System configuration support. Table 5.6 shows the performance of the ROTFL software architecture, that consists of the following parts. Table 5.6: ROTFL performance Task Daemon startup Device driver setup Module loading (if not cached) Module loading (if cached) Read Write Time (µs) 500 650 3450 2500 3.6 2.7 Notes once once each driver each loading each loading 4 bytes read 4 bytes write • The first task to be executed is the ROTFL Daemon startup, that initializes all data structures and prepares the ROTFL Daemon to accept configuration requests; it takes around 500 µs, but it is necessary to perform it just once, when the ROTFL Daemon starts. 104 5.3. Concluding remarks • The second task is the devices drivers setup, that loads the correct driver and initializes all necessary devices for a specific module; it takes around 650 µs and it is executed once for each kind of module. • Module loading time is different if the requested module is cached or not; in the first case it takes around 2500 µs, otherwise it takes around 3450 µs. To be more precise, the module used to calculate these results is 4 columns wide. • Finally reading and writing from and to a configured module takes around 3.6 µs to read 4 bytes and 2.7 µs to write 4 bytes. 5.3 Concluding remarks The results concerning the IPGen tool show that the generation of the IP-Core introduces, with respect to the core logic, only a small resources overhead, that can also be neglected if the size of the original core is relevant. Moreover, the performance of this tool are also very good, since in the average it is possible to obtain a complete IP-Core in just 0.065 seconds. On the other hand, the results of the ROTFL architecture prove that the Operating System temporal overhead is acceptable, since the duration of a reconfiguration performed by using the OS reconfiguration support is comparable to the hardware reconfiguration latency. In the worst case, in fact, using a module that is just 4 columns wide, the hardware reconfiguration latency is around 1500 µs, while the same reconfiguration performed through the ROTFL architecture takes around 3500 µs (included the delay introduced by socket communication). Furthermore, considering the scenario where the requested module is cached, independently of its size, the performance can be considerably improved, since the whole reconfiguration process takes constantly just 2500 µs, as shown in Figure 5.2. Finally, using wider modules it is possible to completely hide the software overhead due to the ROTFL Daemon also when the requested module is not 105 Chapter 5. Experimental results ROTFL Daemon Socket communication 0 Socket communication 1 Time (ms) 2 Figure 5.2: Module cached scenario cached, since the hardware reconfiguration latency grows linearly with the module size, while the ROTFL overhead remains constant, as shown in Figure 5.3. 0 1 2 3 4 5 6 Time (ms) Socket communication ROTFL Daemon Socket communication Hardware reconfiguration (4 columns) Hardware reconfiguration (8 columns) Hardware reconfiguration (12 columns) Figure 5.3: Reconfiguration latencies 106 Chapter 6 Conclusions and future work Previous chapters have introduced a methodology for reconfigurable embedded systems design that strongly reduces both the time to market of the final implementation of the system and the efforts required for its development. This methodology has been described with a flow that has been integrated with two main components that represent the original contribution of this thesis: the automatic IP-Cores generation and the Operating System reconfiguration support. • The automatic IP-Cores generation task can be achieved by using the IPGen tool, whose goal is the definition of an automated flow for the interfacing process of IP-Cores. In this way it is possible to obtain, starting from a core functionality, a complete component that is ready for the bus communication, without requiring user interactions. Preliminary results show that the proposed approach provides the design flow with a simple and powerful way to automatically obtain working IP-Cores, that can be used as fixed or reconfigurable modules of the final system. The IPGen tool, in fact, has been tested using several component cores and the generated modules have been plugged into real systems and have been downloaded onto a FPGA to test their effective correctness. The performance achieved by IPGen are good, since the resources overhead introduced to obtain bus compatible IP-Cores is really small and in some cases, in particular with large components, it is possible to neglect 107 Chapter 6. Conclusions and future work it. Also the temporal performance is excellent, seeing that in the average the duration of the IP-Core generation phase is around 0.065 seconds. • The proposed OS reconfiguration support has been developed to be applicable to a wide class of reconfigurable scenarios, that are characterized by the presence of multi-FPGAs reconfigurable systems. The presented scenarios can be also seen as basic components of a more complex distributed system, where each one of them can be considered as a node of the distributed solution. Moreover, for the development of the whole ROTFL architecture, a layered structure has been chosen. This solution leads to achieve several remarkable benefits on the final system. First, it is possible to exploit a high-level and very effective user interface that makes use of common OS concepts, such as the assignment of names like /dev/module_0 to each configured module, while completely hiding from the user the dynamic aspects, and the associated complications, of reconfigurable hardware. In addition to this, the proposed solution allows many resources (i.e., many FPGAs) to be combined into one unique virtual hardware component, allowing the ROTFL Daemon, that is the centralized manager, to handle flexible and scalable hardware architectures. Furthermore, the layered structure of the ROTFL architecture also allows to easily adapt each one of its components to a specific situation without the need of modifications to the whole software architecture, leading both to a high customizability and reusability and to a low error-proneness. Thanks to these aspects of the ROTFL architecture, it is possible to develop, as a future work, a collection of different versions of the ROTFL Module Manager, of the ROTFL Allocation Manager and of the ROTFL Positioning Manager that use disparate algorithms. These managers have to strongly respect the defined interface, making it possible to choose the more suitable algorithm for each specific situation without the need to change the structure of the whole ROTFL architecture. 108 Chapter 6. Conclusions and future work Finally, it is possible to imagine a scenario where the ROTFL Repository might be extended to support a dynamic management of both bitstream files and modules information. In this way it will be possible to load in the ROTFL architecture a new IP-Cores class also at run-time and not only during the development phase. 109 Bibliography [1] Two flows for partial reconfiguration: module based or difference based, Xilinx Inc., XAPP290, September 2004 [2] Development system reference guide, Xilinx Inc., 2005 [3] Computer Architecture, a quantitative approach, D. Hennessy, J. Patterson, Morgan Kaufmann, San Mateo, 1990 [4] The General Rapid Architecture Description, Carl Ebeling, University of Washington Technical Report: UW-CSE-02-06-02, 2002 [5] Rapid-C Manual, Carl Ebeling, University of Washington Technical Report: UW-CSE-02-07-06, 2002 [6] A Configurable Pipelined State Machine as a Hybrid ASIC and Configurable Architecture, Peter Zipf, Claude Stötzler, Manfred Glesner, Institute of Microelectronic Systems, Darmstadt University of Technology, Germany, 2004 [7] Configurable Architecture for System-Level Prototyping of High-Speed Embedded Wireless Communication Systems, Visvanathan Subramanian, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, 2003 [8] A Configurable FPGA-based Hardware Architecture for Adaptive Processing of Noisy Signals for Target Detection Based on Constant False Alarm Rate (CFAR) Algorithms, René Cumplido, César Torres, Santos López, National Institute for Astrophysics Optics and Electronics, Puebla, Mexico, 2004 111 BIBLIOGRAPHY [9] Configurable, High throughput LDPC decoder Architecture for Irregular codes, Marjan Karkooti, Yang Sun, Joseph. R. Cavallaro, Center for Multimedia Communications, ECE department [10] Piperench: A reconfigurable architecture and compiler, Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matt Moe, R. Reed Taylor, IEEE Computer, Vol. 33, No. 4, April 2000 [11] The MorphoSys Dynamically Reconfigurable System-On-Chip G. Lu, E.M.C. Filho, M. Lee, N. Bagherzadeh, and F.J. Kurdahi, 1st NASA / DoD Workshop on Evolvable Hardware (EH Õ99), July 19-21, 1999, Pasadena, CA, USA, IEEE Computer Society, 1999 [12] The Splash 2 Reconfigurable Processor and Application‘s, Jeffrey M. Arnold, Duncan A. Buell, Dzung T. Hoang, Daniel V. Pryor, Nabeel Shirazi, Mark R. Thistle, Proceedings of the International Conference on Computer Design, CS Press, 1993 [13] Garp: A MIPS Processor with a Reconfigurable Coprocessor, John R. Hauser and John Wawrzynek, IEEE Symposium on Field-Programmable Custom Computing Machines , FCCM ’97, April 16-18, 1997 [14] The garp architecture and c compiler Timothy J. Callahan, John R. Hauser, John Wawrzynek, Computer, vol. 33, no. 4, pp. 62-69, April, 2000 [15] Baring it all to Software: Raw Machines, Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, Anant Agarwal, IEEE Computer, September 1997 [16] The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs, Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen Matt Frank, Saman Amarasinghe, Anant Agarwal, IEEE Micro, Mar/Apr 2002 112 BIBLIOGRAPHY [17] Managing partial dynamic reconfiguration in Virtex II Pro FPGAs, Philippe Butel, Gerard Habay, Alain Rachet, Xcell Journal, Fall 2004 [18] System-level modeling of dynamically reconfigurable hardware with SystemC, Antti Pelkonen, Kostas Masselos, Miroslav Cupék, IPDPS ’03: Proceedings of the 17th International Symposium on Parallel and Distributed Processing, Washington, DC, USA, 2003, IEEE Computer Society, 2003 [19] BORPH Operating System, Berkeley Emulation Engine 2 Operating System, http://bee2.eecs.berkeley.edu/wiki/Bee2OperatingSystem, Berkeley, June 2006 [20] Embedded Linux as a platform for dynamically self-reconfiguring systems-onchip, John Williams and Neil Bergmann, Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, CSREA Press, June 2004 [21] A Flexible Platform for Real-Time Reconfigurable Systems on Chip, N. W. Bergmann, J. A. Williams, P. J. Waldeck, Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, Las Vegas, USA, 2003 [22] The Egret Platform For Reconfigurable System-On-Chip, Neil W. Bergmann and John Williams, Proceedings of the IEEE International Conference on Field-Programmable Technology, IEEE, 2003 [23] A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system, Alberto Donato, Master’s thesis, Politecnico di Milano, 2005 [24] YaRA: un’architettura riconfigurabile basata sull’approccio monodimensionale, Alessio Montone, Antonio Piazzi, Bachelor’s thesis, Politecnico di Milano, 2006 [25] A Prototyping Platform for Dynamically Reconfigurable System on Chip Designs, Heiko Kalte, Mario Porrmann and Ulrich Rückert, Proceedings of the IEEE Workshop Heterogeneous reconfigurable Systems on Chip (SoC), 2002 113 Ringraziamenti Vi avviso fin da subito che il numero di persone ringraziate in questa sezione sarà abbastanza elevato. Il motivo di questa decisione è che non credo ci siano persone che non valga la pena di ringraziare o che non abbiano inciso i loro nomi nella mia vita abbastanza in profondità per meritare un grazie. Vorrei cominciare con tutti coloro che mi hanno seguito durante la mia carriera scolastica, in particolare le maestre delle elementari, Laura e Tiziana, i Professori del liceo, Costantini, Pilone (il fagiano era davvero buono!) e Rigotti, ed i docenti dell’università, Ferrandi e Sciuto, per poi proseguire con tutti i miei compagni di scuola e quelli dell’università: Anna, Fly (non si accorgerà nessuno del fatto che ho appena riavviato questo computer...), Gianluis (l’importante è non esagerare, mai), Max, Nino, Quintana (l’Orsetto del cuore), Randa, Ritz e Teo; continuando con gli amici del M ICRO LAB: Ack, Ale (dell’IP-Gen), Ale (Mele), Birdack, Carlo (che può tornare indietro di un giorno facendo il giro del mondo!), Chiara (senza la quale non ce l’avrei fatta...), Davide, Diego (il mio gemello di iBook! :D), Edo, Francesca (ah, quegli appunti...), Frascino (detto Gattuso), Gegi, Ics, Il Supremo, Leo, Katia, Malex (scritto col 2 piccolo), Marco (complimenti per il calcetto e per OGame), Osprey, Roberto, Shumi (davvero identico), Teo (la Germania... -.- ), Teo (dell’IP-Gen), Tia, Vik e Zeph. Non posso inoltre dimenticarmi di tutti i fantastici ragazzi conosciuti in Germania, tra cui: Anne (entrambe), Annett, Annkatrin (la creatura...), Boris, Cheng Yee (col sombrerino rosso ed un secchiello di sangria), Christina (uh uh ...), Hanna (ufficizmo), Jan (il nostro indispensabile buddy), Jenny (grazie per le foto), Jens (master of FPGA Editor), Kim (entrambe), Markus (non preoccuparti, entro domani sicuramente...), Miriam (entrambe), Nadine, Verena ed infine il mitico Su (sasaaa). Perchè non citare anche i migliori amici dell’uomo che mi hanno regalato tanti momenti di felicità, riempiendo di gioia alcuni tra i momenti più tristi: Birillo, la Licia (o Felicina, coi fiocchettini...), la Mila e Yoshi (Bodino Bodenaus). 115 Vi sono inoltre persone sempre presenti, premurose e gentili, che meritano un sentito ringraziamento, come Chiara, Giulia, Andrea (che bello Gardaland!), Lorena, Donato e l’Ing. Jannelli. Di fondamentale importanza sono stati, inoltre, l’affetto e la vicinanza di tutti i miei parenti, sempre pronti a spronarmi, consigliarmi e sostenermi, tra cui: Elio, Mino, Paolo, la nonna Tina, il nonno Ettore, la nonna Michela, il nonno Vincenzo, Salva, Oscar (compagno di mille avventure, nella vita reale e non), la zia Mina, lo Zione, Mamma e Papà. Infine un grazie di cuore agli amici più cari, con i quali sono cresciuto e sto crescendo giorno dopo giorno: Laura, Nadia, Marta, Flavia, Katia, Paolo (con la camicia a quadrettoni), Tanis (ormai il nome ufficiale di Jacopo!), Luca (non più Ciccio), Valeria, Sabri, Fuca (detto anche Johnfuc), Ale, Geppo (detto anche Roby :D ) e Max (davvero un punto fisso di riferimento). Un ultimo speciale ringraziamento è doverosamente dedicato a Marco (meglio conosciuto come Santa), che mi ha aiutato a coronare questo sogno nel migliore dei modi, dedicandomi ogni attimo del suo tempo libero (anzi, mi correggo, non credo che Marco abbia mai avuto del tempo libero, solo alcuni momenti in cui era meno impegnato di altri), anche se spesso con ingenti ritardi di svariate ore... ma sempre e comunque al mio fianco in ogni istante, grazie di tutto. ... e l’elenco delle persone a cui sono grato non sarebbe ancora finito, molte altre mi hanno accompagnato durante questo lungo cammino, ma dovrò limitarmi a ringraziarle tutte insieme, visto che, come dice Max, lo spazio dedicato ai ringraziamenti sta per termin Written with LATEX 2ε and BIBTEX Printed on September 29, 2006

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A novel methodology for dynamically reconfigurable embedded