Download A novel methodology for dynamically reconfigurable embedded

Document related concepts

Computer program wikipedia , lookup

Stream processing wikipedia , lookup

Computer security compromised by hardware failure wikipedia , lookup

Distributed operating system wikipedia , lookup

Motorola 68000 wikipedia , lookup

Transcript
P OLITECNICO DI M ILANO
Facoltà di Ingegneria dell’Informazione
Corso di laurea specialistica in Ingegneria Informatica
A novel methodology
for dynamically reconfigurable
embedded systems design
Relatore: Prof. Donatella S CIUTO
Correlatore: Ing. Marco Domenico S ANTAMBROGIO
Tesi di laurea specialistica di
Vincenzo R ANA
Matr. 674672
Anno Accademico 2005/2006
to you
Riassunto della tesi
Nel corso degli ultimi anni, lo scenario dei sistemi digitali dedicati è stato considerevolmente influenzato dallo sviluppo delle architetture dinamicamente riconfigurabili. Queste architetture permettono l’introduzione, all’interno del
flusso di progettazione, di un ulteriore grado di libertà, grazie al quale diventa possibile aumentare notevolmente la flessibilità dei sistemi sviluppati. Oggi,
diverse tipologie di applicazioni potrebbero trarre benefici, sia in termini di costi che di servizi, dalla capacità di modificare le proprie funzionalità hardware,
molto più veloci delle rispettive implementazioni in software, anche successivamente alla fase di produzione, in modo da garantire coerenza col cambiamento delle necessità degli utenti, con la variazione della codifica dei dati o con
l’evoluzione dei protocolli di comunicazione.
Ciò risulta possibile grazie all’impiego dei dispositivi riprogrammabili, quali ad esempio le FPGA (Field Programmable Gate Arrays). Caratteristica distintiva di questi dispositivi è la possibilità di essere riconfigurati dinamicamente
anche durante la fase di utilizzo. Alcuni di essi consentono inoltre di sfruttare le potenzialità derivanti dalla riconfigurazione parziale, che interessa solo
una porzione dell’intero dispositivo, mentre la restante parte, che non è direttamente coinvolta nel processo di riconfigurazione, può continuare a svolgere le
proprie funzionalità senza alcuna interferenza.
Uno degli approcci alla riconfigurazione descritti in [1], è quello modulebased, caratterizzato dall’idea di suddividere ogni dispositivo riprogrammabile in un determinato numero di parti, ciascuna delle quali viene indicata con
il nome di slot riconfigurabile, o semplicemente slot. In questo scenario risulta
possibile utilizzare uno o più slot per configurare sul dispositivo un componente capace di eseguire una specifica funzionalità, chiamato modulo. Un aspet-
v
to di fondamentale importanza consiste nella possibilità di garantire il corretto funzionamento dei moduli configurati su slot non coinvolti nel processo di
riconfigurazione anche durante l’esecuzione di questo processo.
Un secondo metodo, attraverso il quale è possibile portare a termine una
riconfigurazione, è quello difference-based, che non richiede la definizione di
slot e moduli e comporta perciò un minore sforzo di progettazione. Tuttavia
questo approccio risulta essere adeguato soltanto nei casi in cui le differenze
tra una configurazione e la successiva sono ridotte. Ciò è dovuto principalmente al processo sul quale esso è basato, che risulta essere adatto unicamente
all’introduzione di cambiamenti di piccola entità all’interno del sistema.
Nella progettazione e nello sviluppo dei sistemi digitali dedicati dinamicamente riconfigurabili, l’approccio che sembra portare ad ottenere i risultati più
soddisfacenti è, come descritto in [2], il modular-based design. L’idea principale sulla quale questo approccio è basato consiste nel considerare la specifica
del sistema composta da un insieme di vari componenti indipendenti tra loro
(chiamati IP-Core, Intellectual Property-Cores), i quali vengono sintetizzati individualmente per poi essere infine assemblati insieme per produrre il sistema
desiderato. La definizione di questi IP-Core richiama il concetto di modulo nell’approccio alla riconfigurazione module-based, con il quale il modular-based
design risulta dunque essere strettamente collegato.
Ciascun IP-Core, che può quindi essere considerato come un modulo che
verrà configurato in un determinato insieme di slot, è composto da due parti
distinte: la logica applicativa e la logica di comunicazione. La prima componente, spesso indicata soltanto come logica, implementa la funzionalità primaria
dell’intero modulo, mentre la seconda permette al modulo di essere inserito in
un sistema complesso e di interagire con gli altri elementi di tale sistema, per
esempio altri IP-Core.
La riconfigurazione dinamica di questi moduli è dunque la chiave primaria
che permette di conferire al sistema sviluppato un elevato livello di flessibilità. Per semplificare la gestione di questi processi di riconfigurazione risulta
spesso utile l’impiego di un controllore software, il quale può essere sviluppato
sia come un’applicazione stand-alone che attraverso il supporto di un Sistema
vi
Operativo.
• La prima soluzione, che prevede l’implementazione di un’applicazione
dedicata, è maggiormente orientata alla creazione di una specifica soluzione ottimizzata per un singolo particolare problema. Tuttavia questa
scelta richiede un enorme investimento in termini di impegno sia nella
progettazione che nello sviluppo del sistema, oltre a causare un notevole aumento del tempo necessario per la realizzazione di queste fasi, ed è
quindi indicata solo in particolari circostanze.
• Al contrario, la seconda soluzione può essere adottata durante le fasi di
prototipazione o per accentuare la flessibilità dell’intero sistema, poiché
in questo modo risulta possibile sfruttare i classici servizi offerti da un
Sistema Operativo, come ad esempio le tecniche di scheduling dei processi
o i sistemi di comunicazione tra tali processi, applicandoli allo scopo di
semplificare ed ottimizzare la gestione della riconfigurazione.
Obiettivo di questa tesi è la definizione di una metodologia che sia in grado
di descrivere completamente l’intero processo di progettazione modulare di un
sistema digitale dedicato dinamicamente riconfigurabile, e di guidarne inoltre
lo sviluppo, partendo dalla specifica di alto livello dell’applicazione originaria.
Tra le caratteristiche di fondamentale importanza dell’approccio proposto vi è
la possibiltà di fornire al progettista uno strumento attraverso il quale risulti
possibile sia ridurre considerevolmente il tempo necessario per lo sviluppo del
sistema, che migliorare, oltre a semplificare, il processo stesso di sviluppo.
Per raggiungere questo obiettivo è stato definito il flusso BE-DRESD, composto dalla seguente serie di elementi, ciascuno dei quali dedicato ad una specifica funzionalità.
• L’ingresso di questo flusso è composto dalla specifica ad alto livello di
un’applicazione in grado di risolvere un determinato problema e dalla
sua descrizione hardware, per esempio una descrizione VHDL (Very high speed integrated circuit Hardware Description Language). Queste descrizioni iniziali vengono analizzate da DRESD-HLR (DRESD High-Level
vii
Reconfiguration) al fine di creare un grafo ed estrarne informazioni relative alle strutture ricorrenti, che verranno utilizzate nelle fasi successive del
flusso.
• Un altro componente, chiamato DRESD-BE (DRESD Back-End), sfrutta le
informazioni precedentemente ottenute per la generazione dell’architettura hardware sulla quale il sistema sarà basato e dei moduli hardware,
che potranno essere staticamente inseriti nella parte fissa dell’architettura
o dinamicamente configurati nel sistema finale.
• La creazione della parte software del sistema svilppato viene invece gestita da DRESD-SW (DRESD SoftWare), attraverso il quale è possibile ottenere sia la versione stand-alone che quella basata sull’impiego di un Sistema
Operativo capace di gestire i processi di riconfigurazione.
• In aggiunta a questi componenti, il flusso BE-DRESD contiene anche
DRESD-VAL (DRESD Validation), che è composto da due applicativi,
SyCERS e BAnMaT (Bitstream Analyzer and Manipulator Tool), ed è
utilizzato per validare i risultati delle precedenti fasi e per ottenere informazioni in grado di guidare il ciclo di raffinamento della soluzione
sviluppata.
• Un ulteriore componente del flusso principale è DRESD-DB (DRESD DataBase), una sorta di base di dati che fornisce le informazioni relative al
dispositivo per il quale il sistema deve essere progettato a tutti gli altri
elementi del flusso.
• L’ultima fase è realizzata da DRESD-TM (DRESD Technology Management) e consiste nella generazione della soluzione finale, composta dalla
parte hardware, il risultato di DRESD-BE, da quella software, il risultato di DRESD-SW, e dalle informazioni necessarie per la loro disposizione
fisica sulla piattaforma hardware utilizzata.
Il contributo innovativo di questo lavoro di tesi consiste, in aggiunta alla
definizione del flusso BE-DRESD per la progettazione e lo sviluppo di sistemi dinamicamente riconfigurabili, nell’integrazione degli elementi preesistenti,
viii
contenuti in questo stesso flusso, con una nuova serie di infrastrutture in grado
di colmare le lacune riscontrate nello stato attuale dell’arte.
In particolare la fase realizzata da DRESD-BE è stata completata con l’introduzione di IPGen (IP-Core Generator), un’applicazione in grado di utilizzare
le informazioni estratte da DRESD-HLR, che consistono nella logica necessaria
per implementare le funzionalità estratte, per la generazione automatica degli
IP-Core, ovvero dei moduli contenenti sia la logica applicativa che quella di
comunicazione attraverso il Bus Wishbone. I moduli ottenuti grazie ad IPGen
possono dunque essere utilizzati direttamente sia come componenti statici della parte fissa dell’architettura che come moduli dinamicamente configurabili
all’interno del sistema sviluppato.
Il secondo contributo innovativo di questo lavoro di tesi è rappresentato dalla creazione DRESD-SW, che consiste sia nell’estensione del Sistema Operativo
Linux con un supporto per la riconfigurabilità dinamica che nello sviluppo di
un gestore centralizzato della riconfigurazione per tale Sistema Operativo.
• Il supporto per la riconfigurabilità è costituito dal modulo del kernel per la
gestione del Reconfiguration Controller, il controllore hardware che permette di eseguire la riconfigurazione fisica dei dispositivi riprogrammabili, da
quello per l’interazione con il MAC (Media Access Control), il responsabile dello spazio di indirizzamento sul Bus Wishbone, dal modulo del kernel
chiamato LOL (Load On Linux), che gestisce l’aggiunta e la rimozione dinamica di componenti all’interno del sistema, memorizzandone le informazioni principali, e dalla Reconfiguration Library, il cui scopo è quello di
semplificare l’utilizzo dei moduli del kernel precedentemente presentati.
• Il gestore centralizzato della riconfigurazione è rappresentato dal ROTFL
Daemon (Reconfiguration Of The FPGA under Linux), il quale è in grado
di gestire le richieste, relative all’aggiunta o alla rimozione di un modulo,
provenienti tramite una comunicazione basata sui socket dalla ROTFL Library, la libreria che ogni applicazione deve includere per poter eseguire
una riconfigurazione dinamica. Queste richieste sono gestite dai tre elementi che compongono il ROTFL Daemon: il ROTFL Module Manager, che
ix
implementa una sorta di cache dei moduli configurati, il ROTFL Allocation Manager, il cui scopo è la ricerca dell’insieme di slot da utilizzare per
configurare il modulo richiesto, ed il ROTFL Positioning Manager, il cui
obiettivo è rappresentato dalla selezione del bitstream corretto in grado
di configurare il modulo richiesto nella posizione specificata dal ROTFL
Allocation Manager.
Grazie allo sviluppo di queste nuove componenti del flusso BE-DRESD ed
alla metodologia proposta risulta possibile eseguire automaticamente la generazione degli IP-Core, partendo dalla loro logica applicativa, includere nella soluzione finale un Sistema Operativo in grado di gestire la riconfigurabilità dinamica ed infine sfruttare un insieme di driver attraverso i quali stabilire un semplice
ma potente canale di comunicazione con i moduli dinamicamente configurati.
La tesi è organizzata in sei capitoli, il primo dei quali, il Capitolo 1, introduce lo scenario dei sistemi digitali dedicati, con particolare riferimento
alla possibilità di una loro estensione attraverso l’uso della riconfigurabilità
dinamica.
Il Capitolo 2 presenta uno studio dello stato dell’arte per quanto concerne
il campo dei sistemi digitali dedicati riconfigurabili. La prima parte del capitolo è focalizzata sulla presentazione delle principali piattaforme configurabili e
riconfigurabili esistenti in letteratura. L’analisi di queste piattaforme, ciascuna
delle quali manualmente sviluppata e specificatamente ottimizzata per la risoluzione di un particolare problema, rende evidente la mancanza di un flusso
capace di astrarre ed automatizzare il processo di sviluppo in modo da sfruttare pienamente le potenzialità di tali sistemi. La seconda parte del capitolo
descrive le più rappresentative metodologie di sviluppo, che tentano di porre
rimedio alla precedente mancanza, ma senza riuscirvi completamente in quanto
limitate ad una parziale visione del flusso o confinate ad un livello di astrazione
troppo elevato per poter automatizzare un reale processo di sviluppo. L’ultima
parte del capitolo è dedicata alla presentazione dei più importanti supporti alla
riconfigurazione per i Sistemi Operativi, nei quali è stata riscontrata l’assenza
del servizio di DMA (Direct Memory Access) e la mancanza di un gestore della
riconfigurazione centralizzato.
x
Gli aspetti emersi dalle analisi effettuate nel precedente capito conducono,
nel Capitolo 3, alla presentazione della metodologia adottata per definire il flusso di progettazione proposto. In particolare la prima parte del capitolo pone
l’accento sulla generazione automatica degli IP-Core, data la loro funzionalità
base, mentre la seconda parte si focalizza sugli aspetti relativi al supporto per
la riconfigurazione nei Sistemi Operativi, quali ad esempio la gestione della riconfigurazione dei moduli, del caricamento automatico dei driver necessari per
i moduli configurati e della comunicazione di questi con l’intero sistema.
Lo scopo del Capitolo 4 è quello di descrivere dettagliatamente l’implementazione fisica della metodologia proposta nel precedente capitolo, ponendo una
notevole enfasi sia sull’integrazione del flusso di progettazione con la generazione automatica dei moduli, che possono essere utilizzati come componenti
fissi o riconfigurabili del sistema finale, che sullo sviluppo di un’architettura
software basata sul Sistema Operativo Linux e composta da una serie di moduli del kernel, di librerie e di un gestore centralizzato della riconfigurazione,
in grado di sfruttare i meccanismi della riconfigurazione dinamica nei sistemi
digitali dedicati.
Il Capitolo 5 introduce una vasta raccolta di risultati sperimentali in modo da rendere possibile la validazione della metodologia proposta. La prima
parte del capitolo è rivolta all’implementazione del tool per la generazione automatica degli IP-Core, mentre la seconda parte presenta la piattaforma hardware utilizzata per lo sviluppo dell’architettura software ed un’ampia gamma
di risultati sperimentali inerenti l’architettura software stessa.
Infine il Capitolo 6 traccia le conclusioni finali relative alla metodologia proposta ed alla sua implementazione, sottolineando alcuni possibili estensioni e
lavori futuri che possono essere applicati per ampliare e migliorare l’approccio
descritto in questo lavoro di tesi.
xi
Contents
Riassunto della tesi
v
1
Introduction
1
2
State of the art
5
2.1
Configurable systems . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Reconfigurable Pipelined Datapath . . . . . . . . . . . . .
8
2.1.2
Configurable Pipelined State Machine . . . . . . . . . . . .
8
2.1.3
Configurable Architecture for High-Speed Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Configurable FPGA-Based Hardware Architecture for
Adaptive Processing of Noisy Signals for Target Detection
based on Constant False Alarm Rate (CFAR) Algorithms .
10
Configurable, High-Throughput LDPC Decoder Architecture for Irregular Codes . . . . . . . . . . . . . . . . . . . .
10
Common features and limits of configurable systems . . .
11
Reconfigurable systems . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
PipeRench . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.2
MorphoSys . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.3
Splash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.4
Garp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.5
Raw Architecture Workstation . . . . . . . . . . . . . . . .
20
2.2.6
Common features and limits of reconfigurable system
. .
21
Development methodologies . . . . . . . . . . . . . . . . . . . . .
22
2.3.1
23
2.1.4
2.1.5
2.1.6
2.2
2.3
RECONF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
Contents
2.4
2.3.2
ADRIATIC . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3
Common features and limits of development methodologies 26
Software reconfiguration supports . . . . . . . . . . . . . . . . . .
2.4.1
2.5
3
Embedded Linux as a platform for dynamically selfreconfiguring systems-on-chip . . . . . . . . . . . . . . . .
28
2.4.2
Caronte . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.4.3
BORPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.4.4
Common features and limits of software reconfiguration
supports . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . .
34
37
3.1
BE-DRESD flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
DRESD-BE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.2.1
Cores handling . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2.2
Automatic IP-Core generation . . . . . . . . . . . . . . . .
46
DRESD-SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.3.1
Reconfiguration layer . . . . . . . . . . . . . . . . . . . . .
51
3.3.2
Dynamic reconfiguration management . . . . . . . . . . .
52
3.3.3
IP-Cores devices access . . . . . . . . . . . . . . . . . . . .
54
3.3.3.1
Dynamic device drivers loading and unloading .
56
3.3.3.2
IP-Core user-side drivers . . . . . . . . . . . . . .
57
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.4
xiv
27
Proposed methodology
3.3
4
24
Design flow software development
59
4.1
IPGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.2
Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.2.1
Underlying platform . . . . . . . . . . . . . . . . . . . . . .
65
4.2.2
Linux kernel modules infrastructure . . . . . . . . . . . . .
67
4.2.2.1
The Reconfigurator Controller kernel module . .
68
4.2.2.2
The MAC kernel module . . . . . . . . . . . . . .
70
4.2.2.3
The LOL kernel module . . . . . . . . . . . . . . .
71
4.2.2.4
The Reconfiguration Library . . . . . . . . . . . .
72
BIBLIOGRAPHY
4.2.3
4.3
5
6
The ROTFL architecture . . . . . . . . . . .
4.2.3.1 The ROTFL Library . . . . . . . .
4.2.3.2 The ROTFL Daemon . . . . . . .
4.2.3.3 The ROTFL Module Manager . .
4.2.3.4 The ROTFL Allocation Manager .
4.2.3.5 The ROTFL Positioning Manager
4.2.3.6 The ROTFL Repository . . . . . .
Concluding remarks . . . . . . . . . . . . . . . . .
Experimental results
5.1 IPGen . . . . . . . . . . . . . . . . .
5.2 Software architecture . . . . . . . .
5.2.1 RAPTOR2000 board . . . .
5.2.2 ROTFL Allocation Manager
5.2.3 ROTFL architecture . . . . .
5.3 Concluding remarks . . . . . . . .
Conclusions and future work
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
77
79
80
82
89
91
92
.
.
.
.
.
.
95
95
97
98
99
103
105
107
111
xv
List of Tables
2.1
2.2
Configurable systems features . . . . . . . . . . . . . . . . . . . . .
Reconfigurable systems features . . . . . . . . . . . . . . . . . . .
5.1
5.2
5.3
5.4
5.5
5.6
IPGen tests . . . . . . . . . . . . . . . . . . .
Temporal performance . . . . . . . . . . . .
Final results . . . . . . . . . . . . . . . . . .
Comparison with the exhaustive algorithm
Hardware reconfiguration latency . . . . .
ROTFL performance . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
21
96
101
102
102
104
104
xvii
List of Figures
2.1
PipeRench reconfigurable pipeline . . . . . . . . . . . . . . . . . . .
15
2.2
MorphoSys reconfigurable processor architecture . . . . . . . . . .
16
2.3
Garp reconfigurable processor architecture . . . . . . . . . . . . . .
19
2.4
RECONF2 design flow . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.5
ADRIATIC design flow . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.6
IP-Core Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.1
BE-DRESD flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2
YaRA Modular Architecture Creation . . . . . . . . . . . . . . . . .
43
3.3
IP-Core schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.4
DRESD-SW design flow . . . . . . . . . . . . . . . . . . . . . . . .
49
3.5
Drivers hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.1
Reading process diagram . . . . . . . . . . . . . . . . . . . . . . .
61
4.2
Writing process diagram . . . . . . . . . . . . . . . . . . . . . . . .
62
4.3
Multi-FPGA scenarios . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.4
Linux kernel modules infrastructure . . . . . . . . . . . . . . . . .
68
4.5
Reconfiguration Controller registers . . . . . . . . . . . . . . . . .
69
4.6
Command Register . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.7
Software Architecture schematic . . . . . . . . . . . . . . . . . . .
75
4.8
Architectural layers . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.9
Socket communication . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.10 Genetic algorithm chromosome . . . . . . . . . . . . . . . . . . . .
86
4.11 Fitness evaluation examples . . . . . . . . . . . . . . . . . . . . . .
89
xix
LIST OF FIGURES
5.1
5.2
5.3
xx
Multi-FPGAs system on RAPTOR2000 . . . . . . . . . . . . . . . . 98
Module cached scenario . . . . . . . . . . . . . . . . . . . . . . . . 106
Reconfiguration latencies . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 1
Introduction
The embedded systems design scenario has been significantly affected, in the
last years, by the influence of dynamic reconfigurable architectures. By exploiting the potentiality of these architectures, it is possible to introduce into the
design workflow a new degree of freedom, that increases the flexibility of the
developed systems. Several different classes of applications, in fact, would benefit from the possibility to change their functionalities after the system has been
produced.
This is possible thanks to the employment of reprogrammable devices, such
as FPGAs (Field Programmable Gate Arrays), that are characterized by the ability to be partially reconfigured at run-time, while the rest of the device that is
not involved in the reconfiguration process is still working.
From a general point of view, as described in [1], partial reconfiguration can
be performed by using the following approaches.
• The first one is the module-based approach, that is characterized by the
division of reprogrammable devices in a certain number of portions, each
one of which is called reconfigurable slot. In this scenario it is possible to
reconfigure one or more reconfigurable slots with a hardware component
that is able to perform a specific functionality, called module. Obviously,
the modules contained in slots that are not involved in the reconfiguration
task do not have to stop during the reconfiguration process.
1
Chapter 1.
Introduction
• The second approach is the difference-based one, that does not require
slots and modules definition, but that is only suitable when the differences
between two configurations are very small, since the process on which it
is based is suitable only when little changes in the design are required.
The most general design approach for dynamically reconfigurable embedded systems, as described in [2], is the modular-based design. This approach is
strongly connected to the module-based reconfiguration approach and is based
on the idea of a design implemented considering the system specification as
composed of a set of several independent modules (called IP-Cores, Intellectual
Property-Cores) that can be individually synthesized and finally assembled to
produce the desired system.
Each one of these IP-Cores consists of two parts:
• the core logic, often called just core for short, that implements the module
functionality, and
• the communication logic, that allows the component to be plugged into a
system and to interact with the rest of the system, for examples with other
IP-Cores.
To handle the dynamic reconfiguration of these modules it is often useful to
use a software controller. This controller can be developed as a stand-alone software application or with the support of an Operating System. The first choice
is oriented towards the creation of a specific solution that is optimized for a
particular problem. However this solution requires a large investment in terms
of design and implementation efforts, and considerably increases the time to
market. On the opposite, the second choice can be followed to increase the flexibility of the whole system, since in this way it is possible to exploit the classical
services that an Operating System can provide, such as processes scheduling
techniques or inter-process communication systems, applying them to improve
the reconfiguration management.
Aim of this thesis is to define a complete methodology, based on the modular design approach, that describes the whole design process and that drives
2
Chapter 1.
Introduction
the creation of partial dynamic reconfigurable embedded systems, starting from
the high-level specification of the original application. The proposed approach
provides the designer with a methodology able to strongly reduce the time to
market of the final implementation of the system and to simplify the development process, since all the design phases have been automated. To achieve
this objective, a tool that automatically generates IP-Cores, starting from their
core logic, has been implemented, and the Linux Operating System has been
extended with both a reconfiguration support and a centralized reconfiguration
manager able to handle dynamic reconfigurability. In addition to this, also a
collection of drivers has been developed to realize a simple and powerful communication channel with configured modules.
The thesis is composed of six chapters. Chapter 2 presents a review of the
state of the art in the reconfigurable embedded systems area. The analysis of
existing approaches starts by presenting configurable and reconfigurable hardware platforms and ends with the description of the more representative development methodologies and Operating Systems reconfiguration supports, with
particular attention to their common features and limits.
Chapter 3 introduces the methodology adopted to define the proposed design workflow for dynamically reconfigurable embedded systems. In particular this chapter is focussed on the automatic IP-Core generation, once provided
the core functionality, and on the Operating System reconfiguration support aspects, such as the modules reconfiguration and communication handling.
Aim of Chapter 4 is to detail the actual implementation of the proposed
methodology, emphasizing the integration of the design flow with the automatic generation of reconfigurable and fixed hardware modules, and the development of a software architecture based on the standard Linux Operating
System that allows exploitation of reconfiguration on embedded systems.
Finally, Chapter 5 introduces a large collection of experimental results of
the presented implementation in order to validate the proposed methodology,
while Chapter 6 draws the final conclusions, outlining some possible extensions
and future works based on the approach described in this thesis.
3
Chapter 2
State of the art
This chapter describes the state of the art of both configurable and reconfigurable systems, of their development methodologies and of software reconfiguration supports.
Section 2.1 presents the more general configurable system. These kinds of
systems are characterized by their flexibility, that can be obtained in different
ways. It is possible to employ either hybrid systems that consist of static hardware and programmable logic or programmable devices. Each solution brings
to different benefits and disadvantages, that will be analyzed at the end of the
section.
A more restricted set of the previously introduced configurable systems is
described in Section 2.2. The reconfigurable systems employ a dynamic approach to the configuration, to add another degree of freedom to the flexibility
of configurable systems. In this way, in fact, it is possible to modify the system
components at run-time, by changing the cores of an architecture while some
others cores are still running.
In Section 2.3 two development methodologies are presented. Aim of these
methodologies is to guide the design of a configurable or reconfigurable systems. These methodologies, in fact, can be applied to simplify the development or to improve the performance of the previously described platforms. To
achieve these objectives, each methodology introduces both general architec-
5
Chapter 2.
State of the art
ture structures and flows that have to be followed during the design of this
kind of systems.
Section 2.4 describes a set of software solutions to support the reconfiguration tasks. The more suitable way to achieve this objective is to extend an OS
(Operating System) with a reconfiguration support. An important aspect of this
support is the development of either a manager or a set of tools that can simplify
and improve the management of the reconfiguration processes, implementing
useful services such as allocation politics or device handling.
Finally Section 2.5 summarizes the more important aspects that characterize
the described systems, methodologies and software supports.
2.1
Configurable systems
In the last years a lot of attempts have been made to try to fill the gap between
GPPs (General-Purpose Processors) and ASICs (Application-Specific Integrated
Circuits).
General-purpose microprocessors [3] are digital electronic components with
transistors on a single semiconductor integrated circuit that allow to interpret
instructions and to process data contained in a program. This kind of components gives to a system a very good flexibility, because it is possible to write
several different applications that solve different problems and that run on the
same microprocessor. In this way it is possible to use the same component to
achieve various objectives, but introducing in the system a remarkable delay
and increasing the power consumption, since the general-purpose microprocessor solution is slower and more power expensive than its counterpart on the
field of full-custom design.
ASICs are integrated circuits customized for a particular use, in order to
achieve very small chips and good performance, matching exactly the computation (high throughput and low latency); unfortunately this kind of components
requires conspicuous non-recurring engineering costs (the cost to setup the factory to produce a particular ASIC), a long time to market (long design cycle)
and their flexibility, if any, is very low.
6
2.1. Configurable systems
Between these opposite alternatives it is possible to find a compromise, by
using DSPs (Digital Signal Processors), FPGAs (Field-Programmable Gate Arrays) or hybrid systems containing a mix of the previous solutions.
DSPs are special-purpose microprocessors designed specifically for digital
signal processing, generally in real-time. These devices are either not programmable, or have limited programming facilities, but they are cheaper and
more specialized of general-purpose microprocessors in order to achieve better
performance for a certain class of problems.
FPGAs are semiconductor devices containing logic blocks, which can be configured to compute arbitrary functions, and configurable wiring, which can be
used to connect the logic blocks as well as registers together into arbitrary circuits. Traditional FPGAs are very generic, but some of the higher-end FPGAs,
such as the Xilinx Virtex 4 and Virtex 5 families, offer multiple subfamilies, each
optimized for a different market area. The optimizations are achieved by crafting different mixes of memory, logic, multiplier-accumulator (MAC) blocks and
high-speed I/O. Using this kind of components it is possible to obtain a good
flexibility in the system (by introducing the ability to re-program the device),
to decrease the time to market and to reduce the non-recurring engineering
costs, even if FPGAs are generally slower than their ASIC counterparts and
draw more power.
In the next subsections a few examples of configurable systems will be described, to show some ways in which it is possible to develop a configurable
architecture to find a trade-off between the flexibility of a general-purpose processor and the performance of an ASIC. In the RaPiD (Reconfigurable Pipelined
Datapath) and Configurable Pipelined State Machine approaches, respectively described in Section 2.1.1 and in Section 2.1.2, an attempt was made to build
coarse-grained adaptable ASICs or hybrid ASIC/FPGA architectures by introducing some programmable elements to interconnect hardware logic. The remaining approaches, described in Sections 2.1.3, 2.1.4 and 2.1.5, employ FPGAs
to develop FPGA-based architectures or as coprocessors, to give a good degree
of flexibility to the whole system.
7
Chapter 2.
State of the art
In the last subsection, 2.1.6, all the presented approaches will be be analyzed
to find common features and limits; this analysis can be useful to find the right
way to simplify the development task, to maximize the flexibility of the FPGAbased systems and to improve the time to market, reducing the time required
by the development, the interfacing and the integration phases.
2.1.1
Reconfigurable Pipelined Datapath
RaPiD [4] research is developed at Department of Computer Science and Engineering of the University of Washington and is focussed on defining coarse-grained
adaptable architectures that solve the performance/power/price constraints
posed by mobile/embedded systems platforms for a wide range of highly
repetitive and computationally-intensive applications in the signal and image
processing domain.
This is accomplished by mapping the computation into a deep pipeline using a configurable array of coarse-grained computational units. RaPiD provides
a large number of ALUs, multipliers, registers and memory modules that can
be configured into the appropriate pipelined datapath; this datapath is a linear
array of functional units communicating in mostly nearest-neighbor fashion.
Mapping applications to RaPiD involves designing the underlying datapath
and providing the dynamic control required for the different parts of the computation. The control design can be hard because control signals are generated
at different times and travel at different rates. To simplify this task it is possible
to use Rapid-C [5], that is an ad-hoc program language to develop RaPiD systems, but, even if this language provides a nice abstraction of the architecture,
the programmer is still responsible for all the scheduling of data and operations
in the datapath.
2.1.2
Configurable Pipelined State Machine
The Configurable Pipelined State Machine [6] developed at the Institute of Microelectronic Systems of Darmstadt University of Technology in Germany, is a FSM
(Finite State Machine) where all units relevant for control and transition logic
8
2.1. Configurable systems
are configurable, while the basic structural components like state registers are
built of fixed logic; this architecture is the result of a combined approach and
it is faster and smaller than a FPGA implementation while providing full programmability. Since in this specific case the underlying pipeline structure will
be the same for all possible applications, it is possible to limit configurability to
the logic producing the control signals and the state transition logic, while the
basic architectural structure can remain to be fixed hardware.
The result of this approach is an hybrid ASIC that implement in hardware
the basic architectural structure of a pipelined state machine while allowing to
configure control and state transition logic; however this solution is built adhoc to solve this specific class of problems and then it is impossible to apply the
same structure to a generalized set of scenarios.
2.1.3
Configurable Architecture for High-Speed Communication Systems
The Configurable Architecture for High-Speed Communication Systems [7], developed at the Center for Wireless Telecommunication of Virginia Polytechnic Institute
and State University in Virginia, is a prototype of a rapidly deployable last mile
wireless high-speed communications system to support emergency management. Given the high bandwidth required and the amount of data that needs
to be transported, an hybrid architecture was used, with processing elements
implemented partially as software running on a microprocessor and partially
as FPGA hardware logic blocks.
The hybrid architecture developed is a combination of a specialized processor (Motorola PowerQuicc II 8255) for packet-level operations and a programmable logic device (Xilinx Virtex XCV600) for bit-level operations, with a
dual-port memory, that allows the processor and the FPGA to read and write
data simultaneously and it is highly fit for the specific application.
9
Chapter 2.
2.1.4
State of the art
Configurable FPGA-Based Hardware Architecture for
Adaptive Processing of Noisy Signals for Target Detection based on Constant False Alarm Rate (CFAR) Algorithms
The Configurable FPGA-Based Hardware Architecture for Adaptive Processing of
Noisy Signals for Target Detection based on Constant False Alarm Rate (CFAR) Algorithms [8] has been designed at the National Institute for Astrophysics, Optics and
Electronics, in Mexico, specifically to be configured for the Cell-Average version
(CA-CFAR) of CFAR algorithm and for two variations of it: the Max and the
Min CFAR. However there are other versions of the CFAR algorithm, such as
the Order Statistics CFAR, that have not been taken into account.
This architecture has been implemented on a FPGA device providing good
performance; in fact it is 18 times faster than the required theoretical processing time, about 10 times faster of the software implementation over a personal
computer with a Pentium IV processor running at 2.4 GHz and with 512 Mbytes
of main memory, and 3 times faster of the solution using a TMS320C6203 DSP
device from Texas Instruments.
Even if this architecture efficiently implements a class of related CFAR algorithms for adaptive signal processing and target detection, the CA-CFAR, the
MAX-CFAR and the MIN-CFAR algorithms, and it can be extended to more
complex CFAR algorithms such as the Statistic Ordered algorithms, since it exploits the parallel nature in CFAR signal processing, no attempts have been
made to find out a generalized structure of the architecture or a common development flow that is able to solve a wide set of problems, covering a whole class
of related applications.
2.1.5
Configurable, High-Throughput LDPC Decoder Architecture for Irregular Codes
The Configurable, High throughput LDPC decoder Architecture for Irregular codes [9]
is suitable for a scenario in which it is compulsory to ensure a very high data rate
10
2.1. Configurable systems
communication through a noisy channel. In these contexts, to provide a reliable
communication infrastructure and to guarantee a low power consumption, it is
possible to use error correcting codes to eliminate or to reduce the need of retransmission, for example the Low Density Parity Check (LDPC) codes, that can
assure very good performance in noisy channels and that is a good candidate
for the next generation of wireless devices.
To create a flexible architecture that is able to support different block lengths
and code rates, a Virtex4-xc4vfx60 FPGA has been used to implement the whole
architecture; the clock frequency of the generated design is 160 MHz, against the
412 MHz of an ASIC solution, and its latency is between 5 and 11 microseconds,
while the ASIC solution latency is between 2.2 and 4.5 microseconds. The ASIC
system is then a little bit faster than the FPGA-Based system, but it only supports one code, so it is necessary to develop a different ASIC for each different
block length or code rate, introducing the necessity of a considerable investment
and increasing the time to market.
Differently from the approach presented in the Section 2.1.4, this approach
exploits an interesting advantage of the FPGA-Based solution, introducing the
re-use of the hardware to develop different versions of the system, even if this
is yet far away to achieve a structure that can be used in a generalized flow and
that can be applied to a large group of scenarios.
2.1.6
Common features and limits of configurable systems
All the presented scenarios have in common the lack of a generalized flow and
a complete methodology that allows to abstract and automate the development
of a configurable system. Without this flow it is impossible to exploit totally the
true potentialities of this kind of systems, since there is no possibility to automatically reach a low-level implementation, starting from a high-level specification of the application; in fact all the proposed approaches are characterized
by ad-hoc solutions manually developed in a different way for each case.
With a generalized and automatic flow, instead, in addition to the simplification of the development task, it will be also possible to maximize the flexibility
of the system by developing various implementations to improve the analysis
11
Chapter 2.
State of the art
Table 2.1: Configurable systems features
Platform
ASIC
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
FPGA
DSP
X
X
X
X
X
X
GPP
Flexibility
X
X
X
X
X
Flow
Partial
generalized
flow
Complete
generalized
flow
X
X
of the solution space or to solve the same problem in different ways. This task
doesn’t require too much effort, since it can be enough to modify some parts
of the high-level specification to generate a different low-level implementation
that can be more suitable for a different scenario. Furthermore, it is possible to
improve the time to market, since this kind of flow reduces the time required
by the development, the interfacing and the integration phases.
Table 2.1 shows the platform or platforms on which each approach has been
developed and the main features that characterize that solution.
The first two approaches, 2.1.1 and 2.1.2, are basically developed with the
ASIC technologies and provide a low-level of flexibility, while no generalization
has been introduced. In fact, in the first one the programmer is still responsible
for all the scheduling of data and operations in the datapath, while in the second
one is voluntarily built ad-hoc to solve the specific problem.
Even if the third solution, 2.1.3, is a hybrid system that uses both a DSP and a
FPGA to provide more flexibility, it is still developed for a singular application
without taking into account the possibility of extending the same solution to
other similar problems.
On the contrary in the last two approaches, 2.1.4 and 2.1.5, that are both developed by using a FPGA-Based architecture, there is an attempt to generalize
the solution. The 2.1.4 system, in fact, efficiently implements a class of related
12
2.2. Reconfigurable systems
CFAR algorithms for adaptive signal processing and target detection, the CACFAR, the MAX-CFAR and the MIN-CFAR algorithms.
The 2.1.5 system, instead, is able to support different block lengths and codes
rates of the same LDPC code. Anyway in all the approaches there is no trace
of the effort to create a really generalized flow that can automate or improve all
the development phases or that can bring to apply a successful system to solve
a different problem.
2.2
Reconfigurable systems
Reconfigurable systems add another degree of freedom to the flexibility of configurable systems, since they make it possible to modify the system components
at run-time. In this way it is possible to change the cores of an architecture while
some other cores are still running.
In the next subsections some representative reconfigurable architectures will
be presented.
The PipeRench architecture [10], described in Section 2.2.1, introduces the
concept of hardware virtualization to make it possible to execute a design of
any size to a compatible device with any capacity.
The MorphoSys architecture [11], presented in Section 2.2.2, is a reconfigurable computer architecture targeted to computational intensive applications
that consists of a TinyRISC processor, that is a programmable processing unit,
and a RC-Array, that is the reconfigurable hardware unit.
The Splash processor [12], described in Section 2.2.3, is a special-purpose parallel processor that is able to exploit temporal parallelism (pipelining) or data
parallelism (single instruction multiple data stream) present in the applications.
In this processor the computing elements are programmable FPGA devices.
The Garp architecture [13], outlined in Section 2.2.4, is the integration of a reconfigurable computing unit with an ordinary processor on a single chip. Programming for the Garp system is an automatic task, since the Garp compiler is
able to automatically extract loops from ANSI C programs.
13
Chapter 2.
State of the art
The RAW architecture [15], introduced in Section 2.2.5, is a simple, wireefficient multicore architecture, in which it is possible to increase the performance by exploiting fine-grained parallelism.
Finally, in Section 2.2.6, common features and limits of reconfigurable systems will be described, to show the main characteristics and trends of the current scenario of this kind of systems.
2.2.1
PipeRench
The PipeRench project [10] allows a hardware design of any size to execute on a
compatible device with any capacity, by virtualizing hardware. The PipeRench
system provides both the extremely high-speed reconfiguration necessary for
hardware virtualization and compilation tools for this architecture. In this way
it is possible to find a solution for both the problems that inhibit the deployment of applications based on run-time reconfiguration: the first is that design
methodologies for partial reconfigurable applications are completely ad-hoc,
while the second one is the lack, in existing FPGAs, of reconfiguration mechanisms that adequately support local run-time reconfiguration.
This solution is suitable for a scenario in which available resources are not
enough for the computation, so it is possible to exploit the reconfigurable pipeline
to virtualize pipeline stages; this technique implies that at every clock cycle a
new stage is configured, in a way that makes it possible to execute the computation even if the whole pipeline is never configure at the same time. Figure
2.1 shows a Virtual Pipestage and a Physical Pipestage of an example in which
there is an application that consists of five stages and a physical pipeline that
consists of only 3 stages. In this example each stage is configured in one cycle
and then it is executed for the next two cycles, so the effective throughput is of
two computed results every five clock cycles. More in general, the throughput
of a virtualized application with v virtual stages on a system with p physical
stages is (p -1)/v.
However the reconfigurable pipeline structure introduces some relevant constraints that limit the freedom of the design. For example the state of a stage
14
2.2. Reconfigurable systems
Figure 2.1: PipeRench reconfigurable pipeline
can only depend from the previous stages, so in this kind of system only connections between consecutive stages are allowed.
2.2.2
MorphoSys
MorphoSys [11] is a reconfigurable computer architecture targeted to computational intensive applications. Figure 2.2 shows the MorphoSys system, that
consists of the following components:
• TinyRISC, a general-purpose 32 bit RISC processor
• RC-Array (Reconfigurable Cells Array), the reconfigurable hardware unit
15
Chapter 2.
State of the art
Figure 2.2: MorphoSys reconfigurable processor architecture
• framebuffer, the embedded data memory of the reconfigurable processor
• DMA (Direct Memory Access), used to transfer data from external memory
• context memory, 32-bit instruction words for RC-Array.
The execution model of the MorphoSys processor is based on the partitioning of applications in sequential and data-parallel tasks; the first are executed
by the programmable processing unit called TinyRISC processor, while the latter are mapped on the reconfigurable hardware unit called RC-Array. This is
composed of a bidimensional array of reconfigurable cells (RC), whose configurations are stored in the context memory. During the execution, configuration
data are fetched from the context memory, while computational data for the RCArray is loaded in the framebuffer from external memory. Data transfer between
the MorphoSys elements and the external memory are managed by the DMA and
16
2.2. Reconfigurable systems
requested from TinyRISC processor. After data loading, RC-Array is enabled by
the TinyRISC with a specific command; however, during the computation, it is
possible to change context to specific RC by reconfiguring only the selected part
of the array.
To use the MorphoSys architecture it is necessary to write both the RC-Array
configuration program and the instructions program for the TinyRISC processor. The first can be realized by using a specific assembler language, while the
second one can be obtained from a C compiler. However, the current version
of the compiler is not able to manage the RC-Array, so the control instructions
have to be manually inserted by the programmer. Thus, even if low-level details of the hardware component, such as the composition of the RC-Array and
interconnection network, are deeply described, the MorphoSys solution doesn’t
present a complete methodology to implement the whole reconfigurable system. Furthermore, it is not explained how it is possible to derive the sequential
and parallel tasks from a given application and how they are managed by the
scheduler.
2.2.3
Splash
The Splash processor [12], devloped at the IDA Supercomputing Research Center, is an attached special-purpose parallel processor designed to accelerate the
solution of problems which exhibit at least modest amounts of temporal parallelism (pipelining) or data parallelism (single instruction multiple data stream).
In this processor the computing elements are programmable FPGA devices.
The system is composed of a normal workstation (a Sun SparcStation host),
an interface board and an array of Splash boards (from 1 up to 16). The reconfigurable elements in the Splash system all consist of Xilinx XC4010 FPGAs.
The interface board between the workstation and the array consists of an
input and an output DMA channel, each controlled by a FPGA (called XL and
XR), connected to the SparcStation host via Sun SBus channel. The XL element
is connected to the first board, while the XR is attached to the last one.
Splash boards consist of 16 FPGAs (X1. . . X16), a crossbar switch and a
seventeenth FPGA (X0), which acts as a control element for the board. Within a
17
Chapter 2.
State of the art
board, a FPGA is connected to its left and right neighbour and to the crossbar
switch. The boards are connected to each other in a chain, and the X0 element
of each board is also connected to the interface board.
The workstation performs a wide range of operations, since it acts as a general controller for the reconfiguration of FPGA elements and crossbar switches,
sends computational data and control signals to the array and collects the results.
The Splash architecture has been designed to supply a Single-Instruction
Multiple-Data (SIMD) computational model, where each board has all processing elements configured to perform the same operations on different data in
parallel, but the flexibility of the architecture allows a lot of different computational models. As an example, pipelining can be used to perform a flow of
computation on the same data connecting different programmable elements so
that the output of a FPGA is the input of the next one.
Dynamic reconfiguration in the Splash model consists in modifying two kind
of elements within a board: the crossbar switch and/or the processing elements.
In the first scenario, reconfiguration of the crossbar switch interconnections allow an easy way to modify the data flow in the system without the need to
modify the single computing elements; in the second one, single FPGAs are
reconfigured to change the kind of computation performed on the data.
Programming for the Splash system is done by writing the behavioral description of the algorithm using the VHSIC Hardware Description Language
(VHDL), which goes through a process of refinement and debugging using the
Splash simulator. The algorithm is then manually partitioned on the different
processing elements. Thus, the Splash solution does not present a methodology
to partition an implementation of an algorithm on the array modules, but the
process must be performed manually. This makes programming the Splash
system quite difficult, as it requires direct and low-level knowledge of the
physical implementation of the system. Also, there is no direct way to derive
a configuration for the crossbar switch even when the mapping of functional
units on the FPGAs is known.
18
2.2. Reconfigurable systems
2.2.4
Garp
The focus of the Garp [13] research at Berkeley University is the integration of a
reconfigurable computing unit with an ordinary RISC processor to form a single
combined processor chip.
Figure 2.3 shows the chip containing the Garp reconfigurable processor architecture, in which the reconfigurable array works as a slave coprocessor of the
master microprocessor. The RISC core is a single-issue MIPS-II, provided with a
set of special instructions that manage the reconfigurable array, by modifying its
configuration. The reconfigurable array is organized as 32 rows by 23 columns
of 2-bit logic blocks, and it can be used to speed up part of the computation, for
example loops. Furthermore, the reconfigurable array can perform data cache
or memory accesses to the shared memory independent of the MIPS core.
Memory
Instruction
cache
Data
cache
Standard
processor
Configurable
array
Garp chip
Figure 2.3: Garp reconfigurable processor architecture
19
Chapter 2.
State of the art
Programming for the Garp system [14] is an automatic task. However Garp
compiler is able to automatically extract only compute-intensive loops of ANSI
C programs for acceleration on the tightly-coupled dynamically reconfigurable
coprocessor.
2.2.5
Raw Architecture Workstation
The Raw Architecture Workstation (RAW) [15] [16] is a simple, wire-efficient
multicore architecture. Its goal is to increase the performance of applications
in which the compiler can discover and statically schedule fine-grained parallelism.
The RAW project’s approach to achieving this goal is to implement a simple, highly parallel VLSI architecture, and to fully expose the low-level details
of the hardware architecture to the compiler, so that the software can orchestrate the execution of the application by applying techniques such as pipelining,
synchronization and conflict elimination for shared resources by static scheduling and routing. RAW is composed of a set of interconnected tiles that can be
crossed by a signal in just one clock cycle. Each tile is composed of:
• instructions
• switch-instructions
• a data memory
• an ALU
• a FPU
• registers
• a programmable switch
This approach acquires the same set of features that makes ASICs popular
for specific applications. First, RAW implements fine-grained communication
between large numbers of replicated processing elements and, thereby, is able
20
2.2. Reconfigurable systems
Table 2.2: Reconfigurable systems features
System
PipeRench
MorphoSys
Slpash
Garp
RAW
Reconfiguration
Partial
Partial
Complete
Complete
Partial
Granularity
Fine-grained
Coarse-grained
Coarse-grained
Fine-grained
Coarse-grained
Application domain
Hardware accelerator
Data parallelism
Data parallelism
General-purpose
Hardware accelerator
to exploit huge amounts of fine-grained parallelism in applications, when this
parallelism exists. Second, it exposes the complete details of the underlying
hardware architecture to the software system, so the compiler or the software
in general can determine, and implement, the best allocation of resources for
each application.
Since the RAW solution allows to write an application with a high-level programming language and to compile it for the RAW architecture, it is really suitable to develop very good hardware accelerator. However this architecture is
not flexible enough to be applicable in a generalized embedded systems scenario; in this kind of systems, in fact, there is the need to dynamically reconfigure the cores to allow the modification at run-time of the configuration of
the System on Chip (SoC). This can be done, for example, by implementing the
whole system on a FPGA and exploiting its reconfiguration features.
2.2.6 Common features and limits of reconfigurable system
Table 2.2 summarizes the features of the presented approaches. Even if each of
them presents a good solution for a specific scenario, they are far from being
general solution for a wide class of problems, since each one presents some
aspects that limit its applicability to different contexts.
In the PipeRench approach, to execute a design of any size on a compatible
device with any capacity, the concept of hardware virtualization has been introduced, but to adopt this solution it is mandatory to introduce some relevant
21
Chapter 2.
State of the art
constraints. These constraints limit coinsiderably the freedom of the design,
decreasing its degree of flexibility.
In the MorphoSys solution, even if to obtain the instructions program for the
TinyRISC processor it is possible to use a C compiler, the programmer is still
responsible for the insertion of the control instructions to manage the RC-Array.
Thus there is not an automatic way to obtain the final system, since the control
instructions have to be inserted manually.
Also in the Slpash solution there is a similar problem, since it does not present
a methodology to partition an implementation of an algorithm on the array
modules, but the process must be done manually. This makes programming
the Splash system quite difficult, as it requires direct and low-level knowledge
of the physical implementation of the system.
On the opposite, programming for the Garp system is an automatic task,
but the true limit is that the Garp compiler is able to automatically extract only
compute-intensive loops of ANSI C programs.
The RAW solution, instead, is suitable to develop very good hardware accelerators, but it is not flexible enough to be applicable in a generalized embedded
systems scenario in which run-time reconfiguration may be necessary.
2.3
Development methodologies
To develop a configurable or a reconfigurable system it is possible to build an
ad-hoc solution of to follow a generalized design flow. The first choice implies
a considerable investment in terms of both time and efforts requested to build
a specific and optimized solution for the given problem, while the second one
allows to exploit the re-use of knowledge, cores and software to reach more
rapidly a good solution to the same problem.
In the next subsections, 2.3.1 and 2.3.2, the RECONF2 and the ADRIATIC
development methodologies will be presented. The first methodology takes
as input a static version of an application, executes a partitioning of the given
application and then implements both the partitioned parts and the reconfiguration controller in hardware. The second one, instead, introduces a system
22
2.3. Development methodologies
level implementation of the flow, even if some implementation problems are
not described in details and no solutions are given for them.
Finally, Section 2.3.3 will describe the common features of the presented approaches and the main limits that characterize each one of them, to show the
absence of a complete and generalized flow that is able to describe and to guide
the whole design flow of a reconfigurable system.
2.3.1
RECONF2
The RECONF2 [17] aim is to allow implementation of adaptive system architectures by developing a complete design environment to take benefits of dynamic
reconfigurable FPGAs; in particular it is targeted to real-time image processing
or signal processing applications.
The RECONF2 builds a set of partial bitstreams representing different features and then use this collection to partially reconfigure the FPGA when
needed; the reconfiguration task can be under the control of the FPGA itself or
through the use of an external controller.
A set of tools and associated methodologies have been developed to accomplish the following tasks:
• automatic or manual partitioning of a conventional design,
• specification of the dynamic constraints,
• verification of the dynamic implementation through dynamic simulations
in all steps of the design flow,
• automatic generation of the configuration controller core for VHDL or C
implementation,
• dynamic floorplanning management and guidelines for modular backend implementation.
Figure 2.4 shows the proposed design flow. It is possible to use as input for
this flow a conventional VHDL static description of the application or multiple
23
Chapter 2.
State of the art
Figure 2.4: RECONF2 design flow
descriptions of a given VHDL entity, to enable dynamic switching between two
architectures sharing the same interfaces and area on the FPGA. The steps that
characterize this approach are the partitioning of the design code, the verification of the dynamic behavior and the generation of the configuration controller.
The main limit of the RECONF2 solution is that there isn’t the possibility to
integrate the system with both a hardware and a software part, since both the
partitioned application and the reconfiguration controller are implemented in
hardware in the final system.
2.3.2
ADRIATIC
Aim of the ADRIATIC [18] project is to define a methodology able to guide the
codesign of reconfigurable SoC, with particular attention to cores situated in the
wireless communication application domain.
Figure 2.5 shows the whole design flow. The first phase is the system specification, in which the functionality of the system can be described by using a
24
2.3. Development methodologies
Figure 2.5: ADRIATIC design flow
high-level language program, like in a standard design flow. This executable
specification can be used to accomplish the following tasks:
• generation of the test-bench, that can be used in the other phases of the
design,
25
Chapter 2.
State of the art
• partitioning of the application to specify which part of the system will
be implemented in hardware (either static or dynamically reconfigurable
hardware),
• accurate definition of the application domain and of the designer knowledge.
To derive the final architecture from the input specification, the dynamically
reconfigurable hardware has to be identified; each dynamically reconfigurable
hardware block can be considered as a hardware block that can be scheduled
for a certain time interval.
During the partitioning phase it has to be decided for each part of the system, if it has to be implemented in software, in hardware or in a reconfigurable
hardware block. To help in this decision, some general guidelines have been
developed.
In the mapping phase the functionalities defined by the executable specification are modified to obtain thorough simulation results.
In conclusion, the ADRIATIC flow is a solution that can be easily applied to
the system-level of a design. In this phase, in fact, it is possible to draw benefits
from the general rules that guide the partitioning and from the mapping phase.
However there is not a detailed description of the following phases, that take
place at RTL level, thus there are some implementation problems that cannot
find a solution within the ADRIATIC flow.
2.3.3
Common features and limits of development methodologies
Both the presented flows try to find a solution to the lack of a generalized flow
that is able to describe the design flow of a reconfigurable system development.
RECONF2 is a solution that automate the whole flow, from the high-level
description of the application to the synthesis phase, but it is limited to the
hardware. No software part, in fact, can be included in the final architecture,
26
2.4. Software reconfiguration supports
since both the partitioned parts derived from the original application and the
reconfiguration controller are always implemented in hardware.
On the opposite, ADRIATIC takes into account, in addition to the static and
the reconfigurable hardware, also a software part, but the flow is described
solely from the system specification phase to the system level simulation. Thus
it can be applied only to the system level, and not to the lower level implementation phases, that takes place in the RTL level.
2.4
Software reconfiguration supports
A dynamic reconfigurable architecture often needs software integration to control the scheduling of the reconfiguration. This kind of tasks can be implemented as stand-alone software application or with the support of an Operating
System. The first choice is oriented to create a specific solution that is optimized
for a specific problem. This solution requires a big investment in terms of design
and implementation efforts, and considerably increases the time to market. The
second choice, instead, can be followed to increase the flexibility of the whole
system. In this way, in fact, it is possible to exploit the classical services that
an Operating System can provide, such as processes scheduling techniques or
inter-process communication systems, applying them to improve the reconfiguration management.
In the next subsections some solutions to integrate an Operating System
with a reconfiguration support will be presented.
Section 2.4.1 describes the approach developed at the University of Queensland, Australia, which aims at creating a set of tools to simplify the design and
the implementation of reconfigurable systems. Embedded Linux is the host
used to achieve this goal.
Section 2.4.2 introduces the Caronte solution, that is a natural extension of the
approach exposed in Section 2.4.1. This solution adds to the embedded Linux
a module that is responsible for the management of the devices dynamically
mapped on the FPGA.
27
Chapter 2.
State of the art
Section 2.4.3 presents the BORPH approach, that consists of an extended
Linux kernel that is able to manage FPGA resources as if they were additional
CPUs of the reconfigurable computer on which it is running.
Finally, Section 2.4.4 compares all the presented solutions to find features
and limits. Even if the described approaches use an OS to manage the reconfiguration, there are various ways to support reconfigurations requests, and these
different ways lead to different solutions.
2.4.1
Embedded Linux as a platform for dynamically selfreconfiguring systems-on-chip
The approach developed at the University of Queensland, Australia, [20], to design
and implement meaningful systems employing dynamic self reconfiguration,
or DRSs (Dynamic Reconfigurable Systems), is focussed on the creation of a
platform of tools that can simplify these tasks. To achieve this goal, embedded
Linux is proposed as a natural host for such a platform.
As part of the reconfigurable system-on-chip (RSoC) research project called
Egret [22], an embedded Linux kernel called uClinux has been successfully
ported to the Xilinx Microblaze soft-core processor [21]. The capability to support research and experimentation into dynamic and self reconfiguring systems
is one of Egret’s design requirements. uClinux is a porting of the Linux kernel
to support embedded processors lacking a memory management unit (MMU),
like the Xilinx Microblaze. Thus, uClinux offers an interface almost identical to
standard Linux, including command shells, C library support and Unix system
calls.
In addition to this, a support for Xilinx FPGA self-reconfiguration has been
integrated into the Microblaze uClinux kernel, using the standard Linux device
driver model. This solution allows the exploitation of the power and the flexibility given by the Linux platform to rapidly develop a set of tools whose purpose
is to perform complex dynamic self-reconfiguration tasks.
This support is provided by an abstraction layer for the Xilinx Internal Configuration Access Port (ICAP). Xilinx developed an OPB interface to the ICAP
28
2.4. Software reconfiguration supports
module, that allow frame-by-frame readback and partial configuration in ICAPsupported devices. Using this OPB interface it is possible to connect this peripheral to the Microblaze soft-core processor.
To integrate this device within the Linux kernel, the standard device driver
architecture used by all Linux devices has been adopted. To follow the Linux
philosophy, a device driver has been developed that just implements mechanism (the provided capabilities), without any reference to the policy (how those
capabilities can be used).
The result of this approach is a character-based device driver, which implements the read(), write() and ioctl() system calls:
• read: initiates a read from the ICAP into a user memory buffer, of the
specified number of bytes,
• write: the specified number of bytes are written to the ICAP from a user
memory buffer,
• ioctl: interface to device specific control operations, such as querying the
status, or changing operating modes.
This device, that is registered in the Linux device subsystem (/dev/icap),
may be accessed using standard Linux system calls, such as open, read and
write. In this way the kernel mediates between user programs, that implement
policy, and the device driver, that implement mechanism.
In addition to this, it is possible to develop a collection of small tools, each
focussed on performing a single job, and to use the shell as a mechanism to
chaining these tools together. This is one of the underlying principles of Un*xlike operating system, that can make the combination of uClinux and the ICAP
device driver very powerful and easy to use.
However, when an application accesses the ICAP device driver to perform
a reconfiguration, the processor is kept occupied for the whole time interval
needed to reconfigure the FPGA, since there is no possibilities to exploit DMA.
Furthermore, this approach doesn’t present a centralized manager that is
able to manage the reconfiguration at a high-level, but each reconfiguration is
29
Chapter 2.
State of the art
performed as a single task. In this way it is not possible to exploit benefits
derived from services such as caching. Finally, for each reconfiguration request
there is the need to specify the bitstream with which it is possible to perform
the reconfiguration itself, since there isn’t an abstraction layer that allows to ask
for a module without knowing the name of its corresponding bitstream file.
2.4.2
Caronte
The Caronte solution [23], developed at the Politecnico di Milano, is a natural extension of the approach exposed in Section 2.4.1, in which the IPCM
(IntellectualProperty-Core Manager) has been introduced.
As shown in Figure 2.6, the IPCM is responsible for the management of the
IP-Cores dynamically mapped on the FPGA.
Kernel
Register/Unregister devices
Request/Free memory areas
kernel
modules
IP-Core
drivers
Kernel
Module
Kernel
Module
IP-Core Manager
Driver
Driver
Driver
Data I/O
reconfigurable
HW
IP-Core
IP-Core
IP-Core
IP-Core
IP-Core
Figure 2.6: IP-Core Manager
The main task of the IPCM is to handle the dynamic addition and removal
if IP-Cores which is done during partial reconfiguration. The cores can communicate with this module providing information upon device type and I/O
30
2.4. Software reconfiguration supports
memory location, which are necessary to the operating system to access the device.
The IPCM hides the kernel from differences in devices type, since all of them
are interfaced using a single major number, and the IPCM itself distinguishes
among them and selects the correct driver implementing the necessary calls.
From the kernel point of view, in fact, it is a standard module which registers
a major number (by default 121) among character devices that will be used to
access all the IP-Core devices.
Anyway this solution limits the number of IP-Cores that can be configured,
since with the current implementation of the IPCM it is possible to configure
just 16 kinds of IP-Core, that have to be statically assigned to a number from 0
to 15, and 16 IP-Cores for each kind. In addition to this, even if each IP-Core is
registered automatically, there is not the possibility to automatically load or unload its corresponding driver, so this operation has to be performed manually.
An important advantage provided by the IPCM is an easier programming
interface for the development of IP-Cores drivers, since it hides the kernel internal structures and integrates all common operations for the devices.
The main disadvantage of this approach is the absence of a unique reconfiguration manager that is able to implement the caching, the allocation and
the positioning mechanisms. Each of these phases can be implemented manually by exploiting the information contained in the IPCM module, but there is
no a framework that implements them in an integrated and automatic way, to
improve the system performance.
Finally, an intermediate layer abstracting the reconfiguration requests is
missing. This layer can be useful to hide to the software applications any details
about low-level implementation of the reconfiguration routines and to allow to
change them without the need to modify the software application. In fact, it
can be used as an interface to decouple the high-level user applications that
request reconfigurations from the low-level kernel tasks that perform the real
reconfiguration process.
31
Chapter 2.
2.4.3
State of the art
BORPH
BORPH [19] (Berkeley Os for ReProgrammable Hardware), is an Operating System designed at the Univeristy of California, for FPGA-based reconfigurable computers. It is an extended Linux kernel that is able to handle FPGA resources as
native computational resources on BEE2 (Berkeley Emulation Engine 2), that is
a reconfigurable computer.
This OS, in addition to allow a simple way to perform FPGAs reconfiguration, also provides useful standard system services, such as the ability for
FPGAs to read or write to the standard Linux file system, allowing them to
communicate with the rest of the system easily and systematically.
To achieve this goal, BORPH introduces the concept of hardware process, that
is a hardware design running on a FPGA; this hardware process is a standard
user process, so it behaves just like a normal software program running on a
processor. However there is not the possibility to execute a partitioning on a
given application to derive a software and a hardware part, and there is not an
automatic flow that is able to bring in an easy way to the generation of a hardware process from a high-level specification. Thus each change on the high-level
specification of the problem has to be directly translated in a manual change of
the low-level hardware description.
To deploy a hardware process on the reconfigurable devices, BORPH exploit
the concept of hardware regions, that is the smallest reconfigurable region that
is possible to manage. Even if it is possible to imagine a hardware region as a
partially reconfigurable region on a single FPGA, on a BEE2 module it is implemented only as an entire user FPGA. Thus each hardware process, also a very
small one, needs to be deployed on an entire dedicated FPGA.
Furthermore, the hardware configuration of a hardware process is encapsulated in the executable file so it is hard to completely exploit hardware re-use
and it is impossible to implement caching politics. In addition to this, it is not
possible to choose at run-time the most suitable hardware description of the
hardware process that has to be deployed, for example depending either on the
FPGAs availability or on the performance required from the user.
32
2.4. Software reconfiguration supports
Finally, the BORPH solution doesn’t allow to completely separate the software layer from the hardware layer, but they still remain to the same level. This
implies that both the hardware and the software side have to be specifically developed to work together. Otherwise, to write a software application that uses
a hardware process it is necessary to know exactly how it behaves, since there is
not the possibility either to use a common library for the communication or to
write software controllers for the hardware processes.
2.4.4
Common features and limits of software reconfiguration
supports
The proposed approaches have in common the idea to extend an Operating
System with a reconfiguration support. The different solutions derive from the
different ways in which this support has been developed. Thus, in addition to
some common problems, also specifics limits will be analyzed in the following
paragraphs.
In all the presented approaches the processor is involved in the configuration of the FPGA for the whole time interval needed by the reconfiguration
process. This is due to the lack of the DMA service, that makes it compulsory
to give to the reconfiguration module the whole bitstream instead of only its
memory address.
Even if Caronte presents a sort of centralized manager, in all the solutions it
is not possible to manage the reconfiguration at a high-level. Each reconfiguration is performed as a single task, so it is not possible to exploit a framework
that presents services such as caching, allocation and positioning. In fact each
reconfiguration request has to specify the right bitstream with which to perform
the reconfiguration itself. It is not possible to request just a desired functionality and let the system to decide the more suitable configuration file to use to
perform the reconfiguration.
Another limit is the lack of an intermediate layer to decouple the user-side
applications from the low-level kernel tasks that perform the reconfiguration.
This can be implemented, for example, with a library and it can be useful to
33
Chapter 2.
State of the art
abstract the reconfiguration requests, to hide to the user-side software applications all the steps that have to be followed to perform a reconfiguration or to
manage a configured IP-Core.
A constraint of the Caronte solution is the limited number of IP-Cores that
can be supported by the system. The current implementation of the IPCM, in
fact, makes it compulsory to assign statically a number from 0 to 15 to 16 kinds
of IP-Cores. Furthermore, it is possible to instantiate only 16 IP-Cores for each
previously declared type.
Finally, a specific disadvantage of the BORPH solution is that it is based on
BEE2 platform, in which each hardware process, also a very small one, needs to
be deployed on an entire dedicated FPGA. This is due to the definition of the
smallest reconfigurable region as a complete FPGA.
2.5
Concluding remarks
The analysis of the presented configurable systems bring to the conclusion that
the main problem is the lack of a generalized flow that allows abstraction and
automation of the development of a configurable system. All the described
approaches are ad-hoc solutions, that are manually optimized for the specific
problem but that require considerable time and efforts to be developed. In fact,
it is very hard to exploit the potentiality of such kind of systems without a general flow that can guide the whole design.
A complete design flow could offer several advantages: from the simplification of the development phase to the improvement of the flexibility of the
system. Furthermore, it is also possible to shorten the time to market, by reducing the time required by the development, the interfacing and the integration
phases.
Also the discussion on reconfigurable systems brings to similar conclusions.
Even if each of the described reconfigurable systems presents a good solution
for a specific scenario, they are far from being a general solution for a wide
class of problems. Each solution, in fact, presents some aspects that limit its
34
2.5. Concluding remarks
applicability to different contexts, so it is not possible to apply the same solution
structure to solve a different, even if similar, problem.
The presented development methodologies try to find a solution to this lack
of a generalized flow that is able to abstract the design flow of a configurable or
reconfigurable system development.
RECONF2 is a solution that automates the whole flow, from the high-level
description of the application to the synthesis phase, but the real problem is
that it is limited to the hardware. Both the partitioned parts derived from the
original application and the reconfiguration controller are always implemented
in hardware. Thus, there is not the possibility to include a software part in the
final architecture,
On the opposite, ADRIATIC takes into account both a hardware and software part. The main limit of this solution is that the flow is described solely
from the system specification phase to the system level simulation. Thus it can
be applied only to the system level, and not to the implementation phases.
Finally Operating System reconfiguration supports are described. A considerable disadvantage of the presented approaches is the absence of the DMA
service. This forces the employ of the processor for the whole time interval
needed by the reconfiguration process.
Furthermore, in all the solutions it is not possible to manage the reconfiguration at a high-level. There is no framework able to provide services such as
caching, allocation and positioning.
Another considerable limit is the lack of an intermediate layer to decouple
the user-side applications from the low-level kernel tasks that perform the reconfiguration. This layer can be useful to abstract the reconfiguration requests.
In this way it can be possible to hide to the user-side software applications all
the kernel tasks that are necessary to perform a reconfiguration or to manage a
configured IP-Core.
A specific constraint of the Caronte solution is the limited number of IP-Cores
that can be supported by the system.
In conclusion, a specific disadvantage of the BORPH solution is that on the
BEE2 each hardware process, even if it is very small, needs to be deployed on an
35
Chapter 2.
State of the art
entire dedicated FPGA, since the FPGA is the smallest reconfiguration area that
is possible to manage.
36
Chapter 3
Proposed methodology
Aim of this chapter is to introduce a flow that is able to guide the design of a configurable or reconfigurable system, starting from the high-level specification of
an application. This flow simplifies, improves and automates the development
process. Moreover, it is possible to include in the final solution an Operating
System that is able to manage the reconfiguration, to add more flexibility to the
whole system.
The analysis of previous works, presented in Chapter 2, shows that there
is still the lack of a complete methodology that is able to describe the design
of a reconfigurable system. The presented approaches are limited either to an
abstract description of the flow or to a reduced portion of the whole process.
In addition to this, while hardware reconfigurable platforms have already
gained a wide range of use in different scenarios, an homogeneous and centralized support for dynamic reconfiguration within a standard Operating System
is still missing. As shown in Sections 2.1 and 2.2, embedded systems, most of
times make use of a standalone application explicitly designed for the particular target system and with a complete knowledge of the hardware on which it
has to run.
However, the developement of a standalone application makes it hard to
exploit application design reuse. In fact, it introduces a major effort to develop
a new system, since few can be derived from a previous one. Furthermore, it
37
Chapter 3.
Proposed methodology
reduces the flexibility of the developed system and to considerably increase the
time to market.
Section 3.1 introduces the BE-DRESD flow, whose goal is to find a solution to the presented limits. This flow represents the proposed methodology
for dynamically reconfigurable systems design and consists of several components, each one performing a specific task. In particular, the main contribution
of this thesis is the integration of DRESD-BE with a tool, called IPGen, for the
automatic generation of IP-Cores, starting from their cores description, and the
creation of DRESD-SW, that consists of the extension of the Linux Operating
System with a reconfiguration support and the development of a centralized
reconfiguration manager.
Section 3.2 describes in detail DRESD-BE, that is responsible for the creation
of the hardware architecture of the final system. This architecture includes the
reconfigurable and the fixed modules, that are automatically created by IPGen.
Section 3.3, presents DRESD-SW, whose main task is to analyze and to modify the software part of the original application to make it suitable for the interaction with the reconfiguration manager of the Linux Operating System extended with the reconfiguration support.
Finally, Section 3.4 summarizes all the main aspects of the presented flow,
giving an overall view of the proposed methodology.
3.1
BE-DRESD flow
The schematic of the flow proposed in this thesis, called BE-DRESD, is shown
in Figure 3.1. The input of this flow consists of both a high-level specification
of the application that solves a particular problem and its translation to a hardware description, such as a VHDL (Very high speed integrated circuit Hardware
Description Language) description. It is possible to write this hardware description either manually or by using development frameworks such as CoDeveloper.
CoDeveloper is a C language development system for coarse grained programmable hardware targets including mixed processor and FPGA platforms.
38
3.1. BE-DRESD flow
High-level
specification
BE-DRESD
DRESD-HLR
DID
DID
DRESD-DB
DID
SyCERS
DID
DRESD-VAL
DID
DRESD-BE
BAnMaT
DRESD-SW
.bit
.elf
DRESD-TM
Figure 3.1: BE-DRESD flow
CoDeveloper’s core technology is the Impulse C library and related tools that allow standard ANSI C to be used for the expression of highly parallel applications and algorithms targeting mixed hardware/software targets. In this way it
is possible to obtain synthesizable HDL from C applications, so this framework
can be used as an integration of the BE-DRESD flow to restrict the required
inputs to ANSI C programs.
The BE-DRESD flow is composed of several components, each one implementing a different stage of the flow:
• DRESD-HLR: DRESD High-Level Reconfiguration takes the input description of the application and tries to extract from it the recurrent structures
that will be used by the following stages,
39
Chapter 3.
Proposed methodology
• DRESD-BE: DRESD Back-End is responsible for the creation of both reconfigurable modules and configurable or reconfigurable architectures,
• DRESD-SW: DRESD SoftWare is the generator of the software part of the
final solution, that consists of either a standalone software application or
an Operating System with a reconfiguration support, device drivers, userside drivers and the software part of the application,
• DRESD-VAL: DRESD Validation is composed of two tools, SyCERS and
BAnMaT, and it is used to validate the output of DRESD-HLR and DRESDBE,
• DRESD-DB: DRESD DataBase provides useful information on the target
device to the other components of BE-DRESD,
• DRESD-TM: DRESD Technology Management is the final stage of the flow
and it takes in input the output generated by DRESD-BE and DRESD-SW
to create the final solution.
The focus of this thesis is based on the components highlighted in Figure
3.1, that are the back-end, DRESD-BE, and the software manager, DRESD-SW.
These components, that represent with DRESD-HLR the core functionalities of
BE-DRESD, have been integrated with the other pre-existing elements in order
to achieve a complete flow. A more detailed description of these two components is presented respectively in Sections 3.2 and 3.3. The input of DRESD-BE
is the output generated by DRESD-HLR, that is validated using SyCERS.
DRESD-HLR analyzes the input description to create a graph on which it
is able to work. The obtained graph is explored to find recurrent structures
with which it is possible to cover the graph itself. This task can be driven by
the information produced by the validation phases. To achieve this goal it is
possible to use different algorithms, for example to maximize the number of
instances of the same structures present in the graph or to maximize the size
of the structure itself. Anyway, when an adequate set of structures has been
found, it is extracted from the original graph.
40
3.1. BE-DRESD flow
The obtained partitioning information is validated with SyCERS (DRESDVAL). This validation phase can be useful to obtain performance measurement
that can drive the refinement cycle. The DRESD-HLR process is then repeated
several times, until the validation constraints are satisfied. When the validation
stage is successfully passed, the generated output is given to to the DRESD-BE
and DRESD-SW components.
When the DRESD-BE and DRESD-SW processes are completed their outputs are taken as input by DRESD-TM. The output of DRESD-SW is a set of
executables that can be either compressed together to form a ramdisk image or
directly given to the DRESD-TM.
On the opposite, the output of DRESD-BE has to be validated with BAnMaT (Bitstream Analyzer and Manipulator Tool), that is the second tool of
DRESD-VAL. This bitstream validation phase can impact on both DRESD-BE
and DRESD-HLR, since its output can guide both processes. In fact, if the validation constraints are not satisfied, it is possible to repeat these phases to try
to fulfill them. When these constraints are satisfied, instead, the obtained bitstreams are finally given to DRESD-TM.
Aim of DRESD-DB is to provide to the other components a description of
the target device. This device is part of the platform on which the final solution
has to be deployed. Each step of the BE-DRESD flow needs this information to
create the constraints, to improve the exploration of the feasible solutions or to
optimize the process itself. Physically DRESD-DB is a database that contains
all the necessary information about a wide range of devices. This is the set of
supported devices, but it is also possible to extend it with new descriptions, to
increase the flexibility of the BE-DRESD approach.
The last step of the proposed flow is represented by DRESD-TM. In this
phase the executables and the bitstreams are set together with the deployment
information. These are necessary to establish where each part of the solution
has to be placed, since there is not a fixed position in which each part of the
solution can be located. In this way it is possible to create the final solution that
implements the given application and that specifies how it has to be deployed
to solve the original problem.
41
Chapter 3.
3.2
Proposed methodology
DRESD-BE
DRESD-BE is the stage in which the reconfigurable architecture is developed.
The general system on which each output of this phase is based is YaRA (Yet
Another Reconfigurable Architecture) [24]. This architecture has been chosen
since it can be adopted to solve several very wide classes of problems, thus it
provides a considerable flexibility to the flow.
YaRA is constituted by two parts: a fixed part, YaRA_FIX, and a reconfigurable part, YaRA_REC, that is a collection of reconfigurable IP-Cores (or modules). Each possible configuration of YaRA_FIX with a different set of reconfigurable IP-Cores give origin to a static photo of the system, that is called
YaRA_TOP.
It is possible to imagine, in a general view, that these static photos are used
to create the bitstreams (a complete bitstream and a group of partial bitstreams)
that will be used to set-up the system (the complete bitstream) and to pass from
a static photo to another one (the partial bitstreams).
In particular, the adopted solution consists of the generation of both a complete bitstream that configures the system with the Top and a set of empty modules. Then for each module two partial bitstreams have to be created: one is
used to configure it over an empty module and another one to come back to
the empty module. In this way to change from an IP-Core to a different one it
is necessary to pass from the first module to the empty module and then it is
possible to configure the desired IP-Core.
Figure 3.2 shows the YaRA Modular Architecture Creation phase, that is part
of DRESD-BE. Its inputs are provided by DRESD-DB and DRESD-HLR. These
inputs consist of both information about the processor and the reconfigurable
bus that have to be used, and information about the set of cores that have been
extracted from the original specification. Also DRESD-VAL is involved in this
flow, since it gives useful guidelines to improve the previous solution in a refinement cycle.
The first step of the flow is the creation of YaRA_TOP. This goal is achieved
starting from the generation of the System.vhd and the .ncd files with the
42
3.2. DRESD-BE
YaRA Modular Architecture Creation
YaRA Top Creator
Processor Info +
IP-Cores* +
Reconfigurable
Bus Info
EDK System
Creator
IPGen
System.vhd +
*.ncd
IP-Generator
Bus
Info
Gender
Fix Generator
Fix
Rec
YaRA(--) +
YaRA_FIX
COMIC
YaRA(-)
System
Configuration Tool
YaRA+
=1: Codesign
≥2: Reconfiguration
Figure 3.2: YaRA Modular Architecture Creation
EDK System Creator tool. The first one represents the VHDL description of
YaRA_FIX, while the others are the descriptions of the fixed component included in YaRA_FIX. Input of this tool, in addition to the standard input of the
43
Chapter 3.
Proposed methodology
whole flow, are the IP-Cores that have been selected to be inserted in the fixed
part of the architecture.
These modules are provided by IPGen (IP-Core Generator), aims at creating
an IP-Core for each given core. This process requires mainly the mapping of
the signals of the core to internal registers and their interface with the signals
present in the chosen communication infrastructure. IPGen concepts will be
described in a more detailed way in Section 3.2.2.
After that, the Fixed Generator tool produces YaRA_FIX and YaRA(–), that is
the first version of the complete architecture, in which there is no information
about the communication infrastructure and reconfigurable modules. Another
tool, COMIC (COMmunication Infrastructure Creator), produces the next version of the solution, that is YaRA(-); this version contains the communication
infrastructure, but reconfigurable modules are still missing.
The last tool, System Configuration Tool, completes the YaRA(-) description
with the reconfigurable modules collection. These reconfigurable modules are
provided by IPGen, in the same way in which it provides fixed modules to EDK
System Creator.
The output of this final step is a group of possible configurations of the system. If this group consists of just one configuration, the flow output is a codesign of the original specification, since it consists of a configurable system that
doesn’t need dynamical reconfiguration. Otherwise, if the group consists of
more than one configuration, the final result is a dynamical reconfigurable system, and the different configurations represent the possible status or instances
of the system in a particular instant.
3.2.1
Cores handling
The main task of DRESD-HLR is the generation of the recurrent structures list.
Using this list it is possible to perform the gender assignment phase, in which
each structure is assigned to the hardware, to the reconfigurable hardware or to
the software side. The following step is the creation of an architecture that includes the functionalities that have to be implemented in hardware. Since these
functionalities are extracted by DRESD-HLR from the specification, they consist
44
3.2. DRESD-BE
of the minimum logic that is necessary to express their purpose, thus they are
not already suitable to be used with a bus-based communication infrastructure.
On the opposite, to build a reconfigurable architecture using a general structure, it is useful to implement a bus communication. This communication allows the fixed components of the architecture to interact with a standard interface of the modules, that uses the same kind of signals and that behaves in a
similar way (for example the reset signal is always interpreted in the same way)
for each different reconfigurable IP-Core. This layer of abstraction provides
more flexibility to the system and makes it possible to adopt the reconfiguration model described in Section 3.2, in which each module can be substituted
with another one.
As previously hinted, the cores extracted by DRESD-HLR are not suitable to
be used with a bus-based communication infrastructure. Anyway it is possible
to adapt them to the bus communication without changing their internal logic.
This is the task that IPGen is able to perform in an automatic way.
There are two main concepts on which automatic IP-Cores generation task
is based on:
• the need to preserve both the internal structure and the functionality of
each IP-Core,
• the possibility to flatten each VHDL description.
On one side it is compulsory to preserve the functionality of each IP-Core,
since it is part of the original specification and then it cannot be arbitrarily modified. Furthermore it is very desirable to preserve also the internal structure,
since in this way it is possible to abstract the implementation of each module to
concentrate the efforts on the analysis of its interface.
On the other hand, the introduction of a hierarchy of wrappers doesn’t directly imply a considerable waste either of performance or of resources required
to implement it in hardware. This is possible because VHDL allows the synthesis of each description in a flattened way, so the overhead introduced by the
wrappers hierarchy is very small.
45
Chapter 3.
Proposed methodology
These considerations bring to the conclusion that it can be a good solution
to develop a sort of wrappers hierarchy to create an IP-Core from each given
core, as described more in details in Section 3.2.2. In addition to the adaptation
of the core to the bus communication, in a way that makes it possible to use
it with the YaRA architecture, this choice also offers an efficient, elegant and
human-readable solution to the proposed issue.
3.2.2
Automatic IP-Core generation
Aim of this phase is to build a complete IP-Core starting from its core logic. This
task can be automatically performed through three steps:
• registers mapping
• address spaces assignment
• signals interfacing
Figure 3.3 shows the result of these steps. The core logic is included in a
more complex component called IP-Core that is able to communicate with the
target bus.
Registers mapping is necessary since each core has a different signals set.
These collections can differ for the total number of signals, for their size or their
type. In this scenario the most suitable solution is to use a standard set of signals
for the communication with the rest of the system, and to use these standard
signals to manage the specific signals of each core.
To make this idea applicable it is necessary to find a way in which it is possible to store temporarily a specific signal during its set-up (to avoid undesired
interferences with the core logic) and to make it available also when the standard signals are managing other specific signals. The easiest way in which this
decoupling can be done is by introducing a group of registers that correspond
to the specific signals set. Each register, then, can be assigned to a specific signal
in a direct way, while the standard signals can interact only with the registers
and not with the specific signals set.
46
3.2. DRESD-BE
BUS
IP-Core
Interface
Core
Logic
Register
Register
Register
Register
Core
Logic
Figure 3.3: IP-Core schematic
The second step that has to be performed is the address space assignment.
Once the standard signals are mapped on the registers set, it is necessary to
assign to each register a specific address. In this way it is possible to use the
data contained in the address signal to refer to a specific register.
This solution allows the use of a small collection of signals both to write and
to read from each specific signal of the cores. This group of signals consists of
an address signal, a data signal and a few control signals. The address signal
contains the address of the register that has to be accessed, while the data signal
either contains the data that has to be written on the selected register or is the
place where the data read from the selected register can be stored.
In the last step the signals interface phase is performed. In this phase the
signals of the target bus interface have to be used to interact with the registers.
The address and the data signals are involved in the creation of routines to read
or to write on a particular register, which address is specified in the address
signal. Also the control signals, such as the reset signal, are used to manage the
core in the correct way.
47
Chapter 3.
Proposed methodology
After the execution of this step, the IP-Core is ready to be bound to the target
bus and to work properly with it, since the signals of its interface are the only
set of signals that the developed IP-Core needs to perform the correct functionality. These signals, in fact, are used to set-up and manage the registers, that are
directly mapped on the specific signals of the contained core.
This sort of wrappers hierarchy achieves the main objective of the IP-Core
automatic generation, that is to automatically create, starting from a given core
logic, a module that is compatible with the target bus.
3.3
DRESD-SW
DRESD-SW is the component of the BE-DRESD flow, that is located between
DRESD-HLR and DRESD-TM. Aim of this component is to generate the software part of the final solution.
This software part can be developed as a standalone application that includes the reconfiguration controller, using a specific reconfiguration library.
Otherwise, it can be designed to run on an Operating system that provides reconfiguration mechanisms.
Figure 3.4 presents the DRESD-SW design flow. As shown in this figure, its
input set consists of the following items:
• the base Operating System
• the cores descriptions
• the software application
The base OS is the platform on which the final software solution will run. In
the proposed approach the Linux OS has been considered a good choice to obtain both flexibility and performance, so it has been adopted to develop the first
version of this flow. Anyway it is also possible to follow the same flow with a
different OS, developing another specific reconfigurable support and compiling
all the codes using the right cross-compilation target.
48
3.3. DRESD-SW
Cores*
VHDL
OS
OS
Support
Software
analysis
Library
Cores analysis
Reconfiguration
support
integration
OS(+)
Software
Device Drivers
(Kernel modules)
Reconfiguration
library
User-side
Drivers
Modified
software
Device Drivers
integration
Software
compiling
Software
compiling
Compiled
software
OS(++)
SW (elf)
Software
integration
SW+ (elf)
Selection
DRESD-SW
Software
solution
Figure 3.4: DRESD-SW design flow
Another input of this component is the cores descriptions set. These descriptions are extracted from the original specification by DRESD-HLR and have to
be analyzed by DRESD-SW to obtain the corresponding collections of drivers.
There are two different kinds of drivers:
49
Chapter 3.
Proposed methodology
• device drivers
• user-side drivers
Device drivers, often called just drivers for short, provide the Operating System with the information on how to control and communicate with a particular
hardware device. This kind of drivers implements the basic functions that the
OS needs to manage different devices, such as the writing or the reading from
a particular register of a hardware module. These functions allow both data
transfer and control registers management.
On the other hand, user-side drivers allow to abstract the device drivers
layer to provide the user-side software applications with a simple and efficient
way to manage devices. This makes it possible to avoid in the user-side applications direct calls to the devices configured in the OS, since these calls are
already implemented and grouped in the functions provided by the user-side
drivers.
The last input of this component is the software application from which
cores have been extracted. This application consists of both the software controller and the software parts specified during the partitioning phase. In other
words, it represents the whole original specification, excluding the parts that
have been chosen to be implemented in hardware during the partitioning
phase.
This application is analyzed and modified by replacing the portions of code
that have been selected to be implemented as hardware cores with the corresponding function calls to the previously described user-side drivers, and to the
library provided by the reconfiguration support of the selected OS. The main
purpose of this library is to provide the software applications (for example the
controller) with a fast and powerful mechanism to request or to discard a module.
The whole software part is then cross-compiled to obtain binaries that can
run on the processor of the target system. These binaries perform all the tasks
that have been selected to remain in software and control the reconfigurable or
fixed modules. The functions provided by the OS support library are used to
50
3.3. DRESD-SW
obtain or to release an IP-Core, while user-side drivers are used to access the
IP-Core, writing or reading data from it.
The OS reconfiguration support, which main functions are used in the software controller, has to be included in the OS to extend it with all the necessary
reconfiguration mechanisms. Once that the support has been merged with the
OS, an extended OS that is able to manage reconfiguration is obtained. This
extended OS, however, doesn’t support yet the communication with the particular IP-Cores needed by each different application. To provide it with these
specific communication procedures it is necessary to include in the OS all the
device drivers obtained from the cores analysis.
The result of these steps is an OS that is able both to manage reconfiguration and to provide all the necessary communication functions to the software
applications that need to use IP-Cores. On this OS it is possible to run the binaries obtained from the cross-compilation previously described. The integration
between the OS and the user-side binaries represents the software part of the
solution and it is also the final output of DRESD-SW.
3.3.1
Reconfiguration layer
One of the main aspects of the OS reconfiguration support is the complete abstraction of the reconfiguration task. In other words it is fundamental that this
support decouples the user-side applications from the system processes that
have to be executed to perform a reconfiguration.
In this way it is possible to obtain several benefits, by exploiting the following advantages:
• simplification of the reconfiguration calls,
• code reuse and portability,
• different low-level implementations support.
The introduction of a reconfiguration layer that completely hides the lowlevel reconfiguration processes to user-side applications, simplifies the task of
51
Chapter 3.
Proposed methodology
writing software that uses hardware modules. The functions that this layer has
to provide to the user-side applications are the following ones:
• module request, to ask to the reconfiguration manager a particular module that has to be configured on the system,
• module release: to let the reconfiguration manager know that a specific
module instance is no longer in use and it can be deleted or cached,
• module removal: to ask to the reconfiguration manager to explicitly delete
a particular module instance from the system,
• modules list: to know the list of configured modules and their relative
status.
This abstraction approach allows also code reuse and portability, since the
high-level reconfiguration calls don’t contain any information about their lowlevel implementation. In this way it is possible both to reuse the same code or
the same portion of code in different situations and to port them on different
hardware platforms.
For the same reason it is possible to implement in various ways the same reconfiguration tasks, for example by following different cache policies or different allocation mechanisms, and to choose at runtime the most suitable solution
for each particular scenario. The only constraint is that each implementation
has to satisfy the standard interfaces of the reconfiguration functions.
3.3.2
Dynamic reconfiguration management
In the proposed approach the Linux OS has been extended with a centralized
reconfiguration manager to support and manage external and internal reconfigurations. The choice of a centralized manager instead of a distributed solution
has been followed because in this way it is possible to exploit several advantages, brought by the possibility to implement the following policies:
• cache policy
52
3.3. DRESD-SW
• allocation policy
• positioning policy
The first policy represents the way in which cached modules are managed.
When a module is no longer in use, in fact, it is possible to perform either an
hard-removal or a soft-removal to delete it. The hard-removal configures the
slots occupied from the unused IP-Core with blank modules, removing physically all the logic of the deleted module. The soft-removal, instead, leaves unaltered the FPGA configuration, but perform a logic removal by deleting all the
information associated with the deleted modules.
Another way to manage a module removal is to keep both the module configured on the reprogrammable device and its information, while setting its status as cached. In this way the cached module can be assigned to other applications that require an IP-Core of the same family. This approach brings to a
remarkable improvement of the temporal performance, since it introduces the
possibility to satisfied a module request without performing any physical reconfiguration.
Allocation policies aim at defining how to position a given module. Implementing this kind of policies allows the exploitation of well known algorithms
to maximize the number of IP-Core that it is possible to configure on the same
device. This can be seen as a reduction of the number of refused modules, that
are the modules that cannot be placed on the device because there is no more
space available.
The main concept to follow while implementing allocation policies is to minimize the fragmentation of the devices. So, for each required module, it is necessary to find the minimum set of consecutive free slots where it is possible to
configure the module itself. In this way larger groups of free slots are left available for larger modules, without breaking them into several smaller groups.
Positioning policies concern the selection of the bitstream that is able to perform the desired reconfiguration. There are two possible ways in which this
selection can be executed.
53
Chapter 3.
Proposed methodology
The first is suitable for a scenario in which for each feasible position of the
module on the reprogrammable device there is a different bitstream. In this case
the positioning layer searches the right collection of bitstreams for the desired
family of IP-Cores and then selects the bitstream that corresponds to the place
chosen in the allocation phase.
The second way is suitable when there is a component that is able to modify a bistream to shift its position within the FPGA. In this case the positioning
layer has to select the right base bitstream, that is the only bitstream that represents the whole family of bitstreams corresponding to the same module. This
information is then used to setup the relocation component that performs the
shifting of the base bitstream to the desider position. In this way it is possible to
obtain a new bitstream with which it is possible to configure the desired module
in the position selected in the allocation phase.
All of these policies have to access one or more databases in which all the
modules and the bitstreams information are stored. There is the need, then, of
a database where it is possible to store the current configuration of the FPGA,
with the status of each module, to know if it is either running or cached. Furthermore it is compulsory to develop a database in which module family data
and relatives bitstreams can be retrieved.
3.3.3
IP-Cores devices access
Once that an IP-Core has been configured on the reprogrammable device, there
is the need to establish a communication channel between the Operating System
and the module itself. This channel can be used by the OS to accomplish the
applications requests of writing or reading from an IP-Core, since it cannot be
acceptable to let the software applications access directly to the configurable
hardware.
The best way to achieve this goal is to follow the standard Linux philosophy, that proposes the implementation of device drivers. Figure 3.5 shows the
drivers hierarchy that allows to decouple applications requests and OS communication tasks.
54
3.3. DRESD-SW
Software applications
Software application
Software application
Software application
User-side Driver 1
User-side Driver 2
User-side Driver 2
/dev/device_1
/dev/device_2a
/dev/device_2b
Devices
device_1.o
device_2.o
Device Drivers
IP-Core_1
IP-Core_2a
IP-Core_2b
IP-Cores
Figure 3.5: Drivers hierarchy
Each IP-Cores family is managed by the same device driver, so the number
of device drivers loaded by the OS at any time corresponds to the number of
types of IP-Cores that is possible to handle. The device driver is able to distinguish a module from another of the same family by its memory address space,
since it is unique for each module.
The development of a centralized and automatic reconfiguration manager
implies the implementation of a mechanism to dynamically manage this kind
of drivers. Devices drivers needed to handle the configured IP-Cores have to be
dynamically loaded, while, if no more modules of a certain family are present
on the FPGA, the corresponding device driver has to be unloaded. This aspect
of the reconfiguration manager is described more in details in Section 3.3.3.1.
55
Chapter 3.
Proposed methodology
To allow user-side applications to access IP-Cores, the OS provides them
with a collection of devices, located in the /dev directory. Each different device
corresponds to a different IP-Core, so each devices set that corresponds to modules of the same family has to refer to the same device driver.
A device is characterized by its major number and its minor number. Each
IP-Core family is represented by the same major number, that corresponds to
a specific device driver, while the minor number allows to distinguish between
different IP-Cores of the same type.
To avoid to include in the user-side applications direct calls to the devices,
it can be useful to develop a collection of user-side drivers, each one of them is
able to manage a complete family of IP-Cores. The way in which this kind of
drivers allows to considerably simplify the access to the configured modules is
described in Section 3.3.3.2.
3.3.3.1
Dynamic device drivers loading and unloading
During the reconfiguration of a module, it is necessary to check if an appropriate device driver is already loaded in the OS. If this driver is not found between
the loaded device drivers, it is compulsory to load it, otherwise it would not be
possible to manage a communication with the requested IP-Core.
After this phase, the configured module has to be registered to the right device driver, to set-up its memory address space. During this process an unique
minor number is assigned to the module. The association of this minor number
and the major number of the device driver allows to identify the device that corresponds to the configured module. The name of this device is then used by the
user-side application to manage the IP-Core.
When a module is no longer in use, and the caching policy has decided that
it cannot be kept in cache, it has to be unregistered from its device driver. This
operation is useful to free the memory address space allocated for the unused
IP-Core. If in the whole system there is no module of the same family of the
removed IP-Core, then it is also possible to unload its device driver.
Following the presented steps it is possible to implement a dynamic management of the device driver to automatically set-up the channel communi-
56
3.4. Concluding remarks
cation that each configured module needs to be used by both the OS and the
user-side applications.
3.3.3.2
IP-Core user-side drivers
Even if it is possible for the user-side applications to directly access the devices,
this is not a simple and clear way to manage configured modules, since it requires to know the way in which each device driver operates.
A more powerful way is to develop a collection of user-side drivers, each one
of them able to interact with a whole IP-Cores family. These user-side drivers
provide applications with a set of functions that perform the following classes
of tasks:
• reading from module registers,
• writing to module registers, and
• changing the module status.
Each of these classes is directly translated in the corresponding set of instructions that interact with the device to perform the required process.
The introduction of this layer of drivers, not only allow both to simplify and
to decrease the time required to implement the communication with reconfigurable modules, but also makes it possible to change the implementation of a
device driver without the need of substantial modifications to the user-side application. It is sufficient to develop a new user-side driver that is compatible
with the new implementation of the device driver and that exports the same
interface of the communication functions. This approach allows to add more
flexibility to the whole drivers hierarchy.
3.4
Concluding remarks
The proposed methodology is described by the BE-DRESD flow, in which a
high-level specification of an application that solves a particular problem is analyzed and modified by DRESD-HLR. This component allows the identification
57
Chapter 3.
Proposed methodology
of recurrent structures in the input code that can be potentially implemented as
hardware modules, and in this way it performs a hardware/software partitioning of the high-level specification.
The following steps, that have to be executed after DRESD-VAL validation,
are represented by DRESD-BE and DRESD-SW, the original contribution of this
thesis. The first one is responsible for the creation of the base architecture and
for the generation of both reconfigurable and fixed hardware modules. On the
other hand, the second component modifies the software part of the original
application to make it able to manage reconfigurable modules. The result of
this process can be integrated with an OS that is extended with a reconfiguration
support.
DRESD-DB provides all the other components with information about the
target device, to automatically develop a solution that can be physically deployed on the real target system. The deployment information is associated
with the hardware and the software solutions in DRESD-TM, that produces
the final configurable or reconfigurable system implementing the original highlevel specification.
58
Chapter 4
Design flow software development
Aim of this chapter is to describe the development details of the methodological aspects previously presented in Chapter 3. These aspects concern both the
integration of the DRESD flow with the automatic generation of reconfigurable
and fixed hardware modules, and the design of a software architecture based
on a standard Operating System that allows exploitation of reconfiguration.
In particular, Section 4.1 introduces the IPGen tool, that is a tool able to integrate the DRESD-BE flow with the automatic generation of IP-Cores, starting
from their core logic. These IP-Cores can be used either as fixed or as reconfigurable modules that have to be plugged in the final architecture.
The following section, Section 4.2, presents the development details of the
software architecture that allows to perform reconfiguration tasks over the
Linux Operating System.
The class of underlying platforms on which it is possible to run the same
software architecture is introduced in Section 4.2.1. Since this collection of platforms can be described in the same way from an abstract point of view, it is possible to manage reconfiguration processes without the need of modifications on
the proposed software architecture.
The developed solution is based both on the Linux low-level reconfiguration support and on the centralized reconfiguration manager. The first one is
presented in Section 4.2.2 and consists of several kernel modules that implement the low level operations needed by reconfiguration processes, such as the
59
Chapter 4.
Design flow software development
set up of the right address space on the Wishbone bus or the physical reconfiguration of the reprogrammable device. The second one is introduced in section
4.2.3 and is composed of three different managers that are able to handle reconfiguration requests at different abstraction layers.
4.1
IPGen
As hinted in the previous chapter, the IPGen tool can be used to automate and
to speed up the generation of bus-compatible components, that is part of the
proposed embedded systems design flow. In particular, the IP-Cores generation
phase is involved in the design of the hardware architecture, that is part of the
DRESD-BE flow.
The core functionalities extracted from the original specification, in fact, cannot be directly used in the YaRA architecture. To be plugged in YaRA they need
to be adapted for the Wishbone bus communication. After this phase, the obtained IP-Cores can be used either as fixed or as reconfigurable modules in the
developed architecture.
The creation of a complete IP-Core is a process that can be divided into two
distinct phases:
• the generation of the IP-Core logic, that is the core functionality of the
whole component, and
• the implementation of the communication infrastructure that makes it
possible to interconnect the IP-Core with the rest of the system.
IPGen is a software tool that allows to perform the second step in an automatic way, starting from a given VHDL description of the core logic and the
information about the bus that has to be used to communicate with the system.
To achieve this task, IPGen performs the following steps.
• The first step that has to be executed is the input phase, in which the tool
is provided with the VHDL description of the core and the indication of
the chosen communication infrastructure.
60
4.1. IPGen
Begin
Read from
input file
False
Error
Pattern
recognized
False
False
End of core
recognized
True
Signal
recognized
True
True
Add signal to
the list
End
Figure 4.1: Reading process diagram
• The second step is the reading process, shown in Figure 4.1, in which IPGen reads and interprets the input VHDL description to store all the information needed by the following step. This phase can be further divided
in the following operations:
– the recognition of the VHDL entity declaration pattern in the VHDL
description;
– the building of the signals list: the basic idea followed by this step is
that when a signal is recognized, it is analyzed and its information is
stored in the signals list; this action is repeated until the end of the
signal declaration is reached; and
– the storing of the core’s entity name and the file path in two variables
used by the following process.
61
Chapter 4.
Design flow software development
Begin
Get signal
list
Read from
"input stub"
False
Pattern
recognized
True
Add input to
"output stub"
True
End of
"input stub"
False
Elaborate
signal list
Read from
"input sln"
Add data to
"output stub"
False
Pattern
recognized
True
Add input to
"output sln"
True
End of
"input sln"
False
Add data to
"output sln"
End
Figure 4.2: Writing process diagram
• The third step is the writing process, shown in Figure 4.2, that takes in
input the signals list, the core’s entity name, its path and the kind of bus
infrastructure that has to be used. Aim of this last step is to write the IP-
62
4.1. IPGen
Core VHDL description, and this objective is achieved by performing the
following actions:
– the creation of a stub VHDL file between the core and the IP-Core
VHDL descriptions, that allows the input signals of the core to be
written by the bus master, and the outputs to be read. An important
feature of the tool is that the address decoding logic is automatically
generated and included in the stub; and
– the generation of the top architecture VHDL file, that is the final IPCore that contains both the processing logic and the chosen bus interface.
If an error occurs during the execution of either the reading or writing phase,
the tool is halted and an error message is returned. This message contains information that is useful to understand where and why the process failed. Even
if the tool cannot detect all VHDL syntax errors, since it is not a VHDL parser
and it does not validate the analyzed code, it is however able to check the entity
declaration syntax. On the opposite, if the execution ends correctly, the created
IP-Core, which structure is shown in Figure 3.3, is ready to be plugged in the
architecture for which it has been developed.
Within the DRESD-BE flow, to be more precise in the YaRA Modular Architecture Creation phase, the IP-cores obtained thanks to IPGen are used by YaRA
Top Creator. The fixed modules are included in the fixed part of the architecture,
by using EDK System Creator and Fix Generator, while the reconfigurable modules are plugged in the architecture by System Configuration Tool during the last
step, as shown in Figure 3.2. In this way it is possible to automatically obtain a
working architecture based on YaRA that supports all the functionalities of the
original specification that have been implemented as hardware components.
63
Chapter 4.
4.2
Design flow software development
Software architecture
The software architecture is the part of the system that is responsible both for
managing dynamic reconfiguration and for handling the reconfigurable hardware.
This architecture can be implemented both as standalone application or as
Operating System support. The first solution is designed to solve just one class
of problems, so it can be deeply optimized but it is necessary to rewrite the
whole application if the context changes.
The second solution is a more general one and consists of a layer that allows to access to the reconfigurable hardware at a very high level of abstraction;
moreover, this partial dynamic reconfiguration support allows to easily exploit
inter-process communication and process scheduling provided by the OS.
Section 4.2.1 presents a class of reconfigurable embedded systems that can be
described in the same way from an abstract point of view, using the concepts of
master and slave FPGAs. This class represents the collection of systems that the
proposed software can handle and on which it is able to manage reconfiguration
tasks.
Section 4.2.2 introduces the low-level implementation of the Linux reconfiguration support. It consists of a collection of kernel modules that have to be
loaded in the OS to enable the hardware components responsible of the reconfiguration process. Two of them are essential to grant the access to two IP-Cores
on the master FPGA that are responsible for the reconfiguration of the slave FPGAs and for the setup of the communication on the Wishbone bus. The last one
is the manager of the dynamically registered devices. Furthermore, to simplify
the access to these kernel modules, also a common library, called Reconfiguration
Library, is introduced.
This OS support can be extended, as described in Section 4.2.3, with a centralized reconfiguration manager. This manager, called ROTFL Daemon and described in Section 4.2.3.2, has both to implement all the policies previously introduced in Section 3.3.2 and to allow an easy communication between user-side
64
4.2. Software architecture
applications and the Linux kernel modules that perform physical reconfigurations.
4.2.1
Underlying platform
The more general platform on which a configurable or reconfigurable system
can be developed is a multi-FPGA scenario where the reconfigurable resources
are distributed on several interconnected FPGAs. The master FPGA has to be
able to reconfigure, partially or totally, other slave FPGAs. These slave FPGAs
can be divided into several slots that can be filled with IP-Cores by the master
FPGA.
The main challenge in such a scenario is to hide the system characteristics
and the additional efforts regarding the communication with dynamic modules
from the user applications.
Figure 5.1 shows a collection of different scenarios on which the previously
described abstraction can be applied. In all these scenarios, each master FPGA
is characterized by the presence of an embedded PowerPC processor, on which
the Operating System runs, in addition to the static hardware components such
as a memory controller, general purpose inputs/outputs, and a reconfiguration
manager.
Slave FPGAs, instead, hold the reconfigurable resources used to dynamically load hardware modules into the system. These resources are used according to a 1D-placement with a granularity of four CLB (Configurable Logic
Block) columns. This means that dynamic modules always use the full height
of the FPGA, while their width is a multiple of four CLB columns.
In the first scenario, called Scenario A in Figure 4.3, there is just one FPGA
that is used both as a master FGPA and as a slave FPGA. This FPGA is logically
divided in two different parts:
• a fixed part, that is the part of the FPGA that contains the PowerPC processor and that acts as a single master FPGA, and
• a reconfigurable part, that is handled as a single slave FPGA, even if the
number of slots that it is possible to configure is smaller.
65
Chapter 4.
Design flow software development
Slot 8
Slot 7
Slot 6
Slave FPGA
Slot 5
Slot 4
Slot 3
Slot 1
PPC
Slave FPGA
Slot 2
Master FPGA
Slot 2
PPC
(Scenario A)
(Scenario B)
Slot 1
Master and Slave FPGA
Master FPGA
Slave FPGA
Slave FPGA
Slot 12
Slot 11
Slot 10
Slot 9
Slot 8
Slot 7
Slot 6
Slot 4
Slot 3
Slot 2
Slot 1
Slot 5
PPC
(Scenario C)
Slave FPGA
Master FPGA
Slave FPGA
Slot 12
Slot 11
Slot 10
Slot 9
Slot 8
Slot 7
Slot 6
Slot 4
Slot 3
Slot 2
Slot 1
Slave FPGA
Slot 5
PPC
(Scenario D)
Slave FPGA
Figure 4.3: Multi-FPGA scenarios
On the opposite, in all the remaining scenarios each FPGA of the system acts
either as a master or as a slave FPGA, without logical internal divisions.
66
4.2. Software architecture
The differences between these scenarios reside in the different ways in which
the communication infrastructure is implemented. The second scenario, called
Scenario B in Figure 4.3, presents a chain communication in which the master
FPGA can communicate with just one slave FPGA, and each slave FPGA can
communicate just with the following one.
Scenario C and Scenario D, instead, represent respectively a point to point
connection and a bus-based connection. In both these scenarios the master
FPGA is able to communicate directly with each slave FPGA.
Even if the presented scenarios differ for the logical partitioning of master
and slave FPGAs sets and for their communication infrastructures, they can be
reduced to same class of platforms from the software point of view. For this
reason they can be handled by the same software architecture, as described in
the following section.
4.2.2
Linux kernel modules infrastructure
The low-level implementation of the partial dynamic reconfiguration support
consists of three kernel modules and a library. Figure 4.4 shows the hierarchy
between the software applications, the Reconfiguration Library, the kernel and
the kernel modules.
The first kernel module is the Reconfiguration Controller kernel module, described in Section 4.2.2.1, that provides the interaction with the hardware Reconfiguration Controller; this kernel module manages partial or complete reconfigurations of an FPGA by simply providing the controller with information on
the bitstream base address, on its size and on the slave FPGA that has to be
configured with the given bitstream.
Section 4.2.2.2 presents the second module, the MAC (Media Access Control)
kernel module, that allows to communicate with the hardware MAC component,
that is the IP-Core responsible for the dynamic changes of the address space
of the configured modules on the Wishbone bus. With this kernel module it
is possible to setup the right address space on the Wishbone bus for each new
IP-Core added into a slave FPGA.
67
Chapter 4.
Design flow software development
Software
applications
SW
SW
SW
Library
Reconfiguration Library
Kernel
Kernel
Kernel
modules
Reconfiguration
controller
MAC
LOL
SW
Kernel modules
Figure 4.4: Linux kernel modules infrastructure
The third kernel module is the LOL (Load On Linux) kernel module, introduced in Section 4.2.2.3, that doesn’t refer directly to a hardware component of
the system. It is a centralized manager that is able to handle the dynamic registering and the unregistering of other devices. Its function is to store all the
information about the registered devices and to allow both the addition of a
new device and the removal of an existing one from the system.
Finally Section 4.2.2.4 describes the Reconfiguration Library, which purpose is
to simplify writing applications that have to manage the presented kernel modules. For this reason the library offers also a set of functions that allow to make
read, write and IOCTL calls in a simplified way both on the Reconfiguration Controller device and on the MAC device.
4.2.2.1
The Reconfigurator Controller kernel module
The Reconfigurator controller kernel module is an interface for the Reconfiguration
Controller, that is a hardware component that has to be present in the final re-
68
4.2. Software architecture
configurable system, since it is allows the reconfiguration of a slave FPGAs with
a given bitstream.
A special feature of the Reconfiguration Controller component is its Direct
Memory Access (DMA) to the SDRAM (Synchronous Dynamic Random Access
Memory) memory. This enables very fast configurations when downloading
bitstreams from a given position within the memory to a selected FPGA.
It is possible to communicate with this hardware component through its registers, whose schematic is shown in Figure 4.5. The Bitstream base address register
contains the base address of the bitstream that has to be used reconfigure the selected FPGA, while the Bitstream dimension represents the dimension, expressed
in bytes, of the bitstream itself. The size of these two registers is 32 bits.
0x000
31
30
29
2
1
0
2
1
0
2
1
0
Bitstream base address
0x008
31
30
29
Bitstream dimension (bytes)
0x020
31
30
29
8
7
Command
Figure 4.5: Reconfiguration Controller registers
The last register is the Command register, as shown in Figure 4.6. It is smaller
that the previous registers, since its size is just 8 bits. In particular, bit number 4
is used to select a complete (indicated with a 0) or a partial (indicated with a 1)
reconfiguration, while last three bits (bits number 2, 1 and 0) are used to select
the FPGA that has to be reconfigured.
The Reconfiguration Controller work is divided in two phases: the setup phase
and the reconfiguration phase.
• In the setup phase it is possible to set the right data to specify which bitstream has to be used to perform a complete or partial reconfiguration;
69
Chapter 4.
Design flow software development
Command
7
6
5
4
3
2
1
0
Slave FPGA number
Complete (0) or Partial (1)
reconfiguration
Figure 4.6: Command Register
the information needed by the controller are the memory base address at
which the bitstream is stored and its dimension expressed in bytes. Both
the base address and the dimension of the bitstream are written on their
respective registers on the controller when the proper IOCTL calls are performed.
• The second step, the reconfiguration phase, starts when the Command register of the Reconfiguration Controller is modified; this step performs the
specified kind of physical reconfiguration (complete or partial) of the selected slave FPGA.
Since the Reconfiguration Controller works with DMA, the processor on which
the Operating System is running is involved only in the setup phase, while during the reconfiguration phase it is free to work on others processes.
4.2.2.2
The MAC kernel module
Each slave FPGA comprises a Wishbone bus to which the hardware modules
are dynamically connected. The bus-bridges that are used to connect the modules to the processor system require the Medium Access Controll (MAC) for the
communication with the modules.
These MACs differ considerably from those used in standard on-chip bus
systems, since they have to deal with a changing number of communication
70
4.2. Software architecture
participants. Thus, they provide the ability to allocate address space for each
loaded module at run-time. This allows for a very flexible use of the available
bandwidth as well as for multiple instantiations of modules (e.g., two identical
Adder-modules loaded for different tasks).
The MAC kernel module is the part of the system that provides the setup of the
communication between the processor and the configured IP-Cores, setting the
correct address space for each one of them; in fact when an IP-Core is configured
on a slave FPGA its range is known and it is possible to search a free space for
it on the Wishbone bus.
The information about the address space reserved to the new IP-Core on this
bus is passed to the MAC through an IOCTL call to the MAC kernel module and
then the MAC itself takes care of the communication setup.
4.2.2.3
The LOL kernel module
The LOL (Load On Linux) kernel module is used to dynamically manage the registering and unregistering of devices. Each time a device driver is loaded, it
makes a call to a function exported by the LOL kernel module. This function is
used to communicate to the LOL kernel module the value of three function pointers and of an integer number:
• the first function pointer refers to the add_device function of the loaded
device driver, that allows to register a new device; when a new device is
added, the device driver has to store its minor number and to register the
corresponding memory space for communication;
• the second pointer is the rem_device function pointer, that is responsible
to the deletion of an existing device; when a deletion of a device is performed the device driver has both to update its table of devices, deleting
all the information corresponding to the removed device, and to free the
memory space occupied by the removed device;
• the last pointer refers to the clean-up function that is useful to unload the
device driver; this request can be performed when no more devices are
71
Chapter 4.
Design flow software development
registered on it and either there is no more space in memory to keep the
device driver or the device driver is not more useful for the system;
• the integer number represents the major number of the device driver itself; this major number is dynamically assigned to the device driver by
the Operating System. Since it is necessary to know this number to establish a communication with the device driver and there is no way to
know it directly, the device driver has to give it to another kernel module,
whose major number is well known. This can be the LOL kernel module,
that is able to store this major number in a place that is accessible by the
upper level, otherwise the communication between the software applications and the device driver cannot take place.
The LOL kernel module stores the information on each device driver in a table.
When a new device has to be added or an existing one has to be deleted from a
device driver already loaded in the system, it is possible to find in this table the
pointer to the right function that is able to perform the requested action.
4.2.2.4
The Reconfiguration Library
Aim of the Reconfiguration Library is to provide a simple and optimized mechanism to interact with both the Reconfigurator Controller kernel module and the
MAC kernel module.
To improve the usability of the Reconfigurator Controller kernel module, the
Reconfiguration Library offers a collection of functions that implement the IOCTL
calls that are necessary to perform the following actions:
• write the bitstream base address in the corresponding Bitstream base address register of the Reconfigurator Controller,
• write the bitstream size in the corresponding Bitstream dimension register
of the Reconfigurator Controller,
• write the command in the corresponding Command register of the Reconfigurator Controller, and
72
4.2. Software architecture
• reset all the Reconfigurator Controller registers.
In addition to these simple processes, the Reconfiguration Library also implements two complex processes that combine the basic IOCTL calls to achieve the
following flows:
• configuration, that takes in input the bitstream base address, the bitstream
size and the number of the slave FPGA on which to perform a total configuration with the given bitstream, and
• reconfiguration, that is similar to the previous flow, but it performs a partial reconfiguration on the selected slave FPGA
On the other hand, to improve the usability of the MAC kernel module, the Reconfiguration Library offers a set of functions that allow to perform the following
IOCTL calls:
• reset all the MAC kernel module registers,
• write the base address of the address space on the corresponding register
of the MAC kernel module,
• write the high address of the address space on the corresponding register
of the MAC kernel module, and
• write the number number of the module that corresponds to the selected
address space range.
In conclusion, using the Reconfiguration Library it is possible to communicate
with the Reconfigurator Controller and the MAC both to perform a partial or a
complete reconfiguration and to setup the correct address space on the Wishbone bus of the reconfigured module with just a few function calls.
73
Chapter 4.
4.2.3
Design flow software development
The ROTFL architecture
The reconfiguration support described in Section 4.2.2 makes it possible to configure a slave FPGA with a given bitstream and to setup the correct address
space for it on the Wishbone bus.
Due to this support, partial dynamic reconfiguration can be performed by
the Operating System in a very simple way, so there is the need for an architecture capable to receive module requests from software applications, to successfully complete the whole process of reconfiguration and to answer to requests
by giving back to the applications the name of the device that has to be used to
perform the requested functionality.
This kind of architectures can be used as a support to write software applications, such as the software controller, able to execute some processes with
reconfigurable hardware instead of using only the processor on which the Operating System is running.
These applications don’t have to manage anything about reconfiguration,
they only have to know the interfaces of the functions that are necessary to
perform the following steps:
• to request a hardware module that is able to perform the desired functionality,
• to interact with the requested module, to both write and read to and from
its registers, and
• to delete the module when it is no longer in use.
The software architecture proposed in this thesis, called ROTFL (Reconfiguration Of The FPGA under Linux), implements the previous functions. As
shown in Figure 4.7, it is characterized by three components: the ROTFL Library, the ROTFL Daemon and the ROTFL Repository.
• The ROTFL Library, detailed in Section 4.2.3.1, is an interface that provides
the possibility to communicate through sockets with the ROTFL Daemon;
74
4.2. Software architecture
User space
Software application
ROTFL Daemon
ROTFL Module Manager
ROTFL Library
ROTFL Allocation Manager
ROTFL Repository
ROTFL Positioning Manager
Kernel space
LOL Manager
MAC
Reconfigurator Controller
Hardware
Figure 4.7: Software Architecture schematic
in other words it allows the interaction, using a simple function call, with
the ROTFL Daemon by sending a command to it and by receiving from it
the result of the process.
• The ROTFL Daemon, that is an application that runs on the Operating System waiting for a socket command, is presented in Section 4.2.3.2; it is
capable to handle requests like the configuration of a new module or the
deletion of an existing one. These tasks are accomplished by the three
managers of which the ROTFL Daemon is composed: the ROTFL Module
Manager, the ROTFL Allocation Manager and the ROTFL Positioning Manager. Each one of these managers tries to manage by itself the requests, and
only if this is not possible the requests are forwarded to the next manager,
that is located at a lower level in the hierarchy.
75
Chapter 4.
Design flow software development
• The ROTFL Repository is introduced in Section 4.2.3.6; it is a sort of
database to store and to retrieve information about bitstream locations,
bitstream dimensions, device drivers names and paths and module specifications.
For the implementation of dynamically reconfigurable systems it can be useful to adopt a layer model that systematically abstracts from the hardware resources. Each layer represents a set of components within the HW/SW architecture that is part of the reconfiguration process.
By defining these layers and especially the interfaces between neighboring layers, the reusability of existing components is increased while the errorproneness of the system design is significantly reduced.
This layer model has been applied to the development of the proposed software architecture. The layers in which the ROTFL architecture can be divided
are shown in Figure 4.8.
User space
Application Layer
Software application
ROTFL Daemon
Module Management Layer
ROTFL Module Manager
ROTFL Library
ROTFL Repository
ROTFL Allocation Manager
Allocation Layer
ROTFL Positioning Manager
Positioning Layer
Kernel space
LOL Manager
MAC
Reconfigurator Controller
Configuration Layer
Hardware
Hardware Layer
Figure 4.8: Architectural layers
The uppermost layer is the Application Layer. It represents all applications
that are using dynamically reconfigurable hardware. Any application that
76
4.2. Software architecture
wants to load a new hardware module makes a module request to the Module
Management Layer.
This layer holds a list of all currently loaded hardware modules. In case of a
module request, it checks whether any inactive module of the requested type is
available, and returns a reference to this module to the application, changing its
status from cached to running. If no such module exists, the Module Management
Layer requests a module placement to the Allocation Layer.
The Allocation Layer is responsible for choosing appropriate reconfigurable
resources for requested modules as well as for allocating address spaces on the
communication infrastructures (e.g., the Wishbone bus). This layer is the uppermost layer that knows about the physical arrangement of the reconfigurable
resources such as the existence of multiple FPGAs.
When the most suitable reconfigurable resources for the requested module
have been found, both this information and the module type are given to the
following layer. This is the Positioning Layer, that loads a bitstream from a local
bitstreams repository and adapts the position information to the given position.
This manipulated bitstream is then given to the Configuration Layer. This
layer contains interfaces to all existing reconfigurable resources, such as the
ICAP for self-configuration, or the SelectMap and JTAG interfaces for external
configuration.
The reconfigurable resources themselves, which can be distributed over several FPGAs, are represented by the Hardware Layer.
The whole software architecture has been developed according to the presented layer model, using well defined interfaces between contiguous layers.
In this way it is possible to obtain a high level of flexibility, since there is no
need to change the whole architecture structure if a single layer has to be modified.
4.2.3.1
The ROTFL Library
The ROTFL Library is mainly used by the software applications that want to
work with reconfigurable hardware. This library simplifies the reconfiguration
77
Chapter 4.
Design flow software development
tasks by providing the user-side applications or the software reconfiguration
controller with the following functions.
• The ROTFL_add function is used to add a new hardware module in the
system. When an user-side application needs an IP-Core, it has to call this
function with the name of the desired module. The second parameter that
this function takes as input is a char pointer to the string that will contain
the result of the requested process. This function returns an integer value
that indicates if the requested process has been successfully completed.
If the returned value is a zero value, the second parameter given to the
function contains the name of the device that is able to perform the desired
functionality. On the opposite, if the returned value is different from zero,
it means that an error is occurred and the second parameter contains a
message that describes the error type.
• Another function provided by the ROTFL Library is the ROTFL_del function, that can be used when a module is no longer in use and it is not more
useful for the user-side application. This function is similar to the previous one from the input point of view, since the first parameter represent
the name of the module that has to be deleted and the second one still denotes the string on which the result description has to be written. Also the
output is homologous to the previous one; if the returned value is different
from zero an error is occurred during the execution of the requested operation and the string pointed by the second parameter contains the error
description, while a returned value equal to zero stands for a successfully
completed operation.
• The ROTFL_rem function is almost identical to the ROTFL_del, but it specifies to the ROTFL Daemon that the removed module cannot be handle as a
cached module, but it has to be physically removed from the system. This
can be useful either when it is known that this module cannot be useful to
any other application or when there is the need to avoid that anyone else
can use the same IP-Core.
78
4.2. Software architecture
• The last function provided by the ROTFL Library is the ROTFL_list function, that takes in input a pointer to the string that will contain the list of
the modules physically configured on the reprogrammable devices. This
list is a sort of description of the status of each slave FPGA, with explicit information about running and cached modules. It is possible to
employ this information to monitor the availability of the slots of each
reprogrammable device.
The ROTFL Library converts each of the presented function calls into a socket
communication flow with the ROTFL Daemon. This mechanism allows both to
hide to user-side applications the handling of the socket communication and to
export a collection of simple functions that abstract the implementation of the
reconfiguration tasks.
4.2.3.2
The ROTFL Daemon
The ROTFL Daemon is the centralized manager that is located between the
ROTFL Library and the Operating System reconfiguration support, that consists
of the collection of kernel modules presented in Section 4.2.2.
Aim of this component is to manage each socket request that comes from
the Application Layer, in general from the ROTFL Library, as shown in Figure
4.9. These commands can be represented by the request to add, to delete or to
remove a module or to get the list of the configured IP-Cores.
This daemon consists of the three following managers, each of which is located on the corresponding layer:
• the ROTFL Module Manager, presented in Section 4.2.3.3, that is located in
the Module Management Layer,
• the ROTFL Allocation Manager, described in Section 4.2.3.4, that implements the Allocation Layer, and
• the ROTFL Positioning Manager, introduced in Section 4.2.3.5, that is located in the Positioning Layer.
79
Chapter 4.
Design flow software development
1746
Socket
Client
(ROTFL Library)
Server
(ROTFL Daemon)
Socket
Command sent to the server
Elaboration of the request
Result returned to the client
Figure 4.9: Socket communication
These managers have been developed to communicate between them
through well defined interfaces, so it is possible to develop several versions of
the same manager that use the same interface.
Since these different implementations of the managers are completely interchangeable, it is possible to choose the most suitable solution for each different
scenario without the need to change the whole structure.
4.2.3.3
The ROTFL Module Manager
When the ROTFL Daemon receives a request, the first manager that analyzes the
received command is the the ROTFL Module Manager. Aim of this manager is to
implement a sort of modules cache.
This manager is able to handle all the requests that come from the ROTFL
Library. In particular commands are treated in the following ways.
• When the configuration of a new module is requested, with a module_manager_add function call, the first step that the ROTFL Module Man-
80
4.2. Software architecture
ager performs is to search if the table of the configured IP-Cores contains
a cached module of the same kind of the requested one. The manager is
able to accomplish by itself the module request only if this search ends
successfully, otherwise the request is forwarded to the ROTFL Allocation
Manager.
To be more precise, if the cached module is found, the manager returns
its name to the user-side application, without performing any physical
reconfiguration. The status of this module is changed from cached to running to avoid to accidentally delete it or to assign it to another user-side
application.
On the opposite, if the cached module is not found, then the module request has to be forwarded to the following manager, the ROTFL Allocation
Manager, that will perform the real reconfiguration of the new IP-Core.
At the same time the ROTFL Module Manager has to retrieve from the
ROTFL Repository the name of the device driver that is able to handle the
requested module. If this device driver is not yet loaded in the system, the
the ROTFL Module Manager has to load it, otherwise the Operating System
will not be able to manage the new configured module.
• When the deletion of a module is requested, with a module_manager_del
function call, the only step that the ROTFL Module Manager has to perform
is to change the status of the selected module from running to cached. In
this way the module will be available for other user-side applications that
need an IP-Core of the same kind of the cached one. In this specific case
no physical reconfiguration has to be performed, since the configuration
status of the reprogrammable devices has to remain the same.
• In a similar way, when the removing of a module is requested with a module_manager_rem function call, there is no the need to reconfigure the reprogrammable device. To perform the removing of the selected modules,
in fact, it is sufficient to delete from the table of the configured IP-Cores all
the informations concerning the selected module, to free its address space
on the Wishbone bus and to unregister the corresponding device to free
81
Chapter 4.
Design flow software development
the memory reserved for it. In this way all the resources occupied by the
removed module are freed and are make available for other IP-Cores.
• The last request that it is possible to get by the ROTFL Library is the list of
the configured modules. If the user-side application requests this list, the
ROTFL Module Manager has just to return all the informations contained
in the table of the configured IP-Cores, since it is always coherent with the
real status of all the slave FPGAs.
4.2.3.4
The ROTFL Allocation Manager
When it is impossible to find a cached module that is able to perform the requested functionality, the new module has to be configured somewhere in one
of the reprogrammable devices. Aim of the ROTFL Allocation Manager is to find
a suitable location to place the module that has to be configured.
Moreover the ROTFL Allocation Manager also has to find a free place on the
Wishbone bus where it is possible to position the address space that the module
needs to establish the communication with the rest of the system.
These tasks can be performed in several different ways and using various
algorithms. The simpler algorithm that allows to accomplish these processes
is the algorithm that returns the first suitable position that is found. However
it is possible to implement algorithms that are able to choose the most suitable
FPGA on which it is possible to configure the module and the best place for its
address space on the Wishbone bus by following metrics like the time that has
to be spent for the search or the fragmentation of the FPGAs.
To achieve its objective, each algorithm needs to retrieve from the ROTFL
Repository the information on the size and the range of the address space of the
new module. It is possible that a single module can be configured with different
bitstreams that need different combinations of slots and that represent IP-Cores
of different sizes. In this case the algorithm can also evaluate which is the most
suitable for each particular situation.
When the ROTFL Allocation Manager has found both the place where it is
possible to configure the IP-Core and the base address of its address space on
82
4.2. Software architecture
the Wishbone bus, this information is given to the ROTFL Positioning Manager.
This last manager is responsible for the selection of the bitstream that is able to
configure the selected FPGA at the position specified by the ROTFL Allocation
Manager.
The ROTFL Allocation Manager implemented using a genetic algorithm
Genetic algorithms, in computer science, are a class of search techniques
modeled on evolutionary biology. The ROTFL Allocation Manager has been
implemented using an algorithm of this class, since it allows to look for a good
sub-optimum solution in a reasonable time, while an exhaustive search can be
excessively slow. Moreover, using a genetic algorithm it is possible to adapt the
search process to the current status of the system, by tuning parameters such as
the probability of crossover or mutation process. It is possible, in fact, that in a
particular moment the system needs a fast configuration process (that implies
a fast allocation task) or a very accurate solution to avoid waste of configurable
resources (that requires a very precise allocation task).
Since a genetic algorithm is appropriate for both the presented situations,
it can represent an excellent and parametric compromise between the optimality of the final solution and the time constraints imposed by the dynamically
reconfigurable scenario.
Genetic algorithms Evolution is a long time scale process that changes a
population of organisms by generating better offsprings through reproduction.
Borrowing this idea from biology, a learning process can be modelled as
evolution. So genetic algorithms are inspired by Darwin’s theory of evolution:
Problems are solved by an evolutionary process that mimics natural evolution in looking
for a best (fittest) solution (survivor). They are a part of evolutionary computing.
Genetic algorithms are based on following concepts:
• chromosome is the coding of a possible solution for a given problem
• gene is the coding of a part of the solution
• allele is one of the elements used to code the genes
83
Chapter 4.
Design flow software development
• fitness is the evaluation of the actual solution
• crossover is the generation of a new solution by mixing two existing solutions
• mutation is a random change in the solution
According to Darwin’s theory of evolution the best chromosome survive to
create new offspring. Crossover and mutation depend on the encoding of chromosomes. Mutation is intended to prevent falling of all solutions in the population into a local optimum.
The basic genetic algorithm is based on following steps:
1. generation of a random population of chromosomes;
2. evaluation of the fitness of each chromosome in the population;
3. creation of a new population by repeating the following steps until the
new population is complete:
(a) selection of two parent chromosomes from a population according to
their fitness;
(b) crossover of the parents to form a new offspring;
(c) mutation of the new offspring at each locus;
(d) placement of the new offspring in the new population;
4. the new population is used for a further run of the algorithm;
5. if the end condition is satisfied, the best solution in current population is
returned;
6. otherwise the cycle is repeated starting again from point 2.
In general genetic algorithms are best suited for following cases:
• big, not unimodal and not smooth search space
84
4.2. Software architecture
• noisy and usually not analytic fitness function
• looking for a good sub-optimum in a reasonable time
They can be used for many applications, for example for optimization, prediction, classification, economy, ecology, automatic programming. In this example a genetic algorithm has been used for dinamically reconfigurable modules
allocation.
In particular the algorithm has been applied to the allocation of dynamically
reconfigurable modules. When a new module has to be reconfigured in the system, in fact, there is the need to find a suitable free place where it can be configured. This search task has been modeled with a genetic algorithm in which each
chromosome represents a configuration status of the reprogrammable devices
and both crossover and mutation processes try to change the previously found
location for the new module in order to achieve a better fitness, that stands for
the goodness of the final solution.
Encoding There are many parameters and settings that can be implemented in a different way for each class of problems: how to create cromosomes
and what kind of encoding is suitable for each particular situation; how to select
parents for the crossover process, following the idea that the better parents will
produce the better offspring, and how to define crossover and mutation tasks,
that are the two basic operators of genetic algorithms.
Then, the first step in developing a genetic algorithm is defining a suitable
solution encoding. A chromosome should in some way contain information
about the solution that it represents. Since the encoding depends mainly on
the solved problem, for the ROTFL Allocation Manager it has been chosen a couple of arrays, the Slots and the Modules arrays. Figure 4.10 shows an example
chromosome of a system that contains only one slave FPGA with four slots.
The first array consists of a collection of genes, which contain the information on which module is configured on each slot of the reprogrammable device.
In particular each gene directly corresponds to a single slot of a slave FPGA.
Since on a device of n slots it is possible to configure not more than n modules
85
Chapter 4.
Design flow software development
Slots
Modules
0
1
2
3
1
0
3
3
0
1
2
3
0
-2
0
1
Figure 4.10: Genetic algorithm chromosome
(this is possible only when each configured module requires just one slot), the
alleles of this kind of genes are represented by the number between 0 and n.
The numbers contained in the Slots array correspond to the position of a
gene in the second array. The Modules array, in fact, is composed of a set of
genes that represent hardware IP-Cores. The following numbers represent the
codification of the alleles for this second kind of genes:
• 0: this number means that the module is not configured on the reprogrammable device, since it has not been yet placed or it has already been
deleted from the system
• 1: this number indicates that the module has been already configured on
the FPGA and it is still running, so in this moment it cannot be directly
unloaded from the system
• -2: a module characterized by this number is a cached IP-Core. In other
words it is a module that has already been placed on the reprogrammable
device but it is not currently used by any user-side application, thus it is
possible to unload it to overwrite its slots with the configuration of a more
useful IP-Core
The example shown in Figure 4.10 represents a status of the system in which
the second module (module 1) is configured on the first slot of the FPGA (slot 0)
and the fourth module (module 3) is placed on the third and on the fourth slot
(slot 2 and slot 3), while the second slot (slot 1) is free (since the first module,
module 0, is not configured).
86
4.2. Software architecture
The Slots array gives further information, indicating that the second module
(module 1) is cached, while the fourth module (module 3) is still running. This
means that the bigger module that is possible to configure starting from this status is a module that requires two slots, since it can be configure on the first two
slots of the FPGA (slot 0 and slot 1), by unloading the second module (module
1) that is currently cached.
After the choice of the proper coding for chromosomes, genes and alleles, a
suitable fitness function has to be defined . Main objective of the ROTFL Allocation Manager is to handle the configurable space of the reprogrammable device
to avoid both a waste of slots and the refusing of the configuration of an IPCore, that happens when there is no place where it is possible to configure it.
This means that it is desirable to keep the free slots all together, without breaking them in a lot of smaller separate set of free slots, since a large collection of
contiguous slots allows to configure also bigger modules.
For this reason the fitness function has been defined as a number that increases of a small quantity for each free slot. This quantity starts from a default
value, but it gets bigger when a free slot is followed by another free slot. On
the opposite, when a free slot is followed by a slot containing a cached or a
running module, the gain comes back to the default value. Moreover, to prefer
solutions with a large number of cached modules, that are useful to speed up
the reconfiguration process, also a fixed reward has been introduced for each
cached IP-Core of the solution.
Figure 4.11 shows an example of the evaluation of the fitness function of
three given chromosomes, with a default gain of 2 points, increased of 1 point
for each contiguous free slot, and a fixed reward of 1 point for each cached module. The three chromosomes are very similar, but the seventh module (module
6) is placed in a different position in each solution. In the first example (A), the
seventh module is located at the end of the FPGA, in the second example (B) it is
configured to break the set of the last four free slots, while in the third example
(C) it has been placed in the most suitable location, that is the second slot (slot
1). Even if the number of configured IP-Cores, the number of cached modules
and the total number of free slots are the same for all the solutions, the first one
87
Chapter 4.
Design flow software development
presents two sets of free slots (which sizes are respectively of 1 and 3 slots) with
a fitness of 13, the second one 4 sets (which sizes are respectively of 1, 2 and 1
slots) with a fitness of 11, while the third one a singe set (which size is of 4 slots)
with a fitness of 16. Obviously the last solution is the most suitable, since is the
only one that allows the configuration of a new module that requires 4 contiguous slots, in fact it presents the bigger fitness within the class of the presented
solutions.
Development details The proposed genetic algorithm for the ROTFL Allocation Manager is performed each time a set of new modules have to be configured on the reprogrammable devices of the system. It is possible to choose, for
each module, the best location where it has to be placed.
If each module can be placed in n positions, an exhaustive search with a set
of m IP-Cores requires nm evaluations of feasible solutions. With a genetic algorithm it is possible to considerably decrease the time required by the allocation
process, since it works on a smaller set of solutions, trying to modify them to
reach a good sub-optimum solution in a reasonable time.
The size of the initial population is a parameter of the algorithm and it can be
changed to tune the performance of the ROTFL Allocation Manager. Between this
population a set of chromosomes is chosen to create a new population. These
chromosomes are called parents of the offspring, that is formed through the
crossover process.
The crossover task is performed by randomly choosing two parents. The
new chromosome is generated by keeping the locations of the first half of the m
modules from the first parent, while the other locations are taken directly from
the second parent. During this phase it is possible to introduce, with a random
probability, a mutation. This is defined as a change in the partial solutions found
by the parents. In other words it means that the location inherited by the parents
can be randomly modified, to prevent that all solutions in the population fall
into a local optimum.
88
4.2. Software architecture
Slots A
Modules A
Fitness A
0
1
2
3
4
5
6
7
1
0
3
3
0
0
0
6
0
1
2
3
4
5
6
7
0
-2
0
1
0
0
-2
0
0
1
2
3
4
5
6
7
1
3
+2
Slots B
Modules B
Fitness B
Modules C
Fitness C
5
3
+0
+2
8
+3
12
+4
13
+1
0
1
2
3
4
5
6
7
1
0
3
3
0
0
6
0
0
1
2
3
4
5
6
7
0
-2
0
1
0
0
-2
0
0
1
2
3
4
5
6
7
1
3
+2
Slots C
3
+0
3
+0
5
3
+0
+2
8
+3
9
+1
11
+2
0
1
2
3
4
5
6
7
1
6
3
3
0
0
0
0
0
1
2
3
4
5
6
7
0
-2
0
1
0
0
-2
0
0
1
2
3
4
5
6
7
1
2
+1
2
+0
4
2
+0
+2
7
+3
11
+4
16
+5
Figure 4.11: Fitness evaluation examples
4.2.3.5
The ROTFL Positioning Manager
The available position for the required module and its corresponding address
space on the Wishbone bus, found by the ROTFL Allocation Manager, are the inputs of the following manager, called ROTFL Positioning Manager. Aim of this
manager is both to setup the MAC kernel module with the information concern-
89
Chapter 4.
Design flow software development
ing the Wishbone bus address space and to retrieve from the ROTFL Repository
the base address and the size of the bitstream that is able to configure the required module to the selected location of the slave FPGA. This information is
then used to setup the Reconfiguration Controller kernel module that will perform
the physical reconfiguration of the slave FPGA.
Since each partial bitstream is able to configure a module only in one specific
position, and this position is incapsulated in the bitstream itself, there is the
need to store in memory, for each kind of IP-Cores, one different bitstream for
each location in which the module can be configured. To avoid this waste of
space it is possible to keep in memory just one bitstream for each IP-Cores type,
using this base bitstream with a relocation filter, that is a hardware component
that is able to shift a bitstream to make it suitable for another set of slots of the
slave FPGA. The ROTFL Positioning Manager has to know if the system in which
it is working contains a hardware relocation filter or not, since it behaves in the
following ways depending on the presence of this filter.
• On one hand, if the hardware relocation filter is not present in the system,
the ROTFL Positioning Manager has to search in the ROTFL Repository the
right class of bitstreams that is able to configure the requested type of IPCores. Then it has to find, between this collection of bitstreams, the only
one that can configure the selected module in the right position of the
slave FPGA, that has been allocated by the ROTFL Allocation Manager in
the previous phase.
• On the other hand, if the system contains the hardware relocation filter,
the ROTFL Positioning Manager has to search in the ROTFL Repository the
base bitstream for the selected type of IP-Cores. Both the retrieved bitstream and the position on which it has to be relocated, that is the position
allocated by the ROTFL Allocation Manager, are then used to initialize the
relocation filter. The output of this component is a new bitstream that is
able to configure the same type of IP-Cores of the selected one, but in a
different position of the slave FPGA.
90
4.2. Software architecture
The following step is to setup the Reconfiguration Controller kernel module with
the information concerning the proper bitstream, that is the bitstream retrieved
from the ROTFL Repository, if the relocation filter is not present in the system,
or the adapted one, in the case that the system contains the relocation filter.
This information, that consists of both the base memory address at which the
bitstream file is stored and the size of the file itself, is then used by the Reconfiguration Controller to perform the actual reconfiguration of the slave FPGA. After
this task the IP-Core is physically configured in the system, but the communication infrastructure has not yet been realized.
To establish a communication between the new module and the rest of the
system, in fact, a last step has to be performed. To be more precise, there is
the need to setup the MAC kernel module with the address space information
provided by the ROTFL Allocation Manager, that consists of the base address
and the range of the address space assigned to the new module on the Wishbone
bus.
4.2.3.6
The ROTFL Repository
The ROTFL Repository is a database to simplify the management of bitstream
files and of their corresponding information, such as their functionality and the
name of the device driver that is able to manage them.
When a new module is requested, the ROTFL Daemon has to check if the
repository contains all the necessary information about this module, i. e. the
base address and the size of its bitstream or its bitstreams, the size of the module
itself, the required range of address space on the bus and the name of the device
driver that is able to manage the required module.
In particular this information is needed by the following stage of the whole
reconfiguration process.
• The bitstream base address and its size are required by the Reconfigurator
Controller to perform reconfiguration on the slave FPGA.
• Module size and address space range are essential to search free space
both on the slave FPGAs and on the Wishbone bus; the ROTFL Alloca-
91
Chapter 4.
Design flow software development
tion Manager uses this information to perform the search algorithm to find
where it is possible to configure the new module and which address space
is available on the Wishbone bus to establish the communication between
the module and the rest of the system.
• The name of the device driver is used by the ROTFL Module Manager, that
gives this information to the LOL kernel module. In fact this is the responsible for the loading and the unloading of the device driver each time a new
module is added or an existing one is removed from the system.
4.3
Concluding remarks
This chapter focuses on the software architecture of the components, developed
in this thesis, that constitute part of the entire design flow described in Chapter
3.
The tool described in the first part of this chapter is IPGen. This tool makes
it possible to take a given core logic of a component, that represents the core
functionality of the whole IP-Core, and to automatically build both the register
mapping and the interface with the desired bus. In this way it is possible to
obtain a complete IP-Core that is ready to be plugged in the final system either
as a fixed or as a reconfigurable module.
The rest of the chapter focuses on the description of the proposed software
architecture. First a class of reconfigurable hardware platforms has been introduced. These platforms have in common the possibility to be viewed in a
uniform way from an abstract point of view. This allows the application of the
developed software architecture to all the platforms of the family.
After that the Linux kernel modules infrastructure has been described. This
infrastructure, that consists of several kernel modules and a library, constitutes
the Linux OS reconfiguration support. The Reconfiguration Controller kernel module is the responsible for the physical reconfiguration of the programmable device with a given bitstream. The MAC kernel module provides the setup to the
bus communication by configuring the address space on the Wishbone bus of
92
4.3. Concluding remarks
the new module with the correct information. The LOL kernel module, in addition to allow the dynamic registering and unregistering of devices, stores all
the information about the registered devices. Finally, the Reconfiguration Library
aims at simplifying writing applications that have to manage the previously
described kernel modules.
In the last part of the chapter the ROTFL architecture has been described.
This architecture consists of a daemon, the ROTFL Daemon, a library, the ROTFL
Library, that simplifies the interaction with the daemon and a repository, the
ROTFL Repository, that represents a sort of dynamic database. The ROTFL Daemon implements the Module Management Layer, the Allocation Layer and the Positioning Layer. Each layer is handled by a different manager that can be developed by using several algorithms, whose implementation is described by a
well defined interface. In this way it is possible to choose at run-time the most
suitable solution for each situation.
Finally an example has been shown to demonstrate how it is possible to
develop a different version of a manager of the ROTFL Daemon using a well
known algorithms. In particular, a new version of the ROTFL Allocation Manager
has been build using a genetic algorithm.
93
Chapter 5
Experimental results
Goal of this chapter is to present an overall view of the experimental results of
the proposed implementation of the methodologies introduced in Chapter 3.
In particular, Section 5.1 focuses on the description of the tests for the IPGen
tool. This tool has been tested with several components, and all the generated
IP-Cores have been physically configured on a FPGA, to verify their effectiveness.
The second part of this chapter, Section 5.2, presents a prototyping platform
on which the proposed software architecture is able to run and a collection of
experimental results on the ROTFL architecture. These results describe both the
number of slices and BlockRAMs occupied by the components that are necessary to enable partial reconfiguration and the timing performance of the software architecture on the developed system.
Finally, Section 5.3 summarizes all the results presented in this chapter, in
order to achieve a complete overview of the performance of the described implementations and to evaluate the goodness of the proposed approaches.
5.1
IPGen
The methodology for the automatic generation of IP-Core, presented in Section
3.2.2, has been developed as explained in Section 4.1 and has been tested under different Operating Systems and architectures. These tests concern several
95
Chapter 5.
Experimental results
types of components, starting from some small IP-Cores such as an adder, a
xor and two different multipliers. Moreover, also more complex examples have
been examined, e.g a Direct Fourier Transformation core, various implementations of the AES algorithm, a Siemens Mobile Communications description of
a complex ALU and a video editing core that changes the image coloring plane
from RGB to YCbCr.
Table 5.1 presents some relevant results, considering both the input core,
that represents the core logic, and the obtained component, that is the final IPCore produced by the IPGen tool. For each one of them, the size in terms of 4INPUT LUT S and the number of occupied slices are illustrated, both as absolute
values and as the percentage with respect to the total dimension of the FPGA. In
addition to this information, also the time needed by IPGen to create the IP-Core
is specified.
Table 5.1: IPGen tests
IP-Core
Core: Mult1
IP-Core: Mult1
Core: Mult2
IP-Core: Mult2
Core: IrDA
IP-Core: IrDA
Core: FIR
IP-Core: FIR
Core: AES128
IP-Core: AES128
Core: RGB2YCbCr
IP-Core: RGB2YCbCr
Core: Complex ALU
IP-Core: Complex ALU
4-input LUTs
30
172
64
339
15
146
273
308
4124
4314
1028
848
1750
2089
Perc.
0%
2%
1%
4%
1%
1%
2%
3%
42%
44%
10%
9%
18%
21%
Slices
26
122
37
205
11
103
153
173
2132
2250
913
940
950
1079
Perc.
1%
2%
1%
4%
1%
2%
3%
3%
43%
46%
18%
19%
19%
22%
Time (s)
0.049
0.053
0.045
0.058
0.075
0.063
0.071
On one hand the relative overhead due to the interface of the core logic with
the Wishbone bus is acceptable, both for the 4- INPUT LUT S and for the occupied slices, especially when the core size is relevant. This allows the use of
96
5.2. Software architecture
the generated IP-Cores in the final reconfigurable system without wasting too
much space on the reconfigurable devices.
On the other hand, the computation is extremely low with respect to the
whole embedded system design process. In particular, it is almost constant and
on average is of 0.065 seconds. The range in which computation time is located
starts from 0.045 seconds and ends to 0.075 seconds.
In conclusion, the proposed flow for automatic IP-Cores generation has successfully passed all the proposed tests, generating working components, that
can be imported into standard architectures with a Wishbone bus communication to obtain a bitstream that can be directly downloaded on a FPGA. Moreover, the IPGen tool, that implements this flow, is characterized by very good
performance, introducing only a little overhead in the size of the final IP-Core.
5.2
Software architecture
To prove the correctness of the software architecture proposed in Section 4.2.3,
the ROTFL solution has been tested on the RAPTOR2000 board [25]. Using this
board, whose detailed description can be found in Section 5.2.1, it is possible to
implement several kinds of reconfigurable systems that can be associated with
the platform classes introduced in Section 4.2.1.
The first set of tests concerns the implementation of the ROTFL Allocation
Manager obtained by using a genetic algorithm. This solution has been tested
and compared with other different implementations of the same manager and
the final results are described in Section 5.2.2.
Tests and results concerning the whole ROTFL architecture, instead, are presented in Section 5.2.3. Aim of these tests is to obtain information on the latency
introduced by the partial reconfiguration of the slave FPGAs and the Operating
System timing overhead. In this way it is possible to evaluate the timing performance of the ROTFL architecture in a real reconfigurable embedded system.
97
Chapter 5.
5.2.1
Experimental results
RAPTOR2000 board
For a prototype implementation of one of the multi-FPGAs reconfigurable systems proposed in Section 4.2.1, the RAPTOR2000 hardware architecture [25] has
been used. RAPTOR2000 is a prototyping platform that consists of a motherboard and up to six daughter-boards. The motherboard provides several communication infrastructures and a configuration environment for partial and dynamic configuration of FPGAs located on the daughter-boards.
Figure 5.1 shows the schematic of the system that has been developed by
using the RAPTOR2000 board. It consists of a Xilinx Virtex-2Pro FPGA and two
Xilinx Virtex-II FPGAs. The Virtex-2Pro FPGA, which is used to run the SW
solution, is constituted by a PowerPC and several static hardware components,
such as a memory controller, general purpose inputs/outputs and the Reconfiguration Controller (that is represented by the VCM, Virtex Configuration Manager).
Figure 5.1: Multi-FPGAs system on RAPTOR2000
The Virtex-II FPGAs represent the reconfigurable resources used to dynamically load hardware modules into the system. Moreover, each Virtex-II FPGA
includes a Wishbone bus to which the hardware modules are connected dynamically. The bus-bridges that are used to connect the modules to the processor system include the Medium Access Control (MAC) for the communication
with the modules.
The Reconfiguration Controller is a hardware component that represents the
Allocation Layer and a part of the Positioning Layer. A special feature of this component is its direct memory access (DMA) to the local SDRAM memory. This
98
5.2. Software architecture
enables very fast configurations when downloading bitstreams from a given position within the memory to a selected FPGA within the RAPTOR2000 system.
5.2.2
ROTFL Allocation Manager
To prove the flexibility and the adaptability of the ROTFL architecture, one of
its components, the ROTFL Allocation Manager, has been developed using a genetic algorithm, as described in Section 4.2.3.4. To evaluate the performance
of the proposed solution, it has been tested and the obtained results have been
compared with those achieved by other implementations of the same manager
that use different algorithms.
In the performed tests, the only parameter imposed by the system is the
number of reconfigurable slots. This number represents the size of the reprogrammable devices, divided by the size of each slot. Obviously the duration of
the exhaustive algorithm considerable increases with the number of slots of the
system, while both the random and the genetic algorithm are almost independent from this parameter, since there is a little reduction of the performance due
to the increased dimension of the data structure but the overall complexity of
these algorithms remains the same.
Also the number of modules that have to be placed at the same time on the
reconfigurable system affects the presented algorithms in a similar way, but the
difference is that this parameter can be chosen and modified either at compiletime or at run-time.
Moreover, the following set of parameters can be used to specifically tune
the performance of the genetic algorithm.
• Minimum fitness: this is the minimum fitness that allows a solution to be
chosen as the final solution of the algorithm before the maximum number
of rounds has been processed. If this limit is too high, then the maximum
number of rounds is always reached.
• Maximum number of rounds: this number represents the number of evolution cycles that have to be performed to obtain the final solution. On
99
Chapter 5.
Experimental results
one hand, if this number is too small, it is possible to obtain a final solution with a very small fitness. On the other hand, if this number is too
large, the performance of the algorithm can be drastically reduced.
• Initial population size: this number is the number of chromosomes of the
initial randomly created population. Obviously a bigger initial population
has more probability to contain a good solution with respect to a smaller
initial one.
• Selection size: this number represents the number of parent chromosomes
that are kept in the following generation and that are used to form the new
offspring. It keeps solutions with a high fitness in the following generations.
• Crossover probability: this is the probability to perform a crossover between the two parents while the reproduction phase, allowing to mix two
good solutions in the hope to form a better one. If no crossover has to be
executed, then one of the two parents is copied directly in the new population.
• Mutation probability: this is the probability to perform a mutation during
the generation of the offspring, allowing to randomly modify new chromosomes.
All these parameters can be modified to tune the genetic algorithm performance or to adapt this algorithm to a specific situation. The following results
have been obtained by using a system with 400 slots and configuring the genetic
algorithm with a minimum fitness of 10000 points, with the maximum number
of rounds set to 15, with a population of 30 individuals, with a selection size of
15 individuals and with both a crossover and a mutation probability of 50%.
To evaluate the performance of the proposed algorithm, both the timing performance and the quality of the final results of the following collection of different implementations has been examined.
• The null implementation of the ROTFL Allocation Manager is the base implementation that is useful just to estimate the overhead due to the test-
100
5.2. Software architecture
Table 5.2: Temporal performance
Algorithm
Null
Random
Genetic
Exhaustive
Total
time (s)
69
74
104
157
Normalized
time (s)
0
5
35
88
Average time for
each module (ms)
0
0.17
1.17
2.93
bench. In fact it follows a very simple behavior, since it always answer to
the test application refusing the requested module. The time taken by this
implementation is the time wasted in the creation of the requests and in
the management of the FPGA structure from the test application.
• The random solution is an implementation that tries to place the requested
module in a random feasible location. If that position is free, then the
module can be configured, otherwise the request is directly refused.
• The genetic algorithm is the proposed solution that implements the
ROTFL Allocation Manager using the previously described approach.
• The exhaustive solution, finally, tries all the possible placement combinations to find the solution that minimizes the fragmentation of the slave
FPGAs. This implementation is obviously the slowest solution, but it also
the implementation that provides the best final results.
Table 5.2 shows the performance of the previously presented algorithms.
The total time, expressed in seconds, is based on a test that performs 100 rounds,
that consist of 100 insertions of 3 modules each, in a system with 400 reconfigurable slots. The normalized time, also expressed in seconds, is the time effectively spent by the algorithm, since it is computed subtracting to the total time
the overhead of the test application. Finally the average time for each module
is the estimated time that each single module insertion requires.
The second table, Table 5.3, presents the number of refused modules and the
total caching reward for each algorithm. In addition to this, it also shows the
101
Chapter 5.
Experimental results
Table 5.3: Final results
Algorithm
Null
Random
Genetic
Exhaustive
Refused
modules
10000
7392
4670
3473
Normalized
refused modules
6527
3919
1197
0
Caching
reward
0
22465
67614
92247
Normalized
caching reward
-92247
-69782
-24633
0
Table 5.4: Comparison with the exhaustive algorithm
Algorithm
Random
Genetic
Temporal
improvement (%)
1760
251
Rejection
worsening (%)
39.2
12
Caching
worsening (%)
75.6
26.7
normalized values of these parameters, that have been calculated by subtracting
to the total values the results of the exhaustive algorithm, that are the maximum
obtainable.
Finally, Table 5.4 describes the temporal improvement that the genetic algorithm and the random algorithm are able to obtain, with respect to the exhaustive algorithm. Even if the random solution brings a considerable temporal improvement (1760%), it cannot be chosen as a suitable solution since its results,
in terms of both refused modules and caching reward, are really inadmissible
(40% - 76%). On the other side, the genetic solution makes it possible to obtain
a more modest temporal improvement (251%), but it allows to keep a lower
worsening of the final results (12 % - 27%).
In conclusion, the genetic algorithm seems to implement the best compromise between temporal performance and effectiveness of the final results, that
consists of both the number of refused modules and the number of the IP-Cores
that are kept in cache.
102
5.2. Software architecture
5.2.3
ROTFL architecture
The performance of the whole ROTFL architecture is affected mainly by the
latency introduced by the partial reconfiguration of the slave FPGAs and by the
overhead caused by the Operating System.
Furthermore, additional FPGA resources are required to enable partial reconfiguration, i.e., the Virtex Configuration Manager (VCM) introduced in section 5.2.1, that represents the Reconfiguration Controller. The VCM modules uses
1726 slices (18.6%) and 6 BlockRAMs (6.8%) of the Xilinx Virtex-2Pro FPGA
(XC2VP20). The high area requirement is caused mainly by the integrated readback functionality, which will be used in future implementations.
On the opposite, the additional resources required for partial reconfiguration and for the other components of the static architecture can be neglected,
since the resource overhead in these components is smaller than 1%.
The latency for partial reconfiguration introduced by the hardware components is composed of the following parts.
• First, a static time that is required to initiate the DMA transfer of the partial
configuration bitstreams from the SDRAM to the configuration interface
(VCM), plus the time required to initialize the configuration interface of
the FPGA and to flush the configuration buffer at the end of the configuration.
• Second, the time needed to download the bitstream to the FPGA. This
time depends on the size of the reconfigurable hardware modules.
The static time is 158 clock cycles before reconfiguration and 824 clock cycles for buffer flushing after reconfiguration. Moreover, the number of clock
cycles needed to reconfigure one CLB column of the used Xilinx Virtex-II FPGA
(XC2V4000) is 18,128. Therefore, the time to reconfigure a hardware module in
the proposed system is
(158 + n · 18128 + 824) · 20 ns
(5.1)
103
Chapter 5.
Experimental results
Table 5.5: Hardware reconfiguration latency
Columns
4
8
12
Latency (µs)
1469.88
2920.12
4370.36
where n is the number of reconfigured CLB columns. The reconfiguration
clock period used in the prototypic implementation is 20 ns. Table 5.5 shows
the reconfiguration time introduced by the hardware for typical module sizes.
These modules only use CLB columns. The download time changes insignificantly if embedded multipliers or BlockRAMs are used. If also the BlockRAM
contents have to be written during reconfiguration, an additional 1054.72 µs apply per BlockRAM column. Equation 5.1 assumes that no data compression is
used for the partial bitstreams and thus gives worst case times.
On the other hand, there is the time overhead caused by the Operating System configuration support. Table 5.6 shows the performance of the ROTFL software architecture, that consists of the following parts.
Table 5.6: ROTFL performance
Task
Daemon startup
Device driver setup
Module loading (if not cached)
Module loading (if cached)
Read
Write
Time (µs)
500
650
3450
2500
3.6
2.7
Notes
once
once each driver
each loading
each loading
4 bytes read
4 bytes write
• The first task to be executed is the ROTFL Daemon startup, that initializes
all data structures and prepares the ROTFL Daemon to accept configuration requests; it takes around 500 µs, but it is necessary to perform it just
once, when the ROTFL Daemon starts.
104
5.3. Concluding remarks
• The second task is the devices drivers setup, that loads the correct driver
and initializes all necessary devices for a specific module; it takes around
650 µs and it is executed once for each kind of module.
• Module loading time is different if the requested module is cached or not;
in the first case it takes around 2500 µs, otherwise it takes around 3450 µs.
To be more precise, the module used to calculate these results is 4 columns
wide.
• Finally reading and writing from and to a configured module takes
around 3.6 µs to read 4 bytes and 2.7 µs to write 4 bytes.
5.3
Concluding remarks
The results concerning the IPGen tool show that the generation of the IP-Core
introduces, with respect to the core logic, only a small resources overhead, that
can also be neglected if the size of the original core is relevant. Moreover, the
performance of this tool are also very good, since in the average it is possible to
obtain a complete IP-Core in just 0.065 seconds.
On the other hand, the results of the ROTFL architecture prove that the Operating System temporal overhead is acceptable, since the duration of a reconfiguration performed by using the OS reconfiguration support is comparable to
the hardware reconfiguration latency.
In the worst case, in fact, using a module that is just 4 columns wide, the
hardware reconfiguration latency is around 1500 µs, while the same reconfiguration performed through the ROTFL architecture takes around 3500 µs (included the delay introduced by socket communication). Furthermore, considering the scenario where the requested module is cached, independently of its
size, the performance can be considerably improved, since the whole reconfiguration process takes constantly just 2500 µs, as shown in Figure 5.2.
Finally, using wider modules it is possible to completely hide the software
overhead due to the ROTFL Daemon also when the requested module is not
105
Chapter 5.
Experimental results
ROTFL Daemon
Socket
communication
0
Socket
communication
1
Time (ms)
2
Figure 5.2: Module cached scenario
cached, since the hardware reconfiguration latency grows linearly with the
module size, while the ROTFL overhead remains constant, as shown in Figure
5.3.
0
1
2
3
4
5
6
Time (ms)
Socket
communication
ROTFL
Daemon
Socket
communication
Hardware
reconfiguration
(4 columns)
Hardware reconfiguration
(8 columns)
Hardware reconfiguration
(12 columns)
Figure 5.3: Reconfiguration latencies
106
Chapter 6
Conclusions and future work
Previous chapters have introduced a methodology for reconfigurable embedded systems design that strongly reduces both the time to market of the final
implementation of the system and the efforts required for its development. This
methodology has been described with a flow that has been integrated with two
main components that represent the original contribution of this thesis: the automatic IP-Cores generation and the Operating System reconfiguration support.
• The automatic IP-Cores generation task can be achieved by using the IPGen tool, whose goal is the definition of an automated flow for the interfacing process of IP-Cores. In this way it is possible to obtain, starting
from a core functionality, a complete component that is ready for the bus
communication, without requiring user interactions.
Preliminary results show that the proposed approach provides the design
flow with a simple and powerful way to automatically obtain working
IP-Cores, that can be used as fixed or reconfigurable modules of the final
system. The IPGen tool, in fact, has been tested using several component
cores and the generated modules have been plugged into real systems and
have been downloaded onto a FPGA to test their effective correctness.
The performance achieved by IPGen are good, since the resources overhead introduced to obtain bus compatible IP-Cores is really small and in
some cases, in particular with large components, it is possible to neglect
107
Chapter 6.
Conclusions and future work
it. Also the temporal performance is excellent, seeing that in the average
the duration of the IP-Core generation phase is around 0.065 seconds.
• The proposed OS reconfiguration support has been developed to be applicable to a wide class of reconfigurable scenarios, that are characterized
by the presence of multi-FPGAs reconfigurable systems. The presented
scenarios can be also seen as basic components of a more complex distributed system, where each one of them can be considered as a node of
the distributed solution.
Moreover, for the development of the whole ROTFL architecture, a layered
structure has been chosen. This solution leads to achieve several remarkable benefits on the final system. First, it is possible to exploit a high-level
and very effective user interface that makes use of common OS concepts,
such as the assignment of names like /dev/module_0 to each configured
module, while completely hiding from the user the dynamic aspects, and
the associated complications, of reconfigurable hardware. In addition to
this, the proposed solution allows many resources (i.e., many FPGAs) to
be combined into one unique virtual hardware component, allowing the
ROTFL Daemon, that is the centralized manager, to handle flexible and
scalable hardware architectures. Furthermore, the layered structure of the
ROTFL architecture also allows to easily adapt each one of its components
to a specific situation without the need of modifications to the whole software architecture, leading both to a high customizability and reusability
and to a low error-proneness.
Thanks to these aspects of the ROTFL architecture, it is possible to develop, as a future work, a collection of different versions of the ROTFL
Module Manager, of the ROTFL Allocation Manager and of the ROTFL Positioning Manager that use disparate algorithms. These managers have to
strongly respect the defined interface, making it possible to choose the
more suitable algorithm for each specific situation without the need to
change the structure of the whole ROTFL architecture.
108
Chapter 6.
Conclusions and future work
Finally, it is possible to imagine a scenario where the ROTFL Repository
might be extended to support a dynamic management of both bitstream
files and modules information. In this way it will be possible to load in
the ROTFL architecture a new IP-Cores class also at run-time and not only
during the development phase.
109
Bibliography
[1] Two flows for partial reconfiguration: module based or difference based, Xilinx
Inc., XAPP290, September 2004
[2] Development system reference guide, Xilinx Inc., 2005
[3] Computer Architecture, a quantitative approach, D. Hennessy, J. Patterson,
Morgan Kaufmann, San Mateo, 1990
[4] The General Rapid Architecture Description, Carl Ebeling, University of Washington Technical Report: UW-CSE-02-06-02, 2002
[5] Rapid-C Manual, Carl Ebeling, University of Washington Technical Report:
UW-CSE-02-07-06, 2002
[6] A Configurable Pipelined State Machine as a Hybrid ASIC and Configurable Architecture, Peter Zipf, Claude Stötzler, Manfred Glesner, Institute of Microelectronic Systems, Darmstadt University of Technology, Germany, 2004
[7] Configurable Architecture for System-Level Prototyping of High-Speed Embedded
Wireless Communication Systems, Visvanathan Subramanian, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, 2003
[8] A Configurable FPGA-based Hardware Architecture for Adaptive Processing of
Noisy Signals for Target Detection Based on Constant False Alarm Rate (CFAR)
Algorithms, René Cumplido, César Torres, Santos López, National Institute
for Astrophysics Optics and Electronics, Puebla, Mexico, 2004
111
BIBLIOGRAPHY
[9] Configurable, High throughput LDPC decoder Architecture for Irregular codes,
Marjan Karkooti, Yang Sun, Joseph. R. Cavallaro, Center for Multimedia
Communications, ECE department
[10] Piperench: A reconfigurable architecture and compiler, Seth Copen Goldstein,
Herman Schmit, Mihai Budiu, Srihari Cadambi, Matt Moe, R. Reed Taylor,
IEEE Computer, Vol. 33, No. 4, April 2000
[11] The MorphoSys Dynamically Reconfigurable System-On-Chip G. Lu, E.M.C.
Filho, M. Lee, N. Bagherzadeh, and F.J. Kurdahi, 1st NASA / DoD Workshop on Evolvable Hardware (EH Õ99), July 19-21, 1999, Pasadena, CA,
USA, IEEE Computer Society, 1999
[12] The Splash 2 Reconfigurable Processor and Application‘s, Jeffrey M. Arnold,
Duncan A. Buell, Dzung T. Hoang, Daniel V. Pryor, Nabeel Shirazi, Mark R.
Thistle, Proceedings of the International Conference on Computer Design,
CS Press, 1993
[13] Garp: A MIPS Processor with a Reconfigurable Coprocessor, John R. Hauser
and John Wawrzynek, IEEE Symposium on Field-Programmable Custom
Computing Machines , FCCM ’97, April 16-18, 1997
[14] The garp architecture and c compiler Timothy J. Callahan, John R. Hauser,
John Wawrzynek, Computer, vol. 33, no. 4, pp. 62-69, April, 2000
[15] Baring it all to Software: Raw Machines, Elliot Waingold, Michael Taylor,
Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim,
Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, Anant Agarwal, IEEE Computer, September 1997
[16] The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs, Michael Bedford Taylor, Jason Kim, Jason Miller,
David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul
Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark
Seneski, Nathan Shnidman, Volker Strumpen Matt Frank, Saman Amarasinghe, Anant Agarwal, IEEE Micro, Mar/Apr 2002
112
BIBLIOGRAPHY
[17] Managing partial dynamic reconfiguration in Virtex II Pro FPGAs, Philippe Butel, Gerard Habay, Alain Rachet, Xcell Journal, Fall 2004
[18] System-level modeling of dynamically reconfigurable hardware with SystemC,
Antti Pelkonen, Kostas Masselos, Miroslav Cupék, IPDPS ’03: Proceedings
of the 17th International Symposium on Parallel and Distributed Processing, Washington, DC, USA, 2003, IEEE Computer Society, 2003
[19] BORPH Operating System, Berkeley Emulation Engine 2 Operating System, http://bee2.eecs.berkeley.edu/wiki/Bee2OperatingSystem, Berkeley, June 2006
[20] Embedded Linux as a platform for dynamically self-reconfiguring systems-onchip, John Williams and Neil Bergmann, Proceedings of the International
Conference on Engineering of Reconfigurable Systems and Algorithms,
CSREA Press, June 2004
[21] A Flexible Platform for Real-Time Reconfigurable Systems on Chip, N. W.
Bergmann, J. A. Williams, P. J. Waldeck, Proceedings of the International
Conference on Engineering of Reconfigurable Systems and Algorithms,
Las Vegas, USA, 2003
[22] The Egret Platform For Reconfigurable System-On-Chip, Neil W. Bergmann
and John Williams, Proceedings of the IEEE International Conference on
Field-Programmable Technology, IEEE, 2003
[23] A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system, Alberto Donato, Master’s thesis, Politecnico di Milano, 2005
[24] YaRA: un’architettura riconfigurabile basata sull’approccio monodimensionale,
Alessio Montone, Antonio Piazzi, Bachelor’s thesis, Politecnico di Milano,
2006
[25] A Prototyping Platform for Dynamically Reconfigurable System on Chip Designs,
Heiko Kalte, Mario Porrmann and Ulrich Rückert, Proceedings of the IEEE
Workshop Heterogeneous reconfigurable Systems on Chip (SoC), 2002
113
Ringraziamenti
Vi avviso fin da subito che il numero di persone ringraziate in questa sezione
sarà abbastanza elevato.
Il motivo di questa decisione è che non credo ci siano persone che non valga la pena di ringraziare o che non abbiano inciso i loro nomi nella mia vita
abbastanza in profondità per meritare un grazie.
Vorrei cominciare con tutti coloro che mi hanno seguito durante la mia carriera scolastica, in particolare le maestre delle elementari, Laura e Tiziana, i Professori del liceo, Costantini, Pilone (il fagiano era davvero buono!) e Rigotti, ed
i docenti dell’università, Ferrandi e Sciuto, per poi proseguire con tutti i miei
compagni di scuola e quelli dell’università: Anna, Fly (non si accorgerà nessuno del fatto che ho appena riavviato questo computer...), Gianluis (l’importante
è non esagerare, mai), Max, Nino, Quintana (l’Orsetto del cuore), Randa, Ritz
e Teo; continuando con gli amici del M ICRO LAB: Ack, Ale (dell’IP-Gen), Ale
(Mele), Birdack, Carlo (che può tornare indietro di un giorno facendo il giro del
mondo!), Chiara (senza la quale non ce l’avrei fatta...), Davide, Diego (il mio
gemello di iBook! :D), Edo, Francesca (ah, quegli appunti...), Frascino (detto
Gattuso), Gegi, Ics, Il Supremo, Leo, Katia, Malex (scritto col 2 piccolo), Marco
(complimenti per il calcetto e per OGame), Osprey, Roberto, Shumi (davvero
identico), Teo (la Germania... -.- ), Teo (dell’IP-Gen), Tia, Vik e Zeph.
Non posso inoltre dimenticarmi di tutti i fantastici ragazzi conosciuti in Germania, tra cui: Anne (entrambe), Annett, Annkatrin (la creatura...), Boris, Cheng
Yee (col sombrerino rosso ed un secchiello di sangria), Christina (uh uh ...), Hanna (ufficizmo), Jan (il nostro indispensabile buddy), Jenny (grazie per le foto),
Jens (master of FPGA Editor), Kim (entrambe), Markus (non preoccuparti, entro
domani sicuramente...), Miriam (entrambe), Nadine, Verena ed infine il mitico
Su (sasaaa).
Perchè non citare anche i migliori amici dell’uomo che mi hanno regalato
tanti momenti di felicità, riempiendo di gioia alcuni tra i momenti più tristi:
Birillo, la Licia (o Felicina, coi fiocchettini...), la Mila e Yoshi (Bodino Bodenaus).
115
Vi sono inoltre persone sempre presenti, premurose e gentili, che meritano
un sentito ringraziamento, come Chiara, Giulia, Andrea (che bello Gardaland!),
Lorena, Donato e l’Ing. Jannelli.
Di fondamentale importanza sono stati, inoltre, l’affetto e la vicinanza di
tutti i miei parenti, sempre pronti a spronarmi, consigliarmi e sostenermi, tra
cui: Elio, Mino, Paolo, la nonna Tina, il nonno Ettore, la nonna Michela, il nonno
Vincenzo, Salva, Oscar (compagno di mille avventure, nella vita reale e non), la
zia Mina, lo Zione, Mamma e Papà.
Infine un grazie di cuore agli amici più cari, con i quali sono cresciuto e sto
crescendo giorno dopo giorno: Laura, Nadia, Marta, Flavia, Katia, Paolo (con
la camicia a quadrettoni), Tanis (ormai il nome ufficiale di Jacopo!), Luca (non
più Ciccio), Valeria, Sabri, Fuca (detto anche Johnfuc), Ale, Geppo (detto anche
Roby :D ) e Max (davvero un punto fisso di riferimento).
Un ultimo speciale ringraziamento è doverosamente dedicato a Marco (meglio conosciuto come Santa), che mi ha aiutato a coronare questo sogno nel migliore dei modi, dedicandomi ogni attimo del suo tempo libero (anzi, mi correggo, non credo che Marco abbia mai avuto del tempo libero, solo alcuni momenti in cui era meno impegnato di altri), anche se spesso con ingenti ritardi
di svariate ore... ma sempre e comunque al mio fianco in ogni istante, grazie di
tutto.
... e l’elenco delle persone a cui sono grato non sarebbe ancora finito, molte
altre mi hanno accompagnato durante questo lungo cammino, ma dovrò limitarmi a ringraziarle tutte insieme, visto che, come dice Max, lo spazio dedicato
ai ringraziamenti sta per termin
Written with LATEX 2ε and BIBTEX
Printed on September 29, 2006