Download Transputer Architecture and Parallel Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Transcript
CMPE 511 COMPUTER ARCHITECTURE
TERM PAPER
“Transputer Architecture and Parallel Applications”
By
Serdar SARI
Boğaziçi Üniversitesi
2003
0
TABLE OF CONTENTS
Page
1 INTRODUCTION…………………………………………………………2
2 MULTI-TRANSPUTER SYSTEMS.……………………………………...2
3 TRANSPUTER ARCHITECTURE………………………………………..3
3.1 THE PROCESSOR
3.1.1 Registers………………………………………………………...4
3.1.2 Instruction Set ………………………………………………….4
3.2 MEMORY…………………………………………………………..5
3.3 FLOATING POINT UNIT………………………………………….6
3.4
TIMERS…………………………………………………………….6
3.5 SYSTEM SERVICES……………………………………………….6
3.6 LINK INTERFACE
3.6.1 Link Communication……………………………………………6
3.6.2 Link Protocols…………………………………………………..7
3.7 T9000 ARCHITECTURE…………………………………………...7
4 OCCAM LANGUAGE …………………………………………………….8
5 PARALLEL APPLICATIONS……………………………………………..9
6 CONCLUSION…………………………………………………………… 11
REFERENCES……………………………………………………………….12
APPENDIX
1
1
INTRODUCTION
Transputer was the first single chip computer designed for message passing multiprocessor
systems. It was a novel architecture produced by the UK based company Inmos Ltd. But become
extinct due to the being late in renewing itself against the rapidly improving technology and also
loosing its efficiency in the later family of products, cost/performance ratio, which was very high
compared with the others in the market
Transputer, the most interesting processor for Parallel Processing systems, has been designed
explicitly as a basic building block for parallel processing. It has a RISC type of instruction set
and consists of a fast microprocessor, several communications ports, fast memory, an external
memory interface, timers, clocks and scheduling all on one standard chip [1][2]. The name
(transistor + computer) was selected to indicate the role the individual tansputer would play,
numbers of them would be used as a member of arrays, just as transistors had earlier. Being simple
and easy to implement made multi-transputer systems very popular in the applications need
parallel computing [7], achieved by message-passing [8], such as robot control [3][4], image
processing [5], databases [6] … that very fast computation achieved.
1980’s many considered the Transputer is the design for the future of computing. Today, just over
a decade later, this interesting chip has largely forgotten. Nevertheless, it still deserves to be
considered that in this paper first, in section 2, a general view to transputers, multi-transputer
systems are introduced, then the architecture of transputer with main features are discussed in
section 3, later in section 4 the Occam language which was born and die with transputers briefly
explained and finally parallel applications, transputers were widely used, are discussed in section
5.
2
MULTI-TRANSPUTER SYSTEMS
The transputer is built to serve as a single processor in a MIMD (Multiple Instruction stream
Multiple Data stream) concurrent system. So its design provides efficient and wholly reliable
solutions to the problems because of the generality and versatility of the MIMD systems when
compared to SIMD (Single Instruction stream, Multiple Data stream) systems. First of all you do
not need a separate program for each processor. Second, some extra mechanism do not needed
whenever two or more of the processors have to be synchronized, although all the processors have
to be free to run at their own speeds.
The transputer is designed to work with unshared memory, interprocessor communication should
normally be done by message passing. But also even in very small number, there exist memory
shared transputer systems constructed for special purposes. Message passing enforces the
disciplines that are necessary for safe sharing information. So it automatically avoids the problems
of shared-memory systems in which these disciplines must be explicitly programmed and
carefully followed. Also for the very large networks of processors, memory access problem in
memory-shared systems requires expensive solutions. However, message-passing systems have
their own disadvantages. Code and data have to be physically transferred to the local memory of
each node that can constitute a significant overhead. Application program could be
computationally intensive that cannot be divided into transputers. They are inefficient to emulate
share memory multiprocessor operations.
Each sequential part of a parallel program is called a process. A process starts, performs a number
of actions, and then stops or terminates. Each action may be a computational assignment, an input
2
or an output. The processes can run on the same transputer and be time-shared or on different
transputers and run concurrently. A communication route between two processes is called a
channel. If it is on the same transputer, they are called soft/internal channels. if it is on different
transputers then called, hard/external channel. They are implemented by the transputer’s
communication engines, called links. Links are point-to-point and one-way.
According to information transfer over links transputers are the member of the class: circuit
switched network. First the path established between the source and destination through all the
required intermediate nodes, and all links are reserved. The information is then sent through the
network. Afterwards the links are released. So this means no buffering, minimum latency and
delay.
3
TRANSPUTER ARCHITECTURE
The goal, behind the transputer, was to produce a family of chips ranging in power and cost that
would then be wired together to form a complete computer. First generation of them are 16 bit
transputers: T212, T222, T225; 32 bit transputers without a floating unit: T400, T414, T425,
T426; 32 bit transputers with a floating unit: T800, T801, T805. All have the same architecture,
similar instruction sets and fully compatible communications links. Second Generation 64 bit
transputer with a floating unit: T9000. Although general architecture much the same, it is a new
design and is much more complex chip then its predecessors.
All the transputers except T9000 has identical architecture. T805 one of the famous one is showed
in figure1 on the left. It consisted of a conventional, sequential, RISC processor, a communication
subsystem, four high-speed inter-processor links, 4 Kb of on-chip RAM and an on-chip memory
interface, a floating point unit and other system services. The following sections will briefly
explains these functional units one by one.
Figure1 (IMS T805 Architecture)
3
3.1 THE PROCESSOR
The transputer processor is in some ways a conventional microprocessor. It executes one
instruction at a time and has a pipelined fetch.
3.1.1 Registers:
Areg, Breg, and Creg: They are used to evaluate expressions and hold instruction operands
and results. These are called as evaluation registers and arranged into a stack. Only the Areg is
connected to internal buses, so only the Areg can be read or written to. Writing (reading) the Areg
pushes (pops) the contents of the Areg (Breg) to the Breg (Areg) and contents of Breg (Creg) to
the Creg (Breg). Old contents of Creg are lost. There is no protection against pushing too many
values on the stack that it overflows. (It is left to compilers and assembly code writers.).These
features leads to simplified register connection, compact instructions, faster register access.
Iptr, Oreg, Wreg: These are called sequential control registers: Instruction pointer (Iptr),
holds the address of the next instruction. Operand register (Oreg), holds the operand for the
current instruction. It can’t be directly loaded from (or stored in) the data part of the memory
Workspace register (Wreg), holds the workspace pointer (Wptr) which is the address an area of
memory called the local workspace.
3.1.2 Instruction Set:
All the transputers have the same instruction format. Each instruction is 8 bits(1 byte) long.
The 4 most significant part is gives the opcode and the 4 least significant part is used for
data(operand). Execution of every instruction has the same sequence. First the Iptr is incremented.
Next, the four data bits are copied into the four least significant bits of the Oreg. Then the function
given by the opcode is executed. Finally the Oreg is set back to zero, unless the function is a
prefix.
Prefixing: Since instruction format reserves 4 bit for our data that it must lie in the range 0 to
15. This gives us a small number of very fast instructions with limited data. But we always need
larger operands. So we use prefixes, the instructions to load values into the Oreg, to build larger
operands. They are pfix, to implement large positive numbers and nfix, to implement large
negative values. The operand register is used in the formation of instruction operands; the
transputer’s instruction set is somewhat unusual in this aspect.
Pfix: Copies its 4 bits data into the Oreg and shifts the Oreg for left 4 bits. This leaves the
last 4 bits empty ready for next instruction. If it is another prefix instruction, again it will copy 4
bits of data and shifts 4 bits. Finally, the ldc(load constant) copies 4 bits data to empty part of the
Oreg and then into the Areg.
Nfix: Like, pfix it copies 4 bits into the Oreg and shifts the Oreg left 4 bits. It then
complements the Oreg, turning zeros into ones and vice versa, which converts a small positive
number into a small negative one.
Direct Instructions: The 4 bits of opcode in an instruction give us 16 possible function
codes. The instructions with these codes are called functions or direct instructions.
Indirect Instructions: Since the number of different instructions that transputer’s have varies
between 100 to 150, sixteen possible function codes not enough. For this reason an instruction
called operate (opr) is used. The indirect instructions are numbered. To tell the processor which
4
indirect instructions we want to execute, we give the number of the instruction as the data for the
operate instruction in the Oreg.
Short Indirect Instructions: First 16 of indirect instructions that could be called directly by
using opr (operation). For example: opr #0 calls the rev instruction (with machine code = #F0),
which swaps the Areg and Breg.
Long Indirect Instructions: The ones left and needed to be calling by using pfix
instruction. For example: mint (loads the constant MinInt into the Areg) has the machine code #42
and called by pfix #4; opr #2
Some important instructions belong instruction set can be seen in Appendix A.
3.2 Memory
All transputers memory arranged in bytes and programmer can access individual bytes. 32 bit
address space gives 4Gbytes address space. The range of addresses, unlike the conventional ones,
starts with minInt #800000(# indicates that number is written in hexadecimal form) and goes up to
maxInt #7FFFFF, with #000000 lying in the middle. Do not need to calculate unsigned arithmetic
that reduces the size of instruction set and microcode. For example the calculation #700000 +
#700000 would be a sensible adress calculation, but we cannot use ordinary signed arithmatic
because it will overflow. The users just need to remember when designing the physical address
decoding that in the bottom half of the memory space the most significant address bit is high. So a
memory map that is common, efficient and supported by the company is showed in figure2.
Actually, only restriction for programmer is the on-chip RAM that has to be at the bottom and
start with MinInt than any other arrangements are possible.
Figure2: Memory map
The processor and links do not know what physical device they are addressing. They cannot tell
the difference between on-chip memory, external memory or other memory-mapped devices.
To simplify board design, the transputer has an external memory interface, usually abbreviated
emi. There are two distinct types of transputer memory interface. The fast or two-cycle interface is
optimized for simple memory systems that use SRAM, ROM and other devices. It is
incorporated in all the 16 bit transputers and the T801. The other type is the programmable or
three-cycle memory that is used all 32-bit TXXX transputers except T801.It is designed to
simplify the interface with DRAM and other complex memory-mapped devices. By providing as
much on-chip support as possible, it minimizes the amount of external logic required.
5
Transputer memory is divided into workspaces for keeping the parameters of different procedures
or processes. And as stated before Wptr (workspace pointer) holds the bottom address of the
current process. This pointer could be thought and also be used (transputer’s instruction set
supports) like a stack pointer. This property makes it easy to switch context and deal with
variables. When a context switch occurs, the transputer saves the processor state then loads the
address of upcoming process to Wptr. Also as stated before with an 8-bit instruction format
transputers could load an address in four cycles so it seems inefficient when dealing with
variables. For this reason, address in Wptr is used efficiently. Since all parameters belong a
process kept in workspace, only thing must be done is, loading the constant which will be added to
address in Wptr to reach the desired address.
3.3 Floating Point Unit
It can be thought as a separate coprocessor under the master, the CPU, which could run at the
same time as the CPU but cannot run a different parallel process. It has its own evolution stack
registers FAreg, FBreg, FCreg. There are 53 floating-point instructions. High level programming
language to program is strongly advised rather than assembly. It bases IEEE standards for the
floating point format, operations and results: For the 32 bit numbers; 1 bit for sign, 8 bit for
exponent, 23 bit for mantissa. For the 64 bit numbers; 1 bit for sign, 11 bits for exponent, 52 bits
for mantissa. It also supports such results Inf(infinite), NaN( not a number and not defined).
3.4 Timers
The transputer has two timers, which can be accessed by the programmer. High Resolution Timer:
increments every five periods of ClockIn (one microsecond resolution with the normal 5 Mhz
clock. Low Resolution Timer: This is 64 times slower, so increments every 64 microseconds
These speeds are independent of transputer model, processor speed and word length.
3.5
System Services
In multi-processor systems it should be convenient to have a hierarchy of control. For example, a
host should be able to boot up a network of transputers, detect when an error occurs and debug the
network. These are achieved by means of reset, analyze, and error pins that called system service
pins
3.6
Link Interface
The INMOS link is effectively a serial DMA port. It is an interface that reads or writes memory at
one end and sends or receives high-speed serial data packets at the other end. They are extremely
flexible and can be used for, interfacing with peripherals using a link adaptor, an ASIC
(Application specific integrated circuit) chip can use a link to read and write directly into a
transputer memory at high speed, most common to talk to another processor, usually anther
transputer.
3.6.1 Link Communication
The hardware connection of links is extremely simple, over short distances, which is what they
were designed for. Links are serial port to simplify board design; just two tracks are required for
each link connection (fig3).
The four links and processor have independent access to the memory. The Processor sets up a link
and is then free to execute other code while dedicated link logic handles the communication. All
6
four links can be inputting and outputting simultaneously while the processor is running code. Of
course there could be a bandwidth problem when all links and processor access memory at the
same time but this is not a common event (fig4).
The links designed so that transputers do not need to be synchronized in order to talk each other.
However, need to agree nominal bit rate (input clock*internal phase-locked loop). So this means
that transputers in a network may be driven either from a common clock or from separate clocks.
Figure3
Figure4
3.6.2 Link Protocols
Every data packet is acknowledged which means that the transputer only has to buffer a single
incoming data packet. This also provides the synchronizing effect of channels between transputers
at programming level. A link begins to run when it gets a command from the processor, i.e. when
the processor executes an input or output instruction. When it is output it sends a data packed,
assuming that something is waiting to receive the data, and waits till an acknowledge arrives, then
next package. …(Chance to wait forever). When it is input, checks whether a data has come, if not
waits, and sends an acknowledge as soon as get the package. The problem here is the acknowledge
does not confirm that the packet has been received correctly. This protocol is expected by the
hardware on transputers and link adaptors.
All link packets begin with a single high start bit and end with a low stop bit. The second bit of a
packet indicates whether it is a data packet or an acknowledge; a high bit signifies data and a low
bit signifies an acknowledge.
3.7 T9000 Second Generation
T9000 was the last transputer version, produced to compete in the market. It’s architecture (fig5)
differs from the T8XX in that it had a true 16 kB high speed cache instead of RAM, a five stage
pipeline (fig6), a grouper which would collect instructions out of the stack and group them into
larger packages of 4 bytes to feed the pipeline faster, a crossbar data address bus and a link system
upgraded to a new 100MHz mode. But long delays in the T9000’s development meant that the
7
faster load-store designs were already outperforming it by the time it was to be released. In fact it
also failed to reach its own performance goal of besting the T800 by ten times.
Figure 5: T9000 Processor Architecture and Processor pipeline
4
OCCAM LANGUAGE
Message passing can be achieved by either designing a special parallel programming language or
using a normal high level language and provide a library of external procedures or system calls for
message passing.
Transputers were typically programmed using the Occam programming language although it could
be implemented by fortran, basic or pascal. Occam supported thread-style tasks in the language,
and in most cases simply writing a program in Occam resulted in a threaded application. With the
task support and communications built into the chip and the language interacting with it directly,
writing code for things like device controllers became a triviality—even the most basic code could
watch the serial ports for I/O, and would automatically sleep when there was no data.
Occam is a block-structured language using identation rather than brackets or BEGIN –AND to
show a compound structure. Each level of identation consists of two spaces and each statement is
normally placed on a separate line. It uses prefix operators and comments is declared with --.
Data type representation is as used to be but using a colon to show it is prefixing a process.(INT
X: -- declares the variable x and [10]INT X : -- declares a one-dimensional array, x, with an
index 10).Variables ara declared prior to process or ‘subprocesses’ and not at the beginning of the
complete program. Then they have the scope given by the level of identation.
8
Five primitive process exist in ocaam for data transfer :
–Assignment
variable := expression
example:
–Input
channel ? variable
example:
–Output
channel ! expression
example:
–Skip
SKIP -- NOP that terminates
–Stop
STOP –- NOP that never terminates
x:= y + 2
keyboard ? char
screen ! Char
Unlike the most programming languages, in which statements are executed one after another in the
sequence written unless control statements are used, in Occam processes can be specified as
executed concurrently or sequentially. Sequential operation is specified with the sequence (SEQ)
process and each component process is executed after the previos process has finished.
General:
example: Takes data from an input channel c1 and sends to output channel c2
SEQ
INT X:
Process1
SEQ
Process2
c1 ? x
:
c2 ! x
concurrent operation is specified with the parallel (PAR) process, all component processes are
executed simultaneously. Its usage is as the same as SEQ.
Repetitive processes are done by “while” where for conditionals “if” is used. General
representation :
WHILE Boolean expression
IF
Process
Boolean expression
Process
It is needless to mention Occam is a simple language, of which main details have given. Maybe
purposely, but it lacks some features found in conventional high level languages such as limited
data structures and not allowed recursion.
5
PARALLEL APPLICATIONS
There are areas of commerce, industry and science where high performance is always required.
Among these are, where transputers are good candidate: high-performance graphics, the use of
graphics engines to generate the sophisticated images required today in advertising and films,
image processing, either an image can be broken into smaller segments and each processed by a
separate device, database applications, the data may be spread over a number of storage devices
and separate processors handling each device but in a coordinated, cooperating manner, robotics,
control of joints which work concurrently.
Parallel machines have been around for many years, but their impact in the commercial world has
been limited to high-performance areas. With the advent of the transputer, inexpensive parallelism
is steadily becoming a reality. The principal areas in which transputers are entering the market are
: Add-on boards, parallel workstations, locally intelligent terminals, large systems – command and
control. To be more specific we can give following novel applications:
Digital telephones: Transputers were suitable in the control of digital communications and used
once upon a time in this exciting application.
9
Video telephones: KDD of Japan was supplying these devices, which uses transputers for video
compression at rates of 1-2 frames per second
Laser printers: UK company Eidolon has produced a fast and portable rastor image processor
based on the transputer for formatting images in a laser printer. Only two transputers were
required to drive a 40 page per minute system with full page description language graphics
processing.
Control systems: A French Company, CGEE-Alsthom, was used a transputer based control
system installed in nuclear power stations. The Controbloc P20 system incorporates 400
transputers and features processor and functional redundancy that should provide a high level of
reliability.
Artificial Intelligence: The cost of AI workstations had traditionally been high, but it was
reduced by the development of inexpensive AI workstations based on the transputers.
Optical character recognition and text digitization: Progress in this area was in its early stages
in 1980’s that not many systems with transputers were used.
Neural Networks: Transputers were surprisingly well suited for building neural structures and a
number of research organizations have already built working machines.
10
6 CONCLUSION
Transputer was a unique device in 1980’s designed for parallel processing that its novel
architecture and simply implementation makes one of the most popular and most talked processor
in literature. Actually there are still funs of them that keep alive it in their websites. For the areas
need parallel processing and high performance, transputer was, undoubted, very good solution.
For the extinction of it, the company, Inmos, who didn’t manage to renew and fall the prices
down, is responsible. Also, maybe if it was US based company, we were still using transputers.
Nevertheless, researchers and designers who want an efficient parallel processing should not the
skip the idea behind the architecture of the transputer.
11
References
1 - J. Hinton and A. Pinder, “Transputer Hardware and System Design”, Prentice
Hall, 1993
2 – IMS T805 Transputer INMOS ltd, Brisrol, UK
3 - R. Zhang, E. B. Fernandez and J. Wu, “A Parallel Implementation of Robot
Control Equations on IMS T414 Transputers ”, Transputer research and Applications
(NATUG4), D.L. Fielding, Ed. 1990, IOS Press.
4 – F. Hamisi and David A. Fraser, “Transputer-based implementation of real-time
robot position control”, Microprocessors and Microsystems, 1989, vol 13, pp. 644652
5 - S. Hemann, “A transputer based shuffle shift machine for image processing and
reconstruction”, Proceedings 1990 of the 29th IEEE conference, pp. 445 – 450.
6 – M. Walden, K. Sere, “Free Text Retrieval on Transputer Networks”,
Microprocessors and Microsystems, 1989, vol 13, pp. 179-183
7 – N. Tucker, “Commercial Issues: parallel processing and the transputer”,
Microprocessors and Microsystems, 1989, vol 13, pp. 139-144
8 – B. Wilkinson, “Computer Architecture: Design and performance”, Prentice Hall,
1996
9 – Inmos Limited (1988), “Occam2 Reference Manual”, Prentice Hall : Hemel
Hempstead.
12
APPENDIX A
TRANSPUTER INSTRUCTION SET
Function Codes
Processor Initialization Operation Codes
13
Arithmetic/Logical Operation Code
Long Arithmetic Operation Codes
14
Input/output operation codes
Scheduling operation codes
Control Operation Codes
15