Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CMPE 511 COMPUTER ARCHITECTURE TERM PAPER “Transputer Architecture and Parallel Applications” By Serdar SARI Boğaziçi Üniversitesi 2003 0 TABLE OF CONTENTS Page 1 INTRODUCTION…………………………………………………………2 2 MULTI-TRANSPUTER SYSTEMS.……………………………………...2 3 TRANSPUTER ARCHITECTURE………………………………………..3 3.1 THE PROCESSOR 3.1.1 Registers………………………………………………………...4 3.1.2 Instruction Set ………………………………………………….4 3.2 MEMORY…………………………………………………………..5 3.3 FLOATING POINT UNIT………………………………………….6 3.4 TIMERS…………………………………………………………….6 3.5 SYSTEM SERVICES……………………………………………….6 3.6 LINK INTERFACE 3.6.1 Link Communication……………………………………………6 3.6.2 Link Protocols…………………………………………………..7 3.7 T9000 ARCHITECTURE…………………………………………...7 4 OCCAM LANGUAGE …………………………………………………….8 5 PARALLEL APPLICATIONS……………………………………………..9 6 CONCLUSION…………………………………………………………… 11 REFERENCES……………………………………………………………….12 APPENDIX 1 1 INTRODUCTION Transputer was the first single chip computer designed for message passing multiprocessor systems. It was a novel architecture produced by the UK based company Inmos Ltd. But become extinct due to the being late in renewing itself against the rapidly improving technology and also loosing its efficiency in the later family of products, cost/performance ratio, which was very high compared with the others in the market Transputer, the most interesting processor for Parallel Processing systems, has been designed explicitly as a basic building block for parallel processing. It has a RISC type of instruction set and consists of a fast microprocessor, several communications ports, fast memory, an external memory interface, timers, clocks and scheduling all on one standard chip [1][2]. The name (transistor + computer) was selected to indicate the role the individual tansputer would play, numbers of them would be used as a member of arrays, just as transistors had earlier. Being simple and easy to implement made multi-transputer systems very popular in the applications need parallel computing [7], achieved by message-passing [8], such as robot control [3][4], image processing [5], databases [6] … that very fast computation achieved. 1980’s many considered the Transputer is the design for the future of computing. Today, just over a decade later, this interesting chip has largely forgotten. Nevertheless, it still deserves to be considered that in this paper first, in section 2, a general view to transputers, multi-transputer systems are introduced, then the architecture of transputer with main features are discussed in section 3, later in section 4 the Occam language which was born and die with transputers briefly explained and finally parallel applications, transputers were widely used, are discussed in section 5. 2 MULTI-TRANSPUTER SYSTEMS The transputer is built to serve as a single processor in a MIMD (Multiple Instruction stream Multiple Data stream) concurrent system. So its design provides efficient and wholly reliable solutions to the problems because of the generality and versatility of the MIMD systems when compared to SIMD (Single Instruction stream, Multiple Data stream) systems. First of all you do not need a separate program for each processor. Second, some extra mechanism do not needed whenever two or more of the processors have to be synchronized, although all the processors have to be free to run at their own speeds. The transputer is designed to work with unshared memory, interprocessor communication should normally be done by message passing. But also even in very small number, there exist memory shared transputer systems constructed for special purposes. Message passing enforces the disciplines that are necessary for safe sharing information. So it automatically avoids the problems of shared-memory systems in which these disciplines must be explicitly programmed and carefully followed. Also for the very large networks of processors, memory access problem in memory-shared systems requires expensive solutions. However, message-passing systems have their own disadvantages. Code and data have to be physically transferred to the local memory of each node that can constitute a significant overhead. Application program could be computationally intensive that cannot be divided into transputers. They are inefficient to emulate share memory multiprocessor operations. Each sequential part of a parallel program is called a process. A process starts, performs a number of actions, and then stops or terminates. Each action may be a computational assignment, an input 2 or an output. The processes can run on the same transputer and be time-shared or on different transputers and run concurrently. A communication route between two processes is called a channel. If it is on the same transputer, they are called soft/internal channels. if it is on different transputers then called, hard/external channel. They are implemented by the transputer’s communication engines, called links. Links are point-to-point and one-way. According to information transfer over links transputers are the member of the class: circuit switched network. First the path established between the source and destination through all the required intermediate nodes, and all links are reserved. The information is then sent through the network. Afterwards the links are released. So this means no buffering, minimum latency and delay. 3 TRANSPUTER ARCHITECTURE The goal, behind the transputer, was to produce a family of chips ranging in power and cost that would then be wired together to form a complete computer. First generation of them are 16 bit transputers: T212, T222, T225; 32 bit transputers without a floating unit: T400, T414, T425, T426; 32 bit transputers with a floating unit: T800, T801, T805. All have the same architecture, similar instruction sets and fully compatible communications links. Second Generation 64 bit transputer with a floating unit: T9000. Although general architecture much the same, it is a new design and is much more complex chip then its predecessors. All the transputers except T9000 has identical architecture. T805 one of the famous one is showed in figure1 on the left. It consisted of a conventional, sequential, RISC processor, a communication subsystem, four high-speed inter-processor links, 4 Kb of on-chip RAM and an on-chip memory interface, a floating point unit and other system services. The following sections will briefly explains these functional units one by one. Figure1 (IMS T805 Architecture) 3 3.1 THE PROCESSOR The transputer processor is in some ways a conventional microprocessor. It executes one instruction at a time and has a pipelined fetch. 3.1.1 Registers: Areg, Breg, and Creg: They are used to evaluate expressions and hold instruction operands and results. These are called as evaluation registers and arranged into a stack. Only the Areg is connected to internal buses, so only the Areg can be read or written to. Writing (reading) the Areg pushes (pops) the contents of the Areg (Breg) to the Breg (Areg) and contents of Breg (Creg) to the Creg (Breg). Old contents of Creg are lost. There is no protection against pushing too many values on the stack that it overflows. (It is left to compilers and assembly code writers.).These features leads to simplified register connection, compact instructions, faster register access. Iptr, Oreg, Wreg: These are called sequential control registers: Instruction pointer (Iptr), holds the address of the next instruction. Operand register (Oreg), holds the operand for the current instruction. It can’t be directly loaded from (or stored in) the data part of the memory Workspace register (Wreg), holds the workspace pointer (Wptr) which is the address an area of memory called the local workspace. 3.1.2 Instruction Set: All the transputers have the same instruction format. Each instruction is 8 bits(1 byte) long. The 4 most significant part is gives the opcode and the 4 least significant part is used for data(operand). Execution of every instruction has the same sequence. First the Iptr is incremented. Next, the four data bits are copied into the four least significant bits of the Oreg. Then the function given by the opcode is executed. Finally the Oreg is set back to zero, unless the function is a prefix. Prefixing: Since instruction format reserves 4 bit for our data that it must lie in the range 0 to 15. This gives us a small number of very fast instructions with limited data. But we always need larger operands. So we use prefixes, the instructions to load values into the Oreg, to build larger operands. They are pfix, to implement large positive numbers and nfix, to implement large negative values. The operand register is used in the formation of instruction operands; the transputer’s instruction set is somewhat unusual in this aspect. Pfix: Copies its 4 bits data into the Oreg and shifts the Oreg for left 4 bits. This leaves the last 4 bits empty ready for next instruction. If it is another prefix instruction, again it will copy 4 bits of data and shifts 4 bits. Finally, the ldc(load constant) copies 4 bits data to empty part of the Oreg and then into the Areg. Nfix: Like, pfix it copies 4 bits into the Oreg and shifts the Oreg left 4 bits. It then complements the Oreg, turning zeros into ones and vice versa, which converts a small positive number into a small negative one. Direct Instructions: The 4 bits of opcode in an instruction give us 16 possible function codes. The instructions with these codes are called functions or direct instructions. Indirect Instructions: Since the number of different instructions that transputer’s have varies between 100 to 150, sixteen possible function codes not enough. For this reason an instruction called operate (opr) is used. The indirect instructions are numbered. To tell the processor which 4 indirect instructions we want to execute, we give the number of the instruction as the data for the operate instruction in the Oreg. Short Indirect Instructions: First 16 of indirect instructions that could be called directly by using opr (operation). For example: opr #0 calls the rev instruction (with machine code = #F0), which swaps the Areg and Breg. Long Indirect Instructions: The ones left and needed to be calling by using pfix instruction. For example: mint (loads the constant MinInt into the Areg) has the machine code #42 and called by pfix #4; opr #2 Some important instructions belong instruction set can be seen in Appendix A. 3.2 Memory All transputers memory arranged in bytes and programmer can access individual bytes. 32 bit address space gives 4Gbytes address space. The range of addresses, unlike the conventional ones, starts with minInt #800000(# indicates that number is written in hexadecimal form) and goes up to maxInt #7FFFFF, with #000000 lying in the middle. Do not need to calculate unsigned arithmetic that reduces the size of instruction set and microcode. For example the calculation #700000 + #700000 would be a sensible adress calculation, but we cannot use ordinary signed arithmatic because it will overflow. The users just need to remember when designing the physical address decoding that in the bottom half of the memory space the most significant address bit is high. So a memory map that is common, efficient and supported by the company is showed in figure2. Actually, only restriction for programmer is the on-chip RAM that has to be at the bottom and start with MinInt than any other arrangements are possible. Figure2: Memory map The processor and links do not know what physical device they are addressing. They cannot tell the difference between on-chip memory, external memory or other memory-mapped devices. To simplify board design, the transputer has an external memory interface, usually abbreviated emi. There are two distinct types of transputer memory interface. The fast or two-cycle interface is optimized for simple memory systems that use SRAM, ROM and other devices. It is incorporated in all the 16 bit transputers and the T801. The other type is the programmable or three-cycle memory that is used all 32-bit TXXX transputers except T801.It is designed to simplify the interface with DRAM and other complex memory-mapped devices. By providing as much on-chip support as possible, it minimizes the amount of external logic required. 5 Transputer memory is divided into workspaces for keeping the parameters of different procedures or processes. And as stated before Wptr (workspace pointer) holds the bottom address of the current process. This pointer could be thought and also be used (transputer’s instruction set supports) like a stack pointer. This property makes it easy to switch context and deal with variables. When a context switch occurs, the transputer saves the processor state then loads the address of upcoming process to Wptr. Also as stated before with an 8-bit instruction format transputers could load an address in four cycles so it seems inefficient when dealing with variables. For this reason, address in Wptr is used efficiently. Since all parameters belong a process kept in workspace, only thing must be done is, loading the constant which will be added to address in Wptr to reach the desired address. 3.3 Floating Point Unit It can be thought as a separate coprocessor under the master, the CPU, which could run at the same time as the CPU but cannot run a different parallel process. It has its own evolution stack registers FAreg, FBreg, FCreg. There are 53 floating-point instructions. High level programming language to program is strongly advised rather than assembly. It bases IEEE standards for the floating point format, operations and results: For the 32 bit numbers; 1 bit for sign, 8 bit for exponent, 23 bit for mantissa. For the 64 bit numbers; 1 bit for sign, 11 bits for exponent, 52 bits for mantissa. It also supports such results Inf(infinite), NaN( not a number and not defined). 3.4 Timers The transputer has two timers, which can be accessed by the programmer. High Resolution Timer: increments every five periods of ClockIn (one microsecond resolution with the normal 5 Mhz clock. Low Resolution Timer: This is 64 times slower, so increments every 64 microseconds These speeds are independent of transputer model, processor speed and word length. 3.5 System Services In multi-processor systems it should be convenient to have a hierarchy of control. For example, a host should be able to boot up a network of transputers, detect when an error occurs and debug the network. These are achieved by means of reset, analyze, and error pins that called system service pins 3.6 Link Interface The INMOS link is effectively a serial DMA port. It is an interface that reads or writes memory at one end and sends or receives high-speed serial data packets at the other end. They are extremely flexible and can be used for, interfacing with peripherals using a link adaptor, an ASIC (Application specific integrated circuit) chip can use a link to read and write directly into a transputer memory at high speed, most common to talk to another processor, usually anther transputer. 3.6.1 Link Communication The hardware connection of links is extremely simple, over short distances, which is what they were designed for. Links are serial port to simplify board design; just two tracks are required for each link connection (fig3). The four links and processor have independent access to the memory. The Processor sets up a link and is then free to execute other code while dedicated link logic handles the communication. All 6 four links can be inputting and outputting simultaneously while the processor is running code. Of course there could be a bandwidth problem when all links and processor access memory at the same time but this is not a common event (fig4). The links designed so that transputers do not need to be synchronized in order to talk each other. However, need to agree nominal bit rate (input clock*internal phase-locked loop). So this means that transputers in a network may be driven either from a common clock or from separate clocks. Figure3 Figure4 3.6.2 Link Protocols Every data packet is acknowledged which means that the transputer only has to buffer a single incoming data packet. This also provides the synchronizing effect of channels between transputers at programming level. A link begins to run when it gets a command from the processor, i.e. when the processor executes an input or output instruction. When it is output it sends a data packed, assuming that something is waiting to receive the data, and waits till an acknowledge arrives, then next package. …(Chance to wait forever). When it is input, checks whether a data has come, if not waits, and sends an acknowledge as soon as get the package. The problem here is the acknowledge does not confirm that the packet has been received correctly. This protocol is expected by the hardware on transputers and link adaptors. All link packets begin with a single high start bit and end with a low stop bit. The second bit of a packet indicates whether it is a data packet or an acknowledge; a high bit signifies data and a low bit signifies an acknowledge. 3.7 T9000 Second Generation T9000 was the last transputer version, produced to compete in the market. It’s architecture (fig5) differs from the T8XX in that it had a true 16 kB high speed cache instead of RAM, a five stage pipeline (fig6), a grouper which would collect instructions out of the stack and group them into larger packages of 4 bytes to feed the pipeline faster, a crossbar data address bus and a link system upgraded to a new 100MHz mode. But long delays in the T9000’s development meant that the 7 faster load-store designs were already outperforming it by the time it was to be released. In fact it also failed to reach its own performance goal of besting the T800 by ten times. Figure 5: T9000 Processor Architecture and Processor pipeline 4 OCCAM LANGUAGE Message passing can be achieved by either designing a special parallel programming language or using a normal high level language and provide a library of external procedures or system calls for message passing. Transputers were typically programmed using the Occam programming language although it could be implemented by fortran, basic or pascal. Occam supported thread-style tasks in the language, and in most cases simply writing a program in Occam resulted in a threaded application. With the task support and communications built into the chip and the language interacting with it directly, writing code for things like device controllers became a triviality—even the most basic code could watch the serial ports for I/O, and would automatically sleep when there was no data. Occam is a block-structured language using identation rather than brackets or BEGIN –AND to show a compound structure. Each level of identation consists of two spaces and each statement is normally placed on a separate line. It uses prefix operators and comments is declared with --. Data type representation is as used to be but using a colon to show it is prefixing a process.(INT X: -- declares the variable x and [10]INT X : -- declares a one-dimensional array, x, with an index 10).Variables ara declared prior to process or ‘subprocesses’ and not at the beginning of the complete program. Then they have the scope given by the level of identation. 8 Five primitive process exist in ocaam for data transfer : –Assignment variable := expression example: –Input channel ? variable example: –Output channel ! expression example: –Skip SKIP -- NOP that terminates –Stop STOP –- NOP that never terminates x:= y + 2 keyboard ? char screen ! Char Unlike the most programming languages, in which statements are executed one after another in the sequence written unless control statements are used, in Occam processes can be specified as executed concurrently or sequentially. Sequential operation is specified with the sequence (SEQ) process and each component process is executed after the previos process has finished. General: example: Takes data from an input channel c1 and sends to output channel c2 SEQ INT X: Process1 SEQ Process2 c1 ? x : c2 ! x concurrent operation is specified with the parallel (PAR) process, all component processes are executed simultaneously. Its usage is as the same as SEQ. Repetitive processes are done by “while” where for conditionals “if” is used. General representation : WHILE Boolean expression IF Process Boolean expression Process It is needless to mention Occam is a simple language, of which main details have given. Maybe purposely, but it lacks some features found in conventional high level languages such as limited data structures and not allowed recursion. 5 PARALLEL APPLICATIONS There are areas of commerce, industry and science where high performance is always required. Among these are, where transputers are good candidate: high-performance graphics, the use of graphics engines to generate the sophisticated images required today in advertising and films, image processing, either an image can be broken into smaller segments and each processed by a separate device, database applications, the data may be spread over a number of storage devices and separate processors handling each device but in a coordinated, cooperating manner, robotics, control of joints which work concurrently. Parallel machines have been around for many years, but their impact in the commercial world has been limited to high-performance areas. With the advent of the transputer, inexpensive parallelism is steadily becoming a reality. The principal areas in which transputers are entering the market are : Add-on boards, parallel workstations, locally intelligent terminals, large systems – command and control. To be more specific we can give following novel applications: Digital telephones: Transputers were suitable in the control of digital communications and used once upon a time in this exciting application. 9 Video telephones: KDD of Japan was supplying these devices, which uses transputers for video compression at rates of 1-2 frames per second Laser printers: UK company Eidolon has produced a fast and portable rastor image processor based on the transputer for formatting images in a laser printer. Only two transputers were required to drive a 40 page per minute system with full page description language graphics processing. Control systems: A French Company, CGEE-Alsthom, was used a transputer based control system installed in nuclear power stations. The Controbloc P20 system incorporates 400 transputers and features processor and functional redundancy that should provide a high level of reliability. Artificial Intelligence: The cost of AI workstations had traditionally been high, but it was reduced by the development of inexpensive AI workstations based on the transputers. Optical character recognition and text digitization: Progress in this area was in its early stages in 1980’s that not many systems with transputers were used. Neural Networks: Transputers were surprisingly well suited for building neural structures and a number of research organizations have already built working machines. 10 6 CONCLUSION Transputer was a unique device in 1980’s designed for parallel processing that its novel architecture and simply implementation makes one of the most popular and most talked processor in literature. Actually there are still funs of them that keep alive it in their websites. For the areas need parallel processing and high performance, transputer was, undoubted, very good solution. For the extinction of it, the company, Inmos, who didn’t manage to renew and fall the prices down, is responsible. Also, maybe if it was US based company, we were still using transputers. Nevertheless, researchers and designers who want an efficient parallel processing should not the skip the idea behind the architecture of the transputer. 11 References 1 - J. Hinton and A. Pinder, “Transputer Hardware and System Design”, Prentice Hall, 1993 2 – IMS T805 Transputer INMOS ltd, Brisrol, UK 3 - R. Zhang, E. B. Fernandez and J. Wu, “A Parallel Implementation of Robot Control Equations on IMS T414 Transputers ”, Transputer research and Applications (NATUG4), D.L. Fielding, Ed. 1990, IOS Press. 4 – F. Hamisi and David A. Fraser, “Transputer-based implementation of real-time robot position control”, Microprocessors and Microsystems, 1989, vol 13, pp. 644652 5 - S. Hemann, “A transputer based shuffle shift machine for image processing and reconstruction”, Proceedings 1990 of the 29th IEEE conference, pp. 445 – 450. 6 – M. Walden, K. Sere, “Free Text Retrieval on Transputer Networks”, Microprocessors and Microsystems, 1989, vol 13, pp. 179-183 7 – N. Tucker, “Commercial Issues: parallel processing and the transputer”, Microprocessors and Microsystems, 1989, vol 13, pp. 139-144 8 – B. Wilkinson, “Computer Architecture: Design and performance”, Prentice Hall, 1996 9 – Inmos Limited (1988), “Occam2 Reference Manual”, Prentice Hall : Hemel Hempstead. 12 APPENDIX A TRANSPUTER INSTRUCTION SET Function Codes Processor Initialization Operation Codes 13 Arithmetic/Logical Operation Code Long Arithmetic Operation Codes 14 Input/output operation codes Scheduling operation codes Control Operation Codes 15