Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ARM Introduction & Instruction Set Architecture Aleksandar Milenkovic E-mail: Web: [email protected] http://www.ece.uah.edu/~milenka Outline           ARM Architecture ARM Organization and Implementation ARM Instruction Set Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 2 ARM History  ARM – Acorn RISC Machine (1983 – 1985)  Acorn Computers Limited, Cambridge, England  ARM – Advanced RISC Machine 1990  ARM Limited, 1990  ARM has been licensed to many semiconductor manufacturers 3 ARM’s visible registers  User level  15 GPRs, PC, CPSR (current program status register)  Remaining registers are used for system-level programming and for handling exceptions r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 (PC) CPSR user mode usable in user mode system modes only r8_fiq r9_fiq r10_fiq r11_fiq r12_fiq r13_fiq r14_fiq r13_svc r14_svc r13_abt r14_abt r13_irq r14_irq r13_und r14_und SPSR_irq SPSR_und SPSR_abt SPSR_fiq SPSR_svc fiq mode svc mode abort mode irq mode undefined mode 4 ARM CPSR format  N (Negative), Z (Zero), C (Carry), V (oVerflow)  mode – control processor mode  T – control instruction set  T = 1 – instruction stream is 16-bit Thumb instructions  T = 0 – instruction stream is 32-bit ARM instructions  I F – interrupt enables 31 28 27 N ZC V 8 7 6 5 4 unused IF T 0 mode 5 ARM memory organization  Linear array of bytes numbered from 0 to 232 – 1  Data items  bytes (8 bits)  half-words (16 bits) – always aligned to 2-byte boundaries (start at an even byte address)  words (32 bits) – always aligned to 4-byte boundaries (start at a byte address which is multiple of 4) bi t 31 bi t 0 23 22 21 20 19 18 17 16 word16 15 14 13 12 half-word14 half-word12 11 10 9 8 5 4 word8 7 6 byte6 half-word4 3 2 1 0 byte3 byte2 byte1 byte0 byte address 6 ARM instruction set  Load-store architecture  operands are in GPRs  load/store – only instructions that operate with memory  Instructions  Data Processing – use and change only register values  Data Transfer – copy memory values into registers (load) or copy register values into memory (store)  Control Flow o branch o branch-and-link – save return address to resume the original sequence o trapping into system code – supervisor calls 7 ARM instruction set (cont’d) Three-address data processing instructions Conditional execution of every instruction Powerful load/store multiple register instructions Ability to perform a general shift operation and a general ALU operation in a single instruction that executes in a single clock cycle  Open instruction set extension through coprocessor instruction set, including adding new registers and data types to the programmer’s model  Very dense 16-bit compressed representation of the instruction set in the Thumb architecture     8 I/O system  I/O is memory mapped  internal registers of peripherals (disk controllers, network interfaces, etc) are addressable locations within the ARM’s memory map and may be read and written using the loadstore instructions  Peripherals may use either the normal interrupt (IRQ) or fast interrupt (FIQ) input  normally most interrupt sources share the IRQ input, while just one or two time-critical sources are connected to the FIQ input  Some systems may include external DMA hardware to handle high-bandwidth I/O traffic 9 ARM exceptions  ARM supports a range of interrupts, traps, and supervisor calls – all are grouped under the general heading of exceptions  Handling exceptions  current state is saved by copying the PC into r14_exc and CPSR into SPSR_exc (exc stands for exception type)  processor operating mode is changed to the appropriate exception mode  PC is forced to a value between 0016 and 1C16, the particular value depending on the type of exception  instruction at the location PC is forced to (the vector address) usually contains a branch to the exception handler; the exception handler will use r13_exc, which is normally initialized to point to a dedicated stack in memory, to save some user registers  return: restore the user registers and then restore PC and CPSR atomically 10 ARM cross-development toolkit  Software development  tools developed by ARM Limited  public domain tools (ARM back end for gcc C compiler) C source C libraries C compiler assembler .aof object libraries linker  Cross-development  tools run on different architecture from one for which they produce code asm source .axf system model ARMulator debug ARMsd development board 11 Outline             ARM Architecture ARM Assembly Language Programming ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 12 ARM Instruction Set  Data Processing Instructions  Data Transfer Instructions  Control flow Instructions 13 Data Processing Instructions  Classes of data processing instructions     Arithmetic operations Bit-wise logical operations Register-movement operations Comparison operations  Operands: 32-bits wide; there are 3 ways to specify operands  come from registers  the second operand may be a constant (immediate)  shifted register operand  Result: 32-bits wide, placed in a register  long multiply produces a 64-bit result 14 Data Processing Instructions (cont’d) Arithmetic Operations Bit-wise Logical Operations ADD r0, r1, r2 r0 := r1 + r2 AND r0, r1, r2 r0 := r1 and r2 ADC r0, r1, r2 r0 := r1 + r2 + C ORR r0, r1, r2 r0 := r1 or r2 SUB r0, r1, r2 r0 := r1 - r2 EOR r0, r1, r2 r0 := r1 xor r2 SBC r0, r1, r2 r0 := r1 - r2 + C - 1 BIC r0, r1, r2 r0 := r1 and (not) r2 RSB r0, r1, r2 r0 := r2 – r1 RSC r0, r1, r2 r0 := r2 – r1 + C - 1 Register Movement Comparison Operations MOV r0, r2 r0 := r2 CMP r1, r2 set cc on r1 - r2 MVN r0, r2 r0 := not r2 CMN r1, r2 set cc on r1 + r2 TST r1, r2 set cc on r1 and r2 TEQ r1, r2 set cc on r1 xor r2 15 Data Processing Instructions (cont’d)  Immediate operands: immediate = (0->255) x 22n, 0 <= n <= 12 ADD r3, r3, #3 r3 := r3 + 3 AND r8, r7, #&ff r8 := r7[7:0], & for hex  Shifted register operands  the second operand is subject to a shift operation before it is combined with the first operand ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 16 ARM shift operations      LSL – Logical Shift Left LSR – Logical Shift Right ASR – Arithmetic Shift Right ROR – Rotate Right RRX – Rotate Right Extended by 1 place 31 0 31 00000 00000 LSL #5 31 LSR #5 0 31 0 1 00000 0 11111 1 ASR #5 , positiv e operand 31 0 0 ASR #5 , negativ e operand 0 31 0 C C ROR #5 C RRX 17 Setting the condition codes  Any DPI can set the condition codes (N, Z, V, and C)  for all DPIs except the comparison operations a specific request must be made  at the assembly language level this request is indicated by adding an `S` to the opcode  Example (r3-r2 := r1-r0 + r3-r2) ADDS r2, r2, r0 ; carry out to C ADC r3, r3, r1 ; ... add into high word  Arithmetic operations set all the flags (N, Z, C, and V)  Logical and move operations set N and Z  preserve V and either preserve C when there is no shift operation, or set C according to shift operation (fall off bit) 18 Multiplies  Example (Multiply, Multiply-Accumulate) MUL r4, r3, r2 r4 := [r3 x r2]<31:0> MLA r4, r3, r2, r1 r4 := [r3 x r2 + r1] <31:0>  Note  least significant 32-bits are placed in the result register, the rest are ignored  immediate second operand is not supported  result register must not be the same as the first source register  if `S` bit is set the V is preserved and the C is rendered meaningless  Example (r0 = r0 x 35)  ADD r0, r0, r0, LSL #2 ; r0’ = r0 x 5 RSB r3, r3, r1 ; r0’’ = 7 x r0’ 19 Data transfer instructions  Single register load and store instructions  transfer of a data item (byte, half-word, word) between ARM registers and memory  Multiple register load and store instructions  enable transfer of large quantities of data  used for procedure entry and exit, to save/restore workspace registers, to copy blocks of data around memory  Single register swap instructions  allow exchange between a register and memory in one instruction  used to implement semaphores to ensure mutual exclusion on accesses to shared data in multis 20 Data Transfer Instructions (cont’d) Register-indirect addressing Single register load and store LDR r0, [r1] r0 := mem32[r1] STR r0, [r1] mem32[r1] := r0 Note: r1 keeps a word address (2 LSBs are 0) Base+offset addressing (offset of up to 4Kbytes) LDR r0, [r1, #4] r0 := mem32[r1 +4] LDRB r0, [r1] r0 := mem8[r1] Note: no restrictions for r1 Auto-indexing addressing LDR r0, [r1, #4]! r0 := mem32[r1 + 4] r1 := r1 + 4 Post-indexed addressing LDR r0, [r1], #4 r0 := mem32[r1] r1 := r1 + 4 21 Data Transfer Instructions (cont’d) COPY: ADR r1, TABLE1 ADR r2, TABLE2 LOOP: LDR r0, [r1] STR r0, [r2] ADD r1, r1, #4 ADD r2, r2, #4 ... TABLE1: ... TABLE2:... ; r1 points to TABLE1 ; r2 points to TABLE2 COPY: ADR r1, TABLE1 ADR r2, TABLE2 LOOP: LDR r0, [r1], #4 STR r0, [r2], #4 ... TABLE1: ... TABLE2:... ; r1 points to TABLE1 ; r2 points to TABLE2 22 Data Transfer Instructions Multiple register data transfers LDMIA r1, {r0, r2, r5} r0 := mem32[r1] r2 := mem32[r1 + 4] r5 := mem32[r1 + 8] Note: any subset (or all) of the registers may be transferred with a single instruction Note: the order of registers within the list is insignificant Note: including r15 in the list will cause a change in the control flow  Stack organizations  FA – full ascending  EA – empty ascending  FD – full descending  ED – empty descending  Block copy view  data is to be stored above or below the the address held in the base register  address incrementing or decrementing begins before or after storing the first value 23 Multiple register transfer addressing modes r9’ r9 1018 r5 r1 r0 16 100c 16 1000 r9 r9’ r5 r1 r0 16 STMDA r9!, {r0,r1,r5} 1018 16 100c 16 r9 1000 16 STMIB r9!, {r0,r1,r5} 1018 16 100c 16 1000 r5 r1 r0 16 STMIA r9!, {r0,r1,r5} 1018 r9’ 100c 16 r9 r9’ 16 r5 r1 r0 1000 16 STMDB r9!, {r0,r1,r5} 24 The mapping between the stack and block copy views B e f o re In c re me n t Af t e r B e f o re De c re me n t Af t e r As c e n di n g Ful l Emp t y STMIB STMFA STMIA STMEA LDMDB LDMEA LDMDA LDMFA De s c e n di n g Ful l Emp t y LDMIB LDMED LDMIA LDMFD STMDB STMFD STMDA STMED 25 Control flow instructions Branch B BAL BEQ BNE BPL BMI BCC BLO BCS BHS BVC BVS BGT BGE Interpretation Unconditional Always Equal Not equal Plus Minus Carry clear Lower Carry set Higher or same Overflow clear Overflow set Greater than Greater or equal BLT BLE Less than Less or equal BHI BLS Higher Lower or same Normal uses Always take this branch Always take this branch Comparison equal or zero result Comparison not equal or non-zero result Result positive or zero Result minus or negative Arithmetic operation did not give carry-out Unsigned comparison gave lower Arithmetic operation gave carry-out Unsigned comparison gave higher or same Signed integer operation; no overflow occurred Signed integer operation; overflow occurred Signed integer comparison gave greater than Signed integer comparison gave greater or equal Signed integer comparison gave less than Signed integer comparison gave less than or equal Unsigned comparison gave higher Unsigned comparison gave lower or same 26 Conditional execution  Conditional execution to avoid branch instructions used to skip a small number of non-branch instructions  Example CMP r0, #5 BEQ BYPASS ADD r1, r1, r0 SUB r1, r1, r2 BYPASS: ... ; ; if (r0!=5) { ; r1:=r1+r0-r2 ;} With conditional execution CMP r0, #5 ADDNE r1, r1, r0 SUBNE r1, r1, r2 ... ; ; ; Note: add 2 –letter condition after the 3-letter opcode ; if ((a==b) && (c==d)) e++; CMP r0, r1 CMPEQ r2, r3 ADDEQ r4, r4, #1 27 Branch and link instructions  Branch to subroutine (r14 serves as a link register) BL SUBR ; branch to SUBR .. ; return here SUBR: .. ; SUBR entry point MOV pc, r14 ; return  Nested subroutines SUB1: SUB2: BL SUB1 .. ; save work and link register STMFD r13!, {r0-r2,r14} BL SUB2 .. LDMFD r13!, {r0-r2,pc} .. MOV pc, r14 ; copy r14 into r15 28 Supervisor calls  Supervisor is a program which operates at a privileged level – it can do things that a user-level program cannot do directly  Example: send text to the display  ARM ISA includes SWI (SoftWare Interrupt) ; output r0[7:0] SWI SWI_WriteC ; return from a user program back to monitor SWI SWI_Exit 29 Jump tables  Call one of a set of subroutines depending on a value computed by the program JTAB: BL JTAB ... CMP r0, #0 BEQ SUB0 CMP r0, #1 BEQ SUB1 CMP r0, #2 BEQ SUB2 Note: slow when the list is long, and all subroutines are equally frequent BL JTAB ... JTAB: ADR r1, SUBTAB CMP r0, #SUBMAX ; overrun? LDRLS pc, [r1, r0, LSL #2] B ERROR SUBTAB: DCD SUB0 DCD SUB1 DCD SUB2 ... 30 Hello ARM World! AREA HelloW, CODE, READONLY ; declare code area SWI_WriteC EQU &0 ; output character in r0 SWI_Exit EQU &11 ; finish program ENTRY ; code entry point START: ADR r1, TEXT ; r1 <- Hello ARM World! LOOP: LDRB r0, [r1], #1 ; get the next byte CMP r0, #0 ; check for text end SWINE SWI_WriteC ; if not end of string, print BNE LOOP SWI SWI_Exit ; end of execution TEXT = “Hello ARM World!”, &0a, &0d, 0 END 31 ARM Organization and Implementation Aleksandar Milenkovic E-mail: Web: [email protected] http://www.ece.uah.edu/~milenka Outline            ARM Architecture ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 33 ARM organization A[31:0] control address register  Register file – P C  2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc)  Barrel shifter – shift or rotate one operand for any number of bits  ALU – performs the arithmetic and logic functions required  Memory address register + incrementer  Memory data registers  Instruction decoder and associated control logic incrementer PC register bank instruction decode A L U b u s multiply register & A B b u s b u s barrel shifter control ALU data out register data in register D[31:0] 34 Three-stage pipeline  Fetch  the instruction is fetched from memory and placed in the instruction pipeline  Decode  the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath  Execute  the instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register 35 ARM single-cycle instruction pipeline 1 2 3 instruction fetch decode execute fetch decode execute fetch decode execute time 36 ARM single-cycle instruction pipeline fetch sub r2,r3,r6 decode execute add fetch cmp r2,#3 1 decode execute sub fetch 2 add r0,r1,#5 3 decode execute cmp time 37 ARM multi-cycle instruction pipeline 1 fetch ADD decode 2 3 4 5 fetch STR Decode logic is always generating the control signals for the datapath to use in the next cycle execute decode calc. addr. data xfer fetch ADD decode fetch ADD execute decode execute fetch ADD decode execute instruction time 38 ARM multi-cycle LDMIA (load multiple) instruction ldmia fetch decodeex ld r2ex ld r3 r0,{r2,r3} sub r2,r3,r6 cmp r2,#3 fetch Decode stage occupied since ldmia must continue to remember decoded instruction decode ex sub fetch decodeex cmp time Instruction delayed sub fetched at normal time but not decoded until LDMIA is finishing 39 Control stalls: due to branches  Branches often introduce stalls (branch penalty)  Stall time may depend on whether branch is taken  May have to squash instructions that already started executing  Don’t know what to fetch until condition is evaluated 40 ARM pipelined branch Decision not made until the third clock cycle bne foo sub r2,r3,r6 foo add r0,r1,r2 fetch decode ex bne ex bne ex bne fetch decode Two cycles of work thrown away if bne takes place fetch decode ex add time 41 Pipeline: how it works  All instructions occupy the datapath for one or more adjacent cycles  For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle  During the fist datapath cycle each instruction issues a fetch for the next instruction but one  Branch instruction flush and refill the instruction pipeline 42 ARM9TDMI 5-stage pipeline next pc pc + 4  Fetch  Decode  instruction is decoded  register operands read (3 read ports)  Execute  an operand is shifted and the ALU result generated, or  address is computed B, BL MOV pc SUBS pc  Buffer/data  data memory is accessed (load, store)  Write-back LDR pc  write to register file +4 fetch I-cache pc+8 I decode instruction decode r15 register read immediate fields LDM/ STM postindex +4 mul shift pre-index reg shift ALU execute forwarding paths mux byte repl. load/store address D-cache buffer/ data rot/sgn ex register write write-back 43 ARM9TDMI Data Forwarding Data Forwarding next pc +4 fetch I-cache pc + 4 ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 pc+8 I decode instruction decode r15 register read ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 ADD r8, r9, r10 r8 := r9 + r10 ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 immediate fields LDM/ STM postindex +4 mul shift pre-index reg shift ALU forwarding paths mux Stall? LD r3, [r2] ADD r1, r2, r3 execute B, BL MOV pc SUBS pc byte repl. r3 := mem[r2] r1 := r2 + r3 load/store address D-cache buffer/ data rot/sgn ex LDR pc register write write-back 44 ARM9TDMI PC generation  3-stage pipeline next pc +4 fetch I-cache pc + 4  PC behavior: operands are read in execution stage r15 = PC + 8 pc+8 I decode register read immediate fields  5-stage pipeline  operands are read in decode stage and r15 = PC + 4?  incompatibilities between 3stage and 5-stage B, BL implementations => MOV pc SUBS pc unacceptable  to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs instruction decode r15 LDM/ STM postindex +4 mul shift pre-index reg shift ALU execute forwarding paths mux byte repl. load/store address D-cache buffer/ data rot/sgn ex LDR pc register write write-back 45 Data processing instruction datapath activity (Ex) Reg-Reg Rd = Rn op Rm r15 = AR + 4 AR = AR + 4 address register address register increment Rd Reg-Imm PC increment Rd PC registers Rn Rd = Rn op Imm r15 = AR + 4 AR = AR + 4 registers Rm Rn mult mult as ins. as ins. as instruction as instruction [7:0] data out data in i. pipe (a) register – register operations data out data in i. pipe (b) register – immediate operations 46 STR (store register) datapath activity (Ex1, Ex2) Compute address (Ex1) address register AR = Rn op Disp r15 = AR + 4 address register increment increment PC Store data (Ex2) Rn registers PC registers Rn AR = PC mem[AR] = Rd<x:y> If autoindexing => Rn = Rn +/- 4 Rd mult mult shifter lsl #0 = A / A +B/ = A +B/ A -B A -B [11:0] data out data in i. pipe (a) 1st cycle – compute address byte? data in i. pipe (b) 2nd cycle – store data & auto-index 47 The first two (of three) cycles of a branch instruction Compute target address address register address register AR = PC + Disp,lsl #2 Save return address (if required) r14 = PC AR = AR + 4 increment increment R14 registers registers PC PC mult mult shifter lsl #2 Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction data out which follows the branch? =A = A+B [23:0] data in i. pipe (a) 1st cycle – compute branch target data out data in i. pipe (b) 2nd cycle – save return address 48 ARM Implementation  Datapath  RTL (Register Transfer Level)  Control unit  FSM (Finite State Machine) 49 2-phase non-overlapping clock scheme  Most ARMs do not operate on edge-sensitive registers  Instead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signal  Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2 phase 1 phase 2 1 clock cycle 50 ARM datapath timing  Register read  Register read buses – dynamic, precharged during phase 2  During phase 1 selected registers discharge the read buses which become valid early in phase 1  Shift operation  second operand passes through barrel shifter  ALU operation  ALU has input latches which are open in phase 1, allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU  ALU processes the operands during the phase 2, producing the valid output towards the end of the phase  the result is latched in the destination register at the end of phase 2 51 ARM datapath timing (cont’d) ALU operands latched ph ase 1 register read time shift time ph ase 2 read bus valid shift out valid precharge invalidates buses register write time ALU time ALU out Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time 52 The original ARM1 ripple-carry adder  Carry logic: use CMOS AOI (And-Or-Invert) gate  Even bits use circuit show below  Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around Cout  Worst case path: 32 gates long A B sum Cin 53 ARM2 4-bit carry look-ahead scheme  Carry Generate (G) Carry Propagate (P)  Cout[3] =Cin[0].P + G  Use AOI and alternate AND/OR gates  Worst case: 8 gates long A[3:0] Cout[3] G 4-bit adder logic P B[3:0] sum[3:0] Cin[0] 54 The ARM2 ALU logic for one result bit  ALU functions data operations (add, sub, ...) address computations for memory accesses branch target computations fs : 5 01 23 carry bit-wise logical lo gic NB operations bu s  ...     4 G AL U bu s P NA bu s 55 ARM2 ALU function codes fs 5 0 0 0 0 0 1 0 0 0 0 0 fs 4 0 0 0 1 1 1 0 0 0 0 0 fs 3 0 1 1 1 0 0 0 0 0 1 1 fs 2 1 0 0 0 1 1 0 0 1 0 1 fs 1 0 0 0 0 1 1 0 0 0 1 0 fs 0 0 0 1 1 0 0 0 1 1 0 0 ALU o ut p ut A and B A and not B A xor B A plus not B plus carry A plus B plus carry not A plus B plus carry A A or B B not B zero 56 The ARM6 carry-select adder scheme  Compute sums of various fields of a,b[3:0] the word + +, +1 +, +1 for carry-in of zero c s s+1 and carry-in of mux one  Final result is mux selected by using the correct carryin value to control a multiplexor sum[3:0] sum[7:4] sum[15:8] Worst case: O(log2[word width]) gates long a,b[31:28] mux sum[31:16] Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable. 57 The ARM6 ALU organization  Not easy to merge the arithmetic and logic functions => a separate logic unit runs in parallel with the adder, and multiplexor selects the output A operand latch invert A B operand latch invert B XOR gates XOR gates function logic functions logic/arithmetic adder C in C V res ult mux N zero detect Z res ult 58 ARM9 carry arbitration encoding  Carry arbitration adder ai bi Ci vi, wi ai bi ai-1 bi-1 Ci vi, wi 0 0 0 0, 0 0 0 - - 0 0, 0 1 1 1 1, 1 1 1 - - 1 1, 1 1 0 u 1, 0 0(1) 1(0) 0 0 0 0, 0 0 1 u 1, 0 0(1) 1(0) 1 1 1 1, 1 u 1, 0 0(1) 1(0) 0(1) 1(0) vi  ai  bi wi  ai  bi 59 The cross-bar switch barrel shifter  Shifter delay is critical since it contributes directly to the datapath cycle time  Cross-bar switch matrix (32 x 32)  Principle for 4x4 matrix right 3 right 2 right 1 no shift in[3] left 1 in[2] left 2 in[1] left 3 in[0] out[0] out[1] out[2] out[3] 60 The cross-bar switch barrel shifter (cont’d)  Precharged logic is used => each switch is a single NMOS transistor  Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics  For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)  Arithmetic shift right: use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately 61 Multiplier design  All ARMs apart form the first prototype have included support for integer multiplication  older ARM cores include low-cost multiplication hardware that supports only the 32-bit result multiply and multiply-accumulate  recent ARM cores have high-performance multiplication hardware and support 64-bit result multiply and multiply-accumulate  Low cost implementation  Use the datapath iteratively, employing the barrel shifter and ALU to generate 2-bit product in each clock cycle  use early termination to stop the iterations when there are no more ones in the multiply register 62 The 2-bit multiplication algorithm, Nth cycle  Control settings for the Nth cycle of the multiplication  Use existing shifter and ALU + additional hardware  dedicated two-bits-per-cycle shift register for the multiplier and a few gates for the Booth’s algorithm control logic (overhead is a few per cent on the area of ARM core) Carry - i n 0 1 Mul t i p l i e r x0 x1 x2 x3 x0 x1 x2 x3 Shi ft LSL #2N LSL #2N LSL #(2N + 1) LSL #2N LSL #2N LSL #(2N + 1) LSL #2N LSL #2N ALU A+0 A+B A– B A– B A+B A+B A– B A+0 Carry - o ut 0 0 1 1 0 0 1 1 63 High speed multiplication  Where multiplication performance is very important, more hardware resources must be dedicated  in some embedded systems the ARM core is used to perform real-time digital signal processing (DSP) – DSP programs are typically multiplication intensive  Use intermediate results which include partial sums and partial carries  Carry-save adders are used for this  These two binary results are added together at the end of multiplication  The main ALU is used for this 64 Carry-propagate (a) and carry-save (b) adder structures  Carry propagate adder takes two conventional (irredundant) binary numbers as inputs and produces a binary sum  Carry save adder takes one binary and one redundant (partial sum and partial carry) input and produces a sum in redundant binary representation (sum and carry) (a) (b) A B Cin + A B Cin + Cout S Cout A A B Cin + Cout S S B Cin + Cout S A B Cin + Cout A S B Cin + Cout S A B Cin + Cout A S B Cin + Cout S 65 ARM high-speed multiplier organization  CSA has 4 layers of adders each handling 2 multiplier bits => multiply 8-bits per clock cycle  Partial sum and carry are cleared at the beginning or initialized to accumulate a value  Multiplier is shifted right 8-bits per cycle in the ‘Rs’ register  Carry sum and carry are rotated right 8 bits per cycle  Performance: up to 4 clock cycles (early termination is possible)  Complexity: 160 bits in shift registers, 128 bits of carry-save adder logic (up to 10% of simpler cores) 66 ARM high-speed multiplier organization in itia liza ti on for MLA registers Rs >> 8 bits/cycle Rm rotate sum and carry 8 b its/cycl e carry-save adders partial sum partial carry ALU (add partials) 67 ARM2 register cell circuit write read read A B ALU bus A bus B bus 68 ARM register bank floorplan A bus read decoders B bus read decoders write decoders Vdd Vss ALU bus PC bus INC bus ALU bus PC register c ells A bus B bus 69 ARM core datapath buses address register incrementer Ad PC A inc B register bank multiplier shift out W ALU shifter data in instruction Din instruction pipe data out 70 ARM control logic structure instruction coprocessor decode PLA address control register control cy cle count ALU control multiply control load/store multiple shifter control 71