Download ARM Systems-on-chip

Document related concepts
no text concepts found
Transcript
ARM
Introduction &
Instruction Set Architecture
Aleksandar Milenkovic
E-mail:
Web:
[email protected]
http://www.ece.uah.edu/~milenka
Outline










ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
2
ARM History
 ARM – Acorn RISC Machine (1983 – 1985)
 Acorn Computers Limited, Cambridge, England
 ARM – Advanced RISC Machine 1990
 ARM Limited, 1990
 ARM has been licensed to many semiconductor
manufacturers
3
ARM’s visible registers
 User level
 15 GPRs, PC,
CPSR (current
program status
register)
 Remaining registers
are used for
system-level
programming and
for handling
exceptions
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15 (PC)
CPSR
user mode
usable in user mode
system modes only
r8_fiq
r9_fiq
r10_fiq
r11_fiq
r12_fiq
r13_fiq
r14_fiq
r13_svc
r14_svc
r13_abt
r14_abt
r13_irq
r14_irq
r13_und
r14_und
SPSR_irq SPSR_und
SPSR_abt
SPSR_fiq SPSR_svc
fiq
mode
svc
mode
abort
mode
irq
mode
undefined
mode
4
ARM CPSR format
 N (Negative), Z (Zero), C (Carry), V (oVerflow)
 mode – control processor mode
 T – control instruction set
 T = 1 – instruction stream is 16-bit Thumb instructions
 T = 0 – instruction stream is 32-bit ARM instructions
 I F – interrupt enables
31
28 27
N ZC V
8 7 6 5 4
unused
IF T
0
mode
5
ARM memory organization
 Linear array of bytes numbered from 0
to 232 – 1
 Data items
 bytes (8 bits)
 half-words (16 bits) – always
aligned to 2-byte boundaries (start
at an even byte address)
 words (32 bits) – always aligned to
4-byte boundaries (start at a byte
address which is multiple of 4)
bi t 31
bi t 0
23
22
21
20
19
18
17
16
word16
15
14
13
12
half-word14 half-word12
11
10
9
8
5
4
word8
7
6
byte6 half-word4
3
2
1
0
byte3 byte2 byte1 byte0
byte
address
6
ARM instruction set
 Load-store architecture
 operands are in GPRs
 load/store – only instructions that operate with memory
 Instructions
 Data Processing – use and change only register values
 Data Transfer – copy memory values into registers (load) or
copy register values into memory (store)
 Control Flow
o branch
o branch-and-link –
save return address to resume the original sequence
o trapping into system code – supervisor calls
7
ARM instruction set (cont’d)
Three-address data processing instructions
Conditional execution of every instruction
Powerful load/store multiple register instructions
Ability to perform a general shift operation and a general
ALU operation in a single instruction that executes in a
single clock cycle
 Open instruction set extension through coprocessor
instruction set, including adding new registers and data
types to the programmer’s model
 Very dense 16-bit compressed representation of the
instruction set in the Thumb architecture




8
I/O system
 I/O is memory mapped
 internal registers of peripherals (disk controllers, network
interfaces, etc) are addressable locations within the ARM’s
memory map and may be read and written using the loadstore instructions
 Peripherals may use either the normal interrupt (IRQ) or
fast interrupt (FIQ) input
 normally most interrupt sources share the IRQ input, while
just one or two time-critical sources are connected to the
FIQ input
 Some systems may include external DMA hardware to
handle high-bandwidth I/O traffic
9
ARM exceptions
 ARM supports a range of interrupts, traps, and supervisor calls – all
are grouped under the general heading of exceptions
 Handling exceptions
 current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)
 processor operating mode is changed to the appropriate exception
mode
 PC is forced to a value between 0016 and 1C16, the particular value
depending on the type of exception
 instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the exception
handler will use r13_exc, which is normally initialized to point to a
dedicated stack in memory, to save some user registers
 return: restore the user registers and then restore PC and CPSR
atomically
10
ARM cross-development toolkit
 Software development
 tools developed by ARM
Limited
 public domain tools
(ARM back end for gcc C
compiler)
C source
C libraries
C compiler
assembler
.aof
object
libraries
linker
 Cross-development
 tools run on different
architecture from one
for which they produce
code
asm source
.axf
system model
ARMulator
debug
ARMsd
development
board
11
Outline












ARM Architecture
ARM Assembly Language Programming
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
12
ARM Instruction Set
 Data Processing Instructions
 Data Transfer Instructions
 Control flow Instructions
13
Data Processing Instructions
 Classes of data processing instructions




Arithmetic operations
Bit-wise logical operations
Register-movement operations
Comparison operations
 Operands: 32-bits wide;
there are 3 ways to specify operands
 come from registers
 the second operand may be a constant (immediate)
 shifted register operand
 Result: 32-bits wide, placed in a register
 long multiply produces a 64-bit result
14
Data Processing Instructions (cont’d)
Arithmetic Operations
Bit-wise Logical Operations
ADD r0, r1, r2
r0 := r1 + r2
AND r0, r1, r2 r0 := r1 and r2
ADC r0, r1, r2
r0 := r1 + r2 + C
ORR r0, r1, r2
r0 := r1 or r2
SUB r0, r1, r2
r0 := r1 - r2
EOR r0, r1, r2
r0 := r1 xor r2
SBC r0, r1, r2
r0 := r1 - r2 + C - 1
BIC r0, r1, r2
r0 := r1 and (not) r2
RSB r0, r1, r2
r0 := r2 – r1
RSC r0, r1, r2
r0 := r2 – r1 + C - 1
Register Movement
Comparison Operations
MOV r0, r2
r0 := r2
CMP r1, r2
set cc on r1 - r2
MVN r0, r2
r0 := not r2
CMN r1, r2
set cc on r1 + r2
TST r1, r2
set cc on r1 and r2
TEQ r1, r2
set cc on r1 xor r2
15
Data Processing Instructions (cont’d)
 Immediate operands:
immediate = (0->255) x 22n, 0 <= n <= 12
ADD r3, r3, #3
r3 := r3 + 3
AND r8, r7, #&ff
r8 := r7[7:0], & for hex
 Shifted register operands
 the second operand is subject to a shift operation before it is
combined with the first operand
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1
ADD r5, r5, r3, LSL r2
r5 := r5 + 2r2 x r3
16
ARM shift operations





LSL – Logical Shift Left
LSR – Logical Shift Right
ASR – Arithmetic Shift Right
ROR – Rotate Right
RRX – Rotate Right
Extended by 1 place
31
0
31
00000
00000
LSL #5
31
LSR #5
0
31
0
1
00000 0
11111 1
ASR #5 , positiv e operand
31
0
0
ASR #5 , negativ e operand
0
31
0
C
C
ROR #5
C
RRX
17
Setting the condition codes
 Any DPI can set the condition codes (N, Z, V, and C)
 for all DPIs except the comparison operations
a specific request must be made
 at the assembly language level this request is indicated by
adding an `S` to the opcode
 Example (r3-r2 := r1-r0 + r3-r2)
ADDS r2, r2, r0 ; carry out to C
ADC r3, r3, r1
; ... add into high word
 Arithmetic operations set all the flags (N, Z, C, and V)
 Logical and move operations set N and Z
 preserve V and either preserve C when there is no shift
operation, or set C according to shift operation (fall off bit)
18
Multiplies
 Example (Multiply, Multiply-Accumulate)
MUL r4, r3, r2
r4 := [r3 x r2]<31:0>
MLA r4, r3, r2, r1
r4 := [r3 x r2 + r1]
<31:0>
 Note
 least significant 32-bits are placed in the result register,
the rest are ignored
 immediate second operand is not supported
 result register must not be the same
as the first source register
 if `S` bit is set the V is preserved and
the C is rendered meaningless
 Example (r0 = r0 x 35)
 ADD r0, r0, r0, LSL #2 ; r0’ = r0 x 5
RSB r3, r3, r1
; r0’’ = 7 x r0’
19
Data transfer instructions
 Single register load and store instructions
 transfer of a data item (byte, half-word, word)
between ARM registers and memory
 Multiple register load and store instructions
 enable transfer of large quantities of data
 used for procedure entry and exit, to save/restore workspace
registers, to copy blocks of data around memory
 Single register swap instructions
 allow exchange between a register and memory
in one instruction
 used to implement semaphores to ensure mutual exclusion
on accesses to shared data in multis
20
Data Transfer Instructions (cont’d)
Register-indirect addressing
Single register load and store
LDR r0, [r1]
r0 := mem32[r1]
STR r0, [r1]
mem32[r1] := r0
Note: r1 keeps a word address (2 LSBs are 0)
Base+offset addressing
(offset of up to 4Kbytes)
LDR r0, [r1, #4] r0 := mem32[r1 +4]
LDRB r0, [r1]
r0 := mem8[r1]
Note: no restrictions for r1
Auto-indexing addressing
LDR r0, [r1, #4]! r0 := mem32[r1 + 4]
r1 := r1 + 4
Post-indexed addressing
LDR r0, [r1], #4
r0 := mem32[r1]
r1 := r1 + 4
21
Data Transfer Instructions (cont’d)
COPY:
ADR r1, TABLE1
ADR r2, TABLE2
LOOP:
LDR r0, [r1]
STR r0, [r2]
ADD r1, r1, #4
ADD r2, r2, #4
...
TABLE1: ...
TABLE2:...
; r1 points to TABLE1
; r2 points to TABLE2
COPY:
ADR r1, TABLE1
ADR r2, TABLE2
LOOP:
LDR r0, [r1], #4
STR r0, [r2], #4
...
TABLE1: ...
TABLE2:...
; r1 points to TABLE1
; r2 points to TABLE2
22
Data Transfer Instructions
Multiple register data transfers
LDMIA r1, {r0, r2, r5}
r0 := mem32[r1]
r2 := mem32[r1 + 4]
r5 := mem32[r1 + 8]
Note: any subset (or all) of the registers may be
transferred with a single instruction
Note: the order of registers within the list is
insignificant
Note: including r15 in the list will cause a change
in the control flow
 Stack organizations
 FA – full ascending
 EA – empty ascending
 FD – full descending
 ED – empty descending
 Block copy view
 data is to be stored above
or below the the address
held in the base register
 address incrementing or
decrementing begins before
or after storing the first
value
23
Multiple register transfer addressing
modes
r9’
r9
1018
r5
r1
r0
16
100c 16
1000
r9
r9’
r5
r1
r0
16
STMDA r9!, {r0,r1,r5}
1018
16
100c 16
r9
1000
16
STMIB r9!, {r0,r1,r5}
1018
16
100c 16
1000
r5
r1
r0
16
STMIA r9!, {r0,r1,r5}
1018
r9’
100c 16
r9
r9’
16
r5
r1
r0
1000
16
STMDB r9!, {r0,r1,r5}
24
The mapping between the stack and
block copy views
B e f o re
In c re me n t
Af t e r
B e f o re
De c re me n t
Af t e r
As c e n di n g
Ful l
Emp t y
STMIB
STMFA
STMIA
STMEA
LDMDB
LDMEA
LDMDA
LDMFA
De s c e n di n g
Ful l
Emp t y
LDMIB
LDMED
LDMIA
LDMFD
STMDB
STMFD
STMDA
STMED
25
Control flow instructions
Branch
B
BAL
BEQ
BNE
BPL
BMI
BCC
BLO
BCS
BHS
BVC
BVS
BGT
BGE
Interpretation
Unconditional
Always
Equal
Not equal
Plus
Minus
Carry clear
Lower
Carry set
Higher or same
Overflow clear
Overflow set
Greater than
Greater or equal
BLT
BLE
Less than
Less or equal
BHI
BLS
Higher
Lower or same
Normal uses
Always take this branch
Always take this branch
Comparison equal or zero result
Comparison not equal or non-zero result
Result positive or zero
Result minus or negative
Arithmetic operation did not give carry-out
Unsigned comparison gave lower
Arithmetic operation gave carry-out
Unsigned comparison gave higher or same
Signed integer operation; no overflow occurred
Signed integer operation; overflow occurred
Signed integer comparison gave greater than
Signed integer comparison gave greater or
equal
Signed integer comparison gave less than
Signed integer comparison gave less than or
equal
Unsigned comparison gave higher
Unsigned comparison gave lower or same
26
Conditional execution
 Conditional execution to avoid branch instructions used to
skip a small number of non-branch instructions
 Example
CMP r0, #5
BEQ BYPASS
ADD r1, r1, r0
SUB r1, r1, r2
BYPASS: ...
;
; if (r0!=5) {
; r1:=r1+r0-r2
;}
With conditional execution
CMP r0, #5
ADDNE r1, r1, r0
SUBNE r1, r1, r2
...
;
;
;
Note: add 2 –letter condition after the 3-letter opcode
; if ((a==b) && (c==d)) e++;
CMP r0, r1
CMPEQ r2, r3
ADDEQ r4, r4, #1
27
Branch and link instructions
 Branch to subroutine (r14 serves as a link register)
BL SUBR ; branch to SUBR
..
; return here
SUBR:
..
; SUBR entry point
MOV pc, r14 ; return
 Nested subroutines
SUB1:
SUB2:
BL SUB1
..
; save work and link register
STMFD r13!, {r0-r2,r14}
BL SUB2
..
LDMFD r13!, {r0-r2,pc}
..
MOV pc, r14 ; copy r14 into r15
28
Supervisor calls
 Supervisor is a program which operates at a privileged
level – it can do things that a user-level program cannot do
directly
 Example: send text to the display
 ARM ISA includes SWI (SoftWare Interrupt)
; output r0[7:0]
SWI SWI_WriteC
; return from a user program back to monitor
SWI SWI_Exit
29
Jump tables
 Call one of a set of subroutines depending on a value
computed by the program
JTAB:
BL JTAB
...
CMP r0, #0
BEQ SUB0
CMP r0, #1
BEQ SUB1
CMP r0, #2
BEQ SUB2
Note: slow when the list is long,
and all subroutines are equally
frequent
BL JTAB
...
JTAB:
ADR r1, SUBTAB
CMP r0, #SUBMAX ; overrun?
LDRLS pc, [r1, r0, LSL #2]
B ERROR
SUBTAB: DCD SUB0
DCD SUB1
DCD SUB2
...
30
Hello ARM World!
AREA HelloW, CODE, READONLY ; declare code area
SWI_WriteC
EQU
&0
; output character in r0
SWI_Exit
EQU
&11
; finish program
ENTRY
; code entry point
START: ADR r1, TEXT
; r1 <- Hello ARM World!
LOOP:
LDRB r0, [r1], #1
; get the next byte
CMP r0, #0
; check for text end
SWINE SWI_WriteC
; if not end of string, print
BNE LOOP
SWI SWI_Exit
; end of execution
TEXT
= “Hello ARM World!”, &0a, &0d, 0
END
31
ARM
Organization and Implementation
Aleksandar Milenkovic
E-mail:
Web:
[email protected]
http://www.ece.uah.edu/~milenka
Outline











ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
33
ARM organization
A[31:0]
control
address register
 Register file –
P
C
 2 read ports, 1 write port +
1 read, 1 write port reserved for
r15 (pc)
 Barrel shifter – shift or rotate
one operand for any number of
bits
 ALU – performs the arithmetic
and logic functions required
 Memory address register +
incrementer
 Memory data registers
 Instruction decoder and
associated control logic
incrementer
PC
register
bank
instruction
decode
A
L
U
b
u
s
multiply
register
&
A
B
b
u
s
b
u
s
barrel
shifter
control
ALU
data out register
data in register
D[31:0]
34
Three-stage pipeline
 Fetch
 the instruction is fetched from memory and placed in the
instruction pipeline
 Decode
 the instruction is decoded and the datapath control signals
prepared for the next cycle; in this stage the instruction
owns the decode logic but not the datapath
 Execute
 the instruction owns the datapath; the register bank is read,
an operand shifted, the ALU register generated and written
back into a destination register
35
ARM single-cycle instruction pipeline
1
2
3
instruction
fetch
decode
execute
fetch
decode
execute
fetch
decode
execute
time
36
ARM single-cycle instruction pipeline
fetch
sub r2,r3,r6
decode execute add
fetch
cmp r2,#3
1
decode execute sub
fetch
2
add r0,r1,#5
3
decode execute cmp
time
37
ARM multi-cycle instruction pipeline
1
fetch ADD decode
2
3
4
5
fetch STR
Decode logic is always generating
the control signals for the datapath
to use in the next cycle
execute
decode
calc. addr. data xfer
fetch ADD
decode
fetch ADD
execute
decode
execute
fetch ADD decode
execute
instruction
time
38
ARM multi-cycle LDMIA (load
multiple) instruction
ldmia
fetch decodeex ld r2ex ld r3
r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
fetch
Decode stage occupied
since ldmia must continue to
remember decoded instruction
decode ex sub
fetch decodeex cmp
time
Instruction delayed
sub fetched at normal time but
not decoded until LDMIA is finishing
39
Control stalls: due to branches
 Branches often introduce stalls (branch penalty)
 Stall time may depend on whether branch is taken
 May have to squash instructions
that already started executing
 Don’t know what to fetch until condition is evaluated
40
ARM pipelined branch
Decision not made until the third clock cycle
bne foo
sub
r2,r3,r6
foo add
r0,r1,r2
fetch decode ex bne ex bne ex bne
fetch decode
Two cycles of work thrown
away if bne takes place
fetch decode ex add
time
41
Pipeline: how it works
 All instructions occupy the datapath
for one or more adjacent cycles
 For each cycle that an instruction occupies the datapath,
it occupies the decode logic in
the immediately preceding cycle
 During the fist datapath cycle each instruction issues
a fetch for the next instruction but one
 Branch instruction flush and refill the instruction pipeline
42
ARM9TDMI
5-stage pipeline
next
pc
pc + 4
 Fetch
 Decode
 instruction is decoded
 register operands read
(3 read ports)
 Execute
 an operand is shifted and
the ALU result generated, or
 address is computed
B, BL
MOV pc
SUBS pc
 Buffer/data
 data memory is accessed
(load, store)
 Write-back
LDR pc
 write to register file
+4
fetch
I-cache
pc+8
I decode
instruction
decode
r15
register read
immediate
fields
LDM/
STM postindex
+4
mul
shift
pre-index
reg
shift
ALU
execute
forwarding
paths
mux
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
register write
write-back
43
ARM9TDMI
Data Forwarding
Data Forwarding
next
pc
+4
fetch
I-cache
pc + 4
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3
pc+8
I decode
instruction
decode
r15
register read
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1
ADD r8, r9, r10
r8 := r9 + r10
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3
immediate
fields
LDM/
STM postindex
+4
mul
shift
pre-index
reg
shift
ALU
forwarding
paths
mux
Stall?
LD r3, [r2]
ADD r1, r2, r3
execute
B, BL
MOV pc
SUBS pc
byte repl.
r3 := mem[r2]
r1 := r2 + r3
load/store
address
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
write-back
44
ARM9TDMI
PC generation
 3-stage pipeline
next
pc
+4
fetch
I-cache
pc + 4
 PC behavior:
operands are read in execution
stage
r15 = PC + 8
pc+8
I decode
register read
immediate
fields
 5-stage pipeline
 operands are read in decode
stage and r15 = PC + 4?
 incompatibilities between 3stage and 5-stage
B, BL
implementations =>
MOV pc
SUBS pc
unacceptable
 to avoid this 5-stage pipeline
ARMs emulate the behavior of
the older 3-stage designs
instruction
decode
r15
LDM/
STM postindex
+4
mul
shift
pre-index
reg
shift
ALU
execute
forwarding
paths
mux
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
write-back
45
Data processing instruction
datapath activity (Ex)
Reg-Reg
Rd = Rn op Rm
r15 = AR + 4
AR = AR + 4
address register
address register
increment
Rd
Reg-Imm
PC
increment
Rd
PC
registers
Rn
Rd = Rn op Imm
r15 = AR + 4
AR = AR + 4
registers
Rm
Rn
mult
mult
as ins.
as ins.
as instruction
as instruction
[7:0]
data out
data in
i. pipe
(a) register – register operations
data out
data in
i. pipe
(b) register – immediate operations
46
STR (store register) datapath activity
(Ex1, Ex2)
Compute address
(Ex1)
address register
AR = Rn op Disp
r15 = AR + 4
address register
increment
increment
PC
Store data (Ex2)
Rn
registers
PC
registers
Rn
AR = PC
mem[AR] =
Rd<x:y>
If autoindexing
=>
Rn = Rn +/- 4
Rd
mult
mult
shifter
lsl #0
= A / A +B/
= A +B/ A -B
A -B
[11:0]
data out
data in
i. pipe
(a) 1st cycle – compute address
byte?
data in
i. pipe
(b) 2nd cycle – store data & auto-index
47
The first two (of three) cycles of a
branch instruction
Compute target
address
address register
address register
AR = PC + Disp,lsl #2
Save return address
(if required)
r14 = PC
AR = AR + 4
increment
increment
R14
registers
registers
PC
PC
mult
mult
shifter
lsl #2
Third cycle: do a small
correction to the value
stored in the link register in
order that it points to
directly at the instruction data out
which follows the branch?
=A
= A+B
[23:0]
data in
i. pipe
(a) 1st cycle – compute branch target
data out
data in
i. pipe
(b) 2nd cycle – save return address
48
ARM Implementation
 Datapath
 RTL (Register Transfer Level)
 Control unit
 FSM (Finite State Machine)
49
2-phase non-overlapping clock
scheme
 Most ARMs do not operate on edge-sensitive registers
 Instead the design is based around
2-phase non-overlapping clocks which are generated
internally from a single clock signal
 Data movement is controlled by passing the data
alternatively through latches
which are open during phase 1 or latches during phase 2
phase 1
phase 2
1 clock cycle
50
ARM datapath timing
 Register read
 Register read buses – dynamic, precharged during phase 2
 During phase 1 selected registers discharge the read buses
which become valid early in phase 1
 Shift operation
 second operand passes through barrel shifter
 ALU operation
 ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
 ALU processes the operands during the phase 2, producing the
valid output towards the end of the phase
 the result is latched in the destination register
at the end of phase 2
51
ARM datapath timing (cont’d)
ALU operands
latched
ph ase 1
register
read
time
shift time
ph ase 2
read bus valid
shift out valid
precharge
invalidates
buses
register
write time
ALU time
ALU out
Minimum Datapath Delay =
Register read time +
Shifter Delay + ALU Delay +
Register write set-up time + Phase 2 to phase 1 non-overlap time
52
The original ARM1 ripple-carry adder
 Carry logic: use CMOS AOI (And-Or-Invert) gate
 Even bits use circuit show below
 Odd bits use the dual circuit with inverted inputs and
outputs and AND and OR gates swapped around
Cout
 Worst case path:
32 gates long
A
B
sum
Cin
53
ARM2 4-bit carry look-ahead scheme
 Carry Generate (G)
Carry Propagate (P)
 Cout[3] =Cin[0].P + G
 Use AOI and
alternate AND/OR gates
 Worst case:
8 gates long
A[3:0]
Cout[3]
G
4-bit
adder
logic
P
B[3:0]
sum[3:0]
Cin[0]
54
The ARM2 ALU logic for one result bit
 ALU functions
data operations (add, sub, ...)
address computations for memory accesses
branch target computations
fs : 5
01 23
carry
bit-wise logical
lo gic
NB
operations
bu s
 ...




4
G
AL U
bu s
P
NA
bu s
55
ARM2 ALU function codes
fs 5
0
0
0
0
0
1
0
0
0
0
0
fs 4
0
0
0
1
1
1
0
0
0
0
0
fs 3
0
1
1
1
0
0
0
0
0
1
1
fs 2
1
0
0
0
1
1
0
0
1
0
1
fs 1
0
0
0
0
1
1
0
0
0
1
0
fs 0
0
0
1
1
0
0
0
1
1
0
0
ALU o ut p ut
A and B
A and not B
A xor B
A plus not B plus carry
A plus B plus carry
not A plus B plus carry
A
A or B
B
not B
zero
56
The ARM6 carry-select adder scheme
 Compute sums of
various fields of
a,b[3:0]
the word
+
+, +1 +, +1
for carry-in of zero
c s
s+1
and carry-in of
mux
one
 Final result is
mux
selected by using
the correct carryin value to control
a multiplexor
sum[3:0] sum[7:4] sum[15:8]
Worst case:
O(log2[word width]) gates long
a,b[31:28]
mux
sum[31:16]
Note: Be careful! Fan-out on some of these gates is
high so direct comparison with previous schemes is
not applicable.
57
The ARM6 ALU organization
 Not easy to merge the arithmetic and logic functions =>
a separate logic unit runs in parallel with the adder,
and multiplexor selects the output
A operand latch
invert A
B operand latch
invert B
XOR gates
XOR gates
function
logic functions
logic/arithmetic
adder
C in
C
V
res ult mux
N
zero detect
Z
res ult
58
ARM9 carry arbitration encoding
 Carry arbitration adder
ai
bi
Ci
vi, wi
ai
bi
ai-1
bi-1
Ci
vi, wi
0
0
0
0, 0
0
0
-
-
0
0, 0
1
1
1
1, 1
1
1
-
-
1
1, 1
1
0
u
1, 0
0(1) 1(0)
0
0
0
0, 0
0
1
u
1, 0
0(1) 1(0)
1
1
1
1, 1
u
1, 0
0(1) 1(0) 0(1) 1(0)
vi  ai  bi
wi  ai  bi
59
The cross-bar switch barrel shifter
 Shifter delay is critical since it contributes directly to the
datapath cycle time
 Cross-bar switch matrix (32 x 32)
 Principle for 4x4 matrix
right 3 right 2 right 1 no shift
in[3]
left 1
in[2]
left 2
in[1]
left 3
in[0]
out[0] out[1] out[2] out[3]
60
The cross-bar switch barrel shifter
(cont’d)
 Precharged logic is used =>
each switch is a single NMOS transistor
 Precharging sets all outputs to logic 0, so those which are
not connected to any input during switching remain at 0
giving the zero filling required by the shift semantics
 For rotate right, the right shift diagonal is enabled +
complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
 Arithmetic shift right:
use sign-extension => separate logic is used to decode
the shift amount and discharge those outputs
appropriately
61
Multiplier design
 All ARMs apart form the first prototype have included
support for integer multiplication
 older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and
multiply-accumulate
 recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply and
multiply-accumulate
 Low cost implementation
 Use the datapath iteratively, employing the barrel shifter
and ALU to generate 2-bit product in each clock cycle
 use early termination to stop the iterations when there are
no more ones in the multiply register
62
The 2-bit multiplication algorithm,
Nth cycle
 Control settings for the Nth cycle of the multiplication
 Use existing shifter and ALU + additional hardware
 dedicated two-bits-per-cycle shift register for the multiplier
and a few gates for the Booth’s algorithm control logic
(overhead is a few per cent on the area of ARM core)
Carry - i n
0
1
Mul t i p l i e r
x0
x1
x2
x3
x0
x1
x2
x3
Shi ft
LSL #2N
LSL #2N
LSL #(2N + 1)
LSL #2N
LSL #2N
LSL #(2N + 1)
LSL #2N
LSL #2N
ALU
A+0
A+B
A– B
A– B
A+B
A+B
A– B
A+0
Carry - o ut
0
0
1
1
0
0
1
1
63
High speed multiplication
 Where multiplication performance is very important,
more hardware resources must be dedicated
 in some embedded systems the ARM core is used to perform
real-time digital signal processing (DSP) –
DSP programs are typically multiplication intensive
 Use intermediate results which include
partial sums and partial carries
 Carry-save adders are used for this
 These two binary results are added together at the end of
multiplication
 The main ALU is used for this
64
Carry-propagate (a) and carry-save
(b) adder structures
 Carry propagate adder takes two conventional (irredundant) binary
numbers as inputs and produces a binary sum
 Carry save adder takes one binary and one redundant (partial sum and
partial carry) input and produces a sum in redundant binary
representation (sum and carry)
(a)
(b)
A
B Cin
+
A
B Cin
+
Cout S
Cout
A
A
B Cin
+
Cout S
S
B Cin
+
Cout
S
A
B Cin
+
Cout
A
S
B Cin
+
Cout
S
A
B Cin
+
Cout
A
S
B Cin
+
Cout
S
65
ARM high-speed multiplier
organization
 CSA has 4 layers of adders each handling 2 multiplier bits
=> multiply 8-bits per clock cycle
 Partial sum and carry are cleared at the beginning
or initialized to accumulate a value
 Multiplier is shifted right 8-bits
per cycle in the ‘Rs’ register
 Carry sum and carry
are rotated right 8 bits per cycle
 Performance: up to 4 clock cycles
(early termination is possible)
 Complexity: 160 bits in shift registers,
128 bits of carry-save adder logic
(up to 10% of simpler cores)
66
ARM high-speed multiplier
organization
in itia liza ti on for MLA
registers
Rs >> 8 bits/cycle
Rm
rotate sum and
carry 8 b its/cycl e
carry-save adders
partial sum
partial carry
ALU (add partials)
67
ARM2 register cell circuit
write
read read
A
B
ALU bus
A bus
B bus
68
ARM register bank floorplan
A bus read decoders
B bus read decoders
write decoders
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
PC
register c ells
A bus
B bus
69
ARM core datapath buses
address register
incrementer
Ad
PC
A
inc
B
register bank
multiplier
shift out
W
ALU
shifter
data in
instruction
Din
instruction pipe
data out
70
ARM control logic structure
instruction
coprocessor
decode
PLA
address
control
register
control
cy cle
count
ALU
control
multiply
control
load/store
multiple
shifter
control
71