Download Lecture 2 - 中央研究院資訊科學研究所

Document related concepts
Transcript
嵌入式處理器架構與程式設計
王建民
中央研究院 資訊所
2008年 7月
Contents
Introduction
 Computer Architecture
 ARM Architecture
 Development Tools
 GNU Development Tools
 ARM Instruction Set
 ARM Assembly Language
 ARM Assembly Programming
 GNU ARM ToolChain
 Interrupts and Monitor

2
Lecture 2
Computer Architecture
Outline



Basic Concepts
Instruction Set Architecture
Machine Organization
4
What is “Computer Architecture”?
Application Software
Software
Programming System
Operating System
Processor
Hardware
Memory
Instruction Set Architecture
I/O System
Circuits
Devices
5
What is “Computer Architecture”?

Instruction Set Architecture (ISA)




Interface between hardware and software
The true language of a machine
The hardware’s specification; defines what a
machine does
Computer Organization


The guts of the machine; how the hardware
works?
The implementation; must obey the ISA
abstraction
6
Machine Organization
Computer
Processor
(CPU)
(active)
Control
(“brain”)
Datapath
(“brawn”)
Memory
(passive)
(where
programs,
& data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Disk
(where
programs,
& data
live when
not running)
Display,
Printer
7
Stored Program Computer



1944: The First Electronic Computer ENIAC at
IAS, Princeton Univ. (18,000 vacuum tubes)
Stored-Program Concept – Storing programs as
numbers – by John von Neumann – Eckert and
Mauchly worked in engineering the concept.
Idea: A program is written as a sequence of
instructions, represented by binary numbers. The
instructions are stored in the memory just as data.
They are read one by one, decoded and then
executed by the CPU.
8
Execution Cycle
Instruction
Fetch
Obtain instruction from program storage
Instruction
Decode
Determine required actions and instruction size
Operand
Fetch
Locate and obtain operand data
Execute
Compute result value or status
Result
Store
Next
Instruction
Deposit results in storage for later use
Determine successor instruction
9
The Instruction Set
The actual programmer visible instruction set
software
instruction set
hardware
10
Instruction-Set Processor Design1

Architecture (ISA)
programmer/compiler view


“functional appearance to its immediate
user/system programmer”
Opcodes, addressing modes, architected
registers, IEEE floating point
11
Instruction-Set Processor Design2

Implementation (µarchitecture)
processor designer/view


“logical structure or organization that performs
the architecture”
Pipelining, functional units, caches, physical
registers
12
Instruction-Set Processor Design3

Realization (chip)
chip/system designer view


“physical structure that embodies the
implementation”
Gates, cells, transistors, wires
13
Outline



Basic Concepts
Instruction Set Architecture
Machine Organization
14
Levels of Abstraction
temp = v[k];
High Level Language
Program (e.g., C)
v[k] = v[k+1];
v[k+1] = temp;
Compiler
lw
lw
sw
sw
Assembly Language
Program (e.g., MIPS)
Assembler
Machine Language
Program (MIPS)
0000
1010
1100
0101
1001
1111
0110
1000
$15,
$16,
$16,
$15,
1100
0101
1010
0000
0110
1000
1111
1001
0($2)
4($2)
0($2)
4($2)
1010
0000
0101
1100
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
Machine Interpretation
Datapath Transfer
Specification
°
°
IR <- Imem[PC]; PC <- PC + 4
15
Recall in C language


Operators: +, -, *, /, %
Operands:



Variables
Constants
Assignment statement:
variable = expression

Expressions consist of operators operating on
operands
16
When Translating to Assembly
a = b + 5;
load
load
add
store
$r1, M[b]
$r2, 5
$r3, $r1, $r2
$r3, M[a]
Statement
Constant
Operands
Memory
Register
Operator
17
Components of an ISA

Organization of programmable storage




Data Types


Encoding and representation
Instruction Format


Registers
Memory
Addressing modes
How are instructions specified?
Instruction Set

What operations can be performed?
18
Basic ISA Classes1

Accumulator (only one register)
1 address: Add A

Stack
0 address: Add

; acc ← acc + mem[A]
; tos ← tos + next
General Purpose Register
2 address: Add A, B
; EA(A) ← EA(A) + EA(B)
3 address: Add A, B, C ; EA(A) ← EA(B) + EA(C)
19
Basic ISA Classes2

Load/Store
Only load/store instructions can access memory
Load Ra, Rb
; Ra ← mem[Rb]
Store Ra, Rb
; mem[Rb] ← Ra


Memory to Memory
All operands and destinations can be memory addresses
Add A, B, C
; mem[A] ← mem[B] + mem[C]

20
Comparison of Four ISA Classes


Code sequence for C = A+ B
Stack
Accumulator
Register
(reg-mem)
Register
(load-store)
Push A
Push B
Add
Pop C
Load A
Add B
Store C
Load R1,A
Add R1,B
Store R1,C
Load R1,A
Load R2,B
Add R3,R1,R2
Store R3,C
Comparison: Bytes per instruction? Number of
instructions? Cycles per instructions?
21
CISC vs. RISC

CISC (Complex Instruction Set Computer)





May have memory-memory instructions
Variable instruction length
Relatively fewer registers
Complex addressing modes
RISC (Reduced Instruction Set Computer)




Have only load-store instructions
Uniform instruction format
Identical general-purpose registers
Simple addressing modes
22
General Purpose Registers Dominate

Advantages of registers


Registers are faster than memory
Registers are easier for a compiler to use


E.g., as a place for temporary storage
Registers can hold variables


Memory traffic is reduced (since registers are faster than
memory)
Code density is improved (since register named with fewer bits
than memory location)
23
MIPS Registers as an Example

32 registers, each is 32 bits wide







Groups of 32 bits called a word in MIPS
Registers are numbered from 0 to 31
Each can be referred to by number or name
Number references:
$0, $1, $2, … $30, $31
By convention, each register also has a name to make it
easier to code, e.g.,
$16 - $23 
$s0 - $s7 (C variables)
$8 - $15 
$t0 - $t7 (temporary)
32 x 32-bit FP registers (paired DP)
Others: HI, LO, PC
24
Memory Addressing


Since 1980 almost every machine uses addresses
to level of 8-bits (byte)
2 questions for the design of ISA


Read a 32-bit word as four loads of bytes from
sequential byte addresses or as one load word from a
single byte address?
Can a word be place on any byte-boundary?
25
Memory Organization



Viewed as a large single dimension array, with an
address
A memory address is an index into the array
“Byte addressing” means that the index points to a
byte of memory
8 bits of data
0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
26
Word Addressing


Every word in memory has an address, similar to
an index in an array
Early computers numbered words like C numbers
elements of an array:

Memory[0], Memory[1], Memory[2], …
Called the “address” of a word

Today machines address memory as bytes, hence
word addresses differ by 4


Memory[0], Memory[4], Memory[8], …
Computers needed to access 8-bit bytes as well as
words (4 bytes/word)
27
Alignment

An ISA may require that all words start at
addresses that are multiples of 4 bytes
(called alignment)
0
1
2
3
Aligned
Not
Aligned
28
Endianess

Big Endian: address of most significant byte = word
address (xx00 = Big End of word)
 IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA

Little Endian: address of least significant byte = word
address (xx00 = Little End of word)
 Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
3
2
1
0
msb
0
big endian byte 0
little endian byte 0
lsb
1
2
3
29
Addressing Modes
Addressing Mode
Register
Immediate
Displacement
Register indirect
Indexed/Base
Direct or absolute
Memory indirect
Auto-increment
Auto-decrement
Scaled
Example
Add R4,R3
Add R4,#3
Add R4,100(R1)
Add R4,(R1)
Add R4,(R1+R2)
Add R4,(1000)
Add R4,@(R3)
Add R1,(R2)+
Meaning
R4←R4+R3
R4←R4+3
R4←R4+mem[100+R1]
R4←R4+mem[R1]
R4←R4+mem[R1+R2]
R4←R4+mem[1000]
R4←R4+mem[mem[R3]]
R1←R1+mem[R2];
R2←R2+d;
Add R1,-(R2)
R2←R2-d;
R1←R1+mem[R2];
Add R4,100(R1)[R2]
R4←R4+mem[100+R1+R2*d]
30
Addressing Mode Usage

3 programs measured on machine with all address
modes (VAX)








Displacement:
Immediate:
Register deferred (indirect):
Scaled:
Memory indirect:
Misc:
42% avg, 32% to 55%
33% avg, 17% to 43%
13% avg, 3% to 24%
7% avg, 0% to 16%
3% avg, 1% to 6%
2% avg, 0% to 3%
88% displacement, immediate & register indirect
Immediate Size:


50% to 60% fit within 8 bits
75% to 80% fit within 16 bits
31
Instruction Formats1
Variable:
…
…
Fixed:
Hybrid:
32
Instruction Formats2

If code size is most important, use variable length
instructions:




If performance is most important, use fixed length
instructions




Difficult control design to compute next address
Complex operations, so use microprogramming
Slow due to several memory accesses
Simple to decode, so use hardware
Wastes code space because of simple operations
Works well with pipelining
Recent embedded machines added optional mode
to execute subset of 16-bit wide instructions
33
Typical Operations
Data Movement
Arithmetic
Shift
Logic
Control (Jump/Branch)
Subroutine Linkage
Interrupt
Synchronization
String
Graphis (MMX)
register-register movement
memory-memory movement
load/store, in/out, push/pop
integer or floating-point
add, subtract, multiply, divide
shift left/right, rotate left/right
not, and, or, xor, set, clear
unconditional, conditional
call, return
trap, return
test&set (atomic r-m-w)
search, translate
parallel subword ops (4 16-bit add)
34
Top 10 80x86 Instructions
° Rank instruction
Integer Av erage Percent total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
mov e register-register 4%
9
call
1%
10
return
1%
Total
96%
° Simple instructions dominate instruction frequency
35
Summary

While theoretically we can talk about complicated
addressing modes and instructions, the ones we
actually use in programs are the simple ones
 RISC philosophy
36
MIPS Instruction Set Design1




Use general purpose registers with a load-store
architecture: YES
Provide at least 16 general purpose registers plus
separate floating-point registers: 31 GPR & 32
FPR
Support basic addressing modes: displacement
(with an address offset size of 12 to 16 bits),
immediate (size 8 to 16 bits), and register deferred:
YES: 16 bits for immediate, displacement
All addressing modes apply to all data transfer
instructions: YES
37
MIPS Instruction Set Design2




Use fixed instruction encoding if interested in
performance and use variable instruction encoding
if interested in code size: Fixed
Support these data sizes and types: 8-bit, 16-bit,
32-bit integers and 32-bit and 64-bit IEEE 754
floating point numbers: YES
Support these simple instructions, since they will
dominate the number of instructions executed:
load, store, add, subtract, move register-register,
and, shift, compare equal, compare not equal,
branch (with a PC-relative address at least 8-bits
long), jump, call, and return: YES, 16b
Aim for a minimalist instruction set: YES
38
MIPS ISA as an Example

Registers
Instruction Categories






Load/store
Computational
Jump and branch
Floating point
Memory management
special
$r0 - $r31
PC
HI
LO
3 Instruction Formats: all 32 bits wide
OP
$rs
$rt
OP
$rs
$rt
OP
$rd
sa
funct
immediate
jump target
39
Outline



Basic Concepts
Instruction Set Architecture
Machine Organization
40
Machine Organization
Computer
Processor
(CPU)
(active)
Control
(“brain”)
Datapath
(“brawn”)
Memory
(passive)
(where
programs,
& data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Disk
(where
programs,
& data
live when
not running)
Display,
Printer
41
Semiconductor Memory, DRAM

Semiconductor memory began to be competitive
in early 1970s


First commercial DRAM was Intel 1103



Intel formed to exploit market for semiconductor
memory
1Kbit of storage on single chip
charge on a capacitor used to hold value
Semiconductor memory quickly replaced core
memory in ‘70s
42
DRAM Architecture

Bits stored in 2-dimensional arrays on chip
Modern chips have around 4 logical banks on each
chip
bit lines
Col.
1
N
N+M
Col.
2M
M
word lines
Row 1
Row Address
Decoder

Row 2N
Column Decoder &
Sense Amplifiers
Data
Memory cell
(one bit)
D
43
DRAM Operation

Row access (RAS)





Column access (CAS)





decode row address, enable addressed row (often multiple Kb in
row)
bitlines share charge with storage cell
small change in voltage detected by sense amplifiers which latch
whole row of bits
sense amplifiers drive bitlines full rail to recharge storage cells
decode column address to select small number of sense amplifier
latches (4, 8, 16, or 32 bits depending on DRAM package)
on read, send latched bits out to chip pins
on write, change sense amplifier latches which then charge storage
cells to required value
can perform multiple column accesses on same row without
another row access (burst mode)
Precharge

charges bit lines to known value, required before next row access
44
Processor-DRAM Performance Gap
Processor-DRAM performance gap grows 50%/year

CPU
µProc
60%/yr.
(2X/1.5yr)
100
10
DRAM
1
DRAM
5%/yr.
(2X/15 yrs)
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
45
Memory Hierarchy



Fact: Large memories are slow, fast memories are
small
How do we create a memory that is large, cheap
and fast (most of the time)?
Hierarchy of Levels




Uses smaller and faster memory technologies close to
the processor
Fast access time in highest level of hierarchy
Cheap, slow memory furthest from processor
The aim of memory hierarchy design is to have
access time close to the highest level and size
equal to the lowest level
46
Current Memory Hierarchy
Processor
Control
1ns
0.0005
-Regs
L1
cache
Speed(ns):
Size (MB):
Cost ($/MB):
Technology:
regs
Datapath
2ns
0.1
$100
SRAM
L2
Cache
6ns
1-4
$30
SRAM
Main
Memory
Secondary
Memory
100ns
100-1000
$1
DRAM
10,000,000ns
100,000
$0.05
Disk
47
Why Hierarchy works: Natural Locality

The Principle of Locality:


Programs access a relatively small portion of the
address space at any second
Temporal Locality (Locality in Time)  Recently accessed
data tend to be referenced again soon

Spatial Locality (Locality in Space)  nearby items will tend
to be referenced soon
48
How is the hierarchy managed?

Registers « Memory


Cache « Main Memory


By the compiler (or assembly language programmer)
By hardware
Main Memory « Disks


By combination of hardware and the operating system
(virtual memory)
By the programmer (files)
49
Inside a Cache
Address
Processor
Address
CACHE
Data
copy of main
memory
location 100
Address
Tag
100
Data Data
Byte Byte
304
Data
Byte
Data
Main
Memory
copy of main
memory
location 101
Line
6848
416
Data Block
50
Virtual Memory1




Idea 1: Many programs sharing DRAM memory
so that context switches can occur
Idea 2: Allow program to be written without
memory constraints – program can exceed the size
of the main memory
Idea 3: Relocation: Parts of the program can be
placed at different locations in the memory instead
of a big chunk.
Virtual Memory:


DRAM memory holds many programs running at same
time (processes)
Use DRAM memory as a kind of “cache” for disk
51
Virtual Memory2



Each process has its own private “virtual address
space” (e.g., 232 Bytes); CPU actually generates
“virtual addresses”
Each computer has a “physical address space”
(e.g., 128 MegaBytes DRAM); also called “real
memory”
Address translation: mapping virtual addresses to
physical addresses


Allows multiple programs to use (different chunks of
physical) memory at same time
Also allows some chunks of virtual memory to be
represented on disk, not in main memory
52
Virtual Memory3


VM divides memory into equal sized pages
Address translation relocates entire pages


Offsets within the pages do not change
If make page size a power of two, the virtual address
separates into two fields:
Virtual Page Number

Page Offset
virtual address
like cache index, offset fields
53
Address Translation
Virtual Address
31 30 29 28 27 . . . . . . . . . . . . . . . . . 12 11 10
98 ......... 3210
Virtual Page Number
Page Offset
1KB page
size
Translation
Physical Page Number
29 28 27 . . . . . . . . . . . . . . . . . .12 11 10
Page Offset
98 ......... 3210
Physical Address
54
I/O Device Examples and Speeds

I/O Speed: bytes transferred per second

From mouse to display: million-to-1
Device
Behavior
Partner
Keyboard
Mouse
Laser Printer
Magnetic Disk
Modem
Network-LAN
Graphics Display
Input
Input
Output
Storage
I or O
I or O
Output
Human
Human
Human
Machine
Machine
Machine
Human
Data Rate
(Mbit/sec)
0.0001
0.0038
3.2.000
240-2560
0.016-0.064
100-1000
800-8000
55
Buses in PC
56
Instruction Set Architecture for I/O


Some machines have special input and output
instructions
Alternative model (used by MIPS):



Input: ~ reads a sequence of bytes
Output: ~ writes a sequence of bytes
Memory also a sequence of bytes, so use loads for
input, stores for output


Called “Memory Mapped Input/Output”
A portion of the address space dedicated to
communication paths to Input or Output devices (no
memory there)
57
Memory Mapped I/O


Certain addresses are not regular memory
Instead, they correspond to registers in I/O devices
address 0
0xFFFF0000
cmd reg.
data reg.
0xFFFFFFFF
58
Processor-I/O Speed Mismatch

500 MHz microprocessor can execute 500 million
load or store instructions per second, or 2,000,000
KB/s data rate


Input: device may not be ready to send data as fast
as the processor loads it



I/O devices from 0.01 KB/s to 30,000 KB/s
Also, might be waiting for human to act
Output: device may not be ready to accept data as
fast as processor stores it
What to do?
59
Polling

Path to device generally has 2 registers:




1 register says it’s OK to read/write
(I/O ready), often called Control Register
1 register that contains data, often called
Data Register
Processor reads from Control Register in loop,
waiting for device to set Ready bit in Control
Register to say its OK (0  1)
Processor then loads from (input) or writes to
(output) data register

Load from device/Store into Data Register resets Ready
bit (1  0) of Control Register
60
Cost of Polling?


Assume for a processor with a 500-MHz clock it
takes 400 clock cycles for a polling operation (call
polling routine, accessing the device, and
returning).
Determine % of processor time for polling



Mouse: polled 30 times/sec so as not to miss user
movement
Floppy disk: transfers data in 2-byte units and has a
data rate of 50 KB/second.
No data transfer can be missed.
Hard disk: transfers data in 16-byte chunks and can
transfer at 8 MB/second. Again, no transfer can be
missed.
61
% Processor time to poll mouse, floppy

Mouse Polling Clocks/sec = 30 * 400 = 12000
clocks/sec


% Processor for polling = 12*103/500*106 = 0.002% 
Polling mouse little impact on processor
Times Polling Floppy/sec = 50 KB/s /2B = 25K
polls/sec
Floppy Polling Clocks/sec
= 25K * 400 = 10,000,000 clocks/sec
6
6
 % Processor for polling = 10*10 /500*10 = 2%
 OK if not too many I/O devices

62
% Processor time to hard disk

Times Polling Disk/sec = 8 MB/s /16B =
500K polls/sec


Disk Polling Clocks/sec = 500K * 400 =
200,000,000 clocks/sec
% Processor for polling = 200*106/500*106 =
40%
 Unacceptable
63
Interrupt



Wasteful to have processor spend most of its time
“spin-waiting” for I/O to be ready
Wish we could have an unplanned procedure call
that would be invoked only when I/O device is
ready
Solution: use exception mechanism to help I/O.
Interrupt program when I/O ready, return when
done with data transfer
64
Benefit of Interrupt-Driven I/O


500 clock cycle overhead for each transfer,
including interrupt. Find the % of processor
consumed if the hard disk is only active 5% of the
time.
Interrupt rate = polling rate




Disk Interrupts/sec = 8 MB/s /16B = 500K
interrupts/sec
Disk Polling Clocks/sec = 500K * 500
= 250,000,000 clocks/sec
% Processor for during transfer: 250*106/500*106=
50%
Disk active 5%  5% * 50%  2.5% busy
65
Direct Memory Access (DMA)






How to transfer data between a Device and Memory?
Wastage of CPU cycles if done through CPU.
Let the device controller transfer data directly to and from
memory  DMA
The CPU sets up the DMA transfer by supplying the type
of operation, memory address and number of bytes to be
transferred.
The DMA controller contacts the bus directly, provides
memory address and transfers the data
Once the DMA transfer is complete, the controller
interrupts the CPU to inform completion.
Cycle Stealing – Bus gives priority to DMA controller thus
stealing cycles from the CPU
66
Responsibilities leading to OS




The I/O system is shared by multiple programs
using the processor
Low-level control of I/O device is complex
because requires managing a set of concurrent
events and because requirements for correct
device control are often very detailed
I/O systems often use interrupts to communicate
information about I/O operations
Would like I/O services for all user programs
under safe control
67