Download write buffer

Document related concepts
no text concepts found
Transcript
What’s a RAM?
• Random Access Memory
• Two main types: Static RAM (SRAM) and Dynamic
RAM (DRAM)
(Other types: Flash RAM, SDRAM, DDR RAM, Video RAM, FERAM)
– Differences lie in how bits are stored:
•
•
•
•
Speed of retrieval
Space requirements
Volatility
Cost
1
Dynamic RAM (DRAM)
Word Line
Read: Drive word line,
sense value on bit line
(destroys saved value)
Write: Drive word line,
drive new value on bit
line.
Bit Line
4
Static RAM (SRAM)
Word
Line
!Bit
Bit
Read: Drive word line,
sense value on bit lines
Write: Drive word line,
drive new value
(strongly) on bit lines
5
6
SRAM vs DRAM, pros and cons
Big win for DRAM
DRAM has a 6-10X density advantage
at the same technology generation.
SRAM advantages
SRAM has deterministic latency:
its cells do not need to be refreshed.
SRAM is much faster: transistors
drive bitlines on reads.
SRAM easy to design in logic
fabrication process (and premium
logic processes have SRAM add-ons)7
Basic RAM Architecture
Word Lines
Bit Cell
Bit Lines
High
Sense Amplifier
Low
Address
Data
8
SRAM array: simpler than DRAM array
Architects specify number of rows and columns.
Word and bit lines slow down as array grows
larger!
Write
Driver
Write
Driver
Write
Driver
Write
Driver
Parallel
Data
I/O
Lines
Add muxes
to select
subset of
9
bits
Dimensions
2014 devices
(~ 10 nm)
2001 devices
(0.18 µm)
1 cm
Chip size
(1 cm)
1 mm
0.1 mm
Diameter of
Human Hair
(25 µm)
10µm
1 µm
1996 devices
(0.35 µm)
0.1 µm
10 nm
2007 devices
(0.1 µm)
Deep UV
Wavelength
(0.248 µm)
1 nm
1Å
Silicon
atom
radius
(1.17 Å)
X-ray
Wavelength
(0.6 nm)
10
1980-2005, CPU--DRAM Speed gap
Performance
(1/latency)
The
power
wall
CPU
60% per yr
2X in 1.5 yrs
CPU
Gap grew 50%
per year
DRAM
9% per yr
2X in 10 yrs
DRAM
Year
11
Caches  Pipeline Relationship
Memory
D-$
I-$
RD
12
WB Data
MUX
Data
Memory
RD
EX/MEM
RD
ALU
MUX
ID/EX
Sign
Imm Extend
MEM/WB
Zero?
Reg File
RS1
RS2
MUX
Adder
IF/ID
Memory
Address
4
Next
SEQ PC
Adder
Next PC
Memory Hierarchy ‫היראכיה של הזיכרון‬
In 1998
SRAM .5- 5ns
$4000 to $10,000 per Gbyte.
DRAM 50- 70ns
$100 to $200 per Gbyte.
Disk
5 to 20 million ns
$0.50 to $2
per Gbyte.
CPU
Level 1
Levels in the
memory hierarchy
Increasing distance
from the CPU in
access time
Cache
Memory
Disk
‫משתמשים רוצים זיכרון מהיר וזול‬
‫ הירארכיה של הזיכרון‬:‫הפתרון‬
Level 2
Level n
Size of the memory at each level
A memory hierarchy in which the faster but
smaller part is “close” to the CPU and used
most of the time and in which slower but
larger part is ‘’far” from the CPU, will give us
the illusion of having a fast large inexpensive
memory
13
What is a cache?
• Small, fast storage used to improve average access time to
slow memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
–
–
–
–
–
–
Registers a cache on variables
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
TLB a cache on page table
Branch-prediction a cache on prediction information?
Proc/Regs
Bigger
L1-Cache
L2-Cache
Faster
Memory
Disk, Tape, etc.
14
Differences in Memory Levels
Level
Registers
L1 Cache
(on chip)
L2Cache
(off chip)
Main
Memory
Secondary
Storage
Memory
Size
Technology
D Flip64 32-bit
Flops
SRAM
16 Kbytes
Typical
Cost per
Access Time Gbyte
.5 -3 ns
N/A
.5 - 5 ns
SRAM
256 Kbytes .5 - 5 ns
DRAM
64 Mbytes
50 - 70 ns
$4,000 $10,000
$4,000 $10,000
$100 - $200
Magnetic
Disk
2 Gbytes
10 - 20 ms
$0.50-$2
15
2005 Memory Hierarchy: Apple iMac G5
Managed
by compiler
Managed
by hardware
Reg
L1
Inst
L1
Data
Size
1K
64K
32K
Latency
(cycles)
1
3
3
L2
Managed by OS,
hardware,
application
DRAM Disk
512K 256M 80G
11
160
1e7
iMac G5
1.6 GHz
$1299.00
Goal: Illusion of large, fast, cheap memory
Let programs address a memory space that
scales to the disk size, at a speed that is
usually as fast as register access
16
18
90 nm, 58 M
transistors
L1 (64K Instruction)
512K
L2
R
e
gi
st
er
s
(1K)
L1 (32K Data)
19
PowerPC 970 FX
Memories Technology - costs
• Faster Memories are more expensive per bit
• Slower Memories are usually smaller in area size per bit
Memory
Technology
Typical access
time
$ per Gbyte
SRAM
.5-5 ns
$4,000-$10,000
DRAM
50 - 70 ns
$100-$200
Magnetic Disk
10-20 million ns
$0.50-$2
How to make best use of cache
memory?
 How to guess which memory locations the
program will want to access next?
22
Locality
•
temporal locality:
–
‫לוקאליות בזמן‬
If we accessed a certain address, the chances are high to access it again shortly. For
data this is so since we probably update it, for instruction it is so since we tend to
use loops
•
spatial locality:
–
–
‫לוקאליות במרחב‬
If we accessed a certain address, the chances are high to access its neighbors.
For instructions this is so due to the sequential nature of programs. For data this is
so since we use groups of variable such as arrays.
•
So, let’s keep recent data and instructions in
cache.
23
24
25
The cache principle
:‫המונחים המרכזיים‬:
Hit: a successful search of info in the cache. If it is in the cache, we
have a hit. We continue executing the instructions.
Miss: an unsuccessful search of info in the cache. If it is not in the
cache, we have a miss and we have to bring the requested data
from a slower memory up one level in the hierarchy. Until then,
we must stall the CPU execuion!
Block: The basic unit that is loaded into the cache when miss
occurs is a block. The minimal size of block is a single word.
27
Cache Misses
• Compulsory (cold start or process migration, first reference): first
access to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant
• Capacity:
– Cache cannot contain all blocks access by the program
– Solution: increase cache size
• Conflict (collision):
– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
• Coherence (Invalidation): other process (e.g., I/O) updates memory
28
Direct Mapped Cache
000
001
010
011
100
101
110
111
Cache
00001
00101
01001
01101
10001
10101
11001
11101
Memory
29
General idea
• Group the memory into blocks of addresses
– The contents of an entire block are loaded to
cache at the same time.
• The cache contains several blocks
– When the CPU tries to access a memory location
that is not in cache, the block containing this
location is loaded to cache - and another block is
potentially evicted, to make room.
30
General idea
• Group the memory into blocks of addresses
– The contents of an entire block are loaded to cache at the
same time.
• The cache contains several blocks
– When the CPU tries to access a memory location that is
not in cache, the block containing this location is loaded to
cache - and another block is potentially evicted, to make
room.
Questions: Where is the block placed in cache?
How to decide which block to evict?
What should be the block size?
31
Direct Mapped Cache
block = 1 word
size of cache=16words
2n blocks
32
One possible arrangement for MIPS cache (block size 2^12):
A d d r e s s ( s h o w in g b it p o s itio n s )
3 1 3 0
1 3 1 2 1 1
2 1 0
B y te
o ffs e t
1 0
2 0
H it
D a ta
T a g
In d e x
In d e x
V a lid
T a g
D a ta
0
1
2
1 0 2 1
1 0 2 2
1 0 2 3
2 0
3 2
33
Another possibility for MIPS (actual DECStation 3100, block size 2^16):
34
Direct Mapped Cache: Mips Architecture
Tag
Index
Address (showing bit positions)
31 30
13 12 11
2 1 0
Byte
offset
Hit
Hit
10
20
Tag
Index
Index
Valid
Tag
Data
0
1
2
1021
1022
1023
20
32
Compare Tags
Data
Data
Figure 7.6
Direct Mapped Cache: Temporal Example
lw
lw
$1,10 110 ($0)
$2,11 010 ($0)
Miss: valid
Miss: valid
lw
lw
$1,22($0)
$2,26($0)
lw
$3,10 110 ($0)
Hit!
lw
$3,22($0)
Index
000
Valid
N
001
010
011
100
N
N
Y
N
N
101
110
111
N
N
Y
N
Tag
Data
11
Memory[11010]
10
Memory[10110]
Figure 7.6
Direct Mapped Cache: Worst case, always miss!
lw
lw
$1,10 110 ($0)
$2,11 110 ($0)
Miss: valid
Miss: tag
lw
lw
$1,22($0)
$2,30($0)
lw
$3,00 110 ($0)
Miss: tag
lw
$3,6($0)
Index
000
Valid
N
001
010
011
100
N
N
N
N
101
110
111
N
N
YY Y
N
Tag
10
1100
Data
Memory[10110]
Memory[11110]
Memory[00110]
Handling writes: ‫טיפול בכתיבה‬
• Write through
– Anything we write is written to the cache and to the memory (we now
discuss one word blocks).
• Write through usually uses a Write buffer
– Since writing to the slower memory take too much time, we use an
intermediate buffer. It gets the write “bursts’ of the program and slowly
but surely writes it to the memory. (If the buffer gets full, we must stall
the CPU)
• Write-back
– Another method is to copy the cache into the memory only when the
block is replaced with another block. This is called write-back or copyback.
39
Write Buffer for Write Through
Cache
Processor
DRAM
Write Buffer
• A Write Buffer is needed between the Cache and
Memory
– Processor: writes data into the cache and the write buffer
– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:
– Typical number of entries: 4
– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycle
• Memory system design:
– Store frequency (w.r.t. time) -> 1 / DRAM write cycle
– Write buffer saturation
40
State machines to manage write buffer
Solution: add a “write buffer” to cache datapath
Processor
Cache
Lower
Level
Memory
Write Buffer
Holds data awaiting write-through to
lower level memory
Q. Why a write buffer ?
Q. Why a buffer, why not
just one register ?
Q. Are Read After Write
(RAW) hazards an issue
for write buffer?
A. So CPU doesn’t stall
A. Bursts of writes are
common.
A. Yes! Drain buffer
before next read, or
check write buffers.
On reads, state machine checks cache and write buffer -what if word was removed from cache before lower-level
write? On writes, state machine stalls for full write buffer,
41
handles write buffer duplicates.
Direct Mapped Cache
block = 1 word
size of cache=16words
42
Direct Mapped Cache
block = 4 word
size of cache=16words
1 block = 4 words
This is still called a direct mapped
cache since each block in the
memory is mapped directly to a
single block in the cache
43
A 4 words block direct mapped implementation
Address (showing bit positions)
31
16 15
4 32 1 0
16
12
2 Byte
offset
Tag
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
Hit
Data
When we have more than a single word in a block, the efficiency of storage is slightly higher since we
have 1 tag for each block instead of for each word. On the other hand we slow the cache somewhat
since we add multiplexors. Anyhow, this is not the issue. The issue is reducing miss rate
44
A 2m words block implementation
A d d r e s s ( s h o w in g b it p o s i t i o n s )
31
n+m+1
30-n-m
Tag
m+1
n
m
2 1 0
Byte Offset inside a word
In d e x
B lo c k o f fs e t
30-n-m bits
V
32*2m
b its
D a ta
T ag
2n
entries
30-n-m
32
32
32
M ux
32
m
32
Hit
Data
45
Block size and miss rate:
When we increase the size of the block, the miss rate, especially for instructions,
is reduced. However, in case we leave the cache size as is, we’ll get to a situation
where there are too few blocks, so we have to change them even before we took
advantage on the locality, i.e., before we used the entire block. That will increase
the miss rate(explains the right hand side of the graphs below)
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
256
1 KB
8 KB
16 KB
64 KB
256 KB
46
The block size
The block does not have to be a single word.
When we increase the size of the cache blocks,
we improve the hit rate since we reduce the
misses due to spatial locality of the program
(mainly) but also the data (e.g., in image
processing). Here is a comparison of the miss
rate of two programs with a single word vs 4
words blocks:
Program
gcc
spice
Block size in
words
1
4
1
4
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
47
1. Reduce Misses via Larger Block Size
25%
1K
20%
15%
16K
10%
64K
5%
256K
256
128
64
32
0%
16
Miss
Rate
4K
Block Size (bytes)
48
Block size and write:
When we have more than a single word in a block, then when we write (a single
word) into a block, we must first read the entire block from the memory (unless
its already in the cache) and only then write to the cache and to the memory.
If we had the block in the cache, the process is exactly as it was for a single word
block cache.
Separate instruction and data caches
Note that usually we have separate instruction and data caches. Having a single cache
for both could give some flexibility since we have sometimes more room for data but
the 2 separate caches have twice the bandwidth, I.e., we can read both at the same time
(2 times faster). That is why most CPUs use separate instruction and data caches.
49
Block size and read:
When we have more than a single word in a block, then when need to wait
longer to read the entire block. There are some techniques to start the writing
into the cache as soon as possible. The other approach is to design the memory
so reading is faster, especially reading consecutive addresses. This is done by
reading several words in parallel.
50
Faster CPUs need better caches
It is shown in the book (section 7.3 pp. 565-567) that when we improve the CPU
(shorten the CPI or the CK period) but leave the cache as is, the percentage of
miss penaly is increased. This means that we need better caches for faster CPUs.
Better means we should reduce the miss rate and reduce the miss penalty.
Reducing the miss rate
This is done by letting the cache more flexibility in keeping data.
So far we allowed a memory block to be mapped to a single block in cache. We
called it a direct mapped cache. There is no flexibility here. The most flexible
scheme is that a block can be store at any of the cache blocks. That way, we can
keep some frequently used blocked that always competed on the same cache
block in direct mapped block implementation. Such a flexible scheme is called
fully associative cache. In a fully associative cache the tag should be compared
to all cache entries.
We have also a compromise called “N-way set associative” cache. Here each
memory block is mapped to one of an N blocks of the cache.
Note that for caches having more than 1 possible mapping, we should employ
some replacement policy. (LRU or Random are used)
51
Direct Mapped Cache
block = 4 word
size of cache=16words
1 block = 4 words
52
2 way set associative Cache
block = 4 word
size of cache=32words
1
2
1 block
= 4 words
1
2
1
N*2n blocks
of 2m words
2
1
2
53
A 4-way set associative cache
Here the block size is 1 word. We see that we have actually 4 “regular” caches + a
54
multiplexor
A 2way set associative cache
Address (show ing bit po sitions)
31
Tag
0
30-n-m
Byte Offset inside a word
n
m
Tag
index
Block offset
30-n-m bits
V
Tag
30-n-m bits
Data
(32*2m
32*2m bits
Data
V Tag
bits)
2n
entries
30-n-m
32
32
32
Mux
2n
entries
30-n-m
32
32
32
32
Mux
m
32
32
m
32
Hit2
Hit1
Data2
Data2
Mux
Hit
Data
55
Fully associative Cache
block = 4 word
size of cache=32words
1
2
1 block = 4 words
3
4
5
6
7
8
56
A fully associative
cache
30-m
m
30-m
2
m
Tag
V
Tag
32 - m
32
32
32
32
32
32
32
32
32
V
Tag
32 - m
32
32
32
V
Tag
32 - m
32
32
32
32
32
32
hit
Data
Here the block size is 2m word. We see that we have only N
blocks
57
Suppose we have 2k words in a cache
N-way set associative
Directed mapped
1*2n (n=k)
N*2n (N=2k-n)
Direct mapped
Block # 0 1 2 3 4 5 6 7
Data
Tag
Search
Set associative
Set #
0
1
2
Data
1
2
Fully associative
N=2k
Tag
Fully associative
3
Data
1
2
Search
Tag
1
2
Search
Searching for address 12 (marked) in 3 types of caches.
58
Cache Block Replacement
After a cache read miss,
if there are no empty cache blocks,
which block should be removed
from the cache?
The Least Recently
A randomly chosen
Used (LRU) block?
block?
Appealing,
Easy to implement, how
but hard to implement.
well does it work?
Miss Rate for 2-way Set Associative Cache
Also, try
Size
Random
LRU
Other LRU
16 KB
5.7%
5.2%
approx.
64 KB
2.0%
1.9%
256 KB
1.17%
1.15%
59
Which block should be replaced on a miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:
– Random
– LRU (Least Recently Used)
Associativity:
2-way
4-way
8-way
Size
LRU Random LRU Random
LRU Random
16 KB
5.2% 5.7% 4.7%
5.3% 4.4%
5.0%
64 KB
1.9% 2.0% 1.5%
1.7% 1.4%
1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
60
Faster CPUs need better caches
Better means we should reduce the miss rate. For that we used 4 set associative
cache.
Better also means reduce the miss penalty.
Reducing the miss penalty
This is done by using 2 levels cache. The first cache will be on the same chip as
the CPU, actually it is a part of the CPU. It is very fast (1-2ns=less than 1 ck
cycle) it is small, the block is also small, so it can be 4-way set associative. The
level 2 cache is out of the chip 10 times slower but still 10 times faster than the
memory (DRAM). It has larger block almost always 2-way set associative or
direct mapped. Mainly aimed to reduce the read penalty. Analyzing such caches
is complicated. Ususally simulations are required.
An optimal single level cache is usually larger and slower than the level1 cache
and faster and smaller than the level2 cache.
Note that usually we have separate instruction and data caches.
61
The Limits of Physical Addressing
“Physical addresses” of memory locations
A0-A31
CPU
A0-A31
Programming the Apple ][ ...
Memory
D0-D31
D0-D31
Data
All programs share one address space:
The physical address space
Machine language programs must be
aware of the machine organization
No way to prevent a program from
accessing any machine resource
63
Solution: Add a Layer of Indirection
“Physical Addresses”
“Virtual Addresses”
A0-A31
Virtual
Physical
Address
Translation
CPU
D0-D31
A0-A31
Memory
D0-D31
Data
User programs run in an standardized
virtual address space
Address Translation hardware
managed by the operating system (OS)
maps virtual address to physical memory
Hardware supports “modern” OS features:
Protection, Translation, Sharing
64
Virtual Memory
65
Virtual Memory
66
Address translation
The translation is simple. We use the LSBs to point at the address inside a page and the rest of the
bits, the MSBs to point at a “virtual” page. The translation should replace the virtual page number
with physical page number, having a smaller number of bits. This means that the physical memory is
smaller than the virtual memory and so, we’ll have to load and store pages whenever required.
Before VM, the programmer was responsible to load and replace “overlays” of code or data. VM
take this burden away.
By the way, using pages with “relocating” the code and the data every time it is loaded into memory
also enables better usage of memory. Large contiguous areas are not required.
67
Address translation
The translation is done by a table called the page table. We have such a table, residing in the main
memory, for each process. A special register, the page table register, points at the start of the table.
When switching the program, I.e, switching to another process, we change the contents of that register
so it points to the appropriate page table. [ To switch a process means also storing all the registers
including the PC of the current process and retrieving those of the process we want to switch to. This is
done by the Operating System every now and then according to some predetermined rule] .
We need to have a valid bit,
same as in caches, which
tells whether the page is
valid or not.
In VM we have fully
associative placement of
pages in the physical
memory.To reduce chances
to page fault. We also apply
sophisticated algorithms for
replacement of pages.
Since the read/write time
(from/to disk) is very long,
we use s/w mechanism
instead of h/w (used in
caches). Also, we use writeback scheme and not writethrough.
68
The page table
The operating system (OS) creates a copy of all the pages of a process on the disk. It loads the
requested pages into the physical memory and keeps track on which page is loaded and which is not.
The page table can be used to point at the pages on the disk .If the valid bit is on, the table has the
physical page address. If the valid bit is off, the table has its disk address.
When a page fault occurs, if all physical memory is used, the OS must choose which page to be
replaced. LRU is often used. However, to simplify things, we set a “use” bit or “reference” bit by
h/w every time a page is accessed. Every now and then these bits are cleared by the OS. So,
according to these bits, the OS can decide which page has a higher chance of being used and keep it
in memory.
The page table could be very big. So there
are technique to keep it small. We do not
prepare room for all virtual addresses
possible, but add an entry whenever a new
page is requested. We sometimes have a
page table with two parts the heap, growing
upwards and the stack growing downwards.
Some OS uses hashing to translate between
the virtual page address and the page table.
Sometimes the page table itself is allowed
to be paged.
Note that every access to the memory is
made of two reads, 1st we read the physical
page address from the page table, then we
can perform the real read.
69
Address Translation
VA
CPU
Translation
data
miss
PA
Cache
Main
Memory
hit
• Page table is a large data structure in memory
• Two memory accesses for every load, store, or
instruction fetch!!!
• Virtually addressed cache?
• Cache the address translations?
70
TLB
Note that every access to the memory is made of two reads, 1st we read the physical page address from the page table,
then we can perform the real read.
In order to avoid that we use a special cache for address translation. It is called a “Translation-Lookaside Buffer”
(TLB). It is a small cache (32-4096 entries) with blocks of 1 or 2 page addresses with a very fast hit time (less than
1/2 a CK cycle to leave enough time for getting the data according to the address from the TLB) and has a small miss
rate (0.01%-1%) . TLB miss cayses a delay of 10-30 CK cycles to access the real page table and update the TLB.
What about write? Whenever
we write to a page in the
physical memory, we must set a
bit in the TLB (and eventually,
when it is replaced in the TLB,
in the page table). This bit is
called the “dirty” bit. When a
‘dirty” page is removed from
the physical memory, ir shopuld
be copied to the disk to replace
the old un-updated page that
was originally on the disk. If the
dirty bit is off, no copy is
required since the original page
is untoutched.
71
TLBs
A way to speed up translation is to use a special cache of recently
used page table entries -- this has many names, but the most
frequently used is Translation Lookaside Buffer or TLB
Virtual Address Physical Address Dirty Ref Valid Access
Really just a cache on the page table mappings
TLB access time comparable to cache access time
(much less than main memory access time)
72
Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative,
set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on
high end machines. This permits fully associative
lookup on these machines. Most mid-range machines use small
n-way set associative organizations.
hit
PA
VA
CPU
Translation
with a TLB
TLB
Lookup
miss
miss
Cache
Main
Memory
hit
Translation
data
1/2 t
t
20 t 73
TLB and Cache together
So here is the complete
picture:
The CPU generates a virtual
address (PC in fetch or
ALUOut during lw or sw
instruction), The bits go
directly to the TLB. If there
is a hit, the output of the
TLB provides the physical
page address. We combine
these lines with the LSBs of
the virtual address and use
the resulting physical address
to access the memory.
This address is connected to
the cache. If cache hit is
detected, the data
immediately appears at the
output of the cache.
All of this took less than a
CK cycle so we can use the
data in the next rising edge
of the CK.
74
Reducing Translation Time
Machines with TLBs go one step further to reduce # cycles/cache
access
They overlap the cache access with the TLB access:
high order bits of the VA are used to look in the TLB while low
order bits are used as index into cache
75
Protection
During the process of having a page fault we can detect that a program is trying
to access a virtual page that is not defined. A regular process cannot be allowed
to access the page table itself, i.e., read and write to the page table. Only kernel
(OS) processes can do that. There can also be restrictions on writing to certain
pages. All this can be achieved with special bits in the TLB (kernel bit, write
access bit etc.). Any violation should cause an exception that will be handled by
the OS.
In some OS and CPUs, not all pages have the same size. We then use the term
segment instead of page. In such case we need to have h/w support that detects
that the CPU tries to access an address which is beyond the limit of the segment.
76
End of caches & VM
77
Conventional Wisdom Changes!!
• CW1; Old; power is free, Transistors are
expensive
new: Power expensive, Transistors are free
• CW2: O: If power concern, only dynamic power
N: Static power 40% of total, concern Leakage.
• CW7: O: Multiply is slow, lw & sw are fast
N: Multiply is fast, lw & sw slow (200 ck cycles to
DRAM).
78
Conventional Wisdom Changes!!
• CW10; Old; Don’t bother parallelizing, just wait
new: Too long wait. Go parallel.
• CW11: O: Increasing clock rate – the way to
improve performance
N: Parallelism is the way to increase performance
79
Backup slides
80
Direct Mapped Cache
block = 1 word
size of cache=16words
81
For MIPS:
82
30-m
m
30-m
2
A fully associative
cache
m
Tag
V
Tag
32
32
32
32
32
32
32
32
32
32
V
Tag
32
32
32
32
V
Tag
32
32
32
32
32
32
32
Here the block size is 2m word. We see that we have only
83 N
blocks
Option
TLB
Block Size
L2 Cache
VM (page)
4-8 bytes (1 4-32 bytes
PTE)
32-256 bytes
4k-16k
bytes
Hit Time
1 cycle
1-2 cycles
6-15 cycles
10-100
cycles
Miss Penalty
10-30
cycles
8-66 cycles
30-200 cycles 700k-6M
cycles
Local Miss
Rate
.1 - 2%
.5 – 20%
13 - 15%
Size
32B – 8KB
1 – 128 KB
256KB 16MB
L2 Cache
DRAM
Disks
Backing Store L1 Cache
L1 Cache
.00001 001%
Q1: Block
Placement
Fully or set DM
associative
DM or SA
Fully
associative
Q2: Block ID
Tag/block
Tag/block
Tag/block
Table
Q3: Block
Replacement
Random
(not last)
N.A. For DM Random (if
SA)
LRU/LFU
Q4: Writes
Flush on
PTE write
Through or
back
Write-back
84
Write-back
•
What happens on a Cache miss?
For in-order pipeline, 2 options:
– Freeze pipeline in Mem stage (popular early on: Sparc, R4000)
IF
ID
IF
EX
ID
Mem stall stall stall … stall Mem
Wr
EX stall stall stall … stall stall Ex Wr
– Use Full/Empty bits in registers + MSHR queue
• MSHR = “Miss Status/Handler Registers” (Kroft)
Each entry in this queue keeps track of status of outstanding memory requests to one
complete memory line.
– Per cache-line: keep info about memory address.
– For each word: register (if any) that is waiting for result.
– Used to “merge” multiple requests to one memory line
• New load creates MSHR entry and sets destination register to “Empty”. Load is “released”
from pipeline.
• Attempt to use register before result returns causes instruction to block in decode stage.
• Limited “out-of-order” execution with respect to loads.
Popular with in-order superscalar architectures.
•
Out-of-order pipelines already have this functionality built in… (load queues, etc).
85
Review: Cache Performance
CPU time = (CPU execution clock cycles +
Memory stall clock cycles) x clock cycle time
Memory stall clock cycles =
(Reads x Read miss rate x Read miss penalty +
Writes x Write miss rate x Write miss penalty)
Memory stall clock cycles =
Memory accesses x Miss rate x Miss penalty
Note: memory hit time is included in execution cycles.
86
Review: Four Questions for Memory Hierarchy
Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
87