Download Chapter 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Chapter 8: Part II
Storage, Network and
Other Peripherals
Performance Analysis: Sync. vs. Async.
Synchronous bus: clock time=50ns, each
transaction takes one clock cycle
 Asynchronous bus: 40 ns per handshake
 Data portion=32 bits
 Question: Find the bandwidth of each bus
when performing one-word reads from a
200ns memory.

Sync. vs. Async. Buses (I)

For the synchronous bus:
1.
2.
3.

Send the address to memory:50 ns
Read the memory: 200 ns
Send the data to the device: 50 ns
Total time= 300 ns,
bandwidth=4bytes/300ns=13.3 MB/sec
Sync. vs. Async. Buses (II)

For the asynchronous bus:
1.
2.
3.

Step 1: 40 ns
Step 2,3,4: max(3x40, 200ns)=200ns
Step 5,6,7: 3x40ns = 120ns
Total time = 360 ns, maximum
bandwidth= 4bytes/360ns = 11.1 MB/s
Increasing Bus Bandwidth
Data bus width
 Separate versus multiplexed address and
data lines
 Block transfers

Performance Analysis of Two Bus
Schemes

Given a system with




a memory and bus system supporting block access
of 4 to 16 words
a 64-bit synchronous bus clocked at 200MHz, with
each 64-bit transfer taking 1 clock cycle, and 1 clock
cycle to send an address to memory
two clock cycles needed between each bus operation
memory access for first 4 words takes 200ns, each
additional set of 4 words requires 20ns
Question
Find the sustained bandwidth and latency
for a read of 256 words for transfers using
4-word blocks and 16-word blocks.
 Find the effective number of bus
transactions for each case.

4-Word Block Transfer
1 clock cycle to send address to memory
 200ns/(5ns/cycle) = 40 cycles to read
memory
 2 cycles to send data from memory
 2 idle cycles
 Total = 45 cycles
 256 words requires 45x64= 2880 cycles

4-Word Block Transfer
Latency = 2880 cycles x 5ns/cycle =
14400 ns
 Number of bus transactions = 64 x
1s/14400ns = 4.44M transactions/s
 Bandwidth = (256x4 bytes)x 1/14400ns =
71.11 MB/s

16-Word Block Transfer





1 clock cycle to send address to memory
40 cycles to read first 4 words from memory
2 cycles to send data, during which the read of
the next 4 words is started.
2 idle cycles between transfers, during which
the read of the next block is completed.
Need to repeat the last two steps 3 times to
read a total of 16 words.
16-Word Block Transfer





Total cycles required = 1 + 40 + 4x(2+2) =57
cycles
256/16=16 transactions are required
Total number of cycles required for 256 word =
16x57 = 912 cycles, latency = 4560 ns
Number of bus transactions = 16 x 1s/4560ns
= 3.51M transactions/s
Bandwidth = (256x4 bytes)x 1/4560ns =
224.56 MB/
Bus Arbitration

Daisy chain arbitration (not very fair)

Centralized arbitration (requires an
arbiter), e.g., PCI

Self selection, e.g., NuBus used in
Macintosh

Collision detection, e.g., Ethernet
Bus Standards




PCI ( a general
purpose
backplane bus)
SCSI (Small
Computer System
Interface)
IEEE 1394
(Firewire)
USB 2.0
Characteristic
Firewire(1394)
USB 2.0
Bus width
4
2
Clocking
asynchronous
asynchronous
Peak bandwidth
50MB/s
(Firewire 400)
100MB/s
(Firewire 800)
0.2 MB/s
1.5 MB/s
60 MB/s
Hot pluggable
Yes
Yes
Max # of devices
63
127
Max. Bus length
4.5M
5M
Interfacing I/O Devices
How is a user I/O request transformed
into a device command and communicated
to the device?
 How is data actually transferred to or from
a memory location?
 What is the role of the operating system?

Role of the OS

The OS plays a major role in handling I/O,
in that:



I/O system is shared by multiple programs
using the processor
I/O system often use interrupts (cause transfer
to supervisor mode)
low-level control of I/O is complex
Communications between OS and
I/O Devices
The OS must be able to give commands
to I/O.
 The I/O must be able to notify the OS
when operation is completed or error has
occurred.
 Data must be transferred between
memory and an I/O device.

Giving Commands to I/O

To give a command, the processor must
be able to address the device and to
supply command words:


memory-mapped I/O: portions of the address
space is assigned to I/O devices
special I/O: dedicated I/O instructions in the
processor.
Communicating with the Processor
Polling
 Interrupts
 DMA

Polling
Polling: processor periodically checks the
status of I/O.
 Overhead of polling in an I/O system




Example 1: mouse
Example 2: floppy disk
Example 3: hard disk
Mouse



Assume the number of clock cycles for a
polling operation, including transferring to the
polling routine, accessing the device, and
restarting the user program, is 400, with a
500 MHz clock.
The mouse must be polled 30 times a second
to ensure that no user movement is missed.
Fraction of CPU time = 30x400/(500x10^6) =
0.002%
Floppy Disk
The floppy disk transfers data to the
processor in 16-bit units and has a data
rate of 50KB/s.
 Polling rate = (50KB/s)/(2 Bytes/polling)
= 25K polling/sec
 Fraction of CPU time =
25Kx400/(500x10^6) = 2%

Hard Disk
Transfer in 4-word blocks
 transfer rate: 4MB/s
 Polling rate = (4MB/s)/(4x4 Bytes/polling)
= 250K polling/sec
 Fraction of CPU time =
250Kx400/(500x10^6) = 20%

Overhead of Polling
Can do the polling only when the device is
active, thus reducing the overhead.
 However, the overhead is still significant,
resulting in another design called
interrupt-driven I/O.

Overhead of Interrupt-Driven I/O




Assume the overhead for each transfer, including
the interrupt, is 500 cycles.
Cycles per second for disk = 250Kx500
= 125x10^6 cycles
Fraction of processor consumed =
125x10^6/(500x10^6) = 25%
Assuming disk is transferring data 5% of the time,
fraction of CPU on average = 25%x5%=1.25%
Direct Memory Access(DMA)




If disk is transferring data most of the time, the
overhead for interrupt-driven I/O is still high.
For high-bandwidth device, let the device
controller transfer data directly to or from the
memory without involving the processor, known
as direct memory access.
Interrupt is used to signal the completion of I/O
transfer or error.
Note: How does it affect cache design?
Overhead of I/O Using DMA
Assume initial setup of DMA transfer takes
1000 cycles, handling of interrupt at DMA
completion takes 500 cycles, average
transfer from disk is 8KB
 Each DMA transfer takes 8KB/(4MB/s) =
2x10^-3s
 If the disk is constantly transferring data,
it requires: (1000+500)/(2x10^-3) =
750x10^3 cycles
 Fraction of CPU time= 750x10^3/(500x10^6) =

0.15%
I/O System Design
Latency constraints: ensuring the latency
to complete and I/O operation is bounded.
 Bandwidth constraints
 Performance Analysis techniques:
— queuing theory
— simulation
— analysis

I/O System Design- Example





CPU: 3 BIPS, average 100,000 instructions in the
OS per I/O operation
backplane bus transfer rate: 1000 MB/s
SCSI-Ultra 320 controller with transfer rate =
320 MB/s, accommodating up to 7 disks
Disk bandwidth = 75MB/s, seek+rotational
latency=6 ms
Workload: 64-KB reads, user program need
200,000 instructions per I/O
Example

Find


the maximum sustainable I/O rate
the number of disks and SCSI controller
required.
Real Stuff: Buses and Network of P4
Intel P4 I/O Chip Sets
A Digital Camera
SoC (System on a chip)