Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 13 Architectural Support for OS I/O Peng Liu [email protected] 1 Last time in Lecture 12 • Virtual Memory – Large Address Space for Each Executing Program – Memory Management for Multiple Programs 2 Virtual Memory in a Nutshell Review • Use hard disk (or Flash) as a large storage for data of all programs – Main memory (DRAM) is a cache for the disk – Managed jointly by hardware and the operating system (OS) • Each running program has its own virtual address space – Address space as shown in previous figure – Protected from other programs • Frequently-used portions of virtual address space copied to DRAM – DRAM = physical address space – Hardware + OS translate virtual addresses (VA) used by program to physical addresses (PA) used by the hardware – Translation enables relocation (DRAM disk) & protection 3 Terminology for Virtual Memory Review • Virtual memory used DRAM as a cache for disk • New terms – VM block is called a “page” • The unit of data moving between disk and DRAM • It is typically larger than a cache block (e.g., 4KB or 16KB) • Virtual and physical address spaces can be divided up to virtual pages and physical pages (e.g., contiguous chunks of 4KB) – VM miss is called a “page fault” • If the valid bit for a virtual page is off, a page fault occurs. • Once the operating system gets control, it must find the page in the next level of the hierarchy and decide where to place the requested page in main memory. 4 Page Faults (Similar to “Cache Misses”) Review • What if an object is on disk rather than in memory? – Page table entry indicates virtual address not in memory – OS exception handler invoked to move data from disk into memory • OS has full control over placement • Full-associativity to minimize future misses Before fault Memory Page Table Virtual Addresses 0 1 After fault Physical Addresses Memory Page Table Virtual Addresses 0 1 0 1 Physical Addresses 0 1 CPU CPU P-1 P-1 N-1 N-1 Disk Disk 5 Using Separate Addresses Spaces Per Program Review • Each program has its own virtual address space and own page table – Addresses 0x400000 from different programs can map to different locations or same location as desired – OS control how virtual pages as assigned to physical memory 6 Translation Process Explained Review • Valid page – Check access rights (R,W,X) against access type • Generate physical address if allowed • Generate a protection fault (exception) if illegal access • Invalid page – Page is not currently mapped and a page fault is generated • Faults are handled by the operating system – Sometimes due to a program error => program terminated • E.g. accessing out of the bounds of array – Sometimes due to “caching” => refill & restart • Desired data or code available on disk • Space allocated in DRAM, page copied from disk, page table updated • Replacement may be needed 7 VM: Replacement and Writes Review • To reduce page fault rate, OS uses least-recently used (LRU) replacement – Reference bit (aka use bit) in PTE set to 1 on access to page – Periodically cleared to 0 by OS – A page with reference bit = 0 has not been used recently • Disk writes take millions of cycles – – – – Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written 8 Fast Translation Using a TLB Review 9 Recap: Virtual Memory Review • Use hard disk ( or Flash) as large storage for data of all programs – Main memory (DRAM) is a cache for the disk – Managed jointly by hardware and the operating system (OS) • Each running program has its own virtual address space – Address space as shown in previous figure – Protected from other programs • Frequently-used portions of virtual address space copied to DRAM – DRAM = physical address space – Hardware + OS translate virtual addresses (VA) used by program to physical addresses (PA) used by the hardware – Translation enables relocation & protection 10 Page Table Issues & Solutions • Page table access latency – Use TLB to cache page table entries close to processor – Limit size of L1 cache to overlap TLB & cache accesses • TLB coverage – Larger pages – Multi-level TLBs • Page table size – Multi-level page tables 11 Architectural Support Operating Systems (OS) • Operating system (OS) – Manages hardware resources • Processor, main memory (DRAM), I/O devices – Provides virtualization • Each program thinks it has exclusive access to resources – Provides protection, isolation, and sharing • Between user programs and between programs and OS • Examples of operating systems – Windows (XP, Vista, Win7), MacOS, Linux, BSD Unix, Solaris.. – Symbian, Windows Mobile, Android 12 Processes • Definition: a process is an instance of a running program • Process provides each program with two key abstractions – Logical control flow • Illusion of exclusive use of the processor • Private set of register values (including PC) – Private address space • Illusion of exclusive use of main memory of “infinte” size • How are these illusion maintained? – Process execution is interleaved (multitasking) • On available processors – Address space is managed by virtual memory system 13 Execution of Processes • Each process has its own logical flow control – A & B are concurrent processes, B & C are sequential – Control flows for concurrent processes are physically disjoint in time – However, can think of concurrent processes are running in parallel 14 Process Address Space 0xFFFF_FFFF • Each process has its own private address space Kernel (OS) memory (code, data, heap, stack) 0x8000_0000 memory invisible to user code User stack (created at runtime) $sp (stack pointer) • Use processes cannot access top region of memory – Used by OS kernel – The kernel is shared core of the OS brk run-time heap (managed by malloc) Read/write segment (.data,.bss) 0x1000_0000 Read-only segment (.text) Loaded from the executable file 0x0040_0000 0x0000_0000 unused 15 Supporting Concurrent Processes through Context Switching • Processes are managed by the OS kernel – Important: the kernel is not a separate process, but rather runs as part of some user process • Control flow passes from one process to another via a context switch • Time between context switches 10-20ms 16 HW Support for the OS • Mechanisms to protect OS from processes – Modes + virtual memory • Mechanisms to switch control flow between OS – processes – System calls + exceptions • Mechanisms to protect processes from each other – Virtual memory • Mechanisms to interact with I/O devices – Primarily memory-mapped I/O 17 Hardware Modes (Protecting OS from Processes) • 2 modes are needed, but some architectures have more • User mode: Used to run users processes – Accessible instructions: user portion of the ISA • The instructions we have seen so far – Accessible state: user registers and memory • But virtual memory translation is always on • Cannot access EPC, Cause, … registers – Exceptions and interrupts are always on • Kernel mode: used to run (most of) the kernel – Accessible instructions: the whole ISA • User + privileged instructions – Accessible state: everything – Virtual memory, exceptions, and interrupts may be turned off 18 Altering the Control Flow: System Calls • So far, we have branches and jumps – They implement control flows within a program • Expected switch to the OS: system call instruction (syscall) – A jump (or function call) to OS code • E.g., in order to perform a disk access • Also similar to a program-initiated exception – Switches processor to kernel mode & disables interrupts – Jump to a pre-defined place in the kernel • Returning from a syscall: use the eret instruction (privileged) – Switch to user mode & enable interrupts – Jump to program counter indicated by EPC 19 Altering Flow Control: Exceptions & Interrupts • Exceptions & interrupts implement unexpected switch to OS – Exceptions: due to a problem in the program • E.g., divide by zero, illegal instructions, page fault – Interrupts: due to an external event • I/O event, hitting clt+c, periodic timer interrupt • Exceptions & interrupts operate similar to system calls – – – – Set EPC & Cause registers Switch to kernel mode, turn off interrupts Jump to predefined part of the OS Return to user program using eret instruction 20 Dealing with Switches to the OS • A syscall, exception, or interrupt transfers control to the OS kernel • Kernel action (stating in the exception handling code) – Examine issue & correct if possible – Return to the same process or to another process 21 A Simple Exception Handler • Can’t survive nested exceptions – Since does not re-enable interrupts • Does no use any user registers – No need to save them 22 Exception Example #1 • Memory Reference – User writes to memory location – That virtual page is currently on disk – OS must load page in memory • Updated page table & TLB – Return to faulting instruction – Successful on second try 23 Exception Examples #2 • Memory Reference – User writes to memory location – Address is not valid or wrong access rights – OS decides to terminate the process • Sends SIGSEG signal to user process • User process exits with “segmentation fault” – OS switches to another active process 24 Using Virtual Memory & Exceptions: Page Sharing • Example: Unix fork() system call – Creates a 2nd process which is a clone of the current one – How can we avoid copying all the address space • Expensive in time and space • Idea: share pages in physical memory – Each process gets a separate page table (separate virtual space) – But both PTs point to the same physical pages – Saves time and space on fork() • Same approach to support any type of sharing – Two processes running same program – Two processes that share data 25 Page Sharing 26 Input/Output 27 Diversity of Devices • I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate: bytes/sec, transfers/sec 28 I/O System Design Goals • Performance measures – Latency (response time) – Throughput (bandwidth) • Dependability is important – Resilience in the face of failures – Particularly for storage devices • Expandability • Computer classes – Desktop: response time and diversity of devices – Server: throughput, expandability, failure resilience – Embedded: cost and response time 29 Throughput vs. Response time • Throughput – Aggregate measure of amount of data moved per unit time, averaged over a window • Measure in bytes/sec or transfers/sec – Sometimes referred to as bandwidth • Examples: Memory bandwidth, disk bandwidth • Response time – Response time to do a single I/O operation • Measured in seconds or cycles – Sometimes referred to as latency • Example: Write a block of bytes to disk • Example: Send a data packet over the network 30 Throughput – Response Time Tradeoffs • Throughput is the number of tasks completed by server per unit time • Response time is the elapsed time between tasks entering queue and tasks completed by the server • Tradeoff between throughput and response time – Highest possible throughput is when the server is always busy and the queue is never empty – Fastest response time is when the queue is empty and the server is idle when the task arrives 31 Magnetic Hard Disks • Characteristics – Long term, nonvolatile storage – Large, inexpensive, but slow (mechanical device) • Usage – Virtual memory (swap area) – File system 32 Disk Storage • Nonvolatile, rotating magnetic storage 33 Hard Disk • Basic operation – Rotating platter coated with magnetic surface – Moving read/write head used to access the disk • Features of hard disks – – – – Platters are rigid (ceramic or metal) High density since head can be controlled more precisely High data rate with higher rotational speed Can include multiple platters • Incredible improvements – – – – Examples of I/O device performance being technology driven Capacity: 2x every year Transfer rate: 1.4x every year Price approaching 1$/GB = 109 bytes 34 Hard Disk Organization • >=1 platter per disk with disk heads on both sides • Platters are divided in concentric tracks • Each track is divided in sectors – Unit of transfer to/from disk (i.e., disk block) – Some disks have a constant number of sectors per track – Others keep constant bit density which places more sectors on outer track • A common track across multiple platters is referred to as cylinder 35 Measuring Disk Access Time • Each read or write has three major components – Seek time is the time to position the arm over the proper track – Rotational latency is the wait for the desired sector to rotate under the read/write head – Transfer time is the time required to transfer a block of bits (sector) under the read/write head • Note that these represent only the “raw performance” of the disk – Other components: I/O bus, controllers, other caches, interleaving, etc. 36 Seek Time • Seek time is the time to position the arm over the proper track • Average seek time is the time it takes to move the read/write head from its current position to a track on the disk • Industry definition is that seek time is the time for all possible seeks divided by the number of possible seeks – In practice, locality reduces this to 25-33% of this number – Note that some manufacturers report minimum seek times rather average seek times 37 Rotational Latency • Rotational latency is the time spent waiting for the desired sector to rotate under the read/write head • Based upon the rotational speed, usually measured in revolutions per minute (RPM) • Average rotational latency – Average rotational latency = 0.5 rotation / RPM • Example: 7200 RPM AverageRotationalLatency 0.5rotation 0.5rotation 4.2ms 7200 RPM 7200 RPM /(60sec/ min) 38 Transfer Time • Transfer time is the time required to transfer a block of bits • A factor of the transfer size, rotational speed, and recording density – Transfer size is usually a sector • Most drives today use caches to help buffer the effects of seek time and rotational latency 39 Typical Hard Drive • Typical hard disk drive – – – – – – – – Rotation speed:3600,5200,7200, or 15000 RPM Tracks per surface:500-2000 tracks Sectors per track:32-128 sectors Sectors size: 512B-1024KB Minimum seek time is often approximately 0.1ms Average seek time is often approximately 5-10ms Access time is often approximately 9-10ms Transfer rate is often 50-200MB/s 40 Average Access Example • Consider the Seagate 36.4GB Ultra SCSI – – – – – Rotation speed:10000RPM Sector size: 512B Average seek time : 5.7 ms Transfer rate: 24.5 MB/s Controller overhead of 1 ms • What is the average read time? 41 I/O System Design • Satisfying latency requirement – For time-critical operations – If system is unloaded • Add up latency of components • Maximizing throughput at steady state (loaded system) – Find “weakest link” (lowest-bandwidth component) – Configure to operate at its maximum bandwidth – Balance remaining components in the system 46 Buses • A bus is a shared communication link that connects multiple devices • Single set of wires connects multiple “subsystems” as opposed to a point to point link which only two components together • Wires connect in parallel, so 32 bit bus has 32 wires of data 47 Advantages/Disadvantages • Advantages – Broadcast capability of shared communication link – Versatility • New device can be added easily • Peripherals can be moved between computer systems that use the same bus standard – Low Cost • A single set of wires is shared multiple ways • Disadvantages – Communication bottleneck • Bandwidth of bus can limit the maximum I/O throughput – Limited maximum bus speed • Length of the bus • Number of devices on the bus • Need to support a range of devices with varying latencies and transfer rates 48 Bus Organization • Bus Components – Control Lines • Signal begin and end of transactions • Indicate the type of information on the data line – Data Lines • Carry information between source and destination • Can include data, addresses, or complex commands • Processor-memory bus or front-side bus or system bus – Short, high-speed bus – Connects memory and processor directly – Designed to match the memory system and achieve the maximum memoryto-processor bandwidth (cache transfers) – Designed specifically for a given processor/memory system (proprietary) • I/O Bus (or peripheral bus) – – – – Usually long and slow Connect devices to the processor-memory bus Must match a wide range of I/O device performance characteristics Industry standard 49 Synchronous versus Asynchronous • Synchronous Bus – Includes a clock in control lines – Fixed protocol for communication relative to the clock – Advantages • Involves very little logic and can therefore run very fast – Disadvantages • Every decision on the bus must run at the same clock rate • To avoid clock skew, bus cannot be long if it is fast – Processor-memory bus • Asynchronous Bus – – – – No clock control line Can easily accommodate a wide range of devices No clock skew problems, so bus can be quite long Requires handshaking protocol 50 Increasing Bus Bandwidth • Several factors account for bus bandwidth – Wider bus width • Increasing data bus width => more data per bus cycle • Cost: More bus lines – Separate address and data lines • Address and data can be transmitted in one bus cycle if separate address and data lines are available • Costs: More bus lines 51 Increasing Bus Bandwidth • Several factors account for bus bandwidth – Block transfers • • • • Transfer multiple words in back-to-back bus cycles Only one address needs to be sent at the start Bus is not released until the last word is transferred Costs: Increased complexity and increased response time for pending requests 52 Increasing Bus Bandwidth – Spilt transaction “pipelining the bus” • Free the bus during time between request and data transfer • Costs: Increased complexity and higher potential latency Spilt transaction bus with separate address and data wires 53 Accessing the Bus • How is the bus reserved by a device that wishes to use it? • Master-slave arrangement – Only the bus master can control access to the bus – The bus master initiates and controls all bus requests – A slave responds to read and write requests • A simple system – Processor is the only bus master – All bus requests must be controlled by the processor – Major drawback is the processor must be involved in every transfer 54 Multiple Masters • With multiple masters, arbitration must be used so that only one device is granted access to the bus at a given time • Arbitration – The bus master wanting to use the bus asserts a bus request – The bus master cannot use the bus until the request is granted – The bus master must signal the arbiter when finished using the bus • Bus arbitration goals – Bus priority – highest priority device should be serviced first – Fairness – Lowest priority devices should not starve 55 Centralized Parallel Arbitration • Advantages – Centralized control where all devices submit request – Any fair priority scheme can be implemented (FCFS, roundrobin) • Disadvantages – Potential bottleneck at central arbiter 56 Operating System Tasks • The OS is a resource manager and acts ad the interface between I/O hardware and programs that request I/O • Several important characteristics of the I/O system: – I/O system is shared by multiple programs – I/O systems often use interrupts to notify CPU • Interrupts = externally generated “exceptions” • Typically “Input available” or “Output complete” messages • OS handles interrupts by transferring control to kernel mode – Low-level control of an I/O device is complex • Managing a set of concurrent events • Requirements for correct device control are very detailed • I/O device drivers are the most common area for bugs in an OS! 57 OS Communication • The operating system should prevent user program from communicating with I/O device directly – Must protect I/O resources to keep sharing fair – Protection of shared I/O resources cannot be provided if user programs could perform I/O directly • Three types of communication are required – OS must be able to give commands to I/O devices – I/O device must be able to notify OS when I/O device has completed and operation or has encountered an error – Data must be transferred between memory and an I/O device 58 I/O Commands: A method for addressing a device • Memory-mapped I/O: – Portions of the address space are assigned to each I/O device – I/O addresses correspond to device registers – User programs prevented from issuing I/O operations directly since I/O address space is protected by the address translation mechanism 59 I/O Commands • I/O devices are managed by I/O controller hardware – Transfers data to/from device – Synchronizes operation with software • Command registers – Cause device to do something • Status registers – Indicate what the device is doing and occurrence of errors • Data registers – Write: transfer data to a device – Read: transfer data from a device 60 Communicating with the CPU • Method #1: Polling – I/O device places information in a status register – The OS periodically checks the status register – Whether polling is used dependent upon whether the device can initiate I/O independently • For instance, a mouse works well since it has a fixed I/O rate and initiates its own data (whenever it is moved) • For others, such as disk access, I/O only occurs under the control of the OS, so we poll only when the OS knows it is active – Advantages • Simple to implement • Processor is in control and does the work – Disadvantage • Polling overhead and data transfer consume CPU time 61 Polling and Programmed I/O 62 I/O Notification • Method #2: I/O Interrupt – When an I/O device needs attention, it interrupts the processor – Interrupt must tell OS about the event and which device • Using “cause” register(s): Kernel “asks” what interrupted • Using vectored interrupts: A different exception handler for each – I/O interrupts are asynchronous events, and happen anytime • Processor waits until current instruction is completed – Interrupts may have different priorities – Advantages: Execution is only halted during actual transfer – Disadvantage: Software overhead of interrupt processing 63 I/O Interrupts 64 Data Transfer • The third component to I/O communication is the transfer of data from the I/O device to memory (or vice versa) • Simple approach: “Programmed” I/O – Software on the processor moves all data between memory addresses and I/O addresses – Simple and flexible, but wastes CPU time – Also, lots of excess data movement in modern systems • Mem->Network->CPU->Network->graphics • So need a solution to allow data transfer to happen without the processor’s involvement 65 Delegating I/O : DMA • Direct Memory Access (DMA) – Transfer blocks of data to or from memory without CPU intervention – Communication coordinated by the DMA controller • DMA controllers are integrated in memory or I/O controller chips – DMA controller acts as a bus master, in bus-based systems • DMA steps – Processor sets up DMA by supplying • Identify of the device and the operation (read/write) • The memory address for source/destination • The number of bytes to transfer – DMA controller starts the operation by arbitrating for the bus and then starting the transfer when the data is ready – Notify the processor when the DMA transfer is complete or on error • Usually using an interrupt 66 DMA Problems: Virtual vs. Physical Addresses • If DMA uses physical addresses – Memory access across physical page boundaries may not correspond to contiguous virtual pages (or even the same application) • Solution1: <1 page per DMA transfer • Solution: chain a series of 1-page requests provided by the OS – Single interrupt at the end of the last DMA request in the chain • Solution2: DMA engine uses virtual addresses – Multi-page DMA requests are now easy – A TLB is necessary for the DMA engine • For DMA with physical addresses: pages must be pinned in DRAM – OS should not page to disks pages involved with pending I/O 67 I/O Summary • I/O performance has to take into account many variables – Response time and throughput – CPU, memory, bus, I/O device – Workload e.g. transaction processing • I/O devices also span a wide spectrum – Disks, graphics, and networks • Buses – Bandwidth – Synchronization – Transactions • OS and I/O – Communication: Polling and Interrupts – Handling I/O outside CPU: DMA 69