* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Design Tradeoffs For Software
Survey
Document related concepts
Transcript
Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown Definition The virtual to physical address translation operation sits on the critical path between the CPU and the cache. If every request for a memory location out from the processor required one or more accesses to main memory (to read page table entries), then the processor would be very slow. TLB is a cache for page table entries. It works in much the same way as the data cache, it stores recently accessed page table entries. Operations on an address request by the CPU Each TLB entry covers a whole page of physical memory a relatively small number of TLB entries will cover a large amount of memory The large coverage of main memory by each TLB entry means that TLBs have a high hit rate TLB types Fully associative in early TLB design Set associative, is more common in new design The Problem. This paper discusses software managed TLB design tradeoffs and their interaction with a range of operating systems however software management can impose considerable penalties, which can highly dependent on the operating system structure and its use of virtual memory Namely memory references that require mappings not in the TLB result in misses that must be serviced either by hardware or by software. Test Environment DECstation 3100 with MIPS R2000 processor R2000 contains 64 entry fully-associative TLB R2000 TLB hardware supports partitioning into two sets, an upper and lower set Lower set consists of entries 0-7 and is used for Page Table Entries with slow retrieval Upper set consists of entries 8-63 and contains more frequently used level 1 user PTEs Test Tools. a system analysis tool called Monster, which enables us to monitor actual miss handling costs in CPU cycles. a TLB simulator called Tapeworm which is compiled directly into the kernel so that it can intercept all of the actual TLB misses caused by both user processes and OS kernel memory references. TLB information that Tapeworm extracts from the running system is used to obtain TLB miss counts and to simulate different TLB configurations. System monitoring with monster. Monster is a hardware monitoring system, its comprised of a monitored DECstation 3100, an attached logic analyzer and a controlling workstation . Measures the amount of time to handle each TLB miss TLB Simulation with Tapeworm. The Tapeworm simulator is built into the operating system and is invoked whenever there is a TLB miss. The simulator uses the real TLB misses to simulate its own TLB configuration. Trace Driven Simulation Trace driven simulation was used because it’s good for studying the components of a computer memory systems like TLBs. a sequence of memory references to the simulation model to mimic the way that a real processor might exercise the design. Problems with Trace driven simulation Difficult to obtain accurate traces. Consumes a considerable processing and storage resources It assumes that address traces are invariant to changes in the structural parameters of a simulated TLB Solution. Compiling the TLB simulator Tapeworn, directly onto the operating system kernel. This allows us to account for all system activity, including multiple process and kernel interactions. It does not require address trace It considers all TLB misses, caused by user level tasks, or kernel. Benchmarks Operating Systems Test Results OS Impact on software managed TLBs Different OS gave different results, although the same application were run on each system. There is a difference in TLB misses & total TLB service time Increasing TLB Performance Additional TLB Miss Vectors. Increase Lower Slots in TLB Partition. Increase TLB Size. Modify TLB Associativity. TLB Miss Vectors L1 User - on level 1 user PTE L1 Kernel - miss on level 1 kernel PTE L2 - miss on level 2 PTE, after level 1 user miss L3 - miss on level 3 PTE, after level 1 kernel miss Modify - miss on protection violation Invalid – page fault TLB Miss Vector Results Modifying Lower TLB Partition OSF/1 OS - increase from 4 to 5 lower slots decreases miss handling time by 50% Mach 3.0 OS – performance increase up to 8 slots Microkernel's benefit from lower TLB partition increase because many system calls (e.g. Unix server on Mach 3.0) mapped to L2 PTEs Increasing TLB size Increasing TLB size • Building TLBs with additional upper slots. • The most significant component is L1k misses, that’s due to the large number of mapped data structure in the kernel. • Allowing the uTLB handler to service L1k misses reduces the TLB service time. • In each system there is a noticeable improvement in the TLB service time as the TLB increases. Conclusion. Software-management of TLBs magnifies the importance of the interactions between TLBs and operating systems, because of the large variation in TLB miss service times that can exist. TLB behavior depends upon the kernel’s use of virtual memory to map its own data structures, including the page tables themselves. TLB behavior is also dependent upon the division of service functionality between the kernel and separate user tasks.