Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Effects of Clock Resolution on the Scheduling of Interactive and Soft Real-Time Processes Yoav Etsion Dan Tsafrir Dror G. Feitelson School of Computer Science and Engineering The Hebrew University, 91904 Jerusalem, Israel fetsman,dants,feitgs.huji.a.il ABSTRACT 1. INTRODUCTION It is ommonly agreed that sheduling mehanisms in general purpose operating systems do not provide adequate support for modern interative appliations, notably multimedia appliations. The ommon solution to this problem is to devise speialized sheduling mehanisms that take the spei needs of suh appliations into aount. A muh simpler alternative is to better tune existing systems. In partiular, we show that onventional sheduling algorithms typially only have little and possibly misleading information regarding the CPU usage of proesses, beause inreasing CPU rates have aused the ommon 100 Hz lok interrupt rate to be oarser than most appliation time quanta. We therefore ondut an experimental analysis of what happens if this rate is signiantly inreased. Results indiate that muh higher lok interrupt rates are possible with aeptable overheads, and lead to muh better information. In addition we show that inreasing the lok rate an provide a measure of support for soft real-time requirements, even when using a general-purpose operating system. For example, we ahieve a sub-milliseond lateny under heavily loaded onditions. Contemporary omputer workloads, espeially on the desktop, ontain a signiant multimedia omponent: playing of musi and sound eets, displaying video lips and animations, et. These workloads are not well supported by onventional operating system shedulers, whih prioritize proesses aording to reent CPU usage [18℄. This deieny is often attributed to the lak of spei support for real-time features, and to the fat that multimedia appliations onsume signiant CPU resoures themselves. The ommon solution to this problem has been to design speialized programming APIs that enable appliations to request speial treatment, and shedulers that respet these requests [19, 8, 22℄. For example, appliations may be required to speify timing onstraints suh as deadlines. To support suh deadlines, the onventional operating system sheduler has to be modied, or a real-time system an be used. While this approah solves the problem, it suers from two drawbaks. One is prie. Real-time operating systems are muh more expensive than ommodity desktop operating systems like Linux or Windows. The prie reets the diÆulty of implementing industrial strength real-time sheduling. This diÆulty, and the requirement for areful testing of all important senarios, are the reasons that many interesting proposals made in aademia do not make it into prodution systems. The other drawbak is the need for speialized interfaes, that may redue the portability of appliations, and require a larger learning and oding eort. An alternative is to stik with ommodity desktop operating systems, and tune them to better support modern workloads. While this may lead to sub-optimal results, it has the important benet of being immediately appliable to the vast majority of systems installed around the world. It is therefore worth while to perform a detailed analysis of this approah, inluding what an be done, what results may be expeted, and what are its inherent limitations. Categories and Subject Descriptors D.4.1 [Proess Management℄: Sheduling; D.4.8 [Performane℄: Measurements; C.4 [Performane of Systems℄: Design studies General Terms Measurement, Performane Keywords Clok interrupt rate, Interative proess, Linux, Overhead, Sheduling, Soft real-time, Tuning Supported by a Usenix sholasti grant. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMETRICS’03, June 10–14, 2003, San Diego, California, USA. Copyright 2003 ACM 1-58113-664-1/03/0006 ...$5.00. 1.1 Commodity Scheduling Algorithms Prevalent ommodity systems (as opposed to researh systems) use a simple sheduler that has not hanged muh in 30 years. The basi idea is that proesses are run in priority order. Priority has a stati omponent (e.g. operating system proesses have a higher initial priority than user proesses) and a dynami part. The dynami part depends on CPU usage: the more CPU yles used by a proess, the lower its priority beomes. This negative feedbak (running redues your priority to run more) ensures that all proesses 1.2 The Resolution of Clock Interrupts Computer systems have two loks: a hardware lok that governs the instrution yle, and an operating system lok that governs system ativity. Unlike the hardware lok, the frequeny of the system lok is not predened: rather, it is set by the operating system on startup. Thus the system an deide for itself what frequeny it wants to use. It is this tunability that is the fous of the present paper. The importane of the system lok (also alled the timer interrupt rate) lies in the fat that ommodity systems measure time using this lok, inluding CPU usage and when timers should go o. The reason that timers are aligned with lok tiks is to simplify their implementation and bound the overhead. The alternative of setting a speial interrupt for eah timer event requires more bookkeeping and risks high overhead if many timers are set with very short intervals. The most ommon frequeny used today is 100 Hz: it is used in Linux, the BSD family, Solaris, the Windows family, and Ma OS X. This hasn't hanged muh in the last 30 years. For example, bak in 1976 Unix version 6 running on a PDP11 used a lok interrupt rate of 60 Hz [16℄. Sine that time the hardware lok rate has inreased by about 3 orders of magnitude, from several megahertz to over 3 gigahertz [23℄. As a onsequene, the size of an operating system tik has inreased a lot, and is now on the order of 10 million yles or instrutions. Simple interative appliations suh as text editors don't require that many yles per quantum1 , making the tik rate obsolete | it is too oarse for measuring the running time of an interative proess. For example, the operating system annot distinguish between proesses that run for a thousand yles and those that run for a million yles, beause using 100 Hz tiks on a 1 GHz proessor both look like 0 time. A speial ase of time measurement is setting the time 1 Interestingly, this same onsideration has also motivated the approah of making the hardware lok slower, rather than making the operating system lok faster as we propose. This has the benet of reduing power onsumption [10℄. achieved frames per second get a fair share of the resoures. CPU usage is forgotten after some time, in order to fous on reent ativity and not on distant history. While the basi ideas are the same, spei systems employ dierent variations. For example, in Solaris priorities of proesses that wake up after waiting for an event are set aording to a table, and the alloated quantum duration is longer if the priority is lower [17℄. In Linux the relationship goes the other way, with the same number serving as both the alloation and the priority [5, 4℄. In Windows NT and 2000, the priority and quanta alloated to threads are determined by a set of rules rather than a formula, but the eet is the same [24℄. For example, threads that seem to be starved get a double quantum at the highest possible priority, and threads waiting for I/O or user input also get a priority boost. In all ases, proesses that do not use the CPU very muh | suh as I/O-bound proesses | enjoy a higher priority for those short bursts in whih they want it. This was suÆient for the interative appliations of twenty years ago. It is no longer suÆient for modern multimedia appliations (a lass of appliations that did not exist when these shedulers were designed), beause their CPU usage is relatively high. 1000 Hz 60 50 100 Hz 40 30 20 20 30 40 50 60 desired frames per second Figure 1: Desired and ahieved frame rate for the Xine MPEG viewer, on systems with 100 Hz and 1000 Hz lok interrupt rates. that a proess may run before it is preempted. This duration, alled the alloation quantum, is also measured in lok tiks. Changing the lok resolution therefore impliitly eets the quantum size. However in reality these two parameters need not be orrelated, and they an be set independently of eah other. The question is how to set eah one. A related issue is providing support for soft real-time appliations suh as games with realisti video rendering, that require aurate timing down to several milliseonds. These appliations require signiant CPU resoures, but in a fragmented manner, and are barely served by a 100 Hz tik rate. In some ases, the limited lok interrupt rate may atually prevent the operating system from providing required servies. An example is given in Figure 1. This shows the desired and ahieved frame rates of the Xine MPEG viewer showing 500 frames of a short lip that is already loaded into memory, when running on a Linux system with lok interrupt rates of 100 Hz and 1000 Hz. For this benhmark the disk and CPU power are not bottleneks, and the desired frame rates an all be ahieved. However, when using a 100 Hz system, the viewer repeatedly disards frames beause the system does not wake it up in time to display them if the desired frame rate is 60 frames per seond. This is an important deieny, as 60 frames/se is mandated by the MPEG standard. Even ner timing servies are required in other, non-desktop appliations. Video rates of up to 1000 frames per seond are used for reording high-speed events, suh as vehile rash experiments [26℄. Similar high rates an also be expeted for sampling sensors in various situations. Even higher rates are neessary in networking, for the implementation of ratebased transmission [25, 2℄. Full utilization of a 100 Mb/s Fast Ethernet with 1500-byte pakets requires a paket to be transmitted every 120 s, i.e. 8333 times a seond. On a gigabit link, the interval drops to 12 s, and the rate jumps up to 83,333 times a seond. Inreasing the lok interrupt rate may be expeted to provide muh better timing support than that available today with 100 Hz. However, this omes at the possible expense of additional overhead, and has therefore been dis- ouraged by Linux developers (this will probably hange as the 2.5 development kernel has swithed to 1000 Hz for the prevalent Intel arhiteture; in the past, suh a rate was reommended only for the Alpha proessor, whih aording to the kernel mailing list was \strong enough to handle it") and by Sun doumentation (\exerise great are if you are onsidering setting high-resolution tiks2 ... this setting should never, ever be made on a prodution system without extensive testing rst" [17, p. 56℄). Our goal is to investigate this tradeo more thoroughly. 1.3 Related Work Other approahes to improving the soft real-time servie provided by ommodity systems inlude RT-Linux, one-shot timers, soft timers, rm timers, and priority adjustments. The RT-Linux projet uses virtual mahine tehnology to run a real-time exeutive under Linux, only allowing Linux to run when there are no urgent real-time tasks that need the proessor [3℄. Thus Linux does not run on the native hardware, but on a virtual mahine. The result is a juxtaposition of a hard real-time system and a Linux system. In partiular, the real-time servies are not available for the Linux proesses, so real time appliations must be partitioned into two independent parts. However, ommuniation between the two parts is supported. One-shot timers do not have a pre-dened periodiity. Instead, they are set aording to need. The system stores timer requests sorted by time. Whenever a timer event is red, the system sets a timer interrupt for the next event. Variants of one-shot timers have been used in several systems, inluding the Pebble operating system, the Nemesis operating system for multimedia [15℄, and the KURT realtime system [25℄. The problem is that this may lead to high overhead if many timing events are requested with ne resolution. In soft timers the timing of system events is also not tied to periodi lok interrupts [2℄. Instead, the system opportunistially makes use of onvenient irumstanes in order to provide higher-resolution servies. For example, on eah return from a system all the system may hek whether any timer has beome ready, and re the respetive events. As suh opportunities our at a muh higher rate than the timer interrupts, the average resolution is muh improved (in other words, soft timers are suh a good idea speially beause the resolution of lok interrupts is so outdated). However, the timing of a spei event annot be guaranteed, and the original low-resolution timer interrupts serve as a fallbak. Using a higher lok rate, as we suggest, an guarantee a muh smaller maximal deviation from the desired time. Firm timers ombine soft timers with one-shot timers [13℄. This ombination redues the need for timer interrupts, alleviating the risk of exessive overheads. Firm timers together with a preemptible kernel and suitable sheduling have been shown to be eetive in supporting time-sensitive appliations on a ommodity operating system. Priority adjustments allow a measure of ontrol over when proesses will run, enabling the emulation of real-time servies [1℄. This is essentially similar to the implementation of hard real-time support in the kernel, exept for the fat that it is done by an external proess, and an only use the primitives provided by the underlying ommodity system. 2 This speially means 1000 Hz. Finally, there are also various programming projets to improve the responsiveness and performane of the Linux kernel. One is the preemptible kernel path, whih has been adopted as part of the 2.5 development kernel. It redues interrupt proessing lateny by allowing long kernel operations to be preempted. A major dierene between the above approahes and ours is that they either require speial APIs, make non-trivial modiations to the system, or both. Suh modiations annot be made by any user, and require a substantial review proess before they are inorporated in standard software releases (if at all). For example, one-shot timers and soft timers have been known sine the mid '90s, but are yet to be inorporated in a major system. By ontradistintion, we fous on a single simple tuning knob | the lok interrupt rate, and investigate the benets and the osts of turning it to muh higher values than ommonly done. Previous work on multimedia sheduling, with the exeption of [19℄, has made no mention of the underlying system lok, and foused on designs for meeting deadline and lateny onstraints. 1.4 Preview of Results Our goal is to show that inreasing the lok interrupt rate is both possible and desirable. Measurements of the overheads involved in interrupt handling and ontext swithing indiate that urrent CPUs an tolerate muh higher lok interrupt rates than those ommon today (Setion 3). We then go on to demonstrate the following: Using a higher tik rate allows the system to perform muh more aurate billing, thus giving a better disrimination among proesses with dierent CPU usage levels (Setion 4). Using a higher tik rate also allows the system to provide a ertain \best eort" style of real-time proessing, in whih appliations an obtain high-resolution timing measurements and alarms (as exemplied in Figure 1, and expanded in Setion 5). For appliations that use time sales that are related to human pereption, a modest inrease in tik rate to 1000 Hz may suÆe. Appliations that operate at smaller time sales, e.g. to monitor ertain sensors, may require muh higher rates and shortening of sheduling quantum lengths (Setion 7). We onlude that improved lok resolution | and the shorter quanta that it makes possible | should be a part of any solution to the problem of sheduling soft real-time appliations, and should be taken into aount expliitly. 2. METHODOLOGY AND APPLICATIONS Before presenting detailed measurement results, we rst desribe the experimental platform and introdue the appliations used in the measurements. 2.1 The Test Platform Most measurements were done on a 664 MHz PentiumIII mahine, equipped with 256 MB RAM, and a 3DFX Voodoo3 graphis aelerator with 16 MB RAM that supports OpenGL in hardware. In addition, we performed ross-platform omparisons using mahines ranging from Pentium 90 to Pentium-IV 2.4 GHz. The operating system is a 2.4.8 Linux kernel (RedHat 7.0), with the XFree86 4.1 X server. The same kernel was ompiled for all the dierent arhitetures, whih may result in minor dierenes in the generated ode due to arhiteture-spei ifdef s. The default lok interrupt rate is 100 Hz. We modied the kernel to run at up to 20,000 Hz. The modiations were essentially straightforward, and involved extending kernel ifdef s to this range and orreting the alulation of bogomips3 . The measurements were onduted using klogger, a kernel logger we developed that supports ne-grain events. While the ode is integrated into the kernel, its ativation at runtime is ontrolled by applying a speial systl all using the /pro le system. In order to redue interferene and overhead, logged events are stored in a sizeable buer in memory (we typially use 4 MB), and only exported at large intervals. This export is performed by a daemon that wakes up every few seonds (the interval is redued for higher lok rates to ensure that events are not lost). The implementation is based on inlined ode to aess the CPU's yle ounter and store the logged data. Eah event has a 20byte header inluding a serial number and timestamp with yle resolution, followed by event-spei data. The overhead of eah event is only a few hundred yles (we estimate that at 100 Hz the overhead for logging is 0.63%, at 1000 Hz it is 0.95%, and at 20,000 Hz 1.18%). In our use, we log all sheduling-related events: ontext swithing, realulation of priorities, forks, exes, and hanging the state of proesses. 2.2 The Workload The system's behavior was measured with dierent lok rates and dierent workloads. The workloads were omposed of the following appliations: 3 A lassi interative appliation | the Emas text editor. During the test the editor was used for standard typing at a rate of about 8 haraters per seonds. The Xine MPEG viewer, whih was used to show a short video lip in a loop. Xine's implementation is multithreaded, making it a suitable representative of this growing lass of appliations [11℄. Speially, Xine uses 6 distint proesses. The two most important ones are the deoder, whih reads the data stream from the disk and generates frames for display, and the displayer, whih displays the frames at the appropriate rate. The displayer keeps trak of time using alarms with a resolution of 4 ms. On eah alarm it heks whether the next frame should be displayed, and if so, sends the frame to the X server. If it is too late, the frame is disarded. If it is very late, the displayer an also notify the deoder to skip ertain frames. In the experiments, audio output was sent to /dev/null rather than to the sound ard, to allow fous on interations with the X server. Quake 3, whih represents a modern interative appliation (role playing game). Quake uses the X server's Diret Rendering Infrastruture (DRI) [21℄ feature whih enables the OpenGL graphis library to aess the hardware diretly, without proxying all the requests Bogomips are an estimate of the lok rate omputed by the Linux kernel upon booting. The orretion prevents division by zero in this alulation. through the X server. This results in some of the graphis proessing being done by the Graphial Proessor Unit (GPU) on the aelerator. Another interesting feature of Quake is that it is adaptive: it an hange its frame rate based on how muh CPU time it gets. Thus when Quake ompetes with other proesses, its frame rate will drop. In our experiments, when running alone it is always ready to run and an use all available CPU time. CPU-bound proesses that serve as a bakground load that an absorb any number of available CPU yles, and ompete with the interative and real-time proesses. In addition, the system ran a host of default proesses, mostly various daemons. Of these, the most important with regard to interative proesses is obviously the X server. 3. CLOCK RESOLUTION AND OVERHEADS A major onern regarding inreasing the lok interrupt rate is the resulting inrease in overheads: with more lok interrupts more time will be wasted on proessing them, and there may also be more ontext swithes (as will be explained below in Setion 6), whih in turn lead to redued ahe and TLB eÆieny. This is the reason why today only the Alpha version of Linux employs a rate of 1024 Hz by default. This is ompounded by the onern that operating systems in general beome less eÆient on mahines with higher hardware lok rates [20℄. We will show that these onerns are unfounded, and a lok interrupt rate of 1000 Hz or more is perfetly possible. The overhead aused by lok interrupts may be divided into two parts: diret overhead for running the interrupt handling routine, and indiret overhead due to redued ahe and TLB eÆieny. The diret overhead an easily be measured using klogger. We have performed suh measurements on a range of Pentium proessors with lok rates from 90 MHz to 2.4 GHz, and on an Athlon XP1700+ at 1.467 GHz with DDR-SDRAM memory. The results are shown in Table 1. We nd that the overhead for interrupt proessing is dropping at a muh slower rate than expeted aording to the CPU lok rate | in fat, it is relatively stable in terms of absolute time. This is due to an optimization in the Linux implementation of gettimeofday(), whereby overhead is redued by aessing the 8253 timer hip on eah lok interrupt | rather than when gettimeofday() itself is alled | and extrapolating using the yle ounter register. This takes a onstant amount of time and therefore adds overhead to the interrupt handling that is not related to the CPU lok rate. Even so, the overhead is still short enough to allow many more interrupts than are used today, up to an order of 10,000 Hz. Alternatively, by removing this optimization, the overhead of lok interrupt proessing an be redued onsiderably, to allow muh higher rates. A good ompromize might be to inrease the lok interrupt rate but leave the rate at whih the 8253 is aessed at 100 Hz. This will amortize the overhead of the o-hip aess, thus reduing the overhead per lok interrupt. A related issue is the overhead for running the sheduler. More lok interrupts imply more alls to the sheduler. Default Cyles s 814180 9.02 1654553 8.31 2342303 6.71 3972462 5.98 6377602 5.64 14603436 6.11 10494396 7.15 Without 8253 Cyles s 498466 5.53 462762 2.32 306311 0.88 327487 0.49 426914 0.38 445550 0.19 202461 0.14 Table 1: Interrupt proessing overheads on dierent proessor generations (averagestandard deviation). 1 process PP−200 20 PII−350 PIV−2.4 PIII−664 PIII−1.133 10 0 1000 Context swith Cahe BW Trap Cyles s MB/s Cyles s 1871656 20.75 281 15324 1.70 1530389 7.69 70526 37975 1.91 1327331 3.80 131429 34368 0.98 1317424 1.98 251232 348163 0.52 1330441 1.18 428682 364278 0.32 3792857 1.59 301647 171232 0.72 1436477 0.98 396263 27420 0.19 Other overheads on dierent proessor generations (averagestandard deviation). Table 2: More serious is the fat that in Linux the sheduler overhead is proportional to the number of proesses in the ready queue. However, this only beomes an important fator for very large numbers of proesses. It is also partly oset by the fat that with more ready proesses it takes longer to omplete a sheduling epoh, and therefore priority realulations are done less frequently. As a side note, it is interesting to ompare lok interrupt proessing overhead to other types of overhead. Ousterhout has laimed that in general operating systems do not beome faster as fast as hardware [20℄. We have repeated some of his measurements on the platforms listed above. The results (Table 2) show that the overhead for ontext swithing (measured using two proesses that exhange a byte via a pipe) takes roughly the same number of yles, regardless of CPU lok speed (exept on the P-IV, whih is using DDR-SDRAM memory at 266 MHz and not the newer RDRAM). It therefore does beome faster as fast as the hardware. We also found that the trap overhead (measured by the repeated invoation of getpid) and ahe bandwidth (measured using mempy) behave similarly. This is more optimisti than Ousterhout's results. The dierene may be due to the fat that Ousterhout ompared RISC vs. CISC arhitetures, and there is also a dierene in methodology: we measure time and yles diretly, whereas Ousterhout based his results on performane relative to a MirovaxII and on estimated MIPS ratings. The indiret overhead of lok interrupt proessing an only be assessed by measuring the total overhead in the ontext of a spei appliation (as was done, for example, in [2℄). The appliation we used is sorting of a large array that oupies about half of the L2 ahe (the L2 ahe was 256 KB on all platforms exept for the P-II 350 whih had an L2 ahe of 512 KB). The sorting algorithm was introsort, whih is used by STL that ships with g. The sorting was done repeatedly, where eah iteration rst initializes the ar- 5000 10000 20000 clock interrupt rate [Hz] P−90 8 process 30 overhead above 100Hz [%] Proessor P-90 PP-200 PII-350 PIII-664 PIII-1.133 PIV-2.4 A1.467 P−90 30 overhead above 100Hz [%] Proessor P-90 PP-200 PII-350 PIII-664 PIII-1.133 PIV-2.4 A1.467 PP−200 20 PII−350 PIV−2.4 PIII−664 PIII−1.133 10 0 1000 5000 10000 20000 clock interrupt rate [Hz] Inrease in overhead due to inreasing the lok interrupt rate from a base ase of 100 Hz. The basi quantum is 50 ms. Figure 2: ray randomly and then sorts it (but the same random sequenes were used to ompare the dierent platforms). By measuring the time per iteration under dierent onditions, we an fator out the added total overhead due to additional lok interrupts (as is shown below). To also hek the overhead aused by additional ontext swithing among proesses, we used dierent multiprogramming levels, running 1, 2, 4, or 8 opies of the test appliation at the same time. All this was repeated for dierent CPU generations with dierent (hardware) lok rates. Assuming that the amount of work to sort the array one is essentially xed, measuring this time as a funtion of the lok interrupt rate will show how muh time was added due to overhead. Figure 2 shows this added overhead as a perentage of the total time required at 100 Hz. From this we see that the added overhead at 1000 Hz is negligible, and even at 5000 Hz it is quite low. Note, however, that this is after removing the gettimeofday() optimization, i.e. without aessing the 8253 hip on eah interrupt. For higher lok rates, the overhead inreases linearly, with a slope that beomes atter with eah new proessor generation (exept for the P-IV). Essentially the same results are obtained with a multiporgramming level of 8. Thus we an expet higher lok interrupt rates to be inreasingly aeptable. The overhead also depends on the length of the quanta, i.e. on how muh time is alloated to a proess eah time it runs. In Linux, the default alloation is 50 ms, whih trans- P−90 overhead above 100Hz [%] 70 1 process 60 PP−200 50 40 30 PII−350 PIII−664 PIII−1.133 PIV−2.4 20 10 0 1000 5000 10000 20000 clock interrupt rate [Hz] overhead above 100Hz [%] 70 8 process P−90 60 50 PP−200 40 PIII−664 PIII−1.133 30 PII−350 20 PIV−2.4 10 0 1000 5000 10000 20000 clock interrupt rate [Hz] Inrease in overhead due to inreasing the lok interrupt rate from a base ase of 100 Hz. Quanta are 6 lok tiks, so they beome shorter for high lok rates. Figure 3: lates to 5 tiks4 . When raising the lok interrupt rate, the question is whether to stik with the alloation of 50 ms, or to redue it by dening the alloation in terms of tiks, so as to improve responsiveness. The results shown in Figure 2 were for 50 ms. Figure 3 shows the same experiments when using 5 tiks, meaning that the quanta are 10 or 100 times shorter when using 1000 Hz or 10,000 Hz interrupt rates, respetively. As shown in the graphs this leads to muh higher overheads, espeially under higher loads, probably beause there are many more ontext swithes. This may limit the realisti lok interrupt rate to 1000 Hz or a bit more, but probably not as high as 5000 Hz (in this ase the P-IV is substantially better than the other platforms, but this is due to using performane relative to 100 Hz, whih was worse than for other platforms for an unknown reason). Note, however, that 1000 Hz is an order of magnitude above what is ommon today, and already leads to signiant benets, as shown in subsequent setions; the added overhead in this ase is just a few perentage points, muh less than the 10{30% whih were the norm a mere deade ago [7℄. Our measurements also allow for an assessment of the relative osts of diret and indiret overhead. For example, when swithing from 100 Hz to 10,000 Hz, the extra time 4 The atual alloation is 5 tiks plus one, to ensure that the alloation is stritly positive, as the 5 is derived from the integral quotient of two onstants. Billing ratio Missed quanta Appliation 100Hz 1000Hz 100Hz 1000Hz Emas 1.0746 0.9468 95.96% 73.42% Xine 1.2750 1.0249 89.46% 74.81% Quake 1.0310 1.0337 54.17% 23.23% X ServerÆ 0.0202 0.9319 99.43% 64.05% CPU-bound 1.0071 1.0043 7.86% 7.83% CPU+Quake 1.0333 1.0390 26.71% 2.36% Æ When running Xine Table 3: Sheduler billing suess rate. an be attributed to 9900 additional lok interrupts eah seond. By subtrating the ost of 9900 alls to the interrupt proessing routine (from Table 1), we an nd how muh of this extra time should be attributed to indiret overhead, that is mainly to ahe eets. For example, onsider the ase of a P-III 664 MHz mahine running a single sorting proess with 50 ms quanta. The average time to sort an array one is 12.675 ms on the 100 Hz system, and 13.397 ms on the 10,000 Hz system. During this time the 10,000 Hz system suered an additional 9900 0:013397 = 133 interrupts. Aording to Table 1 the overhead for eah one (without aessing the 8253 hip) is 0.49 s, so the total additional overhead was 133 0:49 = 65s. But the dierene in the time to sort an array is 13397 12675 = 722s! Thus 722 65 = 657s are unaounted for, and should be attributed to ahe eets and sheduler overhead. In other words, 657=722 = 91% of the overhead is indiret, and only 9% is diret. This number is typial of many of the ongurations heked. The indiret overhead on the P-IV and Athlon mahines, and when using shorter quanta on all mahines, are higher, and may even reah 99%. This means that the gures given in Table 1 should be multiplied by at least 10 (and in some extreme ases by as muh as 100) to derive the real ost of inreasing the lok interrupt rate. 4. CLOCK RESOLUTION AND BILLING Pratially all ommodity operating systems use prioritybased shedulers, and fator CPU usage into their priority alulations. CPU usage is measured in tiks, and is based on sampling: the proess running when a lok interrupt ours is billed for this tik. But the oarse granularity of tiks implies that billing may be inaurate, leading to inaurate information used by the sheduler. The relationship between atual CPU onsumption and billing on a 100 Hz system is shown at the top of Figure 4. The X axis in these graphs is the eetive quantum length: the exat time from when the proess is sheduled to run until when it is preempted or bloked. While the eetive quantum tends to be widely distributed, billing is done in an integral numbers of tiks. In partiular, for Emas and X the typial quantum is very short, and they are pratially never billed! Using klogger, we an tabulate all the times eah appliation is sheduled, for how muh time, and whether or not this was billed. The data is summarized in Table 3. The billing ratio is the time for whih an appliation was billed by the sheduler, divided by the total time atually onsumed by it during the test. The miss perentage is the perentage 3 53700 quanta 3 21810 quanta xine 3 2634 quanta quake 3 8205 quanta emacs X (w/xine) 2 2 2 2 1 1 1 1 billing [ticks] 100Hz 0 0 0 1 2 73200 quanta 30 0 0 1 2 3 30390 quanta 30 xine 0 0 2 4050 quanta 30 quake 20 1 0 1 2 16005 quanta 30 emacs X (w/xine) 20 20 20 10 10 10 1000Hz 10 0 0 0 10 20 0 0 10 20 30 0 0 10 20 0 10 20 effective quantum [ticks] The relationship between eetive quanta durations and how muh the proess is billed, for dierent appliations, using a kernel running at 100 Hz and at 1000 Hz. Conentrations of data points are rendered as larger disks; otherwise the graphs would have a lean steps shape, beause the billing (Y axis) is in whole tiks. Note also that the optimal would be a diagonal line with slope 1. Figure 4: of the appliation's quanta that were totally missed by the sheduler and not billed for at all. The table shows that even though very many quanta are totally missed by the sheduler, espeially for interative appliations, most appliations are atually billed with reasonable auray in the long run. This is a result of the probabilisti nature of the sampling. Sine most of the quanta are shorter than one lok tik, and the sheduler an only ount in omplete tik units, many of the quanta are not billed at all. But when a short quantum does happen to inlude a lok interrupt, it is over billed and harged a full tik. On average, these two eets tend to anel out, beause the probability that a quantum inludes a tik is proportional to its duration. The same averaging happens also for quanta that are longer than a tik: some are rounded up to the next whole tik, while others are rounded down. A notable exeption is the X server when running with Xine (we used Xine beause it intensively uses the X server, as opposed to Quake whih uses DRI). As shown below in Setion 6, when running at 100 Hz this appliation has quanta that are either extremely short (around 68% of the quanta), or 0.8{0.9 of a tik (the remaining 32%). Given the distribution of quanta, we should expet over 30% of them to inlude a tik and be ounted. But the sheduler misses over 99% of them, and only bills about 2% of the onsumed time! This turns out to be the result of synhronization with the operating system tiks. Speially, the long quanta always our after a very short quantum of a Xine proess that was ativated by a timer alarm. This is the displayer, whih heks whether to display the next frame. When it deides that the time is right, it passes the frame to X. The X server then awakes and takes a relatively long time to atually display the frame, but just less than a full tik. As the timer alarm is arried out on a tik, these long quanta always start very soon after one tik, and omplete just before the next tik. Thus, despite being nearly a tik long, they are hardly ever ounted. When running the kernel at 1000 Hz we an see that the situation improves dramatially | the eetive quantum length, even for interative appliations, is typially several tiks long, so the sheduler bills the proess an amount that reets the atual onsumed time muh more aurately. In partiular, on a 1000 Hz system X is billed for over 93% of the time it onsumed, with the missed quanta perentage dropping to 64% | the fration of quanta that are indeed very short. An alternative to this whole disussion is of ourse the option to measure runtime aurately, rather than sampling on lok interrupts. This an be done easily by aessing the CPU yle ounter [6℄. However, this involves modifying the operating system, whereas we are only interested in the eets obtainable by simple tuning of the lok interrupt rate. 5. CLOCK RESOLUTION AND TIMING Inreasing the kernel's lok resolution also yields a major benet in terms of the system's ability to provide aurate timing servies. Speially, with a high-resolution lok it clock interrupts shift S =5 56 ms tick1 tick2 tick3 tick4 T0+S+10 T0+S+20 T0+S+30 10ms T0+S ok T0 skip T0+8 13 ok T0+16 23 skip T0+25 ok T0+33 13 T0+S+40 skip T0+41 23 T0+ 50 16 23 ms frame1 frame2 Appliation Emas Xine (all proesses) Quake X Server (w/Xine) CPU-bound Table 4: Average quanta per seond ahieved by eah appliation when running in isolation. CPU usage Appliation 100Hz 1000Hz Xine 39.42% 40.42% X Server 20.10% 20.79% idle loop 31.46% 31.58% other 9.02% 7.21% frame3 desired frame display times Figure 5: Relationship of lok interrupts to frame display times that auses frames to be skipped. In this example the relative shift is 5 56 ms, and frame 2 is skipped. is possible to deliver high-resolution timer interrupts. This is espeially signiant for soft real-time appliations suh as multimedia players, whih rely on timer events to keep orret time. A striking example was given in the introdution, where it was shown that the Xine MPEG player was sometimes unable to display a movie at a rate of 60 frames per seond (whih is mandated by the MPEG standard). This is somewhat surprising, beause the underlying system lok rate is 100 Hz | higher than the desired rate. The problem stems from the relative timing of the lok interrupts and the times at whih frames are to be displayed. Xine operates aording to two rules: it does not display a frame ahead of its time, and it skips frames that are late by more than half a frame duration. A frame will therefore be displayed only if the lok interrupt that auses Xine's timer signal to be delivered ours in the rst half of a frame's sheduled display time. In the ase of 60 frames per seond on a 100 Hz system, the smallest ommon multiple of the 2 frame duration ( 1000 60 = 16 3 ms) and lok interval (10 ms) is 50 ms. Suh an interval is shown in Figure 5. In this example frame 2 will be skipped, beause interrupt 2 is a bit too early, whereas interrupt 3 is already too late. In general, the question of whether this will indeed happen depends on the relative shift between the sheduled frame times and the lok interrupts. A simple inspetion of the gure indiates that frame 1 will be skipped if the shift (between the rst lok interrupt and the rst frame) is in the range of 8 31 {10 ms, frame 2 will be skipped for shifts in the range 5{6 23 ms, and frame 3 will be skipped for shifts in the range 1 23 {3 13 ms, for a total of 5 ms out of the 10 ms between tiks. Assuming the initial shift is random, there is therefore a 50% hane of entering a pattern in whih a third of the frames are skipped, leading to the observed frame rate of about 40 frames per seond (in reality, though, this happens muh less than 50% of the time, beause the initial program startup tends to be synhronized with a lok tik). To hek this analysis we also tried a muh more extreme ase: running a movie at 50 frames per seond on a 50 Hz system. In this ase, either all lok interrupts fall in the rst half of their respetive frames, and all frames are shown, or else all interrupts fall in the seond half of their frames, and all are skipped. And indeed, we observed runs in whih all Quanta/se 100Hz 1000Hz 22.36 34.60 470.67 695.94 187.88 273.85 71.35 148.21 28.81 38.97 Table 5: CPU usage distribution when running Xine. frames were skipped and the sreen remained blak throughout the entire movie. The impliation of the above is that the timing servie has to have muh ner resolution than that of the requests. For Xine to display a movie at 60 Hz, the timing servie needs a resolution of 4 ms. This is required for the appliation to funtion orretly, not for the atual viewing, and therefore applies despite the fat that this lok resolution is muh higher than the sreen refresh rate. 6. CLOCK RESOLUTION AND THE INTERLEAVING OF APPLICATIONS Reall that we dene the eetive quantum length to be the interval from when a proess is sheduled until it is desheduled for some reason. On our Linux system, the alloation for a quantum is 50 ms plus one tik. However, as we an see from Figures 4 and 6 (introdued below), our interative appliations never even approah this limit. They are always preempted or bloked muh sooner, often quite soon in their rst tik. In other words, the eetive quantum length is very short. This enables the system to support more than 100 quanta per seond, even if the lok interrupt rate is only 100 Hz, as shown in Table 4. It also explains the suess of soft timers [2℄. The distributions of the eetive quantum length for the dierent appliations are shown in Figure 6, for 100 Hz and 1000 Hz systems. An interesting observation is that when running the kernel at 1000 Hz the eetive quanta beome even shorter. This happens beause the system has more opportunities to intervene and preempt a proess, either beause it woke up another proess that has higher priority, or due to a timer alarm that has expired. However, the total CPU usage does not hange signiantly (Table 5). Thus inreasing the lok rate did not hange the amount of omputation performed, but the way in whih it is partitioned into quanta, and the granularity at whih the proesses are interleaved with eah other. A spei example is provided by Xine. One of the Xine proesses sets a 4 ms alarm, that is used to synhronize the video stream. In a 100 Hz system, the alarm signal is only delivered every 10 ms, beause this is the size of a tik. But when using a 1000 Hz lok the system an atually Probability 1 Xine 1 Emacs 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 1 10 20 30 X (with Xine) 0 0 1 10 20 30 0 Quake 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 10 20 30 CPU bound (alone) 10 20 30 40 50 60 40 50 60 CPU bound (with quake) 0 0 10 100HZ 20 0 30 10 20 30 Milliseconds 1000HZ Figure 6: Cumulative distribution plots of the eetive quantum durations of the dierent appliations. deliver the signals on time. As a result the maximal eetive quanta of X and the other Xine proesses are redued to 4 ms, beause they get interrupted by the Xine proess with the 4 ms timer. Likewise, the servie reeived by CPU-bound appliations is not independent of the interative proesses that aompany them. To investigate this eet, these proesses were measured alone and running with Quake. When running alone, their quanta are typially indeed an integral number of tiks long. Most of the time the number of tiks is less than the full alloation, due to interruptions from system daemons or klogger, but a sizeable fration do ahieve the alloated 50 ms plus one tik (whih is an additional 10 ms at 100 Hz, but only 1 ms at 1000 Hz). But when Quake is added, the quanta of the CPU-bound proesses are shortened to the same range as those of Quake, and moreover, they beome less preditable. This also leads to an inrease in the number of quanta that are missed for billing (Table 3), unless the higher lok rate of 1000 Hz is used. 7. TOWARDS BEST-EFFORT SUPPORT FOR REAL-TIME In this setion we set out to explore how lose a general purpose system an ome to supporting real-time proesses in terms of timing delays, only by tuning the lok interrupt rate and reduing the alloated quanta. The metri that we use in order to perform suh an evaluation is lateny: the dierene between the time in whih an alarm requested by a proess should expire, and the time in whih this proess was atually assigned a CPU. Without worrying about overhead (for the moment), our aim is to show that under loads of up to 8 proesses, we an bound the lateny to be less than 1 milliseond. As there are very many types of soft real-time appliations, we sample the possible spae by onsidering three types of proesses: 1. BLK: A proess repeatedly sets alarms without per- forming any type of omputation. Our experiments involved proesses that requested an alarm signal 500 times, with delays that are uniformly distributed between 1 and 1000 milliseonds. 2. N%: Same as BLK, with the dierene that a proess omputed for a ertain fration (N%) of the time till the next alarm. Speially, we heked omputation of N = 1, 2, 4, and 8% out of this interval. Note for example that a ombination of 8 proesses omputing for 8% of the time leads to an average of 64% CPU utilization. To hek what happens when the CPU is not left idle, we also added CPU-bound proesses that do not set timers. 3. CONT: Same as N% where N=100% i.e. the proess omputes ontinuously. For eah of the above 3 types, we heked ombinations of 1, 2, 4, and 8 proesses. All the proesses that set timers were assigned to the (POSIX) Round-Robin lass. Note that a ombination of more than one CONT-proess onstitutes the worst-ase senario, beause | ontrary to the other workloads | the CPU is always busy and there are always alternative proesses with similar priorities (in the RoundRobin queue) that are waiting to run. 1 process 2 processes 800 700 600 500 400 300 200 100 20000Hz, 100 µs time quanta 0 480000 420000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 360000 300000 240000 180000 120000 60000 100Hz, default time quanta (60 ms) 0 Probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Microseconds 4 processes 8 processes Distributions of latenies till a timer signal is delivered, for proesses that ompute ontinuously and also set timers for random intervals of up to one seond. Figure 7: sorted numbers per sec. [millions] 5 base config 4 extreme config 3 2 1 0 A1 4 2. 3 13 67 .4 V− PI 1. 4 66 II− PI 0 00 35 II− PI I− PI −2 PP 90 P− The base system we used is the default onguration of Linux, with 100 Hz lok interrupt rate and a 60 ms (6 tiks) maximal quantum duration. In order to ahieve our submilliseond lateny goal, we ompared this with a rather aggressive alternative: 20,000 Hz lok interrupt rate and 100 s (2 tiks) quantum (note that we are hanging two parameters at one: both the lok resolution and the number of tiks in a quantum). Theoretially, for this onguration the maximal lateny would be 100s 7 = 700s < 1 ms, beause even if a proess is positioned at the end of the run-queue it only needs to wait for seven other proesses to run for 100s eah. The results shown in Figure 7 onrm our expetations. This gure is assoiated with the worst-ase senario of a workload omposed solely of CONT proesses. Examining the results for the original 100 Hz system (left of Figure 7), we see that a single proess reeives the signal within one tik, as may be expeted. When more proesses are present, there is also a positive probability that a proess will nevertheless reeive the signal within a tik: 21 , 41 and 18 for 2, 4 and 8 proesses, respetively. The Y-axis of the gure shows that the atual frations were 0.53, 0.30, and 0.16 (respetively), slightly more than the assoiated probabilities. But, a proess may also be fored to wait for other proesses that preede it to exhaust their quanta. This leads to the step-like shape of the graphs, beause the wait is typially an integral number of tiks. The maximal wait is a full quantum for eah of the other proesses. In the ase of 8 ompeting proesses, for example, the maximum is 60 ms for eah of the other 7, for a total of 420 ms (=420,000 s). The situation on the 20,000 Hz system is essentially the same, exept that the time sale is muh muh shorter | the lateny is almost always less than a milliseond, as expeted. In other words, the high lok interrupt rate and rapid ontext swithing allow the system to deliver timer signals in a timely manner, despite having to yle through all ompeting proesses. Table 6 shows that this is the ase for all our experiments (for brevity only seleted experiments are shown). Note that using the higher lok rate also provides signiantly improved latenies to the experiments where proesses only platform Throughput of the sort appliation, measured as how many millions of numbers were sorted per seond, with 8 ompeting proesses. Figure 8: ompute for a fration of the time till the timer event. With 100 Hz even this senario sometimes auses onits, despite the relatively low overall CPU utilization. The relatively few long-lateny events that remain in the high lok-rate ase are attributed to onits with system daemons that perform disk I/O, suh as the pager. Similar eets have been noted in other systems [14℄. These problems are expeted to go away in the next Linux kernel, whih is preemptive; they should not be an issue in other kernels that are already preemptive (suh as Solaris). But what about overheads? As shown in Figure 3, when running ontinuously omputing proesses (in that ase, a sorting appliation) with a 20,000 Hz lok interrupt rate and quanta of 6 tiks, the additional overhead an reah 35% on ontemporary arhitetures. The overhead for the shorter 2-tik quanta used here may be even higher. This Proesses Type Number BLK 2 BLK 8 CONT 2 CONT 8 2% 2 2% 8 8% 2 8% 8 4% 1+2CPU 4% 1+8CPU 100Hz 0.9 0.95 0.99 5 8 11 5 12 22 50,003 60,003 60,004 370,014 400,014 420,015 6 9 9,193 2,910 8,419 17,940 9 12,431 39,512 40,003 60,005 130,006 50,003 50,003 50,004 50,003 50,003 170,014 20,000Hz max 0.9 0.95 0.99 max 40 13 14 21 23 420 7 9 13 25 160,006 102 103 18,468 60,448 740,025 656 706 15,096 68,139 19,153 13 15 23 837 32,944 12 52 53 1,809 60,003 14 19 53 3,797 294,291 53 53 54 37,328 50,005 55 56 200 256 280,010 56 57 59 856 Table 6: Tails of distributions of latenies to deliver timer signals in dierent experimental settings. Table values are latenies in miroseonds, for various perentiles of the distribution. seems like an expensive and unaeptable prie to pay. However, if we examine the appliation throughput on dierent platforms the piture is not so bleak. Figure 8 ompares the ahieved throughput, as measured by numbers sorted per seond, for two ongurations. The base onguration uses 100 Hz interrupts and 60 ms quanta. The extreme onguration uses 20,000 Hz interrupts and 100 s quanta. While performane dramatially drops when omparing the two ongurations on the same platform, the extreme onguration of eah platform still typially outperforms the base onguration on the previous platform. For example, PIII664 running the base onguration manages to sort about 2,559,000 numbers per seond, while the PIII-1.133 with the extreme onguration sorts about 3,136,000 numbers per seond (the P-IV onsistently performs worse than previous generations). This is an optimisti result whih means that in order to get the same or even improve the performane of an existing platform, while ahieving sub-milliseond lateny, all one has to do is upgrade to the next generation. This is usually muh heaper than purhasing the industrial hard real-time alternative. 8. CONCLUSIONS AND FUTURE WORK General purpose systems, suh as Linux and Windows, are already often used for soft real-time appliations suh a viewing video, playing musi, or burning CDs. Other less ommon appliations inlude various ontrol funtions, ranging from laboratory experiment ontrol to traÆ-light ontrol. Suh appliations are not ritial to the degree that they require a full-edged real-time system. However, they may fae problems on a typial ommodity system due to the lak of adequate support for high-resolution timing servies. A speial ase is \timeline gaps", where the proessor is totally unavailable for a relatively long time [14℄. Various solutions have been proposed for this problem, typially based on expliit support for timing funtions. In partiular, very good results are obtained by using soft timers or one-shot timers. The idea there is to hange the kernel's timing mehanism from the urrent periodi time sampling to event-based time sampling. However, sine this eventbased approah alls for a massive redesign of a major kernel subsystem, it has remained more of an aademi exerise and has yet to make it into the world of mainstream operating systems. The goal of this paper is to hek the degree to whih existing systems an provide reasonable soft real-time ser- vies, speially for interative appliations, just by leveraging the very fast hardware that is now routinely available, without any sophistiated modiations to the system. The mehanism is simply to inrease the frequeny of the periodi timer sampling. We show that this solution | although suering from non-negligible overhead | is a viable solution on today's ultra-fast CPUs. We also show that implementing this solution in mainstream operating systems is as trivial as turning a tuning knob, possibly even at system runtime. We started with the observation that there is a large and growing gap between the CPU lok rates, whih grow exponentially, and the system lok interrupt rates, whih are rather stable at 100 Hz. We showed that by inreasing the lok interrupt rate by a mere order of magnitude, to 1000 Hz, one ahieves signiant advantages in terms of timing and billing servies, while keeping the overheads aeptably low. The modiations required to the system are rather trivial: to inrease the lok interrupt rate, and redue the default quantum length. As multimedia appliations typially operate in this range (i.e. with timers of several milliseonds), suh an inrease may be enough to satisfy this important lass of appliations. A similar observation has been made by Nieh and Lam with regard to the sheduling of multimedia appliations in the SMART sheduler [19℄. A rate of 1000 Hz is used in the experimental Linux 2.5 kernel, and also on personal systems of some kernel hakers [12℄. For more demanding appliations, we experimented with raising the lok interrupt rate up to 20,000 Hz, and found that by doing so appliations are guaranteed to reeive timer signals within one milliseond of the orret times with high probability, even under loaded onditions. In addition to suggesting that 1000 Hz be used as the minimal default lok rate, we also propose that the HZ value and the quantum length be settable parameters, rather than ompiled onstants. This will enable users of systems that are dediated to a time-sensitive task to ongure them so as to bound the lateny, by shortening the quantum so that when multiplied by the expeted number of proesses in the system the produt is less than the desired bound. Of ourse, this funtionality has to be traded o with the overhead it entails. Suh detailed onsiderations an only be made by knowledgeable users on a ase-by-ase basis. Even so, this is expeted to be ost eetive relative to the alternative of prouring a hard real-time system. The last missing piee is the orret prioritization of ap- pliations under heavy load onditions. The problem is that modern interative appliations may use quite a lot of CPU power to generate realisti graphis and video in real-time, and may therefore be hard to distinguish from low priority CPU-bound appliations. This is espeially hard when faed with multi-threaded appliations (like Xine), or if appliations are adaptive (as Quake is) and an always use additional ompute power to improve their output. Our future work therefore deals with alternative mehanisms for the identiation of interative proesses. The mehanisms we are onsidering involve traking the interations of appliations with the X server, and thus with input and output devies that represent the loal user [9℄. Acknowledgements Many thanks are due to Danny Braniss and Tomer Klainer for providing aess to various platforms and helping make them work. 9. REFERENCES [1℄ B. Adelberg, H. Garia-Molina, and B. Kao, \Emulating soft real-time sheduling using traditional operating system shedulers". In Real-Time System Symp., Ot 1994. [2℄ M. Aron and P. Drushel, \Soft timers: eÆient miroseond software timer support for network proessing". ACM Trans. Comput. Syst. 18(3), pp. 197{228, Aug 2000. [3℄ M. Barabanov and V. Yodaiken, \Introduing real-time Linux". Linux Journal 34, Feb 1997. http://www.linuxjournal.om/artile.php?sid=0232. [4℄ M. Bek, H. Bohme, M. Dziadzka, U. Kunitz, R. Magnus, and D. Verworner, Linux Kernel Internals. Addison-Wesley, 2nd ed., 1998. [5℄ D. P. Bovet and M. Cesati, Understanding the Linux Kernel. O'Reilly, 2001. [6℄ J. B. Chen, Y. Endo, K. Chan, D. Mazieres, A. Dias, M. Seltzer, and M. D. Smith, \The measured performane of personal omputer operating systems". ACM Trans. Comput. Syst. 14(1), pp. 3{40, Feb 1996. [7℄ R. T. Dimpsey and R. K. Iyer, \Modeling and measuring multiprogramming and system overheads on a shared memory multiproessor: ase study". J. Parallel & Distributed Comput. 12(4), pp. 402{414, Aug 1991. [8℄ K. J. Duda and D. R. Cheriton, \Borrowed-virtual-time (BVT) sheduling: supporting lateny-sensitive threads in a general-purpose sheduler". In 17th Symp. Operating Systems Priniples, pp. 261{276, De 1999. [9℄ Y. Etsion, D. Tsafrir, and D. G. Feitelson, Human-Centered Sheduling of Interative and Multimedia Appliations on a Loaded Desktop. Tehnial Report, Hebrew University, Mar 2003. [10℄ K. Flautner and T. Mudge, \Vertigo: automati performane-setting for Linux". In 5th Symp. Operating Systems Design & Implementation, pp. 105{116, De 2002. [11℄ K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge, \Thread-level parallelism and interative performane of desktop appliations". In 9th Intl. Conf. Arhitet. Support for Prog. Lang. & Operating Syst., pp. 129{138, Nov 2000. [12℄ FreeBSD Doumentation Server, Thread on \lok granularity (kernel option HZ)". URL http://dos.freebsd.org/mail/arhive/2002/freebsdhakers/20020203.freebsd-hakers.html, Feb 2002. [13℄ A. Goel, L. Abeni, C. Krasi, J. Snow, and J. Walpole, \Supporting time-sensitive appliations on a ommodity OS". In 5th Symp. Operating Systems Design & Implementation, pp. 165{180, De 2002. [14℄ J. Gwinn, \Some measurements of timeline gaps in VAX/VMS". Operating Syst. Rev. 28(2), pp. 92{96, Apr 1994. [15℄ I. Leslie, D. MAuley, R. Blak, T. Rosoe, P. Barham, D. Evers, R. Fairbairns, and E. Hyden, \The design and implementation of an operating system to support distributed multimedia appliations". IEEE J. Selet Areas in Commun. 14(7), pp. 1280{1297, Sep 1996. [16℄ J. Lions, Lions' Commentary on UNIX 6th Edition, with Soure Code. Annabooks, 1996. [17℄ J. Mauro and R. MDougall, Solaris Internals. Prentie Hall, 2001. [18℄ J. Nieh, J. G. Hanko, J. D. Northutt, and G. A. Wall, \SVR4 UNIX sheduler unaeptable for multimedia appliations". In 4th Int'l Workshop Network & Operating System Support for Digital Audio and Video, Nov 1993. [19℄ J. Nieh and M. S. Lam, \The design, implementation and evaluation of SMART: a sheduler for multimedia appliations". In 16th Symp. Operating Systems Priniples, pp. 184{197, Ot 1997. [20℄ J. K. Ousterhout, \Why aren't operating systems getting faster as fast as hardware?". In USENIX Summer Conf., pp. 247{256, Jun 1990. [21℄ B. Paul, \Introdution to the Diret Rendering Infrastruture". http://dri.soureforge.net/do/DRIintro.html, August 2000. [22℄ M. A. Rau and E. Smirni, \Adaptive CPU sheduling poliies for mixed multimedia and best-eort workloads". In Modeling, Anal. & Simulation of Comput. & Teleomm. Syst., pp. 252{261, Ot 1999. [23℄ R. Ronen, A. Mendelson, K. Lai, S-L. Lu, F. Pollak, and J. P. Shen, \Coming hallenges in miroarhiteture and arhiteture". Pro. IEEE 89(3), pp. 325{340, Mar 2001. [24℄ D. A. Solomon and M. E. Russinovih, Inside Mirosoft Windows 2000. Mirosoft Press, 3rd ed., 2000. [25℄ B. Srinivasan, S. Pather, R. Hill, F. Ansari, and D. Niehaus, \A rm real-time system implementation using ommerial o-the-shelf hardware and free software". In 4th IEEE Real-Time Tehnology & App. Symp., pp. 112{119, Jun 1998. [26℄ D. Tyrell, K. Severson, A. B. Perlman, B. Brikle, and C. Vaningen-Dunn, \Rail passenger equipment rashworthiness testing requirements and implementation". In Intl. Mehanial Engineering Congress & Exposition, Nov 2000.