Download ASPLOS 2014 [PDF]

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed operating system wikipedia , lookup

Unix security wikipedia , lookup

Burroughs MCP wikipedia , lookup

VS/9 wikipedia , lookup

Spring (operating system) wikipedia , lookup

Copland (operating system) wikipedia , lookup

DNIX wikipedia , lookup

Transcript
Disengaged Scheduling for Fair, Protected Access to
Fast Computational Accelerators
Konstantinos Menychtas
Kai Shen
Michael L. Scott
University of Rochester
{kmenycht,kshen,scott}@cs.rochester.edu
Abstract
1.
Today’s operating systems treat GPUs and other computational accelerators as if they were simple devices, with
bounded and predictable response times. With accelerators
assuming an increasing share of the workload on modern
machines, this strategy is already problematic, and likely to
become untenable soon. If the operating system is to enforce
fair sharing of the machine, it must assume responsibility for
accelerator scheduling and resource management.
Fair, safe scheduling is a particular challenge on fast accelerators, which allow applications to avoid kernel-crossing
overhead by interacting directly with the device. We propose
a disengaged scheduling strategy in which the kernel intercedes between applications and the accelerator on an infrequent basis, to monitor their use of accelerator cycles and to
determine which applications should be granted access over
the next time interval.
Our strategy assumes a well defined, narrow interface
exported by the accelerator. We build upon such an interface, systematically inferred for the latest Nvidia GPUs. We
construct several example schedulers, including Disengaged
Timeslice with overuse control that guarantees fairness and
Disengaged Fair Queueing that is effective in limiting resource idleness, but probabilistic. Both schedulers ensure
fair sharing of the GPU, even among uncooperative or adversarial applications; Disengaged Fair Queueing incurs a
4% overhead on average (max 18%) compared to direct device access across our evaluation scenarios.
Future microprocessors seem increasingly likely to be highly
heterogeneous, with large numbers of specialized computational accelerators. General-purpose GPUs are the canonical
accelerator on current machines; other current or near-future
examples include compression, encryption, XML parsing,
and media transcoding engines, as well as reconfigurable
(FPGA) substrates.
Current OS strategies, which treat accelerators as if they
were simple devices, are ill suited to operations of unpredictable, potentially unbounded length. Consider, for example, a work-conserving GPU that alternates between requests
(compute “kernels,” rendering calls) from two concurrent
applications. The application with larger requests will tend
to receive more time. A greedy application may intentionally “batch” its work into larger requests to hog resources. A
malicious application may launch a denial-of-service attack
by submitting a request with an infinite loop.
Modern accelerator system architectures (Figure 1) pose
at least two major challenges to fair, safe management. First,
microsecond-level request latencies strongly motivate designers to avoid the overhead of user/kernel domain switching (up to 40% for small requests according to experiments
later in this paper) by granting applications direct device access from user space. In bypassing the operating system, this
direct access undermines the OS’s traditional responsibility
for protected resource management. Second, the applicationhardware interface for accelerators is usually hidden within
a stack of black-box modules (libraries, run-time systems,
drivers, hardware), making it inaccessible to programmers
outside the vendor company. To enable resource management in the protected domain of the OS kernel, but outside
such black-box stacks, certain aspects of the interface must
be disclosed and documented.
To increase fairness among applications, several research projects have suggested changes or additions to the
application-library interface [12, 15, 19, 20, 32], relying
primarily on open-source stacks for this purpose. Unfortunately, as discussed in Section 2, such changes invariably
serve either to replace the direct-access interface, adding a
per-request syscall overhead that can be significant, or to
impose a voluntary level of cooperation in front of that interface. In this latter case, nothing prevents an application
Categories and Subject Descriptors C.1.3 [Processor Architectures]: Other Architecture Styles; D.4.1 [Operating
Systems]: Process Management—Scheduling; D.4.8 [Operating Systems]: Performance
Keywords Operating system protection; scheduling; fairness; hardware accelerators; GPUs
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
ASPLOS ’14, March 1–5, 2014, Salt Lake City, Utah, USA.
c 2014 ACM 978-1-4503-2305-5/14/03. . . $15.00.
Copyright http://dx.doi.org/10.1145/2541940.2541963
Introduction
Library,
Application
runtime
Library,
Application
runtime
User
space
Syscalls
LD/ST
to
directly
mapped
register
DriverDriver
OS Kernel
LD/ST
LD/ST
to
directly
mapped
register
Kernel
space
Interrupts
Accelerator
Hardware
Figure 1. For efficiency, accelerators (like Nvidia GPUs)
receive requests directly from user space, through a
memory-mapped interface. The syscall-based OS path is
only used for occasional maintenance such as the initialization setup. Commonly, much of the involved software (libraries, drivers) and hardware is unpublished (black boxes).
from ignoring the new interface and accessing the device
directly. Simply put, the current state of the art with respect
to accelerator management says “you can have at most two
of protection, efficiency, and fairness—not all three.”
While we are not the first to bemoan this state of affairs [29], we are, we believe, the first to suggest a viable
solution, and to build a prototype system in which accelerator access is simultaneously protected, efficient, and fair.
Our key idea, called disengagement, allows an OS resource
scheduler to maintain fairness by interceding on a small
number of acceleration requests (by disabling the directmapped interface and intercepting resulting faults) while
granting the majority of requests unhindered direct access.
To demonstrate the versatility and efficiency of this idea, we
build a prototype: a Linux kernel module that enables disengaged scheduling on three generations of Nvidia GPUs. We
construct our prototype for the proprietary, unmodified software stack. We leverage our previously proposed state machine inferencing approach to uncover the black-box GPU
interface [25] necessary for scheduling.
We have built and evaluated schedulers that illustrate
the richness of the design space. Our Disengaged Timeslice
scheduler ensures fully protected fair sharing among applications, but can lead to underutilization when applications
contain substantial “off” periods between GPU requests. Our
Disengaged Fair Queueing scheduler limits idleness and
maintains high efficiency with a strong probabilistic guarantee of fairness. We describe how we build upon the uncovered GPU interface; how we enable protection, efficiency,
and fairness; and how hardware trends may influence—and
be influenced by—our scheduling techniques.
2.
Related Work
Some GPU management systems—e.g., PTask [32] and Pegasus [15]—provide a replacement user-level library that ar-
ranges to make a system or hypervisor call on each GPU
request. In addition to the problem of per-request overhead,
this approach provides safety from erroneous or malicious
applications only if the vendor’s direct-mapped interface is
disabled—a step that may compromise compatibility with
pre-existing applications. Elliott and Anderson [12] require
an application to acquire a kernel mutex before using the
GPU, but this convention, like the use of a replacement library, cannot be enforced. Though such approaches benefit from the clear semantics of the libraries’ documented
APIs, they can be imprecise in their accounting: API calls
to start acceleration requests at this level are translated asynchronously to actual resource usage by the driver. We build
our solution at the GPU software / hardware interface, where
the system software cannot be circumvented and resource
utilization accounting can be timely and precise.
Other systems, including GERM [11], TimeGraph/Gdev
[19, 20] and LoGV [13] in a virtualized setting, replace the
vendor black-box software stack with custom open-source
drivers and libraries (e.g., Nouveau [27] and PathScale [31]).
These enable easy operating system integration, and can enforce an OS scheduling policy if configured to be notified
of all requests made to the GPU. For small, frequent acceleration requests, the overhead of trapping to the kernel can
make this approach problematic. The use of a custom, modified stack also necessitates integration with user-level GPU
libraries; for several important packages (e.g., OpenCL, currently available only as closed source), this is not always an
option. Our approach is independent of the rest of the GPU
stack: it intercedes at the level of the hardware-software interface. (For prototyping purposes, this interface is uncovered through systematic reverse engineering [25]; for production systems it will require that vendors provide a modest
amount of additional documentation.)
Several projects have addressed the issue of fairness in
GPU scheduling. GERM [11] enables fair-share resource allocation using a deficit round-robin scheduler [34]. TimeGraph [19] supports fairness by penalizing overuse beyond
a reservation. Gdev [20] employs a non-preemptive variant of Xen’s Credit scheduler to realize fairness. Beyond
the realm of GPUs, fair scheduling techniques (particularly
the classic fair queueing schedulers [10, 14, 18, 30, 33])
have been adopted successfully in network and I/O resource
management. Significantly, all these schedulers track and/or
control each individual device request, imposing the added
cost of OS kernel-level management on fast accelerators that
could otherwise process requests in microseconds. Disengaged scheduling enables us to redesign existing scheduling policies to meet fast accelerator requirements, striking a
better balance between protection and efficiency while maintaining good fairness guarantees.
Beyond the GPU, additional computational accelerators
are rapidly emerging for power-efficient computing through
increased specialization [8]. The IBM PowerEN proces-
sor [23] contains accelerators for tasks (cryptography, compression, XML parsing) of importance for web services.
Altera [1] provides custom FPGA circuitry for OpenCL.
The faster the accelerator, and the more frequent its use,
the greater the need for direct, low-latency access from
user applications. To the extent that such access bypasses
the OS kernel, protected resource management becomes
a significant challenge. The ongoing trend toward deeper
CPU/accelerator integration (e.g., AMD’s APUs [7, 9] or
HSA architecture [24], ARM’s Mali GPUs [4], and Intel’s
QuickAssist [17] accelerator interface) and low-overhead
accelerator APIs (e.g. AMD’s Mantle [3] for GPUs), leads
us to believe that disengaged scheduling can be an attractive
strategy for future system design.
Managing fast devices for which per-request engagement
is too expensive is reminiscent of the Soft Timers work [5].
Specifically, Aron and Druschel argued that per-packet interrupts are too expensive for fast networks and that batched
processing through rate-based clocking and network polling
is necessary for high efficiency [5]. Exceptionless system
calls [35] are another mechanism to avoid the overhead of
frequent kernel traps by batching requests. Our proposal also
avoids per-request manipulation overheads to achieve high
efficiency, but our specific techniques are entirely different.
Rather than promote batched and/or delayed processing of
system interactions, which co-designed user-level libraries
already do for GPUs [21, 22, 28], we aim for synchronous
involvement of the OS kernel in resource management, at
times and conditions of its choosing.
Recent projects have proposed that accelerator resources
be managed in coordination with conventional CPU scheduling to meet overall performance goals [15, 29]. Helios advocated the support of affinity scheduling on heterogeneous
machines [26]. We recognize the value of coordinated resource management—in fact we see our work as enabling
it: for accelerators that implement low-latency requests,
coordinated management will require OS-level accelerator
scheduling that is not only fair but also safe and fast.
3.
Protected Scheduling with Efficiency
From the kernel’s perspective, and for the purpose of scheduling, accelerators can be thought of as processing resources
with an event-based interface. A “task” (the resource principal to which we wish to provide fair service—e.g., a process
or virtual machine) can be charged for the resources occupied by its “acceleration request” (e.g., compute, graphics).
To schedule such requests, we need to know when they are
submitted for processing and when they complete.
Submission and completion are easy to track if the former employs the syscall interface and the latter employs interrupts. However, and primarily for reasons of efficiency,
vendors like Nvidia avoid both, and build an interface that
is directly mapped to user space instead. A request can be
submitted by initializing buffers and a descriptor object in
shared memory and then writing the address of the descriptor
into a queue that is watched by the device. Likewise, completion can be detected by polling a flag or counter that is set
by the device.
Leveraging our previous work on state-machine inferencing of the black-box GPU interface [25], we observe that the
kernel can, when desired, learn of requests passed through
the direct-mapped interface by means of interception—by
(temporarily) unmapping device registers and catching the
resulting page faults. The page fault handler can then pass
control to an OS resource manager that may dispatch or
queue/delay the request according to a desired policy. Request completion can likewise be realized through independent polling. In comparison to relying on a modified
user-level library, interception has the significant advantage
of enforcing policy even on unmodified, buggy, or selfish / malicious applications. All it requires of the accelerator
architecture is a well-defined interface based on scheduling events—an interface that is largely independent of the
internal details of black or white-box GPU stacks.
Cost of OS management Unfortunately, trapping to the
OS—whether via syscalls or faults—carries nontrivial costs.
The cost of a user / kernel mode switch, including the effects of cache pollution and lost user-mode IPC, can be thousands of CPU cycles [35, 36]. Such cost is a substantial addition to the base time (305 cycles on our Nvidia GTX670based system) required to submit a request directly with a
write to I/O space. In applications with small, frequent requests, high OS protection costs can induce significant slowdown. Figure 2 plots statistics for three realistic applications, selected for their small, frequent requests: glxgears
(standard OpenGL [22] demo), Particles (OpenCL / GL from
CUDA SDK [28]), and Simple 3D Texture (OpenCL / GL
from CUDA SDK [28]). More than half of all GPU requests
from these applications are submitted and serviced in less
than 10 µs. With a deeply integrated CPU-GPU microarchitecture, request latency could be lowered even more—to the
level of a simple memory/cache write/read, further increasing the relative burden of interception.
To put the above in perspective, we compared the throughput that can be achieved with an accelerator software / hardware stack employing a direct-access interface (Nvidia
Linux binary driver v310.14) and with one that relies on
traps to the kernel for every acceleration request (AMD Catalyst v13.8 Linux binary driver). The comparison uses the
same machine hardware, and we hand-tune the OpenCL requests to our Nvidia (GTX670) or AMD (Radeon HD 7470)
PCIe GPU such that the requests are equal-sized. We discover that, for requests in the order of 10–100 µs, a throughput increase of 8–35% can be expected by simply accessing
the device directly, without trapping to the kernel for every
request. These throughput gains are even higher (48–170%)
if the traps to the kernel actually entail nontrivial processing
in GPU driver routines.
Request Inter-Ar r ival Per iod
100
% of events in bin
90
80
70
60
50
40
30
glxgears
oclParticles
oclSimpleTexture3D
20
10
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
log2 time continuum separ ated in bins (µs)
14
15
16
17
Request Service Per iod
100
% of events in bin
90
80
70
60
50
40
30
glxgears
oclParticles
oclSimpleTexture3D
20
10
0
0
1
2
3
4
5
6
7
8
9
10
log2 time continuum separ ated in bins (µs)
11
12
13
Figure 2. CDFs of request frequency and average service
times for several applications on an Nvidia GTX670 GPU. A
large percentage of arriving requests are short and submitted
in short intervals (“back-to-back”).
Overview of schedulers Sections 3.1, 3.2, and 3.3 present
new request schedulers for fast accelerators that leverage
interception to explicitly balance protection, fairness, and
efficiency. The Timeslice scheduler retains the overhead of
per-request kernel intervention, but arranges to share the
GPU fairly among unmodified applications. The Disengaged
Timeslice scheduler extends this design by eliminating kernel intervention in most cases. Both variants allow only one
application at a time to use the accelerator. The Disengaged
Fair Queueing scheduler eliminates this limitation, but provides only statistical guarantees of fairness.
Our disengaged schedulers maintain the high performance of the direct-mapped accelerator interface by limiting
kernel involvement to a small number of interactions. In so
doing, they rely on information about aspects of the accelerator design that can affect resource usage and accounting
accuracy. For the purposes of the current description, we assume the characteristics of an Nvidia GPU, as revealed by
reverse engineering and public documentation. We describe
how device features affect our prototype design as we encounter them in the algorithm descriptions; we revisit and
discuss how future software / hardware developments can
influence and be influenced by our design in Section 6.
3.1
Timeslice with Overuse Control
To achieve fairness in as simple a manner as possible, we
start with a standard token-based timeslice policy. Specifically, a token that indicates permission to access the shared
accelerator is passed among the active tasks. Only the current token holder is permitted to submit requests, and all re-
quest events (submission, completion) are trapped and communicated to the scheduler. For the sake of throughput, we
permit overlapping nonblocking requests from the task that
holds the token.
Simply preventing a task from submitting requests outside its timeslice does not suffice for fairness. Imagine, for
example, a task that issues a long series of requests, each of
which requires 0.9 timeslice to complete. A naive scheduler
might allow such a task to issue two requests in each timeslice, and to steal 80% of each subsequent timeslice that belongs to some other task. To avoid this overuse problem, we
must, upon the completion of requests that overrun the end
of a timeslice, deduct their excess execution time from future
timeslices of the submitting task. We account for overuse by
waiting at the end of a timeslice for all outstanding requests
to complete, and charging the token holder for any excess.
When a task’s accrued overuse exceeds a full timeslice, we
skip the task’s next turn to hold the token, and subtract a
timeslice from its accrued overuse.
We also need to address the problem that a single long
request may monopolize the accelerator for an excessive
amount of time, compromising overall system responsiveness. In fact, Turing-complete accelerators, like GPUs, may
have to deal with malicious (or poorly written) applications that submit requests that never complete. In a timeslice scheduler, it is trivial to identify the task responsible for
an over-long request: it must be the last token holder. Subsequent options for restoring responsiveness then depend on
accelerator features.
From model to prototype Ideally, we would like to preempt over-long requests, or define them as errors, kill them
on the accelerator, and prevent the source task from sending more. Lacking vendor support for preemption or a documented method to kill a request, we can still protect against
over-long requests by killing the offending task, provided
this leaves the accelerator in a clean state. Specifically, upon
detecting that overuse has exceeded a predefined threshold,
we terminate the associated OS process and let the existing
accelerator stack follow its normal exit protocol, returning
occupied resources back to the available pool. While the involved cleanup depends on (undocumented) accelerator features, it appears to work correctly on modern GPUs.
Limitations Our Timeslice scheduler can guarantee fair resource use in the face of unpredictable (and even unbounded)
request run times. It suffers, however, from two efficiency
drawbacks. First, its fault-based capture of each submitted
request incurs significant overhead on fast accelerators. Second, it is not work-conserving: the accelerator may be idled
during the timeslice of a task that temporarily has no requests to submit—even if other tasks are waiting for accelerator access. Our Disengaged Timeslice scheduler successfully tackles the overhead problem. Our Disengaged Fair
Queueing scheduler also tackles the problem of involuntary
resource idleness.
3.2
Disengaged Timeslice
Fortunately, the interception of every request is not required
for timeslice scheduling. During a timeslice, direct, unmonitored access from the token holder task can safely be allowed. The OS will trap and delay accesses from all other
tasks. When passing the token between tasks, the OS will
update page tables to enable or disable direct access. We say
such a scheduler is largely disengaged; it allows fast, direct
access to accelerators most of the time, interceding (and incurring cost) on an infrequent basis only. Note that while
requests from all tasks other than the token holder will be
trapped and delayed, such tasks will typically stop running
as soon as they wait for a blocking request. Large numbers
of traps are thus unlikely.
When the scheduler re-engages at the start of a new timeslice, requests from the token holder of the last timeslice may
still be pending. To account for overuse, we must again wait
for these requests to finish.
From model to prototype Ideally, the accelerator would
expose information about the status of request(s) on which
it is currently working, in a location accessible to the scheduler. While we strongly suspect that current hardware has the
capability to do this, it is not documented, and we have been
unable to deduce it via reverse engineering. We have, however, been able to discover the semantics of data structures
shared between the user and the GPU [2, 16, 19, 25, 27].
These structures include a reference counter that is written
by the hardware upon the completion of each request. Upon
re-engaging, we can traverse in-memory structures to find
the reference number of the last submitted request, and then
poll the reference counter for an indication of its completion.
While it is at least conceivable that a malicious application
could spoof this mechanism, it suffices for performance testing; we assume that a production-quality scheduler would
query the status of the GPU directly.
Limitations Our Disengaged Timeslice scheduler does not
suffer from high per-request management costs, but it may
still lead to poor utilization when the token holder is unable
to keep the accelerator busy. This problem is addressed by
the third of our example schedulers.
3.3
Disengaged Fair Queueing
Disengaged resource scheduling is a general concept—allow
fast, direct device access in the common case; intercede only
when necessary to efficiently realize resource management
objectives. Beyond Disengaged Timeslice, we also develop
a disengaged variant of the classic fair queueing algorithm,
which achieves fairness while maintaining work-conserving
properties—i.e., it avoids idling the resource when there is
pending work to do.
A standard fair queueing scheduler assigns a start tag and
a finish tag to each resource request. These tags serve as
surrogates for the request-issuing task’s cumulative resource
usage before and after the request’s execution. Specifically,
the start tag is the larger of the current system virtual time (as
of request submission) and the finish tag of the most recent
previous request by the same task. The finish tag is the start
tag plus the expected resource usage of the request. Request
dispatch is ordered by each pending request’s finish tag [10,
30] or start tag [14, 18, 33]. Multiple requests (potentially
from different tasks) may be dispatched to the device at
the same time [18, 33]. While giving proportional resource
uses to active tasks, a fair queueing scheduler addresses the
concern that an inactive task—one that currently has no work
to do—might build up its resource credit without bound
and then reclaim it in a sudden burst, causing prolonged
unresponsiveness to others. To avoid such a scenario, the
scheduler advances the system virtual time to reflect only the
progress of active tasks. Since the start tag of each submitted
request is brought forward to at least the system virtual time,
any claim to resources from a task’s idle period is forfeited.
Design For the sake of efficiency, our disengaged scheduler avoids intercepting and manipulating most requests.
Like other fair queueing schedulers, however, it does track
cumulative per-task resource usage and system-wide virtual time. In the absence of per-request control, we replace
the request start/finish tags in standard fair queueing with
a probabilistically-updated per-task virtual time that closely
approximates the task’s cumulative resource usage. Time
values are updated using statistics obtained through periodic
engagement. Control is similarly coarse grain: the interfaces
of tasks that are running ahead in resource usage may be disabled for the interval between consecutive engagements, to
allow their peers to catch up. Requests from all other tasks
are allowed to run freely in the disengaged time interval.
Conceptually, our design assumes that at each engagement we can acquire knowledge of each task’s active/idle
status and its resource usage in the previous time interval.
Actions then taken at each engagement are as follows:
1. We advance each active task’s virtual time by adding its
resource use in the last interval. We then advance the
system-wide virtual time to be the oldest virtual time
among the active tasks.
2. If an inactive task’s virtual time is behind the system virtual time, we move it forward to the system virtual time.
As in standard fair queueing, this prevents an inactive
task from hoarding unused resources.
3. For the subsequent time interval, up to the next engagement, we deny access to any tasks whose virtual time is
ahead of the system virtual time by at least the length of
the interval. That is, even if the slowest active task acquires exclusive resource use in the upcoming interval, it
will at best catch up with the tasks whose access is being
denied.
Compared to standard fair queueing, which captures and
controls each request, our statistics maintenance and control
are both coarse grained. Imbalance will be remedied only
virtual time maintenance
and scheduling decision
barrier
disengaged
free run
draining
T1
sampling
T2
sampling
barrier
disengaged free run;
some tasks may be denied access for fairness
draining
timeline
Figure 3. An illustration of periodic activities in Disengaged Fair Queueing. An engagement episode starts with a barrier and
the draining of outstanding requests on the GPU, which is followed by the short sampling runs of active tasks (two tasks T1
and T2 in the case), virtual time maintenance and scheduling decision, and finally a longer disengaged free-run period.
when it exceeds the size of the inter-engagement interval.
Furthermore, the correctness of our approach depends on
accurate statistics gathering, which can be challenging in the
absence of hardware assistance. We discuss this issue next.
negligible. The disengaged free-run period has to be substantially longer to limit the draining and sampling cost. To
ensure this, we set the free-run period to be several times the
duration of the last engagement period.
From model to prototype Ideally, an accelerator would
maintain and export the statistics needed for Disengaged
Fair Queueing, in a location accessible to the scheduler.
It could, for example, maintain cumulative resource usage
for each task, together with an indication of which task(s)
have requests pending and active. Lacking such hardware
assistance, we have developed a software mechanism to estimate the required statistics. In general, accelerators (like
our GPU) accept processing requests through independent
queues mapped in each task’s address space, cycling roundrobin among queues with pending requests and processing
each request separately, one after the other. Most times, there
is (or can be assumed) a one-to-one mapping between tasks
and queues, so the resource share attributed to each task in
a given time interval is proportional to the average request
run-time experienced from its requests queue. The design of
our software mechanism relies on this observation. To obtain the necessary estimates, we cycle among tasks during a
period of re-engagement, providing each with exclusive access to the accelerator for a brief interval of time and actively
monitoring all of its requests. This interval is not necessarily of fixed duration: we collect either a predefined number
of requests deemed sufficient for average estimation, or as
many requests as can be observed in a predefined maximum
interval length—whichever occurs first. As in the timeslice
schedulers, we place a (documented) limit on the maximum
time that any request is permitted to run, and kill any task
that exceeds this limit.
Figure 3 illustrates key activities in Disengaged Fair
Queueing. Prior to sampling, a barrier stops new request
submission in every task, and then allows the accelerator to drain active requests. (For this we employ the same
reference-counter-based mechanism used to detect overuse
at the end of a timeslice in Section 3.2.) After sampling,
we disengage for another free-run period. Blocking new requests while draining is free, since the device is known to be
busy. Draining completes immediately if the device is found
not to be working on queued requests at the barrier.
Because an engagement period includes a sampling run
for every active task, its aggregate duration can be non-
Limitations The request-size estimate developed during
the sampling period of Disengaged Fair Queueing acts as
a resource-utilization proxy for the whole task’s free-run period. In the common case, this heuristic is accurate, as accelerated applications tend to employ a single request queue
and the GPU appears to schedule among queues with outstanding requests in a round-robin fashion. It is possible,
however, to have applications that not only maintain multiple queues (e.g. separate compute and graphics rendering
queues), but also keep only some of them active (i.e. holding outstanding requests) at certain times. When at least one
task among co-runners behaves this way, estimating the average request size on every queue might not be enough for
accurate task scheduling. In principle, better estimates might
be based on more detailed reverse engineering of the GPU’s
internal scheduling algorithm. For now, we are content to assume that each task employs a single queue, in the hope that
production systems will have access to statistics that eliminate the need for estimation.
Our estimation mechanism also assumes that observed
task behavior during an engagement interval reflects behavior in the previous disengaged free-run. However, this estimate can be imprecise, resulting in imperfect fairness. One
potential problem is that periodic behavior patterns—rather
than overall task behavior—might be sampled. The nonfixed sampling period run time introduces a degree of randomness in the sampling frequency and duration, but further
randomizing these parameters would increase confidence in
our ability to thwart any attempt by a malicious application
to subvert the scheduling policy. If the disengaged free-run
period is considered to be too long (e.g., because it dilates
the potential time required to identify and kill an infiniteloop request), we could shorten it by sampling only a subset
of the active tasks in each engagement-disengagement cycle.
Our implementation of the Disengaged Fair Queueing
scheduler comes very close to the work-conserving goal
of not idling the resource when there is pending work to
do. Specifically, it allows multiple tasks to issue requests
in the dominant free-run periods. While exclusive access
is enforced during sampling intervals, these intervals are
bounded, and apply only to tasks that have issued requests
in the preceding free-run period; we do not waste sampling
time on idle tasks. Moreover, sampling is only required in the
absence of true hardware statistics, which we have already
argued that vendors could provide.
The more serious obstacle to true work conservation is
the possibility that all tasks that are granted access during
a disengaged period will, by sheer coincidence, become simultaneously idle, while tasks that were not granted access
(because they had been using more than their share of accumulated time) still have work to do. Since our algorithm
can be effective in limiting idleness, even in the presence of
“selfish,” if not outright malicious applications, and on the
assumption that troublesome scenarios will be rare and difficult to exploit, the degree to which our Disengaged Fair
Queueing scheduler falls short of being fully work conserving seems likely to be small.
4.
Prototype Implementation
Our prototype of interception-based OS-level GPU scheduling, which we call NEON, is built on the Linux 3.4.7 kernel. It
comprises an autonomous kernel module supporting our request management and scheduling mechanisms (about 8000
lines of code); the necessary Linux kernel hooks (to ioctl,
mmap, munmap, copy_task, exit_task, and do_page_fault—
less than 100 lines); kernel-resident control structs (about
500 lines); and a few hooks to the Nvidia driver’s binary
interface (initialization, ioctl, mmap requests). Some of the
latter hooks reside in the driver’s binary-wrapper code (less
than 100 lines); others, for licensing reasons, are inside the
Linux kernel (about 500 lines).
Our request management code embodies a state-machine
model of interaction among the user application, driver, and
device [25], together with a detailed understanding of the
structure and semantics of buffers shared between the application and device. Needed information was obtained through
a combination of publicly available documentation and reverse engineering [2, 16, 19, 25, 27].
From a functional perspective, NEON comprises three
principal components: an initialization phase, used to identify the virtual memory areas of all memory-mapped device registers and buffers associated with a given channel
(a GPU request queue and its associated software infrastructure); a page-fault–handling mechanism, used to catch
the device register writes that constitute request submission;
and a polling-thread service, used within the kernel to detect
reference-counter updates that indicate completion of previously submitted requests. The page-fault–handling mechanism and polling-thread service together define a kernelinternal interface through which any event-based scheduler
can be coupled to our system.
The goal of the initialization phase is to identify the device, task, and GPU context (address space) associated with
every channel, together with the location of three key vir-
tual memory areas (VMAs). The GPU context encapsulates
channels whose requests may be causally related (e.g., issued by the same process); NEON avoids deadlocks by avoiding re-ordering of requests that belong to the same context.
The three VMAs contain the command buffer, in which requests are constructed; the ring buffer, in which pointers to
consecutive requests are enqueued; and the channel register, the device register through which user-level libraries notify the GPU of ring buffer updates, which represent new
requests. As soon as the NEON state machine has identified
all three of these VMAs, it marks the channel as “active”—
ready to access the channel registers and issue requests to
the GPU.
During engagement periods, the page-fault–handling
mechanism catches the submission of requests. Having identified the page in which the channel register is mapped, we
protect its page by marking it as non-present. At the time of
a fault, we scan through the buffers associated with the channel to find the location of the reference counter that will be
used for the request, and the value that will indicate its completion. We then map the location into kernel space, making
it available to the polling-thread service. The page-fault handler runs in process context, so it can pass control to the GPU
scheduler, which then decides whether the current process
should sleep or be allowed to continue and access the GPU.
After the application is allowed to proceed, we single-step
through the faulting instruction and mark the page again as
“non-present.”
Periodically, or at the scheduler’s prompt, the pollingthread service iterates over the kernel-resident structures
associated with active GPU requests, looking for the address/value pairs that indicate request completion. When an
appropriate value is found—i.e., when the last submitted
command’s target reference value is read at the respective
polled address—the scheduler is invoked to update accounting information and to block or unblock application threads
as appropriate.
The page-fault–handling mechanism allows us to capture
and, if desired, delay request submission. The polling-thread
service allows us to detect request completion and, if desired, resume threads that were previously delayed. These request interception and completion notification mechanisms
support the implementation of all our schedulers.
Disengagement is implemented by letting the scheduling
policy choose if and when it should restore protection on
channel register VMAs. A barrier requires that all the channel registers in use on the GPU be actively tracked for new
requests—thus their respective pages must be marked “nonpresent.” It is not possible to miss out on newly established
channel mappings while disengaged: all requests to establish such a mapping are captured as syscalls or invocations
of kernel-exposed functions that we monitor.
Though higher level API calls may be processed out of
order, requests arriving at the same channel are always pro-
cessed in order by the GPU. We rely on this fact in our post–
re-engagement status update mechanism, to identify the current status of the last submitted request. Specifically, we inspect the values of reference counters and the reference number values for the last submitted request in every active channel. Kernel mappings have been created upon first encounter
for the former, but the latter do not lie at fixed addresses in
the application’s address space. To access them safely, we
scan through the command queue to identify their locations
and build temporary memory mappings in the kernel’s address space. Then, by walking the page table, we can find
and read the last submitted commands’ reference values.
5.
Experimental Evaluation
NEON is modularized to accommodate different devices; we
have deployed it on several Nvidia GPUs. The most recent
is a GTX670, with a “Kepler” microarchitecture (GK104
chip core) and 2 GB RAM. Older tested systems include the
GTX275 and NVS295, with the “Tesla” microarchitecture
(200b and G98 chip cores, respectively), and the GTX460,
with the “Fermi” microarchitecture (F104 chip core). We
limit our presentation here to the Kepler GPU, as its modern software / hardware interface (faster context switching
among channels, faster memory subsystem) lets us focus on
the cost of scheduling small, frequent requests. Our host machine has a 2.27 GHz Intel Xeon E5520 CPU (Nehalem microarchitecture) and triple-channel 1.066 GHz DDR3 RAM.
5.1
Experimental Workload
NEON is agnostic with respect to the type of workload: it
can schedule compute or graphics requests, or even DMA
requests. The performance evaluation in this paper includes
compute, graphics, and combined (CUDA or OpenCL plus
OpenGL) applications, but is primarily focused on compute
requests—they are easy to understand, are a good proxy
for broad accelerator workloads, and can capture generic,
possibly unbounded requests. Our experiments employ the
Nvidia 310.14 driver and CUDA 5.0 libraries, which support
acceleration for OpenCL 1.1 and OpenGL 4.2.
Our intent is to focus on behavior appropriate to emerging
platforms with on-chip GPUs [7, 9, 24]. Whereas the crisscrossing delays of discrete systems encourage programmers
to create massive compute requests, and to pipeline smaller
graphics requests asynchronously, systems with on-chip accelerators can be expected to accommodate applications
with smaller, more interactive requests, yielding significant
improvements in flexibility, programmability, and resource
utilization. Of course, CPU-GPU integration also increases
the importance of direct device access from user space [3],
and thus of disengaged scheduling. In an attempt to capture
these trends, we have deliberately chosen benchmarks with
relatively small requests.
Our OpenCL applications are from the AMD APP SDK
v2.6, with no modifications except for the use of high-
Application
Area
BinarySearch
BitonicSort
DCT
EigenValue
FastWalshTransform
FFT
FloydWarshall
LUDecomposition
MatrixMulDouble
MatrixMultiplication
MatrixTranspose
PrefixSum
RadixSort
Reduction
ScanLargeArrays
glxgears
oclParticles
simpleTexture3D
Searching
Sorting
Compression
Algebra
Encryption
Signal Processing
Graph Analysis
Algebra
Algebra
Algebra
Algebra
Data Processing
Sorting
Data Processing
Data Processing
Graphics
Physics/Graphics
Texturing/Graphics
µs per
round
161
1292
197
163
310
268
5631
1490
12628
3788
1153
157
8082
1147
197
72
2006
2472
µs per
request
57
202
66
56
119
48
141
308
637
436
284
55
210
282
72
37
12/302
108/171
Table 1. Benchmarks used in our evaluation. A “round” is
the performance unit of interest to the user: the run time of
compute “kernel(s)” for OpenCL applications, or that of a
frame rendering for graphics or combined applications. A
round can require multiple GPU acceleration requests.
resolution x86 timers to collect performance results. In addition to GPU computation, most of these applications also
contain some setup and data transfer time. They generally
make repetitive (looping) requests to the GPU of roughly
similar size. We select inputs that are meaningful and nontrivial (hundreds or thousands of elements in arrays, matrices or lists) but do not lead to executions dominated by data
transfer.
Our OpenGL application (glxgears) is a standard graphics microbenchmark, chosen for its simplicity and its frequent short acceleration requests. We also chose a combined
compute/graphics (OpenCL plus OpenGL) microbenchmark
from Nvidia’s SDK v4.0.8: oclParticles, a particles collision
physics simulation. Synchronization to the monitor refresh
rates is disabled for our experiments, both at the driver and
the application level, and the performance metric used is the
raw, average frame rate: our intent is to stress the system
interface more than the GPU processing capabilities.
Table 1 summarizes our benchmarks and their characteristics. The last two columns list how long an execution of a
“round” of computations or rendering takes for each application, and the average acceleration request size realized when
the application runs alone using our Kepler GPU. A “round”
in the OpenCL applications is generally the execution of one
iteration of the main loop of the algorithm, including one or
more “compute kernel(s).” For OpenGL applications, it is
the work to render one frame. The run time of a round indicates performance from the user’s perspective; we use it
to quantify speedup or slowdown from the perspective of a
given application across different workloads (job mixes) or
using different schedulers. A “request” is the basic unit of
work that is submitted to the GPU at the device interface. We
capture the average request size for each application through
our request interception mechanism (but without applying
any scheduling policy). We have verified that our GPU request size estimation agrees, within of 5% or less, to the
time reported by standard profiling tools. Combined compute/graphics applications report two average request sizes,
one per request type. Note that requests are not necessarily
meaningful to the application or user, but they are the focal
point of our OS-level GPU schedulers. There is no necessary correlation between the API calls and GPU requests in
a compute/rendering round. Many GPU-library API calls do
not translate to GPU requests, while other calls could translate to more than one. Moreover, as revealed in the course of
reverse engineering, there also exist trivial requests, perhaps
to change mode/state, that arrive at the GPU and are never
checked for completion. NEON schedules requests without
regard to their higher-level purpose.
To enable flexible adjustment of workload parameters,
we also developed a “Throttle” microbenchmark. Throttle
serves as a controlled means of measuring basic system overheads, and as a well-understood competitive sharer in multiprogrammed scenarios. It makes repetitive, blocking compute requests, which occupy the GPU for a user-specified
amount of time. We can also control idle (sleep/think) time
between requests, to simulate nonsaturating workloads. No
data transfers between the host and the device occur during
throttle execution; with the exception of a small amount of
initial setup, only compute requests are sent to the device.
5.2
Overhead of Standalone Execution
NEON routines are invoked during context creation, to mark
the memory areas to be protected; upon every tracked request (depending on scheduling policy), to handle the induced page fault; and at context exit, to safely return allocated system resources. In addition, we arrange for timer
interrupts to trigger execution of the polling service thread,
schedule the next task on the GPU (and perhaps dis/reengage) for the Timeslice algorithm, and sample the next
task or begin a free-run period for the Disengaged Fair
Queueing scheduler.
We measure the overhead of each scheduler by comparing standalone application performance under that scheduler
against the performance of direct device access. Results appear in Figure 4. For this experiment, and the rest of the
evaluation, we chose the following configuration parameters: For all algorithms, the polling service thread is woken up when the scheduler decides, or at 1 ms intervals. The
(engaged) Timeslice and Disengaged Timeslice use a 30 ms
timeslice. For Disengaged Fair Queueing, a task’s sampling
period lasts either 5 ms or as much as is required to intercept a fixed number of requests, whichever is smaller. The
follow-up free-run period is set to be 5× as long, so in the
Figure 4. Standalone application execution slowdown under our scheduling policies compared to direct device access.
standalone application case it should be 25 ms (even if the
fixed request number mark was hit before the end of 5 ms).
NEON is not particularly sensitive to configuration parameters. We tested different settings, but found the above to be
sufficient. The polling thread frequency is fast enough for
the average request size, but not enough to impose a noticeable load even for single-CPU systems. A 30 ms timeslice is
long enough to minimize the cost of token passing, but not
long enough to immediately introduce jitter in more interactive applications, which follow the “100 ms human perception” rule. For most of the workloads we tested, including
applications with a graphics rendering component, 5 ms or
32 requests is enough to capture variability in request size
and frequency during sampling. We increase this number
to 96 requests for combined compute/graphics applications,
to capture the full variability in request sizes. Though realtime rendering quality is not our goal, testing is easier when
graphics applications remain responsive as designed; in our
experience, the chosen configuration parameters preserved
fluid behavior and stable measurements across all tested inputs.
As shown in Figure 4, the (engaged) Timeslice scheduler incurs low cost for applications with large requests—
most notably MatrixMulDouble and oclParticles. Note, however, that in the latter application, the performance metric is frame rendering times, which are generally the result of longer non-blocking graphics requests. The cost of
(engaged) Timeslice is significant for small-request applications (in particular, 38% slowdown for BitonicSort, 30%
for FastWalshTransform, and 40% FloydWarshall). This is
due to the large request interception and manipulation overhead imposed on all requests. Disengaged Timeslice suffers
only the post re-engagement status update costs and the infrequent trapping to the kernel at the edges of the timeslice,
which is generally no more than 2%. Similarly, Disengaged
Fair Queueing incurs costs only on a small subset of GPU
requests; its overhead is no more than 5% for all our appli-
Figure 5. Standalone Throttle execution (at a range of request sizes) slowdown under our scheduling policies compared to direct device access.
cations. The principal source of extra overhead is idleness
during draining, due to the granularity of polling.
Using the controlled Throttle microbenchmark, Figure 5
reports the scheduling overhead variation at different request
sizes. It confirms that the (engaged) Timeslice scheduler incurs high costs for small-request applications, while Disengaged Timeslice incurs no more than 2% and Disengaged
Fair Queueing no more than 5% overhead in all cases.
5.3
Evaluation on Concurrent Executions
To evaluate scheduling fairness and efficiency, we measure
the slowdown exhibited in multiprogrammed scenarios, relative to the performance attained when applications have full,
unobstructed access to the device. We first evaluate concurrent execution between two co-running applications—one of
our benchmarks together with the Throttle microbenchmark
at a controlled request size. The configuration parameters in
this experiment are the same as for the overheads test, resulting in 50 ms “free-run” periods.
Fairness Figure 6 shows the fairness of our concurrent executions under different schedulers. The normalized run time
in the direct device access scenario shows significantly unfair access to the GPU. The effect can be seen on both corunners: a large-request-size Throttle workload leads to large
slowdown of some applications (more than 10× for DCT,
7× for FFT, 12× for glxgears), while a smaller request size
Throttle can suffer significant slowdown against other applications (more than 3× for DCT and 7× for FFT). The
high-level explanation is simple—on accelerators that issue
requests round-robin from different channels, the one with
larger requests will receive a proportionally larger share of
device time. In comparison, our schedulers achieve much
more even results—in general, each co-scheduled task experiences the expected 2× slowdown. More detailed discussion of variations appears in the paragraphs below.
There appears to be a slightly uneven slowdown exhibited
by the (engaged) Timeslice scheduler for smaller Throttle re-
quest sizes. For example, for DCT or FFT, the slowdown of
Throttle appears to range from 2× to almost 3×. This variation is due to per-request management costs. Throttle, being our streamlined control microbenchmark, makes backto-back compute requests to the GPU, with little to no CPU
processing time between the calls. At a given request size,
most of its co-runners generate fewer requests per timeslice.
Since (engaged) Timeslice imposes overhead on every request, the aggressive Throttle microbenchmark tends to suffer more. By contrast, Disengaged Timeslice does not apply this overhead and thus maintains an almost uniform 2×
slowdown for each co-runner and for every Throttle request
size.
A significant anomaly appears when we run the OpenGL
glxgears graphics application vs. Throttle (19 µs) under Disengaged Fair Queueing, where glxgears suffers a significantly higher slowdown than Throttle does. A closer inspection shows that glxgears requests complete at almost
one third the rate that Throttle (19 µs) requests do during
the free-run period. This behavior appears to contradict our
(simplistic) assumption of round-robin channel cycling during the Disengaged Fair Queueing free-run period. Interestingly, the disparity in completion rates goes down with larger
Throttle requests: Disengaged Fair Queueing appears to produce fair resource utilization between glxgears and Throttle
with larger requests. In general, the internal scheduling of
our Nvidia device appears to be more uniform for OpenCL
than for graphics applications. As noted in the “Limitations”
paragraph of Section 3.3, a production-quality implementation of our Disengaged Fair Queueing scheduler should ideally be based on vendor documentation of device-internal
scheduling or resource usage statistics.
Finally, we note that combined compute/graphics applications, such as oclParticles and simpleTexture3D, tend to
create separate channels for their different types of requests.
For multiprogrammed workloads with multiple channels per
task, the limitations of imprecise understanding of internal
GPU scheduling are even more evident than they were for
glxgears: not only do requests arrive on two channels, but
also for some duration of time requests may be outstanding in one channel but not on the other. As a result, even if
the GPU did cycle round-robin among active channels, the
resource distribution among competitors might not follow
their active channel ratio, but could change over time. In a
workload with compute/graphics applications, our requestsize estimate across the two (or more) channels of every task
becomes an invalid proxy of resource usage during free-run
periods and unfairness arises: in the final row of Figure 6,
Throttle suffers a 2.3× to 2.7× slowdown, while oclParticles
sees 1.3× to 1.5×. Completely fair scheduling with Disengaged Fair Queueing would again require accurate, vendorprovided information that our reverse-engineering effort in
building this prototype was unable to provide.
DCT vs. Throttle
FFT vs. Throttle
glxgears (OpenGL) vs. Throttle
oclParticles (OpenGL + OpenCL) vs. Throttle
Figure 6. Performance and fairness of concurrent executions. Results cover four application-pair executions (one per row) and
four schedulers (one per column). We use several variations of Throttle with different request sizes (from 19 µs to 1.7 ms). The
application runtime is normalized to that when it runs alone with direct device access.
Efficiency In addition to fairness, we also assess the overall system efficiency of multiprogrammed workloads. Here
we utilize a concurrency efficiency metric that measures the
performance of the concurrent execution relative to that of
the applications running alone. We assign a base efficiency
of 1.0 to each application’s running-alone performance (in
the absence of resource competition). We then assess the
performance of a concurrent execution relative to this base.
Given N concurrent OpenCL applications whose per-round
run times are t1 ,t2 , ...,tN respectively when running alone on
the GPU, and t1c ,t2c , ...,tNc when running together, the concurrency efficiency is defined as ΣNi=1 (ti /tic ). The intent of the
formula is to sum the resource shares allotted to each application. A sum of less than one indicates that resources have
been lost; a sum of more than one indicates mutual synergy
(e.g., due to overlapped computation).
Figure 7 reports the concurrency efficiency of direct device access and our three fair schedulers. It might at first be
surprising that the concurrency efficiency of direct device access deviates from 1.0. For small requests, the GPU may fre-
Direct Device Access
(Engaged) Timeslice
Disengaged Timeslice
Disengaged Fair Queueing
Figure 7. Efficiency of concurrent executions. The metric of concurrency efficiency is defined in the text.
quently switch between multiple application contexts, leading to substantial context switch costs and <1.0 concurrency
efficiency. Conversely, the efficiency of direct device access
may also be larger than 1.0, due to overlap between DMA
and compute requests.
For its part, the (engaged) Timeslice scheduler incurs
high request management costs. Its efficiency losses with
respect to direct device access can be on average 19% or as
high as 42%; specifically, average/max efficiency losses are
10%/16% (DCT vs. Throttle), 27/42% (FFT vs. Throttle),
5%/6% (glxgears vs. Throttle) and 35/37% (oclParticles vs.
Throttle).
Disengaged Timeslice adds little per-request overhead
while also limiting channel switching on the GPU, so it
enjoys some efficiency advantage over (engaged) Timeslice:
average/max losses with respect to direct device access are
10%/35% (3%/12%, 16%/35%, −5%/0%, and 27%/33% per
respective test).
Overall, Disengaged Fair Queueing does a better job
at allowing “background concurrency”: its average/max
losses with respect to direct device access are only 4%/18%
(0%/3%, 6%/12%, 0%/2% and 11%/18% per respective
test). Unfortunately, our incomplete understanding of GPU
internal scheduling limits our achievable efficiency, as the
imprecise slowdown applied for applications with multiple
channels results in wasted opportunities for Disengaged Fair
Queueing. This is why oclParticles exhibits the worst efficiency loss among the tested cases.
Scalability The performance, fairness, and efficiency characteristics of our schedulers are maintained under higher
concurrency. Figure 8 shows that for four concurrent applications, including one making large requests (Throttle) and
three making smaller ones (BinarySearch, DCT, FFT), the
average slowdown remains at 4–5×. Following the pattern
we have observed before, the efficiency of concurrent runs
drops more when the scheduler is fully engaged (13% over
direct device access), but less so when we use disengaged
scheduling: 8% for Disengaged Timeslice and 7% for Disengaged Fair Queueing.
Figure 8. Fairness (bars, left Y axis) and efficiency (line,
right Y axis) when running four concurrent applications.
5.4
Evaluation on Nonsaturating Workloads
Next we assess our scheduler performance under nonsaturating workloads—those that interleave GPU requests with
“off” periods of CPU-only work. Nonsaturating workloads
lead to GPU under-utilization with non-work-conserving
schedulers. In timeslice scheduling, for instance, resources
are wasted if a timeslice is only used partially. As explained
in Section 3.3, Disengaged Fair Queueing maintains workconserving properties, and therefore wastes less time on
nonsaturating workloads. Direct device access, of course,
is fully work-conserving.
To create nonsaturating workloads, we interleave sleep
time in our Throttle microbenchmark to reach a specified
ratio (the proportion of “off” time under standalone execution). Figure 9 illustrates the results for DCT and the nonsaturating Throttle workload. Note that fairness does not necessarily require co-runners to suffer equally: execution is fair
as long as no task slows down significantly more than 2×.
The results for Disengaged Fair Queueing should thus be
regarded as better than those for the timeslice schedulers,
because Throttle does not suffer, while DCT actually benefits from its co-runner’s partial idleness. Figure 10 further
shows the efficiency losses for our three schedulers and direct device access. At an 80% Throttle sleep ratio, the losses
Figure 9. Performance and fairness of concurrent executions for nonsaturating workloads, demonstrated for DCT vs. Throttle
with GPU “off” periods.
Figure 10. Efficiency of concurrent executions for nonsaturating workloads.
relative to direct access are 36%, 34%, and essentially 0%
for (engaged) Timeslice, Disengaged Timeslice, and Disengaged Fair Queueing, respectively.
6.
Future System Considerations
6.1
Interface Specification
Protected OS-level accelerator management requires relatively little information: an event-based scheduling interface
that allows us to identify when acceleration requests have
been made or completed, utilization information that allows
us to account for per-channel resource usage, and sufficient
semantic understanding of requests to avoid deadlocks and
any unnecessary resource idleness.
Both standard interfaces (system calls) and alternative approaches (virtual memory) can capture request submission
and completion. We believe that the increasing integration of
CPUs and accelerators will only increase the need for fast,
memory-based interfaces. By documenting basic information regarding the layout of accelerator request queues and
core component interactions, and by exposing necessary performance counters to obtain accurate utilization statistics,
vendors could enable OS kernel involvement in accelerator
scheduling without high overheads, enabling full protection
and efficiency through disengaged scheduling. Vendor assis-
tance at this level is necessary, as systems like NEON, relying on reverse engineering and inference techniques, cannot
reach production-level quality.
A “partial opening” of interface specifications, as envisioned here, would still allow vendors to retain proprietary
control over much of the accelerator software / hardware system. For instance, while a request submission event needs
to be identifiable for manipulation, much of the data layout and semantics within a request data structure can remain unpublished. On the other hand, in order to enable
deep scheduling optimizations, vendors would have to reveal request-ordering semantics and synchronization models. For our work with NEON we assumed no request reordering within GPU contexts, an assumption that flows
naturally from the semantics of the context concept. For a
higher degree of optimizations (e.g., reordering of requests
across CUDA streams [28]), vendors could tag requests to
indicate the requirements to be met.
A hardware accelerator may accept multiple requests simultaneously to exploit internal parallelism or to optimize
for data affinity. The OS desires additional information to
manage such parallelism. Most obviously, completion events
need to carry labels that link them to the corresponding requests. For NEON, the task, context, channel, and request reference number have served as surrogate labels. More subtly,
under concurrent submission, an outstanding request may be
queued in the device, so that the time between submission
and completion does not accurately reflect its resource usage. The hardware can facilitate OS accounting by including
resource usage information in each completion event.
Direct accelerator access from other (e.g., I/O) devices is
currently limited to DMA (e.g., RDMA for GPUDirect in
the CUDA SDK [28]). Applications and services using the
GPU and I/O devices will still initiate DMA and compute
requests through the same memory-mapped interface, and
thus be amenable to scheduling. Appropriate specifications
for any alternative, peer-to-peer communication path to the
accelerator would be necessary if devices other than the CPU
were to enjoy rich, direct accelerator interactions.
6.2
Hardware Preemption Support
The reverse-engineering efforts of the open-source community [27, 31] suggest that some type of request arbitration
is possible for current GPUs, but true hardware preemption
is officially still among the list of upcoming GPU features
(e.g., in the roadmap of AMD HSA [24]). In its absence,
tricks, like cutting longer requests into smaller pieces, have
been shown to enhance GPU interactivity, at least for mutually cooperative applications [6]. True hardware preemption
support would save state and safely context-switch from an
ongoing request to the next in the GPU queue. If the preemption decision were left to a hardware scheduler, it would
be fast but oblivious to system scheduling objectives. Disengaged scheduling can help make context-switching decisions appropriately coarse grain, so they can be deferred to
the OS. Simple documentation of existing mechanisms to
identify and kill the currently running context would enable
full protection for schedulers like Disengaged Fair Queueing. True hardware preemption would allow us to tolerate
requests of arbitrary length, without sacrificing interactivity
or becoming vulnerable to infinite loops.
6.3
Protection of Other Resources
Beyond the scheduling of cycles, OS-level protection can enforce the safe use of other GPU resources, such as channels
and memory, to prevent unfair or abusive behavior. Channels
establish a user-mapped address range through which commands are submitted to the hardware. Existing GPUs do not
multiplex requests from different tasks on a channel; instead,
an established channel is held by the task that created it (and
its associated context) for as long as the application deems
necessary. It is thus possible to devise a simple denial-ofservice attack by creating many contexts and occupying all
available GPU channels, thus denying subsequent requests
from other applications; on our Nvidia GTX670 GPU, after 48 contexts had been created (with one compute and
one DMA channel each), no other application could use the
GPU.
An OS-level resource manager may enable protected allocation of GPU channels to avoid such DoS attacks. We
describe a possible policy below. First, we limit the number of channels in any one application to a small constant
C, chosen to match the characteristics of anticipated applications. Attempts to allocate more than C channels return an
“out of resources” error. Second, with a total of D channels
available on the GPU, we permit no more than D/C applications to use the GPU at any given time. More elaborate
policies, involving prioritization, preemption, etc., could be
added when appropriate.
Our GTX670 GPU has 2 GB of onboard RAM that
is shared by active tasks. This memory is managed with
the help of the system’s virtual memory manager and the
IOMMU, providing already sufficient isolation among tasks.
It is conceivable that an erroneous or malicious application
might exhaust the GPU memory and prevent normal use by
others. In theory, our OS-level protection framework could
prevent such abuses by accounting for application uses of
GPU memory and blocking excessive consumption. This,
however, requires uncovering additional GPU semantics on
memory buffer management that we have not yet explored.
7.
Conclusion
This paper has presented a framework for fair, safe, and efficient OS-level management of GPU resources—and, by extension, of other fast computational accelerators. We believe
our request interception mechanism, combined with disengagement, to be the first resource management technique capable of approaching the performance of direct device access from user space. Our Timeslice scheduler with overuse
control can guarantee fairness during multiprogrammed executions. Its disengaged variant improves performance without sacrificing either fairness or protection, by intercepting
only a small subset of requests, allowing most to reach the
device directly. Our Disengaged Fair Queueing scheduler is
effective at limiting idleness, and incurs only 4% average
overhead (max 18%) relative to direct device access across
our evaluation cases.
While our prototype Disengaged Fair Queueing scheduler ensures fairness between multiple compute (OpenCL)
applications, its fairness support is less than ideal for graphics (OpenGL)-related applications. This is due to our insufficient understanding of the device-internal scheduling of
concurrent graphics/compute requests during free-run periods; a production-quality implementation based on vendorsupplied information would not suffer from this limitation.
Even in our prototype, the limitation does not affect the fairness of our Disengaged Timeslice scheduler, which grants
GPU access to no more than one application at a time.
We have identified necessary and/or desirable hardware
features to support disengaged resource management, including the ability to identify the currently running GPU request, to account for per-context resource usage during disengaged execution, to cleanly terminate an aberrant context,
and (ideally) to suspend and resume requests. For the purposes of prototype construction, we have identified surrogates for some of these features via reverse engineering. We
encourage device vendors to embrace a limited “opening” of
the hardware-software interface specification sufficient for
production use.
Acknowledgments
This work was supported in part by the National Science
Foundation under grants CCF-0937571, CCR-0963759,
CCF-1116055, CNS-1116109, CNS-1217372, CNS-1239423,
CCF-1255729, and CNS-1319417. Our thanks as well to
Shinpei Kato and the anonymous reviewers, for comments
that helped to improve this paper.
References
[1] Altera. www.altera.com.
[2] AMD. Radeon R5xx Acceleration: v1.2, Feb. 2008.
[3] AMD’s revolutionary Mantle.
http://www.amd.com/us/products/technologies/mantle.
[4] ARM graphics plus GPU compute. www.arm.com/products/
multimedia/mali-graphics-plus-gpu-compute.
[5] M. Aron and P. Druschel. Soft timers: Efficient microsecond
software timer support for network processing. ACM Trans.
on Computer Systems, 18(3):197–228, Aug. 2000.
[6] C. Basaran and K.-D. Kang. Supporting preemptive task
executions and memory copies in gpgpus. In Real-Time
Systems (ECRTS), 2012 24th Euromicro Conference on,
pages 287–296. IEEE, 2012.
[7] N. Brookwood. AMD FusionTM family of APUs. AMD
White Paper, Mar. 2010.
[8] C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and
P. Pattnaik. A taxonomy of accelerator architectures and their
programming models. IBM Journal of Research and
Development, 54(5):5–1, 2010.
[9] M. Daga, A. M. Aji, and W.-C. Feng. On the efficacy of a
fused CPU+GPU processor (or APU) for parallel computing.
In Proc. of the 2011 Symp. on Application Accelerators in
High-Performance Computing, Knoxville, TN, July 2011.
[10] A. Demers, S. Keshav, and S. Shenker. Analysis and
simulation of a fair queueing algorithm. In Proc. of the ACM
SIGCOMM Conf., Austin, TX, Sept. 1989.
[11] A. Dwarakinath. A fair-share scheduler for the graphics
processing unit. Master’s thesis, Stony Brook University,
Aug. 2008.
[12] G. A. Elliott and J. H. Anderson. Globally scheduled
real-time multiprocessor systems with GPUs. Real-Time
Systems, 48(1):34–74, Jan. 2012.
[13] M. Gottschlag, M. Hillenbrand, J. Kehne, J. Stoess, and
F. Bellosa. LoGV: Low-overhead GPGPU virtualization. In
Proceedings of the 4th Intl. Workshop on Frontiers of
Heterogeneous Computing. IEEE, 2013.
[14] A. G. Greenberg and N. Madras. How fair is fair queuing.
Journal of the ACM, 39(3):568–598, July 1992.
[15] V. Gupta, K. Schwan, N. Tolia, V. Talwar, and
P. Ranganathan. Pegasus: Coordinated scheduling for
virtualized accelerator-based systems. In Proc. of the
USENIX Annual Technical Conf., Portland, OR, June 2011.
[16] Intel. OpenSource HD Graphics Programmer’s Reference
Manual: vol. 1, part 2, May 2012.
[17] Intel Corporation. Enabling consistent platform-level
services for tightly coupled accelerators. White Paper, 2008.
[18] W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In Proc. of the ACM
SIGMETRICS Conf., New York, NY, June 2004.
[19] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa.
TimeGraph: GPU scheduling for real-time multi-tasking
environments. In Proc. of the USENIX Annual Technical
Conf., Portland, OR, June 2011.
[20] S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev:
First-class GPU resource management in the operating
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
system. In Proc. of the USENIX Annual Technical Conf.,
Boston, MA, June 2012.
Khronos OpenCL Working Group and others. The OpenCL
specification, v1.2, 2011. A. Munshi, Ed.
Khronos OpenCL Working Group and others. The OpenGL
graphics system: A specification, v4.2, 2012. Mark Segal and
Kurt Aekeley, Eds.
A. Krishna, T. Heil, N. Lindberg, F. Toussi, and
S. VanderWiel. Hardware acceleration in the IBM PowerEN
processor: Architecture and performance. In Proc. of the 21st
Intl. Conf. on Parallel Architectures and Compilation
Techniques (PACT), Minneapolis, MN, Sept. 2012.
G. Kyriazis. Heterogeneous system architecture: A technical
review. Technical report, AMD, Aug. 2012. Rev. 1.0.
K. Menychtas, K. Shen, and M. L. Scott. Enabling OS
research by inferring interactions in the black-box GPU
stack. In Proc. of the USENIX Annual Technical Conf., San
Jose, CA, June 2013.
E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and
G. Hunt. Helios: Heterogeneous multiprocessing with
satellite kernels. In Proc. of the 22nd ACM Symp. on
Operating Systems Principles (SOSP), Big Sky, MT, Oct.
2009.
Nouveau: Accelerated open source driver for Nvidia cards.
nouveau.freedesktop.org.
NVIDIA Corp. CUDA SDK, v5.0.
http://docs.nvidia.com/cuda/.
S. Panneerselvam and M. M. Swift. Operating systems
should manage accelerators. In Proc. of the Second USENIX
Workshop on Hot Topics in Parallelism, Berkeley, CA, June
2012.
A. K. Parekh and R. G. Gallager. A generalized processor
sharing approach to flow control in integrated services
networks: The single-node case. IEEE/ACM Trans. on
Networking, 1(3), June 1993.
Nouveau, by PathScaleInc. github.com/pathscale/pscnv.
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and
E. Witchel. PTask: Operating system abstractions to manage
GPUs as compute devices. In Proc. of the 23rd ACM Symp.
on Operating Systems Principles (SOSP), Cascais, Portugal,
Oct. 2011.
K. Shen and S. Park. FlashFQ: A fair queueing I/O scheduler
for Flash-based SSDs. In Proc. of the USENIX Annual
Technical Conf., San Jose, CA, June 2013.
M. Shreedhar and G. Varghese. Efficient fair queueing using
deficit round robin. In Proc. of the ACM SIGCOMM Conf.,
Cambridge, MA, Aug. 1995.
L. Soares and M. Stumm. FlexSC: Flexible system call
scheduling with exception-less system calls. In Proc. of the
9th USENIX Conf. on Operating Systems Design and
Implementation (OSDI), Vancouver, BC, Canada, Oct. 2010.
P. M. Stillwell, V. Chadha, O. Tickoo, S. Zhang, R. Illikkal,
R. Iyer, and D. Newell. HiPPAI: High performance portable
accelerator interface for SoCs. In Proc. of the 16th IEEE
Conf. on High Performance Computing (HiPC), Kochi, India,
Dec. 2009.