Download Online bipartite perfect matching with augmentations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Online Bipartite Perfect Matching With
Augmentations
Kamalika Chaudhuri∗ , Constantinos Daskalakis† , Robert D. Kleinberg‡ , and Henry Lin†
∗ Information
Theory and Applications Center, U.C. San Diego
Email: [email protected]
† Division of Computer Science, U.C. Berkeley
E-mail: [email protected], [email protected]
‡ Computer Science Department, Cornell University
E-mail: [email protected]
Abstract—In this paper, we study an online bipartite matching
problem, motivated by applications in wireless communication,
content delivery, and job scheduling. In our problem, we have
a bipartite graph G between n clients and n servers, which
represents the servers to which each client can connect. Although
the edges of G are unknown at the start, we learn the graph
over time, as each client arrives and requests to be matched to
a server. As each client arrives, she reveals the servers to which
she can connect, and the goal of the algorithm is to maintain a
matching between the clients who have arrived and the servers.
Assuming that G has a perfect matching which allows all clients
to be matched to servers, the goal of the online algorithm is to
minimize the switching cost, the total number of times a client
needs to switch servers in order to maintain a matching at all
times.
Although there are no known algorithms which are guaranteed
to yield switching cost better than the trivial O(n2 ) in the worst
case, we show that the switching cost can be much lower in
three natural settings. In our first result, we show that for any
arbitrary graph G with a perfect matching, if the clients arrive
in random order, then the total switching cost is only O(n log n)
with high probability. This bound is tight, as we show an example
where the switching cost is Ω(n log n) in expectation. In our
second result, we show that if each client has edges to Θ(log n)
uniformly random servers, then the total switching cost is even
better; in this case, it is only O(n) with high probability, and we
also have a lower bound of Ω(n/ log n). In terms of the number
of edges needed for each client, our result is tight, since Ω(log n)
edges are needed to guarantee a perfect matching in G with
high probability. In our last result, we derive the first algorithm
known to yield total cost O(n log n), given that the underlying
graph G is a forest. This is the first result known to match the
existing lower bound for forests, which shows that any online
algorithm must have switching cost Ω(n log n), even when G is
restricted to be a forest.
I. I NTRODUCTION
In this paper, we study an online bipartite matching problem,
which models a scenario in which clients arrive over time
and request permanent service from a set of given servers.
As each client arrives, she announces a set of feasible servers
capable of servicing her request, and our goal is to provide
service to each client persistently by maintaining a matching
at all times between clients who have arrived and servers
capable of servicing their requests. We would like to assign
clients to servers permanently without ever having to reassign
clients to different servers, but when a new client arrives
we may be forced to reassign existing clients to alternative
servers to ensure that all clients can receive service. As it
is often more important to provide service to all clients, the
goal of our algorithm will be to maintain a matching always
between arrived clients and allowed servers, while minimizing
the switching cost, the total number of times that clients are
reassigned to different servers.
Our online bipartite matching problem has a wide variety of
applications spanning diverse areas, including streaming content delivery, web hosting, remote data storage, job scheduling,
and hashing. We describe a few applications below, which can
be modeled as an instance of the online bipartite matching
problem we described above. In the following examples, we
always refer to the entities requesting service as clients and
the entities providing service as servers for consistency. In
examples where it is not clear, we mark in parenthesis which
entities are clients and which entities are servers.
• Streaming Content Delivery, Web Hosting, and Remote Data Storage We have a set of servers capable of
streaming content online, hosting web pages, or storing
data remotely. A sequence of clients arrives requesting
to have their content streamed online, their web pages
hosted online, or their data stored online. Due to locality,
security, cost, routing policy, or other reasons, the streaming content, web page, or data from each client can only
be hosted at a subset of server locations.
• Job Scheduling We have a set of servers with differing
capabilities available to process job requests from persistent sources - jobs that need to be processed over a long
or indefinite period of time (e.g. protein folding, genomic
research, SETI@HOME). A sequence of persistent job
requests (clients) arrive and reveal a subset of servers
capable of servicing their request.
• Hashing We have locations in a hash table (servers)
available to store data objects (clients), and a set of
hash functions. Data objects arrive over time and can be
assigned to a location in the hash table, if one of the hash
functions maps the data object to that location.
Note that in all the examples above, it is reasonable to
assume that clients can be reassigned to different servers, but at
a cost. For instance, in the streaming content example, clients
may not be able to access their content while their content
is being transferred from one server to another. Thus, it is
desirable to minimize the number of the reassignments our
algorithm incurs, which causes these interruptions.
Before stating our results, we define our problem more formally. Each instance of the online bipartite matching problem
is defined by:
• A bipartite graph G between clients and servers, which
represents the servers to which each client can connect
• A permutation σ, which represents the sequence in which
the clients arrive
The graph G is unknown at the beginning, but as clients
arrive according to σ, each client reveals the servers to which
she has edges. The goal of the algorithm to is to maintain
a matching between arrived clients and servers at all times,
while minimizing the total number of times that clients are
reassigned to different servers. Alternatively, as each client
arrives, our algorithm can be viewed as finding an augmenting
path (a path that alternates between matched and unmatched
edges), in the graph revealed so far, from the arriving client to
an unmatched server. It should be clear that upper bounding
the total length of the augmenting paths used by the algorithm,
also upper bounds the total switching cost. As the total length
of the augmenting paths is close to the switching cost, it can
also be used to lower bound the switching cost in some cases
as well.
For simplicity, note that we assume each server may service
(or be matched to) at most one client at a time, although our
results can be fully generalized to the case where servers can
service multiple requests. We also assume that G consists of
n clients and n servers, and G contains a perfect matching,
although our results still hold essentially without these assumptions. See the Appendix for further discussion.
Although this problem was introduced over a decade
ago [1], we still know surprisingly little about the optimal
algorithm. For example, there is a very natural greedy algorithm, which provides service to each new client by using
an augmenting path of minimal length, but is this greedy
algorithm optimal? What is its worst-case switching cost?
There are no known upper bounds to show that the greedy
algorithm or any other algorithm performs better than O(n2 ) in
the worst case, but an upper bound of O(n2 ) is trivial since any
reasonable algorithm only switches at most O(n) clients per
arriving client. Furthermore, an existing lower bound in [1]
only shows that the total switching cost has to be Ω(n log n),
so a large gap exists between the known upper and lower
bounds.
A. Our Results
Although worst case analysis for this problem has not
yielded fruitful results, the worst case may not often occur in
practice, and it may be reasonable to make certain assumptions
about the graph G or the arrival order σ. For example, in the
case of remote data storage, one might imagine that the graph
G which determines the servers which can store a client’s
data is fixed, but perhaps the arrival order σ is random. Does
the greedy algorithm perform provably better than O(n2 ) in
this case? Moreover in some cases, it might be reasonable to
assume that the graph G is generated randomly. For example,
in the case of hashing, if Θ(log n) random hash functions are
chosen, then the set of edges between clients (data elements)
and servers (hash locations) is a graph where each client
has Θ(log n) random edges to servers. Can better bounds be
proved in this case as well?
In this paper, we show that indeed the switching cost can
be much better under these conditions, despite the lack of
worst case upper bounds better than O(n2 ). In the first case,
we show that when G is any arbitrary bipartite graph with a
perfect matching, and σ is chosen uniformly at random, then
the greedy algorithm performs well and achieves O(n log n)
switching cost with high probability. This bound is tight as
we show that there is a distribution over graphs for which
any deterministic algorithm must incur cost Ω(n log n) in
expectation, even if σ is chosen uniformly at random.
In the second case, where each client has Θ(log n) random
edges to servers, we show that the switching cost can be even
better. In this case, we prove that the switching cost is O(n)
with high probability, and it nearly matches our lower bound of
Ω(n/ log n). In terms of the number of edges used to generate
the graph G, our result is tight, since Ω(log n) random edges
are needed per node, in order to guarantee a perfect matching
with high probability.
Lastly, we also make the first progress in over a decade in
the original worst case model where σ is chosen adversarially,
but G is known to be acyclic (i.e. a forest). In this setting,
we derive a new algorithm which achieves cost O(n log n),
which matches the existing lower bound of Ω(n log n) for
forests. Although the networks that occur in practice are not
often acyclic, we view our last result as making progress
towards finding an O(n log n) solution in the general worst
case setting.
B. Related Work
The problem studied here was first introduced by Grove
et al. [1] in a paper, which focused on a special case of the
problem where each client has degree at most two. For this
special class of instances, they prove that the greedy algorithm
incurs a worst-case switching cost of O(n log n), and they
give a matching lower bound by showing there are cases
where any algorithm must have cost Ω(n log n). The paper
goes on to consider the case in which clients can connect
and disconnect over time; for this problem, assuming that
each client has degree at most two, they present
√ a randomized
algorithm which has a competitive ratio of O( n), where the
competitive ratio of an online algorithm is the ratio between
the cost incurred by the online algorithm, which does not
know the input sequence in advance, and the cost incurred by
an optimal algorithm which does know the input sequence in
advance. Previous to our work, the special case in which each
client has degree at most two, was the only class of instances
for which an optimal online algorithm was known.
Our problem is related to previous work on online load balancing with task preemption, which has been studied by many
authors, see [2], [3], [4] for example. The main difference
between our work and the previous work on load balancing
is that our model assumes a hard capacity constraint on the
servers and allows clients to be reassigned, while the previous
load balancing work generally does not assume a hard capacity
constraint on the servers, or does not allow reassignments.
Without hard capacity constraints or reassignments, the goal
in online load balancing is often to minimize the maximum
load, or if reassignments are allowed, the goal is to minimize
the number of reassignments while always maintaining a
maximum load which is close to the optimal maximum load
achievable.
For example, a series of works by Azar et al. [5], [6],
[7] studied online load balancing without the possibility of
task preemption (i.e. without reassigning clients, in our terminology). They established that when tasks arrive but never
depart, the greedy algorithm, which assigns each task to the
least-loaded admissible server, has O(log n) competitive ratio
with respect to the maximum load. When tasks both arrive
and√depart, they show that the optimal competitive ratio is
O( n). When jobs arrive and depart and task preemption
is allowed, the situation improves dramatically: Phillips and
Westbrook [3] give an algorithm which always maintains a
maximum load within a factor of O(log n) of optimal, while
only incurring a reassignment cost of O(m), where m is the
number of arrivals and departures. Westbrook [4] also gives
an algorithm which is O(1)-competitive with respect to the
maximum load and with a reassignment cost of O(m log n).
Andrews et al. [2] study online load balancing with reassignment costs in a model in which any client may be assigned to
any server, but clients have arbitrary sizes and reassignment
costs. (The same model was considered, in lesser generality,
in [4].) They give an online algorithm which is 3.5981competitive with respect to load and 6.8285-competitive with
respect to reassignment cost.
Another line of related work involves a variant of our online bipartite matching problem, where reassignments are not
allowed and capacity constraints must be maintained exactly.
Given these restrictions, the goal of the online algorithm is
to match as many clients as possible. This problem was first
studied by Karp et al. [8], and later generalized by Goel and
Mehta [9], and Mehta et al. [10] for the purposes of studying
an online adword placement problem.
Our work is also loosely related to the recent work of
Godfrey, who analyzes the load balancing properties of certain
random processes which assign clients to servers [11]. Godfrey
proves that only very weak conditions are needed on the
random process, in order to ensure that the servers stay roughly
balanced with high probability.
Our load-balancing scenario in Section III also has connections to hashing. A large body of theoretical and experimental
research has focused on hashing schemes and dictionary data
structures; a dictionary data-structure is a table containing
items, and the goal is to design a data-structure which sup-
ports fast insertions and accesses, with a fairly high spaceutilization. Typically, dictionaries are used in combination
with a hashing scheme, which is a mapping from the items
to their locations in the table. A common assumption in
theoretical work is that such mappings are random. Under this
assumption, the connection between hashing and our setting is
as follows: if an item has d randomly chosen locations where
it can be inserted into the table, then the problem of inserting a
series of items into the table reduces to the problem of finding
a matching in a random bipartite graph in an online manner.
Under this setting, our algorithm is very similar to the
Cuckoo Hashing scheme of Pagh and Rodler [12]. Cuckoo
hashing, essentially, uses the following algorithm for inserting
items to tables, or equivalently, matching new clients to
servers. If there is an empty server adjacent to the client to be
matched, then the client is matched to this server; otherwise,
it is matched arbitrarily to one of its adjacent servers, and
a new matching is recursively found for the client which
was previously attached to this server. In [12], [13], it was
shown that if the degree of each node is d = O(ln 1 ), and
if there are n clients and n(1 + ) servers, then the expected
amortized switching cost per client is O (1), where the O (1)
is a constant dependent on and the expectation is with respect
to the randomness in the graph between clients and servers. In
contrast, our results imply that if d = O(log n), and if there
are n clients and n servers, the amortized switching cost per
client is still O(1) with high probability. By making small,
say = O( n1 ), the result in [13] can also be used to prove
bounds on the switching cost when n clients are matched to
n servers, but the bounds produced are very weak, since the
bound on the switching cost per client depends on and is
super-polynomial in ( 1 ). Thus, our results produce a novel
extension to results of [12], [13] to the case where there is no
surplus of servers over clients.
II. R ANDOM A RRIVAL O RDER
In this section, we assume that the bipartite graph G between clients and servers is arbitrary, but that the clients arrive
in a uniformly random order (i.e. σ is uniformly distributed
over all n! possible arrival orders). Under this assumption,
we prove that the natural greedy algorithm for the problem,
which connects each client by using the shortest augmenting
path available, suffers a switching cost of O(n log n) with
high probability. We also prove a matching lower bound of
Ω(n log n) on the total switching cost for any online algorithm,
which shows that O(n log n) is the best switching cost that can
be achieved in the random-ordering model.
A. The Upper Bound
To obtain some intuition, we start by showing that the
total switching cost is at most O(n log n) in expectation. Our
analysis here is just for intuition, as a more sophisticated
argument is needed to show that the total cost is O(n log n)
with high probability. To prove that the total switching cost is
O(n log n) in expectation, we start by proving the following
lemma, which upper bounds the expected cost of each arriving
client.
Lemma II.1 Given that i clients remain to arrive, the expected number of servers that the greedy algorithm needs to
switch in order to connect the next client is at most ni .
Given the lemma, it is easy to sum over
nall clients and bound
the total expected switching cost by i=1 ni = O(n log n).
We now prove the lemma.
Proof: To prove the lemma, let us assume for notational
purposes that clients {1, . . . , i} have yet to arrive, and let us
define dk to be the number of servers that need to be switched
in a shortest augmenting path from client k to a free server,
if client k arrives next, for k ∈ {1, . . . , i}. Note that when
clients {1, . . . , i} haveyet to arrive, the expected cost of the
i
i
next arriving client is k=1 (dk · 1i ) = 1i k=1 dk , since each
client k has probability 1i of arriving next and experiences
switching
i cost dk , if it arrives next. Now, if we can show
that k=1 dk ≤ n, then the lemma follows easily, since the
expected switching cost
iof the next arriving client can then be
upper bounded by 1i k=1 dk ≤ 1i · n = ni .
i
To show that
k=1 dk ≤ n, note that if all i remaining
clients were to arrive next all at once, then it would be
easy to connect the remaining i clients with switching cost
n: simply compute a perfect matching, and then switch all
clients to match the same servers as designated in the perfect
matching. This connects all remaining clients, and causes
each client to switch servers at most once, and thus has total
switching cost n. Furthermore, this implies the existence of
i augmenting paths, which can be used to connect remaining
clients {1,
i. . . , i}, with total switching costn.i Moreover, if we
look at k=1 dk , it must be the case that k=1 dk ≤ n, since
we have established that there exists a set of i augmenting
paths with total switching cost n, and dk is the fewest number
of servers that need to be switched if a client k ∈ {1, . . . , i}
arrives next. Therefore, the expected
switching cost of the next
i
arriving client is at most 1i k=1 dk ≤ ni , and the lemma
follows.
We now prove the main theorem of this section, which states
the total switching cost is bounded by O(n log n) with high
probability.
Theorem II.2 For any bipartite graph G between clients and
servers, if the clients in G arrive in uniformly random order,
then the switching cost incurred by the greedy algorithm is at
most O(n log n) with probability at least 1 − n−2 .
Proof: To prove the cost is O(n log n) with high probability, we couple our online matching process with a sequence
of n2 binary
random variables, Xi,j for i, j ∈ {1, . . . , n},
such that i,j∈{1,...,n} Xi,j stochastically dominates the total
augmentation
cost. We then prove a large deviation bound
on i,j∈{1,...,n} Xi,j . Ideally, it would be nice if we could
simply define Xi,j to be a random variable which is 1 if
server j is switched by the greedy algorithm, when i clients
have yet to arrive and a new client arrives,
and is 0 otherwise.
Although it is easy to see that the i,j∈{1,...,n} Xi,j is the
total switching
cost, it is difficult to prove a large deviation
bound on i,j∈{1,...,n} Xi,j , since the Xi,j variables are not
independent.
Thus, in order to prove a large deviation bound, we
will
define our Xi,j variables in another way, such that
i,j∈{1,...,n} Xi,j stochastically dominates the total switching
cost, and each Xi,j variable is independent of Xi ,j , for all
i = i, and for each fixed j. We can then take advantage of this
independence,
nand use a Chernoff bound to prove that for each
fixed j, the i=1 Xi,j is O(log n) with high
probability. From
there, it follows almost immediately that i,j∈{1,...,n} Xi,j is
O(n log n) with high probability.
Now to define our Xi,j variables, each Xi,j variable will
be 0 or 1, depending on the servers that get switched when
there are i clients yet to arrive and a new client arrives, and
our
n coupling will be defined so that for each i ∈ {1, . . . , n},
j=1 Xi,j stochastically dominates the number servers that
need to be switched in order to connect a new arriving client,
when i clients remain to arrive. Each Xi,j variable will be
1 with probability ( 1i ), and will be independent of any other
variable Xi ,j , for i = i and fixed j.
To define Xi,j and our coupling formally, let us assume
that i clients {1, . . . , i} have yet to arrive, and let us define
dk to represent the minimum number of servers that need to
be switched in order to connect clientk, if it were to arrive
i
next, for k ∈ {1, . . . , i}. Recall that k=1 dk ≤ n from the
argument in Lemma II.1, and thus, we can assign to each
client k ∈ {1, . . . , i}, dk distinct binary random variables
from {Xi,j | j ∈ {1, . . . , n}} without overlap. These variables
can be easily coupled to our random process to stochastically
dominate the switching cost, if the kth client arrives next;
if the kth client arrives next, then we set the dk binary
variables assigned to it to have value 1, and 0 otherwise. Note
that this causes each binary variable assigned to a client in
{Xi,j | j ∈ {1, . . . , n}} to be true with probability ( 1i ), since
each client arrives with probability exactly ( 1i ). For consistency, we also set variables in {Xi,j | j ∈ {1, . . . , n}}, which
are not assigned to any client, to have value 1 independently
and 0 otherwise. As we have
at random with probability ( 1i ) n
defined our coupling, note that j=1 Xi,j always dominates
the number servers that need to be switched in order to connect
a new client, when i clients remain to arrive. Furthermore,
Pr[Xi,j = 1] = ( 1i ) for any i, j ∈ {1, . . . , n}.
Note that even though we assign the variables to cover each
client’s switching cost in a way that is dependent on events
in the past, each Xi,j variable is 0 or 1 with probability
( 1i ), independent of all the previous Xi ,j random variables
with i < i, because regardless of which assignment is made,
each Xi,j variable is 0 or 1, based only on which client
is selected to arrive at the current time, an event which is
independent of the previous Xi ,j random variables. Thus,
our random process defines the random variables Xi,j in a
way such that each Xi,j variable is independent of random
variables Xi ,j , where i = i and j is fixed. Furthermore, it
n
n
is not hard to see that E[ i=1 Xi,j ] = i=1 1i = O(log n),
for any j ∈ {1, . . . , n}, and it is
to apply a standard
easy
n
Chernoff bound to conclude that i=1 Xi,j = O(log n) with
probability atleast (1 − n−3 ). A simple union bound then
implies that i,j∈{1,...,n} Xi,j = Θ(n log n) with probability
at least (1 − n−2 ). Therefore, the total switching cost must be
at most O(n log n) with probability at least (1 − n−2 ), and the
theorem holds.
B. The Lower Bound
Due to space limitations, we defer the proof of our lower
bound, which shows our upper bound is optimal, to the
appendix.
Theorem II.3 For any online algorithm, there exists a graph
G such that the algorithm incurs an expected switching cost
of Ω(n log n) when the clients of G arrive in random order.
III. R ANDOM C ONNECTION M ODEL
In this section we study the following scenario: there is
an equal number n of servers and clients, the clients arrive
sequentially, and each of them selects, at arrival, a random
subset of O(log n) servers. We provide an online matching
protocol for this setting which incurs total switching cost of
O(n) with high probability, so that the average switching
cost is O(1) per client. Moreover, we give a lower bound,
showing that any online matching protocol requires switching
cost Ω(n/ log n) with high probability, which establishes that
our upper bound is tight to within a factor of O(log n). Before
proceeding to the details of our results, we note that, since the
number of servers and clients are equal, we need to assume
that every client selects Ω(log n) servers in order to guarantee
that a matching exists, after all clients arrive.
The idea of our protocol is this: When a new client arrives,
the current server-client graph is examined, and a binary tree
originating at the new client is grown carefully, in order for
an augmenting path to be found along the edges of that tree.
Our analysis establishes that the binary tree that is explored
contains a free server at a constant depth on average, and as a
result, the augmenting path, and hence the resulting switching
cost, is constant on average as well.
Theorem III.1 Consider the following online matching problem: There are n servers, and n clients arrive sequentially,
each client selecting O(log n) servers at random. There exists
a client-server assignment protocol for this online arrival
model which, with high probability, succeeds in matching each
client with a server with overall switching cost of O(n).
Proof: Let S = {s1 , . . . , sn } be the set of servers and
C = {c1 , . . . , cn } the set of clients, and let us suppose that the
clients arrive in the order c1 , c2 , . . . , cn , where t, t = 1, . . . , n,
is the arrival time of client ct . Without loss of generality, we
assume that n = 2 , for some integer ; we also assume that
every client selects a random set of α · log2 n servers from
the set S with repetition, for some sufficiently large constant
α > 0; with minor modifications in the proof, the result
extends to the case where n is not a power of 2 and the clients
select servers without repetition. For this model, we show
that the O NLINE M ATCHING P ROTOCOL described in Figure
1 satisfies the following properties with high probability.
(Here and throughout this section, we interpret “with high
probability” to mean “with probability at least 1 − O(1/n).”)
• at every time step t > 0, the clients c1 , . . . , ct are matched
with a subset of t servers;
• the total switching cost incurred by the protocol is O(n).
In the description of the protocol in Figure 1 we use the
following notation. We denote by ft : {c1 , . . . , ct } → S the
matching of clients to servers that the protocol maintains at
time t, and by gt : S → {¬} ∪ {c1 , . . . , ct } the “inverse
matching”, defined so that gt (s) = c iff c ∈ {c1 , . . . , ct } and
ft (c) = s. Finally, by Gt we denote the bipartite graph with
node set {c1 , . . . , ct } S and an edge between client ci , i ≤ t,
and server s iff s ∈ Sci , where Sci is the set of servers that
client ci selects.
O NLINE M ATCHING P ROTOCOL
At time step t > 0:
1) client ct chooses a random set Sct of α · log2 n
servers from the set S with repetition;
2) ct orders Sct arbitrarily into a list Lct and sets
jct = 1 (to be used as an index in her list of
servers);
3) P:=B INARY BFS(ct , ft−1 , gt−1 );
/*upon success B INARY BFS returns an augmenting path on the graph Gt originating at the
node ct */
if P = ∅ then augment matching ft−1 along
path P; define ft , gt appropriately;
else declare FAIL;
Fig. 1. High-level description of the protocol; when client ct joins the
network an augmenting path from ct to a free server is sought; if such a path
is found, the current matching is augmented along this path to include client
ct .
We will make use of the following fact about the Coupon
Collector Model (proof in the appendix).
Lemma III.2 (Coupon Collector Model) Suppose
that
there are n types of coupons. If every coupon has one of the
n types uniformly at random, then, with probability at least
1 − (2/e)n , after k · n coupons are requested at most n/k − 1
types of coupons are uncollected.
To analyze the performance of the O NLINE M ATCHING
P ROTOCOL we are going to temporarily forget the fact that
every client has a list of α · log2 n servers; we will pretend
instead that every client has an infinite list of servers selected
uniformly at random. Under this assumption, with probability
1, the protocol does not declare FAIL and every client gets
matched with a server. We establish the following lemma
which concludes the proof of the theorem.
B INARY BFS
input: client ct , current matching ft−1 , inverse
matching gt−1 ;
output: augmenting path P on the graph Gt
starting at node ct , or ∅;
1) let σ := Lct (jct );
2) if gt−1 (σ) = ¬ then P := (ct , σ); return
P;
else initialize a queue data structure Q containing the element gt−1 (σ);
3) while Q = ∅
a) c := deQueue(Q);
b) for r = 1, 2
if jc < α log2 n then
i) jc := jc + 1; σ := Lc (jc );
ii) if gt−1 (σ) = ¬ then
let P be the path from node
ct to node σ on the tree
created by the process;
return P;
else push gt−1 (σ) into Q;
else declare FAIL;
Fig. 2. When a client ct joins the network, the B INARY BFS process
explores the graph Gt in a Breadth-First-Search fashion in order to find an
augmenting path from ct to a free server. However, each time a client node is
encountered by the BFS process, only two of its adjacent edges are explored.
Note that our B INARY BFS process may actually explore an edge e, which
goes from some client to a server u, which has already been explored and
thus creates a cycle. In this case, we do not add edge e to our tree as it creates
a cycle, although we do continue our B INARY BFS by exploring two more
edges from the client with which server u is matched.
Lemma III.3 Under the assumption of infinite server-lists,
the following are satisfied with high probability, i.e. with
probability at least 1 − O(1/n):
1) the total augmentation cost incurred by the protocol is
O(n);
2) no client explores more than α · log2 n servers from her
infinite list of servers.
Proof: To show Part 1 of the lemma, let us divide the
arrival times of the clients into log2 n+1 ≡ +1 progressively
smaller intervals Ij = {bj , . . . , ej }, j = 1, . . . , log2 n + 1,
where each interval j has 2nj time steps. In particular, we
define:
j−1
• bj = n − n/2
+ 1, for all j = 1, . . . , log2 n + 1;
j
• ej = n − n/2 , for all j = 1, . . . , log2 n + 1;
Let Ti be the tree constructed by the B INARY BFS process
when client ci arrives and denote by |Vi | the number of client
nodes in Ti . We show the following lemma.
Lemma III.4 Under the infinite server-lists assumption, with
high probability over the random choices of servers by the
clients,
for all j ∈ {1, . . . , log2 n + 1} :
|Vi | ≤ 2j n.
i∈Ij
Proof: Let us fix any j ∈ {1, . . . , log2 n + 1}. Let Kj be
the number of times Steps 1 and 3(b)i of B INARY BFS are
invoked between the arrival of client cbj and until client cej
is matched with a server. Observe that, every time the Steps 1
and 3(b)i of B INARY BFS are invoked, the identity of server
σ is independent of the identities of the servers revealed in the
preceding invocations of Steps 1 and 3(b)i of B INARY BFS.
Also, observe that, if the number of distinct servers that the invocations of Steps 1 and 3(b)i of B INARY BFS have returned
over the course of the protocol is at least n − 2nj , then the
clients of the set {ci }i∈∪t≤j It have all been matched with
servers. From Lemma III.2 it follows that, with probability at
least 1 − (2/e)n , after 2j n invocations of Steps 1 and 3(b)i of
B INARY BFS, n − 2nj distinct servers will be discovered.
Hence, with probability at least 1 − (2/e)n ,
Kj ≤ 2j n.
(1)
It is easy to see that the number of clients encountered in the
trees in phase j is upper bounded by the number of new edges
explored in phase j, so that the following holds:
|Vi | ≤ Kj ≤ 2j n.
i∈Ij
The lemma then follows by a union bound.
From Lemma III.4 it follows that, for all j = 1, . . . , log2 n+1,
1 1 j
2 n ≤ 22j
|Vi | ≤
|Ij |
|Ij |
i∈Ij
⎫
⎧
⎬
⎨ 1 ⇒ log2
|Vi | ≤ 2j.
(2)
⎭
⎩ |Ij |
i∈Ij
By Jensen’s inequality and the concavity of log2 it follows
that
⎫
⎧
⎬
⎨ 1 1 log2
|Vi | ≥
log2 |Vi |.
⎭ |Ij |
⎩ |Ij |
i∈Ij
i∈Ij
By combining the above with (2), it follows that
n
log2 |Vi | ≤ 2j j ,
2
i∈Ij
and summing over j it follows that
n
i=1
log2 n
log2 |Vi | ≤ n
2j
+ 2(log2 n + 1) = O(n).
2j
j=1
Finally, observe that, when a client ci joins the network,
the augmenting path chosen by the protocol has length
O(1 + log2 |Vi |). It follows that the total augmentation cost
paid by the protocol is O(n).
To prove part 2 of the lemma, we make use of the following
well known facts.
Lemma III.5 (Coupon Collector Model) In a coupon collector model with n coupon types, the probability that all
coupons are not collected after 2n loge n steps is at most n1 .
Lemma III.6 (Balls in Bins) In the balls and bins model, if
m balls are thrown into n bins, then, with probability at least
1 − n1 , the maximum load of a bin is O( m
n + log n).
From Lemma III.5, it follows that, with probability at least
1 − n1 , the total number of invocations of Steps 1 and 3(b)i
of B INARY BFS throughout the course of the protocol is at
most 2n loge n. Since every client participates exactly once
in an invocation of Step 1, to conclude the proof of Part 2,
it is enough to show that, with high probability, no client c
participates more than O(log n) times in an invocation of Step
3(b)i. We argue this by coupling the process of selecting clients
for invoking Step 3(b)i with the process of throwing m =
n log n balls into n bins. For our purposes the bins are the
clients and a ball, thrown at the time of an invocation of a
Step 1 or a Step 3(b)i of B INARY BFS, is received by a client
(bin) c if the server selected by the invocation is matched with
client c; as a result of this receipt, c will be the next client
to participate in an invocation of Step 3(b)i, at which time
another ball will be thrown, etc. Note that a ball might not
be received by any client; in particular, for any ball that gets
thrown the probability that a client c receives it is at most n1 .
Hence, the maximum load that a client receives in our model
is dominated by the maximum load of a bin in a balls in bins
process whereby 2n loge n balls are thrown into n bins; the
latter is O(log n) with probability at least 1 − n1 by Lemma
III.6. This concludes the proof of part 2 of the lemma.
We establish next a lower bound of Ω(n/ log n) on the
switching cost of any online protocol. Hence, the protocol
described in the proof of Theorem III.1 is optimal up to a
factor of O(log n). The proof is postponed to the appendix.
Theorem III.7 In the setting of Theorem III.1, any online
matching protocol has switching cost of Ω(n/ log n), with high
probability.
IV. O NLINE B IPARTITE M ATCHING ON F ORESTS
In this section, we provide an algorithm that achieves a
switching cost of O(n log n), when the connection graph on
n clients and n servers is a forest, and contains a perfect
matching between clients and servers. The main result of this
section is as follows.
Theorem IV.1 Let G be the underlying graph of connections
between clients and servers. If G is a forest, then, for any
arrival order of clients, there is an algorithm which has a
switching cost of O(n log n).
Before we describe the algorithm, we first make some
preliminary observations, which provide a foundation for
stating our algorithm. Note that at the start of our online
process, we have n server nodes and no edges, and thus our
connection graph contains n connected components consisting
of a single server node each. As new clients arrive with
their edges, these connected components become merged. In
particular, one should note that when a new client i arrives
with di ≥ 2 edges, the client’s di edges must connect to di
distinct components in the graph, since our final graph must
be a forest. Thus a client arriving with di ≥ 2 edges causes
di components to merge into a single connected component.
As our process proceeds, our online algorithm monitors
these connected components, and designates a node from each
connected component to be the root of the component. The
root and size of each component are key elements in deciding
the augmenting path to be used to connect a new client. At
the beginning of the process, each server node starts as the
root node of its own connected component; this will change
as clients arrive and components merge. Now to define how
our algorithm maintains the root nodes and selects augmenting
paths to connect each new arriving client i, we have three
different cases:
• Case I: If client i has a single edge, then this edge
connects to a single connected component C and thus
all potential augmenting paths from i must pass through
component C. Among the set of potential augmenting
paths, choose the augmenting path that stays furthest
away from the root of C.
• If a client i has more than one edge, then we have two
cases to consider, depending on the set of connected
components to which client i can find an augmenting
path. Let Si denote the set of connected components to
which client i can find an augmenting path, and let Ti
denote the set of connected components to which client
i has an edge.
– Case II: If the set Si contains only one connected component, and that connected component is the unique
largest component of Ti , then follow the rule in case
I.
– Case III: Otherwise, choose any augmenting path into
the smallest component in Si .
At the end of both case II and case III, all components
in Ti have merged into a single component T . Assign T
to have the same root node as the largest component in
Ti , breaking ties arbitrarily if needed.
We now prove Theorem IV.1, by bounding the total
switching cost of our algorithm.
Proof of Theorem IV.1: We first bound the total cost of augmentations arising from case III by O(n log n). Note that each
time a server switches clients via case III, its component size
at least doubles, and thus each server can switch clients at most
O(log n) times via case III augmentations. Therefore, the total
cost arising from case III augmentations is O(n log n).
Lemma IV.2 shows that the switching cost arising from
case I and case II augmentations is at most O(n log n). The
theorem thus follows by combining Lemma IV.2 with our
bound on the total cost of augmentations arising from case
III. Due to space limitations, we defer the proof of the following
lemma to the appendix, which bounds the switching cost
arising from case I and case II augmentations and completes
the proof of Theorem IV.1.
Lemma IV.2 Let G be the underlying graph of connections
between clients and servers. If G is a forest, then, for any
arrival order of clients, the total switching cost of case I and
case II augmentations is at most O(n log n).
[11] B. Godfrey, “Balls and bins with structure: Balanced allocations on hypergraphs,” in Proc. 19th ACM-SIAM Symposium on Discrete Algorithms
(SODA), 2008.
[12] R. Pagh and F. F. Rodler, “Cuckoo hashing,” in Proc. 9th Annual
European Symposium on Algorithms (ESA), 2001, pp. 121–133.
[13] D. Fotakis, R. Pagh, P. Sanders, and P. G. Spirakis, “Space efficient
hash tables with worst case constant access times,” in Proc. 20th Annual
Symposium on Theoretical Aspects of Computer Science (STACS), 2003,
pp. 271–282.
A PPENDIX
A. Discussion
V. C ONCLUSION
In conclusion, we study three variants of the online bipartite
matching problem. First, we show that when the underlying bipartite graph is arbitrary, and the clients arrive in a
random order, the greedy algorithm, which always uses the
shortest available switching path, achieves a switching-cost of
O(n log n) with high probability. We also study the problem
when the arrival order is adversarial and the underlying graph
is a forest, and provide a O(n log n) algorithm for this case.
Finally, we study the problem in random bipartite graphs of
degree O(log n) and show an algorithm which achieves O(n)
switching-cost.
The main open question is to find an algorithm which
achieves a switching cost of O(n log n) for any arbitrary
bipartite graph between clients and servers, and for an adversarial arrival order of clients. In particular, it would be
interesting to find a proof that the greedy algorithm achieves
a switching-cost of O(n log n) in any bipartite graph, for any
arrival order of the clients, or to find a counterexample that
suggests otherwise.
R EFERENCES
[1] E. F. Grove, M.-Y. Kao, P. Krishnan, and J. S. Vitter, “Online perfect
matching and mobile computing,” in Proc. 4th International Workshop
on Algorithms and Data Structures (WADS), 1995, pp. 194–205.
[2] M. Andrews, M. X. Goemans, and L. Zhang, “Improved bounds for online load balancing,” in Computing and Combinatorics, Second Annual
International Conference (COCOON), 1996, pp. 1–10.
[3] S. Phillips and J. Westbrook, “Online load balancing and network flow,”
in Proc. 25th ACM Symposium on Theory of Computing (STOC), 1993,
pp. 402–411.
[4] J. Westbrook, “Load balancing for response time,” in Proc. 3rd Annual
European Symposium on Algorithms (ESA), 1995, pp. 355–368.
[5] Y. Azar, A. Broder, and A. Karlin, “On-line load balancing,” in Proc.
33rd IEEE Symposium on Foundations of Computer Science (FOCS),
1993, pp. 218–225.
[6] Y. Azar, B. Kalyanasundaram, S. Plotkin, K. Pruhs, and O. Waarts,
“Online load balancing of temporary tasks,” in Proc. 2nd International
Workshop on Algorithms and Data Structures (WADS), 1993, pp. 119–
130.
[7] Y. Azar, J. Naor, and R. Rom, “The competitiveness of on-line assignments,” in Proc. 3rd ACM-SIAM Symposium on Discrete Algorithms
(SODA), 1992, pp. 203–210.
[8] R. M. Karp, U. V. Vazirani, and V. V. Vazirani, “An optimal algorithm
for on-line bipartite matching,” in STOC ’90: Proceedings of the twentysecond annual ACM symposium on Theory of computing. New York,
NY, USA: ACM, 1990, pp. 352–358.
[9] G. Goel and A. Mehta, “Online budgeted matching in random input
models with applications to adwords,” in SODA ’08: Proceedings of
the nineteenth annual ACM-SIAM symposium on Discrete algorithms.
Philadelphia, PA, USA: Society for Industrial and Applied Mathematics,
2008, pp. 982–991.
[10] A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani, “Adwords and
generalized online matching,” J. ACM, vol. 54, no. 5, p. 22, 2007.
Although we assume the total number of arriving clients is
the same as the total number of servers, it is not difficult to see
our upper bounds on the switching cost still hold if n clients
arrive and there are m > n servers. We omit formal details,
but the bounds on the switching cost are the same for this case
when measuring the switching costs relative to n, the number
of clients.
We also assume that each server may service (i.e. be
matched to) at most one client, but one should note that our
results also hold when servers can service multiple clients. In
this more general setting, when n clients are to be matched
to m servers who can service s1 , s2 , . . . , sm clients each, our
bounds also hold and are exactly the same as the unit capacity
case, when measuring the cost relative to n the number of
clients. The bounds from the unit capacity case also apply
here, since we can reduce this more general problem to the
unit capacity case by creating si server nodes for each server
i ∈ {1, . . . , m}. By treating each server i as si unit capacity
servers, and creating edges appropriately, it is not hard to see
that any bounds for the unit capacity case also yield equivalent
bounds for the general capacity case in terms of n the number
of clients that arrive.
Lastly, we assume that there is a matching at each step of the
algorithm. We make this assumption because if a new client
i arrives and no matching exists between the arrived clients
and the servers, then we might as well ignore client i. We feel
no remorse ignoring client i, since no algorithm could have
matched i and the other arrived clients. By ignoring the clients
who we cannot possibly serve (without disconnecting other
clients), we are then left with an instance where at each step
a matching exists between the arrived clients and the existing
servers.
B. Skipped Proofs
In this section, we present the details of some of the proofs
from previous sections of the paper.
Proof of Theorem II.3: Using Yao’s lemma, it suffices to
specify a distribution on input instances such that for any
deterministic online algorithm, the expected switching cost
incurred by the algorithm is Ω(n log n), where the expectation
is taken over the random choice of instances and the random
ordering of clients. We define our instance distribution as
follows: form a random bipartite graph between n clients and n
servers by choosing a random permutation π of {1, 2, . . . , n}
and matching client i to servers π(i) and π(i + 1), where
π(n + 1) is interpreted to mean π(1). In other words, the
bipartite graph is a random 2n-cycle on the set of clients and
servers.
This input distribution is invariant under permutations of
the clients, so we may assume without loss of generality
that the clients’ arrival order is 1, 2, . . . , n. At the arrival
time of client k, the servers belong to n − k + 1 different
connected components (including isolated servers) and each
of these connected components is a path. Client k is adjacent
to endpoints of two of these paths, namely, the arcs of the cycle
which extend from client k to the first higher-numbered client
encountered when going around the cycle in either direction.
The theorem now follows from a sequence of observations
outlined below.
1) Conditioning on the cyclic ordering of all clients besides
k, client k is equally likely to be spliced anywhere into
the cycle. In particular, it has probability 1/3 of being
spliced into the middle third of an arc between highernumbered clients.
2) Consider the cyclic ordering of all clients besides k. The
clients with label greater than k partition the ordering into
n − k arcs. It is easy to see the average length of these
arcs is (n − 1)/(n − k), and hence, the expected distance
from client k to the nearest higher-numbered client is at
n−1
.
least 3(n−k)
3) The input distribution is invariant under the operation
of cutting out an arc of the cycle, with its servers at
both endpoints, and reattaching it backwards. Therefore,
conditional on the states of the two paths to which client
k < n is connected at the time of its arrival (i.e., the
sets of clients and servers in those paths and the current
matching between them), each of the ways for k to
connect to these two paths is equally likely. As client
k connects to these two paths by choosing an endpoint
of each, the number of ways for k to connect to the two
paths is the product of the number of endpoints of the
paths. As each path either is a single node with 1 endpoint
or has 2 endpoints, the number of ways of connecting
client k to these two paths is at most 4. Thus, conditional
on the states of the two paths at the time of client k’s
arrival, with probability at least 1/4, client k connects to
each of these paths at the endpoint which is furthest from
the free server on that path.
4) Combining (2) and (3), we find that the expected cost
n−1
.
incurred at the arrival time of client k is at least 12(n−k)
Summing over k, we get the stated lower bound.
Proof of Lemma III.2: Let S be any set of n − nk = 1 − k1 n
coupon types. Then, if k · n coupons are requested,
Pr[all k · n coupons are from the set of types S]
kn
1
≤ e−n .
≤ 1−
k
It follows that, if k · n coupons are requested, then
n
coupons are uncollected
Pr at least
k
all k · n coupons are from
Pr
≤
the set of types S
1
S,|S|=(1− k
)n
n
2
≤
,
e
where the summation
ranges over all sets S of coupon types
of size |S| = 1 − k1 n. Proof of Theorem III.7: Let S = {s1 , . . . , sn } be the set
of servers and C = {c1 , . . . , cn } the set of clients, and let
us suppose that the clients arrive in the order c1 , c2 , . . . , cn ,
where t, t = 1, . . . , n, is the arrival time of client ct . Suppose
also that every client selects a random set of D := α · log2 n
servers with repetition; as in the proof of Theorem III.1, with
minor modifications the argument extends to the case where
the selection happens without repetition. Denoting by Sct the
set of servers that client ct chooses, let us define the following
collection of events for t = 1, . . . , n:
At :=
“When client ct arrives all servers in Sct
are occupied by other clients.”
Recall that an online matching protocol needs to maintain a
matching of the clients with servers at all times. Hence, if the
event At happens, the protocol must incur an extra cost of
at least 1 in period t in order to service client ct . Moreover,
observe that
D
t−1
Pr[At ] =
.
n
Therefore, the expected switching cost incurred by the protocol
is
D
n n−1
t−1
1 D
E[switching cost] ≥
= D
t
n
n t=1
t=1
n−2
1
1 (n − 2)D+1
≥ D
tD dt = D
n
n
D+1
t=0
n
(n − 2)D n − 2
=
=Ω
.
nD
D+1
log n
So, in expectation, every online matching protocol has cost
Ω(n/ log n). To show that this is also true with high probability, note that the events {At }nt=1 are independent. The result
follows from an easy Chernoff bound. Proof of Lemma IV.2: Due to space limitations, we omit the
full proof of this lemma, but complete details can be found in
an online version of this paper.