Facility Location with Service Installation Costs (Extended Abstract) David B. Shmoys∗ Chaitanya Swamy∗ Retsef Levi† Abstract We consider a generalization of the uncapacitated facility location problem which we call Facility Location with Service Installation Costs. We are given a set of facilities, F,a set of demands or clients D, and a set of services S.Each facility i has a facility opening cost fi, and we have a service fl installation cost of for every facility-service pair (i, l). Each client j in Drequests a specific service g(j) ∈Sand the cost of assigning a client j to facility i is given by cij . We want to open a set of facilities, install services at the open facilities, and assign each client j to an open facility at which service g(j) is installed, so as to minimize the sum of the facility opening costs, the service installation costs and the client assignment costs. Our main result is a primal-dual 6-approximation algorithm under the assumption that there is an ordering on the facilities such that if i comes before i in this ordering then for every service type l, fil ≤fil . This includes (as special cases) the settings where the service installation cost i fl depends only on the service type l, or depends only on the location i. With arbitrary service installation costs, the problem becomes as hard as the set-cover problem. Our algorithm extends the algorithm of Jain & Vazirani [9] in a novel way. If the service installation cost depends only on the service type and not on the location, we give an LP rounding algorithm that attains an improved approximation ratio of i 2.391. The algorithm combines both clustered randomized rounding [6] and the filtering based technique of [10, 14]. We also consider the k-median version of the problem where there is an additional requirement that at most k facilities may be opened. We use our primal-dual algorithm to give a constant-factor approximation for this problem when the service installation cost depends only on the service type. 1 Introduction Facility location problems have been widely studied in the Operations Research community (see for e.g. [12]). In its simplest version, uncapacitated facility location (UFL), we are given a set of facilities, F,and a set of demands or clients D. Each facility i has a facility opening cost fi and the cost of assigning a client j to facility i is given by cij .We want to open some facilities ∗{shmoys,swamy}@cs.cornell.edu. Dept. of Computer Science, Cornell University, Ithaca, NY 14853. Research supported partially by NSF grant CCR-9912422. †rl227@cornell.edu. School of ORIE, Cornell University, Ithaca, NY 14853. Research supported partially by a grant from Motorola and NSF grant CCR-9912422. from the set F and assign each demand to an open facility. The goal is to minimize the sum of the facility opening costs and the client assignment costs. This problem has a wide range of applications. For example, a company might want to open its warehouses at some locations so that its total cost of opening warehouses and servicing customers is minimized. In various applications, the clients are differentiated according to the kind of service they require and to satisfy the service requirement of a client we have to assign it to a facility that can provide the service required by the client. For example, in the warehouse location problem above, the customers may be retail stores that request different kinds of supplies. A warehouse may store different kinds of supplies. To satisfy a customer we have to assign it to a warehouse that holds inventory of the type requested by the customer. We model such a setting by saying that in addition to facilities and clients, we have a set of services S. Each client j in Drequests a specific service g(j) ∈S. To satisfy client j we have to assign it to an open facility on which service g(j) is installed. Further, if we install service l on an open facility i we incur a service installation cost of fl.We want to open a set i of facilities, install services at the open facilities, and assign each client j to an open facility i such that service g(j) is installed at i. The cost of a solution is the sum of the facility opening costs, the service installation costs and the client assignment costs, and the goal is to find a solution with minimum total cost. We call this problem, Facility Location with Service Installation Costs.In the warehouse location problem, the service installation cost corresponds to the initial cost of setting up the warehouse to store the particular kind of inventory. The notion of service-dependent fixed costs is also used in inventory problems where one incurs a joint setup cost to start a new order and an item-dependent fixed cost to order a specific item, so one needs to coordinate the placement of item orders; see [1] for a survey. We assume throughout that the assignment costs cij form a metric. Applications and Related Work. Facility location with service installation costs can be used to model a data management/caching problem. Here we are given a set of locations in a network at which caches may be built and a set of processes located at the nodes of the network that request data items. Each process requests a specific data item. To satisfy the request we must assign the process to a cache that stores the requested data item, incurring an access cost proportional to the distance between the process site and the cache location. Building a cache at a location incurs a location dependent cost and storing a data item in a cache at a particular location incurs a cost that depends on the data item and the location. The goal is to build caches, store data items in the caches and assign each process to a cache containing the data item requested by the process, so as to minimize the total cost of building caches, storing data items and the access cost of requests. This is exactly the facility location problem with service installation costs where the caches are facilities, the processes are clients and the data items correspond to services. Baev & Rajaraman [3] considered a closely related problem called the data placement problem and gave a 20.5approximation algorithm. Here caches of fixed capacity are already built at certain locations and the goal is to find a placement of data items to caches respecting the cache capacities that minimizes the sum of the access costs and the cost of storing data items. The ratio has recently been improved by Swamy [16]. Facility location with service installation costs is a generalization of UFL — if there is just one service type then this is simply the uncapacitated facility location problem. There is a large body of literature that deals with designing approximation algorithms for metric UFL and we sample only a few results below; see [13] for a survey of this and earlier work. Shmoys, Tardos & Aardal [14] gave the first constant- factor approximation algorithm for this problem using the filtering technique of Lin & Vitter [10] to round the optimal solution of a linear program. Chudak & Shmoys [5, 6] gave an LP rounding based 1+ 2 e approximation algorithm. They combined randomized rounding and the decomposition results of [14] to get a variant that might be called clustered randomized rounding. Sviridenko [15] improved the ratio to 1.58. Jain & Vazirani [9] gave a combinatorial primal-dual 3-approximation algorithm where the LP is used only in the analysis. The current best ratio for UFL is 1.52 [11] obtained by building upon a dual-fitting based greedy algorithm of Jain, Mahdian, Markakis, Saberi & Vazirani [8]. Our Results. Our main result is a primal-dual 6approximation algorithm for the facility location problem with service installation costs under the assumption that there is an ordering on the facilities such that if i comes before i in this ordering then for every service type l, fl ≤ fil . This is reasonable in many settings; i for example, one expects the inventory setup cost of a warehouse in New York city to be less than the inventory setup cost in a remote town like Ithaca regardless of the kind of inventory. As special cases this includes the cases where the service installation cost fl depends only i on the service type l, or depends only on the location i. In the former setting where the service installation cost depends only on the service type we give an algorithm based on LP rounding that attains a much improved approximation ratio of 2.391. We show that with arbitrary service installation costs the problem becomes as hard as the set-cover problem. Combined with the result of Feige [7], this shows that no polynomial-time algorithm with a ratio of (1 − )ln |D| exists for this problem in O(log log n)]. We the general case unless NP ⊆ DTIME[n also consider the k-median version of the problem where there is an additional requirement that at most k facilities may be opened. We use our primal-dual algorithm to give a constant-factor approximation for this problem when the service installation cost depends only on the service type. Our Techniques. Facility location with service installation costs is a generalization of UFL. It differs however from traditional multi-level extensions of facility location where we assume that a demand can be assigned to any facility in any level. In our problem demand j may only be assigned to a level 1 “facility” (i, g(j)) and then to facility i in level 2. Moreover, existing techniques for UFL and multi-level facility location do not readily generalize. If there were no facility opening costs we could decouple the problem into several UFL instances, one for each service type, and solve each one separately. With facility opening costs this approach fares badly since we may end up opening a lot of facilities and spend too much on the facility opening costs. All known algorithms for UFL rely on the fact, either in the design or the analysis, that a client j can be moved from a facility i to another nearby facility i without increasing its assignment cost by much, and leaving the facility opening cost unchanged. In our problem, reassigning j to i may now require us to install service g(j)on i caus g(j) ing us to pay the installation cost f which could be i large. Technically the hard part is to find a way to reassign clients to nearby facilities so that we do not pay too much to install services at the new locations. With arbitrary service installation costs such a reassignment need not be possible since we can encode the constraint that a client may only be assigned to a specific set of facilities, making the problem set-cover hard. We build upon the primal-dual algorithm of [9] in a novel way and give a 6-approximation algorithm under the assumption that the facilities are ordered so that if i comes before i then fl ≤ fl for every service ii l. At a high level, the idea is to consider an integer programming formulation of the problem and the dual of its linear programming relaxation, and construct simultaneously an integer primal solution and a dual solution. Each client j has a dual variable αj which can be interpreted intuitively as the payment that j is willing to make to get itself assigned to an open facility. The Jain-Vazirani (JV) algorithm [9] for UFL works in two phases. In phase I we grow each dual variable αj uniformly and gradually build a primal feasible solution. Once αj becomes equal to cij for some facility i, j starts paying toward the facility opening cost of i. When the total contribution to i from the various clients equals fi, we declare i to be tentatively open and assign all the unassigned clients contributing toward i to i.Phase I ends when each client is assigned to a tentatively open facility. At this point a client could be contributing towards multiple tentatively open facilities. We call a set of facilities independent if each client contributes towards at most one facility in the set. In phase II we select a maximal independent subset of tentatively open facilities and open these. The analysis shows that if the facility i to which a client j was assigned in phase I is not opened, then there is a “nearby” open facility i to which j can be reassigned. Our algorithm also proceeds in phases. During phase I we tentatively open some facilities, tentatively install service l on some facilities for each service type l, and assign each client j to a tentatively open facility on which service g(j) is tentatively installed. Phase II is more involved. We have to select a set of facilities to open and install services in such a way that we can, (1) pay for installing the services, and (2) ensure that if a client j has to be reassigned there is a nearby open facility on which service g(j) is installed. We show that we can achieve properties (1) and (2) if we look at the facilities in a particular order and pick a maximal independent subset greedily. This gives us a 6-approximation algorithm. Our primal-dual algorithm exploits the property that in the JV algorithm any maximal independent set of tentatively open locations may be picked. Although this is a well-known fact, to our knowledge this has been used in only a couple of applications previously. Bartal, Charikar and Raz [4] consider a clustering problem and use the JV algorithm to solve a relaxation of the problem by picking an appropriate maximal independent set. Archer, Rajagopalan and Shmoys [2] pick a maximum independent set and use this to prove a bound on the integrality gap of the k-median LP. When the service installation cost depends only on the service type, we give an LP-rounding algorithm that combines clustered randomized rounding [6] and the filtering based technique of [10, 14]. A feature of the algorithm is that we bound distances cij using both the αj bound due to complementary slackness and the bound obtained by filtering. This gives a better performance guarantee than that obtained by using either of the two bounds separately. Sviridenko [15] also used the two bounds in conjunction to improve the approximation ratio for UFL from 1+ 2 to 1.58. e 2 A Linear Program We can formulate the problem as an integer program and relax the integrality constraints to get a linear program. We use i to index the facilities in F, j to index the clients in D and l to index the services in S. l min fiyi + fil y + cij xij (P) i i il ji s.t. xij ≥ 1 ∀j ig(j) xij ≤ y ∀i, j i xij ≤ yi ∀i, j l xij ,yi,yi ≥ 0 ∀i, j, l. l Variable yi indicates if facility i is open, y indicates if i service type l is installed at i,and xij indicates if client j is connected to facility i. The first constraint states that each client must be assigned to a facility, the second and the third constraints say that if client j is assigned to facility i, then service g(j)mustbeinstalledon i and i must be open. An integral solution corresponds exactly to a solution to our problem. Let Gl be the set of clients requesting service l. The dual program is, max αj (D) j s.t. αj ≤ cij + βij + θij ∀i, j (2.1) θij ≤ fl i ∀i, l j∈Gl βij ≤ fi ∀i (2.2) j αj ,βij ,θij ≥ 0 ∀i, j. Intuitively αj is the budget that j is willing to spend to get itself assigned to an open facility. Constraint (2.1) says that a part of this goes towards paying for the assignment cost cij . The rest gets divided into a payment for the service installation cost θij ,and a payment for the facility opening cost βij . 3 A Primal-Dual Algorithm We consider instances of the problem where there is an ordering on the facilities in F such that if i comes before i in this ordering then for every service type l, fl ≤ fl ii . Equivalently this means that for any two locations i, i , T T the vectors fl and fl are comparable. ii l=1...|S| l=1...|S| Let O denote this total ordering on the facilities. We say that i ≤ i if i comes before i in the ordering O. The algorithm is strongly motivated by the primal- dual algorithm of Jain and Vazirani for the traditional uncapacitated facility location algorithm. In our algorithm, we first construct a feasible dual solution, and then use this dual solution to extract a feasible (integer) primal solution. As has been the norm, our algorithm is a dual ascent algorithm, so all dual variables are only increased throughout the execution of the algorithm. We next describe the algorithm. There is a notion of time around which the algorithm is specified. We start at time t = 0, all dual variables are initialized to 0, each demand j is said to be unfrozen,and all facilities are closed. At first, the variables that are increased are the αj s; more precisely, for any unfrozen demand j, αj is always equal to the time t.We say that demand j is tight with facility i, or has reached i, if αj ≥ cij . As time increases, we will freeze demand points, tentatively open facilities, and tentatively install services at facilities. We increase the αj of each demand j until one of the following events happens: 1. Suppose that demand j becomes tight with facility i. If service g(j) is not tentatively installed at i, then we start increasing θij at the same rate as αj ; that is, if αj = t,then θij = t−cij . If service g(j)is tentatively installed, but i is not tentatively open, we instead increase βij at the same rate as αj ;that is, if αj = t,then θij remains 0, but βij = t − cij . Finally, if service g(j) is tentatively installed, and i is tentatively open, we freeze demand point j (and no longer increase αj ). 2. Suppose that for a facility i and a service type l,we get that j∈Gl θij = fil: in this case, we tentatively install service l at i.If i is also tentatively open, then we freeze each demand j ∈ Gl that is tight with i (and no longer increase αj ). If i is not yet tentatively open, then for each demand j ∈ Gl that is tight with i, we no longer increase θij , but instead start increasing βij at the same rate as αj . 3. Suppose that for a facility i, j βij = fi:in this case, we tentatively open i. For each demand j,we do not increase βij from now on. If demand j is tight with i and service g(j) is tentatively installed at i, we freeze j (and no longer increase αj ). We only raise the αj ,βij ,θij of unfrozen demands. Frozen demands do not participate in any events. We continue this process until all demands become frozen. Let (α, β, θ) denote the final dual solution obtained by the above process. Observe that if i is the facility that caused j to freeze, then service g(j) must be tentatively installed at i,and i must be tentatively open. We now specify which facilities to open, how to install services on facilities, and how to assign demands to facilities. Let F be the set of tentatively open facilities, and let Fl ⊆ F be the set of tentatively open facilities on which service l is tentatively installed. For facility i ∈ F ,let ti be the time at which i became tentatively open. If i ∈ Fl,let til be the time at which service l was tentatively installed at i. Opening facilities. We open a subset of facilities from F .We say that i, i ∈ F are dependent if there is a demand j such that both βij and βij are positive. We consider the facilities in F in the order given by O and pick a maximal independent set of facilities, F ⊆ F . We open the facilities in F . Installing services. Consider service type l and the set of facilities Fl. We say that facilities i, i ∈ Fl are service-l-dependent if there exists some demand j in Gl such that both θij and θij are positive. We pick a maximal independent subset Fl by looking at facilities in Fl in a particular order: first we consider facilities in Fl F in increasing order of til, and then facilities in Fl − F in increasing order of ti. We first install service l on all facilities in Fl F . Furthermore, for each i ∈ Fl − F , we pick a facility i ∈ F such that i and i are dependent and i ≤ i (in the ordering given by O), and install service l on facility i .We say that i is the neighbor of i and denote it by nbr(i). Note that nbr(.) depends only on i and not on the service type l:if i/∈ F , then we can choose a single facility i ∈ F such that i ≤ i regardless of the service type l. Assigning demands. We assign each client j to the nearest open facility at which service g(j) is installed. 3.1 Analysis. We now bound the performance of our algorithm. The following lemma just says what it means for a demand j to get frozen. Lemma 3.1. Let i be the facility that causes a demand j to freeze. Then, i is tentatively open, service g(j) is tentatively installed at i,and αj =max(cij ,ti,tig(j)). We start by bounding the cost incurred in opening facilities, and installing services. Let D be the subset of demands {j : ∃i∈ F s.t. βij >0}. Lemma 3.2. The cost of opening facilities is at most αj . Furthermore, the cost of installing services j∈D is at most j αj . Proof. For each facility i that is tentatively opened, we have that fi = By the construction of F , j∈D βij . we know that for each demand j, there is at most one i ∈ F such that βij > 0. Summing over all facilities, and using this fact, we see that fi = βij = βij ≤ αj . i∈Fi∈Fj∈D j∈Di∈Fj∈D By the definition of nbr(i), we know that for any service type l, fl ≤ fl.For each i ∈ Fl , we install service nbr(i) i l either at i ∈ Fl or at nbr(i), and so we can upper bound the total cost of installing services of type l by fil . Since each service l is tentatively installed i∈F l only when fl = j∈Gl θij , wehavethat fl = ii∈Fi l θij . The notion of service-l-dependence j∈Gl i∈F insures that l for each demand j ∈ Gl,there is at most one facility i ∈ Fl for which θij is positive. We obtain that the total cost of installing service l is at most j∈Gl αj , which immediately implies the lemma. We next bound the assignment cost incurred by the solution computed. The following facts, which follow directly from the construction of the algorithm, will be useful in this analysis. Fact 3.1. Suppose that βik is positive. Then it follows that cik ≤ αk − βik and αk ≤ ti. Fact 3.2. Suppose that θik is positive. Then cik ≤ αk − θik and cik 0, it follows that til ≤ αj − βij . So by the triangle inequality, cij ≤ 3αj . Now consider a demand j/∈ D , and again let g(j)= l.Let i be the facility that caused j to freeze, and so αj =max(cij ,ti,til). If i ∈ F Fl ,then service l is installed at i,and cij ≤ αj . Suppose that i ∈ Fl − F ,and let i = nbr(i) (an open facility at which service l is installed, see Fig. 1a). By Claim 3.1, cii ≤ 2ti =⇒ cij ≤ 3αj . Next suppose that i ∈ Fl .Since i ∈ Fl,there must exist i ∈ Fl such that i was not picked in Fl because i and i are service-l-dependent due to a client k.If i is also in F (Fig. 1b), then service l is installed there. Applying Claim 3.2, we obtain that both cik and cik are less than αk ≤ max(ti,til) ≤ αj , and hence cij ≤ 3αj . However, if i ∈ F (so i/∈ F ), then let i = nbr(i); service lis installed at i (Fig. 1c). Since i and iservicel- dependent, we have that ti ≤ ti, cii ≤ 2min(ti ,ti ) (Claim 3.1), and cij ≤ 3αj as above, which implies that cij ≤ 5αj . This completes the proof. Theorem 3.1. The above algorithm returns a solution of cost at most 6 j αj ≤ 6 · OPT . Proof. Follows from Lemma 3.2 and Lemma 3.3. 4 An LP-Rounding Algorithm We now give an algorithm for the special case where fl = fl , i.e., the installation cost depends only on i the service l and not on the location at which it is installed. Note that this case is handled by the primal- dual algorithm above. Here we adapt the rounding procedure of [6] to give deterministic and randomized approximation algorithms achieving ratios of 6 and 2.391 respectively. Let (x,y)and (α,β,θ) be the optimal solutions to (P) and (D) respectively. We can ensure that for (a) (b) j k i i Facility icaused jto freeze, iand i are service-l-dependent, j k i i (c) j k k i i ≤ αj ≤ ti ≤ αk ≤ αk i ≤ ti i∈ F − F,i = nbr(i) i/∈ F ll , i = nbr(i)i,iare service-l-dependent Figure 1: Bounding the assignment cost of j. (a), (b) Different 3-hop cases, and (c) the 5-hop case. g(j) l every i, j and l, xij =0 or xij = y and y =0or ii l y = yi. We will round the fractional solution (x, y) i to an integer solution losing a factor of at most 6. Let Fj = {i : xij > 0}. We describe the algorithm briefly. A1. First, for every service type l, we consider the clients in Gl and cluster the facilities on which service l is installed around some cluster centers: pick j ∈ Gl with smallest αj value and form a cluster around j consisting of the facilities in Fj . We remove every client k ∈ Gl (including j)that is assigned (fractionally) to some facility in the cluster created, and recurse on the remaining set of clients until no client in Gl is left. Let Dl be the set of cluster centers. A2. Let D = Dl. We cannot open a facility in every l cluster since different clusters could share the same fractional facility weight (yi) if the cluster centers request different services. Say that j, k ∈ D are dependent if Fj Fk = φ. Note that this can only happen if j and k request different services. We consider clients in D in order of increasing αj and pick a maximal independent subset D .For each client j ∈ D , we open the facility in Fj with smallest fi and install service g(j)on it. Further for every k ∈ D − D ,there is some j ∈ D with αj ≤ αk such that j and k are dependent. We pick some such j and install service g(k)onthe facility opened from Fj .Call j the neighbor of k and denote it nbr(k). A3. We assign demand j to a facility as follows: (i) if j ∈ D it is assigned to the facility opened from Fj , (ii) if j ∈ D − D ,then nbr(j)= k ∈ D and j is assigned to the facility opened from Fk, and (iii) if j/∈ D,there is some k ∈ Dg(j) ⊆ D such that αk ≤ αj and j was removed from Gl in step A1 because a cluster was created around k;we assign j to the same facility as k. Note that this is a feasible assignment of demands to facilities. Lemma 4.1. The facility opening cost is at most fiyi. i Proof. The clusters corresponding to clients in D are disjoint and we open the cheapest facility in each cluster — the lemma follows. Lemma 4.2. Thecost of installingservices is atmost fll i yi. i,l Proof. The cost of installing a service is independent of the location at which it is installed, and for any service type l, we install service on at most one (new) location per cluster center in Dl. So the total cost of installing services is at most fl = lj∈Dl fl fll yi ≤ i yi. lj∈Dl i∈Fj i,l Lemma 4.3. The assignment cost of client j is at most 5αj . Proof. If j ∈ D , we assign j to a facility i ∈ Fj ,and cij ≤ αj by complementary slackness. If j ∈ D − D and nbr(j)= k ∈ D , we assign j to the facility i opened from Fk.Since αk ≤ αj and cjk ≤ 2αj , cij ≤ 3αj .If j/∈ D, j is assigned to the same facility as k where k ∈ Dg(j) ⊆ D and j was removed from Gg(j) in step A1 because a cluster was created around k .So αk ≤ αj and cjk ≤ 2αj .From above, k is assigned to a facility i with cik ≤ 3αk ,so cij ≤ 5αj . Thus we have proved the following theorem. Theorem 4.1. The cost of the solution returned is at most 6 · OPT. 4.1 Improvement using randomization. We give a rounding procedure that combines clustered randomized rounding [6] and the filtering based technique of [10, 14]. We define some notation first. Let 0 < <1 1 be a parameter that we will set later and r = .Sort γ the facilities in Fj by increasing cij .Let i be the first facility in this ordering such that xij ≥ . i∈Fj :cij ≤cij Let Nj be the subset of Fj consisting of all facilities (including i) that come before i in this ordering. Define ¯ Cj ( )= cij and Cj = i cij xij . To simplify things we assume that each yi ≤ and for any j, i∈Nj yi is exactly .If some yi > , then we can create at most 1/ copies of i and set yi ≤ for each of the copies so that yi = yi (setting the variables copies i l xij ,yi accordingly). Similarly if i∈Nj yi > ,we can take the facility i in Nj that comes last in the ordering in Fj and split it into two copies i1 and i2 setting yi2 = i∈Nj yi − , yi1 = yi −yi2 (and the other variables accordingly). We include only i1 in Nj thus en suring that i∈Nj yi = . The cost of the fractional solution remains unchanged by these transformations and any solution to the modified instance gives a solution to the original instance of no greater cost. R1. This is the same as step A1 except that we choose j ∈Gl with smallest 2αj +Cj ( )+C ¯ j as the cluster center. This gives us a set of cluster centers Dl for each service type l. R2. Weprune theset D = as in step A2 but l Dl modify the notion of dependency to say that j,k ∈ D are dependent if Nj Nk = φ, and consider the clients in D in increasing order of Cj ( )+ C ¯ j .For k ∈D−D we define nbr(k) as before. We call the facilities in Nj for clients j ∈ D central facilities, and the rest as non-central facilities. R3. For every client j ∈ D we randomly open exactly one facility in Nj by choosing facility i with proba bility yi/ i∈Nj yi = r·yi. This facility now serves as a backup facility for all the clients that would get assigned to this facility in step A3 of the deterministic algorithm. R4. Independent of step R3, each non-central facility i is opened independently with probability r·yi. R5. For any facility i, be it a central or a non-central facility, if iis opened (in R3 or R4), we install on it all services that are installed on it in the fractional l solution, i.e., all l such that yi >0. R6. For every client j ∈ D−D , if no facility from Fj is open, we install service g(j) on the facility opened in R3 from Nnbr(j). R7. We assign demand j to the nearest open facility at which service g(j) is installed. Lemmas 4.1 and 4.2 are modified to the following. Lemma 4.4. The expected cost of opening facilities is r· fiyi. The expected cost of installing services is at i 1 fll most r+ i yi. er i,l Proof. Each facility i is opened with probability r ·yi. The cost of installing services in step R5 is bounded by Pr[i is opened (in R3 or R4)] l fl = · il:y>0 ii r i fl flll l yi l = r· i y since yi >0=⇒ y = yi. l:y>0 i i,l i i i Consider client j ∈D−D with g(j)= l. For every non-central facility i ∈ Fj ,let Ei be the event that i is opened in step R4 and pi =Pr[Ei]= r ·yi. For every cluster center k ∈ D such that Sk = Fj Nk = φ,let Ek be the event that a facility from Sk is open after 􀀀 yi step R3. Let pk =Pr[Ek ]= 􀀀 i∈Sk = r · yi. yi i∈Sk i∈Nk Let m be the total number of events. All the events Ei are independent. The probability that service l is installed in step R6 due to client j, is the probability that no facility from Fj is open after steps 􀀀 R3 and R4, m − pn −r which is at most (1 − pn) ≤ e n = e . n=1 So the cost of installing services in step R6 is at most 1 fg(j) ≤ 1 fll l i y since y =1 and er j∈D−Der i,l i i∈Fj i any two clients in Dl have disjoint Fj . To bound the assignment cost, we bound the assignment cost incurred under a provably worse way of assigning demands to facilities. Demand j is assigned to a facility as follows. If some facility i ∈ Fj is open, we assign demand j to the nearest such facility. Otherwise if j ∈ D − D , j is assigned to its backup facility. If j/∈ D, there is some client k ∈ Dg(j) ⊆ D such that j was removed from Gg(j) because a cluster was formed around k in step R1. We assign j to the same facility as k;so j may be assigned either to a facility in Fk or to its backup facility in Nnbr(k),if k/∈D and no facility from Fk is open. Note that service g(j)isinstalledonthe facility to which j is assigned. We need the following lemma from [6] (see also [15]). Lemma 4.5. Let d1 ≤ d2 ≤ ... ≤ dm and 0 ≤ pn ≤ 1 for n=1,... ,m.Then, p1d1 +(1−p1)p2d2 +···+(1−p1) ···(1 −pm−1)pmdm n≤m pndn ≤ 1 − (1 −pn) . n≤m pn n≤m Lemma 4.6. Let j be any demand. Let X be the distance between j and the facility assigned to it and Z be the event that no facility i ∈ Fj is open. Then, ¯ (i) If j/∈ D, E X|Z ≤ 3αj + Cj ( )+ Cj , (ii) 1 E X ≤C ¯ j + (3αj + Cj ( )). er Proof. If j ∈ D ,E X = i∈Nj cij xij / i∈Nj xij ≤ ¯ Cj since every facility in Fj −Nj is farther from j than every facility in Nj .For j/∈ D , we show (i) and use it to prove (ii). Suppose j ∈ D−D , k = nbr(j)and A= Nj Nk = φ. For any facility i ∈ A we have cij ≤ αj and cik ≤ Ck( ). Let B be the distance between j and its backup facility in Nk.Event Z implies that j is assigned to the backup facility in Nk so conditioned on Z, X = B.If there is some i∈ A such that cik ≤C ¯ k we ¯ have a deterministic bound of B ≤ αj + Ck + Ck( ). If there is no such i in A, since the unconditional distance between k and the backup facility in Nk is at ¯ most Ck , by conditioning on Z we are only removing weight from facilities that are farther than the average distance. So the conditional expected distance between ¯ k and the backup facility is at most Ck implying that ¯ E B|Z =E X|Z ≤ αj + Ck ( )+ Ck.In either case ¯ E X|Z ≤ αj + Cj ( )+ Cj , where the last inequality follows since we look at clients in D in increasing order of Cj ( )+ C ¯ j and k was picked before j. If j/∈ D, there must be a client k ∈ Dg(j) such that j was removed from Gg(j) because a cluster was formed around k in step R1. So j,k are assigned to the same ¯¯ facility, 2αk + Ck ( )+ Ck ≤ 2αj + Cj ( )+ Cj and cjk ≤αj + αk . If a facility in Fk is open then we have a deterministic bound of X ≤ αj +2αk .Otherwise j,k are assigned to the backup facility for k and by the above bound on E X|Z for k , the conditional expected distance from j to the backup facility is at most αj +2αk + Ck ( )+ C ¯ k ≤3αj + Cj ( )+ C ¯ j . We now prove part (ii). For a non-central facility i∈ Fj ,let pi and Ei be as defined in Lemma 4.4 and let di = cij . For every k ∈D such that Sk = Fj Nk = φ, let pk,Ek be as defined in Lemma 4.4 and define dk = E distance from j to Sk|Ek = cij yi/yi. i∈Sk i∈Sk Let the distances be ordered so that d1 ≤d2 ≤...≤dm where there are m events in all. Since the events Ei are m independent and yi = xij , p=Pr[Z]= (1 −pn) ≤ 􀀀 n=1 − pn n e = e−r.So, E X ≤ p1d1 +(1 −p1)p2d2 + ··· +(1 −p1) ···(1 −pm−1)pmdm + p·E X|Z n≤m nn ≤ pd (1 −p)+ p(3αj + Cj ( )+ C ¯ j ) p n≤mn 1¯ ≤ Cj + (3αj + Cj ( )). r e Theorem 4.2. The randomized algorithm produces a where r =1/ .Taking =0.67674 we get a solution of cost at most 2.391 ·OPT. solution of expected cost at most, 4 1 max r+ ,1+ r re(1 − )e3 + re·OPT Proof. From the definition of Cj ( )and the Markov ¯ property we have, Cj ( ) ≤ Cj . The proof now follows 1−γ from Lemmas 4.4 and 4.6. 5The k-Median Variant We consider a variant of this problem where we have the additional constraint that at most k facilities may be opened. This adds the constraint i yi ≤ k to the linear program (P). The objective function of the dual (D) gets modified to max j αj −kz and constraint (2.2) changes to j βij ≤ fi + z.Let (KP) and (KD) be the modified primal and dual programs and OPTK be the value of an optimal k-median solution. 5.1 A modified primal-dual algorithm. We first modify the primal-dual algorithm of Section 3 to obtain the stronger guarantee that we return a solution to (P) of cost (O,I,C), and a solution (α,β,θ)to (D) such that 6O + I + C ≤ 6 j αj ≤ 6 ·OPT ,where O,I,C denote respectively the facility opening cost, the service installation cost and the client assignment cost. We describe the algorithm briefly and sketch the analysis. The dual ascent process is the same as in Section 3 but we modify the way in which we open facilities and install services to ensure that a demand j does not pay for both opening a facility and for installing a service at some other facility. To do this we consider a more detailed notion of dependence between facilities. We classify 4 types of dependence between facilities. Say that the ordered pair (i,i)is, (1) ff-dependent (f for facility) if there is a demand j such that βij ,βij >0. (2) sf-l dependent (s for service) if there exists j ∈ Gl such that θij ,βij >0. (3) ss-ldependent if for some j ∈Gl,both θij ,θij >0. (4) fs-l dependent if for some j ∈Gl,both βij ,θij >0. Recall that F is the set of tentatively open facilities, Fl ⊆ F the set of tentatively open facilities on which service l is tentatively installed, ti is the time at which facility i ∈ F became tentatively open, and for i ∈ Fl, til is the time at which service lwas tentatively installed at i. Initially for each facility i∈F,let Si be the set of services that are tentatively installed at i. M1. We first pick a set F ⊆ F of facilities to open, and for each i ∈ F aset Ti ⊆ Si of services to install at facility i. Initially F = φ and Ti = φ for all i. We consider facilities in F in the order given by O. While F = φ, 1. Let i ∈F be the currently considered facility. Remove i from F ,set F ←F ∪{i},Ti = Si. 2. For each i ∈ F we do the following. a) If (i, i ) is ff-dependent or ∃l ∈ Ti s.t. (i, i )is sf-l dependent or ∃l ∈ Si s.t. (i, i )is fs-l dependent and til 0}. By the construction of F ,we know that for each demand j thereis atmostone i ∈F such that βij > 0; for j ∈D let o(j) denote this unique facility in F . By design we ensure that any demand j ∈ Gl contributes θij > 0 for at most one facility i ∈ Al ∪ Fl and further that if j ∈ D then i = o(j) is the only such facility. This gives the following. Lemma 5.1. The cost of opening facilities is βo(j)j . The cost of installing services is at j∈D most θo(j)j + αj . j∈D j/∈D The bound on the assignment cost incurred is similar to the bound in Lemma 4.3 and is proved similarly. Lemma 5.2. If j ∈ D , the assignment cost of j is at most 3(αj − βo(j)j ).If j/∈ D the assignment cost incurred for j is at most 5αj . Combining the above two lemmas and the fact that for j ∈ D , αj = co(j)j + βo(j)j + θo(j)j ,we get the following theorem. Theorem 5.1. The solution returned has cost (O, I, C) such that 6O + I + C ≤6 j αj ≤6 ·OPT . 5.2 An 18-approximation algorithm. Suppose we fix z and run the above primal-dual algorithm with the facility costs modified to fi + z. Suppose the algorithm returns a primal solution of cost (O, I, C)thatopens k facilities and a dual solution (α, β, θ). Here O is the facility opening cost with the original costs fi. Then, we can show that we have a solution of cost at most 6·OPTK .since (α, β, θ, z) is a feasible solution to (KD) and by Theorem 5.1, 6(O+kz)+I +C+ ≤6 αj =⇒ j 6O+I+C ≤6( j αj −kz) ≤6·OPTK . So our goal is to find such a z in polynomial time. We do not quite know how to do this, instead we find two values z1 >z2,close together such that the algorithm opens k1 k facilities for z2.When z is very large, e.g., |D|(maxij cij +maxil fl), the algorithm will i open just one facility and at z = 0 the algorithm opens >k (we assume this — otherwise the solution at z =0 costs at most 6·OPTK since (α, β, θ, 0) is a feasible dual solution). We perform a bisection search in this range to find z1 and z2. We combine these two solutions to first get a fractional solution of cost 6(1 − )−1 ·OPTK ,for some small , that (fractionally) opens k facilities and in which each demand is assigned to at most two facilities. If the service cost depends only on the service type, we can round this solution losing a factor of 3(1 − )to get an integer solution that opens exactly k facilities, thus getting an overall approximation ratio of 18. This idea was usedin[9] forthe k-median variant of UFL. However our final rounding procedure differs from the one in [9] since we also have to pay for installing services and need to ensure that demand j is connected to an open facility at which service g(j) is installed. Theorem 5.2. If the service installation cost depends only on the service and not on the location, we can get an 18-approximation algorithm for the k-median variant of facility location with service installation costs. 6Extensions Arbitrary Demands. Our results carry over to the case where instead of unit demands, client j may have a demand dj ≥0. We can reduce this to the unit demand case by making dj copies of client j, but this makes the algorithm run in pseudo-polynomial time. We can simulate this reduction however. In the primal-dual algorithm we raise αj at rate dj and say that j has reached i if αj ≥ dj cij . In the LP rounding algorithms of Section 4 the only change is that αj gets replaced with αj /dj in steps A1, A2 and R1. The analysis in Sections 3, 4 and 5 extends in a straightforward way and we get the same approximation guarantees. Acknowledgments. We thank Robin Roundy for stimulating discussions and helpful suggestions. References [1] N. Aksoy and S. Erenguc. Multi-item inventory models with coordinated replenishments: a survey. International Journal of Operations & Production Management, 8:63–73, 1988. [2] A. Archer, R. Rajagopalan and D. Shmoys. Lagrangian relaxation for the k-median problem: new insights and continuity properties. In Proceedings of 11th ESA, pages 31–42, 2003 [3] I. Baev and R. Rajaraman. Approximation algorithms for data placement in arbitrary networks. In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 661–670, 2001. [4] Y. Bartal, M. Charikar and D. Raz. Approximating min-sum k-clustering in metric spaces. In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing, pages 11–20, 2001. [5] F. A. Chudak. Improved approximation algorithms for uncapacitated facility location. In Proceedings of 5th IPCO, pages 180–194, 1998. [6] F. Chudak and D. Shmoys. Improved approximation algorithms for the uncapacitated facility location problem. SIAM Journal on Computing.To appear. [7] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45:634–652, 1998. [8] K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. Vazirani. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. Journal of the ACM.To appear. [9] K. Jain and V.V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM, 48:274–296, 2001. [10] J. H. Lin and J. S. Vitter. -approximations with minimum packing constraint violation. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing, pages 771–782, 1992. [11] M. Mahdian, Y. Ye, and J. Zhang. Improved approximation algorithms for metric facility location. In Proceedings of 5th APPROX, pages 229–242, 2002. [12] P. Mirchandani and R. Francis, eds. Discrete Location Theory. John Wiley and Sons, Inc., New York, 1990. [13] D. B. Shmoys. Approximation algorithms for facility location problems. In Proceedings of 3rd APPROX, pages 27–33, 2000. ´ [14] D. B. Shmoys, E. Tardos, and K. I. Aardal. Approximation algorithms for facility location problems. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pages 265–274, 1997. [15] M. Sviridenko. An improved approximation algorithm for the metric uncapacitated facility location problem. In Proceedings of 9th IPCO, pages 240–257, 2002. [16] C. Swamy. A note on the data placement problem. Unpublished manuscript, 2003.