Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fastest Isotonic Regression Algorithms Quentin F. Stout [email protected] www.eecs.umich.edu/ ˜qstout/ Version as of August 2014 Below are tables of the fastest known isotonic regression algorithms for various Lp metrics and partial orderings. Throughout, “fastest” means in O-notation, not in any measurements of implementations. I’ve cited the first paper to give a correct algorithm with the given time bound, to the best of my knowledge. In some cases I’ve listed two if they appeared nearly contemporaneously. I’ve only included algorithms for exact calculations (to within numerical accuracy), not approximations. I think I’ve included all orderings of practical interest. Feel free to contact me about corrections or updates. This is a list of the fastest known algorithms, not an historical review nor a survey of applications. Several of the papers listed below have extensive references within them. I’ve also omitted related topics such as unimodal regression, integer-valued regression, isotonic regression with constraints on the number of level sets or the differences between adjacent ones, etc. I might include some of these in the future if there is sufficient interest. No parallel algorithms are considered since regrettably there has been no interesting work in this area, even though they would be useful for large data sets. Contact me if you know of such algorithms, or want to fund their development. A directed acyclic graph (DAG) G with n vertices V = {v1 , ..., vn } and m edges defines a partial order over the vertices, where vi precedes vj if and only if there is a path from vi to vj . It is assumed that G is connected, and hence m ≥ n − 1. If it isn’t connected then the algorithms would be applied to each component independently of the others. A function z = (z1 . . . zn ) on G is isotonic if whenever vi precedes vj , then zi ≤ zj . By data (y, w) on G we mean there is a weighted value (yi , wi ) at vertex vi , 1 ≤ i ≤ n, where yi is an arbitrary real number and wi , the weight, is ≥ 0. For 1 ≤ p ≤ ∞, given data (y, w) on G, an Lp isotonic regression of the data is a real-valued isotonic function z over V that minimizes P ( ni=1 wi |yi − zi |p )1/p 1 ≤ p < ∞ maxni=1 wi |yi − zi | p=∞ among all isotonic functions. The Lp regression error is the value of this expression. The orderings listed in the tables are linear (also known as total), tree, points in multidimensional space, and general (i.e., an algorithm that applies to all orderings). A DAG of points in multidimensional space is the isotonic version of multivariate regression. In d-dimensional space (the “dim” orderings), point p = (p1 , . . . , pd ) precedes point q = (q1 , . . . , qd ) iff pi ≤ qi for all 1 ≤ i ≤ d. In some settings, q is said to dominate p. In the tables the multidimensional orderings are further subdivided into regular grids and points in arbitrary positions, and into dimension 2 and dimension ≥ 3. They are subdivided like this because there are different algorithms that can be used in these cases. The constants in the O-notational analyses depend on d but in general the papers do not explicitly determine them. The metrics considered are L1 , L2 , and L∞ . These metrics also go under a variety of other names: L1 is also known as Manhattan or taxi-cab distance, median regression, least absolute deviation; L2 is squared error regression or Euclidean distance; and L∞ is also known as minimax optimization, uniform metric, Chebyshev distance, supremum, or maximum absolute deviation. Some results for general Lp appear in [12]. The tables list the best time known for the given ordering and metric, and citations to the references where the algorithm is described, or to a relevant note. All times are worst-case. 1 linear tree 2-dim grid 2-dim arbitrary, note d) d ≥ 3 grid d ≥ 3 arbitrary, note d) arbitrary weighted time reference Θ(n log n) [1, 9] Θ(n log n) [12] Θ(n log n) [12] 2 Θ(n log n) [12] 2 Θ(n log n) A d 2 Θ(n log n) A 2 Θ(nm + n log n) [2] unweighted time Θ(n log n) Θ(n log n) Θ(n log n) Θ(n log2 n) Θ(n1.5 logd+1 n) Θ(n1.5 logd+1 n) Θ(min{nm+n2 log n, n2.5 log n}) A: Result implied by that for arbitrary DAG W: Result implied by that for weighted data Table 1: L1 . See Comment 4. time reference linear Θ(n) note a) tree Θ(n log n) [7] 2 2-dim grid Θ(n ) [8] 2 2-dim arbitrary, note d) Θ(n log n) [12], note c) 3 d ≥ 3 grid Θ(n log n) A d 3 d ≥ 3 arbitrary, note d) Θ(n log n) A 2 3 arbitrary Θ(n m + n log n) [6, 12], note b) A: Result implied by that for arbitrary DAG Table 2: L2 , no improvements known for unweighted data. Also see Comment 3. weighted unweighted time reference time reference linear Θ(n) [14] Θ(n) A tree Θ(n) [14] Θ(n) A 2-dim grid Θ(n) [14] Θ(n) A 2-dim arbitrary, note d) Θ(n log n) [14] Θ(n log n) A d ≥ 3 grid d Θ(n) [14] Θ(n) A d ≥ 3 arbitrary, note Θ(n logd−1 n) [14] Θ(n logd−1 n) A arbitrary Θ(m log n) [5, 10], note f) Θ(m) note e) A: Result implied by that for arbitrary DAG Table 3: L∞ . See Comments 4, 5. 2 reference W W W W [13] [13] W, [12] Notes concerning the tables: a) For linear orders the “pool adjacent violators”, PAV, approach has been repeatedly rediscovered. Apparently the first paper that used this is by Ayer et al. [3]. For the L2 metric it is trivial to implement in linear time, and similarly for the L∞ metric with unweighted data, while for the others appropriate data structures are needed. Previously the fastest known algorithm for weighted L∞ used this approach, taking Θ(n log n) time, but now the fastest takes Θ(n) time and is not based on PAV [14]. b) Maxwell and Muckstadt’s algorithm [6], with a small correction by Spouge, Wan, and Wilbur [8], takes Θ(n4 ) time. The algorithm in [12] takes Θ(n2 m + n3 log n), which is an improvement when m = o(n2 ). c) For 2-dimensional points with arbitrary placement, [12] shows how to use a balanced tree to simulate the 2-dimensional grid algorithms for the L1 and L2 metrics to obtain the indicated results. d) For points in arbitrary position and d ≥ 3 (and d = 2 for the L∞ metric), the results are based on embedding them into a DAG which has more vertices but which can represent the domination ordering using relatively few edges. The best result for general sparse DAGs is then applied to this new DAG. This was first used in the original version of [10], but was subsequently moved to [13]. The new DAG has Θ(nn logd−1 n) edges and vertices, and can be constructed in time linear in its size. The time analysis for L2 uses the fact that the number of iterations needed is linear in n, rather than linear in the size of the new DAG. For L1 and L2 the analyses also use the fact that a minimum cost flow step in [2] uses a number of steps linear in the number of vertices with nonzero weight, which too is n rather than the size of the new DAG. If one instead used the natural representation of domination ordering for points in arbitrary position there would be Θ(n2 ) edges, and so the bounds for general dense orderings would apply. e) For unweighted data the time for L∞ is Θ(m) since the regression value at a vertex can be chosen to be the average of the maximum value of any predecessor (including itself) and the minimum value of any successor (including itself). This can be computed via topological sorting and has been repeatedly rediscovered. f) The algorithm in [10] is a modest improvement of the algorithm of Kaufman and Tamir [5], changing the time from Θ(m log n + n log2 n) to Θ(m log n). This is faster for sparse graphs where m = o(n log n), which is relevant for all of the other orderings considered. Unfortunately, the approach is based on parametric search, which is decidedly impractical. Comments: 1. Most of the entries have changed since I first posted this in 2009, partially because I happened to develop some new algorithms. Many of these incorporate prior results that are referenced in the papers. Please let me know of faster algorithms so that I can incorporate them. The most recent major update to this document was in August 2014. Among other things, I changed to pdf, instead of html, because it simplified keeping the cross-indexing up to date and the O-notation looks better. I also updated most of the entries for weighted L∞ . I hadn’t planned on working on them, and had assumed that the previous results were optimal (e.g., Θ(n log n) for the linear order), but stumbled across a helpful approach [14] while working on a different problem. 3 2. While many isotonic regression algorithms have been published, it appears that few efficient, exact, ones have been implemented. I’m as guilty of this as anyone. Some of them would require significant work, and for weighted L∞ regression on an arbitrary DAG the result is purely of theoretical interest since it relies on parametric search. I’m not sure which of the published ones would be faster in practice, e.g., how large does n need to be before the Θ(n) algorithm for linear orders [14] is faster than the simpler Θ(n log n) prefix approach [10]? However, some of the implementations I’ve seen are of slower algorithms even when the one faster in O-notation is also likely to be faster in practice and about the same difficulty to program. I might start providing pointers to good implementations, so if you have some favorites please send me a pointer to them. 3. For DAGs other than linear and trees the fastest L2 algorithms use a recursive approach based on splitting the vertices into those where the regression value is above the average, and those where it is below. In the worst case, at each step either almost all vertices have a regression value larger than the average, or almost all are less. This introduces a factor of n which in practice is likely to be closer to log n. Reducing the time of the L2 algorithms, and implementing them efficiently, is the change of most importance to people who want to use isotonic regression. 4. In general there are many optimal isotonic regressions when the L1 and L∞ metrics are used. A natural “best best” regression is the limit as p → 1 or p → ∞, respectively, of Lp isotonic regression. This doesn’t appear to have a name, so I’ve called it “strict isotonic regression”. The standard approach for unweighted L∞ , setting the regression value at a vertex to be the average of the largest preceding data value (including it) and smallest following one (including it), does not produce strict isotonic regression. For example, the regression of data 1, -1, 0 would be 0, 0, 0.5, i.e., there is an unnecessary error at the last vertex. The isotonic regression for all other Lp metrics would set the last value to 0, hence the strict L∞ isotonic regression would do so as well. For L∞ the fastest known algorithms for computing the strict regression take Θ(n log n) time for linear and tree orderings and Θ(TC(n, m)) time for a general ordering, where TC(n, m) is the time required to determine the transitive closure of a DAG of n vertices and m edges. The best bound known for TC(n, m) is Θ(min{nm, n2.376 }). These algorithms appear in [11], along with an analysis of mathematical reasons for preferring strict isotonic regression. Apparently no algorithms have been published for strict L1 isotonic regression. Most L1 algorithms, when applied to the unweighted data 1, 0, 1, would produce 1, 1, 1 or 0, 0, 1, while the strict regression is 0.5, 0.5, 1. PAV can be used to produce the strict L1 isotonic regression for linear and tree orderings, but often can only produce an approximation accurate to within machine error. This occurs even if the data is unweighted with only integer values. The difficulty has to do with determining the proper median of, say, 10, 2, 1, 0 vs. 5, 2, 1, 0. In both cases the strict median is between 1 and 2, but it is slightly larger in the former case. Jackson [4] gives a formula for the strict median. 5. Currently, all of the fastest algorithms for weighted L∞ isotonic regression use an indirect approach, based on queries determining if there is an isotonic regression with error ≤ ǫ, and, if there is, producing one. A search is used to find the minimum such ǫ. Unfortunately the results, while optimal, are not always appealing. For example, using the algorithms as they are described, the unweighted data 1, -1, 3 results in 0, 0, 2, while some users would prefer 0, 0, 3. The latter is the strict regression (see Comment 4). 4 References [1] Ahuja, RK and Orlin, JB (2001), “A fast scaling algorithm for minimizing separable convex functions subject to chain constraints”, Operations Research 49, pp. 784–789. [2] Angelov, S, Harb, B, Kannan, S, and Wang, L-S (2006), “Weighted isotonic regression under the L1 norm”, Symposium on Discrete Algorithms (SODA), pp. 783–791. [3] Ayer, M, Brunk, HD, Ewing, GM, Reid, WT, and Silverman, E (1955), “An empirical distribution function for sampling with incomplete information”, Annals of Math. Stat. 5, pp. 641–647. [4] Jackson, D (1921), “Note on the median of a set of numbers”, Bulletin of the American Mathematical Society 27, pp. 160–164. [5] Kaufman, Y and Tamir, A (1993), “Locating service centers with precedence constraints”, Discrete Applied Mathematics 47, pp. 251–261. [6] Maxwell, WL and Muckstadt, JA (1985), “Establishing consistent and realistic reorder intervals in production-distribution systems”, Operations Research 33, pp. 1316–1341. [7] Pardalos, PM and Xue, G (1999), “Algorithms for a class of isotonic regression problems”, Algorithmica 23, pp. 211–222. [8] Spouge J, Wan H, and Wilbur WJ (2003), “Least squares isotonic regression in two dimensions”, Journal of Optimization Theory and Applications 117, pp. 585–605. [9] Stout, QF (2008), “Unimodal regression via prefix isotonic regression”, Computational Statistics and Data Analysis 53, pp. 289–297. www.eecs.umich.edu/ ˜qstout//abs/UniReg.html A preliminary version appeared in “Optimal algorithms for unimodal regression”, Computing and Statistics 32, 2000. [10] Stout, QF (2011), “Weighted L∞ isotonic regression”, submitted. www.eecs.umich.edu/ ˜qstout//abs/LinfinityIsoReg.html This is a major revision of the original version that was posted on the web in 2008. Some of the material in that paper was moved to [13]. [11] Stout, QF (2012), “Strict L∞ isotonic regression”, Journal of Optimization Theory and Applications 152, pp. 121–135. www.eecs.umich.edu/ ˜qstout//abs/StrictIso.html [12] Stout, QF (2013), “Isotonic regression via partitioning”, Algorithmica 66, pp. 93–112. www.eecs.umich.edu/ ˜qstout//abs/L1IsoReg.html [13] Stout, QF (2013), “Isotonic regression for multiple independent variables”, Algorithmica, to appear. www.eecs.umich.edu/ ˜qstout//abs/MultidimIsoReg.html [14] Stout, QF (2014), “L∞ isotonic regression for linear and multidimensional orders”, in preparation. www.eecs.umich.edu/ ˜qstout//abs/LinftyIsoRegLinear.html Copyright 2009–2014 Quentin F. Stout 5