Anticoncentration regularizers for stochastic combinatorial problems

Viewer
Transcript

Anticoncentration regularizers for stochastic combinatorial problems

Geoffrey J. Gordon Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 [email protected]

Shiva Kaul Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 [email protected]

Abstract Statistically optimal estimators often seem difficult to compute. When they are the solution to a combinatorial optimization problem, NP-hardness motivates the use of suboptimal alternatives. For example, the non-convex `0 norm is ideal for enforcing sparsity, but is typically overlooked in favor of the convex `1 norm. We introduce a new regularizer which is small enough to preserve statistical optimality but large enough to circumvent worst-case computational intractability. This regularizer rounds the objective to a fractional precision and smooths it with a random perturbation. Using this technique, we obtain a combinatorial algorithm for noisy sparsity recovery which runs in polynomial time and requires a minimal amount of data. The sparsity recovery problem illustrates a common tradeoff between computational and statistical efficiency. The goal is to estimate supp (h∗ ), the positions of the non-zero entries of some unknown h∗ ∈ RD . A variety of resources are utilized to solve the problem: randomness, used to draw M “sensing” vectors xm ∈ RD with independent N (0, 1) entries; M data {ym = hh∗ , xm i + gm }m=1 contaminated with independent “noise” gm ∼N 0, σ 2 ; and ˆ M is chosen to minimize time, during which h F (h) = P(supp(h)) 6= supp(h∗ )) with regard to the external randomness of g1 , . . . , gM and the internal randomness of ˆM ) → 1 x1 , . . . , xM . We hope for asymptotically reliable recovery in the sense that F (h as M → ∞. This hope is realized in many situations; we focus on a challenging setting where the sparsity level S = kh∗ k0 = |supp (h∗ )| is a constant fraction of D and the minimum component mind khd k2 is a constant multiple of log S/S. (It is challenging because h∗ is fairly dense, but not impossible due to the significant amount of signal.) Here, asymptotically reliable recovery demands M = Ω(S) [18]. This optimal statistical rate is achieved by direct “empirical” minimization [18]: min. h

M X

k ym − hh, xm i k2 s.t. khk0 ≤ S

(Direct2 )

m=1

Unfortunately, this minimization (without our attendant conditions on xm and ym ) is strongly NP-hard 1 [7]. This worst-case hardness discouraged average-case analysis of 1 A problem is strongly NP-hard if it remains NP-hard even when the input is encoded in unary rather than binary; that is, when the numerical values of its inputs are bounded by a polynomial in their lengths. A runtime is pseudopolynomial if it is polynomial in the unary length of the input. Weakly NP-hard problems, such as knapsack, may be solved in pseudopolynomial time, but unless P=NP the same is not true for strongly NP-hard problems.

1

Direct2 in favor of alternatives such as convex relaxations [2] and greedy approximations [16]. Such methods are effective, but typically attain computational efficiency by sacrificing statistical efficiency. For example, the Lasso is a widely used alternative which relaxes the non-convex cardinality constraint to an `1 norm penalty. It is asymptotically reliable only if M = Ω(S log(D − S)); that is, a linear amount of data no longer suffices [17]. This general pattern - sacrificing data for time by relaxing a hard empirical problem - applies to other important problems such as halfspace learning and matrix completion [11, 4]. We offer a complementary technique to relaxation based on a simple observation: when M is small, the objective FM fluctuates considerably, and so yields an opportunity to “sneak in” a small bit of regularization. Whereas typical regularizers are designed outweigh the fluctuations, the anticoncentration regularizer hides in their shadow. Since it vanishes faster than FM stabilizes, it does not affect asymptotic reliability, yet it is still large enough to admit a polynomial time algorithm. Moreover, within the confines of linear data and polynomial time, it may continuously balance the linear coefficient and the polynomial degree. This is similar to how statistical penalties balance approximation error and estimation error. We briefly describe the high-level ideas of this approach. We start with a mild variant of Direct2 where the objective is normalized and the `2 loss is swapped for `1 : min. FM (h) = h

M 1 X | ym − hh, xm i | s.t. khk0 ≤ S M m=1

(Direct1 )

This is merely for exposition purposes so we can work with integer linear programs. Direct1 inherits strong NP-hardness from Direct2 ; we conjecture it inherits asymptotic reliability as well. Our main result follows. Theorem 1 Given M = Ω(S) in the aforementioned challenging setting for h∗ , a randomized polynomial time algorithm achieves the same asymptotic reliability as Direct1 . The remainder of this note is a proof sketch of Theorem 1 and a discussion of related work.

1

The proof sketch

The proof consists of four steps: 1. Form a new program Round by constraining Direct1 ’s decision variables to take a random number of values which is probably polynomial. Its solution is still asymptotically reliable. 2. Form a new program SmoothRound by randomly perturbing the coefficients in Round’s objective (or, indirectly, its constraint matrix). Its solution is still asymptotically reliable. 3. SmoothRound can be solved in expected polynomial time iff Round can be solved in pseudopolynomial time. 4. Round can be solved in pseudopolynomial time since its constraint matrix has constant branchwidth (as defined later). ˆ M ) by just a factor We show the changes to Direct1 lower the rate of convergence of F (h of O(1) or even o(1). This way we don’t have to touch the original proof of asymptotic reliability. Indeed, little of the proof pertains specifically to sparsity recovery.

1.1

Rounding

The first step is the most technical; we present only brief, high-level intuition. Without loss of generality, suppose the numbers defining Direct1 lie within [−1, 1]. These numbers are almost surely irrational, so any optimization procedure must first round them to some bounded length. Standard arguments show that the length L of the solution is polynomial in the length of the input. 2

We want to show that there is a data tradeoff constant c ≥ 1 and rounding parameters D k ≥ 0, {ad }D d=1 , and {bd }d=1 such that, in terms of asymptotic reliability, the solution to Direct1 is no better than the solution to min. FcM (h) s.t. khk0 ≤ S, ∀d : ad ≤ 2−k hd ≤ bd h

(Round)

where ad , hd , and bd are integers. If the cardinality of each [ad , bd ] is bounded by a polynomial, then of course hd is similarly bounded, so the constraints effectively round the objective. As the spacing k grows, a polynomial bound is easier to attain, but asymptotic reliability may suffer. We may obtain high bit-precision without losing accuracy if we have a good idea of where the minimizer will be. Fortunately, we may obtain such an estimate in polynomial time by solving the linear program ηˆ = argminh FcM (h). ηˆ is normally distributed around h∗ ; the bounded precision solution η¯ essentially shares this distribution for large enough L. Since E(¯ η ) ≈ h∗ , we will take [ad , bd ] = η¯d ± γ for some γ > 0. Our strategy is, for some constant δ > 0, to choose the smallest k such that γ is polynomial and the rounded ¯ ∗ ∈ [ad , bd ] with probability 1 − δ. Since we know the distribution of ηˆ, we may obtain a h M confidence interval. 1.2

Smoothing

We now show that randomly perturbing the coefficients of the objective does not affect asymptotic recovery. Let each ρm be a uniform random variable drawn from [−r, r]. The new objective is M 1 X AM (h) = (1 + ρm )| ym − hh, xm i | M m=1

The full result is based on the next lemma, which is a simple consequence of iterated expectation. Lemma 2 ∃c = O(1), ∀M ≥ M 0 , ∃r > 0, ∀h, AcM (h) has the same expectation as, but no more variance than, FM (h). q p 2 Note that |ym − hh, xm i| has a half normal distribution with mean kh − h∗ k22 + σ 2 π |h − h∗ k22 + σ 2 . First, and variance π−2 π ! r q 2 2 2 ∗ 2 E(AcM (h)) = E(E(AcM (h)|ρ)) = E (1 + ρm ) (kh − h k2 + σ ) = E(FM (h)) π where the last equality follows by symmetry of ρm . Next, 2 V (AcM (h)) = E(V (AcM (h)|ρ)) + V (E(AcM (h)|ρ)) = V (FcM (h)) + (kh − h∗ k22 + σ 2 )V (ρm ) π and we can take a r = O M −1/2 to satisfy the lemma. Alternatively, we may trade time and data by letting c = 1 − 1/p(M ) for some polynomial p and sizing r appropriately. 1.3

SmoothRound is easier

We now reap the benefits of rounding and smoothing. We apply a result from smoothed analysis of algorithms, wherein nature perturbs the input, and the algorithm’s runtime may vary with the length of the original input and the amount of perturbation. Since our perturbation parameter 1/r is polynomial in M , we may simulate nature. The result internally manages precision. We say an algorithm has probably polynomial runtime if, with probability 1−δ, the running time is polynomially bounded in the length of the input, 1/r and 1/δ. Such an algorithm does not necessarily have expected polynomial running time. We may achieve this by wrapping the algorithm in a loop which restarts execution after a polynomial number of steps. 3

Proposition 3 (adapted from [9]) Since its decision variables take at most a polynomial number of values, SmoothRound can be solved in probably polynomial time if (and only if ) Round can be solved by a possibly randomized algorithm in expected pseudopolynomial time. 1.4

Round in pseudopolynomial time

To complete the argument, we show Round can be solved in pseudopolynomial time. We must exploit some structure of the input since the general form of the problem is still strongly NP-hard. This hardness follows by reduction from the exact cover problem: given a collection of elements E, and a collection of sets Z ⊂ 2E , determine if there is a “cover” C ⊂ Z such that each element belongs to exactly one set in C. In the reduction, E corresponds to the rows of the constraint matrix and Z corresponds to its columns; entry (e, z) is 1 iff z contains e. The threshold vector is set to all 1’s to enforce the “exactly one” condition. The columns of the constraint matrix therefore encode difficult dependencies. This is not true of the constraint matrices used for sparsity recovery. The bulk of Round’s constraint matrix K encodes the “design” matrix X whose rows are x1 , ..., xM . Since X has iid N (0, 1) entries, its columns are nearly linearly independent with overwhelming probability. K’s inability to encode difficult dependencies is captured by its branchwidth, defined thusly: a branch decomposition is a binary tree in which each column of K appears at exactly one leaf. Cutting an edge in this tree effectively partitions the columns into two sets K1 and K2 . The branchwidth of K is min

max (rank (K1 ) + rank (K2 ) − rank(K) + 1)

decompositions cuts

K has O(M + D) columns and O(M ) rows. But since M = Ω(S) = Ω(D), we may assume that there are as many rows as columns. In this case, the linear independence of X’s columns will ensure that K has constant branchwidth. (The columns corresponding to the auxiliary decision variables add at most a constant factor.) Given this condition, we may use an off-the-shelf algorithm. Proposition 4 (from [3]) An integer linear program can be solved in pseudopolynomial time if its decision variables take polynomially many values and its constraint matrix has constant branchwidth.

2

Related work

Regularization was introduced as a numerical computing technique, so its computational benefits have been deeply studied. Smoothed analysis demonstrates how small perturbations can reduce computational complexity [14]. Smoothed analysis has been applied to learning problems [6, 8]. Random perturbations are a popular internal device within algorithms [15]; in learning, their use dates back to Hannan [5]. Properties such as strong convexity confer both computational and statistical benefits [19, 10]. Computational learning theory mostly focuses on solving learning problems in polynomial time. However, regularization is not often used to make such distinctions, perhaps because data is often considered free. Recent work has underscored rich distinctions within the confines of polynomial time and limited data [1, 13, 12]. To the best of our knowledge, no previous work identifies the opportunity to regularize a statistically optimal, NP-hard empirical problem, and thereby derive an optimal algorithm for the original problem. We believe this technique could be useful for other learning problems, especially as a complement to algorithms tuned for speed.

References [1] L´eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 161–168. NIPS Foundation (http://books.nips.cc), 2008. 4

[2] E. Candes, J. Romberg, and T. Tao. Stable Signal Recovery from Incomplete and Inaccurate Measurements. ArXiv Mathematics e-prints, March 2005. [3] William Cunningham and Jim Geelen. On integer programming and the branch-width of the constraint matrix. In Matteo Fischetti and David Williamson, editors, Integer Programming and Combinatorial Optimization, volume 4513 of Lecture Notes in Computer Science, pages 158–166. Springer Berlin / Heidelberg, 2007. [4] M. Fazel, H. Hindi, and S.P. Boyd. Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In American Control Conference, 2003. Proceedings of the 2003, volume 3, pages 2156 – 2162 vol.3, june 2003. [5] J. Hannan. Approximation to Bayes risk in repeated plays. In M. Dresher and A. Tucker and P. Wolfe, editor, Contributions to the Theory of Games, volume 3. Princeton University Press, 1957. [6] Adam Tauman Kalai, Alex Samorodnitsky, and Shang-Hua Teng. Learning and smoothed analysis. In Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’09, pages 395–404, Washington, DC, USA, 2009. IEEE Computer Society. [7] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24:227–234, April 1995. [8] A. Rakhlin, K. Sridharan, and A. Tewari. Online Learning: Stochastic and Constrained Adversaries. ArXiv e-prints, April 2011. [9] Heiko R¨ oglin and Berthold V¨ ocking. Smoothed analysis of integer programming. Math. Program., 110(1):21–56, 2007. [10] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635–2670, 2010. [11] Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces with the zero-one loss. CoRR, abs/1005.3681, 2010. [12] Shai Shalev-Shwartz, Ohad Shamir, and Eran Tromer. Using more data to speed-up training time. CoRR, abs/1106.1216, 2011. [13] Shai S. Shwartz and Nathan Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 928–935, New York, NY, USA, 2008. ACM. [14] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Commun. ACM, 52(10):76–84, 2009. [15] S.H. Teng. Algorithm design and analysis with perturbations. In Fourth International Congress of Chinese Mathematicans, 2007. [16] J.A. Tropp. Greed is good: algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 50(10):2231 – 2242, oct. 2004. [17] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. ArXiv Mathematics e-prints, May 2006. [18] M.J. Wainwright. Information-theoretic limits on sparsity recovery in the highdimensional and noisy setting. Information Theory, IEEE Transactions on, 55(12):5728 –5741, dec. 2009. [19] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.

5

Reducibility Among Combinatorial Problems