A fast convex conjugated algorithm for sparse recovery

Viewer
Transcript

Neurocomputing 115 (2013) 178–185

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Letters

A fast convex conjugated algorithm for sparse recovery Ran He a,n, Xiaotong Yuan b, Wei-Shi Zheng c a b c

National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China Department of Statistics, Rutgers University, NJ 08816, USA School of Information Science and Technology, Sun Yat-sen University, Guangzhou 510275, China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 6 August 2012 Received in revised form 14 October 2012 Accepted 27 December 2012 Communicated by Xiaoqin Zhang Available online 9 February 2013

Sparse recovery aims to ﬁnd the sparsest solution of an underdetermined system X b ¼ y. This paper studies simple yet efﬁcient sparse recovery algorithms from a novel viewpoint of convex conjugacy. To this end, we induce a family of convex conjugated loss functions as a smooth approximation of l0 -norm. Then we apply the additive form of half-quadratic (HQ) optimization to solve these loss functions and to reformulate the sparse recovery problem as an augmented quadratic constraint problem that can be efﬁciently computed by alternate minimization. At each iteration, we compute the auxiliary vector of HQ via minimizer function and then we project this vector into the nullspace of the homogeneous linear system X b ¼ 0 such that a feasible and sparser solution is obtained. Extensive experiments on random sparse signals and robust face recognition corroborate our claims and validate that our method outperforms the state-of-the-art l1 minimization algorithms in terms of computational cost and estimation error. & 2013 Elsevier B.V. All rights reserved.

Keywords: Sparse representation Half-quadratic minimization L1 minimization

1. Introduction Recovering a sparse signal has been a growing interest in machine learning, computer vision, signal/image processing and statistical inference. Sparse recovery methods try to ﬁnd the sparsest solution of an underdetermined linear system X b ¼ y [1,2]. The canonical form of this problem is given by minJbJ0 b

s:t: X b ¼ y,

ð1Þ

where X A Rdn is a training matrix whose columns represent a redundant basis (i.e., d on), y A Rd is an input signal/image vector, b A Rn is an unknown vector to be estimated, and J J0 denotes l0-norm that is a count of the nonzero entries in b. Since the l0 minimization problem in (1) is NP hard and thus intractable, many approximated loss functions have been proposed. In [3,4], lp-norm is used as an approximation of l0-norm. In [5], correntropy induced metric is further used as an l0-norm approximator. In compressed sensing, the most commonly used approximator of l0-norm is l1-norm. In recent years, a variety of fast algorithms have been developed to solve l1 minimization problem. Basis pursuit [6], orthogonal matching pursuit [7], least angle regression [8], gradient projection [9,10], homotopy [11–13], iterative shrinkage–thresholding [14,15], proximal gradient [16–18], alternating direction n

Corresponding author. Tel.: þ86 10 62618760. E-mail address: [email protected] (R. He).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.12.034

[19] and Bregman iterative [20,21] are used to solve l1 minimization. In addition, convex conjugacy (or Legendre–Fenchel transform [22]) is used in [23] to ﬁnd a sparse solution, which is actually an additive form of half-quadratic (HQ) optimization [24]. Then the authors in [25,26] make use of HQ to learn a nonnegative sparse representation and a robust sparse representation respectively. Negahban et al. [27] proposed a uniﬁed M-estimation framework for sparsity estimation. A complete review of optimization algorithms for sparse recovery and compressed sensing is out of the scope of this paper, and we refer the interested readers to [20,28,29] and the references therein. Although these methods indeed improve computational efﬁciency of l1 minimization and run very fast on small dataset, they are still computationally expensive for large-scale application problems [28]. In addition, the performance of these methods seems to be sensitive to parameters, which needs careful tuning for different applications. To attack the above problems, this paper combines research in convex optimization [22], robust statistics [30] and compressed sensing [29] to develop simple yet efﬁcient algorithms for sparse recovery. First, based on the theory of convex conjugacy, we induce a family of loss functions as an approximation of l0-norm. We establish some important properties of these functions and discuss the relationship between their minimizer function and softthresholding function [28]. Second, we propose a general optimization method to solve these loss functions based on sparse recovery problem. We reformulate this sparse problem as an augmented quadratic constraint problem and make use of alternate minimization to solve it. At each iteration, we compute the auxiliary vector of

R. He et al. / Neurocomputing 115 (2013) 178–185

HQ via minimizer function and then we project this vector into the nullspace of the homogeneous linear system X b ¼ 0 such that we can obtain a new feasible solution of X b ¼ y, which is sparser than the one in previous iteration. Both theoretical analysis and experimental results on middlescale dataset demonstrate that our proposed method has a linear complexity O(tdn), where t is the number of iterations, and show that a descending direction from the nullspace may be potentially useful and powerful to reduce sparsity. To the best of our knowledge, this time complexity is smaller than those of the fast algorithms reported in [28]. Extensive experiments on random sparse signals and robust face recognition corroborate our claims and validate that our method performs better than state-of-theart l1 minimization algorithms in terms of computational cost. The paper is organized as follows. In Section 2 we ﬁrst induce a family of loss functions for sparse recovery and then propose a fast O(tdn) algorithm via HQ optimization. In Section 3, we provide numerical results on random sparse signals and robust face recognition, showing the advantage of our methods. Finally, we draw the conclusions and discuss future work in Section 4.

179

functions for sparse recovery. Table 1 tabulates four potential loss functions, among which L2L1 and correntropy loss functions have been used as objectives in compressed sensing and optimized by gradient based methods. We denote fðÞ of Huber, L2L1, correntropy and log loss functions by fH ðÞ, f21 ðÞ, fc ðÞ and fL ðÞ respectively. All four functions can be optimized by the additive form of HQ optimization based on the following Lemma. Lemma 1. If fðÞ is a differentiable function that satisﬁes ﬁve conditions in [24] (i.e., [24, Eq. (59)]), then there exists a dual potential function jðÞ (or named conjugate function in [22]), such that

fðxÞ ¼ minfðqxÞ2 þ jðqÞg,

ð4Þ

q

where q is an auxiliary variable and uniquely determined by the minimizer function dðÞ with respect to fðÞ [24]. Lemma 1 shows that we can cast a loss function fðÞ to a half quadratic formulation and obtain a resultant (augmented) costfunction, min fðxÞ ¼ minfðqxÞ2 þ jðqÞg: x

2. Fast sparse recovery algorithm In this section, we aim to address a general optimization problem for sparse recovery. By substituting the l0-norm in (1) with a family of loss functions in Table 1, we have min b

n X

fðbi Þ s:t: X b ¼ y:

ð2Þ

i¼1

According to convex conjugacy (Lemma 1 in Section 2.1), we can reformulate (2) as its augmented HQ problem min

n X

fðbi pi Þ2 þ jðpi Þg

s:t: X b ¼ y,

Since q is uniquely determined by minimizer function dðÞ, we do not concern with the analytic form of jðÞ when q is ﬁxed. Table 1 also gives the minimizer functions of the four loss functions. Based on the properties of the additive form of HQ optimization, it is easy to prove that the minimizer function in HQ also satisﬁes the conditions of MPO [32]. It is interesting to highlight that when fðÞ becomes Huber loss function, the dual potential function jðÞ of fðÞ is the absolute function 9 9 [24]. Given a vector b, we sum over i in each of its entries and then we have

ð3Þ

n X

where jðÞ is a dual potential function which is determined using the theory of convex conjugacy [22] and pi is determined by the minimizer function dðÞ. In the following two subsections, we ﬁrst discuss the properties of this family of loss functions and establish connection between sparse recovery and robust statistics. And then we propose a fast yet simple algorithm to solve (2).

i¼1

b,p

i¼1

2.1. A family of loss functions In compressed sensing [1], both half-quadratic optimization [24] and Moreau proximal operator (MPO) [32] are widely used, which are based on convex conjugacy (or Legendre–Fenchel transform [22]). Chartrand [23] resorts to the additive form of HQ optimization for sparse recovery. Then the authors in [25,26] make use of HQ to learn a robust sparse representation. Based on MPO, one has the softthresholding function of l1 regularization [32]. Inspired by the convex conjugacy observation, we study a family of convex conjugated loss functions and their minimizer

ð5Þ

x,q

fH ðbi Þ ¼ minfJbpJ22 þ eJpJ1 g,

ð6Þ

p

P where p A Rn is an auxiliary vector and JpJ1 6 i 9pi 9 is l1-norm. Each pi is uniquely determined by the minimizer function dH ðÞ. The right part of (6) is widely used in compressed sensing where the minimizer function dH ðÞ is also denoted as MPO or softthresholding function. The remaining three functions behave similarly to soft-thresholding function and hence can be used as an approximation of l0-norm for sparse recovery. It is also interesting to note that the four loss functions belong to the so-called robust M-estimators in robust M-estimators [30,31,33–35]. We have known that M-estimators are used to detect outliers that are signiﬁcantly different from uncorrupted data. In sparse recovery, it is easy to observe that nonzero coefﬁcients are signiﬁcantly different from zero coefﬁcients. And hence sparse and nonzero coefﬁcients can be treated and detected as outliers. This analysis indicates that we potentially introduce the optimization technique in sparse recovery to solve robust M-estimator based objectives in robust statistics; on the other

Table 1 Loss functions and their minimizer functions in HQ optimization. We denote fðÞ of Huber, L2L1, correntropy and log loss functions by fH ðÞ, f21 ðÞ, fc ðÞ and fL ðÞ respectively. Note that the dual potential function jðÞ of fH ðÞ is jj. dH ðÞ is just the soft-thresholding function in compressed sensing [28]. In robust statistics [30,31,25], the four loss functions also belong to the so-called robust M-estimators. HQ functions

Huber [24,23]

L2L1 [24,25]

Correntropy [5,25]

Log [24,25]

fðÞ

8 2 > < x =2

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ c þ x2 =e

2 x 1exp

9x9 9x9 log 1 þ

dðÞ

9x9r e

e

2

e > 9x94 e : e9x9 2 ( 0 9x9 r e xe signðxÞ 9x9 4 e

! 1 1pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ x 2 c þ x =e

2 x x 1exp

e

e

e

! 1 1 x c þ 9x9=e

180

R. He et al. / Neurocomputing 115 (2013) 178–185

hand, we can also introduce the optimization technique in robust statistics to solve sparse recovery problems. According to the above connection between sparse recovery and robust statistics, we can put a sparse linear representation b and a noise item eA Rd together, and optimize them as a vector a6½b; e. However, there is no necessary sparse assumption on e. The above analysis gives a complementary explanation of l1 minimization based on dense error correction [36] from the view point of robust M-estimators.

jðpi Þ:

ð7Þ

i¼1

By setting the derivative of Lðb,pÞ with respect to b to zero, we get @Lðb,pÞ ¼ 2ðbpÞX T L ¼ 0: @b

2XðbpÞXX T L ¼ 0 ) 2y2XpXX T L ¼ 0 ð9Þ

Substituting (9) back into (8), we obtain the analytic solution of (3)

bn ¼ X T ðXX T Þ1 y þ pX T ðXX T Þ1 Xp:

ð10Þ

Since d is often smaller than n in an underdetermined system, one can assume that XXT is invertible. Let G6X T ðXX T Þ1 A RðndÞ and ^ y6Gy A Rd , we can write (10) as follows:

bn ¼ y^ þ pGðXpÞ:

ð11Þ

Based on the additive form of HQ optimization, we then obtain the following alternate minimization to solve (2): t1

pti ¼ dH ðbi

Þ,

bt ¼ y^ þpt GðXpt Þ,

X ¼ ½X F X Z :

ð14Þ

Let g t 6Xp , we have g tF ¼ X F ptF

and

g tz ¼ 0:

ð15Þ

Algorithm 1. Fast sparse recovery Algorithm via HQ.

Output: b A Rn ^ 1: y’Gy, p’0, e’0:1, g ¼ 0:99, t’1. 2: repeat 3: Least squares step: Compute g t ¼ ðX F ptF Þ. t Compute b ¼ y^ þ pt Gg t . t 4: HQ step: Compute pt ¼ dH ðbi Þ. i

5: 6:

Parameter step: Compute e ¼ ge. until Converges

ð8Þ

Multiplying both sides of (8) by X and using the equality constraint X b ¼ y, we have

) L ¼ 2ðXX T Þ1 y2ðXX T Þ1 Xp:

and t1

t

In this subsection, we develop fast algorithm to solve (2). To solve the constrained optimization problem in (3), we resort to Lagrange multipliers L and have the following Lagrangian function: n X

p ¼ ½pF ; pZ

Input: X A Rdn , G ¼ X T ðXX T Þ1 and yA Rd .

2.2. Fast algorithm via HQ optimization

Lðb,pÞ6ðbpÞT ðbpÞLT ðX byÞ þ

be the nonzero set and zero set in vector p respectively. Then we have the following partition of p and column partition of X:

ð12Þ ð13Þ

where t is the iteration number. t Here we further introduce a variable b0 6pt GðXpt Þ. Since t t t t t t X b0 ¼ Xp XGðXp Þ ¼ Xp Xp ¼ 0, b0 A NðXÞ, where N(X) denotes the nullspace of the matrix X. Furthermore, it is easy to prove that t for 8t, b is always a feasible solution of the underdetermined linear system X b ¼ y. From these two observations, we learn that at each iteration, the alternate minimization method in (12) and (13) projects vector pt into the nullspace of X and makes use of the projected vector to ﬁnd a feasible solution. t In (13), we have a feasible solution b via auxiliary vector pt. According to the property of the minimizer function dðÞ, dðÞ will t1 punish small entries of b such that pt tends to be sparser than t1 b . Considering the objective function in (3), we ﬁnd that bt is actually the closest solution to pt in feasible solutions of X b ¼ y. The major computational cost of (12) and (13) involves the multiplication of GðXpt Þ whose cost is 2dn. This computational cost tends to be large as n increases. Fortunately, we ﬁnd that pt is computed by minimizer function. If dH ðÞ, which is a truncate function, is used, there will be a large number of zero entries in pt in most iterations, which is very helpful to further reduce the computational cost. Hence we consider the zero and nonzero entries separately, and introduce two variables F and Z. Let F and Z be two subsets of f1, . . . ,ng such that F [ Z ¼ f1, . . . ,ng and F \ Z ¼ Y where Y is an empty set. And let F and Z

Algorithm 1 summarizes the optimization procedure of our alternate minimization. Inspired by iterative shrinkage–thresholding methods [14], we also apply a descending parameter g to reduce e in the minimizer function dðÞ. Since data matrix X is given, we can pre-calculate and save the matrix G. The computational cost of Algorithm 1 mainly involves three steps. The ﬁrst step involves the computation of y^ whose cost is nd. The second step is to compute gt. Its cost is nd þ ðrnÞd where r is the average sparsity. The third step involves the addition of vectors and the computation of pt. The cost of this step is 4n. Hence the computational cost of Algorithm 1 is nd þ tð4n þ nd þðrnÞdÞ. When the number of data tends to be large, rn 5 n. Hence the time complexity is O(tdn). Note that since it is unnecessary to compute G ¼ X T ðXX T Þ1 for each input y, the computational complexity of Algorithm 1 does not include the computation of G. The computational complexity of Algorithm 1 can be further reduced if we use parallel strategy. That is we can resort to parallel algorithms to compute matrix-vector multiplication Gv (v ¼y or v ¼ g t ). Given a parallel degree n 5 n, we will have a parallel algorithm with complexity O(tn). In addition, experimental results and analysis in [24] show that HQ based methods are substantially faster than the gradient based ones. A systematic analysis of the convergence rate of HQ optimization is given in [24].

3. Experiments In this section, we systematically and quantitatively evaluate our method with several state-of-the-art l1 minimization algorithms, including random sparse signals and robust face recognition. All algorithms were run on an Intel(R) Core(TM) 2.93 GHz Windows XP personal computer with 4 GB memory. 3.1. Algorithmic setting Considering that [28] provides a good and general software platform2 to evaluate different sparse recovery methods, we make 1 Note that GX A Rnn is an n n matrix. When n tends to be large-scale, the memory space of GX is tremendous. Hence we do not pre-compute the result of matrix multiplication between G and X. 2 http://www.eecs.berkeley.edu/yang/software/l1benchmark/

R. He et al. / Neurocomputing 115 (2013) 178–185

181

50

30

20

Average estimation error

40 Average run time (s)

1.1

OMP BP HOMO PFP L1LS SpaRSA FISTA ADM HQ

10

1 OMP BP HOMO PFP L1LS SpaRSA FISTA ADM HQ

0.9 0.8 0.7 0.6 0.5 0.4

0

0.5

1

1.5

Number of training samples

2

0.5

1

1.5

Number of training samples

x 104

2 x 104

1200 OMP BP HOMO PFP L1LS SpaSRA FISTA ADM HQ Truth

Average Sparsity

1000 800 600 400 200 0

0.5

1

1.5

Number of training samples

2 x 104

Fig. 1. Comparison of the nine algorithms w.r.t. a ﬁxed dimension d¼ 500 and sparsity ratio r ¼ 0:1, and varying n. (a) Average run time as a function of n. (b) Average estimation error as a function of n. (c) Average sparsity as a function of n. The dashline indicates the groundtruth sparsity.

use of this platform to compare our method with the other eight state-of-the-art methods, including orthogonal matching pursuit (OMP) [7], basis pursuit (BP) [6], Homotopy (HOMO) [13,37], polytope faces pursuit (PFP) [38], large-scale l1-regularized least squares (L1LS) [10], sparse reconstruction by separable approximation (SpaRSA) [39], fast iterative shrinkage–thresholding algorithm (FISTA) [17] and alternating direction method (ADM) [19]. We denote our Algorithm 1 as HQ. The parameter g in Algorithm 1 is ﬁxed to 0.99. 3.2. Performance on random sparse signals 3.2.1. Sparse recovery In order to fairly evaluate the nine involved algorithms, we make use of a given groundtruth sparse signal x0 as in [28]. If the l2-norm difference between xt and x0 is smaller than a threshold, the iteration will exit. Moreover, we set the number of maximal iteration for all algorithms to 500. We generate a random matrix X and vector y based on the platform in [28]. Dimension d¼500 and sparsity rate r ¼ 0:1 are ﬁxed. The number of training samples n is varied from 5000 to 20,000. All these simulations are averaged over 20 runs. Fig. 1 displays the performance of different methods in terms of average running time, sparsity and estimation error. Here, we make use of the truncate method in [25] to calculate sparsity. For these results, we can draw the following observations:

(1) As reported in [28], OMP, HOMO and SpaRSA run faster than BP, PFP, ADM, FISTA and L1LS when dimension d is 500. When n is small, all algorithms have similar average run time. However, when n tends to be larger, BP, PFP, ADM, FISTA and L1LS take longer time to exit. When n is larger than 5000, our HQ method run faster than its eight competitors.

(2) All methods fail to ﬁnd the groundtruth x0. Since we use the threshold as an exit condition, most of the compared methods obtain similar estimation errors. Our HQ method achieves the smallest estimation errors. Among other methods, although OMP has smaller run time, its estimation error is signiﬁcantly larger than those of other methods. (3) Although most of the methods obtain similar estimation errors, their sparsities are different. This is because different methods make use of different strategies to perform optimization. As a result, for some methods, there are many small nonzero entries in learned coefﬁcients and there are only limited large coefﬁcients, leading to a large sparsity. Fig. 2 gives examples of these scenarios.

3.2.2. Different loss functions In this subsection, we evaluate the proposed family of loss function for sparse recovery. If the sparsity variance of an algorithm is smaller than a threshold, the algorithm will exit. In the ﬁrst experiment, simulations are run to explore sparse representations learned by different methods where d and n are ﬁxed to 500 and 1000 respectively. The learned coefﬁcients by different sparse algorithms are plotted in Fig. 2 and the groundtruth x0 is plotted in Fig. 2(f). We ﬁnd that only the coefﬁcients of HQ-correntropy are very close to x0. (The estimation error is 1e6 .) Although other sparse methods can also ﬁnd a sparse representation, this representation may not be just one of x0. We consider this phenomenon as a coincidence with this simulation. Since the simulation system X b ¼ y is underdetermined, the groundtruth x0 may not be the sparsest solution. As reported in [5,40], correntropy based nonconvex objective seems to outperform convex ones.

182

R. He et al. / Neurocomputing 115 (2013) 178–185

0.2 0.1 0 −0.1 −0.2

0.2 0.1 0 −0.1 −0.2 0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2 0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2

0.2 0 −0.2 0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2 0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2 0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2

0

200

400

600

800

1000

0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2 0

200

400

600

800

1000

0.2 0.1 0 −0.1 −0.2 0

200

400

600

800

1000

0

200

400

600

800

1000

Fig. 2. Sparse representations learned by different methods. The estimation error of HQ-Correntropy is 1e6 : (a) Homotopy; (b) ADM; (c) BP; (d) FISTA; (e) SPASRA; (f) Groundtruth; (g) HQ-Huber; (h) HQ-Correntropy; (i) HQ-L2L1; and (j) HQ-Log.

Average sparsity

1200 1000 800 600 400

Huber Correntropy L2L1 Log

Average estimation error

1400 1 0.8 0.6

Huber

0.4

Correntropy L2L1

0.2

200

0

0.5 1 1.5 2 The number of training samples x 104

Log 0

0.5 1 1.5 The number of training samples

2 x 104

Fig. 3. Performance of different loss functions. Left: average sparsity; right: average estimation errors.

In the second experiment, simulations are run to demonstrate the performance of different smooth loss functions where n is varied from 1000 to 20,000. All simulations are averaged over 20 runs. Fig. 3 shows the performance of different loss functions. We observe that when dimension is 1000, both correntropy and L2L1 based algorithms achieve small estimation errors. This means that they almost ﬁnd the groundtruth x0. Although HQ-Log has large sparsity, all methods have similar estimation errors when n tends to be large. These results validate that all potential functions in Table 1 can be used in sparse recovery to ﬁnd a sparse representation.

3.2.3. Parameter setting In sparse recovery, it is a tedious job to determine the appropriate parameters (such as regularizer and step length of gradient) that often signiﬁcantly affect the sparsity and convergence of an algorithm. One merit of our HQ algorithm is that there is only one parameter g (see Algorithm 1). In this subsection, we systematically discuss the effects of g. d and n which are ﬁxed to 500 and 1000 respectively, and g is varied from 0.9 to 0.99. Huber loss function is used. All experiments are averaged over 20 runs. Fig. 4 displays our method’s performance as a function of g in terms of iterations and estimation errors. We observe that the number of iterations and average run time increase as g increases.

However, sparsity and estimation errors decrease as g increases. A large value of g represents a low learning rate. When g tends to be larger, our HQ method can thoroughly search a descending and feasible solution from nullspace. And hence it needs more iterations to ﬁnd a sparser solution that has a small estimation error. Since the objective function in (2) is convex (Huber loss is used), there is a global solution of (2) when e (or g) is ﬁxed. However, if we use a descending e, learning rate g will affect the optimal solution as shown in Fig. 4. For a given g, our HQ method can always ﬁnd a sparse solution due to its alternate minimization. But this sparse solution may not be the sparsest one. Considering estimation error, we recommend the use of g ¼ 0:99.

3.3. Performance on robust face recognition Robust sparse recovery has drawn much attention in computer vision and signal processing [41]. It has shown its superiority in dealing with large occlusion and corruption in many application problems, such as robust face recognition [42,25,43]. In robust sparse recovery, one often solves the following l1 minimization problem [44]: minJaJ1 a

s:t: Aa ¼ y,

ð16Þ

R. He et al. / Neurocomputing 115 (2013) 178–185

Average run time(s)

Average iterations

100 90 80 70

183

0.3

0.25

0.2

60 0.9

0.92 0.94 0.96 Parameter value

0.98

Average estimation error

Average sparsity

780

0.9

760 740 720 700 0.9

0.92 0.94 0.96 Parameter value

0.98

0.92 0.94 0.96 Parameter value

0.98

0.92 0.94 0.96 Parameter value

0.98

0.28 0.26 0.24 0.22 0.2 0.18 0.9

Fig. 4. Performance as a function of the parameter g in our Algorithm 1: (a) iterations; (b) run time; (c) sparsity; and (d) estimation error.

Recognition rate(100%)

50 HQ−Correntropy HQ−L2L1 HQ−Huber SpaRSA ADM HOMO FISTA OMP

45 40 35 30 25 20

200

300

400

500

600

Dimension

Fig. 5. Recognition accuracy on the AR database.

where a ¼ ½b; e and A6½X I. eA Rd is an item to model noise and I is an identity matrix. We can easily extend (16) to a general problem based on the proposed loss functions in Table 1 and make use of Algorithm 1 to learn a robust sparse representation. In this subsection, we apply our HQ methods to robust face recognition, and perform experiment on the AR database [45] as in [42,25]. For training, we select 952 facial images (about 8 per person) of unoccluded frontal views without illumination variation. For testing, we use images of the person wearing sunglasses, which occlude roughly 20% of facial images. A sparse algorithm will exit when sparsity variance is smaller than a threshold. In addition, we set the number of maximal iteration for all algorithms to 500. Considering computational costs, we omit L1LS, BP and PFP methods in comparison as in [28]. Note that the purpose of this experiment is to compare different sparse recovery methods rather than to achieve the highest accuracy on the AR database.

Fig. 5 plots the recognition rates of the compared methods using downsampled images of dimensions 161 and 644, which correspond to downsampling ratios of 1/8 and 1/4 respectively. Since all methods try to solve the same optimization problem, they are equal in theory and ﬁnd the same solution if parameters are well tuned. However, as reported in [28], there is a large difference between the recognition rates when there is occlusion or corruption. Among the eight fast methods compared, our HQcorrentropy method achieves the highest recognition rates and OMP method obtains the lowest ones, and the former gives nearly 50% higher crop. When feature dimension is 161, we observe that the recognition rates of ADM, HOMO, SpaRSA and HQ-Huber are very close. This may be due to the fact that they all use the softthresholding function (i.e., the minimizer function of Huber loss function). When feature dimension is 644, the methods can be ordered in ascending recognition rate as OMP, FISTA, HOMO, ADM, SpaRSA, HQ-Huber, HQ-L2L1 and HQ-correntropy. From this face recognition experiment, we ﬁnd that although all compared methods are equal in theory, they may obtain different recognition rates. How to determine appropriate parameters for soft-thresholding function based on sparse methods is still an open problem beyond this work.

4. Conclusion and future work This paper has studied sparse recovery problem from the theory of convex conjugacy and induced a family of smooth loss functions as an approximation of l0-norm. Theoretical analysis has shown that there are similar properties of the loss functions used in robust statistics and sparse recovery. To solve these convex conjugated objectives, we have introduced the additive form of HQ optimization and reformulated the sparse problem as an

184

R. He et al. / Neurocomputing 115 (2013) 178–185

augmented quadratic problem. Then we propose a simple yet efﬁcient alternate minimization algorithm. At each iteration, we compute the auxiliary vector of HQ and project it into the nullspace; then we obtain a feasible and sparser solution based on the projection vector. Experimental results on random sparse signals and robust face recognition demonstrate that our method has a linear complexity and outperforms the state-of-the-art l1 minimization algorithms in terms of computational cost and estimation error. Both theoretical analysis and experimental results show that a descending direction from the nullspace of the homogeneous linear system X b ¼ 0 may be potentially useful to reduce sparsity. One future work is to introduce this observation in gradient based sparse algorithms. Another line of future work is to introduce the proposed family of loss functions and HQ optimization to feature selection, group sparsity, and low-rank matrix recovery.

Acknowledgments We would like to greatly thank the associate editor and the reviewers for their valuable comments and advice. This work was supported in part by the Research Foundation for the Doctoral Program of the Ministry of Education of China (#20100041120009), the Natural Science of Foundation of China (#61103155, #61102111) and the Strategic Priority Research Program of the Chinese Academy of Sciences (#XDA06030300). References [1] D. Donoho, Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006) 1289–1306. [2] E.J. Candes, T. Tao, Decoding by linear programming, IEEE Trans. Inf. Theory 51 (12) (2005) 4203–4215. [3] R. Chartrand, Exact reconstruction of sparse signals via nonconvex minimization, IEEE Signal Process. Lett. 14 (10) (2007) 707–710. [4] I. Daubechies, R. Devore, M. Fornasier, C.S. Gunturk, Iteratively reweighted least squares minimization for sparse recovery, Commun. Pure Appl. Math. 63 (1) (2010) 1–38. [5] S. Seth, J. C. Principe, Compressed signal reconstruction using the correntropy induced metric, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 3845–3848. [6] S. Chen, D. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput. 20 (1998) 33–61. [7] G. Davis, S. Mallat, M. Avellaneda, Adaptive greedy approximations, J. Constr. Approx. 13 (1997) 57–98. [8] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2) (2004) 407–499. [9] M. Figueiredo, R. Nowak, S.J. Wright, Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems, IEEE J. Sel. Top. Signal Process. 1 (4) (2007) 586–597, Special issue on Convex Optimization Methods for Signal Processing. [10] S.J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, An interiorpoint method for large-scale l1 -regularized least squares, IEEE J. Sel. Top. Signal Process. 1 (4) (2007) 606–617. [11] M. Osborne, B. Presnell, B. Turlach, A new approach to variable selection in least squares problems, IMA J. Numer. Anal. 20 (2000) 389–404. [12] D. Malioutov, M. Cetin, A. Willsky, Homotopy continuation for sparse signal representation, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. [13] D. Donoho, Y. Tsaig, Fast solution of l1 -norm minimization problems when the solution may be sparse, IEEE Trans. Inf. Theory 54 (11) (2008) 4789–4812. [14] I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Commun. Pure Appl. Math. 57 (2006) 1413–1457. [15] P.L. Combettes, V.R. Wajs, Signal recovery by proximal forward-backward splitting, SIAM J. Multiscale Model. Simulation 4 (5) (2005) 1168–1200. [16] Y. Nesterov, A method of solving a convex programming problem with 2 convergence rate oð1=k Þ, Soviet Math.–Doklady 27 (2) (1983) 372–376. [17] A. Beck, M. Teboulle, A fast iterative shrinkage–thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202. [18] S. Becker, J. Bobin, E.J. Candes, NESTA: A fast and accurate ﬁrst-order method for sparse recovery, SIAM J. Imaging Sci. 4 (1) (2011) 1–39.

[19] J. Yang, Y. Zhang, Alternating direction algorithms for l1 -problems in compressive sensing (preprint). http://arXiv:0912.1185. [20] W. Yin, S. Osher, D. Goldfarb, J. Darbon, Bregman iterative algorithms for l1 -minimization with applications to compressed sensing, SIAM J. Imaging Sci. 1 (1) (2008) 143–168. [21] M.V. Afonso, J.M. Bioucas-Dias, M.A.T. Figueiredo, Image denoising with compound regularization using a bregman iterative algorithm, in: Conference on Telecommunications, 2009. [22] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004. [23] R. Chartrand, Fast algorithms for nonconvex compressive sensing: MRI reconstruction from very few data, in: Proceedings of the IEEE International Symposium on Biomedical Imaging, 2009, pp. 262–265. [24] M. Nikolova, M.K. NG, Analysis of half-quadratic minimization methods for signal and image recovery, SIAM J. Sci. Comput. 27 (3) (2005) 937–966. [25] R. He, W.-S. Zheng, B.-G. Hu, Maximum correntropy criterion for robust face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1561–1576. [26] R. He, W.-S. Zheng, B.-G. Hu, X.-W. Kong, A regularized correntropy framework for robust pattern recognition, Neural Comput. 23 (8) (2011) 2074–2100. [27] S. Negahban, P. Ravikuman, M. Wainwright, B. Yu, A uniﬁed framework for high-dimensional analysis of m-estimators with decomposable regularizers, CoRR abs/1010.2731. 1 [28] A.Y. Yang, S.S. Sastry, A. Ganesh, Y. Ma, Fast ‘ -minimization algorithms and an application in robust face recognition: a review, in: International Conference on Image Processing, 2010. [29] M. Fornasier, Theoretical foundations and numerical methods for sparse recovery, Radon Ser. Comput. Appl. Math. 9 (2010) 1–121. [30] P. Huber, Robust Statistics, Wiley, 1981. [31] Z. Zhang, Parameter estimation techniques: a tutorial with application to conic ﬁtting, Image Vision Comput. 15 (1) (1997) 59–76. [32] J. Bioucas-Dias, M. Figueiredo, A new twist: two-step iterative shrinkage/ thresholding algorithms for image restoration, IEEE Trans. Image Process. 16 (12) (2007) 2992–3004. [33] R. He, B.-G. Hu, X.-T. Yuan, Robust discriminant analysis based on nonparametric maximum entropy, in: Asian Conference on Machine Learning (ACML), NanJing, China, 2009. [34] R. He, B.-G. Hu, W.-S. Zheng, X.W. Kong, Robust principal component analysis based on maximum correntropy criterion, IEEE Trans. Image Process. 20 (6) (2011) 1485–1494. [35] R. He, B. gang Hu, X. Yuan, W.-S. Zheng, Principal component analysis based on non-parametric maximum entropy, Neurocomputing 10 (73) (2010) 1840–1852. 1 [36] J. Wright, Y. Ma, Dense error correction via ‘ -minimization, IEEE Trans. Inf. Theory 56 (7) (2010) 3540–3560. [37] M.S. Asif, J.K. Romberg, Dynamic updating for l1 minimization, in: Proceedings of CoRR, 2009. [38] M. Plumbley, Recovery of sparse representations by polytope faces pursuit, in: International Conference on Independent Component Analysis and Blind Source Separation, 2006, pp. 206–213. [39] S. Wright, R. Nowak, M. Figueiredo, Sparse reconstruction by separable approximation, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. [40] R. Chartrand, W. Yin, Iteratively reweighted algorithms for compressive sensing, in: IEEE Conference on Acoustics, Speech and Signal Processing, 2008, pp. 3869–3872. [41] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Huang, S. Yan, Sparse representation for computer vision and pattern recognition, Proc. IEEE 98 (6) (2010) 1031–1044. [42] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [43] Z. Lei, R. Chu, R. He, S. Liao, S. Li, Face recognition by discriminant analysis with gabor tensor representation, in: Advances in Biometrics, 2006, pp. 87–95. [44] W. Li, B. Li, X. Zhang, W. Hu, H. Wang, G. Luo, Occlusion handling with l1-regularized sparse reconstruction, in: Asian Conference on Computer Vision, 2010, pp. 630–640. [45] A.M. Martinez, R. Benavente, The AR face database, Computer Vision Center (CVC), Technical Reports.

Ran He received the BS degree in Computer Science from the Dalian University of Technology of China, and the Ph.D. degree in Pattern Recognition and Intelligent System from the Institute of Automation, Chinese Academy of Sciences, China, in 2009. He is currently an associate professor with NLPR, Institute of Automation, Chinese Academy of Science, and serves as an associate editor of Neurocomputing (Elsevier). His research interests include information theoretic learning and computer vision.

R. He et al. / Neurocomputing 115 (2013) 178–185

Xiao-Tong Yuan received the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2009. He is currently a Post-Doctoral Research Fellow with the Department of Statistics and Biostatistics, Rutgers University, Newark, NJ. He has authored or co-authored over 40 technical papers over a wide range of research topics. His current research interests include machine learning, data mining, and computer visions. Dr. Yuan was the recipient of the Classiﬁcation Task Prize in PASCAL VOC in 2010.

185 Wei-Shi Zheng received his Ph.D. degree in Applied Mathematics at Sun Yat-Sen University, China, 2008. After that, he has been a Postdoctoral Researcher on the European SAMURAI Research Project at Queen Mary University of London, UK. He has now joined Sun Yatsen University as an associate professor under the 100-people program of Sun Yat-sen University. His current research interests are in object association and categorization for visual surveillance. He is also interested in feature extraction, kernel methods, transfer learning, and face image analysis.

A greedy algorithm for sparse recovery using precise ...

A Convex Hull Approach to Sparse Representations for ...

BAYESIAN PURSUIT ALGORITHM FOR SPARSE ...

A Fast Bit-Vector Algorithm for Approximate String ...

A Fast Algorithm for Mining Rare Itemsets

A Fast Algorithm For Rate Optimized Motion Estimation

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar

A Fast String Searching Algorithm

On Deterministic Sketching and Streaming for Sparse Recovery and ...

A Fast Bit-Vector Algorithm for Approximate String ...

A Fast Distributed Approximation Algorithm for ...

A Fast Bresenham Type Algorithm For Drawing Ellipses

A fast optimization transfer algorithm for image ...

A Fast Greedy Algorithm for Generalized Column ...

a fast algorithm for vision-based hand gesture ...

A Fast Algorithm For Rate Optimized Motion Estimation

A Fast Greedy Algorithm for Outlier Mining - Semantic Scholar

A Fast Bresenham Type Algorithm For Drawing Circles

A Fast Distributed Approximation Algorithm for ...

A Ultra Fast Euclidean Division Algorithm for Prime ...

SPARSE RECOVERY WITH UNKNOWN VARIANCE