Thong T. Do

Trac D. Tran

The Johns Hopkins University

The Johns Hopkins University

The Johns Hopkins University

[email protected]

[email protected]

[email protected]

ABSTRACT The low-rank matrix approximation problem involves finding of a rank k version of a m × n matrix A , labeled A k , such that A k is as ”close” as possible to the best SVD approximation version of A at the same rank level. Previous approaches approximate matrix A by non-uniformly adaptive sampling some columns (or rows) of A , hoping that this subset of columns contain enough information about A . The sub-matrix is then used for the approximation process. However, these approaches are often computationally intensive due to the complexity in the adaptive sampling. In this paper, we propose a fast and efficient algorithm which at first pre-processes matrix A in order to spread out information (energy) of every columns (or rows) of A, then randomly selects some of its columns (or rows). Finally, a rank-k approximation is generated from the row space of these selected sets. The preprocessing step is performed by uniformly randomizing signs of entries of A and transforming all columns of A by an orthonormal matrix F with existing fast implementation (e.g. Hadamard, FFT, DCT...). Our main contribution is summarized as follows. 1) We show that by uniformly selecting at random d rows

of the preprocessed matrix with d = O η1 k max{log k, log β1 } , we guarantee the relative Frobenius norm error approximation: (1 + η) kA − Ak kF with probability at least 1 − 5β. 2) With d above, a spectral norm error apqwe establish

kA − Ak k2 with probability at least proximation: 2 + 2m d 1 − 2β. 3) The algorithm requires 2 passes over the data and runs in time O mn log d + (m + n)d2 which, as far as the best of our knowledge, is the fastest algorithm when the matrix A is dense. 4) As a bonus, applying this framework to the well-known Ax − b kwhere A ∈ least square approximation problem min kA

Rm×r , we show that by randomly choosing d = O η1 γr log m , the approximation solution is proportional to the optimal

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. STOC’09, May 31–June 2, 2009, Bethesda, Maryland, USA. Copyright 2009 ACM 978-1-60558-506-2/09/05 ...$5.00.

one with a factor of η and with extremely high probability, (1 − 6m−γ ), say.

Categories and Subject Descriptors F.2.1 [Numerical Algorithms and Problems]: Computations on matrices

General Terms Algorithms, Theory

1.

INTRODUCTION

Low-rank matrix approximation has been widely used in many applications involving latent semantic indexing, DNA microarray data, facial recognition, web search, clustering, just to name a few (see [1] for a detailed explanation of these applications). The Singular Value Decomposition (SVD) [2] gives the optimal rank-k approximation of a matrix A in the sense of both error bound in Frobenius and spectral norm. However, computing SVD requires the amount of memory and time which are superlinear in the size of A , hence the true SVD becomes prohibitive in many applications. Recently, several developments showed that a subset of rows (or columns) which is obtained by adaptively sampling a few rows (or columns) of A is sufficient for the approximation process, particularly the work of Frieze et al. [3], Drineas et al. [1], [4], Har-Peled [5], Deshpande et al. [6], [7] and Sarl´ os [8]. The first adaptive-sampling algorithm was originated from Frieze et al. [3] which shows how to sample rows of A . Drineas et al. [1] then proposes a simpler algorithm to compute an approximation of the SVD. Both methods claim an additive Frobenius norm error which depends on the Frobenius norm of the input matrix A . This approach is undesirable as this norm of A is often large. Also built on Frieze et al.’s idea, more recently Har-Peled [5] and Deshpande et al. [7] propose more intriguing sampling techniques, one employs geometric ideas while the other pioneers the concept of volume sampling. They both showed that the Frobenius norm approximation error is relative, implying that the error is (1 + η) proportional to the best rank-k approximation and the technique is far better than additive one. However, the amount of time for constructing sampling distribution seems to be costly. Most recently, Sarl´ os [8] proposes a different approach based on results from fast JohnsonLindenstrauss transform (FJLT) [9]. The paper shows that if a d = O(k/η + k log k) random linear combination C of rows of A is constructed, then the rank-k approximation

generated from row space of C achieves a relative error with constant probability (at least 1/2, say). This algorithm is the best so far in the sense of the least number of passes over the input data A (only two), as well as the running time O(M d + (n + m)d2 ) where M denotes the number of non-zero elements of A. With a similar flavor but from a different perspective, we propose a new fast and efficient algorithm, demonstrating that by uniformly sampling d = O η1 k max{log k, log β1 } rows of A after preprocessing, the rank-k approximation also guarantees the relative error with probability at least (1 − 5β). Moreover, the algorithm only needs two passes over the input data and the running time is O mn log d + (m + n)d2 which, as far as the best of our knowledge, is the fastest algorithms when the matrix A is dense. Our algorithm also produces p a low-rank approximation whose spectral norm error is 2m/d, proportional to the best rank-k version. Our approach is motivated from remarkable results in the compressed sensing (CS) community. The CS problem states that one can perfectly recover a k-sparse signal x ∈ Rm which has at most k-nonzero coefficients from an observation y ∈ Rd , which is a random projector of x onto a much lower dimensional space: d = O(k log m). To intuitively explain this magical phenomenon, it can be observed that x is a low dimensional signal embedded into a high dimensional space. Hence, under some good linear transforms, y will preserve enough information of x such that the critical information can be efficiently extracted by a good reconstruction algorithm. In a recent paper [10], we show that a structurally random matrix (SRM) is a promising candidate for compressed sensing. Mathematically, SRM Φ of size d × m is a product of three matrices r m SF D (1) Φ= d where • D , the local randomizer, is a diagonal matrix whose diagonal entries are i.i.d Rademacher random variables: P(Dii = ±1) = 1/2. • F is any m × m orthonormal matrix. In practice, one can choose F among many with efficient fast-computable algorithms, for instance, Hadamard transform, FFT, DCT, and DWT. • S , the uniformly random downsampler, is an d × m matrix whose d rows are randomly selected from rows of an m × m identity matrix. The intuition behind SRM is the Uncertainty Principle which states that a signal whose energy is spread out in time/spacial domain is concentrated in frequency domain and vice versa. From this observation, it is necessary to randomize all entries of x to produce a noise-like signal before applying a fast transform to it. In our proposed algorithm, we utilize SRM Φ as a preprocessor. By projecting columns of A onto Φ , we hope that the row space of ΦA will contain a good approximation of the row space of the entire matrix A . In a later section of the paper, we will show that a sufficient selection of rows of F DA will produce a fast approximation matrix and with a relative error in the Frobenius norm as well as small error in the spectral norm.

It is necessary to note that our algorithm is different from the one of Sarl´ os [8] in terms of the last pre-processing step. The later uses the so-called FJLT which is basically a combination of three matrices: Φ = P F D where F and D are both defined above, and P is a sparse matrix whose entries are chosen with distribution specified in [9]. Hence, the projections of columns of A on Φ are obtained by multiplication operators on P . In our algorithm, however, instead of using matrix multiplication, d rows of matrix F DA are uniformly selected in a random manner. Hence, if F is a fast transform like Hadamard or FFT, the computational time can be reduced from O(mnd) to O(mn log d)[11] with a dense input matrix A . Our proofs are also significantly different: we use remarkable results in Banach space to claim for our results, while Sarl´ os arguments based on JohnsonLindenstrauss lemma. In addition, unlike previously adaptively sampling techniques [3], [1], [4], [5], [7] whose sampling distributions base mostly on the characteristics of rows (columns) of A, our algorithm sample rows of F DA uniformly at random which is superior in the simplicity sense. While preparing this paper, we are awared of two closelyrelated efforts on constructing a similar preprocess as SRM [11], [12]. However, there are still several significant differences from our approach. In the first effort [11], entries of the diagonal matrix D are complex and distributed uniformly on the unit circle. However, they were only able to prove approximation bound if the matrix S is sampled with an order of k2 coordinates (d = O(k2 )) which is far over k log k. Not only the techniques are different, our √ resulting spectral norm bound is also better by a factor of d. Moreover, our approach is more general in the choice of the fast transform F . In the second effort [12], the problem is different: the authors apply a pre-processing step to improve the least square approximation. Actually, by applying our technique, the number of samples here can be decreased by a factor of log(r log m) and the probability of success is extremely higher. We will present this result in another version in the near future. The paper is organized as follows. The next section covers critical background materials on matrix norms and linear algebra. In section III, we introduce the algorithm and present our main results for matrix approximation. We devote the last sections for our arguments and concluding remarks.

2.

NOTATIONS AND REVIEW OF LINEAR ALGEBRA

Here are some notations used throughout the paper. Matrices are represented by bold capital letters while vectors are bold lower-case. Matrix and vector entries are not bold, just like any scalar. For instance, A is a matrix, and its entry at row ith and column j th is Aij . Similarly, a is a vector, and ai is its ith entry. When we mention about a vector a , we assume that it is the row vector while a ∗ is its transpose. In a matrix A , a i is defined as the ith -row of the matrix A .

2.1

Matrix and vector norms

Several different matrix norms are used throughout this Xk paper. The spectral norm of a matrix X is denoted by kX X kF . If we whereas the Frobenius norm is represented as kX denote the Euclidean inner product between two matrices X , Y i = trace(X X ∗Y ), then kX X kF = hX X , X i. It can be is hX X kF = supkYY k =1 hX X , Y i. easily verified that: kX F

Remark 1. It is important to understand the signifi1: Input: Matrix A ∈ Rm×n , error parameter η and probcance of the parameter µ here. µ can be seen as a measure ability success β of how concentrated and expanded magnitude of rows of the m×n ˜ 2: Output: Matrix A k ∈ R of rank at most k transform F are. The value of µ ranges from 1/m to 1. In n √ o n o max log k, log β1 the worse case when µ = 1 and F is a diagonal matrix with 1. Set d = O η1 µ2 m2 max k, k log m β |Fii | = 1, it implies that most of the entries of A are totally 2 where µ = maxi,j |Fij | lost in the random selection process, except when d = m. Hence, the matrix C is certainly not able to represent a good 2. Randomly pick up a subset Ω of d entries from approximation of A at small d. On the other hand when a set {1, 2, ..., m} with probability of each entry µ = 1/m, F is a Hadamard or Fourier matrix, entries of F selection is d/m are spread outuniformly number is opn and o of rows n to select o p √ 3 F ΩDA = ΦA . 3. Compute C = m max log k, log . timal: d = O η1 max k, k log 2m d β β Our main result for the Algorithm 1 is presented in the A) 4. Project rows of A onto C to obtain PC (A following theorem. Its proof is delayed to the next section. 5. Compute the best rank-k approximation of Theorem 1. Suppose A ∈ Rm×n , let η ∈ (0, 1] and β ∈ A), A˜k = PC ,k (A A) PC (A (0, 1). Running the Algorithm 1 will return a rank-k approximation matrix A˜k . Then the three following claims hold: Algorithm 1: Low-rank matrix approximation algorithm

A − A˜k ≤ (1 + η) kA A − A k kF with probability at 1. A F Another useful norm for our purpose is the Schatten norm. least 1 − 5β Given a parameter q ≥ 1, the Schatten q-norm of a matrix q

P q 1/q

A − A k k with probability at A − A˜k < (2 + 2m ) kA 2. A X is defined as: kX X kSq = where si ’s are singular d i si X least 1 − 2β, values of the matrix . Note that when q = ∞, the Schatten q-norm is the spec3. The algorithm requires two passes over the data and X kS∞ = kX X k. Schatten 2-norm is the Frobetral norm: kX runs in time O mn log d + (m + n)d2 X k = kX X k . The following properties of nius norm: kX S2

F

Schatten p-norm are used in the paper: X kSq ≤ kX X kSp . 1) When p ≤ q, the inequality occurs: kX 2) If r is a rank of X , then with q ≥ log(r), it holds that X k ≤ kX X kSq ≤ e kX X k. kX With vectors, the only norm we consider is the l2 -norm, so xk which is equal we p simple denote l2 -norm of a vector by kx x, x i, where hx x, y i is the Euclidean inner product beto hx tween two vectors. Like matrices, we can straightforwardly xk = supkyy k=1 hx x, y i. establish: kx

2.2

Singular value decomposition

The Singular Value Decomposition (SVD) of a matrix A ∈ Rm×n is denoted by A = U ΣV ∗ where U ∈ Rm×ρ , Σ ∈ Rρ×ρ , V ∈ Rρ×n with ρ being the rank of A . The best rank-k approximation with respect to spectral and Frobenius norm of A turns out to be A k = U kΣ kV ∗k , where U k is the first k columns of U . The orthogonal projector of rows of a matrix A onto a A) = AC †C where C † is the matrix C is denoted by PC (A Moore-Penrose pseudoinverse of C . A best rank-k approxiA) is defined by PC ,k (A A). mation of PC (A

3.

OUR ALGORITHM AND MAIN RESULTS m×n

Our Algorithm 1 takes matrix A ∈ R , an error parameter η and a probability success β as inputs. At first the A is preprocessed by randomizing signs its columns and passing through a fast linear transform, for instance, Hadamard, FFT, DCT, Wavelet, to name a few. A matrix C is then constructed by uniformly randomly sampling a subset of rows from the F DA matrix. Next steps are followed similarly as the algorithms of A. Deshpande et al. [7] and T. Sarl´ os [8]. The low-rank approximation matrix is constructed by projecting A onto C and computing the best rank-k approximation of this projector.

Remark 2. Recently, there appears two inspiring works of N. Ailon and E. Liberty [13] [14] which significantly improve the running time of Johnson-Lindenstrauss (JL) transform. Particularly, they showed that by connecting results in coding theory, a fast linear mapping Φ ∈ Rd×m which guarantees JL Lemma is efficiently designed [13]. The ’efficiency’ implies complexity of the projection of A onto Φ is reduced to O(mn log d) (the efford of [14] even archive a more decreasing complexity which is a order O(mn). Nevertheless, maximum entries of each column of A need to be strictly restricted). These FJLT can be applied directly to Sarl´ os’s algorithm [8] to obtain less running time. Hence, it is now essential to evaluate the proposed algorithm with Sarl´ os’s one. For a fair comparison, we fix the transform matrix to be Hardarmard and set probability to 1−m−1 . Then n o of success √ 1 our algorithm needs d = O η max k log m, k log2 m and O mn log(k log m) + (m + n)(k log m)2 for computational complexity (when k is less than O(log2 m)), while applying the best FJLT so far [13] to Sarl´ os’s algorithm will requires O mn log(k log k) + (m + n)(k log k)2 log m complexity which is still higher than that of our by approximately a factor of log m.

4.

PROOF OF THEOREM 1

Suppose matrix A has rank ρ. Let A = U ΣV ∗ . Denote pairs U k , V k and U ρ−k , V ρ−k the matrices of the first k and last (ρ − k) singular vectors of U , V , respectively. Also denote Σ k and Σ ρ−k diagonal matrices of the first k and last (ρ − k) singular values of Σ . In order to establish the Theorem 1, we use the two following lemmas and theorems. These lemmas will be proven at the end of the section. We leave the proofs of Theorem 2 and 3 for the next section.

The two following lemmas consider the Frobenius and A − A˜k ). spectral norm bounds of (A ΦU k ) is full Lemma 1. Let H = U ρ−kΣ ρ−k . If matrix (Φ column rank, then the following inequality holds

A − PC ,k (A A)kF ≤ kA A − A k kF + (Φ ΦU k )† ΦH kA F

In the next lemma, we consider the spectral norm bound. ΦU k ) is full Lemma 2. Let H = U ρ−kΣ ρ−k . If matrix (Φ column rank, then the following inequality holds

A − PC ,k (A A)k ≤ 2 kA A − A k k + (Φ ΦU k )† ΦH kA X Y kF ≤ From the fact that for any two matrices X and Y , kX X k kY Y kF , we have kX

∗ ∗ ∗

∗ ∗ ΦU k )† ΦH ≤ (U U kD F ΩF DU k )−1 kU U kΦ ΦH kF

(Φ F

(2) Similar result can be obtained with the spectral norm of ΦU k )† ΦH . (Φ We now denote random variables for random signs and sampling processes. Let i = Dii be i.i.d Rademacher random variables with P(i = ±1) = 1/2. Also let δj be i.i.d Bernoulli 0/1 random variables with P(δj = 1) = d/m whose subscript j represents the entry selected from a set {1, 2, ..., m}. As mentioned above, j ∈ Ω. For consistence, we define ui and hk are row vectors of U k and H , respectively. Also denote {eej ∈ Rm }1≤j≤m the standard P basis in the Euclidean space. It is followed that F ∗ΩF Ω = j∈Ω f ∗j f j = Pm ∗ j=1 δj f j f j . Hence, U ∗kD ∗ ) (F F ∗ΩF Ω ) (D DH ) (U ! m ! m ! m X X X ∗ ∗ ∗ δj f j f j ke kh k = i u i e i j=1

i=1

=

m X m X m X

k=1 ∗

δj i ku ∗i e if j (ff j e ∗k ) h k (3)

j=1 i=1 k=1

=

=

m X

δj

m X m X

j=1

i=1 k=1

m X

m X

j=1

δj

i k Fij∗ Fjku ∗i h k !

i Fij∗ u ∗i

i=1

m X

! k Fjkh k

=

and

δj x ∗j y j

j=1

k=1

where row vectors m X x j := i Fij u i

m X

y j :=

i=1

m X

k Fjkh k .

(4)

k=1

Hence, U ∗kΦ ∗ΦH =

m mX δj x ∗j y j d j=1

(5)

Pm ∗ Likewise, U ∗kΦ ∗ΦU k = m j=1 δj x j x j d In order to prove the Theorem 1, it is sufficient to P ∗ 1. find a condition on d such that matrix m j=1 δj x j x j is invertible. −1 P m ∗ 2. find the spectral norm bound of . j=1 δj x j x j 3. find the Frobenius norm bound of

Pm

j=1

δj x ∗j y j .

The next theorem for showing the bound of Pm is dedicated ∗ δ x d upon which x is invertible and establish the j j j j=1

Pm

∗ m probabilistic bound of I − d δ x x

. j j j j=1

P m ∗ Theorem 2. Let S = I − m i=1 δix i x i . Suppose the d xj k2 for q ≥ number of samples d obeys: d ≥ 25mq max kx log k. Then, r m√ xi k (ES q )1/q ≤ 5 q max kx (6) i d n o log k for a positive constant If d ≥ Ca1 µm max k, log 2m β C1 ≤ 25. Then, P (S ≤ a) ≥ 1 − 2β

(7)

where a is any constant in (0, 1) In the next theorem, we consider the Frobenius norm P ∗ bound of the sum m j=1 δj x j y j .

P

x∗ y and denote α = Theorem 3. Let SF = m j=1 δj j j A − A k k2F . Then, kA

F

xi k max kyy i k ESF ≤4.65 max kx i i r n√ o d xi k , k1/4 max kyy i k max α max kx + 2.56 i i m (8) Also, there is a small numerical constant C such that, r √ √ √ 3 2m P SF ≤ Cµ α d k max{ k, log } log ≥ 1 − 3β. β β (9) Proof. (Theorem 1) We are now ready to verify arguments in Theorem 1. Theorem 2 states n that aso we sample enough rows of Φ (d ≥ C1 log k, say), kII − U ∗kΦ ∗ΦU k k < 1 with µm max k, log 2m a β high probability, which implies that ΦU k is full column rank. Furthermore, every singular values of ΦU k are bounded in the range √ √ ΦU k ) ≤ smax (Φ ΦU k ) ≤ 1 + a 1 − a ≤ smin (Φ (10)

∗ −1

B B) ≤ It can be easy to verify that for a matrix B , (B 2 B 1/smin (B ). Therefore with the choice of a = 1/5, we obtain

5

∗ ∗

U kΦ ΦU k )−1 ≤ P (U ≥1−β (11) 4 Based on the above inequalities, we now simply bound the second term of the Lemma 2 as follows

∗ ∗

U kΦ ΦU k )−1 kΦ ΦU k k kΦ ΦU ρ−k k kΣ Σρ−k k ΦU k )† ΦH ≤ (U

(Φ r 2m A − Ak k < kA d with probability at least (1 − 2β). This inequality in combination with the Lemma 2 proves the second statement of Theorem 1. Combine second claim of Theorem 3 and (2), (11), we have with probability at least 1 − 5β r

√ √ 5m 2m 3 √

ΦU k )† ΦH ≤ C d k max{ k, log } log µ α

(Φ 4d β β F √ := η α

where η < 1. Set C2 := C 2 (5/4)2 , we get √ 1 2m 3 d = C2 µ2 m2 max k, k log log η β β

The next proposition which has been stated in Theorem 1.6 of [15] and is simple to verify. It is said that if B is a matrix with bounded norm, then for any bounded-norm matrix A , singular values of AB satisfy

In combination with a condition on d that ΦU k is full row rank, d must satisfies √ 1 2m 3 d = C µ2 m2 max k, k log max log k, log η β β where C := max{5C1 , C2 } is a small numerical constant. Also from Lemma 1, we conclude the first statement of Theorem 1 The running time can be bound similarly as in [8] and [7]. Note that with our SRM, we do not have to use matrix multiplication. Hence, the computational time is reduced by a factor of d/ log d.

4.1

Proof of Lemma 1

Proof. The proof is similar to the proof of Theorem 1 in [4] with small modifications (see also Theorem 14 in [8]). A) is the best rank-k approximation of A Note that PC ,k (A on the row space spanned by C . Therefore, A − PC ,k (A A)kF ≤ kA A − B kF kA where B is any matrix of rank-k whose rows are on the row space of C . We also have A − B kF ≤ kA A − A k kF + kA A k − B kF kA ΦA k )†C = A k (Φ ΦA k )†ΦA . In [4], the We choose B = A k (Φ author showed that

Ak − A k (Φ ΦA k )†ΦA = (Φ ΦU k )† ΦU ρ−kΣ ρ−k

A F

F

The proof is completed.

4.2

Proof of Lemma 2

Before proving the Lemma 2 , we show that orthogonal projection of rows of a matrix A onto a matrix C also preserves the optimality of the spectral norm. Proposition 1.

A − Bk A − AC †C ≤ kA

A where B is any matrix whose rows are on the space spanned by rows of C

ˆ = argmaxkxxk=1 x ∗A − x ∗AC †C and Proof. Denote x let B = P C . We have A − P C k = sup kx x ∗A − x ∗P C k kA xk=1 kx

x ∗A − z ∗C k = sup kx

(denote z ∗ = x ∗P )

xk=1 kx

∗

ˆ A − xˆ∗AC †C ≥ kˆ x ∗A − z ∗C k ≥ x

A − AC †C = A (by definition of x ∗ ) The last inequality holds since xˆ∗AC † C is the orthogonal

ˆ ∗A onto C , and thus x ˆ ∗A − xˆ∗AC †C is projection of x minimal.

A) Proposition 2. For every i = 1, ..., rank(A AB ) ≤ kB B k si (A A) si (A Proof. (Lemma 2) From the triangular inequality, we have

A − PC ,k (A A)k ≤ A A − AC †C + A AC †C − PC ,k (A A ) kA A) = AC †C k is best rank-k approximaRecall that PC ,k (A tion of AC †C . Hence,

AC †C − PC ,k (A A) = sk+1 (A AC †C )

A

AC †C ) ≤ C †C sk+1 (A A) = sk+1 (A A) = By Proposition 2, sk+1 (A A − Ak k kA In addition, from triangular inequality,

A − AC †C ≤ A Ak − A kC †C + (A A − A k ) − (A A − A k ) C †C

A

Ak − A kC †C + kA A − Ak k ≤ A

Ak − B k Proposition 1 addresses that A k − A kC †C ≤ kA where B is any matrix whose rows are on the row space ΦA k )†C . spanned by C . As in Lemma 1, we choose B = A k (Φ The proof in Theorem 1 of [4] also holds with spectral norm. We obtain

Ak − A k (Φ ΦA k )†ΦA = (Φ ΦU k )† ΦU ρ−kΣ ρ−k

A The Lemma is followed.

5.

PROOFS OF THEOREMS 2 AND 3

At the moment, we admit that expectation bounds (6) and (8) of Theorem 2 and 3 holds. We now focus on verifying (7) and (9). The arguments for these expectation bound will be postponed to the Appendix. xj k and At first, we state two lemmas for bounding maxi kx maxi kyy j k where x j and y j are the sum of vectors with random ±1 weights. We leave the proof to the end of the section. Lemma 3. Let x j be defined in (4), r p √ 2m xj k ≤ µk + 4 µ log ≥1−β P max kx 1≤j≤m β

(12)

Lemma 4. Let y j be defined in (4), r r √ µα 2m P max kyy j k ≤ µα + 4 log ≥ 1 − β (13) 1≤j≤m rk β A − A k k2F , rk is the numerical rank of (A A −A Ak ) where α = kA 2 which is defined as: rk = α/δk+1 .

5.1

Proof of Theorem 2

Proof. Take q = log k and apply Markov’s inequality. For each t > 0, P (S ≥ tES) ≤ t−q . By choosing t = e and β ≥ t−q , we attain r mp P S ≤ 5e log k max kxj k ≥ 1 − β. j d

Combine with Lemma 3 and let r √ mp a = C1 log k µ max{k, 4 log 2m/β} d

Combine with (23) (see section 6), we conclude m X j=1

with C1 ≤ 10e we conclude the proof.

5.2

P ∗ In order to bound Frobenius norm of the sum of m j=1 δj x j y j , one of the easy way is to apply a simple Markov’s inequality. However, the probabilistic bound is not tight. Instead, we use a remarkable result from Talagrand [16] (see also Corollary 7.8 of [17] which bound the supremum of a sum of independent random variables Z1 , Z2 , ..., Zm in Banach space

g∈G

m X

g(Zi )

i=1

F

the Frobenius norm of the sum is highly concentrated around its expectation Lemma 6. With SF defined in Theorem 3, its probabilistic bound is r 3 (15) P SF ≤ C1 log · ESF ≥ 1 − β β

∗

X kF = supkG X G) Proof. At first we note that kX GkF =1 trace(X ∗ X = supkG hX , G i. Let Z := δ x y , we have j j j j GkF =1

m m m

X X X

Z j , G i = sup Zj) SF = Z j = sup hZ g (Z

GkF =1 GkF =1 kG kG j=1

Since SF > 0, the expected value of SF is equal to the expected value of SF . That means ESF = ESF . The absolute Z j ) can be bounded value of g(Z ∗

Z j )| = δj x j y j , G ≤ δj x ∗j y j F ≤ x ∗j y j F = kx xj k kyy j k |g(Z

n X

2

There will exist a small constant such that C1 q C log β3 + 1. The proof is now concluded.

=

3 β

=

Proof. (Theorem 3) Lemmas 3 and 4 imply that with the probability at least 1−β √ xj k ≤ C1 µ max kx

1≤j≤m

s

2m max k, log β

and

r 2m log β

where C1 ≤ 5 and C2 ≤ 5. We consider two cases: k ≥ log 2m and k < log 2m . For the former, define probabilistic β β evens p xj k ≤ 5 µk M = max kx and

m m X

d X

x ∗j y j 2 = d trace x ∗j y j y ∗j x j F m j=1 m j=1

m m X d X d kyy j k2 trace x ∗j x j ≤ max kyy j k2 trace x ∗j x j m j=1 m j j=1

q log

We are now ready to prove the Theorem 3

1≤j≤m

and therefore,

j=1

d xj k2 α max kx j m

d 2 2 xj k , k max kyy j k σ = sup Eg (Zi ) := max α max kx j j m g∈G i=1 2

√ max kyy j k ≤ C2 µα

2 2 d ∗ d

x ∗j y j 2 Z j ) = Eδj x ∗j y j , G = Eg 2 (Z xj y j , G ≤ F m m

Eg 2 (Zj ) =

=

So, we choose

xj k maxj kyy j k. We now Hence, we can take η := maxj kx compute

m X

!

r 3 3 =β P SF ≥ C log + 1 ESF ≤ 3 exp − log β β

where C1 is a small numerical constant.

j=1

m X d xj k2 trace Eg (Zj ) ≤ max kx y ∗j y j m j j=1 j=1 2

The last inequality comes from a simple observation that log (1 + x) ≥ 2x/3 at 0 ≤ x ≤ 1. Hence, t must be selected 2 such that ηt ≤ Eq SF . Choose t = C log β3 ESF where C is a small numerical constant. By simple algebraic calculation, one can show that ηt ≤ E2 SF as d ≥ C 2 µm log β3 . Therefore,

In the next Lemma, we claim

P

that with high probability

∗ the deviation of m δ x y j=1 j j j is small. This implies that

F

m X

Apply the powerful Talagrand’s result (14) and note that from expectation inequality (8), σ 2 +ηESF ≤ σESF +ηESF = E2 SF t ηt P (SF ≥ t + ESF ) ≤ 3 exp − log 1 + 2 Cη E SF 2 1 t ≤ 3 exp − . C E2 SF

Lemma 5. If |g| ≤ η for every g ∈ G and {g(Zi )}1≤i≤m have zero mean for every g ∈ G. Then for all t ≥ 0, ηt t log 1 + , P (|S − ES| ≥ t) ≤ 3 exp − Cη σ 2 + ηES (14) Pn P 2 where σ 2 = supg∈G n i=1 Eg (Zi ), S = supg∈G i=1 g(Zi ) , and C > 0 is a small numerical constant.

j=1

d k max kyy j k2 j m

Prove similarly and combine with (24), we also attain

Proof of Theorem 3

S = sup

Eg 2 (Zj ) ≤

1≤j≤m

! r 2m max kyy j k ≤ 5 µα log 1≤j≤m β

N=

√

then from (8), suppose M and N hold, we have r √ 2m ESF ≤ 4.56C1 C2 µ α k log β r r √ p 2m d 1/4 + 2.56 max C1 α µk, C2 k µα log m β r √ √ √ 2m ≤ µ α C3 k + C4 d k max{ k, log } β r √ √ √ 2m ≤ C5 µ α d k max{ k, log } β where C3 = 4.56C1 C2 ≤ 114, C4 = 2.56 ∗ C1 < 13 and C5 = C3 + C4 < 127. The second inequality follows from µ ≥ 1/m We define the even r √ √ √ 2m } P = ESF ≤ C5 µ α d k max{ k, log β which occurs as both evens M and N hold. We have P(P c ) ≤ P(M c ∪ N c ) ≤ P(M c ) + P(N c ) ≤ 2β. Hence P(P ) ≥ 1 − 2β. Combine with Lemma 6 and set C = C5 ×C1 , we conclude that with probability no less than (1 − 3β) r √ √ √ 3 2m } log SF ≤ Cµ α d k max{ k, log β β Likewise, if k < log (1 − 3β)

2m , β

g k≤1 i=1 kg

Remark 3. It is important to compare the use of uncomplicated Markov’s inequality to the beautiful Talagrand concentration measurement (Lemma 5) applied in Lemma 6. If the former is utilized, then Lemma 6 is guaranteed with a small constant probability, for instant, 1/2. This leads to the conclusion of Theorem 3 with probability 1/2. However, in this case the running time will be boosted by a logarithmic proportion of m (log m) in order for the algorithm to success with extremely high probability, 1 − m−1 , say (see also remark 2 for a fair comparison with other’s works and how our technique is better in term of computational complexity).

Proofs of Lemma 3 and 4

In order to bound maxi kxi k and maxi kyi k, we use a powerful result from Theorem 7.3 of [17] which strongly bound the supremum of a sum of vectors b 1 , b 2 , ..., b m with random weights in Banach space. Lemma 7. Let {ηi }1≤i≤m be a sequence of independent random variable such that kηi k ≤ 1 almost surely with i = 1, 2, ..., m and let b 1 , b 2 , ..., b m be vectors in Banach space. Then for every t ≥ 0,

! m

X

t2

, (16) P ηib i ≥ M + t ≤ 2 exp −

16σ 2 i=1

P

and where M is either the mean or median of m i=1 ηib i P m 2 2 σ = supkgg k≤1 i=1 hgg , b i i

g k≤1 kg m X

≤ µ sup g g k≤1 kg

i=1

! U k k2 = µ u ∗i u i g ∗ = µ kU

i=1

√ EZ 2 we have that M =

From Jensen’s inequality: EZ ≤ q xj k2 where E kx

xj k ≤ E kx

n X

xj k2 = kx

i k Fij∗ Fkj u ∗i u k

i,k=1

=

X

2i Fij∗ Fij u ∗i u i +

i

X

i k Fij∗ Fkj u ∗i u k

i6=k

Since {i } is an i.i.d Rademacher sequence, 2i = 1 and Ei k = 0 with i 6= k. Hence, xj k2 = E kx

we obtain with probability at least

r √ √ 2m 3 SF ≤ Cµ α d k log log β β Combine both condition on k, we complete the proof of Theorem 3.

5.3

Lemma 7 asserts that the sum is distributed like Gaussian √ around its mean or median, with standard deviation 2 2σ. xj k Applying Lemma 7, we can bound the maximum of kx and kyy j k

P

= Proof. (Lemma 3) We have kxj k = m i=1 i Fij u i

P

m ib i where b i := Fij u i . The bound for σ 2 can be i=1 obtained by ! m m X X 2 2 ∗ ∗ σ = sup hgg , b i i = sup g Fij Fij u i u i g ∗

X

Fij∗ Fij u ∗i u i ≤ µ

i

m X

U k k2F = µk u ∗i u i = µ kU

i=1

√ which leads M ≤ µk. Lemma 7 now gives us t2 xj k ≥ M + t) ≤ 2 exp − P (kx 16µ Apply a union bound for a supremum of a random process t2 xj k ≤ M + t ≤ 2m exp − P max kx 1≤j≤m 16µ q √ , we obtain By choosing t = 4 µ log 2m β P

xj k ≤ max kx

1≤j≤m

p

√ µk + 4 µ

r 2m − log 2m β = β log ≤ 2me β

The Lemma follows. Proof. (Lemma 4) P Following the same line of proof as in Lemma 3 with y j = m i=1 i Fjih i , we can obtain a upper bound for σ 2 ! m m X X 2 ∗ ∗ σ = sup hgg , Fjih i i = sup g Fji Fjih i h i g ∗ g k=1 i=1 kg

≤ µ sup g g k=1 kg

g k=1 kg m X

i=1

! H ∗H k , h ∗i h i g ∗ = µ kH

i=1

where H ∗H = Σ ∗ρ−kU ∗ρ−kU ρ−kΣ ρ−k = Σ 2ρ−k . Hence,

α 2 A − A k k2 = µδk+1 σ 2 ≤ µ Σ 2ρ−k = µ kA =µ , rk q and M = E kyy j k ≤ E kyy j k2 where E kyy j k2 =

m X i=1

∗ Fji Fjih ∗i h i ≤ µ

m X

H ∗H ) h ∗i h i = µ · trace (H

i=1

A − A k k2F = µα = µ · trace Σ 2ρ−k = µ kA

√ Therefore, M ≤ µα. Apply Lemma 7 and then take the union bound for the supremum of kyj k, we attain t2 P max kyy j k ≤ M + t ≤ 2m exp − 1≤j≤m 16µα/rk q q √ Choose t = 4 µα log 2m and combine with M < µα, rk β P

max kyy j k ≥

√

1≤j≤m

√ µα + 4 µ

r

α rk

r 2m − log 2m β = β log ≤ 2me β

The proof is now completed.

Proof. The line of proof is similar to Theorem 3.1 of Rudelson and Vershynin [19]. Let {δi0 } be an independent copy of the sequence {δi }. By first applying the Holder’s inequality and then Jensen’s inequality

q 1/q

X

∗ E1 = Eδ δix i x i

i

Sq

q 1/q

X

X

∗ ∗ xi xi +δ ≤ Eδ δi − δ x i x i

i

6.

EXPECTATION BOUNDS

Our main arguments are mostly based on Noncommutative Khintchine inequality which bounds the Schatten norm of a sum of Rademacher series [18]. Lemma 8. (Noncommutative Khintchine inequality) X i } (1 ≤ i ≤ n) be a set of matrices of the same dimenLet {X sion and let {i } be an independent Rademacher sequence. For each q ≥ 2,

q 1/q

X

E ≤ Cq × iX i

i Sq

!1/2 !1/2

X

X ∗

, , X iX ∗i XiXi max

i

i Sq

Sq

(17) where the constant Cq ≤ 2

pπ√ −1/4 e

Sq

! q 1/q

X

X

0 ∗ ∗ δi − δi x i x i +δ xi xi

= Eδ Eδ0

i

i

Sq

Sq

q 1/q

X

X

∗ 0 ∗ +δ ≤ Eδ Eδ 0 δi − δi x i x i xi xi

i

i

Sq

.

Sq

(20) Define {i } to be an independent Rademacher sequence and independent of the sequences {δi } and {δi0 }. Since (δi − δi0 ) is symmetric, (δi − δi0 ) has the same distribution as i (δi − δi0 ). Therefore, the first term of (20) is bounded by

q 1/q

X ∗

0 Eδ Eδ0 δi − δi x i x i

i

q.

i

Sq

Sq

q 1/q

X

= Eδ Eδ0 E i δi − δi0 x ∗i x i

The next theorem is used to prove the expectation bound of (8). We now state a stronger result

i

Sq

q 1/q

Theorem 4. Let X and Y be two matrices of size m × k1

X

xi } and {yy i }

and m × k2 which satisfy X ∗Y = 0 and let {x ∗ = 2 (Eδ E2 )1/q , i δix i x i ≤ 2 Eδ E be row vectors of X and Y , respectively. Denote {δi } to

i Sq be a sequence of independent identically distributed {0/1} Bernoulli random variables with P(δi = 1) = δ. Then at where the second inequality is followed from Holder’s inq≥2 equality. 1/q

q Apply Khintchine inequality (17) to bound E2 , we obtain

X √

Eδ xi k max kyy i k + ≤ 2 2Cq2 max kx δix ∗i y i

q i i

q !1/2 i Sq

X

X

1/2

1/2

x i k2 i δix ∗i x i ≤ Cq δix ∗i x i kx

X

X

E2 = E p

∗ ∗

i i xi k 2 δCq max max kx y i y i , max kyy i k xi xi , Sq Sq i i

i i Sq Sq

1/2 q

(18)

X

xi k2 ≤ Cq δix ∗i x i kx

where the constant Cq is defined in Lemma 17. i Sq At first, we start with a Lemma which is useful for the proof of the above Theorem xi }1≤i≤m be a set of row vectors of the Lemma 9. Let {x same length k and denote {δi } to be a sequence of independent identically distributed {0/1} Bernoulli random variables with P(δi = 1) = δ. The following inequality is satisfied

q 1/q

X

X

∗ 2 2 ∗ Eδ xi k + δ δix i x i ≤ 2Cq max kx xi xi i

i

Sq

i

Sq

(19) where q ≥ 2 and the constant Cq is defined in Lemma 17.

xi k2 of the sum is positive definite, so Each term x ∗i x i kx from Weyl’s result (Theorem 4.3.1 of [20]) which mentions that by adding a positive definite matrix to a positive definite matrix, all singular values will be increasing. Therefore, we can replace all the weights kxj k2 by maxj kxj k2 and move outside the norm

q

1/2 q

X

q X

∗ ∗ xi k E i δix i x i ≤ Cq max kx δix i x i . i

i

Sq

i

Sq

Hence, (Eδ E2 )

1/q

One can observe that the first claim of the Theorem 3 is a corollary of Theorem 4. The condition X ∗Y = 0 obeys since m X U ∗kD ∗ ) (F F ∗F ) (D DH ) X ∗Y = x ∗i y i = (U

1/q

q/2

X

∗ x x x ≤ Cq max kx i k Eδ δi i i i

i

Sq

q 1/q 1/2

X

xi k Eδ ≤ Cq max kx δix ∗i x i i

i

Sq

q xi k (Eδ E2 )1/q , = Cq max kx i

(21) p √ where the last inequality follow from E Z ≤ E(Z). xi k2 . From (21), one can see that: (Eδ E2 )1/q ≤ Cq2 maxi kx The Lemma 9 is now followed. Proof. (Theorem 4) As the above proof, define {δi0 } as an independent copy of the sequence {δi }. Since X ∗Y = 0, then following similarly as the above Lemma’s proof, we obtain

q 1/q

q 1/q

X

X

E1 = Eδ δix ∗i y i δi − δ x ∗i y i = Eδ

i

i

Sq

= Eδ Eδ0

X

δi −

δi0 x ∗i y i

i

Sq

! q 1/q

Sq

q 1/q

X

∗ = 2 (Eδ E2 )1/q , i δix i y i ≤ 2 Eδ E

Sq

where {i } is an independent Rademacher sequence. Applying Khintchine’s inequality to E2 , we obtain E2 ≤ Cqq max {B1 , B2 }q = Cqq max {B1q , B2q }

P 1/2

∗ y 2 where B1 =

and B2 is defined the i δix i x i ky i k Sq

same as B1 except x i is replaced by y i . Expectation of E2 is now bounded by Eδ E2 ≤ Cqq Eδ (max {B1q , B2q }) ≤ Cqq max {Eδ B1q , Eδ B2q } , where the second inequality follows from: E max{a, b} ≤ max{Ea, Eb} with nonnegative a, b. We have n o E1 ≤ 2Cq max (Eδ B1q )1/q , (Eδ B2q )1/q It is sufficient to bound (Eδ B1q )1/q , (Eδ B2q )1/q can be attained likewise. Also from Theorem 4.3.1 of [20], one can

P 1/2 ∗

. Hence, see that: B1 ≤ maxi kyy i k i δix i x i Sq v u ! 1/q u X

√ u

q 1/q ∗ (Eδ B1 ) ≤ max kyy i k tEδ δi x i x i := max kyy i k E3 i i

i

Sq

p √ where the inequality√ holds from E Z ≤ E(Z). from the √ Lemma (9), Upper bound for E3 is followed √ √ and with nonnegative a and b, a + b ≤ a + b

1/2

p X √ √

∗ xi k + δ E3 ≤ 2Cq max kx xi xi i

i

The proof is completed.

where the third equality holds due to the orthonormality of F and D (see (3)). We restate in a corollary. Corollary 1. With the same notations as in the Theorem 4. Denote {δi } to be a sequence of independent identically distributed {0/1} Bernoulli random variables with P(δi = d . Then 1) = m

X

∗ xi k max kyy i k Eδ δix i y i ≤ 4.65 max kx i i

i F (22) r o n√ d xi k , k1/4 max kyy i k . + 2.56 max α max kx i i m Proof. Applying Theorem 4 to this case, one can find the equationsPfor the expected P spectral and Frobenius norm of the sums j x ∗j x j and j y ∗j y j . We first compute x ∗j x j = U ∗kD ∗F ∗F DU k = U ∗kU k = I k

i

i=1

= U ∗kH = U ∗kU ρ−kΣ ρ−k = 0

Sq

Hence, the spectral and Frobenius norms of this sum are as follows

m

m

X

X √

∗ ∗ xj xj = 1 and xj xj = k (23)

j=1

j=1

Pm

∗ j=1 y j y j

∗

F

∗ 2 ρ−k Σ ρ−k V ρ−k .

Similarly, = H H = V fore,

m

X

∗ y j y j = s2k+1 and,

j=1

m

X

∗ A − A k k2F = α y j y j = Σ 2ρ−k F ≤ kA

j=1

There-

(24)

F

So far, we only consider the Schatten q-norm with 2 ≤ q < ∞ of a sum of matrices with random Bernoulli weights. As q = ∞, the Schatten norm is the spectral norm. In the next Theorem, we derive another useful result which is analogue to the Theorem 3.1 of Rudelson and Vershynin [19]. The Theorem guarantees the invertibility of a sub-matrix which is formed from sampling a few columns (or rows) of a matrix X. xi }1≤i≤m be rows of a matrix X of Theorem 5. Let {x size m × k which obeys X ∗ X = I and denote {δi } to be a sequence of independent identically distributed {0/1} Bernoulli random variables with P(δi = 1) = δ. Then with q ≥ log k, the following inequality is satisfied

q !1/q r m

1X q

∗ xi k (25) ≤C Eδ Ik×k − δi x i x i max kx

δ i=1 δ i provided that the right-hand side of (25) and the constant √ C = 23/4 πe ≈ 5

Proof. Denote E the expectation of the left-hand side. X k ≤ kX X kSq , we have Remark that kX

q 1/q

∗ 1 X

E≤ δi − δ x i x i Eδ

δ i Sq

Following precisely the proof of the Lemma 9 and remark X kSq ≤ e kX X k as q ≥ log (rank (X X )), we obtain that kX

q !1/q 1/2

X

2

x i k Eδ δix ∗i x i E ≤ Cq max kx i

δ i Sq

q !1/q 1/2 r

X

1 1

∗ xi k Eδ ≤ 2eCq max kx δi x i x i

δ

δ i i From the Minkowski’s inequality: (EkX + Y kq )1/q ≤ (EkXkq )1/q + (EkXkq )1/q we see that

q !1/q

q !1/q

X 1X

1 ∗ ∗ δix i x i ≤ Eδ Ik×k − δix i x i Eδ

δ δ i i =E+1 √ xi k E + 1. If E ≤ 1, then Therefore, E ≤ 2eCq 1δ maxi kx √ √ E + 1 ≤ 2 which leads to r √ 1 xi k E ≤ 2 2eCq max kx δ i q

provided that the right-hand side is less than 1. Substitute the value of Cq in Lemma 17, we finish the proof. The first claim of Theorem 2 is the exact corollary of the Theorem 5 with δ = d/m.

7.

CONCLUSION

In this paper, we presented a fast and efficient algorithm for low-rank matrix approximation as well as least-square approximation. Using remarkable techniques in Banach space, we showed that our algorithm can produce a rank-k matrix approximation which achieves relative error bound in Frobenius norm. In experiments, we observe that our algorithm also obtain the relative error bound in spectral norm as singular values somehow decay in power laws. We leave the problem of how to prove this bound for future research. Acknowledgements. We are grateful to reviewers for valuable comments which improve overall quality of the paper.

8.

REFERENCES

[1] P. Drineas, R. Kanna, and M. W. Mahoney, “Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix,” SIAM Journal on Computing, vol. 36. [2] G. H. Golub and C. F. V. Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, 1983. [3] A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms for finding low-rank approximations,” Journal of the ACM, vol. 51.

[4] P. Drineas, M. W. Mahoney, and S. Muthukrishna, “Subspace sampling and relative error matrix approximation: Column-based methods,” Proc. of APPROX-RANDOM, pp. 316–326, 2006. [5] S. Har-Peled, “Low rank matrix approximation in linear time,” Manuscript. [6] A. Deshpande and S. Vempala, “Adaptive sampling and fast low-rank matrix approximation,” Proc. of the 10th International Workshop on Randomization and Computation, 2006. [7] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, “Matrix approximation and projective clustering via volume sampling,” Theory of Computing, vol. 2, pp. 225 – 247, 2006. [8] T. Sarl´ os, “Improve approximation algorithms for large matrices via random projection,” Proc. of the 47th Annual IEEE Symposium on Foundations of Computer Science, pp. 143–152, 2006. [9] N. Ailon and B. Chazzelle, “Approximation nearest neighbors and the fast jonson-lindenstrauss transform,” Proc. of the 38th Annual ACM Symposium on Theory of Computing, pp. 557–563, 2006. [10] T. T. Do, T. D. Tran, and L. Gan, “Fast compressive sampling with structurally random matrices,” Proc. IEEE Int. Conf. of Acous., Speech and Signal Proc., pp. 3369–3372, May 2008. [11] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, “A fast randomized algorithm for the approximation of matrices,” UCLA University Technical report No. 1386. [12] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlos, “Faster least square approximation,” under review, SIAM Journal on Computing, 2008. [13] N. Ailon and E. Liberty, “Fast dimension reduction using rademacher series on dual BCH codes,” Symposium on Discrete Algorithms (SODA), pp. 1–9, 2008. [14] N. Ailon and E. Liberty, “Dense fast random projections and lean walsh transforms,” Workshop on Randomization and Computation (RANDOM), pp. 512–522, 2008. [15] B. Simon, Trace Ideals and Their Applications, American Mathematical Society, 2005. [16] M. Talagrand, “New concentration inequalities in product space,” Inventionnes Math, vol. 126, pp. 505–563, 1996. [17] M. Ledoux, The Concentration of Measure Phenomenon, American Mathematical Society, 2001. [18] A. Buchholz, “Operator Khintchine inequality in non-commutative probability,” Math. Annalen, vol. 319, 2001. [19] M. Rudelson and R. Vershynin, “Sampling from large matrices: an approach through geometric functional analysis,” to appear in Journal of the ACM. [20] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, 2005.